Data Science for the "poor" - Octave 3.8.2 for the Big(ger) Data (the experimental --enable-64 compilation switch)
(you can skip to the next section - The Problem - if you are here solely after the Octave with --enable-64 for Linux. And if you really are not interested in any details - check the section 'The Solution for Lazy ones')
Big Data is all about processing massive data amounts, distributed (cloud) computing, parallel processing, in-memory computing, and the need to apply complex algorithms (Machine Learning) on large scale data sets. That all lead to the need of having efficient and well performing methods (applied math/algebra/statistics) as well as having a computing environments where you can test and run your hypothesis or algorithm in order to for example learn a statistical model to be used for your prediction software that you are working on. That ultimate clever algorithm solving particular business challenge or even working on the "Holy Grail" of all SW development - the AI (Artificial Intelligence).
In my attempts to tackle various Big Data topics, I very often run into technological issues with current tooling. Much as I do love challenges and hate unsolved problems, one of the problems I've run into already about a year ago was when trying to get my Octave (an free open-source version of otherwise very expensive and very powerful MATLAB software) to run my 'complex' math.
My experience with tooling enabled to process large data is that the whole process is still rather cumbersome. Yes, we have big clusters like Hadoop or Spark, SAP HANA and others to run our intensive computations there, but engineering of Machine Learning pipeline requires some research and prototyping phases where you do not run your engineered algorithms against real "Big Data" but still against 'decent size' data-sets to see and measure the algorithm behavior considering the statistical properties of the targeted Big Data data-set.
In those research and engineering phases, or R&D if you will, we would like to rather focus on finding Solution of the problem rather than on finding most efficient way to run the Solution in particular compute environment (that I would call Solution Implementation Phase). Though I know that in practice the platform you will use to execute your algorithm will dictate significant part of the design of your Solution if to be used for production / operations.
So, in those R&D phases we need flexibility, and speed of the whole research & prototyping process. Meaning we need SW tools which provide convenient & reliable ways of processing large data-sets and do complex math calculations.
There really is a huge variety of tools you can use. In practice you will most likely end-up with using couple of them while solving end-to-end processing pipeline of your Big Data problem - as there is no such "the one" tool which would do best all the tasks you need (for instance complex math calculations, parallel processing, distributed computing, in-memory processing) and all that using your favorite programming language (R, MATLAB/Octave, Scala, .. or ideally some kind of "Rapid Application Development" environment which would be a Drag & Drop IDE with some support of scripting/coding extension in your favorite script language - thus extensible.
The execution platform you will be going after should therefore support 64-bit architecture (to utilize all the memory available in nowadays HW), parallel execution (to utilize multi-core HW architectures) and distributed computing (to build clusters being able to store & process huge amounts of data). These can further limit your choices of the tools and platforms you will use during Solution Implementation Phase.
To avoid any flame-war here, I intentionally do not want to recommend or nominate the best tool/environment to be used, since there is none, it all depends on the further context of your project. Below I will mention a small "slice" of the tools I sometimes use in R&D phase and with which I got positive experiences while working on my prototyping attempts - i.e. while solving some of the endless count of challenges at Kaggle:
Tools for statistical data analysis and exploration:
PSPP (open-source version of SPSS)
and of course the "heavy lifters" MATLAB or Wolfram Mathematica
Remark: If you are interested in the above topic a little more - I recently found a reference to a paper from UMBC - "A Comparative Evaluation of Matlab, Octave, FreeMat, Scilab, R, and IDL on Tara".
Tools for data pre-processing / algorithms and coding:
Python / iPython Notebook
Remark: Yes, Rattle is good for not only analyzing the statistical data properties but also for executing some of the basic versions of ML models (learn & predict). In contrary, I cannot claim positive experience from performance perspective with Weka, though it has huge library of ML algorithms. I really like the fact that Python / iPython Notebook can efficiently utilize 64-bit architecture and process very efficiently the data - I had similar positive experiences with using iPython on both, Windows and Linux platforms.
As you see - Octave - mentioned couple of times in the text above is open-source (and free) version of very powerful engine fitted well to write mathematical algorithms used in Machine Learning (ML). Of course there are others environments (like the whole R language eco-system) which you can choose from. That being said if you (like myself) started to get into Machine Learning domain through the MOOC ML course on Coursera (lessons here) held by Andrew Ng, maybe you would agree with Andrew and his choice to use the Octave in the basic introductory course for ML. It really seems to be little easier for newbies to start with Octave.
Remark: For those who alternate between the 2 worlds of MATLAB/Octave and R - I found a neat 'R and Octave cheat sheet' here ...
So, let's say we start learning ML using Octave. Develop some nice algorithms and we would like to leverage our work already done to do some more serious job - processing more data and process them as efficiently as possible in the boundaries of the budget we have got allocated for the project (which eliminates the idea to buy MATLAB license ... ;-) ).
The problem I was facing with Octave was that when I tried to process bigger amounts of the data I was getting the famous Octave error 'memory exhausted or requested size too large for range of Octave's index type'. Irrespective of the Octave being compiled for x86 (32-bit) or x64 (64-bit) architecture. (Despite the fact that I was sure that the real memory requirement for the operation I was trying to execute in the Octave should be easily satisfied by my HW configuration - x64 Linux OS, 16GB of physical RAM and ton's of GBs on swap, just in case ...).
While searching for the reasons of the error above I found on many web-sites - like here - logical explanation saying that Octave (and the libraries used by Octave) are using 32-bit indexing of memory objects so you cannot allocate such large object (let's say an array, matrix, vector) having more elements than it fits to 32-bit integer. (in fact it seems that the actual limit is not even 2 ^ 32 but only 2 ^ 31 as signed integers seems to be used for the index)
If you use standard Octave with 32-bit indexing, it prevents you to allocate i.e. matrix of 65536 by 65536 elements of type int8 (which would normally require just 4GB of memory).
Remark: Yes, in ML we sometimes use algorithms which try to process many features, sometimes in thousands and sometimes even in ten or hundreds of thousands ... so we need large matrices. Some workarounds around this limitation of Octave suggest using Sparse matrices to avoid physically storing so many data fields assuming that in many cases the features might not be defined for every data entry in our data-set. And that would be the right way of designing ML algorithm to use Sparse Matrices in case we process Sparse data. But we may not have always the Sparse data ...)
To demonstrate the problem in regular Octave with 32-bit indexing, just run octave and type in:
While searching for the resolution of this Octave limitation I found information about a magic experimental switch (called "--enable-64"). First problem with this switch is that this is a compilation switch - so you need to re-compile the Octave from the source code (not a big of a deal, right? just execute "./configure && make && make install" ... Well, not really, the second problem of this is that you also need to re-compile few of the 3rd party libraries Octave is using (Arpack, BLAS, LAPACK, SuiteSparse, GLPK, QHULL, QRUPDATE ...).
Though I was searching a lot whether there is already a pre-compiled Octave version for Linux with this experimental switch --enable-64, I didn't find much. I did find some web-sites referring to partial instructions how and which 3rd party libraries can/have to be re-compiled to support 64-bit indexing. Unfortunately none of the sites gave me full instructions how to do that. I tried to follow the official instructions on GNU Octave pages a year ago and I failed badly already while trying to compile 3rd or 4th required library. Yes, I was expecting to just follow the routine ./configure && make && make install procedure with some minor changes to the makefile(s) or the source codes. That was not working out though ...
Because my astrological sign is Aries - I am stubborn in nature and I hate unresolved issues and so now, a year later (summer 2014) I sat down and started again from scratch. I am not gonna bore you with describing all of the issues I was facing while trying to get Octave compiled and tested with the --enable-64 switch. It took me couple of weeks to find out the right sequence, right modifications of source code and make-files and disover the right dependencies on the other libaries and tools. I finally made it and decided to publish the whole work at GitHub so other people can use it, work on it, enhance it or port it to other Linux environments.
I published some more details at my GitHub repository calaba/octave-3.8.2-enable-64-ubuntu-14.04 - feel free to use/clone/correct/enhance.
Here below are steps to be executed if you wanna get your own 64-bit Octave 3.8.2 with 64-bit indexing enabled (for Ubuntu Desktop Linux - tested for 14.04 / 14.04.1 and 14.10 versions):
Remark: End-to-end it takes approximately 3-4 hours (depending on your internet connection & HW speed). Compilation process is fully automated, including the download & installation of 3rd party SW.
Steps to get your own Octave with 64-bit indexing:
1) Get some virtualization SW - i.e. VMWare Player / VMWare Workstation
2) Install Ubuntu Linux Desktop (64-bit version) - 14.xx - either directly from Ubuntu (downloads here or alternative downloads here). For my final tests I used version 14.10 and I uploaded the ISO file (CD/DVD image) also to my Google Drive here .
3) Once you install the Ubuntu Linux Desktop - log in, make sure your internet connection is working within the Ubuntu Linux and then execute following commands (you can execute it either as root or as non-root user):
a) cd ~ ; sudo apt-get install git
b) sudo git clone https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04.git
c) cd ~/octave-3.8.2-enable-64-ubuntu-14.04
d) ./all.sh 2> all-err.log
Now the installation & compilation process starts. You will be asked several times to provide user's sudo-password if you run the compilation as non-root user.
After approximately 1 hour of compilation of Octave source and required libraries you can test the Octave with 64-bit indexing:
e) To check that the Octave built-in tests were OK, execute this command: tail -n 25 4-compile-64-octave.log (alternatively you can re-execute the tests by executing 'cd octave-3.8.2 ; make check')
f) To install compiled octave in your Ubuntu Desktop Linux execute 'cd octave-3.8.2 ; make install'
The Solution for Lazy ones
1) Get VMWare Player/Workstation or any other virtualization SW which can run virtual appliances from the OVA files.
2) Download the OVA file with Ubuntu Desktop 14.10 distribution containing my GitHub repository cloned and with Octave 3.8.2 compiled from source code using the --enable-64 switch. Download it here (4.4GB).
3) Enjoy it - Password of the 'octave' and 'root' users is 'password'.
Remark for Windows users: As I finished my work on this topic, I discovered that there is actually a working Windows distribution of Octave - MXE Octave. It also offers a support for 'large indexing' which means 64-bit indexing of memory objects. I didn't check this version in detail as I am little skeptical about windows memory management. Whenever I run some ML/Big Data computation tasks requiring larger memory/swap file I always use tools for 64-bit Linux distributions.
Enjoy and Happy New 2015 !