Learning From Data @learndata - Tumblr Blog

Posts

The Atlantic's Cities section picks their ten favorite open data releases from 2012. An interesting twist: this list comprises a set of tools and interactive visualizations, not just the raw data. Of note: beware of SF's "high injury corridors," shown in the image above!

(Unrelated addendum: I successfully defended my thesis this week! Very much looking forward to the holidays, an overdue vacation, and then getting much more involved in the world of data. More details to come, soon.)

#data #visualization #open data #public #social

"It is unfortunate that in our generation, the brightest minds are being used to try to get people to click on ads. I hope in this decade - in the 2010s - we have a shift from consumerism back to mission ... national security, financial risk, and health care."

#strata #o'reilly #big data #tim estes #national security #financial risk #health care

If you're looking for a direct and quick intro to R's ggplot2 graphics package, these summer '12 slides from Wickham himself seem like a good start.

#R #ggplot2 #reference #graphics

Drew Conway (of the now-infamous Data Science Venn Diagram) gives a great talk on the value of asking good questions over having good tools. (Monktoberfest 2012)

#drew conway #venn diagram #monktoberfest

Very rich data visualization from the Hewlett Foundation on the size, location and year distribution of their grants.

#data #visualization

fastcodesign

One mathematician’s ingenious solution to armoring heavy bombers inspired Facebook’s research manager to look at Facebook’s ocean of data from a new perspective.

#fastco #facebook #data

Two Finds: SQLZoo & 'Data Literacy' Short Course

I came across two interesting finds this week:

First, Y Combinator (a great twitter feed to keep tabs on) shared a link to SQLZoo.net, a source of tutorials, tests, and references for learning SQL. I don't yet have any experience with relational databases, but it sounds like they're used often in traditional business arenas. Having a basic familiarity with the syntax and use of this special querying language seems wise.

Second, Quora sent me an email with an update to a 'Data Science' thread that I had previously checked out. I'm actually having trouble finding the original thread now, but the original question was something along the lines of "What skills does a data scientist need to start out with?" The update was from one of the co-instructors of this MIT short course (Independent Activities Period, IAP) in 'Data Literacy.' It's a lab-heavy, six-day walkthrough of the basics in accessing data, extracting information, using statistical tools, and making visualizations. It appears that they're primarily using Python for this course, and it looks great. Very excited to start working through this one.

On an unrelated note, I'm sad to say that I've elected to abandon the programming exercises in the Coursera Machine Learning course. Though it is fun and incredibly educational, I need to devote much more time to finishing my graduate work and thesis. I look forward to watching the rest of the videos, and at least obtaining some familiarity with the remaining topics. Since this was a repeat of the course, I hope it'll be offered again and I can pick up where I left off in the programming exercises!

#sql #python #coursera #thesis

Learn Git: tryGit (Code School)

Among the items on my ad-hoc list of 'to-dos' for getting into data science: get an introduction to distributed version control systems (VCS). Just recently on the Kaggle blog, there was a great article on Engineering Practices in Data Science and the need for source control. I admit that the idea was a new one to me, but it makes complete sense. Instead of cleverly-named (or so you thought...) files stored away in random hard drive locations, having a good pipeline helps "get the goo of software out of the way so they can focus on valuable data problems."

Sounds great. So, for a rookie to the field, where do I begin? Turns out, Git has the answer. Many of them, in fact. From free online textbooks, to uber-intro (i.e. for me) walkthroughs like tryGit, you can find a way into VCS.

I had a hunch that Git was going to be useful to me eventually; I set up an account and used Gist to embed my 'reference cards' on Python and bash. Having run through the tryGit tutorial, I feel... interested. Without previous experience, it's all still a bit mysterious. But I can see the utility, so I'm looking forward to experimenting more with it, soon.

#Git #VCS #Kaggle

Multi-class Classification & Neural Networks (Coursera ML class)

The third programming exercise in Coursera's Machine Learning class deals with one-vs-all logistic regression (aka multi-class classification) and an introduction to the use of neural networks to recognize hand-written digits. This is - by far! - the most interesting assignment yet.

Getting my head wrapped around the setup for this assignment took almost as much time as actually implementing the solution. In short, the data we're looking at is a subset of the MNIST handwritten digit dataset. We have 5000 training examples, each of which is an "unrolled" version of a 20 x 20 pixel grayscale image (ie a 400-dimensional vector). Each pixel is encoded as a grayscale intensity, and the "0" digit is labeled "10" for convenience with Octave vector indexing. We get a sense of what we're dealing with at first, by running some provided code that displays a random 10 x 10 array of training examples:

You can see that the clarity or messiness of the digits is all over the map, so to speak. This is very clearly a relevant opportunity for a learning algorithm!

We're asked to vectorize previous code, but if you were thinking that way the first time around, your code from previous exercises will already be vectorized. This saves a bunch of steps in this assignment. Next we add in regularization, but once more we leave the choice of lambda to Prof. Ng in the code that tests our solutions. I imagine (and hope!) that in exercises in the not-too-distant future, we'll be learning how to choose our own values of the regularization parameter.

Then we get to the business of implementing one-vs-all classification by training a regularized logistic classifier for one each of the K classes in the dataset (here, K=10). We also make use of some new techniques in this exercise (logical arrays, and the fmincg advanced optimization function). In the end, our trained algorithm is loosed upon the same dataset (not ideal or realistic, but ok...) and for each training example, outputs its prediction of the correct digit. On the given set of data, our algorithm correctly classifies about 95% of the training examples. Pretty good!

The second part of the programming exercise deals with neural networks (NN). Turns out NN are pretty complicated; I watched the video lectures on this a few times and still barely caught it. Since the algorithm is complex, it's split across assignments; in this one, we're only implementing feedforward propagation using a previously-trained set of parameters/weights (theta matrices). When we complete the forward propagation, the test code randomly chooses an entry from the MNIST 5000-example subset we have and displays it's guess at the value along with the actual image. Impressively, it's training accuracy is about 97%. Here are some examples of the test output (may have to "View Image" in a new tab to read the prediction. Spoiler: they're all correct.):

This exercise was great. Though there are still some core details that are being hidden from us (e.g. the training of this NN), this is starting to look like a legitimate machine learning exercise. The details of the code are pretty interesting, too, and I'm working on getting those files available. Stay tuned on that front!

#neural network #multi-class classification #coursera #machine learning

Logistic Regression in Octave (Coursera ML class)

In programming exercise two of Prof. Ng's Machine Learning class, we implemented logistic regression on two unique sets of data. The first dataset was a distribution of exam score pairs corresponding to students who were either admitted to a fictitious program or not. We analyzed these data by making use of the Octave function fminunc - a built-in optimization solver. We implemented the sigmoid hypothesis function, then the cost function and its gradient. Passing fminunc these functions leads us to the following decision boundary:

Once we have this decision boundary, we can use it to predict the likelihood of a admission based on a new pair of exams scores.

The second dataset was more interesting. In this case, we get the quality control results (yes, no) as a function of two test scores on microchips. These data, however, were not separable by a straight line through the plot. In order to accomodate this situation, we used a technique called feature mapping, to extend the existing features (two test scores) into additional features by creating higher-order multiples of the first two. We extended our features up to sixth-order polynomials in the original two features, and this allows our decision boundary to be non-linear in the 2D data plot.

For this dataset we also implemented regularization. Since we don't explicitly know ab initio which features (if any) will be more significant, we include a parameter that penalizes overfitting. We again passed fminunc the same cost function and its gradient (side note, if these two functions are properly vectorized, no changes need to be made between the two datasets). We also need to fix the regularization parameter, now. The initial value was chosen to be 1, which results in a good-looking fit to the data:

By experimenting with the value of this parameter, we can observe the varying final results. For large values (e.g. 100), we find a simpler decision boundary that ultimately underfits the data:

In contrast, by removing the regularization parameter (i.e. set it to 0), we see that the regression leads to a complicated decision boundary, and ultimately overfits the data. We shouldn't be very confident in any predictions that come from this result:

So that's the first implementation of logistic regression. It's a very neat continuation of the principles of linear regression; mostly it just requires thinking about a different hypothesis, the mechanism is mostly the same.

#coursera #machine learning #octave #logistic #regression

Linear Regression in Octave (Coursera ML class)

Data!

Granted, this data comes pre-cleaned, pre-packaged, and - truth be told - even the setup for plotting was pre-made by Prof. Ng for the Coursera course. Still, it's great to fire some code and see something meaningful happen.

These were the results of the first programming assignment for Machine Learning. In this training dataset, we have the profits from house sales as a function of city population. We wrote the guts of the linear regression model and used gradient descent to minimize the defined cost function. The resulting (two) parameters were used to create the blue fit line. And it doesn't look half bad!

Furthermore, we actually plotted a grid of the parameter space for the cost function, and then a contour plot with an "X" marking the final result. From this second plot, you can see we did succeed in finding the minimum of the cost function.

This is all very cool. It's particularly fun for me, because I use linear and non-linear regression all the time in my own work. Often, I'll even be passing a software package the model function. But I still never got to actually implement the regression algorithm. Very, very cool to be in charge of the behind-the-scenes action.

This also seems like a good opportunity to figure out the how and what of putting code up on GitHub, too. I'll update this post with a link to the code for this linear regression exercise once I figure that out.

Unrelated to linear regression: I've realized that writing my thesis is taking up a fraction of the time in my day. I hope to continue working on the exercises in this course as long as possible, but my current projection does not suggest success in completing the ML course.

#GNU #Octave #linear #regression #thesis #coursera

Bit Of A False Start

Pretty typical.

In my current role as an experimental scientist I often find myself needing more knowledge, sometimes in new fields. My typical modus operandi - and I believe this is common - is to go out and find a couple of resources (online, in print, colleagues) and tap them for information. Using a couple of different sources gives a rudimentary "error bar" on the information. Agreement on a topic? Maybe reliable. Vastly differing opinions? May be worth finding additional sources.

When it came to learning new skills for data science, I found online polls, books, introductory lessons, and devoured them. Just dive in, right? As I began to read through DAwOST, I realized that I'd most strongly benefit from being able to recreate the various graphs and data representations that are presented in the text. With the simple tools I had so far (e.g. Codecademy Python lessons), I couldn't do anything like that. This was a great eye-opener and suggested that I take the time to build some foundational skills and fundamentals before getting too far ahead of myself.

So, picking a language (perhaps Python, for no better reason than its popularity in data science) and learn how to a) use run it locally, on my own machine b) handle input files, i.e. arbitrary data and c) make some basic graphs have moved to near the top of my to-do list. And in the "general skills" category, I've also enrolled in another Coursera course on basic statistics. I think it will be great to see how these concepts are traditionally taught (hopefully in some sort of logical order), but it also has a focus on programming in R, another language with which I want to be familiar. (Not enough spare time for that; my thesis deadline is knocking.)

Then, I look forward to punching into the meat of DAwOST and also beginning to play with data on my own.

#coursera #statistics #Python #DAwOST

Machine Learning: Coursera

On a recommendation, I'm planning to work through Coursera's Machine Learning course. It started a little over a week before I got going, so first I have to catch up! But, it'll be a good opportunity to do some programming in Octave, the open-source version of MATLAB (new to me) and pick up some ML fundamentals.

#Coursera #machine learning #octave

Learn CLI: bash (Learn Code The Hard Way)

A brief detour on my path toward practical Python programming (alliteration!): bash.

Among the desirable skills that I've seen listed in many data science job descriptions: bash (unix command line). I've played around with this a little in the past (who hasn't goofed around at the Windows CMD or OS X Terminal??), but this crash course in bash from the Learn Code The Hard Way series was a great, structured introduction. I'm certainly not a skilled user, but I can do something very simple (and powerful!) things with it now.

In order to revisit and practice this more, I'd also like to work through this series of posts from QuickLeft.

#bash #command line #terminal #QuickLeft #to do

Links (late August)

In parallel with various coding progress, I come across many informative links online (Twitter is a blessing and a curse!) Here is the most recent dump of links (for both my own record and for anyone else interested):

( [*] = articles I still need to read )

General, news

Data scientist = rock stars (April 2012)

"The data scientist role is fast becoming the most sought after career of the technology world. We asked top data scientist Jake Porway from The New York Times about how he got his job, and his tips for success in the field."

McKinsey report on explosion of data, growth potential (2011)

0xdata wants to make everyone a data scientist (April 2012)

Interview with Edwin Chen of Twitter (July 2012)

Resources

RPI Data Science course with material (2011)

Quora: "How do I become a data scientist?"

^ Probably an infinite number of resources & links in here!

Data Science 101(2012)

^ An individual's (?) blog about learning data science; lots of links

Learn Python (Codecademy)

Almost exactly the time I was deciding to get my hands dirty with R and Python, Codecademy announced they had released a Python track. Though my attention waned from their CodeYear JavaScript project after a few months, this one is certainly more relevant.

I notice that this is for Python 2.7 and there is apparently a newer v3 out with some new syntax changes.

After a week of some lunch breaks at work and evening sessions, I'd worked through the introductory course*. Since there are still some obvious common data types and programming structures lacking (e.g. arrays, loops), I'm assuming there will be additional lessons in the future. I'm amazed at how simple the structure of Python is.

For future reference, the Python Language Reference is here, and below is a condensed set of take-aways from the Codecademy lessons. This is also a good opportunity to start learning about github. The Gist tool lets you write code and then include in posts (see below).

*If this track is like some of their others, there will be more content added in the future; if so, I may return to work on it.

#codecademy #python #reference

Data Analysis with Open Source Tools (O'Reilly) [Ch 1]

My goal is to get into this field as quickly as possible. This means hopefully finding the most efficient sources from which to learn and doing so with the least overhead (cost) possible. Anything that seems relevant and open source is probably going to be spot-on.

After seeing a few references to the text Data Analysis with Open Source Tools (DAwOST), a quick search showed me that I could buy it for $32, or download it free from Squidoo. If all goes well, I'll be sending lots of money to O'Reilly & co. in the coming months and years; I'm sure they'll understand my broke-grad-student logic for the moment.

The text is rather long so as I work through it I'd like to break it up, here, by Chapters or Parts.

Chapter 1 is an introduction, a layout of the plans ahead. Some choice quotes:

As a physicist, I am not content merely to describe data or to make black-box predictions: the purpose of an analysis is always to develop an understanding for the processes or mechanisms that give rise to the data that we observe. (p. xiii)

YES.

Data analysis, as I understand it, is not a fixed set of techniques. It is a way of life, and it has a name: curiosity. There is always something else to find out and something more to learn. (p. xiv)

Music to my ears.

Simple is better than complex.

Cheap is better than expensive.

Explicit is better than opaque.

Purpose is more important than process.

Insight is more important than precision.

Understanding is more important than technique.

Think more, work less. (p. xv)

CAN I GET A AMEN!

#DAwOST #O'Reilly #open source

Trending Blogs

Recently Viewed Blogs

Learning From Data