Adam Laiacano @adamlaiacano - Tumblr Blog

Posts

On the heels of the Airbnb blog post about how they use PMML models in their machine learning framework, I figured I'd post some code I've been toying with for using PMML for batch processing in Scalding.

I think this will be a huge help if you're able to build a predictive model on a subset of your data in in something like R or python (which doesn't have the best support for PMML yet), and then apply it to a much larger data volume in Hadoop.

I'll have a full post up eventually, but here's a simple example usage.

All of the work is done in a single trait called Predictable, and any class that extends it will get a predict method to apply a model.

#scalding #machine learning #hadoop

Naive Bayes classification for large data sets

I put together a Naive Bayes classifier in scalding. It's modeled after the scikit-learn approach, and I've tried to make the API look similar. The advantage of using scalding is that it is designed to run over enormous data sets in Hadoop. I've run some binary classification jobs on 100GB+ of data and it works quite well.

Here's an example usage on the famous iris data set.

The three classes are the species of iris, and the four features/attributes are the length and width of the flower's sepal and petal. The train method (line 21) returns a Pipe containing all of the information required for classification.

The scikit-learn documentation contains a good explanation of Bayes' theorem, which boils down to the following equation:

\[ \hat{y} = \underset{y} {\mathrm{argmax}} ~P(y|x_i) = \underset{y} {\mathrm{argmax}} ~P(y) \prod_{i=1}^{n}P(x_i | y) \]

Where \( \hat{y} \) is the predicted class, \( y \) is the training class, and \( x_i \) are the features (sepalWidth, sepalLength, petalWidth, petalLength). Most Naive Bayes examples that I've seen are dealing with word counts, therefore \( P(x_i | y) = \frac{\text{number of times word} x_i \text{appears in class} y}{\text{total number of words in class} y} \). That's how the MultinomialNB class works, but in the iris data set, we're dealing with continuous, normally distributed measurements and not counts of objects in a multinomial distribution.

Therefore, we want to use GaussianNB which uses the following equation in the classification:

\[ P(x_i | y) = \frac{1}{\sqrt{2\pi {\sigma_y}^2}} \text{exp}(-\frac{(x_i-u_y)^2}{2\pi\sigma_y^2}) \]

That means that our model pipe must contain the following information in order to calculate \(P(y|x_i)\)

classId, \(y\) - The type of flower.

feature, \(i\ - The name of the feature (sepalWidth, sepalLength, petalWidth, or petalLength).

classPrior, \(P(y)\)- Prior probability of an iris blonging to the class.

mu, \(\mu_y\) - The mean of the given feature within the class.

sigma, \(\sigma_y\) - The standard deviation of the given feature within the class.

The model Pipe is then crossed with the test set and we calculate the likelihood that the point belongs in each class, \(P(y|x_i)\). Once we have the probability of a data point belonging to each class, we simply group by the point's ID field and keep only the class with the maximum likelihood.

The results are shown below (plotting only two of the four features). The x's are the training points used, and o's are successfully classified points.

For now, your best bet for using this is to just copy the code off of github. If I get some time, I'd love to port this over to scalding's typed API, combine it with some other machine learning functions (such as the K nearest neighbor classifier I wrote) to provide a nice little library of tools for scaling machine learning algorithms. If you'd like to be involved, get in touch.

#machine learning #scalding #hadoop

I’m always amazed by the amount of work that gets done at every DataKind event. In just 14 (consecutive) hours, we were able to understand a very complex data problem, scope the project, and start laying the groundwork for a number of tools to help Amnesty International save lives.

Thanks to the folks from AI and the 30 or so people who helped out with our project.

#DataKind

I gave a talk about Digital Signal Processing in Hadoop at this month's NYC Machine Learning meetup. Here's the abstract:

In this talk I'm going to introduce the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and work through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. I'll show how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding) and give a demo on how to detect trendy meme-ish blogs on Tumblr.

#machine learning #data science #digital signal processing #talks

The guys at yhat released a port of ggplot2 for python!

(via yhat/ggplot)

#rstats #ggplot2 #ggplot #data science #data visualization

This should be interesting: an online unconference, put on by the great people behind Simply Statistics. The listed speakers include Sinan Aral, Hilary Mason, Hadley Wickham, and others.

seanjtaylor

Finding Nate Silver

My [Sean's] (five minute) talk from Ignite Foo Camp on how we can build a reputation system to identify the people who are best at making predictions.

This is a great post showing how to build and work with random variables in scala. It starts with something as simple as drawing from a uniform distribution and moves to distribution transforms, adding random variables, conditional probability, and more. The code examples are great if you're a scala programmer who wants to learn more about probability, or if you already know the probability and want to learn some scala by example. Here's a quick taste of how to build a generic `Distribution` trait and create uniform and Bernoulli distributions out of it. trait Distribution[A] { def get: A def sample(n: Int): List[A] = { List.fill(n)(this.get) } def map[B](f: A => B): Distribution[B] = new Distribution[B] { override def get = f(self.get) } } val uniform = new Distribution[Double] { private val rand = new java.util.Random() override def get = rand.nextDouble() } def bernoulli(p: Double): Distribution[Boolean] = { uniform.map(_ < p) }

#programming #scala #probability #statistics

Making data sensible with a Bayesian analysis

This is an excellent post about why raw numbers are not always a good representation of the overall data that you are analyzing. Turns out it's really easy to smooth the data and find the real trends and outliers in your sample (which is often what you're looking for as an analyst).

Thanks to Anna for the link

#bayesian statistics #data analysis

Data generating products.

People put a lot of effort into predicting the sentiment around a certain article, tweet, photo, or any other piece of information on the internet. There's huge value in knowing who is consuming your content, how they feel about it, and what kinds of things they feel similar about. For example, if I regularly share my love for Pepsi products and disdain for Coca-Cola products, then Pepsi can consider me a loyal customer and target their advertising dollars elsewhere.

Measuring sentiment is not always easy. The cononical approach is to build large list of "positive" and "negative" words (either manually or via labeled training data), and then count how many of each group appear in the text that you want to classify. You can add some weights and filters to the words, but it really all comes down to the same sort of thing. This works OK in some contexts, but will never be exact and gets much more difficult if you try to figure out the sentiment of specific users and not an aggregate "mood."

A few months ago I spoke at a conference and focused my talk on building "data generating products." By that, I mean building features that enhance a user's experience, and simultaneously let you collect useful data that you would otherwise have to predict in order to build new products on top of the new information. The example that I used was tumblr's typed post system.

With the seven post types, we can give users tools that make it easier to share specific types of media. Sharing a song? Search for it on Spotify or SoundCloud. Sharing a photo? Drag-and-drop the images, or take one with your webcam. It's easy for us to determine the type of media that you share or consume. If your blog is focused on sharing songs that you enjoy, we can recommend it to people who are using tumblr to discover new music.

Today bitly released a bitly for feelings bookmarklet (above). It lets users bookmark and/or share articles and websites that they come across, and uses a cute short-url domains like oppos.es or wtfthis.me, based on how you feel about the article. I'm not sure what their long-term plans are for this product (it's still in beta), but I'd love to be able to log into my bitly account and see all of the funny links that I've saved (lolthis.me), or all of the products I'd like to buy someday (iwantth.is).

It lets me, the user, organize content and makes me more likely to use bitly as a bookmarking service. It also tells them explicitly how I feel about a photo, article, or product and gives them an idea about why I am bookmarking and/or sharing it. There's no need to scrape my twitter account and analyze the words I use to describe the link that I'm sharing.

There are other products that have been well designed to collect rich user data, such as LinkedIn Endorsements or when Facebook switched from having text lists of bands/movies/books that I like to subscribing to individual 'like' pages for each item. These are fine ideas that create clear signals, but I'm no more likely to use LinkedIn because I can verify that my friend knows how to program in Ruby. Bitly added a simple interface to enhance their simple service, and are getting a wealth of valuable data out of it.

Now that they're collecting this great information, I can't wait to see what they build with it. Hopefully they'll work it into their real time search engine.

#data science #bitly

A great analysis of Zipcar's twitter followers. Via Who are brands really talking to on social media?

#zipcar #networks #twitter

Baseball by the (jersey) numbers

The other day I saw this video of a 14 year old high school basketball sensation. He's billed as "The Next LeBron" who was in turn billed as "The Next Michael Jordan." What caught my eye is that they all wear number 23 on their jersey (until LeBron moved to The Heat, at least).

When I was a kid I played lots of sports, and the best basketball player on my team was always #23 and the best football player was always #34 (Bo Jackson, Walter Payton, etc). So that got me to wondering what the "desirable" numbers are in other sports. I asked about other sports on twitter and #10 is clearly the best number to wear in soccer, and maybe #9 for hockey (Richard, Hull, Howe).

I wanted to figure out which jersey numbers have been the best historically, and baseball is the obvious sport to turn to for this, since data is so reliable and readily available. So I downloaded the career stats for a little over 17,000 players (which I think is every baseball player ever) and decided to see which are the best jersey numbers over all.

For each player, I got their career batting average and their jersey number, which proved to be harder than I expected. For example, Johnny Damon wore #18 from 2002-2009 (during his best seasons with the Red Sox and Yankees), but has also worn 51, 8, 22, and 33. So to make it a little easy on myself, I just got the number that they wore on the most teams. Johnny Damon has had 7 numbers on 6 teams, but wore #18 on 4 teams, so I'm associating him with #18.

For scoring the jersey number, I'm taking the mean of career batting averages for all players who wore that number, weighted by the number of plate appearances per batter. I perform this weighting so that someone like Roberto Clemente who hit .317 over 10,211 plate appearances would have more influence than someone like Buster Posey, who hit .311 with only 1,324 plate appearances. That's a bit of a mouthful so here's some math that might make more sense:

\[ S_j = \sum_{i=1}^{N_j} \frac{p_{j,i} b_{j,i}}{\sum_{k=1}^{N_j}p_{j,k}} \]

Where \( S_j \) is the score for jersey number \( j \), \( b_{j,i} \) and \( p_{j,i} \) are respectively the batting average and number of plate appearances for the \( i^{th} \) player who wears jersey number \( j \), and \( N_j \) is the total number of players who wore jersey number \( j \). I was also sure to limit the data set to \( p_{j,i} \geq 500 \) and \( N_j \geq 5 \). There were also a lot of older (~100 years ago) players whose number I couldn't gather, so I dropped those as well. This narrowed the data set down to about 3,500 players.

The next thing to consider is that since I'm using lifetime batting average and number of plate appearances to rank players, what I'm really doing is ranking hitters. Many pitchers in the National League end up with over 500 at bats, but their batting average is just going to hurt the overall ranking of their jersey number. The following graph makes this crystal clear. The y-axis shows the jersey number, and the x-axis is the batting average for each player.

So once we remove the pitchers, here's where each number ranks in terms of weighted batting average:

There are some clear winners here. The number 51 is a bit of an outlier because there are only 10 batters with that number who meet the minumum plate appearance requirement, but the list includes Ichiro Suzuki (.321) and Bernie Williams (.297) among others with high batting averages. Here are the top players who wore #4, a more "classic" number:

name plate_appearances batting_average Rogers Hornsby 9480 0.358 Lou Gehrig 9663 0.340 Riggs Stephenson 5134 0.336 Dale Alexander 2736 0.331 Babe Herman 6228 0.324 Luke Appling 10254 0.310 Hack Wilson 5556 0.307 Paul Molitor 12167 0.306 Smead Jolley 1815 0.305 Mel Ott 11348 0.304

Not a bad group to be in. There are also some clear loser numbers, but they're mostly higher numbers which I think are often worn by pitchers.

Here are the the 10 best and worst lifetime batting averages:

name number plate_appearances batting_average Rogers Hornsby 4 9480 0.358 0 Ted Williams 9 9788 0.344 0 Bill Terry 3 7108 0.341 0 Lou Gehrig 4 9663 0.340 0 Tony Gwynn 19 10232 0.338 0 Riggs Stephenson 4 5134 0.336 0 Al Simmons 7 9518 0.334 0 Paul Waner 24 10766 0.333 0 Dale Alexander 4 2736 0.331 0 Stan Musial 6 12717 0.331 0 ... Corky Miller 37 575 0.188 0 J.R. Phillips 17 545 0.188 0 Bill Plummer 8 1007 0.188 0 Gus Gil 18 538 0.186 0 Brandon Wood 32 751 0.186 0 Drew Butera 41 531 0.183 0 Kevin Cash 17 714 0.183 0 Tommy Dean 3 594 0.180 0 Ray Oyler 1 1445 0.175 0 John Vukovich 16 607 0.161 0

If you're a baseball fan, you can see that the bottom is all populated by pitchers, because no .890 batter would ever get 950 at bats in the majors. In fact, baseball's obvious bias towards keeping good hitters and cutting poor hitters is obvious when you plot lifetime batting average against total plate appearances:

So what number would I wear if I were a pro baseball player? I can't NOT choose #9, even though it's retired by the Red Sox.

Also, all of the code and results for this blog post are available on my github page.

#sports #baseball

A great list of blogs on statistics, visualization, politics and more. Curated by Andrew Gelman. I'm not sure how Simply Statistics didn't make the list.

#links #andrew gelman

I sent out a tweet last night asking if there are any good data libraries for Scala. The next morning, Saddle gets released. I've heard it described as "Pandas for Scala," which is great because nothing beats having data frames ported to a new language.

#tech #scala #pandas #programming

All models are wrong, but some are useful.

George E P Box (1919-2013)

#george box #statistics #quotes

How to approach a problem: self-indulgent music recommendations.

I’ve been thinking a lot about music recommendations lately, and I realized that I’m usually a little bearish about listening to recommended bands that I’ve never heard of before. Maybe it’s just because I listen to a pretty broad variety of music, but I love re-discovering a band that I know but haven’t thought of in a while. So with that, let’s build a 100% self-centered music recommender. The goal is to remind myself of some bands that I might like to play next based on what I’m listening to right now.

Fortunately for me, I’ve used last.fm to record the last 135,000+ tracks that I’ve listened to over the course of 7 years (my “now playing” is listed at the top of this page). And even more fortunately, they let you grab your entire history via their API. I was actually able to get 127,873 of them, which is more than plenty to work with. So let’s check out which artists I’ve listened to the most and see how well it matches up with my last.fm profile:

Artist Plays ---------------------------- Tom Waits 3155 Justin Townes Earle 2613 Iron & Wine 2053 M. Ward 2005 Lucero 1832 Old 97's 1761 The Black Keys 1755 Beach House 1624 Death Cab for Cutie 1592 Ryan Adams 1527

The first thing that should become clear is that I listen to a lot of Sad Bastard music. At least both Dillinger Four and Samiam are in the top 20.

Approach

When designing this recommender, I’m going to try to answer the following question: Given the artist I’m currently listening to, what have I generally listened to next?

Now since this is specific to me, I’m going add a few constraints. The first of which is this: I prefer listening to full albums rather than individual tracks. I’m not going to recommend songs, I’m going to recommend artists. Because of that, the only important attributes I need for each track are the time that I listened to it and the artist.

First attempt

Let’s look at the song-to-song transitions. That is, given that I’m listening to a song by The Antlers, which band am I likely to listen to next? This table shows the number of times I transition from listening to an Antlers song to each artist on the list, as well as the probability of the transition.

artist transitions transition_prob The Antlers 854 0.870540 Beach House 22 0.022426 Arcade Fire 5 0.005097 Patrick Watson 4 0.004077 The Tallest Man on Earth 3 0.003058 Okkervil River 3 0.003058 The Avett Brothers 3 0.003058 Carla Bruni 3 0.003058 Pinback 2 0.002039 North Highlands 2 0.002039

This simply confirms what I stated earlier: when I listen to music, I listen to full albums. 87% of the time that I listen to an Antlers song, I listen to another one of their songs next. That’s not helpful for recommendations, so I’ll add another constraint: I’m only interested in transitions where the artists are not the same. Now the above list looks like this:

artist transitions transition_prob Beach House 22 0.173228 Arcade Fire 5 0.039370 Patrick Watson 4 0.031496 The Tallest Man on Earth 3 0.023622 The Avett Brothers 3 0.023622 Okkervil River 3 0.023622 Carla Bruni 3 0.023622 Pinback 2 0.002039 North Highlands 2 0.002039 Fleetwood Mac 2 0.015748

The order is the same as before, but the transition probabilities are much higher. This is a reasonable list of artists to recommend to someone who listens to The Antlers. Even last.fm has Beach House and Okkervil River in the top related artists.

Modifications

We’re doing well so far, but let’s see if we can make it a little better with just a bit more work. Beach House is the top recommendation, but I listen to them a lot . Of all of the tracks I’ve recorded, 1.27% of them are Beach House tracks. Considering there are a total of 1,342 unique artists in my data set, that means I’m \(\frac{0.0127}{1 / 1342} = 17 \) times more likely to listen to Beach House than the “average” band.

So let’s use this information by dividing each transition probability by the unconditional probability of listening to a given artist. The unconditional probability is simply the total plays for each artist divided by the total number of plays (1624/127873 = 0.0127 for Beach House). The equation for the ranking has now become the following, where \( Pr(artist | Antlers) \) means the probability of listening to a given artist immediately after listening to The Antlers:

\[ \frac{Pr(artist | Antlers)}{Pr(artist)} \]

When I divide by the unconditional probability, I will give weight to artists that I listen to less often overall, making the results a little more exciting. If I multiply by this probability, however, I’ll give extra weight to artists I listen to more often, making the results more familiar. It’s probably important to point out that this is a bit of a hack that I came up with while typing up this blog post, and shouldn’t be confused with Bayes’ Theorem even though it looks sort of related. Anyway, let’s see what these rankings look like:

original less familiar more familiar ------------------------------------------------------------------------------ Beach House April March Beach House Arcade Fire Broken Bells Tom Waits Patrick Watson Beach House Arcade Fire The Tallest Man on Earth Army of Ponch Patrick Watson The Avett Brothers Mineral The Tallest Man on Earth Okkervil River Carla Bruni Okkervil River Carla Bruni Arcade Fire Bon Iver Pinback North Highlands The Avett Brothers North Highlands The Murder City Devils Dillinger Four Fleetwood Mac Pinback Band of Horses

You can see that Tom Waits moved up on the chart on the right because I’ve listen to him more than anybody else. For the middle list, however, there’s lots of stuff that didn’t even make the original cut. A band like Army of Ponch might not seem like the best recommendation to someone currently listening to The Antlers, but I’ve made that transition twice and might want to again.

While we’re at it, here’s the list of recommendations for Bruce Springsteen:

original less familiar more familiar ---------------------------------------------------------------------- Chuck Ragan Buddy Holly Tom Waits Built to Spill Buckingham Nicks Chuck Ragan Camera Obscura Sam Cooke Built to Spill Tom Waits The Jayhawks Wilco Wilco Chuck Ragan Camera Obscura Okkervil River Bridge and Tunnel Okkervil River Mean Creek Built to Spill Death Cab for Cutie Death Cab for Cutie Mastodon Old 97's Dan Auerbach Camera Obscura Ryan Adams Bridge and Tunnel Iron & Wine and Calexico Spoon

Just reading that list reminds me that Mean Creek has a new record out that I’m going to listen to right now.

Performance

So which list is best? How does it compare to standard information retrieval techniques? Well, that’s probably different for each person and the only way to find out is to test it. I could (and might eventually) put together a little app that recommends me some artists from my listening history based on what I’m listening to right now. With a simple A/B test, I could see which of the three recommendation algorithms I follow most often and stick with that one in the future. To do that, I would have to record

The artist that is currently playing

Which artists recommendations are displayed

Which recommended artist (if any) was played next

The recommendation algorithm that provides the highest play / display ratio is the one I’d like to go with in the future. This seems like an obvious place to plug sifter for performing A/B and other types of testing in scenarios like this.

Conclusion

The point of this blog post is more about the thought process than the technical parts of the recommender. There are lots of things that I could have done “right,” like using properties of Markov Chains (which is essentially what I built) to improve the system, or account for the fact that Buddy Holly follows Bruce Springsteen in my music library, so maybe that isn’t a true transition.

I think the main takeaway is really in the constraints that I put on the system. The idea for this one-day project followed the following course:

Build a music recommendation system

Build a music recommendation system that only uses my last.fm data

Build a music recommendation system that only recommends music I already know

Build an artist recommendation system, not songs

Only recommend artists that I’ve listened to immediately after the given artist

Come up with a few simple variations and test them for performance

Identifying the problem correctly let me build something in just a few hours. Is it the best recommender system the world has ever seen? Actually, it might be, because I’ve never seen any recommender that only suggests content that you are already familiar with and that’s what I wanted. But we’d have to test it against the likes of Last.fm, Spotify, and Pandora to find out.

As I said in the beginning, I’ve been thinking about this stuff a lot lately, but that’s not to say I’ve put this method into production anywhere. The code for all of this was done using python/pandas, and breaks pretty much every rule that I laid out in my previous blog post, so I’ll clean that up and get it posted soon.

#recommendations #data science

So, yes, lasso was a great idea, and I didn’t get the point. I’m still amazed in retrospect that as late as 2003, I was fitting uncontrolled regressions with lots of predictors and not knowing what to do. Or, I should say, as late as 2013, considering I still haven’t fully integrated these ideas into my work. I do use bayesglm() routinely, but that has very weak priors.

It's always impressive to see a leader in any field admit when they were "way way wrong" about a certain technique or idea. This post by Andrew Gelman is a great summary of what Lasso (or regularization in general) is, and how his opinion on it has changed over time.

Tibshirani announces new research result: A significance test for the lasso « Statistical Modeling, Causal Inference, and Social Science

#statistics #andrew gelman #lasso

Trending Blogs

Recently Viewed Blogs

Adam Laiacano