If you're a frequent user of e-commerce websites, you're probably familiar with the lists of related products these sites often feature to help customers find what they're looking for on the store. [Amazon](http://amazon.com), in particular, sports several such lists on every page, including "Frequently Bought Together" and "Customers Who Bought This Item Also Bought." If you've been to our site lately, or read [our announcement](http://inventables.blogspot.com/2010/09/new-beginning.html), you know that [Inventables](http://www.inventables.com) is now an e-commerce site. As such, we thought it would be a good time to explore adding some of these types of lists to our product pages. In this post I'll talk about how we scripted a process that uses [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/) to parse the millions of product views we've accumulated over the past year or so, and put it into a format that can be used to display the list "People who viewed this product also viewed" for every product on our site. To see examples of this list in the wild, check out one of [our](http://www.inventables.com/technologies/squishy-gel-magnet#viewed-technologies) [cool](http://www.inventables.com/technologies/bend-sensor#viewed-technologies) [products](http://www.inventables.com/technologies/rubber-glass#viewed-technologies). ### Formal Definition of our List Given two products `p1` and `p2`, we will define `p1 related-by-path-to p2` (and vice versa) if there is exists a user-session-specific path P where `page-view(p1)` and `page-view(p2)` are both members of P. Additionally, the weight of this relationship between `p1` and `p2` will be equal to the number of distinct paths for which they are both members. If 10 users viewed both `p1` and `p2` during a single visit, then those two products will have a path-relation of weight 10. To display the list for a given product, we will sort the set of related products in descending order of weight. For example, let's say you're looking at [Rubber Glass](http://www.inventables.com/technologies/rubber-glass). If you see [Flexible Metal Brick](http://www.inventables.com/technologies/flexible-metal-brick) as the first item in the People who viewed Rubber Glass also viewed list, then that means more people who looked at Rubber Glass looked at Flexible Metal Brick than any other product on the site. ### The Raw Data To build our list we'll be drawing on a single table from our application database. This table stores an entry for every unique product viewed by a user during a given session. Only two columns from this table are needed for our task: `path_id` and `product_id`. We can export this data from our [Amazon RDS](http://aws.amazon.com/rds/) MySQL instance, using the mysql command line utility like so: mysql -u $user --password=$pass --host=$host --batch -e \ "SELECT path_id, product_id FROM recent_items" \ $database > /tmp/input/recent_items.tsv This tab-separated file will constitute the initial input to our MapReduce workflow. ### Tool Selection I selected [Hadoop Streaming](http://hadoop.apache.org/mapreduce/docs/current/streaming.html) to drive the MapReduce workflow because of the large volume of data we're dealing with, and also because we run our application infrastructure on the [Amazon AWS](http://aws.amazon.com/) platform, which, in the form of their [Amazon Elastic MapReduce](http://aws.amazon.com/elasticmapreduce/) offering, has built-in support for Hadoop Streaming. For now we can get by with running Hadoop in [single node mode](http://hadoop.apache.org/common/docs/current/single_node_setup.html), but if our scale grows or we start doing more things with MapReduce, we'll be able to easily switch to true [cluster mode](http://hadoop.apache.org/common/docs/current/cluster_setup.html). ### Solution Outline The goal of this exercise was to start with a table of `path_id, product_id` values and ultimately generate a new table of `source_product_id, target_product_id, weight` values, where the source and target product are related to each other with the given weight. From this final data format, it is trivial to build a gallery to display the top n related products for a given source product. To get from the millions of `path_id, product_id` records to the desired `source_product_id, target_product_id, weight` format using MapReduce, we take a two-phase approach. ### MapReduce Phase 1 This phase creates an output file where each row represents one instance of a product-to-product relationship from a single path. As such, each row will carry a weight of 1. ## Map: The input to this function is already in the desired format of `path_id, product_id`, so this mapping step is just the identity function. ## Reduce: Hadoop MapReduce guarantees that all rows for a given key emitted by the map step will go to the same reducer. In our case, the key is a path id. So in this first reducer, we iterate over all the rows for a given path, appending each product id to an array. Once we have all the product id's for a given path, we can emit all of the permutations of `product_id, product_id` pairs. For example, let's say the input included the following rows: path1, product1 path1, product2 path1, product3 (The user looked at 3 products during his session.) The reducer would first map `path1` to the array `[product1, product2, product3]`. Then, it would emit six key/value pairs of the form `source_product_id_target_product_id, weight`: product1_product2, 1 product1_product3, 1 product2_product1, 1 product2_product3, 1 product3_product1, 1 product3_product2: 1 And the ruby source: ### MapReduce Phase 2 The second and final MapReduce will sum the weights associated with every unique `product1_product2` key. ## Map: Since the input to the second map phase is already in the desired format of `product1_product2, weight`, the identity function can also be used here. ## Reduce: The reduce step, being given all the weights for a given product-product combination at a time, simply iterates over each key, sums the weight, and emits one key/value pair with the total. Ruby source: ### Pulling in the data After the second MapReduce phase completes, we have the data we need. All that's needed is to import it into our database. We can accomplish this again with the mysql command line utility: mysqlimport --local --compress -u $user --host=$host \ --columns=source_product_id, target_product_id,count \ --replace $database \ /tmp/final_output/viewed_products.tsv Note: the `viewed_products` table has a unique index on `source_product_id, target_product_id`. ### Stitching it all up We want our process to be automated so that it can be run on a schedule. This can be accomplished with a [simple shell script](https://gist.github.com/720445#file_gistfile3.sh). ### Conclusion As you can see, it wasn't hard at all to build our first list of related products using Hadoop MapReduce. We plan to do a lot more in this realm going forward. --Jeff