PriceWeave @priceweave - Tumblr Blog

Mining Twitter: Analyzing Social Reactions to Products and Brands

[This post was written by Dipanjan. Dipanjan works in the Engineering Team with Mandar, addressing some of the problems related to Data Semantics. He loves watching English Sitcoms in his spare time.] This is the second post in our series of blog posts which we shall be presenting regarding social media analysis. We have already talked about Twitter Mining in depth earlier and also how to analyze social trends in general and gather insights from YouTube. If you are more interested in developing a quick sentiment analysis app, you can check our short tutorial on that as well. Our flagship product, PriceWeave, is all about delivering real time actionable insights at scale. PriceWeave helps Retailers and Brands take decisions on product pricing, promotions, and assortments on a day to day basis. One of the areas we focus on is “Social Intelligence”, where we measure our customers' social presence in terms of their reach and engagement on different social channels. Social Intelligence also helps in discovering brands and products trending on social media. Today, I will be talking about how we can get data from Twitter in real-time and perform some interesting analytics on top of that to understand social reactions to trending brands and products. In our last post, we had used Twitter’s Search API for getting a selective set of tweets and performed some analytics on that. But today, we will be using Twitter’s Streaming API, to access data feeds in real time. A couple of differences with regards to the two APIs are as follows. The Search API is primarily a REST API which can be used to query for “historical data”. However, the Streaming API gives us access to Twitter’s global stream of tweets data. Moreover, it lets you acquire much larger volumes of data with keyword filters in real-time compared to normal search.

Installing Dependencies

I will be using Python for my analysis as usual, so you can install it if you don’t have it already. You can use another language of your choice, but remember to use the relevant libraries of that language. To get started, install the following packages, if you dont have them already. We use simplejson for JSON data processing at DataWeave, but you are most welcome to use the stock json library.

[root@dip]# pip install twitter [root@dip]# pip install simplejson [root@dip]# pip install prettytable [root@dip]# pip install matplotlib [root@dip]# pip install nltk

Acquiring Data

We will use the Twitter Streaming API and the equivalent python wrapper to get the required tweets. Since we will be looking to get a large number of tweets in real time, there is the question of where should we store the data and what data model should be used. In general, when building a robust API or application over Twitter data, MongoDB being a schemaless document-oriented database, is a good choice. It also supports expressive queries with indexing, filtering and aggregations. However, since we are going to analyze a relatively small sample of data using pandas, we shall be storing them in flat files. Note: Should you prefer to sink the data to MongoDB, the mongoexport command line tool can be used to export it to a newline delimited format that is exactly the same as what we will be writing to a file. The following code snippet shows you how to create a connection to Twitter’s Streaming API and filter for tweets containing a specific keyword. For simplicity, each tweet is saved in a newline delimited file as a JSON document. Since we will be dealing with products and brands, I have queried on two trending products and brands respectively. They are, ‘Sony’ and ‘Microsoft’ with regards to brands and ‘iPhone 6’ and ‘Galaxy S5’ with regards to products. You can write the code snippet as a function for ease of use and call it for specific queries to do a comparative study.

import io import simplejson as json import twitter # Go to https://apps.twitter.com/ to create an app and get values for these credentials CONSUMER_KEY = '' CONSUMER_SECRET = '' OAUTH_TOKEN = '' OAUTH_TOKEN_SECRET = '' # Authenticate with OAuth auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) # Create a connection to the Twitter Streaming API twitter_stream = twitter.TwitterStream(auth=auth) QUERY = 'microsoft' OUT_FILE = 'tweets_'+QUERY+'.json' print 'Filtering the public timeline for "{0}"'.format(QUERY) stream = twitter_stream.statuses.filter(track=QUERY) # Write one tweet per line as a JSON document. with io.open(OUT_FILE, 'a', encoding='utf-8',buffering=1) as f: for tweet in stream: f.write(unicode(u'{0}\n'.format(json.dumps(tweet, ensure_ascii=False)))) print tweet['text']

Let the data stream for a significant period of time so that you can capture a sizeable sample of tweets.

Analyses and Visualizations

Now that you have amassed a collection of tweets from the API in a newline delimited format, let's start with the analyses. One of the easiest ways to load the data into pandas is to build a valid JSON array of the tweets. This can be accomplished using the following code segment.

import pandas as pd DATA_FILES = ['tweets_microsoft.json', 'tweets_sony.json', 'tweets_galaxys5.json', 'tweets_iphone6.json'] data_frames = dict() for data_file in DATA_FILES: data = "[{0}]".format(",".join([line for line in open(data_file).readlines()])) data_frames[data_file.split('_')[1].split('.')[0]] = pd.read_json(data, orient='records') # All the values should be of data frame type print {k:type(v) for k,v in data_frames.items()} # to see an individual sample data frame print data_frames['sony']

Note: With pandas, you will need to have an amount of working memory proportional to the amount of data that you’re analyzing. Once you run this, you should get a dictionary containing 4 data frames. The output I obtained is shown in the snapshot below.

Note: Per the Streaming API guidelines, Twitter will only provide up to 1% of the total volume of real time tweets, and anything beyond that is filtered out with each “limit notice”. The next snippet shows how to remove the “limit notice” column if you encounter it.

# Capture the limit notices by indexing into the data frame for non-null field containing "limit" # df is a data frame here limit_notices = df[pd.notnull(df.limit)] # Remove the limit notice column from the DataFrame entirely df = df[pd.notnull(df['id'])] print "Number of total tweets that were rate-limited", sum([ln['track'] for ln in limit_notices.limit]) print "Total number of limit notices", len(limit_notices)

Time-based Analysis

Each tweet we captured had a specific time when it was created. To analyze the time period when we captured these tweets, let’s create a time-based index on the created_at field of each tweet so that we can perform a time-based analysis to see at what times do people post most frequently about our query terms.

from prettytable import PrettyTable pt = PrettyTable() pt = PrettyTable(['Brand \ Product', 'First tweet timestamp (UTC)', 'Last tweet timestamp (UTC)']) for key in sorted(data_frames.keys()): pt.add_row([key, data_frames[key]['created_at'][0], data_frames[key]['created_at'][-1]]) print pt

The output I obtained is shown in the snapshot below.

I had started capturing the Twitter stream at around 7 pm on the 6th of December and stopped it at around 11:45 am on the 7th of December. So the results seem consistent based on that. With a time-based index now in place, we can trivially do some useful things like calculate the boundaries, compute histograms and so on. Operations such as grouping by a time unit are also easy to accomplish and seem a logical next step. The following code snippet illustrates how to group by the “hour” of our data frame, which is exposed as a datetime.datetime timestamp since we now have a time-based index in place. We print an hourly distribution of tweets also just to see which brand/product was most talked about on Twitter during that time period.

brands = ['sony', 'microsoft'] products = ['iphone6', 'galaxys5'] brands_grouped_time = [data_frames[key].groupby(lambda x: x.hour) for key in brands] products_grouped_time = {key:data_frames[key].groupby(lambda x: x.hour) for key in products} # dividing by 1000 in the distribution because people talked more about brands during this time period for brand, brand_grouped_time in brands_grouped_time.items(): print "\nNumber of relevant tweets by the hour (UTC) for", brand pt = PrettyTable(['Hour', 'Total Tweets', 'Tweet Distribution']) pt.align["Tweet Distribution"] = "l" for hour, group in brand_grouped_time: pt.add_row([hour, len(group), '*'*(len(group) / 1000)]) print pt # dividing by 100 in the distribution because people talked less about products during this time period for product, product_grouped_time in products_grouped_time.items(): print "\nNumber of relevant tweets by the hour (UTC) for", product pt = PrettyTable(['Hour', 'Total Tweets', 'Tweet Distribution']) pt.align["Tweet Distribution"] = "l" for hour, group in product_grouped_time: pt.add_row([hour, len(group), '*'*(len(group) / 100)]) print pt

The outputs I obtained are depicted in the snapshot below.

The “Hour” field here follows a 24 hour format. What is interesting here is that, people have been talking more about Sony than Microsoft in Brands. In Products, iPhone 6 seems to be trending more than Samsung’s Galaxy S5. Also the trend shows some interesting insights that people tend to talk more on Twitter in the morning and late evenings.

Time-based Visualizations

It could be helpful to further subdivide the time ranges into smaller intervals so as to increase the resolution of the extremes. Therefore, let’s group into a custom interval by dividing the hour into 15-minute segments. The code is pretty much the same as before except that you call a custom function to perform the grouping. This time, we will be visualizing the distributions using matplotlib.

import matplotlib.pyplot as plt def group_by_15_min_intervals(x): if 0 <= x.minute <= 15: return (x.hour, "0-15") elif 15 < x.minute <= 30: return (x.hour, "16-30") elif 30 < x.minute <= 45: return (x.hour, "31-45") else: return (x.hour, "46-00") brands_grouped_time = {key:data_frames[key].groupby(lambda x: group_by_15_min_intervals(x)) for key in brands} products_grouped_time = {key:data_frames[key].groupby(lambda x: group_by_15_min_intervals(x)) for key in products} # Plot for brands plt.ylabel("Tweet Volume") plt.xlabel("Time") plt.title("Brands Social Trend") plt.plot([float(str(hour[0])+'.'+hour[1].split('-')[0]) for hour, group in brands_grouped_time['sony']][1:-1], [len(group)for hour, group in brands_grouped_time['sony']][1:-1],'r', label='Sony') plt.plot([float(str(hour[0])+'.'+hour[1].split('-')[0]) for hour, group in brands_grouped_time['microsoft']][1:-1], [len(group)for hour, group in brands_grouped_time['microsoft']][1:-1],'b', label='Microsoft') plt.legend() # Plot for products plt.ylabel("Tweet Volume") plt.xlabel("Time") plt.title("Products Social Trend") plt.plot([float(str(hour[0])+'.'+hour[1].split('-')[0]) for hour, group in products_grouped_time['iphone6']][1:-1], [len(group)for hour, group in products_grouped_time['iphone6']][1:-1],'r', label='iPhone 6') plt.plot([float(str(hour[0])+'.'+hour[1].split('-')[0]) for hour, group in products_grouped_time['galaxys5']][1:-1], [len(group)for hour, group in products_grouped_time['galaxys5']][1:-1],'b', label='Galaxy S5') plt.legend()

The two visualizations are depicted below. Ofcourse don’t forget to ignore the section of the plots from after 11:30 am to around 7 pm because during this time no tweets were collected by me. This is indicated by a steep rise in the curve and is insignificant. The real regions of significance are from hour 7 to 11:30 and hour 19 to 22. Considering brands, the visualization for Microsoft vs. Sony is depicted below. Sony is the clear winner here.

Considering products, the visualization for iPhone 6 vs. Galaxy S5 is depicted below. The clear winner here is definitely iPhone 6.

Tweeting Frequency Analysis

In addition to time-based analysis, we can do other types of analysis as well. The most popular analysis in this case would be frequency based analysis of the users authoring the tweets. The following code snippet will compute the Twitter accounts that authored the most tweets and compare it to the total number of unique accounts that appeared for each of our query terms.

# Just to jog your memory, this was already initialized earlier brands = ['sony', 'microsoft'] products = ['iphone6', 'galaxys5'] # For brands brands_user_coll = {key:df.pop('user').apply(pd.Series) for key, df in [(key, data_frames[key]) for key in brands]} brands_authors = {key:brands_user_coll[key].screen_name for key in brands_user_coll.keys()} brands_authors_counter = {key:Counter(brands_authors[key].values) for key in brands_authors.keys()} # Display the results in a neat tabulated form for brand in brands_authors_counter.keys(): print "\nMost frequent (top 10) authors of tweets for", brand pt = PrettyTable(['Author', 'Tweet Count']) [pt.add_row([a, f]) for a, f in brands_authors_counter[brand].most_common(10)] print pt num_unique_authors = len(set(brands_authors[brand].values)) print "There are {0} unique authors out of {1} tweets".format(num_unique_authors, len(data_frames[brand])) # For products products_user_coll = {key:df.pop('user').apply(pd.Series) for key, df in [(key, data_frames[key]) for key in products]} products_authors = {key:products_user_coll[key].screen_name for key in products_user_coll.keys()} products_authors_counter = {key:Counter(products_authors[key].values) for key in products_authors.keys()} # Display the results in a neat tabulated form for product in products_authors_counter.keys(): print "\nMost frequent (top 10) authors of tweets for", product pt = PrettyTable(['Author', 'Tweet Count']) [pt.add_row([a, f]) for a, f in products_authors_counter[product].most_common(10)] print pt num_unique_authors = len(set(products_authors[product].values)) print "There are {0} unique authors out of {1} tweets".format(num_unique_authors, len(data_frames[product]))

The results which I obtained are depicted below.

What we do notice is that a lot of these tweets are also made by bots, advertisers and SEO technicians. Some examples are Galaxy_Sleeves and iphone6_sleeves which are obviously selling covers and cases for the devices.

Tweeting Frequency Visualizations

After frequency analysis, we can plot these frequency values to get better intuition about the underlying distribution, so let’s take a quick look at it using histograms. The following code snippet created these visualizations for both brands and products using subplots.

# Brands Tweets Visualizations fig, axes = plt.subplots(2, sharex=True) axes[0].hist(sorted(brands_authors_counter['sony'].values()), bins=20, alpha=0.7, label='Sony', log=True, color='r') axes[0].set_title('Sony') axes[1].hist(sorted(brands_authors_counter['microsoft'].values()), bins=20, alpha=0.7, label='Microsoft', log=True, color='b') axes[1].set_title('Microsoft') for ax in axes: ax.set_xlabel('Number of Tweets') ax.set_ylabel('Number of Authors') # Products Tweets Visualizations fig, axes = plt.subplots(2, sharex=True) axes[0].hist(sorted(products_authors_counter['iphone6'].values()), bins=20, alpha=0.7, label='iPhone 6', log=True, color='r') axes[0].set_title('iPhone 6') axes[1].hist(sorted(products_authors_counter['galaxys5'].values()), bins=20, alpha=0.7, label='Galaxy S5', log=True, color='b') axes[1].set_title('Galaxy S5') for ax in axes: ax.set_xlabel('Number of Tweets') ax.set_ylabel('Number of Authors')

The visualizations I obtained are depicted below.

The distributions follow the “Pareto Principle” as expected where we see that a selective number of users make a large number of tweets and the majority of users create a small number of tweets. Besides that, we see that based on the tweet distributions, Sony and iPhone 6 are more trending than their counterparts.

Locale Analysis

Another important insight would be to see where your target audience is located and their frequency. The following code snippet achieves the same.

# Top ten locales for Brands for brand in brands: print 'Top 10 locales for', brand pt = PrettyTable(['Language', 'Tweets']) top_ten = dict(data_frames[brand].lang.value_counts()[:10]) top_ten = [[key, top_ten[key]] for key in sorted(top_ten.keys())] top_ten.sort(key = lambda row: row[1], reverse=True) [pt.add_row(row) for row in top_ten] print pt # Top ten locales for Products for product in products: print 'Top 10 locales for', product pt = PrettyTable(['Language', 'Tweets']) top_ten = dict(data_frames[product].lang.value_counts()[:10]) top_ten = [[key, top_ten[key]] for key in sorted(top_ten.keys())] top_ten.sort(key = lambda row: row[1], reverse=True) [pt.add_row(row) for row in top_ten] print pt

The outputs which I obtained are depicted in the following snapshot. Remember that Twitter follows the ISO 639–1 language code convention.

The trend we see is that most of the tweets are from English speaking countries as expected. Surprisingly, most of the Tweets regarding iPhone 6 are from Japan!

Analysis of Trending Topics

In this section, we will see some of the topics which are associated with the terms we used for querying Twitter. For this, we will be running our analysis on the English language tweets. We will be using the nltk library here to take care of a couple of things like removing stopwords which have little significance. Now I will be doing the analysis here for brands only, but you are most welcome to try it out with products too because, the following code snippet can be used to accomplish both the computations.

from collections import Counter import nltk import copy # This is just a rough compilation, it can be done in a better way ignore_terms = [] ignore_terms.extend(nltk.corpus.stopwords.words('english')) ignore_terms.extend(['-', '#', '', 'rt']) # ignoring tokens like retweets and symbols # Analysis for brands print '\n\nAnalysis for Sony' sony_df = copy.copy(data_frames['sony']) sony_en_text = sony_df[sony_df['lang'] == 'en'].pop('text') sony_tokens = [] [sony_tokens.extend([t.lower().strip(":,.") for t in txt.split()]) for txt in sony_en_text.values] sony_tokens_counter = Counter(sony_tokens) [sony_tokens_counter.pop(t, None) for t in ignore_terms] print '\nMost common terms:' print sony_tokens_counter.most_common(20) print '\nMost common phrases:' nltk.Text([token.encode('utf-8').strip() for token in sony_tokens]).collocations() print '\n\nAnalysis for Microsoft' microsoft_df = copy.copy(data_frames['microsoft']) microsoft_en_text = microsoft_df[microsoft_df['lang'] == 'en'].pop('text') microsoft_tokens = [] [microsoft_tokens.extend([t.lower().strip(":,.") for t in txt.split()]) for txt in microsoft_en_text.values] microsoft_tokens_counter = Counter(microsoft_tokens) [microsoft_tokens_counter.pop(t, None) for t in ignore_terms] print '\nMost common terms:' print microsoft_tokens_counter.most_common(20) print '\nMost common phrases:' nltk.Text([token.encode('utf-8').strip() for token in microsoft_tokens]).collocations()

What the above code does is that, it takes each tweet, tokenizes it and then computes a term frequency and outputs the 20 most common terms for each brand. Ofcourse an n-gram analysis can give a deeper insight into trending topics but the same can also be accomplished with ntlk’s collocations function which takes in the tokens and outputs the context in which they were mentioned. The outputs I obtained are depicted in the snapshot below.

Some interesting insights we see from the above outputs are as follows.

Sony was hacked recently and it was rumored that North Korea was responsible for that, however they have denied that. We can see that is trending on Twitter in context of Sony. You can read about it here.

Sony has recently introduced Project Sony Skylight which lets you customize your PS4.

There are rumors of Lumia 1030, Microsoft’s first flagship phone. People are also talking a lot about Windows 10, the next OS which is going to be released by Microsoft pretty soon.

Interestingly, “ebay price” comes up for both the brands, this might be an indication that eBay is offering discounts for products from both these brands.

To get a detailed view on the tweets matching some of these trending terms, we can use nltk’s concordance function as follows.

print 'Tweets for Sony talking about hack' nltk.Text([token.encode('utf-8').strip() for token in sony_tokens]).concordance('hack') print 'Tweets for Microsoft talking about Lumia 1030' nltk.Text([token.encode('utf-8').strip() for token in microsoft_tokens]).concordance('1030')

The outputs I obtained are as follows. We can clearly see the tweets which contain the token we searched for. In case you are unable to view the text clearly, click on the image to zoom.

Thus, you can see that the Twitter Streaming API is a really good source to track social reaction to any particular entity whether it is a brand or a product. On top of that, if you are armed with an arsenal of Python’s powerful analysis tools and libraries, you can get the best insights from the unending stream of tweets. That’s all for now folks! Before I sign off, I would like to thank Matthew A. Russell and his excellent book Mining the Social Web once again, without which this post would not have been possible.

#engineering #social intelligence #social trends #products #dipanjan #data mining

Mining Twitter in Depth to Understand and Analyze Product Trends

[This post was written by Dipanjan. Dipanjan works in the Engineering Team with Mandar, addressing some of the problems related to Data Semantics. He loves watching English Sitcoms in his spare time. This blog post is cross posted on the DataWeave blog as well.] Social media can be defined as virtual communities and networks, where social interaction takes place among people and a wide variety of content is shared including ideas, opinions, information, pictures, videos and much more. Due to the massive growth of social media in the last decade, it has become a rage among data enthusiasts to tap into the vast pool of social data and gather interesting insights like trending items, reception of newly released products by society, popularity measures to name a few. As you are aware, we are constantly evolving PriceWeave, which has the most extensive set of offerings when it comes to providing actionable insights to retail stores and brands. As part of the product development, we look at social data from a variety of channels to mine things like: trending products/brands; social engagement of stores/brands; what content "works" and what doesn't on social media, and so forth. We do a number of experiments with Twitter data, and this series of blog posts is one of the outputs from those efforts. In some of our recent blog posts, we have seen how to look at current trends and gather insights from YouTube the popular video sharing website. We have also talked about how to create a quick bare-bones web application to perform sentiment analysis of tweets from Twitter. Today I will be talking about mining data from Twitter and doing much more with it than just sentiment analysis. We will be analyzing Twitter data in depth and then we will try to get some interesting insights from it. To get data from twitter, first we need to create a new Twitter application to get OAuth credentials and access to their APIs. For doing this, head over to the Twitter Application Management page and sign in with your Twitter credentials. Once you are logged in, click on the Create New App button as you can see in the snapshot below. Once you create the application, you will be able to view it in your dashboard just like the application I created, named DataScienceApp1_DS shows up in my dashboard depicted below. On clicking the application, it will take you to your application management dashboard. Here, you will find the necessary keys you need in the Keys and Access Tokens section. The main tokens you need are highlighted in the snapshot below. I will be doing most of my analysis using the Python programming language. To be more specific, I will be using the IPython shell, but you are most welcome to use the language of your choice, provided you get the relevant API wrappers and necessary libraries. Installing necessary packages After obtaining the necessary tokens, we will be installing some necessary libraries and packages, namely twitter, prettytable and matplotlib. Fire up your terminal or command prompt and use the following commands to install the libraries if you don't have them already.

[root@dip]# pip install twitter [root@dip]# pip install prettytable [root@dip]# pip install matplotlib

Creating a Twitter API Connection Once the packages are installed, you can start writing some code. For this, open up the IDE or text editor of your choice and use the following code segment to create an authenticated connection to Twitter's API. The way the following code snippet works, is by using your OAuth credentials to create an object called auth that represents your OAuth authorization. This is then passed to a class called Twitter belonging to the twitter library and we create a resource object named twitter_api that is capable of issuing queries to Twitter's API.

import twitter CONSUMER_KEY = 'REPLACE WITH YOUR KEY' CONSUMER_SECRET = 'REPLACE WITH YOUR SECRET' OAUTH_TOKEN = 'REPLACE WITH YOUR TOKEN' OAUTH_TOKEN_SECRET = 'REPLACE WITH YOUR TOKEN SECRET' auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET, CONSUMER_KEY, CONSUMER_SECRET) twitter_api = twitter.Twitter(auth=auth) print twitter_api

If you do a print twitter_api and all your tokens are corrent, you should be getting something similar to the snapshot below. This indicates that we've successfully used OAuth credentials to gain authorization to query Twitter's API. Exploring Trending Topics Now that we have a working Twitter resource object, we can start issuing requests to Twitter. Here, we will be looking at the topics which are currently trending worldwide using some specific API calls. The API can also be parameterized to constrain the topics to more specific locales and regions. Each query uses a unique identifier which follows the Yahoo! GeoPlanet’s Where On Earth (WOE) ID system, which is an API itself that aims to provide a way to map a unique identifier to any named place on Earth. The following code segment retrieves trending topics in the world, the US and in India.

import json WORLD_WOE_ID = 1 US_WOE_ID = 23424977 IND_WOE_ID = 23424848 world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID) us_trends = twitter_api.trends.place(_id=US_WOE_ID) india_trends = twitter_api.trends.place(_id=IND_WOE_ID) print world_trends print us_trends print india_trends

Once you print the responses, you will see a bunch of outputs which look like JSON data. To view the output in a pretty format, use the following commands and you will get the output as a pretty printed JSON shown in the snapshot below. To view all the trending topics in a convenient way, we will be using list comprehensions to slice the data we need and print it using prettytable as shown below.

from prettytable import PrettyTable world_trends = [trend['name'] for trend in world_trends[0]['trends']] us_trends = [trend['name'] for trend in us_trends[0]['trends']] india_trends = [trend['name'] for trend in india_trends[0]['trends']] pt = PrettyTable(field_names=['World Trends', 'US Trends', 'India Trends']) for world_trend, us_trend, india_trend in zip(world_trends, us_trends, india_trends): pt.add_row([world_trend, us_trend, india_trend]) print pt

On printing the result, you will get a neatly tabulated list of current trends which keep changing with time. Now, we will try to analyze and see if some of these trends are common. For that we use Python's set data structure and compute intersections to get common trends as shown in the snapshot below. Interestingly, some of the trending topics at this moment in the US are common with some of the trending topics in the world. The same holds good for US and India. Mining for Tweets In this section, we will be looking at ways to mine Twitter for retrieving tweets based on specific queries and extracting useful information from the query results. For this we will be using Twitter API's GET search/tweets resource. Since the Google Nexus 6 phone was launched recently, I will be using that as my query string. You can use the following code segment to make a robust API request to Twitter to get a size-able number of tweets.

query = 'Nexus6' count = 100 search_results = twitter_api.search.tweets(q=query, count=count) statuses = search_results['statuses'] # Iterate through 5 more batches of results by following the cursor for _ in range(5): print "Length of status list", len(statuses) try: next_results = search_results['search_metadata']['next_results'] except KeyError, e: break # create a dictionary of parameters to be passed to the search method kwargs = dict([kv.split('=') for kv in next_results[1:].split('&')]) search_results = twitter_api.search.tweets(**kwargs) statuses += search_results['statuses'] # Print one sample tweet by slicing the list print json.dumps(statuses[0], indent=2)

The code snippet above, makes repeated requests to the Twitter Search API. Search results contain a special search_metadata node that embeds a next_results field with a query string that provides the basis of making a subsequent query. If we weren't using a library like twitter to make the HTTP requests for us, this preconstructed query string would just be appended to the Search API URL, and we'd update it with additional parameters for handling OAuth. However, since we are not making our HTTP requests directly, we must parse the query string into its constituent key/value pairs and provide them as keyword arguments to the search/tweets API endpoint. I have provided a snapshot below, showing how this dictionary of key/value pairs are constructed which are passed as kwargs to the Twitter.search.tweets(..) method. Analyzing the structure of a Tweet In this section we will see what are the main features of a tweet and what insights can be obtained from them. For this we will be taking a sample tweet from our list of tweets and examining it closely. To get a detailed overview of tweets, you can refer to this excellent resource created by Twitter. I have extracted a sample tweet into the variable sample_tweet for ease of use. sample_tweet.keys() returns the top-level fields for the tweet. Typically, a tweet has some of the following data points which are of great interest.

The identifier of the tweet can be accessed through sample_tweet['id']

The human-readable text of a tweet is available through sample_tweet['text']

The entities in the text of a tweet are conveniently processed and available through sample_tweet['entities']

The "interestingness" of a tweet is available through sample_tweet['favorite_count'] and sample_tweet['retweet_count'], which return the number of times it's been bookmarked or retweeted, respectively

An important thing to note, is that, the retweet_count reflects the total number of times the original tweet has been retweeted and should reflect the same value in both the original tweet and all subsequent retweets. In other words, retweets aren't retweeted

The user details can be accessed through sample_tweet['user'] which contains details like screen_name, friends_count, followers_count, name, location and so on

Some of the above datapoints are depicted in the snapshot below for the sample_tweet. Note, that the names have been changed to protect the identity of the entity that created the status. Before we move on to the next section, my advice is that you should play around with the sample tweet and consult the documentation to clarify all your doubts. A good working knowledge of a tweet's anatomy is critical to effectively mining Twitter data. Extracting Tweet Entities In this section, we will be filtering out the text statuses of tweets and different entities of tweets like hashtags. For this, we will be using list comprehensions which are faster than normal looping constructs and yield substantial perfomance gains. Use the following code snippet to extract the texts, screen names and hashtags from the tweets. I have also displayed the first five samples from each list just for clarity.

status_texts = [ status['text'] for status in statuses ] screen_names = [ user_mention['screen_name'] for status in statuses for user_mention in status['entities']['user_mentions'] ] hashtags = [ hashtag['text'] for status in statuses for hashtag in status['entities']['hashtags'] ] # get samples of first five entities texts = status_texts[0:5] scr_names = list(set(screen_names))[0:5] hash_tags = hashtags[0:5] tweet_words = words[0:5] # display the results as a table pt = PrettyTable() pt.add_column('Tweets', texts) pt.add_column('Screen Names', scr_names) pt.add_column('HashTags', hash_tags) pt.add_column('Words', tweet_words)

Once you print the table, you should be getting a table of the sample data which should look something like the table below but with different content ofcourse! Frequency Analysis of Tweet and Tweet Entities Once we have all the required data in relevant data structures, we will do some analysis on it. The most common analysis would be a frequency analysis where we find out the most common terms occurring in different entities of the tweets. For this we will be making use of the collection module. The following code snippet ranks the top ten most occurring tweet entities and prints them as a table.

from collections import Counter # get top ten entities top_words = [item[0] for item in Counter(words).most_common()[:10]] top_words_freq = [item[1] for item in Counter(words).most_common()[:10]] top_screen_names = [item[0] for item in Counter(screen_names).most_common()[:10]] top_screen_names_freq = [item[1] for item in Counter(screen_names).most_common()[:10]] top_hashtags = [item[0] for item in Counter(hashtags).most_common()[:10]] top_hashtags_freq = [item[1] for item in Counter(hashtags).most_common()[:10]] # print the results as a table pt = PrettyTable() pt.add_column('Words',top_words) pt.add_column('Frequency',top_words_freq) pt.add_column('Screen Names',top_screen_names) pt.add_column('Frequency',top_screen_names_freq) pt.add_column('Hashtags',top_hashtags) pt.add_column('Frequency',top_hashtags_freq) print pt

The output I obtained is shown in the snapshot below. As you can see, there is a lot of noise in the tweets because of which several meaningless terms and symbols have crept into the top ten list. For this, we can use some pre-processing and data cleaning techniques. Analyzing the Lexical Diversity of Tweets A slightly more advanced measurement that involves calculating simple frequencies and can be applied to unstructured text is a metric called lexical diversity. Mathematically, lexical diversity can be defined as an expression of the number of unique tokens in the text divided by the total number of tokens in the text. Let us take an example to understand this better. Suppose you are listening to someone who repeatedly says "and stuff" to broadly generalize information as opposed to providing specific examples to reinforce points with more detail or clarity. Now, contrast that speaker to someone else who seldom uses the word "stuff" to generalize and instead reinforces points with concrete examples. The speaker who repeatedly says "and stuff" would have a lower lexical diversity than the speaker who uses a more diverse vocabulary. The following code snippet, computes the lexical diversity for status texts, screen names, and hashtags for our data set. We also measure the average number of words per tweet.

# A function for computing lexical diversity def lexical_diversity(tokens): return 1.0*len(set(tokens))/len(tokens) # A function for computing the average number of words per tweet def average_words(statuses): total_words = sum([ len(s.split()) for s in statuses ]) return 1.0*total_words/len(statuses) print lexical_diversity(words) print lexical_diversity(screen_names) print lexical_diversity(hashtags) print average_words(status_texts)

The output which I obtained is depicted in the snapshot below. Now, I am sure you must be thinking, what on earth do the above numbers indicate? We can analyze the above results as follows.

The lexical diversity of the words in the text of the tweets is around 0.097. This can be interpreted as, each status update carries around 9.7% unique information. The reason for this is because, most of the tweets would contain terms like Android, Nexus 6, Google

The lexical diversity of the screen names, however, is even higher, with a value of 0.59 or 59%, which means that about 29 out of 49 screen names mentioned are unique. This is obviously higher because in the data set, different people will be posting about Nexus 6

The lexical diversity of the hashtags is extremely low at a value of around 0.029 or 2.9%, implying that very few values other than the #Nexus6 hashtag appear multiple times in the results. This is relevant because tweets about Nexus 6 should contain this hashtag

The average number of words per tweet is around 18 words

This gives us some interesting insights like people mostly talk about Nexus 6 when queried for that search keyword. Also, if we look at the top hashtags, we see that Nexus 5 co-occurs a lot with Nexus 6. This might be an indication that people are comparing these phones when they are tweeting. Examining Patterns in Retweets In this section, we will analyze our data to determine if there were any particular tweets that were highly retweeted. The approach we'll take to find the most popular retweets, is to simply iterate over each status update and store out the retweet count, the originator of the retweet, and status text of the retweet, if the status update is a retweet. We will be using a list comprehension and sort by the retweet count to display the top few results in the following code snippet.

retweets = [ # Store out a tuple of following three values (status['retweet_count'], status['retweeted_status']['user']['screen_name'], status['text']) # for each status for status in statuses # as long as the status has been retweeted if status.has_key('retweeted_status') ] # Display the top 5 retweets with necessary fields pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text']) [ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ] pt.max_width['Text'] = 50 print pt

The output I obtained is depicted in the following snapshot. From the results, we see that the top most retweet is from the official googlenexus channel on Twitter and the tweet speaks about the phone being used non-stop for 6 hours on only a 15 minute charge. Thus, you can see that this has definitely been received positively by the users based on its retweet count. You can detect similar interesting patterns in retweets based on the topics of your choice. Visualizing Frequency Data In this section, we will be creating some interesting visualizations from our data set. For plotting we will be using matplotlib, a popular Python plotting library which comes inbuilt with IPython. If you don't have matplotlib loaded by default use the command import matplotlib.pyplot as plt in your code. Visualizing word frequencies In our first plot, we will be displayings the results from the words variable which contains different words from the tweet status texts. Using Counter from the collections package, we generate a sorted list of tuples, where each tuple is a (word, frequency) pair. The x-axis value will correspond to the index of the tuple, and the y-axis will correspond to the frequency for the word in that tuple. We transform both axes into a logarithmic scale because of the vast number of data points. Visualizing words, screen names, and hashtags A line chart of frequency values is decent enough. But what if we want to find out the number of words having a frequency between 1-5, 5-10, 10-15... and so on. For this purpose we will be using a histogram to depict the frequencies. The following code snippet achieves the same.

for label, data in (('Words', words), ('Screen Names', screen_names), ('Hashtags', hashtags)): # Build a frequency map for each set of data and plot the values c = Counter(data) plt.hist(c.values()) # Add a title labels plt.title(label) plt.ylabel("Number of items in a bin") plt.xlabel("Bins (number of times an item appeared)") # Display as a new figure plt.figure()

What this essentially does is, it takes all the frequencies and groups them together and creates bins or ranges and plots the number of entities which fall in that bin or range. The plots I obtained are shown below. From the above plots, we can observe that, all the three plots follow the "Pareto Principle" i.e, almost 80% of the words, screen names and hashtags have a frequency of only 20% in the whole data set and only 20% of the words, screen names and hashtags have a frequency of more than 80% in the data set. In short, if we consider hashtags, a lot of hashtags occur maybe only once or twice in the whole data set and very few hashtags like #Nexus6 occur in almost all the tweets in the data set leading to its high frequency value. Visualizing retweets In this visualization, we will be using a histogram to visualize retweet counts using the following code snippet.

# Using underscores while unpacking values in a tuple is idiomatic for discarding them counts = [count for count, _, _ in retweets] plt.hist(counts) plt.title("Retweets") plt.xlabel('Bins (number of times retweeted)') plt.ylabel('Number of tweets in bin') print counts

The plot which I obtained is shown below. Looking at the frequency counts, it is clear that very few retweets have a large count. I hope you have seen by now, how powerful Twitter APIs are and using simple Python libraries and modules, it is really easy to generate very powerful and interesting insights. That's all for now folks! I will be talking more about Twitter Mining in another post sometime in the future. A ton of thanks goes out to Matthew A. Russell and his excellent book Mining the Social Web, without which this post would never have been possible.

#engineering #social trends #products #dipanjan #social intelligence #data mining

Smartphones vs Tablets: Does size matter?

[This post was written by Dipanjan with contributions from Mandar. Dipanjan is a Data Engineer at DataWeave who works with Mandar in the Engineering team, addressing some of the semantics related problems like product clustering and data normalization.] We have seen a steady increase in the number of smartphones and tablets since the last five years. Looking at the number of smartphones, tablets and now wearables ( smart watches and fitbits ) that are being launched in the mobiles market, we can truly call this 'The Mobile Age'. We, at DataWeave, deal with millions of data points related to products which vary from electronics to apparel. One of the main challenges we encounter while dealing with this data is the amount of noise and variation present for the same products across different stores. One particular problem we have been facing recently is detecting whether a particular product is a mobile phone (smartphone) or a tablet. If it is mentioned explicitly somewhere in the product information or metadata, we can sit back and let our backend engines do the necessary work of classification and clustering. Unfortunately, with the data we extract and aggregate from the Web, chances of finding this ontological information is quite slim. To address the above problem, we decided to take two approaches.

Try to extract this information from the product metadata

Try to get a list of smartphones and tablets from well known sites and use this information to augment the training of our backend engine

Here we will talk mainly about the second approach since it is more challenging and engaging than the former. To start with, we needed some data specific to phone models, brands, sizes, dimensions, resolutions and everything else related to the device specifications. For this, we relied on a popular mobiles/tablets product information aggregation site. We crawled, extracted and aggregated this information and stored it as a JSON dump. Each device is represented as a JSON document like the sample shown below.

{ "Body": { "Dimensions": "200 x 114 x 8.7 mm", "Weight": "290 g (Wi-Fi), 299 g (LTE)" }, "Sound": { "3.5mm jack ": "Yes", "Alert types": "N/A", "Loudspeaker ": "Yes, with stereo speakers" }, "Tests": { "Audio quality": "Noise -92.2dB / Crosstalk -92.3dB" }, "Features": { "Java": "No", "OS": "Android OS, v4.3 (Jelly Bean), upgradable to v4.4.2 (KitKat)", "Chipset": "Qualcomm Snapdragon S4Pro", "Colors": "Black", "Radio": "No", "GPU": "Adreno 320", "Messaging": "Email, Push Email, IM, RSS", "Sensors": "Accelerometer, gyro, proximity, compass", "Browser": "HTML5", "Features_extra detail": "- Wireless charging- Google Wallet- SNS integration- MP4/H.264 player- MP3/WAV/eAAC+/WMA player- Organizer- Image/video editor- Document viewer- Google Search, Maps, Gmail,YouTube, Calendar, Google Talk, Picasa- Voice memo- Predictive text input (Swype)", "CPU": "Quad-core 1.5 GHz Krait", "GPS": "Yes, with A-GPS support" }, "title": "Google Nexus 7 (2013)", "brand": "Asus", "General": { "Status": "Available. Released 2013, July", "2G Network": "GSM 850 / 900 / 1800 / 1900 - all versions", "3G Network": "HSDPA 850 / 900 / 1700 / 1900 / 2100 ", "4G Network": "LTE 800 / 850 / 1700 / 1800 / 1900 / 2100 / 2600 ", "Announced": "2013, July", "General_extra detail": "LTE 700 / 750 / 850 / 1700 / 1800 / 1900 / 2100", "SIM": "Micro-SIM" }, "Battery": { "Talk time": "Up to 9 h (multimedia)", "Battery_extra detail": "Non-removable Li-Ion 3950 mAh battery" }, "Camera": { "Video": "Yes, 1080p@30fps", "Primary": "5 MP, 2592 x 1944 pixels, autofocus", "Features": "Geo-tagging, touch focus, face detection", "Secondary": "Yes, 1.2 MP" }, "Memory": { "Internal": "16/32 GB, 2 GB RAM", "Card slot": "No" }, "Data": { "GPRS": "Yes", "NFC": "Yes", "USB": "Yes, microUSB (SlimPort) v2.0", "Bluetooth": "Yes, v4.0 with A2DP, LE", "EDGE": "Yes", "WLAN": "Wi-Fi 802.11 a/b/g/n, dual-band", "Speed": "HSPA+, LTE" }, "Display": { "Multitouch": "Yes, up to 10 fingers", "Protection": "Corning Gorilla Glass", "Type": "LED-backlit IPS LCD capacitive touchscreen, 16M colors", "Size": "1200 x 1920 pixels, 7.0 inches (~323 ppi pixel density)" } }

From the above document, it is clear that there are a lot of attributes that can be assigned to a mobile device. However, we would not need all of them for building our simple algorithm for labeling smartphones and tablets. I had decided to use the device screen size for separating out smartphones and tablets but I decided to take some suggestions from our team. After sitting down and taking a long, hard look at our dataset, Mandar had an idea of using the device dimensions also for achieving the same goal! Finally, the attributes that we decided to use were,

Size

Title

Brand

Device dimensions

Screen size

I wrote some regular expressions for extracting out the features related to the device screen size and resolution. Getting the resolution was easy, which was achieved with the following Python code snippet. There were a couple of NA values but we didn't go out of our way to get the data by searching on the web because resolution varies a lot and is not a key attribute for determining if a device is a phone or a tablet.

size_str = repr(doc["Display"]["Size"]) resolution_pattern = re.compile(r'(?:\S+\s)x\s(?:\S+\s)\s?pixels') if resolution_pattern.findall(size_str): resolution = ''.join([token.replace("'","") for token in resolution_pattern.findall(size_str)[0].split()[0:3]]) else: resolution = 'NA'

But the real problems started when I wrote regular expressions for extracting the screen size. I started off with analyzing the dataset and it seemed that screen size was mentioned in inches so I wrote the following regular expression for getting screen size.

size_str = repr(doc["Display"]["Size"]) screen_size_pattern = re.compile(r'(?:\S+\s)\s?inches') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0].split()[0] else: screen_size = 'NA'

However, I noticed that I was getting a lot of 'NA' values for many devices. On looking up the same devices online, I noticed there were three distinct patterns with regards to screen size. They are,

Screen size in 'inches'

Screen size in 'lines'

Screen size in 'chars' or 'characters'

Now, some of you might be wondering what on earth do 'lines' and 'chars' mean and how do they measure screen size. On digging it up, I found that basically both of them mean the same thing but in different formats. If we have 'n lines' as the screen size, it means, the screen can display at most 'n' lines of text at any instance of time. Likewise, if we have 'n x m chars' as the screen size, it means the device can diaplay 'n' lines of text at any instance of time with each line having a maximum of 'm' characters. The picture below will make things more clear. It represents a screen of 4 lines or 4 x 20 chars. Thus, the earlier logic for extracting screen size had to be modified and we used the following code snippet. We had to take care of multiple cases in our regexes, because the data did not have a consistent format.

size_str = repr(doc["Display"]["Size"]) screen_size_pattern = re.compile(r'(?:\S+\s)\s?inc[h|hes]') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' inches' else: screen_size_pattern = re.compile(r'(?:\S+\s)\s?lines') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' lines' else: screen_size_pattern = re.compile(r'(?:\S+\s)x\s(?:\S+\s)\s?char[s|acters]') if screen_size_pattern.findall(size_str): screen_size = screen_size_pattern.findall(size_str)[0] .replace("'","").split()[0]+' lines' else: screen_size = 'NA'

Mandar helped me out with extracting the 'dimensions' attribute from the dataset and performing some transformations on it to get the total volume of the phone. It was achieved using the following code snippet.

dimensions = doc['Body']['Dimensions'] dimensions = re.sub (r'[^\s*\w*.-]', '', dimensions.split ('(') [0].split (',') [0].split ('mm') [0]).strip ('-').strip ('x') if not dimensions: dimensions = 'NA' total_area = 'NA' else: if 'cc' in dimensions: total_area = dimensions.split ('cc') [0] else: total_area = reduce (operator.mul, [float (float (elem.split ('-') [0])/10) for elem in dimensions.split ('x')], 1) total_area = round(float(total_area),3)

We used PrettyTable to output the results in a clear and concise format as shown below.

+----------------+-------+-----------+----------+--------------------+----------+ | Model Name | Brand |Screen Size|Resolution| Dimensions |Total Area| +----------------+-------+-----------+----------+--------------------+----------+ | Liquid E3 | Acer | 4.7 inches| 720x1280 | 136 x 68 x 9 | 83.232 | | Liquid Z4 | Acer | 4.0 inches| 480x800 | 124 x 64 x 9.7 | 76.979 | | Iconia B1-721 | Acer | 7.0 inches| 600x1024 | 199 x 122.3 x 11.4 | 277.45 | | Iconia B1-720 | Acer | 7.0 inches| 600x1024 |198.1 x 121.9 x 10.2| 246.314 | | Iconia A1-830 | Acer | 7.9 inches| 768x1024 | 203 x 138.4 x 8.2 | 230.381 | | Liquid Z5 | Acer | 5.0 inches| 480x854 | 145.5 x 73.5 x 8.8 | 94.109 | | Liquid S2 | Acer | 6.0 inches|1080x1920 | 166 x 86 x 9 | 128.484 | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... | |Galaxy Note 8.0 |Samsung| 8.0 inches| 800x1280 | 210.8 x 135.9 x 8 | 229.182 | | Rex 90 S5292 |Samsung| 3.5 inches| 320x480 | 113 x 61.9 x 11.9 | 83.237 | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... | | F101 | ZTE | 2.0 inches| 176x220 | 105 x 46 x 12.6 | 60.858 | | F100 | ZTE | 2.0 inches| 176x220 | 105 x 46 x 12.6 | 60.858 | |Coral200 Sollar | ZTE | 1.5 inches| 128x128 | 106 x 45.6 x 18.1 | 87.488 | +----------------+-----+-------------+----------+--------------------+----------+

Next, we stored the above data in a csv file and used Pandas, Matplotlib, Seaborn and IPython to do some quick exploratory data analysis and visualizations. The following depicts the top ten brands with the most number of mobile devices as per the dataset. Then, we looked at the device area frequency for each brand using boxplots as depicted below. Based on the plot, it is quite evident that almost all the plots are right skewed, with a majority of the distribution of device dimensions (total area) falling in the range [0,150]. There are some notable exceptions like 'Apple' where the skew is considerably less than the general trend. On slicing the data for the brand 'Apple', we noticed that this was because devices from 'Apple' have an almost equal distribution based on the number of smartphones and tablets, leading to the distribution being almost normal. Based on similar experiments, we noticed that tablets had larger dimensions as compared to mobile phones, and screen sizes followed that same trend. We made some quick plots with respect to the device areas as shown below. Now, take a look at the above plots again. The second plot shows the distribution of device areas in a kernel density plot. This distribution resembles a Gaussian distribution but with a right skew. [Mandar reckons that it actually resembles a Logistic distribution, but who's splitting hairs, eh? ;)] The histogram plot depicts the same, except here we see the frequency of devices vs the device areas. Looking at it closely, Mandar said that the bell shaped curve had the maximum number of devices and those must be all the smartphones, while the long thin tail on the right side must indicate tablets. So we set a cutoff of 160 cubic centimeters for distinguishing between phones and tablets. We also decided to calculate the correlation between 'Total Area' and 'Screen Size' because as one might guess, devices with larger area have large screen sizes. So we transformed the screen sizes from textual to numeric format based on some processing, and calculated the correlation between them which came to be around 0.73 or 73% We did get a high correlation between Screen Size and Device Area. However, I still wanted to investigate why we didn't get a score close to 90%. On doing some data digging, I noticed an interesting pattern. After looking at the above results, what came to our minds immediately was: why do phones with such small screen sizes have such big dimensions? We soon realized that these devices were either "feature phones" of yore or smartphones with a physical keypad! Thus, we used screen sizes in conjunction with dimensions for labeling our devices. After a long discussion, we decided to use the following logic for labeling smartphones and tablets.

device_class = None if total_area >= 160.0: device_class = 'Tablet' elif total_area < 160.0: device_class = 'Phone' if 'lines' in screen_size: device_class = 'Phone' elif 'inches' in screen_size: if float(screen_size.split()[0]) < 6.0: device_class = 'Phone'

After all this fun and frolic with data analysis, we were able to label handheld devices correctly, just like we wanted it!

+---------------+-------+-------------+------------+------------+--------------+ | Model Name | Brand | Screen Size | Resolution | Total Area | Device Class | +---------------+-------+-------------+------------+------------+--------------+ | Liquid E3 | Acer | 4.7 inches | 720x1280 | 83.232 | Phone | | Liquid Z4 | Acer | 4.0 inches | 480x800 | 76.979 | Phone | | Iconia B1-721 | Acer | 7.0 inches | 600x1024 | 277.45 | Tablet | | Iconia B1-720 | Acer | 7.0 inches | 600x1024 | 246.314 | Tablet | | Iconia A1-830 | Acer | 7.9 inches | 768x1024 | 230.381 | Tablet | | Liquid Z5 | Acer | 5.0 inches | 480x854 | 94.109 | Phone | | Liquid S2 | Acer | 6.0 inches | 1080x1920 | 128.484 | Phone | | Liquid Z3 | Acer | 3.5 inches | 320x480 | 68.016 | Phone | | Liquid S1 | Acer | 5.7 inches | 720x1280 | 129.878 | Phone | | Iconia Tab A3 | Acer | 10.1 inches | 800x1280 | 464.1 | Tablet | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... | +---------------+-------+-------------+------------+------------+--------------+

#engineering #classification #visualization #dipanjan #mandar #data analysis

Meet our People: Sukruth

Name

Sukruth Ambuga Nagaraj

Bio

Sukruth handles Account Management and Customer Support. He's meticulous about his work and choosy in his dressing. Or it could be the other way round.

Birthday

November 18

When did you join DataWeave?

I joined DataWeave On 1 May, 2013. That was the day we moved to our new office.

What is your role?

I started with the Data Quality team. In fact, I was the first member in the Data Quality team. I was always interested in the customer facing aspects of our business. So, I transitioned there over a period of time. Now I work closely with the Business Development team and handle all aspects of Account Management.

What is your background?

Before DataWeave, I was working for a while as a Sports Analyst at Repucom Media Analysis India Pvt Ltd. There I worked on the analysis of certain metrics for Golf, MotorSports, etc.

Why did you join DataWeave?

To be part of an upcoming start up was an exciting opportunity for me. There is always a lot of challenging work, and growth options. It has worked out very well for me!

Can you share some highlights from your work?

We are a start up and it's a high pressure environment with a lot of activities going on at the same time. We are doing pilots with potential customers, creating accounts for new customers, and dealing with issues they might have. I am definitely the go to person as far as coordination of our delivery process is concerned. It's both exciting and challenging.

Hobbies?

Playing strategy games and working on my 2 decade old ride YEZDI ROADKING.

Quirks?

I am very meticulous and often obsessively clean.

Worst fears?

Yet to come, as I am always daring daring.

Favorite quote?

No Pain No Gain!

#team #life #people #life@priceweave

Why is product matching hard?

Product Matching is a combination of algorithmic and manual techniques to recognize and match identical products from different sources. Product matching is at the core of competitive intelligence for retail. A competitive intelligence product is most useful when it can accurately match products of a wide range of categories in a timely manner, and at scale.

Shown below is PriceWeave’s Products Tracking Interface, one of the features where product matching is in action. The Products Tracking Interface lets a brand or a retailer track their products and monitor prices, availability offers, discounts, variants, and SLAs on a daily (or a more frequent) basis.

A snapshot of products tracked for a large online mass merchant

Expanded view for a product shows the prices related data points from competing stores Product Matching helps a retailer or a brand in several ways:

Tracking competitor prices and stock availability

Organizing seller listings on a marketplace platform

Discovering gaps in product catalog

Filling the missing attributes in product catalog information

Comparing product life cycles across competitors

Given its criticality, every competitive intelligence product strives hard to make its product matching accurate and comprehensive. It is a hard problem, and one that cannot be complete addressed in an automated fashion. In the rest of this post, we will talk about why product matching is hard.

Product Matching Guidelines

Amazon provides a guideline to sellers about how they should write product catalog information in order to achieve a good product matching with respect to their seller listings. These guidelines apply to any retail store or marketplace platform. The trouble is, more often than not these guidelines are not followed, or cannot by retailers because they don’t have access to all the product related information. Some of the challenges are:

Products either don’t have a UPC code or it is not available. There are also non-standard products, unbranded products, and private label products.

There are products with slights variations in technical specifications, but the complete specs are not available.

Retailers manage a huge catalog of accessories, for instance Electronics Accessories (screen guards, flip covers, fancy USB drives, etc.).

Apparels and Lifestyle products often have very little by way of unique identifiers. There is no standard nomenclature for colors, material and style.

Products are often bundled with accessories or other related products. There are no standard ways of doing product bundling.

In the absence of standard ways of representing products, every retailer uses their own internal product IDs, product descriptions, and attribute names.

Algorithmic Product Matching using “Document Clustering”

Algorithmic product matching is done using some Machine Learning, typically techniques from Document Clustering. A document is a text document or a web page, or a set of terms that usually occur within a “context”. Document clustering is the process of bringing together (forming clusters of) similar documents, and separating our dissimilar ones. There are many ways of defining similarity of documents that we will not delve into in this post. Documents have “features” that act as “identifiers” that help an algorithm cluster them.

A document in our case is a product description -- essentially a set of data points or attributes we have extracted from a product page. These attributes include: title, brand, category, price, and other specs. Therefore, these are the attributes that help us cluster together similar products and match products. The quality of clustering -- that is how accurate and how complete the clusters are -- depends on how good the features are. In our case, most of the times the features are not good, and that is what makes clustering, and in turn product matching, a hard problem.

Noisy Small Factually Weak (NSFW) Documents

The documents that we deal with, the product descriptions, are not well formed and so not readily usable for product matching. We at PriceWeave characterize them endearignly as Noisy Weak and Factually Weak (NSFW) documents. Let us see some examples to understand these terms.

Noisy

Spelling errors, non-standard and/or incomplete representations of product features.

Brands written as “UCB” and “WD” instead of “United Colors of Benetton” and “Western Digital”.

Model no.s might or might not be present. A camera’s model number written as one of the following variants: DSC-WX650 vs DSCWX650 vs DSC WX 650 vs WX 650.

Noisy/meaningless terms might be present (“brand new”, “manufacturer's warranty”, “with purchase receipt”)

Small

Not much description. A product simply written as “Apple iPhone” without any mention of its generation, or other features.

Not many distinguishable features. Example, “Samsung Galaxy Note vs Samsung Galaxy Note 2”, “Apple ipad 3 16 GB wifi+cellular vs Apple ipad mini 16 GB wifi-cellular”

Factually Weak

Products represented with generic and subjective descriptions.

Colours and their combinations might be represented differently. Examples, “Puma Red Striped Bag”, “Adidas Black/Red/Blue Polo Tshirt”.

In the absence of clean, sufficient, and specific product information, the quality of algorithmic matching suffers. Product matching include many knobs and switches to adjust the weights given to different product attributes. For example, we might include a rule that says, “if two products are identical, then they fall in the same price range.” While such rules work well generally, they vary widely from category to category and across geographies. Further, adding more and more specific rules will start throwing off the algorithms in unexpected ways rendering them less effective.

In this post, we discussed the challenges posed by product matching that make it a hard problem to crack. In the next post, we will discuss how we address these challenges to make PriceWeave’s product matching robust.

PriceWeave is an all-around Competitive Intelligence product for retailers, brands, and manufacturers. We’re built on top of huge amounts of products data to provide real-time actionable insights. PriceWeave's offerings include: pricing intelligence, assortment intelligence, gaps in catalogs, and promotion analysis. Please visit PriceWeave to view all our offerings. If you’d like to try us out request for a demo.

#product matching #retail intelligence #Price Intelligence #pricing intelligence

Meet our People: Mukesh

Name

Mukesh Kumar

Bio

Mukesh builds robust back end systems. When he is done with that, he wants to turn his focus towards social service.

Birthday

October 6

When did you join DataWeave?

I joined DataWeave around mid November, 2013.

What is your role?

I work with the Back end Engineering team of PriceWeave. I am a Python developer. I am responsible for building some of the crawler components as well as diagnostics tools.

What is your background?

I have been a programmer throughout. Before joining DataWeave, I was working with another Bangalore based start up called PromptCloud for a few years.

Why did you join DataWeave?

I have always been curious about upcoming start ups, especially in the data aggregation and analytics space. I also wanted to explore more of me. After interviewing with DataWeave I found the work and the people very conducive for this.

Can you share some highlights from your work?

Many things, actually. We are a start up, so we face a lot of interesting engineering challenges on a daily basis. I recently implemented a customizable "user friendly" generic crawler that meets most of the crawling needs of PriceWeave. I also implemented and integrated a logger into the crawler pipeline. These systems improved the scalability and robustness of our crawlers.

Hobbies?

Reading

Quirks?

I am the more curt and serious member of a boisterous team.

Worst fears?

None

Favorite quote?

Our greatness doesn't lie in never falling but in rising every time we fall.

#team #life #people #life@priceweave

How Colours Influence Buying Patterns In Retail

Research shows that the colour of the clothes we wear significantly affect our day to day lives. For instance wearing black might help us appear powerful and authoritative at the workplace, while a red dress can make us look more attractive to a date. A yellow top might brighten up one’s day and a blue one land us a nifty bonus.

Oftentimes buyers navigating the myriad nuances of current fashion look for help from friends, popular media and retailers themselves. Retailers, for their part, try to stay ahead of fashion trends by meticulously studying trends from magazines, keeping a close eye on competitors and wading through the chatter on social media and fashion blogs.

Now that most of retail is metrics driven and becoming smarter by the day, we asked ourselves whether there is a more optimal way to analyse the influence of colors on customer buying decisions. Here’s how we went about doing it:

Method:

Thanks to the internet, a huge mine of valuable fashion data is available to us through e-commerce sites, brand Pinterest pages and fashion blogs, which regularly update their content streams with the newest fashion offerings. Data ranging from featured fashion of the current season including the complete product catalogue of brands as well as combinations of dresses that go together (even between brands) are all available for us to collect and analyse.

By crawling these sites, pages and blogs periodically we can extract the colors on each of the images shared. This data is very helpful for any online/offline merchant to visualize the current trend in the market and plan out their own product offering. It is also possible to plot monthly data to capture the timeline of trends across different fashion websites.

How is it Useful?

Let us assess the applications made possible from this data. How would color analysis assist product managers, category heads and merchandising heads?

1. Spotting current trends:

Color analysis can spot current trends across brands and various filters. This gives decision makers the ability to gauge and respond to current trends and offerings. Some filters that can be used to analyse this are price, colors, categories, subcategories etc

2. Predictive trends:

Using historical color data future trends can be spotted with greater accuracy. With this data decision makers can stay ahead of the demands and the predictions of the market and gain a foothold on the ever changing nature of fashion.

3. Assortment Analysis:

Assortment Analysis can become more in depth and insightful with color analysis. Assortment comparisons of one’s offerings v/s competitor’s offerings can give a clear cut decision pointers on both one’s color offerings present and categories one can focus on to get ahead of the competition.

4. Recommendations

A strong recommendation feature is vital in driving up sales by offering the right products to buyers at the right time. Analysis of colors helps recommendations become smarter and more relevant. For instance, the algorithm can help understand what tops go with which jeans or which shirts go with what ties.

Colours add a new dimension to current business analytics. Decision makers will be able to access enhanced analytics on existing products and compare across sources based on parameters such as price, categories, subcategories etc.

Color Analysis in retail is largely unexplored and rife with possibilities. Doing it at scale presents a number of unique challenges that we are addressing. We’re excited to bring novel techniques and the power of large scale data analytics to retail.

Color analysis will add to a retailer's understanding of consumer buying patterns. This will help retailers sell better and improve profit margins. We are currently working on integrating this feature into PriceWeave so that our customers can do a comparative assortment analysis with color as an additional dimension.

About Priceweave:

PriceWeave provides Competitive Intelligence for retailers, brands, and manufacturers. We’re built on top of huge amounts of products data to provide features such as: pricing opportunities (and changes), assortment intelligence, gaps in catalogs, reporting and analytics, and tracking promotions, and product launches. PriceWeave lets you track any number of products across any number of categories against your competitors. If you’d like to try us out request for a demo.

The Price is Right!

Picture this. You’re approaching the biggest sale of the year for your business, the number of offerings are ever growing and your competitors are inching in on your turf. How then are you to tackle the complex & challenging task of pricing your offerings? In short how do you know if the price is right?

Here’s how we think it’s possible:

1. Prioritize your objectives

Pricing can be modified based on your priorities. A good pricing intelligence tool lets you understand pricing opportunities across different dimensions (categories/brands, etc.). Which categories do you want to score on? Which price battles do you choose to fight? Once you have decided your focus areas, you can make pricing decisions accordingly.

2. Trading off margins for market share (or vice versa)

Trading off profits for larger market shares often decreases overhead and increases profits due to network effects. This means that the value of your offerings increases as more people use them (e.g., the iOS or the Windows platform). If margins are crucial do not hesitate to make smart and aggressive pricing decisions using inputs from pricing intelligence tools.

3. Avoiding underpricing and overpricing

Underpricing brings down the bottom line and overpricing alienates customers. Walking the thin line between these is both an art and a science. An effective path to a balanced pricing is employing a pricing intelligence tool. A pricing intelligence tool helps you in getting the price right with ease for any number of your products.

4. Understanding consumers and balancing costs

Who IS your buyer? How much is she willing to shell out for the products you are selling? How much should you mark up your products to recuperate your costs? What can you do retain your consumers and attract new ones? What steps are my competitors taking to achieve this (discounts/combos/coupons/loyalty points)? Answer these questions and you are closer to the ideal price.

5. Monitor competition

The simplest and the most effective way to price your product right is to monitor your competitors. Every pricing win contributes to your profits and boosts your bottom line. Competitive Intelligence products let you monitor your products across any of your competitors.

Conclusion

There are many ways to determine the right price for your products. An effective pricing tool goes a long way in helping you determine the right price for your products. It augments your experience, intuition, and your internal analytics with solid competitive pricing data.

Why not give pricing intelligence a test ride then? Email us today at [email protected] to get started.

About PriceWeave

Assortment Intelligence: So you think you know your offerings?

In retail, product assortment plays a critical role in selling effectively. It impacts the everyday decision making of category managers, brand managers, the merchandising, planning, and logistics teams. A good assortment mix helps achieve the following objectives:

Reduce acquisition costs for new customers (as well as retain existing customers)

Increase penetration by catering to a variety of customer segments

Optimize planning and inventory management costs.

Increasingly, retailers are moving away from a generic one-size-fits all assortment planning model, to a more dynamic and data driven approach. As a result, assortment benchmarking followed by assortment planning are activities that take place round the year. The breadth and depth of one’s assortment achieved through assortment benchmarking can define how and when products get bought.

A number of factors are crucial for assortment planning: analytics over internal data, intuition, experience, and understanding gained through trends. In addition to these, tracking assortment changes on competitors’ websites helps retailers track and adjust their product mix by adjusting features such as brands, colors, variants, and pricing. The goal is to help users find exactly what they are looking for, the moment they are looking for it.

Let’s see how we can achieve this through Assortment Intelligence tools in a moment. But first, some basics.

What is Assortment Intelligence?

Assortment intelligence refers to online retailers tracking, analysing a competitor’s assortment, and benchmarking it against one’s one assortment. Assortment intelligence tools make this process efficient. A good assortment intelligence tool such as PriceWeave gives you information the breadth and depth of your competitors’ assortment across categories and brands. It helps you analyze assortment through different lenses: colors, variants, sizes, shapes, and other technical specifications. With the help of an assortment intelligence tool, a retailer can get a good understanding about what products competitors have, how they perform and whether they should add these products to their existing catalog.

Who uses Assortment Intelligence?

Assortment tracking is used by retailers operating across categories as varied as footwear, electronics, jewelry, household goods,appliances, accessories, tools, handbags, furniture, clothing, baby products, and books among others.

Some Uses of Assortment Intelligence

Gaps in Catalog: Discover products/brands your competitors are offering that are not on your catalog, and add them.

Unique Offerings: Find products/brands that only you are offering and decide whether you are pricing them right. May be you want to bump up their prices.

Compare and analyze product assortment across dimensions: Benchmark your assortments across different dimensions and combinations thereof. Understand your as well as competitors’ focus areas. You can do this in aggregate as well as at the category/brand/feature level. Below we show a few examples.

Effectively measure discount distributions across brands and/or sources. Understand your competitors' "sweet spots" in terms of discounts.

Discount distribution across sources

Understand assortment spread across price ranges. Are you focusing on all price ranges or only a few? Is that a decision you made consciously?

No. of SKUs in a given price range across sources

No. of SKUs in a given price range for Mobile accessories

Deep dive using smart filters -- monitor specific competitors, brands and sets of products with filters such as colors, variants, sizes and other product features.

Variant analysis for apparel brands

Why do it?

Assortment Intelligence not only increases sales and improves margins, but also helps reduce planning and inventory costs. It allows retailers to strike the right balance between assortment and inventory while maximizing sales. Retailers can take informed decisions by analyzing one's own as well as competitors' assortments. Businesses gain an edge by identifying opportunities around changes in product mix and make quick decisions. By identifying areas that need focus, and taking timely actions, an assortment intelligence tool will help improve the bottom line.

What does PriceWeave bring in?

With a feature-rich product such as PriceWeave, you can do all of the above and more everyday (or more frequently if you like). In addition, you can get all assortment related data as reports in case you want to do your own analysis. You can also set alerts on any changes that you want to track.

PriceWeave lets you drill down as deep as you like. Assortments do not have to be based on high level dimensions or standard features like colors and sizes. You can analyze assortments based on technical specs of products (RAM size, cloth material, style, shape, etc.) or their combinations.

Assortment Intelligence is an important part of the PriceWeave offering. If you’d like us to help you make smarter assortment intelligence decisions talk to us.

About Priceweave

7 features of an effective Retail Price Intelligence tool

[This is Part 3 of a series of posts on Competitive Intelligence for Retail. Find the previous posts here: Part 1 and Part 2.]

1. Accurate Product Matching

A fundamental feature of a Price Intelligence (PI) tool is that it lets you track and compare your products against your competition.

So, a PI tool must take care of matching each of your product across all other sources, so that you can make a straightforward comparison and take actions. The more accurate the product matching the more confident you are as a category manager about your decisions.

2. Extensive Product Coverage

Information is most useful when it is as correct and as complete as possible. If product matching is accurate, you are assured that the data is correct. But a PI tool needs to do this at scale.

What good is it if a large number of products you want to track are not covered? Undoubtedly the depth in coverage of products is one of the most important feature of a PI tool. Whether the product you’re tracking is a high end flat screen TV or an oven mitten, a PI tool should be able to cover and deliver intelligence on the chosen product.

3. High Data Update Frequency

Data points like product prices and offers get stale fairly quickly. Ideally, we want to see real time data. Real time is not achievable at scale, or might even be an overkill in many cases.

However, an effective PI tool must present up-to-date data to the extent possible. Based on requirement this can vary from a day to a few hours thus helping the business stay ahead of the price curve.

4. Pricing Opportunities

A good PI tool should present data at different levels of granularity: category, sub-category, brand, and individual product. This helps the category/merchandizing team or the pricing analysts to surgically strike problem areas. For instance, when you are tracking 1000s or even 100s of products, it’s next to impossible to go over every product and take pricing decisions.

Instead, the PI tool should highlight pricing opportunities, such that pricing decisions can be taken efficiently and quickly.

5. Historical Pricing

“Prediction is very difficult, especially if it’s about the future.” But they also say, history can be a useful predictor of the future. Nowhere is it truer than in competitive price intelligence.

An analysis of historical data almost always shows a trend that can be capitalized on for competitive pricing. A good PI tool stores and presents historical pricing data in a useful manner.

6. “It’s not [just] about the money”

Retail is a highly competitive and commoditized sector. So, price is an important factor for a consumer when making a decision to buy a product. Having said that, as a retailer, you don’t always want to compete on pricing.

You may want to compete through better packaging, or giving the user more choice (variants/colours/sizes), or better SLAs. This is where a Price Intelligence tool needs to go beyond just pricing. It needs to capture and present all other relevant data points associated with a product.

7. Uncluttered User Experience

Any tool built for a user needs to be usable, intuitive, and uncluttered. More so for busy managers who need to take several decisions quickly day on day. A Price Intelligence tool is in essence a Data Product. A data product is built on top of a lot of data; however, a good data product is one “where data recedes to the background”.

A data product is not one that delivers a lot of data, but one that delivers actionable data and insights based on data. Data presentation is also another important aspect. A good PI tool delivers the most important data points in formats and templates that a customer can easily consume.

Think a Pricing Intelligence tool can be useful for your retail store? Talk to us!

PriceWeave provides Competitive Intelligence for retailers, brands, and manufacturers. PriceWeave is built on top of huge amounts of products data to provide features such as: pricing opportunities (and changes), assortment intelligence, gaps in catalogs, reporting and analytics, and tracking promotions, and product launches.

PriceWeave is powered by distributed data crawling and processing engines that enables serving millions of data points around products data refreshed on a daily basis. This data is presented through dashboards, notifications, and reports. PriceWeave brings the ability to use BigData in compelling ways to retailers.

PriceWeave lets you track any number of products across any categories against your competitors. Still not convinced? Try us out. Just send us a request for a demo.

Trending Blogs

Recently Viewed Blogs

PriceWeave