Performing sentiment analysis with Twitter Streaming API, Python and Elasticsearch
Twitter users publish around 500 million tweets per day. The tweets, which could be about anything -- including a product, service, company or some other point of interest -- collectively become a massive and everflowing stream of semi-structured data.
This data, once captured and analysed, can provide an ongoing view of public sentiment towards a product, company or topic, measure reactions to a press release, track impacts to perception from a marketing campaign or create alerts for customer service teams to respond to feedback and complaints.
This recipe is going to explore using the Twitter Streaming API, Docker and Kitematic -- a fancy new Docker UI for OS X, Elasticsearch, Python and Kibana. The Python portions of this guide are based on a great post from Real Python published last November. This should take about an hour to cook, maybe less if you're more familiar with Docker. Let's get started...
Set up Elasticsearch with (or without) Kitematic
You may opt to set up your own Docker environment, covered previously, and skip to the next section. Otherwise, download and install Kitematic. Kitematic UI replacement for Boot2Docker, which allows OS X users to install and run Docker containers through a lightweight Linux virtual machine. Kitematic simplifies the installation process, configuration and management of containers during development.
Create an Elasticsearch container
In Kitematic, search for elasticsearch and create a container based on the official library Dockerfile. Alternatively for those not using Kitematic, run docker run -p 9200:9200 -p 9300:9300 elasticsearch. This will expose Elasticsearch on a port determined by Kitematic (e.g. http://192.168.99.100:49156), or on http://localhost:9200.
Obtain a Twitter API key and setup config file
You'll need a Twitter account to create an app on Twitter's developer portal. Once done, the 'Keys and Access Tokens' will provide various keys. Save these in a file called config.py in the following format. Important -- do not share these keys or store them in version control. In this file also include the Elasticsearch host and port, and the keyword you would like to capture. For this demo, I've captured tweets containing the keyword "Easter".
## Twitter API consumer_key = "your_consumer_key" consumer_secret = "your_consumer_secret_key" access_token = "your_access_token" access_token_secret = "your_access_token_secret" ## Elasticsearch elasticsearch_uri = "http://192.168.99.100:49156" ## Stream keyword = "easter"
Install Python package dependencies
OS X users will already have Python 2.7 installed by default. You'll need to install the following packages:
tweepy: to connect to the Twitter Streaming API
textblob: to analyse tweet sentiment
elasticsearch: to connect to our Elasticsearch instance
config: to load the config.py
datetime: to convert Twitter's bad dates into ISO 8601 format
To do this easily in OS X, open Terminal.app and run:
wget https://bootstrap.pypa.io/get-pip.py sudo python get-pip.py sudo pip install tweepy textblob elasticsearch config datetime
This will install PIP, a Python package manager, as well as the above packages.
Connect it all together
Save the following code into a file called sentiment.py in the same directory as config.py from above:
import json from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream from textblob import TextBlob from elasticsearch import Elasticsearch from datetime import datetime # import twitter keys and tokens from config import * # create instance of elasticsearch es = Elasticsearch([elasticsearch_uri]) class TweetStreamListener(StreamListener): # on success def on_data(self, data): try: # decode json dict_data = json.loads(data) # pass tweet into TextBlob tweet = TextBlob(dict_data["text"]) # determine if sentiment is positive, negative, or neutral if tweet.sentiment.polarity < 0: sentiment = "negative" elif tweet.sentiment.polarity == 0: sentiment = "neutral" else: sentiment = "positive" # fix the timestamp format # https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior timestamp = datetime.strptime(dict_data["created_at"].replace("+0000 ",""), "%a %b %d %H:%M:%S %Y").isoformat() # output sentiment print timestamp + ":" + sentiment # add text and sentiment info to elasticsearch es.index(index="sentiment-demo", doc_type="tweet", body={"source": dict_data["source"], "author": dict_data["user"]["screen_name"], "location": dict_data["user"]["location"], "followers": dict_data["user"]["followers_count"], "timestamp": timestamp, "message": dict_data["text"], "polarity": tweet.sentiment.polarity, "subjectivity": tweet.sentiment.subjectivity, "sentiment": sentiment}) except: # tweet skipped due to processing error print "processing exception" print data return True # on failure def on_error(self, status): print status if __name__ == '__main__': # create instance of the tweepy tweet stream listener listener = TweetStreamListener() # set twitter keys/tokens auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) # create instance of the tweepy stream stream = Stream(auth, listener) # search twitter for the keyword stream.filter(track=[keyword])
Once saved, in Terminal.app run python sentiment.py. If the Elasticsearch container is running, config.py exists and details are accurate, the requisite Python packages are installed and an Internet connection is available, the command line should begin outputting a date and sentiment rating (positive, negative, neutral) for each tweet captured using your keyword as it is published. Leave this process running and the tweets will be captured indefinitely.
Visualise with Kibana
To descriptively show the sentiment analysis results, install Kibana. In Kitematic, search for 'Kibana4' and create a container using the Dockerfile maintained by user 'marcbachmann', or non-Kitematic users, run (replacing the Elasticsearch variable with your own address):
docker run -e ELASTICSEARCH=http://192.168.99.100:49156 -P marcbachmann/kibana4
If you are using Kitematic, once setup is completed, go to the Settings pane and add an Environment Variable called 'ELASTICSEARCH' and set it to your Elasticsearch URI (e.g. http://192.168.99.100:49156).
Go ahead and open your browser to Kibana's local address (either http://localhost:5601 or set by Kitematic).
Configure an index pattern using sentiment-demo and the field name timestamp. Now you can start to explore the data in the Discover tab. The following details are captured for each tweet (and can be modified by changing sentiment.py:
Author (Twitter account)
Timestamp of publishing
Tweet message
Count of followers at time of tweet
Tweet source (e.g. Twitter for iPhone)
Author location on profile at time of tweet
Message polarity and sentiment
Message subjectivity
To build a dashboard in Kibana, first create a visualisation on the Visualisation tab. Here we will create a breakdown of "Easter" sentiment over time:
Select 'Vertical bar chart'
For the y-axis metric, select 'Count' aggregation
For the x-axis buckets, select 'Date Histogram' aggregation on field 'timestamp' with interval 'Minute'
Add sub-aggregation, select 'Split Bars'
Sub-aggregation as 'Terms', on field 'sentiment'
In View Options, make sure Bar Mode is 'stacked'. You can also scale the y-axis.
Click 'Apply' and the save disk icon, saving as 'Sentiment over time'.
Now in the Dashboard tab click the plus button to add your visualisation and re-size to fit. Save the dashboard as 'Easter Sentiment'. You can then add other visuals and metrics to the dashboard. In the top right menu bar, set refresh interval to '5 seconds' and you can watch the sentinment dashboard update as new data streams in with a view similar to this:
In the time that I spent putting this tutorial together, over 70000 "Easter" tweets were captured, mainly postive and neutral, with a few negative tweets highlighting international conflicts, holiday illness and reluctance to return to work following the long weekend.
That's all for now. Stay tuned to see a tutorial covering creating a Dockerfile to encapsulate this demo, and doing further analysis of the tweets and sentiment measures with R.










