Professional Documents
Culture Documents
Using Azure Cognitive Services For Sentiment Analysis of Trump S Tweets
Using Azure Cognitive Services For Sentiment Analysis of Trump S Tweets
Irfan Elahi
Jul 9 · 15 min read
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 1/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
. . .
With this context in mind, if one has to build the capability to perform sentiment
learning at scale, one option is to use the traditional machine learning approaches to
build a classifier which predicts that whether the input (which can be a text
transformed from its raw form to the form (consisting of bag-of-words or TF/IDF or
any other representation e.g. word2vec) that’s understandable by machine learning
algorithms) belongs to discrete binary categories (positive or negative). One can use
the good-old supervised machine learning algorithms like Naive Bayes, Logistic
Regression, SVM, Decision Trees or ensemble based like RandomForest to do the task
and can achieve relatively good level of performance based on the chosen evaluation
metric. Or one can explore Deep Learning approaches like Multi-Layer Perceptrons to
achieve similar objective with caveats. But chances are that no matter how advanced or
well-established your data science team is, unless you are Microsoft or Google or alike,
it won’t be easy to compete your approaches, in terms of performance and efficiency,
against the ones used in big companies (e.g. Microsoft). One of the beauties of Cloud
services is that they provide ready-to-consume analytics services like Azure Cognitive
Services or AWS Comprehend based on their decades of investment in R&D. One can
employ these services for sentiment analysis use-cases to enable rapid speed-to-value.
In my previous article, we explored the usage of Azure Cognitive Services for content
moderation whereas this post is about using yet another services in the suite of Azure
Cognitive Services for sentiment analysis!
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 2/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
To make matters interesting, I’ll demonstrate how you can use Azure Cognitive Services
to perform sentiment analysis on the the tweets published by Donald Trump. Not sure
about you but many may have a preconceived and biased assumption that tweets from
him do bear a “specific polarity”. This exercise will endeavor to validate that
assumption and will analyze recent tweets by him to see what sentiments do they bear.
The post will consist of two parts. In the first part, I’ll demonstrate how you can retrieve
the data (Trump’s tweet in this case) using Python and in the next I’ll highlight how
you can use Scala and Databricks (Spark) to analyze the tweets in scalable fashion. And
again, the code implementation won’t be production grade and is meant to prove the
capability only.
Let’s start off with the first part i.e. Data extraction. For this, you will require the
following:
Extract Tweets -> Perform some processing -> Persist in file system -> Read
Tweets in Spark -> Integrate with Azure Cognitive Services -> Generate Sentiment
Predictions -> Perform Analysis
The part in bold will be the scope of this post. The rest will be covered in part II. So let’s
start shall we?
If you haven’t signed up for the community version of Databricks, you can visit the
following link. Once signed up, you will have to launch a cluster prior to using the
notebooks. With community edition, the available cluster is pretty small but it will be
more than enough for us. To create a cluster, click “Clusters” on the left hand menu and
then click “+ Create Cluster” button:
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 3/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
it will present you with options to create cluster. Specify a name of your choice for the
cluster. The parameters’ default values should be fine. Once done, click “Create
Cluster”. It will take a few minutes for the cluster to be setup.
The cluster will be installed with bare-minimum packages. To extract tweets, we’ll be
using tweepy. To use that, you will firstly have to install it in the cluster. To do that,
click Workspace -> Users.
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 4/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
Click your email id and then right click in the cascaded section -> Create -> Library
and then in that section, select PyPI, write “tweepy” in the package area and click
“Create”
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 5/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
Once that’s done, a notebook will become available for your use. If you have used
Jupyter notebooks before, the UX will appear to be somewhat similar.
Now with the environment initialized and provisioned, another dependency needs to
be catered: to enable integration with Twitter, you need to have a Twitter app created. I
won’t go into the specifics of this step however this link provides step by step procedure
to do this. Once you’ve registered a Twitter app, you need the following four credentials
related to it:
API Key
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 6/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
Access token
now we are in the position to actually do coding! I’ll endeavor to explain the code as we
proceed along.
Let’s start off by importing the required modules and initializing the required variables.
Authentication mechanism while using twitter app is OAuth and tweepy’s APIs make it
convenient to perform that using OAuthHandler class in which you specify the Twitter
app related credentials. And then use tweepy.API which is a wrapper API provided by
Twitter for the rest of the steps:
The object returned by tweepy.API class provides a number of methods that can be
used to extract tweets. The one that we’ll use in this post is user_timeline() function
that allows us to specify Twitter handle, number of tweets:
trump_tweets = auth_api.user_timeline(screen_name =
‘realDonaldTrump’, count = 600, include_rts = False, tweet_mode =
‘extended’)
Many of the parameters are self explanatory. Specifically, setting include_rts to False
exclude retweets from the account and tweet_mode as extended gives full text of
tweets otherwise it will be truncated. trump_tweets is an object of type
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 7/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
tweepy.models.ResultSet. It’s like an iterable that can be traversed and each element in
this contains attribute full_text that contains the required text of tweets. So let’s just
use Python’s list comprehension to do this task as follows:
Here’s how the tweets look like when I tried to extract. Yours may vary.
now let’s save this list in the form of a text file so that we can use them in the second
stage of our pipeline. As we’ll be analyzing the tweets using Spark in Databricks in part
2 so using a distributed file system or object store makes sense in this context. That’s
where Databricks DBFS (Databricks File System) makes things easier. This topic itself
demands great deal of comprehension but at this stage, it would be enough to know
that DBFS is also a distributed file system, physicalized on object store depending on
the cloud environment (Blob for Azure or S3 for AWS). You can also mount your object
stores in DBFS and can access those under its namespace. And lastly, you can use local
file system APIs (e.g. of Python) to do read/write on DBFS as well (courtesy of FUSE
mount). Thus:
You can verify that the data has been written successfully in a number of ways. One of
the ways is to read the file again to ensure that it has been written correctly:
read_tweets = []
with open(‘/dbfs/FileStore/tables/trump_tweets.txt’,’r’) as f:
read_tweets.append(f.read())
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 8/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
This concludes the first part of this blog-post in which the process of extracting tweets
from a specific user using Python (via tweepy module) is highlighted along with the
configuration steps to use Databricks notebooks.
. . .
Dependency Management:
Firstly, you’ll need to take care of a few dependencies:
1. Text analytics API in Azure Cognitive Service — If you have an Azure subscription
(the free trial will work too for this example), you’ll need to provision Text Analytics
API in Azure Cognitive Services. The process should be similar to my previous post
where I demonstrated how one can use Content Moderation service of Azure Cognitive
Services. After the Text Analytics API has been provisioned, you need to grab the
following details:
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 9/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
once done, your text analytics API will be ready to be consumed from a REST based
client, written in a language of your choice (Scala in this case).
2. Scala libraries — We will be relying on a couple of Scala libraries to get the job
done. Specifically, we’ll require the following ones:
> scalaj (to send REST calls to Azure Text Analytics API)
> spray json (to parse the Json responses of Azure Text Analytics API that we’ll use for
further processing subsequently)
As we are using Databricks, thus the process of managing these dependencies for your
environment will be similar. i.e. You will have to create a library and attach it to the
cluster that you launched. For JVM languages, maven is usually the go-to repository
where you can find many of your such dependencies. To manage these libraries in
Databricks, you will have to provide maven repository coordinates for these (which is
consists of (groudpID:artifactID:Version). When you specify these maven repository
coordinates, dependency management (like the one being used in Databricks) become
able to find these in maven repository, download and install/provision them in your
environment. Here’s an example of how you specify maven coordinates of Spray Json
library in Databricks:
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 10/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
Read the tweets from DBFS -> Remove any lines in the tweets file which just have URL in
them (as they won’t have any signal of sentiment in them) -> Format each tweet line in the
form of a REST call that’s required for Azure Text Analytics API end-point -> send REST
calls to Azure Text Analytics API end-point -> Process the results for analytical insights
With this workflow in view, let’s consider implementation of each step. As a good
practice, we’ll firstly create some helper functions that will help us to perform many of
the above mentioned tasks.
def requestFormatter(givenTweet:String):String={
s"""{
"documents":[
{
"language":"en",
"id":1,
"text":"${givenTweet}"
}
]
}"""
}
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 11/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
language, id and text fields and you can send up to 5000 objects in one REST call.
however for the sake of simplicity, I am just sending one object in one REST POST call.
def
sendPostRequest(textAnalyticsUrl:String,subscriptionKey:String,reque
stBody:String):String={
import scalaj.http.Http
Thread.sleep(3000)
val result = Http(textAnalyticsUrl).postData(requestBody)
.header("Content-Type", "application/json")
.header("Ocp-Apim-Subscription-Key", subscriptionKey).asString
result.body
}
{
“documents”:[{“id”:”1",”score”:0.85717505216598511}],
“errors”:[]
}
where documents object contains a list and specifically score corresponding to the id of
the document. The returned score varies from 0 to 1 and it’s the result of the prediction
yield after invoking the Azure Text Analytics API. Values close to 0 represent negative
sentiments and closer to 1 represent positive.
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 12/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
def removeHttpLines(textLine:String):Boolean={
import scala.util.matching.Regex
val pattern = "^http".r
pattern.findFirstIn(textLine) match {
case Some(x)=>false
case _ => true
}
}
1. The function expects a parameter (textLine of type String) and returns Boolean
values (true or false).
2. It makes use of regular expression and looks for a specific pattern in text file where a
line is starting with “http”. It can be refined further indeed but for simplicity sake, let’s
use that.
3. It then tries to find that pattern in text file. Here, Scala’s pattern matching constructs
are used to match against two possibilities: if a match is found i.e Some(x), then value
returned will be false else it will return true. The reasoning behind why we are
returning these values will become apparent shortly.
Now with these functions in place, let’s implement the rest of the logic, which
interestingly and thanks to Scala being a functional programming language, can be
represented in one line of code:
val tweetsSentimentsRdd =
sc.textFile("/FileStore/tables/trump_tweets.txt").filter(removeHttpL
ines).map(x=>requestFormatter(x)).map(y=>sendPostRequest(url,subscri
ptionKey,y))
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 13/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
2. The next step is to filter out any line consisting just of http URLs. that’s where we
used Spark’s filter transformation. In its parameter, we passed a function (thanks to
functional programming again) specifically, removeHttpLines which will operate on
each line of the data and will only return those lines which yield true from that
function (i.e. which don’t have http at the start of them).
3. The next part transforms each line of the filtered text (i.e. http lines removed) and
converts (using requestFormatter function ) each tweet into the required Json string of
the form:
{
“documents”:[
{
“language”:”en”,
“id”:1,
“text”:”tweet text will come here”
}
]
}
3. The next portion then invokes Azure Text Analytics API end-point using the function
sendPostRequest
Upon executing this, no execution will happen thanks to Spark’s lazy execution model.
As the data is small and it’s a PoC setting, thus it’s safe to use “collect” action (however
try to avoid this in production settings as it returns data from all the distributed
executor nodes of Spark to the driver program thus causing memory overflow
challenges).
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 14/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
Now we have the responses in Scala collection form. Each element of the list is a string
(Json) consisting of the response from Azure Text Analytics API and we can’t do much
about it in the current state. If you want to answer a few questions like:
> What’s the maximum and minimum value of the score?
> What’s the average sentiment score in this tweets corpora?
> What are the most positive tweets by Trump in this corpora?
To answer these analytical questions (and many others), one treatment that could be
done on the data in this form is to transform the existing Json string to Scala case
classes which are optimized for such processing. There can be many ways to do this
however I resorted to using spray json library to do this:
To do this, firstly we will have to create a parser for Json. This involves creating case
classes that represent structure of Json as follows:
and then using spray json constructs to specify the Json’s structure, including the
values of key and how different sections are embedded within each other e.g.
import MyJsonProtocol._
import spray.json._
def parser(givenJson:String):AzureTextAnalyticsResponse = {
givenJson.parseJson.convertTo[AzureTextAnalyticsResponse]
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 15/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
}
}
now with these functions and objects created, what’s remaining is to use them in Scala
collection to get the desired result:
val tweetsSentimentScore =
tweetsSentimentList.filter(eachResponse=>eachResponse.contains(“docu
ments”)).map(eachResponse=>ResponseJsonUtility.parser(eachResponse))
.map(parsedResponse=>parsedResponse.documents(0).score)
The upstated expression should be familiar to you by now but still here is a breakdown
understanding of the steps:
1. Firstly, we will perform filtering to only consider elements which have “documents”
section in the Json
2. We then parse each Json string and convert to AzureTextAnalyticsResponse case
class
3. We then just access score of each parsed Json object to get a list just consisting of
sentiment scores
Once we have this, then doing further analysis becomes convenient for example we can
calculate average sentiment score as follows:
(tweetsSentimentScore.sum)/(tweetsSentimentScore.length)
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 16/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets
The score turned out to be 0.629 which means that the analyzed tweets have slightly
positive sentiment on average :-)
and similarly, we can get the maximum sentiment score of Trump’s tweets:
tweetsSentimentScore.max
which turns out to be 0.99 meaning quite some tweets had quite high degree of
positivity :-)
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 17/17