Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Only you can see this message


This story is eligible to earn money through the Partner Program. It may be shown as part of the
metered paywall. Learn more

Using Azure Cognitive Services for


Sentiment Analysis of Trump’s Tweets
An extensive tutorial on how to use Azure Cognitive Services (Text
analytics API) to perform sentiment analysis using Databricks (Python,
Scala)

Irfan Elahi
Jul 9 · 15 min read

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 1/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Photo by Con Karampelas on Unsplash

. . .

First Section — Extracting Tweets


Sentiments can manifest anywhere from reviews, news, real-life conversations,
journalism to name a few. The capability to identify the polarity of sentiments
accurately by employing machine learning approaches unlocks series of business use-
cases that can yield immense value for businesses. Instead of humans to parse through
the content to infer sentiment, a process that bears a wide spectrum of inefficiency
w.r.t. cost and time to value, if machines could be forged to determine the sentiments
automatically; businesses can tap this insight to align their strategies (e.g. for a retail
company, they can expedite on tailored marketing to reduce the risk of churn and limit
the impact of negative word of mouth, it can be a feedback loop to revisit the
product/service and enhance it as Bill Gates once famously stated that “Your most
unhappy customers are your greatest source of learning”)

With this context in mind, if one has to build the capability to perform sentiment
learning at scale, one option is to use the traditional machine learning approaches to
build a classifier which predicts that whether the input (which can be a text
transformed from its raw form to the form (consisting of bag-of-words or TF/IDF or
any other representation e.g. word2vec) that’s understandable by machine learning
algorithms) belongs to discrete binary categories (positive or negative). One can use
the good-old supervised machine learning algorithms like Naive Bayes, Logistic
Regression, SVM, Decision Trees or ensemble based like RandomForest to do the task
and can achieve relatively good level of performance based on the chosen evaluation
metric. Or one can explore Deep Learning approaches like Multi-Layer Perceptrons to
achieve similar objective with caveats. But chances are that no matter how advanced or
well-established your data science team is, unless you are Microsoft or Google or alike,
it won’t be easy to compete your approaches, in terms of performance and efficiency,
against the ones used in big companies (e.g. Microsoft). One of the beauties of Cloud
services is that they provide ready-to-consume analytics services like Azure Cognitive
Services or AWS Comprehend based on their decades of investment in R&D. One can
employ these services for sentiment analysis use-cases to enable rapid speed-to-value.
In my previous article, we explored the usage of Azure Cognitive Services for content
moderation whereas this post is about using yet another services in the suite of Azure
Cognitive Services for sentiment analysis!
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 2/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

To make matters interesting, I’ll demonstrate how you can use Azure Cognitive Services
to perform sentiment analysis on the the tweets published by Donald Trump. Not sure
about you but many may have a preconceived and biased assumption that tweets from
him do bear a “specific polarity”. This exercise will endeavor to validate that
assumption and will analyze recent tweets by him to see what sentiments do they bear.

The post will consist of two parts. In the first part, I’ll demonstrate how you can retrieve
the data (Trump’s tweet in this case) using Python and in the next I’ll highlight how
you can use Scala and Databricks (Spark) to analyze the tweets in scalable fashion. And
again, the code implementation won’t be production grade and is meant to prove the
capability only.

Let’s start off with the first part i.e. Data extraction. For this, you will require the
following:

1. Python (I am using Python 3.5)

2. Development Environment (IDE or notebooks. I am using Databricks)

The workflow will look something like this:

Extract Tweets -> Perform some processing -> Persist in file system -> Read
Tweets in Spark -> Integrate with Azure Cognitive Services -> Generate Sentiment
Predictions -> Perform Analysis

The part in bold will be the scope of this post. The rest will be covered in part II. So let’s
start shall we?

If you haven’t signed up for the community version of Databricks, you can visit the
following link. Once signed up, you will have to launch a cluster prior to using the
notebooks. With community edition, the available cluster is pretty small but it will be
more than enough for us. To create a cluster, click “Clusters” on the left hand menu and
then click “+ Create Cluster” button:

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 3/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

it will present you with options to create cluster. Specify a name of your choice for the
cluster. The parameters’ default values should be fine. Once done, click “Create
Cluster”. It will take a few minutes for the cluster to be setup.

The cluster will be installed with bare-minimum packages. To extract tweets, we’ll be
using tweepy. To use that, you will firstly have to install it in the cluster. To do that,
click Workspace -> Users.

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 4/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Click your email id and then right click in the cascaded section -> Create -> Library

and then in that section, select PyPI, write “tweepy” in the package area and click
“Create”

Once done, click “New Notebook” in the main page of Databricks:

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 5/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

give a name to your notebook, select Python and click “Create”

Once that’s done, a notebook will become available for your use. If you have used
Jupyter notebooks before, the UX will appear to be somewhat similar.

Now with the environment initialized and provisioned, another dependency needs to
be catered: to enable integration with Twitter, you need to have a Twitter app created. I
won’t go into the specifics of this step however this link provides step by step procedure
to do this. Once you’ve registered a Twitter app, you need the following four credentials
related to it:

API Key

API secret key

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 6/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Access token

Access token secret

now we are in the position to actually do coding! I’ll endeavor to explain the code as we
proceed along.

Let’s start off by importing the required modules and initializing the required variables.

from tweepy import OAuthHandler


from tweepy import API
from tweepy import Cursor

consumer_key = "" #twitter app’s API Key


consumer_secret = "" #twitter app’s API secret Key
access_token = "" #twitter app’s Access token
access_token_secret = "" #twitter app’s access token secret

Authentication mechanism while using twitter app is OAuth and tweepy’s APIs make it
convenient to perform that using OAuthHandler class in which you specify the Twitter
app related credentials. And then use tweepy.API which is a wrapper API provided by
Twitter for the rest of the steps:

auth = OAuthHandler(consumer_key, consumer_secret)


auth.set_access_token(access_token, access_token_secret)
auth_api = API(auth)

The object returned by tweepy.API class provides a number of methods that can be
used to extract tweets. The one that we’ll use in this post is user_timeline() function
that allows us to specify Twitter handle, number of tweets:

trump_tweets = auth_api.user_timeline(screen_name =
‘realDonaldTrump’, count = 600, include_rts = False, tweet_mode =
‘extended’)

Many of the parameters are self explanatory. Specifically, setting include_rts to False
exclude retweets from the account and tweet_mode as extended gives full text of
tweets otherwise it will be truncated. trump_tweets is an object of type

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 7/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

tweepy.models.ResultSet. It’s like an iterable that can be traversed and each element in
this contains attribute full_text that contains the required text of tweets. So let’s just
use Python’s list comprehension to do this task as follows:

final_tweets = [each_tweet.full_text for each_tweet in trump_tweets]

Here’s how the tweets look like when I tried to extract. Yours may vary.

now let’s save this list in the form of a text file so that we can use them in the second
stage of our pipeline. As we’ll be analyzing the tweets using Spark in Databricks in part
2 so using a distributed file system or object store makes sense in this context. That’s
where Databricks DBFS (Databricks File System) makes things easier. This topic itself
demands great deal of comprehension but at this stage, it would be enough to know
that DBFS is also a distributed file system, physicalized on object store depending on
the cloud environment (Blob for Azure or S3 for AWS). You can also mount your object
stores in DBFS and can access those under its namespace. And lastly, you can use local
file system APIs (e.g. of Python) to do read/write on DBFS as well (courtesy of FUSE
mount). Thus:

with open(‘/dbfs/FileStore/tables/trump_tweets.txt’, ‘w’) as f:


for item in final_tweets:
f.write(“%s\n” % item)

You can verify that the data has been written successfully in a number of ways. One of
the ways is to read the file again to ensure that it has been written correctly:

read_tweets = []
with open(‘/dbfs/FileStore/tables/trump_tweets.txt’,’r’) as f:
read_tweets.append(f.read())

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 8/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Here’s a sample output:

This concludes the first part of this blog-post in which the process of extracting tweets
from a specific user using Python (via tweepy module) is highlighted along with the
configuration steps to use Databricks notebooks.

. . .

Second Section — Azure Cognitive Services Integration


for Sentiment Analysis
After you have extracted the required set of tweets from your specified twitter handle
(@realDonaldTrump in this example), the next section in this analytical pipeline is to
integrate with Azure Cognitive Services to get sentiment predictions. We’ll use Spark in
Databricks to enable processing at scale (with some caveats). As mentioned before that
Azure Cognitive Services is a suite of services and for sentiment analysis of the tweets,
we will be using Azure Text Analytics API. Also, just to broaden the impact of this post
and to add more breadth to your skill-set, I’ll demonstrate how will you be able to do
that while using Scala! So let’s get started.

Dependency Management:
Firstly, you’ll need to take care of a few dependencies:
1. Text analytics API in Azure Cognitive Service — If you have an Azure subscription
(the free trial will work too for this example), you’ll need to provision Text Analytics
API in Azure Cognitive Services. The process should be similar to my previous post
where I demonstrated how one can use Content Moderation service of Azure Cognitive
Services. After the Text Analytics API has been provisioned, you need to grab the
following details:

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 9/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

> Text Analytics API end-point


> API authentication key

once done, your text analytics API will be ready to be consumed from a REST based
client, written in a language of your choice (Scala in this case).

2. Scala libraries — We will be relying on a couple of Scala libraries to get the job
done. Specifically, we’ll require the following ones:
> scalaj (to send REST calls to Azure Text Analytics API)
> spray json (to parse the Json responses of Azure Text Analytics API that we’ll use for
further processing subsequently)

As we are using Databricks, thus the process of managing these dependencies for your
environment will be similar. i.e. You will have to create a library and attach it to the
cluster that you launched. For JVM languages, maven is usually the go-to repository
where you can find many of your such dependencies. To manage these libraries in
Databricks, you will have to provide maven repository coordinates for these (which is
consists of (groudpID:artifactID:Version). When you specify these maven repository
coordinates, dependency management (like the one being used in Databricks) become
able to find these in maven repository, download and install/provision them in your
environment. Here’s an example of how you specify maven coordinates of Spray Json
library in Databricks:

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 10/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

You’ll have to do the same for scalaj library.


Now coming to the implementation part, it always helps to have an oversight of the
workflow that you’ll be implementing and thus our workflow will consist of the
following steps:

Read the tweets from DBFS -> Remove any lines in the tweets file which just have URL in
them (as they won’t have any signal of sentiment in them) -> Format each tweet line in the
form of a REST call that’s required for Azure Text Analytics API end-point -> send REST
calls to Azure Text Analytics API end-point -> Process the results for analytical insights

With this workflow in view, let’s consider implementation of each step. As a good
practice, we’ll firstly create some helper functions that will help us to perform many of
the above mentioned tasks.

Scala Function to format tweets data in REST POST call format:


Create a Databricks notebook, choose Scala as the language, and in a cell, write this
code:

def requestFormatter(givenTweet:String):String={
s"""{
"documents":[
{
"language":"en",
"id":1,
"text":"${givenTweet}"
}
]
}"""
}

Let’s understand what’s happening in this code:


1. We created a function of the name requestFormatter and it accepts one parameter
(givenTweet) of type String.
2. The function returns String
3. The function creates a Json as per the requirements of Azure Text Analytics API
which consists of a key-value pair with key as “documents” and value as a list of object
consisting of language, id and text fields. These fields are self explanatory. id field in a
list should be unique. text is the field where the actual data (tweet) in this case, will be
embedded. Also, as documents is a list so you can have multiple objects of the

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 11/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

language, id and text fields and you can send up to 5000 objects in one REST call.
however for the sake of simplicity, I am just sending one object in one REST POST call.

Scala function to Send REST POST call:


In another cell, type the following code:

def
sendPostRequest(textAnalyticsUrl:String,subscriptionKey:String,reque
stBody:String):String={
import scalaj.http.Http
Thread.sleep(3000)
val result = Http(textAnalyticsUrl).postData(requestBody)
.header("Content-Type", "application/json")
.header("Ocp-Apim-Subscription-Key", subscriptionKey).asString
result.body
}

Here’s what we are doing in this function:


1. The function sendPostRequest accepts three parameters (textAnalyticsUrl represents
Azure Text Analytics API end-point URI, subscriptionKey represents the key that you
retrieved previously that’ll be used for authentication of your REST calls and
requestBody is the data that will be sent as a part of your REST calls
2. We introduced a delay of 3 seconds just for this implementation so that Azure
doesn’t block our requests. There are better ways to address this limitation.
3. Then we send a REST POST call and specify URI, headers (“Content-Type”,
“application/json”, “Ocp-Apim-Subscription-Key”, subscriptionKey) and populate body
of the request with the Json that we will get with our previous function.
4. Lastly, we return results of the REST response as String from this function which
looks of the form:

{
“documents”:[{“id”:”1",”score”:0.85717505216598511}],
“errors”:[]
}

where documents object contains a list and specifically score corresponding to the id of
the document. The returned score varies from 0 to 1 and it’s the result of the prediction
yield after invoking the Azure Text Analytics API. Values close to 0 represent negative
sentiments and closer to 1 represent positive.

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 12/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Scala function to Remove HTTP lines from Tweets Functions:


This function is employed to remove HTTP lines that may be present in our file of
tweets.

def removeHttpLines(textLine:String):Boolean={
import scala.util.matching.Regex
val pattern = "^http".r
pattern.findFirstIn(textLine) match {
case Some(x)=>false
case _ => true
}
}

1. The function expects a parameter (textLine of type String) and returns Boolean
values (true or false).
2. It makes use of regular expression and looks for a specific pattern in text file where a
line is starting with “http”. It can be refined further indeed but for simplicity sake, let’s
use that.
3. It then tries to find that pattern in text file. Here, Scala’s pattern matching constructs
are used to match against two possibilities: if a match is found i.e Some(x), then value
returned will be false else it will return true. The reasoning behind why we are
returning these values will become apparent shortly.

Now with these functions in place, let’s implement the rest of the logic, which
interestingly and thanks to Scala being a functional programming language, can be
represented in one line of code:

val tweetsSentimentsRdd =
sc.textFile("/FileStore/tables/trump_tweets.txt").filter(removeHttpL
ines).map(x=>requestFormatter(x)).map(y=>sendPostRequest(url,subscri
ptionKey,y))

let’s decipher what’s happening here:


1. Firstly, we made use of SparkContext (sc) textFile function which is used to read data
from text file (usually from HDFS and in this case DBFS which implements its
interfaces). It accepts file path as string and that’s where we specified where our tweets
file is present. This function returns RDD of type String where each element
corresponds to each line in the file(s)

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 13/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

2. The next step is to filter out any line consisting just of http URLs. that’s where we
used Spark’s filter transformation. In its parameter, we passed a function (thanks to
functional programming again) specifically, removeHttpLines which will operate on
each line of the data and will only return those lines which yield true from that
function (i.e. which don’t have http at the start of them).
3. The next part transforms each line of the filtered text (i.e. http lines removed) and
converts (using requestFormatter function ) each tweet into the required Json string of
the form:

{
“documents”:[
{
“language”:”en”,
“id”:1,
“text”:”tweet text will come here”
}
]
}

3. The next portion then invokes Azure Text Analytics API end-point using the function
sendPostRequest

Upon executing this, no execution will happen thanks to Spark’s lazy execution model.
As the data is small and it’s a PoC setting, thus it’s safe to use “collect” action (however
try to avoid this in production settings as it returns data from all the distributed
executor nodes of Spark to the driver program thus causing memory overflow
challenges).

val tweetsSentimentList = tweetsSentimentsRdd.collect()

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 14/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

Now we have the responses in Scala collection form. Each element of the list is a string
(Json) consisting of the response from Azure Text Analytics API and we can’t do much
about it in the current state. If you want to answer a few questions like:
> What’s the maximum and minimum value of the score?
> What’s the average sentiment score in this tweets corpora?
> What are the most positive tweets by Trump in this corpora?

To answer these analytical questions (and many others), one treatment that could be
done on the data in this form is to transform the existing Json string to Scala case
classes which are optimized for such processing. There can be many ways to do this
however I resorted to using spray json library to do this:

To do this, firstly we will have to create a parser for Json. This involves creating case
classes that represent structure of Json as follows:

case class ResponseBody(id:String, score:Double)


case class AzureTextAnalyticsResponse(documents: List[ResponseBody],
errors: List[String])

and then using spray json constructs to specify the Json’s structure, including the
values of key and how different sections are embedded within each other e.g.

object ResponseJsonUtility extends java.io.Serializable {


import spray.json._
import DefaultJsonProtocol._

object MyJsonProtocol extends DefaultJsonProtocol {


implicit val responseBodyFormat =
jsonFormat(ResponseBody,"id","score") //this represents the inner
document object of the Json
implicit val responseFormat =
jsonFormat(AzureTextAnalyticsResponse,"documents","errors") //this
represents the outer key-value pairs of the Json
}

//and lastly, a function to parse the Json (string) needs to be


written which after parsing the Json string returns data in the form
of case class object.

import MyJsonProtocol._
import spray.json._

def parser(givenJson:String):AzureTextAnalyticsResponse = {
givenJson.parseJson.convertTo[AzureTextAnalyticsResponse]
https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 15/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

}
}

now with these functions and objects created, what’s remaining is to use them in Scala
collection to get the desired result:

val tweetsSentimentScore =
tweetsSentimentList.filter(eachResponse=>eachResponse.contains(“docu
ments”)).map(eachResponse=>ResponseJsonUtility.parser(eachResponse))
.map(parsedResponse=>parsedResponse.documents(0).score)

The upstated expression should be familiar to you by now but still here is a breakdown
understanding of the steps:
1. Firstly, we will perform filtering to only consider elements which have “documents”
section in the Json
2. We then parse each Json string and convert to AzureTextAnalyticsResponse case
class
3. We then just access score of each parsed Json object to get a list just consisting of
sentiment scores

Once we have this, then doing further analysis becomes convenient for example we can
calculate average sentiment score as follows:

(tweetsSentimentScore.sum)/(tweetsSentimentScore.length)

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 16/17
8/25/2019 Using Azure Cognitive Services for Sentiment Analysis of Trump’s Tweets

The score turned out to be 0.629 which means that the analyzed tweets have slightly
positive sentiment on average :-)

and similarly, we can get the maximum sentiment score of Trump’s tweets:

tweetsSentimentScore.max

which turns out to be 0.99 meaning quite some tweets had quite high degree of
positivity :-)

Thus, in conclusion, we can use this approach to do a number of analysis to address


different set of questions. Also, this is a pretty basic implementation with a lot of
opportunities of enhancements so would be keen to know what you do with this
foundational knowledge. If you struggled to understand some of Scala and Spark
concepts, do check out my recent Scala Programming for Big Data Analytics book
published by Apress.

Scala Python Azure Cognitive Service Spark Databricks

About Help Legal

https://towardsdatascience.com/using-azure-cognitive-services-for-sentiment-analysis-of-trumps-tweets-part-1-f42d68c7e40a 17/17

You might also like