Different Methods For Calculating Sentiment of Text

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Different Methods for Calculating Sentiment of Text

In this article, we will learn how to analyze the text in order to find its sentiment score. Since we want to
calculate sentiment scores, we would look at text data generated by humans, such as product reviews,
restaurants, etc., or tweets from famous personalities. With the help of these scores, we can understand
the emotion behind the text and classify them for further decision-making.

Table of Contents

1. Need of Sentiment Scores

2. Method 1: Using Positive and Negative Word Count – With Normalization

3. Method 2: Using Positive and Negative Word Counts – With Semi Normalization

4. Method 3: Using VADER SentimentIntensityAnalyser

5. Conclusions

Need of Sentiment Score

Today, text data is widely used to understand the behavior of customers and users towards services. For
example, companies use product reviews to understand customers’ opinions on products. Based on
similar analyses, customers can be classified into groups and make specific decisions for customers who
are unhappy. The use of texts is endless and data scientists around the world have helped analyze the
trends of texts. You can also use star ratings to understand the sentiment of users’ feedback on products
or services. However, in order to understand the actual reaction, analyzing text-based feedback is a
better solution.

In order to understand the emotion behind a text, we can analyze the text using NLP techniques to find
patterns. However, if the same process has to be done in a large number of texts, it will take a lot of
time, and new texts will be added until then. For these cases, the calculation of sentiment scores is
useful. Thus, a positive score shows a positive text and a negative score shows a negative text.  The
automation of this sentiment score calculator will help to analyze the text and make quick decisions. For
calculating the sentiment scores, we would be using our methods as well using some pre-defined rule-
based methods for calculating these scores in a jiffy.

1
Method One: Using Positive and Negative Word Count – With Normalization for Calculating Sentiment
Score

In this method, we will calculate the Sentiment Scores by classifying and counting the Negative and
Positive words from the given text and taking the ratio of the difference of Positive and Negative Word
Counts and Total Word Count. We will be using the Amazon Cell Phone Reviews dataset from Kaggle.

To calculate the sentiment, here we will use the formula:

Let’s first import the Libraries

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

Now, we will import the dataset. Out of all the columns present in this dataset, we
will use the ‘body’ column which contains the cell phone reviews.

df = pd.read_csv('20191226-reviews.csv', usecols=['body'])

Next, we will create an instance of WordNetLemmatizer, which we will be using in the next step. Also,
we will call the Stop Words from the NLTK library.

lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')

Next, we will define a function to preprocess the text. In this function, first, we converted the input
string to lowercase. Then, we extracted only the letters from this string. This task is performed using
Regular Expressions. Next, we tokenized the string. After tokenizing, we filtered the text of stop words.
At last, we lemmatized the remaining words and returned these words.

def text_prep(x):
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize

2
Now, we will apply this function to every text in the ‘body‘ column.

preprocess_tag = [text_prep(i) for i in df['body']]


df["preprocess_txt"] = preprocess_tag

Next, we will calculate the number of resultant words for each text. This will be helpful later when we
calculate the Sentiment Score.

df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))

In the next step, we defined the generic Negative Words and Positive Words list. For this, we have taken
the positive-text.txt and negative-text.txt files or generally known as Opinion Lexicon by the respective
authors(s). The same can be downloaded from here.

file = open('negative-words.txt', 'r')


neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()

Next, we will make a count of positive and negative words for the text. For this, we will count all the
words in preprocess_txt that are present in positive words list pos_words, and similarly will count all
the words in preprocess_txt that are present in the negative words list neg_words. We will add these
values into the main Pandas DataFrame.

num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))


df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg

Finally, we will calculate the sentiment. We will create a ‘sentiment‘ column and perform the calculation
for each text.

df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)

Putting it all together:

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])

3
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
df.head()

On executing this code, we get:

4
Method 2: Using Positive and Negative Word Counts – With Semi Normalization to calculate
Sentiment Score

In this method, we calculate the sentiment score by evaluating the ratio of Count of Positive Words and
Count of Negative Words + 1.  Since there is no difference of values involved, the sentiment value will
always be more than 0. Also, adding 1 in the denominator would save from Zero Division Error.   Let’s
start with the implementation. The implementation code will remain the same till the Negative and
Positive Word Count from the previous part with a difference that this time we don’t need the total
word count, thus will be omitting that part.

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg

Next, we will apply the formula and add the formulated value into the main DataFrame.

5
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)

Putting it all together.

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
corp = str(x).lower()
corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
tokens = word_tokenize(corp)
words = [t for t in tokens if t not in stop_words]
lemmatize = [lemma.lemmatize(w) for w in words]
return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)
df.head()

6
On executing this code, we get:

Method Three: Using VADER SentimentIntensityAnalyser to calculate Sentiment Score

In this method, we will use the Sentiment Intensity Analyser which uses the VADER Lexicon. VADER is a
long-form for Valence Aware and sEntiment Reasoner, a rule-based sentiment analysis tool. VADER
calculates text emotions and determines whether the text is positive, neutral or, negative. This analyzer
calculates text sentiment and produces four different classes of output
scores: positive, negative, neutral, and compound.

Here, we will make use of the Compound Score. A compound score is the aggregate of the score of a
word, or precisely, the sum of all words in the lexicon, normalized between -1 and 1. It is not
recommended to use general text preprocessing techniques prior to calculation because this may affect
VADER results. Using these VADER results, you can also classify each text into categories according to
your needs.

Now, we will use VADER to implement the calculation of the sentiment score. Since this process is
different from the previous two methods, we will be implementing it from scratch.

First, we will import the libraries. We will use the NLTK library for importing the
SentimentIntensityAnalyzer.

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Next, we will create the instance of SentimentIntensityAnalyzer.

sent = SentimentIntensityAnalyzer()

Next, we will read the Cell Phone Review Data, which we have been using previously.

df = pd.read_csv('', usecols = ['body'])

Finally, we will calculate the Compound Poalrtity Score and add it into the Main DataFrame for each of
the texts.

7
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity

Putting it all together:

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent = SentimentIntensityAnalyzer()
df = pd.read_csv('', usecols = ['body'])
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity
df.head()
On executing this code, we get:

Conclusions

In this article, we learned about three different ways of calculating the Sentiment Scores for each of the
given Cell Phone Reviews. The first two methods are self-calculated while the third method was rule-
based and the score is evaluated by itself. Each of these methods has its own pros and cons. There are
numerous ways by which one can calculate the sentiment score. One can even define his own method
of evaluating the scores. VADER is more robust and precise in terms of Internet slang, which is widely
used today. One can try building a new and stronger method of evaluation that might cover a wider
range of words.

You might also like