Professional Documents
Culture Documents
Sentiment Analysis On User-Generated Tweets
Sentiment Analysis On User-Generated Tweets
By Gokul Ghate
Sentiment Analysis is key to practice social listening on social media platforms.
Listening is the practice of actively receiving and interpreting messages in a
conversation, verbal or written. In social media, it applies the same way. Brands can
now actively listen and monitor how customers perceive a new product, services, or
features development.
For Example:
OnePlus famously used sentiment analysis to detect what their consumers
discussed in their new smartphone OnePlus6 launch .
Academic Papers have been developed around sentiment analysis of Twitter
data, such as in the 2012 U.S. Presidential Election Cycle .
Importing dependencies
import tensorflow as tf #deep learning library that uses neural network
behind it
from wordcloud import WordCloud # for data visualization
import matplotlib.pyplot as plt # for interactive visualizations
import pandas as pd
import numpy as np
import re
print("Tensorflow Version",tf.__version__)
Dataset Preprocessing
The dataset being used here is Sentiment-140. It contains a labels data of 1.6M Tweets and I
find it a good amount of data to train a model.
df = pd.read_csv('C:/Users/gokul/Desktop/Projects_on_github/Sentiment
Analysis on user-generated
tweets/archive(1)/training.1600000.processed.noemoticon.csv',
encoding = 'latin',header=None)
df.head()
0 1 2 3 4 \
0 0 1467810369 Mon Apr 06 22:19:45 PDT 2009 NO_QUERY _TheSpecialOne_
1 0 1467810672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton
2 0 1467810917 Mon Apr 06 22:19:53 PDT 2009 NO_QUERY mattycus
3 0 1467811184 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY ElleCTF
4 0 1467811193 Mon Apr 06 22:19:57 PDT 2009 NO_QUERY Karoli
5
0 @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 is upset that he can't update his Facebook by ...
2 @Kenichan I dived many times for the ball. Man...
3 my whole body feels itchy and like its on fire
4 @nationwideclass no, it's not behaving at all....
The columns are without any proper names. Renaming them to give them a
reference:
df.columns = ['sentiment', 'id', 'date', 'query', 'user_id', 'text']
df.head()
user_id text
0 _TheSpecialOne_ @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 scotthamilton is upset that he can't update his Facebook by ...
2 mattycus @Kenichan I dived many times for the ball. Man...
3 ElleCTF my whole body feels itchy and like its on fire
4 Karoli @nationwideclass no, it's not behaving at all....
sentiment text
0 Negative @switchfoot http://twitpic.com/2y1zl - Awww, t...
1 Negative is upset that he can't update his Facebook by ...
2 Negative @Kenichan I dived many times for the ball. Man...
3 Negative my whole body feels itchy and like its on fire
4 Negative @nationwideclass no, it's not behaving at all....
Decoding the labels. Mapping 0 -> Negative and 1 -> Positive. Following this we will
analyse the dataset by its distribution because it is important that we have small
amount of examples for given classes.
val_count = df.sentiment.value_counts()
plt.figure(figsize=(4,4))
plt.bar(val_count.index, val_count.values)
plt.title("Sentiment Data Distribution")
plt.show()
We can see that the dataset is without any skewness.
Data Exploration
import random
random_idx_list = [random.randint(1,len(df.text)) for i in range(10)] #
creates random indexes to choose from dataframe
df.loc[random_idx_list,:].head(10) # Returns the rows with the index and
display it
sentiment text
643864 Negative @selenagomez respond to your fans more
1127003 Positive so glad the day is over ! excited for tonight
584731 Negative How does it feel to repatriate your lover back...
1243807 Positive listening to mws the first decade. memories of...
917274 Positive @pranaydewan thanku cooking is simple but the...
1288089 Positive @Cannon007 Very true The only thing worse was...
1125909 Positive @phinnia it's so rofltastic and ridiculous. w...
604766 Negative @tyrone yes i now mhhhhhhhhhh i wi...
866687 Positive @ArsenalSarah @Ndnbluez Enjoy our coffee &...
794450 Negative @KhloeKardashian unfortch I'm workin in ICU to...
We have to take out irregularities from the text. The punctuations used has no
contextual meaning and adds no value to the feature for model training.
Text Preprocessing
Tweet texts often consists of other user mentions, hyperlink texts, emoticons and
punctuations. In order to use them for learning using a Language Model, we cannot
permit those texts for training a model. So we have to clean the text data using
various preprocessing and cleansing methods.
Stemming / Lemmatization
For grammatical reasons, documents are going to use different forms of a word,
such as write, writing and writes. Additionally, there are families of derivationally
related words with similar meanings. The goal of both stemming and lemmatization
is to reduce inflectional forms and sometimes derivationally related forms of a word
to a common base form.
Stemming usually refers to a process that chops off the ends of words in the hope of
achieving goal correctly most of the time and often includes the removal of
derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary
and morphological analysis of words, normally aiming to remove inflectional
endings only and to return the base and dictionary form of a word
For example:
Stopwords
Stopwords are commonly used words in English which have no contextual meaning
in an sentence. So therefore we remove them before classification. Some stopwords
are...
'i', 'our', 'yours', 'his', 'herself', 'them', 'which', 'these', 'was', 'have', 'does', 'could',
'she's' etc..
NLTK is a python library which has functions to perform text processing task for NLP.
stop_words = stopwords.words('english')
stemmer = SnowballStemmer('english')
text_cleaning_re = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
Positive Words
plt.figure(figsize = (8,8))
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate("
".join(df[df.sentiment == 'Positive'].text))
plt.imshow(wc, interpolation = 'bilinear')
plt.show()
Negative Words
plt.figure(figsize = (10,10))
wc = WordCloud(max_words = 2000 , width = 1600 , height = 800).generate("
".join(df[df.sentiment == 'Negative'].text))
plt.imshow(wc , interpolation = 'bilinear')
plt.show()
Train and Test split
TRAIN_SIZE = 0.8
MAX_NB_WORDS = 100000
MAX_SEQUENCE_LENGTH = 30
Using train_test_split to shuffle the dataset and split it into training and testing
dataset. It's important to shuffle our dataset before training.
train_data.head(10)
sentiment text
23786 Negative need friends
182699 Negative im trying call impossible
476661 Negative good pace going 3k 13 min missed 5k turn ended...
1181490 Positive u gonna shows ny soon luv see u live
878773 Positive hell yea get em tattoos ink free wish parents ...
130866 Negative yeah need 2 see ur mom calls back first rememb...
1235876 Positive sounds like cup tea sign
717314 Negative tired want sleep wtf
969880 Positive amazing wish
748698 Negative thank god wkrn abc affiliate nashville back mi...
Tokenization
Tokenization is the task of chopping it up into pieces, called tokens, perhaps at the
same time throwing away certain characters, such as punctuation.
tokenizer - create tokens for every word in the data corpus and map them to
a index using dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_data.text)
word_index = tokenizer.word_index
vocab_size = len(tokenizer.word_index) + 1
print("Using TensorFlow Backend")
print("Vocabulary Size :", vocab_size)
We now have a tokenizer object, which can be used to covert any word into a Key in
dictionary (number).
Now we are going to build a Sequence Model. A sequence of numbers has to be fed
in it.
x_train = pad_sequences(tokenizer.texts_to_sequences(train_data.text),
maxlen = MAX_SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(test_data.text),
maxlen = MAX_SEQUENCE_LENGTH)
print("Training X Shape:",x_train.shape)
print("Testing X Shape:",x_test.shape)
labels = train_data.sentiment.unique().tolist()
Label Encoding
We are building the model to predict class in enocoded form (0 or 1 as this is a
binary classification). We should encode our training labels to encodings.
encoder = LabelEncoder()
encoder.fit(train_data.sentiment.to_list())
y_train = encoder.transform(train_data.sentiment.to_list())
y_test = encoder.transform(test_data.sentiment.to_list())
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)
Word Embedding
In Language Model, words are represented in a way to intend more meaning and for
learning the patterns and contextual meaning behind it.
Word Embedding is one of the popular representation of document vocabulary. It
is capable of capturing context of a word in a document, semantic and syntactic
similarity, relation with other words, etc.
Basically, it's a feature vector representation of words which are used for other
natural language processing applications.
Going in the path of Computer Vision, we use Transfer Learning. We download the
pre-trained embedding and use it in our model.
The pretrained Word Embedding like GloVe & Word2Vec gives more insights for a
word which can be used for classification.
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip
embeddings_index = {}
f = open(GLOVE_EMB, encoding="utf8")
for line in f:
values = line.split()
word = value = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
embedding_layer = tf.keras.layers.Embedding(vocab_size,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Optimization Algorithm
This notebook uses Adam, optimization algorithm for Gradient Descent.
Callbacks
Callbacks are special functions which are called at the end of an epoch. We can use
any functions to perform specific operation after each epoch. I used two callbacks
here,
model.compile(optimizer=Adam(learning_rate=LR), loss='binary_crossentropy',
metrics=['accuracy'])
ReduceLROnPlateau = ReduceLROnPlateau(factor=0.1,
min_lr = 0.01,
monitor = 'val_loss',
verbose = 1)
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available: 0
Training on CPU...
Epoch 1/10
1250/1250 [==============================] - 2117s 2s/step - loss: 0.5197 -
accuracy: 0.7387 - val_loss: 0.4821 - val_accuracy: 0.7659 - lr: 0.0010
Epoch 2/10
1250/1250 [==============================] - 1698s 1s/step - loss: 0.4880 -
accuracy: 0.7618 - val_loss: 0.4728 - val_accuracy: 0.7730 - lr: 0.0010
Epoch 3/10
1250/1250 [==============================] - 1528s 1s/step - loss: 0.4776 -
accuracy: 0.7688 - val_loss: 0.4680 - val_accuracy: 0.7762 - lr: 0.0010
Epoch 4/10
1250/1250 [==============================] - 1520s 1s/step - loss: 0.4714 -
accuracy: 0.7728 - val_loss: 0.4643 - val_accuracy: 0.7778 - lr: 0.0010
Epoch 5/10
1250/1250 [==============================] - 1772s 1s/step - loss: 0.4670 -
accuracy: 0.7756 - val_loss: 0.4625 - val_accuracy: 0.7783 - lr: 0.0010
Epoch 6/10
1250/1250 [==============================] - 1840s 1s/step - loss: 0.4636 -
accuracy: 0.7775 - val_loss: 0.4600 - val_accuracy: 0.7800 - lr: 0.0010
Epoch 7/10
1250/1250 [==============================] - 2093s 2s/step - loss: 0.4607 -
accuracy: 0.7797 - val_loss: 0.4584 - val_accuracy: 0.7813 - lr: 0.0010
Epoch 8/10
1250/1250 [==============================] - 1991s 2s/step - loss: 0.4584 -
accuracy: 0.7808 - val_loss: 0.4575 - val_accuracy: 0.7816 - lr: 0.0010
Epoch 9/10
1250/1250 [==============================] - 1952s 2s/step - loss: 0.4563 -
accuracy: 0.7821 - val_loss: 0.4571 - val_accuracy: 0.7820 - lr: 0.0010
Epoch 10/10
1250/1250 [==============================] - 1626s 1s/step - loss: 0.4548 -
accuracy: 0.7831 - val_loss: 0.4555 - val_accuracy: 0.7824 - lr: 0.0010
Model Evaluation
Evaluating the performance of the model. We will use some evaluation metrics and
techniques to test the model.
Starting with the Learning Curve of loss and accuracy of the model on each epoch.
An epoch means training the neural network with all the training data for
one cycle. In an epoch, we use all of the data exactly once.
s, (at, al) = plt.subplots(2,1)
at.plot(history.history['accuracy'], c= 'b')
at.plot(history.history['val_accuracy'], c='r')
at.set_title('model accuracy')
at.set_ylabel('accuracy')
at.set_xlabel('epoch')
at.legend(['LSTM_train', 'LSTM_val'], loc='upper left')
al.plot(history.history['loss'], c='m')
al.plot(history.history['val_loss'], c='c')
al.set_title('model loss')
al.set_ylabel('loss')
al.set_xlabel('epoch')
al.legend(['train', 'val'], loc = 'upper left')
<matplotlib.legend.Legend at 0x20d46cbd6a0>
The model will predict a score between 0 and 1. We can classify 2 classes by setting
a threshold value for it. In this case, I have set it to 0.5 as the THRESHOLD value. If
the score is above 0.5, it will be classified as the POSITIVE sentiment.
def decode_sentiment(score):
return "Positive" if score>0.5 else "Negative"
Confusion Matrix
a summary of prediction results on a classification problem
The number of correct and incorrect predictions are summarized with count values
and broken down by each class.
import itertools
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
fmt = '.2f'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")