"Sentiment Analysis of Imdb Movie Reviews": A Project Report

A PROJECT REPORT
on
“SENTIMENT ANALYSIS OF IMDB MOVIE

REVIEWS”
Submitted to
KIIT Deemed to be University
In Partial Fulfilment of the Requirement for the Award of
BACHELOR’S DEGREE IN COMPUTER

SCIENCE & ENGINEERING
BY
SAMEER SAGAR 1605389

ANIMESH TILAK 1605340
UNDER THE GUIDANCE OF

PROF. SURESH CHANDRA MOHARANA
SCHOOL OF COMPUTER ENGINEERING

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA - 751024
April 2020
I
A PROJECT REPORT
on
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS”
Submitted to
In Partial Fulfilment of the Requirement for the Award of
BACHELOR’S DEGREE IN COMPUTER

SCIENCE & ENGINEERING
BY

UNDER THE GUIDANCE OF

PROF. SURESH CHANDRA MOHARANA
SCHOOL OF COMPUTER ENGINEERING

KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAE, ODISHA -751024
April 2020
II
School of Computer Engineering
Bhubaneswar, ODISHA 751024
CERTIFICATE
This is certify that the project entitled
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS“
submitted by

is a record of bonafide work carried out by them, in the partial fulfilment of the
requirement for the award of Degree of Bachelor of Engineering (Computer Sci-
ence & Engineering OR Information Technology) at KIIT Deemed to be university,
Bhubaneswar. This work is done during year 2019-2020, under our guidance.
Date: 29/04/2020
Prof. Suresh Chandra Moharana

Project Guide
III
Acknowledgement
We owe our deepest gratitude to Prof. Suresh Chandra Moharana, Professor, School of
Computer Engineering, KIIT Deemed to be University, Bhubaneswar, for his helpful advice,
support, motivation and encouragement throughout our work. We are extremely obliged to
have him by our side sharing his words of wisdom during the entire course of this project.
We would like to recognize the importance of our friends who have always supported us
during our tough times and kept us ever motivated to keep on going. And lastly, to our parents
who always support us and make us capable enough to reach the apex.
SAMEER SAGAR
ANIMESH TILAK
IV
ABSTRACT
Sentiment Analysis of IMDB Movie Reviews by the use of Natural Language Processing
Techniques and analyzing the performance of various Machine Learning Algorithms on
movie reviews. After converting unstructured data into structured data for the ease of
analysis. Movie Reviews are classified into binary categories i.e. Positive reviews or Negative
reviews on the basis of words used in the reviews.Machine Learning classifiers are used to
categorize these reviews to its maximum accuracy and comparing the performance of the
classifiers with each other on the same dataset.
Keywords: Sentiment Analysis, Movie, Reviews, Machine Learning, NLP
V
Contents
1 Introduction 1
2 Software Requirement Specification 2

2.1 Tools,Languages and libraries used 2
3 Project Planning And Implementation 3

3.1 Converting unstructured data into structured data 3
3.2 Extracting features by TF-Idf transformer 3
3.3 Classification using machine learning algorithm 4
3.4 Accuracy Score 6
4 Implementation 7
4.1 Data Pre-processing 7
4.2 Using Count Vectorizer 7
4.3 Extraction of features using Tf-idf transformer 7
4.4 Result Analysis 8
5 Screen-shots of project 9
5.1 Data Loading 10
5.2 Data Pre-processing 10
5.3 Model training & Accuracy score 11
6 Conclusion And Future Scope 12
6.1 Conclusion 12
6.2 Future Scope 12
7 References 13
VI
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS
Chapter 1
Introduction
Unstructured data inflow is rapidly increasing day by day. It needs to be classified to

get meaningful insight out of it. Sentiment Analysis can be used in various fields like
Product performance analysis in market, training chatter bots with specific sentiments
to respond, content ratings for various blogs, posts, videos and can also be used in
story summarizing. Sentiment Analysis is also used in Page Ranking Systems for
various search engines.
IMDB Movie reviews dataset labeled as negative and positive reviews is taken. In the
dataset both negative and positive review has one thousand reviews.
These unstructured reviews are converted into structured data as vectors. These
vectors labeled as negative and positive reviews train the model to classify test data
reviews into positive or negative reviews category.
CountVectorizer is used to tokenize each word present in the positive and negative
reviews and build a vocabulary of the words present in an encoded fashion using the
vocabulary that was created. TF-Idf Transformer is used to find the uniqueness of the
document i.e. giving the weights to the important words in the document and
removing the stop words according to the English language. After completion of these
two process the model is trained and accuracy score is checked for the different
Machine Learning algorithms used in this project.
School of Computer Engineering, KIIT, BBSR 1

Chapter 2
Software Requirements Specification
2.1 Tools Used
2.1.1 Anaconda Navigator

2.1.2 Jupyter Notebook
2.2 Languages and Libraries Used
2.2.1 Python 3.6

2.2.2 Numpy
2.2.3 Pandas
2.2.4 Sklearn

Chapter 3
Project Planning
3.1 Converting unstructured data into structured data
IMDB Movie reviews are imported from the text file where reviews were line
separated and labeled as negative and positive. Now a dictionary is created by taking
all the words from negative as well as positive reviews.
3.1.1 Using CountVectorizer
CountVectorizer removes English stopwords from our created dictionary and an

object of CountVectorizer is initialized and it is fed by the created dictionary. This
CountVectorizer object gives unique index to each word present in the created
dictionary.
Single line reviews are given as a parameter to CountVectorizer object. Now these
reviews are converted from unstructured data in English language to 1-D vector. This
1-D vector is the combination of 0’s and 1’s.
If a review contains any word, then the index which is assigned to that word by the
CountVectorizer object is assigned frequency of that word in the review.
3.2 Extracting Features by Tf-idfTransformer
TF in TF-Idf means term-frequency while idf means inverse document-frequency. This

is commonly used for term weighing scheme in retrieving the information.
TfidfTransformer takes 1-D vectors and their label and gives weights according to the
importance of words for the classification.
The formula used to compute the tf-idf for a term t of a document d in a document set
is tf-idf (t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1
(if smooth_idf=False), where n is the total number of
documents in the document set and df(t) is the document frequency of t; the
document frequency is the number of documents in the document set that contain the
term t. The effect of adding “1” to the idf in the equation above is that terms with zero
idf, i.e., terms that occur in all documents in a
training set, will not be completely ignored.
3.3 Classification using Machine Learning Algorithm
Classification is done with the help of classifiers like Logistic Regression, Support
Vector Machine, Gaussian Naive Bayes, Multinomial Naive Bayes and K-Nearest
Neighbors.
3.3.1 Logistic Regression

Logistic Regression is best used for classification of binary categorical data.Vectors
are classified as positive or negative reviews by this classifier.
There is an awesome function called Sigmoid or Logistic function , and it is used to get

values between 0 and 1.This function squashes the value (any value ) and gives the
value between 0 and 1.
3.3.2 GAUSSIAN NAIVE BAYES
Gaussian Naive Bayes is best used to classify text data because it treat
each word as independent from others. Words become features and
contribute equally according to their weights to classify review which has
been converted into vectors.
3.3.3 MULTINOMIAL NAIVE BAYES
It works same as Gaussian Naive Bayes, both the classifier use likelihood
table to calculate the probabilities. But there is a limitation in Gaussian
Naive Bayes when an unseen word comes which is not in the created
dictionary, then Gaussian Naive Bayes makes the probability zero which is
not right decision. Multinomial Naive Bayes overcomes this limitation.
where Nki is the number of times feature i appears in a sample of class k inthe

training set T, Nk is frequency of that feature in the dataset, n is number of documents
in which that feature is present. The smoothing priors α≥0 accounts for features not
present in the learning samples and prevents zero probabilities
in further computations. setting α=1 is called Laplace smoothing, while α<1 is called
Lidstone smoothing.
3.3.4 SUPPORT VECTOR MACHINE
SVM which stands for Support Vector Machine is a supervised machine learning algorithm that
can be used for both classification and regression problems. Support vectors are the data
points nearest to the hyper plane, the points of a data set that, if removed, would change the
position of the hyper plane which divides it. Because of which, they can be considered the
critical elements of a data set.
3.3.5 K-NEAREST NEIGHBORS
K-Nearest Neighbors is one of the most basic yet crucial classification algorithms in Machine
Learning. It belongs to the supervised learning family and finds a large number of application
in recognizing pattern, intrusion detection and data mining.
It is most commonly and widely used in real-life scenarios since it is non-parametric, meaning,
it does not make any previous assumptions about the data distribution.
3.4 ACCURACY SCORE
Accuracy Score is used to calculate the performance of each classifiers used.

It is generally calculated by comparing predicted label with actual label.
Accuracy = (No. of correct Predictions) / (No. of total Predictions)

CHAPTER 4
Implementation
4.1 DATA PRE PROCESSING
In Data Pre-Processing the first step is to convert the unstructured data into structured data. In
this process both the negative and positive reviews dataset is to be made as a single document
so both Positive reviews dataset as well as Negative reviews dataset are read and after this
both are appended to one document.
4.2 USING CountVectorizer
After the reviews are appended as one document i.e. dictionary, the created dictionary is fit
into CountVectorizer which tokenize each word present in the dictionary created i.e. gives
unique indexing to each word in the dictionary.
Single lined reviews are fed to the CountVectorizer as an object which in turn converts the
reviews into a one dimensional vector in the form of 0’s and 1’s.
These reviews are made free of the stop words used in English because they do not hold any
importance for the further process as our objective is to find the uniqueness of the word to
further proceed so those word like a,all,also,am,the etc. which occurs very frequently and
holds no importance are removed.
4.3 EXTRACTION OF FEATURES USING TF-Idf Transformer
Using CountVectorizer we tokenized the dictionary as well as we built a vocabulary of words

which are in encoded fashion i.e. 0’s and 1’s. Now for the extraction of features TF-Idf (Term
Frequency - Inverse Document Frequency) Transformer is used. The dataframe is split into
two parts X and Y in which X contains the results for all the reviews which are converted as
0’s and 1’s and are fit into TF-Idf Transformer which in turn gives unique indexing to each
and every word present in the data frame. This unique indexing is actually the imporatnce of
each word present in the document which will help us classify whether the review is a positive
review or a negative review. The result of TF-Idf is now the feature which we will feed into
our model to find the accuracy score of each Machine Learning algorithm used in the project.

4.4 RESULT ANALYSIS
The dataframe now is split into training and test set i.e. in this case the feature which we get
from the TF-Idf is the training set and to which class it belongs to i.e. positive class or
negative class is the test set. The result of the splitting is a sparse matrix which is fed to
different machine learning algorithms used like K-Nearest Neighbors, Support Vector
Machine, Logistic Regression etc. After feeding the sparse matrix to each of the model and the
calculating the accuracy score the results which we get from it are as follows:
1) Logistic Regression gave the accuracy score of 0.7054704595186 which is approximately

70.55%.
2) SVM (Support Vector Machine) gave the accuracy score of 0.745232885276649 which is
approximately 74.52%.
3) K-Nearest Neighbors gave the accuracy score of 0.5873710534542045 which is
4) Multinomial Naive Bayes gave the accuracy score of 0.7636761487964989 which is
5) Gaussian Naive Bayes gave the accuracy score of 0.6308221319162238 which is
So according to our research we found out that Multinomial Naive Bayes performed best of all
the other classifiers. The reason is because all the columns in our feature are independent
which means if our dictionary contains words like love,hate,boring etc all are independent
columns and are independent of each other and in this scenario where we have independent
columns Multinomial Naive Bayes gives the best results. Unlike Naive Bayes, If we add new
words in the dictionary and use only Naive Bayes classifier and calculate the accuracy score
then the accuracy score will be less than that of Multinomial Naive Bayes because the
probability of the new word and the review associated with it becomes zero whereas In
Multinomial Naive Bayes it adds smoothing factor which does not let the probability go down
to zero neither of the word nor of the review associated with it. Similarly other classifier
performed less than that of Multinomial Naive Bayes because all the other classifier does not
add the smoothing factor which is added in the Multinomial Naive Bayes and hence due to this
problem other classifiers gives somewhat less accuracy score than Multinomial Naive Bayes.

Chapter 5
Screen shots of Project

5.1 DATA LOADING
5.2 DATA PRE PROCESSING

5.2.1 DATA PRE PROCESSING
5.2.2 MODEL TRAINING

5.2.3 Model Training And Accuracy Score

Chapter 6
Conclusion and Future Scope
6.1 Conclusion
Sentiment analysis is done better when we convert the unstructured data into
structured data because machine learning models understand numerical data better
than categorical or language data. After applying different classifiers it is observed
that logistic regression, Multinomial Naïve Bayes and Support Vector Machine
perform very good in classifying binary data. 75% accuracy is good to achieve
because the dataset was small. It is hard for classifiers to classify when small data set
is given for training.
Performance increased because while data pre-processing Tf-idfTransformer gave

weights to each word according to their importance. This technique made
classification easy for the classifiers. Therefore by changing the data pre-processing,
feature selection, feature engineering methods high performance can be achieved.
6.2 Future Scope
The potential of machine learning is too much, overtaking some of the human labor of
some lexicon based tasks that requires intensive human labor. This is where machine
learning comes into play. Such algorithms will also have to perceive and examine
natural text context-wise and concept-wise. Time will also play a crucial part looking
at the amount of data which is being generated on the Web day-to-day.
Now a days everyday people use Social Media platforms to show their sentiments in
the form of text, videos, images etc. So Sentiment Analysis plays a crucial role in
making business strategy.Growth of Artificial Intelligence and Machine Learning is
highly reliable on how machines validate and match to different sentiments shown by
humans in the form of speech, text, videos, body language etc.

References
[1] Sentiment Analysis - A how-to guide with movie reviews,

www.towardsdatascience.com
[2] Movie review sentiment analysis with Naive Bayes, levelup.gitconnected.com
[3] Predict sentiment from movie reviews, machinelearningmastery.com/
[4] Sentiment Analysis using Natural Language Processing,
https://medium.com/@GeneAshis/nlp-sentiment-analysis-on-imdb-movie-dataset-
fb0c4d346d23
[5] Movie Review Analysis, www.semanticscholar.org
[6] Sentiment Analysis in movie reviews application, www.igi-global.com

Appendix-I
STUDENT'S CONTRIBUTION TO THE PROJECT
NAME OF STUDENT SAMEER SAGAR

ROLL NO 1605389
PROJECT TITLE SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

ABSTRACT OF THE Sentiment Analysis of IMDB Movie Reviews by the use of
PROJECT (WITHIN 80 Natural Language Processing Techniques and analyzing the
WORDS) performance of various Machine Learning Algorithms on
movie reviews.
CONTRIBUTION
1. CONTRIBUTION TO PROJECT PLANNING AND RESEARCH
THE PROJECT
REPORT
2. CONTRIBUTION DATA PRE PROCESSING

DURING MODEL TRAINING USING
IMPLEMENTATION MULTINOMIAL NAIVE BAYES &
NAIVE BAYES
3. CONTRIBUTION FOR PRESENTING DATA PRE-PROCESSING AND MODEL

THE PROJECT WORKING
DEMONSTRATION /
PRESENTATION
SIGNATURE OF STUDENT
SIGNATURE OF GUIDE
School of Computer Engineering, KIIT, BBSR
15
Appendix-II
STUDENT'S CONTRIBUTION TO THE PROJECT
NAME OF STUDENT ANIMESH TILAK

ROLL NUMBER 1605340
PROJECT TITLE SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS
ABSTRACT OF THE PROJECT Sentiment Analysis of IMDB Movie Reviews by the use of Natur
(WITHIN 80 WORDS) Language Processing Techniques and analyzing the performance
various Machine Learning Algorithms on movie reviews.
CONTRIBUTION
4. CONTRIBUTION TO THE RESEARCH ON CLASSIFIERS,
PROJECT REPORT
5. CONTRIBUTION DURING MODEL TRAINING USING KNN, SUPPORT VECTOR

IMPLEMENTATION MACHINE, LOGISTIC REGRESSION
6. CONTRIBUTION FOR THE EXPLAINING THE USE CASE AND FUTURE SCOPE
PROJECT
DEMONSTRATION /
PRESENTATION
SIGNATURE OF STUDENT
SIGNATURE OF GUIDE

"Sentiment Analysis of Imdb Movie Reviews": A Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

"Sentiment Analysis of Imdb Movie Reviews": A Project Report

Uploaded by

Copyright:

Available Formats

A PROJECT REPORT

“SENTIMENT ANALYSIS OF IMDB MOVIE

In Partial Fulfilment of the Requirement for the Award of

BACHELOR’S DEGREE IN COMPUTER

SAMEER SAGAR 1605389

UNDER THE GUIDANCE OF

SCHOOL OF COMPUTER ENGINEERING

BACHELOR’S DEGREE IN COMPUTER

SAMEER SAGAR 1605389

UNDER THE GUIDANCE OF

SCHOOL OF COMPUTER ENGINEERING

SAMEER SAGAR 1605389

Prof. Suresh Chandra Moharana

Keywords: Sentiment Analysis, Movie, Reviews, Machine Learning, NLP

2 Software Requirement Specification 2

3 Project Planning And Implementation 3

Unstructured data inflow is rapidly increasing day by day. It needs to be classified to

School of Computer Engineering, KIIT, BBSR 1

Software Requirements Specification

2.1 Tools Used

2.1.1 Anaconda Navigator

2.2 Languages and Libraries Used

2.2.1 Python 3.6

School of Computer Engineering, KIIT, BBSR 2

3.1 Converting unstructured data into structured data

3.1.1 Using CountVectorizer

CountVectorizer removes English stopwords from our created dictionary and an

3.2 Extracting Features by Tf-idfTransformer

TF in TF-Idf means term-frequency while idf means inverse document-frequency. This

3.3 Classification using Machine Learning Algorithm

3.3.1 Logistic Regression

There is an awesome function called Sigmoid or Logistic function , and it is used to get

3.3.2 GAUSSIAN NAIVE BAYES

3.3.3 MULTINOMIAL NAIVE BAYES

where Nki is the number of times feature i appears in a sample of class k inthe

3.3.4 SUPPORT VECTOR MACHINE

3.3.5 K-NEAREST NEIGHBORS

3.4 ACCURACY SCORE

Accuracy Score is used to calculate the performance of each classifiers used.

Accuracy = (No. of correct Predictions) / (No. of total Predictions)

School of Computer Engineering, KIIT, BBSR 6

4.1 DATA PRE PROCESSING

4.2 USING CountVectorizer

4.3 EXTRACTION OF FEATURES USING TF-Idf Transformer

Using CountVectorizer we tokenized the dictionary as well as we built a vocabulary of words

School of Computer Engineering, KIIT, BBSR 7

4.4 RESULT ANALYSIS

1) Logistic Regression gave the accuracy score of 0.7054704595186 which is approximately

School of Computer Engineering, KIIT, BBSR 8

Screen shots of Project

5.2 DATA PRE PROCESSING

5.2.1 DATA PRE PROCESSING

5.2.2 MODEL TRAINING

School of Computer Engineering, KIIT, BBSR 10

5.2.3 Model Training And Accuracy Score

School of Computer Engineering, KIIT, BBSR 11

Conclusion and Future Scope

Performance increased because while data pre-processing Tf-idfTransformer gave

6.2 Future Scope

School of Computer Engineering, KIIT, BBSR 12

[1] Sentiment Analysis - A how-to guide with movie reviews,

School of Computer Engineering, KIIT, BBSR 13