Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 27

A PROJECT REPORT

on

“SENTIMENT ANALYSIS OF IMDB MOVIE


REVIEWS”

Submitted to
KIIT Deemed to be University

In Partial Fulfilment of the Requirement for the Award of

BACHELOR’S DEGREE IN COMPUTER


SCIENCE & ENGINEERING

BY

SAMEER SAGAR 1605389


ANIMESH TILAK 1605340

UNDER THE GUIDANCE OF


PROF. SURESH CHANDRA MOHARANA

SCHOOL OF COMPUTER ENGINEERING


KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA - 751024
April 2020

I
A PROJECT REPORT
on
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS”

Submitted to
KIIT Deemed to be University
In Partial Fulfilment of the Requirement for the Award of

BACHELOR’S DEGREE IN COMPUTER


SCIENCE & ENGINEERING

BY

SAMEER SAGAR 1605389


ANIMESH TILAK 1605340

UNDER THE GUIDANCE OF


PROF. SURESH CHANDRA MOHARANA

SCHOOL OF COMPUTER ENGINEERING


KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAE, ODISHA -751024
April 2020

II
KIIT Deemed to be University
School of Computer Engineering
Bhubaneswar, ODISHA 751024

CERTIFICATE
This is certify that the project entitled
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS“
submitted by

SAMEER SAGAR 1605389


ANIMESH TILAK 1605340

is a record of bonafide work carried out by them, in the partial fulfilment of the
requirement for the award of Degree of Bachelor of Engineering (Computer Sci-
ence & Engineering OR Information Technology) at KIIT Deemed to be university,
Bhubaneswar. This work is done during year 2019-2020, under our guidance.

Date: 29/04/2020

Prof. Suresh Chandra Moharana


Project Guide

III
Acknowledgement

We owe our deepest gratitude to Prof. Suresh Chandra Moharana, Professor, School of
Computer Engineering, KIIT Deemed to be University, Bhubaneswar, for his helpful advice,
support, motivation and encouragement throughout our work. We are extremely obliged to
have him by our side sharing his words of wisdom during the entire course of this project. 
We would like to recognize the importance of our friends who have always supported us
during our tough times and kept us ever motivated to keep on going. And lastly, to our parents
who always support us and make us capable enough to reach the apex.

SAMEER SAGAR
ANIMESH TILAK

IV
ABSTRACT

Sentiment Analysis of IMDB Movie Reviews by the use of Natural Language Processing
Techniques and analyzing the performance of various Machine Learning Algorithms on
movie reviews. After converting unstructured data into structured data for the ease of
analysis. Movie Reviews are classified into binary categories i.e. Positive reviews or Negative
reviews on the basis of words used in the reviews.Machine Learning classifiers are used to
categorize these reviews to its maximum accuracy and comparing the performance of the
classifiers with each other on the same dataset.

Keywords: Sentiment Analysis, Movie, Reviews, Machine Learning, NLP

V
Contents

1 Introduction 1

2 Software Requirement Specification 2


2.1 Tools,Languages and libraries used 2

3 Project Planning And Implementation 3


3.1 Converting unstructured data into structured data 3
3.2 Extracting features by TF-Idf transformer 3
3.3 Classification using machine learning algorithm 4
3.4 Accuracy Score 6

4 Implementation 7
4.1 Data Pre-processing 7
4.2 Using Count Vectorizer 7
4.3 Extraction of features using Tf-idf transformer 7
4.4 Result Analysis 8

5 Screen-shots of project 9
5.1 Data Loading 10
5.2 Data Pre-processing 10
5.3 Model training & Accuracy score 11
6 Conclusion And Future Scope 12
6.1 Conclusion 12
6.2 Future Scope 12
7 References 13

VI
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

Chapter 1

Introduction

Unstructured data inflow is rapidly increasing day by day. It needs to be classified to


get meaningful insight out of it. Sentiment Analysis can be used in various fields like
Product performance analysis in market, training chatter bots with specific sentiments
to respond, content ratings for various blogs, posts, videos and can also be used in
story summarizing. Sentiment Analysis is also used in Page Ranking Systems for
various search engines.

IMDB Movie reviews dataset labeled as negative and positive reviews is taken. In the
dataset both negative and positive review has one thousand reviews.

These unstructured reviews are converted into structured data as vectors. These
vectors labeled as negative and positive reviews train the model to classify test data
reviews into positive or negative reviews category.

CountVectorizer is used to tokenize each word present in the positive and negative
reviews and build a vocabulary of the words present in an encoded fashion using the
vocabulary that was created. TF-Idf Transformer is used to find the uniqueness of the
document i.e. giving the weights to the important words in the document and
removing the stop words according to the English language. After completion of these
two process the model is trained and accuracy score is checked for the different
Machine Learning algorithms used in this project.

School of Computer Engineering, KIIT, BBSR 1


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

Chapter 2

Software Requirements Specification

2.1 Tools Used

2.1.1 Anaconda Navigator


2.1.2 Jupyter Notebook

2.2 Languages and Libraries Used

2.2.1 Python 3.6


2.2.2 Numpy
2.2.3 Pandas
2.2.4 Sklearn

School of Computer Engineering, KIIT, BBSR 2


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

Chapter 3

Project Planning

3.1 Converting unstructured data into structured data

IMDB Movie reviews are imported from the text file where reviews were line
separated and labeled as negative and positive. Now a dictionary is created by taking
all the words from negative as well as positive reviews.

3.1.1 Using CountVectorizer

CountVectorizer removes English stopwords from our created dictionary and an


object of CountVectorizer is initialized and it is fed by the created dictionary. This
CountVectorizer object gives unique index to each word present in the created
dictionary.
Single line reviews are given as a parameter to CountVectorizer object. Now these
reviews are converted from unstructured data in English language to 1-D vector. This
1-D vector is the combination of 0’s and 1’s.
If a review contains any word, then the index which is assigned to that word by the
CountVectorizer object is assigned frequency of that word in the review.

3.2 Extracting Features by Tf-idfTransformer

TF in TF-Idf means term-frequency while idf means inverse document-frequency. This


is commonly used for term weighing scheme in retrieving the information.
TfidfTransformer takes 1-D vectors and their label and gives weights according to the
importance of words for the classification.

The formula used to compute the tf-idf for a term t of a document d in a document set
is tf-idf (t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1
(if smooth_idf=False), where n is the total number of
School of Computer Engineering, KIIT, BBSR 3
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

documents in the document set and df(t) is the document frequency of t; the
document frequency is the number of documents in the document set that contain the
term t. The effect of adding “1” to the idf in the equation above is that terms with zero
idf, i.e., terms that occur in all documents in a
training set, will not be completely ignored.

3.3 Classification using Machine Learning Algorithm

Classification is done with the help of classifiers like Logistic Regression, Support
Vector Machine, Gaussian Naive Bayes, Multinomial Naive Bayes and K-Nearest
Neighbors.

3.3.1 Logistic Regression


Logistic Regression is best used for classification of binary categorical data.Vectors
are classified as positive or negative reviews by this classifier.

There is an awesome function called Sigmoid or Logistic function , and it is used to get


values between 0 and 1.This function squashes the value (any value ) and gives the
value between 0 and 1.
School of Computer Engineering, KIIT, BBSR 4
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

3.3.2 GAUSSIAN NAIVE BAYES

Gaussian Naive Bayes is best used to classify text data because it treat
each word as independent from others. Words become features and
contribute equally according to their weights to classify review which has
been converted into vectors.

3.3.3 MULTINOMIAL NAIVE BAYES

It works same as Gaussian Naive Bayes, both the classifier use likelihood
table to calculate the probabilities. But there is a limitation in Gaussian
Naive Bayes when an unseen word comes which is not in the created
dictionary, then Gaussian Naive Bayes makes the probability zero which is
not right decision. Multinomial Naive Bayes overcomes this limitation.

where Nki is the number of times feature i appears in a sample of class k inthe


training set T, Nk is frequency of that feature in the dataset, n is number of documents
in which that feature is present. The smoothing priors α≥0 accounts for features not
present in the learning samples and prevents zero probabilities
in further computations. setting α=1 is called Laplace smoothing, while α<1 is called
Lidstone smoothing.
School of Computer Engineering, KIIT, BBSR 5
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

3.3.4 SUPPORT VECTOR MACHINE

SVM which stands for Support Vector Machine is a supervised machine learning algorithm that
can be used for both classification and regression problems. Support vectors are the data
points nearest to the hyper plane, the points of a data set that, if removed, would change the
position of the hyper plane which divides it. Because of which, they can be considered the
critical elements of a data set.

3.3.5 K-NEAREST NEIGHBORS

K-Nearest Neighbors is one of the most basic yet crucial classification algorithms in Machine
Learning. It belongs to the supervised learning family and finds a large number of application
in recognizing pattern, intrusion detection and data mining.
It is most commonly and widely used in real-life scenarios since it is non-parametric, meaning,
it does not make any previous assumptions about the data distribution.

3.4 ACCURACY SCORE

Accuracy Score is used to calculate the performance of each classifiers used.


It is generally calculated by comparing predicted label with actual label.

Accuracy = (No. of correct Predictions) / (No. of total Predictions)

School of Computer Engineering, KIIT, BBSR 6


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

CHAPTER 4

Implementation

4.1 DATA PRE PROCESSING

In Data Pre-Processing the first step is to convert the unstructured data into structured data. In
this process both the negative and positive reviews dataset is to be made as a single document
so both Positive reviews dataset as well as Negative reviews dataset are read and after this
both are appended to one document.

4.2 USING CountVectorizer

After the reviews are appended as one document i.e. dictionary, the created dictionary is fit
into CountVectorizer which tokenize each word present in the dictionary created i.e. gives
unique indexing to each word in the dictionary.
Single lined reviews are fed to the CountVectorizer as an object which in turn converts the
reviews into a one dimensional vector in the form of 0’s and 1’s.
These reviews are made free of the stop words used in English because they do not hold any
importance for the further process as our objective is to find the uniqueness of the word to
further proceed so those word like a,all,also,am,the etc. which occurs very frequently and
holds no importance are removed.

4.3 EXTRACTION OF FEATURES USING TF-Idf Transformer

Using CountVectorizer we tokenized the dictionary as well as we built a vocabulary of words


which are in encoded fashion i.e. 0’s and 1’s. Now for the extraction of features TF-Idf (Term
Frequency - Inverse Document Frequency) Transformer is used. The dataframe is split into
two parts X and Y in which X contains the results for all the reviews which are converted as
0’s and 1’s and are fit into TF-Idf Transformer which in turn gives unique indexing to each
and every word present in the data frame. This unique indexing is actually the imporatnce of
each word present in the document which will help us classify whether the review is a positive
review or a negative review. The result of TF-Idf is now the feature which we will feed into
our model to find the accuracy score of each Machine Learning algorithm used in the project.

School of Computer Engineering, KIIT, BBSR 7


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

4.4 RESULT ANALYSIS

The dataframe now is split into training and test set i.e. in this case the feature which we get
from the TF-Idf is the training set and to which class it belongs to i.e. positive class or
negative class is the test set. The result of the splitting is a sparse matrix which is fed to
different machine learning algorithms used like K-Nearest Neighbors, Support Vector
Machine, Logistic Regression etc. After feeding the sparse matrix to each of the model and the
calculating the accuracy score the results which we get from it are as follows:

1) Logistic Regression gave the accuracy score of 0.7054704595186 which is approximately


70.55%.
2) SVM (Support Vector Machine) gave the accuracy score of 0.745232885276649 which is
approximately 74.52%.
3) K-Nearest Neighbors gave the accuracy score of 0.5873710534542045 which is
approximately 58.73%.
4) Multinomial Naive Bayes gave the accuracy score of 0.7636761487964989 which is
approximately 76.37%.
5) Gaussian Naive Bayes gave the accuracy score of 0.6308221319162238 which is
approximately 63.08%.

So according to our research we found out that Multinomial Naive Bayes performed best of all
the other classifiers. The reason is because all the columns in our feature are independent
which means if our dictionary contains words like love,hate,boring etc all are independent
columns and are independent of each other and in this scenario where we have independent
columns Multinomial Naive Bayes gives the best results. Unlike Naive Bayes, If we add new
words in the dictionary and use only Naive Bayes classifier and calculate the accuracy score
then the accuracy score will be less than that of Multinomial Naive Bayes because the
probability of the new word and the review associated with it becomes zero whereas In
Multinomial Naive Bayes it adds smoothing factor which does not let the probability go down
to zero neither of the word nor of the review associated with it. Similarly other classifier
performed less than that of Multinomial Naive Bayes because all the other classifier does not
add the smoothing factor which is added in the Multinomial Naive Bayes and hence due to this
problem other classifiers gives somewhat less accuracy score than Multinomial Naive Bayes.

School of Computer Engineering, KIIT, BBSR 8


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

Chapter 5

Screen shots of Project


5.1 DATA LOADING

5.2 DATA PRE PROCESSING


School of Computer Engineering, KIIT, BBSR 9
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

5.2.1 DATA PRE PROCESSING

5.2.2 MODEL TRAINING

School of Computer Engineering, KIIT, BBSR 10


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

5.2.3 Model Training And Accuracy Score

School of Computer Engineering, KIIT, BBSR 11


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

Chapter 6

Conclusion and Future Scope

6.1 Conclusion

Sentiment analysis is done better when we convert the unstructured data into
structured data because machine learning models understand numerical data better
than categorical or language data. After applying different classifiers it is observed
that logistic regression, Multinomial Naïve Bayes and Support Vector Machine
perform very good in classifying binary data. 75% accuracy is good to achieve
because the dataset was small. It is hard for classifiers to classify when small data set
is given for training.

Performance increased because while data pre-processing Tf-idfTransformer gave


weights to each word according to their importance. This technique made
classification easy for the classifiers. Therefore by changing the data pre-processing,
feature selection, feature engineering methods high performance can be achieved.

6.2 Future Scope

The potential of machine learning is too much, overtaking some of the human labor of
some lexicon based tasks that requires intensive human labor. This is where machine
learning comes into play. Such algorithms will also have to perceive and examine
natural text context-wise and concept-wise. Time will also play a crucial part looking
at the amount of data which is being generated on the Web day-to-day.

Now a days everyday people use Social Media platforms to show their sentiments in
the form of text, videos, images etc. So Sentiment Analysis plays a crucial role in
making business strategy.Growth of Artificial Intelligence and Machine Learning is
highly reliable on how machines validate and match to different sentiments shown by
humans in the form of speech, text, videos, body language etc.

School of Computer Engineering, KIIT, BBSR 12


SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS

References

[1] Sentiment Analysis - A how-to guide with movie reviews,


www.towardsdatascience.com
[2] Movie review sentiment analysis with Naive Bayes, levelup.gitconnected.com
[3] Predict sentiment from movie reviews, machinelearningmastery.com/
[4] Sentiment Analysis using Natural Language Processing,
https://medium.com/@GeneAshis/nlp-sentiment-analysis-on-imdb-movie-dataset-
fb0c4d346d23
[5] Movie Review Analysis, www.semanticscholar.org
[6] Sentiment Analysis in movie reviews application, www.igi-global.com

School of Computer Engineering, KIIT, BBSR 13


School of Computer Engineering, KIIT, BBSR 14

Appendix-I

STUDENT'S CONTRIBUTION TO THE PROJECT

NAME OF STUDENT SAMEER SAGAR


ROLL NO 1605389

PROJECT TITLE SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS


ABSTRACT OF THE Sentiment Analysis of IMDB Movie Reviews by the use of
PROJECT (WITHIN 80 Natural Language Processing Techniques and analyzing the
WORDS) performance of various Machine Learning Algorithms on
movie reviews.

CONTRIBUTION
1. CONTRIBUTION TO PROJECT PLANNING AND RESEARCH
THE PROJECT
REPORT

2. CONTRIBUTION DATA PRE PROCESSING


DURING MODEL TRAINING USING
IMPLEMENTATION MULTINOMIAL NAIVE BAYES &
NAIVE BAYES

3. CONTRIBUTION FOR PRESENTING DATA PRE-PROCESSING AND MODEL


THE PROJECT WORKING
DEMONSTRATION /
PRESENTATION

SIGNATURE OF STUDENT

SIGNATURE OF GUIDE
School of Computer Engineering, KIIT, BBSR
15

Appendix-II

STUDENT'S CONTRIBUTION TO THE PROJECT

NAME OF STUDENT ANIMESH TILAK


ROLL NUMBER 1605340
PROJECT TITLE SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS
ABSTRACT OF THE PROJECT Sentiment Analysis of IMDB Movie Reviews by the use of Natur
(WITHIN 80 WORDS) Language Processing Techniques and analyzing the performance
various Machine Learning Algorithms on movie reviews.

CONTRIBUTION
4. CONTRIBUTION TO THE RESEARCH ON CLASSIFIERS,
PROJECT REPORT

5. CONTRIBUTION DURING MODEL TRAINING USING KNN, SUPPORT VECTOR


IMPLEMENTATION MACHINE, LOGISTIC REGRESSION

6. CONTRIBUTION FOR THE EXPLAINING THE USE CASE AND FUTURE SCOPE
PROJECT
DEMONSTRATION /
PRESENTATION

SIGNATURE OF STUDENT
SIGNATURE OF GUIDE

You might also like