Professional Documents
Culture Documents
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
"Sentiment Analysis of Imdb Movie Reviews": A Project Report
on
Submitted to
KIIT Deemed to be University
BY
I
A PROJECT REPORT
on
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS”
Submitted to
KIIT Deemed to be University
In Partial Fulfilment of the Requirement for the Award of
BY
II
KIIT Deemed to be University
School of Computer Engineering
Bhubaneswar, ODISHA 751024
CERTIFICATE
This is certify that the project entitled
“SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS“
submitted by
is a record of bonafide work carried out by them, in the partial fulfilment of the
requirement for the award of Degree of Bachelor of Engineering (Computer Sci-
ence & Engineering OR Information Technology) at KIIT Deemed to be university,
Bhubaneswar. This work is done during year 2019-2020, under our guidance.
Date: 29/04/2020
III
Acknowledgement
We owe our deepest gratitude to Prof. Suresh Chandra Moharana, Professor, School of
Computer Engineering, KIIT Deemed to be University, Bhubaneswar, for his helpful advice,
support, motivation and encouragement throughout our work. We are extremely obliged to
have him by our side sharing his words of wisdom during the entire course of this project.
We would like to recognize the importance of our friends who have always supported us
during our tough times and kept us ever motivated to keep on going. And lastly, to our parents
who always support us and make us capable enough to reach the apex.
SAMEER SAGAR
ANIMESH TILAK
IV
ABSTRACT
Sentiment Analysis of IMDB Movie Reviews by the use of Natural Language Processing
Techniques and analyzing the performance of various Machine Learning Algorithms on
movie reviews. After converting unstructured data into structured data for the ease of
analysis. Movie Reviews are classified into binary categories i.e. Positive reviews or Negative
reviews on the basis of words used in the reviews.Machine Learning classifiers are used to
categorize these reviews to its maximum accuracy and comparing the performance of the
classifiers with each other on the same dataset.
V
Contents
1 Introduction 1
4 Implementation 7
4.1 Data Pre-processing 7
4.2 Using Count Vectorizer 7
4.3 Extraction of features using Tf-idf transformer 7
4.4 Result Analysis 8
5 Screen-shots of project 9
5.1 Data Loading 10
5.2 Data Pre-processing 10
5.3 Model training & Accuracy score 11
6 Conclusion And Future Scope 12
6.1 Conclusion 12
6.2 Future Scope 12
7 References 13
VI
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS
Chapter 1
Introduction
IMDB Movie reviews dataset labeled as negative and positive reviews is taken. In the
dataset both negative and positive review has one thousand reviews.
These unstructured reviews are converted into structured data as vectors. These
vectors labeled as negative and positive reviews train the model to classify test data
reviews into positive or negative reviews category.
CountVectorizer is used to tokenize each word present in the positive and negative
reviews and build a vocabulary of the words present in an encoded fashion using the
vocabulary that was created. TF-Idf Transformer is used to find the uniqueness of the
document i.e. giving the weights to the important words in the document and
removing the stop words according to the English language. After completion of these
two process the model is trained and accuracy score is checked for the different
Machine Learning algorithms used in this project.
Chapter 2
Chapter 3
Project Planning
IMDB Movie reviews are imported from the text file where reviews were line
separated and labeled as negative and positive. Now a dictionary is created by taking
all the words from negative as well as positive reviews.
The formula used to compute the tf-idf for a term t of a document d in a document set
is tf-idf (t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1
(if smooth_idf=False), where n is the total number of
School of Computer Engineering, KIIT, BBSR 3
SENTIMENT ANALYSIS OF IMDB MOVIE REVIEWS
documents in the document set and df(t) is the document frequency of t; the
document frequency is the number of documents in the document set that contain the
term t. The effect of adding “1” to the idf in the equation above is that terms with zero
idf, i.e., terms that occur in all documents in a
training set, will not be completely ignored.
Classification is done with the help of classifiers like Logistic Regression, Support
Vector Machine, Gaussian Naive Bayes, Multinomial Naive Bayes and K-Nearest
Neighbors.
Gaussian Naive Bayes is best used to classify text data because it treat
each word as independent from others. Words become features and
contribute equally according to their weights to classify review which has
been converted into vectors.
It works same as Gaussian Naive Bayes, both the classifier use likelihood
table to calculate the probabilities. But there is a limitation in Gaussian
Naive Bayes when an unseen word comes which is not in the created
dictionary, then Gaussian Naive Bayes makes the probability zero which is
not right decision. Multinomial Naive Bayes overcomes this limitation.
SVM which stands for Support Vector Machine is a supervised machine learning algorithm that
can be used for both classification and regression problems. Support vectors are the data
points nearest to the hyper plane, the points of a data set that, if removed, would change the
position of the hyper plane which divides it. Because of which, they can be considered the
critical elements of a data set.
K-Nearest Neighbors is one of the most basic yet crucial classification algorithms in Machine
Learning. It belongs to the supervised learning family and finds a large number of application
in recognizing pattern, intrusion detection and data mining.
It is most commonly and widely used in real-life scenarios since it is non-parametric, meaning,
it does not make any previous assumptions about the data distribution.
CHAPTER 4
Implementation
In Data Pre-Processing the first step is to convert the unstructured data into structured data. In
this process both the negative and positive reviews dataset is to be made as a single document
so both Positive reviews dataset as well as Negative reviews dataset are read and after this
both are appended to one document.
After the reviews are appended as one document i.e. dictionary, the created dictionary is fit
into CountVectorizer which tokenize each word present in the dictionary created i.e. gives
unique indexing to each word in the dictionary.
Single lined reviews are fed to the CountVectorizer as an object which in turn converts the
reviews into a one dimensional vector in the form of 0’s and 1’s.
These reviews are made free of the stop words used in English because they do not hold any
importance for the further process as our objective is to find the uniqueness of the word to
further proceed so those word like a,all,also,am,the etc. which occurs very frequently and
holds no importance are removed.
The dataframe now is split into training and test set i.e. in this case the feature which we get
from the TF-Idf is the training set and to which class it belongs to i.e. positive class or
negative class is the test set. The result of the splitting is a sparse matrix which is fed to
different machine learning algorithms used like K-Nearest Neighbors, Support Vector
Machine, Logistic Regression etc. After feeding the sparse matrix to each of the model and the
calculating the accuracy score the results which we get from it are as follows:
So according to our research we found out that Multinomial Naive Bayes performed best of all
the other classifiers. The reason is because all the columns in our feature are independent
which means if our dictionary contains words like love,hate,boring etc all are independent
columns and are independent of each other and in this scenario where we have independent
columns Multinomial Naive Bayes gives the best results. Unlike Naive Bayes, If we add new
words in the dictionary and use only Naive Bayes classifier and calculate the accuracy score
then the accuracy score will be less than that of Multinomial Naive Bayes because the
probability of the new word and the review associated with it becomes zero whereas In
Multinomial Naive Bayes it adds smoothing factor which does not let the probability go down
to zero neither of the word nor of the review associated with it. Similarly other classifier
performed less than that of Multinomial Naive Bayes because all the other classifier does not
add the smoothing factor which is added in the Multinomial Naive Bayes and hence due to this
problem other classifiers gives somewhat less accuracy score than Multinomial Naive Bayes.
Chapter 5
Chapter 6
6.1 Conclusion
Sentiment analysis is done better when we convert the unstructured data into
structured data because machine learning models understand numerical data better
than categorical or language data. After applying different classifiers it is observed
that logistic regression, Multinomial Naïve Bayes and Support Vector Machine
perform very good in classifying binary data. 75% accuracy is good to achieve
because the dataset was small. It is hard for classifiers to classify when small data set
is given for training.
The potential of machine learning is too much, overtaking some of the human labor of
some lexicon based tasks that requires intensive human labor. This is where machine
learning comes into play. Such algorithms will also have to perceive and examine
natural text context-wise and concept-wise. Time will also play a crucial part looking
at the amount of data which is being generated on the Web day-to-day.
Now a days everyday people use Social Media platforms to show their sentiments in
the form of text, videos, images etc. So Sentiment Analysis plays a crucial role in
making business strategy.Growth of Artificial Intelligence and Machine Learning is
highly reliable on how machines validate and match to different sentiments shown by
humans in the form of speech, text, videos, body language etc.
References
Appendix-I
CONTRIBUTION
1. CONTRIBUTION TO PROJECT PLANNING AND RESEARCH
THE PROJECT
REPORT
SIGNATURE OF STUDENT
SIGNATURE OF GUIDE
School of Computer Engineering, KIIT, BBSR
15
Appendix-II
CONTRIBUTION
4. CONTRIBUTION TO THE RESEARCH ON CLASSIFIERS,
PROJECT REPORT
6. CONTRIBUTION FOR THE EXPLAINING THE USE CASE AND FUTURE SCOPE
PROJECT
DEMONSTRATION /
PRESENTATION
SIGNATURE OF STUDENT
SIGNATURE OF GUIDE