Literature Survey

LITERATURE SURVEY ON
SENTIMENT ANALYSIS FOR PREDICTING MOVIE

REVIEW
Submitted in partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology
In
Computer Science And Engineering
By
Meghna Peethambaran
FEDERAL INSTITUTE OF SCIENCE AND TECHNOLOGY (FISAT)

R
ANGAMALY-683577, ERNAKULAM (DIST)

Affiliated to
MAHATMA GANDHI UNIVERSITY

Kottayam-686560
November 2017
FEDERAL INSTITUTE OF SCIENCE AND TECHNOLOGY (FISAT)
R
Mookkannor(P.O), Angamaly-683577
CERTIFICATE
This is to certify that literature survey titled Sentiment Analysis for predicting movie review
is a bonafide work carried out by Meghna Peethambaran (14004078) in partial fulfilment for the
award of Bachelor of Technology in Computer Science and Engineering from Mahatma Gandhi University,
Kottayam, Kerala during the academic year 2017-2018.
Staff In-charge Head of the Department
Place:
Date:
Internal Examiner External Examiner

ABSTRACT
Movie reviews are assessments of the aesthetic, entertainment, social and cultural merits and significance
of a current film or video. Reviews tend to be short to medium length articles, often written by a single
staff writer for a particular publication. For film industry, online review of critical audiences plays an
important role. On one hand, the good comments of a movie can attract more audiences in general. On
the other hand, the good comments do not necessary mean high box revenue and vice verse. Although
reviews are usually fairly "quick takes" on a movie, they can, in some instances, be lengthy, substantive,
and very insightful. Here we developed a model to perform sentimental analysis on the movie reviews
and predict whether it is a positive or negative review.
ACKNOWLEDGEMENT
Apart from the efforts put in by us, the success of this project depends largely on the encouragement
and guidelines of many others. We take this opportunity to express our gratitude to the people who have
been instrumental in the successful completion of this
project:
Mr.Paul Mundadan, Chairman, FISAT, who provided us with the vital facilities required by the
project right from inception to completion.
Dr. George Issac, Principal, FISAT for the amenities he provided, which helped us in the fulfillment
of our project.
Dr. Prasad J.C, HOD(CSE Dept), FISAT who always guided us and rendered his help in all phases
of our project.
Mr.Pankaj Kumar G, for his constant encouragement and enthusiastic supervision and for guiding
us with patience in all the stages.Without his help and inspiration, this would not have been materialized.
Ms. Divya John, Ms.Reshmi R, Ms.Preethi N P and Mr. Paul P Mathai for their guidance
and constant supervision as well as for providing necessary information regarding the project and also
for their support in completing the project.
The faculty of the CSE Dept., FISAT and Lab Instructors for providing us with the necessary Lab
facilities and helping us throughout this project.
Our families who inspired, encouraged and fully supported us in every trial that came our way. Also,
we thank them for giving us not just financial, but moral and spiritual support
Meghna Peethambaran
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Works 3
2.1 Genre Specific Aspect Based Sentiment Analysis of Movie Reviews . . . . . . . . . . . . . 3
2.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Performance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Reduced Feature Based Sentiment Analysis on Movie Reviews Using Key Terms . . . . . 6
2.2.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Learning Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Detection Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Feature Selection & Classification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Design Approach for Accuracy in Movies Reviews Using Sentiment Analysis . . . . . . . . 15
2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Aspect Based Sentiment Analysis of Movie Reviews . . . . . . . . . . . . . . . . . . . . . 17
2.6 Sentiment Analysis of Movie Reviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Improvement of Sentiment Analysis based on Clustering of Word2Vec Features . . . . . . 20
2.7.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Scope of the work 23
4 Conclusion 24
List of Figures
2.1 Diagram for the proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Preprocessing and Plot Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Key Term Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Key Term Based Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 SentiWordNet Based Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Lexicon Based Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Work flow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Pre processing of datatset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Diagram for the proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 Work flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Sentiment Analysis for predicting movie review
Chapter 1
Introduction
With the increasing popularity of social media sites such as Twitter and review sites like Yelp and Rotten
Tomatoes, it is important to be able to automatically make sense of these large amounts of subjective
opinionated data. Sentiment analysis, using natural language processing and machine learning techniques
to characterize subjective human opinions or sentiments, has been rapidly gaining popularity as a method
of analyzing these large corpora for such diverse applications such as predicting trends in the stock market,
and characterizing diurnal and seasonal moods such as seasonal affective disorder.Most of the work to
date has been identifying how the presence of individual words in an excerpt, such as a tweet or movie
review, contributes to the sentiment of the entire excerpt (a so-called bag-of-words model).
1.1 Overview
Semantic analysis describes the process of understanding natural language-the way that humans com-
municate based on meaning and context.The semantic analysis of natural language content starts by
reading all of the words in content to capture the real meaning of any text. It identifies the text elements
and assigns them to their logical and grammatical role. It analyzes context in the surrounding text and
it analyzes the text structure to accurately disambiguate the proper meaning of words that have more
than one definition.Semantic technology processes the logical structure of sentences to identify the most
relevant elements in text and understand the topic discussed.
Sentiment analysis [1] is a methodology by which find out the sentimental orientation of a piece of
text. Using it, we can infer whether a particular person has conveyed a positive or negative sentiment
in the said text under consideration. We tackled the issue of aspect based sentiment analysis of movie
reviews in our previous publication [2]. In it the concept of "driving factors" is used, which enhanced the
overall classification accuracy by amplifying the effect of certain movie aspects with respect to others. In
the current work,the same is tend to be used, but for reviews with different genre. Many researchers have
done work on Aspect based analysis of review, be it movie or customer review. Also many algorithms have
Dept of Computer Science & Engineering, FISAT 1

been developed for the same. But not much work has been done on genre specific aspect based analysis.
Genre specific reviews demand special techniques while analysing as such reviews contain sentences or
words that have unique meaning based on the context i.e. genre in which they are used.
The researchers got inspired from approaches in different fields like information retrieval, natural
language processing, statistics, summarization, probability and machine learning, and different concepts
in these fields are applied for better opinion mining. Major steps involved in sentiment analysis include
data gathering, preprocessing, aspect identification, feature extraction and sentiment classification. An
opinion mining process can be done at different levels like document level, sentence level, phrase level,
tweet level or aspect level.

Chapter 2
Related Works
2.1 Genre Specific Aspect Based Sentiment Analysis of Movie

Reviews
[1]
An unsupervised aspect based analysis model is developed that uses context i.e. genre specific lexicons.
The use of separate lexicons for each genre is made, and using this context sensitivity is inculcated into
the model. Also using aspect based analysis, a fine grained aspect level analysis model is developed.
2.1.1 Implementation
The method aims at developing a lexicon based aspect oriented analysis approach for genre specific re-
views. Fig 2.1 describes the flow of the proposed method. The dataset used is here is that Mahesh Joshi,
Dipanjan Das, Kevin Gimpel, and Noah A. Smith used in their experiments. The dataset was in XML
format and each file contained movie details like name of the movie, genre of the movie, date of release
and also sited web links for obtaining full reviews. Pre-processing was required for the dataset as it was
not in accordance with our requirements. The links and the genre of each movie was extracted from the
dataset. Since the dataset didn’t contain ratings,movielens dataset was used.
[10]
Sentiment analysis is a methodology by which we find out the sentimental orientation of a piece
of text. Using it, we can infer whether a particular person has conveyed a positive or negative sentiment
in the said text under consideration. We tackled the issue of aspect based sentiment analysis of movie
reviews in our previous publication [2]. In it we use the concept of "driving factors", which enhanced
the overall classification accuracy by amplifying the effect of certain movie aspects with respect to oth-
ers. In the current work, we tend to use the same concept, but for reviews with different genre. Many
researchers have done work on Aspect based analysis of review, be it movie or customer review. Also
many algorithms have been developed for the same. But not much work has been done on genre specific
aspect based analysis. Genre specific reviews demand special techniques while analysing as such reviews
contain sentences or words that have unique meaning based on the context i.e. genre in which they are

used
The next step was that of separating the review text into aspect specific text. Aspect Based Text
Separator (ABTS) was used for this purpose. ABTS separates the review text into different groups based
on the movie aspects. It does this using an aspect lexicon. The next step was that of the classification
of these separated aspect texts. To account for all the context sensitive words, a genre specific lexicon
is developed. This lexicon would contain certain words whose orientation would depend on the genre in
which they are used. A list of top 500 frequently used adjectives in everyday life was used and formed a
lexicon out of these words. To assign orientation to these words based on the movie genre, the method-
ology of Semantic orientation is used.
Figure 2.1. Diagram for the proposed method
Before defining semantic orientation,Pointwise Mutual Information (PMI) is defined. PMI between two
words is the amount of information that acquire about the presence of one word when we observe the
other [14].
The formula for PMI is:
The semantic orientation of a word is calculated by:

Here X denotes a positively oriented word or string of words and Y denotes a negatively oriented
word or a string of words. Thus we find the co-occurrence of word with a positive word and with a
negative word. Then subtract the PMI obtained with the positive and negative word to get the overall
orientation of the word. Thus if a negative value is obtained, the overall SO is negative, and it means
that the word under consideration occurs more closely with the negative word string and similarly if the
result is positive, the word is closely associated with the positive word string.
The NEAR operator functionality was programmatically recreated using Boolean operators to work
on our dataset. To find co-occurrence of two words, word1 and word2 we issued a Boolean query over our
dataset as: "word1 word2" OR "word2 word1" OR "word1 * word2" OR "word2 * word1". The above
query considers all the cases related to the positioning of the words. Here "*" represents a wildcard,
which means there can be single or multiple words between the two words.
After hitcount collection, the SO of the adjective was calculated according to the given formula. The
SO values were normalised and value was brought between -1 and 1 before storing the values in the
lexicon. This process was carried on for all 500 adjectives and for each adjective. The SO values for each
genre corresponding to each adjective were stored in a file and a genre specific lexicon was formed. After
the review is passed through the ABTS, the separated aspect texts are forwarded to be scored using the
Genre Specific Lexicon Scorer (GSLS). Here the adjectives are extracted from the text and score it using
the lexicon. If the adjective is present in the lexicon, a score corresponding to the genre of the review
which is under consideration is given. If the adjective is not present in the lexicon, then the adjective
is scored using the SentiWordnetdictionary. SentiWordnet contains sentimental scores, which are not
context specific, for huge collection of adjectives, adverbs and nouns.
The effect of negation and intensifiers on the adjective that is being scored is also considered. Negations
are the words that change the polarity of the adjective. Intensifiers are the words that enhance the score
of the adjective. After all the adjectives have been scored, the average of the value of all the scores is
computed and assign this averaged value as the score for the aspect text. The next step is the application
of driving factors on these scores (DFM). Driving factors are used for amplifying the importance of
certain aspects in the overall classification process and also in fine grained analysis of the review under
consideration. After application of driving factors, it is followed by the summation process in which a
proper threshold was set and review scores were compared with this threshold, and review classification
based on this comparison was done.

2.1.2 Performance calculation

Each file in the dataset consisted of the following information: the name of the movie, the genre of the
movie, some financial information regarding the movie, and a list of links to various sites, from which
the full movie review was obtained. The following methods were used to measure the performance of the
movie:
2.1.3 Summary
Using the driving factors in the above mentioned methodology gave the following results: for action genre-
got plot, movie and direction as the most important factors as these aspects had the highest values, for
comedy - got acting, plot and movie, for crime - got screenplay, music and plot, for drama - got music,
movie and direction and for horror - got music, direction and movie. These results obtained are only for
the particular dataset under consideration.
2.2 Reduced Feature Based Sentiment Analysis on Movie Re-

views Using Key Terms
Word of mouth communication plays an influential role in propelling or sinking movies.[4] Proper analysis
of movie reviews published in social media helps in efficient decision making. Sentiment analysis aids in
automating this time consuming task. Identifying the relevant data with the elimination of plot in movie
reviews enhance the performance of any sentiment analysis algorithm.Aan entity recognition method is
implementated in the preprocessing stage to eliminate the irrelevant information from the reviews. Sets
of N-grams are extracted from the document corpus as key terms. Extracted key terms together with an
external lexicon are used to identify the sentiment conveying features in the review and these features
are used to train the classifiers. Performance analysis is done on various classifiers and it is found that
the proposed method outperforms existing approaches on movie reviews.

Movie reviews can be analyzed in many ways to extract useful information from them. It helps in opinion
summarization, extracting sentiment orientation of an author towards the movie, performance comparison
of multiple movies, identification of characters and storyline and so on. The focus is given on determining
whether the overall opinion of an author towards a movie is positive or negative in nature. Hence it can
be considered as a binary classification task. The proposed system for implementing this task is designed
as a supervised machine learning algorithm that uses some predefined lexicons and a set of key terms in
the document corpus for feature extraction. Hence it could be considered as a hybrid of machine learning
based and lexicon based approaches. Mainly there are two phases for this algorithm: Learning Phase and
Detection Phase.
In the Learning Phase, initially a set of positive and negative key terms are extracted from the
training data. Then a set of features are generated based on these terms. Additional features included
into the feature vector are SentiWordNet (SWN) based score and features based on a lexicon of positive
and negative words. This feature vector and the sentiment label of documents are used to train classifiers.
Detection phase uses the trained classifiers to classify the test data as positive or negative.
2.2.2 Learning Phase

Given a set of labeled training data, the learning phase trains a machine learning classifier to handle the
sentiment detection problem in the detection phase. This phase mainly consists of the following steps:
A.Preprocessing and Plot Elimination
Input to the system is a set of training data that is labeled either as positive or negative and a set of
unlabeled data for testing. The textual data extracted from movie review website is unstructured and it
is to be converted into a form suitable for further processing.
Preprocessing and plot elimination are applied on both labeled training documents for learning and un-
labeled test documents for sentiment detection as shown in figure 2.2. As part of preprocessing phase,
some erroneous patterns that frequently occur in the data are removed through regular expression match-
ing.There will be some terms that frequently occur in the data, but do not contribute towards the text
mining. Those terms like "the", "and", "does", "of" etc. are called stop words and are removed as part
of pre processing. But the stop words removal is done only whenever required. Unlike the existing ap-
proaches for sentiment analysis, a plot elimination phase is also incorporated in the preprocessing phase.
It can be considered as a part of the preprocessing or as a separate phase.
By analyzing the movie reviews, it is found that the performance of a sentiment classification system

Figure 2.2. Preprocessing and Plot Elimination
may improve by eliminating the portions of reviews that describe the story, thus retaining sentences
bearing actual opinion of the author. It is done in two phases:
a.Plot Elimination Phase I
b.Plot Elimination Phase II

B.Key Term Extraction

During feature extraction, key terms are detected from each review and based on its presence, a novel
set of features are computed. Here key terms are considered as N-grams of different sizes. An N-gram is
a consecutive sequence of n terms. Each review is divided into sentences. Then each sentence is again
divided into N-grams and added to a list of N-grams. Then we compute the probability distribution of
the set of N-grams corresponding to the document corpus. All N-grams whose ratio of probability of
occurrence in positive document corpus to that in negative document corpus greater than a threshold
value are extracted as positive key terms. Similarly, all Ngrams whose ratios of probability of occurrence
in negative document corpus to that in positive document corpus greater than the threshold are extracted
as negative key terms. The threshold value is chosen based on the size of the document corpus.
Figure 2.3. Key Term Extraction

C.FeatureExtraction
A review cannot be directly given as it is to a machine learning algorithm. Hence each document is
to be transformed into a set of features. These features are to be selected in a logical manner such that
it holds the information relevant for classification and prediction. The experiments are done using three
types of features:
-Key term based features
- SentiWordNet based features
- Lexicon based features
Key Term Based Extraction

To extract key term based features, first locate positive and negative key terms present in a review. This
step uses the set of positive and negative key terms that have already been identified in the key term
extraction phase. Key term detection is implemented as an iterative task. Each review is divided into
sentences and each sentence is again divided into N-grams with "n" set to its maximum chosen value.
If an N-gram is a positive or negative key term, then it is added to the corresponding list of detected
key terms of the review. Otherwise it is again divided into sub N-grams and this process is repeated
iteratively until N reaches 1.
Figure 2.4. Key Term Based Extraction

SentiWordNet Based Feature Extraction

There may be some terms that carries higher sentiment polarity, but are not used commonly.To handle
such infrequent polar terms, SentiWordNet based features are extracted. The role that a term performs
in a sentence and it’s semantic and sentiment orientation depend on Parts of Speech (PoS) tag of the
term in that sentence. i.e., the sense of a word when used as a noun may be quite different from that
when it comes as an adjective. SentiWordnet scores can be computed for different PoS tags of the same
word.First perform PoS tagging on review sentences. Then compute score of a (word, PoS) pair as the
difference between positive score and negative score obtained from SentiWordNet for the pair. Overall
sentiment score of a review is computed as the average SentiWordNet score of all terms in it. The pres-
ence of negation terms like "not" can reverse the polarity of a sentence. Negation is handled by reversing
the score (i.e., multiply by -1) of up to three terms following the negation term. Figure 2.5 shows the
SentiWordNet based feature extraction.
Figure 2.5. SentiWordNet Based Feature Extraction

Lexicon Based Feature Extraction

A lexicon of positive and negative terms is collected from online resources. Counts of positive and
negative terms in a review are added as lexicon based features as shown in figure 2.6.Once the reviews
are transformed to set of features, the feature set and corresponding sentiment labels are given as input
to some popular machine learning classifiers to build a classification model that can later predict labels
of unknown reviews during detection phase. The machine learning classifiers that have been chosen
are Support Vector Machine (SVM), K Nearest Neighbor (KNN), Bernoulli Naive Bayesian classifier,
Random Forest and Decision Tree Classifier.
Figure 2.6. Lexicon Based Feature Extraction
2.2.3 Detection Phase

Input to the detection phase is a set of unlabeled reviews. Objective of this phase is to predict sentiment
label of these documents. Major steps involve
- Preprocessing and Plot Elimination
- Feature extraction
- Sentiment Prediction
First two steps are same as done in learning phase. The same key terms extracted in the learning phase
are used for generating key term based features in detection phase. Input to the sentiment prediction
phase is the set of features corresponding to each review. Each review is classified either as positive or
negative using the machine learning classification model built in the learning phase. The result can be
compared to the actual label of the test data to compute the accuracy of proposed sentiment classification
approach.

2.2.4 Summary
Sentiment analysis on movie reviews is a challenging task because of the presence of plot descriptions
within the reviews. There was a considerable improvement in performance by eliminating those portions
of the review that describe the story line, thus enabling the sentiment classification algorithm to focus
on the relevant opinionated sentences. The study also introduced a novel set of features based on the
frequent N-grams used by authors to express their feelings in addition to a set of lexicon based features.
Elimination of plot and the reduced feature set made the proposed sentiment analysis system efficient in
terms of time and cost.
2.3 Feature Selection & Classification Algorithms

[6]
Sentiment analysis is a sub-domain of opinion mining where the analysis is focused on the extraction
of emotions and opinions of the people towards a particular topic from a structured, semi-structured or
unstructured textual data. We examine the sentiment expression to classify the polarity of the movie
review on a scale of 0(highly disliked) to 4(highly liked) and perform feature extraction and ranking and
use these features to train our multilabel classifier to classify the movie review into its correct label.
Due to lack of strong grammatical structures in movie reviews which follow the informal jargon, an
approach based on structured N-grams has been followed. In addition, a comparative study on different
classification approaches has been performed to determine the most suitable classifier to suit our problem
domain. In this paper, a lexical approach is followed using the SentiWordNet to determine the overall
polarity of the movie review.
A.Preprocessing of data
Preprocessing is the preparation of dataset before applying any algorithm into it.The data is stemmed
to remove commoner morphological endings from words in english. Then data stopping is performed to
remove the most common words according to a stop list to reduce the size of the document. Parts of
Speech is a processing technique where the words are marked corresponding to a particular part of speech
such as noun etc.
After the preprocessing, the next step was analyzing the data to find common observable patterns
that may affect the polarity of the document. In order to calculate the document polarity, it is necessary
to understand that the sentiment score may be enhanced or diminished with its usage as well as their
relationship with the nearby words.With the analysis of features from observation, the impact of each
feature on the polarity of the document to set the scaling factor for each of the feature need to be found.
To find the impact, Information Gain of each features and used a Feature Ranking Algorithm to rank all
the features has been used.

Well known classifiers namely Bagging, Random Forest, Decision Tree, Naive Bayes, K-Nearest Neigh-
bor, Classification via Regression are used. The classification is done in our methodology with the aim
to predict the class level for a machine to predict the class of a movie review whenever it arrives.
Figure 2.7. Proposed methodology
2.3.2 Summary
In this work, extracted new features that have a strong impact on determining the polarity of the movie
reviews and applied computation linguistic methods for the preprocessing of the data. were extracted.
Feature impact analysis were performed,by computing information gain for each feature in the feature
set and used it to derive a reduced feature set. Among six classification techniques, we found that the
highest accuracy was given by Random Forest with an accuracy of 88.95%.

2.4 Design Approach for Accuracy in Movies Reviews Using Sen-

timent Analysis
[3]
Sentiment analysis is a sub-domain of opinion mining where the analysis is focused on the extraction
of emotions, a specific view or judgment on certain topic. Sentiment analysis system classifies text data
into their respective sentiments of positive polarity, negative polarity or neutral. In this domain most of
the previous researchers have focused on using one of the three classifiers like SVM, Naŕve Bayes, and
Maximum Entropy. There are some other robust classifiers which have ability to provide comparable or
better results.
This paper focuses on two areas like first Feature Selection and Ranking and second using machine learning
techniques.The labels are provided to the polarity as follow Strong Negative - (-2), Weak Negative - (-1),
Neutral- 0, Weak Positive- 1, Strong Positive-2. THe work flow is shown in fig 2.8
Figure 2.8. Work flow diagram
A.Input Data
The input data is in the form of reviews from the "times of india" movie review dataset. Particular

movie is selected from the dataset and reviews regarding that movie are displayed on web page. After
releasing of any new movie the reviews of that movie are added to the dataset.
B.Pre-Processing
The text pre-processing techniques are divided into three subcategories(fig 2.9):
- Tokenization: The data present in the text document contains block of characters called tokens.
These text documents are separated as tokens and used for further processing of data.
- Removal of Stop Words: Those words which appear too often that support no information for the
task are removed.
-Part of Speech Tagging: POS tagger parses a sentence or document and tags each term with its part
of speech.
Figure 2.9. Pre processing of datatset
C. Text Transformation
In the process of text transformation the score of each sentence in the source document is calculated
by sum of weight of each term in the corresponding sentences. The weight of each term is calculated
by multiplication of that words based on adjective word extracted from part-of-speech. The output of
pre-processing process is given as input to text transformation process.
D. Feature Extraction
In the process of feature extraction, movie features are extracted from every sentence. For finding the
polarity of text document, it is necessary to understand the sentiment score with its usage as well as their
relationship with all the nearby words.
E Feature Reduction Approach
One of the biggest problems of sentimental analysis is dealing with text data which are available in very
high dimensions which may affect the performance of classifier. So, there is a need for such technique
which will eliminate those features that are not relevant and keeping only those features. The Information
Gain and Gain Ratio are the most popular techniques among number of feature reduction techniques.

Information Gain: Information Gain technique is mainly used for finding importance of a feature in
decreasing overall entropy. Information Gain process is mainly based on the measure entropy. The
entropy measure indicates the impurity of collected samples.
Gain Ratio: In gain ratio the contribution of all features will be normalized before the classification of
the document.
F.Classification
Lexicon based approach is used,SentiWordNet, for finding the overall polarity of movie reviews. The
classification is done with the Random Forest classifier to determine the sentiment labels for a machine
and to predict the class of a movie reviews whenever it arrives in the form of positive or negative polarity.
The reduced features are provided as input to classification process and the classification is based on
number of positive and negative sentiment. The sentiments in the sentence are classified according to
polarity.
2.4.2 Summary
Many researchers have work on the domain of movie reviews, Pang et al have used three classification
techniques NB, ME, SVM with the help of these technique they have achieve accuracy of 82.90%. Prabowo
et al.used hybrid SVM method for classification method it achieves accuracy of 87.3%. Rui Yao et al
work on NB classification technique and achieve accuracy of 78.75%. Mullen and Collier et al performs
sentiment classification using SVM and achieves accuracy of 86%. On comparing the result with the
previous models the approach in the paper achieves highest accuracy level than previous model used for
classification of movie reviews.
2.5 Aspect Based Sentiment Analysis of Movie Reviews

[1]
The method aims at developing a technique for aspect based sentiment analysis of movie reviews.
Fig 2.10 describes the flow of the proposed method. The review of the movie from different sources
are collected and pre-processed to make it suitable for applying in the method. The preprocessing step
includes the formatting of the different reviews so that they can be aligned in a required format. For this
the HTML tags and other tags were removed. These pre-processed reviews are then passed through the
aspect based text separator and the separated review text was obtained. The various movie aspects that
is used are screenplay, music, acting, plot, movie, and direction. The functionality of the separator was
to separate the review aspect wise. The separator used an aspect specific lexicon for the purpose of text
separation.Each word in the lexicon were associated with the part of speech of that word.
While searching the sentence to match the lexicon word, the sentence is tagged with the Stanford
Part-Of-Speech tagger, and then the lexicon word is matched within the sentence having the same part
of speech. These aspect based separated sentences were given as an input to the classifiers meant for each
aspects. A Naive Bayes classifier was used for this purpose. It calculates the probability of a word or

Figure 2.10. Diagram for the proposed method
albeit a sentence, belonging to positive or a negative class of reviews. The traditional method of training
and testing the classifier is applied. The output of the classifier is either 1 or -1 denoting that the input
text was of positive or negative orientation respectively. Instead of NB ,any classifier like SVM etc can be
used that is able to clearly classify the text in two classes. Based on the weightage of the driving factors
of the movie, the aspect based output is multiplied with the respective driving factor.

2.6 Sentiment Analysis of Movie Reviews

[6]
In this paper an attempt is made to explore a new SentiWordNet based scheme for both document-level
and aspect-level sentiment classification. The document-level classification involves use of different lin-
guistic features (ranging from Adverb+Adjective combination to Adverb+Adjective+Verb combination).
A new domain specific heuristic is devised for aspect-level sentiment classification of movie reviews. This
scheme locates the opinionated text around the desired aspect/ feature in a review and computes its
sentiment orientation. For a movie, this is done for all the reviews. The sentiment scores on a particular
aspect from all the reviews are then aggregated. This process is carried out for all aspects under con-
sideration. Finally a summarized sentiment profile of the movie on all aspects is presented in an easy to
visualize and understandable pictorial form. The rest of the paper is organized as follows.
A.Document-level Sentiment Classification

The document-level sentiment classification attempts to classify the entire document (such as one review)
into "positive" or "negative" class. The approaches based on SentiWordNet targets the term profile of
the review document and extract terms having desired POS label (such as adjectives, adverbs or verbs).
This clearly shows that before applying the SentiWordNet based formulation; the review text should be
applied to a POS tagger which tags each term occurring in the review text. Then some selected terms
(with desired POS tag) are extracted and the sentiment score of each extracted term is obtained from
the SentiWordNet library. The scores for all extracted terms in a review are then aggregated using some
weightage and aggregation scheme. Thus two key issues are to decide (a) which POS tags should be
extracted, and (b) how to decide the weightage of scores of different POS tags extracted while computing
the aggregate score.
B. Aspect-level Sentiment Analysis

The aspect-level sentiment analysis allows to analyze the positive and negative aspects of an item. How-
ever, this kind of analysis is often domain specific. The aspect-level sentiment analysis involves the
following: (a) identifying which aspects are to be analyzed, (b) locating the opinionated content about
that aspect in the review, and (c) determining the sentiment polarity of views expressed about an aspect.
The first step was to identify which aspects are worth considering in movie domain. Since a particular
aspect is expressed by different words (such as screenplay, screen presence, acting) by users,an aspect-
vector was created for all aspects under consideration. The sentiment analysis around aspects thus first
locates an opinionated content about an aspect in a review and then uses the SentiWordNet based ap-
proach to compute its sentiment polarity. Tthe SentiWordNet (AAC) scheme is used for this purpose.
When an aspect indicating term (those terms that belong to the aspect vector created in the beginning)

is located,first lookup up to 5-gram backward for occurrence of adjectives or adverb+adjective combines.

If no such term is found, search up to 5-gram forward for their occurrence. In both cases the lookup
terminates at 5-gram or sentence boundary whichever is encountered first. Then the sentiment polarity
for these terms is computed using the SentiWordNet based formulation for AAC.
C.Collecting Datasets
10 reviews each for 100 movies were collected from the popular movie review database website www.imdb.com[15].
All the reviews were labelled manually to evaluate performance of our algorithmic formulations. Out of
1000 movie reviews collected, 760 are labeled positive and 240 are labeled as negative reviews.
D.Performance Evaluation
In order to evaluate the accuracy and performance of our algorithmic formulations,the standard per-
formance metrics of Accuracy, Precision, Recall and Fmeasure were computed.
The measure of Accuracy A used by us is:
2.7 Improvement of Sentiment Analysis based on Clustering of

Word2Vec Features
[17]
Since the introduction of Word2Vec by Mikolov et al.to discover semantic relation between words, it
has been used as features for several text classification tasks . Due to the high dimensional nature of the
Word2Vec features, it increases the complexity for the classifier. Several feature extraction methods can
be applied in order to reduce the dimension of the Word2Vec features.
In this paper, a method to construct feature set is proposed to reduce the dimension of the Word2Vec
features for sentiment analysis. In particular, the set of terms in a vocabulary are clustered around
opinion words in order to distribute them based on polarity. It is hypothesized that such a method will

improve the effectiveness of sentiment classification of text.
Figure 2.11. Work flow
The Vector representation of a corpus is discovered by using the Skip-gram technique of the Word2Vec
[31] to calculate the probability distribution of terms.The terms in vocabulary are clustered based on
their polarity in the distribution space. For each of the words in the dictionary that also appears in
the vocabulary, the associated vector for the word in the Word2Vec from the earlier step is extracted.
In order to construct the clusters of terms in the vocabulary, the similarity between each term in the
vocabulary and all words in the sentiment lexical dictionary selected as the centroids of the clusters is
calculated. Specifically, the cosine similarity is used to measure the similarity between two vectors. The
terms are assigned to the cluster to which centroid is the most similar. As a result, the terms in the
vocabulary are clustered based on the opinion words in the dictionary. In sentiment analysis, the reviews
or comments need to be classified into its polarity whether it is positive or negative. Typically, the high
dimensional vectors of the Word2Vec are used as the features for the classification techniques.
The algorithm for the proposed method is given as:

2.7.2 Summary
A method is proposed to reduce the size of the Word2Vec feature set for sentiment analysis. The method
constructs cluster of terms centered by a set of opinion words from a sentiment lexical dictionary. A
simple transformation is applied to the negative term vectors to redistribute the terms in the space based
on their polarity. A much smaller matrix of document vectors is produced based on the set of clusters.
Two classifiers, namely Logistic Regression and Support Vector Machine (SVM) are used to compare the
performance of different feature set for sentiment analysis. It has been observed that the performance of
the proposed method is encouraging, showing that it can be more effective and efficient than the baseline.
In the future, more investigation will be performed on the Word2Vec in term of the perplexity

Chapter 3
Scope of the work
Sentiment analysis is tough because same topic can be expressed in different ways. Also the words used
to express a positive sentiment would be negative in other statements. The movie reviews posted on the
inter-net are unstructured form of grammar and expressing opinions on a topic are never standardized,
one person’s appreciation may differ from others.
The problem statement on Sentiment Analysis of Movie Reviews have been described as:
• Extracting Sentiment Words
It is the heart of sentiment analysis; all the review statement contains sentiment words which have
a major contribution in determining the polarity of the review. Example, ”The movie was good and
interesting”, here the sentiment words good and interesting tells us that the polarity of the movie
is positive.
• Sarcasm
It is really difficult to know the tone of author in textual sentences, we can’t definitely say that bad
means bad or good. For example, ”The movie was supposed to be hilarious?”
• Parsing
What does the verb and/or adjective of a subjective or objective textual sentence really refer to?
• Scaling
What is the quantity of data input as a proportion of the total universe of users? 10% of the IMDB
corpus gives you a rough idea of what’s going on n but the result are nowhere close to the resolution
you get t with 50% of the reviews.

Chapter 4
Conclusion
Using the driving factors we were able to extract the most importance aspects for a particular dataset
under consideration. Thus by using this methodology, we can identify importance aspects across various
datasets and across various genres. Using this knowledge, we might try to develop a fine grained recom-
mendation system which recommends the user with movies not only on ratings, but also on the aspects
about the movies he likes. Also instead of using it on movie review, we can use them on customer reviews
for business analytics and marketing of the product. The customers give their opinion about various
product aspects like its performance, usability, cost, build quality etc. and thus creating a domain for
driving factor usage. The research can have a considerable application on reviews in Indian languages
too.Further study is needed for feasible application of this model on Indian languages. We have only used
the concept of negation, intensifiers and genre specific lexicon to induce some knowledge about inter-
word dependencies in the algorithm. Various other techniques like dependency tree, clause based scoring
can be used for further detailed analysis.
Sentiment analysis on movie reviews is a challenging task because of the presence of plot descriptions
within the reviews. There was a considerable improvement in performance by eliminating those portions
of the review that describe the story line, thus enabling the sentiment classification algorithm to focus
on the relevant opinionated sentences. The study also introduced a novel set of features based on the
frequent N-grams used by authors to express their feelings in addition to a set of lexicon based features.
Elimination of plot and the reduced feature set made the proposed sentiment analysis system efficient in
terms of time and cost. More features could be incorporated into feature set because of its small size.
Features based on deep learning techniques could be adopted in future to further improve the sentiment
classification results. Similarly new context based features can also be added to the feature list. Context
is extracted in this method using the concept of N-grams.

References
[1] Vijay Parkhe, Bhaskar Bhiswas, "Genre Specific Aspect Based Sentiment Analysis of Movie Reviews"
[2] Qing Caoa,Wenjing Duanb and QiweiGana, "Exploring determinants of voting for the "helpfulness"
of online user reviews: A text mining approach"
[3] Rasika Wankhede and Prof. A.N.Thakare , "Design Approach for Accuracy in Movies Reviews Using
Sentiment Analysis "
[4] Sruthi S , Reshma Sheik and Ansamma John "Reduced Feature Based Sentiment Analysis on Movie
Reviews Using Key Terms "
[5] V.K.Singh, R.Piryani, A.Uddin, P.Waila , "Sentiment analysis of movie reviews: A new feature-based
heuristic for aspect-level sentiment classification"
[6] Tirath Prasad Sahu and Sanjeev Ahuja, "Sentiment Analysis of Movie Reviews: A study on Feature
Selection Classification Algorithms "
[7] Pang, B., Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization
with respect to rating scales. In Annual meeting-association for computational linguistics (Vol. 43, p.
115).
[8] https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-
sentiment/
[9] https://nlp.stanford.edu/courses/cs224n/2012/reports/WuJean_PaoYuanyuan_224nReport.pdf
[10] http://www.expertsystem.com/natural-language-process-semantic-analysis-definition/
[11] https://machinelearningmastery.com/deep-learning-bag-of-words-model-sentiment-analysis/
[12] https://www.scribd.com/document/252659877/Sentiment-Analysis-of-Rotten-Tomatoes-for-Box-
Office-Revenue-Prediction
-25-
[13] https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews
[14] Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., Lin, C.-J. (2008, June). Liblinear: A library
for large linear classification. J. Mach. Learn. Res., 9, 1871âĂŞ1874.
[15] Peter D. Turney, Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised
Classification of Reviews, Proceedings of the 40th Annual Meeting of the Association for Computa-
tional Linguistics (ACL), Philadelphia, July 2002, pp. 417-424.
[16] E. Nikolaidis, C. Sabo, J. A. R. Marshal, and A. Reina, Characterisation and upgrade of the commu-
nication between overhead controllers and Kilobots, White Rose Research Online, Tech. Rep., April
2017.
[17] Eissa M.Alshari, Azreen Azman, Shyamala Doraisamy, Norwati Mustapha and Mustafa
Alkeshr"Improvement of Sentiment Analysis based on Clustering of Word2Vec Features"

Literature Survey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Literature Survey

Uploaded by

Copyright:

Available Formats

LITERATURE SURVEY ON

SENTIMENT ANALYSIS FOR PREDICTING MOVIE

Computer Science And Engineering

FEDERAL INSTITUTE OF SCIENCE AND TECHNOLOGY (FISAT)

ANGAMALY-683577, ERNAKULAM (DIST)

MAHATMA GANDHI UNIVERSITY

Staff In-charge Head of the Department

Internal Examiner External Examiner

3 Scope of the work 23

2.1 Diagram for the proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Dept of Computer Science & Engineering, FISAT 1

Dept of Computer Science & Engineering, FISAT 2

2.1 Genre Specific Aspect Based Sentiment Analysis of Movie

Dept of Computer Science & Engineering, FISAT 3

Figure 2.1. Diagram for the proposed method

The semantic orientation of a word is calculated by:

Dept of Computer Science & Engineering, FISAT 4

Dept of Computer Science & Engineering, FISAT 5

2.1.2 Performance calculation

2.2 Reduced Feature Based Sentiment Analysis on Movie Re-

Dept of Computer Science & Engineering, FISAT 6

2.2.2 Learning Phase

A.Preprocessing and Plot Elimination

Dept of Computer Science & Engineering, FISAT 7

Figure 2.2. Preprocessing and Plot Elimination

b.Plot Elimination Phase II

Dept of Computer Science & Engineering, FISAT 8

B.Key Term Extraction

Figure 2.3. Key Term Extraction

Dept of Computer Science & Engineering, FISAT 9

Key Term Based Extraction

Figure 2.4. Key Term Based Extraction

Dept of Computer Science & Engineering, FISAT 10

SentiWordNet Based Feature Extraction

Figure 2.5. SentiWordNet Based Feature Extraction

Dept of Computer Science & Engineering, FISAT 11

Lexicon Based Feature Extraction

Figure 2.6. Lexicon Based Feature Extraction

2.2.3 Detection Phase

Dept of Computer Science & Engineering, FISAT 12

2.3 Feature Selection & Classification Algorithms

Dept of Computer Science & Engineering, FISAT 13

Figure 2.7. Proposed methodology

Dept of Computer Science & Engineering, FISAT 14

2.4 Design Approach for Accuracy in Movies Reviews Using Sen-

Figure 2.8. Work flow diagram

Dept of Computer Science & Engineering, FISAT 15

Figure 2.9. Pre processing of datatset

Dept of Computer Science & Engineering, FISAT 16

2.5 Aspect Based Sentiment Analysis of Movie Reviews

Dept of Computer Science & Engineering, FISAT 17

Figure 2.10. Diagram for the proposed method

Dept of Computer Science & Engineering, FISAT 18

2.6 Sentiment Analysis of Movie Reviews

A.Document-level Sentiment Classification

B. Aspect-level Sentiment Analysis

Dept of Computer Science & Engineering, FISAT 19

is located,first lookup up to 5-gram backward for occurrence of adjectives or adverb+adjective combines.

2.7 Improvement of Sentiment Analysis based on Clustering of

Dept of Computer Science & Engineering, FISAT 20

improve the effectiveness of sentiment classification of text.

Figure 2.11. Work flow