Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19


Sentimental Analysis: findings preferred genres using IMRD rating system

Project report submitted to VIT Business School, Vellore

In partial fulfilment of the requirements for the degree of


Submitted by
Ganapathy B (19MBA0036)

Submitted to
Under the guidance of
Prof Dr. Shiva Kumar



October 2020
Abstract: This paper aims at finding most preferred genres on different languages using
Average rating system. Secondary data of Movie ratings are collected from IMDB data
set. Here, in this work collected data were separated and average ratings are calculated
using Excel worksheet. Dataset for this work was prepared and found the most
preferred movie genres in each languages.
Keywords: sentiment analysis, preferred genres, average ratings, IMDB datset


Indian Film industry is known for different languages and genres in which they are made.
Each Indian state carries different emotions and films clearly exhibit them. Different genres
and styles of film making create variety in a language film industry. Channels like IMDB,
Rotten Tomatoes and other social media sites encourage individuals to post opinions. It helps
understand why one does not like a film or genre. Most new or upcoming film makers
concentrate on what they want to present and do not consider viewer sentiments. This
thinking fails them even though stories may be interesting. Sentiment analysis/opinion
mining is a means of getting an idea on what the viewers wish.

Age affects film genre viewing. For example, old people prefer watching family drama while
the youth would like to watch crime and thrillers and romance. Language is another
important factor in film viewing. A commercial film need not necessarily impress all
audience. Many workers are involved and dependent on film making and therefore research
and audience understanding is very important. Opinion mining like a voting system helps
understand the majority view. IMDB like portals let audience rate films thus helping in
marketing a film as viewers would prefer watching films with higher rating and good

A sentimental analysis/opinion mining is necessary to understand viewer genre preference

region wise. This analysis cannot be done with just review and comments while collection of
data from past till present is needed to give a qualitative result. The main objective of this
study is to find genre and language impacts on film viewing over last five years from
01/01/2015 to 31/12/2019. IMDB platform has data on viewer voting, reviews posted,
opinions from magazine, social media and other reviews. Based on this data IMDB produce
movie ratings.

Sentiment Analysis
Sentiment Analysis helps distinguish opinions from different data sources. Most sentiment
analysis studies are now focused on social media sources such as IMDB, Facebook, twitter
and reviews. Performing phrase-level analysis of movie reviews is a challenging task. So in
this study, to find out the preferred genre movie ratings are used. Where ratings are the kind
of opinions, peoples vote their opinion based on the reviews after using some tools ratings are
given for the particular movie. Sentiment analysis aims to determine a person’s attitude on a
topic or contextual polarity of data. Sentiment is defined as a kind of emotion and sentiment
analysis represents opinion in text or data forms.


Natasha Suri and Prof. Toran Verma use various techniques to utilize the multilingual
proclamation accurately where he surveyed using the machine learning and lexicon based
approaches for multilingual sentiment analysis where those statements emerges due the
utilization of more than one language to create a statement. In thr absence of clear
grammatical structure it is hard to discover correct sentiment out of it. So using various
techniques they found out that translation software can help analyse multilingual sentiment in
a better way.

Tirath Prasad Sahu and Sanjeev Ahuja proposed an approach for sentimental analysis based
on the IMDB movie review database to classify the polarity of the movie review on a scale of
0(highly disliked) to 4(highly liked). He followed a lexical approach using the SentiWordNet
to determine the overall polarity of the movie review and study the features that affect the
sentiment score of the movie review text. Also, he used the classification algorithms for the
evaluation of performance and accuracy of the approach used and found among six
classification techniques, the highest accuracy was given by Random Forest with an accuracy
of 88.95%.

Bhattacharjee, B., Sridhar, A. and Dutta, A. designed a study of identifying the causal
relationship between social media content of a Bollywood movie and its box-office success
thus the approach is to understand whether the polarity of the social media content of movies
can essentially reveal any insights about the potential box office revenues. They collected
data from social media and used text mining approach to identify the sentiments about a
movie and analysed the relationship between the sentiments captured from social media and
total revenue generated in both pre-release and post-release scenarios and based on that linear
regression models were built. He concludes that the findings from this study are expected to
be very useful, both for management practitioners as well as for academicians.

Sujata Rani and Parteek Kumar proposed a research depicts a systematic review in the field
of sentiment analysis in general and Indian languages specifically. Thus based on techniques,
domains, sentiment levels and classes have been presented in this research. This work will
assist in finding the available resources such as annotated datasets, pre-processing linguistic
and lexical resources in Indian languages for sentiment analysis. Thus, he conclude that this
survey can help in building the effective SA for their own Indian language by using the
different methods and techniques used by other researchers which can help in benefits of

Gangula Rama Rohit Reddy and Radhika Mamidi had designed this case based on their view
that sentiment analysis task becomes challenging when it comes to low resource languages.
The objective of this paper is to Create resource Towards Automated Sentiment Analysis in
Telugu (a low resource language) and Integrating Multiple Domain Sources to Enhance
Sentiment Prediction they developed process of creating the corpora and assigning polarities
to them then trained their classifiers that yields good classification results they used the
sentiment data from the above corpus from different domains and tested the performance of
sentiment analysis models built using single data source for both in-domain and cross-domain
classification.. Finally, they compared all the three approaches based on the performance of
the models and found that using generalized (multi-source) sentiment classification would
yield better results than that of in- domain and cross-domain classification.

Elshrif Elmurngi, Abdelouahed Gherbi proposed a work on detecting fake positive reviews
and fake negative reviews from opinion reviews the objective of this paper is to classify
movie reviews into groups of positive or negative polarity by using machine learning
algorithms. They analyse online movie reviews using SA methods in order to detect fake
reviews. They used and compared five supervised machine learning algorithms: Naive Bayes
(NB), Support Vector Machine (SVM), K-Nearest Neighbours (KNN-IBK), KStar (K*) and
Decision Tree (DT-J48) for sentiment classification of reviews. By applying those
approaches they studied the accuracy, precision, recall and F-Measure of all sentiment
classification algorithms, and found how to determine which algorithm is more accurate.
Furthermore, they were able to detect fake positive reviews and fake negative reviews
through detection processes.

Akanksha Madan Thorat and R. Vishnu Priya designed a study on sentiment analysis of
movie review using text mining. Thus, they used Feature selection technique for collecting or
selecting the most important words from each category in text mining processes and selected
most effective words from each category by applying some methods on selected words. And
give a suggestion that after extracting the feature from movie reviews, by applying some
machine learning algorithm we can find the sentiment or opinion related to the movie.

Ang (Carl) Li designed a Sentiment Analysis for IMDb Movie Review. That the main
objective of this paper is to introduce a classification model for sentiment analysis with con-
text information participated in the feature space. They collected data from IMDb movie re-
views, in which they sampled 1,000 instances from the huge dataset and split it by the 20%-
70%-10% ratio for development, cross-validation and final test sets. They followed multiple
error analysis, including stretchy patterns, character N-grams and elimination of stop words,
and tuning procedure on ridge parameter and found the performance set hit 84% of
percentage correctness and 0.6806 in kappa statistics, revealing marginal improvement to the
baseline Logistic Regression model.

Munir Ahmad, Shabib Aftab, Iftikhar Ali and Noureen Hameed proposed a review on Hybrid
Tools and Techniques for Sentiment Analysis. That main objective of this paper is to find
Sentiments in a given text by various methodologies as positive, negative or neutral. Data are
obtained from multiple sources and it depends directly on the user which can be from any
part of the world. They used hybrid approach which is the combination of machine learning
and lexicon based approach for the optimum results, and concluded that this approach
generally yields better results. And in this work different hybrid techniques and tools have
been discussed and analysed from different aspects.

Jyostna Devi Bodapati, N. Veeranjaneyulu and Shareef Shaik using LSTMs sentiment
analysis of movie review are found. That the main objective is to focused on understanding
the polarity of the given movie reviews by classifying whether it is positively polarized or
negatively polarized They used LSTMs, a variant of RNNs to predict the sentiment analysis
for the task of movie review analysis. They collected data from IMDB bench mark dataset
and concluded that by using this proposed method with LSTM based classification gives the
best performance.

Deepa Ananda and Deepan Naorema analyse the aspect based sentiment analysis for the
movie using the review. As many reviews are based on describing the plot they develop the
methodology to separate the plot views and the user interest. They used two approaches first
one is filtering the statement from the review and second one is extracting sentiment from
the statement .where filtering the plot sentences out has a great impact on the sentiment
extraction on the review.

Tun Thura Thet, Jin-Cheon Na and Christopher S.G. Khoo proposed automatic sentiment
analysis of movie reviews. Thus by considering the grammatical dependency structure of the
clause the independent clauses in the sentence of the reviews expresses different sentiment
towards different aspect was discovered. The prior sentiment scores of about 32,000
individual words are derived from SentiWordNet. The approach is effective for aspect-based
sentiment analysis of short documents such as message posts on discussion boards. The
accuracies of clause level sentiment classification for overall movie, director, cast, story,
scene and music aspects are 75%, 86%, 83%,80%, 90% and 81% respectively.

KamilTopal ,GultekinOzsoyoglu (2016) A study conducted on Emotion Analysis of IMDb

Movie Reviews to find out whether the ratings and review shown in the most popular site
movie site like IMDb create an impact on the moviegoers to decide on to select the movie to
be watched next. By using K clustering algorithm the movies which was highly rated for
emotions rarely have extreme ends. Certain other ways was find out to see the emotions as
each review has four dimensions which is obtained by using Sematic Net database and
Singular Value Decomposition and the result ways find out that there are several other ways
like Emotion Heat Map to find out the movie that one has to watch.

Mr.Abhishek Kesharwani and Mr. Rakesh Bharti use the public opinion of the movie from
the twitter database as the textual context to predict the rating of the movie. They used the
methodology of Tweet Collection, Tweet classification and Rating the Movies. And found
the predicted ratings are compared with the IMDB and Rotten tomatoes ratings where twitter
data set are effective to produce the rating of the movie.
Sagar Chavan, Akash Morwal, Shivam Patanwala, and Prachi Janrao used sentiment analysis
to the reviews given by the public for the movie and find out their overall reaction to those
movies. They processed their entire methodology in python and the final output was
determined as they like the movie or dislike the movie.

Sandeep Ranjan and Dr.Sumesh Sood propose a mathematical model of sentimental analysis
to predict their success of the movie in the box office. Data are collected from the twitter data
base for the analysis. The accurate prediction of hit and flop of the movie using the twitter
mindset are found where it influences the promoters to maintain positive sentiment among
the public to get high revenue generation.

Viraj Parkhe, Bhaskar Biswas by using certain driving factors he find out the aspects of a
movie review which direct its polarity the most. He used the methodology of aspect based
text separator to find out the sentiment in the text. It concluded with Movie, Acting and Plot
aspects getting overall high driving factors and resulting in an accuracy of 79.372% for the
current dataset in consideration.

Shriya Se, R. Vinayakumar and M. Anand Kumar and K. P. Soman proposed this paper on
classifying the Tamil movie reviews as positive and negative. Data are collected from
different sources of webpages and by using supervised machine learning algorithms method
he analysed the sentiments of the review. Here he used SVM, Maxent classifier, Decision tree
and Naive Bayes algorithms for classifying Tamil movie reviews into positive and negative.
For conclusion, he conclude that SVM algorithm performs better in classifying the Tamil
movie reviews when compared with other machine learning algorithms. Thus it gives an
accuracy of 75.9% for classifying Tamil movie reviews


IMDb launched online in 1990 and is an Amazon subsidiary since 1998. It is the world’s
most popular and authoritative source for movie, TV and celebrity content. It helps fans to
explore the world of movies and shows and decide what to watch. IMDB ratings are done
based on viewer responses. IMDb considers viewer opinions through a voting system (IMDb
registered users can cast a vote from a scale of 1 to 10 on every released title), reviews on
their platform, magazine and other social media sources. Collectively they compare opinion
and produce the movie rating . IMDB displays a variety of opinions on a title so users can
make informed viewing decisions. It also always displays rating split so that users can see
votes distribution and determine how uniform/polarized are the opinions . Users can update
their votes as often as they’d like, but any new vote on the same title will overwrite the
previous one, so it is one vote per title per user. They take all the individual ratings from
registered users and use them to calculate a single rating. Instead of arithmetic mean (i.e. the
sum of all votes divided by the number of votes), they use weighted average for the rating
calculation although they display the mean and average votes on the votes breakdown page.
The reason for using weighted average is although they accept and consider all votes received
by users, not all votes have the same impact (or ‘weight’) on the final rating. Various filters
are applied to raw data to maintain a true opinion rather people stuffing votes to change
movie rating. IMDB does not disclose the rating method to ensure an unbiased rating
mechanism. Based on this system top rated movies ratings are calculated. The formula used

Weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

Where: R = mean average of the movie; v = number of votes of the movie; m = minimum
votes required to be listed ; C = mean vote across the whole report

IMDB rating system

For a movie 596,026 IMDb users have given a weighted average vote of 8.1 / 10
1. Rating scale of 1-10 are used
2. Number of votes per movie are calculated
3. Demographic of votes collected
4. Arithmetic and mean value are calculated

Following are the sample representation of how IMDB rating system works.
By considering all those aspects and filtering process rating for a particular movie are

First 5 languages namely
Tamil, Hindi, Telugu,
Malayalam and Kannada
were chosen. Language-wise
and genre-wise data were
collected. For each language
the genre separated and the
Average was taken. From this average the weighted average was calculated. Using the excel
tool collected data where imported in it and average of each genre were calculated using
average formula. The Genre with high rating is the most preferred one.

Purpose of the Study

Each language has different genre preference according to situations. This study helps new
comers to the industry and low budget makers to do a thorough analysis on the genre
preference and then venture into the filmmaking process. New film makers need to make
their mark on their first film itself else recognition and further opportunities would be lean
ahead. First considering the preference then working on the story would help one to set foot
in a correct way. Next considering the low budget film makers, they already have money
constrains and they cannot bare the losses on the whole and this sometimes would be a
disastrous move if the film fails. A producer can invest only of the film proves to be a success
and for this the sentiment analysis helps. Stories might fail for many reasons, but some stories
though have quality lack appreciation as those films concentrate more on story where they get
awards but no box office collection. Thus sentiment analysis is an important tool for
everyone while it helps the new venture film makers and low budget makers even more.
Data needed for the research work were all collected from the IMDB Benchmark dataset as
the user rating after a film release is reflected fast here, and the reason is just simple that
audience trust this platform and post their views and ratings without a second thought and the
process of generating the ratings are clearly explained that they are based on peoples opinion.
Keeping this as a reference this platform has been chose for the research work. The collected
data were then imported to the Excel sheet for further analysis. A simple excel sheet result
makes the work easier and faster compared to other tools.
Secondary data was used from IMDB. Microsoft excel was the tool used here for
calculating the data and the outcome of the calculation is been clearly pulled on to a table.
The data which were collected as per languages, Data of movies ratings from year 1/01/2015
to 30/12/2019 where collected

The process done here then after getting the data to the excel sheet is that
I. The average was found genre wise after importing the data to the excel sheet.
II. The weighted average was found genre wise.

 The weighted average is been calculated to understand which genre is much preferred
by the audience on each language.

Data Collection

That in Tamil language data of action is 51, romance is 24, comedy is 56, drama is 76 and
thriller is 61 therefore total of 268 movie data were collected.

Mean 53.6
Standard Error 8.500588215
Median 56
Mode #N/A
Standard Deviation 19.0078931
Sample Variance 361.3
Kurtosis 1.791204517
Skewness -0.867208606
Range 52
Minimum 24
Maximum 76
Sum 268
Count 5

In Hindi language data of action is 22, romance is 15, comedy is 30, drama is 34 and thriller
is 29 therefore total of 130movie data were collected.

Mean 26
Standard Error 3.36154726
Median 29
Mode #N/A
Standard Deviation 7.51664818
Sample Variance 56.5
Kurtosis -0.43088731
Skewness -0.77703623
Range 19
Minimum 15
Maximum 34
Sum 130
Count 5

In Telugu language data of action is 34, romance is 18, comedy is 19, drama is 34 and thriller
is 27 therefore total of 132 movie data were collected.

Mean 26.4
Standard Error 3.47275107
Median 27
Mode 34
Standard Deviation 7.76530746
Sample Variance 60.3
Kurtosis -
Skewness -
Range 16
Minimum 18
Maximum 34
Sum 132
Count 5

In Malayalam language data of action is 20, romance is 20, comedy is 86, drama is 119 and
thriller is 27 therefore total of 313 movie data were collected.

Mean 62.6
Standard Error 19.2187409
Median 68
Mode 20
Standard Deviation 42.974411
Sample Variance 1846.8
Kurtosis -1.7673957
Skewness 0.19823438
Range 99
Minimum 20
Maximum 119
Sum 313
Count 5

In Kannada language data of action is 24, romance is 24, comedy is 7, drama is 62 and
thriller is 8 therefore total of 125 movie data were collected.

Mean 25
Standard Error 9.959919678
Median 24
Mode 24
Standard Deviation 22.27105745
Sample Variance 496
Kurtosis 2.480314289
Skewness 1.505235463
Range 55
Minimum 7
Maximum 62
Sum 125
Count 5

The result was further calculated using the excel tool as mentioned before.

The calculation done language wise has been framed in a pie chart for further analysis and
understanding on the preferences.

The table after calculation are as follows.

Average Ratings of Tamil Language

Genres Ratings
Action 5.347619
Comedy 5.37963
Drama 6.648684
Romantic 6.008
Thriller/Crime 6.52069

Here, by calculating average we found out that Drama genre has the highest average of
6.648684% and next highest average is thriller genre with 6.52069%. which states that these
genres have high preference rather than other genres.

Average Ratings of Hindi Language

Genres Ratings
Action 7.1
Comedy 6.713333333
Drama 7.270588235
Romantic 6.50625
Thriller/Crime 7.196551724
Here, by calculating average we found out that Drama genre has the highest average of
7.270588235% and next highest average is thriller genre with 7.196551724%. which states
that these genres have high preference rather than other genres.

Average Ratings of Telegu Language

Genres Average Ratings

Action 6.869697
Comedy 6.8578947
Drama 7.0852941
Romantic 6.8222222
Crime 7.2518519

Here, by calculating average we found out that thriller genre has the highest average of
7.2518519%. which states that this genre have high preference rather than other genres in this
particular language.

Average Ratings of Malayalam Language

Genres Average Ratings

Action 5.847368421
Comedy 5.847368421
Drama 6.276033058
Romantic 6.385
Thriller/Crime 6.198148148

Here, by calculating average we found out that romantic genre has the highest average of
6.385 % and next highest average is drama genre with 6.276033058 % which states that these
genres have high preference rather than other genres in this particular language.

Average Ratings of Kannada Language

Genres Average Ratings

Action 5.654166667
Comedy 6
Drama 7.248484848
Romantic 6.756
Thriller/Crime 7.68

Here, by calculating average we found out that Thriller genre has the highest average of
7.68% which states that this genre has high preference rather than other genres.


Comparing the data collected from IMDB, it is clear that the Tamil film industry give less
preference to Action and comedy films, while the data clearly proves that the preference is
more over the thriller and Drama genre. The percentage goes to 22% in case of these two
genres. Romance on the other part of the chart holds 20% which does not deny the fact that
romance is still a preference in this particular industry.

In case of Hindi films, the romance and comedy genre rate less of about 6.5 and 6.7, which
proves the least interest. The hype here is more for Thriller/crime and Drama. The rating on
the table shows the numbers to be around 7.19 for thriller and 7.2. The drama genre is the
most audience goes for.

Telugu audience prefers thriller and crime and the rating goes to 7.25. Next place with point
variance is for the drama genre which is rated 7.08. The rest all genre carries the same place,
but the least preference goes for romance here.

Malayalam audience have romance genre as most preferred (6.38). Next is drama genre with
the rating of 6.27. The least here is for an Action film which is has the less rating.

Kannada has a smaller number of films released for the crime thriller genre but the preference
is more here compared to other genre. The percentage shoots up to 23% with rating of about
7.68. The next preference goes for the genre of Drama rated 7.24. This proves the interest
towards crime thrillers.


This particular method of sentimental analysis is done with the data of last 5 years of films
from the 5 languages film industries. The ratings as mentioned before are been taken and
analyzed from the IMDB platform which gives much data of audience preference from the
poll of ratings. This method of analysis cannot predict the future but can be used as metrics
for present analysis so as to work on short term projects. This analysis would give an idea on
the recent trends on genre to choose for impressing the audience.
Also, the perception on what a particular language film industry does is clearly a
misconception and is judgmental. Many films go on the failure list due to this misconception
of the film makers. So, this particular analysis would help one to understand the preferences
on a shorter term.

[1] Bodapati, J. D., Veeranjaneyulu, N., & Shaik, S. (2019). Sentiment Analysis from Movie
Reviews Using LSTMs. Ingénierie des Systèmes d Inf., 24(1), 125-129.
[2] Li, A. C. (2019). Sentiment Analysis for IMDb Movie Review.
[3] Ahmad, M., Aftab, S., Ali, I., & Hameed, N. (2017). Hybrid tools and techniques for
sentiment analysis: A review. Int. J. Multidiscip. Sci. Eng, 8(3), 29-33.
[4] Thorat, A. M., & Priya, R. V. (2018). Sentiment analysis of movie review using text
mining. International Journal of Pure and Applied Mathematics, 119(16), 3561-3566.
[5] Gangula, R. R. R., & Mamidi, R. (2018, May). Resource creation towards automated
sentiment analysis in telugu (a low resource language) and integrating multiple domain
sources to enhance sentiment prediction. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018).
[6] Elmurngi, E., & Gherbi, A. (2018). Fake Reviews Detection on Movie Reviews through
Sentiment Analysis Using Supervised Learning Techniques. International Journal on
Advances in Systems and Measurements, 11(1 & 2), 196-207.
[7] Rani, S., & Kumar, P. (2019). A journey of Indian languages over sentiment analysis: a
systematic review. Artificial Intelligence Review, 52(2), 1415-1462.
[8] Suri, N., & Verma, T. (2017). Multilingual Sentimental Analysis on Twitter Dataset: A
Review. Advances in Computational Sciences and Technology, 10(9), 2789-2799.
[9] Bhattacharjee, B., Sridhar, A., & Dutta, A. (2017). Identifying the causal relationship
between social media content of a Bollywood movie and its box-office success-a text
mining approach. International Journal of Business Information Systems, 24(3), 344-368.
[10] Sahu, T. P., & Ahuja, S. (2016, January). Sentiment analysis of movie reviews: A
study on feature selection & classification algorithms. In 2016 International Conference
on Microelectronics, Computing and Communications (MicroCom) (pp. 1-6). IEEE.
[11] Parkhe, V., & Biswas, B. (2016). Sentiment analysis of movie reviews: finding most
important movie aspects using driving factors. Soft Computing, 20(9), 3373-3379.
[12] Anand, D., & Naorem, D. (2016). Semi-supervised aspect based sentiment analysis
for movies using review filtering. Procedia Computer Science, 84, 86-93.
[13] Thet, T. T., Na, J. C., & Khoo, C. S. (2010). Aspect-based sentiment analysis of
movie reviews on discussion boards. Journal of information science, 36(6), 823-848.
[14] Topal, K., & Ozsoyoglu, G. (2016, August). Movie review analysis: Emotion analysis
of IMDb movie reviews. In 2016 IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining (ASONAM) (pp. 1170-1176). IEEE.
[15] Kesharwani, A., & Bharti, R. (2017). Movie Rating Prediction Based on: Twitter
Sentiment Analysis. LAP LAMBERT Academic Publishing.
[16] Ranjan, S., & Sood, S. (2017). Online Word of Mouth Communication in Bollywood
Tweet Dataset. International Journal for Research in Applied Science & Engineering
Technology, 5(12), 1442-1449.
[17] Sagar Chavan, Akash Morwal,Shivam Patanwala,Prachi Janrao, “Sentiment Analysis
of Movie Rating System”, IOSR Journal of Computer Engineering (IOSR-JCE), e-ISSN:
2278-0661, p-ISSN: 2278-8727 , PP 43-47, 2017.
[18] Se, S., Vinayakumar, R., Kumar, M. A., & Soman, K. P. (2016). Predicting the
sentimental reviews in tamil movie using machine learning algorithms. Indian Journal of
Science and Technology, 9(45).

You might also like