Professional Documents
Culture Documents
Sentiment Analysis of IMDb Movie Reviews
Sentiment Analysis of IMDb Movie Reviews
Abstract:- This paper presents a sentiment analysis dataset is partitioned into training and testing sets to facilitate
project focusing on IMDb movie reviews, aimed at model training and evaluation.
classifying reviews as either positive or negative based on
their textual content. Utilizing a dataset of 50,000 IMDb II. RELATED WORK
movie reviews, sourced from Kaggle, the study addresses
the binary classification challenge by employing pre- The task of sentiment analysis, also known as opinion
processing techniques such as TF-IDF vectorization. The mining, has been extensively studied in various domains to
dataset is split into training and testing sets, with models extract and analyze sentiment or subjective information from
trained on the former and evaluated on the latter. Three text data. Maas et al. (2011) conducted a seminal work on
machine learning algorithms—Logistic Regression, sentiment analysis by proposing a method for learning word
Random Forest, and Decision Tree—are implemented vectors to perform sentiment classification (Maas et al. 2011).
and compared using performance metrics including Kiritchenko et al. (2014) focused on sentiment analysis of
precision, recall, and F1-score. Results indicate that short informal texts, addressing the challenges of sentiment
Logistic Regression outperforms other models in classification in informal language (Kiritchenko et al. 2014).
sentiment analysis classification. The report concludes by Poria et al. (2016) explored aspect extraction for opinion
highlighting the project’s contributions and suggesting mining using deep convolutional neural networks,
avenues for future research, emphasizing the potential contributing to the advancement of sentiment analysis
benefits of expanding sentiment types and dataset size. techniques (Poria et al. 2016). Rodr´ıguez-Fernandez´ and
Ortega (2017) analyzed the factors influencing sentiment
I. INTRODUCTION analysis accuracy, shedding light on the challenges and
opportunities in this field (Rodr´ıguez-Fernandez and Ortega
Sentiment analysis, an integral task in natural language 2017).´
processing, plays a pivotal role in discerning the polarity of
textual content, thereby facilitating informed decision- In addition to sentiment analysis, research has
making in various domains (Maas et al. 2011; Kiritchenko et investigated the impact of social question-and-answer (Q&A)
al. 2014; Poria et al. 2016; Rodr´ıguez-Fernandez and Ortega sites on knowledge sharing in various communities. Vasilescu
2017). In this context, this´ paper delves into the realm of et al. (2015) examined how social Q&A sites are changing
sentiment analysis through the lens of IMDb movie reviews. knowledge sharing dynamics in open source software
The objective is to classify reviews as either positive or communities, highlighting the role of social platforms in
negative, leveraging a dataset comprising 50,000 labeled facilitating knowledge exchange (Vasilescu et al. 2015).
movie reviews sourced from Kaggle. By employing machine
learning techniques, particularly logistic regression, random III. DATASET DESCRIPTION
forest, and decision tree algorithms, this study endeavors to
unravel the sentiment embedded within the textual corpus. The dataset utilized in this study comprises 50,000
IMDb movie reviews, sourced from Kaggle, a popular
The significance of sentiment analysis in the realm of platform for hosting datasets and machine learning
movie reviews lies in its potential to offer valuable insights to competitions. Each movie review within the dataset is labeled
filmmakers, production houses, and audiences alike. with its corresponding sentiment, categorized as either
Understanding audience sentiment towards a particular positive or negative. This binary classification scheme
movie not only aids filmmakers in gauging audience simplifies the sentiment analysis task, enabling the
reception but also empowers viewers to make informed classification of reviews based on their polarity.
choices regarding their cinematic preferences. Therefore, this
paper seeks to contribute to the burgeoning field of sentiment The IMDb movie review dataset offers a diverse and
analysis by addressing the specific nuances inherent in movie comprehensive collection of textual data, encompassing a
reviews. wide range of movie genres, release dates, and viewer
preferences. With an equal distribution of 25,000 positive and
Structured into several sections, this paper begins with 25,000 negative reviews, the dataset provides a balanced
an overview of the dataset utilized, delineating its representation of sentiment polarity, thereby minimizing bias
characteristics and the binary classification problem it poses. during model training and evaluation.
Subsequently, the pre-processing steps involved in preparing
the textual data for analysis are elucidated, including TF-IDF
vectorization and data segmentation. Following this, the
Each movie review in the dataset is represented as a text Exploratory Data Analysis (EDA)
document, typically containing multiple sentences expressing A preliminary analysis of the dataset is conducted to
the reviewer’s opinion, critique, or evaluation of a particular understand the distribution of classes (positive and negative
film. The reviews vary in length, tone, and language style, sentiments) using visualizations, specifically a count plot.
reflecting the diversity of opinions prevalent among IMDb This step provides insights into the balance of classes within
users. the dataset.
B. Dataset Preprocessing
Before training the models, the dataset undergoes
several preprocessing steps. This includes removing any
irrelevant characters, such as punctuation marks and special
symbols, and converting the text data to lowercase to ensure
uniformity.
C. TF-IDF Vectorization
The TF-IDF (Term Frequency-Inverse Document
Frequency) vectorization technique is used to convert the
Fig 2 First Few Rows of the Dataset preprocessed text data into numerical features. TF-IDF
assigns each term in the document a weight that reflects its
IV. METHODOLOGY importance in the document relative to the entire corpus. TF-
IDF vectorization involves the following steps:
Data Loading and Preparation
The IMDb movie review dataset is loaded from a CSV Term Frequency (TF) Calculation:
file obtained from Kaggle. The dataset contains two columns: The frequency of each term in the document is
’review’ and ’sentiment’, representing the textual content of computed. TF is calculated using the formula:
the reviews and their corresponding sentiment labels (positive
or negative), respectively.
TF-IDF Calculation:
The TF-IDF score for each term in the document is
computed as the product of its TF and IDF scores:
TF-IDF vectorization is widely used in text mining and Where β0,β1,β2,...,βn are the model parameters learned
natural language processing tasks due to its ability to capture during training.
the semantic meaning of text data and identify important
features for analysis. Random Forest
Random forest is an ensemble learning method that
D. Model Training constructs multiple decision trees during training and outputs
In this subsection, we describe the training process of the mode of the classes for classification. Each decision tree
the machine learning models used for sentiment analysis of is trained on a bootstrap sample of the training data, and a
IMDb movie reviews. We employ logistic regression, random random subset of features is considered at each split to
forest, and decision tree algorithms for classification tasks. promote diversity among the trees. The final prediction of the