An Enhanced Sentiment Analysis using Machine Learning Methods in Imbalanced Movie Review Streams

An Enhanced Sentiment Analysis using
Machine Learning Methods in Imbalanced Movie

Review Streams
Abstract—This paper proposes a novel framework to enhance II. L ITERATURE REVIEW

sentiment analysis algorithms addressing the challenge of imbal-
anced datasets in the context of movie reviews. The significance Sentiment analysis interprets and categorizes the text’s
of this research lies in its potential to more accurately interpret
individuals’ attitudes, an area often compromised by skewed expressed opinions, emotions, and values in understanding
data. The proposed framework employs the synthetic minority various forms of communication, including movie reviews.
oversampling technique (SMOTE) and generative adversarial Tun et al. [4] introduced a fine-grained methodology that
networks (GANs) to balance the dataset. We employ a resource- delves into sentiment orientation and intensity across different
aware framework to harmonize the available computational facets of movie reviews, enabling a detailed and nuanced
resources with the needs of the classification process. This
framework aims to create a system that remains effective and sentiment analysis.
responsive to real-time data classification demands while optimiz- A prevalent challenge in sentiment analysis is the issue of
ing hardware and software capabilities. The experiments used imbalanced datasets, which Dai [5] addressed by introducing
the IMDB movie review dataset to categorize real-time review the Word Embedding-based Synthetic Minority Oversampling
streams. The model is iteratively refined based on classification
outcomes, facilitating continuous enhancement. The Support Technique (WESMOTE). This technique enhances data bal-
Vector Classifier (SVC) and extra tree algorithm excel in native ance, improving the reliability of sentiment classifiers. Krishna
and streamed classification environments. et al. [6] proposed a hybrid approach to assess word polarity
Index Terms—sentiment analysis, class imbalance, data within contexts, offering significant insights into complex and
streams subjective expressions prevalent in movie reviews. Further
advancing the field, Krishna’s [6] hybrid deep learning ap-
proach combines Convolutional Neural Networks (CNN) and
I. I NTRODUCTION Bidirectional Long Short-Term Memory (BiLSTM) models to
achieve remarkable precision in sentiment classification for
Sentiment analysis is an essential element of natural lan- IMDb movie reviews.
guage processing, aiming to understand and categorize emo-
Motz [7] contributed an innovative methodology for live
tions expressed in text, particularly in contexts such as movie
sentiment analysis, utilizing machine learning and text pro-
reviews on platforms like IMDb. These analyses provide
cessing to analyze trends and sentiments within Twitter data
critical feedback for filmmakers and studios, offering valu-
streams. This approach signifies a leap towards real-time
able insights for strategic improvements. However, challenges
sentiment analysis, though it also underscores the challenges
in sentiment analysis of movie reviews include linguistic
of achieving high accuracy in such dynamic environments.
complexity, subjectivity, contextual nuances, and variations in
Satrya et al. [8] integrated machine and ensemble learning
dataset quality [1].
with rule-based approaches to enhance sentiment classification
Previous studies have explored the accuracy and perfor-
accuracy, mainly focusing on complex data from Indone-
mance of different algorithms, especially in the context of
sian police-related Twitter content. Similarly, Sai et al. [9]
Twitter data streams, offering valuable insights for dynamic
conducted a comparative analysis of sentiment classification
analysis scenarios [2], [3]. While these works provide a solid
algorithms, highlighting the nuanced performance variations
foundation, they leave a noticeable gap in understanding the
and the influence of techniques like Bag-of-Words and TF-
optimal mix of sampling and boosting techniques for enhanced
IDF on algorithm efficacy.
sentiment analysis.
Our study proposes a framework that utilizes streaming data
to refine sentiment analysis algorithms, specifically targeting III. M ETHODOLOGY
IMDb movie reviews. We systematically evaluate various
classifiers, focusing on their accuracy and overall performance. This section discusses the methodology to develop and
This research builds on existing literature and lays the ground- refine a model specifically to classify streamed data. We
work for further advancements in sentiment analysis method- utilize the resource-aware framework to balance the available
ologies, ensuring a more accurate and nuanced understanding computational resources with the demands of the classification
of sentiments in movie reviews and other textual content. process (Figure 1).
Copies of
minority
class
Training
data Original Recorded
data data streamed
Dataset
Preprocessing Attributes and

predicted class
Model
training
Testing Streamed over

data time Classification Predicted Confusion
Model class matrix
Parameter
adjustment
Fig. 1. Data streaming system architecture
A. Data Preprocessing Algorithm 1 Oversampling dataset with SMOTE

Require: A class imbalance data set S, a parameter k
Our data preprocessing incorporates oversampling tech-
Divide S into S+ and S−
niques to address imbalances in movie review datasets and
for i = 1 to |S− | − |S+ | do
mitigate biases during training, specifically augmenting the mi-
Select a sample x in S+ randomly
nority class for a balanced representation of positive and neg-
Find k nearest neighbors of x in S+
ative sentiments. Integration of the Synthetic Minority Over-
Select one of the k nearest neighbors randomly, record
sampling Technique (SMOTE) into our pipeline generates
it as y
synthetic samples through interpolation, enhancing the model’s
Generate a new sample xnew = x + δ · (y − x), where δ
proficiency in capturing patterns across minority and majority
is a random number in [0, 1]
classes in movie sentiment analysis (refer to 1). The initial
Add xnew to ψ
model is built on a randomly balanced dataset from the IMDB
end for
database, ensuring an unbiased representation of sentiments in
80% of the movie review collection. Subsequently, a streaming
process deploys the initial model for sentiment analysis on ran-
framework analyzes system resource utilization, focusing on
domly selected movie reviews, allowing continuous recording
RAM and CPU metrics through 5. This analysis ensures the
and integration of streaming data for periodic refinement. In
model’s performance to classify streamed data accurately and
response to potential imbalances in incoming data streams,
is resource-efficient.
our adaptive approach leverages the SMOTE oversampling
technique to effectively manage imbalanced classes, ensuring C. Building Trained Model
a more robust and accurate sentiment analysis [1]. We employ a variety of machine learning models to de-
velop our sentiment analysis method to ensure robustness and
B. Architecture of Data Stream accuracy. These models include:
We initiate the streaming procedure by partitioning the 1) Random Forest: The Random Forest algorithm is central
IMDb dataset into an 80% training subset. We simulate real- to our sentiment analysis technique due to its robustness
time movie review streams using Algorithm 3, based on and adaptability across various machine learning applications.
methodologies from Gaber et al. [10] and Cuzzocrea et al. Random Forest constructs multiple decision trees during train-
[11]. The core model employs Algorithm 4 to dynamically ing and aggregates their predictions to improve accuracy and
classify incoming reviews to enhance the model with fresh, control over-fitting. Each tree in the ensemble is trained on a
labeled streamed data continuously. These predictions are then random subset of the data, which introduces diversity in the
integrated into the training set using Algorithm 2. The iterative model and enhances its generalization capabilities.
aspect of this approach progressively refines the model’s 2) Support Vector Machine (SVM): We also consider the
predictive accuracy in line with the real-time data stream. This Support Vector Machine (SVM) to develop our sentiment
Algorithm 2 Generating the classification model analysis technique. SVM works by identifying an optimal
Require: Dataset D: IMDB Movie Review hyperplane that distinctly separates data points of differing
t ← global variable batch process duration sentiment classes. The identification is conducted by trans-
Split D → training data T and testing data S forming input features into higher-dimensional spaces, uncov-
train M ← with T using a classification algorithm A ering complex and nuanced patterns often prevalent in text
while not end do data.
if Batch == t then 3) Naive Bayes: Naive Bayes is a probabilistic classifier
get R from Algorithm 4 that employs Bayes’ theorem with strong (naive) independence
merge data TR ← T + R assumptions between features. It is particularly effective in
re-train M ← with TR using a classification algo- text classification tasks, including sentiment analysis, due to
rithm A its ability to calculate the likelihood of sentiments based on
use M for Algorithm 4 the features observed in the text.
end if 4) Extra Tree: Extra Trees distinguishes itself through its
end while unique approach to constructing decision trees. Randomly se-
lecting thresholds for each feature split injects a higher degree
Algorithm 3 Generating data streams of diversity into the ensemble, contributing to its robustness
Require: Dataset D: IMDB Movie Review and reducing the likelihood of overfitting and handling noisy
while not end do data effectively.
X ←get random movie review from dataset D. Resource-aware Framework
Remove the class label Y from X
A data stream, defined as a continuous flow of data gen-
Feed X and Y into the classification algorithm A
erated, processed, and consumed over time, presents unique
end while
challenges in processing and analyzing information in real-
time [12]. In this research, the model will be continuously
Algorithm 4 Classifying data streams
retrained after 15 minutes to update the model with the result
Require: Model M
of the prediction to accommodate machine learning continu-
Input: X and Y from Algorithm 3
ously. To address these challenges, a resource-aware frame-
while not end do
work is employed to effectively manage system resources and
Y ′ ← Predict class label of X using M
maintain an uninterrupted detection process. The primary goal
if Y == Y ′ then
of this framework is to ensure the processing of data without
Record R ← concat(X, Y )
disruptions, thereby optimizing the efficiency of the detection
if Batch == t then
process. This involves continuous monitoring of crucial system
pass R to Algorithm 2
resources such as CPU and RAM usage. By doing so, the
end if
framework provides real-time insights into resource consump-
end if
tion patterns, allowing for proactive identification of potential
Evaluate the prediction using ground truth Y
bottlenecks and the implementation of strategies to mitigate
Calculate Accuracy using Equation 1
any adverse effects on the data stream processing.
Calculate Recall using Equation 2
Calculate Precision using Equation 3 E. Performance measurement
Calculate F1 Score using Equation 4 We evaluate classification performance by assessing accu-
end while racy, precision, recall, and F1-score based on the classifi-
cation outcomes. Accuracy, the ratio of correct predictions,
Algorithm 5 Calculating Ram and CPU Usage is calculated by dividing the sum of true positives and true
cpu ← global variable batch process cpu usage negatives by the total number of observations (Equation 1).
ram ← global variable batch process ram usage While accuracy offers an overall performance measure, it may
while not end do not reveal specific issues like a high misclassification rate
if Batch == t then for a particular class. Therefore, additional metrics like recall
re-train M ← with TR using a classification algo- and F1-score become essential. Recall measures the number
rithm A of correctly predicted positives (Equation 2), while the F1-
use M for Algorithm 4 score combines precision (Equation 3) and recall (Equation
cpu ← CPU Usage 2), providing a unified metric summarizing the model’s overall
ram ← RAM Usage performance (Equation 4).
end if
TP + TN
end while Accuracy = (1)
TP + TN + FP + FN
TP
recall = (2)
TP + FN
(SVC) across all performance evaluations, consistent with the
TP findings from the preceding experiment (refer to Table I, Table
precision = (3)
TP + FP II, and Table III). These results underscore the robustness
precision × recall and efficacy of SVC in handling balanced datasets, further
F 1 − score = 2 × (4) validating its suitability for sentiment analysis in dynamic and
precision + recall
evolving data scenarios.
Furthermore, we compute the mean (Equation 5) and stan-
dard deviation (Equation 6) of the experimental results. The C. Experiment 2: Datastream classification
mean, or average, indicates the data’s central tendency, pro- The experiment initiates by using 80% of the dataset to
viding insights into its typical value. Conversely, the standard initialize the model, focusing on random selection for balanced
deviation gauges the degree of variability or dispersion within representation. With a continuous influx of 1 to 100 new data
the dataset. It assesses how widely spread the data points are points every minute, the system undergoes re-training every 15
relative to the mean. A higher standard deviation indicates minutes, ensuring the model adapts to the dynamic data stream
greater scatter or dispersion of data points, whereas a lower for enhanced sentiment classification. Classifier performance
standard deviation signifies a closer concentration of data evaluation, including mean, variance, and standard deviation
around the mean. for accuracy, F1 score, recall, and precision (refer to Table IV,
n Table VI, Table V), highlights the consistent superiority of the
1X
x= xi (5) Extra Tree Classifier. It exhibits superior accuracy, F1 score,
n i=i recall, and precision with low standard deviation and variance,
v positioning it as the optimal choice for sentiment analysis on
u N
u 1 X the evaluated dataset. Resource consumption, visually depicted
s=t (xi − x)2 (6) in Figure 3a, 3b, 3c, and 3d, reveals distinct patterns for Extra
N − 1 i=1
Tree and Random Forest classifiers, Naive Bayes, and SVC.
IV. E XPERIMENT AND ANALYSIS Extra Tree and Random Forest classifiers exhibit notable RAM
A dedicated thread is implemented during the training and demand during re-training, while Naive Bayes and SVC show
validation phases to parallelly record and monitor system significant CPU usage. The stability in high CPU usage for
resource utilization, including CPU and memory, as detailed in Naive Bayes indicates higher algorithmic complexity. These
5. This parallel resource monitoring thread enables real-time consumption patterns offer valuable insights for understanding
analysis of resource consumption, capturing valuable insights and optimizing classifiers in real-time sentiment analysis tasks.
into the algorithms’ performance characteristics.
V. C ONCLUSION
The study employs two experimental approaches to assess
the efficacy of diverse classification algorithms. The first We developed a framework for sentiment analysis on movie
approach involves fixed data for training and testing to gain streams, utilizing a resource-aware framework and model
performance insights within a controlled environment. The refinement techniques. The objective is to maintain high
second approach simulates a dynamic data stream, continu- accuracy and efficiency in sentiment analysis tasks, adapting
ously testing and updating classification algorithms to evaluate seamlessly to the continuous influx of data and the evolving
their adaptability to evolving real-time scenarios. nature of streamed content. Our analysis identifies the Support
Vector Classification (SVC) as the leading algorithm in clas-
A. Dataset sical model classifications. SVC has demonstrated remarkable
We utilized the IMDB Movie Ratings Sentiment Analysis performance across training scenarios. Its consistent superior-
dataset sourced from Kaggle, which presents a minor imbal- ity underscores its adaptability and robustness, making it an
ance with 20,019 instances labeled as negative and 19,981 ideal choice for sentiment analysis tasks, irrespective of dataset
labeled as positive, summing up to 40,000 data points as shown size.
in 2a. and after the smoothing the data is balanced with 20,019
instance for negative and positive each 2b R EFERENCES
[1] Y. Al Amrani, M. Lazaar, and K. E. El Kadiri, “Random forest and
B. Experiment 1: Classical model classification support vector machine based hybrid approach to sentiment analysis,”
We implement the balanced random sampling technique Procedia Computer Science, vol. 127, pp. 511–520, 2018.
[2] B. Gokulakrishnan, P. Priyanthan, T. Ragavan, N. Prasath, and
to ensure equitable representation of positive and negative A. Perera, “Opinion mining and sentiment analysis on a twitter data
sentiments within the training dataset. This approach aims to stream,” IEEE Xplore, p. 182–188, 12 2012. [Online]. Available:
mitigate bias and enhance the model’s capacity to accurately https://ieeexplore.ieee.org/abstract/document/6423033/
[3] A. M. Shiddiqi, D. H. Fudholi, and A. Zahra, “A social media analytics
classify sentiments across both classes. Consistent with the on how people feel about their cancelled homecoming during covid-
previous experiment, we randomly select 1000, 10000, and 19 pandemic,” in 2021 Fourth International Conference on Vocational
20000 rows from a simulated data stream to generate balanced Education and Electrical Engineering (ICVEE), 2021, pp. 1–6.
[4] T. T. Thet, J.-C. Na, and C. S. Khoo, “Aspect-based sentiment analysis
data samples for analysis. The experimental outcomes reaffirm of movie reviews on discussion boards,” Journal of Information Science,
the superior performance of Support Vector Classification vol. 36, pp. 823–848, 11 2010.
Distribution of Labels Class Distribution After SMOTE
20000 20000
17500 17500
15000 15000
12500 12500
Count
Count
10000 10000
7500 7500
5000 5000
2500 2500
0 0
Negative Positive Negative Positive
Label Label
(a) (b)
Fig. 2. (a) Without Sampling (b)With Smote Oversampling)
TABLE I
R ESULT OF 1000 BALANCED RANDOM SAMPLING DATA AS DATA TRAINING IN NATIVE CLASSIFICATION ENVIRONMENT
Algorithm Mean Standard Deviation

Accuracy F1 Score Recall Precision Accuracy F1 Score Recall Precision
Naive Bayes 0.863525 0.863427076 0.86350045 0.86450621 0.003813422 0.003824676 0.003816954 0.003747296
Extra Tree 0.859375 0.85937381 0.85937594 0.859387791 0.006733777 0.006734824 0.006733774 0.00672407
Random Forest 0.842325 0.84231578 0.842319531 0.842389079 0.00380645 0.003815222 0.003810106 0.003744666
SVC 0.90065 0.900640367 0.90065994 0.900829298 0.002868743 0.002870403 0.002866899 0.00284218
TABLE II

Naive Bayes 0.863525 0.863427076 0.86350045 0.86450621 0.003813422 0.003824676 0.003816954 0.003747296
Extra Tree 0.860975 0.860971289 0.860972304 0.861006681 0.004139558 0.004137167 0.004139132 0.004165179
Random Forest 0.84515 0.845135646 0.845142327 0.845258638 0.005092427 0.005093073 0.005093051 0.005094881
SVC 0.90065 0.900640367 0.90065994 0.900829298 0.002868743 0.002870403 0.002866899 0.00284218
TABLE III

Naive Bayes 0.863525 0.863427076 0.86350045 0.86450621 0.003813422 0.003824676 0.003816954 0.003747296
Extra Tree 0.860125 0.860123514 0.860123178 0.860134919 0.002489038 0.00248809 0.002488169 0.002497265
Random Forest 0.843725 0.843717217 0.843720313 0.84377834 0.004385167 0.004396243 0.004389004 0.004304175
SVC 0.90065 0.900640367 0.90065994 0.900829298 0.002868743 0.002870403 0.002866899 0.00284218
TABLE IV TABLE V
M EAN OF THE ACCURACY IN STREAMED CLASSIFICATION ENVIRONMENT S TANDARD D EVIATION OF THE ACCCURACY IN STREAMED
CLASSIFICATION ENVIRONMENT
Algorithm Accuracy Precision Recall F1-score
svc 0,9481777 0,9518184 0,9481777 0,9483182 Algorithm Accuracy Precision Recall F1-score
random forest 0,9681520 0,9702490 0,9681520 0,9681852 svc 0,0497305 0,0455899 0,0497305 0,0491354
nb 0,8959449 0,9034277 0,8959449 0,8958279 random forest 0,0365329 0,0341241 0,0365329 0,0368645
extra tree 0,9717636 0,9736189 0,9717636 0,9717993 nb 0,0752644 0,0733693 0,0752644 0,0762665
extra tree 0,0313473 0,0280731 0,0313473 0,0312633
[5] H.-J. Dai and C.-K. Wang, “Classifying adverse drug reactions from
imbalanced twitter data,” International Journal of Medical Informatics, International Conference on Advance Computing and Innovative Tech-
vol. 129, pp. 122–132, 09 2019. nologies in Engineering (ICACITE). IEEE, 2022, pp. 689–693.
[6] M. M. Krishna, B. Duraisamy, and J. Vankara, “Hybrid deep learning [7] A. Motz, E. Ranta, A. S. Calderon, Q. Adam, F. Alzhouri, and
techniques for sentiment analysis on imdb datasets,” in 2022 2nd D. Ebrahimi, “Live sentiment analysis using multiple machine learning
(a) (b)
(c) (d)
Fig. 3. (a) Extra Tree Performance in datastream. (b) Random Forest Classifier Performance in datastream.(c) Naive Bayes Performance in datastream. (d)
SVC Performance in datastream.
TABLE VI Knowledge Discovery, S. Madria and T. Hara, Eds. Cham: Springer

VARIANCE OF THE ACCURACY IN STREAMED CLASSIFICATION International Publishing, 2015, pp. 296–309.
ENVIRONMENT [12] A. M. Shiddiqi, E. D. Yogatama, and D. A. Navastara, “Resource-aware
video streaming (ravis) framework for object detection system using
Algorithm Accuracy Precision Recall F1-score deep learning algorithm,” MethodsX, vol. 11, p. 102285, 2023.
svc 0,0024731 0,0020784 0,0024731 0,0024142
random forest 0,0013346 0,0011644 0,0013346 0,0013589
nb 0,0056647 0,0053830 0,0056647 0,0058165
extra tree 0,0009826 0,0007881 0,0009826 0,0009773
and text processing algorithms,” Procedia Computer Science, vol. 203,

pp. 165–172, 2022.
[8] W. F. Satrya, R. Aprilliyani, and E. H. Yossy, “Sentiment analysis of
indonesian police chief using multi-level ensemble model,” Procedia
Computer Science, vol. 216, pp. 620–629, 2023. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1877050922022554
[9] L. Sai, K. Yasaswi, V. Rakesh, M. Varun, M. Yeswanth, and J. S. Kiran,
“Comparative study of algorithms for sentiment analysis on imdb movie
reviews,” 03 2023.
[10] M. M. Gaber and A. M. Shiddiqi, “Distributed data stream
classification for wireless sensor networks,” in Proceedings of the 2010
ACM Symposium on Applied Computing, ser. SAC ’10. New York,
NY, USA: Association for Computing Machinery, 2010, p. 1629–1630.
[Online]. Available: https://doi.org/10.1145/1774088.1774439
[11] A. Cuzzocrea, M. M. Gaber, and A. M. Shiddiqi, “Distributed classifica-
tion of data streams: An adaptive technique,” in Big Data Analytics and

An Enhanced Sentiment Analysis using Machine Learning Methods in Imbalanced Movie Review Streams

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Enhanced Sentiment Analysis using Machine Learning Methods in Imbalanced Movie Review Streams

Uploaded by

Copyright:

Available Formats

An Enhanced Sentiment Analysis using

Machine Learning Methods in Imbalanced Movie

Abstract—This paper proposes a novel framework to enhance II. L ITERATURE REVIEW

Preprocessing Attributes and

Testing Streamed over

Fig. 1. Data streaming system architecture

A. Data Preprocessing Algorithm 1 Oversampling dataset with SMOTE

Algorithm Mean Standard Deviation

Algorithm Mean Standard Deviation

Algorithm Mean Standard Deviation

TABLE VI Knowledge Discovery, S. Madria and T. Hara, Eds. Cham: Springer

and text processing algorithms,” Procedia Computer Science, vol. 203,

You might also like