Professional Documents
Culture Documents
SentimentalAnalysisonWebScrapingUsing
SentimentalAnalysisonWebScrapingUsing
SentimentalAnalysisonWebScrapingUsing
net/publication/362540336
CITATIONS READS
11 2,442
4 authors, including:
SEE PROFILE
All content following this page was uploaded by Dr. Yusuf Perwej on 07 August 2022.
Abstract — Most consumers look to online reviews when deciding which e-commerce services or items to purchase.
Unfortunately, the main problem with these reviews, which is not properly addressed, is that there are dishonest reviews. The
novel feature of the proposed approach is the application of opinion mining on customer evaluations to help companies and
organizations continually enhance their marketing strategies and obtain a comprehensive analysis of consumer perceptions of
their products and brands. In this study, the long short-term memory (LSTM) and deep learning convolutional neural network
integrated with LSTM (CNN-LSTM) models were used to analyses the sentiment of reviews in the e-commerce market. Data
preprocessing techniques including lowercase processing, stop word removal, punctuation removal, and tokenization were used
for data purification. Clean data was processed using the LSTM and CNN-LSTM models to identify and classify the customers'
sentiment as either positive or negative. The best accuracy achieved by our model was 97.8%.
Index Terms — Long Short Term Memory (LSTM), Convolutional Neural Network (CNN), Sentiment Analysis, Machine Learning,
Pre-processing, Tokenization.
1. Introduction
relationships or patterns and also the views which correspond
The ability of technology to transport information quickly to them, representing those relationships as a graph
characterises the digital revolution, often known as the data comprising features and opinions. In order to get only those
age [1]. In recent years, government agencies [2] and thought expressions which are most closely connected to the
corporate trade organisations have put a lot of effort into target feature (user-specified feature), clustering is done [6]
learning what their target customers think. But more and more on the graph. The other expressions are pruned. Compared to
individuals are freely sharing their thoughts online. The the baseline, we were highly accurate in every domain.
availability of historical data online today makes obtaining it Compare this method to cutting-edge systems where, with
straightforward. Individual opinions can be expressed by limited data [7], we attain a comparable level of accuracy. Our
participants verbally or via the use of evaluations or reviews. model got highest accuracy as compare to other models
Furthermore, information provided by a person is more
trustworthy than information provided by a producer [3]. The remainder of this paper is structured in the following
Therefore, in order to market a product or develop new ones, manner. In section II, the related work is discussed. In
it is essential to know what the customer wants and dislikes sections III and IV, we discuss the Sentiment Analysis and the
about it. Most companies or producers rely on consumer input Web Scraping. The methodologies are presented in section V.
to increase sales. Word-of-mouth recommendations from We present result and discussion in section VI. Section VII
customers are exactly like product reviews posted online. The brings this paper to a close
study of sentiment is essential in the field of finance. The aim
of sentiment is to determine the direction of the emotions 2. Related Work
suggested by the screenplay. The word "Subjectivity
Analysis" is a by-product of opinion mining. The study of A software agent that simulates the human experience of Web
emotions, sentiments, views, behaviours, and assessments is surfing interaction is used in the web scraping process, which
known as subjectivity analysis [4]. An opinion is a judgement includes systematic content extraction and combining from
or a belief that lacks support from facts or evidence, according the Internet [8]. It might be viewed as a manual task from an
to the definition. Rarely is the attitude expressed in a review operational perspective, which means copying and pasting the
of a specific product expressly favourable or negative; data. The distinction in this case is that a virtual computer
instead, consumers typically have a mixed opinion about agent is used to accomplish this operation automatically [9]. It
numerous qualities, some of which are positive and some of might also be viewed as an extraction procedure that converts
which are bad. Analyse a review despite the poor battery life, Internet material that is unstructured into a structured format
I appreciate Micromax's multimedia features [5]. This that is simple to read and utilize for various studies.
sentence evokes conflicting feelings. This study suggests an The web scraper uses a request to get access to the website,
approach that uses dependency parsing to identify and then uses the HTML code to discover specified
components, extract them, and store them into a structured more synonyms and potential antonyms. The following
manner using a data frame. This method of removing iteration is then started when the new words are added to the
unstructured material from websites can be used in a variety seed list. When no further words can be identified, the entire
of contexts. It may be used, for instance, to compare product process comes to an end.
pricing across several websites, to remove advertisements
from connected pages, to address issues with inadequate 3. Sentiment Analysis
statistics data, or to supplement official datasets with
additional data [10]. Web scraping may also be utilized in the The process of determining whether a block of text is positive,
human resources field to locate open positions on various negative or neutral is known as sentiment analysis. Sentiment
websites and categorize those using Nave Bayes algorithms. analysis is the contextual mining of words to determine the
social sentiment of a brand and to assist businesses in
Another work on sentiment analysis for huge data was done determining whether or not the product they are producing
by Sharma et al. They have endeavored to present a summary will find a market [18]. Sentiment analysis's objective is to
of the most recent developments in opinion mining. Although examine public opinion in a way that will support business
we still have difficulties relating to unstructured data, they growth. It emphasizes emotions as well as polarity (positive,
found that sentiment analysis has grown to be quite popular in negative, and neutral) (happy, sad, angry, etc.). Sentiment
this study sector and noted that a lot of research has already analysis aids data analysts at major corporations in
been done. Furthermore, they claim that Wordnet is the most understanding consumer experiences, doing complex market
frequently used lexicon source and that supervised rather than research, monitoring brand and product reputation, and
dictionary procedures yield more accurate findings [11]. An gauging public opinion. In order to provide their own clients
ensemble framework for Sentiment Classification, developed with insightful information, data analytics [19] firms
by merging several feature sets and classification frequently incorporate third-party sentiment analysis APIs
methodologies, was employed by Xia et al. [12]. They into their own platforms for customer experience
employed three basic classifiers (Naive Bayes, Maximum management, social media monitoring, and workforce
Entropy, and Support Vector Machines), two kinds of feature analytics.
sets (Part-of-speech information and Word-relations), and
two different feature sets in their research. For sentiment 4. Web Scraping
classification, they used ensemble techniques such fixed
combination, weighted combination, and meta-classifier Automated web data and information collection is called web
combination and got improved accuracy. Text having a scraping. In essence, it is the removal of online data.
subjective context performs better in terms of sentiment Information retrieval, news collecting, website monitoring,
analysis than text with merely an objective context. This is competitive marketing, and other topics are covered by web
because, when a book's context or perspective is objective, scraping [20] shown in figure 1. Web scraping makes it quick
the text often portrays some commonplace assertions or facts and straightforward to access the large quantity of
without evoking any sentiment [13]. information available online. Compared to manually pulling
data from websites, it is far faster and easier. These days, web
Additionally, Dandannavar and Mangalwede offered an scraping is more and more common. A lot of data collection
overview on several methods for sentiment analysis that use and information extraction may be done quickly and easily
textual data. They analyzed the benefits and drawbacks of using an online data scraping software. Compared to
each method in their study, and they came to the conclusion manually pulling data from websites, it is far faster and easier.
that several ways could be used to do sentiment analysis, These days, web scraping is more and more common. A lot of
some of which were based on lexicon, some on training sets, data collection and information extraction may be done
and some of which employed both. The approaches that were quickly and easily using an online data scraping software.
examined are domain-specific, and the majority of them focus However, when individuals use the term "web scrapers," they
on English and Chinese. As a result, relatively few often refer to computer programmes.
investigations on sentiment categorization for other languages
have been done [14].
extract meaningful data. These bots can quickly retrieve Common words like "the," "a," "an," "is," and "are" are
enormous volumes of data by automating this process [21]. In examples of stop-words. Due to the fact that they don't
the digital era, when big data plays such a significant role and provide any knowledge that is crucial for the model, these
is continuously updating and changing, this has obvious words were removed from the review's content.
advantages. Web scraping has several uses, particularly in the
area of data analytics. Remove all punctuation
There was no punctuation in the materials utilized for the
5. Methodology review.
The main contributions of the proposed worked are the Contraction Removal
following.
The full form is used instead of the abbreviated form when the
word is really written there; for instance, “when've” becomes
“when have”. Contraction removal is the name given to this
process.
Tokenization
Tokens, or little bits of words, were used to separate the
sentences in the review texts.
Part-of-Speech Tagging
In this stage, the phrase's words are each assigned a POS[23]
tag, such as "VB" for a verb, "AJJ" for an adjective, and "NN"
for a noun.
Score Generation-
A score was produced after sentiment analysis of the review
text. The sentiment score was calculated by comparing the
dataset to the opinion lexicon [22], which contains 5,000
positive words and 4,500 negative keywords, along with their
associated ratings.
Word Embeddings
We created numerical vectors for each preprocessed sentence
in the dataset of product reviews using the "Word
Figure 2 The Methodology Flow Chart embeddings" approach.
5. 1 Dataset and Description To create word indices, we first turned each phrase in the
review text into a sequence. We use the Keras texts tokenize
To assess the proposed method, a dataset [22] in CSV file to get those indexes [24]. We made sure that all terms and
format was collected from comments on the Amazon words received a one index in tokenize and that the
website.com. The collection includes reviews of televisions, vocabulary size was correctly updated. Then, for each word in
video surveillance devices, cellphones, tablets, laptops, and the training and testing sets, a unique index is created.
more shown in figure 2. A variety of procedures are involved
in the preparation of the data, including the processing of
lowercase data that contains meta-features such the reviewer's
ID, the product ID, and the review content.
5. 2 Data Preprocessing
Lower Case
It requires altering all of the words in the review text to
lowercase.
Stopword Elimination
Figure 3 The CNN Model
5. 3 Convolutional Layer
The most popular categorization models are CNN. CNNs are
well-liked because they can independently learn the features
for each class without the need for human feature extraction.
A CNN shown in figure 3 [25] building layer, the primary
convolutional layer, performs well with CSV and image data.
By mixing data with filters, CNN develops features that are
subsequently passed on to the following layer. The layer that
creates new features filters the previously created features
once more.