Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/362540336

Sentimental Analysis on Web Scraping Using Machine Learning Method

Article in Journal of Information and Computational Science · August 2022


DOI: 10.12733/JICS.2022/V12I08.535569.67004

CITATIONS READS

10 1,214

4 authors, including:

Dr. Yusuf Perwej Km Divya


ProfessorAmbalika Institute of Management & Technology Ambalika Institute of Management & Technology
103 PUBLICATIONS 1,286 CITATIONS 7 PUBLICATIONS 14 CITATIONS

SEE PROFILE SEE PROFILE

Puneet Kumar Yadav


Chandigarh University
5 PUBLICATIONS 21 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

An Experiential Study of the Big Data View project

Developing an Intelligent System for Medical Diagnosis View project

All content following this page was uploaded by Dr. Yusuf Perwej on 07 August 2022.

The user has requested enhancement of the downloaded file.


Journal of Information and Computational Science1 ISSN: 1548-7741

Sentimental Analysis on Web Scraping Using


Machine Learning Method
Saurabh Sahu1*, Km Divya2, Dr. Neeta Rastogi3, Puneet Kumar Yadav4, Dr. Yusuf Perwej5
1*
B.Tech Scholar, Computer Science & Engineering, Ambalika Institute of Management & Technology, Lucknow
2
Assistant Professor, Department of Computer Science & Engineering, Ambalika Institute of Management & Technology, Lucknow
3
Professor, Department of Computer Science & Engineering, Ambalika Institute of Management & Technology, Lucknow
4
Assistant Professor, Department of Computer Science & Engineering, Ambalika Institute of Management & Technology, Lucknow
5
Professor, Department of Computer Science & Engineering, Ambalika Institute of Management and Technology, Lucknow

Abstract — Most consumers look to online reviews when deciding which e-commerce services or items to purchase.
Unfortunately, the main problem with these reviews, which is not properly addressed, is that there are dishonest reviews. The
novel feature of the proposed approach is the application of opinion mining on customer evaluations to help companies and
organizations continually enhance their marketing strategies and obtain a comprehensive analysis of consumer perceptions of
their products and brands. In this study, the long short-term memory (LSTM) and deep learning convolutional neural network
integrated with LSTM (CNN-LSTM) models were used to analyses the sentiment of reviews in the e-commerce market. Data
preprocessing techniques including lowercase processing, stop word removal, punctuation removal, and tokenization were used
for data purification. Clean data was processed using the LSTM and CNN-LSTM models to identify and classify the customers'
sentiment as either positive or negative. The best accuracy achieved by our model was 97.8%.

Index Terms — Long Short Term Memory (LSTM), Convolutional Neural Network (CNN), Sentiment Analysis, Machine Learning,
Pre-processing, Tokenization.

1. Introduction
relationships or patterns and also the views which correspond
The ability of technology to transport information quickly to them, representing those relationships as a graph
characterises the digital revolution, often known as the data comprising features and opinions. In order to get only those
age [1]. In recent years, government agencies [2] and thought expressions which are most closely connected to the
corporate trade organisations have put a lot of effort into target feature (user-specified feature), clustering is done [6]
learning what their target customers think. But more and more on the graph. The other expressions are pruned. Compared to
individuals are freely sharing their thoughts online. The the baseline, we were highly accurate in every domain.
availability of historical data online today makes obtaining it Compare this method to cutting-edge systems where, with
straightforward. Individual opinions can be expressed by limited data [7], we attain a comparable level of accuracy. Our
participants verbally or via the use of evaluations or reviews. model got highest accuracy as compare to other models
Furthermore, information provided by a person is more
trustworthy than information provided by a producer [3]. The remainder of this paper is structured in the following
Therefore, in order to market a product or develop new ones, manner. In section II, the related work is discussed. In
it is essential to know what the customer wants and dislikes sections III and IV, we discuss the Sentiment Analysis and the
about it. Most companies or producers rely on consumer input Web Scraping. The methodologies are presented in section V.
to increase sales. Word-of-mouth recommendations from We present result and discussion in section VI. Section VII
customers are exactly like product reviews posted online. The brings this paper to a close
study of sentiment is essential in the field of finance. The aim
of sentiment is to determine the direction of the emotions 2. Related Work
suggested by the screenplay. The word "Subjectivity
Analysis" is a by-product of opinion mining. The study of A software agent that simulates the human experience of Web
emotions, sentiments, views, behaviours, and assessments is surfing interaction is used in the web scraping process, which
known as subjectivity analysis [4]. An opinion is a judgement includes systematic content extraction and combining from
or a belief that lacks support from facts or evidence, according the Internet [8]. It might be viewed as a manual task from an
to the definition. Rarely is the attitude expressed in a review operational perspective, which means copying and pasting the
of a specific product expressly favourable or negative; data. The distinction in this case is that a virtual computer
instead, consumers typically have a mixed opinion about agent is used to accomplish this operation automatically [9]. It
numerous qualities, some of which are positive and some of might also be viewed as an extraction procedure that converts
which are bad. Analyse a review despite the poor battery life, Internet material that is unstructured into a structured format
I appreciate Micromax's multimedia features [5]. This that is simple to read and utilize for various studies.
sentence evokes conflicting feelings. This study suggests an The web scraper uses a request to get access to the website,
approach that uses dependency parsing to identify and then uses the HTML code to discover specified

Volume 12 Issue 8 - 20221 24 www.joics.org


Journal of Information and Computational Science2 ISSN: 1548-7741

components, extract them, and store them into a structured more synonyms and potential antonyms. The following
manner using a data frame. This method of removing iteration is then started when the new words are added to the
unstructured material from websites can be used in a variety seed list. When no further words can be identified, the entire
of contexts. It may be used, for instance, to compare product process comes to an end.
pricing across several websites, to remove advertisements
from connected pages, to address issues with inadequate 3. Sentiment Analysis
statistics data, or to supplement official datasets with
additional data [10]. Web scraping may also be utilized in the The process of determining whether a block of text is positive,
human resources field to locate open positions on various negative or neutral is known as sentiment analysis. Sentiment
websites and categorize those using Nave Bayes algorithms. analysis is the contextual mining of words to determine the
social sentiment of a brand and to assist businesses in
Another work on sentiment analysis for huge data was done determining whether or not the product they are producing
by Sharma et al. They have endeavored to present a summary will find a market [18]. Sentiment analysis's objective is to
of the most recent developments in opinion mining. Although examine public opinion in a way that will support business
we still have difficulties relating to unstructured data, they growth. It emphasizes emotions as well as polarity (positive,
found that sentiment analysis has grown to be quite popular in negative, and neutral) (happy, sad, angry, etc.). Sentiment
this study sector and noted that a lot of research has already analysis aids data analysts at major corporations in
been done. Furthermore, they claim that Wordnet is the most understanding consumer experiences, doing complex market
frequently used lexicon source and that supervised rather than research, monitoring brand and product reputation, and
dictionary procedures yield more accurate findings [11]. An gauging public opinion. In order to provide their own clients
ensemble framework for Sentiment Classification, developed with insightful information, data analytics [19] firms
by merging several feature sets and classification frequently incorporate third-party sentiment analysis APIs
methodologies, was employed by Xia et al. [12]. They into their own platforms for customer experience
employed three basic classifiers (Naive Bayes, Maximum management, social media monitoring, and workforce
Entropy, and Support Vector Machines), two kinds of feature analytics.
sets (Part-of-speech information and Word-relations), and
two different feature sets in their research. For sentiment 4. Web Scraping
classification, they used ensemble techniques such fixed
combination, weighted combination, and meta-classifier Automated web data and information collection is called web
combination and got improved accuracy. Text having a scraping. In essence, it is the removal of online data.
subjective context performs better in terms of sentiment Information retrieval, news collecting, website monitoring,
analysis than text with merely an objective context. This is competitive marketing, and other topics are covered by web
because, when a book's context or perspective is objective, scraping [20] shown in figure 1. Web scraping makes it quick
the text often portrays some commonplace assertions or facts and straightforward to access the large quantity of
without evoking any sentiment [13]. information available online. Compared to manually pulling
data from websites, it is far faster and easier. These days, web
Additionally, Dandannavar and Mangalwede offered an scraping is more and more common. A lot of data collection
overview on several methods for sentiment analysis that use and information extraction may be done quickly and easily
textual data. They analyzed the benefits and drawbacks of using an online data scraping software. Compared to
each method in their study, and they came to the conclusion manually pulling data from websites, it is far faster and easier.
that several ways could be used to do sentiment analysis, These days, web scraping is more and more common. A lot of
some of which were based on lexicon, some on training sets, data collection and information extraction may be done
and some of which employed both. The approaches that were quickly and easily using an online data scraping software.
examined are domain-specific, and the majority of them focus However, when individuals use the term "web scrapers," they
on English and Chinese. As a result, relatively few often refer to computer programmes.
investigations on sentiment categorization for other languages
have been done [14].

The bag-of-words technique was employed by Turney et al.


[15] for sentiment analysis, in which the connections between
words were not at all taken into account and a text was only
represented as a collection of words. Every word's emotion
was identified in order to assess the overall mood of the
document. These values were then combined using various
aggregation techniques. WordNet's lexical database was
utilized by Kamps et al. [16] to analyses the emotional
qualities of words. On WordNet, they created a distance
metric and identified the semantic polarity of adjectives. A
crucial method of the dictionary-based approach was also
described by Hu and Bing [17] in their final presentation. Figure 1 The Web Scraping
In this one, a list of opinion words is manually compiled, and
this set is expanded by looking in other places (such as the Web scraping software (sometimes known as "bots") is
well-known corpora WordNet or thesaurus), in order to find designed to browse websites, scrape the pertinent pages, and

Volume 12 Issue 8 - 20222 25 www.joics.org


Journal of Information and Computational Science3 ISSN: 1548-7741

extract meaningful data. These bots can quickly retrieve Common words like "the," "a," "an," "is," and "are" are
enormous volumes of data by automating this process [21]. In examples of stop-words. Due to the fact that they don't
the digital era, when big data plays such a significant role and provide any knowledge that is crucial for the model, these
is continuously updating and changing, this has obvious words were removed from the review's content.
advantages. Web scraping has several uses, particularly in the
area of data analytics.  Remove all punctuation
There was no punctuation in the materials utilized for the
5. Methodology review.
The main contributions of the proposed worked are the  Contraction Removal
following.
The full form is used instead of the abbreviated form when the
word is really written there; for instance, “when've” becomes
“when have”. Contraction removal is the name given to this
process.

 Tokenization
Tokens, or little bits of words, were used to separate the
sentences in the review texts.

 Part-of-Speech Tagging
In this stage, the phrase's words are each assigned a POS[23]
tag, such as "VB" for a verb, "AJJ" for an adjective, and "NN"
for a noun.
 Score Generation-
A score was produced after sentiment analysis of the review
text. The sentiment score was calculated by comparing the
dataset to the opinion lexicon [22], which contains 5,000
positive words and 4,500 negative keywords, along with their
associated ratings.
 Word Embeddings
We created numerical vectors for each preprocessed sentence
in the dataset of product reviews using the "Word
Figure 2 The Methodology Flow Chart embeddings" approach.
5. 1 Dataset and Description To create word indices, we first turned each phrase in the
review text into a sequence. We use the Keras texts tokenize
To assess the proposed method, a dataset [22] in CSV file to get those indexes [24]. We made sure that all terms and
format was collected from comments on the Amazon words received a one index in tokenize and that the
website.com. The collection includes reviews of televisions, vocabulary size was correctly updated. Then, for each word in
video surveillance devices, cellphones, tablets, laptops, and the training and testing sets, a unique index is created.
more shown in figure 2. A variety of procedures are involved
in the preparation of the data, including the processing of
lowercase data that contains meta-features such the reviewer's
ID, the product ID, and the review content.

5. 2 Data Preprocessing

We did a number of preprocessing procedures on the data as a


whole to make it simpler to process the review texts. To assess
our model for the best accuracy, we use 80% training and 20%
testing.

 Lower Case
It requires altering all of the words in the review text to
lowercase.
 Stopword Elimination
Figure 3 The CNN Model

Volume 12 Issue 8 - 20223 26 www.joics.org


Journal of Information and Computational Science4 ISSN: 1548-7741

5. 3 Convolutional Layer
The most popular categorization models are CNN. CNNs are
well-liked because they can independently learn the features
for each class without the need for human feature extraction.
A CNN shown in figure 3 [25] building layer, the primary
convolutional layer, performs well with CSV and image data.
By mixing data with filters, CNN develops features that are
subsequently passed on to the following layer. The layer that
creates new features filters the previously created features
once more.

5. 3.1 Dense Layer

A layer of basic neurons known as the "dense layer" is


one in which every neuron gets input from every neuron
in the layer above. Images are referred to the dense
layer via the convolutional layer.

5. 3.2 Max Pooling Layer

With a particular kernel size (n*n), Max Pooling


evaluates the largest value from the image matrix and
uses it to build a down-sampled (pooled) feature map,
hence lowering the process time. It is regularly
employed following a convolutional layer. The 5 x 5
kernel has been assigned.

5. 3.3 Batch Normalization layer

The batch normalization approach boosts the neural


networks' speed and stability. The layer performs
operations such as standardization and normalization Figure 4 The Summary of Model
on the inputs it receives from the layer above.
6. Result and Discussion
5. 3.4 Layer LSTM
The CNN-LSTM model's accuracy, which was fairly good at
The RNN [26] is a LSTM [27] version that can develop 97.8% [30] shown in figure 5. We are seeing that our model
long-term dependence. An LSTM layer was provided to 50 has the greatest accuracy. When compared to other models
concealed units so they may enter the subsequent layer. One when I apply it to this dataset. Thus, we can claim that our
of the most important advantages of employing a algorithm has the highest accuracy rate for sentiment analysis
convolutional neural network as a method for feature of customer reviews. The confusion matrix is used to show the
extraction rather to a traditional LSTM is the decrease in the sample's TP, FP, TN, and FN rates. These rates were used to
aggregate number of features [28] shown in figure 4. These generate the performance metrics for the CNN-LSTM model
features (words) from the feature extraction process are used [31], which forecasts consumer sentiment using hidden data
by a sentiment classification model to evaluate if the content [32]. These metrics include specificity, accuracy, recall,
of the product review is favorable or negative. precision, and F1-score.

5. 3.5 Sigmoid Activation Functionn

In this initial layer, the output classes are identified and


categorized as either positive or negative. The formulas below

Evaluation metrics, the suggested models were evaluated


using the metrics for accuracy, precision, recall, F1-score
[29], and specificity (CNN-LSTM and LSTM). Below are the
performance metrics
Figure 5 The Model Accuracy

Volume 12 Issue 8 - 20224 27 www.joics.org


Journal of Information and Computational Science5 ISSN: 1548-7741

7. Conclusion and Future Work Technologies for Computational Intelligence; Springer:


Singapore, 2020; pp. 75–90.
In order to assign weighted sentiment ratings to the entities, [12] R. Xia, C. Zong, and S. Li, “Ensemble of feature sets and
topics, themes, and categories included inside a sentence or classification algorithms for sentiment classification,”
phrase, sentiment analysis systems for text analysis integrate Information Sciences: an International Journal, vol. 181,
natural language processing (NLP) with machine learning no. 6, pp. 1138–1152, 2011
approaches. The bulk of sentiment analysis methods used in [13] Sarkar, Dipanjan. Text Analytics with Python. Text
this study were created using machine learning techniques. In Analytics with Python: A Practical Real-World
comparison to other models, CNN-LSTM provides the best Approach to Gaining Actionable Insights from your
accuracy. Based on a thorough examination, we can conclude Data. Apress, 2017
that our approach is most effective for sentiment analysis, [14] Dandannavar, P.S.; Mangalwede, S.R.; Deshpande, S.B.
which has an accuracy rate of 97.8%. In the future, any author Emoticons and their effects on sentiment analysis of
could create sentiment analysis technology or robust machine Twitter data. In EAI International Conference on Big
learning algorithms with great accuracy. Data Innovation for Sustainable Cognitive Computing;
Springer: Cham, Switzerland, 2020; pp. 191–201.
Reference [15] P. D. Turney, “Thumbs up or thumbs down?: semantic
orientation applied to unsupervised classification of
[1] Nikhat Akhtar, Firoj Parwej, Yusuf Perwej, “A Perusal reviews,” in Proceedings of the 40th annual meeting on
of Big Data Classification and Hadoop Technology”, association for computational linguistics, pp. 417–424,
International Transaction of Electrical and Computer Association for Computational Linguistics, 2002.
Engineers System (ITECES), USA, Volume 4, No. 1, [16] J. Kamps, M. Marx, R. J. Mokken, and M. De Rijke,
Pages 26-38, 2017, DOI: 10.12691/iteces-4-1-4 “Using wordnet to measure semantic orientations of
[2] Al-Mushayt O., Haq Kashiful, Yusuf Perwej, adjectives,” 2004.
“Electronic-Government in Saudi Arabia; a Positive [17] Hu, M.; Liu, B. Mining and summarizing customer
Revolution in the Peninsula”, International Transactions reviews. In Proceedings of the Tenth ACM SIGKDD
in Applied Sciences, India, ISSN-0974-7273, Volume 1, International Conf. on Knowledge Discovery and Data
Number 1, Pages 87-98, 2009 Mining, Seattle, WA, USA, 22–25; pp. 168–177, 2004
[3] Rui Xia, Feng Xu, Chengqing Zong, Qianmu Li, Yong Qi [18] Svetlana Kiritchenko, Xiaodan Zhu and Saif M.
and Tao Li, "Dual Sentiment Analysis: Considering Two Mohammad, "Sentiment Analysis of Short Informal
Sides of One Review", IEEE Trans. On Knowledge and Texts", Journal of Artificial Intelligence Research, vol.
Data Engineering, 2015 50, pp. 723-762, 2014
[4] A. Pak and P. Paroubek, "Twitter as a Corpus for [19] Firoj Parwej, Nikhat Akhtar, Yusuf Perwej, “A Close-Up
Sentiment Analysis and Opinion Mining", LREc, vol. 10, View About Spark in Big Data Jurisdiction”,
no. 2010, 2010 International Journal of Engineering Research and
[5] anushka Bollegala, David Weir and John Carroll, Application (IJERA), ISSN: 2248-9622, Volume 8, Issue
"Cross-Domain Sentiment Classification Using a 1, (Part -I1), Pages 26-41, 2018, DOI:
Sentiment Sensitive Thesaurus", IEEE trans. on 10.9790/9622-0801022641
knowledge and data engineering, vol. 25, no. 8, August [20] ErdincUzun, A novel web scraping approach using the
2013 additional information obtained from web pages,
[6] Y. Dang, Y. Zhang, and H. Chen, “A lexicon-enhanced Preparation of Papers for IEEE Transactions and
method for sentiment classification: an experiment on Journals, Vol 4, 2016
online product reviews,” IEEE Intelligent Systems, vol. [21] Renita Crystal Pereira and T Vanitha, "Web Scraping of
25, no. 4, pp. 46–53, 2010 Social Networks", International Journal of Innovative
[7] Yusuf Perwej, Bedine Kerim, Mohmed Sirelkhtem Research in Computer and Communication Engineering,
Adrees, Osama E. Sheta, “An Empirical Exploration of vol. 3, pp. 237-239, Oct. 2018
the Yarn in Big Data”, International Journal of Applied [22] R. S. Jagdale, V. S. Shirsat, and S. N. Deshmukh,
Information Systems (IJAIS) – ISSN: 2249-0868, “Sentiment analysis on product reviews using machine
Foundation of Computer Science FCS, New York, USA, learning techniques. Cognitive informatics and soft
Volume 12, No.9, Pages 19-29, 2017, DOI: computing,” Advances in Intelligent Systems and
10.5120/ijais2017451730 Computing, vol. 768, 2018.
[8] Glez-Peña, D.; Lourenço, A.; López-Fernández, H.; [23] Asif Perwej, Dr. Yusuf Perwej, Nikhat Akhtar, and Firoj
Reboiro-Jato, M.; Fdez-Riverola, F. Web scaping Parwej, “A FLANN and RBF with PSO Viewpoint to
technologies in an API world. Brienfings Bioinform. Identify a Model for Competent Forecasting Bombay
2014, 15, 788–794. [CrossRef] [PubMed] Stock Exchange COMPUSOFT, An International
[9] Saurkar, A.V.; Pathare, K.G.; Gode, S.A. An overview on Journal of Advanced Computer Technology, 4 (1),
web scraping techniques and tools. Int. J. Future Revolut. Volume-IV, Issue-I, Pages 1454-1461, 2015, DOI:
Comput. Sci.Commun. Eng. 2018, 4, 363–367. 10.6084/ijact.v4i1.60
[10] Hillen, J. Web scraping for food price research. Br. Food [24] Tensorflow and text preprocessing, Accessed 20 January
J. 2019, 121, 3350–3361. 2022,https://www.tensorflow.org/api_docs/python/tf/ker
[11] Sharma, D.; Sabharwal, M.; Goyal, V.; Vij, M. as/preprocess /text/Tokenize.
Sentiment analysis techniques for social media data: A [25] Khadeeja Naqvi, Divyanshi Gautam, Ashish Kumar
review. In First International Conference on Sustainable Srivastava, Prof. (Dr.) Syed Qamar Abbas, Dr. Nikhat
Akhtar, “A Machine Learning-Based Rational Breast

Volume 12 Issue 8 - 20225 28 www.joics.org


Journal of Information and Computational Science6 ISSN: 1548-7741

Cancer Diagnosis” , Journal of Emerging Technologies


and Innovative Research, Volume 9, Issue 7, Pages
558-567, 2022, DOI: 10.6084/m9.jetir.JETIR2207677
[26] Yusuf Perwej, “Recurrent Neural Network Method in
Arabic Words Recognition System”, International
Journal of Computer Science and Telecommunications
(IJCST), which is published by Sysbase Solution (Ltd),
UK, London, Volume 3, Issue 11, Pages 43-48, 2012
[27] S. N. Alsubari, S. N. Deshmukh, M. H. Al-Adhaileh, F.
W. Alsaade, and T. H. Aldhyani, “Development of
Integrated Neural Network Model for Identification of
Fake Reviews in E-Commerce Using Multidomain
Datasets,” Applied Bionics and Biomechanics, vol.
2021, Article ID 5522574, 11 pages, 2021
[28] Yusuf Perwej, “The Bidirectional Long-Short-Term
Memory Neural Network based Word Retrieval for
Arabic Documents”, Transactions on Machine Learning
and Artificial Intelligence (TMLAI), Society for
Science and Education, United Kingdom (UK), ISSN
2054-7390, Volume 3, Issue 1, Pages 16 - 27, 2015, DOI:
10.14738/tmlai.31.863
[29] Nikhat Akhtar, Devendera Agarwal, “An Efficient
Mining for Recommendation System for Academics”,
International Journal of Recent Technology and
Engineering (IJRTE), Volume-8, Issue-5, Pages
1619-1626, 2020, DOI: 10.35940/ijrte.E5924.018520
[30] Jorge E. Zambrano, Daniel P. Benalcazar, Claudio A.
Perez, Kevin W. Bowyer, "Iris Recognition Using
Low-Level CNN Layers Without Training and Single
Matching", IEEE Access, vol.10, pp.41276-41286, 2022
[31]J. Fan, W. Xu, Y. Wu and Y. Gong, "Human tracking
using convolutional neural networks", Neural Networks
IEEE Transactions, 2010
[32] Svetlana Kiritchenko, Xiaodan Zhu and Saif M.
Mohammad, "Sentiment Analysis of Short Informal
Texts", Journal of Artificial Intelligence Research, vol.
50, pp. 723-762, 2014

Volume 12 Issue 8 - 20226 29 www.joics.org


View publication stats

You might also like