Professional Documents
Culture Documents
Sample Major Project Soft
Sample Major Project Soft
BACHELOR OF ENGINEERING
In
By
Jitesh Garg (15XU1A0502)
Syed Tameem Alam Quadri (15XU1A0511)
CERTIFICATE
This is to certify that the project entitled “Amazon Fine Food Review
Analysis” is a bonafide work carried out by the following students during
the period 2018-2019 in partial fulfilment of the requirements for the
award of degree of “bachelor of technology in computer science and
engineering” from CSI Wesley Institute of Technology and Sciences,
Secunderabad affiliated to Jawaharlal Nehru Technological University,
Hyderabad (JNTUH) under our guidance and supervision.
The results embodied in the project work have not been submitted to any
other university or institute for the award of any degree or diploma.
DECLARATION
We, Jitesh Garg (15XU1A0502) and Syed Tameem Alam Quadri (15XU1A0511) students
of CSI Wesley Institute of Technology and Sciences, pursuing Bachelor’s degree in Computer
Science and Engineering, hereby declare that the project report entitled “Amazon Fine Food
Review Analysis", carried out under the guidance of Mr. Subhan Allah , Assistant Professor,
Department of Computer Science and Engineering, is submitted in partial fulfilment of the
requirements for the degree of Bachelor of Engineering in Computer Science. This is a bonafide
record of the work carried out by us and the results embodied in this project have not been
reproduced/ copied from any source.
I would like to express my sincere gratitude to my project guide Mr, Subhan Allah,
Assistant Professor, for giving me the opportunity to work on this topic. It would never be
possible for me to take this project to this level without her innovative ideas and her relentless
support and encouragement.
I would like to thank my project coordinator Mr. Subhan Allah, Professor, who
helped me by being an example of high vision and pushing towards greater limits of
achievement.
My sincere thanks to Mr, Subhan Allah, Associate Professor and Head of the
Department of Computer Science and Engineering, for her valuable guidance and
encouragement which has played a major role in the completion of the project and for helping
me by being an example of high vision and pushing towards greater limits of achievement.
I would like to express a deep sense of gratitude towards Dr. P.M Yohan, Principal,
CSI WITS for always being an inspiration and for always encouraging us in every possible
way.
I am indebted to the Department of Computer Science & Engineering and CSI Wesley
Institute of Technology and Sciences for providing me with all the required facility to carry
my work in a congenial environment. I extend my gratitude to the CSE Department staff for
providing me to the needful time to time whenever requested.
I would like to thank my parents for allowing me to realize my potential, all the support
they have provided me over the years was the greatest gift anyone has ever given me and also
for teaching me the value of hard work and education. My parents have offered me with
tremendous support and encouragement, thanks to my parents for all the moral support and the
amazing opportunities they have given me over the years.
1
ABSTRACT
Amazon.com is known to be the largest online retailer in the world providing vast number of
services to its customer on the products sold. One of the major requirements of the Amazon is
to categorize the reviews on the products into positive and negative (also called critical)
reviews, in order to present the same to its customer and for improving its own services. Fine
food products contain wide range of products, ranging from Cheese to Chocolates, Tofu cubes
and Almond Flour etc. A positive review on these products helps in finding what is actually
been liked by the customer and a negative review is helps in finding what’s wrong with the
product and helps in improving the product and services associated with it. The project Amazon
Fine Food Reviews Analysis is intended to solve this problem by helping to categorizing the
reviews into positive or negative using applied Natural Language Processing based on the
dataset provided by Stanford Network Analysis project carried out for Research and Education.
The Machine Learning model design is intended to process more than 550,000 real world
reviews by multiple customers on various Fine Food products of Amazon from the dataset
taken and then classify these reviews to be positive or negative and hence solving the problem.
2
CONTENTS
Chapter 1: Introduction 1
Chapter 2: Literature Survey 4
2.1 Need for Sentimental Analysis 5
2.2 Lexical Analysis 6
2.3 Machine Learning based analysis 6
2.4 Hybrid Analysis 7
2.5 Levels of Sentimental Analysis 7
2.6 Different approaches for sentiment analysis 10
Chapter 3: System Analysis 14
3.1 Problems statement 15
3.2 Proposed System 15
3.3 Proposed Logic: 15
3.4 Feasibility Study: 17
Chapter 4: Software Requirement Specification 18
4.1 Introduction 19
4.2 Purpose 19
4.3 Functional Requirements 20
4.4 Non- Functional Requirements 20
4.5 Hardware Requirements 21
4.6 Software Requirements 21
Chapter 5: System Design 22
5.1 Block Diagram 23
5.2 Data Flow Diagram 23
5.3 Class Diagram 24
5.4 Object Diagram 24
5.5 Use Case Diagram 25
5.6 Activity Diagram 25
5.7 State chart Diagram 26
5.8 Sequence Diagram 26
5.9 Collaboration Diagram 27
5.10 Component Diagram 27
3
Chapter 6: Implementation 28
Chapter 7: Testing 38
7.1 Test Cases 39
Chapter 8: Screen Shots 40
8.1 Reading data from the dataset 41
8.2 Records read from the dataset 41
8.3 Value of shape of the data before removing useless records 42
8. 4 Value of shape of the data after removing useless records 42
8.5 Removing URL’s from the data 43
8.6 Removing HTML tags from the data reviews 43
8.7 De- contraction of reviews in the data 44
8.8 After De- contraction of data 44
8.9 Removing Numbers and Special Characters 45
8.10 Stop Words Considered 45
8.11 After removing Stop Words from the reviews 46
8.12 Applying Bi-Gram partition on the reviews 46
8.13 Applying Brute Force algorithm on the reviews 47
8.14 After applying KNN using Brute Force approach 47
8.15 Word2Vec using Brute Force approach 48
8.16 Accuracy to Neighbors graph using Brute Force 48
8.17 K-NN using kd-tree approach 49
8.18 Applying K-NN algorithm using kd-tree approach 49
8.19 Accuracy vs Neighbors graph using Kd-Tree approach 50
8.20 Implementing Word2Vec using Kd-Tree approach 50
8.21 Accuracy using Kd-Tree Implementation 51
Chapter 9: Conclusions 52
Chapter 10: Future Enhancements 54
Chapter 11: References 56
4
LIST OF FIGURES
SL. Figure Name: Pg.
No. No.:
1 5.1 Block Diagram 23
2 5.2 Data Flow Diagram 23
3 5.3 Class Diagram 24
4 5.4 Object Diagram 24
5 5.5 Use Case Diagram 25
6 5.6 Activity Diagram 25
7 5.7 State chart Diagram 26
8 5.8 Sequence Diagram 26
9 5.9 Collaboration Diagram 27
10 5.10 Component Diagram 27
11 8.1 Reading data from the dataset 41
12 8.2 Records read from the dataset 41
13 8.3 Value of shape of the data before removing useless 42
records
14 8. 4 Value of shape of the data after removing useless 42
records
15 8.5 Removing URL’s from the data 43
16 8.6 Removing HTML tags from the data reviews 43
17 8.7 De- contraction of reviews in the data 44
18 8.8 After De- contraction of data 44
19 8.9 Removing Numbers and Special Characters 45
20 8.10 Stop Words Considered 45
21 8.11 After removing Stop Words from the reviews 46
22 8.12 Applying Bi-Gram partition on the reviews 46
23 8.13 Applying Brute Force algorithm on the reviews 47
24 8.14 After applying KNN using Brute Force approach 47
25 8.15 Word2Vec using Brute Force approach 48
26 8.16 Accuracy to Neighbors graph using Brute Force 48
27 8.17 K-NN using kd-tree approach 49
28 8.18 Applying K-NN algorithm using kd-tree approach 49
29 8.19 Accuracy vs Neighbors graph using Kd-Tree approach 50
30 8.20 Implementing Word2Vec using Kd-Tree approach 50
31 8.21 Accuracy using Kd-Tree Implementation 51
5
CHAPTER 1
INTRODUCTION
INTRODUCTION:
6
The Amazon is a world’s leading e-commerce website which had spread its market in a wide
variety of fields including food. The Amazon had been collecting the reviews from the
customers to consider then as their feedback of the food that is being sold. To make themselves
better, Amazon used to go through all the reviews on all the food items, and manually classify
them as good and bad. But as the market has been expanding, the number of reviews from the
customers had also increased. Hence it became difficult for Amazon to classify from all the
recorded reviews. Amazon now needs a solution for this problem. Hence the concepts of data
mining, supervised learning and classification had been employed to consider all the reviews
and classify them as either positive or critical reviews.
Reviews on Amazon are not only related to the product but also the service given to the
customers. If users get clear bifurcation about product reviews and service reviews it will be
easier for them to take the decision, in this paper we propose a system that performs the
classification of customer reviews followed by finding sentiment of the reviews. A rule based
extraction of product feature sentiment is also done.
All Information in the world can be broadly classified into mainly two categories, facts and
opinions. Facts are objective statements about entities and worldly events. On the other hand
opinions are subjective statements that reflect people’s sentiments or perceptions about the
entities and events. Maximum amount of existing research on text and information processing
is focused on mining and getting the factual information from the text or information. Before
we had WWW we were lacking a collection of opinion data, in an individual needs to make a
decision, he/she typically asks for opinions from friends and families. When an organization
needs to find opinions of the general public about its products and services, it conducted
surveys and focused groups. But after the growth of Web, especially with the drastic growth
of the user generated content on the Web, the world has changed and so has the methods of
gaining ones opinion. One can post reviews of products at merchant sites and express views on
almost anything in Internet forums, discussion groups, and blogs, which are collectively called
the user generated content. As the technology of connectivity grew so as the ways of
interpreting and processing of users opinion information has changed. Some of the machine
learning techniques like Naïve Bayes, Maximum Entropy and Support Vector Machines has
been discussed in the paper. Extracting features from user opinion information is an emerging
task. The algorithms used in our project is K-Nearest Neighbors classifier.
7
Data classification is the process of sorting and categorizing data into various types, forms or
any other distinct class. Data classification enables the separation and classification of data
according to data set requirements for various business or personal objectives. An effective
data classification process is important because it can help organization determine the
appropriate levels of control to maintain the confidentiality and integrity of their data
Each review given by the customer is considered and is recorded into a dataset. The review
may consist of images along with the text. Only the textual data is considered and the reviews
are the classified only based on the textual data. Since, the images are a huge entity, and the
image data consumes a lot of data, the images are excluded since the textual information is
alone enough for the classification of data. Amazon is a web application, where the reviews
are taken through the website itself. The review consists of all kinds of data which may or may
not be necessary for the classification. Few review may not even be helpful for the customer.
The data set consists of all these data which then has to be removed. All the nominal data will
not be taken into the consideration. Only the ordinal data is taken into the consideration.
8
CHAPTER 2
LITERATURE SURVEY
9
Sentiment analysis is a machine learning approach in which machines analyzes and
classifies the sentiments, emotions, opinions about any particular topics or entity which are
expressed in the form of text or speech [4]. Sentiment analysis is a challenge of the Natural
Language Processing (NLP), text analytics and computational linguistics. In a general sense,
sentiment analysis determines the opinion regarding the object/subject in discussion [5]. People
share knowledge, experiences and thoughts with the world by using Social Media like blogs,
forums, wikis, review sites, social networks, tweets and so on. This has changed the manner in
which people communicate and influence social, political and economic behavior of other
people in the Web 2.0. Indeed the Web 2.0 allows everyone having a voice, promising to boost
human collaboration capabilities on a worldwide scale, enabling individuals to share opinions
by means of read-write Web and user’s generated contents [6].
The term sentiment analysis first appeared in, however the research on sentiments/opinions
appeared earlier. The literature on sentiment analysis focused on different domains, from
management sciences to computer science, social sciences and business due to its importance
to society as whole and different tasks such as: subjective expressions, sentiments of words,
subjective sentences, and topics [6]. Sentiment analysis is used to extract the subjective
information in source material by applying various techniques such as Natural language
Processing (NLP), Computational Linguistics and text analysis and classify the polarity of the
opinion [5].
Sentiment analysis has been practiced on a variety of topics. For instance, sentiment analysis
studies for movie reviews, product review, and news and blogs. In this section, Twitter specific
sentiment analysis approaches are reported. The research on sentiment analysis so far has
mainly focused on two things: identifying whether a given textual entity is subjective or
objective, and identifying polarity of subjective texts. Most sentiment analysis studies use
machine learning approaches [5].
In sentiment analysis domain, the texts belong to either of positive or negative classes. There
may also be multi-valued or binary classes like positive, negative and neutral (or irrelevant).
The core complexity of classification of texts in sentiment analysis with respect to that of other
topic-based cataloging is due to the non-usability of keywords, despite the fact that the number
10
of classes in sentiment analysis is less than that in the later approach by Opinion mining
(sentiment extraction) is employed on Twitter posts by means of Lexical analysis or by
Machine learning based analysis or using Hybrid/Combined analysis [5].
This technique is governed by the use of a dictionary consisting pre-tagged lexicons. The input
text is converted to tokens by the Tokenizer. Every new token encountered is then matched for
the lexicon in the dictionary. If there is a positive match, the score is added to the total pool of
score for the input text. For instance if “dramatic” is a positive match in the dictionary then the
total score of the text is incremented. Otherwise the score is decremented or the word is tagged
as negative. Though this technique appears to be amateur in nature, its variants have proved to
be worthy [5].
Machine learning is one of the most prominent techniques gaining interest of researchers due
to its adaptability and accuracy. In sentiment analysis, mostly the supervised learning variants
of this technique are employed. It comprises of three stages: Data collection, Pre-processing,
Training data, Classification and plotting results. In the training data, a collection of tagged
corpora is provided. The Classifier is presented a series of feature vectors from the previous
data. A model is created based on the training data set which is employed over the new/unseen
text for classification purpose. In machine learning technique, the key to accuracy of a classifier
is the selection of appropriate features. Generally, unigrams (single word phrases), bi-grams
(two consecutive phrases), tri-grams (three consecutive phrases) are selected as feature vectors.
There are a variety of proposed features namely number of positive words, number of negative
words, and length of the document, Support Vector Machines (SVM), and Naïve Bayes (NB)
algorithm. Accuracy is reported to vary from 63% to 80% depending upon the combination of
various features selected [5].
Our project is a Machine Learning based analysis which shows the comparison of the
techniques implemented using the Bigram analysis and the K-Nearest Neighbors analysis. The
KNN algorithm is more accurate when compared to the bigram analysis. The KNN is again
implemented using brute force technique which takes a lot of computational time. Hence, the
KNN is again implemented using the kd-tree (K-Dimensional tree) which take very less time
when compared to the brute force technique.
11
2.4 Hybrid Analysis:
The advances in sentiment analysis lured researchers to explore the possibility of a hybrid
approach which could collectively exhibit the accuracy of a machine learning approach and the
speed of lexical approach. Pseudo documents encompassing all the words from the set of
chosen lexicons are created. Then computed the cosine similarity amongst the pseudo
documents and the unlabeled documents. Depending upon the measure of similarity, the
documents were either assigned a positive or a negative sentiment. This training dataset was
then fed to a naïve bayes classifier for training purpose.
The Document Level Sentiment analysis is performed for whole document. The basic unit of
information is a single document of opinionated text. In this type of document level
classification a single review about a single topic is considered. But in case of forums or blogs,
comparative sentences may appear and customers may compare one product with the other that
has similar characteristics and hence document level analysis is not desirable in forums and
blogs. While doing document level classification, irrelevant sentences must be eliminated at
preprocessing phase. For document level classification both supervised and unsupervised
machine learning classification methods are used. Supervised machine learning algorithm such
as Support Vector Machine (SVM), Naïve Baye’s, KNN and Maximum Entropy can be used
to train the system. For training and testing dataset, the reviewer rating (in the form of 1-5 stars)
and review text can be used. The features that can be used for the machine learning are term
frequency, document frequency, tf-idf measure, Part of speech tagging, Opinion words, opinion
phrases, negations and dependencies. Manually labeling the polarities of the document is time
consuming task and hence the user rating available can be made use of. The unsupervised
machine learning can be done by extracting the opinion words inside a document. The point-
wise mutual information can be made use of to find the semantics of the extracted words [4].
The Sentence level sentiment analysis is related to find sentiment form different sentences
whether the sentence expressed is positive, negative or neutral sentiment. The Sentence level
sentiment analysis is closely related to subjectivity classification. Here, the polarity of each
sentence is calculated and then same document level classification methods are used for the
12
sentence level classification problem. Then the objective and subjective sentences must be
found out. The subjective sentences must contain opinion words which help in determining the
sentiment about entity. After that the polarity classification is done into positive, negative and
neutral classes [4].
The Entity or Aspect Level sentiment analysis performs finer-grained analysis. The goal is to
find out the sentiment on entities or aspect of those entities. For example, consider a statement
“My Nokia Lumina 510 cell phone has good picture quality but it has less battery backup.” So
the opinion on Nokia’s camera and display quality is positive but the opinion on its cell phone
battery backup is negative. We can generate summery of opinions about entities. Comparative
statements are part of the entity or aspect level sentiment analysis but deal with techniques of
comparative sentiment analysis [4].
In phrase level sentiment classification, the phrases that contain opinion words are found out
and a phrase level classification is done. This is advantageous or may be disadvantageous. It is
advantageous where the exact opinion about an entity can be correctly extracted. But in other
cases, where contextual polarity matters, so result may not be accurate. So the negation of
words can occur locally. In such cases, this type of sentiment analysis suffices [4].
Product features are considered as product attributes. Analysis of these features for identifying
sentiment of the document is called as feature based sentiment analysis. In this approach
positive, negative or neutral opinion is identified from the extracted features. It is the fine
grained analysis model among all other model [4].
A. Review Sites
A review site is a website where users can post reviews, which give an opinion about people,
businesses, products, services and particular entity. Most of the sentiment analysis work has
been done on movie and product review sites. The review data used in most of the sentiment
classification studies are collected from the ecommerce websites like www.amazon.com
(product reviews), www.yelp.com (restaurant reviews), www.CNET download.com (product
13
reviews) and www.reviewcentre.com, which have millions of product reviews by customer.
Other than these reviews the available are professional review sites such as
www.dpreview.com, www.zdnet.com and customer opinion sites on broad topics and products
such as www.consumerreview.com, www.epinions.com, www.bizrate.com [4].
B. Blogs
With an increasing usage of the internet, blogging and blog posts are growing rapidly [7]. The
term blog refers to a webpage consisting of brief paragraphs of opinion, information, personal
diary entries, or links, called posts, which are arranged chronologically with the most recent
first, in the style of an online journal [8]. Sentiment analysis on blogs [9] has been used to
predict the product sales, movie sales, political mood and in many of the studies related to
sentiment analysis.
C. Forums
Forums or message boards allow its members to hold conversations by posting it on the site.
These are dedicated to a topic and thus using forums as a database allows us to do sentiment
analysis in a single domain.
D. Datasets
Most of the work in the field uses movie reviews and product reviews data for classification.
The movie review dataset which is available online is (http://
www.cs.cornell.edu/People/pabo/movie-review-data) [7]. Other dataset which is available
online is multi-domain sentiment (MDS) dataset (Blitzer et. al., 2007)
(http://www.cs.jhu.edu/mdredze/datasets/sentiment) [10]. Is ; Zhu Jian ,2010 ; Pang and Lee
,2004; Bai et al. ,2005; Kennedy and Inkpen ,2006; Zhou and Chaovalit ,2008; Yulan He 2010;
Rudy Prabowo ,2009; Rui Xia ,2011) [7].
The approach that we are considering in our project is the dataset approach. The dataset is
collected from the website Kaggle. The dataset was certified to the Stanford research
University which had later been released to the Kaggle website for the learning purposes. The
dataset is considered in the SQLite format. The dataset consists of the Amazon fine food
reviews that had been collected from past 13 years. The dataset is a collection of 568,454
records.
E. Micro-blogging
14
Twitter is a popular micro-blogging service where users create or write status messages called
"tweets". These tweets express opinions about different topics. Tweets are also used as data
source for classifying sentiment.
It is the branch of computer science and technology which focused on developing systems that
allow computers to communicate with people using natural language. Natural language
processing technique plays important role to get accurate sentiment analysis. NLP techniques
like Bag of words, Hidden markov model (HMM), part of speech (POS), N-gram algorithms,
large sentiment lexicon acquisition and parsing techniques are used to express opinion for
document level, phrase level, sentences level and aspect level [12,13]. Large sentiment lexicon
acquisition is used sentiment word dictionary which contains lot of sentiment words with their
numeric threshold value for particular domain [14]. SentiWordNet dictionary is used for
subjective sentiment analysis. Part-of speech (POS) tagging is often the most time consuming
and challenging task before doing sentiment analysis of any documents. As online textual
reviews are short, non-grammar sentences and contain slangs, abbreviations, and symbols
which make the POS tagging even more difficult. For example, consider the statement. “The
camera is good. I love its picture quality.” Here, “camera” is referred as a product and “picture
quality” is referred as a feature. We know, Products and features are tagged as nouns. We can
define the synonym list of products and features. This feature can be because of uncertain and
non-grammar online reviews. For example, consider the following comment. “I like the high
res”. Here “res” refers to resolution, and resolution is similar to graphics. Sometimes textual
reviews may contain mixture sentiment. For example, “I like the graphics, but it takes battery
a lot”. Now we are doing feature based sentiment analysis, so it is easy to handle such reviews.
In this case, the sentiment is positive for “graphics” and negative for “battery”. For this
CLASSIFIER, CONCEPT, CONCEPT_RULE, and PREDICATE_RULE rules can be used
[6].
Machine learning techniques are most useful techniques for the sentiment classification for
categorized text into positive, negative or neutral categories. in machine learning technique,
training and testing datasets are required. A training dataset is used to learn the documents and
15
test dataset is used to validate the performance. There are number of machine learning
algorithms used to classify reviews. There are two types of machine learning techniques such
as supervised machine learning algorithm like maximum entropy, SVM, Naïve bayes, KNN,
etc and unsupervised machine learning algorithm such as HMM, Neural network, PCA, ICA,
SVD, etc.
i. Naïve Bayes
Naïve bayes is a simple and easy but effective classification algorithm. It is mostly used for
document level classification. The basic idea is to calculate the probabilities of categories given
a test document by using the joint probabilities of words and categories. Naive Bayes is optimal
for certain problem classes with highly dependent features. Naive Bayes classifiers are
computationally fast when taking decisions. It does not require large amounts of data before
learning can begin [15].
KNN is a classifier that relies on the category labels attached to the training documents similar
to the test document. It is a method to classify an object based on the majority class amongst
its k-nearest neighbors. It is a type of lazy learning where the function is only approximated
locally and all computation is deferred until classification [17]. KNN algorithm usually uses
16
the Euclidean or the Manhattan distance. However, any other distance such as the Chebyshev
norm or the Mahalanob is distance can also be used [18].
iv. Winnow
Decision Tree Learning is a tree based approach, where collection of child and root node which
focus on the target value. It is a flow chart like structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and leaf nodes represent
child node or class distributions [21]. The popular Decision Tree algorithms are ID3, C4.5 and
CART. The ID3 algorithm is considered as a very simple decision tree algorithm. It uses
information gain as splitting criteria. C4.5 is an evolution of ID3. It uses gain ratio as splitting
criteria [22]. The CART algorithm uses Gini coefficient as the test attribute for selection
criteria, and each time selects an attribute with the smallest Gini coefficient as the test attribute
for a given set [23].
The need for the classification of the reviews is to help Amazon detect the number of positive
reviews and the negative reviews received from the customers for a specific product. The
reviews will then act as a feedback mechanism for the organization to help them work
according to their marketing strategies.
The techniques involved here to perform classification are the concepts of data cleaning, text
preprocessing of the data mining, supervised learning and the concepts of machine learning.
This technique is widely used in the market and is found to be the most efficient technique to
classify the data taken from a huge dataset.
The Sentimental analysis is done at a Phrase level. The reviews are taken from the dataset, and
from all the textual data, only the necessary data is considered, thereby shrinking the entire
17
sentence into a single phrase. The phrase may be meaningless when arranged in a sequence but
the words of the sentences are used for the classification.
The dataset considered for this project is a collection of the Amazon fine food reviews form
past 13 years. The data is a collection of 568,454 records. The data is present in a .sqlite format.
The consists of the reviews which may consists of plain text, urls of the images of other
references, HTML tags (since the reviews are collected online) and many unnecessary stop
words. The data is 10 attributed of which ‘Score’ is considered as the class attribute. The data
also has a helpfulness numerator and helpfulness denominator which is avoid few of the
reviews from taking into the consideration.
The machine learning approach considered for classifying the data is the K-Nearest Neighbors
classification algorithm. Each record is considered as a point in the search space and the
distance between them is considered. Out of k nearest neighbors, (k being odd) select the class
to which majority of the points belong to and classify the new point to the desired class.
18
CHAPTER 3
SYSTEM ANALYSIS
19
3.1 Problems statement:
The problem being addressed in this project is the poor quality of Amazon reviews at the top
of the forum despite the “helpfulness” rating system. The problem arises from the “free pass”
given to new reviews to be placed at the top of the forum, for a chance to be rated by the
community. The proposed solution to this problem is to use machine learning techniques to
design a system that “pre-rates” new reviews on their “helpfulness” before they are given a
position at the top of the forum. This way, poor quality reviews will be more unlikely to be
shown at the top of the forum, as they do not get the “freepass” because they are new. The
proposed system will use a set of Amazon review data to train itself to predict a helpfulness
classification (helpful, or not helpful) for new input data [1].
This is a supervised learning problem where we need to predict the positive or negative target
variable for each review. The goal will be to maximize the accuracy of this classification. We
will train our model on a dataset containing thousands of reviews presented as unstructured
text. Each review will be labeled as positive or negative.
The project provides the organization a way to solve the problem of classification. The project
uses the concepts of data mining, machine learning and supervised learning to solve the
problem. The main objective of the project is to classify the reviews from the dataset into wither
positive or critical reviews.
The data is first collected form all the customers into a dataset. The dataset consists of all the
reviews which consists of the textual data including the urls, HTML tags, etc. Hence the data
is first taken through the Data Cleaning phase.
The data cleaning is further divided into 2 parts. Data de duplication and Text Preprocessing.
20
Data de duplication is a phase where the reviews which had been recorded more than once in
a single timestamp are removed. The data has the attributes helpfulness numerator and
helpfulness denominator. If the value of the numerator divided by the denominator is greater
than one, the review is considered to be helpful. Else, the reviews are removed and are not
taken into consideration. The reviews with scored 1,2 or 4,5 are only considered. The reviews
with score 3 are considered as neutral and are not taken into consideration. All the reviews with
scores 1 and 2 are considered as 0 and the reviews with scores 4,5 are taken as 1. This process
is known as the partition.
Text preprocessing the process of refining the textual data in the reviews. The reviews are taken
from the Amazon web application. Hence the reviews may consists of many HTML tags and
URLs. Hence, all the URLs and HTML tags are removed from the reviews. The reviews also
consists of many spell check errors and also the words containing apostrophes. These data can
never be used for the classification of the data. The data that contain the special symbols and
numbers is also not useful for the classification of the data. The special symbols and the words
with a size less than or equal to the size 2 are removed from the consideration. The code also
consists of 184 stop words which are used in the reviews but do not add much weightage to the
process of classification. Hence, these words are called as the stop words. Whenever these
words are encountered in the reviews, they must be removed.
The process of data cleaning is then continued by the Featurization. The Featurization is a class
which is used to perform the bigram analysis by taking all the possible 2 word combinations
from the shrinked reviews. The bigram partition is done on all the reviews. The reviews are
then classified into either positive or critical reviews. But the accuracy of the bigram analysis
is less when the time taken by it is taken into the consideration. Hence we use another approach
to solve the problem of classification, called K-nearest Neighbors classification. The algorithm
can be implemented in two ways. One is Brute Force approach. The other is the kd-tree
approach.
The brute force approach goes through each and every possibility of the review and classifies
it into either positive or critical. The accuracy given by the system is approximately 84 percent.
But the Brute force algorithm takes a lot of computational time to classify the reviews. Hence,
the concept of kd-tree (K-Dimensional tree) is involved into the project.
The kd-tree approach is used to know the data that lies closer to the point to be classified
beforehand. Hence, the time taken for the system to classify the reviews is relatively less when
21
compared to the brute force approach. The kd-tree approach also gives almost same accuracy
ads that of the brute force approach, but computationally takes very less time when compared
to the brute force approach.
So far, the reviews are partitioned based only on the words present in the reviews. The semantic
meaning of the reviews is not considered. Hence, if a review is encountered that is written with
the words that are not being considered from the bag of words, the reviews may be classified
incorrectly. Hence the concept of Word2Vec is used. This class represents the words that are
repeated more than thrice in a dataset as a point in a 50 dimensional feature space. When the
new word arises, the word is also represented as a part of the feature space. When a vector is
drawn through the words present in the review, a sentence will be formed which provides a
semantic meaning to the system. This is known as the average Word2Vec where a sentence is
recognized through its semantic meaning.
The Word2Vec is implemented using both the Brute Force approach and the Kd-tree approach.
Both of them give approximately same accuracy with a huge change in the complexity.
The project uses a huge computation in its algorithms. Hence, the system must be capable of
performing the computation as required. The technical requirements of the system are high
when compared to an average compute. The Operating system used by us is the Windows 10
Enterprise Edition. The RAM space of the system used supports 8GB of data and also 128GB
of SATA connectivity. The system is also equipped with a GTX 1050 GPU which helps in
performing the computation faster.
The project is not so costly. The only cost that would be taken to make the project is the cost
of building the machine and the pay for the data analyst.
Since the algorithm has to go through each and every bigram of the review formed, the system
will be take computationally more time. The benchmarks set by the program when run through
different test cases was noticed that the project takes 10 minutes to classify 3500 records, 15
Minutes for 8000 records, 1 hour for 1.46 lakh records and 7 hours for all the reviews present
in the dataset.
22
CHAPTER 4
SOFTWARE REQUIREMENT SPECIFCATION
23
4.1 Introduction
A Software Requirements Specification (SRS) – for a software system – is a
complete description of the behaviour of a system to be developed. It includes a set of use
cases that describe all the interactions the users will have with the software. In addition to use
cases, the SRS also contains non-functional requirements. Non-functional requirements are
requirements which impose constraints on the design or implementation (such as performance
engineering requirements, quality standards, or design constraints).
System requirements specification: A structured collection of information that embodies the
requirements of a system. A business analyst, sometimes titled system analyst, is responsible
for analysing the business needs of their clients and stakeholders to help identify business
problems and propose solutions. Within the systems development life cycle domain, typically
performs a liaison function between the business side of an enterprise and the information
technology department or external service providers. Projects are subject to three sorts of
requirements:
● Business requirements describe in business terms what must be delivered or accomplished
to provide value.
● Product requirements describe properties of a system or product (which could be one of
Several ways to accomplish a set of business requirements.)
● Process requirements describe activities performed by the developing organization.
For instance, process requirements could specify specific methodologies that must be
followed, and constraints that the organization must obey.
● Product and process requirements are closely linked. Process requirements often specify
the activities that will be performed to satisfy a product requirement. For example, a maximum
development cost requirement (a process requirement) may be imposed to help achieve a
maximum sales price requirement (a product requirement); a requirement that the product be
maintainable (a Product requirement) often is addressed by imposing requirements to follow
particular development styles.
4.2 Purpose:
An systems engineering, a requirement can be a description of what a system must
do, referred to as a Functional Requirement. This type of requirement specifies something that
the delivered system must be able to do. Another type of requirement specifies something
about the system itself, and how well it performs its functions. Such requirements are often
called Non-functional requirements, or 'performance requirements' or 'quality of service
24
requirements.' Examples of such requirements include usability, availability, reliability,
supportability, testability and maintainability.
● Model building
● Prediction
25
The system is implemented in Jupyter Notebook and all the required packages are download
& imported.
26
CHAPTER 5
SYSTEM DESIGN
27
Fig. 5.1 Block Diagram
28
Fig. 5.3 Class Diagram
29
Fig. 5.5 Use Case Diagram
30
Fig. 5.7 State chart Diagram
31
Fig. 5.9 Collaboration Diagram
32
CHAPTER 6
IMPLEMENTATION
33
Read Data Class:
filtered_data = None
class ReadData(object):
Deduplication Class:
class Deduplication(object):
34
sorted_data = filtered_data.sort_values('ProductId',axis=0,ascending= True, inplace=
False, kind='quicksort',na_position='last')
return sorted_data
class TextPreprocessing(object):
return x
35
i+=1
return x
return x
36
return x
Class Featurization:
class Featurization(object):
def apply_brute(self):
%time
knn = KNeighborsClassifier(algorithm='brute')
# neigh = np.arange(1,100,2)
param_grid = {'n_neighbors':np.arange(1,100,2)} #params we need to try on classifier
tscv = TimeSeriesSplit(n_splits=10) #For time based splitting
gsv = GridSearchCV(knn,param_grid,cv=tscv,verbose=1)
gsv.fit(X_train,y_train)
print("Best HyperParameter: ",gsv.best_params_)
print("Best Accuracy: %.2f%%"%(gsv.best_score_*100))
return gsv.best_params_
37
def apply_kdtree(self):
%time
knn = KNeighborsClassifier(algorithm='kd_tree')
# neigh = np.arange(1,100,2)
param_grid = {'n_neighbors':np.arange(1,100,2)} #params we need to try on classifier
tscv = TimeSeriesSplit(n_splits=10) #For time based splitting
gsv = GridSearchCV(knn,param_grid,cv=tscv,verbose=1)
gsv.fit(X_train,y_train)
print("Best HyperParameter: ",gsv.best_params_)
print("Best Accuracy: %.2f%%"%(gsv.best_score_*100))
return gsv.best_params_
Applying Word2Vec:
sent_of_train=[]
for sent in X_train:
sent_of_train.append(sent.split())
38
# List of sentence in X_est text
sent_of_test=[]
for sent in X_test:
sent_of_test.append(sent.split())
# Train your own Word2Vec model using your own train text corpus
# min_count = 3 considers only words that occured atleast 3 times
w2v_model=Word2Vec(sent_of_train,min_count=3,size=50, workers=4)
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 3 times ",len(w2v_words))
39
for sent in sent_of_test:
sent_vec = np.zeros(50)
cnt_words =0;
for word in sent: #
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
test_vectors.append(sent_vec)
40
# determining best k
optimal_k = neighbors[cv_scores.index(max(cv_scores))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)
# ============================== KNN with k = optimal_k
===============================================
# instantiate learning model k = optimal_k
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k, algorithm='brute', n_jobs=-1)
# evaluate accuracy
acc = accuracy_score(Y_test, pred) * 100
print('\nThe Test Accuracy of the K-NN classifier for k = %d is %f%%' % (optimal_k, acc))
# Variables that will be used for making table in Conclusion part of this assignment
Avg_Word2Vec_brute_K = optimal_k
Avg_Word2Vec_brute_train_acc = max(cv_scores)*100
Avg_word2Vec_brute_test_acc = acc
41
# perform 3-fold cross validation
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k, algorithm='kd_tree')
scores = cross_val_score(knn, train_vectors, Y_train, cv=3, scoring='accuracy', n_jobs=-1)
cv_scores.append(scores.mean())
# determining best k
optimal_k = neighbors[cv_scores.index(max(cv_scores))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)
# evaluate accuracy
acc = accuracy_score(Y_test, pred) * 100
print('\nThe Test Accuracy of the K-NN classifier for k = %d is %f%%' % (optimal_k, acc))
# Variables that will be used for making table in Conclusion part of this assignment
Avg_Word2Vec_kdTree_K = optimal_k
Avg_Word2Vec_kdTree_train_acc = max(cv_scores)*100
Avg_Word2Vec_kdTree_test_acc = acc
42
CHAPTER 7
TESTING
43
Description Expected Actual Result
Name Output Output
Data Removal of the The duplicate The duplicate Success
Deduplication Duplicate reviews review must reviews are
with same be dropped dropped
“UserId", keeping only keeping only
"ProfileName", the first the first
"Time", "Text" record record
Attributes
Remove In the dataset, The records All the Success
Useless Data always the with records with
Helpfulness Helpfulness Helpfulness
Numerator<= Numerator<= Numerator<=
Helpfulness Helpfulness Helpfulness
Denominator Denominator Denominator
must retain. are retained
Remove The URL from the URL must be URL is Success
URL’s “Text” attribute removed and removed and
must be removed substituted substituted
with nothing with nothing
Remove All Use Beautiful All the The HTML Success
the HTML Soup Function to HTML Tags Tags are
Tags remove HTML must be removed
Tags from “Text” removed from from the
attribute “Text” “Text”
De- The short strings De- De- Success
contraction must be de- contraction of contraction
contracted words of words
Removing String with All the strings All the Success
String with Numerical with strings with
Numerical characters add no Numerical Numerical
Characters value to “Text” characters characters are
must be removed
removed
Remove Stop Retaining Stop Stop words Stop Words Success
words words makes adds must be are removed
no value to Text removed
Table 7.1 Test Cases
44
CHAPTER 8
SCREENSHOTS
45
8.1 Reading data from the dataset
46
8.3 Value of shape of the data before removing useless records
47
8.5 Removing URL’s from the data
48
8.7 De- contraction of reviews in the data
49
8.9 Removing Numbers and Special Characters
50
8.11 After removing Stop Words from the reviews
51
8.13 Applying Brute Force algorithm on the reviews
52
8.15 Word2Vec using Brute Force approach
53
8.17 K-NN using kd-tree approach
54
8.19 Accuracy vs Neighbors graph using Kd-Tree approach
55
8.21 Accuracy using Kd-Tree Implementation
56
CHAPTER 9
CONCLUSION
57
CONCLUSION:
The reviews had been taken from the dataset provided by the Amazon and the reviews had
been classified into either positive or Critical reviews. The raw reviews that had been
considered from the Web application had been subjected to data cleaning and Text
preprocessing. The URLs and the HTML tags, useless reviews, special symbols, stop words,
contracted words will be removed from the reviews in this process. The Bi-gram partition is
performed on the data after Data cleaning. The Bi-gram analysis gives less accuracy when
compared to the Machine Learning approach. The K-Nearest Neighbors had been implemented
using the Brute Force approach which goes through every possibility and computation
consumes a lot of time. The algorithm is then modified to kd-tree algorithm. The data will be
known beforehand thereby reducing the complexity of the algorithm. The Semantic
understanding is been trained to the algorithm by using the concept of Word2Vec. The reviews
can now be classified based on the semantic meaning of the reviews also.
58
CHAPTER 10
FUTURE ENHANCEMENTS
59
FUTURE ENHANCEMENTS:
The reviews had been classified only based on the textual data of the reviews. If the textual
reviews is contradicting the image data of the reviews, ambiguity is generated in the
classification of the reviews and leads to the incorrect classification of the reviews. The data
considered in this project is only textual. The algorithm can be further extended by making use
of the advanced concepts of the Machine Learning such as Convolution networks and
classifying the images that had been posted by the customers in their reviews as the feedback.
This helps the organization to better understand the reviews and process the feedback according
to their marketing strategies. The algorithm can be implemented using the Logistic regression
which gives a better accuracy when compared to K-Nearest Neighbors algorithm. The
classification can be made more accurate by improving the data cleaning algorithms. The data
cleaning can be done in a more efficient manner so as to reduce the data being considered,
reducing the time complexity and increasing the accuracy.
60
CHAPTER 11
REFERENCES
61
REFERENCES:
[1] Amazon Review Classification and Sentiment Analysis Aashutosh Bhatt, Ankit Patel,
Harsh Chheda, Kiran Gawande, Computer Department, Sardar Patel Institute of Technology,
[2] https://t-lanigan.github.io/amazon-review-classifier/
[3] https://mc.ai/amazon-fine-food-reviews-case-study-from-scratch/
[4] https://www.ijcseonline.org/pub_paper/32-IJCSE-00858.pdf literature survey
[5] https://arxiv.org/ftp/arxiv/papers/1512/1512.01043.pdf literature survey
[6] https://www.ijcaonline.org/research/volume125/number3/dandrea-2015-ijca-
905866.pdf literature survey
62