Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

CSI WESLEY INSTITUTE OF TECHNOLOGY & SCIENCES

1-7-132/142, P.G.ROAD, OPP: PARADISE


SECUNDERABAD – 500 003
(Computer Science and Engineering)
Department of Computer Science and Engineering

Amazon Fine Food Review Analysis

A project report submitted in partial fulfilment of the requirement for


the award of the Degree of

BACHELOR OF ENGINEERING

In

COMPUTER SCIENCE AND ENGINEERING

By
Jitesh Garg (15XU1A0502)
Syed Tameem Alam Quadri (15XU1A0511)

Under the Guidance of


Mr. Subhan Allah Sir,
Associate Professor, Department of CSE
CSI WESLEY INSTITUTE OF TECHNOLOGY & SCIENCES
1-7-132/142, P.G.ROAD, OPP: PARADISE
SECUNDERABAD – 500 003
(Computer Science and Engineering)
Department of Computer Science and Engineering

CERTIFICATE

This is to certify that the project entitled “Amazon Fine Food Review
Analysis” is a bonafide work carried out by the following students during
the period 2018-2019 in partial fulfilment of the requirements for the
award of degree of “bachelor of technology in computer science and
engineering” from CSI Wesley Institute of Technology and Sciences,
Secunderabad affiliated to Jawaharlal Nehru Technological University,
Hyderabad (JNTUH) under our guidance and supervision.
The results embodied in the project work have not been submitted to any
other university or institute for the award of any degree or diploma.

Jitesh Garg (15XU1A0502)


Syed Tameem Alam Quadri (15XU1A0511)

Internal Guide Head Of Department


Mr. Subhan Allah Dr. K. V. S Sudhakar

External Examiner Principal


CSI WESLEY INSTITUTE OF TECHNOLOGY & SCIENCES
1-7-132/142, P.G.ROAD, OPP: PARADISE
SECUNDERABAD – 500 003
(Computer Science and Engineering)

Department of Computer Science and Engineering

DECLARATION

We, Jitesh Garg (15XU1A0502) and Syed Tameem Alam Quadri (15XU1A0511) students
of CSI Wesley Institute of Technology and Sciences, pursuing Bachelor’s degree in Computer
Science and Engineering, hereby declare that the project report entitled “Amazon Fine Food
Review Analysis", carried out under the guidance of Mr. Subhan Allah , Assistant Professor,
Department of Computer Science and Engineering, is submitted in partial fulfilment of the
requirements for the degree of Bachelor of Engineering in Computer Science. This is a bonafide
record of the work carried out by us and the results embodied in this project have not been
reproduced/ copied from any source.

Jitesh garg (15XU1A0502)


Syed Tameem Alam Quadri (15XU1A0511)
ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to my project guide Mr, Subhan Allah,
Assistant Professor, for giving me the opportunity to work on this topic. It would never be
possible for me to take this project to this level without her innovative ideas and her relentless
support and encouragement.

I would like to thank my project coordinator Mr. Subhan Allah, Professor, who
helped me by being an example of high vision and pushing towards greater limits of
achievement.

My sincere thanks to Mr, Subhan Allah, Associate Professor and Head of the
Department of Computer Science and Engineering, for her valuable guidance and
encouragement which has played a major role in the completion of the project and for helping
me by being an example of high vision and pushing towards greater limits of achievement.

I would like to express a deep sense of gratitude towards Dr. P.M Yohan, Principal,
CSI WITS for always being an inspiration and for always encouraging us in every possible
way.

I am indebted to the Department of Computer Science & Engineering and CSI Wesley
Institute of Technology and Sciences for providing me with all the required facility to carry
my work in a congenial environment. I extend my gratitude to the CSE Department staff for
providing me to the needful time to time whenever requested.

I would like to thank my parents for allowing me to realize my potential, all the support
they have provided me over the years was the greatest gift anyone has ever given me and also
for teaching me the value of hard work and education. My parents have offered me with
tremendous support and encouragement, thanks to my parents for all the moral support and the
amazing opportunities they have given me over the years.

1
ABSTRACT

Amazon.com is known to be the largest online retailer in the world providing vast number of
services to its customer on the products sold. One of the major requirements of the Amazon is
to categorize the reviews on the products into positive and negative (also called critical)
reviews, in order to present the same to its customer and for improving its own services. Fine
food products contain wide range of products, ranging from Cheese to Chocolates, Tofu cubes
and Almond Flour etc. A positive review on these products helps in finding what is actually
been liked by the customer and a negative review is helps in finding what’s wrong with the
product and helps in improving the product and services associated with it. The project Amazon
Fine Food Reviews Analysis is intended to solve this problem by helping to categorizing the
reviews into positive or negative using applied Natural Language Processing based on the
dataset provided by Stanford Network Analysis project carried out for Research and Education.
The Machine Learning model design is intended to process more than 550,000 real world
reviews by multiple customers on various Fine Food products of Amazon from the dataset
taken and then classify these reviews to be positive or negative and hence solving the problem.

2
CONTENTS
Chapter 1: Introduction 1
Chapter 2: Literature Survey 4
2.1 Need for Sentimental Analysis 5
2.2 Lexical Analysis 6
2.3 Machine Learning based analysis 6
2.4 Hybrid Analysis 7
2.5 Levels of Sentimental Analysis 7
2.6 Different approaches for sentiment analysis 10
Chapter 3: System Analysis 14
3.1 Problems statement 15
3.2 Proposed System 15
3.3 Proposed Logic: 15
3.4 Feasibility Study: 17
Chapter 4: Software Requirement Specification 18
4.1 Introduction 19
4.2 Purpose 19
4.3 Functional Requirements 20
4.4 Non- Functional Requirements 20
4.5 Hardware Requirements 21
4.6 Software Requirements 21
Chapter 5: System Design 22
5.1 Block Diagram 23
5.2 Data Flow Diagram 23
5.3 Class Diagram 24
5.4 Object Diagram 24
5.5 Use Case Diagram 25
5.6 Activity Diagram 25
5.7 State chart Diagram 26
5.8 Sequence Diagram 26
5.9 Collaboration Diagram 27
5.10 Component Diagram 27

3
Chapter 6: Implementation 28
Chapter 7: Testing 38
7.1 Test Cases 39
Chapter 8: Screen Shots 40
8.1 Reading data from the dataset 41
8.2 Records read from the dataset 41
8.3 Value of shape of the data before removing useless records 42
8. 4 Value of shape of the data after removing useless records 42
8.5 Removing URL’s from the data 43
8.6 Removing HTML tags from the data reviews 43
8.7 De- contraction of reviews in the data 44
8.8 After De- contraction of data 44
8.9 Removing Numbers and Special Characters 45
8.10 Stop Words Considered 45
8.11 After removing Stop Words from the reviews 46
8.12 Applying Bi-Gram partition on the reviews 46
8.13 Applying Brute Force algorithm on the reviews 47
8.14 After applying KNN using Brute Force approach 47
8.15 Word2Vec using Brute Force approach 48
8.16 Accuracy to Neighbors graph using Brute Force 48
8.17 K-NN using kd-tree approach 49
8.18 Applying K-NN algorithm using kd-tree approach 49
8.19 Accuracy vs Neighbors graph using Kd-Tree approach 50
8.20 Implementing Word2Vec using Kd-Tree approach 50
8.21 Accuracy using Kd-Tree Implementation 51
Chapter 9: Conclusions 52
Chapter 10: Future Enhancements 54
Chapter 11: References 56

4
LIST OF FIGURES
SL. Figure Name: Pg.
No. No.:
1 5.1 Block Diagram 23
2 5.2 Data Flow Diagram 23
3 5.3 Class Diagram 24
4 5.4 Object Diagram 24
5 5.5 Use Case Diagram 25
6 5.6 Activity Diagram 25
7 5.7 State chart Diagram 26
8 5.8 Sequence Diagram 26
9 5.9 Collaboration Diagram 27
10 5.10 Component Diagram 27
11 8.1 Reading data from the dataset 41
12 8.2 Records read from the dataset 41
13 8.3 Value of shape of the data before removing useless 42
records
14 8. 4 Value of shape of the data after removing useless 42
records
15 8.5 Removing URL’s from the data 43
16 8.6 Removing HTML tags from the data reviews 43
17 8.7 De- contraction of reviews in the data 44
18 8.8 After De- contraction of data 44
19 8.9 Removing Numbers and Special Characters 45
20 8.10 Stop Words Considered 45
21 8.11 After removing Stop Words from the reviews 46
22 8.12 Applying Bi-Gram partition on the reviews 46
23 8.13 Applying Brute Force algorithm on the reviews 47
24 8.14 After applying KNN using Brute Force approach 47
25 8.15 Word2Vec using Brute Force approach 48
26 8.16 Accuracy to Neighbors graph using Brute Force 48
27 8.17 K-NN using kd-tree approach 49
28 8.18 Applying K-NN algorithm using kd-tree approach 49
29 8.19 Accuracy vs Neighbors graph using Kd-Tree approach 50
30 8.20 Implementing Word2Vec using Kd-Tree approach 50
31 8.21 Accuracy using Kd-Tree Implementation 51

5
CHAPTER 1
INTRODUCTION

INTRODUCTION:

6
The Amazon is a world’s leading e-commerce website which had spread its market in a wide
variety of fields including food. The Amazon had been collecting the reviews from the
customers to consider then as their feedback of the food that is being sold. To make themselves
better, Amazon used to go through all the reviews on all the food items, and manually classify
them as good and bad. But as the market has been expanding, the number of reviews from the
customers had also increased. Hence it became difficult for Amazon to classify from all the
recorded reviews. Amazon now needs a solution for this problem. Hence the concepts of data
mining, supervised learning and classification had been employed to consider all the reviews
and classify them as either positive or critical reviews.

Reviews on Amazon are not only related to the product but also the service given to the
customers. If users get clear bifurcation about product reviews and service reviews it will be
easier for them to take the decision, in this paper we propose a system that performs the
classification of customer reviews followed by finding sentiment of the reviews. A rule based
extraction of product feature sentiment is also done.

All Information in the world can be broadly classified into mainly two categories, facts and
opinions. Facts are objective statements about entities and worldly events. On the other hand
opinions are subjective statements that reflect people’s sentiments or perceptions about the
entities and events. Maximum amount of existing research on text and information processing
is focused on mining and getting the factual information from the text or information. Before
we had WWW we were lacking a collection of opinion data, in an individual needs to make a
decision, he/she typically asks for opinions from friends and families. When an organization
needs to find opinions of the general public about its products and services, it conducted
surveys and focused groups. But after the growth of Web, especially with the drastic growth
of the user generated content on the Web, the world has changed and so has the methods of
gaining ones opinion. One can post reviews of products at merchant sites and express views on
almost anything in Internet forums, discussion groups, and blogs, which are collectively called
the user generated content. As the technology of connectivity grew so as the ways of
interpreting and processing of users opinion information has changed. Some of the machine
learning techniques like Naïve Bayes, Maximum Entropy and Support Vector Machines has
been discussed in the paper. Extracting features from user opinion information is an emerging
task. The algorithms used in our project is K-Nearest Neighbors classifier.

7
Data classification is the process of sorting and categorizing data into various types, forms or
any other distinct class. Data classification enables the separation and classification of data
according to data set requirements for various business or personal objectives. An effective
data classification process is important because it can help organization determine the
appropriate levels of control to maintain the confidentiality and integrity of their data

Each review given by the customer is considered and is recorded into a dataset. The review
may consist of images along with the text. Only the textual data is considered and the reviews
are the classified only based on the textual data. Since, the images are a huge entity, and the
image data consumes a lot of data, the images are excluded since the textual information is
alone enough for the classification of data. Amazon is a web application, where the reviews
are taken through the website itself. The review consists of all kinds of data which may or may
not be necessary for the classification. Few review may not even be helpful for the customer.
The data set consists of all these data which then has to be removed. All the nominal data will
not be taken into the consideration. Only the ordinal data is taken into the consideration.

8
CHAPTER 2
LITERATURE SURVEY

9
Sentiment analysis is a machine learning approach in which machines analyzes and
classifies the sentiments, emotions, opinions about any particular topics or entity which are
expressed in the form of text or speech [4]. Sentiment analysis is a challenge of the Natural
Language Processing (NLP), text analytics and computational linguistics. In a general sense,
sentiment analysis determines the opinion regarding the object/subject in discussion [5]. People
share knowledge, experiences and thoughts with the world by using Social Media like blogs,
forums, wikis, review sites, social networks, tweets and so on. This has changed the manner in
which people communicate and influence social, political and economic behavior of other
people in the Web 2.0. Indeed the Web 2.0 allows everyone having a voice, promising to boost
human collaboration capabilities on a worldwide scale, enabling individuals to share opinions
by means of read-write Web and user’s generated contents [6].

The term sentiment analysis first appeared in, however the research on sentiments/opinions
appeared earlier. The literature on sentiment analysis focused on different domains, from
management sciences to computer science, social sciences and business due to its importance
to society as whole and different tasks such as: subjective expressions, sentiments of words,
subjective sentences, and topics [6]. Sentiment analysis is used to extract the subjective
information in source material by applying various techniques such as Natural language
Processing (NLP), Computational Linguistics and text analysis and classify the polarity of the
opinion [5].

2.1 Need for Sentimental Analysis:

Sentiment analysis has been practiced on a variety of topics. For instance, sentiment analysis
studies for movie reviews, product review, and news and blogs. In this section, Twitter specific
sentiment analysis approaches are reported. The research on sentiment analysis so far has
mainly focused on two things: identifying whether a given textual entity is subjective or
objective, and identifying polarity of subjective texts. Most sentiment analysis studies use
machine learning approaches [5].

In sentiment analysis domain, the texts belong to either of positive or negative classes. There
may also be multi-valued or binary classes like positive, negative and neutral (or irrelevant).
The core complexity of classification of texts in sentiment analysis with respect to that of other
topic-based cataloging is due to the non-usability of keywords, despite the fact that the number

10
of classes in sentiment analysis is less than that in the later approach by Opinion mining
(sentiment extraction) is employed on Twitter posts by means of Lexical analysis or by
Machine learning based analysis or using Hybrid/Combined analysis [5].

2.2 Lexical Analysis:

This technique is governed by the use of a dictionary consisting pre-tagged lexicons. The input
text is converted to tokens by the Tokenizer. Every new token encountered is then matched for
the lexicon in the dictionary. If there is a positive match, the score is added to the total pool of
score for the input text. For instance if “dramatic” is a positive match in the dictionary then the
total score of the text is incremented. Otherwise the score is decremented or the word is tagged
as negative. Though this technique appears to be amateur in nature, its variants have proved to
be worthy [5].

2.3 Machine Learning based analysis:

Machine learning is one of the most prominent techniques gaining interest of researchers due
to its adaptability and accuracy. In sentiment analysis, mostly the supervised learning variants
of this technique are employed. It comprises of three stages: Data collection, Pre-processing,
Training data, Classification and plotting results. In the training data, a collection of tagged
corpora is provided. The Classifier is presented a series of feature vectors from the previous
data. A model is created based on the training data set which is employed over the new/unseen
text for classification purpose. In machine learning technique, the key to accuracy of a classifier
is the selection of appropriate features. Generally, unigrams (single word phrases), bi-grams
(two consecutive phrases), tri-grams (three consecutive phrases) are selected as feature vectors.
There are a variety of proposed features namely number of positive words, number of negative
words, and length of the document, Support Vector Machines (SVM), and Naïve Bayes (NB)
algorithm. Accuracy is reported to vary from 63% to 80% depending upon the combination of
various features selected [5].

Our project is a Machine Learning based analysis which shows the comparison of the
techniques implemented using the Bigram analysis and the K-Nearest Neighbors analysis. The
KNN algorithm is more accurate when compared to the bigram analysis. The KNN is again
implemented using brute force technique which takes a lot of computational time. Hence, the
KNN is again implemented using the kd-tree (K-Dimensional tree) which take very less time
when compared to the brute force technique.

11
2.4 Hybrid Analysis:

The advances in sentiment analysis lured researchers to explore the possibility of a hybrid
approach which could collectively exhibit the accuracy of a machine learning approach and the
speed of lexical approach. Pseudo documents encompassing all the words from the set of
chosen lexicons are created. Then computed the cosine similarity amongst the pseudo
documents and the unlabeled documents. Depending upon the measure of similarity, the
documents were either assigned a positive or a negative sentiment. This training dataset was
then fed to a naïve bayes classifier for training purpose.

2.5 Levels of Sentimental Analysis:

A. Document Level Sentiment Analysis:

The Document Level Sentiment analysis is performed for whole document. The basic unit of
information is a single document of opinionated text. In this type of document level
classification a single review about a single topic is considered. But in case of forums or blogs,
comparative sentences may appear and customers may compare one product with the other that
has similar characteristics and hence document level analysis is not desirable in forums and
blogs. While doing document level classification, irrelevant sentences must be eliminated at
preprocessing phase. For document level classification both supervised and unsupervised
machine learning classification methods are used. Supervised machine learning algorithm such
as Support Vector Machine (SVM), Naïve Baye’s, KNN and Maximum Entropy can be used
to train the system. For training and testing dataset, the reviewer rating (in the form of 1-5 stars)
and review text can be used. The features that can be used for the machine learning are term
frequency, document frequency, tf-idf measure, Part of speech tagging, Opinion words, opinion
phrases, negations and dependencies. Manually labeling the polarities of the document is time
consuming task and hence the user rating available can be made use of. The unsupervised
machine learning can be done by extracting the opinion words inside a document. The point-
wise mutual information can be made use of to find the semantics of the extracted words [4].

B. Sentence Level Sentiment Analysis:

The Sentence level sentiment analysis is related to find sentiment form different sentences
whether the sentence expressed is positive, negative or neutral sentiment. The Sentence level
sentiment analysis is closely related to subjectivity classification. Here, the polarity of each
sentence is calculated and then same document level classification methods are used for the

12
sentence level classification problem. Then the objective and subjective sentences must be
found out. The subjective sentences must contain opinion words which help in determining the
sentiment about entity. After that the polarity classification is done into positive, negative and
neutral classes [4].

C. Entity or Aspect Level Sentiment Analysis:

The Entity or Aspect Level sentiment analysis performs finer-grained analysis. The goal is to
find out the sentiment on entities or aspect of those entities. For example, consider a statement
“My Nokia Lumina 510 cell phone has good picture quality but it has less battery backup.” So
the opinion on Nokia’s camera and display quality is positive but the opinion on its cell phone
battery backup is negative. We can generate summery of opinions about entities. Comparative
statements are part of the entity or aspect level sentiment analysis but deal with techniques of
comparative sentiment analysis [4].

D. Phrase Level Sentiment Analysis

In phrase level sentiment classification, the phrases that contain opinion words are found out
and a phrase level classification is done. This is advantageous or may be disadvantageous. It is
advantageous where the exact opinion about an entity can be correctly extracted. But in other
cases, where contextual polarity matters, so result may not be accurate. So the negation of
words can occur locally. In such cases, this type of sentiment analysis suffices [4].

E. Feature Level Sentiment Analysis

Product features are considered as product attributes. Analysis of these features for identifying
sentiment of the document is called as feature based sentiment analysis. In this approach
positive, negative or neutral opinion is identified from the extracted features. It is the fine
grained analysis model among all other model [4].

Data Sources for data classification:

A. Review Sites

A review site is a website where users can post reviews, which give an opinion about people,
businesses, products, services and particular entity. Most of the sentiment analysis work has
been done on movie and product review sites. The review data used in most of the sentiment
classification studies are collected from the ecommerce websites like www.amazon.com
(product reviews), www.yelp.com (restaurant reviews), www.CNET download.com (product

13
reviews) and www.reviewcentre.com, which have millions of product reviews by customer.
Other than these reviews the available are professional review sites such as
www.dpreview.com, www.zdnet.com and customer opinion sites on broad topics and products
such as www.consumerreview.com, www.epinions.com, www.bizrate.com [4].

B. Blogs

With an increasing usage of the internet, blogging and blog posts are growing rapidly [7]. The
term blog refers to a webpage consisting of brief paragraphs of opinion, information, personal
diary entries, or links, called posts, which are arranged chronologically with the most recent
first, in the style of an online journal [8]. Sentiment analysis on blogs [9] has been used to
predict the product sales, movie sales, political mood and in many of the studies related to
sentiment analysis.

C. Forums

Forums or message boards allow its members to hold conversations by posting it on the site.
These are dedicated to a topic and thus using forums as a database allows us to do sentiment
analysis in a single domain.

D. Datasets

Most of the work in the field uses movie reviews and product reviews data for classification.
The movie review dataset which is available online is (http://
www.cs.cornell.edu/People/pabo/movie-review-data) [7]. Other dataset which is available
online is multi-domain sentiment (MDS) dataset (Blitzer et. al., 2007)
(http://www.cs.jhu.edu/mdredze/datasets/sentiment) [10]. Is ; Zhu Jian ,2010 ; Pang and Lee
,2004; Bai et al. ,2005; Kennedy and Inkpen ,2006; Zhou and Chaovalit ,2008; Yulan He 2010;
Rudy Prabowo ,2009; Rui Xia ,2011) [7].

The approach that we are considering in our project is the dataset approach. The dataset is
collected from the website Kaggle. The dataset was certified to the Stanford research
University which had later been released to the Kaggle website for the learning purposes. The
dataset is considered in the SQLite format. The dataset consists of the Amazon fine food
reviews that had been collected from past 13 years. The dataset is a collection of 568,454
records.

E. Micro-blogging

14
Twitter is a popular micro-blogging service where users create or write status messages called
"tweets". These tweets express opinions about different topics. Tweets are also used as data
source for classifying sentiment.

2.6 Different approaches for sentiment analysis:

A. Natural Language Processing

It is the branch of computer science and technology which focused on developing systems that
allow computers to communicate with people using natural language. Natural language
processing technique plays important role to get accurate sentiment analysis. NLP techniques
like Bag of words, Hidden markov model (HMM), part of speech (POS), N-gram algorithms,
large sentiment lexicon acquisition and parsing techniques are used to express opinion for
document level, phrase level, sentences level and aspect level [12,13]. Large sentiment lexicon
acquisition is used sentiment word dictionary which contains lot of sentiment words with their
numeric threshold value for particular domain [14]. SentiWordNet dictionary is used for
subjective sentiment analysis. Part-of speech (POS) tagging is often the most time consuming
and challenging task before doing sentiment analysis of any documents. As online textual
reviews are short, non-grammar sentences and contain slangs, abbreviations, and symbols
which make the POS tagging even more difficult. For example, consider the statement. “The
camera is good. I love its picture quality.” Here, “camera” is referred as a product and “picture
quality” is referred as a feature. We know, Products and features are tagged as nouns. We can
define the synonym list of products and features. This feature can be because of uncertain and
non-grammar online reviews. For example, consider the following comment. “I like the high
res”. Here “res” refers to resolution, and resolution is similar to graphics. Sometimes textual
reviews may contain mixture sentiment. For example, “I like the graphics, but it takes battery
a lot”. Now we are doing feature based sentiment analysis, so it is easy to handle such reviews.
In this case, the sentiment is positive for “graphics” and negative for “battery”. For this
CLASSIFIER, CONCEPT, CONCEPT_RULE, and PREDICATE_RULE rules can be used
[6].

B. Machine Learning Techniques

Machine learning techniques are most useful techniques for the sentiment classification for
categorized text into positive, negative or neutral categories. in machine learning technique,
training and testing datasets are required. A training dataset is used to learn the documents and

15
test dataset is used to validate the performance. There are number of machine learning
algorithms used to classify reviews. There are two types of machine learning techniques such
as supervised machine learning algorithm like maximum entropy, SVM, Naïve bayes, KNN,
etc and unsupervised machine learning algorithm such as HMM, Neural network, PCA, ICA,
SVD, etc.

i. Naïve Bayes

Naïve bayes is a simple and easy but effective classification algorithm. It is mostly used for
document level classification. The basic idea is to calculate the probabilities of categories given
a test document by using the joint probabilities of words and categories. Naive Bayes is optimal
for certain problem classes with highly dependent features. Naive Bayes classifiers are
computationally fast when taking decisions. It does not require large amounts of data before
learning can begin [15].

ii. Support Vector Machine

SVM is a discriminative classifier considered as the best text classification method. It is a


statistical classification method proposed by Vapnik. SVM maps input (realvalued) feature
vectors into a higher-dimensional feature space through some nonlinear mapping. SVMs are
developed on the principle of structural risk minimization. The structural risk minimization
seeks to find a hypothesis (h) for which one can find lowest probability of error whereas the
traditional learning techniques for pattern recognition are based on the minimization of the
empirical risk, which are attempt to optimize the performance of the learning set. Computing
the hyper plane to separate the data points i.e. training an SVM leads to a quadratic optimization
problem. SVMs can learn a larger set of patterns and able to scale better, because of
classification complexity it does not depend on the dimensionality of the feature space. SVM
have the ability to update the training patterns dynamically whenever there is a new pattern
during classification [16].

iii. k-Nearest Neighbor

KNN is a classifier that relies on the category labels attached to the training documents similar
to the test document. It is a method to classify an object based on the majority class amongst
its k-nearest neighbors. It is a type of lazy learning where the function is only approximated
locally and all computation is deferred until classification [17]. KNN algorithm usually uses

16
the Euclidean or the Manhattan distance. However, any other distance such as the Chebyshev
norm or the Mahalanob is distance can also be used [18].

iv. Winnow

It is a well-known online mistaken-driven technique. It works by updating its weights in a


sequence of trials and on each trial; it first makes a prediction for one document and then
receives feedback. If a mistake is made, then it updates its weight vector using the document.
During the training phase, with a collection of training data we can process it repeatedly several
times by iterating on the data

C. Decision Tree Learning

Decision Tree Learning is a tree based approach, where collection of child and root node which
focus on the target value. It is a flow chart like structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and leaf nodes represent
child node or class distributions [21]. The popular Decision Tree algorithms are ID3, C4.5 and
CART. The ID3 algorithm is considered as a very simple decision tree algorithm. It uses
information gain as splitting criteria. C4.5 is an evolution of ID3. It uses gain ratio as splitting
criteria [22]. The CART algorithm uses Gini coefficient as the test attribute for selection
criteria, and each time selects an attribute with the smallest Gini coefficient as the test attribute
for a given set [23].

Approaches involved in this project:

The need for the classification of the reviews is to help Amazon detect the number of positive
reviews and the negative reviews received from the customers for a specific product. The
reviews will then act as a feedback mechanism for the organization to help them work
according to their marketing strategies.

The techniques involved here to perform classification are the concepts of data cleaning, text
preprocessing of the data mining, supervised learning and the concepts of machine learning.
This technique is widely used in the market and is found to be the most efficient technique to
classify the data taken from a huge dataset.

The Sentimental analysis is done at a Phrase level. The reviews are taken from the dataset, and
from all the textual data, only the necessary data is considered, thereby shrinking the entire

17
sentence into a single phrase. The phrase may be meaningless when arranged in a sequence but
the words of the sentences are used for the classification.

The dataset considered for this project is a collection of the Amazon fine food reviews form
past 13 years. The data is a collection of 568,454 records. The data is present in a .sqlite format.
The consists of the reviews which may consists of plain text, urls of the images of other
references, HTML tags (since the reviews are collected online) and many unnecessary stop
words. The data is 10 attributed of which ‘Score’ is considered as the class attribute. The data
also has a helpfulness numerator and helpfulness denominator which is avoid few of the
reviews from taking into the consideration.

The machine learning approach considered for classifying the data is the K-Nearest Neighbors
classification algorithm. Each record is considered as a point in the search space and the
distance between them is considered. Out of k nearest neighbors, (k being odd) select the class
to which majority of the points belong to and classify the new point to the desired class.

18
CHAPTER 3
SYSTEM ANALYSIS

19
3.1 Problems statement:

The problem being addressed in this project is the poor quality of Amazon reviews at the top
of the forum despite the “helpfulness” rating system. The problem arises from the “free pass”
given to new reviews to be placed at the top of the forum, for a chance to be rated by the
community. The proposed solution to this problem is to use machine learning techniques to
design a system that “pre-rates” new reviews on their “helpfulness” before they are given a
position at the top of the forum. This way, poor quality reviews will be more unlikely to be
shown at the top of the forum, as they do not get the “freepass” because they are new. The
proposed system will use a set of Amazon review data to train itself to predict a helpfulness
classification (helpful, or not helpful) for new input data [1].

This is a supervised learning problem where we need to predict the positive or negative target
variable for each review. The goal will be to maximize the accuracy of this classification. We
will train our model on a dataset containing thousands of reviews presented as unstructured
text. Each review will be labeled as positive or negative.

To solve this problem we will perform the following tasks:

1. Preprocess the data.

2. Train and tune the hyperparameters of the recurrent neural network.

3. Test the accuracy of the model on the testing set.

3.2 Proposed System:

The project provides the organization a way to solve the problem of classification. The project
uses the concepts of data mining, machine learning and supervised learning to solve the
problem. The main objective of the project is to classify the reviews from the dataset into wither
positive or critical reviews.

3.3 Proposed Logic:

The data is first collected form all the customers into a dataset. The dataset consists of all the
reviews which consists of the textual data including the urls, HTML tags, etc. Hence the data
is first taken through the Data Cleaning phase.

The data cleaning is further divided into 2 parts. Data de duplication and Text Preprocessing.

20
Data de duplication is a phase where the reviews which had been recorded more than once in
a single timestamp are removed. The data has the attributes helpfulness numerator and
helpfulness denominator. If the value of the numerator divided by the denominator is greater
than one, the review is considered to be helpful. Else, the reviews are removed and are not
taken into consideration. The reviews with scored 1,2 or 4,5 are only considered. The reviews
with score 3 are considered as neutral and are not taken into consideration. All the reviews with
scores 1 and 2 are considered as 0 and the reviews with scores 4,5 are taken as 1. This process
is known as the partition.

Text preprocessing the process of refining the textual data in the reviews. The reviews are taken
from the Amazon web application. Hence the reviews may consists of many HTML tags and
URLs. Hence, all the URLs and HTML tags are removed from the reviews. The reviews also
consists of many spell check errors and also the words containing apostrophes. These data can
never be used for the classification of the data. The data that contain the special symbols and
numbers is also not useful for the classification of the data. The special symbols and the words
with a size less than or equal to the size 2 are removed from the consideration. The code also
consists of 184 stop words which are used in the reviews but do not add much weightage to the
process of classification. Hence, these words are called as the stop words. Whenever these
words are encountered in the reviews, they must be removed.

The process of data cleaning is then continued by the Featurization. The Featurization is a class
which is used to perform the bigram analysis by taking all the possible 2 word combinations
from the shrinked reviews. The bigram partition is done on all the reviews. The reviews are
then classified into either positive or critical reviews. But the accuracy of the bigram analysis
is less when the time taken by it is taken into the consideration. Hence we use another approach
to solve the problem of classification, called K-nearest Neighbors classification. The algorithm
can be implemented in two ways. One is Brute Force approach. The other is the kd-tree
approach.

The brute force approach goes through each and every possibility of the review and classifies
it into either positive or critical. The accuracy given by the system is approximately 84 percent.
But the Brute force algorithm takes a lot of computational time to classify the reviews. Hence,
the concept of kd-tree (K-Dimensional tree) is involved into the project.

The kd-tree approach is used to know the data that lies closer to the point to be classified
beforehand. Hence, the time taken for the system to classify the reviews is relatively less when

21
compared to the brute force approach. The kd-tree approach also gives almost same accuracy
ads that of the brute force approach, but computationally takes very less time when compared
to the brute force approach.

So far, the reviews are partitioned based only on the words present in the reviews. The semantic
meaning of the reviews is not considered. Hence, if a review is encountered that is written with
the words that are not being considered from the bag of words, the reviews may be classified
incorrectly. Hence the concept of Word2Vec is used. This class represents the words that are
repeated more than thrice in a dataset as a point in a 50 dimensional feature space. When the
new word arises, the word is also represented as a part of the feature space. When a vector is
drawn through the words present in the review, a sentence will be formed which provides a
semantic meaning to the system. This is known as the average Word2Vec where a sentence is
recognized through its semantic meaning.

The Word2Vec is implemented using both the Brute Force approach and the Kd-tree approach.
Both of them give approximately same accuracy with a huge change in the complexity.

3.4 Feasibility Study:

The project uses a huge computation in its algorithms. Hence, the system must be capable of
performing the computation as required. The technical requirements of the system are high
when compared to an average compute. The Operating system used by us is the Windows 10
Enterprise Edition. The RAM space of the system used supports 8GB of data and also 128GB
of SATA connectivity. The system is also equipped with a GTX 1050 GPU which helps in
performing the computation faster.

The project is not so costly. The only cost that would be taken to make the project is the cost
of building the machine and the pay for the data analyst.

Since the algorithm has to go through each and every bigram of the review formed, the system
will be take computationally more time. The benchmarks set by the program when run through
different test cases was noticed that the project takes 10 minutes to classify 3500 records, 15
Minutes for 8000 records, 1 hour for 1.46 lakh records and 7 hours for all the reviews present
in the dataset.

22
CHAPTER 4
SOFTWARE REQUIREMENT SPECIFCATION

23
4.1 Introduction
A Software Requirements Specification (SRS) – for a software system – is a
complete description of the behaviour of a system to be developed. It includes a set of use
cases that describe all the interactions the users will have with the software. In addition to use
cases, the SRS also contains non-functional requirements. Non-functional requirements are
requirements which impose constraints on the design or implementation (such as performance
engineering requirements, quality standards, or design constraints).
System requirements specification: A structured collection of information that embodies the
requirements of a system. A business analyst, sometimes titled system analyst, is responsible
for analysing the business needs of their clients and stakeholders to help identify business
problems and propose solutions. Within the systems development life cycle domain, typically
performs a liaison function between the business side of an enterprise and the information
technology department or external service providers. Projects are subject to three sorts of
requirements:
● Business requirements describe in business terms what must be delivered or accomplished
to provide value.
● Product requirements describe properties of a system or product (which could be one of
Several ways to accomplish a set of business requirements.)
● Process requirements describe activities performed by the developing organization.
For instance, process requirements could specify specific methodologies that must be
followed, and constraints that the organization must obey.
● Product and process requirements are closely linked. Process requirements often specify
the activities that will be performed to satisfy a product requirement. For example, a maximum
development cost requirement (a process requirement) may be imposed to help achieve a
maximum sales price requirement (a product requirement); a requirement that the product be
maintainable (a Product requirement) often is addressed by imposing requirements to follow
particular development styles.
4.2 Purpose:
An systems engineering, a requirement can be a description of what a system must
do, referred to as a Functional Requirement. This type of requirement specifies something that
the delivered system must be able to do. Another type of requirement specifies something
about the system itself, and how well it performs its functions. Such requirements are often
called Non-functional requirements, or 'performance requirements' or 'quality of service

24
requirements.' Examples of such requirements include usability, availability, reliability,
supportability, testability and maintainability.

A collection of requirements define the characteristics or features of the desired system. A


'good' list of requirements as far as possible avoids saying how the system should implement
the requirements, leaving such decisions to the system designer. Specifying how the system
should be implemented is called "implementation bias" or "solution engineering". However,
implementation constraints on the solution may validly be expressed by the future owner, for
example for required interfaces external systems; for interoperability with other systems; and
for commonality (e.g. of user interfaces) with other owned products.

4.3 Functional Requirements:


● Load data
● Data analysis

● Data Pre- processing

● Model building

● Prediction

4.4 Non- Functional Requirements:


The major non-functional Requirements of the system are as follows
Usability
The system is designed with completely automated process hence there is no or less user
intervention.
Reliability
The system is more reliable because of the qualities that are inherited from the chosen platform
Python. The code built by using Python is more reliable.
Performance
This system is developed- in the high level languages and using the advanced technologies.
Supportability
The system is designed to be the cross platform supportable. The system is supported on a
wide range of hardware and any software platform.
Implementation

25
The system is implemented in Jupyter Notebook and all the required packages are download
& imported.

4.5 Hardware Requirements:


• RAM: 4GB and Higher
• Processor: Intel i3 and above
• Hard Disk: 500GB: Minimum

4.6 Software Requirements:


• OS: Windows or Linux
• Python IDE : python 2.7.x and above
• jupyter notebook
• Setup tools and pip to be installed for 3.6 and above
• Language: Python Scripting

26
CHAPTER 5
SYSTEM DESIGN

27
Fig. 5.1 Block Diagram

Fig. 5.2 Data Flow Diagram.

28
Fig. 5.3 Class Diagram

Fig. 5.4 Object Diagram

29
Fig. 5.5 Use Case Diagram

Fig. 5.6 Activity Diagram

30
Fig. 5.7 State chart Diagram

Fig. 5.8 Sequence Diagram

31
Fig. 5.9 Collaboration Diagram

Fig. 5.10 Component Diagram

32
CHAPTER 6
IMPLEMENTATION

33
Read Data Class:
filtered_data = None
class ReadData(object):

def extract(self, limit):


'''
Used to extract the specific number of records from the dataset
args :-
limit: desired integer value, less than the size of the dataset
'''
con= sqlite3.connect('database.sqlite')
filtered_data= pd.read_sql_query("""SELECT * FROM Reviews WHERE Score != 3
LIMIT """ + str(limit), con)
print(filtered_data.shape)
print("\nPrinting Top 5 Records of Data:\n\n")
print(filtered_data.head(5))
return filtered_data

def partition(self, x):


if x< 3:
return 0
return 1

def get_filtered_data(self, filtered_data):


filtered_data['Score'] = filtered_data['Score'].map(self.partition)
return filtered_data

Deduplication Class:
class Deduplication(object):

def sort_data(self, filtered_data):

34
sorted_data = filtered_data.sort_values('ProductId',axis=0,ascending= True, inplace=
False, kind='quicksort',na_position='last')
return sorted_data

def deduplicate(self, sorted_data):


final_data=
sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',in
place= False)
return final_data

def removeUselessData(self, x):


x= x[x.HelpfulnessNumerator<=x.HelpfulnessDenominator]
return x

Text Pre- processing Class:


size=final_data.shape[0]

class TextPreprocessing(object):

def removeURL(self, x):


i=0
for index, row in x.iterrows():
x['Text'].values[i]= re.sub(r"http\S+","",x['Text'].values[i])
i+=1

return x

def removeHtml(self, x):


i=0
for index, row in x.iterrows():
soup= BeautifulSoup(x['Text'].values[i])
x['Text'].values[i]= soup.get_text()

35
i+=1
return x

def decontracted(self, x):


i=0
for index, row in x.iterrows():
phrase = x['Text'].values[i]
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
x['Text'].values[i]= phrase

return x

def removeNumSS(self, x):


i=0
for index, row in x.iterrows():
x['Text'].values[i]= re.sub("\S*\d\S*","",x['Text'].values[i]).strip()
x['Text'].values[i]= re.sub('[^A-Za-z0-9]+',' ',x['Text'].values[i])
i+=1

36
return x

Class Featurization:
class Featurization(object):
def apply_brute(self):
%time

knn = KNeighborsClassifier(algorithm='brute')
# neigh = np.arange(1,100,2)
param_grid = {'n_neighbors':np.arange(1,100,2)} #params we need to try on classifier
tscv = TimeSeriesSplit(n_splits=10) #For time based splitting
gsv = GridSearchCV(knn,param_grid,cv=tscv,verbose=1)
gsv.fit(X_train,y_train)
print("Best HyperParameter: ",gsv.best_params_)
print("Best Accuracy: %.2f%%"%(gsv.best_score_*100))
return gsv.best_params_

def apply_knn(self, ngh):


#Testing Accuracy on Test data
knn = KNeighborsClassifier(n_neighbors=int(ngh))
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print("Accuracy on test set: %0.3f%%"%(accuracy_score(y_test, y_pred)*100))
print("Precision on test set: %0.3f"%(precision_score(y_test, y_pred)))
print("Recall on test set: %0.3f"%(recall_score(y_test, y_pred)))
print("F1-Score on test set: %0.3f"%(f1_score(y_test, y_pred)))
print("Confusion Matrix of test set:\n [ [TN FP]\n [FN TP] ]\n")
df_cm = pd.DataFrame(confusion_matrix(y_test, y_pred), range(2),range(2))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, annot=True,annot_kws={"size": 16}, fmt='g')

37
def apply_kdtree(self):
%time

knn = KNeighborsClassifier(algorithm='kd_tree')
# neigh = np.arange(1,100,2)
param_grid = {'n_neighbors':np.arange(1,100,2)} #params we need to try on classifier
tscv = TimeSeriesSplit(n_splits=10) #For time based splitting
gsv = GridSearchCV(knn,param_grid,cv=tscv,verbose=1)
gsv.fit(X_train,y_train)
print("Best HyperParameter: ",gsv.best_params_)
print("Best Accuracy: %.2f%%"%(gsv.best_score_*100))
return gsv.best_params_

def apply_kdtree_knn(self, kneigh):


#Testing Accuracy on Test data
knn = KNeighborsClassifier(n_neighbors=kneigh,algorithm='kd_tree')
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)
print("Accuracy on test set: %0.3f%%"%(accuracy_score(y_test, y_pred)*100))
print("Precision on test set: %0.3f"%(precision_score(y_test, y_pred)))
print("Recall on test set: %0.3f"%(recall_score(y_test, y_pred)))
print("F1-Score on test set: %0.3f"%(f1_score(y_test, y_pred)))
print("Confusion Matrix of test set:\n [ [TN FP]\n [FN TP] ]\n")
df_cm = pd.DataFrame(confusion_matrix(y_test, y_pred), range(2),range(2))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, annot=True,annot_kws={"size": 16}, fmt='g')

Applying Word2Vec:
sent_of_train=[]
for sent in X_train:
sent_of_train.append(sent.split())

38
# List of sentence in X_est text
sent_of_test=[]
for sent in X_test:
sent_of_test.append(sent.split())

# Train your own Word2Vec model using your own train text corpus
# min_count = 3 considers only words that occured atleast 3 times
w2v_model=Word2Vec(sent_of_train,min_count=3,size=50, workers=4)

w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 3 times ",len(w2v_words))

# compute average word2vec for each review for X_train .


train_vectors = [];
for sent in sent_of_train:
sent_vec = np.zeros(50)
cnt_words =0;
for word in sent: #
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
train_vectors.append(sent_vec)

# compute average word2vec for each review for X_test .


test_vectors = [];

39
for sent in sent_of_test:
sent_vec = np.zeros(50)
cnt_words =0;
for word in sent: #
if word in w2v_words:
vec = w2v_model.wv[word]
sent_vec += vec
cnt_words += 1
if cnt_words != 0:
sent_vec /= cnt_words
test_vectors.append(sent_vec)

print("Done with the compute")

###Brute Force Implementation ###


%time
# creating odd list of K for KNN
myList = list(range(0,50))
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty list that will hold cv scores


cv_scores = []

# perform 3-fold cross validation


for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
scores = cross_val_score(knn, train_vectors, Y_train, cv=3, scoring='accuracy', n_jobs=-1)
cv_scores.append(scores.mean())

40
# determining best k
optimal_k = neighbors[cv_scores.index(max(cv_scores))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)
# ============================== KNN with k = optimal_k
===============================================
# instantiate learning model k = optimal_k
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k, algorithm='brute', n_jobs=-1)

# fitting the model


knn_optimal.fit(train_vectors, Y_train)

# predict the response


pred = knn_optimal.predict(test_vectors)

# evaluate accuracy
acc = accuracy_score(Y_test, pred) * 100
print('\nThe Test Accuracy of the K-NN classifier for k = %d is %f%%' % (optimal_k, acc))

# Variables that will be used for making table in Conclusion part of this assignment
Avg_Word2Vec_brute_K = optimal_k
Avg_Word2Vec_brute_train_acc = max(cv_scores)*100
Avg_word2Vec_brute_test_acc = acc

### Kd- tree Implementation ###


myList = list(range(0,50))
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty list that will hold cv scores


cv_scores = []

41
# perform 3-fold cross validation
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k, algorithm='kd_tree')
scores = cross_val_score(knn, train_vectors, Y_train, cv=3, scoring='accuracy', n_jobs=-1)
cv_scores.append(scores.mean())

# determining best k
optimal_k = neighbors[cv_scores.index(max(cv_scores))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)

# ============================== KNN with k = optimal_k


===============================================
# instantiate learning model k = optimal_k
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k, algorithm='kd_tree', n_jobs=-
1)

# fitting the model


knn_optimal.fit(train_vectors, Y_train)

# predict the response


pred = knn_optimal.predict(test_vectors)

# evaluate accuracy
acc = accuracy_score(Y_test, pred) * 100
print('\nThe Test Accuracy of the K-NN classifier for k = %d is %f%%' % (optimal_k, acc))

# Variables that will be used for making table in Conclusion part of this assignment
Avg_Word2Vec_kdTree_K = optimal_k
Avg_Word2Vec_kdTree_train_acc = max(cv_scores)*100
Avg_Word2Vec_kdTree_test_acc = acc

42
CHAPTER 7
TESTING

43
Description Expected Actual Result
Name Output Output
Data Removal of the The duplicate The duplicate Success
Deduplication Duplicate reviews review must reviews are
with same be dropped dropped
“UserId", keeping only keeping only
"ProfileName", the first the first
"Time", "Text" record record
Attributes
Remove In the dataset, The records All the Success
Useless Data always the with records with
Helpfulness Helpfulness Helpfulness
Numerator<= Numerator<= Numerator<=
Helpfulness Helpfulness Helpfulness
Denominator Denominator Denominator
must retain. are retained
Remove The URL from the URL must be URL is Success
URL’s “Text” attribute removed and removed and
must be removed substituted substituted
with nothing with nothing
Remove All Use Beautiful All the The HTML Success
the HTML Soup Function to HTML Tags Tags are
Tags remove HTML must be removed
Tags from “Text” removed from from the
attribute “Text” “Text”
De- The short strings De- De- Success
contraction must be de- contraction of contraction
contracted words of words
Removing String with All the strings All the Success
String with Numerical with strings with
Numerical characters add no Numerical Numerical
Characters value to “Text” characters characters are
must be removed
removed
Remove Stop Retaining Stop Stop words Stop Words Success
words words makes adds must be are removed
no value to Text removed
Table 7.1 Test Cases

44
CHAPTER 8
SCREENSHOTS

45
8.1 Reading data from the dataset

8.2 Records read from the dataset

46
8.3 Value of shape of the data before removing useless records

8. 4 Value of shape of the data after removing useless records

47
8.5 Removing URL’s from the data

8.6 Removing HTML tags from the data reviews

48
8.7 De- contraction of reviews in the data

8.8 After De- contraction of data

49
8.9 Removing Numbers and Special Characters

8.10 Stop Words Considered

50
8.11 After removing Stop Words from the reviews

8.12 Applying Bi-Gram partition on the reviews

51
8.13 Applying Brute Force algorithm on the reviews

8.14 After applying KNN using Brute Force approach

52
8.15 Word2Vec using Brute Force approach

8.16 Accuracy to Neighbors graph using Brute Force

53
8.17 K-NN using kd-tree approach

8.18 Applying K-NN algorithm using kd-tree approach

54
8.19 Accuracy vs Neighbors graph using Kd-Tree approach

8.20 Implementing Word2Vec using Kd-Tree approach

55
8.21 Accuracy using Kd-Tree Implementation

56
CHAPTER 9
CONCLUSION

57
CONCLUSION:

The reviews had been taken from the dataset provided by the Amazon and the reviews had
been classified into either positive or Critical reviews. The raw reviews that had been
considered from the Web application had been subjected to data cleaning and Text
preprocessing. The URLs and the HTML tags, useless reviews, special symbols, stop words,
contracted words will be removed from the reviews in this process. The Bi-gram partition is
performed on the data after Data cleaning. The Bi-gram analysis gives less accuracy when
compared to the Machine Learning approach. The K-Nearest Neighbors had been implemented
using the Brute Force approach which goes through every possibility and computation
consumes a lot of time. The algorithm is then modified to kd-tree algorithm. The data will be
known beforehand thereby reducing the complexity of the algorithm. The Semantic
understanding is been trained to the algorithm by using the concept of Word2Vec. The reviews
can now be classified based on the semantic meaning of the reviews also.

58
CHAPTER 10

FUTURE ENHANCEMENTS

59
FUTURE ENHANCEMENTS:

The reviews had been classified only based on the textual data of the reviews. If the textual
reviews is contradicting the image data of the reviews, ambiguity is generated in the
classification of the reviews and leads to the incorrect classification of the reviews. The data
considered in this project is only textual. The algorithm can be further extended by making use
of the advanced concepts of the Machine Learning such as Convolution networks and
classifying the images that had been posted by the customers in their reviews as the feedback.
This helps the organization to better understand the reviews and process the feedback according
to their marketing strategies. The algorithm can be implemented using the Logistic regression
which gives a better accuracy when compared to K-Nearest Neighbors algorithm. The
classification can be made more accurate by improving the data cleaning algorithms. The data
cleaning can be done in a more efficient manner so as to reduce the data being considered,
reducing the time complexity and increasing the accuracy.

60
CHAPTER 11

REFERENCES

61
REFERENCES:

[1] Amazon Review Classification and Sentiment Analysis Aashutosh Bhatt, Ankit Patel,
Harsh Chheda, Kiran Gawande, Computer Department, Sardar Patel Institute of Technology,
[2] https://t-lanigan.github.io/amazon-review-classifier/
[3] https://mc.ai/amazon-fine-food-reviews-case-study-from-scratch/
[4] https://www.ijcseonline.org/pub_paper/32-IJCSE-00858.pdf literature survey
[5] https://arxiv.org/ftp/arxiv/papers/1512/1512.01043.pdf literature survey
[6] https://www.ijcaonline.org/research/volume125/number3/dandrea-2015-ijca-
905866.pdf literature survey

62

You might also like