Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Mini Project Report

Entitled

Product review Analysis and prediction


Submitted to the Department of Electronics Engineering in Partial Fulfilment for the
Requirements for the Degree of

Bachelor of Technology
(Electronics and Communication)

: Presented & Submitted By :

Kuldeep Joshi and Viprav Patel


Roll No. (U20EC143 and U20EC158)
B. TECH. VI (EC), 6th Semester

: Guided By :

Dr. Kishor Upla


Assistant Professor, SVNIT

(Year: 2022-23)

DEPARTMENT OF ELECTRONICS ENGINEERING


SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECHNOLOGY
Surat-395007, Gujarat, INDIA.
Sardar Vallabhbhai National Institute Of Technology
Surat - 395 007, Gujarat, India

DEPARTMENT OF ELECTRONICS ENGINEERING

CERTIFICATE
This is to certify that the Mini-Project Report entitled “Product review Analysis
and prediction” is presented & submitted by Kuldeep Joshi and Viprav Patel, bear-
ing Roll No. U20EC143 and U20EC158, of B.Tech. VI, 6th Semester in the partial
fulfillment of the requirement for the award of B.Tech. Degree in Electronics & Com-
munication Engineering for academic year 2022-23.
They have successfully and satisfactorily completed their Mini-Project in all re-
spects. We, certify that the work is comprehensive, complete and fit for evaluation.

Dr. Kishor Upla


Assistant Professor & Project Guide
Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapters
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.1.1 product review using machine learning . . . . . . . . . . . . . 1
0.1.2 Algorithms used . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.1.3 Tools and Libraries used . . . . . . . . . . . . . . . . . . . . . 5
0.1.4 Code for Product review Analysis . . . . . . . . . . . . . . . . 8
0.1.5 conclusion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

iii
List of Figures
1 accuracy by Logistic regretion. . . . . . . . . . . . . . . . . . . . . . . 9
2 accuracy by Svm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 accuracy by Knn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 comparative accuracy graph. . . . . . . . . . . . . . . . . . . . . . . . 13

iv
Product review analysis and Prediction
0.1 Introduction

0.1.1 product review using machine learning


Product review by machine learning involves using algorithms and statistical models
to analyze and extract insights from large amounts of customer feedback data, such
as online product reviews, ratings, and comments. These insights can then be used to
evaluate the overall sentiment of the reviews, identify specific product features that cus-
tomers like or dislike, and predict the likelihood of a customer purchasing the product.
The process usually involves using natural language processing (NLP) techniques to
preprocess and analyze the text data, such as sentiment analysis, topic modeling, and en-
tity recognition. Machine learning algorithms can then be trained on these preprocessed
data to make predictions, such as predicting the star rating of a product based on its
features, or recommending similar products to customers based on their past purchases
and preferences.
Overall, product review analysis by machine learning can provide valuable insights
for businesses to improve their products, marketing strategies, and customer satisfac-
tion.

1. Logistic regretion

2. Support vector machine

3. K nearest neighbor

1) Logistic regretion: Logistic regression is a statistical method used for binary


classification tasks. It is a type of regression analysis used to predict the probability of
a certain outcome, based on one or more predictor variables.
In logistic regression, the response variable is binary (either 0 or 1), and the predictor
variables can be continuous or categorical. The logistic regression model calculates the
probability of the response variable being 1, given the predictor variables, and it uses a
logistic function to map the predictor variables to the probability of the response variable
being 1.
Logistic regression is widely used in various fields, including healthcare, finance,
and marketing, for predicting binary outcomes such as disease diagnosis, loan approval,
and customer churn, respectively.
2) Support vector machine: Support vector machine (SVM) is a powerful and pop-
ular machine learning algorithm used for classification and regression tasks. It works
by finding the best hyperplane that separates the data points of different classes in a
high-dimensional space.
The main idea behind SVM is to find a hyperplane that maximizes the margin be-
tween the closest data points of different classes, known as support vectors. The hyper-
plane that maximizes the margin is considered the best classifier, as it is the one that is
most likely to generalize well on new, unseen data.
SVM can handle both linear and non-linear classification problems by using dif-
ferent kernel functions, such as polynomial, radial basis function (RBF), and sigmoid.
These kernel functions map the input data to a higher-dimensional feature space, where
the hyperplane can better separate the data points.
SVM is a versatile algorithm that can be used for a variety of classification tasks,
including image classification, text classification, and bioinformatics. It has several
advantages, such as being less prone to overfitting, having a solid theoretical foundation,
and producing accurate and robust results.
However, SVM can be computationally expensive for large datasets and requires
careful selection of the kernel function and hyperparameters to avoid overfitting.

3) K nearest neighbor: K-nearest neighbor (KNN) is a supervised machine learning


algorithm used for classification and regression tasks. In KNN, the class of a data point
is determined by its proximity to other data points in the dataset.

The algorithm works by calculating the distance between the input data point and all
other data points in the training set. The K-nearest neighbors are the data points in the
training set that are closest to the input data point, where K is a user-defined parameter.

For classification, the most common class among the K-nearest neighbors is as-
signed to the input data point. For regression, the average or median of the target values
of the K-nearest neighbors is used as the predicted value for the input data point.

KNN is a simple and effective algorithm, but it can be sensitive to outliers and
requires a large amount of memory to store the training set. It is commonly used in ap-
plications such as image recognition, text classification, and recommendation systems.
0.1.2 Algorithms used
1. Logistic regretion

2. Support vector machine

3. K nearest neighbor

1) Logistic regretion: Logistic regression is a statistical method used for binary


classification tasks. It is a type of regression analysis used to predict the probability
of a certain outcome, based on one or more predictor variables.
In logistic regression, the response variable is binary (either 0 or 1), and the
predictor variables can be continuous or categorical. The logistic regression model
calculates the probability of the response variable being 1, given the predictor vari-
ables, and it uses a logistic function to map the predictor variables to the probabil-
ity of the response variable being 1.
Logistic regression is widely used in various fields, including healthcare, fi-
nance, and marketing, for predicting binary outcomes such as disease diagnosis,
loan approval, and customer churn, respectively.
2) Support vector machine: Support vector machine (SVM) is a powerful and
popular machine learning algorithm used for classification and regression tasks.
It works by finding the best hyperplane that separates the data points of different
classes in a high-dimensional space.
The main idea behind SVM is to find a hyperplane that maximizes the margin
between the closest data points of different classes, known as support vectors. The
hyperplane that maximizes the margin is considered the best classifier, as it is the
one that is most likely to generalize well on new, unseen data.
SVM can handle both linear and non-linear classification problems by using
different kernel functions, such as polynomial, radial basis function (RBF), and
sigmoid. These kernel functions map the input data to a higher-dimensional fea-
ture space, where the hyperplane can better separate the data points.
SVM is a versatile algorithm that can be used for a variety of classification
tasks, including image classification, text classification, and bioinformatics. It has
several advantages, such as being less prone to overfitting, having a solid theoreti-
cal foundation, and producing accurate and robust results.
However, SVM can be computationally expensive for large datasets and re-
quires careful selection of the kernel function and hyperparameters to avoid over-
fitting.

3) K nearest neighbor: K-nearest neighbor (KNN) is a supervised machine


learning algorithm used for classification and regression tasks. In KNN, the class
of a data point is determined by its proximity to other data points in the dataset.

The algorithm works by calculating the distance between the input data point
and all other data points in the training set. The K-nearest neighbors are the data
points in the training set that are closest to the input data point, where K is a user-
defined parameter.

For classification, the most common class among the K-nearest neighbors is
assigned to the input data point. For regression, the average or median of the target
values of the K-nearest neighbors is used as the predicted value for the input data
point.

KNN is a simple and effective algorithm, but it can be sensitive to outliers and
requires a large amount of memory to store the training set. It is commonly used
in applications such as image recognition, text classification, and recommendation
systems.
0.1.3 Tools and Libraries used

• pandas

• sci-kit learn

• numpy

• matplot lib

1) Pandas : Pandas is a popular open-source data analysis and manipulation


library for the Python programming language. It provides a flexible and powerful
toolkit for working with structured data, including tabular data in the form of
spreadsheets or databases.

Some of the key features of pandas include:

Data manipulation: Pandas provides a rich set of functions for filtering, sort-
ing, aggregating, and transforming data. Data cleaning: Pandas provides tools
for handling missing values, removing duplicates, and dealing with outliers. Data
exploration: Pandas enables data exploration through visualization and statistical
analysis tools. Integration with other libraries: Pandas can be easily integrated
with other Python libraries for data analysis and visualization, such as NumPy,
Matplotlib, and Scikit-learn. Pandas is widely used in data science, finance, social
sciences, and other fields where data analysis and manipulation are critical. It is
also often used in combination with other Python libraries for data analysis and
visualization to provide a comprehensive data analysis and visualization toolkit.

2) Scikit-lear: Scikit-learn (or sklearn) is a powerful Python library for ma-


chine learning built on top of NumPy and SciPy. It provides a range of efficient
tools for supervised and unsupervised learning, including classification, regression,
clustering, and dimensionality reduction via a consistent interface.

Some of the key features of scikit-learn include:


Easy-to-use API: Scikit-learn has a simple and intuitive API that makes it easy
to use for both beginners and experienced users.

Wide range of algorithms: Scikit-learn includes a large number of popular


machine learning algorithms such as decision trees, random forests, support vector
machines, k-nearest neighbors, and many more.

Preprocessing tools: Scikit-learn provides a variety of preprocessing tools for


feature extraction, feature scaling, and data normalization.

Cross-validation: Scikit-learn has built-in tools for cross-validation, which al-


lows for evaluating model performance on different subsets of the data.

Model selection: Scikit-learn includes tools for hyperparameter tuning and


model selection, which helps to optimize model performance.

Overall, scikit-learn is a powerful and widely used tool for machine learning in
Python. Its ease of use, flexibility, and extensive documentation make it a popular
choice for both beginners and experienced data scientists.

3) NumPy: NumPy (Numerical Python) is a popular Python library for nu-


merical computing that provides support for large, multi-dimensional arrays and
matrices, along with a wide range of mathematical functions to operate on them.
It is one of the fundamental libraries for scientific computing with Python.

Some of the key features of NumPy include:

N-dimensional array object: NumPy provides a powerful N-dimensional array


object that can handle large datasets efficiently.

Broadcasting: NumPy allows for efficient broadcasting of operations across


arrays with different shapes and sizes.

Mathematical functions: NumPy provides a wide range of mathematical func-


tions for linear algebra, Fourier transforms, random number generation, and more.

Array manipulation: NumPy includes tools for indexing, slicing, and reshaping
arrays, as well as for concatenating, splitting, and stacking them.

Integration with other libraries: NumPy is integrated with many other sci-
entific computing libraries in Python, including SciPy, Matplotlib, Pandas, and
scikit-learn.

Overall, NumPy is a powerful and widely used library for numerical comput-
ing in Python, and its efficient array operations make it an essential tool for data
scientists and machine learning practitioners.

4) Matplotlib : Matplotlib is a data visualization library in Python. Version


3.3.4 is one of the stable releases of the library and was released on February 22,
2021.

This version includes several improvements and bug fixes over the previous
release, including:

Improvements to the default settings for text and color handling in plots. Bet-
ter handling of errorbars in scatter plots. Support for exporting plots in vectorized
formats like SVG and PDF with improved output quality. Better support for inter-
active plotting in Jupyter notebooks. Improvements to the layout and spacing of
subplots. Improved support for plotting time series data with datetime axes. Over-
all, version 3.3.4 is a solid release that provides several important improvements
and bug fixes. It’s recommended for anyone using Matplotlib to upgrade to this
version if they haven’t already done so.
0.1.4 Code for Product review Analysis

Using Logistic regression

#Importing libraries and


import pandas as pd
df = pd.read_csv(r’C:\Users\91798\Desktop\amazon_baby.csv’)
#getting rid of null values
df = df.dropna()
#Taking a 30% representative sample
import numpy as np
np.random.seed(34)
df1 = df.sample(frac = 0.3)
#Adding the sentiments column
df1[’sentiments’] = df1.rating.apply(lambda x: 0 if x in [1, 2] else 1)

X = df1[’review’]
y = df1[’sentiments’]

df.head(5)

#Logistic Regression

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=2
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
#Vectorizing the text data
Figure 1: accuracy by Logistic regretion.

ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn.linear_model import LogisticRegression
#Training the model
lr = LogisticRegression()
lr.fit(ctmTr, y_train)
#Accuracy score
lr_score = lr.score(X_test_dtm, y_test)
print("Results for Logistic Regression with CountVectorizer")
print(lr_score)
#Predicting the labels for test data
y_pred_lr = lr.predict(X_test_dtm)
from sklearn.metrics import confusion_matrix
#Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)
#True positive and true negative rates
tpr_lr = round(tp/(tp + fn), 4)
tnr_lr = round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)
Figure 2: accuracy by Svm.

Using Support Vector Machine

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=1
#Vectorizing the text data
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn import svm
#Training the model
svcl = svm.SVC()
svcl.fit(ctmTr, y_train)
svcl_score = svcl.score(X_test_dtm, y_test)
print("Results for Support Vector Machine with CountVectorizer")
print(svcl_score)
y_pred_sv = svcl.predict(X_test_dtm)
#Confusion matrix
cm_sv = confusion_matrix(y_test, y_pred_sv)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_sv).ravel()
print(tn, fp, fn, tp)
tpr_sv = round(tp/(tp + fn), 4)
tnr_sv = round(tn/(tn+fp), 4)
print(tpr_sv, tnr_sv)
Using K nearest neighbor

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.5, random_state=1
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(ctmTr, y_train)
knn_score = knn.score(X_test_dtm, y_test)
print("Results for KNN Classifier with CountVectorizer")
print(knn_score)
y_pred_knn = knn.predict(X_test_dtm)
#Confusion matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_knn).ravel()
print(tn, fp, fn, tp)
tpr_knn = round(tp/(tp + fn), 4)
tnr_knn = round(tn/(tn+fp), 4)
print(tpr_knn, tnr_knn)

Comparative accuracy of all algorithms

import matplotlib.pyplot as plt


Figure 3: accuracy by Knn.

x_values = [’Logistic Regression’,’SVM’, ’KNN’]


y_values = [0.9028072227502011,0.8909276993932305,0.8545215293515608]

plt.plot(x_values,y_values)
plt.xlabel("Types of Method")
plt.ylabel("Accuracy")
plt.title("Comparative accuracy of all the Methods")

plt.show()
Figure 4: comparative accuracy graph.
0.1.5 conclusion:

after implementing three different algorithms on the same dataset, we see that Lo-
gistic Regression gave the best accuracy, followed by Support Vector Machine(SVM)
and the least accuracy was of K nearest neighborhood. This discrepancy can be
understood by the fact that logistic regression works very precisely on classifying
binary labelled data, as was the case in our dataset.

Coming to the time to train the model. Again, logistic regression took the least
amount of time to train. But in case of SVM, it took drastically huge time to train.
And KNN took a bit more than Logistic Regression.

On looking at all this discrepancies, implementing Logistic regression maybe


the best method in atleast binary labelled data training. It happens so because of
very basic algorithm with less computation and hence takes less time. SVM and
KNN are better than Logistic regression when we have more than 2 categories of
classification and especially when dataset is of images.

You might also like