Professional Documents
Culture Documents
Product Review Analysis and Prediction PDF
Product Review Analysis and Prediction PDF
Entitled
Bachelor of Technology
(Electronics and Communication)
: Guided By :
(Year: 2022-23)
CERTIFICATE
This is to certify that the Mini-Project Report entitled “Product review Analysis
and prediction” is presented & submitted by Kuldeep Joshi and Viprav Patel, bear-
ing Roll No. U20EC143 and U20EC158, of B.Tech. VI, 6th Semester in the partial
fulfillment of the requirement for the award of B.Tech. Degree in Electronics & Com-
munication Engineering for academic year 2022-23.
They have successfully and satisfactorily completed their Mini-Project in all re-
spects. We, certify that the work is comprehensive, complete and fit for evaluation.
iii
List of Figures
1 accuracy by Logistic regretion. . . . . . . . . . . . . . . . . . . . . . . 9
2 accuracy by Svm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 accuracy by Knn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 comparative accuracy graph. . . . . . . . . . . . . . . . . . . . . . . . 13
iv
Product review analysis and Prediction
0.1 Introduction
1. Logistic regretion
3. K nearest neighbor
The algorithm works by calculating the distance between the input data point and all
other data points in the training set. The K-nearest neighbors are the data points in the
training set that are closest to the input data point, where K is a user-defined parameter.
For classification, the most common class among the K-nearest neighbors is as-
signed to the input data point. For regression, the average or median of the target values
of the K-nearest neighbors is used as the predicted value for the input data point.
KNN is a simple and effective algorithm, but it can be sensitive to outliers and
requires a large amount of memory to store the training set. It is commonly used in ap-
plications such as image recognition, text classification, and recommendation systems.
0.1.2 Algorithms used
1. Logistic regretion
3. K nearest neighbor
The algorithm works by calculating the distance between the input data point
and all other data points in the training set. The K-nearest neighbors are the data
points in the training set that are closest to the input data point, where K is a user-
defined parameter.
For classification, the most common class among the K-nearest neighbors is
assigned to the input data point. For regression, the average or median of the target
values of the K-nearest neighbors is used as the predicted value for the input data
point.
KNN is a simple and effective algorithm, but it can be sensitive to outliers and
requires a large amount of memory to store the training set. It is commonly used
in applications such as image recognition, text classification, and recommendation
systems.
0.1.3 Tools and Libraries used
• pandas
• sci-kit learn
• numpy
• matplot lib
Data manipulation: Pandas provides a rich set of functions for filtering, sort-
ing, aggregating, and transforming data. Data cleaning: Pandas provides tools
for handling missing values, removing duplicates, and dealing with outliers. Data
exploration: Pandas enables data exploration through visualization and statistical
analysis tools. Integration with other libraries: Pandas can be easily integrated
with other Python libraries for data analysis and visualization, such as NumPy,
Matplotlib, and Scikit-learn. Pandas is widely used in data science, finance, social
sciences, and other fields where data analysis and manipulation are critical. It is
also often used in combination with other Python libraries for data analysis and
visualization to provide a comprehensive data analysis and visualization toolkit.
Overall, scikit-learn is a powerful and widely used tool for machine learning in
Python. Its ease of use, flexibility, and extensive documentation make it a popular
choice for both beginners and experienced data scientists.
Array manipulation: NumPy includes tools for indexing, slicing, and reshaping
arrays, as well as for concatenating, splitting, and stacking them.
Integration with other libraries: NumPy is integrated with many other sci-
entific computing libraries in Python, including SciPy, Matplotlib, Pandas, and
scikit-learn.
Overall, NumPy is a powerful and widely used library for numerical comput-
ing in Python, and its efficient array operations make it an essential tool for data
scientists and machine learning practitioners.
This version includes several improvements and bug fixes over the previous
release, including:
Improvements to the default settings for text and color handling in plots. Bet-
ter handling of errorbars in scatter plots. Support for exporting plots in vectorized
formats like SVG and PDF with improved output quality. Better support for inter-
active plotting in Jupyter notebooks. Improvements to the layout and spacing of
subplots. Improved support for plotting time series data with datetime axes. Over-
all, version 3.3.4 is a solid release that provides several important improvements
and bug fixes. It’s recommended for anyone using Matplotlib to upgrade to this
version if they haven’t already done so.
0.1.4 Code for Product review Analysis
X = df1[’review’]
y = df1[’sentiments’]
df.head(5)
#Logistic Regression
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
from sklearn.linear_model import LogisticRegression
#Training the model
lr = LogisticRegression()
lr.fit(ctmTr, y_train)
#Accuracy score
lr_score = lr.score(X_test_dtm, y_test)
print("Results for Logistic Regression with CountVectorizer")
print(lr_score)
#Predicting the labels for test data
y_pred_lr = lr.predict(X_test_dtm)
from sklearn.metrics import confusion_matrix
#Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_lr).ravel()
print(tn, fp, fn, tp)
#True positive and true negative rates
tpr_lr = round(tp/(tp + fn), 4)
tnr_lr = round(tn/(tn+fp), 4)
print(tpr_lr, tnr_lr)
Figure 2: accuracy by Svm.
plt.plot(x_values,y_values)
plt.xlabel("Types of Method")
plt.ylabel("Accuracy")
plt.title("Comparative accuracy of all the Methods")
plt.show()
Figure 4: comparative accuracy graph.
0.1.5 conclusion:
after implementing three different algorithms on the same dataset, we see that Lo-
gistic Regression gave the best accuracy, followed by Support Vector Machine(SVM)
and the least accuracy was of K nearest neighborhood. This discrepancy can be
understood by the fact that logistic regression works very precisely on classifying
binary labelled data, as was the case in our dataset.
Coming to the time to train the model. Again, logistic regression took the least
amount of time to train. But in case of SVM, it took drastically huge time to train.
And KNN took a bit more than Logistic Regression.