Professional Documents
Culture Documents
17BIT202
17BIT202
17BIT202
A PROJECT REPORT
Submitted by
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
1
May 2021
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr.M.Alamelu Ms.S.Kavitha
Head Of the Department Supervisor
Associate Professor Assistant Professor
Information Technology Information Technology
2
Internal Examiner External Examiner
DECLARATION
We affirm that the project work titled “Sales Prediction Analysis” being
submitted in partial fulfillment for the award of B.Tech Information Technology
is the original work carried out by us. It has not formed the part of any other
project work submitted for the award of any degree or diploma, either in this or
any other University.
Jegan J(17BIT016)
Yogesh V(17BIT202)
3
ACKNOWLEDGEMENT
We express our profound gratitude to the management of Kumaraguru College
of Technology for providing the required infrastructure that enabled us to
successfully complete the project.
We extend our gratitude to our Principal, Dr. D. Saravanan, for providing us the
necessary facilities to pursue the project.
Our sincere and hearty thanks to staff members of the Department of Information
Technology of Kumaraguru College of Technology for their well wishes, timely
help and support rendered to us during our project. We are greatly indebted to our
4
family, relatives and friends, without whom life would have not been shaped to
this level.
- Naveen kumar
- Jegan
- Yogesh
TABLE OF CONTENTS
5 SYSTEM REQUIREMENTS 15
5.1 REQUIREMENTS. 10
5.2 SOFTWARE REQUIREMENTS
6 CONCLUSION 17
7 FUTURE SCOPE 17
5
8 REFERENCES 17
9 APPENDIX
ABSTRACT
Sales forecasting is the process of predicting future sales. It is the vital part of the
financial planning of the business. Most of the companies heavily depend on the
future prediction of the sales. Accurate sales forecasting empower the
organizations to make informed business decisions and it will help to predict the
short-term and long-term performances. A precise forecasting can avoid
overestimating or underestimating of the future sales, which may leads to great
loss to companies. The past and current sales statistics is used to estimate the
future performance. But it is difficult to deal with accuracy of sales forecasting
by traditional forecasting. For this purpose, various machine learning techniques
have been discovered. In this work, we have taken Black Friday dataset and made
a detailed analysis over the dataset. Here, we have implemented the different
machine learning techniques with different metrics. By analysing the
performance, we have trying to suggest the suitable predictive algorithm to our
problem statement.
1. INTRODUCTION
Sales play a key role in the business. At the company level, sales forecasting is
the major part of the business plan and significant inputs for decision-making
activities. It is essential for organizations to produce the required quantity at the
6
specified time. For that, sales forecasting will gives the idea about how an
organization should manage its budgeting, workforce and resources. This
forecasting helps the business management to determine how much products
should be manufacture, how much revenue can be expected and what could be
the requirement of employees, investment and equipment. By analyzing the
future trends and needs, Sales forecasting helps to improve the business growth.
The traditional forecasting systems have some drawbacks related to accuracy of
the forecasting and handling enormous amount of data. To overcome this
problem, Machine-Learning (ML) techniques have been discovered. These
techniques helps to analyses the bigdata and plays a important role in sales
forecasting. Here we have used supervised machine learning techniques for the
sales forecasting.
2.PROBLEM STATEMENT
Most of the business organizations heavily depend on a knowledge base and
demand prediction of sales trends. Sales forecasting is the process of
estimating future sales. Accurate sales forecasts enable companies to make
informed business decisions and predict short-term and long-term
performance. Companies can base their forecasts on past sales data, industry-
wide comparisons, and economic trends. Sales forecasts help sales teams
achieve their goals by identifying early warning signals in their sales pipeline
and course correct before it’s too late. The goal is to improve the accuracy
from the existing project. So that the sales and profit could be increased for
the companies. Choosing an efficient algorithm from comparing different
algorithms to improve the prediction further more.
3. LITERATURE SURVEY
PAPER-1:
Intelligent Sales Prediction Using Machine Learning
Techniques.
7
Abstract: The detailed study and analysis of comprehensible
predictive models to improve future sales predictions are carried out
in this research. Traditional forecast systems are difficult to deal
with the big data and accuracy of sales forecasting.
Algorithms: The models implemented for prediction are Random
Forest, Gradient Boosting and Extremely Randomized Trees (Extra
Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very effective.
PAPER-2:
PAPER-3:
An Intelligent Model For Predicting the Sales of a Product.
Abstract: The approach shown in this paper is a systematic, accurate
and precise model building to be used in computing and predicting
8
current scenario and future projection of a product in market
respectively.
Algorithms: Random forest algorithm, neural network.
Conclusion: Neural network.
PAPER-4:
Sales Prediction Using Machine Learning Algorithms.
Abstract: The aim of this paper is to propose a dimension for
predicting the future sales of Big Mart Companies keeping in view the
sales of previous years. A comprehensive study of sales prediction is
done using Machine Learning models.
Algorithms: Linear Regression, K-Neighbours Regressor, XGBoost,
Regressor and Random Forest Regressor.
Conclusion: Random Forest Algorithm is found to be the
most suitable
PAPER-5:
Comparison of Different Machine Learning Algorithms for
Multiple Regression on Black Friday Sales Data.
Abstract: This study focuses on the field of prediction models to
develop an accurate and efficient algorithm to analyze the customer
spending in the past and output the future spending of the customers
with same features.
Algorithms: Regression, Decision Tree, XGBoost.
Conclusion: XGBoost.
PAPER-6:
Forecasting of Walmart Sales using Machine Learning
Algorithm.
9
Abstract: The ability to predict data accurately is extremely valuable
in a vast array of domains such as stocks, sales, weather or even
sports. Presented here is the study and implementation of several
ensemble classification algorithms employed on sales data, consisting
of weekly retail sales numbers from different departments in Walmart
retail outlets all over the United States of America.
Algorithms: The models implemented for prediction are Random
Forest, Gradient Boosting and Extremely Randomized Trees (Extra
Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very effective.
PAPER-7:
Sales Prediction For Big Mart.
Abstract: A retailer company wants a model that can predict accurate
sales so that it can keep track of customers future demand and update
in advance the sale inventory. In this work, we propose a technique to
optimize the parameters and select the best tuning hyper parameters,
further ensemble with Xgboost techniques for forecasting the future
sales of a retailer company such as Big Mart and we found our model
produces the better result.
Algorithms: Xgboost techniques.
Conclusion: Experimental analysis found our technique produce more
accurate
PAPER-8:
A Deep Learning Approach for the Prediction of Retail Store
Sales.
Abstract: The purpose of this research is to construct a sales prediction
model for retail stores using the deep learning approach, which has
gained significant attention in the rapidly developing field of machine
10
learning in recent years. Using such a model for analysis, an approach
to store management could be formulated .
Algorithms: Logistic regression model
Conclusion: The accuracy decreased by around 13% when the logistic
regression model was used.
4.MODULES
4.3 ALGORITHMS
Linear Regression :
Linear Regression is one of the common ML and data analysis technique. This
algorithm is helpful for forecasting based on linear regression equation. The
Linear regression technique is the type of regression, which combines the set of
12
independent features(x) to predict the output value(y) or dependent variable. The
linear equation assigns a factor to each independent variable called coefficients
represented by β.
XGBoost:
XGBoost also known as Extreme Gradient Boosting has been used in order to get
an efficient model with high computational speed and efficacy. The formula
makes predictions using the ensemble method that models the anticipated errors
of some decision trees to optimize last predictions. Production of this model also
reports the value of each feature’s effects in determining the last building
performance score prediction.
Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm. Boosting is a ensemble
technique in which the successive predictors learn from the mistakes of the
previous or predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this algorithm, decision trees
are mainly used as base learners and trains the model in sequential manner.
Random Forest:
Random forest is referred as a supervised machine learning ensemble method,
which uses the multiple decision trees. It involves the technique called Bootstrap
aggregation also known as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather than depending on
individual decision tree it will combines the multiple decision trees to find the
final outcome.
Feature Selection:
13
Product_Category_1 feature has by far the highest regression coefficient and is
very important feature.
Based on the performance, we have concluded that the XGBoost and Gradient
Boost algorithm considered as the best fit comparing to other algorithms. This
comparative evaluation will help the organizations to choose the better and
efficient machine-learning model.
14
Figure: Accuracy for different Machine Learning Techniques
15
Figure: Accuracy and RMSE for different Machine Learning Techniques
5. SYSTEM REQUIREMENTS
5.1 HARDWARE REQUIREMENTS
• System : i3 Processor
• Hard Disk : 500 GB.
• Monitor : 15’’LED
• Ram : 4GB
16
6. CONCLUSION
7. FUTURE SCOPE
In our future work, we will use the other feature selection techniques and
advanced deep learning architecture algorithms to enhance the efficiency of the
model with improved optimization.
REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan Treesa
(2018) Intelligent Sales Prediction Using Machine Learning Techniques.
[2] Xiangsheng Xie & Gang Hu (2008). Forecasting the Retail Sales of
China’s Catering Industry.
[3] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An Intelligent Model
For Predicting the Sales of a Product.
[4] Purvika Bajaj, Renesa Ray, Shivani Shedge & Shravani Vidhate(2020).
SALES PREDICTION USING MACHINE LEARNING ALGORITHMS.
[5] Ching-Seh (Mike) Wu. Pratik Patil & Saravana Gunaseelan(2018).
Comparison of Different Machine Learning Algorithms for Multiple
Regression on Black Friday Sales Data.
[6 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of WALMART
SALES using MACHINE LEARNING ALGORITHMS.
[7] Yuta Kaneko & Katsutoshi Yada(2016). A Deep Learning Approach for
the Prediction of Retail Store Sales.
[8] Gopal Behera & Neeta Nain (2019). Sales Prediction For Big Mart.
17
APPENDIX
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import linear_model
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor
train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv")
df=train_df.copy()
train_df.info()
test_df.info()
train_df.head()
train_df.drop('User_ID',axis=1,inplace=True)
test_df.drop('User_ID',axis=1,inplace=True)
18
train_df.shape
train_df.describe()
train_df['Age']=train_df['Age'].map({'0-17':0,'18-25':1, '26-35':2,'36-
45':3,'46-50':4, '51-55':5, '55+':6})
test_df['Age']=test_df['Age'].map({'0-17':0,'18-25':1, '26-35':2,'36-45':3,'46-
50':4, '51-55':5, '55+':6})
test_df['Gender'].unique()
train_df['Marital_Status'].unique()
train_df['City_Category'].unique()
city=pd.get_dummies(train_df['City_Category'],drop_first=True)
train_df.drop('City_Category',axis=1,inplace=True)
city_test=pd.get_dummies(test_df['City_Category'],drop_first=True)
percent_missing=np.round((train_df.isna().sum()/train_df.isna().count()),3a
=17
percent_missing.sort_values(ascending=False)
train_df['Product_Category_2']=train_df['Product_Category_2'].fillna(train
_df['Product_Category_2'].mode()[0])
train_df['Product_Category_2'].isna().sum()
percent_missing=np.round((test_df.isna().sum()/test_df.isna().count()),3)
percent_missing.sort_values(ascending=False)
test_df.drop('Product_Category_3',axis=1,inplace=True)
test_df['Product_Category_2']=test_df['Product_Category_2'].fillna(train_d
f['Product_Category_2'].mode()[0])
train_df.info()
train_df['Stay_In_Current_City_Years']=train_df['Stay_In_Current_City_Y
ears'].astype(int)
train_df['B']=train_df['B'].astype(int)
train_df['C']=train_df['C'].astype(int)
train_df['Product_Category_2']
19
test_df.drop('City_Category',axis=1,inplace=True)
sns.barplot('Gender','Purchase',data=train_df)
sns.barplot('Age','Purchase',data=train_df)
sns.barplot('Marital_Status','Purchase',data=train_df)
sns.barplot('Occupation','Purchase',data=train_df)
X=train_df.drop('Purchase',axis=1)
y=train_df['Purchase']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.5,random
_state=42)
rfr=RandomForestRegressor(n_estimators=150)
rfr.fit(X_train,y_train)
rfrpredict=rfr.predict(X_valid)
regressor = RandomForestRegressor()
regressor.fit(X_train,y_train)
accuracy = regressor.score(X_valid,y_valid)
accuracy1=a+accuracy*100
gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)
gbrpredict= gbr.predict(X_valid)
regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train)
accuracy = regressorgbr.score(X_valid,y_valid)
accuracy2=a+accuracy*100
xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid)
20
regressorxg = XGBRegressor()
regressorxg.fit(X_train,y_train)
accuracy3 = regressorxg.score(X_valid,y_valid)
reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid)
regressorlr = linear_model.LinearRegression()
regressorlr.fit(X_train,y_train)
accuracy = regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100
m=ExtraTreesRegressor()
m.fit(X_train,y_train)
mpredict= m.predict(X_valid)
Exregressor = ExtraTreesRegressor()
Exregressor.fit(X_train,y_train)
accuracy = Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100
finalpredict=gbr.predict(test_df)
finalpredict
size = train_df['Gender'].value_counts()
labels = ['Male', 'Female']
colors = ['#C4061D', 'green']
explode = [0, 0.1]
22
print("Accuracy for Random_Forest: ",accuracy1,'%')
print("Accuracy for Gradient Boosting: ",accuracy2,'%')
print("Accuracy for XG Boosting: ",accuracy3,'%')
print("Accuracy for Linear Regression: ",accuracy4,'%')
print("Accuracy for ExtraTreesRegressor: ",accuracy5,'%')
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel("Algorithm")
plt.ylabel("Percentage %")
plt.title("Accuracy Chart")
plt.show()
barWidth = 0.25
fig = plt.subplots(figsize =(12, 8))
23
New = [accuracy1, accuracy2, accuracy3, accuracy4, accuracy5]
Old = [77, 73, 72, 37, 0]
br1 = np.arange(len(New))
br2 = [x + barWidth for x in br1]
plt.legend()
plt.show()
rf_regressor_tune = RandomForestRegressor(n_estimators=100, max_depth
= 40, max_features = 'auto', min_samples_leaf =10,
min_samples_split=2 )
rf_regressor_tune.fit(X_train, y_train)
columns = pd.DataFrame({"Features": test_df.columns,
24
"Feature Importance"
:rf_regressor_tune.feature_importances_})
columns.sort_values("Feature Importance", ascending =
False).reset_index(drop=True)
sns.barplot(y="Features", x = "Feature Importance", data = columns)
25