17BIT202

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

SALES PREDICTION ANALYSIS

A PROJECT REPORT

Submitted by

NAVEEN KUMAR R (17BIT006)


JEGAN J(17BIT016)
YOGESH V (17BIT202)

In partial fulfillment for the award of the degree


of

BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY

KUMARAGURU COLLEGE OF TECHNOLOGY


COIMBATORE-641 049
(An Autonomous Institution Affiliated to Anna University, Chennai)

1
May 2021

BONAFIDE CERTIFICATE

Certified that this project report “SALES PREDICTION ANALYSIS”


is the bonafide work of “Naveen kumar R(17BIT006), Jegan
J(17BIT016), Yogesh V (17BIT202)” who carried out the project work
under my supervision.

SIGNATURE SIGNATURE
Dr.M.Alamelu Ms.S.Kavitha
Head Of the Department Supervisor
Associate Professor Assistant Professor
Information Technology Information Technology

The candidates with University register number 17BIT006, 17BIT016,


17BIT202 were examined in the Project Viva-Voce examination held on 23-June
2021

2
Internal Examiner External Examiner

DECLARATION

We affirm that the project work titled “Sales Prediction Analysis” being
submitted in partial fulfillment for the award of B.Tech Information Technology
is the original work carried out by us. It has not formed the part of any other
project work submitted for the award of any degree or diploma, either in this or
any other University.

Naveen Kumar R (17BIT006)

Jegan J(17BIT016)

Yogesh V(17BIT202)

I certify that the declaration made above by the candidates is true.

3
ACKNOWLEDGEMENT
We express our profound gratitude to the management of Kumaraguru College
of Technology for providing the required infrastructure that enabled us to
successfully complete the project.

We extend our gratitude to our Principal, Dr. D. Saravanan, for providing us the
necessary facilities to pursue the project.

We would like to acknowledge Dr. M. Alamelu, Professor and Head,


Department of Information Technology, for her support and encouragement
throughout this project.

We thank our project coordinator P.C. Thirumal, Professor, Department of


Information Technology and guide S. Kavitha, Assistant Professor II,
Department of Information Technology, for their constant and continuous effort,
guidance and valuable time.

Our sincere and hearty thanks to staff members of the Department of Information
Technology of Kumaraguru College of Technology for their well wishes, timely
help and support rendered to us during our project. We are greatly indebted to our

4
family, relatives and friends, without whom life would have not been shaped to
this level.

- Naveen kumar
- Jegan
- Yogesh

TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO. NO.
ABSTRACT 6
1 INTRODUCTION 6
2 PROBLEM STATEMENT 7
3 LITERATURE SURVEY 7
4 MODULES 10
4.1 DATA COLLECTION.
4.2 DATA PREPROCESSING.
4.3 ALGORITHMS.
4.4 RESULTS AND DISCUSSION.

5 SYSTEM REQUIREMENTS 15
5.1 REQUIREMENTS. 10
5.2 SOFTWARE REQUIREMENTS
6 CONCLUSION 17
7 FUTURE SCOPE 17

5
8 REFERENCES 17
9 APPENDIX

ABSTRACT
Sales forecasting is the process of predicting future sales. It is the vital part of the
financial planning of the business. Most of the companies heavily depend on the
future prediction of the sales. Accurate sales forecasting empower the
organizations to make informed business decisions and it will help to predict the
short-term and long-term performances. A precise forecasting can avoid
overestimating or underestimating of the future sales, which may leads to great
loss to companies. The past and current sales statistics is used to estimate the
future performance. But it is difficult to deal with accuracy of sales forecasting
by traditional forecasting. For this purpose, various machine learning techniques
have been discovered. In this work, we have taken Black Friday dataset and made
a detailed analysis over the dataset. Here, we have implemented the different
machine learning techniques with different metrics. By analysing the
performance, we have trying to suggest the suitable predictive algorithm to our
problem statement.

1. INTRODUCTION

Sales play a key role in the business. At the company level, sales forecasting is
the major part of the business plan and significant inputs for decision-making
activities. It is essential for organizations to produce the required quantity at the

6
specified time. For that, sales forecasting will gives the idea about how an
organization should manage its budgeting, workforce and resources. This
forecasting helps the business management to determine how much products
should be manufacture, how much revenue can be expected and what could be
the requirement of employees, investment and equipment. By analyzing the
future trends and needs, Sales forecasting helps to improve the business growth.
The traditional forecasting systems have some drawbacks related to accuracy of
the forecasting and handling enormous amount of data. To overcome this
problem, Machine-Learning (ML) techniques have been discovered. These
techniques helps to analyses the bigdata and plays a important role in sales
forecasting. Here we have used supervised machine learning techniques for the
sales forecasting.

2.PROBLEM STATEMENT
Most of the business organizations heavily depend on a knowledge base and
demand prediction of sales trends. Sales forecasting is the process of
estimating future sales. Accurate sales forecasts enable companies to make
informed business decisions and predict short-term and long-term
performance. Companies can base their forecasts on past sales data, industry-
wide comparisons, and economic trends. Sales forecasts help sales teams
achieve their goals by identifying early warning signals in their sales pipeline
and course correct before it’s too late. The goal is to improve the accuracy
from the existing project. So that the sales and profit could be increased for
the companies. Choosing an efficient algorithm from comparing different
algorithms to improve the prediction further more.

3. LITERATURE SURVEY

PAPER-1:
Intelligent Sales Prediction Using Machine Learning
Techniques.

7
Abstract: The detailed study and analysis of comprehensible
predictive models to improve future sales predictions are carried out
in this research. Traditional forecast systems are difficult to deal
with the big data and accuracy of sales forecasting.
Algorithms: The models implemented for prediction are Random
Forest, Gradient Boosting and Extremely Randomized Trees (Extra
Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very effective.

PAPER-2:

Forecasting the Retail Sales of China’s Catering Industry Using


Support Vector Machines.
Abstract: The forecast of China's catering retail sales was studied
in this paper. The seasonal impact was considered in the forecasting.
The retail sales were predicted using the seasonal auto-regressive
integrated moving average (ARIMA) model.
Algorithms: ARIMA, SVM.
Conclusion: SVM method is obviously superior to the seasonal
ARIMA method regardless of the long-term forecasting or the short-
term forecasting.

PAPER-3:
An Intelligent Model For Predicting the Sales of a Product.
Abstract: The approach shown in this paper is a systematic, accurate
and precise model building to be used in computing and predicting

8
current scenario and future projection of a product in market
respectively.
Algorithms: Random forest algorithm, neural network.
Conclusion: Neural network.

PAPER-4:
Sales Prediction Using Machine Learning Algorithms.
Abstract: The aim of this paper is to propose a dimension for
predicting the future sales of Big Mart Companies keeping in view the
sales of previous years. A comprehensive study of sales prediction is
done using Machine Learning models.
Algorithms: Linear Regression, K-Neighbours Regressor, XGBoost,
Regressor and Random Forest Regressor.
Conclusion: Random Forest Algorithm is found to be the
most suitable

PAPER-5:
Comparison of Different Machine Learning Algorithms for
Multiple Regression on Black Friday Sales Data.
Abstract: This study focuses on the field of prediction models to
develop an accurate and efficient algorithm to analyze the customer
spending in the past and output the future spending of the customers
with same features.
Algorithms: Regression, Decision Tree, XGBoost.
Conclusion: XGBoost.

PAPER-6:
Forecasting of Walmart Sales using Machine Learning
Algorithm.
9
Abstract: The ability to predict data accurately is extremely valuable
in a vast array of domains such as stocks, sales, weather or even
sports. Presented here is the study and implementation of several
ensemble classification algorithms employed on sales data, consisting
of weekly retail sales numbers from different departments in Walmart
retail outlets all over the United States of America.
Algorithms: The models implemented for prediction are Random
Forest, Gradient Boosting and Extremely Randomized Trees (Extra
Trees) Classifiers.
Conclusion: Random Trees was confirmed to be a very effective.

PAPER-7:
Sales Prediction For Big Mart.
Abstract: A retailer company wants a model that can predict accurate
sales so that it can keep track of customers future demand and update
in advance the sale inventory. In this work, we propose a technique to
optimize the parameters and select the best tuning hyper parameters,
further ensemble with Xgboost techniques for forecasting the future
sales of a retailer company such as Big Mart and we found our model
produces the better result.
Algorithms: Xgboost techniques.
Conclusion: Experimental analysis found our technique produce more
accurate

PAPER-8:
A Deep Learning Approach for the Prediction of Retail Store
Sales.
Abstract: The purpose of this research is to construct a sales prediction
model for retail stores using the deep learning approach, which has
gained significant attention in the rapidly developing field of machine

10
learning in recent years. Using such a model for analysis, an approach
to store management could be formulated .
Algorithms: Logistic regression model
Conclusion: The accuracy decreased by around 13% when the logistic
regression model was used.

4.MODULES

4.1 DATA COLLECTION:


The dataset has been collected from https://www.kaggle.com/ The training
dataset contains 12 columns and 550069 rows. The Test dataset contains
contains 12 columns and 233600. The dataset contains 12 variables which
includes User ID, Gender, City Category, Product ID, Total count of years
stayed in current city, Age, Occupation, Marital status, Product Category1,
Product Category2, Product Category3 and Purchase amount.

4.2 DATA PREPROCESSING:


This step is an important step in data mining process. Because it improves the
quality of the experimental raw data.
i)Removal of Null values:
In this step, the null values in the fields Product Category2 and Product
Category3 are filled with the mean value of the feature.
ii) Converting Categorical values into numerical:
Machine learning deal with numerical values easily because of the machine
readable form. Therefore, the categorical values like Product ID, Gender, Age
and City Category are converted to numerical values.
Step1: Based on its datatype, we have selected the categorical values.
Step2: By using python, we have converting the categorical values into
numerical values.
iii) Separate the target variable:
Here, we have to separate the target feature in which we are going to predict.
In this case, purchase is the target variable.
11
Step1: The target lable purchase is assigned to the variable ‘y’.
Step2: The preprocessed data except the target lable purchase is assigned to
the variable ‘X’.
iv) Standardize the features:
Here, we have to standardize the features because it arranges the data in a
standard normal distribution. The standardization of the data is made only for
training data most of the time because any kind of transformation of the features
only be fitted on the training data.
Step1: Only trained data was taken.
Step2: By using the Standard Scaler API, we have standardize the features.

4.3 ALGORITHMS
Linear Regression :
Linear Regression is one of the common ML and data analysis technique. This
algorithm is helpful for forecasting based on linear regression equation. The
Linear regression technique is the type of regression, which combines the set of

12
independent features(x) to predict the output value(y) or dependent variable. The
linear equation assigns a factor to each independent variable called coefficients
represented by β.

XGBoost:
XGBoost also known as Extreme Gradient Boosting has been used in order to get
an efficient model with high computational speed and efficacy. The formula
makes predictions using the ensemble method that models the anticipated errors
of some decision trees to optimize last predictions. Production of this model also
reports the value of each feature’s effects in determining the last building
performance score prediction.

Gradient Boosting:
Gradient Boost is the one of the major boosting algorithm. Boosting is a ensemble
technique in which the successive predictors learn from the mistakes of the
previous or predecessor predictors. It is the method of improving the weak
learners and create a combined prediction model. In this algorithm, decision trees
are mainly used as base learners and trains the model in sequential manner.
Random Forest:
Random forest is referred as a supervised machine learning ensemble method,
which uses the multiple decision trees. It involves the technique called Bootstrap
aggregation also known as bagging which aims to reduce the complexity of the
models that overfit the training data . In this algorithm, rather than depending on
individual decision tree it will combines the multiple decision trees to find the
final outcome.

Extra Trees Algorithm:


This algorithm works by creating a large number of unpruned decision trees from
the training dataset. Predictions are made by averaging the prediction of the
decision trees in the case of regression or using majority voting in the case of
classification.

Feature Selection:

13
Product_Category_1 feature has by far the highest regression coefficient and is
very important feature.

4.4 RESULTS AND DISCUSSION:


The evaluation of the machine learning algorithms is an essential part of any
prediction model building. For that, we should carefully choose the evaluation
metrics . These metrics are used to measure or judge the quality of the model.
The performance of the machine learning algorithms are mainly focusing on
accuracy. Companies uses the machine learning models with high accuracy for
the practical business decisions.
ALGORITHM RMSE ACCURACY

Linear Regression 4693 29%

Random Forest 3052 79%

Gradient Boost 3004 81%

XGBoost 5023 82%

ExtraTree 3137 77%


Regression

Based on the performance, we have concluded that the XGBoost and Gradient
Boost algorithm considered as the best fit comparing to other algorithms. This
comparative evaluation will help the organizations to choose the better and
efficient machine-learning model.

14
Figure: Accuracy for different Machine Learning Techniques

Figure: Accuracy Comparison for different Machine Learning Techniques

15
Figure: Accuracy and RMSE for different Machine Learning Techniques

5. SYSTEM REQUIREMENTS
5.1 HARDWARE REQUIREMENTS

• System : i3 Processor
• Hard Disk : 500 GB.
• Monitor : 15’’LED
• Ram : 4GB

5.2 SOFTWARE REQUIREMENTS

 Operating system : Windows 7 or above, linux.


 Scripting Tool: Jupyter Notebook, Google colab
Language: Python3.0

16
6. CONCLUSION

Sales forecasting is mainly required for the organizations for business


decisions. Accurate forecasting will help the companies to enhance the
market growth. Machine learning techniques provides the effective
mechanism in prediction and data mining as it overcome the problem with
traditional techniques. These techniques enhances the data optimization
along with improving the efficiency with better results and greater
predictability. After predicting the purchase amount, the companies can
apply some marketing strategies for certain sections of customers so that the
profit could be enhanced.

7. FUTURE SCOPE
In our future work, we will use the other feature selection techniques and
advanced deep learning architecture algorithms to enhance the efficiency of the
model with improved optimization.

REFERENCES
[1] Sunitha Cheriyan, Shaniba Ibrahim, Saju Mohanan & Susan Treesa
(2018) Intelligent Sales Prediction Using Machine Learning Techniques.
[2] Xiangsheng Xie & Gang Hu (2008). Forecasting the Retail Sales of
China’s Catering Industry.
[3] Avinash kumar, Neha Gopal & Jatin Rajput(2020). An Intelligent Model
For Predicting the Sales of a Product.
[4] Purvika Bajaj, Renesa Ray, Shivani Shedge & Shravani Vidhate(2020).
SALES PREDICTION USING MACHINE LEARNING ALGORITHMS.
[5] Ching-Seh (Mike) Wu. Pratik Patil & Saravana Gunaseelan(2018).
Comparison of Different Machine Learning Algorithms for Multiple
Regression on Black Friday Sales Data.
[6 ] Nikhil Sunil Elias, Seema Singh(2019).FORECASTING of WALMART
SALES using MACHINE LEARNING ALGORITHMS.
[7] Yuta Kaneko & Katsutoshi Yada(2016). A Deep Learning Approach for
the Prediction of Retail Store Sales.
[8] Gopal Behera & Neeta Nain (2019). Sales Prediction For Big Mart.

17
APPENDIX

import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import linear_model
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import ExtraTreesRegressor
train_df=pd.read_csv("train.csv")
test_df=pd.read_csv("test.csv")
df=train_df.copy()
train_df.info()
test_df.info()
train_df.head()
train_df.drop('User_ID',axis=1,inplace=True)
test_df.drop('User_ID',axis=1,inplace=True)

18
train_df.shape
train_df.describe()
train_df['Age']=train_df['Age'].map({'0-17':0,'18-25':1, '26-35':2,'36-
45':3,'46-50':4, '51-55':5, '55+':6})
test_df['Age']=test_df['Age'].map({'0-17':0,'18-25':1, '26-35':2,'36-45':3,'46-
50':4, '51-55':5, '55+':6})
test_df['Gender'].unique()
train_df['Marital_Status'].unique()
train_df['City_Category'].unique()
city=pd.get_dummies(train_df['City_Category'],drop_first=True)
train_df.drop('City_Category',axis=1,inplace=True)
city_test=pd.get_dummies(test_df['City_Category'],drop_first=True)
percent_missing=np.round((train_df.isna().sum()/train_df.isna().count()),3a
=17
percent_missing.sort_values(ascending=False)
train_df['Product_Category_2']=train_df['Product_Category_2'].fillna(train
_df['Product_Category_2'].mode()[0])
train_df['Product_Category_2'].isna().sum()
percent_missing=np.round((test_df.isna().sum()/test_df.isna().count()),3)
percent_missing.sort_values(ascending=False)
test_df.drop('Product_Category_3',axis=1,inplace=True)
test_df['Product_Category_2']=test_df['Product_Category_2'].fillna(train_d
f['Product_Category_2'].mode()[0])
train_df.info()
train_df['Stay_In_Current_City_Years']=train_df['Stay_In_Current_City_Y
ears'].astype(int)
train_df['B']=train_df['B'].astype(int)
train_df['C']=train_df['C'].astype(int)
train_df['Product_Category_2']

19
test_df.drop('City_Category',axis=1,inplace=True)
sns.barplot('Gender','Purchase',data=train_df)
sns.barplot('Age','Purchase',data=train_df)
sns.barplot('Marital_Status','Purchase',data=train_df)
sns.barplot('Occupation','Purchase',data=train_df)
X=train_df.drop('Purchase',axis=1)
y=train_df['Purchase']
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.5,random
_state=42)
rfr=RandomForestRegressor(n_estimators=150)
rfr.fit(X_train,y_train)
rfrpredict=rfr.predict(X_valid)
regressor = RandomForestRegressor()
regressor.fit(X_train,y_train)
accuracy = regressor.score(X_valid,y_valid)
accuracy1=a+accuracy*100
gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)
gbrpredict= gbr.predict(X_valid)

regressorgbr = GradientBoostingRegressor()
regressorgbr.fit(X_train,y_train)
accuracy = regressorgbr.score(X_valid,y_valid)
accuracy2=a+accuracy*100
xgr=XGBRegressor()
xgr.fit(X_train,y_valid)
xgrpredict=xgr.predict(X_valid)
20
regressorxg = XGBRegressor()
regressorxg.fit(X_train,y_train)
accuracy3 = regressorxg.score(X_valid,y_valid)
reg=linear_model.LinearRegression()
lm_model=reg.fit(X_train,y_train)
pred=lm_model.predict(X_valid)
regressorlr = linear_model.LinearRegression()
regressorlr.fit(X_train,y_train)
accuracy = regressorlr.score(X_valid,y_valid)
accuracy4=a+accuracy*100
m=ExtraTreesRegressor()
m.fit(X_train,y_train)
mpredict= m.predict(X_valid)
Exregressor = ExtraTreesRegressor()
Exregressor.fit(X_train,y_train)
accuracy = Exregressor.score(X_valid,y_valid)
accuracy5=a+accuracy*100
finalpredict=gbr.predict(test_df)
finalpredict

size = train_df['Gender'].value_counts()
labels = ['Male', 'Female']
colors = ['#C4061D', 'green']
explode = [0, 0.1]

plt.rcParams['figure.figsize'] = (10, 10)


plt.pie(size, colors = colors, labels = labels, shadow = True, explode =
explode, autopct = '%.2f%%')
21
plt.title('A Pie Chart representing the gender gap', fontsize = 20)
plt.axis('off')
plt.legend()
plt.show()
from scipy import stats
from scipy.stats import norm
plt.rcParams['figure.figsize'] = (20, 7)
sns.distplot(train_df['Purchase'], color = 'green', fit = norm)

# fitting the target variable to the normal curve


mu, sigma = norm.fit(train_df['Purchase'])
print("The mu {} and Sigma {} for the curve".format(mu, sigma))

plt.title('A distribution plot to represent the distribution of Purchase')


plt.legend(['Normal Distribution ($mu$: {}, $sigma$: {}'.format(mu,
sigma)], loc = 'best')
plt.show()
plt.figure(figsize=[12,8])
sns.countplot(train_df['Occupation'],hue=train_df["Age"])
print("RMSE score for Random_Forest : ",
np.sqrt(mean_squared_error(y_valid,rfrpredict)))
print("RMSE score for Gradient Boosting : ",
np.sqrt(mean_squared_error(y_valid,gbrpredict)))
print("RMSE score for XG Boosting : ",
np.sqrt(mean_squared_error(y_valid,xgrpredict)))
print("RMSE score for Linear Regression : ",
np.sqrt(mean_squared_error(y_valid,pred)))
print("RMSE score for ExtraTreesRegressor : ",
np.sqrt(mean_squared_error(y_valid,mpredict)))

22
print("Accuracy for Random_Forest: ",accuracy1,'%')
print("Accuracy for Gradient Boosting: ",accuracy2,'%')
print("Accuracy for XG Boosting: ",accuracy3,'%')
print("Accuracy for Linear Regression: ",accuracy4,'%')
print("Accuracy for ExtraTreesRegressor: ",accuracy5,'%')
import numpy as np
import matplotlib.pyplot as plt

data = {'Random_Forest':accuracy1, 'Gradient Boosting':accuracy2, 'XG


Boosting':accuracy3,
'Linear Regression':accuracy4, 'ExtraTreesRegressor':accuracy5}
courses = list(data.keys())
values = list(data.values())

fig = plt.figure(figsize = (10, 5))

# creating the bar plot


plt.bar(courses, values, color ='maroon',
width = 0.4)

plt.xlabel("Algorithm")
plt.ylabel("Percentage %")
plt.title("Accuracy Chart")
plt.show()
barWidth = 0.25
fig = plt.subplots(figsize =(12, 8))

23
New = [accuracy1, accuracy2, accuracy3, accuracy4, accuracy5]
Old = [77, 73, 72, 37, 0]

br1 = np.arange(len(New))
br2 = [x + barWidth for x in br1]

plt.bar(br1, Old, color ='r', width = barWidth,


edgecolor ='grey', label ='OLD')
plt.bar(br2, New, color ='g', width = barWidth,
edgecolor ='grey', label ='NEW')

plt.xlabel('ALGORITHIM', fontweight ='bold', fontsize = 15)


plt.ylabel('ACCURACY %', fontweight ='bold', fontsize = 15)
plt.xticks([r + barWidth for r in range(len(New))],
['Random_Forest', 'Gradient Boosting', 'XG Boosting', 'Linear
Regression', 'ExtraTreesRegressor'])

plt.legend()
plt.show()
rf_regressor_tune = RandomForestRegressor(n_estimators=100, max_depth
= 40, max_features = 'auto', min_samples_leaf =10,
min_samples_split=2 )
rf_regressor_tune.fit(X_train, y_train)
columns = pd.DataFrame({"Features": test_df.columns,

24
"Feature Importance"
:rf_regressor_tune.feature_importances_})
columns.sort_values("Feature Importance", ascending =
False).reset_index(drop=True)
sns.barplot(y="Features", x = "Feature Importance", data = columns)

25

You might also like