Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 60

SOUTH CAMPUS

High ground, Fatehgarh Anantnag-192101 Jammu & Kashmir

DEPARTMENT OF COMPUTER SCIENCE

SESSION (2017-2020)
A MAJOR PROJECT REPORT

Submitted in the partial fulfillment of the requirements for the


Award of the Degree of Master of Computer Applications

“SALES FORECASTING”
Submitted by
Danish Rashid Shabnam Nisar Ovais Bhat
MCA (6th sem.) MCA (6th sem.) MCA (6th sem.)
(17045113002) (17045113040) (17045113043)

Under The Support and Guidance of


Dr. Hilal Ahmad Khanday
(Assistant Professor)
Department of Computer Science, South Campus,
University of Kashmir
SOUTH CAMPUS

“SALES FORECASTING”
Project Submitted to the Department Of Computer Science, South
Campus, University Of Kashmir in the partial fulfillment of the require-
ment for the award of the Degree of

MATER OF COMPUTER APPLICATIONS


(MCA)

Submitted by

1) Danish Rashid 17045113002


2) Shabnam Nisar 17045113040
3) Ovais Bhat 17045113043
Department of Computer Science, South Campus,
University of Kashmir

CERTIFICATE
This is to certify that the Project entitled

“SALES FORECASTING”
Is the original work carried out by
1) Danish Rashid 17045113002
2) Shabnam Nisar 17045113040
3) Ovais Ahmad Bhat 17045113043

In the partial fulfillment of the requirement for the award of the Degree of

MATER OF COMPUTER APPLICATION


(MCA)
During the Academic year of 2021

Supervisor Academic Coordinator

Dr. Hilal Ahmad Khanday Mr. Mohsin Altaf Wani


Assistant Professor, Department of Computer Science Assistant Professor, Department of Com
University of Kashmir, South Campus puter Science University of Kashmir, SC
Student Declaration/Certificate

We Danish Rashid (1704513002), Shabnam Nisar (17045113040) & Ovais Ah-


mad Bhat (17045113043) hereby declare that the work, which is being presented
in the Project entitled “SALES FORECASTING” in partial fulfillment of the re-
quirement for the award of MASTER OF COMPUTER APPLICATIONS
(MCA) degree in the Session 2017 , is an authentic record of our own work , car-
ried out under the supervision of Dr. Hilal Ahmad Khanday (Assistant Professor.,
Department of Computer Science , University of Kashmir south Campus ,Anant-
nag).
The matter embodied in this Project has not been submitted for the award of any
other degree.

Date:

1) Danish Rashid 2) Shabnam Nisar


(17045113002) (17045113040)

3) Ovais Ahmad Bhat


(17045113043)

This is to Certify that the above statement made by the candidates are correct to
the best of my knowledge

Supervisor Academic Coordinator

Dr. Hilal Ahmad Khanday Mr. Mohsin Altaf Wani


Assistant Professor, Department of Computer Science Assistant Professor, Department of Com
University of Kashmir, South Campus puter Science University of Kashmir, SC

Acknowledgement
In the name of “Allah” , the most beneficent and merciful, the creator of treasures
of knowledge and wisdom, who gave us strength and knowledge to complete this
project.
We are highly thankful to our project guide Dr. Hilal Ahmad Khanday (Assis-
tant Prof. Department Of Computer Science, University of Kashmir South Cam-
pus) for without his support this project was impossible. His constructive advice
and constant motivation have been responsible for the successful completion of
this project.
We are also thankful to our teachers and other staff members of our Department.
They have been very helpful and kind to us throughout our MCA course.

We would like to extend our thanks to all those authors and researchers whose
research papers, articles have provided the diversity of interesting material that
helped us to make this work possible.
.

Danish Rashid (17045113002)


Shabnam Nisar (17045113040)
Ovais Ahmad Bhat (17045113043)
Table of Contents

ABSTRACT....................................................................................................................................................9
INTRODUCTION...........................................................................................................................................2
Problem Statement.................................................................................................................................4
Objectives................................................................................................................................................4
System Requirements..............................................................................................................................5
Development tools..................................................................................................................................7
IMPLEMENTATION....................................................................................................................................12
Chapter1:-Data Handling:..........................................................................................................................13
Import packages:-..................................................................................................................................13
Import datasets:-...................................................................................................................................13
Chapter 2:- Exploratory Data Analysis.......................................................................................................15
Exploring Data through visualizations:-.................................................................................................15
Chapter 3:-Feature Engineering:...............................................................................................................26
Treating the missing values...................................................................................................................26
Drawing the correlation matrix:-...........................................................................................................30
Dealing with categorical variables:-.......................................................................................................33
Label encoding for the categorical variables:..............................................................................33
One Hot Encoding for the categorical variables:-..............................................................................35
Chapter 4:-Model Building:.......................................................................................................................37
Splitting the Dataset..............................................................................................................................37
Labeling of Data:-..................................................................................................................................39
Chapter 5:- Modelling................................................................................................................................40
Linear Regression:-................................................................................................................................40
Decision Tree Regression:......................................................................................................................42
XGBoost Regression:..............................................................................................................................43
Random Forest Regression:...................................................................................................................44
_Toc64580643
Support Vector Regression:-..................................................................................................................45
RESULT:-....................................................................................................................................................47
FUTURE......................................................................................................................................................48
REFRENCES................................................................................................................................................49
ABSTRACT
Sales Forecasting is the process of using a company’s sales records over the past
years to predict the short-term or long-term sales performance of that company in
the future. This is one of the pillars of proper financial planning. As with any pre-
diction-related process, risk and uncertainty are unavoidable in Sales Forecasting
too. This is the age of the internet where the amount of data being generated is so
huge that man alone is not able to process through the data. Nowadays shopping
malls and Big Marts keep the track of their sales data of each and every individual
item for predicting future demand of the customer and update the inventory man-
agement as well. These data stores basically contain a large number of customer
data and individual item attributes in a data warehouse. Further, anomalies and fre-
quent patterns are detected by mining the data store from the data warehouse. The
resultant data can be used for predicting future sales volume with the help of dif-
ferent machine learning techniques for the retailers like Big Mart. Many machine
learning techniques hence have been discovered for this purpose. In this project,
we are trying to predict the sales of a retail store using different machine learning
techniques and trying to determine the best algorithm suited to our particular prob-
lem statement. We have implemented various regression techniques and found that
this model produces better performance as compared to various existing models.

1
2
INTRODUCTION
Sales Forecasting can be defined as the prediction of upcoming sales based on the
past sales occurred. Sales forecasting is of paramount importance for companies
which are entering new markets or are adding new services, products or which are
experiencing high growth. The main reason a company does a forecast is to bal-
ance marketing resources and sales against supply capacity planning.

In this model, we are addressing the problem of Sales Forecasting of an item on


customer's future demand in different Big Mart Stores across various locations and
products based on the previous record. Different machine learning algorithms like
linear regression analysis, random forest, etc are used for prediction or forecasting
of sales volume .As good sales are the life of every organization so the forecasting
of sales plays an important role in any shopping complex. Always a better predic-
tion is helpful, to develop as well as to enhance the strategies of business about the
marketplace which is also helpful to improve the knowledge of marketplace. A
standard sales prediction study can help in deeply analyzing the situations or the
conditions previously occurred and then, the inference can be applied about cus-
tomer acquisition, funds inadequacy and strengths before setting a budget and mar-
keting plans for the upcoming year. In other words, sales prediction is based on the
available resources from the past. In depth knowledge of past is required for en-
hancing and improving the likelihood of marketplace irrespective of any circum-
stances especially the external circumstance, which allows to prepare the upcoming
needs for the business. Extensive research is going on in retailers domain for fore-
casting the future sales demand. The aim of this data science project is to build a
predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and
stores which play a key role in increasing sales .It may help to cut down wasteful
expenditure and as a result the goods can be offered at fair price .It may help the
business to decide whether to add a new product to its product line or to drop an
unsuccessful one.

3
Problem Statement

Predicting sales of a company needs time series data of that company and based on
that data the model can predict the future sales of that company or product. So, in
this research project we will analyze the time series sales data of a company and
will predict the sales of the company for the coming quarter and for a specific
product.
The data scientists at Big Mart have collected 2013 sales data for 1559 products
across 10 stores in different cities. Also, certain attributes of each product and store
have been defined. We will use their data set to train our model .The aim is to build
a predictive model and find out the sales of each product at a particular store.

Objectives

The main objective is to understand whether specific properties of products or


stores play a significant role in terms of increasing or decreasing sales volume. To
achieve this goal we will build a predictive model and find out sales of each prod-
uct at a particular store.

 Sales Forecasting aims to predict future sales and is used as the basis of planning


time and resources. A good forecast should have several objectives, all directed at
identifying what you will sell, when you will sell it and to whom.

 Understand the historical demand for the products.

4
 Build predictive ML model to obtain forecasts.

System Requirements
Operating System Processors Disk Space RAM

Any Intel or AMD


Windows XP Service x86 1 GB for ANACONDA 1024 MB

processor support-
Pack 3 ing only, (At least 2048 MB

SSE2 instruction se 3–4 GB for a typical recommended)

Windows Server
2003 Installation

R2 with Service
Pack 2

Windows Vista Ser-


vice

Pack 1 or 2

Windows Server
2008

Service Pack 2 or R2

Windows 7

Mac OS X 10.5.5 All Intel-based Macs 1 GB for ANACONDA 1024 MB

5
(Leopard) and
above only, (At least 2048 MB

3–4 GB for a typical recommended)

Mac OS X 10.6.x
(Snow Installation

Leopard)

Ubuntu 8.04, 8.10, Any Intel or AMD


9.04, x86 1 GB for ANACONDA 1024 MB

processor support-
and 9.10 ing only, (At least 2048 MB

Red Hat Enterprise SSE2 instruction set 3–4 GB for a typical recommended)

Linux 5.x Installation

6
Development tools

There are a lot of sources that can adequately cover all our machine learning
and artificial intelligence needs.
We used several tools to develop our “Sales Forecasting model” .These
tools are briefly listed below:-
Programming Languages

a. Python:-is a language that is favored for its readability, relatively mild learning
curve and functional structure that is used in many cases. This language is beginner
friendly and quite simple. To use this language for machine learning, you do not
have to be knowledgeable of all the intricacies of it. The Python machine learning is
used in this model.
Data Analytics and Visualization Tools

a) ANACONDA:- Anaconda is a conditional free and open-source distribution of


the Python and R programming languages for scientific computing (data science,
machine learning applications, large-scale data processing, predictive analytics,
etc.), that aims to simplify package management and deployment. Anaconda distri-
bution comes with over 250 packages automatically installed, and over 7,500 addi-
tional open-source packages can be installed from PyPI as well as the conda pack-
age and virtual environment manager. It also includes a GUI, Anaconda Naviga-
tor, as a graphical alternative to the command line interface (CLI).
The big difference between conda and the pip package manager is in how package
dependencies are managed, which is a significant challenge for Python data sci-
ence and the reason conda exists.
When pip installs a package, it automatically installs any dependent Python pack-
ages without checking if these conflict with previously installed packages. It will
install a package and any of its dependencies regardless of the state of the existing
installation. Because of this, a user with a working installation of, for example,
Google Tensorflow, can find that it stops working having used pip to install a dif-
ferent package that requires a different version of the dependent numpy library

7
than the one used by Tensorflow. In some cases, the package may appear to work
but produce different results in detail.
In contrast, conda analyses the current environment including everything currently
installed, and, together with any version limitations specified (e.g. the user may
wish to have Tensorflow version 2,0 or higher), works out how to install a compat-
ible set of dependencies, and shows a warning if this cannot be done.

It is an application used in interactive computing. It is a powerful and simple tool


that is used to tinker with data analysis problems. This application allows users to
write text descriptions, python code and embeds charts and plots directly into an in-
teractive webpage. When using this application, it will enable its users to make and
send documents, develop and execute, and present or discuss the results using a live
code. It is combined with a number of tools, supports container platforms and ex-
tends to over 40 programming languages

.
c) Pandas

It is a popular library used for retrieving and preparing data to be used later in other
machine learning libraries. Pandas enables its users to fetch data from different
sources easily. It acts as a tool that simplifies analysis by converting JSON, SQL,
TSV or CSV database into a data frame; it makes a python object look like an SPSS
table with rows and columns or an Excel sheet.

d) Matplotlib

It is a plotting library for Python 2D. Plotting can be defined as a visualization of


machine learning data. It allows its user to generate production-quality visualization
with just a few lines of code. Users can draw different kinds of charts and plots for
visualizing results. The drawn plots can be easily embedded in Jupyter Notebook.

8
This means that a user can always visualize data and results obtained from your
models.
e) Seaborn :-

Seaborn is a library for making statistical graphics in Python. It builds on top of


matplotlib and integrates closely with pandas data structures. Seaborn helps you
explore and understand your data.
f) XGBoost :-

XGBoost is an open source library providing a high-performance implementation


of gradient boosted decision trees. An underlying C++ codebase combined with a
Python interface sitting on top makes for an extremely powerful yet easy to imple-
ment package.
g) Scikit-learn:-

Scikit-learn is a library in Python that provides many unsupervised and supervised


learning algorithms. It’s built upon some of the technology we might already be fa-
miliar with, like NumPy, pandas, and Matplotlib.

9
In today’s modern world,
huge shopping centers

such as big malls and


marts are recording data

10
11
12
IMPLEMENTATION

Fig.1. Working procedure of proposed model.

13
Chapter1:-Data Handling:

Import packages:-

A package is basically a directory with Python files and a file with the
name __init__.py. This means that every directory inside of the Python path, which
contains a file named __init__.py, will be treated as a package by Python. It's pos-
sible to put several modules into a Package.

We used various type of packages in our model. We imported them at the start of
our project.

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

14
import xgboost as xgb

Import datasets:-

The first step is to look at the data and try to identify the information which we hy-
pothesized vs the available data. A comparison between the data dictionary on the
competition page and out hypotheses is shown below:

Observations: 8,523
Variables: 12

Variable Description
Item_Identifier Unique product ID
Item_Weight Weight of product
Item_Fat_Content Whether the product is low fat or not
The % of total display area of all products in a
Item_Visibility
store allocated to the particular product
Item_Type The category to which the product belongs
Item_MRP Maximum Retail Price (list price) of the product
Outlet_Identifier Unique store ID
Outlet_Establishment_Year The year in which store was established

15
Variable Description
The size of the store in terms of ground area cov-
Outlet_Size
ered
Outlet_Location_Type The type of city in which the store is located
Whether the outlet is just a grocery store or some
Outlet_Type
sort of supermarket
Sales of the product in the particular store. This is
Item_Outlet_Sales
the outcome variable to be predicted.

Command used to load Train data set(“train.csv”)


Train_data = pd.read_csv('E:\Sales Forecasting\Dataset\Train.csv')

Command used to load Test data set(“train.csv”)


Test_data = pd.read_csv('E:\Sales Forecasting\Dataset\Test.csv')

Chapter 2:- Exploratory Data Analysis

We will start off by plotting and exploring all the individual variables to gain some
insights.

Exploring Data through visualizations:-

Univariate Analysis:

for i in Train_data.describe().columns:

16
sns.distplot(Train_data[i].dropna())
plt.show()

17
18
19
We also used one more type of graph to plot these visualizations:-

# Boxplot:-
for i in Train_data.describe().columns:
sns.boxplot(Train_data[i].dropna())
plt.show()

Bivariate Analysis:
After looking at every feature individually, let’s now explore them again with re-
spect to the target variable. Here we will make use of scatter plots for continuous
or numeric variables.

Item_Weight and Item_Outlet_Sales analysis:


plt.figure(figsize=(13,9))

plt.xlabel("Item_Weight")

plt.ylabel("Item_Outlet_Sales")

plt.title("Item_Weight and Item_Outlet_Sales analysis")

sns.scatterplot(x='Item_Weight',y='Item_Outlet_Sales',hue='Item_Type',size='Ite
m_Weight', data =Train_data)

Item_Visibility and Item_Outlet_Sales Analysis:


plt.figure(figsize=(13,9))

plt.xlabel("Item_Visibility")

plt.ylabel("Item_Outlet_Sales")

plt.title("Item_Visibility and Item_Outlet_Sales analysis")

20
sns.scatterplot(x='Item_Visibility',y='Item_Outlet_Sales',hue='Item_Type',size='Ite
m_Weight', data =Train_data)

21
Impact of Outlet_Type on Item_Outlet_Sales:-
Outlet_Type_Pivot=\

Train_data.pivot_table(index='Outlet_Type',values="Item_Outlet_Sales",aggfunc=
np.median)

Outlet_Type_Pivot.plot(kind='bar',color='brown',figsize=(12,7))

plt.xlabel("Outlet_Type")

plt.ylabel("Item_Outlet_Sales")

plt.title("Impact of Outlet_Type on Item_Outlet_Sales")

22
plt.xticks(rotation=0)

plt.show()

Impact of Item_Fat_Content on Item_Outlet_Sales:-


Item_Fat_Content_pivot = \

Train_data.pivot_table(index='Item_Fat_Content',values="Item_Outlet_Sales",ag-
gfunc=np.median)

23
Item_Fat_Content_pivot.plot(kind='bar',color='blue',figsize=(12,7))

plt.xlabel("Item_Fat_Content")

plt.ylabel("Item_Outlet_Sales")

plt.title("Impact of Item_Fat_Content on Item_Outlet_Sales")

plt.xticks(rotation=0)

plt.show()

24
Distribution Of Outlet_Type:-

plt.figure(figsize = (10,8))

sns.countplot(Train_data.Outlet_Type)

plt.xticks(rotation = 90)

25
Distribution Of Outlet_Location Type:-

plt.figure(figsize = (10,8))

sns.countplot(Train_data.Outlet_Location_Type)

Train_data.Outlet_Location_Type.value_counts()

Output Will be:


Tier 3 3350
Tier 2 2785
Tier 1 2388
Name: Outlet_Location_Type, dtype: int64

26
Chapter 3:-Feature Engineering:

Most of the times the given features in a dataset are not enough to give satisfactory
predictions .In such cases, we have to create new features which might help in im-
proving the model’s performance. Let’s try to create some new features for our
dataset.

In order to do the feature engineering we combined both data sets so that we can
do the feature engineering of both sets combinely.

Train_data['Source'] = 'train'
Test_data['Source'] = 'test'
df = pd.concat((Train_data, Test_data), ignore_index = True)

df.shape
(14204, 13)
df.columns

Index(['Item_Fat_Content', 'Item_Identifier', 'Item_MRP',


'Item_Outlet_Sales',
'Item_Type', 'Item_Visibility', 'Item_Weight',
'Outlet_Establishment_Year', 'Outlet_Identifier',
'Outlet_Location_Type', 'Outlet_Size', 'Outlet_Type',
'Source'],
dtype='object')

We did feature engineering through two steps, discussed below:-

Treating the missing values

Missing data can have a severe impact on building predictive models because the
missing values might be contain some vital information which could help in mak-

27
ing better predictions. So, it becomes imperative to carry out missing data imputa-
tion. There are different methods to treat missing values based on the problem and
the data. Some of the common techniques are as follows:
1. Deletion of rows: In train dataset, observations having missing values in any
variable are deleted. The downside of this method is the loss of information
and drop in prediction power of model.
2. Mean/Median/Mode Imputation: In case of continuous variable, missing
values can be replaced with mean or median of all known values of that vari-
able. For categorical variables, we can use mode of the given values to re-
place the missing values.

 #Item_Weight:-
df['Item_Weight'].mean()
df['Item_Weight'].fillna(df['Item_Weight'].mean(),inplace=True)
df.isnull().sum()

Output will be:-

Item_Fat_Content 0
Item_Identifier 0
Item_MRP 0
Item_Outlet_Sales 5681
Item_Type 0
Item_Visibility 0
Item_Weight 0
Outlet_Establishment_Year 0
Outlet_Identifier 0
Outlet_Location_Type 0
Outlet_Size 4016
Outlet_Type 0
Source 0
dtype: int64

 #Outlet_Size

28
#Outlet_Size:-
df['Outlet_Size'].value_counts()
df['Outlet_Size'].fillna('Medium',inplace=True)
df.isnull().sum()

 For Item_Visibility column:


#Item_Visibility:-
df[df['Item_Visibility']==0] ['Item_Visibility'].count()
df['Item_Visibility'].fillna(df['Item_Visibility'].median(),inplace=True)

 #We will make one more column here that will show us how old the store
is and we will name it as Outlet_Years:-
df['Outlet_Establishment_Year'].value_counts()
df['Outlet_Years']=2020-df['Outlet_Establishment_Year']
df['Outlet_Years'].describe()

The output of above code will be:


count 14204.000000
mean 22.169319
std 8.371664
min 11.000000
25% 16.000000
50% 21.000000
75% 33.000000
max 35.000000
Name: Outlet_Years, dtype: float64
29
 #Item_Type:
#We will be creating 3 categories instead of the already existing 16 categories.
# Changing only the first 2 characters

df['Item_Type'].value_counts()
df['Item_Identifier'].value_counts()
# 'FD'-FOOD
# 'DR'-DRINK
# 'NC'-NON-CONSUMABLE
#We will be creating 3 categories instead of the already existing 16 categories.
# Changing only the first 2 characters
df['New_Item_Type']=df['Item_Identifier'].apply(lambda x:x[0:2])
# Rename them to make categories:-
df['New_Item_Type']=df['New_Item_Type'].map({'FD':'Food','NC':'Non-Consum-
able','DR':'Drinks'})
df['New_Item_Type'].value_counts()

Output will be:


Food 10201
Non-Consumable 2686
Drinks 1317
Name: New_Item_Type, dtype: int64

# If a product is non-consumable then why associate a fat content to them,we


will get rid of this:-

df.loc[(df['New_Item_Type'] == 'Non-
Consumable','Item_Fat_Content')]='Non-Edible'
df['Item_Fat_Content'].value_counts()

Output will be:-

Low Fat 5998


Regular 4824

30
Non-Edible 2686
LF 367
reg 195
low fat 134
Name: Item_Fat_Content, dtype: int64

 df.head()
 df.describe()
 df.columns()

Drawing the correlation matrix:-

Train_data.corr()

31
plt.figure(figsize=(35,15))
sns.heatmap(Train_data.corr(),vmax=1,square=True,annot=True,cmap='viridis')
plt.title('Correlation between different attributes')
plt.show()

32
corr=df.corr()
sns.heatmap(corr,annot=True,cmap='coolwarm')

33
Dealing with categorical variables:-

In this stage, we will convert our categorical variables into numerical ones. We
will use 2 techniques — Label Encoding and One Hot Encoding.
1. Label encoding simply means converting each category in a variable to a
number. It is more suitable for ordinal variables — categorical variables
with some order.
2. In  One hot encoding, each category of a categorical variable is converted
into a new binary column (1/0).
We will use both the encoding techniques.

Label encoding for the categorical variables:


In machine learning, we usually deal with datasets which contains multiple la -
bels in one or more than one columns. These labels can be in the form of words
or numbers. To make the data understandable or in human readable form, the
training data is often labeled in words.
Label Encoding refers to converting the labels into numeric form so as to con-
vert it into the machine-readable form. Machine learning algorithms can then de -
cide in a better way on how those labels must be operated. It is an important pre-
processing step for the structured dataset in supervised learning.

34
Example:
Suppose we have a columns height in some data set

After applying LabelEncoding, the height column is converted into:

where 0 is the label for tall, 1 is the label for medium and 2 is label for short
height.

35
We will label encode
['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','New_Item_Type','Out
let_Type','Outlet' ]as these are ordinal variables.
from sklearn.preprocessing import LabelEncoder
label=LabelEncoder()

# New variable for outlet:-

df['Outlet']=label.fit_transform(df['Outlet_Identifier'])
varib['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','New_Item_Type','Outlet_
Type','Outlet']

for i in varib:
df[i]=label.fit_transform(df[i])
df.head()

One Hot Encoding for the categorical variables:-


Sometimes in datasets, we encounter columns that contain numbers of no spe -
cific order of preference. The data in the column usually denotes a category or
value of the category and also when the data in the column is label encoded. This
confuses the machine learning model, to avoid this, the data in the column
should be One Hot encoded .
It refers to splitting the column which contains numerical categorical data to
many columns depending on the number of categories present in that column.
Each column contains “0” or “1” corresponding to which column it has been
placed.

#DummyVariable:-

36
df=pd.get_dummies(df,columns=['Item_Fat_Content','Outlet_Location_Typ
e','Outlet_Size','New_Item_Type','Outlet_Type','Outlet'])

df.head()

Chapter 4:-Model Building:


37
Before feeding our data into any model, it is a good practice to preprocess the data.
We will do preprocessing on both independent variables and target variable.
Splitting the Dataset

When splitting a dataset there are two competing concerns:


-If you have less training data, your parameter estimates have greater variance.
-And if you have less testing data, your performance statistic will have greater vari-
ance.

df.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)

df.columns

train=df.loc[df['Source']=='train']

train.shape

38
(8523, 38)
test=df.loc[df['Source']=='test']

test.shape
(5681, 38)

train.drop(['Source'],axis=1,inplace=True)
train.columns

OUTPUT:
Index(['Item_Identifier', 'Item_MRP', 'Item_Outlet_Sales',
'Item_Visibility',
'Item_Weight', 'Outlet_Identifier', 'Outlet_Years',
'Item_Visib_avg',
'Item_Fat_Content_0', 'Item_Fat_Content_1',
'Item_Fat_Content_2',
'Item_Fat_Content_3', 'Item_Fat_Content_4',
'Item_Fat_Content_5',
'Outlet_Location_Type_0', 'Outlet_Location_Type_1',
'Outlet_Location_Type_2', 'Outlet_Size_0',
'Outlet_Size_1',
'Outlet_Size_2', 'New_Item_Type_0',
'New_Item_Type_1',
'New_Item_Type_2', 'Outlet_Type_0', 'Outlet_Type_1',
'Outlet_Type_2',
'Outlet_Type_3', 'Outlet_0', 'Outlet_1', 'Outlet_2',
'Outlet_3',
'Outlet_4', 'Outlet_5', 'Outlet_6', 'Outlet_7', 'Out-
let_8', 'Outlet_9'],
dtype='object')

test.drop(['Item_Outlet_Sales','Source'],axis=1,inplace=True)
test.columns

39
Labeling of Data:-

 For supervised learning to work, we need a labeled set of data that the model can
learn from to make correct decisions. Data labeling typically starts by asking hu-
mans to make judgments about a given piece of unlabeled data. For example, label-
ers may be asked to tag all the images in a dataset where “does the photo contain a
bird” is true. The tagging can be as rough as a simple yes/no or as granular as iden-
tifying the specific pixels in the image associated with the bird. The machine learn-
ing model uses human-provided labels to learn the underlying patterns in a process
called "model training." The result is a trained model that can be used to make pre-
dictions on new data.

In machine learning, a properly labeled dataset that we use as the objective stan-
dard to train and assess a given model is often called “ground truth.” The accuracy
of your trained model will depend on the accuracy of your ground truth, so spend-
ing the time and resources to ensure highly accurate data labeling is essential.

#Labelling Of Data Sets:

X_train=train.drop(['Item_Outlet_Sales','Item_Identifier','Outlet_Identifier'],axi
s=1)

X_train

y_train=train['Item_Outlet_Sales']

y_train.head()

X_test=test.drop(['Item_Identifier','Outlet_Identifier'],axis=1)

X_test.head()

40
Chapter 5:- Modelling

Finally we have arrived at most interesting stage of the whole process — predictive
modeling. We will start off with the simpler models and gradually move on to
more sophisticated models. We will build the models using…

Linear Regression:-

Linear regression is the simplest and most widely used statistical technique for pre-
dictive modeling. Given below is the linear regression equation:
where X1, X2,…,Xn are the independent variables, Y is the target variable and all
thetas are the coefficients. Magnitude of a coefficient wrt to the other coefficients
determines the importance of the corresponding independent variable.
For a good linear regression model, the data should satisfy a few assumptions. One
of these assumptions is that of absence of multi collinearity , i.e, the independent
variables should be correlated. However, as per the correlation plot above, we have
a few highly correlated independent variables in our data. This issue of multi-
collinearity can be dealt with regularization.
For the time being , let’s build our linear regression model with all the variables:-

from sklearn.linear_model import LinearRegression

regressor = LinearRegression(normalize=True)

regressor.fit(X_train,y_train)

y_test=regressor.predict(X_test)

y_test

41
array([1843., 1454., 1883., ..., 1798., 3582., 1264.])
print("Linear Regression Model Score:",regressor.score(X_train,y_train))

Linear Regression Model Score: 0.5635701942241875

lr_accuracy=round(regressor.score(X_train,y_train)*100)

Linear Regression Model Accuracy: 56.0

42
Decision Tree Regression:

Decision tree regression observes features of an object and trains a model in the
structure of a tree to predict data in the future to produce meaningful continuous out-
put. Continuous output means that the output/result is not discrete, i.e., it is not repre-
sented just by a discrete, known set of numbers or values.
let’s build our linear regression model with all the variables:-

from sklearn.tree import DecisionTreeRegressor


from sklearn.metrics import mean_squared_error
tree =DecisionTreeRegressor(max_depth=15,min_samples_leaf=100)
tree.fit(X_train,y_train)
predict_r=tree.predict(X_test)
array([1610.8769902 , 1412.15620504, 584.75776477, ...,
1827.83284545,
3692.81910244, 1211.249992 ])

print("Decision Tree Regression Score:",tree.score(X_train,y_train))

Decision Tree Regression Score: 0.6153908290282177

tree_accuracy=round(tree.score(X_train,y_train)*100)
print("Decision Tree Regression Accuracy:",tree_accuracy)

Decision Tree Regression Accuracy: 62.0

43
XGBoost Regression:

XGBoost is a fast and efficient algorithm and has been used to by the winners of
many data science competitions. It’s a boosting algorithm. There are many tuning
parameters in XGBoost which can be broadly classified into General Parameters,
Booster Parameters and Task Parameters.

 General parameters refers to which booster we are using to do boosting. The


commonly used are tree or linear model
 Booster parameters depends on which booster you have chosen
 Learning Task parameters that decides on the learning scenario, for exam-
ple, regression tasks may use different parameters with ranking tasks.

Let’s have a look at the parameters that we are going to use in our model.

1. eta: It is also known as the learning rate or the shrinkage factor. It actually
shrinks the feature weights to make the boosting process more conservative.
The range is 0 to 1. Low eta value means model is more robust to overfit-
ting.
2. gamma: The range is 0 to ∞. Larger the gamma more conservative the algo-
rithm is.
3. max_depth: We can specify maximum depth of a tree using this parameter.
4. subsample: It is the proportion of rows that the model will randomly select
to grow trees.
5. colsample_bytree: It is the ratio of variables randomly chosen for build
each tree in the model.

from xgboost import XGBRegressor

model=XGBRegressor(learning_rate=0.05)

model.fit(X_train,y_train)

y_pred=model.predict(X_test)

44
y_pred
array([1660.4456, 1315.1495, 574.9488, ..., 1865.4355,
3743.5881,
1242.0835], dtype=float32)

print("XGBoost Regression Model Score:",model.score(X_train,y_train))


XGBoost Regression Model Score: 0.6781707786110868

model_accuracy=round(model.score(X_train,y_train)*100)
print("XGBoost Regression Accuracy:",model_accuracy)

XGBoost Regression Accuracy: 68.0

RANDOM FOREST REGRESSION:

RandomForest is a tree based bootstrapping algorithm wherein a certain no. of


weak learners (decision trees) are combined to make a powerful prediction model.
For every individual learner, a random sample of rows and a few randomly chosen
variables are used to build a decision tree model. Final prediction can be a function
of all the predictions made by the individual learners. In case of regression prob-
lem, the final prediction can be mean of all the predictions.

from sklearn.ensemble import RandomForestRegressor


from sklearn.metrics import mean_squared_error

rf=RandomForestRegressor()
rf.fit(X_train,y_train)

45
predict_r=rf.predict(X_test)
predict_r
array([1621.5559 , 1516.3595 , 755.74958, ..., 1796.52814,
4939.70336,
1414.75842])

rf.score(X_train,y_train)
print("Random Forest Regression Model Score:",rf.score(X_train,y_train))

Random Forest Regression Model Score: 0.9142221121648255

rf_accuracy=round(rf.score(X_train,y_train)*100)
print("RandomForest Regression Accuracy:",rf_accuracy)

RandomForest Regression Accuracy: 91.0

SUPPORT VECTOR REGRESSION:-

Support Vector Machine can also be used as a regression method, maintaining all
the main features that characterize the algorithm (maximal margin). The Support
Vector Regression (SVR) uses the same principles as the SVM for classification,
with only a few minor differences. First of all, because output is a real number it
becomes very difficult to predict the information at hand, which has infinite possi-
bilities. In the case of regression, a margin of tolerance (epsilon) is set in approxi-
mation to the SVM which would have already requested from the problem. But be-
sides this fact, there is also a more complicated reason, the algorithm is more com-
plicated therefore to be taken in consideration. However, the main idea is always
the same: to minimize error, individualizing the hyperplane which maximizes the
margin, keeping in mind that part of the error is tolerated. 
46
from sklearn.svm import SVR

svm=SVR(epsilon=15,kernel='linear')

svm.fit(X_train,y_train)

predict_r=svm.predict(X_test)

predict_r
array([1623.10709496, 1339.00939903, 2376.77467647, ...,
1770.54684921,
3465.33724055, 1265.79536608])

svm.score(X_train,y_train)
print("Support Vector Regression Model Score:",svm.score(X_train,y_train))

Support Vector Regression Model Score: 0.5054852640111823

svm_accuracy=round(svm.score(X_train,y_train)*100)

print("Support Vector Regression Accuracy:",svm_accuracy)

Support Vector Regression Accuracy: 51.0

47
Result:-
After trying and testing 5 different algorithms, the RANDOM FOREST REGRESSOR
has the BEST SCORE= (0.9142221121648255) followed by XGBoost REGRESSOR
with SCORE = (0.6781707786110868)

MODEL SCORE ACCURACY


01 LINEAR REGRESSION MODEL 0.5635701942241 56.0
875
02 DECISION TREE REGRESSION 0.6153908290282 62.0
177

03 XGBoost REGRESSION 0.6781707786110 68.0


868

04 RANDOM FOREST REGRESSION 0.9142221121648 91.0


255

05 SUPPORT VECTOR REGRESSION 0.5054852640111 51.0


823

48
Future Work

Since no system in the world is complete and only time can prove its incomplete-
ness, same is the case with this system. Since it is academic project, there is lot of
scope for this project in future. Some of the future enhancements include:-

 We will try to increase its accuracy more by doing more feature engineering.
 Train one of the models on our own data
 We will try to add a recommender system in this project

49
References

1. Beheshti-Kashi, S., Karimi, H.R., Thoben, K.D., Lutjen, M., Teucke, M.: A survey
on retail sales forecasting and prediction in fashion markets. Systems Science &
Control Engineering 3(1), 154{161 (2015)
2. Bose, I., Mahapatra, R.K.: Business data mininga machine learning perspective.
Information & management 39(3), 211{225 (2001)
3. Chu, C.W., Zhang, G.P.: A comparative study of linear and nonlinear models for
aggregate retail sales forecasting. International Journal of production economics
86(3), 217{231 (2003)
4. Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., Sartin, M.:
Combing content-based and collaborative _lters in an online newspaper (1999)
5. Das, P., Chaudhury, S.: Prediction of retail sales of footwear using feedforward and
recurrent neural networks. Neural Computing and Applications 16(4-5), 491{502
(2007)
6. Domingos, P.M.: A few useful things to know about machine learning. Commun.
acm 55(10), 78{87 (2012)
7. Langley, P., Simon, H.A.: Applications of machine learning and rule induction.
Communications of the ACM 38(11), 54{64 (1995)
8. Loh, W.Y.: Classi_cation and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 1(1), 14{23 (2011)
9. Makridakis, S., Wheelwright, S.C., Hyndman, R.J.: Forecasting methods and applications.
John wiley & sons (2008)
10. Ni, Y., Fan, F.: A two-stage dynamic sales forecasting model for the fashion retail.
Expert Systems with Applications 38(3), 1529{1536 (2011)
11. Punam, K., Pamula, R., Jain, P.K.: A two-level statistical model for big mart sales
prediction. In: 2018 International Conference on Computing, Power and Communication
Technologies (GUCON). pp. 617{620. IEEE (2018)
12. Ribeiro, A., Seruca, I., Dur~ao, N.: Improving organizational decision support: Detection
of outliers and sales prediction for a pharmaceutical distribution company.
Procedia Computer Science 121, 282{290 (2017)
13. Shrivas, T.: Big mart dataset@ONLINE (Jun 2013), https://datahack.
analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/
14. Smola, A.J., Scholkopf, B.: A tutorial on support vector regression. Statistics and
computing 14(3), 199{222 (2004)

50
51

You might also like