MAJOR Mid Sem Report

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

MAJOR PROJECT

SYNOPSIS
Submitted in partial fulfillment of the requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
ON

BIG MART SALES PREDICTION USING MACHINE LEARNING

Submitted by:
Aashish Kumar(BFSI)
Aryan Tayal(ECRA)
Vishwatej Ajay(ECRA)

Under the guidance of


Dr. Shweta Mongia
Associate Professor
Department of Informatics

School of Computer Science and Engineering


Department of Informatics
UPES, Dehradun-248007
2016-2020
DECLARATION

I hereby declare that this submission is my own and that, to the best of my
knowledge and belief, it contains no material previously published or written by
another person nor material which has been accepted for the award of any other
Degree or Diploma of the University or other Institute of Higher Learning, except
where due acknowledgement has been made in the text.

Mentor Name Mentor Signature

Date
ACKNOWLEDGEMENT

I would like to express my deepest appreciation to all those who provided me the
possibility to complete this report. I am very thankful to UPES for giving me the
opportunity to undertake the summer internship. I wish to express my sincere
gratitude to Dr. Shweta Mongia for providing me an opportunity to do my project
work on Loan Prediction. This project bears an imprint of many people. I sincerely
thank my project guide Dr. Shweta Mongia for guidance and encouragement for
carrying out this project work.

I have to appreciate the guidance given by other supervisor as well as the panels
especially in my project presentation that has improved our presentation skills,
thanks to their comment and advices.
1.3 WHAT IS PREDICTIVE MODELING?
Predictive modeling is a process that uses data and statistics to predict outcomes
with data models. These models can be used to predict anything from sports
outcomes and TV ratings to technological advances and corporate earnings.
These synonyms are often used interchangeably. However, predictive analytics
most often refers to commercial applications of predictive modeling, while
predictive modeling is used more generally or academically. Of the terms,
predictive modeling is used more frequently, which is illustrated in the Google
Trends chart below. Machine learning is also distinct from predictive modeling and
is defined as the use of statistical techniques to allow a computer to construct
predictive models. In practice, machine learning and predictive modeling are often
used interchangeably. However, machine learning is a branch of artificial
intelligence, which refers to intelligence displayed by machines.

1.4 OVERVIEW:
Predictive modeling is useful because it gives accurate insight into any question and
allows users to create forecasts. To maintain a competitive advantage, it is critical
to have insight into future events and outcomes that challenge key assumptions.
Analytics leaders must align predictive modeling initiatives with an organization’s
strategic goals. For example, a computer chip manufacturer might set a strategic
priority to produce chips with the greatest number of transistors in the industry by
2025. Analytics professionals could construct a predictive model to forecast the
number of transistors per chip to become a leader if they feed the model product,
geography, sales, and other related trend data. Additional sources could include
data about the most transistor-dense chips, commercial demand for computing
power, and strategic partnerships between chip manufacturers and hardware
manufacturers. Once initiatives are in motion, analytics professionals can perform
backward-looking analyses to assess the accuracy of predictive models and the
success of the initiatives.

1.5 BENEFITS OF PREDICTIVE MODELING:


In its multiple forms—predictive modeling, decision analysis and optimization,
transaction profiling, and predictive search—predictive analytics can be applied to
a range of business strategies and has been a key player in search advertising and
recommendation engines. These techniques can provide managers and executives
with decision-making tools to influence upselling, sales and revenue forecasting,
manufacturing optimization, and even new product development. Though useful
and beneficial, predictive analytics isn’t for everyone.
CODE AND DEMO:
Importing libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Importing Dataset:

train=pd.read_csv('Train.csv')

train.head(8)

O/P:
Taking a look at the features of the dataset:
train.info()
O/P:
Taking a look at the number of null values:
train.isnull().sum()
O/P:

Summary statistics of the dataset:


train.describe()
O/P:

Creating an array with feature sets:


x=train.iloc[:,:-1].values
x
O/P:

Creating an array of target variable:


y=train.iloc[:,11].values
y
O/P:

Imputing the missing values in the dataset:


For Item_weight:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose=0, copy=
False)
imputer = imputer.fit(train[['Item_Weight']])
train['Item_Weight'] = imputer.transform(train[['Item_Weight']])
train.isnull().sum()

O/P:

For Outlet_Size:

imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent',


verbose=0, copy= False)
imputer = imputer.fit(train[['Outlet_Size']])
train['Outlet_Size'] = imputer.transform(train[['Outlet_Size']])
train.isnull().sum()

O/P:
Importing test dataset:
test=pd.read_csv('Test.csv')

Concatenating train and test dataset:


train['source']='train'
test['source']='test'
data = pd.concat([train, test],ignore_index=True,sort=True)
print(train.shape, test.shape, data.shape)
O/P:

Checking for missing values:


data.isnull().sum()
O/P:
Imputing missing values:
For Item_Weight:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean', verbose=0,
copy= False)
imputer = imputer.fit(data[['Item_Weight']])
data['Item_Weight'] = imputer.transform(data[['Item_Weight']])

For Outlet_Sales:
imputer = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent',
verbose=0, copy= False)
imputer = imputer.fit(data[['Outlet_Size']])
data['Outlet_Size'] = imputer.transform(data[['Outlet_Size']])

Checking for missing values:


data.isnull().sum()
O/P:

Filtering categorical columns:


categorical_columns = [x for x in data.dtypes.index if data.dtypes[x]=='object']

Excluding ID columns and source:


categorical_columns = [x for x in categorical_columns if x not in
['Item_Identifier','Outlet_Identifier','source']]

Printing frequency of categories:


for col in categorical_columns:
print('\nFrequency of Categories for variable %s'%col)
print(data[col].value_counts())
O/P:
In our data exploration we saw that Item_Visibility had the minimum value 0,
which makes no sense since every product must be visible to all clients. Let’s
consider it as missing value and impute it with mean visibility of that product.
print(sum(data['Item_Visibility'] == 0))
O/P:
879

non_zero_mean= data[data.Item_Visibility != 0].mean()


data.loc[data.Item_Visibility==0 , "Item_Visibility"]=non_zero_mean
print(sum(data['Item_Visibility']==0))
O/P:
0

Creating a new column Outlet_years:


data['Outlet_Years'] = 2013 - data['Outlet_Establishment_Year']
data['Outlet_Years'].describe()
O/P:

Earlier we saw that the Item_Type variable has 16 categories which might not
prove to be very useful in our analysis.
So it’s a good idea to combine them. If we look closely to the Item_Identifier of
each item we see that each one starts with either “FD” (Food), “DR” (Drinks) or
“NC” (Non-Consumables).
Getting the first two characters of ID:
data['Item_Type_Combined'] = data['Item_Identifier'].apply(lambda x: x[0:2])

Renaming them to more intuitive categories:


data['Item_Type_Combined'] = data['Item_Type_Combined'].map({'FD':'Food','NC':'Non-
Consumable','DR':'Drinks'})
data['Item_Type_Combined'].value_counts()
O/P:

Similarly for Item_Fat_Content:


print('Original Categories:')
print(data['Item_Fat_Content'].value_counts())
O/P:

data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low
Fat','reg':'Regular','low fat':'Low Fat'})
print(data['Item_Fat_Content'].value_counts())
O/P:

FEATURE ENGINEERING:
Since scikit-learn only accepts numerical variables, we need to convert all
categories of nominal variables into numeric types. Let’s start with turning all
categorical variables into numerical values using LabelEncoder().(Encode labels
with value between 0 and no_of_classes-1)

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
data['Outlet'] = le.fit_transform(data['Outlet_Identifier'])
var_mod =
['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Item_Type_Combined','Outlet
_Type','Outlet']
for i in var_mod:
data[i] = le.fit_transform(data[i])

After that, we can use get_dummies to generate dummy variables from these
numerical categorical variables.

data = pd.get_dummies(data, columns


=['Item_Fat_Content','Outlet_Location_Type','Outlet_Size','Outlet_Type','Item_Ty
pe_Combined','Outlet'])
data.dtypes
O/P:
Dropping columns which have been converted to different types:
data.drop(['Item_Type','Outlet_Establishment_Year'],axis=1,inplace=True)
Dividing into test and train set:
train = data.loc[data['source']=="train"]
test = data.loc[data['source']=="test"]

Dropping unnecessary columns:

test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)

Exporting files as modified versions:


train.to_csv("train_modified.csv",index=False)
test.to_csv("test_modified.csv",index=False)

You might also like