Professional Documents
Culture Documents
Big Sales Mart Final Script PDF
Big Sales Mart Final Script PDF
1 What is EDA?
1. Exploratory Data Analysis as it seems. It helps understand the data better.
2. Variables and their linkage with the Target Variable
3. To Build the Model
4. Understanding of the data better and its kinda know how about the data and its PATTERNS
1
[90]: train = pd.read_csv("./Downloads/Train_UWu5bXk.csv")
test = pd.read_csv("./Downloads/Test_u94Q5KV.csv")
Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052
(8523, 12)
(5681, 11)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier 8523 non-null object
2
Item_Weight 7060 non-null float64
Item_Fat_Content 8523 non-null object
Item_Visibility 8523 non-null float64
Item_Type 8523 non-null object
Item_MRP 8523 non-null float64
Outlet_Identifier 8523 non-null object
Outlet_Establishment_Year 8523 non-null int64
Outlet_Size 6113 non-null object
Outlet_Location_Type 8523 non-null object
Outlet_Type 8523 non-null object
Item_Outlet_Sales 8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.1+ KB
C:\ProgramData\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713:
FutureWarning: Using a non-tuple sequence for multidimensional indexing is
deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will
be interpreted as an array index, `arr[np.array(seq)]`, which will result either
in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
3
[8]: # Summary Statistics of Sales
train.Item_Outlet_Sales.describe()
[9]: print(train.columns)
[10]: sns.distplot(train[pd.notnull(train.Item_Weight)]["Item_Weight"],
color="magenta")
4
[10]: <matplotlib.axes._subplots.AxesSubplot at 0x1c6cef39438>
5
[13]: train.Item_Visibility.describe()
6
1.2.4 Observation: Item MRP
Item MRP shows 04 different distribution which means it needs treating/Transforming of the
Values.
[15]: # Item MRP Vs Sales
plt.scatter(train.Item_MRP, train.Item_Outlet_Sales, color= "hotpink")
7
[16]: train.head()
Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
8
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052
9
[19]: <matplotlib.axes._subplots.AxesSubplot at 0x1c6cf780d68>
[92]: # Replacing the Item Fat Content with the resp Cats...
test.Item_Fat_Content = test.Item_Fat_Content.replace(to_replace=["low fat",␣
↪"LF", "reg"],
10
[22]: # Item Type
plt.figure(figsize=[10,7])
sns.countplot(x = "Item_Type", data = train)
plt.xticks(rotation= 90)
plt.show()
11
1.3.2 Observation: Item Type
1. Fruits and Veggies
2. Snack Foods
3. Frozen Food
These are the most saleable items
[23]: sns.violinplot(x = "Outlet_Size",y = "Item_Outlet_Sales",data = train)
12
1.3.3 Bivariate Analysis - Item Visibility and Sales
[24]: plt.figure(figsize=[8,5])
plt.scatter(train.Item_Visibility, train.Item_Outlet_Sales,
color="red")
plt.xlabel("Item Visibility")
plt.ylabel("Sales")
plt.title("Scatterplot - Visibility Vs Sales")
plt.show()
13
1.3.4 Observation: Item Visibility Vs Sales
Highly Visible Items have lesser sales.
[25]: # Item Weight Vs Sales
plt.scatter(train.Item_Weight, train.Item_Outlet_Sales,
color="blue", alpha = 0.3)
14
1.3.5 Observation: Item Weight Vs Sales
No pattern seen…
[26]: train.head()
15
Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052
16
[29]: # Item Type Vs Sales
plt.figure(figsize=[10,6])
sns.boxplot(x = "Item_Type", y = "Item_Outlet_Sales",
data = train)
plt.xticks(rotation= 90)
plt.show()
17
[30]: print(train[train.Item_Outlet_Sales>8000].Item_Type.unique())
18
[32]: # Outlet 27 Max Sales Guy on the Street...
train[train.Outlet_Identifier=="OUT027"]["Outlet_Establishment_Year"].unique()
19
[33]: count 528.000000
mean 340.329723
std 249.979449
min 33.955800
25% 153.633350
50% 265.321300
75% 460.733600
max 1482.070800
Name: Item_Outlet_Sales, dtype: float64
20
1.3.9 Observation - ‘Outlet_Location_Type’ & Outlet Type
1. Tier 3 has max sales
2. Supermarket Type 03 has Max Sales
[93]: Item_Identifier 0
Item_Weight 1463
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 2410
Outlet_Location_Type 0
21
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
[37]: test.isnull().sum()
[37]: Item_Identifier 0
Item_Weight 976
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 1606
Outlet_Location_Type 0
Outlet_Type 0
dtype: int64
[38]: test.Outlet_Size.value_counts()
22
4 Household 53.8614 OUT013
Outlet_Type Item_Outlet_Sales
0 Supermarket Type1 3735.1380
1 Supermarket Type2 443.4228
2 Supermarket Type1 2097.2700
3 Grocery Store 732.3800
4 Supermarket Type1 994.7052
[96]: # FDA15
train[train.Item_Identifier=="FDN15"]["Item_Weight"]
[96]: 2 17.5
759 17.5
4817 17.5
5074 17.5
6163 17.5
6952 NaN
8349 NaN
Name: Item_Weight, dtype: float64
[108]: df.head()
[116]: train[train.Item_Identifier=="FDA15"]["Item_Weight"]
[116]: 0 9.3
831 9.3
2599 9.3
2643 9.3
23
4874 9.3
5413 9.3
6696 9.3
7543 9.3
Name: Item_Weight, dtype: float64
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
[114]: train["Item_Weight"]=train.groupby("Item_Identifier")["Item_Weight"].
↪transform(lambda x:x.fillna(x.mean()))
[117]: test["Item_Weight"]=test.groupby("Item_Identifier")["Item_Weight"].
↪transform(lambda x:x.fillna(x.mean()))
[118]: train.isnull().sum()
[118]: Item_Identifier 0
Item_Weight 4
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
Item_Outlet_Sales 0
dtype: int64
24
Index: []
[159]: test.isnull().sum()
[159]: Item_Identifier 0
Item_Weight 0
Item_Fat_Content 0
Item_Visibility 0
Item_Type 0
Item_MRP 0
Outlet_Identifier 0
Outlet_Establishment_Year 0
Outlet_Size 0
Outlet_Location_Type 0
Outlet_Type 0
dtype: int64
[160]: test[pd.isnull(test.Item_Weight)]
25
Item_Type, Item_MRP, Outlet_Identifier, Outlet_Establishment_Year, Outlet_Size,
Outlet_Location_Type, Outlet_Type]
Index: []
[161]: 'FD'
[162]: itemid=[]
for i in range(0, len(train.Item_Identifier)):
itemid.append(train.Item_Identifier[i][:2])
[164]: train.head()
26
Outlet_Establishment_Year Outlet_Size Outlet_Location_Type \
0 1999 Medium Tier 1
1 2009 Medium Tier 3
2 1999 Medium Tier 1
3 1998 Medium Tier 3
4 1987 High Tier 3
[165]: itemid=[]
for i in range(0, len(test.Item_Identifier)):
itemid.append(test.Item_Identifier[i][:2])
# Creatng a New Variable called Item Category
test["Item_Category"]=itemid
27
[58]: train.head()
28
1 Supermarket Type2 443.4228 DR 4
2 Supermarket Type1 2097.2700 FD 14
3 Grocery Store 732.3800 FD 15
4 Supermarket Type1 994.7052 NC 26
[171]: train["Item_Weight"]=train.Item_Weight.astype("float")
[172]: test["Item_Weight"]=test.Item_Weight.astype("float")
[173]: test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5681 entries, 0 to 5680
Data columns (total 13 columns):
Item_Identifier 5681 non-null object
Item_Weight 5681 non-null float64
Item_Fat_Content 5681 non-null object
Item_Visibility 5681 non-null float64
Item_Type 5681 non-null object
Item_MRP 5681 non-null float64
Outlet_Identifier 5681 non-null object
Outlet_Establishment_Year 5681 non-null int64
Outlet_Size 5681 non-null object
Outlet_Location_Type 5681 non-null object
Outlet_Type 5681 non-null object
Item_Category 5681 non-null object
Existence_Years 5681 non-null int64
dtypes: float64(3), int64(2), object(8)
memory usage: 577.1+ KB
[176]: train.shape
[177]: test.shape
29
[180]: # Item Type
train.Item_Type.unique()
[181]: # Perishables
perishables =['Dairy', 'Meat', 'Fruits and Vegetables','Frozen Foods',
'Breakfast','Breads','Seafood']
[184]: # where it matches the list of perishables, "Perishables" else "Non Perishables"
train["ItemType_Cat"] = np.where(train.Item_Type.isin(perishables),␣
↪"Perishables", "Non Perishables")
[186]: train.head()
30
3 FDX07 19.20 Regular 0.053931
4 NCD19 8.93 Low Fat 0.053931
Price_Per_Unit ItemType_Cat
0 26.861204 Perishables
1 8.153581 Non Perishables
2 8.092457 Perishables
3 9.484115 Perishables
4 6.031512 Non Perishables
axis = 1)
axis = 1)
[189]: print(newtrain.shape)
print(newtest.shape)
31
(8523, 12)
(5681, 11)
[190]: newtrain.head()
3 Applying Encoding
Label or OHE
Label encoding is applied where the data or values follow a certain order where we can say that
one value is greater than the other for eg shirt size. S>M>L. However, where we see that there is
no order and hence it is imperative to apply OHE(pd.get_dummies)
[192]: dummy_train.head()
32
0 26.861204 1 0
1 8.153581 0 1
2 8.092457 1 0
3 9.484115 0 1
4 6.031512 1 0
Outlet_Size_High Outlet_Size_Medium … \
0 0 1 …
1 0 1 …
2 0 1 …
3 0 1 …
4 1 0 …
[5 rows x 23 columns]
33
[195]: scaled_train = pd.DataFrame(sc.fit_transform(dummy_train),
columns=dummy_train.columns)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625:
DataConversionWarning: Data with input dtype uint8, int64, float64 were all
converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:462:
DataConversionWarning: Data with input dtype uint8, int64, float64 were all
converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py:625:
DataConversionWarning: Data with input dtype uint8, int64, float64 were all
converted to float64 by StandardScaler.
return self.partial_fit(X, y)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:462:
DataConversionWarning: Data with input dtype uint8, int64, float64 were all
converted to float64 by StandardScaler.
return self.fit(X, **fit_params).transform(X)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246:
FutureWarning: The default value of n_estimators will change from 10 in version
0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
34
3.0.1 Conversion of Scaled Value into Original Values…
e^x and Log(e)
1. This will be achieved using Standard Scaler…
2. U will have to run the standard scaler of the column which you wish to transform and then
apply Inverse Transform to bring it back into the org values
[213]: # Apply Standard Scaler
mysales = sc.fit_transform(pd.DataFrame(train.Item_Outlet_Sales))
[209]: cd
C:\Users\Mumbai-Admin
[217]: # Adaboost
from sklearn.ensemble import AdaBoostRegressor
ada =AdaBoostRegressor()
35
[219]: # Inverse Transform
sales = sc.inverse_transform(pred_ada)
[ ]:
36