Professional Documents
Culture Documents
SFB - CIA 3 - Report - FINAL
SFB - CIA 3 - Report - FINAL
By
ARITRA GHOSH (2327106)
DEBANIK HAZRA (2327113)
DIANA JOHN (2327114)
SHINY SHAH (2327150)
GARIKAPATI SANDEEP CHOWDARY (2327120)
MBA PROGRAMME
SCHOOL OF BUSINESS AND MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY), BANGALORE
JULY 2023
1
INDEX
2
INTRODUCTION
Sales Prediction in the retail domain is the focus of this study. For studying the same, the retail sales
dataset of outlets of varied types and sizes across different locations are taken.
This dataset contains crucial variables like Item Weight, Item Fat Content, Item Visibility, Item Type,
Item MRP, Outlet Size, Outlet Location Type, Outlet Type, and Item Outlet Sales, relevant to the
study.
Point & Interval estimation, One sample t test, One Sample Test for Proportion, Two sample t test,
Anova, Chi-square test, Correlation, and Simple Linear Regression, and are the statistical methods
used in this work.
RESEARCH METHODOLOGY
Dataset URL:
https://www.kaggle.com/datasets/adarshkumarjha/big-mart-sales-prediction
This dataset was utilized to forecast the demand in the retail domain.
BUSINESS PROBLEM
Demand Forecasting: The study focuses on forecasting the demand in the retail domain by analyzing
how product attributes like Item Weight, Item Fat Content, Item Visibility, Item Type, and Item MRP
and outlet attributes like Outlet Size, Outlet Location Type, Outlet Type relates to Item Outlet Sales.
DATA DICTIONARY
3
Outlet Type Type of outlet (e.g., Qualitative Nominal
Supermarket Type1,
Grocery Store)
Item Outlet Sales Sales of the product in the Quantitative Continuous (Ratio)
outlet
Out of a population of 8523 peoples, the optimum sample size for the study is computed below. Here:
Mean = 2181.2889
Assumed Mean = 2000
Margin of Error = 181
Significance level = Alpha = 0.05
4
1) Point & Interval Estimation
The sample mean and proportion are said to be the point estimators of the population mean and
proportion respectively. Interval estimation uses sample data to determine an interval of potential
values for an unknown population parameter.
The (1-alpha) % confidence interval for μ is given by: Mean +- Margin of Error
Continuous Variables
Item Weight
Item Visibility
5
Item MRP
Interpretation: Here, Point & Confidence Interval Estimation is same with manual and excel
calculation.
Categorical Variables
Item Fat
Item Fat Content Count of Item Fat Content Sample Proportion
Low Fat 218 0.639296188
Regular 123 0.360703812
Grand Total 341 1
Margin of Error 0.050968905
Upper Level 0.690265093
Lower Level 0.588327282
Point Estimator = Sample Proportion = 0.6393
Item Type
6
Others 6 0.017595308
Seafood 3 0.008797654
Snack Foods 39 0.114369501
Soft Drinks 18 0.052785924
Starchy Foods 3 0.008797654
Grand Total 341 1
Margin of Error 0.038455519
Upper Level 0.193880739
Lower Level 0.116969701
Outlet Type
Outlet Type
Count of Outlet Type Sample Proportion
Grocery Store 39 0.114369501
Supermarket Type1 235 0.68914956
Supermarket Type2 36 0.105571848
Supermarket Type3 31 0.090909091
Grand Total 341 1
Margin of Error 0.049125996
Upper Level 0.738275556
Lower Level 0.640023564
7
Interval = p bar +- Z alpha/2*sqrt(p bar*1 - p bar/n)
Alpha = Significance level = 0.05
Outlet Size
Outlet Size
Count of Outlet_Size Sample Proportion
High 77 0.225806452
Medium 135 0.395894428
Small 129 0.37829912
Grand Total 341 1
Margin of Error 0.051906889
Upper Level 0.447801317
Lower Level 0.34398754
Interpretation: Here, Point & Confidence Interval Estimation is same with manual and excel
calculation.
In statistics, the process of hypothesis testing involves putting an analyst's presumption about a
population parameter to the test. The type of data used and the purpose of the study will determine the
methodology the analyst uses.
Using sample data, hypothesis testing is done to determine whether a claim is plausible. These data
could originate from a broader population or a process that creates data. In the descriptions that
follow, "population" will be used to refer to both situations.
8
Practical Statistical Statistical Practical
Test
Question Question Decision Solution
Interpretation
Since, Test Statistic (1.7898) is greater than t 0.05, 340 (1.65), we reject H0 at the 5% level of
significance. This means that the Average Item Outlet Sales is greater than 2000, with a 5% chance of
error in judgement.
9
Item MRP
Item MRP
Mean 141.8061314
Variance 3653.72487
Observations 341
Hypothesized Mean Difference 0
df 341
t Stat 43.32157904
P(T<=t) one-tail 8.9134E-141
t Critical one-tail 1.649347611
P(T<=t) two-tail 1.7827E-140
t Critical two-tail 1.966965734
Interpretation
Since, Test Statistic (-2.5032) is less than -t 0.05, 340 (-1.65), we reject H0 at the 5% level of
significance. This means that the Average Item MRP is less than 150 with a 5% chance of error in
judgement.
Item Weight
Item Weight
Mean 11.0304
Variance 38.5341
Observations 341
Hypothesized Mean Difference 0
df 339
t Stat 32.7647
P(T<=t) one-tail 2E-107
t Critical one-tail 1.64936
P(T<=t) two-tail 4E-107
t Critical two-tail 1.96699
10
Step 1: Development of null hypothesis
H0: Average Item Weight is greater than or equal to 15 (u >= 15)
Interpretation
Since, Test Statistic (-11.7914) is less than -t 0.05,340 (-1.65), we reject H0 at the 5% level of
significance. This means that the Average Item Weight is less than 15, with a 5% chance of error in
judgement.
Item Visibility
Item Visibility
Mean 0.0627
Variance 0.0023
Observations 341
Hypothesized Mean Difference 0
df 340
t Stat 24.203
P(T<=t) one-tail 3E-76
t Critical one-tail 1.6493
P(T<=t) two-tail 6E-76
t Critical two-tail 1.967
11
Interpretation
Since, Test Statistic (1.0304) is less than t 0.05, 340 (1.65), we accept H0 at the 5% level of
significance. This means that the Average Item Visibility is less than or equal to 0.06, with a 5%
chance of error in judgement.
Item Type
Sample
Item Type Count of Item Type Proportion
Baking Goods 26 0.076246334
Breads 13 0.038123167
Breakfast 3 0.008797654
Canned 31 0.090909091
Dairy 20 0.058651026
Frozen Foods 32 0.093841642
Fruits and
Vegetables 53 0.15542522
Hard Drinks 10 0.029325513
Health and Hygiene 28 0.082111437
Household 43 0.126099707
Meat 13 0.038123167
Others 6 0.017595308
Seafood 3 0.008797654
Snack Foods 39 0.114369501
Soft Drinks 18 0.052785924
Starchy Foods 3 0.008797654
Grand Total 341 1
Margin of Error 0.038455519
Upper Level 0.193880739
Lower Level 0.116969701
Interpretation
Since, Test Statistic (-2.05782) is less than -t 0.05, 340 (-1.65), we reject H0 at the 5% level of
significance. This means that the proportion of Fruits & Vegetables item types is less than 0.2, with a
5% chance of error in judgement.
12
Item Fat Content
Sample
Item Fat Content Count of Item Fat Content Proportion
Low Fat 218 0.639296188
Regular 123 0.360703812
Grand Total 341 1
Margin of Error 0.050968905
Upper Level 0.690265093
Lower Level 0.588327282
Interpretation
Since, Test Statistic (1.4812) is less than t 0.05, 340 (1.65), we accept H0 at the 5% level of
significance. From the hypothesis testing, it can be concluded that the proportion of Low-Fat products
is less than or equal to 0.6, with a 5% chance of error in judgement.
13
Step 6: Critical Value for one tailed test is 1.65
Interpretation
Since, Test Statistic (-4.4947) is less than -t 0.05, 340 (-1.65), we reject H0 at the 5% level of
significance. From the hypothesis testing, it can be concluded that the proportion of Tier 3 Outlet
Locations are less than 0.5, with a 5% chance of error in judgement.
Outlet Type
Sample
Outlet Type Count of Outlet Type Proportion
Grocery Store 39 0.114369501
Supermarket Type1 235 0.68914956
Supermarket Type2 36 0.105571848
Supermarket Type3 31 0.090909091
Grand Total 341 1
Margin of Error 0
Upper Level 0.68914956
Lower Level 0.68914956
Interpretation
Since, Test Statistic (10.8992) is greater than t 0.05, 340 (1.65), we reject H0 at the 5% level of
significance. From the hypothesis testing, it can be concluded that the proportion of Supermarket
Type 1 Outlet types is greater than 0.4, with a 5% chance of error in judgement.
Outlet Size
Sample
Outlet Size Count of Outlet_Size Proportion
High 77 0.225806452
Medium 135 0.395894428
14
Small 129 0.37829912
Grand Total 341 1
Margin of Error 0
Upper Level 0.395894428
Lower Level 0.395894428
Interpretation
Since, Test Statistic (-1.2159) is less than -t 0.05, 340 (-1.65), we reject H0 at the 5% level of
significance. From the hypothesis testing, it can be concluded that the proportion of Medium Outlet
Sizes are less than 0.5, with a 5% chance of error in judgement.
15
Step 2: Development of alternate hypothesis
Ha: Average sales of Low-Fat food items is greater than Regular Fat food items (u1 > u2)
Interpretation
Here, p-value (0.0321) < alpha (0.05), H0 is rejected. From the hypothesis testing, it can be concluded
that the average sales of Low-Fat food items is greater than Regular Fat food items with a 5% chance
of error in judgement.
Supermarket Type 1
Sales Other Outlet Types Sales
Mean 2264.904274 1914.332028
Variance 1945409.208 3959511.932
Observations 235 106
Hypothesized Mean
Difference 0
df 153
t Stat 1.641125209
P(T<=t) one-tail 0.051412758
t Critical one-tail 1.654873847
P(T<=t) two-tail 0.102825515
t Critical two-tail 1.975590315
Interpretation
Here, p-value (0.0514) > alpha (0.05), H0 cannot be rejected. From the hypothesis testing, it can be
concluded that the Average sales in Supermarket Type 1 is greater than or equal to other Outlet Types
with a 5% chance of error in judgement.
16
Tier 1 Outlets Sales v/s Other Outlet Location Sales
Interpretation
Here, p-value (0.1578) > alpha (0.05), H0 cannot be rejected. From the hypothesis testing, it can be
concluded that the Average sales in Tier1 Outlets is less than or equal to other Outlet location Types
with a 5% chance of error in judgement.
5) Anova
ANOVA is a statistical method used to determine whether the means of two or more groups differ
from one another significantly. ANOVA compares the means of various samples to examine the
influence of one or more factors.
Anova: Single
Factor
SUMMARY
Groups Count Sum Average Variance
Small Outlet Size
Sales 129 248017.8 1922.619 1993521
Medium Outlet Sales 135 332941.9 2466.237 3379642
High Outlet Sales 77 154211.9 2002.752 1974761
17
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 21827861 2 10913930 4.298803 0.014335 3.022441
Within Groups 8.58E+08 338 2538830
Hypothesis
H0: The average sales across different outlet sizes are same (u1 = u2 = u3)
Ha: At least the average sales across one category of outlet size is different
Interpretation
As, p-value (0.0143) < 0.05, H0 is rejected. This means that at least the average sales across one
category of outlet size are different.
SUMMARY
Groups Count Sum Average Variance
Tier 1 Outlet Location
Sales 92 184965.9 2010.499 2655332
Tier 2 Outlet Location
Sales 120 269192.3 2243.269 1583974
Tier 3 Outlet Location
Sales 129 281013.5 2178.4 3491390
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 2926317 2 1463159 0.563892 0.569525 3.022441
Within Groups 8.77E+08 338 2594752
Hypothesis
H0: The average sales across different outlet location types are same (u1 = u2 = u3)
Ha: At least the average sales across one category of outlet location type is different
Interpretation
As, F calculated (0.5638) > critical value (3.0224), H0 is rejected. This means that at least the average
sales across one category of outlet location type are different.
18
Outlet Type & Sales
SUMMARY
Groups Count Sum Average Variance
Grocery Store Sales 39 15611.01 400.2824 75349.21
Supermarket Type 1
Sales 274 547863.5 1999.502 2103973
Supermarket Type 2
Sales 36 75367.23 2093.534 2385182
Supermarket Type 3
Sales 31 111941 3610.999 4986918
ANOVA
Source of Variation SS df MS F P-value F crit
2.62E-
Between Groups 1.8E+08 3 60117003 27.89456 16 2.628646
Within Groups 8.1E+08 376 2155151
Hypothesis
H0: The average sales across different outlet types are same (u1 = u2 = u3)
Ha: At least the average sales across one category of outlet type is different
Interpretation
As, F calculated (27.8946) > F critical value (2.6286), H0 is rejected. This means that at least the
average sales across one category of outlet type are different.
6) Chi-Square Test
The chi square test is conducted to check the association between two categorical values.
Count of
Item_Fat_Content Item fat Content
Grand
Outlet Type Low Fat Regular Total
Grocery Store 20 19 39
Supermarket Type1 154 81 235
Supermarket Type2 24 12 36
Supermarket Type3 20 11 31
Grand Total 218 123 341
19
Expected Values
Count of
Item_Fat_Content Item fat Content
Grand
Outlet Type Low Fat Regular Total
Grocery Store 24.93255132 14.0674 39
Supermarket Type1 150.2346041 84.7654 235
Supermarket Type2 23.01466276 12.9853 36
Supermarket Type3 19.81818182 11.1818 31
Grand Total 218 123 341
Hypothesis
H0: There is no association between Outlet Type and Item Fat Content
Ha: There is association between Outlet Type and Item Fat Content
alpha = 0.05
p - value = 0.3782
Interpretation
As, p-value (0.3782) > alpha (0.05), H0 cannot be rejected. This means that there is no association
between Outlet Type and Item Fat Content.
Expected Values
Hypothesis
H0: There is no association between Outlet Location Type and Outlet Size
Ha: There is association between Outlet Location Type and Outlet Size
20
alpha = 0.05
p - value = 1.3 x 10^-17
Interpretation
As, p-value (1.3 x 10^-17) < alpha (0.05), H0 is rejected. This means that there is association between
Outlet Location Type and Outlet Size.
Expected Values
Count of
Item_Fat_Content Item fat Content
Grand
Item Type Low Fat Regular Total
Canned 15.2 15.8 31
Frozen Foods 15.69032258 16.30968 32
Fruits and Vegetables 25.98709677 27.0129 53
Snack Foods 19.12258065 19.87742 39
Grand Total 76 79 155
Hypothesis
H0: There is no association between Item Type and Item Fat Content
Ha: There is association between Item Type and Item Fat Content
alpha = 0.05
p - value = 0.0727
Interpretation
As, p-value (0.0727) > alpha (0.05), H0 cannot be rejected. This means that there is no association
between Item Type and Item fat Content.
7) Correlation
A statistical measure called correlation shows how much two or more variables fluctuate in
connection to one another.
21
Item Visibility and Item Outlet Sales
Step 1: Variables
X: Item Visibility
Y: Item Outlet Sales
Correlation: -0.1244
Interpretation
Both the variables are negatively correlated. Since the correlation value (0.1224) is above 0.7, there is
strong correlation between Item Visibility and Item Outlet Sales. This means that as Item Visibility
increases, Item Outlet Sales decreases.
Step 1: Variables
X: Item MRP
Y: Item Outlet Sales
Correlation = 0.5514
Interpretation
Both the variables are positively correlated. Since the correlation value (0.5514) is between 0.5 and
0.7, there is moderate correlation between Item MRP and Item Outlet Sales. This means that as Item
MRP increases, Item Outlet Sales will also increase.
Objective: To check the relationship between the values of Item Weight and Item Outlet Sales.
Step 1: Variables
X: Item Weight
Y: Item Outlet Sales
22
Correlation: 0.0024
Interpretation: Both the variables are positively correlated. Since the correlation value (0.0024) is
between 0.0 and 0.2, there is very weak to negligible correlation between Item Weight and Item
Outlet Sales.
Item_Outlet_Sales
15000 y = 148.02x
10000
5000
0
0 5 10 15 20 25
Interpretation:
0 is the baseline Item Outlet Sales when the Item Weight is zero.
Significant F- Value
Regression
Statistics
Multiple R 0.124399
R Square 0.015475
Adjusted R
Square 0.012571
Standard Error 1598.612
Observations 341
Multiple R: 0.124399
Interpretation: As the value of Multiple R is greater than 0.7, it indicates a strong correlation
between Item Outlet Sales and Item Visibility.
R Square: 0.015475
Interpretation: It indicates that 1.55% of the variance in Item Outlet Sales can be explained by Item
Visibility.
ANOVA
Significance
df SS MS F F
Regression 1 13617423 13617423 5.328547 0.02158017
Residual 339 8.66E+08 2555560
Total 340 8.8E+08
23
Hypothesis
H0: Item Outlet Sales does not have a linear relationship with Item Visibility
Ha: Item Outlet Sales have a linear relationship with Item Visibility
Interpretation
As, p-value (0.0216) < alpha (0.05), H0 is rejected. This means that Item Outlet Sales has a linear
relationship with Item Visibility, at the 5% level of significance.
Co-efficient Table
Uppe
r
Coefficient
Standard Upper Lower 95.0
s Error t Stat P-value Lower 95% 95% 95.0% %
142.848 16.9285 5.06E- 2137.2435 2699. 2137.
Intercept 2418.225 9 5 47 2 2 2 2699
Item 1813.18 0.0215
Visibility -4185.49 6 -2.30836 8 -7752.0074 -619 -7752 -619
Regression Equation
Y = (-4185.49*X) + 2418.225
Net Sales = (-4185.49*Item Visibility) + 2418.225
Intercept Hypothesis
Interpretation
Since the p-value corresponding to Intercept (5.06E-47) < alpha (0.05), H0 is rejected. This means
that the intercept is non-zero or the regression line does not pass through the origin. As this is the
case, we can say that there are outliers in the data, which makes intercept significant.
Assumption Checking
Linearity
Item_Outlet_Sales
y = -4185.5x + 2418.2
15000
Item Outlet Sales
10000
5000
0
0 0.1 0.2 0.3
Item Visibility
24
Interpretation: The points are more or less along the straight line. This implies that the relationship
between Item Visibility and Item Outlet Sales are linear in the parameters m and c.
Normality
10000
0
0 5000 10000 15000
-10000
Sample Percentile
Interpretation: Normal Probability Plot indicates that all the points are along the 45-degree line. This
implies that the error in estimating Net Sales, i.e., ε, and hence Net Sales, follows a normal
distribution.
Homoscedasticity
Item_Visibility Residual
Plot
20000
Residuals
0
0 0.05 0.1 0.15 0.2 0.25 0.3
-20000
Item_Visibility
Interpretation: The points in the Item Visibility Residual plot are randomly placed without any
pattern. This implies that the error in estimating Net sales, i.e., ε, has a constant variance across all the
values of Item Visibility.
Independence
Interpretation: The errors, across all the values of Item Visibility, are independent of each other.
Regression
Statistics
Multiple R 0.551436
R Square 0.304081
Adjusted R
Square 0.302028
Standard Error 1344.03
Observations 341
25
Multiple R (0.551436)
Interpretation: The value of Multiple R is greater than 0.5 but less than 0.7, it indicates a moderate
correlation between Item Outlet Sales and Item MRP.
R Square (0.304081)
Interpretation: It indicates that 30.4% of the variance in Item Outlet Sales can be explained by Item
MRP.
ANOVA
Significance
df SS MS F F
Regression 1 267577090.9 3E+08 148.1 1.60628E-28
Residual 339 612375309.3 2E+06
Total 340 879952400.2
Hypothesis
H0: Item Outlet Sales does not have a linear relationship with Item MRP
Ha: Item Outlet Sales have a linear relationship with Item MRP
Interpretation: As the p-value (1.60628E-28) < alpha (0.05), H0 is rejected. This means that Item
Outlet Sales has a linear relationship with Item MRP, at the 5% level of significance.
Co-efficient Table
Standard P- Upper Lower Upper
Coefficients Error t Stat value Lower 95% 95% 95.0% 95.0%
-
Intercept 74.737 185.8453074 0.4021 0.688 290.8182075 440.2922 -290.818 440.2922
2E-
Item_MRP 14.67632 1.205873101 12.171 28 12.30438097 17.04825 12.30438 17.04825
Regression Equation
Y = 14.67632*X + 74.737
Net Sales = 14.67632*Item MRP + 74.737
Intercept Hypothesis
Interpretation
Since the p-value corresponding to Intercept (0.688) > alpha (0.05), H0 cannot be rejected. This
means that the intercept is zero or the regression line passes through the origin.
26
Assumption Checking
Linearity
Item_Outlet_Sales
y = 14.676x + 74.737
15000
Item Outlet Sales
10000
5000
0
0 100 200 300
Item MRP
Interpretation: The points are more or less along the straight line. This implies that the relationship
between Item MRP and Item Outlet Sales are linear in the parameters m and c.
Normality
10000
5000
0
0 5000 10000 15000
-5000
Sample Percentile
Interpretation: Normal Probability Plot indicates that all the points are along the 45-degree line. This
implies that the error in estimating Net Sales, i.e., ε, and hence Net Sales, follows a normal
distribution.
Homoscedasticity
0
0 100 200 300
-10000
Item_MRP
Interpretation: The points in the Item Visibility Residual plot are randomly placed without any
pattern. This implies that the error in estimating Net sales, i.e., ε, has a constant variance across all the
values of Item MRP.
Independence
Interpretation: The errors, across all the values of Item Visibility, are independent of each other.
27
CONCLUSION
The following factors could be considered by the outlets in retail domain to accelerate their sales:
Item Fat Content: The average sales of Low-Fat food items are greater than Regular Fat food items.
To accelerate the sales of Regular Fat food items, companies should focus on marketing the feel, taste
and other aspects of these products.
Outlet Type: The average sales in Supermarket Type 1 are greater than or equal to other Outlet Types
(Grocery, Supermarket Type 2 & 3). To boost the sales in other outlet types other than Supermarket
Type 1, the outlets should adopt Competitive Pricing Strategies.
Outlet Location Type: The average sales in Tier1 Outlet Location Type are less than or equal to
other Outlet location Type. To accelerate the sales in these outlets, they should place their image as
easy to access outlets where quality products are available at affordable rates. They could also adopt
Home Delivery services and provide free delivery to customers who have a membership in their
outlets.
28