Professional Documents
Culture Documents
GR6 Epgp16sece
GR6 Epgp16sece
GR6 Epgp16sece
Semester – 1
Section E
Group No - 6
Mukul Shrivastava (459)
Hyndavii Chiramdasu (450)
Anmol (_)
Akash Dayal (_)
Kapil Chaurasia (_)
Table of Contents
Understanding Rossmann data (Descriptive Analytics)............................................................4
Analysis on Stores & Assortment offered by them............................................................................4
Analysis on Sales trends...................................................................................................................5
Seasonality in sales data..................................................................................................................5
Analysis on Promo...........................................................................................................................6
School Holiday impact on Promo......................................................................................................6
Effect of competition on sales..........................................................................................................7
Inferential statistics.................................................................................................................7
Hypothesis testing...........................................................................................................................8
Predicting for sales using the various variables available (Regression Model).........................9
Appendix 1............................................................................................................................10
1
Background
The dataset under consideration pertains to Rossmann, a prominent retail chain that boasts a
widespread presence with over 3,000 drug stores strategically located across seven European
countries. At present, Rossmann store managers are tasked with predicting their daily sales for
up to six weeks in advance. This information is used to create staff schedules.
These sales are influenced by many factors, including promotions, competition, school and state
holidays, seasonality, and locality. With thousands of individual managers predicting sales
based on their unique circumstances, the accuracy of results is varied.
Business Outcome:
We are aiming to solve the following business problems for Rossmann via our analysis:
1. Improved Operational Efficiency: By accurately predicting sales, store managers can
create efficient staff schedules, leading to improved operational efficiency.
2. Cost Savings: Optimized staff scheduling will result in cost savings by aligning labor
resources with actual demand, reducing unnecessary labor expenses.
3. Enhanced Customer Satisfaction: Efficient staffing ensures that customers receive the
attention and service they expect, contributing to a positive shopping experience.
4. Consistency Across Stores: Standardized sales prediction models ensure consistency
in forecasting across all 1,115 stores, reducing disparities in accuracy.
5. Strategic Decision-Making: Insights into key drivers of sales allow for informed
decision-making, helping Rossmann adapt its strategies to maximize sales opportunities.
Data
We utilized the Rossmann Store Sales dataset, sourced from Kaggle. The dataset can be
accessed at the following URL: https://www.kaggle.com/c/rossmann-store-sales/data.
The data has information for about daily sales of 1100 stores along with other metrics like
promotion, school holiday, nearby competition store, type of store, type of assortment etc. The
detailed data structure along with definitions is present in Appendix 1.
2
Understanding Rossmann data (Descriptive Analytics)
Rossmann Germany is a $2.1Billion dollar enterprise (2014 data) with 1100 stores in Germany.
Rossmann on average sells goods worth about $2 Million per day.
Store a has the highest sales across the different store types, possibly due to the fact
that it has the highest # of stores
Store b has the nearly double the average sales (Sales/#Stores) of any store
54% of sales across all years are contributed by store type a, followed by store type d
which contributes to ~30%
This is because 600 of the 1115 stores are store type A and D type stores are 348 in
number
3
1. There are 3 types of assortment present in the each of the store types namely A, B, C,
D.
2. A & C type assortments contribute to ~49% each to sales, B type assortment being the
least contributor of ~1.2%
This is because assortment type A is in 593 stores followed by B in 513 and C in meagre 9 store
4
Seasonality in sales data
Sunday constantly had lowest average sales, after digging into it we found this was
because only few stores were open on Sunday
Monday contributes to ~19% of the weekly sales which is the highest among all days of
week followed by Tuesday and Friday
Analysis on Promo
5
1. While Promo increases sales always, the impact is higher if the promo falls on a school
holiday.
2. On a school working day, due to promo sales increase by 93%, while a Promo on a
school holiday only increases sales by 43%
3. This might be because parents have the time and mindspace to shop better.
Sales do not look dependent on the closest competitor distance. Looks like our stores doesn’t
get affected by competitors.
Also calculated the correlation b/w Sales & Comp Distance- (-0.01836759). Doesn’t indicate any
strong correlation.
Inferential statistics
• Define population- All customers who are shopping at Rossmann from the start of
the stores till date
• Parameter to be estimated – Sales
• Define sample- Sales of Customers shopping in the period 2013, 2014 and 2015
uptoAugust
• Appropriate statistic to estimate the parameter – Mean & Std dev
• Mean Sales= $5774
• Std dev = $3850
> kurtosis(df_v1$Sales)
[1] 1.778351
> skewness(df_v1$Sales)
[1] 0.6414577
Outlier analysis
Sales-
6
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 3727 5744 5774 7856 41551
Outlier= 7856+1.5*(7856-3727)= 14,049
n() sum(Promo)
19731 13942
Looks like due to promo guest are buying more. These are looking like outlier
Hypothesis testing
In this section our aim was to debunk some of the popular myths around Rossmann. Here are
couple of popular myths-
In Germany, there is a prevailing sentiment that Rossmamn has excessive promotions. Social
media platforms often feature posts advising against making non-promotional purchases at
Rossman, with claims suggesting that promotional sales occur at least every other day, if not
more (i.e. in a month more than 15 days are promo days for at least 1 store).
While this hypothesis may be entertaining for influencers, its potential impact on our sales is a
matter of concern for our leadership team. Consequently, there is an effort to substantiate this
claim.
Based on our sample data spanning across 31 months, it was determined that the average
number of promotional days in a month is 12.6 days, with a standard deviation of 1.58 days. In
our pursuit of data accuracy, we aimed for a 95% confidence level in our findings.
Probability Analysis
Evaluating the type of distribution of sales
Normal distribution. Test for normal
7
8
Predicting for sales using the various variables available (Regression Model)
The predictor variable in this analysis was Sales. Our objective in this section is to explain how
we used our understanding of data to come up with a regression equation which can explain
sales.
1. We had about 1100 stores, predicting sales outputs of each stores separately would
have been a mammoth task (possible also overfitting the data) so we converted the
stores into classes based on Store Type, Store Assortment, Promo, Date
2. Converted categorical variables into dummy variables
Since we were operating on 16+ columns this exercise was not possible in Excel. We turned
towards R.
Results
We ran multiple models and after multiple iterations we concluded with the below model. This
model gave good results not only on the testing data but also on the training data.
Model Equation
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -753132.44 10422.32 -72.261 < 2e-16 ***
SchoolHoliday -28624.27 6261.45 -4.572 4.89e-06 ***
cnt_str 5396.04 38.75 139.239 < 2e-16 ***
StoreType_a 44383.26 7978.88 5.563 2.72e-08 ***
StoreType_b -118154.38 8117.35 -14.556 < 2e-16 ***
Assortment_a -48587.40 5574.06 -8.717 < 2e-16 ***
Assortment_b -30392.83 10785.89 -2.818 0.004843 **
StateHoliday_a 184434.62 18444.14 10.000 < 2e-16 ***
StateHoliday_b 108537.37 30084.90 3.608 0.000310 ***
DayOfWeek_1 84794.78 7956.19 10.658 < 2e-16 ***
DayOfWeek_3 -33899.52 7867.63 -4.309 1.66e-05 ***
DayOfWeek_4 -29509.96 7848.39 -3.760 0.000171 ***
perc_stores_open 861932.52 9972.20 86.434 < 2e-16 ***
perc_stores_promo 174577.90 5645.78 30.922 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
9
Residual standard error: 284200 on 11674 degrees of freedom
Multiple R-squared: 0.8261, Adjusted R-squared: 0.8259
F-statistic: 4265 on 13 and 11674 DF, p-value: < 2.2e-16
Data fields
• Id - an Id that represents a (Store, Date) duple within the test set
• Store - a unique Id for each store
• Sales - the turnover for any given day [Y variable]
• Customers - the number of customers on a given day
• Open - an indicator for whether the store was open: 0 = closed, 1 = open
• StateHoliday - indicates a state holiday. Normally all stores, with few exceptions, are
closed on state holidays. Note that all schools are closed on public holidays and
weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None
• SchoolHoliday - indicates if the (Store, Date) was affected by the closure of public
schools
• StoreType - differentiates between 4 different store models: a, b, c, d
• Assortment - describes an assortment level: a = basic, b = extra, c = extended
• CompetitionDistance - distance in meters to the nearest competitor store
• CompetitionOpenSince[Month/Year] - gives the approximate year and month of the
time the nearest competitor was opened
• Promo - indicates whether a store is running a promo on that day
10