Professional Documents
Culture Documents
Analyzing Walmart Sales
Analyzing Walmart Sales
It is important to thoroughly understand data before trying to solve any problem. One
may not be able to solve data unless they have gained a thorough understanding of it. The data
provided is for 45 Walmart stores that are located in various regions. Each of the stores has
various departments. The task in this assignment is to predict the department-wide sales for
each of the stores. Besides, Walmart runs various promotional markdown events all through
the year. Such markups tend to come before prominent holidays such as the Super Bowl,
Thanksgiving, Christmas, and Labor Day. The weeks preceding and including such holidays
are given a weight that is five times higher than the weight given to weeks that do not fall
within these holidays. One of the main challenges presented by competition is modelling the
effect markdowns have on these holiday weeks when there is no complete historical data.
Performance Metrics
Problem Statement
The task assignment is to predict the sales data from the data-set given. This is a
regression problem, and regression models have to be used in the prediction of sales using the
data set.
Analysis of Data
The analysis of the data is based on the perspective that sales can respond to time and
space factors and that the sales records are an aggregation of each department’s factors. The
2
day variable can be used to determine important information about sales. The first step in the
data analysis process was to check out unique values of the store and type.
From the data, there are 45 different types of stores. The next step was to find the way the data
The groupby function was used to examine the behavior of the data including its mean, count,
minimum and maximum sales values, standard deviation, and percentile values. The aim was
The next step was to find the proportions of the data in each type and present them in a
pie-chart.
NExt what I did is to find the proportions of the data in each type and focus them in a pie -
chart view.
3
The pie chart shows that 48.9% of the sales belong to store A, 37.8% to store B, and 13.3% to
store C. A box plot was used to understand the volume and type of cells in stores A, B, and C.
4
The data shows that the largest store is store A while the smallest store is store C. It is easy to
predict the size of the stores as there is no overlapping information between the three stores.
The next step was to visualize the number of weekly cells in both types A, B, and C.
It is clear that there is an overlap in the data. The median for store A is the highest while
the median for store C is the lowest. The next step was to try to make sense of the training
data. The training and stores data was merged and used to develop a scatter plot on the size and
No useful information was obtained from the plot. The next step was to construct a scatter plot
The next step was to try to understand the data using the type of training data, stores, and
weekly sales.
6
The next step was to find if there is a relationship between weekly sales, store, and iholiday.
The next step was to determine whether there is any relationship between department,
Unlike store and holiday relation, department and holiday do not explain any relation 72
department shows the highest surge in sales during holiday However others don’t and even
more in some departments non-holidays’ sales is higher. That means the character of product
Next I did is to find whether data of weekly sales is following gaussian distribution or not.
Because if weekly sales follows a gaussian distribution then we can predict some other
The values in Weekly Sales does not follow a straight line. So. we can conclude that weekly
Next what I have done is to find out the range of values in weekly sales . I have used the
The next step was to divide departments into four unique parts each and finding the
The next step was to divide by store and date mean values and break them into slots of 5 and
As I could not find any proper relation amongst these columns for sales prediction, I
decide to make a dummy model just to check what the worst conditions can be like.
Solution
The first step I took was to ensure that some common featurizations on the date part and
date were directly connected to sales. Featurization’s core part is related with domain
knowledge. Featurization is one of the most important step toward modelling because the
accuracy of the models depends a lot on the featurizations that we take. I decided to take a
subset of all the features and see how the model is performing.
def multi_holiday(data):
dataset = data.copy()
holiday = []
for i in range(len(dates)):
holiday.append(get_holiday(str(dates[i])))
holiday = pd.DataFrame(holiday)
holiday.columns = [‘super_bowl’,
‘labor’,’thank’,’chris’,’columbus’,’veterans’,’independence’,’memorial’,’washington’]
normal_holiday = dataset.copy()
12
holiday_pre1 = []
%d’)).values
for i in range(len(dates)):
holiday_pre1.append(get_holiday(str(dates[i])))
holiday_next1 = []
%d’)).values
for i in range(len(dates)):
holiday_next1.append(get_holiday(str(dates[i])))
holiday_pre1 = pd.DataFrame(holiday_pre1)
holiday_next1 = pd.DataFrame(holiday_next1)
holiday_pre1.columns = [‘super_bowl_p1’,
‘labor_p1’,’thank_p1',’chris_p1',’columbus_p1',’veterans_p1',’independence_p1',’memorial_p
1',’washington_p1']
holiday_next1.columns = [‘super_bowl_n1’,
‘labor_n1’,’thank_n1',’chris_n1',’columbus_n1',’veterans_n1',’independence_n1',’memorial_n
1',’washington_n1']
PreNext1week = dataset.copy()
holiday_pre2 = []
%d’)).values
for i in range(len(dates)):
holiday_pre2.append(get_holiday(str(dates[i])))
holiday_next2 = []
13
%d’)).values
for i in range(len(dates)):
holiday_next2.append(get_holiday(str(dates[i])))
holiday_pre2 = pd.DataFrame(holiday_pre2)
holiday_next2 = pd.DataFrame(holiday_next2)
holiday_pre2.columns = [‘super_bowl_p2’,
‘labor_p2’,’thank_p2',’chris_p2',’columbus_p2',’veterans_p2',’independence_p2',’memorial_p
2',’washington_p2']
holiday_next2.columns = [‘super_bowl_n2’,
‘labor_n2’,’thank_n2',’chris_n2',’columbus_n2',’veterans_n2',’independence_n2',’memorial_n
2',’washington_n2']
PreNext2week = dataset.copy()
return(normal_holiday,PreNext1week,PreNext2week).