Analyzing Walmart Sales

1
Analyzing Walmart’s Sales

Description of Data
It is important to thoroughly understand data before trying to solve any problem. One
may not be able to solve data unless they have gained a thorough understanding of it. The data
provided is for 45 Walmart stores that are located in various regions. Each of the stores has
various departments. The task in this assignment is to predict the department-wide sales for
each of the stores. Besides, Walmart runs various promotional markdown events all through
the year. Such markups tend to come before prominent holidays such as the Super Bowl,
Thanksgiving, Christmas, and Labor Day. The weeks preceding and including such holidays
are given a weight that is five times higher than the weight given to weeks that do not fall
within these holidays. One of the main challenges presented by competition is modelling the
effect markdowns have on these holiday weeks when there is no complete historical data.
Performance Metrics
Problem Statement
The task assignment is to predict the sales data from the data-set given. This is a
regression problem, and regression models have to be used in the prediction of sales using the
data set.
Analysis of Data
The analysis of the data is based on the perspective that sales can respond to time and
space factors and that the sales records are an aggregation of each department’s factors. The
2
day variable can be used to determine important information about sales. The first step in the
data analysis process was to check out unique values of the store and type.
From the data, there are 45 different types of stores. The next step was to find the way the data
behaves with respect to type.
The groupby function was used to examine the behavior of the data including its mean, count,
minimum and maximum sales values, standard deviation, and percentile values. The aim was
to get an idea of how data behaves.
The next step was to find the proportions of the data in each type and present them in a
pie-chart.
NExt what I did is to find the proportions of the data in each type and focus them in a pie -
chart view.
3
The pie chart shows that 48.9% of the sales belong to store A, 37.8% to store B, and 13.3% to
store C. A box plot was used to understand the volume and type of cells in stores A, B, and C.
4
The data shows that the largest store is store A while the smallest store is store C. It is easy to
predict the size of the stores as there is no overlapping information between the three stores.
The next step was to visualize the number of weekly cells in both types A, B, and C.
It is clear that there is an overlap in the data. The median for store A is the highest while
the median for store C is the lowest. The next step was to try to make sense of the training
data. The training and stores data was merged and used to develop a scatter plot on the size and
weekly sales of the train data.

5
No useful information was obtained from the plot. The next step was to construct a scatter plot
between weekly sales and store size.
No useful information was gained form the plots above.
The next step was to try to understand the data using the type of training data, stores, and
weekly sales.
6
The next step was to find if there is a relationship between weekly sales, store, and iholiday.
The next step was to determine whether there is any relationship between department,
isholiday, and weekly sales.

7
Unlike store and holiday relation, department and holiday do not explain any relation 72
department shows the highest surge in sales during holiday However others don’t and even
more in some departments non-holidays’ sales is higher. That means the character of product
(department) is different relation with sales.
Next I did is to find whether data of weekly sales is following gaussian distribution or not.
Because if weekly sales follows a gaussian distribution then we can predict some other
properties about weekly sales also.

8
The values in Weekly Sales does not follow a straight line. So. we can conclude that weekly
sales do not have a gaussian distribution.
Next what I have done is to find out the range of values in weekly sales . I have used the
logarithm of the actual value to get an insight of the data.

9
The weekly sales range from 0 to 12 (log) as we can see here.
The next step was to divide departments into four unique parts each and finding the
mean sales of each part.

10
The next step was to divide by store and date mean values and break them into slots of 5 and
then plotting the mean sales by time.

11
As I could not find any proper relation amongst these columns for sales prediction, I
decide to make a dummy model just to check what the worst conditions can be like.
Solution
The first step I took was to ensure that some common featurizations on the date part and
date were directly connected to sales. Featurization’s core part is related with domain
knowledge. Featurization is one of the most important step toward modelling because the
accuracy of the models depends a lot on the featurizations that we take. I decided to take a
subset of all the features and see how the model is performing.
def multi_holiday(data):
dataset = data.copy()
holiday = []
dates = data[‘Date’].apply(lambda x: x.strftime(‘%Y-%m-%d’)).values
for i in range(len(dates)):
holiday.append(get_holiday(str(dates[i])))
holiday = pd.DataFrame(holiday)
holiday.columns = [‘super_bowl’,
‘labor’,’thank’,’chris’,’columbus’,’veterans’,’independence’,’memorial’,’washington’]
dataset = pd.merge(dataset, holiday, left_index=True, right_index=True)
normal_holiday = dataset.copy()
12
holiday_pre1 = []
dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=7)).strftime(‘%Y-%m-
%d’)).values
holiday_pre1.append(get_holiday(str(dates[i])))
holiday_next1 = []
dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=7)).strftime(‘%Y-%m-
%d’)).values
holiday_next1.append(get_holiday(str(dates[i])))
holiday_pre1 = pd.DataFrame(holiday_pre1)
holiday_next1 = pd.DataFrame(holiday_next1)
holiday_pre1.columns = [‘super_bowl_p1’,
‘labor_p1’,’thank_p1',’chris_p1',’columbus_p1',’veterans_p1',’independence_p1',’memorial_p
1',’washington_p1']
holiday_next1.columns = [‘super_bowl_n1’,
‘labor_n1’,’thank_n1',’chris_n1',’columbus_n1',’veterans_n1',’independence_n1',’memorial_n
1',’washington_n1']
dataset = pd.merge(dataset, holiday_pre1, left_index=True, right_index=True)
dataset = pd.merge(dataset, holiday_next1, left_index=True, right_index=True)
PreNext1week = dataset.copy()
holiday_pre2 = []
dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=14)).strftime(‘%Y-%m-
%d’)).values
holiday_pre2.append(get_holiday(str(dates[i])))
holiday_next2 = []
13
dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=14)).strftime(‘%Y-%m-
%d’)).values
holiday_next2.append(get_holiday(str(dates[i])))
holiday_pre2 = pd.DataFrame(holiday_pre2)
holiday_next2 = pd.DataFrame(holiday_next2)
holiday_pre2.columns = [‘super_bowl_p2’,
‘labor_p2’,’thank_p2',’chris_p2',’columbus_p2',’veterans_p2',’independence_p2',’memorial_p
2',’washington_p2']
holiday_next2.columns = [‘super_bowl_n2’,
‘labor_n2’,’thank_n2',’chris_n2',’columbus_n2',’veterans_n2',’independence_n2',’memorial_n
2',’washington_n2']
dataset = pd.merge(dataset, holiday_pre2, left_index=True, right_index=True)
dataset = pd.merge(dataset, holiday_next2, left_index=True, right_index=True)
PreNext2week = dataset.copy()
return(normal_holiday,PreNext1week,PreNext2week).

Analyzing Walmart Sales

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing Walmart Sales

Uploaded by

Copyright:

Available Formats

1

Analyzing Walmart’s Sales

behaves with respect to type.

to get an idea of how data behaves.

weekly sales of the train data.

between weekly sales and store size.

No useful information was gained form the plots above.

isholiday, and weekly sales.

(department) is different relation with sales.

properties about weekly sales also.

sales do not have a gaussian distribution.

logarithm of the actual value to get an insight of the data.

The weekly sales range from 0 to 12 (log) as we can see here.

mean sales of each part.

then plotting the mean sales by time.

dates = data[‘Date’].apply(lambda x: x.strftime(‘%Y-%m-%d’)).values

dataset = pd.merge(dataset, holiday, left_index=True, right_index=True)

dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=7)).strftime(‘%Y-%m-

dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=7)).strftime(‘%Y-%m-

dataset = pd.merge(dataset, holiday_pre1, left_index=True, right_index=True)

dataset = pd.merge(dataset, holiday_next1, left_index=True, right_index=True)

dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=14)).strftime(‘%Y-%m-

dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=14)).strftime(‘%Y-%m-

dataset = pd.merge(dataset, holiday_pre2, left_index=True, right_index=True)

dataset = pd.merge(dataset, holiday_next2, left_index=True, right_index=True)

You might also like