Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

1

Analyzing Walmart’s Sales


Description of Data

It is important to thoroughly understand data before trying to solve any problem. One

may not be able to solve data unless they have gained a thorough understanding of it. The data

provided is for 45 Walmart stores that are located in various regions. Each of the stores has

various departments. The task in this assignment is to predict the department-wide sales for

each of the stores. Besides, Walmart runs various promotional markdown events all through

the year. Such markups tend to come before prominent holidays such as the Super Bowl,

Thanksgiving, Christmas, and Labor Day. The weeks preceding and including such holidays

are given a weight that is five times higher than the weight given to weeks that do not fall

within these holidays. One of the main challenges presented by competition is modelling the

effect markdowns have on these holiday weeks when there is no complete historical data.

Performance Metrics

Problem Statement
The task assignment is to predict the sales data from the data-set given. This is a
regression problem, and regression models have to be used in the prediction of sales using the
data set.
Analysis of Data
The analysis of the data is based on the perspective that sales can respond to time and
space factors and that the sales records are an aggregation of each department’s factors. The
2

day variable can be used to determine important information about sales. The first step in the
data analysis process was to check out unique values of the store and type.

From the data, there are 45 different types of stores. The next step was to find the way the data

behaves with respect to type.

The groupby function was used to examine the behavior of the data including its mean, count,

minimum and maximum sales values, standard deviation, and percentile values. The aim was

to get an idea of how data behaves.

The next step was to find the proportions of the data in each type and present them in a

pie-chart.

NExt what I did is to find the proportions of the data in each type and focus them in a pie -

chart view.
3

The pie chart shows that 48.9% of the sales belong to store A, 37.8% to store B, and 13.3% to

store C. A box plot was used to understand the volume and type of cells in stores A, B, and C.
4

The data shows that the largest store is store A while the smallest store is store C. It is easy to

predict the size of the stores as there is no overlapping information between the three stores.

The next step was to visualize the number of weekly cells in both types A, B, and C.

It is clear that there is an overlap in the data. The median for store A is the highest while

the median for store C is the lowest. The next step was to try to make sense of the training

data. The training and stores data was merged and used to develop a scatter plot on the size and

weekly sales of the train data.


5

No useful information was obtained from the plot. The next step was to construct a scatter plot

between weekly sales and store size.

No useful information was gained form the plots above.

The next step was to try to understand the data using the type of training data, stores, and

weekly sales.
6

The next step was to find if there is a relationship between weekly sales, store, and iholiday.

The next step was to determine whether there is any relationship between department,

isholiday, and weekly sales.


7

Unlike store and holiday relation, department and holiday do not explain any relation 72

department shows the highest surge in sales during holiday However others don’t and even

more in some departments non-holidays’ sales is higher. That means the character of product

(department) is different relation with sales.

Next I did is to find whether data of weekly sales is following gaussian distribution or not.

Because if weekly sales follows a gaussian distribution then we can predict some other

properties about weekly sales also.


8

The values in Weekly Sales does not follow a straight line. So. we can conclude that weekly

sales do not have a gaussian distribution.

Next what I have done is to find out the range of values in weekly sales . I have used the

logarithm of the actual value to get an insight of the data.


9

The weekly sales range from 0 to 12 (log) as we can see here.

The next step was to divide departments into four unique parts each and finding the

mean sales of each part.


10

The next step was to divide by store and date mean values and break them into slots of 5 and

then plotting the mean sales by time.


11

As I could not find any proper relation amongst these columns for sales prediction, I

decide to make a dummy model just to check what the worst conditions can be like.

Solution

The first step I took was to ensure that some common featurizations on the date part and

date were directly connected to sales. Featurization’s core part is related with domain

knowledge. Featurization is one of the most important step toward modelling because the

accuracy of the models depends a lot on the featurizations that we take. I decided to take a

subset of all the features and see how the model is performing.

def multi_holiday(data):

dataset = data.copy()

holiday = []

dates = data[‘Date’].apply(lambda x: x.strftime(‘%Y-%m-%d’)).values

for i in range(len(dates)):

holiday.append(get_holiday(str(dates[i])))

holiday = pd.DataFrame(holiday)

holiday.columns = [‘super_bowl’,

‘labor’,’thank’,’chris’,’columbus’,’veterans’,’independence’,’memorial’,’washington’]

dataset = pd.merge(dataset, holiday, left_index=True, right_index=True)

normal_holiday = dataset.copy()
12

holiday_pre1 = []

dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=7)).strftime(‘%Y-%m-

%d’)).values

for i in range(len(dates)):

holiday_pre1.append(get_holiday(str(dates[i])))

holiday_next1 = []

dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=7)).strftime(‘%Y-%m-

%d’)).values

for i in range(len(dates)):

holiday_next1.append(get_holiday(str(dates[i])))

holiday_pre1 = pd.DataFrame(holiday_pre1)

holiday_next1 = pd.DataFrame(holiday_next1)

holiday_pre1.columns = [‘super_bowl_p1’,

‘labor_p1’,’thank_p1',’chris_p1',’columbus_p1',’veterans_p1',’independence_p1',’memorial_p

1',’washington_p1']

holiday_next1.columns = [‘super_bowl_n1’,

‘labor_n1’,’thank_n1',’chris_n1',’columbus_n1',’veterans_n1',’independence_n1',’memorial_n

1',’washington_n1']

dataset = pd.merge(dataset, holiday_pre1, left_index=True, right_index=True)

dataset = pd.merge(dataset, holiday_next1, left_index=True, right_index=True)

PreNext1week = dataset.copy()

holiday_pre2 = []

dates = data[‘Date’].apply(lambda x: (x-datetime.timedelta(days=14)).strftime(‘%Y-%m-

%d’)).values

for i in range(len(dates)):

holiday_pre2.append(get_holiday(str(dates[i])))

holiday_next2 = []
13

dates = data[‘Date’].apply(lambda x: (x+datetime.timedelta(days=14)).strftime(‘%Y-%m-

%d’)).values

for i in range(len(dates)):

holiday_next2.append(get_holiday(str(dates[i])))

holiday_pre2 = pd.DataFrame(holiday_pre2)

holiday_next2 = pd.DataFrame(holiday_next2)

holiday_pre2.columns = [‘super_bowl_p2’,

‘labor_p2’,’thank_p2',’chris_p2',’columbus_p2',’veterans_p2',’independence_p2',’memorial_p

2',’washington_p2']

holiday_next2.columns = [‘super_bowl_n2’,

‘labor_n2’,’thank_n2',’chris_n2',’columbus_n2',’veterans_n2',’independence_n2',’memorial_n

2',’washington_n2']

dataset = pd.merge(dataset, holiday_pre2, left_index=True, right_index=True)

dataset = pd.merge(dataset, holiday_next2, left_index=True, right_index=True)

PreNext2week = dataset.copy()

return(normal_holiday,PreNext1week,PreNext2week).

You might also like