Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

1.

0 Introduction

This report is to understand the relationship between the customer’s yearly average spending
on clothing commodities from the new cosmetic brand and their collected behavioral data. The
data is limited to customers who have a membership in the brand, which may cause biases in
loyalty if the report were applied to general usage. The best application for this report, which
stems from the limited data set, would be only predicting the behaviors of the member
customers.
1.1 Research Questions

a. Build linear regression models to predict yearly amount spent with average time in-store,
time on the app, and time on the website.

b. Of the three variables modeled, which has the most significant impact on yearly amount
spent (if there is any difference)?

c. Is there a significant difference in the yearly amount spent by males compared with
females?

2.0 Experimental Design and Limitations of dataset

This part of the report focuses on introducing the factors, including independent and dependent
variables, possible lurking variables, outliers and possible errors, and experimental
methodologies.

The limitations of the provided data set are:

1. The data quantity of male members is 61 and the data quantity of female members is 60
2. The total population of users with membership is unprovided.
3. If the task is “ investigating customer behavior” to find out “how to improve the website
and the app” the provided data set is insufficient. (under this case, gathering user
feedback on each feature of the app or website might be better for the end goal)

2.1 Variables
Independent Variables:
Average time in-store / Time on App / Time on Website

Dependent Variable:
Yearly amount spent

Possible Lurking Variables:


The definition of the lurking variables is the uncollected data that may affect the dependent
variable and the analysis. This may include:
Customer’s income level / Location of each data collected / Age of members / Occupations

Avg. time Time on Time on Length of Yearly


in store app website membership amount spent
Mean 33.05 12.10 37.02 3.46 497.73
Median 33.03 12.01 37.06 3.47 493.18
Mode 32.84 12.73 36.95 3.28 NULL
(identical data
point doesn’t
exist)
Range 5.00 5.24 5.19 5.04 459.49
Variance 0.90 0.95 1.17 1.03 6045.16
Standard deviation 0.95 0.97 1.08 1.01 77.75

Table1 Descriptive Statistic Summary


2.1 Graphical Data summary

Figure1 Descriptive Boxplot two gender combined


Figure2 Boxplot of respective two gender
Avg time in store
35

30

25

20

15

10

.4 1
]
.0 8
]
.7 5
]
.4 2
]
.0 9
]
.7 6
]
.4 3
] .1 ]
1 2 2 3 4 4 5 , 36
.7 4, 3 .4 1, 3 .0 8, 3 .7 5, 3 .4 2, 3 .0 9, 3 .7 6, 3 5 .4
3
[3 0 (3 1 (3 2 (3 2 (3 3 (3 4 (3 4 (3

Figure3 Histogram of Avg time in store

Time on app
40
35
30
25
20
15
10
5
0
] ] ] ] ] ] ] ]
0 .1 7 0 .8 6 1 .5 5 2 .2 4 2 .9 3 3 .6 2 4 .3 1 1 , 15
4 8, 1 .1 7, 1 .8 6, 1 .5 5, 1 .2 4, 1 .9 3, 1 .6 2, 1 (1 4 .3
[9 . (1 0 (1 0 (1 1 (1 2 (1 2 (1 3

Figure4 Histogram of Time on app


Time on website
40

35

30

25

20

15

10

0
(35.24, 36] (36.76, 37.52] (38.28, 39.04]
[34.48, 35.24] (36, 36.76] (37.52, 38.28] (39.04, 39.8]

Figure5 Histogram of Time on website

Length of membership
40

35

30

25

20

15

10

0
[1.08, 1.8] (1.8, 2.52] (2.52, 3.24] (3.24, 3.96] (3.96, 4.68] (4.68, 5.4] (5.4, 6.12]

Figure6 Histogram of Length of membership


Yearly amount spent
40
35
30
25
20
15
10
5
0
9, …

9, …

9, …

9, …

9, …

9, …

9, …

9, …
9, …

1 .0

6 .0

1 .0

6 .0

1 .0

6 .0

1 .0

6 .0
6 .0

(3 2

(3 7

(4 3

(4 8

(5 4

(5 9

(6 5

(7 0
[2 6

Figure3 Histogram of Yearly amount spent

2.2 Outliers and possible errors in measurement


From Figure1 and Table1, the standard deviations for each numerical variable shows a
relatively tight distribution, especially for time spent in-app and on the website. The larger
outliers shown in each of the sections could indicate that these deviations may affect the
prediction of the linear regression models.
When considering the possible errors in measurement, sampling bias, self – reported bias may
be the reasons of this.
Sampling bias may occur when the sampled data couldn’t be statistically represent the main
body. However, due to the lack of variance of the larger population this hypothesis remain
unsolvable.

2.3 Experimental Design and Rationale

In this section, the report discusses the relationship between independent variables and one
and only dependent variable. However, to best harness the data set as a whole, the research
was conducted in general and sectioned into two genders, respectively, to find out more
possible clue.

First, this part of the experiment is conducted to visualize the relationships between each
independent variable and the dependent variable. By finding out Pearson correlation coefficient
(r) we can determined the strength and direction of the linear relationship between the two
selected independent variable and dependent variable.
Secondly, reliability is discussed based on the coefficient of determination (R-square) we get
from the linear regression model to find out if the model fits each of the independent variables
having an impact on the yearly amount spent. This part of LGM is used to answer the first two
research questions.

Third, to find out if there is any difference in yearly amount spent between males and females,
the research will be conducted by conducting the T- test with the null hypothesis of “ the yearly
spent by males is not significantly different from that of females.” The t-test was selected as the
more appropriate test over the z-test despite having a sample size greater than 30 because
both the population variances are not known and the extent to which the sample data accurately
represents the population is uncertain.

3.0 Results
Linear Regression Models (charts):

Figure4 Linear regression of Avg time in store (male)

Treatment Site Summary of results


Least squares regression line 𝒚
" 43.171x - 917.67
Pearson’s r r = 0.48
Coefficient of determination R² = 0.2324
Table2 Summary of linear regression of Avg time in store (male)
Figure5 Linear regression of Avg time in store (female)
Treatment Site Summary of results
Least squares regression line 𝒚
" = 31.039x - 537.91
Pearson’s r R = 0.40
Coefficient of determination R² = 0.1581
Table3 Summary of linear regression of Avg time in store (female)

Figure6 Linear regression of Avg time in store (combined)

Treatment Site Summary of results


Least squares regression line 𝒚
" = 34.398x - 639.26
Pearson’s r r = 0.42
Coefficient of determination R² = 0.1755
Table4 Summary of linear regression of Avg time in store (female)
Figure set 1 Yearly amount spent / Average time in store

Figure7 Linear regression of Time on app (male)

Treatment Site Summary of results


Least squares regression line 𝒚
" = 33.743x + 89.886
Pearson’s r r = 0.42
Coefficient of determination R² = 0.1735
Table5 Summary of linear regression of Time on app (male)

Figure8 Linear regression of Time on app (female)

Treatment Site Summary of results


Least squares regression line 𝒚
" = 25.604x + 186.28
Pearson’s r r = 0.32
Coefficient of determination R² = 0.1031

Table6 Summary of linear regression Time on app (female)

Figure8 Linear regression of Time on app (combined)

Treatment Site Summary of results


Least squares regression line 𝒚
" =29.822x + 136.75
Pearson’s r r = 0.37
Coefficient of determination R² = 0.1391

Table7 Summary of linear regression of Time on app (combined)

Figure set 2 Yearly amount spent / Time on app

Figure9 Linear regression of Time on website (male)


Treatment Site Summary of results
Least squares regression line 𝒚
" = -11.105x + 912.54
Pearson’s r r = -0.16
Coefficient of determination R² = 0.02
Table8 Summary of linear regression of of Time on website (male)

Figure10 Linear regression of Time on website (female)

Treatment Site Summary of results


Least squares regression line 𝒚
" = 4.5109x + 326.08
Pearson’s r r = 0.06
Coefficient of determination R² = 0.0038
Table9 Summary of linear regression of Time on website (female)
Figure11 Linear regression of Time on website (combined)

Treatment Site Summary of results


Least squares regression line 𝒚
" = -3.8762x + 641.24

Pearson’s r r = -0.05
Coefficient of determination R² = 0.0029
Table10 Summary of linear regression of Time on website (combined)

Residual plots:

Residual plot:Avg time in store


250

200
150

100
50

0
0 20 40 60 80 100 120 140
-50

-100

-150

-200

Figure12 Residual plot of Avg time in store


Residual plot:Time on app
250
200
150
100
50
0
0 20 40 60 80 100 120 140
-50
-100
-150
-200
-250

Figure12 Residual plot of Time on app

Residual plot:Time on website


300

200

100

0
0 20 40 60 80 100 120 140
-100

-200

-300

Figure12 Residual plot of Time on website


t-Test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2
Mean 502.055082 493.337167
Variance 6155.21996 5996.7331
Observations 61 60
Hypothesized Mean 0
Difference
df 119
t Stat 0.61514269
P(T<=t) one-tail 0.26981753
t Critical one-tail 1.65775928
P(T<=t) two-tail 0.53963507
t Critical two-tail 1.98009988

Table11 T- test table of yearly spent of males and females

4.0 Discussions and Findings


Box and whisker plot:

From the general box plots, we can observe outliers on two sides of the data in the chart, yearly
amount spent, while the average time in store and the length of membership have outliers on
the larger side.

Linear regression model:

The largest r-square we found is the average time in store and the yearly amount spent. Based
on the regression analysis, there is a moderate positive relationship between the average time
in store and the yearly amount spent by customers. However, the average time in store only
explains a little over 23% of the variation in the yearly amount spent, suggesting that while there
is some association, other factors are also influencing the yearly amount spent that are not
accounted for in this simple regression model.
T - test
Given that the p-values (both one-tailed and two-tailed) are greater than the common alpha
level of 0.05, there is not enough evidence to reject the null hypothesis of no difference in mean
yearly amount spent between the two genders. This means that, statistically, there isn't a
significant difference in the yearly amount spent between the male group and the female group.
4.1 Final Suggestions:

From the discussions above we can make below two suggestions:

1. Investing more in expanding the average amount time the member customer spend may
be the best way in this context to increase the yearly amount spent since it has a
positive, though insignificant, relation between which.

2. The T - test show the statistically significance of difference in yearly spent between male
and female does not exist. Focus on gathering other information while collecting data
from customers may be a better solution to getting better knowledge of customer
behavior.

You might also like