P426853 Statistic Report

1.
0 Introduction
This report is to understand the relationship between the customer’s yearly average spending
on clothing commodities from the new cosmetic brand and their collected behavioral data. The
data is limited to customers who have a membership in the brand, which may cause biases in
loyalty if the report were applied to general usage. The best application for this report, which
stems from the limited data set, would be only predicting the behaviors of the member
customers.
1.1 Research Questions
a. Build linear regression models to predict yearly amount spent with average time in-store,
time on the app, and time on the website.
b. Of the three variables modeled, which has the most significant impact on yearly amount
spent (if there is any difference)?
c. Is there a significant difference in the yearly amount spent by males compared with
females?
2.0 Experimental Design and Limitations of dataset
This part of the report focuses on introducing the factors, including independent and dependent
variables, possible lurking variables, outliers and possible errors, and experimental
methodologies.
The limitations of the provided data set are:
1. The data quantity of male members is 61 and the data quantity of female members is 60
2. The total population of users with membership is unprovided.
3. If the task is “ investigating customer behavior” to find out “how to improve the website
and the app” the provided data set is insufficient. (under this case, gathering user
feedback on each feature of the app or website might be better for the end goal)
2.1 Variables
Independent Variables:
Average time in-store / Time on App / Time on Website
Dependent Variable:
Yearly amount spent
Possible Lurking Variables:

The definition of the lurking variables is the uncollected data that may affect the dependent
variable and the analysis. This may include:
Customer’s income level / Location of each data collected / Age of members / Occupations
Avg. time Time on Time on Length of Yearly

in store app website membership amount spent
Mean 33.05 12.10 37.02 3.46 497.73
Median 33.03 12.01 37.06 3.47 493.18
Mode 32.84 12.73 36.95 3.28 NULL
(identical data
point doesn’t
exist)
Range 5.00 5.24 5.19 5.04 459.49
Variance 0.90 0.95 1.17 1.03 6045.16
Standard deviation 0.95 0.97 1.08 1.01 77.75
Table1 Descriptive Statistic Summary

2.1 Graphical Data summary
Figure1 Descriptive Boxplot two gender combined

Figure2 Boxplot of respective two gender
Avg time in store
35
30
25
20
15
10
.4 1
]
.0 8
]
.7 5
]
.4 2
]
.0 9
]
.7 6
]
.4 3
] .1 ]
1 2 2 3 4 4 5 , 36
.7 4, 3 .4 1, 3 .0 8, 3 .7 5, 3 .4 2, 3 .0 9, 3 .7 6, 3 5 .4
3
[3 0 (3 1 (3 2 (3 2 (3 3 (3 4 (3 4 (3
Figure3 Histogram of Avg time in store
Time on app
40
35
30
25
20
15
10
5
0
] ] ] ] ] ] ] ]
0 .1 7 0 .8 6 1 .5 5 2 .2 4 2 .9 3 3 .6 2 4 .3 1 1 , 15
4 8, 1 .1 7, 1 .8 6, 1 .5 5, 1 .2 4, 1 .9 3, 1 .6 2, 1 (1 4 .3
[9 . (1 0 (1 0 (1 1 (1 2 (1 2 (1 3
Figure4 Histogram of Time on app

Time on website
40
35
30
25
20
15
10
0
(35.24, 36] (36.76, 37.52] (38.28, 39.04]
[34.48, 35.24] (36, 36.76] (37.52, 38.28] (39.04, 39.8]
Figure5 Histogram of Time on website
Length of membership
40
35
30
25
20
15
10
0
[1.08, 1.8] (1.8, 2.52] (2.52, 3.24] (3.24, 3.96] (3.96, 4.68] (4.68, 5.4] (5.4, 6.12]
Figure6 Histogram of Length of membership

Yearly amount spent
40
35
30
25
20
15
10
5
0
9, …
9, …
9, …
9, …
9, …
9, …
9, …
9, …
9, …
1 .0
6 .0
1 .0
6 .0
1 .0
6 .0
1 .0
6 .0
6 .0
(3 2
(3 7
(4 3
(4 8
(5 4
(5 9
(6 5
(7 0
[2 6
Figure3 Histogram of Yearly amount spent
2.2 Outliers and possible errors in measurement

From Figure1 and Table1, the standard deviations for each numerical variable shows a
relatively tight distribution, especially for time spent in-app and on the website. The larger
outliers shown in each of the sections could indicate that these deviations may affect the
prediction of the linear regression models.
When considering the possible errors in measurement, sampling bias, self – reported bias may
be the reasons of this.
Sampling bias may occur when the sampled data couldn’t be statistically represent the main
body. However, due to the lack of variance of the larger population this hypothesis remain
unsolvable.
2.3 Experimental Design and Rationale
In this section, the report discusses the relationship between independent variables and one
and only dependent variable. However, to best harness the data set as a whole, the research
was conducted in general and sectioned into two genders, respectively, to find out more
possible clue.
First, this part of the experiment is conducted to visualize the relationships between each
independent variable and the dependent variable. By finding out Pearson correlation coefficient
(r) we can determined the strength and direction of the linear relationship between the two
selected independent variable and dependent variable.
Secondly, reliability is discussed based on the coefficient of determination (R-square) we get
from the linear regression model to find out if the model fits each of the independent variables
having an impact on the yearly amount spent. This part of LGM is used to answer the first two
research questions.
Third, to find out if there is any difference in yearly amount spent between males and females,
the research will be conducted by conducting the T- test with the null hypothesis of “ the yearly
spent by males is not significantly different from that of females.” The t-test was selected as the
more appropriate test over the z-test despite having a sample size greater than 30 because
both the population variances are not known and the extent to which the sample data accurately
represents the population is uncertain.
3.0 Results
Linear Regression Models (charts):
Figure4 Linear regression of Avg time in store (male)
Treatment Site Summary of results

Least squares regression line 𝒚
" 43.171x - 917.67
Pearson’s r r = 0.48
Coefficient of determination R² = 0.2324
Table2 Summary of linear regression of Avg time in store (male)
Figure5 Linear regression of Avg time in store (female)
" = 31.039x - 537.91
Pearson’s r R = 0.40
Table3 Summary of linear regression of Avg time in store (female)
Figure6 Linear regression of Avg time in store (combined)

" = 34.398x - 639.26
Table4 Summary of linear regression of Avg time in store (female)
Figure set 1 Yearly amount spent / Average time in store
Figure7 Linear regression of Time on app (male)

" = 33.743x + 89.886
Table5 Summary of linear regression of Time on app (male)
Figure8 Linear regression of Time on app (female)

" = 25.604x + 186.28
Table6 Summary of linear regression Time on app (female)
Figure8 Linear regression of Time on app (combined)

" =29.822x + 136.75
Table7 Summary of linear regression of Time on app (combined)
Figure set 2 Yearly amount spent / Time on app
Figure9 Linear regression of Time on website (male)

" = -11.105x + 912.54
Pearson’s r r = -0.16
Table8 Summary of linear regression of of Time on website (male)
Figure10 Linear regression of Time on website (female)

" = 4.5109x + 326.08
Table9 Summary of linear regression of Time on website (female)
Figure11 Linear regression of Time on website (combined)

" = -3.8762x + 641.24
Pearson’s r r = -0.05
Table10 Summary of linear regression of Time on website (combined)
Residual plots:
Residual plot:Avg time in store

250
200
150
100
50
0
0 20 40 60 80 100 120 140
-50
-100
-150
-200
Figure12 Residual plot of Avg time in store

Residual plot:Time on app
250
200
150
100
50
0
0 20 40 60 80 100 120 140
-50
-100
-150
-200
-250
Figure12 Residual plot of Time on app
Residual plot:Time on website

300
200
100
0
0 20 40 60 80 100 120 140
-100
-200
-300
Figure12 Residual plot of Time on website

t-Test: Two-Sample Assuming Unequal Variances
Variable 1 Variable 2
Mean 502.055082 493.337167
Variance 6155.21996 5996.7331
Observations 61 60
Hypothesized Mean 0
Difference
df 119
t Stat 0.61514269
P(T<=t) one-tail 0.26981753
t Critical one-tail 1.65775928
P(T<=t) two-tail 0.53963507
t Critical two-tail 1.98009988
Table11 T- test table of yearly spent of males and females
4.0 Discussions and Findings

Box and whisker plot:
From the general box plots, we can observe outliers on two sides of the data in the chart, yearly
amount spent, while the average time in store and the length of membership have outliers on
the larger side.
Linear regression model:
The largest r-square we found is the average time in store and the yearly amount spent. Based
on the regression analysis, there is a moderate positive relationship between the average time
in store and the yearly amount spent by customers. However, the average time in store only
explains a little over 23% of the variation in the yearly amount spent, suggesting that while there
is some association, other factors are also influencing the yearly amount spent that are not
accounted for in this simple regression model.
T - test
Given that the p-values (both one-tailed and two-tailed) are greater than the common alpha
level of 0.05, there is not enough evidence to reject the null hypothesis of no difference in mean
yearly amount spent between the two genders. This means that, statistically, there isn't a
significant difference in the yearly amount spent between the male group and the female group.
4.1 Final Suggestions:
From the discussions above we can make below two suggestions:
1. Investing more in expanding the average amount time the member customer spend may
be the best way in this context to increase the yearly amount spent since it has a
positive, though insignificant, relation between which.
2. The T - test show the statistically significance of difference in yearly spent between male
and female does not exist. Focus on gathering other information while collecting data
from customers may be a better solution to getting better knowledge of customer
behavior.

P426853 Statistic Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P426853 Statistic Report

Uploaded by

Copyright:

Available Formats

1.

2.0 Experimental Design and Limitations of dataset

The limitations of the provided data set are:

Possible Lurking Variables:

Avg. time Time on Time on Length of Yearly

Table1 Descriptive Statistic Summary

Figure1 Descriptive Boxplot two gender combined

Figure3 Histogram of Avg time in store

Figure4 Histogram of Time on app

Figure5 Histogram of Time on website

Figure6 Histogram of Length of membership

Figure3 Histogram of Yearly amount spent

2.2 Outliers and possible errors in measurement

2.3 Experimental Design and Rationale

Figure4 Linear regression of Avg time in store (male)

Treatment Site Summary of results

Figure6 Linear regression of Avg time in store (combined)

Treatment Site Summary of results

Figure7 Linear regression of Time on app (male)

Treatment Site Summary of results

Figure8 Linear regression of Time on app (female)

Treatment Site Summary of results

Table6 Summary of linear regression Time on app (female)

Figure8 Linear regression of Time on app (combined)

Treatment Site Summary of results

Table7 Summary of linear regression of Time on app (combined)

Figure set 2 Yearly amount spent / Time on app

Figure9 Linear regression of Time on website (male)

Figure10 Linear regression of Time on website (female)

Treatment Site Summary of results

Treatment Site Summary of results

Residual plot:Avg time in store

Figure12 Residual plot of Avg time in store

Figure12 Residual plot of Time on app

Residual plot:Time on website

Figure12 Residual plot of Time on website

Table11 T- test table of yearly spent of males and females

4.0 Discussions and Findings

Linear regression model:

From the discussions above we can make below two suggestions:

You might also like