Professional Documents
Culture Documents
P426853 Statistic Report
P426853 Statistic Report
0 Introduction
This report is to understand the relationship between the customer’s yearly average spending
on clothing commodities from the new cosmetic brand and their collected behavioral data. The
data is limited to customers who have a membership in the brand, which may cause biases in
loyalty if the report were applied to general usage. The best application for this report, which
stems from the limited data set, would be only predicting the behaviors of the member
customers.
1.1 Research Questions
a. Build linear regression models to predict yearly amount spent with average time in-store,
time on the app, and time on the website.
b. Of the three variables modeled, which has the most significant impact on yearly amount
spent (if there is any difference)?
c. Is there a significant difference in the yearly amount spent by males compared with
females?
This part of the report focuses on introducing the factors, including independent and dependent
variables, possible lurking variables, outliers and possible errors, and experimental
methodologies.
1. The data quantity of male members is 61 and the data quantity of female members is 60
2. The total population of users with membership is unprovided.
3. If the task is “ investigating customer behavior” to find out “how to improve the website
and the app” the provided data set is insufficient. (under this case, gathering user
feedback on each feature of the app or website might be better for the end goal)
2.1 Variables
Independent Variables:
Average time in-store / Time on App / Time on Website
Dependent Variable:
Yearly amount spent
30
25
20
15
10
.4 1
]
.0 8
]
.7 5
]
.4 2
]
.0 9
]
.7 6
]
.4 3
] .1 ]
1 2 2 3 4 4 5 , 36
.7 4, 3 .4 1, 3 .0 8, 3 .7 5, 3 .4 2, 3 .0 9, 3 .7 6, 3 5 .4
3
[3 0 (3 1 (3 2 (3 2 (3 3 (3 4 (3 4 (3
Time on app
40
35
30
25
20
15
10
5
0
] ] ] ] ] ] ] ]
0 .1 7 0 .8 6 1 .5 5 2 .2 4 2 .9 3 3 .6 2 4 .3 1 1 , 15
4 8, 1 .1 7, 1 .8 6, 1 .5 5, 1 .2 4, 1 .9 3, 1 .6 2, 1 (1 4 .3
[9 . (1 0 (1 0 (1 1 (1 2 (1 2 (1 3
35
30
25
20
15
10
0
(35.24, 36] (36.76, 37.52] (38.28, 39.04]
[34.48, 35.24] (36, 36.76] (37.52, 38.28] (39.04, 39.8]
Length of membership
40
35
30
25
20
15
10
0
[1.08, 1.8] (1.8, 2.52] (2.52, 3.24] (3.24, 3.96] (3.96, 4.68] (4.68, 5.4] (5.4, 6.12]
9, …
9, …
9, …
9, …
9, …
9, …
9, …
9, …
1 .0
6 .0
1 .0
6 .0
1 .0
6 .0
1 .0
6 .0
6 .0
(3 2
(3 7
(4 3
(4 8
(5 4
(5 9
(6 5
(7 0
[2 6
In this section, the report discusses the relationship between independent variables and one
and only dependent variable. However, to best harness the data set as a whole, the research
was conducted in general and sectioned into two genders, respectively, to find out more
possible clue.
First, this part of the experiment is conducted to visualize the relationships between each
independent variable and the dependent variable. By finding out Pearson correlation coefficient
(r) we can determined the strength and direction of the linear relationship between the two
selected independent variable and dependent variable.
Secondly, reliability is discussed based on the coefficient of determination (R-square) we get
from the linear regression model to find out if the model fits each of the independent variables
having an impact on the yearly amount spent. This part of LGM is used to answer the first two
research questions.
Third, to find out if there is any difference in yearly amount spent between males and females,
the research will be conducted by conducting the T- test with the null hypothesis of “ the yearly
spent by males is not significantly different from that of females.” The t-test was selected as the
more appropriate test over the z-test despite having a sample size greater than 30 because
both the population variances are not known and the extent to which the sample data accurately
represents the population is uncertain.
3.0 Results
Linear Regression Models (charts):
Pearson’s r r = -0.05
Coefficient of determination R² = 0.0029
Table10 Summary of linear regression of Time on website (combined)
Residual plots:
200
150
100
50
0
0 20 40 60 80 100 120 140
-50
-100
-150
-200
200
100
0
0 20 40 60 80 100 120 140
-100
-200
-300
Variable 1 Variable 2
Mean 502.055082 493.337167
Variance 6155.21996 5996.7331
Observations 61 60
Hypothesized Mean 0
Difference
df 119
t Stat 0.61514269
P(T<=t) one-tail 0.26981753
t Critical one-tail 1.65775928
P(T<=t) two-tail 0.53963507
t Critical two-tail 1.98009988
From the general box plots, we can observe outliers on two sides of the data in the chart, yearly
amount spent, while the average time in store and the length of membership have outliers on
the larger side.
The largest r-square we found is the average time in store and the yearly amount spent. Based
on the regression analysis, there is a moderate positive relationship between the average time
in store and the yearly amount spent by customers. However, the average time in store only
explains a little over 23% of the variation in the yearly amount spent, suggesting that while there
is some association, other factors are also influencing the yearly amount spent that are not
accounted for in this simple regression model.
T - test
Given that the p-values (both one-tailed and two-tailed) are greater than the common alpha
level of 0.05, there is not enough evidence to reject the null hypothesis of no difference in mean
yearly amount spent between the two genders. This means that, statistically, there isn't a
significant difference in the yearly amount spent between the male group and the female group.
4.1 Final Suggestions:
1. Investing more in expanding the average amount time the member customer spend may
be the best way in this context to increase the yearly amount spent since it has a
positive, though insignificant, relation between which.
2. The T - test show the statistically significance of difference in yearly spent between male
and female does not exist. Focus on gathering other information while collecting data
from customers may be a better solution to getting better knowledge of customer
behavior.