Professional Documents
Culture Documents
Inferential Statics
Inferential Statics
Statistics
Extended
Project
Business Report
1 of 33
Table of Contents
Problem 1......................................................................................................................5
Data overview............................................................................................................5
Key Questions............................................................................................................6
Problem 2.......................................................................................................................8
Data overview............................................................................................................8
Key Questions............................................................................................................9
Problem 3.....................................................................................................................11
Context.....................................................................................................................11
Objective...................................................................................................................12
Problem 3 - Data Overview.......................................................................................13
Univariate Analysis...................................................................................................16
Bivariate Analysis.....................................................................................................20
Insights based on EDA.............................................................................................23
Key Questions..........................................................................................................25
Actionable Insights & Recommendations.................................................................35
2 of 33
List of Figures
Fig 7 Bivariate analysis of Conversion status vs Time spent on the page ..................21
Fig 9.Relationship between preferred language, conversion status and landing pag.23
Fig 10. Relationship between group, time spent on the page, and conversion status.24
Fig 11. Visualize the time spent on both the old and new landing pages....................25
Fig 13. Visualization of the time spent on the new page for different language users..32
3 of 33
List of Tables
4 of 33
Problem -1
An independent research organization is trying to estimate the probability that an accident
at a nuclear power plant will result in radiation leakage. The types of accidents possible at
the plant are fire hazards, mechanical failure, or human error. The research organization
also knows that two or more types of accidents cannot occur simultaneously.
According to the studies carried out by the organization, the probability of a radiation leak
in case of a fire is 20%, the probability of a radiation leak in case of a mechanical 50%, and
the probability of a radiation leak in case of a human error is 10%. The studies also showed
the following;
1.1 What are the probabilities of a fire, a mechanical failure, and a human error
respectively?
1.3 Suppose there has been a radiation leak in the reactor for which the definite cause is
not known. What is the probability that it has been caused by:
a) a fire?
b) a mechanical failure?
c) a human error?
1.1 Solution
Defining the event -
F - Fire,
M - Mechanical Error
H - Human Error
R - Radiation leak
5 of 33
N - No Accident
Given Probabilities -
P(R|F) = 0.2
P(R|M) = 0.5
P(R|H) = 0.1
P(R ∩ F) = 0.001
P(R ∩ M) = 0.0015
P(R ∩ H) = 0.0012
Since the type of possible accident here are fire hazards, mechanical failure, human error.
P(R|N) = 0
P(R ∩ N) = P(R|N)P(N) = 0
By probability theorem
1.3 If there has been a radiation leak in the reactor for which the definite cause is
unknown. The probability that it has been caused by-
Problem - 2
Grades of the final examination in a training course are found to be normally distributed,
with a mean of 77 and a standard deviation of 8.5. Based on the information given answer
the questions below.
2.1 What is the probability that a randomly chosen student gets a grade below 85 on this
exam?
2.2 What is the probability that a randomly selected student score between 65 and 87?
2.3 What should be the passing cut-off so that 75% of the students clear the exam?
Solution – 2.1
To find the probability that a randomly chosen student gets a grade below 85 on this exam
we need to calculate cumulative probability up to the value of 85
Z = (X-µ)/ σ
Where:
7 of 33
X – The value we need to find the probability for (85 in this case)
Z = (85-77)/8.5
Z = 0.941176
Now we can use standard normal distribution table or calculator to find cumulative
probability corresponding to the z score of 0.941176.
From the standard normal distribution table, the cumulative probability (area under the
curve for the z score of 0.941176 is approximately 0.8264.
So, the probability that a randomly chosen student gets a grade below 85 on this
exam is 82.64%
Solution – 2.2
To find the probability that a randomly chosen student gets a grade between 65 & 87 on
this exam we need to calculate cumulative probability for both the values and subtract the
smaller probability from bigger probability,
For 65
Z1 = (65-77)/8.5 = -1.411765
For 87
Z2 = (87-77)/8.5 = 1.176471
Using the standard normal distribution table, we find the cumulative probabilities
corresponding to Z1 & Z2
0.8790-0.0793 = 0.7997
8 of 33
Therefore, the probability that a randomly selected student score between 65 and 87
is approximately 0.7997 or 79.97%.
Solution – 2.3
To determine the passing cut-off so that 75% of the students clear the exam, we need to
find the corresponding z score for the cumulative probability of 0.75.
Using the standard normal distribution table, we find the z score corresponding to a
cumulative probability of 0.75 is approximately 0.6745.
Z = (X-µ)/ σ
0.6745 = (X-77)/8.5
Solving for X:
X-77 = 0.6745*8.5
X-77 = 5.73425
X = 82.73425
E-news Express, an online news portal, aims to expand its business by acquiring new
subscribers. With every visitor to the website taking certain actions based on their interest,
the company plans to analyze these actions to understand user interests and determine
how to drive better engagement. The executives at E-news Express are of the opinion that
there has been a decline in new monthly subscribers compared to the past year because the
9 of 33
current web page is not designed well enough in terms of the outline & recommended
content to keep customers engaged long enough to decide to subscribe.
[Companies often analyze user responses to two variants of a product to decide which of
the two variants is more effective. This experimental technique, known as A/B testing, is
used to determine whether a new feature attracts users based on a chosen metric.]
Objective
The design team of the company has researched and created a new landing page that has a
new outline & more relevant content shown compared to the old page. In order to test the
effectiveness of the new landing page in gathering new subscribers, the Data Science team
conducted an experiment by randomly selecting 100 users and dividing them equally into
two groups. The existing landing page was served to the first group (control group) and the
new landing page to the second group (treatment group). Data regarding the interaction of
users in both groups with the two versions of the landing page was collected. Being a data
scientist in E-news Express, you have been asked to explore the data and perform a
statistical analysis (at a significance level of 5%) to determine the effectiveness of the new
landing page in gathering new subscribers for the news portal by answering the following
questions:
1. Do the users spend more time on the new landing page than on the existing landing
page?
2. Is the conversion rate (the proportion of users who visit the landing page and get
converted) for the new page greater than the conversion rate for the old page?
3. Does the converted status depend on the preferred language?
4. Is the time spent on the new page the same for the different language users?
Data Description
The data contains the different data related to an E-news Express. The detailed data
dictionary is given below.
10 of 33
Data Dictionary
The data contains information regarding the interaction of users in both groups with the
two versions of the landing page.
Data Overview
The Data Frame has 6 columns mentioned in the data dictionary. Data in each row
corresponds to the activity, duration spent on the page and preferences of each user.
11 of 33
Table 2: Basic information of the datasets
Observations:
Observations:
• The maximum time spent on the page is 10.71 minutes by overall users.
• The minimum time spent on the page is 0.19 minutes by overall users.
12 of 33
• Average time spent by a customer is 5.377 minutes by overall users.
• Standard deviation of the time spent between the users is 2.37 minutes by overall
users.
Observations
• All the categorical variables have a 100 entries each, meaning that there is no
missing data:
• The group variable has two unique values, control and treatment, with 50 entries
each
• The landing_page variable has two unique values, old and new, with 50 entries each
• The converted variable has two unique values, yes and no, with 54 and 46 entries
respectively
• The language_preferred variable has three unique values, language Spanish and
French are counted to highest number of 34 in the column.
Univariate Analysis
13 of 33
Fig 1. Univariate analysis of time_spent on the page
Observations
Group
14 of 33
Fig 2. Univariate analysis of Group
Observations
Landing page
15 of 33
Fig 3 – univariate analysis of landing_page
Observations
• The countplot shows 50% of users are using old_landing page and 50% of users are using
new_landing page.
Converted
Observations
• The countplot shows 54% of users are converted and 46% of users are not converted.
16 of 33
Language preferred
Observation:
The countplot shows the language_preference of 32% of users is English, 34% of users is French and
34% of users is Spanish.
Bivariate Analysis
17 of 33
Fig 6. Bivariate analysis of Landing page vs Time spent on the page
Observation:
18 of 33
Conversion status vs Time spent on the page
Observation:
• There are few outliers on the time spent on page for the converted status with "yes"
and also for converted status with "no".
• The mean time spent by both the groups with converted status "yes" is high
compare to the mean time spent by the users whose covnerted status is "no".
19 of 33
Language preferred vs Time spent on the page
Observation:
• There is no outlier in the time who prefers English
• There are few outliers in the time spent on page who prefers French and Spanish.
• Mean time spent by treatment group is more than control group for all the language
users.
20 of 33
Problem 3.2 Insights based on EDA
Fig 9. Relationship between preferred language, conversion status, and landing page
Observation:
It appears that users that prefer Spanish and French opted not to convert to a subscriber
while viewing the old landing page. However, users of all languages preferred to convert to
subscribers while viewing the new landing page.
B) Relationship between group, time spent on the page, and conversion status
21 of 33
Fig 10. Relationship between group, time spent on the page, and conversion status
Observation:
It appears that more people converted to subscribers in the treatment group compared to
the control group. Additionally, users in the treatment group spent more time on the page.
Problem 3.3. Do the user spend more time on the new landing Page than
the old landing page?
Visual Analysis
22 of 33
Fig 11. Visualize the time spent on both the old and new landing pages
Observations:
• There are few outliers in the data of the new landing page.
• Mean time spent by the users on the new landing page is higher compared to the
mean time spent on old landing page.
• 50% of users of new_landing_page spend 6 mins on the page.
• 50% of users of old_landing page spend around 4.5 mins on the page.
H(0) null hypotheses: The mean time spent by the users on the new page is equal to the mean
time spent by the users on the old page.
H(1) alternate hypotheses: The mean time spent by the users on the new page is greater
than the mean time spent by the users on the old page.
Step 2: Select the appropriate Test
This is a one-tailed test concerning two population means from two independent
populations. The population standard deviations are unknown. Based on this
information, a two-sample independent t-test would be the most appropriate.
23 of 33
Step 3: Decide the significance level
As given in the problem statement, we select α= 0.05.
Observations:
Based on the sample standard deviations of the two groups, the population standard
deviations can be assumed to be unequal.
24 of 33
Observation:
• p_value is <0.05
As the p-value 0.0001392381225166549 is less than the level of significance. we reject the
null hypothesis.
As the p-value (~0.00013) is less than the level of significance, we can reject the null
hypothesis. Hence, we do have enough evidence to support the claim that users spend
more time on new_landing_page than the old_landing_page.
Visual analysis
25 of 33
Fig 12. Visualize the conversion status based on preferred language
Observations:
• There are few outliers in the data of English and Spanish preferred language users.
• Mean time spent by the users and who get converted to "yes" is higher than the
users whose converted status is "no" for all the three languages.
• The mean time of French preferred users is little high compared to other language
preferred users, where the converted status is "yes".
• The mean time of Spanish preferred users is high compared to other language
preferred users, where the converted status is "no".
• The mean and median of all the language users for both the converted status is
almost equal except Spanish with converted status "no".
The null hypothesis H(0): The converted status is independent of the preferred language.
The alternate hypothesis H(1): The converted status is dependent on the preferred
language.
26 of 33
Step 2: Select Appropriate test
Observations:
• For the language English the converted numbers are high compared to other
language.
27 of 33
Step 5: Calculate the p-value
Perform a chi-squared test for independence and determine the p-value
Observations:
The P_value is greater than 0.05.
Since the p-value is greater than the level of signifcance of 5%, the null hypothesis fails to
be rejected. This means that that the converted status is independent of the preferred
language.
28 of 33
Problem 3.5. Is the time spent on the new page the same for the different
language users?
Fig 13. Visualization of the time spent on the new page for different language users
29 of 33
Observations:
• Mean time spent by the users who prefer English is highest and lowest for users who prefer
French.
H(0): The mean time spent on the new lading page is the same across all preferred
langauges.
H(1): At least one of the mean times spent on the new landing page is different amongst the
preferred languages.
This is a problem, concerning three population means. Based on this information, a one-
way ANOVA test would be the most appropriate.
Shapiro-Wilk’s test
Level of Significance
We select α= 0.05.
30 of 33
The p-value is 0.8040016293525696
Draw Inference
Since p-value of the test is very large than the 5% significance level, we fail to reject the null
hypothesis that the response follows the normal distribution.
Levene’s test
Level of Significance
We select α= 0.05
Find the p_value
The p-value is 0.46711357711340173
Draw Inference
Since the p-value is large than the 5% significance level, we fail to reject the null hypothesis
of homogeneity of variances, meaning the variances are equal.
• The populations are normally distributed - Yes, the normality assumption is verified
using the Shapiro-Wilk’s test.
• Samples are independent simple random samples - Yes, we are informed that the
collected sample is a simple random sample.
• Population variances are equal - Yes, the homogeneity of variance assumption is
verified using the Levene's test.
31 of 33
Observations:
Conclusion
1. Users spending more time on the new_landing_page than the old_landing_page.
2. Time spent by the different users with converted status "yes" is more than the users
with converted status "no".
3. Conversion rate for new page is greater than the old page.
4. Conversion status is independent of preferred language.
5. The mean time spent on the new page with respect to the three different languages
are equal.
6. Though the count of English language users are comparitively low, their conversion
rate is higher than the other two language users.
Recommendations:
32 of 33
• E-News Express should fully implement the new landing page, as it appears to
attract much more attention than the old landing page. The greater amount of time
spent on the new landing page compared to the old one is evidence that users prefer
it
• It might be beneficial to cut losses with the old landing page, as there are
diminishing returns in average time spent and conversion rate. The new landing
page has an increased conversion rate; therefore, more resources should be directed
towards it, as it has greater potential to increase membership.
• Deploy the new landing page, incorporating all the existing preferred languages.
Since there is no significant difference between the average time spent on the new
page across the preferred languages, the conversion rate of subscribers remains
similar throughout. Perhaps adding more languages to the portal will help reach a
wider audience
33 of 33