Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Advanced

Statistics
Extended
Project
Business Report

1 of 33
Table of Contents

Problem 1......................................................................................................................5
Data overview............................................................................................................5
Key Questions............................................................................................................6
Problem 2.......................................................................................................................8
Data overview............................................................................................................8
Key Questions............................................................................................................9
Problem 3.....................................................................................................................11
Context.....................................................................................................................11
Objective...................................................................................................................12
Problem 3 - Data Overview.......................................................................................13
Univariate Analysis...................................................................................................16
Bivariate Analysis.....................................................................................................20
Insights based on EDA.............................................................................................23
Key Questions..........................................................................................................25
Actionable Insights & Recommendations.................................................................35

2 of 33
List of Figures

Fig 1. Univariate analysis of time_spent on the page................................................16

Fig 2. Univariate analysis of Group.........................................................................17

Fig 3. Univariate analysis of landing_page...............................................................17

Fig 4. Univariate analysis of conversion..................................................................18

Fig 5. Univariate analysis of Language preferred.....................................................19

Fig 6. Bivariate analysis of Landing page vs Time spent on the page.........................20

Fig 7 Bivariate analysis of Conversion status vs Time spent on the page ..................21

Fig 8. Bivariate analysis of Language preferred vs Time spent on the page................22

Fig 9.Relationship between preferred language, conversion status and landing pag.23

Fig 10. Relationship between group, time spent on the page, and conversion status.24

Fig 11. Visualize the time spent on both the old and new landing pages....................25

Fig 12. Visualize the conversion status based on preferred language........................28

Fig 13. Visualization of the time spent on the new page for different language users..32

3 of 33
List of Tables

Table 1: Top 5 rows of the datasets..........................................................................13

Table 2: Basic information of the datasets...............................................................14

Table 3: Statistical Summary for numerical variable...............................................14

Table 4: Statistical Summary for categorical variable.............................................15

Table 5. Table for the converted and language_preferred variables

Table 6. value counts for language preferred

4 of 33
Problem -1
An independent research organization is trying to estimate the probability that an accident
at a nuclear power plant will result in radiation leakage. The types of accidents possible at
the plant are fire hazards, mechanical failure, or human error. The research organization
also knows that two or more types of accidents cannot occur simultaneously.

According to the studies carried out by the organization, the probability of a radiation leak
in case of a fire is 20%, the probability of a radiation leak in case of a mechanical 50%, and
the probability of a radiation leak in case of a human error is 10%. The studies also showed
the following;

The probability of a radiation leak occurring simultaneously with fire is 0.1%.


The probability of a radiation leak occurring simultaneously with a mechanical failure is
0.15%.
The probability of a radiation leak occurring simultaneously with a human error is 0.12%.

Based on the information available, answer the questions below:

1.1 What are the probabilities of a fire, a mechanical failure, and a human error
respectively?

1.2 What is the probability of a radiation leak?

1.3 Suppose there has been a radiation leak in the reactor for which the definite cause is
not known. What is the probability that it has been caused by:

a) a fire?

b) a mechanical failure?

c) a human error?

1.1 Solution
Defining the event -

F - Fire,

M - Mechanical Error

H - Human Error

R - Radiation leak

5 of 33
N - No Accident

Given Probabilities -

P(R|F) = 0.2

P(R|M) = 0.5

P(R|H) = 0.1

P(R ∩ F) = 0.001

P(R ∩ M) = 0.0015

P(R ∩ H) = 0.0012

1.1 The probabilities of a fire, a mechanical failure, a human error respectively: -

P(F) = P(R ∩ F)/ P(R|F) = 0.001/0.2 = 0.005

P(M) = P(R ∩ M)/ P(R|M) = 0.0015/0.5 = 0.003

P(H) = P(R ∩ H)/ P(R|H) = 0.0012/0.1 = 0.012

1.2 - The Probabilities of a radiation leak:

Since the type of possible accident here are fire hazards, mechanical failure, human error.

P(N) = 1 - (0.005+0.003+0.012) = 0.98

P(R|N) = 0

P(R ∩ N) = P(R|N)P(N) = 0

By probability theorem

P(R) = P(R ∩ F)+ P(R ∩ M)+ P(R ∩ H)+ P(R ∩ N)

P(R) = 0.001 + 0.0015 + 0.0012 + 0


6 of 33
P(R) = 0.0037

1.3 If there has been a radiation leak in the reactor for which the definite cause is
unknown. The probability that it has been caused by-

A) The probability of fire radiation is -

P(F|R) = P(R ∩ F)/P(R) = 0.001/0.0037 = 0.270

B) The probability of mechanical failure radiation leak is -

P(M|R) = P(R ∩ M)/P(R) = 0.0015/0.0037 = 0.405

C) The probability of human error radiation leak is -

P(H|R) = P(R ∩ H)/P(R) = 0.0012/0.0037 = 0.324

Problem - 2
Grades of the final examination in a training course are found to be normally distributed,
with a mean of 77 and a standard deviation of 8.5. Based on the information given answer
the questions below.

2.1 What is the probability that a randomly chosen student gets a grade below 85 on this
exam?

2.2 What is the probability that a randomly selected student score between 65 and 87?

2.3 What should be the passing cut-off so that 75% of the students clear the exam?

Solution – 2.1
To find the probability that a randomly chosen student gets a grade below 85 on this exam
we need to calculate cumulative probability up to the value of 85

Using the z score formula -

Z = (X-µ)/ σ

Where:

7 of 33
X – The value we need to find the probability for (85 in this case)

µ - the mean (77)

σ - the standard deviation (8.5)

Z = (85-77)/8.5

Z = 0.941176

Now we can use standard normal distribution table or calculator to find cumulative
probability corresponding to the z score of 0.941176.

From the standard normal distribution table, the cumulative probability (area under the
curve for the z score of 0.941176 is approximately 0.8264.

So, the probability that a randomly chosen student gets a grade below 85 on this
exam is 82.64%

Solution – 2.2
To find the probability that a randomly chosen student gets a grade between 65 & 87 on
this exam we need to calculate cumulative probability for both the values and subtract the
smaller probability from bigger probability,

Using the z score formula -

For 65

Z1 = (65-77)/8.5 = -1.411765

For 87

Z2 = (87-77)/8.5 = 1.176471

Using the standard normal distribution table, we find the cumulative probabilities
corresponding to Z1 & Z2

For Z1 = -1.411765, the cumulative probability is approximately 0.0793

For Z2 = 1.176471 the cumulative probability is approximately 0.8790

The probability of scoring between 65 and 87 is difference between cumulative


probabilities:

0.8790-0.0793 = 0.7997
8 of 33
Therefore, the probability that a randomly selected student score between 65 and 87
is approximately 0.7997 or 79.97%.

Solution – 2.3
To determine the passing cut-off so that 75% of the students clear the exam, we need to
find the corresponding z score for the cumulative probability of 0.75.

Using the standard normal distribution table, we find the z score corresponding to a
cumulative probability of 0.75 is approximately 0.6745.

Using the z score formula -

Z = (X-µ)/ σ

0.6745 = (X-77)/8.5

Solving for X:

X-77 = 0.6745*8.5

X-77 = 5.73425

X = 82.73425

Therefore, the passing cutoff should be set at approximately 82.73425 for 75 %


students to clear the exam

Problem 3. Project Business Statistics: E-news Express


Business Context
The advent of e-news, or electronic news, portals has offered us a great opportunity to
quickly get updates on the day-to-day events occurring globally. The information on these
portals is retrieved electronically from online databases, processed using a variety of
software, and then transmitted to the users. There are multiple advantages of transmitting
news electronically, like faster access to the content and the ability to utilize different
technologies such as audio, graphics, video, and other interactive elements that are either
not being used or aren’t common yet in traditional newspapers.

E-news Express, an online news portal, aims to expand its business by acquiring new
subscribers. With every visitor to the website taking certain actions based on their interest,
the company plans to analyze these actions to understand user interests and determine
how to drive better engagement. The executives at E-news Express are of the opinion that
there has been a decline in new monthly subscribers compared to the past year because the
9 of 33
current web page is not designed well enough in terms of the outline & recommended
content to keep customers engaged long enough to decide to subscribe.

[Companies often analyze user responses to two variants of a product to decide which of
the two variants is more effective. This experimental technique, known as A/B testing, is
used to determine whether a new feature attracts users based on a chosen metric.]

Objective

The design team of the company has researched and created a new landing page that has a
new outline & more relevant content shown compared to the old page. In order to test the
effectiveness of the new landing page in gathering new subscribers, the Data Science team
conducted an experiment by randomly selecting 100 users and dividing them equally into
two groups. The existing landing page was served to the first group (control group) and the
new landing page to the second group (treatment group). Data regarding the interaction of
users in both groups with the two versions of the landing page was collected. Being a data
scientist in E-news Express, you have been asked to explore the data and perform a
statistical analysis (at a significance level of 5%) to determine the effectiveness of the new
landing page in gathering new subscribers for the news portal by answering the following
questions:

1. Do the users spend more time on the new landing page than on the existing landing
page?
2. Is the conversion rate (the proportion of users who visit the landing page and get
converted) for the new page greater than the conversion rate for the old page?
3. Does the converted status depend on the preferred language?
4. Is the time spent on the new page the same for the different language users?

Data Description

The data contains the different data related to an E-news Express. The detailed data
dictionary is given below.

10 of 33
Data Dictionary
The data contains information regarding the interaction of users in both groups with the
two versions of the landing page.

1. user_id - Unique user ID of the person visiting the website


2. group - Whether the user belongs to the first group (control) or the second group
(treatment)
3. landing_page - Whether the landing page is new or old
4. time_spent_on_the_page - Time (in minutes) spent by the user on the landing page
5. converted - Whether the user gets converted to a subscriber of the news portal or
not
6. language_preferred - Language chosen by the user to view the landing page

Data Overview

Structure of the data:

The Data Frame has 6 columns mentioned in the data dictionary. Data in each row
corresponds to the activity, duration spent on the page and preferences of each user.

Table 1: Top 5 rows of the datasets.

The data frame has 100 rows and 6 columns.

11 of 33
Table 2: Basic information of the datasets

Observations:

• There are 4 objective type columns(variables), 1 integer column(variable) and 1


float column(variable) in the data set.
• There are no columns with null value/missing value in the data set.

Statistical Summary for Numerical Variable:

Table 3: Statistical Summary for numerical variable

Observations:

• The maximum time spent on the page is 10.71 minutes by overall users.
• The minimum time spent on the page is 0.19 minutes by overall users.
12 of 33
• Average time spent by a customer is 5.377 minutes by overall users.
• Standard deviation of the time spent between the users is 2.37 minutes by overall
users.

Statistical Summary for Categorical Variable:

Table 4: Statistical Summary for categorical variable

Observations

• All the categorical variables have a 100 entries each, meaning that there is no
missing data:
• The group variable has two unique values, control and treatment, with 50 entries
each
• The landing_page variable has two unique values, old and new, with 50 entries each
• The converted variable has two unique values, yes and no, with 54 and 46 entries
respectively
• The language_preferred variable has three unique values, language Spanish and
French are counted to highest number of 34 in the column.
Univariate Analysis

Time spent on the page

13 of 33
Fig 1. Univariate analysis of time_spent on the page

Observations

• The histogram_boxplot of the variable "time_spent_on_the_page" shows there is no


outliers in the data.
• mean and median are almost equal. It lies between 5 to 6 mins.

Group

14 of 33
Fig 2. Univariate analysis of Group

Observations

• Time distribution of "control" group follows normal distribution.


• Mean and median is nearly same which is around 4.8 mins.

Landing page

15 of 33
Fig 3 – univariate analysis of landing_page

Observations
• The countplot shows 50% of users are using old_landing page and 50% of users are using
new_landing page.

Converted

Fig 4. Univariate analysis of conversion

Observations

• The countplot shows 54% of users are converted and 46% of users are not converted.

16 of 33
Language preferred

Fig 5. Univariate analysis of Language preferred

Observation:

The countplot shows the language_preference of 32% of users is English, 34% of users is French and
34% of users is Spanish.

Bivariate Analysis

Landing page vs Time spent on the page

17 of 33
Fig 6. Bivariate analysis of Landing page vs Time spent on the page

Observation:

• There are few outliers on the timespent on new_landing_page


• Mean time spent on new landing_page lies around 6 minutes
• Mean time spent on old landing_oage lies around 4.8 mins.

18 of 33
Conversion status vs Time spent on the page

Fig 7 Bivariate analysis of Conversion status vs Time spent on the page

Observation:
• There are few outliers on the time spent on page for the converted status with "yes"
and also for converted status with "no".
• The mean time spent by both the groups with converted status "yes" is high
compare to the mean time spent by the users whose covnerted status is "no".

19 of 33
Language preferred vs Time spent on the page

Fig 8. Bivariate analysis of Language preferred vs Time spent on the page

Observation:
• There is no outlier in the time who prefers English
• There are few outliers in the time spent on page who prefers French and Spanish.
• Mean time spent by treatment group is more than control group for all the language
users.

20 of 33
Problem 3.2 Insights based on EDA

A) Relationship between preferred language, conversion status, and landing page

Fig 9. Relationship between preferred language, conversion status, and landing page

Observation:

It appears that users that prefer Spanish and French opted not to convert to a subscriber
while viewing the old landing page. However, users of all languages preferred to convert to
subscribers while viewing the new landing page.

B) Relationship between group, time spent on the page, and conversion status

21 of 33
Fig 10. Relationship between group, time spent on the page, and conversion status

Observation:

It appears that more people converted to subscribers in the treatment group compared to
the control group. Additionally, users in the treatment group spent more time on the page.

Problem 3.3. Do the user spend more time on the new landing Page than
the old landing page?

Visual Analysis

22 of 33
Fig 11. Visualize the time spent on both the old and new landing pages

Observations:

• There are few outliers in the data of the new landing page.
• Mean time spent by the users on the new landing page is higher compared to the
mean time spent on old landing page.
• 50% of users of new_landing_page spend 6 mins on the page.
• 50% of users of old_landing page spend around 4.5 mins on the page.

Step 1: Define the null and alternate hypotheses.

H(0) null hypotheses: The mean time spent by the users on the new page is equal to the mean
time spent by the users on the old page.

H(1) alternate hypotheses: The mean time spent by the users on the new page is greater
than the mean time spent by the users on the old page.
Step 2: Select the appropriate Test

This is a one-tailed test concerning two population means from two independent
populations. The population standard deviations are unknown. Based on this
information, a two-sample independent t-test would be the most appropriate.

23 of 33
Step 3: Decide the significance level
As given in the problem statement, we select α= 0.05.

Step 4: Collect and prepare data


The sample standard deviation of the time spent on the new page is: 1.82
The sample standard deviation of the time spent on the old page is: 2.58

Observations:

Based on the sample standard deviations of the two groups, the population standard
deviations can be assumed to be unequal.

Two-sample independent t-test assumptions:

• Continuous data - Yes, the Time is measured on a continuous scale.


• Normally distributed populations - Yes, the sample size is >30. Since the sample
sizes are greater than 30, Central Limit Theorem states that the distribution of
sample means will be normal.
• Independent populations - As we are taking random samples for two different
groups, the two samples are from two independent populations.
• Unequal population standard deviations - As the sample standard deviations are
different, the population standard deviations may be assumed to be different.
• Random sampling from the population - Yes, we are informed that the collected
sample is a simple random sample.
We can use two sample independent T-test for this problem.

Step 5: Calculate the p-value


scipy.stats.ttest_ind calculates the t-test on TWO independent samples of observations
when population standard deviation is not known, and sample standard deviation is not
equal. This function returns the test statistic and p-value for a right-tailed t-test in case the
alternative parameter is set to 'greater'.

The p-value is 0.0001392381225166549

24 of 33
Observation:
• p_value is <0.05

Step 6: Compare the p-value with α

As the p-value 0.0001392381225166549 is less than the level of significance. we reject the
null hypothesis.

Step 7: Draw Inference

As the p-value (~0.00013) is less than the level of significance, we can reject the null
hypothesis. Hence, we do have enough evidence to support the claim that users spend
more time on new_landing_page than the old_landing_page.

Problem 3. 4: Converted Status depend on the preferred language?

Visual analysis

25 of 33
Fig 12. Visualize the conversion status based on preferred language

Observations:

• There are few outliers in the data of English and Spanish preferred language users.
• Mean time spent by the users and who get converted to "yes" is higher than the
users whose converted status is "no" for all the three languages.
• The mean time of French preferred users is little high compared to other language
preferred users, where the converted status is "yes".
• The mean time of Spanish preferred users is high compared to other language
preferred users, where the converted status is "no".
• The mean and median of all the language users for both the converted status is
almost equal except Spanish with converted status "no".

Step 1: Define null and alternative hypothese

The null hypothesis H(0): The converted status is independent of the preferred language.

The alternate hypothesis H(1): The converted status is dependent on the preferred
language.

26 of 33
Step 2: Select Appropriate test

This is a problem of the test of independence, concerning two categorical variables -


converted status and preferred language. Based on this information, a chi-square test
for independence would be the most approriate.

Step 3: Decide the significance level

As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

Table 5. Table for the converted and language_preferred variables

Observations:

• For the language English the converted numbers are high compared to other
language.

Chi-Squared test for independence assumptions:

• Categorical variables - Yes


• Expected value of the number of sample observations in each level of the variable is
at least 5 - Yes, the number of observations in each level is greater than 5.
• Random sampling from the population - Yes, we are informed that the collected
sample is a simple random sample.

27 of 33
Step 5: Calculate the p-value
Perform a chi-squared test for independence and determine the p-value

The p-value is 0.2129888748754345

Observations:
The P_value is greater than 0.05.

Step 6: Compare the p-value with α


As the p-value 0.21298887487543447 is greater than the level of significance, we fail to
reject the null hypothesis.

Step 7: Draw Inference

Since the p-value is greater than the level of signifcance of 5%, the null hypothesis fails to
be rejected. This means that that the converted status is independent of the preferred
language.

28 of 33
Problem 3.5. Is the time spent on the new page the same for the different
language users?

Perform Visual Analysis

Fig 13. Visualization of the time spent on the new page for different language users

29 of 33
Observations:

• Mean time spent by the users who prefer English is highest and lowest for users who prefer
French.

Table 6. value counts for language preferred

Step 1: Define the null and alternate hypotheses

H(0): The mean time spent on the new lading page is the same across all preferred
langauges.

H(1): At least one of the mean times spent on the new landing page is different amongst the
preferred languages.

Step 2: Select Appropriate test

This is a problem, concerning three population means. Based on this information, a one-
way ANOVA test would be the most appropriate.

Step 3: Decide the significance level


As given in the problem statement, we select α = 0.05.

Step 4: Collect and prepare data

Shapiro-Wilk’s test

H(0) : Carbon emission follows a normal distribution

H(1) : Carbon emission does not follow a normal distribution

Level of Significance
We select α= 0.05.
30 of 33
The p-value is 0.8040016293525696

Draw Inference
Since p-value of the test is very large than the 5% significance level, we fail to reject the null
hypothesis that the response follows the normal distribution.

Levene’s test

H(0) : All the population variances are equal

H(1) : At least one variance is different from the rest

Level of Significance
We select α= 0.05
Find the p_value
The p-value is 0.46711357711340173

Draw Inference
Since the p-value is large than the 5% significance level, we fail to reject the null hypothesis
of homogeneity of variances, meaning the variances are equal.

One-way ANOVA test assumptions:

• The populations are normally distributed - Yes, the normality assumption is verified
using the Shapiro-Wilk’s test.
• Samples are independent simple random samples - Yes, we are informed that the
collected sample is a simple random sample.
• Population variances are equal - Yes, the homogeneity of variance assumption is
verified using the Levene's test.

Step 5: Calculate the p-value

The p-value is 0.43204138694325955


Test_statistics: 0.8543992770006823

31 of 33
Observations:

p_value is greater than 0.05.

Step 6: Compare the p-value with


As the p-value 0.43204138694325955 is greater than the level of significance, we fail to
reject the null hypothesis.

Step 7: Draw Inference


Since the p-value is greater than the 5% significance level, we fail to reject the null
hypothesis. Hence, we dont have enough statistical evidence to say that the mean time
spent on the new page with respect to the three different languages are different.

3.6 Conclusion and Business Recommendations

Conclusion
1. Users spending more time on the new_landing_page than the old_landing_page.
2. Time spent by the different users with converted status "yes" is more than the users
with converted status "no".
3. Conversion rate for new page is greater than the old page.
4. Conversion status is independent of preferred language.
5. The mean time spent on the new page with respect to the three different languages
are equal.
6. Though the count of English language users are comparitively low, their conversion
rate is higher than the other two language users.

Recommendations:

32 of 33
• E-News Express should fully implement the new landing page, as it appears to
attract much more attention than the old landing page. The greater amount of time
spent on the new landing page compared to the old one is evidence that users prefer
it
• It might be beneficial to cut losses with the old landing page, as there are
diminishing returns in average time spent and conversion rate. The new landing
page has an increased conversion rate; therefore, more resources should be directed
towards it, as it has greater potential to increase membership.
• Deploy the new landing page, incorporating all the existing preferred languages.
Since there is no significant difference between the average time spent on the new
page across the preferred languages, the conversion rate of subscribers remains
similar throughout. Perhaps adding more languages to the portal will help reach a
wider audience

33 of 33

You might also like