Professional Documents
Culture Documents
Unit -1 L2
Unit -1 L2
1 /52
Descriptive / inferential
• Descriptive statistics are methods that help
researchers organize, summarize, and simplify
the results obtained from research studies.
2 /52
Statistic / parameter
• A summary value that describes a sample is
called a statistic. M=25 s=2
3 /52
Frequency Distributions
One method of simplifying and organizing a set
of scores is to group them into an organized
display that shows the entire set.
4 /52
Example
5 /52
Histogram & Polygon
6 /52
Bar Graphs
7 /52
Bar Graph
8 /52
Central tendency
The goal of central tendency is to identify the
value that is most typical or most representative
of the entire group.
9 /52
Central tendency
• The mean is the arithmetic average.
• The median measures central tendency by
identifying the score that divides the
distribution in half.
• The mode is the most frequently occurring
score in the distribution.
10 /52
Variability
Variability is a measure of the spread of scores
in a distribution.
Total=60 SS =70
Mean=6
12 /52
Non-numerical Data
Proportion or percentage in each category.
For example,
• 43% prefer Democrat candidate,
• 28% prefer Republican candidate,
• 29% are undecided
13 /52
Hypothesis testing
• A hypothesis test is a statistical procedure that
uses sample data to evaluate the credibility of
a hypothesis about a population.
14 /52
5 elements of a hypothesis test
1. The Null Hypothesis
The null hypothesis is a statement about the
population, or populations, being examined, and
always says that there is no effect, no change, or no
relationship.
15 /52
5 elements of a hypothesis test
3. The Standard Error
Standard error is a measure of the average, or standard distance
between sample statistic and the corresponding population
parameter.
"standard error of the mean , sm" refers to the standard deviation of the distribution of sample means taken from a population.
M 1− M 2
t=
sm
16 /52
5 elements of a hypothesis test
5. The Alpha Level ( Level of Significance)
The alpha level, or level of significance, for a
hypothesis test is the maximum probability that the
research result was obtained simply by chance.
18 /52
Errors in Hypothesis Testing
If a researcher is misled by the results from the
sample, it is likely that the researcher will reach
an incorrect conclusion.
Two kinds of errors can be made in hypothesis
testing.
19 /52
Type I Errors
• A Type I error occurs when a researcher finds evidence
for a significant result when, in fact, there is no effect (
no relationship) in the population.
• The error occurs because the researcher has, by chance, selected an extreme sample that appears to show
the existence of an effect when there is none.
20 /52
Type II error
• A Type II error occurs when sample data do
not show evidence of a significant effect
when, in fact, a real effect does exist in the
population.
• This often occurs when the effect is so small that it does not show up in the sample.
21 /52
Types of Errors
Controlled via sample size Typically restrict to a 5% Risk
(=1-Power of test) = level of significance
/52
Type I and Type II Errors –
Example
Given H0 : average life of pacemaker = 300 days, and HA:
Average life of pacemaker > 300 days
0
a)It is better to make a Type II error (where H0 is false i.e
average life is actually more than 300 days but we accept H0
and assume that the average life is equal to 300 days)
(b)As we increase the significance level (α) we increase the
chances of making a type I error. Since here it is better to
make a type II error we shall choose a low α.
/52
Two Tail Test
Two tailed test will reject the null hypothesis if the sample mean
is significantly higher or lower than the hypothesized mean.
s c h ool
Appropriate when H0 : µ = µ0 and HA: µ ≠ µ0
e.g The manufacturer of light bulbs wants to produce light bulbs
with a mean life of 1000 hours. If the lifetime is shorter he will
lose customers to the competition and if it is longer then he will
incur a high cost of production. He does not want to deviate
significantly from 1000 hours in either direction. Thus he
selects the hypotheses as
H0 : µ = 1000 hours and HA: µ ≠ 1000 hours and uses a
two tail test.
/52
One Tail Test
A one-sided test is a statistical hypothesis test in which the
values for which we can reject the null hypothesis, H0 are
located entirely in one tail of the probability distribution.
school
Lower tailed test will reject the null hypothesis if the sample
mean is significantly lower than the hypothesized mean.
Appropriate
when H0 : µ = µ0 and HA: µ < µ0
e.g A wholesaler buys light bulbs from the manufacturer in large
lots and decides not to accept a lot unless the mean life is at
least 1000 hours.
H0 : µ = 1000 hours and HA: µ <1000 hours and uses a lower
tail test.
i.e he rejects H0 only if the mean life of sampled bulbs is
significantly below 1000 hours. (he accepts HA and rejects the /52
One Tail Test
Upper tailed test will reject the null hypothesis if the sample mean
is significantly higher than the hypothesized mean. Appropriate
when H0 : µ = µ0 and HA: µ > µ0 school
e.g A highway safety engineer decides to test the load
bearing capacity of a 20 year old bridge. The minimum
load-bearing capacity of the bridge must be at least 10 tons.
H0 : µ = 10 tons and HA: µ >10 tons
and uses an upper tail test.
i.e he rejects H0 only if the mean load bearing capacity of the
bridge is significantly higher than 10 tons.
/52
Hypothesis test for
population mean Sample mean
n(x − 0 )
H0 : µ = µ0 and Test statistic =
s
Population mean
For HA: µ > µ0, reject H0 if t n −1, school
For HA: µ < µ0, reject H0 if −t n −1,
/52
Hypothesis test for population
mean
A weight reducing program that includes a strict diet and exercise
claims on its online advertisement that an average
overweight person lose 10 pounds in three months. Following
the program’s method a group of twelve overweight persons
have lost 8.1 5.7 11.6 12.9 3.8 5.9 7.8 9.1 7.0 8.2 9.3
and 8.0 pounds
in three months. Test at 5% level of significance whether the
program’s advertisement is overstating the reality.
/52
Hypothesis test for population
mean
Solution:
H0: µ = 10 (µ0) HA: µ < 10 (µ0)
n = 12, x(bar) = 8.027, s = 2.536, = 0.05
12(8.075 − 10) 3.46 −1.925
= = = −2.62
2.536 2.536
Critical t-value = -tn-1,α= - t11,0.05= -2. 201 (TINV)
Since < -tn-1,α we reject H0 and conclude that the
program is overstating the reality.
(What happens if we take = 0.01? Is the program
overstating the reality at 1% significance level?)
/52
Hypothesis test for population
proportion
n(p̂ − p0 )
H0 : p = p0 and Test statistic =
p0 (1− p0 )
For HA: p > p0 reject H0 if z
school
For HA: p < p0 reject H0 if −z
/52
Hypothesis test for population
proportion
n(p̂ − p0 )
H0 : p = p0 and Test statistic =
p0 (1− p0 )
For HA: p > p0 reject H0 if z
school
For HA: p < p0 reject H0 if −z
/52
Hypothesis test for population
proportion
A ketchup manufacturer is in the process of deciding whether to
produce an extra spicy brand. The research
department used a national telephone survey of 6000
households and found the extra spicy ketchup would be
purchased by 335 of them. A much more extensive study made
two years ago showed that 5% of the households would
purchase the brand then. At a 2% significance level, should the
company conclude that there is an increased interest in the
extra-spicy flavor?
/52
Hypothesis test for population
proportion
335
n = 6000, pˆ = = 0.05583
6000
H0 : p = 0.05(p0 ) HA : p 0.05 school
n(p̂ − p0 ) 6000 0.00583
= =
p0 (1− p0 ) 0.05 0.95
77.459 0.00583
= = 2.072
0.218
= 0.02
Z (the critical value of Z ) = 2.05 (NORMSINV)
Z we reject H0 i.e the current interest is significantly greater
than the interest of two years ago.
/52
Hypothesis test for population
standard deviation
(n −1)s 2
H0 : = 0 and Test statistic =
02
For HA: > 0 reject H0 if (n−1),
2(R )
school
For HA: < 0 reject H0 if (n−1),1−
2(R )
/52
Hypothesis test for comparing
two population means
Consider two populations with means µ1, µ2 and standard deviations 1 and 2.
x = 1 and x = 2 are the means of the sampling distributions of population1
1 2
x −x
For HA: µ1 > µ2 reject H0 if > Z 1 2
/52
Hypothesis test for comparing
population means
n1 = 32, x1 = 3.23, 1 = 0.51 n2 = 38, x2 = 4.36, 2 = 0.84
H0 : 1 = 2 HA : 1 2
12 2 2 0.26 0.71 school
X −X = + = + = 0.026 = 0.163
1 2
n1 n2 32 38
(x1 − x2 ) − ( 1 − 2 )H0 −1.13 − 0
= = = −6.92
X −X
1 2
0.163
= 0.05
Critical value of Z = −Z = −1.64
−Z we reject H0 and conclude that there has
been a decline.
/52
Hypothesis test for comparing
population proportions
Consider two samples of sizes n1 and n2 with p1and p2 as the respective
proportions of successes. Then
p̂ =
n1p1 + n2 p2
n1 + n2 populations.
s cho o l
is the estimated overall pr op or tio no fs uccesses in the two
p̂q̂ p̂q̂
+ is the estimated standard error of the difference
̂ p −p =
1 2
n1 n2 between the two proportions.
(p1 − p2 ) − (p1 − p2 )H 0
H0 : p1 = p2 and test statistic, =
̂ x −x
For HA: p1 > p2 reject H0 if >
1 2
/52
Hypothesis test for comparing
population proportions
A large hotel chain is trying to decide whether to convert more of
its rooms into non-smoking rooms. In a random sample of 400
school
guests last year, 166 had requested non-smoking rooms. This year
205 guests in a sample of 380 preferred the non-smoking rooms.
/52
Hypothesis test for comparing
population
166
proportions
205
n1 = 400,p1 = = 0.415, n2 = 380,p2 = = 0.5395
400 380
H0 : p1 = p2 H A : p1 p2
p̂ =
n1p1 + n2 p2 400 0.415 + 380 0.5395
=
sch oo l ( Proportionof success
= 0 .4 75 7 in the two populations)
n1 + n2 400 + 380
1 1 1
̂ p1 −p2 = p̂q̂ + = 0.4757 0.5243
1
+ = 0.0358
n1 n2 400 380
= 0.01 The hotel chain should
Critical value of Z = −Z = −2.32 convert more rooms to non-
(p1 − p2 ) − (p1 − p2 )H smoking rooms as there has
−0.1245 − 0
= 0
= = −3.48 been a significant increase
ˆpˆ1 −p̂2 0.0358 in the number of guests
−Z we reject H 0 seeking non-smoking
rooms.
/52
Steps to undertaking a Hypothesis test
Hypothesis Testing Flowchart
Define study question
Choose a
Set null and alternative hypothesis suitable
test
Calculate a test statistic
Calculate a p-value
/52
HYPOTHESIS
TESTING
▪State the hypothesized value of the All possible alternatives other than
parameter before sampling. the null hypothesis.
▪The assumption we wish to test E.g µ ≠ 20 µ > 20
(or the assumption we are trying to
reject) µ < 20
▪E.g population mean µ = 20 There is a difference between coke
and diet coke
▪There is no difference between
coke and diet coke
/52
Null Hypothesis
/52
Alternative Hypothesis
The alternative hypothesis, HA, is a statement of what a
statistical hypothesis test is set up to establish. For example, in
trial sc ho o l
of a new drug, the alternative hypo t he si s mi ghtbe that the
the clinical
new drug has a different effect, on average, compared to that of
the current drug. We would write
HA: the two drugs have different effects, on average. or
HA: the new drug is better than the current drug, on average.
/52
Procedure of Hypothesis Testing
The Hypothesis Testing comprises the following
steps: Step 1
Set up a hypothesis.
Step 2
Set up a suitable significance level.
The confidence with which an experimenter rejects or accepts Null
Hypothesis depends on the significance level adopted. Level of
significance is the rejection region ( which is outside the confidence
or acceptance region).The level of significance, usually denoted by the
α.
/52
Selecting and interpreting
significance level
1. Deciding on a criterion for accepting or rejecting the null
hypothesis.
s c ho
2. Significance level refers to the percen ta ge o f
ol
s a mple means that is outside certain prescribed limits. E.g
testing a hypothesis at 5% level of significance means
▪ that we reject the null hypothesis if it falls in the two regions
of area 0.025.
▪ Do not reject the null hypothesis if it falls within the region of
area 0.95.
3. The higher the level of significance, the higher is the
probability of rejecting the null hypothesis when it is true.
(acceptance region narrows)
/52
Critical value Critical value
/52
If our sample statistic(calculated value) fall in the non-
shaded region( acceptance region), then it simply means
that there is no evidence to reject the null hypothesis.
/52
After doing computation, check the sample result.
Compare the calculated value( sample result) with the value
obtained from the table.(tabulated or critical value)
Step 6
Making Decisions
Making decisions means either accepting or rejecting the
null hypothesis.
If computed value(absolute value) is more than the
tabulated or critical value, then it falls in the critical region.
In that case, reject null hypothesis, otherwise accept.
/52
Type I and Type II Errors
When a statistical hypothesis is tested, there are 4 possible
results:
(1)The hypothesis is true but our test accepts it. (2)The
hypothesis is false but our test rejects it. (3)The hypothesis
is true but our test rejects it. (4)The hypothesis is false but
our test accepts it.
/52
Example 1 - Court Room Trial
In court room, a defendant is considered not guilty as
long as his guilt is not proven. The prosecutor tries to
prove the guilt of the defendant. Only when there is
enough charging evidence the defendant is condemned.
In the start of the procedure, there are two hypotheses
H0: "the defendant is not guilty", and H1: "the
defendant is guilty". The first one is called null
hypothesis, and the second one is called alternative
(hypothesis).
/52
Null Hypothesis (H0) is Alternative
true He is not guilty Hypothesis (H1) is
true
He is guilty
Accept Wrong
Right decision
Null decision
Hypothesi
Reject Wrong Type II Error
sNull Right decision
decision
Hypothesi Type I Error
s
/52
Chi squared Test?
• Null: There is NO association between
class and survival
• Alternative: There IS an association between
class and survival
contingency table
3x2
www.statstutor.ac.uk /52
What would be expected if the null is true?
• Same proportion of people would have died in each class!
• Overall, 809 people died out of 1309 = 61.8%
www.statstutor.ac.uk /52
What would be expected if the null is true?
• Same proportion of people would have died in each class!
• Overall, 809 people died out of 1309 = 61.8%
www.statstutor.ac.uk /52
Chi-squared test statistic
• The chi-squared test is used when we want to see if
two categorical variables are related
• The test statistic for the Chi-squared test uses the
sum of the squared differences between each pair of
observed (O) and expected values (E)
=
2
n
(Oi − Ei ) 2
i =1 Ei
www.statstutor.ac.uk /52
*
*Following steps are required to calculate the value of chi-square.
Pink = 4 × 97 = 48.50
/52
Red Pink White Total
/52
5.06 12.25 1.56
= + +
24.25 48.50 24.25
= 0.53
Conclusion –
The calculated Chi-square value (.53) is less than the
tabulated chi-square value (5.99) at 5% level of
probability for 2 d.
f. The hypothesis is, therefore, in agreement with the
recorded facts.
/52
ANOVA
• If the number of samples is more than two the Z-test and t-
test cannot be used.
• The technique of variance analysis developed by fisher is very
useful in such cases and with its help it is possible to study the
significance of the difference of mean values of a large no.of
samples at the same time
• ANOVA, or Analysis of Variance, is a statistical method used to
compare the means of three or more samples to determine if
at least one of the sample means significantly differs from the
others.
• ANOVA tests the null hypothesis that all groups have the same
population mean, against the alternative hypothesis that at
least one group is different.
62 /52
Applications and Benefits of ANOVA towards Data Science
Applications of ANOVA
• A/B Testing: ANOVA is widely used in A/B testing where multiple versions of a web page
or app are compared to determine which one performs better on a given metric.
• Machine Learning Model Evaluation: It can be used to compare the performance of
different models or algorithms by treating the performance metric (e.g., accuracy, F1
score) as the dependent variable.
• Feature Selection: ANOVA can help in identifying which categorical variables (features)
have a significant impact on the target variable, which is crucial in model building and
optimization.
Benefits of ANOVA
• Efficiency: Allows for the simultaneous comparison of more than two groups, which is
more efficient than conducting multiple t-tests.
• Insightful: Provides insights into data by identifying variables that significantly impact
the outcome, aiding in understanding relationships in the data.
• Versatility: ANOVA can be applied to a wide range of data science projects, from
exploratory data analysis to complex experimental designs.
63 /52
Terminologies of ANOVA
1. Independent Variable (Factor)
The variable that is being manipulated or categorized in an experiment. In ANOVA, it refers to
the groups or treatments being compared. If an ANOVA has one independent variable, it's
called a one-way ANOVA; if it has two, it's called a two-way ANOVA, and so on.
2. Dependent Variable
The outcome or response variable. This is the variable that is measured in the experiment and
is believed to be influenced by the independent variable(s).
3. Levels
The different categories or groups within an independent variable. For example, if the
independent variable is "fertilizer type," the levels would be the specific types of fertilizers
being tested.
67 /52
Terminologies of ANOVA - Example
9. P-Value
•Example: If the p-value is less than 0.05, it suggests that there is a statistically significant
difference in exam scores among at least some of the study methods.
10. Null Hypothesis (H0)
•Example: The null hypothesis would state that there is no difference in average exam scores
between the four study methods (flashcards, group study, lecture recordings, textbooks).
11. Alternative Hypothesis (H1 or Ha)
•Example: The alternative hypothesis suggests that at least one study method leads to
significantly different average exam scores compared to the others.
68 /52
Degree of Freedom df
Degrees of freedom refer to the number of independent pieces of information used in the
calculation of a statistic.
Suppose a teacher wants to compare the final exam scores of students who used three
different study methods: Method A, Method B, and Method C. Each method is used by a
different group of students. Here are the group sizes:
•Method A: 5 students
•Method B: 6 students
•Method C: 4 students
These degrees of freedom are used in the ANOVA test to determine the critical values from
the F-distribution
70 /52
F-statistic in ANOVA
The F-statistic in ANOVA is calculated to determine whether there are any statistically
significant differences between the means of three or more groups.
It is derived from dividing the variance between the groups by the variance within the groups.
Imagine we have three diet plans (A, B, C) and we're interested in understanding their effect
on weight loss over a month. We have the following weight loss data (in pounds) for each
group:
•Diet A: 2, 4, 3, 5
•Diet B: 3, 5, 4, 6
•Diet C: 5, 7, 6, 8
Step 1: Calculate the mean weight loss for each group and the overall mean
•Mean of Diet A: (2+4+3+5)/4=3.5(2+4+3+5)/4=3.5
•Mean of Diet B: (3+5+4+6)/4=4.5(3+5+4+6)/4=4.5
•Mean of Diet C: (5+7+6+8)/4=6.5(5+7+6+8)/4=6.5
•Overall Mean (Grand Mean): (3.5+4.5+6.5)/3=4.833(3.5+4.5+6.5)/3=4.833
Step 2: Calculate the Between-Group Variance (SSB: Sum of Squares Between)
This measures how much each group mean deviates from the grand mean, weighted by the
number of observations in each group.
SSB=n(mean of Diet A−grand mean)2+n(mean of Diet B−grand mean)2+n(mean of Diet C−gran
d mean)2
Where n is the number of observations per group (in this case, 4).
SSB=4(3.5−4.833)2+4(4.5−4.833)2+4(6.5−4.833)2 = 18.668 71 /52
F-statistic in ANOVA
Calculating SSW
SSW is the sum of squared deviations of each observation from its group mean. It's calculated
for each group and then summed up.
For Diet A:
SSWA=(2−3.5)2+(4−3.5)2+(3−3.5)2+(5−3.5)2 = 5
For Diet B:
SSWB=(3−4.5)2+(5−4.5)2+(4−4.5)2+(6−4.5)2 = 5
For Diet C:
SSWC=(5−6.5)2+(7−6.5)2+(6−6.5)2+(8−6.5)2 = 5
Total SSW:
SSW=SSWA+SSWB+SSWC = 15 72 /52
F-statistic in ANOVA
Step 4: Calculate the Mean Square Between (MSB) and Mean Square Within (MSW)
•MSB (Mean Square Between Groups):
MSB=SSB/dfbetween
Where dfbetween = number of groups−1= 3−1= 2
MSB=18.668/2=9.334
73 /52
Steps of calculating ANOVA
Alternatively, it can be calculated directly from the variance of all observations from the
overall mean.
74 /52
Steps of calculating ANOVA
75 /52
Types of ANOVA
1. One-Way ANOVA
•Description: Tests the effect of a single factor (independent variable) on a continuous
outcome variable across two or more groups.
•Use Case: Comparing the effectiveness of different diets on weight loss, where the diets are
the only variable being tested.
2. Two-Way ANOVA
•Description: Evaluates the impact of two independent variables on a continuous outcome,
allowing for the examination of interactions between factors.
•Use Case: Examining how diet (Factor A) and exercise regimen (Factor B) together affect
weight loss, including any interaction effects between diet and exercise.
3. Repeated Measures ANOVA
•Description: Used when the same subjects are measured multiple times under different
conditions or over time.
•Use Case: Measuring the cognitive performance of a group of individuals before and after
consuming different types of beverages (e.g., water, coffee, tea) at several intervals.
4. Multivariate ANOVA (MANOVA)
•Description: Extends ANOVA to test for the effect of independent variables on multiple
dependent variables simultaneously.
•Use Case: Investigating the effect of teaching methods on students' performance in different
subjects (e.g., math, science, literature) at once.
76 /52
Types of ANOVA
5. Mixed-Design ANOVA
•Description: Combines elements of between-subjects (involving different groups of subjects)
and within-subjects (involving the same subjects over time or conditions) designs, suitable for
more complex experiments.
•Use Case: Studying the impact of a training program (between-subjects factor) on skill
improvement over time (within-subjects factor), with some participants receiving training and
others not.
6. ANCOVA (Analysis of Covariance)
•Description: Blends ANOVA and regression, allowing for the examination of the impact of
one or more factors on a dependent variable while statistically controlling for the variability
associated with one or more covariates.
•Use Case: Evaluating the effectiveness of different teaching methods on final exam scores
while controlling for students' initial knowledge levels.
7. Factorial ANOVA
•Description: Designed to explore the effects of two or more independent variables at
multiple levels and their interactions on a dependent variable.
•Use Case: Assessing the impact of various levels of temperature and humidity on the growth
rate of a plant species.
77 /52
Difference between ANOVA and Hypothesis Testing
•ANOVA is a specialized form of hypothesis testing designed for comparing means across
three or more groups, using the F-statistic to test the null hypothesis of no difference
between group means. Hypothesis testing, on the other hand, is a more general approach
used to determine if sample data is significantly different from what is expected under the
null hypothesis, employing a variety of statistical tests depending on the nature of the data
and the hypothesis.
Key Difference
•Scope: Hypothesis testing is a broad concept encompassing various statistical tests,
including ANOVA, used to make inferences about population parameters. ANOVA is a
specific type of hypothesis test focused on comparing means across multiple groups.
•Application: Hypothesis testing can be applied to a wide range of questions and data
types (e.g., means, proportions, variances), while ANOVA specifically addresses questions
about the differences in means across groups.
•Statistical Test: While hypothesis testing uses various test statistics depending on the test
(e.g., t-statistic for t-tests, chi-square statistic for chi-square tests), ANOVA specifically uses
the F-statistic.
78 /52
Difference between ANOVA and T-Test
2. Hypothesis Testing
•T-test: Tests the null hypothesis that there is no difference between the two group
means.
•ANOVA: Tests the null hypothesis that all group means are equal. If ANOVA indicates
significant differences, it means that at least one group mean differs from the others, but
it does not specify which groups are different.
3. Statistical Model
•T-test: Compares the means of two groups and considers the variability within those
groups.
•ANOVA: Compares the means across multiple groups by partitioning the total variance
into variance between groups and variance within groups.
79 /52
Difference between ANOVA and T-Test
6. Multiple Comparisons
•T-test: When multiple t-tests are used to compare more than two groups, it increases the
risk of Type I error (false positive). This is not an efficient or recommended approach for
comparing multiple groups.
•ANOVA: Designed to compare multiple groups in one go, thus controlling for Type I error
rate. However, if ANOVA shows a significant difference, post hoc pairwise comparisons are
needed to determine which specific groups differ.
The t-test is suitable for comparing two groups, while ANOVA is designed for three or
more groups. Using multiple t-tests to perform the job of an ANOVA not only increases the
computational burden but also inflates the chance of making a Type I error.
80 /52