Download as pdf or txt
Download as pdf or txt
You are on page 1of 412

STATISTICAL

ANALYSIS
PARAMETIC STATISTICS
Parametric Statistics
Pa rame t r ic s t at is t ic a l pro ce du re s a re
inferential procedures that rely on testing
claims regarding parameters such as the
population mean, the population standard
deviation, or the population proportion.
In some circumstances, the use of parametric
procedures requires that certain requirements
regarding the distribution of the population,
such as normality, be satisfied.
Parametric Statistics
✦ As sume unde r l ying s t at is t ic al
distributions in the data. Therefore,
several conditions of validity must be met
so that the result of a parametric test is
reliable.
✦ Apply to data in ratio scale, and some
apply to data in interval scale.
Two Common Forms of
Statistical Inference

1.Estimation
2.Hypothesis Testing
Estimating the Value of
a Parameter

In statistics, estimate is use to


approximate the value of an
unknown population parameter.
Two Types of
Estimation
1. Point estimation -(single points that are
used to infer parameters directly).

2. Inter val estimation - (also called


confidence interval for parameter).
Parameter and
Statistic
A para meter is a
numerical characteristic
of the population. Any
ch aracte r is t ics of a
population are called a
parameter.
A statistic is a numerical value that describes a
sample or a number computed from the sample
data.
What Properties make a
Good Point Estimator?
1.It's desirable that the sampling distribution be
centered around the true population parameter.
An estimator with this property is called
unbiased.
2.It's desirable that our chosen estimator have a
small standard error in comparison with other
estimators we might have chosen.
Confidence Interval
Confidence interval provides more information
than point estimates and it consist of an interval
of numbers.

Level of confidence represents the expected


proportion of intervals that will contain the
parameter if a large number of different samples
is obtained.
The level of confidence is denoted by (1 − α) × 100 %
Confidence Interval
Confidence inter val estimates are of the form Point
estimate margin of error

Point Estimate ± Margin of Error

Estimate + Margin of Error


Estimate - Margin of Error Estimate
Margin of Error
The margin of error of the estimate can be
computed using this formula:

( n) ( n)
σ s
E = zα/2 or E = zα/2

Standard Error of Estimate


Margin of Error
The margin of error of a confidence interval
estimate of a parameter depends on three
factors:
1. Level of Confidence
2. Sample Size
3. Standard Deviation
Interpretation of
Confidence Interval

A (1 − α) × 100 % confidence interval indicates


that, if we obtained many simple random
samples of size n from the population whose
mean , is unknown, then approximately of
the intervals will contain .
Interpretation of
Confidence Interval
In OtherWords:
We are (insert level of confidence) confident that
the population mean is bet ween (lower bound)
and (upper bound). This is an abbreviated way of
saying the method is correct (1 − α) × 100 % of
the time.
Interpretation of
Confidence Interval
Example:
If we constructed a 90% confidence interval with a
lower bound of 12 and an upper bound of 18, we would
interpret the intervals as follows:

“We are 90% confident that the population


mean, is bet ween 12 and 18”.
Remember:

A 95% confidence interval does not mean


that there is 95% probability that the
interval contains population mean.
Estimating the Value of a
Parameter Using Confidence
Intervals

1. Constructing confidence intervals about a


population mean where the population
standard deviation is (known or unknown).
2. Constructing confidence intervals about a
population proportion.
3. Constructing confidence intervals about a
population standard deviation.
Confidence intervals about a
population Mean where the Population
Standard Deviation is Known
Case 1:
σ is Known and n ≥ 30

( n)
σ
x̄ ± zα/2

Point Estimator
Margin of Error
Confidence intervals about a
population Mean where the Population
Standard Deviation is Unknown
Case 2:
σ is unknown and n ≥ 30

( n)
s
x̄ ± zα/2
Note:

If the sample size is large (n > 30), then the sample


standard de viations can be used to e stimate the
population standard deviation.
Confidence intervals about a
population Mean where the Population
Standard Deviation is Unknown
Case 3:
σ is unknown and n < 30

( n)
s
x̄ ± tα/2

Where tα/2 is computed with n - 1 degrees of freedom.


Example 1
How much do Filipinos sleep each night? Based on
a random sample of 1120 Filipinos 15 years of
age or older, the mean amount of sleep per night
is 8.17 hours according to the Filipino Time. Use
Sur vey conducted by the Bureau of Labor
Statistics. Assuming the population standard
deviation for amount of sleep per night is 1.2
hours, construct and interpret a 95% confidence
inter val for the mean amount of sleep per night
of Filipinos 15 years of age or older.
Solution:
Given:
The z – score for confidence level 95% in the z –
table is 1.96. Apply Case 1.
n = 1,120 x̄ = 8.17 σ = 1.2 “We are 95%
confident that

( 1120 )
1.2 the population
8.17 ± 1.96
mean is
bet ween 8.10
8.17 ± 0.0703 = (8.0997,8.2403) and 8.24”.
Example 2
Suppose we would like to
e s t im ate t he me an
amount of money spent on
books by BS Statistics
students in a semester. We
h ave dat a f rom 20
randomly selected
students. Construct and
interpret a 95%
confidence interval.
Solution:
We will apply Case 3, since n <30 and σ is
unknown.

To determine the confidence interval, we


will use RStudio.
The t.test ( ) command used to find confidence
intervals with levels of confidence 95%.
“We are 95% confident that the
population mean for the amount of
money spent on books is bet ween
Php. 132.35 and Php. 173.35”.
Example 3
A simple random sample of size n = 40 is
drawn from a population. The sample mean
is found to be 20.1, and the sample
standard deviation is found to be 3.2.
Construct and interpret a 90% confidence
interval about the population mean.
Solution:
Given:
The z – score for confidence level 90% in the z –
table is 1.645. Apply Case 2.
n = 40 x̄ = 20.1 s = 3.2 “ W e a r e 9 0 %
confident that the

( 40 )
3.2 population mean is
20.1 ± 1.645
b e t w e e n 1 9. 2 7
and 20.93”.
20.1 ± 0.8323 = (19.2677,20.9323)
Example 4
A corporation monitors time spent by office
workers browsing the web on their computers
instead of working. In a sample of computer
re cords of 15 worke rs, cons t r uct a 99%
confidence inter val for the mean time spent by
selected office workers in browsing the web in an
eight-hour day.
Solution:
We will apply Case 3, since n <30 and σ is
unknown.

To determine the confidence interval, we


will use RStudio.
The t.test ( ) command can also be used to find
confidence intervals with levels of confidence
different from 95%. We can specify the desired
level of confidence using the conf.level command.
“We are 99.5% confident that the
population mean time spent by selected
office workers in browsing the web is
bet ween 24.51 mns. and 41.22 mns.”.
Confidence Intervals About
a Population Proportion

Th e p o i n t e s t i m ate f o r t h e p op u l at io n
proportion is x
p̂ =
n
where x is the number of individuals in the
sample with the specified characteristic and n is
the sample size.
Confidence Intervals About
a Population Proportion
Suppose a simple random sample of size n is
taken from a population. A (1 − α) × 100 % confidence
inter val for p is given by the following
quantities:
̂ − p̂
p(1
p̂ ± zα/2
n
Note:

̂ − p)̂ ≥ 10 and n ≤ 0.05N to


It must be the case that np(1
construct this inter val.
Example 1
In a poll conducted by the Research Center for the
People and the Press, a simple random sample of 1505
Filipino adults was asked whether they were in favor
of tighter enforcement of government rules on TV
content during hours when children are most likely to
be watching. Of the 1,505 adults, 1,129 responded yes.
Obtained a 95% confidence interval for the proportion
of Filipinos who are in favor of tighter enforcement of
government rules on TV content during hours when
children are most likely to be watching.
Solution:
Given:
The z – score for confidence level 95% in the z – table
is 1.96. x 1,129
n = 1,505 x = 1,129 p̂ = n = 1,505 = 0.7502
Check:
(1,505)(0.7502)(1 − 0.7502) ≥ 10
282.0369 ≥ 10
Solution:
0.7502(1 − 0.7502)
0.7502 ± 1.96
1,505
0.7502 ± 0.0219 = (0.7283,0.7721)
“We are 95% confident that the proportion of
Filipinos who are in favor of tighter enforcement of
government rules on TV content during hours when
children are most likely to be watching is bet ween
0.73 and 0.77.
Example 2
Suppose a consumer advocacy group would like
to conduct a survey to find the proportion of
consumers who bought the newest generation
of an MP3 player were happy with their
purchase. The advocacy group took a random
sample of 1000 consumers who recently
purchased this MP3 player and found that 400
were happy with their purchase. Find a 90%
confidence interval for p.
Solution:
Given:
x 400
n = 1000 x = 400 p̂ = = = 0.40
n 1000
Check:
(1000)(0.40)(1 − 0.40) ≥ 10
240 ≥ 10
The prop.test(x,n,p) command can also be used to
find confide nce in te r v als wi t h le ve ls of
confidence different from 95%. We can specify
the desired level of confidence using the
conf.level command.
“We are 90% confident that the population
proportion of consumers who bought the newest
generation of an MP3 player were happy with their
purchase is bet ween 0.37 and 0.43.
Confidence Intervals About
a Population Variance
If a simple random sample of size n is taken from a
normal population with mean and standard
deviation σ , then a (1 − α) × 100 % confidence inter val
about is given by

2 2
(n − 1)s 2 (n − 1)s
< σ <
χα/2
2 χ1−α/2
2

with n - 1 degrees of freedom.


Remember:
A confidence interval about the population
variance or standard deviation is not of
the form “point estimate ± margin of
error” because the sampling distribution
of the sample variance is not symmetric.
Example 1
A simple random sample of size n = 12 is
drawn from a population that is normally
distributed. The sample variance is found
2
to be s = 23.7 . Construct a 90% confidence
interval about the population variance.
Solution:
Given:
n = 12 2
s = 23.7 CI = 90 %
α = 1 − CI = 1 − 0.90 = 0.10
df = n − 1 = 12 − 1 = 11
2 2 2
χα/2,df = χ0.10/2,11 = χ0.05,11 = 19.675
2 2 2
χ1−α/2,df = χ1−0.10/2,11 = χ0.95,11 = 4.575
Solution:
2 2
(n − 1)s 2 (n − 1)s
< σ <
χα/2
2 χ1−α/2
2

(12 − 1)23.7 2 (12 − 1)23.7


<σ <
19.675 4.575

2
13.2503 < σ < 56.9836

“We are 90% confident that the population


variance is bet ween 13.25 and 56.98.
Example 2
A jar of peanut is supposed to have 16 ounces of peanuts.
The filling machine inevitably experiences fluctuations in
filling, so a quality-control manager randomly samples 12
jars of peanuts from the storage facility and measures
their contents. She obtained the following data:

Determine the sample standard deviation and construct


a 90% confidence interval for the population standard
deviation of the number of ounces of peanuts.
Exercises
Exercises 1:
Jane wants to estimate the proportion of
s t ude n ts on he r c ampus wh o e at
cauliflower. After surveying 20 students,
she finds 2 who eat cauliflower. Obtain
and interpret a 95% confidence interval
for the proportion of students who eat
cauliflower on Jane’s campus.
Exercises 2:
Alan wants to estimate the proportion
of adults who walk to work. In a survey
of 10 adults, he finds 1 who walk to
work . Obtain and interpret a 95%
confidence interval for the proportion of
adults who walk to work.
Exercises 3:
The following data represent the pH of rain for a
random sample of 12 rain dates in Sta. Mesa,
Manila. A normal probability plot suggests the
data could come from a population that is
normally distributed.
4.58, 5.19, 5.05, 4.80, 4.77, 4.78,
5.71, 4.76, 5.02, 4.74, 4.75, 4.55
Construct and interpret a 95% confidence interval
for the mean pH of rainwater in Sta. Mesa,
Manila.
Exercises 4:
A Tootsie Pop is a sucker with a candy center. A famous
commercial for Tootsie Pops once asked, “How many licks
to the center of a Tootsie Pop?” In an attempt to answer
this question, Cory Heid of Siena Heights University asked
92 volunteers to count the number of licks required
before reaching the chocolate center. The mean number of
licks required was 356.1 with a standard deviation of
185.7. Find and interpret a 95% confidence interval for
the number of licks required to reach the candy center of
a Tootsie Pop.
Source: Heid, Cory. “Tootsie Pops: How Many Licks to the Chocolate?”
Significance, October, 2013 Volume 10 Issue 5.
Exercises 5:
Investors not only desire a high return on their money,
but they would also like the rate of return to be stable
from year to year. An investment manager invests with
the goal of reducing volatility (year-to-year fluctuations
in the rate of return). The following data represent the
rate of return (in percent) for his mutual fund for the
past 12 years.
13.8, 14.9, 10.0, 12.3, 11.2, 6.7,
9.8, 12.5, 10.4, 8.9, 15.9, 6.6
Construct a 95% confidence interval for the population
standard deviation of the rate of return.
Exercises 6:
Suppose a sample of 30 Stats students
are given an IQ test. If the sample has a
standard deviation of 12.23 points, find
a 90% confidence inter val for the
populat ion s t andard de v iat ion and
interpret the result.
What is HYPOTHESIS
TESTING?

Defintion:

Hypothesis testing is a procedure on


sample evidence and probability, used to
test claims regarding a characteristic of
one or more populations.
What is HYPOTHESIS?
Defintion:
A statement or claim regarding a
cha racteristic of o ne o r mo re
populations.
A preconceived idea, assumed to be
true but has to be tested for its
truth or falsity.
Example of Hypothesis
✦ The me an body temperature for patients
admitted to elective surgery is not equal to
37.0oC.
✦ A consumer advocate would like to know if the
mean lifetime of a bulb is less than 500 hours.
✦ A real estate broker believes that because of
changes in interest rates, as well as other
economic factors, the mean price has increased
since then.
Procedures for Testing
Hypothesis
1. State the null and alternative hypothesis.
2. Set the level of significance or alpha level (α).
3. Determine the test distribution to use.
4. Determine the critical region.
5. State the decision rule.
6. Calculate a test statistic.
7. Make statistical decision.
1. State the Null and
Alternative Hypothesis

Two Types of Hypothesis

1. Null Hypothesis

2. Alternative Hypothesis
Null Hypothesis
• Denoted by Ho
• The statement being tested.
• Assumed true until evidence indicates
other wise.
• Must contain the condition of equality
and must be written with the symbol =, ≤ ,
or ≥ .
Example:
✦ Students who eat and not eat breakfast will
perform the same on a math exam.
✦ St u de n t s w h o e x p e r i e n c e a n d n o t
experience  test anxiety  prior to an English
exam will get the same scores.
✦ Motorists who talk and not talk on the phone
while driving will get the same errors on a
driving course.
Alternative Hypothesis
• Denoted by Ha
• Statement that must be true if the null
hypothesis is false.
• Sometimes referred to as the research
hypothesis.
• Must contain the condition of equality and
must be written with the symbol ≠ , < or >.
Example:
✦ Students who eat breakfast will perform better
on a math exam than students who do not eat
breakfast.
✦ Students who experience  test anxiety  prior to an
English exam will get higher scores than students
who do not experience test anxiety.
✦ Motorists who talk on the phone while driving will
be more likely to make errors on a driving course
than those who do not talk on the phone.
Remember:
If you are conducting a research study
and you want to use a hypothesis test to
support your claim, the claim must be
stated in such a way that it becomes
the alternative hypothesis, so it
c a n n o t c o n t a i n t h e c o ndi t io n o f
equality.
Two Types of Alternative Test

1. One - tailed test


✦ Left tailed
✦ Right tailed
2. Two - tailed test
2. Set the Level of Significance
or Alpha Level (α)
Defintion:

The level of significance, , is the


probability of making a type I error.
Two Types of Error
Example:
Ho : The defendant is innocent.
Ha : The defendant is not innocent.

What happen to the defendant if the


jury made type I and type II error?
Answer:
A type I error is like putting an innocent
person in jail.

A type II error is like letting a guilty


person go free.
Example:
Type I Error
BFAD allows the release of an ineffective
medicine.

Type II Error
BFAD does not allow the release of an
effective drug.
Remember:
It is important to note that we want to
set (α) before we start our study because
the Type I error is the more ‘grevious’
error to make.
The smaller ( α ) is, the smaller the region
of rejection.
3. Determine the Test
Distribution to Use

Determine the best statistical test


to be use, based on the objective,
and the assumptions that are
satisfied.
List of Common
Parametric Test
1.One Sample z - Test
2.One Sample t - Test
3.One Sample Proportion Test
4.Independent Sample z - Test
5.Independent Sample t - Test
6. Two Sample Proportion Test
List of Common
Parametric Test
7.Paired Sample t - Test
8.Analysis of Variance (ANOVA) Test
9.Tukey Test (Post Hoc Analysis of ANOVA)
10.Two Way Analysis of Variance
11.Pearson Product Moment Correlation
12.Regression Analysis
4. Determine the Critical
Region
Defintion:
Rejection of region or critical
region is the set of all values of the
test statistic which will lead to the
rejection of .
Acceptance Region is the set of all
values of the test statistic that leads
the researcher to retain .
5. State the Decision Rules

✦ Using confidence interval


✦ Using p-value approach
✦ Using traditional method
Using Confidence Interval

Decision Rule:
Reject the null hypothesis if the test
statistic is no t within the range
specified by the confidence interval.
Using P - Value Approach

Decision Rule:
Reject the null hypothesis if the computed
p-value is less than or equal to the set
significance level , other wise do not reject
the null hypothesis.
Example:
If the level of significance α = 0.05,
P-value Decision
0.01 Reject
0.05 Reject
0.10 Failed to reject Ho
Using Traditional Method

Decision Rule:

Reject Ho if the computed value of


the test statistic falls in the region
of rejection.
6. Calculate Test Statistic

Once you de termine the appropriate


statistical test to be used on step no. 3,
calculate the test statistic. The value
computed using different statistical test
is used to compare to the critical value.
Defintion:

Test statistic - a statistic computed


f rom t h e s am p le d at a t h at i s
especially sensitive to the differences
bet ween Ho and Ha .
7. Make Statistical Decision

✦ Fail to reject the null hypothesis/


Retain the null hypothesis.
✦ Reject the null hypothesis.
Remember:
It is important to recognize that we
never accept the null hypothesis. We
are merely saying that the sample
e v ide nce is no t s t rong e nough to
wa r ra n t re je c t io n o f t h e nu l l
hypothesis.
Normal Distribution
This graph is called the normal curve, which is
bell-shaped cur ve and which approximately
describes many phenomena that occur in
nature, industry, and research.

Normal Curve
Properties of a Normal Curve
1. The normal cur ve is bell-shaped and
symmetric about the mean.
2. The mean, median and mode are equal.
3. The total area under the cur ve is equal to
one.
4. The normal cur ve approaches, but never
touches the x-axis as it extends farther and
farther away from the mean.
Testing Normality of the Data
To determine if the data is follows a
normalit y distribution, we can use the
graphical or numerical method.
Graphical:
Histogram and Normal Q-Q Plot
Numerical:
Kolmogorov Smirnov Test
Lilliefors
Anderson - Darling Test
Shapiro Wilk Test
How to Check Normality?
Histogram plots the observed values against their
frequency, states a visual estimation whether the
distribution is bell shaped or not.
How to Check Normality?
Q-Q probability plots display the observed values
against normally distributed data (represented by the
line).
Remember:

Graphical methods are typically not


very useful when the sample size is
small.
Common Statistical Test for
Normality
Distribution
Test Method Statistic n Range
Based
Kolmogorov-
D n ≥3 EDF
Smirnov
Lilliefors L n ≥4 EDF
Anderson-Darling A-square n ≥8 EDF

Shapiro-Wilk W 3 n <= 5000 -


Caveat: Just because you meet sample size requirements (n in the
above table), this does not guarantee that the test result is
efficient and powerful. Almost all normality test methods
perform poorly for small sample sizes (less than or equal to 30).
Common Statistical Test for
Normality
Kolmogorov Smirnov Test
It was first derived by Kolmogorov (1933) and later modified
and proposed as a test by Smirnov (1948). The test is non-
parametric and entirely agnostic to what this distribution
actually is.

This test has been shown to be less powerful than the other
tests in most situations. It is included only because of its
historical popularity. Some published articles would say “The
Kolmogorov-Smirnov test is only a historical curiosity. It
should never be used."

Tie scores should not be present in the data.


Common Statistical Test for
Normality
Lilliefors Test
Adaptation of the Kolmogorov - Smirnov Test for
the case when the mean and variance of the normal
distribution is unknown.

It is also use as correction for Kolmogorov - Smirnov


Test since the parameters of 𝐶𝐷𝐹 are estimated
from the sample, the test becomes conser vative and
loses power.
Common Statistical Test for
Normality
Anderson - Darling Test

It is a modified Kolmogorov-Smirnov test, but more


weight to the tails of the distribution is given.

This test, developed by Anderson and Darling


(1954), is a popular among those tests that are
based on EDF statistics.
Common Statistical Test for
Normality
Shapiro - Wilk Test
One of the most popular tests for normality
assumption diagnostics which has good properties of
power and it based on correlation within given
obser vations and associated normal scores.

The Shapiro-Wilk test statistic is derived by Shapiro


and Wilk (1965).

Doesn’t work well if several values in the data set


are the same/tie scores occur in the data.
Hypotheses of Normality Test
Ho: The sample data follows a normal distribution.

Ha: The sample data does not follow a normal


distribution.

When we are testing normality:


• If P value > alpha, it means that the data are
normal.

• If P value ≤ alpha, it means that the data are NOT


normal.
Example:
Construct a graphical and numerical method
in testing the normality of these data.
Diameters of 36 rivet heads in 1/100 of an
inch:
Normal Q - Q Plot
To construct normal Q - Q plot use
the command:
qqnorm(x)
qqline(x)
“x” is a numeric vector of data values
Histogram
To construct Histogram use the
command:
hist(x,probabilit y=TRUE,col=“choose
your color”)
li ne s(de nsi t y ( x),c ol=“ch o o se yo u r
color”)
“x” is a numeric vector of data values
There is a warning message because some of the
data points are the same.
Summary of Result
Test Method P - value Decision Remarks

Kolmogorov-
< 0.000 Reject Ho Not Normal
Smirnov
Failed to
Lilliefors 0.0571 Normal
Reject Ho
Failed to
Anderson-Darling 0.2178 Normal
Reject Ho

Shapiro-Wilk 0.2804 Failed to Normal


Reject Ho
ONE SAMPLE HYPOTHESIS
TEST
• ONE SAMPLE Z- TEST
• ONE SAMPLE T - TES
• ONE SAMPLE PROPORTION TEST
Test Concerning the
Population Mean
One-sample z-test and One-sample
t-test is used to compare the mean of
one sample to a known standard (or
theoretical/hypothetical) mean ( μ0 ).
Assumptions
1. The sample is obtained using simple
random sampling or from a randomized
experiment.
2. The population from which the data is
sampled is normally distributed.
Hypotheses
Two-Tailed Left-Tailed Right-Tailed

Ho : μ = μo Ho : μ = μo Ho : μ = μo
Ha : μ ≠ μo Ha : μ < μo Ha : μ > μo

Note: μ0 is a specified value of the population mean.


One Sample z - Test
Case 1: Testing means of a normal population
with known σ
Test Statistic:
x̄ − μo
z= σ

n
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μ < μo z ≥ zα (Right Tailed Test)

Ha : μ > μo z ≤ − zα (Left Tailed Test)

Ha : μ ≠ μo z ≤ − zα/2 and z ≥ zα/2


(Two Tailed Test)
Note:
If the null hypothesis can’t be accepted then the
conclusion is simply that the population mean doesn’t
equal the assumed value. It doesn’t matter if the true
value is likely to be more or less than the assumed
value.
A t wo-tailed test is the one that rejects the null
hypothesis if the sample statistic is significantly
higher or lower than the assumed value of the
population parameter.
In a one-tailed test, there is only one rejection region,
and the null hypothesis is rejected only if the value of
a sample statistic falls into the single rejection region.
One Sample z - Test
Case 2: Large sample tests for means
with unknown σ
If σ is unknown and n>30, use the z-test but replace
by s, that is,
Test Statistic: x̄ − μo
z= s

n
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μ > μo z ≥ zα (Right Tailed Test)

Ha : μ < μo z ≤ − zα (Left Tailed Test)

Ha : μ ≠ μo z ≤ − zα/2 and z ≥ zα/2


(Two Tailed Test)
Note:
Tabulated z-values for the common choices of α

α 0.01 0.05 0.10


zα 2.33 1.645 1.28
zα/2 2.576 1.96 1.645
One Sample t - Test
Case 3: Small sample tests for means
with unknown σ
If σ is unknown and n<30, use the t-test and replace
by s, that is,
Test Statistic:
x̄ − μo
t= s

n
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μ > μo t ≥ tα,df (Right Tailed Test)

Ha : μ < μo t ≤ − tα,df (Left Tailed Test)

Ha : μ ≠ μo t ≤ − tα/2,df and t ≥ tα/2,df


(Two Tailed Test)
Note: df = n − 1
Example 1:
Doe s an average box of cere al
contain more than 368 grams of
cereal? A random sample of 36
boxes showed x = 372.5, and s = 15.
Test at the α = 0.01 level.
Solution:
Step 1: Step 2:
Ho : μ = 368 α = 0.01
Ha : μ > 368
Step 3:
Since σ is not given, and n is greater than 30,
we will use One - Sample z - Test (Case 2) and
a right - tailed test. Rejection
Step 4: Region
z0.01=2.33
-2 -1 0 1 2 2.33
Solution:
Step 5: If test statistic is greater than CV(2.33), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6: x̄ − μ o 372.5 − 368
z= s = = 1.8
15
Step 7:
n 36
Since test statistic (1.8) is less than CV(2.33), we fail to
reject Ho, therefore there is no sufficient sample evidence
to support the claim that the true mean is more than
368.
Example 2:
Doe s an average box of cere al
contain more than 368 grams of
cereal? A random sample of 25 boxes
showed x= 372.5. The company has
specified σ to be 15 grams. Test at
the α = 0.05 level.
Solution:
Step 1: Step 2:
Ho : μ = 368 α = 0.05
Ha : μ > 368
Step 3:
Since σ is given, and n is less than 30, we will
use One - Sample z - Test (Case 1) and a right -
tailed test. Rejection
Step 4: Region
z0.05=1.645
-2 -1 0 1 2 1.645
Solution:
Step 5: If test statistic is greater than CV(1.645), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6: x̄ − μ o 372.5 − 368
z= σ = = 1.5
15
Step 7:
n 25
Since test statistic (1.5) is less than CV(1.645), we fail to
reject Ho, therefore there is no sufficient sample evidence
to support the claim that the true mean is more than
368.
Example 3:
Doe s an average box of cere al
contain less than 368 grams of
cereal? A random sample of 25 boxes
showed x = 372.5, and s = 15. Test at
the α = 0.01 level.
Solution:
Step 1: Step 2:
Ho : μ = 368 α = 0.01
Ha : μ < 368
Step 3:
Since σ is not given, and n is less than 30, we
will use One - Sample t - Test (Case 3) and a
left - tailed test. Rejection
Step 4: df = 25 − 1 = 24 Region
t0.01,24=−2.492
-2.492 -2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.492), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6: x̄ − μ o 372.5 − 368
z= s = = 1.5
15
Step 7:
n 25
Since test statistic (1.5) is greater than CV(-2.492), we
fail to reject Ho, therefore there is no sufficient sample
evidence to support the claim that the true mean is less
than 368.
Example 4:
Doe s an average box of cere al
contain 368 grams of cereal? A
random sample of 25 boxes showed
x= 372.5. The company has specified
σ to be 15 grams. Test at the α =
0.05 level.
Solution:
Step 1: Step 2:
Ho : μ = 368 α = 0.05
Ha : μ ≠ 368
Step 3:
Since σ is given, and n is less than 30, we will
use One - Sample z - Test (Case 1) and a t wo -
tailed test. Rejection Rejection
Step 4: Region Region
z0.05=±1.96
-1.96 -2 -1 0 1 2 1.96
Solution:
Step 5: If test statistic is less than CV(-1.96) and
greater than CV(1.96), reject the null hypothesis,
other wise fail to reject the null hypothesis.
Step 6: x̄ − μ o 372.5 − 368
z= σ = = 1.5
15
Step 7:
n 25
Since test statistic (1.5) is greater than CV(-1.96) and
less than CV(1.96), we fail to reject Ho, therefore there is
no sufficient sample evidence to support the claim that
the true mean is not equal to 368.
Connection to 

Confidence Intervals
Given: x = 372.5, σ = 15, n = 25
The 95% confidence interval is:

( 25 ) ( 25 )
15 15
372.5 − 1.96 ≤ μ ≤ 372.5 + 1.96

366.62 ≤ μ ≤ 378.38
If this interval contains the hypothesized mean
(368), we do not reject the null hypothesis. Since the
computed interval contains the hypothesised mean
(368), we fail to reject the null hypothesis.
Testing a Claim About a
Proportion
We can test a claim about a proportion, percentage,
or probability, as illustrated in these examples:
Based on a sample survey, fewer than ¼ of all
college graduates smoke.
The percentage of physicians leaving the country
is equal to 15%.
If a driver is fatally injured in a car crash, there is
a 0.35 probability that the driver was legally
impaired.
One Sample Proportion
Test
The One-Sample Proportion Test is used to
assess whether a population proportion (P1)
i s s i g n i fi c a n t l y d i f f e r e n t f r o m a
hypothesized value (P0). The hypotheses
may be stated in terms of the proportions,
their difference, their ratio, or their odds
ratio, but all four hypotheses result in the
same test statistics.
Assumptions
1. The conditions for a binomial experiment are
satisfied. That is, we have a fixed number of
i n de p e n de n t t r i a l s h av i ng c o n s t a n t
probabilities, and each trial has t wo outcome
categories, which we classify as “success” and
“failure”.
2. The conditions npo ≥ 5 and n(1 − po) ≥ 5 are
both satisfied, so the binomial distribution of
sample proportions can be approximated by a
normal distribution with µ = np and
σ = np(1 − p) .
Hypotheses
Two-Tailed Left-Tailed Left-Tailed

Ho : p = po Ho : p = po Ho : p = po
Ha : p ≠ po Ha : p < po Ha : p > po

Note: p0 is a specified value of the population proportion.


Rejection Region
Alternative Hypothesis Rejection Region

Ha : p < po z ≥ zα (Right Tailed Test)

Ha : p > po z ≤ − zα (Left Tailed Test)

Ha : p ≠ po z ≤ − zα/2 and z ≥ zα/2


(Two Tailed Test)
Test Statistic
p̂ − po
z=
po(1 − po)
n
Where:
x
p̂ =
n
x = the number of individuals in the sample
with the specified characteristic
n = the sample size
Note:
When conducting a test of a claim about a
population proportion p, be careful to identify
correctly the sample proportion
1. The sample proportion p̂ is sometimes given
directly.
Example: “10% of the observed sports car are
red.” This is expressed as p̂ = 0.10.
Note:
2. In other cases, we may need to calculate the
x
sample proportion by using p̂ =
n
Example: “96 surveyed households have cable
TV and 54 do not,” we can first find the sample
size n to be 96 + 54 = 150, then we can
calculate the value of the sample proportion of
households with cable TV as follows:
x 96
p̂ = = = 0.64
n 150
Example 1:
250 housewives were randomly selected
a n d a s k e d w h e t h e r t h e y pre fe r
purchasing fish from supermarkets or
from wet (public) markets. If 114 of them
preferred supermarkets, is there evidence
at the 5% level of significance to suggest
t h at t he proport ion of h ouse wi ve s
t h ro ugh o u t t h e c i t y w h o p re fe r
supermarkets exceeds 40%.
Solution:
We need first to check if np ≥ 5 and np(1-p) ≥ 5
to determine if binomial distribution can be
approximated by the normal distribution.

npo = 250(0.40) = 100 > 5


npo(1 − po) = 250(1 − 0.40) = 150 > 5

The assumption is satisfied.


Solution:
Step 1: Step 2:
Ho : p = 40 % α = 0.05
Ha : p > 40 %
Step 3:
Since we are testing population proportion, we
can use one sample proportion and a right -
tailed test. Rejection Rejection
Step 4: Region Region
z0.05=±1.96
-1.96 -2 -1 0 1 2 1.96
Solution:
Step 5: If test statistic is less than CV(-1.96) and
greater than CV(1.96), reject the null hypothesis,
other wise fail to reject the null hypothesis.
Step 6: 114
−0.40
p̂ − po 250
z= = = 1.81
po(1 − po) 0.40(1 − 0.40)
n 250
Step 7:
Since test statistic (1.81) is greater than CV(-1.96) and less than
CV(1.96), we fail to reject Ho, therefore we don’t have enough
evidence to prove that the proportion of housewives throughout
the city who prefer supermarkets exceeds 40%.
Exercises 1:
Kate Flower, President of Kate and Edith Cake
Company, says that the mean number of cakes
sold daily is 1, 500. An employee wants to test
the accuracy of Kate's claim. A random sample
of 36 days shows that the mean daily sales
were 1, 450 cakes. Using a level of significance
of =0.01 and assuming σ=120 cakes. What
should the worker conclude?
Solution:
Step 1: Step 2:
Ho : μ = 1,500 α = 0.01
Ha : μ ≠ 1,500
Step 3:
Since σ is given, and n is greater than 30, we
will use One - Sample z - Test (Case 1) and a
t wo - tailed test. Rejection Rejection
Step 4: Region Region
z0.01=±2.576
-2.576 -2 -1 0 1 2 2.576
Solution:
Step 5: If test statistic is less than CV(-2.576) and
gre ate r t h an CV(2.576), re je c t t he nul l
hypothesis, other wise fail to reject the null
Step 6: hypothesis.
x̄ − μo 1450 − 1500
z= σ = = − 2.5
120
Step 7: n 36
Since test statistic (-2.5) is greater than CV(-2.576) and
less than CV(2.576), we fail to reject Ho, therefore there
is no sufficient sample evidence to support Kate’s claim
that cake sales average 1,500 daily.
Exercises 2:
Juanita Lopez, a production super visor at
chemical company, wants to be sure that the
Super-Duper can is filled with an average of 16oz
of product. If the mean volume is significantly less
than 16 oz, customers will likely complain,
prompting undesirable publicity. The physical size
of the can doe sn’t allow a me an volume
significantly above 16 oz. A random sample of 36
cans shows a sample mean of 15.7 oz. Assuming
σ is 0.2 oz, conduct a hypothesis test with = 0.01.
Solution:
Step 1: Step 2:
Ho : μ = 16oz α = 0.01
Ha : μ ≠ 16oz
Step 3:
Since σ is given, and n is greater than 30, we
will use One - Sample z - Test (Case 1) and a
left - tailed test. Rejection
Step 4: Region
z0.01=±2.576 -2.576
2.576
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.576), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6: x̄ − μ 15.7 − 16
o
z= σ = = − 9.0
0.2
Step 7: n 36
Since test statistic (-9.0) is less than CV(-2.576), Juanita must
reject Ho and rush to correct the filling process. It’s virtually
impossible that a sample selected from a sampling distribution
that has a true mean of 16oz will have a sample mean located
9.00 standard errors to the left of the true mean.
Exercises 3:
We want to compare fasting serum cholesterol levels
of Filipino women to that of the American women.
Assume the cholesterol levels in 20 to 39 years old
women in the United States in normally distributed
with μ = 90mg/dl . Blood tests are preformed on 19
female Filipinos in this age range rendered a sample
mean cholesterol level of 181.52 mg/dl and standard
deviation of 40 mg/dl. Conduct a test of hypothesis
to determine whether Filipino women have lower
average cholesterol level than their American
counterparts. Use alpha=0.05.
Solution:
Step 1: Step 2:
Ho : μ = 90mg/dl α = 0.05
Ha : μ < 90mg/dl
Step 3:
Since σ is not given, and n is less than 30, we
will use One - Sample t - Test (Case 3) and a
left - tailed test. Rejection
Step 4: df = 19 − 1 = 18 Region
t0.05,18=−1.734
-1.734 -2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-1.734), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6: x̄ − μ 181.52 − 190
o
z= s = = − 0.9241
40
Step 7: n 19
Since test statistic (-0.9241) is greater than CV(-1.734), we
fail to reject Ho, therefore we don’t have enough evidence to
conclude that the cholesterol levels of Filipino women ages 20
to 39 years old in the United States is less 190 mg/dl.
Exercises 4:
In a study of air-bag effectiveness, it was
found that in 821 crashes of midsize cars
equipped with air bags, 46 of the crashes
resulted in hospitalization of the drivers.
Use a 0.01 level of significance to test the
claim that the airbag hospitalization rate
is lower than the 7.8% rate for crashes of
midsize cars equipped with automatic
safety belts.
Solution:
We need first to check if np ≥ 5 and np(1-p) ≥ 5
to determine if binomial distribution can be
approximated by the normal distribution.

npo = 821(0.078) = 64.038 > 5


npo(1 − po) = 821(0.078)(1 − 0.078) = 59.043 > 5

The assumption is satisfied.


Solution:
Step 1: Step 2:
Ho : p = 7.8 % α = 0.01
Ha : p < 7.8 %
Step 3:
Since we are testing population proportion, we
can use one sample proportion and a left -
tailed test. Rejection
Step 4: Region
z0.01=−2.33
-2.33
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.33), reject the
null hypothesis, other wise fail to reject the null
hypothesis.
Step 6: 46
−0.078
p̂ − po 821
z= = = − 2.35
po(1 − po) 0.078(1 − 0.078)
n 821
Step 7:
Since test statistic (-2.35) is less than CV(-2.33), we reject Ho,
therefore the airbag hospitalization rate is lower than the
7.8% rate for crashes of midsize cars equipped with automatic
safety belts.
Exercises 5:
Suppose that the teacher of a school claims that
the average weight of student population greater
than from 140 lb. and we desire to test the
truth of this claim. We have a random sample of
6 students of the school weights from student
population. Use a 0.10 level of significance.

Student 1 2 3 4 5 6
Weight 135 119 106 135 180 108
Solution:
Step 1: Step 2:
Ho : μ = 140lbs α = 0.10
Ha : μ > 140lbs
Step 3:
Since is not given, and n is less than 30, we
will use One - Sample t - Test (Case 3) and a
right - tailed test. Rejection
Step 4: df = 6 − 1 = 5 Region
t0.10,5=1.476
-2 -1 0 1 2 1.476
Solution:
Step 5: If test statistic is greater than CV(1.476), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6:
We can solve the test statistic and p-value of
One - Sample t - Test using RStudio. TV(-0.852)
and p-value(0.784)
Step 7:
Since test statistic (-0.852) is less than CV(1.476), we
fail to reject Ho, therefore we don’t have enough evidence
to support the claim of the teacher.
INFERENCE ABOUT TWO
MEANS
• INDEPENDENT SAMPLE Z- TEST
• INDEPENDENT SAMPLE T - TEST
• PAIRED SAMPLE T - TEST
Inference About Two
Means

To perform inference on the difference of


t wo population means, we must first
determine whether the data come from
an independent or dependent sample.
Distinguish between Independent
and Dependent Sample
A sampling method is independent when
the individuals selected for one sample do
not dictate which individuals are to be in a
second sample.
A sampling method is dependent when
the individual selected to be in one sample
are used to determine the individuals to be
in the second sample.
Exercises:
Determine whether the sample is independent or
dependent.

1. A researcher wants to know if the mean length


of stay in for-profit hospitals is different from the
mean length of stay in not-for-profit hospitals. He
randomly selected 20 individuals in the for-profit
hospital and matched them with 20 individuals in
the not-for-profit by diagnosis.
Exercises:
Determine whether the sample is independent or
dependent.

2. An urban economist believes that commute


times to work in the South are less than commute
times to work in the Midwest. He randomly
selects 40 employed individuals in the south and
45 employed individuals in the Midwest and
determines their commute times.
Exercises:
Determine whether the sample is independent or
dependent.

3. In an experiment conducted in biology class, Prof.


Rhea measured the time required for 12 students to
catch a failing meter stick using their dominant hand
and nondominant hand. The goal of the study was to
de te rmine whe t he r t he re ac t ion t ime in an
individual’s dominant hand is different from the
reaction time in the non dominant hand.
Answer:
1. Dependent
2. Independent
3. Dependent
Two Independent
Means
Allows researchers to evaluate or to
compare the mean difference bet ween
two populations using the data from
t wo separate samples.
Used to te st whe ther population
means are significantly different from
each other, using the means from
randomly drawn samples.
Assumptions
1. Your dependent variable should be measured
on a continuous scale (i.e., it is measured at
the interval or ratio level).
2. Your independent variable should consist of
two categorical, independent groups.
3. Yo u s h o u l d h a v e i n d e p e n d e n c e o f
observations, which means that there is no
relationship bet ween the observations in each
group or bet ween the groups themselves.
Assumptions
4. There should be no significant outliers.
5. Your depe nde nt variable should be
approximately normally distributed
for e ach group of the independent
variable.
6. There needs to be homogeneity of
variances.
Hypotheses
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 < 0

H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 > 0

H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 ≠ 0
Independent Sample z -
Test
Case 1: σ1 = σ2 = σ
(x̄1 − x̄2) − (μ1 − μ2)
z=
1 1
σ +
n1 n2
Case 2: σ1 ≠ σ2
(x̄1 − x̄2) − (μ1 − μ2) where: ∑ (x − x̄1)2
z= σ1 =
N1
σ12 σ22
+ ∑ (x − x̄2)2
n1 n2 σ2 =
N2
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μ1 − μ2 > 0 z ≥ zα (Right Tailed Test)

Ha : μ1 − μ2 < 0 z ≤ − zα (Left Tailed Test)

Ha : μ1 − μ2 ≠ 0 z ≤ − zα/2 and z ≥ zα/2


(Two Tailed Test)
Note:
Use the z-distribution to conduct the test
if you have t wo independent samples
take n f rom normall y dis t r ibu te d
populat ions and if you k no w bo th
population standard deviation or both
samples exceeds 30.
Independent Sample t -
Test
Case 1: s1 = s2 = sp where:
(x̄1 − x̄2) − (μ1 − μ2) (n1 − 1)s12 + (n2 − 1)s22
z= sp =
1 1 n1 + n2 − 2
sp +
n1 n2
Case 2: s1 ≠ s2 where:
(x̄1 − x̄2) − (μ1 − μ2) s1 =
∑ (x − x̄1)2
z= n1 − 1
s12 s22
+ ∑ (x − x̄2)2
n1 n2 s2 =
n2 − 1
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μ1 − μ2 > 0 t ≥ tα,df (Right Tailed Test)

Ha : μ1 − μ2 < 0 t ≤ − tα,df (Left Tailed Test)

Ha : μ1 − μ2 ≠ 0 t ≤ − tα/2,df and t ≥ tα/2,df


(Two Tailed Test)
Note: df = n1 + n2 − 2
Note:
If both population standard deviation are
unknown and the sample size are small use t-
distribution, however, you need to use first F-test
to determine if the variance are equal or not.
If the results of F-test is fail to reject the null
hypothesis then you will use the t-distribution
Case 1, which means that the variance are equal.
If the results of F-test is reject the null
hypothesis then you will use the t-distribution
Case 2, which means that the variance are not
equal.
F - Test
For the comparison of t wo variances or
standard deviations, an F - test is used.
The sampling distribution of the variances
is called the F - distribution.
Test Statistic: s 2
1
F=
where: s22
2
s1 :Variance of the first sample
2
s2 :Variance of the second sample
Note:
The values of F cannot be negative.
The distribution is positively skewed.
The F distribution is a family of curves
based on the degrees of freedom of the
denominator.
Assumptions
1. The populations from which the
sample s were obtained must be
normally distributed.
2. The samples must be independent of
each other.
Example 1:
An agricultural research institute is studying
t wo new varieties of palay both of which are
reputedly high-yielding varieties. There are a few
studies which suggest that the difference in the
yield per hectare may be significant. The head of
the institute decides to find out if there is, in fact,
a significant difference in yield. Forty hectares
are planted to variety A and thirty hectares to
variety B. Both varieties are grown under
identical laboratory conditions.
Example 1 (cont.):
At har vest time, the results are:

Indicators Variety A Variety B

Average Yield per


250 Canvas 240 Canvas
Hectare
Population
20 Canvas 15 Canvas
Standard Deviation

At 1% Level of Significance, is there a significant


difference in the yield of t wo palay variety?
Solution:
Step 1:
Ho : μA − μB = 0
There is no significant difference in the yield per hectare of
the t wo varieties of palay.
Ha : μA − μB ≠ 0
There is significant difference in the yield per hectare of
the t wo varieties of palay.
Step 2:
α = 0.01
Solution:
Step 3:
Since σ is given, and n is greater than 30, we
will use Independent - Sample z - Test (Case 2)
and a t wo - tailed test.
Step 4:
Rejection Rejection
z0.01=±2.576 Region Region

-2.576 2.576
-2 -1 0 1 2
Solution:
Step 5:
If test statistic is less than CV(-2.576) and
gre ate r t h an CV(2.576), re je c t t he nul l
hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
(x̄A − x̄B) − (μA − μB) (250 − 240) − 0
z= = = 2.39
σA2 σB2 202 152
+ +
nA nB 40 30
Solution:
Step 7:
Since te st statistic (2.39) is gre ater than
CV(-2.576) and less than CV(2.576), we fail to
reject Ho, therefore there is no significant
difference in the yield per hectare of varieties A
and B. The reputation that both are high-yielding
varieties is shown to be consistent. The studies
sugge s t ing th at the dif fe re nce in yie ld is
significant are not conclusive.
Example 2:
Suppose we put people on 2 diets “the fruit diet and
the bread diet”. Participants are randomly assigned
to either 7-days of eating exclusively fruits or 7-
week of exclusively eating bread. At the end of the
day, we measure weight gain by each participant. Is
bread diet causes more weight gain compared to
fruits diet? Test the claim using 10% level of
significance.
Fruit Diet 3 4 4 4 5 6 6
Bread Diet 1 2 2 2 3 4 4
Solution:
Step 1:
Ho : μF − μB = 0
There is no significant difference bet ween bread and fruit
diet.
Ha : μF − μB < 0
Bread diet causes more weight gain compared to fruits diet.

Step 2:
α = 0.10
Solution:
Step 3:
Since σ is not given, and n is less than 30, we
will use Independent - Sample t - Test and a
t wo - tailed test but we need to first use F-
test to determine if case 1 or case 2.
Step 4:
Rejection
df = 7 + 7 − 2 = 12
Region
t0.10,12=−1.356
-1.356 -2 -1 0 1 2
Before we proceed to t.test ( ) command, we must first
check whether the variances are homogeneous. Used
var.test ()command for F - test of Fisher.
We obtained p-value greater than 0.10, then the t wo
var iance s are homoge ne ous, the re fore we will use
Independent sample t - Test (Case 1).
Solution:
Step 5:
If test statistic is less than CV(-1.356), reject the
null hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
We can solve the test statistic and p-value of
independent - Sample t - Test using RStudio.
TV(3.300) and p-value(0.997)
Solution:
Step 7:
Since test statistic (3.300) is greater than
CV(-1.356), we fail to reject Ho, therefore
there is no significant difference bet ween
bread and fruit diet.
Example 3:
An apartment rental agent tells the personnel
manager of a firm thinking of building a plant in
the agent’s city that the mean rental rates for
t wo-bedroom apartment are the same in sector
A and B of the city. To test this claim, the
p e rs o n ne l m a n age r ra n dom l y s am p le s
apartment comple xe s in e ach sector and
obtained the following data.
Example 3 (cont.):
Sector A Sector B
x1 = $595 x2 = $580
n1 = 10 n2 = 12
s1 = $62 s2 = $32
2 2
s = 3,844
1 s2 = 1, 024

What can the personnel manager conclude


about the agent’s claim at 0.05 level?
Solution:
Step 1:
Ho : σA = σB
Equal Variances Assumed.
Ha : σA ≠ σB
Equal Variances Not Assumed.
Step 2:
α = 0.05
Solution:
Step 3:
F-distribution is used since, we are testing
the variance.
Step 4:
df1 = n1 − 1 = 10 − 1 = 9
f0.05,9,11=3.59
df2 = n2 − 1 = 12 − 1 = 11
Step 5:
If test statistic is greater than CV(3.59), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Solution:
Step 6: 2
s1 3844
F= = = 3.754
s22 1024
Step 7:
Since test statistic (3.754) is greater than
CV(3.59) we reject Ho, therefore the variances are
not equal.
Solution:
Step 1:
Ho : μA − μB = 0
There is no significant difference in building a plant in
sector A and B.
Ha : μA − μB ≠ 0
There is significant difference in building a plant in sector
A and B.
Step 2:
α = 0.05
Solution:
Step 3:
Since σ is not given, n is less than 30, and not
equal variances assumed based on the result
of F-test, we will use the Independent -
Sample t - Test (Case 2) and a t wo - tailed
test.
Rejection Rejection
Step 4:
Region Region
df = 10 + 12 − 2 = 20
t0.05,20=±2.086 -2.086 2.086
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.086) and
gre ate r t h an CV(2.086), re je c t t h e nul l
hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
(x̄A − x̄B) − (μA − μB) (595 − 580) − 0
z= = = 0.692
sA2 sB2 3844 1024
+ +
nA nB 10 12
Solution:
Step 7:
Since test statistic (0.692) is greater
than CV(-2.086) and less than CV(2.086),
we fail to reject Ho, therefore there is no
significant difference in building a plant
in sector A and B.
Example 4:
Consider the case of the t wo experimental diets
designed to add weight to malnourished third-
world children. The given table presents the weight
gains made by 8 children, who were fed diet A and
9 children who received diet B.
P r o v e t h a t b o t h Diet A Diet B
children in different
diet have the same x1 = 5.8lbs x2 = 7.27lbs
p o p u l a t i o n m e a n n1 = 8 n2 = 9
weight at 0.05 level? 2 2
s1 = 2.6029 s2 = 0.9800
Solution:
Step 1:
Ho : σA = σB
Equal Variances Assumed.
Ha : σA ≠ σB
Equal Variances Not Assumed.
Step 2:
α = 0.05
Solution:
Step 3:
F-distribution is used since, we are testing
the variance.
Step 4:
df1 = n1 − 1 = 8 − 1 = 7
f0.05,7,8=4.53
df2 = n2 − 1 = 9 − 1 = 8
Step 5:
If test statistic is greater than CV(4.53), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Solution:
Step 6: 2
s1 2.6029
F= = = 2.656
s22 0.9800
Step 7:
Since test statistic (2.656) is less than CV(4.53)
we fail to reject Ho, therefore the variances are
equal.
Solution:
Step 1:
Ho : μA − μB = 0
There is no significant difference bet ween both
diets.
Ha : μA − μB ≠ 0
There is significant difference in both diets.
Step 2:
α = 0.05
Solution:
Step 3:
Since σ is not given, n is less than 30, and
equal variances assumed based on the result
of F-test, we will use the Independent -
Sample t - Test (Case 1) and a t wo - tailed test.
Rejection Rejection
Step 4:
Region Region
df = 8 + 9 − 2 = 15
t0.05,15=±2.131 -2.131 2.131
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.131) and
gre ate r t h a n CV(2.131), re je c t t h e nul l
hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
(x̄A − x̄B) − (μA − μB) (5.8 − 7.27) − 0
z= = = − 2.295
1 1 1 1
sp + 1.318 +
nA nB 8 9

(n1 − 1)s12 + (n2 − 1)s22 (8 − 1)(2.6029) + (9 − 1)(0.9800)


sp = = = 1.318
n1 + n2 − 2 8+9−2
Solution:
Step 7:
Since test statistic (-2.295) is less than
CV(-2.131), we reject Ho, therefore the
population mean weight gain of diet A
isn’t equal to the gain of diet B.
Paired Sample t - Test
The dependent sample t-test (also called the
paired t-test or paired-samples t-test) compares
the means of t wo related groups to determine
whe t he r t he re is a s t at is t ic all y signific an t
difference bet ween these means.
Test Statistic: Where:
x̄d − μd ∑d
t= s x̄d =
n
d
∑ (x − x̄d)2
n sd =
n−1
Assumptions
1. Your dependent variable should be
measured at the interval or ratio level
(i.e., they are continuous).
2. Your independent variable should consist
of t wo categorical, "related groups" or
"matched pairs”.
Assumptions
3. There should be no significant outliers in
the differences bet ween the t wo related
groups.
4. The distribution of the differences in
the dependent variable bet ween the t wo
related groups should be approximately
normally distributed.
Hypotheses
H0 : μd = 0 and Ha : μd < 0

H0 : μd = 0 and Ha : μd > 0

H0 : μd = 0 and Ha : μd ≠ 0

Note: μ1 − μ2 = μd
Rejection Region
Alternative Hypothesis Rejection Region

Ha : μd > 0 t ≥ tα,df (Right Tailed Test)

Ha : μd < 0 t ≤ − tα,df (Left Tailed Test)

Ha : μd ≠ 0 t ≤ − tα/2,df and t ≥ tα/2,df


(Two Tailed Test)
Note: df = n − 1
Example 1:
An industrial engineer is evaluating a new technique to
assemble air compressors. If there is a difference in the
number of compressors that can be assembled when the
existing procedure is used, and when the new technique
is followed, she will recommend that the company use
the approach that the result in the greatest worker
productivity. A sample of 8 employees is selected at
random, and the number of compressor they used in
each procedure for 1 week using the existing procedure
is recorded. The same 8 workers are then trained to use
the new technique, and their output for 1 week is then
noted:
Example 1 (cont.):
Employee After Before
A 85 80
B 84 88
C 80 76
D 93 90
E 83 74
F 71 70
G 79 81
H 83 83
Solution:
Step 1:
Ho : μd = 0
The mean difference bet ween before and after production
is zero.
Ha : μd ≠ 0
There is a mean difference bet ween production methods.

Step 2:
α = 0.05
Solution:
Step 3:
Since there are t wo groups that are related,
we will use Paired - Sample t - Test and a t wo
- tailed test.
Step 4:
Rejection Rejection
df = 8 − 1 = 7 Region
Region
t0.05,7=±2.365
-2.365 2.365
-2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-2.365) and greater
than CV(2.365), reject the null hypothe sis,
other wise fail to reject the null hypothesis.
Step 6: 2
x1 x2 d d −d (d − d )
x̄d − μd 2.0 − 0 85 80 5 3 9
t= s = = 1.366
d 4.1404 84 88 −4 -6 36
80 76 4 2 4
n 8 93 90 3 1 1
83 74 9 7 49
16
x̄d = = 2.0 71 70 1 -1 1
8
79 81 -2 -4 16
120
sd = = 4.1404 83 83 0 -2 4
8−1
16 120
Solution:
Step 7:

Since test statistic (1.366) is greater than


CV(-2.365) and less than CV(2.365), we fail to
reject Ho, therefore the mean difference in
production method is zero. The engineers cant
conclude that one assembly method is better the
other.
Example 2:
A researcher is interested whether a training
course increases the teaching performance of the
teachers who attended the training courses. Test
at 10% level of significance. The data are shown
below:
Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Before
85 84 86 87 89 82 80 84 86 82 89 87 82 81 86 89 89 84 85 88
Training

After
95 98 97 92 96 93 94 95 90 82 97 98 95 95 92 91 94 95 96 97
Training
Solution:
Step 1:
Ho : μd = 0
There is no significant dif ference in the te aching
performance of the teachers before and after training.
Ha : μd < 0
The training course increases the teaching performance of
the teachers who attended the training.
Step 2:
α = 0.10
Solution:
Step 3:
Since there are t wo groups that are related,
we will use Paired - Sample t - Test and a left-
tailed test.
Step 4:
Rejection
df = 20 − 1 = 19 Region
t0.10,19=−1.729
-1.729 -2 -1 0 1 2
Solution:
Step 5: If test statistic is less than CV(-1.729), reject the
null hypothesis, other wise fail to reject the null
hypothesis.
Step 6:
We can solve the test statistic and p-value of
paired Sample t - Test using RStudio. TV(3.300)
and p-value(0.997)
Step 7:
Since test statistic (-9.697) is less than CV(-1.729),
we reject Ho, therefore the training course help to
increase the teaching performance of the teachers
who attended the training.
Two Sample Proportion
Test
A t wo proportion z-test allows you to compare
t wo proportions to see if they are the same.
When testing a hypothesis made about t wo
population proportions – such as proportions
of cured patients in a population given some
treatment and a second population given a
placebo.
Two - Proportion z -
Test
Test Statistic:
( p1̂ − p2̂ ) − (p1 − p2)
z=
̂ − p)̂
p(1 ̂ − p)̂
p(1
+
n1 n2
Where:
x1 x2 x1 + x2
p1̂ = p2̂ = p̂ =
n1 n2 n1 + n2
Assumptions
1. We have t wo independent sets of
randomly selected sample data.
2. For both samples, the conditions np ≥ 5
and np(1 − p) ≥ 5 are satisfied.
Hypotheses
H0 : p1 − p2 = 0 and Ha : p1 − p2 < 0

H0 : p1 − p2 = 0 and Ha : p1 − p2 > 0

H0 : p1 − p2 = 0 and Ha : p1 − p2 ≠ 0
Rejection Region
Alternative Hypothesis Rejection Region

Ha : p1 − p2 > 0 z ≥ zα (Right Tailed Test)

Ha : p1 − p < 0 z ≤ − zα (Left Tailed Test)

Ha : p1 − p2 ≠ 0 z ≤ − zα/2 and z ≥ zα/2


(Two Tailed Test)
Example 1:
Johns Hopkins researchers conducted a study of
pregnant IBM employees. Among 30 employees
who worked with glycol ethers, 10 (or 33.3%)
had miscarriages, but among 750 who were not
exposed to glycol ethers, 120 (or 16.0%) had
miscarriages. At the 0.01 significance level, test
the claim that the miscarriage rate is greater
for women exposed to glycol ethers.
Solution:
We stipulate that sample 1 is the group that
worked with glycol ethers and sample 2 is the
group not exposed, so the sample statistics
can be summarized as shown here:
Exposed to Glycol Not Exposed to
Ethers Glycol Ethers
n1 = 30 n2 = 750
x1 = 10 x2 = 120
Solution:
We need first to check if np ≥ 5 and n(1-p) ≥ 5 to
determine if binomial distribution can be
approximated by the normal distribution.
n1p1̂ = 30(0.333) = 9.999 > 5
n1(1 − p1̂ ) = 30(1 − 0.333) = 20.001 > 5

n2 p2̂ = 750(0.16) = 120 > 5


n2(1 − p2̂ ) = 750(1 − 0.16) = 630 > 5
The assumption is satisfied.
Solution:
Step 1:
Ho : p1 − p2 = 0
There is no significant dif ference be t ween
miscarriage rate of respondents that are exposed
to glycol ethers and not exposed.
Ha : p1 − p2 > 0
Respondents that are exposed to glycol ethers
have greater miscarriage rate compared to
respondents that are not exposed.
Step 2:
α = 0.01
Solution:
Step 3:
Since we are comparing t wo proportions, we
will use the Two - Proportion z - Test and a
right - tailed test.

Step 4:
Rejection
z0.01=2.33 Region

2.33
-2 -1 0 1 2
Solution:
Step 5: If test statistic is greater than CV(2.33), reject
the null hypothesis, other wise fail to reject the
null hypothesis.
Step 6:
( p1̂ − p2̂ ) − (p1 − p2) (0.3333 − 0.16) − 0
z= = = 2.4973
̂ − p)̂
p(1 ̂ − p)̂
p(1 0.1667(1 − 0.1667) 0.1667(1 − 0.1667)
+ +
n1 n2 30 750

10 120 10 + 120
p1̂ = = 0.3333 p2̂ = = 0.16 p̂ = = 0.1667
30 750 30 + 750
Solution:
Step 7:
Since test statistic (2.4973) is greater than CV(2.33),
we reject Ho, therefore we can conclude that
miscarriage rate is greater for women exposed to
ethyl glycol. With this evidence, the John Hopkins
researchers concluded that women employees exposed
to glycol ethers “have a significantly increased risk of
miscarriage.” On the basis on these results, IBM
warned its employees of the danger, notified the
Environmental Protection Agency, and greatly reduced
its use of glycol ethers.
Exercises
Exercises 1:
The production manager of a fruits canning
factory begins to suspect that, as a result of
obser ving the machine operators, the 16 oz. can
of fruits may be slightly filled beyond the
required weight. He takes a random sample of 80
packed cans and finds that the mean weight is
16.08 oz. with a standard deviation of 0.04 oz.
At 1% Level of Significance, can the production
manager conclude that the fruit cans were being
overfilled?
Exercises 2:
An insurance executive asserts that the mean
amount paid by his firm for personal injury
resulting from personal accidents is P18,500. An
actuary wants to check the accuracy of this
assertion and is allowed to sample randomly 36
cases involving personal injury. The sample mean
is P19,415. Assuming that σ = P2,600, test the
executive belief with level of significance of
0.05.
Exercises 3:
The manager of the Granite Rock Company
believes that the average truckload delivered
weighs 4,500lbs. A stockholder, Chip Stone
argues that this is an inflated figure to live new
investors, Mr. Stone randomly samples the
records of 25 loads and finds the mean load to be
4,460lbs with standard deviation (s) of 250lbs.
Can Mr. stone reject the manager’s claim using
a significance level of 0.05?
Exercises 4:
A poultry raiser harvests an average of 300 eggs
per day. He has recently e xperimented with
different types of poultry feeds. As a result, he
noticed some fluctuations in the number of eggs laid
by the chickens, which is neither clearly higher nor
lower than previous weeks. He decides to find out if
there might be a significant change in the number of
eggs laid by the chickens. He records his har vest of
eggs for 20 days. He finds that the average per day
is 290 eggs with a standard deviation of 15. At 5%
Level of Significance, what did the poultry raiser
find out?
Exercises 5:
An experimental diet was followed by a random
sample of 6 people. The cholesterol level for each
was measured before and after the diet as follows:

Before: 174 160 151 121 275 118


After: 196 212 254 207 221 223

Test the hypothesis at the 0.01 level that there is a


significant decrease in the population cholesterol level
after the diet.
LINEAR CORRELATION
AND REGRESSION
ANALYSIS
Correlation Analysis
Us e d t o me a s u re s t h e de g re e o f
relationship bet ween t wo variables x and
y by means of a single number called the
correlation coefficient.
Only concerned with strength of the
relationship.
No causal effect is implied.
Note:
The value of the correlation coefficient
denoted by the symbol “r” ranges from -1 to
1.
The correlation bet ween the variables may
e i t h e r b e s h o w i n g d i re c t o r i n ve rs e
relationship.
Sample of Observations from Various
r Values
Y Y Y

X X X
r = -1 r = -.6 r =0
Y Y

r = .6 r=1
Note:
Features of r
Unit free
Range bet ween -1 and 1
The closer to -1, the stronger the negative
linear relationship.
The closer to 1, the stronger the positive
linear relationship.
The closer to 0, the weaker the linear
relationship.
Caveats
A correlation of 70% does not mean
that 70% of the points are clustered
around a line. Nor should we claim here
that we have t wice as much linear
association with a set of points, which
has a correlation of 35%.
Correlation does not imply causation.
Caveats
A The presence of outliers easily affects
the correlation of a set of data.
• In some situations, we ought to remove
these outliers from the data set and re-
do the correlation analysis.
• In other case, these outliers ought not to
be removed as there will always be some
points detached from the rest of the
data.
Pearson Product Moment
Correlation Coefficient
Commonly called the Pearson r.
It measures the linear relationship bet ween t wo
variables.
The level of measurement of the data for the t wo
variable are either in inter val or ratio scale.
n ∑ xy − ∑ x ∑ y
r=
[n ∑ x 2 − ( ∑ x)2][n ∑ y 2 − ( ∑ y)2]
where:
x = the observed data for the independent variable
y = the observed data for the dependent variable
n = no. of samples
Pearson Product Moment
Correlation Coefficient
Test Statistic:
df
t=r
where: 1 − r2
df = degrees of freedom
r = correlation coefficient of Pearson r
Note:
df = n − 2
Qualitative Interpretation
Note:
If r is negative, this means that for every
i n c re a s e i n o n e v a r i a b l e , t h e re i s a
corresponding decrease in the second variable
or that there is an inverse relationship
bet ween variables x and y.
If r is positive, this means that for every
i n c re a s e i n o n e v a r i a b l e , t h e re i s a
corresponding increase in the second variable
or that there is a direct relationship bet ween
variables x and y.
Hypotheses
Ho : ρ = 0
There is no significant relationship
bet ween the t wo variables.
Ha : ρ ≠ 0
Th e re i s s ig n ific a n t re l a t i o n s h i p
bet ween the t wo variables.
Example 1:
T h e R i p - o f f Ve n d i n g M a c h i n e No. of Persons
Working at
No. of cups of
coffee sold
Company operates coffee vending location

machine s in office buildings. The 5 10


6 20
c om p a ny wa n t s t o s t u d y t h e
14 30
relationship; if any, that to study
19 40
number of cups sold per day and the
15 30
number of persons working in each
11 20
building. Sample data for the study
18 40
were collected by the company and
22 40
p re s e n t e d b e l o w a n d t e s t t h e
26 50
significance at 0.05 level.
Solution:
Step 1:
Ho : ρ = 0
There is no significant relationship bet ween the number of
cups sold per day and the number of persons working in
each building.
Ha : ρ ≠ 0
There is significant relationship bet ween the number of
cups sold per day and the number of persons working in
each building.
Step 2:
α = 0.05
Solution:
Step 3:
Since we are testing the significant relationship of
t wo variables, we will use Pearson r.
Step 4: df = 9 − 2 = 7 t0.05,7=±2.365
Step 5:
If test statistic is less than
CV(-2.365) and gre ate r Rejection Rejection
than CV(2.365), reject the Region Region
null hypothesis, other wise
f ai l t o re je c t t h e nul l -2.365 2.365
hypothesis. -2 -1 0 1 2
Solution:
Step 6: x y x squared y squared xy
5 10 25 100 50


x = 136 6 20 36 400 120
14 30 196 900 420

y = 280 19 40 361 1600 760
15 30 225 900 450
x 2 = 2,448
∑ 11 20 121 400 220
2

y = 10,000 18 40 324 1600 720
22 40 484 1600 880

xy = 4,920 26 50 676 2500 1300
Sum: 136 280 2,448 10,000 4,920
Solution:
9(4920) − (136)(280)
r= = 0.9681
[9(2448) − (136)2][9(10000) − (280)2]
Strong Positive Correlation

9−2
t = 0.9681 = 10.222
1 − (0.9681)2
Solution:
Step 7:

Since test statistic (10.222) is greater than


CV(2.365), we reject Ho, therefore there is
significant relationship bet ween the number of
cups sold per day and the number of persons
working in each building.
Example 2:
Square Annual Sales
You want to examine the Feet ($1000)

1,726 3,681
correlation of the annual
1,542 3,395
sales of produce stores on
2,816 6,653
their size in square footage.
5,555 9,543
S ample dat a f o r se ve n
stores were obtained. 1,292 3,318
2,208 5,563
1,313 3,760
Solution:
Step 1:
Ho : ρ = 0
There is no significant relationship bet ween the annual
sales of produce stores on their size in square footage.
Ha : ρ ≠ 0
There is significant relationship bet ween the annual sales
of produce stores on their size in square footage.
Step 2:
α = 0.05
Solution:
Step 3:
Since we are testing the significant relationship of
t wo variables, we will use Pearson r.
Step 4: df = 7 − 2 = 5 t0.05,5=±2.571
Step 5:
If test statistic is less than
CV(-2 .571) a nd gre ate r Rejection Rejection
than CV(2.571), reject the Region Region
null hypothesis, other wise
f ai l t o re je c t t h e nul l -2.571 2.571
hypothesis. -2 -1 0 1 2
Solution:
Step 6:
We can solve the test statistic and p-value of
Pearson r using RStudio. TV(9.010) and p-
value(0.0003)
Step 7:
Since test statistic (9.010) is greater than
CV(2.571), we reject Ho, therefore there is
significant relationship bet ween the annual sales
of produce stores on their size in square footage.
Regression Analysis
Regression analysis is used primarily to
model causality and provide prediction.
Predicts the value of a dependent (response)
variable based on the value of at least one
independent (explanatory) variable.
Explains the effect of the independent
variables on the dependent variable
Types of Regression Models
Simple Linear
Regression
Re l at ionshi p be t we e n v ar i able s is
described by a linear function.
The change of one variable causes the
change in the other variable.
A dependency of one variable on the
other.
Population Linear Regression
Population regression line is a straight line that
describes the dependence of the average value of
one variable on the other.
Population Linear Regression
Sample Linear Regression
Sample regression line provides an estimate of the
population regression line as well as a predicted
value of Y.
Note:
b0 and b1 are obtained by finding the
values of b0 and b1 that minimizes the
sum of the squared residuals.
n n
2 2
(yi − y)̂ =
∑ ∑
ei
i=1 i=1
b0 provides an estimate of β0.
b1 provides an estimate of β1.
Interpretation of the 

Slope and the Intercept
b0 = Eˆ (Y | X = 0 ) is the estimated
average value of Y when the value of X
is zero.
ΔEˆ (Y | X )
b1 =
ΔXis the estimated change in
the average value of Y as a result of a
one-unit change in X.
Note:
When b1>0, Y increases as X increases. In this
case, we say that Y is directly or positively
related to X.
When b1<0, Y decreases as X increases, and we
say that Y is inversely or negatively related to X.
When b1=0, Y is a constant and is equal to the y-
intercept a. This implies that there is no change
in Y whatever X value is. This implies that
variable x and y have no relationship.
Example:
Examine t he line ar Square
Feet
Annual Sales
($1000)
dependency of the annual 1,726 3,681
sales of produce stores on 1,542 3,395
t h e i r s i ze i n sq ua re 2,816 6,653
footage. Find the equation 5,555 9,543
of the straight line that 1,292 3,318
fits the data best. 2,208 5,563
1,313 3,760
Solution:
From RStudio Printout:

Equation of the straight line


ŷ = 1636.415 + 1.487(xi)
Solution:
To examine the linear dependency of the annual sales
produce stores on their size in square footage, we will use
the regression scatter plot.

ŷ = 1636.415 + 1.487(xi)
Solution:
ŷ = 1636.415 + 1.487(xi)
The slope of 1.487 means that for each
increase of one unit in X, we predict the
average of Y to increase by an estimated
1.487 units.
The model estimates that for each increase of
one square foot in the size of the store, the
expected annual sales are predicted to
increase by $1487.
Inference About the
Slope: t-Test
t - test for a population slope
Is there a linear dependency of Y on X ?
Null and Alternative Hypothesis
Ho : β1 = 0 (No linear dependency)
Ha : β1 ≠ 0 (Linear dependency)
Test Statistic: Where:
sxy
b1 − β1 sb1 =
t= n
sb1 ∑i=1 (xi − x̄)2
Example:
Square Annual Sales
Feet ($1000)
Gi ven the following 1,726 3,681
information, determine if 1,542 3,395
the square footage of the 2,816 6,653
store affecting its annual 5,555 9,543
sales? 1,292 3,318
2,208 5,563
1,313 3,760
Solution:
Inference about the slope: Ho : β1 = 0 Ha : β1 ≠ 0

Since the p-value (0.0003) is less than the level of


significance 0.05, we reject the null hypothe sis.
Therefore, there is a linear dependency on the annual
sales of produce stores on their size in square footage
and there is evidence that square footage affects
annual sales.
Solution:

α = 0.05 Test Statistic


df = 7 − 2 = 5
Rejection Rejection
t0.05,5=≠2.571
Region Region
Decision: Reject Ho
-2.571 2.571
-2 -1 0 1 2
Confidence Interval of
the Slope
To calculate the confidence interval:
b1 ± tα/2,df (sb1)
Note: df = n − 2

At 95% level of confidence, the confidence inter val for the


slope is (1.062, 1.911). Does not include 0.
Conclusion: There is a significant linear dependency of annual
sales on the size of the store.
Inference About the
Slope: F-Test
F - test for a population slope
Is there a linear dependency of Y on X ?
Null and Alternative Hypothesis
Ho : β1 = 0 (No linear dependency)
Ha : β1 ≠ 0 (Linear dependency)
Test Statistic: Note: df1 = 1
SSR
df2 = n − 2
F= 1
SSE Fα,df1,df2
n−2
Relationship between a
t Test and an F Test
Null and Alternative Hypothesis
Ho : β1 = 0
Ha : β1 ≠ 0
2
(tn−2) = F1,n−2
Example:
Square Annual Sales
Feet ($1000)
Gi ven the following 1,726 3,681
information, determine if 1,542 3,395
the square footage of the 2,816 6,653
store affecting its annual 5,555 9,543
sales? 1,292 3,318
2,208 5,563
1,313 3,760
Solution:
Inference about the slope: Ho : β1 = 0 Ha : β1 ≠ 0

Since the p-value (0.0003) is less than the level of


significance 0.05, we reject the null hypothesis. Therefore,
there is a linear dependency on the annual sales of produce
stores on their size in square footage and there is evidence
that square footage affects annual sales.
Solution:

df1 = 1 Decision: Reject Ho


Test Statistic df2 = 7 − 2 = 5
α = 0.05 F0.05,1,5=6.61
A simple technique for prediction is through linear
regression analysis which utilizes an equation of
the form.
ŷ = b0 + b1x
Where:
ŷ = Predicted Value
b0 = The y - intercept
b1 = The slope of the line
x = A given value of the independent variable
Calculation of the
Regression Equation
n ∑ xy − ( ∑ x)( ∑ y)
b1 =
n ∑ x 2 − ( ∑ x)2
b0 = ȳ − b1x̄
Where:
n = number of paired observation
ȳ = the mean of y variable
x̄ = the mean of x variable
Example:
Find an equation that describe the relationship bet ween
the output of sample of Tackey Toy employee and their
aptitude test.
Aptitude Test
Employees
(x)
A 6
B 9
C 3
D 8
E 7
F 5
G 8
H 10
Solution:
Aptitude
Employees Test Output xy x squared y squared
(x) (y)
A 6 30 180 36 900
B 9 49 441 81 2401
C 3 18 54 9 324
D 8 42 336 64 1764
E 7 39 273 49 1521
F 5 25 125 25 625
G 8 41 328 64 1681
H 10 52 520 100 2704
Total 56 296 2257 428 11920
Solution:
56 296
x̄ = = 7 ȳ = = 37
8 8
8(2257) − (56)(296)
b1 = = 5.1389
8(428) − (56)2
b0 = 37 − (5.1389)(7) = 1.0277
Therefore, the regression equation that describes the
relationship bet ween the output of sample of Tackey Toy
employees and their aptitude test is
ŷ = 1.0277 + 5.1389(xi)
Suppose, for example, that the unfortunate Hiram
Ramos, personnel manager for Tackey Toy, is
considering hiring an applicant who scored a 4 on the
aptitude test. The supervisor of the department
wants someone hired who can produce an average of
30 dozen units. Of course, it is not possible to tell
exactly what the applicant’s future production might
be. By substituting 4 for x in the regression equation,
we have,

ŷ = 1.0277 + 5.1389(4) = 21.58


Therefore the manager can not hired an employee
who scored 4 because he can only produce 21.58
dozen.
Residual Analysis
Purposes
• Examine linearity
• Evaluate violations of assumptions
Plot residuals vs. Xi , Yi and time
• Graphical Analysis of Residuals
Residual Analysis for Linearity

Y Y

X X
e e
X
X

Not
Linear ü Linear
Residual Analysis for Homoscedasticity

Y Y

X
X
SR SR
X X

Heteroscedasticity
ü Homoscedasticity
Pitfalls of Regression
Analysis
Lacking an awareness of the assumptions
underlying least-squares regression.
Not knowing how to evaluate assumptions.
Not knowing the alternatives to classical
regression if some assumption is violated.
Using a regression model without knowledge
of the subject matter.
Strategies for Avoiding
the Pitfalls of Regression
Start with a scatter plot of X on Y to observe
possible relationship.
Pe rform re sidual an alysis to che ck the
assumptions.
• Use a histogram, stem-and-leaf display, box-
and-whisker plot, or normal probability plot
of the residuals to uncover possible non-
normality.
Strategies for Avoiding
the Pitfalls of Regression
If there is violation of any assumption, use
a l t e r n a t i v e me t h o d s t o l e a s t-s q u a re s
regression or alternative least-squares models
(e.g.: Curvilinear or multiple regression)
If there is no evidence of assumption violation,
then test for the significance of the regression
coefficients
Example 2:
Examine t he line ar Square
Feet
Annual Sales
($1000)
dependency of the annual 1,726 3,681
sales of produce stores on 1,542 3,395
t h e i r s i ze i n sq ua re 2,816 6,653
footage. Find the equation 5,555 9,543
of the straight line that 1,292 3,318
fits the data best. 2,208 5,563
1,313 3,760
Solution:
From RStudio Printout:

Equation of the straight line


ŷ = 1636.415 + 1.487(xi)
Solution:
To examine the linear dependency of the annual sales
produce stores on their size in square footage, we will use
the regression scatter plot.

ŷ = 1636.415 + 1.487(xi)
Solution:
ŷ = 1636.415 + 1.487(xi)
The slope of 1.487 means that for each
increase of one unit in X, we predict the
average of Y to increase by an estimated
1.487 units.
The model estimates that for each increase of
one square foot in the size of the store, the
expected annual sales are predicted to
increase by $1487.
Exercises
Exercises 1:
Castle Rock Entertainment has produced many
movies over the past few years. The Vice-
President wants to see if there is a relationship
be t ween the total cost of film (including
production costs, salarie s, and marke ting
expenses) and the gross income produced by the
film through ticket sales in the American movie
theaters. A random of sample films produced the
following data pairs.
Exercises 1 (cont.):
Costs Gross Income
(Million55
Dollars) (Million Dollars)
150.50
42 123.00
17 68.00
30 93.00
43 16.00
26 5.00
19 10.00
35 35.00
22 20.00
13 15.00
1. Predict the gross income for the film with a cost of 27
million.
2. Predict the gross income for the film with a cost of 35
million.
Exercises 2:
The scores of ten randomly Student x Y
se le c te d se nior high s ch o o l A 5 6
students on the mathematical B 7 15
portion of the National C 9 16
A dm i s s i o n t e s t ( N AT ) a n d D 10 12
mathematical ability part of a E 11 21
university admission test were F 12 22
recorded as follows: G 15 8
H 17 26
Compute the coefficient of
I 20 5
correlation (r).
J 26 30
Exercises 3:
In the following given data, x =number of sessions
attended by 15 trainees in a leadership training
seminar, while y = scores obtained by the same
trainees in a test given after the seminar.
x 3 2 4 5 5 6 6 7 9 7 8 5 6 3 8
y 65 50 75 70 80 85 79 88 91 87 88 70 71 63 85

1. Determine the regression equation for predicting y


from x
2. Predict the mean score of trainees who attend an
average of 8 sessions
ONE - WAY ANALYSIS OF
VARIANCE
TWO - WAY ANALYSIS OF
VARIANCE
TUKEY TEST (POST HOC TEST)
One - Way ANOVA
One-way analysis of variance (ANOVA) is
a method of test ing the equality of
three or more population means by
analyzing sample variances.
It is called the analysis of variance
because the test is based on the analysis
of variation in the data obtained from
different samples.
One - Way ANOVA
Test Statistic:

Note: k
k n
2
∑∑
2 SSw = (x̄ij − x̄i)

SSb = n (x̄i − x̄)
i=1 j=1
i=1
Note:
The ANOVA test is applied by calculating t wo
estimates of the variance of population
distributions: the variance bet ween samples
and the variance within samples.
The variance bet ween samples is also called
the mean square bet ween samples or MSB. The
variance within samples is also called the
mean square within samples of MSW.
Assumptions
1. Your dependent variable should be measured at
the interval or ratio level (i.e., they are
continuous).
2. Your independent variable should consist of two
or more categorical, independent groups.
3. You should have independence of observations,
which means that there is no relationship
bet ween the observations in each group or
bet ween the groups themselves.
Assumptions
4. There should be no significant outliers.
5. Your dependent variable should be
approximately normally distributed
for each category of the independent
variable.
6. There needs to be homogeneity of
variances.
Hypotheses
The analysis of variance is used to test the
hypothesis that the means of three or more
populations are the same against the alternative
hypothesis that the mean of at least one
population is different from the others.
Ho : μ1 = μ2 = . . . = μk
Ha : At least one of the population means is
different from the others.
Rejection Region
One - Way ANOVA is always right-tailed with
the rejection region in the right tail of the F
distribution curve.
Critical Value: Fα/2,df1,df2

Note: Where:
df = k − 1 k = No. of categories.
df = n − k n = Total number of observation.
Example:
Suppose we have teachers at a school who have
devised three different methods to teach arithmetic.
They want to find out if these three methods produce
different mean scores. Let μ1, μ2 and μ3 the mean
scores of all students who are taught by Methods I,
II, and III, respectively.
To test if the three teaching methods produce
different means, we test the null hypothesis
Ho : μ1 = μ2 = μ3
Ha : At least one of the population means is
different from the others.
Note:
Using a one-way ANOVA test, we analyze only one
factor or variable.
For instance, in the example of testing for the equality
of mean arithmetic scores of students taught by each
of the three different methods, we are considering only
one factor, which is the effect of different teaching
methods on the scores of students.
Sometimes we may analyze the effects of t wo factors.
For example, if different teachers teach arithmetic
using these three methods, we can analyze the effects
of teachers and teaching methods on the scores of
students. This is done by using a t wo-way ANOVA.
Note:
The variance bet ween samples, MSB, gives an estimate of
variance based on the variation among the means of
samples taken from different populations.
For the example of three teaching methods, MSB will be
based on the values of the mean scores of three samples of
students taught by three different methods. If the means
of all populations under consideration are equal, the means
of the respective samples will still be different but the
variation among them is expected to be small, and
consequently, the value of MSB is expected to be small.
However, if the means of populations under consideration
are not all equal, the variation among the means of
respective samples is expected to be large, and consequently,
the value of MSB is expected to be large.
Note:
The variance within samples, MSW, gives an
estimate of variance based on the variation
within the data of different samples.
For the example of three teaching methods,
MSW will be based on the scores of individual
students included in the three samples taken
from three populations.
Example:
Callie Cruz, Vice-President of the Nikel and Dime Savings
Bank, is reviewing employees performance for possible
salary increase. In evaluating tellers, Callie decides that
an important criterion is the number of customer each
day. She e xpects that e ach teller should handle
approximately the same number of customers daily.
Other wise, each teller should be rewarded or penalized
accordingly.
Callie randomly selects 6 business days and customer
traffic for each teller during these days is recorded. The
factor or variable of interest, then, is the number of
customers ser ved. The sample data are shown below:
Example (cont.):
Customer Traffic Data
Day Teller 1 Teller 2 Teller 3
Ms. David Ms. Chua Ms. Lim
1 45 55 54
2 56 50 61
3 47 53 54
4 51 59 58
5 50 58 52
6 45 49 51
Total 294 324 330
Solution:
Step 1:
Ho : μ1 = μ2 = μ3
All population means are equal. that is, Ms. David, Ms. Chua
and Ms. Lim serve the same average number of customer
per day and they are assumed to have same workload.

Ha : At least one of the population means is


different from the others.
Not all the tellers are handling the same average number of
customers per day. At least 1 of the teller performing
better than the others, at least 1 of them is not
performing up to the standards of the others.
Solution:
Step 2: α = 0.05
Step 3:
Since we are comparing more t wo groups, we
will use the F - distribution.
Step 4: df1 = 3 − 1 = 2 df2 = 18 − 3 = 15

F0.05,2,15=3.68
Solution:
Step 6:
294 324 330
x̄1 = = 49 x̄2 = = 54 x̄3 = = 55
6 6 6
49 + 54 + 55
x̄ = = 52.6667
3
2 2
ssb = 6(49 − 52.6667) + 6(54 − 52.6667)
2
+6(55 − 52.6667)
= 124
Solution:
2 2 2
ss
Step 6: 1 = (45 − 49) + (56 − 49) + (47 − 49)
2 2 2
+(51 − 49) + (50 − 49) + (45 − 49)
= 90
2 2 2
ss2 = (55 − 54) + (50 − 54) + (53 − 54)
+(59 − 54)2 + (58 − 54)2 + (49 − 54)2
= 84
ss3 = (54 − 55)2 + (61 − 55)2 + (54 − 55)2
2 2 2
+(58 − 55) + (52 − 55) + (51 − 55)
= 72
ssw = 90 + 84 + 72
= 246
Solution:
Step 6:
Sum of Degrees of Variance
Source F Ratio
Squares Freedom Estimate

Between 124 2 62

Within 246 15 16.4 3.7805

Total 370 17
Solution:
Step 7:
Since test statistic (3.7805) is greater than CV(3.68),
we reject Ho, therefore at least one of the tellers
among David, Chua and Lim is likely to be handling
more or fewer customers than the others.
Exercises 1:
A career counselor claims in Career Development
Quarterly that there is no difference in career
decision-making attitudes among the population of
students from various socioeconomic classes.
The results of scores from an Lower Middle Upper
32 45 38
attitudes test given to random
36 42 38
samples of students are as 40 34 31
follows: 32 42 41
33 29
37 33
Test the counselor’s claim at the 0.01 level. 34
Exercises 2:
Fifteen fourth-grade students were randomly assigned to three
groups to experiment with three different methods of teaching
arithmetic. At the end of the semester, the same test was given
to all 15 students. The table gives the scores of students in the
three groups.
Test the that the mean scores of Method I Method II Method III
all three groups of fourth- 48 55 85
graders taught by three different
73 85 68
methods are not equal. Assume
that all the required assumptions 51 70 95
hold true. Use 0.01 level of 65 69 74
significance. 87 90 67
Post Hoc Tests on One-
Way Analysis of Variance
Suppose we perform a one-way ANOVA and
the results lead us to conclude that at least
one population is different from the others.
To de t e r m i n e w h i c h m e a n s d i f fe r
s i g n i f i c a nt l y, we m a k e a dd i t i o n a l
c omp a r is o n s be t we e n me a n s . Th e
procedures for making these comparisons
are called multiple comparison methods.
Tukey Test

The Tukey test is also known as the Honestly


Significant Difference Test or the Wholly
Significant Difference Test. It is designed to
compare pairs of means after the null
hypothesis of equal means has been rejected.
Tukey Test

It test Ho : μi = μj versus Ha : μi ≠ μj for all


means where i ≠ j . The goal of the test is to
determine which population means differ
significantly.
Tukey Test
Note:
The computation of the test statistic for the
Tukey’s test follows the same logic as the
te s t for compar ing t wo me ans f rom
independent sampling but the standard
error is not the same as the standard error
used.
Distribution of Turkey
Test

Th e q - t e s t s t a t i s t i c f o l l o ws a
distribution called the Studentized
range distribution.
Standard Error

2 ( n1 n2 )
2
s 1 1
SE = × +
where:
2
s mean square error estimate (MSE) of from
the one-way ANOVA
n1 sample size from population 1
n2 sample size from population 2.
Test Statistic for
Tukey’s Test
The test statistic for Tukey’s test when
testing Ho : μ1 = μ2 versus Ha : μ1 ≠ μ2 is given
by
(x̄1 − x̄2) − (μ1 − μ2)
q=

2 ( n1 n2 )
s2 1 1
× +

Where x̄2 > x̄1


Critical Value for the
Tukey’s Test
The critical value for Tukey’s test using a
familywise error rate α is given by

qα,v,k
Critical Value for the
Tukey’s Test
The level of
qα,v,k
significance is Total number of
called the means being
experiment wise compared.
error rate or Degrees of freedom due to
familywise error error (the degrees of
rate. freedom due to error is the
total number of subjects’
sample size minus the
number of means being
compared, or n-k ).
Decision Rule

If q ≥ qα,v,k reject the null hypothesis


that Ho : μi = μj and conclude that the
means are significantly different.
Procedures Used to Make Multiple
Comparison Using Turkey Test

Step 1:
Arrange the sample means in ascending order.

Step 2:
Compute the pair wise differences, x̄i − x̄j ,
where x̄i > x̄j .
Procedures Used to Make Multiple
Comparison Using Turkey Test

Step 3:
Compute the test statistic for e ach
pair wise difference.
(x̄1 − x̄2) − (μ1 − μ2)
q=

2 ( n1 n2 )
s2 1 1
× +
Procedures Used to Make Multiple
Comparison Using Tukey Test
Step 4:
Determine the Critical Value.
Step 5:
Determine the decision.

Step 6:
Determine the conclusion.
Example 1
Suppose that there is sufficient evidence to
reject Ho : μ1 = μ2 = μ3 = μ4 using a one-way
ANOVA. The mean square error from ANOVA
is determined to be 26.2. The sample means
are x̄1 = 42.6,x̄2 = 49.1,x̄3 = 46.8,x̄4 = 63.7 with
n1 = n2 = n3 = n4 = 6 .
Use Tukey’s test to determine which pair wise
means are significantly different using a
familywise error of 0.05.
Solution:
Step 1:

Arrange the sample means in ascending order.


x̄1 = 42.6,x̄3 = 46.8,x̄2 = 49.1,x̄4 = 63.7
Step 2:

Compute the pair wise differences.


x̄4 − x̄1 = 21.1 x̄4 − x̄2 = 14.6 x̄2 − x̄3 = 2.3
x̄4 − x̄3 = 16.9 x̄2 − x̄1 = 6.5 x̄3 − x̄1 = 4.2
Solution:
Step 3:

Compute the test statistic for e ach pair wise


difference.
Ho : μ4 − μ1 = 0
(21.1) − (0)
q= = 10.0974

(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ4 − μ3 = 0
(16.9) − (0)
q= = 8.0875

(6 6)
26.2 1 1
× +
2
Ho : μ4 − μ2 = 0
(14.6) − (0)
q= = 6.9868

(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ2 − μ1 = 0
(6.5) − (0)
q= = 3.1106

(6 6)
26.2 1 1
× +
2
Ho : μ2 − μ3 = 0
(2.3) − (0)
q= = 1.1007

(6 6)
26.2 1 1
× +
2
Solution:
Ho : μ3 − μ1 = 0
(4.2) − (0)
q= = 2.0099

(6 6)
26.2 1 1
× +
2
Step 4:

Determine the Critical Value.

qα,v,k → q0.05,20,4 = 3.958


Solution:
Step 5:

Determine the decision.

If q ≥ qα,v,k reject the null hypothesis


that Ho : μi = μj and conclude that the
means are significantly different.
Solution:
Step 6:

Determine the conclusion.


Reject Ho : μ4 − μ1 = 0 Retain Ho : μ2 − μ1 = 0
10.0974 > 3.958 3.1106 < 3.958
Reject Ho : μ4 − μ3 = 0 Retain Ho : μ2 − μ3 = 0
8.0875 > 3.958 1.1007 < 3.958
Reject Ho : μ4 − μ2 = 0 Retain Ho : μ3 − μ1 = 0
6.9868 > 3.958 2.0099 < 3.958
Example 2
Suppose the following data are taken from three
different populations that are known to be
normally distributed with equal population
variances based on independent simple random
samples.
Example 2 (cont.)
A.Test the claim that each sample comes from a
population with the same mean at the level of
significance. That is, test Ho : μ1 = μ2 = μ3 .
B.If you rejected the null hypothesis in part (A),
use Tukey’s test to determine which pair wise
means differ using a familywise error rate of
0.05
Based on the result of ANOVA test, the Ho : μ1 = μ2 = μ3 is
reject based on 0.05 level of significance.
The output gives the difference in means, confidence
levels and the adjusted p-values for all possible pairs.

The confidence levels and p-values show the only significant


bet ween-group difference is for treatments c and a, c and b.
Pairs b and a contain 0 in the confidence intervals
and thus, have no significant difference.
Two - Way Analysis of
Variance

The t wo-way ANOVA compares the


mean differences bet ween groups that
have been split on t wo independent
variables (called factors).
Two - Way Analysis of
Variance

The primary purpose of a t wo-way ANOVA


is to understand if there is an interaction
bet ween the t wo independent variables on
the dependent variable.
Two - Way Analysis of
Variance
The interaction term in a t wo-way ANOVA
informs you whether the effect of one of
yo u r i n de p e n de n t v a r i a ble s o n t h e
dependent variable is the same for all
values of your other independent variable
(and vice versa).
Two - Way Analysis of
Variance
For example, you could use a t wo-way ANOVA
t o u n de rs t a n d w h e t h e r t h e re i s a n
interaction bet ween gender and educational
level on test anxiety amongst university
students, where gender (males/females) and
e duc at ion le ve l (unde rgraduate /
p o s t g r a d u a te ) a re y o u r i n de p e n de n t
v a r i a ble s, a n d te s t a n x ie t y is yo u r
dependent variable.
Reminders:
If you have three independent variables
rather than t wo, you need a three-way
ANOVA.
Alternatively, if you have a continuous
covariate, you need a t wo-way ANCOVA.
Assumptions
1. Your dependent variable should be
measured at the continuous level.
2. Your t wo independent variables
should each consist of t wo or more
categorical, independent groups.
Assumptions
3. You should have independence of
observations.
4. There should be no significant outliers.
5. Your dependent variable should be
approximately normally distributed for
each combination of the groups of the
t wo independent variables.
Assumptions

6. There needs to be homogeneity of


variances for each combination of the
g r o u p s o f t h e t wo i n de p e n de n t
variables.
Difference Between One-
Way and Two-Way ANOVA
Hypotheses Regarding
Interaction Effect

Ho : There is no interaction bet ween the


factors.

Ha : There is interaction bet ween the


factors.
Hypotheses Regarding
Main Effects
Ho : There is no effect of factor A on the
response variable.
Ha : There is effect of factor A on the response
variable.

Ho : There is no effect of factor B on the


response variable.
Ha : There is effect of factor B on the response
variable.
Two-Way ANOVA Table
Reminders:
Whenever conducting a t wo-way ANOVA,
we always first test the hypothesis
regarding interaction effect. If the null
hypothesis of no interaction is rejected,
we do not interpret the results of the
hypotheses involving the main effects.
Example
In de p e n de n t
Va r i a b l e
( De v i c e a n d
Task)
De pe nde n t
Variable (Task
Completion
Time)

You might also like