Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

EDA 223

FOR POPULATION: Researcher is conducting a survey


LESSON 6: SAMPLING THEORY on customer satisfaction with a particular brand of
smartphones.
➢ Creation of sample set
➢ Retains the accuracy in bringing out the correct SAMPLING FRAME: Database containing contact
statistical information. information or purchase records of all customers who
➢ Population of a topic → Sample of Population to be have bought the brand’s smartphones from authorized
studied → Conclusion to the Population retailers or online stores.

SAMPLING
• Process of selecting a small number of elements
(sample) from a larger defined target group of FOR POPULATION: A study investigating the
elements (population) prevalence of a rare medical condition in a specific
region.
• The information gathered from the small group will
allow judgments to be made about the larger group. SAMPLING FRAME: List of patient records from
hospitals, clinics, or medical registries within the region.
BASICS OF SAMPLING THEORY
➢ Population * Care must be taken to make sure your sampling frame is
• The studied group of individuals adequate for your needs. *
➢ Element
• Basic unit of information that has a unique meaning
and subcategories of distinct value. A good sample frame for a project on living condition would:
➢ Defined target population
• Chosen group of individuals that is studied. • Include all individuals in the target population.
➢ Sampling unit
• Unit of a population that is used for statistical • Exclude all individuals not in the target
research or study. population.
➢ Sampling Frame
• Eligible members of a population from which • Include all accurate information that can be used
samples are drawn. to contact selected individuals.

SAMPLING ERROR
II. SAMPLING METHODS
• Any type of bias that is attributable to mistakes in
either, 1. PROBABILITY SAMPLING
o Drawing a sample • Sampling technique in which the researcher chooses
o Determining the sample size samples from a larger population using a method
based on the Theory of Probability.

• Random selection
I. DEFINING A POPULATION OF •
INTEREST TYPES OF PROBABILITY SAMPLING:
Population of interest is entirely dependent on,
1. Simple Random Sampling – every unit has an
a.) Management Problem equal nonzero chance of being selected.
b.) Research Problems and;
c.) Research Design Ex. In studying reading habits of student at a
university, they assigned a unique number to
Some bases for defining Population: each student and then use a random number
o Geographic Area generator to select a sample of 100 students from
o Demographics the entire student population.
o Usage/Lifestyle
o Awareness 2. Systematic Random Sampling – the defined
target population is ordered, and the sample is
selected according to position using a skip
interval.
SAMPLING FRAME
Ex. The individuals that are picked is according
• List of population elements (people, companies, to the assigned number uniformly defined.
houses, cities, etc.) from which units to be sampled (numbers divisible by 5 → 5th, 10th, 15th, etc.)
can be selected.
• Difficult to get an accurate list.
• Sample frame error – occurs when certain elements 3. Stratified Random Sampling – population is
of the population are accidentally omitted or not divided into different subgroups and samples
included on the list. are selected from each.
Ex. Stratifying employees working on different speaking to a member of that club to forward the
departments (from sales, marketing, and HR). survey to its co-members creating a chain-
referral.

• Steps in drawing a stratified random III. DETERMINING A SAMPLE SIZE


sampling: • Should be carefully fixed so that it will be adequate
1. Divide the target population into to draw valid and generalized conclusions.
homogeneous subgroups or strata.
2. Draw random samples from each • To determine appropriate sample size, the following
stratum. should be determined,

3. Combine the samples from each stratum 1. Level of Precision


into a single sample of the target population.
• Also called Sampling Error
4. Cluster Sampling
• Range in which the true value of the population is
- All members from randomly selected segments of a estimated to be.
population
• Expressed in percentage points
- It is used when population falls into naturally
occurring subgroups.
• Determined by a statistical method called Standard
Ex. Dividing a city into neighborhoods (cluster) and Deviation.
randomly select several neighborhoods. Within each
selected neighborhoods, they test all residents for the
disease. 𝑯𝒊𝒈𝒉 𝝈 = 𝑳𝒐𝒘 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑳𝒐𝒘 𝝈 = 𝑯𝒊𝒈𝒉 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏

2. Level of Confidence
2. NON-PROBABILITY SAMPLING
• Sampling technique in which the researcher selects • Risk level of Confidence level is based on Central
samples based on the subjective judgment of the Limit Theorem.
researcher rather than random selection.

TYPES OF NONPROBABILITY SAMPLING: CENTRAL LIMIT THEOREM – when a population is


repeatedly sampled, the average value of the attribute
1. Convenience Sampling – consists only of obtained by those samples is equal to the true population
available members of the population. value.

Ex. Asking people outside the university library


ssome questions about the survey. All of them • Approximately 95% of the sample values are within
are picked based on the convenience of reaching two standard deviations of the true population value.
them for data collection. (Normal Distribution only)

• If 95% confidence level → 95 out of 100 samples


2. Judgment Sampling – relies upon belief that will have the true population value within the range
participants fit certain characteristics. of precisions specified.

Ex. Checking the characteristics of a Miss 3. Confidence Interval


Universe Philippines to pick individuals who can
audition for the part. • Range of values we are sure our true value lies in.

3. Quota Sampling – dividing populations into


subgroups based on predetermined
characteristics (quota) and then selecting
CONFIDENCE LEVEL
participants from each subgroup until the quota
is filled. 𝒔
𝑪𝑰 = ̅
𝒙±𝒛
√𝒏
Ex. If the population consists of 40% male, and
60% female, the sample should reflect to those 𝒘𝒉𝒆𝒓𝒆:
percentages.
𝑪𝑰 − 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒊𝒏𝒕𝒆𝒓𝒗𝒂𝒍
4. Snowball Sampling – (chain referral)
recruiting participants through referrals from ̅ − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒎𝒆𝒂𝒏
𝒙
initial participants or through social networks,
creating “snowball” effect as the sample size 𝒛 − 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒍𝒆𝒗𝒆𝒍 𝒗𝒂𝒍𝒖𝒆
grows.
𝒔 − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏
Ex. If you are studying a club but is hard to get a
contact with, you will grab the chance of 𝒏 − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆
• Entire population as the sample (that is why large
population is not advisable to be census)

Example: We measure the heights of 40 randomly chosen 2. SAMPLE SIZE OF A SIMILAR STUDY
men and get a mean height of 175cm. We also know the • Without reviewing the methods used in these studies
standard deviation of men’s height is 20cm: may run the risk of repeating errors that were made in
determining the sample size for another study.
1ST → GIVEN:
3. PUBLISHED TABLES
̅(𝒔𝒂𝒎𝒑𝒍𝒆 𝒎𝒆𝒂𝒏) = 𝟏𝟕𝟓𝒄𝒎
𝒙

𝒔 (𝒔𝒕𝒂𝒏𝒅𝒂𝒓𝒅 𝒅𝒆𝒗𝒊𝒂𝒕𝒊𝒐𝒏) = 𝟐𝟎𝒄𝒎

𝒏 (𝒏𝒐. 𝒐𝒇 𝒐𝒃𝒔𝒆𝒓𝒗𝒂𝒕𝒊𝒐𝒏𝒔) = 𝟒𝟎 𝒎𝒆𝒏

2nd → CONFIDENCE LEVEL (Z-VALUE):

𝑰𝒇 𝑪𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒍𝒆𝒗𝒆𝒍 = 𝟗𝟓%, 𝒛 − 𝒔𝒄𝒐𝒓𝒆 = 𝟏. 𝟗𝟓𝟗𝟗𝟔

3rd → COMPUTATION OF CI:

𝒔
̅ ± 𝒛(
𝑪𝑰 = 𝒙 )
√𝒏

𝒔
̅ ± 𝒛(
𝑪𝑰 = 𝒙 )
√𝒏
𝟐𝟎
→ 𝑪𝑰 = 𝟏𝟕𝟓 ± 𝟏. 𝟗𝟓𝟗𝟗𝟔 ( )
√𝟒𝟎

→ 𝑪𝑰 = 𝟏𝟕𝟓 ± 𝟔. 𝟏𝟗𝟖

4. USING FORMULAS TO CALCULATE THE


SAMPLE SIZE

4. Degree of Variability (Proportion, p)


o COCHRAN FORMULA & MODIFIED
• Distribution of attributes in the population COCHRAN FORMULA
- Allows you to calculate the ideal sample size
• 𝑽𝒂𝒓𝒊𝒂𝒃𝒍𝒆𝒔 𝒘𝒊𝒕𝒉 𝒎𝒐𝒓𝒆 𝒉𝒐𝒎𝒐𝒈𝒆𝒏𝒆𝒐𝒖𝒔 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏 = given a level of precision, desired
𝒔𝒎𝒂𝒍𝒍𝒆𝒓 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 𝒓𝒆𝒒𝒖𝒊𝒓𝒆𝒅 confidence level, and the estimated
proportion of the attribute present in the
• 𝑴𝒐𝒓𝒆 𝒉𝒆𝒕𝒆𝒓𝒐𝒈𝒆𝒏𝒆𝒐𝒖𝒔 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏 = population.
𝒍𝒂𝒓𝒈𝒆𝒓 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆 𝒓𝒆𝒒𝒖𝒊𝒓𝒆𝒅 𝒕𝒐 𝒐𝒃𝒕𝒂𝒊𝒏 𝒈𝒊𝒗𝒆𝒏 𝒍𝒆𝒗𝒆𝒍 𝒐𝒇 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏
- For LARGE POPULATIONS
• 𝟓𝟎% 𝒎𝒂𝒙𝒊𝒎𝒖𝒎 𝒗𝒂𝒓𝒊𝒂𝒃𝒊𝒍𝒊𝒕𝒚, 𝒑
COCHRAN FORMULA
𝒛𝟐 𝒑𝒒
𝒏𝟎 =
𝒆𝟐
BASES OF DETERMINING SAMPLE SIZE Where:
𝒆 − 𝒍𝒆𝒗𝒆𝒍 𝒐𝒇 𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏(𝑴𝒂𝒓𝒈𝒊𝒏 𝒐𝒇 𝑬𝒓𝒓𝒐𝒓)
𝒑 − 𝒑𝒓𝒐𝒑𝒐𝒓𝒕𝒊𝒐𝒏 𝒐𝒇 𝒕𝒉𝒆 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏
1. CENSUS FOR SMALL POPULATIONS 𝒒 − (𝟏 − 𝒑)
• Eliminates sample error 𝒛 − 𝒛 − 𝒔𝒄𝒐𝒓𝒆 𝒐𝒇 𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 𝒍𝒆𝒗𝒆𝒍
MODIFIED COCHRAN FORMULA
𝒏𝟎
𝒏=
𝒏 −𝟏
𝟏 + 𝟎𝑵
Where:
𝒏𝟎 − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆
𝑵 − 𝒕𝒐𝒕𝒂𝒍 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏

o TARO YAMANE FORMULA /


SLOVIN’S FORMULA
- Finite population and population is known:

SLOVIN’S FORMULA
𝑵
𝒏=
𝟏 + 𝑵𝒆𝟐
Where:
𝒆 − 𝑴𝒂𝒓𝒈𝒊𝒏 𝒐𝒇 𝑬𝒓𝒓𝒐𝒓
𝑵 − 𝒕𝒐𝒕𝒂𝒍 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏

LESSON 7: CORRELATION AND


REGRESSION
LESSON 7: CORRELATION AND REGRESSION
➢ CORRELATION
- Statistical tool to measure the association between variables

- Relationship in the changes and movement of two variables.

- Measure the extend of relationship between variables.

TYPES OF VARIABLES:
1. Nominal – data that are numbered

2. Ordinal – numbered data that is in order or ranking.

3. Discrete – numbered data in whole number

4. Continuous – numbered data in rational numbers

CORRELATION COEFFICIENT
Pearson’s Correlation / Product Moment Correlation Coefficient

- Measures the nature and strength between two variables of the quantitative type.

- Measures the degree of linear association.

- Denoted by r

- Ranges from +1.0 (positive) to -1.0 (negative)


o Sign – nature of association.
o Value – strength of association

▪ POSITIVE CORRELATION

𝑾𝑯𝑬𝑹𝑬: 𝑿 ↑ 𝒀 ↑

▪ NEGATIVE CORRELATION

𝑾𝑯𝑬𝑹𝑬: 𝑿 ↑ 𝒀 ↓
▪ ZERO CORRELATION
-absence of any systematic tendency

Score Ranges Verbal Description Strength of relationship

𝟎. 𝟎𝟎 − 𝟎. 𝟐𝟗𝟗 Negligible correlation Very strong

Weak positive/ negative Strong


𝟎. 𝟑𝟎 − 𝟎. 𝟒𝟗𝟗
correlation

Moderate positive/ negative Moderate


𝟎. 𝟓𝟎 − 𝟎. 𝟔𝟗𝟗
correlation

High positive/ negative Weak


𝟎. 𝟕𝟎 − 𝟎. 𝟖𝟗𝟗
correlation

Very high positive/ negative Very Weak to none


𝟎. 𝟗𝟎 − 𝟏. 𝟎𝟎
correlation

PEARSON – r FORMULA

𝒏 ∑ 𝒙𝒚 − ∑ 𝒙 ∑ 𝒚
𝒓=
√[∑ 𝒙𝟐 − (∑ 𝒙)𝟐 ][∑ 𝒚𝟐 − (∑ 𝒚)𝟐 ]

Where:

𝒓 – 𝑷𝒆𝒂𝒓𝒔𝒐𝒏 𝒑𝒓𝒐𝒅𝒖𝒄𝒕 𝒎𝒐𝒎𝒆𝒏𝒕 𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕


𝒏 − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒊𝒛𝒆

∑ 𝒙 − 𝒔𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔 𝒐𝒇 𝒙

∑ 𝒚 − 𝒔𝒖𝒎 𝒐𝒇 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔 𝒐𝒇 𝒚

∑ 𝒙𝒚 − 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒕𝒉𝒆 𝒗𝒂𝒍𝒖𝒆𝒔 𝒐𝒇 𝒑𝒓𝒐𝒅𝒖𝒄𝒕 𝒐𝒇 𝒙&𝒚

(∑ 𝒙)(∑ 𝒚) − 𝒕𝒉𝒆 𝒑𝒓𝒐𝒅𝒖𝒄𝒕 𝒐𝒇 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒙 & 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒚

∑ 𝒙𝟐 − 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒙

∑ 𝒚𝟐 − 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒔𝒒𝒖𝒂𝒓𝒆𝒔 𝒐𝒇 𝒚


Ex. Below are the scores of 10 students in their Mathematics and Physics exams. Determine if there is a significant
relationship between the scores.

MATH PHYSICS
STUDENT SCORES (X) SCORES (Y) 𝑿𝟐 𝒀𝟐 XY
A 65 65 4225 4225 4225
B 66 63 4356 3969 4158
C 68 64 4624 4096 4352
D 71 67 5041 4489 4757
E 65 70 4225 4900 4550
F 68 62 4624 3844 4216
G 67 69 4489 4761 4623
H 62 71 3844 5041 4402
I 60 70 3600 4900 4200
J 69 68 4761 4624 4692
TOTAL: 661 669 43789 44849 44175

𝒏 ∑ 𝒙𝒚 − ∑ 𝒙 ∑ 𝒚
𝒓=
√[∑ 𝒙𝟐 − (∑ 𝒙)𝟐 ][∑ 𝒚𝟐 − (∑ 𝒚)𝟐 ]

[𝟏𝟎(44175) − (𝟔𝟔𝟏)(𝟔𝟔𝟗)]
𝒓=
√[𝟏𝟎(𝟒𝟑𝟕𝟖𝟗) − (𝟔𝟔𝟏)𝟐 ][𝟏𝟎(𝟒𝟒𝟖𝟒𝟗) − (𝟔𝟔𝟗)𝟐 ]

Pearson r -0.483774464
MODERATE NEGATIVE
CORRELATION

SPEARMAN RANK FORMULA

𝟔 ∑ 𝒅𝟐
𝒓𝒔 = 𝟏 −
𝒏(𝒏𝟐 − 𝟏)

Where:

𝒓𝒔 = 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 𝒐𝒇 𝒄𝒐𝒓𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏

𝒅 − 𝒅𝒊𝒇𝒇𝒆𝒓𝒆𝒏𝒄𝒆 𝒃𝒆𝒕𝒘𝒆𝒆𝒏 𝒓𝒂𝒏𝒌 𝒐𝒇 𝒕𝒉𝒆 𝒕𝒘𝒐 𝒗𝒂𝒓𝒊𝒂𝒃𝒍𝒆

𝒏 − 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒂𝒔𝒆𝒔
Ex. In a study of the relationship between level of education and income the following data was obtained. Find the
relationship between them and comment.

Level of squared
Education Difference Difference
Sample (x) Rank (x) Income (y) Rank (y) (d) (d^2)
A Preparatory 5.5 45 4 1.5 2.25
B Elementary 3.5 35 5 -1.5 2.25
C University 1 18 7 -6 36
D Elementary 3.5 50 2.5 1 1
E Secondary 2 65 1 1 1
F Illiterate 7 50 2.5 4.5 20.25
G Preparatory 5.5 20 6 -0.5 0.25
6.30E+0
TOTAL: 1

n 7

Spearman rho -1.25E-01


NEGLIGIBLE CORRELATION

REGRESSION

- Technique concerned with predicting some variables by knowing others.

- Tells you how values in y change as a function of changes in values of x

- For prediction

Line of best fit Slope (m)

𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
𝑦 = 𝑚𝑥 + 𝑏 𝑚=
𝑛 ∑ 𝑥 2 − 𝑛(∑ 𝑥 )2
Where:

𝑚 − 𝑠𝑙𝑜𝑝𝑒

𝑥 − 𝑥 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒
𝑏 − 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

Intercept (b)

𝑏 = 𝑦̅ − 𝑚𝑥̅

Where:

∑𝑦 ∑𝑥
𝑦̅ = ; 𝑥̅ =
𝑛 𝑛
Problem: The following table shows the percent of water and the number of calories in various canned soups to
which 100g of water are added.

a.) Find the equation of the regression line for the date. Round constants to the nearest hundredth.

b.) Use the equation to find the expected number of the calories in a soup that is 89% water. Round to the
nearest whole number.

Regression Equation

Where: 𝒚 = 𝒎𝒙 + 𝒃

Where: 𝒏 ∑ 𝒙𝒚−∑ 𝒙 ∑ 𝒚 𝑚 −7.98245614


𝒔𝒍𝒐𝒑𝒆, 𝒎 = 𝒏 ∑ 𝒙𝟐 −𝒏(∑ 𝒙)𝟐

Where: ̅ − 𝒎𝒙
𝒚 − 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕, 𝒃 = 𝒚 ̅ 𝑏 767.122807

Regression Equation y= -7.982x+767.122

Expected number of calories 𝑖𝑓 𝒙 89

From regression equation, 𝒚 = −𝟕. 𝟗𝟖𝟐(𝟖𝟗) + 𝟕𝟔𝟕. 𝟏𝟐𝟐 𝒚 𝟓𝟔. 𝟔𝟖𝟒𝟐𝟏𝟎𝟓𝟑


𝒚 𝟓𝟕
LESSON 8: HYPOTHESIS TESTING IDENTIFY THE PROBLEM

- Z test for the mean (sigma known) STEP:

1st – State the Null Hypothesis (𝑯𝟎 : 𝝁 = 𝟑)


- P-value approach to hypothesis testing
2nd – State its opposite, the Alternative Hypothesis
(𝑯𝟏 : 𝝁 < 𝟑)
CONNECTION TO CONFIDENCE
Hypothesis are mutually exclusive &
INTERVAL ESTIMATION
exhaustive
• One-tail test * Sometimes it is easier to form the
- T-test of hypothesis for the mean alternative hypothesis first. *

- Z-test of hypothesis for the proportion


LEVEL OF SIGNIFICANCE, 𝜶
Hypothesis - Unlikely Values of sample statistics if Null
Hypothesis is true
- An assumption about the population
o Rejection Region of Sampling
parameter.
distribution
- Parameter → Population means or
- 𝛼 – typical values are 0.01, 0.05, 0.10
proportion.
- Selected by the researcher at the start
→ Must be identified before
analysis.
- Provide Critical values of the test
NULL HYPOTHESIS, 𝑯𝟎

- States the assumption (numerical) to be


tested
𝐻0 : 𝜇 = 3
- Begin with the assumption that the null
hypothesis is true.
(Innocent until proven guilty)

- Status quo

- Always contain the ‘=’ sign

- The null hypothesis may or may not be


rejected.
ERRORS IN MAKING DECISIONS
ALTERNATE HYPOTHESIS,𝑯𝟏
• TYPE I ERROR
- The opposite of the Null Hypothesis o Reject True Null Hypothesis
𝑯𝟏 : 𝝁 < 𝟑 o Has serious consequences
o Probability of Type I error is
- Challenges the Status Quo 𝛼(Level of significance)

- Never contains ‘=’ sign

- Alternate Hypothesis may or may not be


accepted.
• TYPE II ERROR o If p-value > 𝛼, Do not Reject 𝑯𝟎
o Do not Reject the False Null o If p-value ≤ 𝛼, Reject 𝑯𝟎
Hypothesis
o Type II error is 𝛽

𝜶 &𝜷 𝑰𝑵𝑽𝑬𝑹𝑺𝑬 𝑹𝑬𝑳𝑨𝑻𝑰𝑶𝑵𝑺𝑯𝑰𝑷

- Reduce probability of one error and the ONE-TAIL Z TEST FOR MEAN
other one goes up (𝝈 𝑲𝑵𝑶𝑾𝑵)

ASSUMPTIONS
Z-TEST STATISTICS (𝝈 KNOWN) o Population is Normally
- Convert sample statistics to standardized Z Distributed
Variable o If not normal, use large samples
o Null Hypothesis has = sign only
Z test
̅ − 𝝁𝒙̅
𝒙 o Z test Statistics:
𝒁=
𝝈𝒙̅ Z test statistics
̅ − 𝝁𝒙̅ (𝒙
𝒙 ̅ − 𝝁)
𝒁= = 𝝈
𝝈𝒙̅
√𝒏
- Compare to Critical Z value(s)
➔ If Z test statistics falls in Critical
region, Reject 𝑯𝟎 ; Otherwise Do No
Reject 𝑯𝟎

P-VALUE TEST

- Probability of obtaining a test statistic


extreme (≤, ≥) than actual sample values
given 𝑯𝟎 is true

- Observed level of significance


o Smallest value of a 𝐻0 can be
rejected

- Used to make rejection decision T TEST (𝝈 𝑲𝑵𝑶𝑾𝑵)


t test statistics
̅−𝝁
𝒙 Confidence Level z-value
𝒕= 𝒔
√𝒏 80% 1.28

Where: 85% 1.44


̅ − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒎𝒆𝒂𝒏
𝒙
𝒔 − 𝒔𝒂𝒎𝒑𝒍𝒆 𝒔𝒕𝒅. 𝒅𝒆𝒗 90% 1.64
𝝁 − 𝒑𝒐𝒑𝒖𝒍𝒂𝒕𝒊𝒐𝒏 𝒎𝒆𝒂𝒏
95% 1.96
z test statistics (2 means)
98% 2.33

(𝒙 ̅𝟐 ) − 𝒅𝟎
̅𝟏 − 𝒙 99% 2.58
𝒛=
𝝈𝟐 𝝈𝟐𝟐
√ 𝟏
𝒏𝟏 + 𝒏𝟐

Problem for Z test: (1-tail) Does an average box of


cereal contain more than 368 grams of cereal? A
random sample of 25 boxes showed 𝑥̅ = 372.5. The
company has specified 𝜎 = 15 𝑔𝑟𝑎𝑚𝑠. Test at the
𝛼 = 0.05 level.

Z- test for deviation known (proportion)

Ho: The number of cereals is not greater than 368 grams


H1: The number of cereals is greater than 368 grams

GIVEN:

(𝒙
̅ −𝝁)
1.5 → 𝑧 = 𝝈
average 372.5 Z-value √𝒏

number of boxes 25 Z-crit 1.644853627 → 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉 (1 − 0.05)


dev (proportion) 15
−𝑥 2
1.5 1
significance 0.05 P-value (Z) 0.0668 → 0.5 − ∫0 𝑒 2 𝑑𝑥
√2𝜋
confidence 0.95 P-value (Z crit) 0.05
mean 368
0.0668 > 0.05

Null hypothesis is not rejected


Problem for Z-test: An electrical firm manufactures
light bulbs that have a lifetime that is approximately
normally distributed with a mean of 800 hours and a
standard deviation of 40 hours. Test the hypothesis
that 𝜇 = 800 ℎ𝑜𝑢𝑟𝑠 against the alternative, 𝜇 ≠
800 ℎ𝑜𝑢𝑟𝑠, if a random sample of 30 bulbs has an
average life of 788 hours. Use a p-value in your
answer.

Z-test for deviation known (proportion) 2-tail

Ho: The lifetime of a light bulb is approximately 800 hours


H1: The lifetime of a light bulb is not approximately 800 hours

GIVEN:

average 788.00
(𝒙
̅ −𝝁)
-1.643167673→ 𝑧 = 𝝈
number of bulbs 30.00 z- value √𝒏
0.05
dev (proportion) 40.00 z-value (crit) -1.959963985 → 𝑁𝑂𝑅𝑀. 𝑆. 𝐼𝑁𝑉 ( )
2
significance 0.05
𝒙 𝟐
−𝟏.𝟔𝟒𝟑𝟐 𝟏
confidence 0.95 p-value 0.101 → 𝟐 𝒙 (𝟎. 𝟎𝟓 + ∫𝟎 𝒆− 𝟐 𝒅𝒙 )
√𝟐𝝅
mean 800.00 p-value (crit) 0.05

0.101 > 0.05

Ho is not rejected
Problem for t-test (SINGLE MEAN): past
experience indicates that the time required for high
school seniors to complete a standardized test is a
normal random variable with a mean of 35 minutes.
If a random sample of 20 high school seniors took an
average of 33.1 minutes to complete this test with a
standard deviation of 4.3 minutes. Test the
hypothesis, at the 0.05 level of significance, that 𝜇 =
35 𝑚𝑖𝑛𝑢𝑡𝑒𝑠 against the alternative that 𝜇 <
35 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
t-test for deviation known (sample) 1 tail (left)

Ho: There is significant evidence that the time required to complete the test is 35 minutes
H1: There is significant evidence that the time required to complete the test is less than 35 minutes

GIVEN:

average 33.1
̅−𝝁
𝒙
-1.976060073 → 𝒕= 𝒔
no. of student 20 t-value √𝒏
dev (sample) 4.3 t-value (crit) -1.729132812 → 𝑻. 𝑰𝑵𝑽(𝟎. 𝟎𝟓, 𝟏𝟗)
significance 0.05
confidence 0.95 -1.976060073 < -1.729132812
mean 35
Df → 𝒏 − 𝟏 19 Ho is rejected

t-stat to p value

Location Equivalent formula

Left Tail (≠) 𝑻. 𝑫𝑰𝑺𝑻(𝑻, 𝑫𝑭, 𝑻𝑹𝑼𝑬)

Right Tail (>) 𝑻. 𝑫𝑰𝑺𝑻. 𝑹𝑻(𝑻, 𝑫𝑭)

2 Tail (<) 𝑻. 𝑫𝑰𝑺𝑻. 𝟐𝑹𝑻(𝑻, 𝑫𝑭)


Problem z-test (double mean): A random sample of
size 𝑛1 = 25, taken from a normal population with a
standard deviation of 𝜎1 = 5.2, has a mean 𝑥̅1 = 81.
A second random sample of size 𝑛2 = 36, taken from
a different normal population with a standard
deviation 𝜎2 = 3.4, has a mean 𝑥̅ 2 = 76. Test the
hypothesis that 𝜇1 = 𝜇2 against the alternative, 𝜇1 ≠
𝜇2 . Quote a P-value in your conclusion.

z-test for dev known (sample, double mean) 1-tail (right)

Ho: There is a significant evidence that the two means are equal
H1: There is a significant evidence that the two means are not equal

GIVEN:

1st number 25
1st ave 81
(𝒙̅ 𝟏 −𝒙̅ 𝟐 )−𝒅𝟎
4.221685587 → 𝒛 =
𝝈𝟐 𝝈𝟐
√ 𝟏+ 𝟐
𝒏 𝒏
𝟏 𝟐
1st dev 5.2 z-value
𝟎.𝟎𝟓
2nd number 36 z-value (crit) 1.959963985 → 𝑵𝑶𝑹𝑴. 𝑺. 𝑰𝑵𝑽
𝟐
2nd ave 76
𝒙 𝟐
𝟒.𝟐𝟐𝟏𝟕 𝟏
2nd dev 3.4 p-value 2.42482E-05→ 𝟎. 𝟓 − ∫𝟎 𝒆− 𝟐 𝒅𝒙
√𝟐𝝅
significance 0.05 p-value (crit) 0.05
diff 0

0.0000242 < 0.05

Reject Ho
Problem z-test (double mean): A manufacturer
claims that the average tensile strength of thread A
exceeds the average tensile strength of thread B by at
least 12 kilograms. To test this claim, 50 pieces of
each type of thread were tested under similar
conditions. Type A thread had an average tensile
strength of 86.7 kilograms with a standard deviation
of 6.28 kilograms, while type B thread had an
average tensile strength of 77.8 kilograms with a
standard deviation of 5.61 kilograms. Test the
manufacturer’s claim using a 0.05 level of
significance.

z-test for dev known (sample, double mean) 1-tail (right)

Ho: There is a significant evidence that the difference did not exceed 12 kilograms
H1: There is a significant evidence that the difference exceed 12 kilograms

GIVEN:

1st number 50
1st ave 86.7
(𝒙̅ 𝟏 −𝒙̅ 𝟐 )−𝒅𝟎
-2.603103417→ 𝒛 =
𝝈𝟐 𝝈𝟐
√ 𝟏+ 𝟐
𝒏 𝒏
𝟏 𝟐
1st dev 6.28 z-value
2nd number 50 z-value (crit) 1.644853627→ 𝑵𝑶𝑹𝑴. 𝑺. 𝑰𝑵𝑽(𝟏 − 𝟎. 𝟎𝟓)
2nd ave 77.8
2nd dev 5.61
significance 0.05
diff 12

-2.603103417 < 1.644853627

Do Not Reject Ho
LESSON 9: TEST OF INDEPENDENCE
The Chi-square test of independence is a statistical hypothesis test used to determine whether two categorical or nominal variables are likely to be related or not.
𝐻0 : 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
𝐻1 : 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑎𝑟𝑒 𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡

Example problem: A criminologist conducted a survey to determine whether the incidence of certain types of crime varied from one part of a large city to another. The particular
crimes of interest were assault, burglary, larceny, and homicide. The following table shows the numbers of crimes committed in four areas of the city during the past year.

STEP PROCESS

Actual

TYPE OF CRIME
DISTRICT Assault Burglary Larceny Homicide Total
OBSERVED VALUES A 162 118 451 18 749
B 310 196 996 25 1527
C 258 193 458 10 919
D 280 175 390 19 864
Total 1010 682 2295 72 4059
Expected
EXPECTED VALUES USING,
TYPE OF CRIME
𝑻𝑶𝑻𝑨𝑳 𝑹𝑶𝑾 𝑨𝑳𝑰𝑮𝑵𝑬𝑫 𝒙 𝑻𝑶𝑻𝑨𝑳 𝑪𝑶𝑳𝑼𝑴𝑵 𝑨𝑳𝑰𝑮𝑵𝑬𝑫
𝑬𝑿𝑷𝑬𝑪𝑻𝑬𝑫 = DISTRICT Assault Burglary Larceny Homicide Total
𝑶𝑽𝑬𝑹𝑨𝑳𝑳 𝑻𝑶𝑻𝑨𝑳
A 186.3735 125.8482 423.4922 13.28603 749
WHERE:
TOTAL ROW – TOTAL IN A ROW B 379.963 256.5691 863.3814 27.08647 1527
TOTAL COLUMN – TOTAL IN A COLUMN C 228.6746 154.4119 519.612 16.30155 919
OVERALL TOTAL – TOTAL OF ALL
D 214.9889 145.1707 488.5144 15.32594 864
Total 1010 682 2295 72 4059
CHI-SQUARE TABLE USING, Chi
(𝑶𝑩𝑺𝑬𝑹𝑽𝑬𝑫 − 𝑬𝑿𝑷𝑬𝑪𝑻𝑬𝑫)𝟐
𝑪𝑯𝑰 =
𝑬𝑿𝑷𝑬𝑪𝑻𝑬𝑫 TYPE OF CRIME
WHERE: DISTRICT Assault Burglary Larceny Homicide Total
OBSERVED – OBSERVED DATA IN THE SAME CELL A 3.187508 0.489438 1.786755 1.672546 7.136247
EXPECTED – EXPECTED DATA IN THE SAME CELL
B 12.88238 14.29875 20.37072 0.160721 47.71257
C 3.760725 9.643294 7.305519 2.435937 23.14548
D 19.65888 6.129233 19.86654 0.880775 46.53542
Total 39.48949 30.56071 49.32953 5.14998 124.5297

CHI-SQUARED STAT. (OVERALL SUM), DEGREE OF FREEDOM 𝐶ℎ𝑖 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑠𝑡𝑎𝑡 124.5297127
(𝑫𝑭 = (𝑪𝑶𝑳𝑼𝑴𝑵 − 𝟏) 𝒙 (𝑹𝑶𝑾 − 𝟏)
𝑑𝑓 9

𝐶ℎ𝑖 𝑐𝑟𝑖𝑡 𝑣𝑎𝑙𝑢𝑒 21.66599433


𝑪𝑹𝑰𝑻𝑰𝑪𝑨𝑳 𝑽𝑨𝑳𝑼𝑬. 𝑭𝑶𝑹𝑴𝑼𝑳𝑨 = 𝑪𝑯𝑰𝑺𝑸. 𝑰𝑵𝑽. 𝑹𝑻(𝑨𝑳𝑷𝑯𝑨, 𝑫𝑭)
𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 0.01
124.5297127 > 21.66599433

RIGHT TAILED (CHI SQUARED IS INSIDE THE REJECTION REGION)


COMPARE THE RESULTS AND GENERATE DECISION

Decision: Reject Ho
Do Not Reject Ho

𝐻0 - The occurrence of these types of crime is independent to the city district


CONCLUSION
𝐻1 - The occurrence of these types of crime is dependent to the city district

LESSON 10: ONE WAY ANOVA (TEST OF VARIANCE)


- Commonly used to test the following:
o Statistical differences among the means of two or more group.

o Statistical differences among the means of two or more interventions.

o Statistical differences among the means of two or more change scores.


STEPS PROCESS

𝐻0 : µ1 = µ2 = µ3 = µ4 = µ5 = µ6
HYPOTHESIS (NULL & ALTERNATIVE)
𝐻1 : 𝐴𝑡 𝑙𝑒𝑎𝑠𝑡 𝑡𝑤𝑜 𝑚𝑒𝑎𝑛𝑠 𝑑𝑖𝑓𝑓𝑒𝑟 𝑓𝑟𝑜𝑚 𝑒𝑎𝑐ℎ 𝑜𝑡ℎ𝑒𝑟

MACHINE
1 2 3 4 5 6 Total:
17.5 16.4 20.3 14.6 17.5 18.3 104.6
MEAN AND TOTAL SUM OF THE SCORES FOR EACH 16.9 19.2 15.7 16.7 19.2 16.2 103.9
SAMPLES 15.8 17.7 17.8 20.8 16.5 17.5 106.1
18.6 15.4 18.9 18.9 20.5 20.1 112.4
Total: 68.8 68.7 72.7 71 73.7 72.1 427
Mean 17.2 17.175 18.175 17.75 18.425 18.025 106.75
MACHINE
𝑋12 𝑋 22 𝑋32 𝑋42 𝑋52 𝑋62
306.25 268.96 412.09 213.16 306.25 334.89
SQUARES OF EACH SCORES AND THEIR SUMS
285.61 368.64 246.49 278.89 368.64 262.44
249.64 313.29 316.84 432.64 272.25 306.25
345.96 237.16 357.21 357.21 420.25 404.01
Total: 1187.46 1188.05 1332.63 1281.9 1367.39 1307.59
SUM OF SQUARES (SS EACH) USING,
(∑ 𝒙)𝟐
𝑺𝑺 = ∑ 𝑿𝟐 −
𝒏

Where: SS 4.1 8.1275 11.3075 21.65 9.4675 7.9875 62.64


SS – SUM OF SQUARES
X – SQUARED DATA
x – DATA
n – NUMBER OF DATA EACH MACHINE
GIVEN:
𝒌 − 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒕𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 k 6
𝒏 − 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒄𝒐𝒓𝒆𝒔 𝒊𝒏 𝒆𝒂𝒄𝒉 𝒕𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕
n 4
𝑵 − 𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒄𝒐𝒓𝒆𝒔 (𝒌 𝒙 𝒏)
𝑻 − 𝒕𝒐𝒕𝒂𝒍 𝒇𝒐𝒓 𝒆𝒂𝒄𝒉 𝒕𝒓𝒆𝒂𝒕𝒎𝒆𝒏𝒕 𝒄𝒐𝒏𝒅𝒊𝒕𝒊𝒐𝒏 N 24
𝑮 − 𝒔𝒖𝒎 𝒐𝒇 𝒂𝒍𝒍 𝒔𝒄𝒐𝒓𝒆𝒔 𝒊𝒏 𝒕𝒉𝒆 𝒔𝒕𝒖𝒅𝒚 G 427
𝑺𝑺 − 𝒔𝒖𝒎 𝒐𝒇 𝒔𝒒𝒖𝒂𝒓𝒆𝒔
SUM OF SQUARES
𝑮𝟐
𝑺𝑺𝒕𝒐𝒕𝒂𝒍 = ∑ 𝑿𝟐 − SS
𝑵
Between 5.3383333
𝑺𝑺𝒘𝒊𝒕𝒉𝒊𝒏 = ∑ 𝑺𝑺𝒈𝒊𝒗𝒆𝒏
Within 62.64

𝑺𝑺𝒃𝒆𝒕𝒘𝒆𝒆𝒏 = 𝑺𝑺𝒕𝒐𝒕𝒂𝒍 − 𝑺𝑺𝒘𝒊𝒕𝒉𝒊𝒏 Total 67.978333

DEGREES OF FREEDOM
df
𝒅𝒇𝒕𝒐𝒕𝒂𝒍 = 𝑵 − 𝟏
Between 5
𝒅𝒇𝒘𝒊𝒕𝒉𝒊𝒏 = 𝑵 − 𝒌 Within 18
𝒅𝒇𝒃𝒆𝒕𝒘𝒆𝒆𝒏 = 𝒌 − 𝟏 Total 23
𝑺𝑺𝒃𝒆𝒕𝒘𝒆𝒆𝒏 MS
𝑴𝑺𝒃𝒆𝒕𝒘𝒆𝒆𝒏 =
𝒅𝒇𝒃𝒆𝒕𝒘𝒆𝒆𝒏
Between 1.0676667
𝑺𝑺𝒘𝒊𝒕𝒉𝒊𝒏 Within 3.48
𝑴𝑺𝒘𝒊𝒕𝒉𝒊𝒏 =
𝒅𝒇𝒘𝒊𝒕𝒉𝒊𝒏
Total
F-statistics F

𝑴𝑺𝒃𝒆𝒕𝒘𝒆𝒆𝒏 Between 0.3068008


𝑴𝑺𝒘𝒊𝒕𝒉𝒊𝒏 Within
Total
F value 0.3068008
Crit F value 2.7728532
F-STATISTICS & F CRITICAL
(𝑰𝑭 𝑭 − 𝑺𝑻𝑨𝑻𝑰𝑺𝑻𝑰𝑪𝑺 > 𝑪𝑹𝑰𝑻, 𝑹𝑬𝑱𝑬𝑪𝑻 𝑵𝑼𝑳𝑳 𝑯𝒀𝑷𝑶𝑻𝑯𝑬𝑺𝑰𝑺)
F value Decision Crit F value
0.306800766 > 2.772853153
Decision: Reject Ho
Do Not Reject Ho
CONCLUSION
Ho: µ1=µ2=µ3=µ4=µ5=µ6
H1: At least two means differ from each other

You might also like