Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 271

BASICS

OF
STATISTICS
Francisco S. Antonio
Associate Prof. IV
CATEGORIES OF
STATISTICS
Descriptive statistics
Inferential statistics
WHAT IS STATISTICS?
A science which deals with the
collection, organization,
presentation, analysis and
interpretation of data.
A set of numbers or figures or
processed data
DESCRIPTIVE STATISTICS
Deals with data description to acquire
information or knowledge
Involves organization and presentation
of data
DESCRIPTIVE STATISTICS
INFERENTIAL STATISTICS
Deals with making conclusions or generalizations
about a population of interest (a large set) when
only a part (smaller set) of it is examined
Involves random sampling
INFERENTIAL STATISTICS
EXAMPLE
A president of a certain state university wanted to determine
if majority of the faculty members of state universities in
Region II are in favor of a change in academic calendar. He
then took a random sample of faculty members in the
region and asked them if they are in favor or not with the
change. Based on the random sample, it was concluded that
indeed, majority of the faculty members in the region are in
favor of the change in the academic calendar.
BASIC CONCEPTS
Universe
- set of objects, individuals or entities under study
Variable
- characteristic of interest which is measured or observed from the
elements of the universe
Population
-set of all possible values of a variable corresponding to the entire
collection of units
BASIC CONCEPTS
Sample
- subset of the universe or population
BASIC CONCEPTS
Universe
Y1
U1
Y2
U2
Variable Y Y3
U3
.
.
.
.
.
.
YN
UN
TYPES OF VARIABLES
 Qualitative
-take on values which are not numerical
 Quantitative
- Number which indicate the amount of the characteristic
LEVELS OF
MEASUREMENT
Nominal
Ordinal
Interval
Ratio
NOMINAL
Lowest level of measurement
Responses are categories or labels
Counts or frequencies and percentages can be obtained per category
ORDINAL
Responses are categories or labels which can be
ranked or ordered
Difference between two responses is meaningless
INTERVAL
Responses are categories or labels which can be ranked or ordered
Difference between two responses has meaning
A zero data point in this level is arbitrary
Addition and subtraction of responses are possible
RATIO
Highest level of measurement
All properties of the interval level hold
Zero value in this level means lack of that characteristic
All mathematical operations are possible
MEASURES OF LOCATION
Minimun (min)- lowest value in the data set
Maximum (max)- highest value in the data set
Measures of central Tendency- gives the middle value of the data set
-mean(µ)
-median(Md)
-mode(Mo)
Quantiles
-percentiles, deciles, quartiles
SOME DESCRIPTIVE
MEASURES
Measures of Location
Measure of Dispersion
Skewness
MEAN
The sum of all the observations in the data set divide by the total
number of observations.
PROPERTIES OF THE
MEAN
A data set can only have one mean.
It can only be computed for quantitative data sets.
The magnitude of each observation contributes to the value of the
mean.
It is easily effected by extremely high or low observations.
MEDIAN
The middle value in the data set, after arranging the observations in
increasing or decreasing order.
MODE
the observation in the data set occurred most frequently.
PROPERTIES
A data set can have more than one mode.
It can be used to describe a quantitative or qualitative data set.
It is not determined by the magnitude of the observations.
MEASURES OF
DISPERSION
Describes the spread or variability of the
observations in a data set
The higher the value, the greater the variability of
the data set
QUANTILES
Percentiles- divide the array of data
into 100 equal parts
Deciles- divide the array of data into
ten equal parts
Quartiles- divide the array of data into
four equal parts
MEASURES OF
DISPERSION
 Range (R)
Variance ()
Standard Deviation ()
Coefficient of Variation (CV)
RANGE
The difference between the highest and
lowest observations in the data set
VARIANCE
  The average squared deviations of the observations from the mean

(population)

(sample)
PROPERTIES OF THE
RANGE
It is quick to compute and easy to understand.
It is a rough measure of dispersion.
It is usually reported together with the median.
PROPERTIES OF THE
VARIANCE
 One of the most useful measures of dispersion
 All observations in the data set contribute to the magnitude of the
variance
 Can only take on values of at least zero
 the unit of measure is the square of the measure of the data set
STANDARD DEVIATION
The
  square root of the variance

s
PROPERTIES OF THE
COEFFICIENT OF VARIATION
Has no unit of measure
Can be used to compare dispersion of two or more data sets with the
same or different units of measurements
COEFFICIENT OF
VARIANCE
The
  ratio of the standard deviation to the mean of the data set
MEASURE OF SKEWNESS
This
  measures the departure of the data set from symmetry, with
formula given by

SK
SYMMETRIC
DISTRIBUTIONS
NON-SYMMETRIC
DISTRIBUTIONS
SAMPLING
TECHNIQUES
BASIC CONCEPTS
SAMPLE – SUBSET OF THE
POPULATION OR UNIVERSE TAKEN
TO REPRESENT THE WHOLE
SAMPLING – THE PROCESS OF
SELECTING A SAMPLE
PARAMETER – DESCRIPTIVE
MEASURE OF THE POPULATION
STATISTIC – DESCRIPTIVE MEASURE
OF THE SAMPLE
INFERENTIAL STATISTICS
 DEALS WITH MAKING
GENERALIZATIONS ABOUT A
POPULATION WHEN ONLY PART OF IT
IS EXAMINED
 IT MAKES USE OF THE INDUCTIVE
METHOD OF DRAWING CONCLUSIONS
 INVOLVES RANDOM SAMPLING
AREAS OF CONCERN OF
INFERENTIAL STATISTICS
1. Estimation
2. Test of hypothesis
WHY DO WE SAMPLE?
 Reduce Cost
Greater accuracy and efficiency
Timeliness
Greater scope
Nature of testing procedure
CLASSIFICATION OF
SAMPLING
Non – Probability sampling
 No objective procedure in sample selection
Probabilities of selection of units are not known
 No inferences about the universe/ population can be made
CLASSIFICATION OF
SAMPLING
 Non – Probability sampling
 Probability sampling
CLASSIFICATION OF
SAMPLING
Non – Probability sampling
 accidental sampling
 Convenience sampling
 Purposive sampling
 Quota sampling
CLASSIFICATION OF
SAMPLING
Probability sampling
 Utilizes an objective procedure in sample selection (randomization
procedure)
 probabilities of selection of units are known
 valid inferences about the universe/population can be made
SAMPLE FRAME
 Complete list of all the elements of the universe or population

 Types:
- directory
- Map
CLASSIFICATION OF
SAMPLING
Probability sampling

 Simple random sampling


 stratified random sampling
 Systematic random sampling
 cluster sampling
 simple two-stage sampling (Multiple stages)
SIMPLE RANDOM
SAMPLING
This technique assigns equal chances of selection to all possible samples
of size n that may be formed from the universe.

Types:
SRS without replacement (SRSWOR)
- No repeats in the sample selection
SRS with replacement(SRSWR)
- Repeats in the sample selection is allowed
SAMPLE RANDOM
SAMPLING
PEOCEDURE:
1. From the obtained sampling frame, assign an ID number (from 1 to
N) to each unit.
2. Using any randomization techniques generate n ID numbers and
the units corresponding top the n ID numbers generated will serve
as a simple random sample
SIMPLE RANDOM
SAMPLING
Advantages Disadvantage

• Easy to implement • Difficult to use in actual


• Most statistical methods field operation
available for data analys • Not very efficient when
assume SRS target population is
• Efficient for small scale heterogeneous
surveys
RANDOMIZATION
PROCEDURES
 Lottery method (chips-in-the-box method)
 use of the table of random numbers
 Random number generator (computer programs that generate random
numbers, e.g. calculator)
STRATIFIED RANDOM
SAMPLING
Sampling procedure that subdivides the universe into mutually
exclusive subgroups or strata and draws independent samples from
stratum to stratum.
REASONS FOR
STRATIFICATION
To increase precision of estimates
To draw inferences for subclasses in the
population
STRATIFIED RANDOM
SAMPLING
STRATIFIED RANDOM
SAMPLING
PROCEDURE:
1. From the obtain sampling frame, divide the units into
strata or groups according to the chosen stratification
variable.
2. For each stratum, assign an ID number to each sample
unit listed. For the first stratum, randomly generate nᵢ ID
numbers.
3. Repeat step 2 for the other strata. The samples taken
from each stratum will constitute the stratified random
sample
STRATIFIED RANDOM
SAMPLING
Advantages Disadvantage
• Better cross-section • Requires auxiliary
• simpler field operation information in the frame
• Increased precision of
estimates
SYSTEMATICS SAMPLING
This technique allows selecting one
unit at random and choosing additional
units using a fixed skipping pattern until
the desired number of samples is
obtained.
SYSTEMATIC SAMPLING
Advantages Disadvantage

• Simpler to implement • Risky to use if there is


• Allows sample selection as periodicity in the frame
the sampling frame is being
constructed
• As precise as SRS if there is
random listing
SYSTEMATIC SAMPLING
Procedure:
1. Assign ID numbers to the units listed in the sampling
frame.
2. Determine k=N/n as the sampling interval.
3. determine S, the random start by generating a random
number from 1 to k if k is whole number, 1 to N
otherwise
4. Take every kᵀᴴ unit there after from the sampling frame
until the desired n units are selected.
CLUSTER SAMPLING
The population is first grouped into non-
overlapping subgroups (clusters).
A simple random sample of cluster is selected.
All the units in the chosen clusters are included in
the sample.
Cluster sampling
Advantages Disadvantage

• More economical • Efficiency decreases with increasing


• Simple field operation cluster size
• Frame requirement is very much
simpler
• More efficient than SRS if the
clusters are heterogeneous internally
MULTI-STAGE SAMPLING
Primary sampling unit (psu) – largest sampling unit, 1st stage sampling
unit
Secondary sampling unit (ssu) – 2nd largest sampling unit, 2nd stage
sampling unit
Tertiary sampling unit (tsu) – 3rd largest sampling unit, 3rd stage
sampling unit
ultimate sampling unit (usu) – unit whose measurements will be made,
smallest unit
MULTI-STAGE SAMPLING
An extension of cluster sampling (single/one-stage sampling)
Characterized by sampling being done in stages
Usually applied in large-scale surveys
MULTI-STAGE SAMPLING
PROCEDURE:
1. From each primary sampling unit, assign an ID number.
2. Generate n ID numbers randomly and the psu’s
corresponding to then n ID numbers will be the psu’s
used in the survey.
3. Repeat steps 1 and 2, this time units selected from each
psu are now labeled as the secondary sampling units
(ssu’s).
4. If more stages are required, the same set of procedures
are adopted for each succeeding stage of sampling.
MULTI-STAGE SAMPLING
Advantages Disadvantage

• More economical than selecting units of • Complexity in the


interest directly analytical formula to be
• Simplifies the field operations considerably used
most especially in large-scale surveys
ILLUSTRATION
ILLUSTRATION: SIMPLE
TWO-STAGE SAMPLING
ILLUSTRATION
SAMPLE SIZE
DETERMINATION
REQUIREMENTS
1. Level of confidence: (1 – α) x 100%
2. Maximum allowable/tolerable error (margin/bound of error): B
3. Measure of dispersion of the population of interest: σ 2
4. For estimating a proportion, perceived value of the population
proportion, P.
CONSIDERATIONS
Size of the population
 Dispersion of the population
 Available resources (cost and time)
SAMPLE SIZE DETERMINATION
IN SRS ESTIMATE THE MEAN

Where σ 2 = population variance

= standard normal variety with a area to the right


α/2
B = error of estimation
REMARKS
If N is known, we use

In practice, the variance α2 is usually unknown. We use s2 from prior


studies, instead.
SOLUTION
Given: N = 1000; σ = 1,250; B = 150
95% confidence level α = .05
EXAMPLE
It is necessary to estimate the average amount of money for a
hospital’s accounts receivable. It is known that from prior data that the
standard deviation of these accounts is about 1,250 pesos. If there are
N =1000 open accounts, find the sample size needed to estimate the
mean with a margin of error of only 150 pesos. Use a 95% confidence
level.
SAMPLE SIZE DETERMINATION
IN SRS TO ESTIMATE THE TOTAL

Where σ 2 = population variance

= standard normal variate with area to the right α/2

B = error of estimation
EXAMPLE
It is necessary to estimate the total amount of money receivable
from a hospital’s open accounts. It is known that from prior data that
the standard deviation of these accounts is about 1250 pesos. If there
are N = 1000 open accounts, find the sample size needed to estimate
the total amount collectible with a margin of error of only 2000 pesos.
Use a 95% confidence level.
SAMPLE SIZE DETERMINATION IN
SRS TO ESTIMATE THE PROPORTION

Where p = perceived value of the population proportion; q = 1-p

= standard normal variate with area to the right α/2


B = error of estimation
SOLUTION
Given: N = 1,000; σ=1,250; B=2,000
95% confidence level →α=.05
EXAMPLE
It is necessary to estimate the proportion of clients with balances
who will pay their bill within a month. If there are N =1000 open
accounts, find the sample size needed to estimate the proportion with a
margin of error of only 0.10. use a 90% confidence level.
SOLUTION
Given: N = 1,000; B = 0.01
95% confidence level →α=.10
Assume p = 0.30
EXAMPLE
Suppose that Matt’s Motors wants to conduct a survey to
determine the proportion of its customers who still own their cars 5
years after purchasing them. Suppose it wants to be 95% confident of
being correct to within 0.025 of the true proportion. What sample size
is needed.
ILLUSTRATION
Suppose B is decreased to 0.01
N = 1000; B = 0.01
95% confidence level →α=.10
Assume p = 0.30
SAMPLE SIZE DETERMINATION IN
SRS TO ESTIMATE THE PROPORTION

If the population size is unknown,


SOLUTION
Given: B = 0.025
95% confidence level →α=.10
Assume p = 0.30
SAMPLE SIZE
DETERMINATION FOR STRS
 Do separate sample size determination for each stratum then
aggregate for the stratified population.
 Determine the ultimate sample size n and allocate this to the different
strata using an allocation procedure.
THE CV IN SAMPLE SIZE
DETERMINATION
When sampling to estimate the mean:

When sampling to estimate the proportion:

Where c = desired CV of the estimate


Cy = CV for the target population
SAMPLE SIZE
ALLOCATION
 Equal Allocation

Proportional Allocation
SAMPLE SIZE ALLOCATION
Optimum Allocation

Where

= average cost per unit for stratum i


SAMPLE SIZE
ALLOCATION
Neyman’s Allocation
TEST OF
STATISTICAL
HYPOTHESIS
TYPES OF HYPOTHESIS
 Null Hypothesis (Ho)
- a statement of no difference (status quo)
 Alternative Hypothesis (Ha)
- a statement taken to be true when Ho is rejected
- also referred to as the “researcher’s hypothesis”
 Errors
Type I error (α) – rejecting Ho when it is true
Type II error (ᵦ) – accepting Ho when it is false
STATISTICAL
HYPOTHESIS
an assertion or conjecture concerning
parameter(s) or distribution(s) of one or
more populations
STEPS IN HYPOTHESIS
TESTING
1. State the null and alternative hypothesis.
2. Specify the level of significance (α).
3. Identify the test procedure/test statistic.
4. State the decision rule.
5. Compute the value of the test statistic/p-value.
6. Make a statistical decision.
7. State the conclusion to answer the objective.
ONE-TAILED TESTS
Any test of hypothesis where the alternative is one-sided (i.e., Ha:
Ө > Ө ₀ or Ha: Ө > Ө ₀ )
p-VALUE IN HYPOTHESIS
TESTING
 Measures the weight of the evidence for rejecting (or failing to reject)
the null hypothesis
 The smaller the p-value of a test, the heavier the weight of the
evidence for rejecting Ho
TWO-TAILED TESTS
Any test of hypothesis where the alternative is two-sided (i.e., Ha:
Ө ≠ Ө ₀) is called a two-tailed test.
TEST OF HYPOTHESIS ON
ONE POPULATION MEAN
Ho: µ = µ ₀ vs i) Ha: µ ≠ µ ₀
ii) Ha: µ > µ ₀
iii)Ha: µ < µ ₀
Assumptions:
1. The population is normally distributed.
2. The variable is measured in the least the interval scale.
TEST FOR NORMALITY
OF THE POPULATION
Ho: The data come from a normal population.
Ha: The data do not come from a normal population.
Test Procedure: Wilk-Shapiro Test,
Kolmogorov-Smirnov Test,
Cramer-Von Mises,
Anderson-Darling
Decision rule: Reject Ho if p-value < α.
otherwise, fail to reject Ho.
EXAMPLE
It was found that the average time required by workers to complete a
certain manual operation was 29.6 min with a standard deviation of 3.5 min. A
gropup of 25 workers was randomly chosen to receive a special training for
two weeks. After the training for two weeks. After the training, it was found
that their average time to complete the same task was 28 min. Can it be
concluded that the special training speeds up the operation? Assume that
time to finish this task is normally distributed and use α = 0.05
TEST OF HYPOTHESIS ON ONE
POPULATION MEAN
Case
  1: is known
Test statistic: Zc~N(0,1) under Ho

  𝑥 − 𝜇0
´
𝑍𝐶 =
𝜎 /√𝑛
Decision rule: Reject Ho if
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
otherwise, fail to reject Ho.
SOLUTION
Let
  = average time of workers to finish the task after training
Ho: = 29.6 vs. <29.6
Test procedure: one-tailed Z-test at = 0.05
Decision rule: Reject Ho if Zc<-Z0.05
(or p-value<0.05). Otherwise, fail to reject Ho.
SOLUTION
Computations:
 

-=-1.645
Using p-value:
p-value=P(Z<-2.286)=1-P(Z<2.286)
p-value=0.0084
Decision: Reject Ho
EXAMPLE
  It was believed that on the average, the scores of students in the first long
exam in the Basic Statistics subject was 62. A random sample of 15 students in
this semester’s enrollees was selected and the following were their score: 68, 75,
49, 57, 60, 82, 64, 56, 54, 66, 78, 67, 63, 61, and 70. Assuming the scores in this
exam are normally distributed, is there reason to believe that the students this
semester, scored, on the average, higher than the previous years? Use =5%.
TEST OF HYPOTHESIS ON
  ONE
Case2. is unknownPOPULATION MEAN
Test procedure: one-sample t-test
Test statistic:-under Ho

Decision rule: Reject Ho if:


i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
otherwise, fail to reject Ho.
SOLUTION
Let
  x=First Exam score of STAT 166 students this semester
x~N(,²)

Ho:=62 vs. Ha: >62


Test procedure: one-sample t-test
Decision rule: Reject Ho if )
(or p-value/2 < 0.05).
otherwise, fail to reject Ho.
SOLUTION
Computations:
 
=1.761
Decision: Fail to reject Ho.
TEST OF HYPOTHESIS ON THE
MEANS OF TWO POPULATIONS
Let
  be the means of population 1 and population 2, respectively.

Parameter of interest: (difference of the two means)


Hypothesis to be tested:
Ho: =vsi) Ha: ≠
ii) Ha:
iii) Ha:
Where = hypothesized difference4 in means
COMPARING TWO
POPULATIONS
Parameter of interest
Difference of Means (Parametric tests)
Types of samples
1. Independent Samples
2. Related (Paired) Samples
 Matched/Paired observations
 Self-paired observations
REMARK
  In situations wherein more comparison of the population means are
desired is usually set to zero.

Ho: =vs

Possible alternatives
i) Ha: ≠0
ii) Ha:
iii) Ha:
INDEPENDENT SAMPLES
Assumptions:
 
1. Data are at least measured in the interval level.
2. Simple random samples of sizes are drawn independently from each
population.
3. The two populations are normally distributed.
);)
CASE2: UNKNOWN VARIANCES BUT ASSUMED TO
BE EQUAL
Estimator
  for : -

Standard error: ( - )=

where ==
CASE1: KNOWN VARIANCES
Test
  statistic:
=
Decision rule: Reject Ho if
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
fail to reject Ho otherwise.
CASE 2: UNKNOWN VARIANCES
BUT ASSUMED TO BE EQUAL
Test
  statistics:
=
Decision rule: Reject Ho if:
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
fail to reject Ho, otherwise.
EXAMPLE
A random sample of students were randomly assigned to either a
section with laboratory or without laboratory. In the section with
laboratory, it was found out that the average grades of the 11 students
was 85 with a standard deviagtion of 4.7. In the section without
laboraotory, the 17 students got an average of 79 with a standard
deviation of 6.1 Is there evidence to say that incorporating a laboratory
session improved the students performance in the course? Assume
grades to be normally distributed with equal variances between
sections. Use α=5%
SOLUTION
₁=85;
  =4.7
=79; =6.1

==31.3946

where =
SOLUTION
Let
  X₁= grade of students in the section with laboratory
X₂= grade of students in the section without laboratory
X₁ ~ and X₂ ~
Ho: =0 vs. Ha: =0
Test statistics: Ho:
Decision rule: Reject Ho if
otherwise, fail to reject Ho.
SOLUTION
Decision: Since 2.767 > 1.706, we reject Ho.
Conclusion: At the 5% level of significance, there is sufficient evidence to say
that incorporation of a laboratory session in this course improves the
performance of the students.
REMARKS
  For large sample sizes, the standard normal distribution. Hence, we may
use the Z-table to approximate tabular values of t for large sample sizes.
DEGREES OF FREEDOM
V’  is the approximate degrees of freedom
V’
Where
CASE 3: UNKNOWN VARIANCES BUT
ARE UNEQUAL
Test
  statistic: =
Decision rule: Reject Ho if:
i) or p-value <
ii) >or p-value/2 < and>0
iii) or p-value/2 <and>0
fail to reject Ho, otherwise.
EXAMPLE
Suppose two groups of student classified as smokers and non-smokers
were randomly selected and asked the number of times they come down with
cough last year. The data collected are as follows:
Smokers: 3 5 4 2 4 5 6
Non-smokers:0 1 2 1 2
At the 5% level of significance, can it be said that smokers tend to have more
occurrences of cough than the non-smokers?
COMPARISON OF TWO
RELATED SAMPLES
Matching (pairing) of similar individuals
units that have similar characteristics related to the variable(s) being
investigated are paired/matched.
One member of each pair is assigned to one condition (or assumed to
be coming for are population) while the other is assigned to the
condition (or assumed to be coming from the other population).
DATA LAYOUT
For a sample of size n,
Group 1 (X₁) …

Group 1 (X₂) …

Group 1 () …

COMPARISON OF TWO
RELATED SAMPLES
Self-pairing – a unit is measured on two
occasions, usually before and after a
treatment.
t-TEST FOR RELATED
SAMPLES
Parameter
  of interest: (mean difference)

=
PROCEDURE OF
ANALYSIS
1. Test the normally of the two populations.
2. Test the equality of the variances using the Levene’s test.
3. Test the equality of the means by using the appropriate t-tset.
Equality of Variance
Variable
  Method Num DF Den DF F Value Pr>F
cough Folded F 6 4 2.59 0.3771

T-Test
Variable Method Variances DF t Value Pr>
cough Pooled Equal 10 4.30 0.0016
Cough Satterhwaite Unequal 9.9 4.66 0.0009
TEST ON EQUALITY OF
TWO VARIANCES
Ho:
  =vs. Ha:
Test procedure: F-test
Test statistic: under Ho

where df1=numerator df
df2=denominator df
Decision rule: Reject if>
otherwise,fail to reject Ho.
TEST OF HYPOTHESIS
Let
  = frequency of cough for smokers
= frequency of cough for non-smokers
and
Ho:- = o vs. Ha: - o
Test statistics: -
Decision rule: Reject Ho if >=1.812
otherwise, fail to reject Ho.
Decision: Since 4.3>1.812, we reject Ho.
Conclusion: At 5% level of significance, there is evidence to indicate that
smokers have more frequency of cough than non-smokers.
ASSUMPTIONS
1.  The data are measured At least in the interval level.
2. Both populations are normally distributed with a common variance;
and the two populations are related.
3. N()
TEST OF HYPOTHESIS
Ho:
  = →Ho: - = VS.
i) Ha: ≠ →Ha: - ≠
ii) Ha: > →Ha: - >
iii) Ha: < →Ha: - <

Note: In practice, the most common scenario is testing for the equality
of two means, thus, =0.
 
POINT ESTIMATORS
=
D

=
D
=
TEST OF HYPOTHESIS
Test
  statistic: under Ho

Decision Rule: Reject Ho if


i) (or p-value <)
ii) >(or p-value/2 <
iii) (or p-value/2 <
EXAMPLE
Ten households were randomly selected to determine if their real income
(gross income adjusted for the CPI) changed after 5 years. Their real incomes
were computed for the years 2005 and 2010. The data gathered is given in the
table that follows. Test at 5% level of significance.
SOLUTION
Let
  =mean real income in 2005
=mean real income in 2010

Ho: =0→- =0→=


mean real income did not change.
Ha: =0→- =0→=
mean real income changed.
Household 2005 Income 2010 Income

1 128,486 115,661

2 86,538 57,831

3 106,153 51,277

4 150,000 98,313

5 79,227 42,265

6 153,846 156,988

7 89,346 98,795

8 107,740 105,975

9 220,208 173,494

10 612,370 608,237
SOLUTION
  Analysis Variable: diffinic
Mean StdDevStd Error t Value Pr>
22507.70 24205.70 7654.51 2.94 0.0165
SOLUTION
Test
  statistic: under Ho
Decision rule: Reject Ho If p-value < 0.05
otherwise fail to reject Ho.
Computations:
=2.94
p-value=0.0165
Decision: Since p-value<0.05, we reject Ho.
Conclusion: At=5%, average real income
changed after 5 years.
ONE-WAY ANALYSIS OF
VARIANCE (ANOVA)
Ho:
  = =…= vs.
Ha: At least one mean is different from the rest.
Data layout:

Group/treatment

1 2 … K

… … … …


TEST OF HYPOTHESIS ON
K POPULATIONS
  The k different populations are classified on the basis of a single
criterion such as different treatment or groups.
Random sample of are selected from each of k populations and are
used as basis for comparing the means of the k populations.
ASSUMPTIONS
1.  The data are measured at least in the interval level.
2. The k populations are normally distributed with means ,, …, with a
common variance .
The ANOVA approach
i) Partitions the total variation of the data info two sources: variation among
groups/treatments; and
ii) Compares these two sources of variation.
 
COMPUTATIONS
MSW

MSB
ANOVA TABLE
 

Test statistic: under Ho


Decision rule: Reject Ho if
otherwise fail to reject Ho.
PAIRWISE MEAN COMPARISONS
 The ANOVA is a powerful tool for testing homogeneity of a set means.
However, if we reject the null hypothesis – i.e., means are not all equal – we
still do not know and which are different.
 Thus, we make paired comparisons.
DATA
____________________________________
Financial Leverage
Control (no dept) Low Medium high
2.1 6.2 9.6 10.3
5.6 4.0 8.0 6.9
3.0 8.4 5.5 7.8
7.8 2.8 12.6 5.8
5.2 4.2 7.0 7.2
2.6 5.0 7.8 12.0
EXAMPLE
The financial structure of a firm refers to the way the firm’s assets are
divided by equity and dept., and the financial leverage refers to the percentage
of assets financed leverage can be used to increase the rate of return on equity,
i.e., stockholders can receive higher returns on equity with the same amount of
investment by the use of financial leverage. The following data show the rates
of return on equity using 4 levels of financial leverage for 24 randomly selected
firms.
SOLUTION
Test of normality for each population:
--------------------------level = 1 -----------------------
Shapiro-Wilk W 0.916001 Pr<W 0.4770
--------------------------level = 2 -----------------------
Shapiro-Wilk W 0.947162 Pr<W 0.7172
--------------------------level = 3 -----------------------
Shapiro-Wilk W 0.941784 Pr<W 0.6736
--------------------------level = 4 -----------------------
Shapiro-Wilk W 0.911698 Pr<W 0.4477
SOLUTION
The ANOVA Procedure
Department Variable: returns
sum of

Source DF Squares Mean Square F Value Pr>F


Model 3 80.7683333 26.9227778 5.34 0.0073
Error 20 100.8700000 5.0435000
Correct Total 23 181.6383333

R-Square Coeff Var Root MSE Returns Mean


0.444666 34.24306 2.245774 6.558333

Source DF Anova SS Mean Square F Value Pr>F


Level 3 80.76833333 26.92277778 5.34 0.0073
TEST OF HOMOGENEITY OF
VARIANCES: LEVENE’S TEST
Ho:
  ==…=
Ha: At least one is different.

Procedure:
1. Compute the mean for each sample/group.
2. Get the absolute deviation of each observation from its group mean.
3. Perform ANOVA on these deviations.
SOLUTION
Pair Mean Comparison
(LSD) for returns

Means with the same letter are not significantly different.


T Grouping Mean N Level
A 8.417 6 3
A 8.333 6 4
B 5.100 6 2
B 4.383 6 1
TEST ON
RELATIONSHIP
BETWEEN
VARIABLES
STRENGTH
 Some associations are quite pronounced (strong association or high
correlation)
 Some associations are relatively weak
RELATIONSHIP OR “CO-
VARIATION”
 When two (or more) variables vary with regards to each other in a
predictable manner, they are said to co-vary
 Co-variation is synonymous to association or correlation.
 important properties investigated are strength and direction.
DIRECTION
Positive or direct association – high values of one variable of the other,
and vice versa
Negative or indirect association – high values of one variable are
associated with low values of the other, and vice versa
ILLUSTRATION
PEARSON’S PRODUCT MOMENT
 CORRELATION COEFFICIENT()
 
Where =covariance between X and Y

= standard deviation of the X values


=standard deviation of the Y values
N = number of paired observations in the population
PEARSON’S PRODUCT MOMENT
 CORRELATION COEFFICIENT()
A parametric statistic which measures the degree of linear relationship
between two variables measured in at least the interval level
Its values range from -1 to +1
ESTIMATOR
 
Where =sample covariance between X and Y

= sample standard deviations of X and Y, respectively


QUALITATIVE INTERPRETATIONS

Value of ρ Qualitative Interpretation

0 No linear relationship/ association


0.-0.20 Very weak linear association
0.20-0.40 Weak linear association
0.40-0.60 Moderate linear association
0.60-0.80 Strong linear association
0.80-1.0 Very strong linear association
1.0 Perfect linear association
TEST OF SIGNIFICANCE
FOR ρ
Test
  statistic: r

Decision rule: Reject Ho if


i) p-value<
ii) p-value/2<
iii) p-value/2<
Fail to reject Ho otherwise.
TEST OF SIGNIFICANCE
FOR ρ
Ho: ρ = 0 (There is no linear association between X and Y.)

Against one of these:


i) Ha: ρ ≠ 0 (There is a linear association between X and Y.)
ii) Ha: ρ > 0 (There is a positive association between X and Y.)
iii) Ha: ρ < 0 (There is a negative association between X and Y.)
EXAMPLE
A sociologist believes that suicide rate and divorce rate of the states
in the US are positively correlated. He obtained the following data on
suicide rate (per 100,00) selected states. Is there sufficient evidence to
indicate the suicide rate and divorce rate are positively correlated) Use
5% level of significance.
DATA
Suicide
rate 8.4 11.6 13.1 11.9 9.9 7.7 7.1 13.2 14.0

Divorce
rate 29.1 28.9 30.6 26.0 18.4 10.7 25.3 24.0 25.7
SOLUTION
Decision:
  Since 0.1472 > 0.05, we fail to reject Ho.

Conclusion: At = 5%, there is no significant linear association between


suicide rate and divorce rate.
SOLUTION
Ho: ρ = 0 (There is no linear association between suicide rate and divorce
rate..)
HA: ρ = 0 (There is a positive association between suicide rate and divorce
rate.)

Decision rule: Reject Ho if p-value < 0.05


Fail to reject Ho otherwise.

Computations:
r=0.524 p-value = 0.1472
SPEARMAN’S RANK-ORDER
CORRELATION COEFFICIENT (ρs)
A non-parametric alternative to the Pearson’s correlation coefficient
Used when the variables are measured in at least the ordinal level
Uses ranks instead of the original data measurements
Qualitative interpretation is the same as that of Pearson’s
SPEARMAN’S RANK-ORDER
CORRELATION COEFFICIENT (ρs)
 Rank X and Y independently and obtain the differences of their ranks, =
R()-().

Qualitative interpretation is the same as that of Pearson’s.


TEST OF SIGNIFICANCE
 
OR
Ho:
  = 0 (There is no linear association between X and Y .)

Against one of these:


i) Ha: ≠ 0 (There is a linear association between X and Y.)
ii) Ha: >0 (There is a positive linear association between X and Y.)
iii) Ha: < 0 (There is a negative linear association between X and Y.)
EXAMPLE
A tutor rated the suitability of 7 randomly selected students for a job
and recorded their knowledge score in psychology as follows:

Student 1 2 3 4 5 6 7
Rate in Job 4 7 3 1 6 2 5
Score 5 6 4 2 7 3 1

At 1% level of significance, is there evidence to show that suitability rating


and knowledge score are related?
SOLUTION
Ho:
  = 0 (There is no linear association between suitability
rating and knowledge score.)
Ha: 0 (There is no linear association between suitability
rating and knowledge score.)
Decision rule: Reject Ho if p-value<0.05
Fail to reject Ho otherwise.
Computations:
=0.607 with p-value = 0.1482
CHI-SQUARE BASED
MEASURES OF ASSOCIATION
Computed using count data
Used to determine if two categorical variables are independent (i.e.,
not associted
 
Decision: Since 0.1482 > 0.05, we fail to reject Ho.

Conclusion: There is no association between suitability rating


and knowledge score.
CONTINGENCY TABLE VariableB
Variable
A Total
...

...

...

. . . . .
. . . . .
. . . . .

...

Total ...
ASSUMPTIONS
The labels or classes are non-overlapping.
Each entity/unit should belong in only one class
No more than 20% of the classes should have expected frequencies less
than 5.
No class should have an expected frequency less than 1.
EXAMPLE
  A simple random sample 1847 individuals was obtained and
classified according to gender and whether or not they believe in the
Filipino talent. Is there evidence (at =5%) that there is an association
between gender and the belief in the Filipino talent?
PROCEDURE
Ho:
  X and Y are independent.
Ha: X and Y are associated.
Test statistic: under Ho
where

Decision rule: Reject Ho if p-value<


otherwise, fail to reject Ho.
CONTINGENCY TABLE:
Do you believe in the Filipino Talent
Gender
No Yes Total

Male 315 489 804

Female 346 697 1043

Total 661 1186 1847


SOLUTION
Ho:
  Gender and belief in the Filipino talent are independent.
Ha: Gender and belief in the Filipino talent are associated.
Test procedure: -test-of-independence
Decision rule: Reject Ho if p-value<0.05
otherwise, fail to reject Ho.
Computations:
=7.126 p-value=0.008
MEASURES OF ASSOCIATION
BASED ON THE CHI-SQUARE
The chi-square test provides little information about the strength (i.e.,
weak or strong) and form (positive or negative) of the association
between two variables.
When two categorical variables are significantly associated according to
the Chi-square test of independence, then we can compute measures of
association to assess the strength of the relationship.
SOLUTION
Decision:
  Since 0.008<0.05, we reject Ho.
Conclusion: At =5%, there is a significant association between gender
and believing in the Filipino talent.
CHI-SQUARE BASED
MEASURES OF ASSOCIATION
1.  Phi Coefficient
- appropriate for 2 x 2 contingency tables
- ranges from 0 to 1

where n=sample size


CHI-SQUARE BASED
MEASURES OF ASSOCIATION
2.  Contingency Coefficient
- can be used for higher order contingency tables
- ranges between 0 to <1
-generally cannot attain the upper limit of 1
-upper limit depends on the size of the contingency table (e.g. a4 x 4
table has a maximum of 0.87)
QUALITATIVE
INTERPRETATION
Qualitative
Value of
Interpretation

Less than 0.10 Weak association


Between 0.10 to 0.30 Moderate
More than 0.30 association
Strong association
CHI-SQUARE ASED
MEASURES OF ASSOCIATION
3.  Cramer’s V
-can be used for any dimension of the contingency table
-ranges between 0 to 1
-equal to the Phi coefficient when one of the dimension is 2

where q=min(r,c)
REGRESSION
ANALYSIS
FORMS OF RELATIONSHIP
BETWEEN X AND Y
REGRESSION ANALYSIS
Used to find a possible functional relationship between two variables X
and Y, where X and Y are paired variables (i.e., both are measured on
the same units)
The ultimate objective is to predict Y (dependent or response variable)
given the value of X (independent or explanatory variable)
THE SIMPLE LINEAR
 
REGRESSION MODEL
Where =observed value of Y
observed value of X
=regression constant (true Y-intercept)
=regression coefficient (true changed in Y per unit increase in X)
=random error association with for given
ASSUMPTIONS
 Normality of residuals
 The residuals (predicted minus observed values of y) must be normally
distributed.
 This can be assessed by producing histograms of the residuals and/or the
normal probability plot.
ESTIMATION BASED ON
SRS OF SIZE n
 

Predicting equation: +
EVALUATION OF THE SAMPLE
REGRESSION EQUATION
  An overall adequacy of the predicting equation is provided by the
coefficient of multiple determination
==
gives the proportion of variation in Y that is accounted for by X. It ranges
from 0 to 1 (or 0% to 100%). The nearer is to 1, the better is the fit of the
regression line.
TESTING THE
 SIGNIFICANCE OF
Ho:
  0(Y is not linearly dependent on X)

against
i) Ho: =0(Y is linearly dependent on X)
ii) Ho: >0(Y is positive linearly dependent on X)
iii) Ho: <0(Y is negative linearly dependent on X)
EXAMPLE
  A random sample of 10 households in certain municipality were
observed for the amount spent on groceries per week (Y) and the
number of household members (X). Do the following data support the
claim that the bigger the household, the higher the expenditure for
groceries per week Use =5%.
Amount Spent on Number of HH
HH no.
Groceries/wk(Y) members
1 457.50 2
2 331.90 2
3 683.30 3
4 1,069.20 4
5 358.60 1
6 1,306.20 5
7 1,400.00 5
8 850.00 3
9 200.00 1
10 1,800.00 4
MULTIPLE
LINEAR
REGRESSION
Analysis of Variance
Sum of Mean
Source DF squares square F Value Pr>F
Model 1 2032031 2032031 29.93 0.0006
Error 8 543195 67899
Corrected Total 9 2575226

Root MSE 260.57510 R-Square 0.7891


Dependent Mean 845.67000 Adj R-Sq 0.7627
Coeff Var 30.81286

Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr>ltl
Intercept 1 -110.58000 193.24766 -0.57 0.5829
Members 1 318.75000 58.26636 5.47 0.0006
REVIEW
 Correlation indicates the strength of linear relationship between two
variables.
 Simple linear regression describes the form of the linear relationship
between two variables.
 For linear regression to be valid, the following assumptions must hold:
 Linearity
 Normality of residuals
 Constant variance
NOTE
 A single dependent variable may be affected by two or more predictor
variables working together.
 Prediction of the dependent variable becomes more efficient if the
collective effect of these variable are considered since the unexplained
variation in the dependent variable is minimized.
ESTIMATION
 

Predicting equation:
+…+
THE MULTIPLE LINEAR
REGRESSION MODEL
++…++
 
Where =value of the dependent variable in the observation
=regression constant
=regression coefficient for
=value of the independent variable in the observation
=random error term associated with
EVALUATION OF THE SAMPLE
REGRESSION EQUATION
=  coefficient of multiple determination
= the proportion of the variance in y that can be explained by the
p predictors together
TEST OF SIGNIFICANCE
OF b’S
Ho:
  =0 (Y does not linearly depend on )
Ha: 0 (Y linearly depend on )

Test statistic: Student’s t distribution


Decision rule: Reject Ho if p-value<
otherwise, fail to reject Ho.
LIMITATIONS OF
REGRESSION ANALYSIS
Can only ascertain relationships but can never be sure about
underlying causal mechanism
The number of observations must exceed the number of independent
variables
The problem of multicollinearity
TEST OF SIGNIFICANCE
OF THE MODEL
Ho:
  ==…==0 (Y does not linearly depend on all X’s.)
Ha: At least one 0, I =1, 2, …, p (Y linearly depend on at least one X.)

Test statistic:
Decision rule: Reject Ho if
fail to reject Ho otherwise.
EXAMPLE
Consider the ten observations on the expenditure on groceries and the
household sizes. Information on the gross gathered. Fit the multiple linear
regression model using the expenditure on groceries as the dependent
variable.
HH no. Amount Spent on Number of HH HH no.
Groceries/wk(Y) members

1 457.50 2 1

2 331.90 2 2

3 683.30 3 3

4 1,069.20 4 4

5 358.60 1 5

6 1,306.20 5 6

7 1,400.00 5 7

8 850.00 3 8

9 200.00 1 9
Analysis of Variance
Sum of Mean
Source DF squares square F Value Pr>F
Model 2 2105495 1052748 15.69 0.0026
Error 7 469731 67104
Corrected Total 9 2575226

Root MSE 259.04521 R-Square 0.8176


Dependent Mean 845.67000 Adj R-Sq 0.7655
Coeff Var 30.81286

Parameter Estimates
Parameter Standard
Variable DF EstimateError t Value Pr>ltl
Intercept 1 -110.58000 193.24766 -0.57 0.5829
Members 1 423.43477 115.60894 3.66 0.0080
Income 1 -0.01837 0.01755 -1.05 0.3302
CATEGORICAL
DATA
ANALYSIS
TEST FOR SINGLE
PROPORTION
Parameter
  of interest: P= proportion of the population which possesses
the characteristic of interest

Ho: P=Po, where Po is the hypothesized


value of P
Against one of the following:
a) Ha: P
b) Ha: P
c) Ha: P
CATEGORICAL RESPONSE
DATA
- Measurement scale consisting of a set of categories
Examples:
 Choice of accommodation (house, condominium, apartment)
 Whether patient survives operation(yes, no)
 Mental illness (schizophrenia, depression, neurosis)
 Alligator’s primary food choice (fish, invertebrate, reptile)
 Consumer’s preference among leading brands of product (Brand A,
Brand B, Brand C)
TEST FOR A SINGLE
PROPORTION
Test
  procedure: approximate Z-test

Test statistic:
Decision rule: Reject Ho if

a) >
b) <-
Fail to reject Ho otherwise.
EXAMPLE
  Do majority of adults in the Philippines believes that a pregnant
woman should be able to obtain an abortion? Of 893 respondents, 400
replied “yes” and 493 replied “no”. Use =5%.

Parameter of interest: P=proportion of adults who believe that a


pregnant woman should be able to obtain an abortion
EXAMPLE
Test
  statistic:
Decision rule: Reject Ho if >=1.645
Fail to reject Ho otherwise
EXAMPLE
Ho: P=0.5; There proportion of adults who believe that a
pregnant woman should be able to obtain an abortion is
0.5
Ho: P>0.5; There proportion of adults who believe that a
pregnant woman should be able to obtain an abortion is
greater than 0.5
Test procedure: one-tailed approximate Z-test
EXAMPLE
Computations:
 
== -3.1121
Decision: Since -3.1121, we fail to reject Ho.
Conclusion: Sample shows at 5% level of significance, that
the proportion of Filipino who believe that a pregnant woman should
be able to obtain abortion is not more than 50%
COMPARISON OF TWO
PROPORTIONS
Parameter
  of interest: = difference between the two proportions

Cases:
Independent samples
Related (Paires) samples
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Test
  procedure: approximate Z-test

Test statistic:

When ≠
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Ho:
  -=, where is the hypothesized value of the difference
between two proportions

Against one following:


a) Ha: -
b) Ha: -
c) Ha: -
COMPARISON OF TWO
PROPORTIONS: Independent Sample
Case
Test
  statistic:
When =0

where =
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Decision
  rule: Reject Ho if
a)
b)
c)
fail to reject Ho otherwise.
EXAMPLE
Enjoy shopping?
Group Total
Yes No
Women 189 10,845 11,034
Men 104 10,933 11,037

P=proportion of customers who


enjoyed shopping of clothes
EXAMPLE
  A clothing company is interested to determine if women really enjoy
shopping of clothes compared to men. A random sample of customers
was selected and was asked the question, “Do you enjoy shopping for
clothing?”. The results are shown in the following table. Use =5%.
EXAMPLE
Ho:
  =0; The proportion of men and women customers who
enjoyed shopping for clothes is equal.
Ho: 0; The proportion of women customers who enjoyed
shopping is greater than the proportion of men
who enjoyed shopping.
Test procedure: one-tailed approximate Z-test
EXAMPLE
Test
  statistics:

Decision rule: Reject Ho if =1.645


fail to reject Ho otherwise
EXAMPLE
Decision:
  Since 5.0014>1.645, we reject Ho.

Conclusion: There is reason to believe that women really enjoy shopping


of clothes compared to men at =5%.
EXAMPLE
Computations:
 
= ==0.0133

=5.0014
COMPARISON OF TWO
PROPORTIONS: Related Samples Case
 Observations in the sample can be observed on two occasions as in a
“before and after” type of research
 The observations can therefore be placed in two-way table as follows.

After
Before
Success Failure

Success SS SF

Failure FS FF
COMPARISON OF TWO
PROPORTIONS: Related Samples
Case
Test
  statistic:
When ≠
EXAMPLE
  Prior to a television debate of two candidates, a random sample of
100 voters the debate, the same 100 voters expressed again their
choices. The results are given in a table in the next slide. Using =5%,
testy if there is a change in voters’ decision after the debate.
COMPARISON OF TWO
PROPORTIONS: Related Samples Case
 
Test statistic:

When ≠
EXAMPLE After the debate
  Before the debate
Jez Dar

Jez 63 21

Dar 4 12
Let = proportion of voters who chose Jez before the debate
= proportion of voters who chose Jez after the debate
EXAMPLE
-=0
  = The proportion of voters who chose Jezbefore and
after the debate are equal.
-0 = The proportion of voters who chose Jez before and after
the debate are not equal.

Test procedure: two-tailed approximate Z-test


EXAMPLE
Computation:
 
==3.4

Decision: Since 3.4>1.96, we reject Ho.


Conclusion: There is a change in voters’ decision after the debate at =5%.
EXAMPLE
Test
  statistic:

Decision rule: Reject Ho if =1.96


Fail to reject Ho otherwise.
MODELING
CATEGORIC
AL DATA
BINARY LOGISTIC
REGRESSION MODEL
 Used to study the relationship between a response variable (binary)
and Explanatory variables (of any type)
 The probability that the “event of interest” will occur is expressed as a
linear function of one/more categorical and/or continuous independent
variables.
 Y=1 is the event of interest and Y=0, otherwise
BINARY LOGISTIC REGRESSION
MODEL: CONTINUOUS PREDICTOR
Given: Binary response variable Y(0-1) Single explanatory variable X
(interval/ratio)
APPLICATIONS
 Probability of having a particular disease
 Probability of buying a particular product
 Probability of surviving a particular treatment
 Probability of passing a course
 Probability that weight exceeds 50 kg
MODELS
 Log odds of Y=1:
logit (Y=1)
odds of Y=1:
odds (Y=1)
Probability of Y=1:
STEPS
1. Fit the model and estimate its parameters.
2. Test the significance of the estimates.
3. Evaluate the model adequacy/model fit.
4. Interpret the model.
INTERPRETATION
odds
  (Y=1)

For >0, every unit increase in X increases the odds of Y=1 Multiplicatively
By
INTERPRETATION
logit
  (Y=1)

For >0, every unit increase in X increases the LOG of Y=1 By


INTERPRETATION
 

For >0, every unit increase in X increases the probability of Y=1 By


EXAMPLE
Y=passing or failing a Licensure Examination for Teachers (LET)
X=weekly average number of hours spent in reviewing
Analysis of Maximum Likelihood Estimates
 
Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
Intercept 1 -2.2136 0.9988 4.9119 0.0267
Hours 1 0.0704 0.0267 6.9643 0.0083

Ho: Ha:

Reject Ho:
at =5%, the probability of passing the Let is dependent on the
weekly average number of hours spent in reviewing.
Analysis of Maximum Likelihood Estimates

 
Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
Intercept 1 -2.2136 0.9988 4.9119 0.0267
Hours 1 0.0704 0.0267 6.9643 0.0083

logit (Y=1)
Analysis of Maximum Likelihood Estimates

  Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
Intercept 1 -2.2136 0.9988 4.9119 0.0267
Hours 1 0.0704 0.0267 6.9643 0.0083

Ho: Not adequate Ha: Adequate


Reject Ho.
At =5%, the model is adequate.
EXAMPLE
 

logit(Y=1)
For every hour increase in the weekly average number of hours spent in
reviewing, the LOG ODDS of passing the LET increases by 0.0704

For every hour increase in the weekly average number of hours


spent in reviewing, the ODDs of passing the LET increases
MULTIPLICATIVELY by exp(0.0704)=1.0729
MODELS
 Log odds of Y=1:
Logit (Y=1)+…+
odds of Y=1:
odds (Y=1)
Probability of Y=1:
BINARY LOGISTIC REGRESSION
MODEL: Categorical Predictor
Given: Binary response variable Y(0,1) single explanatory variable D
(nominal/ordinal)
STEPS
1.  Construct the dummy variables, .
2. Fit the model and estimate its parameters.
3. Test the significance of the estimates.
4. Evaluate the model adequacy/model fit.
5. Interpret the model.
CONSTRUCTION OF THE
DUMMY VARIABLES
For a nominal variable X with I categories, there will be (i-1) number of
dummy variables, with one dummy variable as the reference category (all
dummies equal to 0), e.g.

D
X

0 0

1 0
0 1
INTERPRETATION
odds
  (Y=1)

For >0, the odds of Y=1 higher by for relative to .


INTERPRETATION
Logit
  (Y=1)+…+

For >0, the LOG odds of Y=1 higher by relative to .


EXAMPLE
Y=survival
X=treatment (2=advanced, 1=basic, 0=placebo)
Analysis of Maximum Likelihood Estimates

Standard
  Wald
Parameter DF Estimate Error Chi-SquarePr>ChiSq
Intercept 1 0.0772 0.5093 0.0230 0.8795
TREATMENT 1 1 1.2221 0.5745 4.5250 0.0334
TREATMENT2 1 2.3207 0.7893 8.6450 0.0033

Logit (Y=1)
Testing global Null Hypothesis: BETA=0
Parameter
  Chi-Square DF Pr>ChiSq
Likelihood Ratio 55.9138 2 <.0001
Score 46.4730 2 <.0001
Wald 21.1462 2 <.0001

Ho: Not adequate Ha: Adequate

Reject Ho.
At =5%, the model is adequate.
Analysis of Maximum Likelihood Estimates
Standard
  Wald
Parameter DF Estimate Error Chi-SquarePr>ChiSq
Intercept 1 0.0772 0.5093 0.0230 0.8795
TREATMENT 1 1 1.2221 0.5745 4.5250 0.0334
TREATMENT2 1 2.3207 0.7893 8.6450 0.0033

Ho: =0 Ha:: =0
Reject Ho.
At =5%, the probability of surviving is higher for basic treatment compared
to placebo.
EXAMPLE
 
Logit (Y=1)

The LOG ODDS of surviving is 1.222 times higher for basic treatment
relative to placebo.
The odds of surviving is exp(1.22)=3 times higher for basic treatment
relative to placebo.
BINARY LOGISTIC REGRESSION
MODEL: Multiple Predictors
Given:
  Binary response variable Y(0,1)
K explanatory variables, ,,…,
(continuous and categorical)
MODELS
 Probability of Y=1:
(Y=1)
MODELS
 Log odds of Y=1:
Logit (Y=1)
=+…+
odds of Y=1:

odds (Y=1)=
STEPS
1. Construct the dummy variables, if necessary.
2. Fit the model and estimate its parameters.
3. Test the significance of the estimates.
4. Evaluate the model adequacy/model fit, choosing the best model in
the process.
5. Interpret the final model.
EXAMPLE
A study on the risk factors associated with low birth weight was
conducted.

Y=birth weight (1=low, 0-otherwise)


Xs: AGE=age of mother (in years)
VISIT=number of physician visits during pregnancy
WGT=weight at the last menstrual period (in pounds)
LABOR=history of premature labor (1=yes,0=no)
HYPER=history of hypertension (1=yes, 0=no)
GOALS OF MODELING
 The model is complex enough to fit the data well.
 The model should be parsimonious.
EXAMPLE
The
  final model is

Logit (Y=1)=2.91100.17WGT+1.4071.894
MULTICATEGORY LOGISTIC
REGRESSION MODEL
 Nominal responses
 Y is categorical with multiple responses
 X is categorical or continuous

Models: Baseline category logits


Discrete choice models
MULTICATEGORY LOGISTIC
REGRESSION MODEL
 Ordinal responses
 Y is ordinal with multiple responses
 X is categorical or continuous

Models: Cumulative logit models with proportional odds property paired


category ordinal logit models: adjacent category logits

You might also like