Professional Documents
Culture Documents
Basics OF Statistics: Francisco S. Antonio
Basics OF Statistics: Francisco S. Antonio
OF
STATISTICS
Francisco S. Antonio
Associate Prof. IV
CATEGORIES OF
STATISTICS
Descriptive statistics
Inferential statistics
WHAT IS STATISTICS?
A science which deals with the
collection, organization,
presentation, analysis and
interpretation of data.
A set of numbers or figures or
processed data
DESCRIPTIVE STATISTICS
Deals with data description to acquire
information or knowledge
Involves organization and presentation
of data
DESCRIPTIVE STATISTICS
INFERENTIAL STATISTICS
Deals with making conclusions or generalizations
about a population of interest (a large set) when
only a part (smaller set) of it is examined
Involves random sampling
INFERENTIAL STATISTICS
EXAMPLE
A president of a certain state university wanted to determine
if majority of the faculty members of state universities in
Region II are in favor of a change in academic calendar. He
then took a random sample of faculty members in the
region and asked them if they are in favor or not with the
change. Based on the random sample, it was concluded that
indeed, majority of the faculty members in the region are in
favor of the change in the academic calendar.
BASIC CONCEPTS
Universe
- set of objects, individuals or entities under study
Variable
- characteristic of interest which is measured or observed from the
elements of the universe
Population
-set of all possible values of a variable corresponding to the entire
collection of units
BASIC CONCEPTS
Sample
- subset of the universe or population
BASIC CONCEPTS
Universe
Y1
U1
Y2
U2
Variable Y Y3
U3
.
.
.
.
.
.
YN
UN
TYPES OF VARIABLES
Qualitative
-take on values which are not numerical
Quantitative
- Number which indicate the amount of the characteristic
LEVELS OF
MEASUREMENT
Nominal
Ordinal
Interval
Ratio
NOMINAL
Lowest level of measurement
Responses are categories or labels
Counts or frequencies and percentages can be obtained per category
ORDINAL
Responses are categories or labels which can be
ranked or ordered
Difference between two responses is meaningless
INTERVAL
Responses are categories or labels which can be ranked or ordered
Difference between two responses has meaning
A zero data point in this level is arbitrary
Addition and subtraction of responses are possible
RATIO
Highest level of measurement
All properties of the interval level hold
Zero value in this level means lack of that characteristic
All mathematical operations are possible
MEASURES OF LOCATION
Minimun (min)- lowest value in the data set
Maximum (max)- highest value in the data set
Measures of central Tendency- gives the middle value of the data set
-mean(µ)
-median(Md)
-mode(Mo)
Quantiles
-percentiles, deciles, quartiles
SOME DESCRIPTIVE
MEASURES
Measures of Location
Measure of Dispersion
Skewness
MEAN
The sum of all the observations in the data set divide by the total
number of observations.
PROPERTIES OF THE
MEAN
A data set can only have one mean.
It can only be computed for quantitative data sets.
The magnitude of each observation contributes to the value of the
mean.
It is easily effected by extremely high or low observations.
MEDIAN
The middle value in the data set, after arranging the observations in
increasing or decreasing order.
MODE
the observation in the data set occurred most frequently.
PROPERTIES
A data set can have more than one mode.
It can be used to describe a quantitative or qualitative data set.
It is not determined by the magnitude of the observations.
MEASURES OF
DISPERSION
Describes the spread or variability of the
observations in a data set
The higher the value, the greater the variability of
the data set
QUANTILES
Percentiles- divide the array of data
into 100 equal parts
Deciles- divide the array of data into
ten equal parts
Quartiles- divide the array of data into
four equal parts
MEASURES OF
DISPERSION
Range (R)
Variance ()
Standard Deviation ()
Coefficient of Variation (CV)
RANGE
The difference between the highest and
lowest observations in the data set
VARIANCE
The average squared deviations of the observations from the mean
(population)
(sample)
PROPERTIES OF THE
RANGE
It is quick to compute and easy to understand.
It is a rough measure of dispersion.
It is usually reported together with the median.
PROPERTIES OF THE
VARIANCE
One of the most useful measures of dispersion
All observations in the data set contribute to the magnitude of the
variance
Can only take on values of at least zero
the unit of measure is the square of the measure of the data set
STANDARD DEVIATION
The
square root of the variance
s
PROPERTIES OF THE
COEFFICIENT OF VARIATION
Has no unit of measure
Can be used to compare dispersion of two or more data sets with the
same or different units of measurements
COEFFICIENT OF
VARIANCE
The
ratio of the standard deviation to the mean of the data set
MEASURE OF SKEWNESS
This
measures the departure of the data set from symmetry, with
formula given by
SK
SYMMETRIC
DISTRIBUTIONS
NON-SYMMETRIC
DISTRIBUTIONS
SAMPLING
TECHNIQUES
BASIC CONCEPTS
SAMPLE – SUBSET OF THE
POPULATION OR UNIVERSE TAKEN
TO REPRESENT THE WHOLE
SAMPLING – THE PROCESS OF
SELECTING A SAMPLE
PARAMETER – DESCRIPTIVE
MEASURE OF THE POPULATION
STATISTIC – DESCRIPTIVE MEASURE
OF THE SAMPLE
INFERENTIAL STATISTICS
DEALS WITH MAKING
GENERALIZATIONS ABOUT A
POPULATION WHEN ONLY PART OF IT
IS EXAMINED
IT MAKES USE OF THE INDUCTIVE
METHOD OF DRAWING CONCLUSIONS
INVOLVES RANDOM SAMPLING
AREAS OF CONCERN OF
INFERENTIAL STATISTICS
1. Estimation
2. Test of hypothesis
WHY DO WE SAMPLE?
Reduce Cost
Greater accuracy and efficiency
Timeliness
Greater scope
Nature of testing procedure
CLASSIFICATION OF
SAMPLING
Non – Probability sampling
No objective procedure in sample selection
Probabilities of selection of units are not known
No inferences about the universe/ population can be made
CLASSIFICATION OF
SAMPLING
Non – Probability sampling
Probability sampling
CLASSIFICATION OF
SAMPLING
Non – Probability sampling
accidental sampling
Convenience sampling
Purposive sampling
Quota sampling
CLASSIFICATION OF
SAMPLING
Probability sampling
Utilizes an objective procedure in sample selection (randomization
procedure)
probabilities of selection of units are known
valid inferences about the universe/population can be made
SAMPLE FRAME
Complete list of all the elements of the universe or population
Types:
- directory
- Map
CLASSIFICATION OF
SAMPLING
Probability sampling
Types:
SRS without replacement (SRSWOR)
- No repeats in the sample selection
SRS with replacement(SRSWR)
- Repeats in the sample selection is allowed
SAMPLE RANDOM
SAMPLING
PEOCEDURE:
1. From the obtained sampling frame, assign an ID number (from 1 to
N) to each unit.
2. Using any randomization techniques generate n ID numbers and
the units corresponding top the n ID numbers generated will serve
as a simple random sample
SIMPLE RANDOM
SAMPLING
Advantages Disadvantage
B = error of estimation
EXAMPLE
It is necessary to estimate the total amount of money receivable
from a hospital’s open accounts. It is known that from prior data that
the standard deviation of these accounts is about 1250 pesos. If there
are N = 1000 open accounts, find the sample size needed to estimate
the total amount collectible with a margin of error of only 2000 pesos.
Use a 95% confidence level.
SAMPLE SIZE DETERMINATION IN
SRS TO ESTIMATE THE PROPORTION
Proportional Allocation
SAMPLE SIZE ALLOCATION
Optimum Allocation
Where
𝑥 − 𝜇0
´
𝑍𝐶 =
𝜎 /√𝑛
Decision rule: Reject Ho if
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
otherwise, fail to reject Ho.
SOLUTION
Let
= average time of workers to finish the task after training
Ho: = 29.6 vs. <29.6
Test procedure: one-tailed Z-test at = 0.05
Decision rule: Reject Ho if Zc<-Z0.05
(or p-value<0.05). Otherwise, fail to reject Ho.
SOLUTION
Computations:
-=-1.645
Using p-value:
p-value=P(Z<-2.286)=1-P(Z<2.286)
p-value=0.0084
Decision: Reject Ho
EXAMPLE
It was believed that on the average, the scores of students in the first long
exam in the Basic Statistics subject was 62. A random sample of 15 students in
this semester’s enrollees was selected and the following were their score: 68, 75,
49, 57, 60, 82, 64, 56, 54, 66, 78, 67, 63, 61, and 70. Assuming the scores in this
exam are normally distributed, is there reason to believe that the students this
semester, scored, on the average, higher than the previous years? Use =5%.
TEST OF HYPOTHESIS ON
ONE
Case2. is unknownPOPULATION MEAN
Test procedure: one-sample t-test
Test statistic:-under Ho
Ho: =vs
Possible alternatives
i) Ha: ≠0
ii) Ha:
iii) Ha:
INDEPENDENT SAMPLES
Assumptions:
1. Data are at least measured in the interval level.
2. Simple random samples of sizes are drawn independently from each
population.
3. The two populations are normally distributed.
);)
CASE2: UNKNOWN VARIANCES BUT ASSUMED TO
BE EQUAL
Estimator
for : -
Standard error: ( - )=
where ==
CASE1: KNOWN VARIANCES
Test
statistic:
=
Decision rule: Reject Ho if
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
fail to reject Ho otherwise.
CASE 2: UNKNOWN VARIANCES
BUT ASSUMED TO BE EQUAL
Test
statistics:
=
Decision rule: Reject Ho if:
i) or p-value <
ii) >or p-value/2 <
iii) or p-value/2 <
fail to reject Ho, otherwise.
EXAMPLE
A random sample of students were randomly assigned to either a
section with laboratory or without laboratory. In the section with
laboratory, it was found out that the average grades of the 11 students
was 85 with a standard deviagtion of 4.7. In the section without
laboraotory, the 17 students got an average of 79 with a standard
deviation of 6.1 Is there evidence to say that incorporating a laboratory
session improved the students performance in the course? Assume
grades to be normally distributed with equal variances between
sections. Use α=5%
SOLUTION
₁=85;
=4.7
=79; =6.1
==31.3946
where =
SOLUTION
Let
X₁= grade of students in the section with laboratory
X₂= grade of students in the section without laboratory
X₁ ~ and X₂ ~
Ho: =0 vs. Ha: =0
Test statistics: Ho:
Decision rule: Reject Ho if
otherwise, fail to reject Ho.
SOLUTION
Decision: Since 2.767 > 1.706, we reject Ho.
Conclusion: At the 5% level of significance, there is sufficient evidence to say
that incorporation of a laboratory session in this course improves the
performance of the students.
REMARKS
For large sample sizes, the standard normal distribution. Hence, we may
use the Z-table to approximate tabular values of t for large sample sizes.
DEGREES OF FREEDOM
V’ is the approximate degrees of freedom
V’
Where
CASE 3: UNKNOWN VARIANCES BUT
ARE UNEQUAL
Test
statistic: =
Decision rule: Reject Ho if:
i) or p-value <
ii) >or p-value/2 < and>0
iii) or p-value/2 <and>0
fail to reject Ho, otherwise.
EXAMPLE
Suppose two groups of student classified as smokers and non-smokers
were randomly selected and asked the number of times they come down with
cough last year. The data collected are as follows:
Smokers: 3 5 4 2 4 5 6
Non-smokers:0 1 2 1 2
At the 5% level of significance, can it be said that smokers tend to have more
occurrences of cough than the non-smokers?
COMPARISON OF TWO
RELATED SAMPLES
Matching (pairing) of similar individuals
units that have similar characteristics related to the variable(s) being
investigated are paired/matched.
One member of each pair is assigned to one condition (or assumed to
be coming for are population) while the other is assigned to the
condition (or assumed to be coming from the other population).
DATA LAYOUT
For a sample of size n,
Group 1 (X₁) …
Group 1 (X₂) …
Group 1 () …
…
COMPARISON OF TWO
RELATED SAMPLES
Self-pairing – a unit is measured on two
occasions, usually before and after a
treatment.
t-TEST FOR RELATED
SAMPLES
Parameter
of interest: (mean difference)
=
PROCEDURE OF
ANALYSIS
1. Test the normally of the two populations.
2. Test the equality of the variances using the Levene’s test.
3. Test the equality of the means by using the appropriate t-tset.
Equality of Variance
Variable
Method Num DF Den DF F Value Pr>F
cough Folded F 6 4 2.59 0.3771
T-Test
Variable Method Variances DF t Value Pr>
cough Pooled Equal 10 4.30 0.0016
Cough Satterhwaite Unequal 9.9 4.66 0.0009
TEST ON EQUALITY OF
TWO VARIANCES
Ho:
=vs. Ha:
Test procedure: F-test
Test statistic: under Ho
where df1=numerator df
df2=denominator df
Decision rule: Reject if>
otherwise,fail to reject Ho.
TEST OF HYPOTHESIS
Let
= frequency of cough for smokers
= frequency of cough for non-smokers
and
Ho:- = o vs. Ha: - o
Test statistics: -
Decision rule: Reject Ho if >=1.812
otherwise, fail to reject Ho.
Decision: Since 4.3>1.812, we reject Ho.
Conclusion: At 5% level of significance, there is evidence to indicate that
smokers have more frequency of cough than non-smokers.
ASSUMPTIONS
1. The data are measured At least in the interval level.
2. Both populations are normally distributed with a common variance;
and the two populations are related.
3. N()
TEST OF HYPOTHESIS
Ho:
= →Ho: - = VS.
i) Ha: ≠ →Ha: - ≠
ii) Ha: > →Ha: - >
iii) Ha: < →Ha: - <
Note: In practice, the most common scenario is testing for the equality
of two means, thus, =0.
POINT ESTIMATORS
=
D
=
D
=
TEST OF HYPOTHESIS
Test
statistic: under Ho
1 128,486 115,661
2 86,538 57,831
3 106,153 51,277
4 150,000 98,313
5 79,227 42,265
6 153,846 156,988
7 89,346 98,795
8 107,740 105,975
9 220,208 173,494
10 612,370 608,237
SOLUTION
Analysis Variable: diffinic
Mean StdDevStd Error t Value Pr>
22507.70 24205.70 7654.51 2.94 0.0165
SOLUTION
Test
statistic: under Ho
Decision rule: Reject Ho If p-value < 0.05
otherwise fail to reject Ho.
Computations:
=2.94
p-value=0.0165
Decision: Since p-value<0.05, we reject Ho.
Conclusion: At=5%, average real income
changed after 5 years.
ONE-WAY ANALYSIS OF
VARIANCE (ANOVA)
Ho:
= =…= vs.
Ha: At least one mean is different from the rest.
Data layout:
Group/treatment
1 2 … K
… … … …
…
TEST OF HYPOTHESIS ON
K POPULATIONS
The k different populations are classified on the basis of a single
criterion such as different treatment or groups.
Random sample of are selected from each of k populations and are
used as basis for comparing the means of the k populations.
ASSUMPTIONS
1. The data are measured at least in the interval level.
2. The k populations are normally distributed with means ,, …, with a
common variance .
The ANOVA approach
i) Partitions the total variation of the data info two sources: variation among
groups/treatments; and
ii) Compares these two sources of variation.
COMPUTATIONS
MSW
MSB
ANOVA TABLE
Procedure:
1. Compute the mean for each sample/group.
2. Get the absolute deviation of each observation from its group mean.
3. Perform ANOVA on these deviations.
SOLUTION
Pair Mean Comparison
(LSD) for returns
Divorce
rate 29.1 28.9 30.6 26.0 18.4 10.7 25.3 24.0 25.7
SOLUTION
Decision:
Since 0.1472 > 0.05, we fail to reject Ho.
Computations:
r=0.524 p-value = 0.1472
SPEARMAN’S RANK-ORDER
CORRELATION COEFFICIENT (ρs)
A non-parametric alternative to the Pearson’s correlation coefficient
Used when the variables are measured in at least the ordinal level
Uses ranks instead of the original data measurements
Qualitative interpretation is the same as that of Pearson’s
SPEARMAN’S RANK-ORDER
CORRELATION COEFFICIENT (ρs)
Rank X and Y independently and obtain the differences of their ranks, =
R()-().
Student 1 2 3 4 5 6 7
Rate in Job 4 7 3 1 6 2 5
Score 5 6 4 2 7 3 1
...
...
. . . . .
. . . . .
. . . . .
...
Total ...
ASSUMPTIONS
The labels or classes are non-overlapping.
Each entity/unit should belong in only one class
No more than 20% of the classes should have expected frequencies less
than 5.
No class should have an expected frequency less than 1.
EXAMPLE
A simple random sample 1847 individuals was obtained and
classified according to gender and whether or not they believe in the
Filipino talent. Is there evidence (at =5%) that there is an association
between gender and the belief in the Filipino talent?
PROCEDURE
Ho:
X and Y are independent.
Ha: X and Y are associated.
Test statistic: under Ho
where
where q=min(r,c)
REGRESSION
ANALYSIS
FORMS OF RELATIONSHIP
BETWEEN X AND Y
REGRESSION ANALYSIS
Used to find a possible functional relationship between two variables X
and Y, where X and Y are paired variables (i.e., both are measured on
the same units)
The ultimate objective is to predict Y (dependent or response variable)
given the value of X (independent or explanatory variable)
THE SIMPLE LINEAR
REGRESSION MODEL
Where =observed value of Y
observed value of X
=regression constant (true Y-intercept)
=regression coefficient (true changed in Y per unit increase in X)
=random error association with for given
ASSUMPTIONS
Normality of residuals
The residuals (predicted minus observed values of y) must be normally
distributed.
This can be assessed by producing histograms of the residuals and/or the
normal probability plot.
ESTIMATION BASED ON
SRS OF SIZE n
Predicting equation: +
EVALUATION OF THE SAMPLE
REGRESSION EQUATION
An overall adequacy of the predicting equation is provided by the
coefficient of multiple determination
==
gives the proportion of variation in Y that is accounted for by X. It ranges
from 0 to 1 (or 0% to 100%). The nearer is to 1, the better is the fit of the
regression line.
TESTING THE
SIGNIFICANCE OF
Ho:
0(Y is not linearly dependent on X)
against
i) Ho: =0(Y is linearly dependent on X)
ii) Ho: >0(Y is positive linearly dependent on X)
iii) Ho: <0(Y is negative linearly dependent on X)
EXAMPLE
A random sample of 10 households in certain municipality were
observed for the amount spent on groceries per week (Y) and the
number of household members (X). Do the following data support the
claim that the bigger the household, the higher the expenditure for
groceries per week Use =5%.
Amount Spent on Number of HH
HH no.
Groceries/wk(Y) members
1 457.50 2
2 331.90 2
3 683.30 3
4 1,069.20 4
5 358.60 1
6 1,306.20 5
7 1,400.00 5
8 850.00 3
9 200.00 1
10 1,800.00 4
MULTIPLE
LINEAR
REGRESSION
Analysis of Variance
Sum of Mean
Source DF squares square F Value Pr>F
Model 1 2032031 2032031 29.93 0.0006
Error 8 543195 67899
Corrected Total 9 2575226
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr>ltl
Intercept 1 -110.58000 193.24766 -0.57 0.5829
Members 1 318.75000 58.26636 5.47 0.0006
REVIEW
Correlation indicates the strength of linear relationship between two
variables.
Simple linear regression describes the form of the linear relationship
between two variables.
For linear regression to be valid, the following assumptions must hold:
Linearity
Normality of residuals
Constant variance
NOTE
A single dependent variable may be affected by two or more predictor
variables working together.
Prediction of the dependent variable becomes more efficient if the
collective effect of these variable are considered since the unexplained
variation in the dependent variable is minimized.
ESTIMATION
Predicting equation:
+…+
THE MULTIPLE LINEAR
REGRESSION MODEL
++…++
Where =value of the dependent variable in the observation
=regression constant
=regression coefficient for
=value of the independent variable in the observation
=random error term associated with
EVALUATION OF THE SAMPLE
REGRESSION EQUATION
= coefficient of multiple determination
= the proportion of the variance in y that can be explained by the
p predictors together
TEST OF SIGNIFICANCE
OF b’S
Ho:
=0 (Y does not linearly depend on )
Ha: 0 (Y linearly depend on )
Test statistic:
Decision rule: Reject Ho if
fail to reject Ho otherwise.
EXAMPLE
Consider the ten observations on the expenditure on groceries and the
household sizes. Information on the gross gathered. Fit the multiple linear
regression model using the expenditure on groceries as the dependent
variable.
HH no. Amount Spent on Number of HH HH no.
Groceries/wk(Y) members
1 457.50 2 1
2 331.90 2 2
3 683.30 3 3
4 1,069.20 4 4
5 358.60 1 5
6 1,306.20 5 6
7 1,400.00 5 7
8 850.00 3 8
9 200.00 1 9
Analysis of Variance
Sum of Mean
Source DF squares square F Value Pr>F
Model 2 2105495 1052748 15.69 0.0026
Error 7 469731 67104
Corrected Total 9 2575226
Parameter Estimates
Parameter Standard
Variable DF EstimateError t Value Pr>ltl
Intercept 1 -110.58000 193.24766 -0.57 0.5829
Members 1 423.43477 115.60894 3.66 0.0080
Income 1 -0.01837 0.01755 -1.05 0.3302
CATEGORICAL
DATA
ANALYSIS
TEST FOR SINGLE
PROPORTION
Parameter
of interest: P= proportion of the population which possesses
the characteristic of interest
Test statistic:
Decision rule: Reject Ho if
a) >
b) <-
Fail to reject Ho otherwise.
EXAMPLE
Do majority of adults in the Philippines believes that a pregnant
woman should be able to obtain an abortion? Of 893 respondents, 400
replied “yes” and 493 replied “no”. Use =5%.
Cases:
Independent samples
Related (Paires) samples
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Test
procedure: approximate Z-test
Test statistic:
When ≠
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Ho:
-=, where is the hypothesized value of the difference
between two proportions
where =
COMPARISON OF TWO
PROPORTIONS: Independent Samples
Case
Decision
rule: Reject Ho if
a)
b)
c)
fail to reject Ho otherwise.
EXAMPLE
Enjoy shopping?
Group Total
Yes No
Women 189 10,845 11,034
Men 104 10,933 11,037
=5.0014
COMPARISON OF TWO
PROPORTIONS: Related Samples Case
Observations in the sample can be observed on two occasions as in a
“before and after” type of research
The observations can therefore be placed in two-way table as follows.
After
Before
Success Failure
Success SS SF
Failure FS FF
COMPARISON OF TWO
PROPORTIONS: Related Samples
Case
Test
statistic:
When ≠
EXAMPLE
Prior to a television debate of two candidates, a random sample of
100 voters the debate, the same 100 voters expressed again their
choices. The results are given in a table in the next slide. Using =5%,
testy if there is a change in voters’ decision after the debate.
COMPARISON OF TWO
PROPORTIONS: Related Samples Case
Test statistic:
When ≠
EXAMPLE After the debate
Before the debate
Jez Dar
Jez 63 21
Dar 4 12
Let = proportion of voters who chose Jez before the debate
= proportion of voters who chose Jez after the debate
EXAMPLE
-=0
= The proportion of voters who chose Jezbefore and
after the debate are equal.
-0 = The proportion of voters who chose Jez before and after
the debate are not equal.
For >0, every unit increase in X increases the odds of Y=1 Multiplicatively
By
INTERPRETATION
logit
(Y=1)
Ho: Ha:
Reject Ho:
at =5%, the probability of passing the Let is dependent on the
weekly average number of hours spent in reviewing.
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
Intercept 1 -2.2136 0.9988 4.9119 0.0267
Hours 1 0.0704 0.0267 6.9643 0.0083
logit (Y=1)
Analysis of Maximum Likelihood Estimates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr>ChiSq
Intercept 1 -2.2136 0.9988 4.9119 0.0267
Hours 1 0.0704 0.0267 6.9643 0.0083
logit(Y=1)
For every hour increase in the weekly average number of hours spent in
reviewing, the LOG ODDS of passing the LET increases by 0.0704
D
X
0 0
1 0
0 1
INTERPRETATION
odds
(Y=1)
Standard
Wald
Parameter DF Estimate Error Chi-SquarePr>ChiSq
Intercept 1 0.0772 0.5093 0.0230 0.8795
TREATMENT 1 1 1.2221 0.5745 4.5250 0.0334
TREATMENT2 1 2.3207 0.7893 8.6450 0.0033
Logit (Y=1)
Testing global Null Hypothesis: BETA=0
Parameter
Chi-Square DF Pr>ChiSq
Likelihood Ratio 55.9138 2 <.0001
Score 46.4730 2 <.0001
Wald 21.1462 2 <.0001
Reject Ho.
At =5%, the model is adequate.
Analysis of Maximum Likelihood Estimates
Standard
Wald
Parameter DF Estimate Error Chi-SquarePr>ChiSq
Intercept 1 0.0772 0.5093 0.0230 0.8795
TREATMENT 1 1 1.2221 0.5745 4.5250 0.0334
TREATMENT2 1 2.3207 0.7893 8.6450 0.0033
Ho: =0 Ha:: =0
Reject Ho.
At =5%, the probability of surviving is higher for basic treatment compared
to placebo.
EXAMPLE
Logit (Y=1)
The LOG ODDS of surviving is 1.222 times higher for basic treatment
relative to placebo.
The odds of surviving is exp(1.22)=3 times higher for basic treatment
relative to placebo.
BINARY LOGISTIC REGRESSION
MODEL: Multiple Predictors
Given:
Binary response variable Y(0,1)
K explanatory variables, ,,…,
(continuous and categorical)
MODELS
Probability of Y=1:
(Y=1)
MODELS
Log odds of Y=1:
Logit (Y=1)
=+…+
odds of Y=1:
odds (Y=1)=
STEPS
1. Construct the dummy variables, if necessary.
2. Fit the model and estimate its parameters.
3. Test the significance of the estimates.
4. Evaluate the model adequacy/model fit, choosing the best model in
the process.
5. Interpret the final model.
EXAMPLE
A study on the risk factors associated with low birth weight was
conducted.
Logit (Y=1)=2.91100.17WGT+1.4071.894
MULTICATEGORY LOGISTIC
REGRESSION MODEL
Nominal responses
Y is categorical with multiple responses
X is categorical or continuous