Professional Documents
Culture Documents
An Introduction To The Analysis of Categorical Data
An Introduction To The Analysis of Categorical Data
An Introduction To The Analysis of Categorical Data
Often data consist of the counts of the number of events of certain types. We
will study methods of analysis specifically designed for count data. Examples are
Number of deaths from tornadoes in a state during a given year: 0, 1, 2, 3,
...
The number of automobile accidents on Texas Avenue during a given week
The response variable is often called the dependent variable. As the name
indicates it depends on several factors/variables, and this is the primary
variable to assess a condition.
The explanatory variables, on the other hand, are often called predictors or
independent variables whose effect on outcome variables is often the study
goal. Studying effect is not straigtforward in observational studies.
Therefore, in general, often interest is in studying association between
outcome variables and explanatory variables.
Distinguish between response and explanatory variables:
Attitude toward gun control, Gender, Education
Cholesterol level (high, moderate, low), Heart disease (yes, no)
Number of deaths from coronavirus, number of people infected with
coronavirus
A female horseshoe crab with a male crab attached to her nest sometimes
has other male crabs, called satellites, near her. Fit a logistic regression
model to estimate the probability that a female crab has at least one
satellite as a function of color, weight, and carapace width.
Fifty-nine alligators were sampled in Florida. The response is primary food
type: Fish, Invertebrate, and Other. The explanatory variable is length of
the alligator in meters.
In this section we will study the basic distributions used in the analysis of
categorical data:
Binomial distribution
Multinomial distribution
Poisson distribution
Models for overdispersed data
and consequently,
pr(Y = 0) = (1 − π)n ,
pr(Y = n) = π n ,
pr(Y = 1) = nπ(1 − π)n−1 ,
pr(Y = n − 1) = nπ n−1 (1 − π),
n(n − 1) 2
pr(Y = 2) = π (1 − π)n−2 ,
2
n(n − 1) n−2
pr(Y = n − 2) = π (1 − π)2 ,
2
..
.
Show histograms of different binomial distributions using R. For that, for given n
and π, generate 1000 binomial random numbers, and draw a histogram of the
random numbers. Note that the possibles values of the random numbers must be
0, 1, . . . , n.
For any adult ages between 40 and 60, the chances of normal blood pressure,
elevated blood pressure, stage-I hypertension, stage-II hypertension, and stage-III
hypertension (critical conditions) are 0.5, 0.25, 0.2, 0.04, 0.01, respectively. Out
of 100 random chosen people of ages 40 and 60, what is the probability that
observing the following frequency 1) (55, 30, 10, 2, 3) and 2) (60, 30, 8)? In the
second case, it is implied that the total number of people in stages II and III is 2.
µy
pr(Y = y ) = exp(−µ) , y = 0, 1, 2, . . .
y!
The mean number of events over the specified time or space on which Y is
defined is E (Y ) = µ.
The variance of the number of events is var(Y ) = µ.
Based on the data during June 9-June 15, 2020, (31, 19, 26, 11, 60, 22, 18:
Source: http://wtaw.com/
brazos-county-health-district-coronavirus-update-june-16/) on
average 26.7 positive COVID-19 cases are found daily in the Brazos valley. What
is the probability that on a given day during the said period 50 or more new
positive cases be found?
Here X denotes the number of new positive cases on a given day. This number is
random. Assume that X ∼ poisson(µ = 26.7). Then we calculate the probability
P∞ P49
pr(X ≥ 50) = r =50 exp(−26.7)26.7r /r ! = 1 − r =0 exp(−26.7)26.7r /r !.
This can be calculated in R by
1-ppois(49, 26.7)= 3.629556e-05
The average number of oil tankers arriving each day at a certain port city is
known to be 10. The facilities at the port can handle at most 15 tankers per day.
What is the probability that on a given day tankers will have to wait a day or so
to dock or be sent away?
The number of arriving tankers on a given day is random, call it by X . Suppose
that X ∼ poisson(µ). Since the average number of tankers arriving on a day is
10, we set µ = 10. Now, we calculate
Both the binomial distribution and the Poisson distribution have one parameter.
Usually for the binomial distribution, n, the number of trials, is assumed to be
known. For these one-parameter distributions, the variance must have a particular
mathematical relationship to the mean. When the sample mean and a sample
variance do not reflect this mathematical relationship, we should consider
alternative models for the data.
For the binomial distribution, mean µ = nπ and variance σ 2 = nπ(1 − π). So,
σ 2 = µ(1 − µ/n) = µ(n − µ)/n. If the data exhibit a larger variance than the
mean, the data are said to be overdispersed. For Poisson data, the population
mean and population variance are equal. If the data exhibit a larger variance than
the mean, the data are said to be overdispersed. This may occur when there are
different Poisson models in the population (the Possion parameter is varying)
resulting in a larger observed variance than expected. We will consider a couple of
models that result in overdispersion.
Suppose that you randomly sample 1000 households those who did not have
health insurance in the last year, and ask how many times each household visited
ER in the last one year. The number of visits is a random variable taking values,
like 0, 1, 2, ... etc. What do you expect, about the proportion of zeros– the
proportion of zeros will be much more than the proportion of non-zero visits. The
histogram would look like the following.
700
600
500
400
Frequency
300
200
100
0
0 1 2 3 4 5 6
Number of visits
In this data, we see there are more zero counts than expected for any Poisson
model. We can model this extra zeros as a mixture of two populations: a Poisson
population that is selected with probability π and a population of zeros that is
selected with probability 1 − π.
The probability function for the zero-inflated Poisson r.v. is
1 − π + π exp(−µ) if x = 0,
pr(X = x|µ, π) = x
π exp(−µ) µx! if x > 0
E (X ) = µ
var(X ) = µ(1 + Dµ), where D = 1/ν,
The sample mean and variance of the data were 26.7 and 255.2, respectively. So,
Poisson model assumption on X is questionable. How about compute the
pr(X ≥ 50) assuming X ∼ NegBinomial(ν, p) distribution? Note that in order to
calculate pr(X ≥ 50) we need to know (or estimate) the two unknowns ν and
p = ν/(ν + µ) from the data.
One way of estimating these two is via the method of moments (MOM). We know
E 2 (X )
ν = , (1)
var(X ) − E (X )
ν
p =
ν + E (X )
E 2 (X )
=
E (X ) + E (X )var(X ) − E 2 (X )
2
E (X )
= . (2)
var(X )
A slight difference in these two results occurred due to rounding of size and
prob. These results are more believable compared to the results under the
Poisson model assumption, and these results can be interpreted as out of 10 days
approximately 1 day likely to show the number of cases greater or equal to 50.
Look at the data given in the following webpage. It contains the number of
hurricanes in each category, 1, 2, 3, 4, 5 per decade starting from 1851.
https://www.nhc.noaa.gov/pastdec.shtml
On different parts of the data, we can fit different models that we have
discussed so far, multinomial, Poisson, zero-inflated poisson, negative
binomial distribution.
This dataset will be used in homeworks.
For over-dispersed count data, we generally think of the negative binomial model.
For under-dispersed data, one may consider the generalized poisson model.
We will now suppose that we have observed data from one of the previous
distributions where the value of the parameter is unknown. We will develop
methods for statistical inference concerning the unknown parameters.
We now apply some of these techiques to the binomial, Poisson, and multinomial
distributions. In later chapters, we will consider response data with one on the
above distributions in the situation where we have one or more explanatory
variables.
Suppose we observe data from a binomial distribution with unknown π. Our goal
is to estimate π.
The function L(π|Y = 14), a function of the parameter π, is called the likelihood
function.
0.12
0.10
0.08
L(π|Y=14)
0.06
0.04
0.02
0.00
A similar argument for an Y ∼ Bin(n, π) implies that the mle of π is the sample
proportion, p:
Y
π
b=p= .
n
E (b
π ) = π (the average of the MLE is the parameter of interest)
π ) = π(1−π)
var(b n (the variance of the MLE depends on the parameter of
interest and it decreases with the sample size)
For a large n, the distribution of π
b is approximately normal with the above
mean and variance.
Fact: If Y ∼ Bin(n, π) and the probability histogram is not too skewed (nπ ≥ 5,
n(1 − π) ≥ 5), then Y is approximately normally distributed. This implies that
the MLE π b = Y /n is approximately normally distributed.
b−π
π
Z=q ∼ N(0, 1).
π(1−π)
n
Null hypothesis: H0 : π = π0
The score statistic uses the distribution under H0 to obtain the standard error in
the test statistic:
b − π0
π
Z=p
π0 (1 − π0 )/n
The Wald statistic uses the estimated standard deviation of π
b to standardize π
b:
b − π0
π
Z=p
b(1 − π
π b)/n
The Wilson interval (also known as score interval as it is based on score statistic)
is an approximate confidence interval based on the standardized π b:
b − π0
π
Z=p .
π(1 − π)/n
The Wilson interval has been shown to have better coverage properties than the
Wald interval
Remark: Agresti and Coull derived a simple approximation to the level α Wilson
interval. Form the Wald interval based on n∗ = n + Zα/2 2
observations and
2 2 2
π̃ = (Y + Zα/2 /2)/(n + Zα/2 ). This is equivalent to adding Zα/2 observations,
half of which are successes and half are failures.
π : probability that an adult shopper request for a rain check in case an advertised
item is unavailable
p
b ±
π Zα/2p πb(1 − π
b)/n
.249 ± 1.96 .249(1 − .249)/277
.249 ± .051
The 95% Wald interval is (.198, .300).
We can also compute the more accurate confidence interval:
s
1.962 0.249 · 0.751 1.962
0.249 + ± 1.96 +
2(277) 277 4(277)2
1 + (1.962 )/277
Consider testing H0 : π = 0.6 versus Ha : π > 0.6 for a binomial distribution with
n = 10. If we observe Y = 9 successes, the exact p-value equals
10
pr(Y ≥ 9) = 0.69 0.41 + 0.610 = 0.0403 + 0.0060 = 0.0463.
9
y p0,y
0 0.000
1 0.002
2 0.011
3 0.042
4 0.111
5 0.201
6 0.251
7 0.215
8 0.121
9 0.040
10 0.006
We found that p9,0 = 0.040, and p0,0 , p1,0 , p2,0 , p9,0 , p10,0 ≤ p9,0 , hence the p-value is
0.000 + 0.002 + 0.011 + 0.040 + 0.006 = 0.059. You may also get the same answer
using the R function binom.test(c(9, 1), p = 0.6).
where Fa,b (c) denotes the 1 − c percentile of the F distribution with df = (a, b).
Due to the discreteness of the binomial distribution, the actual coverage
probability of the Clopper-Pearson interval usually is greater than 1 − α. This
approach is used in binom.test function of R.
The various confidence level methods do not necessarily achieve the stated
confidence level. You create a plot showing the true confidence levels for the
Wald, Wilson, Agresti-Coull, and Clopper-Pearson intervals when n = 40 and
1 − α = 0.95 as a function of π.
1.0
0.8
0.6
CP
0.4
0.2
0.0
pi
1. The plot shows that the coverage probability changes with the under
proportion π with which data are generated from.
2. The coverage probability gets worse (far away from the nominal level 0.95) as
π gets close to 0 or 1.
Set this equal to zero and solve for µ to get the mle:
Pn
Yi
b = i=1
µ =Y
n
1 E (b
µ) = µ (the average of the MLE is the parameter of interest)
2 µ) = µn (the variance depends on the parameter of interest and it
var(b
decreases with the sample size)
q
3 SE = µbn
4 The distribution of µ
b is approximately normal with the above mean and
variance.
As we did for the binomial distribution, we can obtain confidence intervals for µ
based on either the Wald statistic or the score statistic. The Wald statistic for
testing H0 : µ = µ0 is
Y − µ0
Z= q
Y /n
The resulting level 1 − α confidence interval for µ is given by
q
Y ± Zα/2 Y /n.
Y − µ0
Z= p
µ0 /n
Expected value: E (b
πi ) = πi
πi ) = πi (1 − πi )/n
Variance: var(b
q
πi πj
Correlation: corr(b bj ) = − (1−πi )(1−π
πi , π j)
When the null hypothesis is true, the maximized restricted likelihood under
the null hypothesis will be similar in value to the maximized unrestricted
likelihood.
However, if the null hypothesis is false, the maximized unrestricted likelihood
can be much larger.
Thus, we will reject H0 when Λ is small, or equivalently, when
G 2 = −2log(Λ) is large.
1
The following data are obtained from Table 1 of the paper
Let πx-y be the death rate for age group x-y . Test if π0-29 < 0.002 and
π30- > 0.03.
1
Vital Surveillances: The Epidemiological Characteristics of an Outbreak of 2019
Novel Coronavirus Diseases (COVID-19) — China, 2020. China CDC Weekly 2020.
Epub February 17, 2020.
Samiran Sinha (TAMU) Categorical Data May 28, 2021 75 / 95
First we write H0 : π0−29 ≥ 0.002 and π30− ≤ 0.03 and Ha : π0−29 < 0.002
or π30− > 0.03.
Pay attention to how the alternative hypothesis is written.
Different approaches can be taken to test this hypothesis. Here we use the
Pearson’s Chi-square goodness of fit test.
First we shall calculate the expected cell frequency under H0 . For 0-29, the
expected deaths are 9.17, and for the age group older than 30 years, the
expected deaths are 1202.64.
Next we form this table:
A sample of 156 dairy calves born in a particular county in Florida were classified
according to whether they caught infections within 60 days of birth. The
following data were recorded:
Secondary Infection
Primary Infection No Yes
No 0 63
Yes 63 30
Secondary Infection
Primary Infection No Yes
No * * 1−θ
Yes θ(1 − θ) θ2 θ
1
Secondary Infection
Primary Infection No Yes
No * * n3 = 63
Yes n2 = 63 n1 = 30 n1 + n2 = 93
n1 + n2 + n3 = 156
Equivalently, maximize
3
X
L(θ) = ni log(πi (θ))
i=0
Cell 1 2 3 Total
Observed count 30 63 63 156
Estimated probability 0.244 0.250 0.506 1
µ
bi 38.07 38.99 78.94 156
(O − E )2 /E 1.71 14.78 3.22 19.71
Eighty litters, each containing 3 rabbits, were observed and the number Y of male
rabbits in the litter was recorded, resulting in the following frequency distribution:
If the assumption of Bernoulli trials is correct for the sex of rabbits, the
distribution of Y should be binomial with n = 3 and θ = probability of a male
birth.
Solve for θ:
P3
i=0 ni i
θb = P 3
3 i=0 ni
Number of Male Rabbits
=
Total Number of Rabbits
Test statistic:
X (Oi − Ei )2
T = = 1.07
Ei
i
Revisit the last example. However, the software does not know that θ was estimated, so
it uses 3 degrees of freedom which is incorrect.
Code
obs.freq=c(19, 32, 22, 7)
prob.vec=c( 0.2118, 0.4305, 0.2918, 0.0659)
chisq.test(obs.freq, p=prob.vec)
data: obs.freq
X-squared = 1.0661, df = 3, p-value = 0.7853