An Introduction To The Analysis of Categorical Data

An Introduction to the Analysis of Categorical Data
Samiran Sinha (TAMU) Categorical Data May 28, 2021 1 / 95

What is meant by “categorical data”?
A categorical variable has a measurement scale that consists of a set of

categories. Some examples are as follows:
Attitude toward gun control: favor or oppose
Gender: male or female
Education: Did not graduate from high school, high school, some
college, bachelor’s degree, master’s degree, PhD, professional degree
Cholesterol level (high, moderate, low)
Hypertension (pre-hypertension, Hypertension stage I, Hypertension
stage II)

What is a discrete (count) data?
Often data consist of the counts of the number of events of certain types. We
will study methods of analysis specifically designed for count data. Examples are
Number of deaths from tornadoes in a state during a given year: 0, 1, 2, 3,
...
The number of automobile accidents on Texas Avenue during a given week

Response and Explanatory Variables
The response variable is often called the dependent variable. As the name
indicates it depends on several factors/variables, and this is the primary
variable to assess a condition.
The explanatory variables, on the other hand, are often called predictors or
independent variables whose effect on outcome variables is often the study
goal. Studying effect is not straigtforward in observational studies.
Therefore, in general, often interest is in studying association between
outcome variables and explanatory variables.
Distinguish between response and explanatory variables:
Attitude toward gun control, Gender, Education
Cholesterol level (high, moderate, low), Heart disease (yes, no)
Number of deaths from coronavirus, number of people infected with
coronavirus

Nominal and Ordinal Data
Categorical variables with ordered categories are called ordinal variables.

Otherwise they are nominal. Identify whether each variable is nominal or ordinal.
Education with levels no high school diploma, high school diploma, college
or vocational degree
Political party affiliation (Democrat, Republican, Other)
Cholesterol level (high, moderate, low)
Hospital location (College Station, Bryan, Temple, Houston)

Methods designed for nominal variables can be used for both nominal and
ordinal variables. But the analysis disregarding the ordering of the ordinal
variable may cause a loss of power. However, the analysis will give a valid
result.
Methods designed for ordinal variables should not be used with nominal
variables. This analysis will typically give different results if the categories
were differently ordered.

Examples of Categorical Data Analyses
In a sample of 50 adult Americans, only 14 correctly described the Bill of

Rights as the first ten amendments to the U.S. Constitution. Estimate the
proportion of Americans that can give a correct description of the Bill of
Rights.
In 1954, 401,974 children were participants in a large clinical trial for polio
vaccine. Of these 201,229 were given the vaccine, and 200,745 were given a
placebo. 110 of the children who received a placebo got polio and 33 of
those given the vaccine got polio. Was the vaccine effective?
The Storm Prediction Center (an agency of NOAA) tracks the number and
characteristics of tornadoes. Construct a model relating the number of
deaths to the number of tornadoes or to the number of killer tornadoes.

Continues...
A study was carried out to compare two treatments for a respiratory

disorder. The goal was to compare the proportions of patients responding
favorably to test and placebo. A confounding factor is that the study was
carried out at two centers which had different patient populations. We wish
to examine the association between treatment and response while adjusting
for the effects of the centers.
Stillbirth is the death of a fetus at any time after the twentieth week of
pregnancy. A premature birth is the live birth of a child from the twentieth
until the thirty-seventh week of pregnancy. Fit loglinear models to
investigate the association among the following variables that were recorded
in a study of stillbirth in the Australian state of Queensland:
Birth status(B) – live birth or stillbirth

Gender(G) – female or male
Gestational age(A) – ≤ 24, 25-28, 29-32, 33-36, 37-41 weeks
Race(R) – Aborigine or white

Continues...
A female horseshoe crab with a male crab attached to her nest sometimes
has other male crabs, called satellites, near her. Fit a logistic regression
model to estimate the probability that a female crab has at least one
satellite as a function of color, weight, and carapace width.
Fifty-nine alligators were sampled in Florida. The response is primary food
type: Fish, Invertebrate, and Other. The explanatory variable is length of
the alligator in meters.

Models for Categorical Data
In this section we will study the basic distributions used in the analysis of
categorical data:
Binomial distribution
Multinomial distribution
Poisson distribution
Models for overdispersed data

Binomial Probability Distribution
A binomial experiment satisfies the conditions:
Experiment consists of n independent and identical trials.

Each trial can result in one of two outcomes (success, S, or failure, F ).
Trials are independent.
The probability of S is the same for each trial: pr(S) = π.
Define the random variable Y to be the number of successes out of n trials. We

call Y a binomial random variable. We write Y ∼ Bin(n, π), where n : fixed
number of trials, π : probability of a success in a trial.

n y
pr(y ) = pr(Y = y ) = π (1 − π)n−y , y = 0, 1, 2, . . . , n
y
where
n n!
= .
y y !(n − y )!
Using the above formula you obtain

n n
= = 1,
0 n

n n
= = n,
1 n−1

n n n(n − 1)
= = ,
2 n−2 2

n n n(n − 1)(n − 2)
= = ,
3 n−3 6
..
.

Continues...
and consequently,
pr(Y = 0) = (1 − π)n ,
pr(Y = n) = π n ,
pr(Y = 1) = nπ(1 − π)n−1 ,
pr(Y = n − 1) = nπ n−1 (1 − π),
n(n − 1) 2
pr(Y = 2) = π (1 − π)n−2 ,
2
n(n − 1) n−2
pr(Y = n − 2) = π (1 − π)2 ,
2
..
.
We can numerically calculate these probabilities as soon as we know the value of

n and π (n and π are the two parameters of the binomial model).

The mean of the random variable Y , the average number of successes out of
n trials, is µ = E (Y ) = nπ.
2
p Y is σ = var(Y ) = nπ(1 − π). So the
The variance of the random variable
standard deviation of Y is σ = nπ(1 − π).

Plot
Show histograms of different binomial distributions using R. For that, for given n
and π, generate 1000 binomial random numbers, and draw a histogram of the
random numbers. Note that the possibles values of the random numbers must be
0, 1, . . . , n.

Example 1.
In an experiment to investigate whether a graphologist (handwriting expert) could

distinguish a normal person’s handwriting from that of a psychotic, a well known
expert was given 10 files, each containing a handwriting sample for a normal
person and a person diagnosed as psychotic. The graphologist was then asked to
identify the psychotic’s handwriting. The graphologist made correct identification
in 6 of the 10 trials. Does this evidence indicate that the graphologist has an
ability to distinguish the handwriting of psychotics?

We want the graphologist to do better than chance. Therefore we are interested
in what the probability the graphologist correctly guesses 6 or more out of 10?
n = 10
Trials (handwriting samples) are the same?
S = correct diagnosis, F = incorrect diagnosis.
Trials are independent: true.
pr(S) = 0.5 (if guessing).
Therefore, Y ∼ Bin(n = 10, π = .5). Now,
pr(Y ≥ 6) = 1 − pr(Y ≤ 5) = 1 − 0.623 = 0.377. There is a 37.7% chance of
getting 6 or more right when guessing, indicating that it is likely that by mere
guessing he correctly identified 6 out of 10 trials.

Multinomial probability distribution
A multinomial experiment satisfies the following conditions:
Experiment consists of n trials.

Each trial can result in one of c mutually exclusive categories.
Trials are independent.
The probability of the i th category, πi , is the same for each trial, and
π1 + · · · + πc = 1.
In a multinomial experiment with n trials the observable random variables are

(N1 , N2 , . . . , Nc ) where Ni = number of trials resulting in the i th category, where
N1 + · · · + Nc = n.

The random variables N = (N1 , . . . , Nc ) are said to have a multinomial
distribution, Mult(n, π) with the probability mass function
n!
pr(n) = pr(N1 = n1 , . . . , Nc = nc ) = π n1 π n2 · · · πcnc .
n1 !n2 ! · · · nc ! 1 2
The average number of counts in the ith category out of n trials is

E (Ni ) = nπi = Ei .
The variance of the number of counts in the ith category is
var(Ni ) = nπi (1 − πi ).
Correlation between
q the counts of the ith and jth category is
πi πj
corr(Ni , Nj ) = − (1−πi )(1−π j)
.

Multinomial example
For any adult ages between 40 and 60, the chances of normal blood pressure,
elevated blood pressure, stage-I hypertension, stage-II hypertension, and stage-III
hypertension (critical conditions) are 0.5, 0.25, 0.2, 0.04, 0.01, respectively. Out
of 100 random chosen people of ages 40 and 60, what is the probability that
observing the following frequency 1) (55, 30, 10, 2, 3) and 2) (60, 30, 8)? In the
second case, it is implied that the total number of people in stages II and III is 2.

Poisson Distribution
Consider these random variables (over a specified time period or a geographic

location)
Number of telephone calls received per hour.
Number of days school is closed due to snow in a winter.
Number of baseball games postponed due to rain during a season.
Number of trees in an area of forest.
Number of bacteria in a culture.
A random variable Y , the number of counts occurring during a given time
interval or in a specified region (of length or size T ), is called a Poisson random
variable. The corresponding distribution is denoted by Y ∼ Poisson(µ), where
µ = λT and λ is the rate per unit time or rate per unit area. Obviously µ > 0.

Poisson Distribution
µy
pr(Y = y ) = exp(−µ) , y = 0, 1, 2, . . .
y!
The mean number of events over the specified time or space on which Y is
defined is E (Y ) = µ.
The variance of the number of events is var(Y ) = µ.

Example 2.
Suppose that the average number of radioactive particles passing through a

counter during 1 millisecond in a laboratory experiment is four. What is the
probability that six particles enter the counter in a given millisecond?
Here X denotes the number of particles passing through a counter during 1
millisecond. This number is random. Assume that X ∼ poisson(µ = 4). Then we
calculate the probability pr(X = 6) = exp(−4)46 /6! = 0.1042.

Example 2 continues
Show the histogram of this Poisson distribution. Given a numeric value of λ,

generate Poisson random numbers, and then construct a histogram based on the
numbers. Note that possible values of the random variable are 0, 1, 2, . . . .

Example 3.
Based on the data during June 9-June 15, 2020, (31, 19, 26, 11, 60, 22, 18:
Source: http://wtaw.com/
brazos-county-health-district-coronavirus-update-june-16/) on
average 26.7 positive COVID-19 cases are found daily in the Brazos valley. What
is the probability that on a given day during the said period 50 or more new
positive cases be found?
Here X denotes the number of new positive cases on a given day. This number is
random. Assume that X ∼ poisson(µ = 26.7). Then we calculate the probability
P∞ P49
pr(X ≥ 50) = r =50 exp(−26.7)26.7r /r ! = 1 − r =0 exp(−26.7)26.7r /r !.
This can be calculated in R by
1-ppois(49, 26.7)= 3.629556e-05

Example 4.
The average number of oil tankers arriving each day at a certain port city is
known to be 10. The facilities at the port can handle at most 15 tankers per day.
What is the probability that on a given day tankers will have to wait a day or so
to dock or be sent away?
The number of arriving tankers on a given day is random, call it by X . Suppose
that X ∼ poisson(µ). Since the average number of tankers arriving on a day is
10, we set µ = 10. Now, we calculate
pr(X > 15) = 1 − pr(X ≤ 15)

= 1 − 0.9513
= 0.0487.

Models for overdispersed data
Both the binomial distribution and the Poisson distribution have one parameter.
Usually for the binomial distribution, n, the number of trials, is assumed to be
known. For these one-parameter distributions, the variance must have a particular
mathematical relationship to the mean. When the sample mean and a sample
variance do not reflect this mathematical relationship, we should consider
alternative models for the data.
For the binomial distribution, mean µ = nπ and variance σ 2 = nπ(1 − π). So,
σ 2 = µ(1 − µ/n) = µ(n − µ)/n. If the data exhibit a larger variance than the
mean, the data are said to be overdispersed. For Poisson data, the population
mean and population variance are equal. If the data exhibit a larger variance than
the mean, the data are said to be overdispersed. This may occur when there are
different Poisson models in the population (the Possion parameter is varying)
resulting in a larger observed variance than expected. We will consider a couple of
models that result in overdispersion.

The zero-inflated Poisson model example
Suppose that you randomly sample 1000 households those who did not have
health insurance in the last year, and ask how many times each household visited
ER in the last one year. The number of visits is a random variable taking values,
like 0, 1, 2, ... etc. What do you expect, about the proportion of zeros– the
proportion of zeros will be much more than the proportion of non-zero visits. The
histogram would look like the following.

Histogram of the number of emergency visits in the last
one year
700
600
500
400
Frequency
300
200
100
0
0 1 2 3 4 5 6
Number of visits

The zero-inflated Poisson model
In this data, we see there are more zero counts than expected for any Poisson
model. We can model this extra zeros as a mixture of two populations: a Poisson
population that is selected with probability π and a population of zeros that is
selected with probability 1 − π.
The probability function for the zero-inflated Poisson r.v. is

1 − π + π exp(−µ) if x = 0,
pr(X = x|µ, π) = x
π exp(−µ) µx! if x > 0

An easy calculation provides the mean and variance of the random variable
E (X ) = πµ
var(X ) = E (X 2 ) − E 2 (X ) = π(µ + µ2 ) − π 2 µ2 = πµ{1 + µ(1 − π)}.
We can compare the mean to the variance using an index of dispersion:
var(X )/E (X ) = 1 + µ(1 − π) > 1 if π > 0.
The above calculation implies that for the zero-inflated Poisson model, the
variance is larger than the mean, and this is a case of overdispersed Poisson
model.

The Negative Binomial Distribution
Overdispersion can result from a different type of heterogeneity. Suppose that

each observation X has a Poisson distribution with the mean of each observation
being a random variable with a gamma distribution. The resulting r.v. has a
negative binomial distribution with the probability function
ν x
Γ(x + ν) ν ν
pr(X = x|µ, ν) = 1− , x = 0, 1, 2, . . . ,
x! Γ(ν) ν+µ ν+µ
where the gamma function is defined by
Z ∞
Γ(α) = t α−1 e −t dt.
0
The probability model has two parameters µ and ν.

The mean and variance are:
E (X ) = µ
var(X ) = µ(1 + Dµ), where D = 1/ν,
and D can be considered to be an index (or measure) of overdispersion.

In R, ν is referred to as size and p = ν/(ν + µ) is referred to as probability. In R
the distribution can be identified by (ν, µ) or (ν, p). In the next example, I will
use both sets of parameters to compute the probability of an event.

Revisit Example 3 (Covid-19 cases)
The sample mean and variance of the data were 26.7 and 255.2, respectively. So,
Poisson model assumption on X is questionable. How about compute the
pr(X ≥ 50) assuming X ∼ NegBinomial(ν, p) distribution? Note that in order to
calculate pr(X ≥ 50) we need to know (or estimate) the two unknowns ν and
p = ν/(ν + µ) from the data.
One way of estimating these two is via the method of moments (MOM). We know
E 2 (X )
ν = , (1)
var(X ) − E (X )
ν
p =
ν + E (X )
E 2 (X )
=
E (X ) + E (X )var(X ) − E 2 (X )
2
E (X )
= . (2)
var(X )

Now, in MOM we replace E (X ) and var(X ) by the sample mean and sample
variances, respectively to get the estimate of the unknown parameters. So, after
plugging in 26.7 and 255.2 in place of E (X ) and var(X ) in equations (1) and (2),
we obtain νb = 3.115 and pb = 0.1046. Now, we calculate pr(X ≥ 50) as
1-pnbinom(49, size=3.11, mu=26.7) [1] 0.09029845

1-pnbinom(49, size=3.11, prob=.1046) [1] 0.08933621
A slight difference in these two results occurred due to rounding of size and
prob. These results are more believable compared to the results under the
Poisson model assumption, and these results can be interpreted as out of 10 days
approximately 1 day likely to show the number of cases greater or equal to 50.

Example 5.
Look at the data given in the following webpage. It contains the number of
hurricanes in each category, 1, 2, 3, 4, 5 per decade starting from 1851.
https://www.nhc.noaa.gov/pastdec.shtml
On different parts of the data, we can fit different models that we have
discussed so far, multinomial, Poisson, zero-inflated poisson, negative
binomial distribution.
This dataset will be used in homeworks.

Example 6.
Look at the influenza associated pediatric mortality data

https://www.dshs.state.tx.us/IDCU/disease/IAPM/
Influenza-Associated-Pediatric-Mortality-Data.aspx, and figure the
appropriate model to fit the data.

Side note
For over-dispersed count data, we generally think of the negative binomial model.
For under-dispersed data, one may consider the generalized poisson model.

Statistical inference for categorical data
We will now suppose that we have observed data from one of the previous
distributions where the value of the parameter is unknown. We will develop
methods for statistical inference concerning the unknown parameters.
We use the technique of maximum likelihood for estimation of parameters.

We use likelihood-based tests for hypothesis testing. These tests include
likelihood-ratio tests, Wald tests, and score tests.
We use the corresponding confidence intervals for interval estimation. We
will construct likelihood-ratio intervals, Wald intervals, and score intervals.
We now apply some of these techiques to the binomial, Poisson, and multinomial
distributions. In later chapters, we will consider response data with one on the
above distributions in the situation where we have one or more explanatory
variables.

Inference for a Binomial Proportion
Suppose we observe data from a binomial distribution with unknown π. Our goal
is to estimate π.
Example 6. Suppose that in a sample of 50 adult Americans, only 14 correctly

described the Bill of Rights as the first ten amendments to the U. S. Constitution.
Estimate the proportion of Americans that can give a correct description of the
Bill of Rights.
A common method of estimation is known as maximum likelihood estimation.

We use the formula for the binomial distribution to write out the probability of 14
successes out of 50 trials.

50 14
L(π|Y = 14) = pr(Y = 14) = π (1 − π)50−14 .
14
The function L(π|Y = 14), a function of the parameter π, is called the likelihood
function.

Plot of L(π|Y = 14) against π
0.12
0.10
0.08
L(π|Y=14)
0.06
0.04
0.02
0.00
0.0 0.2 0.4 0.6 0.8 1.0

The value of π that maximizes `(π) = log{L(π|Y = 14)} is the maximum
likelihood estimate (MLE), π
b = 14/50 = 0.28. We can verify that this value
maximizes the likelihood using calculus.
We take the derivative of the natural logarithm of the likelihood, `(π):

∂`(π) ∂ 50 14 36
= log + 14log(π) + 36log(1 − π) = − .
∂π ∂π 14 π 1−π
We set this equal to zero and solve for π:

14 36 14 36 14
− =0⇒ = ⇒ 14(1 − π) = 36π ⇒ π
b=p= .
π 1−π π 1−π 50

MLE continues...
A similar argument for an Y ∼ Bin(n, π) implies that the mle of π is the sample
proportion, p:
Y
π
b=p= .
n

Properties of the MLE
E (b
π ) = π (the average of the MLE is the parameter of interest)
π ) = π(1−π)
var(b n (the variance of the MLE depends on the parameter of
interest and it decreases with the sample size)
For a large n, the distribution of π
b is approximately normal with the above
mean and variance.
Fact: If Y ∼ Bin(n, π) and the probability histogram is not too skewed (nπ ≥ 5,
n(1 − π) ≥ 5), then Y is approximately normally distributed. This implies that
the MLE π b = Y /n is approximately normally distributed.

Tests concerning a population proportion
We use large sample methodology to construct tests for a population proportion.

Therefore, when nπ ≥ 5 and n(1 − π) ≥ 5, we can assume that standardize MLE
of π approximately follows a standard normal distribution, that means
b−π
π
Z=q ∼ N(0, 1).
π(1−π)
n

Hypothesis test continues...
Null hypothesis: H0 : π = π0
The score statistic uses the distribution under H0 to obtain the standard error in
the test statistic:
b − π0
π
Z=p
π0 (1 − π0 )/n
The Wald statistic uses the estimated standard deviation of π
b to standardize π
b:
b − π0
π
Z=p
b(1 − π
π b)/n
Alternative Rejection region p-value

hypothesis for level α test
Ha : π > π 0 Z ≥ Zα pr(Z ≥ Zobs )
Ha : π < π 0 Z ≤ −Zα pr(Z ≤ Zobs )
Ha : π 6= π0 |Z | ≥ Zα/2 2pr(Z ≥ |Zobs |)
Zobs : observed value of Z , Zα : upper α percentile point of N(0, 1)

A large-sample interval for π (population proportion)
To construct a confidence interval for the population proportion π based on the

mle, π
b, we write
!
b−π
π
pr −Zα/2 < p < Zα/2 ≈ 1 − α
b(1 − π
π b)/n
Solving for our parameter π yields (approximately):
p p
b − Zα/2 π
pr π b(1 − π
b)/n < π < π b + Zα/2 π b(1 − π
b)/n ≈ 1 − α
Thus, p b ± 1.96 SE , where the standard error
a 95% confidence interval for π is π
SE = π b(1 − π
b)/n.
Remark. The above confidence interval is known as the Wald interval since it is
based on the Wald statistic for testing H0 : π = π0 :
π̂ − π0
Z=p .
b(1 − π
π b)/n

Continue...
The Wilson interval (also known as score interval as it is based on score statistic)
is an approximate confidence interval based on the standardized π b:
b − π0
π
Z=p .
π(1 − π)/n
We use the approximate probability statement

!
π̂ − π
pr −Zα/2 < p < Zα/2 ≈1−α
π(1 − π)/n
to obtain the inequality

2
π − π)
(b 2
< Zα/2
π(1 − π)/n
which results in a quadratic equation for π.

The solutions to the quadratic equation provide the endpoints for the confidence
interval. The resulting confidence interval has endpoints
s
2 2
zα/2 b) zα/2
b(1 − π
π
π
b+ ± zα/2 +
2n n 4n2
2 .
1 + (zα/2 )/n

Continue...
The Wilson interval has been shown to have better coverage properties than the
Wald interval
Remark: Agresti and Coull derived a simple approximation to the level α Wilson
interval. Form the Wald interval based on n∗ = n + Zα/2 2
observations and
2 2 2
π̃ = (Y + Zα/2 /2)/(n + Zα/2 ). This is equivalent to adding Zα/2 observations,
half of which are successes and half are failures.

Example 7.
In a survey of 277 randomly selected adult shoppers, 69 stated that if an

advertised item is unavailable they request a rain check, construct a 95% CI for π.
π : probability that an adult shopper request for a rain check in case an advertised
item is unavailable

Continue...
p
b ±
π Zα/2p πb(1 − π
b)/n
.249 ± 1.96 .249(1 − .249)/277
.249 ± .051
The 95% Wald interval is (.198, .300).
We can also compute the more accurate confidence interval:
s
1.962 0.249 · 0.751 1.962
0.249 + ± 1.96 +
2(277) 277 4(277)2
1 + (1.962 )/277
We obtain the 95% Wilson confidence interval for π: (0.202, 0.303).

For the 95% Agresti-Coull interval, we add two successes and two failures to
obtain π̃ = 71/281
q = 0.253 with resulting 95% confidence interval:
(0.253)(0.747)
0.253 ± 1.96 281 or (0.202, 0.303).

R code
binom package for CI– binom.confint function

Small Sample Inference for a Binomial Proportion
We have presented methods for inference concerning a binomial proportion that

are based upon the normal approximation to the binomial distribution. This
approximation does not apply when the sample size is small. Here we will use the
binomial distribution directly to obtain p-values for a test.
Consider testing H0 : π = 0.6 versus Ha : π > 0.6 for a binomial distribution with
n = 10. If we observe Y = 9 successes, the exact p-value equals

10
pr(Y ≥ 9) = 0.69 0.41 + 0.610 = 0.0403 + 0.0060 = 0.0463.
9
Using this p-value to decide whether or not to reject at a given level of

significance α will lead to a conservative test. In this example, the probability
that the p-value ≤ 0.05 when H0 is true equals 0.0463, which is less than 0.05.
This contrasts with the situation for tests with continuous test statistics, where
the p-value under H0 has a uniform distribution on the interval (0, 1). An
approach to providing a less conservative test is to use the mid p-value.

In mid p-value we replace pr(Y = y ) by pr(Y = y )/2 in the p-value calculation.
If we observe y = 9 successes, the mid p-value equals

1 10 1
mid p-value = 0.69 0.41 + 0.610 = (0.0403) + 0.0060 = 0.0262.
2 9 2

The following table provides the p-values and mid p-values for the one-sided test
of H0 : π = 0.6 versus Ha : π > 0.6 for different observed values of Y .
y pr(Y = y ) p-value Mid p-value

0 0.0001 1.0000 0.9999
1 0.0016 0.9999 0.9991
2 0.0106 0.9983 0.9930
3 0.0425 0.9877 0.9665
4 0.1115 0.9452 0.8895
5 0.2007 0.8338 0.7334
6 0.2508 0.6331 0.5077
7 0.2150 0.3823 0.2748
8 0.1209 0.1673 0.1068
9 0.0403 0.0464 0.0262
10 0.0060 0.0060 0.0030

Exact procedure for two sided binomial proportion test
Consider testing H0 : π = 0.6 versus Ha : π 6= 0.6 for a binomial distribution with n = 10.
Let p0,y = pr(Y = y |π = 0.6). If we observe Y = 9 successes, the exact p-value is
X
pr(Y = y |π = 0.6).
y :py ,0 ≤p9,0
Now, we found that the following probabilities
y p0,y
0 0.000
1 0.002
2 0.011
3 0.042
4 0.111
5 0.201
6 0.251
7 0.215
8 0.121
9 0.040
10 0.006

Exact procedure for two sided binomial proportion test
We found that p9,0 = 0.040, and p0,0 , p1,0 , p2,0 , p9,0 , p10,0 ≤ p9,0 , hence the p-value is
0.000 + 0.002 + 0.011 + 0.040 + 0.006 = 0.059. You may also get the same answer
using the R function binom.test(c(9, 1), p = 0.6).

A small-sample confidence interval for the binomial proportion was developed by
Clopper and Pearson (1934). The interval consists of all the values of π0 for
which each one-sided binomial p-value exceeds α/2. The computation of the
intervals is facilitated by the relationship of the binomial sum and the cumulative
distribution function of the beta or F distributions. The interval is given by
−1 −1
n−y +1 n−y
1+ <π < 1+ ,
yF2y ,2(n−y +1) (1 − α/2) (y + 1)F2(y +1),2(n−y ) (α/2)
where Fa,b (c) denotes the 1 − c percentile of the F distribution with df = (a, b).
Due to the discreteness of the binomial distribution, the actual coverage
probability of the Clopper-Pearson interval usually is greater than 1 − α. This
approach is used in binom.test function of R.

Use R to obtain the exact CI

True Levels for Binomial Confidence Intervals
The various confidence level methods do not necessarily achieve the stated
confidence level. You create a plot showing the true confidence levels for the
Wald, Wilson, Agresti-Coull, and Clopper-Pearson intervals when n = 40 and
1 − α = 0.95 as a function of π.

Some guidance on the plots
Consider different π, under each π, generate 5000 datasets each of size n = 40.
Construct 95% CI, then what percentage of these intervals contain the true π.
Code
capn=5000
prop=NULL;
for( pi in seq(0.01, 0.99, 0.02)){
print(pi)
data=rbinom(capn, 40, pi)
out1=lapply(data, function(x) binom.confint(x, 40, methods="all"))
out2=lapply(out1, function(out) as.numeric(out[2, 5]<pi &out[2, 6]>pi))
prop=c(prop, sum(as.numeric(out2))/capn)
pdf("wald CI CP.pdf")
plot(seq(0.01, 0.99, 0.02), prop, type="l", ylim=c(0, 1),
ylab="CP", xlab="pi")
abline(h=0.95)
dev.off()
}

Coverage probability– Wald CI
1.0
0.8
0.6
CP
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
pi

Some comments on the plot
1. The plot shows that the coverage probability changes with the under
proportion π with which data are generated from.
2. The coverage probability gets worse (far away from the nominal level 0.95) as
π gets close to 0 or 1.

Inference for the Poisson Mean
Suppose that Y1 , . . . , Yn is a random sample from a Poisson distribution with
mean µ. We wish to obtain the maximum likelihood estimator for µ and to form
a confidence interval for µ.
The likelihood function is
n n n
!
Y Y e −µ µYi Y 1 Pn
L(µ) = pr(Y = Yi ) = = e −nµ µ i=1 Yi
Yi ! Yi !
i=1 i=1 i=1
To maximize the likelihood, we take the derivative of the log-likelihood

`(µ) = log{L(µ)}
( n
! n
) Pn
∂`(µ) ∂ Y 1 X Yi
= log − nµ + ( Yi )log(µ) = −n + i=1
∂µ ∂µ Yi ! µ
i=1 i=1
Set this equal to zero and solve for µ to get the mle:
Pn
Yi
b = i=1
µ =Y
n

Properties of the MLE
1 E (b
µ) = µ (the average of the MLE is the parameter of interest)
2 µ) = µn (the variance depends on the parameter of interest and it
var(b
decreases with the sample size)
q
3 SE = µbn
4 The distribution of µ
b is approximately normal with the above mean and
variance.

Confidence intervals for µ
As we did for the binomial distribution, we can obtain confidence intervals for µ
based on either the Wald statistic or the score statistic. The Wald statistic for
testing H0 : µ = µ0 is
Y − µ0
Z= q
Y /n
The resulting level 1 − α confidence interval for µ is given by
q
Y ± Zα/2 Y /n.

The score statistic for testing H0 : µ = µ0 is
Y − µ0
Z= p
µ0 /n
The resulting level 1 − α confidence interval for µ is given by

2
Zα/2 Zα/2 q
Y+ ± √ 2 /(4n).
Y + Zα/2
2n n

Inference for the Multinomial Distribution
The multinomial likelihood has the form

n!
L(π1 , . . . , πc ) = π n1 π n2 · · · πcnc .
n1 !n2 ! · · · nc ! 1 2
Using a derivation analogous to that for the mle of the binomial proportion, the
mle for the multinomial probabilities π1 , . . . , πc are
n1 nc
π
b1 = ,··· ,π
bc = .
n n

Properties of the MLEs
Expected value: E (b
πi ) = πi
πi ) = πi (1 − πi )/n
Variance: var(b
q
πi πj
Correlation: corr(b bj ) = − (1−πi )(1−π
πi , π j)

Testing goodness of fit for the multinomial distribution
Goodness-of-fit tests are used in categorical data to determine the adequacy of

the assumed model. When the assumed model does not hold, statistical
procedures designed for that model are often not useful.
Categorical data typically consists of counts of events in various mutually

exclusive categories. Goodness-of-fit tests compare the observed counts with the
expected counts under a hypothesized model. Large differences indicate that the
hypothesized model is not adequate.

Suppose that the random variable N = (N1 , . . . , Nc ) has a multinomial
distribution, Mult(n, π1 , . . .P
, πc ). Note that here we have c mutually exclusive
c
and exhaustive categories ( k=1 πk = 1).
We wish to test the null hypothesis of specified cell probabilities:
H0 : π1 = π10 , . . . , πc = πc0 versus Ha : some πi 6= πi0 .
When H0 is true, the expected counts are µi = E (Ni ) = nπi0 .

We base our test on the differences between the observed and expected
(predicted) cell frequencies using Pearson’s chi-squared test statistic:
c
2
X (ni − µi )2
χ =
µi
i=1
When H0 is true, χ2 has approximately a chi-squared distribution with c − 1

degrees of freedom. We reject H0 for large values of χ2 .

An alternative statistic is the likelihood ratio statistic. The general form of the
likelihood ratio statistic is
Maximized likelihood under H0
Λ= .
Maximized likelihood for unrestricted π
When the null hypothesis is true, the maximized restricted likelihood under
the null hypothesis will be similar in value to the maximized unrestricted
likelihood.
However, if the null hypothesis is false, the maximized unrestricted likelihood
can be much larger.
Thus, we will reject H0 when Λ is small, or equivalently, when
G 2 = −2log(Λ) is large.

For testing goodness of fit for the multinomial distribution, the LR statistic equals
c
X ni
G2 = 2 ni log .
µi
i=1
When H0 is true, G 2 has approximately a chi-squared distribution with c − 1

degrees of freedom. We reject H0 for large values of G 2 .
Remarks:
Pearson’s statistic and the LR statistic will have similar values, particularly
for large samples.
Both statistics will be equal zero if ni = µi for all i.
The approximation to the distribution of these statistics using the
chi-squared distribution is valid when n is large enough so that all the µi
values are large.
Generally both statistics will lead to the same inference for any given set of
data. However, if the values of µi are small, the two statistics can differ
quite a bit in value.

Example 8.
1
The following data are obtained from Table 1 of the paper
Age 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 ≥ 80

Confirmed 416 549 3619 7600 8571 10008 8583 3918 1408
cases
Deaths 0 1 7 18 38 130 309 312 208
Let πx-y be the death rate for age group x-y . Test if π0-29 < 0.002 and
π30- > 0.03.
1
Vital Surveillances: The Epidemiological Characteristics of an Outbreak of 2019
Novel Coronavirus Diseases (COVID-19) — China, 2020. China CDC Weekly 2020.
Epub February 17, 2020.
First we write H0 : π0−29 ≥ 0.002 and π30− ≤ 0.03 and Ha : π0−29 < 0.002
or π30− > 0.03.
Pay attention to how the alternative hypothesis is written.
Different approaches can be taken to test this hypothesis. Here we use the
Pearson’s Chi-square goodness of fit test.
First we shall calculate the expected cell frequency under H0 . For 0-29, the
expected deaths are 9.17, and for the age group older than 30 years, the
expected deaths are 1202.64.
Next we form this table:
Age 0-29 30-

Deaths No deaths Death No deaths
Observed (O) 8 4576 1015 39073
Expected (E ) 9.17 4574.83 1202.64 38885.36
(O − E )2 /E 0.15 0.00 29.28 0.91

Test statistics T = 0.15 + 0 + 29.28 + 0.91 = 30.34.
The rejection region at the 5% level: T > χ20.05,2 = 5.99
Reject H0 and conclude that the data provide sufficient evidence at the 5%
level against the H0 , and conclude that either π0-29 < 0.002 or π30- > 0.03.

Example 9.
A sample of 156 dairy calves born in a particular county in Florida were classified
according to whether they caught infections within 60 days of birth. The
following data were recorded:
Secondary Infection
Primary Infection No Yes
No 0 63
Yes 63 30

A goal of this study was to determine whether the probability of a primary
infection was the same as the conditional probability of a secondary infection,
given that a calf got the primary infection. We let πab denote the probability that
a calf was classified in row a and column b. The null hypothesis is then
H0 : π21 + π22 = π22 /(π21 + π22 ),
or π22 = (π21 + π22 )2 . Let θ = π21 + π22 .

Then π22 = θ2 , π21 = θ − θ2 = θ(1 − θ). And, π11 + π12 = 1 − θ.

Probability table under the assumed model
Secondary Infection
No * * 1−θ
Yes θ(1 − θ) θ2 θ
1

Then the null hypothesis implies that the cell counts (for the (2,2)th, (2, 1)th,
and the total of (1, 1) and (1,2)) have a trinomial distribution with cell
probabilities (θ2 , θ(1 − θ), 1 − θ). If we assume that n1 , n2 and n3 are the counts
for these cells then
Secondary Infection
No * * n3 = 63
Yes n2 = 63 n1 = 30 n1 + n2 = 93
n1 + n2 + n3 = 156

Estimate θ. We need to maximize:
L(θ) = π1 (θ)n1 π2 (θ)n2 π3 (θ)n3
Equivalently, maximize
3
X
L(θ) = ni log(πi (θ))
i=0

Define `(θ) = log{L(θ)}. Then
`(θ) = n1 log(θ2 ) + n2 log{θ(1 − θ)} + n3 log(1 − θ).
Take the derivative with respect to θ and set equal to zero:

2n1 n2 n2 n3
`0 (θ) = + − − = 0.
θ θ 1−θ 1−θ
Solve for θ:
2n1 + n2
θb = .
2n1 + 2n2 + n3
Here,
2(30) + 63
θb = = 0.494.
2(30) + 2(63) + 63

Often students ask this question that why suddenly from four-cell multinomial the
problem is reduced to three-cell multinomial. The answer is simply the
convenience. It is natural to think the problem as a four-cell multinomial, but
that involves with an extra parameter, π11 , so that π11 + π12 = 1 − θ.
Consequently, we need to estimate that extra parameter along with θ. However,
whether we start from four-category multinomial or three-category multinomial
the result for θ will be the same in the end.

Test goodness of the fit to the multinomial distribution with c = 3 and
parameters ((0.494)2 , 0.494(1 − 0.494), 1 − 0.494).
Cell 1 2 3 Total
Observed count 30 63 63 156
Estimated probability 0.244 0.250 0.506 1
µ
bi 38.07 38.99 78.94 156
(O − E )2 /E 1.71 14.78 3.22 19.71

Test statistic:
X (O − E )2
T =
E
Rejection region: T > χ23−1−1,.05
= 3.84
Since observed T is 19.71 much larger than 3.84, we reject H0 and conclude that
this model does not fit the data. In particular, there were many more calves who
had a primary infection, but did not get a secondary infection, than expected.

Example 10.
Eighty litters, each containing 3 rabbits, were observed and the number Y of male
rabbits in the litter was recorded, resulting in the following frequency distribution:
# of males in litter 0 1 2 3 Total

# of litters 19 32 22 7 80
If the assumption of Bernoulli trials is correct for the sex of rabbits, the
distribution of Y should be binomial with n = 3 and θ = probability of a male
birth.

Example 10 continues
H0 : πi (θ) = pr(Y = i) = 3i θi (1 − θ)3−i , i = 0, 1, 2, 3

Ha : At least one probability is not as specified.
Step 1. Estimate θ under H0 . We need to maximize the log-likelihood function
`(θ) = log{π0 (θ)n0 π1 (θ)n1 π2 (θ)n2 π3 (θ)n3 }

3
X
= ni log{πi (θ)}
i=0
3
X 3
= ni log + ilog(θ) + (3 − i)log(1 − θ) .
i
i=0
Take the derivative and set equal to zero:

3
0
X ni i ni (3 − i)
` (θ) = − =0
θ 1−θ
i=0

After some algebra,
3 3 3
!
X X X
(1 − θ) ni i = 3 ni − ni i θ
i=0 i=0 i=0
Solve for θ:
P3
i=0 ni i
θb = P 3
3 i=0 ni
Number of Male Rabbits
=
Total Number of Rabbits

Here
0 + 32 + 44 + 21 97
θb = = = 0.404
3 × 80 240
Step 2. Test goodness of fit to the binomial distribution with n = 3 and
π = 0.404.
# of males in litter (i) 0 1 2 3 Total
observed (Oi ) 19 32 22 7 80
Binomial prob. pr(Y = i) 0.2117 0.4305 0.2918 0.0659 1
Ei 16.94 34.44 23.35 5.28 80
(Oi − Ei )2 /Ei 0.25 0.17 0.08 0.46 1.07
Test statistic:
X (Oi − Ei )2
T = = 1.07
Ei
i
Rejection region: T > χ24−1−1,.05 = 5.99 Attention: The degrees of freedom is

4 − 1 − 1, we subtracted one degree of freedom from 3 because we have
estimated θ.

Conclusion: Fail to reject H0 and conclude that the data do not provide strong
evidence against the binomial distribution. However, fail to rejection does not
imply that the data are consistent with the Binomial distribution. It may happen
that the test is not powerful enough to detect some difference between the
hypothesized distribution and the underlying distribution that generated the data.
That is the reason that this traditional goodness-of-fit is often referred to as a
lack of fit test.

Remarks
Failure to reject H0 in a goodness-of-fit test does not imply that the

hypothesized distribution is the true distribution. It is merely one that is
consistent with the data.
Failure to reject H0 can occur because the chi-squared test may not be
powerful for certain alternatives. When the categories have some ordering,
Pearson’s chi-squared statistic can be partitioned to obtain tests with higher
power for certain alternatives.
We often need to group cells so that the expected cell frequencies are all at
least 5. If so, we should estimate the parameters using the grouped data
likelihood.
The use of the chi-squared distribution is based on having a large sample
resulting in large (≥ 5) expected cell counts. If some of the expected cell
counts are small, then one can use exact methods for carrying out the test.
Alternative to the lack-of-fit test, nowadays researchers are using
equivalence test (Lakens, 2017 and Lakens et al., 2018).

Some References
Alan Agresti, Analysis of Ordinal Categorical Data, Second Edition
Alan Agresti, Categorical Data Analysis, Third Edition
Paul D. Allison, Logistic Regression Using the SAS System
Christopher Bilder and Thomas Loughin, Analysis of Categorical Data with R
Ronald Christensen, Log-Linear Models and Logistic Regression, Second Edition
D. Collett, Modelling Binary Data, Second Edition
P. Congdon, Bayesian Models for Categorical Data
D. R. Cox and E. J. Snell, Analysis of Binary Data, Second Edition
Julian Faraway, Extending the Linear Model with R: Generalized Linear, Mixed
Effects and Nonparametric Regression Models
Frank E. Harrell, Jr., Regression Modelling Strategies
D. W. Hosmer and S. Lemeshow, Applied Logistic Regression, Second Edition
Lakens, D. (2017). Equivalence test. Social Psychological and Personality Science,
8, 355–362.
Lakens, D., Scheel, A. M. and Isager, P. M. (2018). Equivalence Testing for
Psychological Research: A Tutorial. Advances in Methods and Practices in
Psychological Science, 1, 259–269.
Bayo Lawal, Categorical Data Analysis with SAS and SPSS Applications
Continues...
P. McCullagh and J. A. Nelder, Generalized Linear Models, Second Edition

Jeffrey S. Simonoff, Analyzing Categorical Data
Maura E. Stokes, Charles S. Davis, and Gary G. Koch, Categorical Data Analysis
Using the SAS System, Third Edition
Gerhard Tutz, Regression for Categorical Data
Daniel Zelterman, Advanced Log-Linear Models Using SAS

R code for Pearson Chi-square goodness of fit test
Revisit the last example. However, the software does not know that θ was estimated, so
it uses 3 degrees of freedom which is incorrect.
Code
obs.freq=c(19, 32, 22, 7)
prob.vec=c( 0.2118, 0.4305, 0.2918, 0.0659)
chisq.test(obs.freq, p=prob.vec)
Chi-squared test for given probabilities
data: obs.freq
X-squared = 1.0661, df = 3, p-value = 0.7853

An Introduction To The Analysis of Categorical Data

Uploaded by

Copyright:

Available Formats

You might also like

An Introduction To The Analysis of Categorical Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction To The Analysis of Categorical Data

Uploaded by

Copyright:

Available Formats

An Introduction to the Analysis of Categorical Data

Samiran Sinha (TAMU) Categorical Data May 28, 2021 1 / 95

A categorical variable has a measurement scale that consists of a set of

Samiran Sinha (TAMU) Categorical Data May 28, 2021 2 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 3 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 4 / 95

Categorical variables with ordered categories are called ordinal variables.

Samiran Sinha (TAMU) Categorical Data May 28, 2021 5 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 6 / 95

In a sample of 50 adult Americans, only 14 correctly described the Bill of

Samiran Sinha (TAMU) Categorical Data May 28, 2021 7 / 95

A study was carried out to compare two treatments for a respiratory

Birth status(B) – live birth or stillbirth

Samiran Sinha (TAMU) Categorical Data May 28, 2021 8 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 9 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 10 / 95

A binomial experiment satisfies the conditions:

Experiment consists of n independent and identical trials.

Define the random variable Y to be the number of successes out of n trials. We

Samiran Sinha (TAMU) Categorical Data May 28, 2021 11 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 12 / 95

We can numerically calculate these probabilities as soon as we know the value of

Samiran Sinha (TAMU) Categorical Data May 28, 2021 13 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 14 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 15 / 95

In an experiment to investigate whether a graphologist (handwriting expert) could

Samiran Sinha (TAMU) Categorical Data May 28, 2021 16 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 17 / 95

A multinomial experiment satisfies the following conditions:

Experiment consists of n trials.

In a multinomial experiment with n trials the observable random variables are

Samiran Sinha (TAMU) Categorical Data May 28, 2021 18 / 95

The average number of counts in the ith category out of n trials is

Samiran Sinha (TAMU) Categorical Data May 28, 2021 19 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 20 / 95

Consider these random variables (over a specified time period or a geographic

Samiran Sinha (TAMU) Categorical Data May 28, 2021 21 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 22 / 95

Suppose that the average number of radioactive particles passing through a

Samiran Sinha (TAMU) Categorical Data May 28, 2021 23 / 95

Show the histogram of this Poisson distribution. Given a numeric value of λ,

Samiran Sinha (TAMU) Categorical Data May 28, 2021 24 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 25 / 95

pr(X > 15) = 1 − pr(X ≤ 15)

Samiran Sinha (TAMU) Categorical Data May 28, 2021 26 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 27 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 28 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 29 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 30 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 31 / 95

Overdispersion can result from a different type of heterogeneity. Suppose that

The probability model has two parameters µ and ν.

Samiran Sinha (TAMU) Categorical Data May 28, 2021 32 / 95

and D can be considered to be an index (or measure) of overdispersion.

Samiran Sinha (TAMU) Categorical Data May 28, 2021 33 / 95

Samiran Sinha (TAMU) Categorical Data May 28, 2021 34 / 95

1-pnbinom(49, size=3.11, mu=26.7) [1] 0.09029845