M3 L03 Contingency Analysis D2L

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Module #3: Proportions and Frequencies

Lecture 3: Analysis of Frequencies (Contingency analysis)


(Reading: Whitlock and Schluter pp. 179-185, 191-192, 203-223, 245-255)

Here is a second type of analysis for frequency data: Contingency analysis. The
underlying arithmetic of the test is the same as the Goodness-of-fit test, but
now we are asking whether two categorical variables are associated with each
other. In contingency analysis, we are trying to determine if the frequencies of
different values for one variable depend on the value of another variable. We
want to determine if one variable is “contingent” or “dependent on” the other.
This is ultimately telling us whether two variables are independent or not.

But before we show you an example contingency analysis, we will look at


the difference between extrinsic and intrinsic hypotheses.

I. Extrinsic and Intrinsic Hypotheses


1) Extrinsic expectations/hypotheses –
- when expected frequencies are derived from information other than the data we are analyzing

- the expectations are treated as a given.

- For instance, studying the frequency of male and female


offspring in a lizard, you might be interested in testing if the
observed frequencies of males and females deviate from a
50:50 ratio.
- That expectation, which you can think of as a null
hypothesis, was chosen for reasons that have nothing to do
with the data you’re analyzing. The expected number of
male offspring would be half the observed total number of
offspring, and similarly for the expected number of female
offspring. If the expected number comes out as a fraction,
that’s fine.

ex. hockey player lab, the canadian population birth data were extrinctic
ex. tim -rol up thingy , bird example 1
2) Intrinsic expectations/hypotheses -
- When expected frequencies are derived from the data you are analyzing, No information is assumed prior
to the study

II. An example of a Contingency Analysis using an intrinsic


hypothesis:
For instance, let’s say you’re interested in different species of eel tend
to be present in different habitats.

Young and Winn (2003) counted sightings of the spotted moray


eel, Gymnothorax moringa, and the purplemouth moray eel, G.
vicinus, in a 150-m by 250-m area of reef in Belize. They identified
each eel they saw, and classified the locations of the sightings into
three types: those in grass beds, those in sand and rubble, and
those within one meter of the border between grass and
sand/rubble. The number of sightings are shown in the table,

Spotted Purplemouth
Grass 127 116
Sand 99 67
Border 264 161

CONTINGENCY TABLE: Use to analyze whether which row an observation is in , is contingent on


(dependent) which column it falls into and vice versa

2
It helps to calculate the totals in each category; these are called
“column totals” and “row totals”.
→These are the observed frequencies.
Spotted Purplemouth Total
Grass 127 116 243
Sand 99 67 166
Border 264 161 425
Total 490 344 834

Now let’s say that:


H0: The proportion of eel species( spotted or purplemouth) will be equal across the three habitat
types( brass, sand, border) for the coral reef in Belize.

HA:
The proportion of eel species( spotted or purplemouth) will not be equal across the three habitat
types( brass, sand, border) for the coral reef in Belize.

-
→We will do a G-test for Contingency Analysis using a contingency
table
• The hypothesis doesn’t actually tell you what the expected
number of eels sightings in each habitat would be.
• Instead, you need to estimate the expected numbers, using the
column and row totals.

3
Step 1: Calculate expected frequencies for each category

E= expected frequencies

Spotted Purplemouth Total 41.2% eels were


purplemouth- then
Grass 127 (E=142) 116 (E= 100) 243 multiply 243*41.2% to get
grass for purplemouth
Sand 99 (E=98) 67 (E=68) 166
Border 264 (E=250) 161 (E=175) 425
Total 490 344 834

344/834= 41.2%

What does it look like is happening in terms of habitat choice for the
two species?

These expected numbers are intrinsic to the data. We didn’t have any
scientific reason beforehand to say, “I expect 41.2% of eels found in
sandy habitats to be G.vicinus”.
The frequencies are features of our data, so our expected numbers are
said to constitute an intrinsic hypothesis.

4
Step 2: Now that we have our observed and expected frequencies, we
can calculate our test statistic. Again, we will conduct a G-test.
𝑘
𝑂𝑖
G = 2 ∑ 𝑂𝑖 ln ( )
𝐸𝑖
𝑖=1

In this eel-habitat example, we have a total of k=6 categories into which


observations can fall.
𝑘
127 116 99 67
𝐺 = 2 ∑ 127 𝑙𝑛 ( ) + 116 𝑙𝑛 ( ) + 99 𝑙𝑛 ( ) + 67 𝑙𝑛 ( )
142 100 98 68
𝑖=1
264 161
+ 264𝑙𝑛 ( ) + 161 𝑙𝑛 ( )
250 175

𝐺 = 2(−14.783 + 17.2167 + 1.0050 − 1.99261 + 14.38 − 13.4244)


𝐺 = 8.022

Step 3: Compare the test statistic to the tabulated χ2 distribution


To do this, we need, df:
Recall:
𝒅𝒇 = 𝒌 − 𝒑 − 𝟏
k= #of categories K=Rows*column
p= #of parameter estimated

5
Degrees of freedom for the intrinsic hypothesis:
If you have an intrinsic hypothesis, you have to estimate (c-1)+(r-1)
parameters, where c is the number of columns in your table and r is
the number of rows.
• The reason you only have to estimate c-1 column totals and r-1
row totals is because you can take the total number of
observations N as given.
• So if you know N, and you’ve estimated c-1 column totals, you
also have an estimate of the remaining column total. It’s the
same for rows. If you know N, and you’ve estimated r-1 row
totals, you also have an estimate of the remaining row total.
• For our example of an intrinsic hypothesis that eels and habitat
type are independent, k=6 categories and p= (r-1)+(c-1) = (2-
1)+(3-1) = 3 est. parameters, so df=6-3-1=2; same as (r-1)(c-1) =
(2-1)(3-1) = 2
So…
𝒅𝒇 = 𝒓𝒄 − [(𝒓 − 𝟏) + (𝒄 − 𝟏)] − 𝟏
𝒅𝒇 = 𝟐
6- {(3-1)+(2-1)}-1= 2

𝑿𝟐𝒄𝒓𝒊𝒕 = 𝑿𝟐𝟎.𝟎𝟓,𝟐 = 𝟓. 𝟗𝟗𝟏 G=8.022

𝑮 > 𝑿𝟐𝟎.𝟎𝟓,𝟐 , therefore REJECT the null hypothesis

Here, we are doing a G-test for independence using a contingency table


– to see whether habitat choice for two species of eel is independent of
the species type.
Biological Conclusion: The proportion of eels of a given species (purplemouth, spotted) are not equal
across the 3 habitat types (grass, sand, border) (contingency analysis, G=8.022,
df=2, p<0.05)

6
• So now we know that habitat in which an eel is found is NOT
independent of the eel species, but what classes in the table are
similar to others, and which are different?

We won’t show how to do this, but know that there is a way to look at
exactly HOW the frequencies differ from what you expect.
Partitioning G to see which terms in the calculation of G are
larger/smaller than you would expect due to random chance along.

Summary:
What is the difference between using the G-test as a contingency
analysis or a test for goodness of fit?
- if you have 2 categorical variable looking at independence (contingency) , nor goodness if fit

What is the difference between an intrinsic and extrinsic hypothesis?


- all about using data to generate expected frequencies

contingency analyses - majority of the time using intrinsic

7
7. All of the following correctly describes contingency analysis except?
a. Contingency analysis is used to determine whether two or
more categorical variables are associated with one another.
b. The degrees of freedom are calculated as the number of
categories minus one.
1. Random sample
c. In contingency analysis, the null hypothesis is testing the
2. No more than 20% of
cells have an expected independence of two or more categorical variables.
frequency of less thand. Contingency analysis utilizes expected frequencies for each
5, power will be too low category combination compared to observed frequencies for
3. No cells have expected each category combination.
frequency less than 1
e. The 𝒳 2 contingency test statistic makes the same
assumptions as the 𝒳 2 goodness-of-fit test.

9. The table (right) shows Outcome


Non-
605 heart attack Fatal
Fatal
Diet Heart Cancer Healthy Total
survivors that were disease
Heart
randomly assigned to Disease
American
eat two different diets. Heart 24 25 15 239 303
The patients were Association
tracked 4 years later to Mediterranean 14 8 7 273 302
Total 38 33 22 512 605
determine the
frequencies of different outcomes. What would be the expected
frequency for eating a “Mediterranean Diet” and developing “Non-
fatal Heart Disease” if the two variables were independent?
a. 18.97
b. 8.00 303/605=33
c. 16.47
d. 10.98
e. 302

You might also like