Professional Documents
Culture Documents
C119 EssentialsofStatistics Handbook
C119 EssentialsofStatistics Handbook
STATISTICS
HANDBOOK
Greg Martin
MBA MPH MbBCH MFPHMI
Table of contents
Abbreviation list 3
Appendix
Common research methods and statistical tests 69
Risk 73
Reference list 74
GETTING ACQUAINTED
WITH DATA
www.medmastery.com
Describing data
In this lesson, we are going to talk about the first step to good statistical
analysis. Specifically, we’ll explore how to describe, summarize, and visualize
data. We will also talk about producing tables, plots, and graphs for different
types of variables.
To explain what observations are, let’s talk about an example. James has certain
characteristics that we’re interested in—name, gender, age, weight, and height.
He’s a 27-year-old male who weighs 75.1 kg and is categorized as short. All
of this information about James are observations. In our spreadsheet, this
information is stored as data under the appropriate column headings or variable
headings. All of our observations make up a data set.
Name Gender Age Weight (kg) Height
James Male 27 75.1 Short
Barra Male 32 98.3 Short
Sarah Female 34 63.5 Medium
Bill Male 23 87.2 Tall
Categorical data
There are two types of categorical variables:
1. Nominal categorical variables
2. Ordinal categorical variables
Think of these categories as buckets into which the values from any other variable
can be placed and then compared. For example, we could compare the average
weight or the average age of males and females in this group.
Numerical data
Numerical data consists of numbers that can be placed on a number
line. However, these numeric variables can fall on that number line in two
different ways:
1. Discrete variables
Age is a typical discrete variable. Each observation falls definitively on a
value on the integer number line (e.g., 32, 33, 34).
2. Continuous variables
Weight falls on any number including fractions between two integers. For
example, Barrett is 98.3 kg .
Describing variables
A trick to understanding how your numeric variable values are distributed along
the number line is to imagine them sitting on the number line. When there is
more than one observation for a particular number, they get stacked upon each
other. This turns into an interesting shape, which we call a distribution.
Let’s look at the most useful ways that we can describe the data in a distribution.
Range
Firstly, a distribution will have minimum and maximum values. These two values
are also the parameters for the range. Range tells us the spread of the data.
If we divide all the observations into four equal groups, each of those groups will
contain a quarter of all the observations. The two middle quarters will be called the
interquartile range, which is another way to describe how the data are spread out.
Interquartile range
When the distribution, or the shape of the data, is symmetrical, the values of the
mean, median, and mode will be the same.
However, if the distribution of values has more observations toward one end of
the number line, it will appear to have a long tail to one side.
If this long tail occurs on the left side, this is called left-skewed. In this case, the
mean is disproportionately affected by the outliers and extreme values. Similarly,
if you have a right-skewed distribution (i.e., the tail is to the right) it is called
right-skewed, and the mean is way too far to the right.
Right-skewed distribution
Therefore, the value of the mean is not a good measure of centrality. Clearly, when
the distribution is skewed, the median is a more robust measure of centrality.
Mean, median, and mode tell us where to find the middle of this
data. Range, interquartile range, and standard deviation describe
how spread out this data are.
Histograms
We can visualize the distribution of a numeric data set by imagining buckets that
represent different intervals along the x-axis. The buckets can be of any chosen
size (e.g., 0–10, 10–20, 20–30, and so on). By counting how many observations
fall into each of those buckets, we can create a histogram.
Let’s talk about each component of the box plot. The interquartile range is
depicted by the box itself and will contain 50% of the data. The median, or the
value that splits all of the data into two separate groups, is represented by the
line in the middle of the box.
The lines extending from the box are called whiskers. They extend out to 1.5 times
the interquartile range. Any values outside of the whiskers are called outliers.
For example, there are four people that have been categorized as short, two are
medium height, and two are tall, for a total of eight observations. This can be
recorded in a frequency table.
Frequency
Short 4
Medium 2
Tall 2
Total 8
Relative frequency
We can determine the relative frequency, or proportion of the total, for each
category. In this case, the relative frequency of short individuals is found by
dividing the number of short people by the total number of observations, which
gives us a value of 0.5, or a half.
We can repeat this process for all the categories in the data set.
Short
Medium
Tall
We can calculate the relative frequency or the percentage, which can be recorded
in the frequency table in brackets next to the frequency value. For example, in
the male column, three males out of a total of five males are short, or 60% of the
males are short.
One option is to create a scatter plot. A scatter plot is where each point
corresponds to the x and the y coordinates of a given observation, or in the case
of our data set, a person. You can also add a trendline to this plot.
For example, we’ve got Sarah who is 34 years of age and weighs 63.5 kg. Her
information can be represented in the following scatter plot.
By contrast, weight might be dependent on, or affected by, age so there might be
some sort of causative relationship between your age and your weight. Therefore,
weight is a dependent variable. By convention, we put those onto our y-axis.
Using this plot, we can see the difference in weight between males and females
in tall people, medium-height people, and in short people.
Consider a simple data set of the average (fictitious) earnings of males and
females. In this data set, there are no missing values. For males, the average
earning is about $5000 a month. For females, the average earning is a little less
than $4600.
Name Gender Earnings ($)
Barra Male 5000
Peter Male 7000
James Male 3000
Andrew Male 7000
Philip Male 3000
Mary Female 5000
Jane Female 7000
Chloe Female 3000
Wendy Female 7000
Colette Female 1000
The reason for this is that the missing data aren’t randomly distributed. There’s a
pattern to it, but the pattern isn’t obvious. You may need to look at the distribution
of your missing data in relation to another variable to see the pattern.
And in this example, the data had only been collected from people living in cities,
and the absence of rural data represented a systematic bias in the data set.
Name Gender Earnings ($) Home
Barra Male 5000 Urban
Peter Male NA Rural
Average monthly earnings
James Male 3000 Urban
= 3600
Andrew Male NA Rural
Philip Male 3000 Urban
2. Delete just the rows where there is missing data from a specific variable.
5. Replace missing values with what our best guess is as to what that value
should be.
This is called imputation.
PERFORMING
STATISTICAL TESTS
www.medmastery.com
Testing hypotheses
Generate a hypothesis
A hypothesis is a testable theory; something that can be falsified.
To check if we are correct about the average weight of adult males in Ireland, we
can’t weigh every male in Ireland. So, instead of weighing all of them, we look at a
small group of adult males, a sample. The sample needs to be random so that it
more closely represents the wider population. By looking at the characteristics of
that sample, we can make an inference about the population as a whole.
Let’s say that people have always believed that adult males in Ireland have an
average weight of 65 kg. That’s the null hypothesis, H0. However, we believe that is
incorrect and that the data will show that it’s incorrect. We’d like to argue that the
average weight is not equal to 65 kg; this is our alternative hypothesis, H1.
We want to use the data to show that one of them is likely to be false. And because
they are mutually exclusive, if we show that one is false, we can be confident that
the other is likely to be true.
So, our actual hypothesis is the alternative hypothesis, H1. But instead of testing
H1 directly, we test the opposite, the null hypothesis, H0. If we find that H0 is likely
incorrect, then we can infer that H1 is likely correct.
If that were true, when the sampled population of adult males were weighed, they
wouldn’t necessarily weigh exactly 65 kg but likely close to 65 kg.
For example, our first sample might have an average weight of just less than 65
kg. The second time around, the average weight might be just over 65 kg. The
third time, it might be exactly 65 kg. And we continue to repeat this and most of
our samples would be in and around 65 kg.
Remember, our assumption at this point is that the null hypothesis is correct and
that the average weight of adult males in Ireland is 65 kg. If that were true, most
of our samples would be in and around 65 kg, as shown below. A few of them by
virtue of absolute chance would be a little bit further away. Some of them might
be 55 kg, some of them 75 kg, very few of the samples through absolute random
chance may be 50, 70, or 80 kg, and very few might be 90 kg. But we’d expect
most to be in the middle. This is what we call a distribution.
65
Distribution if our null hypothesis is true
In the next illustration, we’ve added a line above all of our data that shows the
shape of our distribution. Notice that our distribution for the null hypothesis
resembles the shape of a hill. Underneath the line of the hill, we can find all of the
samples, 100%. A small percentage, 2.5%, would fall in the small corners, or tails,
at either side. This shape is referred to as a normal distribution.
100%
65
Shape of a normal distribution resembles a hill
If the null hypothesis is true, we expect our sample to be somewhere in the big
bundle in the middle. The likelihood of the sample landing within the tails is so
low that if our sample has an average within the tail, say 90 kg, we would lose
confidence in our null hypothesis. Indeed, we may have so little confidence that
we reject the null hypothesis, H0, and decide that the average weight of adult
males in Ireland is 65 kg is incorrect.
We would then accept the alternative hypothesis, H1, and decide that the average
weight of adult males in Ireland is not 65 kg.
The significance level is the line in the sand or threshold where we decide it’s too
unlikely. Anything beyond the threshold is close to the edge and so unlikely that
we will reject our null hypothesis. We call this threshold alpha (α).
Above, we’ve chosen a threshold of 2.5% (or 0.025) on each side. Since we have
two tails, together that makes our α 5% (or 0.05).
α = 5%
Now, we often use 5% (or 0.05) as the α or the significance level, but don’t use
it automatically. Instead, we should think about what we’re measuring and the
consequences, and then decide on an appropriate α. In other words, if it’s a life
On the other hand, our alternative hypothesis, H1, could have been
that the average weight was more than 65 kg. For this to be true,
it could only be above 65 kg. So, only one side of the distribution
would be relevant, and only one tail would need to be considered.
We would say that 75 kg ends up in a part of the curve where there’s a low
probability of getting a sample that’s far away from the middle. This orange-
shaded area under the curve is the probability.
2.5% 2.5%
55 65 75
The P value assumes that the null hypothesis is true. Looking at the distribution
of samples that we’d expect, what is the probability that we would get a random
sample from the population that’s that far away or further from the middle?
If the P value, in this case, was 3% (or 0.03) and the α was 5% (or 0.05), that
would be sufficient evidence for us to reject the null hypothesis and accept the
alternative hypothesis. We’re confident that the average weight of adult males in
Ireland is not 65 kg.
On the other hand, if the P value was greater than α, then we’ve failed to reject
the null hypothesis. We don’t have sufficient evidence to accept the alternative
hypothesis.
We’re asking the questions, can we make an inference about the wider population,
and is this representative of the truth or of the population from which the sample
was taken? In other words, is this difference statistically significant?
2. Two-tailed
To compare the means of two different variables, asking whether they
are different
3. One-tailed
To compare the means of two different variables, asking whether there is a
difference in a particular direction
4. Paired
To compare the means of matched observations (can be one-tailed or
two-tailed)
We calculate the P value. If the P value is less than our threshold, such
as 0.05 (or 5%), then we can reject the null hypothesis that the presumed
average is accurate. We accept that the sample mean we’ve calculated is
statistically significant.
Two-tailed t-test
Two-tailed and one-tailed t-tests both compare two populations. We have
two options for how to do this. First, we will go through a two-tailed approach,
asking, are these two means different?
Let’s say we are comparing the life expectancy in Ireland and Switzerland. We
don’t know in which direction the difference might be. Is the difference that
we’re seeing statistically significant? Would we expect to see a difference of this
magnitude by chance?
Then, if that were correct, we test the probability that we would find samples
with means of 73 years in Ireland and 75.6 years in Switzerland by chance. If the
probability is less than our threshold (e.g., 0.05), we’d say that these means are
statistically significantly different. We would then reject the null hypothesis or
notion that these countries have the same life expectancy and accept that these
life expectancies are statistically significantly different.
One-tailed t-test
Now let’s look at life expectancy in Africa and in Europe. We could approach
the same problem in a slightly different way. We could say, we know that
Africa has a lower life expectancy than Europe, but we’re asking if a difference
of this magnitude is statistically significant. In this scenario, we would use a
one-tailed t-test.
Paired t-test
For a paired t-test, we’re talking about there being corresponding or paired data
for each observation.
Let’s say we know the life expectancy in Africa in 1957 and 2007. For each
sample or observation in 1957, there is a counterpart in 2007. For example, we’d
have the life expectancy in South Africa for both 1957 and 2007 and the life
expectancy in Malawi for both 1957 and 2007, and so on. There are pairs of data
with a counterpart at each time point.
If the P value was less than 5%, then we would reject the null hypothesis and
accept that the means were significantly different.
One-tailed
With a one-tailed paired t-test, our hypotheses could be as follows:
If the P value was less than 5%, then we would reject the null hypothesis and
accept that the life expectancy in Africa was significantly lower in 1957.
ANOVA
The null hypothesis is still the same, which is that there are no differences in the
means of these populations. The alternative hypothesis is that there is a
difference and that the difference we’re seeing is statistically significant.
The boxplots and density plots below illustrate the differences between the
means in three populations—Europe, the Americas, and Asia. These data are
from the Gapminder data.
The data are showing a difference in the means across these different
populations. But is that difference real or statistically significant?
We’d like to know more, though. The ANOVA hasn’t told us which means differ, so
we need to do more analysis.
In the table below, we’ve fit our model into what’s called a Tukey multiple
comparison of means. It’s taken each of the options—Asia and the Americas,
Europe and the Americas, and Europe and Asia—and looked at them individually.
We learn two things from this analysis: the 95% confidence intervals and the
adjusted P values.
Comparison Difference Confidence interval Adjusted P value
Asia and Americas –2.9 –6.5 to 0.72 0.14
Europe and Americas 4.0 0.36 to 7.7 0.028
Europe and Asia 6.9 3.5 to 10 0.000019
Confidence intervals
Let’s start by interpreting the differences and confidence intervals. We are
looking to see if the confidence interval includes or crosses zero:
Adjusted P values
We call these P values adjusted to differentiate them from the P value we would
have gotten if we had just compared two of these means using a t-test.
But in the other two comparisons, where the confidence interval does not include
the possibility of no difference, the P values are less than 0.05. So, the difference
between Europe and the Americas and the difference between Europe and Asia
are considered statistically significant.
We can test whether or not there is a difference in the proportions across the
different categories. Does the distribution differ from our expectations?
Let’s say we have some flowers, irises to be specific, and we’ve categorized them
as small, medium, and large. In the first instance, we could ask the question, are
the proportions of flowers that are small, medium, and large the same? Do we
expect to see the same number of small, medium, and large flowers in a random
sample that we take from a population?
Proportion of flowers
Chi-squared goodness of fit test
Again, we use the P value to find out the probability of a difference occurring by
chance if the null hypothesis were true. Let’s use a threshold (α) of 0.05 again.
To test the hypothesis, we make a table, perform the chi-squared test, and we get
a P value. If that P value is less than the threshold, we reject the null hypothesis
and accept the alternative hypothesis, inferring that this difference we’re seeing in
the data is statistically significant.
We’re asking the question, are the proportions of species dependent, or are they
independent of the size of the flowers? Does knowing the size of the flower tell us
the probability of a particular flower is one of these species? Looking at the graphs
below, it seems that that is the case. But we need to check that statistically.
40
Species
20
Setosa
Versicolor
Virginica
0
Small Medium Large
If the P value is very, very small, beyond our threshold, we reject the notion that
the proportions are the same and accept that this difference or relationship that
we’re seeing is statistically significant.
Let’s say that we plot the speed of a car along the x-axis and the distance that it
takes for the car to stop along the y-axis.
Here’s our question, is it correct that the faster a car travels, the longer it’ll take
to stop? Intuitively, we all agree that yes, that’s probably the case. But let’s see
what our data show.
We suspect that the distance to stop may depend on the speed of the car, so we
refer to the distance to stop as our dependent variable. The speed of the car is
our independent variable.
On the x-axis, we plot the independent variable, or in this case the speed of the
car. On the y-axis, we plot the dependent variable, or in this case, the distance
taken to stop.
The model below will let us check whether the value of y depends on the value
of x. Each data point is represented by a red dot.
Speed of car
Slope
The slope in the example on the previous page is 3.9. The model is suggesting
that for every movement, one unit on the x-axis is a movement of 3.9 units on the
y-axis, that’s how we get a slope of 3.9. In other words, for every increase in speed
of one mile per hour, there’s an increase of 3.9 feet that’s required to stop the car.
We need the y-intercept for the model because we can’t draw the line without
a slope and a y-intercept. However, as we interpret what the data mean, we
very often do not need the y-intercept and it may not have any meaningful
interpretation.
For example, the distance to stop can’t be –17 feet. The y-intercept is a theoretical
value indicating what y would be if the speed of the car were zero. In our example,
the y-intercept is not important.
There are some potential scenarios when we might be asking, what would y be
when x is zero? If this is our question, we’ve got a y-intercept and even a P value
to go with it.
P value
We’ve also got a P value for that slope to test a hypothesis. The null hypothesis
here would be that the slope is zero, that there is no relationship up or down, and
that there’s no real change in y with a change in x up or down.
H0 Slope = 0
H1 Slope > 0
If the null hypothesis were true, what’s the probability that a random sample would
give us a slope of 3.9? It’s certainly possible, but very unlikely. If our threshold is
5% (if the P value is less than 5%), then we reject the null hypothesis and accept
the alternative hypothesis.
In our car example, the P value is extremely small: 1.9 × 10–12. We can reject
the null and accept that not only do we have a slope, but we think that this is
statistically significant.
All of the changes in the distance taken to stop cannot be explained by the speed
of the car. Multiple variables will contribute to this change. For example, the
kind of road, the driver’s reaction speed, and the time of day may all affect the
distance it takes to stop. We’re very unlikely to ever know all of the variables that
could possibly contribute to the dependent variable. So, this question is always
important to ask, how much of the change in y can we explain by the change
in x?
R-squared answers this question. It is a number between zero and one. In our
example, R-squared is 0.65 or 65%. That means that 65% of the change in the
distance to stop can be explained by a change in the speed of the car. Really
interesting!
We can check that our data are normally distributed by visually inspecting the
data using a histogram or a Q-Q plot. Or we can apply a Shapiro-Wilk test to the
data when our data aren’t normally distributed.
Expected normal
Frequency
Comparison tests
Comparison tests answer the question, are these different?
Comparison tests
Comparison Variables Parametric Non-parametric
test test
Two One numeric, one categorical Unpaired Mann-Whitney
independent e.g., height in males vs. t-test test
groups females
More One numeric, one categorical One-way Kruskal-Wallis
than two (with > two categories) ANOVA test
independent e.g., height in children, adults,
groups and older adults
Differences One numeric variable at two Paired Wilcoxon
between time points, paired t-test signed-rank test
paired e.g., weight before / after diet
samples
If the data are normally distributed, we would usually use a t-test to compare
averages. The t-test compares the average of two independent groups and
assumes that the height variable is normally distributed.
If the data aren’t normally distributed, then we’d use a non-parametric test, like the
Mann-Whitney test to compare two averages.
If the data aren’t normally distributed, then the non-parametric test that we’d use
would be the Kruskal- Wallis test.
Paired samples
Let’s talk about paired data. Imagine our sample is a population of people who’ve
recently been on a particular diet. We want to know, was the significant change
in weight due to the diet? These data are paired. In other words, for each weight
before the data point, there’s a corresponding weight after the data point that
relates to the same person. So, the two data variables are not independent of
each other.
Because the data are paired, we don’t just compare the average weight before with
the average weight after. Rather, we’re interested in the average of the differences
between the two weight variables. We’re asking, is that average significantly
different from zero?
If the data are normally distributed, then we would use the paired t-test, which
considers the average difference between paired samples.
Tests of association
Tests of association ask the question, did this change because of that? The table
below summarizes the parametric and non-parametric tests of association.
Tests of association
Variables Parametric test Non-parametric test
Two numeric variables Pearson’s correlation Spearman correlation
e.g., height, weight coefficient coefficient
Two categorical variables Chi-squared test
e.g., gender, age group
Here, we use the chi-squared test, which has no assumption of normality. The chi-
squared test is a non-parametric test.
We would arrange these data in a spreadsheet with the various attributes (or
variables) in columns. These variables will be the objects of our inquiry. In
our data set, we’ve got two categorical variables (sex and age group) and two
numeric variables (height and weight).
The table below summarizes the five most important combinations of data.
Common combinations
What we observe in our Is it real?
sample data
One categorical One sample proportion
test
For each of these scenarios, we need to move through the following steps:
1. Define our question and hypotheses
2. Choose an alpha value
3. Analyze the data
When we collect our sample data, we can see that yes, the proportions do change
across the age groups. Is this due to chance? We test the idea that the proportions
are all the same—that’s our null hypothesis. Here, we conduct a chi-squared test.
If the P value is less than the alpha, we reject the null hypothesis and state that
our observation is statistically significant.
We collect some sample data, we find that the average height is indeed different
from the historic height. Is it statistically significant? Well, if there were no
difference, what would the chances be that we observed the difference that we do
(or a greater difference)? So, we conduct a t-test comparing the averages, and if
the P value is less than the alpha, then we can reject the null hypothesis and state
that the observed difference is statistically significant.
In our sample, we do observe a difference. Since our categorical variable has two
independent groups, we conduct a t-test, which gives us a P value. If the p is less
than the alpha, we reject the null hypothesis and we infer that the observation is
statistically significant.
If instead, we had a categorical variable with more than two categories (e.g., age
group with children, adults, and older adults), then we would perform an ANOVA.
We collect sample data and find some sort of relationship. To check if it is real or
by chance, we assume that it is by chance and that there’s no correlation between
the two variables. Here, we conduct a correlation test to find out two things:
1. Correlation coefficient
A number between negative one and one that represents the relationship
between two numeric variables.
2. P value
If the P value is less than alpha, we can reject the null hypothesis, and infer
that the correlation that we see is statistically significant.
• No correlation
If there is no relationship between the two variables, then the correlation
coefficient will be zero.
• Positive correlation
If as x goes up, y also goes up, we have what’s called a positive correlation
and the coefficient will be greater than zero. If there’s a perfectly positive
correlation, the correlation coefficient will be one.
By the way, it doesn’t matter which of our variables is on the x-axis versus the
y-axis; the correlation coefficient will be the same.
ASSESSING RESEARCH
METHODS
www.medmastery.com
Distinguishing between quantitative and
qualitative methods
Social scientists, anthropologists, epidemiologists, and economists use two
types of research methods:
1. Qualitative
2. Quantitative
Qualitative research
Qualitative research aims to unpack and understand the nature of a phenomenon or
the qualities associated with a particular phenomenon, asking questions such as:
• Who?
• What?
• Why?
• When?
• How?
• Interviews
e.g., in-depth interviews or key informant interviews
• Focus groups
• Surveys
Quantitative research
By contrast, quantitative research is interested in magnitude and asks, how much?
• Magnitude of an association
e.g., the relationship between a risk factor and an outcome
2. Observational
Observational studies are non-interventional. They observe what is
happening in the world. Some examples include the following:
- cross-sectional surveys
- case-control studies
- cohort studies
- ecological studies
Case-control studies
To do a case-control study, we start with outcomes of interest (e.g., a condition
or disease), and then study past exposures.
For example, let’s say that we found a group of people who, at the age of 21,
suddenly found that their hair went green. We collect those people in the group.
Then we create an analogous group that includes people that don’t have that
condition but are in every other way similar to the first group.
Drawbacks
The evidence that you get from a case-control study is not considered particularly
strong evidence.
Cohort studies
A cohort study is the opposite. We start off with an exposure of interest, then
study the outcomes.
It might be a rare or unusual exposure, for example, people that have been to the
North Pole. So, we find that group of people. Then we find an analogous group of
people who are similar to the first group—except they haven’t had the exposure of
interest. The demographics are similar but the second group has not been to the
North Pole.
Now, we look forward in time, or prospectively, and follow up with our groups over
time to see what outcomes emerge as a result of that exposure.
Initial observation
Follow up observation
Strengths
Cohort studies allow us to study rare exposures and multiple outcomes. The
evidence from a cohort study is considered stronger evidence than that from a
case-control study.
Drawbacks
Cohort studies take a long time because we have to follow up with the group over
time. They are also expensive.
Then we follow up with these two groups and analyze the data to see if there’s any
difference between the two groups over time.
Let’s imagine, for example, that we’ve noticed that with an increase in shark
attacks, we see a corresponding increase in the sale of ice cream. We must
assume that there’s an alternative explanation, or confounding variable, for this
relationship. In this case, it would be hot weather: as the temperature increases,
people swim in the sea more, and so they’re more likely to get bitten by sharks.
At the same time, as the temperature goes up, people buy more ice cream to
cool down.
Based on this connection, it’s easy for us to doubt a causal relationship between
shark attacks and eating ice cream. But many relationships are not as obvious,
and we could erroneously assume that there’s a causal relationship and miss one
or more confounding variables.
Therefore, when you compare the two groups, the effect of that confounding
variable is nullified and confounding variables that we haven’t even considered
are controlled for automatically.
Risk
Imagine we’re following a group of 50 people for a set period of time. At the end
of that time, we check to see how many of them have developed a disease that
we’re interested in. In our example, 17 participants got sick.
Red = sick
Green = not sick
To calculate what proportion got sick and the risk of getting sick, we divide 17 by
50, which equals 0.34. So, the risk of getting sick in this cohort was 0.34. In other
words, each person had a 34% chance of getting sick, all else being equal.
17
= 0.34 Relative risk
50
Relative risk
Of course, we know that not everyone’s risk is the same. Some people have
characteristics that inherently increase their risk of certain illnesses. Others
might have been exposed to something that increases their risk. In our example,
the circles represent people who smoke, and we think that they might be at
increased risk of the disease that we’re studying.
Let’s calculate the risk in smokers and nonsmokers separately to see if there is
a difference between them. If there is a difference, we will try and identify how
25 smokers
25 nonsmokers
2. No effect
If, however, there is no difference in the risk, then exposure doesn’t make any
difference. Here, the risk ratio will equal one.
3. Exposure is protective
If the exposure of interest is protective against the particular outcome (e.g.,
the use of sunscreen protects against skin cancer), then the risk ratio will be
less than one.
Excess risk
Excess risk describes how much of someone’s risk can be attributed to a
particular exposure. This is also referred to as attributable risk or risk difference.
To calculate excess risk, we look at the difference between the risks in each
group. If we subtract 0.2 from 0.48, we get 0.28.
Imagine that you are a nonsmoker and your risk of disease is 20%. Then you
start smoking and your risk goes up to 48%. Smoking hasn’t caused all of the risk
In the group of nonsmokers, 53 babies had a low birth weight and 798 were
of normal weight. So, the risk of low birth weight babies among nonsmokers
was 0.063 or 6.3%. In the group of smokers, 19 babies had a low birth weight
and 139 were of normal weight. The risk of low birth weight babies among
smokers was 0.12 or 12%.
We can also calculate excess risk by subtracting one from the other. In this case,
if you smoke during pregnancy, your risk of delivering a low birth weight baby
increases by 5.7%.
www.medmastery.com
Common research methods and
statistical tests
Research methods
Observational Interventional
Case-control Cohort Ecological Randomized
studies studies studies controlled
trials
Ideal for • Multiple • Rare • Comparing • Testing
exposures exposures populations treatments
• Rare • Multiple or
outcomes outcomes communities
• Individual
data
unavailable
Direction Retrospective Retrospective Retrospective Prospective
or or prospective
prospective
Model Select Select Use Randomly
outcome of previous aggregates assign
interest, then or current of individual- individuals to
study past exposure of level data intervention
exposures interest, then to measure or control
study current exposures and group, then
outcomes outcomes study future
or future outcomes
outcomes
Statistical tests
Comparison tests
Comparison tests ask, are these different? Parametric tests assume our
data have a normal distribution. Non-parametric tests do not assume a
normal distribution.
Tests of association
Tests of association ask, did this change because of that?
Tests of association
Variables Parametric test Non-parametric test
Two numeric variables Pearson’s correlation Spearman correlation
e.g., height, weight coefficient coefficient
Two categorical Chi-squared test
variables
e.g., gender, age group
Calculating risk
Calculation Example Interpretation
Risk People with the outcome 17 sick Each person
Total in the group 50 in total had a 34%
chance of
= 0.34
getting sick
Relative Risk of those exposed 0.48 risk in smokers Smokers are
risk Risk of those not exposed 0.20 in nonsmokers 2.4 times more
likely to get this
= 2.4
disease than
nonsmokers
Excess (Risk among those 0.48 risk in smokers The excess risk
risk exposed) – (risk among – 0.20 in nonsmokers from smoking
nonsmokers) = 0.28 is 28%
Choi YG. 2013. Clinical statistics: Five key statistical concepts for clinicians. J Korean
Assoc Oral Maxillofac Surg. 39: 203–206. PMID: 24471046
Creswell, JW and Creswell, JD. 2018. Research Design: Qualitative, Quantitative, and
Mixed Methods Approaches. 5th edition. Los Angeles, CA: SAGE Publications, Inc.
Fossi, RJ. 2022. Applied Biostatistics for the Health Sciences. 2nd edition. Hoboken,
NJ: Wiley.
Gapminder. 2023. Free data from world bank via gapminder.org. www.gapminder.org.
Accessed March 1, 2023.
Munnangi, S and Boktor, SW. 2022. Epidemiology of Study Design. In: StatPearls
[Internet]. Treasure Island (FL): StatPearls Publishing. PMID: 29262004
Ratelle, JT, Sawatsky, AP, and Beckman, TJ. 2019. Quantitative research methods in
medical education. Anesthesiology. 131: 23–35. PMID: 31045900
Sawatsky, AP, Ratelle, JT, and Beckman, TJ. 2019. Qualitative research methods in
medical education. Anesthesiology. 131: 14–22. PMID: 31045898