C119 EssentialsofStatistics Handbook

ESSENTIALS OF
STATISTICS
HANDBOOK
Greg Martin
MBA MPH MbBCH MFPHMI
Table of contents
Abbreviation list 3
Getting acquainted with data

Describing data 5
Visualizing data 10
Dealing with missing data 19
Performing statistical tests

Testing hypotheses 23
Performing a t-test 30
Carrying out an analysis of variance 35
Performing a chi-squared test 38
Interpreting a linear regression 41
Appreciating non-parametric testing 45
Putting it all together 50
Assessing research methods

Distinguishing between quantitative and qualitative methods 56
Comparing case-control and cohort studies 58
Executing randomized controlled trials 61
Evaluating risk 63
Appendix
Common research methods and statistical tests 69
Risk 73
Reference list 74
Become an expert at www.medmastery.com. 2

Abbreviation list
α alpha
ANOVA analysis of variance
GDP gross domestic product
H0 null hypothesis
H1 alternative hypothesis
χ2 chi-squared test
PLEASE NOTE: In several places throughout this course, the

instructor uses the term man or woman when the terms male
or female were shown both on screen and in this handbook.
We acknowledge that sex and gender are not equivalent and
apologize for any confusion this may cause.

Chapter 1
GETTING ACQUAINTED
WITH DATA
www.medmastery.com
Describing data
In this lesson, we are going to talk about the first step to good statistical
analysis. Specifically, we’ll explore how to describe, summarize, and visualize
data. We will also talk about producing tables, plots, and graphs for different
types of variables.
What does a typical data set look like?

A typical data set will be organized like a spreadsheet with columns that contain
the variables we are interested in and rows containing our observations.
To explain what observations are, let’s talk about an example. James has certain
characteristics that we’re interested in—name, gender, age, weight, and height.
He’s a 27-year-old male who weighs 75.1 kg and is categorized as short. All
of this information about James are observations. In our spreadsheet, this
information is stored as data under the appropriate column headings or variable
headings. All of our observations make up a data set.
Name Gender Age Weight (kg) Height
James Male 27 75.1 Short
Barra Male 32 98.3 Short
Sarah Female 34 63.5 Medium
Bill Male 23 87.2 Tall
Types of data sets

There are two common types of data:
1. Categorical data
2. Numeric data
Categorical data
There are two types of categorical variables:
1. Nominal categorical variables
2. Ordinal categorical variables

Nominal categorical variables
Nominal categorical variables describe categories with no natural order. Let’s
look at a categorical variable like gender. Each person, or each observation, in our
data set can be categorized either as male or female.
Think of these categories as buckets into which the values from any other variable
can be placed and then compared. For example, we could compare the average
weight or the average age of males and females in this group.
Ordinal categorical variables

These are slightly different kinds of categorical variables. Height is an example of
an ordinal categorical variable. Again we’ve got categories or buckets into which
we can categorize the observations in our data set. But in this case, the order
matters—short, medium, and tall. When there is a natural order to the categories,
the variable is called an ordinal categorical variable.
Numerical data
Numerical data consists of numbers that can be placed on a number
line. However, these numeric variables can fall on that number line in two
different ways:
1. Discrete variables
Age is a typical discrete variable. Each observation falls definitively on a
value on the integer number line (e.g., 32, 33, 34).
2. Continuous variables
Weight falls on any number including fractions between two integers. For
example, Barrett is 98.3 kg .
Describing variables
A trick to understanding how your numeric variable values are distributed along
the number line is to imagine them sitting on the number line. When there is
more than one observation for a particular number, they get stacked upon each
other. This turns into an interesting shape, which we call a distribution.

10 20 30 40 50
Numeric variable distribution
Let’s look at the most useful ways that we can describe the data in a distribution.
Range
Firstly, a distribution will have minimum and maximum values. These two values
are also the parameters for the range. Range tells us the spread of the data.
If we divide all the observations into four equal groups, each of those groups will
contain a quarter of all the observations. The two middle quarters will be called the
interquartile range, which is another way to describe how the data are spread out.
Interquartile range
Range describes data spread

Mean, median, and mode
Mean, median, and mode are used to describe the middle of the data:
• The mean is the average of all the data.
• The median is the value that splits the data into two equal groups.
• The mode is the most common value in the data set.
When the distribution, or the shape of the data, is symmetrical, the values of the
mean, median, and mode will be the same.
However, if the distribution of values has more observations toward one end of
the number line, it will appear to have a long tail to one side.
If this long tail occurs on the left side, this is called left-skewed. In this case, the
mean is disproportionately affected by the outliers and extreme values. Similarly,
if you have a right-skewed distribution (i.e., the tail is to the right) it is called
right-skewed, and the mean is way too far to the right.
Mode Median Mean
Right-skewed distribution
Therefore, the value of the mean is not a good measure of centrality. Clearly, when
the distribution is skewed, the median is a more robust measure of centrality.

Standard deviation
The standard deviation describes the average distance from the mean or how
spread out the data are. For example, one standard deviation on either side of
the mean is the average distance of the observations from the mean.
If your data are normally or symmetrically distributed, then approximately 68% of

all of the observations will occur within one standard deviation on either side of
the mean and 95% will be within two standard deviations.
Mean, median, and mode tell us where to find the middle of this
data. Range, interquartile range, and standard deviation describe
how spread out this data are.
Return to table of contents.

Visualizing data
In this lesson, we are going to look at other ways to describe the shape of a
distribution. The best way to do this is to visualize the data.
Visualizing one numeric variable

Remember that numeric data consists of numbers that can be placed on a
number line.
There are two main ways to visualize numeric data:

1. Histograms
2. Box plots
Histograms
We can visualize the distribution of a numeric data set by imagining buckets that
represent different intervals along the x-axis. The buckets can be of any chosen
size (e.g., 0–10, 10–20, 20–30, and so on). By counting how many observations
fall into each of those buckets, we can create a histogram.
Histogram to visualize numeric data

Box plots
We can use the interquartile range and the median of our data to create a box plot.
Let’s talk about each component of the box plot. The interquartile range is
depicted by the box itself and will contain 50% of the data. The median, or the
value that splits all of the data into two separate groups, is represented by the
line in the middle of the box.
The lines extending from the box are called whiskers. They extend out to 1.5 times
the interquartile range. Any values outside of the whiskers are called outliers.
Box plot to visualize numeric data
Visualizing one categorial variable

To explain how to visualize categorical data, let’s consider an example of
height. Each observation has been categorized as either short, medium, or tall.
This categorical variable can be summarized by counting up the number of
observations that land in each category.
Peter Male 27 75.1 Short

Chloe Female 32 98.3 Short
Ben Male 34 63.5 Medium
Anna Female 23 87.2 Tall
For example, there are four people that have been categorized as short, two are
medium height, and two are tall, for a total of eight observations. This can be
recorded in a frequency table.
Frequency
Short 4
Medium 2
Tall 2
Total 8
Relative frequency
We can determine the relative frequency, or proportion of the total, for each
category. In this case, the relative frequency of short individuals is found by
dividing the number of short people by the total number of observations, which
gives us a value of 0.5, or a half.
Relative frequency of short individuals = number of short individuals = 4 = 0.5

total number of individuals 8
This can be converted to a percentage by multiplying by 100. For this example,

the percentage of short individuals in this data set would be 50%.
We can repeat this process for all the categories in the data set.
Frequency table of height observations

Frequency Relative frequency Percentage
Short 4 0.5 50%
Medium 2 0.25 25%
Tall 2 0.25 25%
Total 8 1.0 100%

Bar charts and pie charts
To visualize this data, we can use a bar chart, where the height of the bar is either
the actual number of observations, the relative frequency, or the percentage.
Short Medium Tall
Bar chart of height observations
An alternative is to use the pie chart.
Short
Medium
Tall
Pie chart of height observations
Visualizing two categorical variables

If we have two categorical variables (e.g., gender and height) the first thing is
to create a two-way frequency table. This table includes both variables. One
variable will be described in the columns (e.g., gender) and the other in the rows
(e.g., height).

Two-way frequency table of height and gender variables
Male Female Total
Short 3 1 4
Medium 1 1 2
Tall 1 1 2
Total 5 3 8
We can calculate the relative frequency or the percentage, which can be recorded
in the frequency table in brackets next to the frequency value. For example, in
the male column, three males out of a total of five males are short, or 60% of the
males are short.
Two-way frequency table of height and gender variables

Male Female Total
Short 3 (60%) 1 (33%) 4 (50%)
Medium 1 (20%) 1 (33%) 2 (25%)
Tall 1 (20%) 1 (33%) 2 (25%)
Total 5 (100%) 3 (100%) 8 (100%)
Stacked bar charts

This type of data can be visualized using a stacked bar chart, where the height of
each column is determined by the actual number of observations. Alternatively,
you can stack them by percentage, so that each column totals 100%, making it
much easier to visually compare proportions.
Stacked bar chart to visualize height and gender

Visualizing two numeric variables
If you have two numeric variables, like age and weight, how are you going to
visually represent them?
One option is to create a scatter plot. A scatter plot is where each point
corresponds to the x and the y coordinates of a given observation, or in the case
of our data set, a person. You can also add a trendline to this plot.
For example, we’ve got Sarah who is 34 years of age and weighs 63.5 kg. Her
information can be represented in the following scatter plot.
Scatter plot of age and weight

By convention, we usually plot the independent variable on the x-axis and the
dependent variable on the y-axis because of the direction of causation. In this
case, we think that a change in age might affect weight. In other words, as you
get older, you might gain weight. We don’t think that a change in your weight
affects, or has any causative impact, on your age.
By contrast, weight might be dependent on, or affected by, age so there might be
some sort of causative relationship between your age and your weight. Therefore,
weight is a dependent variable. By convention, we put those onto our y-axis.
Visualizing two numeric variables and one

categorical variable
Let’s consider how we would plot a mixed data set with two numerical variables
and one categorical variable.

There are three steps we should follow:

1. Create a scatter plot with the numerical variables.
2. Use color to distinguish categorical variables.

For each of the points in the scatter plot, use the categorical variable to
assign them to a different group or a different color.

3. Draw a trendline, independently through each of those categories.
The result is two graphs superimposed upon one another where we can see the
difference between the genders with respect to the relationship between age
and weight.
Scatter plot of age, weight, and gender
Visualizing two categorical variables and one

numeric variable
Let’s talk about two categorical and one numeric variable.

First of all, let’s consider just the numeric variable by itself (e.g., weight). We can
plot this on its own using a box plot. To add the categorical variables, we need to
disaggregate, or separate, that data and redraw our box plot.
In the following example, those observations that have been categorized as

female have been used to draw the pink plot, while the observations that have
been categorized as male have been used to draw the blue plot. The plot has
also been disaggregated by height.
Box plot of height, weight, and gender
Using this plot, we can see the difference in weight between males and females
in tall people, medium-height people, and in short people.

Dealing with missing data
Let’s talk about missing data. It is important to understand the distribution of
missing data.
Consider a simple data set of the average (fictitious) earnings of males and
females. In this data set, there are no missing values. For males, the average
earning is about $5000 a month. For females, the average earning is a little less
than $4600.
Name Gender Earnings ($)
Barra Male 5000
Peter Male 7000
James Male 3000
Andrew Male 7000
Philip Male 3000
Mary Female 5000
Jane Female 7000
Chloe Female 3000
Wendy Female 7000
Colette Female 1000
Identifying the impact of missing data

Now let’s look at the same data set again, but this time there is some missing data
(identified as NA). The distribution of the missing data are random. The average
earning, or the difference, between males and females, hasn’t changed much.
This is because the missing data are randomly and evenly distributed between
these two groups.
Barra Male 5000
Peter Male NA
James Male 3000 Average monthly earnings = 5000
Andrew Male 7000
Philip Male NA

Mary Female 5000
Jane Female NA
Chloe Female NA Average monthly earnings = 4300
Wendy Female 7000
Colette Female 1000
Let’s look at another example of the same data set, where we’ve got missing data
that seems to be evenly distributed between the two groups. However, in this case,
there is a substantial impact on the average income between the two groups.
Barra Male 5000
Peter Male NA
James Male 3000 Average monthly earnings = 3600
Andrew Male NA
Philip Male 3000
Mary Female NA
Jane Female 7000
Chloe Female NA Average monthly earnings = 7000
Wendy Female 7000
Colette Female NA
The reason for this is that the missing data aren’t randomly distributed. There’s a
pattern to it, but the pattern isn’t obvious. You may need to look at the distribution
of your missing data in relation to another variable to see the pattern.
And in this example, the data had only been collected from people living in cities,
and the absence of rural data represented a systematic bias in the data set.
Name Gender Earnings ($) Home
Barra Male 5000 Urban
Peter Male NA Rural
Average monthly earnings
James Male 3000 Urban
= 3600
Andrew Male NA Rural
Philip Male 3000 Urban

Mary Female NA Rural
Jane Female 7000 Urban
Average monthly earnings
Chloe Female NA Rural = 7000
Wendy Female 7000 Urban
Colette Female NA Rural
Now, this isn’t a showstopper, we can still analyze the data. But when you report
your findings, you need to be very clear that the difference in earnings between
males and females in this data set represented urban data only.
Strategies to manage missing data

There are various strategies that you can employ when you’re faced with missing
data. I’m going to walk you through five things that you can do to deal with
missing values:
1. Delete all rows with missing data.
This is almost invariably a bad strategy because it can take out almost your
entire data set.
2. Delete just the rows where there is missing data from a specific variable.
3. Change missing values in a particular variable to something else

(e.g., unknown).
This works well for variables that you are not going to analyze (e.g., names of
participants).
4. Change a specific value to missing data.

This is useful when data has incorrectly been given a value (e.g., zero) but is
in fact missing.
5. Replace missing values with what our best guess is as to what that value
should be.
This is called imputation.

Chapter 2
PERFORMING
STATISTICAL TESTS
www.medmastery.com
Testing hypotheses
Generate a hypothesis
A hypothesis is a testable theory; something that can be falsified.
We may have a question about a population. Let’s imagine we’re looking at a

population such as everyone in Ireland. We’ve got an idea about this population, such
as the average weight of all of the adult males. We have generated a hypothesis.
Choose a sample of the population

We test our hypotheses with sample data.
To check if we are correct about the average weight of adult males in Ireland, we
can’t weigh every male in Ireland. So, instead of weighing all of them, we look at a
small group of adult males, a sample. The sample needs to be random so that it
more closely represents the wider population. By looking at the characteristics of
that sample, we can make an inference about the population as a whole.
A random sample will never be an exact representation of the entire population.
= one adult male
Entire population Random sample

Define the null hypothesis
When we test a hypothesis, there are usually two possible ideas that are often
mutually exclusive—only one of the two can possibly be true.
Let’s say that people have always believed that adult males in Ireland have an
average weight of 65 kg. That’s the null hypothesis, H0. However, we believe that is
incorrect and that the data will show that it’s incorrect. We’d like to argue that the
average weight is not equal to 65 kg; this is our alternative hypothesis, H1.
Only one of these two ideas can possibly be true:
H0 Average weight of adult males in Ireland = 65 kg

H1 Average weight of adult males in Ireland ≠ 65 kg
We want to use the data to show that one of them is likely to be false. And because
they are mutually exclusive, if we show that one is false, we can be confident that
the other is likely to be true.
So, our actual hypothesis is the alternative hypothesis, H1. But instead of testing
H1 directly, we test the opposite, the null hypothesis, H0. If we find that H0 is likely
incorrect, then we can infer that H1 is likely correct.
Consider what to expect from the data

We will work with the assumption that the null hypothesis is correct.
If that were true, when the sampled population of adult males were weighed, they
wouldn’t necessarily weigh exactly 65 kg but likely close to 65 kg.
For example, our first sample might have an average weight of just less than 65
kg. The second time around, the average weight might be just over 65 kg. The
third time, it might be exactly 65 kg. And we continue to repeat this and most of
our samples would be in and around 65 kg.

Distribution
A distribution shows our full collection of data points for a single variable, with all
possible observations and how often they occur. The distribution lets us see how
many data points are close to and far from the average.
Remember, our assumption at this point is that the null hypothesis is correct and
that the average weight of adult males in Ireland is 65 kg. If that were true, most
of our samples would be in and around 65 kg, as shown below. A few of them by
virtue of absolute chance would be a little bit further away. Some of them might
be 55 kg, some of them 75 kg, very few of the samples through absolute random
chance may be 50, 70, or 80 kg, and very few might be 90 kg. But we’d expect
most to be in the middle. This is what we call a distribution.
65
Distribution if our null hypothesis is true
In the next illustration, we’ve added a line above all of our data that shows the
shape of our distribution. Notice that our distribution for the null hypothesis
resembles the shape of a hill. Underneath the line of the hill, we can find all of the
samples, 100%. A small percentage, 2.5%, would fall in the small corners, or tails,
at either side. This shape is referred to as a normal distribution.

2.5% 2.5%
Tail
100%
65
Shape of a normal distribution resembles a hill
If the null hypothesis is true, we expect our sample to be somewhere in the big
bundle in the middle. The likelihood of the sample landing within the tails is so
low that if our sample has an average within the tail, say 90 kg, we would lose
confidence in our null hypothesis. Indeed, we may have so little confidence that
we reject the null hypothesis, H0, and decide that the average weight of adult
males in Ireland is 65 kg is incorrect.
We would then accept the alternative hypothesis, H1, and decide that the average
weight of adult males in Ireland is not 65 kg.
Decide on the alpha

Before taking our sample, we will decide on the significance level.
The significance level is the line in the sand or threshold where we decide it’s too
unlikely. Anything beyond the threshold is close to the edge and so unlikely that
we will reject our null hypothesis. We call this threshold alpha (α).
Above, we’ve chosen a threshold of 2.5% (or 0.025) on each side. Since we have
two tails, together that makes our α 5% (or 0.05).
α = 5%
Now, we often use 5% (or 0.05) as the α or the significance level, but don’t use
it automatically. Instead, we should think about what we’re measuring and the
consequences, and then decide on an appropriate α. In other words, if it’s a life

and death scenario where if we’re wrong, people will fall off the map and die, then
we might want to have a much more stringent α. For example, we might decide
that we want it to be a one in 10,000 chance that we’re wrong.
Choose one tail versus two tails

So far, our alternative hypothesis, H1, is that the average weight is not 65 kg. For
this to be true, it could be above or below. So, both sides of the distribution are
relevant, and both tails need to be considered.
H1 Average weight of adult males in Ireland ≠ 65 kg need a two-tailed test
On the other hand, our alternative hypothesis, H1, could have been
that the average weight was more than 65 kg. For this to be true,
it could only be above 65 kg. So, only one side of the distribution
would be relevant, and only one tail would need to be considered.
H1 Average weight of adult males in Ireland > 65 kg need a one-tailed test
Analyze data to find the P value

Let’s return to our example where the alternative hypothesis is that the average
is not 65 kg. Imagine we took a sample from the population that landed at 75 kg.

Sample average = 75 kg
We would say that 75 kg ends up in a part of the curve where there’s a low
probability of getting a sample that’s far away from the middle. This orange-
shaded area under the curve is the probability.
Since it is a two-tailed test, we need to incorporate the probability of getting a

value that far away from the middle on the other side, so 55 kg. If we add those
two probabilities together, that would be the P value.

P value
2.5% 2.5%
55 65 75
P value is the probability of getting a sample

average that far away from centre
The P value assumes that the null hypothesis is true. Looking at the distribution
of samples that we’d expect, what is the probability that we would get a random
sample from the population that’s that far away or further from the middle?
Accept or reject the null hypothesis

Provided that our P value is less than the α, in this case, 5% (or 0.05), that is
sufficient evidence to reject the null hypothesis and accept the alternative
hypothesis:
If P < α, then we reject H0 and accept H1
If the P value, in this case, was 3% (or 0.03) and the α was 5% (or 0.05), that
would be sufficient evidence for us to reject the null hypothesis and accept the
alternative hypothesis. We’re confident that the average weight of adult males in
Ireland is not 65 kg.
If P = 3% and α = 5%, then P < α, then we reject H0 and accept H1

On the other hand, if the P value was greater than α, then we’ve failed to reject
the null hypothesis. We don’t have sufficient evidence to accept the alternative
hypothesis.
If P > α, then we have failed to reject H0 and cannot accept H1

Consider the confidence interval
In addition to P values, our statistical tests will also give us a 95% confidence
interval. In our example, our sample average was 75 kg. We might be 95% sure
that the true average weight of adult males in Ireland falls within a particular
range, such as 70–80 kg. In this particular case, we know that that range will not
overlap with 65.

Performing a t-test
When to use a t-test

We’re going to look at four different applications of the t-test using real data.
When we do a t-test, we’re comparing the difference in means or averages,
including the following comparisons:
• Measured average compared to a presumed average
• Two populations
• Two different points in time
We’re asking the questions, can we make an inference about the wider population,
and is this representative of the truth or of the population from which the sample
was taken? In other words, is this difference statistically significant?
We will answer these questions with four different t-tests:

1. Single sample
To compare a measured average against a hypothesized or presumed average
2. Two-tailed
To compare the means of two different variables, asking whether they
are different
3. One-tailed
To compare the means of two different variables, asking whether there is a
difference in a particular direction
4. Paired
To compare the means of matched observations (can be one-tailed or
two-tailed)
Single sample t-test

Our first example is a single sample t-test where we just have one sample that
we want to compare against a hypothesized or presumed average.

In this case, we have the life expectancy in Africa. We have a presumed life
expectancy or presumed mean of 50 years. We would ask the question, is
our sample mean of 48.9 years statistically significantly different from the
presumed mean?
H0 Life expectancy in Africa = 50 years

H1 Life expectancy in Africa ≠ 50 years
We calculate the P value. If the P value is less than our threshold, such
as 0.05 (or 5%), then we can reject the null hypothesis that the presumed
average is accurate. We accept that the sample mean we’ve calculated is
statistically significant.
Two-tailed t-test
Two-tailed and one-tailed t-tests both compare two populations. We have
two options for how to do this. First, we will go through a two-tailed approach,
asking, are these two means different?
Let’s say we are comparing the life expectancy in Ireland and Switzerland. We
don’t know in which direction the difference might be. Is the difference that
we’re seeing statistically significant? Would we expect to see a difference of this
magnitude by chance?

We start by assuming the null hypothesis is correct, that Ireland and Switzerland
have the exact same life expectancy.
H0 Life expectancy in Ireland = Switzerland

H1 Life expectancy in Ireland ≠ Switzerland
Then, if that were correct, we test the probability that we would find samples
with means of 73 years in Ireland and 75.6 years in Switzerland by chance. If the
probability is less than our threshold (e.g., 0.05), we’d say that these means are
statistically significantly different. We would then reject the null hypothesis or
notion that these countries have the same life expectancy and accept that these
life expectancies are statistically significantly different.
One-tailed t-test
Now let’s look at life expectancy in Africa and in Europe. We could approach
the same problem in a slightly different way. We could say, we know that
Africa has a lower life expectancy than Europe, but we’re asking if a difference
of this magnitude is statistically significant. In this scenario, we would use a
one-tailed t-test.
H0 Life expectancy in Africa = Europe

H1 Life expectancy in Africa < Europe

If the probability is less than our threshold (e.g., 0.05), we’d say that these means
are statistically significant. We would then reject the null hypothesis or notion
that they have the same life expectancy and accept that the life expectancy in
Africa is significantly less than in Europe.
Paired t-test
For a paired t-test, we’re talking about there being corresponding or paired data
for each observation.
Let’s say we know the life expectancy in Africa in 1957 and 2007. For each
sample or observation in 1957, there is a counterpart in 2007. For example, we’d
have the life expectancy in South Africa for both 1957 and 2007 and the life
expectancy in Malawi for both 1957 and 2007, and so on. There are pairs of data
with a counterpart at each time point.
Under these circumstances, we perform a paired t-test.
A paired t-test can be one-sided or two-sided. Upfront, we can determine if the P

value is less than the threshold. Anything less than that threshold, for example,
5%, means that the null hypothesis gets rejected. The null hypothesis is that both
of these dates have the exact same life expectancy or the exact same mean. If
that’s not the case, we can reject that and accept that the difference we’re seeing
is statistically significant.

Two-tailed
With a two-tailed paired t-test, our hypotheses would be as follows:
H0 Life expectancy in Africa in 1957 = 2007

H1 Life expectancy in Africa in 1957 ≠ 2007
If the P value was less than 5%, then we would reject the null hypothesis and
accept that the means were significantly different.
One-tailed
With a one-tailed paired t-test, our hypotheses could be as follows:
H0 Life expectancy in Africa in 1957 = 2007

H1 Life expectancy in Africa in 1957 < 2007
If the P value was less than 5%, then we would reject the null hypothesis and
accept that the life expectancy in Africa was significantly lower in 1957.

Carrying out an analysis of variance
With an analysis of variance (ANOVA), we’re trying to answer the same
kinds of questions that we were with the t-test, except now we have three or
more populations.
ANOVA
The null hypothesis is still the same, which is that there are no differences in the
means of these populations. The alternative hypothesis is that there is a
difference and that the difference we’re seeing is statistically significant.
H0 Life expectancy in Europe = life expectancy in the Americas

= life expectancy in Asia
H1 Life expectancy in Europe ≠ life expectancy in the Americas
≠ life expectancy in Asia
The boxplots and density plots below illustrate the differences between the
means in three populations—Europe, the Americas, and Asia. These data are
from the Gapminder data.
The data are showing a difference in the means across these different
populations. But is that difference real or statistically significant?

If we do an ANOVA test and calculate a P value that’s very small, we would reject
the null hypothesis and accept that there is some sort of statistical difference.
We’d like to know more, though. The ANOVA hasn’t told us which means differ, so
we need to do more analysis.
Tukey multiple comparison of means

By comparing the means of each group with every other, we can determine which
pairings differ the most.
In the table below, we’ve fit our model into what’s called a Tukey multiple
comparison of means. It’s taken each of the options—Asia and the Americas,
Europe and the Americas, and Europe and Asia—and looked at them individually.
We learn two things from this analysis: the 95% confidence intervals and the
adjusted P values.
Comparison Difference Confidence interval Adjusted P value
Asia and Americas –2.9 –6.5 to 0.72 0.14
Europe and Americas 4.0 0.36 to 7.7 0.028
Europe and Asia 6.9 3.5 to 10 0.000019
Confidence intervals
Let’s start by interpreting the differences and confidence intervals. We are
looking to see if the confidence interval includes or crosses zero:
• Between Asia and the Americas

The difference between the means is –2.9. Importantly, the confidence
interval goes from –6.5 to 0.72, which crosses zero. In other words, the 95%
confidence interval includes the possibility of a zero difference meaning
there’s no difference between those two population means.
• Between Europe and the Americas

The difference is 4.0 and the confidence interval is 0.36 to 7.7, which does
not include zero.

• Between Europe and Asia
The difference is 6.9 and the confidence interval is from 3.5 to 10, which does
not include the possibility of zero.
Adjusted P values
We call these P values adjusted to differentiate them from the P value we would
have gotten if we had just compared two of these means using a t-test.
As expected, the adjusted P values in our Tukey analysis above correspond to

what we have learned from the confidence intervals. In the Asia to Americas
comparison with a confidence interval that included zero, the P value is not less
than 0.05, it’s 0.14.
But in the other two comparisons, where the confidence interval does not include
the possibility of no difference, the P values are less than 0.05. So, the difference
between Europe and the Americas and the difference between Europe and Asia
are considered statistically significant.

Performing a chi-squared test
In this lesson, we will discuss chi-squared tests, also written as χ2 tests. There are
two chi-squared tests:
1. Chi-squared goodness of fit test
2. Chi-squared test of independence
Chi-squared goodness of fit test

The chi-squared goodness of fit test is sometimes referred to simply as the fit
test. Really, what we’re looking at here is categorical variables and the proportions
of categorical variables. A categorical variable has groups or buckets that the
data can be arranged into.
We can test whether or not there is a difference in the proportions across the
different categories. Does the distribution differ from our expectations?
Let’s say we have some flowers, irises to be specific, and we’ve categorized them
as small, medium, and large. In the first instance, we could ask the question, are
the proportions of flowers that are small, medium, and large the same? Do we
expect to see the same number of small, medium, and large flowers in a random
sample that we take from a population?
Small Medium Large
Proportion of flowers
Chi-squared goodness of fit test

We answer that question by doing a chi-squared goodness of fit test. Similar to
our other tests, we have a null hypothesis and an alternative hypothesis. Our null
hypothesis is that there was no difference in proportions, or that they are the
same. Our alternative hypothesis is that there are differences.
H0 Proportion of small flowers = proportion of medium flowers

= proportion of large flowers
H1 Proportion of small flowers ≠ proportion of medium flowers
≠ proportion of large flowers
Again, we use the P value to find out the probability of a difference occurring by
chance if the null hypothesis were true. Let’s use a threshold (α) of 0.05 again.
To test the hypothesis, we make a table, perform the chi-squared test, and we get
a P value. If that P value is less than the threshold, we reject the null hypothesis
and accept the alternative hypothesis, inferring that this difference we’re seeing in
the data is statistically significant.
Chi-squared test of independence

The same principle applies to the chi-squared test of independence. This time we
are checking whether two variables are associated with each other.
We’re asking the question, are the proportions of species dependent, or are they
independent of the size of the flowers? Does knowing the size of the flower tell us
the probability of a particular flower is one of these species? Looking at the graphs
below, it seems that that is the case. But we need to check that statistically.

60
40
Species
20
Setosa
Versicolor
Virginica
0
Small Medium Large
Species of iris by size

Chi-squared test of independence
When we perform a chi-squared test of independence, we get a P value again.

The P value will tell us the probability of a sample showing a difference in
proportions or a relationship if the null hypothesis were true.
If the P value is very, very small, beyond our threshold, we reject the notion that
the proportions are the same and accept that this difference or relationship that
we’re seeing is statistically significant.
Remember, the threshold (α) has to be predetermined.

We can’t choose it retrospectively, after looking at the data.
That’s P hacking—bad science!

Interpreting a linear regression
In this lesson, we’re talking about a linear model or regression modeling.
Let’s say that we plot the speed of a car along the x-axis and the distance that it
takes for the car to stop along the y-axis.
Here’s our question, is it correct that the faster a car travels, the longer it’ll take
to stop? Intuitively, we all agree that yes, that’s probably the case. But let’s see
what our data show.
Plot data on a Cartesian plane

We can put or plot our data on a Cartesian plane.
A Cartesian plane is a two-dimensional plane that is part of a Cartesian

coordinate system. Each data point is identified by a pair of numerical
coordinates. So, if we have two variables that are both numerical, we can plot our
data on a Cartesian plane. We refer to the horizontal line as the x-axis, and we
refer to the vertical line as the y-axis.
We suspect that the distance to stop may depend on the speed of the car, so we
refer to the distance to stop as our dependent variable. The speed of the car is
our independent variable.
On the x-axis, we plot the independent variable, or in this case the speed of the
car. On the y-axis, we plot the dependent variable, or in this case, the distance
taken to stop.
The model below will let us check whether the value of y depends on the value
of x. Each data point is represented by a red dot.

Distance taken to stop
Speed of car
Relationship between speed and stopping distance
Find the line of best fit

Then we can plot a best-fit line (blue in this example). This line is the best possible
fit through these data. Note that there’s an upward slope, which would suggest a
positive association and support our hypothesis . But how do we know if it’s real
or not? How do we know if the association is statistically significant?
Interpret the results

The analysis will give us a whole lot of numbers, but only four are really important
to understand:
1. Slope
2. Y-intercept
3. P value
4. R-squared
Slope
The slope in the example on the previous page is 3.9. The model is suggesting
that for every movement, one unit on the x-axis is a movement of 3.9 units on the
y-axis, that’s how we get a slope of 3.9. In other words, for every increase in speed
of one mile per hour, there’s an increase of 3.9 feet that’s required to stop the car.

Y-intercept
We also get a y-intercept. In our example, the y-intercept is –17.
We need the y-intercept for the model because we can’t draw the line without
a slope and a y-intercept. However, as we interpret what the data mean, we
very often do not need the y-intercept and it may not have any meaningful
interpretation.
For example, the distance to stop can’t be –17 feet. The y-intercept is a theoretical
value indicating what y would be if the speed of the car were zero. In our example,
the y-intercept is not important.
There are some potential scenarios when we might be asking, what would y be
when x is zero? If this is our question, we’ve got a y-intercept and even a P value
to go with it.
P value
We’ve also got a P value for that slope to test a hypothesis. The null hypothesis
here would be that the slope is zero, that there is no relationship up or down, and
that there’s no real change in y with a change in x up or down.
H0 Slope = 0
H1 Slope > 0
If the null hypothesis were true, what’s the probability that a random sample would
give us a slope of 3.9? It’s certainly possible, but very unlikely. If our threshold is
5% (if the P value is less than 5%), then we reject the null hypothesis and accept
the alternative hypothesis.
In our car example, the P value is extremely small: 1.9 × 10–12. We can reject
the null and accept that not only do we have a slope, but we think that this is

R-squared
The R-squared (R2) tells us how much of the change in y that we’re seeing can be
explained by a change in x.
All of the changes in the distance taken to stop cannot be explained by the speed
of the car. Multiple variables will contribute to this change. For example, the
kind of road, the driver’s reaction speed, and the time of day may all affect the
distance it takes to stop. We’re very unlikely to ever know all of the variables that
could possibly contribute to the dependent variable. So, this question is always
important to ask, how much of the change in y can we explain by the change
in x?
R-squared answers this question. It is a number between zero and one. In our
example, R-squared is 0.65 or 65%. That means that 65% of the change in the
distance to stop can be explained by a change in the speed of the car. Really
interesting!

Appreciating non-parametric testing
The statistical tests that we’ve discussed so far are based on the assumption
that the data we’re analyzing are normally distributed. A normal distribution
means the following:
• Most of the values are close to the middle and close to the average.
• Fewer and fewer values are found as we move away from the middle.
• Values are evenly distributed on either side of the average.
Data is normally distributed
We can check that our data are normally distributed by visually inspecting the
data using a histogram or a Q-Q plot. Or we can apply a Shapiro-Wilk test to the
data when our data aren’t normally distributed.
Expected normal
Frequency
Skewed data Observed value
Histogram Normal Q-Q plot
Data is positively skewed

When the data is not normally distributed, then we apply statistical tests called
non-parametric tests:
1. Mann-Whitney test
2. Kruskal-Wallis test
3. Wilcoxon signed-rank test
4. Pearson’s correlation coefficient
5. Chi-squared test
Comparison tests
Comparison tests answer the question, are these different?
The table below offers a summary of the parametric and non-parametric

comparison tests. Remember, parametric tests are for when we are assuming
or have confirmed a normal distribution. Non-parametric tests are for when we
cannot assume a normal distribution.
Comparison tests
Comparison Variables Parametric Non-parametric
test test
Two One numeric, one categorical Unpaired Mann-Whitney
independent e.g., height in males vs. t-test test
groups females
More One numeric, one categorical One-way Kruskal-Wallis
than two (with > two categories) ANOVA test
independent e.g., height in children, adults,
groups and older adults
Differences One numeric variable at two Paired Wilcoxon
between time points, paired t-test signed-rank test
paired e.g., weight before / after diet
samples
Abbreviations: ANOVA, analysis of variance

Two independent groups
Let’s say we asked the question, is there a difference in height between males
and females in this particular sample? We can use the sex variable to create two
groups. Then we can calculate the average height in each of those groups and we
can compare those two averages.
Comparing two independent groups
If the data are normally distributed, we would usually use a t-test to compare
averages. The t-test compares the average of two independent groups and
assumes that the height variable is normally distributed.
If the data aren’t normally distributed, then we’d use a non-parametric test, like the
Mann-Whitney test to compare two averages.
More than two independent groups

We may want to compare the average height with a categorical variable that
has more than two categories. For example, our question might be, is there a
difference in height between different age groups? We’ve got adults, children, and
older adults.
Comparing more than two independent groups

If the data are normally distributed, we would use a one-way ANOVA test to
compare the averages of more than two independent groups. This test assumes
that the residuals are normally distributed.
If the data aren’t normally distributed, then the non-parametric test that we’d use
would be the Kruskal- Wallis test.
Paired samples
Let’s talk about paired data. Imagine our sample is a population of people who’ve
recently been on a particular diet. We want to know, was the significant change
in weight due to the diet? These data are paired. In other words, for each weight
before the data point, there’s a corresponding weight after the data point that
relates to the same person. So, the two data variables are not independent of
each other.
Because the data are paired, we don’t just compare the average weight before with
the average weight after. Rather, we’re interested in the average of the differences
between the two weight variables. We’re asking, is that average significantly
different from zero?
If the data are normally distributed, then we would use the paired t-test, which
considers the average difference between paired samples.
If these paired differences aren’t normally distributed, then we would use a

non-parametric test called the Wilcoxon signed-rank test.
Tests of association
Tests of association ask the question, did this change because of that? The table
below summarizes the parametric and non-parametric tests of association.
Variables Parametric test Non-parametric test
Two numeric variables Pearson’s correlation Spearman correlation
e.g., height, weight coefficient coefficient
Two categorical variables Chi-squared test
e.g., gender, age group

Two numeric variables
If we were to test the relationship between two numeric variables, such as height
and weight, we again have two options.
Relationship between two numeric variables
If the data are normally distributed, we use Pearson’s correlation coefficient.

This test assumes that both variables are normally distributed. If the data aren’t
normally distributed, we would use the Spearman correlation coefficient instead.
We will outline how to interpret a correlation coefficient in the next lesson.
Two categorical variables

Next, we have two categorical variables, such as sex and age group, and we’re
asking if the proportions of males and females depend on the age group.
Relationship between two categorical variables
Here, we use the chi-squared test, which has no assumption of normality. The chi-
squared test is a non-parametric test.

Putting it all together
Let’s imagine that we have a research question about the height and weight of
people living in Ireland. Of course, we can’t measure the height and weight of the
entire population. Instead, we take a random sample of the population, and we
measure the weight and height of that sample. We also collect some additional
information like sex and age group from each of the people in our sample.
We would arrange these data in a spreadsheet with the various attributes (or
variables) in columns. These variables will be the objects of our inquiry. In
our data set, we’ve got two categorical variables (sex and age group) and two
numeric variables (height and weight).
Things start to get interesting when we start looking at combinations of

variables. There are many possible combinations of variables that we could look
at, such as the following:
• Sex and height, one categorical and one numeric
• Height and weight, which are both numeric
• Sex and age group, both categorical
In each case, we might see differences between groups and / or relationships

between variables. And in each of these cases, there are specific statistical
tests that we can apply to see if what we are seeing in the sample data has
implications for what we think about the wider population. Is what we are seeing
statistically significant?
The table below summarizes the five most important combinations of data.
Common combinations
What we observe in our Is it real?
sample data
One categorical One sample proportion
test

Two categorical Chi-square
One numeric T-test
One numeric and T-test or ANOVA

one categorical
Two numeric Correlation test
For each of these scenarios, we need to move through the following steps:
1. Define our question and hypotheses
2. Choose an alpha value
3. Analyze the data
One categorical variable: One sample proportion test

Let’s move through the five steps imagining that we have one categorical variable:
1. Define our question and hypotheses
With just one categorical variable like sex, we might ask the question, is
there a difference in the number of males and females in the population?
Our hypothesis might be that there is a difference between the number of
males and females in the population. Our null hypothesis is that there is no
difference in the number of males and females in the population. If that were
true, how likely would it be for us to collect this sample?
2. Choose an alpha value

We choose our threshold for how unlikely would be too unlikely. Here, we will
use 5%.

3. Analyze the data
The next step is to perform a statistical test. In this case, we do a one-sample
proportion test and we generate a P value. If the P is less than 5%, then we
can reject the null hypothesis and state that the difference that we observe is
Two categorical variables: Chi-squared test

If we add another categorical variable, in this case, age group, we may have a
research question such as, does the proportion of males and females differ
across these groups? Our hypothesis is that the number of males and females
that we observe is dependent on the age category that we look at. In other words,
the proportions change or depend on the age category.
When we collect our sample data, we can see that yes, the proportions do change
across the age groups. Is this due to chance? We test the idea that the proportions
are all the same—that’s our null hypothesis. Here, we conduct a chi-squared test.
If the P value is less than the alpha, we reject the null hypothesis and state that
our observation is statistically significant.
One numeric variable: T-test

To look at just one numeric variable on its own, like height, what questions can
we ask? Well, we might have some theoretical value that we want to compare
our data to. For example, in the case of average height, our question might be,
is the current average height different from a previously established height?
Imagine that the previously established height was 1.4 m. We want to know if the
average height in our current population is different. Our hypothesis is that there
is a difference.
We collect some sample data, we find that the average height is indeed different
from the historic height. Is it statistically significant? Well, if there were no
difference, what would the chances be that we observed the difference that we do
(or a greater difference)? So, we conduct a t-test comparing the averages, and if
the P value is less than the alpha, then we can reject the null hypothesis and state
that the observed difference is statistically significant.

One categorical variable and one numeric variable:
T-test or ANOVA
Let’s consider a scenario with one categorical variable and one numeric variable.
We may ask the question, is there a difference between the average height of
males and females? In this case, our hypothesis is that there is a difference, and
the null hypothesis is that there is no difference.
In our sample, we do observe a difference. Since our categorical variable has two
independent groups, we conduct a t-test, which gives us a P value. If the p is less
than the alpha, we reject the null hypothesis and we infer that the observation is
If instead, we had a categorical variable with more than two categories (e.g., age
group with children, adults, and older adults), then we would perform an ANOVA.
Two numeric variables: Correlation test

In this example, we will look at the life combination of variable types in this data
set: two numeric variables, height, and weight. We might start with the question,
is there a relationship between height and weight? Our hypothesis is that there is
a relationship.
We collect sample data and find some sort of relationship. To check if it is real or
by chance, we assume that it is by chance and that there’s no correlation between
the two variables. Here, we conduct a correlation test to find out two things:
1. Correlation coefficient
A number between negative one and one that represents the relationship
between two numeric variables.
2. P value
If the P value is less than alpha, we can reject the null hypothesis, and infer
that the correlation that we see is statistically significant.

The correlation coefficient will tell us if and how the two variables are correlated:
• Negative correlation
If as the x variable gets larger, the y variable actually gets smaller, we say
that they are negatively correlated. The coefficient will be less than zero. If
they are perfectly negatively correlated, then the correlation coefficient will
be negative one.
• No correlation
If there is no relationship between the two variables, then the correlation
coefficient will be zero.
• Positive correlation
If as x goes up, y also goes up, we have what’s called a positive correlation
and the coefficient will be greater than zero. If there’s a perfectly positive
correlation, the correlation coefficient will be one.
By the way, it doesn’t matter which of our variables is on the x-axis versus the
y-axis; the correlation coefficient will be the same.

Chapter 3
ASSESSING RESEARCH
METHODS
www.medmastery.com
Distinguishing between quantitative and
qualitative methods
Social scientists, anthropologists, epidemiologists, and economists use two
types of research methods:
1. Qualitative
2. Quantitative
Qualitative research
Qualitative research aims to unpack and understand the nature of a phenomenon or
the qualities associated with a particular phenomenon, asking questions such as:
• Who?
• What?
• Why?
• When?
• How?
The following are examples of qualitative methods:

• Observation
e.g., of what’s happening in a community or through a microscope
• Interviews
e.g., in-depth interviews or key informant interviews
• Focus groups
• Surveys
Quantitative research
By contrast, quantitative research is interested in magnitude and asks, how much?

For example, quantitative research might inquire about the following:
• Magnitude of an occurrence
e.g., the incidence or prevalence of a disease in the community
• Magnitude of an association
e.g., the relationship between a risk factor and an outcome
Quantitative methods can fall into one of two categories:

1. Interventional
Interventional studies test an intervention or treatment. The most common
type would be randomized control trials.
2. Observational
Observational studies are non-interventional. They observe what is
happening in the world. Some examples include the following:
- cross-sectional surveys
- case-control studies
- cohort studies
- ecological studies

Comparing case-control and cohort studies
In this lesson, we will take a quick look at the differences between case-control
studies and cohort studies.
Case-control studies
To do a case-control study, we start with outcomes of interest (e.g., a condition
or disease), and then study past exposures.
This is a great design if we are studying a rare condition or disease.
For example, let’s say that we found a group of people who, at the age of 21,
suddenly found that their hair went green. We collect those people in the group.
Then we create an analogous group that includes people that don’t have that
condition but are in every other way similar to the first group.
We look back in time, or retrospectively, at the people’s histories to try to identify

exposures that may have led to the current condition, in this case, green hair.
Then we compare these two groups to determine whether one group was
exposed to certain risks more than the other.
Control group Exposures People with the

outcome of interest

Strengths
We can do these relatively fast because the data already exists. Also, case-control
studies don’t cost as much money to conduct.
Drawbacks
The evidence that you get from a case-control study is not considered particularly
strong evidence.
Cohort studies
A cohort study is the opposite. We start off with an exposure of interest, then
study the outcomes.
It might be a rare or unusual exposure, for example, people that have been to the
North Pole. So, we find that group of people. Then we find an analogous group of
people who are similar to the first group—except they haven’t had the exposure of
interest. The demographics are similar but the second group has not been to the
North Pole.
Now, we look forward in time, or prospectively, and follow up with our groups over
time to see what outcomes emerge as a result of that exposure.
Initial observation
Follow up observation
Exposure of interest Control group

Alternatively, we could look at a rare previous exposure, choosing a group of
exposed individuals and non-exposed individuals, and then compare their current
outcomes. This scenario would be considered a retrospective study.
Strengths
Cohort studies allow us to study rare exposures and multiple outcomes. The
evidence from a cohort study is considered stronger evidence than that from a
case-control study.
Drawbacks
Cohort studies take a long time because we have to follow up with the group over
time. They are also expensive.

Executing randomized controlled trials
In this lesson, we will look at randomized controlled trials. We’re going to take
a look at the idea of confounding and talk about why it is that randomized
controlled trials are so good at controlling for confounding.
How do we perform a randomized controlled trial?

To conduct a randomized controlled trial, we start with a group of people,
communities, or hospitals that have agreed to participate in the trial. For our
example, imagine that it’s a group of people and we have randomly assigned each
person to one of two groups:
1. Intervention
2. Placebo or non-intervention
Then we follow up with these two groups and analyze the data to see if there’s any
difference between the two groups over time.
What are confounding variables?

A confounding variable is an alternative explanation for a relationship that
we’ve observed.
Let’s imagine, for example, that we’ve noticed that with an increase in shark
attacks, we see a corresponding increase in the sale of ice cream. We must
assume that there’s an alternative explanation, or confounding variable, for this
relationship. In this case, it would be hot weather: as the temperature increases,
people swim in the sea more, and so they’re more likely to get bitten by sharks.
At the same time, as the temperature goes up, people buy more ice cream to
cool down.

Confounding variables can skew data
Based on this connection, it’s easy for us to doubt a causal relationship between
shark attacks and eating ice cream. But many relationships are not as obvious,
and we could erroneously assume that there’s a causal relationship and miss one
or more confounding variables.
Confounding variables must be associated with or connected to

the exposure and the outcome of interest, but they cannot be on
the causal pathway.
How do randomized controlled trials help?

Randomized controlled trials are great at dealing with the problem of confounding.
Variables that could skew the relationship between an exposure and an outcome,
often end up randomly and equally distributed between the two groups. So, both
groups end up equally affected by those confounding variables.
Therefore, when you compare the two groups, the effect of that confounding
variable is nullified and confounding variables that we haven’t even considered
are controlled for automatically.

Evaluating risk
Let’s discuss and evaluate risk by working through a hypothetical cohort study.
Risk
Imagine we’re following a group of 50 people for a set period of time. At the end
of that time, we check to see how many of them have developed a disease that
we’re interested in. In our example, 17 participants got sick.
Red = sick
Green = not sick
Start of study Follow up

50 participants in total 17 people got sick
To calculate what proportion got sick and the risk of getting sick, we divide 17 by
50, which equals 0.34. So, the risk of getting sick in this cohort was 0.34. In other
words, each person had a 34% chance of getting sick, all else being equal.
17
= 0.34 Relative risk
50
Relative risk
Of course, we know that not everyone’s risk is the same. Some people have
characteristics that inherently increase their risk of certain illnesses. Others
might have been exposed to something that increases their risk. In our example,
the circles represent people who smoke, and we think that they might be at
increased risk of the disease that we’re studying.
Let’s calculate the risk in smokers and nonsmokers separately to see if there is
a difference between them. If there is a difference, we will try and identify how

much more at risk one group is relative to the other. To calculate relative risk, we
work through three questions:
1. How many have been exposed?
Among the smokers (represented by circles in the next illustration), 12
people got sick and 13 people didn’t get sick, giving us a total of 25 smokers.
Among the nonsmokers (represented by squares), 5 people got sick and 20
didn’t get sick, giving us a total of 25 nonsmokers.
25 smokers
25 nonsmokers
17 sick 33 not sick
2. What is the risk in each group?

Now we divide to find the risk of getting sick among smokers and
nonsmokers. Of the 25 smokers, 12 of them got sick. So, the risk among
smokers is 0.48 or 48%. Of the 25 nonsmokers, 5 of them got sick. So, the
risk amongst nonsmokers is 0.2 or 20%.

3. What is the relative risk?
To find how much more the risk is for smokers than nonsmokers, we simply
calculate the ratio of the risk in smokers to the risk in nonsmokers. In this case,
the risk ratio is 0.48 divided by 0.2, giving us a relative risk of 2.4. In other
words, smokers are 2.4 times more likely to get this disease than nonsmokers.
0.48
= 2.40 Relative risk
0.20
Once we calculate our risk ratios and relative risk, there are three possible
interpretations:
1. Exposure increases the risk
If the exposure of interest increases the risk, the risk ratio will be greater
than one. As in our previous example, exposure increases the chance of
the outcome by 2.4 times; you’re 2.4 times more likely to get this particular
outcome.
2. No effect
If, however, there is no difference in the risk, then exposure doesn’t make any
difference. Here, the risk ratio will equal one.
3. Exposure is protective
If the exposure of interest is protective against the particular outcome (e.g.,
the use of sunscreen protects against skin cancer), then the risk ratio will be
less than one.
Excess risk
Excess risk describes how much of someone’s risk can be attributed to a
particular exposure. This is also referred to as attributable risk or risk difference.
To calculate excess risk, we look at the difference between the risks in each
group. If we subtract 0.2 from 0.48, we get 0.28.
0.48–0.20 = 0.28 Excess risk
Imagine that you are a nonsmoker and your risk of disease is 20%. Then you
start smoking and your risk goes up to 48%. Smoking hasn’t caused all of the risk

(all 48%), because there was already a background risk of 20%. This means the
excess risk from smoking is 28%.
Example: Cohort study

Let’s take a look at a real example. This is a cohort study that asked the question,
does smoking in mothers increase the risk of low birth weight in babies? Let’s
work through those same three questions.
1. How many have been exposed?

This study followed 1000 pregnant patients. Of them, 158 smoked during
pregnancy and 842 didn’t smoke.
2. What is the risk in each group?

We will consider the risk in each of these two groups separately. To calculate
the risk, consider the number that had a particular outcome divided by
everybody in that group.
In the group of nonsmokers, 53 babies had a low birth weight and 798 were
of normal weight. So, the risk of low birth weight babies among nonsmokers
was 0.063 or 6.3%. In the group of smokers, 19 babies had a low birth weight
and 139 were of normal weight. The risk of low birth weight babies among
smokers was 0.12 or 12%.

3. What is the relative risk?
Take the risk in the exposed group and divide it by the risk in the unexposed
group, so 0.12 divided by 0.063, or 12% divided by 6.3%. We get a relative risk
of 1.9. In other words, smoking during pregnancy nearly doubled the risk of
the baby having a low birth weight.
0.120
= 1.9 Relative risk
0.063
Below we have the exact same numbers and calculations presented in a two by
two table. We calculate the risk in each group separately, then calculate a relative
risk of 1.9, which is nearly double.
We can also calculate excess risk by subtracting one from the other. In this case,
if you smoke during pregnancy, your risk of delivering a low birth weight baby
increases by 5.7%.
0.120–0.063 = 0.057 Excess risk

APPENDIX
www.medmastery.com
Common research methods and
statistical tests
Research methods
Observational Interventional
Case-control Cohort Ecological Randomized
studies studies studies controlled
trials
Ideal for • Multiple • Rare • Comparing • Testing
exposures exposures populations treatments
• Rare • Multiple or
outcomes outcomes communities
• Individual
data
unavailable
Direction Retrospective Retrospective Retrospective Prospective
or or prospective
prospective
Model Select Select Use Randomly
outcome of previous aggregates assign
interest, then or current of individual- individuals to
study past exposure of level data intervention
exposures interest, then to measure or control
study current exposures and group, then
outcomes outcomes study future
or future outcomes
outcomes

Controls Include Include Not always Randomly
people who people who applicable, assigned
don’t have weren’t which
outcome, but exposed but increases
are otherwise are otherwise possible
similar to similar confounding
cases to those variables
exposed
Analysis Study odds Calculate Compare Compare
of each relative risk groups, instead predefined
exposure of individuals endpoints
Efficiency • Fast • Slow • Often quick • Slow
• Cheap • Expensive • Often cheap • Expensive
Evidence Weak Medium Strong Very strong
strength
Challenges Recall bias Selection Confounding Non-
bias variables, compliance
ecological
fallacy
Example Do diet, Does intimate Is a Does radiation
question smoking, partner population’s reduce
or exercise violence life expectancy tumor size
affect one’s affect one’s associated and improve
risk of brain risk of heart with a survival
injury? attack or country’s in cancer
brain injury? gross domestic patients?
product (GDP)
ranking?
Statistical tests
Comparison tests
Comparison tests ask, are these different? Parametric tests assume our
data have a normal distribution. Non-parametric tests do not assume a
normal distribution.

Comparison tests
Comparison Variables Parametric test Non-parametric
test
Two independent One numeric, one Unpaired t-test Mann-Whitney
groups categorical test
e.g., height
in males vs.
females
More than two One numeric, one One-way ANOVA Kruskal-Wallis
independent categorical (with test
groups > two categories)
e.g., height in
children, adults,
and older adults
Differences One numeric Paired t-test Wilcoxon signed-
between paired variable at two rank test
samples time points,
paired
e.g., weight
before / after diet
Abbreviations: ANOVA, analysis of variance
Tests of association ask, did this change because of that?
Variables Parametric test Non-parametric test
Two numeric variables Pearson’s correlation Spearman correlation
e.g., height, weight coefficient coefficient
Two categorical Chi-squared test
variables
e.g., gender, age group

Interpreting statistical test results
Values Question Possible Interpretation

results
Probability What is the P < threshold Result is statistically
(P) value probability that (α) significant
this difference
P>α Result is not statistically
occurred by
significant
chance?
R squared How much of 0 No association
(R2) the change
0.5 50% of change can be
in y can be
explained by x
explained by a
change in x? 1 100% of change can be
explained by x
Correlation Are these –1 to 0 Negative correlation
coefficient two variables
0 No correlation
correlated? If so,
how? 0 to 1 Positive correlation
For more information, please refer to Chapter 2 and Chapter 3.

Risk
Calculating risk
Calculation Example Interpretation
Risk People with the outcome 17 sick Each person
Total in the group 50 in total had a 34%
chance of
= 0.34
getting sick
Relative Risk of those exposed 0.48 risk in smokers Smokers are
risk Risk of those not exposed 0.20 in nonsmokers 2.4 times more
likely to get this
= 2.4
disease than
nonsmokers
Excess (Risk among those 0.48 risk in smokers The excess risk
risk exposed) – (risk among – 0.20 in nonsmokers from smoking
nonsmokers) = 0.28 is 28%
Interpreting relative risk

Relative risk Interpretation Example
>1 Exposure increases risk Smoking increases the risk of
cancer by 2.4 times
=1 No effect Exposure does not make
any difference
<1 Exposure is protective Use of sunscreen protects
against skin cancer
For more information, please refer to Chapter 3.

Reference list
Bowling, A. 2014. Research Methods in Health: Investigating Health and Health
Services. 4th edition. Berkshire, England: Open University Press.
Choi YG. 2013. Clinical statistics: Five key statistical concepts for clinicians. J Korean
Assoc Oral Maxillofac Surg. 39: 203–206. PMID: 24471046
Creswell, JW and Creswell, JD. 2018. Research Design: Qualitative, Quantitative, and
Mixed Methods Approaches. 5th edition. Los Angeles, CA: SAGE Publications, Inc.
Fossi, RJ. 2022. Applied Biostatistics for the Health Sciences. 2nd edition. Hoboken,
NJ: Wiley.
Gapminder. 2023. Free data from world bank via gapminder.org. www.gapminder.org.
Accessed March 1, 2023.
Munnangi, S and Boktor, SW. 2022. Epidemiology of Study Design. In: StatPearls
[Internet]. Treasure Island (FL): StatPearls Publishing. PMID: 29262004
Ratelle, JT, Sawatsky, AP, and Beckman, TJ. 2019. Quantitative research methods in
medical education. Anesthesiology. 131: 23–35. PMID: 31045900
Sawatsky, AP, Ratelle, JT, and Beckman, TJ. 2019. Qualitative research methods in
medical education. Anesthesiology. 131: 14–22. PMID: 31045898

Become an expert by learning the most important clinical skills at www.medmastery.com.

C119 EssentialsofStatistics Handbook

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C119 EssentialsofStatistics Handbook

Uploaded by

Copyright:

Available Formats

ESSENTIALS OF

Getting acquainted with data

Performing statistical tests

Assessing research methods

Become an expert at www.medmastery.com. 2

PLEASE NOTE: In several places throughout this course, the

Become an expert at www.medmastery.com. 3

What does a typical data set look like?

Types of data sets

Become an expert at www.medmastery.com. 5

Ordinal categorical variables

Become an expert at www.medmastery.com. 6

Numeric variable distribution

Range describes data spread

Become an expert at www.medmastery.com. 7

Mode Median Mean

Become an expert at www.medmastery.com. 8

If your data are normally or symmetrically distributed, then approximately 68% of

Return to table of contents.

Become an expert at www.medmastery.com. 9

Visualizing one numeric variable

There are two main ways to visualize numeric data:

Histogram to visualize numeric data

Become an expert at www.medmastery.com. 10

Box plot to visualize numeric data

Visualizing one categorial variable

Become an expert at www.medmastery.com. 11

Relative frequency of short individuals = number of short individuals = 4 = 0.5

This can be converted to a percentage by multiplying by 100. For this example,

Frequency table of height observations

Become an expert at www.medmastery.com. 12

Short Medium Tall

Bar chart of height observations

An alternative is to use the pie chart.

Pie chart of height observations

Visualizing two categorical variables

Become an expert at www.medmastery.com. 13

Two-way frequency table of height and gender variables

Stacked bar charts

Stacked bar chart to visualize height and gender

Scatter plot of age and weight

Become an expert at www.medmastery.com. 15

Visualizing two numeric variables and one

Name Gender Age Weight (kg) Height

There are three steps we should follow:

2. Use color to distinguish categorical variables.

Become an expert at www.medmastery.com. 16

Scatter plot of age, weight, and gender

Visualizing two categorical variables and one

Become an expert at www.medmastery.com. 17

In the following example, those observations that have been categorized as

Box plot of height, weight, and gender

Return to table of contents.

Become an expert at www.medmastery.com. 18

Identifying the impact of missing data

Become an expert at www.medmastery.com. 19

Become an expert at www.medmastery.com. 20

Strategies to manage missing data

3. Change missing values in a particular variable to something else

4. Change a specific value to missing data.

Return to table of contents.

Become an expert at www.medmastery.com. 21