Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

COURSE NOTES

GOOD PRACTICES OF INDUSTRY

COURSE:
Biostatistics

Publication Date: May 2018


INTRODUCTION
1. Course Information

Transcript
Welcome to the “Biostatistics” course.

Before beginning, please take a moment to read the course details.

Click on the Course Notes button to download a PDF version of the course. The course notes
contain all the text associated with the course, including objectives, transcripts, and progress
checks.

Onscreen Text: Course Details


Course Length: Approximately two hours. Your time may vary based on connection speed,
prerequisite knowledge, and other factors.
Date Published: May 2018

You must reach the end of every lesson in order to receive a Mark Complete status for the
course.

2
INTRODUCTION
2. Course Overview

Introduction
By using real examples from the medical literature, this course presents an introduction to
clinical research and applied statistics. This basic understanding will aid in critically reading
and understanding the medical literature.

Transcript
This course is divided into five sections:

• The first section, THE RESEARCH QUESTION, provides the background you need to
understand hypothesis-driven research.

• The second section, STUDY DESIGN, describes the different types of studies used in
medical research, including their strengths and weaknesses.

• The third section, TYPES OF MEASUREMENT, explains the different ways that
variables are measured, and also gives background on diagnostic testing.

• The fourth section, STATISTICAL INFERENCE, teaches basic concepts of statistical


inference, including hypothesis testing, confidence intervals, and sample size and
power.

• The fifth section, STATISTICAL ANALYSES, discusses the appropriate use of and
interpretation of common statistical tests used in medical studies.

3
THE RESEARCH QUESTION
3. Objectives

Introduction
When reading a paper from the medical literature, the tendency is to go straight to the results
and conclusions and forget to ask how and why the study was done. This omission makes it
impossible to understand the study results in their appropriate context. For example, the
results of a study could indicate that a new drug lowers risk of myocardial infarction, but
unless it is noted that this study was done to assess recurrence (and not first occurrence) of
myocardial infarction, it may be falsely concluded that healthy people should start taking the
drug. Therefore, it is important to first determine why the study was done and what question it
addressed (i.e., the research question). This course focuses on studies that are designed to
test specific hypotheses.

Transcript
These are the objectives for this section.

Onscreen Text: Objectives


After you finish this section, you should be able to:

• Describe the empirical cycle


• Differentiate between exploratory and confirmatory studies
• Identify the research hypothesis in a given study
• Identify the predictor and outcome in a given study
• Define study population and study sample

4
THE RESEARCH QUESTION
4. The Empirical Cycle

Transcript
The empirical cycle of scientific inquiry, which provides a broad outline of hypothesis-driven
1
research, consists of six phases.

1. In the observation phase, scientists gather ideas from previous studies, case reports,
anecdotal evidence, biological studies, disease surveillance, and exploratory studies.
For example, anecdotal evidence and isolated cases suggested that the combination
of psychotherapy and antidepressants surpasses psychotherapy alone in the
treatment of depression. A literature search on this question will bring up the results
of several previous studies that addressed this topic—with mixed results.

2. In the induction phase, scientists formulate specific research hypotheses. In clinical


research, these are hypotheses about whether specific risk factors and treatments
are causally related to disease. For example, psychotherapy plus antidepressants is a
better treatment for depression than psychotherapy alone.

3. In the deduction phase, scientists spell out testable predictions from their hypothesis.
For example, if the hypothesis is true, then patients treated with the combination
therapy should have lower scores on the Hamilton Rating Scale for Depression than
patients treated with psychotherapy alone.

4. In the data collection phase, scientists run a study to test their hypothesis. For
example, investigators ran a randomized trial comparing improvements on the
Hamilton Rating Scale for Depression between a group treated with combination
therapy and a group treated with psychotherapy alone.2

5. In the conclusion phase, scientists evaluate their original hypothesis in light of the
evidence they have collected. For example, there was not sufficient evidence to
suggest a difference between combined therapy and psychotherapy alone.

6. In the publication phase, scientists document and disseminate their results, including
suggesting areas for further research—which feeds back to the first phase,
observation.

1
de Groot AD. Methodology: Foundations of Inference and Research in the Behavioral Sciences, Mouton & Co,
Belgium; 1969.
2
de Jonghe F, Hendricksen M, van Aalst G, et al. Psychotherapy alone and combined with pharmacotherapy in
the treatment of depression. Br J Psychiatry. 2004 Jul;185:37-45.

5
THE RESEARCH QUESTION
5. Exploratory vs. Confirmatory Studies

Transcript
Exploratory studies can activate the empirical cycle by providing preliminary evidence of
associations between exposures and diseases. These studies have not been designed to test
specific hypotheses and are likely to yield findings merely due to chance; therefore the results
3
of exploratory studies should be interpreted cautiously.

Exploratory studies typically rummage through a large set of possible factors to uncover
associations with disease. For example, imagine a room full of people who are separated into
two groups: people born on odd days and people born on even days. Eventually, if enough
questions are asked, an imbalanced factor will be found—a personal characteristic, behavior,
or exposure that is significantly higher in one of the groups. But this is most likely a chance
association because there is no known reason to believe that birth on an odd vs. an even
date should create a real difference in any meaningful characteristic between people. Thus,
such a result will most likely not hold up in a new sample of people that are separated in an
identical way. Similarly, exploratory clinical studies look at so many factors that some will be
associated with disease merely by chance.

However, the results from exploratory studies may consequently lead to novel hypotheses,
which are tested by confirmatory studies—starting with observational studies and sometimes
progressing to experimental trials. This module is focused on confirmatory, or hypothesis-
driven, studies.

3
de Groot AD. Methodology: Foundations of Inference and Research in the Behavioral Sciences, Mouton & Co,
Belgium; 1969.

6
THE RESEARCH QUESTION
6. The Research Hypothesis

Transcript
The research hypothesis is a refined version of the research question that provides the basis
for statistical testing. The research hypothesis specifies a relationship between a predictor
and outcome in a specific population known as the “target population.”

Predictors are factors that might increase or decrease the risk of disease or alter disease
prognosis. The outcome is the onset of disease or the change in disease prognosis.

Transforming a research question into a hypothesis that specifies precisely how the
hypothesis will be tested (referred to as “operationalizing” a hypothesis) is very important for
numerous reasons, including the need to generate a null hypothesis for statistical testing and
providing a way for other scientists to test the same hypothesis in the same way (known as
4
reproducibility).

Here’s an example: Atherosclerosis increases depressive symptoms in elderly men and


women. Atherosclerosis is the predictor; the predictor is hypothesized to increase levels of the
outcome; the outcome is depressive symptoms; the target population is elderly men and
women.

4
Motulsky H. Intuitive Biostatistics. Oxford University Press; 1995.

7
THE RESEARCH QUESTION
7. Population vs. Sample

Transcript
Researchers hope to generalize their results to a broad group of people, known as the
population, such as people with mild or moderate depression, intravenous drug users in large
metropolitan areas in the United States, or healthy elderly people in The Netherlands. In
general, it is impractical to try to study every person in a population because it is simply too
large, and not every person is available to participate in a study. Instead, researchers define a
narrower population (called a “study population”), from which they can conveniently select
participants, such as all people with mild or moderate depression from a nearby geographic
area or who receive health care at a particular clinic. This study population should be
sufficiently representative of the overall population of interest so study results can be
generalized back to the overall population. Researchers invite some or all individuals from the
study population to participate; those who are invited and actually participate in the study
5,6
constitute the “sample” of the study.

The study population is defined by geographical and temporal factors as well as specific
criteria that determine who can be included or excluded from a study (known as “inclusion
7
and exclusion criteria”, respectively). For example, to test treatments for people with mild or
moderate depression, researchers used a study population that consisted of all newly
registered patients during a three-year period at two outpatient clinics of a large psychiatric
facility in Amsterdam who met the following inclusion criteria: patients had to be 18 to 65
years of age, have a diagnosable depressive disorder, and have a score of 12 to 24 on the
8
Hamilton Rating Scale for Depression. Furthermore patients were excluded from the study if
they had a psychotic, psycho-organic, or dissociative illness, a drug problem, specific
language or physical barriers, or any contraindications for antidepressant use. These
exclusion criteria were established to protect potential participants from harm by the study as
well as to maintain the integrity of the study by removing unreliable participants.

After applying these criteria, the study population consisted of 233 patients. Out of these, 208
enrolled in the study, and thus constituted the study sample.

When reading a medical paper, it is important to carefully consider how participants were
selected to judge whether the results of a study can be applied to the overall population, a
process called “generalizability.” For example, smokers in the United Kingdom and Ireland
were recruited to participate in a randomized trial of the nicotine patch alone vs. the nicotine
9
patch plus a web-based computer support program. Participants in the combination group
reported quitting smoking at higher rates than participants in the patch-only group. It would be
tempting to conclude that all Western Europeans who are trying to quit smoking will benefit
from adding a web-based smoking cessation program to nicotine patch treatment. However,
smokers who volunteer for research studies are known to differ from smokers in the general
population. For example, research volunteers are often more educated and more motivated.
Thus, the study only provides evidence that web-based programs are effective in the sub-

5 th
Freedman D, Pisani R, Purves R. Statistics. 4 edition. W.W. Norton & Company; 2007.
6 rd
Weiss NS. Clinical Epidemiology: The Study of the Outcome of Illness. 3 edition. Oxford University Press; 2006.
7
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB. Fundamentals of Clinical Trials. Springer
Publishing; 1998.
8
de Jonghe F, Hendricksen M, van Aalst G et al. Psychotherapy alone and combined with pharmacotherapy in the
treatment of depression. Br J Psychiatry. 2004;185:37-45.
9
Strecher VJ, Shiffman S, West R. Randomized controlled trial of a web-based computer-tailored smoking
cessation program as a supplement to nicotine patch therapy. Addiction. 2005; 100: 682. Abstract.

8
population of smokers who would volunteer for research studies, rather than the general
population of smokers.

9
THE RESEARCH QUESTION
8. Progress Check

Transcript
Match each term to the appropriate part of the research question.

Term Part of Research Question


Predictor Antidepressants
Outcome Angina, heart attack, or coronary artery surgery
Study population 74,948 registered patients from 9 general practices in Trent,
England who had electronic records available from 1995 to
1999
Overall population General population of Western Europe
Research The use of antidepressants increases risk of heart disease
hypothesis (angina, heart attack, or coronary artery surgery) in the
general population of Western Europe.
Study sample Individuals who actually participate in a study

10
STUDY DESIGN
9. Objectives

Introduction
This section introduces the different types of studies that are used in medical research and
discusses their relative strengths and weaknesses.

Transcript
These are the objectives for this section.

Onscreen Text: Objectives


After you finish this section, you should be able to:

• Critically evaluate evidence for causal relationships


• Apply the Bradford-Hill criteria for judging causality
• Define bias and confounding
• Differentiate between observational and experimental studies
• Describe the different types of studies used in medical research
• Recognize the advantages and limitations of the different study designs
• Identify sources of bias in different study designs
• Describe the relative strengths of each type of study design in establishing
causality

11
STUDY DESIGN
10. Causal Inference

Transcript
The goal of clinical research is to determine whether or not particular risk factors and
treatments are causally related to disease. However, since clinical research is conducted at
the population rather than molecular or cellular level, proving causal relationships is difficult.
For example, investigators in one study found an association between the presence of
atherosclerosis and the presence of depression in a large sample of elderly men and
10
women. These results do not prove that atherosclerosis causes depression. In fact,
causality may go in the opposite direction: depression may lead to lack of exercise and poor
eating habits, which then leads to plaque build-up in the arteries. Alternatively, a third factor,
such as advancing age, may increase the risk for both atherosclerosis and depression
independently.

In the 1960s, Sir Austin Bradford-Hill outlined nine points to consider when judging whether or
11
not an observed association between a risk factor and disease is likely to be causal. Though
many researchers have since modified this list, this table shows Bradford-Hill's original
criteria. Examine these nine criteria and keep them in mind as we discuss the relative
strengths and weaknesses of different study designs used in medical research.

Onscreen Text: Bradford-Hill Criteria for Inferring Causality

An observed association between a


risk factor and a disease is more likely
Explanation
to be a causal relationship when the
association:
The stronger the relationship between
the risk factor and the disease, the less
1. Is strong
likely that the association is due to an
extraneous reason.
The risk factor must temporally precede
the onset of disease, since it is logically
2. Follows the appropriate time sequence
necessary for a cause to precede an
effect in time.
A causal relationship is more likely if a
dose-response relationship can be
3. Shows a dose-response relationship shown. As the amount of exposure to the
risk factor is increased, disease risk
should similarly increase.
The findings must be consistent across
4. Is consistent across different studies
research sites and methodologies.
The risk factor should increase (or
decrease) the likelihood of a specific
5. Is specific
disease, rather than just overall disease
in general.

10
Tiemeier H, van Dijck W, Hofman A, Witteman JC, Stijnen T, Breteler MM. Relationship between atherosclerosis
and late-life depression: the Rotterdam Study. Arch Gen Psychiatry. 2004; 61(4):369-76.
11
Austin Bradford Hill. The environment and disease: Association or causation? Proceedings of the Royal Society
of Medicine. 1965; 58, 295-300.

12
An observed association between a
risk factor and a disease is more likely
Explanation
to be a causal relationship when the
association:
There should be a plausible biological
reason, usually at the molecular level, for
6. Is biologically plausible
how the risk factor can cause the
disease.
A cause-and-effect interpretation cannot
7. Is coherent with previous knowledge conflict with what is known from other
high quality studies.
Sometimes a commonly accepted
8. Is analogous to a known phenomenon phenomenon in one area of study can be
applied to another area of study.
Experimental studies yield stronger
9. Comes from experimental evidence evidence of causality than observational
studies.

His list was not meant to be exhaustive, nor is it necessary to meet all criteria in order to
establish causality. Experimental studies are the gold standard for proving causality because
the investigator controls the exposure or treatment. In clinical research, experimental studies
generally take the form of a randomized clinical trial (RCT). However, randomized clinical
trials are not always possible to perform.

13
STUDY DESIGN
11. Bias and Confounding

Transcript
All medical studies are subject to bias and confounding, which may undermine the validity of
study results.

Biases arise when differences exist in the selection, measurement, or analysis of study
subjects depending on their exposures or outcomes. Consider a randomized clinical trial for
treatment of depression, comparing psychotherapy alone versus psychotherapy plus
12
antidepressants. If the investigator who evaluates outcomes is not blinded, he/she might
perceive greater improvements in patients in the therapy plus drug group because he/she
expects better results in patients receiving two kinds of treatments vs. one, even if there is no
truly objective difference between the two groups. Bias will be discussed more completely in
13
the context of the different study designs.

Confounding occurs because factors that are associated with both the predictor and outcome
are ultimately responsible for their relationship. A classic example of confounding is the
association seen between heavy alcohol consumption and lung cancer. Heavy drinkers get
lung cancer at higher rates than non-drinkers, but there is no known biological relationship
that explains how alcohol consumption can cause lung cancer. Rather, it is known that heavy
drinkers also smoke at higher rates than non-drinkers; and smoking is known to cause lung
cancer. Drinkers who do not smoke are not at increased risk of lung cancer. We thus say that
smoking confounds the relationship between heavy alcohol consumption and lung cancer.

Remember, a confounder must be associated with both the outcome (e.g., disease) and
predictor (e.g., risk factor) of interest and must not simply be a link in a causal pathway
between the predictor and outcome. For example, people who eat a high-fat diet are more
likely to have high cholesterol and are also more likely to develop heart disease, but
cholesterol is not considered a confounder of the relationship between fatty foods and heart
disease because it’s a well established intermediate step in the causal pathway to heart
disease.

In the alcohol and lung cancer example, the effect of the confounder is to create the
appearance of an association between two things that are not causally related. In addition to
misleading researchers to state causal relationships that are not real, confounding can also
cause researchers to overestimate, underestimate, or even invert a true causal relationship.

In observational studies, which include cross-sectional studies, case-control studies, and


cohort studies, the investigator observes the population, but does not interfere. In this case,
confounding can be a huge problem because there may exist at least one unmeasured or
unknown factor that could explain the study results. In experimental studies, the investigator
can minimize or remove the effect of such factors by randomly assigning subjects to different
groups. Randomization minimizes confounding by distributing confounding variables roughly
equally among different groups.

12
de Jonghe F, Hendricksen M, van Aalst G, et al. Psychotherapy alone and combined with pharmacotherapy in
the treatment of depression. Br J Psychiatry. 2004 Jul;185:37-45.
13
Gordis L. Epidemiology. Elsevier Saunders Publishing; 2009.

14
Though the randomized clinical trial is considered the gold standard for showing a causal link
between a risk factor and a disease, RCTs are not always feasible. Observational studies are
often cheaper and faster, can examine longer-term effects, and can be used to generate
hypotheses for RCTs. In some cases, RCTs are impossible for obvious reasons. For
example, a study comparing the clinical characteristics of angina manifesting in males vs.
females cannot be randomized. A randomized study may also be impossible for ethical
reasons – for example, a study directing half of the participants to smoke.

15
STUDY DESIGN
12. Cross-Sectional Studies

Transcript
Before discussing the details of a cross-sectional study, it is important to briefly define two
medical terms: prevalence and incidence. Prevalence is the proportion of people in the
population who have disease, including new and old cases. Incidence is the rate that new
14
cases of disease occur over a specific period of time.

Cross-sectional studies measure risk factors and disease in a sample of people at one time
point. Consider the cross-sectional study that you see here: “Relationship between
15
atherosclerosis and late-life depression.” Investigators assessed atherosclerosis and
depressive symptoms in more than 4,000 men and women aged 60 years and older who
were randomly selected from a large community in The Netherlands. Depression was found
to be more common in people who had high levels of atherosclerosis. Use the link to view the
publication.

When reading observational studies, it is important to consider the characteristics of people


who refused to participate in the study as a potential source of bias. Out of 5,901 subjects
asked to participate, 4,019, or 68%, participated in both the interview and the clinical work-up
16
for atherosclerosis. If non-participants had higher rates of depression or atherosclerosis, this
would lead to underestimates of the prevalence of these conditions in the population.

The results of cross-sectional studies are difficult to interpret because the direction of
causality cannot easily be established. In the case here, it wouldn’t be unreasonable to
believe that depression leads to atherosclerosis, rather than vice versa.

The association between atherosclerosis and depression might also be explained by


confounding. For example, older people are likely to have more atherosclerosis; if older
people also have more depression, the two conditions would appear related. In this study, the
association remained strong even after accounting for age, sex, cognitive function, education,
antidepressant medication, smoking, cholesterol, weight, height, and the presence of other
17
cardiovascular disease. However, it is impossible to control for all possible confounders—
particularly since both atherosclerosis and depression may have developed a long time ago
and potential confounders that occurred in the past were not measured.

Cross-sectional studies are relatively inexpensive, fast, and simple; data collection occurs
only once, minimizing time, cost, and inconvenience to the participants. If the sample is
randomly selected from a particular population and participation rates are high, cross-
sectional studies provide useful information about the presence of risk factors and disease in
a population.

14
Gordis L. Epidemiology. Elsevier Saunders Publishing; 2009.
15
Tiemeier H, van Dijck W, Hofman A, Witteman JC, Stijnen T, Breteler MM. Relationship between atherosclerosis
and late-life depression: the Rotterdam Study. Arch Gen Psychiatry. 2004 Apr; 61(4):369-76.
16
Tiemeier H, van Dijck W, Hofman A, Witteman JC, Stijnen T, Breteler MM. Relationship between atherosclerosis
and late-life depression: the Rotterdam Study. Arch Gen Psychiatry. 2004 Apr; 61(4):369-76.
17
Tiemeier H, van Dijck W, Hofman A, Witteman JC, Stijnen T, Breteler MM. Relationship between atherosclerosis
and late-life depression: the Rotterdam Study. Arch Gen Psychiatry. 2004 Apr; 61(4):369-76.

16
However, because cause and effect are measured simultaneously, temporal direction may be
uncertain unless the risk factor is a fixed characteristic, such as genetics. Cross-sectional
studies are inefficient for studying rare diseases or exposures, which occur too infrequently in
a random sample.

17
STUDY DESIGN
13. Case-Control Studies

Transcript
A case-control study is a type of observational study that measures past exposures in a
sample of people who have a disease (the cases) and in a comparable sample of people who
do not have that disease (the controls). This design is efficient for studying rare disease
because investigators specifically seek out people with disease from a target population.
Case-control studies are useful for evaluating associations between exposures and diseases,
but do not yield estimates of the absolute risk, prevalence, or incidence of disease. Since
researchers select the relative proportion of cases to controls, the prevalence of disease in
the study sample does not reflect the true prevalence of disease in the population. For
example, if one control is selected for every case, half the sample will have the disease, but
18
this does not mean that half the population has the disease.

Case-control studies are often used in outbreak investigations. Researchers interview


everyone who became sick and a sample of people who did not become sick and look for
exposures that appear disproportionately in the case group. Case-control studies also have
historical significance; they have been used to identify associations between mouth cancer
and pipe smoking in the 1920s; between breast cancer and reproductive history in the 1920s;
and between smoking and lung cancer in the 1950s. During the early years of the AIDS
epidemic—when the cause of the disease was still a mystery—case-control studies identified
sexual contact and blood transfusions as the main routes of transmission of AIDS.

Consider the case-control study that you see here, “Antidepressants as a risk factor for
19
ischemic heart disease: case-control study in primary care.” Using electronic health records,
researchers from Trent, England, identified all men and women from nine primary care
practices who were diagnosed with heart disease from 1995 to 1999. Researchers then
combed through the same records to find controls—that is, people whose records gave no
indication of heart disease; they selected 4 to 6 controls of the same age and sex for each
case (a practice called “matching” controls to cases). Researchers then extracted
computerized data about antidepressant prescriptions for cases and controls. They found an
elevated rate of antidepressant drug use in the case group. Use the link to view the
publication.

Proper control selection is vital to the validity of case-control studies. Cases are likely to be
found at hospitals or doctor’s offices; if controls are drawn outside of these settings, this
differential selection can bias the results. For example, in this study, if controls had been
drawn from the community-at-large rather than from the same doctors’ offices as the cases,
the sample may have included people who see a doctor less frequently and thus obtain fewer
prescriptions. In this scenario, the case group would likely be associated with taking a number
of different prescription drugs the investigator would care to examine, but this association
would merely be a consequence of poor study design.

18 th
Dunn OJ, Clark VA. Basic Statistics: A Primer for the Biomedical Sciences. 4 edition. Wiley; 2009.
19
Hippisley-Cox J, Pringle M, Hammersley V, et al. Antidepressants as risk factor for ischaemic heart disease:
case-control study in primary care. BMJ. 2001 Sep 22; 323(7314):666-9.

18
Case-control studies often ask cases and controls to recall exposures that occurred in the
distant past. Not surprisingly, people who have a disease are usually searching for reasons
why they got sick, and they may be more likely than controls to remember and report
exposures. This situation leads to recall bias and is a common problem for case-control
20
studies. It is believed recall bias was not a factor in this study since computerized records
provided the prescription histories.

As with any observational study, confounding is a problem. After adjustment for potential
confounders, the researchers found that the association between antidepressants and heart
21
disease was limited to the use of tricyclic antidepressants. Matching controls to cases, as
was done in this study, helps reduce confounding by ensuring the similarity of cases and
controls on important factors such as age and sex.

Even though case-control studies ask about exposures in the past, establishing temporal
relationships may still be tricky. In this study, heart disease may have been present long
before a diagnosis was made in the medical records; thus, whether antidepressant use
actually preceded heart disease is still an open question. Diagnosis with disease may also
change behaviors, such as diet, and cases may have difficulty recalling which behaviors
preceded and which followed disease diagnosis.

20
Gordis L. Epidemiology. Elsevier Saunders Publishing; 2009.
21
Hippisley-Cox J, Pringle M, Hammersley V et al. Antidepressants as risk factor for ischaemic heart disease: case-
control study in primary care. BMJ. 2001; 323(7314):666-9.

19
STUDY DESIGN
14. Cohort Studies

Transcript
In a prospective cohort study, investigators measure risk factors at the beginning, or baseline,
of the study, and then follow the study participants (i.e., the cohort) over time to measure
disease occurrence. Cohort studies can also be retrospective; retrospective cohort studies
involve searching existing databases for past information about baseline measurements,
22
follow-up measurements, and outcomes in a defined cohort of people.

Cohort studies yield estimates of the incidence and cumulative risk of disease in different
exposure groups where risk, in this case, is defined as the chance that a person will develop
disease in a given time period.

The Framingham Heart Study is a prospective cohort study that is more than half a century
old. In 1948, investigators assembled a cohort of more than 5,200 28 to 62 year old residents
from Framingham, Massachusetts—considered an average community in the United States.
Investigators measured participants’ health and lifestyle factors, including blood pressure,
weight, and exercise. Since then, researchers have carefully tracked participants to measure
the development of heart disease. It's from the Framingham study that we now know that
smoking, high blood pressure, physical inactivity, cholesterol, and obesity are all risk factors
for heart disease.

Consider the retrospective cohort study “A longitudinal study of premorbid IQ score and risk of
developing schizophrenia, bipolar disorder, severe depression, and other nonaffective
23
psychoses.” In this study, researchers had access to records for more than 50,000 Swedish
men drafted for compulsory military training during 1969 to 1970; the initial evaluation
included an IQ test. Using a national hospital database, the researchers were able to track
which of these men went on to be admitted to the hospital for schizophrenia, bipolar disorder,
severe depression, and other nonaffective psychoses in the 27 years following the IQ test.
The researchers found that lower IQ score was associated with increased risk for developing
all of these conditions except bipolar disorder. Outcomes were relatively rare, ranging from
0.22% for bipolar disorder to 0.8% for schizophrenia.

Use the link to view the publication.

Cohort studies involving rare outcomes require huge sample sizes and lengthy follow-up to
ensure that sufficient cases develop. Because of the availability of computerized records,
researchers in this study were able to follow a sample of 50,000 people for 27 years. The
longer the follow-up, the higher the chance for participant attrition. Loss to follow up can bias
results if certain exposure groups are more likely to withdraw from the study. Losses in this
study were minimized by the use of computerized records.

This study only captured cases of disease that required hospitalization; most schizophrenics
require hospitalization at some point in their illness, but patients with depression and bipolar
disorder may never be hospitalized. If IQ is related to hospital admission for these conditions,
this could bias results.

22
Gordis L. Epidemiology. Elsevier Saunders Publishing; 2009.
23
Zammit S, Allebeck P, David AS, et al. A longitudinal study of premorbid IQ Score and risk of developing
schizophrenia, bipolar disorder, severe depression, and other nonaffective psychoses. Arch Gen Psychiatry. 2004
Apr; 61(4):354-60.

20
The investigators had limited information on confounders collected at conscription, including
psychiatric problems at conscription, drug use, place of upbringing, and paternal age. These
confounders did not appear to account for the observed results.

Cohort studies follow participants longitudinally, or over time. Researchers measure


predictors before study subjects develop disease, thus establishing that the predictors
preceded the outcome. Cohort studies yield estimates of rate and risk of disease at different
levels of exposure. Cohort studies can be used to study rare exposures and multiple
outcomes. Cohort studies have provided invaluable information about risk factors for chronic
diseases. Cohort studies are expensive, take time, and require careful tracking of participants
to avoid losing them during follow up.

21
STUDY DESIGN
15. Randomized Clinical Trials

Transcript
The RCT is the only study design in clinical research that approximates the laboratory
experiment.

In a RCT, investigators randomly assign participants to different intervention groups (also


called “arms”). Randomization minimizes confounding by baseline variables—for example,
smokers or obese patients or patients with high blood pressure are evenly assigned to
different intervention arms. The double-blind RCT is optimal because assignments are
concealed from both subjects and investigators, thereby minimizing bias. However, it is not
24
always possible to blind all parties.

Consider the randomized clinical trial that you see here, “Psychotherapy alone and combined
25
with pharmacotherapy in the treatment of depression.” Patients with mild or moderate
depressive disorder were randomly assigned to receive psychotherapy alone or
psychotherapy plus antidepressants. Independent, blinded observers evaluated outcomes,
since neither physicians nor patients were blinded to treatment assignment. After six months,
the patients were evaluated for improvement in depression symptoms. Neither the
independent observers nor the treating physicians saw differences between the two treatment
groups; however, the patients in the drug plus therapy group reported greater improvement in
symptoms than patients in the therapy-only group. Use the link to view the publication.

Patients who took antidepressants may simply have experienced a placebo effect; they may
have felt or reported greater improvements because they believed that drugs should help. To
control for placebo effects, many randomized clinical trials use a placebo intervention.
Investigators have even gone so far as to use sham surgery—where surgeons act as if they
are operating on a patient without performing any truly therapeutic procedure. For example, a
study of surgery for knee osteoarthritis found that arthroscopic surgery was no more effective
26
than sham surgery. The reduction in pain that both groups experienced could thus be
attributed to a placebo effect. (Please note the ethics of subjecting patients to sham surgery is
a topic of ongoing debate in the medical community and is rarely practiced.)

This case study was an example of a parallel clinical trial, where outcomes are compared
27
between two (or more) separate treatment groups. Randomized trials may also follow a
28
crossover design. In a crossover design, each person receives all treatments sequentially in
random order. In a crossover design, each person serves as his or her own control and intra-
individual differences are evaluated. Crossover studies generally require fewer participants to
measure the same effect, but they may be clouded by carryover effects from one treatment to
the next. To avoid carryover effects, investigators usually build in a washout period between
treatment cycles.

24
Weiss NS. Clinical Epidemiology. Oxford Publishing; 2006.
25
de Jonghe F, Hendricksen M, van Aalst G, et al. Psychotherapy alone and combined with pharmacotherapy in
the treatment of depression. Br J Psychiatry. 2004;185:37-45.
26
Moseley JB, O’Malley K, Petersen NJ et al. A controlled trial of arthroscopic surgery for osteoarthritis of the
knee. NEJM. 2002; 347(2):81-8.
27
de Jonghe F, Hendricksen M, van Aalst G, et al. Psychotherapy alone and combined with pharmacotherapy in
the treatment of depression. Br J Psychiatry. 2004;185:37-45.
28
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB.. Fundamentals of Clinical Trials. Springer
Publishing; 1998.

22
In both parallel and crossover RCTs, patients do not always stick to the regimen to which they
are randomized, and this can reduce the ability to detect effects. For example, in the trial of
psychotherapy alone versus psychotherapy plus antidepressants, 15% of patients in the
combination group had discontinued medication by the end of the trial. This non-compliance
diminishes the combination group, making it less similar to the therapy-only group.
Nevertheless, the results are still relevant because RCTs are designed to tell how effective an
intervention is in a real-world situation—which includes patients who don’t follow doctor’s
orders—rather than the efficacy of the intervention when administered perfectly.

RCTs may involve a select group of people, as in cases of severe disease. Also, people who
are willing to leave their treatment up to fate may be different from people who are not.
Therefore, results sometimes have limited applicability to the real-world or other similar
situations (denoted by the medical term “generalizability”).

Randomized clinical trials are expensive—typically costing thousands or even tens-of-


thousands of US dollars per participant. Thus, they can only measure short-term effects, on
the order of months to a few years. At times, they are not ethical or practical. Nonetheless,
randomized clinical trials remain the gold standard for proving cause-and-effect relationships.

Select the “I” icon for more information.

More Information
Differential drop-out can make randomized clinical trials difficult to evaluate, and
sometimes side effects make blinding difficult. Nonetheless, RCTs are the gold
standard among study designs.

23
STUDY DESIGN
16. Meta-Analysis

Transcript
A meta-analysis is not a type of study design, but instead a statistical practice of pooling the
results of separate but related studies. Meta-analysis can yield a precise estimate of effect,
evaluate consistency, and address apparent contradictions in the literature. Meta-analysis
also allows one to address questions that smaller individual studies could not address due to
small sample sizes. The first step of meta-analysis is to search for all studies related to the
hypothesis of interest, as well as unpublished data when obtainable. For example, consider
the study seen here: “Efficacy of venlafaxine compared with selective serotonin reuptake
29
inhibitors and other antidepressants: a meta-analysis.” The investigators searched
electronic databases of scientific articles using the keywords venlafaxine and Effexor to
identify 2,349 studies; from there, the investigators manually selected all double-blinded,
randomized studies comparing venlafaxine with alternative antidepressants for the treatment
30
of depression. There were a total of 32 such studies.

Using these studies, the researchers found sufficient evidence to conclude that venlafaxine is
consistently more effective than selective serotonin reuptake inhibitors in the treatment of
depression.

Use the link to view the publication.

Select the ”I” icon for more information.

More Information
Meta-analysis can evaluate the overall weight of evidence for causal relationships,
making it a very powerful technique. However, meta-analysis is limited by publication
bias and by the quality of studies available.

Meta-analysis draws primarily from published studies, and published studies have an
inherent bias because researchers often fail to publish negative results. For example,
if ten research groups conduct studies of hormone replacement therapy and
dementia and only three find a positive association and publish their results, the
published literature paints a skewed picture of the overall body of knowledge on HRT
and dementia. A meta-analysis that draws solely from published data will be similarly
skewed.

Finally, even the most elegant meta-analysis cannot salvage a collection of poorly
designed studies.

Meta-analysis may also provide a level of precision that can detect tiny effects that
have no clinical meaning. For example, if you pool data from 10,000 people, you may
be able to detect a 0.005% difference between two groups, but this difference is too
small to be clinically relevant.

29
Smith D, Dempster C, Glanville J, Freemantle N, Anderson I. Efficacy and tolerability of venlafaxine compared
with selective serotonin reuptake inhibitors and other antidepressants: a meta-analysis. Br J Psychiatry. 2002;
180:396-404.
30 rd
Weiss NS. Clinical Epidemiology: The Study of the Outcome of Illness. 3 edition. Oxford University Press; 2006.

24
STUDY DESIGN
17. Progress Check

Transcript
Match one advantage and one disadvantage to each study design.

Study Design Advantage Disadvantage


Is a poor study design for
Cross-sectional Can be used to measure the
distinguishing cause and
study prevalence of disease
effect
Is efficient for studying rare
Case-control study Is subject to recall bias
diseases
Measures risk factors prior to Data collection may take
Cohort study
the onset of disease decades to complete
Is too expensive for
Randomized
Is an experimental study evaluation of long-term
controlled trial
outcomes
Can evaluate consistency
Meta-analysis Is limited by publication bias
over many studies

Determine whether the statements are true or false.

Statement True False


A signficant correlation between
exposure and disease is proof of X
causality.
If the risk factor does not
temporally precede the onset of
X
disease, this is proof against
causality.
Confounding can cause
researchers to underestimate the
X
magnitude of an association
between exposure and disease.
Randomization increases the
chance of having hidden X
confounders.

25
TYPES OF MEASUREMENT
18. Objectives

Introduction
The characteristics, exposures, and outcomes that are measured in medical studies are
called variables. The values that are recorded for these variables (e.g., yes/no, 0 to 100)
constitute the study data. The type of data guides the statistical analysis. This section
introduces the different types of data that are encountered in clinical research and explains
how diagnostic tests are evaluated.

Transcript
These are the objectives for this section.

Onscreen Text: Objectives


After you finish this section, you should be able to:

• Describe sources of variation in medical data


• Differentiate between continuous and categorical data
• Define time-to-event data
• Define mean and standard deviation
• Identify how data were measured in real studies
• Describe how diagnostic tests are evaluated
• Define and calculate sensitivity, specificity, negative predictive value, and positive
predictive value

26
TYPES OF MEASUREMENT
19. Variation

Transcript
Medicine is not an exact science—there are many sources of variation and uncertainty.
Variation in medical data may arise from real biological differences, variation in data collection
31
methods, measurement error, and random chance.

Human beings differ in personal characteristics, behaviors, exposures, and responses. For
example, people differ naturally in height. The same person may also differ when followed
over time. For example, a child’s height changes as the child ages. Groups of people may
also differ in their collective characteristics. For example, on average, men are taller than
women.

The process of measuring variables also adds variability. For example, questionnaires for
diagnosing depression may differ in quality and focus, and thus yield different results. People
may score higher or lower on a depression questionnaire depending on the season and time
of day that it is administered. Finally, simple human error or faulty equipment can alter the
measurement. Differential measurement errors—those that only affect certain groups in a
study—can bias study results.

There are several ways to quantify variability. The simplest measure is the range of observed
values; for example, historically, adult human height has ranged from just under two feet to
almost nine feet. A more informative measure of variability is the standard deviation, which
gives one a feel for the typical range of values away from the average. For example, if the
average (or mean) height for men is 5 feet 9 inches and the standard deviation is 3 inches,
this means that most men have heights within a few inches of 5 feet 9. Another measure of
variation is the inter-quartile range, which indicates the range of the middle 50% of the data.

Select the ”I” icon for more information about standard deviation.

More information
The standard deviation expresses the average deviation from the mean. Because the
sum of the negative and positive deviations from the mean will always be 0, one
cannot directly calculate an average deviation. Instead, one must first calculate the
variance, which is the average of the squared deviations from the mean. The
standard deviation is then obtained by taking the square root of the variance.

∑ (x
Variance:
i − mean) 2
σ2 = i =1

n −1

Standard deviation: n

∑ (x i − mean) 2
σ= i =1

n −1

31 th
Freedman D, Pisani R, Purves R. Statistics. 4 edition. W.W. Norton & Company; 2007.

27
TYPES OF MEASUREMENT
20. Continuous Data

Transcript
Data can be broadly divided into numerical and categorical. Categorical data classifies people
into categories, such as male and female. Numerical data involve numbers that you can add,
subtract, multiply, and divide. Numerical data may further be discrete or continuous. Discrete
data can only be whole numbers, usually counts. Continuous data are uninterrupted—
32
meaning they can take on any value within a defined range.

For example, a crude tape measure shows that a woman is six feet tall. If you could measure
her with a more precise tape measure, you might find that she’s actually 6 feet one-half inch
tall. If you could measure her with an extremely precise tape measure, you might find that
she’s actually 6 feet 0.525 inches tall, and so on. The measurement of continuous variables is
limited only by the precision of the measuring device.

Many clinical and personal characteristics are continuous, such as age, weight, blood
pressure, and income. Researchers will also treat some discrete numerical variables like
continuous variables if they take on a wide range of values, such as scores on a 40-point
scale for depression.

Typically, studies report the mean and standard deviation of continuous variables for their
study sample. The mean value is just the average value. The standard deviation is a measure
of the average scatter around the mean value. For example, in a cross-sectional study of
atherosclerosis and late-life depression, the continuous variables blood pressure, body mass
index, and cholesterol were measured. Among those with normal depression scores, the
mean cholesterol was 224 with a standard deviation of 38.7.

Studies may also report the median value and interquartile range of a continuous variable.
The median value is the value exactly in the middle of the sample; if the sample contains an
even number of observations, the median value is the average of the two values closest to
the middle; for example, the median cholesterol value in the sample pictured here is 225. The
interquartile range is the middle 50% of the data, as shown here. The mean and median
values may be quite different if the data have an uneven (i.e., skewed) distribution. For
example, adding a cholesterol value of 500 to the data skews the mean from 234 to 264, but
the median is relatively unaffected at 230. The mean and median will be the same if the
variable follows a symmetric distribution such as a normal distribution, also known as the “bell
curve.” Select the following data distributions to find the median, mean, interquartile range,
and standard deviation.

Many continuous variables conform roughly to a normal distribution. This means that values
tend to cluster around the mean value and drop in frequency farther from the mean, forming a
symmetric bell shape. The bell is wider or thinner depending on the standard deviation of the
characteristic. Normal distributions are predictable: 68% of observations fall within 1 standard
deviation of the mean; 95% fall within 2 standard deviations of the mean, and only a small
fraction of observations are ever more than three standard deviations away from the mean.
For example, if cholesterol is normally distributed in a population with a mean of 220 and a
standard deviation of 40, then 68% of people will have cholesterol between 180 and 260, 95%
will have cholesterol between 140 and 300, and 99.7% will have cholesterol between 100 and
340.

32
Moore DS. Statistics: Concepts and Controversies. Freeman Publishing; 2001.

28
TYPES OF MEASUREMENT
21. Categorical Data

Transcript
Categorical data classify study subjects into categories, such as exposed and unexposed,
33
and can be binary, nominal, or ordinal.

Binary data is the simplest type of data; everyone falls into one of two categories. For
example: yes or no, disease or not disease.

Nominal data involve two or more categories that have no particular order. For example,
blood type—A, B, AB, or O—is nominal data.

Ordinal data involve two or more ordered categories—such as healthy, mild depression,
moderate depression, and severe depression.

Categorical variables are reported as the percentage of study subjects that fall into each
category. For example, in a cross-sectional study of atherosclerosis and late-life depression,
smoking was measured as never, current, or ex-smoker. Among those with normal
depression scores, 15.9% were current smokers and 50.4% were ex-smokers. The remaining
33.7% had never smoked.

Sometimes continuous data may be collapsed into categorical data. For example, continuous
IQ score could be grouped into low, medium, or high IQ categories to simplify the statistical
analysis and interpretation. The trade-off for this simplicity is a loss of precision. The results
will also vary depending on where cut-off points for the categories are drawn.

33 th
Moore DS, Notz WI. Statistics: Concepts and Controversies. 5 edition. W.H. Freeman; 2001.

29
TYPES OF MEASUREMENT
22. Time-to-Event Data

Transcript
In some cohort studies and randomized trials, participants are followed until an event of
interest occurs, such as the subject recovers from disease, develops a disease, or dies.
Because time is measured, researchers can estimate rates of recovery, disease, or death in
the population. The corresponding data are called time-to-event data or survival data. This
type of data has two parts: a continuous part (time) and a binary part (whether or not the
event occurred). For participants who have an event during study follow-up, researchers
record the time it took for the event to occur. For participants who leave or finish the study
before having the event, researchers record their total time of follow-up and designate them
34
as “censored,” indicating that they were event-free at their final measurement.

For example, the Women’s Health Initiative studied the effects of postmenopausal hormones
on the incidence of heart disease. This graphic shows the cumulative risk of heart disease
over time, comparing the hormone group with the placebo group. The curve goes up slightly
each time one of the thousands of women in the study develops heart disease.

34
Rosner B. Fundamentals of Biostatistics. Thomson Learning & Brooks/Cole Publishing; 2006

30
TYPES OF MEASUREMENT
23. Diagnostic Testing

Transcript
Correctly measuring the presence or absence of disease is critical to clinical research as well
as to clinical practice. Yet, diagnostic tests are imperfect, yielding false negatives and false
positives. In a medical study, it may be too costly, time consuming, or risky to administer the
gold standard diagnostic to all subjects. Therefore, a less accurate—but cheaper and faster—
test may be used. When reading the medical literature, it is important to consider the
accuracy of the diagnostic test used. You may also encounter medical papers whose sole
purpose is to evaluate the performance of diagnostic tests—by estimating their sensitivity and
35
specificity in a particular population compared with the gold standard test.

36
Use the link to view the publication.

The sensitivity of a test is its innate or intrinsic ability to detect disease. Sensitivity is defined
as the percentage of people with the disease of interest who have positive test results.
Sensitivity is calculated by dividing the true positives by the true positives plus the false
negatives. The specificity of a test is its innate or intrinsic ability to detect people who do not
have the disease of interest and is therefore defined as the percentage of people without the
disease who have negative test results. Specificity is calculated by dividing the true negatives
by the true negatives plus the false positives. The sensitivity and specificity of a test can vary
in different populations and during different stages of disease. For example, a questionnaire
designed to detect depression in patients with cancer may be a poor tool for detecting
depression in patients with Alzheimer’s disease.

Consider the following study. This study evaluated the performance of the Beck Depression
Inventory-Short Form questionnaire in diagnosing moderate and severe depression in people
37
in the hospital, compared with the diagnoses made by the gold standard clinical interview. A
cutoff score of 13/14 correctly classified 29 of 31 depressed people as having depression and
119 of 124 people without depression as not having depression, for a sensitivity of 93.5% and
a specificity of 96%.

The positive predictive value, or PPV, is the probability that a person who tests positive
actually has the disease. The PPV is calculated as the true positives divided by the true
positives plus the false positives. In this example, the PPV is 29 out of 34 positive test results,
or 85.3%. Similarly, the negative predictive value, or NPV, is the probability that a person who
tests negative actually does not have the disease. NPV is calculated as the true negatives
divided by the true negatives plus the false negatives. In this example, the NPV is 119 out of
121, or 98.3%. The PPV and NPV depend on the sensitivity and specificity of the test as well
as on the prevalence of depression, which here is 31 out of 155, or 20%.

35
Furlanetto LM, Mendlowicz MV, Romildo Bueno J. The validity of the Beck Depression Inventory-Short Form as
a screening and diagnostic instrument for moderate and severe depression in medical inpatients. J Affect Disord.
2005;86(1):87-91
36
Furlanetto LM, Mendlowicz MV, Romildo Bueno J. The validity of the Beck Depression Inventory-Short Form as
a screening and diagnostic instrument for moderate and severe depression in medical inpatients. J Affect Disord.
2005;86(1):87-91
37
Furlanetto LM, Mendlowicz MV, Romildo Bueno J. The validity of the Beck Depression Inventory-Short Form as
a screening and diagnostic instrument for moderate and severe depression in medical inpatients. J Affect Disord.
2005;86(1):87-91

31
If the prevalence of depression in this population were twice as high, the Beck Depression
Inventory would have the same sensitivity and specificity at a cutoff of 13/14, since the test’s
innate ability to detect or rule out depression would not change. However, the PPV and NPV
would change. As the proportion of people with depression increases, the ratio of true
positives to false positives increases, thus increasing the PPV from 85.3% to 93.5%.
Correspondingly, the ratio of true negatives to false negatives decreases, thus decreasing the
NPV from 98.3% to 95.9%. In general, increasing disease prevalence increases the PPV of a
test and decreasing disease prevalence increases the NPV of a test. Testing in high-risk
groups—groups that are likely to have high disease prevalence—can increase the PPV for
the same diagnostic test.

Test results often occur along a continuum, as with the Beck Depression Inventory-Short
Form Scale. Moving the cutoff point changes the test’s sensitivity, specificity, and positive and
negative predictive values. As seen here, choosing a more stringent cutoff decreases the
sensitivity of the test but increases the specificity.

Use the link to view the publication.

Select the “I” icon for more information.

More Information
Even if a diagnostic test has a high sensitivity and high specificity, it can yield an
alarming number of false positives if used for general screening in a low-risk
population. For example, say a population has an HIV infection rate of 1% and the
HIV antibody test has 99% specificity and 99% sensitivity. If 100,000 people from the
population are screened, 1,000 of them will carry the HIV virus and therefore 99,000
will not be infected. With a sensitivity of 99%, the test will correctly find about 990 out
of 1,000 true positives. But, a specificity rate of 99% applied to 99,000 people will
yield roughly 990 false positives. In other words, about half of those who test positive
won’t have the disease. Because false positive results are highly traumatic,
widespread screening is therefore unwarranted in this population.

32
TYPES OF MEASUREMENT
24. Progress Check

Transcript
Determine whether the statements are true or false.

Statement True False Feedback


Age is an example of a X This statement is false. Age
categorical variable. is a continuous variable.
X This statement is true. A coin
The oucome of a coin toss is an toss only has two outcomes
example of a binary variable. (heads or tails), making it a
binary variable.
The sensitivity of a diagnostic X This statement is false. The
test depends on the prevalence sensitivity and specificity of a
of the disease being diagnosed. diagnostic test are intrinsic
characteristics of the test,
and do not depend on
disease prevalence.
As disease prevalence X This statement is false. As
increases, the positive disease prevalence
predictive value of a diagnostic increases, the positive
test decreases. predictive value of a
diagnostic test increases.

33
STATISTICAL INFERENCE
25. Objectives

Introduction
This section introduces the fundamentals of statistical inference. Statistical inference is the
process of making inferences about the population of interest from the results obtained from
the study sample.

Transcript
These are the objectives for this section.

Onscreen Text: Objectives


After you finish this section, you should be able to:

• Describe sampling variation


• Interpret a p-value
• Define type I and type II errors
• Differentiate between statistical and clinical significance
• Understand statistical power
• Interpret a confidence interval
• Describe the problem of multiple comparisons

34
STATISTICAL INFERENCE
26. Hypothesis Testing

Transcript
The goal of a hypothesis test is to evaluate whether associations seen in a study sample
reflect real associations in the target population or merely chance variation.

A statistic is a summary measure calculated from a sample. A parameter is a summary


measure from a target population. Usually we cannot measure the entire population, so we
infer something about the population from the sample statistic. This is the process of
38
statistical inference.

A statistic is any value that can be calculated from the sample data. For continuous variables,
this could be a mean or a difference in the means between two groups. For categorical
variables, this could be a proportion or the difference in proportions between two groups.
Other statistics that you will learn about include the odds ratio, risk ratio, and hazard ratio.

Statistics vary from sample to sample due to random chance. Imagine a population of
100,000 people that has an average IQ of 100. If you sample five random people from this
population, the average IQ of the sample might be 130 if you happen to pick a few geniuses.
If you take a different sample of five random people, however, the sample average might be
90. Another sample of five may yield a sample average of 110. This is the idea of sampling
variability. The field of statistics provides guidance on how to make conclusions in the face of
this chance variation.

For example, in a case-control study of ischemic heart disease and antidepressant use, 23%
of 933 cases with heart disease had used antidepressants compared with only 16% of 5,516
39
controls. This 7% difference could reflect a true association between antidepressant use
and heart disease in the larger target population, or it could be a fluke in this particular
sample. The question is: is 7% bigger or smaller than the expected sampling variability?

Use of antidepressants and recording of depression in cases and controls before


diagnosis of ischemic heart disease in cases
Number (%) of cases Number (%) of controls
(n=933) (n=5,516)
Code for depression but no 41 (4) 191 (3)
antidepressants
Any antidepressant drug 217 (23) 871 (16)
ever
23% 16%
“Antidepressants as risk factor for ischemic heart disease: case-control study in primary
care” Hippisley-Cox et al., BMJ, 2001.

Difference = 7%

Null There is no association between antidepressant use and heart disease in


Hypothesis the target population.

38 th
Freedman D, Pisani R, Purves R. Statistics. 4 edition. W.W. Norton & Company; 2007.
39
Hippisley-Cox J, Pringle M, Hammersley V, et al. Antidepressants as risk factor for ischaemic heart disease:
case-control study in primary care. BMJ. 2001; 323(7314):666-9.

35
Statisticians try to answer this question with a formal hypothesis test:

First, assume that the 7% difference is just a fluke—and that there is no difference in
antidepressant use between heart disease patients and controls in the larger population. This
statement of no effect is called a null hypothesis. The null hypothesis is usually the opposite
of what the researcher is hoping to prove. Analogous to the legal notion of “innocent until
proven guilty” the null hypothesis is initially presumed to be true—and is only rejected if the
evidence against it is beyond a reasonable doubt.

Select the research hypotheses pictured here to turn them into null hypotheses.

Research (Alternative) Hypothesis Null Hypothesis


Venlafaxine has greater efficacy than SSRIs Venlafaxine has the same efficacy as SSRIs
in treating depression. in treating depression.
Low IQ is a risk factor for bipolar disorder. Low IQ and bipolar disorder are not
associated.
Psychotherapy combined with Antidepressant drugs do not make a
antidepressants is a better treatment for difference in psychotherapy treatment for
depression than psychotherapy alone. depression.

The second step is to predict the sampling variability assuming the null hypothesis is true.
These predictions can be made by mathematical theory or by computer simulation. In
computer simulation, the computer simulates taking repeated samples of the same size from
the same population, and then the research can observe the sampling variability.

Here are the results of a computer simulation of 1,000 different case-control studies with 933
cases and 5,516 controls; for each simulated study, the difference in the proportion of cases
and controls using antidepressants is calculated. Under the null hypothesis, this difference is
0%. This graph displays the frequency of different results from the simulated studies. The
difference between cases and controls is close to 0% for most of the studies, but with some
variation due to chance—which is called the standard error. Most values fall between –5%
and +5%. A value of 7% or higher is not predicted to occur once in 1,000 studies if the null
hypothesis is true. Recall that the result of the real case-control study was a difference value
of 7%.

The third step is to quantify how unlikely the observed result is if the null hypothesis is true.
This quantity is called a p-value, short for “probability value.” Here, the probability of getting a
value of 7% or higher by random chance is less than 1 in 10,000 (0.01%), which means it is
highly improbable.

The fourth step is to make a decision about whether or not to reject the null hypothesis. If the
p-value is small enough, reject the null hypothesis in favor of an alternative hypothesis. Here,
we reject the null hypothesis in favor of the alternative hypothesis that antidepressant use and
heart disease are associated.

Smaller studies are more susceptible to chance variability. If this study had sampled only 50
cases and 50 controls, the sampling variability would have been much higher—as shown in
this computer simulation. Assuming the null hypothesis is true, a difference of 7% or higher
would have arisen by chance about 170 out of 1,000 times. Thus, the probability of the
observed data under the null hypothesis is 17%, which is not improbable enough to warrant
rejecting the null hypothesis.

36
The p-value here will actually be 17% multiplied by two, or 34%, to make this a two-tailed
hypothesis test. A two-tailed test acknowledges that, under the null hypothesis, there is an
equal chance that controls will use antidepressants either more frequently or less frequently
than cases.

By convention, two-tailed p-values of less than 0.05 are often accepted as “statistically
40,41
significant” in the medical literature, but this is an arbitrary cut-off.

A cut-off of p <.05 is not infallible—in fact, it allows that in 5 of 100 studies, a result will appear
significant just by chance. This error rate is a false-positive error rate, also called a type I
error, designated by the Greek symbol α.

The type I error rate is the chance of finding an association when no association actually
exists. Investigators make a type I error when whenever they reject a null hypothesis that is
actually true. Before beginning a study investigators must decide how low this error rate
should be, (a value of α = 0.05 is usually chosen), because concluding that an effect exists
when it doesn’t may lead to harmful or unnecessary health recommendations or policy
decisions. At the completion of a study, a p-value is calculated from the data and if the p-
value is less than α, the convention is to reject the null hypothesis. Conversely, the type II
error, designated by the Greek symbol β, is the false-negative rate, or the chance of missing a
real association. Investigators make a type II error whenever they fail to reject a null
hypothesis that is false. Generally, researchers are willing to accept a higher false negative
rate in comparison to the false positive rate.

True State of Null Hypothesis (H0)


Your Statistical Decision H0 True H0 False
Reject H0 Type I Error (α) Correct
Do not reject H0 Correct Type II Error (β)

Because the type II error rate is usually around 20%, failure to reject the null hypothesis does
not prove the null hypothesis. Results that are not statistically significant should not be
interpreted as "evidence of no effect,” but rather as “no evidence of an effect.” Studies may
miss effects if the type II error rate is too high—these studies are said to have low statistical
power.

40
Moyé LA. Statistical Monitoring of Clinical Trials: Fundamentals for Investigators. Springer-Verlag New York;
2006.
41
Norman GR, Streiner DL. Biostatistics: The Bare Essentials. BC Decker Publishing; 2008.

37
STATISTICAL INFERENCE
27. Power and Sample Size

Transcript
Statistical power is the chance you have of concluding that there is an association between
exposure and disease if an association truly exists. Put another way, statistical power is the
42
sensitivity of your study to detect true associations.

Studies have more power to detect an effect when the effect is large and when the sampling
variability is low. Increasing the p-value cutoff for statistical significance also increases power,
but at the cost of a higher type I error rate.

Sampling variability, measured as standard error, is a function of two components: the


variability of the outcome of interest and the size of the study sample.

Sampling variability is higher if the outcome being measured has a lot of inherent variability or
was measured imprecisely. One way to increase statistical power is to reduce variability due
to measurement error.

Sampling variability is lower with bigger sample sizes. A case-control study with 933 cases
and 5,516 controls has a much lower standard error than a similar study with only 50 cases
and 50 controls. Increasing sample size increases statistical power.

Understanding statistical power is critical to interpreting studies that find no association. A null
result could indicate a real lack of association in the target population or it could simply reflect
low power. Check the paper to see if power was estimated and reported. If no power is
reported, check to see if the sample size is small. Statisticians have developed formulas for
estimating statistical power for a given sample size and the sample size needed to achieve a
desired statistical power for different study designs.

This table shows you the sample size needed to detect a given effect size with 80% power at
a 0.05 significance level. Effect size here is the difference between two groups, presented as
standard deviations. For example if you want to test whether or not there is a difference in
average cholesterol between subjects on a low-fat diet and subjects on a normal diet, you
would need 63 subjects per group to have 80% power to detect a between-group difference of
a half standard deviation of cholesterol. If the standard deviation of cholesterol is 40 points, a
half standard deviation is 20. Notice that to detect smaller and smaller differences, you need
larger and larger sample sizes—at some point, the detectable difference is so small, it
becomes clinically irrelevant. For example, one-tenth of a standard deviation of cholesterol is
only a 4-point difference; a difference this small is not going to influence clinical practice.

If an association is shown to be statistically significant, your job as an interpreter of the


medical literature is not finished. Statistical significance does not tell you anything about the
size of the effect. A study with a huge sample size may have adequate power to detect trivial
differences between groups. You must consider the magnitude of the effect size before
deciding that the study results are of clinical importance.

Select the sample sizes pictured here to see how small of an effect you could detect.

42
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.

38
Research (Alternative) Hypothesis Null Hypothesis
1,600 per group A 2.5% difference between the groups would
be statistically significant at a significance
level of 0.05.
3,200 per group A 1.25% difference between the groups
would be statistically significant at a
significance level of 0.05.
6,400 per group A 0.625% between the groups difference
would be statistically significant at a
significance level of 0.05.

More Information
Clinical significance is not always clear-cut. For example, in a meta-analysis of
venlafaxine versus other antidepressants for the treatment of depression, venlafaxine
43
had a pooled 14% higher remittance rate than other antidepressants. This was
statistically significant with a 95% confidence interval of 7% to 22%. Is this 14%
difference clinically relevant? A psychiatrist would have to treat over seven additional
patients with venlafaxine just to relieve one additional patient of depression. Thus, the
answer may depend on the incremental cost of venlafaxine compared with other
drugs.

Finally, just because a statistically and clinically significant association is found, this does not
prove cause and effect. Even if exposure does not cause disease, a statistical association
can arise if there are biases in study design and data collection, if disease caused exposure,
or through the effect of an unmeasured confounding variable.

43
Smith D, Dempster C, Glanville J, Freemantle N, Anderson I. Efficacy and tolerability of venlafaxine compared
with selective serotonin reuptake inhibitors and other antidepressants: a meta-analysis. Br J Psychiatry.
2002;180:396-404.

39
STATISTICAL INFERENCE
28. Confidence Intervals

44,45,46
Transcript
A major goal of inferential statistics is to make an estimate of a population parameter from a
sample statistic. For example, we might want to estimate the difference in response rates
between two antidepressants. Confidence intervals give a plausible range of values for a
population parameter and also give information about the precision of an estimate. When
sampling variability is high, the confidence interval will be wide to reflect the uncertainty of the
observation.

The following computer simulation illustrates the theoretical meaning of a confidence interval.
Imagine that the true population value for some parameter is 10. If we could take 50 different
samples of the same size from a population and calculate 95% confidence intervals for each,
the confidence intervals will include the true population value of 10 about 95% of the time.
Here you can see the results of 50 mock runs of the study. Only three confidence intervals out
of 50, or about 6%, missed the mark. Besides 95% confidence intervals, we can also produce
90%, 99%, and 99.9% confidence intervals depending on how confident we want to be that
we captured the true population parameter.

A confidence interval is calculated as the observed sample statistic plus or minus a multiple of
the standard error. Higher standard errors give wider confidence intervals, reflecting
imprecision. To increase the certainty of capturing the true population parameter, we multiply
the standard error by a larger factor. Typically, we use a factor of two, which comes from the
fact that sample statistics tend to be normally distributed and 95% of normally distributed
observations fall within two standard deviations of their mean.

In the case-control study of antidepressant use and ischemic heart disease, the observed
difference between cases and controls was 7%, with a standard error of about 2%, so the
95% confidence interval would be 3% to 11%. The 99% confidence interval requires a
confidence factor of 2.33, increasing the interval to 2.3% to 11.7%.

Confidence intervals yield two types of information with hypothesis tests. For example, in the
previous study the null hypothesis is that antidepressant use and heart disease are unrelated
(i.e., the difference between cases and controls is 0%). However, the 95% confidence interval
is 3% to 11%, which excludes 0%. Thus, we can be 95% confident that the true difference is
not 0%. Whenever this is true, we automatically know that the p-value for the hypothesis test
will be less than 0.05, and we can reject the null hypothesis of no association at the 5%
significance level. Similarly, the 99% confidence interval of 2.3% to 11.7% excludes 0%; thus,
we can be 99% confident that the true value is not 0%. In addition we know that the p-value
for the hypothesis test will correspondingly be less than 0.01, and we can reject the null
hypothesis of no association at the 1% significance level.

44
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.
45
Norman GR, Streiner DL. Biostatistics: The Bare Essentials. BC Decker Publishing; 2008.
46
Motulsky H. Intuitive Biostatistics. Oxford University Press; 1995.

40
When comparing differences in means between two groups, the null value is zero.
Sometimes, however, fractions or ratios are reported rather than differences. For example, it
is common to report the risk ratio, which is calculated as the risk of disease in an exposed
group divided by the risk of disease in an unexposed group. A value greater than one means
exposure increases disease risk; a value less than one means exposure decreases disease
risk; and a value of 1.0 is the null value, indicating equal risk in the two groups. An odds ratio
is an approximation of the risk ratio.

In the case-control study of antidepressant use and heart disease, researchers report an odds
ratio of 1.67 with 95% confidence interval 1.41 to 1.99, which excludes null value of 1.0,
47
indicating statistical significance. The magnitude of the effect is also estimated:
antidepressant use increases risk of heart disease from 41% to 99%.

Select the null hypotheses pictured here to see the corresponding null values.

Null Hypothesis Null value


Venlafaxine has 0% difference in response between groups treated with either
the same efficacy venlafaxine or SSRIs.
as SSRIs in
treating
depression.
Low IQ and Percent of people with low IQ who develop bipolar disorder
bipolar disorder = 1.0
Percent of people with high IQ who develop bipolar disorder
are not
associated.
Antidepressants There is a 0% difference in improvement of depression between the
do not make a psychotherapy and antidepressant drug group compared with
difference when psychotherapy-only group.
combined with
psychotherapy to
treat depression.

47
Hippisley-Cox J, Pringle M, Hammersley V et al. Antidepressants as risk factor for ischaemic heart disease: case-
control study in primary care. BMJ. 2001; 323(7314):666-9.

41
STATISTICAL INFERENCE
29. Multiple Comparisons

48,49
Transcript
In 1980, a group of researchers randomized 1,073 heart disease patients into two groups—
50
call them A and B—but gave no treatment to either group. Not surprisingly, there was no
difference in survival. Then they divided the patients into 18 subgroups based on prognostic
factors. In one subgroup of 397 patients—those with three-vessel disease and an abnormal
left ventricular contraction—survival of those in group A was significantly better than survival
of those in group B, with a p-value less than 0.025. How could this be?

If subjects are divided in enough different ways, a subgroup can be found where survival
rates for groups A and B differ to a statistically significant extent. This example illustrates a
larger problem in statistics: if too many hypotheses are tested, eventually a statistically
significant result will arise just by chance. Each individual hypothesis test has a 5% chance of
giving a false positive result, so the cumulative type I error rate will be higher than 5%. We
can quantify this increase for independent tests—for example, 18 independent tests give an
overall type I error rate of 60% (i.e., there is a 60% chance that at least one result that is
calculated to be statistically significant is actually a false positive).

To minimize this increase, investigators should establish analysis plans prior to data
collection, in which they specify a limited number of hypotheses to be tested, including
subgroup analyses. Statisticians may also correct for multiple comparisons by requiring more
stringent p-value cut-offs to achieve statistical significance. When evaluating a medical study,
it’s important to consider the number of hypotheses tested and whether or not corrections
were made for multiple comparisons.

Multiple comparisons appear in many forms. For example, in cohort studies and randomized
clinical trials there may be multiple analyses of the data, and each analysis adds to the total
number of hypothesis tests conducted on the data.

Sources of Multiple Comparisons


• Multiple outcomes are evaluated
• Comparisons of more than two treatment or exposure groups
• Repeated measures over time
• Multiple analyses at the data during sequential interim monitoring

48
Norman GR, Streiner DL. Biostatistics: The Bare Essentials. BC Decker Publishing; 2008.
49
Motulsky H. Intuitive Biostatistics. Oxford University Press; 1995.
50
Lee KL, McNeer JF, Starmer CF, Harris PJ, Rosati RA. Clinical judgment and statistics. Lessons from a simulated
randomized trial in coronary artery disease. Circulation. 1980; 61(3): 508-515.

42
STATISTICAL INFERENCE
30. Progress Check

Transcript
A risk ratio is a statistic that is formed by dividing the proportion of exposed subjects by the
proportion of unexposed subjects. The risk ratio gives the increased (or decreased) risk of a
specified outcome due to exposure. The null value of the risk ratio is 1.0—indicating equal
risk in both groups. Use your understanding of statistical and clinical significance to decide
which risk ratio is statistically significant, clinically significant, both, or neither.

Risk Ratio
(Exposed vs. Statistically
Clinically
Unexposed) and Significant, Neither Onscreen Text
Significant
95% confidence p <0.05
interval
1.02 (1.01, 1.03) X
More Information
An estimated treatment
difference that is not
statistically significant, but
5.0 (.08, 9.5) X is large enough to
indicate possible clinical
significance; indicates
that the sample size may
be too small.
0.99 (0.89, 1.09) X
More Information
An estimated treatment
difference that is not
statistically significant, but
0.20 (.05, 1.05) X is large enough to
indicate possible clinical
significance; indicates
that the sample size may
be too small.

43
STATISTICAL ANALYSES
31. Objectives

Introduction
This section introduces common statistical tests used in medical studies and their appropriate
use and interpretation. This section also presents material on more advanced statistical tests.

Transcript
These are the objectives for this section.
Onscreen Text: Objectives
After you finish this section, you should be able to:

• Describe the intention-to-treat principle


• Explain last observation carried forward
• Calculate odds ratios and risk ratios
• Interpret odds ratios, risk ratios, and hazard ratios
• Understand the results of simple statistical tests
• Recognize the names of more advanced statistical tests
• Identify the appropriate statistical test for a given study design and type of data
• Understand common pitfalls in medical statistics

44
STATISTICAL ANALYSES
32. Intention-to-Treat Analysis

Transcript
Investigators should make every effort to obtain complete follow up data on their subjects;
however randomized trial data are often incomplete and imperfect. Intention-to-treat analysis
maintains the balance that was achieved through randomization by comparing outcomes
according to how participants were randomized, even if they refused or discontinued the
51
intervention, used it incorrectly, or otherwise violated study protocol.

Intention-to-treat: Participants will be counted in the intervention group to which they


were originally assigned, even if they
• Refused the intervention after randomization
• Discontinued the intervention during the study
• Followed the intervention incorrectly
• Violated study protocol

For example, in a randomized, placebo-controlled trial of divalproex for treatment of bipolar


depression, of the 25 patients who were randomized, only 80% completed at least four weeks
52
of the study, and only 48% completed the full eight weeks. The investigators analyzed the
data using intention to treat, including data from all participants in all comparisons, according
to their original assignment into treatment or placebo.

Use the link to view the paper.

Intention to treat has two strong rationales. First, the goal of randomization is to balance
potential confounding factors in the study groups. This balance will be lost if the data are
analyzed according to how participants self-selected rather than how they were randomized.
Second, intention-to-treat analysis simulates real life, where patients often don’t adhere
perfectly to treatment or may discontinue treatment altogether. Intention-to-treat analysis
evaluates the real world effectiveness of a drug rather than its efficacy when taken optimally.

Intention-to-treat analysis adds variability that will dilute observable effects, but will not create
spurious associations. Thus, any effect that is observed is likely a conservative estimate of
the true effect. This concept is best illustrated with a simple mathematical example. Imagine a
randomized trial of a drug treatment for depression where patients given a placebo have a
25% chance of recovering from depression during the one-year study, but patients given the
study drug have a 50% or two-fold higher chance of recovering in the same time period. The
true risk ratio is 50% divided by 25% or 2.0.

If you study 100 treated and 100 placebo patients who follow protocol exactly, then around 50
drug patients and 25 placebo patients will recover, for an estimated risk ratio of 2.0.

However, if early in the study, 25 placebo subjects start taking the study drug and 25 treated
patients stop taking the study drug, then there will be more than 25 recoveries in the placebo
group and fewer than 50 in the treated group.

Thus, the observed risk ratio will be closer to the null value of 1.0.

51 rd
Weiss NS. Clinical Epidemiology: The Study of the Outcome of Illness. 3 edition. Oxford University Press; 2006.
52
Davis LL, Bartolucci A, Petty F. Divalproex in the treatment of bipolar depression: a placebo-controlled study. J
Affect Disord. 2005; 85(3):259-66.

45
Alternative strategies to intention-to-treat include per-protocol and treatment-received. A per-
protocol analysis excludes all noncompliant participants from the final analysis, including
anyone who switched groups, withdrew, missed measurements, or otherwise violated study
protocol. This approach could bias the results, making one treatment appear to be better
simply because one group was more adherent. It also does not reflect real life, where patients
are often non-adherent. In the randomized trial of psychotherapy alone versus psychotherapy
plus antidepressants, the investigators analyzed data according to a per-protocol analysis,
excluding the 16 participants who withdrew from the therapy plus drug group immediately
after randomization. This is not the optimal analysis since people who would refuse
antidepressant drugs may be different from people who wouldn’t, and omitting these patients
only from the psychotherapy plus drug arm creates an imbalance.

The treatment-received approach analyzes all participants according to the treatment they
actually received, regardless of what treatment they were originally assigned. For example, if
a participant assigned to the therapy-only group decides to take antidepressants, they will be
counted in the therapy plus drug group. This approach tries to get at the efficacy of a
treatment when actually taken, but at the cost of turning a randomized trial into an
observational study.

46
STATISTICAL ANALYSES
33. Predictors and Outcomes

Transcript
Dividing variables into predictors and outcomes provides a useful framework for choosing the
appropriate statistical test for your data. The form of the outcome variable—whether it is
53
categorical, continuous, or time-to-event data—guides the choice of statistical test.

Type of Predictor Type of Outcome Statistical Test or Measure of


Variable/s Variable Association
Binary (two groups) Continuous Two sample t-test
Categorical (>two
Continuous ANOVA
groups)
Continuous Simple linear regression
Continuous
Multiple linear regression,
Multivariate (categorical
Continuous including Analysis of Covariance
and continuous)
(ANCOVA)
2
Z-test, chi-square (χ ) test,
Categorical Categorical
Fisher’s exact test
Binary Binary Odds ratio, risk ratio
Multivariate Binary Logistic regression
Categorical Time-to-event Kaplan-Meier curve/log-rank test
Cox-proportional hazards
Multivariate Time-to-event
regression, hazard ratio

The terms predictor and outcome only imply causality and order if the data come from a
randomized trial. Otherwise, exposure and disease can be used interchangeably as the
predictor and outcome for statistical analyses. For example, when comparing antidepressant
use between heart disease cases and controls, case/control status is the predictor and
antidepressant use is the outcome, even though we are interested in evaluating causality in
the other direction. Similarly, in a cross-sectional study of atherosclerosis and late-life
depression, the presence or absence of depression can be treated as the predictor and the
level of atherosclerosis as the outcome in the statistical analysis.

It is important to consider possible confounders of the association between a main predictor


and main outcome variable. Multivariate regression techniques control for confounding
variables (sometimes called covariates); the effect of the risk factor on disease is corrected to
account for the effect of confounders by including them in the statistical model. Statistical
adjustment by no means eliminates the problem of confounding—residual confounding by
unknown or unmeasured confounders will always remain a potential problem.

Examples of Multivariate Regression


• Multiple linear regression
• Logistic regression
• Cox-proportional hazards regression

53
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.

47
STATISTICAL ANALYSES
34. Analyses for Continuous Outcomes

Transcript
A two-sample t-test compares the mean value of the outcome between two groups, such as
treatment and placebo. The groups can be thought of as a binary predictor variable. The null
hypothesis is that there is no difference in the means of the two groups. A t-test is just a ratio
of the observed difference in means to the sampling variability (or standard error) of the
difference in means. If the ratio is high enough, this suggests that the observed difference
54
isn’t just due to chance variation.

For example, in a study of psychotherapy versus combined therapy, the outcome variable
was a score on the Hamilton Rating Scale for Depression, which the investigators treated as
a continuous variable. Here you can see the mean values in the psychotherapy versus
combined therapy groups. Looking just at week 24, the mean score for the psychotherapy
group was 11.35; and for the combined therapy group was 9.53, for a difference of 1.82
points. The standard error is about 1—as pictured here—so the t-statistic is a little shy of 2,
which corresponds to a p-value of .083.

Alternatively, the investigators could have reported a confidence interval for the difference in
two means rather than the p-value from a t-test. Here the 95% confidence interval would just
cross 0.

The Analysis of Variance test (known as ANOVA) compares means across two or more
groups (a categorical predictor). ANOVA partitions the variability in the outcome into two
sources: between-group variability and within-group variability. If the variability between
groups surpasses the background variability within groups, this is evidence against the null
hypothesis of no difference between the groups. For example, in the cross-sectional study of
atherosclerosis and late-life depression, investigators divided participants into three groups:
normal, subthreshold depression, and depression. They compared ages between the three
groups and found a significant difference between the groups. A significant ANOVA test
indicates that at least two groups differ, but does not pin down which groups differ. Here, it
appears that the normal group is on average two years younger than the subthreshold
depression and depression groups.

54
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.

48
For the remainder of this section, we will introduce you to more advanced statistical
techniques that you may encounter in the medical literature.

Analysis of Covariance, or ANCOVA, is like ANOVA, except it adjusts the outcome means by
potential confounders (covariates). For example, returning to table 1 of the atherosclerosis
and late-life depression paper, the authors compared mean values of several variables
between the three groups, using ANCOVA to adjust the means for age and sex, two
potentially strong confounders. After adjustment for age and sex, mean pack-years of
smoking was similar between the three groups.

ANCOVA is a specific type of linear regression; linear regression is the multivariate technique
for analyzing continuous outcome data.

Repeated measures occur when several outcome measurements are taken on the same
person over time, as might occur in a cohort study or randomized trial. For example, in the
placebo-controlled trial of divalproex to treat bipolar depression, researchers measured
depression on the Hamilton Rating Scale at baseline and every week for eight weeks
thereafter.

With many serial measurements, participants who miss measurements or withdraw from the
study will have missing data points. For example, in this trial of divalproex, fewer than half of
the participants completed all eight follow-up measurements. Intention-to-treat analysis
requires that these missing data be filled in, such that all participants can be included in the
statistical comparisons. A common technique for resolving missing data is to carry the last
observation forward wherever data are missing, as shown here.

The simplest method for analyzing repeated measures data is a paired t-test—which
accommodates data from two time points. The null hypothesis is that the average change in
the outcome from time 1 to time 2 is zero. A change score is calculated for each person, and
this is averaged over the study sample—to see if the average change is significantly different
from 0. When there are more than two measurements over time, either repeated-measures
ANOVA or multivariate ANOVA (or MANOVA) may be used. MANOVA is just an extension of
the paired t-test, but considers more than one change score simultaneously. Repeated-
measures ANOVA partitions out at an additional source of variation: variation over time. A
significant result in either of these tests indicates changes over time. For example, in the
divalproex trial, repeated-measures ANOVA reveals a significant change over time within the
divalproex group and a significant difference over time in the placebo vs. divalproex group.

Underlying these tests for continuous outcome variables is a hidden assumption that the
outcome variable is normally distributed. Fortunately, violations of this assumption are
generally only worrisome with small samples. If the sample is small and the outcome variable
is highly skewed, the t-test and ANOVA tests are invalid. In these cases, it is better to use an
alternate test that compares the medians rather than the means between groups. These
alternate tests are part of a larger collection of tests, called distribution-free or non-parametric,
that do not make distributional assumptions or estimate population parameters. Non-
parametric tests produce p-values, but not confidence intervals.

Onscreen Text:
Statistical Test Non-Parametric Equivalent
Two-sample t-test Mann-Whitney U test
ANOVA Kruskal-Wallis test
Paired t-test Wilcoxon signed-rank sum test

49
STATISTICAL ANALYSES
35. Analyses for Categorical Outcomes

Transcript
When the outcome variable is binary (yes/no), the difference in proportions test compares the
proportion of “yes” outcomes between two groups. The null hypothesis is that there is no
difference in proportions between the two groups. The ratio of the observed difference in
proportions to the standard error is called a Z-statistic. If the ratio is high enough, this
55
suggests that the observed difference isn’t just due to chance variation.

For example, in a case-control study of ischemic heart disease and antidepressant use, 23%
of cases with heart disease had used antidepressants compared with only 16% of controls,
which is a difference of 7%. The standard error is about 2%—as pictured here—so the Z-
value is 3.5, which corresponds to a highly significant p-value. Alternatively, the investigators
could have reported a confidence interval for the difference rather than the p-value from a Z-
test; here the 95% confidence interval does not cross zero.

A chi-square test is a generalization of the difference in proportions test and can be used to
compare the difference in proportions across more than two groups, for an outcome with
more than two levels. For example, in a cross-sectional study of atherosclerosis and late-life
depression, the normal, subthreshold depression, and depression groups were cross-
classified into low, medium, and high levels of coronary calcification. These counts can be
translated into percentages for ease of comparison. If atherosclerosis and depression status
are unrelated, then the proportion of subjects who have low, medium, and high calcification
should be similar across all three groups. It appears that the group with depressive disorders
has a disproportionately high number of people with high calcification—44%, compared with
27% and 28% in the other two groups.

The chi-square test evaluates whether this shift in proportions is real or simply an artifact of
chance. Under the null hypothesis, the proportions should be equally distributed in the three
groups—which translates to the expected cell counts shown here. The chi-square statistic is a
function of difference in observed and expected counts. Here the p-value is 0.096, which is
not quite enough evidence to reject the null hypothesis.

Select the more information icon if you would like to see more mathematical details about
calculating the chi-square statistic.

More Information
The chi-square statistic is the sum over all cells of the observed minus the expected
counts squared divided by the expected counts.

9 cells
(observed - expected)2 (865 − 856 )2 (20 − 21)2 (9 − 17)2
χ 42 = ∑ expected
=
856
+
21
+
17
+ ....
i =1

A significant chi-square test indicates that at least two groups differ, but it does not pin down
which groups differ.

55
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.

50
When the sample size in any of the cells in the contingency table is less than five, then the
assumptions of the chi-square test may be violated and a Fisher’s exact test should be used
instead.

Whenever you have a binary outcome measure, you can also calculate a relative risk. The
relative risk gives an estimate of the relative increase or decrease in risk of an outcome given
a particular exposure. There are several measures of relative risk, including the risk ratio, the
odds ratio, and the hazard ratio.

Risk is the probability, or chance, of developing an outcome over a specified time period. A
risk ratio is a ratio of risks for two groups. In a cohort study or randomized trial, the risk ratio is
the proportion of exposed or treated subjects who develop the outcome divided by the
proportion of unexposed or control subjects who develop the outcome. A value greater than
one means exposure increases risk; a value less than one means exposure decreases risk;
and a value of 1.0 indicates equal risk in the two groups. For example, in the randomized
placebo-controlled trial of divalproex to treat bipolar depression, 6 out of 13 patients in the
treatment group had remission of depression compared with only 3 out of 12 of the placebo
group. The resulting risk ratio is 1.84—meaning that the divalproex group has an 84% higher
chance of remission than the placebo group. The confidence intervals are wide and cross 1.0,
however, reflecting the small sample size.

In case-control studies, you cannot directly calculate a risk ratio, since case-control studies do
not yield estimates of disease risk. The odds ratio (OR) can be calculated without information
about the absolute risk of disease, however. The OR is an approximation of the risk ratio; the
chance of an event happening divided by the chance of it not happening. For example, if the
probability of an event occurring is one-third, or 1 in 3, the odds of it happening are 1 to 2.
The concept of odds is typically evoked in gambling; for example if a horse has a 9 to 1 odds,
this means he has a 1 in 10 chance of winning.

Select the probabilities pictured here to see the corresponding odds.

Probability Odds
0.25 1:3
0.50 1:1
0.75 3:1

In a case-control study, the OR is calculated as the ratio of the odds of exposure among
cases to the odds of exposure among controls. Fortunately, this turns out to be
mathematically equivalent to the ratio of the odds of disease among the exposed to the odds
of disease among the unexposed—which is the causal direction we care about. For example,
in the case-control study of ischemic heart disease and antidepressant drugs, 217 out of 933
cases had used antidepressants versus 871 out of 5,516 controls. The OR is then 217 to 716
divided by 871 to 4,645. An OR of 1.62 indicates a 62% increase in the odds of heart disease
with the use of antidepressant drugs.

51
As long as the outcome is rare, as in rare diseases, the OR is a good approximation of the
risk ratio. When the outcome is more common, however, the OR is inflated compared with the
risk ratio. Protective factors will artificially appear more protective, and risk factors will
artificially appear more risky. For example, in the divalproex randomized trial, remission from
depression occurred in more than one-third of the sample. The risk ratio for remission is 1.82,
but the odds ratio is 2.6. If you’re reading a paper that reports ORs and the outcome is
common, keep this distortion effect in mind.

Cohort studies may also report ORs, because the multivariate technique for analyzing binary
outcome data (called logistic regression) yields odds ratios, not risk ratios. Logistic regression
gives confounder-adjusted ORs and ORs for categorical and continuous exposures.

52
STATISTICAL ANALYSES
36. Analyses for Time-to-Event Outcome

Transcript
The goal of survival analysis is to describe and compare how events—such as disease onset,
disease remission, or death—happen over time. The Kaplan-Meier curve graphically displays
this survival function. The height of the curve at a given time point is the probability of
surviving event-free past that time point. Take a simple example with only five subjects.
Subject E dies at four months; he represents one-fifth of the subjects at risk, so the curve
drops 20%. Subject A is censored (or removed from the risk pool) at six months, leaving just
three subjects at-risk. Subject C dies at seven months; he represents one-third of the risk
pool, so the curve drops by one-third of 80%, or 27%, to 53%. Subjects B and D survive the
whole year, and are censored when the study ends. The probability of surviving in the entire
56
year, taking into account censoring is 53%.

Kaplan-Meier curves are used to visually comparing the survival functions of different groups.
For example, in the randomized trial of psychotherapy versus combined therapy for the
treatment of depression, subjects were followed until the time-to-remission of depression. The
Kaplan-Meier curve is shown here. Remissions happened at 4, 8, 12, and 24 months—
corresponding to the intervals at which depression was evaluated. The chance of depression
continuing past 12 months was 75% for the psychotherapy group and 65% for the combined
therapy group. Though the combined therapy group did slightly better, a formal log-rank test
comparing the two survival curves shows that this difference is not statistically significant. You
do not need to worry about the mathematical details of the log-rank test.

To quantify the difference in survival between two groups, we can calculate the hazard ratio,
which is a ratio of incidence rates. Unlike proportions, incidence rates take into account the
heterogeneity of follow-up times. The interpretation is similar to that of a risk ratio: a hazard
ratio greater than one indicates an increased rate of disease in the treated or exposed group,
and a value less than one indicates a decreased rate.

The Cox proportional hazards model is a multivariate technique for calculating confounder-
adjusted hazard ratios. This model allows event rates to change over time, but assumes that
the relative difference between groups is constant. As with multivariate-adjusted odds ratios,
hazard ratios may have continuous or categorical predictors; thus you should check the units
of the predictor when interpreting hazard ratios from Cox regression.

56
Rosner B. Fundamentals of Biostatistics. Thomson & Brooks/Cole Publishing; 2006.

53
STATISTICAL ANALYSES
37. Progress Check

Transcript
Match each term to the appropriate row.
Type of Outcome Statistical Test or Measure
Type of Predictor Variable (s)
Variable of Association
Binary (two groups) Continuous Two sample t-test
Categorical (>two groups) Continuous ANOVA
Continuous Continuous Simple linear regression
Multivariate (categorical and Multiple linear regression,
Continuous
continuous) including ANCOVA
2
Z-test,chi-square (χ ) test,
Categorical Categorical
Fisher’s exact test
Binary Binary Odds ratio, risk ratio
Multivariate Binary Logistic regression
Kaplan-Meier curve/log-rank
Categorical Time-to-event
test
Cox-proportional hazards
Multivariate Time-to-event
regression, hazard ratio

54
CONCLUSION
38. Summary

Transcript
The goal of clinical research is to evaluate whether specific risk factors and treatments are
causally related to disease risk or prognosis.

The process of generating, refining, and testing specific hypotheses is called the empirical
cycle of scientific inquiry. Clinical researchers gather ideas from literature reviews, exploratory
studies, surveillance studies, clinical experience, and lab evidence. They generate formal
research hypotheses about whether specific risk factors and treatments are causally related
to disease or disease prognosis. They formally test these hypotheses in focused, confirmatory
studies. These studies inevitably generate ideas for new studies and new hypotheses.
Initially, research hypotheses are evaluated and refined in observational studies; if possible,
these hypotheses are eventually addressed in randomized trials.

Studies that were not designed to test specific hypotheses are likely to find chance
associations, and should be interpreted skeptically.

Researchers hope to generalize their findings to as broad a group of people as possible.


Since it is impractical to try to enumerate or reach every person in this idealized target
population, researchers define a more accessible study population from which they can
conveniently select participants. Out of the eligible study population, those who actually
participate make up the study sample. When reading a medical paper, you should carefully
consider how participants were selected to judge the generalizability of the results to the
target population.

When reading the medical literature, you need to evaluate how strongly the data support a
causal relationship between a risk factor and a disease (or a treatment and disease
prognosis). The Bradford-Hill criteria give nine points to consider when judging whether an
observed association between a risk factor and disease is likely to be causal. Some of the key
criteria are reviewed here.

55
This table summarizes the key advantages and disadvantages of the different study designs
for medical research. Use the scroll bar to move up and down through the table.

Type of Study Design Description Advantages Disadvantages


Research

Observational The investigator Cheaper Weaker evidence


observes the Faster of causality
population, but
does not Can examine Unmeasured
interfere. long-term effects confounding

Cross-sectional Measures Cheap and easy Cannot determine


prevalence of cause and effect
risk factor and Generalizable
disease at one Difficult design for
time point. Estimates rare diseases and
disease and exposures
exposure
prevalence

Case-control Compares past Cheap and fast Control selection


exposures in a is tricky
sample of cases Efficient for rare
(with disease) diseases Temporality is
and a unclear
comparable
sample of Recall bias and
controls (without selection bias
disease).

Cohort Tracks the Temporality is Lengthy and


incidence of correct costly
disease (or other
outcome) in a Avoids recall bias Requires a large
cohort of people sample and long
whose Estimates rate follow-up for rare
exposures are and risk of outcomes
measured prior disease
to disease Loss to follow-
occurrence. Can be used to up
study multiple
outcomes

56
Experimental The investigator Stronger Expensive
controls the evidence of
environment. causality Not always
ethically or
Minimizes practically feasible
confounding
Follow-up time is
limited

Randomized trial Investigator Gold standard Expensive


randomly for demonstrating
assigns causal Not practical for
participants to relationships long-term
different outcomes
intervention Randomization
arms. minimizes Not always
confounding and generalizable
selection bias
Not always
feasible

Cross-over Each participant Compared with Compared with


randomized trial receives all parallel trails, parallel trails,
interventions requires fewer subject to carry-
sequentially, in participants over effects
random order.

Meta-Analysis Meta-analysis Pools data from Yields precise Is limited by


several studies estimates of publication bias
to evaluate the effect
overall weight of Is limited by the
evidence for a Evaluates quality of the
causal consistency over existing data
relationship. different studies
May yield
Addresses statistically
contradictions in significant
the literature associations that
are not clinically
Evaluates overall significant
evidence

Bolded Terms

• Confounding: A measured or unmeasured variable that is related to both exposure and


disease and distorts or masks the true effect of exposure on disease.
• Prevalence: The proportion of people in a specified population who have a disease,
including new and old cases.
• Recall bias: Systematic error due to the differences in accuracy or completeness of recall
of past events or experiences. For example, cases may be more likely than controls to
remember and report exposures.
• Selection bias: When study subjects are selected differently depending on their risk
factors or outcomes. For example, in case-control studies, cases may be selected
differently than controls, creating spurious associations.
• Incidence: The rate at which new cases of disease develop in a population.
• Risk: The chance that a person will develop a disease in a given time-period.

57
• Loss to follow-up: Loss of contact with some participants, so that researchers cannot
complete data collection as planned. Loss to follow-up is a common cause of missing data,
especially in long-term studies.
• Gold standard: A method, procedure, or measurement that is widely accepted as being
the best available.
• Statistically significant: The probability that the association between the factor and the
outcome is due to change is less than a specified level (by convention, p < 0.05).
• Clinically significant: The association between the factor and the outcome is large
enough to matter to patients.

Whether variables are measured as continuous or categorical guides the statistical analysis.
Categorical data classify things into categories. Continuous data are uninterrupted numerical
data.

This table summarizes the different types of data used in medical research. Use the scroll bar
to move up and down through the table.

Type of Data Subtypes Examples Summary of


Measures

Continuous Follow a normal Age Mean and standard


distribution (or deviation
symmetric) Weight
distribution
Height

Blood pressure

Follow a more Scores on a test of Median and


skewed distribution depression interquartile range

Categorical Binary (two Case/control Proportion of study


categories) subjects who are
Exposed/unexposed cases, exposed,
treated, etc.
Treatment/placebo

Nominal (two or more Intervention group Percentage of study


unordered categories) subjects that fall into
Marital status each category

Median and
interquartile range

Ordinal (two or more Low/Medium/High Percentage of study


ordered categories) exposure subjects that fall into
each category
Birth order
Median and
interquartile range

Time-to-Event Time to death Median time to event


or incidence rate
Time to disease
Cumulative risk
Time to injury

58
Time to recovery

Bolded Terms

• Normal distribution: Values are distributed in a symmetric, bell shape, where 68% of
observations fall within one standard deviation of the mean, 95% fall within two standard
deviations of the mean, and 99.7% fall within three standard deviations of the mean
• Mean: The average value
• Standard deviation: The average scatter around the mean
• Skewed distribution: The distribution of values is non-symmetrical
• Median: The middle number in a set of ordered data
• Interquartile range: The middle 50% of the data

The goal of a hypothesis test is to evaluate whether exposure-disease associations seen in a


sample reflect real associations in the population or merely chance variation.

First, specify a null hypothesis, which is usually the opposite of what you're trying to show.
Then, predict the likelihood of a range of possible outcomes for your study given that the null
hypothesis is true.
Then, run a study to collect empirical data.
Then, calculate how probable (or improbable) the results of your study are if the null
hypothesis is true. If the data are sufficiently unlikely, this leads you to reject your null
hypothesis in favor of an alternative.

When reading the results of hypothesis tests, keep these caveats in mind

[Onscreen Text:
• When multiple hypothesis tests are performed, the overall likelihood of a chance
association will be greater than 0.05.
• Failure to reject a null hypothesis is not evidence of a lack of association. If the
sample size is small, null results could be due to insufficient statistical power.
• Statistical association is no guarantee of clinical relevance. The larger the sample
size, the more power to detect significantly significant but trivially small
associations.
• Confidence intervals allow you to evaluate clinical significance as well as
statistical significance. A confidence interval is calculated as the observed sample
statistic plus or minus a multiple of the standard error of the statistic.
• Standard error, which is a measure of the sampling variability, decreases as
sample size increases. For example, the standard error of the mean is calculated
as the standard deviation of the characteristic divided by the square root of the
sample size.
• Confidence intervals tell you the magnitude of an effect and the precision with
which it was estimated.

Select the data in the left column to see the corresponding confidence interval.

Intention-to-treat analysis is the preferred analysis method for randomized trial data. Intention
to treat compares outcomes according to how participants were randomized even if they
refused or discontinued the intervention, used it incorrectly, or otherwise violated study
protocol. Intention-to-treat minimizes bias and confounding by maintaining randomization,
measures real-life effectiveness; and tends to produce conservative estimates of effect.

59
The following statistical tests and measures of association are commonly used in clinical
research. Select the last column to review details of each test.

Onscreen Text: Statistical Analyses

Type of Type of Statistical Test More Information


Predictor Outcome or Measure of
Variable/s Variable Association
Binary (two Continuous Two sample t-test Used for comparing the means
groups) of two groups.
Categorical (>two Continuous ANOVA,repeated- Used for comparing the means
groups) measures of more than two groups.
ANOVA Repeated-measures ANOVA
(MANOVA) evaluates changes in the
mean over time.
Multivariate Continuous Linear regression, ANCOVA calculates
including confounder-adjusted means
ANCOVA across different groups, and is
a specific example of linear
regression. Linear regression
is a multivariate technique for
continuous outcome data.
Categorical Categorical Z-test,chi-square Z-test is used to evaluate
2
(χ ) test, Fisher’s associations between two
exact test categorical variables. Chi-
square test is used to evaluate
associations between two or
more categorical variables, as
displayed in a contingency
table. Fisher’s exact test
should be used when any cell
in the contingency table has
fewer than five observations.
Binary Binary Risk ratio, odds Risk ratio is a measure of the
ratio relative increase in risk of
disease for an exposed or
treated group versus an
unexposed or control group.
A risk ratio greater than one
denotes increased risk; less
than one denotes decreased
risk; and equal to one denotes
equal risk between two
groups. The odds ratio is a
good approximation of the risk
ratio when the outcome is rare.
The odds ratio is the ratio of
the odds of the disease or
outcome in one group versus
another.

60
Type of Type of Statistical Test More Information
Predictor Outcome or Measure of
Variable/s Variable Association
Multivariate Binary Logistic Gives multivariate-adjusted
regression odds ratios, and odds ratios for
continuous and categorical
predictors.
Categorical Time-to-event Kaplan-Meier/log- Compares time-to-event or
rank test event-free survival of two or
more groups.
Multivariate Time-to-event Cox proportional Gives multivariate-adjusted
hazards hazard ratios, and hazard
regression ratios for continuous and
categorical predictors.

61

You might also like