Professional Documents
Culture Documents
short-notes-in-medical-statistics-for-medical-examinations_e272c005-0c58-4e6b-b485-73c9eefb55e8
short-notes-in-medical-statistics-for-medical-examinations_e272c005-0c58-4e6b-b485-73c9eefb55e8
SHORT NOTES IN
a
Short Notes in Medical Statistics
Table of contents
Part 1: Study Design ............................... 1 Relative Risk Reduction (RRR) ............ 22
Observational and Experimental Number Needed to Treat, and
Studies ......................................................... 1 Number Needed to Harm ................... 23
Case Series ................................................ 1 Part 4: Descriptive statistics ........... 24
Cross-sectional Studies ........................ 1 Types of data variables....................... 24
Case-control studies.............................. 3 Descriptive statistics for categorical
Cohort Studies .......................................... 4 variables ................................................... 25
(experimental) studies.
Observational
Experimental studies
(Non-experimental) studies
or collect data without intervening with that may be a new drug, a surgical
- Case-control studies
- Cohort studies
Case Series
• This type of study includes a few numbers of patients with a specific disease or
condition.
• It is not a true study and does not aim to draw general conclusions.
Cross-sectional Studies
1
Short Notes in Medical Statistics
the population (current cases) but not the incidence (new cases).
Example
Advantages
• Quick, easy, and inexpensive (No waiting for the occurrence of outcome).
Disadvantages
2
Short Notes in Medical Statistics
Case-control studies
group of individuals who are free of the disease (controls) regarding the past
• We start with cases and controls, then we look for the past exposure
(retrospective).
Example
• Cases: Students from Cairo University who are diagnosed with bipolar disorder.
• Controls: Students from Cairo University who don’t have bipolar disorder.
Advantages
Disadvantages
• Recall bias (the study is retrospective, and participants may not report their
Cohort Studies
a period of time (often years) to see whether they develop the disease or
outcome of interest.
• The rates of disease incidence among the exposed and unexposed groups are
retrospective, but both types define the cohorts based on the exposure status,
at the time of the study (current time). Pre-existing data, such as medical
records or employee files, can be used to assess exposure status in the past.
Example
To study if obesity is a risk factor for depression among Cairo University students:
• Ensuring that the outcome is not present (none of the participants have
depression).
• Follow up the different exposure groups (obese and non-obese) for a specific
period of time.
4
Short Notes in Medical Statistics
Advantages
control studies.
Disadvantages
Ecological Studies
• They are observational studies in which the outcome of interest is the rate of an
5
Short Notes in Medical Statistics
(placebo).
Example
andomi ation
efficacy of weight reduction in utcome +ve
Treatment A
Pop lation
controlling depression in obese ew treatment
utcome ve
Sample
depression).
Blinding (Masking)
• The goal is to reduce bias due to the subjectivity in reporting (by patient), and
6
Short Notes in Medical Statistics
Types of blinding
• Open label (no blinding): the patient and physicians know which
• Single blinded: the patient does not know what drug he/she is receiving.
• Double-blinded: both the patient and the investigator do not know which
Randomization
in the study.
• It produces balanced groups, i.e., the measured and unknown factors and
Advantages of RCTs
• Can provide comparison between the new drug and the current standard
Disadvantages of RCTs
• Difficult to study rare events and a long follow up period might be needed.
considerations.
7
Short Notes in Medical Statistics
participants eligible for trials may not be representative of all patients with the
condition of interest.
• Parallel design where patients are randomized to one of the two groups of
treatments, A and B, and each patient receives only one type of treatment.
andomi ation
switch groups in the study. For example, esult esult
Treatment A Treatment
Sample
treatment A after the halfway point. In this study design participants act as their
own controls. The main disadvantage is the carryover effect, which may affect
the direct intervention effect (if the effect of the drug in the first period affects
group they were originally allocated to during randomization (i.e. using data
from all patients, including those who did not complete the study or changed
the treatments.
• Per-Protocol Analysis (PP): Only participants who strictly adhered to the study
protocol are analyzed. All others – such as participants who moved, who did not
concomitant medication they should not have – are excluded from the
analysis.
8
Short Notes in Medical Statistics
Meta-analysis
• Statistical methods are used to combine the results of each independent study
Bias
• Attrition bias occurs when patients who are lost to follow-up differ in a
systematic way from those who continued the study (they might be older or
9
Short Notes in Medical Statistics
• Recall bias occurs when individuals with disease may be more likely to
incorrectly recall/believe they were exposed to a possible risk factor than those
• Observer bias occurs when knowledge of exposure status (e.g. race, gender)
biases the observer towards a diagnosis; this occurs more commonly with
analysis. This occurs when some studies are less likely to be published (usually those
Confounding effect
and distorts the estimated effect of an exposure if not accounted for in the
study design/analysis.
10
Short Notes in Medical Statistics
• Content validity: Does the scale cover all the relevant areas?
11
Short Notes in Medical Statistics
who develops the disease during a specific period of time (new cases/total).
Over one year, if 10 women are diagnosed with breast cancer, out of the total female
study population of 1000 (who do not have breast cancer at the beginning of the study
period), then the incidence of breast cancer in this population is 10/1000= 0.01, or 1%.
If a survey included 1150 university students, a total of 170 reported daily smoking. The
12
Short Notes in Medical Statistics
• If a group of researchers comes up with a new diagnostic test (e.g., blood test)
to diagnose certain disease (e.g., presence of cancer), they will have to run an
experiment to see how good this new diagnostic test is (which may be cheaper,
• We need to compare this new diagnostic test to the gold standard test that
• So, we apply this new test and the gold standard test (true diagnosis) to a group
• Based on the results of the two tests, we will come up with 4 groups:
A. True positive: Positive for the blood test and positive for the histopathology
B. False positive: Positive for the blood test and negative for the histopathology
C. False negative: Negative for the blood test and positive for the histopathology
D. True negative: Negative for the blood test and negative for the histopathology
The new Test True positive False positive Total test positive
13
Short Notes in Medical Statistics
Example
If 1000 individuals were exposed to the two tests and the result is summarized as
follows:
180 80
(c) (d)
• Sensitivity
Sensitivity is the percentage of true positives, i.e., the proportion of those who have the
In other words: the probability that a test result will be positive when the disease is
present.
𝑎 number of true positive (a)
Sensitivity = 𝑎+𝑐 = number of true positive (a)+number of false negative (c)
180
Sensitivity = = 0.9 = 90%
180+20
This 90% sensitivity means that if we are sure that 100 patients have the disease
(based on the gold standard test), the new diagnostic test will be positive in 90 cases.
14
Short Notes in Medical Statistics
• Specificity
Specificity is the percentage of true negatives, i.e., the proportion of those who don’t
have the disease who are correctly identified by the test as negative.
In other words: the probability that a test result will be negative when the disease is
absent.
𝑑 number of true negative (d)
Specificity = 𝑏+𝑑 =number of false positive (b)+number of true negative (d)
720
Specificity = = 0.9 = 90%
80+720
This 90% specificity means that if we are sure that 100 individuals don’t have the
disease (based on the gold standard test), the new diagnostic test will be negative in
90 cases.
90% of people who do not have the disease will test negative.
• To rule out a disease, we want to be sure that a negative result is really negative
(no disease); therefore, a few false negatives should occur. High sensitivity
helps rule out if the test is negative. If we use SN for sensitivity, we use a highly
probability that the patient has the disease (a positive test result should really
indicate disease). Therefore, we want a few false positives. High specificity helps
rule in if the test is positive. If we use SP for specificity, we use a highly specific
15
Short Notes in Medical Statistics
Sensitivity and specificity are characteristics of the test. But the physician and the
patient may have a different question: what is the chance that a person with a positive
test truly has the disease? Here comes two other calculations:
• Positive predictive value is the probability that when having a positive test
result, that individual will truly have that specific disease. It is the proportion of
𝑎 180
PPV = = 180+80 = 0.69 = 69%
𝑎+𝑏
• Negative predictive value is the probability that when having a negative test
result, that individual will truly be free of the disease. It is the proportion of people
𝑑 720
NPV = = 20+720 = 0.97 = 97%
𝑐+𝑑
For those who test negative, 97% are not having the disease.
PPV = true positive / testing positive NPV = true negative / testing negative
Sensitivity and specificity are characteristics of the test and are not affected by the
disease prevalence, while PPV and NPV are affected by the disease prevalence.
16
Short Notes in Medical Statistics
Likelihood ratios
• The likelihood ratio of a positive test result: ratio between the probability of a
positive test result in the presence of the disease and the probability of a
• The likelihood ratio of a negative test result: ratio between the probability of a
negative test result in the presence of the disease and the probability of a
High likelihood ratios for positive test results and low likelihood ratios for negative test
17
Short Notes in Medical Statistics
cut-off point.
specificity.
• We can choose the optimal cut-off point depending on the implications of false
positive and false negative results, and the prevalence of the condition.
accept more false positives (lower specificity) in return for fewer false negatives
(higher sensitivity).
18
Short Notes in Medical Statistics
Risk Ratio (Relative Risk, RR) and Odds Ratio (OR) are different measures of association.
Example
If a cohort study was done to follow 800 individuals for 5 years period, 400 are smokers,
and 400 are non-smokers. They were followed up for the occurrence of coronary heart
disease.
40 360 400
Smokers
a b a+b
20 380 400
Non-smokers
c d c+d
60 740 800
Total
a+c b+d a+b+c+d
• Relative risk (RR) is the risk (incidence) of having the disease among the
exposed divided by the risk of having the disease among the non-exposed.
Risk (incidence) is calculated by dividing the number who developed the disease by
19
Short Notes in Medical Statistics
𝑎
Incidence among exposed
RR = 𝑎+𝑏
𝑐
Incidence among non exposed
𝑐+𝑑
𝑎/(𝑎+𝑏)
RR = 𝑐/(𝑐+𝑑)
𝑎/(𝑎+𝑏) 40/400
RR= = 20/400 = 2
𝑐/(𝑐+𝑑)
• Odds ratio (OR) is the odds of having the disease among the exposed divided
The odds are calculated by dividing the number of have the disease by the number
𝑎
Odds of having the disease among exposed
OR = 𝑏
𝑐
Odds of having the disease among non exposed
𝑑
𝑎/𝑏 𝑎𝑑
OR = =
𝑐/𝑑 𝑏𝑐
𝑎/𝑏 40/360
OR = = = 2.08
𝑐/𝑑 20/380
RR <1 if the group represented in the numerator is at lower “risk” of the event.
RR =1 if the group represented in the numerator is at the same “risk” of the event.
OR is interpreted in the same way, but we use the word “odds” instead of “risk”.
20
Short Notes in Medical Statistics
• Attributable risk is simply the difference in incidence (risk) between the exposed
group and the non-exposed group. It refers to the increase in risk that can be
𝑎 𝑐
Attributable risk = Incidence among exposed ( )− Incidence among non exposed ( )
𝑎+𝑏 𝑐+𝑑
• If the incidence for a specific disease among smokers is 12%, and the incidence
of the same disease among non-smokers is 5%. So, the attributable risk is 12%-
• Note that in calculating the relative risk, we divide the risk in one group by the
risk in the other group, while here in the attributable risk, we calculate the
• Absolute risk reduction (ARR), or risk difference is the same as the attributable
risk. It is the difference between two risks. We use it when a treatment causes
ARR= Incidence (risk) among treatment group − Incidence (risk) among control group
• If the incidence for a specific disease among non-vaccinated group is 12%, and
the incidence of the same disease among the vaccinated group is 5%. So, the
the population).
21
Short Notes in Medical Statistics
Relative Risk Reduction (RRR) is the amount of risk reduction relative to the baseline
risk. It is the difference in the risk of the event between the control and experimental
ARR
Incidence (risk) among control group
An alternative way of calculating the (RRR) is to use the relative risk (RR):
RRR = (1 - RR)
is 12%, and the incidence of the same disease among the vaccinated
22
Short Notes in Medical Statistics
• The Number Needed to Treat (NNT) is the number of individuals that need be
• Both Number Needed to Treat (NNT) and Number Needed to Harm (NNH) are
• If there is risk reduction, we calculate the number needed to treat, and if there
We need to vaccinate 14 people to prevent one of them from having the disease.
23
Short Notes in Medical Statistics
numerical.
ordered), as sex (female, male) and blood groups: (A, B, AB, O).
agree).
• Categorical variables that consist of only two categories are called binomial
Numerical variables are either measured or counted, presented in numbers, and have
• Discrete variables: They take only integer numbers (no decimals) such as 0, 5,
22, 106, etc. They usually represent a count of something, as number of kids in a
family.
• Continuous variables: They can take any real numerical value, including
decimals (as 14.55, 48.8, 178.2). They involve measurements such as height and
weight.
24
Short Notes in Medical Statistics
Categorical variables such as sex, smoking status, and disease severity are presented
using:
each category.
Numerical variables are usually described using two numbers, one represents the
center of the data (central tendency), and the other represents the spread of the data
(dispersion).
Mean: it is the sum of the observed values divided by the number of observations. It is
Median: it is the point at the center of the data values, where half of the data points
are above, and half are below it. To calculate the median, we first arrange (order) our
data from the smallest value to the largest value. Then, the median is the value in the
• Measures of dispersion
Measures of dispersion (spread of the data) are used to describe variability in the
data. The commonly used measures of dispersion are range, inter-quartile range,
25
Short Notes in Medical Statistics
= Q3-Q1.
• The first quartile (Q1, lower quartile): in the point where 25% of the data are below
• The third quartile (Q3, upper quartile): in the point where 75% of the data are
Variance: it is a measure of spread that considers all data points in the calculation. It
represents the distance of all data points from the mean. Variance is measured using
the data values from their mean. It is calculated as the square root of the variance, so
Note that:
• Mean and standard deviations are affected by the presence of extreme values.
26
Short Notes in Medical Statistics
• If we take number of samples from a population, then the mean of each sample
is calculated, those means will be arranged into a distribution around the true
population mean.
• The standard deviation of this distribution, i.e. the standard deviation of sample
• The standard error tells us how accurate the mean of any sample is likely to be
equal.
in the tails (bell shape) and are symmetrical around the mean.
27
Short Notes in Medical Statistics
Skewed distribution
• Positive skew is when the long tail is on the right side and is skewed to the
right.
• Negative skew is when the long tail is on the left side and is skewed to the left.
The mean for a skewed data variable is located nearer to the tail (as it is affected by
28
Short Notes in Medical Statistics
For each research question, we define two types of hypotheses: the null hypothesis
Both are mutually exclusive (not overlapping) and only one of them is true.
• Fail to reject the null hypothesis (implies accepting the null hypothesis) and
• Reject the null hypothesis (implies accepting the alternative hypothesis) and
association.
29
Short Notes in Medical Statistics
While doing medical research, there is a possibility to reach a false conclusion and
result).
negative result).
• Type I error is more serious (it might indicate that a drug is effective while in fact
error.
30
Short Notes in Medical Statistics
Power is the probability of not committing type II error. So, power = 1-β
• The statistical power of a study is the power (or ability) of the study to detect a
• In practice, β is usually set at 0.2. This provides a power value of 0.8 (80%).
80%.
The level of significance (α) is the maximum allowed probability of committing type I
error.
• The smaller the value of α, the lower the risk of committing type I error.
• Common values for α are 0.05 and 0.01 indicating 5% and 1%, respectively.
• Studies with larger power will need larger sample size, and studies with lower
probability for type I error (α) will need larger sample size.
31
Short Notes in Medical Statistics
P-value
• When doing a statistical test using the computer software, we get the p-value
• If the null hypothesis is true, the p-value is the probability of obtaining this result
(or something more extreme). In other words, the p-value is the probability of
seeing the observed difference (in the collected data), or greater, just by
• For much simplicity: p-value is the probability of seeing the observed difference
just by chance.
• We compare the p-value (from the statistical test) to the level of significance
• If the p-value is greater than the level of significance, then we do not reject the
null hypothesis (we say that the p-value is not significant and there is no
difference).
• If the p-value is less than the level of significance, then we reject the null
hypothesis (we say that the p-value is significant and there is a statistically
significant difference).
32
Short Notes in Medical Statistics
It is important to consider the clinical significance and not only the statistical
significance.
• If we have a very large sample size, comparing two groups might be statistically
significant even if the difference between them is very small and has no clinical
importance.
• On the other hand, if we have a small sample size, the result might be
statistically not significant (due to low power of the study), even if the difference
es No
Confidence interval
• A Confidence Interval (CI) is a range of values within which we are fairly sure
the true population value lies (e.g. the mean). It is bounded by the upper and
• It is frequently reported as 95%CI (i.e. if this study was repeated 100 times,
• As the sample size increases, the confidence interval becomes narrower (more
precise).
• If the confidence interval for the difference between the two groups contains 0,
Signi cant
di erence
Not Signi cant
Di erence
the relative risk (RR) or odds ratio (OR) Not Signi cant
Statistical tests
• Statistical tests are used to study the association between variables or the
• We use them to get the p-value and reach a conclusion if there is a statistical
significance or not.
• Parametric tests are used to compare means of the groups while non-
• Parametric tests are used to compare groups where the numerical variables
34
Short Notes in Medical Statistics
Parametric
Statistical test Used for or non- Example
parametric
Used to compare means Parametric Comparing haemoglobin
Independent
of two independent level between patients in the
samples t test
groups treatment and control
(Student t test)
groups
Used to compare the Non- Comparing the hospital
medians of two parametric length of stay (not normally
Mann-Whitney
independent groups distributed) between the
test
(variable is not normally treatment and control
distributed) groups
Used to compare the Parametric Comparing the weight of a
means of one group group of individuals before
Paired t-test
under two conditions or and after being on a specific
time points (paired data) diet
Used to compare the Non- Comparing the pain score of
Wilcoxon Signed values for one group parametric a group of individuals before
Rank test under two conditions or and after receiving a specific
time points (paired data) medication
Used to compare the Parametric Comparing the birthweight
means of more than two of infants to mothers with
independent groups different smoking status
One-way ANOVA
(never smoke, quit before
pregnancy, smoke during
pregnancy)
Used to compare the Non- Comparing the neonatal
medians of more than parametric intensive care unit (NICU)
two independent groups length of stay for infants of
Kruskal-Wallis (variable is not normally mothers with different
test distributed) smoking status (never
smoke, quit before
pregnancy, smoke during
pregnancy)
To study if there is a Non- Comparing males and
relationship/association parametric females regarding having
between two categorical complications (yes or no). If
Chi-square test
variables there an association
between sex, and having a
complication
Fisher’s exact The same as Chi-square test but for small samples
35
Short Notes in Medical Statistics
Correlation
variables.
• The correlation coefficient (r): shows the strength and direction of the
correlation.
• The closer the r value is to +1 or -1, the stronger the correlation between the two
variables, and the closer the value to 0, the weaker the relationship.
variance in one variable that can be explained by the other variable. It is the
• Types of correlation:
Pearson’s correlation(r): parametric test, used for numerical data that are linearly
Spearman’s correlation (rho): non-parametric test, used for ordinal data or numerical
• Scatterplots help to illustrate the correlation. The following graphs show each
36
Short Notes in Medical Statistics
Regression
Regression is a statistical tool used mainly to study the association between one or
more variables and an outcome variable. It quantifies this relationship as we can get
• The variable that is being affected by other variables is called the outcome
• The variable that is studied for having possible effect on the outcome is called
• Prediction: we can use one or more variables to estimate the value of the
outcome variable.
37
Short Notes in Medical Statistics
Types of regression:
• If the outcome variable is time to event (survival data), we use Cox regression.
Survival data
eart rate es/ o Time to death
lood pressure Diseased/ ot diseased Time to recurrence
ualit o li e score omplication/ o Time to second heart attack
complication
www.stats4drs.com
Moha med l s heri 2022
https://www.connectmedical .academ / 12
• Based on the number of predictor variables, the regression model is either
there is more than one predictor variable. So, we have simple linear regression,
etc.
Estimates resulting from simple regression are called crude or unadjusted, while those
resulting from multiple regression are called adjusted (adjusted and unadjusted OR).
When reporting the results of regression, we use the coefficients for linear regression,
while odds ratio (OR) is used for logistic regression, and hazard ratio (HR) is used for
Cox regression.
38
Short Notes in Medical Statistics
Survival analysis
• Survival analysis is used when the outcome of interest is the time until an event
occurs (time to event). This event is usually death, as survival after breast
• When the study ends, some individuals still haven't had the event yet.
• Other individuals drop out or get lost in the middle of the study, and all we know
about them is the last time they were still 'free' of the event.
• Time to event: The time from entry into a study until a subject has a particular
event (outcome).
• Censoring (no event): Subjects are said to be censored if they are lost to follow
up or drop out of the study, or if the study ends before they die or have an
outcome of interest.
baseline.
39
Short Notes in Medical Statistics
a specific time and can be used to estimate the median survival time which is
• Kaplan-Meier Curve can be used to compare the survival in two groups. If the
curve goes down rapidly, the occurrence of the event is at a higher rate in this
group. The statistical test used for this comparison is the log rank test.
• Cox regression is the type of regression used in survival studies, and hazard
40