Statistics Made Easy Presentation PDF

Statistics Made Easy
Hani Tamim, MPH, PhD

Assistant Professor
Epidemiology and Biostatistics
Research Center / College of Medicine
King Saud bin Abdulaziz University for Health Sciences
Riyadh – Saudi Arabia
Objective of medical research
Is treatment A better than treatment B for patients with

hypertension?
What is the survival rate among ICU patients?
What is the incidence of Down’s syndrome among a certain

group of people?
Is the use of Oral Contraceptives associated with an increased risk

of breast cancer?
Research Process?
Planning
Design
Data collection
Analysis
¾ Data entry
¾ Data cleaning
¾ Data management
¾ Data analysis
Reporting
Statistics is used in ….
What is statistics?
Scientific methods for:

¾ Collecting
¾ Organizing
¾ Summarizing
¾ Presenting
¾ Interpreting
data
Definition of some basic terms
Population: The largest collection of entities for which we have

interest at a particular time
Sample: A part of a population
Simple random sample: is when a sample n is drawn from a

population N in such a way that every possible sample of size n
has the same chance of being selected
Definition of some basic terms
Variable: A characteristic of the subjects under observation that

takes on different values for different cases, example: age gender,
diastolic blood pressure
Quantitative variables: Are variables that can convey information

regarding amount
Qualitative variables: Are variables in which measurements

consist of categorization
Types of variables
Categorical variables
Continuous variables
Nominal: unordered data
¾ Death
¾ Gender
¾ Country of birth
Ordinal: Predetermined order among response classification
¾ Education
¾ Satisfaction
Continuous: Not restricted to integers
¾ Age
¾ Weight
¾ Cholesterol
¾ Blood pressure
Steps involved (data)
Data collection
Database structure
Data entry
Data cleaning
Data management
Data analyses
Data collection
Data collection:
¾ Collection of information that will be used to answer the research

question
¾ Could be done through questionnaires, interviews, data abstraction,

etc.
Data collection
Database structure
Database structure:
¾ Structure the database (using SPSS) into which the data will be
entered
Data entry
Data entry:
¾ Entering the information (data) into the computer
¾ Most of the times done manually

Single data entry
Double data entry
Data cleaning
Data cleaning:
¾ Identify any data entry mistakes
¾ Correct such mistakes

Data management
Data management:
¾ Create new variables based on different criteria
¾ Such as:
BMI
Recoding
Categorizing age (less than 50 years, and 50 years and above)
Etc.
Data analyses
Data analyses:
¾ Descriptive statistics: are the techniques used to describe the main

features of a sample
¾ Inferential statistics: is the process of using the sample statistic to

make informed guesses about the value of a population parameter
Data analyses
Data analyses:
¾ Univariate analyses
¾ Bivariate analyses
¾ Multivariate analyses
Bottom line
There are different statistical methods

for different types of variables
Descriptive statistics: categorical variables
Frequency distribution
Graphical representation
Frequency distribution
¾ A frequency distribution lists, for each value (or small range of

values) of a variable, the number or proportion of times that
observation occurs in the study population
Frequency distribution:
How to describe a categorical variable (marital status)?

Construct a frequency distribution
¾ Title
¾ Values
¾ Frequency
¾ Relative frequency (percent)
¾ Valid relative frequency (valid percent)
¾ Cumulative relative frequency (cumulative percent)

Marital status of the 291 patients admitted to the Emergency Department
Cumulative
Frequency Percent Valid Percent Percent
Valid Married 266 91.4 94.7 94.7
Single 13 4.5 4.6 99.3
Widow 2 .7 .7 100.0
Total 281 96.6 100.0
Missing System 10 3.4
Total 291 100.0
Example
Example: summarizing data
¾ A graph lists, for each value (or small range of values) of a variable,
the number or proportion of times that observation occurs in the
study population
Graphical representation:
Two types
¾ Bar chart
¾ Pie chart
Construct a bar or pie chart
¾ Title
¾ Values
¾ Frequency or relative frequency
¾ Properly labelled axes

Descriptive statistics: continuous variables
Central tendency
Dispersion
How to describe a continuous variable (Systolic blood pressure)?
Central tendency:
¾ Mean
¾ Median
¾ Mode
Mean:
Add up data, then divide by sample size (n)

The sample size n is the number of observations (pieces of
data)
Example
¾ n = 5 Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
120 + 80 + 90 + 110 + 95
X3 = 90 X= = 99mmHg
X4 = 110 5
X5 = 95
Formula n
∑X i
X= i =1
n
Summation Sign
¾ Summation sign (∑) is just a mathematical shorthand for “add
up all of the observations”
∑X
i=1
i = X1 + X 2 + X 3 + ....... + Xn
Also called sample average or arithmetic mean X
Sensitive to extreme values
¾ One data point could make a great change in sample mean
Uniqueness
Simplicity
Median: is the middle number, or the number that cuts the data in
half
80 90 95 110 120
The sample median is not sensitive to extreme values

¾ For example: If 120 became 200, the median would remain the
same, but the mean would change to 115.
If the sample size is an even number
80 90 95 110 120 125
95 + 110
= 102.5 mmHg
2
Median: Formula
¾ n = odd: Median = middle value (n+1/2)

¾ n = even: Median = mean of middle 2 values (n/2 and n+2/2)
Properties:
¾ Uniqueness
¾ Simplicity
¾ Not affected by extreme values
Mode: Most frequently occurring number
80 90 95 95 120 125
Mode = 95
Example:
Statistics
Systolic blood pressure

N Valid 286
Missing 5
Mean 144.13
Median 144.50
Mode 155
Central tendency measures do not tell the whole story
Example:
21 22 23 23 23 24 24 25 28
Mean = 213/9 = 23.6
Median = 23
15 18 21 21 23 25 25 32 33
Mean = 213/9 = 23.6
Median = 23
How to describe a continuous variable (Systolic blood pressure)

in addition to central tendency?
Measures of dispersion:
¾ Range
¾ Variance
¾ Standard Deviation
Range
Range = Maximum – Minimum
Example: X 1 = 120
¾ Range = 120 – 80 = 40
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
Sample variance (s2 or var or σ2)

The sample variance is the average of the square of the
deviations about the sample mean
n
∑ i
(X − X) 2
s2 = i=1
n −1
Sample standard deviation (s or SD or σ)

It is the square root of variance n
∑ i
(X − X) 2
s= i=1
n −1
Example: n = 5 systolic blood pressures (mm Hg) X 1 = 120

Recall, from earlier: average = 99 mm HG
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
5
∑ i
(X
i=1
− X) 2
= (120 − 99) 2
+ (80 − 99) 2
+ (90 − 99) 2
+ (110 − 99)2 + (95 − 99)2 = 1020

Sample Variance
∑ i
(X − X ) 2
1020
s =
2 i=1
= = 255
n −1 4
Sample standard deviation (SD)

s = √s2 = √255 = 15.97 (mm Hg)
The bigger s, the more variability
s measures the spread about the mean
s can equal 0 only if there is no spread

¾ All n observations have the same value
The units of s is the same as the units of the data (for example,
mm Hg)
Example:
Statistics

N Valid 286
Missing 5
Mean 144.13
Median 144.50
Mode 155
Std. Deviation 35.312
Variance 1246.916
Range 202
Minimum 55
Maximum 257
Example: summarizing data
Graphical representation:
Different types
¾ Histogram
Construct a chart
¾ Title
¾ Values
¾ Frequency or relative frequency
¾ Properly labelled axes

Shapes of the Distribution
Three common shapes of frequency distributions:
A B C
Symmetrical Positively Negatively
and bell skewed or skewed or
shaped skewed to skewed to
the right the left
Shapes of Distributions
Symmetric (Right and left sides are mirror images)

¾ Left tail looks like right tail
¾ Mean = Median = Mode
Mean
Median
Mode
Left skewed (negatively skewed)

¾ Long left tail
¾ Mean < Median
Mean Median Mode

Right skewed (positively skewed)

¾ Long right tail
¾ Mean > Median
Mode Mean
Median
Shapes of the Distribution
Three less common shapes of frequency distributions:
A B C
Bimodal Reverse Uniform
J-shaped
Probability
Probability
Definition:
¾ The likelihood that a given event will occur
¾ It ranges between 0 and 1:

0 means the event is impossible to occur
1 means that the event is definitely occurring
How do we calculate it?
Frequentist Approach:
¾ Probability: is the long term relative frequency
Thus, it is an idealization based on imagining what would

happen to the relative frequencies in an indefinite long
series of trials
Application in medicine
How does probability apply in medicine?
Probability is the most important theory behind biostatistics
It is used at different levels

Descriptive
Example: 4% chance of a patient dying after admission to

emergency department (from the previous example)
¾ What do we mean?
Out of each 100 patients admitted to the emergency department, 4
will die, whereas 96 will be discharged alive
Example: 1 in 1000 babies are born with a certain abnormality!
Incidence and prevalence

Associations
Example: the association between cigarette smoking and death

after admission to the emergency department with an MI
Current Cigarrete Smoking in association with death at discharge
Count
Death at discharge
Death Discharged Total
Current Cigarrete No 5 123 128
Smoking Yes 5 154 159
Total 10 277 287
Probability of being smoker = 100 / 331
Probability of dying if a smoker = 5 / 159 = 3.1%
Probability of dying if a non-smoker = 5 / 128 = 3.9%
Associations
Same is applied to:
¾ Relative risk
¾ Risk difference
¾ Attributable risk
¾ Odds ratio
¾ Etc…..
Bottom line
Probability is applied at all levels of statistical analyses

Probability distributions
Probability distributions list or describe probabilities for all possible

occurrences of a random variable
There are two types of probability distributions:
¾ Categorical distributions
¾ Continuous distributions
Probability distributions: categorical variables
¾ Frequency distribution
¾ Other distributions, such as binomial

Probability distributions: continuous variables
¾ Continuous distribution
¾ Such as Z and t distributions

Normal Distribution
Properties of a Normal Distribution
¾ Also called Gaussian distribution
¾ A continuous, Bell shaped, symmetrical distribution; both

tails extend to infinity
¾ The mean, median, and mode are identical
¾ The shape is completely determined by the mean and

standard deviation
Normal Distribution
A normal distribution can have any µ and any σ:

e.g.: Age: µ=40 , σ = 10
The area under the curve represents 100% of all the observations
Mean
Median
Mode
Normal Distribution
Normal Distribution
Age distribution for a specific population
50% 50%
Mean=40
SD=10
Normal Distribution
Age = 25 Mean=40
SD=10
Normal distribution
¾ The formula used to calculate the area below a certain point in a

normal distribution:
¾ The probability density function of the normal distribution with

mean µ and variance σ2
Normal distribution
¾ Thus, for any normal distribution, once we have the mean and sd,
we can calculate the percentage of subjects:
¾ Above a certain level
¾ Below a certain level
¾ Between different levels
¾ But the problem is:

¾ Calculation is very complicated and time consuming, so:
Standardized Normal Distribution
We standardize to a normal distribution
What does this mean?
For a specific distribution, we calculate all possible probabilities,

and record them in a table
A normal distribution with a µ = 0, σ = 1 is called a Standardized

Normal Distribution
Mean=0
SD=1
Area under the Normal Curve from 0 to X
X 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.00000 0.00399 0.00798 0.01197 0.01595 0.01994 0.02392 0.02790 0.03188 0.03586
0.1 0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142 0.07535
0.2 0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026 0.11409
0.3 0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803 0.15173
0.4 0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439 0.18793
0.5 0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904 0.22240
0.6 0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175 0.25490
0.7 0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230 0.28524
0.8 0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057 0.31327
0.9 0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646 0.33891
1.0 0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993 0.36214
1.1 0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100 0.38298
1.2 0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973 0.40147
1.3 0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621 0.41774
1.4 0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056 0.43189
1.5 0.43319 0.43448 0.43574 0.43699 0.43822 0.43943 0.44062 0.44179 0.44295 0.44408
1.6 0.44520 0.44630 0.44738 0.44845 0.44950 0.45053 0.45154 0.45254 0.45352 0.45449
1.7 0.45543 0.45637 0.45728 0.45818 0.45907 0.45994 0.46080 0.46164 0.46246 0.46327
1.8 0.46407 0.46485 0.46562 0.46638 0.46712 0.46784 0.46856 0.46926 0.46995 0.47062
1.9 0.47128 0.47193 0.47257 0.47320 0.47381 0.47441 0.47500 0.47558 0.47615 0.47670
2.0 0.47725 0.47778 0.47831 0.47882 0.47932 0.47982 0.48030 0.48077 0.48124 0.48169
2.1 0.48214 0.48257 0.48300 0.48341 0.48382 0.48422 0.48461 0.48500 0.48537 0.48574
2.2 0.48610 0.48645 0.48679 0.48713 0.48745 0.48778 0.48809 0.48840 0.48870 0.48899
2.3 0.48928 0.48956 0.48983 0.49010 0.49036 0.49061 0.49086 0.49111 0.49134 0.49158
2.4 0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361
2.5 0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520
2.6 0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643
2.7 0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736
2.8 0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807
2.9 0.49813 0.49819 0.49825 0.49831 0.49836 0.49841 0.49846 0.49851 0.49856 0.49861
3.0 0.49865 0.49869 0.49874 0.49878 0.49882 0.49886 0.49889 0.49893 0.49896 0.49900
3.1 0.49903 0.49906 0.49910 0.49913 0.49916 0.49918 0.49921 0.49924 0.49926 0.49929
3.2 0.49931 0.49934 0.49936 0.49938 0.49940 0.49942 0.49944 0.49946 0.49948 0.49950
3.3 0.49952 0.49953 0.49955 0.49957 0.49958 0.49960 0.49961 0.49962 0.49964 0.49965
3.4 0.49966 0.49968 0.49969 0.49970 0.49971 0.49972 0.49973 0.49974 0.49975 0.49976
3.5 0.49977 0.49978 0.49978 0.49979 0.49980 0.49981 0.49981 0.49982 0.49983 0.49983
3.6 0.49984 0.49985 0.49985 0.49986 0.49986 0.49987 0.49987 0.49988 0.49988 0.49989
3.7 0.49989 0.49990 0.49990 0.49990 0.49991 0.49991 0.49992 0.49992 0.49992 0.49992
3.8 0.49993 0.49993 0.49993 0.49994 0.49994 0.49994 0.49994 0.49995 0.49995 0.49995
3.9 0.49995 0.49995 0.49996 0.49996 0.49996 0.49996 0.49996 0.49996 0.49997 0.49997
4.0 0.49997 0.49997 0.49997 0.49997 0.49997 0.49997 0.49998 0.49998 0.49998 0.49998
Standardized Normal
Normal Distribution
Distribution (Z)
Mean = µ, SD = σ Mean = 0, SD = 1
TRANSFORM
Z = x-µ
σ
Standardized Normal
Normal Distribution
Distribution (Z)
Mean = 40, SD = 10 Mean = 0, SD = 1

TRANSFORM
Z(40) = x - µ = 40 - 40 = 0
σ 10
Standardized Normal
Normal Distribution
Distribution (Z)
30 -1
Mean = 40, SD = 10 Mean = 0, SD = 1
TRANSFORM
Z(40) = x - µ = 30 - 40 = -1
σ 10
Standardized Normal Distribution: summary
For any normal distribution, we can

Transform the values to the standardized normal distribution (Z)
Use the Z table to get the following areas
¾ Above a certain level
¾ Below a certain level
¾ Between different levels
Normal Distribution
Mean=40
SD=10
30 68% 50
Mean – 1SD Mean + 1SD

Normal Distribution
Mean=40
SD=10
20 95% 60

Normal Distribution
Mean=40
SD=10
10 99.7% 70

Practical example
Practical example
The 68-95-99.7 Rule for the Normal Distribution
68% of the observations fall within one standard deviation of the

mean
95% of the observations fall within two standard deviations of the
mean
99.7% of the observations fall within three standard deviations of
the mean
When applied to ‘real data’, these estimates are considered
approximate!
Distributions of Blood Pressure
.4
.3
68% Mean = 125 mmHG

.2 s = 14 mmHG
95%
.1 99.7%
0
83 97 111 125 139 153 167
The 68-95-99.7 rule applied to the distribution

of systolic blood pressure in men.
Data analyses
Data analyses:
¾ Descriptive statistics: are the techniques used to describe the main

features of a sample
¾ Inferential statistics: is the process of using the sample statistic to

make informed guesses about the value of a population parameter
Why do we carry out research?
population
sample
Inference: Drawing
conclusions on certain
questions about a
population from sample data
Inferential statistics
Since we are not taking the whole population, we have to draw

conclusions on the population based on results we get from the
sample
Simple example: Say we want to estimate the average systolic

blood pressure for patients admitted to the emergency department
after having an MI
Other more complicated measures might be quality of life,

satisfaction with care, risk of outcome, etc.
What do we do?
Take a sample (n=291) of patients admitted to emergency

department in a certain hospital
Calculate the mean and SD (descriptive statistics) of systolic blood

pressure
Statistics

N Valid 286
Missing 5
Mean 144.13
Std. Deviation 35.312
The next step is to make a link between the estimates we observed

from the sample and those of the underlying population (inferential
statistics)
What can we say about these estimates as compared to the

unknown true ones???
In other words, we trying to estimate the average systolic blood

pressure for ALL patients admitted to the emergency department
after an MI
Sample data
N=291
Mean=144
SD=35
Inference
In statistical inference we usually encounter TWO issues
¾ Estimate value of the population parameter. This is done through

point estimate and interval estimate (Confidence Interval)
¾ Evaluate a hypothesis about a population parameter rather than

simply estimating it. This is done through tests of significance
known as hypothesis testing (P-value)
1- Confidence Interval
Confidence Intervals
A point estimate:
A single numerical value used to estimate a population parameter.
Interval estimate:
Consists of 2 numerical values defining a range of values that with
a specified degree of confidence includes the parameter being
estimated.
(Usually interval estimate with a degree of 95% confidence is
used)
Example
What is the average systolic blood pressure for patients admitted

to emergency departments after an MI?
Select a sample
Point estimate = mean = 144 ⎛ 35 ⎞
144 ± 1.95 × ⎜ ⎟
Interval estimate = 95% CI = (140 – 148) ⎝ 291 ⎠
95% Confidence Interval: x + z (1-α/2) SE

- Upper limit = x + z (1-α/2) SE
- Lower limit = x − z (1-α/2) SE
Sampling distribution of mean
N = 291
µ - 2SE 95% µ + 2SE

Standard error
Standard error
¾ = sd / √n
¾ As sample size increases the standard error decreases
¾ The estimation as measured by the confidence interval will be
better, ie narrower confidence interval
Interpretation
95% Confidence Interval
¾ There is 95% probability that the true parameter is within the

calculated interval
¾ Thus, if we repeat the sampling procedure 100 times, the above

statement will be:
correct in 95 times (the true parameter is within the interval)
wrong in 5 times (the true parameter is outside the interval) (also called
α error)
Notes on Confidence Intervals
Interpretation
¾ It provides the level of confidence of the value for the population
average systolic blood pressure
Are all CIs 95%?

¾ No
¾ It is the most commonly used
¾ A 99% CI is wider
¾ A 90% CI is narrower
Notes on Confidence Intervals
To be “more confident” you need a bigger interval

¾ For a 99% CI, you need ± 2.6 SEM
¾ For a 95% CI, you need ± 2 SEM
¾ For a 90% CI, you need ± 1.65 SEM
2- P-value
Inference
P-value
¾ Is related to another type of inference
¾ Hypothesis testing
¾ Evaluate a hypothesis about a population parameter rather than
simply estimating it
Hypothesis testing
Back to our previous example
We want to make inference about the average systolic blood

pressure of patients admitted to emergency department after MI
Assume that the normal systolic blood pressure is 120
The question is whether the average systolic blood pressure for

patients admitted to emergency departments is different than the
normal, which is 120
Hypothesis testing
Two types of hypotheses:
¾ Null hypothesis: is a statement consistent with “no difference”
¾ Alternative hypothesis: is a statement that disagrees with the null

hypothesis, and is consistent with presence of “difference”
The logic of hypothesis testing
To decide which of the hypothesis is true

Take a sample from the population
If the data are consistent with the null hypothesis, then we do not
reject the null hypothesis (conclusion = “no difference”)
If the sample data are not consistent with the null hypothesis, then
we reject the null (conclusion = “difference”)
Hypothesis testing
Example: is the systolic blood pressure for patients admitted to

emergency department after an MI normal (ie =120)?
- Ho: µ = 120
- Ha: µ ≠ 120
How do we answer this question?
We take a sample and find that the mean is 144 years
Can we consider that the 144 is consistent with the normal value
(120 years)?
Hypothesis testing
N = 291
mean mean
144 144
Ho: µ = 120
It looks like it is consistent with the null hypothesis
Is it still consistent with the null hypothesis?

Hypothesis testing
N = 291
mean mean
2.5% 2.5%
95%
µ - 2SE Ho: µ = 120 µ + 2SE
Test statistic
It is the statistic used for deciding whether the null hypothesis

should be rejected or not
Used to calculate the probability of getting the observed results if

the null hypothesis is true.
This probability is called the p-value.

How to decide
We calculate the probability of obtaining a sample with mean of

144 if the true mean is 120 due to chance alone (p-value)
Based on p-value we make our decision:
¾ If the p-value is low then this is taken as evidence that it is unlikely
that the null hypothesis is true, then we reject the null hypothesis (we
accept alternative one)
¾ If the p-value is high, it indicates that most probably the null
hypothesis is true, and thus we do not reject the Ho
Problem!
We could be making the wrong decisions
Decision Ho True Ho False
Do not reject Ho Correct decision Type II error
Reject Ho Type I error Correct decision
Type I error: is rejecting the null hypothesis when it is true
Type II error: is not rejecting the null hypothesis when it is false

Error
Type I error:
¾ Referred to as α
¾ Probability of rejecting a true null hypothesis
Type II error:
¾ Referred to as β
¾ Probability of accepting a false null hypothesis
Power:
¾ Represented by 1-β
¾ Probability of correctly rejecting a false null hypothesis
Significance level
The significance level, α, of a hypothesis test is defined as the

probability of making a type I error, that is the probability of
rejecting a true null hypothesis
It could be set to any value, as:
¾ 0.05
¾ 0.01
¾ 0.1
Statistical significance
If the p-value is less then some pre-determined cutoff (e.g. .05),

the result is called “statistically significant”
This cutoff is the α-level
The α-level is the probability of a type I error
It is the probability of falsely rejecting H0

Back to the example
To test whether the average systolic blood pressure for patients

admitted to the emergency department after an MI is different
than 120 (which is the normal blood pressure)
We carry out a test called “one sample t-test” which provides a p-

value based on which we accept or reject the null hypothesis.
Back to the example
One-Sample Statistics
Std. Error
N Mean Std. Deviation Mean
Systolic blood pressure 286 144.13 35.312 2.088
One-Sample Test
Test Value = 120

95% Confidence
Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
Systolic blood pressure 11.558 285 .000 24.133 20.02 28.24
Since p-value is less than 0.05, then the conclusion will be that the
systolic blood pressure for patients admitted to emergency
department after an MI is significantly higher than the normal
value which is 120
p-values
p-values are probabilities (numbers between 0 and 1)
Small p-values mean that the sample results are unlikely when the
null is true
The p-value is the probability of obtaining a result as/or more

extreme than you did by chance alone assuming the null
hypothesis H0 is true
t-distribution
The t-distribution looks like a standard normal curve
A t-distribution is determined by its degrees of freedom (n-1), the

lower the degrees of freedom, the flatter and fatter it is
Normal (0,1)
t35
0
t15
ν 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%
1 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6
2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60
3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92
4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610
5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869
6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959
7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408
8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041
9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781
10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587
11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437
12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318
13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221
14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140
15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073
16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015
17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965
18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922
19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883
20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850
100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390
120 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373
0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
Hypothesis Testing
Different types of hypothesis:

¾ Mean (a) = Mean (b)
¾ Proportion (a) = Proportion (b)
¾ Variance (a) = Variance (b)
¾ OR = 1
¾ RR = 1
¾ RD = 0
¾ Test of homogeneity
¾ Etc..
Example
Comparing two means: paired testing
In the previous example, is the heart rate at admission different

than the heart rate at discharge among the patients admitted to the
emergency department after an MI?
Statistics
Heart Rate at Heart Rate at

admission discharge
N Valid 286 77
Missing 5 214
Mean 82.64 76.99
Std. Deviation 22.598 17.900
Is this decrease in heart rate statistically significant?
Thus, we have to make inference….

What type of test to be used?
Since the measurements of the heart rate at admission and at

discharge are dependent on each other (not independent), another
type of test is used
Paired t-test
Paired Samples Statistics
Std. Error
Mean N Std. Deviation Mean
Pair Heart Rate at admission 81.16 75 23.546 2.719
1 Heart Rate at discharge 76.72 75 17.973 2.075
Paired Samples Test
Paired Differences
95% Confidence
Interval of the
Std. Error Difference
Mean Std. Deviation Mean Lower Upper t df Sig. (2-tailed)
Pair Heart Rate at admission -
4.440 25.302 2.922 -1.381 10.261 1.520 74 .133
1 Heart Rate at discharge
95%CI = 4.4 ± 1.95 × 2.9

H0: µb - µa = 0
P-value = 0.133, thus no significant difference
HA: µb - µa ≠ 0
How Are p-values Calculated?
sample mean − 0
t=
SEM
4 .4
t = = 1 . 52
2 .9
The value t = 1.52 is called the test statistic
Then we can compare the t-value in the table and get the
p-value, or get it from the computer (0.13)
Interpreting the p-value
The p-value in the example is 0.133
¾ Interpretation: If there is no difference in heart rate between

admission and discharge to an emergency department, then the
chance of finding a mean difference as extreme/more extreme as 4.4
in a sample of 291 patients is 0.133
¾ Thus, this probability is big (bigger than 0.05) which leads to saying
that the difference of 4.4 is due to chance
Notes
How to decide on significance from the 95% CI?
3 scenarios
-15 -10 -5 0 5 10 15
-15 -10 -5 0 5 10 15
-15 -10 -5 0 5 10 15
Comparing two means: Independent sample testing
In the previous example, is the systolic blood pressure different

between males and females among the patients admitted to the
emergency department after an MI?
Group Statistics
Std. Error
Sex N Mean Std. Deviation Mean
Systolic blood pressure Male 240 145.05 35.162 2.270
Female 44 138.64 35.753 5.390
Is this difference in systolic blood pressure statistically significant?
Thus, we have to make inference….

Null hypothesis:
¾ Ho: Mean SBP(Males) = Mean SBP (Females)
¾ Ho: Mean SBP (Males) - Mean SBP (Females) = 0
Alternative hypothesis:
¾ Ha: Mean SBP(Males) ≠ Mean SBP (Females)
¾ Ha: Mean SBP(Males) - Mean SBP (Females) ≠ 0
Thus, we carry out a test called: independent samples t-test

Formula to use is:
What we need to know is that we can calculate a p-value out of the

t-test (based on the t-distribution)
Based on this p-value, make the decision:

¾ P-value > 0.05, then do no reject the null (the two means are equal)
¾ P-value < 0.05, then reject the null (the two means are different)
Group Statistics
Std. Error
Sex N Mean Std. Deviation Mean
Systolic blood pressure Male 240 145.05 35.162 2.270
Female 44 138.64 35.753 5.390
Independent Samples Test
Levene's Test for

Equality of Variances t-test for Equality of Means
95% Confidence
Interval of the
Mean Std. Error Difference
F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper
Systolic blood pressure Equal variances
.044 .835 1.109 282 .269 6.409 5.781 -4.970 17.789
assumed
Equal variances
1.096 59.267 .278 6.409 5.848 -5.292 18.111
not assumed
Two formulas for calculation of t-test To know which one to use (hypothesis test)
1- when variances are equal Ho: variancemales = variancefemales
2- when variances are not equal Ha: variancemales = variancefemales
1- If p-value > 0.05 then variances are equal

2- If p-value < 0.05 then variances are not equal
Example
T-test
Ho: Mean1 = Mean2 T-test: P-value = 0.89

Ha: Mean1 ≠ Mean2 No significant difference
Chi square
Example
In the MI example, we would like to check if hypertension is

associated with gender.
In other words, are males at higher or lower risk of having

hypertension?
Sex * Hypertension Crosstabulation
Count
Hypertension
No Yes Total
Sex Male 191 52 243
Female 24 20 44
Total 215 72 287
Example
Sex * Hypertension Crosstabulation
Hypertension
No Yes Total
Sex Male Count 191 52 243
% within Sex 78.6% 21.4% 100.0%
Female Count 24 20 44
% within Sex 54.5% 45.5% 100.0%
Total Count 215 72 287
% within Sex 74.9% 25.1% 100.0%
Example
To answer the question, we do a hypothesis test:

¾ H0: P1 = P2 (P1 - P2 = 0)
¾ Ha: P1 ≠ P2 (P1 - P2 ≠ 0)
(Pearson’s) Chi-Square Test (χ2)

¾ Calculation is easy (can be done by hand)
Works well for big sample sizes
Can be extended to compare proportions between more than two

independent groups in one test
The Chi-Square Approximate Method
2
(0 - E)
χ 2
= ∑
4 cells E
Looks at discrepancies between observed and expected cell counts
Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
O = observed
row total × column total

E = expected =
grand total
The Chi-Square Approximate Method
The distribution of this statistic when the null is a chi-square
distribution with one degree of freedom
We can use this to determine how likely it was to get such a big
discrepancy between the observed and expected by chance alone

Distribution: Chi-Square with One Degree of Freedom
.8
.6
Probability
.4
Χ2 = 3.84 ↔ p = 0.05
.2
0
0 1 5 10 15 20
Chi-squared Value
Example of Calculations of
Chi-Square 2x2 Contingency Table
2
(0 - E)
Test statistic χ 2
= ∑
4 cells E
Chi-Square Tests
Asymp. Sig. Exact Sig. Exact Sig.

Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 11.471b 1 .001
Continuity Correctiona 10.227 1 .001
Likelihood Ratio 10.366 1 .001
Fisher's Exact Test .001 .001
Linear-by-Linear
11.431 1 .001
Association
N of Valid Cases 287
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.
04.
χ = 11.471
2
Sampling Distribution: Chi-Square with One Degree of
Freedom
.8
.6
Probability
.4
.2
0
0 1 5 10 15 20
Chi-squared Value
Example
The value that corresponds to 95%, or 5% error is 5.991.
Thus we reject Ho since 11.471 is < 5.991
We conclude that Ho is false and that there is a relationship
between gender and diagnosis with hypertension
The p-value is = 0.001

Chi-square
Ho: Proportion1 = Proportion2 ChiSquare: P-value = 0.96

Ha: Proportion1 ≠ Proportion2 No significant difference
Relative Risk (RR):
Study the association between Vioxx use and Myocardial Infarction
MI
Yes No
Vioxx 71 52
Drug
Placebo 29 48
Ho: RR = 1
RR=1.5, 95% CI = (1.1 - 1.9) (p-value = 0.01)
Ha: RR ≠ 1
Significant association
Notes
How to decide on significance from the 95% CI?
3 scenarios
0 1 2 3 5 6 7
0 1 2 3 5 6 7
0 1 2 3 5 6 7
Example
Example
Chi-square
Ho: Proportion1 = Proportion2 ChiSquare: P-value = 0.96

Ha: Proportion1 ≠ Proportion2 No significant difference
Example
We would like to check if there is an association between gender

and both Hypertension and diabetes combined.
Sex * Hypterension and Diabetes combined Crosstabulation
Hypterension and Diabetes

combined
Either HT Both HT
None or DM and DM Total
Sex Male Count 145 67 28 240
% within Sex 60.4% 27.9% 11.7% 100.0%
Female Count 13 12 19 44
% within Sex 29.5% 27.3% 43.2% 100.0%
Total Count 158 79 47 284
% within Sex 55.6% 27.8% 16.5% 100.0%
Ho: HPV status and stage of HIV infection are independent.

Ha: the two variables are not independent.
Ho: P1 = P2 = P3
Ha: P1 ≠ P2 ≠ P3
Example
Conclusion
Sex * Hypterension and Diabetes combined Crosstabulation
Hypterension and Diabetes

combined
Either HT Both HT
None or DM and DM Total
Sex Male Count 145 67 28 240
% within Sex 60.4% 27.9% 11.7% 100.0%
Female Count 13 12 19 44
% within Sex 29.5% 27.3% 43.2% 100.0%
Total Count 158 79 47 284
% within Sex 55.6% 27.8% 16.5% 100.0%
Chi-Square Tests
Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 28.691a 2 .000
Likelihood Ratio 24.336 2 .000
Linear-by-Linear
25.341 1 .000
Association
N of Valid Cases 284
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 7.28.
Example
The value that corresponds to 95%, or 5% error is 5.991.
Thus we reject Ho since 28.691 is < 5.991
We conclude that Ho is false and that there is a relationship
between gender and diagnosis with hypertension and/or diabetes
The p-value is < 0.0001

ANOVA
The problem
We have samples from a number of independent groups.
We have a single numerical or ordinal variable and are interested
in whether the values of the variable vary between the groups.
Example: Is systolic blood pressure vary between men of
different smoking status.

The problem
One-way ANOVA can answer the question be comparing the

group means.
So the null and alternative hypotheses are:

H0: all group means in the population are equal
HA : at least two of the means are not equal
ANOVA is an extension of 2 independent groups.
But 2 groups technique can not be used.

The problem
- If 5 groups is available then 10 t-test of 2 groups to perform.
- The high Type I error rate, resulting from the large number of
comparisons, means that we may draw incorrect conclusions.

Assumptions
Analysis of variance requires the following assumptions:
Independent random samples have been taken from each

population.
The populations are normal.
The population variances are all equal.

The ANOVA Table
ANOVA table summaries the calculation needed to test the main

hypothesis.
Sources df SS MS F
SS ( factor) MS ( factor)
Factor k −1 SS(factor) MS(factor)=
k −1 MS (error )
SS (error )
Error n −k SS(error) MS(error)=
n −k
___________________________________________________________
Total n − 1 SS(total)
Rationale
One-way ANOVA separate the total variability (SS(total) in the

data into:
¾ Differences between the individuals from the different groups
(between-group variation) SS(factor)
¾ The random variation between the individuals within each group

(within-group variation) SS(error) called also unexplained
Rationale
These components of variation are measured using variances,

hence the name analysis of variance (ANOVA).
Under the null hypothesis that the group means are the same,
SS(factor) will be similar to SS(error).
The test is based on the ratio of these two variances.
If there are differences between-groups, then between-groups

variance will be larger than within-group variance.
Example
A new variable is created which combines diagnosis with

Hypertension and Diabetes together as follows:
Hypterension and Diabetes combined
Cumulative
Valid None 159 54.6 55.6 55.6
Either HT or DM 80 27.5 28.0 83.6
Both HT and DM 47 16.2 16.4 100.0
Total 286 98.3 100.0
Total 291 100.0
Example
We would like to check whether the systolic blood pressure is the

same for the three groups defined by their HT and DM status.
Ho: Mean1 = Mean2 = Mean3

Ha: Mean1 ≠ Mean2 ≠ Mean3
Hypterension and Diabetes combined
Cumulative
Valid None 159 54.6 55.6 55.6
Either HT or DM 80 27.5 28.0 83.6
Both HT and DM 47 16.2 16.4 100.0
Total 286 98.3 100.0
Total 291 100.0
Example
Descriptives

95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
None 155 144.52 32.789 2.634 139.32 149.73 78 248
Either HT or DM 79 142.97 39.634 4.459 134.10 151.85 56 257
Both HT and DM 47 146.55 36.360 5.304 135.88 157.23 55 235
Total 281 144.43 35.319 2.107 140.28 148.57 55 257
ANOVA

Sum of
Squares df Mean Square F Sig.
Between Groups 380.517 2 190.259 .152 .859
Within Groups 348908.2 278 1255.066
Total 349288.8 280
We conclude that the average systolic blood pressures for the

three groups are the same.
Conclusion
We conclude that the average systolic blood pressure for the three
groups is the same.
Bivariate analyses
DEPENDENT
(outcome)
2 LEVELS > 2 LEVELS CONTINUOUS
X2 X2
2 LEVELS t-test
INDEPENDENT
(chi square test) (chi square test)

(exposure)
X2 X2
> 2 LEVELS ANOVA
(chi square test) (chi square test)
-Correlation
CONTINUOUS ANOVA t-test -Linear
Regression
New scenario
If the dependent and independent variables are continuous, then

we can’t use the t-test, and we cannot use the chi squared.
Regression and Correlation
Describing association between two continuous variables
¾ Scatterplot
¾ Correlation coefficient
¾ Simple linear regression

Correlation
Correlation
It is a measure of linear correlation
Called Pearson correlation coefficient (r)
Ranges between:
+1.0 (perfect positive correlation)
-1.0 (perfect negative correlation)

Scatter plot and correlation
The Correlation Coefficient (r)
Measures the direction and strength of the linear association

between x and y
The correlation coefficient is between -1 and +1
¾ r>0 Positive association
¾ r<0 Negative association
¾ r=0 No association
r = 0.01
r = 0.68
Y
Y
X X
12
r = 0.98
10
8
Y
Y
r = -0.9 6
4
X
X
Correlation in the Plasma Example
3.5
Y, plasma volume (liters)
r = .76
2.5
55 60 65 70 75
X, body weight (kg)

Correlation
Study the association between Heart Rate and Systolic Blood

Pressure
Ho: Correlation = 0
Scatter plot for assocation between HR and SBP
Ha: Correlation ≠ 0
A
A
Correlation: = 0.190
250
A
A
A
A A A
A
A AA
200 A
AAA
A
AA
A A
A
A
A A
AA A A
P-value = 0.001
A A AA A
A A A
A A AA
BP systolic
AA A
A A A AA
A
AA
AAA
A
A
A A A
A A A
A A A A Correlations
A
A
AAA AA
A
A
A
AA
AAA
A AAA A
A AA A
A
AAA A
AA A A
AAA
AA Systolic blood Heart Rate at
150 A AAAAAA A A
A
AA A
A
A
A
AA A
A AA AAA A
AA A
A A A A A A A pressure admission
AA A A A
AAA A AA AA A A A
AA AAAA A AAA
A
AA A AA
A A Systolic blood pressure Pearson Correlation 1 .190**
A AAA AA A
AA A A AAA A AAAAA
A A Sig. (2-tailed) .001
A AA A AA
AA
A
AAA
A
AA
AA A A
A
AA N 286 285
AAA AA AAA
A A
A A A
A A A AA AA Heart Rate at admission Pearson Correlation .190** 1
A AA
AA AAAA
100 A A A A
A A A A A Sig. (2-tailed) .001
A A A
A A A AA N 285 286
A
**. Correlation is significant at the 0.01 level (2-tailed).
A
A A
50
40 80 120 160
Heart Rate
Significant correlation
Problem
Important to note that

correlation measures
strength of linear
association
r= 0
There could be a strong
non-linear relationship
between y and x, and r
Y
may not catch it
X
Correlation Coefficient
Outliers can really affect correlation coefficient

One extreme point can change r sizably
r = .7
Simple linear regression
Used to quantify the association between two variables
It is “simple” in terms of
having only 1 variable
The association is thought to be linear in nature
Formula: Dependent = β0 + β1 (independent)

The Equation of a Line
β0 and β1 are called regression coefficients
These two quantities are estimated by the “least squares” method
The intercept β0 is the estimated expected value of y when x is 0
The slope β1 is the estimated expected change in y corresponding
to a unit increase in x
The Slope
The slope β1 is the expected change in y corresponding to a unit
increase in x
¾ β1 = 0 No association between y and x
¾ β1 > 0 Positive association (as x increases y tends to increase)
¾ β1 < 0 Negative association (as x increases y tends to decrease)

The Equation of a Line
y
y
y = bˆ0 + bˆ1 x
b̂1
b̂0
x
0
The Slope
y
β1 > 0
β1 = 0
β1 < 0
x
0
Systolic blood pressure and age
Model Summary
Adjusted Std. Error of

Model R R Square R Square the Estimate
1 .054a .003 -.001 35.387
a. Predictors: (Constant), Age
Correlation:
R = 0.054
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 136.400 8.812 15.479 .000
Age .148 .162 .054 .910 .364
a. Dependent Variable: Systolic blood pressure
Simple linear regression:

SBP = 136.400 + 0.148 (Age)
If age = 0, then SBP = 136.400 + 0 = 136.400
As age increase by 1 year, SBP increases by 0.148 units
Simple Linear Regression
How do we decide if there is significant association between age

and SBP?
Hypothesis test
Ho: β1 = 0
Ha: β1 ≠ 0
SBP = β0 + β1 (Age)
If reject Ho, then as age changes, SBP changes significantly
If Ho is not rejected, then if as changes, there is no effect on SBP
Multiple Linear Regression
The important aspect of linear regression is that we can include

more than 1 independent variable
This is to control for the effect of another variable
Study the association between Age and SBP while controlling for
gender
SBP = β0 + β1 (Age) + β2 (Gender)
Multiple Linear Regression
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 143.090 9.742 14.688 .000
Age .216 .171 .080 1.261 .208
Sex -8.992 6.123 -.093 -1.469 .143
a. Dependent Variable: Systolic blood pressure
Multiple linear regression:

SBP = 143.090 + 0.216 (Age) + -8.992 (Gender)
As age increase by 1 year, SBP increases by 0.216 units
after adjusting for gender
Difference in SBP between males and females is 8.992 units
Choosing the right statistical test
Choosing a statistical test
Choosing the right statistical test depends on:

¾ Nature of the data
¾ Sample characteristics
¾ Inferences to be made
A consideration of the nature of data includes:
Number of variables
¾ not for entire study, but for the specific question at hand
Type of data
¾ numerical, continuous
¾ dichotomous, categorical information

A consideration of the sample characteristics includes:
Number of groups
Sample type
¾ normal distribution (parametric) or not (non-parametric)
¾ independent or dependent
A consideration of the inferences to be made includes:
Data represent the population
The group means are different
There is a relationship between variables

Before choosing a statistical test, ask:
How many variables?
How many groups?
Is the distribution of data normal?
Are the samples (groups) independent?
What is your hypothesis or research question?
Is the data continuous, ordinal, or categorical?

Descriptive analyses
Type of variable Measure
Categorical Proportion (%)
Continuous
Mean (SD)
(Normal)
Continuous - Median
(Not Normal) - Inter-quartile range
Different types of statistics
Parametric vs non-parametric analyses

¾ Parametric:
Assume data follows a specific probability distribution
More powerful
¾ Non-parametric:
Also called distribution free
No assumptions required for data
But are robust

Univariate analyses
Categorical Z proportions
Continuous
T-test
(Normal)
Continuous - n > 30 → t-test

(Not Normal) - n < 30 → Kolmogorov-Smirnov Test
Bivariate analyses
Type of
2 levels > 2 levels Continuous
variable
2 levels Chi squared Chi squared T-test
> 2 levels Chi squared Chi squared Anova
- Correlation
Continuous T-test Anova
- linear regression
Bivariate analyses
Type of
2 levels > 2 levels Continuous
variable
-Fisher’s test - Mann-Whitney

2 levels Fisher’s test
- McNemar’s test - Wilcoxin test
- Kruskal-Wallis
> 2 levels Fisher’s test Fisher’s test
- Friedman test
- Mann-Whitney - Kruskal-Wallis - Correlation

Continuous
- Wilcoxin test - Friedman test - Regression
Multivariate analyses
Categorical Logistic regression
Continuous
Multinomial regression
(Normal)
Continuous
Linear regression
(Not Normal)
Overview
Measurement Ordinal or Binomial Survival Time

(Gaussian) Measurement (Non-
Gaussian)
Describe one group Mean, SD Median, interquartile Proportion Kaplan Meier survival
range curve
Compare two unpaired Unpaired t test Mann-Whitney test Fisher's test Log-rank test or
groups Chi-square Mantel-Haenszel*
Compare two paired groups Paired t test Wilcoxon test McNemar's test Conditional
proportional hazards
regression*
Compare three or more One-way ANOVA Kruskal-Wallis test Chi-square test Cox regression
unmatched groups
Compare three or more Repeated-measures Friedman test Cochrane Q** Conditional
matched groups ANOVA proportional hazards
regression*
Quantify association between Pearson correlation Spearman correlation Contingency

two variables coefficients**
Predict value from another Simple linear Nonparametric Simple logistic Cox regression
measured variable regression regression** regression*
Predict value from several Multiple linear Multiple logistic Cox regression
measured or binomial regression* regression*
variables
Sample size calculation
Sample size and power calculation
Important step in designing a study
If it is not done, then sample size might be high or low:
¾ If it is low: lack precision to provide reliable answers
¾ If it is high: resources will be wasted for minimal gain

This step addresses two questions:
¾ How precise will my parameter estimates tend to be if I select a
particular sample size?
¾ How big a sample do I need to attain a desirable level of precision?

Sample size and power calculation: example
A cross-sectional survey of the prevalence of diabetes (diagnosed
or undiagnosed) among native Americans would require a sample
size of 1421 to allow estimation of the prevalence within a
precision of ±0.02 with 90% confidence, assuming a true
prevalence no larger than 30%.

Should be done at the DESIGN stage, ie before data is collected
Drives the whole study
To determine the sample size:
¾ Objectives should be clearly defined
¾ Main exposure and outcome should be specified
¾ Analyses plan should be clarified

Different equations are used:
Depends on:
¾ Study design
¾ Objectives (prevalence, risk, etc.)
¾ Types of variables
Following is an example of sample size calculation for comparing
the means in two groups

Sample size and power calculation: example
A randomized clinical trial of a new drug treatment vs. placebo
for decreasing blood pressure would require 126 patients for a
two-sided test at α = 0.05 to provide 80% power to detect a 5%
difference in blood pressure.

Sample size calculation: comparing two means
2 *SD * (z α + z β )
2 2
N≈ 2
∆
N = the number of subjects in each group
α = level of significance (error)
1 - β = power
Difference = Minimal significant difference

N = the number of subjects in each group
¾ ↑ N = more power or less α
¾ ↓ N = less power or more α

α = level of significance (error)
¾ ↑ α = more power or smaller N
¾ ↓ α = less power or larger N

1 - β = power
¾ ↑ 1 - β = less α or larger N
¾ ↓ 1 - β = more α or smaller N
¾ ↑ Difference = larger power or smaller N
¾ ↓ Difference = smaller power or larger N

N = to be found
α = level of significance (error) = 0.05 or 5%
1 - β = power = 0.80 or 80%

Thank you

Statistics Made Easy Presentation PDF

Uploaded by

Copyright:

Available Formats

You might also like

Statistics Made Easy Presentation PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Made Easy Presentation PDF

Uploaded by

Copyright:

Available Formats

Statistics Made Easy

Hani Tamim, MPH, PhD

 Is treatment A better than treatment B for patients with

 What is the survival rate among ICU patients?

 What is the incidence of Down’s syndrome among a certain

 Is the use of Oral Contraceptives associated with an increased risk

 Scientific methods for:

 Population: The largest collection of entities for which we have

 Sample: A part of a population

 Simple random sample: is when a sample n is drawn from a

 Variable: A characteristic of the subjects under observation that

 Quantitative variables: Are variables that can convey information

 Qualitative variables: Are variables in which measurements

 Nominal: unordered data

 Ordinal: Predetermined order among response classification

 Continuous: Not restricted to integers

¾ Collection of information that will be used to answer the research

¾ Could be done through questionnaires, interviews, data abstraction,

¾ Entering the information (data) into the computer

¾ Most of the times done manually

¾ Identify any data entry mistakes

¾ Correct such mistakes

¾ Create new variables based on different criteria

¾ Descriptive statistics: are the techniques used to describe the main

¾ Inferential statistics: is the process of using the sample statistic to

There are different statistical methods

¾ A frequency distribution lists, for each value (or small range of

 How to describe a categorical variable (marital status)?

 Construct a frequency distribution

¾ Relative frequency (percent)

¾ Valid relative frequency (valid percent)

¾ Cumulative relative frequency (cumulative percent)

Marital status of the 291 patients admitted to the Emergency Department

 Construct a bar or pie chart

¾ Frequency or relative frequency

¾ Properly labelled axes

 How to describe a continuous variable (Systolic blood pressure)?

 Add up data, then divide by sample size (n)

 Also called sample average or arithmetic mean X

 Sensitive to extreme values

¾ One data point could make a great change in sample mean

 The sample median is not sensitive to extreme values

 If the sample size is an even number

80 90 95 110 120 125

¾ n = odd: Median = middle value (n+1/2)

 Mode: Most frequently occurring number

Systolic blood pressure

 Central tendency measures do not tell the whole story

 How to describe a continuous variable (Systolic blood pressure)

 Range = Maximum – Minimum

 Sample variance (s2 or var or σ2)

 Sample standard deviation (s or SD or σ)

 Example: n = 5 systolic blood pressures (mm Hg) X 1 = 120

+ (110 − 99)2 + (95 − 99)2 = 1020

 Sample standard deviation (SD)

 The bigger s, the more variability

 s measures the spread about the mean

Is treatment A better than treatment B for patients with

What is the survival rate among ICU patients?

What is the incidence of Down’s syndrome among a certain

Is the use of Oral Contraceptives associated with an increased risk

Scientific methods for:

Population: The largest collection of entities for which we have

Sample: A part of a population

Simple random sample: is when a sample n is drawn from a

Variable: A characteristic of the subjects under observation that

Quantitative variables: Are variables that can convey information

Qualitative variables: Are variables in which measurements

Nominal: unordered data

Ordinal: Predetermined order among response classification

Continuous: Not restricted to integers

How to describe a categorical variable (marital status)?

Construct a frequency distribution

Construct a bar or pie chart

How to describe a continuous variable (Systolic blood pressure)?

Add up data, then divide by sample size (n)

Also called sample average or arithmetic mean X

Sensitive to extreme values

The sample median is not sensitive to extreme values

If the sample size is an even number

Mode: Most frequently occurring number

Central tendency measures do not tell the whole story

How to describe a continuous variable (Systolic blood pressure)

Range = Maximum – Minimum

Sample variance (s2 or var or σ2)

Sample standard deviation (s or SD or σ)

Example: n = 5 systolic blood pressures (mm Hg) X 1 = 120

Sample standard deviation (SD)

The bigger s, the more variability

s measures the spread about the mean

s can equal 0 only if there is no spread

Three common shapes of frequency distributions:

Symmetric (Right and left sides are mirror images)

Left skewed (negatively skewed)

Right skewed (positively skewed)

Three less common shapes of frequency distributions:

Thus, it is an idealization based on imagining what would

How does probability apply in medicine?

Probability is the most important theory behind biostatistics

It is used at different levels

Example: 4% chance of a patient dying after admission to

Example: 1 in 1000 babies are born with a certain abnormality!

Incidence and prevalence

Example: the association between cigarette smoking and death

Same is applied to:

Probability is applied at all levels of statistical analyses

Probability distributions list or describe probabilities for all possible

There are two types of probability distributions:

Properties of a Normal Distribution

A normal distribution can have any µ and any σ:

We standardize to a normal distribution

What does this mean?

For a specific distribution, we calculate all possible probabilities,

A normal distribution with a µ = 0, σ = 1 is called a Standardized

For any normal distribution, we can

68% of the observations fall within one standard deviation of the

Since we are not taking the whole population, we have to draw

Simple example: Say we want to estimate the average systolic

Other more complicated measures might be quality of life,

Take a sample (n=291) of patients admitted to emergency

Calculate the mean and SD (descriptive statistics) of systolic blood

The next step is to make a link between the estimates we observed

What can we say about these estimates as compared to the

In other words, we trying to estimate the average systolic blood

In statistical inference we usually encounter TWO issues

What is the average systolic blood pressure for patients admitted

95% Confidence Interval: x + z (1-α/2) SE

95% Confidence Interval