Statistics Made Easy Presentation PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 226

Statistics Made Easy

Hani Tamim, MPH, PhD


Assistant Professor
Epidemiology and Biostatistics
Research Center / College of Medicine
King Saud bin Abdulaziz University for Health Sciences
Riyadh – Saudi Arabia
Objective of medical research

™ Is treatment A better than treatment B for patients with


hypertension?

™ What is the survival rate among ICU patients?

™ What is the incidence of Down’s syndrome among a certain


group of people?

™ Is the use of Oral Contraceptives associated with an increased risk


of breast cancer?
Research Process?

™ Planning
™ Design
™ Data collection
™ Analysis
¾ Data entry
¾ Data cleaning
¾ Data management
¾ Data analysis

™ Reporting
Statistics is used in ….
What is statistics?

™ Scientific methods for:


¾ Collecting
¾ Organizing
¾ Summarizing
¾ Presenting
¾ Interpreting

™ data
Definition of some basic terms

™ Population: The largest collection of entities for which we have


interest at a particular time

™ Sample: A part of a population

™ Simple random sample: is when a sample n is drawn from a


population N in such a way that every possible sample of size n
has the same chance of being selected
Definition of some basic terms

™ Variable: A characteristic of the subjects under observation that


takes on different values for different cases, example: age gender,
diastolic blood pressure

™ Quantitative variables: Are variables that can convey information


regarding amount

™ Qualitative variables: Are variables in which measurements


consist of categorization
Types of variables

™ Categorical variables

™ Continuous variables
Categorical variables

™ Nominal: unordered data

¾ Death

¾ Gender

¾ Country of birth

™ Ordinal: Predetermined order among response classification

¾ Education

¾ Satisfaction
Continuous variables

™ Continuous: Not restricted to integers

¾ Age

¾ Weight

¾ Cholesterol

¾ Blood pressure
Steps involved (data)

™ Data collection

™ Database structure

™ Data entry

™ Data cleaning

™ Data management

™ Data analyses
Data collection

™ Data collection:

¾ Collection of information that will be used to answer the research


question

¾ Could be done through questionnaires, interviews, data abstraction,


etc.
Data collection
Database structure

™ Database structure:

¾ Structure the database (using SPSS) into which the data will be
entered
Data entry

™ Data entry:

¾ Entering the information (data) into the computer

¾ Most of the times done manually


ƒ Single data entry
ƒ Double data entry
Data cleaning

™ Data cleaning:

¾ Identify any data entry mistakes

¾ Correct such mistakes


Data management

™ Data management:

¾ Create new variables based on different criteria

¾ Such as:
ƒ BMI
ƒ Recoding
ƒ Categorizing age (less than 50 years, and 50 years and above)
ƒ Etc.
Data analyses

™ Data analyses:

¾ Descriptive statistics: are the techniques used to describe the main


features of a sample

¾ Inferential statistics: is the process of using the sample statistic to


make informed guesses about the value of a population parameter
Data analyses

™ Data analyses:

¾ Univariate analyses

¾ Bivariate analyses

¾ Multivariate analyses
Bottom line

There are different statistical methods


for different types of variables
Descriptive statistics: categorical variables

™ Frequency distribution

™ Graphical representation
Descriptive statistics: categorical variables

™ Frequency distribution

¾ A frequency distribution lists, for each value (or small range of


values) of a variable, the number or proportion of times that
observation occurs in the study population
Descriptive statistics: categorical variables

™ Frequency distribution:

™ How to describe a categorical variable (marital status)?


Descriptive statistics: categorical variables

™ Construct a frequency distribution

¾ Title

¾ Values

¾ Frequency

¾ Relative frequency (percent)

¾ Valid relative frequency (valid percent)

¾ Cumulative relative frequency (cumulative percent)


Descriptive statistics: categorical variables

Marital status of the 291 patients admitted to the Emergency Department

Cumulative
Frequency Percent Valid Percent Percent
Valid Married 266 91.4 94.7 94.7
Single 13 4.5 4.6 99.3
Widow 2 .7 .7 100.0
Total 281 96.6 100.0
Missing System 10 3.4
Total 291 100.0
Example
Example: summarizing data
Descriptive statistics: categorical variables

™ Graphical representation

¾ A graph lists, for each value (or small range of values) of a variable,
the number or proportion of times that observation occurs in the
study population
Descriptive statistics: categorical variables

™ Graphical representation:

™ Two types
¾ Bar chart

¾ Pie chart
Descriptive statistics: categorical variables

™ Construct a bar or pie chart

¾ Title

¾ Values

¾ Frequency or relative frequency

¾ Properly labelled axes


Descriptive statistics: categorical variables
Descriptive statistics: categorical variables
Descriptive statistics: continuous variables

™ Central tendency

™ Dispersion

™ Graphical representation
Descriptive statistics: continuous variables

™ How to describe a continuous variable (Systolic blood pressure)?

™ Central tendency:
¾ Mean

¾ Median

¾ Mode
Descriptive statistics: continuous variables

™ Mean:

™ Add up data, then divide by sample size (n)


™ The sample size n is the number of observations (pieces of
data)
™ Example
¾ n = 5 Systolic blood pressures (mmHg)
X1 = 120
X2 = 80
120 + 80 + 90 + 110 + 95
X3 = 90 X= = 99mmHg
X4 = 110 5
X5 = 95
Descriptive statistics: continuous variables

™ Formula n

∑X i
X= i =1
n

™ Summation Sign
¾ Summation sign (∑) is just a mathematical shorthand for “add
up all of the observations”

∑X
i=1
i = X1 + X 2 + X 3 + ....... + Xn
Descriptive statistics: continuous variables

™ Also called sample average or arithmetic mean X

™ Sensitive to extreme values

¾ One data point could make a great change in sample mean

™ Uniqueness

™ Simplicity
Descriptive statistics: continuous variables

™ Median: is the middle number, or the number that cuts the data in
half

80 90 95 110 120

™ The sample median is not sensitive to extreme values


¾ For example: If 120 became 200, the median would remain the
same, but the mean would change to 115.
Descriptive statistics: continuous variables

™ If the sample size is an even number

80 90 95 110 120 125

95 + 110
= 102.5 mmHg
2
Descriptive statistics: continuous variables

™ Median: Formula

¾ n = odd: Median = middle value (n+1/2)


¾ n = even: Median = mean of middle 2 values (n/2 and n+2/2)

™ Properties:

¾ Uniqueness
¾ Simplicity
¾ Not affected by extreme values
Descriptive statistics: continuous variables

™ Mode: Most frequently occurring number

80 90 95 95 120 125

™ Mode = 95
Descriptive statistics: continuous variables

™ Example:

Statistics

Systolic blood pressure


N Valid 286
Missing 5
Mean 144.13
Median 144.50
Mode 155
Descriptive statistics: continuous variables

™ Central tendency measures do not tell the whole story

™ Example:
21 22 23 23 23 24 24 25 28
Mean = 213/9 = 23.6
Median = 23

15 18 21 21 23 25 25 32 33
Mean = 213/9 = 23.6
Median = 23
Descriptive statistics: continuous variables

™ How to describe a continuous variable (Systolic blood pressure)


in addition to central tendency?

™ Measures of dispersion:
¾ Range
¾ Variance
¾ Standard Deviation
Descriptive statistics: continuous variables

™ Range

™ Range = Maximum – Minimum

™ Example: X 1 = 120
¾ Range = 120 – 80 = 40
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
Descriptive statistics: continuous variables

™ Sample variance (s2 or var or σ2)


™ The sample variance is the average of the square of the
deviations about the sample mean
n

∑ i
(X − X) 2

s2 = i=1
n −1

™ Sample standard deviation (s or SD or σ)


™ It is the square root of variance n

∑ i
(X − X) 2

s= i=1
n −1
Descriptive statistics: continuous variables

™ Example: n = 5 systolic blood pressures (mm Hg) X 1 = 120


™ Recall, from earlier: average = 99 mm HG
X 2 = 80
X 3 = 90
X 4 = 110
X 5 = 95
5

∑ i
(X
i=1
− X) 2
= (120 − 99) 2
+ (80 − 99) 2
+ (90 − 99) 2

+ (110 − 99)2 + (95 − 99)2 = 1020


Descriptive statistics: continuous variables

™ Sample Variance

∑ i
(X − X ) 2
1020
s =
2 i=1
= = 255
n −1 4

™ Sample standard deviation (SD)


s = √s2 = √255 = 15.97 (mm Hg)
Descriptive statistics: continuous variables

™ The bigger s, the more variability

™ s measures the spread about the mean

™ s can equal 0 only if there is no spread


¾ All n observations have the same value

™ The units of s is the same as the units of the data (for example,
mm Hg)
Descriptive statistics: continuous variables

™ Example:

Statistics

Systolic blood pressure


N Valid 286
Missing 5
Mean 144.13
Median 144.50
Mode 155
Std. Deviation 35.312
Variance 1246.916
Range 202
Minimum 55
Maximum 257
Example: summarizing data
Descriptive statistics: continuous variables

™ Graphical representation:

™ Different types
¾ Histogram
Descriptive statistics: continuous variables

™ Construct a chart

¾ Title

¾ Values

¾ Frequency or relative frequency

¾ Properly labelled axes


Descriptive statistics: continuous variables
Shapes of the Distribution

™ Three common shapes of frequency distributions:

A B C
Symmetrical Positively Negatively
and bell skewed or skewed or
shaped skewed to skewed to
the right the left
Shapes of Distributions

™ Symmetric (Right and left sides are mirror images)


¾ Left tail looks like right tail
¾ Mean = Median = Mode

Mean
Median
Mode
Shapes of Distributions

™ Left skewed (negatively skewed)


¾ Long left tail
¾ Mean < Median

Mean Median Mode


Shapes of Distributions

™ Right skewed (positively skewed)


¾ Long right tail
¾ Mean > Median

Mode Mean
Median
Shapes of the Distribution

™ Three less common shapes of frequency distributions:

A B C
Bimodal Reverse Uniform
J-shaped
Probability
Probability

™ Definition:
¾ The likelihood that a given event will occur

¾ It ranges between 0 and 1:


ƒ 0 means the event is impossible to occur
ƒ 1 means that the event is definitely occurring
How do we calculate it?

™ Frequentist Approach:
¾ Probability: is the long term relative frequency

™ Thus, it is an idealization based on imagining what would


happen to the relative frequencies in an indefinite long
series of trials
Application in medicine

™ How does probability apply in medicine?

™ Probability is the most important theory behind biostatistics

™ It is used at different levels


Descriptive

™ Example: 4% chance of a patient dying after admission to


emergency department (from the previous example)
¾ What do we mean?
Out of each 100 patients admitted to the emergency department, 4
will die, whereas 96 will be discharged alive

™ Example: 1 in 1000 babies are born with a certain abnormality!

™ Incidence and prevalence


Associations

™ Example: the association between cigarette smoking and death


after admission to the emergency department with an MI

Current Cigarrete Smoking in association with death at discharge

Count
Death at discharge
Death Discharged Total
Current Cigarrete No 5 123 128
Smoking Yes 5 154 159
Total 10 277 287
Probability of being smoker = 100 / 331
Probability of dying if a smoker = 5 / 159 = 3.1%
Probability of dying if a non-smoker = 5 / 128 = 3.9%
Associations

™ Same is applied to:

¾ Relative risk

¾ Risk difference

¾ Attributable risk

¾ Odds ratio

¾ Etc…..
Bottom line

™ Probability is applied at all levels of statistical analyses


Probability distributions

™ Probability distributions list or describe probabilities for all possible


occurrences of a random variable

™ There are two types of probability distributions:

¾ Categorical distributions

¾ Continuous distributions
Probability distributions: categorical variables

™ Categorical variables
¾ Frequency distribution

¾ Other distributions, such as binomial


Probability distributions: continuous variables

™ Continuous variables
¾ Continuous distribution

¾ Such as Z and t distributions


Normal Distribution

™ Properties of a Normal Distribution

¾ Also called Gaussian distribution

¾ A continuous, Bell shaped, symmetrical distribution; both


tails extend to infinity

¾ The mean, median, and mode are identical

¾ The shape is completely determined by the mean and


standard deviation
Normal Distribution

™ A normal distribution can have any µ and any σ:


e.g.: Age: µ=40 , σ = 10

™ The area under the curve represents 100% of all the observations

Mean
Median
Mode
Normal Distribution
Normal Distribution

Age distribution for a specific population

50% 50%

Mean=40
SD=10
Normal Distribution

Age distribution for a specific population

Age = 25 Mean=40
SD=10
Normal distribution

¾ The formula used to calculate the area below a certain point in a


normal distribution:

¾ The probability density function of the normal distribution with


mean µ and variance σ2
Normal distribution

¾ Thus, for any normal distribution, once we have the mean and sd,
we can calculate the percentage of subjects:
¾ Above a certain level
¾ Below a certain level
¾ Between different levels

¾ But the problem is:


¾ Calculation is very complicated and time consuming, so:
Standardized Normal Distribution

™ We standardize to a normal distribution

™ What does this mean?

™ For a specific distribution, we calculate all possible probabilities,


and record them in a table

™ A normal distribution with a µ = 0, σ = 1 is called a Standardized


Normal Distribution
Standardized Normal Distribution

Mean=0
SD=1
Area under the Normal Curve from 0 to X

X 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.00000 0.00399 0.00798 0.01197 0.01595 0.01994 0.02392 0.02790 0.03188 0.03586
0.1 0.03983 0.04380 0.04776 0.05172 0.05567 0.05962 0.06356 0.06749 0.07142 0.07535
0.2 0.07926 0.08317 0.08706 0.09095 0.09483 0.09871 0.10257 0.10642 0.11026 0.11409
0.3 0.11791 0.12172 0.12552 0.12930 0.13307 0.13683 0.14058 0.14431 0.14803 0.15173
0.4 0.15542 0.15910 0.16276 0.16640 0.17003 0.17364 0.17724 0.18082 0.18439 0.18793
0.5 0.19146 0.19497 0.19847 0.20194 0.20540 0.20884 0.21226 0.21566 0.21904 0.22240
0.6 0.22575 0.22907 0.23237 0.23565 0.23891 0.24215 0.24537 0.24857 0.25175 0.25490
0.7 0.25804 0.26115 0.26424 0.26730 0.27035 0.27337 0.27637 0.27935 0.28230 0.28524
0.8 0.28814 0.29103 0.29389 0.29673 0.29955 0.30234 0.30511 0.30785 0.31057 0.31327
0.9 0.31594 0.31859 0.32121 0.32381 0.32639 0.32894 0.33147 0.33398 0.33646 0.33891
1.0 0.34134 0.34375 0.34614 0.34849 0.35083 0.35314 0.35543 0.35769 0.35993 0.36214
1.1 0.36433 0.36650 0.36864 0.37076 0.37286 0.37493 0.37698 0.37900 0.38100 0.38298
1.2 0.38493 0.38686 0.38877 0.39065 0.39251 0.39435 0.39617 0.39796 0.39973 0.40147
1.3 0.40320 0.40490 0.40658 0.40824 0.40988 0.41149 0.41308 0.41466 0.41621 0.41774
1.4 0.41924 0.42073 0.42220 0.42364 0.42507 0.42647 0.42785 0.42922 0.43056 0.43189
1.5 0.43319 0.43448 0.43574 0.43699 0.43822 0.43943 0.44062 0.44179 0.44295 0.44408
1.6 0.44520 0.44630 0.44738 0.44845 0.44950 0.45053 0.45154 0.45254 0.45352 0.45449
1.7 0.45543 0.45637 0.45728 0.45818 0.45907 0.45994 0.46080 0.46164 0.46246 0.46327
1.8 0.46407 0.46485 0.46562 0.46638 0.46712 0.46784 0.46856 0.46926 0.46995 0.47062
1.9 0.47128 0.47193 0.47257 0.47320 0.47381 0.47441 0.47500 0.47558 0.47615 0.47670
2.0 0.47725 0.47778 0.47831 0.47882 0.47932 0.47982 0.48030 0.48077 0.48124 0.48169
2.1 0.48214 0.48257 0.48300 0.48341 0.48382 0.48422 0.48461 0.48500 0.48537 0.48574
2.2 0.48610 0.48645 0.48679 0.48713 0.48745 0.48778 0.48809 0.48840 0.48870 0.48899
2.3 0.48928 0.48956 0.48983 0.49010 0.49036 0.49061 0.49086 0.49111 0.49134 0.49158
2.4 0.49180 0.49202 0.49224 0.49245 0.49266 0.49286 0.49305 0.49324 0.49343 0.49361
2.5 0.49379 0.49396 0.49413 0.49430 0.49446 0.49461 0.49477 0.49492 0.49506 0.49520
2.6 0.49534 0.49547 0.49560 0.49573 0.49585 0.49598 0.49609 0.49621 0.49632 0.49643
2.7 0.49653 0.49664 0.49674 0.49683 0.49693 0.49702 0.49711 0.49720 0.49728 0.49736
2.8 0.49744 0.49752 0.49760 0.49767 0.49774 0.49781 0.49788 0.49795 0.49801 0.49807
2.9 0.49813 0.49819 0.49825 0.49831 0.49836 0.49841 0.49846 0.49851 0.49856 0.49861
3.0 0.49865 0.49869 0.49874 0.49878 0.49882 0.49886 0.49889 0.49893 0.49896 0.49900
3.1 0.49903 0.49906 0.49910 0.49913 0.49916 0.49918 0.49921 0.49924 0.49926 0.49929
3.2 0.49931 0.49934 0.49936 0.49938 0.49940 0.49942 0.49944 0.49946 0.49948 0.49950
3.3 0.49952 0.49953 0.49955 0.49957 0.49958 0.49960 0.49961 0.49962 0.49964 0.49965
3.4 0.49966 0.49968 0.49969 0.49970 0.49971 0.49972 0.49973 0.49974 0.49975 0.49976
3.5 0.49977 0.49978 0.49978 0.49979 0.49980 0.49981 0.49981 0.49982 0.49983 0.49983
3.6 0.49984 0.49985 0.49985 0.49986 0.49986 0.49987 0.49987 0.49988 0.49988 0.49989
3.7 0.49989 0.49990 0.49990 0.49990 0.49991 0.49991 0.49992 0.49992 0.49992 0.49992
3.8 0.49993 0.49993 0.49993 0.49994 0.49994 0.49994 0.49994 0.49995 0.49995 0.49995
3.9 0.49995 0.49995 0.49996 0.49996 0.49996 0.49996 0.49996 0.49996 0.49997 0.49997
4.0 0.49997 0.49997 0.49997 0.49997 0.49997 0.49997 0.49998 0.49998 0.49998 0.49998
Standardized Normal Distribution

Standardized Normal
Normal Distribution
Distribution (Z)

Mean = µ, SD = σ Mean = 0, SD = 1
TRANSFORM

Z = x-µ
σ
Standardized Normal Distribution

Standardized Normal
Normal Distribution
Distribution (Z)

Mean = 40, SD = 10 Mean = 0, SD = 1


TRANSFORM

Z(40) = x - µ = 40 - 40 = 0
σ 10
Standardized Normal Distribution

Standardized Normal
Normal Distribution
Distribution (Z)

30 -1
Mean = 40, SD = 10 Mean = 0, SD = 1
TRANSFORM

Z(40) = x - µ = 30 - 40 = -1
σ 10
Standardized Normal Distribution: summary

™ For any normal distribution, we can


™ Transform the values to the standardized normal distribution (Z)
™ Use the Z table to get the following areas
¾ Above a certain level
¾ Below a certain level
¾ Between different levels
Normal Distribution

Age distribution for a specific population

Mean=40
SD=10

30 68% 50

Mean – 1SD Mean + 1SD


Normal Distribution

Age distribution for a specific population

Mean=40
SD=10

20 95% 60

Mean – 2SD Mean + 2SD


Normal Distribution

Age distribution for a specific population

Mean=40
SD=10

10 99.7% 70

Mean – 3SD Mean + 3SD


Practical example
Practical example
The 68-95-99.7 Rule for the Normal Distribution

™ 68% of the observations fall within one standard deviation of the


mean
™ 95% of the observations fall within two standard deviations of the
mean
™ 99.7% of the observations fall within three standard deviations of
the mean
™ When applied to ‘real data’, these estimates are considered
approximate!
Distributions of Blood Pressure

.4

.3

68% Mean = 125 mmHG


.2 s = 14 mmHG
95%
.1 99.7%

0
83 97 111 125 139 153 167

The 68-95-99.7 rule applied to the distribution


of systolic blood pressure in men.
Data analyses

™ Data analyses:

¾ Descriptive statistics: are the techniques used to describe the main


features of a sample

¾ Inferential statistics: is the process of using the sample statistic to


make informed guesses about the value of a population parameter
Why do we carry out research?

population

sample

Inference: Drawing
conclusions on certain
questions about a
population from sample data
Inferential statistics

™ Since we are not taking the whole population, we have to draw


conclusions on the population based on results we get from the
sample

™ Simple example: Say we want to estimate the average systolic


blood pressure for patients admitted to the emergency department
after having an MI

™ Other more complicated measures might be quality of life,


satisfaction with care, risk of outcome, etc.
Inferential statistics

™ What do we do?

™ Take a sample (n=291) of patients admitted to emergency


department in a certain hospital

™ Calculate the mean and SD (descriptive statistics) of systolic blood


pressure
Statistics

Systolic blood pressure


N Valid 286
Missing 5
Mean 144.13
Std. Deviation 35.312
Inferential statistics

™ The next step is to make a link between the estimates we observed


from the sample and those of the underlying population (inferential
statistics)

™ What can we say about these estimates as compared to the


unknown true ones???

™ In other words, we trying to estimate the average systolic blood


pressure for ALL patients admitted to the emergency department
after an MI
Inferential statistics

™ Sample data

N=291

Mean=144
SD=35
Inference

™ In statistical inference we usually encounter TWO issues

¾ Estimate value of the population parameter. This is done through


point estimate and interval estimate (Confidence Interval)

¾ Evaluate a hypothesis about a population parameter rather than


simply estimating it. This is done through tests of significance
known as hypothesis testing (P-value)
1- Confidence Interval
Confidence Intervals

™ A point estimate:
A single numerical value used to estimate a population parameter.

™ Interval estimate:
Consists of 2 numerical values defining a range of values that with
a specified degree of confidence includes the parameter being
estimated.
(Usually interval estimate with a degree of 95% confidence is
used)
Example

™ What is the average systolic blood pressure for patients admitted


to emergency departments after an MI?
™ Select a sample
™ Point estimate = mean = 144 ⎛ 35 ⎞
144 ± 1.95 × ⎜ ⎟
™ Interval estimate = 95% CI = (140 – 148) ⎝ 291 ⎠

™ 95% Confidence Interval: x + z (1-α/2) SE


- Upper limit = x + z (1-α/2) SE
- Lower limit = x − z (1-α/2) SE
Sampling distribution of mean

N = 291

µ - 2SE 95% µ + 2SE


Standard error

™ Standard error
¾ = sd / √n
¾ As sample size increases the standard error decreases
¾ The estimation as measured by the confidence interval will be
better, ie narrower confidence interval
Interpretation

™ 95% Confidence Interval

¾ There is 95% probability that the true parameter is within the


calculated interval

¾ Thus, if we repeat the sampling procedure 100 times, the above


statement will be:

ƒ correct in 95 times (the true parameter is within the interval)

ƒ wrong in 5 times (the true parameter is outside the interval) (also called
α error)
Notes on Confidence Intervals

™ Interpretation
¾ It provides the level of confidence of the value for the population
average systolic blood pressure

™ Are all CIs 95%?


¾ No
¾ It is the most commonly used
¾ A 99% CI is wider
¾ A 90% CI is narrower
Notes on Confidence Intervals

™ To be “more confident” you need a bigger interval


¾ For a 99% CI, you need ± 2.6 SEM
¾ For a 95% CI, you need ± 2 SEM
¾ For a 90% CI, you need ± 1.65 SEM
2- P-value
Inference

™ P-value
¾ Is related to another type of inference
¾ Hypothesis testing
¾ Evaluate a hypothesis about a population parameter rather than
simply estimating it
Hypothesis testing

™ Back to our previous example

™ We want to make inference about the average systolic blood


pressure of patients admitted to emergency department after MI

™ Assume that the normal systolic blood pressure is 120

™ The question is whether the average systolic blood pressure for


patients admitted to emergency departments is different than the
normal, which is 120
Hypothesis testing

™ Two types of hypotheses:

¾ Null hypothesis: is a statement consistent with “no difference”

¾ Alternative hypothesis: is a statement that disagrees with the null


hypothesis, and is consistent with presence of “difference”
The logic of hypothesis testing

™ To decide which of the hypothesis is true


™ Take a sample from the population
™ If the data are consistent with the null hypothesis, then we do not
reject the null hypothesis (conclusion = “no difference”)
™ If the sample data are not consistent with the null hypothesis, then
we reject the null (conclusion = “difference”)
Hypothesis testing

™ Example: is the systolic blood pressure for patients admitted to


emergency department after an MI normal (ie =120)?
- Ho: µ = 120
- Ha: µ ≠ 120

™ How do we answer this question?

™ We take a sample and find that the mean is 144 years

™ Can we consider that the 144 is consistent with the normal value
(120 years)?
Hypothesis testing

N = 291

mean mean
144 144

Ho: µ = 120

It looks like it is consistent with the null hypothesis

Is it still consistent with the null hypothesis?


Hypothesis testing

N = 291

mean mean

2.5% 2.5%

95%
µ - 2SE Ho: µ = 120 µ + 2SE
Test statistic

™ It is the statistic used for deciding whether the null hypothesis


should be rejected or not

™ Used to calculate the probability of getting the observed results if


the null hypothesis is true.

™ This probability is called the p-value.


How to decide

™ We calculate the probability of obtaining a sample with mean of


144 if the true mean is 120 due to chance alone (p-value)
™ Based on p-value we make our decision:
¾ If the p-value is low then this is taken as evidence that it is unlikely
that the null hypothesis is true, then we reject the null hypothesis (we
accept alternative one)
¾ If the p-value is high, it indicates that most probably the null
hypothesis is true, and thus we do not reject the Ho
Problem!

™ We could be making the wrong decisions

Decision Ho True Ho False

Do not reject Ho Correct decision Type II error

Reject Ho Type I error Correct decision

™ Type I error: is rejecting the null hypothesis when it is true

™ Type II error: is not rejecting the null hypothesis when it is false


Error

™ Type I error:
¾ Referred to as α
¾ Probability of rejecting a true null hypothesis
™ Type II error:
¾ Referred to as β
¾ Probability of accepting a false null hypothesis
™ Power:
¾ Represented by 1-β
¾ Probability of correctly rejecting a false null hypothesis
Significance level

™ The significance level, α, of a hypothesis test is defined as the


probability of making a type I error, that is the probability of
rejecting a true null hypothesis
™ It could be set to any value, as:
¾ 0.05
¾ 0.01
¾ 0.1
Statistical significance

™ If the p-value is less then some pre-determined cutoff (e.g. .05),


the result is called “statistically significant”

™ This cutoff is the α-level

™ The α-level is the probability of a type I error

™ It is the probability of falsely rejecting H0


Back to the example

™ To test whether the average systolic blood pressure for patients


admitted to the emergency department after an MI is different
than 120 (which is the normal blood pressure)

™ We carry out a test called “one sample t-test” which provides a p-


value based on which we accept or reject the null hypothesis.
Back to the example

One-Sample Statistics

Std. Error
N Mean Std. Deviation Mean
Systolic blood pressure 286 144.13 35.312 2.088

One-Sample Test

Test Value = 120


95% Confidence
Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
Systolic blood pressure 11.558 285 .000 24.133 20.02 28.24

™ Since p-value is less than 0.05, then the conclusion will be that the
systolic blood pressure for patients admitted to emergency
department after an MI is significantly higher than the normal
value which is 120
p-values

™ p-values are probabilities (numbers between 0 and 1)

™ Small p-values mean that the sample results are unlikely when the
null is true

™ The p-value is the probability of obtaining a result as/or more


extreme than you did by chance alone assuming the null
hypothesis H0 is true
t-distribution

™ The t-distribution looks like a standard normal curve

™ A t-distribution is determined by its degrees of freedom (n-1), the


lower the degrees of freedom, the flatter and fatter it is

Normal (0,1)
t35

0
t15
ν 75% 80% 85% 90% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95%

1 1.000 1.376 1.963 3.078 6.314 12.71 31.82 63.66 127.3 318.3 636.6

2 0.816 1.061 1.386 1.886 2.920 4.303 6.965 9.925 14.09 22.33 31.60

3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 7.453 10.21 12.92

4 0.741 0.941 1.190 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610

5 0.727 0.920 1.156 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869

6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959

7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408

8 0.706 0.889 1.108 1.397 1.860 2.306 2.896 3.355 3.833 4.501 5.041

9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781

10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587

11 0.697 0.876 1.088 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437

12 0.695 0.873 1.083 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318

13 0.694 0.870 1.079 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221

14 0.692 0.868 1.076 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140

15 0.691 0.866 1.074 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073

16 0.690 0.865 1.071 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015

17 0.689 0.863 1.069 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965

18 0.688 0.862 1.067 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922

19 0.688 0.861 1.066 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883

20 0.687 0.860 1.064 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850

100 0.677 0.845 1.042 1.290 1.660 1.984 2.364 2.626 2.871 3.174 3.390

120 0.677 0.845 1.041 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373

0.674 0.842 1.036 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291
Hypothesis Testing

™ Different types of hypothesis:


¾ Mean (a) = Mean (b)
¾ Proportion (a) = Proportion (b)
¾ Variance (a) = Variance (b)
¾ OR = 1
¾ RR = 1
¾ RD = 0
¾ Test of homogeneity
¾ Etc..
Example
Comparing two means: paired testing

™ In the previous example, is the heart rate at admission different


than the heart rate at discharge among the patients admitted to the
emergency department after an MI?
Statistics

Heart Rate at Heart Rate at


admission discharge
N Valid 286 77
Missing 5 214
Mean 82.64 76.99
Std. Deviation 22.598 17.900

™ Is this decrease in heart rate statistically significant?

™ Thus, we have to make inference….


Comparing two means: paired testing

™ What type of test to be used?

™ Since the measurements of the heart rate at admission and at


discharge are dependent on each other (not independent), another
type of test is used

™ Paired t-test
Comparing two means: paired testing

Paired Samples Statistics

Std. Error
Mean N Std. Deviation Mean
Pair Heart Rate at admission 81.16 75 23.546 2.719
1 Heart Rate at discharge 76.72 75 17.973 2.075

Paired Samples Test

Paired Differences
95% Confidence
Interval of the
Std. Error Difference
Mean Std. Deviation Mean Lower Upper t df Sig. (2-tailed)
Pair Heart Rate at admission -
4.440 25.302 2.922 -1.381 10.261 1.520 74 .133
1 Heart Rate at discharge

95%CI = 4.4 ± 1.95 × 2.9


H0: µb - µa = 0
P-value = 0.133, thus no significant difference
HA: µb - µa ≠ 0
How Are p-values Calculated?

sample mean − 0
t=
SEM

4 .4
t = = 1 . 52
2 .9
The value t = 1.52 is called the test statistic

Then we can compare the t-value in the table and get the
p-value, or get it from the computer (0.13)
Interpreting the p-value

™ The p-value in the example is 0.133

¾ Interpretation: If there is no difference in heart rate between


admission and discharge to an emergency department, then the
chance of finding a mean difference as extreme/more extreme as 4.4
in a sample of 291 patients is 0.133

¾ Thus, this probability is big (bigger than 0.05) which leads to saying
that the difference of 4.4 is due to chance
Notes

™ How to decide on significance from the 95% CI?

™ 3 scenarios

-15 -10 -5 0 5 10 15

-15 -10 -5 0 5 10 15

-15 -10 -5 0 5 10 15
Comparing two means: Independent sample testing

™ In the previous example, is the systolic blood pressure different


between males and females among the patients admitted to the
emergency department after an MI?
Group Statistics

Std. Error
Sex N Mean Std. Deviation Mean
Systolic blood pressure Male 240 145.05 35.162 2.270
Female 44 138.64 35.753 5.390

™ Is this difference in systolic blood pressure statistically significant?

™ Thus, we have to make inference….


Comparing two means: Independent sample testing

™ Null hypothesis:
¾ Ho: Mean SBP(Males) = Mean SBP (Females)
¾ Ho: Mean SBP (Males) - Mean SBP (Females) = 0

™ Alternative hypothesis:
¾ Ha: Mean SBP(Males) ≠ Mean SBP (Females)
¾ Ha: Mean SBP(Males) - Mean SBP (Females) ≠ 0
Comparing two means: Independent sample testing

™ Thus, we carry out a test called: independent samples t-test


™ Formula to use is:
Comparing two means: Independent sample testing

™ What we need to know is that we can calculate a p-value out of the


t-test (based on the t-distribution)

™ Based on this p-value, make the decision:


¾ P-value > 0.05, then do no reject the null (the two means are equal)

¾ P-value < 0.05, then reject the null (the two means are different)
Comparing two means: Independent sample testing

Group Statistics

Std. Error
Sex N Mean Std. Deviation Mean
Systolic blood pressure Male 240 145.05 35.162 2.270
Female 44 138.64 35.753 5.390

Independent Samples Test

Levene's Test for


Equality of Variances t-test for Equality of Means
95% Confidence
Interval of the
Mean Std. Error Difference
F Sig. t df Sig. (2-tailed) Difference Difference Lower Upper
Systolic blood pressure Equal variances
.044 .835 1.109 282 .269 6.409 5.781 -4.970 17.789
assumed
Equal variances
1.096 59.267 .278 6.409 5.848 -5.292 18.111
not assumed

Two formulas for calculation of t-test To know which one to use (hypothesis test)
1- when variances are equal Ho: variancemales = variancefemales
2- when variances are not equal Ha: variancemales = variancefemales

1- If p-value > 0.05 then variances are equal


2- If p-value < 0.05 then variances are not equal
Example
T-test

Ho: Mean1 = Mean2 T-test: P-value = 0.89


Ha: Mean1 ≠ Mean2 No significant difference
Chi square
Example

™ In the MI example, we would like to check if hypertension is


associated with gender.

™ In other words, are males at higher or lower risk of having


hypertension?

Sex * Hypertension Crosstabulation

Count
Hypertension
No Yes Total
Sex Male 191 52 243
Female 24 20 44
Total 215 72 287
Example

Sex * Hypertension Crosstabulation

Hypertension
No Yes Total
Sex Male Count 191 52 243
% within Sex 78.6% 21.4% 100.0%
Female Count 24 20 44
% within Sex 54.5% 45.5% 100.0%
Total Count 215 72 287
% within Sex 74.9% 25.1% 100.0%
Example

™ To answer the question, we do a hypothesis test:


¾ H0: P1 = P2 (P1 - P2 = 0)

¾ Ha: P1 ≠ P2 (P1 - P2 ≠ 0)

™ (Pearson’s) Chi-Square Test (χ2)


¾ Calculation is easy (can be done by hand)

™ Works well for big sample sizes

™ Can be extended to compare proportions between more than two


independent groups in one test
The Chi-Square Approximate Method

2
(0 - E)
χ 2
= ∑
4 cells E
™ Looks at discrepancies between observed and expected cell counts
™ Expected refers to the values for the cell counts that would be
expected if the null hypothesis is true
O = observed

row total × column total


E = expected =
grand total
The Chi-Square Approximate Method

™ The distribution of this statistic when the null is a chi-square

distribution with one degree of freedom

™ We can use this to determine how likely it was to get such a big

discrepancy between the observed and expected by chance alone


Distribution: Chi-Square with One Degree of Freedom

.8
.6
Probability
.4

Χ2 = 3.84 ↔ p = 0.05
.2
0

0 1 5 10 15 20
Chi-squared Value
Example of Calculations of
Chi-Square 2x2 Contingency Table

2
(0 - E)
™ Test statistic χ 2
= ∑
4 cells E
Chi-Square Tests

Asymp. Sig. Exact Sig. Exact Sig.


Value df (2-sided) (2-sided) (1-sided)
Pearson Chi-Square 11.471b 1 .001
Continuity Correctiona 10.227 1 .001
Likelihood Ratio 10.366 1 .001
Fisher's Exact Test .001 .001
Linear-by-Linear
11.431 1 .001
Association
N of Valid Cases 287
a. Computed only for a 2x2 table
b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 11.
04.

χ = 11.471
2
Sampling Distribution: Chi-Square with One Degree of
Freedom

.8
.6
Probability
.4
.2
0

0 1 5 10 15 20
Chi-squared Value
Example

™ The value that corresponds to 95%, or 5% error is 5.991.

™ Thus we reject Ho since 11.471 is < 5.991

™ We conclude that Ho is false and that there is a relationship

between gender and diagnosis with hypertension

™ The p-value is = 0.001


Chi-square

Ho: Proportion1 = Proportion2 ChiSquare: P-value = 0.96


Ha: Proportion1 ≠ Proportion2 No significant difference
Relative Risk (RR):

™ Study the association between Vioxx use and Myocardial Infarction

MI
Yes No
Vioxx 71 52
Drug
Placebo 29 48

Ho: RR = 1
RR=1.5, 95% CI = (1.1 - 1.9) (p-value = 0.01)
Ha: RR ≠ 1
Significant association
Notes

™ How to decide on significance from the 95% CI?

™ 3 scenarios

0 1 2 3 5 6 7

0 1 2 3 5 6 7

0 1 2 3 5 6 7
Example
Example
Chi-square

Ho: Proportion1 = Proportion2 ChiSquare: P-value = 0.96


Ha: Proportion1 ≠ Proportion2 No significant difference
Example

™ We would like to check if there is an association between gender


and both Hypertension and diabetes combined.
Sex * Hypterension and Diabetes combined Crosstabulation

Hypterension and Diabetes


combined
Either HT Both HT
None or DM and DM Total
Sex Male Count 145 67 28 240
% within Sex 60.4% 27.9% 11.7% 100.0%
Female Count 13 12 19 44
% within Sex 29.5% 27.3% 43.2% 100.0%
Total Count 158 79 47 284
% within Sex 55.6% 27.8% 16.5% 100.0%

™ Ho: HPV status and stage of HIV infection are independent.


™ Ha: the two variables are not independent.

™ Ho: P1 = P2 = P3
™ Ha: P1 ≠ P2 ≠ P3
Example

™ Conclusion
Sex * Hypterension and Diabetes combined Crosstabulation

Hypterension and Diabetes


combined
Either HT Both HT
None or DM and DM Total
Sex Male Count 145 67 28 240
% within Sex 60.4% 27.9% 11.7% 100.0%
Female Count 13 12 19 44
% within Sex 29.5% 27.3% 43.2% 100.0%
Total Count 158 79 47 284
% within Sex 55.6% 27.8% 16.5% 100.0%

Chi-Square Tests

Asymp. Sig.
Value df (2-sided)
Pearson Chi-Square 28.691a 2 .000
Likelihood Ratio 24.336 2 .000
Linear-by-Linear
25.341 1 .000
Association
N of Valid Cases 284
a. 0 cells (.0%) have expected count less than 5. The
minimum expected count is 7.28.
Example

™ The value that corresponds to 95%, or 5% error is 5.991.

™ Thus we reject Ho since 28.691 is < 5.991

™ We conclude that Ho is false and that there is a relationship

between gender and diagnosis with hypertension and/or diabetes

™ The p-value is < 0.0001


ANOVA
The problem

™ We have samples from a number of independent groups.

™ We have a single numerical or ordinal variable and are interested

in whether the values of the variable vary between the groups.

™ Example: Is systolic blood pressure vary between men of

different smoking status.


The problem

™ One-way ANOVA can answer the question be comparing the


group means.

™ So the null and alternative hypotheses are:


H0: all group means in the population are equal

HA : at least two of the means are not equal

™ ANOVA is an extension of 2 independent groups.

™ But 2 groups technique can not be used.


The problem

- If 5 groups is available then 10 t-test of 2 groups to perform.

- The high Type I error rate, resulting from the large number of

comparisons, means that we may draw incorrect conclusions.


Assumptions

Analysis of variance requires the following assumptions:

™ Independent random samples have been taken from each


population.

™ The populations are normal.

™ The population variances are all equal.


The ANOVA Table

™ ANOVA table summaries the calculation needed to test the main


hypothesis.

Sources df SS MS F

SS ( factor) MS ( factor)
Factor k −1 SS(factor) MS(factor)=
k −1 MS (error )

SS (error )
Error n −k SS(error) MS(error)=
n −k
___________________________________________________________
Total n − 1 SS(total)
Rationale

™ One-way ANOVA separate the total variability (SS(total) in the


data into:
¾ Differences between the individuals from the different groups
(between-group variation) SS(factor)

¾ The random variation between the individuals within each group


(within-group variation) SS(error) called also unexplained
Rationale

™ These components of variation are measured using variances,


hence the name analysis of variance (ANOVA).

™ Under the null hypothesis that the group means are the same,
SS(factor) will be similar to SS(error).

™ The test is based on the ratio of these two variances.

™ If there are differences between-groups, then between-groups


variance will be larger than within-group variance.
Example

™ A new variable is created which combines diagnosis with


Hypertension and Diabetes together as follows:

Hypterension and Diabetes combined

Cumulative
Frequency Percent Valid Percent Percent
Valid None 159 54.6 55.6 55.6
Either HT or DM 80 27.5 28.0 83.6
Both HT and DM 47 16.2 16.4 100.0
Total 286 98.3 100.0
Missing System 5 1.7
Total 291 100.0
Example

™ We would like to check whether the systolic blood pressure is the


same for the three groups defined by their HT and DM status.

™ Ho: Mean1 = Mean2 = Mean3


™ Ha: Mean1 ≠ Mean2 ≠ Mean3

Hypterension and Diabetes combined

Cumulative
Frequency Percent Valid Percent Percent
Valid None 159 54.6 55.6 55.6
Either HT or DM 80 27.5 28.0 83.6
Both HT and DM 47 16.2 16.4 100.0
Total 286 98.3 100.0
Missing System 5 1.7
Total 291 100.0
Example
Descriptives

Systolic blood pressure


95% Confidence Interval for
Mean
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
None 155 144.52 32.789 2.634 139.32 149.73 78 248
Either HT or DM 79 142.97 39.634 4.459 134.10 151.85 56 257
Both HT and DM 47 146.55 36.360 5.304 135.88 157.23 55 235
Total 281 144.43 35.319 2.107 140.28 148.57 55 257

ANOVA

Systolic blood pressure


Sum of
Squares df Mean Square F Sig.
Between Groups 380.517 2 190.259 .152 .859
Within Groups 348908.2 278 1255.066
Total 349288.8 280

™ We conclude that the average systolic blood pressures for the


three groups are the same.
Conclusion

™ We conclude that the average systolic blood pressure for the three
groups is the same.
Bivariate analyses

DEPENDENT
(outcome)

2 LEVELS > 2 LEVELS CONTINUOUS

X2 X2
2 LEVELS t-test
INDEPENDENT

(chi square test) (chi square test)


(exposure)

X2 X2
> 2 LEVELS ANOVA
(chi square test) (chi square test)

-Correlation
CONTINUOUS ANOVA t-test -Linear
Regression
New scenario

™ If the dependent and independent variables are continuous, then


we can’t use the t-test, and we cannot use the chi squared.
Regression and Correlation

™ Describing association between two continuous variables

¾ Scatterplot

¾ Correlation coefficient

¾ Simple linear regression


Correlation
Correlation

™ It is a measure of linear correlation

™ Called Pearson correlation coefficient (r)

™ Ranges between:

™ +1.0 (perfect positive correlation)

™ -1.0 (perfect negative correlation)


Scatter plot and correlation
The Correlation Coefficient (r)

™ Measures the direction and strength of the linear association


between x and y
™ The correlation coefficient is between -1 and +1
¾ r>0 Positive association
¾ r<0 Negative association
¾ r=0 No association
r = 0.01
r = 0.68
Y

Y
X X

12
r = 0.98

10
8
Y
Y

r = -0.9 6
4

X
X
Correlation in the Plasma Example

3.5
Y, plasma volume (liters)

r = .76

2.5
55 60 65 70 75

X, body weight (kg)


Correlation

™ Study the association between Heart Rate and Systolic Blood


Pressure
Ho: Correlation = 0
Scatter plot for assocation between HR and SBP
Ha: Correlation ≠ 0
A
A
Correlation: = 0.190
250
A
A
A
A A A
A
A AA

200 A
AAA

A
AA
A A
A
A
A A
AA A A
P-value = 0.001
A A AA A
A A A
A A AA
BP systolic

AA A
A A A AA
A
AA
AAA
A
A
A A A
A A A
A A A A Correlations
A
A
AAA AA
A
A
A
AA
AAA
A AAA A
A AA A
A
AAA A
AA A A
AAA
AA Systolic blood Heart Rate at
150 A AAAAAA A A
A
AA A
A
A
A
AA A
A AA AAA A
AA A
A A A A A A A pressure admission
AA A A A
AAA A AA AA A A A
AA AAAA A AAA
A
AA A AA
A A Systolic blood pressure Pearson Correlation 1 .190**
A AAA AA A
AA A A AAA A AAAAA
A A Sig. (2-tailed) .001
A AA A AA
AA
A
AAA
A
AA
AA A A
A
AA N 286 285
AAA AA AAA
A A
A A A
A A A AA AA Heart Rate at admission Pearson Correlation .190** 1
A AA
AA AAAA
100 A A A A
A A A A A Sig. (2-tailed) .001
A A A
A A A AA N 285 286
A
**. Correlation is significant at the 0.01 level (2-tailed).
A
A A
50
40 80 120 160

Heart Rate
Significant correlation
Problem

™ Important to note that


correlation measures
strength of linear
association

r= 0
™ There could be a strong
non-linear relationship
between y and x, and r

Y
may not catch it

X
Correlation Coefficient

™ Outliers can really affect correlation coefficient


™ One extreme point can change r sizably

r = .7
Simple linear regression
Simple linear regression

™ Used to quantify the association between two variables

™ It is “simple” in terms of

™ having only 1 variable

™ The association is thought to be linear in nature

™ Formula: Dependent = β0 + β1 (independent)


The Equation of a Line

™ β0 and β1 are called regression coefficients

™ These two quantities are estimated by the “least squares” method

™ The intercept β0 is the estimated expected value of y when x is 0

™ The slope β1 is the estimated expected change in y corresponding

to a unit increase in x
The Slope

™ The slope β1 is the expected change in y corresponding to a unit

increase in x

¾ β1 = 0 No association between y and x

¾ β1 > 0 Positive association (as x increases y tends to increase)

¾ β1 < 0 Negative association (as x increases y tends to decrease)


The Equation of a Line

y
y

y = bˆ0 + bˆ1 x

b̂1

b̂0

x
0
The Slope

y
β1 > 0

β1 = 0

β1 < 0
x
0
Simple linear regression

™ Systolic blood pressure and age

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .054a .003 -.001 35.387
a. Predictors: (Constant), Age

™ Correlation:
R = 0.054
Simple linear regression
Simple linear regression

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 136.400 8.812 15.479 .000
Age .148 .162 .054 .910 .364
a. Dependent Variable: Systolic blood pressure

™ Simple linear regression:


SBP = 136.400 + 0.148 (Age)
If age = 0, then SBP = 136.400 + 0 = 136.400
As age increase by 1 year, SBP increases by 0.148 units
Simple Linear Regression

™ How do we decide if there is significant association between age


and SBP?
™ Hypothesis test
Ho: β1 = 0
Ha: β1 ≠ 0
™ SBP = β0 + β1 (Age)
™ If reject Ho, then as age changes, SBP changes significantly
™ If Ho is not rejected, then if as changes, there is no effect on SBP
Multiple Linear Regression

™ The important aspect of linear regression is that we can include


more than 1 independent variable
™ This is to control for the effect of another variable
™ Study the association between Age and SBP while controlling for
gender
™ SBP = β0 + β1 (Age) + β2 (Gender)
Multiple Linear Regression

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 143.090 9.742 14.688 .000
Age .216 .171 .080 1.261 .208
Sex -8.992 6.123 -.093 -1.469 .143
a. Dependent Variable: Systolic blood pressure

™ Multiple linear regression:


SBP = 143.090 + 0.216 (Age) + -8.992 (Gender)
As age increase by 1 year, SBP increases by 0.216 units
after adjusting for gender
Difference in SBP between males and females is 8.992 units
Choosing the right statistical test
Choosing a statistical test

™ Choosing the right statistical test depends on:


¾ Nature of the data
¾ Sample characteristics
¾ Inferences to be made
Choosing a statistical test

™ A consideration of the nature of data includes:

™ Number of variables

¾ not for entire study, but for the specific question at hand

™ Type of data

¾ numerical, continuous

¾ dichotomous, categorical information


Choosing a statistical test

™ A consideration of the sample characteristics includes:

™ Number of groups

™ Sample type

¾ normal distribution (parametric) or not (non-parametric)

¾ independent or dependent
Choosing a statistical test

™ A consideration of the inferences to be made includes:

™ Data represent the population

™ The group means are different

™ There is a relationship between variables


Choosing a statistical test

™ Before choosing a statistical test, ask:

™ How many variables?

™ How many groups?

™ Is the distribution of data normal?

™ Are the samples (groups) independent?

™ What is your hypothesis or research question?

™ Is the data continuous, ordinal, or categorical?


Descriptive analyses

Type of variable Measure

Categorical Proportion (%)

Continuous
Mean (SD)
(Normal)
Continuous - Median
(Not Normal) - Inter-quartile range
Different types of statistics

™ Parametric vs non-parametric analyses


¾ Parametric:
ƒ Assume data follows a specific probability distribution

ƒ More powerful

¾ Non-parametric:
ƒ Also called distribution free

ƒ No assumptions required for data

ƒ But are robust


Univariate analyses

Type of variable Measure

Categorical Z proportions

Continuous
T-test
(Normal)

Continuous - n > 30 → t-test


(Not Normal) - n < 30 → Kolmogorov-Smirnov Test
Bivariate analyses

Type of
2 levels > 2 levels Continuous
variable

2 levels Chi squared Chi squared T-test

> 2 levels Chi squared Chi squared Anova

- Correlation
Continuous T-test Anova
- linear regression
Bivariate analyses

Type of
2 levels > 2 levels Continuous
variable

-Fisher’s test - Mann-Whitney


2 levels Fisher’s test
- McNemar’s test - Wilcoxin test

- Kruskal-Wallis
> 2 levels Fisher’s test Fisher’s test
- Friedman test

- Mann-Whitney - Kruskal-Wallis - Correlation


Continuous
- Wilcoxin test - Friedman test - Regression
Multivariate analyses

Type of variable Measure

Categorical Logistic regression

Continuous
Multinomial regression
(Normal)

Continuous
Linear regression
(Not Normal)
Overview

Measurement Ordinal or Binomial Survival Time


(Gaussian) Measurement (Non-
Gaussian)
Describe one group Mean, SD Median, interquartile Proportion Kaplan Meier survival
range curve
Compare two unpaired Unpaired t test Mann-Whitney test Fisher's test Log-rank test or
groups Chi-square Mantel-Haenszel*

Compare two paired groups Paired t test Wilcoxon test McNemar's test Conditional
proportional hazards
regression*

Compare three or more One-way ANOVA Kruskal-Wallis test Chi-square test Cox regression
unmatched groups
Compare three or more Repeated-measures Friedman test Cochrane Q** Conditional
matched groups ANOVA proportional hazards
regression*

Quantify association between Pearson correlation Spearman correlation Contingency


two variables coefficients**
Predict value from another Simple linear Nonparametric Simple logistic Cox regression
measured variable regression regression** regression*

Predict value from several Multiple linear Multiple logistic Cox regression
measured or binomial regression* regression*
variables
Sample size calculation
Sample size and power calculation

™ Important step in designing a study

™ If it is not done, then sample size might be high or low:

¾ If it is low: lack precision to provide reliable answers

¾ If it is high: resources will be wasted for minimal gain


Sample size and power calculation

™ This step addresses two questions:

¾ How precise will my parameter estimates tend to be if I select a

particular sample size?

¾ How big a sample do I need to attain a desirable level of precision?


Sample size and power calculation: example

™ A cross-sectional survey of the prevalence of diabetes (diagnosed

or undiagnosed) among native Americans would require a sample

size of 1421 to allow estimation of the prevalence within a

precision of ±0.02 with 90% confidence, assuming a true

prevalence no larger than 30%.


Sample size and power calculation

™ Should be done at the DESIGN stage, ie before data is collected

™ Drives the whole study

™ To determine the sample size:

¾ Objectives should be clearly defined

¾ Main exposure and outcome should be specified

¾ Analyses plan should be clarified


Sample size and power calculation

™ Different equations are used:

™ Depends on:

¾ Study design

¾ Objectives (prevalence, risk, etc.)

¾ Types of variables

™ Following is an example of sample size calculation for comparing

the means in two groups


Sample size and power calculation: example

™ A randomized clinical trial of a new drug treatment vs. placebo

for decreasing blood pressure would require 126 patients for a

two-sided test at α = 0.05 to provide 80% power to detect a 5%

difference in blood pressure.


Sample size calculation: comparing two means

2 *SD * (z α + z β )
2 2

N≈ 2

™ N = the number of subjects in each group

™ α = level of significance (error)

™ 1 - β = power

™ Difference = Minimal significant difference


Sample size calculation: comparing two means

™ N = the number of subjects in each group

¾ ↑ N = more power or less α

¾ ↓ N = less power or more α


Sample size calculation: comparing two means

™ α = level of significance (error)

¾ ↑ α = more power or smaller N

¾ ↓ α = less power or larger N


Sample size calculation: comparing two means

™ 1 - β = power

¾ ↑ 1 - β = less α or larger N

¾ ↓ 1 - β = more α or smaller N
Sample size calculation: comparing two means

™ Difference = Minimal significant difference

¾ ↑ Difference = larger power or smaller N

¾ ↓ Difference = smaller power or larger N


Sample size calculation: comparing two means

™ N = to be found

™ α = level of significance (error) = 0.05 or 5%

™ 1 - β = power = 0.80 or 80%

™ Difference = Minimal significant difference


Thank you

You might also like