Professional Documents
Culture Documents
Statistics Level 1
Statistics Level 1
Advances in our understanding of factors which affect health and wellbeing come through
research in the health sciences. Examples of such research include surveys to describe
patterns of disease in a community or risk factors for disease such as diet and smoking; studies
trying to find out whether a newly developed treatment works; studies of factors which may
prevent disease such as physical activity; studies of barriers to improving health such as
reasons for declining vaccination rates in children, prevention of smoking. Biostatistics
(statistics applied in the health sciences) is a vital tool in our mission to improve health and
wellbeing for all people.
STAT115 provides an introduction to the core principles and methods of biostatistics. In this
course you will gain an understanding of how statistics is used to answer research questions:
how to look for patterns in data, how to test hypotheses about disease causation and prevention
and improvement in well-being. The understanding and skills gained in STAT115 can be a
starting point for a career in biostatistics or can be used to assist understanding of research in
other disciplines including physiology, anatomy, human nutrition, sports science, and
psychology.
Lectures
Lectures are held as follows: Monday, Tuesday, Thursday and Friday at 11.00 am,
commencing Monday 8 July. Although these notes are extensive, experience shows that
students who miss lectures have a severe disadvantage.
Support Classes
We have students from a range of backgrounds in the course. If you are concerned about your
mathematical skills, we have a Support Class available on Tuesday evenings from 6pm to
8pm (commencing week 1) in the North CAL lab, ground floor Science III building, opposite
the Science Library. In order to check whether this is appropriate for you, have a look at the
"Basics Booklet" in Appendix 1 of these notes. The Support Class is designed for those who
struggle with the material in the booklet.
Practice resources
Check out the following:
1. Basics Booklet in Appendix 1
2. MATHERCIZE http://mathercize.otago.ac.nz, login password is plus
References
There is no set text for the course as this course booklet contains all material necessary. If
further reference materials are desired, two useful texts are:
Clark, M.J. and Randal, J. R. A First Course in Applied Statistics. Pearson
MacGillivray, H. Utts & Heckard's Mind on Statistics. Cengage Learning.
Multiple copies of both references are in the Science Library on close reserve at the Loans
Desk.
Computing
The R package will be used in tutorials. No prior knowledge of R is needed as a handout
(found at the end of this book) and full instructions will be given in lectures. All students will
have their own User Name and Password. The User Name is the name on your student ID
card and the Password is your student ID number.
Time Commitment
STAT 115 is a one semester course worth 18 points. It is expected that students should spend
an average of 12 hours per week on this course. After allowing four hours per week attending
lectures, this leaves eight hours for other course related activities such as assignments, reading
notes and revising.
Calculators
You can use any type of calculator in the tests, the final exam and the assignments, as long as
it has no communication capability. This means cellphones are not permitted. You are
allowed to use a graphics calculator, but you do not need to buy one specially. A basic
scientific calculator that allows you to calculate a mean and a standard deviation is sufficient.
Be familiar with the working of your calculator.
ii
Tilman
2 lectures
Megan
6 lectures
Tilman
8 lectures
Tilman
5 lectures
Tilman
3 lectures
Tilman
3 lectures
Categorical data: tests for association; rates, relative risk and risk differences, Megan
odds ratios; confidence intervals for relative risk and odds ratio.
4 lectures
Regression and correlation: the simple linear regression model; tests on the slope; Megan
predictions; confidence intervals for predictions; correlation.
4 lectures
Multiple regression: tests on the estimated parameters; dummy variables for Megan
qualitative predictors; parallel regressions and control of confounding.
4 lectures
Ethics and Study design: Ethical issues, bias and confounding.
Katrina
7 lectures
Internal Assessment
There will be eight assignments and three mastery tests. Each assessment will have a scaled
mark recorded out of 20. These assessments will be admininstered electronically.
Your scores from the mastery tests contribute 2/3 of your internal assessment component, the
assignments contribute the remaining 1/3.
A) Assignments
1. These can be completed anywhere you have an internet connection.
2. Because they are electronic, cutoff is prompt at 0900.
3. Due to the assignment being electronic, no extension is possible.
B) Mastery tests
WHERE AND WHEN These 3 tests are administered electronically in North CAL on the
dates specified on the course schedule. The tests cannot be taken outside of the scheduled
testing periods nor in any other venue.
BOOKING - This is done in advance online via the Resources Page. You will be advised on
Blackboard when booking is open.
TIME ALLOWED - 20 minutes (within the 30-minute booking slot)
FORMAT - Multi-choice. It is also open book, but if you simply rely on having all your notes
available you will probably find it difficult to complete the test on time.
TOPICS - Material to be tested will be advised in advance on Blackboard
iii
NOTES
1. Bring your student ID card.
2. Be on time so as not to disturb others and also to avoid delaying the start of the next
session.
3. Bring your calculator. You are not allowed to use your cellphone as a calculator.
4. Open R before you log on to the Resources Page.
5. Be mindful when scrolling that you do not inadvertently change your selected answer.
6. Any issues arising during the test must be brought immediately to the attention of the
supervisor so that remedial action can be taken and you complete the test before leaving
the room. There is no comeback once you have left the venue.
7. There is no resit for these mastery tests.
IMPORTANT - The only devices to be in operation during any testing period are:
1. the lab computer with just the test and R active on screen;
2. a calculator
The use of other devices, eg cellphone, tablet, laptop etc, or using other programmes on
the computer, eg email, may be deemed as an attempt to cheat.
Security
You are strongly advised to ID your calculator and other personal devices as these frequently
get left behind in the computer laboratory.
Exam format
A three-hour multiple choice exam will produce a mark out of 100.
Final mark
In your overall mark we will count your exam mark for 2/3 of the total and the internal
assessment for 1/3. However, if your final exam mark taken out of 100 is greater than this,
we will use just the final exam mark. That is, the final mark F will be calculated as:
F = {E, (2E + A)/3}
where E (exam mark) is out of 100 and A (internal assessment) is out of 100. The internal
assessment marks will be made up 1/3 from the eight assignments and 2/3 from the three
mastery tests.
E.g. 1 - 85% in exam, 80% for internal assessment. The exam mark is greater than the
internal assessment so final mark = 85%.
E.g. 2 - 60% in exam, 90% for internal assessment. The internal assessment is greater than
the exam so 2/3 of exam = 40% + 1/3 internal assessment = 30%, final mark = 70%
Terms requirement
There is no terms requirement for this course. If you miss a test or an assignment, you can
still potentially pass the course.
iv
STAT115&&
Introduc/on&to&Biosta/s/cs&
Sec$on'1:'Introduc$on'
Lecture'1'
Sec/on&1&&
Biosta/s/cs&and&research:&&an&overview&
Course&aim:'
An'introduc$on'to'the'core'biosta$s$cal'methods'
essen$al'to'the'health'sciences'
scien$c'method'
design'of'research'studies'
descrip$on'and'analysis'of'data'
Learning&aims&and&objec/ves'
By'the'end'of'the'course'students'should'
be'aware'of'the'appropriate'use'of'common'study'designs'
and'their'strengths'and'weaknesses'
Goal&of&health&sciences&professions'
To'improve'the'health'and'wellDbeing of'individuals'and'
communi$es''
This'involves'
'treatment'of'disease'
'preven$on'of'disease'
'promo$on'of'health'
be'able'to'describe'the'informa$on'contained'in'a'data'set'
be&able&to&carry&out&common&sta/s/cal&data&analyses&
be'able'to'interpret'the'results'of'common'sta$s$cal'analyses'
in'the'context'of'the'par$cular'study'design'used'
'In'order'to'do'this'we'need'knowledge'about'
'
'
'
'
'
be'able'to'cri$cally'evaluate'selected'research'ar$cles'
published'in'health'sciences'journals'
''
'causes'of'disease'
'diagnosis'
'disease'processes'
'eec$veness'of'treatments'
'societal'factors'which'aect'health'
Examples&of¤t&gaps&in&knowledge'
Common'diseases:'
Diabetes'
Cancer'
New'diseases'
'HIV,'SARS,'avian'inuenza'
Exposures'
Vitamin'D'deciency'
Smoking'
New'technologies'
'Cell'phones,'3D'technology'
'
Research'
A'process'for'providing'answers'to'ques$ons'for'which'the'answer'is'
not'immediately'available'
Examples&of&general&health&research&ques/ons&
What'are'the'gene$c'events'which'lead'to'childhood'cancer?'
Can'we'develop'a'vaccine'to'prevent'SARS?'
Can'we'develop'vaccines'against'cancer'cells?'
How'do'we'stop'people'smoking?'
Can'a'new'drug'improve'survival'in'people'with'colorectal'cancer?''
How'can'we'prevent'childhood'overweight'and'obesity?'
What'are'the'main'factors'aec$ng'quality'of'life'of'people'with'a'
chronic'illness?'
''
Research'provides'a'systema$c'process'for'answering'these'ques$ons''
'
Dr&Greens&clinical&study&
Invasive&poten/al&of&CIN3&
Epidemiologists'
Margaret'McCredie,''
CharloUe'Paul'
David'Skegg'
Sta/s/cian'
Katrina'Sharples'
Department'of'Preven$ve'and'
Social'Medicine,''
University'of'Otago'
Gynaecological&oncologist'
Ron'Jones'
Na$onal'Womens'Hospital,'Auckland'
Pathologist'
Judith'Baranyai,''
Lab'Plus'
Auckland''
&'
Cytologist'
Gabrielle'Medley'
Melbourne'Pathology,'Victoria,'
Australia'
'
Clinical'study'of'the'natural'history'of'carcinoma'in'situ'(CIS)'(1965D74)'
Carried'out'at'Na$onal'Womens'Hospital,'Auckland,'New'Zealand'
Aim:''to'inves$gate'Dr'Greens'hypothesis'that'CIS'was'not'a'precursor'of'
invasive'cancer'
Involved'withholding'or'delaying'treatment'of'cura$ve'intent'for'a'group'of'
women'diagnosed'with'CIS'
'
It'has'since'been'the'subject'of'a'Judicial'Inquiry'(1987D88)'
concluded'that'the'study'was'unethical'
recommended'that'the'histological'and'other'material'kept'at'Na$onal'
Womens'should'be'available'for'properly'planned'and'approved'research'
and'teaching'
'
'
''
HPV&Transmission&model&
HPV'infec$on'
Absence'of'virus'
virus'produc$on' produc$on'
E6DE7'produc$on'
Viral'DNA'integra$on'
Smith'MA,'Canfell'K.'
'Int.'J.'Cancer:'123,'18541863'(2008)'
hUp://www.bioacademy.gr/lab/lab.php?lb=36&pg=6'
Research&Aim&
To'es$mate'the'long'term'risk'of'cervical'cancer'in'
' ' 'i)' 'women'whose'CIN3'lesion'was'minimally''
'disturbed'and'
'ii)' 'women'who'had'persis$ng'CIN3'
Our study
Women'diagnosed'with'CIS'at'Na$onal'Womens'Hospital'
between'1'Jan'1955'and'31'Dec'1675'(1063'women)'
Informa$on'on'smears'and'procedures'extracted'from'
hospital'notes'
Endpoint:'invasive'cancer'of'cervix'or'vaginal'vault'
'
'
Ini/al&treatment&of&CIN3&lesion&
Time&un/l&adequate&treatment&
Invasive&cancer&of&cervix&or&vaginal&vault&&
Ini/al&treatment&punch&or&wedge&biopsy&
FollowDup'censored'
aner'treatment'
No'censoring'
Blakely'T,'Shaw'C,'Atkinson'J,'Tobias'M,'Bas$ampillai'N,'Sloane'K,'Sarfa$'D,'Cunningham'R.''2010.''
Cancer'Trends:'Trends'in'Incidence'by'Ethnic'and'
Socioeconomic'Group,'New'Zealand'19812004.''Wellington:'University'of'Otago'and'Ministry'of'
Health.'
Possible&reasons&for&poorer&survival&aRer&
diagnosis&
Dierent'disease'subDtype'
Dierent'stage'of'disease'at'diagnosis'
Dierences'in'access'to'or'uptake'of'treatment''
CoDmorbidi$es'
Poorer'followDup'
'
Hill'et'al'J'Epidemiol'Community'Health'2010;64:117e123.'
PIPER&Aims
To'compare'progression'free'survival'in'pa$ents'diagnosed'
with'colon'cancer'and'rectal'cancer'(CRC)'according'to:
loca$on'of'residence'(urban'or'rural'&'distance'from'trea$ng'
centre);'ethnicity;'and'socioDeconomic'depriva$on'of'area'of'
residence.
To'iden$fy'dierences'in'pa$ent'presenta$on,'diagnos$c'
evalua$on,'treatment'and'followDup'which'contribute'to'
dierences'in'outcome'by'rurality,'ethnicity'or'socioD
economic'depriva$on''
PIPER&Study&design&
Na$onal'study'including'6323'pa$ents'
Data'obtained'by'reviewing'pa$ents'notes'and'hospital'data'
bases'
Analyses'will'compare'the'demographic'and'disease'
characteris$cs,'treatment'delivery'and'followDup'among'the'
dierent'ethnic'groups'
Overall'goal'is'to'iden$fy'areas'for'interven$on'in'order'to'
improve'outcomes'
A'secondary'goal'is'to'set'up'prospec$ve'data'collec$on'to'
allow'research'into'beUer'treatments''
Scien$c
method'
Research&
The'objec$ve'for'most'research'studies'is'to'use'data'from'a'
sample'to'draw'inference'about'a'larger'popula$on:'
'
Accept/reject/
rene'theory'
Hypothesis'
Experiment/
observa$on'(carry'
out'research)'
Predic$ons'
Steps&in&the&research&process&
&Development'of'the'research'ques$on'
&Design'of'the'study'&
'Collec$on'of'informa$on''
'Data'descrip$on'and'analysis'
'Interpreta$on'of'results '
'
''
'
Research&ques/ons&relevant&to&course&
Epidemiology:''the'study'of'distribu$on'and'determinants'of'
disease'frequency'
Clinical&research:''the'study'of'ques$ons'rela$ng'to'care'of'
pa$ents'
Descrip/ve&ques/ons:'
'What'is'the'distribu$on'of'a'disease?'
'What'is'the'natural'history'of'a'disease?'
Analy/c&ques/ons: &
'
'What'are'the'causes'of'a'disease?''
'Will'this'approach'prevent'disease?'
'Does'this'treatment'improve'outcome?'
''
'
Introduc/on&to&study&design&&
1. Descrip$ve'studies'(studies'which'describe'things)'
''
2. Analy$c'studies'(studies'which'test'hypotheses)'
' 'Experimental'studies'
' 'Observa$onal'studies'
'''Examples'of'types'of'analy$c'study'
''
3. Summary'
' 'Classica$on'of'research'designs'
' 'Classica$on'of'common'study'types'
'
&'
''
'
Descrip/ve&studies&&
Aim:'to'describe,'for'example:'
'the'characteris$cs'of'people'with'a'disease'(person,'place,'''$me)'
''lifestyle'paUerns'in'a'popula$on'
'aptudes'to'health'care'
''
Descrip$ve'studies'are'onen'called'surveys'or'crossDsec$onal'studies'
''
Descrip$ve'studies'generally'use'a'sample'from'a'popula$on'
''
'
''
'
&'
''
'
Example:&What&are&the&serum&cholesterol&
levels&of&New&Zealanders'
Sample'mean' =
Popula$on'
(true)'mean'
+'
Error'
Method:'Select'a'subgroup'(sample)'of'people'and'measure'their'serum'
cholesterol'levels'
Random&sampling'
'choose'the'sample'in'such'a'way'that'every'individual'in'the'popula$on'
has'a'known'chance'of'being'selected'''
'in'a'simple'random'sample,'everyone'has'an'equal'chance'of'being'chosen'
'this'method'is'the'best'way'of'obtaining'a'sample'which'is'representa$ve'
of'the'popula$on'
'
Suppose'we'want'to'es$mate'mean'cholesterol'in'the'popula$on:''
'
''
'
&'
''
Random'error:'
due'to'natural'biological'variability'
increasing'the'sample'size'will'reduce'the'random''
'uctua$ons'in'the'sample'mean''
'Systema$c'error'(=bias)'
due'to'aspects'of'the'design'or'conduct'of'the'study' ' '
'which'systema$cally'distort'the'results'
occurs'if'a'sample'is'not'representa$ve'of'the'popula$on''
cannot'be'reduced'by'increasing'the'sample'size'
'
'
Systema$c'
error'
Random'
error'
Analy/c&studies&&
Analy/c&studies&&
Experimental&studies'
the'researcher'intervenes'and'records'the'result'of their'
interven$on'
the'aim'is'to'control'all'other'factors'to'isolate'the'eects'of'the'
interven$on'
best'way'to'study'causa$on'
'Observa0onal&studies&&'
the'inves$gator'does'not'intervene,'simply'observes'a'naturally'
occurring'process, and'collects'informa$on'
ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'
have'been'obtained'if'the'experimental'study'could'have'been'
done'
Purpose:&'to'test'hypotheses,'about,'for'example:'
causes'of'disease'
methods'for'preven$on'of'disease'
the'eects'of'treatments'
&
&
&
&
'
&'
''
''
'
'
'
'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
&' Experiment'
''
'
' Par$cipants'
(nonDsmokers)'
Heart'
disease?'
Smoke'
Random'
alloca$on'
FollowDup'
Dont'
smoke'
Heart'
disease?'
Randomised'controlled'trial'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
Observa$onal
Study
&'
''
Heart'
disease?'
Smokers'
'
' Par$cipants'
FollowDup'
NonD
smokers'
Heart'
disease?'
Cohort'study'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
Observa$onal'Study'
&'
''
'
'
Smokers?'
People'with'
heart'disease'
(cases)'
='?'
Smokers?'
People'
without'heart'
disease'
(controls)'
CaseDcontrol'study'
STAT115&&
Introduc/on&to&Biosta/s/cs&
Sec$on'1:'Introduc$on'
Lecture'2'
STAT&115:&Introduc/on&to&study&design&&
1.'Descrip$ve'studies'(studies'which'describe'things)'
''
2.'Analy$c'studies'(studies'which'test'hypotheses)'
' 'Experimental'studies'
' 'Observa$onal'studies'
'''Examples'of'types'of'analy$c'study'
''
3.'Summary'
' 'Classica$on'of'research'designs'
' 'Classica$on'of'common'study'types'
'
&'
''
'
Example:&What&are&the&serum&cholesterol&
levels&of&New&Zealanders'
'
Method: 'Select'a'subgroup'(sample)'of'people'and'measure'
their'serum'cholesterol'levels'
'
Suppose'we'want'to'es$mate'mean'cholesterol'in'the'
popula$on:''
'
''
'
&'
''
'
Sample'mean' =
Popula$on'
(true)'mean'
+'
Error'
Systema$c'
error'
Random'
error'
Random'error:'
due'to'natural'biological'variability'
increasing'the'sample'size'will'reduce'the'random' '
'uctua$ons'in'the'sample'mean''
''
Systema$c'error'(=bias)'
'
due'to'aspects'of'the'design'or'conduct'of'the'study'' '
'which'systema$cally'distort'the'results'
occurs'if'a'sample'is'not'representa$ve'of'the'popula$on''
cannot'be'reduced'by'increasing'the'sample'size'
'
Analy/c&studies&&
Analy/c&studies&&
Experimental&studies'
the'researcher'intervenes'and'records'the'result'of their'
interven$on'
the'aim'is'to'control'all'other'factors'to'isolate'the'eects'of'the'
interven$on'
best'way'to'study'causa$on'
'Observa0onal&studies&&'
the'inves$gator'does'not'intervene,'simply'observes'a'naturally'
occurring'process, and'collects'informa$on'
ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'
have'been'obtained'if'the'experimental'study'could'have'been'
done'
Purpose:&'to'test'hypotheses,'about,'for'example:'
causes'of'disease'
methods'for'preven$on'of'disease'
the'eects'of'treatments'
&
&
&
&
'
&'
''
''
'
'
'
'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
&' Experiment'
''
'
' Par$cipants'
(nonSsmokers)'
Heart'
disease?'
Smoke'
Random'
alloca$on'
FollowSup'
Dont'
smoke'
Heart'
disease?'
Randomised'controlled'trial'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
Observa$onal
Study
&'
''
Heart'
disease?'
Smokers'
'
' Par$cipants'
FollowSup'
NonS
smokers'
Heart'
disease?'
Cohort'study'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
Observa$onal'Study'
&'
''
'
'
Smokers?'
People'with'
heart'disease'
(cases)'
='?'
Smokers?'
People'
without'heart'
disease'
(controls)'
CaseScontrol'study'
Randomised&controlled&trial&
The'Gold'standard'analy$c'study'
'Characteris$cs'of'a'RCT:'
'select'a'group'of'people'
randomly'allocate'them'to'either'an'interven$on'group(s)'
'or'a'control'group'
follow'par$cipants'up'over'$me,'and'measure'outcome'
A'control'group'is'used'to'isolate'the'eects'of'the'interven$on'
Random'alloca$on,'or'randomisa,on-means'every'person'has'
the'same'chance'of'being'in'each'group.'This'gives'the'best'
chance'of'ge]ng'two'groups'which'are'comparable'in'all'
respects''
''
Common&analy/c&study&designs''
'
Experimental:''
& &Randomised'controlled'trial'
'
Observa$onal:''
'Cohort'study'
'CaseScontrol'study'
''
'
'
Randomised&controlled&trial''
Used'to'evaluate'new'treatments'or'preven$ve'strategies'
O_en'not'ethical'in'studies'of'disease'causa$on'
&
Example&RCT:&LIPID&study&(NEJM,&1998)'
Does'treatment'with'pravasta$n'reduce'the'risk'of'death'in'
pa$ents'with'coronary'heart'disease?'
Study'par$cipants:''''
9014'pa$ents'
age'31S75'
coronary'heart'disease'
cholesterol''155'S'271mg/decilitre'
'
LIPID&study&
Par$cipants'
(n=9014)'
Randomisa$on'
Placebo'
(n=4502)'
LIPID&trial&results&&
As'par$cipants'were'
recruited'to'the'
study'they'were'
allocated'to'either'
pravasta$n'or'
control'according'to'
a'random'number'
sequence'
Pravasta$n'
(n=4512)'
FollowSup'
6'years'
8.3%'died'
6.4%'died'
Cohort&study''
Randomised&controlled&trial''
Advantages:'
experiment''the'best'way'to'test'an'hypothesis'
if'the'trial'is'well'conducted,'dierences'in'outcome'can'be'
'''akributed'to'the'interven$on'
Disadvantages:'
may'not'be'ethical'or'feasible'
''
'
Observa$onal'study,'generally'carried'out'to'test'hypotheses'
Characteris$cs:'
par$cipants'are'selected'before'disease'has'developed'
followed'over'$me'to'determine'development'of'disease'
informa$on'is'collected'about'exposures'at'baseline'and'during'
followSup'
Example:&Bri/sh&Doctors&Study&&
Aim:'to'inves$gate'the'rela$onship'between'smoking'and'lung'cancer'
''
'
'
'
'
'Observa$onal'study,'generally'carried'out'to'test'hypotheses'
Sent'a'ques$onnaire'
about'their'smoking'
habits''
'
Found'smokers'had'a'
14'fold'higher'risk'of'
lung'cancer'than'the'
nonSsmokers'
CasePcontrol&study'
Doctors'on'the'Bri$sh'
medical'register'
(men'n=24,389)'
Bri/sh&Doctors&&
Study&
''
Smokers'
(n=21,'296)'
NonSsmokers'
(n=3093)'
Lung'cancer?'
Lung'cancer?'
Characteris$cs'
par$cipants'are'chosen'on'the'basis'of'their'disease'status:'a'
group'with'disease'(cases)'and'a'group'without'(controls)'
' informa$on'is'collected'from'people'with'and'without'disease'
about'exposures'that'occurred'in'the past'
longitudinal'(retrospec$ve)'
&
Example:&Risk&of&venous&thromboebolism&aSer&air&travel&
''
'
'
'
CasePcontrol&study'
A'random'sample'
of'people'who'
have'not'had'a'
deep'vein'
thrombosis''
Popula$on'
Deep'vein'
thrombosis'
(n=210)'
No'deep'
vein'
thrombosis'
(n=210)'
&Findings:'
'odds'ra$o=2.1,'95%'condence'interval''(1.1'to'4.0)'
Air'travel'doubled'the'odds'of'venous'thromboembolism''
&
''
'
'
'
Long'
distance'
ight'
(n=31)'
No'long'
distance'
ight'
(n=179)'
Long'
distance'
ight'
(n=16)'
No'long'
distance'
ight'
(n=194'
Cohort&vs&casePcontrol&studies'
Cohort&vs&casePcontrol&studies'
Cohort&study&'
CasePcontrol&study&
Advantages'
closest'observa$onal'study'to'randomised'controlled'trial'
good'for'examining'common'outcomes'
can'evaluate'the'eect'of'exposure'on'mul$ple'outcomes'
Disadvantages'
long'dura$on'needed'if'the'disease'takes'a'long'$me'to'
develop'a_er'exposure'
if'the'disease'is'rare,'the'number'of'par$cipants'needs'to'be'
very'large
&&
&
''
'
'
Advantages'
rela$vely'quick'
smaller'than'cohort'studies,'par$cularly'for'rare'diseases'
can'examine'the'eects'of'mul$ple'exposures'
'
'Disadvantages'
events'have'already'occurred'so'the'poten$al'for'bias'is'
higher'
''
&
''
'
'
'
Classica/on&of&research&designs'
i)'Classica$on'by'purpose'of'the'study'
'Descrip$ve'(describe'things)'vs''analy$c'(tes$ng'hypotheses)'
''
ii) 'Classica$on'by'form'of'the'design'
' 'experimental''(researcher'intervenes)''
' 'vs.'observa$onal''(researcher'observes)'
iii)'Classica$on'by'$me'
'''''crossSsec$onal''(informa$on'collected'about'one'point'in'$me)
' 'vs.'longitudinal'''&
''
'
'
'
Classica/on&of&common&study&types'
Randomised'controlled'trial''' '&
' ' 'analy$c,'experimental,'longitudinal'(prospec$ve)'
'Cohort'study ' ''
' ' 'analy$c,'observa$onal,'longitudinal'(usually'prospec$ve)'
CaseScontrol'studies'
' ' 'analy$c,'observa$onal,'longitudinal'(retrospec$ve)'
'
These'classica$ons'provide'a'useful'framework'for'thinking'about'
the'strengths'and'weaknesses'of'dierent'study'designs,'but'they'will'
not'always'work'
''
'
'
'
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
Data forming the response and exposure variables can be either categorical
or numerical (otherwise known as qualitative and quantitative
respectively).
Often the values show pattern similar to what is called the bell-shaped
normal curve with many values clustered around a central point and few
values in the tails.
Ratio: fraction given by one quantity over another. Both quantities have
the same units.
12.7% = 0.127...
Usual practice is to simplify rates to a per unit measure...
0.127 200 companies = 25.4 companies with female directors? Hmmm...
Using R: At home
A bit of a demo...
Start up R
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
Graphs can be used to summarise sample data, though many graphs can
be highly misleading. Today we shall look at summarising numerical data
graphically using histograms, and how to make these with R.
These statistics measure the centre and the variability of the sample data
respectively.
And then we will look at box and whisker plots, another way of displaying
numerical data.
Frequency fj
2
7
9
10
11
7
8
2
56 (sample size)
Computing proportions
...and percentages
fj
2
7
9
10
11
7
8
2
56
Proportion
2/56 = 0.036
7/56 = 0.125
9/56 = 0.161
10/56 = 0.179
11/56 = 0.196
7/56 = 0.125
8/56 = 0.143
2/56 = 0.036
fj
2
7
9
10
11
7
8
2
56
Proportion
2/56 = 0.036
7/56 = 0.125
9/56 = 0.161
10/56 = 0.179
11/56 = 0.196
7/56 = 0.125
8/56 = 0.143
2/56 = 0.036
56/56 = 1.0
%
3.6%
12.5%
16.1%
17.9%
19.6%
12.5%
14.3%
3.6%
100%
Issues
20.00%$
18.00%$
16.00%$
14.00%$
12.00%$
Labelling issues....
10.00%$
%$Frequency$(1dp)$$
8.00%$
6.00%$
4.00%$
2.00%$
0.00%$
59.5$$
69.5$$
(69.5)$$
(79.5)$$
79.5$$
(84.5)$$
84.5$$
(89.5)$$
89.5$$
(94.5)$$
94.5$$
(99.5)$$
99.5$$
109.5$$
(109.5)$$
(119.5)$$
Bars 1,2,7,8 actually cover 10mm range while the others cover 5mm
range. This makes us overestimate the proportion with low or high
blood pressure.
Switching to R
Start up R.
Notice that
Y-axis gives the raw frequencies. We can change options to give
proportions or percentages.
Equally spaced bins, makes it easier to compare areas.
Number of bins chosen automatically (to make it look good) but we
can alter that.
No distracting 3d graphics!
hist(Dataset$Pressure)
hist(Dataset$Pressure, col="darkgray")
hist(Dataset$Pressure, col="darkgray",breaks=5)
Setting the breaks between the bins
hist(Dataset$Pressure, col="darkgray",
breaks=c(59.5,69.5,79.5,84.5,89.5,94.5,99.5,109.5,119.5))
5
Reading a histogram
The heights of the rst two and last two rectangles are halved but
their bases are doubled from 5 to 10mm. Area is proportional to
frequency.
A histogram with rectangle heights proportional to class frequencies
would give a misleading picture of the data.
You will nd that most of the histograms produced by statistical
packages like R have class intervals of equal length and you can
decide the number of intervals you want in the graph.
Usually between 5 and 20 intervals of equal length are chosen for a
good summary of the data.
Many numerical data sets look roughly like a blob in the middle with two
tails extending out either side.
Grades histogram
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
The sample mean is the average of the quantities in the sample. Compute
it by summing up all of the values and dividing by the sample size.
1
Example
Six patients lived the following years after diagnosis of HIV
Datum
1.8
3.2
6.8
4.6
2.8
7.9
Symbol
x1
x2
x3
x4
x5
x6
1X
xi = (x1 + x2 + + xn )/n.
n
i=1
The mean may be dierent to all of the values, but it will be within
the same range of values.
In R the command
The median is the middle value, once the sample has been sorted from
smallest to largest.
mean(<data set>)
computes the mean.
You can also choose
summary(<data set>)
computes the mean and other summary statistics.
Example
To nd the median of 1,6,4,8,3 we rst sort the values:
1,3,4,6,8
and then take the middle value (4).
Example
To nd the median of 5,3,7,5,3,6,3 we rst sort the values:
3,3,3,5,5,6,7
and then take the middle value (5).
Median examples
Consider the data:
Data
We rst sort the data:
Data
Sorted:
95
95
62
86
86
73
78
78
78
90
90
86
62
62
89
73
73
90
89
89
95
NZ weekly income
Mode
If the data is discrete or if it has been binned, then we can talk about the
mode of the data. The mode is the most frequently occurring value.
Note for later in the courseit is much more common to talk about the
mode of a density, rather than a data set.
Example: grades
Here again the (default) histogram for the blood pressure data we looked
at earlier.
In R we compute that the mean of these marks is 63.3 and the median of
these marks is 70.
In R we compute that the mean of these marks is 89.32 and the median of
these marks is 89.50.
Sources of variability
Consider the data set 95, 86, 78, 90, 62, 73, 89.
The mean is the value x which minimises the sum of squares
(x 95)2 +(x 86)2 +(x 78)2 +(x 90)2 +(x 62)2 +(x 73)2 +(x 89)2 .
If data are highly variable there are problems analysing the data and it
will be necessary to select larger samples.
Range
The range is the dierence between the largest and smallest values in the
sample.
Example
The range of the sample 95, 86, 78, 90, 62, 73, 89 is 95 62 = 33.
While the range does tell us something about the sample, it is aected a
lot by outliers and random noise. For this reason, we dont often use the
range to tell us something about the underlying population.
S2 =
1 X
(xi x)2 .
n1
i=1
Standard deviation
Variance example
Find the sample variance and standard deviation of 11, 18, 14, 15, 12.
i=1
xi x
(11-14)=-3
(18-14) = 4
(14-14) = 0
(15-14) = 1
(12-14)=-2
0
Hence
S 2 = 30/(5 1) = 7.5
and
S=
(xi x)2
(3)2 = 9
42 = 16
02 = 0
12 = 1
(2)2 = 4
30
7.5 = 2.74.
An on a technical note...
The formula for the sample variance is
n
S2 =
You can compute the mean in R using
mean(<variable name>)
You can compute the variance in R using
var(<variable name>)
1 X
(xi x)2 .
n1
i=1
Why (n 1)?
The variance of the whole population is dened as
2 =
N
1 X
(xi xi )2 .
N
i=1
However if you take a random sample of size n and use this variance
formula you will, on average, get an amount that is n1
n times what you
want.
The
1
n1
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
Range
Interquartile range
Recall that the range of a sample is the dierence between the maximum
and minimum values.
The median is the value which is (informally) in the middle of the sample
values.
The upper quartile is the value which is 3/4 of the way up the sample
values.
The interquartile range (IQR) is the dierence between the upper quartile
and the lower quartile.
25%
Data
25%
Data
LQ
The lower quartile is the value which is 1/4 of the way up the sample
values.
50 % of the values
Median
25%
Data
UQ
Interquartile Range
Range
The upper and lower quartiles are produced by the summary command in
R.
Lower
Quartile
25%
Data
Median
Upper
Quartile
Fiddly details
General instructions:
We usually cant get exactly 25% of the data points below some value,
so we do something in between.
To nd the lower quartile from a sample of size n
Example Here are the maximum temperatures in Dunedin for the last 14
days.
10, 11, 12, 11, 11, 11, 11, 11, 11, 11, 13, 14, 13, 12
We sort the values:
10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 13, 13, 14
2
3
4
0.25 11 + 0.75 11 = 11
If
If
If
If
r
r
r
r
Skinks
Box plots
1
8
18
8
16
11
4
5
14
13
10
10
1
14
33
12
16
10
2
6
11
11
14
6
5
8
16
8
10
13
0
10
20
10
7
7
1
7
1
17
10
10
Replanted
Min.
: 1.00
1st Qu.: 9.50
Median :12.50
Mean
:14.62
3rd Qu.:18.00
Max.
:33.00
5
4
17
29
8
8
6
8
12
3
12
12
5
13
27
12
19
6
6
6
26
5
17
12
2
4
4
11
5
11
2
1
8
16
14
10
Tussock
Min.
: 5.0
1st Qu.: 9.5
Median :11.0
Mean
:12.0
3rd Qu.:14.0
Max.
:29.0
0
1
31
12
15
29
3
3
24
6
23
12
Tussock
4
11
15
18
14
7
Pasture
Replanted
Draw a box and a middle line at the upper quartile, median and lower
quartile.
25
Starting at the top of the box, measure up a length of 1.5 IQR. Here,
that gives 14.0 + 1.5 4.5 = 20.75.
Upper quartile
Tussock
5
11
6
12
6
12
7
12
7
12
7
13
8
14
8
14
10
14
10
15
10
16
10
16
10
17
10
19
10
23
11
29
Median
Lower quartile
10
15
20
Dont draw the line there. Instead nd the largest data point that is at
most 20.75. The sorted values in this example are
Tussock
15
20
Median
Lower quartile
10
Tussock
25
Upper quartile
10
15
20
25
The smallest data value that is at least as large as 2.75 equals 5. We draw
the bottom line here.
Upper quartile
Median
Lower quartile
If there are any data values outside the range of the bottom and top line,
then we plot crosses or dots for each one of them.
7
12
7
13
8
14
8
14
10
14
10
15
10
16
10
16
10
17
10
19
10
23
11
29
Largest value at
most 1.5IQR above UQ
Upper quartile
25
In this case, 5 is the minimum value (none below that), but 23 and 29 are
both above the top line.
25
7
12
20
6
12
Tussock
6
12
15
5
11
10
Median
15
Smallest value at
least1.5IQR below LQ
Upper quartile
Median
Lower quartile
10
Tussock
20
Lower quartile
Comparing variables
Box plots provide some information about the center, range and symmetry
of the data. We can easily put multiple box plots on one graph. In R, use
boxplot(<dataset>) to see box plots for all variables in a data set.
library(UsingR)
violinplot(Skinks)
Overview
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
Prevalence
In the following diagram shaded lines indicate the time each person has
the disease.
!
Prevalence of cataracts
310
= 0.125 = 125 cases per 1000 people = 12.5%
2477
Subject!
Number!
Prevalence of blindness
5!
4!
3!
2!
1!
Time!
t"
22
= 0.009 = 9 cases per 1000 people = 0.9%
2477
Prevalence
1/5
2/5
3/5
2/5
Note on prevalence
Incidence
The time point may refer to calendar time, or to a xed point in the
course of events.
Example
Example: the proportion of people free from back pain 2 months after
back injury. (The time point here is relative to an event, rather than an
absolute time.
Cumulative incidence
There were 609 men aged 4076 who had no detected heart disease
in 1960.
These men were followed for 7 years and 71 cases of heart disease
were detected during this period.
Number of cases = 71. Population size = 609.
Cumulative incidence =
71
= 0.117 cases per person = 11.7%
609
People-time
People-time at risk
Subject
A
B
C
D
E
Jan
1997
Incidence rate:
Jan
1998
Jan
1999
Jan
2000
(lost
x
Jan
Jan
2001 2002
to follow up)
Total
time
at risk
2.0
3.0
5.0
4.0
2.5
Here
= initiation of follow-up
x = development of disease.
Hepatitis
Hepatitis
Example. Frequency of hepatitis in two regions.
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Location
Region A
Region B
Population
25,000
7000
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
In Region B,
In Region A,
Number cases = 58
Person-years = 1 25000
Incidence rate = 58/25000 cases per person-year
= 2.32 cases per 1000 people per year.
Number cases = 35
Person-years = 2 7000
= 14000
Incidence rate = 35/14000 cases per person-year
= 2.50 cases per 1000 people per year.
Num.
stroke
70
65
139
274
cases of
Person-years of
observation (over
8 years)
395,594
232,712
280,141
908,447
Stroke incidence
rate (per 100,000
person years)
17.7
27.9
49.6
30.2
274
100, 000 = 30.2 cases of stroke per
(Total) incidence rate = 908,447
100,000 person-years of observation.
908,447
118,539
The denominator for measures of incidence should include only those who
are at risk of developing the disease. It should exclude
those who already have the disease
those who cannot develop the disease
Failure to do this will lead to an underestimate of the true incidence since
fewer will develop the condition.
For example when studying the incidence of endometrial cancer we should
exclude women who have had a hysterectomy.
= 7.7 years.
Example: Disease A
Example: Disease B
1.
2.
1.
2.
3.
3.
4.
4.
5.
5.
Time
L
5
5t
Prevalence at time L =
2
5.
Time
5
5t
Prevalence at time L = 55 .
Afterword: units
There are dierent expressions for the units used with incidence rate.
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
The simplest (I think) is new cases per person-years. This is the same as
x cases per person per year.
where x is some fraction.
Sometimes you see
cases per 100 people per year (multiply x by 100 for this)
cases per 100,000 people per year. (multiply x by 100,000 for this)
The following mean the same
0.002 cases per person per year
0.2 cases per 100 people per year
200 cases per 100,000 people per year
In the same way we can change the period of time. The following are the
same
Location
Region A
Region B
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
In Region A,
Number cases = 58
Person-years = 1 25000
Incidence rate = 58/25000 per year
= 2.32 per 1000 per year.
Overview
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
In Region A,
Number cases = 58
Person-years = 1 25000
Incidence rate = 58/25000 cases per person-year
= 2.32 cases per 1000 people per year.
Example association
Data
Data from cohort study of oral contraceptive use (OC) and bacteria in the
urine among women aged 16-49 years over 3 years.
Recall from lecture two: a cohort study is one where we start a complete
population.
OC Use
Total
Yes
No
Bacteria Present
Yes
No
27
455
77
1831
104
2286
Total
482
1908
2390
Measuring association
Relative risk
The Relative eect (also known as the relative risk). This expresses
(or more correctly, estimates) the incidence rate of the disease within
one population (group) relative to the rate in the other population
(group).
The Absolute eect expresses the absolute dierence in incidence
between the two groups.
Both measures are still statistics and might still reect random error. Later
in the course youll learn how to model and test whether a given
association is statistically signicant or not.
Attributable risk
The relative eect (RR) (also called relative risk) is the ratio of the
(cumulative) incidence in the exposed group Ie to the incidence in
unexposed group (I0 ).
Ie
RR =
I0
8
>
<> 1 (exposure ! disease)
If RR is... = 1 (no association between exposure and disease)
>
:
< 1 (exposure is protective).
Indicates how much more likely disease is to develop in the exposed
group than in the unexposed group.
Good measure of strength of an association, and the usual measure in
studies of causation of disease.
We can also calculate ratios of prevalences, but the interpretation is
dierent.
Association with OC
OC Use
Yes
No
Total
Bacteria Present
Yes
No
27
455
77
1831
104
2286
Total
482
1908
2390
RR =
56
= 1.4,
40
AR = 56 40.
Improvement in pain
No improvement in pain
Total
Treatment
18
7
25
Control
8
17
25
Improvement in pain
No improvement in pain
Total
Treatment
18
7
25
Control
8
17
25
I0 = 8/25.
Ie
18/25
=
= 2.3.
I0
8/25
Absolute risk is
Treatment/control is the Exposure Category (explanatory variable).
AR = Ie I0 = 18 8 = 10.
Variations on a theme
Males
Females
Number
examined
2024
2445
Number
with CHD
48
28
Prevalence
per 1,000
23.7
11.5
Compute relative and absolute risk using prevalence per 1000 (note
48
23.7 = 2024
1000).
Consider males as the exposed group.
RR = (23.7/11.5) = 2.1
Heart disease is almost twice as common among the males, and there are
12.2 more cases of heart disease in 1000 men than in 1000 women.
Postmenopausal
hormone use
Yes
No
Coronary heart
Yes
No
30
60
-
Person-years
54,308.7
51,477.5
Incidence rate:
Users: 30/54308.7 = 55 per 100,000 person-years
Non-users: 60/51477.5 = 117 per 100,000 person years
Attributable Risk:
55-117 =-62 cases of CHD per 100,000 person years
Hormone use prevents 62 cases per 100,000 person years
Relative Risk: 55/117 = 0.47
The risk of CHD among users is 0.47 times the risk in non-users (a 53%
reduction in risk).
Notes
Relative risks
provide information on the strength of an association;
can be used to assist in assessment of the likelihood of a causal
association.
Attributable risks
measure the impact of an exposure, (assuming that it is causal).
If a disease is common a small relative risk will translate to a large
attributable risk. [see previous example]
Smokers
Non-smokers
RR
AR
STAT115
Introduction to Biostatistics
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 9
Some denitions
Calculating Probabilities
Probability of an event A is
nA
number of trials for which A is true
=
total number of trials
N
We dont always need to conduct the experiments to measure these,
as we can make logical arguments.
For example, ipping a coin: Pr(heads) = Pr(tails) = 1/2
Pr(A) =
The Rules
Addition Rule
Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Complementary Events
Two events are complementary if exactly one of them is always true.
For example, the coin ip. Will always be either heads or tails.
is called the complement of A.
A
=1
Pr(A) + Pr(A)
Example
A = Rolling an even number {2,4,6}
B = Rolling 3 or less {1,2,3}
Pr(A or B) = Pr(A) + Pr(B) - Pr(A \ B)
= 3/6 + 3/6 - 1/6
= 5/6
Intuitively, this makes sense!
The Rules
Multiplication Rule
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B|A)
Please read this as probability of B given A
Example
A = Rolling an even number {2,4,6}
B = Rolling 3 or less {1,2,3}
Pr(A and B) = Pr(A) x Pr(B|A)
= 3/6 x 1/3
= 1/6
Independent Events
When the occurrence of one event does not eect the outcome of
another event.
For example ipping 3 heads in a row.
Blood Type
A
B
AB
O
Probability
0.38
0.11
0.04
0.47
What is the probability that 3 randomly selected people have blood group
O?
= 0.38 + 0.11
= 0.104
= 0.49
Note than A and B are mutually exclusive,
that is they cannot both occur, and Pr(A and B) = 0.
Hospital Patients
Find the probability a patient has both diabetes and high blood
pressure.
Pr(B) = 0.10
Pr(A|B) = 0.85
Pr(A) = 0.25
Pr(A|B) = 0.85
Pr(B) = 0.10
Pr(A) = 0.25
Find the probability a patient has both diabetes and high blood
pressure.
Pr(A \ B) = Pr(A | B) Pr(B)
= 0.85 0.10
Pr(A) = 0.25
= 0.085
Pr(A) 6= Pr(A | B)
Summary
Questions
Calculating a probability.
Pr(A) =
nA
number of trials for which A is true
=
total number of trials
N
Addition Rule.
Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Multiplication Rule.
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B | A )
Read this as probability of B given A.
Addition Rule - Mutually Exclusive
Pr(A [ B) = Pr(A) + Pr(B)
Multiplication Rule - Independent Events
Pr(A \ B) = Pr(A) Pr(B)
Summary
Calculating a probability.
STAT115
Introduction to Biostatistics
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 10
Pr(A) =
Addition Rule.
Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Multiplication Rule.
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B | A )
Read this as probability of B given A.
Addition Rule - Mutually Exclusive
Pr(A [ B) = Pr(A) + Pr(B)
Multiplication Rule - Independent Events
Pr(A \ B) = Pr(A) Pr(B)
CONDITIONAL EVENTS
Two events are conditional if the probability of one event changes
depending on the outcome of another event
e.g.
Pr(A \ B) = Pr(A) Pr(B | A)
Pr(A \ B) Pr(A) = Pr(A) Pr(B | A) Pr(A)
Pr(A \ B)
Pr(B | A) =
Pr(A)
2
4
6
Pr(A \ B) = 1/3
3
Pr(A [ B) = {2 4 5 6}, therefore Pr(A [ B) = 4/6 = 2/3
Tree Diagrams
Find Pr (A [ B)
Use addition rule
Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Pr(A [ B) = 1/2 + 1/2 1/3
Pr(A [ B) = 2/3
The same result we observed in our diagram.
Tree Diagrams
A)
B|
Pr(
Pr(A \ B)
)
r (A
Pr(
B|
A)
Pr(A \ B)
(A
)
A)
B|
Pr(
\ B)
Pr(A
Pr(
B|
A)
\ B)
Pr(A
B
A
Pr
Independent Stages
0.4
0.6
0.4
0.4
0.4
0.4
0.
0.
0.4
0.4
0.6
0.4
T
T
0.6
0.4
0.6
X=3
0.6
X=2
0.4
X=2
0.6
0.4
T
T
X=1
X =2
0.6
X =1
0.4
X =1
0.6
X =0
0.6
0.4
0.6
0.6
0.6
0.
0.6
0.
0.4
0.6
0.6
0.4
T
0.6
0.4
X=3
0.6
X=2
0.4
X=2
0.6
0.4
T
T
X=1
X =2
0.6
0.
0.
0.4
0.6
X =1
0.4
X =1
0.6
X =0
0.6
Pr (X = 2) = Pr (TTT
Summary
Find the probability of seeing tuatara at one of the three sites
Take advantage of the fact that all possibilities add to 1
Conditional Probability
Pr (X = 2) = 0.288
Pr (X = 0) = 0.6 0.6 0.6 = 0.216
Pr(B | A) =
Pr (X = 1) = 1 0.288 0.216 0.064
= 0.432
Tree Diagrams
Questions
A)
B|
Pr(
Pr(
B|
A)
Pr(A \ B)
| A)
\ B)
Pr(A
Pr(A \ B)
100 Otago students were asked if they like L&P.
Of the 75 males (M) surveyed: 50 said they like L&P (L).
surveyed: 20 said they like L&P (L).
Of the 25 females (M)
)
r(A
Pr
Pr(A \ B)
Pr(A)
(A
)
B
Pr(
Pr(
B|
A)
Find:
Pr(L)
Pr(L|M)
Pr(L|M)
\ B)
Pr(A
Summary
STAT115
Introduction to Biostatistics
Conditional Probability
Pr(A \ B) = Pr(A) Pr(B | A)
Dr Tilman Davies
University of Otago
Pr(B | A) =
Section 3. Probability
Lecture 11
Pr(A \ B)
Pr(A)
Tree Diagrams
A)
B|
Pr(
)
r(A
Pr(
B|
A)
Pr(A \ B)
(A
)
A)
B|
Pr(
\ B)
Pr(A
Pr(
B|
A)
\ B)
Pr(A
Pr(A \ B)
Pr
SENSITIVITY = Pr(T| D)
Think of this as the probability of a positive test result, given the
person actually has the disease
D)
SPECIFICTY = Pr(T|
Think of this as the probability of a negative test result, given the
person does NOT have the disease
SENSITIVITY and SPECIFICTY will appear very naturally in your
tree diagrams!
Screening Programmes
Find the probability that a woman has the cancer given the biopsy
says she does (i.e. does the biopsy diagnose true patient status?).
Tree Diagrams
Let C be the event woman has the cancer and T be the event test
is positive.
Pr(C) = 1/10000 = 0.0001 (disease prevalence)
Pr(T | C) = 0.90 (conditional probability)
Pr(T | C)=
0.001
00
)=
0.0
(C
Pr
Pr
(C
)=
0.9
99
Pr(
C
Pr(
T|C
)
01
)=
0
0.0
(C
Pr
)
T|C
0.90
0.90
T
0.9
99
0.9
99
0.10
00
Pr
(C
)=
0.0
=0
.10 T
.001 T
=0
)
|C
r(T
C
Pr(
T|C
)
=0
.999 T
Pr(T and C) =
0.00009
1
00
0.10
0.0
and C) =
Pr(T
0.00001
0.9
=
Pr(T and C)
and C)
=
Pr(T
0.00
99
0.00100
0.99
0.99890
= 0.00009+0.00100 = 0.00109
P(T) = P(T\C)+P(T\C)
Classication table
T).
To calculate this we use
Find the negative predictive value Pr(C|
the conditional probability formula
| T)
= Pr (C \ T)
Pr (C
Pr (T)
0.99890
| T)
=
Pr (C
0.99891
| T)
= 0.9999
Pr (C
Calculating Probabilities
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
be no-bycatch
Let (B) be By-catch, and (B)
Estimate the probability that a sampled vessel is Russian (R).
Find Pr(R)
Given that the sampled vessel had by-catch what is the probability
that it is Russian?
Find Pr(R|B)
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
Tree Diagram
23
219
23
123
No by-catch
Bycatch
Total
R
3
12 9
21
100
123
6
96
90
96
9
21 6
9
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
100
219
6
219
Given that the sampled vessel had by-catch what is the probability
that it is Russian?
i.e. nd P(R|B).
First calculate the total probability of a by-catch
Pr(B) =
90
219
23
219
29
219
6
219
Summary
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
Pr(R\B)
Pr(B)
Pr(R | B)
23
219
29
219
Pr(R | B)
23
29
Pr(R | B)
= 0.793
SENSITIVITY = Pr(T| D)
Think of this as the probability of a positive test result, given the
person actually has the disease.
D)
SPECIFICTY = Pr(T|
Think of this as the probability of a negative test result, given the
person does NOT have the disease.
POSITIVE PREDICTIVE VALUE = Pr(D|T)
The proportion of patients with positive test results who are correctly
diagnosed.
T)
Summary
D
T|
Pr(
Pr(D \ T )
)
r(D
Pr(
T|
D)
Pr
)
Pr(D \ T
\ T)
Pr(D
Pr(
T|
D)
\T
)
Pr(D
D
T|
Pr(
(D
)
Random Variables
STAT115
Introduction to Biostatistics
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 12
i=0 Pr(X
= xi ) =
k
X
xi Pr (X = xi )
i=1
is the variance of X
Learn this right now:
q
variance ( 2 ) = standard deviation ()
variance ( 2 ) = (standard deviation ())2
Contagious disease
k
X
(xi x )2 Pr (X = xi )
i=1
X = xi
0
1
2
3
4
Pr(X = xi )
0.10
0.25
0.40
0.20
0.05
q
p
X2 = 1.0275 = 1.0137
Examples
Heres how we calculate the mean and variance of the new random
variable, Z:
Z = a + bY
mean: Z = a + bY
variance: Z2 = b 2 Y2
Z = aX + bY
mean: Z = aX + bY
variance: Z2 = a2 X2 + b 2 Y2
Say X = 5 and X2 = 3
Consider the new random variable Z = 2X
In this case:
mean: Z = 2 X
= 2 5 = 10
variance: Z2 = 22 X2
= 22 3 = 12
Examples
Examples
X2
=3
In this case:
In this case:
mean: Z = 1 X + 2 Y
mean: Z = 4 X + 3
= 1 5 + 2 (2)
= 4 5 + 3 = 23
=54=1
variance: Z2 = 42 X2
variance: Z2 = 12 X2 + 22 Y2
= 42 3 = 48
=13+41
=3+4=7
a = 1, and b = 1
Z=X+Y
In this case:
In this case:
Z = X 2Y
Z = X + Y
Z2 = X2 + Y2
Z2 = (1)2 X2 + (2)2 Y2
= X2 + 4Y2
Temperature Problem
5
(F 32)
9
5
5
32
= F
9
9
5
160
= F
9
9
160 5
+ F
=
9
9
C=
Variance:
Mean:
C =
160 5
+ F
9
9
C = a + bF
160
+
9
= 21.1 C
=
5
70
9
C =
160 5
+ F
9
9
C2 = b 2 F2
2
5
=
52
9
= 7.716 C
The Standard deviation is just the square root of the variance
p
7.716 = 2.78 C
Summary
Questions
Binomial Distribution
STAT115
Introduction to Biostatistics
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 13
Probability Distribution
Single Trial
Pr(Y=yi )
mean: Y = 1 + 0 (1 ) =
The variance of this single trial is:
variance: Y2
= (1 )2 + (0 )2 (1 )
= (1 )2 + 2 (1 )
= (1 ) (1 + )
= (1 )
Binomial Distribution
mean: X = Y1 + Y2 + Y3 + . . . + Yn
= + + + ... +
= n
X = Y1 + Y 2 + Y 3 + . . . + Yn
Where all the Yi s are independent of one another.
variance: X2 = Y2 1 + Y2 2 + Y2 3 + . . . + Y2 n
= (1 ) + (1 ) + . . . + (1 )
= n(1 )
standard deviation: X =
p
n(1 )
Use p where p =
X = n
X
n
Notes
Outcome is binary.
1
Outcome is binary.
Pr (X = x) =
Where
n
x
n
x (1 )nx
x
Calculating Probabilities
n = 3, x = 2, = 0.4
Records show that twenty percent of violin pupils are known to develop
OOS during the course of their training. Dene X to be the number of
violin pupils out of 9 who develop OOS during their training.
n
x (1 )nx
x
3
0.42 (1 0.4)32
=
2
Pr (X = x) =
= 3 0.42 0.6
= 0.288
This is of course a very simple example, for more complicated examples
instead of using the formula all the time we can use tables or software to
nd probabilities.
Summary
Questions
Summary
STAT115
Introduction to Biostatistics
Section 3. Probability
Lecture 14
A report suggests that 75% of NZ-born children live with both parents. A
random sample of 20 Maori children is selected, and asked whether they
live with both of their parents.
Immediately recognize this as a Binomial Distribution (X).
Dene the parameters of the distribution of X
Find Pr(X = 15)
Find the probability that 11 or fewer live with both parents,
Pr(X 11).
n x
(1 )nx
Pr (X = x) =
x
20
0.7515 (1 0.75)2015
Pr (X = 15) =
15
20!
=
0.7515 (0.25)5
15! (20 15)!
= 0.2023
In practice, youll use R:
dbinom(x=15,size=20,prob=0.75)
The command dbinom will provide the individual binomial probabilities
associated with a given outcome, provided the number of trials (size) and
the probability of success (prob).
Find the probability that 11 or fewer are found to live with both
parents.
Pr (X 11) = Pr (X = 0, 1, 2, . . . , 11)
= Pr (X = 0) + Pr (X = 1) + . . . + Pr (X = 11)
=
11
X
Pr (X = i)
i=0
= 0.0410
R:
pbinom(x=11,size=20,prob=0.75)
The command pbinom will provide the sum of all individual binomial
probabilities less than or equal to a given outcome, provided the number
of trials (size) and the probability of success (prob). Can you think how
we can use this to determine greater than probabilities?
Find the probability that three or more of the patients have their
tumor size halved
Pr (X 3) =
i=7
X
Pr (X = i)
i=3
= Pr (X = 3) + Pr (X = 4) + . . . + Pr (X = 7)
= 0.2269 + 0.0972 + 0.0250 + 0.0036 + 0.0002
= 0.3529
Summary
In a pilot study in Auckland, three out of seven patients given a new
drug had their tumor size halved. What conclusion if any can be
drawn about the new drug? Explain how you reach your conclusion
p = Probability of the observed occurrence (X=3), or a more
unlikely occurrence
p = Pr(X 3) = 0.3529
p > 0.05, implying this is not an unusual event, hence the outcome is
consistent with the standard drug success rate of 30%.
There is no evidence to suggest the new drug is any dierent to the
standard.
Questions
Summary
STAT115
Introduction to Biostatistics
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 15
Normal Distribution
1 X 2
1
f (X ) = p
e 2 ( ) .
2
Examples
Always draw a diagram to identify the area you want.
In the following examples, pnorm(q) gives you the area (of a Standard
Normal) below the point q.
Find Pr (0 < Z < 1.64)
pnorm(1.64) - pnorm(0)
Probability = 0.4495
or
Probability = 0.0505
pnorm(1.64) - pnorm(1)
Probability = 0.1082
Probability = 0.7908
Calculating Probabilities
Find the value Z above which 25% of the area lies.
qnorm(0.75)
or
qnorm(0.25,lower.tail=FALSE)
Z = 0.6745
Probability = 0.1359
Probability = 0.0668
Percentage = 6.68%
Height = 182.82 cm
Z-Score
Any Normal distribution value (say with mean X and s.d. X ) can
be put on the Standard Normal scale.
This is called calculating the Z-Score, or Z-Value.
Its essentially transforming your Normal value into a Standard
Normal value.
Z-Value: Z =
X X
X
X X
X