Statistics Level 1

STAT115 INTRODUCTION TO BIOSTATISTICS 2013
Advances in our understanding of factors which affect health and wellbeing come through
research in the health sciences. Examples of such research include surveys to describe
patterns of disease in a community or risk factors for disease such as diet and smoking; studies
trying to find out whether a newly developed treatment works; studies of factors which may
prevent disease such as physical activity; studies of barriers to improving health such as
reasons for declining vaccination rates in children, prevention of smoking. Biostatistics
(statistics applied in the health sciences) is a vital tool in our mission to improve health and
wellbeing for all people.
STAT115 provides an introduction to the core principles and methods of biostatistics. In this
course you will gain an understanding of how statistics is used to answer research questions:
how to look for patterns in data, how to test hypotheses about disease causation and prevention
and improvement in well-being. The understanding and skills gained in STAT115 can be a
starting point for a career in biostatistics or can be used to assist understanding of research in
other disciplines including physiology, anatomy, human nutrition, sports science, and
psychology.
GENERAL INFORMATION AND ADMINISTRATION

Lecturers
Dr Tilman Davies, Room 516, Science III Building.
Ms Megan Drysdale, Room 231, Science III Building.
Dr Katrina Sharples, Dept. of Preventive and Social Medicine, Adams Building.
Contact: stat115@maths.otago.ac.nz
Lectures
Lectures are held as follows: Monday, Tuesday, Thursday and Friday at 11.00 am,
commencing Monday 8 July. Although these notes are extensive, experience shows that
students who miss lectures have a severe disadvantage.
STAT 115 Web Page and Resource Area

The STAT 115 web page: www.maths.otago.ac.nz/?stat115 will contain course resource
material. Notices, old exam papers with solutions and any other useful information will be
posted here. You can access such information by clicking on the Resources button. You are
strongly advised to check through the solutions to weekly exercises as students who fail to do
this are at a severe disadvantage.
Help Sessions and Tutorials

These will be held in the North CAL laboratory in Science III (directly outside the Science
Library). Tutorials are cafeteria style which means that you can attend at any scheduled time.
Times can be found on the STAT115 paper page on the Mathematics and Statistics
Department website. Help with weekly exercises will be given as well as access to
computers.
Support Classes
We have students from a range of backgrounds in the course. If you are concerned about your
mathematical skills, we have a Support Class available on Tuesday evenings from 6pm to
8pm (commencing week 1) in the North CAL lab, ground floor Science III building, opposite
the Science Library. In order to check whether this is appropriate for you, have a look at the
"Basics Booklet" in Appendix 1 of these notes. The Support Class is designed for those who
struggle with the material in the booklet.
Practice resources
Check out the following:
1. Basics Booklet in Appendix 1
2. MATHERCIZE http://mathercize.otago.ac.nz, login password is plus
References
There is no set text for the course as this course booklet contains all material necessary. If
further reference materials are desired, two useful texts are:
Clark, M.J. and Randal, J. R. A First Course in Applied Statistics. Pearson
MacGillivray, H. Utts & Heckard's Mind on Statistics. Cengage Learning.
Multiple copies of both references are in the Science Library on close reserve at the Loans
Desk.
Computing
The R package will be used in tutorials. No prior knowledge of R is needed as a handout
(found at the end of this book) and full instructions will be given in lectures. All students will
have their own User Name and Password. The User Name is the name on your student ID
card and the Password is your student ID number.
Time Commitment
STAT 115 is a one semester course worth 18 points. It is expected that students should spend
an average of 12 hours per week on this course. After allowing four hours per week attending
lectures, this leaves eight hours for other course related activities such as assignments, reading
notes and revising.
Calculators
You can use any type of calculator in the tests, the final exam and the assignments, as long as
it has no communication capability. This means cellphones are not permitted. You are
allowed to use a graphics calculator, but you do not need to buy one specially. A basic
scientific calculator that allows you to calculate a mean and a standard deviation is sufficient.
Be familiar with the working of your calculator.
ii
Course content (in approximate lecture order)

Introduction: research methods and study design; designed experiments versus
observational studies; case control, cohort and intervention studies.
Data description and presentation: the use of R; histograms, box-and-whisker
plots, measures of centre and spread of data, measures of disease frequency and
association.
Probability: the nature of random variation; diagnostic tests; probability
distributions including the binomial and normal distributions.
Estimation: sampling distributions; confidence intervals for means, differences
proportions.
Hypothesis testing: classical procedures for means, proportions, and differences;
the p-value; statistical vs clinical significance; power and sample size.
Analysis of variance: completely randomised design; multiple comparisons.
Tilman
2 lectures
Megan
6 lectures
Tilman
8 lectures
Tilman
5 lectures
Tilman
3 lectures
Tilman
3 lectures
Categorical data: tests for association; rates, relative risk and risk differences, Megan
odds ratios; confidence intervals for relative risk and odds ratio.
4 lectures
Regression and correlation: the simple linear regression model; tests on the slope; Megan
predictions; confidence intervals for predictions; correlation.
4 lectures
Multiple regression: tests on the estimated parameters; dummy variables for Megan
qualitative predictors; parallel regressions and control of confounding.
4 lectures
Ethics and Study design: Ethical issues, bias and confounding.
Katrina
7 lectures
Internal Assessment
There will be eight assignments and three mastery tests. Each assessment will have a scaled
mark recorded out of 20. These assessments will be admininstered electronically.
Your scores from the mastery tests contribute 2/3 of your internal assessment component, the
assignments contribute the remaining 1/3.
A) Assignments
1. These can be completed anywhere you have an internet connection.
2. Because they are electronic, cutoff is prompt at 0900.
3. Due to the assignment being electronic, no extension is possible.
B) Mastery tests
WHERE AND WHEN These 3 tests are administered electronically in North CAL on the
dates specified on the course schedule. The tests cannot be taken outside of the scheduled
testing periods nor in any other venue.
BOOKING - This is done in advance online via the Resources Page. You will be advised on
Blackboard when booking is open.
TIME ALLOWED - 20 minutes (within the 30-minute booking slot)
FORMAT - Multi-choice. It is also open book, but if you simply rely on having all your notes
available you will probably find it difficult to complete the test on time.
TOPICS - Material to be tested will be advised in advance on Blackboard
iii
NOTES
1. Bring your student ID card.
2. Be on time so as not to disturb others and also to avoid delaying the start of the next
session.
3. Bring your calculator. You are not allowed to use your cellphone as a calculator.
4. Open R before you log on to the Resources Page.
5. Be mindful when scrolling that you do not inadvertently change your selected answer.
6. Any issues arising during the test must be brought immediately to the attention of the
supervisor so that remedial action can be taken and you complete the test before leaving
the room. There is no comeback once you have left the venue.
7. There is no resit for these mastery tests.
IMPORTANT - The only devices to be in operation during any testing period are:
1. the lab computer with just the test and R active on screen;
2. a calculator
The use of other devices, eg cellphone, tablet, laptop etc, or using other programmes on
the computer, eg email, may be deemed as an attempt to cheat.
Security
You are strongly advised to ID your calculator and other personal devices as these frequently
get left behind in the computer laboratory.
Exam format
A three-hour multiple choice exam will produce a mark out of 100.
Final mark
In your overall mark we will count your exam mark for 2/3 of the total and the internal
assessment for 1/3. However, if your final exam mark taken out of 100 is greater than this,
we will use just the final exam mark. That is, the final mark F will be calculated as:
F = {E, (2E + A)/3}
where E (exam mark) is out of 100 and A (internal assessment) is out of 100. The internal
assessment marks will be made up 1/3 from the eight assignments and 2/3 from the three
mastery tests.
E.g. 1 - 85% in exam, 80% for internal assessment. The exam mark is greater than the
internal assessment so final mark = 85%.
E.g. 2 - 60% in exam, 90% for internal assessment. The internal assessment is greater than
the exam so 2/3 of exam = 40% + 1/3 internal assessment = 30%, final mark = 70%
Terms requirement
There is no terms requirement for this course. If you miss a test or an assignment, you can
still potentially pass the course.
iv
STAT115&&
Introduc/on&to&Biosta/s/cs&
Sec$on'1:'Introduc$on'
Lecture'1'
Sec/on&1&&
Biosta/s/cs&and&research:&&an&overview&
Course&aim:'
An'introduc$on'to'the'core'biosta$s$cal'methods'
essen$al'to'the'health'sciences'
scien$c'method'
design'of'research'studies'
descrip$on'and'analysis'of'data'
Learning&aims&and&objec/ves'
By'the'end'of'the'course'students'should'
be'aware'of'the'appropriate'use'of'common'study'designs'
and'their'strengths'and'weaknesses'
Goal&of&health&sciences&professions'
To'improve'the'health'and'wellDbeing of'individuals'and'
communi$es''
This'involves'
'treatment'of'disease'
'preven$on'of'disease'
'promo$on'of'health'
be'able'to'describe'the'informa$on'contained'in'a'data'set'
be&able&to&carry&out&common&sta/s/cal&data&analyses&
be'able'to'interpret'the'results'of'common'sta$s$cal'analyses'
in'the'context'of'the'par$cular'study'design'used'
'In'order'to'do'this'we'need'knowledge'about'
'
'
'
'
'
be'able'to'cri$cally'evaluate'selected'research'ar$cles'
published'in'health'sciences'journals'
''
'causes'of'disease'
'diagnosis'
'disease'processes'
'eec$veness'of'treatments'
'societal'factors'which'aect'health'
Examples&of&current&gaps&in&knowledge'
Common'diseases:'
Diabetes'
Cancer'
New'diseases'
'HIV,'SARS,'avian'inuenza'
Exposures'
Vitamin'D'deciency'
Smoking'
New'technologies'
'Cell'phones,'3D'technology'
'
Research'
A'process'for'providing'answers'to'ques$ons'for'which'the'answer'is'
not'immediately'available'
Examples&of&general&health&research&ques/ons&
What'are'the'gene$c'events'which'lead'to'childhood'cancer?'
Can'we'develop'a'vaccine'to'prevent'SARS?'
Can'we'develop'vaccines'against'cancer'cells?'
How'do'we'stop'people'smoking?'
Can'a'new'drug'improve'survival'in'people'with'colorectal'cancer?''
How'can'we'prevent'childhood'overweight'and'obesity?'
What'are'the'main'factors'aec$ng'quality'of'life'of'people'with'a'
chronic'illness?'
''
Research'provides'a'systema$c'process'for'answering'these'ques$ons''
'
Dr&Greens&clinical&study&
Invasive&poten/al&of&CIN3&
Epidemiologists'
Margaret'McCredie,''
CharloUe'Paul'
David'Skegg'
Sta/s/cian'
Katrina'Sharples'
Department'of'Preven$ve'and'
Social'Medicine,''
University'of'Otago'
Gynaecological&oncologist'
Ron'Jones'
Na$onal'Womens'Hospital,'Auckland'
Pathologist'
Judith'Baranyai,''
Lab'Plus'
Auckland''
&'
Cytologist'
Gabrielle'Medley'
Melbourne'Pathology,'Victoria,'
Australia'
'
Clinical'study'of'the'natural'history'of'carcinoma'in'situ'(CIS)'(1965D74)'
Carried'out'at'Na$onal'Womens'Hospital,'Auckland,'New'Zealand'
Aim:''to'inves$gate'Dr'Greens'hypothesis'that'CIS'was'not'a'precursor'of'
invasive'cancer'
Involved'withholding'or'delaying'treatment'of'cura$ve'intent'for'a'group'of'
women'diagnosed'with'CIS'
'
It'has'since'been'the'subject'of'a'Judicial'Inquiry'(1987D88)'
concluded'that'the'study'was'unethical'
recommended'that'the'histological'and'other'material'kept'at'Na$onal'
Womens'should'be'available'for'properly'planned'and'approved'research'
and'teaching'
'
'
''
HPV&Transmission&model&
HPV'infec$on'
Absence'of'virus'
virus'produc$on' produc$on'
E6DE7'produc$on'
Viral'DNA'integra$on'
Smith'MA,'Canfell'K.'
'Int.'J.'Cancer:'123,'18541863'(2008)'
hUp://www.bioacademy.gr/lab/lab.php?lb=36&pg=6'
Research&Aim&
To'es$mate'the'long'term'risk'of'cervical'cancer'in'
' ' 'i)' 'women'whose'CIN3'lesion'was'minimally''
'disturbed'and'
'ii)' 'women'who'had'persis$ng'CIN3'
Our study
Women'diagnosed'with'CIS'at'Na$onal'Womens'Hospital'
between'1'Jan'1955'and'31'Dec'1675'(1063'women)'
Informa$on'on'smears'and'procedures'extracted'from'
hospital'notes'
Endpoint:'invasive'cancer'of'cervix'or'vaginal'vault'
'
'
Ini/al&treatment&of&CIN3&lesion&
Time&un/l&adequate&treatment&
Invasive&cancer&of&cervix&or&vaginal&vault&&
Ini/al&treatment&punch&or&wedge&biopsy&
FollowDup'censored'
aner'treatment'
No'censoring'
Blakely'T,'Shaw'C,'Atkinson'J,'Tobias'M,'Bas$ampillai'N,'Sloane'K,'Sarfa$'D,'Cunningham'R.''2010.''
Cancer'Trends:'Trends'in'Incidence'by'Ethnic'and'
Socioeconomic'Group,'New'Zealand'19812004.''Wellington:'University'of'Otago'and'Ministry'of'
Health.'
Possible&reasons&for&poorer&survival&aRer&
diagnosis&
Dierent'disease'subDtype'
Dierent'stage'of'disease'at'diagnosis'
Dierences'in'access'to'or'uptake'of'treatment''
CoDmorbidi$es'
Poorer'followDup'
'
Hill'et'al'J'Epidemiol'Community'Health'2010;64:117e123.'
PIPER&Aims
To'compare'progression'free'survival'in'pa$ents'diagnosed'
with'colon'cancer'and'rectal'cancer'(CRC)'according'to:
loca$on'of'residence'(urban'or'rural'&'distance'from'trea$ng'
centre);'ethnicity;'and'socioDeconomic'depriva$on'of'area'of'
residence.
To'iden$fy'dierences'in'pa$ent'presenta$on,'diagnos$c'
evalua$on,'treatment'and'followDup'which'contribute'to'
dierences'in'outcome'by'rurality,'ethnicity'or'socioD
economic'depriva$on''
PIPER&Study&design&
Na$onal'study'including'6323'pa$ents'
Data'obtained'by'reviewing'pa$ents'notes'and'hospital'data'
bases'
Analyses'will'compare'the'demographic'and'disease'
characteris$cs,'treatment'delivery'and'followDup'among'the'
dierent'ethnic'groups'
Overall'goal'is'to'iden$fy'areas'for'interven$on'in'order'to'
improve'outcomes'
A'secondary'goal'is'to'set'up'prospec$ve'data'collec$on'to'
allow'research'into'beUer'treatments''
Scien$c
method'
Research&
The'objec$ve'for'most'research'studies'is'to'use'data'from'a'
sample'to'draw'inference'about'a'larger'popula$on:'
'
Accept/reject/
rene'theory'
Hypothesis'
Experiment/
observa$on'(carry'
out'research)'
Predic$ons'
Steps&in&the&research&process&
&Development'of'the'research'ques$on'
&Design'of'the'study'&
'Collec$on'of'informa$on''
'Data'descrip$on'and'analysis'
'Interpreta$on'of'results '
'
''
'
Research&ques/ons&relevant&to&course&
Epidemiology:''the'study'of'distribu$on'and'determinants'of'
disease'frequency'
Clinical&research:''the'study'of'ques$ons'rela$ng'to'care'of'
pa$ents'
Descrip/ve&ques/ons:'
'What'is'the'distribu$on'of'a'disease?'
'What'is'the'natural'history'of'a'disease?'
Analy/c&ques/ons: &
'
'What'are'the'causes'of'a'disease?''
'Will'this'approach'prevent'disease?'
'Does'this'treatment'improve'outcome?'
''
'
Introduc/on&to&study&design&&
1. Descrip$ve'studies'(studies'which'describe'things)'
''
2. Analy$c'studies'(studies'which'test'hypotheses)'
' 'Experimental'studies'
' 'Observa$onal'studies'
'''Examples'of'types'of'analy$c'study'
''
3. Summary'
' 'Classica$on'of'research'designs'
' 'Classica$on'of'common'study'types'
'
&'
''
'
Descrip/ve&studies&&
Aim:'to'describe,'for'example:'
'the'characteris$cs'of'people'with'a'disease'(person,'place,'''$me)'
''lifestyle'paUerns'in'a'popula$on'
'aptudes'to'health'care'
''
Descrip$ve'studies'are'onen'called'surveys'or'crossDsec$onal'studies'
''
Descrip$ve'studies'generally'use'a'sample'from'a'popula$on'
''
'
''
'
&'
''
'
Example:&What&are&the&serum&cholesterol&
levels&of&New&Zealanders'
Sample'mean' =
Popula$on'
(true)'mean'
+'
Error'
Method:'Select'a'subgroup'(sample)'of'people'and'measure'their'serum'
cholesterol'levels'
Random&sampling'
'choose'the'sample'in'such'a'way'that'every'individual'in'the'popula$on'
has'a'known'chance'of'being'selected'''
'in'a'simple'random'sample,'everyone'has'an'equal'chance'of'being'chosen'
'this'method'is'the'best'way'of'obtaining'a'sample'which'is'representa$ve'
of'the'popula$on'
'
Suppose'we'want'to'es$mate'mean'cholesterol'in'the'popula$on:''
'
''
'
&'
''
Random'error:'
due'to'natural'biological'variability'
increasing'the'sample'size'will'reduce'the'random''
'uctua$ons'in'the'sample'mean''
'Systema$c'error'(=bias)'
due'to'aspects'of'the'design'or'conduct'of'the'study' ' '
'which'systema$cally'distort'the'results'
occurs'if'a'sample'is'not'representa$ve'of'the'popula$on''
cannot'be'reduced'by'increasing'the'sample'size'
'
'
Systema$c'
error'
Random'
error'
Analy/c&studies&&
Analy/c&studies&&
Experimental&studies'
the'researcher'intervenes'and'records'the'result'of their'
interven$on'
the'aim'is'to'control'all'other'factors'to'isolate'the'eects'of'the'
interven$on'
best'way'to'study'causa$on'
'Observa0onal&studies&&'
the'inves$gator'does'not'intervene,'simply'observes'a'naturally'
occurring'process, and'collects'informa$on'
ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'
have'been'obtained'if'the'experimental'study'could'have'been'
done'
Purpose:&'to'test'hypotheses,'about,'for'example:'
causes'of'disease'
methods'for'preven$on'of'disease'
the'eects'of'treatments'
&
&
&
&
'
&'
''
''
'
'
'
'
Example:''Op$ons'for'studying'the'rela$onship'
between'smoking'and'heart'disease''
'
&' Experiment'
''
'
' Par$cipants'
(nonDsmokers)'
Heart'
disease?'
Smoke'
Random'
alloca$on'
FollowDup'
Dont'
smoke'
Heart'
disease?'
Randomised'controlled'trial'
'
Observa$onal
Study
&'
''
Heart'
disease?'
Smokers'
'
' Par$cipants'
FollowDup'
NonD
smokers'
Heart'
disease?'
Cohort'study'
'
Observa$onal'Study'
&'
''
'
'
Smokers?'
People'with'
heart'disease'
(cases)'
='?'
Smokers?'
People'
without'heart'
disease'
(controls)'
CaseDcontrol'study'
STAT115&&
Introduc/on&to&Biosta/s/cs&
Sec$on'1:'Introduc$on'
Lecture'2'
STAT&115:&Introduc/on&to&study&design&&
1.'Descrip$ve'studies'(studies'which'describe'things)'
''
2.'Analy$c'studies'(studies'which'test'hypotheses)'
' 'Experimental'studies'
' 'Observa$onal'studies'
'''Examples'of'types'of'analy$c'study'
''
3.'Summary'
' 'Classica$on'of'research'designs'
' 'Classica$on'of'common'study'types'
'
&'
''
'
Example:&What&are&the&serum&cholesterol&
levels&of&New&Zealanders'
'
Method: 'Select'a'subgroup'(sample)'of'people'and'measure'
their'serum'cholesterol'levels'
'
Suppose'we'want'to'es$mate'mean'cholesterol'in'the'
popula$on:''
'
''
'
&'
''
'
Sample'mean' =
Popula$on'
(true)'mean'
+'
Error'
Systema$c'
error'
Random'
error'
Random'error:'
due'to'natural'biological'variability'
increasing'the'sample'size'will'reduce'the'random' '
'uctua$ons'in'the'sample'mean''
''
Systema$c'error'(=bias)'
'
due'to'aspects'of'the'design'or'conduct'of'the'study'' '
'which'systema$cally'distort'the'results'
occurs'if'a'sample'is'not'representa$ve'of'the'popula$on''
cannot'be'reduced'by'increasing'the'sample'size'
'
Analy/c&studies&&
Analy/c&studies&&
Experimental&studies'
the'researcher'intervenes'and'records'the'result'of their'
interven$on'
the'aim'is'to'control'all'other'factors'to'isolate'the'eects'of'the'
interven$on'
best'way'to'study'causa$on'
'Observa0onal&studies&&'
the'inves$gator'does'not'intervene,'simply'observes'a'naturally'
occurring'process, and'collects'informa$on'
ideal'is'to'get'as'close'as'possible'to'the'informa$on'that'would'
have'been'obtained'if'the'experimental'study'could'have'been'
done'
Purpose:&'to'test'hypotheses,'about,'for'example:'
causes'of'disease'
methods'for'preven$on'of'disease'
the'eects'of'treatments'
&
&
&
&
'
&'
''
''
'
'
'
'
'
&' Experiment'
''
'
' Par$cipants'
(nonSsmokers)'
Heart'
disease?'
Smoke'
Random'
alloca$on'
FollowSup'
Dont'
smoke'
Heart'
disease?'
Randomised'controlled'trial'
'
Observa$onal
Study
&'
''
Heart'
disease?'
Smokers'
'
' Par$cipants'
FollowSup'
NonS
smokers'
Heart'
disease?'
Cohort'study'
'
Observa$onal'Study'
&'
''
'
'
Smokers?'
People'with'
heart'disease'
(cases)'
='?'
Smokers?'
People'
without'heart'
disease'
(controls)'
CaseScontrol'study'
Randomised&controlled&trial&
The'Gold'standard'analy$c'study'
'Characteris$cs'of'a'RCT:'
'select'a'group'of'people'
randomly'allocate'them'to'either'an'interven$on'group(s)'
'or'a'control'group'
follow'par$cipants'up'over'$me,'and'measure'outcome'
A'control'group'is'used'to'isolate'the'eects'of'the'interven$on'
Random'alloca$on,'or'randomisa,on-means'every'person'has'
the'same'chance'of'being'in'each'group.'This'gives'the'best'
chance'of'ge]ng'two'groups'which'are'comparable'in'all'
respects''
''
Common&analy/c&study&designs''
'
Experimental:''
& &Randomised'controlled'trial'
'
Observa$onal:''
'Cohort'study'
'CaseScontrol'study'
''
'
'
Randomised&controlled&trial''
Used'to'evaluate'new'treatments'or'preven$ve'strategies'
O_en'not'ethical'in'studies'of'disease'causa$on'
&
Example&RCT:&LIPID&study&(NEJM,&1998)'
Does'treatment'with'pravasta$n'reduce'the'risk'of'death'in'
pa$ents'with'coronary'heart'disease?'
Study'par$cipants:''''
9014'pa$ents'
age'31S75'
coronary'heart'disease'
cholesterol''155'S'271mg/decilitre'
'
LIPID&study&
Par$cipants'
(n=9014)'
Randomisa$on'
Placebo'
(n=4502)'
LIPID&trial&results&&
As'par$cipants'were'
recruited'to'the'
study'they'were'
allocated'to'either'
pravasta$n'or'
control'according'to'
a'random'number'
sequence'
Pravasta$n'
(n=4512)'
FollowSup'
6'years'
8.3%'died'
6.4%'died'
Cohort&study''
Randomised&controlled&trial''
Advantages:'
experiment''the'best'way'to'test'an'hypothesis'
if'the'trial'is'well'conducted,'dierences'in'outcome'can'be'
'''akributed'to'the'interven$on'
Disadvantages:'
may'not'be'ethical'or'feasible'
''
'
Observa$onal'study,'generally'carried'out'to'test'hypotheses'
Characteris$cs:'
par$cipants'are'selected'before'disease'has'developed'
followed'over'$me'to'determine'development'of'disease'
informa$on'is'collected'about'exposures'at'baseline'and'during'
followSup'
Example:&Bri/sh&Doctors&Study&&
Aim:'to'inves$gate'the'rela$onship'between'smoking'and'lung'cancer'
''
'
'
'
'
'Observa$onal'study,'generally'carried'out'to'test'hypotheses'
Sent'a'ques$onnaire'
about'their'smoking'
habits''
'
Found'smokers'had'a'
14'fold'higher'risk'of'
lung'cancer'than'the'
nonSsmokers'
CasePcontrol&study'
Doctors'on'the'Bri$sh'
medical'register'
(men'n=24,389)'
Bri/sh&Doctors&&
Study&
''
Smokers'
(n=21,'296)'
NonSsmokers'
(n=3093)'
Lung'cancer?'
Lung'cancer?'
Characteris$cs'
par$cipants'are'chosen'on'the'basis'of'their'disease'status:'a'
group'with'disease'(cases)'and'a'group'without'(controls)'
' informa$on'is'collected'from'people'with'and'without'disease'
about'exposures'that'occurred'in'the past'
longitudinal'(retrospec$ve)'
&
Example:&Risk&of&venous&thromboebolism&aSer&air&travel&
''
'
'
'
CasePcontrol&study'
A'random'sample'
of'people'who'
have'not'had'a'
deep'vein'
thrombosis''
Popula$on'
Deep'vein'
thrombosis'
(n=210)'
No'deep'
vein'
thrombosis'
(n=210)'
&Findings:'
'odds'ra$o=2.1,'95%'condence'interval''(1.1'to'4.0)'
Air'travel'doubled'the'odds'of'venous'thromboembolism''
&
''
'
'
'
Long'
distance'
ight'
(n=31)'
No'long'
distance'
ight'
(n=179)'
Long'
distance'
ight'
(n=16)'
No'long'
distance'
ight'
(n=194'
Cohort&vs&casePcontrol&studies'
Cohort&vs&casePcontrol&studies'
Cohort&study&'
CasePcontrol&study&
Advantages'
closest'observa$onal'study'to'randomised'controlled'trial'
good'for'examining'common'outcomes'
can'evaluate'the'eect'of'exposure'on'mul$ple'outcomes'
Disadvantages'
long'dura$on'needed'if'the'disease'takes'a'long'$me'to'
develop'a_er'exposure'
if'the'disease'is'rare,'the'number'of'par$cipants'needs'to'be'
very'large
&&
&
''
'
'
Advantages'
rela$vely'quick'
smaller'than'cohort'studies,'par$cularly'for'rare'diseases'
can'examine'the'eects'of'mul$ple'exposures'
'
'Disadvantages'
events'have'already'occurred'so'the'poten$al'for'bias'is'
higher'
''
&
''
'
'
'
Classica/on&of&research&designs'
i)'Classica$on'by'purpose'of'the'study'
'Descrip$ve'(describe'things)'vs''analy$c'(tes$ng'hypotheses)'
''
ii) 'Classica$on'by'form'of'the'design'
' 'experimental''(researcher'intervenes)''
' 'vs.'observa$onal''(researcher'observes)'
iii)'Classica$on'by'$me'
'''''crossSsec$onal''(informa$on'collected'about'one'point'in'$me)
' 'vs.'longitudinal'''&
''
'
'
'
Classica/on&of&common&study&types'
Randomised'controlled'trial''' '&
' ' 'analy$c,'experimental,'longitudinal'(prospec$ve)'
'Cohort'study ' ''
' ' 'analy$c,'observa$onal,'longitudinal'(usually'prospec$ve)'
CaseScontrol'studies'
' ' 'analy$c,'observa$onal,'longitudinal'(retrospec$ve)'
'
These'classica$ons'provide'a'useful'framework'for'thinking'about'
the'strengths'and'weaknesses'of'dierent'study'designs,'but'they'will'
not'always'work'
''
'
'
'
Data and variables
STAT115
Introduction to Biostatistics
Megan Drysdale
University of Otago
Course Co-ordinator: Dr Tilman Davies
Section 2: Data Description and Presentation

Lecture 3
Iron levels in newborn children
Anthony-Sivan et al. (2012) measured cord ferritin levels in 140

pregnant women in Israel.
Mothers in the stress group (n = 63) were in the rst trimester during
the period that the area was under rocket attack.
Mothers in the control group became pregnant 3-4 months after the
attacks ended.
Results indicated that cord ferritin levels tended to be lower in the
stress group.
Question: what are the response and explanatory variables?
There are two types of measurement of interest in many scientic studies...

1
First, the outcomes measured on each experimental unit (plant,

animal, person) provide values of what is called a response variable.
Second, the characteristics or levels of exposure that explain at least

some of the dierences in the observed values of the response variable
are called explanatory variables.
Data forming the response and exposure variables can be either categorical
or numerical (otherwise known as qualitative and quantitative
respectively).
Types of data: categorical data
Categorical data takes on values in a xed number of categories. The

simplest kind involves just two categories. For example, a person could
be...
male/female
smoker/non-smoker
diabetic/non-diabetic
Such data are also called binary data, dichotomous data, yes/no data and
0-1 data.
The last is particularly important, for example 0 would typically represent
non-diabetic and 1 would represent diabetic.
Example: Altruism gene?
Types of data: categorical data (multiple)

We often need to use more than two categories.
Data are nominal if there is no natural (or relevant) ordering...
Blood group: A/B/AB/O
Ethnicity: Maori/Pacic Island/Caucasian/Asian
Data are ordinal if there is a natural ordering...
Degree of pain: minimal/moderate/severe/unbearable
Subjects had two versions of a particular gene (COMT)

Participants earned money, and then had the opportunity to give it to
a poor child in a developing country
Study claimed that participants carrying the Val version of the gene
gave more money
Example: Beer, anyone?
Megan is an awesome lecturer:

strongly agree/agree/neutral/disagree/strongly disagree
Even in this case it can be misleading to code the categories as integer
values (e.g. 0,1,2,3 for Degree of pain). Is unbearable three times more
severe than moderate ?
Types of data: discrete numerical data
With discrete data, observations take only certain numerical values,

typically integers or whole numbers. For example:
number of possums caught in traps
number of children in a family (0,1,2,3,4,...)
It is important to note that these are not like categorical data as the
numerical representations are always consistent... e.g. 3 children is three
times as many as one.
Survey data collected to analyse drinking habits in NZ.

Age categories on the form were 18-24, 25-34, 35-49, 50+
This type of data can be treated as though it is categorical if we must, but

this discards information about the magnitude of the relationships between
successive outcomes.
Types of data: continuous numerical data
Here recorded values or observations result from some form of

measurement. For example:
Types of data: continuous numerical data
Often the values show pattern similar to what is called the bell-shaped
normal curve with many values clustered around a central point and few
values in the tails.
height, age, blood pressure, serum cholesterol, oxygen levels in a lake,

...
Often no restriction on values other than that caused by accuracy of
equipment for recording values.
Commonly used measures: ratios, proportions
Commonly used measures: percentages
Ratio: fraction given by one quantity over another. Both quantities have
the same units.
Proportions are often expressed in terms of percentages.
Example: In a class with 10 boys and 20 girls, the ratio of boys to

girls is 10/20 (= 1/2) and the ratio of girls to boys is 20/10 (= 2)
Proportion: fraction of one quantity when compared to the whole.
Example: In a class with 10 boys and 20 girls, the proportion of boys
10
= 31
is 10+20
To convert proportions to percentages, multiply by 100 and add a % sign.

To convert percentages to proportions, divide by 100 and remove % sign.
Example: 30% = 0.3, 56% = 0.56 etc.
Careful with percentages
Commonly used measures: rates

Rates are like ratios for quantities with dierent units.
NZ Herald, Friday, June 8, 2012:

Of Australias top 200 listed companies, 12.7 per cent had
female directors by the beginning of August, compared to 9.3 per
cent for the top 100 listed companies here.
Number of new cases of HIV in NZ per year (this is the incidence of

HIV)
Number of people with HIV in NZ at a certain time (this is the
prevalence of HIV)
Number of children per family
Number of yawns per lecture
12.7% = 0.127...
Usual practice is to simplify rates to a per unit measure...
0.127 200 companies = 25.4 companies with female directors? Hmmm...
13 deaths over 5 years is 2.6 deaths per year

Question: Is 13 deaths per 5 years better than 10 deaths per 4
years?
Commonly used measures: scores
Continuous phenomena are often scored and binned in survey data as

ordinal categories. example:
Responses in a question about back pain might be on a scale of 1 (no
pain) to 5 (unbearable pain).
Levels of agreement in a survey (e.g. course evaluation) might be
labelled a great deal / somewhat / not much / not at all
In both cases, the responses might be numbered (e.g. 0,1,2,3,4), but care
must be taken interpreting these as numerical data.
Computing with Data: R
R is an open source (free) software package designed for statistical

analysis.
It was designed and rst developed in NZ, but is now used around the
world.
It is powerful, but takes a bit of learning.
Anyone on Facebook at the moment?
Using R: At home
You can use R on Macs, Windows and Linux. R is available as a free

download from
http://cran.r-project.org/
This is a self-extracting executable that will install everything for you.
A bit of a demo...
To give you an initial look, Ill now

1
Start up R
Import some data
Recode a numerical variable as a categorical variable
Statistics and samples
STAT115
Megan Drysdale
University of Otago

Lecture 4
In statistics we often (informally) talk about samples from a population.

What do we mean?
By population we mean a set of individuals or entities or subjects that we
wish to make inferences about. This could be a real population (e.g.
males in Dunedin), but doesnt need to be.
By sample we mean a subset of the population, usually selected at random.
A statistic is a quantity that we can compute from a sample (hence the
subject name!)
Describing numerical data
Example: hypertension data
Graphs can be used to summarise sample data, though many graphs can
be highly misleading. Today we shall look at summarising numerical data
graphically using histograms, and how to make these with R.
In a hypertension study 56 men who are heavy smokers (smoked for 25

years) have blood pressures measured (in mm of Hg). Blood pressures are
classied into intervals to form a frequency table and interval frequencies
(fj ) are obtained
In subsequent lectures we will talk about particular values which

summarise numerical data, including:
1
mean; median; mode
standard deviation; interquartile range
These statistics measure the centre and the variability of the sample data
respectively.
And then we will look at box and whisker plots, another way of displaying
numerical data.
Pressure (mm of Hg)

59.5 (69.5)
69.5 (79.5)
79.5 (84.5)
84.5 (89.5)
89.5 (94.5)
94.5 (99.5)
99.5 (109.5)
109.5 (119.5)
Total
Frequency fj
2
7
9
10
11
7
8
2
56 (sample size)
Computing proportions
...and percentages
Pressure (mm of Hg)

59.5 (69.5)
69.5 (79.5)
79.5 (84.5)
84.5 (89.5)
89.5 (94.5)
94.5 (99.5)
99.5 (109.5)
109.5 (119.5)
Total
fj
2
7
9
10
11
7
8
2
56
Proportion
2/56 = 0.036
7/56 = 0.125
9/56 = 0.161
10/56 = 0.179
11/56 = 0.196
7/56 = 0.125
8/56 = 0.143
2/56 = 0.036
Excel Bar Chart
Pressure (mm of Hg)

59.5 (69.5)
69.5 (79.5)
79.5 (84.5)
84.5 (89.5)
89.5 (94.5)
94.5 (99.5)
99.5 (109.5)
109.5 (119.5)
Total
fj
2
7
9
10
11
7
8
2
56
Proportion
2/56 = 0.036
7/56 = 0.125
9/56 = 0.161
10/56 = 0.179
11/56 = 0.196
7/56 = 0.125
8/56 = 0.143
2/56 = 0.036
56/56 = 1.0
%
3.6%
12.5%
16.1%
17.9%
19.6%
12.5%
14.3%
3.6%
100%
Issues
20.00%$
18.00%$
16.00%$
3D eect doesnt help us understand the data (in general, avoid 3D

eects wherever possible).
14.00%$
12.00%$
Labelling issues....
10.00%$
%$Frequency$(1dp)$$
8.00%$
6.00%$
4.00%$
2.00%$
0.00%$
59.5$$
69.5$$
(69.5)$$
(79.5)$$
79.5$$
(84.5)$$
84.5$$
(89.5)$$
89.5$$
(94.5)$$
94.5$$
(99.5)$$
99.5$$
109.5$$
(109.5)$$
(119.5)$$
Bars 1,2,7,8 actually cover 10mm range while the others cover 5mm
range. This makes us overestimate the proportion with low or high
blood pressure.
Switching to R
Start up R.
Import the data le BPdata.txt (available on the resources page)
Read the data into R
Create a histogram using the command:
hist(Dataset$Pressure, scale="frequency", breaks="Sturges",

col="darkgray")
Modifying the histogram
Suppose we want to specify the actual bins (values) to use in the

histogram.
About the Histogram
The command to generate the histogram was
Notice that
Y-axis gives the raw frequencies. We can change options to give
proportions or percentages.
Equally spaced bins, makes it easier to compare areas.
Number of bins chosen automatically (to make it look good) but we
can alter that.
No distracting 3d graphics!
hist(Dataset$Pressure, scale="frequency", breaks="Sturges",

col="darkgray")
Actually, a much simpler version would have also worked:

hist(Dataset$Pressure)
Modifying the histogram
The simplest histogram:
Colouring in the boxes:
Specifying the number of bins
hist(Dataset$Pressure)
hist(Dataset$Pressure, col="darkgray")
hist(Dataset$Pressure, col="darkgray",breaks=5)
Setting the breaks between the bins
hist(Dataset$Pressure, col="darkgray",
breaks=c(59.5,69.5,79.5,84.5,89.5,94.5,99.5,109.5,119.5))
5
Adding options ne tune the presentation of the graph.
Technical point: R uses c( , , , ) to encode an array.
Shape and skewness
Reading a histogram
The heights of the rst two and last two rectangles are halved but
their bases are doubled from 5 to 10mm. Area is proportional to
frequency.
A histogram with rectangle heights proportional to class frequencies
would give a misleading picture of the data.
You will nd that most of the histograms produced by statistical
packages like R have class intervals of equal length and you can
decide the number of intervals you want in the graph.
Usually between 5 and 20 intervals of equal length are chosen for a
good summary of the data.
Many numerical data sets look roughly like a blob in the middle with two
tails extending out either side.
If the left tail is longer than the

right we say the data is leftskewed
If the right tail is longer than

the left we say the data is rightskewed
Data that is neither left nor right skewed is symmetric.
STAT 115 Final exam marks
Grades histogram
In this histogram, the boxes correspond to letter grades.
hist(Grades$Final, breaks = c(0,50,60,65,70,75,80,85,90,100))
The R command to use is:
Import these data in R, produce histograms with varying numbers of bins,

and a histogram with bins
049, 5059, 6064, 6569, 7074, 7579, 8084,8589,90100.
The le Stat115Grades2011.txt contains the (anonymised) internal

assessment and nal exam marks for STAT 115 students in 2011.
These data are left-skewed since the left tail is longer.
STAT115
Megan Drysdale
University of Otago

Lecture 5
Measures of central tendancy: the mean

subject name!)
In this lecture we will study some standard summary statistics, quantities
computed from the sample which in some way summarise properties of
that sample.
Notes on the mean
The sample mean is the average of the quantities in the sample. Compute
it by summing up all of the values and dividing by the sample size.
1
In general, the mean of a sample of size n is given by
Example
Six patients lived the following years after diagnosis of HIV
Datum
1.8
3.2
6.8
4.6
2.8
7.9
Symbol
x1
x2
x3
x4
x5
x6
Sample mean = (1.8 + 3.2 + 6.8 + 4.6 + 2.8 + 7.9)/6 years

= 4.52 years.
1X
xi = (x1 + x2 + + xn )/n.
n
i=1
Be careful with a calculator. To compute the mean of 1,6,5 enter

(
1
+
6
+
5
) 3
not
The mean may be dierent to all of the values, but it will be within
the same range of values.
We sometimes use x to denote the mean of x1 , . . . , xn .
Computing the sample mean in R
The sample median
In R the command
The median is the middle value, once the sample has been sorted from
smallest to largest.
mean(<data set>)
computes the mean.
You can also choose
summary(<data set>)
computes the mean and other summary statistics.
Example
To nd the median of 1,6,4,8,3 we rst sort the values:
1,3,4,6,8
and then take the middle value (4).
Example
To nd the median of 5,3,7,5,3,6,3 we rst sort the values:
3,3,3,5,5,6,7
and then take the middle value (5).
Sample median calculation
Suppose that the sample size is n.

If n is odd then there is a unique middle element in the sorted list
(element (n + 1)/2).
If n is even then there are two middle elements, and the median
equals their average.
Example
To nd the median of 7,6,1,4 we sort:
1,4,6,7
and then average the two middle elements (4,6), to give a median of 5.
Median examples
Consider the data:
Data
We rst sort the data:
Data
Sorted:
95
95
62
86
86
73
78
78
78
90
90
86
62
62
89
73
73
90
89
89
95
and then take the middle (4th) entry: 86.

Note that if the value 62 had been replaced by, say, 5, then the median
would be unchanged, whereas the mean would have been changed
substantially. This is important, since data sets often contain outlier
values which are due to experimental or measurement error. They can be
radically dierent from the other values, and should be ignored.
NZ weekly income
Mode
In June 2011 (latest reported data), in NZ,
If the data is discrete or if it has been binned, then we can talk about the
mode of the data. The mode is the most frequently occurring value.
the median weekly income (from all sources) was $550;

the mean weekly income (from all sources) was $703.
Why do you think there is such a dierence?
Note for later in the courseit is much more common to talk about the
mode of a density, rather than a data set.
Example: grades
Example: blood pressure
Here is the (default) histogram for STAT115 exam results in 2011.
Here again the (default) histogram for the blood pressure data we looked
at earlier.
In R we compute that the mean of these marks is 63.3 and the median of
these marks is 70.
In R we compute that the mean of these marks is 89.32 and the median of
these marks is 89.50.
A nice mathematical property
Sources of variability
Consider the data set 95, 86, 78, 90, 62, 73, 89.
The mean is the value x which minimises the sum of squares
(x 95)2 +(x 86)2 +(x 78)2 +(x 90)2 +(x 62)2 +(x 73)2 +(x 89)2 .
Variability reects dierences in the values collected for dierent units

being measured, for example people, or animals or plants or
companies or readings on dierent days etc. Two sets of values can
have the same mean and median yet show quite dierent patterns.
The median is a value x which minimises the sum of dierences
If data are highly variable there are problems analysing the data and it
will be necessary to select larger samples.
|x 95| + |x 86| + |x 78| + |x 90| + |x 62| + |x 73| + |x 89|.
We look at several ways to quantify variability in a sample.
Try and convince yourself that this holds in general.
Range
Sample variance and sample standard deviation

The sample variance S 2 is dened by the formula
The range is the dierence between the largest and smallest values in the
sample.
Example
The range of the sample 95, 86, 78, 90, 62, 73, 89 is 95 62 = 33.
While the range does tell us something about the sample, it is aected a
lot by outliers and random noise. For this reason, we dont often use the
range to tell us something about the underlying population.
S2 =
1 X
(xi x)2 .
n1
i=1
Although the divisor is (n 1) rather then n in this equation, we can see

that S 2 is eectively the average of the squared deviations of the
individual data values (xi ) from their mean (x).
The sample variance is an overall measure of the extent to which the xi
values dier from their mean (x).
If you didnt take squares, the values above the mean would cancel those
below the mean, and you would end up with 0.
Standard deviation
Variance example
Find the sample variance and standard deviation of 11, 18, 14, 15, 12.
A convenient alternative to the sample variance is the sample standard

deviation.
v
u
n
p
u 1 X
(xi x)2
S = S2 = t
n1
The mean is x = (11 + 18 + 14 + 15 + 12)/5 = 14.

xi
11
18
14
15
12
70
i=1
The standard deviation, s, is measured in the same units as the original

data (taking the square root cancels the squaring).
Example In the hypertension example, the values in the data are measured
in mm (of Hg). Hence the variance is measured in mm2 while the standard
deviation is measured in mm.
xi x
(11-14)=-3
(18-14) = 4
(14-14) = 0
(15-14) = 1
(12-14)=-2
0
Hence
S 2 = 30/(5 1) = 7.5
and
S=
Computing sample mean and variance in R
(xi x)2
(3)2 = 9
42 = 16
02 = 0
12 = 1
(2)2 = 4
30
7.5 = 2.74.
An on a technical note...
The formula for the sample variance is
n
S2 =
You can compute the mean in R using
mean(<variable name>)
You can compute the variance in R using
var(<variable name>)
1 X
(xi x)2 .
n1
i=1
Why (n 1)?
The variance of the whole population is dened as
2 =
You can compute the standard deviation in R using

sd(<variable name>)
N
1 X
(xi xi )2 .
N
i=1
However if you take a random sample of size n and use this variance
formula you will, on average, get an amount that is n1
n times what you
want.
The
1
n1
term in the S 2 formula corrects for this. (see STAT 261)
STAT115

subject name!)
Megan Drysdale
University of Otago

Lecture 6
Range
In the last lecture we studied some standard summary statistics, quantities

computed from the sample which in some way summarise properties of
that sample. Previously, we had looked at histograms as a way to
represent sample data.
Today we look at interquartile range and their graphical equivalent, box
and whisker plots.
Interquartile range
Recall that the range of a sample is the dierence between the maximum
and minimum values.
The median is the value which is (informally) in the middle of the sample
values.
The upper quartile is the value which is 3/4 of the way up the sample
values.
The interquartile range (IQR) is the dierence between the upper quartile
and the lower quartile.
25%
Data
25%
Data
LQ
The lower quartile is the value which is 1/4 of the way up the sample
values.
50 % of the values
Median
25%
Data
UQ
Interquartile Range
Range
25% of the values
The upper and lower quartiles are produced by the summary command in
R.
25% of the values
Lower
Quartile
25%
Data
Median
Upper
Quartile
Fiddly details
General instructions:
We usually cant get exactly 25% of the data points below some value,
so we do something in between.
To nd the lower quartile from a sample of size n
Example Here are the maximum temperatures in Dunedin for the last 14
days.
10, 11, 12, 11, 11, 11, 11, 11, 11, 11, 13, 14, 13, 12
We sort the values:
x(1) , x(2) , x(3) , . . . , x(n)
10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 13, 13, 14
Sort the numbers in increasing order.
Let m be the whole number part of n+1

4 and let r be the fractional
part. (e.g. if n = 6 then n+1
=
1.75,
m = 1 and r = 0.75)
4
1
Lower quartile: (14+1)/4 = 3.75. We want (0.25) times 3rd value +

(0.75) times fourth, or
2
3
4
0.25 11 + 0.75 11 = 11
If
If
If
If
r
r
r
r
= 0 then return x(m) .

= 0.25 then return 0.75x(m) + 0.25x(m+1) .
= 0.5 then return 0.5x(m) + 0.5x(m+1) .
= 0.75 then return 0.25x(m) + 0.75x(m+1) .
Note: dierent software and textbooks use dierent rules here.
10,11,11,11,11,11,11,11,11,12,12,13,13,14 Upper quartile (14+1)*(3/4) =

11.25. We want (0.75) times 11th value + (0.25) times 12th value
0 75 12 + 0 25 13 = 12 25
Skinks
Box plots
Thirty-two traps were placed in each of three habitats: pasture, replanted

forest and tussock on Stephens Island.The data are the counts of skinks
per trap totalled over a ten-day period in each habitat.
A box plot gives a graphical summary of some of these numbers.
1
8
18
8
16
11
4
5
14
13
10
10
1
14
33
12
16
10
2
6
11
11
14
6
5
8
16
8
10
13
0
10
20
10
7
7
1
7
1
17
10
10
The summary command in R gives

Pasture
Min.
: 0.000
1st Qu.: 2.000
Median : 4.500
Mean
: 4.875
3rd Qu.: 6.250
Max.
:14.000
Replanted
Min.
: 1.00
1st Qu.: 9.50
Median :12.50
Mean
:14.62
3rd Qu.:18.00
Max.
:33.00
5
4
17
29
8
8
6
8
12
3
12
12
5
13
27
12
19
6
6
6
26
5
17
12
2
4
4
11
5
11
2
1
8
16
14
10
Tussock
Min.
: 5.0
1st Qu.: 9.5
Median :11.0
Mean
:12.0
3rd Qu.:14.0
Max.
:29.0
0
1
31
12
15
29
3
3
24
6
23
12
Tussock
4
11
15
18
14
7
Pasture
Replanted
Box plots step one
Box plots step two
Draw a box and a middle line at the upper quartile, median and lower
quartile.
Upper (3rd) quartile is 14.0. IQR is 4.5.
25
Starting at the top of the box, measure up a length of 1.5 IQR. Here,
that gives 14.0 + 1.5 4.5 = 20.75.
Upper quartile
Tussock
5
11
6
12
6
12
7
12
7
12
7
13
8
14
8
14
10
14
10
15
10
16
10
16
10
17
10
19
10
23
11
29
Median
Lower quartile
10
15
20
Dont draw the line there. Instead nd the largest data point that is at
most 20.75. The sorted values in this example are
The largest value that is at most 20.75 equals 19.

We draw the top line at 19.
Box plots step two
Box plots step three

In the same way we subtract 1.5 IQR from the lower quartile. In this
example, we get
9.5 1.5 4.5 = 2.75
Tussock
15
20
Median
Lower quartile
10
Tussock
25
Upper quartile
10
15
20
25
The smallest data value that is at least as large as 2.75 equals 5. We draw
the bottom line here.
Upper quartile
Median
Lower quartile
Box plots step three
Summary of box plots
If there are any data values outside the range of the bottom and top line,
then we plot crosses or dots for each one of them.
7
12
7
13
8
14
8
14
10
14
10
15
10
16
10
16
10
17
10
19
10
23
11
29
Largest value at
most 1.5IQR above UQ
Upper quartile
25
In this case, 5 is the minimum value (none below that), but 23 and 29 are
both above the top line.
Outliers (extreme values)
25
7
12
20
6
12
Tussock
6
12
15
5
11
10
Median
15
Smallest value at
least1.5IQR below LQ
Upper quartile
Median
Lower quartile
10
Tussock
20
Lower quartile
Comparing variables
Combining histograms and box plots

Box plots will soon be superseded by violin plots.
Box plots provide some information about the center, range and symmetry
of the data. We can easily put multiple box plots on one graph. In R, use
boxplot(<dataset>) to see box plots for all variables in a data set.
To make these, use the commands
library(UsingR)
violinplot(Skinks)
Overview
STAT115
Megan Drysdale
University of Otago

Lecture 7
Proportions and rates
In the last few lectures we looks at statistics and graphs used to

summarise the data in a sample.
Today we will look at look at two statistics used in epidemiology, incidence
and prevalence. They get special treatment because
They are very widely used, and youll need to know exactly what they
mean,
Theyre often confused, or at least confusing.
Prevalence
Measures of disease frequency are typically presented in the form

numerator/denominator.
Recall:
Proportion. Expresses the value as a fraction of the whole. Example
In STAT 115 this year there are 310 female students and 138 males
students. The proportion of female students is
310
= 0.69
310 + 138
The denominator and numerator have the same units.
Rate. A fraction where the numerator and denominator have dierent
units (e.g. children per family, new cases per year). Usually, the
denominator is a measure of time.
Prevalence gives frequency of existing cases of disease. It is useful for

measuring the disease burden in a community and is often measured in a
cross-sectional survey.
Example The proportion of students in this class who currently have a cold.
Example The proportion of Otago students who had swine u at 3pm last
Tuesday.
Prevalence of eye disease
In a survey of eye disease among 2477 people aged 52-85 in Framingham,

Massachusetts, there were 310 with cataracts and 22 blind.
Prevalences and timelines
In the following diagram shaded lines indicate the time each person has
the disease.
!
Prevalence of cataracts
310
= 0.125 = 125 cases per 1000 people = 12.5%
2477
Subject!
Number!
Prevalence of blindness
5!
4!
3!
2!
1!
Time!
t"
22
= 0.009 = 9 cases per 1000 people = 0.9%
2477
Prevalence
1/5
2/5
3/5
2/5
Note on prevalence
Incidence
Prevalence is the proportion of people in a population who have the

disease at a given point in time.
Incidence measures the frequency of new cases of a disease. As such, it is

useful for looking at the causes of disease.
The time point may refer to calendar time, or to a xed point in the
course of events.
Example
Example: the proportion of people free from back pain 2 months after
back injury. (The time point here is relative to an event, rather than an
absolute time.
How many people in this lecture theatre currently have a cold.

(Prevalence)
How many people in this lecture theatre got a cold this week?
Cumulative incidence
Cumulative incidence example

A study of heart disease was made in Evans County, Georgia.
Cumulative incidence is the proportion of people who become diseased

during a specied period of time.
=
number of new cases of disease during time period

size of total population at risk
This provides an estimate of the probability, or risk, that an individual will

develop the disease during the specied period of time.
The period of time could be one day, one week, one year, ve years etc.
There were 609 men aged 4076 who had no detected heart disease
in 1960.
These men were followed for 7 years and 71 cases of heart disease
were detected during this period.
Number of cases = 71. Population size = 609.
Cumulative incidence =
71
= 0.117 cases per person = 11.7%
609
over the 7 year period.

NOTE: For the cumulative incidence to be interpretable, the time period
must be specied.
Incidence rate motivation
Cumulative incidence calculations assume that the population at the

start of the study is exactly the same as at the end of the study.
In practice, people are lost to follow up, and people enter the study at
dierent times.
We therefore compute a rate of incidence which is not dependent on
the exact study time, but instead summarises the incidence per year.
People-time
People-time is the total time each person in the population is

part of the study, and
at risk (for the disease being studied).
It is the same if we follow 16 people for one year or 4 people for four years,
or 1 person for 16 years.
People-time at risk
Subject
A
B
C
D
E
Jan
1997
Incidence rate:
Jan
1998
Jan
1999
Jan
2000
(lost
x
Jan
Jan
2001 2002
to follow up)
Total
time
at risk
2.0
3.0
5.0
4.0
2.5
Here
= initiation of follow-up
x = development of disease.
The incidence rate

=
number of new cases of disease

total person-time at risk
In the previous example

The number of new cases was two.
The number of person-years at risk = 16.5
2
Incidence rate =
= 0.121,
16.5
which is 0.121 cases per person year of observation, or 12.1 cases per 100
person years of observation.
Number of person-years at risk = 16.5.
Hepatitis
Hepatitis
Example. Frequency of hepatitis in two regions.

Location
Region A
Region B
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Location
Region A
Region B
Population
25,000
7000
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
In Region B,
In Region A,
Number cases = 58
Person-years = 1 25000
Incidence rate = 58/25000 cases per person-year
= 2.32 cases per 1000 people per year.
Number cases = 35
= 14000
Strokes in women aged 30-55
Notes on incidence rate
A study in the United States measured the incidence rate of stroke in a

group of 118,539 women aged 30-55 years of age. The women were free
from stroke in 1986, and were followed for 8 years.
Smoking
category
Never smoked
Ex-smoker
Smoker
Total
Num.
stroke
70
65
139
274
cases of
Person-years of
observation (over
8 years)
395,594
232,712
280,141
908,447
Stroke incidence
rate (per 100,000
person years)
17.7
27.9
49.6
30.2
274
100, 000 = 30.2 cases of stroke per
(Total) incidence rate = 908,447
100,000 person-years of observation.
Average follow-up per women (person-times) =
908,447
118,539
The denominator for measures of incidence should include only those who
are at risk of developing the disease. It should exclude
those who already have the disease
those who cannot develop the disease
Failure to do this will lead to an underestimate of the true incidence since
fewer will develop the condition.
For example when studying the incidence of endometrial cancer we should
exclude women who have had a hysterectomy.
= 7.7 years.
Incidence versus prevalence
Incidence versus prevalence
Example: Disease A
Example: Disease B
1.
2.
1.
2.
3.
3.
4.
4.
5.
5.
Time
L
Cumulative Incidence = 5/5 in t-years.

Incidence rate =
5
5t
cases per person year.
Prevalence at time L =
2
5.
Time
Cumulative Incidence = 5/5 in t-years.

Incidence rate =
5
5t
cases per person year.
Prevalence at time L = 55 .
Prevalence versus incidence
Prevalence depends on the

incidence rate, as well as the
duration of disease.
Adult onset diabetes has a low incidence rate but a long duration, as
the disease is neither curable nor total. Hence prevalence is high relative to
incidence.
A cold has a (very) high incidence, but the duration is short, so
prevalence is low relative to incidence.
Afterword: units
There are dierent expressions for the units used with incidence rate.
STAT115
Megan Drysdale
University of Otago

Lecture 8
The simplest (I think) is new cases per person-years. This is the same as
x cases per person per year.
where x is some fraction.
Sometimes you see
cases per 100 people per year (multiply x by 100 for this)
cases per 100,000 people per year. (multiply x by 100,000 for this)
The following mean the same
0.002 cases per person per year
0.2 cases per 100 people per year
200 cases per 100,000 people per year
Afterword: units again
Sometimes we get sloppy....

In the same way we can change the period of time. The following are the
same
Location
Region A
Region B
0.002 cases per person-year
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
0.002 cases per person per year

0.2 cases per 100 people per year
0.02 cases per person per 10 years
2 cases per person per 10 years
In Region A,
Number cases = 58
Incidence rate = 58/25000 per year
= 2.32 per 1000 per year.
Sometimes we get sloppy....
Overview

Location
Region A
Region B
New cases
of hepatitis
58
35
Reporting
Period
1985
1984-1985
Population
25,000
7000
In Region A,
Number cases = 58
In the last few lectures we looks at statistics and graphs used to

summarise the data in a sample.
Today we will look at statistics which are used to assess disease
association. This is an extremely important eld of statistical inference.
The study of associations between diseases and dierent factors of groups
is an important step in identifying causes and/or potential treatments.
Example association
Data
Data from cohort study of oral contraceptive use (OC) and bacteria in the
urine among women aged 16-49 years over 3 years.
Recall from lecture two: a cohort study is one where we start a complete
population.
OC Use
Total
Yes
No
Bacteria Present
Yes
No
27
455
77
1831
104
2286
Total
482
1908
2390
Bacteria Present is called the Disease Category (the outcome variable)

OC use is called the Exposure Category (explanatory variable).
Cumulative incidence
In this case, the researchers visited every household in specic
Italian-American neighbourhood.
OC users: 27/482 = 0.056, or 56 cases per 1000 in the 3 years.

Non users: 77/1908 = 0.040 or 40 cases per 1000 in the 3 years.
Measuring association
Relative risk
We will look at two dierent ways to measure association in the data:

1
The Relative eect (also known as the relative risk). This expresses
(or more correctly, estimates) the incidence rate of the disease within
one population (group) relative to the rate in the other population
(group).
The Absolute eect expresses the absolute dierence in incidence
between the two groups.
Both measures are still statistics and might still reect random error. Later
in the course youll learn how to model and test whether a given
association is statistically signicant or not.
Attributable risk
The relative eect (RR) (also called relative risk) is the ratio of the
(cumulative) incidence in the exposed group Ie to the incidence in
unexposed group (I0 ).
Ie
RR =
I0
8
>
<> 1 (exposure ! disease)
If RR is... = 1 (no association between exposure and disease)
>
:
< 1 (exposure is protective).
Indicates how much more likely disease is to develop in the exposed
group than in the unexposed group.
Good measure of strength of an association, and the usual measure in
studies of causation of disease.
We can also calculate ratios of prevalences, but the interpretation is
dierent.
Association with OC
The absolute eect, or attributable risk (AR) is the dierence in incidence

between exposed and unexposed groups:
AR = Ie I0 .
OC Use
Yes
No
Total
Assuming a cause-eect relationship between exposure and disease, we say:

8
>
<> 0 AR equals number of cases attributed to exposure
If AR is... = 0 (no association between exposure and disease)
>
:
< 0 -AR is number of cases prevented to exposure.
AR has the same units as the incidence rate (cases per person-time).
Bacteria Present
Yes
No
27
455
77
1831
104
2286
Total
482
1908
2390
For the three year period:

Ie = 27/482 = 56 per 1000
RR =
56
= 1.4,
40
I0 = 77/1908 = 49 per 1000
AR = 56 40.
Does infra-red treatment help with arthritis?
Does infra-red treatment help with arthritis?
A randomised trial of the eectiveness of infra-red stimulation compared

with placebo on pain caused by cervical osteoarthritis (degenerative joint
disease in the neck) carried out over two months. The control group were
given a placebo treatment.
Improvement in pain
No improvement in pain
Total
Treatment
18
7
25
Control
8
17
25
Improvement in pain
No improvement in pain
Total
Treatment
18
7
25
Control
8
17
25
Cumulative incidence over the two months:

Ie = 18/25
Relative risk is
RR =
Improvement/No improvement is the Disease Category (the outcome

variable)
I0 = 8/25.
Ie
18/25
=
= 2.3.
I0
8/25
Absolute risk is
Treatment/control is the Exposure Category (explanatory variable).
AR = Ie I0 = 18 8 = 10.
Variations on a theme
Framingham study: relative risk from prevalence
Sometimes the setup or presentation of data and relative risk varies.
Prevalence of coronary heart disease (CHD) at initial examination among

4469 persons age 30-62 years of age in the Framingham Study.
Relative and absolute risk can be computed in terms of prevalence

rather than just cumulative incidence. The resulting quantities have a
slightly dierent interpretation, but the computation is the same. See
the Framingham study
Relative and absolute risk can also be computed as a ratio of
incidence rates, for the same reasons that we often look at incidence
rates instead of cumulative incidence (e.g. variable risk times for
people in the sample). See the hormone and heart study
It can be informative to compare relative risk of dierent diseases or
conditions. We will see that in common diseases, a smaller relative
risk leads to a larger absolute risk. See the British doctors study
Males
Females
Number
examined
2024
2445
Number
with CHD
48
28
Prevalence
per 1,000
23.7
11.5
Compute relative and absolute risk using prevalence per 1000 (note
48
23.7 = 2024
1000).
Consider males as the exposed group.
RR = (23.7/11.5) = 2.1
AR = 23.7 11.5 = 12.2.
Heart disease is almost twice as common among the males, and there are
12.2 more cases of heart disease in 1000 men than in 1000 women.
Hormone and heart study: relative risk from incidence rates

Data from a cohort study of postmenopausal hormone use and coronary
heart disease among female nurses.
Postmenopausal
hormone use
Yes
No
Coronary heart
Yes
No
30
60
-
Person-years
54,308.7
51,477.5
Incidence rate:
Users: 30/54308.7 = 55 per 100,000 person-years
Non-users: 60/51477.5 = 117 per 100,000 person years
Attributable Risk:
55-117 =-62 cases of CHD per 100,000 person years
Hormone use prevents 62 cases per 100,000 person years
Relative Risk: 55/117 = 0.47
The risk of CHD among users is 0.47 times the risk in non-users (a 53%
reduction in risk).
Notes
Relative risks
provide information on the strength of an association;
can be used to assist in assessment of the likelihood of a causal
association.
Attributable risks
measure the impact of an exposure, (assuming that it is causal).
If a disease is common a small relative risk will translate to a large
attributable risk. [see previous example]
Doctors study: comparing relative risk

Relative and attributable risks of mortality from lung cancer and coronary
heart disease among cigarette smokers in a cohort study in British male
physicians.
Smokers
Non-smokers
RR
AR
Annual mortality rate per 100,000

Lung cancer
Heart disease
140
669
10
413
14.0
1.6
130
256
(Risk is per 100,000 per year)
Heart disease is more common therefore a smaller relative increase in risk

produces more people with disease.
First of all: What is Probability?
STAT115
Dr Tilman Davies
University of Otago
Section 3. Probability
Lecture 9
There are varying denitions of probability

Statisticians can be split into two main groups who have diering
views on probability.
Frequentists consider probability to be the relative frequency in the
long run of outcomes.
Bayesians consider probability to be a way to represent an individuals
degree of belief in a statement given the evidence.
Some denitions
Consider these statements.

We can quantify these probabilities (Frequentist)
What is the probability I win lotto tonight?
What is the probability I roll a 6?
Based on personal and subjective belief (Bayesian)

What is the chance I pass Stat110?
What is chance I will do an OE after graduating?
Experiment = process by which observations/measurements are

obtained
e.g. tossing a fair die
Event = outcome of experiment
rolling a 2, rolling a 5, etc.
Sample space = the set of all possible outcomes
In this case, {1, 2, 3, 4, 5, 6}
Conditions for a valid probability
Calculating Probabilities
Each probability is between 0 and 1
Probability of an event A is
The sum of the probabilities over all possible events is 1

In other words, the sum of the total probability for all possible
outcomes is equal to 1
nA
number of trials for which A is true
=
total number of trials
N
We dont always need to conduct the experiments to measure these,
as we can make logical arguments.
For example, ipping a coin: Pr(heads) = Pr(tails) = 1/2
If the event A cannot happen then Pr(A) = 0

If the event A is certain to happen then Pr(A) = 1
Some more things to know
Pr(A) =
The Rules
Addition Rule
Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Complementary Events
Two events are complementary if exactly one of them is always true.
For example, the coin ip. Will always be either heads or tails.
is called the complement of A.
A
=1
Pr(A) + Pr(A)
Example
A = Rolling an even number {2,4,6}
B = Rolling 3 or less {1,2,3}
Pr(A or B) = Pr(A) + Pr(B) - Pr(A \ B)
= 3/6 + 3/6 - 1/6
= 5/6
Intuitively, this makes sense!
The Rules
Addition rule - special case
Multiplication Rule
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B|A)
Please read this as probability of B given A
Example
A = Rolling an even number {2,4,6}
B = Rolling 3 or less {1,2,3}
Pr(A and B) = Pr(A) x Pr(B|A)
= 3/6 x 1/3
= 1/6
Mutually Exclusive Events

There is no intersection between the two events.
In other words, events are mutually exclusive if they cannot both
occur.
Back to coin ipping: getting heads and tails.
In this case the addition rule simplies to
Pr(A or B) = Pr(A [ B) = Pr(A) + Pr(B)
Because Pr(A \ B) cannot occur, in fact Pr(A \ B) = 0
If you think carefully, this makes sense too!
Multiplication rule - special case
Blood donor example
Independent Events
When the occurrence of one event does not eect the outcome of
another event.
For example ipping 3 heads in a row.
The probability of being in each of the 4 blood groups (Dunedin

donor centre)
In this case, Pr (B|A) = Pr (B).

The multiplication rule simplies to
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B)
Think about it:

A = First 3 ips are heads.
B = The 4th ip is heads.
Pr(B|A) = Pr(B)..... A and B are independent, for sure!
Blood Type
A
B
AB
O
Probability
0.38
0.11
0.04
0.47
Blood donor example - Addition Rule
What is the probability that a person is either A or B?
Blood donor example - Multiplication Rule
What is the probability that 3 randomly selected people have blood group
O?
Pr(A or B) = Pr(A) + Pr(B)

Pr(O) Pr(O) Pr(O) = 0.473
= 0.38 + 0.11
= 0.104
= 0.49
Note than A and B are mutually exclusive,
that is they cannot both occur, and Pr(A and B) = 0.
Hospital Patients
A survey of hospital patients shows that the probability a patient has

high blood pressure given he/she is diabetic is 0.85. If 10% of the
patients are diabetic and 25% have high blood pressure:
(under the assumption of independence)
Hospital Patients - Relevant information
A survey of hospital patients shows that the probability a patient has

high blood pressure given he/she is diabetic is 0.85. If 10% of the
patients are diabetic and 25% have high blood pressure
Let A be the event A patient has high blood pressure
Find the probability a patient has both diabetes and high blood
pressure.
Let B be the event A patient is diabetic
Are the conditions of diabetes and high blood pressure independent?
Pr(B) = 0.10
Pr(A|B) = 0.85
Pr(A) = 0.25
Hospital Patients - Question 1
Pr(A|B) = 0.85
Pr(B) = 0.10
Pr(A) = 0.25
Find the probability a patient has both diabetes and high blood
pressure.
Pr(A \ B) = Pr(A | B) Pr(B)
Hospital Patients - Question 2

Pr(A|B) = 0.85
Pr(B) = 0.10
Pr(A) = 0.25
Are the conditions of diabetes and high blood pressure independent?
Remember when discussing the special case of the multiplication rule
we said if A and B are independent then:
Pr(A | B) = Pr(A)
We can use this to test for independence.
Pr(A | B) = 0.85
= 0.85 0.10
Pr(A) = 0.25
= 0.085
Pr(A) 6= Pr(A | B)
) A and B are not independent
Summary
Questions
Calculating a probability.
Pr(A) =
nA
=
N
Addition Rule.
Multiplication Rule.
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B | A )
Read this as probability of B given A.
Addition Rule - Mutually Exclusive
Pr(A [ B) = Pr(A) + Pr(B)
Multiplication Rule - Independent Events
Pr(A \ B) = Pr(A) Pr(B)
For each question, assume a standard deck of 52 playing cards:

What is the probability the rst card drawn is a 4 or a 5?
Say the rst card drawn is a 5. What is the probability the second card
drawn is an Ace?
(Think of this as What is the probability the second card drawn is an
Ace, given the rst card drawn was a 5?)
Dierent question: What is the probability the rst card drawn is a 5
and the second card drawn is an Ace?
Summary
Calculating a probability.
STAT115
Dr Tilman Davies
University of Otago
Lecture 10
Pr(A) =

nA
=
N
Addition Rule.
Multiplication Rule.
Pr(A and B) = Pr(A \ B) = Pr(A) Pr(B | A )
Read this as probability of B given A.
Addition Rule - Mutually Exclusive
Pr(A [ B) = Pr(A) + Pr(B)
Multiplication Rule - Independent Events
Pr(A \ B) = Pr(A) Pr(B)
Calculating Conditional probabilities
Fair Die Example
CONDITIONAL EVENTS
Two events are conditional if the probability of one event changes
depending on the outcome of another event
e.g.
Pr(A \ B) = Pr(A) Pr(B | A)
Pr(A \ B) Pr(A) = Pr(A) Pr(B | A) Pr(A)
Pr(A \ B)
Pr(B | A) =
Pr(A)
A fair die is thrown. A is the event a number greater than 3 is

thrown and B is the event an even number is thrown
Find Pr(A [ B) and Pr(A \ B)
Pr(A) =3/6 = 1/2
Pr(B) =3/6 = 1/2
Fair Die Example - Visualisation

Event A: a number greater than 3 is thrown
A fair die is thrown.
Event B: an even number is thrown
Dene Event A: a number greater than 3 is thrown
Calculate Pr (A \ B): Multiplication rule
Dene Event B: an even number is thrown

B
A
Brief thought exercise:

Pr(B|A) = Pr(even number, given we have rolled a 4, 5, or 6)
= 2/3
2
4
6
Pr(A \ B) = 1/2 2/3

1
Pr(A \ B) = 1/3
3
Pr(A [ B) = {2 4 5 6}, therefore Pr(A [ B) = 4/6 = 2/3
The same result we observed in our diagram.
Pr(A \ B) = {4 6}, therefore Pr(A \ B) = 2/6 = 1/3
Tree Diagrams
Find Pr (A [ B)
Use addition rule
Pr(A [ B) = Pr(A) + Pr(B) Pr(A \ B)
Pr(A [ B) = 1/2 + 1/2 1/3
Pr(A [ B) = 2/3
The same result we observed in our diagram.
Useful for examining combinations of several random variables.

for your Event names.
Always use A, and A
Choose what letters you use wisely!
Rules:
Add Vertically.
Multiply across.
Basic Tree Diagrams
Tree Diagrams
A)
B|
Pr(
Pr(A \ B)
)
r (A
Pr(
B|
A)
Pr(A \ B)
(A
)
A)
B|
Pr(
\ B)
Pr(A
Pr(
B|
A)
\ B)
Pr(A
B
A
Pr
Independent Stages
Stephens Island is an uninhabited island in Cook Strait where tuatara are

being re-established. For some years three locations have been visited on
the island and tuatara have been found at dierent locations with
probability 0.4. At any visit X represents the number of locations out of
three at which tuatara are observed. X can take values 0, 1, 2, or 3. Find
the probabilities that 0, 1, 2, or 3 locations have tuatara on a visit. T is
is the complementary event
the event location has tuatara and T
location has no tuatara
Dene Event T: Location has tuatara
Location does NOT have tuatara
Dene Event T:
Tree Diagrams - Building Step 1
0.4
0.6
0.4
0.4
0.4
0.4
0.
0.
0.4
0.4
0.6
0.4
T
T
0.6
0.4
0.6
Probabilities of observing tuatara

0.4
X=3
0.6
X=2
0.4
X=2
0.6
0.4
T
T
X=1
X =2
0.6
X =1
0.4
X =1
0.6
X =0
0.6
0.4
0.6
0.6
0.6
0.
0.6
0.
0.4
0.6
0.6
0.4
T
This was given in the problem statement:

Pr (T ) = 0.4
Also found by following one T branch in the tree.
Find the probability of seeing tuatara at the rst site observed.
Find the probability of seeing tuatara at the rst two sites

(we assume independence of each site)
Pr (T \ T ) = Pr (T ) Pr (T ) = 0.4 0.4 = 0.16
0.6
This is also found by multiplying the probabilities following two T

branches on the tree diagram. (multiple across rule)
Probabilities of observing tuatara
Add down rule
Find the probability of seeing tuatara at all three sites.

(again assume independence)
Pr (T \ T \ T ) = Pr (T ) Pr (T ) Pr (T ) = 0.4 0.4 0.4 = 0.064
Find the probability of seeing tuatara at two of the three sites:
This is also found by multiplying the probabilities along three T

branches on the tree diagram. (multiple across rule)
0.4
Add down rule

0.4
X=3
0.6
X=2
0.4
X=2
0.6
0.4
T
T
X=1
X =2
0.6
0.
0.
0.4
0.6
X =1
0.4
X =1
0.6
X =0
0.6
Find the probability of seeing tuatara at two of the three sites:

[ TTT
[ TTT)
Pr (X = 2) = Pr (TTT
= Pr (TTT) + Pr (TTT) + Pr (TTT)

= 0.096 + 0.096 + 0.096
= 0.288
Summary
Find the probability of seeing tuatara at one of the three sites
Take advantage of the fact that all possibilities add to 1
Conditional Probability
Pr (X = 2) = 0.288
Pr (X = 0) = 0.6 0.6 0.6 = 0.216
Pr (X = 3) = 0.4 0.4 0.4 = 0.064
Pr(B | A) =
Pr (X = 1) = 1 0.288 0.216 0.064
= 0.432
Tree Diagrams
Introduction to Tree diagrams

Multiply across, Add down
Questions
A)
B|
Pr(
Pr(
B|
A)
Pr(A \ B)
| A)
\ B)
Pr(A
Pr(A \ B)
100 Otago students were asked if they like L&P.
Of the 75 males (M) surveyed: 50 said they like L&P (L).
surveyed: 20 said they like L&P (L).
Of the 25 females (M)
)
r(A
Pr
Pr(A \ B)
Pr(A)
(A
)
B
Pr(
Pr(
B|
A)
Find:
Pr(L)
Pr(L|M)
Pr(L|M)
Are the events M and L independent?
\ B)
Pr(A
Summary
STAT115
Conditional Probability
Dr Tilman Davies
University of Otago
Pr(B | A) =
Lecture 11
Pr(A \ B)
Pr(A)
Introduction to Tree diagrams

Multiply across, Add down
Tree Diagrams
Sensitivity and Specicity
A)
B|
Pr(
)
r(A
Pr(
B|
A)
Pr(A \ B)
(A
)
A)
B|
Pr(
\ B)
Pr(A
Pr(
B|
A)
\ B)
Pr(A
Pr(A \ B)
D: some condition (D) is present.

T: the related test (T) for the presence of D is positive.
Note: This test result may or may not be correct.
Pr
Given the following Events:
SENSITIVITY = Pr(T| D)
Think of this as the probability of a positive test result, given the
person actually has the disease
D)
SPECIFICTY = Pr(T|
Think of this as the probability of a negative test result, given the
person does NOT have the disease
SENSITIVITY and SPECIFICTY will appear very naturally in your
tree diagrams!
PPV and NPV
POSITIVE PREDICTIVE VALUE = Pr(D|T)

The proportion of patients with positive test results who are correctly
diagnosed.
T)
NEGATIVE PREDICTIVE VALUE = Pr(D|

The proportion of patients with negative test results who are correctly
diagnosed.
Screening Programmes
A patient with certain symptoms consulted her doctor to be checked

for a cancer, and she undergoes a biopsy.
With this test there is a probability of 0.90 that a woman with the
cancer shows a positive biopsy, and a probability of only 0.001 that a
healthy woman incorrectly shows a positive biopsy.
Notice the dierent order of the D and the T events.
Historical information also suggests that the prevalence of this cancer

in the population is 1 in 10000.
These will NOT appear naturally in your tree diagrams.

See slide later for how to calculate these.
Find the probability that a woman has the cancer given the biopsy
says she does (i.e. does the biopsy diagnose true patient status?).
Tree Diagrams
Let C be the event woman has the cancer and T be the event test
is positive.
Pr(C) = 1/10000 = 0.0001 (disease prevalence)
Pr(T | C) = 0.90 (conditional probability)
Pr(T | C)=
0.001
00
)=
0.0
(C
Pr
Pr
(C
)=
0.9
99
Pr(
C
Pr(
T|C
)
01
)=
0
0.0
(C
Pr
)
T|C
0.90
0.90
T
0.9
99
0.9
99
0.10
00
Pr
(C
)=
0.0
=0
.10 T
.001 T
=0
)
|C
r(T
C
Pr(
T|C
)
=0
.999 T
Positive Predictive value

0.90
Pr(T and C) =
0.00009
1
00
0.10
0.0
and C) =
Pr(T
0.00001
Find the positive predictive value Pr(C|T). To calculate this we use

the conditional probability formula
Pr (C \ T)
Pr (T)
0.00009
Pr (C | T) =
0.00109
Pr (C | T) = 0.083
Pr (C | T) =
0.9
=
Pr(T and C)
and C)
=
Pr(T
0.00
99
0.00100
0.99
0.99890
= 0.00009+0.00100 = 0.00109
P(T) = P(T\C)+P(T\C)
Only 8.3% of those women identied as having the disease actually

do.
Negative Predictive value
Classication table
T).
To calculate this we use
Find the negative predictive value Pr(C|
the conditional probability formula

| T)
= Pr (C \ T)
Pr (C
Pr (T)
0.99890
| T)
=
Pr (C
0.99891
| T)
= 0.9999
Pr (C
Sometimes the information is presented in a dierent manner.

Table layout for 2 Random Variables
Closure of the squid shery in the sub Antartic islands due to Hooker
sea lion bycatch is a costly issue for shing companies and much
research is carried out on this. The following table classies a sample
of 219 vessels according to vessel nation and bycatch status over nine
years.
99.99% of those women identied as not having the disease, do not

have it.
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
be no-bycatch
Let (B) be By-catch, and (B)
Estimate the probability that a sampled vessel is Russian (R).
Find Pr(R)
Given that the sampled vessel had by-catch what is the probability
that it is Russian?
Find Pr(R|B)
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
Estimate the probability that a sampled vessel is Russian.

The estimated probability a sampled vessel is Russian is
P(R)= 123
219 = 0.562
Tree Diagram
Calculating Conditional Probabilities
23
219
23
123
No by-catch
Bycatch
Total
R
3
12 9
21
100
123
6
96
90
96
9
21 6
9
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
100
219
6
219
Given that the sampled vessel had by-catch what is the probability
that it is Russian?
i.e. nd P(R|B).
First calculate the total probability of a by-catch
Pr(B) =
90
219
23
219
29
219
6
219
Summary
No by-catch
Bycatch
Total
NZ
90
6
96
Russia
100
23
123
Total
190
29
219
Now use the conditional probability

Pr(R | B) =
Pr(R\B)
Pr(B)
Pr(R | B)
23
219
29
219
Pr(R | B)
23
29
Pr(R | B)
= 0.793
SENSITIVITY = Pr(T| D)
Think of this as the probability of a positive test result, given the
person actually has the disease.
D)
SPECIFICTY = Pr(T|
Think of this as the probability of a negative test result, given the
person does NOT have the disease.
POSITIVE PREDICTIVE VALUE = Pr(D|T)
The proportion of patients with positive test results who are correctly
diagnosed.
T)
NEGATIVE PREDICTIVE VALUE = Pr(D|

The proportion of patients with negative test results who are correctly
diagnosed.
Summary
D
T|
Pr(
Pr(D \ T )
)
r(D
Pr(
T|
D)
Pr
)
Pr(D \ T
\ T)
Pr(D
Pr(
T|
D)
\T
)
Pr(D
D
T|
Pr(
(D
)
Heres how to calculate all the things which do NOT already

appear in your tree diagram:
P(T ) = P(T \ D) + P(T \ D)

) = P(T
\ D) + P(T
\ D),
quite naturally
P(T
P(D \ T )
which is Positive Predictive Value
P(D|T ) =
P(T )

T
) = P(D \ T ) which is Negative Predictive Value
P(D|
)
P(T
Random Variables
STAT115
Dr Tilman Davies
University of Otago
A random variable has values which depend on the outcome of a

random experiment.
Random variables are labelled with a capital letter (X).
The value of an outcome is denoted by a lower-case letter (x).
They can be discrete or continuous
Lecture 12
Random Variables Discrete Example
Probability Distribution Discrete Random Variable
Note that for a probability distribution of a discrete random variable:

Consider the Tuatara example
X is the random variable
X is discrete
The values of xi are the possible outcomes: 0, 1, 2, or 3.
If many trials are carried out then the relative frequencies of each xi s
stabilize to give the probabilities for each outcome.
This set of probabilities forms the probability distribution
The sum of all the probabilities Pr(X = xi ) adds to one.

P3
i=0 Pr(X
= xi ) =
= P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

= 0.216 + 0.432 + 0.288 + 0.064
=1
All probabilities are between 0 and 1.
The collection of all possible outcomes and their associated
probabilities is referred to as the probability mass function.
Describing Probability Mass Functions
Just as for a data set, we can describe a probability distribution by

nding the mean (to describe the centre) and by nding the variance
or standard deviation (to describe the variability).
If X is the discrete random variable with a probability mass function:
Calculating the Mean Probability Mass Function
The sample mean is:

x =
k
X
xi Pr (X = xi )
i=1
= x0 Pr(X = x0 ) + x1 Pr(X = x1 ) + x2 Pr(X = x2 ) + x3 Pr(X = x3 )
X is the mean of X, and

X2
is the variance of X
Learn this right now:
q
variance ( 2 ) = standard deviation ()
variance ( 2 ) = (standard deviation ())2
Calculating the Variance Probability Mass Function
= 0 Pr(X = 0) + 1 Pr(X = 1) + 2 Pr(X = 2) + 3 Pr(X = 3)

= 0 0.216 + 1 0.432 + 2 0.288 + 3 0.064
= 1.2
where k is total number of possible outcomes, i.e. the number of

dierent values xi
Contagious disease
A person infected with a disease can pass it on to others

Let the discrete random variable X be the number of others infected
by this person.
The sample variance is:

x2 =
k
X
(xi x )2 Pr (X = xi )
i=1
x is the value of the mean, that we just learned how to calculate.
X = xi
0
1
2
3
4
Pr(X = xi )
0.10
0.25
0.40
0.20
0.05
Combining Random Variables

Using the formula:
Mean
X = 0(0.10) + 1(0.25) + 2(0.40) + 3(0.20) + 4(0.05) = 1.85
Variance
X2 = (0 1.85)2 0.10 + (1 1.85)2 0.25 + (2 1.85)2 0.40 + (3
1.85)2 0.20 + (4 1.85)2 0.05 = 1.0275
Standard Deviation
X =
q
p
X2 = 1.0275 = 1.0137
Often we are interested in the mean and the variance of a rescaled

random variable, or in the mean and variance of sums (or dierences)
of random variables.
Assume:
X is a R.V., with mean X and variance X2
Y is a R.V., with mean Y and variance Y2
a and b are constants (that means theyre xed numbers).
We will consider new random variables of the form:

Z = a + bY
Z = aX + bY
Examples
Heres how we calculate the mean and variance of the new random
variable, Z:
Z = a + bY
mean: Z = a + bY
variance: Z2 = b 2 Y2
Z = aX + bY
mean: Z = aX + bY
variance: Z2 = a2 X2 + b 2 Y2
Say X = 5 and X2 = 3
Consider the new random variable Z = 2X
In this case:
mean: Z = 2 X
= 2 5 = 10
variance: Z2 = 22 X2
= 22 3 = 12
Examples
Examples
Again, let X = 5 and
X2
Finally, let X = 5 and X2 = 3,

and another random variable Y, with Y = 2 and Y2 = 1
=3
This time consider the new random variable Z = 4X +3
Consider the new random variable Z = X + 2Y
In this case:
In this case:
mean: Z = 1 X + 2 Y
mean: Z = 4 X + 3
= 1 5 + 2 (2)
= 4 5 + 3 = 23
=54=1
variance: Z2 = 42 X2
variance: Z2 = 12 X2 + 22 Y2
= 42 3 = 48
=13+41
=3+4=7
Things to look out for:

Either a and/or b are negative
Say a = 1, and b = 2
Z = X 2Y
a = 1, and b = 1
Z=X+Y
In this case:
In this case:
Z = X 2Y
Z = X + Y
Z2 = X2 + Y2
Z2 = (1)2 X2 + (2)2 Y2
= X2 + 4Y2
Temperature Problem
Rearranging the Formula
5
(F 32)
9
5
5
32
= F
9
9
5
160
= F
9
9
160 5
+ F
=
9
9
Temperatures can be recorded in degrees Fahrenheit. Suppose a

random variable F measures January temperature ( F) in Dunedin.
C=
Daily maximum summer temperatures have a mean of 70 F with a

standard deviation of 5 F.
Use the conversion formula C = 59 (F-32) to nd the mean and
standard deviation for the temperature in degrees Celsius.
Calculating the mean
Calculating the standard deviation
Variance:
Mean:
C =
160 5
+ F
9
9
C = a + bF
160
+
9
= 21.1 C
=
5
70
9
C =
160 5
+ F
9
9
C2 = b 2 F2
2
5
=
52
9
= 7.716 C
The Standard deviation is just the square root of the variance
p
7.716 = 2.78 C
Summary
Questions
If X and Y are an independent random variables and a and b are

constants.
Consider a new random variable Z = a + bY:
mean: Z = a + bY
variance: Z2 = b 2 Y2
Alternatively, say Z = aX + bY:
mean: Z = aX + bY
variance: Z2 = a2 X2 + b 2 Y2
Say the average time required to do a stats question is 5 minutes,

with a variance of 2 minutes, and the average time for writing an
essay is 20 minutes, with a variance of 10 minutes.
Find the mean, variance, and standard deviation of the time necessary
to do 3 stats questions and write 2 essays.
Binomial Distribution
STAT115
Dr Tilman Davies
University of Otago
Lecture 13
Probability Distribution
Arises when investigating proportions

Requires a binary outcome, which means exactly two possible
outcomes: success, and failure.
These may be dened however you wish.
We conduct a xed number of trials, n
Each trials has an equal probability of success:
Single Trial
The mean of this single trial is:

For a single trial (call this Y), the probability distribution is:
Y = yi
1 (Success)
0 (Failure)
Pr(Y=yi )
mean: Y = 1 + 0 (1 ) =
The variance of this single trial is:
variance: Y2
= (1 )2 + (0 )2 (1 )
= (1 )2 + 2 (1 )
= (1 ) (1 + )
= (1 )
Binomial Distribution
Mean and Variance of the Binomial Distribution

Combining random variables:
X = Y1 + Y 2 + Y 3 + . . . + Yn
Gives:
Now, well dene the Binomial distribution, as the sum of n such

independent trials
mean: X = Y1 + Y2 + Y3 + . . . + Yn
= + + + ... +
Total number of successes:
= n
X = Y1 + Y 2 + Y 3 + . . . + Yn
Where all the Yi s are independent of one another.
variance: X2 = Y2 1 + Y2 2 + Y2 3 + . . . + Y2 n
= (1 ) + (1 ) + . . . + (1 )
= n(1 )
standard deviation: X =
Binomal mean and variance
p
n(1 )
Estimate when we dont know it exactly

Sometimes the value of the parameter is not known and we need to
estimate it.
Mean of the Binomial:
Use p where p =
X = n
X
n
X is the number of successes.

n is the number of trials.
Variance of the Binomial:

X2 = n (1 )
Standard Deviation of the Binomial:
p
X = n (1 )
Then simply replace with p, in each formula:

X = np
X2 = np (1 p)
p
X = np (1 p)
Three necessary conditions for the Binomial
Notes
Outcome is binary.
1
Outcome is binary.
We have n independent trials.
Probability of success must stay constant.
Notes About the Conditions
There may be more than two possible outcomes, as long as the

outcomes can be combined into two subsets.
One subset is success, the other is failure.
e.g. Rolling a 4 is a success - Any other number is a failure
or Having Blue eyes is a success - Any other eye color is a failure
Probability distribution of the Binomial
The probability of x successes where x takes the values 0 to n is given by:

Probability of success must stay constant.
Sampling without replacement from a small population does not
produce a binomial random variable.
For example, suppose a class consists of 10 boys and 10 girls. Five are
randomly selected to be in a play and X = the number of girls selected.
This is not binomially distributed because each time an individual is

removed from the sample the probability that a girl is selected changes.
Pr (X = x) =
Where
n
x

n
x (1 )nx
x
is the binomial coecient

n!
n
=
x
x! (n x)!
Consider the Tuatara
Stephens Island is an uninhabited island in Cook Strait where tuatara are

being re-established. For some years three locations have been visited on
the island and tuatara have been found at a location with probability 0.4.
At any visit X represents the number of locations out of three at which
tuatara are observed (X can take values 0, 1, 2, or 3). Find the
probabilities that 0, 1, 2 or 3 locations have tuatara on a visit. T is the
is the complementary event location
event location has tuatara and T
has no tuatara.
Find the probability of seeing Tuatara at two of the three sites.

,TT
T,T
TT )
Pr (X = 2) = Pr (TT T
= 0.096 + 0.096 + 0.096
= 0.288
This is a binomial example with n = 3, = 0.4.
In this case we are interested in the probability of two successes.
Using the Formula
Example: Developing OOS
n = 3, x = 2, = 0.4
Records show that twenty percent of violin pupils are known to develop
OOS during the course of their training. Dene X to be the number of
violin pupils out of 9 who develop OOS during their training.

n
x (1 )nx
x

3
0.42 (1 0.4)32
=
2
Pr (X = x) =
= 3 0.42 0.6
= 0.288
This is of course a very simple example, for more complicated examples
instead of using the formula all the time we can use tables or software to
nd probabilities.
(a) Find the probability mass function of X.

In R
(b) Probability that none of the 9 pupils develop OOS?
Pr(X=0) = 0.1342 = 13.42%
(c) Probability that more than 4 pupils develop OOS?
We could add: 0.0165 + 0.0028 + ...... etc.
Or, lets use R again.
P(X>4) = 0.0196 = 1.96%
(d) We observe 5 out of 9 students develop OOS. Conclusions?
Summary
Questions
A binomial distribution is dened by two parameters:

The xed number of trials (n)
The constant probability of success on each trial ()
Three necessary conditions for Binomial Distribution:

Outcome is binary (success or failure)
n independent trials
The probability of success () is the same, for all trials
Say X is a Binomial, with parameters n and :

mean: X = n
variance: X2 = n(1 ) p
standard deviation: X = n(1 )
Say we ip a coin 100 times.

What is the mean, variance, and standard deviation for the number of
heads?
Say we have an unfair coin, and ip it 100 times.

This results in 80 heads.
What is your best estimate of Pr(heads) for this coin?
Find your best estimate of the mean, variance, and standard deviation
for the number of heads.
Summary
A binomial distribution is dened by two parameters:
STAT115
The xed number of trials (n)

The constant probability of success on each trial ()
Three necessary conditions for Binomial Distribution:

Dr Tilman Davies
University of Otago
Lecture 14
Family make-up example
Outcome is binary (success or failure)

n independent trials
The probability of success () is the same, for all trials
Say X is a Binomial, with parameters n and :

mean: X = n
variance: X2 = n(1 ) p
standard deviation: X = n(1 )
Family make-up example solutions

X is Binomial, with parameters n = 20 and = 0.75
A report suggests that 75% of NZ-born children live with both parents. A
random sample of 20 Maori children is selected, and asked whether they
live with both of their parents.
Immediately recognize this as a Binomial Distribution (X).
Dene the parameters of the distribution of X
Find Pr(X = 15)
Find the probability that 11 or fewer live with both parents,
Pr(X 11).

n x
(1 )nx
Pr (X = x) =
x

20
0.7515 (1 0.75)2015
Pr (X = 15) =
15
20!
=
0.7515 (0.25)5
15! (20 15)!
= 0.2023
In practice, youll use R:
dbinom(x=15,size=20,prob=0.75)
The command dbinom will provide the individual binomial probabilities
associated with a given outcome, provided the number of trials (size) and
the probability of success (prob).
Find the probability that 11 or fewer are found to live with both
parents.
Pr (X 11) = Pr (X = 0, 1, 2, . . . , 11)
= Pr (X = 0) + Pr (X = 1) + . . . + Pr (X = 11)
=
11
X
Pr (X = i)
i=0
= 0.0410
R:
pbinom(x=11,size=20,prob=0.75)
Say that we randomly sample 20 children, and nd that only 11 are

living with both parents.
Does this provide evidence that less than 75% of children live with
both parents?
If = 0.75 is assumed for families, then
Pr(X 11) = 0.0410 is very small (less than 5%).
5% is a very important number in statistical analyses.
By convention, occurrences with less than 5% chance of happening
are considered unlikely, and these provide sucient evidence
against an assumption.
Since Pr (X 11) = 0.041, this provides evidence against the claim
that 75% of children live with both parents.
The command pbinom will provide the sum of all individual binomial
probabilities less than or equal to a given outcome, provided the number
of trials (size) and the probability of success (prob). Can you think how
we can use this to determine greater than probabilities?
This is called a p-value
The p-value (p).

Dened as the probability of the observation which occurred, or
observing an even more unlikely outcome.
If our observation had been 19 out of 20 families,
then p = Pr (X 19)
If our observation had been 6 out of 20 families,
then p = Pr (X 6)
p < 0.05 (by convention) implies an event is suciently unlikely.

p 0.05 means an event is not unusual.
Cancer drug example

The standard drug for treating tumors is claimed to halve the tumor size in
30% of patients.
Suppose we treat 7 randomly selected patients with a new, experimental
drug, and observe how many have their tumor size halved (X).
List the conditions for X to be Binomial
Write down the probability that three of the patients have their tumor
size halved
Pr (X = 3)
Find the probability that three or more of the patients have their
tumor size halved.
Pr (X 3)
In a pilot study in Auckland, three out of seven patients given a new
drug had their tumor size halved. What conclusion if any can be
drawn about the new drug?
Cancer drug example solutions

Write down the probability that three of the patients have their tumor
size halved.
Pr (X = 3) = 73 0.33 (1 0.3)73
From equation, or using R,
Pr (X = 3) = 0.2269.
List the conditions which must be met if X is binomial
Trials must be independent. We cant select people from the same
family, for example.
Exactly two outcomes: Tumor halved, or not halved.
Constant probability for all the patients
Find the probability that three or more of the patients have their
tumor size halved
Pr (X 3) =
i=7
X
Pr (X = i)
i=3
= Pr (X = 3) + Pr (X = 4) + . . . + Pr (X = 7)
= 0.2269 + 0.0972 + 0.0250 + 0.0036 + 0.0002
= 0.3529
Summary
In a pilot study in Auckland, three out of seven patients given a new
drug had their tumor size halved. What conclusion if any can be
drawn about the new drug? Explain how you reach your conclusion
p = Probability of the observed occurrence (X=3), or a more
unlikely occurrence
p = Pr(X 3) = 0.3529
p > 0.05, implying this is not an unusual event, hence the outcome is
consistent with the standard drug success rate of 30%.
There is no evidence to suggest the new drug is any dierent to the
standard.
The p-value (p) of an experiment is dened as:

The probability of the observation which occurred, or observing an
even more unlikely outcome.
More unlikely will be in the direction away from the mean.
5% is our threshold for for p-value:

p < 0.05 implies an event is suciently unlikely,
In which case we have evidence to reject the assumption we began with
(e.g. = 0.75)
Questions
We are curious whether males are more likely to attend the E3

Conference (Electronic Entertainment Expo, held annually in Los
Angeles) than females.
Say we observe a random sample of 4 attendees, and they all happen
to be male.
Is this sucient evidence that males are more likely to attend?
Say instead, we observe a random sample of 5 attendees, who all
happen to be male.
Is this sucient evidence?
Summary
STAT115
Dr Tilman Davies
University of Otago
Lecture 15
Normal Distribution
Demonstration: The Normal Distribution PDF

(probability density function)
Normal Distribution parameters:
mean ()
variance ( 2 )
The Standard Normal (Z):

=0
2 = 1
The p-value (p) of an experiment is dened as:

The probability of the observation which occurred, or observing an
even more unlikely outcome.
More unlikely will be in the direction away from the mean.
5% is our threshold for for p-value:

p < 0.05 implies an event is suciently unlikely,
In which case we have evidence to reject the assumption we began with
(e.g. = 0.75)
Normal Distribution Notes
The graph is symmetrical about (centre).

The two parameters, and , completely dene the normal
distribution. We say
X N(, 2 ).
Standard Normal: Z N(0, 1).
Increasing moves the curve but does not change its shape.
Increasing spreads the curve more widely about X = but does not
alter the centre.
Equation of the Normal PDF
Probability Density Function
Normal Distribution PDF:
Compare a relative frequency histogram with a probability distribution.
1 X 2
1
f (X ) = p
e 2 ( ) .
2
We estimate the parameters from data:
is estimated by the sample mean, x = ( i xi )/n

is estimated by the sample standard deviation, s.
Frequency Histogram of some Height Data
Relative frequency histogram represents a sample (smaller number of

individuals).
Probability density function represents a population (large number of
individuals).
Normal Fit to Height Data
sample mean = 172.5cm, sample standard deviation s = 10.6
Area Under The Curve
Area Under The Curve
Probabilities are equivalent to areas under the normal distribution

curve.
Total area under the curve is equal to 1
Curve is symmetric about the mean, so area of exactly 0.5 on either
half
The probability Pr (a < X < b) is found using the area under the
curve between X = a and X = b.
Never forget:
The areas under the curve represent probabilities.
Areas Under a Normal Curve, Using R
We will use R to nd these areas (probabilities).
Examples
Always draw a diagram to identify the area you want.
Areas under this curve are found by using pnorm:

pnorm(q, mean = 0, sd = 1, lower.tail = TRUE
Enter the value from which you want the area above/below (q).
Choose the tail (upper/lower) (lower.tail must be either TRUE or
FALSE)
In the following examples, pnorm(q) gives you the area (of a Standard
Normal) below the point q.
Find Pr (0 < Z < 1.64)
pnorm(1.64) - pnorm(0)
Enter variable values: (mean: mean) and (standard deviation: sd)

For the Standard Normal: = 0, and 2 = 1, and these are the
default values in R if you dont enter a mean or a sd
Probability = 0.4495
Find Pr (Z > 1.64)

1 - pnorm(1.64)
or
Find Pr (1 < Z < 1.64)

pnorm(1.64,lower.tail=FALSE)
pnorm(1.64) - pnorm(1)
Inverse Problems Using R

Find Pr (1 < Z < 1.64)
pnorm(1.64) - pnorm(-1)
Normal quantiles are found with qnorm:

qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)
Enter the probability (p)
Choose the tail (upper/lower: lower.tail)
Enter the mean (mean), and standard deviation (sd)
When lower.tail=TRUE (default) you should interpret qnorm as
what is the value of the normal distribution dened with mean and
sd below which an area of exactly p occurs?
When lower.tail=FALSE you should interpret qnorm as what is

the value of the normal distribution dened with mean and sd above
which an area of exactly p occurs?
Some special quantiles we will meet later are often also called
critical values.
Find the value Z above which 25% of the area lies.
qnorm(0.75)
or
qnorm(0.25,lower.tail=FALSE)
Assume that heights of students enrolled in 100 level university papers

have a normal distribution with mean X = 170cm and standard deviation
X = 10.
Find the proportion of students with a height between 180-190cm.
Z = 0.6745
Find the percentage of students taller than 185cm.
Find the height which is exceeded by 10% of students.
Percentage = 6.68%
Height = 182.82 cm
Z-Score
Normal Distribution Summary
Any Normal distribution value (say with mean X and s.d. X ) can
be put on the Standard Normal scale.
This is called calculating the Z-Score, or Z-Value.
Its essentially transforming your Normal value into a Standard
Normal value.
Z-Value: Z =
X X
X
Sometimes you may be required to calculate a Z-Value

Othertimes you may choose to, in order to compare dierent Normal
distributions.
Demonstration: Probabilities from z-scores (salaries)
Parameters mean (), and standard deviation ().

The two parameters, and , completely dene the normal
distribution i.e. X N(, 2 ).
The probability Pr (a < X < b) is found using the area under the
curve between X = a and X = b.
The Standard Normal has mean = 0 and standard deviation
= 1, i.e. Z N(0, 1).
To transform onto the Standard Normal scale, calculate the Z-Value:
Z=
X X
X

Statistics Level 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Level 1

Uploaded by

Copyright:

Available Formats

STAT115 INTRODUCTION TO BIOSTATISTICS 2013

GENERAL INFORMATION AND ADMINISTRATION

STAT 115 Web Page and Resource Area

Help Sessions and Tutorials

Course content (in approximate lecture order)

Data and variables

Section 2: Data Description and Presentation

Iron levels in newborn children

Anthony-Sivan et al. (2012) measured cord ferritin levels in 140

There are two types of measurement of interest in many scientic studies...

First, the outcomes measured on each experimental unit (plant,

Second, the characteristics or levels of exposure that explain at least

Types of data: categorical data

Categorical data takes on values in a xed number of categories. The

Example: Altruism gene?

Types of data: categorical data (multiple)

Subjects had two versions of a particular gene (COMT)

Example: Beer, anyone?

Megan is an awesome lecturer:

Types of data: discrete numerical data

With discrete data, observations take only certain numerical values,

Survey data collected to analyse drinking habits in NZ.

This type of data can be treated as though it is categorical if we must, but

Types of data: continuous numerical data

Here recorded values or observations result from some form of

Types of data: continuous numerical data

height, age, blood pressure, serum cholesterol, oxygen levels in a lake,

Commonly used measures: ratios, proportions

Commonly used measures: percentages

Proportions are often expressed in terms of percentages.

Example: In a class with 10 boys and 20 girls, the ratio of boys to

To convert proportions to percentages, multiply by 100 and add a % sign.

Careful with percentages

Commonly used measures: rates

NZ Herald, Friday, June 8, 2012:

Number of new cases of HIV in NZ per year (this is the incidence of

13 deaths over 5 years is 2.6 deaths per year

Commonly used measures: scores

Continuous phenomena are often scored and binned in survey data as

Computing with Data: R

R is an open source (free) software package designed for statistical

Anyone on Facebook at the moment?

You can use R on Macs, Windows and Linux. R is available as a free

To give you an initial look, Ill now

Import some data

Recode a numerical variable as a categorical variable

Statistics and samples

Section 2: Data Description and Presentation

In statistics we often (informally) talk about samples from a population.

Describing numerical data

Example: hypertension data

In a hypertension study 56 men who are heavy smokers (smoked for 25

In subsequent lectures we will talk about particular values which

mean; median; mode

standard deviation; interquartile range

Pressure (mm of Hg)

Pressure (mm of Hg)

Excel Bar Chart

Pressure (mm of Hg)

3D eect doesnt help us understand the data (in general, avoid 3D

Import the data le BPdata.txt (available on the resources page)

Read the data into R

Create a histogram using the command: