Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Probability Theory and

Statistics
With a view towards the natural sciences

Lecture notes

Niels Richard Hansen


Department of Mathematical Sciences
University of Copenhagen
November 2010
2
Preface

The present lecture notes have been developed over the last couple of years for a
course aimed primarily at the students taking a Master’s in bioinformatics at the
University of Copenhagen. There is an increasing demand for a general introductory
statistics course at the Master’s level at the university, and the course has also
become a compulsory course for the Master’s in eScience. Both educations emphasize
a computational and data oriented approach to science – in particular the natural
sciences.
The aim of the notes is to combine the mathematical and theoretical underpinning
of statistics and statistical data analysis with computational methodology and prac-
tical applications. Hopefully the notes pave the way for an understanding of the
foundation of data analysis with a focus on the probabilistic model and the method-
ology that we can develop from this point of view. In a single course there is no
hope that we can present all models and all relevant methods that the students will
need in the future, and for this reason we develop general ideas so that new models
and methods can be more easily approached by students after the course. We can,
on the other hand, not develop the theory without a number of good examples to
illustrate its use. Due to the history of the course most examples in the notes are
biological of nature but span a range of different areas from molecular biology and
biological sequence analysis over molecular evolution and genetics to toxicology and
various assay procedures.
Students who take the course are expected to become users of statistical methodology
in a subject matter field and potentially also developers of models and methodology
in such a field. It is therefore intentional that we focus on the fundamental principles
and develop these principles that by nature are mathematical. Advanced mathemat-
ics is, however, kept out of the main text. Instead a number of math boxes can
be found in the notes. Relevant, but mathematically more sophisticated, issues are
treated in these math boxes. The main text does not depend on results developed in

i
ii

the math boxes, but the interested and capable reader may find them illuminating.
The formal mathematical prerequisites for reading the notes is a standard calculus
course in addition to a few useful mathematical facts collected in an appendix. The
reader who is not so accustomed to the symbolic language of mathematics may,
however, find the material challenging to begin with.
To fully benefit from the notes it is also necessary to obtain and install the statisti-
cal computing environment R. It is evident that almost all applications of statistics
today require the use of computers for computations and very often also simula-
tions. The program R is a free, full fledge programming language and should be
regarded as such. Previous experience with programming is thus beneficial but not
necessary. R is a language developed for statistical data analysis and it comes with
a huge number of packages, which makes it a convenient framework for handling
most standard statistical analyses, for implementing novel statistical procedures, for
doing simulation studies, and last but not least it does a fairly good job at producing
high quality graphics.
We all have to crawl before we can walk – let alone run. We begin the notes with
the simplest models but develop a sustainable theory that can embrace the more
advanced ones too.
Last, but not least, I owe a special thank to Jessica Kasza for detailed comments on
an earlier version of the notes and for correcting a number of grammatical mistakes.

November 2010
Niels Richard Hansen
Contents

1 Introduction 1
1.1 Notion of probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Statistics and statistical models . . . . . . . . . . . . . . . . . . . . . 4

2 Probability Theory 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Sample spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Probability measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Probability measures on discrete sets . . . . . . . . . . . . . . . . . . 21
2.5 Descriptive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Probability measures on the real line . . . . . . . . . . . . . . . . . . 32
2.7 Descriptive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7.1 Histograms and kernel density estimation . . . . . . . . . . . 43
2.7.2 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . 49
2.7.3 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.8 Conditional probabilities and independence . . . . . . . . . . . . . . 59
2.9 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.9.1 Transformations of random variables . . . . . . . . . . . . . . 63
2.10 Joint distributions, conditional distributions and independence . . . 70

iii
iv Contents

2.10.1 Random variables and independence . . . . . . . . . . . . . . 70


2.10.2 Random variables and conditional distributions . . . . . . . . 75
2.10.3 Transformations of independent variables . . . . . . . . . . . 81
2.11 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
2.12 Local alignment - a case study . . . . . . . . . . . . . . . . . . . . . 91
2.13 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . 97
2.13.1 Conditional distributions and conditional densities . . . . . . 106
2.14 Descriptive methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
2.15 Transition probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 111

3 Statistical models and inference 117


3.1 Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.2 Classical sampling distributions . . . . . . . . . . . . . . . . . . . . . 127
3.3 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.3.1 Parametric Statistical Models . . . . . . . . . . . . . . . . . . 131
3.3.2 Estimators and Estimates . . . . . . . . . . . . . . . . . . . . 132
3.3.3 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . 136
3.4 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.4.1 Two sample t-test . . . . . . . . . . . . . . . . . . . . . . . . 163
3.4.2 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . 169
3.4.3 Multiple testing . . . . . . . . . . . . . . . . . . . . . . . . . . 172
3.5 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
3.5.1 Parameters of interest . . . . . . . . . . . . . . . . . . . . . . 181
3.6 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
3.6.1 Ordinary linear regression . . . . . . . . . . . . . . . . . . . . 190
3.6.2 Non-linear regression . . . . . . . . . . . . . . . . . . . . . . . 204
3.7 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
3.7.1 The empirical measure and non-parametric bootstrapping . . 214
3.7.2 The percentile method . . . . . . . . . . . . . . . . . . . . . . 216

4 Mean and Variance 219


Contents v

4.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219


4.1.1 The empirical mean . . . . . . . . . . . . . . . . . . . . . . . 224
4.2 More on expectations . . . . . . . . . . . . . . . . . . . . . . . . . . 225
4.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
4.4 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . 234
4.5 Properties of the Empirical Approximations . . . . . . . . . . . . . . 239
4.6 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . 245
4.7 Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
4.7.1 MLE and Asymptotic Theory . . . . . . . . . . . . . . . . . . 256
4.8 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

A R 267
A.1 Obtaining and running R . . . . . . . . . . . . . . . . . . . . . . . . 267
A.2 Manuals, FAQs and online help . . . . . . . . . . . . . . . . . . . . . 268
A.3 The R language, functions and scripts . . . . . . . . . . . . . . . . . 269
A.3.1 Functions, expression evaluation, and objects . . . . . . . . . 269
A.3.2 Writing functions and scripts . . . . . . . . . . . . . . . . . . 270
A.4 Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
A.5 Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
A.5.1 Bioconductor . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
A.6 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
A.7 Other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

B Mathematics 277
B.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
B.2 Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
B.3 Limits and infinite sums . . . . . . . . . . . . . . . . . . . . . . . . . 279
B.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
B.4.1 Gamma and beta integrals . . . . . . . . . . . . . . . . . . . 282
B.4.2 Multiple integrals . . . . . . . . . . . . . . . . . . . . . . . . . 283
vi Contents
1

Introduction

1.1 Notion of probabilities

Flipping coins and throwing dice are two commonly occurring examples in an in-
troductory course on probability theory and statistics. They represent archetypical
experiments where the outcome is uncertain – no matter how many times we roll
the dice we are unable to predict the outcome of the next roll. We use probabilities
to describe the uncertainty; a fair, classical dice has probability 1/6 for each side to
turn up. Elementary probability computations can to some extent be handled based
on intuition, common sense and high school mathematics. In the popular dice game
Yahtzee the probability of getting a Yahtzee (five of a kind) in a single throw is for
instance
6 1
5
= 4 = 0.0007716.
6 6
The argument for this and many similar computations is based on the pseudo theorem
that the probability for any event equals
number of favourable outcomes
.
number of possible outcomes
Getting a Yahtzee consists of the six favorable outcomes with all five dice facing the
same side upwards. We call the formula above a pseudo theorem because, as we will
show in Section 2.4, it is only the correct way of assigning probabilities to events
under a very special assumption about the probabilities describing our experiment.
The special assumption is that all outcomes are equally probable – something we
tend to believe if we don’t know any better, or can see no way that one outcome
should be more likely than others.
However, without some training most people will either get it wrong or have to give
up if they try computing the probability of anything except the most elementary

1
2 Introduction

events – even when the pseudo theorem applies. There exist numerous tricky prob-
ability questions where intuition somehow breaks down and wrong conclusions can
be drawn if one is not extremely careful. A good challenge could be to compute the
probability of getting a Yahtzee in three throws with the usual rules and provided
that we always hold as many equal dice as possible.
0.7
0.6
0.5
0.4
0.3

0 100 200 300 400 500

Figure 1.1: The relative frequency of times that the dice sequence comes
out before the sequence as a function of the number of times the dice game
has been played.

The Yahtzee problem can in principle be solved by counting – simply write down all
combinations and count the number of favorable and possible combinations. Then
the pseudo theorem applies. It is a futile task but in principle a possibility.
In many cases it is, however, impossible to rely on counting – even in principle. As
an example we consider a simple dice game with two participants: First I choose
a sequence of three dice throws, , say, and then you choose , say. We
throw the dice until one of the two sequences comes out, and I win if comes
out first and otherwise you win. If the outcome is

then I win. It is natural to ask with what probability you will win this game. In
addition, it is clearly a quite boring game, since we have to throw a lot of dice and
simply wait for one of the two sequences to occur. Another question could therefore
be to ask how boring the game is? Can we for instance compute the probability for

You might also like