Professional Documents
Culture Documents
Curso
Curso
Curso
Mathematical
Statistics
Alberto Brini
Office: MF4.???
Email: a.brini@tue.nl
Paulo Serra
Office: MF4.???
Email: p.j.serra@tue.nl
http://www.win.tue.nl/~rmcastro/2WS30
I.2
Before we start…
Setup of the course:
• Two weekly lectures
• Two weekly instructions/tutorials/advising
Student Assessment:
• Homework assignments (20%)
• Modeling project (20%)
• Final Exam (60%)
Prerequisites:
• Probability Theory (2WS20)
• A reasonable level of mathematical maturity
I.3
Important Topics from Prob. Theory
• Expectation of r.v.’s and functions of r.v.’s
• Conditional expectations
I.4
Before we start…
Study Materials:
• “Statistical Theory: A Concise Introduction”,
Abramovich and Ritov
• Others (see website)
I.6
What is Statistics?
Statistics is often
associated only with polls,
census, and other “boring”
stuff
However, this is a very limiting view of statistics:
I.7
Probability AND Statistics
Probability and Statistics are NOT the same thing!!!
I.9
A Typical Example
The presidential elections in the United States work in a funny
way, and in each state there is essentially a separate election.
Very important are the so-called ``swing-states'', for which it
is difficult to predict the outcome of electoral process.
I.14
A Typical Dataset
To better understand the impact of smoking in pregnancy a big
study was conducted in the USA. All the pregnancies under a
certain health cooperative (in S. Francisco) were monitored
between 1960 and 1967 and figures like the mothers age,
smoking status, baby weight at birth, etc… were collected (a
total of 1236 valid entries)
27 33 28 36 23 25 33 23 25 30 27 32 23 36 30 38
25 33 33 43 22 27 25 30 23 27 (…)
I.15
Descriptive Statistics
Typically we can only say something sensible about data or a
dataset if we assume a statistical model for it. Nevertheless,
a good start is to summarize the contents of a dataset, or
represent them in a palatable way. This is also a key aspect of
Exploratory Data Analysis.
(mothers in S.
Our hope is that the sample is
Francisco)
somewhat representative of the
entire population…
Notice the
In our example
units are
squared !!!
always non-negative
In our example
This seems fishy. Actually, there are two entries in the data
that are 99. It turns out this value is not the age of the
mother, but rather indicates their age was unknown. So we
must treat these two entries as missing values. Removing these
you’ll get
I.21
Other Numerical Summaries
There are many other numerical summaries that are important
(we’ll encounter these again, in the context of graphical
representations of data)
I.22
Sample Median and Percentiles
Definition: Sample Median
150
0
I.26
Histograms – Choice of Binning
The choice of the number of bins is a tricky business…
Histogram of x
400
0
x
Frequency
150
“Just right”!!!
0
15
x
There are rules-of-thumb for the number of bins that most I.27
software will use… You don’t need to worry too much (yet)...
Histograms
Actually, if the data can be viewed as independent samples
from some continuous distribution, the histogram (after proper
normalization) can be interpreted as an estimate of the true
underlying density function !!!
Histogram of y
8e-04
Density
4e-04
0e+00
y
Baby weight: this histogram has a bell-like shape. Is it
reasonable to model baby weight as a sample from a normal
distribution? I.28
Density Estimators
Histograms are actually a very crude density estimator. There
are much better alternatives, like kernel-based estimators
density.default(x = y, n = 50000)
8e-04
Density
4e-04
0e+00
15 20 25 30 35 40 45
15 20 25 30 35 40 45
I.30
Actually, box-plots are generally a bit more sophisticated…
Box and Whisker Plots
Box and whisker plots are usually presented using the following
rules:
20 40 60 80
Frequency
6000
x
4500
0
0 50 100 150 200 250 4000 4500 5000 5500 6000 6500 7000 7500
Index x
80
60
Frequency
0.00
40
r
20
-0.04
0
0 50 100 150 200 250 -0.06 -0.04 -0.02 0.00 0.02 0.04
Index r
There seems to be much less of a
temporal trend on the returns, so
histograms and box-plots are
potentially useful representations
of the data…
The choice of Statistical Model is
-0.04 -0.02 0.00 0.02 already important for description
of the data!!! I.34
Quantile-Quantile Plots
These are part of a general class of qualitative plots that are
meant to help you assess some properties of the data. Namely,
if the data can be reasonably modeled by independent samples
from some distribution…
I.35
Quantile-Quantile Plots
We can compare the order statistics, to the values we would
expect for some distributions (e.g. a normal distribution).
30
-0.04 0.00
Density
20
10
0
-3 -2 -1 0 1 2 3 -0.04 0.00
Theoretical Quantiles r
0.4
Sample Quantiles
Density
0
0.2
-1
-2
0.0
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Theoretical Quantiles r
I.38
Normal Quantile-Quantile Plots
Example: Synthetic data – exponential distribution
Normal Q-Q Plot Histogram of r
line
0.8
3
Density
2
0.4
1
0.0
0
-3 -2 -1 0 1 2 3 0 1 2 3 4
Theoretical Quantiles r
30
4000
-0.04 0.00
Density
20
Too many points
away from the
10
line
1500
0
-3 -2 -1 0 1 2 3 -3 -2 -0.04
-1 0 1
0.00 2 3
I.40
Theoretical Quantiles Theoretical Quantiles
r
Quantile-Quantile Plots
Normal Q-Q Plot
4500
Sample Quantiles
3500
2500
1500
-3 -2 -1 0 1 2 3
Theoretical Quantiles
I.41
What’s Next
Now that we can summarize and represent data in nice ways we
would like to make meaningful statements about the population
that gave rise to this data.