Curso

2WS30/39
Mathematical
Statistics
2WS30 - Introduction and E.D.A./Descriptive Statistics

Before we start… Lecturers/Instructors:
Lecturers/Instructors: Rui M. Castro

Office: MF4.075
Alessandro di Bucchianicco Phone: (040 247) 2499
Office: MF7.097a Email: rmcastro@tue.nl
Email: a.d.bucchianico@tue.nl
Alberto Brini
Office: MF4.???
Email: a.brini@tue.nl
Paulo Serra
Office: MF4.???
Email: p.j.serra@tue.nl
http://www.win.tue.nl/~rmcastro/2WS30
I.2
Before we start…
Setup of the course:
•  Two weekly lectures
•  Two weekly instructions/tutorials/advising
Student Assessment:
•  Homework assignments (20%)
•  Modeling project (20%)
•  Final Exam (60%)
Prerequisites:
•  Probability Theory (2WS20)
•  A reasonable level of mathematical maturity
I.3
Important Topics from Prob. Theory
•  Expectation of r.v.’s and functions of r.v.’s
•  Computing the distribution of a function of random

variables using the Probability Integral Transform
•  The properties of the sum of random variables
•  Law of large numbers
•  Central limit theorem
•  Convergence of random variables
•  Conditional expectations
I.4
Before we start…
Study Materials:
•  “Statistical Theory: A Concise Introduction”,
Abramovich and Ritov
•  Others (see website)
Announcements and other course materials:

•  I’ll post everything on the course webpage
•  I’ll send emails when necessary
I ASSUME YOU ARE ALL REGISTERED FOR THE

COURSE!
(you won’t receive any announcements otherwise)

I.5
What is Statistics?
According to the Encyclopedia Britannica:
“Statistics is the art and science of gathering, analyzing, and
making inferences from data”
In his book “Statistical Models”, A. C. Davison answers

the question in a much more thorough way:
“Statistics concerns what can be learned from data. Applied
statistics comprises a body of methods for data collection
and analysis across the whole range of science, and in areas
such as engineering, medicine, business, and law - wherever
variable data must be summarized, or used to test or confirm
theories, or to inform decisions. Theoretical statistics
underpins this by providing a framework for understanding
the properties and scope of methods used in applications.”
I.6
What is Statistics?
Statistics is often
associated only with polls,
census, and other “boring”
stuff
However, this is a very limiting view of statistics:
I.7
Probability AND Statistics
Probability and Statistics are NOT the same thing!!!
•  Probability provides the foundation of statistics

•  Statistics is concerned with testing hypothesis/making
inferences about the “world” by using data (assumed to be
collected according to some probabilistic model)
Probabilistic Model
(models how data is created) Sample (small part
of the population)
Population (World)
Statistics – Inference about the I.8

population/world from the sample
In this Course
•  Emphasis on the theoretical underpinnings and

foundations of statistical inference.
•  In the modeling/homework assignments you will also

encounter other aspects of statistics, such as the
gathering, description and summarization of data
•  Very importantly, you’ll encounter the issues related

to the choice of a “good” statistical model.
I.9
A Typical Example
The presidential elections in the United States work in a funny
way, and in each state there is essentially a separate election.
Very important are the so-called ``swing-states'', for which it
is difficult to predict the outcome of electoral process.
In the 2012 election Wisconsin appeared to be such a swing-

state. A phone survey (July 25, 2012) with 480 likely voters
yielded the following data: 248 of individuals indicated they will
vote for Barack Obama; 232 individuals indicate they will vote
for Mitt Romney.
What predictions can be made about the outcome of the

Wisconsin election (if it was to take place on that same day)?
The data in this example is loosely based on a poll, as described in http://www.rasmussenreports.com/

public_content/politics/elections/election_2012/election_2012_presidential_election/wisconsin/ I.10
election_2012_wisconsin_president.
German Tanks
During the II WW it was of importance for the allies to assess
the number of German tanks and V2 rockets that the Germans
were able to produce in a certain period of time.
A lot of money was spend on intelligence to do so. However, the

most successful and accurate approach was based on a
relatively simple statistical approach (and some naivety by the
Germans):
Each German tank that was captured had serial numbers in

various parts (e.g., engine block). As the name indicates, these
were serial, essentially ranging from 1 to N. Assuming
simplistically that each produced tank is equally likely to be
captured gives a possible way to estimate N.
I.11
German Tanks!
A Concrete instance:
During a certain period six German tanks were captured, with
serial numbers 17, 68, 94, 127, 135, 212. Then a good estimate
for N is given by
Date Estimate True value Intelligence

estimate
June 1940 169 122 1000
June 1941 244 271 1550
August 1942 327 342 1550
I.12
Biology and Estimation of Missing Mass!
Suppose you are working with biologists studying the ecosystem
on a certain lake. They would like to know how many species of
fish inhabit the lake. They set a several (fish friendly) nets in
different places and record the following catch:
You later go fishing on the lake. What is the probability you’ll

encounter a species you haven’t seen before?
The Good-Turing estimator of this quantity is 2/12=0.167 I.13
What is Data?
Definition: Data and Dataset
This seems a bit vague… For our purposes:
Data is a collection of numerical or categorical observations of

a certain process (either physical, biological, social, etc…).
Depending on the questions one wants to answer the order of

the data might be important (e.g. AEX over time), other times
it is irrelevant (exam grades of 2WS30 ordered by student
last name).
I.14
A Typical Dataset
To better understand the impact of smoking in pregnancy a big
study was conducted in the USA. All the pregnancies under a
certain health cooperative (in S. Francisco) were monitored
between 1960 and 1967 and figures like the mothers age,
smoking status, baby weight at birth, etc… were collected (a
total of 1236 valid entries)
For instance, this is a list of the mother’s age (in years)
27 33 28 36 23 25 33 23 25 30 27 32 23 36 30 38
25 33 33 43 22 27 25 30 23 27 (…)
We desire to make “meaningful” statements about mothers in

S. Francisco, but using only this sample…
I.15
Descriptive Statistics
Typically we can only say something sensible about data or a
dataset if we assume a statistical model for it. Nevertheless,
a good start is to summarize the contents of a dataset, or
represent them in a palatable way. This is also a key aspect of
Exploratory Data Analysis.
This is the goal of Descriptive Statistics, which are either

numerical or graphical summaries and representations of data.
In what follows we will concentrate mostly on scenarios where

the ordering of the elements in the dataset is not considered
important. E.g.:
•  Exam grades of 2WS30

•  Customer satisfaction ratings of a store
•  Number of rotten apples in each crate of apples from a I.16
certain producer (order of the crates doesn’t matter)
A Typical Dataset
Sample
(a small number of mothers

in S. Francisco)
Population
(mothers in S.
Our hope is that the sample is
Francisco)
somewhat representative of the
entire population…
Before trying to do this, let’s see if we can “understand” the

data a bit better, and summarize it in nice ways… I.17
Numerical Summaries – Sample Mean
Often it is good to have an idea of where the data values are
hovering around. There are a number of natural ways to
quantify this:
Definition: Sample Mean/Sample Average
For the dataset of the previous slides we have
Clearly this is good information to have, but it would be good to

know if mother’s age is always close to this, or differs wildly…
I.18
Sample Variance/Standard Deviation
Definition: Sample Variance/Standard Deviation
Notice the
In our example
units are
squared !!!
The sample standard deviation is given by

A intuitive interpretation of what the sample standard
deviation represents is not so easy, but we can still understand
why it does measure variability: I.19
Sample Variance/Standard Deviation
always non-negative
Properties: Sample Variance/Standard Deviation
The last expression makes handmade computations typically I.20
easier, but numerically it can be a very bad choice…

The Sample Range
Another way to assess variability:
Definition: Sample Range
In our example
This seems fishy. Actually, there are two entries in the data
that are 99. It turns out this value is not the age of the
mother, but rather indicates their age was unknown. So we
must treat these two entries as missing values. Removing these
you’ll get
I.21
Other Numerical Summaries
There are many other numerical summaries that are important
(we’ll encounter these again, in the context of graphical
representations of data)
Definition: Order Statistics
I.22
Sample Median and Percentiles
Definition: Sample Median
This is essentially the value the “splits” the dataset in two:

approximately half of the data is below the median and half is
above the median. More generally, we can define
Definition: Sample Percentiles
Calculation of sample percentiles is not done the same way

everywhere, and most statistical packages use a definition that
I.23
involves interpolation (like the median above).
Sample Median and Percentiles
For our dataset we have that the median is 26. This value does
not change if we remove the two entries valued 99.
The sample median is a measure of location that is robust to

outliers, unlike the sample mean.
However, the median seems to also discard a lot of information

in comparison with the sample mean. A compromise between
the two is the trimmed mean
Definition: 10% Trimmed Mean
In our example I.24

Graphical Representations
Especially for large datasets, graphical representations are
often much more (qualitatively) informative than numerical
summaries. Perhaps we simplest graphical representation is the
scatter-plot (baby weight, in grams)
1500 2000 2500 3000 3500 4000 4500 5000
It is sometimes convenient to jitter to abysses of the points,

so it is easier to see what’s going on…
1500 2000 2500 3000 3500 4000 4500 5000

I.25
Histograms
Scatterplots are still a bit difficult to read – a way we can get
a better view is by aggregating data into bins
1500 2000 2500 3000 3500 4000 4500 5000

Histogram of x
Frequency
150
0
1500 2000 2500 3000 3500 4000 4500 5000
I.26
Histograms – Choice of Binning
The choice of the number of bins is a tricky business…
Histogram of x
Too few !!!

Frequency
400
0
1000 2000 Histogram

3000 of x 4000 5000
x
Frequency
150
“Just right”!!!
0
1500 2000 2500 3000 3500 4000 4500 5000

Histogram of x
x
35
Frequency
15
Too many !!!

0
1500 2000 2500 3000 3500 4000 4500 5000
x
There are rules-of-thumb for the number of bins that most I.27
software will use… You don’t need to worry too much (yet)...
Histograms
Actually, if the data can be viewed as independent samples
from some continuous distribution, the histogram (after proper
normalization) can be interpreted as an estimate of the true
underlying density function !!!
Histogram of y
8e-04
Density
4e-04
0e+00
1500 2000 2500 3000 3500 4000 4500 5000
y
Baby weight: this histogram has a bell-like shape. Is it
reasonable to model baby weight as a sample from a normal
distribution? I.28
Density Estimators
Histograms are actually a very crude density estimator. There
are much better alternatives, like kernel-based estimators
density.default(x = y, n = 50000)
8e-04
Density
4e-04
0e+00
2000 3000 4000 5000
N = 1236 Bandwidth = 102
The principle behind all these estimators is still the same –

locally averaging data. However, these can be much more
accurate than the histogram. I.29
Box and Whisker Plots
These are funny looking plots that give a nice graphical
representation of the (mother’s age) data…
15 20 25 30 35 40 45
First Quartile: Sample Median: Third Quartile:

25% of the data is half the data is 75% of the data is
below this below this below this
Smallest Value Largest Value
15 20 25 30 35 40 45
I.30
Actually, box-plots are generally a bit more sophisticated…
Box and Whisker Plots
Box and whisker plots are usually presented using the following
rules:
First Quartile Sample Median Third Quartile Outliers
Whisker extends to the Whisker extends to the

15 20
smallest point within 1.5
25 30Range
InterQuartile 35 40
largest point within 1.5
45
(IQR)
IQR IQR
Why 1.5?
These plots are easy to understand, and are therefore quite

useful, we can even compare different datasets easily… I.31
Multiple Box Plots
Birth-weight data comparison between smoking and non-
smoking mothers):
Comparative boxplots of Birth Weight
Smoking
Non-smoking
1500 2000 2500 3000 3500 4000 4500 5000
Does this mean smoking is bad for you???

I.32
Time-Sequence Plots
Sometimes the order of the data matters !!! Example:
PSI20 financial data over (01/07/2011 - 29/06/2012)
Histogram of x
7500
20 40 60 80
Frequency
6000
x
4500
0
0 50 100 150 200 250 4000 4500 5000 5500 6000 6500 7000 7500
Index x
There is clearly a temporal

trend that is completely ignored
in the histogram or boxplot
representations!!! The order of
the data really matters…
4500 5000 5500 6000 6500 7000 7500 I.33
PSI20 Example
We can, however, look at the daily returns instead…
PSI20 returns Histogram of r
80
60
Frequency
0.00
40
r
20
-0.04
0
0 50 100 150 200 250 -0.06 -0.04 -0.02 0.00 0.02 0.04
Index r
There seems to be much less of a
temporal trend on the returns, so
histograms and box-plots are
potentially useful representations
of the data…
The choice of Statistical Model is
-0.04 -0.02 0.00 0.02 already important for description
of the data!!! I.34
Quantile-Quantile Plots
These are part of a general class of qualitative plots that are
meant to help you assess some properties of the data. Namely,
if the data can be reasonably modeled by independent samples
from some distribution…
Let’s recall some concepts from a few slides back:
Definition: Order Statistics
I.35
We can compare the order statistics, to the values we would
expect for some distributions (e.g. a normal distribution).
So this gives an easy visual way to check if assuming normality

is somewhat reasonable…
I.36
Normal Quantile-Quantile Plots
Example: PSI20 Daily returns
Normal Q-Q Plot Histogram of r

Sample Quantiles
30
-0.04 0.00
Density
20
10
0
-3 -2 -1 0 1 2 3 -0.04 0.00
Theoretical Quantiles r
Daily Returns seem to be reasonably modeled by a normal

Distribution !!!
I.37
Example: Synthetic data – normal distribution (sanity check)

2
0.4
Sample Quantiles
Density
0
0.2
-1
-2
0.0
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
I.38
Example: Synthetic data – exponential distribution
Too many points

away from the
4
Sample Quantiles
line
0.8
3
Density
2
0.4
1
0.0
0
-3 -2 -1 0 1 2 3 0 1 2 3 4
Everything seems to make sense here…

I.39
Summary: normal QQ plots give us a qualitative way to check
if data can be reasonably modeled by a normal distribution.
If most points lie approximately on a straight line then the
normal modeling assumption might be reasonable – otherwise it
is doubtful.
Normal Q-Q Plot

BirthHistogram
Normalweight
Q-Q Plot
of r
PSI20 Daily returns
Sample Quantiles
Sample Quantiles
30
4000
-0.04 0.00
Density
20
Too many points
away from the
10
line
1500
0
-3 -2 -1 0 1 2 3 -3 -2 -0.04
-1 0 1
0.00 2 3
I.40
Theoretical Quantiles Theoretical Quantiles
r
Normal Q-Q Plot
4500
Sample Quantiles
3500
2500
1500
-3 -2 -1 0 1 2 3
Theoretical Quantiles
I.41
What’s Next
Now that we can summarize and represent data in nice ways we
would like to make meaningful statements about the population
that gave rise to this data.
For this we need to make some assumptions, leading into the

notion of a Statistical Model.
In this course will focus mainly on one one type of statistical

model. However, going beyond this model will not be hard given
the foundational knowledge you’ll develop.
Important!!! All models are wrong…

…but some are useful.
(George E.P. Box)
I.42

Curso

Uploaded by

Copyright:

Available Formats

You might also like

Curso

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Curso

Uploaded by

Copyright:

Available Formats

2WS30/39

2WS30 - Introduction and E.D.A./Descriptive Statistics

Lecturers/Instructors: Rui M. Castro

• Computing the distribution of a function of random

• The properties of the sum of random variables

• Law of large numbers

• Central limit theorem

• Convergence of random variables

Announcements and other course materials:

I ASSUME YOU ARE ALL REGISTERED FOR THE

(you won’t receive any announcements otherwise)

In his book “Statistical Models”, A. C. Davison answers

• Probability provides the foundation of statistics

Statistics – Inference about the I.8

• Emphasis on the theoretical underpinnings and

• In the modeling/homework assignments you will also

• Very importantly, you’ll encounter the issues related

In the 2012 election Wisconsin appeared to be such a swing-

What predictions can be made about the outcome of the

The data in this example is loosely based on a poll, as described in http://www.rasmussenreports.com/

A lot of money was spend on intelligence to do so. However, the

Each German tank that was captured had serial numbers in

Date Estimate True value Intelligence

You later go fishing on the lake. What is the probability you’ll

This seems a bit vague… For our purposes:

Data is a collection of numerical or categorical observations of

Depending on the questions one wants to answer the order of

For instance, this is a list of the mother’s age (in years)

We desire to make “meaningful” statements about mothers in

This is the goal of Descriptive Statistics, which are either

In what follows we will concentrate mostly on scenarios where

• Exam grades of 2WS30

(a small number of mothers

Before trying to do this, let’s see if we can “understand” the

Definition: Sample Mean/Sample Average

For the dataset of the previous slides we have

Clearly this is good information to have, but it would be good to

The sample standard deviation is given by

Properties: Sample Variance/Standard Deviation

The last expression makes handmade computations typically I.20

easier, but numerically it can be a very bad choice…

Definition: Sample Range

Definition: Order Statistics

This is essentially the value the “splits” the dataset in two:

Definition: Sample Percentiles

Calculation of sample percentiles is not done the same way

The sample median is a measure of location that is robust to

However, the median seems to also discard a lot of information

Definition: 10% Trimmed Mean

In our example I.24

1500 2000 2500 3000 3500 4000 4500 5000

It is sometimes convenient to jitter to abysses of the points,

1500 2000 2500 3000 3500 4000 4500 5000

1500 2000 2500 3000 3500 4000 4500 5000

1500 2000 2500 3000 3500 4000 4500 5000

Too few !!!

1000 2000 Histogram

1500 2000 2500 3000 3500 4000 4500 5000

•  Computing the distribution of a function of random

•  The properties of the sum of random variables

•  Law of large numbers

•  Central limit theorem

•  Convergence of random variables

•  Probability provides the foundation of statistics

•  Emphasis on the theoretical underpinnings and

•  In the modeling/homework assignments you will also

•  Very importantly, you’ll encounter the issues related

•  Exam grades of 2WS30