Curso

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

2WS30/39

Mathematical
Statistics

2WS30 - Introduction and E.D.A./Descriptive Statistics


Before we start… Lecturers/Instructors:

Lecturers/Instructors: Rui M. Castro


Office: MF4.075
Alessandro di Bucchianicco Phone: (040 247) 2499
Office: MF7.097a Email: rmcastro@tue.nl
Email: a.d.bucchianico@tue.nl

Alberto Brini
Office: MF4.???
Email: a.brini@tue.nl

Paulo Serra
Office: MF4.???
Email: p.j.serra@tue.nl

http://www.win.tue.nl/~rmcastro/2WS30
I.2
Before we start…
Setup of the course:
•  Two weekly lectures
•  Two weekly instructions/tutorials/advising

Student Assessment:
•  Homework assignments (20%)
•  Modeling project (20%)
•  Final Exam (60%)

Prerequisites:
•  Probability Theory (2WS20)
•  A reasonable level of mathematical maturity
I.3
Important Topics from Prob. Theory
•  Expectation of r.v.’s and functions of r.v.’s

•  Computing the distribution of a function of random


variables using the Probability Integral Transform

•  The properties of the sum of random variables

•  Law of large numbers

•  Central limit theorem

•  Convergence of random variables

•  Conditional expectations
I.4
Before we start…
Study Materials:
•  “Statistical Theory: A Concise Introduction”,
Abramovich and Ritov
•  Others (see website)

Announcements and other course materials:


•  I’ll post everything on the course webpage
•  I’ll send emails when necessary

I ASSUME YOU ARE ALL REGISTERED FOR THE


COURSE!

(you won’t receive any announcements otherwise)


I.5
What is Statistics?
According to the Encyclopedia Britannica:
“Statistics is the art and science of gathering, analyzing, and
making inferences from data”

In his book “Statistical Models”, A. C. Davison answers


the question in a much more thorough way:
“Statistics concerns what can be learned from data. Applied
statistics comprises a body of methods for data collection
and analysis across the whole range of science, and in areas
such as engineering, medicine, business, and law - wherever
variable data must be summarized, or used to test or confirm
theories, or to inform decisions. Theoretical statistics
underpins this by providing a framework for understanding
the properties and scope of methods used in applications.”

I.6
What is Statistics?
Statistics is often
associated only with polls,
census, and other “boring”
stuff
However, this is a very limiting view of statistics:

I.7
Probability AND Statistics
Probability and Statistics are NOT the same thing!!!

•  Probability provides the foundation of statistics


•  Statistics is concerned with testing hypothesis/making
inferences about the “world” by using data (assumed to be
collected according to some probabilistic model)
Probabilistic Model
(models how data is created) Sample (small part
of the population)
Population (World)

Statistics – Inference about the I.8


population/world from the sample
In this Course

•  Emphasis on the theoretical underpinnings and


foundations of statistical inference.

•  In the modeling/homework assignments you will also


encounter other aspects of statistics, such as the
gathering, description and summarization of data

•  Very importantly, you’ll encounter the issues related


to the choice of a “good” statistical model.

I.9
A Typical Example
The presidential elections in the United States work in a funny
way, and in each state there is essentially a separate election.
Very important are the so-called ``swing-states'', for which it
is difficult to predict the outcome of electoral process.

In the 2012 election Wisconsin appeared to be such a swing-


state. A phone survey (July 25, 2012) with 480 likely voters
yielded the following data: 248 of individuals indicated they will
vote for Barack Obama; 232 individuals indicate they will vote
for Mitt Romney.

What predictions can be made about the outcome of the


Wisconsin election (if it was to take place on that same day)?

The data in this example is loosely based on a poll, as described in http://www.rasmussenreports.com/


public_content/politics/elections/election_2012/election_2012_presidential_election/wisconsin/ I.10
election_2012_wisconsin_president.
German Tanks
During the II WW it was of importance for the allies to assess
the number of German tanks and V2 rockets that the Germans
were able to produce in a certain period of time.

A lot of money was spend on intelligence to do so. However, the


most successful and accurate approach was based on a
relatively simple statistical approach (and some naivety by the
Germans):

Each German tank that was captured had serial numbers in


various parts (e.g., engine block). As the name indicates, these
were serial, essentially ranging from 1 to N. Assuming
simplistically that each produced tank is equally likely to be
captured gives a possible way to estimate N.
I.11
German Tanks!
A Concrete instance:
During a certain period six German tanks were captured, with
serial numbers 17, 68, 94, 127, 135, 212. Then a good estimate
for N is given by

Date Estimate True value Intelligence


estimate
June 1940 169 122 1000
June 1941 244 271 1550
August 1942 327 342 1550
I.12
Biology and Estimation of Missing Mass!
Suppose you are working with biologists studying the ecosystem
on a certain lake. They would like to know how many species of
fish inhabit the lake. They set a several (fish friendly) nets in
different places and record the following catch:

You later go fishing on the lake. What is the probability you’ll


encounter a species you haven’t seen before?
The Good-Turing estimator of this quantity is 2/12=0.167 I.13
What is Data?
Definition: Data and Dataset

This seems a bit vague… For our purposes:

Data is a collection of numerical or categorical observations of


a certain process (either physical, biological, social, etc…).

Depending on the questions one wants to answer the order of


the data might be important (e.g. AEX over time), other times
it is irrelevant (exam grades of 2WS30 ordered by student
last name).

I.14
A Typical Dataset
To better understand the impact of smoking in pregnancy a big
study was conducted in the USA. All the pregnancies under a
certain health cooperative (in S. Francisco) were monitored
between 1960 and 1967 and figures like the mothers age,
smoking status, baby weight at birth, etc… were collected (a
total of 1236 valid entries)

For instance, this is a list of the mother’s age (in years)

27 33 28 36 23 25 33 23 25 30 27 32 23 36 30 38
25 33 33 43 22 27 25 30 23 27 (…)

We desire to make “meaningful” statements about mothers in


S. Francisco, but using only this sample…

I.15
Descriptive Statistics
Typically we can only say something sensible about data or a
dataset if we assume a statistical model for it. Nevertheless,
a good start is to summarize the contents of a dataset, or
represent them in a palatable way. This is also a key aspect of
Exploratory Data Analysis.

This is the goal of Descriptive Statistics, which are either


numerical or graphical summaries and representations of data.

In what follows we will concentrate mostly on scenarios where


the ordering of the elements in the dataset is not considered
important. E.g.:

•  Exam grades of 2WS30


•  Customer satisfaction ratings of a store
•  Number of rotten apples in each crate of apples from a I.16
certain producer (order of the crates doesn’t matter)
A Typical Dataset
Sample

(a small number of mothers


in S. Francisco)
Population

(mothers in S.
Our hope is that the sample is
Francisco)
somewhat representative of the
entire population…

Before trying to do this, let’s see if we can “understand” the


data a bit better, and summarize it in nice ways… I.17
Numerical Summaries – Sample Mean
Often it is good to have an idea of where the data values are
hovering around. There are a number of natural ways to
quantify this:

Definition: Sample Mean/Sample Average

For the dataset of the previous slides we have

Clearly this is good information to have, but it would be good to


know if mother’s age is always close to this, or differs wildly…
I.18
Sample Variance/Standard Deviation
Definition: Sample Variance/Standard Deviation

Notice the
In our example
units are
squared !!!

The sample standard deviation is given by


A intuitive interpretation of what the sample standard
deviation represents is not so easy, but we can still understand
why it does measure variability: I.19
Sample Variance/Standard Deviation

always non-negative

Properties: Sample Variance/Standard Deviation

The last expression makes handmade computations typically I.20

easier, but numerically it can be a very bad choice…


The Sample Range
Another way to assess variability:

Definition: Sample Range

In our example

This seems fishy. Actually, there are two entries in the data
that are 99. It turns out this value is not the age of the
mother, but rather indicates their age was unknown. So we
must treat these two entries as missing values. Removing these
you’ll get

I.21
Other Numerical Summaries
There are many other numerical summaries that are important
(we’ll encounter these again, in the context of graphical
representations of data)

Definition: Order Statistics

I.22
Sample Median and Percentiles
Definition: Sample Median

This is essentially the value the “splits” the dataset in two:


approximately half of the data is below the median and half is
above the median. More generally, we can define

Definition: Sample Percentiles

Calculation of sample percentiles is not done the same way


everywhere, and most statistical packages use a definition that
I.23
involves interpolation (like the median above).
Sample Median and Percentiles
For our dataset we have that the median is 26. This value does
not change if we remove the two entries valued 99.

The sample median is a measure of location that is robust to


outliers, unlike the sample mean.

However, the median seems to also discard a lot of information


in comparison with the sample mean. A compromise between
the two is the trimmed mean

Definition: 10% Trimmed Mean

In our example I.24


Graphical Representations
Especially for large datasets, graphical representations are
often much more (qualitatively) informative than numerical
summaries. Perhaps we simplest graphical representation is the
scatter-plot (baby weight, in grams)

1500 2000 2500 3000 3500 4000 4500 5000

It is sometimes convenient to jitter to abysses of the points,


so it is easier to see what’s going on…

1500 2000 2500 3000 3500 4000 4500 5000


I.25
Histograms
Scatterplots are still a bit difficult to read – a way we can get
a better view is by aggregating data into bins

1500 2000 2500 3000 3500 4000 4500 5000


Histogram of x
Frequency

150
0

1500 2000 2500 3000 3500 4000 4500 5000

I.26
Histograms – Choice of Binning
The choice of the number of bins is a tricky business…
Histogram of x

Too few !!!


Frequency

400
0

1000 2000 Histogram


3000 of x 4000 5000

x
Frequency

150

“Just right”!!!
0

1500 2000 2500 3000 3500 4000 4500 5000


Histogram of x
x
35
Frequency

15

Too many !!!


0

1500 2000 2500 3000 3500 4000 4500 5000

x
There are rules-of-thumb for the number of bins that most I.27
software will use… You don’t need to worry too much (yet)...
Histograms
Actually, if the data can be viewed as independent samples
from some continuous distribution, the histogram (after proper
normalization) can be interpreted as an estimate of the true
underlying density function !!!

Histogram of y
8e-04
Density

4e-04
0e+00

1500 2000 2500 3000 3500 4000 4500 5000

y
Baby weight: this histogram has a bell-like shape. Is it
reasonable to model baby weight as a sample from a normal
distribution? I.28
Density Estimators
Histograms are actually a very crude density estimator. There
are much better alternatives, like kernel-based estimators

density.default(x = y, n = 50000)
8e-04
Density

4e-04
0e+00

2000 3000 4000 5000

N = 1236 Bandwidth = 102

The principle behind all these estimators is still the same –


locally averaging data. However, these can be much more
accurate than the histogram. I.29
Box and Whisker Plots
These are funny looking plots that give a nice graphical
representation of the (mother’s age) data…

15 20 25 30 35 40 45

First Quartile: Sample Median: Third Quartile:


25% of the data is half the data is 75% of the data is
below this below this below this
Smallest Value Largest Value

15 20 25 30 35 40 45
I.30
Actually, box-plots are generally a bit more sophisticated…
Box and Whisker Plots
Box and whisker plots are usually presented using the following
rules:

First Quartile Sample Median Third Quartile Outliers

Whisker extends to the Whisker extends to the


15 20
smallest point within 1.5
25 30Range
InterQuartile 35 40
largest point within 1.5
45
(IQR)
IQR IQR
Why 1.5?

These plots are easy to understand, and are therefore quite


useful, we can even compare different datasets easily… I.31
Multiple Box Plots
Birth-weight data comparison between smoking and non-
smoking mothers):
Comparative boxplots of Birth Weight
Smoking
Non-smoking

1500 2000 2500 3000 3500 4000 4500 5000

Does this mean smoking is bad for you???


I.32
Time-Sequence Plots
Sometimes the order of the data matters !!! Example:
PSI20 financial data over (01/07/2011 - 29/06/2012)
Histogram of x
7500

20 40 60 80
Frequency
6000
x

4500

0
0 50 100 150 200 250 4000 4500 5000 5500 6000 6500 7000 7500

Index x

There is clearly a temporal


trend that is completely ignored
in the histogram or boxplot
representations!!! The order of
the data really matters…
4500 5000 5500 6000 6500 7000 7500 I.33
PSI20 Example
We can, however, look at the daily returns instead…
PSI20 returns Histogram of r

80
60
Frequency
0.00

40
r

20
-0.04

0
0 50 100 150 200 250 -0.06 -0.04 -0.02 0.00 0.02 0.04

Index r
There seems to be much less of a
temporal trend on the returns, so
histograms and box-plots are
potentially useful representations
of the data…
The choice of Statistical Model is
-0.04 -0.02 0.00 0.02 already important for description
of the data!!! I.34
Quantile-Quantile Plots
These are part of a general class of qualitative plots that are
meant to help you assess some properties of the data. Namely,
if the data can be reasonably modeled by independent samples
from some distribution…

Let’s recall some concepts from a few slides back:

Definition: Order Statistics

I.35
Quantile-Quantile Plots
We can compare the order statistics, to the values we would
expect for some distributions (e.g. a normal distribution).

So this gives an easy visual way to check if assuming normality


is somewhat reasonable…
I.36
Normal Quantile-Quantile Plots
Example: PSI20 Daily returns

Normal Q-Q Plot Histogram of r


Sample Quantiles

30
-0.04 0.00

Density

20
10
0
-3 -2 -1 0 1 2 3 -0.04 0.00

Theoretical Quantiles r

Daily Returns seem to be reasonably modeled by a normal


Distribution !!!
I.37
Normal Quantile-Quantile Plots
Example: Synthetic data – normal distribution (sanity check)

Normal Q-Q Plot Histogram of r


2

0.4
Sample Quantiles

Density
0

0.2
-1
-2

0.0
-3

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Theoretical Quantiles r

I.38
Normal Quantile-Quantile Plots
Example: Synthetic data – exponential distribution
Normal Q-Q Plot Histogram of r

Too many points


away from the
4
Sample Quantiles

line

0.8
3

Density
2

0.4
1

0.0
0

-3 -2 -1 0 1 2 3 0 1 2 3 4

Theoretical Quantiles r

Everything seems to make sense here…


I.39
Quantile-Quantile Plots
Summary: normal QQ plots give us a qualitative way to check
if data can be reasonably modeled by a normal distribution.
If most points lie approximately on a straight line then the
normal modeling assumption might be reasonable – otherwise it
is doubtful.

Normal Q-Q Plot


BirthHistogram
Normalweight
Q-Q Plot
of r
PSI20 Daily returns
Sample Quantiles
Sample Quantiles

30
4000
-0.04 0.00

Density

20
Too many points
away from the

10
line
1500

0
-3 -2 -1 0 1 2 3 -3 -2 -0.04
-1 0 1
0.00 2 3
I.40
Theoretical Quantiles Theoretical Quantiles
r
Quantile-Quantile Plots
Normal Q-Q Plot
4500
Sample Quantiles

3500
2500
1500

-3 -2 -1 0 1 2 3

Theoretical Quantiles

I.41
What’s Next
Now that we can summarize and represent data in nice ways we
would like to make meaningful statements about the population
that gave rise to this data.

For this we need to make some assumptions, leading into the


notion of a Statistical Model.

In this course will focus mainly on one one type of statistical


model. However, going beyond this model will not be hard given
the foundational knowledge you’ll develop.

Important!!! All models are wrong…


…but some are useful.
(George E.P. Box)
I.42

You might also like