10 Statistics PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Lecture 10: Statistics and Sampling Distributions

Data Analysis MA2224

2024-03-06

Estimation and Statistics

Setup

Illustrative Example Suppose you regularly just stare at stuff on your phone while at work when you are supposed
to be doing actual work. Every so often your boss will walk by your office. In order to maximize your loafing and
minimize your stress you want to model this situation using a random variable in order to try and predict when she
will walk by again. What would you use?
Time until her next walk by:
T ∼ Exp(β)
or T ∼ U nif (α, β)
or T ∼ N (µ, σ 2 ).
or number of times she walks by each hour: N ∼ P ois(λ).
There are plenty of options. Regardless of which one of these models you decide to use you need to come up with
values for the parameters α, β, µ, σ 2 or λ.
Say you decide to model the time until your boss passes by your office using an exponential distribution. So we have
Y ∼ Exp(β) and we could calculate the probability she walks by in the next 5 minutes if we knew β. We decide to
take a random sample of size 10 throughout the week. This means Yi ∼ Exp(β) for i = 1, ..., 10 is your theoretical
sample and you eventually get the data or observed sample yi for i = 1, . . . 10 to be

7.32 0.56 2.59 0.17 1.81


0.19 3.04 2.17 1.61 19.41

With this data we now want to ESTIMATE what the value of β is. This is what we use statistics for.

Estimation

Point Estimate of A Parameter.


A point estimate of a parameter θ (like µ, σ,λ, etc.) is a single number, θ̂ (like µ̂, σ̂,λ̂, etc.), that is used in the
place of the parameter when you dont know it.
Samples

For the most part, a point estimate of a parameter can be any number you want. If someone asks me what the height
of men are on average (µ) I can use my own height of 6 feet tall as the point estimate because why not? This probably
wont be a very good estimate though.
Point estimates therefore can be good or bad, and there are several properties of good point estimates that you might
study in a more advanced class. We wont worry about that here but there is on important requirement to get a good
estimate and that’s to have a good sample.

Samples

In lecture 1 we introduced the definition of a Sample as just a subset of the population but now we focus on a specific
type of sample called a Random Sample from here on out.

Random Sample.
The collection of random variables X1 , X2 , ..., Xn is a Random Sample of size n if the Xi s are all independent
and have identical population models.

Xi ∼ Distribution(parameters) i = 1, ..., n

When I say “sample” after this, I mean Random Sample as defined above.

Statistics

Figuring out the model Parameters is the next goal of this class. To do so, the obvious thing to do is to investigate
a sample from your model or population and see if you can figure out the parameters from your sample. This
means we need numbers in a sample that reflect the parameter values. These sample “special numbers” are called
statistics.

Statistics.
A statistic of a sample is just a special number calculated from the sample. In math words, a statistic θ̂ is just a
function of the sample
θ̂ = g(X1 , X2 , ..., Xn )
Staistics

Statistics: Measures of Center

Recall that in an exponential model, β = µ, so we need to estimate µ.


When trying to figure out the parameter µ there are three options we could use. Since the mean µ is the center of a
probability distribution, statistics that are meant to estimate it are called measures of center.

The Mode.
The most frequently occurring value in a sample is called The Mode. This is generally used for discrete data.

The Median.
When the data are order from smallest to largest, the value m that has 50% below it and 50% above it is called
the The Median. When ordered lowest to highest the median is in the 0.5(n + 1)th position. It will either be a
value in the sample or the average of two values.


 y(n+1)/2 n odd
m=
 yn/2 +y(n/2+1)

2 n even

The Average.
The Average or Mean of a sample is denoted ȳ and

y1 + y2 + ... + yn
Pn
i=1 yi
ȳ = =
n n

In the sample before, we get the median and average

7.32 0.56 2.59 0.17 1.81


0.19 3.04 2.17 1.61 19.41

7.32 + 0.56 + 2.59 + 0.17 + 1.81 + 0.19 + 3.04 + 2.17 + 1.61 + 19.41
Pn
i=1 yi
ȳ = = = 3.887
n 10
For the median we have to order the values

0.17, 0.19, 0.56, 1.61, 1.81, 2.17, 2.59, 3.04, 7.32, 19.41

Then locate the 0.5(10 + 1)th = 5.5th number which lies between 5th and 6th numbers 1.81 and 2.17 which we average
to get
1.81 + 2.17
m= = 1.99
2
Statistics: Measures of Location - Pecentiles

While both of these numbers attempt to give an estimate of µ or the “center” of the distribution, they both do it in
different ways. The average gives the “center of weight” of the sample, the median is the 50-50 divide.
We will use the sample average as an estimate of the population mean, ȳ ≈ µ. Which means our model from before
can be T ∼ Exp(β = 3.887) and we can approximate the probability of the boss walking by in the next 5 minutes as
P (T < 5) = 1 − e−5/3.887 = 72.3%

Statistics: Measures of Location - Pecentiles

The mean, median and mode tells you the location of certain points of interest in your data, specifically locations of
the center.
Other important locations are where the data is split, which we know as percentiles.
While the median is the 50-50 split of the data, i.e. the 50th percentile, we can easily extend this to other splits.

Percentiles.
When the data are ordered from smallest to largest, the value kp that has p × 100% below it is called
the The pth Percentile. Special cases are The First Quartile (25-75) located at the 0.25(n + 1)th position,
The Median (50-50) or The Second Quartile located at the 0.5(n + 1)th position, and The Third Quartile (75-25)
located at the 0.75(n + 1)th position.

Statistics: Measures of Spread

After thinking about the “center”, the second thing is to think about the “spread”. Statistics that try and capture the
variation in a sample are the following.

The Range.
The Range measures the total distance covered by you data.

Range = M ax − M in

The IQR.
The Interquartile Range or IQR measures how close together the "middle" of the data is. How close the middle
50% of the data is together.
IQR = Q3 − Q1
Calculator: TI-30 X IIS

The Variance and Standard Deviation.


The Variance is the "average" squared distance from each sample value to the sample mean.
The Standard Deviation is the square root of the variance.

i=1 (yi
Pn
− ȳ)2 q
s2y = sy = s2y
n−1

The measures of spread can be used to estimate σ 2 and σ.


We will use the sample variance and standard deviation as estimates of the population variance and standard deviation,
s2y ≈ σ 2 and sy ≈ σ.

Calculator: TI-30 X IIS

TI-30 X IIS

1. Enter Stat mode by pressing “Stat”: 2nd → Data


2. Select “1VAR” Statistics since we only have 1 group.
3. Press “Data” to start entering the data. Type each value and press enter to type in the next value. You can
keep FRQ, the frequency of each value, at 1 as we wont be giving you any really large datasets.
4. Press “Statvar” to get the statistics. Scroll to the right to see what you get.
5. Exit Stat mode by pressing “Exitstat”: 2nd → Statvar

WARNING: The calculators give the standard deviation sx which you need to square to get the variance
s2x .DO NOT USE σx .

Calculator: TI-30 XS

TI-30 XS

1. Select “DATA” to bring up several lists to use.


2. Enter the data values into the first list.
3. Press “STAT”: 2nd → Data.
4. Select “1VAR” Statistics since we only have 1 group. You can keep FRQ, the frequency of each value, at “ONE”
as we wont be giving you any really large datasets. You can scroll down to get x̄, m, Q1 , Q3 and sx .
5. You can clear your lists by pressing “DATA” twice.

WARNING: The calculators give the standard deviation sx which you need to square to get the variance
s2x .DO NOT USE σx .
Example

Example

For the following sample, determine the mean, median, quartiles, range, IQR, standard deviation and variance.

12.80 10.94 10.67 10.23 15.68


13.27 15.63 15.65 14.22 7.76
13.14 18.32 13.99 12.38 20.20

Statistic Value
x̄ 13.66
m 13.27
Q1 11.66
Q3 15.64
x̄ 12.44
m 3.98
Q1 3.19
Q3 10.18

Note: Quartiles may be different and that’s ok.


Sampling Distributions Introduction

Statistics as Random Variables

The key insights to make from these definitions in this lecture are:

1) A random sample is a collection of random variables.


2) A statistic combines those random variables into a single value.
3) A statistic is therefore also a random variable.
4) So a statistic is a random variable which means we have to find the three things: the center (Expectation), the
spread (Variance) and shape (pdf).

Main Statistics

Though we just covered a bunch of statistics, the main ones we will look at for developing ideas will be the following
three:

The Sample Sum.


For a random sample Y1 , ..., Yn the Sum is

n
U = Y1 + Y2 + ... + Yn =
X
Yi
i=1

The Sample Average.


For a random sample Y1 , ..., Yn the Average is

n
Y1 + Y2 + ... + Yn 1X
Pn
i=1 Yi
Ȳ = = = Yi
n n n i=1

The Sample Variance.


For a random sample Y1 , ..., Yn the Variance is
Pn  2
i=1 Yi − Ȳ
S2Y =
n−1
Sampling Distributions

Sampling Distributions

There is nothing that makes a statistic random variable different from our usual random variables in terms of what
you do with them. However they are philosophically different so this unfortunately means they get entirely new names
for things that we already have names for.

The Sampling Distribution of a Statistic.

A statistic, θ̂ is a random variable having it’s own pdf/pmf/mgf, mean, variance, etc. We call this distribution
The Sampling Distribution of θ̂ which has the following usual properties with new special symbols:
- Center or Mean denoted:
µθ̂
- Spread or Variance/SD denoted:
σθ̂2 and σθ̂

- Shape or pdf/cdf/mgf denoted:

f (θ̂) F (θ̂) and Mθ̂ (t)

We have three primary statistics; U, Ȳ, and S2y so we need to determine those properties for each. Which makes the
symbols:

• Sample Sum:

E(U) = µU V ar(U) = σU
2
or SD(U) = σU f (u) or MU (t)

In words: “The mean of the Sum, the variance of the Sum and the distribution of the Sum”

• Sample Average:

E(Ȳ) = µȲ V ar(Ȳ) = σȲ


2
or SD(Ȳ) = σȲ f (ȳ) or MȲ (t)

In words: “The mean of the Mean, the variance of the Mean and the distribution of the Mean”
• Sample Variance:

E(S2 ) = µS2 V ar(S2 ) = σS2 2 or SD(S2 ) = σS2 f (s2 ) or MS2 (t)

In words: “The mean of the Variance, the variance of the Variance and the distribution of the Variance”
It’s definitely worth your time to understand the symbols.
Distributions of The Sum and Mean

Distributions of The Sum and Mean

In lecture 9, we had some concluding facts that let use find sampling distributions right away

Linear Combinations of Independent Normal Random Variables.


If X1 , X2 , X3 , . . . Xn is are independent random random variables with Xi ∼ N (µi , σi2 ) that we take the linear
combination L = a0 + ni=1 ai Xi then
P

n n
!
L ∼ N a0 +
X X
ai µi , a2i σi2
i=1 i=1

Notice that this covers a lot of options and that these will all be normal distributions

X1 − X2 a = 1 , b = −1

U = X1 + X2 a=b=1

X1 + X2 1
X̄ = a=b=
2 2

Sum and Average Distributions.


Given a random sample Yi for i = 1, . . . , n, if we know additionally that Yi ∼ N (µ, σ 2 ) for i = 1, . . . , n (or at
least is approximately normal) then we derived the following sampling distributions for the Sum, U = ni=1 Yi ,
P

and the sample mean, Ȳ = n1 ni=1 Yi :


P

!

2
 σ2
U ∼ N nµ, nσ Ȳ ∼ N µ,
n
Example

Example

Suppose that the average amount of cash an Engineering student has on them is normally distributed with a mean of
$25 and standard deviation of $8.

1. If you mug 9 of your classmates, what is the probability you get less than $200?

We have a random sample of n = 9 with

Ei ∼ N (µ = 25, σ 2 = 82 )

We care about the Sum

9
U=
X
Ei
i=1

U ∼ N (µU = 9(25), σU2 = 9(64)) → U ∼ N (µU = 225, σU2 = 576)

200 − 225
 
P (U < 200) = Φ √ = Φ(−1.04) = 1 − Φ(1.04) = 0.1492
576
2. What is the probability that from mugging 16 Tandon students you get an average of $22 or more?

64
 
Ē ∼ N µĒ = 25, σĒ
2
= → Ē ∼ N (µĒ = 25, σĒ
2
= 4)
16

22 − 25
 
P (Ē ≥ 22) = 1 − Φ √ = 1 − Φ(−1.50) = 1 − (1 − Φ(1.50)) = 0.9332
4
3. Suppose the average amount of cash a Business student has on them is normally distributed with a mean of $30
and standard deviation of $12. If you mug nine Engineering students and seven Business students, what is the
probability you get more cash from the business students?

UE ∼ N (µ = 9(25), σ 2 = 9(64)) → UE ∼ N (µ = 225, σ 2 = 576)

UB ∼ N (µ = 7(30), σ 2 = 7(144)) → UB ∼ N (µ = 210, σ 2 = 1008)

Here we care about the difference of the outcomes, i.e. we want to know:

P (UB > UE) = P (UB − UE > 0)

We now need to find that distribution for the difference of two random variables. Everything is normal so things all
stay normal, we just need to get the expectation and variance of

UB − UE
Example

E(UB − UE) = E(UB) − E(UE) = 210 − 225 = −15

V (UB − UE) = V (UB) + V (UE) = 1008 + 576 = 1584

UB − UB ∼ N (−15, 1584)

0 − (−15)
 
P (UB > UE) = P (UB − UE > 0) = 1 − Φ √ = 1 − Φ(0.38) = 1 − 0.648 = 0.352
1584

You might also like