Professional Documents
Culture Documents
10 Statistics PDF
10 Statistics PDF
10 Statistics PDF
2024-03-06
Setup
Illustrative Example Suppose you regularly just stare at stuff on your phone while at work when you are supposed
to be doing actual work. Every so often your boss will walk by your office. In order to maximize your loafing and
minimize your stress you want to model this situation using a random variable in order to try and predict when she
will walk by again. What would you use?
Time until her next walk by:
T ∼ Exp(β)
or T ∼ U nif (α, β)
or T ∼ N (µ, σ 2 ).
or number of times she walks by each hour: N ∼ P ois(λ).
There are plenty of options. Regardless of which one of these models you decide to use you need to come up with
values for the parameters α, β, µ, σ 2 or λ.
Say you decide to model the time until your boss passes by your office using an exponential distribution. So we have
Y ∼ Exp(β) and we could calculate the probability she walks by in the next 5 minutes if we knew β. We decide to
take a random sample of size 10 throughout the week. This means Yi ∼ Exp(β) for i = 1, ..., 10 is your theoretical
sample and you eventually get the data or observed sample yi for i = 1, . . . 10 to be
With this data we now want to ESTIMATE what the value of β is. This is what we use statistics for.
Estimation
For the most part, a point estimate of a parameter can be any number you want. If someone asks me what the height
of men are on average (µ) I can use my own height of 6 feet tall as the point estimate because why not? This probably
wont be a very good estimate though.
Point estimates therefore can be good or bad, and there are several properties of good point estimates that you might
study in a more advanced class. We wont worry about that here but there is on important requirement to get a good
estimate and that’s to have a good sample.
Samples
In lecture 1 we introduced the definition of a Sample as just a subset of the population but now we focus on a specific
type of sample called a Random Sample from here on out.
Random Sample.
The collection of random variables X1 , X2 , ..., Xn is a Random Sample of size n if the Xi s are all independent
and have identical population models.
Xi ∼ Distribution(parameters) i = 1, ..., n
When I say “sample” after this, I mean Random Sample as defined above.
Statistics
Figuring out the model Parameters is the next goal of this class. To do so, the obvious thing to do is to investigate
a sample from your model or population and see if you can figure out the parameters from your sample. This
means we need numbers in a sample that reflect the parameter values. These sample “special numbers” are called
statistics.
Statistics.
A statistic of a sample is just a special number calculated from the sample. In math words, a statistic θ̂ is just a
function of the sample
θ̂ = g(X1 , X2 , ..., Xn )
Staistics
The Mode.
The most frequently occurring value in a sample is called The Mode. This is generally used for discrete data.
The Median.
When the data are order from smallest to largest, the value m that has 50% below it and 50% above it is called
the The Median. When ordered lowest to highest the median is in the 0.5(n + 1)th position. It will either be a
value in the sample or the average of two values.
y(n+1)/2 n odd
m=
yn/2 +y(n/2+1)
2 n even
The Average.
The Average or Mean of a sample is denoted ȳ and
y1 + y2 + ... + yn
Pn
i=1 yi
ȳ = =
n n
7.32 + 0.56 + 2.59 + 0.17 + 1.81 + 0.19 + 3.04 + 2.17 + 1.61 + 19.41
Pn
i=1 yi
ȳ = = = 3.887
n 10
For the median we have to order the values
0.17, 0.19, 0.56, 1.61, 1.81, 2.17, 2.59, 3.04, 7.32, 19.41
Then locate the 0.5(10 + 1)th = 5.5th number which lies between 5th and 6th numbers 1.81 and 2.17 which we average
to get
1.81 + 2.17
m= = 1.99
2
Statistics: Measures of Location - Pecentiles
While both of these numbers attempt to give an estimate of µ or the “center” of the distribution, they both do it in
different ways. The average gives the “center of weight” of the sample, the median is the 50-50 divide.
We will use the sample average as an estimate of the population mean, ȳ ≈ µ. Which means our model from before
can be T ∼ Exp(β = 3.887) and we can approximate the probability of the boss walking by in the next 5 minutes as
P (T < 5) = 1 − e−5/3.887 = 72.3%
The mean, median and mode tells you the location of certain points of interest in your data, specifically locations of
the center.
Other important locations are where the data is split, which we know as percentiles.
While the median is the 50-50 split of the data, i.e. the 50th percentile, we can easily extend this to other splits.
Percentiles.
When the data are ordered from smallest to largest, the value kp that has p × 100% below it is called
the The pth Percentile. Special cases are The First Quartile (25-75) located at the 0.25(n + 1)th position,
The Median (50-50) or The Second Quartile located at the 0.5(n + 1)th position, and The Third Quartile (75-25)
located at the 0.75(n + 1)th position.
After thinking about the “center”, the second thing is to think about the “spread”. Statistics that try and capture the
variation in a sample are the following.
The Range.
The Range measures the total distance covered by you data.
Range = M ax − M in
The IQR.
The Interquartile Range or IQR measures how close together the "middle" of the data is. How close the middle
50% of the data is together.
IQR = Q3 − Q1
Calculator: TI-30 X IIS
i=1 (yi
Pn
− ȳ)2 q
s2y = sy = s2y
n−1
TI-30 X IIS
WARNING: The calculators give the standard deviation sx which you need to square to get the variance
s2x .DO NOT USE σx .
Calculator: TI-30 XS
TI-30 XS
WARNING: The calculators give the standard deviation sx which you need to square to get the variance
s2x .DO NOT USE σx .
Example
Example
For the following sample, determine the mean, median, quartiles, range, IQR, standard deviation and variance.
Statistic Value
x̄ 13.66
m 13.27
Q1 11.66
Q3 15.64
x̄ 12.44
m 3.98
Q1 3.19
Q3 10.18
The key insights to make from these definitions in this lecture are:
Main Statistics
Though we just covered a bunch of statistics, the main ones we will look at for developing ideas will be the following
three:
n
U = Y1 + Y2 + ... + Yn =
X
Yi
i=1
n
Y1 + Y2 + ... + Yn 1X
Pn
i=1 Yi
Ȳ = = = Yi
n n n i=1
Sampling Distributions
There is nothing that makes a statistic random variable different from our usual random variables in terms of what
you do with them. However they are philosophically different so this unfortunately means they get entirely new names
for things that we already have names for.
A statistic, θ̂ is a random variable having it’s own pdf/pmf/mgf, mean, variance, etc. We call this distribution
The Sampling Distribution of θ̂ which has the following usual properties with new special symbols:
- Center or Mean denoted:
µθ̂
- Spread or Variance/SD denoted:
σθ̂2 and σθ̂
We have three primary statistics; U, Ȳ, and S2y so we need to determine those properties for each. Which makes the
symbols:
• Sample Sum:
E(U) = µU V ar(U) = σU
2
or SD(U) = σU f (u) or MU (t)
In words: “The mean of the Sum, the variance of the Sum and the distribution of the Sum”
• Sample Average:
In words: “The mean of the Mean, the variance of the Mean and the distribution of the Mean”
• Sample Variance:
In words: “The mean of the Variance, the variance of the Variance and the distribution of the Variance”
It’s definitely worth your time to understand the symbols.
Distributions of The Sum and Mean
In lecture 9, we had some concluding facts that let use find sampling distributions right away
n n
!
L ∼ N a0 +
X X
ai µi , a2i σi2
i=1 i=1
Notice that this covers a lot of options and that these will all be normal distributions
X1 − X2 a = 1 , b = −1
U = X1 + X2 a=b=1
X1 + X2 1
X̄ = a=b=
2 2
!
2
σ2
U ∼ N nµ, nσ Ȳ ∼ N µ,
n
Example
Example
Suppose that the average amount of cash an Engineering student has on them is normally distributed with a mean of
$25 and standard deviation of $8.
1. If you mug 9 of your classmates, what is the probability you get less than $200?
Ei ∼ N (µ = 25, σ 2 = 82 )
9
U=
X
Ei
i=1
200 − 225
P (U < 200) = Φ √ = Φ(−1.04) = 1 − Φ(1.04) = 0.1492
576
2. What is the probability that from mugging 16 Tandon students you get an average of $22 or more?
64
Ē ∼ N µĒ = 25, σĒ
2
= → Ē ∼ N (µĒ = 25, σĒ
2
= 4)
16
22 − 25
P (Ē ≥ 22) = 1 − Φ √ = 1 − Φ(−1.50) = 1 − (1 − Φ(1.50)) = 0.9332
4
3. Suppose the average amount of cash a Business student has on them is normally distributed with a mean of $30
and standard deviation of $12. If you mug nine Engineering students and seven Business students, what is the
probability you get more cash from the business students?
Here we care about the difference of the outcomes, i.e. we want to know:
We now need to find that distribution for the difference of two random variables. Everything is normal so things all
stay normal, we just need to get the expectation and variance of
UB − UE
Example
UB − UB ∼ N (−15, 1584)
0 − (−15)
P (UB > UE) = P (UB − UE > 0) = 1 − Φ √ = 1 − Φ(0.38) = 1 − 0.648 = 0.352
1584