Professional Documents
Culture Documents
Lecture 2 Review of Probability and Statistics
Lecture 2 Review of Probability and Statistics
Lecture 2 Review of Probability and Statistics
Tingting Wu
Ø Data sampling
• Stochastic process for representative sample data (Typical causes for Sample Bias?)
Date Daily Month Name Alcohol (ml) Month Name Alcohol (ml)
Expense ($)
01/01/2023 25 Jan Jenny 100 Jan Jenny 100
02/01/2023 37 Jan Mary 293 Jan Mary 293
03/01/2023 78 Jan Ting 20 Jan Ting 20
04/01/2023 25 Jan Ying 30 Feb Jenny 200
…... ….. Feb Tom 120 Feb Mary 124
16/01/2023 76 Feb Jerry 20 Feb Ting 30
(A: Daily expense for Jerry) (B: Surveys of consumption for alcohol) (C: Monthly Survey of consumption for alcohol)
Types of Datasets
Can you tell the types of dataset?
But, what is the importance of knowing probability rules in real life situations…?
Practice for probability
An urn contains 20 red and 10 blue balls. Two balls are drawn from a bag one after the other
without replacement.
• What is the probability that both the balls are drawn are red?
!"
• In the first draw P(A) = P(red balls in first draw) = #"
$%
• In the second draw, Conditional probability of B on A will be, 𝑃(𝐵|𝐴) =
!%
• By multiplication rule of probability,
20 19 38
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) × 𝑃(𝐵|𝐴) = + =
30 29 87
Random Variables
Some definitions:
Outcomes are the mutually exclusive potential results of a random process
• The number of days it will rain next week
Random variables is a numerical summary of a random outcome
• The number of days it will rain next week is random and takes on a numerical value
(0,1,2,3,4,5,6 or 7)
• Let random variables S be the number of days it will rain in the last week of August.
Probability Distribution of S
Outcome (s) 0 1 2 3 4 5 6 7
Probability 0.2 0.25 0.2 0.15 0.1 0.05 0.04 0.01
Pr(S=1)
Cumulative distribution of discrete random variables
A cumulative probability distribution is the probability that the random variable is less
than or equal to a particular value
• The probability that it will rain less than or equal to s days, F(s) = Pr(S ≤ s) is the cumulative
probability distribution of S evaluated at s.
• A cumulative probability distribution is also referred to as a cumulative distribution or a CDF.
PDF CDF
Moments of a Probability Distribution
Mean (expected value)
The Mean or expected value of a random variable X is the average value over many repeated
trails or occurrences.
It measures the location of the central point of density curve of X.
• The mean or expected value of a continuous random variable X with a probability distribution function 𝑓 𝑥
"
𝐸 𝑋 = & 𝑥 . 𝑓 𝑥 𝑑𝑥
!"
Variance
The variance of a random variable X is the expected value of the square of the deviation of X
from its mean.
• The variance of X is a measure of the dispersion/spread of the density of X.
• Suppose a discrete random value X takes on k possible values
(
!
𝜎*! = 𝑉𝑎𝑟 𝑋 = 𝐸 𝑋−𝐸 𝑋 = )(𝑥& − 𝐸(𝑋))! . 𝑃) (𝑋 = 𝑥& )
&'$
The Mean will typically be strongly influenced by extreme values than the Median.
Often used probability distributions in Econometrics
Normal Distribution
“Everyone believes the normal distribution, Experimentalists think that it is a mathematical theorem while the
mathematicians believe it to be an experimental fact.”
- Gabriel Lippman, France, 1845-1921, (1908 Nobel Physics Prize)
• The most often encountered probability distribution (density) function in Econometrics is the Normal
distribution:
Normal Distribution
A normal distribution with mean µ and standard deviation σ (Variance σ! ) is denoted as 𝑵 𝝁, 𝝈𝟐
"
P a < X < 𝑏 = ) 𝑓 𝑋 𝑑𝑥
!
1 #"$ !/!& !
𝑓 𝑥 = 𝑒"
2𝜋𝜎
𝑎 𝑏
Standard Normal Distribution
• To look up probabilities of a general normally distributed random variable
𝑋 ~ 𝑁 𝜇, 𝜎 ! ,
• We need standardize (scaling) X to obtain the standard normal random variable Z as below
𝑋−𝜇
Z=
𝜎
, which we call standardization and z-score.
Find the probability for a z score lies between 0 and 1 ? Cumulative form mean Z table
z .00 .01 .02
0.0 .0000 .0040 .0080
0.1 .0398 .0438 .0478
1. Check the Z table (Row => Integer +first decimal; Column => Second decimal )
2. The area between 0 and 1 shows 34.13%.
Probabilities for Standardized Normal Distribution
Find the area between -2 and 1 under the standard normal distribution curve.
= +
48% 34%
-2 1 -2 0 0 1
The area between -2 and 0 is the same as the area between 0 and 2, by symmetry
The area between -2 and 0 is about 48% and the area between 0 and 1 is about 34%
Hence the area between -2 and 1 is about 48%+34%=82%
The Chi-Squared Distribution
The chi-squared distribution is the distribution of the sum of m squared independent standard
normal random variables
• If we have m independent standard normal random variables Z, then the sum of the squares of these random
,
variables ∑3 !
&'$ 𝑍& has a chi-squared distribution with m degrees of freedom: 𝜒+
Pr 𝑌 = 1 = 𝑃) 𝑋 = 1, 𝑌 = 1 + 𝑃) 𝑋 = 0, 𝑌 = 1 = 0.22
Pr 𝑌 = 𝑦| 𝑋 = 𝑥 = 𝑃) (𝑌 = 𝑦)
• If X and Y are independent this also implies
Pr 𝑌 = 𝑦, 𝑋 = 𝑥 = 𝑃) 𝑋 = 𝑥 _ 𝑃) 𝑌 = 𝑦 (see slide 8 )
The covariance between rain (Y) and it being very cold (X):
𝐶𝑜𝑣 𝑋, 𝑌 = 1 − 0.3 1 − 0.22 _ 0.15 + (1 − 0.3) (0 − 0.22) · 0.15 + (0 − 0.3) (1 − 0.22) · 0.07+ (0 −
0.3) (0 − 0.22) · 0.63 = 0.84
Note: If X and Y are independently distributed, then 𝐶𝑜𝑣 𝑋, 𝑌 = 0 (But not vice versa !!)
Two Random Variables: Correlation
• The units of the covariance of X and Y are the units of X multiplied by the units of Y.
• This makes it hard to interpret the size of the covariance.
𝐶𝑜𝑣(𝑋, 𝑌) 𝜎*;
𝐶𝑜𝑟𝑟 𝑋, 𝑌 = =
𝑉𝑎𝑟 𝑋 𝑉𝑎𝑟(𝑌) 𝜎* 𝜎;
• A correlation is always between -1 and 1 and X and Y are uncorrelated if Corr (X, Y) = 0
• If X and Y are uncorrelated this does not necessarily imply mean Independence!
Two Random Variables: Correlation
ü corr(X,Y) !=0?
ü corr(X,Y) > 0?
ü corr(X,Y) < 0?
ü corr(X,Y) = 0?
ü Others?
? ?
Two Random Variables: Correlation for a sample dataset
Example: Association between the scores of midterm and final
1. Collect the data for two variables over each individual (Cross-sectional dataset)
Midterm Final
?
Rank id major name score Rank id major name score
#"!
45-degree line.
"! • The midterm score helps a lot in prediction the final score
!
! "! #!! #"! $!!
Midterm
중간고사
Independent variable
Two Random Variables: Correlation for a sample dataset
Population correlation 𝜌*+ :
𝐶𝑜𝑣(𝑋, 𝑌) 𝜎*+
𝜌*+ = 𝐶𝑜𝑟𝑟 𝑋, 𝑌 = =
𝑉𝑎𝑟 𝑋 𝑉𝑎𝑟(𝑌) 𝜎* 𝜎+
x y
𝑥̅ = 4, 𝑠0 = 2.24
1 5
3 9 𝑦j = 7, 𝑠< = 4.47
4 7
5 1
7 13
1. Correlation by Chance
• Replicate with a big enough sample size
• Simultaneity
Income and investment
This is it for today!