Cheat Sheet

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Binomial distribution

1. n ‘Bernoulli trials’: n trials such that each trial has only two possible
outcomes: ‘success’ (1) or ‘failure’ (0)

2. P(success) = 𝑝 is the same for all trials


Uncertainty, Data, and Judgment P(failure) = 1 − 𝑝 = 𝑞

3. All trials are independent

Michèle Hibon and Spyros Zoumpoulis  Interested in the probability of observing

𝑋 = # of successes in n trials

 = np 𝜎= 𝑛𝑝(1 − 𝑝)

1 2

Poisson distribution Normal distribution

1. Observe # of ‘successes’ or ‘arrivals’ over some continuum


(such as time or space)
2. The expected # of ‘successes’ or ‘arrivals’ is constant in the
unit of measure (l = expected # of arrivals in the interval)
3. The ‘successes’ or ‘arrivals’ are independent
X -3s -2s -1s  +1s +2s +3s
Interested in the probability of observing
𝑋 = # of arrivals in a given period of time

Z -3 -2 -1 0 1 2 3
 = l σ= 𝜆
The Z-score tells you how many standard deviations you are away from the mean

3 4
Transformations
Central limit theorem (CLT)
x- x1 - 
z  Prob( x  x1 )  Prob( z  z1  )
s s For a random sample of size n taken from any
population X with mean  and standard deviation s:
In the table of Normal Probability: Areas to the right of indicated Z values

Prob (Z > value) = F(Z) The sample mean X follows* a Normal distribution, with

Prob (Z < value) = 1 - F(Z)


 mean equal to the population mean: X = 
Prob (Z1 < Z < Z2) = F(Z1) - F(Z2)
 standard deviation (standard error): sX = s / n
* This is an approximation. It works well for n  30.

5 6

Confidence intervals for means Hypothesis Testing Two-sided test

H0:  = 0
𝜎 𝜎 If we don’t know the population standard deviation s HA:   0
Reject HO Accept HO Reject HO
X - 0
𝑋−𝑍 ⁄ , 𝑋+𝑍 ⁄ z obs 
𝑛 𝑛 use the sample standard deviation s instead. s
n

Confidence intervals for proportions

𝑥 𝑥 𝑥 𝑥
𝑥 1− 𝑥 1−
−𝑍 𝑛 𝑛 , +𝑍 𝑛 𝑛
𝑛

𝑛 𝑛

𝑛 X
0
Z2 2 z
Sample size n for a given Max Error n 2
-Z/2 Z/2
4E

7 8
p-value = probability of a result as or more extreme Regression analysis
(in the direction of HA) as that observed, assuming • Regression analysis is used to:

H0 is true  Predict the value of a dependent variable based on at least one independent
variable
 Explain the impact of changes in an independent variable on the dependent variable
For a two-sided test,
the p-value is the probability of a result at least as extreme
• Jargon:
(on either side) as that observed, assuming H0 is true.
 Dependent variable: The variable we wish to predict or explain
 Independent variable: The variable used to explain the dependent variable
P-value = 2 P( Z  |Zobserved| )
Decision rule: • Types of regression models:
If p-value < α You reject H0  Simple Regression: Use one independent variable to predict the dependent variable
If p-value > α You fail to reject (i.e., you accept) H0  Multiple Regression: Use more than one independent variable to predict the
dependent variable

9 10

Postulated vs. estimated Model Testing Bi: is there evidence of a relationship between Xi and Y?
Postulated model (for the population) Estimated model (based on a sample) Objective: Is Bi different from zero? If not, we want it out of the model.

𝑌 = 𝐴 + 𝐵1𝑋 + 𝐵2𝑋 + 𝐵3𝑋 + ⋯ + 𝜀 𝑌 = 𝑎 + 𝑏1𝑋 + 𝑏2𝑋 + 𝑏3𝑋 + ⋯ + 𝜀̂ Test: H0: Bi  0  there is no relationship between Xi and Y
HA: Bi  0  there is a relationship
Definitions: Definitions: Statistic: If H0 is true, then
Y – dependent variable
Xi – independent variables a, bi – regression coefficients bi–Bi bi
t-stat =
A, Bi – unknown regression parameters
𝜺𝒊 – residual error sbi = sbi ~ Z (approx.)
εi – random error term reject H0 reject H0

Assumptions: The prediction produced by the model:


Decision: Reject at  level
εi ~ normal (0, s2) accept H0
εi’s are independent if p-value <  or if |t-stat| > Z /2
𝑌 = 𝑎 + 𝑏1 𝑋 + 𝑏2 𝑋 + 𝑏3 𝑋 + ⋯ then the variable is significant (Y)
Goal: Estimate A and Bis t
otherwise it is not significant (N) -4 -3 -2 -1 0 1 2 3 4

(based on a sample) -Z /2 Z/2

11 12
Coefficient of determination R2
and adjusted R2 Possible problems with regression
• Measures of how good the model is • Multicollinearity (MC)
• R2 measures the proportion of variation in the dependent variable – A strong linear relationship between two independent variables
(Y) that is explained by the model
• Always true: 0  R2  1 • Why is it a problem?
• If more variables are added, R2 goes up or stays the same. That – One of the independent variables becomes redundant; may result in
happens regardless of what variables are added – that’s not good. unstable/unreliable coefficients
• How to spot it?
Adjusted-R2 accounts for number of variables in the model. It – Rule of thumb: if the correlation between two independent variables is
penalizes having too many variables (“overfitting”). It is used to higher than 0.7 in absolute value
compare models of different sizes.
In contrast to R2, it can go up or down when a variable is added
to the model.

13 14

Dealing with Multicollinearity Remove the one


with highest 𝑝-value
Confidence Intervals for Bj
neither is significant (or smallest absolute
100(1- )% interval for coefficient of Regression
𝑡-stat; it’s the same
thing).
lower limit: 𝑏 − 𝑍 / ⋅𝑆
Two independent
variables are Remove the one
highly correlated
one is significant upper limit: 𝑏 + 𝑍 / ⋅𝑆
which is not
(positively or significant.
negatively):
|Corr| > 0.7
Does the model make sense?
(Do the signs of the
Confidence intervals for a forecast
coefficients make sense?) 100(1 - )% Interval for a forecast
If yes, leave either both or
both are significant only one in - whichever model
has higher adjusted-𝑅 .
lower limit: 𝑌 − 𝑍 / ⋅ StDevReg
If not, there could be a
computational problem. upper limit: 𝑌 + 𝑍 / ⋅ StDevReg
Remove the one with the
highest 𝑝-value.

15 16

You might also like