Stat Final Notes111

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

CH1: DESCRIPTIVE STATISTICS 2. SD(0.5X+0.5Y) > √ Var(0.5X+0.5Y): [留意 dependent/not!

> dep 要用‘2 求 1 式’]


- Categorical: Nominal(region); Ordinal(low, high)| Numerical: √ units; interval (temp), ra o (age)
- Boxplot: 5 no. summary – min q1 median q3 max|Mean > Median -> Right-skewed curve; vice versa Properties of variation
- Quar les (Q): 成個數列雙數 > 切開兩整分搵 median; 成個數列單數 > 攞走 Q2,左右係兩整分

- Variance (Var): |Mean&SD: affected by extreme; Md&Q: robust

Categorical: Two-way / Con ngency tables


-(W propor on/%) Joint distribu on: each cell gives
propor on of TOTAL sample
- Bernoulli Distribu on: 1 = success, 0 = failure; fixed, constant probi of success ‘p’; independence
-Marginal distribu on: 最邊嗰兩條’Total’
➢ - P(X=1) = p, P(X=0) = 1-p; E(X) = p; Var(X) = p(1-p) = pq
- Condi onal distribu on:分母=total no. of the condi on (最邊嗰小 totals > eg 577 on above - Binomial Distribu on: Criteria: 同上 & n iden cal trials; X = total no. of successes in n trials; p =
- Lurking variable: overlooked variable w/ impt effect | Simpson’s Paradox: change in direc on of Steps to solve ques ons
variables’ associa on, when data are separated into groups defined THIRD variable probi of success in EACH trial > for Y˜Binomial (n, p) v
- 答法: 3rd variable –__;. *The worst x among the 3rd variable X* is the worst as rate of delay is the
< 1st find n and p(may be from ra os) Eg: US boys: girls
*highest/*. *A**hv a lot/ worst op on*,B no. absolute no. of delays for *A/* > rela vely *smaller/
at birth = 1.09:1, what propor on of US fam w/ 6
➢ Numerical: Sca erplot > explanatory x-axis, response y-axis|Correla on (r): Between -1 to 1, 0=no r
children hv at least 3 boys?
- High/moderate/low; +ve/ -ve/zero; no units; NOT robust | Correla on does NOT imply causa on
E(X) = np; Var(X) = npq
CH2: PROBABILITY (PROBI)
- A B: OR; A B: AND; AC: complement | Mutually exclusive > 有佢無我 > roll a dice 出 6 = no 1-5
CH4: CONTINUOUS RANDOM VARIABLE (PDF)
3. Addi on Rule for MuEx: P(A∪B) = P(A) + P(B); P(A∩B) = 0|5. General Addi on Rule: P(A∪B) = P(A)
+ P(B) - P(A∩B) | Independent > not affect other> roll dice twice, events of 2 trials ind. > 要中 Rule6
- Uniform Distribu on >上面 for c≤x≤d; 下面 otherwise (x<c; x>d) <state!
- 6. Mul plica on Rule: P(A∩B) = P(A)*P(B)|Bayes’ Rule: - X ~ Uniform [c, d] #c<d| P(a≤X≤b): c ≤ a<b ≤ d: (b-a)/(d-c)
a<c/ d<b:area under curve between a & b | X = median = (c+d)/2; σX = (d-c)/√12
-Condi onal: P(A|B) = P(A∩B)/ P(B) > P(B) ≠ 0
-Mul plica ve Rule for dependent events:
Empirical Rule -Skewness K3: K3≈ 0
P(A∩B) = P(A|B)*P(B) = P(B|A)*P(A)
‘symmetric’; K3> 0 ‘right-
-Law of Total Probi:
skewed’ |Normal probi:
If 𝑋~𝑁 (𝜇, 𝜎2), Z = (X-µ)/ 𝜎
P(Ace) = P(Ace and Red) + P(Ace and Black) =
is standard normal, mean
2/52 + 2/52 = 4/52
➢ 0 and SD 1 > Z˜N(0,1)
-Step to solve ques ons: frequency (no.
CH3: DISCRETE RANDOM VARIABLE - Steps to solve ques ons: 1. 攞 X 計 Z 2. 攞 Z 搵相對 probi 3.睇清目標 probi 係 Z 上/下> 要 Z 上>
Discrete(PMF) of x)/ total no. OR P(A∩B) = P(A)*P(B) >
搵 – Z probi. 4. 搵兩個數字中間嘅 probi > 相減
when outcomes involve 2/+ ind. Events
- Normal approx to Binomial: n larger, p= 0.5 > use 𝑋~𝑁 (𝜇, 𝜎2) > µ = np; 𝜎2=npq |np ≥5; nq ≥5
- Smaller df, t-distribu on has fa er tails
^ Covariance (Cov): > Cov(X, X) = Var(X) CH5: SAMPLING DISTRIBUTION (Samp dis) -Central Limit Theorem (CLT): : ;

< Correla on (r) |samp dis= normal when n large enough


- rXY = Cov(X, Y)/ SD(X)*SD(Y) > 可用任 2 求 1 項
- Possible 題型: Given SDx, SDY, r; find SD(X+Y): 1.Cov(X,Y) = r* SDx*SDY -σ(X)< σ(popu) |popu σ not given> es mate by SD (p1
- 2. Var(X+Y) = SD x2 + SDY2 + 2Cov(X,Y) 3. 開方搵 SD(X+Y) √var) | t-distribu on #df=n-1 for n< 30
- Coefficient of Varia on (CV) > CVX = SDX/ µX; Sharpe Ra o (S) > S(X) = (µ - rf)/ SD
- For mixed por olio S(0.5X+0.5Y): 1. E(0.5X+0.5Y) = 0.5E(X) + 0.5E(Y)
CH6: CONFIDENCE INTERVAL - Danger of Extrapola on outside the experimental region > may become curved outside exp region
- For each value of X, value of Y is normally distributed with some mean (may linearly depend on X),
- Convert given % of confidence to a > Table find za/2 > Find SE w given p̂
and a σ (not depend on X) > the σ is a constant, same for all values > σ of Y is also the σ of all ε
- Interpreta on: We are x% confident that required popu propor on is between this interval
- Model assump ons – LINE 1. Linearity: Mean value of Y has linear rela on with X
- Mean (above is p): | For n < 30, za/2 changes to ta/2, (n-1) - εi is independent of X i > E(εi) = 0 > ‘mean zero assump on’ 2. Independence between errors for
- Sample size n requirement >control margin of error be within a range, requires minimum n different X 3.Normality (distributed) of errors for different X 4.Equal variance for different X
- Residuals (ei / εi-hat) | Mean Square Error (MSE): s2 = MSE = SSE/(n-2) |Standard Error: s= √MSE
- For propor on: ; For mean: (unknown σ then use SD-,-) - 解題步驟: 1. plug the ‘X’s into ŷ = b0 + b1x to obtain ŷ 2.Sum up all ‘y - ŷ’ for SSE 3. Cal MSE and SE
- !! For propor on, L is % range; For mean, L is numerical value range|Round up ans if n not integer ▪ check for ‘LINE’: “E”: In residual plot evenly spread > homoscedas c; “N”: QQ plot > standard
CH7: HYPOTHESIS TESTING normal VS Residual quan le; “I”: Most likely violated when X is me stamps ( me-series data)
- Type I error: Reject true null hypo > a/ significance level > smaller a requires more opposing evd
- Type II error: Do not reject false null hypo > ß| 1 – ß/ power: Probi of correctly rejec ng false null SST= |Coe icient of Determination
ß varies inversely with a; Small R2: Interpreta on: About …% of sample varia on
a when cost of rejec ng truth in … (Y) can be explained by the linear regression
- (Type I) is high (cancer eg) model where we use …(X) to predict …(Y)
- Large a when interest in changing default | //Hypo tes ng methods// (illustrated by popu MEAN) -
- R2 = SSR/SST = 1 – SSE/SST |Between 0 to 1: 1= perfect match; 0= no linear rela onship
❖ Z-test (cri cal value) >Requirement: popu standard devia on σ is KNOWN; SRS
Popu proportion v - (correla on) r2 = R2 > can find R2 by: 1. Find SD(X), SD(Y), Cov(X,Y) 2. Cal r 3. r 2 = R2!
➢ 1-sided/ Tail tests: For ‘less than’ Ha > lower-tail test; vice versa
- b1 = sxy/s2x = r*sy/sx; sign of r same as sign of b1
- 1. Cal ‘z’ of given sample> z = (X-µ)/ (σ/√n) 2. Find ‘za’from normal table(eg a=0.05, za=1.645
- Tes ng for slope ß1: When Y&X no linear rela onship > H0: ß1=0 (default no rela on); H1: ß1≠0
3. Rejec on rule: Reject H0 in favor of Ha if z >za (‘greater than’ Ha); vice versa
- 表達方法: Reject H0: … in favor of Ha: … at the … significance level
- SE of b1: > Recap s= √ SSE/(n-2) | T-test 要 n-2 n-2 n-2 !!!!!!!
➢ 2-sided test: For ‘not equal to’ Ha |1. Find a/2 2.Find za/2 & -za/2
- Rejec on rule: Reject H0 in favor of Ha if z > za/2 OR z < -za/2 !! Different Ha different rejec on rules -大致同 ch7 差唔多
- For n<30 -> apply t-test pls! -> dun forget to s ll do a/2 for 2-sided! -解題 (Given b0, b1, s)
❖ P-value : Rejec on rule: p-value ≤ a, we can reject H0… 1-sided/ Tail tests 1. Cal ‘z’ of given sample -Find z-/t-sta s c of a
- 2. Find area under curve to RIGHT (‘greater than’ Ha) of ‘z’ on table > p-value! or a/2 from tables
-
- Apply rejec on rule > smaller than a > reject H0! > Ks implies the z-test in another way
<Find Sb1 > Cal b1/ Sb1 > Apply rejec on rules
- 2-sided test: 1. Cal ‘p-value’ based on calculated ‘z’ > then 2 > REAL P-VALUE
- Under same rejec on rule, reject only when Real p-value ≤ a !! SAME rejec on rule all the me !!
❖ Chi-Square Test for Independence > H0: 2 variables sta s cally independent; Ha: dependent -,- - -Test for y-int ß0: everything same except
- Observed table > data collected; Ar ficial/expected table > assuming 2 variables independent - Point on regression line corresponding to a par cular value X0:
- Oij = Observed cell frequency; ri = total obs in ith row; cj = total obs in jth column - Confidence interval for mean value of Y; predic on interval for an individual observed value of Y
- Expected cell frequency for ith row & jth column under independence: Eij = (ri*cj)/n
▪ Requirement: each expected frequency ≥ 5; observa on obtained from SRS
- CI: | PI: |開方入面舊嘢 (no 1+): distance value
Confidence Interval 解題:Find 95% CI for mean sales when X is 4 (X0); Given b0, b1, s
- > do NOT round Eij to integers! | X2a with (r-1)(c-1) df | Reject H0 if X2 > X2a
CH8: (SIMPLE) LINEAR REGRESSION -Predic on Interval 解題: Use 95% PI,
- Probabilis c: hypothesize determinis c (exact)+ random error > Y=?X + ε predict sales when X is 4 (X0); Given
- Y = ß0 + ß1X + ε | ß0: popu y-intercept; ß1: popu slope; ε: random error > E(ε)=0! | µY|X: ß0 + ß1X b0, b1, s| 所有野 > 除咗 final step
- find PI 要係 +1! +1! +1!
- ‘Best fit’: Least Square Es ma on(LSE):minimize Sum of Square Error(SSE) - Factors affec ng interval width (+- as rela onship): Confidence level 1-a (+); Data dispersion s (+);
sample size n (-); Distance of X0 from mean X (+) > 上圖個弧形 shape!
3. b0 = Ȳ - b1 X
4. ŷ = b0 + b1x
- 1. > 上 Cov (X,Y),下 Var(X) 2. ;

You might also like