3 V ISUAL 5 Continuous random variables 6 Sampling distributions
Card & R Commands All plots have optional arguments: CDF F(x) gives area to the left of x, F −1 (p) expects p µx̄ = µ σx̄ = √ σ (57) by Anthony Tanbakuchi. Version 1.8.2 • main="" sets title is area to the left. n http://www.tanbakuchi.com • xlab="", ylab="" sets x/y-axis label • type="p" for point plot f (x) : probability density (34) r pq ANTHONY @ TANBAKUCHI · COM Z ∞ µ p̂ = p σ p̂ = (58) Get R at: http://www.r-project.org • type="l" for line plot E =µ= x · f (x) dx (35) n R commands: bold typewriter text • type="b" for both points and lines −∞ Ex: plot(x, y, type="b", main="My Plot") rZ ∞ 7 Estimation 1 Misc R Plot Types: σ= (x − µ)2 · f (x) dx (36) hist(x) histogram −∞ 7.1 C ONFIDENCE I NTERVALS To make a vector / store data: x=c(x1, x2, ...) stem(x) stem & leaf F(x) : cumulative prob. density (CDF) (37) Help: general RSiteSearch("Search Phrase") proportion: p̂ ± E, E = zα/2 · σ p̂ (59) Help: function ?functionName boxplot(x) box plot F −1 (x) : inv. cumulative prob. density (38) plot(T) bar plot, T=table(x) mean (σ known): x̄ ± E, E = zα/2 · σx̄ (60) Get column of data from table: Z x tableName$columnName plot(x,y) scatter plot, x, y are ordered vectors F(x) = f (x0 ) dx0 (39) mean (σ unknown, use s): x̄ ± E, E = tα/2 · σx̄ , (61) −∞ List all variables: ls() plot(t,y) time series plot, t, y are ordered vectors curve(expr, xmin,xmax) plot expr involving x p = P(x < x0 ) = F(x0 ) (40) d f = n−1 Delete all variables: rm(list=ls()) 0 x =F −1 (p) (41) (n − 1)s2 (n − 1)s2 √ variance: < σ2 < , (62) x = sqrt(x) (1) 2.4 A SSESSING N ORMALITY p = P(x > a) = 1 − F(a) (42) 2 χR χ2L xn = x∧ n (2) Q-Q plot: qqnorm(x); qqline(x) d f = n−1 p = P(a < x < b) = F(b) − F(a) (43) r n = length(x) (3) pˆ1 qˆ1 pˆ2 qˆ2 3 Probability 5.1 U NIFORM DISTRIBUTION 2 proportions: ∆ p̂ ± zα/2 · + (63) T = table(x) (4) n1 n2 Number of successes x with n possible outcomes. s (Don’t double count!) p = P(u < u0 ) = F(u0 ) s21 s2 2 Descriptive Statistics 2 means (indep): ∆x̄ ± tα/2 · + 2, (64) xA = punif(u’, min=0, max=1) (44) n1 n2 2.1 N UMERICAL P(A) = (17) n u0 = F −1 (p) = qunif(p, min=0, max=1) (45) d f ≈ min (n1 − 1, n2 − 1) Let x=c(x1, x2, x3, ...) P(Ā) = 1 − P(A) (18) sd n matched pairs: d¯ ± tα/2 · √ , di = xi − yi , (65) total = ∑ xi = sum(x) (5) P(A or B) = P(A) + P(B) − P(A and B) (19) 5.2 N ORMAL DISTRIBUTION n i=1 P(A or B) = P(A) + P(B) if A, B mut. excl. (20) 2 d f = n−1 1 (x−µ) min = min(x) (6) −1 P(A and B) = P(A) · P(B|A) (21) f (x) = √ · e 2 σ2 (46) 2πσ 2 max = max(x) (7) 7.2 CI C RITICAL VALUES ( TWO SIDED ) P(A and B) = P(A) · P(B) if A, B independent (22) p = P(z < z0 ) = F(z0 ) = pnorm(z’) (47) six number summary : summary(x) (8) n! = n(n − 1) · · · 1 = factorial(n) (23) zα/2 = Fz−1 (1 − α/2) = qnorm(1-alpha/2) (66) ∑ xi z0 = F −1 (p) = qnorm(p) (48) µ= = mean(x) (9) = n! Perm. no elem. alike (24) p = P(x < x0 ) = F(x0 ) tα/2 = Ft−1 (1 − α/2) = qt(1-alpha/2, df) (67) N n Pk (n − k)! ∑ xi = pnorm(x’, mean=µ, sd=σ) (49) χ2L = Fχ−1 2 (α/2) = qchisq(alpha/2, df) (68) x̄ = = mean(x) (10) n! n = Perm. n1 alike, . . . (25) n1 !n2 ! · · · nk ! x0 = F −1 (p) χ2R = Fχ−1 2 (1 − α/2) = qchisq(1-alpha/2, df) x̃ = P50 = median(x) (11) n! = qnorm(p, mean=µ, sd=σ) (50) (69) nCk = = choose(n,k) (26) s 2 (n − k)!k! ∑ (xi − µ) σ= (12) N 5.3 t- DISTRIBUTION 7.3 R EQUIRED SAMPLE SIZE s 4 Discrete Random Variables z 2 ∑ (xi − x̄) 2 p = P(t < t 0 ) = F(t 0 ) = pt(t’, df) (51) proportion: n = p̂q̂ α/2 , (70) s= = sd(x) (13) E n−1 P(xi ) : probability distribution (27) t 0 = F −1 (p) = qt(p, df) (52) σ s ( p̂ = q̂ = 0.5 if unknown) CV = = (14) E = µ = ∑ xi · P(xi ) (28) µ x̄ χ2 - DISTRIBUTION zα/2 · σ̂ 2
q 5.4 σ = ∑ (xi − µ)2 · P(xi ) (29) mean: n = (71) E 2.2 R ELATIVE STANDING 2 p = P(χ < χ ) = F(χ ) 20 20 x−µ x − x̄ 4.1 B INOMIAL DISTRIBUTION z= = (15) σ s = pchisq(X 2 ’, df) (53) Percentiles: 20 µ = n· p (30) χ =F −1 (p) = qchisq(p, df) (54) Pk = xi , (sorted x) √ σ = n· p·q (31) i − 0.5 5.5 F - DISTRIBUTION k= · 100% (16) P(x) = nCx px q(n−x) = dbinom(x, n, p) (32) n To find xi given Pk , i is: p = P(F < F 0 ) = F(F 0 ) 1. L = (k/100%)n 4.2 P OISSON DISTRIBUTION = pf(F’, df1, df2) (55) 2. if L is an integer: i = L + 0.5; µx · e−µ otherwise i=L and round up. P(x) = = dpois(x, µ) (33) F 0 = F −1 (p) = qf(p, df1, df2) (56) x! 8 Hypothesis Tests 8.6 2- SAMPLE MATCHED PAIRS TEST 9.4 P REDICTION I NTERVALS Test statistic and R function (when available) are listed for each. H0 : µd = 0 To predict y when x = 5 and show the 95% prediction interval with regression Optional arguments for hypothesis tests: t.test(x, y, paired=TRUE, alternative="two.sided") model in results: alternative="two.sided" can be: where: x and y are ordered vectors of sample 1 and sample 2 data. predict(results, newdata=data.frame(x=5), "two.sided", "less", "greater" d¯ − µd0 int="pred") conf.level=0.95 constructs a 95% confidence interval. Standard CI t= √ , di = xi − yi , d f = n − 1 (78) sd / n only when alternative="two.sided". 10 ANOVA Required Sample size: Optional arguments for power calculations & Type II error: 10.1 O NE WAY ANOVA power.t.test(delta=h, sd =σ, sig.level=α, power=1 − alternative="two.sided" can be: β, type ="paired", alternative="two.sided") 1. results=aov(depVarColName∼indepVarColName, "two.sided" or "one.sided" data=tableName) Run ANOVA with data in TableName, sig.level=0.05 sets the significance level α. 8.7 T EST OF HOMOGENEITY, TEST OF INDEPENDENCE factor data in indepVarColName column, and response data in depVarColName column. 8.1 1- SAMPLE PROPORTION H0 : p1 = p2 = · · · = pn (homogeneity) 2. summary(results) Summarize results H0 : X and Y are independent (independence) H0 : p = p0 3. boxplot(depVarColName∼indepVarColName, chisq.test(D) prop.test(x, n, p=p0 , alternative="two.sided") data=tableName) Boxplot of levels for factor Enter table: D=data.frame(c1, c2, ...), where c1, c2, ... are p̂ − p0 column data vectors. z= p (72) MS(treatment) p0 q0 /n Or generate table: D=table(x1, x2), where x1, x2 are ordered vectors F= , d f1 = k − 1, d f2 = N − k (85) MS(error) of raw categorical data. 8.2 1- SAMPLE MEAN (σ KNOWN ) To find required sample size and power see power.anova.test(...) (Oi − Ei )2 H0 : µ = µ0 χ2 = ∑ , d f = (num rows - 1)(num cols - 1) (79) Ei 11 Loading and using external data and tables x̄ − µ0 (row total)(column total) z= √ (73) Ei = = npi (80) 11.1 L OADING EXCEL DATA σ/ n (grand total) 1. Export your table as a CSV file (comma seperated file) from Excel. For 2 × 2 contingency tables, you can use the Fisher Exact Test: 2. Import your table into MyTable in R using: 8.3 1- SAMPLE MEAN (σ UNKNOWN ) fisher.test(D, alternative="greater") MyTable=read.csv(file.choose()) H0 : µ = µ0 (must specify alternative as greater) t.test(x, mu=µ0 , alternative="two.sided") Where x is a vector of sample data. 9 Linear Regression 11.2 L OADING AN .RDATA FILE x̄ − µ0 You can either double click on the .RData file or use the menu: t = √ , d f = n−1 (74) 9.1 L INEAR CORRELATION • Windows: File→Load Workspace. . . s/ n Required Sample size: H0 : ρ = 0 • Mac: Workspace→Load Workspace File. . . cor.test(x, y) power.t.test(delta=h, sd =σ, sig.level=α, power=1 − where: x and y are ordered vectors. 11.3 U SING TABLES OF DATA β, type ="one.sample", alternative="two.sided") ∑ (xi − x̄)(yi − ȳ) r−0 1. To see all the available variables type: ls() r= , t=q , d f = n−2 (81) 8.4 2- SAMPLE PROPORTION TEST (n − 1)sx sy 1−r2 2. To see what’s inside a variable, type its name. n−2 H0 : p1 = p2 or equivalently H0 : ∆p = 0 3. If the variable tableName is a table, you can also type prop.test(x, n, alternative="two.sided") names(tableName) to see the column names or type 9.2 M ODELS IN R head(tableName) to see the first few rows of data. where: x=c(x1 , x2 ) and n=c(n1 , n2 ) MODEL TYPE EQUATION R MODEL 4. To access a column of data type tableName$columnName ∆ p̂ − ∆p0 linear 1 indep var y = b0 + b1 x1 y∼x1 z= q , ∆ p̂ = p̂1 − p̂2 (75) An example demonstrating how to get the women’s height data and find the p̄q̄ p̄q̄ . . . 0 intercept y = 0 + b1 x1 y∼0+x1 n + n 1 2 mean: linear 2 indep vars y = b0 + b1 x1 + b2 x2 y∼x1+x2 x1 + x2 . . . inteaction y = b0 + b1 x1 + b2 x2 + b12 x1 x2 y∼x1+x2+x1*x2 p̄ = , q̄ = 1 − p̄ (76) > ls() # See what variables are defined n1 + n2 polynomial y = b0 + b1 x1 + b2 x22 y∼x1+I(x2∧ 2) Required Sample size: [1] "women" "x" power.prop.test(p1=p1 , p2=p2 , power=1 − β, > head(women) #Look at the first few entries 9.3 R EGRESSION height weight sig.level=α, alternative="two.sided") Simple linear regression steps: 1 58 115 1. Make sure there is a significant linear correlation. 2 59 117 8.5 2- SAMPLE MEAN TEST 2. results=lm(y∼x) Linear regression of y on x vectors 3 60 120 H0 : µ1 = µ2 or equivalently H0 : ∆µ = 0 3. results View the results > names(women) # Just get the column names t.test(x1, x2, alternative="two.sided") 4. plot(x, y); abline(results) Plot regression line on data [1] "height" "weight" where: x1 and x2 are vectors of sample 1 and sample 2 data. 5. plot(x, results$residuals) Plot residuals > women$height # Display the height data ∆x̄ − ∆µ0 [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 t=r d f ≈ min (n1 − 1, n2 − 1), ∆x̄ = x̄1 − x̄2 (77) s21 s22 y = b0 + b1 x1 (82) > mean(women$height) # Find the mean of the heights n1 + n2 [1] 65 ∑ (xi − x̄)(yi − ȳ) Required Sample size: b1 = (83) ∑(xi − x̄)2 power.t.test(delta=h, sd =σ, sig.level=α, power=1 − β, type ="two.sample", alternative="two.sided") b0 = ȳ − b1 x̄ (84)