All of Statistics Final

You might also like

Download as pdf
Download as pdf
You are on page 1of 223
Springer Texts in Statistics Alfred: Elements of Statist forthe Life and Socal Sciences ‘Berger: An Introduction to Probability and Stochastic Processes, Bilodeau and Brenner: Thooty of Mutvariate Statistics Blom: Probability and Statistics: Thoory and Applications Brockwell and Davis: Inodueton o Times Series and ‘Second Edition Carmona: Statistical Analysis of Financiat Data in S-Plus (Chow and Teicher: Probability Theory: Independence, nerchangeabiiy Martingaes, Third Edition (Christensen: Advanced Linear Modeling: Moliariate, Time Series, and ‘Spatial Data; Nonparametric Regression and Response Surface Maximization, Second Exition Christensen: Log-Linear Models and Logistic Regression, Second Baiton Christensen: Plane Answers to Complex Questions: The Theory of Linear ‘Models, Third Edition CCreighion: A Fist Course in Probability Models and Statistical Inference Davis: Statistical Methods forthe Analysis of Repeated Measurements Desi and Voss: Design ad Analysis of Experiments Toit Steyn and Stump. Graphical Exploratory Data Analysis Durrett: Essentials of Stochastic Processes awards: lettin to Graphical Modeling, Second Edition Finkelstein and Levin: Statistic for Lawyers Flury: A Fist Course in Maltivariate Statistics Heiberger and Holland: Statistical Analysis and Data Display: An Invermedi Course with Examples in PLUS, R, and SAS Jobson: Applied Multivariate Data Analysis, Vol E Design Multivari Methods Probabty and Statistical Inference, Volume I: Probability Regression and Data Analysis, Volume If Categorical and cond Edition ch: Probability and Statistical Inf ‘Second Eaton arr: Provability Kesftz: Applied Mathematial Demography, Second Edition ‘Kiefer. Introduction to Statistical Inference Kalh nce, Volume I: Statistical Inference Kokoska and Nevison: Statistical Tables and Formulae Kulkarni: Moding, Analysis, Design, and Control of Stochastic Systems ange: Applied Probability Lehmann: Elements of Large Sample Theory Larry Wasserman All of Statistics A Concise Course in Statistical Inference a Springer To Isa 1 Probability LI Introduction Probability i a mathematical langage for quantifying uncertainty. Re thi Juapter we introduce the basi concepts underlying probity theory. W 2 Sample Spaces and Events ite called sample onteomes, realizations, or elements. Subse led Events, 1.1 Example, Ife toss 0 An. FT). The own 1.2 Example, Leto he the onteone of a measurement of some physical quae for example, temperature. Then x20). One cond ng th 1.3 Example, If we ts a coin freer, th te spc the fit Westy that A, da... aredloint or ate mutually exclusive if, (14 ct 0 whe j. For example, y= [el).ae = [Le Ay = [Ba = {o=w1w welt Injen A partition of yw seqenee of Cnt sets Ay yx stich hat ee re a eens oe ete A, =, Given an event A define the ndlentor funetion of A by B= {eran Pray = Toss = LP) for i> 3}. (1 toca Ine) Hee A=} 9 \ 4 0. The union of events A and Bs dein fine lity a ay = UE Ade A sequence of sets Aiea i saat Aor we Borw € bat) monotone decreasing i Ar > A Athen we lefine Th Uset a Ay Ac In either ease, we will write Ay +A eh can be thought of ws “A oF BL” HT Ay, Ay. ba soquence of sets then ~ 1A Gxample, Let = Rand let Ay = 0.1/8) for = 1.2)... THEME A Uar= fee: we As for at ft one i} G1) and = Ae = {0} Minted we define y= (01/8) then Uy a (.1)and =p As =e The intercetion of 4 and B AP) B= {wen 4 read “A and B times wo write APE ns ABP or (A, B). I Aa A 1.3. Probability csequmice of et A wil asian rea mb F(A} to every event All the probably of Aas fect: ve A, frat} 1. Weate eal Pa probability dlsteibution ora probability moar To aualify asa probability, Pans at ms The et liference sn =| wet oA sa oa rte A.C Bro, equal 3 Dato A Juncton Pht ain «rel war F(A) fo ah ‘Summary of Terminology Axiom 1: P(A) 2 0 f waa ictne (pot ot Axiom 3: If). as ‘ (G) Sea APB or Ad | There ae many interpretations of P(A). The two common interpretations — fF equ kel hen Pf) = PU) +P) PU Ha) are frequencies sal degres of bli. Inthe frequency interpretation, P(A. : the long rn proportion of ines that A iste in epitions. Fr exanupl any thatthe probubiity of bs ts 1/2. we anean that if we fi ch 1.8 Theorem (Continity of Probable). fy +A th (An) > PLA) lke the idea of a straight ine n geometry. ‘The degre ele interpn i that P(A) ssn an ctnervera strength of elt that Ai tre. Int Poor. Supprse that Ay is monotone increasing 39 tht dy C A sterpectation, we site that Axioms 1 to hold, The difercuce in ites tigre An = Uy Ave Define By = Ay. By = fw © rotation will nt matter sch ui we den! with statistical inference. Then Ah Be = eter we Athos can b J clifering interpretations lea to two schools of inference: the fret wo that B.D, dsjint, Au = Ute; As = Us By for each » and on the Bayesian seo, We defer dxcusion until Chapter 1 Us Aa (See exercise 1.) From Anion 3, ) 0 b ck 4 O ye PAB (1) = 0. Except in this special ease, there is no way to judge independenew 1.10 Example, ‘Toss fair vin 10 tin y= Pt 1 Pal ails LPO 1—PCFJPCE):-PCTio) using independenc 1.11 Example, Two people take turns trying to snk a Basetbollintoa ns Person [sucess with ly 1/3 while porns 2 succeeds with bility 1/4, What isthe probability that persons 1 succeeds before perso Let denote the event of interest, Let Ay be the event that the first is by person {and that it ove om eral uber, Note that li Aas. a sjoint and that B =U, Ay. He £)- SPU, Non, Pla) = 1/9 ay aceurs if we have the sequence person mises, person 2 wi sce, This hs probability B(aAy) = (2/3)(8/4)(1/3 ug this logic we 4,) = /291C/2). Hen Best (ty ats (ty =2 Here we nse that fet that, if then OX ,0 =H . 1. A wnd 2 are independent if and only i PLA APU 1.6 Conditional Probability Assuming that P(E) > 0, we define the o ty of A given T/P(B) > 0 tien the conditional probability of A Think of F(A|B) as the fraction of times A occurs among thse i whi B oceurs. Por any fixed B such that PLB) > PCB) isa probability (0 i atsfis the thro axons of probity. fu particular, P(A|E2) > 0, PCIE) Vand if Ay, Ay... ane disjoint then B(UZ, AelB) = SPA is jn general not trie that P(AMZUC) — PLAIB) + PAC), Th ity apply to events om the let of the bar, In general its not at P(AIB) = P(BIA). People get this confused all the tine, For exampl we probability of spas given you have measles is 1 but the probability that ro have measles given that you have spots snot 1, I this eave he differen between F(A}H) and PBA) is obvious but. ther whut itis Ts sions. This mistake i made oft enongh in legal en is sometime 1.13 Example. A nical test fora disease D has outcomes + ond —, The probabilities From the definition of conditional proby " ap so 7 Dy ~ Som ~* Apparcntly, the testis fairy accurate. Sick pple yield a ponitive 90 percent of the tte an henley people veld waht IM pereent of he ti Supuse yo go fora test and get a panitve. Wha is the probability yout disease? Mest people answer $M The correct ans x ' Ap) The Iason here bs that yout nea to compute the ansrer numerically. Dit 114 Lemma, 1/4 depen Js then PAID) = PLA), At Mit) = P(A\APC = BLEAYECA oun the lat lenin, we see that another inlerretation of independence hat kuowing H doesn’t change the probability of A. The formula P( AL A)P(B|a) is sometimes belpfal for eaculating ps 1.15 Example, Draw two cards from deck, without replwenient. Let A bu we event tha te ist draw is the Ace of Ct al Bet Bb nt that Bayes’ Theorem, 0 he Bayes theovem isthe bss of expert systems” nn “Bayes a 4I2)= 7 z . 7 ars 3) sc lives as Chapter 17. F vee n preliminary esl 1.16 Theorem (The Law of Total Probability). Let Asso. y 8 Bibliographic Remarks yet nye ue muesli this chapter is standin, Details eat be Ftd in ay wi a. At the introductory evel, there is DeGroor an Selervah (200 BLANPLA > rrvaiate level, Grimmett env Stirzaker (1982) al Karr (1892) ned lve there ane Dil 1979) onl Beinn (1902). adap Paoor. Define C= 2A, snl ote tnt x are inj dh sy exonnples sual exrtns fmm DeGroot tat Seerish (202) al C J ene et al Stireaker (1982) F yore, BIA,IPUA, ) Appendix f 4 4,)P(A,) frau the definition of in . rally, i 5 wot este to assign probabilities to all subset : tr. Tasted, one wetrets attention to a set of events call a a-algebra TAT Theorem (Bayes Theorem). fet Av A jas ee cao econ ar iat P(A) > 0 iv a ri aa ii) if Ai A then US) Av A an scan The sets in a ane sid to he measurable, We call (2A) » measurable 1? ina probability uwasure defied on A, then (2.A,P) bs ele 1.18 Remark, Wi all prior probability of wl P(A)B) the probability space. W he real ine, we take At be the soul josterior probability of el that eontains all insets: whieh is rll the Borel ofl. oF We apply the seintion of condition! probabil esl ee a 110. Exercises ale i B Eur 1. Bilin the details of the proof of Aloo, prone the montane 1.19 Example. | divi te thvee categories A = “sm mene 4 ty." Fron jee Hin 2, Prove the stateatents i equation nce a reer o 3. Let @ be a samy UE, Ac and Cy re and Ket Ay, Ay... he events, Define B, Fy 3+ aa that CCC Show th UE. Cn it ant uth | er eee aad Moda exept poly tte manber tte event (onan are (42,3) wre pine a (Us) 1). 7 \ xo ner-rx 0, The exponential distribation ts wed 0 model the Hein ier X has a Gane distrition wie parsaneeers mud WR Variables 2.5 Bivaiate Distrbutions a, denoted by X ~ Gamma(a, 8), 5 Bivariate Distributions i 1 yee, 230 Given « pi of discrete mndnn varias Xn fine ee joint mass aa function by f X= and ¥ = y). From naw on, we write PX where 0,3 > 0. The exponential distribution i just « Gamma(l,) iste rand ¥ = y) aw F(X = 2. =a). We write fax fy when we want tobe tion. FX, ~ Gamama(ay, 8) ar independent, then S22, X;~ Gamal, a4 3) solicit 2.18 Example. Here is bivariate cisteiution fortwo random variables X Tuk Bera Distmamuvion. X has a Beta distribution with parameters ee 1 > (and > 0, denoted by X'~ Beta yoo y=1 ho erie, oes rope es a xaifyo 49 Av Cavey Distranvri0n, —X has a f distribution with » degre of Tay f(t) = POC = LY = 1) = Aw reson — written X~ ty = if rept — —— fe Ta : owen co Tanction | ) XY) if Tet distribution is similar to @ Normal but it has thicker tails. bn fact, the Normal corresponds to ¢ with v = c. The Cauchy dissin is & special fle) = 0 forall (2 / : Ps nd “ue i) for any set A XY) EA) =f Jy flown To sce that this is indeed density - tan oo) ~tanY-o0}] = 2 [F - (-2)] =1 Xen

Find P(X < 1/2.¥ The event 1/2) correspnuls woe toa sume of the unit square. Integrating uct correspond TE Z)s..0s pare independent standard Normal raul variables then 32, Z; ne, to conning he ae ofthe st 1/480, POX 1/29 | 2.21 Example. Let (X.Y) have density 2.22 Example. If tlie dstsibution fs defined over a none alelations an spleate. Here i ve Dec O02). Let [9° have ens { i 1 10 otieens Note ist that “<< 1, Naw let ws find the val of. The trick hen in ac ix Heuee. ¢ = 21/4. Now let ws compute P(X > ¥), This corsponuds to th A= (e.g) . (Yor ea se thi yr cigram 2.6 Marginal Distributions 2.24 Example. Su In principle, to check whether X and ¥ are independent we 2.25 Definition : 2,30 Theorem, Let X ond ¥ hove 2.26 Example, Suppose that : . 2.31 Example, Let Vand ¥ av 2.27 Example, Supine that & fy (0) = f(t) = 1/2. X and ¥ are ite Then, fx(O} = F(t) = 1/2 hep af Be te cys ste) =f steve = 22 f' ydy = 22 2.32 Example, Suppose that X and Y-are independent snd both have the ‘and Y av independent 1 Let us find P(X-+¥ < 1) Using independence, the joint densi He) =x = | 9 otherwise Xe AY B= MX EAR EB) xj d we write X UY. 01 1 say that X and ¥ are dependent “The watement is not rigorous because the density is defined only upto sets of messi 0 conditional 236 Definition. fv wn voile, 1 ; probability density function vy 5 that fy(y) >0. Th ong res shel or verifying indepeude 2.33 Theorem. the ange of X and Y is (porily in 2.37 Example. Let X and ¥ fe a jit ion i ‘ cere punee Ths. yao) = 1 fr 02 Land Oot eae : ifr). We ean write this XY 2.34 Example, Lot Nel ¥ fu det From the dfinigion of the conditional desis, we 1 f.X and Y is the eotangde (0,2) (0,2). We ean wit . het ; a 2.38 Example. Lot [ty Hoses, lo rs 2.8 Conditional Distributions vs find P(X < 1/4)" = 1/9). In example 2.27 ; ieerval ¥ = y Specially, BX = 2|¥ = y) = P(X fey y Tet to define the contol probability fryvlaly = 35 Definition. Tv conditional probability sass Function fs (yey! (.|1) Ka2¥=s) _ fevley r(xctiy ' a ! x aay : a )-d (|3) for continous distin ve the sane di The intern o ifor(Q.1). After non cae, wo mt eae a get aoa oF? Fest note that $= {Gane and ; aicceees 2.40 Example. Consider the density in Exanple 228, Les fad fy (ule When ais satisfy 2? < yy © 1. Earlier, we 5 21/8)22(1 — 2), Hence, for 22 = y= 1 2.9 Multivariate Distributions and 1p Samples st X= (XtoooesXq) whet random vector, Let flioso sty) denote the mor. Ie isp Xe are randou variables. We call X a We say that X Xy € Asso Nn € An) = PPP € A, 28 Ie suffices to check that J Tt S/AX = 1/2). This cam be done by fist uoting Xess Xn one MD om sample of size n from F () Two Important Multivariate Distributions vial, Consider drawing a ball from an ann whic ts bolls with K difereat labeled “olor 1, olor colar ko" Let = (5 where 1 ball of color j- Draw 1 times (independent draws with eeplacement) lee X= (Xiosess Xe) whore X, isthe number of times that eolorj appears fener,» = Soh, Ny. We say that X lias a Molton (np) distribution se X ~ Mulino Silty fet s 2.42 Lemma. Suppose that X ~ Mltinemial(.p} ive 1: Yous my wa the fllawing fats HEX’ ~ Poon) ao Poisson(1), ane X and Y are inepenrtent, then X41 ~ Poissons Win 2; Note that (X =r, XFY =n) = (Kaas Y Expectation ra [ 0 and 0g 1 Find P(X < 3 |¥ 18, Let X © NCI). Salo the following wing a Normal table wl in Find P(X > =2 o} Find sch that BX : Fs da ‘ 9. Pre Knut (212 3.1 Expectation of a Random Variable 20, Let X.Y" ~ Uaioe independent, Final the PDE for X'— Yaad © ectation of «raul variable X i the average val vy Let Nc Xs Papi beans ber ¥ so Pint 31 Dettion. Tiree vale ores, oie meme Hoe of Y Hats ¥- (#F (2) assuming both integra are finite, where 3.8 Exercises wil jum i. Let X» be the penition of the partic X,.). (Thi fs fucwa as a fandom Fale coin is tossed until led is obtained. What is the expecta lt X be a continuons random variable with eDP #. Suppose ¢ P(X > 0) = { and that E(X) exists. Show that EC) = Jo" (Computer Experiment.) Let Ny, Novos. be (0.1) random varia ster X So Ke Plot Ny. eros for = Lyon 10,00, Let X-~ N(O,1) and let ¥ =e, Pid ECP) and VY L independent rondo variabh at PY pus = 1 Let Xy = Soh Vis Think of ¥; = tock pee dale”, ¥ as “the stock price dectensedl hy one dol Computer Experiment: Simulating the Stock Market Xan plot Xiu vers fr m= 1,2 10,000 Wins, Fo the tent vee the fe a To canupnte the varie, fst compte BCX 4. Swen rate a ra X fn the falling, wo Ful the mewn of X (Scan. ¥ oul X \ i Lo Lo vy(P x sl XX) =r y vx vx ue: Let ult YX = 2). Not x square and take the expectation, You then have fof thre terms. In each cnse, use the rule of the iterated expectation: it) = E(t. S$. Show that if LXV or soe constant ¢, then X'and Yan 18, This question sto Help you understand the ides of « sampling, dis- tribution. Let Xj,...,Xx be MD with mean jaa variance 0 Let X,, Then Xy i a statistic, that is, a func ata, Since is randoms variable, it has a distribution, This dist ng distribution of the statistic. Recall fo and V(X) Don't cofise th Theorem 3.17 that 2X9) sstribution ofthe data fr and the disteibution of the statistic fy, Ta rake this lear, lot X,,...,Xy ~ Uniform( 01) Let fy be the dens af the Uniform(0, 1), Plot fx. Now let Xq = 091M, Xi Fld BCX ‘and ¥VOX,). Plt the as a function of n, Inter tistribution of X for n= 1.5 of E(X,) and V(X) agiee with your theoretical calculations, What pling distribution of Kas 2 inrenses 20, Prove Lemna 3.2, 21, Let X and ¥ be random variables. Suppose that E(YLX) = X. Show that Cov(X,¥) = V(X ft X ~ Uniform(0,1). Let 0 t X-u P) ‘ 4.3 Example. Suppose we test a prediction method, a eral et fr Let Xe 1if the predictor is wre ifthe predictor is vight, Then Ny, —n-" ST, X, i th Each X; may be regarded Tike to know Tut unknown error tate p. Intuitively, we expert that should he cloe to p. How likely is X to not be within We have that V(X) = W(X) /n = pl = pin and _ wisp) < ince p forall p, For © = 2 and n= 100 the bound is 0625, Hoefiding’s inequality is similar in spit to Markow’s inequality but itis “4.4 Theorem (Hoetiding’s Inequality). Let Vicss-sYa be hi observations such that Yi) =0 and a, <¥; Shy. bete > 0. Th Aci to Hoeting’sinequalt Hoefiding’s inequality gives us a simple way to cteate » confidence inter- ee Chapter 6) but here is the basic idea. Fix a > O and ot C= (Ru ~ta.Xu + tn). The, Pip C) = PU(Ny pl > en) Se, He ©) 21a. that i, the random interval C traps the true paraz ue p with probability 1; we call C8 1a confidence terval. More on 47 Theorem (Mills Inequality). Let Z~N(O,1). Th Inequalities For Expectations 14 Appendix This section contains two inequalt ted values, ROOF OF HOKFFDING’S INEQUALITY. We will make mse of the exact form o Taylor's theorem: if a siooth Function, then there i @ number €€ (0,2 that gf) = 9(0) + ug" (0) + 80"(6) sn Ad. For any t > 0, we have, ftom Marko's inequait (xe") va df) > Or thn 9 comes. ean be re =n connie angent line. A function g ix concave if —9 is convex, we \ hoy anaes ae Take expectations of both sides and use the fret that EY [ mime a Heme 9) | wx) 2B | ere = th =a) = 9 andy = =~) | | tet (9) =a) = 0A, © 1/4 all w> 0. By Tas | (3) < 82, un| : r —~ —I alu) = (0) +000) + Sa"Ce Hen Balk) 2 BLN) = Bla +8X) =a BBX) = LABLX)) = 9fBX). @ Be ct «ntl then 1/8) © 1/8), Se gs one EX) ng Re ee ae STEROL Appling Theorem 4.4 we 1.3 Bibliographic Remarks Kn-p>o) (= ) se Js for any (0. tn particnlar, take f= de anal we get POX dof of Hoeling’s Se. By asi rezument we ean show that P(X —p inesquality is from that text Putting these together we Xq—pl>e) S20 Exercises 1, Let X ~ Epon 2, Let X ~ Poisson(A). Use Chebyshev's inequality to show that P(X 3. Lat NacecoyXa ~ Bermoali(p) and X= 0-8 Ly Xs, Bound PCN | > 6) using Chebystew’s inequality and using Hoeffing’s inequality Show that, when m is lange, the bound from Moeflding’s inequality’ ix A. Lat NyyoooeXq ~ Benoa) Lev This Xi Define Cy = Gu —« ») (Computer Experiment) L to more than 5. How large should 1 be Prove Mills inequality, Theorem 4.7. Hint. Note that F(Z] > ¢) 2P(Z > 1), Now write out what P(Z.> 4) means and note that 2/t > 1 mhenever 2 > 6. Let Z ~ N(0,1). Find P(JZ| > t) and plo this asa function oft. From bound B(1Z| > #) < BP tor an E> 0. Plt these bounds for & = 1,2.3.4,5 and compare them to the te value of P(Z] > 2). Also, plot the boul fru Mis inequality Let Xi x. Compare Convergence of Random Variables 5.1 Introduction lo variables, This part of probability is eae’ large sample hat ean we say abont the limiting behavior of sequence of random variables Xp, Xs,..? Since statistics and data mining are all about gathering an, Thon, trivially Hn, bilstic version of this at Xi, Ne ables whi these all has a N{0,1) distr me ate tempted to say that Xp Seom have the sane distebti X ~ N(01), But thi Here is another example. Consider X;,Xzy... where X; ~~ N(0,1/n). Inte tively, Xin very concentrated around 0 ot laxgo n x0 we would like to say n't quite be right since P(Xq = X) = 0 forall hat X, converges to 0. But P(X, 0 for all n Clearly, we need 1 2 Types of Comeeence -——_ develop some tools for discussing eonvergence in rigorous way. This chapter Fat) Pe levels the appropriate methods There are two main ideas in this chapter which we state informally here 1. The law of large numbers sys thatthe sample average X et eel converges in probability (0 the expectation j= B(N,). 7 sto FQ) a all pnts except F-= 0. Convergence 6 mat required ak ¢ = 0 that Ny is chase to ye wih high probability tse t=O nt api of contin 5.2 Definition, i, converges to X in quid ied | The central limit theorem says that yni(X'y — } converges in dis a ten XX, if | tribution toa Nora! distittion, This ean that the sam | 1 large 0 X,— X40 say] e Again, IF X is point mas ate we write Xie instead of Ky X 5.2 Types of Convergence 5.3 Example. Let Xq ~ M(O1/n). Entuitvely, Ny is concentrating at 0 5 The two main types of convergence are defined as follows woul ike to say that Xi, converges to O, Let's ce if this i true. Let be ie distention funtion fora point mass at 0, Note that VAX, ~ N(O.1 5.1 Definition. Lat X30 5 aera anna arabian saad nomal random variable Fort <4 Flt) = PX X te another crite, Let F or of Ny and lt F v 2 < Vt) > Osim ty 2 Fi te the CD af X | F X, <1) =PlyXe < v Vit) -+ 1 since Hence, Fy(0)-» Flt) forall 4 Wand so Xq 0. Notice that (0) = 1 1X, converges to X in probability, written Xp! s X. aff | PL1/2) = 1 so convengence fils at (= 0. That doesn't matter hecause ¢ = 0 % | isnot a continuity point of F and the dention of convergence in distribution Xy—X>e—00 | maly reuites convergence at continuity pots, See Figure f.1- Now eonsidor sees fonvergence in probability. For any « > O,nsingg Markow's inequality 2 Nq converges to N in distribution, written Xy =X, if x, Xa? oe lim, Ful) = F 52) a7 omens ws m0, Hence, Ny P90. — — The next theorem gives the relationship between the types of convergence When the tiniting rand Figure 52. Iihtly. BCX = 0) nn we write Nyse, Similarly, i 5.4 Theorem. The XX wo write Xq~e (0) Xu Xp There is snother type of convergence which we introduce mainly because it (6) XuPoX ations hold except th PROOF. We start by proving (a). Suppese that XX, Fix ¢ > 0. Then, sing, Marhow's inequity (Xn =X] >) = P(lXy— XH? > 2) g BREA! g 5 litle more complicated. You may skip i if 900 with, Fix €> O and let be a continity point of P, Then Fale J=P(Xn SX Sete +P ke SX > eho) X Sete +P(l%e-X1>6) Fla +0) + PUNn ~ A} > Alo Fe-9 x ) =P x XE2-GXa>e Fala) +PUXe — X1>0) Henc Fle— 0) ~ PXe~ X1> 0) < Fula) < FC Xp Thke the limit as n +50 to conclude that F(a ~ 0) < tint F(a) < lisp Fyfe) < Pla +e This hols for alle > 0. Take the lint as ¢ +0 and use the uct that Fis ntinous at 2 an conclu that lini, Fale) ~ Fle) Proof of(¢) Fix €> 0, Then, IX, y= Py cee PMa eet) Xp S06) + Py > ebe) Fle 41- Fale Ple-) +1-Fle+4) o+1-120 Convexcenct ts pRoWABILsTY b WAN, Let U ~ Unif(0,1) and lt X= Yo, (Ui). Then P(\Xal > 6) 2 Types of Convergence uadrotic mean —e probability— distribution Jilsjus(U) > ©) = POS U < Lfn) = W/m + 0. Hence, XqF0, But (3) = 0 2!" d= 1 for all ns X, does not converge in quate mea annuity. Let X~ N(Oc1). Let Xq = =X for n= 1,2,5,.05 hemor Xy (0.1). Xq has the samme distribution function as X for all so, teviall ity Fa(t) = F(a) forall 2, Therefore, Xy + X. But P(X, ~X1 >) Pij2X X| > 6/2) 40.80 X,, does nat eonverge to X in probabil Warning! One ight conjecture that if X hen B{Xp) > & Thies rot! tre, Let andom varinble defied by P(Xq =n?) = 1/n and (X= 0 in, Now, P(\Xq] <6) = P(Xn = 0} = 1 ~ (1/) Hence, Xq +0, However, E(Xq) = [n2% (I/n)] + [0% (1 (fn) =n. Th B{X,) >> 5.5 Theorem. Let Xy.X,Yau¥ be mndom variables. Let g be fa) Xn X and ¥ then Xu + Yq PSN EY 1b) XX ond ¥ “ YeoaX HY J Xn X and ¥ n Xa Yq X WH then Xun Ls XY YX, X then Xa¥q > eX Xp PX Xq)-Esa(X. (0) UXn— X, Men af Xa) 910) Parts (c} and (e} ate know as Slutaky’s theorem. Its worth noting that Xq— X and Yq ¥ in general imply that Xq + Yq» XY 5.3 The Law of Large Numbers of tosses is expected to be clase to 1/2. We now make this mote precise Let X), No, bean sample, let p= B(%,) and ? 9? = V(X;). Recall the sample mean is defined as Xy =~!" X, aud that BX) = p ul V(X, of a 5.6 Theorem (The Weak Law of Large Numbers (WLLN)). ] Interpretation of the WLLN: The distribution of X, becomes more concentrated around j a5» gets large Xa) _ (%.— 0 which tends to 0 a8 2 20. 5.7 Example. Consicer Sipping « coin for which the 7 p. Let X, denote the outote of a single toss (0 0¢ 1), Hence, p 1) = BUX). The fr om of heads after tosses is X,. According ta the law of large nunhers, X, converges to pin probability, This X shonld n be so that P(t < Xq <6 First, E(Y,) = p= 1/2 and (Kn) =a2/n = ph ~ p}/n = 1/C4n), From Chebyshev's inequality, (4 eX, 56 x 1 POX 1 ee iy T Jon wil be logger than Tf n = 84. 4 The Central Limit Theorem uinbers says that the distribution af Ky piles up near nts about X the we noel the central init theorem Suppose that X;....,Xq are 1 with miean and variance a2, The central it theorem (CLT) says that Xy, = n-! $5, X; has a distribution whieh cximately Normal with team yx nnd vat This is remarkable ine nothing is assumed about the distribution of X, except the existence ef be ean and variance [5.8 Theorem (The Central Limit Theorem (CLT))- Let and variance o®, Let Xp = Wty Xe Then i, FOZ, aye fo hea Interpretation: Probability statements about Xan be approximated sing 2 Normal distribution. I's the probably stat are approximating, not the random variable its (Qh) the the fact thatthe distribution of Zi converging to. Normal. They all mean 2 = (0,1) x v(u% xen & x(o, fitXn— 1) % N (02) Xn — 1) v NiO), 5.9 Example. Suppose that the numberof erors per computer program hasa Poisson distribution with mean 5, We get 125 programs, Let Xi... Nias be the mumber of errors inthe programs, We want to approximate POX, <5. variance matrix ©. Let Let = E(X;) = A= 5 and o? = V(X,) = A= 5, Then, x, , Xs Ky <65) yalXn =n) _ vil65—»)) x = PZ <25) = 908. Xe The central limit theorem tells that Zy = Vii(N~n)/a is approsimnate ere X= WE X I(t). However, we rately know a, Later, we wil se that we an estimate Val 1) 810.3 2? fh Nyon Xb st= HK Ke 5.5 The Delta Method This eases the following question if we replace with Ss the central init ae find the Tiniting distribution of g{Yq) whee 9 is any stoth function theorem stil true? The answer is $.10 There, Asn the sane contin a he CLP. The, [58 Thea (he Do Weta) Sparta Yate Ww | Vala. ¥40,1) N(042). | a is given in the Berry Essen theorem, | tala) ~ 5.11 Theorem (The Bery- Essen Inewality). Suppo se. Then sp[P(Zn <2) 8(a)] < BE ‘ ce a Vir 4) 5.12 Theorem (Multivariate central lint theorem). Let Xo... Bet ro ae 5.14 Example. Let Xjo.o..Xiq be 0 wth nite mean yd nite arama i ren ilfo = N(0,1), Let Wy = © x a) where 03) Ys) ~ the det method inp that W, = Net e02/n). \ xn} , There i ao # mukvarate version ofthe dla metho Ey Ms 5.15 Theorem (The Mulvaiate Deka Method). Suppose that Yn = (as --oYor) ee es) tors such that ValYs = 1) = NOE so ory vata) LA Vy (0) evaluated at a the elements of ¥, Vital) — 9) N OVEEV Rit Mu Neat Px and define Vn = X\Xp. Ths, Yq = g(%1,Xa) where glssssa) = oreo By the ri Nios (x Now ud Therefore VaXAXs — may) = N(O.aBors + 2uinaore + wone)- 5.6 Bibliographic Remarks bility theory, For more d= Billingsley (1970 der Vaart Advanced convergence at detail in rad Wellner (186) and van der Vaart (1998), Appendix say that X, converges almost surely to X. written XX 4: Nef) + Xte))) = 1 We say that Xq converges in Ly to X, written X22 Xi Xy—X] 4 x x Tre weak la of large numbers says that X to BL) ) i pr ity. The stom, lw ase that eh is el bet Xi.Xnoos Bend. IF 5.18 Theorem (The Strong Law of Large Numbers) Xj] <0 then Ny 4 cally uniformly integrable if A sequen asymp in_timssp (|X| f(IXe] > M)) = 0. 5.19 Theorem. Jf Xs and Xy is ws 5.02 of the Ce Theore Recall that if X isa randont variable, its moment generating fanetion (1G) is ve(t) = Be, Assume in what follows that the MGE is tnite in a neigh Tothood around t= 0. 5.20 Lemma. Let Za, 23... be sequence of random variables, Let vin Be th WP of Zu. Let Z be onother random variable ond denote its MCP by e. If ad 0, then ZZ Convergence of Rand Variables ss ries & 2008 OF TH CENT UMET TOME Let Y= (4 — no. The 4. Lat Nios bem and let j= (X,)- Soppose that thew 2, = HS, Let YQ) be the MOP of ¥;. The ar of S,¥s i (OU0)™ finite Show that Xa 4 and MF of Za is (/ VAI = Gt) Now w(0) = BA) = 0, 0) ¥)) = 1 $0 {Lat Xi, Nae be a sequence of endo variables such that HO) = 0) +4070) # soma (m=t)=1- bam ron=n= 3 140 Does Xe comerse in probabil? Does Xe cnvenge in quate meat? 22m) bbe Migs Nu Bernoulli) Powe tat ™ Lye sp ant ESOP p t 5 Suppose that the height of men has mean 68 inches and standard bey +] Viton 26 inches We dee 100 sen at radon. Find (approximately) To Let dy = Afr for n= 1,2yoou ltt Xn ~ Poissy) (a) Show that X20 which isthe AGE of a N(Q,1), ‘The rst Flos fom th Iu the last step we used the fact that fay» a then (0) Let Yo = m2 Shaw that Yor“ that the X's are Poisson with mean 1 and that they are independent Exercises et ¥ = Sot X; be the total number of erors. Use the central Kan theorem to approximate P(Y <0) Ue Lat Xivos-sNn be m0 with finite mean pe — BX) and finite vain Deine V(X;). Let Xq be the sample mean and let SZ be the sampl 9. Suppote that P(X = 1) = PIX u ance {Xi probity 1— 4 (0) Show that (82) = 0 oe {oh pai (b) Show that $2-Ps 02 Hint: Show that 83 = eg"! EP XP — dX Does X, converge to X in probability? Does Xy converse to X in die where gy Land dy ~+ 1. Apply the lof lage numbers ton Sok XP tein? Does E(X ~ Xo)? converge to 0 and to Xy. Then sve part (e} of Theoret 5.5, Z~N(O1}, Let €> 0, Show that for any k > 0 2. Let X1,Xa,-.. be a sequence u variables, Show that X20, snd nly if miai>y <2 ‘Compare this to Mil's inequality in Chapter 4 11 Sup Xq ~ N(0,1/n) and let X be w random Part IT (2) =Oifz < Oand Fie) to X in Aliteibution? (Prove or disp Let XX Xa be random vrabe that re pstve and neg Statistical Inference wel Show that Ny» X iPand only if 0) = Land that A = limyyo f(z) > 0.1 Xy =n min{Z,.-.-.Zah Show that Xq ~ Z where Z hasan exponential distribution with mean A Let X),....Xn ~ Uniform(0, 1) Let Yq =, Fd the Hmting distr x ( (x he uw veto (ncaa) ad waiance Let Maly Tat yx wid deine Yq = N/a, inl the tinting distribution of 6 Models, Statistical Learning Inference and 6.1 Introduction Given a sample Ny... Xn F; Row do we infer F7 6.2. Parametric and Nonparametric Models tistical model & is (oe demsities oF 8 auctions). A parametric model is a set § that can be parameter nite number of parameters, For example, if we assur fe datacom om a Navan This isa two-parameter model. Web 6.3. Fundamental Concepts in Inference Mony inferential problems ean be identified as being one of thre types: « on, confidence sets, oF hypothesis testing, We will treat all of these rn i detail n the rest of the hook, Hers we give bret in to the ideas, 6.3.1 Point Estimation rode, a Coe F, a probability desity function f, a regression function, or a prediction for future value ¥ of some rand By convention, we denote a point estimate of? by @ or. Remember ‘that 0s 2 feed, unknown quantity. The estimate 8 depends on the data so is 3 random variable, More formally, let Xy..--.Xy be m HD data points from some distribution B.A point estimator y of eter 0 is sme function of Xi... jn = 96 Nayoo-s Xu The bias ofan estimator is defined by biae(@y) = Bal@y) ~ 0 64) We say that is unbiased! if E(@,) = 0. Unbinseiness use to receive much attention but these days is considered les important; many ofthe estimators swe will se are biased. A reasonable requirement for an estimator i that it The disteibution of, called the sampling distetbution. The standan 10, called the standard error, = sel6,) = VO). « (Often, the standard error depends on the unknown F. In those cases, se an ko 6.8 Example. Let Nis, Xy ~ Be alli) and let fy =n! Xa, Then ) = nS, E(%,) = ps0 fy i unbiased, The standard error is se Va) = Voll — pin. The estimated standard ettor is & = yAT— FT The quality ofa point estimate is sometimes assessed by the mean squared Keep in mind that Ey() refers to expectation with respect to the distribution Heisei = TL fle 0) that generated the data. H des Hot mean we [9 Theorem. Tie ws a [ wise = bias!(@,) + Yo 67) oF, Let By = Ea( Ga). Then E4(q ~ By)? + 20, ~ Balm ~ By) bias?) 4 ¥@,) 6.10 Theorem. {fas +0 38,246 Paoor. I bi + and se + 0 then, by Theorem 69, MSE + 0. I follows that G0, (Recall Definition 5:2.) The result follows from part (a of Theorem 54. 8 6.11 Example. Returning to he coin Bipping example, we have that Ep ps0 the bias = p—p =O and se = y/plt pln -+ 0, Hence, fy +p that i Many o the estimators we will encounter will turn out ¢o have, appa imately, a Normal distribution 026 Medel, Statistica Inference and Learning (6:12 Definition. an estimator is asymptotically Normal ) ] f—! yor (6)| 6.3.2 Confidence Se A 1a confidence interval far a parameter @ is au interual Cy, = (0,8) where @ = a(Xjs.osNq) atid b= HX4,..-.%y) are funetions of the data such that OE Ca)>1—a, forall BE. 69) In words, (2,8) traps with probability 1a, We call a the coverage of he confidence interval Warning! Cs random and @ is fixed (0) = 005. IF is x vector then we use confidence sot (such « sphere or an ellipse) instead of an interval ‘Warning! There is mmc we rarely repeat the same experiment over aud over. A better interpretation ‘hi On day 1, you collect data and construct 2 95 percent confidence interval for a parameter 8). On day 2, you collect new data and con struct a 95 percent confidence interval for an unelated parameter On day 3, you collect new data and construct 2 95 percent conf dence interval for an unrelated parameter 0, You continu this way constructing confidence intervals fora sequence of unrelated param: eters #,.82,-.. Then 95 percent of your intervals wil trap the tre parameter value. There is no need to introduce the idea of repeating he same experiment over and over, 6.13 Example, Every day, uewspapers repr opi 1 your life, 95 percent of yout teres will contain the true parameter This is tre even thong yo are estimating different quantity (n different 6.14 Example. ‘The fact tht a confine interval is not a probability state nt abut 0 i confusing. Cotser this exannple from Berger and Wolpe 1984), Let re fixer, kaw coal rmber a let X)..Xp be ln vaviabes such that P(X, = 1) = PLM; = 1) = 1/2 weve ¥i ata ¥3, Deine Xan suppase that ould probably say that P(@ € Civ.) ~ 1. There is In Chapter 11 we will dicts Bayesian methods in whieh we treat @ as fit 1 dea rents ike “th 6.15 Example. fn the coin flipping setting, lt Cy = (ator Babe) When los(2/a)/{2n). From Hoofing’s iecpality (4.4) i Fallows that 6.16 Theorem (Normal-based Confidence Interval " HOOK. Lat Za = (Oy ~8)/$. By asnmption Z tu Non For 95 percen tervals, = 008 on C4 6.17 Example. Lot Xy.....Ny ~ Bernonli(p) and en T 7 T Bs it That, #2). Therefore, an appr 1 6.18 Example (Testing if» Coin is Fait. Let Xiccoos Nu ~ Berl se the hypothesis that the oin fy Fir aud et Hy denote the hypoth 1 cin by not ala the null hypathesis val 1, is eailed iternative hypothesis, We ean write th hotles " When we ds ih we ange To 64 reat ew Eementary texts fuchte De i 2012) ao Larsen sand Mies (1986) ria sal Barn (2002), kel oe Das 2008). 6.5 Appendix 5.6 Exercises 2. Lat Xjocscs Nu = Wran(.8) [Xoo Na) Bit Estimating the CDF and Statistical Functionals Then we wll etnate satel nc : | 1 The Empirical Distribution Function Xa Fe a HD sa istntion fet tae, We wil eaimate sth the empirical dstrition fanetion, Wis deinsd | [71 Denton. Ti empirical distribution function Fy the cor ] each data point X : | SL Fi 1 4x ° 15 Theorem (Th oy Kiefer Wolfonite (DKW) Inequality). Le X 7.3 Theorem. 7.4 Theorem (The Glivenko-Canteli Theorem). Let Nia... 78 Definition. Tinear fun TaP +66) = aT(F) 40716), J r(2)fr)de in the continos case ane f(x) im the dserete. T ‘mpieieal cor F(x) isliserete, putting mass /n at each X,, Hence, TF) [[r(2\lF(2) is linear functional ther we ba 7.9 Theorem, The playin estimator jor linear Funct ~ | riP) = feta) ix Sometimes we can find the estimated standatd error se of TUF) by doing ome calculations. However, in other cases it fy not obvious how to estimate he standard error. Iu the next capt discuss a general method for finding &. For now, let us just ase that somehow we can fd 3 Tn may eases, tars ont that TUB 4) = NTU.) u 1a confidence interval for PUP) \We wil ell this the Normal-based interval. For a 5 percent confidence TP y) = 28% 7.10 Example (The mean). Let p= T(E) = fd). ‘Phe phen estima sr = F2dBe) Xu The stand ene = VE) = text example, we shal sce hove ta estimate a.) A Normnal-based eonfilence ateral for pis X ae 7.11 Example (The Variance). Le The plug-in TUF) = V(X) = [2d (2)-(f nl [erat Te nid Ce)} The skewness mensures the lack ofa astibution, To fd the gin estimate, fist ral that ji Ky and #2 = 0! My = The plogrin estimate of fie whe) ASO {Ste —wPab(e)} 713 Example (Corlation). Lot Z = (X.¥) and let p = TIF) = BCX T(E) = oT F)TAP) TP) TAP) TAF ; EF) =] TF) = fsyaF (2) TUF)=[edFe). TAF u ( TF) =[24dF), TAF) = and ome place F with Fy in Ty(F). 0. T3(F), ane tok which is cal the sample correlation. « 7.14 Example (Quanties). Let F tie stvietly increasing with sen TUR) is Bip). We Ive to ew Iie earful since By isnot ine avid nibigity we defi Only in th ea Cholesterol). igus 7.2 shows hist nations Fy al fe EG) a f wt gronps this yds $f) = 30 and (ie) = 2 it codec intervals for pr and pp ave fi + 258 205) ad tsier the fetal @= T(E) — 8} whose pgm estimae is 5 216,19 — 195.27 = 20472, The standan error of 8 is 1 that cholesterol i higher wm howe with warawed arte le jmp to the conclusion (ha these data) that cholesterol eas ase, The leap fo statistical evi sin very nena in Chapter 16. 7.3. Bibliographic Remarks 8. Got the dats on eruption ti al waiting times hetween eruptit ie Old Faith from the wtnite, Estimate the nett waitin The Gliveuko-Cantell theorem is the tip of the ieberg. The theory of die ud give standard eror for the estimate. Also, give a 90 percent tuibution funetions is a special ease of what ate eal eile nf interval for te ae wating time, Now ex “ ick uncerie much of moseen statistical theory, Some references on empit tng tine. I the next eaapter we will soe ow 4 r Wellner (1986) and xan der Voart and Wellner ror for the nisin. Exercises recovers the second romp, 85 people wpe recovery tier the tana eeataent and lt py be the probability of 2. Let NyyoosXq ~ Bernonliy) ad let Ys... You ~ Bernoulig)- Fin terval, and 95 percent confidence itera For 6 liye estimator aed lard eor for p. Filan ap inate 91 perce nce interval for p. Fil the plug-in est 10, In 1975, an experiment was condetod to se if elo seating protect timated standaed etror for pq. Find an approximate 99 rainfall 26 clouds were sed with sor atrate and 26 were not. Th pb stam (Computer Experiment.) Generate 10 observations ra N(O.t) as z 1 bet Xi ~ F and let B,(2} be the eanpiricaldlstibution fr 8 Phe Bootstrap 8.1 Simulation drawing an observation from F;, Is equivalent to drawing ‘one point at random fron the original data set Suppose we raw an UD sample VY fom a distribution G, By the law of Tange nbs, Flin, to sinilate X7....,Xy ~ Fat sulloes to draw ve observations with lucene fo Hote isa simnary ay ot) = 80 I “Book Vara Eximation | x, Saif we dew ge spe fm yw a 1 Dawe Nak =F Tike, in which cane, the lifeence between Fy and CY Ne ¥=200) | | = ) st | ay | x. In partial BUF 8.1 Example. The following ps hows how 10 use the bootstrap £0 Y eactstrp fr The Median Hea, we ca we the samp ee en data X= UD, <= 1) 8.2 Bootstrap Variance Estimation : rts sm 1:8) is Fo simulate from the distribution of Tj, when the data are < agrt(variance(Tboot) ) astined 0 have dstbton F,? The amen so slate Xs Xt fom \ . Pap aera ri are aici e The allowing seeatie dagen wl distribution of T,. The iden is illustrated in the following am: ‘prosimuations Real world —F x x, 7, = aX x, _— — ota werd wx TaaXh at vA How do wesimlateXy-..¥fom F,? Nai that pts ma na 8.2 Example, Conse the nervedate,LetO = TF) = fl2-HPAP(e)/o¥ 8.3 Theorem. 0 sma Cholesterol Data). Let us return to the cholesterol Be 1000 PA,. The school is interested in the corr Normal < (th.hat = 2ese, th.hat + 2*se) Vv y | juivak This ans that new drug is not itferent cholesterol although there is considerable uncertainty about how much higher Sa ae used ne ice the rxampkes for their pelagic vale but we do wnt to sound a 2 peoraaeld es | isso 8.6 Example. Heve auuple that was ove of the first used to illustra | i icon ine a LSAT scores (for entrance to lw school) and GPA | 8 S806 20011268 rap but less general Let Ty = TX. Na) b nd Tid Computer Experiment.) Conduct « simulation to compare the vion note the statisti with the # ‘ST bootstrap confidence interval methods, Let n= 3 and It T(P The jackie estimate of va0(T,) BPMAP(2}/0 be the skewness. Dr Yq ~ (01) ad wx Leones. Construct the three typis of bootstrap 5 "INET petcent interes for TF) fr the data Xa Nye Repeat this whol the jacldaife estimate of the standard tor is @jace = y/Fjark. Under Let witnbleconulitons om 7, ican be shuns that sag. cosstently estimates Kroes Xn car(Z,) i the seas thot jas However, unlike the bootstrap, hen me=TlF) =(e 4 wheee gp denotes te 3 Jacklife dows vot prose consistent estimates of the standard error of rei Thos sdaion to coupare the come te gt ft a lowing eoufidenee intervals for 0: (i) Nonaltucerval with standard ervor from the bontstrap (i) bautstp percentile itera sal (i) 8.5.2 Justification For The Percent f Yi ap interval tes 2 monotone transformation = m(T) sue that & A Let NiecessXa be distinct observations (0 te). Show that hoe ae were 6 = m{@). We do not suppose we know the transformation, ve exist Let Uj = mh Let ube the J quantile of C Abo, since U ~ N(6,e2), the /2 quantile of U is ! Astin hootstrap samp Hen Simiary, J Therefore | oat: : ta S00 a) < (0) < iO | Let Xy,o--s%y be distinc observations (no ties) Let Nj... demo | alot X= 01S, NP Find BORG X i i , and ( : (Computer Experiment.) Let Xo. Xv Nona 1. Let 0 = « | we aun of the ootstrap replications. This fs an etn of the distribution of @, Compare this to the true sampling istitation 8.6 Exercises a 1, Consider the data jn Example 8.6. Find the plugin estimate of th Feet Nooo Ny ~ Uniora(0). Let =X AX or Xu) Ge oration cote. Estinne the saad eror ting the bootstrap. erate mata set of size 50 wit Find 95 percent confidence interval using the pivotal, Fin he distribution of 8. Compare the true cstrtion of 10 th wametric Inference 9.1 Parameter of Interest 9.1 Example. Let XX the perme fone) 4 . nt Kon 1 vo ya1-Pxren=1-P/ ) oa) 1 in) 1 t)=1-0(54) 9.2 Example. Rell hat X bas « Gar 9.2 The Method of Moments 9.3 Definition Sx 98 The method of moments estimator d Kook ip) Th X) = pan ft XieseaXa Nor Tien x x x U2 We ned to oe th z= 1px yx wai cnn. The notion 9.9 Remark, If otexp {25 bexp {2X —w PY aoe fo ws where ¥ = 0-1, Xs the sample mean and $? = n-" ,(%, — NYP. The last eqlity above follows for the fact that Y°,(X S20 which ca be verted hy writing 3,(% ¥ + — y)* and the expanding he log keiho Solving the equation we eaichide that j? =X and # = S. It ean be verified that these are indeed Jobat muaximn of the kei. 9.12 Example (A Hard Example). Here wr weophe i onfusing. Let X;,..-,%q ~ Uniform), , [10 oseso f Lo ott Consider 4 xed value of 0. Suppose < X; for some Then, £(%.0) 0) = TI, (Xs0) =0. He follows that £, (8) = 0 any X, > 8 Therefore, £y(0) = Oi 0 < Xow) where Xp) = max[Xyo..o Xe} New sonsider any 0 > Xia For every Xz we then have that f(X0) = 1/80 that £910) =F], f%8) ~ 8". tn couchason yf yr aex EM = Tq ge x See Figure 9.2. Now £q(0) i strictly decreasing over the iaterval [ino 2 The maximum likelihood estimators for the multivariate Normal and thy 9.4 Properties of Maximum Likelihood Estimators many prop tn appealing choice of estimator. Th The ste is asymptotically Normal; (@—0,)/s — N(O.1 |. The MLE is asymptotically optimal or efficient: roughly, this ne Ihave estimators, the MLE Wa the stale a ot east fr ange saps Xs approximately the Bayes estimator. (This point will We will spend some tine explaining what these properties mean and wh hey are good things, In suficiently complicated problen properti » 20 longer be & gourd extinator, For now o ete the MLE worles wal, The propertic stated, we shall tacitly assume that these conditions hold 9.5 Consistency of Maximum Likelihood Estimators To proce, we ued defiition tf and 9 are POF we Kultbncke Letbler distance ® between f and 9 py.o)= fH. t08 (£2) a 96 1 Fis identifiable if # o implies that 010 (me fRa5) = : (ix:00) 9.13 Thee. : 1.6 Equivarianee of 9.14 Theorem. Le Paoor. Let f= 9°! dente the inverse of . Th £(7) = Ty fle b(r)) = FL fleet) = £10) wt 9.15 Example. Let Kyo. Xy NA). T 9.7 Asymptotic Normality 9.16 Definition, ‘he score function a (Scacx y that By(sX:0))~ 0 Te then fll that ofa S17 Theorem, Toa | | i innately Ni 0,18 Theorem (A ix. The fist stat that dy, = Nase last orto of 0 ViTFe(B}. The second till trae eves if we replace the standard error tah ays that the dstribmtion of the ALE em be ap (6-209 i 9.23 Theoren ae stemen proved that the wna satis 9.20 Example. Let Xi... Np ~ Bernwuliig). The ste iy = SX nis tat the weston converges to te right value Dut 7 | | | 9.21 Example. Let Xi Xn. ~ N(Qa2) where 0? i ann, The sot ficient or asymptotically opti if oe Wat) & 08a that 110) Ve Taare spit pon the se el grat. ie i 1 ig to Theorem 9.18. Fy = NiG.0%/n) this " wer be optimal. We wil ves opin i bes petnnton se wee : ality when we diserss decision theory in C 12, 9.22 Example, Lat X ico). That Ry = Ky and some ) ‘The Delta Method | ‘ j) Now we adres the following questa what . | 5 P [5:20 Theorem (The Dota Method). IF — a0) : a) 20m 8 Optimalit : ' 0 rerfor, 3 = fa/ VIR. Lt 6) = logo. Th opr Sine BG) = OI, 016 ; : ; 1 10 Multiparameter Models 9.25 Example. Lot Ny... Ny ~ Teron) and : 1 The Pier ifiaton Fa 1 = 9) 0 the ota alate of the relat! ‘ | = 1 = ad Hy = gee Define the Fisher Information Matrix The ur af es 6 = lg fi Sine 1 acon Ia method " 4 fo) = Nin 9.26 Example. Let Xi.....Xq ~ Nna?). Supe that yb hw Diifrentiate and set equal to 0 an enti th - 1 lant ol the Fir i ' ne of yun 1 7 There ial « tultiparaneer dela wetbon. Let + = gf @hsoo--84) b Xia) = —| aul een vy 9.28 Theorem (Multiparameter delta method). Suppose tho! V ted 0 | ne estinntel starr errr 7). wn | : 7 J. On the er ba vien"s » & taal er J = 1G) and i iW eta j 9.29 Example. Let Xi.o.0.%n ~ Nua?) Let c) = 0/p. In Exer 2 Cheek ons, ‘ 0 | Lost test relent of gs (*) 9.13 Appendix im Pro 1 sr THbOHE 13, Sin iin Mh a a= y ¥ \ . va : ine als from (97). elles that fi ‘ ( 4) +0 9.30 Example. 4 the bontstrap standard. 7” | “yo ta ; inmate X Soke al 8 ax Repeating thi 9.31 Lemma. 1 / Xif). P PHEOREM 9.18, Lat #(0) = The Rea yuation to go 0-0 rin oth ror morrow ¥ X0)/0, Reval that 0 fron the previ Fem 1 Dy = va = van — 0) Ww ~ v0, i om. Let A dog JO Thon ECA, f Apply Tacoren 5.5 part rte ty ima ** (7a) ty to 1. The res fll Tse IxF oF Pr Turon 1 9.18 tells ws tat 10} ste is « fmetian TX") of the dnt X tise that contains all orn nea 32 Definition. 9.33 Example. Let r 938 Beampte, or Ye Mn = (8,5), Then ion nde thse tate ‘ Jaf 25Y fn V+ (0.0.04), 10.0).0.0) (sae) OP et JOP | fl {0.0}. (0.10.0), { ‘oro S® is the sam T xpre ‘depends on the : L {(0.0)}. ((0.1)}. (1.0) ut T and then compute the likelihood, Sufficient statistic | M. minimal siner if 1,0) and ‘ a : 9.38 Example, For 4 N(ne?) woe T= (¥,S) ism mil ci ses * sic. For the Bernoulli model, T= S2, X; is a tninimal sufficient statistic aoe \ For the Poisson model, 1 = So, X minimal sufficient statistic. Check the The frst statistics jot th data sot, ‘This i snfficent, The second cient ste ee th ay ying tT aaonnnnene nnn peanuts tn “me “ee " x. ic yo need the distribution of (Xi, Xz) given 938 Deion, A slic TH nina aliens 7) oe distribution of (Xi, Xz) given T= 1 tfficiency in terms of these x, Xz = Ot = P(X) =1.Xy = Ujt=1) = 0; and Let Xi.Xy ~ Bemoull(@). Let V = Xj. 7 = 3,X; and i) the stibtion of (1, a) ven T = 2 PUN =0,Xs = Ot = 2) =0.PLX) =O. Ne = I]t =2) =0 x X[v rt cea ee Sone of thee depend on the parameter p Ths, the disttbution of Xi, Xal 9.40 Theorem (Factoraton Theorem). Tis si only ier a 9.44 Example, Lt X-~ Poon) Than 9.41 Example. Return tot ro coin flips. Let f= Th : ‘ 9.45 Example. Let ~ Binal 0). Ts m a= #10 Vh Therefore, T = Xy + Xp is oe " ey i-a) : Now we discuss an implication of sufficiency in point estimation. Let @ b ul tog (—” wh a) an esti of 8. The Rao-Blackwell theorein says that an estimator shoul : nly depend on the slic static other tenn be irene a) U denote the Mist of the estimator has 9.42 Theorem (Rao Blockwel) Lt wT ea . oer ping coin tie. Let B= Xi. This bw wl Ain) = f s 1) estnntot: Bu tf wt w fanetion ofthe slic However te that = XyIP) = (Xs + Nay Poon can ie itt an fr 1 Theorem 8 haw MSE at kt ne sll as = Xi. Th wen " fps. An define B= X, an > S, Xu The Xie Ny be UD fromm exponential fails: The f , X X, ha imprned Sw tia fa E Families ‘ : Eee Me) amd B4(@) = mB). T dls we hae sted far are spl eae haa) = THched Tale” Vt led exponent fants We sy tat (ft 0) iva one-paramster exponential family if ther re fet 18 0.46 Example. Let Koos Xy~ Uaio(0.0). 7 " Fis 1 if the term inside th 4s is tne and 0 other ud ncn, We call the natural sufifen : Ti TAX NiceeXa) fice, Bat Srck th i exponent faily# SAT Theorem, Loe density in an exponent " Th thods: (i) Newton-Raphson, and (ii) the EM alge u. Both ar amily for “mu di 9.48 Example. ( der the normal family with @ oe ‘ oy 2 { loa(2n0)) (8 This is exponential with ‘ @) a Jheve £ (84) is the vector of first derivatives and H is the matrix of second PM Avcomurii. The letters EM Hece, with wu saonples, (32, Xs.3E,-X7) in cent. ee ; fore we ean write an exponential family a But suppese we can tid another random variable Z seh 1 flycd) ~ Ffliiss8) a and sch that the Helio ase on fo, 2) 2m) = Me} exp {1% 2) — AlN} tev to maximize, nother words, the model of intrest is the marginal ot re Aly) = ee fh val with a stapler iklihovd. ln this eave, we call ¥ the observes data and ! Te cam be shown that he hide (oF Baent oF ata If we could just “ln” the missing 70%) = Ala) Ver) = A mould hase an ensy problem Conceptual the EM algoritinn work sve the tea ling the missing ata, mining the loglikelino, a iter ere ae sak and the evo 9.49 Example (Mixture of Normals). Souetinesit s reasonable co assuane thst . ilatrurtion of the data 4 mixture of two normals. Think of Ieights of i e being a mixture © fad women's heights. Let ly: denot ie q Marimum Likelihood Estimate fnsity with mean ye and stand deviation 6, The density of « ure of two Normal — | The ide is that an oiervation isda from the Bist norm with probabil so" (on LZ 8 ) pan the ecad with probity Ip. However. we don't ko whieh N a ws de om. The parnnetens ae = ( The ii HO) + KF co fa Feces © Ao by th 4 lla Le pce, RUF) 20 Hee, 0%") > (0) ms lane ee ee \ i 9.50 Example (Continuation of Example 9.49). Consider again the mistune of joa for slit sna that p= 1/2 1 The dens eee from, Ths ta ne of the fr Yad tere Z; = rep U2; = 1 represent the seca Looemetyat tata OZ) ¥24) eS ee , et ir et in vel data YiesosYou if vad Z; = 1AFY) 8 fom n= N= F V2 1) = otacanel) ad fe: = 1 thse ne EM algorth $0 ftp) = Shon floes) where we have depp the pata edit tai notational over. We ca wie The EM Agri ’ hn ene, the compete Hl 0 Totes pol “ 5 7 ata ¥ sfx PD, ” dvd We now show that the EN algorithns always increases the Helos, : zilv".0 y 6 s eo Note that x 0" 200 ai Zilv'.0 Ly) a ro theorem, 100 (00 ee | Y 2. = 1 yP(Z = 1 ge ani J Tah = PIT, Tigh\Z, = O87 PZ, = 0) Fg (es piznivnany [Y= 9") vu Exercises Pin the nl the Leh 1 teat Fuad an expr the ue — thea equal Lex 0,0), Show that the MLE i consistent. Hin Let ¥ Far any 6 P(Y Xi ab a). Equivalent en 1013 Detvton. The Wald Tet a 0 Wold test ise ent : hom 104 Theorem. Asymptotically the vs 4 + NWAH) Heuew, she probably of (@ ) (w (& 05 Remark, A ler the Wal ot a ay in he staal oon cuit a ‘ 10.1. The Wald Test the power ofthe Wal est wh he al pot o 8 ae be 52 be 0 10.6 Theorem. Supp ue seall that fi tonda to © 10 i " 5) aud reject H Ww This is called @ paired Here 6 = pn ~ pay The MLE i fs with estimated ard error hwo “ hem |W ( 10.9 Example (Comparing Two Medians). Consider th on ain k " P There is a relationship between the Wald test and the | — ie a : , ‘ ‘ ‘en 10.10, Any confidence interval that exelides @y corresponds to 1 Hoy if an only iE TUX ts JX". r TX") > 1 wi 10.12 as fl | The pvalue is the proba et statistic the some as or 10.13 Theorem. Let w — (0 —0)/aa a Wald statistic W.. The petue i given b pale W 12 10, Z~ NUON mh 10.14 Theorem. 10.3. The y Distribution In othr words, if fy 6 tr, the pv fs Hk ra fro Let Zieseese UUnifU, 1) distribution, FH) i tne, the distribution ofthe preie will Befire proce meme to dees rt fomenaniate casero stitution ath J dgpews of fresony, written V~ yf. The probabil 10.15 Example. lesterol data fom Exanuple 7.15, To test th ty of c VV _262—1 a hewn that ad OV Weed 5 1a) where Fis the cor. T x 10.16 Defniti 4 M 10.17 Theorem, Hy: Px mous Hy Fe AF TUX peso Xme Vie ¥a) = Rw =P Let N | consider forming all V! perunutations ofthe data X XoaeVie You For ech pertation te the test statistic T, D hse values by Ty... Under the ull ypottess, each of these valu anally likely, # The diseetution pts mss 1/4 on eae T, is ell ot statistic. Assuming we reject ahem T is large, the povale pralue . 4 mo 10.19 Example, Here is» toy example to make the dew clear. Suppose th ata ave: (X;,N2.¥i) = (Mh). Let 7 ¥) =1¥-¥l=2 permntation vale of 7 oa} 2 The pevulue is P(T > 2) = 4/6, 0 [Usually its not practinl ¢o evaluate all N! permutations, We ean appro ‘garth for Permutation Test 1 the observed yale of the ten statist T1X yore XY adomily peruaite the data. we statistic aga sing the rte a Repeat the previos stop 1 tines and let)...» demote the w resent asure the expr sta are the levels of mesenger RNA 10.20 & vpte, DNA mien NA) of eacl ilar veults toa tot 10-23 Example (Mendel's Peas Revisited). Conner Hats based 10.6 The Likelihood Ratio Test Tae Wald test is sei fr tt testis ore general a lf testing etal ps 2(a15\0e (#2) + 0168 (8) 1021 Dotiton Cina 5 roster (12) + 3210 (BE)) Hy:0€04 vermin sO 6 ou. The lkethood ratio stats Uhr Hy thee are four parameters However, the parameters mst sm » the doo ofthe parameter spaces thro Under Ho ne at 21 pace I) = 2k parameter ion of the jeted parameter spa oT . fference of co dimensions is three. Therefore, the limiting distribution j der Hy x and the pve sh 18) = 2. ‘You might have expected t the maxirmun of ikelihood € ew mene T mnelusion is the same as with testo nner if the ts static deh elo ati tet md the x2 es are bot applicable, a in keto sata ts mast sei when Oy consists of ll para they aly ead to sila ests ong th ape 10.22 There. Sipe that O~ nen Py Pye Bo = (0 a, 0} 10.7 Multiple Testing \ i be conctng 2.68 spa : ba0504f) and we want ots the ml ky pot ng sition may end sa iin 2 = 0 tho the tvitng sstibution has 5 ib = 2 ge There ae a . AUT Phooey dente values for these test, " u hs a alts i \ 108) . here Cy fs defined 10 be 1 iF the p-valies are independent and jected. Let i the event that the will hype r nas nye (Qi) < Sonny = 52 ; Hous Theor Tw 10.26 Theorem (Benjamini and Hochbers) " ' " this ex tou ms. The BE thresh correspon naps FDP si 10 independ hy t ie Appendix The Bayesian Philosophy yesian Inference B: 11.1 Example, stitution x 1.2 Example. ( Teor G { Terr tha method. L r Lit Simulation 11.6 Flat Priors, Improper Priors, and 1 t Noninformative” Priors 116 Example. Conse the Bernal (p) mde Ral that aes bt snpratiea in compllentd problen . forever, injecting sujet opinion in 10)~ to the gol of making selenite iference seston the pin rie” An obvious candate fr a noninformative pir tat I sy) x VT =» aoa samples tine fp) led ta pl" ex a Beta (1/2.1/2) density. Ths we to uniform deity. © 1) as we sw cai, which seemed very reawnable But unter to a mtiparaneter problem, the Jefeys' por defn to be (0) 2 fat priors ras sone questi ere Al denotes the determinant of matic A aud (0) bs the TuPeopen Praons. Let X~ N(0,02) with o ka ion mate. Chis ot probably demity inthe usual sense. We ell such a pir inmproper prioe. No anil formal arr ont Ba Multiparameter Problems nul eanput the pasteror deity hy kiln the price and he Hed 0 x (0). This given 1X" ~ N(o/n) and he res at 0 (Brosh The posterior i given by tn general, improper prior rent problem a longa a resating pot i uN " Pat Phions ake Nor Iavantawr. Let X ~ Bernoulli) and : rior deny forthe perametr of er formation about p before the experiment. Now let = lg(p/ «transformation of ad we can emu the rtltng dation fx Oeil ® neh 4 do this atgral. Simulation ca hich ie wot at. But i we are ignorant about » Pont ont «x0 we should use at poe for Thi 1 superscripts index the dren draws Bach Ob a weer 0! the notin ofa Hat pir not wel defied beans Now collet together the fist compeneat ofeach dr Ierrnevs" Pon. del UD wth rue for erating prion bie are a some mie di cu 117 Example (Comparing Two Binomial), Suppose we ave 1 cm nd enter pottots eau Chat Xi ive while ter 10) the Fisher information fnetion. This rule tas out Stment patients survive. We wo to etn mam 2 funtion invariant, There are varios reson for think is r ht be aise prior but we wl not go ino deta her Weakness soe tnetion I nf} Ww « wainy={t Loo ut euuidence interval But iis aot a probability staten 11.10 Example 10. Bibliographic Remark LAL Appendix Qs 1 ence 9, ~ 5) 0 ia Exercises: 1 Verify (0 i Van id the posterior density Plo th desi La he paseo. Pat iso Find cut posterior int ” ' tistical Decision Theory 124 Definition. Tir risk wf oth etn f o 7 : bias y bah 1. iC we a sf r +(e } 2.2. Comparing Risk Function: VooF. (hv Esanple 12.12 wo will explain this ehoiee) 7 12.2 Example. Lot 2) aul : are lon rs see Figure BL. fete, 123 Example, Lot x Beswsn ler snared vr i rastaoel 2.4 Definition, Tir maximum visk the Bayes risk we =¥ Land ae peti constants. T peste 1iy= fa a sing N 126 Definition. wo h = : Irwee to be the vale of x) the wo will i Dor. Suppose that OY is not annus. Then there : Non fied an exp forty Bayes etinator for some sper ‘ 12.8 Theorem. If L A) thon th sets 28) Jon @hen ~ HiOLx ° 21 Theorem. Suppose that @ inthe Bu met to se Proor. The Bayes rik rf.) = f RG Lene 2 Ha) miiioos (Or) = rivative of 2.12 Example, Cos ero sa se I i eespect ts 2) a ett ssquation 2/ phe 1238 we showed thatthe estar 12.9 Example, Lot Xoo. Na NU re 0? is known : Md K+ . 12.13 Example. Consider ag the Bers ft with ls fein 12.4 Minimax Rules ! : Finatng euplented and wo cannot at , 12.10 Theorem. £ ! f ip ; i x Bayes esti hth ‘ nt and f least favorable prior ‘ 12.5 Maximum Likelihood, Minimax, and Bayes 1 imately tininnax. Cons sor i the Fisher information. He 12.14 Theorem x ¥.7™ iz it hat foe lange. 12.15 Example. Sup x ih nr to ie i arametic modes, with large samples, th 7° he i fe Figure 123. Hene 12.10 impli " Assume ty oco= te, yecel In practice, the difference betwoen the risks can be substantial ‘This sh 12.6 Admissibility they have sal risk, It also wseul to characterize bal exthmaton 12.17 Definition, 3 is inadmissible ¢f th A Re. u E adinissible 12.18 Example, Let X ~ N(0,1) sul consider ’ sa Then ther a diferent rule @ with smaller risk. In part % f 0, Hence, 0 = 23,0") = f " dr. ‘Ths, Or So there Hat beats 0. ih 6 admissible It bs ee 12.19 Theorem (Bayes Rules Ace Admissible). Suppose that © Phoor. Supp nadie, Thet the a bet q R (0.8 for all 0 aud B(Qy.3) < 280.0) far sone Oy. L 2 so q i This imple tat dst sine rf, 8) whl cts he fit that Fi a . 12.20 Theorem. fet Xjoooo.Xn = Mj). Unde squared err ons X flows: The pont t tly positive p Tow are mininasty and iis nk? general re maybe 12.21 Theorem. Suppose tat he ‘ n re enn prone a restricted yerson of Theorem 12.1 f a 2.22 Theorem, Let Xiy..0X yt a 12,23 Theorem, 1 2.8 Bibliographic Remarks Part TI Statistical Models and Methods } 7 } 13 Linear and Logistic Regression dual sums of squar i 13.3 Definition. ‘The least squares estimates « {Ba Theorem, 7 TK XY : wy Pex t \ ye 183) ne ion model We wil ha Jer the sar dt ro Example 13.2. The st ' a Wep ates an 8 and 0.14ifi, The fitted! tine 18 40.166 " F he TB. Definition. The Sipe near Regresion Mode 13.6 Example (The 2001 President Elton). Fis 13. ‘ ¥ x ites (omitting Palm Beach Coun ud x I om tl Sc cl Herne ae j ty ) : bd . ) : shake. 1 ; 3 ly 13.40) | 0 vera: 8) £0 t 13.3 Properties of the Least Squares Estimators ; V " ction dat aoe bv egreion piens we nstlly foes on the 1.10 Gxample, For 1318 Theorem. 1 note te ea — ; ue = ( Prediction enat ¥ 0 at v ( hae estimated egesion we 4 We aierve th a EL Nal hj and we want o prec owt The etinat standard sly are obtain by tk \ sos i this equation. Th 13.9 Theorem. Cn TBI Theorem (Pedicton neva U ‘ XX | 1) vst Sa \ 12.12 Example (Eketion Data Rested) On the og seal, on i . el In Palu Beach, ha v4 4 Bucluanan baad 3, wes C \ x Xue} 5 ' \ : [ { \ (6-200.6.57) whi len 151 Inder, 8.151 ‘ \ \e) Multiple Regressior Uh for ofthe kt 8 min , ee th The dota a 1313 Theorem, : x “ r i mn "i 2 (a) 15.15 Theorem 1 he th fats) =¥ The prediction A relsted method for estinuting rik is AIC (Akaike Information ( x Bayeian interpretation. Let = (8% eno : Town that the pasterior probability fora mae! i approxima 1 Igliklihood of the model evaluabed atthe SLE. #7 cna iitng Mall's Cys F ing the aoe with highest BIC is He noel with ng AIC inequivalent to a falow's Cy: soe Exercise i Feeterorpraability, Te BIC sore so Is formation theoet ‘ fe poblen of nual sear, If there are & conan ee Ri the preition for ¥, ob del with ¥ rere ne 2 po Is. We nel to search thong alt \ Revi) = 9 (Ta 1 asi : net ofl the mode Thus, oe need . sod se except that we he ' ; enna phason nov with the best score. Another popu method i toe mols th 1 model to predic the models, However th ma sted fr eah of the k groups and th im : 13.16 Example. Wi pple horas tp nso to thin AIC. The following was obtained fom the p This prorat liner falows Cand : ned wen I diferent a ir dint th mms we wil ists See es emible AIC. This is the sane is uniniuiing Mallows ts ful sch all avaiates) has AIC= 310.7. In sen \ nnethod is BIC (Ba aration criterion). Hi il ne (whe in 0 0 ser, the Al for deleting ane variable ae a f 14 {ultivariate Models MAL Random Vectors , (-) i . idol, | . : 2 Estim: the Correlation He | } y x 1 py). To compute Cov(.X.X, pr i N " \ i o ve hand, \ } , Cou Xi X 14.6 Theorem. As x 14.5 Theorem. 1 i 5 Bibliographic Remarks (: | [ | rn) Ti ml tin cone ee 6 Appendix of Then 138 he i rank von 1 hen we smi we have to eo we nat efice 1 : Soret ns " Im yx x Ain) =O (Sp -1). os z a ‘ Yue — ots te , x 1 we st Sumy xt) + 0 the Fie aor thee ‘ the : » About Indeper TICE hie ¥ Z uy Pa Se Xyk x,=EXy. X= EN x= la T r This sn cone tne thronghont th 7 Ipok. Ds uh nee popal inleperntene is Penrson ute z 15.4 Theorem, 1 T veyy Wek 5:1 Definition. Tis orks ratio HU 1 15.2 Theorem. 1 7 ‘ 15.5 Example. r : hy V Mh: ¥ 8 Z ¥ o L ' mane Two Discrete Variable 15.6 Theorem. Y 1) and 2 € oh 15.7 Remark " ' 5.9 Theorem. 7 c 1 SOY Ka tox (22% 1 5.8 Example. : : 5 exe v=} 1 Oo 1 by their respon wr Partial Respon 26a 6u i tat ent nd histo inferences abot th not conclude that Variable ar One 15.11 Theorem. Wie ¥ 12 Theorem. 1 5 Appendi yu 4 : z 7 \ety/ i tat emnestn/DASL/Datafiles/USTemperatun me Tint whether oop in Moo 15.13 Theorem. DIEVEU (Use the date 5.6 Exercises Tore u vn 1 sty in Mayan: Death $ Seutenc ( 6 usal Inference ly speaking, the statement “X cam : tion, The fist unterfactual random variab h directed acyclic graphs ssl (X =0) an Cy isthe outcome if the su treated (X oval relative risk 16.1) i called the consistency relationship. 7 aot ty make the iden ee : 6.1 Theorem N sion r ‘yp ‘causal effect or average treatment effet t ih measuring the 4, For exanpe, if a fine the causal ject doe y 1.0), Healthy people tend : 1 don't Te 8 this tween (CoC sux x nt creat ation between X aud ¥. 1 had eat aud ¥ leone that X and ¥ ed. : Th Xi and 1x espe th snd conchude that vitamin C prevent il # enconrage everyone to ta in C. Mf rot people comply with Proor. Since X is randomly assigned, X 8 indepenven Hen 1 00 Oo YIN =1)-BIYIX <0) wince ¥ = 1 1 ™ , . net th server wh , uple, @= 0 and a= 1. tis not hard to create exampl ple aver 16.3 Theorem K=0)>0 1)>0. . 3.2 Beyond Binary Treatments | 3 Observational Studies and Confounding, if 1 independent NH : pt : ee = ; have if 1 Tw ob a p 16.4 Theorem. / | IF \ 16.5 Example q r m Ranken e * 17 Directed Graphs and Conditional Independence 17.2. Condition: x 17d Definition. Int X. Men. X and ¥ conditionally indepentlent given Z, XUy |Z, 17.2 Theorem, 73 DAG X aul Y are adjacent. cut of child X}. A directed path :— ¥.2) in Figure 1 i treet pat collider slider ” — an undirected path, es pining into th are nt adjacent tunshielded, directd pth that starts a alles eycle. A directed graph is aeyelie if ha tht directed acyclic graph o Probability and DAGs 17.3 Definition. Markov ta 17.4 Example. § 17.5 Example. DAG. Fru represents Condition 17.6 Theorem. 7.7 Example. th Figure 17 More Independenc Rel fe Marko Markos c\ | SNS | /\ \/ Indepenen 17.5 More eerste Relat ye x i x aio atch onside the DAG in Figures 17.6 am 1 17.11 Example, ‘The fet that editing om a collider ex lide, X and Z are decomneetod, bt they are this idea v1 appears to be late fora museting 2. IX and Z collide at Y, then X'ane Z are desoparated, but they we J deconnected give ¥ Independent. This seen Jnce before we know anvthing about . be descendant of «collider as te same effet nt friend Dein a mpet these vats tbe independent ditioning on the colder. ‘Tux in Figure 17.7, X and Z ld abo expect = yesiLate = yes) > P(Aliens = yes) d-sepmrated tut they are d-conneeted given W ving that your fiend is Ite certainly increas the probability that si deted. But when we lear that you forgot to set your watch propedy id lower the chance that your frend was abducted. Hence, P(Alens Here is mor Finition of dexeparation. Lot X and ¥ he dist siLate = yes) # F(Aliens = yeslLate = yes, Watch = no). Thus, Aliens and tices and lt W be a set of vertices not containing X or Y. Then X ¥ and ¥ such that (i) every collider on Chas a descendant in Wl ‘ : ' . to ather vertex om U ain Wf A, und Wave detines to of vertion 172 Exam. Case he DAG in Fe 172 In hs til SRS ea eee ear ight an smoking re ary len but they are depwdent given “eu aaa os that look eliferent may’ actually imply the same independence the DAG in Figue 178. From: the dorparation ms IG isa DAG, wo let TG) a independence statements ial by 6. Twos DAGs Gy aud Ge for the same variables V are Marko Sees tee equivalent if (0 Gen DAG 6, bt n(n (5.5 ircetedl graph obtained hy replacing the arrows with mnairected edges X and ¥ ate desparate given (S$. 17.13 Theorem. Tivo DAGS Gy and Ge ere Mast dent if and on keton( Gy) ~ soleton(G) snd (i) G, and Ge Rowe the same wasicld 17.10 Theorem. ® Let A. Hun be disjoint Then AUB us a 17.14 Example, ‘The fst three DAGs iu Figure 17.6 are Markov equivalent in na Pe eh has nota indepen bac lower right of the Figure is noe Mathov equivalent to th mation for DAG: = y t how often was Z = 2? To auswer, note that ited raph a "hej probability tb he answer to or question is given b jal distribution 1 ee, \ r= ot Le " o ne that the second graph the cor al 5 W = Lom the second graph. There are We shall denote this as P(Z = 2I¥ jiely a ul for ervention ¥ = ») conditioning by observation or passive conditioning, W to W tha nod to Broke the arent he sas inal grnph Ts heh would imply that hangin Z nditioning by intervention or active condition Both 1 probability graphs but only th rect cansaly fen that Joe smokes, what isthe probability he wil get lng cancer’ ave core A litionn Hl i the eorect ears graph by using background knowlege If Joe quits smoking, what isthe probability he wil get lung cancer” Consider a pair (0,P) where ¢ is a DAG and i bi a dletriba 17.16 Remark. \\ learn the or spl fru data but a the DAG, Let p denote the probability funtion for langerons fu fet i is impossible with two variables. With mote than Her ng and fixing le X be ena a. We represen i methods that cat fl the causal graph under ee a two thin tions but ey ate Ia vethods and, fartherwary, there bs uo The new pair (G*.*) reprevents the intervention “set X soonest : 17.15 Example, You iy have relation et 2 end i auteonn, a confonnding variable Z isa variable with arrows int that iy the variable “Rai lepanent of the varia NX anud Vs se Figute 17.11, 1s easy to check, using the formalin 0 Lave, Consider the falling two Di Ta rationnized study the jween Z ood X is broken, this case Z unobserved (represented by enclosing Z in viele, th Nein Wer tag Wee Lown between X an timable becmse it ean The frst DAG iamples that ile the socond in vx IX hot involve the tinobservel Z. In tury, with all confounders aber vx e ut the jant distribution crt Xan = 2) in formmla (16.7). AZ is mobserved then 1 graphs are correct, Be LW are noe independent. y r * le the second i wrong, s became x x yt ny of saying ‘uk a he mt. wo can make w pects co tween DAG aul conte m ing the rule of interven reuk the arrows into ais as follows, Suppene that X and ¥ ee binary: Define the eonfou 18 Undirected Graphs Undirected Graph 18.7 Example. ‘The snail the graph in Figure 18.9 an TT . - UN |X 8 raphs to Data ae ne arena te dere ces one we to Ba ah dat oe on 1 | Nae amd NOUN Aik 18.5 Bibliographic Rema 19910}. Write a tiinimnal conditio ue fi Po 19 Log-Linear Models tudy log-linear models which ate nseful for mosel LX = (Xie Ny) be a iserete rand yeetor with probability finetion 218, LayeLinear Mod 19.1 Theorem. ts J(2) of a single random F 1 fanetion off “ : " WEE A and.2,=0,¢ = {4 The form calle the logetin Given (p a Bach «4(¢) may depend on some unknown parameters fy. L . A.C 5) be the st of all these ps We will write fla) = fled ve want dep oa he unkown parameters 19.3 Example, Let X= (x he li : P ProeeesPX) } This i an 7 space. fn the loti hey yoo data veto a o-{ n.pe Ph here (pis the set of hp. The set © is 0 N " sin uf ck su forth bet 19.2 Example. Let X~ Bernonlli(p) where 0 < p< 1. We ity mass funetion for X a OA, where » » Hence (7) \ J re X; € {0,1} amd X 5) Th rite as a 2-by- table as fll satisfied, The si parameters of tet a tog (tt) y= tox (24) ( eg (BAI) y= ty (Bs) The next theorem gives an ary way to heck for comitionalipen a logtinewe mon 19.48 Theorem. Jot (No Ni ‘i x x x \ 1 ada term to the model and the graph does not change Tom theorem, we wil we the following lemma whose proof fl me conditional ndepwnen en the model isnot graphical ler the geoph in Figue 19.1 Xo Xe XU XIX and ont Togeinear mae that correspon co tas graph i Proor. (Theorem 1.4.) Suppose tl 1 scene va low in band ¢ Hence, vy nO 2 abort 2 aU r= ; : anodes graphical. The elge m he rap at at ag tat pair oF tel frown the mode, Fo Exponentiating, we se that the joint density is of the fx ‘ z . It omitted Is hee (24) asin an 19.2. Graphical Log-Linear Models emitter, Thee nee other ge aswell You can eck that te A loglinen model is grape! ifm espn co eat Seen ema 19.6 Definition Ea 7 «graphical ; vay J (iJ) CA and E Tuaw a graph for this wel, we wil get the su For exanp ris eon (1.5) s0 we omit the eke eto Xy aed No, Bt this x x x Jes conditional invdepenence const ean al thing. I M are only concerned sbont presence or absence of candi mya th ee eae depeinlenes, then we need tat conskler atch w iui, The prexeue of 5 Vere —— . 19.3. Hierarchical Log-Linear Models : ; : ; A 1 ste a Bit This atv the hierarchic! log li FIGURE 194 The motel for this sh vehi . ishicrarecs fh ot graphienl. The graph coreesponaling 1 th ir rete; see Figure 19-8 Tt phi =o ‘ Nernrchieal fog Vane ot carson co any pairwise conditional independence. a He. Let 19.9 Lemma, " OTe aphcorrespencing isin Figure 1.4. This model fy not hierarchical ot rat, Shee itis wot hierareiea, Ht not graphical ether, fe mode al te graph is ova in Figure 19.2, ‘The ape! hea i ed ao 19.4 Model Generators . HHcearchicnl moxtels can he written suevinety nsng generators, This is most ly explained by example, Suppone that X = (X,,Xa, Xa). Then, A 19.11 Example, Lot ; Theforumla A = 1.241 hae ey and ¢:° Wehnve tabi sim that we shou oly earch ous a itachi a the ler oder terns oi won't be ieateica The nerator AF nels ae les integpetabh {he saturated swe ferent apron fb hypothe testing The nel hat ine The sata Consider A= 142 The Ht ratio st fo this hype the deviance 19.13 Definition. waned M, defi the deviance dev(M) bi the natal independence model, Final wei 9.14 Theorem. ‘he he tio for k near Models to Data Hy the ost Ive mod is M . wor wl A. The gli x fntion (19.1). The MLE 3 generally Tas ‘ i i te whiten that there sample opportunity oe makin dia rh font " : heh tl we inode the mode fer fn ea the cotesponing raph ssl the sae a the me problem i ee regent f 19.15 Example. The flown na ave fons Morro ota on wo we AIC. Let A denote ene lg na, Dil Toa). Ted areon sage lear ye (X,) aK weir wo Xml 7 baat x ied _survivad died survived The saturated log tear model 19.7 Exerc rad vival 2. Prove Lew Prove L ve ous 02 “a7 ae ee an ona fe) Us this The best sub-model, selected using AIC andl back i he hu a i b) lox 6) los f 9.6 Bibliographic Remarks For this chapter, I drew heavily on Whittaker (1990) which ll text on log linear models and graph ferns ofthe sin Example 19 sudom variables (Xy. Na, Xa, Ny) Stppowe th ity raph G for these variabh low all independence atl cnaitional independence relation he graph nude graphical? Is it hieratehieal at ce proportional to the following v1 be bin the independence graphs correspon Mowing log-linear iovlels, Also, identify whether each 20 Nonparametric Curve Estimation er we dinenss nouparanett of probabil I Chap nw that its pons to consistently stint a ena wed to perform f smoothing operation on the dat An exannple of nator iy Mistograms, whieh we discuss 1 iin Section 20.2. To fora a histota estimator of a livia sine to sjoint sets called bins, The histogram estimator bs piecewise stant Fa eve the height of the Function is propestional te ma el 20.1 ‘The Bins-Variance Tradeoft fo a ma na.) = | fuse a SK = BIAS! + VARUANCE on 20.2. Histograms 1 Xena be nD ea rm revsthon tt ot " mtn “ we ‘i un always ue th m this interval. b 20.4 Theorem. ng, Wee inal chine i Th ster te of comers he i [ Peve sl rer to Bd rs, althongh i lifes fom the ra ik 20.9 Definition. i confide 205 Dafiton, 7 7 estimator of Fisk (te) <3 een 20. iu Shox co , | 20.40 : 20.6 Theowem, 7 : : uy 20.7 Theorem, 1 , 008. Herein outline of Fro the central in th 1 By the ji w(t) on san tt ie appa indepen. There 1h ih a ava( vis - va) =z 20.19 20.8 Example, Wo nel erosatiaton inthe astconamy ex 110 isan approximate mininizer tt the resin histyea dh / i ; this nee, The histogram inthe top rin in Fi \ 7 wge| VF ~ v } it % ' (inne Voce - y ) =2 (ms |B y ul init = 1 sisticaly make cofilence statement V ten ity J. Isten, we sal make ede far resolu the histogram. ‘To this end, ek \ ~ ’ ) ] jute =" to 20. ) 0 cruel are the Epanechnikov kernel BOA Definition. Given a Fernel K and « pos Teale he vandwidth, the kerwel density estimator tb 20.13 Example. Figute 20,14 Theorem. J SS a we 0! om o2 0.602 0.604 0.606 By a similar ealenation, pK The result follows fom integrating the sued bins plus the ‘We see that kernel estimators converge at rate! whi xerge atthe slower rt ican be shown that, under weak The expression for h* depends on the unknown density f jim = [7 here Fb the kernel density estimator after omitting the i 20.15 Theorem, Foe any > q ie ee (2%) 4 2 Ko k K 2K(2) and [(: K isa N(0.1) Gaussian kernel then K he N We then case the bandwidth fy that minis F0).* A f is given by the following remarkable theaters duet 20,16 Theorem (Stone's Theorem). Sippose th 1 20.17 Example, The top sight panel of Figure 20.6% based on ers-vliaton, These dat ave rounded which causes problems for cross-validation. Speci ally t causes the minimizer to be A = 0, To overcome this problem, we ed a small ant of random Norunal noise tothe dat. The rst th Tih) is very snwoth with » well defined minitaun, « 20.18 Remark, Do wt assume that, if the estimator Fis wig then eros | 1 yu Fale y ik (5%) an(te teal 19 Example jee K and the wei 6 se power spectrin of the temper uations, What yo av y= — " Teor from the big bang, I r(x) denotes the tre power srt Dak (G2) 1 a raul error with mean 0. The beatin sl sine of peaks using here dnsity eotimation an thes nsertng the eta 08 shows theft base on erneaiaton as well at wdersnaotel nd . fit, The crosevalation ft presence of thre wel viva ~ uptatenty ted peaks, a polite by the phys of the i ng. a 20,21 Theorem. Sapp 08. The vs ofthe Nadaraya-W timation, However, we ist mad to stint 0, Suppose that a an ys Mfatweceae)' f (ore +2 Sly a Y= joie 2 1 f ca thus us the average of the = 1 lifes Ys ~¥; to etna Tn pectic 0 Wh A we mini tat a : ay ee 5: Sw 20s 20.22 Theorem. 7 in) = S20 a data fom BOOMERANG (Nettertel ot al (2002). Maximo: (La ot m1 DAST (Halverson ot a (2002). T ‘ ' Confidence Bands for Kernel Regression A opprxinuate 1 ~ er confidence and ar F(t) is & ass \a a = or (Lt @ is dein (20.80) an wv the width ofthe kerk. fn cose the kernel vex not have tite wide thet we take w to be the effective width, that re range over whic the kore snes, Hu pare 1s 4 95 percent cunfidence envelope for the ME data, West tat we highly onfdent of the existence nnd position the fist peak. We are more weet bout the second nad third peak sion to msatiple regressions N= ( with Ker desity estimation we just rplace the kernel with a wltvar X_) i straight formant : ive regress it hn : : radar nas ae eee ee : only no fp enna tons, The wean Xy) + Doral XNa) + us itive madly are sally fit by au algorithn called baekftting. snl deviation of Fu) 1 the bins nil the standard deviation, ‘This the sec tem do 1 lage sample sizes. This mans that the cnfedence interval Backfieg, 0.6 Bibliographic Remarks wnen-F t,t be the tion ction cine ey grein he o_o 2.7 Rxercise 2. I canverget STOP. Elbe, go bck Ny ~ fal a fb ta ere tor wg kitve modes have the adage that they avoid the curs of dine sea em (0 ty ad they cnt be Bt icky, at they awe one dinatage: the { [ 20.5 Appendix [ f Coneioxnct S08 AND BIAS. The confidense bands we compte a ly { 0) Show a > and he a8 + then Fa) 21 pk web atte the det fe me a Smoothing Using Orthogonal Functions wits and bali. Cannent on the slate on 1 Prove Lema 20. Prove Tore 203 6. Pe hte Vis the mean ofall the ¥ rng to the LL Orthogonal Functions and Ly Spaces Fin the approxitnate risk of this « From this expression risk ind the cptinal badwidth, At what tate ds the vik » Jenote a three 5 : al numbers. Let dene 5 Wa Wa senlar (4 x) a or, we define The sn of veto 0. Show that with stables eesumptions on r(x), 3 ta equ duct 3 islet The inner prod 10. Prove Thee {a vector « is dfinea sectors ate orthogonal (or perpendicular) if Let dy = (1.0.0, #4 = 0.1.0), 64 = (0.1). Tse veto aes this es, the eto 6 83- fom n ismewinthai € smal Hs fir V se they ave he allowing proper en J can be written as! rtogoual i) hey fr fo, which ms Ua any Yea be writen Ba combination of A infil rosk Parseval’s relation whi says that Ysie, when 21 si'= [ Pnde=S 4 a Not oy, Thre a tse 3 = (8 21.1 Example. An example ofan orthonorial bis for £2(0-1) i the cosine (444) s). eee (yet -2) aed a fli red Sosoy we 21.2 Example. Lt Now we make the leap frome vectors: metions, Basically, we just reph epee Lalasb ; fore 1 Ae J Woven oe that fa) ets cle t Fla). The coi { L j Fi jeosteyr were computed munca & Wet write tof a. Te er pret betwee ee ae eee 21.3 Example, ‘The Legendre polyoma on [11] oe dla ] p = 1, 5=0,1,, 21.8) , It can be shown that these functions are complete and orthogonal and that j=! [ peewe 2 ag ai\/ 214 Theorem, The » te this eatimator, note thot 62 is au unbiased estimate of and 2 isan unbiased estimator of 32. We take the positive part of the latter (a)=% ¥(a) 7 mace nek tha canna be negative. We now cons 1< J < pt ninae ROP) Heres suey Prooe. The mean i 1. Let | neo 5-1 Soe (i) = 1B ex a Jar | ne off this etm problem, However, if we net ou ee * te af hes tevin es te the tue demity f(z) = So 20,2 Tm the confidence bash 215 Theorem, The rsh of 21.6 Theorem, An pposinate 1 emf and for oo 8. (z_% ni 1 Poor Hees an atin of he prot. Lt = 3} (8 9 By the 3, s N(Ajvo}/n). Hence, 3 ayej/ vi where Desi 1 Since J is an average, the ventral Iiait theorem tells ws q apnrosimately Normally distributed, 21.8 Theorem. Fea = Hyon diet be the risk of the estimator 21.9 Theorem, The risk IRI) of the esti (= Eh, Hoyt ny = 22 9 (us “J ga” # (21.26) here k= n/4. ‘To motivate this estimator, reall that if fis smooth, then or So, for j > fy J; = Na?) ane ts, 3 02%) fo fe wee Z, ~ N(0 1). Therefore oot (= i & QW #2) =o, Abo, Viyf) = 2k allen ot /K2\(2k 1. Thus we expect @ to be a consistent estimator of 02, There i thing special about the hoe =n. Any’ tnt increases with 78a syproprite rate wil stir We estimate the risk 24 (#-E) (on 21.10 Example, Figure 21.4 shows the doppler function n= 2.088 erentoms genrated from the model Y=rl = i)nces >» N(G,(.1)2). The fig shows the dats ad the estimated ~ ‘Orthogonal Series Regression Estimator ml = 2S veiled. Fe tem aat ys 25 Fo (21.28) | 1 Ta (4> Sao) =P (Fx Se oa 21.12 Example. Fi sal i HAV lan. Wi NW law VV 214 Wavelets father wavelet or Haar sealing fane SSS a then BIZ ere ¢ = ¥/97e is w constant. This 15 Appendix a >, vie DWT Fon HAAR WAVELETS. Let y be the vector of Ys lengths) ane 1 = logan). Create lst D with clement > 1 1):0)f 1 21.6 Bibliographic Remarks 1B, 28, 25, AO A, 65, 76.78, 8, ven iy Og (1997). A more al 3 is Consider he glass frgment data from the b ste, Let ¥ be 1 al, (1998) ‘The theory of statistical estimation sing woweets has b refractive inex aud let Xe altuna content (the fourth waiable 1b mans anthors especially David Donoho ad Lan Jobst Duval aut Jolson (190), Dowoli an Johastone (1985). Don somparanietrie gression to tthe malel ¥ = fn) 4-« asin 2LT Exercises band (adeva)©= (Swe) = (Jere Ha) = alt= aha (=) D Parseval’s relation equation (21.0) “ ee “ ™ a} Fit the curve using the cosine basis method, Plot the fitnetion esti 10, a sity Estimation) Let Xi... Nn ~ f for some density J on ssification Fe) = 00) 4 Bavsaled In this question, we wil explore the mii squntion (21 XX ~ NUko!). Let ania (Xo eae 22.1 Introduction Shnntate = 100 observations fr a N(O,1) ctr the MSE eV iscalled elassification, supervised learning, discrimination Repeat (h) bit add sane antlers tthe data. To do his, sin pattern recogniti a ach observation fom w N(01) ith ity 5 a sane a ate Xne¥o) wore chservaton fron a N(0,10) with probabil eer Repeat question (us the Hast bas Jinwensional vtor al; takes wie a some tite st J. elas ation rule i finitio A: 2. When we obwerwe a new X we pret Ire AC 221 Example, Hew in example with fake data. Figun 2 pnts, ne X= (X1.%2) b sonal am Y= 40.1}. The ¥ cl atthe plo with ho ten ting Y= Cad the ¥ = 0) Abwoshown isa Bear eatin tle eporsented This fs rile of the for ny fast 0 (0 other thing abo the line schist 4 0 al everyting Melon the Fe ' 22.3 Definition, The true error ratelof a classifier bi Ln) = UEAX) 4 VY 22. nd the ompicical error rate or training error rate Eucn) = LY HX #¥ Ths tao props ave perelly separated by the Inet de First we comir the special cane where Y = (0,1). Let 22.2 Example. Recall the the Coronary Risk-Faetor Study (CORIS) 1 2) = EY y aan from Example 13.17. There are 462 males between the ag of 15 and 64 three rural areas in South Attica, The ateome ¥ Is the presence (= 1 Jeno Two that pence (Y= 0) of coronary wart disease and there are 9 covariates: sy ‘ood cumulative tobmec (kg), a (ow density lipoprotein ch : y aux terol fast (Ean history of heat disemse),typon (LypO-A ; Y= NFO =) bavi 3 (eurrent alk consumption), and age. 1 camp Tea Kel = OPW = Toundary wsing the LDA method based on to of tho aarti 3 saints, syteic load pressure a tabmceo consumption, The LDA meth ] i wil be explained shorty. fn this exanph ps ato hard to tall apar hen I fact, 1 of abject are misclisified sing this elssifcation ri . f fis¥ =0 fils) = fle\¥=1 nt, it is worth revisiting the Stalisticy/Data Mining deta yer Ea =Y rok 22.4 Definition. Tir Bayes classification rule h ‘lata tin ii) Ye) weyef 1 Mr? } 7 ‘luaiier hypothe inp hr Meas fetimation lear finding good classifier im ; eee decision boundary. Warning! The Bayes rule has nothing to do with Bayesinn inference. We Error Rates and the Bayes Classifier snld estimate the Bayes rile using ether frequents. or Bayesian method We Bayes rule ray be wet in sevceal equivalent frm: isto ind a classiiation rue h that makes accurate predictions, W art with the folowing definitions [One ca ether nos Fr sity ww . fi epee =x Y <0 10 otherwi cad ney ef fahGa) > =m hate 10 others 225 Theorem. 1 le i optimal, thal is fb The Bayes le depen on unknown quantities 0 to fl some approximation to te B At he there are three mai ap 1, Empirical Risk Minimization. Choose a set of cl Regression, Find nt estimate 7 of the cegzession function ius 1 Ae P= 10 other Density Estimation, Estimate fa from the X's fr which ¥ from the Xs for which ¥; ~ 1 nnd Ie yy ? y =aiN Ale " funy of) Re 10 other Jot general a skes on more than follow 22.6 Theorem. 5 yt Tie opt Vy = AX = 2) = file n= POY =r), Sole) = sal =7) and ans th 3 Gaussian and Linear Classi ee \ hte) ~ gaara mo {-} \ 22.7 Theorem. 1) XI =0~ Nig. %) ond XWP =~ Non {1 it eto 4210534) +e (88) ne ein Lo. others 2 WOM = py), 112 Oy eatin Mahatanobis distance. 0 gnralent wy of a Mog is S Kk A dent f 1 J quadratic analysis (QDA). I aa yaey ‘ Ly x x Xs —folX 5 ; ast Saye calla the discriminant function. The decision homdaey ( | is near 90 this method calls linear discrimination (Lupa), 22.8 Example. Let ns turn to the South Afkican heat dis sii as clasts 1 The observed mi 141/462 = 1. teting ll ehe tes rece th The sess fro alate cserin ited as 0 classi as 1 LDA 22.9 Theorem. Supp yaaty Leo lo ¢ We estimate dy() by by inserting estimates of jig, Ex and mp. There is another version of linear ditcitiinant nnalysis due to Fisher. ‘The Mea it the a ie, Algobraicalls, this means replacing the covariate X XiyovesX¢) with a linear combination U = aT X = SE, aX, The go to chun the vector w = ( tg) that “best separates the data.” Then perforin clasifieation with the oucdimensional cowrate Z instead of X We ses define what we meat hy separation of the groups. We wo Ik he two groups to have meas that are far apart roative to their spend. L lenote the aus of X far ¥; and let 3 be the variance matsis of X. Then vi TX = j) = lay and V(U) = wT Sa. # Deine th We estiminte Jas fallow Let my = S21; =J) be the mnber of ebser ations ingroup j, let X, be the sample mea weetor of the 2s for group sud let S, be the sample nuntrix in group j. Def va he resis of ; si Us wed) = 2(v1— -¥4 5) 22.10 Theorem. Th Let X dote the N (1) wate of the fan x pix x Lx x Uw x 2 Ky RIP IGEN Li Xo | Whee Y= sees Ma) Then ad the vel ean be writen a yx fo iret Lt irerx ete From Thorens 1.13, X'X) Ix! Y-x 1 Linear Regression and Logistic Regressi CO eee . ves ta lesson panel ter 12. The model bis section, s y= (0.0.1 2) =P =X c 2 an athe Mkt 7 is obtained anmerie 1.0. otherwise ~ 22.11 Example. Lot ws ween eth jens data The ML i ge The sles rel the tea ogre model Exaniple 12.17. The err rate sing this ode for clusion . ax (22 Wes cau get n better elasiter hy Biting «ier mode, or example, This ne - Fare ¥ . Iosit POY = 11N a y 22,12 Example. If ws mel the error rate 22.5 Relationship Between Logistie Regression and LDA LDA and logistie regression are wlnost the sae thing, Ife astm that yan (=) 1 oe (RUSUSES) = tag (22) — 20 4 007 B-"n — 00) y= ux : es a0" Sopra) > These ate the si TLies1 = [sein [] 400 a Is b In logistic regression we maximized the condition Hhelboo TL, f(y[20) bu Sine ation only requires knowing fy). we don’t really need to wonparamvetric thas LDA. "This it LDA, nv both lead 0 sti 22.6 Density Estimation and Naive Bayes TE fis). 0 that X “The Naive Bayes Clasifer The uaive Bayes chasifer is poplar when 6 highetimensioual sad ds rte, In that ens, pecially sitnpe 7 Trees Trees are cnssification methods that: pati sovatiate space X i sand then clasify the observations according 10 which pti alin, As the name implies, the clasifer can be represented Forillustration, suppose there are two covariates, X ud No = bl pressure. Figure 22.2 shows « elassiention tree sing these varables The treo is used in the following way. Ifa subject has Age > 50 then ify hin ay Y = 1, a subject has Age < 50 then we check his bl sare, If aystolie bla presnre & < 100 then we classify him as ¥ = 1 otherwise we classify hi as Y= 0. Figur tte Jassie 2 patton ofthe comarate spac Hee is how tree is constructed. First, suppose that y € 9? = {0.1} there is only w single eovnriate X. We choowe a split joi ttt divi real lie ato & \ Ay = (toh Let Bald) be ponton of observations in A EM = h.Me a. . j= Be for n= 1,2 0,1 The impurity of the split # is defined to b m=3 y 22.80) prtienlar measure of impurity fs known as the Gin index. Ifa partition lemeut A, contains all's or all Us, the 0. Otherwise, 72 > 0, W re the spit point ¢ to ninknize the inparty. (Other ilies of fp ites cam be used ess the Gin index When ther al oa oe whichew and sp ul to the lest impurity, This process is contin antl some stoppin cron is met. For example, we mht stop when every partition element les ofthe tree are calle the Heaves. Each lea asl 0 ether there are more data points with ¥ = Dor Y =1 in thar partition This procedure is easily geneealized to the ease where ¥ y only define the iupnriy b ey 2 re (3) the proportn of observations in the partition element for which sl ie shows the 10-fold eros 5 el 15 22.8 Assessing Error Rates and Choosing a Good FIGURE 225. Th rate ad aed fine isthe Classifier framiliation etal of tr How do we chose a sifer? We would hike to havea elasifer A with There are many was to estimate the err rate. We'll consider two: Crass low true error rate (4), Usunly, we ea use the training ere rate Zy(h validatio ul probability inequalities. Vauipari Cre The bs 22.14 Example. Consider the lear disease dat ata into two pees set Toul the validation 22.18 Example, Fit, supoate that = {n-v-fin} const of itely many cla 7 a nn pee ase aa Scone See ra As Aca} voi 22.16 Theorem (Unifarm Convergence). sume Hi finite and J the nuns of subsets of F “pike ant” by A Hone (2) dena ee nue Th umber of elements of » set B. The shatter coefficient is defined by Luni >e) wA.n) = me NAF (237 De, We wi + inequality nd we wil also tse the ere Fy consists fall ite set foie. Now let Xi. ie ‘ eto en PU A) < E No ia \ y=! SKK ( sax [Ba (tt) — E04 J Zath) — L0H $F (lett) —201>4) tern bods the distance between Pave i 22.18 Theorem (Vepnik and Chervnenks (1970). F a . { cup 1Pa) —P(A)] > €} < Bat 22.8) uy j 2217 Theorem, Let % The proof though very leg, Tong unl we one HF A sw eto yn) ificrs, define A to be the cl of the form fae: Gr) = 1. W Then En(R)¢ i «1c confidence idera Poo. This follows fom the fact that Fath) — Lay, > «} < 8 leat) — E1091 >0)

! Ae confidence tre doth s2/m)log (San When age the confidence intra for £(R) sage The more function See eee fr by having a larger confidence interval W220 Detintion, The V ens) densi ofa Fase we sal 24 that ae infinite, sch as " ‘ > fora % To extend our atalat to thar enscs we wat tobe able to define VCC) to eth for This, the ¥Celoesion is the ize of the largest ite set F that can be (sun iE4009 — 2081 >) < something not too big shattered by A meaning that A picks ot cach subset of F. PRL isa set of ‘ Jassifiers we define VC(H) = VC(A) where A is the of sets of the form (ne way to develop sucha gnerllation sy way of the Vapnile Chervonenk fe: fe) = i} ash yrs in 7. "The following sre ss tha if Aas or VC dimensic finite VC-dimension, then the shatter evefficients grow as a polynorial in x 22.21 Theorem. If A has sien A 1 22.22 Example. Let A= {(-% R). The A shatters every I 1 r} bit it shatters no set of the form (251). Three, VCLA) = 1 22.23 Example. Let 1 of close interns on the real Tne, T 4 $ hat it cannot shatter sets with pois. Cons : One cannot find an interval A sich th 4 te all linear hal: nthe plan in {aw all on a fie) cx be shattered, No 4 ean b Consider, for ¢ 1 points forming & diamond, Let 7 a itmast points. we yicked ot, Other configurations v-ansatterable, So VC In general, hallapaces in RE 0 Le 22.25 Example. Let A bw all rectangles ow the plane with sides paral ve punt that not lft, rightist, uppermun, oF oermst. Let Tb 22.26 Theorem, sion dan et H be th se of i oa (8 dy Support Vector Machine First spon tha Tinenrly separable, that i, there exis 22.27 Lemma. ‘The data un be sepavoted by some hyperplane if ane o Poor. Suppane the data cas be separates by a hyperplau y 1 follows that there exists soe constant esi that ¥; = 1 implies WAX) Se and Y= =1 implies WL Therefore, YAWN) > fo alli, Lot He) = a + re a) = bye. Then YoHLX,) > 1 The rover dinection ie senightForward a In the separable cave, there will be sting hyperplanes. Ho Jhonld we choose one? Intuitively i sans reasonable to choo the plane “furthest” from the data iv the sense that it separates the +18 and ul maximizes the distance to the lowest pot, ‘This hyperphane ts ealled th ‘maximum margin hyperplane. ‘The margin is the distance to from the lyperplane othe uennst point. Pots ot the boundary of the margin ule support vectors. Seo Figure 22.28 Theorem. pane Fits Th fas that separeten th Te tums oat that this problem can be recast problem, Let (Xj.Xq) = XFNe denote the in 22.29 Theorem. 0) hyper So AS evarneenins 0-Soad a (nc : Ba) =n Soar here are many sofware pcg hat wl lve ths problem quel The variables § re called slack variables. We nun maximize (22.40) subject Yara The constaut +i tuning parameter that controls the amount of overlap. 22.10 Kernelization There isa tie called kernelization for improving computationally sinple clasiier he, The idea is to up the cvatiate X’— which takes vanes in 2 This ean yield a more Hexible clasifter while retaining computational Tans, 6 maps &° = R? int Ia the higher-dimensional space Z, the Ys are separable by a livear decison bowndary. I other wor 5 linear classifier ina higher-dimensiona space corresponds to noe linear clsiier in the orginal space he ps that to set of classiiers we da not need to give up the rience of linear classifiers. We simply map the covariates to a high limensional space. This is akin to making linear regression more Bexible by ssn polynomial There is patent drawback. Ife significantly expand the dimenso of the problem, we might increase the computational burden, For exarapl x has dinasion d = 256 and we wanted to ase all fourth-order terms 11 2 = (2) has dimension 18181.376, We are spared this computa f that the maximizing vector w is a nen combination runs of the kernel, Formally, the solution the Z's. Hence we can writ . aight invertible. I this case one me constant b, Finally, the prokection onto das writen Also U = nT oa) = Saker) Z xr The support veetor machine can similarly be kerulied, We stmply replace Therefor X,.X,) with A(X,,N;), For example, insted of maxiniin 10 eZ, 1S o(xonly 2.13) ¥, = zP OX, The hyper : Y= Ho(NYTAN 22.11 Other Classifiers is wy eH = IKK There are may’ other classifiers and space prochules fll dissin of al of a The kenearest-neighbors classifier is very simple. Given a point find ere My is a vector whose #4 componcat is data pin tor, Clasify + using the wajority vote of thes seghbors. 7 1 ruudenly, The parameter K ean be chosen Xu i Xa XMvi=dl omevulidatin Bagging is. mith for reducing the variaility of «elastin, Wis mon Ie follow tha pf for highly 0 es ch a es We dw bets o’Ggw j and : Stop 1: Draw Xa ~ pa. Ths, PX , oa Step 2: Denote the outcome of step 1 by & Draw X, ~ P. In other wo 23.12 Theorem, The on ms relation satisfies the following prop Xy =X Step 3 Supe the outeome of stop 2s j. Draw Xe ~ P. kn other w X= HX » And 90 on. ote Ie might be dificult vo unders ig of jy. age simulating fond jo d 4 the chain many times. Collect all the outcomes at time n from all the The set of states can be written as « disjoint onion of elasson X This histogram would look approximately ike jy. A conseaence of theorent RUM U-= there tro states fad j ih oth ates eaunmuniente with evel other, then th chain is called ire in the ehain will return to state F again, By repenting thi uci osed if, once yu ent n samen, we conc ¥iX x. If is tration, then f ql n xing cles absorbing When the cain isin state i, there i probability 1m > O that it will never ene return to sate, This the prubability that th chain is i state 4 exact ines a? — a). Ths sa gonuetrie distribution wie has fnte mean, 23.13 Example. Let = [1 : i \ 23.16 Theorem. Futs about entre } : so The cl 2} 48} and (4). State 4 isan along sate. A finite Maskon chain tnast have af east one rvearrent sta " 23.17 Theorem (Decomposition Theorem). The state spwor 23.14 Definition ‘ccurrent vv persistent o vay Uae 23.18 Example (Random Walk). Let 101. 2.c.-5) amd a 23.15 Theorem. «1 woth pr piict <4 l= p. All states connate, hence either = all the states ate recurrent oral ae transient, To se spy a 4 at Xy =D. Note ° tail (atepa tothe left). We ean approximate this expremon using Stirling's Fornala which siys that x 5 ; Ln nl ~ nye ua wes that heh ee ee Inserting tis approximation into (2.11) shows ty vx SC EUIX Slew, = aX y ve sh Wis easy to check that 5, pvo(n) < ae if and only if 2, po2n) < x by 7 a a Moreover, 3, poo(2n) = 20 if and cay fp = = 1/2. By Thee (28.10) = the cain bs recurve if p= 1/2 otherwise it transient, CosvenGENce oF MARKOV Cuaiss, Te discus the convergence of eu ‘ve ne few anote definitions. Suppose that Xp = i, Define the recurrence Ty = min{n > 0: Xn= i} 2 assuming Xp ever returns to state f, otherwlse define Ty = 26. ‘The mean omy = BT) ~ Sofa oa. filo) =P #.% Xa FAX ) A vecurent state i mull xo othiewise tis elled nonnull oF posi 23.19 Lemma. If! 23.20 Lemma, Jn finite state Mar all recurent states are posit Consider three-state cin with transition matric Io Suppose we start the chain in state 1, Ther we wil be in state 3 a times 3.6 8, This isan example of peri cal, Formally, the period of stat isd if pa(n) = 0 whenever ns uot divisible hy dando the largest int with this property: Thus, d= gd( pun) > O} where ge meas “great common divisor.” State és periodie if (i) > 1 an aperiodie if d(j) = 1 A state with period 1 i called aperiodic 23.21 Lemma, state i has peri d "23.22 Definition. 4 Let = X) be a vector of non [23.28 Definition. Wr | eistribution if Here is the intuition, Draw Xp fru distribution 7 nnd suppose that = 84 stationary distribution, Now draw X; noconding to the transition probabil of the chin, The distebntion of Xy is then jy = oP = xP = x. The distsibution of Np is xP! = (xP)P = =P = x. Comtinying this way, we se tat the distribution of X is 7P" = x, In other words Wat anytime the chain has distribution x, then i will continue to 23.24 Definition, We soy thal a chain has Timniting distribution + if ergoic chain convergs to its stationary distribution, Also, sample average BADE Theorem. an reducible, cigodic Markow chatw haw a wndque jen 2791 0) = Dati) eau Finally, there is ther definition that wil be nsefal later, W satisfies detailed balance if 23.26 Theorem. If satisfies a nn isa st Th of detailed balance will become clear when we di Markow chains Moute Carlo methods in Chapter 24 Warning! Jnst boca lai has a stationary distibution i 23.27 Example. Lo vt 1. 1/8.1/3). Then xP = x 90 7 ia stationary distribution. 1 i arta with the dstoibition = # wil stay” tha alist Iinagine simnlating, many eliains amd checking the marginal distibution ec tne twill always be the uniform distribution x. But have a init, It continies to eyeke around forever, Exanrues oF Manoy Citaty 23.28 Example, Let = {1,2,2,4,5,6). Let 23.29 Example (Hardy-Weinberg). Here is » fens example from Snpponea ene ean be type Aor type a. Theve ate three types of people (call jenotypes): AA, Aa aul a, Let (7) denote the Fraction of peopl of ea jeotype, We assume that everyone cntributes oue of ther two cops of th fee at ravdon to thelr eiklen, We also assume tha mates ae selected muda. The latter iv not realistic reasonable to ass on det choose your mat wr they are AA, Aa, « This wonld be false ifthe and if people ch fs hase on eye color.) Imagine if we pooled everyone's genes together Jar (y/2) A chil ie NA with probatiley P20 PQ, nul an with probability Q2, Thus, the fraction of A this generat P+ PQ= (p44) +( ; However, r = 1 — p — 4. Substitite this in the ahowe equation and yon get PQ = P. A similar eslenlation shows tit the fraction of “a” genes this ems stable afer th eration, The pro AA, Aa, a P®.2P°Q.(2) from the second 3 ald the Hardy-Weinberg, lw Arsumne everyone ls exactly ane child. Nom sl person let Xe the geuotyp Aesectant ain with tate space X = (AA, Aa). Some basi ca oa that th [req The stationary distribution is = = (P.2PQ,02). 23.30 Example (Markov chain Monte Carla). In Chapter 24 ww wll poet lotion ethos called! Marko cain Monte Carko (MCMC), Here sa brie Iescription of the iden. Let f(x) be a probability density on the real Hine au ypmone that fi " a known function and ¢ > 0 1/ f ole). However, it may not be Feasible tw perform ths nega, no i it uceestry fo know inthe follwing algorithm, Let Xo be an asbitrar V(X; cb!) where 6 > 0 fs sone ied onstant Let inf 00). 4} Lwin 'f Draw £7 ~ Unifortn(9, 1) and x= 4 it We will see in Chapter 2 vs nem av Xaver tlie MU Hence, we 6 he ra v NY}. Suppose we observe observations Xyy...+X from thi ‘hain. The unknown parameters of Markov chain are the inital probabit fio = (dol 1} po(2)-v-s) at the elements of the transition matrix P. Each TEL 23.31 Theorem (Consistency and Asymptotic Normality ofthe at). Assume 23.3. Poisson Processes As the name siggests the Poisson prcess is intimately related 1X has a nt distridnition with paranacter A etten X Also recall X) = And VIX) =A. IFN ~ Poison), ¥ and XY, then X-+¥ ~ Poissou(A+v). Pinay, if. n(A) and YN 1n~ Binowial(n,p), then the margiual distribution of ¥ ts ¥ ~ Polson( Now we describe the Poison process, Imagine that you are at your ect puter, Each time a new eval message arrives you record the tte. Let X; be {Xp LE foo} is process with state space X= (01,2, A process of this form is calla! a counting, process. A Poisnn proces is counting px fies certain conditions. I what fells, we wil mes write X(t) instend of Xj. Aho, we need the following notation, Write (ht) = off) if f()/h > 0 as h > 0. This means that f(h) i smal than J when fs close to 0, Fr example, 12 = of) 23.32 Definition. 4 Poisson process is 0 . Xe: 1 [0,90)} with state space X = ( | 2 For any0=t0 0 for all baud SZ 1, Assume each animal has the same Wfespan and that they produce ofipring according tothe distribution py. Let Xy be the mnbir of animals in the n® generation. Let ¥ 40 be the sfpring prod in the n™ goneration, Note that Xe = ¥" rf Let w= BUY ¥(V). Assuaue throughout this question that Show thot Af(n-+ 1) = pA(n) and Qn 41) vin. Stow that An) =e and that V0) = 04 Y () What happens ta the variance i > 1 What happens to the yar ance if j= 1? What happens to the variance fj <1 1) The population goes extinet iF.Xy =O for sane n, Let ths define the exinctum time by Fin) = PON the CoP ofthe random variable N. Show tat F 1 Hine: Note tha pis th {Xs = oF Tims, BUN <0 DO). Let k be the number of espn th origina p ulation becomes extinct at tine 2 i an bly feach ofthe k sub-popmlations generate fom the k offspring 0 Supp 2 4. Use the forma from 1) 0 Show that i is racnrrent state and #45 j hen j 6 voeurent state 24 Simulation Methods 1 Inference Revisi pnsey= ff faint 24.2. Basic Monte Carlo Integration ! x we X y erate Nyooece Ay ~ Unis), em by eh 72 tyatx wat 213 This is the base Monte Carlo Integration method. We eau abs comput the standard error of the estima 24.1 Example. Let wen, f ot | y Fis 24.3 Example (Bayesian Inference for Two Binomias). Let X~ Binowil and ¥ ~ Binomil(nepa). We wou ike to estinnte bi. The ae lta method whic yids RO) , Rh ey : Bayesian analysis, Suppose we use the por (pp) = fn fpe) = sth flor pal XY) = SUL ~ my" BEL = pay” GUIRE 24.1 Panter 0 6 fn sna, X= Sand ¥ = 6, From a posterior sauple of ie 1000 we get a 5 perecnt The posterior mao of 8 is pvterio interval of 0:20.20) The ponerior dit ea estate on [Uf si oertov xr inden = f° fp fto XY Mou Wistogram ofthe slated vasa soe in Figure 241 we want the psteror density of we cau fist get the posterior COF 24.4 Example (Bayesian Inference for Dose Response). Supe we cnet a PIX) = PES ANY) = f SlpupalX.Y Mode 2 For each dine hve, we ae rts an . re A= (mp2) > J. The density can then be obtained by Biri know fons lag consideration tat high iferentiating Joes sould have ihr probity of dat, Ths, pio W To avoid all thee integrals, let's use simnlation, Note that f(p. pl X.¥ nt toatimate the dese at which the annals have a 60 prcent chance of (ou) fp) whic np sad pendent wer the po yi, This called the LD50. Form cr terior dstibation Als mix Ion X 41) and pl saint , Beta +1,m-—Y +1) Hence, we cn sn p rf Af " tra the posterior by dawn Notice that 5 is ipl (opened) function ofp? 90 HOw Metal + Lon x then te a il 8. The posterior monn of 8 Pf) Betal¥ 1am —¥ 44 for = Ace. Now kt 4 = P49 — 7. Th Te integral er the eon snl ye pa 1-0 ’ ’ ss, ind nding th 025 and 975 quantile, The posterior density $(8.X.¥ Flelvin o¥ (5 SV on simply by plotting histgrarn For example, suppone that 1, f for for --Palion- Modine er : : ie Monte Carlo eat ony ih y yo inevete variable. We can estimate its probability mass fetion rata S15 sul be sina to fn aft PHoor. Tho rain of w= fh Preeti oioite—(f f wa tpg ta : (fe ) The depen om g, s0-we only need to minimize th This establishes a lower bound on Eq(IV2). However, Ey-(1V2) equals th lower bond which proves the claim. This theorenn is interesting bt it is omy of theoretical interest. 1 we di not know how to sample from J then i unlikely thet we cou snenple fron h(x)|f(a)/ fVal)|f(a)ds. tw practice, we simply try to find w thick Astebution g whic is similar toh 24.6 Example (Tail Probability). Let esti Z>3) Z~ NOI). Write 1 = [htelflalds where fle) is the x Alensity and > 3h, and W otherwise, The basic estimator is P= NUS A(X) where Nyy... XW ~ NOI. U find (from sulting tvs) that EZ) = 15 and YZ) = Notice tht most observations are wasted in the sense that mast are na the right tail, Now we will estimate this with importance sinpling tak te bea Normal(t) density, We draw vals fom o and the estimate is ne P= NUS) F(X )MX,)/a6Xi) In this case we find that ECF) = 0001 a 7) =.0002, We have reduce the standard deviation hy 4 fuctor of 20, 24.7 Example (Measurement Model With Outliers). Suppose we have meas ments Xiy e000 X we assume that &¢ ~ N(Os1) then X,~ (01). However, when tak roasurements it soften the ee that we get the orcasional wild ob have thin tails whic nap xtreme observations are tare. Oe way improve the model is to sea density for «, with a thicker til, for example a edistibution with v degrees of feed which hus the for tt) = FE) (42) Smaller values of v correspond to thicker tails. For the sake of istration we il take v= 8. Suppose we observe 1X; = 0 whore, has v= 3, We will al X,—4) and We can estate the top and bottom integeal using importance sampling. We daw 8),..-.8y ~ 9 and then, To lusteate the idea, we drew 1 = 2 observations, The posterior mean (com puted snuerialy) is 0.54, Using a Normal importance sampler g yields 24.4 MCMC Part I: The Metropolis-Hastings, Algorithm Now we inirduce Markow chain Mite Catto (MCMC) methods, The iden is to construct a Markov chain Xy.Xa,...y whose stationary distribution if mx, ip This works because there i a law of large mn heorean 2825. The Metropolis- Hastings algorithi is specific MCMC metho! that ks as follows. Let q(ylt) be an arbitrary, friendly distribution (Le. knoe how ¢o sample from q(y))- ‘The conditional density g(ylr) is called ne proposal distribution, ‘The Metropolis Hastings alyorithin creates a sequence of sbservntions Xo, Nis---y a6 follows Metropolis Hastings Algorthen | Chuose Xo anbitravily. Suppose we have generated Xo,Xiy..+5Ni To | senerate X54 do the following | | (1) Generate a proposal or eandidate value ¥ ~ g(uX | ftw) atslw) f Yuet rte) = min! } fy tyr X= 1 x, ty 1 Ad ah thal nA A hy yaa 248 Remark. A sinple way to eect tp (2) so erate U ~ (0.1). UerseX int X 249 Remarks A conn chief fg) is N(x) for some b> 0. This impli t nin { £2. 4} vay GUE 242 Thre Metropol ca caespanding 10 = 7 By costrction, XX)... 8.8 Marko chain, But why dos thy Mar ' sain hve at stationary trib? Bene we expin why, i : : 24.40 Example. The Couey sition as dasity : Or goal st snate a Mark sion Asses nthe rear Soin ti sa) = min f : ’ ty XY won {h, t ty Danka ’ The snnlatorrquite cie of, Figure 21.2 shows the can of ng , N= 1.00 sing Van b= 10. Setting b = 1 ores the to tke small tops As'w rot the chin dom “expire” ich ofthe 7 sample space. The histogram Go thew sora thet fait Temty very wel Setting b~ 10 cans the propos to on be fat nthe che etic ls for i val he and that ry.) = 1. Now ple) isthe probability of jumping fom 24 This reqltes two things: (3) he propasaldistrivtion must generate, a i) you rst accep. This 24.5 MCMC Part Il: Different Flavors There are diffrent types of MCMC algorithm, Here we will consider a few of red drawing a proposal Y af the form v=X, were ¢; comes fot some distribution with density 9. kn other wo so j { tou) min This is calls « random-walk-Metropolis- Hastings wethod. The do the ancept reject step, we would b for gts N08). Th Ure of tha nes X 1 ‘Warning! his wthod doesu't mks values on the whole rea line, IF is restricted t al then itis transfor X. For example, if X € (0,oc) then you might take ¥ = log. and then Simulate the distribution for Y instead of X IsprbesneNce-Mettorouss-HaStiNGs. This an importane-sanupin version of MCMC, We draw the proposal from a xed distri Gen ally, 9 chonen to be an approximation to f- The aoceptance probability becomes ; 1 eu) =min ff } ines SaMPUNG. ‘The two previons methods ean he easly adapted, i principle, ta work in higher dimensions lu practice, tuning te chains to mak the mix well is hard, ibs sampling isa wo to turn a hish-dinenson problem ito several one-dimensional problems, Here's how i works for bivariate problem, Suppose that (NY) has den ity Jx.y(2.) Fist, suppose tat itis possible to simulate frm the cond tional stitutions fy five. Let (No, Yo) be starting values sve drawn (Xas¥% Xe Ya)e Than the Gibbs sapling al itn for geting (X 24.11 Example (Normal Hierarchical Model). Gibbs sampling is very ef f todels calle hlerarchleal models. Here is a simple ease draw a sample of cr eity we drawn people nr many people ¥; have a disease. Thus, ¥;~ Binommil(nyp,). We wing for different disease tutes in diferent cites, We We are inteested in esti ef Recall ths with g. We shall treat 0; as own, Furthermore, west take the distribution Niwa? Zils ~ Nene} As yet another simplfiation we take 7 = 1, ‘The unknown parameter are 8 = (iu ¥iyes vst). The likelihood funetion is ce) x Tsosin [] zt) = Too {-p0.- 0} 0 {35 as randonn draws from somne distribution F, We can write this model It we use the prior f(p) 21 then the posterior is proportional to the Hel the other variables. We eas thre aseny any terias ~ N(be 1h). Nest we will find fret). Again, w Loa an {si - 7} 1 most recently drawn version of each We generated « numerical example wit 1d n= 20 people ac back i rom each city. After runing, the chain, 24.6 Bibliographic Remarks 4 Hex ‘hol called accept-reject sampling for drawing observa 1 Keep rep ating unt u finally get an observation 24.7 Exercises ‘Shox the distribution of ¥ is f t= | Sard frente No tion, Dram hitogran ofthe sane to vei a) Estimate F usin b Monte Carlo method. Use N= 100,000. hae the snenple Repoac Lia Estimate J nsing importance sampling. Take y to be N(L5, with ~ vy ’ vem) ther ae aay extreme re YE wa 2(4)- 2+ oy Let ¥~ (0,1) and XIV = y~ Ngo + 92) Use the method in the tat 5a) o 1, Use the Gibbs: Metropolis alg Plot histograms of the posterions for the 3's Get the posterior menn Bibliography AGRESTI, A. (1008), Cateqor Axaine, H. (1973). Information theory and Tiel TW. (1084), An Fotendaction¢ Wiley Barekon, A. SCHEAVISH, M. J. and WASSERMAN, L, (1900), The « rior distributions in nonparametric problems. The si 5 Bexcuen, H, (1959). Measurement of Subjective Responses. Oxford Univer Bexsanna, Y. and Hoctnena, ¥. (1995). Controlling the fal ate: A practical and powerful approach to multiple testin BeRAN, R. (2000), REACT seatterplot smothers: Supeefficency ugh wis economy. Journal ofthe Amerionn Statistical Assoriation 95 155-17 Beran, Rand Diiwncen, L. (1908). Modulation of estimators andl eon Bokoen, J. and Wourerr, R. (1084). 7h of Masher Lik Bencen, J. ©. (1986), Statistical jon Analyse Second Bukiton). Springer-Vers Bence, J. O. and DeLAMPADY, M. (1987). Testing precise hypotheses (¢/ 335-152). Statistical Seience 2 317-385 problem, The Annals of Statistics AL SL4-826 Teas ana Selected Topics, Vol. 1 (Seeond Edition). Prentice Hall Bruuwestey, P. (1979). Prob y and Measure. Wiley Biswor, Y. M. M, Frisnen, S. B, and HOLLAND, P. W i Practice. MIT Press Multivariate Analyses: The BREIAN, L. (1962). Probiity. Society for Industral and Applied Mathe BranEGan, C. S, (2963). Mark Twain and the Quintus Curtins Snodgrass letters: A statistical test of authorship. Journal of the American Statist Anta, B. P. and Lous, T. A. (1906). Bayes ond Empirical Bayes Methods AseLLA, G. and Bente, RL. (2002), Statistica! Inference. Duxbury Pre uavoiin, Peand MARRON, J. in curves, Journal of the An al Association 94 SOT 823, Cox, D. and Lewis, P. ( An cb of Bren (Chapman & Hall Cox, D. D. (1998). An analysis of Bayesian inference for nonparamettic regresion, The Annals ofS = 21 903-923, and HiskLEy, D. V, (2000), Theoretiat cox, D. Hal Davisox, A. C. and HiNKtzy, D, V. (1007). Bootstrap Methods and The ion. Cambridge University Pres DeGnoor, M. and Scents, M. (2002). Probiiity and Statistics (Thind n), Addison-Wesley Devnove, Lin GvEne, L. and LUGOSL, G. (1996). A Probabilistic Theory Pa Ver a Recognition, Spin Diaconis, P, aad FRRebMAN, D. (1986). On inconsistent Bayes estimates f location, The Anna Dowson, A. J. (2001). An intraduction to generalised linear models, Chap Dovouo, D. Land Jonnston, LM. (1984). Meal spatial adaptation b Jet shrinkage. Biometrika BL 425-415, Dowono, D. L. ad Jonsrosr, T.-M. (1995). Adapting to unk Dowouo, D. L. aud JOHNSTONE, I. M. (1098). Minima estimation 21 wovelet shrinkage, The Annals of Statistics 26 879 DoNoM, D. La, JOHNSTONE, I, My, KERKYACHARIAN, G, and Pteanb, D. (1995). Roya st 4 shrinkage: Asymptopia? (Disc: p DuNsMone, L, Daby, FEF AL. (INST). M545 Staisioal Methods, Unit 9 Categorical Data. The Open University. Epwanps, B. (1905). Int Ernomovien, S. (1909). Now and Applications. Sp Eraon, B. (1979) Annals of Statistics 7 1-26 steal Association 96 1151-1160 EpRON, B. aud Traswawant, R. J. (1908). A fi Frisia, R. (1921). On the pi duced from a small sample. Metron 4 1-32 fining and Knowledge Discovery 1 Geinan, A. Camus, J.B, Stites, HS. and Remy, D, B, (1995 Guosat, $., Gwost, J. K. and Vay Den Vaarer, A. W, (2000). Conver R., RICHARDSON, S. and Spmeceuaauren, D. J. (19 Monte Carlo in Practice. Chapanan fe Hal Gamer, G. and StizaKeit, D. (1982). Probability aud Random Pro Hall aus, P. (1902). Th 1p and Edgeworth Brpansion. Spinger-Verl avvensc Paykt, C., Kovac, J, CaRisruont, J How 1, Carerwnscitr, J, MASON. Hh, PADIN S.Pi M. and READHEAD, A. (2002). DASE fist sul stn nictonve background angular power Hanoi, W., KeRkvACHaRian, G., PloanD, D-and TSvBAKOV. A. (198) Hasrie, 1. Tiasimant, R. and PRIEDMAN, J.-H. 2001), The Blements Learning: Date Mining, Inference, and Prediction. Spr Henamcn, R. (2002), 4 MIT Pre Jouxsoy, RA. nd Wicues, D. W. (1982). Applied Maltiog Aaalyss, Prentice-Hall a7 1122-112 ny Preparation. awn, A. (1). Ps ity, Springer-Verlag. Kass, REL and RAPTERY, ALE, (15), Bayes facons, Jonrnal Kass, RE, and WASSERMAN, L, (1996). The selection of prior distribution by formal res (corr: 198 v99 p 412). Journal of the American Statistic ians (Second Eaition). Prentice Hall Lae, A. Tor at. (2001). A high onmic macrowave background anisotropy data, Ast Lie, P.M, (1997). Bayesian Statistics: Aw h re. Edward Armokl Lamyaant, B. Land CASei.a, G. (1908). Theory of Point Bit Maftnon, J. S. and Wap, M. P ct ment fategoated! squared Annals of Statistics 20 A.. BLACK, M., Lowe, C., MAcMaNON, B. atid YESA, S. RoUSSEAUN, Ji, DU PLESSIS, J.. BENADE, A., JORDAN, P. Korar, J ne international differences in hit survival in breast Jooste, P. tl FERREIRA, J (1983). Coronary risk factor screening in terational Journal of Cancer 11 261-267 Ute rua evn 10 Med 164 130-2 Nevrnmten, ©. Ber at. (2002), A measurement hy boomerang of mul Scwesvtst, M. 1. (1996). Theory of Stites. Springer-Ver ie pks in the anlar oer ete ft te eowave bark ee ere eee ee eee eee onl. Astrophys J. BTL 604-614, Veetor Machines, Regularization, Opti nud Beyond. MUP Pres The Annals of Prari, J. 200), € rmadets, reasoning, and inferener. Cambridge Scort, B., Gorro, A. CoLt, J. and Gonky, G. (1978). Plasma lipids as University Pres lateral risk factors in coronary artery disease: study of 371 males with Prtaavs, D. and Kise, E, (1988). De a hotklay: Mortality eee rounding al occasions. Lanoet 2 728-732 Scort, DW. (1992). Multivariate Density Esti ory, Procte Viswatisation. Wi Patuirs, D. and Ssttri, D. (1980). Postponement of death nt ically meaningful occasions. 1e American Medica! Assocation Siia0, J. and ‘Te, BD. (1085). The Jeckinife and Bootstrap (Germa 268 1917-196 Springer Ver Quexornmie, ME. wate tests of correlation in tine series. Suen, X. and WASSERMAN, L (2001). Rates of convergence of posteron Tourval of the Boal Satstira Society BAL 18-84 distributions. The Annals of Stata M. Rick, JA. Mathematical Statistics and Data Analysis (Second Ea Swonsck, G. Re and Weuusne, J. A, (1986). Empirical Processes W » Statistis, Wiley on). Duxbury Pr SuuveRstas, B, W. (1986). Density Estimation for Statistics en Rowen, C. P, (199 Chapman & Hall fon. Springer Very Springer-Verlag ; Travton HM. and Kantan, $. (1004). 4 “ st Ronins, J. SemEnses, R., Seuvtes, P. and WASSERMAN, L, (2018). Ua 1 Atadensie Press convergence i causal inference, Biometrika (to appear YAN DER LAAN, M. and Rosins, J. (2008). U Censore J, Mand Revov, Y. (1007). Toward a cure of dimensionality ap- Longitudinal Da y. Springer Vering (CODA ic theoty for semiparametric models, Statistics Medicine 16 van pent Waar, A. W, (1998) ie Statistics. Cambridge Universit Pres Rosesnavat P. (2002). Observational Shuies, Springer-Verly " aN A.W, ond WELLER, J. A. Ross, S. (2002). Pr pe, Academie Pres Vapnik. VN. (Hf ning Theory. Wiley List of Symbols Wenssena, S. (1985 n. Wile General Symbols WirraKer, 3. (1900), Graph Applied Mal Statistic Ye fle) for all 2 € A 1, (2000), Bayesian of some nonparannetie problens . . oe 7 (1) (a2) Bx del ‘wena, X. and Low, WY. (1 rt variable selection in incr ® ; : toile, Journal of the America sal Association 90 151-1 K Ca tin Tye ilestor function; 1 fo © A and (otherwise Probability Symbols a probability of event A 1B {and Bae independent oan Bt A and 2 are dependent Fee) = PU fe probability sensi (oF mass) fuetion Nw E has distribution F X~f X has density xy X nd ¥ have the same distebutio it independent nnd identically disteibutee XtoccssXe~F i sap of oie morn F standard Normal probability density nar Normal distebution fetion 1pper ac quantile of N(O, 1): 9 = 11 — a (x) = frdPle) ected value (nian) of random variable X (o(X)) = J red) expected vale (mean) of} UN atiance of tnndonn variable X oWX.Y ° ce between X atid ¥ XiyereoX ta List of Symbols 488 Convergence Symbols = 7 : convergence in probability a ze i # Xq= Need) ya fn— MOLL gE 3 Ky = opltg) — Xy/tq £50 ¥ Xt= Onley) (Xa/én| is bounded in probebility for large Statistical Models ose - tstimate of parameter e] a ra) likelihood fnetion je so 2 SB < 2s 2 3 € fos Useful Math Fats = | = s E | I pte ndy : for > 0. a > 1 then F(a) ~ (a 1)0(a— 1) Hn in posiive integer then 3 : Tn) = (n= 1) Some special vas are: F(1) = 1 and F(1/2) = y a i . Index aceept-teect sampling, 421 accessible, 38 uljvent, 281 sii. Bayes rule, 2 aisle, 20 AIC (Abaiketnformation Criterion) aperiod ly Normal, 92.1 Bayesian inference, 8 ths and weakness, 185, Bayesian inormation exiterion, 220 Bayesian wework, Bayesian phimo Bayesian testing, IS Benjamin’ end Hoch 1 167 bis-varinicetradeot BIC incu nal likelihood, 2: probability, 10,10 olin 1 ction, 218 roel opp Jefe one dvs, 5 sor, 32 marginal Distribution, 3 Monte Carl integration method, Norma-bsed confidence taterval Wold ersdaton, 364 nrg disso, 197 a 16,94 Monty Hall, normalizing con Keron validation, 220 survence tne, 390 i tuulkivarinte Nora na leaves, 36 ean squared ero, 9 7 nl Eanetions, 327 mulevurnte Norma ie a Legendre polynomials, $29 able jn length sedan, 2 inal basis, 3 level 150 09 Navdaraya-Watson kernel estimator, gnteome, 89 Teta Sanction, 122 Mercer's theorem, 375 a overt Tiketioot eth statisti, 161 anethon of monents estimator, 12 rive ayes classifier, 30 61 Metro in Gibbs, 419 natal p ui pve 156, 1 Tinie theory, 71 Metropuis- Hastings algorithm, 411 tral ‘titi, 140 Pmirvise Marko grap, 2 Timing istibution, a Mil'sinesqualty 4.7 65 neural networks, 876 prrauneter of iuterest, 120, Tinea algebra notation nial conaitional independ NewtowRtaphson, Mb Prraneter space, 88 linear laser, 353, Neyman-Pearson, 10.30, 170 paraaeters, 26 Tinearly separabl inoal sufficient, 138 hes, 281 paraaetic op. 1 Teg ols ati, 240 vii ral, 197, 18 oncolide, 265 parametric model, £ Iogeikelibood function, 1 lssing date ron-nl, 300 parent, 26 Parseval’s relation, 329 probability mass function, 22 ample quantile, 102 iceney, 13 partition probability measure, 5, s suicient statistic Summary of Terminology, 4 lity on Finite Sapte Space ample varince, 51 7 sampling distribution, 90 supervi Pearson's x* test, 241 pio, 3 proposal, Att strated mode, 298,29 “0 period, 300 scaling cocticien, 342 support vectors, s2D ee ees quadeti diserininant analysis (QDA) seore function, 128 «attain : quantiles, 102 slater 6 tia 2 perpen ! hid quae, 25 point mass distribution, 26 sive, 150 rancom-valle Metropolis Hastings, - transformations of random variables, pointwise asymptotic, 9 ve ack variables, 371 a Poisson distitution, 27 Shutaky’s theorem, 75 realizations, 3 transient, 88 Poisson process, 394, 30 smoothing. 84 eee eal recurrence time, 300 positive definite, 231 ee smoothing parameter, S03 ee : Jojourn ties, 396 vee Th erey, 150 rcs, 40 posterior risk, 197 Faeee ey ogy. ast spatially bon potential, 285 nen standard evintion, 5 ote outeomes, 251 een eee eons standard er unbiased, 90 power function, 150 * standard Normal distribution, 28 underfting, 208 we eee ate space, 381 undirected graph, 281 Bresson matrix, relative sk, 248 spac ns prediction, 89, 215 Sere er statistic, 1, 107,13 18.14, 21 “ statistical functional, 8, 9 unshielded elder, 260 Dredition interval 19.01, 215 Seo f Predietion risk, 219 response variable, 89,200 seal mode | ea predictor, 89 reweighted lest square, eins pads, 04 Vapnik-Chervonenkis, 366 predictor variable rik, 104 304 Siiring’s Oru, 380 ae probability sirong law of lange munbers, 5.18, Sane probabil distibuti ple correlation, 102 st probability function, 22 ample mean, 5 strongly inadmissible, 204 iting tes, 306 probability inoquaities, 63 sample outcomes subjectivian, 18) Wald test, 158 henge rs (WLLN), Springer Texts in Statisties (onsnud om page) Lehmann: Testing Statistical Hypotheses, Second Edition Lehmann oxd Casella: Theory of Point Estimation, Second Editon Lindman: Analysis of Vance in Experimental Design Lindsey" Applying Generalized Lineat Models Madansky: Preserptions for Working Statisticians MePherson: Applying an Ierpreting Statin: A Comprehensive Guide, ‘Second Edition Mueller: Bssie Principles of Stwctural Equation Modeling: An Introduction to LISREL and EQS. Newyen and Rogers: Fundamentals of Mathematical Statistics: Volume Probability for Statistics gwen and Rogers: Fundamentals of Mathematical Statistics: Volume I: Stats Nother: Inroduction to Statistics: The Nonparametric Way Nolan and Speed: Stat Labs: Mathematica Statistics Though Applications eters: Counting for Something Statistical Principles and Personalities Pfeifer Probability for Applications Piaman: Peebbili Rarslings, Pantula and Dickey: Applied Regression Analysis Rober: The Bayesian Choice: From Decision-Theortic Foundations ‘Computational Implementation, Second Eaton Robert and Casella: Monte Carlo Statistical Methods, Second Editon Rose and Smith: Mathentatical Statistics with Mathematica Ruppert Satsics and Finance: An Itodetion Santner and Duff: The Statistical Analysis of Discrete Data ‘Soile and Wood: Statistical Methods: The Geometric Approach Sen and Srivastava: Regression Analysis: Theory, Methods, and Applications Shao: Mathematical Statistics, Sevond Eaton Shorack: Probability fr Statisticians Shuma aud Stofer: Time Series Analysis and Its Applications Simonof: Analyzing Categorical Data Terrell Mathematical Statistics: Unified Introduction Tim: Applied Multivariate Analysis Tautenburg Statistical Analysis of Designed Experiments, Second Edition Wasserman: Allof Statistics: A Concise Course in Statistical Inference ‘Whine: Probability via Expectation, Fourth Edition Zacks: Intreuction to Reliability Analysis: Probability Models and Staistical Methods inference

You might also like