Professional Documents
Culture Documents
All of Statistics Final
All of Statistics Final
All of Statistics Final
1). 7 \
xo ner-rx Find P(X < 1/2.¥ The event 1/2) correspnuls
woe toa sume of the unit square. Integrating uct correspond
TE Z)s..0s pare independent standard Normal raul variables then 32, Z; ne, to conning he ae ofthe st 1/480, POX 1/29
|2.21 Example. Let (X.Y) have density
2.22 Example. If tlie dstsibution fs defined over a none
alelations an spleate. Here i
ve Dec O02). Let [9° have ens
{ i 1
10 otieens
Note ist that “<< 1, Naw let ws find the val of. The trick hen
in ac ix
Heuee. ¢ = 21/4. Now let ws compute P(X > ¥), This corsponuds to th
A= (e.g) . (Yor ea se thi yr cigram
2.6 Marginal Distributions
2.24 Example. SuIn principle, to check whether X and ¥ are independent we
2.25 Definition
: 2,30 Theorem, Let X ond ¥ hove
2.26 Example, Suppose that
: . 2.31 Example, Let Vand ¥ av
2.27 Example, Supine that
& fy (0) = f(t) = 1/2. X and ¥ are ite
Then, fx(O} = F(t) = 1/2
hep af Be te cys
ste) =f steve = 22 f' ydy = 22
2.32 Example, Suppose that X and Y-are independent snd both have the
‘and Y av independent 1 Let us find P(X-+¥ < 1) Using independence, the joint densi
He) =x = | 9 otherwise
Xe AY B= MX EAR EB) xj
d we write X UY. 01 1 say that X and ¥ are dependent
“The watement is not rigorous because the density is defined only upto sets of
messi 0conditional
236 Definition. fv wn voile,
1 ; probability density function
vy 5 that fy(y) >0. Th
ong res shel or verifying indepeude
2.33 Theorem. the ange of X and Y is (porily in 2.37 Example. Let X and ¥ fe a jit ion i
‘ cere punee Ths. yao) = 1 fr 02 Land Oot
eae : ifr). We ean write this XY
2.34 Example, Lot Nel ¥ fu det
From the dfinigion of the conditional desis, we
1 f.X and Y is the eotangde (0,2) (0,2). We ean wit . het
; a 2.38 Example. Lot
[ty Hoses,
lo rs
2.8 Conditional Distributions
vs find P(X < 1/4)" = 1/9). In example 2.27
; ieerval ¥ = y Specially, BX = 2|¥ = y) = P(X fey
y Tet to define the contol probability fryvlaly =
35 Definition. Tv conditional probability sass Function fs
(yey! (.|1)
Ka2¥=s) _ fevley r(xctiy ' a
! x aay : a )-d (|3)
for continous distin ve the sane di The intern o ifor(Q.1). After
non cae, wo mt eae a get aoaoF? Fest note that
$= {Gane
and
; aicceees
2.40 Example. Consider the density in Exanple 228, Les fad fy (ule
When ais satisfy 2? < yy © 1. Earlier, we 5
21/8)22(1 — 2), Hence, for 22 = y= 1
2.9 Multivariate Distributions and 1p Samples
st X= (XtoooesXq) whet
random vector, Let flioso sty) denote the mor. Ie isp
Xe are randou variables. We call X a
We say that X
Xy € Asso Nn € An) = PPP € A, 28
Ie suffices to check that J Tt
S/AX = 1/2). This cam be done by fist uoting
Xess Xn one MD
om sample of size n from F
() Two Important Multivariate Distributions
vial, Consider drawing a ball from an ann whic ts bolls with K difereat
labeled “olor 1, olor colar ko" Let = (5 where
1 ball of color j- Draw 1 times (independent draws with eeplacement)
lee X= (Xiosess Xe) whore X, isthe number of times that eolorj appears
fener,» = Soh, Ny. We say that X lias a Molton (np) distribution
se X ~ Mulino Silty fet s
2.42 Lemma. Suppose that X ~ Mltinemial(.p}ive 1: Yous my wa the fllawing fats HEX’ ~ Poon) ao
Poisson(1), ane X and Y are inepenrtent, then X41 ~ Poissons
Win 2; Note that (X =r, XFY =n) = (Kaas Y Expectation
ra
[ 0 and 0g 1
Find P(X < 3 |¥
18, Let X © NCI). Salo the following wing a Normal table wl in
Find P(X > =2
o} Find sch that BX :
Fs da ‘
9. Pre Knut (212 3.1 Expectation of a Random Variable
20, Let X.Y" ~ Uaioe independent, Final the PDE for X'— Yaad © ectation of «raul variable X i the average val
vy Let Nc Xs Papi beans ber ¥ so Pint 31 Dettion. Tiree vale ores, oie meme
Hoe of Y Hats ¥- !
Ae confidence tre doth s2/m)log (San
When age the confidence intra for £(R) sage The more function See eee
fr by having a larger confidence interval W220 Detintion, The V ens) densi ofa Fase
we sal 24 that ae infinite, sch as " ‘ > fora %
To extend our atalat to thar enscs we wat tobe able to define VCC) to eth for
This, the ¥Celoesion is the ize of the largest ite set F that can be
(sun iE4009 — 2081 >) < something not too big shattered by A meaning that A picks ot cach subset of F. PRL isa set of
‘ Jassifiers we define VC(H) = VC(A) where A is the of sets of the form
(ne way to develop sucha gnerllation sy way of the Vapnile Chervonenk fe: fe) = i} ash yrs in 7. "The following sre ss tha if Aas
or VC dimensic finite VC-dimension, then the shatter evefficients grow as a polynorial in x22.21 Theorem. If A has sien
A 1
22.22 Example. Let A= {(-% R). The A shatters every I
1 r} bit it shatters no set of the form (251). Three, VCLA) = 1
22.23 Example. Let 1 of close interns on the real Tne, T
4 $ hat it cannot shatter sets with pois. Cons
: One cannot find an interval A sich th
4 te all linear hal: nthe plan in
{aw all on a fie) cx be shattered, No 4 ean b
Consider, for ¢ 1 points forming & diamond, Let 7 a
itmast points. we yicked ot, Other configurations
v-ansatterable, So VC In general, hallapaces in RE
0 Le
22.25 Example. Let A bw all rectangles ow the plane with sides paral
ve punt that not lft, rightist, uppermun, oF oermst. Let Tb
22.26 Theorem, sion dan et H be th se of i
oa (8 dy
Support Vector Machine
First spon tha Tinenrly separable, that i, there exis
22.27 Lemma. ‘The data un be sepavoted by some hyperplane if ane o
Poor. Suppane the data cas be separates by a hyperplau
y 1 follows that there exists soe constant esi that ¥; = 1 implies
WAX) Se and Y= =1 implies WL Therefore, YAWN) > fo
alli, Lot He) = a + re a) = bye. Then YoHLX,) > 1
The rover dinection ie senightForward a
In the separable cave, there will be sting hyperplanes. Ho
Jhonld we choose one? Intuitively i sans reasonable to choo the
plane “furthest” from the data iv the sense that it separates the +18 and
ul maximizes the distance to the lowest pot, ‘This hyperphane ts ealled th
‘maximum margin hyperplane. ‘The margin is the distance to from the
lyperplane othe uennst point. Pots ot the boundary of the margin
ule support vectors. Seo Figure
22.28 Theorem. pane Fits Th fas that separeten th
Te tums oat that this problem can be recast
problem, Let (Xj.Xq) = XFNe denote the in
22.29 Theorem.
0) hyperSo AS evarneenins
0-Soad
a (nc :
Ba) =n Soar
here are many sofware pcg hat wl lve ths problem quel
The variables § re called slack variables.
We nun maximize (22.40) subject
Yara
The constaut +i tuning parameter that controls the amount of overlap.
22.10 Kernelization
There isa tie called kernelization for improving computationally sinple
clasiier he, The idea is to up the cvatiate X’— which takes vanes in 2
This ean yield a more Hexible clasifter while retaining computational
Tans, 6 maps &° = R? int Ia the higher-dimensional space Z, the
Ys are separable by a livear decison bowndary. I other wor
5 linear classifier ina higher-dimensiona space corresponds to noe
linear clsiier in the orginal space
he ps that to set of classiiers we da not need to give up the
rience of linear classifiers. We simply map the covariates to a high
limensional space. This is akin to making linear regression more Bexible by
ssn polynomial
There is patent drawback. Ife significantly expand the dimenso
of the problem, we might increase the computational burden, For exarapl
x has dinasion d = 256 and we wanted to ase all fourth-order terms
11 2 = (2) has dimension 18181.376, We are spared this computaf that the maximizing vector w is a nen combination runs of the kernel, Formally, the solution
the Z's. Hence we can writ . aight invertible. I this case one
me constant b, Finally, the prokection onto
das writen
Also U = nT oa) = Saker)
Z xr
The support veetor machine can similarly be kerulied, We stmply replace
Therefor X,.X,) with A(X,,N;), For example, insted of maxiniin 10
eZ, 1S o(xonly 2.13)
¥, = zP OX, The hyper
: Y= Ho(NYTAN
22.11 Other Classifiers
is
wy eH = IKK There are may’ other classifiers and space prochules fll dissin of al of
a The kenearest-neighbors classifier is very simple. Given a point find
ere My is a vector whose #4 componcat is data pin tor, Clasify + using the wajority vote of thes
seghbors. 7 1 ruudenly, The parameter K ean be chosen
Xu i Xa XMvi=dl omevulidatin
Bagging is. mith for reducing the variaility of «elastin, Wis mon
Ie follow tha pf for highly 0 es ch a es We dw bets
o’Ggw j and :
Stop 1: Draw Xa ~ pa. Ths, PX , oa
Step 2: Denote the outcome of step 1 by & Draw X, ~ P. In other wo 23.12 Theorem, The on ms relation satisfies the following prop
Xy =X
Step 3 Supe the outeome of stop 2s j. Draw Xe ~ P. kn other w
X= HX »
And 90 on. ote
Ie might be dificult vo unders ig of jy. age simulating fond jo d 4
the chain many times. Collect all the outcomes at time n from all the The set of states can be written as « disjoint onion of elasson X
This histogram would look approximately ike jy. A conseaence of theorent RUM U-= there tro states fad j ih othates eaunmuniente with evel other, then th chain is called ire in the ehain will return to state F again, By repenting thi
uci osed if, once yu ent n samen, we conc ¥iX x. If is tration, then
f ql n xing cles absorbing When the cain isin state i, there i probability 1m > O that it will never
ene return to sate, This the prubability that th chain is i state 4 exact
ines a? — a). Ths sa gonuetrie distribution wie has fnte mean,
23.13 Example. Let = [1 :
i \ 23.16 Theorem. Futs about entre
} : so
The cl 2} 48} and (4). State 4 isan along sate. A finite Maskon chain tnast have af east one rvearrent sta
" 23.17 Theorem (Decomposition Theorem). The state spwor
23.14 Definition ‘ccurrent vv persistent o
vay Uae
23.18 Example (Random Walk). Let 101. 2.c.-5) amd a
23.15 Theorem. «1 woth pr piict <4 l= p. All states connate, hence either
= all the states ate recurrent oral ae transient, To se spy a
4 at Xy =D. Note
° tail (atepa tothe left). We ean approximate this expremon using Stirling's
Fornala which siys that
x 5 ;
Ln nl ~ nye
ua wes that heh ee ee Inserting tis approximation into (2.11) shows ty
vx SC EUIX Slew, = aX y ve
sh Wis easy to check that 5, pvo(n) < ae if and only if 2, po2n) < x
by 7 a a Moreover, 3, poo(2n) = 20 if and cay fp = = 1/2. By Thee (28.10)
= the cain bs recurve if p= 1/2 otherwise it transient,CosvenGENce oF MARKOV Cuaiss, Te discus the convergence of eu
‘ve ne few anote definitions. Suppose that Xp = i, Define the recurrence
Ty = min{n > 0: Xn= i} 2
assuming Xp ever returns to state f, otherwlse define Ty = 26. ‘The mean
omy = BT) ~ Sofa oa.
filo) =P #.% Xa FAX )
A vecurent state i mull xo othiewise tis elled nonnull oF posi
23.19 Lemma. If!
23.20 Lemma, Jn finite state Mar all recurent states are posit
Consider three-state cin with transition matric
Io
Suppose we start the chain in state 1, Ther we wil be in state 3 a times 3.6
8, This isan example of peri cal, Formally, the period of stat
isd if pa(n) = 0 whenever ns uot divisible hy dando the largest int
with this property: Thus, d= gd( pun) > O} where ge meas “great
common divisor.” State és periodie if (i) > 1 an aperiodie if d(j) = 1
A state with period 1 i called aperiodic
23.21 Lemma,
state i has peri d
"23.22 Definition. 4
Let = X) be a vector of non
[23.28 Definition. Wr
| eistribution if
Here is the intuition, Draw Xp fru distribution 7 nnd suppose that = 84
stationary distribution, Now draw X; noconding to the transition probabil
of the chin, The distebntion of Xy is then jy = oP = xP = x. The
distsibution of Np is xP! = (xP)P = =P = x. Comtinying this way, we se
tat the distribution of X is 7P" = x, In other words
Wat anytime the chain has distribution x, then i will continue to
23.24 Definition, We soy thal a chain has Timniting distribution + if
ergoic chain convergs to its stationary distribution, Also, sample average
BADE Theorem. an reducible, cigodic Markow chatw haw a wndque
jen 2791 0) = Dati) eau
Finally, there is ther definition that wil be nsefal later, W
satisfies detailed balance if
23.26 Theorem. If satisfies a nn isa stTh of detailed balance will become clear when we di
Markow chains Moute Carlo methods in Chapter 24
Warning! Jnst boca lai has a stationary distibution i
23.27 Example. Lo
vt 1. 1/8.1/3). Then xP = x 90 7 ia stationary distribution. 1
i arta with the dstoibition = # wil stay” tha alist
Iinagine simnlating, many eliains amd checking the marginal distibution
ec tne twill always be the uniform distribution x. But
have a init, It continies to eyeke around forever,
Exanrues oF Manoy Citaty
23.28 Example, Let = {1,2,2,4,5,6). Let
23.29 Example (Hardy-Weinberg). Here is » fens example from
Snpponea ene ean be type Aor type a. Theve ate three types of people (call
jenotypes): AA, Aa aul a, Let (7) denote the Fraction of peopl of ea
jeotype, We assume that everyone cntributes oue of ther two cops of th
fee at ravdon to thelr eiklen, We also assume tha mates ae selected
muda. The latter iv not realistic reasonable to ass
on det choose your mat wr they are AA, Aa, «
This wonld be false ifthe and if people ch
fs hase on eye color.) Imagine if we pooled everyone's genes together
Jar (y/2) A chil ie NA with probatiley P20 PQ,
nul an with probability Q2, Thus, the fraction of A this generat
P+ PQ= (p44) +( ;
However, r = 1 — p — 4. Substitite this in the ahowe equation and yon get
PQ = P. A similar eslenlation shows tit the fraction of “a” genes
this ems stable afer th eration, The pro
AA, Aa, a P®.2P°Q.(2) from the second 3
ald the Hardy-Weinberg, lw
Arsumne everyone ls exactly ane child. Nom sl person
let Xe the geuotyp Aesectant ain with
tate space X = (AA, Aa). Some basi ca oa that th
[req
The stationary distribution is = = (P.2PQ,02).
23.30 Example (Markov chain Monte Carla). In Chapter 24 ww wll poet
lotion ethos called! Marko cain Monte Carko (MCMC), Here sa brie
Iescription of the iden. Let f(x) be a probability density on the real Hine au
ypmone that fi " a known function and ¢ > 0
1/ f ole). However, it may not be Feasible tw perform ths nega, no
i it uceestry fo know inthe follwing algorithm, Let Xo be an asbitrar
V(X; cb!) where 6 > 0 fs sone ied onstant Let
inf 00). 4}
Lwin 'f
Draw £7 ~ Unifortn(9, 1) and
x= 4 it
We will see in Chapter 2 vs nem av Xaver
tlie MU Hence, we 6
he rav NY}. Suppose we observe observations Xyy...+X from thi
‘hain. The unknown parameters of Markov chain are the inital probabit
fio = (dol 1} po(2)-v-s) at the elements of the transition matrix P. Each
TEL
23.31 Theorem (Consistency and Asymptotic Normality ofthe at). Assume
23.3. Poisson Processes
As the name siggests the Poisson prcess is intimately related
1X has a
nt distridnition with paranacter A etten X
Also recall X) = And VIX) =A. IFN ~ Poison), ¥
and XY, then X-+¥ ~ Poissou(A+v). Pinay, if. n(A) and YN
1n~ Binowial(n,p), then the margiual distribution of ¥ ts ¥ ~ Polson(
Now we describe the Poison process, Imagine that you are at your ect
puter, Each time a new eval message arrives you record the tte. Let X; be
{Xp LE foo} is process with state space X= (01,2,
A process of this form is calla! a counting, process. A Poisnn proces is
counting px fies certain conditions. I what fells, we wil
mes write X(t) instend of Xj. Aho, we need the following notation,
Write (ht) = off) if f()/h > 0 as h > 0. This means that f(h) i smal
than J when fs close to 0, Fr example, 12 = of)
23.32 Definition. 4 Poisson process is 0 .
Xe: 1 [0,90)} with state space X = (
| 2 For any0=t0 0
cruel are the Epanechnikov kernel
BOA Definition. Given a Fernel K and « pos Teale he
vandwidth, the kerwel density estimator tb20.13 Example. Figute
20,14 Theorem.
J
SS a we
0! om o2 0.602 0.604 0.606By a similar ealenation,
pK
The result follows fom integrating the sued bins plus the
‘We see that kernel estimators converge at rate! whi
xerge atthe slower rt ican be shown that, under weak
The expression for h* depends on the unknown density f
jim = [7
here Fb the kernel density estimator after omitting the i
20.15 Theorem, Foe any >
q
ie ee (2%) 4 2 Ko
k K 2K(2) and [(:
K isa N(0.1) Gaussian kernel then K he N
We then case the bandwidth fy that minis F0).* A
f is given by the following remarkable theaters duet
20,16 Theorem (Stone's Theorem). Sippose th 1
20.17 Example, The top sight panel of Figure 20.6% based on ers-vliaton,
These dat ave rounded which causes problems for cross-validation. Speci
ally t causes the minimizer to be A = 0, To overcome this problem, we
ed a small ant of random Norunal noise tothe dat. The rst th
Tih) is very snwoth with » well defined minitaun, «
20.18 Remark, Do wt assume that, if the estimator Fis wig then eros
| 1 yu Fale
y ik (5%)
an(te teal
19 Examplejee K and the wei 6 se power spectrin of the temper uations, What yo av
y= — " Teor from the big bang, I r(x) denotes the tre power srt
Dak (G2) 1
a raul error with mean 0. The beatin sl sine of peaks
using here dnsity eotimation an thes nsertng the eta
08 shows theft base on erneaiaton as well at wdersnaotel nd
. fit, The crosevalation ft presence of thre wel
viva ~ uptatenty ted peaks, a polite by the phys of the i ng. a
20,21 Theorem. Sapp 08. The vs ofthe Nadaraya-W timation, However, we ist mad to stint 0, Suppose that a an
ys Mfatweceae)' f (ore +2 Sly a Y=
joie 2 1
f ca thus us the average of the = 1 lifes Ys ~¥; to etna
Tn pectic 0 Wh A we mini tat a
: ay ee 5:
Sw 20s
20.22 Theorem. 7
in) = S20 a
data fom BOOMERANG (Nettertel ot al (2002). Maximo: (La ot
m1 DAST (Halverson ot a (2002). T ‘ 'Confidence Bands for Kernel Regression
A opprxinuate 1 ~ er confidence and ar F(t) is
& ass
\a
a = or (Lt
@ is dein (20.80) an wv the width ofthe kerk. fn cose the kernel
vex not have tite wide thet we take w to be the effective width, that
re range over whic the kore snes, Hu pare
1s 4 95 percent cunfidence envelope for the
ME data, West tat we highly onfdent of the existence nnd position
the fist peak. We are more weet bout the second nad third peak
sion to msatiple regressions N= (
with Ker desity estimation we just rplace the kernel with a wltvar
X_) i straight formant
: ive regress it hn
: : radar nas
ae eee ee : only no fp enna tons, The wean
Xy) + Doral XNa) + us
itive madly are sally fit by au algorithn called baekftting.snl deviation of Fu) 1
the bins nil the standard deviation, ‘This the sec tem do
1 lage sample sizes. This mans that the cnfedence interval
Backfieg, 0.6 Bibliographic Remarks
wnen-F
t,t be the tion ction cine ey grein he
o_o 2.7 Rxercise
2. I canverget STOP. Elbe, go bck
Ny ~ fal a fb ta ere tor wg
kitve modes have the adage that they avoid the curs of dine sea em
(0
ty ad they cnt be Bt icky, at they awe one dinatage: the {
[
20.5 Appendix
[ f
Coneioxnct S08 AND BIAS. The confidense bands we compte a ly {0) Show a > and he a8 + then Fa) 21
pk web atte the det fe me a Smoothing Using Orthogonal Functions
wits and bali. Cannent on the slate on
1 Prove Lema 20.
Prove Tore 203
6. Pe
hte Vis the mean ofall the ¥ rng to the LL Orthogonal Functions and Ly Spaces
Fin the approxitnate risk of this « From this expression
risk ind the cptinal badwidth, At what tate ds the vik » Jenote a three 5 :
al numbers. Let dene 5 Wa Wa senlar (4
x) a or, we define The sn of veto
0. Show that with stables eesumptions on r(x), 3 ta equ duct
3 islet The inner prod
10. Prove Thee {a vector « is dfinea
sectors ate orthogonal (or perpendicular) ifLet dy = (1.0.0, #4 = 0.1.0), 64 = (0.1). Tse veto aes this es, the eto 6 83- fom n ismewinthai €
smal Hs fir V se they ave he allowing proper en J can be written as!
rtogoual
i) hey fr fo, which ms Ua any Yea be writen
Ba combination of A infil rosk Parseval’s relation whi says that
Ysie, when 21 si'= [ Pnde=S 4
a Not oy, Thre a tse 3 = (8
21.1 Example. An example ofan orthonorial bis for £2(0-1) i the cosine
(444) s). eee (yet -2) aed a fli red
Sosoy we 21.2 Example. Lt
Now we make the leap frome vectors: metions, Basically, we just reph epee
Lalasb ; fore 1 Ae J Woven oe that fa) ets cle t Fla). The coi
{ L j Fi jeosteyr were computed munca &
Wet write tof a. Te er pret betwee
ee ae eee 21.3 Example, ‘The Legendre polyoma on [11] oe dla
] p = 1, 5=0,1,, 21.8)
, It can be shown that these functions are complete and orthogonal and that
j=! [ peewe 2 agai\/214 Theorem, The » te this eatimator, note thot 62 is au unbiased estimate of and
2 isan unbiased estimator of 32. We take the positive part of the latter
(a)=% ¥(a) 7 mace nek tha canna be negative. We now cons 1< J < pt
ninae ROP) Heres suey
Prooe. The mean i 1. Let |
neo 5-1 Soe
(i) = 1B ex a
Jar
|
ne off this etm problem, However, if we net ou
ee * te af hes tevin es te the tue demity f(z) = So 20,2 Tm the confidence bash
215 Theorem, The rsh of 21.6 Theorem, An pposinate 1 emf and for oo
8. (z_% ni
1 Poor Hees an atin of he prot. Lt = 3} (8 9 By the
3, s N(Ajvo}/n). Hence, 3 ayej/ vi whereDesi
1
Since J is an average, the ventral Iiait theorem tells ws q
apnrosimately Normally distributed,
21.8 Theorem.
Fea = Hyon
diet
be the risk of the estimator
21.9 Theorem, The risk IRI) of the esti (= Eh, Hoyt
ny = 22 9 (us
“J
ga” # (21.26)
here k= n/4. ‘To motivate this estimator, reall that if fis smooth, then
or So, for j > fy J; = Na?) ane ts, 3 02%) fo
fe wee Z, ~ N(0 1). Therefore
oot (=
i & QW
#2) =o, Abo, Viyf) = 2k allen ot /K2\(2k
1. Thus we expect @ to be a consistent estimator of 02, There i
thing special about the hoe =n. Any’ tnt increases with 78a
syproprite rate wil stir
We estimate the risk
24 (#-E) (on
21.10 Example, Figure 21.4 shows the doppler function n= 2.088
erentoms genrated from the model
Y=rl
= i)nces >» N(G,(.1)2). The fig shows the dats ad the estimated
~ ‘Orthogonal Series Regression Estimator ml
= 2S veiled. Fe tem
aat ys 25
Fo (21.28) |1
Ta
(4> Sao) =P (Fx
Se
oa
21.12 Example. Fisal i
HAV lan.
Wi NW law
VV
214 Wavelets
father wavelet or Haar sealing faneSSS a
then BIZ ere ¢ = ¥/97e is w constant. This 15 Appendix
a >, vie DWT Fon HAAR WAVELETS. Let y be the vector of Ys lengths) ane
1 = logan). Create lst D with clement
>
1 1):0)f
121.6 Bibliographic Remarks 1B, 28, 25, AO A, 65, 76.78, 8,
ven iy Og (1997). A more al 3 is
Consider he glass frgment data from the b ste, Let ¥ be
1 al, (1998) ‘The theory of statistical estimation sing woweets has b
refractive inex aud let Xe altuna content (the fourth waiable
1b mans anthors especially David Donoho ad Lan Jobst
Duval aut Jolson (190), Dowoli an Johastone (1985). Don somparanietrie gression to tthe malel ¥ = fn) 4-« asin
2LT Exercises band
(adeva)©= (Swe) = (Jere Ha) = alt= aha (=)
D Parseval’s relation equation (21.0) “ ee
“ ™ a} Fit the curve using the cosine basis method, Plot the fitnetion esti
10, a
sity Estimation) Let Xi... Nn ~ f for some density J onssification
Fe) = 00) 4 Bavsaled
In this question, we wil explore the mii squntion (21
XX ~ NUko!). Let
ania (Xo
eae 22.1 Introduction
Shnntate = 100 observations fr a N(O,1) ctr
the MSE eV iscalled elassification, supervised learning, discrimination
Repeat (h) bit add sane antlers tthe data. To do his, sin pattern recogniti a
ach observation fom w N(01) ith ity 5 a sane a ate Xne¥o) wore
chservaton fron a N(0,10) with probabil eer
Repeat question (us the Hast bas Jinwensional vtor al; takes wie a some tite st J. elas
ation rule i finitio A: 2. When we obwerwe a new X we pret
Ire AC
221 Example, Hew in example with fake data. Figun
2 pnts, ne X= (X1.%2) b sonal am
Y= 40.1}. The ¥ cl atthe plo with ho ten
ting Y= Cad the ¥ = 0) Abwoshown isa Bear
eatin tle eporsented This fs rile of the for
ny fast 0
(0 other
thing abo the line schist 4 0 al everyting Melon the Fe' 22.3 Definition, The true error ratelof a classifier bi
Ln) = UEAX) 4 VY 22.
nd the ompicical error rate or training error rate
Eucn) = LY HX #¥
Ths tao props ave perelly separated by the Inet de
First we comir the special cane where Y = (0,1). Let
22.2 Example. Recall the the Coronary Risk-Faetor Study (CORIS) 1 2) = EY y aan
from Example 13.17. There are 462 males between the ag of 15 and 64
three rural areas in South Attica, The ateome ¥ Is the presence (= 1 Jeno Two that
pence (Y= 0) of coronary wart disease and there are 9 covariates: sy
‘ood cumulative tobmec (kg), a (ow density lipoprotein ch : y aux
terol fast (Ean history of heat disemse),typon (LypO-A ; Y= NFO =)
bavi 3 (eurrent alk consumption), and age. 1 camp Tea Kel = OPW =
Toundary wsing the LDA method based on to of tho aarti 3
saints, syteic load pressure a tabmceo consumption, The LDA meth ] i
wil be explained shorty. fn this exanph ps ato hard to tall apar hen
I fact, 1 of abject are misclisified sing this elssifcation ri
. f fis¥ =0
fils) = fle\¥=1
nt, it is worth revisiting the Stalisticy/Data Mining deta yer
Ea =Y rok 22.4 Definition. Tir Bayes classification rule h
‘lata tin ii) Ye) weyef 1 Mr? } 7
‘luaiier hypothe inp hr Meas
fetimation lear finding good classifier im ; eee
decision boundary.
Warning! The Bayes rule has nothing to do with Bayesinn inference. We
Error Rates and the Bayes Classifier snld estimate the Bayes rile using ether frequents. or Bayesian method
We Bayes rule ray be wet in sevceal equivalent frm:
isto ind a classiiation rue h that makes accurate predictions, W
art with the folowing definitions [One ca ether nos Fr sity ww .fi epee =x Y <0
10 otherwi
cad
ney ef fahGa) > =m hate
10 others
225 Theorem. 1 le i optimal, thal is fb
The Bayes le depen on unknown quantities 0
to fl some approximation to te B At he
there are three mai ap
1, Empirical Risk Minimization. Choose a set of cl
Regression, Find nt estimate 7 of the cegzession function
ius 1 Ae
P= 10 other
Density Estimation, Estimate fa from the X's fr which ¥
from the Xs for which ¥; ~ 1 nnd Ie yy
? y =aiN Ale
" funy of) Re
10 other
Jot general a skes on more than
follow
22.6 Theorem. 5 yt Tie opt
Vy = AX = 2) = file
n= POY =r), Sole) = sal =7) and ans th
3 Gaussian and Linear Classi
ee \
hte) ~ gaara mo {-} \
22.7 Theorem. 1) XI =0~ Nig. %) ond XWP =~ Non
{1 it eto 4210534) +e (88)
ne ein Lo. others
2 WOM = py), 112
Oy eatin Mahatanobis distance. 0 gnralent wy of
a Mog is S Kk
A dent f 1
J quadratic analysis (QDA). I
aa yaey ‘
Ly x x
Xs —folX 5 ;ast Saye
calla the discriminant function. The decision homdaey (
| is near 90 this method calls linear discrimination
(Lupa),
22.8 Example. Let ns turn to the South Afkican heat dis
sii as clasts 1
The observed mi 141/462 = 1. teting ll ehe
tes rece th The sess fro alate cserin
ited as 0 classi as 1
LDA
22.9 Theorem. Supp
yaaty Leo lo ¢
We estimate dy() by by inserting estimates of jig, Ex and mp. There is
another version of linear ditcitiinant nnalysis due to Fisher. ‘The Mea it
the a ie, Algobraicalls, this means replacing the covariate X
XiyovesX¢) with a linear combination U = aT X = SE, aX, The go
to chun the vector w = ( tg) that “best separates the data.” Then
perforin clasifieation with the oucdimensional cowrate Z instead of X
We ses define what we meat hy separation of the groups. We wo Ik
he two groups to have meas that are far apart roative to their spend. L
lenote the aus of X far ¥; and let 3 be the variance matsis of X. Then
vi TX = j) = lay and V(U) = wT Sa. # Deine th
We estiminte Jas fallow Let my = S21; =J) be the mnber of ebser
ations ingroup j, let X, be the sample mea weetor of the 2s for group
sud let S, be the sample nuntrix in group j. Defva he resis of
; si Us wed) = 2(v1— -¥4 5)
22.10 Theorem. Th Let X dote the N (1) wate of the fan
x pix x
Lx x
Uw x 2 Ky RIP IGEN Li Xo |
Whee Y= sees Ma) Then
ad the vel ean be writen a
yx
fo iret
Lt irerx ete From Thorens 1.13,
X'X) Ix!
Y-x
1 Linear Regression and Logistic Regressi CO eee .
ves ta lesson panel ter 12. The model
bis section, s y= (0.0.1 2) =P =X c 2
an athe Mkt 7 is obtained anmerie
1.0. otherwise ~ 22.11 Example. Lot ws ween eth jens data The ML i ge
The sles rel the tea ogre model Exaniple 12.17. The err rate sing this ode for clusion .
ax (22 Wes cau get n better elasiter hy Biting «ier mode, or example,
This ne - Fare ¥ .
Iosit POY = 11N a y22,12 Example. If ws mel
the error rate
22.5 Relationship Between Logistie Regression and
LDA
LDA and logistie regression are wlnost the sae thing, Ife astm that
yan (=) 1
oe (RUSUSES) = tag (22) — 20 4 007 B-"n — 00)
y= ux :
es a0" Sopra) >
These ate the si
TLies1 = [sein [] 400 a
Is
b
In logistic regression we maximized the condition Hhelboo TL, f(y[20) bu
Sine ation only requires knowing fy). we don’t really need to
wonparamvetric thas LDA. "This it
LDA,
nv both lead 0
sti
22.6 Density Estimation and Naive Bayes
TE fis). 0
that X
“The Naive Bayes ClasiferThe uaive Bayes chasifer is poplar when 6 highetimensioual sad ds
rte, In that ens, pecially sitnpe
7 Trees
Trees are cnssification methods that: pati sovatiate space X i
sand then clasify the observations according 10 which pti
alin, As the name implies, the clasifer can be represented
Forillustration, suppose there are two covariates, X ud No = bl
pressure. Figure 22.2 shows « elassiention tree sing these varables
The treo is used in the following way. Ifa subject has Age > 50 then
ify hin ay Y = 1, a subject has Age < 50 then we check his bl
sare, If aystolie bla presnre & < 100 then we classify him as ¥ = 1
otherwise we classify hi as Y= 0. Figur tte Jassie
2 patton ofthe comarate spac
Hee is how tree is constructed. First, suppose that y € 9? = {0.1}
there is only w single eovnriate X. We choowe a split joi ttt divi
real lie ato & \ Ay = (toh Let Bald) be
ponton of observations in A
EM = h.Me a. .
j= Be
for n= 1,2 0,1 The impurity of the split # is defined to b
m=3
y 22.80)
prtienlar measure of impurity fs known as the Gin index. Ifa partition
lemeut A, contains all's or all Us, the 0. Otherwise, 72 > 0, W
re the spit point ¢ to ninknize the inparty. (Other ilies of fp
ites cam be used ess the Gin index
When ther al oa oe whichew and sp
ul to the lest impurity, This process is contin antl some stoppin
cron is met. For example, we mht stop when every partition element
les ofthe tree are calle the Heaves. Each lea asl 0
ether there are more data points with ¥ = Dor Y =1 in thar partition
This procedure is easily geneealized to the ease where ¥ y
only define the iupnriy b
ey 2
re (3) the proportn of observations in the partition element for whichsl ie shows the 10-fold eros
5 el 15
22.8 Assessing Error Rates and Choosing a Good FIGURE 225. Th rate ad aed fine isthe
Classifier framiliation etal of tr
How do we chose a sifer? We would hike to havea elasifer A with There are many was to estimate the err rate. We'll consider two: Crass
low true error rate (4), Usunly, we ea use the training ere rate Zy(h
validatio
ul probability inequalities.
Vauipari
Cre The bs
22.14 Example. Consider the lear disease dat
ata into two pees set Toul the validation22.18 Example,Fit, supoate that = {n-v-fin} const of itely many cla 7 a nn
pee ase aa Scone See ra As Aca} voi
22.16 Theorem (Unifarm Convergence). sume Hi finite and J the nuns of subsets of F “pike ant” by A Hone (2) dena ee
nue Th umber of elements of » set B. The shatter coefficient is defined by
Luni >e) wA.n) = me NAF (237
De, We wi + inequality nd we wil also tse the ere Fy consists fall ite set foie. Now let Xi. ie
‘ eto en PU A) < E No ia
\ y=! SKK
( sax [Ba (tt) — E04 J Zath) — L0H
$F (lett) —201>4) tern bods the distance between Pave
i 22.18 Theorem (Vepnik and Chervnenks (1970). F a
. { cup 1Pa) —P(A)] > €} < Bat 22.8)
uy j
2217 Theorem, Let
% The proof though very leg, Tong unl we one HF A sw eto
yn) ificrs, define A to be the cl of the form fae: Gr) = 1. W
Then En(R)¢ i «1c confidence idera
Poo. This follows fom the fact that
Fath) — Lay, > «} < 8
leat) — E1091 >0)