Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

STAT210B Notes

Matt Olfat
February 13, 2017

Efron-Stein Inequality
Let X1 , . . . , Xn be independent random variables, and let Z = f (X1 , . . . , Xn ). Then, we have that:
" n #
X
V ar(Z) E Ei [Z]2 Ei2 [Z] .
i=1

we will show that:

(x) = x2 Ent (Z) =PE(Z) (EZ)


(x) = x log x Ent (Z) E [ ni=1 Ei [(Z)] (Ei [Z])]
Let X be a random variable taking values in X countable.
P
Definition 1 the Shannon entropy of X is H(X) = E[ log p(x)] = xX p(x) log p(x).

p(x) log p(x)


P
Definition 2 If P, Q are probability distributions on X , then D(P |Q) = xX q(x)
0.
 
P q(x)
Can show using that log x x1, 0 log 0 = 0. Then consider D(P |Q) xX ,p(x)>0 p(x) 1 p(x)

0.
Entropy maximized over uniform distribution:
let |X | , q(x) = |X1 | . Then, D(P |Q) = log |X | H(P ) 0 H(P ) log |X |.
D(P |Q) = 0 iff P = Q.

Entropy on Product Spaces


if X, Y are random variables takingPvalues in X , Y, respectively, they have some joint probability
P (x, y) on X Y : H((X, Y )) = xX ,yY p(x, y) log p(x, y).

Definition 3 the mutual information between X and Y is I(X, Y ) = H(X) + H(Y ) H(X, Y ).
This measures how dependent the variables are.

Now, PX (x) = yY P (X = x, Y = y). Then, I(x, y) = xX ,yY p(x, y) log pX p(x,y)


P P
(x)pY (y)
=
D(P |PX PY ), where denotes the probability product measure, PX (X = x)PY (Y = y). This
is zero when X, Y are independent.

Definition 4 the conditional entropy of X given Y is H(X|Y ) = H(X, Y )H(Y ) = EY [H(P (X|Y ))].

1
Leads to D(PXY |PX PY ) = H(X) + H(Y ) H(X, Y ) = H(X) H(X|Y ) 0,
since H(X) H(X|Y ). All this together gives us H(X1 , . . . , Xm ) = H(X1 ) + H(X2 |X1 ) +
H(X3 |X1 , X2 ) + . . . .

Pn 5 (Hans Inequality) Let X1 , . . . , Xn be discrete random variables. Then, H(x1 , . . . , xn )


Theorem
1
n1 i=1 H(x1 , . . . , xi1 , xi+1 , . . . , xn ).

Proof. H(x1 , . . . , xn ) = H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi=1 , xi+i , . . . , xn ).


Sum this for all i to get:
P
nH(x
P 1 , . . . , xn ) = H(x1 , . . . , xi1 , xi+1 , . . . , xn )
+ P H(xi |x1 , . . . , xi1 , xi+i , . . . , xn ) P
H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi1 )
P P
P H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(xi |x1 , . . . , xi1 )
= H(x1 , . . . , xi1 , xi+1 , . . . , xn ) + H(x1 , . . . , xi )

This is tight when the variables are independent.


Consider the binary hypercube with edges at -1 and 1. The Hamming distance measures the
number of differing entries in two vectors. Also consider a graph G = (V, E), where each corner of
the cube is a node and adjacent corners have edges between them. Then, |V | = 2n , |E| = n 2n1 .
|E|
Also, |V |
= n/2 = log22 |V | .
We want: A V = {1, 1}n , |E(A)| |A|
log22 |A| .
Proof. Let X have a uniform distribution over A. H(X)P H(X (i) ) is the entropy of the
ith coordinate given everything else. This is also equal to xA p(x) log p(xi |x(i) ). How-
ever, p(xi |x(i) ) = 0.5 if (x1 , . . . , xi1 , xi , xi+1 , . . . , xn ) A, and is 1 P otherwise. Let x(i) =
(x1 , . . . , xi1 , xi , xi+1 , . . . , xn ). We have: H(X)H(X (i) ) = log 2 log 2
P
|A| xA (i) A = |A| |E(A)|
i 1x
H(X).
Let X be a countable set and P let P, Q be probability distributions on X n . Assume P = P1
P2 Pm and denote q (i) = q(x1 , . . . , xi1 , xi+1 , . . . , xn ), p(i) (x(i) ) =
p1 (x1 ) pi1 (xi1 )pi+1 (xi+1 ) pn (xn ).
1
Pn
Theorem 6 (Hans Inequality) D(Q|P ) n1 D(Q(i) |P (i) ),
Pn 
or D(Q|P ) i=1 D(Q|P ) D(Q(i) |P (i) ) .
1
H(Q(i) ). Now, we have xX n q(x) log q(x)
P P
Proof. The original theorem gave us H(Q) n1
1
Pn P (i) (i) (i) (i)

n1 i=1 (i)
x X n1 q (x ) log q (x ) . We want to show that
1
Pn P
q(x) log p(x) n1 i=1 x(i) X n q (x ) log q (i) (x(i) ). We may use the fact that p is a prod-
(i) (i)
P

= n1P ni=1 xX n q(x) log p(i) (x(i) )pi (xi ) =


P P P
uct measure to get xX n q(x) log p(x)P
n1 ni=1 xX n q(x) log pi (xi ) + n1 ni=1 xX n q(x) log p(i) (x(i) )
P P

February 13th
E[(Z Zi )2 ] =E[(Z Zi )2 1{ZZi } ] + E[(Z Zi )2 1{Z<Zi } ]
(1)
=2E[(Z Zi )2+ ] = 2E[(Z Zi )2 ]
This gives us various versions of Efron-Stein.

2
Definition 7 Let f : xn P [0, ). f is self-bounded if i, fi : xi1 [0, ) such that 0
f (x) fi (x(i) ) 1 and ni=1 (f (x) fi (x(i) )) f (x).

Corollary 8 Let Z = f (x1 , . . . , xn ), x1 , . . . , xn independent, f is self-bounded, Z L2 . Then,


Var(Z) E[Z].

E[(f (x) fI (x(i) ))2 ] E[(f (x)


P
Proof. fi are given. By Efron-Stein, V ar[Z]
fI (x(i) ))] E[f (x)] = E[Z].
So self-bounded functions have variances smaller than their expected values. One application
of this is in relative stability:
Zn
Definition 9 We say that nonnegative random variables Zn are relatively stable if E[Zn ]
1 in
probability as n increases.

Thus, the expectation is all we need to know about the magnitude of Zn . If we assume that
V ar[Zn ] E[Zn ]. Then, P (| E[Z Zn
n]
1| ) V2 E[Z ar[Zn ]
n]
1
2 2 E[Z ] .
n
Now, we move on to configuration functions. First, we say that a property is defined over a
union of finite products of a set X. Let 1 X1 , 2 X1 X2 , . . . . We say that (x1 , . . . , xn )
xn satisfies the property if (x1 , . . . , xn ) n . is hereditary (motonote) if the following of
(x1 , . . . , xn ) n then i, . . . , ik , (xi1 , . . . , xik ) k . Let f : X i N. This function maps any
string X = (x1 , . . . , xn ) to the size of the maximal subsets of that string that satisfies . Then, f is
called a configuration function.

Corollary 10 if f is a configuration function, then f is self-bounded.

Proof. Let fi = f , x = (x1 , . . . , xn ) X n . Let i1 , . . . , ik such that (xi1 , . . . , xik ) is the


maximal subset that satisfies . fi (x(i) ) = f (x1 , . . . , xi1
P , xi+1 , . . . , xn ). Then, clearly, 0 f
fi 1, and they will only differ for i {i1 , . . . , ik } so f (x) fi (x(i) )) k = f (x). Then, by
corollary 8, V ar(f (x)) E[f (x)].
VC dimensions are an example of this. Let A be a collection of subsets of X . Let X =
(x1 , . . . , xn ) X n . T r(X) = {A {x1 , . . . , xn } : A A}. We can see that, depending on A,
the trace of X may not capture the richness of the entire power set of X (Consider X a collection
of points and A to be the set of half-spaces facing to the right). The Shatte coefficient (the growth
coefficient) of X is |T r(X)|. A subset {xn , . . . , xn } is shattered by A if the trace is equal to
its power set. The VC dimension of A with respect to that particular X is the size of the maximal
subset shattered by A. Clearly, the property of being shattered is monotone, so the VC dimension is
a configuration function and we have V ar[D(X)] E[D(X)]. This is an example of an empirical
process that concentrates.
Another key example of empirical processes that concentrate is the Rademacher complexity.
Let x1 , . . . , xn be independent
Pn uniform [0, 1]d random variables, 1 , . . . , n i.i.d. Rademacher( 21 ).
Let Z = E[maxk[d] j=1 j (xTj ek )|x1 , . . . , xn ]. Z has the self-bounding property, as removing
one element from the summation inside the maximization can onlyPdecrease the P total value by
(Z Zi ) = (E maxk[d] j (xTj ek )
P
less than one, P i.e. 0 T
Z Zi 1. Furthermore,
P
E maxk[d] j6=i j (xj ek )) n(E[max

You might also like