Professional Documents
Culture Documents
Part 2
Part 2
Part 2
Empirical Processes
2.1
Introduction
f 7→ Pn f.
R
Here, we use the abbreviation Qf = f dQ for a given measurable function
f and signed measure Q. Let P be the common distribution of the Xi . The
centered and scaled version of the given map is the F-indexed empirical
process Gn given by
n
√ 1 X
f 7→ Gn f = n(Pn − P )f = √ f (Xi ) − P f .
n i=1
Pn
Frequently the signed measure Gn = n−1/2 i=1 (δXi − P ) will be identified
with the empirical process.
2.1 Introduction 125
For a given function f , it follows from the law of large numbers and
the central limit theorem that
as
Pn f → P f,
N 0, P (f − P f )2 ,
Gn f
provided P f exists and P f 2 < ∞, respectively. This part is concerned with
making these two statements uniform in f varying over a class F.
Classical empirical process theory concerns the special cases when the
sample space X is the unit interval in [0, 1], the real line R, or Rd and the
indexing collection F is taken to be the set of indicators of left half-lines
in R or lower-left orthants in Rd . In this part we are concerned with em-
pirical measures and processes indexed by these classes of functions F, but
also many others, including indicators of half-spaces, balls, ellipsoids, and
other sets in Rd , classes of smooth functions from Rd to R, and monotone
functions.
With the notation kQkF = sup |Qf |: f ∈ F , the uniform version of
the law of large numbers becomes
kPn − P kF → 0,
where the convergence is in outer probability or is outer almost surely. A
class F for which this is true is called a Glivenko-Cantelli class, or also
P -Glivenko-Cantelli class to bring out the dependence on the underlying
measure P .
In order to discuss a uniform version of the central limit theorem, it is
assumed that
sup |f (x) − P f | < ∞, for every x.
f ∈F
In this case, it is natural to identify f = 1{(−∞, t]} with t ∈ R̄d and the
space `∞ (F) with the space `∞ (R̄d ). Of course, the sample paths of the
empirical distribution function are all contained in a much smaller space
(such as a D[−∞, ∞]), but as long as this space is equipped with the
supremum metric, this is irrelevant for the central limit theorem.
It is known from classical results (for the line by Donsker) that the
class of lower rectangles is Donsker for any underlying law P of X1 , . . . , Xn .
These classical results are a simple consequence of the results of this part.
Moreover, with no additional effort, analogous results are obtained for the
empirical process indexed by the collection of closed balls, rectangles, half-
spaces, and ellipsoids, as well as many collections of functions.
kf k ≤ kgk. For such norms, if f is in the 2ε-bracket [l, u], then it is in the
ball of radius ε around (l + u)/2. Thus covering and bracketing numbers
are related by
N (ε, F, k · k) ≤ N[ ] 2ε, F, k · k .
In general, there is no converse inequality, so that bracketing numbers may
be bigger than covering numbers. (A notable exception is the uniform norm,
for which the previous inequality is an identity.) However, sufficient condi-
tions for the theorems of interest in terms of bracketing numbers can often
be stated using only a single norm on F, while sufficient conditions in terms
of entropy numbers usually involve many norms. This is intuitively obvious
since brackets control a function f (x) pointwise in its argument x, rather
than in norm. As a consequence, sufficient conditions in terms of bracketing
numbers or covering numbers are unrelated
in general. Both are of interest.
We shall write N ε, F, Lr (Q) for covering numbers relative to the
Lr (Q)-norm
1/r
kf kQ,r = |f |r dQ
R
.
This is similar for bracketing numbers. An envelope function of a class F
is any function x 7→ F (x) such that |f (x)| ≤ F (x), for every x and f .
The minimal envelope function is x 7→ supf |f (x)|. It will usually be as-
sumed that this function is finite for every x. The uniform entropy numbers
(relative to Lr ) are defined as
sup log N εkF kQ,r , F, Lr (Q) ,
Q
The proof of this theorem is elementary and extends the usual proof of the
classical Glivenko-Cantelli theorem for the empirical distribution function.
The second theorem is more complicated but has a simple corollary involv-
ing the uniform covering numbers. A class F is P -Glivenko-Cantelli if it is
P -measurable with envelope F such that P ∗ F < ∞ and satisfies
sup N εkF kQ,1 , F, L1 (Q) < ∞, for every ε > 0.
Q
The full statement of the second theorem yields necessary and sufficient
conditions for a P −measurable class F to be P −Glivenko-Cantelli in
‡
Alternatively, the supremum could be taken over all discrete probability measures. This
yields the same results.
2.1 Introduction 129
terms
of the random
covering numbers N (ε, FM , L1 (Pn )) where FM =
f 1[F ≤M ] : f ∈ F . This characterization of the Glivenko-Cantelli theorem
will be used in Chapter 2.10 to establish useful Glivenko-Cantelli preserva-
tion theorems. Chapter 2.4 concludes with the statement of a result char-
acterizing universal Glivenko-Cantelli classes. As mentioned already, the
second Glivenko-Cantelli theorem requires the assumption that F is a “P -
measurable class,” a concept that is defined in Definition 2.3.3. The pres-
ence of this measurability hypothesis is a consequence of the way uniform
entropy is used to control suprema via “randomization by Rademacher ran-
dom variables” (or “symmetrization”) together with the lack of a general
version of Fubini’s theorem for outer integrals. This same type of addi-
tional measurability hypothesis also enters in the corresponding Donsker
theorems with uniform entropy hypotheses. Without these hypotheses, the
theorems are false, but on the other hand it requires some effort to con-
struct a class for which the hypotheses fail. Thus, in applications, this type
of measurability will hardly ever play a role.
The method of symmetrization is discussed in Chapter 2.3 together
with measurability conditions. This section is a necessary preparation for
the uniform entropy Glivenko-Cantelli and Donsker theorems in Chap-
ters 2.4 and 2.5. The theorems that are proved with bracketing do not
use symmetrization and do not need measurability hypotheses.
One of the two Donsker theorems proved in Chapter 2.5 uses bracketing
entropy. A slightly simpler version of this theorem asserts that a class F
of functions is P -Donsker under an integrability condition on the L2 (P )-
entropy with bracketing. A class F of measurable functions is P -Donsker
if Z ∞q
log N[ ] ε, F, L2 (P ) dε < ∞.
0
The other Donsker theorem in this chapter asserts that a class F of func-
tions is P -Donsker under a similar condition on uniform entropy numbers.
If F is class of functions such that
Z ∞ q
(2.1.7) sup log N εkF kQ,2 , F, L2 (Q) dε < ∞,
0 Q
Some thought shows that the number of subsets picked out by a collection C
is closely related to the covering numbers of the class of indicator functions
{1C : C ∈ C} in L1 (Q) for discrete, empirical type measures Q. By a clever
argument, Sauer’s lemma can be used to bound the uniform covering (or
entropy) numbers for this class. Theorem 2.6.4 asserts that there exists a
universal constant K such that, for any VC-class C of sets,
rV (C)
1
N ε, C, Lr (Q) ≤ KV (C)(4e)V (C)
,
ε
for any probability measure Q and r ≥ 1 and 0 < ε < 1. Consequently, VC-
classes are examples of polynomial classes in the sense that their covering
numbers are bounded by a polynomial in 1/ε. The upper bound shows
that VC-classes satisfy the sufficient conditions for the Glivenko-Cantelli
theorem and Donsker theorem discussed previously (with much to spare)
provided they possess certain measurability properties.
It is possible to obtain similar results for classes of functions. A collec-
tion of real-valued, measurable functions F on a sample space X is called a
VC-subgraph class (or simply a VC-class of functions) if the collection of all
subgraphs of the functions in F forms a VC-class of sets in X ×R. Just as for
sets, the covering numbers of a VC-subgraph class F grow polynomially in
1/ε, so that suitably measurable VC-subgraph classes are Glivenko-Cantelli
and Donsker under moment conditions on the envelope function.
Among the many examples of VC-classes are the collections of left half-
lines and lower-left orthants, with which classical empirical process theory is
concerned. Methods to construct a great variety of VC-classes are discussed
in Section 2.6.5.
2.1 Introduction 131
in `∞ (F). Then we pose the following question: for what sequences of i.i.d.,
real-valued random variables ξ1 , . . . , ξn independent from the original data
does it follow that
n
1 X
√ ξi Z i G?
n i=1
In fact, these two displays turn out to be equivalent if the ξi have zero
mean, have variance 1, and satisfy the L2,1 -condition
Z ∞q
kξ1 k2,1 = P |ξ1 | > t dt < ∞.
0
Chapter 2.9 also presents conditional versions of this multiplier central limit
theorem. These are basic for part 3’s development of limit theorems for the
bootstrap empirical process.
Chapter 2.11 gives extensions of the Donsker theorem
Pn for sums of inde-
pendent, but not identically distributed, processes i=1 Zni indexed by an
arbitrary collection F. One example, which occurs naturally in statistical
contexts and is covered by this situation, is the empirical process indexed
by a collection F of functions based on observations Xn1 , . . . , Xnn that
are independent but not identically distributed. Once again we present two
main central limit theorems, based on entropy with and without bracket-
ing. Even if specified to the i.i.d. situation, the theorems in Chapter 2.11
are more general than the theorems obtained before. For instance, we use
2.1 Introduction 133
P∗ (kGn kF > t)
established in Chapter 2.15. As it turns out, bounds for these latter prob-
abilities depend much less heavily on the structure of the class F than do
bounds for the mean E∗ kGn kF itself. Several of the concentration inequal-
ities in Chapter 2.15 are sharp in that they reduce to classical bounds in
the case when the class F consists of just a single function. Chapter 2.15
concludes Part 2.
[
It may be noted that the seminorm ρP is the L2 (P )-norm of f centered at its expectation
(in fact, the standard deviation of f (Xi )). The centering is natural in view of the fact that
the empirical process is invariant under the addition of constants to F : Gn f = Gn (f − c) for
every c. Under the condition that supf ∈F |P f | < ∞, the seminorm ρP can be replaced by the
slightly simpler L2 (P )-seminorm without loss of generality (Problem 2.1.2).
]
In most of this part, the notation Fδ means Fδ = {f − g: f, g ∈ F , ρ(f − g) < δ}, for ρ
equal to either the L2 (P )-semimetric or ρP .
2.1 Introduction 135
This inequality explains the appearance of the square root of the logarithm
of the entropy in the sufficient integral conditions for a class to be Donsker,
which were discussed in the preceding chapter.
While maximal inequalities based on covering numbers suffice for al-
most all statistical applications, they are known to not be sharp in general.
This suboptimality is especially well-understood for Gaussian maximal in-
equalities for which the classical bounds based on covering numbers fail,
but bounds based on generic chaining and majorizing measures succeed in
providing sharp bounds. Section 2.2.4 presents a brief discussion of these
alternative methods and the resulting improvements.
x∗∗ : f → f (x),
and the σ-field generated by F is precisely the Borel σ-field. It follows that
the Zi are Borel measurable if and only if everyP f (Zi ) is measurable.
n
The conclusion is that the sequence n−1/2 i=1 (Zi − EZi ) converges
in distribution if and only if the unit ball of the dual space is Donsker with
respect to the common Borel law of the Zi .
for the function f defined by f (x) = Z(f˜, x). That Z is a stochastic process
means precisely that each function x 7→ f (x) is measurable. The processes
Z1 , Z2 , . . . satisfy the central limit theorem in `∞ (F̃) if and only if the class
of functions {f : f˜ ∈ F̃} is P -Donsker.
(In case p = 2, this can be simplified to the single requirement EkZ1 k22 < ∞.
For this case, the theorem is given in Chapter 1.8.) In terms of empirical
processes, this becomes as follows.
On the other hand, Part 2 gives many interesting results on the central
limit theorem in Banach spaces of the type `∞ (F), where F is a collection
of measurable functions. For this large class of Banach spaces, sufficient
conditions for the central limit theorem go beyond the Banach space struc-
ture, but they can be stated in terms of the structure of the function class
F.
Then (i) ⇒ (ii) ⇒ (iii), but both inverses are false in general. If the functions
an are decreasing, then (i) and (ii) are equivalent, but still (iii) does not imply
(ii).
[Hint: Define a(δ): = lim supn→∞ an (δ).
(i) ⇒ (ii). If the double limit in (ii) is positive, then there exist ε > 0 and
ηm ↓ 0 such that a(ηm ) > ε, for every m, whence there exist n1 < n2 < · · ·
with anm (ηm ) > ε, for every m. Define δn = ηm if n ∈ [nm , nm+1 ), and
arbitrary for n < n1 . Then δn ↓ 0 and anm (δnm ) > ε, for every m, which
contradicts that an (δn ) → 0.
(ii) ⇒ (iii). Set An (δ) = supm≥n am (δ) and fix ε > 0 and ηm ↓ 0. For
every m there exists nm with |An (ηm ) − a(ηm )| < ε, for n ≥ nm . Without
loss of generality take n1 < n2 < · · · and set δn = ηm for n ∈ [nm , nm+1 ).
Then δn ↓ 0 and an (δn ) ≤ An (δn ) ≤ a(δn ) + ε, for all n ≥ n1 , so an (δn ) → 0.
(ii) ⇒ (i) if the an are decreasing. If an (δn ) does not tend to zero for a
given δn ↓ 0, then there exist ε > 0 and n1 < n2 < · · · such that anm (δnm ) >
ε, for every m. If an is decreasing, then anm (δnj ) > ε, for every j ≤ m. Thus
a(δnj ) ≥ lim inf m→∞ anm (δnj ) ≥ ε, for every j, which contradicts a(δ) → 0.
For counterexamples consider the array g(n, m) = 1n>m , which has the
properties that m 7→ g(n, m) is decreasing, and g(n, m) → 1 if n → ∞
followed by m → ∞, while g(n, n) = 0, for every n. Set an (δ) = g(n, m), for
1/(m +1) < δ ≤ 1/m, so that an (1/m) = g(n, m). This function has an (δ) →
1 as n → ∞ for every δ and hence does not satisfy (ii), but an (δn ) = 0 → 0,
for δn = 1/n, so that an does satisfy (iii). Second, set an (δ) = 1 − g(n, m),
for 1/(m + 1) < δ ≤ 1/m. Then an (δ) → 0 for every δ and hence an satisfies
(ii), but an (δn ) = 1, for δn = 1/n.]
For our purposes, Orlicz norms of more interest are the ones given by
p
ψp (x) = ex − 1 for p ≥ 1, which give much more weight to the tails of X.
The bound xp ≤ ψp (x) for all nonnegative x implies that kXkp ≤ kXkψp
for each p. It is not true that the exponential Orlicz norms are all bigger
than all Lp -norms. However, we have the inequalities
(see Problem 2.2.5). Since for the present purposes fixed constants in in-
equalities are irrelevant, this means that a bound on an exponential Orlicz
norm always gives a better result than a bound on an Lp -norm.
Any Orlicz norm can be used to obtain an estimate of the tail of a
distribution. By Markov’s inequality,
1
P |X| > x ≤ P ψ |X|/kXkψ ≥ ψ x/kXkψ ≤ .
ψ(x/kXkψ )
p
For ψp (x) = ex − 1, this leads to tail estimates exp (−Cxp ) for any random
variable with a finite ψp -norm. Conversely, an exponential tail bound of this
type shows that kXkψp is finite.
p
2.2.1 Lemma. Let X be a random variable with P |X| > x ≤ Ke−Cx
for every x, for constants K and C, and for p ≥ 1. Then its Orlicz norm
1/p
satisfies kXkψp ≤ (1 + K)/C .
Now insert the inequality on the tails of |X| and obtain the explicit upper
bound KD/(C − D). This is less than or equal to 1 for D−1/p greater than
1/p
or equal to (1 + K)/C .
A similar inequality is valid for many Orlicz norms, in particular the expo-
nential ones. Here, in the general case, the factor m1/p becomes ψ −1 (m),
where ψ −1 is the inverse function of ψ.
Proof. For simplicity of notation assume first that ψ(x)ψ(y) ≤ ψ(cxy) for
all x, y ≥ 1. In that case ψ(x/y) ≤ ψ(cx)/ψ(y) for all x ≥ y ≥ 1. Thus, for
y ≥ 1 and any C,
" #
ψ c|Xi |/C
n
|Xi | |Xi | |Xi | o
max ψ ≤ max +ψ 1 <1
Cy ψ(y) Cy Cy
X ψ c|Xi |/C
≤ + ψ(1).
ψ(y)
Set C = c max kXi kψ , and take expectations to get
max |Xi | m
Eψ ≤ + ψ(1).
Cy ψ(y)
When ψ(1) ≤ 1/2, this is less than or equal to 1 for y = ψ −1 (2m), which is
greater than 1 under the same condition. Thus
max |Xi |
≤ ψ −1 (2m) c max kXi kψ .
ψ
For the present purposes, the value of the constant in the previous
lemma is irrelevant. (For the Lp -norms, it can be taken equal to 1.) The
important conclusion is that the inverse of the ψ-function determines the
size of the ψ-norm of a maximum in comparison to the ψ-norms of the
individual terms. The ψ-norm grows slowest for rapidly increasing ψ. For
p
ψ(x) = ex − 1, the growth is at most logarithmic, because
1/p
ψp−1 (m) = log(1 + m) .
d(s, t) = kXs − Xt kψ .
For the present purposes, both covering and packing numbers can be
used. In all arguments one can be replaced by the other through the in-
equalities
N (ε, d) ≤ D(ε, d) ≤ N ( 21 ε, d).
Clearly, covering and packing numbers become bigger as ε ↓ 0. By defini-
tion, the semimetric space T is totally bounded if and only if the covering
and packing numbers are finite for every ε > 0. The upper bound in the
following maximal inequality depends on the rate at which D(ε, d) grows
as ε ↓ 0, as measured through an integral criterion.
Proof. Assume without loss of generality that the packing numbers and
the associated “covering integral” are finite. Construct nested sets T0 ⊂
T1 ⊂ T2 ⊂ · · · ⊂ T such that every Tj is a maximal set of points such
that d(s, t) > η2−j for every s, t ∈ Tj where “maximal” means that no
point can be added without destroying the validity of the inequality. By
‡
Separable may be understood in the sense that supd(s,t)<δ |Xs − Xt | remains almost
surely the same if the index set T is replaced by a suitable countable subset.
144 2 Empirical Processes
where for fixed j the maximum is taken over all links (u, v) from Tj+1 to Tj .
Thus the jth maximum is taken over at most #Tj+1 links, with each link
having a ψ-norm kXu − Xv kψ bounded by C d(u, v) ≤ C η2−j . It follows
with the help of Lemma 2.2.2 that, for a constant depending only on ψ and
C,
k
X
ψ −1 D(η2−j−1 , d) η2−j
max (Xs − Xs0 ) − (Xt − Xt0 )
≤ K
s,t∈Tk+1 ψ
j=0
Z η
ψ −1 D(ε, d) dε.
(2.2.6) ≤ 4K
0
In this bound, s0 and t0 are the endpoints of the chains starting at s and
t, respectively.
The maximum of the increments |Xsk+1 − Xtk+1 | can be bounded by
the maximum on the left side of (2.2.6) plus the maximum of the discrep-
ancies |Xs0 − Xt0 | at the end of the chains. The maximum of the latter
discrepancies will be analyzed by a seemingly circular argument. For ev-
ery pair of endpoints s0 , t0 of chains starting at two points in Tk+1 within
distance δ of each other, choose exactly one pair sk+1 , tk+1 in Tk+1 , with
d(sk+1 , tk+1 ) < δ, whose chains end at s0 , t0 . By definition of T0 , this gives
at most D2 (η, d) pairs. By the triangle inequality,
|Xs0 − Xt0 | ≤ (Xs0 − Xsk+1 ) − (Xt0 − Xtk+1 ) + |Xsk+1 − Xtk+1 |.
Take the maximum over all pairs of endpoints s0 , t0 as above. Then the
corresponding maximum over the first term on the right in the last display
is bounded by the maximum in the left side of (2.2.6). Its ψ-norm can be
bounded by the right side of this equation. Combine this with (2.2.6) to
find that
Z η
ψ −1 D(ε, d) dε +
max |Xsk+1 − Xtk+1 |
ψ .
max |Xs − Xt |
≤ 8K
s,t∈Tk+1
0
ψ
d(s,t)<δ
2.2 Maximal Inequalities and Covering Numbers 145
Here the maximum on the right is taken over the pairs sk+1 , tk+1 in Tk+1
uniquely attached to the pairs s0 , t0 as above. Thus the maximum is over
at most D2 (η, d) terms, each ofwhose ψ-norm is bounded by δ. Its ψ-norm
is bounded by K ψ −1 D2 (η, d) δ.
Thus the upper bound given by the theorem is a bound for the maxi-
mum of increments over Tk+1 . Let k tend to infinity to conclude the proof.
The corollary follows immediately from the previous proof, after noting
that, for η equal to the diameter of T , the set T0 consists of exactly one
point. In that case s0 = t0 for every pair s, t, and the increments at the end
of the chains are zero. The corollary also follows from the theorem upon
taking η = δ = diam T and noting that D(η, d) = 1, so that the second
term in the maximal inequality can also be written δ ψ −1 D(η, d) . Since
the function ε 7→ ψ −1 D(ε, d) is decreasing, this term can be absorbed
into the integral, perhaps at the cost of increasing the constant K.
Proof. For any λ and Rademacher variable ε, one has Eeλε = (eλ +
2
e−λ )/2 ≤ eλ /2 , where the last inequality follows after writing out the
power series. Thus by Markov’s inequality, for any λ > 0,
X n Pn 2 2
P ai εi > x ≤ e−λx Eeλ i=1 ai εi ≤ e(λ /2) kak −λx .
i=1
The best upper bound is obtained for λ = x/kak2 and is the exponential
in the probability bound of the lemma. Combination with a similar bound
for the lower tail yields the probability bound.
The bound on the ψ-norm is a consequence of the probability bound
in view of Lemma 2.2.1.
2
Proof. Apply the general maximal inequality with ψ2 (x) = ex − 1
−1
log(1 + m), we have ψ2−1 D2 (δ, d) ≤
p
and η = δ. Since ψ2 (m) =
2.2 Maximal Inequalities and Covering Numbers 147
√
2 ψ2−1 D(δ, d)√. Thus the second
term in the maximal inequality can first
be replaced by 2δψ −1 D(η, d) and next be incorporated in the first (the
covering integral) at the cost of increasing the constant. We obtain
Z δq
sup |Xs − Xt |
≤ K log 1 + D(ε, d) dε.
d(s,t)≤δ ψ2 0
Here D(ε, d) ≥ 2 for every ε that is strictly less than the diameter of T .
Since log(1 + m) ≤ 2 log m for m ≥ 2, the 1 inside the logarithm can be
removed at the cost of increasing K.
for v ≥ var(Y1 + · · · + Yn ).
Proof. See Pollard (1984) or Shorack and Wellner (1986), page 855.
E|Y |m ≤ M m−2 EY 2 ,
for every m ≥ 2. A much weaker inequality than this is the essential element
in the proof of Bernstein’s inequality.
for v ≥ v1 + · · · + vn .
for Ωtj+1 ,x the event that |Xtj+1 − Xtj | > x d(tj+1 , tj )4j+1 . We show at the
end of the proof that
k
X ∞
X
Ak : = sup d(tj+1 , tj )4j+1 ≤ 4 sup d(t, Tj )4j+1 =: A.
t∈Tk+1 j=0 t∈T j=0
The right side A is the first term in the general bound given by the theorem,
and the only term in the bound for the exponential ψ. Therefore it suffices to
complement the bound Ak . A by a bound on the first variable on the right
in the second last display. We derive this bound first for the exponential ψ
and next for general ψ.
α
For ψ(x) = ex − 1 the preceding decomposition gives that
k
X
P sup |Xt − Xt0 | > xA ≤ P |Xtj+1 − Xtj |1Ωtj+1 ,x > 0
t∈T j=0
k k
X X X 1 α
≤ P(Ωu,x ) ≤ ψ(4j+1 ) ≤ 2e−(x/C) ,
j=0 u∈Tj+1 j=0
ψ(x4j+1 /C)
y ≥ 0, by concavity of φ−1 with φ−1 (0) = 0, the right side of the display
does
P not become Psmaller if A is replaced by A ∨ Ã, where we take à =
2 j ψ(4j+1 )−1 t∈Tj+1 d(t, Tj )4j , which is bigger than twice the second
infinite series in the upper bound given by the theorem. By monotonicity
of φ we conclude that
Pk |Xtj+1 −Xtj |
j=0 |Xtj+1 − Xtj |1Ωtj+1 ,x k ψ
X d(tj+1 ,tj ) d(tj+1 , tj )4j+1
φ ≤
A ∨ Ã j=0
ψ(4j+1 ) A ∨ Ã
|Xu −Xπj u |
k
X X ψ d(u,πj u) d(u, πj u)4j+1
≤ .
j=0 u∈Tj+1
ψ(4j+1 ) A ∨ Ã
Pk
It follows that the φ-Orlicz norm of j=0 |Xtj+1 − Xtj |1Ωtj+1 ,1 is bounded
above by A∨ Ã. This concludes the proof of the theorem for a general Young
function ψ.
The definition of ti implies that d(ti , ti+1 ) ≤ d(πi t, ti+1 ). Combin-
ing this with the triangle inequality, we see that d(t, ti ) ≤ d(t, ti+1 ) +
d(πi t, ti+1 ) ≤ 2d(t, ti+1 ) + d(πi t, t), by a second application of the triangle
inequality, and hence d(t, ti ) ≤ d(t, Ti ) + 2d(t, tP i+1 ). Iterating this inequal-
k i−j
ity for i = j, j + 1, . . . , k, we find d(t, tj ) ≤ i=j d(t, Ti )2 , for every
t ∈ Tk+1 , where the remainder term of the last iteration involving d(t, tk+1 )
has vanished as t = tk+1 . We conclude that
k
X k X
X k k
X
d(t, tj )4j ≤ d(t, Ti )2i−j 4j ≤ 2 d(t, Ti )4i .
j=0 j=0 i=j i=0
Pk
By another application of the triangle inequality j=0 d(tj+1 , tj )4j is not
bigger than twice the left side, and hence the right side, of this display, for
any t ∈ Tk+1 . This completes the proof that Ak ≤ 4A.
constants and scaling, then the assumption is also satisfied for φ = ψ, and
the theorem provides a bound on the ψ-norm of the supremum.
The first term in the bound in the theorem involves a supremum over t
taken outside the infinite sum. This expression increases if the supremum is
moved inside the sum. The second term in the bound is a sum of averages
over the sets Tj+1 and increases if these are replaced by suprema. It follows
that both terms are bounded above by
∞
X
sup d(t, Tj )4j .
j=0 t∈T
where the infimum is taken over all sequences of nested partitions T = ∪i Ti,j
j
consisting of at most 22 sets. The latter number are equal to ψ(2j/α ) + 1
α
for the Orlicz norm ψ(x) = 2x − 1. (The base 4 in the powers 4j in the
theorem are arbitrary in general and could be replaced by other numbers
bigger than 1. See Problem 2.2.18.)
A zero-mean Gaussian process X satisfies the increment bound kXs −
2
Xt kψ . d(s, t) for ψ(x) = ex − 1 and d the standard deviation metric.
The relevant generic chaining number is γ2 (T, d). In this case the generic
chaining upper bound is known to be sharp up to multiplicative constants.
for a constant K depending on ψ and C only and D(t, T ) = sups∈T d(t, s).
where the sets Aj (tl ) are defined by Aj (tl ) = {t ∈ T : d(t, tl )/2 < rj (t) +
rj (tl )}, for l ≥ 1, and the recursions continue until the sets A(tl ) have ex-
hausted T . By construction d(tk , tl )/2 ≥ rj (tk ) + rj (tl ), for k > l, whence
the balls Bd P tl ∈ Tj are disjoint (easily), and hence
tl , rj (tl ) for different
1 = µ(T ) ≥ l µ Bd tl , rj (tl ) ≥ |Tj |/ψ(4j ). This shows that the recur-
sive procedure ends after finitely many steps and yields a set of cardinality
|Tj | ≤ ψ(4j ). By construction d(t, tl ) < 2rj (t) + 2rj (tl ) ≤ 4rj (t) if tl is the
first element of Tj with t ∈ A(tl ), whence d(t, Tj ) ≤ 4rj (t), for every t ∈ T .
If we set r0 (t) = D(t, T ), then this is also true for j = 0 and an arbitrary
one-point set T0 . Since µ Bd (t, ε) < 1/ψ(4j ) for ε ∈ rj+1 (t), rj (t) , by
the definition of rj (t), and hence ψ −1 1/µ Bd (t, ε) ≥ 4j , we have
Z D(t,T ) ∞
1 X
ψ −1 4j rj (t) − rj+1 (t)
dε ≥
0 µ Bd (t, ε) j=0
∞ ∞
3X 3X
≥ rj (t)4j ≥ d(t, Tj )4j−1 .
4 j=0 4 j=0
This concludes the proof that the majorizing measure integral dominates
the first generic chaining number.
Next consider the second generic chaining number of the sequence
T0 , T1 , . . .. We claim that rj+1 (t) ≤ 2rj+1 (s), for any t ∈ Tj+1 and s ∈ T
with d(s, t) ≤ rj+1 (t). Indeed, let tl be the first element of Tj+1 such that
s ∈ Aj+1 (tl ). If tl appeared after t, then rj+1 (t) ≤ rj+1 (tl ) ≤ rj+1 (s), by
the definitions of t and tl as minimizers of rj . On the other hand, if tl ap-
peared before t in the recursions, then t ∈ / Aj+1 (tl ) as it was later selected
for Tj+1 and hence
d(t, tl ) d(t, s) d(s, tl ) rj+1 (t)
rj+1 (t)+rj+1 (tl ) ≤ ≤ + < +rj+1 (s)+rj+1 (tl ),
2 2 2 2
156 2 Empirical Processes
which implies the claim rj+1 (t) ≤ 2rj+1 (s), after rearranging.
Since Bd t, rj (s)+d(s, t) ⊃ Bd s, rj (s) by the triangle inequality, the
definitions of rj (s) and rj (t) give that rj (t) ≤ rj (s) + d(s, t). Combining
this with the result of the preceding paragraph, we conclude that rj (t) ≤
rj (s) + 2rj+1 (s), whenever t ∈ Tj+1 and d(s, t) ≤ rj+1 (t). Consequently, for
every t ∈ Tj+1 ,
Z
d(t, Tj )
j
≤ 4rj (t)µ Bd t, rj+1 (t) ≤ rj (s) + 2rj+1 (s) dµ(s).
ψ(4 ) Bd (t,rj+1 (t))
Since the balls Bd (t, rj+1 (t) , for t ∈ Tj+1R are disjoint, the sum
of the left
side over t ∈ Tj+1 is bounded above by T rj (s) + 2rj+1 (s) dµ(s). Next
multiply by 4j and sum over j to find that
∞ X ∞
d(t, Tj )4j
X Z X
rj (s) + 2rj+1 (s) 4j dµ(s).
j
≤
j=0
ψ(4 ) T j=0
t∈Tj+1
P∞
The integrand is bounded above by (3/2) j=0 rj (s)4j , which was already
seen to be bounded above by the majorizing measure integral at the end of
the first part of the proof.
ties
S(T, d, ψ) = sup E sup |Xs − Xt |,
X∈d,ψ s,t
Z Z D(t,T )
1
M̄ (T, d, ψ) = sup ψ −1 dε dµ(t),
µ 0 µ Bd (t, σ)
Z Z D(t,T ) 1
M (T, d, ψ) = inf sup ψ −1 dε,
µ t∈T 0 µ Bd (t, σ)
∞ ∞
h X
j
X 1 X i
A(T, d, ψ) = inf sup d(t, Tj )4 + d(t, Tj )4j .
(Tj )∈ψ t∈T
j=0 j=0
|Tj+1 |
t∈Tj+1
Here the infimum in the definition of S(T, d, ψ) is taken over all separable
stochastic processes X that satisfy the increment bound kXs − Xt kψ ≤
d(s, t), and the infimum in A(T, d, ψ) is taken over all ψ-admissible se-
quences T0 , T1 , . . . of subsets of T .
The first of these quantities may be smaller than the other three. How-
ever, the following theorem shows that the metric d may be replaced by a
larger metric, giving a larger class of processes X in the definition of S and
hence increasing S, so that all four quantities become equivalent, and are
also equivalent to the last three quantities for the original metric d.
Proof. The following three inequalities are true for any metric d:
S(T, d, ψ) . A(T, d, ψ) (by Theorem 2.2.13); A(T, d, ψ) . M (T, d, ψ)
(by Theorem 2.2.15); and M (T, d, ψ) ≤ 2M̄ (T, d, ψ) (by a convexity ar-
gument, see Problem 2.2.19). Furthermore, Bednorz (2010) shows that
M̄ (T, d, ψ) . S(T, d, ψ), for any p-concave metric d. It follows that the
four quantities are equivalent up to multiplicative constants when d is p-
concave.
The condition on ψ implies (1.2) in Bednorz (2012), who constructs an
ultra-metric ρ ≥ d on T for which A(T, ρ, ψ) . A(T, d, ψ). (By definition, an
ultra-metric satisfies d(s, t) ≤ d(s, u) ∨ d(t, u), for every s, t, u; in particular,
it is p-concave for every p > 1.) By the preceding paragraph the four quan-
tities with d replaced by ρ are then equivalent. Since ρ ≥ d, the four quanti-
ties with d are bounded above by the corresponding quantity with ρ. Then
158 2 Empirical Processes
by a change of variables. The second term on the far right is bounded above
by a multiple of Dp (p/(αe))p/α+1/2 , by Stirling’s formula. Conclude that
D−1 kXkp /p1/α . (log 2/p)1/α + (p/α)1/(2p) (αe)−1/α . The right side is de-
creasing in p, for fixed α, and hence is bounded in p ≥ 1.]
5. (Comparing ψp -norms) For any X, the numbers (log 2)1/p kXkψp for
ψp (x) = exp(xp ) − 1 are nondecreasing in p ≥ 1.
[Hint: The function φ for which ψp x(log 2)1/p = φ ψq ((log 2)1/q x) is
concave and satisfies φ(1) = 1 for q ≥ p. Use Jensen’s inequality.]
2.2 Problems and Complements 159
9. If kXkψ < ∞, then kXkσψ < ∞ for every σ > 0. Markov’s inequality yields
the bound P(|X| > t) ≤ 1/σ · 1/ψ(t/kXkσψ ) for every t. Is there an optimal
value of σ?
10. The condition on the quotient ψ(x)ψ(y)/ψ(cxy) in Lemma 2.2.2 is not auto-
matic. For instance, the function ψ(x) = (x + 1) log(x + 1) − 1 + 1 is convex
and increasing for x ≥ 0 but does not satisfy the condition.
11. The function ψp (x) = exp(xp ) − 1 satisfies the hypothesis of Lemma 2.2.2 for
every 0 < p ≤ 2.
12. Let ψ be a convex, positive function, with ψ(0) = 0 such that log ψ is convex.
Then there is a constant σ such that φ(x) = σψ(x) satisfies both φ(1) < 1
and φ(x)φ(y) ≤ φ(xy), for all x, y ≥ 1.
[Hint: Define φ(x) = ψ(x)/2ψ(1) for x ≥ 1. Then φ(1) = 1/2 and g(x, y) =
φ(x)φ(y)/φ(xy) satisfies g(1, 1) = 1/2 and is decreasing in both x and y for
x, y ≥ 1 if log ψ is convex.
The derivative of g with respect to x is given by
ψ(x)ψ(y)/ 2ψ(1)ψ(xy) ψ 0 /ψ(x) − yψ 0 /ψ(xy) , which is less than or equal
to 0 since y ≥ 1 and log ψ is convex. By symmetry, the derivative with respect
to y is also bounded above by zero.]
15. Let T be a compact, semimetric space and q > 0. Then the map x 7→ xq
from the nonnegative functions in `∞ (T ) to `∞ (T ) is continuous at every
continuous x.
[Hint: It suffices to show that kxn −xk∞ → 0 implies that xqn (tn )−xq (tn ) → 0
for every sequence tn . By compactness of T , the sequence tn may be assumed
convergent. Since x is continuous, x(tn ) → x(t). Thus xn (tn ) → x(t), whence
xqn (tn ) → xq (t).]
160 2 Empirical Processes
16. Suppose the metric space T has a finite diameter and log N (ε, T, d) ≤ h(ε),
for all sufficiently small ε > 0 and a fixed, positive, continuous function h.
Then there exists a constant K with log N (ε, T, d) ≤ Kh(ε) for all ε > 0.
[Hint: For sufficiently large ε, the entropy numbers are zero, and for
sufficiently small ε, the inequality is valid with K = 1. The function
log N (ε, T, d)/h(ε) is bounded on intervals that are bounded away from zero
and infinity.]
P∗
17. Let X be a stochastic process with supρ(s,t)<δ X(s) − X(t) → 0 as δ ↓ 0.
Then almost all sample paths of X are uniformly continuous.
[Hint: Given a decreasing sequence of numbers εk → 0, there are numbers
δk > 0 such that P∗ supρ(s,t)<δ
k
X(s) − X(t) > εk ≤ 2−k for every k. By
the Borel-Cantelli lemma, X(s) − X(t) ≤ εk whenever ρ(s, t) < δk , for all
sufficiently large k almost surely.]
18. For every sequence T0 , T1 , . . . of subsets with |T0 | = 1 and |Tj | ≤ ψ(4j ), for
j ≥ 1, there exists a sequence T00 ,P T10 , . . . of subsets with |T00 |P
= 1 and |Tj0 | ≤
∞ ∞
ψ(2j/2 ), for j ≥ 1, such that supt j=0 d(t, Tj0 )2j/2 ≤ 4 supt j=0 d(t, Tj )4j .
Conversely, for every sequence T0 , T1 , . . . of subsets with |T00 | = 1 and
0 0
19. For any compact metric space (T, d) and Young function ψ, we have
M (T, d, ψ) ≤ 2M̄ (T, d, ψ). (See Section 2.2.4 for this notation.)
−1
[Hint: Suppose first that T is finite and that the map R x−17→ ψ (1/x) is con-
vex. Suppose M̄ : = M̄ (T, d, ψ) < ∞ and set fµ (t) = ψ 1/µ(Bd (t, ε)) dε.
The set F of of functions f : T → R such that f ≥ fµ , pointwise, for some µ,
can be identified with a closed, convex subset of RT . If the constant function
M̄ (T, ψ, d) were not contained in F, then the separating hyperplane theo-
rem would give a vector ν 6= 0 such that ν T M̄ < ν T f , for every f ∈ F .
Since f + u1t ∈ F, for every u ≥ 0 if f ∈ F, the vector ν cannot have
negative coordinates, and hence be assumed to be a probability vector. Then
M̄ < ν T fν , which contradicts the definition of M̄ (T, ψ, d). Thus M̄ ∈ F ,
which implies that M̄ ≥ fµ , for some µ. If T is not finite, then this argu-
ment gives for every finite S ⊂ T a measure µS with M̄ ≥ fµS (t), for every
t ∈ S. By Prohorov’s theorem the net µS has a weak limit point µ and by the
Portmanteau theorem this satisfies fµ (t) ≤ lim inf S fµS (t), for every t ∈ T .
Finally if x 7→ ψ −1 (1/x) is not convex, then there exists a convex function ξ
with ξ(x) ≤ φ−1 (1/x) ≤ 2ξ(x).]
2.3.1 Symmetrization
Let ε1 , . . . , εn be i.i.d. Rademacher random variables. Instead of the empir-
ical process
n
1X
f 7→ (Pn − P )f = f (Xi ) − P f ,
n i=1
Here the repeated outer expectation can be bounded above by the joint
outer expectation E∗ by Lemma 1.2.6.
Adding a minus sign in front of a term f (Xi ) − f (Yi ) has the ef-
fect of exchanging Xi and Yi . By construction of the underlying prob-
ability space as a product space, the outer expectation of any function
f (X1 , . . . , Xn , Y1 , . . . , Yn ) remains unchanged under permutations of its 2n
arguments. Hence the expression
1 X n
E∗ Φ
ei f (Xi ) − f (Yi )
n i=1 F
is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . Deduce that
1 Xn
E∗ Φ kPn − P kF ≤ Eε E∗X,Y Φ
εi f (Xi ) − f (Yi )
.
n i=1 F
Use the triangle inequality to separate the contributions of the X 0 s and the
Y 0 s and next use the convexity of Φ to bound the previous expression by
1X n
1X n
1 ∗ 1 ∗
E E Φ 2 ε f (X ) + E E Φ 2 εi f (Yi )
.
2 ε X,Y
n i=1
i i
2 ε X,Y
n i=1
F F
are measurable for every n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . For the intended
application of Fubini’s theorem, it suffices that this is the case for the
completion of (X n , An , P n ).
This follows from Example 1.7.5P upon noting that the conditions imply
n
that the process (x1 , . . . , xn , f ) 7→ i=1 ei f (xi ) is jointly measurable on
n
X × F.
This “measurable Suslin condition” replaces the original problem of
showing P -measurability of F by the problem of finding a suitable topology
on F for which the map (x, f ) 7→ f (x) is measurable. This is sometimes
easy – for instance, if the class F is indexed by a parameter in a nice manner
– but sometimes as hard as the original problem.
The far left and far right sides do not depend on Pthe particular f , and the
n
inequality between them is valid on the set k i=1 Zi kF > x . Integrate
the two sides out with respect to Z1 , . . . , Zn over this set to obtain
n n
X
∗
∗ ∗
X
x
β P
Zi
> x ≤ PZ PY
(Yi − Zi )
> .
i=1
F
i=1
F 2
168 2 Empirical Processes
I.i.d. processes Z1 , . . . , Zn with mean zero satisfy the condition for the
given β in view of Chebyshev’s inequality.
1
Xn
lim x2 sup P∗ √
Zi
> x = 0.
x→∞ n n i=1 F
In particular, the random variable kZ1 k∗F possesses a weak second moment.
If a distribution on the real line has tails of the order o(x−2 ), then all
its moments of order 0 < r < 2 exist. One interesting consequence Pn of the
preceding lemma is that the sequence of moments E∗ kn−1/2 i=1 Zi krF is
uniformly bounded for every 0 <P r < 2. Given the weak convergence, this
n
gives that the sequence E∗ kn−1/2 i=1 Zi krF converges to the corresponding
moment of the limit variable.
170 2 Empirical Processes
Proof. The equivalence of (i) and (ii) is clear from the general results on
weak convergence in `∞ (F) obtained in Chapter 1.5. Condition (iii) implies
(ii) by Markov’s inequality.
Suppose (i) and (ii) hold. Then P∗ kZ1 kF > x = o(x−2 ) as x tends
sequence
gm in G with gm → f and G(gm , ω) → G(f, ω). A stochastic pro-
cess G̃(f ): f ∈ F is a separable version of a given process G(f ): f ∈ F
if G̃ is separable and G̃(f ) = G(f ) almost surely for every f . (The two pro-
cesses must be defined on the same probability space, and the exceptional
null sets may depend on f .)
It is well known that every stochastic process possesses a separable
version (possibly taking values in the extended real line). By its definition, a
separable process has excellent measurability properties. In this subsection
it is shown that an empirical process possesses a separable version that
is itself an empirical process. This fact is used in Chapter 2.10 (and there
only) to remove unnecessary measurability conditions from two preservation
theorems.
For any separable process G̃ with countable “separant” G,
sup G̃(f ) − G̃(g) = sup G̃(f ) − G̃(g), a.s.
ρ(f,g)<δ ρ(f,g)<δ
f,g∈F f,g∈G
G̃n (f ) = Gn (f˜)
The preceding theorem shows that the difference between Gn and G̃n
must be small due to the fact that the differences f − f˜ are small: the
difference cannot be small through the cancellation of positive terms by
negative terms.
There is an analogous, but simpler result for Glivenko-Cantelli classes.
The functions f −f˜ are zero almost surely. The argument in the proof of
Theorem 2.3.15 shows that the last mentioned convergence is equivalent to
convergence in outer probability to zero of the sequence supf ∈F Pn |f − f˜|.
is the restriction [to the set of functions of the type (x1 , . . . , xn ) 7→ f (xi ),
where f ranges over L∞ (X , A, P )] of a lifting of L∞ (X n , An , P n ) for each
1 ≤ i ≤ n.
Combination of the last two displayed equations yields that ρn 1AD (x) ≥
1Af (x) for every x. If x ∈ Af and x ∈ / Nn , then it follows that x ∈ AD .
It has been proved that, for every open U ⊂ F and open G ⊂ Rn ,
there exists a countable set DU,G ⊂ U and a null set Nn,U,G ⊂ X n such
that
x∈ / Nn,U,G ∧ ρΦ ◦ f (x1 ), . . . , ρΦ ◦ f (xn ) ∈ G ∧ f ∈ U
⇒ ∃g ∈ DU,G : ρΦ ◦ g(x1 ), . . . , ρΦ ◦ g(xn ) ∈ G.
Repeat the construction for every U and G in countable bases for the open
sets in F and Rn , respectively. Let Nn and G be the union of all null
sets Nn,U,G and countable DU,G . Then for every x ∈ / Nn and f ∈ F,
there exists a sequence gm in G with ρΦ ◦ gm (x1 ), . . . , ρΦ ◦ gm (xn ) →
ρΦ ◦ f (x1 ), . . . , ρΦ ◦ f (xn ) and gm → f in Lr (P ). Thus the class of
functions ρ(Φ ◦ f ): f ∈ F is pointwise separable.
6. If X1 , X2 , . . . are i.i.d. random variables with E|X1 | < ∞ and r < 1, then
r
E supn≥1 |Xn |/n < ∞.
[Hint: Use Problem 2.3.5.]
7. For i.i.d. random variables
X1 , X2 , . . ., the expectation E supn≥1 |Xn |/n is
finite if and only if E |X1 |(1 ∨ log |X1 |) is finite.
[Hint: Use Problem 2.3.5.]
Proof. Fix ε > 0. Choose finitely many ε-brackets [li , ui ] whose union
contains F and such that P (ui − li ) < ε for every i. Then, for every f ∈ F,
there is a bracket such that
Consequently,
sup (Pn − P )f ≤ max(Pn − P )ui + ε.
f ∈F i
The right side converges almost surely to ε by the strong law of large
numbers for real variables. Combination with a similar argument for
inf f ∈F (Pn − P )f yields that lim sup kPn − P k∗F ≤ ε almost surely, for every
ε > 0. Take a sequence εm ↓ 0 to see that the limsup must actually be zero
almost surely.
2.4 Glivenko-Cantelli Theorems 179
Both the statement and the proof of the following theorem are more
complicated than the previous bracketing theorem. However, its sufficiency
condition for the Glivenko-Cantelli property can be checked for many classes
of functions by elegant combinatorial arguments, as discussed in a later
chapter. Furthermore, its random entropy condition is necessary.
by the triangle inequality, for every M > 0. For sufficiently large M , the
last term is arbitrarily small. To prove convergence in mean, it suffices to
show that the first term converges to zero for fixed M . Fix X1 , . . . , Xn . If
G is an ε-net in L1 (Pn ) over FM , then
1X n
1X n
(2.4.4) Eε
εi f (Xi )
≤ Eε
εi f (Xi )
+ ε.
n i=1 FM n i=1 G
The cardinality of G can be chosen equal to N ε, FM , L1 (Pn ) . Bound the
L1 -norm on the right by the Orlicz-norm for ψ2 (x) = exp(x2 ) − 1, and use
180 2 Empirical Processes
the maximal inequality Lemma 2.2.2 to find that the last expression does
not exceed a multiple of
q
1X n
1 + log N ε, FM , L1 (Pn ) sup
εi f (Xi )
+ ε,
f ∈G n i=1 ψ2 |X
The first term is bounded for fixed ε, M > 0, and the second was seen to
converge to zero in mean. This concludes the proof of (2.4.5).
In Chapter 2.6 it is shown that the entropy in the left side is uni-
formly bounded by a constant for so-called Vapnik-C̆ervonenkis classes F.
Hence any appropriately measurable Vapnik-C̆ervonenkis class is Glivenko-
Cantelli, provided its envelope function is integrable.
n+1
1 X i ∗
kPn+1 k∗F ≤ kP k , a.s.
n + 1 i=1 n F
182 2 Empirical Processes
This is true for any version of the measurable covers. Choose the left side
Sn+1 -measurable and take conditional expectations with respect to Sn+1
to arrive at
n+1
1 X i ∗
kPn+1 k∗F ≤ E kPn kF | Sn+1 , a.s..
n + 1 i=1
Each term in the sum on the right side changes on at most a null set if
kPin k∗F is replaced by another version. For h∗ given in the first paragraph of
the proof, choose the version h∗ (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn+1 ) for kPin k∗F .
Since the function h∗ is permutation-symmetric, it follows by symmetry
that the conditional expectations E kPin k∗F | Sn+1 do not depend on i (al-
k∗F | Sn+1 .
most surely). Thus the right side of the display equals E kPn+1 n
7. The σ-field Sn in Lemma 2.4.6 equals the σ-field generated by the variables
Pk f with k ranging over all integers greater than or equal to n and f ranging
over the set of all measurable, integrable f : X → R.
8. The sequence of averages Ȳn of the first n elements of a sequence of i.i.d. ran-
dom variables Y1 , Y2 , . . . with finite first absolute moment is a reverse mar-
tingale with respect to its natural filtration: E(Ȳn | Ȳn+1 , Ȳn+2 , . . .) = Ȳn+1 .
Pn
[Hint: The conditional expectation equals n−1 i=1 E(Yi | Ȳn+1 , Yn+2 , . . .),
and E(Yi | Ȳn+1 ) is constant in 1 ≤ i ≤ n + 1 by symmetry.]
[Hint: For every ε > 0 and f , there exists m and a bracket [lm , um ] in
L1 (P m ) such that m−1 P m um − m−1 P m lm < ε and
n n
1 X 1 X
lm (Yi,m ) ≤ Pnm f ≤ um (Yi,m ),
nm nm
i=1 i=1
1 m 1
P lm ≤ P f ≤ P m um ,
m m
where Y1,m , Y2,m , . . . are the blocks [X1 , . . . , Xm ], [Xm+1 , . . . , X2m ], . . . . Con-
clude that for every ε > 0 there exists m such that lim supn→∞ kPnm −P k∗F <
ε. It suffices to “close the gaps.”]
10. Theorem 2.4.3 remains valid if the L1 (Pn )-norm is replaced by the Lp (Pn )-
norm, for any 0 < p < ∞.
s/r
[Hint: For 0 < r < s < ∞ and f, g ∈ FM we have kf − gkPn ,s ≤
s/r−1
(2M ) kf −gkPn ,r ≤ kf −gkPn ,s . Use this to compare the entropy numbers
N ε, FM , Lr (Pn ) .]
In this chapter we present the two main empirical central limit theorems.
The first is based on uniform entropy, and its proof relies on symmetrization.
The second is based on bracketing entropy.
Here the supremum is takenR over all finitely discrete probability measures
Q on (X , A) with kF k2Q,2 = F 2 dQ > 0. These conditions are by no means
necessary, but they suffice for many examples. Finiteness of the previous
integral will be referred to as the uniform entropy condition.
Use the second part of the maximal inequality Corollary 2.2.8 to find that
1 X n
Z ∞q
(2.5.3) Eε
√ εi f (Xi )
. log N ε, Fδn , L2 (Pn ) dε.
n i=1 Fδn 0
For large values of ε the set Fδn fits in a single ball of radius ε around the
origin, in which case the integrand is zero. This is certainly the case for
values of ε larger than θn , where
1X n
θn2 = sup kf k2n =
f 2 (Xi )
.
f ∈Fδn n i=1 Fδn
Here the supremum is taken over all discrete probability measures. The
integrand is integrable by assumption. Furthermore, kF kn is bounded be-
low by kF∗ kn , which converges almost surely to its expectation, which may
be assumed positive. Use the Cauchy-Schwarz inequality and the domi-
nated convergence theorem to see that the expectation (with respect to
P∗
X1 , . . . , Xn ) of this integral converges to zero provided θn → 0. This would
conclude the proof of asymptotic equicontinuity.
Since sup{P f 2 : f ∈ Fδn } → 0 and Fδn ⊂ F∞ , it is certainly enough to
prove that
Pn f 2 − P f 2
P∗
F∞ →
0.
2
This is a uniform law of large numbers for the class F∞ . This class has
2
integrable envelope (2F ) and is measurable by assumption. For any pair
f, g of functions in F∞ ,
Pn |f 2 − g 2 | ≤ Pn |f − g|4F ≤ kf − gkn k4F kn .
It follows that the covering number N εk2F k2n , F∞2
, L1 (Pn ) is bounded
by the covering number N εkF kn , F∞ , L2 (Pn ) . By assumption, the latter
2.5 Donsker Theorems 187
2.5.2 Bracketing
The second main empirical central limit theorem uses bracketing entropy
rather than uniform entropy. The simplest version of this theorem uses
L2 (P )-brackets, and requires finiteness of the corresponding bracketing in-
tegral
Z ∞q
(2.5.5) log N[ ] ε, F, L2 (P ) dε < ∞.
0
(Actually this is not really a norm, because it does not satisfy the triangle
inequality. It can be shown that there exists a norm that is equivalent to this
“norm” up to a constant 2, but this is irrelevant for the present purposes.)
Note that kf kP,2,∞ ≤ kf kP,2 , so that the bracketing numbers relative to
L2,∞ (P ) are smaller.
The proofs of the preceding theorem and its strengthening below are
based on a chaining argument. Unlike in the proof of the uniform entropy
theorem, this will be applied to the original summands, without an initial
symmetrization. Because the summands are not appropriately bounded,
Hoeffding’s inequality and the sub-Gaussian maximal inequality do not
apply. Instead, Bernstein’s inequality is used in the form
2
x √
−1
P |Gn f | > x ≤ 2 e 2 P f 2 +2/3kf k∞ x/ n .
kf k∞ p
(2.5.7) EkGn kF . max √ log |F| + max kf kP,2 log |F|.
f n f
The chaining argument is set up in such a way that the two terms
on
√ the right p are of comparable magnitude. This is the case if kf k∞ ∼
nkf kP,2 / log |F|; the inequality is applied after truncating the functions
f at this level.
Proofs. It suffices to prove the second theorem. For each natural number
Nq
q, there exists a partition F = ∪i=1 Fqi of F into Nq disjoint subsets such
2.5 Donsker Theorems 189
that X
2−q
p
log Nq < ∞,
∗
< 2−q ,
sup |f − g|
P,2,∞
f,g∈Fqi
√
πq−1 f in the “chain” is bounded in absolute value by naq (note that
|πq f − πq−1 f | ≤ ∆q−1 f ). For a rigorous derivation, note that either all Bq f
are zero or there is a unique q1 with Bq1 f = 1. In the first case, the first
two terms in the decomposition are zero and the third term is an infinite
series (all Aq f equal 1) whose qth partial sum telescopes out to πq f − πq0 f
and converges to f − πq0 f by the definition of the Aq f . In the second case,
Aq−1 f = 1 if and only if q ≤ q1 and the decomposition is as mentioned,
apart from the separate treatment of the case that q1 = q0 , when already
the first link fails the test. √
Next apply the empirical process Gn = n(Pn −P ) to each of the three
terms separately, and take suprema over f ∈ F. It will be shown that the
resulting three variables converge to zero in probability as n → ∞ followed
by q0 → ∞. √
First, since |f − πq0 f |Bq0 f ≤ 2F 1{2F > naq0 }, one has
√ √
E∗
Gn (f − πq0 f )Bq0 f
F ≤ 4 nP ∗ F {2F > naq0 }.
The right side converges to zero as n → ∞, for each fixed q0 , by the as-
sumption that F has a weak second moment (Problem 2.5.6).
Second, since the partitions are nested, ∆q f Bq f ≤ ∆q−1 f Bq f , whence
by the inequality of Problem 2.5.5,
√
naq P ∆q f Bq f ≤ 2k∆q f k2P,2,∞ ≤ 2 · 2−2q .
√
Since ∆q−1 f Bq f is bounded by naq−1 for q > q0 , we obtain
√ √ aq−1 −2q
P (∆q f Bq f )2 ≤ naq−1 P ∆q f {∆q f > naq } ≤ 2 2 .
aq
Since aq is decreasing, the quotient aq−1 /aq can be replaced by its square.
Then in view of the
P∞definition
p of aq , the series on the right can be bounded
by a multiple of q0 +1 2−q log Nq . This upper bound is independent of n
and converges to zero as q0 → ∞.
Third, there are at most Nq functions πq f − πq−1 f and at most
Nq−1 functions Aq−1 f . Since the partitions are nested, the function |πq f −
2.5 Problems and Complements 191
√
πq−1 f |Aq−1 f is bounded by ∆q−1 f Aq−1 f ≤ n aq−1 . The L2 (P )-norm of
|πq f − πq−1 f | is bounded by 2−q+1 . Apply inequality (2.5.7) to find
X∞
∞ h
X i
E∗
aq−1 log Nq + 2−q log Nq .
p
Gn (πq f − πq−1 f )Aq−1 f
.
F
q0 +1 q0 +1
2.5.9 Example (Cells in R). The classical empirical central limit the-
orem was shown to be a special case of the uniform entropy central
limit theorem in the preceding subsection. This classical theorem is a
special case
√ of the bracketing
central limit theorem
as well. This follows
since N[ ] 2ε, F, L2 (P ) ≤ N[ ] ε2 , F, L1 (P ) for every class of functions
f : X → [0, 1]. The L1 -bracketing numbers were bounded above by a power
of 1/ε in the preceding subsection.
if the supremum on the right is taken over all discrete probability measures
Q.
[Hint: If the left side is equal to m, then there exist functions f1 , . . . , fm
such that P |fi − fj |r > 2r εr P F r for i 6= j. By the strong law of large
numbers, Pn |fi − fj |r → P |fi − fj |r almost surely. Also Pn F r → P F r almost
surely. Thus there exist n and ω such that Pn (ω)|fi − fj |r > 2r εr P F r and
Pn (ω)F r < 2r P F r .]
5. For each positive random variable X, one has the inequalities kXk22,∞ ≤
supt>0 tEX{X > t} ≤ 2kXk22,∞ .
[Hint: The first inequality follows from Markov’s inequality.
R ∞ For the second,
write the expectation in the expression in the middle as 0 P (X1{X > t} >
x) dx. Next, split the integral in the part from zero to t and its complement.
On the first part, the integrand is constant; on the second, it can be bounded
by x−2 times the L2,∞ -norm.]
In Section 2.5.1 the empirical process was shown to converge weakly for
indexing sets F satisfying a uniform entropy condition. In particular, if
2−δ
1
sup log N εkF kQ,2 , F, L2 (Q) ≤ K ,
Q ε
for some δ > 0, then the entropy integral (2.5.1) converges and F is a
Donsker class for any probability measure P such that P ∗ F 2 < ∞, provided
measurability conditions are met. Many classes of functions satisfy this
condition and often even the much stronger condition
V
1
sup N εkF kQ,2 , F, L2 (Q) ≤ K , 0 < ε < 1,
Q ε
for some number V . In this chapter this is shown for classes satisfying
certain combinatorial conditions. For classes of sets, these were first studied
by Vapnik and C̆ervonenkis, whence the name VC-classes. In the second
part of this chapter, VC-classes of functions are defined in terms of VC-
classes of sets. The remainder of this chapter considers operations on classes
that preserve entropy properties, such as taking convex hulls.
2.6.2 Lemma. Let {x1 , . . . , xn } be arbitrary points. Then the total num-
ber of subsets ∆n (C, x1 , . . . , xn ) picked out by C is bounded above by the
number of subsets of {x1 , . . . , xn } shattered by C.
Proof. All sets shattered by a VC-class are among the sets of size at most
V (C). The number of shattered sets gives an upper bound on ∆n by the
preceding lemma.
2.6.4 Theorem. There exists a universal constant K such that for any
VC-class C of sets, any probability measure Q, any r ≥ 1, and 0 < ε < 1,
rV (C)
1
N ε, C, Lr (Q) ≤ KV (C)(4e)V (C)
.
ε
196 2 Empirical Processes
Proof. In view of the equality k1C − 1D kQ,r = Q1/r (C 4 D) for any pair
of sets C and D, the upper bound for a general r is an easy consequence
of the bound for r = 1. The proof for r = 1 is long. Problem 2.6.4 gives a
short proof of a slightly weaker result.
By Problem 2.6.3, it suffices to consider the case that Q is an empir-
ical type measure: a measure on a finite set of points y1 , . . . , yk such that
Q{yi } = li /n for integers l1 , . . . , lk that add up to n. Then it is no loss of
generality to assume that each set in the collection C is a subset of these
points.
Let x1 , . . . , xn be the set of points y1 , . . . , yk with each yi occurring li
times. For each subset C of y1 , . . . , yk , form a subset C̃ of x1 , . . . , xn by tak-
ing all xi that are copies of some yi in C. More precisely, let φ: {1, . . . , n} →
{y1 , . . . , yk } be an arbitrary,
fixed map such that #(j: φ(j) = yi ) = li for
every i. Let C̃ = j: φ(j) ∈ C . Now identify x1 , . . . , xn with 1, . . . , n.
If a set {xj : j ∈ J} is shattered by the collection C˜ of sets C̃, then every
xj in this set must correspond to a different yi (i.e., the map φ must be
one-to-one on {xj : j ∈ J}), and the subset φ(xj ): j ∈ J of {y1 , . . . , yk }
must be shattered by C. This shows that the collection C˜ is VC of the same
dimension as C. By construction, Q(C 4 D) = Q̃(C̃ 4 D̃) for Q̃ equal to
the discrete uniform measure on x1 , . . . , xn . Thus the L1 (Q)-distance on C
corresponds to the L1 (Q̃)-distance on C. ˜ Conclude that N ε, C, ˜ L1 (Q̃) =
N ε, C, L1 (Q) , and it suffices to prove the upper bound for C˜ and Q̃. For
simplicity of notation, assume that Q is the discrete uniform measure on a
set of points x1 , . . . , xn .
Each set C can be represented by an n-vector of ones and zeros indicat-
ing whether or not the points xi are contained in the set. Thus the collection
C is identified with a subset Z of the vertices of the n-dimensional hyper-
cube [0, 1]n . Alternatively, C can be identified with an (n × #C)-matrix of
zeros and ones, the columns representing the individual sets C. For a subset
I of rows, let ZI be the matrix obtained by first deleting the other rows and
next dropping duplicate columns. Alternatively, ZI is the projection of the
point set Z onto [0, 1]I . In these terms ZI corresponds to the collection of
subsets C ∩ {xi : i ∈ I} of the set of points {xi : i ∈ I}, and this set is shat-
tered if the matrix ZI consists of all 2I possible columns or if, as a subset
of the I-dimensional hypercube, ZI contains all vertices. By assumption,
this is possible only for I ≤ V (C). Abbreviate V = V (C).
The normalized Hamming metric on the point set Z is defined by
n
1X
d(w, z) = |wj − zj |, z, w ∈ Z.
n j=1
n
Take the sum over all m+1 of such subsets I on both left and right.
The double sum on the left can be rearranged. Instead of leaving out one
element at a time from every possible subset of size m + 1, we may add
missing elements to every possible set of size m. Conclude that
X X n
(2.6.5) E var(Zi | ZJ ) ≤ V,
m+1
J i∈J
/
This is bounded above by the right side of (2.6.5). Rearrange and simplify
the resulting inequality to obtain
Proof. If the values of all coordinates, except the ith coordinate, of Z are
given, then the vector Z can have at most two values. If there is only one
possible value, then the conditional variance of Zi is zero. If there are two
possible values, then we can call these v and w, where vi = 0 and wi = 1
and the other coordinates of v and w agree. Write p(z) for P(Z = z). Then
Zi is 1 or 0 with conditional probabilities p: = p(w)/ p(v)+p(w) and 1−p,
respectively, and the conditional variance of Zi is p(1 − p). The marginal
probability that (Zj : j 6= i) is equal to (vj : j 6= i) is p(v) + p(w).
Form a graph by taking the points of Z as the nodes and connecting
any two nodes that are at Hamming distance 1/n. This is a subgraph of the
set of edges of the n-dimensional unit cube. Let Ei and E be the set of all
edges that cross the ith dimension and all edges in the graph, respectively.
If Ei is empty, then all variances in the ith term of the lemma are zero.
In the other case, the edges {v, w} in E concern nodes v and w as in the
preceding paragraph. Then, with an empty sum understood as zero,
n n
X X X p(v) p(w)
E var(Zi | Zj , j 6= i) = p(v) + p(w)
i=1 i=1 {v,w}∈Ei
p(v) + p(w) p(v) + p(w)
X
≤ p(v) ∧ p(w).
{v,w}∈E
2.6 Uniform Entropy Numbers 199
Suppose that the edges E can be directed in such a way that at each node
(point in Z) the number of arrows that are directed away is at most V (C).
Then the last sum can be rearranged as a double sum, the outer sum carried
out over the nodes, and the inner sum over the arrows P directed away from
the particular node. Thus the sum is bounded by z∈Z p(z)#arrows(z),
which is bounded by V (C).
It may be shown that the edges can always be directed in the given
manner (Problem 2.6.5), but it is more direct to proceed in a different
manner. Recall from the proof of Lemma 2.6.2 that a class C is hereditary
if it contains every subset of every set contained in C. Since each set in such
a class C is shattered, each set in a hereditary VC-class C has at most V (C)
points. For such a class the edges of the graph can be directed as desired
by decreasing number of elements (or downward in the cube): direct the
edge between v and w in the direction w → v if the (one) coordinate of w
by which w differs from v equals 1. For each given w this creates at most
V (C) arrows as w has at most so many coordinates 1. This concludes the
proof in the case that C is hereditary.
It was shown in the proof of Lemma 2.6.2 that an arbitrary collection
C can be transformed into a hereditary collection by repeated application
of certain operators Ti : C → C, which are one-to-one and do not increase
the VC-dimension. The identification of sets in C with points in Z allows to
view the operators also as maps Ti : Z → Z, and they can be described as
pushing a point z ∈ Z that is at height 1 in the ith coordinate (i.e. zi = 1)
down to height 0 (i.e. changing zi to 0) if the resulting point was not already
in Z, and leaving all other points invariant. We can form a new graph by
taking the set of nodes equal to Ti (Z) and connecting any two nodes at
Hamming distance 1/n. The operator Ti induces a one-to-one map from
the set of vertices E = E(Z) in the original graph into (but not necessarily
onto) the set of vertices E(Ti (Z)) of the new graph as follows. The nodes
v and w of an edge {v, w} in E that crosses the ith dimension (i.e. v and
w differ in their ith coordinates and no other ones) are left invariant by Ti ;
we map such an edge onto itself. We do the same for an edge at height 0
for the ith coordinate. For an edge {v, w} ∈ E at height 1 (i.e. vi = wi = 1)
there are three possibilities. If Ti pushes both v and w down, say to v 0 and
w0 , then the latter two nodes were not in Z and hence nor was {v 0 , w0 };
we map {v, w} to the latter edge. If Ti pushes exactly one of v and w down,
say v to v 0 , then v 0 was not in Z, but w0 was in Z, whence {v 0 , w0 } was
not in E but is in E(Ti (Z)); we map {v, w} to {v 0 , w0 }. Third if Ti pushes
none of v and w down, then we map {v, w} to itself. Thus we have defined a
one-to-one map from E into E(Ti (Z)). We also define a probability measure
pi on Ti (Z) by shifting mass downward in the following manner: first we
give every z ∈ Ti (Z) the mass of its pre-image in Z; next for every edge
{z 0 , z 1 } in E(Ti (Z)) that crosses the ith dimension we place the biggest of
the two masses p(z 0 ) and p(z 1 ) at height 0 (i.e. at z 0 ) and the smallest at
200 2 Empirical Processes
Indeed, any edge {v, w} ∈ E(Z) in the sum on the left that crosses the ith
dimension reappears in the sum on the right, with the same value p(v) ∧
p(w), which is invariant under redistributing the masses between the planes.
Any such edge in E(Z) at height 1 that is pushed down by Ti can be matched
to its image in the sum on the right, which has no smaller value attached
as the bigger masses are redistributed to the bottom plane. Any such edge
at height 1 that is not pushed down forms a pair with an edge in E(Z)
at height 0; they reappear as a pair in the sum on the right and together
contribute no smaller value in view of the inequality (a0 ∧ b0 ) + (a1 ∧ b1 ) ≤
(a0 ∨ a1 ) ∧ (b0 ∨ b1 ) + (a0 ∧ a1 ) ∧ (b0 ∧ b1 ) for any numbers ai , bi . Finally
the remaining edges in E(Z) are at height 0 and do not have a paired edge
above them; they reappear in the sum on the right with possibly a bigger
value.
Thus by repeated application of the downward shift operation, the
original graph and probability measure are changed into a hereditary graph
and another probability measure, meanwhile increasing the target function.
The lemma follows.
]
The form of this definition is different from the definitions given by other authors.
However, it can be shown to lead to the same concept. See Problems 2.6.10 and 2.6.11.
2.6 Uniform Entropy Numbers 201
for the probability measure R with density F r−1 /QF r−1 with respect to
Q. Thus the Lr (Q)-distance is bounded by the distance 2 (QF r−1 )1/r kf −
1/r
gkR,1 . Elementary manipulations yield
This can be bounded by a constant times 1/ε raised to the power rV (F)
by the result of the previous paragraph.
The preceding theorem shows that a VC-class satisfies the uniform en-
tropy condition (2.5.1), with much to spare. Thus Theorem 2.5.2 shows that
a suitably measurable VC-class is P -Donsker for any underlying measure
P for which the envelope is square integrable. The latter condition on the
envelope can be relaxed a little to the minimal condition that the envelope
has a weak second moment.
x → ∞. Then F is P -Donsker.
Proof. Under the stronger condition that the envelope has a finite second
moment, this follows from Theorem 2.5.2. (Then the assumption that F is
pre-Gaussian is automatically satisfied.) For the refinement, see Alexander
(1987c).
elements. Apply the induction hypothesis to obtain a Ck−1 Ln−W -net over
conv Fn(k−1)q with respect to Lr (Q) consisting of at most eDk−1 n elements.
Combine the nets as before to obtain a Ck Ln−W -net over conv Fnkq con-
sisting of at most eDk n elements, for
1
Ck = Ck−1 + ,
k2
1 + log 1 + k q+r0 q/V −2r0 /(2qr0 /V +r0 br )
Dk = Dk−1 + .
k r0 q/V −2r0 /(2qr0 /V +r0 br )
For (q/V − 2)r0 ≥ 2, or equivalently q ≥ 2V (1 + 1/r0 ), the resulting se-
quences Ck and Dk are bounded.
1/r
Br k −1/r0 diam F to the convex combination
P
λj fj . Every realization is
−1
Pk
an average of the form k i=1 fik (allowing for multiple use of an fi ∈ F).
There are at most n+k−1k of such averages. Conclude that
n+k−1 n k
N Br1/r k −1/r0 diam F, conv F, Lr (Q) ≤ ≤ ek 1 +
.
k k
The last inequality can be proved by using Stirling’s inequality with bound.
For ε ≥ 1, the assertion of the lemma is trivial. For ε < 1, conclude the
1/r
proof by choosing the smallest k such that Br k −1/r0 ≤ ε.
for a constant K that depends only the VC-dimension Vm (F) of the VC-
subgraph class connected with F and for F an envelope function of this
VC-subgraph class.
1 V
log N εkF kQ,2 , F, L2 (Q) ≤ C , 0 < ε < 1.
ε
Then there exists a constant K that depends on C and V only such that
1 2 1 1−2/V
K ε log ε , if 0 < V < 2,
log N εkF kQ,2 , sconv(F), L2 (Q) ≤ K 1 2 1 2
ε log ε , if V = 2,
1 V
K ε , if V > 2.
206 2 Empirical Processes
Then 0 ≤ fε ≤ f , and for a given x, either f (x) ∈ (tj , tj−1 ] for some
1 ≤ j ≤ m or f (x) ≤ tm . In the first case 0 ≤ f (x) − fε (x) ≤ tj−1 − tj =
εtj ≤ εf (x) ≤ εF (x), while in the second case 0 ≤ f (x) − fε (x) ≤ f (x) ≤
εkF kQ,r . This implies that Q|f − fε |r ≤ 2εr QF r . We conclude that the
entropy of the class F at 3εkF kQ,r is bounded by the entropy of the class
of functions Fε = {fε : f ∈ F} at ε.
In fact, the class Fε is a VC-subgraph class. The subgraphs can be
decomposed as
Since F is a VC-major class, the collection of sets {x: f (x) > tj−1 }, when
f ranges over F, is a VC-class of subsets of X , which (trivially) implies that
for every j the class Cj of sets {x: f (x) > tj−1 } × (tj , tj−1 ], when f ranges
over F, is a VC-class of subsets of the set X × (tj , tj−1 ]. Because the latter
sets are disjoint, the collection tj Cj is a VC-class of subsets of X × (0, 1],
of VC dimension the sum of the dimensions of the classes Cj , i.e. mV for V
bounded by the dimension of the VC-class attached to the VC-major class
F (cf. Problem 2.6.12).
2.6 Uniform Entropy Numbers 207
Define the sum over an empty set as zero. There exists such a vector
a
with at least one strictly positive coordinate.
For this vector,
the set
(xi , ti ): ai > 0 cannot be of the form (xi , ti ): ti < f (xi ) for some f ,
because then the left side of the equation would be strictly positive and the
right side nonpositive
. Conclude that the subgraphs of F do not
for this f
pick out the set (xi , ti ): ai > 0 . Hence the subgraphs shatter no set of n
points.
208 2 Empirical Processes
2.6.17 Lemma. The set of all translates ψ(x − h): h ∈ R of a fixed
monotone function ψ: R → R is VC of dimension 1.
Proof. The set C c picks out the points of a given set x1 , . . . , xn that C does
not pick out. Thus if C shatters a given set of points, so does C c . This proves
(i) (and shows that the dimensions of C and C c are equal). From n points C
can pick out O(nV (C) ) subsets. From each of these subsets, D can pick out
at most O(nV (D) ) further subsets. Thus C ∩ D can pick out O(nV (C)+V (D) )
subsets. For large n, this is certainly smaller than 2n . This proves (ii). Next
(iii) follows from a combination of (i) and (ii), since C ∪ D = (C c ∩ Dc )c .
For (iv) note first that if φ(C) shatters {y1 , . . . , yn }, then each yi must
be in the range of φ and there exist x1 , . . . , xn such that φ is a bijection
between x1 , . . . , xn and y1 , . . . , yn . Thus C must shatter x1 , . . . , xn . For (v)
the argument is analogous: if ψ −1 (C) shatters z1 , . . . , zn , then all ψ(zi ) must
be different and the restriction of ψ to z1 , . . . , zn is a bijection on its range.
To prove (vi) take any set of points x1 , . . . , xn and any set C̄ from
the sequential closure. If C̄ is the pointwise limit of a net Cα , then for
sufficiently large α the equality 1{C̄}(xi ) = 1{Cα }(xi ) is valid for each i.
For such α the set Cα picks out the same subset as C̄.
For (vii) note first that C × Y and X × D are VC-classes. Then by (ii)
so is their intersection C × D.
(v) F + g = {f + g: f ∈ F} is VC-subgraph;
(vi) F · g = {f g: f ∈ F} is VC-subgraph;
(vii) F ◦ ψ = {f (ψ): f ∈ F} is VC-subgraph;
(viii) φ ◦ F is VC-subgraph for monotone φ.
It suffices to show that these sets are VC in X ∩ {g > 0} × R, X ∩ {g <
0} × R, and X ∩ {g = 0} × R, respectively (Problem 2.6.12). Now, for
instance, i: (xi , ti ) ∈ C − is the set of indices of the points xi , ti /g(xi )
2.6.21 Lemma. If F and G are VC-hull classes for sets (in particular,
uniformly bounded VC-major classes), then FG is a VC-hull class for sets.
Proofs. The set {h◦f > t} can be rewritten as f ·> h−1 (t) for a suitable
inverse h and · > meaning either > or ≥ depending on t. Now the sets
210 2 Empirical Processes
2.6.25 Theorem. There exist universal constants D and c such that, for
any class F of functions f : X → [0, 1], any probability measure Q and
0 < ε < 1,
2 DV (cε,F )
N ε, F, L2 (Q) ≤ .
ε
For a proof see Mendelson and Vershynin (2003).
A comparison of the bound with Theorem 2.6.7, which gives a bound of
the order (1/ε)2V (F ) on the covering number, is impossible without further
knowledge of the constant D. However, if the fat-shattering dimension is
considerably smaller than the VC-dimension, then this may compensate
even a large constant D, and the bound of the preceding theorem may be
preferable. The following example illustrates that this may be the case. In
fact, the example shows that the VC-dimension can be infinite while the
fat-shattering dimension is finite for every ε > 0.
but choosing the points ti also increasing is crucial. Because strict increase,
without quantification, suffices, the points easily fit within the range [0, 1]
of the functions f . The latter changes if we require fat-shattering. If the
points must be picked out with margin ε, then the sequence t1 < · · · < tn
must increase by steps bigger than ε and only of the order 1/ε of them fit
the interval [0, 1]. This explains that
1
V (ε, F) ≤ 1 + .
ε
The preceding theorem therefore gives the bound of the order (1/ε) log(1/ε)
on the entropy of F. This bound is not sharp (see Theorem 2.7.7 and the
discussion below), but reasonable, whereas Theorem 2.6.7 is not applicable.
For a precise proof of the bound on the fat-shattering numbers, take
any sequence x1 < · · · < xn and assume that the subgraphs of F shatter
the points (xi , ti ) at margin ε, for some t1 , . . . , tn . In particular, there exist
two functions fo and fe that pick out the odd and even numbered points
at margin ε:
fo (xi ) ≥ ti + ε, i odd, fe (xi ) ≥ ti + ε, i even,
fo (xi ) ≤ ti , i even, fe (xi ) ≤ ti , i odd.
Then fo (xi ) − fe (xi ) ≥ ε for odd i and fo (xi ) − fe (xi ) ≤ −ε for even i, and
hence the variation of fo − fe over x1 , . . . , xn is at least (n − 1)2ε. Since fo
and fe are monotone with range [0, 1], this variation is at most 2, whence
n − 1 ≤ 1/ε.
This theorem removes the factor log(1/ε) from the bound on the en-
tropy implied by Theorem 2.6.25. The additional condition V (dε, F) ≤
V (ε, F)/2 that makes this possible is less innocuous than it may seem at
first. For instance, it cannot
be satisfied by a nontrivial VC-class, as then
a bound log N ε, F, L2 (Q) ≤ KV (F), free of ε, would result.
The following theorem shows that in the uniform entropy integral the
fat-shattering dimension can replace the uniform entropy. In particular,
the fat-shattering dimension characterizes the uniform Donsker property of
uniformly bounded classes of functions.
2.6 Problems and Complements 213
2.6.28 Theorem. There exist positive constants c < C such that, for any
class F of functions f : X → [0, 1],
Z 1p Z 1 q Z 1 p
c V (ε, F) dε ≤ sup log N ε, F, L2 (Q) dε ≤ C V (ε, F) dε.
0 0 Q 0
2. Unlike the case of sets, the VC-subgraph property does not characterize uni-
versal Donsker classes of functions. There exist bounded universal Donsker
classes that do not even satisfy the uniform entropy condition.
[Hint: Dudley (1987).]
3. Let F be a class of measurable functions such that D ε, F, Lr (Q) ≤ g(ε) for
every empirical type probability measure Q and a fixed function g(ε). Then
the bound is valid for every probability measure.
[Hint: If D ε, F, Lr (P ) = m, then there are functions f1 , . . . , fm such that
r r
P |fi − fj | > ε for every i 6= j. By the strong law of large numbers, Pn |fi −
fj |r → P |fi − fj |r almost surely, for every (i, j). Thus there exists ω and n
such that Pn (ω)|fi − fj |r > εr , for every i 6= j.]
4. There exists a short proof of the fact that for any VC-class of sets C and
δ > 0,
r(V (C)+δ)
1
N ε, C, Lr (Q) ≤ K ,
ε
for any probability measure Q, r ≥ 1, 0 < ε < 1, and a constant K depending
on V (C) and δ only.
[Hint: Take any subcollection of sets C1 , . . . , Cm from C such that Q(Ci 4
Cj ) > ε for every pair i 6= j. Generate a sample X1 , . . . , Xn from Q. Two sets
Ci and Cj pick out the same subset from a realization of the sample if and
only if no Xk falls in the symmetric difference Ci 4 Cj . If every symmetric
difference contains a point of the sample, then all Ci pick out a different
subset from the sample. In that case C picks out at least m subsets from
X1 , . . . , Xn . The probability that this event does not occur is bounded by
X m n
Q(Xk ∈
/ Ci 4 Cj for every k) ≤ 1 − Q(Ci 4 Cj )
2
i<j
m
≤ (1 − ε)n .
2
214 2 Empirical Processes
For sufficiently large n, the last expression is strictly less than 1. For such n
there exists a set of n points from which C picks out at least m subsets. In
other words, for n > − log m 2
/ log(1 − ε), one has the first inequality in
The last inequality follows from Corollary 2.6.3 and the constant K depends
only on V (C). Since − log(1 − ε) > ε, we can take n = 3(log m)/ε and obtain
V (C)
3 log m
m≤K .
ε
Since log m is bounded by a constant times mδ , it follows that m1−δ is
bounded by a constant times (3/ε)V (C) .]
5. The edges of the graph with nodes Z ⊂ {0, 1}n representing a VC-class C of
subsets of a set of points {x1 , . . . , xn } can be directed in such a manner that
at most V (C) arrows are positively incident with each node.
[Hint: The number of edges of a hereditary collection C can be enumerated
by listing for each node v the edges pointing inward: edges {v, w} such that w
has a zero in the (one) position in which it differs from v. Since every subset
of a hereditary class has at most V (C) points, it follows that #E/#Z ≤ V (C).
By repeated application of the operators Ti , an arbitrary collection of sets
can be transformed into a hereditary class. The operations do not decrease
the quotient #E/#Z: the number of edges may increase, while the number
of nodes remains the same. Conclude that for any VC-collection of sets,
#E/#Z ≤ V (C). Since a subcollection of a given VC-class is VC of no greater
dimension, it follows that #E 0 ≤ #Z 0 V (C) for every subgraph (Z 0 , E 0 ) as well.
Now apply Hall’s marriage lemma [e.g. Dudley (1989), Theorem 11.6.1,
page 318], representing directing an edge {v, w} as marrying the edge to one
of the nodes v and w. In fact, let the eligible partners for {v, w} be the
collection of V (C) copies of v and V (C) copies of w. Then the total number
of partners for a given set of edges is at least V (C) times the number of edges.
By the marriage lemma, successful marriage is possible.]
11. (Between graphs) For a set F of measurable functions, define the “be-
tween” graphs as the sets (x, t): 0 ≤ t ≤ f (x) or f (x) ≤ t ≤ 0 . The
“between” graphs form a VC-class of sets if and only if F is a VC-subgraph
class.
[Hint: The “closed” subgraphs intersected with the set {t ≥ 0} yields a VC-
class of sets which is the positive half of the “between” graphs. The “open”
subgraphs intersected with the set {t ≤ 0} and next complemented within
{t ≤ 0} give the lower parts of the “between” graphs.]
15. The class of all closed convex subsets in Rd is not VC for d ≥ 2. The same is
true for the collection of all open convex sets.
[Hint: Any set of n points on the rim of the unit ball is shattered by the
closed convex sets. The closed convex sets are in the sequential closure of the
open convex sets.]
216 2 Empirical Processes
16. The set of all open polygons with extreme points on the rim of the unit circle
in R2 is not VC.
[Hint: Form a regular polygon with its n extreme points on the rim of the
unit circle. Choose n points in the n gaps between the polygon and the unit
circle.]
18. The set of all monotone functions f : R → [0, 1] is VC-hull but not VC-
subgraph.
[Hint: Any set of points (xi , ti ) with both xi and ti strictly increasing in i
is shattered.]
20. The class of functions of the form x 7→ c1(a,b] (x) with a, b, and c > 0 ranging
over R is VC of dimension 2.
[Hint: Any f whose subgraph picks out the subset (x1 , t1 ), (x3 , t3 ) from
three points (xi , ti ) with x1 ≤ x2 ≤ x3 also picks out (x2 , t2 ) unless t2 >
t1 ∨ t3 . If a set of four points with nondecreasing xi is shattered, it follows
that t2 > t1 ∨ t3 , but also t3 > t2 ∨ t4 , t2 > t1 ∨ t4 , and t3 > t1 ∨ t4 .]
21. The “Box-Cox family of transformations” F = fλ : (0, ∞) → R: λ ∈ R −
{0} , with fλ (x) = (xλ − 1)/λ, is a VC-subgraph class.
[Hint: The “between” graph class of sets (as defined in Exercise
2.6.11)
is the class of subsets
C = Cλ : λ 6= 0 of R2 , where Cλ = (x, t): 0 ≤
t ≤ (xλ − 1)/λ . Examine the “dual class” of subsets of R given by D =
{D(x,t) : (x, t) ∈ (0, ∞) × (0, ∞)}, where
Show that D is a VC-class of sets, and apply Assouad (1983), page 246,
Proposition 2.12.]
While the VC-theory gives control over the entropy numbers of many in-
teresting classes through simple combinatorial arguments, results on brack-
eting numbers can be found in approximation theory. This section gives
examples, and in particular gives bracketing numbers of the most impor-
tant classes. In some cases, the bracketing numbers are actually uniform in
the underlying measure.
∂ k.
Dk = ,
∂xk11 · · · ∂xkdd
2.7 Entropies of Function Classes 219
P
where k. = ki . Then for a function f : X → R, let
k
k D f (x) − Dk f (y)
kf kα = max sup D f (x) + max sup
,
k. ≤α x k. =α x,y kx − ykα−α
where the suprema are taken over all x, y in the interior of X with x 6= y.
The Hölder space C α (X ) of order α is the space of all continuous functions
f : X → R with kf kα ≤ M for which the norm kf kα is well defined and
α
finite. Let CM (X ) be the set of all continuous functions f : X → R with
kf kα ≤ M . Thus C1α (X ) denotes the unit ball in the Hölder space of order
α.†
Bounds on the entropy numbers of the classes C1α (X ) with respect to
the supremum norm k·k∞ were among the first known after the introduction
of the concept of covering numbers. These readily yield bounds on the
Lr (Q)-bracketing numbers for the present classes of functions, as well as
for the class of subgraphs corresponding to them. The latter are classes of
sets with “smooth boundaries.”
Proof. Write . for “less than equal a constant times,” where the constant
depends on α and d only. Let β denote the greatest integer strictly smaller
than α.
Since the functions in C1α (X ) are continuous on X by assumption, it
may be assumed without loss of generality that X is open, so that Taylor’s
theorem applies everywhere on X .
Fix δ = ε1/α ≤ 1, and form a δ-net for X of points x1 , . . . , xm contained
in X . The number m of points can be taken to satisfy m . λ(X 1 )/δ d . For
each vector k = (k1 , . . . , kd ) with k. ≤ β, form for each f the vector
Dk f (x ) k
D f (xm )
1
Ak f = ,..., .
δ α−k. δ α−k.
†
Note the inconsistency in notation. The notation k·kα is used only with α < ∞ to define
the present classes of functions. The notation k · k∞ always denotes the supremum norm.
220 2 Empirical Processes
α
where |R| . kx − xi k , because the highest-order partial derivatives are
uniformlyQbounded. In the multidimensional case, the notation hk /k! is
d
short for i=1 hki i /ki !. Thus
δ k.
δ α−k.
X
|f − g|(x) . + δ α ≤ δ α (ed + 1).
k!
k. ≤β
Thus given the values in the ith column of Af , the values Dk f (xj ) range
over an interval of length proportional to δ α−k. . It follows that the val-
ues in the jth column of Af range over integers in an interval of length
proportional to δ k. −α δ α−k. = 1. Consequently, there exists a constant C
depending only on α and d such that
d
#Af ≤ (2δ −α + 2)(β+1) C m−1 .
The theorem follows upon replacing δ by ε1/α and m by its upper bound
λ(X 1 ) ε−d/α , respectively, taking logarithms, and bounding log(1/ε) by a
constant times (1/ε)d/α .
The corollary, together with either the bracketing central limit theo-
rem or the uniform entropy theorem, implies that C1α [0, 1]d is universally
Donsker for α > d/2. For instance, on the unit interval in the line, uniformly
bounded and uniformly Lipschitz of order > 1/2 suffices and in the unit
square it suffices that the partial derivatives exist and satisfy a Lipschitz
condition.
The previous results are restricted to bounded subsets of Euclidean
space. Under appropriate conditions on the tails of the underlying distribu-
tions they can be extended to classes of functions on the whole of Euclidean
space. The general conclusion is that under a weak tail condition, the same
amount of smoothness suffices for the class to be Donsker. For instance,
for any distribution on the line with a finite 2 + δ moment, the class of all
uniformly bounded and uniformly Lipschitz functions of order α > 1/2 has
a finite bracketing integral and hence is Donsker. A more general result is
an easy corollary to the first theorem of this section also.
where the sequence i1 , i2 , . . . ranges over all possible values: for each j the
integer ij ranges over {1, 2, . . . , pj }. The number of brackets is bounded by
Q P r 1/r
pj , and each bracket has Lr (Q)-size equal to 2ε aj Q(Ij ) . It follows
that
1 d/α X ∞ M d/α
1/r j
λ(Ij1 )
P r
log N[ ] 2ε aj Q(Ij ) , F, Lr (Q) . K
ε j=1
a j
aj ε≤Mj
∞
1 V X M V
j
≤K λ(Ij1 ) ,
ε j=1
aj
for every V ≥ d/α. The choice aVj +r = λ(Ij1 )MjV /Q(Ij ) reduces both series
in this expression to essentially the series in the statement of the corollary.
Simplify to obtain the result.
the corollary.
P∞ The class is knownP to be Glivenko-Cantelli or Donsker if and
∞
only if j=1 Mj P (Ij ) < ∞ or j=1 Mj P 1/2 (Ij ) < ∞, respectively. See
Section 2.10.5 and the Notes at the end of Part 2 for further details.
These expressions give differences in the direction h, and are well defined
if x, x + h, or x, x + h, . . . , x + rh are contained in X , respectively.
For α ∈ N and a function f : X → R, let
The Lipschitz spaces are defined through a first order difference of the
derivatives of order the biggest integer α strictly smaller than α. These
2.7 Entropies of Function Classes 225
The inclusions in this display are also embeddings: the inclusion map, map-
ping the space on the left into the space on the right, is continuous. Thus
the unit ball of each of the spaces on the left is included into a multiple of
the unit ball of the spaces on the right‡ For p = 2 these inclusions clearly
become identities.
We also have the following inclusions, which are embeddings as well:
for α > 0, 1 ≤ p ≤ p0 ≤ ∞,[
α
Bp,q α
(X ) ⊂ Bp,q 0 (X ), if 1 ≤ q ≤ q 0 ≤ ∞,
0
α
Bp,q (X ) ⊂ Bpα0 ,q (X ), if α0 − d/p0 ≤ α − d/p,
α
Bp,q (X ) ⊂ C(X ), if α > d/p.
These are examples of Sobolev embedding theorems. The first inclusion was
already noted in the preceding. The second inclusion is easy in the case
that p = p0 , when it merely says that higher order smoothness classes are
contained in lower order smoothness classes, if the smoothness is measured
in a single manner. In its stated, more general form the inclusion shows that
higher order (α) smoothness measured in a weak sense (Lp ) implies lower
order (α0 ) smoothness in a stronger sense (Lp0 for p0 ≤ pd/(d − p(α − α0 ))).
For instance, a function on the line (d = 1) whose derivative is contained
in L2 (α = 1, p = 2) belongs to C 1/2 (α0 = 1/2, p0 = ∞), an assertion that
can easily be proved directly using the Cauchy-Schwarz inequality. The last
inclusion can be viewed as a limiting case of this principle (p0 = ∞, α0 = 0),
and shows that functions in Besov spaces are continuous if the smoothness
level is large enough relative to p. Only for p = ∞ any possible smoothness
level α > 0 qualifies to give continuity.
In view of these inclusions an entropy bound on the Besov spaces au-
tomatically yields a bound on the Sobolev, Lipschitz, and Hölder spaces.
The following theorem describes the entropy relative to the Lτ (X , λ)-norm,
for a bounded, Lipschitz domain X ⊂ Rd , and given τ ∈ (0, ∞]. It can be
‡
We assume that X is strong local Lipschitz, see Bergh and Löfstrom (1976), Theo-
rem 6.4.4, page 152, and Adams and Fournier (2003), Theorem 7.31, page 228.
[
Compare Härdle et al. (1998), page 124, Bergh and Löfstrom (1976), page 153. We again
assume that X is a Lipschitz domain. A function f in a Besov space on the left can then be
α
extended to a function in Bp,q (Rd ) and the embeddings follow from the case that X = Rd .
226 2 Empirical Processes
α
shown that the unit ball of the Besov space Bp,q (X ) is compact in Lτ (X , λ)
if and only if α > d(1/p − 1/τ )+ . Under this condition the order of the
entropy depends on τ , α and d only, and not on p or q.]
α
1 d/α
log N ε, (Bp,q (X )1 , Lτ (X , λ) . .
ε
The key to the proof of this theorem is a representation of Besov spaces
through series expansions on a wavelet basis, after which the entropy cal-
culations can be done on the space of coefficients. Because the Besov norm
is decreasing in q, it suffices to prove the theorem for q = ∞.
For functions f : R → R a suitable wavelet expansion can be written
in terms of a pair of functions φ and ψ, called father wavelet and mother
wavelet. These functions can be chosen such that the set of translates x 7→
φk (x) = φ(x − k) for k ∈ Z of the father, together with the translated
dilations x 7→ ψj,k (x) = 2j/2 ψ(2j x − k) for j, k ∈ Z of the mother form
an orthonormal basis of L2 (R, λ). Thus any function f ∈ L2 (R, λ) can be
written as ∞ X
X X
f= β0,k φk + βj,k ψj,k .
k∈Z j=1 k∈Z
In addition the father and mother wavelet can be chosen to have compact
support.
To simplify notation we write φk = ψ0,k , so that we obtain an or-
thonormal basis ψj,k of L2 (R, λ) indexed by the pairs (j, k) ∈ (N ∪ {0}) × Z.
Then any f ∈ L2 (R, λ) can be expanded as
∞ X
X
f= βj,k ψj,k ,
j=0 k∈Z
R
where βj,k = f ψj,k dλ are the Fourier coefficients of f relative to the basis
ψj,k . Now it can be shown that for sufficiently regular father and mother
wavelets, the Besov norm defined earlier is equivalent to the coefficient norm
given by
∞
X X 1/p q 1/q
kf kα|p,q = 2jα |βj,k |p 2j(1/2−1/p) ,
j=0 k∈Z
X 1/p
kf kα|p,∞ = sup 2jα |βj,k |p 2j(1/2−1/p) .
j=0,...,∞
k∈Z
]
See Adams and Fournier (2003), page 83 for the definition of a Lipschitz domain.
2.7 Entropies of Function Classes 227
p 1/p
P
(For p = ∞ read maxk |βj,k | for k∈Z |βj,k | .) The Besov space
α
Bp,q (R) consists exactly of all functions f whose wavelet coefficients renders
the expression on the right finite, and its unit ball is the set of functions
for which the expression is bounded above by a given constant. (We abuse
notation by writing k · kα|p,q for both the function and coefficient norm,
which are equivalent, but not identical.)
For functions f : Rd → R the wavelet expansion is slightly more com-
plicated. A function f ∈ L2 (Rd ) can be written
X X ∞ X
X X
v v v v
(2.7.5) f= β0,k ψ0,j + βj,k ψj,k ,
k∈Zd v∈{0,1}d j=1 k∈Zd v∈{0,1}d −{0}
v
for functions ψj,k which are orthogonal for different indices (j, k, v), and are
scaled and translated versions of the set of 2d “base wavelets” ψ0,0 v
:
v
ψj,k (x) = 2−jd/2 ψ0,0
v
(2j x − k).
Such a higher-dimensional wavelet basis can be obtained as tensor products
v
ψ0,0 = φv1 × · · · × φvd of a given father wavelet φ0 and mother wavelet φ1
in one dimension. If it is understood that a sum over v ranges over {0, 1}d
if nested within j = 0 and over {0, 1}d − {0} if nested within j ∈ N, then
the two sums can be collapsed again to a single sum over j ∈ N ∪ {0}. The
corresponding Besov norm then takes the form
X ∞ X X 1/p q 1/q
kf kα|p,q = 2jα v p
|βj,k | 2jd(1/2−1/p)
j=0 k∈Zd v
X X 1/p
kf kα|p,∞ = sup 2jα v p
|βj,k | 2jd(1/2−1/p) .
j∈N∪{0} v
k∈Zd
For suitable base wavelets this norm is equivalent to the Besov norm defined
previously.
In order to measure the size of a function in Lτ (Rd , λ), it is useful to
reexpress the expansion in terms of the “renormalized wavelet basis”. The
v
base functions ψj,k have been scaled so that their L2 -norms are equal to
v
the L2 -norm of the base wavelet ψ0,0 , independent of the value of (j, k).
To stabilize the Lτ -norms the appropriate scaling factors are 2jd/τ rather
than 2jd/2 . For a given value of τ the function 2jd(1/τ −1/2) ψj,k
v
has Lτ -norm
independent of (j, k), and yields a wavelet basis “normalized in Lτ ”. If we
define
v
βj,k|τ = 2jd(1/2−1/τ ) βj,k
v
, v
ψj,k|τ = 2jd(1/τ −1/2) ψj,k
v
,
v v
then the product βj,k|τ ψj,k|τ is independent of τ , and the function f can
be reexpressed as
X∞ X X
v v
f= βj,k|τ ψj,k|τ .
j=0 k∈Zd v
228 2 Empirical Processes
For τ = 2 the functions ψj,k|2 = ψj,k are orthonormal and hence the in-
equalities in this display and in the preceding one are actually identities.
For general τ there is a one-sided inequality up to constant only, with the
corresponding inequality valid within a resolution level only. In every case
the Lτ -norm of a function can be bounded in terms of its L2 -wavelet coeffi-
cients, with additional factors 2jd(1/2−1/τ ) added to correct for the scaling.
α
Besov spaces Bp,q (X ) on an unbounded domain X are typically not
compact in Lτ (X , λ), for which reason the preceding theorem is restricted
α
to bounded domains. Every function in f ∈ Bp,q (X ) for a Lipschitz domain
˜
X can be extended to a function in f ∈ Bp,q (Rd ) of Besov norm satisfying
α
in view of Problem 2.7.8 and the definition of s. The choice of λ shows that
this is a multiple of ε.
Next we proceed to count the number of β̃ in each of the three cases.
For each fixed j we represent a typical vector β̃j,· by
(i) the set (k1 , k2 , . . . , kLj ) of indices k of the nonzero coordinates β̃j,k .
(ii) the vector of signs (δ1 , δ2 , . . . , δLj ) of these nonzero coordinates β̃j,k .
(iii) the absolute truncation levels (n1 , n2 , . . . , nLj ) of these nonzero coor-
dinates β̃j,k (i.e. ni = b|βj,ki |/λj c for every i.)
The number of different β̃ is bounded by the product over j of the product
of the numbers of different entities of the three types, and is a bound on
the Cε-covering number of BBpδ .
The number of different subsets as in (i) is bounded by, if Lj > 0,
Lj eK Lj 2jd Lj
X Kj j
≤ . C .
l Lj Lj
l=0
(Cf. Problem 2.7.11.) Each β̃j,· with positive coordinates gives rise to 2Lj
vectors β̃j,· with arbitrary signs, whence the number of sign vectors as in
(ii) is bounded by 2Lj .
The number Lj of nonzero coefficients β̃j,k trivially satisfies Lj ≤ Kj .
2jd in all three (a), (b), (c). In casesP(a) and (b), where p < ∞, the assump-
tion that β ∈ BBpδ implies that k |βj,k |p ≤ 2−jδp . Because |βj,k | ≥ λj
for each nonzero coordinate β̃j,k it follows that in cases (a) and (b) also
Lj . 2−jδp /λpj for each j. In case (c) we have Lj = 0 for j > jε by con-
struction.
In cases (a) and (b), the number of different sets of levels (n1 , . . . , nLj )
as in (iii) is bounded by
n X o
# (n1 , . . . , nLj ): |λj nk |p ≤ 2−jδp
k
n X o
≤4 Lj
max n21 n22 · · · n2Lj : |λj nk |p ≤ 2−jδp ,
k
where we used Jensen’s inequality in the second last step, and the restriction
on (n1 , . . . , nLj ) in the last step. It follows that for p < ∞ the logarithm of
2.7 Entropies of Function Classes 231
jε d
X
(j−jε )d
λ /λpj
p
=2 2 log
j≤jε
2(j−jε )(d+δp)
2−jε δp X (jε −j)δp λp 2(j−jε )(d+δp)
+ 2 p log .
λp j>j λj λp /λpj
ε
The leading factors 2jε d and 2−jε δp /λp in front of the two sums are both
equal to the desired upper bound (1/ε)d/s , whence it suffices to show that
the sums are bounded. In case (a) λj = λ, and the sums can be seen to
be bounded by making the change of summation index j − jε → j. In case
(b) λ/λj = (j − jε )2 , and the situation is the same, apart from the extra
factors (j − jε )2p .
Finally, in case (c) we count in (iii) the vectors (n1 , . . . , nLj ) ∈ NLj
with maxk nk ≤ 2−jδ /λj , the number being understood to be 0 if j > jε
when Lj = 0. This gives upper bound on the number of different β̃ equal
to a multiple of
Xh 2jd C 2−jδ i
Lj log + Lj log 2 + Lj log 4 + Lj log
j
Lj λj
h 2−jδ i (j − j)2
ε
X X
. 2jd 1 + log . 2je d 2(j−jε )d log .
λj 2jδ ε
j≤jε j≤jε
The leading factor 2jε d is equal to (1/ε)d/δ , which is the upper bound given
in the theorem in this case, and the sum can be seen to be bounded after
substituting ε = 2−jε δ and changing summation index.
into two halves of equal length. It is clear that ε0 ≤ 1 and that εi+1 ≤ cεi ≤
2εi+1 . Let ni = ni (f ) be the number of intervals in Pi and si = ni+1 − ni
the number of members of Pi that are divided to obtain Pi+1 . Then by the
definitions of si and εi ,
X r/(r+1)
si (cεi )r/(r+1) ≤ f (xj ) − f (xj−1 ) (xj − xj−1 )1/(r+1)
j
X r/(r+1) X 1/(r+1)
≤ f (xj ) − f (xj−1 ) (xj − xj−1 ) ,
j j
2.7 Entropies of Function Classes 233
r/(r+1)
by Hölder’s inequality. This is bounded by f (1)−f (0) 1 ≤ 1. Conse-
quently, the sum of the numbers of intervals up to the ith partition satisfies
i
X i
X i
X
nj = i + jsi−j ≤ 2 j(cεi−j )−r/(r+1)
j=1 j=1 j=1
(2.7.8) i
−r/(r+1) −r/(r+1)
X
. jcrj/(r+1) εi . εi ,
j=1
ε̃i = sup εi (f ),
f
where the supremum is taken over all functions f in the subset. While
ε̃i = ε̃i (f ) may be thought of as depending on f , it really depends only
on the subset of f in the final partitioning, unlike εi . Since the numbers
n1 ≤ n2 ≤ · · · ≤ ni also depend on the sequence P0 ⊂ · · · ⊂ Pi only,
inequality (2.7.8) remains valid if εi is replaced by ε̃i .
For a fixed function f , define a left-bracketing function fi , which is con-
stant on each interval [xj−1 , xj ) in the partition Pi , recursively as follows.
First f0 = 0. Next given fi−1 , define fi on the interval [xj−1 , xj ) by
ε̃i
fi (xj−1 ) = fi−1 (xj−1 ) + lj ,
(xj − xj−1 )1/r
Consequently,
r 2 /(r+1)
kf − fi krλ,r ≤ ni (2ε̃i )r . ε̃i ,
−r/(r+1)
since ni . ε̃i by (2.7.8). For i = k(f ), this is bounded up to a
r
constant by ε . Thus the brackets have the correct size.
Second, we count the number of functions fk obtained when f ranges
over F. In view of the definition of k = k(f ), we have that εk−1 (g)r > εr+1 ,
for some function g which is equivalent to f at stage k −1. This implies that
Pk−1
εk−1 (g)−r/(r+1) < 1/ε. Combining this with (2.7.8), we find j=1 nj . 1/ε
and trivially nk (f ) ≤ 2nk−1 . 1/ε. (Note that the numbers nj for j ≤ k − 1
are the same for f and g.) The number of sequences 1 = n0 ≤ n1 ≤ · · · ≤
nk ≤ C/ε is equal to the number of ways we can choose k integers from
the set 2, 3, . . . , bC/εc . It does not exceed 2C/ε . Given
a sequence of this
type, the number of ways to obtain Pi+1 from Pi is nsii , which is bounded
by 2ni . We conclude that the the total number of different final partitions
P0 ⊂ · · · ⊂ Pk generated when f ranges over F is bounded by
Multiply this with the upper bound on the number of partitions to conclude
that the total number of left brackets does not exceed (2L)2C/ε .
2.7 Entropies of Function Classes 235
1
the set BV1 [a, b] is contained in the unit ball of the Besov space B1,∞ [a, b].
In view of Theorem 2.7.4 the Lτ (λ)-entropy of the latter unit ball is of the
order (1/ε) provided that 1 > 1(1 − 1/τ )+ , i.e. t < ∞. Thus this theorem
gives the correct bound on the entropy of the set of monotone functions,
albeit not the bracketing entropy. (The restriction to Lebesgue measure on
a compact interval is not a real disadvantage of this approach, as the second
paragraph of the proof of Theorem 2.7.7 shows that the general setting can
be easily reduced to this case.)
Restricted to the closed subsets, this defines a metric (which may be infi-
nite). The following lemma gives the entropy of the collection of all com-
pact, convex subsets of a fixed, bounded subset of Rd with respect to the
Hausdorff metric.
2.7.10 Lemma. For the class C of all compact, convex subsets of a fixed,
bounded subset of Rd , with d ≥ 2, one has
1 (d−1)/2 1 (d−1)/2
K1 ≤ log N ε, C, h ≤ K2 ,
ε ε
for constants Ki that depend on d and the bounded set only.
2.7.11 Corollary. For the class C of all compact, convex subsets of a fixed,
bounded subset of Rd with d ≥ 2, one has the entropy bound
1 (d−1)r/2
log N[ ] ε, C, Lr (Q) ≤ K ,
ε
Proof. For a given set C, let ε C = x: d(x, C c ) > ε be the points that
are at least a distance ε inside C, and let C ε be the points within distance
ε of C. Then there exists a constant K0 depending only on C such that the
Lebesgue measure λ satisfies
λ(C ε − ε C) ≤ K0 ε,
†
This is not trivial, although it is intuitively clear.
2.7 Entropies of Function Classes 237
The entropy in the preceding theorem is relative to the Lr -norm for the
1/r
Lebesgue measure on C. The same bounds are valid for the kqk∞ ε-entropy
relative to the Lr (Q)-norm, for any measure Q with a uniformly bounded
Lebesgue density q on C.
The theorem implies that the bracketing entropy integral of the classes
F∞ is finite for d ≤ 3 in the case that C is a closed, convex polytope, and
for d ≤ 2 in the case of a general compact, convex domain. (In the latter
case the cut-point for r is d/(d − 1); the bound is (1/ε) times a logarithmic
factor if d = 2 and (1/ε)2 if d = 3, both times taken in the regime where
the bound is worse than (1/ε)d/2 .)
The theorem does not provide bounds for the uniform norm. A bound
for r = ∞ is available only if the class of functions is further restricted
to uniformly Lipschitz functions. It can then be derived from the entropy
of the class of their supergraphs, which are convex sets in Rd+1 , for the
Hausdorff metric. The assumption that the functions are Lipschitz is un-
pleasant, though it may be noted that every function f that is convex on
C η for
some η > 0 is automatically Lipschitz on C, with Lipschitz constant
2 sup |f (x)|: x ∈ C η /η. (See Problem 2.7.4.) The proof of the following
result is much shorter than the proof of Theorem 2.7.12, although the latter
also uses the device that a convex function is automatically Lipschitz in the
η-interior of its domain.
238 2 Empirical Processes
Fix a point x such that f (x) < g(x). Then the boundary of the supergraph
of f is below the supergraph
of g at x, and the projection (closest point) of
the point x, f (x) on the supergraph of g has the form y, g(y) for some
point y. The distance d (x, f (x)), (y, g(y)) between the two points in Rd+1
is bounded above by the Hausdorff distance between the two supergraphs.
On the other hand, the distance between the two points is bounded below
by a multiple of
2.7.14 Theorem. The class AA [0, 1]d of all functions f : [0,1]d → R that
d
can be extended
d
to an analytic
function on the set G = z ∈ C : kz −
[0, 1] k∞ < A with supz∈G f (z) ≤ 1 satisfies, for ε < 1/2 and a constant
cd that depends on d only,
1 d 1 1+d
log N ε, AA [0, 1]d , k · k∞ ≤ cd
log .
A ε
Proof. Let {x1 , . . . , xm } be an A/2-net in [0, 1]d for the maximum norm,
and let [0, 1]d = ∪i Bi be a partition in sets obtained by assigning every x ∈
2.7 Entropies of Function Classes 239
for some metric d on the index set, function F on the sample space, and
every x. Then (diam T )F is an envelope function for the class {ft − ft0 : t ∈
T } for any fixed t0 . The bracketing numbers of this class are bounded by
the covering numbers of T .
240 2 Empirical Processes
2.7.16 Theorem. For 0 < b1 < b2 < ∞ with b1 < 1 and a ≥ e, let
Pa,b1 ,b2 be the set of all normal mixtures pF,σ with F ranging over the
probability distributions on [−a, a] and σ ∈ [b1 , b2 ]. Then there exists a
universal constant K such that for d equal to the L1 -norm or the uniform
norm, and 0 < ε < 1/2,
a ab2 2
log N[ ] (ε, Pa,b1 ,b2 , d) ≤ K log .
b1 εb1
2
For fixed b1 , b2 and a the entropy is of the order log(1/ε) . This is
bigger, but not much bigger than the entropy of finite-dimensional classes.
On the other hand, the entropy bound increases proportionally to the min-
imal scale b1 of the mixture components if this approaches zero.
The set of mixtures with an arbitrary mixing distribution on R is not
totally bounded relative to the L1 -metric (even if the scale is restricted
to 1) and hence the preceding theorem can only be extended to mixing
distributions with possibly noncompact support if the mixing distribution
is restricted in some other way. We consider mixing distributions with tails
bounded by a given decreasing function A: (0, ∞) → [0, 1].
For Φ the standard normal
R ∞ distribution function, let Ā be the function
defined by Ā(a) = A(a) + a A dλ + 1 − Φ(a), and let Ā−1 be its inverse.
2.7 Entropies of Function Classes 241
2.7.17 Theorem. For 0 < b1 < 1 < b2 < ∞ and A a given decreasing
function, let PA,b1 ,b2 be the set of all normal mixtures pF,σ with F ranging
over the probability distributions on R with F [−a, a]c ≤ A(a) for all a > 0
and σ ∈ [b1 , b2 ]. Then there exist universal constants c and K such that,
for 0 < ε < 1/4 ∧ A(e),
2.7.19 Lemma. There exist universal constants C and D such that for
any 0 < ε < 1/2, any a, σ > 0 and any probability measure F on [−a, a]
there exists a discrete probability measure F 0 on [−a, a] with fewer than
D(aσ −1 ∨ 1) log(1/ε) support points such that kpF,σ − pF 0 ,σ k∞ ≤ Cε/σ.
Proof.pWe first prove the lemma for a = σ = 1. For ε < 1/2 we have
M : = 8 log(1/ε) > 2 and hence, for any probability measures F and F 0
on [−1, 1],
sup pF,1 (x) − pF 0 ,1 (x) ≤ 2φ(M − 1) ≤ 2φ(M/2) ≤ ε.
|x|≥M
G0i show that kpFi ,σ − pFi0 ,σ k∞ = σ −1 kpGi ,1 − pG0i ,1 k∞ . Combining this with
Pk+1
the inequality kpF,σ − pF 0 ,σ k∞ ≤ i=1 F (Ii )kpFi ,σ − pFi0 ,σ k∞ , we conclude
that pF 0 ,σ has the required distance to pF,σ . The number of support points
of F 0 is bounded by the number of intervals k + 1 times the maximum
number of support points of an Fi0 , and hence is bounded as claimed.
Thus the set P of mixtures pF,σ with F ranging over F and σ over Σ is an
ε-net over Pa,b1 ,b2 for the uniform norm, if the expression at the right of
the last display is made to be less than ε.
We next construct a finite net over P by restricting the support points
and weights of the mixing distributions to suitable grids. For a given γ >
0 let S be a γ-net over the N -dimensional simplex for the `1 -norm, and
let P 0 be the set of all mixtures pF,σ ∈ P such that the (maximally) N
support points of F are among the points {0, ±δ, ±2δ, . . .} ∩ [−a, a] with
weights belonging to S, for given δ > 0. We may project pF,σ ∈ P into
pF 0 ,σ ∈ P 0 by first moving each point mass of F to a closest point in the
grid {0, ±δ, ±2δ, . . .}, and next changing
P the vector of sizesP of the point
masses to a closest vector in S. If F = j pj δzj and F 0 = j pj δzj0 with zj0
the projection of zj on the grid, then
X
pj φσ (x−zj )−φσ (x−zj0 ) +|pj −p0j |φσ (x−zj0 ) .
pF,σ (x)−pF 0 ,σ (x) ≤
j
Because the function φ0σ is negative on the positive half line, the integral
on the right is bounded by its restriction to the interval (a, x). If x/2 > a,
then it can be further bounded by
Z x/2 Z x x A(x/2)
φ0σ (z−x) dzA(a)+ |φ0σ (z−x)| dzA(x/2) . φσ − A(a)+ .
a x/2 2 σ
If x/2 < a < x, then the second integral in this expression alone is a valid
upper bound. For x < a,
Z ∞
φσ (x − z) dF (z) ≤ φσ (x − a)F [a, ∞) . φσ (x − a)A(a).
a
Together with analogous bounds for the left tail, we obtain that, for any x,
Z
b2
φσ (x − z) dF (z) . φb2 (x − a) + φb2 (x/2) + φb2 (x + a) A(a)
|z|>a b1
A(x/2)
+ 1|x|≥2a .
b1
Let H denote the appropriate multiple of the right side of this inequality. If
Fa is the renormalized restriction to [−a, a] of a probability measure F , then
F [−a, a]pFa ,σ ≤ pF,σ ≤ F [−a, a]pFa ,σ + H. Consequently, given ε-brackets
[li , ui ] that cover Pa,b1 ,b2 , there exists, for every (F, σ) as in the definition
of PA,b1 ,b2 , a bracket [li , ui ] with
li (1 − A(a)) ≤ pF,σ ≤ ui + H. Thus the
brackets li 1 − A(a) , ui + H cover PA,b1 ,b2 . The size in L1 of a bracket
[li , ui ] is bounded by kui − li k1 + kli k1 A(a) + kHk1 . ε + Ā(a)b2 /b1 , as
b1 < 1 < b2 by assumption. Now we choose a such that Ā(a) ≤ b1 ε/b2 and
apply Theorem 2.7.16 to bound the number of brackets [li , ui ].
2.7.20 Corollary. Let Cα,d be the collection of subgraphs of C1α [0, 1]d−1 .
There exists a constant K depending only on α and d such that
1 (d−1)r/α
log N[ ] ε, Cα,d , Lr (Q) ≤ K kqkd/α
∞ ,
ε
for every r ≥ 1, ε > 0, and probability measure Q with bounded Lebesgue
density q on Rd .
The Lr (Q)-size of the brackets is the (1/r)th power of this. It follows that
the bracketing number N[ ] (2εkqk∞ )1/r , Cα,d , Lr (Q) is bounded by p. Fi-
nally, apply the previous theorem and make a change of variable.
It follows that the subgraphs of the functions C1α [0, 1]d−1 are P -
Donsker for Lebesgue-dominated measures P with bounded density, pro-
vided α > d − 1. For instance, for the sets cut out in the plane by functions
f : [0, 1] → [0, 1], a uniform Lipschitz condition of any order on the deriva-
tives suffices. The restriction to smooth measures P is natural, as smooth-
ness of the graphs is of little help if the probability mass is distributed in
an erratic manner.
Subgraphs of smooth functions provide a simple way to define sets with
smooth boundaries, but the asymmetry in the arguments is unattractive.
A more natural definition of such sets is to map the unit sphere S d−1 =
{x ∈ Rd : kxk = 1} smoothly into Rd and define a corresponding set as
the “interior” of the range of the map, the latter giving the “boundary” of
the set. For instance, the boundaries of smooth sets in R2 are defined as
images of the unit circle under smooth maps. Unfortunately, it is somewhat
complicated to make this definition precise.
To start we need to define smooth maps f : S d−1 → Rd . To this end we
can cover S d−1 by finitely many sets (an “atlas”) Vi such that each Vi is the
image φi (U d−1 ) of the open unit ball U d−1 in Rd under a C ∞ -isomorphism
φ: U d−1 → Vi . These maps can be constructed to be restrictions of C ∞ -
isomorphisms of a neighborhood of the closure of U d−1 into S d−1 . For given
α, M > 0 we then define Gα,M,d to be the collection of all maps f : S d−1 →
Rd such that each of the maps fj ◦ φi : U d−1 → R, where fj : S d−1 → R is
a coordinate function of f = (f1 , . . . , fd ), is contained in the Hölder ball
α
CM (U d−1 ).
246 2 Empirical Processes
1 (d−1)/α
log N (ε, Kα,M,d , h) ≤ K ,
ε
1 (d−1)r/α
ε, Kα,M,d , Lr (Q) ≤ K 0
log N[ ] .
ε
Proof. See Sun and Pyke (1982) or Dudley (1999), Theorem 8.2.16.
6. Let ε > 0 and let D be a compact, convex subset of Rk . Then there exists a
finite set D0 ⊂ Dε and a constant c depending only on ε and D, such that
kf kD is bounded by ckf kD0 for every convex function f : Dε → R.
[Hint: Since we can cover D by finitely many cubes that are contained in
Dε , it suffices to bound the supremum norm of f over a given closed cube
C ⊂ Dε . By convexity, the maximum of f over C is bounded above by
the maximum of |f | over the corners. Next, f can be bounded below by a
supporting hyperplane at an arbitrary point in the interior of C.
This takes
the form f (c) + h∇f (c), xi, which is bounded below by f(c) −
∇f (c)
kxk
and ∇f (c)i can be bounded by the maximum of the values f (c+εei )−f (c).]
9. For any a, b > 0 there exists a constant K depending only on (a, b) such that,
for every c > 0,
∞
c2−aj ∨ 2bj
X
(c2−aj ∧ 2bj ) log + 1 ≤ Kcb/(a+b) .
c2−aj ∧ 2bj
j=0
[Hint: The solution j0 to the equation c2−aj = 2bj satisfies c2−aj0 = 2bj0 =
cb/(a+b) . For the nearest integer these equalities hold up to a constant de-
pending only on a, b. The sum without the “+1” in the logarithm is bounded
248 2 Empirical Processes
by
X X
2bj log(c2−(a+b)j ) + c2−aj log(2(a+b)j /c)
j≤j0 j>j0
X −b(j0 −j)
X
b/(a+b)
log(2(a+b)(j0 −j) ) + c2−a(j−j0 ) log(2(a+b)(j−j0 ) ) .
.c 2
j≤j0 j>j0
j2−bj and
P
The series’ on the right are bounded by (a + b) log 2 times j≥0
j2−aj , respectively.]
P
j≥0
[Hint: For given (n1 , . . . , nl ) form a binary string as follows: write each nk
in its binary representation, duplicate every bit in this representation (e.g.
011 becomes 001111), and concatenate the strings with separator 01. This
gives a string of length bounded by
l
X
2 (2 log nk + 1) = 2(2 log n1 · · · nl ) + 2l.
k=1
11. The number of subsets of size smaller than l from a set of size n is bounded
above by (en/l)l .
[Hint: The number of subsets can be written as 2n P(X ≤ l) for a binomial
(n, 21 )-distributed random variable X. By Markov’s inequality this is bounded
for any r ≤ 1 by 2n ErX−l = 2n r−l ( 21 + 12 r)n . Choose r = l/n and use that
(1 + l/n)n ≤ el .]
R c/ε
[Hint: The left side is bounded by the integral 0 ε log(1/εx) dx = c +
c log(1/c), as can be seen by discretization on the grid points cj .]
14. For a given bounded, nondecreasing function g: [0, ∞) → [0, ∞) and q > 0 the
R ∞ q 1/q
norm 0 t−α g(t) t−1 dt is equivalent to the `q -norm of the sequence
2−jα g(2j ) j∈Z . The latter norms are decreasing in q.
2.7 Problems and Complements 249
The previous chapters present empirical laws of large numbers and cen-
tral limit theorems for observations from a fixed underlying distribution P .
Many of the sufficient conditions given there are actually satisfied by very
large classes of underlying measures; typically, the only limitation is finite-
ness of some appropriate moment of the envelope function. For instance,
classes satisfying the uniform entropy condition are, up to measurability,
Glivenko-Cantelli or Donsker for all P with P ∗ F < ∞ or P ∗ F 2 < ∞, re-
spectively. In particular, many bounded classes of functions are universally
Donsker: Donsker for every probability measure on the sample space.
In this chapter we note that even stronger results are typically true.
For example, not only does the central limit theorem hold for all underlying
measures, or very large classes of measures, it is also valid uniformly in
the underlying measure, when this ranges over large classes of measures.
Essentially, the main empirical limit theorems are valid uniformly in the
underlying measure if the conditions hold in a uniform sense, which appears
to be the case frequently. This type of uniformity is certainly of interest in
statistical applications.
where the supremum is taken over the set Qn of all discrete probability
measures with atoms of size integer multiples of 1/n. Then F is Glivenko-
Cantelli uniformly in P ∈ P.
By the uniform strong law for real random variables (Proposition A.5.1),
the second term on the right converges uniformly in P almost surely to
P F {F > M }. By the first condition of the theorem, this quantity can be
made arbitrarily small uniformly in P by choosing large M . Conclude that
it suffices to show that the first term on the right, which equals kPn −P kFM ,
converges almost surely to zero uniformly in P for every fixed M .
The class FM satisfies the entropy condition of the theorem relative to
the envelope function M . This follows from the inequality
sup N εM, FM , L1 (Q) ≤ sup sup N εkF kQ,1 , F, L1 (Q) ,
Q∈Qn n≤k≤2n Q∈Qk
which may be proved as follows. Suppose that k out of the n support points
x1 , . . . , xn of a given measure Q ∈ Qn (with multiple appearance of a given
point permitted) satisfy F (xi ) ≤ M . If Qk ∈ Qk is the discrete measure on
these points, then Q|f {F ≤ M }| = (k/n)Qk |f | for any function f . Since
Qk F ≤ M , it follows that an εQk F -net for F in L1 (Qk ) yields an ε(k/n)M
net for FM in L1 (Q). Thus N εM, FM , L1 (Q) ≤ N εkF kQk ,1 , F, L1 (Qk ) .
Since Qk ⊂ Q2k ⊂ · · ·, the right side is bounded by the right side of the
preceding display. This completes the proof that FM satisfies the entropy
condition of the theorem relative to the envelope function M .
252 2 Empirical Processes
for the supremum taken over all probability measures Q. This trivially im-
plies the entropy condition of the previous theorem. It follows that a suit-
ably measurable VC-class is uniformly Glivenko-Cantelli over any class of
underlying measures for which its envelope function is uniformly integrable.
In particular, a uniformly bounded VC-class is uniformly Glivenko-Cantelli
over all probability measures, provided it satisfies the measurability condi-
tion.
(Here BL1 is the set of all functions h: `∞ (F) → R which are uniformly
bounded by 1 and satisfy |h(z1 ) − h(z2 )| ≤ kz1 − z2 kF .) We shall call F
Donsker uniformly in P ∈ P if this convergence is uniform in P .
2.8 Uniformity in the Underlying Distribution 253
(i) ⇒ (ii). For each fixed δ > 0 and P , the truncated modulus function
for every δ > 0. On the right side G can be replaced by F. By assumption the
resulting expression can be made arbitrarily small by choice of δ. Conclude
that, for every ε, η > 0, there exists δ > 0 such that the left side of the
display is smaller than η for every P . Let G increase to a ρP -countable dense
subset of F, and use the continuity of the sample paths of GP to see that
the same statement is true (with the same ε, η, and δ) with G replaced by
F.
This gives a probability form of the second requirement of uniform
pre-Gaussianity. The expectation form follows with the help of Borell’s
inequality A.2.1, which allows one to bound the moment EkGP kFδ by a
universal constant times the median of kGP kFδ (Problem A.2.2).
Fix δ > 0. By the uniform total boundedness, there exists for each P a
δ-net GP for ρP over F such that the cardinalities of the GP are uniformly
bounded by a constant k. Then
kGP kF ≤ kGP kGP + sup GP (f ) − GP (g).
ρP (f,g)<δ
Next, since every h ∈ BL1 satisfies the inequality |h(z1 ) − h(z2 )| ≤ 2 ∧ kz1 −
z2 kF , we have, for every ε > 0,
Combination of the last three displayed equations shows that E∗P h(Gn,P )
converges to Eh(GP ) uniformly in h ∈ BL1 and P .
Exactly as in the proof of Theorem 2.5.2, it can be shown that for some
universal constant K,
2
P∗P kGn,P kFδn ,P > ε
Z θn,P /kF kn
K
q
≤ 2 E∗P sup log N εkF kQ,2 , F, L2 (Q) dε2 EP kF k2n .
ε 0 Q
The uniform entropy condition requires the supremum over all proba-
bility measures of the root entropy to be integrable. In terms of entropy with
bracketing, it suffices to consider the supremum over the class of interest.
Proof. The first condition implies that supP kF kP,2 < ∞. Finiteness of the
integral implies finiteness of the integrand. Therefore,
sup N[ ] εkF kP,2 , F, L2 (P ) < ∞,
P
Proofs. The first lemma was argued earlier (under slightly weaker con-
ditions). (Inspection of the proof shows that the implication (i) ⇒ (ii) of
Theorem 2.8.2 is valid without uniform integrability of the envelope.)
For the second lemma, first note that pointwise convergence of the
covariance functions and zero-mean normality of each GP imply marginal
convergence. The uniform pre-Gaussianity gives
lim lim sup PPn sup GPn (f ) − GPn (g) > ε = 0.
δ↓0 n→∞ ρPn (f,g)<δ
By the uniform convergence of ρPn to ρP0 , this is then also true with ρPn
replaced by ρP0 . The uniform pre-Gaussianity implies uniform total bound-
edness of F by Theorem 2.8.3, which together with (2.8.5) implies total
boundedness for ρP0 . Finally, apply Theorems 1.5.4 and 1.5.7.
With the notation Zi = δXi − P , the empirical central limit theorem can
be written
n
1 X
√ Zi G,
n i=1
If the ξi have mean zero, have variance 1, and satisfy a moment condition,
then the multiplier central limit theorem is true if and only if F is Donsker:
in that case, the two displays are equivalent.
A more refined and deeper result is the conditional multiplier central
limit theorem, according to which
n
1 X
√ ξi Z i G,
n i=1
In spite of the notation, this is not a norm (but there exists a norm that
is equivalent to k · k2,1 ). Finiteness of kξk2,1 requires slightly more than a
finite second moment, but it is implied by a finite 2 + ε absolute moment
(Problem 2.9.1). P
In Chapter 2.3 it is P
shown that the norms of a given process Zi and
its symmetrized version εi Zi are comparable in magnitude. The following
lemma
P extends this comparison to the norm of a general multiplier process
ξi Zi .
|ξi |
≤ 2(n0 − 1) E∗ kZ1 kF E max √
1≤i≤n n
k
√
1 X
+ 2 2 kξk2,1 max E∗
√ εi Zi
.
n0 ≤k≤n k i=n0 F
√
For symmetrically distributed variables ξi , the constants 1/2, 2, and 2 2
can all be replaced by 1.
the variables εi |ξi | possess the same distribution as the ξi . In that case, the
inequality on the left follows from
Xn
Xn
E∗
εi Eξ |ξi | Zi
≤ E∗
εi |ξi | Zi
.
F F
i=1 i=1
For ξ˜k+1 < t ≤ ξ˜k , one has k = #(i: |ξi | ≥ t). Thus the first expectation in
the product can be written as
Xn Z ξ̃k √ Z ∞ p Z ∞p
E k dt ≤ E #(i: |ξi | ≥ t) dt ≤ nP(|ξi | ≥ t) dt,
k=n0 ξ̃k+1 0 0
2.9 Multiplier Central Limit Theorems 263
by Jensen’s inequality. This concludes the proof of the upper bound for
symmetric variables.
For possibly asymmetric variables, first note that
1 X n
1 X n
E∗
√ ξi Zi
≤ E∗
√ (ξi − ηi )Zi
.
n i=1 F n i=1 F
1 X n
1 X n
1
2 kξk1 E∗
√ εi Zi
≤ E∗
√ ξi Zi
n i=1 F n i=1 F
k
√
1 X
≤ 2 2 kξk2,1 max E∗
√ εi Zi
.
1≤k≤n k i=1 F
For variables ξi with range [−1, 1], the maximum in the right side may be
replaced by only its nth term (Problem 2.9.3), but this appears to be untrue
in general. This simpler inequality obtained for n0 = 1 is too weak to yield
the following theorem.
Pn
Proof. Since both the empirical processes n−1/2 i=1 (δXi − P ) and the
n
multiplier processes n−1/2 i=1 ξi (δXi − P ) do not change if indexed by
P
the class of functions {f − P f : f ∈ F} instead of F, it may be assumed
without loss of generality that P f = 0 for every f . Marginal convergence
of both sequences of processes is equivalent to F ⊂ L2 (P ). It suffices to
show that the asymptotic equicontinuity conditions for the empirical and
the multiplier processes are equivalent.
If F is Donsker, then P ∗ (F > x) = o(x−2 ) as x → ∞ by Lemma 2.3.9.
By the same lemma convergence of the multiplier processes to a tight limit
implies that P ∗ (|ξF | > x) = o(x−2 ). In particular, P ∗ F < ∞ in both cases.
Since the assumption kξk2,1√< ∞ implies the existence of a second mo-
ment, we have E max1≤i≤n |ξi |/ n → 0. Combination with the multiplier
264 2 Empirical Processes
k
√
1 X
≤ 2 2kξk2,1 sup E∗
√ εi Zi
,
n0 ≤k k i=1 Fδ
for every n0 and δ > 0. By Lemma 2.3.6, the Rademacher variables can be
deleted
in thisP
statement
at the cost of changing
the constants.
Conclude
n Pn
that E∗
n−1/2 i=1 Zi
F → 0 if and only if E∗
n−1/2 i=1 ξi Zi
F → 0.
δn δn
These are the mean versions of the asymptotic equicontinuity conditions.
By Lemma 2.3.11, these are equivalent to the probability versions.
The first two terms on the right converge weakly if and only if F is Donsker;
the third is in `∞ (F) if and only if kP kF < ∞; in that case, it converges.
where G and G0 are independent (tight) Brownian bridges and are inde-
pendent of the random variable Z, which is standard normally distributed.
The limit process µG + σG0 + σZ P is a Gaussian process with zero mean
and covariance function (σ 2 + µ2 )P f g − µ2 P f P g.
The first condition is true for almost all sequences by the law of large
numbers. Furthermore,
√ a finite second moment, EkZi ||2 < ∞, implies that
max1≤i≤n kZi k/ n → 0 for almost all sequences. For the intersection of
the two sets of sequences, both conditions are satisfied.
Here BL1 is the set of all functions h: `∞ (F) → [−1, 1] such that |h(z1 ) −
h(z2 )| ≤ kz1 − z2 kF for every z1 , z2 .
266 2 Empirical Processes
Proof. (i) ⇒ (ii). For a Donsker class F, the sequence G0n converges in
distribution to a Brownian bridge process by the unconditional multiplier
theorem, Theorem 2.9.2. Thus, it is asymptotically measurable.
A Donsker class is totally bounded for ρP . For each fixed δ > 0 and
f ∈ F, let Πδ f denote a closest element in a given, finite δ-net for F. By
uniform continuity of the limit process G, we have G ◦ Πδ → G almost
surely, as δ ↓ 0. Hence, it is certainly true that
sup Eh(G ◦ Πδ ) − Eh(G) → 0.
h∈BL1
∞
for almost every sequence X1 , X2 , . . . . (To see this, define A: Rp → ` (F)
by Ay(f ) = yi if Πδ f = fi . Then h(G ◦ Πδ ) = g G(f1 ), . . . , G(fp ) , for the
function g defined by g(y) = h(Ay). If h is bounded Lipschitz on `∞ (F),
then g is bounded Lipschitz on Rp with a smaller bounded Lipschitz norm.)
Since BL1 (Rp ) is separable for the topology of uniform convergence on com-
pacta, the supremum in the preceding display can be replaced by a count-
able supremum. It follows that the variable in the display is measurable,
because h(G0n ◦ Πδ ) is measurable. Third,
If (ii) holds, then, by dominated convergence, the second term on the right
side converges to zero for every h ∈ BL1 . The first term on the right is
bounded above by EX Eξ h(G0n )∗ − EX Eξ h(G0n )∗ , which converges to zero
if G0n is asymptotically measurable. Thus, (ii) implies that G0n G un-
conditionally. Then F is Donsker by the converse part of the unconditional
multiplier theorem, Theorem 2.9.2.
This implies that Eξ h(G0n ) → Eh(G) for every h ∈ BL1 , for almost every
sequence X1 , X2 , . . . . By the portmanteau theorem, this is then also true
for every continuous, bounded h. Thus, the sequence G0n converges in dis-
tribution to G given almost every sequence X1 , X2 , . . . also in a more naive
sense of almost sure conditional convergence.
Proof. (i) ⇒ (ii). For the first assertion of (ii), the proof of the in probabil-
ity conditional central limit theorem applies, except that it must be argued
that Eξ kG0n k∗Fδ converges to zero outer almost surely, as n → ∞ followed
268 2 Empirical Processes
by δ ↓ 0. By Corollary 2.9.9,
√
lim sup Eξ kG0n k∗Fδ ≤ 6 2 lim sup E∗ kG0n kFδ , a.s.
n→∞ n→∞
1 X n
∗
1 X n
q
E sup Eξ
√ ξi Zi
≤ C sup E∗
√ ξi Zi
+ C E∗ kZ1 k2F ,
n n i=1 F n n i=1 F
∞
S2n − 6ES2n
Z X
E sup ≤ P S2n − 6ES2n > t2n/2 dt
n 2n/2 0 n
∞
E∗ kZ1 k2F
Z
. ∧ 1 dt.
0 t2
This is bounded by two times the square root of E∗ kZ1 k2F . The second
assertion now follows similarly from the monotonicity of the variables Sn .
2. For any pair of random variables ξ and η, one has kξ + ηk22,1 ≤ 4kξk22,1 +
4kηk22,1 .
√
4. If X1 , X2 , . . . are i.i.d. random variables with lim sup |Xn |/ n < ∞ almost
2
surely, then EX1 < ∞.
√
[Hint: By the Kolmogorov 0-1 law, lim sup |Xn |/ √n isconstant. Thus, there
exists a constant M with P lim sup{|X n | > M n} = 1. By the Borel-
P |Xn |2 > nM 2 < ∞.]
P
Cantelli lemma,
Here the star on the left side denotes a measurable majorant with respect to
the variables (ξ1 , . . . , ξn , Z1 , . . . , Zn ) jointly.
Pn
[Hint: Without loss of generality, assume that E∗
n−1/2 i=1 ξi Zi
F ≤
1/2,
Pfor every
n, and E|ξ1 | = 1. Since Eξ = 0, the random variables
n ∗
Eξ
i=1 ξi Zi
F are nondecreasing in n by Jensen’s inequality. Therefore,
it suffices to consider the limsup along the subsequence indexed by 2n . Since
E∗ kZ1 k2F < ∞,
∞ ∞
X X
P maxn kZi k∗F > 2n/2 ≤ 2n P kZ1 k∗F > 2n/2 < ∞.
1≤i≤2
n=1 n=1
It follows by the Borel-Cantelli lemma that each variable kZi k∗F , for
1 ≤ i ≤ 2n , is bounded by 2n/2 eventually, almost surely. P2nConsequently,
if Z̃i = Zi 1{kZi k∗F ≤ 2n/2 }, then the variables Eξ k i=1 ξi Zi k∗F and
P2n
Eξ k i=1 ξi Z̃i k∗F are equal eventually, almost surely. (We abuse notation,
since Z̃i depends
Pon n.)
It suffices to show that the random variable
2n ∗
lim sup 2−n/2 Eξ
i=1 ξi Z̃i
F is bounded by some fixed number almost
surely. By the Borel-Cantelli lemma, this is certainly the case if
∞
X2n
∗
X
P Eξ
ξi Z̃i
> 5 2n/2 < ∞.
F
n=1 i=1
prove that
∞
X
X2n
∗
X2n
2
∗
(2.9.10) P Eξ
ξi Z̃i
− E
ξi Z̃i
> 2n/2 < ∞.
F F
n=1 i=1 i=1
where
X2n
∗ 2n
X
∗
Ci = Eξ
ξj Z̃j
− Eξ
ξj Z̃j
.
F F
j=1 j=1,j6=i
∗
Pn ∗
Substitute E =kZ̃1 k3F
k=−∞
E < kZ1 k∗F ≤ 2k/2 }; next
kZ1 k3F {2(k−1)/2
3 k/2
bound one factor of kZ1 kF by 2 ; and, finally, change the order of summa-
tion to bound the previous display by
∞ ∞
X X
E∗ kZ1 k2F {2(k−1)/2 < kZ1 k∗F ≤ 2k/2 } 2k/2−n/2 .
k=−∞ n=k
√ √
This equals E∗ kZ1 k2F 2/( 2 − 1). The proof is complete.]
for every t > 0. Here the star on the left side denotes a measurable majorant
with respect to (ξ1 , . . . , ξn , Z1 , . . . , Zn ) jointly.
[Hint: The supremum over all n ≥ 1 inside the probability can be replaced by
2.9 Problems and Complements 273
√
2 times the supremum over the numbers 2n . Set Z̃i = Zi 1 kZi k∗F ≤ t2n/2 .
The given probability is bounded by
X2n
∗
X X
P max kZi k∗F > t2 n/2
+ P Eξ
ξi Z̃i
> 2n/2 (2M + 3t)
1≤i≤2n F
n n i=1
X2n
∗ 2
X X
≤ 2n P kZ1 k∗F > t2n/2 + P Eξ
ξi Z̃i
> 2n/2 (M + t) .
F
n n i=1
Here the square probability is obtained by the same argument as before, but
with T equal to the first l such that S1,l > 2n/2 (M + t). The proof can be
finished as before.]
By the converse part of Theorem 2.4.3 the outer expectation of the loga-
rithm of each of the terms in the product on the right is of order o(n). It fol-
lows that E∗ log N ε, HM , L1 (Pn ) = o(n), for every ε > 0 and M > 0. By
The following lemma is the crux of the proof of Theorem 2.10.1. Let
k · k be any norm on Rk .
Proofs. The first two results are immediate consequences of the char-
acterization of weak convergence in `∞ (F) as marginal convergence plus
asymptotic equicontinuity. In both cases the modulus of continuity does
not increase when passing from F to the new class.
‡
Pointwise convergence means convergence for every argument, not convergence almost
surely.
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 277
The norm does not increase if F is replaced by sconv F. Thus there ex-
ists a version of
the empirical process that converges outer almost surely
in `∞ sconv F to a process with uniformly continuous sample paths.
This implies that sconv F is Donsker. (Note that by perfectness of φn ,
Eh∗ {Gn f : f ∈ sconv F} = (P n )∗ h {Gn f : f ∈ sconv F} for every func-
tion h on `∞ sconv F .)
for all x. Thus, condition (2.10.7) is satisfied and the theorem applies.
Proof. Without loss of generality, assume that each of the functions Lα,i
is positive. For f ∈ Lα F = Lα,1 F1 × · · · × Lα,k Fk , define ψ(f ) = φ(f /Lα ).
Then ψ is uniformly Lipschitz in the sense of (2.10.7). By the preceding
theorem, ψ(Lα F) = φ(F) is Donsker.
equal to the L2 (Pn ) semimetric. This observation permits the use of several
comparison principles for Gaussian processes, including Slepian’s lemma,
Proposition A.2.5
Three lemmas precede the proof of Theorem 2.10.8. The first is a com-
parison principle of independent interest.
280 2 Empirical Processes
2.10.16 Lemma. Let F be a Donsker class with kP kF < ∞. Then the class
F 2 = {f 2 : f ∈ F} is Glivenko-Cantelli in probability: kPn − P k∗F 2 →P
0. If,
∗ 2 2
in addition, P F < ∞ for some envelope function F , then F is also
Glivenko-Cantelli almost surely and in mean.
The right side is bounded by the bounded Lipschitz distance between the
processes Zn and Z. Conclude that with inner probability at least 1 − ε,
K
Pn f 2 − P f 2
F ≤ sup Eξ h(Zn ) − Eh(Z).
h∈BL1
Proof.
Pm For any partition F = ∪m i=1 Fi and ε > 0, one has N (ε, F, ρP ) ≤
i=1 N (ε, Fi , ρP ). Bound the sum by m times the maximum of its terms,
and take logarithms to obtain
p p p
ε log N (ε, F, ρP ) ≤ ε log m + ε max log N (ε, Fi , ρP ).
i≤m
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 281
For each l, let ξl1 , . . . , ξln be i.i.d. random variables with a standard nor-
mal
Pk distribution,
Pn chosen independently for different l. Define Gn (f ) =
−1/2
l=1 n i=1 ξli fl (Xi ). In view of (2.10.7),
k n
X 1X
(fl − gl )2 (Xi ) = varξ Gn (f ) − Gn (g) .
varξ Hn (f ) − Hn (g) ≤
n i=1
l=1
By Slepian’s lemma A.2.5, the expectation Eξ
Hn − Hn (hj )
H is bounded
j
above by two times the same expression, but with Gn instead of Hn . Make
this substitution in the right side of (2.10.20), and conclude that the left
side of (2.10.20) is bounded up to a constant by
p
Eξ sup Gn (f ) − Gn (h) + log m sup kf − hkn .
eP (f,h)<δ eP (f,h)<δ
l=1
Here the supremum is taken over all finitely discrete probability measures
Q. Consequently, if the right side is finite and P ∗ (Lα · F α )2 < ∞, then the
class φ(F) is Donsker provided its members are square integrable and the
class is suitably measurable.
For each i, construct a minimal ε1/αi kFi kRi ,2αi -net Gi for Fi with respect
to the L2αi (Ri )-norm. Then the set of points φ(g1 , . . . , gk ) with (g1 , . . . , gk )
ranging over all possible combinations of gi from Gi forms a net for φ(F)
of L2 (Q)-size bounded by the square root of
X X
ε2 kFi k2α 2
Ri ,2αi = ε Q
i
L2α,i Fi2αi .
This equals ε2 kLα · F α k2Q,2 . Conclude that
k
Y
N εkLα · F α kQ,2 , φ(F), L2 (Q) ≤ N ε1/αi kFi kRi ,2αi , Fi , L2αi (Ri ) .
i=1
Since expressions of the type N εkF kcR,r , F, Lr (cR) are constant in c, the
bound is also valid with Ri replaced by the probability measure Ri /Ri 1.
If Q ranges over all finitely discrete measures, then each of the Ri runs
through finitely discrete measures. Substitute the bound of the preceding
display in the left side of the theorem, and next make a change of variables
ε1/αi 7→ ε to conclude the proof.
2.10.24 Example. Let the functions in F√be nonnegative with lower en-
velope function F , and consider the class √F. If F satisfies the uniform-
entropy condition and P ∗ F 2√/F < ∞, then F is√Donsker if suitably mea-
√
surable. This follows since√| f − g| ≤ |f − g|/ F .
The conclusion that F is Donsker can also be obtained under other
combinations√ of moment and entropy conditions. First, Theorem 2.10.8
shows that F is Donsker if F is Donsker and F ≥ δ > 0. This combines a
stronger assumption on the lower envelope with a weaker entropy condition.
Second, the present theorem can be applied with a different Lipschitz order.
For 1/2 ≤ α ≤ 1,
√ √ √ √
| f − g| | f − g|1−α p 1−2α
= √ √ ≤ 2 F .
|f − g|α | f + g|α
√
Thus, a suitably measurable class F is Donsker if, for some α ≥ 1/2, the
upper and lower envelopes of F satisfy P ∗ F 2α F 1−2α < ∞ and, in addition,
F satisfies
Z ∞ q dε
sup log N εkF kQ,2α , F, L2α (Q) 1−α < ∞.
0 Q ε
286 2 Empirical Processes
Each term of the series on the right tends to zero by the hypothesis that
the classes Fj are P -Glivenko-Cantelli. Furthermore, the P
jth terms is dom-
∞
inated by E∗ Pn (F 1Xj ) + P (F 1Xj ) ≤ 2P (F 1Xj ), where j=1 P (F 1Xj ) =
P (F ) < ∞. Thus the right side of the display tends to zero by the domi-
nated convergence theorem.
For the Donsker property we impose a stronger, but still simple con-
dition.
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 287
Proof. We can assume without loss of generality that the class F contains
the constant function 1.
Since Fj is Donsker, the sequence of empirical processes indexed by Fj
converges in distribution in `∞ (Fj ) to a tight brownian bridge Hj for each
j. This implies that E∗ kGn kFj → EkHj kFj . Thus, EkHj kFj ≤ Ccj . Let
Zj be a standard normal variable independent of Hj , constructed on the
same probability space. Since supf ∈Fj |P f | < ∞, the process f 7→ Zj (f ) =
Hj (f ) + Zj P f is well defined on Fj and takes its values in `∞ (Fj ). We can
construct these processes for different j as independent
P∞ random elements on
a single probability space. Then the series Z(f ) = j=1 Zj (f 1Xj ) converges
in second mean for every f , and satisfies EZ(f )Z(g) = P f g, for every f
and g. Thus, the series defines a version of a brownian motion process.
Since each of the processes {Zj (f 1Xj ): f ∈ F } has bounded and uniformly
continuous sample paths Pk with respect to the L2 (P )-seminorm, so have the
partial sums Z≤k = { j=1 Zj (f 1Xj ): f ∈ F}. Furthermore,
X X
EkHj kFj + E|Zj | P ∗ F 1Xj
E sup Zj (f 1Xj ) ≤
f ∈F
j>k j>k
r
X 2 ∗
≤C cj + P F 1∪j>k Xj .
π
j>k
P∞
This converges to zero as k → ∞. Thus the series Z(f ) = j=1 Zj (f 1Xj )
converges in mean in the space `∞ (F). By the Ito-Nisio theorem, it also
converges almost surely. Conclude that almost all sample paths of the pro-
cess Z are bounded and uniformly continuous. Since EkZkF < ∞, the
class F is totally bounded in L2 (P ), by Sudakov’s inequality. Hence Z is
a tight version of a brownian motion process indexed by F. The process
G(f ) = Z(f ) − Z(1)P f defines a tight brownian bridge process indexed by
F. We have proved that F is pre-Gaussian.
For each k set Gn,≤k (f ) = Gn (f 1∪j≤k Xj ). The continuity modulus of
the process {Gn,≤k (f ); f ∈ F } is bounded by P the continuity modulus of
the empirical process indexed by the sum class j≤k Fj . Since the sum of
finitely many Donsker classes is Donsker we can conclude that the sequence
(Gn,≤k )∞ ∞
n=1 is asymptotically tight in ` (F). Considering the marginal dis-
tributions, we can conclude that the sequence converges in distribution to
Use the theorem with the partition into the intervals Xj = (xj , xj+1 ],
with xj = 1 − 2−j for each integer j ≥ 0. The condition of the theorem
involves the series
∞ ∞ 1
F (1 − 2−j )
Z
X X F (u)
kFj kP,2 ≤ p 2−j ≤ 2 √ du.
j=0 j=1
1 − (1 − 2−j ) 0 1−u
3. For any class F of functions with envelope function F and 0 < r < s < ∞,
4. For any class F of functions with strictly positive envelope function F , the
expression supQ N εkF kQ,r , F, Lr (Q) , where the supremum is taken over
all finitely discrete measures, is nondecreasing in 0 < r < ∞. (Here Lr (Q) is
equipped with (Q|f |r )1/r even though this is not a norm for r < 1.)
[Hint: Given r < s and Q define dR = F r−s dQ. Then Q|f − g|r ≤ kf −
gkrR,s kF ks−r
R,s , by Hölder’s inequality, with (p, q) = (s/r, s/(s − r)).]
5. The expression N εkaF kbQ,r , aF, Lr (bQ) is constant in a and b.
6. If F is P -pre-Gaussian and A: L2 (P ) → L2 (P ) is continuous, then the class
{Af : f ∈ F} is P -pre-Gaussian.
7. If F is P -Donsker and A: L2 (P ) → L2 (P ) is an orthogonal projection, then
the class {Af : f ∈ F} is not necessarily P -Donsker.
8. If both F and G are Donsker, then the class F · G of products x 7→ f (x)g(x)
is Glivenko-Cantelli in probability.
2.11
Central Limit Theorem
for Processes
So far we have focused on limit theorems and inequalities for the empirical
process of independent and identically distributed random variables. Most
of the methods of proof apply more generally. In this chapter we indicate
some extensions to the case of independent but not identically distributed
processes.
We consider
the general
situation of sums of independent stochastic
processes Zni (f ): f ∈ F with bounded sample paths indexed by an arbi-
trary set F.
is measurable, for every δ > 0, every vector (e1 , . . . , emn ) ∈ {−1, 0, 1}mn ,
and every natural
Qmnnumber n. (Measurability for the completion of the prob-
ability spaces i=1 (Xni , Ani , Pni ) suffices.)
292 2 Empirical Processes
Proof. The Lindeberg condition for norms implies the Lindeberg condition
for marginals. Together with the assumption that the covariance function
converges, this gives marginal weak convergence (to a Gaussian process).
o
Set Zni = Zni − EZni , and let δn ↓ 0 be arbitrary. For fixed t and
sufficiently large
Pmn Chebyshev’s inequality and the second condition give
o o
the bound P | i=1 n
Zni (f ) − Zni (g)| > t/2 ≤ 1/2 for every pair f, g with
ρ(f, g) < δn . By the symmetrization lemma, Lemma 2.3.7, for sufficiently
large n,
mn
X
P∗ o o
sup Zni (f ) − Zni (g) > t
ρ(f,g)<δn i=1
mn
X t
≤ 4P sup εi Zni (f ) − Zni (g) > .
ρ(f,g)<δn i=1 4
For fixed values of the processes Zn1 , . . . , Znmn , define the subset An ⊂ Rmn
as the set of all vectors Zn1 (f)−Zn1 (g), . . . , Znmn (f )−Znm n (g) when the
pairs (f, g) range over the set (f, g) ∈ F ×F: ρ(f, g) < δn . By Hoeffding’s
2.11 Central Limit Theorem for Processes 293
P
inequality, Lemma 2.2.7, the stochastic process { εi ai : a ∈ An } is sub-
Gaussian for the Euclidean metric on An . By Corollary 2.2.8,
Xmn Z ∞
p
Eε sup εi Zni (f ) − Zni (g) . log N (ε, An , k · k) dε,
ρ(f,g)<δn i=1 0
where k · k is the Euclidean norm. The integrand can be bounded using the
inequality N (ε, An , k · k) ≤ N 2 (ε/2, F, dn ). Furthermore, if
mn
X
Xmn
θn2 = sup a2i =
a2i
,
a∈An i=1 An
i=1
then, for ε > θn , the set An fits in the ball of radius ε around the origin
and the integrand vanishes. Conclude from this and the entropy condition
(2.11.2) that the integral converges to zero in outer probability if θn → 0 in
probability. Under the measurabilityPassumption, this implies the asymp-
mn o
totic equicontinuity of the sequence i=1 Zni .
By the Lindeberg condition, there exists a sequence of numbers ηn ↓ 0
such that E∗ k a2i {kZni k∗F > ηn }kAn converges to zero. Thus, for showing
P
P∗
that θn → 0, it is not a loss of generality to assume that each Zni satisfies
kZni kF ≤ ηn . Fix Zn1 , . . . , Znmn , and take an ε-net Bn for An for the
Euclidean norm. For every a ∈ An , there exists b ∈ Bn with
P 2 P
εi ai = εi (ai −bi )2 +2 εi (ai −bi )bi + εi b2i ≤ ε2 +2εkbk+ εi b2i .
P P P
P 2
By Hoeffding’s inequality, Lemma 2.2.7, Pthe variable P εi bi has Orlicz
norm for ψ2 bounded by a multiple of ( b4i )1/2 . ηn ( b2i )1/2 . Apply
Lemma 2.2.2 to the third term on the right, and replace suprema over Bn
by suprema over An to obtain
P
1/2 p
P
1/2
Eε
εi a2i
An . ε2 + 2ε
a2i
An + 1 + log |Bn | ηn
a2i
An .
P
The size of the ε-net can be chosen |Bn | ≤ N 2 (ε/2, F, dn ). By the entropy
condition (2.11.2), this variable is bounded in probability (Problem 2.11.1).
Conclude that for some constant K,
P
P
εi a2i
An > t
Kh 2 p
P
1/2 i
≤ P∗ |Bn | > M + ε + ε + ηn log M E
a2i
A .
t n
For M and t sufficiently large, the right side is smaller than 1 − v1 for the
constant v1 in Hoffmann-Jørgensen’s Proposition A.1.5. More precisely, this
can be achieved for M such that P∗ (|Bn | > M ) ≤ (1 − v1 )/2 and t equal to
the numerator of the second term (i.e. K times the term in square brackets)
294 2 Empirical Processes
p
P
1/2
. ηn2 + ε2 + ε + ηn log M E
a2i
An .
Since this is true for every η > 0, it is also true for every sequence
ηn ↓ 0 that converges to zero sufficiently slowly. The processes Zni,ηn cer-
tainly satisfy the Lindeberg condition. If they (or the centered processes
Zni,ηn − EZni,ηn ) also satisfy the other conditions of the theorem, then the
2.11 Central Limit Theorem for Processes 295
Pmn ∞
sequence i=1 (Zni − EZni,ηn ) converges weakly in ` (F). The random
semimetrics dn decrease by the truncation. Hence the conditions for the
truncated processes are weaker than those for the original processes.
Then the preceding theorem readily yields a central limit theorem for pro-
cesses with increments that are bounded by a (random) L2 -metric on F.
Call the processes Zni measurelike with respect to (random) measures µni
if 2
Z
Zni (f ) − Zni (g) ≤ (f − g)2 dµni , every f, g ∈ F.
For P
measurelike processes, the random semimetric dn is bounded by the
L2 ( µni )-semimetric, and the entropy condition (2.11.2) can be related to
the uniform-entropy condition.
On the set where kF kµn > η, the right side of the next-to-last displayed
equation is bounded by J(δn /η) OP (1). This converges to zero in probability
for every η > 0. On the set where kF kµn ≤ η, we have the bound J(∞) η.
This can be made arbitrarily small by the choice of η. Thus, the entropy
condition (2.11.2) is satisfied.
296 2 Empirical Processes
Then the preceding lemma verifies the entropy condition (2.11.2), while
the first two conditions of the display
ensure
the other conditions of Theo-
2 1/2
rem 2.11.1. Note that kZni k
PmFn ≤ Z ni (f ) + 2(µni F ) , for any f .
Thus, the sequence i=1 (Zni − EZni ) converges in `∞ (F) provided
that the sequence of covariance functions converges pointwise and that mea-
surability conditions are met.
2.11.2 Bracketing
For each n, let Zn1 , . . . , Znmn be independent stochastic processes indexed
by a common index set F. In this chapter we aim at generalizations of
the bracketing central limit theorem, Theorem 2.5.6, in two directions: we
2.11 Central Limit Theorem for Processes 297
prove a version for the non-i.i.d. case, and we also weaken the entropy con-
ditions to conditions involving the existence of either majorizing measures
or certain Gaussian processes.
For the first generalization, define, for every n, the bracketing number
N[ ] (ε, F, Ln2 ) as the minimal number of sets Nε in a partition F = ∪N n
j=1 Fεj
ε
n n
of the index set into sets Fεj such that, for every partitioning set Fεj
mn
X 2
E∗ sup Zni (f ) − Zni (g) ≤ ε2 .
n
f,g∈Fεj
i=1
[
See the Problems and Appendix A.2.3.
2.11 Central Limit Theorem for Processes 299
mn
X
2 ∗
sup Zni (f ) − Zni (g) > t ≤ ε2 ,
sup t P
t>0 f,g∈B(ε)
i=1
ρ-ball B(ε) ⊂ F of
for every P radius less than ε and for every n. Then the
mn
sequence i=1 Zni − EZni is asymptotically tight in `∞ (F). It converges
in distribution provided it converges marginally.
(The process is separable, with any dense subset of [0, 1] as the separant,
because it is continuous in probability; the sums on the right side are in-
creasing in n and bound dyadic increments.) The right-hand side has mean
bounded by Kε. Since the processes are bounded by 1
Z(f ) − Z(g)2 ≤ 2Kε.
E sup
a<f ≤g<a+ε
The equivalence of (ii) and (iii) can be argued from properties of Brownian
motion. Here we deduce that (ii) implies (i) from Theorem 2.11.11.
In view of the second condition of (ii), the envelope function F =
(0,1/2] has a weak second moment and kP kF = sup t/q(t): 0 < t ≤
(1/q)1
1/2 is finite. Hence the pre-Gaussianness implies the existence of a tight
version of Brownian motion; its standard deviation metric is Gaussian. Since
for s < t,
1(0,s] 1(0,t] 1(s,t] 1 1
− = + − 1(0,s] ,
q(s) q(t) q(t) q(t) q(s)
the square of this metric equals
t − s 1 1 2
ρ2 (s, t) = 2 + − s, s ≤ t.
q (t) q(t) q(s)
If ρ(s, t) < ε, then both terms on the right are bounded by ε2 . Conclude
that for every fixed s,
1(0,s] 1(0,t] ε1(s,t] ε1(0,s] ε1(s,1] ε1(0,s]
sup − ≤ sup √ + √ =√ + √ .
s<t q(s) q(t) t>s t−s s ·−s s
ρ(s,t)<ε
√
The L2,∞ (P )-norm of the function x 7→ 1(s,1] / x − s equals 1 for every s.
It follows that the L2,∞ (P )-norm of the left side of the preceding display
is bounded by a multiple of ε.
]
Shorack and Wellner (1986), page 462.
302 2 Empirical Processes
partition a = t0 < t1 < · · · < tN = b such that F (ti −) − F (ti−1 ) < ε2 for
every i. Then, for every i,
2
sup Z(s) − Z(t) ≤ E Z(ti −) − Z(ti−1 ) Z(b) < ε2 .
E
ti−1 ≤s,t<ti
The number of points in the partition can be chosen smaller than a constant
times 1/ε2 . (Make sure that the points where F jumps more than ε2 are
among t0 , . . . , tN .) Thus the entropy condition of Theorem 2.11.9 is satisfied
easily for the partition [a, b] = ∪(ti−1 , ti ] ∪ {a}.
Proof. The condition implies the upper bound 2 exp(−x2 /(4b)) on P (|X| >
x), for every x ≤ b/a, and the upper bound 2 exp(−x/(4a)), for all other
positive x. Consequently, the same upper bounds hold for all x > 0 for
the probabilities P(|X|1{|X| ≤ b/a} > x) and P(|X|1{|X| > b/a} > x),
respectively. By Lemma 2.2.1, this implies that the Orlicz norms kX1{|X|
√ ≤
b/a}kψ2 and kX1{|X| > b/a}kψ1 are up to constants bounded by b and
a, respectively. For any random variable Y , Jensen’s inequality yields for
any convex, increasing, nonnegative function ψ,
−1 |Y | 1A
E|Y |1A ≤ ψ Eψ kY kψ P(A)
kY kψ P(A)
1
≤ ψ −1 kY kψ P(A).
P(A)
Apply this to the random variables |X|1{|X| ≤ b/a} and |X|1{|X| > b/a}
separately and use the triangle inequality to find that
1 √ 1
E|X|1A . ψ2−1 b P(A) + ψ1−1 a P(A).
P(A) P(A)
Finally, apply the inequality p logk (1 + 1/p) . (p + µ) logk (1/µ), which is
valid for small µ > 0 and k = 1/2, 1.
2.11 Central Limit Theorem for Processes 303
s
X
−q 1
(2.11.18) lim lim sup sup 2 log = 0,
q0 →∞ n→∞ f q>q0
µn (Fqn f )
mn
X 2
sup E Zni (f ) − Zni (g) ≤ 2−2q ,
n
f,g∈Fqj i=1
mn
X
t2 P∗ sup Zni (f ) − Zni (g) > t ≤ 2−2q .
sup
t>0 n
f,g∈Fqj
i=1
Here Fqn f is the set in the qth partition to which f belongs. Under the
conditions of Theorem 2.11.9, this is true with 1/µn (Fqn f ) replaced by the
number Nqn of sets inPthe q-th partition. The measures µn can then be
−q
constructed as µn = q2 µn,q , where µn,q (Fqn f ) = (Nqn )−1 , for every
f and q. (Alternatively, the expression 1/µn (Fqn f ) can be replaced by Nqn
throughout the proof.) Under the conditions of Theorem 2.11.11, a single
measure µ = µn can be derived from a majorizing measure corresponding
to the Gaussian semimetric ρ. The existence of this measure is indicated in
Proposition A.2.22. The construction of a sequence of partitions (in sets of
ρ-diameter at most 2−q ) is carried out in Lemma A.2.24. In the following,
it is not a loss of generality to assume that µn (Fqn f ) ≤ 1/4 for every q
and f . Most of the argument is carried out for a fixed n; this index will be
suppressed in the notation.
The remainder of the proof is similar to the proof of Theorem 2.5.6,
except that Lemma 2.11.17 is substituted for Lemma 2.2.10 in the chaining
argument, which now uses majorizing measures. Choose an element fqj
from each partitioning set Fqj and define
πq f = fqj ,
(∆q f )ni = sup Zni (f ) − Zni (g), if f ∈ Fqj ,
f,g∈Fqj
s
. 1
aq f = 2−q log .
µ(Fq+1 f )
304 2 Empirical Processes
Center each term at zero expectation; take the sum over i; and, finally, take
the supremum over f for all three terms on the right separately. It will be
shown that the resulting expressions converge to zero in mean as n → ∞
o
followed by q0 → ∞, whence the centered processes Zni satisfy
Xmn
(2.11.20) lim lim sup E∗
o
Zni o
(f ) − Zni (πqn0 f )
= 0.
q0 →∞ n→∞ F
i=1
In view of Theorem 1.5.6, this gives a complete proof in the case that the
partitions do not depend on n. The case that the partitions depend on n
requires an additional argument, given at the end of this proof.
Since (∆q f )ni ≤ 2ηn , the first term in (2.11.19) is zero as soon as
2ηn ≤ inf f aq0 f . For every fixed q0 , this is true for sufficiently large n,
because, by assumption (2.11.18), inf f aq f (which depends on n) is bounded
away from zero as n → ∞ for every fixed q0 .
By the nesting of the partitions, (∆q f )ni ≤ (∆q−1 f )ni , which is
bounded by aq−1 f on the set where (Bn qf )ni = 1. It follows that
Zni (f ) − Zni (πq f )(Bq f )ni ≤ (∆q f )ni (Bq f )ni ≤ aq−1 f,
Xmn X mn
var (∆q f )ni (Bq f )ni ≤ aq−1 f E(∆q f )ni ∆q f )ni > aq f
i=1 i=1
aq−1 f −2q
≤2 2 ,
aq f
a = 2aq−1 f . Thus, the lemma implies that, for every f and measurable set
A,
Xmn
E (∆q f )ni (Bq f )ni − E(∆q f )ni (Bq f )ni 1A
i=1
s s !
1 aq−1 f −q 1
. aq−1 f log + 2 log P(A) + µ(Fq f )
µ(Fq f ) aq f µ(Fq f )
s
1
. 2−q log
P(A) + µ(Fq f ) ,
µ(Fq f )
This concludes the proof that the second term in the decomposition result-
ing from (2.11.19) converges to zero as n → ∞ followed by q0 → ∞.
The third term can be handled in a similar manner. The proof of
(2.11.20) is complete.
Finally, if the partitions depend on n, then the discrepancies at the ends
of the chains must be analyzed separately. If the q0 th partition consists of
306 2 Empirical Processes
Nqn0 sets, then the combination of Bernstein’s inequality and Lemma 2.2.10
yields
Xmn q
o
(πqn0 f ) − Zni
o
(πqn0 g) . log Nqn0 ηn + log Nqn0 2−q0 + δn .
E sup Zni
ρ(f,g)<δn i=1
The entropy condition of Theorem 2.11.9 implies that 2−q0 log Nqn0 → 0 as
n → ∞ followed by q0 → ∞. Hence the expression in the display converges
to zero as n → ∞ followed
Pmn o by q0 → ∞. Combine this with (2.11.20) to see
that the sequence i=1 Zni is asymptotically equicontinuous with respect
to ρ. Finally, Theorem 1.5.7 shows that the sequence is asymptotically
tight.
Then the central limit theorem holds under an entropy condition. The fol-
lowing theorems impose a uniform-entropy condition and a bracketing en-
tropy condition, respectively.
2.11.22 Theorem. For each n, let Fn = fn,t : t ∈ T be a class of measur-
able functions indexed by a totally bounded semimetric space (T, ρ) such
2
that the classes Fn,δ = fn,s − fn,t : ρ(s, t) < δ and Fn,δ are P -measurable
for every δ > 0. Suppose (2.11.21) holds, as well as
Z δn q
sup log N εkFn kQ,2 , Fn , L2 (Q) dε → 0, for every δn ↓ 0.
Q 0
2.11 Central Limit Theorem for Processes 307
2.11.23 Theorem. For each n, let Fn = fn,t : t ∈ T be a class of mea-
surable functions indexed by a totally bounded semimetric space (T, ρ).
Suppose (2.11.21) holds, as well as
Z δn q
log N[ ] εkFn kP,2 , Fn , L2 (P ) dε → 0, for every δn ↓ 0.
0
n
1X
d2n (s, t) = (fn,s − fn,t )2 (Xi ) = Pn (fn,s − fn,t )2 .
n i=1
It follows that N (ε, T, dn ) = N ε, Fn , L2 (Pn ) , for every ε > 0. If Fn is
replaced by Fn ∨ 1, then the conditions of the theorem still hold. Hence,
assume without loss of generality that Fn ≥ 1. Insert the bound on the
covering numbers and next make a change of variables to bound the entropy
integral (2.11.2) by
Z δn q
log N εkFn kPn ,2 , Fn , L2 (Pn ) dε kFn kPn ,2 .
0
Conclude that the Gaussian process G defined in the preceding problem has
a version with bounded, uniformly ρ-continuous sample paths. Hence ρ is a
Gaussian semimetric.
[Hint: The convergence of the integral is immediate from the fact that
N (2−q , F, ρ) = Nq . The continuity follows from Corollary 2.2.8 and Prob-
lem 2.2.17. Also see Appendix A.2.3.]
R∞p
4. Any semimetric d on an arbitrary set F such that 0 log N (ε, F, d) dε < ∞
is Gaussian-dominated.
[Hint: Construct a sequence of nested partitions of F = ∪j Fqj in sets of δ-
diameter at most 2−q for each q. The number Nq of sets in the qth partition
2.11 Problems and Complements 309
p
can be chosen such that q 2−q log Nq < ∞. As in the preceding problems,
P
the semimetric ρ defined from this sequence of partitions is Gaussian by the
preceding problem and has the property that each ball Bρ (f, 2−q ) equals a
partitioning set; hence these balls have d-diameter at most 2−q for every q.
The latter implies that d ≤ 2ρ.]
5. Any semimetric d on an arbitrary set F for which there exists a Borel prob-
ability measure µ with
Z δ q
lim sup log 1/µ B(f, ε) dε = 0
δ↓0 f 0
≤ 2−q .
sup |f − g|
P,2,∞
f,g∈Fqj
√
where Gn = n(Pn − P ) is the empirical process indexed by F. The index
(s, f ) ranges over [0, 1] × F. The covariance function of Zn is given by
bnsc ∧ bntc
cov Zn (s, f ), Zn (t, g) = Pfg − PfPg .
n
By the multivariate
central limit theorem, the marginals of the sequence
of processes Zn (s, f ): (s, f ) ∈ [0, 1] × F converge to the marginals of a
Gaussian process Z. The latter is known as the Kiefer-Müller process; it
has mean zero and covariance function
cov Z(s, f ), Z(t, g) = (s ∧ t) P f g − P f P g .
2.12.3 Example. Let X = [0, 1] be the unit interval in the real line and C
the collection of cells [0, s] with 0 ≤ s ≤ 1. If Qni is the Dirac measure at
the point i/n (for 1 ≤ i ≤ n), then
k
X X k k+1
Sn [0, s] = Yni = Yni , if ≤s< .
i=1
n n
i/n≤s
√
The choice Yni = Yi / n for an i.i.d. sequence Y1 , Y2 , . . . gives the classical
partial-sum process.
2.12.4 Example. Let X = [0, 1] be the unit interval in the real line and C
0 ≤ s ≤ 1. If Qni is the uniform measure
the collection of cells [0, s] with
on the interval (i − 1)/n, i/n , then
k
X s − k/n k k+1
Sn [0, s] = Yni + Yn,k+1 , if ≤s< .
i=1
1/n n n
This is a linear interpolation of the partial-sum process in the previous
example.
2.12.5 Example. If each Qni equals the Dirac measure at some point xni ∈
X , then the general process reduces to
X
Sn (C) = Yni .
i:xni ∈C
This may be visualized as random weights Yni being located at fixed points
xni . Each set C is charged with the sum of the weights that it carries.
P
The process Sn = Yni Qni is the sum of the independent processes
Zni = Yni Qni . Since
2 2
Zni (C) − Zni (D) ≤ Yni Qni C 4 D ,
the processes Zni are measurelike with respect to the measures µni =
2
Yni Qni . Thus for a collection C that satisfies the uniform-entropy condi-
tion, Theorem 2.11.1 combined with Lemma 2.11.6 readily yields a central
limit theorem for the sequence Sn . This is true in particular for VC-classes
of sets.
2.12 Partial-Sum Processes 315
Proof. Apply Theorem 2.11.1 with Zni = Yni Qni and ρ the L1 (Q)-
semimetric. Then kZni kF can be taken equal to |Yni |.
The entropy condition (2.11.2) is satisfied in view of Lemma 2.11.6.
Since C satisfies the uniform-entropy condition, it is totally bounded under
the L1 (Q)-semimetric (for any probability measure Q).
The measurability conditions listed before Theorem 2.11.1 are sat-
isfied, because the suprema can be replaced by countable suprema. In-
deed, since C satisfies the uniform-entropy
Pmn condition, it is totally bounded
and hence separable in L1 ( i=1 Qni + Q). Thus, there exists a count-
able
Pmnsubcollection that contains, for every C ∈ C, a sequence Ck with
i=1 Qni + Q (Ck 4 C) → 0 as k → ∞ (and n fixed). For this se-
quence, Sn (C) = limk→∞ Sn (Ck ). Similar arguments applied to the mul-
tiplier processes show that the suprema that are assumed measurable in
Theorem 2.11.1 can be replaced by countable suprema.
The theorem follows from Theorem 2.11.1.
2.12.8 Example. Let X = [0, 1]d be the unit cube and C the collection
of all cells [0, t] with 0 ≤ t ≤ 1. Let the collection of measures Qni be
the nd Dirac measures located at nodes of the regular grid consisting of
the nd points {1/n, 2/n, . . . , 1}d . This gives a higher-dimensional version of
the classical partial-sum process. If Yni = Yi /nd/2 for an i.i.d. mean-zero
sequence Y1 , Y2 , . . . with unit variance, then the measure Qn reduces to
d
n
1 X
Qn = d Qni .
n i=1
This is the discrete uniform measure on the grid points. The sequence Qn
converges weakly to Lebesgue measure, and the collection of cells [0, t] sat-
isfies the bracketing condition for Lebesgue measure. Thus, (2.12.7) is sat-
isfied for Q equal to Lebesgue measure, and the sequence of covariance
functions converges. The limiting Gaussian process has covariance function
ES(C)S(D) = λ(C ∩ D) for Lebesgue measure λ (a standard Brownian
sheet).
2.12.9 Example. Let X = [0, 1]d be the unit cube and C a VC-class. Par-
tition [0, 1]d into nd cubes of volume n−d , and let the collection Qni be the
set of uniform probability measures on the cubes. This gives a smoothed
version of the partial-sum process in higher dimensions. If each of the Yni
has variance n−d , then the measure Qn reduces to Lebesgue measure. Thus
(2.12.7) is satisfied for Q equal to Lebesgue measure.
If the cubes in the partition are denoted Cni , then the covariance func-
tion can be written
d
n
1 X λ(Cni ∩ C) λ(Cni ∩ D)
d
= EE 1C (U )| An E 1D (U )| An ,
n i=1 λ(Cni ) λ(Cni )
where U is the identity map on [0, 1]d equipped with Lebesgue measure
and An is the σ-field generated by the partition. For n → ∞, the sequence
E 1C (U )| An converges in probability to 1C (U ) for every set C. (This
2.12 Problems and Complements 317
In this section we consider some Donsker classes of interest, that do not fit
well in the framework of the preceding chapters.
2.13.1 Sequences
A countable class of functions may be shown to be Donsker by any of the
criteria we have discussed so far. A very simple sufficient condition is that
the sequence converges to zero sufficiently fast.
Proof. For a fixed natural number m, define a partition {fi } = ∪m+1 i=1 Fi
by letting Fi consist of the single function fi for i ≤ m and Fm+1 =
{fm+1 , fm+2 , . . .}. Since the variation over the first m sets in the partition
is zero,
ε
P sup sup Gn (f − g) > ε ≤ P sup Gn f >
i f,g∈Fi f ∈Fm+1 2
∞
4 X
≤ 2 P (fi − P fi )2 < ∞,
ε i=m+1
i=1
k k ∞ ∞
X X G2 (fi )
n
X X
≤2 (ci − di )2 P fi2 +2 (ci − di )2 G2n (fi ).
i=1 i=1
P fi2
i=k+1 i=k+1
k ∞
X G2 (fi )
n
X
2kf − gk2P,2 +8 G2n (fi ).
i=1
P fi2
i=k+1
Take the supremum over all pairs of series f and g with kf − gkP,2 < δ.
2 2
Since EGP n (fi ) ≤ P fi , the expectation of this supremum is bounded by
2 ∞ 2
2δ k + 8 i=k+1 P fi . This expression can be made arbitrarily small by
first choosing k large and next δ small.
320 2 Empirical Processes
Since the cosines are bounded, the series defining the elements of F are
uniformly convergent in this case. The classical Cramér-von Mises statistic
equals the square of the Kolmogorov-Smirnov statistic over the elliptical
class F: Z 1
G2n (t) dt = kGn k2F .
0
†
Dudley (1967b), Proposition 6.3.
2.13 Other Donsker Classes 321
To see this, note that the Cramér-von Mises statistic is the √square of the
L2 [0, 1]-norm
of the function t →
7 G n (t). Since the functions 2 sin πjt: j =
1, 2, . . . form an orthonormal base of L2 [0, 1], Parseval’s formula yields
∞ ∞
Z 1 X R 1 √ 2 X 1 √
G2n (t) dt G2n
= G (t) 2 sin(πjt) dt =
0 n
2 cos πjt .
0 j=1 j=1
π2 j 2
(Note that − Gn f 0 √
R R
= f dGn = Gn f by the definitions of t 7→ Gn (t) and
Gn f .) The functions 2 cos πjt: j = 1, 2, . . . form an orthonormal system
in L2 [0, 1], so that the result follows from the general series representation
of kGn k2F for elliptical classes.
The series are again uniformly convergent. The Watson statistic equals the
square of the Kolmogorov-Smirnov statistic over the elliptical class F:
Z 1
2
Gn (t) − Gn (s) ds dt = kGn k2F .
R
0
This can be seen by a similar argument. The Watson statistic is the square
of the L2 [0, 1]-norm of the projection
√ of the function
√ t 7→ Gn (t) on the mean-
zero functions. The functions 2 sin 2πjt, 2 cos 2πjt: j = 1, 2, . . . form
an orthonormal base of the mean-zero functions. Application of Parseval’s
formula followed by partial integration as in the preceding example yields
the result.
The argument is slightly more involved than in the preceding examples, but
it is based on the same idea. The normalized Legendre polynomials satisfy
the differential equations p00j (u)(1 − u2 ) − 2up0j (u) = −j(j + 1)pj (u) for u ∈
[−1, 1] (Problem 2.13.1). By a change of variables and partial integration,
it can be deduced that
Z 1
2 p0i (2t − 1)p0j (2t − 1) t(1 − t) dt
0
Z 1
1
pi (u) p00j (u)(1 − u2 ) − 2up0j (u) du
= −4
−1
= 14 j(j + 1)δij .
√ p p
It follows that the functions 2 2p0i (2t−1) t(1 − t)/ j(j + 1) with j rang-
ing over {1, 2, . . .} form an orthonormal base of L2 [0, 1]. By Parseval’s for-
mula, the Anderson-Darling statistic equals
∞ ∞
X R 1 2 8 X 2
G (t)p0i (2t − 1) dt G2n pj (2t − 1) .
0 n
=
j=1
j(j + 1) j=1 j(j + 1)
This equals kGn k2F by the general series representation of kGn k2F for ellip-
tical classes.
Gaussian.
[Hint: It suffices to prove the same relationship for the polynomials pj (u)
defined as uj minus the projection of uj on the linear space spanned by
1, u, . . . , uj−1 . The left side of the differential equation is a polynomial of
degree j with leading term −j(j + 1)uj . By definition, the right side has
the same property. It suffices to show that the left side is orthogonal to the
functions 1, u, . . . , uj−1 . By partial integration,
Z Z
p0j (u)2u uk du = 2pj (1) ± 2pj (−1) − pj (u)2(k + 1)uk du
Then the Watson statistic equals kGn k2F . (The terms in the accompanying
series representation are not asymptotically independent; hence this repre-
sentation is of less interest.)
√
[Hint: The functions 2 cos πjt: j = 1, 2, . . . form an orthonormal base of
the mean-zero functions in L2 [0, 1].]
2.14
Maximal Inequalities
and Tail Bounds
In this chapter we derive moment and tail bounds for the supremum √ kGn kF
of the empirical process. Throughout this chapter, Gn = n(Pn − P ) de-
notes the empirical process of an i.i.d. sample X1 , . . . , Xn from a probability
measure P , defined as the coordinate projections of a product probability
space (X ∞ , A∞ , P ∞ ).
In the first two sections we consider classes of functions such that
either the uniform-entropy integral or the bracketing integral converges.
Then both Lp -moments and ψp -moments can be bounded by a multiple
of the entropy integral times the corresponding moment of the envelope
function. In view of general results that bound the higher order Lp and
the ψp -norms in terms of the L1 -norm, the main job is to derive upper
bounds for the expectation E∗ kGn kF . We derive such bounds also with a
view toward statistical applications in Part 3.
For some of these applications it is important that the bound takes
account of the (small) variance of the individual functions. We present
useful bounds both under uniform entropy and bracketing.
Bounds on moments imply rates of decrease for the tail probabilities
P∗ kGn kF > t as t → ∞. For uniformly bounded classes F, the tails
of kGn kF are exponentially small. In Section 2.14.5, we derive exponential
bounds that give the correct constant in the exponent for such classes. Most
of the tail bounds are uniform in n.
All bounds measure the size of the supremum kGn kF . In Chapter 2.15
we consider the concentration of this variable around its mean. Unlike the
bounds in the present chapter the concentration bounds do not depend
2.14 Maximal Inequalities and Tail Bounds 325
(much) on the size of the class F, and hence have a different character.
where the supremum is taken over all discrete probability measures Q with
kF kQ,2 > 0. The envelope function need not be the minimal envelope func-
tion. If the envelope function is clear from the context we may also omit it
from the notation and write J(δ, F, L2 ).
Certainly J(1, F, L2 ) < ∞ if F satisfies the uniform-entropy condition
(2.5.1). ForpVapnik-C̆ervonenkis
classes F, the function J(δ, F, L2 ) is of the
order O δ log(1/δ) as δ ↓ 0.
Because kGn F kP,2 = kF kP,2 , the far right bound of the theorem can
be interpreted as saying that the supremum of the empirical process has
the same order of magnitude as its value on the envelope, as soon as the
uniform entropy integral is finite. This is a useful result, also for many
statistical applications, but has the drawback that it is insensitive to the
sizes of the individual functions.
The next theorem remedies this. Its bound is not accurate when δ is
very small (relative to n and the entropy integral), but has exactly the right
form for applications to obtain rates of convergence (in Chapter 3.4).
The middle term on the right side is bounded by a multiple of the sum of
the first and third terms, since xp. 1p + xq for any conjugate pair (p, q)
and any x > 0, in particular x = J(Ar )B/A.
In the next two theorems we relax the assumption that the class F
of functions is uniformly bounded, made in Theorem 2.14.2, to moment
and exponential bounds, respectively, at the cost of a deterioration of the
dependence of the bounds on δ.
1 σ2
n,4
(2.14.11) E∗P σn,2
2
. δ 2 P F 2 + √ E∗P J , F, L2 (Pn F 4 )1/2 .
n (Pn F 4 )1/2
The next step is to bound σn,4 in terms of σn,2 .
2.14 Maximal Inequalities and Tail Bounds 329
By Hölder’s inequality, for any conjugate pair (p, q) and any 0 < s < 4,
1/p 1/q
Pn f 4 ≤ Pn |f |4−s F s ≤ Pn |f |(4−s)p Pn F sq .
Choosing s such that (4 − s)p = 2 (and hence sq = (4p − 2)/(p − 1)), we
find that 1/q
4 2/p
σn,4 ≤ σn,2 Pn F sq .
We insert this bound in (2.14.11). The function
p (x, y) 7→ x1/p√y 1/q is concave,
1/p 1/q
and hence the function (x, y, z) 7→ J( x y /z, F, L2 ) z can be seen
to be concave by the same arguments as in the proof of Theorem 2.14.2.
Therefore, we can apply Jensen’s inequality to see that
1/(2q)
∗ 2 2 2 1 (E∗P σn,2
2
)1/(2p) P F sq
EP σn,2 . δ P F + √ J , F, L2 (P F 4 )1/2 .
n (P F 4 )1/2
We conclude that z: = (E∗P σn,2
2
)1/2 /kF kP,2 satisfies
(The constant in . may depend on r.) To see this, note that for C > c the
left side is bounded by a multiple of log(2
+ t/c), whereas for small
C the
left side is bounded by a multiple of log(2 + t) + log(1 + 1/C) /`(C) .
log(2 + t) + 1.
From the inequality Ψ(f ) ≤ f ψ(f ), we obtain that, for f > 0,
f
Ψ r . f.
log (2 + f )
Therefore, by (2.14.12) followed by Young’s inequality,
f4 f 2 /C 2 f 2 logr (2 + f 2 /C 2 )
=
k(C 2 ) logr (2 + f 2 /C 2 ) `r (C 2 )
2
f
. 2 + Ψ F 2 logr (2 + F 2 ) .
C
On integrating this with respect to the empirical measure, with C 2 = Pn f 2 ,
r
2 2
we see that, with G = Ψ F log (2 + F ) ,
Pn f 4 . k(Pn f 2 ) 1 + Pn G .
4
We take the supremum over f to bound σn,4 as in (2.14.10) in terms of
2
k(σn,2 ), and next substitute this bound in (2.14.11) to find that
q √
2 ) 1+P G
1 k(σn,2 n
E∗P σn,2
2
≤ δ 2 P F 2 + √ E∗P J , F, L2 (Pn F 4 )1/2
n (Pn F 4 )1/2
q √
1 k(E∗P σn,2
2 ) 1 + PG
2 2
≤ δ PF + √ J 4 1/2
, F, L2 (P F 4 )1/2 ,
n (P F )
where we have used the concavity of k, and the concavity of the other
maps, as previously. By assumption the expected value P G is finite for
r = 2/p. It follows that z 2 = E∗P σn,2
2
/P F 2 satisfies, for suitable constants
2 4
a, b, c depending on r, P F , P F and P G,
a p
z 2 . δ 2 + √ J( k(z 2 b)c, F, L2 ).
n
By concavity and the fact that k(0) = p 0, we have k(Cz) ≤ Ck(z), for
C ≥ 1 and z > 0. The function z 7→ k(z 2 b)c inherits this property.
Therefore we can apply Lemma p 2.14.13, with k of the lemma equal to
the present function z 7→ k(z 2 b)c, to obtain a bound on J(z, F, L )
p 2
in terms of J(δ, F, L2 ) and J k(δ 2 b)c, F, L2 ), which we substitute in
(2.14.3). Here k(δ 2 ) = δ 2 logr (1/δ) for sufficiently small δ > 0 and
κ(δ 2 ) . δ 2 . δ 2 logr (1/δ) for δ < 1/2 and bounded away from 0. Thus
we can simplify the bound to the one in the statement of the theorem,
possibly after increasing the constants a, b, c to be at least 1, to complete
the proof.
2.14 Maximal Inequalities and Tail Bounds 331
As in the proof of Lemma 2.14.6 we can solve this for J ◦ k(z) to find that
B 2
J ◦ k(z) . J ◦ k(A) + J ◦ k(A)2 .
A
Next again by the monotonicity of J,
p p
J(z) ≤ J A2 + B 2 J ◦ k(z) ≤ J(A) 1 + (B/A)2 J ◦ k(z)
h B p B 2 i
. J(A) 1 + J ◦ k(A) + J ◦ k(A) .
A A
The middle term on the right side is bounded by the sum of the first and
third terms.
Consequently, if kf kP,2 < δkF kP,2 for every f in the class F, then
√ √
kGn k∗F
. J[ ] δ, F| F, L2 (P ) kF kP,2 + nP F {F > na(δ)}.
P,1
This “Bernstein norm” neither is homogeneous nor satisfies the triangle in-
equality, but we shall nevertheless use it to measure the “size” of functions.
It dominates the L2 (P )-norm.
Proofs. For the proof of the lemma, fix integers q0 and q2 such that 2−q0 <
δ ≤ 2−q0 +1 and 2−q2 −2 < γ ≤ 2−q2 −1 . For q ≥ q0 , construct a nested
Nq
sequence of partitions F = ∪i=1 Fqi such that the q0 th partition is the
partition given in the statement of the lemma and for, each q > q0 ,
∗
sup |f − g|
< 2−q .
f,g∈Fqi
By the choice ofq0 the number of sets in the q0 th partition satisfies Nq0 ≤
N[ ] 2−q0 , F, k·k and the preceding display is valid for q = q0 up to a factor
2. The number Nq of subsets in the qth partition can be chosen to satisfy
q
X
log N[ ] 2−r , F, k · k .
log Nq ≤
r=q0
Now adopt the same notation as in the proof of Theorem 2.5.6 (except define
aq with 1 + log rather than log) and decompose, pointwise in x (which is
suppressed in the notation),
q2
X
f − πq0 f = (f − πq0 f )Bq0 f + (f − πq f )Bq f
q0 +1
q2
X
+ (πq f − πq−1 f )Aq−1 f + (f − πq2 f )Aq2 f.
q0 +1
See the proof of Theorem 2.5.6 for the notation and motivation.
Apply the empirical process Gn to each of the three terms separately
and take suprema over f ∈ F. The Lp (P )-norm of the first term resulting
from this procedure can be bounded by
∗
∗
√
Gn (f − πq0 f )Bq0 f
F
.
Gn ∆q0 f
F
+ n
P ∆q0 f Bq0 f
F .
P,p P,p
The first term on the right arises as the last term in the upper bound of
the lemma. The second term on the right can be bounded by
q
a−1 2
q0 P ∆ q0 f . δ 1 + log N[ ] δ/2, F, k · k
Z δ q
. 1 + log N[ ] ε, F, k · k dε.
δ/3
of kGn kF on the left. Thus the second and third terms give contributions
to the Lp (P )-norm of kGn k∗F bounded up to a constant by
qX
2 +1 Z 2−q0 q
−q
p
2 1 + log Nq . 1 + log N[ ] ε, F, k · k dε.
q0 +1 2−q2 −1
√ √
For δ = 8ηkF kP,2 , the right side is bounded by nP ∗ F {F > na(η) .
Finally,
√ √ √
E∗
Gn πq0 f
F . E∗
Gn πq0 f {F ≤ na(η)}
F + nP ∗ F {F > na(η) .
√ √
Each function πq0 f 1{F ≤ na(η)} is uniformly bounded by na(η). In-
equality (2.5.7) yields a bound for the first term on the left equal to a
multiple of
p
1 + log Nq0 a(η) + 1 + log Nq0
kf kP,2
F .
The first term is bounded by a multiple of J[ ] η, F, L2 (P ) kF kP,2 . The
second arises as the last term in the first inequality of the theorem.
For the second inequality of Theorem 2.14.15, it suffices to note that
the integrand in the bracketing integral is decreasing so that
q
δ 1 + log N[ ] (δkF kP,2 , F, L2 (P ) ≤ J[ ] δ, F, L2 (P ) .
For the final inequality of the theorem, choose δ = 2 in the second. Since
the class F fits in the single bracket [−F, F ], it follows that a(2) = 2kF kP,2 .
For the proofs of Theorems 2.14.16 and 2.14.17, apply Lemma 2.14.18
with γ = 0 and p = 1. It suffices to bound the second and third terms on
the right appropriately.
336 2 Empirical Processes
Since the entropy is decreasing in δ, the second term is smaller than the
entropy integral. By the same reasoning the part in brackets of the first
term is bounded by the square of the entropy integral divided by δ 2 .
Theorem 2.14.17 follows similarly, this time using the refined form
of Bernstein’s inequality, Lemma 2.2.11, in combination with the “norm”
k · kP,B .
Proof. We first prove that the left side is smaller than the far right
side. By the symmetrization inequality, Lemma 2.3.1, the left side is
bounded by 2E∗ kG o o
PnnkC for Gn the symmetrized empirical process, given
o −1/2 o
by Gn 1C = n i=1 εi 1C (Xi ). For given X1 , . . . , Xn the process Pn is
sub-Gaussian relative to the L2 (Pn )-norm, and (by definition) there are
2.14 Maximal Inequalities and Tail Bounds 337
∆n (C, X1 , . . . , Xn ) different vectors 1C (X1 ), . . . , 1C (Xn ) as C ranges over
C, all of L2 (Pn )-norm bounded by 1. Therefore, by Lemma 2.2.2, or in order
to obtain an p optimal constant direct integration of the tail bound, we have
E∗ε kGon kC ≤ 2 log 2∆n (C, X1 , . . . , Xn ). Finally the right side is bounded
in Corollary 2.6.3 (see Problem 2.6.7).
To prove that the expression in the middle is between the other
expressions, we introduce independent copies of X1 , . . . , Xn , and insert
Rademacher variables as in the proof of Lemma 2.3.1. However, we do not
separate the original variables and their copies (which produces a factor
2), but work with 2n variables. Given these 2n variables the process is still
subGaussian and we bound its conditional expectation as before.
1
P |X| > t ≤ p kXkpp .
t
It is useful to employ also exponential Orlicz norms for 0 < p < 1. The fact
that the functions x 7→ exp xp − 1 are convex only on the interval (cp , ∞)
for cp = (1/p − 1)1/p leads to the inconvenience that k · kψp (if defined ex-
actly as before) does not satisfy the triangle inequality. The usual solution
is to define ψp (x) as exp xp − 1 for x ≥ cp and to be linear and contin-
uous on (0, cp ). Since we are interested in using these norms to measure
tail probabilities, the particular adaptation of the definition of ψp is not
important.
The following theorem shows that a general Orlicz norm of kGn k∗F is
bounded by its L1 -norm plus the corresponding Orlicz norm of the envelope
function F . In fact, Orlicz norms of sums of independent processes can in
general be bounded by their L1 -norm plus a maximum or sum of Orlicz
norms of the individual terms.
338 2 Empirical Processes
Here 1/p + 1/q = 1 and the constants in the inequalities . depend only on
the type of norm involved in the statement.
Next, Lemma 2.2.2 can be used to further bound the maxima appearing
on the right. This gives the first two lines of the theorem. The third line is
immediate from Proposition A.1.6.
To obtain tail bounds for the random variable kGn k∗F , the last theorem
may be combined with the bounds on E∗ kGn kF given by the preceding
theorems. While the last theorem is valid for any class F of functions, the
bounds on E∗ kGn kF rely on the finiteness of an entropy integral. If either
the uniform-entropy integral or the bracketing entropy integral of the class
F is finite, then
E∗ kGn kF . kF kP,2 ,
where the constant depends on the entropy integral. The L2 (P )-norm of the
envelope function can in turn be bounded by an exponential Orlicz norm.
Combination with the last theorem shows that
kGn k∗F
P,ψ
. kF kP,ψp (0 < p ≤ 2).
p
Conclude that kGn k∗F has tails of the order exp(−Ctp ) for some 0 < p ≤ 2,
whenever the envelope function F has tails of this order. Similar reasoning
can be applied using the Lp (P )-norms. It may be noted that, for large n
and 0 < p < 2, the size of the ψp -norm of kGn k∗F is mostly determined
by the entropy integral and the L2 -norm of the envelope: according to the
preceding theorem, the ψp -norm of the envelope enters the upper bound
multiplied by a factor that converges to zero as n → ∞.
To see some of the strength of the preceding theorems, it is instruc-
tive to apply them to a class F consisting of a single function f . Then
2.14 Maximal Inequalities and Tail Bounds 339
√
for every t > K W and a constant D that depends on K, K 0 , W, V , and
V 0 only.
into the bound. If every f takes its values in the interval [0, 1], this variance
is maximally 1/4. Thus a bound of the order exp −(1/2)t2 /σF 2 could be
better than the bounds given by the preceding theorems. This bound is valid
in the limit as n → ∞, but for finite sample size a correction is necessary.
Proof. See Devroye (1982) and Shorack and Wellner (1986), pages 829–
830.
The following lemma collects four exponential bounds on the tail probabil-
ities of a mean of a sample without replacement.
Proof. For a proof of the first three inequalities, see Shorack and Wellner
(1986). We prove the fourth.
The sample without replacement can be generated in two steps. First,
partition the set of points {c1 , . . . , cN } randomly into n subsets J1 , . . . , Jn
of m elements each. Next, randomly choose one element of each subset.
Let E2 denote expectation with respect to the second step only, treating
the partition obtained in the first step as fixed. Given the partition, the
344 2 Empirical Processes
2
The choice s = 2nt/mσN yields an upper bound equal to half the
upper
bound given by the lemma. The probability P −Ūn + c̄N > t can be
bounded similarly.
is convergent with limit (P̃n,N − PN )f . To see this, note first that the qth
partial sum of the series telescopes out to (P̃n,N − PN )πq f . Since P̃n,N f ≤
mPN f for every nonnegative f , we have
and the conclusion follows. The triangle inequality can now be applied to
obtain
X
kP̃n,N − PN kF ≤ k(P̃n,N − PN )πq0 f kF + k(P̃n,N − PN )(πq − πq−1 )f kF .
q>q0
If the εq -net has Nq elements, then the suprema are over at most Nq0 in
the first term on the right and Nq Nq−1 elements in each of the terms of the
2.14 Maximal Inequalities and Tail Bounds 345
series,
P respectively. Conclude that for any positive numbers ηq and b with
q>q0 ηq + b ≤ 1:
PR kP̃n,N − PN kF > t
≤ Nq0
PR (P̃n,N − PN )πq0 f > tb
(2.14.37)
F
X
Nq2
PR (P̃n,N − PN )(πq − πq−1 )f > tηq
.
+
F
q>q0
The probabilities on the right will be bounded with the help of the various
exponential bounds of Lemma 2.14.36. Lemma 2.14.35 turns the resulting
upper bounds into bounds on the tails of the distribution of kPn − P kF .
For appropriate choices of tuning constants, this gives the desired results.
In all four cases the second term on the right is negligible with respect to
the first.
For the proofs of Theorems 2.14.25 and 2.14.26 (with 3V + δ instead
of V ), use Hoeffding’s inequality on the first term in the right of (2.14.37)
and Massart’s inequality on each of the terms of the series. Thus, (2.14.37)
is bounded by
−nt2 η 2
X 2 q
(2.14.38) Nq0 2 exp −2nt2 b2 + Nq exp 2 .
q>q
m4ε q−1
0
2
Note that PN (πq f − πq−1 f ) ≤ + 2ε2q 2ε2q−1
is bounded by 4ε2q−1 .
For the proof of Theorem 2.14.25, choose α large and
a = 1 − t−2 ; b = 1 − t−2 ; m = bt2 c; q0 = 2 + bt2/(α−1) c;
√ −α−1 −α
m εq = q ; ηq = (q − 1) .
Combination of Lemma 2.14.35 and (2.14.22) with (2.14.37) and (2.14.38)
yields as upper bound for P∗ kGn kF > t a constant times
n −1 V
tq0α+1 2 exp −2t2 b2 a2 n02 /N 2
1− 2 2 0
4(1 − a) t n
α+1 2V
X
1 2 2 2 02 2
+ tq 2 exp − 4 t (q − 1) a n /N .
q>q0
For sufficiently large t, the leading multiplicative factor outside the square
brackets is bounded by 2. The quotient n0 /n = 1 − 1/m is bounded below
by 1 − 2t−2 . Furthermore, the series
is bounded by its q0 th term. To see
this, write it as q>q0 exp −ψ(q) for ψ(q) = (1/4)t2 (q − 1)2 a2 n02 /N 2 −
P
2V (α + 1) log q and apply Problem 2.14.3. Thus, for sufficiently large t, the
last display is up to a constant bounded by
2
V (α+1)
tV 3 + t α−1 exp −2t2 (1 − t−2 )6
4
+ t2V (2 + t2 )2V (α+1) exp − 41 t2 t α−1 (1 − 2t−2 )4 ,
346 2 Empirical Processes
the second term being an upper bound for the q0 th term of the series. For
2
large t, the second term is certainly bounded by e−2t . For sufficiently large
3V +δ −2t2
α, the first term is bounded by a multiple of (1∨t) e . This concludes
the proof of the first theorem.
For the proof of Theorem 2.14.26, choose α large and
a = 1 − t−2 ; b = 1 − t−r ; m = bt2 c; q0 = 2 + btr/(α−1) c;
√
mεq = q −α−β
; ηq = (q − 1) −α
; β= Wα
2−W ; 2 − r = W + W r α+β
α−1 .
(2.14.39)
n −1 W
exp K tq0α+β 2 exp −2t2 a2 b2 n02 /N 2
1− 2 2 0
4(1 − a) t n
X
α+β W
1 2 2β 2 02 2
+ exp 2K tq 2 exp − t (q − 1) a n /N .
q>q
4
0
Once again the multiplicative factor is bounded and the series is bounded by
2
√ If t is sufficiently large, then so is t/σ ≥ t, m ≥ (1/2)(t/σ) ,
its q0 th term.
and s ≤ 3/ n. The preceding expression is up to a constant bounded by
1 t V t 2/(α−1) V (α+1) t2 (1 − 2σ 2 /t2 )6
2+ exp − 12 2 √
σσ σ σ + (3 + t)/ n
exp − 12 t2 /σ 2
+ + P∗ (Cn ).
σ 2V
The first two terms are bounded as desired. The last term can be bounded
with the help of Theorems 2.14.25 and 2.14.26. Since each |f | is bounded
by 1,
nσ 2 −1
1− P∗ (Cn )
(1 − a)2 t2 n0
α+β W
t r/2 q0 t2 b2 a2 n02 /N 2
+ exp K 2 exp − √
σ σ 2(σ 2 + s) + tban0 /N n
α+β 2 02
X t r/2 q
W
1 t 2β 2 n
+ exp 2K 2 exp − 4 2 (q − 1) a 2 .
q>q
σ σ σ N
0
Since (α + β)W = 2β, the series can be written in the form q>q0 e−ψ(q)
P
for a function of the form ψ(q) = c(q − 1)2β − dq 2β . The constants c and d
348 2 Empirical Processes
K
P∗ (U ≥ t, W ≥ w) ≤ √ exp −nh(u) − ϕ(w) + 5nu(u − t) .
nu
Pn (C0c ) Pn (C − C0 ) − P (C − C0 ) + Pn (C0c ) P (C − C0 ) − P (C − C0 ) .
Pn (C0c ) P (C0c ) P (C0c ) Pn (C0c )
This is bounded by 1/n times W2 + U a/σ02 . The third term can be handled
similarly.
Now consider the exponential bound. Since nU is distributed
as the
absolute deviation from the mean of a binomial n, P (C0 ) variable, Tala-
grand’s inequality for the binomial tail (Proposition A.6.4) yields
2K
P(U ≥ t) ≤ √ exp −nh(u) exp 5nu(u − t).
nu
For I ⊂ {1, . . . , n}, let ΩI be the set such that Xi ∈ C0 for i ∈ I and
Xi ∈ / C0 otherwise. Define P1 on ΩI by P1 (A) = P (A∩C0 )/P (C0√ ). Then for
#I = k, conditionally on ΩI , W1 has the same distribution as k kGYk kC1
where Y1 , . . . , Yk are i.i.d. P1 and C1 = {C0 − C: C ∈ C}. Similarly, if we
c c
define P2 on ΩI by P2 (A) = P (A √∩ C0 )/P (C
Z
0 ), then conditionally on ΩI ,
W2 has the same distribution as n − k kGn−k kC2 , where Z1 , . . . , Zn−k are
i.i.d. P2 and C2 = {C − C0 : C ∈ C}. Also, note that for any class F of
functions,
√ ∗ Y √
kE kGk kF ≤ K nE∗ kGn kF
for √
an absolute constant K (Problem 2.14.6). A similar inequality is valid
for n − k E∗ kGZ √ kF . By Theorem 2.15.5 applied twice conditionally on
n−k
ΩI , for w ≥ 2CK nµ0n ,
w w
P∗ (W ≥ w| ΩI ) ≤ 2D exp −ϕS 0 ≤ 2D exp −ϕS ,
2C K
350 2 Empirical Processes
with
h√ i
S0 = kE∗ kGYk kC1 ∨ 1 + kkP1 (f − P1 f )2 kC1
h√ i
∨ n − kE∗ kGZ k
n−k 2C ∨ 1 + (n − k)kP 2 (f − P 2 f )2
kC2
√ 0 a
≤ K nµ̄n + n 2
σ
√ 0
≤ K 0 (na + nµ̄0n ) ≡ K 0 S.
Then if l > 0,
√
nU (1 + 2a/σ02 ) ≤ t n − ld,
so W ≥ ld. Thus, since W ≥ 0, we can find l ≥ 0 such that
(l + 1)d
W ≥ ld and U ≥u− .
n
In other words,
∞n
n 2a √ o [ (l + 1)d o
nU (1 + 2 ) + W ≥ t n ⊂ W ≥ ld, U ≥ u − .
σ0 n
l=1
2.14 Maximal Inequalities and Tail Bounds 351
For l ≥ 1,
ϕ(ld) ≥ lϕ(d) ≥ l · 11ud ≥ 5u(l + 1)d + lud.
Hence 5u(l + 1)d − ϕ(ld) ≤ −lud ≤ −l and
∞
X ∞
X
e5u(l+1)d−ϕ(ld) ≤ e−l ≡ K.
l=1 l=1
Thus
K
P∗ kGn kC > t ≤ √ e−nh(u) e5ud .
nu
It remains to bound ud and u in terms of t.√
By Problem 2.14.5 with t = K(C)u √ 0S, it follows that d ≤ d0 ≡
K(C)uS. Since
√ d ≥ 1/u and d ≥ C nµn , we deduce that that d ≤
max{1/u, C nµ0n , K(C)uS}, or
√
ud ≤ max{1, C nµ0n u, K(C)u2 S}.
√
Hence, using t/ n ≤ 1,
It follows that
t t t
h(u) ≥ h( √ ) + (u − √ )h0 ( √ )
n n n
t t t
≥ h( √ ) − 5 √ √ − u
n n n
t t2 2a/σ02
= h( √ ) − 5 .
n n 1 + 2a/σ02
Putting this together yields the claimed inequality.
352 2 Empirical Processes
where
Since each set Tl+1,j has diameter ≤ 4−l , Ai has diameter at most 2 · 4−l +
2 · 4−l = 4−l+1 . Now disjointify the sets {Ai } by defining Ci = Ai − ∪j<i Aj
for every i = 1, . . . , N . The collection {Ci } forms a partition Q of T that
is coarser than Pl+1 . Certain elements of Q might contain more than κl
elements of Pl+1 ; partition any such element of Q into sets all of which
except one contain exactly kl elements of Pl+1 and one being the union
of at most kl sets of Pl+1 . This constructs Pl satisfying (i) and (ii); (iii)
follows easily from the construction.
To prove the last assertion, note that by (iii) and the hypotheses we
have
#Pl+1
#Pl ≤ (K4l )V + .
2 · 4V
Induction then yields the claim.
Kt2 V
M = #Pp ≤ 2 .
V
Consider the atoms {Dj }M j=1 of Pp . To bound µn , we first show that if
3(q − p)V ≤ n4−q , then
h√ √ p p
µn ≤ K̃ V 4−p + q4−q + 4−q log K
(2.14.47) i
+ n−1/2 (qV + V log K) .
Ck = C ∈ C1 : P (C4Ck ) ≤ 4−q+1 ,
Gk = {1C − 1Ck : C ∈ Ck }, k = 1, . . . , m,
354 2 Empirical Processes
Hence by Lemma 2.3.6, Theorem 2.15.5, and Problem 2.14.8, the variables
Xn
Zk =
(1C − 1Ck )(Xi ) − n P (C) − P (Ck )
Ck
i=1
and also bound the sum of the second and third terms in the exponential
in the conclusion of Theorem 2.14.41. Namely, after using (2.14.47) and
4−p ≤ V /t2 , we need to bound
hp p i t4
Q = K̃t V q4−q + V 4−q log K + n−1/2 (V q + V log K) − .
4n
To do this, let q be the largest integer so that (2.14.51) holds. Then
n −p nV n
3(q − p)4q−p ≤ 4 ≤ = 2
V V t2 t
and hence q − p ≤ K4 log(K4 n/t2 ) for some constant K4 . Furthermore, by
the definition of q,
t h p i t4
Q ≤ K̃ √ 2V q + 3V 2 q log K + V log K −
n 4n
tV t4
≤ K5 √ [q + log K] −
n 4n
tV t4 tV t4
= K5 √ q − + K5 √ log K −
n 8n n 8n
≡ Q1 + Q2 .
But from the definition of p it follows that 4−(p−1)+1 ≥ V /t2 and hence
4−p ≥ V /16t2 or p ≤ K√6 log(16t2 /V ). Then, by writing q = p + (q − p), it
follows that, for t ≥ K̃ V log K, we have
2
16t K4 n K7 n
q ≤ K6 log + K4 log ≤ K 7 log ,
V t2 V
356 2 Empirical Processes
t4 t4
tV K7 n tV K7 n
Q1 ≤ K8 √ log − ≤ sup√ K8 √ log −
n V 8n 0≤t≤ n n V 8n
≤ K9 V.
√
Furthermore, if t ≤ n/ log K, then Q2 ≤ K5 V , while
V2 2
sup Q2 ≤ K10 (log K) ,
n≥1 t2
√
so that Q ≤ K11 V if t ≤ n/ log K, and
V2
2
Q ≤ K12 V + 2 (log K)
t
in any case. Thus, it follows from Theorem 2.14.41 and our choices for a,
p, and q that
P∗ kGn kC1 ≥ t ≤ P∗ max kGn kDj ≥ t
1≤j≤M
M
X
P∗ kGn kDj ≥ t
≤
j=1
t4
K −2t2
≤M e exp Kat2 + Kµn t −
t 4n
2 V
V2
Kt K −2t2 2
≤2 e exp KV + K12 V + K12 2 (log K)
V t t
2 V
Kt 1 −2t 2
≤ K13 e exp(3K14 V )
V t
V
D DKt2
2
≤ e−2t
t V
√
if t ≥ V log K or n ≥ √t2 (log K)2 . Choosing D = D(K) so large that the
last bound ≥ 1 for t ≤ V log K and combining with (2.14.43) completes
the proof of Theorem 2.14.29 when (2.14.27) holds.
2.14 Problems and Complements 357
Then ψ̃p is convex and there exists a constant Cp such that kXkψ̃p ≤
kXkψp ≤ Cp kXkψ̃p . Thus kXkψ̃p defines a true Orlicz norm and can be
used interchangeably with kXkψp up to constants. A bound on the latter
pseudonorm leads to the tail bound
p
p −tp /kXk
P |X| > t ≤ 1 + ecp e
ψp
.
[Hint: Note that ψ̃p ≤ ψp ≤ ψ̃p + ψp (cp ). Also Eψ̃p (Y /C) ≤ Eψ̃p (Y )/C for
every C > 1 and Y by the convexity of ψ̃p .]
√ √
[Hint: Consider the two regions K(C) rt ≥ rC and K(C) rt ≤ rC. On
the first region, 11C 2 exp(11C 2 − 1) works for K 2 (C), while 11C 2 works for
the second region. Hence K 2 (C) can be chosen to be the larger of these two
constants.]
358 2 Empirical Processes
p
[Hint: Use Corollary 2.2.8 with d(C, D) = nk1C − 1D kL1 (Qn ) , where Qn
√ the empirical measure on the points xi . The diameter of C for d is at most
is
2N .]
and
n
X
√ h
V V
K 1/2
i
E∗
εi 1C (Xi )
≤ K̃ V n a + log log .
C n a a
i=1
9. Suppose that C is a class of sets satisfying (2.14.27) and supC∈C P (C) ≤ σ02 .
Then for some constants D and K̃
D −2t2
P∗ kGn kC > t ≤
e ,
t
√
for all t ≥ K̃ V log K and n ≥ 1.
[Hint: Use Lemma 2.15.10 and the second result proved in Exercise 2.14.8.]
[Hint:
R∞ Recall the proof of Lemma
2.2.2; compute the left side as the integral
P max 1≤j≤m |Zj | > t dt; and split the region of integration at τ ≡
0 √
D ∨ C r log m.]
13. The upper bound in Theorem 2.14.16 can (with the same proof) be slightly
improved to
log N[ ] δkF kP,2 , F, L2 (P )
J[ ] δ, F| F, L2 (P ) kF kP,2 + √ .
n
A similar improvement is possible for Theorem 2.14.17.
2.15
Concentration
The location of the distribution of the norm kGn kF of the empirical process,
as measured for instance by its mean value E∗ kGn kF , depends of course
strongly on the size of the class of functions F. Chapter 2.14 gives vari-
ous methods to estimate this mean value. It turns out that the spread or
“concentration” of the distribution around its center hardly depends on the
complexity of the class F.
In this chapter we present two bounds on the concentration. The first
“bounded difference inequality” is an exponential inequality of Hoeffding
type, and gives a bound expressed in terms of the range of the functions
f . The second, deeper result is of Bennett-Bernstein type, and incorpo-
rates also the variance of the functions f . The bounds can be viewed as
uniform versions of the Hoeffding and Bernstein inequalities, Lemmas 2.2.7
and 2.2.9.
A similar phenomenon of concentration is known for suprema of cen-
tered Gaussian processes, for which Borell’s inequality Proposition A.2.1
gives a subGaussian bound for probabilities of deviations from the mean
that is valid without conditions.
As usual for measurability purposes the observations X1 , . . . , Xn are
assumed to be defined as the coordinate projections on a product probabil-
ity space.
Both the√norm kGn kF and the supremum supf ∈F Gn f satisfy this inequal-
ity, with nci the supremum over f of the range of f . (Therefore, so do their
measurable cover functions almost surely under the product law P n+1 , if
these variables are not measurable. See Problem 2.15.1.) In the application
to the empirical process the variables X1 , . . . , Xn are i.i.d., but in the fol-
lowing theorem it suffices that they are independent. The variable Yi (used
in the statement of the theorem to relax assumption (2.15.2) to an almost
sure inequality) is an independent copy of Xi .
(If (2.15.2) only holds almost everywhere, use the essential infimum and
supremum relative to the law of Xi in the definitions of Ai and Bi instead.)
Pn
With the understanding that E(T | ∅) = ET , we then have T −ET = i=1 Ti .
Furthermore, Ai ≤ Ti ≤ Bi and Bi − Ai ≤ ci by integration of the bounded
362 2 Empirical Processes
P fi2 are the integrals of fi and fi2 with respect to their first arguments,
the arguments (Xj : j 6= i) being fixed.) Finally, for given X1 , . . . , Xn and
f0 the function where the supremum in the definition of T is assumed,
n X
X n X
X n
X
(n − 1)T = f0 (Xj ) ≤ fi (Xj ) = Ti .
i=1 j6=i i=1 j6=i i=1
Pn
Rearranging this inequality gives that i=1 (T −Ti ) ≤ T . The first assertion
of the theorem follows from Proposition 2.15.12 with the choice u = 1.
The second assertion follows by applying the first to the class of func-
tions F ∪ (−F).
This has the same form as Bernstein’s inequality (Lemma 2.2.9), with
wn playing the role of the variance. If F consists of a single function f ,
then wn is indeed the variance of f , and hence the theorem exactly re-
covers Bernstein’s inequality. In the general case the (positive) constant
n−1/2 E∗ supf Gn f must be added to the variance, increasing the bound.
For a Donsker
√ class this added term is small and tends to zero, as does the
term t/3 n in the denominator of the exponent. In the limit as n → ∞ the
bound takes the form exp(− 12 t2 /σF
2
), and hence we could interprete these
additional terms as finite-sample
√ corrections on the Gaussian bound. On
the other hand, for t n or very large classes F the correction terms
dominate the variance term.
364 2 Empirical Processes
Proof. We apply
p √ (2.15.8)
√ to the
√ functions f /M , and use the inequality
2t(σ 2 + 2µ/ n) ≤ 2tσ + t/ n + µ.
Proof. Since |f (x) − P f | ≤ 1 for all√ x, the probability on the left side of
the above display is zero√for all t ≥ n. Hence any nonnegative bound √ will
be trivially true for t > n, and we may restrict attention to t ≤ n.
∗
For t ≥ K0 µn,F the left side is bounded above by P kGn kF − µn,F >
−1
t(1−K0 ) . Hence the theorem follows if we can show the bound with (e.g.)
12 instead of 11 in the exponent for the centered norm of the empirical
process. √
By 2
√ the assumptions wn√= σF +22µn / −1 n ≤ σ02 + 2K0−1 . Therefore, for
t ≤ ε n we have wn + t/(3 n) ≤ σ0 + 2K0 + ε/3, which is smaller than
1/24 if σ0 , K0−1 and ε are chosen sufficiently small. The bound then follows
from (2.15.7).
2.15 Concentration 365
√ √ √
Finally if ε n ≤ t ≤ n, then t/( nwn ) ≥ ε/(σ02 + 2K0−1 ), which can
be made arbitrarily large by first choosing small ε to satisfy the bound of
the preceding paragraph and next choosing σ0 and K0−1 to be still smaller.
Because h(t) ∼ t log t as t → ∞, and hence h(t) ≥ 12t for sufficiently large
t, we then have
t t √
nwn h 1 + √ ≥ 12nwn √ = 12 nt ≥ 12t2 .
nwn nwn
Thus the bound follows from Theorem 2.15.5.
Unlike the bounded difference inequality, Theorem 2.15.5 and its con-
sequences (2.15.7) and (2.15.8) concern upper tail probabilities only, and
do not show the size of deviations of the supremum under its mean. Al-
though these are of the same order (up to constants), it appears that their
estimation requires a separate argument.
The function h in the following theorem is as in the previous theorem
and is defined in (2.15.4).
The proofs of Theorems 2.15.5 and 2.15.11 rely upon the entropy
method. The entropy of a nonnegative random variable Y is defined as
Ent(Y ) = EY log Y − EY log EY . Problems 2.15.3 and 2.15.4 derive the
relevant properties of entropy.
Theorem 2.15.5 is a special case of a theorem on more general functions
T = T (X1 , . . . , Xn ) of independent variables X1 , . . . , Xn . The abstract ver-
sion is somewhat technical, but its proof is relatively straightforward.
Si ≤ T − Ti ≤ 1,
E Si | Xj : j 6= i ≥ 0,
Si ≤ u,
n
X
(T − Ti ) ≤ T.
i=1
366 2 Empirical Processes
Pn
If σ 2 is a constant satisfying σ 2 ≥ 2
i=1 E(Si | Xj : j 6= i), and v = (1 +
u)ET + σ 2 , then, for all t ≥ 0,
t
P T − ET ≥ t ≤ exp −vh 1 + .
v
Xn n
X
Ent(eλT ) ≤ E Enti (eλT ) = E eλTi Enti (eλ(T −Ti ) ),
i=1 i=1
for the function gλ,α : R → R given by gλ,α (t) = eλt (e−λt − 1 + λt)/(eλt −
1 − λt + λαt2 ). Because T − Ti ≤ 1 by assumption, and the function gλ,α is
increasing on (−∞, 1] (see Problem 2.15.5), and also the term in brackets is
nonnegative, the expression in the preceding display increases if gλ,α (T −Ti )
is replaced by gλ,α (1). Because gλ,α (1) > 0 and −t + αt2 ≤ −s + αs2 if both
s ≤ u ≤ 1 and s ≤ t ≤ 1 (remember that α = 1/(1 + u); the parabola is
symmetric about (u + 1)/2 ≥ u ≥ s and hence decreasing on [0, (1 + u)/2]
and equal at u and 1), the preceding display is further bounded by
gλ,α (1) eλT − eλTi − λSi eλTi + λαSi2 eλTi .
We take the expected value and sum over i to obtain an upper bound on
(2.15.13). Next we use that E(−λSi eλTi ) ≤ 0, because E(Si | P Xj : j 6= i) ≥ 0
and Ti is measurable relative to (Xj : j 6= i). We also use that i ESi2 eλTi ≤
2.15 Concentration 367
The exponentials eλT and eλTi on the right side are in the reverse order
relative to (2.15.13). Therefore, adding the present inequality and (2.15.13)
gives
n
1 X
Ent(eλT ) + 1 ≤ λασ 2 EeλT + E λ(T − Ti )eλT .
gλ,α (1) i=1
The last term on the right can be further bounded by λET eλT , since
P n λET
i=1 (T − Ti ) ≤ T by assumption. Next we multiply across by e to
rewrite the inequality as (cf. Problem 2.15.3(iii)).
1
Ent(eλ(T −ET ) ) + 1 ≤ λασ 2 Eeλ(T −ET ) + EλT eλ(T −ET )
gλ,α (1)
= λαvEeλ(T −ET ) + Eλ(T − ET )eλ(T −ET ) .
The entropy Ent(eλ(T −ET ) ) can be expressed in terms of the Laplace trans-
form F (λ) = Eeλ(T −ET ) as Ent(eλ(T −ET ) ) = λF 0 (λ) − F (λ) log F (λ), and
E(T − ET )eλ(T −ET ) = F 0 (λ). After also inserting the explicit form of
gλ,α (1), the preceding inequality can be simplified to
eλ − 1 + α eλ (e−λ − 1 + λ)
(log F )0 (λ) − log F (λ) ≤ λ vα.
eλ − 1 − λ + λα e − 1 − λ + λα
Some algebra shows that the corresponding equality is solved (in log F ) by
the function λ 7→ vψ(λ), for ψ(λ) = eλ − 1 − λ. Consequently, we can apply
Problem 2.15.6 to see that log F (λ) ≤ vψ(λ).
Combining this with Markov’s inequality, we obtain P T − ET ≥ t) ≤
e−λt F (λ) ≤ e−(λt−vψ(λ)) . The (optimal) choice λ = log(1 + t/v) yields the
proposition.
Boucheron et al. (2000) show that under the same hypotheses, for nonneg-
ative and bounded variables T and for 0 < t ≤ ET ,
t
P T − ET ≤ −t ≤ exp −ET h 1 − .
ET
by Problem 2.15.3(vi), for any random variables Si,λ , where Ei means the
expectation with respect to Xi with the other variables (Xj : j 6= i) fixed.
For fˆλ a function f that attains the minimum in the definition of Sλ , we
choose Pn
− f (Xi )
X e i=1
Si,λ = Pi (fˆλ = f ) .
Mf (λ)n
f
The expression inside brackets on the right side can also be written as
λ−1 log Sλ + nLfˆλ (λ) − nL0fˆ (λ). We conclude that F (λ) = ESλ satisfies
λ
0
λESλ0 = ESλ log Sλ − nESλ λL0fˆ (λ) − Lfˆλ (λ) .
λF (λ) =
λ
We solve this for ESλ log Sλ , and use the definition of Ent(Sλ ) and inequal-
ities (2.15.14) and (2.15.19) to infer that, for λ ≤ λ0 ,
λF 0 (λ) − 1 − γ(λ) F (λ) logF (λ) ≤ n ν(λ) − 1 ESλ λL0fˆ (λ) − Lfˆλ (λ)
λ
Here λL0f (λ) − Lf (λ) = Ent(e−λf (X1 ) )/Mf (λ) ≤ eλ (e−λ − 1 + λ)σF 2
, by
Problem 2.15.3(v) and because Mf (λ) ≥ 1, by Jensen’s inequality and the
assumption that P f = 0. We conclude that, for λ ≤ λ0 ,
Because minf ef ≤ minf (ef /cf ) maxf cf , for any positive constants
cf , ef , we have Ee−λT ≤ ESλ maxf Mf (λ)n . Combining this with Bennett’s
bound log Mf (λ) ≤ P f 2 (eλ − 1 − λ) and (2.15.20), we conclude that
Pn P
−λ fˆi,λ (Xj ) −λ fˆi,λ (Xj )
Ei e j=1 /Mfˆi,λ (λ)n e j6=i /Mfˆi,λ (λ)n−1
Pn = Pn .
−λ fˆλ (Xj ) −λ fˆλ (Xj )
e j=1 /Mfˆλ (λ)n e j=1 /Mfˆλ (λ)n
By the definition of fˆi,λ , the right side increases if fˆi,λ is replaced by fˆλ .
This yields the middle inequality of (2.15.16). The second inequality follows
because eλf ≤ eλ for every f by the assumption that f ≤ 1, while Mf (λ) ≤
cosh λ by Problem 2.15.9.
To prove (2.15.17) we first write
P
−λ f (Xj )
X e j6=i
ˆ
Ei Si,λ = Pi (fˆλ = f ) = Ei Sλ eλfλ (Xi ) Mfˆλ (λ).
Mf (λ)n−1
f
2.15 Concentration 371
n
ˆ
X
ESλ eλfλ (Xi ) Mfˆλ (λ) − 1
i=1
n
X ˆ ˆ
= ESλ eλfλ (Xi ) Mfˆλ (λ) − 1 − ν(λ) log eλfλ (Xi ) Mfˆλ (λ)
i=1
Xn e−λfˆλ (Xi )
− ν(λ) ESλ log .
i=1
Mfˆλ (λ)
ˆ
The first term on the right increases if eλfλ (Xi ) Mfˆλ (λ) is replaced by
Ei Sλ /Sλ , by (2.15.16) and the fact the function x 7→ x − 1 − ν log x is
decreasing for x ≤ ν. The sum in the second term on the right is equal to
ESλ log Sλ . It follows that the right side is bounded above by
n E S
X i λ Ei Sλ
ESλ − 1 − ν(λ) log − ν(λ)ESλ log Sλ .
i=1
Sλ Sλ
Here Enti (e−λf (Xi ) ) = Mf (λ) λL0f (λ)−Lf (λ) and the order of summation
ˆ
= Ei Sλ eλfλ (Xi ) Mfˆλ (λ) λL0fˆ (λ) − Lfˆλ (λ) .
λ
ˆ
We bound eλfλ (Xi ) Mfˆλ (λ) by ν(λ) using (2.15.16), next take the expecta-
tion over the other variables and sum over i to arrive at (2.15.18).
372 2 Empirical Processes
7. Let α(λ) = (1 − γ)(λ)/λ for γ = ν log ν and ν(λ) = (e2λ + 1)/2 and
let Λ, β: [0, 1] → R be functions such that Λ is absolutely
R λ continuous with
Λ0 − αΛ ≤ β. Then Λ(λ) ≤ µλe−I(λ) + λe−I(λ) 0 s−1 eI(s) β(s) ds for
Rλ
µ = lim supε↓0 Λ(ε)/ε and I(λ) = 0 γ(s)/s ds.
(as γ(s) = O(s) as s → 0), but the “in-
[Hint: The function I isRwell defined
λ
tegrating factor” exp − 0 α(s) ds for the linear differential equation is not.
R λ[ε, 1] for ε ∈
This motivates consideration of the functions first on the interval
(0, 1). The corresponding integrating factor is Aε (λ) = exp − ε α(s) ds =
(ε/λ)eI(λ)−I(ε) . From the inequality
Rλ (Aε Λ)0 = Aε (Λ0 − αΛ) ≤ Aε β we obtain
that (Aε Λ)(λ) ≤ (ΛAε )(ε) + ε (Aε β)(s) ds. Dividing by Aε (λ) and taking
the limit as ε ↓ 0 gives the desired result.]
satisfy
(i) λe−I(λ) J(λ) + eλ − 1 − λ ≤ (e3λ − 1 − 3λ)/9.
(ii) λ(1 − e−I(λ) ) ≤ 2(e3λ − 1 − 3λ)/9.
9. Any random variable X with EX = 0 and |X| ≤ 1 satisfies EeλX ≤ cosh λ,
for λ ≥ 0.
2
Notes
2.2. The maximal inequality Theorem 2.2.4 uses the special properties of
the Orlicz norm only through the application of Lemma 2.2.2. Hence an
identical result can be proved using other norms provided the analogue of
the lemma is available. For instance, Ledoux and Talagrand (1991) show
that the lemma holds for weak Lp -norms with ψ taken equal to xp . Second,
it may be remarked that often the full strength of Theorem 2.2.4 is not
needed and the weakened version of this theorem, where the Orlicz norm
on the left is replaced by E sups,t |Xs − Xt |, suffices. This weakened version
can be proved along exactly the same lines, provided a similarly weakened
version of Lemma 2.2.2 is available. The latter is contained in Problem 2.2.8
for any Orlicz norm.
The method of chaining to prove continuity or obtain bounds on
suprema of stochastic processes is due to “Kolmogorov and his school”,
and goes back to the 1930s and 40s. The combination of chaining with “en-
tropy” in the study of the continuity of Gaussian processes goes back to
Dudley (1967b) and Sudakov (1969). Dudley (1967b) credits V. Strassen
with the idea and basic results, but see the review of Sudakov (1976) by
Dudley (1978b). Strassen and Dudley (1969) apparently made the first use
of entropy in connection with empirical processes. Maximal inequalities for-
mulated as entropy bounds for general Young functions were obtained in
Kono (1980) and Pisier (1983).
Upper bounds for suprema of Gaussian processes in terms of majoriz-
ing measures were obtained by Fernique in Fernique (1974), and shown
to be sharp in Talagrand (1987c). After giving simplified proofs in Tala-
2 Notes 375
See Pollard (1989a). The present organization and notation of the chaining
argument is borrowed from Arcones and Giné (1993), who chain by ex-
ponential bounds for probabilities. The use of means in combination with
Lemma 2.2.10 in the proof is new.
2.8. Theorem 2.8.1 is due to Dudley, Giné, and Zinn (1991), who give a
378 2 Empirical Processes
2.10. Theorem 2.10.1 is due to Van der Vaart and Wellner (2000).
Alexander (1987c) gives Theorem 2.10.3 and derives that F +G and F ∪
G are Donsker if F and G are Donsker from his more general Proposition 2.6.
Theorem 2.10.4 and Theorem 2.10.5 are due to Dudley (1985).
Versions of Theorem 2.10.8 are established by Giné and Zinn (1986a)
(one-dimensional φ under measurability conditions) and Talagrand (1987a)
(uniformly bounded classes). Lemma 2.10.16 is established by Giné and
Zinn (1986a); the proof given here is new. Section 2.10.5 is based on Van
der Vaart (1993) and Van der Vaart and Wellner (2000) The result of Ex-
ample 2.10.28 specialized to the class C11 (R) was first obtained by Giné
and Zinn (1986b) by a different method. Pollard (1990), Section 5, gives a
variety of results similar to those in Section 2.10.4.
2.11. The limit theory in this section has its antecedents in Koul (1970),
Shorack (1973, 1979), and Van Zuijlen (1978). Also see Shorack and Beir-
lant (1986), Marcus and Zinn (1984), and Shorack and Wellner (1986),
Section 3.3, pages 108–109.
The first subsection, based on random and uniform-entropy, follows
Alexander (1987b), but with a simplified proof for the main Theorem 2.11.1.
See Alexander’s paper for a discussion of the minimal moment conditions
on the envelope within the context of this theorem. The improvement over
Theorem 2.5.2 in the i.i.d. case was already obtained by Giné and Zinn
(1984). Pollard (1990) develops a combinatorial theory for the non-i.i.d.
case paralleling the VC-theory of Chapter 2.6.
The bracketing central limit theorem, Theorem 2.11.11, and its corol-
lary are due to Andersen, Giné, Ossiander, and Zinn (1988), building on
Ossiander (1987) and work on majorizing measures and Gaussian processes
due to Fernique (1974) and Talagrand (1987c). Theorem 2.11.9 does not
seem to be a corollary of their result, but appears to be the natural gen-
eralization of Ossiander’s result to the case of non-i.i.d. observations. The
proof of Theorem 2.11.11 shows that the theorem remains true if the sin-
gle majorizing measure, which is guaranteed to exist by the assumption of
Gaussian domination, is replaced by certain sequences of majorizing mea-
sures (depending on n). The proof given here is greatly simplified in com-
parison to the original proof.
Jain and Marcus proved their theorem, as given here, in Jain and
Marcus (1975). Example 2.11.14 is taken from Giné and Zinn (1986a). Dud-
ley (1985), Section 6, gives the first satisfactory integration of the famous
“Chibisov-O’Reilly theorem” treated in Example 2.11.15 with modern em-
pirical process theory.
380 2 Empirical Processes
2.13. Theorem 2.13.1 is taken from Dudley (1984), who also shows that,
in a certain sense, the condition that the sequence converges is sharp. Giné
and Zinn (1986a) give further results for sequences of functions bounded in
uniform norm.
Dudley (1987) notes that the square of the supremum of the empirical
process over an elliptical class F equals a weighted sum of the squares of
the empirical process acting on the basis functions defining the elliptical
class, and he points out the practical importance of such classes.
The equivalence of (iii) and (iv) in Theorem 2.13.6 is due to Giné and
Zinn (1984); the equivalence of (i), (ii), and (iii) was proved by Talagrand
(1988).
this book. Theorems 2.14.2, 2.14.7, and 2.14.8 were obtained more recently
in Van der Vaart and Wellner (2011).
Dvoretzky, Kiefer, and Wolfowitz (1956) proved the first exponential
bound for the supremum distance between the empirical distribution func-
tion and the true distribution function in the case of X = R. Massart (1990)
found the best constant D multiplying the exponential for this classical case.
Related bounds for the more difficult case X = Rk were established by
Kiefer (1961). Vapnik and Červonenkis (1971) proved exponential bounds
of the same form as those given in Theorem 2.14.29, but with no powers of t
in front and the constant D depending on (and increasing with) n. Alexan-
der (1984) gives bounds of the same type as those in Theorems 2.14.25,
2.14.29, and 2.14.33 and also has sharper bounds than the bound given in
2.14.33.
Theorems 2.14.25, 2.14.29, and 2.14.30 are due to Talagrand (1994),
while Theorems 2.14.26, 2.14.33, and 2.14.34 are from Massart (1986). For
still more bounds, see Smith and Dudley (1992) and Adler and Brown
(1986).