Part 2

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 259

PART 2

Empirical Processes
2.1
Introduction

This part is concerned with convergence of a particular type of random map:


the empirical process. The empirical measure Pn of a sample of random
elements X1 , . . . , Xn in a measurable space (X , A) is the discrete random
measure given by Pn (C) = n−1 #(1 ≤ i ≤ n: Xi ∈ C). Alternatively (if
points are measurable), it can be described as the random measure that
puts mass 1/n at each observation. We shallP frequently write the empirical
n
measure as the linear combination Pn = n−1 i=1 δXi of the Dirac measures
at the observations.
Given a collection F of measurable functions f : X → R, the empirical
measure induces a map from F to R given by

f 7→ Pn f.

R
Here, we use the abbreviation Qf = f dQ for a given measurable function
f and signed measure Q. Let P be the common distribution of the Xi . The
centered and scaled version of the given map is the F-indexed empirical
process Gn given by

n
√ 1 X 
f 7→ Gn f = n(Pn − P )f = √ f (Xi ) − P f .
n i=1

Pn
Frequently the signed measure Gn = n−1/2 i=1 (δXi − P ) will be identified
with the empirical process.
2.1 Introduction 125

For a given function f , it follows from the law of large numbers and
the central limit theorem that
as
Pn f → P f,
N 0, P (f − P f )2 ,

Gn f
provided P f exists and P f 2 < ∞, respectively. This part is concerned with
making these two statements uniform in f varying over a class F.
Classical empirical process theory concerns the special cases when the
sample space X is the unit interval in [0, 1], the real line R, or Rd and the
indexing collection F is taken to be the set of indicators of left half-lines
in R or lower-left orthants in Rd . In this part we are concerned with em-
pirical measures and processes indexed by these classes of functions F, but
also many others, including indicators of half-spaces, balls, ellipsoids, and
other sets in Rd , classes of smooth functions from Rd to R, and monotone
functions. 
With the notation kQkF = sup |Qf |: f ∈ F , the uniform version of
the law of large numbers becomes
kPn − P kF → 0,
where the convergence is in outer probability or is outer almost surely. A
class F for which this is true is called a Glivenko-Cantelli class, or also
P -Glivenko-Cantelli class to bring out the dependence on the underlying
measure P .
In order to discuss a uniform version of the central limit theorem, it is
assumed that
sup |f (x) − P f | < ∞, for every x.
f ∈F

Under this condition the empirical process {Gn f : f ∈ F} can be viewed as


a map into `∞ (F). Consequently, it makes sense to investigate conditions
under which

in `∞ (F),

(2.1.1) Gn = n Pn − P G,
where the limit G is a tight Borel measurable element in `∞ (F). A class F
for which this is the true is called a Donsker class, or P -Donsker class to
be more complete.
The nature of the limit process G follows from consideration of its
marginal distributions. The marginals Gn f converge if and only if the func-
tions f are square-integrable. In that case the multivariate central limit
theorem yields that for any finite set f1 , . . . , fk of functions
 
Gn f1 , . . . , Gn fk Nk 0, Σ ,
where the k × k-matrix Σ has (i, j)th element P (fi − P fi )(fj − P fj ). Since
convergence in `∞ (F) implies marginal convergence, it follows that the
126 2 Empirical Processes

limit process Gf : f ∈ F must be a zero-mean Gaussian process with
covariance function
(2.1.2) EGf1 Gf2 = P (f1 − P f1 )(f2 − P f2 ) = P f1 f2 − P f1 P f2 .
According to Lemma 1.5.3, this and tightness completely determine the
distribution of G in `∞ (F). It is called the P -Brownian bridge.
An application of Slutsky’s lemma shows that every Donsker class is
a Glivenko-Cantelli class in probability. In fact, this is also true with “in
probability” replaced by “almost surely.” Conversely, not every Glivenko-
Cantelli class is Donsker, but the collection of all Donsker classes contains
many members of interest. Many classes are also universally Donsker, which
is defined as P -Donsker for every probability measure P on the sample
space.

2.1.3 Example (Empirical distribution function). Let X1 , . . . , Xn be


i.i.d. random elements in Rd , and let F be the
collection of all indicator
functions of lower rectangles 1{(−∞, t]}: t ∈ R̄d . The empirical measure
indexed by F can be identified with the empirical distribution function
n
 1X
t 7→ Pn 1 (−∞, t] = 1{Xi ≤ t}.
n i=1

In this case, it is natural to identify f = 1{(−∞, t]} with t ∈ R̄d and the
space `∞ (F) with the space `∞ (R̄d ). Of course, the sample paths of the
empirical distribution function are all contained in a much smaller space
(such as a D[−∞, ∞]), but as long as this space is equipped with the
supremum metric, this is irrelevant for the central limit theorem.
It is known from classical results (for the line by Donsker) that the
class of lower rectangles is Donsker for any underlying law P of X1 , . . . , Xn .
These classical results are a simple consequence of the results of this part.
Moreover, with no additional effort, analogous results are obtained for the
empirical process indexed by the collection of closed balls, rectangles, half-
spaces, and ellipsoids, as well as many collections of functions.

2.1.4 Example (Empirical process indexed by sets). Let C be a collec-


tion of measurable sets in the sample space (X , A), and take the class of
functions F equal to the set of indicator functions of sets in C. This leads
to the empirical distribution indexed by sets
1
C 7→ Pn (C) = #(Xi ∈ C).
n
In this case, it is convenient to make the identification C ↔ 1{C}, both
in notation and in terminology. Hence C is called a Glivenko-Cantelli class
if kPn − P kC converges to zero
√ in outer probability or outer almost surely,
and C is a Donsker class if n(Pn − P ) converges weakly to a tight limit in
`∞ (C).
2.1 Introduction 127

An unfortunate technicality of the preceding definitions of Glivenko-


Cantelli and Donsker classes is that they depend on the underlying prob-
ability space on which X1 , X2 , . . . are defined, because this determines the
outer expectations. In this book, unless specified otherwise, this technical-
ity is resolved by always assuming that X1 , X2 , . . . are defined canonically.
Thus the underlying probability space is the product space (X ∞ , A∞ , P ∞ ),
and Xi is the projection onto the ith coordinate.†

2.1.1 Overview of Chapters 2.2–2.15


Whether a given class F is a Glivenko-Cantelli or Donsker class depends on
the size of the class. A finite class of square integrable functions is always
Donsker, while at the other extreme the class of all square integrable, uni-
formly bounded functions is almost never Donsker. A relatively simple way
to measure the size of a class F is to use entropy numbers. The ε-entropy of
F is essentially the logarithm of the number of “balls” or “brackets” of size
ε needed to cover F. From this informal definition, it is already clear that
the entropy numbers increase as ε decreases to zero. Simple sufficient con-
ditions for a class to be Glivenko-Cantelli or Donsker can be given in terms
of the rate of increase as ε tends to zero. The main results on Glivenko-
Cantelli classes are given in Chapter 2.4, and the main results on Donsker
classes are given in Chapter 2.5. These chapters are the core of Part 2.
For an informal description of the main results, we first define entropy
with and without bracketing. Let (F, k · k) be a subset of a normed space
of real functions f : X → R on some set. We are mostly thinking of Lr (Q)-
spaces for probability measures Q.

2.1.5 Definition (Covering numbers)
 . The covering
number N ε, F, k·k
is the minimal number of balls g: kg − f k < ε of radius ε needed to cover
the set F. The centers of the balls need not belong to F, but they should
have finite norms. The entropy (without bracketing) is the logarithm of the
covering number.

2.1.6 Definition (Bracketing numbers). Given two functions l and u,


the bracket [l, u] is the set of all functions f with l ≤ f ≤ u. An ε-bracket
is a bracket [l, u] with ku − lk < ε. The bracketing number N[ ] ε, F, k · k
is the minimum number of ε-brackets needed to cover F. The entropy with
bracketing is the logarithm of the bracketing number. In the definition of
the bracketing number, the upper and lower bounds u and l of the brackets
need not belong to F themselves but are assumed to have finite norms.

The norms that will be of interest later, such as the Lr (Q)-norms,


all possess the (Riesz) property: for a pair of functions, if |f | ≤ |g|, then

Since the empirical measure depends only on the first n coordinates, and the projection of
X ∞ onto X n is a perfect map, any outer expectation E∗ h(Pn ) may be evaluated equivalently,
assuming that X1 , . . . , Xn are defined on X n or on X ∞ .
128 2 Empirical Processes

kf k ≤ kgk. For such norms, if f is in the 2ε-bracket [l, u], then it is in the
ball of radius ε around (l + u)/2. Thus covering and bracketing numbers
are related by 
N (ε, F, k · k) ≤ N[ ] 2ε, F, k · k .
In general, there is no converse inequality, so that bracketing numbers may
be bigger than covering numbers. (A notable exception is the uniform norm,
for which the previous inequality is an identity.) However, sufficient condi-
tions for the theorems of interest in terms of bracketing numbers can often
be stated using only a single norm on F, while sufficient conditions in terms
of entropy numbers usually involve many norms. This is intuitively obvious
since brackets control a function f (x) pointwise in its argument x, rather
than in norm. As a consequence, sufficient conditions in terms of bracketing
numbers or covering numbers are unrelated
 in general. Both are of interest.
We shall write N ε, F, Lr (Q) for covering numbers relative to the
Lr (Q)-norm
1/r
kf kQ,r = |f |r dQ
R
.
This is similar for bracketing numbers. An envelope function of a class F
is any function x 7→ F (x) such that |f (x)| ≤ F (x), for every x and f .
The minimal envelope function is x 7→ supf |f (x)|. It will usually be as-
sumed that this function is finite for every x. The uniform entropy numbers
(relative to Lr ) are defined as

sup log N εkF kQ,r , F, Lr (Q) ,
Q

where the supremum is over all probability measures Q on (X , A), with


0 < QF r < ∞.‡
In Chapter 2.4, two Glivenko-Cantelli theorems are given, based on
entropy with and without bracketing, respectively. The first theorem asserts
that F is P -Glivenko-Cantelli if

N[ ] ε, F, L1 (P ) < ∞, for every ε > 0.

The proof of this theorem is elementary and extends the usual proof of the
classical Glivenko-Cantelli theorem for the empirical distribution function.
The second theorem is more complicated but has a simple corollary involv-
ing the uniform covering numbers. A class F is P -Glivenko-Cantelli if it is
P -measurable with envelope F such that P ∗ F < ∞ and satisfies

sup N εkF kQ,1 , F, L1 (Q) < ∞, for every ε > 0.
Q

The full statement of the second theorem yields necessary and sufficient
conditions for a P −measurable class F to be P −Glivenko-Cantelli in

Alternatively, the supremum could be taken over all discrete probability measures. This
yields the same results.
2.1 Introduction 129

terms
 of the random
covering numbers N (ε, FM , L1 (Pn )) where FM =
f 1[F ≤M ] : f ∈ F . This characterization of the Glivenko-Cantelli theorem
will be used in Chapter 2.10 to establish useful Glivenko-Cantelli preserva-
tion theorems. Chapter 2.4 concludes with the statement of a result char-
acterizing universal Glivenko-Cantelli classes. As mentioned already, the
second Glivenko-Cantelli theorem requires the assumption that F is a “P -
measurable class,” a concept that is defined in Definition 2.3.3. The pres-
ence of this measurability hypothesis is a consequence of the way uniform
entropy is used to control suprema via “randomization by Rademacher ran-
dom variables” (or “symmetrization”) together with the lack of a general
version of Fubini’s theorem for outer integrals. This same type of addi-
tional measurability hypothesis also enters in the corresponding Donsker
theorems with uniform entropy hypotheses. Without these hypotheses, the
theorems are false, but on the other hand it requires some effort to con-
struct a class for which the hypotheses fail. Thus, in applications, this type
of measurability will hardly ever play a role.
The method of symmetrization is discussed in Chapter 2.3 together
with measurability conditions. This section is a necessary preparation for
the uniform entropy Glivenko-Cantelli and Donsker theorems in Chap-
ters 2.4 and 2.5. The theorems that are proved with bracketing do not
use symmetrization and do not need measurability hypotheses.
One of the two Donsker theorems proved in Chapter 2.5 uses bracketing
entropy. A slightly simpler version of this theorem asserts that a class F
of functions is P -Donsker under an integrability condition on the L2 (P )-
entropy with bracketing. A class F of measurable functions is P -Donsker
if Z ∞q

log N[ ] ε, F, L2 (P ) dε < ∞.
0
The other Donsker theorem in this chapter asserts that a class F of func-
tions is P -Donsker under a similar condition on uniform entropy numbers.
If F is class of functions such that
Z ∞ q 
(2.1.7) sup log N εkF kQ,2 , F, L2 (Q) dε < ∞,
0 Q

then F is P -Donsker for every probability measure P such that P ∗ F 2 < ∞


for some envelope function F and for which certain measurability hypothe-
ses are satisfied. The integral conditions both measure the rate at which
the entropy grows as ε decreases to zero. Note that the entropies are zero
for large ε (if F is bounded), so that convergence of the integrals at ∞ is
automatic.
Once the main Glivenko-Cantelli theorems and Donsker theorems are
in hand, the following question arises: how can we verify the hypotheses on
uniform entropy numbers and bracketing numbers? Tools and techniques
for this purpose are developed in Chapter 2.6 for uniform entropy numbers
and in Chapter 2.7 for bracketing numbers.
130 2 Empirical Processes

One of the starting points for controlling uniform covering numbers is


the notion of a Vapnik-C̆ervonenkis class of sets, or simply VC-class. Say
that a collection C of subsets of the sample space X picks out a certain subset
of the finite set {x1 , . . . , xn } ⊂ X if it can be written as {x1 , . . . , xn } ∩ C
for some C ∈ C. The collection C is said to shatter {x1 , . . . , xn } if C picks
out each of its 2n subsets. The VC-dimension V (C) of C is the largest n
for which some set of size n is shattered by C. A collection C of measurable
sets is called a VC-class if its dimension V (C) is finite. Thus, by definition,
a VC-class of sets picks out strictly less than 2n subsets from any set of
n > V (C) elements. The surprising fact is that such a class can necessarily
pick out only a polynomial number O(nV (C) ) of subsets, well below the
2n − 1 that the definition appears to allow. This result is a consequence
of a combinatorial result, known as Sauer’s lemma, which states that the
number ∆n (C, x1 , . . . , xn ) of subsets picked out by a VC-class C satisfies
V (C)    ne V (C)
X n
max ∆n (C, x1 , . . . , xn ) ≤ ≤ .
x1 ,...,xn
j=0
j V (C)

Some thought shows that the number of subsets picked out by a collection C
is closely related to the covering numbers of the class of indicator functions
{1C : C ∈ C} in L1 (Q) for discrete, empirical type measures Q. By a clever
argument, Sauer’s lemma can be used to bound the uniform covering (or
entropy) numbers for this class. Theorem 2.6.4 asserts that there exists a
universal constant K such that, for any VC-class C of sets,
 rV (C)
1
N ε, C, Lr (Q) ≤ KV (C)(4e)V (C)

,
ε
for any probability measure Q and r ≥ 1 and 0 < ε < 1. Consequently, VC-
classes are examples of polynomial classes in the sense that their covering
numbers are bounded by a polynomial in 1/ε. The upper bound shows
that VC-classes satisfy the sufficient conditions for the Glivenko-Cantelli
theorem and Donsker theorem discussed previously (with much to spare)
provided they possess certain measurability properties.
It is possible to obtain similar results for classes of functions. A collec-
tion of real-valued, measurable functions F on a sample space X is called a
VC-subgraph class (or simply a VC-class of functions) if the collection of all
subgraphs of the functions in F forms a VC-class of sets in X ×R. Just as for
sets, the covering numbers of a VC-subgraph class F grow polynomially in
1/ε, so that suitably measurable VC-subgraph classes are Glivenko-Cantelli
and Donsker under moment conditions on the envelope function.
Among the many examples of VC-classes are the collections of left half-
lines and lower-left orthants, with which classical empirical process theory is
concerned. Methods to construct a great variety of VC-classes are discussed
in Section 2.6.5.
2.1 Introduction 131

Section 2.6.6 introduces the related notion of fat-shattering dimen-


sion. This notion of “dimension” is defined at a fixed level  and is always
smaller than a natural related VC-dimension. In spite of this, it is possible
to bound L2 (Q) covering numbers in terms of the fat-shattering dimension.
Statements of the available bounds are provided in this section.
The application of the other main Glivenko-Cantelli and Donsker the-
orem requires estimates on the bracketing numbers of classes of func-
tions. These are given in Chapter 2.7 for classes of smooth functions, sets
with smooth boundaries, convex sets, monotone functions, and functions
smoothly depending on a parameter, natural subsets of Besov spaces, classes
of sets with smooth boundaries, Gaussian mixtures, and classes of convex
functions. Coupled with the results of Chapter 2.4 and 2.5, these estimates
yield many more interesting Donsker classes.
Given a basic collection of Glivenko-Cantelli and Donsker classes and
estimates of their entropies, new classes with similar properties can be con-
structed in several ways. For instance, the symmetric convex Pmhull sconv F
of a class F is the
Pm collection of all functions of the form i=1 αi fi , with
each fi ∈ F and i=1 |αi | ≤ 1. In Chapter 2.10 it is shown that the convex
hull of a Donsker class is Donsker. In addition to this, Theorem 2.6.9 gives
a useful bound for the entropy of a convex hull in the case that the Lr (Q)
covering numbers with r > 1 for F are polynomial: if
  1 V
N εkF kQ,r , F, Lr (Q) ≤ C ,
ε
then the convex hull of F satisfies
(
1 rV /((r−1)V +r)

 K ε , if 1 < r < 2,
log N εkF kQ,r , sconv F, Lr (Q) ≤
1 2V /(V +2)

K ε if 2 ≤ r < ∞.
An important feature of this bound is that the power 2V /(V + 2) of 1/ε
in the second case is strictly less than 2 for any V , and in the first case
the power rV /((r − 1)V + r) is strictly less than 2 if V < 2r/(2 − r);
so that the convex hull of a polynomial class satisfies the uniform entropy
condition (2.1.7). This bound is true in particular for VC-hull classes, which
are defined as the sequential closures of the convex hulls of VC-classes.
Other operations preserving the Glivenko-Cantelli or Donsker proper-
ties are considered in Chapter 2.10. A short summary is that given Glivenko-
Cantelli (or Donsker) classes F1 , . . . , Fk and a fixed function φ on Rk , the
new class 
H = φ(f1 , . . . , fk ): fj ∈ Fj , j = 1, . . . , k
is Glivenko-Cantelli (or Donsker) if φ is continuous (or Lipschitz) respec-
tively, subject to having a suitably integrable envelope. For example, the
classes of pairwise sums F +G, pairwise infima F ∧G, and pairwise suprema
F ∨G, as well as the union F ∪G, are Glivenko-Cantelli (or Donsker) classes
if both F and G are Glivenko-Cantelli (or Donsker). Similarly, pairwise
132 2 Empirical Processes

products FG and quotients F/G are Glivenko-Cantelli (or Donsker) if F


and G are Glivenko-Cantelli (or Donsker) and bounded above, or bounded
away from, zero. The latter boundedness assumptions can be traded against
stronger conditions on the entropies of the classes. Such “permanence prop-
erties” applied to the basic Glivenko-Cantelli or Donsker classes resulting
from Chapters 2.6 and 2.7 provide a very effective method to verify the
Glivenko-Cantelli (or Donsker) property in, for instance, statistical appli-
cations.
The remainder of Part 2 explores refinements, extensions, special
classes F, uniformity in the underlying distribution, multiplier processes,
partial-sum processes, the non-i.i.d. situation, and moment and tail bounds
for the supremum of the empirical process.
The uniformity of weak convergence of Gn to G with respect to the
underlying probability distribution P generating the data is addressed in
Chapter 2.8. Again, the main results use either uniform entropy or brack-
eting entropy.
The viewpoint in Chapter 2.9 on multiplier central limit theorems is
that, for a Donsker class F,
n n
1 X  1 X
Gn = √ δXi − P ≡ √ Zi G,
n i=1 n i=1

in `∞ (F). Then we pose the following question: for what sequences of i.i.d.,
real-valued random variables ξ1 , . . . , ξn independent from the original data
does it follow that
n
1 X
√ ξi Z i G?
n i=1
In fact, these two displays turn out to be equivalent if the ξi have zero
mean, have variance 1, and satisfy the L2,1 -condition
Z ∞q

kξ1 k2,1 = P |ξ1 | > t dt < ∞.
0

Chapter 2.9 also presents conditional versions of this multiplier central limit
theorem. These are basic for part 3’s development of limit theorems for the
bootstrap empirical process.
Chapter 2.11 gives extensions of the Donsker theorem
Pn for sums of inde-
pendent, but not identically distributed, processes i=1 Zni indexed by an
arbitrary collection F. One example, which occurs naturally in statistical
contexts and is covered by this situation, is the empirical process indexed
by a collection F of functions based on observations Xn1 , . . . , Xnn that
are independent but not identically distributed. Once again we present two
main central limit theorems, based on entropy with and without bracket-
ing. Even if specified to the i.i.d. situation, the theorems in Chapter 2.11
are more general than the theorems obtained before. For instance, we use
2.1 Introduction 133

a random entropy condition, rather than a uniform entropy condition, and


also consider bracketing using majorizing measures.
Two types of partial-sum processes are studied in Chapter 2.12. The
first type is the sequential empirical process
bnsc r
1 X  bnsc
Zn (s, f ) = √ f (Xi ) − P f = Gbnsc (f ),
n i=1 n

which records the “history” of the empirical processes Gn as sampling pro-


gresses. The second type of partial-sum processes studied are those gener-
ated by independent random variables located at the points of a lattice.
These processes, which can be viewed as closely related to the multiplier
processes studied in Chapter 2.9, since they are empirical processes with
random masses (the multipliers) at (degenerately) random points, have
some special features and deserve special attention.
Chapter 2.13 gives Donsker theorems for several special types of classes
F, namely, sequences and elliptical classes.
Chapter 2.14 gives a detailed study of moment bounds, tail bounds,
and exponential bounds for suprema of the empirical process. The moment
bounds play an important role in Chapters 3.2 and 3.4 on the limit theory
of M -estimators. The proofs of the exponential bounds for

P∗ (kGn kF > t)

in Chapter 2.14 rely on concentration inequalities for

P∗ kGn kF − E∗ kGn kF > t and P∗ −(kGn kF − E∗ kGn kF ) > t


 

established in Chapter 2.15. As it turns out, bounds for these latter prob-
abilities depend much less heavily on the structure of the class F than do
bounds for the mean E∗ kGn kF itself. Several of the concentration inequal-
ities in Chapter 2.15 are sharp in that they reduce to classical bounds in
the case when the class F consists of just a single function. Chapter 2.15
concludes Part 2.

2.1.2 Asymptotic Equicontinuity


A class of measurable functions is called pre-Gaussian if the (tight) limit
process G in the uniform central limit theorem (2.1.1) exists. By Kol-
mogorov’s extension theorem, there always exists a zero-mean Gaussian
process Gf : f ∈ F with covariance function given by (2.1.2). A more
precise description of pre-Gaussianity is that there exists a version of this
Gaussian process that is a tight, Borel measurable map from some proba-
bility space into `∞ (F). Usually the name “Brownian bridge” introduced
earlier refers to this tight process with its special sample path properties,
rather than to the general stochastic process G.
134 2 Empirical Processes

It is desirable to have a more concrete description of the tightness


property of a Brownian bridge and hence of the notion of pre-Gaussianity.
In Chapter 1.5 it was seen that tightness of a random map into `∞ (F) is
closely connected to continuity of its sample paths. Define a seminorm ρP
by 1/2
ρP (f ) = P (f − P f )2 .
Then, by Example 1.5.10, a class F is pre-Gaussian if and only if
• F is totally bounded for ρP ,
• there exists a version of G with uniformly ρP -continuous sample paths
f 7→ G(f ).
Actually, this example shows that instead of the centered L2 -norm ρP , any
centered Lr -norm can be used as well. However, the L2 -norm appears to
be the most convenient.[
While pre-Gaussianity of a class F is necessary for the uniform central
limit theorem, it is not sufficient. A Donsker class F satisfies the stronger
condition that the sequence Gn is asymptotically tight. By Theorem 1.5.7,
the latter entails replacing the condition that the sample paths of the limit
process are continuous by the condition that the empirical process is asymp-
totically continuous: for every ε > 0,
 


(2.1.8) lim lim sup P sup Gn (f − g) > ε = 0.

δ↓0 n→∞ ρP (f −g)<δ

Much of Part 2 is concerned with this equicontinuity condition. With the


notation] Fδ = {f − g: f, g ∈ F, ρP (f − g) < δ}, it is equivalent to the
following statement: for every decreasing sequence δn ↓ 0,
P∗
kGn kFδn → 0

(Problem 2.1.5). In view of Theorem 1.5.7, a class F is Donsker if and only


if F is totally bounded in L2 (P ) and satisfies the asymptotic equicontinuity
condition as in the preceding displays.

2.1.3 Maximal Inequalities


It follows that both the law of large numbers and the central limit theorem
are concerned with showing that a supremum of real-valued variables con-
verges to zero. To a certain extent, the derivation of these results is done
in parallel, as both need to make use of maximal inequalities that bound
probabilities involving suprema of random variables. Chapter 2.2 develops

[
It may be noted that the seminorm ρP is the L2 (P )-norm of f centered at its expectation
(in fact, the standard deviation of f (Xi )). The centering is natural in view of the fact that
the empirical process is invariant under the addition of constants to F : Gn f = Gn (f − c) for
every c. Under the condition that supf ∈F |P f | < ∞, the seminorm ρP can be replaced by the
slightly simpler L2 (P )-seminorm without loss of generality (Problem 2.1.2).
]
In most of this part, the notation Fδ means Fδ = {f − g: f, g ∈ F , ρ(f − g) < δ}, for ρ
equal to either the L2 (P )-semimetric or ρP .
2.1 Introduction 135

such inequalities in an abstract setting, using Orlicz norms. If ψ is a non-


decreasing, convex function with ψ(0) = 0, then the Orlicz norm kXkψ of
a random variable X is defined by
n  |X|  o
kXkψ = inf C > 0: Eψ ≤1 .
C
The functions ψ of primary interest are ψ(x) = xp and ψ(x) = exp(xp ) − 1
(for p ≥ 1 or adapted versions if 0 < p < 1). For ψ(x) = xp , the Orlicz
norm kXkψ is just the usual Lp -norm kXkp . The key Lemma 2.2.2 provides
a bound for the Orlicz norm of a finite maximum max1≤i≤m Xi in terms
of the Orlicz norms of the variables Xi and the inverse function ψ −1 (m).
Suppose that lim supx,y→∞ ψ(x)ψ(y)/ψ(xy) < ∞. Then there is a constant
K depending only on ψ so that, for arbitrary random variables X1 , . . . , Xm ,

max Xi ≤ K ψ −1 (m) max kXi kψ .

1≤i≤m ψ 1≤i≤m

The interesting feature of this inequality is that a bound on maxi kXi kψ


for a rapidly growing function ψ, such as ψ2 (x) = exp(x2 ) − 1, yields a
bound on the rate of growth with m of the Orlicz norm of maxpXi governed
by the slowly growing function ψ −1 (m), such as ψ2−1 (m) = log(m + 1).
Since Gaussian random variables and sums of independent, bounded ran-
dom variables enjoy finite ψ2 -Orlicz norms, the particular function ψ2−1 (m)
figures prominently in the developments in later chapters.
Bounds on finite suprema can be extended to general maximal inequal-
ities with the help of the chaining method, which Kolmogorov pioneered.
The idea is to relate maxima over an infinite set to maxima of increments
over an increasing sequence of finite sets. Here covering numbers are intro-
duced to measure the size of the approximating finite sets. For a semimet-
ric space (T, d), the covering number N (ε, T, d) is defined as the minimal
number of balls of radius ε needed to cover T . Given a separable stochas-
tic process {Xt : t ∈ T } and a fixed function ψ, define a semimetric by
d(s, t) = kXs − Xt kψ for every s, t. Then, for any η, δ > 0 and a constant
K,
hZ η i
ψ −1 N (ε, d) dε + δ ψ −1 N 2 (η, d) .

sup |Xs − Xt | ≤ K

d(s,t)≤δ ψ 0

Thus the ψ-norm of a supremum is bounded by an integral that involves


the inverse ψ −1 and the covering numbers of the index set T with respect
to the intrinsic semimetric d.
In particular, a process {Xt : t ∈ T } is said to be sub-Gaussian if the
ψ2 -norms of the increments Xs − Xt are finite. This can be shown to be
true if and only if, for some semimetric d,
1 2 2
P |Xs − Xt | > x ≤ 2e− 2 x /d (s,t) ,

for every x > 0.
136 2 Empirical Processes

Then the preceding inequality can be simplified to


Z δp
E sup |Xs − Xt | ≤ K log N (ε, T, d) dε.
d(s,t)≤δ 0

This inequality explains the appearance of the square root of the logarithm
of the entropy in the sufficient integral conditions for a class to be Donsker,
which were discussed in the preceding chapter.
While maximal inequalities based on covering numbers suffice for al-
most all statistical applications, they are known to not be sharp in general.
This suboptimality is especially well-understood for Gaussian maximal in-
equalities for which the classical bounds based on covering numbers fail,
but bounds based on generic chaining and majorizing measures succeed in
providing sharp bounds. Section 2.2.4 presents a brief discussion of these
alternative methods and the resulting improvements.

* 2.1.4 Central Limit Theorem in Banach Spaces


The convergence
Pn in distribution in `∞ (F) of the empirical process Gn =
−1/2
n i=1 (δXi − P ) can be considered a central limit theorem for the i.i.d.
random elements δX1 − P, . . . , δXn − P in the Banach space `∞ (F). In
this section it is first shown that, conversely, any central limit theorem in
a Banach space can be stated in terms of empirical processes. Next it is
argued that this observation is only moderately useful.
Suppose Z1 , . . . , Zn are i.i.d. random maps with values in a Banach
space X. Let F be a subset of the dual space of X such that supf |f (x)| =
kxk for every x and such that f (Zi ) is measurable for each f . A mean of
Zi is an element EZi of X such that f (EZi ) = Ef (Zi ), for every f ∈ F.
Every given element x ∈ X can be identified with a map

x∗∗ : f → f (x),

from F to R. By assumption, this gives an isometric identification x ↔ x∗∗


of X with a subset of `∞ (F). Thus random maps Z1 , . . . , Zn in X satisfy
the central limit theorem if and only if the random Pelements Z1∗∗ , . . . , Zn∗∗
∞ n ∗∗
satisfy
Pn the central limit theorem in ` (F). Since i=1 (Zi − EZi ) (f ) =
i=1 f (Zi ) − Ef (Zi ) , this appears to be the case if and only if F is a
Donsker class of functions. However, for an accurate statement, it is neces-
sary to describe the measurability structure more precisely.

2.1.9 Example. In case of a separable Banach space it is usually assumed


that Z1 , . . . , Zn are Borel measurable maps. Then “i.i.d.” can be understood
in the usual manner. If F is taken equal to the unit ball of the dual space,
then the equality supf |f (x)| = kxk is valid by the Hahn-Banach theorem,

* This section may be skipped at first reading.


2.1 Introduction 137

and the σ-field generated by F is precisely the Borel σ-field. It follows that
the Zi are Borel measurable if and only if everyP f (Zi ) is measurable.
n
The conclusion is that the sequence n−1/2 i=1 (Zi − EZi ) converges
in distribution if and only if the unit ball of the dual space is Donsker with
respect to the common Borel law of the Zi .

2.1.10 Example. Suppose Z1 , . . . , Zn are i.i.d. stochastic processes with


bounded sample paths, indexed by an arbitrary set F̃ (not necessarily
a class of functions). The notion “i.i.d.” should include that the finite-
dimensional marginals Zi (f˜1 ), . . . , Zi (f˜k ) are i.i.d. vectors. In general, this
does not determine the “distribution” of the processes in `∞ (F̃) adequately.
Assume in addition that each Zi is defined on a product probability space
)f˜ = Z(f˜, xi ), for some
(X n ,An , P n ) as Zi (x 1 , . . . , xnP  fixed
Pn stochastic pro-
n
cess Z(f , x): f ∈ F̃ . Then i=1 Zi (f˜) − EZi (f˜) = i=1 f (xi ) − P f ,
˜ ˜


for the function f defined by f (x) = Z(f˜, x). That Z is a stochastic process
means precisely that each function x 7→ f (x) is measurable. The processes
Z1 , Z2 , . . . satisfy the central limit theorem in `∞ (F̃) if and only if the class
of functions {f : f˜ ∈ F̃} is P -Donsker.

Thus the central limit theorem in an arbitrary Banach space can be


phrased as a result about empirical processes. Even though this conclusion
is formally correct, the methods (traditionally) used for proving empirical
central limit theorems yield unsatisfactory descriptions of the central limit
theorem in special Banach spaces. For instance, the conditions for the cen-
tral limit theorem in Lp -space are simple and well known. If (S, S, µ) is
a σ-finite measure space and Z1 , . . . , Zn are i.i.d., zero-mean
Pn Borel mea-
−1/2
surable maps into Lp (S, S, µ), then the sequence n i=1 i converges
Z
weakly if and only if P kZ1 kp > t = o(t−2 ) as t → ∞ and

Z
p/2
EZ12 (s) dµ(s) < ∞.
S

(In case p = 2, this can be simplified to the single requirement EkZ1 k22 < ∞.
For this case, the theorem is given in Chapter 1.8.) In terms of empirical
processes, this becomes as follows.

2.1.11 Proposition. Let (S, S, µ) be a σ-finite measure space, let 1 ≤


p < ∞, and let P be a Borel probability measure on Lp (S, S, µ). Then
dual space of Lp (S, S, µ) is P -Donsker if and only if
Rthe Runit ball2 of thep/2
S
z(ω, s) dP (ω) dµ(s) < ∞ and P (kzkp > t) = o(t−2 ) as t → ∞.

This proposition is proved in texts on probability in Banach spaces.†


The methods developed in this part do not yield this result. In fact, al-
ready the formulation of the proposition is unnecessarily complicated: it is
awkward to push this central limit theorem into an empirical mold.

See Ledoux and Talagrand (1991), Theorem 10.10.
138 2 Empirical Processes

On the other hand, Part 2 gives many interesting results on the central
limit theorem in Banach spaces of the type `∞ (F), where F is a collection
of measurable functions. For this large class of Banach spaces, sufficient
conditions for the central limit theorem go beyond the Banach space struc-
ture, but they can be stated in terms of the structure of the function class
F.

Problems and Complements


1. (Total boundedness in L2 (P )) If F is totally bounded in L2 (P ), then it
is totally bounded for the seminorm ρP . If F is totally bounded for ρP and
kP kF = sup{|P f |: f ∈ F} is finite, then it is totally bounded in L2 (P ).
[Hint: Suppose f1 , . . . , fm form an ε-net for ρP . If sup{|P f |: f ∈ F} is
finite, then the numbers P f with f ranging over some ρP -ball of radius ε
are contained in an interval. Choose for every fi a finite collection fi,j such
that, for every f with√ρP (f − fi ) < ε, there is an fi,j with |P f − P fi,j | < ε.
Then the fi,j form a 5ε-net for F in L2 (P ).]

2. (Using k · kP,2 instead of ρP ) Suppose sup{|P f |: f ∈ F} is finite. Then


a class F of measurable functions is P -Donsker if and only if F is totally
bounded in L2 (P ) and the empirical process is asymptotically equicontinuous
in probability for the L2 (P )-semimetric.
[Hint: Use the previous problem and the next problem.]

3. Suppose ρ is a semimetric on a class of measurable functions F that is uni-


formly stronger than ρP in the sense that ρP (f − g) ≤ φ ρ(f, g) , for a
function φ with φ(ε) → 0 as ε ↓ 0. Suppose F is totally bounded under ρ.
Then F is P -Donsker if and only if the empirical process indexed by F is
asymptotically equicontinuous in probability with respect to ρ.
[Hint: Use the addendum to Theorem 1.5.7.]

4. (Brownian motion) If F is P -pre-Gaussian and kP kF < ∞, then a P -


Brownian bridge has uniformly continuous sample paths with respect to the
L2 (P )-semimetric. The process Z(f ) = G(f ) + ξP f for a standard normal
variable ξ independent of G has a version that is a Borel measurable, tight
map in `∞ (F). This process, which is Gaussian with mean zero and covariance
function P f g, is called a Brownian motion.

5. If an : (0, 1] → [0, ∞) is a sequence of nondecreasing functions, then there


exists δn ↓ 0 such that lim supn→∞ an (δn ) = limδ→0 lim supn→∞ an (δ). De-
duce that limδ→0 lim supn→∞ an (δ) = 0 if and only if an (δn ) → 0 for every
δn → 0.

6. For arbitrary functions an : (0, 1] → [0, ∞) consider:


(i) limn→∞ an (δn ) = 0, for every δn ↓ 0.
(ii) limδ↓0 lim supn→∞ an (δ) = 0.
(iii) limn→∞ an (δn ) = 0, for some δn ↓ 0.
2.1 Problems and Complements 139

Then (i) ⇒ (ii) ⇒ (iii), but both inverses are false in general. If the functions
an are decreasing, then (i) and (ii) are equivalent, but still (iii) does not imply
(ii).
[Hint: Define a(δ): = lim supn→∞ an (δ).
(i) ⇒ (ii). If the double limit in (ii) is positive, then there exist ε > 0 and
ηm ↓ 0 such that a(ηm ) > ε, for every m, whence there exist n1 < n2 < · · ·
with anm (ηm ) > ε, for every m. Define δn = ηm if n ∈ [nm , nm+1 ), and
arbitrary for n < n1 . Then δn ↓ 0 and anm (δnm ) > ε, for every m, which
contradicts that an (δn ) → 0.
(ii) ⇒ (iii). Set An (δ) = supm≥n am (δ) and fix ε > 0 and ηm ↓ 0. For
every m there exists nm with |An (ηm ) − a(ηm )| < ε, for n ≥ nm . Without
loss of generality take n1 < n2 < · · · and set δn = ηm for n ∈ [nm , nm+1 ).
Then δn ↓ 0 and an (δn ) ≤ An (δn ) ≤ a(δn ) + ε, for all n ≥ n1 , so an (δn ) → 0.
(ii) ⇒ (i) if the an are decreasing. If an (δn ) does not tend to zero for a
given δn ↓ 0, then there exist ε > 0 and n1 < n2 < · · · such that anm (δnm ) >
ε, for every m. If an is decreasing, then anm (δnj ) > ε, for every j ≤ m. Thus
a(δnj ) ≥ lim inf m→∞ anm (δnj ) ≥ ε, for every j, which contradicts a(δ) → 0.
For counterexamples consider the array g(n, m) = 1n>m , which has the
properties that m 7→ g(n, m) is decreasing, and g(n, m) → 1 if n → ∞
followed by m → ∞, while g(n, n) = 0, for every n. Set an (δ) = g(n, m), for
1/(m +1) < δ ≤ 1/m, so that an (1/m) = g(n, m). This function has an (δ) →
1 as n → ∞ for every δ and hence does not satisfy (ii), but an (δn ) = 0 → 0,
for δn = 1/n, so that an does satisfy (iii). Second, set an (δ) = 1 − g(n, m),
for 1/(m + 1) < δ ≤ 1/m. Then an (δ) → 0 for every δ and hence an satisfies
(ii), but an (δn ) = 1, for δn = 1/n.]

7. The packing numbers of a ball B(0, R) = {x: kxk ≤ R} of radius R in Rd


satisfy, for any norm,
d
3R
 
D ε, B(0, R), k · k ≤ , 0 < ε ≤ R.
ε

[Hint: If x1 , . . . , xm is an ε-separated subset in B(0, R), then the balls of radii


ε/2 around the xi are disjoint and contained in B(0, R + ε/2). Comparison
of the volume of their union and the volume of B(0, R + ε/2) shows that
m(ε/2)d ≤ (R + ε/2)d .]
2.2
Maximal Inequalities
and Covering Numbers

In this chapter we derive a class of maximal inequalities that can be used


to establish the asymptotic equicontinuity of the empirical process. Since
the inequalities have much wider applicability, we temporarily leave the
empirical process framework.
Let ψ be a nondecreasing, convex function with ψ(0) = 0 and X a
random variable. Then the Orlicz norm kXkψ is defined as
n  |X|  o
kXkψ = inf C > 0: Eψ ≤1 .
C
(Here the infimum over the empty set is ∞.) Using Jensen’s inequality,
it is not difficult to check that this indeed defines a norm (on the set of
random variables for which kXkψ is finite). The best-known examples of
Orlicz norms are those corresponding to the functions x 7→ xp for p ≥ 1:
the corresponding Orlicz norm is simply the Lp -norm
1/p
kXkp = E|X|p .

For our purposes, Orlicz norms of more interest are the ones given by
p
ψp (x) = ex − 1 for p ≥ 1, which give much more weight to the tails of X.
The bound xp ≤ ψp (x) for all nonnegative x implies that kXkp ≤ kXkψp
for each p. It is not true that the exponential Orlicz norms are all bigger
than all Lp -norms. However, we have the inequalities

kXkψp ≤ kXkψq (log 2)1/q−1/p , p ≤ q,


kXkp ≤ p! kXkψ1
2.2 Maximal Inequalities and Covering Numbers 141

(see Problem 2.2.5). Since for the present purposes fixed constants in in-
equalities are irrelevant, this means that a bound on an exponential Orlicz
norm always gives a better result than a bound on an Lp -norm.
Any Orlicz norm can be used to obtain an estimate of the tail of a
distribution. By Markov’s inequality,
    1
P |X| > x ≤ P ψ |X|/kXkψ ≥ ψ x/kXkψ ≤ .
ψ(x/kXkψ )
p
For ψp (x) = ex − 1, this leads to tail estimates exp (−Cxp ) for any random
variable with a finite ψp -norm. Conversely, an exponential tail bound of this
type shows that kXkψp is finite.
p
2.2.1 Lemma. Let X be a random variable with P |X| > x ≤ Ke−Cx


for every x, for constants K and C, and for p ≥ 1. Then its Orlicz norm
1/p
satisfies kXkψp ≤ (1 + K)/C .

Proof. By Fubini’s theorem


Z |X|p Z ∞
D|X|p Ds
P (|X| > s1/p ) DeDs ds.

E e −1 =E De ds =
0 0

Now insert the inequality on the tails of |X| and obtain the explicit upper
bound KD/(C − D). This is less than or equal to 1 for D−1/p greater than
1/p
or equal to (1 + K)/C .

Next consider the ψ-norm of a maximum P ofp finitely many random


variables. Using the fact that max |Xi |p ≤ |Xi | , one easily obtains for
the Lp -norms
 1/p
max Xi = E max |Xi |p ≤ m1/p max kXi kp .

1≤i≤m p 1≤i≤m 1≤i≤m

A similar inequality is valid for many Orlicz norms, in particular the expo-
nential ones. Here, in the general case, the factor m1/p becomes ψ −1 (m),
where ψ −1 is the inverse function of ψ.

2.2.2 Lemma. Let ψ be a convex, nondecreasing, nonzero function with


ψ(0) = 0 and lim supx,y→∞ ψ(x)ψ(y)/ψ(cxy) < ∞ for some constant c.
Then, for any random variables X1 , . . . , Xm ,

max Xi ≤ K ψ −1 (m) max kXi kψ ,

1≤i≤m ψ i

for a constant K depending only on ψ.


142 2 Empirical Processes

Proof. For simplicity of notation assume first that ψ(x)ψ(y) ≤ ψ(cxy) for
all x, y ≥ 1. In that case ψ(x/y) ≤ ψ(cx)/ψ(y) for all x ≥ y ≥ 1. Thus, for
y ≥ 1 and any C,
"  #
ψ c|Xi |/C
    n
|Xi | |Xi | |Xi | o
max ψ ≤ max +ψ 1 <1
Cy ψ(y) Cy Cy

X ψ c|Xi |/C
≤ + ψ(1).
ψ(y)
Set C = c max kXi kψ , and take expectations to get
 
max |Xi | m
Eψ ≤ + ψ(1).
Cy ψ(y)
When ψ(1) ≤ 1/2, this is less than or equal to 1 for y = ψ −1 (2m), which is
greater than 1 under the same condition. Thus
max |Xi | ≤ ψ −1 (2m) c max kXi kψ .

ψ

By the convexity of ψ and the fact that ψ(0) = 0, it follows that


ψ −1 (2m) ≤ 2 ψ −1 (m). The proof is complete for every special ψ that meets
the conditions made previously.
For a general ψ, there are constants σ ≤ 1 and τ > 0 such that
φ(x) = σψ(τ x) satisfies the conditions of the previous paragraph. Apply the
inequality to φ, and observe that kXkψ ≤ kXkφ /(στ ) ≤ kXkψ /σ (Prob-
lem 2.2.3).

For the present purposes, the value of the constant in the previous
lemma is irrelevant. (For the Lp -norms, it can be taken equal to 1.) The
important conclusion is that the inverse of the ψ-function determines the
size of the ψ-norm of a maximum in comparison to the ψ-norms of the
individual terms. The ψ-norm grows slowest for rapidly increasing ψ. For
p
ψ(x) = ex − 1, the growth is at most logarithmic, because
1/p
ψp−1 (m) = log(1 + m) .

The previous lemma is useless in the case of a maximum over infinitely


many variables. However, such a case can be handled via repeated applica-
tion of the lemma via a method known as chaining. Every random variable
in the supremum is written as a sum of “little links,” and the bound de-
pends on the number and size of the little links needed. For a stochastic
process {Xt : t ∈ T }, the number of links depends on the entropy of the
index set for the semimetric

d(s, t) = kXs − Xt kψ .

The general definition of “metric entropy” is as follows.


2.2 Maximal Inequalities and Covering Numbers 143

2.2.3 Definition (Covering numbers). Let (T, d) be an arbitrary semi-


metric space. Then the covering number N (ε, d) is the minimal number of
balls of radius ε needed to cover T . Call a collection of points ε-separated
if the distance between each pair of points is strictly larger than ε. The
packing number D(ε, d) is the maximum number of ε-separated points in
T . The corresponding entropy numbers are the logarithms of the covering
and packing numbers, respectively.

For the present purposes, both covering and packing numbers can be
used. In all arguments one can be replaced by the other through the in-
equalities
N (ε, d) ≤ D(ε, d) ≤ N ( 21 ε, d).
Clearly, covering and packing numbers become bigger as ε ↓ 0. By defini-
tion, the semimetric space T is totally bounded if and only if the covering
and packing numbers are finite for every ε > 0. The upper bound in the
following maximal inequality depends on the rate at which D(ε, d) grows
as ε ↓ 0, as measured through an integral criterion.

2.2.4 Theorem. Let ψ be a convex, nondecreasing, nonzero function with


ψ(0) = 0 and lim supx,y→∞ ψ(x)ψ(y)/ψ(cxy) < ∞, for some constant c.
Let {Xt : t ∈ T } be a separable stochastic process‡ with

kXs − Xt kψ ≤ C d(s, t), for every s, t,

for some semimetric d on T and a constant C. Then, for any η, δ > 0,


hZ η i
ψ −1 D(ε, d) dε + δ ψ −1 D2 (η, d) ,

sup |Xs − Xt | ≤ K

d(s,t)≤δ ψ 0

for a constant K depending on ψ and C only.

2.2.5 Corollary. The constant K can be chosen such that


Z diam T
ψ −1 D(ε, d) dε,

sup |Xs − Xt | ≤ K

s,t ψ 0

where diam T is the diameter of T .

Proof. Assume without loss of generality that the packing numbers and
the associated “covering integral” are finite. Construct nested sets T0 ⊂
T1 ⊂ T2 ⊂ · · · ⊂ T such that every Tj is a maximal set of points such
that d(s, t) > η2−j for every s, t ∈ Tj where “maximal” means that no
point can be added without destroying the validity of the inequality. By


Separable may be understood in the sense that supd(s,t)<δ |Xs − Xt | remains almost
surely the same if the index set T is replaced by a suitable countable subset.
144 2 Empirical Processes

the definition of packing numbers, the number of points in Tj is less than


or equal to D(η2−j , d).
“Link” every point tj+1 ∈ Tj+1 to a unique tj ∈ Tj such that
d(tj , tj+1 ) ≤ η2−j . Thus obtain for every tk+1 a chain tk+1 , tk , . . . , t0 that
connects it to a point in T0 . For arbitrary points sk+1 , tk+1 in Tk+1 , the
difference in increments along their chains can be bounded by
k k
X X
(Xs − Xs0 ) − (Xtk+1 − Xt0 ) = (Xsj+1 −Xsj ) − (Xtj+1 −Xtj )

k+1
j=0 j=0
k
X
≤2 max Xu − Xv ,
j=0

where for fixed j the maximum is taken over all links (u, v) from Tj+1 to Tj .
Thus the jth maximum is taken over at most #Tj+1 links, with each link
having a ψ-norm kXu − Xv kψ bounded by C d(u, v) ≤ C η2−j . It follows
with the help of Lemma 2.2.2 that, for a constant depending only on ψ and
C,
k
X
ψ −1 D(η2−j−1 , d) η2−j

max (Xs − Xs0 ) − (Xt − Xt0 ) ≤ K

s,t∈Tk+1 ψ
j=0
Z η
ψ −1 D(ε, d) dε.

(2.2.6) ≤ 4K
0

In this bound, s0 and t0 are the endpoints of the chains starting at s and
t, respectively.
The maximum of the increments |Xsk+1 − Xtk+1 | can be bounded by
the maximum on the left side of (2.2.6) plus the maximum of the discrep-
ancies |Xs0 − Xt0 | at the end of the chains. The maximum of the latter
discrepancies will be analyzed by a seemingly circular argument. For ev-
ery pair of endpoints s0 , t0 of chains starting at two points in Tk+1 within
distance δ of each other, choose exactly one pair sk+1 , tk+1 in Tk+1 , with
d(sk+1 , tk+1 ) < δ, whose chains end at s0 , t0 . By definition of T0 , this gives
at most D2 (η, d) pairs. By the triangle inequality,

|Xs0 − Xt0 | ≤ (Xs0 − Xsk+1 ) − (Xt0 − Xtk+1 ) + |Xsk+1 − Xtk+1 |.

Take the maximum over all pairs of endpoints s0 , t0 as above. Then the
corresponding maximum over the first term on the right in the last display
is bounded by the maximum in the left side of (2.2.6). Its ψ-norm can be
bounded by the right side of this equation. Combine this with (2.2.6) to
find that
Z η
ψ −1 D(ε, d) dε + max |Xsk+1 − Xtk+1 | ψ .

max |Xs − Xt | ≤ 8K
s,t∈Tk+1
0
ψ
d(s,t)<δ
2.2 Maximal Inequalities and Covering Numbers 145

Here the maximum on the right is taken over the pairs sk+1 , tk+1 in Tk+1
uniquely attached to the pairs s0 , t0 as above. Thus the maximum is over
at most D2 (η, d) terms, each ofwhose ψ-norm is bounded by δ. Its ψ-norm
is bounded by K ψ −1 D2 (η, d) δ.
Thus the upper bound given by the theorem is a bound for the maxi-
mum of increments over Tk+1 . Let k tend to infinity to conclude the proof.
The corollary follows immediately from the previous proof, after noting
that, for η equal to the diameter of T , the set T0 consists of exactly one
point. In that case s0 = t0 for every pair s, t, and the increments at the end
of the chains are zero. The corollary also follows from the theorem upon
taking η = δ = diam T and noting that D(η, d) = 1, so that the second
term in the maximal inequality can also be written δ ψ −1 D(η, d) . Since
the function ε 7→ ψ −1 D(ε, d) is decreasing, this term can be absorbed
into the integral, perhaps at the cost of increasing the constant K.

Though the theorem gives a bound on the continuity modulus of the


process, a bound on the maximum of the process will be needed. Of course,
for any t0 ,
sup |Xt | ≤ kXt0 kψ + sup |Xt − Xt0 | .

t ψ t ψ
Nevertheless, to state the maximal inequality in terms of the increments ap-
pears natural. The increment bound shows Rthat the process X is continuous
η
in ψ-norm, whenever the covering integral 0 ψ −1 D(ε, d) dε converges for
some η > 0. (In that case, the right side in Theorem 2.2.4 can be made ar-
bitrarily small by choosing first η small and next δ.) It is a small step to
deduce the continuity of almost all sample paths from this inequality, but
this is not needed at this point (Problem 2.2.17).

2.2.1 Sub-Gaussian Inequalities


A standard normal variable has tails of the order x−1 exp − 12 x2 and satisfies
P |X| > x p ≤ 2 exp − 12 x2 for every x ≥ 0. By direct calculation one finds a
ψ2 -norm of 8/3. In this subsection we study random variables satisfying
similar tail bounds.
Hoeffding’s inequality asserts
P a “sub-Gaussian” tail bound for random
variables of the form X = Xi with X1 , . . . , Xn i.i.d. with zero means
and bounded range. (See Proposition A.6.1.) The following special case of
Hoeffding’s inequality will be needed.

2.2.7 Lemma (Hoeffding’s inequality). Let a1 , . . . , an be constants and


ε1 , . . . , εn be independent Rademacher random variables; i.e., with P(εi =
1) = P(εi = −1) = 1/2. Then
 X  1 2 2
εi ai > x ≤ 2 e− 2 x /kak ,

P
P √
for the Euclidean norm kak. Consequently, k εi ai kψ2 ≤ 6kak.
146 2 Empirical Processes

Proof. For any λ and Rademacher variable ε, one has Eeλε = (eλ +
2
e−λ )/2 ≤ eλ /2 , where the last inequality follows after writing out the
power series. Thus by Markov’s inequality, for any λ > 0,
X n  Pn 2 2
P ai εi > x ≤ e−λx Eeλ i=1 ai εi ≤ e(λ /2) kak −λx .
i=1

The best upper bound is obtained for λ = x/kak2 and is the exponential
in the probability bound of the lemma. Combination with a similar bound
for the lower tail yields the probability bound.
The bound on the ψ-norm is a consequence of the probability bound
in view of Lemma 2.2.1.

A stochastic process is called sub-Gaussian with respect to the semi-


metric d on its index set if
1 2 2
P |Xs − Xt | > x ≤ 2e− 2 x /d (s,t) ,

for every s, t ∈ T, x > 0.
(The constants 2 and 1/2 are of no special importance. See Problem 2.2.14.)
Any Gaussian process is sub-Gaussian for the standard deviation semimet-
ric d(s, t) = σ(Xs − Xt ). Another example is the Rademacher process
n
X
Xa = a i εi , a ∈ Rn ,
i=1

for Rademacher variables ε1 , . . . , εn . By Hoeffding’s inequality, this is sub-


Gaussian for the Euclidean distance d(a, b) = ka − bk.
√ Sub-Gaussian processes satisfy the increment bound kXs − Xt kψ2 ≤
6 d(s, t). Since the inverse of the ψ2 -function is essentially the square root
of the logarithm, the general maximal inequality leads for sub-Gaussian pro-
cesses to a bound in terms of an entropy integral. (Remember that entropy
is defined as the logarithm of packing numbers.) Furthermore, because of
the special properties of the logarithm, the statement can be slightly sim-
plified.

2.2.8 Corollary. Let {Xt : t ∈ T } be a separable sub-Gaussian process.


Then for every δ > 0,
Z δp
E sup |Xs − Xt | ≤ K log D(ε, d) dε,
d(s,t)≤δ 0

for a universal constant K. In particular, for any t0 ,


Z ∞p
E sup |Xt | ≤ E|Xt0 | + K log D(ε, d) dε.
t 0

2
Proof. Apply the general maximal inequality with ψ2 (x) = ex − 1
−1
log(1 + m), we have ψ2−1 D2 (δ, d) ≤
p
and η = δ. Since ψ2 (m) =
2.2 Maximal Inequalities and Covering Numbers 147

2 ψ2−1 D(δ, d)√. Thus the second

 term in the maximal inequality can first
be replaced by 2δψ −1 D(η, d) and next be incorporated in the first (the
covering integral) at the cost of increasing the constant. We obtain
Z δq

sup |Xs − Xt | ≤ K log 1 + D(ε, d) dε.

d(s,t)≤δ ψ2 0

Here D(ε, d) ≥ 2 for every ε that is strictly less than the diameter of T .
Since log(1 + m) ≤ 2 log m for m ≥ 2, the 1 inside the logarithm can be
removed at the cost of increasing K.

2.2.2 Bernstein’s Inequality


Since many random variables have larger than normalP tails, the results of
n
the previous subsection are not always useful. A sum i=1 Yi of indepen-
dent variables with mean zero and bounded range is, for large n, approx-
imately normally distributed with variance v = var(Y1 + . . . + Yn ). The
tails of a normal N (0, v) variable are of the order exp −x2 /(2v). If the vari-
ables Yi have range [−M, M ], then Bernstein’s
Pn inequality gives a tail bound
exp −x2 /[2v + (2M x/3)] for the variables i=1 Yi . The extra term, 2M x/3,
may be viewed as a penalty for the nonnormality: for n → ∞, it is typically
negligible with respect to v = vn .

2.2.9 Lemma (Bernstein’s inequality). For independent random vari-


ables Y1 , . . . , Yn with bounded ranges [−M, M ] and zero means,
2
x
 −1
P |Y1 + · · · + Yn | > x ≤ 2 e 2 v+M x/3 ,

for v ≥ var(Y1 + · · · + Yn ).

Proof. See Pollard (1984) or Shorack and Wellner (1986), page 855.

For large x, the upper bound in Bernstein’s inequality is essentially


of the exponential type exp (−x/M ); for x close to zero the upper bound
behaves like the normal upper bound, exp(−x2 /(2v)). This suggests that a
maximum of variables that satisfy a Bernstein-type bound can be controlled
using a combination of the ψ1 and ψ2 Orlicz norms.

2.2.10 Lemma. Let X1 , . . . , Xm be arbitrary random variables that satisfy


the tail bound
x2
 −1
P |Xi | > x ≤ 2 e 2 b+ax ,
for all x (and i) and fixed a, b > 0. Then
 √ p 
max Xi ≤ K a log(1 + m) + b log(1 + m) ,

1≤i≤m ψ1
148 2 Empirical Processes

for a universal constant K.

Proof. The condition implies the upper bound 2 exp(−x2 /(4b)) on


P (|Xi | > x), for every x ≤ b/a, and the upper bound 2 exp(−x/(4a)),
for all other positive x. Consequently, the same upper bounds hold for all
x > 0 for the
 probabilities P |Xi |1{|Xi | ≤ b/a} > x and P |Xi |1{|Xi | >
b/a} > x , respectively. By Lemma 2.2.1, this implies that the Orlicz
norms kXi 1{|X√ i | ≤ b/a}kψ2 and kXi 1{|Xi | > b/a}kψ1 are up to constants
bounded by b and a, respectively. Next

k max Xi kψ1 ≤ k max Xi 1{|Xi | ≤ b/a}kψ1 + k max Xi 1{|Xi | > b/a}kψ1 .


i i i

Since the ψp -norms are up to constants nondecreasing in p, the first ψ1 -norm


on the right can be replaced by a ψ2 -norm. Finally, apply Lemma 2.2.2 to
find the bound as stated.

A refined form of Bernstein’s inequality will be useful in Part 3. A


random variable Y with bounded range [−M, M ] satisfies

E|Y |m ≤ M m−2 EY 2 ,

for every m ≥ 2. A much weaker inequality than this is the essential element
in the proof of Bernstein’s inequality.

2.2.11 Lemma (Bernstein’s inequality). Let Y1 , . . . , Yn be independent


random variables with zero mean such that E|Yi |m ≤ m! M m−2 vi /2, for
every m ≥ 2 (and all i) and some constants M and vi . Then
x2
 −1
P |Y1 + · · · + Yn | > x ≤ 2 e 2 v+M x ,

for v ≥ v1 + · · · + vn .

Proof. See Bennett (1962), pages 37–38.

The moment condition in the refined form of Bernstein’s inequality is


somewhat odd. It is implied by
 |Yi |  2
E e|Yi |/M − 1 − M ≤ 12 vi .
M
Conversely, if Yi satisfies the moment condition of the lemma, then the
preceding display is valid with M replaced by 2M and vi replaced by 2vi .
Thus for applications where constants in Bernstein’s inequality are unim-
portant, the preceding display is “equivalent” to the moment condition of
the lemma.
2.2 Maximal Inequalities and Covering Numbers 149

* 2.2.3 Tightness Under an Increment Bound


In the next chapters we obtain general central limit theorems for empirical
processes through the application of maximal inequalities. Independently,
the next example shows how a classical, simple sufficient condition for weak
convergence follows also from the maximal inequalities.

2.2.12 Example. Let Xn (t): t ∈ [0, 1] be a sequence of separable
stochastic processes with bounded sample paths and increments satisfying
p
E Xn (s) − Xn (t) ≤ K|s − t|1+r ,

for constants p, K, r > 0 independent  of n. Assume that the sequences


of marginals Xn (t1 ), . . . , Xn (tk ) converge weakly to the corresponding
marginals of a stochastic process X(t): t ∈ [0, 1] . Then there exists a
version of X with continuous sample paths and Xn X in `∞ [0, 1]. (Hence
also in D[0, 1] or C[0, 1], provided every Xn has all its sample paths in these
spaces.)
To prove this for p > 1, apply Theorem  2.2.4 with ψ(x) = xp and
α
d(s, t) = |s − t| , where α = (1 + r)/p ∧ 1. A d-ball of radius ε around
some t is simply the interval [t−ε1/α , t+ε1/α ]. Thus the index set [0,1] can be
covered with N (ε, d) = (1/2)ε−1/α balls of radius ε. Since ψ −1 (x) = x1/p ,
the covering integral can be bounded by a multiple of
Z η Z η
−1
ε−1/(pα) dε.

ψ N (ε, d) dε ≤
0 0

For p > 1, the integral converges. Since Xn (s) − Xn (t) p ≤ K 1/p |s − t|α ,
the general maximal inequality and Markov’s inequality give
  C hZ η i
ε−1/(pα) dε + δη −2/(pα) ,

P sup Xn (s) − Xn (t) > x ≤

|s−t|<δ x 0
for some constant C independent of n. By choosing first η and next δ,
one can make the right side arbitrarily small. This verifies the asymptotic
equicontinuity of Xn . Its weak convergence and the continuity of the limit
process follow from Theorem 1.5.7.
For p ≤ 1, use the inequality (x+ )1/q − (y + )1/q ≤ |x − y|1/q (valid for
all reals x, y, and q ≥ 1), with q = 2/p, to derive that
2
E Xn+ (s)p/2 − Xn+ (t)p/2 ≤ K|s − t|1+r ,

for every s, t.
By the previous argument, it follows that (Xn+ )p/2 (X + )p/2 , where
+ p/2
(X ) is a process with continuous sample paths. Since the map x 7→ xq
from ` [0, 1] to `∞ [0, 1] is continuous at every x ∈ C[0, 1] (Problem 2.2.15),

it follows that Xn+ X + . By a similar argument, Xn− X − . Together


this yields Xn X.

* This section may be skipped at first reading.


150 2 Empirical Processes

2.2.4 Generic Chaining and Majorizing Measures


Maximal inequalities based on covering numbers appear to suffice for al-
most all statistical applications, but are known to be suboptimal in general.
The suboptimality is particularly well documented for Gaussian maximal
inequalities, for which the alternate techniques generic chaining and ma-
jorizing measures give sharp bounds. In this section we briefly discuss these
techniques within the context of Theorem 2.2.4. Although at present they
have found very few applications in statistics, the section may serve as an
introduction to facilitate future use. In the remainder of this book majoriz-
ing measures are used only in one proof, and even this proof can be read
independently from the present section.
As in Theorem 2.2.4 consider a separable stochastic process {Xt : t ∈ T }
indexed by a semimetric space (T, d), with increments satisfying the bound
kXs − Xt kψ . d(s, t), for a given Young function ψ. Unlike Theorem 2.2.4
the following theorem does not impose further conditions on ψ, but instead
formulates its bound in terms of another Young function φ (which can
always be taken to be the identity). The bound is expressed in terms of
a sequence T0 , T1 , T2 , . . . of subsets of T , not necessarily nested, which are
only limited by their cardinalities. The set T0 must consist of a single point,
while the cardinality of Tj must satisfy |Tj | ≤ ψ(4j ), for every j ≥ 1. Such
a sequence of subsets is called ψ-admissible.

2.2.13 Theorem. Let φ and ψ be convex, nondecreasing, nonzero func-


tions with φ(0) = ψ(0) = 0 and φ(y/x) ≤ ψ(y)/ψ(x), for every y ≥ x ≥ 1.
Let {Xt : t ∈ T } be a separable stochastic process with

kXs − Xt kψ ≤ C d(s, t), for every s, t,

for some semimetric d on T and a constant C. Then, for any sequence


T0 , T1 , T2 , . . . of subsets of T such that |T0 | = 1 and |Tj | ≤ ψ(4j ), for every
j ≥ 1,
∞ ∞
X
j
X 1 X
sup |Xs − Xt | ≤ K sup d(t, Tj )4 + K d(t, Tj )4j ,

s,t∈T φ t∈T j=0 j=0
|Tj+1 |
t∈Tj+1

for a constant K depending on ψ and C only. Furthermore, in the special


α
case that ψ(x) = ex − 1, for some α ≥ 1, then this can be improved to

X
sup |Xs − Xt | ≤ K sup d(t, Tj )4j .

s,t∈T ψ t∈T j=0

Proof. By the assumed separability of X and the monotone convergence


theorem, it suffices to prove the bounds for a finite set T . As the bounds
decrease if the sets Tj are enlarged, it is then no loss of generality to assume
2.2 Maximal Inequalities and Covering Numbers 151

that T = Tk+1 , for some k. For t = tk+1 ∈ T and j ≥ 0, let πj t be a point


in Tj closest to t, so that d(t, Tj ) = d(t, πj t), and for every t ∈ Tk+1 form
the chain tk+1 , tk , . . . , t0 with tj = πj tj+1 , for every j. By telescoping and
the triangle inequality, for every x > 0,
k
X k
X
|Xt − Xt0 | ≤ |Xtj+1 − Xtj |1Ωtj+1 ,x + x d(tj+1 , tj )4j+1 ,
j=0 j=0

for Ωtj+1 ,x the event that |Xtj+1 − Xtj | > x d(tj+1 , tj )4j+1 . We show at the
end of the proof that
k
X ∞
X
Ak : = sup d(tj+1 , tj )4j+1 ≤ 4 sup d(t, Tj )4j+1 =: A.
t∈Tk+1 j=0 t∈T j=0

The right side A is the first term in the general bound given by the theorem,
and the only term in the bound for the exponential ψ. Therefore it suffices to
complement the bound Ak . A by a bound on the first variable on the right
in the second last display. We derive this bound first for the exponential ψ
and next for general ψ.
α
For ψ(x) = ex − 1 the preceding decomposition gives that
k
X 

P sup |Xt − Xt0 | > xA ≤ P |Xtj+1 − Xtj |1Ωtj+1 ,x > 0
t∈T j=0
k k
X X X 1 α
≤ P(Ωu,x ) ≤ ψ(4j+1 ) ≤ 2e−(x/C) ,
j=0 u∈Tj+1 j=0
ψ(x4j+1 /C)

for (x/C)α ≥ 2, where we used  Markov’s inequality and the assumption


that
Eψ |X s − Xt |/(Cd(s,
t)) ≤ 1, for every s, t. Lemma 2.2.1 gives that
supt∈T |Xt − Xt | . A, which completes the proof for the exponential
0 ψ
function ψ.
For a general Young function ψ we choose x = 1 in the decomposition
of the first paragraph, and apply the assumption φ(y/x) ≤ ψ(y)/ψ(x) with
y = |Xtj+1 − Xtj |/d(tj+1 , tj ) and x = 4j , on the event Ωtj+1 ,x , where the
restriction y ≥ x ≥ 1 is valid, to see that
|Xtj+1 −Xtj |  
k k ψ
d(tj+1 ,tj )
X X
|Xtj+1 − Xtj |1Ωtj+1 ,x ≤ φ−1 d(tj+1 , tj )4j+1
j=0 j=0
ψ(4j+1 )
|Xtj+1 −Xtj | 
k ψ
X
d(tj+1 ,tj ) d(tj+1 , tj )4j+1 
−1
≤φ A,
j=0
ψ(4j+1 )A
Pk
by concavity of φ−1 with φ−1 (0) = 0, since j=0 d(tj+1 , tj )4j+1 /A ≤ 1,
by the definition of A. As the map s 7→ φ−1 (y/s)s is increasing, for any
152 2 Empirical Processes

y ≥ 0, by concavity of φ−1 with φ−1 (0) = 0, the right side of the display
does
P not become Psmaller if A is replaced by A ∨ Ã, where we take à =
2 j ψ(4j+1 )−1 t∈Tj+1 d(t, Tj )4j , which is bigger than twice the second
infinite series in the upper bound given by the theorem. By monotonicity
of φ we conclude that
Pk |Xtj+1 −Xtj | 

j=0 |Xtj+1 − Xtj |1Ωtj+1 ,x  k ψ
X d(tj+1 ,tj ) d(tj+1 , tj )4j+1
φ ≤
A ∨ Ã j=0
ψ(4j+1 ) A ∨ Ã
|Xu −Xπj u | 
k
X X ψ d(u,πj u) d(u, πj u)4j+1
≤ .
j=0 u∈Tj+1
ψ(4j+1 ) A ∨ Ã

The right side is independent of t ∈ Tk+1 , and hence by the assumption on


the increments of X the expectation of the supremum over t ∈ Tk+1 of the
left side is bounded above by
k
X X d(u, πj u)4j+1 Ã
≤ ≤ 1.
j+1
ψ(4 ) A ∨ Ã A ∨ Ã
j=0 u∈Tj+1

Pk
It follows that the φ-Orlicz norm of j=0 |Xtj+1 − Xtj |1Ωtj+1 ,1 is bounded
above by A∨ Ã. This concludes the proof of the theorem for a general Young
function ψ.
The definition of ti implies that d(ti , ti+1 ) ≤ d(πi t, ti+1 ). Combin-
ing this with the triangle inequality, we see that d(t, ti ) ≤ d(t, ti+1 ) +
d(πi t, ti+1 ) ≤ 2d(t, ti+1 ) + d(πi t, t), by a second application of the triangle
inequality, and hence d(t, ti ) ≤ d(t, Ti ) + 2d(t, tP i+1 ). Iterating this inequal-
k i−j
ity for i = j, j + 1, . . . , k, we find d(t, tj ) ≤ i=j d(t, Ti )2 , for every
t ∈ Tk+1 , where the remainder term of the last iteration involving d(t, tk+1 )
has vanished as t = tk+1 . We conclude that
k
X k X
X k k
X
d(t, tj )4j ≤ d(t, Ti )2i−j 4j ≤ 2 d(t, Ti )4i .
j=0 j=0 i=j i=0

Pk
By another application of the triangle inequality j=0 d(tj+1 , tj )4j is not
bigger than twice the left side, and hence the right side, of this display, for
any t ∈ Tk+1 . This completes the proof that Ak ≤ 4A.

The condition that φ(y/x) ≤ ψ(y)/ψ(x), for y ≥ x ≥ 1, is satisfied


for
 φ the identity and every Young function ψ (as ψ(x) = ψ (x/y)y +
0 ≤ (x/y)ψ(y), for every x ≤ y). Thus the theorem gives a bound on
E sups,t |Xs − Xt | for any stochastic process whose increments are bounded
in terms of an arbitrary Young function ψ. If ψ(x)ψ(y) ≤ ψ(xy) for every
x, y ≥ 1, which is the assumption in Theorem 2.2.4 up to multiplicative
2.2 Maximal Inequalities and Covering Numbers 153

constants and scaling, then the assumption is also satisfied for φ = ψ, and
the theorem provides a bound on the ψ-norm of the supremum.
The first term in the bound in the theorem involves a supremum over t
taken outside the infinite sum. This expression increases if the supremum is
moved inside the sum. The second term in the bound is a sum of averages
over the sets Tj+1 and increases if these are replaced by suprema. It follows
that both terms are bounded above by

X
sup d(t, Tj )4j .
j=0 t∈T

The infimum of this expression over all ψ-admissible sequences T0 , T1 , T2 , . . .


is essentially the entropy bound of Theorem 2.2.4, as the optimal choice of
Tj would be a minimal εj -net, for εj determined so that |Tj |  ψ(4j ). In
particular, for any sequence εj ↓ 0 with ψ(4j ) ≤ N (εj , T, d) ≤ ψ(4j ) + 1, we
can choose Tj a minimal εj -net over T , and see that the preceding display
is bounded above by
X∞ Z ∞
−1
ψ −1 N (ε, T, d) dε.
 
εj ψ N (εj , T, d) .
j=0 0

(Here use that ψ −1 N (εj−1 , T, d) ∼ 4j−1 , as j → ∞.) Thus Theo-




rem 2.2.13 improves on Theorem 2.2.4.


The sequence of subsets T0 , T1 , T2 , . . . may be replaced by a sequence
of partitions of T . A nested sequence T, T = ∪i Ti,1 , T = ∪i Ti,2 , . . . of par-
titions of T is called ψ-admissible if the number of sets in the jth partition
is bounded above by ψ(4j ), for every j ≥ 1, and is equal to 1 for j = 0.
For given t ∈ T denote the set Ti,j in the jth partition that contains t by
Tj (t). Then the generic chaining bounds are defined as the infima over all
ψ-admissible partitions of the numbers
∞ ∞
X X 1 X
diam Tj (t) 4j diam Ti,j 4j .
 
sup and
t∈T j=0 j=0
ψ(4j ) i

If for every j ≥ 0 a set Tj ⊂ T of points is constructed by choosing exactly


one point from each set Ti,j in the partition T = ∪i Ti,j , then the sequence of
sets T0 , T1 , T2 , . . . is clearly ψ-admissible, and the two series in the preceding
display dominate the two series in Theorem 2.2.13.
α
In the case of the exponential Orlicz norms ψ(x) = ex − 1, only
the first of the two series is relevant, and the procedure of the preced-
ing paragraph can also be reversed, in the sense that given a sequence
of admissible
P∞subsets thereexists a sequence P∞ of admissible partitions such
that supt j=0 diam Tj (t) 4j . supt j=0 d(t, Tj )4j . The partitions can
be constructed by starting from the Voronoi partition defined by the sets
Tj , and cutting these further into sets of appropriate diameters, without
overly increasing the cardinality (see Talagrand (2014), Corollary 2.3.2). In
154 2 Empirical Processes

the exponential case it is customary to replace the powers 4j by 2j/α and


write the generic chaining bound in the form

X
diam Tj (t) 2j/α ,

γα (T, d) = inf sup
t∈T j=0

where the infimum is taken over all sequences of nested partitions T = ∪i Ti,j
j
consisting of at most 22 sets. The latter number are equal to ψ(2j/α ) + 1
α
for the Orlicz norm ψ(x) = 2x − 1. (The base 4 in the powers 4j in the
theorem are arbitrary in general and could be replaced by other numbers
bigger than 1. See Problem 2.2.18.)
A zero-mean Gaussian process X satisfies the increment bound kXs −
2
Xt kψ . d(s, t) for ψ(x) = ex − 1 and d the standard deviation metric.
The relevant generic chaining number is γ2 (T, d). In this case the generic
chaining upper bound is known to be sharp up to multiplicative constants.

2.2.14 Proposition. There exist universal constants c, C > 0 such that,


for any zero-mean, separable Gaussian process X,

c γ2 (T, d) ≤ E sup |Xs − Xt | ≤ C γ2 (T, d).


s,t∈T

For a proof of the proposition, see Talagrand (2014), Chapter 2. The


proposition underlines the superiority of generic chaining bounds over en-
tropy bounds, as relatively simple examples of Gaussian processes show
that the entropy bounds can be too large. (For instance, see Talagrand
(2014), pages 51–55, for the case of linear Gaussian processes indexed by
elipsoids.)
Majorizing measures offer a different approach that also gives sharp
bounds in certain cases. A majorizing measure µ is any probability measure
on T for which the integral in the following theorem is finite, which then
gives a nontrivial bound on the supremum of the stochastic process X.
Let Bd (t, ε) be the closed ball of radius ε around t relative to the
semimetric d.

2.2.15 Theorem. Let φ and ψ be convex, nondecreasing, nonzero func-


tions with φ(0) = ψ(0) = 0 and φ(y/x) ≤ ψ(y)/ψ(x), for every y ≥ x ≥ 1.
Let {Xt : t ∈ T } be a separable stochastic process with

kXs − Xt kψ ≤ C d(s, t), for every s, t,

for some semimetric d on T and a constant C. Then, for any probability


measure µ on T ,
Z D(t,T )
 1 
sup |Xs − Xt | ≤ K sup ψ −1  dε,

s,t∈T φ t∈T 0 µ Bd (t, ε)
2.2 Maximal Inequalities and Covering Numbers 155

for a constant K depending on ψ and C only and D(t, T ) = sups∈T d(t, s).

Proof. Given an arbitrary majorizing measure, we construct an admissi-


ble sequence T0 , T1 , T2 , . . . of subsets of T whose generic chaining bounds
are smaller than the majorizing measure integral, up to a constant. The
theorem is then a consequence of Theorem 2.2.13. For a direct proof based
on chaining, see for instance Ledoux and Talagrand (1991), or the proof of
Theorem 2.11.11.
For a given probability measure µ, and for j ≥ 1, define
n  1 o
rj (t) = inf ε > 0: µ Bd (t, ε) ≥ .
ψ(4j )
Next for every fixed j ≥ 1 construct a set Tj ⊂ T recursively, by first setting
t1 = argmint∈T rj (t), and next given t1 , . . . , tk−1 setting
tk = argmin rj (t),
t∈T −∪l<k Aj (tl )

where the sets Aj (tl ) are defined by Aj (tl ) = {t ∈ T : d(t, tl )/2 < rj (t) +
rj (tl )}, for l ≥ 1, and the recursions continue until the sets A(tl ) have ex-
hausted T . By construction  d(tk , tl )/2 ≥ rj (tk ) + rj (tl ), for k > l, whence
the balls Bd P  tl ∈ Tj are disjoint (easily), and hence
tl , rj (tl ) for different
1 = µ(T ) ≥ l µ Bd tl , rj (tl ) ≥ |Tj |/ψ(4j ). This shows that the recur-
sive procedure ends after finitely many steps and yields a set of cardinality
|Tj | ≤ ψ(4j ). By construction d(t, tl ) < 2rj (t) + 2rj (tl ) ≤ 4rj (t) if tl is the
first element of Tj with t ∈ A(tl ), whence d(t, Tj ) ≤ 4rj (t), for every t ∈ T .
If we set r0 (t) = D(t, T ), then this is also true for j = 0 and an arbitrary 
one-point set T0 . Since µ Bd (t, ε) < 1/ψ(4j ) for  ε ∈ rj+1 (t), rj (t) , by
the definition of rj (t), and hence ψ −1 1/µ Bd (t, ε) ≥ 4j , we have
Z D(t,T ) ∞
 1  X
ψ −1 4j rj (t) − rj+1 (t)

 dε ≥
0 µ Bd (t, ε) j=0
∞ ∞
3X 3X
≥ rj (t)4j ≥ d(t, Tj )4j−1 .
4 j=0 4 j=0

This concludes the proof that the majorizing measure integral dominates
the first generic chaining number.
Next consider the second generic chaining number of the sequence
T0 , T1 , . . .. We claim that rj+1 (t) ≤ 2rj+1 (s), for any t ∈ Tj+1 and s ∈ T
with d(s, t) ≤ rj+1 (t). Indeed, let tl be the first element of Tj+1 such that
s ∈ Aj+1 (tl ). If tl appeared after t, then rj+1 (t) ≤ rj+1 (tl ) ≤ rj+1 (s), by
the definitions of t and tl as minimizers of rj . On the other hand, if tl ap-
peared before t in the recursions, then t ∈ / Aj+1 (tl ) as it was later selected
for Tj+1 and hence
d(t, tl ) d(t, s) d(s, tl ) rj+1 (t)
rj+1 (t)+rj+1 (tl ) ≤ ≤ + < +rj+1 (s)+rj+1 (tl ),
2 2 2 2
156 2 Empirical Processes

which implies the claim rj+1 (t) ≤ 2rj+1 (s), after rearranging.
Since Bd t, rj (s)+d(s, t) ⊃ Bd s, rj (s) by the triangle inequality, the
definitions of rj (s) and rj (t) give that rj (t) ≤ rj (s) + d(s, t). Combining
this with the result of the preceding paragraph, we conclude that rj (t) ≤
rj (s) + 2rj+1 (s), whenever t ∈ Tj+1 and d(s, t) ≤ rj+1 (t). Consequently, for
every t ∈ Tj+1 ,
Z
d(t, Tj )  
j
≤ 4rj (t)µ Bd t, rj+1 (t) ≤ rj (s) + 2rj+1 (s) dµ(s).
ψ(4 ) Bd (t,rj+1 (t))


Since the balls Bd (t, rj+1 (t) , for t ∈ Tj+1R are disjoint, the sum
 of the left
side over t ∈ Tj+1 is bounded above by T rj (s) + 2rj+1 (s) dµ(s). Next
multiply by 4j and sum over j to find that

∞ X ∞
d(t, Tj )4j
X Z X
rj (s) + 2rj+1 (s) 4j dµ(s).

j

j=0
ψ(4 ) T j=0
t∈Tj+1

P∞
The integrand is bounded above by (3/2) j=0 rj (s)4j , which was already
seen to be bounded above by the majorizing measure integral at the end of
the first part of the proof.

Compared to the entropy integral appearing in Theorem 2.2.4, the


majorizing measure integral in the preceding theorem replaces the packing
numbers D(ε, T, d) by the inverses of the probabilities of the balls Bd (t, ε).
The latter are local to the point t, which is compensated by taking the
supremum over t ∈ T , outside the integral, much as in the generic chaining
bound. It is this local nature that gives the majorizing measures more flex-
ibility in measuring the complexity of the index set T . Indeed, the infimum
of the majorizing measure integrals over all probability measures is smaller
than a multiple of the entropy integral. In particular, for the exponential
Young functions the majorizing measure integral of a convex combination
of uniform measures on a sequence of ε-nets is comparable to the entropy
integral.
α
In the case of the exponential Young functions ψ(x) = ex − 1, the
infimum of the majorizing measure integrals over all majorizing measures
can be shown to be equivalent to γα (T, d) up to multiplicative constants
(see Talagrand (2001)). In particular, both generic chaining and majorizing
measures give sharp bounds for suprema of Gaussian processes.
The equivalence of generic chaining numbers and majorizing measures
integrals extends to more general Young functions that satisfy a mild regu-
larity condition. Furthermore, for metrics d that are p-concave, i.e. metrics
such that dp is a metric, for some p > 1, these numbers also give sharp
bounds on expected suprema of stochastic processes. Consider the quanti-
2.2 Maximal Inequalities and Covering Numbers 157

ties
S(T, d, ψ) = sup E sup |Xs − Xt |,
X∈d,ψ s,t
Z Z D(t,T )
1
 
M̄ (T, d, ψ) = sup ψ −1  dε dµ(t),
µ 0 µ Bd (t, σ)
Z Z D(t,T )  1 
M (T, d, ψ) = inf sup ψ −1  dε,
µ t∈T 0 µ Bd (t, σ)
∞ ∞
h X
j
X 1 X i
A(T, d, ψ) = inf sup d(t, Tj )4 + d(t, Tj )4j .
(Tj )∈ψ t∈T
j=0 j=0
|Tj+1 |
t∈Tj+1

Here the infimum in the definition of S(T, d, ψ) is taken over all separable
stochastic processes X that satisfy the increment bound kXs − Xt kψ ≤
d(s, t), and the infimum in A(T, d, ψ) is taken over all ψ-admissible se-
quences T0 , T1 , . . . of subsets of T .
The first of these quantities may be smaller than the other three. How-
ever, the following theorem shows that the metric d may be replaced by a
larger metric, giving a larger class of processes X in the definition of S and
hence increasing S, so that all four quantities become equivalent, and are
also equivalent to the last three quantities for the original metric d.

2.2.16 Proposition. Let (T, d) be a compact metric space, and let ψ


be a convex, nondecreasing, nonzero function with ψ(0) = 0 satisfying
ψ(2x) ≥ 2Cψ(x), for every x ≥ 0 and some constant C > 1. If A(T, d, ψ)
is finite, then there exists a metric ρ on T with d ≤ ρ, such that all of
S(T, ρ, ψ), M̄ (T, ρ, ψ), M (T, ρ, ψ), A(T, ρ, ψ), M (T, d, ψ), M̄ (T, d, ψ) and
A(T, d, ψ) are equivalent up to multiplicative constants. Furthermore, if
dp is a metric for some p > 1, then these quantities are also equivalent to
S(T, d, ψ). Finally, for the exponential ψ functions, these assertions are true
with A(T, d, ψ) defined as equal to only the first generic chaining series.

Proof. The following three inequalities are true for any metric d:
S(T, d, ψ) . A(T, d, ψ) (by Theorem 2.2.13); A(T, d, ψ) . M (T, d, ψ)
(by Theorem 2.2.15); and M (T, d, ψ) ≤ 2M̄ (T, d, ψ) (by a convexity ar-
gument, see Problem 2.2.19). Furthermore, Bednorz (2010) shows that
M̄ (T, d, ψ) . S(T, d, ψ), for any p-concave metric d. It follows that the
four quantities are equivalent up to multiplicative constants when d is p-
concave.
The condition on ψ implies (1.2) in Bednorz (2012), who constructs an
ultra-metric ρ ≥ d on T for which A(T, ρ, ψ) . A(T, d, ψ). (By definition, an
ultra-metric satisfies d(s, t) ≤ d(s, u) ∨ d(t, u), for every s, t, u; in particular,
it is p-concave for every p > 1.) By the preceding paragraph the four quan-
tities with d replaced by ρ are then equivalent. Since ρ ≥ d, the four quanti-
ties with d are bounded above by the corresponding quantity with ρ. Then
158 2 Empirical Processes

A(T, ρ, ψ) and A(T, d, ψ) are equivalent, and equivalent to all quantities


with ρ. Furthermore, the quantities A(T, d, ψ) . M (T, d, ψ) ≤ 2M̄ (T, d, ψ)
are bounded above by the corresponding quantities with ρ instead of d,
which are equivalent to A(T, ρ, ψ). This concludes the proof of the first
assertion of the proposition.
The final assertion follows by the same arguments.

Problems and Complements


p
1. A standard normal variable X possesses norm kXkψ2 = 8/3 for ψ2 (x) =
exp(x2 ) − 1.
2. The constant random variable X = 1 possesses norm kXkψp = (log 2)−1/p
p
for ψp (x) = ex − 1.
3. For any constant C ≥ 1, convex, increasing ψ and random variable X, one
has kXkCψ ≤ C kXkψ . For C ≤ 1 the reverse inequality holds.

[Hint: By convexity of ψ, one has Eψ |X|/CkXkψ ≤ 1/C, for C ≥ 1.]

4. (Comparing Lp - and ψα -norms) For any α ≥ 1 there exist constants


0 < cα < Cα < ∞ such that, for any random variable X,
kXkp
cα kXkψα ≤ sup ≤ Cα kXkψα .
p≥1 p1/α

In particular, kXk1 ≤ kXk2 ≤ kXkψ2 .


[Hint: For the lower bound set D = supp≥1 kXkp /p1/α , so that E|X|αj ≤
αj j
P∞(αj) , forαjj ≥ 1.j Expand the exponential to conclude that Eψα (X/C) ≤
D
j=1
(D/C) (αj) /j!. Further bound the sum by using Stirlings’s formula
j j /j! ≤ ej /(2πj)1/2 , and reduce it to a number smaller than 1 by choosing
C a sufficiently large multiple of D.
For the upperbound set Dα = kXkψα and use Markov’s inequality to
see that P |X| > x ≤ 2e−(x/D) , for x ≥ xα : = D(log 2)1/α . Conclude that
Z ∞
α
E|X|p ≤ xpα + pxp−1 2e−(x/D) dx ≤ xpα + 2Dp Γ(p/α + 1),

by a change of variables. The second term on the far right is bounded above
by a multiple of Dp (p/(αe))p/α+1/2 , by Stirling’s formula. Conclude that
D−1 kXkp /p1/α . (log 2/p)1/α + (p/α)1/(2p) (αe)−1/α . The right side is de-
creasing in p, for fixed α, and hence is bounded in p ≥ 1.]

5. (Comparing ψp -norms) For any X, the numbers (log 2)1/p kXkψp for
ψp (x) = exp(xp ) − 1 are nondecreasing in p ≥ 1.
 
[Hint: The function φ for which ψp x(log 2)1/p = φ ψq ((log 2)1/q x) is
concave and satisfies φ(1) = 1 for q ≥ p. Use Jensen’s inequality.]
2.2 Problems and Complements 159

6. (Monotone convergence for Orlicz-norms) Let ψ be a convex, nonde-


creasing, nonzero function on [0, ∞) with ψ(0) = 0. If 0 ≤ Xn ↑ X almost
surely, then kXn kψ ↑ kXkψ .
[Hint: By the monotone convergence theorem, lim Eψ(Xn /rkXkψ ) > 1, for
any r < 1.]

7. The infimum in the definition of an Orlicz norm is attained (at kXkψ ).


8. Let ψ be any function as in the definition of an Orlicz norm. Then for any
random variables X1 , . . . , Xm , one has E max |Xi | ≤ ψ −1 (m) max kXi kψ .
[Hint: For any C, one has E max |Xi |/C ≤ ψ −1 E max ψ(|Xi |/C) by

Jensen’s inequality. Bound the maximum by a sum and take C = max kXi kψ .]

9. If kXkψ < ∞, then kXkσψ < ∞ for every σ > 0. Markov’s inequality yields
the bound P(|X| > t) ≤ 1/σ · 1/ψ(t/kXkσψ ) for every t. Is there an optimal
value of σ?
10. The condition on the quotient ψ(x)ψ(y)/ψ(cxy) in Lemma 2.2.2  is not auto-
matic. For instance, the function ψ(x) = (x + 1) log(x + 1) − 1 + 1 is convex
and increasing for x ≥ 0 but does not satisfy the condition.
11. The function ψp (x) = exp(xp ) − 1 satisfies the hypothesis of Lemma 2.2.2 for
every 0 < p ≤ 2.
12. Let ψ be a convex, positive function, with ψ(0) = 0 such that log ψ is convex.
Then there is a constant σ such that φ(x) = σψ(x) satisfies both φ(1) < 1
and φ(x)φ(y) ≤ φ(xy), for all x, y ≥ 1.
[Hint: Define φ(x) = ψ(x)/2ψ(1) for x ≥ 1. Then φ(1) = 1/2 and g(x, y) =
φ(x)φ(y)/φ(xy) satisfies g(1, 1) = 1/2 and is decreasing in both x and y for
x, y ≥ 1 if log ψ is convex.
 The derivative of g with respect to x is given by
ψ(x)ψ(y)/ 2ψ(1)ψ(xy) ψ 0 /ψ(x) − yψ 0 /ψ(xy) , which is less than or equal
to 0 since y ≥ 1 and log ψ is convex. By symmetry, the derivative with respect
to y is also bounded above by zero.]

13. The functions


 ψp , with 1 ≤ p ≤ 2, and more generally the function ψ(x) =
exp h(x) , with h(x) convex (h(x) = x(log(x)−1)+1), satisfy the conditions
in the preceding problem. The functions ψp , for 0 < p ≤ 1, do not satisfy
these conditions.
 
14. Suppose that P |Xs −Xt | > x ≤ K exp −Cx2 /d2 (s, t) for a given stochas-
tic process and certain positive constants K and C. Then the process is
sub-Gaussian for a multiple of the distance d.
[Hint: There exists a constant D depending only on K and C such that
2 2
1 ∧ Ke−Cx is bounded above by 2e−Dx , for every x ≥ 0.]

15. Let T be a compact, semimetric space and q > 0. Then the map x 7→ xq
from the nonnegative functions in `∞ (T ) to `∞ (T ) is continuous at every
continuous x.
[Hint: It suffices to show that kxn −xk∞ → 0 implies that xqn (tn )−xq (tn ) → 0
for every sequence tn . By compactness of T , the sequence tn may be assumed
convergent. Since x is continuous, x(tn ) → x(t). Thus xn (tn ) → x(t), whence
xqn (tn ) → xq (t).]
160 2 Empirical Processes

16. Suppose the metric space T has a finite diameter and log N (ε, T, d) ≤ h(ε),
for all sufficiently small ε > 0 and a fixed, positive, continuous function h.
Then there exists a constant K with log N (ε, T, d) ≤ Kh(ε) for all ε > 0.
[Hint: For sufficiently large ε, the entropy numbers are zero, and for
sufficiently small ε, the inequality is valid with K = 1. The function
log N (ε, T, d)/h(ε) is bounded on intervals that are bounded away from zero
and infinity.]
P∗

17. Let X be a stochastic process with supρ(s,t)<δ X(s) − X(t) → 0 as δ ↓ 0.
Then almost all sample paths of X are uniformly continuous.
[Hint: Given a decreasing sequence of numbers εk → 0, there are numbers
δk > 0 such that P∗ supρ(s,t)<δ
k
X(s) − X(t) > εk ≤ 2−k for every k. By

the Borel-Cantelli lemma, X(s) − X(t) ≤ εk whenever ρ(s, t) < δk , for all
sufficiently large k almost surely.]

18. For every sequence T0 , T1 , . . . of subsets with |T0 | = 1 and |Tj | ≤ ψ(4j ), for
j ≥ 1, there exists a sequence T00 ,P T10 , . . . of subsets with |T00 |P
= 1 and |Tj0 | ≤
∞ ∞
ψ(2j/2 ), for j ≥ 1, such that supt j=0 d(t, Tj0 )2j/2 ≤ 4 supt j=0 d(t, Tj )4j .
Conversely, for every sequence T0 , T1 , . . . of subsets with |T00 | = 1 and
0 0

|Tj0 | ≤ ψ(2j/2 ), for j ≥ 1, there exists a sequence T0 , T1 , .P . . of subsets with



|T0 | = 1 and |Tj | ≤ ψ(4j ), for j ≥ 1, such that supt j=0 d(t, Tj )4j ≤
P∞
supt j=0 d(t, Tj0 )2j/2 .
0 0 0 0
[Hint: For the first take T4j = T4j+1 = T4j+2 = T4j+3 = Tj . For the second
0
take Tj = T4j .]

19. For any compact metric space (T, d) and Young function ψ, we have
M (T, d, ψ) ≤ 2M̄ (T, d, ψ). (See Section 2.2.4 for this notation.)
−1
[Hint: Suppose first that T is finite and that the map R x−17→ ψ (1/x) is con-
vex. Suppose M̄ : = M̄ (T, d, ψ) < ∞ and set fµ (t) = ψ 1/µ(Bd (t, ε)) dε.
The set F of of functions f : T → R such that f ≥ fµ , pointwise, for some µ,
can be identified with a closed, convex subset of RT . If the constant function
M̄ (T, ψ, d) were not contained in F, then the separating hyperplane theo-
rem would give a vector ν 6= 0 such that ν T M̄ < ν T f , for every f ∈ F .
Since f + u1t ∈ F, for every u ≥ 0 if f ∈ F, the vector ν cannot have
negative coordinates, and hence be assumed to be a probability vector. Then
M̄ < ν T fν , which contradicts the definition of M̄ (T, ψ, d). Thus M̄ ∈ F ,
which implies that M̄ ≥ fµ , for some µ. If T is not finite, then this argu-
ment gives for every finite S ⊂ T a measure µS with M̄ ≥ fµS (t), for every
t ∈ S. By Prohorov’s theorem the net µS has a weak limit point µ and by the
Portmanteau theorem this satisfies fµ (t) ≤ lim inf S fµS (t), for every t ∈ T .
Finally if x 7→ ψ −1 (1/x) is not convex, then there exists a convex function ξ
with ξ(x) ≤ φ−1 (1/x) ≤ 2ξ(x).]

20. If X1 , . . . , Xn are√ independent standard normal variables, then we have


E max1≤i≤n |Xi |/ 2 log n → 1, as n → ∞. Furthermore,
p p
(π log 2)−1/2 log n ≤ E max Xi ≤ E max |Xi | ≤ 2 log(2n).
1≤i≤n 1≤i≤n
2.2 Problems and Complements 161
p
[Hint: Lemma 2.2.2 gives the slightly worse upper bound 8/3 log(n + 1),
which is valid without independence. This may be improved by exact calcula-
tion using the assumed independence. The asymptotic result follows also by
direct calculation. The lower bound with its exact constant (which is sharp
at n = 2) can be found in Fernique (1997); see the proof of Corollary A.2.6.]
2.3
Symmetrization
and Measurability

One of the two main approaches toward deriving Glivenko-Cantelli and


Donsker theorems is based on the principle of comparing the empirical pro-
cess to a “symmetrized” empirical process. In this chapter we derive the
main symmetrization theorem, as well as a number of technical comple-
ments, which may be skipped at first reading.

2.3.1 Symmetrization
Let ε1 , . . . , εn be i.i.d. Rademacher random variables. Instead of the empir-
ical process
n
1X 
f 7→ (Pn − P )f = f (Xi ) − P f ,
n i=1

consider the symmetrized process


n
1X
f 7→ Pon f = εi f (Xi ),
n i=1

where ε1 , . . . , εn are independent of (X1 , . . . , Xn ). Both processes have


mean function zero (because E εi f (Xi )| Xi = 0 by symmetry of εi ). It
turns out that the law of large numbers or the central limit theorem for one
of these processes holds if and only if the corresponding result is true for
the other process. One main approach to proving empirical limit theorems
is to pass from Pn − P to Pon and next apply arguments conditionally on
the original X’s. The idea is that, for fixed X1 , . . . , Xn , the symmetrized
2.3 Symmetrization and Measurability 163

empirical measure is a Rademacher process, hence a sub-Gaussian process,


to which Corollary 2.2.8 can be applied.
Thus we need to bound maxima and moduli of the process Pn − P
by those of the symmetrized process. To formulate such bounds, we must
be careful about the possible nonmeasurability of suprema of the type
kPn − P kF . The result will be formulated in terms of outer expectation,
but it does not hold for every choice of an underlying probability space on
which X1 , . . . , Xn are defined. Throughout this part, if outer expectations
are involved, it is assumed that X1 , . . . , Xn are the coordinate projections
on the product space (X n , An , P n ), and the outer expectations of functions
(X1 , . . . , Xn ) 7→ h(X1 , . . . , Xn ) are computed for P n . Thus “independent”
is understood in terms of a product probability space. If auxiliary variables,
independent of the X’s, are involved, as in the next lemma, we use a similar
convention. In that case, the underlying probability space is assumed to be
of the form (X n , An , P n ) × (Z, C, Q) with X1 , . . . , Xn equal to the coor-
dinate projections on the first n coordinates and the additional variables
depending only on the (n + 1)st coordinate.
The following lemma will be used mostly with the choice Φ(x) = x.

2.3.1 Lemma (Symmetrization). For every nondecreasing, convex


Φ: R → R and class of measurable functions F,
   
E∗ Φ Pn − P F ≤ E∗ Φ 2 Pon F ,

where the outer expectations are computed as indicated in the preceding


paragraph.

Proof. Let Y1 , . . . , Yn be independent copies of X1 , . . . , Xn , defined for-


mally as the coordinate projections on the last n coordinates in the product
space (X n , An , P n ) × (Z, C, Q) × (X n , An , P n ). The outer expectations in
the statement of the lemma are unaffected by this enlargement of the under-
lying probability space, because coordinate projections are perfect maps.
For fixed values X1 , . . . , Xn ,
n n
1 X  1 X 
kPn −P kF = sup f (Xi )−Ef (Yi ) ≤ E∗Y sup f (Xi )−f (Yi ) ,
f ∈F n i=1 f ∈F n i=1

where E∗Y is the outer expectation with respect to Y1 , . . . , Yn computed


for P n for given, fixed values of X1 , . . . , Xn . Combination with Jensen’s
inequality yields
n
!
∗Y
 1X  
Φ kPn − P kF ≤ EY Φ f (Xi ) − f (Yi ) ,

n i=1 F

where ∗Y denotes the minimal measurable majorant of the supremum with


respect to Y1 , . . . , Yn , still with X1 , . . . , Xn fixed. Because Φ is nondecreas-
ing and continuous, the ∗Y inside Φ can be moved to E∗Y (Problem 1.2.8).
164 2 Empirical Processes

Next take the expectation with respect to X1 , . . . , Xn to get


 1 Xn
 
E∗ Φ kPn − P kF ≤ E∗X E∗Y Φ
 
f (Xi ) − f (Yi ) .

n i=1 F

Here the repeated outer expectation can be bounded above by the joint
outer expectation E∗ by Lemma 1.2.6.  
Adding a minus sign in front of a term f (Xi ) − f (Yi ) has the ef-
fect of exchanging Xi and Yi . By construction of the underlying prob-
ability space as a product space, the outer expectation of any function
f (X1 , . . . , Xn , Y1 , . . . , Yn ) remains unchanged under permutations of its 2n
arguments. Hence the expression
 1 X n
 
E∗ Φ

ei f (Xi ) − f (Yi )

n i=1 F

is the same for any n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . Deduce that
 1 Xn
 
E∗ Φ kPn − P kF ≤ Eε E∗X,Y Φ
 
εi f (Xi ) − f (Yi ) .

n i=1 F

Use the triangle inequality to separate the contributions of the X 0 s and the
Y 0 s and next use the convexity of Φ to bound the previous expression by
 1X n   1X n 
1 ∗ 1 ∗
E E Φ 2 ε f (X ) + E E Φ 2 εi f (Yi ) .

2 ε X,Y
n i=1
i i 2 ε X,Y
n i=1
F F

By perfectness of coordinate projections, the expectation E∗X,Y is the same


as E∗X and E∗Y in the two terms, respectively. Finally, replace the repeated
outer expectations by a joint outer expectation.

The symmetrization lemma is valid for any class F. In the proofs of


Glivenko-Cantelli and Donsker theorems, it will be applied not only to the
original set of functions of interest, but also to several classes constructed
from such a set F (such as the class Fδ of small differences). The next
step in these proofs is to apply a maximal inequality to the right side of
the lemma, conditionally on X1 , . . . , Xn . At that point we need to write the
joint outer expectation as the repeated expectation E∗X Eε , where the indices
X and ε mean expectation over X and ε, respectively, conditionally on the
remaining variables. Unfortunately, Fubini’s theorem is not valid for outer
expectations. To overcome this problem, it is assumed that the integrand in
the right side of the lemma is jointly measurable in (X1 , . . . , Xn , ε1 , . . . , εn ).
Since the Rademacher variables are discrete, this is the case if and only if
the maps
Xn
(2.3.2) (X1 , . . . , Xn ) 7→ ei f (Xi )

F
i=1
2.3 Symmetrization and Measurability 165

are measurable for every n-tuple (e1 , . . . , en ) ∈ {−1, 1}n . For the intended
application of Fubini’s theorem, it suffices that this is the case for the
completion of (X n , An , P n ).

2.3.3 Definition (Measurable class). A class F of measurable functions


f : X → R on a probability space (X , A, P ) is called a P -measurable class
if the function (2.3.2) is measurable on the completion of (X n , An , P n ) for
every n and every vector (e1 , . . . , en ) ∈ Rn .

In the following, one cannot dispense with measurability conditions of


some form. Examples show that the law of large numbers or the central limit
theorem may fail only because of the violation of measurability conditions
such as the one just introduced. On the other hand, the measurability
conditions in the present form are not necessary, but they are suggested
by the methods of proof. A trivial, but nevertheless useful, device to relax
measurability requirements a little is to replace the class F by a class G
for which measurability can be checked. If kPn − P kF = kPn − P kG inner
almost surely, then a law of large numbers for G clearly implies one for F.
Similarly, if kPn − P kFδ = kPn − P kGδ inner almost surely for every δ > 0,
then asymptotic equicontinuity of the F-indexed empirical process follows
from the same property for G. Furthermore, if in both cases G is chosen
to be a subclass of F, then the other conditions for the uniform law or
the central limit theorem typically carry over from F onto G. Rather than
formalizing this principle into a concept such as “nearly measurable class,”
we will state the conditions directly in terms of F.

2.3.4 Example (Pointwise measurable classes). Suppose F contains a


countable subset G such that for every f ∈ F there exists a sequence gm in
G with gm (x) → f (x) for every x. Then F is P -measurable
P for every
P.
P This claim
is immediate from the fact that ei f (Xi ) equals
F
ei f (Xi ) .
G
Some examples of this situation are the collection of indicators of cells
in Euclidean space, the collection of indicators of balls, and collections of
functions that are separable for the supremum norm.
In Section 2.3.3 it will be seen that every separable collection F ⊂
L2 (P ) has a “version” that is “almost” pointwise separable. (The name
pointwise separable class will be reserved for these versions.)

2.3.5 Example (Measurable Suslin). Suppose F is a Suslin topological


space (with respect to an arbitrary topology) and the map (x, f ) 7→ f (x)
is jointly measurable on X × F for the product σ-field of A and the Borel
σ-field. [Recall that every Borel subset (even analytic subset) of a Polish
space is Suslin for the relative topology.] Then F is P -measurable for every
probability measure P .
166 2 Empirical Processes

This follows from Example 1.7.5P upon noting that the conditions imply
n
that the process (x1 , . . . , xn , f ) 7→ i=1 ei f (xi ) is jointly measurable on
n
X × F.
This “measurable Suslin condition” replaces the original problem of
showing P -measurability of F by the problem of finding a suitable topology
on F for which the map (x, f ) 7→ f (x) is measurable. This is sometimes
easy – for instance, if the class F is indexed by a parameter in a nice manner
– but sometimes as hard as the original problem.

* 2.3.2 More Symmetrization


This section continues with additional results on symmetrization. These are
not needed for the main results of this part (Chapters 2.2, 2.4, and 2.5), and
the reader may wish to skip the remainder of this chapter at first reading.
For future reference it is convenient to P generalize the notation. Instead
n
of the empirical distribution, consider sums i=1 Zi of independent stochas-
tic processes {Zi (f ): f ∈ F}. Again the processes need not possess any fur-
ther measurability properties besides measurability of all marginals Zi (f ).
However, for computing outer expectations, it will Qn be understood that the
underlying probability space is a product space i=1 (Xi , Ai , Pi )×(Z, C, Q)
and each Zi is a function of the ith coordinate of (x, z) = (x1 , . . . , xn , z)
only. For i.i.d. stochastic processes, it is understood in addition that the
spaces (Xi , Ai , Pi ) as well as the maps Zi (f ) defined on them are identical.
Additional (independent) Rademacher or other variables are understood to
be functions of the (n + 1)st coordinate z only. The empirical distribution
corresponds to taking Zi (f ) = f (Xi ) − P f .
First consider “desymmetrization.” The inequality of the preceding
lemma can only be reversed (up to constants) for classes F that are cen-
tered (Problem 2.3.1). TheP following lemma shows that the maxima of the
processes Pn − P and n−1 εi (δXi − P ) are fully comparable.

2.3.6 Lemma. Let Z1 , . . . , Zn be independent stochastic processes with


mean zero. Then
 X n  n
 X  Xn 

E∗ Φ 21 εi Zi ≤ E∗ Φ Zi F ≤ E∗ Φ 2 εi (Zi − µi ) ,

F F
i=1 i=1 i=1

for every nondecreasing, convex Φ: R → R and arbitrary functions µi : F →


R.

Proof. The inequality on the right follows by similar arguments as in the


proof of Lemma 2.3.1. For the inequality on the left, let Y1 , . . . , Yn be an
independent copy of Z1 , . . . , Zn (suitably defined on the probability space

* This section may be skipped at first reading.


2.3 Symmetrization and Measurability 167
Qn Qn
i=1 (Xi , Ai , Pi ) × (Z, C, Q) × i=1 (Xi , Ai , Pi ) and depending on the last n
coordinates exactly as Z1 , . . . , Zn depend on the first n coordinates). Since
EYi (f ) = 0, the left side of the lemma is an average of expressions of the
type
n
 X  
E∗Z Φ 12

ei Zi (f ) − EYi (f ) ,

F
i=1
where (e1 , . . . , en ) ranges over {−1, 1}n . (cf. Lemma 1.2.7). By Jensen’s
inequality, this expression is bounded above by
n n h
 X    X  
E∗Z,Y Φ 21 = E∗Z,Y Φ 21

ei Zi (f ) − Yi (f ) Zi (f ) − Yi (f ) .

F F
i=1 i=1

Finally, apply the triangle inequality and convexity of Φ.

The most important choice for Φ in the preceding lemmas is Φ(x) = x.


The requirement that Φ be convex rules out Φ(x) = 1{x > a}. Nevertheless,
there is an analogous symmetrization inequality for probabilities.

2.3.7 Lemma (Symmetrization for probabilities). For arbitrary inde-


pendent stochastic processes Z1 , . . . , Zn and arbitrary functions µ1 , . . . , µn :
F 7→ R,
 X n   X n 
βn (x) P∗ Zi > x ≤ 2P∗ 4 εi (Zi − µi ) > x ,

F F
i=1 i=1
Pn 
for every x > 0 and βn (x) ≤ inf f P | i=1 Zi (f ) < x/2 . In par-
ticular, this is true for i.i.d. mean-zero processes, and βn (x) = 1 −
(4n/x2 ) supf var Z1 (f ). Here the outer expectations are computed as in-
dicated previously.

Proof. Let Y1 , . . . , Yn be an independent copy Pnof Z1 , . . . , Zn , suitably de-


fined on a product space asP previously. If k i=1 Zi kF > x, then there is
n
certainly some f for which | i=1 Zi (f )| > x. Fix a realization, Z1 , . . . , Zn
and f for which both are the case. For this fixed realization,
 X n x  X n Xn x
β ≤ P∗Y Yi (f ) < ≤ P∗Y Yi (f ) − Zi (f ) >

i=1
2 i=1 i=1
2
n
 X x
≤ P∗Y (Yi − Zi ) > .

i=1
F 2

The far left and far right sides do not depend on Pthe particular f , and the
n
inequality between them is valid on the set k i=1 Zi kF > x . Integrate
the two sides out with respect to Z1 , . . . , Zn over this set to obtain
n n
 X


∗ ∗
 X x
β P Zi > x ≤ PZ PY (Yi − Zi ) > .

i=1
F
i=1
F 2
168 2 Empirical Processes

By symmetry, the right side equals


n
 X x
Eε P∗Z P∗Y εi (Yi − Zi ) > .

i=1
F 2

In viewPnof the triangle inequality, this expression is not bigger than


2P∗ k i=1 εi (Yi − µi )kF > x/4 .


I.i.d. processes Z1 , . . . , Zn with mean zero satisfy the condition for the
given β in view of Chebyshev’s inequality.

Since they only add in signs, Rademacher variables appear to yield an


efficient method of symmetrization. Nevertheless, in most arguments they
can be replaced by other variables; for instance, standard normal ones. Then
the form of the symmetrization inequalities may become more complicated.
A discussion is deferred to Chapter Pn2.9, where it is shown
Pn in general that
the sequences of processes n−1/2 i=1 ξi Zi and n−1/2 i=1 Zi possess the
same asymptotic behavior for most choices of ξi .
Inequalities in terms of the means of suprema of processes are easier to
handle than inequalities in terms of probabilities. However, so far the main
condition (2.1.8) for the empirical central limit theorem has been stated in
terms of probabilities. In view of Markov’s inequality, the condition

(2.3.8) E∗ nkPn − P kFδn → 0, for every δn → 0,

is sufficient for asymptotic equicontinuity (2.1.8). In view of Lemma 2.3.6,


this condition is equivalent to the analogous condition in terms of the sym-
metrized empirical measure Pon . It is therefore of interest that there is no
loss of generality in passing from probabilities (2.1.8) to expectations.
For the proof we need a preparatory lemma.

Lemma. Let Z1 , Z2 , . . . be i.i.d stochastic processes such that


2.3.9 P
n
n−1/2 i=1 Zi converges weakly in `∞ (F) to a tight Gaussian process. Then

 1 Xn 
lim x2 sup P∗ √ Zi > x = 0.

x→∞ n n i=1 F

In particular, the random variable kZ1 k∗F possesses a weak second moment.

Proof. Let Y1 , . . . , Yn be an independent copy of Z1 , . . . , Zn , for the ben-


efit of the outer expectations defined as coordinate projections on an ex-
tra factor of a product probability space, as usual. Marginal convergence
to a Gaussian process implies that EZi (f ) = 0 and var Zi (f ) < ∞ for
every f . Furthermore, the existence of a tight limit ensures  that F is to-
tally bounded for the semimetric ρ(f, g) = σ Z(f ) − Z(g) . In particular,
2.3 Symmetrization and Measurability 169

α2 = supf ∈F var Z(f ) is finite. By an intermediate step in the proof of the


symmetrization Lemma 2.3.7,
n n
 4α2   1
X ∗
  1 X ∗ x
1− 2 P √ Zi > x ≤ P √ (Zi − Yi ) > .

x n i=1 F n i=1 F 2

It suffices to prove the lemma for Zi − Yi instead of Zi . For convenience of


notation, suppose that the original Zi are symmetric.
Fix ε > 0. The norm kGkF of the Gaussian limit variable G has mo-
ments of all orders (see Proposition A.2.3). Thus there exists x such that
P kGkF ≥ x ≤ ε/x2 ≤ 1/8. By the portmanteau theorem, there exists N
such that, for all n ≥ N ,
n
 X √   2ε
(2.3.10) P∗ Zi > x n ≤ 2P kGkF ≥ x ≤ 2 .

i=1
F x

By Lévy’s inequalities A.1.2,


n
 √   X ∗ √  4ε
P max kZi k∗F > x n ≤ 2P Zi > x n ≤ 2 .

1≤i≤n
i=1
F x

In view of Problem 2.3.2, for every n ≥ N ,


√ 
x2 nP∗ kZ1 kF > x n ≤ 8ε.

This immediately implies the second assertion of the lemma.


To obtain the first
Pmassertion, apply the preceding argument to the
processes Zi = m−1/2 j=1 Zi,j for each 1 ≤ i ≤ n, where Z1,1 , . . . , Zn,m are
Pn Pn Pm
i.i.d. copies of Z1 . The sequence n−1/2 i=1 Zi = (nm)−1/2 i=1 j=1 Zi,j
converges to the same limit G for each m as n → ∞. Since the convergence
is “accelerated,” there exists N such that (2.3.10) holds for n ≥ N and all
m, where x can be chosen dependent on ε and G only, as before. In other
words, there exists x and N such that for n ≥ N ,
m

 
2 ∗ 1
X

x nP √ Z1,j > x n ≤ 8ε,

m j=1 F

for every m. This implies the lemma.

If a distribution on the real line has tails of the order o(x−2 ), then all
its moments of order 0 < r < 2 exist. One interesting consequence Pn of the
preceding lemma is that the sequence of moments E∗ kn−1/2 i=1 Zi krF is
uniformly bounded for every 0 <P r < 2. Given the weak convergence, this
n
gives that the sequence E∗ kn−1/2 i=1 Zi krF converges to the corresponding
moment of the limit variable.
170 2 Empirical Processes

2.3.11 Lemma. Let Z1 , Z2 , . . . be i.i.d. stochastic processes, linear in f .


Set ρZ (f, g) = σ Z1 (f ) − Z1 (g) and Fδ = {f − g: ρZ (f, g) < δ}. Then the
following statements
Pn are equivalent:
(i) n−1/2 i=1 Zi converges weakly to a tight ∞
Pn limit in `P∗(F);
−1/2
(ii) (F, ρZ ) is totally bounded and n i=1 Zi Fδn → 0 for every δn ↓

0; Pn
(iii) (F, ρZ ) is totally bounded and E∗ n−1/2 i=1 Zi F → 0 for every
δn
δn ↓ 0. r
n
These conditions imply that the sequence E∗ n−1/2 i=1 Zi F converges
P
to EkGkrF for every 0 < r < 2.

Proof. The equivalence of (i) and (ii) is clear from the general results on
weak convergence in `∞ (F) obtained in Chapter 1.5. Condition (iii) implies
(ii) by Markov’s inequality.
Suppose (i) and (ii) hold. Then P∗ kZ1 kF > x = o(x−2 ) as x tends


to infinity by Lemma 2.3.9. This implies that


kZi kF
E∗ max √ →0
1≤i≤n n
(Problem 2.3.3). In view of the triangle inequality, the same
isP F
true with
n
replaced by Fδn . Convergence to zero in probability of n−1/2 i=1 Zi F
δn
implies pointwise convergence to zero of the sequence of their quantile func-
tions. Apply Hoffmann-Jørgensen’s inequality (Proposition A.1.5) to obtain
(iii).
The last assertion follows from the remark after Lemma 2.3.9.

To recover the empirical process, set Zi (f ) = f (Xi ) − P f .

2.3.12 Corollary. Let F be a class of measurable functions. Then the


following are equivalent:
(i) F is P -Donsker;
(ii) (F, ρP ) is totally bounded and (2.1.8) holds;
(iii) (F, ρP ) is totally bounded and (2.3.8) holds.

2.3.13 Corollary. Every P -Donsker class F satisfies P kf −P f k∗F > x =




o(x−2 ) as x tends to infinity. Consequently, if kP f kF < ∞, then F possesses


an envelope function F with P (F > x) = o(x−2 ).

* 2.3.3 Separable Versions


Let
 (F, ρ) be a given separable, semimetric space. A stochastic process
G(f, ω): f ∈ F is called separable if there exists a null set N and a count-
able subset G ⊂ F such that, for all ω ∈ / N and f ∈ F, there exists a

* This section may be skipped at first reading.


2.3 Symmetrization and Measurability 171

sequence
 gm in G with gm → f and G(gm , ω) → G(f, ω). A  stochastic pro-

cess G̃(f ): f ∈ F is a separable version of a given process G(f ): f ∈ F
if G̃ is separable and G̃(f ) = G(f ) almost surely for every f . (The two pro-
cesses must be defined on the same probability space, and the exceptional
null sets may depend on f .)
It is well known that every stochastic process possesses a separable
version (possibly taking values in the extended real line). By its definition, a
separable process has excellent measurability properties. In this subsection
it is shown that an empirical process possesses a separable version that
is itself an empirical process. This fact is used in Chapter 2.10 (and there
only) to remove unnecessary measurability conditions from two preservation
theorems.
For any separable process G̃ with countable “separant” G,

sup G̃(f ) − G̃(g) = sup G̃(f ) − G̃(g) , a.s.
ρ(f,g)<δ ρ(f,g)<δ
f,g∈F f,g∈G

If G̃ is a version of G, then the countability of G implies that the right side


changes at most on a null set if G̃ is replaced by G. Next the expression
increases if G is replaced by F. Thus, for every separable version G̃ of a
process G,

sup G̃(f ) − G̃(g) ≤ sup G(f ) − G(g) , a.s.
ρ(f,g)<δ ρ(f,g)<δ
f,g∈F f,g∈F

It follows that the modulus of continuity of a process decreases when re-


placing it by a separable version. Consequently, asymptotic equicontinuity
of a sequence of processes implies the same for any sequence of separable
versions. Since the marginal distributions of a process do not change when
passing to a separable version, this yields the following lemma.

2.3.14 Lemma. Let G̃n be separable versions of a sequence of stochastic


processes Gn with sample paths in `∞ (F). Then the sequence Gn converges
in distribution to a tight limit in `∞ (F) if and only if the sequence G̃n
converges to a tight limit and Gn − G̃n converges to zero in outer probability
in `∞ (F).

Empirical processes allow a special type of separable version, con-


structed from a separable version of the index class F. Let F be a class
of square integrable measurable functions f : X → R on a probability
space (X , A, P ) equipped with a semi-metric ρ. Call F a pointwise sep-
arable class relative to ρ if there exists a countable subset G ⊂ F and
for each n a P n -null set Nn ⊂ X n such that, for all (x1 , . . . , xn ) ∈ / Nn
and f ∈ F, there exists  a sequence g m in G such that ρ(gm , f ) → 0 and
gm (x1 ), . . . , gm (xn ) → f (x1 ), . . . , f (xn ) . A class F̃ of measurable func-
tions is a pointwise separable version of F if F̃ is pointwise separable and
172 2 Empirical Processes

there exists a bijection f ↔ f˜ between F and F̃ such that f = f˜ almost


surely. √
Let Gn = n(Pn − P ) be the empirical process indexed by the class
of functions F. If F̃ is a pointwise separable version of F, then

G̃n (f ) = Gn (f˜)

defines a separable version of Gn , for every n. This separable version is


itself an empirical process, and its index class is obtained by changing each
element of the original index class on a null set. Furthermore, the difference
process Gn − G̃n is the empirical process indexed by the class {f − f˜: f ∈ F}.
Each of the functions in this class is zero almost surely. It turns out that
the empirical process indexed by a zero class converges in probability to
zero if and only if the same is true for the class of absolute values. This
yields the following theorem.

2.3.15 Theorem. Let F̃ be a pointwise separable version relative to the


√ functions F. Then
L2 (P )-norm of a class of square integrable measurable
F is Donsker if and only if F̃ is Donsker and supf ∈F n Pn |f − f˜| → 0 in
outer probability.

Proof.√ By the preceding lemma, F is Donsker if and only if F̃ is Donsker


and n Pn (f − f˜) F → 0 in outer probability. It suffices to show that for

a class of measurable functions√f : X → R such that √f = 0 almost P∗surely,


P∗
for every f , the two statements n kPn f kF → 0 and n Pn |f | F → 0 are
equivalent. It is clear that the second statement implies the first. 
For a proof in the other direction, fix ε > 0 and define U = x ∈
X n : kPn kF ≤ ε ∗ . It will be shown that there exists a measurable set
U 0 ⊂ U of the same measure as U such that, for all x ∈ U 0 and every
subset I ⊂ {1, 2, . . . , n},
1 X
f (xi ) ≤ ε.

n

i∈I
 
Then it follows that P∗ kPn k|F | ≤ 2ε ≥ P∗ kPn kF ≤ ε , and the desired
result follows.
For I ⊂ {1, 2, . . . , n}, identify x ∈ X n with (α, β), where α ∈ X I are
the coordinates xi with i ∈ I and β ∈ X n−Iare the remaining coordinates.
Let Uα ⊂ X n be the “vertical” section {α}× β: (α, β) ∈ U of U at “width”
α, and let πUα be the projection of Uα on X n−I (so that Uα = {α} × πUα ).
Let VI be the union of all sections Uα with projection πUα equal to a null
set (under P n−I ). By Fubini’s theorem, P n (VI ) = 0. Set
[
U0 = U − VI ,
I

where I ranges over all nonempty, proper subsets of {1, . . . , n}.


2.3 Symmetrization and Measurability 173

Fix I and f . Each x ∈ U 0 can be identified with a pair (α, β) as before.


By construction P n−I (πUα ) > 0 and by assumption f (β10 ), . . . , f (βn−I0

) =
0 almost surely under  P
n−I
. Conclude that there must be a β 0 ∈ πUα with
0 0
f (β1 ), . . . , f (βn−I ) = 0. Thus
1 X 1 X X
f (xi ) = f (αi ) + f (βi0 ) ≤ ε,

n n

i∈I i∈I i∈I
/
0
because (α, β ) ∈ U . This concludes the proof.

The preceding theorem shows that the difference between Gn and G̃n
must be small due to the fact that the differences f − f˜ are small: the
difference cannot be small through the cancellation of positive terms by
negative terms.
There is an analogous, but simpler result for Glivenko-Cantelli classes.

2.3.16 Theorem. Let F̃ be a pointwise separable version relative to the


L1 (P )-norm of a class of integrable measurable functions F. Then F is
weak P -Glivenko-Cantelli if and only if F̃ is weak P -Glivenko-Cantelli and
supf ∈F Pn |f − f˜| → 0 in outer probability.

Proof. Because the process f 7→ Pn f˜ is a separable version of f 7→ Pn f ,


there exists a countable separant G ⊂ F such that, almost surely,
sup Pn f˜ − P f˜ = sup Pn f˜ − P f˜ = sup Pn f − P f ≤ sup Pn f − P f .

f ∈F f ∈G f ∈G f ∈F

If follows that the class F̃ is Glivenko-Cantelli if F is Glivenko-Cantelli.


We next conclude that F is Glivenko-Cantelli if and only if the class F̃ is
Glivenko-Cantelli and supf ∈F Pn (f − f˜) tends to zero in outer probability.

The functions f −f˜ are zero almost surely. The argument in the proof of
Theorem 2.3.15 shows that the last mentioned convergence is equivalent to
convergence in outer probability to zero of the sequence supf ∈F Pn |f − f˜|.

A pointwise separable version of a given class of functions can be con-


structed from a special type of lifting of L∞ (X , A, P ). A lifting is a map
ρ: L∞ (X , A, P ) → L∞ (X , A, P ) that assigns to each equivalence class f of
essentially bounded functions a representative ρf in such a way that
(i) ρ1 = 1;
(ii) ρ is positive: ρf (x) ≥ 0 for all x if f ≥ 0 almost surely;
(iii) ρ is linear;
(iv) ρ is multiplicative: ρ(f g) = ρf ρg.
Any lifting automatically possesses the further property:
(vi) ρh(f1 , . . . , fn ) ≥ h(ρf1 , . . . , ρfn ) for every lower semicontinuous
h: Rn → R and f1 , . . . , fn in L∞ (X , A, P ) (Problem 2.3.8).
It is well known that every complete probability space (X , A, P ) admits
a lifting ρ of the space L∞ (X , A, P ). For the construction of a separable
174 2 Empirical Processes

version of the empirical process, a lifting with a further property is needed.


This exists by the following proposition.

2.3.17 Proposition (Consistent lifting). Any complete probability space


(X , A, P ) admits a lifting of the space L∞ (X , A, P ) such that for each n
the map ρn given by the diagram
 ρ  
n
(x1 , . . . , xn ) 7→ f (xi ) −→ (x1 , . . . , xn ) 7→ ρf (xi )

is the restriction [to the set of functions of the type (x1 , . . . , xn ) 7→ f (xi ),
where f ranges over L∞ (X , A, P )] of a lifting of L∞ (X n , An , P n ) for each
1 ≤ i ≤ n.

This proposition is proved by Talagrand (1982), who calls a lifting ρ


with the given property a consistent lifting. There also exist liftings that
are not consistent.
Suppose F is a class of measurable, uniformly bounded functions. The
following theorem shows that a pointwise separable version F̃ of F can be
defined by
f˜(x) = ρf (x)
for a consistent lifting ρ of L∞ (X , A, P ). (Interpret ρf as the image of the
class of f under ρ.) This separable version inherits the nice properties of a
lifting: the map f 7→ f˜ is positive, linear, and multiplicative.
If F contains unbounded functions, then a pointwise separable version
can be defined by
f˜(x) = Φ−1 ◦ ρ Φ ◦ f (x),


for a consistent lifting as before and a fixed, strictly monotone bijection


Φ of R̄ onto a compact interval. This separable modification f 7→ f˜ lacks
linearity and multiplicativity but is positive.

2.3.18 Theorem (Separable modification). Let (X , A, P ) be a complete


probability space and F be a class of measurable functions f : X → R. If F
is separable in Lr (P ), then there exists a pointwise separable version of F
relative to the Lr (P ) − norm.

Proof. Define f˜ as indicated, using a consistent lifting, so that the map ρn


in the preceding proposition is a lifting. Then f = f˜ almost surely, because
ρΦ ◦ f is a representative of the class of Φ ◦ f .
Fix n and open sets U ⊂ F and G ⊂ Rn . Define
n o
Af = (x1 , . . . , xn ) ∈ X n : ρΦ ◦ f (x1 ), . . . , ρΦ ◦ f (xn ) ∈ G .


For D ⊂ F set AD = ∪f ∈D Af . There exists a countable subset D ⊂


U for which P n (AD ) is maximal over all countable subsets D of U . Any
2.3 Problems and Complements 175

maximizing D satisfies 1Af ≤ 1AD almost surely for all f ∈ U . By positivity


of the lifting ρn , it follows that

ρn 1Af (x) ≤ ρn 1AD (x), for every x.



Set Nn = x ∈ X n : 1AD (x) = 0, ρn 1AD (x) = 1 . By multiplicativity the
image of an indicator function under a lifting is an indicator function. It
follows that Nn can also be described as the set of x such that 1AD (x) <
ρn 1AD (x). Let (Φ ◦ f )i denote the function (x1 , . . . , xn ) 7→ Φ ◦ f (xi ). Since
G is open, property (vi) of the lifting ρn gives
 
ρn 1Af = ρn 1G (Φ◦f )1 , . . . , (Φ◦f )n ≥ 1G ρn (Φ◦f )1 , . . . , ρn (Φ◦f )n = 1Af .

Combination of the last two displayed equations yields that ρn 1AD (x) ≥
1Af (x) for every x. If x ∈ Af and x ∈ / Nn , then it follows that x ∈ AD .
It has been proved that, for every open U ⊂ F and open G ⊂ Rn ,
there exists a countable set DU,G ⊂ U and a null set Nn,U,G ⊂ X n such
that

x∈ / Nn,U,G ∧ ρΦ ◦ f (x1 ), . . . , ρΦ ◦ f (xn ) ∈ G ∧ f ∈ U

⇒ ∃g ∈ DU,G : ρΦ ◦ g(x1 ), . . . , ρΦ ◦ g(xn ) ∈ G.

Repeat the construction for every U and G in countable bases for the open
sets in F and Rn , respectively. Let Nn and G be the union of all null
sets Nn,U,G and countable DU,G . Then for every x ∈ / Nn and f ∈ F,
there exists a sequence gm in G with ρΦ ◦ gm (x1 ), . . . , ρΦ ◦ gm (xn ) →
ρΦ ◦ f (x1 ), . . . , ρΦ ◦ f (xn ) and gm → f in Lr (P ). Thus the class of
functions ρ(Φ ◦ f ): f ∈ F is pointwise separable.

Problems and Complements


P P
1. There is no universal constant K such that E εi Xi ≤ KE (Xi − EXi )
for any i.i.d. random variables X1 , . . . , Xn .
[Hint: Take n = 1 and X1 ∼ N (µ, 1).]

2. For independent random variables ξ1 , . . . , ξn ,


P
 i
P(|ξi | > x)
P max |ξi | > x ≥ P .
i 1+ i
P(|ξi | > x)
In particular, if the left side is less than 1/2, then 2P(maxi |ξi | > x) ≥
P
P(|ξi | > x).
i
[Hint: For x ≥ 0, one has 1 − x ≤ exp(−x) and 1 − e−x ≥ x/(1 + x).]
176 2 Empirical Processes

3. Let Xn1 , . . . , Xnn be P an arbitrary array of real random variables.


n
(i) If supx>ε√n n−1 i=1 x2 P |Xni | > x → 0 for every ε > 0, then

E max1≤i≤n |Xni |/ n → 0. √
(ii) If the array is√ rowwise i.i.d., then max1≤i≤n |Xni | = oP ( n) if and only
−1
if P(|Xni | > ε n) = o(n ) for every ε > 0.
If the variables in the triangular array are i.i.d., then all assertions are equiv-
alent to P(|X1 | > x) = o(x−2 ).
√ PR∞
[Hint: E max
√  |Xni |{|Xni | > ε n} is bounded by 0
P |Xni |1{|Xni | >
ε n} > t dt. This integral splits into two pieces. The integral over  the
√ P 2
second
R ∞ −2 part [ε n, ∞) can be bounded by sup x>ε

n x P |X ni | > x times
√ x dx.]
ε n

4. Let ξ1 , . . . , ξn be i.i.d. random variables.


(i) Show that the following are equivalent.
(a) P(|ξ1 | > x) = o(x−1 ).
(b) max1≤i≤n |ξi |/n converges to zero in probability.
(c) max1≤i≤n |ξi |r /nr converges to zero in mean for every r < 1.
(ii) Show that the following are equivalent:
(a) E|ξ1 | < ∞.
(b) max1≤i≤n |ξi |/n converges to zero almost surely.
(c) max1≤i≤n |ξi |/n converges to zero in mean.
(iii) Let r > 1. Use (i) and (ii) to show the following:
(a) Show that P(|ξ1 | > x) = o(x−r ) if and only if max1≤i≤n |ξi |/n1/r
converges to zero in probability if and only if max1≤i≤n |ξi |/n1/r
converges to zero in mean.
(b) Show that E|ξ1 |r < ∞ if and only if max1≤i≤n |ξi |r /n converges to
zero almost surely if and only if max1≤i≤n |ξi |r /n converges to zero
in mean.
variables with E|ξi |r <
5. If {ξi } is a sequence of independent, positive random P
∞, for all i and some r > 0, then, for tλ = inf{t > 0: i P(ξi > t) ≤ λ} and
any λ > 0,
Z ∞ XZ ∞
λtrλ 1 X r r r
+ P(ξi > t) dt ≤ E max |ξi | ≤ tλ + P(ξi > t) dtr .
1+λ 1+λ t
i
t
i λ i λ

[Hint: Use Problem 2.3.2.]

6. If X1 , X2 , . . . are i.i.d. random variables with E|X1 | < ∞ and r < 1, then
r
E supn≥1 |Xn |/n < ∞.
[Hint: Use Problem 2.3.5.]
 
7. For i.i.d. random variables
 X1 , X2 , . . ., the expectation E supn≥1 |Xn |/n is
finite if and only if E |X1 |(1 ∨ log |X1 |) is finite.
[Hint: Use Problem 2.3.5.]

8. Any lifting ρ of L∞ (X , A, P ) has the following additional properties:


(i) kρf k∞ ≤ kf k∞ ;
2.3 Problems and Complements 177

(ii) ρh(f1 , . . . , fn ) = h(ρf1 , . . . , ρfn ) for every continuous function h: Rn →


R and f1 , . . . , fn in L∞ (X , A, P );
(iii) ρh(f1 , . . . , fn ) ≥ h(ρf1 , . . . , ρfn ) for every lower semicontinuous function
h: Rn → R and f1 , . . . , fn in L∞ (X , A, P );
(iv) ρ1A = 1Ã for some measurable set Ã.
[Hint: For polynomials h, property (ii) is a consequence of linearity and mul-
tiplicativity of a lifting. By the Stone-Weierstrass theorem, every continuous
h can be uniformly approximated by polynomials on any given compact sub-
set of Rn . Thus the general form of (ii) can be reduced to polynomials in
view of (i).
Property (iii) is a consequence of (ii), because every lower semicontin-
uous function can be approximated pointwise from below by a sequence of
continuous functions.]
2.4
Glivenko-Cantelli Theorems

In this chapter we prove two types of Glivenko-Cantelli theorems. The first


theorem is the simplest and is based on entropy with bracketing. Its proof
relies on finite approximation and the law of large numbers for real vari-
ables. The second theorem uses random L1 -entropy numbers and is proved
through symmetrization followed by a maximal inequality.
Recall Definition 2.1.6 of the bracketing numbers of a class F of func-
tions.

 Let F be a class of measurable functions such that


2.4.1 Theorem.
N[ ] ε, F, L1 (P ) < ∞ for every ε > 0. Then F is Glivenko-Cantelli.

Proof. Fix ε > 0. Choose finitely many ε-brackets [li , ui ] whose union
contains F and such that P (ui − li ) < ε for every i. Then, for every f ∈ F,
there is a bracket such that

(Pn − P )f ≤ (Pn − P )ui + P (ui − f ) ≤ (Pn − P )ui + ε.

Consequently,
sup (Pn − P )f ≤ max(Pn − P )ui + ε.
f ∈F i

The right side converges almost surely to ε by the strong law of large
numbers for real variables. Combination with a similar argument for
inf f ∈F (Pn − P )f yields that lim sup kPn − P k∗F ≤ ε almost surely, for every
ε > 0. Take a sequence εm ↓ 0 to see that the limsup must actually be zero
almost surely.
2.4 Glivenko-Cantelli Theorems 179

2.4.2 Example. The previous proof generalizes a well-known proof of the


Glivenko-Cantelli theorem for the empirical distribution function on the
real line. Indeed, the set of indicator functions of cells (−∞, c] possesses
finite bracketing
 numbers for any underlying
 distribution. Simply use the
brackets 1{(−∞, ti ]}, 1{(−∞, ti+1 )} for a grid of points −∞ = t0 < t1 <
· · · < tm = ∞ with the property P (ti , ti+1 ) < ε for each i. Bracketing
numbers of many other classes of functions are discussed in Chapter 2.7.

Both the statement and the proof of the following theorem are more
complicated than the previous bracketing theorem. However, its sufficiency
condition for the Glivenko-Cantelli property can be checked for many classes
of functions by elegant combinatorial arguments, as discussed in a later
chapter. Furthermore, its random entropy condition is necessary.

2.4.3 Theorem. Let F be a P -measurable class of measurable functions


with envelope F such that P ∗ F < ∞. Let FM be the class of functions
f 1{F ≤ M } when f ranges over F. Then kPn − P k∗F → 0 almost surely
−1

if and only if n log N ε, FM , L1 (Pn ) → 0 in outer probability, for every
ε and M > 0. In that case both convergences are also in outermean. In
particular, the class F is Glivenko-Cantelli if log N ε, F, L1 (Pn ) = o∗P (n)
for every ε > 0.

Proof. The lastassertion of the theorem


 is a consequence of the inequality
N ε, FM , Lr (Q) ≤ N ε, F, Lr (Q) , which is valid for every ε > 0, every
measure Q and every r > 0.
We first show the sufficiency of the entropy condition. By the sym-
metrization Lemma 2.3.1, measurability of the class F, and Fubini’s theo-
rem,
1X n
E∗ kPn − P kF ≤ 2EX Eε εi f (Xi )

n i=1 F
1X n
≤ 2EX Eε εi f (Xi ) + 2P ∗ F {F > M },

n i=1 FM

by the triangle inequality, for every M > 0. For sufficiently large M , the
last term is arbitrarily small. To prove convergence in mean, it suffices to
show that the first term converges to zero for fixed M . Fix X1 , . . . , Xn . If
G is an ε-net in L1 (Pn ) over FM , then
1X n 1X n
(2.4.4) Eε εi f (Xi ) ≤ Eε εi f (Xi ) + ε.

n i=1 FM n i=1 G


The cardinality of G can be chosen equal to N ε, FM , L1 (Pn ) . Bound the
L1 -norm on the right by the Orlicz-norm for ψ2 (x) = exp(x2 ) − 1, and use
180 2 Empirical Processes

the maximal inequality Lemma 2.2.2 to find that the last expression does
not exceed a multiple of
q 1X n

1 + log N ε, FM , L1 (Pn ) sup εi f (Xi ) + ε,

f ∈G n i=1 ψ2 |X

where the Orlicz norms k · kψ2 |X are taken over ε1 , . . . , εn p


with X1 , . . . , Xn
fixed. By Hoeffding’s
p inequality, they can be bounded by 6/n (Pn f 2 )1/2 ,
which is less than 6/n M . Thus the last displayed expression is bounded
by r
6
q
P∗

1 + log N ε, FM , L1 (Pn ) M +ε → ε,
n
if the entropy condition holds, for any ε > 0 and M > 0. It has been
shown that the left side of (2.4.4) converges to zero in probability. Since it
is bounded by M , its expectation with respect to X1 , . . . , Xn converges to
zero by the dominated convergence theorem.
This concludes the proof that kPn − P k∗F → 0 in mean. That it also
converges almost surely follows from the fact that the sequence kPn − P k∗F
is a reverse martingale with respect to a suitable filtration. See Lemma 2.4.6
below.
The proof of the converse result relies on the multiplier inequality
(with Gaussian multipliers) given by Lemma 2.9.1 together with Sudakov’s
inequality. We shall actually show the (seemingly) stronger assertion that
convergence to zero of kPn − P k∗F in probability implies that, for every
ε > 0 and M > 0,
E∗ log N ε, FM , L2 (Pn ) = o(n).

(2.4.5)
The assumed outer integrability of the envelope function F implies (see
Problem 2.3.4) that E maxk≤n F ∗ (Xk )/n → 0. Combination with the
Hoffmann-Jørgensen inequality for moments, Proposition A.1.5, allows to
strengthen the convergence in probability of kPn − P k∗F to convergence in
mean. Next, by the desymmetrization inequality, Lemma 2.3.6,
n
1 ∗ 1
X
E ε (f (X ) − P f ) ≤ E∗ kPn − P kF → 0.

2 i i
n i=1

F
−1 Pn −1
P n
Because E n i=1 εi P f F ≤ E|n i=1 εi | kP kF tends to zero in mean

by the law of large numbers, we also obtain convergence to zero of the left
side without the term PP f . Next by the multiplier inequalities, Lemma 2.9.1,
n
we see that also E∗ n−1 i=1 ξi f (Xi ) F → 0 for i.i.d. standard normal mul-
tipliers ξ1 , . . . , ξn independent from X1 , . . . , Xn . (Apply the inequality for
fixed n0 and Pnlet first n → ∞ and next n0 → ∞.) Given X1 , . . . , Xn the pro-
cess n−1/2 i=1 ξi f (Xi ) is mean-zero Gaussian, with natural pseudo-metric
given by the L2 (Pn )-norm. By Sudakov’s inequality, Proposition A.2.6,
n
1 p 1 X
√ sup ε log N (ε, F, L2 (Pn )) ≤ 3Eξ ξi f (Xi ) .

n ε>0 n i=1 F
2.4 Glivenko-Cantelli Theorems 181

Taking the outer


q expectation across this inequality, we see that the se-
−1/2

quence n log N ε, F, L2 (Pn ) tends to zero in outer expectation,

for every ε > 0. Because N ε, FM , L2 (Pn ) is bounded  above both by
N ε, FM , L∞ (Pn ) ≤ (2M/ε)n and N ε, F, L2 (Pn ) , for every ε and
M > 0, it follows that
r s 
1  log(2M/ε)n log N ε, F, L2 (Pn )
log N ε, FM , L2 (Pn ) ≤ ).
n n n

The first term is bounded for fixed ε, M > 0, and the second was seen to
converge to zero in mean. This concludes the proof of (2.4.5).

If F has a measurable envelope with P F < ∞, then Pn F = O(1)


almost surely and the random entropy condition of the preceding theorem
is implied by
log N ε kF kPn ,1 , F, L1 (Pn ) = o∗P (n).


In Chapter 2.6 it is shown that the entropy in the left side is uni-
formly bounded by a constant for so-called Vapnik-C̆ervonenkis classes F.
Hence any appropriately measurable Vapnik-C̆ervonenkis class is Glivenko-
Cantelli, provided its envelope function is integrable.

2.4.6 Lemma. Let F be a class of measurable functions with envelope


F such that P ∗ F < ∞. Define a filtration by letting Sn be the σ-field
generated by all measurable functions h: X ∞ → R that are permutation-
symmetric in their first n arguments. Then
 
E kPn − P k∗F Sn+1 ≥ kPn+1 − P k∗F ,

a.s.

Furthermore, there exist versions of the measurable cover functions kPn −


P k∗F that are adapted to the filtration. Any such versions form a reverse
submartingale and converge almost surely to a constant.

Proof. Assume without loss of generality that P f = P 0 for every f .


n
The function h: X n → R given by h(x1 , . . . , xn ) = kn−1 i=1 f (xi )kF is
∗ n
permutation-symmetric. Its measurable cover h (for P ) can be chosen
permutation-symmetric as well (Problem 2.4.4). Then h∗ (X1 , . . . , Xn ) is a
version of kPn k∗F and is Sn measurable.
Let Pin be the empirical
Pn+1 measure of X1 , . . . , Xi−1 , Xi+1 , . . . , Xn+1 .
Then Pn+1 f = (n + 1)−1 i=1 Pin f for every f , whence

n+1
1 X i ∗
kPn+1 k∗F ≤ kP k , a.s.
n + 1 i=1 n F
182 2 Empirical Processes

This is true for any version of the measurable covers. Choose the left side
Sn+1 -measurable and take conditional expectations with respect to Sn+1
to arrive at

n+1
1 X  i ∗ 
kPn+1 k∗F ≤ E kPn kF | Sn+1 , a.s..
n + 1 i=1

Each term in the sum on the right side changes on at most a null set if
kPin k∗F is replaced by another version. For h∗ given in the first paragraph of
the proof, choose the version h∗ (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn+1 ) for kPin k∗F .
Since the function h∗ is permutation-symmetric, it follows by symmetry
that the conditional expectations E kPin k∗F | Sn+1 do not depend on i (al-
k∗F | Sn+1 .

most surely). Thus the right side of the display equals E kPn+1 n

While bracketing numbers need not be finite for validity of the


Glivenko-Cantelli property under a given underlying distribution P , this
changes when the property is imposed for every P on the sample space. A
class F of functions that is P -Glivenko-Cantelli for every probability mea-
sure P on (X , A) is called a universal Glivenko-Cantelli class. Under some
regularity conditions a uniformly bounded, universal Glivenko-Cantelli class
has finite bracketing numbers.

2.4.7 Theorem. Suppose that the sample space (X , A) is a Polish space


with its Borel σ-field and that the class F of measurable functions is uni-
formly bounded and contains a countable subset that is dense for the topol-
ogy of pointwise convergence on X . Then the following assertions are equiv-
alent:
(i) F is a universalGlivenko-Cantelli class.
(ii) N[ ] ε, F, L1 (P ) < ∞, for every ε > 0 and probability measure P on
(X , A). 
(iii) N ε, F, L1 (P ) < ∞, for every ε > 0 and probability measure P on
(X , A).
Furthermore, under these conditions kPn − P kF is a measurable random
variable.

Proof. See Van Handel (2013).


2.4 Problems and Complements 183

Problems and Complements


1. (Necessity of integrability of the envelope) Suppose that for a class
F of measurable functions the empirical measure of an i.i.d. sample satisfies

kPn − P kF as∗
→ 0. Then P kf − P kF < ∞. Consequently, if kP kF < ∞, then

P F < ∞ for an envelope function F .
[Hint: Since (1/n)kf (Xn ) − P f kF is bounded above by kPn − P kF + (1 −
1/n) kPn−1 − P kF , one has that P(kf (Xn ) − P Pf k∗F ≥ n, i.o.) = 0. By the
Borel-Cantelli lemma, it follows that the series P(kf (Xn ) − P f k∗F ≥ n)
converges. This series is an upper bound for the given first moment, because
the Xn are identically distributed.]

2. The Lr (Q)-entropy numbers of the class FM = f 1{F ≤ M }: f ∈ F are
smaller than those of F for any probability measure Q and for numbers
M > 0 and r ≥ 1.
3. (Stability of the Glivenko-Cantelli property) If F, F1 , and F2 are
Glivenko-Cantelli classes of functions, then
(i) {a1 f1 + a2 f2 : fi ∈ Fi , |ai | ≤ 1} is Glivenko-Cantelli;
(ii) F1 ∪ F2 is Glivenko-Cantelli;
(iii) the class of all functions that are both the pointwise limit and the L1 (P )-
limit of a sequence in F is Glivenko-Cantelli.
4. Let h: X n → R be permutation-symmetric. Then there exists a version h∗ of
the measurable cover for P n on An which is permutation-symmetric.
[Hint: Take an arbitrary measurable cover h̃ ≥ h, and set h∗ = minσ h̃ ◦ σ.]

5. If the underlying probability space is not the product space (X ∞ , A∞ , P ∞ ),


then kPn − P kF may fail to be a reverse martingale, even if kPn − P kF is
measurable for each n.
[Hint: There exists a (nonmeasurable) set A ⊂ [0, 1] with inner and outer
Lebesgue measures λ∗ (A) = 0 and λ∗ (A) = 1, respectively. Take (Xn , An , Pn )
equal to (A, B ∩ A, λA ) for odd values of n and (Ac , B ∩ Ac , λAc ) for even
values of n, where λA is the trace measure defined in Problem 1.2.16. Define
Xn to be the embedding
Q of Xn into the unit interval, viewed as a map on the
infinite product Xi , and let F be equal to the set of indicators of all finite
subsets of Ac . Then kPn − P kF equals (n − 1)/2n, for odd values of n, and
1/2 for n even.]

6. If the filtration Sn in Lemma 2.4.6 is replaced by the filtration consisting of


the σ-fields generated by the variables Pk f for k ≥ n and f ranging over F,
then kPn − P k∗F may fail to be a reverse submartingale.
[Hint: Let (X , A, P ) be the unit interval in R with Borel sets and Lebesgue
measure and X1 , X2 , . . . the coordinate projections of the infinite product
space, as usual. For the collection F = {1x : x ∈ [1/2, 1]}, the variable kPn −
P kF is measurable with respect to A∞ , but not Ãn -measurable for any n. It
is also not measurable with respect to the P ∞ -completion of Ãn even though
F is image-admissible Suslin.]
184 2 Empirical Processes

7. The σ-field Sn in Lemma 2.4.6 equals the σ-field generated by the variables
Pk f with k ranging over all integers greater than or equal to n and f ranging
over the set of all measurable, integrable f : X → R.
8. The sequence of averages Ȳn of the first n elements of a sequence of i.i.d. ran-
dom variables Y1 , Y2 , . . . with finite first absolute moment is a reverse mar-
tingale with respect to its natural filtration: E(Ȳn | Ȳn+1 , Ȳn+2 , . . .) = Ȳn+1 .
Pn
[Hint: The conditional expectation equals n−1 i=1 E(Yi | Ȳn+1 , Yn+2 , . . .),
and E(Yi | Ȳn+1 ) is constant in 1 ≤ i ≤ n + 1 by symmetry.]

9. (Block-bracketing) Given a class F of functions f : X → R with inte-


grable
Pm envelope, m define F ⊕m to be the set of functions (x1 , . . . , xm ) 7→
i=1
f (xi ) on X . Suppose that for every ε > 0 there exists m such that
N[ ] εm, F ⊕m , L1 (P m ) < ∞. Then F is Glivenko-Cantelli.


[Hint: For every ε > 0 and f , there exists m and a bracket [lm , um ] in
L1 (P m ) such that m−1 P m um − m−1 P m lm < ε and
n n
1 X 1 X
lm (Yi,m ) ≤ Pnm f ≤ um (Yi,m ),
nm nm
i=1 i=1

1 m 1
P lm ≤ P f ≤ P m um ,
m m
where Y1,m , Y2,m , . . . are the blocks [X1 , . . . , Xm ], [Xm+1 , . . . , X2m ], . . . . Con-
clude that for every ε > 0 there exists m such that lim supn→∞ kPnm −P k∗F <
ε. It suffices to “close the gaps.”]

10. Theorem 2.4.3 remains valid if the L1 (Pn )-norm is replaced by the Lp (Pn )-
norm, for any 0 < p < ∞.
s/r
[Hint: For 0 < r < s < ∞ and f, g ∈ FM we have kf − gkPn ,s ≤
s/r−1
(2M ) kf −gkPn ,r ≤ kf −gkPn ,s . Use this to compare the entropy numbers
N ε, FM , Lr (Pn ) .]

11. A P -Glivenko Cantelli class F with kP kF < ∞ is totally bounded in L1 (P ). If


F has an envelope that is contained in Lr (P ), then F is also totally bounded
in Lr (P ), for every r ∈ (1, ∞).
[Hint: By Problem 2.4.1 F possesses an integrable envelope. It is no loss
of generality to assume that this is finite everywhere. The class F − F is
Glivenko-Cantelli. By a result of Dudley (1998) (and also by Theorem 2.10.1,
but this uses the present problem to resolve measurability issues), the classes
(F − F)+ and (F − F)− are Glivenko-Cantelli, whence the class |F − F|
of absolute differences is Glivenko-Cantelli. Thus there exists a sequence of
finitely discrete probability measures P n such that L n : = sup f,g∈F
(Pn −

P )|f − g| → 0. For given ε > 0 there exists n0 such that Ln0 < ε. Because
restricted to the support of Pn0 the functions f ∈ F are equivalent to a cube
in Rn0 , there exists a finite ε-net f1 , . . . , fN over F relative to the L1 (Pn0 )-
norm. This is a 2ε-net over F relative to L1 (P )-norm.]
2.5
Donsker Theorems

In this chapter we present the two main empirical central limit theorems.
The first is based on uniform entropy, and its proof relies on symmetrization.
The second is based on bracketing entropy.

2.5.1 Uniform Entropy


In this section weak convergence of the empirical process will be established
under the condition that the envelope function F be square integrable,
combined with the uniform entropy bound
Z ∞ q 
(2.5.1) sup log N εkF kQ,2 , F, L2 (Q) dε < ∞.
0 Q

Here the supremum is takenR over all finitely discrete probability measures
Q on (X , A) with kF k2Q,2 = F 2 dQ > 0. These conditions are by no means
necessary, but they suffice for many examples. Finiteness of the previous
integral will be referred to as the uniform entropy condition.

2.5.2 Theorem. Let F be a class of measurable functions that satisfies


the uniform entropy bound (2.5.1). Let the classes Fδ = {f − g: f, g ∈
2
F, kf − gkP,2 < δ} and F∞ be P -measurable for every δ > 0. If P ∗ F 2 < ∞,
then F is P -Donsker.

Proof. Let δn ↓ 0 be arbitrary. By Markov’s inequality and the symmetriza-


tion Lemma 2.3.1,
n

 2 ∗ 1 X

P kGn kFδn > x ≤ E √ εi f (Xi ) .

x n i=1 Fδn
186 2 Empirical Processes

Since the supremum in the right-hand side is measurable by assumption,


Fubini’s theorem applies and the outer expectation can be calculated as
EX Eε. Fix X
P1n, . . . , Xn . By
Hoeffding’s inequality, the stochastic process
f 7→ n−1/2 i=1 εi f (Xi ) is sub-Gaussian for the L2 (Pn )-seminorm
v
u n
u1X
kf kn = t f 2 (Xi ).
n i=1

Use the second part of the maximal inequality Corollary 2.2.8 to find that
1 X n Z ∞q

(2.5.3) Eε √ εi f (Xi ) . log N ε, Fδn , L2 (Pn ) dε.

n i=1 Fδn 0

For large values of ε the set Fδn fits in a single ball of radius ε around the
origin, in which case the integrand is zero. This is certainly the case for
values of ε larger than θn , where
1X n
θn2 = sup kf k2n = f 2 (Xi ) .

f ∈Fδn n i=1 Fδn

Furthermore, covering numbers of the class Fδ are bounded by covering 


numbers of F∞ ={f − g: f, g ∈ F}. The latter satisfy N ε, F∞ , L2 (Q) ≤
N 2 ε/2, F, L2 (Q) for every measure Q.
Limit the integral in (2.5.3) to the interval (0, θn ), make a change of
variables, and bound the integrand to obtain the bound
Z θn /kF kn q 
sup log N εkF kQ,2 , F, L2 (Q) dε kF kn .
0 Q

Here the supremum is taken over all discrete probability measures. The
integrand is integrable by assumption. Furthermore, kF kn is bounded be-
low by kF∗ kn , which converges almost surely to its expectation, which may
be assumed positive. Use the Cauchy-Schwarz inequality and the domi-
nated convergence theorem to see that the expectation (with respect to
P∗
X1 , . . . , Xn ) of this integral converges to zero provided θn → 0. This would
conclude the proof of asymptotic equicontinuity.
Since sup{P f 2 : f ∈ Fδn } → 0 and Fδn ⊂ F∞ , it is certainly enough to
prove that
Pn f 2 − P f 2 P∗

F∞ →
0.
2
This is a uniform law of large numbers for the class F∞ . This class has
2
integrable envelope (2F ) and is measurable by assumption. For any pair
f, g of functions in F∞ ,
Pn |f 2 − g 2 | ≤ Pn |f − g|4F ≤ kf − gkn k4F kn .

It follows that the covering number N εk2F k2n , F∞2
, L1 (Pn ) is bounded
by the covering number N εkF kn , F∞ , L2 (Pn ) . By assumption, the latter
2.5 Donsker Theorems 187

number is bounded by a fixed number, so its logarithm is certainly o∗P (n),


as required for the uniform law of large numbers, Theorem 2.4.3. This
concludes the proof of asymptotic equicontinuity.
Finally, we show that F is totally bounded in L2 (P ). By the result of
the
last paragraph,
there exists a sequence of discrete measures Pn with
(Pn − P )f 2 converging to zero. Take n sufficiently large so that the
F∞ 
2
supremum is bounded by ε √. By assumption, N ε, F, L2 (Pn ) is finite. Any
ε-net for F in L2 (Pn ) is a 2ε-net in L2 (P ).

2.5.4 Example (Cells in Rk ). The set F of all indicator functions


1{(−∞, t]} of cells in R satisfies
2
N ε, F, L2 (Q) ≤ N[ ] ε2 , F, L1 (Q) ≤ 2 ,
 
ε
R1
for any probability measure Q and ε ≤ 1. Since 0 log(1/ε) dε < ∞, the
class of cells in R is Donsker. The covering numbers of the class of cells
(−∞, t] in higher dimensions satisfy a similar bound, but with a higher
power of (1/ε). Thus the class of all cells (−∞, t] in Rk is Donsker for any
dimension.
In the next chapter these classes of sets are shown to be examples of
the large collection of VC-classes, which are all Donsker provided they are
suitably measurable.

2.5.2 Bracketing
The second main empirical central limit theorem uses bracketing entropy
rather than uniform entropy. The simplest version of this theorem uses
L2 (P )-brackets, and requires finiteness of the corresponding bracketing in-
tegral
Z ∞q

(2.5.5) log N[ ] ε, F, L2 (P ) dε < ∞.
0

2.5.6 Theorem. Any class F of measurable functions satisfying (2.5.5) is


P -Donsker.

Unlike the uniform entropy condition, the bracketing integral involves


only the true underlying measure P . However, part of this gain is offset
by the fact that bracketing numbers can be larger than covering numbers.
As a result, the two sufficient conditions for a class to be Donsker are not
comparable. Examples of classes of functions that satisfy the bracketing
condition are given in Chapter 2.7.
We shall prove a slightly stronger version of the theorem. From a given
finite set of L2 (P )-brackets it is not difficult to construct an envelope func-
tion F that is square-integrable, and hence the existence of such an envelope
188 2 Empirical Processes

is implicit in the assumption of the preceding theorem. It is known that the


latter is not necessary for a class F to be Donsker, though the envelope
must possess a weak second moment (meaning that P ∗ (F > x) = o(x−2 )
as x → ∞). Similarly, the L2 (P )-norm used to measure the size of the
brackets can be replaced by a weaker norm, which makes the bracketing
numbers smaller and the convergence of the integral easier. The result below
measures the size of the brackets by their L2,∞ -norm given by
 1/2
kf kP,2,∞ = sup x2 P (|f | > x) .
x>0

(Actually this is not really a norm, because it does not satisfy the triangle
inequality. It can be shown that there exists a norm that is equivalent to this
“norm” up to a constant 2, but this is irrelevant for the present purposes.)
Note that kf kP,2,∞ ≤ kf kP,2 , so that the bracketing numbers relative to
L2,∞ (P ) are smaller.
The proofs of the preceding theorem and its strengthening below are
based on a chaining argument. Unlike in the proof of the uniform entropy
theorem, this will be applied to the original summands, without an initial
symmetrization. Because the summands are not appropriately bounded,
Hoeffding’s inequality and the sub-Gaussian maximal inequality do not
apply. Instead, Bernstein’s inequality is used in the form
2
x √
 −1
P |Gn f | > x ≤ 2 e 2 P f 2 +2/3kf k∞ x/ n .

This is valid for every square integrable, uniformly bounded function f .


Lemma 2.2.10 implies that for a finite set F of cardinality |F| ≥ 2,

kf k∞ p
(2.5.7) EkGn kF . max √ log |F| + max kf kP,2 log |F|.
f n f

The chaining argument is set up in such a way that the two terms
on
√ the right p are of comparable magnitude. This is the case if kf k∞ ∼
nkf kP,2 / log |F|; the inequality is applied after truncating the functions
f at this level.

2.5.8 Theorem. Let F be a class of measurable functions such that


Z ∞ q 
Z ∞ q 
log N[ ] ε, F, L2,∞ (P ) dε + log N ε, F, L2 (P ) dε < ∞.
0 0

Moreover, assume that the envelope function F of F possesses a weak sec-


ond moment. Then F is P -Donsker.

Proofs. It suffices to prove the second theorem. For each natural number
Nq
q, there exists a partition F = ∪i=1 Fqi of F into Nq disjoint subsets such
2.5 Donsker Theorems 189

that X
2−q
p
log Nq < ∞,
∗
< 2−q ,

sup |f − g|
P,2,∞
f,g∈Fqi

sup f − g P,2 < 2−q .



f,g∈Fqi

To see this, first cover F separately with minimal numbers of L2 (P )-balls


and L2,∞ (P )-brackets of size 2−q , disjointify, and take the intersection of
the two partitions. The total number of sets will be Nq = Nq1 Nq2 if Nqi are
the number of sets in the two separate partitions. The logarithm turns the
product into a sum, and the first condition is satisfied if it is satisfied for
both Nqi .
The sequence of partitions can, without loss of generality, be chosen
as successive refinements. Indeed, first construct a sequence of partitions
N̄q
F = ∪i=1 F̄qi possibly without this property. Next take the partition at
stage q to consist of all intersections of the form ∩qp=1 F̄p,ip . This gives
into Nq = N̄1 · · · N̄q sets. Using the inequality (log N̄p )1/2 ≤
Q
partitions
(log N̄p )1/2 and rearranging sums, it is seen that the first of the three
P
displayed conditions is still satisfied.
Choose for each q a fixed element fqi from each partitioning set Fqi ,
and set
πq f = fqi if f ∈ Fqi ,
∆q f = sup |f − g|∗ , if f ∈ Fqi .
f,g∈Fqi

Note that πq f and ∆q f run through a set of Nq functions if f runs through


F. In view of Theorem 1.5.6, it suffices to show that the sequence Gn (f −
πq0 f ) F converges in probability to zero as n → ∞ followed by q0 → ∞.
Define for each fixed n and (large) q ≥ q0 numbers and indicator
functions;
aq = 2−q
p
log Nq+1 ,
 √ √
Aq−1 f = 1 ∆q0 f ≤ naq0 , . . . , ∆q−1 f ≤ naq−1 ,
 √ √ √
Bq f = 1 ∆q0 f ≤ naq0 , . . . , ∆q−1 f ≤ naq−1 , ∆q f > naq ,
 √
Bq0 f = 1 ∆q0 f > naq0 .
Note that Aq f and Bq f are constant in f on each of the partitioning sets
Fqi at level q, because the partitions are nested. Now decompose, pointwise
in x (which is suppressed in the notation),

X ∞
X
f − πq0 f = (f − πq0 f )Bq0 f + (f − πq f )Bq f + (πq f − πq−1 f )Aq−1 f.
q0 +1 q0 +1
Pq
The idea here is to write the left side as the sum of f −πq1 f and q10 +1 (πq f −
πq−1 f ) for the largest q1 = q1 (f, x) such that each of the “links” πq f −
190 2 Empirical Processes


πq−1 f in the “chain” is bounded in absolute value by naq (note that
|πq f − πq−1 f | ≤ ∆q−1 f ). For a rigorous derivation, note that either all Bq f
are zero or there is a unique q1 with Bq1 f = 1. In the first case, the first
two terms in the decomposition are zero and the third term is an infinite
series (all Aq f equal 1) whose qth partial sum telescopes out to πq f − πq0 f
and converges to f − πq0 f by the definition of the Aq f . In the second case,
Aq−1 f = 1 if and only if q ≤ q1 and the decomposition is as mentioned,
apart from the separate treatment of the case that q1 = q0 , when already
the first link fails the test. √
Next apply the empirical process Gn = n(Pn −P ) to each of the three
terms separately, and take suprema over f ∈ F. It will be shown that the
resulting three variables converge to zero in probability as n → ∞ followed
by q0 → ∞. √
First, since |f − πq0 f |Bq0 f ≤ 2F 1{2F > naq0 }, one has
√ √
E∗ Gn (f − πq0 f )Bq0 f F ≤ 4 nP ∗ F {2F > naq0 }.

The right side converges to zero as n → ∞, for each fixed q0 , by the as-
sumption that F has a weak second moment (Problem 2.5.6).
Second, since the partitions are nested, ∆q f Bq f ≤ ∆q−1 f Bq f , whence
by the inequality of Problem 2.5.5,

naq P ∆q f Bq f ≤ 2k∆q f k2P,2,∞ ≤ 2 · 2−2q .

Since ∆q−1 f Bq f is bounded by naq−1 for q > q0 , we obtain
√ √ aq−1 −2q
P (∆q f Bq f )2 ≤ naq−1 P ∆q f {∆q f > naq } ≤ 2 2 .
aq

Apply the triangle inequality and inequality (2.5.7) to find


X∞
E∗ Gn (f − πq f )Bq f

F
q0 +1
∞ ∞
X X √
E∗ Gn ∆q f Bq f F +

≤ 2 n P ∆q f Bq f F
q0 +1 q0 +1
∞ h
aq−1 −q p 4 −2q i
X r
. aq−1 log Nq + 2 log Nq + 2 .
q0 +1
aq aq

Since aq is decreasing, the quotient aq−1 /aq can be replaced by its square.
Then in view of the
P∞definition
p of aq , the series on the right can be bounded
by a multiple of q0 +1 2−q log Nq . This upper bound is independent of n
and converges to zero as q0 → ∞.
Third, there are at most Nq functions πq f − πq−1 f and at most
Nq−1 functions Aq−1 f . Since the partitions are nested, the function |πq f −
2.5 Problems and Complements 191

πq−1 f |Aq−1 f is bounded by ∆q−1 f Aq−1 f ≤ n aq−1 . The L2 (P )-norm of
|πq f − πq−1 f | is bounded by 2−q+1 . Apply inequality (2.5.7) to find
X∞ ∞ h
X i
E∗ aq−1 log Nq + 2−q log Nq .
p
Gn (πq f − πq−1 f )Aq−1 f .

F
q0 +1 q0 +1

Again this upper bound is independent of n and converges to zero as q0 →


∞.

2.5.9 Example (Cells in R). The classical empirical central limit the-
orem was shown to be a special case of the uniform entropy central
limit theorem in the preceding subsection. This classical theorem is a
special case
√ of the bracketing
 central limit theorem
 as well. This follows
since N[ ] 2ε, F, L2 (P ) ≤ N[ ] ε2 , F, L1 (P ) for every class of functions
f : X → [0, 1]. The L1 -bracketing numbers were bounded above by a power
of 1/ε in the preceding subsection.

Problems and Complements


1. (Taking the supremum over all Q) The supremum in (2.5.1) can
be replaced by the supremum over all probability measures Q, such that
0 < QF 2 < ∞, without changing the condition. In fact, for any probability
measure P and r > 0 such that 0 < P F r < ∞,
 
D 2εkF kP,r , F, Lr (P ) ≤ sup D εkF kQ,r , F, Lr (Q)
Q

if the supremum on the right is taken over all discrete probability measures
Q.
[Hint: If the left side is equal to m, then there exist functions f1 , . . . , fm
such that P |fi − fj |r > 2r εr P F r for i 6= j. By the strong law of large
numbers, Pn |fi − fj |r → P |fi − fj |r almost surely. Also Pn F r → P F r almost
surely. Thus there exist n and ω such that Pn (ω)|fi − fj |r > 2r εr P F r and
Pn (ω)F r < 2r P F r .]

2. Can the constant 2 in the preceding problem be replaced by 1? Is a similar


inequality valid for covering numbers, and what is the best constant?
3. The presence of the factor kF kQ,2 in the uniform entropy condition (2.5.1) is
helpful if it is larger than 1 but is detrimental otherwise. However, given any
class F with a square integrable envelope, there is always a square integrable
envelope F with kF kQ,2 ≥ 1 for all Q.

4. If F is compact, then the function ε 7→ N ε, F, L2 (P ) is left continuous.
[Hint: The function f 7→ inf i kf − fi k attains its maximum for every finite
set f1 , . . . , fp .]
192 2 Empirical Processes

5. For each positive random variable X, one has the inequalities kXk22,∞ ≤
supt>0 tEX{X > t} ≤ 2kXk22,∞ .
[Hint: The first inequality follows from Markov’s inequality.
R ∞ For the second,
write the expectation in the expression in the middle as 0 P (X1{X > t} >
x) dx. Next, split the integral in the part from zero to t and its complement.
On the first part, the integrand is constant; on the second, it can be bounded
by x−2 times the L2,∞ -norm.]

6. Each random variable with P(|X| > t) = o(t−2 ) as t → ∞ (a “weak second


moment”) satisfies E|X|{|X| > t} = o(t−1 ) as t → ∞.
7. For any random variable X, one has kXk1 ≤ 3kXk2,∞ .
2.6
Uniform Entropy Numbers

In Section 2.5.1 the empirical process was shown to converge weakly for
indexing sets F satisfying a uniform entropy condition. In particular, if
 2−δ
 1
sup log N εkF kQ,2 , F, L2 (Q) ≤ K ,
Q ε

for some δ > 0, then the entropy integral (2.5.1) converges and F is a
Donsker class for any probability measure P such that P ∗ F 2 < ∞, provided
measurability conditions are met. Many classes of functions satisfy this
condition and often even the much stronger condition
 V
 1
sup N εkF kQ,2 , F, L2 (Q) ≤ K , 0 < ε < 1,
Q ε

for some number V . In this chapter this is shown for classes satisfying
certain combinatorial conditions. For classes of sets, these were first studied
by Vapnik and C̆ervonenkis, whence the name VC-classes. In the second
part of this chapter, VC-classes of functions are defined in terms of VC-
classes of sets. The remainder of this chapter considers operations on classes
that preserve entropy properties, such as taking convex hulls.

2.6.1 VC-Classes of Sets


Let C be a collection of subsets of a set X . An arbitrary set of n points
{x1 , . . . , xn } possesses 2n subsets. Say that C picks out a certain subset
from {x1 , . . . , xn } if this can be formed as a set of the form C ∩{x1 , . . . , xn }
194 2 Empirical Processes

for a C in C. The collection C is said to shatter {x1 , . . . , xn } if each of its


2n subsets can be picked out in this manner. The VC-dimension V (C) of
the class C is the largest n such that some set of size n is shattered by C.[
Clearly, the more refined C is, the larger is its index. The index is more
formally defined through
n o
∆n (C, x1 , . . . , xn ) = # C ∩ {x1 , . . . , xn }: C ∈ C ,
n o
V (C) = sup n: max ∆n (C, x1 , . . . , xn ) = 2n .
x1 ,...,xn

Thus the dimension is infinite if C shatters sets of arbitrarily large size.


A collection of measurable sets C is called a VC-class if its dimension
is finite. The main result of this section is the remarkable fact that the
covering numbers of any VC-class grow polynomially in 1/ε as ε → 0, of
order dependent on the dimension of the class. Since the envelope of a class
of sets is certainly square integrable, it follows that a VC-class of sets is
Donsker for any underlying probability measure, provided measurability
conditions are met.

2.6.1 Example (Cells in Rd ). The collection of all cells of the form


(−∞, c] in R shatters no two-point set {x1 , x2 }, because it fails to pick
out the largest of the two points. Hence its VC-dimension equals 1. The
collection of all cells (a, b] in R shatters every two-point set but cannot
pick out the subset consisting of the smallest and largest points of any
set of three points. Thus its VC-dimension equals 2. With more effort, it
can be seen that the indices of the same type of sets in Rd are d and 2d,
respectively.
Techniques to show that many other collections of sets are VC are
discussed in Section 2.6.5.

The following combinatorial result is needed: the number of subsets


shattered by a class C is at least the number of subsets picked out by C. In
particular, if the class C shatters no set
Pof V + 1 points, then the number
V
of subsets picked out by C is at most j=0 nj , the number of subsets of
size ≤ V .

2.6.2 Lemma. Let {x1 , . . . , xn } be arbitrary points. Then the total num-
ber of subsets ∆n (C, x1 , . . . , xn ) picked out by C is bounded above by the
number of subsets of {x1 , . . . , xn } shattered by C.

Proof. Assume without loss of generality that every C is a subset of the


given set of points, so that each C picks out itself and ∆n (C, x1 , . . . , xn ) is
the cardinality of C.
[
In the first edition of this book the notation V (C) was used for the VC-index, which is
defined as the smallest n such that no set of n points is shattered. This index is the present
dimension plus 1.
2.6 Uniform Entropy Numbers 195

Call C hereditary if it has the property that B ∈ C whenever B ⊂ C, for


a set C ∈ C. Each of the sets in a hereditary collection of sets is shattered,
whence a hereditary collection shatters at least |C| sets and the assertion
of the lemma is certainly true for hereditary collections. It will be shown
that a general C can be transformed into a hereditary collection, without
changing its cardinality and without increasing the number of its shattered
subsets.
Given 1 ≤ i ≤ n and C ∈ C, define the set Ti (C) to be C − {xi } if
C − {xi } is not contained in C and to be C otherwise. Thus xi is deleted
from C if this creates a “new” set; otherwise it is retained. If xi ∈ / C, then
C is left unchanged.
Since the map Ti is one-to-one, the collections C and Ti (C) have the
same cardinality. Furthermore, every subset A ⊂ {x1 , . . . , xn } that is shat-
tered by Ti (C) is shattered by C. If xi ∈/ A, this is clear from the fact that
the collection of sets C ∩ A does not change if each C is replaced by its
transformation Ti (C). Second, if xi ∈ A and A is shattered by Ti (C), then
for every B ⊂ A, there is C ∈ C with B ∪ {xi } = Ti (C) ∩ A. This implies
xi ∈ Ti (C), whence Ti (C) = C, whence C − {xi } ∈ C. Thus both B ∪ {xi }
and B − {xi } = (C − {xi }) ∩ A are picked out by C. One of them equals B.
It has been shown that the assertion of the lemma is true for C if
it is true for Ti (C). The same is true for the operator T1 ◦ T2 ◦ · · · ◦ Tn
playing the role of Ti . Apply this operator repeatedly, until thePcollection of
sets doesP not change any P more. This happens after at most C |C| steps,
because C |Ti (C)| < C |C| whenever the collections Ti (C) and C are
different. The collection D obtained in this manner has the property that
D − {xi } is contained in D for every D ∈ D and every xi . Consequently, D
is hereditary.

2.6.3 Corollary. For a VC-class of sets of dimension V (C), one has


V (C)  
X n
max ∆n (C, x1 , . . . , xn ) ≤ .
x1 ,...,xn
j=0
j

Consequently, the numbers on the left side grow polynomially of order at


most O(nV (C) ) as n → ∞.

Proof. All sets shattered by a VC-class are among the sets of size at most
V (C). The number of shattered sets gives an upper bound on ∆n by the
preceding lemma.

2.6.4 Theorem. There exists a universal constant K such that for any
VC-class C of sets, any probability measure Q, any r ≥ 1, and 0 < ε < 1,
 rV (C)
1
N ε, C, Lr (Q) ≤ KV (C)(4e)V (C)

.
ε
196 2 Empirical Processes

Proof. In view of the equality k1C − 1D kQ,r = Q1/r (C 4 D) for any pair
of sets C and D, the upper bound for a general r is an easy consequence
of the bound for r = 1. The proof for r = 1 is long. Problem 2.6.4 gives a
short proof of a slightly weaker result.
By Problem 2.6.3, it suffices to consider the case that Q is an empir-
ical type measure: a measure on a finite set of points y1 , . . . , yk such that
Q{yi } = li /n for integers l1 , . . . , lk that add up to n. Then it is no loss of
generality to assume that each set in the collection C is a subset of these
points.
Let x1 , . . . , xn be the set of points y1 , . . . , yk with each yi occurring li
times. For each subset C of y1 , . . . , yk , form a subset C̃ of x1 , . . . , xn by tak-
ing all xi that are copies of some yi in C. More precisely, let φ: {1, . . . , n} →
{y1 , . . . , yk } be an arbitrary,
fixed map such that #(j: φ(j) = yi ) = li for
every i. Let C̃ = j: φ(j) ∈ C . Now identify x1 , . . . , xn with 1, . . . , n.
If a set {xj : j ∈ J} is shattered by the collection C˜ of sets C̃, then every
xj in this set must correspond to a different  yi (i.e., the map φ must be
one-to-one on {xj : j ∈ J}), and the subset φ(xj ): j ∈ J of {y1 , . . . , yk }
must be shattered by C. This shows that the collection C˜ is VC of the same
dimension as C. By construction, Q(C 4 D) = Q̃(C̃ 4 D̃) for Q̃ equal to
the discrete uniform measure on x1 , . . . , xn . Thus the L1 (Q)-distance on  C
corresponds to the L1 (Q̃)-distance on C. ˜ Conclude that N ε, C, ˜ L1 (Q̃) =
N ε, C, L1 (Q) , and it suffices to prove the upper bound for C˜ and Q̃. For
simplicity of notation, assume that Q is the discrete uniform measure on a
set of points x1 , . . . , xn .
Each set C can be represented by an n-vector of ones and zeros indicat-
ing whether or not the points xi are contained in the set. Thus the collection
C is identified with a subset Z of the vertices of the n-dimensional hyper-
cube [0, 1]n . Alternatively, C can be identified with an (n × #C)-matrix of
zeros and ones, the columns representing the individual sets C. For a subset
I of rows, let ZI be the matrix obtained by first deleting the other rows and
next dropping duplicate columns. Alternatively, ZI is the projection of the
point set Z onto [0, 1]I . In these terms ZI corresponds to the collection of
subsets C ∩ {xi : i ∈ I} of the set of points {xi : i ∈ I}, and this set is shat-
tered if the matrix ZI consists of all 2I possible columns or if, as a subset
of the I-dimensional hypercube, ZI contains all vertices. By assumption,
this is possible only for I ≤ V (C). Abbreviate V = V (C).
The normalized Hamming metric on the point set Z is defined by

n
1X
d(w, z) = |wj − zj |, z, w ∈ Z.
n j=1

This corresponds exactly to the L1 (Q)-metric on the collection C: if the sets


C and D correspond to the vertices w and z, then Q(C 4 D) = d(w, z) for
the discrete uniform measure Q.
2.6 Uniform Entropy Numbers 197

Fix a maximal ε-separated collection of sets C ∈ C. For simplicity of


notation assume that C (or equivalently, Z) itself is ε-separated. We shall
bound its cardinality,
 which constitutes a bound on the packing number
D ε, C, L1 (Q) .
Let Z be a random variable with the discrete uniform distribution
on the point set Z in [0, 1]n . The coordinates of Z = (Z1 , . . . , Zn ) are
(correlated, non-i.i.d.) Bernoulli variables taking values in {0, 1}. Fix an
integer V ≤ m < n. Given a subset I of {1, 2, . . . , n} of size m + 1, apply
Lemma 2.6.6 (below) to the projection ZI of Z onto ZI to conclude that
X
E var(Zi | ZI−{i} ) ≤ V.
i∈I

n

Take the sum over all m+1 of such subsets I on both left and right.
The double sum on the left can be rearranged. Instead of leaving out one
element at a time from every possible subset of size m + 1, we may add
missing elements to every possible set of size m. Conclude that
 
X X n
(2.6.5) E var(Zi | ZJ ) ≤ V,
m+1
J i∈J
/

where the first sum is over all subsets J of size m.


Conditionally on ZJ = s, the vector Z is uniformly distributed over
the set of columns z in Z for which zJ = s. Suppose there are Ns of
such columns. Let W and W̃ be independent random vectors defined on a
common probability space distributed uniformly over these columns. By the
ε-separation of Z, the vectors W and W̃ are at Hamming distance d(W, W̃ )
at least ε whenever they are unequal. The latter happens with probability
1 − 1/Ns . Since var Wi = E(Wi − W̃i )2 /2,
X X  1 
var(Zi | ZJ = s) = 1
2 E(Wi − W̃i )2 = 21 End(W, W̃ ) ≥ 12 nε 1 − .
Ns
i∈J
/

If we integrate the left side with respect to s for the distribution of ZJ


and next sum over J, then we obtain the left side of (2.6.5). Since P(ZJ =
s) = Ns /#Z, the resulting expression is bounded below by, with #ZJ an
average over J,
X X Ns  
1
 1  n 1  #ZJ 
εn 1 − = εn 1 − .
#Z 2 Ns m 2 #Z
J s∈ZJ

This is bounded above by the right side of (2.6.5). Rearrange and simplify
the resulting inequality to obtain

#ZJ nε(m + 1) #ZJ εm


#Z ≤ . .
nεm + n − 2nV + 2mV εm − 2V
198 2 Empirical Processes

The number of points in ZJ is equal to the number of subsets picked out


PVC from
by
m
 the points {xi : i ∈ J}. By Sauer’s
V
lemma, this is bounded by
j=0 j , which is smaller than (em/V ) for m ≥ V . Thus we obtain
 e V mV +1 ε
#Z ≤ ,
V mε − 2V
for every integer V ≤ m < n. The optimal unrestricted choice of m ∈ (0, ∞)
is m = 2(V + 1)/ε, for which the upper bound takes the form
 e V  2(V + 1) V +1  2e V
1
2 ≤ (V + 1)e .
V ε ε
Since the upper bound evaluated at m = 2(V + 1)/ε + 1 differs from this
only up to a universally bounded constant, the discretization of m causes no
problem. Furthermore, for the optimal choice of m, the inequality m > V
is implied by ε < 2, while for m = 2(V + 1)/ε ≥ n, we can use the trivial
bound #Z ≤ n ≤ m, which is certainly bounded by the right side of the
preceding display for V ≥ 1. For V = 0, the collection C consists of at most
one set, and the theorem is trivial.

2.6.6 Lemma. Let Z be an arbitrary random vector taking values in a set


Z ⊂ {0, 1}n that corresponds to a VC-class C of subsets of a set of points
{x1 , . . . , xn }. Then
n
X
E var(Zi | Zj , j 6= i) ≤ V (C).
i=1

Proof. If the values of all coordinates, except the ith coordinate, of Z are
given, then the vector Z can have at most two values. If there is only one
possible value, then the conditional variance of Zi is zero. If there are two
possible values, then we can call these v and w, where vi = 0 and wi = 1
and the other coordinates of v and w agree. Write p(z) for P(Z = z). Then
Zi is 1 or 0 with conditional probabilities p: = p(w)/ p(v)+p(w) and 1−p,
respectively, and the conditional variance of Zi is p(1 − p). The marginal
probability that (Zj : j 6= i) is equal to (vj : j 6= i) is p(v) + p(w).
Form a graph by taking the points of Z as the nodes and connecting
any two nodes that are at Hamming distance 1/n. This is a subgraph of the
set of edges of the n-dimensional unit cube. Let Ei and E be the set of all
edges that cross the ith dimension and all edges in the graph, respectively.
If Ei is empty, then all variances in the ith term of the lemma are zero.
In the other case, the edges {v, w} in E concern nodes v and w as in the
preceding paragraph. Then, with an empty sum understood as zero,
n n
X X X  p(v) p(w)
E var(Zi | Zj , j 6= i) = p(v) + p(w)
i=1 i=1 {v,w}∈Ei
p(v) + p(w) p(v) + p(w)
X
≤ p(v) ∧ p(w).
{v,w}∈E
2.6 Uniform Entropy Numbers 199

Suppose that the edges E can be directed in such a way that at each node
(point in Z) the number of arrows that are directed away is at most V (C).
Then the last sum can be rearranged as a double sum, the outer sum carried
out over the nodes, and the inner sum over the arrows P directed away from
the particular node. Thus the sum is bounded by z∈Z p(z)#arrows(z),
which is bounded by V (C).
It may be shown that the edges can always be directed in the given
manner (Problem 2.6.5), but it is more direct to proceed in a different
manner. Recall from the proof of Lemma 2.6.2 that a class C is hereditary
if it contains every subset of every set contained in C. Since each set in such
a class C is shattered, each set in a hereditary VC-class C has at most V (C)
points. For such a class the edges of the graph can be directed as desired
by decreasing number of elements (or downward in the cube): direct the
edge between v and w in the direction w → v if the (one) coordinate of w
by which w differs from v equals 1. For each given w this creates at most
V (C) arrows as w has at most so many coordinates 1. This concludes the
proof in the case that C is hereditary.
It was shown in the proof of Lemma 2.6.2 that an arbitrary collection
C can be transformed into a hereditary collection by repeated application
of certain operators Ti : C → C, which are one-to-one and do not increase
the VC-dimension. The identification of sets in C with points in Z allows to
view the operators also as maps Ti : Z → Z, and they can be described as
pushing a point z ∈ Z that is at height 1 in the ith coordinate (i.e. zi = 1)
down to height 0 (i.e. changing zi to 0) if the resulting point was not already
in Z, and leaving all other points invariant. We can form a new graph by
taking the set of nodes equal to Ti (Z) and connecting any two nodes at
Hamming distance 1/n. The operator Ti induces a one-to-one map from
the set of vertices E = E(Z) in the original graph into (but not necessarily
onto) the set of vertices E(Ti (Z)) of the new graph as follows. The nodes
v and w of an edge {v, w} in E that crosses the ith dimension (i.e. v and
w differ in their ith coordinates and no other ones) are left invariant by Ti ;
we map such an edge onto itself. We do the same for an edge at height 0
for the ith coordinate. For an edge {v, w} ∈ E at height 1 (i.e. vi = wi = 1)
there are three possibilities. If Ti pushes both v and w down, say to v 0 and
w0 , then the latter two nodes were not in Z and hence nor was {v 0 , w0 };
we map {v, w} to the latter edge. If Ti pushes exactly one of v and w down,
say v to v 0 , then v 0 was not in Z, but w0 was in Z, whence {v 0 , w0 } was
not in E but is in E(Ti (Z)); we map {v, w} to {v 0 , w0 }. Third if Ti pushes
none of v and w down, then we map {v, w} to itself. Thus we have defined a
one-to-one map from E into E(Ti (Z)). We also define a probability measure
pi on Ti (Z) by shifting mass downward in the following manner: first we
give every z ∈ Ti (Z) the mass of its pre-image in Z; next for every edge
{z 0 , z 1 } in E(Ti (Z)) that crosses the ith dimension we place the biggest of
the two masses p(z 0 ) and p(z 1 ) at height 0 (i.e. at z 0 ) and the smallest at
200 2 Empirical Processes

height 1 (i.e. at z 1 ). We claim that


X X
p(v) ∧ p(w) ≤ pi (v) ∧ pi (w).
{v,w}∈E(Z) {v,w}∈E(Ti (Z))

Indeed, any edge {v, w} ∈ E(Z) in the sum on the left that crosses the ith
dimension reappears in the sum on the right, with the same value p(v) ∧
p(w), which is invariant under redistributing the masses between the planes.
Any such edge in E(Z) at height 1 that is pushed down by Ti can be matched
to its image in the sum on the right, which has no smaller value attached
as the bigger masses are redistributed to the bottom plane. Any such edge
at height 1 that is not pushed down forms a pair with an edge in E(Z)
at height 0; they reappear as a pair in the sum on the right and together
contribute no smaller value in view of the inequality (a0 ∧ b0 ) + (a1 ∧ b1 ) ≤
(a0 ∨ a1 ) ∧ (b0 ∨ b1 ) + (a0 ∧ a1 ) ∧ (b0 ∧ b1 ) for any numbers ai , bi . Finally
the remaining edges in E(Z) are at height 0 and do not have a paired edge
above them; they reappear in the sum on the right with possibly a bigger
value.
Thus by repeated application of the downward shift operation, the
original graph and probability measure are changed into a hereditary graph
and another probability measure, meanwhile increasing the target function.
The lemma follows.

2.6.2 VC-Classes of Functions


The subgraph of a function f : X → R is the subset of X × R given by]

(x, t): t < f (x) .

A collection F of measurable functions on a sample space is called a VC-


subgraph class, or just a VC-class, if the collection of all subgraphs of the
functions in F forms a VC-class of sets (in X × R). Just as for sets, the
covering numbers of VC-classes of functions grow at a polynomial rate.
Let V (F) be the VC-dimension of the set of subgraphs of functions in
F.

2.6.7 Theorem. For a VC-class of functions with measurable envelope


function F and r ≥ 1, one has for any probability measure Q with kF kQ,r >
0,
 1 rV (F )
N εkF kQ,r , F, Lr (Q) ≤ KV (F)(16e)V (F )

,
ε
for a universal constant K and 0 < ε < 1.

]
The form of this definition is different from the definitions given by other authors.
However, it can be shown to lead to the same concept. See Problems 2.6.10 and 2.6.11.
2.6 Uniform Entropy Numbers 201

Proof. Let C be the set of all subgraphs Cf of functions f in F. By Fubini’s


theorem, Q|f −g| = Q×λ(Cf 4Cg ) if λ is Lebesgue measure on the real line.
Renormalize Q × λ to a probability measure on the set {(x, t): |t| ≤ F (x)}
by defining P = (Q × λ)/(2QF ). Then, by the result for sets in the previous
section,
   4e V (F )
N ε 2QF, F, L1 (Q) = N ε, C, L1 (P ) ≤ KV (F) ,
ε
for a universal constant K.
This concludes the proof for r = 1. For r > 1, note that

Q|f − g|r ≤ Q|f − g|(2F )r−1 = 2r−1 R|f − g| QF r−1 ,

for the probability measure R with density F r−1 /QF r−1 with respect to
Q. Thus the Lr (Q)-distance is bounded by the distance 2 (QF r−1 )1/r kf −
1/r
gkR,1 . Elementary manipulations yield

N ε 2kF kQ,r , F, Lr (Q) ≤ N εr RF, F, L1 (R) .


 

This can be bounded by a constant times 1/ε raised to the power rV (F)
by the result of the previous paragraph.

The preceding theorem shows that a VC-class satisfies the uniform en-
tropy condition (2.5.1), with much to spare. Thus Theorem 2.5.2 shows that
a suitably measurable VC-class is P -Donsker for any underlying measure
P for which the envelope is square integrable. The latter condition on the
envelope can be relaxed a little to the minimal condition that the envelope
has a weak second moment.

2.6.8 Theorem. Let F be a pointwise separable, P -pre-Gaussian VC-class


of functions with envelope function F such that P ∗ F > x = o(x−2 ) as


x → ∞. Then F is P -Donsker.

Proof. Under the stronger condition that the envelope has a finite second
moment, this follows from Theorem 2.5.2. (Then the assumption that F is
pre-Gaussian is automatically satisfied.) For the refinement, see Alexander
(1987c).

2.6.3 Convex Hulls and VC-Hull Classes


The symmetric convex
Pm hull sconvP F of a class of functions is defined as the
m
set of functions i=1 αi fi , with i=1 |αi | ≤ 1 and each fi contained in F.
A set of measurable functions is called a VC-hull class if it is in the
pointwise sequential closure of the symmetric convex hull of a VC-class of
functions. More formally, a collection of functions F is VC-hull if there
exists a VC-class G of functions such that every f ∈ F is the pointwise
202 2 Empirical Processes

limit of a sequence of functions fm contained in sconv G. If the class G can


be taken equal to a class of indicator functions, then the class F is called a
VC-hull class for sets.
In Chapter 2.10 it is shown that the sequentially closed symmetric con-
vex hull of a Donsker class is Donsker. Since a suitably measurable VC-class
of functions is Donsker, provided its envelope has a weak second moment,
many VC-hull classes can be seen to be Donsker by this general result. In
most cases the same conclusion can also be obtained from the stronger re-
sult that any VC-hull class satisfies the uniform entropy condition. This is
shown in the following, which gives an upper bound on the entropy.
Even though VC-hull classes are small enough to have a finite uni-
form entropy integral, they can be considerably larger than VC-classes.
Their entropy numbers (logarithms of the covering numbers), rather than
their covering numbers, behave polynomially in 1/ε. More precisely, the
Lr -entropy numbers of the convex hull of any polynomial class are of lower
order than (1/ε)W , where W < 2 if r ≥ 2 (and also for smaller r depending
on the polynomial order). The bound W < 2 is just enough to ensure that
the class satisfies the uniform entropy condition (2.5.1).

2.6.9 Theorem. Let Q be a probability measure on (X , A), and let F be


a class of measurable functions with measurable square integrable envelope
F such that QF r < ∞ for some 1 < r < ∞ and
  1 V
N εkF kQ,r , F, Lr (Q) ≤ C , 0 < ε < 1.
ε
Then there exists a constant K that depends on r, C and V only such that
( rV /((r−1)V +r)
 K 1ε if 1 < r ≤ 2,
log N εkF kQ,r , conv F, Lr (Q) ≤
1 2V /(V +2)

K ε if 2 ≤ r < ∞.

Proof. Every point in the convex hull of F is within distance ε of the


convex hull of an ε-net over F. Therefore, to prove the assertion of the
theorem for a fixed ε, it is no loss of generality to assume that F is finite.
Set r0−1 = min{1 − 1/r, 1/2}, W = r0−1 + 1/V , and L = C 1/V kF kQ,r .
Then by assumption F can be covered by n balls of radius at most Ln−1/V
for every natural number n. (Note that the assumption is trivially true for
1 ≤ ε ≤ C 1/V .) Form sets F1 ⊂ F2 ⊂ · · · ⊂ F such that for each n the
set Fn is a maximal, Ln−1/V -separated net over F. Thus Fn contains at
most n points. It will be shown by induction that there exist constants Ck
and Dk depending only on C and V such that supk Ck ∨ Dk < ∞ and, for
q ≥ 2V (1 + 1/r0 ),
log N Ck Ln−W , conv Fnkq , Lr (Q) ≤ Dk n,

(2.6.10) n, k ≥ 1.
This would imply the theorem. The proof of (2.6.10) consists of a nested
induction argument. The outer layer is induction on k. The case k = 1 is
proved for each n by induction on n.
2.6 Uniform Entropy Numbers 203

Assume k = 1. For n ≤ n0 and fixed n0 , the statement is trivially true


for sufficiently large C1 . It suffices to choose C1 Ln−W
0 ≥ kF kQ,r , in which
case the left side of (2.6.10) vanishes for every n ≤ n0 . For general n, fix
m = n/d for large enough d to be chosen. Each f ∈ Fn − Fm is within
distance at most Lm−1/V of some element Πm f of Fm . Thus each element
of conv Fn can be written as
X X X
λf f = µf f + λf (f − Πm f ),
f ∈Fn f ∈Fm f ∈Fn −Fm
P P
where µf ≥ 0 and µf = λf = 1. If Gn is the set of functions 0 and f −
Πm f with f ranging over Fn −Fm , then it follows that conv Fn ⊂ conv Fm +
conv Gn for a set Gn containing at most n elements, each of norm smaller
than Lm−1/V , whence diam Gn ≤ 2Lm−1/V . Apply Lemma 2.6.11 (below)
to Gn with ε defined by m−1/V ε = (1/4)C1 n−W to find a (1/2)C1 Ln−W -net
over conv Gn consisting of at most
nbr 4r0 C1−r0 dr0 /V
e C1r0

r0
br /εr0
e + enε /br ≤ e+
br 4r0 dr0 /V
elements. Apply the induction hypothesis to Fm to find a C1 Lm−W -net
over conv Fm consisting of at most em elements. This defines a partition of
conv Fm into m-dimensional sets of diameter at most 2C1 Lm−W . Such a
set can be isometrically identified with a subset of a ball of radius C1 Lm−W
in Rm . Thus each of these sets can be partitioned in
m
3C1 Lm−W

n/d
= 6dW
(1/2)C1 Ln−W
sets of diameter (1/2)C1 Ln−W (Problem 2.1.7). Take a function from each
of these sets, and form all sums f + g of a function f attached to conv Fm
and a function g attached to conv Gn by the preceding procedure. These
form a C1 Ln−W -net over conv Fn of cardinality bounded by
nbr 4r0 C1−r0 dr0 /V
e C1r0

en/d (6dW )n/d e +
br 4r0 dr0 /V
This is bounded by en for suitable choices of C1 and d depending on V and
r only. This concludes the proof of (2.6.10) for k = 1 and every n.
The argument continues by induction on k. By a similar construction
as before, conv Fnkq ⊂ conv Fn(k−1)q + conv Gn,k for a set Gn,k containing
at most nk q elements, each of norm smaller than L(n(k − 1)q )−1/V , so that
diam Gn,k ≤ 2Ln−1/V k −q/V 2q/V . Apply Lemma 2.6.11 to Gn,k with ε =
2−1 k q/V −2 2−q/V n−1/r0 to find an Lk −2 n−W -net over conv Gn,k consisting
of at most
qr /V +r0 2r0 −r0 q/V
 br /εr0  e k q+r0 q/V −2r0 nbr 2 0 k
e + enk q εr0 /br = e + qr /V +r
2 0 0b
r
204 2 Empirical Processes

elements. Apply the induction hypothesis to obtain a Ck−1 Ln−W -net over
conv Fn(k−1)q with respect to Lr (Q) consisting of at most eDk−1 n elements.
Combine the nets as before to obtain a Ck Ln−W -net over conv Fnkq con-
sisting of at most eDk n elements, for
1
Ck = Ck−1 + ,
k2
1 + log 1 + k q+r0 q/V −2r0 /(2qr0 /V +r0 br )

Dk = Dk−1 + .
k r0 q/V −2r0 /(2qr0 /V +r0 br )
For (q/V − 2)r0 ≥ 2, or equivalently q ≥ 2V (1 + 1/r0 ), the resulting se-
quences Ck and Dk are bounded.

Since the symmetric convex hull sconv F of a class F is contained in


the convex hull of the class F ∪ −F ∪ {0}, the bound of the preceding
theorem is valid for sconv F as well. It suffices to note that the covering
numbers of the class F ∪ −F ∪ {0} are at most twice the covering numbers
of F plus 1.
An important ingredient in the proof of the preceding theorem is the
following lemma, which gives a useful bound on the covering numbers of
the convex hull of an arbitrary finite set of small diameter.

2.6.11 Lemma. Let F be an arbitrary set of n measurable functions


f : X → R of finite Lr (Q)-diameter diam F. Then for every r > 1 there
exists a constant br such that for r0−1 = min(1 − 1/r, 1/2) and for every
ε > 0, br /εr0
N ε diam F, conv F, Lr (Q) ≤ e + e n εr0 /br

.

Proof. Write F = {f1 , . . . , fn }. For given λ in the n-dimensional unit


simplex and natural number k, let Y1 , . . . , Yk be i.i.d. random
P elements
with P(Yi = fj ) = λj for j = 1, . . . , n. Then EYi = λj fj and by
Marcinkiewicz’ inequality, Proposition A.1.9,
Xk r k
X r/2
E (Yi − EYi ) ≤ Br E (Yi − EYi )2 .

i=1 i=1
Pk
For r ≤ 2 we bound the right side further by Br E i=1 |Yi − EYi |r using
subadditivity of the map y 7→ y r/2 , while for r ≥ 2 we use Hölder’s inequal-
ity to bound it by this same expression times k r/2−1 . Next we integrate
over Q, divide by k r , and use Fubini’s theorem to see that
k Z
r Br (1 ∨ k r/2−1 ) X Br r
EkȲk −EY1 kQ,r ≤ E |Yi −EY1 |r dQ ≤ r/r (diam F ,
kr i=1
k 0

since r/r0 = r − 1 if r ≤ 2 and r/r0 = r/2 if r ≥ 2. The inequality


shows that at least one realization of Ȳk must have Lr (Q)-distance at most
2.6 Uniform Entropy Numbers 205

1/r
Br k −1/r0 diam F to the convex combination
P
λj fj . Every realization is
−1
Pk
an average of the form k  i=1 fik (allowing for multiple use of an fi ∈ F).
There are at most n+k−1k of such averages. Conclude that
 
n+k−1  n k
N Br1/r k −1/r0 diam F, conv F, Lr (Q) ≤ ≤ ek 1 +

.
k k

The last inequality can be proved by using Stirling’s inequality with bound.
For ε ≥ 1, the assertion of the lemma is trivial. For ε < 1, conclude the
1/r
proof by choosing the smallest k such that Br k −1/r0 ≤ ε.

2.6.12 Corollary. For any VC-hull class F of measurable functions and


probability measure Q,

  1 2Vm (F )/(2Vm (F )+1)


log N ε kF kQ,2 , F, L2 (Q) ≤ K ,
ε

for a constant K that depends only the VC-dimension Vm (F) of the VC-
subgraph class connected with F and for F an envelope function of this
VC-subgraph class.

The entropy of the convex hull of a much bigger class of polynomial


entropy (1/ε)V can be similarly bounded. Remarkably, taking the convex
hull does not increase the order of the entropy in the case of a very big
class with V > 2, while the entropy increases by a logarithmic factor only if
V = 2. For V < 2 the entropy increases more substantially, but the entropy
integral of the convex hull still converges if V < 2/3. See Carl et al. (1999)
for a proof of the following theorem for V 6= 2 and Gao (2001) for the case
V = 2.

2.6.13 Theorem. Let Q be a probability measure on (X , A), and let F be


a class of measurable functions with measurable envelope function F such
that QF 2 < ∞ and, for some positive constants C, V ,

  1 V
log N εkF kQ,2 , F, L2 (Q) ≤ C , 0 < ε < 1.
ε

Then there exists a constant K that depends on C and V only such that

1 2 1 1−2/V
  
K ε log ε , if 0 < V < 2,
 
log N εkF kQ,2 , sconv(F), L2 (Q) ≤ K 1 2 1 2
ε log ε , if V = 2,
1 V

 
K ε , if V > 2.
206 2 Empirical Processes

2.6.4 VC-Major Classes


A class of measurable functions is called a VC-major class if the sets
{x: f (x) > t} with f ranging over F and t over R form a VC-class of
sets.
Pfm: X → [0, 1] is the uniform limit of the sequence
Because any function
of discretizations fm = i=1 m−1 1{f > i/m}, bounded VC-major classes
are scalar multiples of VC-hull classes. Hence Corollary 2.6.12 allows to
obtain a bound on their entropy, from which it follows that suitably mea-
surable bounded VC-hull classes are universally Donsker.
Both results can be improved considerably by direct arguments.

2.6.14 Lemma (Bounded VC-major classes). If F is a VC-major class


of functions with envelope F ≤ 1, then for every r ≥ 1 there is a constant
A > 0 such that, for every probability measure Q, and every ε ∈ (0, 1/2),
 A 1  1 
log N εkF kQ,r , F, Lr (Q) ≤ log log .
ε ε εkF kQ,r

Proof. It is no loss of generality to assume that the functions f are non-


negative. For fixed ε ∈ (0, 1/2) define a grid of points t0 = 1 > t1 > t2 · · · >
tm > 0 by tj = (1 + ε)−j , where we choose m the smallest integer such that
tm ≤ εkF kQ,r . Furthermore, for f ∈ F set
m
X
fε = tj 1{tj < f ≤ tj−1 }.
j=1

Then 0 ≤ fε ≤ f , and for a given x, either f (x) ∈ (tj , tj−1 ] for some
1 ≤ j ≤ m or f (x) ≤ tm . In the first case 0 ≤ f (x) − fε (x) ≤ tj−1 − tj =
εtj ≤ εf (x) ≤ εF (x), while in the second case 0 ≤ f (x) − fε (x) ≤ f (x) ≤
εkF kQ,r . This implies that Q|f − fε |r ≤ 2εr QF r . We conclude that the
entropy of the class F at 3εkF kQ,r is bounded by the entropy of the class
of functions Fε = {fε : f ∈ F} at ε.
In fact, the class Fε is a VC-subgraph class. The subgraphs can be
decomposed as

{(x, t): fε (x) ≥ t} = ∪m


j=1 {x: f (x) > tj−1 } × (tj , tj−1 ].

Since F is a VC-major class, the collection of sets {x: f (x) > tj−1 }, when
f ranges over F, is a VC-class of subsets of X , which (trivially) implies that
for every j the class Cj of sets {x: f (x) > tj−1 } × (tj , tj−1 ], when f ranges
over F, is a VC-class of subsets of the set X × (tj , tj−1 ]. Because the latter
sets are disjoint, the collection tj Cj is a VC-class of subsets of X × (0, 1],
of VC dimension the sum of the dimensions of the classes Cj , i.e. mV for V
bounded by the dimension of the VC-class attached to the VC-major class
F (cf. Problem 2.6.12).
2.6 Uniform Entropy Numbers 207

Thus the class Fε is VC-subgraph of dimension bounded by mV . Since


F is also an envelope function for Fε , it follows from Theorem 2.6.7 that
  A rV m
N εkF kQ,r , Fε , Lr (Q) ≤ .
ε
m−1
Here by its definition m = m(ε) satisfies (1 +
 ε) ≥ εkF kQ,r , from which
it follows that m ≤ 1 + 2 log 1/(εkF kQ,r ) /ε. The proof is concluded by
inserting this bound.

2.6.15 Theorem. Let F be a pointwise separable,


R p VC-major class of mea-
surable functions with envelope F such that P ∗ (F > x) dx < ∞. Then
F is P -Donsker.

Proof. See Dudley and Koltchinskii (1994).

2.6.5 Examples and Permanence Properties


The first two lemmas of this section give basic methods for generating VC-
graph classes. This is followed by a discussion of methods that allow one to
build up new classes related to the VC-property from more basic classes.

2.6.16 Lemma. Any finite-dimensional vector space F of measurable func-


tions f : X → R is VC-subgraph of dimension smaller than or equal to
dim(F) + 1.

Proof. Take any collection of n = dim(F) + 2 points (x1 , t1 ), . . . , (xn , tn )


in X × R. By assumption, the vectors
0
f (x1 ) − t1 , . . . , f (xn ) − tn , f ∈ F,

are contained in a dim(F) + 1 = (n − 1) -dimensional subspace of Rn . Any


vector a 6= 0 that is orthogonal to this subspace satisfies
X  X 
ai f (xi ) − ti = (−ai ) f (xi ) − ti , for every f ∈ F.
ai >0 ai <0

Define the sum over an empty set as zero. There exists such a vector
a
 with at least one strictly positive coordinate.
 For this vector,
the set
(xi , ti ): ai > 0 cannot be of the form (xi , ti ): ti < f (xi ) for some f ,
because then the left side of the equation would be strictly positive and the
right side nonpositive
 . Conclude that the subgraphs of F do not
for this f
pick out the set (xi , ti ): ai > 0 . Hence the subgraphs shatter no set of n
points.
208 2 Empirical Processes

2.6.17 Lemma. The set of all translates ψ(x − h): h ∈ R of a fixed
monotone function ψ: R → R is VC of dimension 1.

Proof. By the monotonicity, the subgraphs are linearly ordered by inclu-


sion: if ψ is nondecreasing, then the subgraph of x 7→ ψ(x − h1 ) is contained
in the subgraph of x 7→ ψ(x − h2 ) if h1 ≥ h2 . Any collection of sets with
this property has VC-dimension 1.

2.6.18 Lemma. Let C and D be VC-classes of sets in a set X and φ: X → Y


and ψ: Z → X fixed functions. Then
(i) C c = {C c : C ∈ C} is VC;
(ii) C u D = {C ∩ D: C ∈ C, D ∈ D} is VC;
(iii) C t D = {C ∪ D: C ∈ C, D ∈ D} is VC;
(iv) φ(C) is VC if φ is one-to-one;
(v) ψ −1 (C) is VC;
(vi) the sequential closure of C for pointwise convergence of indicator func-
tions is VC.
For VC-classes C and D in sets X and Y,
(vii) C × D is VC in X × Y.

Proof. The set C c picks out the points of a given set x1 , . . . , xn that C does
not pick out. Thus if C shatters a given set of points, so does C c . This proves
(i) (and shows that the dimensions of C and C c are equal). From n points C
can pick out O(nV (C) ) subsets. From each of these subsets, D can pick out
at most O(nV (D) ) further subsets. Thus C ∩ D can pick out O(nV (C)+V (D) )
subsets. For large n, this is certainly smaller than 2n . This proves (ii). Next
(iii) follows from a combination of (i) and (ii), since C ∪ D = (C c ∩ Dc )c .
For (iv) note first that if φ(C) shatters {y1 , . . . , yn }, then each yi must
be in the range of φ and there exist x1 , . . . , xn such that φ is a bijection
between x1 , . . . , xn and y1 , . . . , yn . Thus C must shatter x1 , . . . , xn . For (v)
the argument is analogous: if ψ −1 (C) shatters z1 , . . . , zn , then all ψ(zi ) must
be different and the restriction of ψ to z1 , . . . , zn is a bijection on its range.
To prove (vi) take any set of points x1 , . . . , xn and any set C̄ from
the sequential closure. If C̄ is the pointwise limit of a net Cα , then for
sufficiently large α the equality 1{C̄}(xi ) = 1{Cα }(xi ) is valid for each i.
For such α the set Cα picks out the same subset as C̄.
For (vii) note first that C × Y and X × D are VC-classes. Then by (ii)
so is their intersection C × D.

2.6.19 Lemma. Let F and G be VC-subgraph classes of functions on a set


X and g: X → R, φ: R → R, and ψ: Z → X fixed functions. Then
(i) F ∧ G = {f ∧ g: f ∈ F, g ∈ G} is VC-subgraph;
(ii) F ∨ G is VC-subgraph;

(iii) {F > 0} = {f > 0}: f ∈ F is VC;
(iv) −F is VC;
2.6 Uniform Entropy Numbers 209

(v) F + g = {f + g: f ∈ F} is VC-subgraph;
(vi) F · g = {f g: f ∈ F} is VC-subgraph;
(vii) F ◦ ψ = {f (ψ): f ∈ F} is VC-subgraph;
(viii) φ ◦ F is VC-subgraph for monotone φ.

Proof. The subgraphs of f ∧ g and f ∨ g are the intersection and union of


the subgraphs of f and g, respectively. Hence (i) and (ii) are consequences
of the preceding lemma. For (iii) note that the sets {f > 0} are one-to-one
images of the intersections of the (open) subgraphs with the set X × {0}.
Thus the class {F > 0} is VC by (ii) and (iv) of the preceding lemma.
The subgraphs of the class −F are the images of the open supergraphs
of F under the map (x, t) 7→ (x, −t). The open supergraphs are the comple-
ments of the closed subgraphs, which are VC by Problem 2.6.10. Now (iv)
follows from the previous lemma. For (v) it suffices to note that the sub-
graphs of the class F + g shatter a given set of points (x1 , t1), . . . , (xn , tn ) if
and only if the subgraphs of F shatter the set xi , ti −g(xi ) . The subgraph
of the function f g is the union of the sets
C + = (x, t): t < f (x)g(x), g(x) > 0 ,


C − = (x, t): t < f (x)g(x), g(x) < 0 ,




C 0 = (x, t): t < 0, g(x) = 0 .





It suffices to show that these sets are VC in X ∩ {g > 0} × R, X ∩ {g <
0} × R, and X ∩ {g = 0} × R, respectively (Problem 2.6.12). Now, for
instance, i: (xi , ti ) ∈ C − is the set of indices of the points xi , ti /g(xi )


picked out by the open supergraphs of F. These are the complements of


the closed subgraphs and hence form a VC class.
The subgraphs of the class F ◦ψ are the inverse  images of the subgraphs
of functions in F under the map (z, t) 7→ ψ(z), t . Thus (v) of the previous
lemma implies (vii).
For (viii) suppose the subgraphs of φ ◦ F shatter the set of points
(x1 , t1 ), . . . , (xn , tn ). Choose f1 , . . . , fm from F such that the subgraphs of
the functions  φ ◦ fj pick out  all m = 2n subsets. For each fixed i, define
si = max fj (xi ): φ fj (xi ) ≤ ti . Then si < fj (xi ) if and only if ti <
φ ◦ fj (xi ), for every i and j, and the subgraphs of f1 , . . . , fm shatter the
points (xi , si ).

2.6.20 Lemma. If F is VC-major, then the class of functions h ◦ f , with h


ranging over the monotone functions h: R → R and f over F, is VC-major.

2.6.21 Lemma. If F and G are VC-hull classes for sets (in particular,
uniformly bounded VC-major classes), then FG is a VC-hull class for sets.

Proofs. The set {h◦f > t} can be rewritten as f ·> h−1 (t) for a suitable


inverse h and · > meaning either > or ≥ depending on t. Now the sets
210 2 Empirical Processes

{f ≥ t} with f ranging over F and t over R form a VC-class, since they


are in the sequential pointwise closure of the same sets defined with strict
inequality.
If F and G are VC-hull classes for sets, then any function f g can be ap-
proximated by a product of two symmetric convex combinations of indicator
functions of sets from VC-classes. Since the set of pairwise intersections of
sets from two VC-classes is VC, the products of the approximations are
symmetric convex combinations of indicators of sets from a VC-class.

2.6.22 Example. The set of all monotone functions f : R → [0, 1] is


Donsker for every probability measure.

2.6.23 Lemma. Let F : X → R be a fixed, nonnegative function. Then


the class of functions x 7→ t1{F (x) ≥ t} with t ranging over R is VC of
dimension V (F) ≤ 2.

Proof. Consider three arbitrary points (x1 , t1 ), (x2 , t2 ), and (x3 , t3 ) in X ×


[0, ∞) such that F (x1 ) ≤ F (x2 ) ≤ F (x3 ). Suppose the three-point set is
shattered by F. Since F selects the single point (x1 , t1 ), there exists s1 with
  
t1 < s1 ≤ F (x1 ) ∧ s1 ≤ t2 or s1 > F (x2 ) ∧ s1 ≤ t3 or s1 > F (x3 ) .
Because F (xi ) is increasing in i, this can be reduced to
  
t1 < s1 ≤ F (x1 ) ∧ s1 ≤ t2 ∧ s1 ≤ t3 .
Since F selects the single point (x2 , t2 ), there exists s2 with
  
s2 ≤ t1 or s2 > F (x1 ) ∧ t2 < s2 ≤ F (x2 ) ∧ s2 ≤ t3 or s2 > F (x3 ) .
In the first part of this logical statement, s2 ≤ t1 is impossible because
t1 < s1 ≤ t2 < s2 in view of the first two parts of the preceding statement
and the middle part of the last statement. Thus the last statement can be
reduced to
  
s2 > F (x1 ) ∧ t2 < s2 ≤ F (x2 ) ∧ s2 ≤ t3 .
This
 implies that t3 > F (x1 ). Then F cannot select the two-point set
(x1 , t1 ), (x3 , t3 ) .

2.6.24 Example. For every strictly increasing function φ and arbitrary


functions F ≥ 0 and G, the set of functions x 7→ φ(t)G(x) 1{F (x) ≥ t}
(with t ranging over R) is VC of dimension V (F) ≤ 3.
Indeed, for G ≡ 1 this class is contained in the class of functions
x 7→ s1{φ ◦ F (x) ≥ s}. The general case follows from Lemma 2.6.19.
Consequently, the class of functions x 7→R tk−1 F (x)1{F (x) ≥ t} is
Donsker for every measurable function F with F 2k dP < ∞ by the uni-
form entropy central limit theorem.
2.6 Uniform Entropy Numbers 211

2.6.6 Fat Shattering


The VC-dimension V (F) of a class F of functions f : X → R is the largest
number n for which there exist points x1 , . . . , xn in X and t1 , . . . , tn in R
such that the set of points (x1 , t1 ), . . . , (xn , tn ) in X × R is shattered by the
subgraphs of F, i.e. for every subset I ⊂ {1, 2, . . . , n} there exists f ∈ F
with 
f (xi ) > ti , if i ∈ I,
f (xi ) ≤ ti , if i ∈ / I.
This notion is purely combinatorial, without quantification of the position
of the points (xi , ti ), in contrast to the following, which is a “dimension at
ε”.
The fat-shattering dimension V (ε, F) of a class of functions F at ε > 0
is defined as the largest n such that there exist x1 , . . . , xn and t1 , . . . , tn
such that for every subset I ⊂ {1, 2, . . . , n} there exists f ∈ F with

f (xi ) ≥ ti + ε, if i ∈ I,
f (xi ) ≤ ti , if i ∈
/ I.
The two cases of inequalities arise if the point (xi , ti ) is “ε below” or “above”
the graph of f , respectively. We could describe this by saying that the
subgraph of f picks out the points {(xi , ti ): i ∈ I} with margin ε. Clearly if
this is true for some ε > 0, then the subgraph also picks out these points in
the ordinary sense, whence it follows that V (ε, F) ≤ V (F), for every ε > 0.
The following theorem shows that these smaller numbers can never-
theless be used to bound entropy numbers.

2.6.25 Theorem. There exist universal constants D and c such that, for
any class F of functions f : X → [0, 1], any probability measure Q and
0 < ε < 1,
  2 DV (cε,F )
N ε, F, L2 (Q) ≤ .
ε
For a proof see Mendelson and Vershynin (2003).
A comparison of the bound with Theorem 2.6.7, which gives a bound of
the order (1/ε)2V (F ) on the covering number, is impossible without further
knowledge of the constant D. However, if the fat-shattering dimension is
considerably smaller than the VC-dimension, then this may compensate
even a large constant D, and the bound of the preceding theorem may be
preferable. The following example illustrates that this may be the case. In
fact, the example shows that the VC-dimension can be infinite while the
fat-shattering dimension is finite for every ε > 0.

2.6.26 Example. The class F of monotone functions f : R → [0, 1] is not


VC. For instance, its subgraphs can pick out every subset from a set of
points (x1 , t1 ), . . . , (xn , tn ) with x1 < x2 < · · · < xn and 0 < t1 < t2 <
· · · < tn < 1. Here choosing the xi to be increasing is no loss of generality,
212 2 Empirical Processes

but choosing the points ti also increasing is crucial. Because strict increase,
without quantification, suffices, the points easily fit within the range [0, 1]
of the functions f . The latter changes if we require fat-shattering. If the
points must be picked out with margin ε, then the sequence t1 < · · · < tn
must increase by steps bigger than ε and only of the order 1/ε of them fit
the interval [0, 1]. This explains that
1
V (ε, F) ≤ 1 + .
ε
The preceding theorem therefore gives the bound of the order (1/ε) log(1/ε)
on the entropy of F. This bound is not sharp (see Theorem 2.7.7 and the
discussion below), but reasonable, whereas Theorem 2.6.7 is not applicable.
For a precise proof of the bound on the fat-shattering numbers, take
any sequence x1 < · · · < xn and assume that the subgraphs of F shatter
the points (xi , ti ) at margin ε, for some t1 , . . . , tn . In particular, there exist
two functions fo and fe that pick out the odd and even numbered points
at margin ε:
 
fo (xi ) ≥ ti + ε, i odd, fe (xi ) ≥ ti + ε, i even,
fo (xi ) ≤ ti , i even, fe (xi ) ≤ ti , i odd.
Then fo (xi ) − fe (xi ) ≥ ε for odd i and fo (xi ) − fe (xi ) ≤ −ε for even i, and
hence the variation of fo − fe over x1 , . . . , xn is at least (n − 1)2ε. Since fo
and fe are monotone with range [0, 1], this variation is at most 2, whence
n − 1 ≤ 1/ε.

The preceding example illustrates the potential advantage of fat-


shattering, but also that Theorem 2.6.25 is not sharp in general. An im-
proved bound is available if the fat-shattering dimension satisfies an addi-
tional condition.

2.6.27 Theorem. For any class F of functions f : X → [0, 1] with fat-


shattering dimension such that V (dε, F) ≤ V (ε, F)/2 for some d > 1 and
every ε > 0, there exists a constant D that depends on d only and a universal
constant c > 0 such that, for any probability measure Q and 0 < ε < 1,

log N ε, F, L2 (Q) ≤ DV (cε, F).

This theorem removes the factor log(1/ε) from the bound on the en-
tropy implied by Theorem 2.6.25. The additional condition V (dε, F) ≤
V (ε, F)/2 that makes this possible is less innocuous than it may seem at
first. For instance, it cannot
 be satisfied by a nontrivial VC-class, as then
a bound log N ε, F, L2 (Q) ≤ KV (F), free of ε, would result.
The following theorem shows that in the uniform entropy integral the
fat-shattering dimension can replace the uniform entropy. In particular,
the fat-shattering dimension characterizes the uniform Donsker property of
uniformly bounded classes of functions.
2.6 Problems and Complements 213

2.6.28 Theorem. There exist positive constants c < C such that, for any
class F of functions f : X → [0, 1],
Z 1p Z 1 q Z 1 p

c V (ε, F) dε ≤ sup log N ε, F, L2 (Q) dε ≤ C V (ε, F) dε.
0 0 Q 0

For proofs of the preceding theorems, see Rudelson and Vershynin


(2006).

Problems and Complements


1. Every collection of sets that is pre-Gaussian for every underlying measure is
VC.
[Hint: Dudley (1984), 11.4.1.]

2. Unlike the case of sets, the VC-subgraph property does not characterize uni-
versal Donsker classes of functions. There exist bounded universal Donsker
classes that do not even satisfy the uniform entropy condition.
[Hint: Dudley (1987).]

3. Let F be a class of measurable functions such that D ε, F, Lr (Q) ≤ g(ε) for
every empirical type probability measure Q and a fixed function g(ε). Then
the bound is valid for every probability measure.

[Hint: If D ε, F, Lr (P ) = m, then there are functions f1 , . . . , fm such that
r r
P |fi − fj | > ε for every i 6= j. By the strong law of large numbers, Pn |fi −
fj |r → P |fi − fj |r almost surely, for every (i, j). Thus there exists ω and n
such that Pn (ω)|fi − fj |r > εr , for every i 6= j.]

4. There exists a short proof of the fact that for any VC-class of sets C and
δ > 0,
 r(V (C)+δ)
 1
N ε, C, Lr (Q) ≤ K ,
ε
for any probability measure Q, r ≥ 1, 0 < ε < 1, and a constant K depending
on V (C) and δ only.
[Hint: Take any subcollection of sets C1 , . . . , Cm from C such that Q(Ci 4
Cj ) > ε for every pair i 6= j. Generate a sample X1 , . . . , Xn from Q. Two sets
Ci and Cj pick out the same subset from a realization of the sample if and
only if no Xk falls in the symmetric difference Ci 4 Cj . If every symmetric
difference contains a point of the sample, then all Ci pick out a different
subset from the sample. In that case C picks out at least m subsets from
X1 , . . . , Xn . The probability that this event does not occur is bounded by
 
X m n
Q(Xk ∈
/ Ci 4 Cj for every k) ≤ 1 − Q(Ci 4 Cj )
2
i<j
 
m
≤ (1 − ε)n .
2
214 2 Empirical Processes

For sufficiently large n, the last expression is strictly less than 1. For such n
there exists a set of n points from which C picks out at least m subsets. In
other words, for n > − log m 2
/ log(1 − ε), one has the first inequality in

m ≤ max ∆n (C, x1 , . . . , xn ) ≤ K nV (C) .


x1 ,...,xn

The last inequality follows from Corollary 2.6.3 and the constant K depends
only on V (C). Since − log(1 − ε) > ε, we can take n = 3(log m)/ε and obtain
V (C)
3 log m

m≤K .
ε
Since log m is bounded by a constant times mδ , it follows that m1−δ is
bounded by a constant times (3/ε)V (C) .]

5. The edges of the graph with nodes Z ⊂ {0, 1}n representing a VC-class C of
subsets of a set of points {x1 , . . . , xn } can be directed in such a manner that
at most V (C) arrows are positively incident with each node.
[Hint: The number of edges of a hereditary collection C can be enumerated
by listing for each node v the edges pointing inward: edges {v, w} such that w
has a zero in the (one) position in which it differs from v. Since every subset
of a hereditary class has at most V (C) points, it follows that #E/#Z ≤ V (C).
By repeated application of the operators Ti , an arbitrary collection of sets
can be transformed into a hereditary class. The operations do not decrease
the quotient #E/#Z: the number of edges may increase, while the number
of nodes remains the same. Conclude that for any VC-collection of sets,
#E/#Z ≤ V (C). Since a subcollection of a given VC-class is VC of no greater
dimension, it follows that #E 0 ≤ #Z 0 V (C) for every subgraph (Z 0 , E 0 ) as well.
Now apply Hall’s marriage lemma [e.g. Dudley (1989), Theorem 11.6.1,
page 318], representing directing an edge {v, w} as marrying the edge to one
of the nodes v and w. In fact, let the eligible partners for {v, w} be the
collection of V (C) copies of v and V (C) copies of w. Then the total number
of partners for a given set of edges is at least V (C) times the number of edges.
By the marriage lemma, successful marriage is possible.]

6. For every pair of integers S, r ≥ 1, and n = Sr, there exists a subset Z ⊂


{0, 1}n such that, for all 1 ≤ k ≤ n and S = V (Z),
S
n

D(k/n, Z, d) ≥ .
2e(k + S)

[Hint: Start with the subset W = (0, 0, . . . , 0), (1, 0, . . . , 0), (1, 1, . . . , 0), . . . ,
(1, 1, 1, . . . , 1) of {0, 1}r , and let Z = W S be the set of all vectors in {0, 1}n
obtained by concatenating S vectors from W.]
V (C)
7. Every VC-class C of sets satisfies ∆n (C, x1 , . . . , xn ) ≤ ne/V (C) for
n ≥ V (C).
[Hint: Consider a random variable Y with a binomial distribution with pa-
rameters n and 1/2. Bound P(Y ≤ k) by ErY −k for a simple (not optimal)
choice of r.]
2.6 Problems and Complements 215

8. Let Q be a finitely discrete probability measure supported on x1 , . . . , xn.


Then for any collection of sets, one has ∆n (C, x1 , . . . , xn ) = N ε, C, Lr (Q)
for all ε ≤ inf Q{x}1/r .
[Hint: The Lr -distance equals Q1/r (C 4D) and can be less than a sufficiently
small ε only if C 4 D does not contain a support point.]

9. If a collection of sets C is a VC-class, then the collection of indicators of sets


in C is a VC-subgraph class of the same index.
10. (Open and closed subgraphs) For  a set F of measurable
 functions, define

“closed” and “open” subgraphs by (x, t): t ≤ f (x) and (x, t): t < f (x) ,
respectively. Then the collection of “closed” subgraphs has the same VC-
dimension as the collection of “open” subgraphs. Consequently, “closed” and
“open” are equivalent in the definition of a VC-subgraph class.
[Hint: Suppose the “closed” subgraphs shatter the set (x1 , t1 ), . . . , (xn , tn ).
Choose f1 , . . . , fm whose “closed” subgraphs pick out all m = 2n subsets.
Set 2ε = inf{ti − fj (xi ): ti − fj (xi ) > 0}. Then the “open” subgraphs shatter
the set (x1 , t1 − ε), . . . , (xn , tn − ε). The converse can be argued in a similar
manner.]

11. (Between graphs) For a  set F of measurable functions, define the “be-
tween” graphs as the sets (x, t): 0 ≤ t ≤ f (x) or f (x) ≤ t ≤ 0 . The
“between” graphs form a VC-class of sets if and only if F is a VC-subgraph
class.
[Hint: The “closed” subgraphs intersected with the set {t ≥ 0} yields a VC-
class of sets which is the positive half of the “between” graphs. The “open”
subgraphs intersected with the set {t ≤ 0} and next complemented within
{t ≤ 0} give the lower parts of the “between” graphs.]

12. If X is the union of finitely many disjoint sets Xi , and Ci is a VC-class of


m m
subsets of X i for each i, i = 1, . . . , m, then ti=1 Ci is a VC-class in ∪i=1 Xi of
P m
dimension i=1
V (Ci ).
13. If F is a VC-major class, then the sets {x: f (x) ≥ t}, with f ranging over F
and t over R, form a VC-class.
 
[Hint: The set x: f (x) ≥ t is the intersection of the sets x: f (x) ≥ t −
n−1 . Use Lemma 2.6.18(vi).]

14. The collection of all half-spacesin Rd is a VC-class of dimension d + 1. A


half-space is a set of the form x ∈ Rd : hx, ui ≤ c for fixed u ∈ Rd and
c ∈ R. The collection of all closed balls in Rd is a VC-class of dimension
d + 1.
[Hint: Use Lemma 2.6.16. See Dudley (1979).]

15. The class of all closed convex subsets in Rd is not VC for d ≥ 2. The same is
true for the collection of all open convex sets.
[Hint: Any set of n points on the rim of the unit ball is shattered by the
closed convex sets. The closed convex sets are in the sequential closure of the
open convex sets.]
216 2 Empirical Processes

16. The set of all open polygons with extreme points on the rim of the unit circle
in R2 is not VC.
[Hint: Form a regular polygon with its n extreme points on the rim of the
unit circle. Choose n points in the n gaps between the polygon and the unit
circle.]

17. If C is a VC-class of subsets of X and φ: X → Y is an arbitrary map, then


φ(C) need not be VC.
[Hint: Take X = Y = N with φ(2n − 1) = φ(2n) = n. Let C be
the collection consisting of the empty set, all singletons of odd integers,
and all finite subsets of even integers with more than two elements: C =
{∅, {1}, {3}, . . . , {2, 4}, {2, 6}, . . . , {2, 4, 6}, . . .}.Show that no subset of two
points in X is shattered, while φ(C) consists of all finite subsets of N.]

18. The set of all monotone functions f : R → [0, 1] is VC-hull but not VC-
subgraph.
[Hint: Any set of points (xi , ti ) with both xi and ti strictly increasing in i
is shattered.]

19. For a VC-subgraph class F, the class {f − P f : f ∈ F} is not necessarily


VC-subgraph.
[Hint: For any countable collection of functions gn : X → [0, 1], the subgraphs
of the collection gn +n form a linearly ordered set. Hence the functions gn +n
form a VC-subgraph class.]

20. The class of functions of the form x 7→ c1(a,b] (x) with a, b, and c > 0 ranging
over R is VC of dimension 2.

[Hint: Any f whose subgraph picks out the subset (x1 , t1 ), (x3 , t3 ) from
three points (xi , ti ) with x1 ≤ x2 ≤ x3 also picks out (x2 , t2 ) unless t2 >
t1 ∨ t3 . If a set of four points with nondecreasing xi is shattered, it follows
that t2 > t1 ∨ t3 , but also t3 > t2 ∨ t4 , t2 > t1 ∨ t4 , and t3 > t1 ∨ t4 .]

21. The “Box-Cox family of transformations” F = fλ : (0, ∞) → R: λ ∈ R −
{0} , with fλ (x) = (xλ − 1)/λ, is a VC-subgraph class.
[Hint: The “between” graph class of sets (as defined in Exercise
 2.6.11)
is the class of subsets
C = Cλ : λ 6= 0 of R2 , where Cλ = (x, t): 0 ≤
t ≤ (xλ − 1)/λ . Examine the “dual class” of subsets of R given by D =
{D(x,t) : (x, t) ∈ (0, ∞) × (0, ∞)}, where

D(x,t) = {λ 6= 0: (x, t) ∈ Cλ } = {λ 6= 0: 0 ≤ t ≤ (xλ − 1)/λ}.

Show that D is a VC-class of sets, and apply Assouad (1983), page 246,
Proposition 2.12.]

22. The VC-dimension of a collection Θ of functions θ: X → {0, 1} on some


arbitrary set X isequal to the VC-dimension
of the the collection {Cθ : θ ∈ Θ}
of subsets Cθ = (x, y): y = θ(x) of X × {0, 1}.
2.6 Problems and Complements 217

23. For VC-classes C1 , . . . , Cm of subsets of a set X , of VC-dimensions V1 , . . . , Vm ,


the VC-dimension of the collection of all intersections C1 ∩ · · · ∩ Cm and of
the collection of all unions C1 ∪ · · · ∪ Cm of sets Ci ∈ Ci is bounded above
by c1 V log(c2 m) − Ent(V ~ )/V̄ ≤ c1 V log(c2 m), where V ~ = (V1 , . . . , Vm ),
V̄ = m−1
P 
Vi c1 = e/ (e − 1) log 2 ≈ 2.28231 and c2 = e/ log 2 ≈ 3.92165.
The same is true for the collection of all Cartesian products C1 × · · · × Cm
of sets Ci belonging to classes Ci of subsets of sets Xi .
[Hint: Van der Vaart and Wellner (2009)]

24. If a class F of functions is closed under scalar multiplication (if f ∈ F, then


cf ∈ F for every c ∈ R), then V (ε, F) = V (F), for every ε > 0.
2.7
Entropies of Function Classes

While the VC-theory gives control over the entropy numbers of many in-
teresting classes through simple combinatorial arguments, results on brack-
eting numbers can be found in approximation theory. This section gives
examples, and in particular gives bracketing numbers of the most impor-
tant classes. In some cases, the bracketing numbers are actually uniform in
the underlying measure.

2.7.1 Hölder classes


Hölder spaces provided the simplest method for defining regularity of func-
tions by smoothness properties. In recent years they have been superceded
by the Besov spaces, which provide a more refined scale of regularity levels.
In this section we develop the classical results for Hölder spaces, deferring
a treatment of of Besov spaces (which imply the results obtained in the
present section) to the next section.
For given α > 0 we study the class of all functions on a bounded set X
in Rd that possess uniformly bounded partial derivatives up to order α (the
greatest integer smaller than α) and whose highest partial derivatives are
Lipschitz of order α − α. A simple example is Lipschitz functions of some
order 0 < α ≤ 1 (for which α = 0 even if α = 1!).
For a more precise description, define for any vector k = (k1 , . . . , kd )
of d integers the differential operator

∂ k.
Dk = ,
∂xk11 · · · ∂xkdd
2.7 Entropies of Function Classes 219
P
where k. = ki . Then for a function f : X → R, let
k
k D f (x) − Dk f (y)
kf kα = max sup D f (x) + max sup
,
k. ≤α x k. =α x,y kx − ykα−α

where the suprema are taken over all x, y in the interior of X with x 6= y.
The Hölder space C α (X ) of order α is the space of all continuous functions
f : X → R with kf kα ≤ M for which the norm kf kα is well defined and
α
finite. Let CM (X ) be the set of all continuous functions f : X → R with
kf kα ≤ M . Thus C1α (X ) denotes the unit ball in the Hölder space of order
α.†
Bounds on the entropy numbers of the classes C1α (X ) with respect to
the supremum norm k·k∞ were among the first known after the introduction
of the concept of covering numbers. These readily yield bounds on the
Lr (Q)-bracketing numbers for the present classes of functions, as well as
for the class of subgraphs corresponding to them. The latter are classes of
sets with “smooth boundaries.”

2.7.1 Theorem. Let X be a bounded, convex subset of Rd with nonempty


interior. There exists a constant K depending only on α and d such that
 1 d/α
log N ε, C1α (X ), k · k∞ ≤ Kλ(X 1 )

,
ε
1

ε > 0, where λ(X ) is the Lebesgue measure of the set x: kx −
for every
Xk < 1 .

Proof. Write . for “less than equal a constant times,” where the constant
depends on α and d only. Let β denote the greatest integer strictly smaller
than α.
Since the functions in C1α (X ) are continuous on X by assumption, it
may be assumed without loss of generality that X is open, so that Taylor’s
theorem applies everywhere on X .
Fix δ = ε1/α ≤ 1, and form a δ-net for X of points x1 , . . . , xm contained
in X . The number m of points can be taken to satisfy m . λ(X 1 )/δ d . For
each vector k = (k1 , . . . , kd ) with k. ≤ β, form for each f the vector
 Dk f (x )   k 
D f (xm ) 
1
Ak f = ,..., .
δ α−k. δ α−k.

Then the vector δ α−k. Ak f consists of the values Dk f (xi ) discretized on a


grid of mesh-width δ α−k. .


Note the inconsistency in notation. The notation k·kα is used only with α < ∞ to define
the present classes of functions. The notation k · k∞ always denotes the supremum norm.
220 2 Empirical Processes

If a given pair of functions f and g satisfy Ak f = Ak g for each k with


k. ≤ β, then kf − gk∞ . ε. Indeed, for each x there exists an xi with
kx − xi k ≤ δ. By Taylor’s theorem,
X (x − xi )k
(f − g)(x) = Dk (f − g)(xi ) + R,
k!
k. ≤β

α
where |R| . kx − xi k , because the highest-order partial derivatives are
uniformlyQbounded. In the multidimensional case, the notation hk /k! is
d
short for i=1 hki i /ki !. Thus
δ k.
δ α−k.
X
|f − g|(x) . + δ α ≤ δ α (ed + 1).
k!
k. ≤β

It follows that there exists a constant C depending on α and d only such


that the covering number N Cε, C1α (X ), k · k∞ is bounded by the number
of different matrices A
0,0,...,0 f

 A1,0,...,0 f 
Af =  .. ,
.
 
A0,0,...,β f
when f ranges over C1α (X ). Each row in the matrix Af is one of the vectors
Ak f for k with k. ≤ β. Hence the number of rows is certainly smaller than
(β + 1)d . By definition of each Ak f and the fact that |Dk f (xi )| ≤ 1 for each
i, the number of possible values of each element in the row Ak f is bounded
by 2/δ α−k. + 2, which does not exceed 2δ −α + 2. Thus each column of the
d
matrix can have at most (2δ −α + 2)(β+1) different values.
Assume without loss of generality that x1 , . . . , xm have been chosen
and ordered in such a way that for each j > 1 there is an index i < j such
that kxi − xj k < 2δ. Then use the crude bound obtained previously for
the first column only. For each later column, indexed by xj , there exists a
previous xi with kxi − xj k < 2δ. By Taylor’s theorem,
X (xi − xj )l
Dk f (xj ) = Dk+l f (xi ) + R,
l!
k. +l. ≤β

where |R| . kxi − xj kα−k. . Thus with Bk f = δ α−k. Ak f ,



k X (xi − xj )l
D f (xj )− Bk+l f (xi )
l!

k. +l. ≤β
l
Bk+l f (xi ) − Dk+l f (xi ) kxi − xj k + δ α−k.
X
.
l!
k. +l. ≤β
δ l.
δ α−k. −l. + δ α−k. . δ α−k. .
X
.
l!
k. +l. ≤β
2.7 Entropies of Function Classes 221

Thus given the values in the ith column of Af , the values Dk f (xj ) range
over an interval of length proportional to δ α−k. . It follows that the val-
ues in the jth column of Af range over integers in an interval of length
proportional to δ k. −α δ α−k. = 1. Consequently, there exists a constant C
depending only on α and d such that
d
#Af ≤ (2δ −α + 2)(β+1) C m−1 .
The theorem follows upon replacing δ by ε1/α and m by its upper bound
λ(X 1 ) ε−d/α , respectively, taking logarithms, and bounding log(1/ε) by a
constant times (1/ε)d/α .

2.7.2 Corollary. Let X be a bounded, convex subset of Rd with nonempty


interior. There exists a constant K depending only on α, diam X , and d such
that  1 d/α
log N[ ] ε, C1α (X ), Lr (Q) ≤ K

,
ε
for every r ≥ 1, ε > 0, and probability measure Q on Rd .

Proof. Let f1 , . . . , fp be the centers of k · k∞ -balls of radius ε that cover


C1α (X ). Then the brackets [fi − ε, fi + ε] cover C1α (X ). Each bracket has
Lr (Q)-size at most 2ε. By the previous theorem, log p can be chosen smaller
than the given polynomial in 1/ε.

The corollary, together with either the bracketing central limit theo-
rem or the uniform entropy theorem, implies that C1α [0, 1]d is universally
Donsker for α > d/2. For instance, on the unit interval in the line, uniformly
bounded and uniformly Lipschitz of order > 1/2 suffices and in the unit
square it suffices that the partial derivatives exist and satisfy a Lipschitz
condition.
The previous results are restricted to bounded subsets of Euclidean
space. Under appropriate conditions on the tails of the underlying distribu-
tions they can be extended to classes of functions on the whole of Euclidean
space. The general conclusion is that under a weak tail condition, the same
amount of smoothness suffices for the class to be Donsker. For instance,
for any distribution on the line with a finite 2 + δ moment, the class of all
uniformly bounded and uniformly Lipschitz functions of order α > 1/2 has
a finite bracketing integral and hence is Donsker. A more general result is
an easy corollary to the first theorem of this section also.

2.7.3 Corollary. Let Rd = ∪∞ d


j=1 Ij be a partition of R into bounded, con-
vex sets with nonempty interior, and let F be a class of functions f : Rd → R
α
such that the restrictions F|Ij belong to CM j
(Ij ) for every j. Then there
exists a constant K depending only on α, V , r, and d such that
 1 V  X∞ Vr
 V r+r
1 V r+r V
 V +r
log N[ ] ε, F, Lr (Q) ≤ K λ(Ij ) Mj Q(Ij ) V +r ,
ε j=1
222 2 Empirical Processes

for every ε > 0, V ≥ d/α, and probability measure Q.

Proof. Let aj be any sequence of numbers in (0, ∞]. For each j ∈ N,


α
take an εaj -net fj,1 , . . . , fj,pj for CM j
(Ij ) for the uniform norm on Ij . By
Theorem 2.7.1, pj can be chosen to satisfy
 M d/α
j
log pj ≤ K λ(Ij1 ) ,
εaj
for a constant K depending on d and α only. It is clear that, for εaj > Mj ,
the value pj can be chosen equal to 1. Now form the brackets
X∞ ∞
X 
(fj,ij − εaj )1{Ij }, (fj,ij + εaj )1{Ij } ,
j=1 j=1

where the sequence i1 , i2 , . . . ranges over all possible values: for each j the
integer ij ranges over {1, 2, . . . , pj }. The number of brackets is bounded by
Q P r 1/r
pj , and each bracket has Lr (Q)-size equal to 2ε aj Q(Ij ) . It follows
that
   1 d/α X ∞  M d/α
1/r j
λ(Ij1 )
P r
log N[ ] 2ε aj Q(Ij ) , F, Lr (Q) . K
ε j=1
a j
aj ε≤Mj

 1 V X  M V
j
≤K λ(Ij1 ) ,
ε j=1
aj

for every V ≥ d/α. The choice aVj +r = λ(Ij1 )MjV /Q(Ij ) reduces both series
in this expression to essentially the series in the statement of the corollary.
Simplify to obtain the result.

The preceding upper bound may be applied to obtain a simple sufficient


condition for classes of smooth functions on the whole space to have a finite
bracketing
P∞ integral. As an example, consider the class C1α (R). If α > 1/2
s
and j=1 P (Ij ) < ∞ for a partition of R into intervals of fixed length and
some s < 1/2, then for sufficiently small δ > 0,
 1 2−δ
log N[ ] ε, C1α (R), L2 (P ) ≤ K

ε
(for a constant K depending on P , α, and δ). Consequently, the class C1α (R)
has a finite bracketing integral and is P -Donsker. Convergence of the series
is
R implied by a tail condition: the series converges for some s < 1/2 if
|x|2+δ dP (x) < ∞ for some δ > 0.
The bound given by the preceding corollary can be proved to be the
best in terms of a power of (1/ε), but it is not sharp in terms of lower-
order terms. As a consequence, the best conditions for the class F as in
the corollary to be Glivenko-Cantelli or Donsker cannot be obtained from
2.7 Entropies of Function Classes 223

the corollary.
P∞ The class is knownP to be Glivenko-Cantelli or Donsker if and

only if j=1 Mj P (Ij ) < ∞ or j=1 Mj P 1/2 (Ij ) < ∞, respectively. See
Section 2.10.5 and the Notes at the end of Part 2 for further details.

2.7.2 Besov and Sobolev Spaces


The scale of Besov spaces is indexed by three parameters, a level of regu-
larity α, and two parameters p, q indicating norms by which the regularity
is measured. For p = q = ∞ the Besov space of regularity α contains the
Hölder space C α (X ) and its entropy has the same order. The Besov spaces
indexed by different pairs p, q can be viewed as refinements of this space,
and many of them are “equal in size” as measured by entropy. Combinations
of parameters such that α > d/p permit the use of the uniform norm in
the entropy bound, and it is under this condition that the entropy numbers
obtained in this section translate into bracketing numbers.
The scale of Sobolev spaces is indexed by two parameters, a level of reg-
ularity α, and a parameter p indicating the norm by which the regularity is
measured. The Sobolev spaces indexed by p = 2 are the simplest examples.
They are Hilbert spaces and coincide with the corresponding Besov spaces
with p = q = 2. For values of p unequal to 2, the Sobolev spaces can be
sandwiched between Besov spaces, and their entropy follows immediately
from the entropy of Besov spaces.
Let X ⊂ Rd be a fixed measurable set. For most of the following we
need that X is sufficiently regular, for instance a (strong local) Lipschitz
domain in the sense of Adams and Fournier (2003), page 83, which requires
that the boundary of the set X is sufficiently smooth. For instance, ev-
ery bounded rectangle or convex set with nonempty interior and smooth
boundary is a Lipschitz domain.
For a function f : X → R and vector k =P(k1 , . . . , kd ) ∈ Nd let Dk f de-
note the kth partial derivative, and set k. = ki , as in Section 2.7.1. Besov
spaces are defined in terms of differences (“discretized derivatives”), rather
than derivatives. The directional difference operator is defined recursively
for a “direction” h ∈ Rd by

∆h f (x) = f (x + h) − f (x), ∆rh f (x) = ∆h ∆r−1


h f (x).

These expressions give differences in the direction h, and are well defined
if x, x + h, or x, x + h, . . . , x + rh are contained in X , respectively.
For α ∈ N and a function f : X → R, let

kf kα|p : = max kDk f kλ,p .


0≤k. ≤α

where k · kλ,p denotes the Lp -norm relative to the Lebesgue measure on


X . The Sobolev space Wpα (X ) is defined to be the subset of Lp (X , λ) of
functions for which the norm kf kα|p is finite (where the partial derivatives in
the previous display may be in the weak sense). It can be shown that this is
224 2 Empirical Processes

the same as the completion of the set of functions {f ∈ C α (X ): kf kα|p < ∞}


for the norm k · kα|p (see e.g. Adams and Fournier (2003), page 60 and
Theorem 3.17 on page 67). These Sobolev spaces of integer order can be
extended to arbitrary α > 0 via interpolation (as [Lp (X ), Wpr (X )]α/r ), but
we shall not discuss this operation here.
For 1 ≤ p, q ≤ ∞ and fixed r ∈ N with r > α, define,
r q dt 1/q
Z ∞  
1
kf kα|p,q = kf kλ,p + sup ∆ f
, (q < ∞),
0 tα khk≤t h λ,p t
1
kf kα|p,∞ = kf kλ,p + sup α sup ∆rh f λ,p .

t>0 t khk≤t
α
The corresponding Besov space Bp,q (X ) is the collection of all f ∈ Lp (X )
such that kf kα|p,q < ∞. The norm k · kλ,p is the Lp -norm relative to the
Lebesgue measure on the domains where the functions (f or ∆rh f ) are
naturally defined. The value of r must be strictly larger than α, also if α is
an integer (e.g. r = 2 if α = 1).
BecauseR the function h 7→ k∆rh kλ,p is bounded above by a multiple of

kf kλ,p and 1 t−αq−1 dt = (αq)−1 < ∞, the integral in the definition of
kf kα|p,q is mostly determined by the behavior of the modulus of continuity
h 7→ k∆rh kλ,p near zero, and its upperR 1 bound ∞ can be replaced by 1
without much consequence. Because 0 (1/t) dt = ∞, in a very rough sense
finiteness of the norm kf kα|p,q requires that the modulus k∆rh kλ,p is of
smaller order than khk−α as h ↓ 0. The parameter q determines how this
order is measured. Higher values of q make the requirement less stringent
(i.e. the norms kf kα|p,q are up to constants decreasing in q), so that larger
q lead to bigger Besov spaces (see Problem 2.7.14). A further intuitive
understanding of the role of this parameter appears to be a delicate matter.
For our purpose it suffices to consider the value q = ∞, which defines the
biggest class of functions.
The use of the directional difference operator, and differences of or-
der r > α, make the definition of Besov spaces a bit complicated. It
would be natural to use the same definition, but with first differences
of derivatives. For instance, for functions on the line one might replace
t−α supkhk≤t k∆rh f kλ,p by t−(α−α) k∆h f (α) kλ,p , for α the largest integer
smaller than α. This would result, in general, in spaces that are slightly
smaller. Similarly using exactly α differences if α is an integer (i.e. r = α)
would give a smaller space. (See Problem 2.7.15.)
α
In particular, the Besov spaces Bp,∞ (X ) contains the Lipschitz space
α
Lipp (X ) of all functions f : X → R with finite α-Lipschitz norm
1
sup max ∆h (Dk f ) λ,p .

kf kα,p = kf kλ,p + sup
t>0 tα−α khk≤t k. =α

The Lipschitz spaces are defined through a first order difference of the
derivatives of order the biggest integer α strictly smaller than α. These
2.7 Entropies of Function Classes 225

spaces are a natural generalization of the Hölder spaces considered in Sec-


tion 2.7.1, replacing the uniform norm by the Lp -norm, and reduce to the
latter spaces if p = ∞. For 1 < α ∈ / N and p ≥ 1 the spaces Lipα p (X )
α
and Bp,∞ (X ) coincide, but for integer-valued α the Besov space is slightly
larger.
The Besov spaces “bracket” the Sobolev spaces:
α
Bp,p (X ) ⊂ Wpα (X ) ⊂ Bp,2
α
(X ), if 1 < p ≤ 2,
α
Bp,2 (X ) ⊂ Wpα (X ) ⊂ α
Bp,p (X ), if 2 ≤ p < ∞.

The inclusions in this display are also embeddings: the inclusion map, map-
ping the space on the left into the space on the right, is continuous. Thus
the unit ball of each of the spaces on the left is included into a multiple of
the unit ball of the spaces on the right‡ For p = 2 these inclusions clearly
become identities.
We also have the following inclusions, which are embeddings as well:
for α > 0, 1 ≤ p ≤ p0 ≤ ∞,[
α
Bp,q α
(X ) ⊂ Bp,q 0 (X ), if 1 ≤ q ≤ q 0 ≤ ∞,
0
α
Bp,q (X ) ⊂ Bpα0 ,q (X ), if α0 − d/p0 ≤ α − d/p,
α
Bp,q (X ) ⊂ C(X ), if α > d/p.

These are examples of Sobolev embedding theorems. The first inclusion was
already noted in the preceding. The second inclusion is easy in the case
that p = p0 , when it merely says that higher order smoothness classes are
contained in lower order smoothness classes, if the smoothness is measured
in a single manner. In its stated, more general form the inclusion shows that
higher order (α) smoothness measured in a weak sense (Lp ) implies lower
order (α0 ) smoothness in a stronger sense (Lp0 for p0 ≤ pd/(d − p(α − α0 ))).
For instance, a function on the line (d = 1) whose derivative is contained
in L2 (α = 1, p = 2) belongs to C 1/2 (α0 = 1/2, p0 = ∞), an assertion that
can easily be proved directly using the Cauchy-Schwarz inequality. The last
inclusion can be viewed as a limiting case of this principle (p0 = ∞, α0 = 0),
and shows that functions in Besov spaces are continuous if the smoothness
level is large enough relative to p. Only for p = ∞ any possible smoothness
level α > 0 qualifies to give continuity.
In view of these inclusions an entropy bound on the Besov spaces au-
tomatically yields a bound on the Sobolev, Lipschitz, and Hölder spaces.
The following theorem describes the entropy relative to the Lτ (X , λ)-norm,
for a bounded, Lipschitz domain X ⊂ Rd , and given τ ∈ (0, ∞]. It can be


We assume that X is strong local Lipschitz, see Bergh and Löfstrom (1976), Theo-
rem 6.4.4, page 152, and Adams and Fournier (2003), Theorem 7.31, page 228.
[
Compare Härdle et al. (1998), page 124, Bergh and Löfstrom (1976), page 153. We again
assume that X is a Lipschitz domain. A function f in a Besov space on the left can then be
α
extended to a function in Bp,q (Rd ) and the embeddings follow from the case that X = Rd .
226 2 Empirical Processes

α
shown that the unit ball of the Besov space Bp,q (X ) is compact in Lτ (X , λ)
if and only if α > d(1/p − 1/τ )+ . Under this condition the order of the
entropy depends on τ , α and d only, and not on p or q.]

2.7.4 Theorem. Let 0 < τ ≤ ∞, 0 < p, q ≤ ∞ and a bounded Lipschitz


α
domain X be given. If α > d(1/p − 1/τ )+ , then the unit ball Bp,q (X )1 of
α
the Besov space Bp,q (X ) satisfies

α
  1 d/α
log N ε, (Bp,q (X )1 , Lτ (X , λ) . .
ε
The key to the proof of this theorem is a representation of Besov spaces
through series expansions on a wavelet basis, after which the entropy cal-
culations can be done on the space of coefficients. Because the Besov norm
is decreasing in q, it suffices to prove the theorem for q = ∞.
For functions f : R → R a suitable wavelet expansion can be written
in terms of a pair of functions φ and ψ, called father wavelet and mother
wavelet. These functions can be chosen such that the set of translates x 7→
φk (x) = φ(x − k) for k ∈ Z of the father, together with the translated
dilations x 7→ ψj,k (x) = 2j/2 ψ(2j x − k) for j, k ∈ Z of the mother form
an orthonormal basis of L2 (R, λ). Thus any function f ∈ L2 (R, λ) can be
written as ∞ X
X X
f= β0,k φk + βj,k ψj,k .
k∈Z j=1 k∈Z

In addition the father and mother wavelet can be chosen to have compact
support.
To simplify notation we write φk = ψ0,k , so that we obtain an or-
thonormal basis ψj,k of L2 (R, λ) indexed by the pairs (j, k) ∈ (N ∪ {0}) × Z.
Then any f ∈ L2 (R, λ) can be expanded as
∞ X
X
f= βj,k ψj,k ,
j=0 k∈Z
R
where βj,k = f ψj,k dλ are the Fourier coefficients of f relative to the basis
ψj,k . Now it can be shown that for sufficiently regular father and mother
wavelets, the Besov norm defined earlier is equivalent to the coefficient norm
given by
∞ 
X X 1/p q 1/q
kf kα|p,q = 2jα |βj,k |p 2j(1/2−1/p) ,
j=0 k∈Z
X 1/p
kf kα|p,∞ = sup 2jα |βj,k |p 2j(1/2−1/p) .
j=0,...,∞
k∈Z

]
See Adams and Fournier (2003), page 83 for the definition of a Lipschitz domain.
2.7 Entropies of Function Classes 227

p 1/p
P 
(For p = ∞ read maxk |βj,k | for k∈Z |βj,k | .) The Besov space
α
Bp,q (R) consists exactly of all functions f whose wavelet coefficients renders
the expression on the right finite, and its unit ball is the set of functions
for which the expression is bounded above by a given constant. (We abuse
notation by writing k · kα|p,q for both the function and coefficient norm,
which are equivalent, but not identical.)
For functions f : Rd → R the wavelet expansion is slightly more com-
plicated. A function f ∈ L2 (Rd ) can be written
X X ∞ X
X X
v v v v
(2.7.5) f= β0,k ψ0,j + βj,k ψj,k ,
k∈Zd v∈{0,1}d j=1 k∈Zd v∈{0,1}d −{0}

v
for functions ψj,k which are orthogonal for different indices (j, k, v), and are
scaled and translated versions of the set of 2d “base wavelets” ψ0,0 v
:
v
ψj,k (x) = 2−jd/2 ψ0,0
v
(2j x − k).
Such a higher-dimensional wavelet basis can be obtained as tensor products
v
ψ0,0 = φv1 × · · · × φvd of a given father wavelet φ0 and mother wavelet φ1
in one dimension. If it is understood that a sum over v ranges over {0, 1}d
if nested within j = 0 and over {0, 1}d − {0} if nested within j ∈ N, then
the two sums can be collapsed again to a single sum over j ∈ N ∪ {0}. The
corresponding Besov norm then takes the form
X ∞  X X 1/p q 1/q
kf kα|p,q = 2jα v p
|βj,k | 2jd(1/2−1/p)
j=0 k∈Zd v
X X 1/p
kf kα|p,∞ = sup 2jα v p
|βj,k | 2jd(1/2−1/p) .
j∈N∪{0} v
k∈Zd

For suitable base wavelets this norm is equivalent to the Besov norm defined
previously.
In order to measure the size of a function in Lτ (Rd , λ), it is useful to
reexpress the expansion in terms of the “renormalized wavelet basis”. The
v
base functions ψj,k have been scaled so that their L2 -norms are equal to
v
the L2 -norm of the base wavelet ψ0,0 , independent of the value of (j, k).
To stabilize the Lτ -norms the appropriate scaling factors are 2jd/τ rather
than 2jd/2 . For a given value of τ the function 2jd(1/τ −1/2) ψj,k
v
has Lτ -norm
independent of (j, k), and yields a wavelet basis “normalized in Lτ ”. If we
define
v
βj,k|τ = 2jd(1/2−1/τ ) βj,k
v
, v
ψj,k|τ = 2jd(1/τ −1/2) ψj,k
v
,
v v
then the product βj,k|τ ψj,k|τ is independent of τ , and the function f can
be reexpressed as
X∞ X X
v v
f= βj,k|τ ψj,k|τ .
j=0 k∈Zd v
228 2 Empirical Processes

Under this normalization the Lτ -norm of the projection on level j is equiv-


alent to the `τ -norm of the coefficients at that level:
XX X X τ XX
v
|βj,k|τ |τ . v
βj,k|τ v
ψj,k|r . v
|βj,k|τ |τ .

τ
k∈Zd v k∈Zd v k∈Zd v

Thus by the triangle inequality the Lτ -norm of the function f satisfies


X∞ X X 1/τ
v
kf kτ . |βj,k|τ |τ .
j=0 k∈Zd v

For τ = 2 the functions ψj,k|2 = ψj,k are orthonormal and hence the in-
equalities in this display and in the preceding one are actually identities.
For general τ there is a one-sided inequality up to constant only, with the
corresponding inequality valid within a resolution level only. In every case
the Lτ -norm of a function can be bounded in terms of its L2 -wavelet coeffi-
cients, with additional factors 2jd(1/2−1/τ ) added to correct for the scaling.
α
Besov spaces Bp,q (X ) on an unbounded domain X are typically not
compact in Lτ (X , λ), for which reason the preceding theorem is restricted
α
to bounded domains. Every function in f ∈ Bp,q (X ) for a Lipschitz domain
˜
X can be extended to a function in f ∈ Bp,q (Rd ) of Besov norm satisfying
α

kf˜kα|p,q . kf kα|p,q . We can next expand the extension f˜ on a given wavelet


basis. If we measure the function f˜ in Lτ (X , λ), then terms of the expansion
that refer to a base function whose support does not intersect X do not play
a role. Thus if we use compactly supported wavelets, then at each level j
only of the order 2jd 2d . 2jd terms of the expansion are important. Con-
sequently, for the proof of Theorem 2.7.4, it suffices to consider coefficients
v
βj,k with j ∈ N ∪ {0} and (k, v) ranging over a set of cardinality of the
order 2jd . The necessary calculations are given in Section 2.7.2.1, stripped
of any interpretation in terms of function spaces. Because the parameter
v loses its geometric meaning in this representation, it can without loss of
generality be subsumed in the parameter k.

2.7.2.1 Besov Bodies


Let BBpδ be the collection of all doubly indexed sequences β = (βj,k ), where
j = 0, 1, . . . and k = 1, 2, . . . , Kj for some Kj . 2jd , with
X 1/p
sup 2jδ |βj,k |p ≤ 1.
j∈N∪{0} k∈Kj

We equip this with the norm


Kj
∞ X 1/τ
X
kβkτ = |βj,k |τ , if τ < ∞,
j=0 k=1
X∞
kβk∞ = max |βj,k |.
k=1,···,Kj
j=0
2.7 Entropies of Function Classes 229

For β wavelet coefficients normalized in Lτ and δ = α − d(1/p − 1/τ )


this corresponds to functions f with Besov norm kf kα|p,∞ bounded by a
constant, equipped with the Lτ -norm.

2.7.6 Theorem. If δ > d(1/τ − 1/p)+ , then, with s = δ + d(1/p − 1/τ ),


  1 d/s
log N ε, BBpδ , k · kτ . .
ε
Kj
Proof. For given j let βj,· Pdenoteτ the
1/τ vector (βj,k )k=1,...,Kj in R , and
let kβj,· kτ be its τ -norm k |βj,k | . These norms are decreasing in τ ,
1/τ −1/p
while, in the other direction, for p > τ we have kβj,· kτ . Kj kβj,· kp
by Hölder’s inequality. The latter comparison and the fact that Kj . 2jd
shows that, for p > τ and some constant C,
BBpδ ⊂ C BBτδ+d(1/p−1/τ ) .
Therefore, the theorem for τ = p and δ > 0 with upper bound (1/ε)d/δ
implies the theorem for p > τ and δ > d(1/τ − 1/p) as given. Thus we may
assume that p ≤ τ and then prove the bound for arbitrary δ > 0.
For a given sequence of numbers (λj ) and β ∈ BBpδ let β̃ be defined
by shrinking for each j each of the coordinates βj,k of β to the grid with
mesh width λj . Thus β̃j,k = sign(βj,k )b|βj,k |/λj cλj for each (j, k). If λj = 0
it will be understood that each βj,k is shrunk to 0. The proof proceeds by
determining (λj ) such that kβ − β̃kτ ≤ Cε for some constant C, in which
case the number of different arrays β̃ is an upper bound on the Cε-covering
number of BBpδ . For defining the numbers (λj ) and counting the number
of different β̃ we differentiate between three cases:
(a) p ≤ τ < ∞. We choose λj = λ ∼ ε(s+d/τ )/s , independent of j.
(b) p < τ = ∞. For jε the solution of the equation 2−jε (δ+d/p) = ε we
choose λj = ε/(j − jε )2 , where ε/0 = ε.
(c) p = τ = ∞. For jε be the solution of the equation 2−jε δ = ε we choose
λj = ε/(j − jε )2 for j < jε and λj = 0 otherwise.
In all three cases clearly |βj,k − β̃j,k | ≤ λj for every (j, k) such that λj >
1/τ
0. This implies that kβj,· − β̃j,· kτ ≤ λj Kj . λj 2jd/τ for τ < ∞ and
kβj,· − β̃j,· k∞ ≤ λj whenever λj > 0. P
In case (b) the last bound shows that kβ − β̃k∞ ≤ j λj , which is
boundedPby a multiple P of ε. In case (c) the same bound implies that kβ −
β̃k∞ ≤ j≤jε λj + j>jε |βj,k |, which is also bounded by a multiple of ε,
since the assumption that β ∈ BB∞ δ
implies that |βj,k | ≤ 2−jδ for every
−jε δ
(j, k) and 2 = ε by assumption.
In case (a), where p ≤ τ <P∞ we have next to the P trivial bound
kβj,· − β̃j,· kτ . λj 2jd/τ also that k |βj,k − β̃j,k |τ ≤ λτ −p k |βj,k − β̃j,k |p
and hence
kβj,· − β̃j,· kτ ≤ λ1−p/τ kβj,· − β̃j,· kp/τ
p
≤ λ1−p/τ 2p/τ kβj,· kp/τ
p . λ1−p/τ 2−jδp/τ ,
230 2 Empirical Processes

since β ∈ BBpδ . Combination with the trivial bound yields


∞ 
X 
kβ − β̃kτ . λ1−p/τ 2−jδp/τ ∧ λ2jd/p . λs/(s+d/τ ) ,
j=0

in view of Problem 2.7.8 and the definition of s. The choice of λ shows that
this is a multiple of ε.
Next we proceed to count the number of β̃ in each of the three cases.
For each fixed j we represent a typical vector β̃j,· by
(i) the set (k1 , k2 , . . . , kLj ) of indices k of the nonzero coordinates β̃j,k .
(ii) the vector of signs (δ1 , δ2 , . . . , δLj ) of these nonzero coordinates β̃j,k .
(iii) the absolute truncation levels (n1 , n2 , . . . , nLj ) of these nonzero coor-
dinates β̃j,k (i.e. ni = b|βj,ki |/λj c for every i.)
The number of different β̃ is bounded by the product over j of the product
of the numbers of different entities of the three types, and is a bound on
the Cε-covering number of BBpδ .
The number of different subsets as in (i) is bounded by, if Lj > 0,
Lj    eK Lj  2jd Lj
X Kj j
≤ . C .
l Lj Lj
l=0

(Cf. Problem 2.7.11.) Each β̃j,· with positive coordinates gives rise to 2Lj
vectors β̃j,· with arbitrary signs, whence the number of sign vectors as in
(ii) is bounded by 2Lj .
The number Lj of nonzero coefficients β̃j,k trivially satisfies Lj ≤ Kj .
2jd in all three (a), (b), (c). In casesP(a) and (b), where p < ∞, the assump-
tion that β ∈ BBpδ implies that k |βj,k |p ≤ 2−jδp . Because |βj,k | ≥ λj
for each nonzero coordinate β̃j,k it follows that in cases (a) and (b) also
Lj . 2−jδp /λpj for each j. In case (c) we have Lj = 0 for j > jε by con-
struction.
In cases (a) and (b), the number of different sets of levels (n1 , . . . , nLj )
as in (iii) is bounded by
n X o
# (n1 , . . . , nLj ): |λj nk |p ≤ 2−jδp
k
n X o
≤4 Lj
max n21 n22 · · · n2Lj : |λj nk |p ≤ 2−jδp ,
k

in view of Problem 2.7.10. Here we can bound


2X  1 X   2−jδp /λp 
j
log n21 n22 · · · n2Lj = log npk . Lj log npk ≤ Lj log ,
p Lj Lj
k k

where we used Jensen’s inequality in the second last step, and the restriction
on (n1 , . . . , nLj ) in the last step. It follows that for p < ∞ the logarithm of
2.7 Entropies of Function Classes 231

the number of different arrays β̃ is bounded by a multiple of


∞ h  2jd C   2−jδp /λp i
j
X
Lj log + Lj log 2 + Lj log 4 + Lj log .
j=0
Lj Lj

Let jε be the solution to the equation 2−jε (d+δp) = λp for λs/(s+d/τ ) = ε,


and bound Lj by a multiple of 2jd for j ≤ jε and by a multiple of 2−jδp /λpj
if j > jε . Then the preceding display reduces to
h  2−jδp /λp i X 2−jδp h  2jd  i
j
X
2jd 1 + log jd
+ p log −jδp p + 1
2 j>j
λj 2 /λj
j≤jε ε

jε d
X
(j−jε )d
 λ /λpj
p 
=2 2 log
j≤jε
2(j−jε )(d+δp)
2−jε δp X (jε −j)δp λp  2(j−jε )(d+δp) 
+ 2 p log .
λp j>j λj λp /λpj
ε

The leading factors 2jε d and 2−jε δp /λp in front of the two sums are both
equal to the desired upper bound (1/ε)d/s , whence it suffices to show that
the sums are bounded. In case (a) λj = λ, and the sums can be seen to
be bounded by making the change of summation index j − jε → j. In case
(b) λ/λj = (j − jε )2 , and the situation is the same, apart from the extra
factors (j − jε )2p .
Finally, in case (c) we count in (iii) the vectors (n1 , . . . , nLj ) ∈ NLj
with maxk nk ≤ 2−jδ /λj , the number being understood to be 0 if j > jε
when Lj = 0. This gives upper bound on the number of different β̃ equal
to a multiple of
Xh  2jd C   2−jδ i
Lj log + Lj log 2 + Lj log 4 + Lj log
j
Lj λj
h  2−jδ i  (j − j)2 
ε
X X
. 2jd 1 + log . 2je d 2(j−jε )d log .
λj 2jδ ε
j≤jε j≤jε

The leading factor 2jε d is equal to (1/ε)d/δ , which is the upper bound given
in the theorem in this case, and the sum can be seen to be bounded after
substituting ε = 2−jε δ and changing summation index.

2.7.3 Monotone Functions


The class of all uniformly bounded, monotone functions on the real line is
Donsker. This may be proved in many ways; for instance, by verifying that
the class possesses a finite bracketing integral. The following theorem shows
that the bracketing entropy is of the order 1/ε uniformly in the underlying
measure.
232 2 Empirical Processes

2.7.7 Theorem. The class F of monotone functions f : R → [0, 1] satisfies


 1
log N[ ] ε, F, Lr (Q) ≤ K ,
ε
for every probability measure Q, every r ≥ 1, and a constant K that de-
pends on r only.

Proof. It suffices to establish the bound for the class of monotonically


increasing functions. It also suffices to prove the bound for Q equal to the
uniform measure λ on [0, 1]. To see the latter, note first that if Q−1 (u) =
inf{x: Q(x) ≥ u} denotes the quantile function of Q, then the class F ◦ Q−1
consists of functions f ◦ Q−1 : [0, 1] → [0, 1] that are monotone. Since Q−1 ◦
Q(x) ≤ x for every x and u ≤ Q ◦ Q−1 (u) for every u, an ε-bracket [l, u] for
f ◦Q−1 for λ yields a function l◦Q with the properties l◦Q ≤ f ◦Q−1 ◦Q ≤ f
and

kf − l ◦ QkQ,r = kf ◦ Q−1 − l ◦ Q ◦ Q−1 kλ,r ≤ kf ◦ Q−1 − lkλ,r < ε.

Thus l ◦ Q is a “lower bracket.” If Q is strictly increasing, then u ◦ Q can


be used as an upper bracket. More generally, we can repeat the preceding
−1
construction with Q (u) = sup{x: Q(x) ≤ u} and obtain upper brackets
u ◦ Q. A set of full brackets can next be formed by taking every pair [l ◦
Q, u ◦ Q] of functions attached to a same function f .
Call a function g a left bracket for f if g ≤ f and kf − gkλ,r ≤ ε.
It suffices to construct a set of left brackets of the given cardinality. Right
brackets may next be constructed from left brackets for the class of functions
h(x) = 1 − f (1 − x).
Fix ε and set c = (1/2)1/r . Fix a function f . Let P0 be the partition
0 = x0 < x1 = 1 of the unit interval. Given a partition Pi of the form
0 = x0 < x1 < · · · < xn = 1, define εi = εi (f ) by

εi = max f (xj ) − f (xj−1 ) (xj − xj−1 )1/r .



j

Form a partition Pi+1 by dividing the intervals [xj−1 , xj ] in Pi for which

f (xj ) − f (xj−1 ) (xj − xj−1 )1/r ≥ cεi




into two halves of equal length. It is clear that ε0 ≤ 1 and that εi+1 ≤ cεi ≤
2εi+1 . Let ni = ni (f ) be the number of intervals in Pi and si = ni+1 − ni
the number of members of Pi that are divided to obtain Pi+1 . Then by the
definitions of si and εi ,
X r/(r+1)
si (cεi )r/(r+1) ≤ f (xj ) − f (xj−1 ) (xj − xj−1 )1/(r+1)
j
X r/(r+1) X 1/(r+1)
≤ f (xj ) − f (xj−1 ) (xj − xj−1 ) ,
j j
2.7 Entropies of Function Classes 233
r/(r+1)
by Hölder’s inequality. This is bounded by f (1)−f (0) 1 ≤ 1. Conse-
quently, the sum of the numbers of intervals up to the ith partition satisfies
i
X i
X i
X
nj = i + jsi−j ≤ 2 j(cεi−j )−r/(r+1)
j=1 j=1 j=1
(2.7.8) i
−r/(r+1) −r/(r+1)
X
. jcrj/(r+1) εi . εi ,
j=1

where . denotes smaller than up to a constant that depends on r only.


Each function f generates a sequence of partitions P0 ⊂ P1 ⊂ · · ·.
Call two functions equivalent at stage i if their partitions up to the i-th
are the same. For each i this yields a partitioning of the class F in equiv-
alence classes, which can for increasing i be visualized as a tree structure
with the different equivalence classes at stage i as the nodes at level i. Con-
tinue the partitioning of a certain branch until the first level k such that
εk (f )r ≤ εr+1 for every f in that branch. This defines a partitioning of the
class F into finitely many subsets, each corresponding to some sequence of
partitions P0 ⊂ · · · ⊂ Pk of the unit interval. For each subset and i, define

ε̃i = sup εi (f ),
f

where the supremum is taken over all functions f in the subset. While
ε̃i = ε̃i (f ) may be thought of as depending on f , it really depends only
on the subset of f in the final partitioning, unlike εi . Since the numbers
n1 ≤ n2 ≤ · · · ≤ ni also depend on the sequence P0 ⊂ · · · ⊂ Pi only,
inequality (2.7.8) remains valid if εi is replaced by ε̃i .
For a fixed function f , define a left-bracketing function fi , which is con-
stant on each interval [xj−1 , xj ) in the partition Pi , recursively as follows.
First f0 = 0. Next given fi−1 , define fi on the interval [xj−1 , xj ) by

ε̃i
fi (xj−1 ) = fi−1 (xj−1 ) + lj ,
(xj − xj−1 )1/r

where lj ≥ 0 is the largest integer such that fi ≤ f . Thus to construct fi , the


left bracket fi−1 is raised at xj−1 by as many steps of size ε̃i (xj − xj−1 )−1/r
as possible.
We claim that the set of functions fk , when f ranges over F and
k = k(f ) is the final level of partitioning for f , constitutes a set of left
brackets of size proportional to ε of the required cardinality.
First, it is immediate from the construction of fi that, for each of the
intervals [xj−1 , xj ) in the ith partition,

f (xj−1 ) − fi (xj−1 ) (xj − xj−1 )1/r ≤ ε̃i .



234 2 Empirical Processes

Combining this with the definition of εi and the monotonicity of f , we


obtain

(2.7.9) f (x) − fi (xj−1 ) (xj − xj−1 )1/r ≤ εi + ε̃i ≤ 2ε̃i ,



x ∈ [xj−1 , xj ).

Consequently,
r 2 /(r+1)
kf − fi krλ,r ≤ ni (2ε̃i )r . ε̃i ,
−r/(r+1)
since ni . ε̃i by (2.7.8). For i = k(f ), this is bounded up to a
r
constant by ε . Thus the brackets have the correct size.
Second, we count the number of functions fk obtained when f ranges
over F. In view of the definition of k = k(f ), we have that εk−1 (g)r > εr+1 ,
for some function g which is equivalent to f at stage k −1. This implies that
Pk−1
εk−1 (g)−r/(r+1) < 1/ε. Combining this with (2.7.8), we find j=1 nj . 1/ε
and trivially nk (f ) ≤ 2nk−1 . 1/ε. (Note that the numbers nj for j ≤ k − 1
are the same for f and g.) The number of sequences 1 = n0 ≤ n1 ≤ · · · ≤
nk ≤ C/ε  is equal to the number of ways we can choose k integers from
the set 2, 3, . . . , bC/εc . It does not exceed 2C/ε . Given
 a sequence of this
type, the number of ways to obtain Pi+1 from Pi is nsii , which is bounded
by 2ni . We conclude that the the total number of different final partitions
P0 ⊂ · · · ⊂ Pk generated when f ranges over F is bounded by

2C/ε 2n0 2n1 · · · 2nk−1 ≤ 22C/ε .

Given a sequence of partitions P0 ⊂ · · · ⊂ Pk , let Fi be the set of all left


brackets fi constructed on the partition Pi when f ranges over the subset
of F corresponding to this sequence of partitions. Then F0 = {0}, and since
ε̃i is fixed for the given sequence of partitions,
ni
Y
#Fi ≤ mj #Fi−1 ,
j=1

where mj is the number of nonnegative integers that can occur in the


definition of a function fi ∈ Fi from a function fi−1 ∈ Fi−1 . Let the interval
[xj−1 , xj ) occur in the ith partition. By (2.7.9) we have that f (xj−1 ) −
fi−1 (xj−1 ) ≤ 2ε̃i−1 (xj,i − xj−1,i )−1/r , where [xj−1,i , xj,i ) is the interval in
the partition Pi−1 that contains [xj−1 , xj ). It follows that

2ε̃i−1 (xj,i − xj−1,i )−1/r 2


0 ≤ mj ≤ + 2 . · 1 + 2 =: L.
ε̃i (xj − xj−1 )−1/r c
Conclude that for a given sequence of partitions P0 ⊂ · · · ⊂ Pk , the total
number of left brackets fk is bounded by

Lnk Lnk−1 · · · Ln1 ≤ L2C/ε .

Multiply this with the upper bound on the number of partitions to conclude
that the total number of left brackets does not exceed (2L)2C/ε .
2.7 Entropies of Function Classes 235

A function f : R → R is of bounded variation if it is the distribution


function f (x) = µ(−∞, x] of a signed measure on R. Each such function
can be written as the difference of two monotone functions, corresponding
to the positive and negative parts of the signed measure. The space BV(R)
is the collection of all functions of bounded variation, and it can be normed
with the total variation norm on the corresponding signed measures. It
follows from Theorem 2.7.7 that the Lr (Q)-bracketing entropy of the unit
ball of BV(R) is of the order (1/ε), uniformly in Q. Similar bounds are valid
for the bounded variation functions f : [a, b] → R on a subinterval of R.
It is instructive to relate the bound on the entropy of the class BV1 [a, b]
of functions f : [a, b] → R of variation bounded by 1 to the entropy bounds
on Besov spaces. Because, for h > 0,
Z b−h
Z b−h Z Z b

f (x + h) − f (x) dx = df (x) dx ≤ h
df (s) ,
a a (x,x+h] a

1
the set BV1 [a, b] is contained in the unit ball of the Besov space B1,∞ [a, b].
In view of Theorem 2.7.4 the Lτ (λ)-entropy of the latter unit ball is of the
order (1/ε) provided that 1 > 1(1 − 1/τ )+ , i.e. t < ∞. Thus this theorem
gives the correct bound on the entropy of the set of monotone functions,
albeit not the bracketing entropy. (The restriction to Lebesgue measure on
a compact interval is not a real disadvantage of this approach, as the second
paragraph of the proof of Theorem 2.7.7 shows that the general setting can
be easily reduced to this case.)

2.7.4 Closed Convex Sets and Convex Functions


For a pair of subsets C and D of a metric space, the Hausdorff distance is
defined as
h(C, D) = sup d(x, D) ∨ sup d(x, C).
x∈C x∈D

Restricted to the closed subsets, this defines a metric (which may be infi-
nite). The following lemma gives the entropy of the collection of all com-
pact, convex subsets of a fixed, bounded subset of Rd with respect to the
Hausdorff metric.

2.7.10 Lemma. For the class C of all compact, convex subsets of a fixed,
bounded subset of Rd , with d ≥ 2, one has
 1 (d−1)/2   1 (d−1)/2
K1 ≤ log N ε, C, h ≤ K2 ,
ε ε
for constants Ki that depend on d and the bounded set only.

Proof. See Bronštein (1976) or Dudley (1984).


236 2 Empirical Processes

As a consequence we obtain Lr (Q)-bracketing numbers for Lebesgue


absolutely continuous probability measures Q. Combination with the brack-
eting central limit theorem shows that the closed convex sets of the unit
ball in two dimensions form a Donsker class for, for instance, the uniform
measure. For any dimension, this class is Glivenko-Cantelli.
For an extension to unbounded convex sets, see Section 2.10.5.

2.7.11 Corollary. For the class C of all compact, convex subsets of a fixed,
bounded subset of Rd with d ≥ 2, one has the entropy bound

  1 (d−1)r/2
log N[ ] ε, C, Lr (Q) ≤ K ,
ε

for every Lebesgue absolutely continuous probability measure Q with


bounded density and a constant K that depends on the bounded set, Q,
and d only.


Proof. For a given set C, let ε C = x: d(x, C c ) > ε be the points that
are at least a distance ε inside C, and let C ε be the points within distance
ε of C. Then there exists a constant K0 depending only on C such that the
Lebesgue measure λ satisfies

λ(C ε − ε C) ≤ K0 ε,

for every 0 < ε < 1.†


If h(C, D) < ε, then it must be that ε C ⊂ D ⊂ C ε . Thus if C1 , . . . , Cp
are the centers of Hausdorff balls of radius ε that cover C, then the pairs
[ε Ci , Ciε ] form a class of brackets that cover C. Their sizes in Lr (Q) are
1/r
bounded by Q1/r (C ε − ε C), which is bounded by kqk∞ (K0 ε)1/r . Now the
corollary follows from the previous lemma.

Next consider the set of all convex functions f : C → R defined on


a compact, convex subset of Rd , which we assume throughout to have
nonempty interior. We consider the class of such functions that are uni-
formly bounded in Lp , for some p. The entropy of this class turns out to
depend on the shape of C. The nicest case is that C is the convex hull
of finitely many points, when the Lr -entropy is of order (1/ε)d/2 for any
r < p. This covers for instance the unit interval in R, or a rectangle in Rd .
However, the constant in the bound increases with the number of extreme
points of C, and for general convex sets, such as the Euclidean unit ball, the
entropy is (1/ε)d/2 only if r < d/(d − 1 + d/p) and deteriorates otherwise.


This is not trivial, although it is intuitively clear.
2.7 Entropies of Function Classes 237

2.7.12 Theorem. Let Fp be the class of all convex functions f : C → [0, 1]


defined on a compact, convex subset C ⊂ Rd with kf kp ≤ 1, for some fixed
p ∈ [1, ∞]. If C is the closed convex hull of m points, then, for 1 ≤ r < p,
 1 d/2
log N ε, Fp , Lr (λ) ≤ Kd mdd/2e λ(C)d/(2r)−d/(2p)

,
ε
for a constant Kd that depends on the dimension d only. Furthermore, if
C is an arbitrary compact, convex subset Rd , then, for a constant K that
depends on d, p, r and the diameter of C,
  d/2
1 d

 K , if r < d−1+d/p ,


 ε (p−r)d
   1 d/2  1  1+ 2pr
log N ε, Fp , Lr (λ) ≤ K d
log , if r = d−1+d/p ,

 ε (d−1)pr
ε
 K 1 2(p−r) ,

  
 d
if r > d−1+d/p .
ε

In the last two formulas (p − r)/p and p/(p − r) are read as 1 if p = ∞,


In both cases, if p = ∞, then the same bound is valid for the bracketing
entropy.

Proof. See Gao and Wellner (2017).

The entropy in the preceding theorem is relative to the Lr -norm for the
1/r
Lebesgue measure on C. The same bounds are valid for the kqk∞ ε-entropy
relative to the Lr (Q)-norm, for any measure Q with a uniformly bounded
Lebesgue density q on C.
The theorem implies that the bracketing entropy integral of the classes
F∞ is finite for d ≤ 3 in the case that C is a closed, convex polytope, and
for d ≤ 2 in the case of a general compact, convex domain. (In the latter
case the cut-point for r is d/(d − 1); the bound is (1/ε) times a logarithmic
factor if d = 2 and (1/ε)2 if d = 3, both times taken in the regime where
the bound is worse than (1/ε)d/2 .)
The theorem does not provide bounds for the uniform norm. A bound
for r = ∞ is available only if the class of functions is further restricted
to uniformly Lipschitz functions. It can then be derived from the entropy
of the class of their supergraphs, which are convex sets in Rd+1 , for the
Hausdorff metric. The assumption that the functions are Lipschitz is un-
pleasant, though it may be noted that every function f that is convex on
C η for
 some η > 0 is automatically Lipschitz on C, with Lipschitz constant
2 sup |f (x)|: x ∈ C η /η. (See Problem 2.7.4.) The proof of the following
result is much shorter than the proof of Theorem 2.7.12, although the latter
also uses the device that a convex function is automatically Lipschitz in the
η-interior of its domain.
238 2 Empirical Processes

2.7.13 Corollary. Let F be the class of all convex functions


f : C → [0,
1]
defined on a compact, convex subset C ⊂ Rd such that f (x) − f (y) ≤
Lkx − yk for every x, y. Then
 1 d/2
log N ε, F, k · k∞ ≤ Kd,C (1 + L)d/2

,
ε
for a constant Kd,C that depends on the dimension d and C only.

Proof. Let Cf be the supergraph (x, t): f (x) ≤ t of a function f . For
every pair of points x and y and functions f and g,

f (x) − g(x) ≤ f (x) − g(y) + Lky − xk.

Fix a point x such that f (x) < g(x). Then the boundary of the supergraph
of f is below the supergraph
 of g at x, and the projection (closest point) of
the point x, f (x) on the supergraph of g has the form y, g(y) for some
point y. The distance d (x, f (x)), (y, g(y)) between the two points in Rd+1
is bounded above by the Hausdorff distance between the two supergraphs.
On the other hand, the distance between the two points is bounded below
by a multiple of

kx − yk + f (x) − g(y) ≥ (1 + L)−1 f (x) − g(x) .


Conclude that kf − gk∞ . (1 + L) h(Cf , Cg ). Finally, apply Lemma 2.7.10


to the set of supergraphs intersected with the set C × [0, 1].

2.7.5 Analytic Functions


The entropy of the unit ball in Hölder or Besov scales decreases below
any polynomial in 1/ε if the regularity index α increases indefinitely. Thus
the entropy of classes of functions that are infinitely smooth is smaller
than any polynomial 1/ε. The following theorem illustrates that it may be
logarithmic in 1/ε.

2.7.14 Theorem. The class AA [0, 1]d of all functions f : [0,1]d → R that
d
can be extended
d
to an analytic
function on the set G = z ∈ C : kz −
[0, 1] k∞ < A with supz∈G f (z) ≤ 1 satisfies, for ε < 1/2 and a constant

cd that depends on d only,
 1 d  1 1+d
log N ε, AA [0, 1]d , k · k∞ ≤ cd

log .
A ε

Proof. Let {x1 , . . . , xm } be an A/2-net in [0, 1]d for the maximum norm,
and let [0, 1]d = ∪i Bi be a partition in sets obtained by assigning every x ∈
2.7 Entropies of Function Classes 239

[0, 1]dPto a closest xi ∈ {x1 , . . . , xm }. Consider the piecewise polynomials


m
P = i=1 Pi,ai 1Bi , for
X
Pi,ai (x) = ai,k (x − xi )k .
k.≤n

Here the sum ranges over all multi-index vectors k = (k1 , . . . , kd ) ∈ (N ∪


{0})d with k. = k1 + · · · + kd ≤ n, and for y = (y1 , . . . , yd ) ∈ Rd the
notation y k is short for y1k1 y2k2 · · · ydkd . We obtain a finite set of functions by
discretizing the coefficients ai,k for each (i, k) over an ε/Ak. -net over the
interval [−1/Ak. , 1/Ak. ]. The log cardinality of this set is bounded by
Y Y   Y 2/Ak.  2
d
log #ai,k ≤ m log ≤ mn log .
i k:k.≤n
ε/Ak. ε
k:k.≤n

We can choose m ≤ (2/A)d . The proof is complete once it is shown that


the resulting set of functions is an ε-net for n of the order log(1/ε).
By the Cauchy formula (d applications of the formula in one dimension
suffice), for C1 , . . . , Cd circles of radius Ā < A in the complex plane around
the coordinates xi1 , . . . , xid of xi , and with Dk the partial derivative of
order k and k! = k1 !k2 ! · · · kd !,
Dk f (x ) 1 I I
f (z) 1
i
= · · · dz · · · dz ≤ k. .

d k+1 1 d
k! (2πi) C1 Cd (z − xi )


Consequently, for any z ∈ Bi and appropriately chosen ai ,
X Dk f (x ) ∞
i
X 1 X ld  3 n
(z − xi )k ≤ (A/2) k.
≤ . ,


k! Ā k. 3 l 4
k.>n k.>n l=n+1
X Dk f (x ) n
i k
X ε
k.
X ld
(z − xi ) − Pi,ai (z) ≤ (A/2) ≤ ε . ε.

k! Ak. 2l

k.≤n k.≤n l=1

We conclude that the piecewise polynomials form an cε-net for n sufficiently


large that (2/3)n is smaller than ε.

2.7.6 Classes Lipschitz in a Parameter


A simple, but useful, application of bracketing is to classes of functions
x 7→ ft (x) that are Lipschitz in the index parameter t ∈ T . Suppose that

fs (x) − ft (x) ≤ d(s, t) F (x),

for some metric d on the index set, function F on the sample space, and
every x. Then (diam T )F is an envelope function for the class {ft − ft0 : t ∈
T } for any fixed t0 . The bracketing numbers of this class are bounded by
the covering numbers of T .
240 2 Empirical Processes

2.7.15 Theorem. Let F = {ft : t ∈ T } be a class of functions satisfying


the preceding display for every s and t and some fixed function F . Then,
for any norm k · k,

N[ ] 2εkF k, F, k · k ≤ N (ε, T, d).

Proof. Let t1 , . . . , tp be an ε-net for d for T . Then the brackets [fti −


εF, fti + εF ] cover F. They are of size 2εkF k.

2.7.7 Gaussian Mixtures


For σ > 0 and a probability distribution F on R let pF,σ : R → R be the
mixture density
1 x − z 
Z
pF,σ (x) = φ dF (z),
σ σ
for φ the density of the standard normal distribution. In this section we
estimate the bracketing entropy of classes of normal mixtures of this form.
We consider these under the L1 -norm relative to the Lebesgue measure on
R and the uniform norm on R.
A first theorem concerns mixtures with mixing distribution F re-
stricted to an interval [−a, a].

2.7.16 Theorem. For 0 < b1 < b2 < ∞ with b1 < 1 and a ≥ e, let
Pa,b1 ,b2 be the set of all normal mixtures pF,σ with F ranging over the
probability distributions on [−a, a] and σ ∈ [b1 , b2 ]. Then there exists a
universal constant K such that for d equal to the L1 -norm or the uniform
norm, and 0 < ε < 1/2,

a ab2 2
log N[ ] (ε, Pa,b1 ,b2 , d) ≤ K log .
b1 εb1
2
For fixed b1 , b2 and a the entropy is of the order log(1/ε) . This is
bigger, but not much bigger than the entropy of finite-dimensional classes.
On the other hand, the entropy bound increases proportionally to the min-
imal scale b1 of the mixture components if this approaches zero.
The set of mixtures with an arbitrary mixing distribution on R is not
totally bounded relative to the L1 -metric (even if the scale is restricted
to 1) and hence the preceding theorem can only be extended to mixing
distributions with possibly noncompact support if the mixing distribution
is restricted in some other way. We consider mixing distributions with tails
bounded by a given decreasing function A: (0, ∞) → [0, 1].
For Φ the standard normal
R ∞ distribution function, let Ā be the function
defined by Ā(a) = A(a) + a A dλ + 1 − Φ(a), and let Ā−1 be its inverse.
2.7 Entropies of Function Classes 241

2.7.17 Theorem. For 0 < b1 < 1 < b2 < ∞ and A a given decreasing
function, let PA,b1 ,b2 be the set of all normal mixtures pF,σ with F ranging
over the probability distributions on R with F [−a, a]c ≤ A(a) for all a > 0
and σ ∈ [b1 , b2 ]. Then there exist universal constants c and K such that,
for 0 < ε < 1/4 ∧ A(e),

Ā−1 (εb1 /b2 )  Ā−1 (εb1 /b2 )b2 2


log N[ ] (cε, PA,b1 ,b2 , k · k1 ) ≤ K log .
b1 εb1

2.7.18 Example. If the mixing distributions have sub-Gaussian tails, then


we can apply the preceding theorem with A equal to 1−Φ(a) ≤ φ(a), whence
Ā is bounded by a multiple of the same function. Then Ā−1 (ε) is bounded
by a multiple of (log(1/ε))1/2 and, for fixed b1 , b2 , the bracketing entropy
is bounded by a multiple of (log(1/ε))5/2 . As long as the tails of the mixing
distributions are bounded by a function of the form A(a) = exp(−daδ ), the
entropy of the set of mixtures increases at most through a power of log(1/ε).
On the other hand, polynomially decreasing tails incur an additional factor
of ε−k in the entropy bounds, so that the entropy bound is dominated by
the tails in that case.

The proofs of the preceding theorems are based on approximation by


discrete mixtures with finite support.

2.7.19 Lemma. There exist universal constants C and D such that for
any 0 < ε < 1/2, any a, σ > 0 and any probability measure F on [−a, a]
there exists a discrete probability measure F 0 on [−a, a] with fewer than
D(aσ −1 ∨ 1) log(1/ε) support points such that kpF,σ − pF 0 ,σ k∞ ≤ Cε/σ.

Proof.pWe first prove the lemma for a = σ = 1. For ε < 1/2 we have
M : = 8 log(1/ε) > 2 and hence, for any probability measures F and F 0
on [−1, 1],

sup pF,1 (x) − pF 0 ,1 (x) ≤ 2φ(M − 1) ≤ 2φ(M/2) ≤ ε.
|x|≥M

The remainder in the Taylor expansion of order N −1 of the function x 7→ ex


is bounded above by |x|N /N ! ≤ (e|x|)N /N N . Consequently, there exists a
polynomial P2N −2 of degree 2N − 2 such that
2 N
φ(x) − P2N −2 (x) ≤ (ex /2) .

N N

If F and F 0 are probabilityRmeasures on [−1, 1] with the same moments of


orders 1, . . . , 2N − 2, then P2N −2 (x − z) d(F − F 0 )(z) = 0, and hence

sup |pF,1 (x) − pF 0 ,1 (x)| ≤ 2 sup φ(x − z) − P2N −2 (x − z) .
|x|≤M |x|≤M,|z|≤1
242 2 Empirical Processes

The argument x p − z in the right side is bounded in absolute value by


M +1 ≤ 3M/2 = 18 log(1/ε). Therefore, there exists a universal constant
c such that the left side of the preceding display is bounded above by a
multiple of
N
c log(1/ε))  
N
= exp −N log N − log c log(1/ε) .
N
This is bounded by ε for N equal to a sufficiently large multiple of log(1/ε).
We can match 2N − 2 moments with a discrete distribution F 0 with
2N − 1 support points. (See Problem 2.7.16.) This completes the proof of
the lemma if a = σ = 1.
We partition a general interval [−a, a] into k = b2a/σc disjoint, consec-
utive subintervals I1 , . . . , Ik of length σ and
Pk+1 a final interval Ik+1 of length
lk+1 smaller than σ. We may write F = i=1 F (Ii )Fi , where each Fi is a
Pk+1
probability measure concentrated on Ii , and then pF,σ = i=1 F (Ii )pFi ,σ .
For ease of notation, let Zi be a random variable distributed according to
Fi , and for ai the left end point of Ii , let Gi be the law of Wi = (Zi − ai )/σ.
Thus Gi is a probability measure on the interval [0, 1] for i = 1, . . . , k and
on the interval [0, lk+1 /σ] ⊂ [0, 1] for i = k + 1.
By the first part of the proof there exist probability measures G0i on
[0, 1] with Ni . log(1/ε) support points such that kpGi ,1 − pG0i ,1 k∞ . ε.
For i = k + 1 the measure G0i can be taken supported onP[0, lk+1 /σ]. Let Fi0
k+1
be the law of ai + σWi0 if Wi0 has law G0i and set F 0 = i=1 F (Ii )Fi0 . The
identity pFi ,σ (x) = σ pGi ,1 (x − ai )/σ , and similar identities for Fi0 and
−1

G0i show that kpFi ,σ − pFi0 ,σ k∞ = σ −1 kpGi ,1 − pG0i ,1 k∞ . Combining this with
Pk+1
the inequality kpF,σ − pF 0 ,σ k∞ ≤ i=1 F (Ii )kpFi ,σ − pFi0 ,σ k∞ , we conclude
that pF 0 ,σ has the required distance to pF,σ . The number of support points
of F 0 is bounded by the number of intervals k + 1 times the maximum
number of support points of an Fi0 , and hence is bounded as claimed.

Proof of Theorems 2.7.16 and 2.7.17. For given positive numbers α


and β fix a minimal α-net Σ over the interval [b1 , b2 ], and let F be the set
of discrete probability distributions on [−a, a] with at most N = D(ab−1 1 ∨
1) log(1/β) support points, for the constant D of Lemma 2.7.19. For every
σ ∈ [b1 , b2 ] there exists σ 0 ∈ σ with |σ − σ 0 | ≤ α, whence, with φσ the
normal density with scale σ,
|σ − σ 0 | α
kpF,σ − pF,σ0 k∞ ≤ kφσ − φσ0 k∞ . . 2.
(σ ∧ σ 0 )2 b1
By Lemma 2.7.19 there exists for every given probability measure F on
[−a, a] an element F 0 ∈ F (possibly depending on σ 0 ) such that kpF,σ0 −
pF 0 ,σ0 k∞ . β/b1 . Hence
α β
kpF,σ − pF 0 ,σ0 k∞ . + .
b21 b1
2.7 Entropies of Function Classes 243

Thus the set P of mixtures pF,σ with F ranging over F and σ over Σ is an
ε-net over Pa,b1 ,b2 for the uniform norm, if the expression at the right of
the last display is made to be less than ε.
We next construct a finite net over P by restricting the support points
and weights of the mixing distributions to suitable grids. For a given γ >
0 let S be a γ-net over the N -dimensional simplex for the `1 -norm, and
let P 0 be the set of all mixtures pF,σ ∈ P such that the (maximally) N
support points of F are among the points {0, ±δ, ±2δ, . . .} ∩ [−a, a] with
weights belonging to S, for given δ > 0. We may project pF,σ ∈ P into
pF 0 ,σ ∈ P 0 by first moving each point mass of F to a closest point in the
grid {0, ±δ, ±2δ, . . .}, and next changing
P the vector of sizesP of the point
masses to a closest vector in S. If F = j pj δzj and F 0 = j pj δzj0 with zj0
the projection of zj on the grid, then
X 
pj φσ (x−zj )−φσ (x−zj0 ) +|pj −p0j |φσ (x−zj0 ) .

pF,σ (x)−pF 0 ,σ (x) ≤
j

We conclude that kpF,σ − pF 0 ,σ k∞ . δkφ0σ k∞ + γkφσ k∞ . δ/b21 + γ/b1 .


There are at most (b2 −b1 )/α ∨1 possible choices of σ ∈ Σ. Each zj0 can
assume at most 2a/δ + 1 different values, for j = 1, . . . , N . The cardinality
of a minimal γ-net S over the N -dimensional unit simplex is bounded by
(5/γ)N for γ ≤ 1 (cf. Problem 2.1.7). Therefore,
b − b  2a N  5 N
2 1
log #P 0 . log ∨1 +1
α δ γ∧1
b − b  a  1 h  2a   5 i
2 1
. log ∨1 + ∨1 log log + 1 + log .
α b1 β δ γ∧1
This number is a bound on the ε-entropy of Pa,b1 ,b2 for the uniform norm
at ε equal to a multiple of (α/b1 )2 + (β/b1 ) + (δ/b21 ) + (γ/b1 ). We choose
α = δ = b21 ε and β = γ = b1 ε to conclude the proof of Theorem 2.7.16 for
the uniform norm.
To extend this to a bound on the L1 -bracketing numbers we first note
that
1  x  1
H(x) = φ 1|x|>2a + φ(0)1|x|≤2a
b1 2b2 b1
is an envelope function for Pa,b1 ,b2 . Therefore, given an η-net p1 , . . . , pM
over Pa,b1 ,b2 for the uniform norm, the brackets [li , ui ] defined by li =
(pi − η) ∨ 0 and ui = (pi + η) ∧ H cover Pa,b1 ,b2 . Because ui − li ≤ (2η) ∧ H
the size of these brackets in L1 can be bounded by, for any B > 2a,
Z Z
b2  B 
(ui − li ) dλ ≤ 2ηB + H(x) dx . ηB + φ .
|x|>B b1 2b2
√ p
For B = (2a ∨ 8ab2 log(1/η) we obtain p an upper bound equal to a
a2
multiple of ηB + (b2 /b1 )η ≤ ηa(b2 /b1 ) log(1/η), for p η < 1/2 and a >
e. Choose η equal to a small multiple of (εb1 /ab2 )/ log(1/ε) to render
244 2 Empirical Processes

this smaller than ε. Replacing ε in the upper bound of Theorem 2.7.16 by


this multiple does not change the form of the upper bound, at the cost of
increasing the constant K. This concludes the proof of Theorem 2.7.16.
For any F satisfying the tail bound assumed in Theorem 2.7.16, by
partial integration, for any x > a > 0,
Z ∞ Z ∞
φσ (x − z) dF (z) = (1 − F )(a)φσ (x − a) + φ0σ (z − x)(1 − F )(z) dz.
a a

Because the function φ0σ is negative on the positive half line, the integral
on the right is bounded by its restriction to the interval (a, x). If x/2 > a,
then it can be further bounded by
Z x/2 Z x  x A(x/2)
φ0σ (z−x) dzA(a)+ |φ0σ (z−x)| dzA(x/2) . φσ − A(a)+ .
a x/2 2 σ

If x/2 < a < x, then the second integral in this expression alone is a valid
upper bound. For x < a,
Z ∞
φσ (x − z) dF (z) ≤ φσ (x − a)F [a, ∞) . φσ (x − a)A(a).
a

Together with analogous bounds for the left tail, we obtain that, for any x,
Z
b2  
φσ (x − z) dF (z) . φb2 (x − a) + φb2 (x/2) + φb2 (x + a) A(a)
|z|>a b1
A(x/2)
+ 1|x|≥2a .
b1

Let H denote the appropriate multiple of the right side of this inequality. If
Fa is the renormalized restriction to [−a, a] of a probability measure F , then
F [−a, a]pFa ,σ ≤ pF,σ ≤ F [−a, a]pFa ,σ + H. Consequently, given ε-brackets
[li , ui ] that cover Pa,b1 ,b2 , there exists, for every (F, σ) as in the definition
of PA,b1 ,b2 , a bracket [li , ui ] with
 li (1 − A(a)) ≤ pF,σ ≤ ui + H. Thus the
brackets li 1 − A(a) , ui + H cover PA,b1 ,b2 . The size in L1 of a bracket
[li , ui ] is bounded by kui − li k1 + kli k1 A(a) + kHk1 . ε + Ā(a)b2 /b1 , as
b1 < 1 < b2 by assumption. Now we choose a such that Ā(a) ≤ b1 ε/b2 and
apply Theorem 2.7.16 to bound the number of brackets [li , ui ].

2.7.8 Sets with Smooth Boundaries


Subgraphs of smooth functions f : [0, 1]d−1 → [0, 1] are subsets of [0, 1]d
with smooth boundaries. Their bracketing entropy can easily be obtained
from an entropy bound on the functions themselves.
2.7 Entropies of Function Classes 245

2.7.20 Corollary. Let Cα,d be the collection of subgraphs of C1α [0, 1]d−1 .
There exists a constant K depending only on α and d such that
 1 (d−1)r/α
log N[ ] ε, Cα,d , Lr (Q) ≤ K kqkd/α

∞ ,
ε
for every r ≥ 1, ε > 0, and probability measure Q with bounded Lebesgue
density q on Rd .

Proof. Let f1 , . . . , fp be the centers of k · k∞ -balls of radius ε that cover


C1α [0, 1]d−1 . For each i, define the sets Ci and Di as the subgraphs of fi − ε
and fi + ε, respectively. Then the pairs [Ci , Di ] form brackets that cover
Cα,d−1 . (More precisely, their indicator functions bracket the set of indicator
functions of the sets in Cα,d−1 .) Their L1 (Q)-size equals
Z Z

Q(Ci 4 Di ) = 1 fi (x) − ε ≤ t < fi (x) + ε dQ(t, x) ≤ 2εkqk∞ .
[0,1](d−1) R

The Lr (Q)-size of the brackets is the (1/r)th power of this. It follows that
the bracketing number N[ ] (2εkqk∞ )1/r , Cα,d , Lr (Q) is bounded by p. Fi-
nally, apply the previous theorem and make a change of variable.

It follows that the subgraphs of the functions C1α [0, 1]d−1 are P -
Donsker for Lebesgue-dominated measures P with bounded density, pro-
vided α > d − 1. For instance, for the sets cut out in the plane by functions
f : [0, 1] → [0, 1], a uniform Lipschitz condition of any order on the deriva-
tives suffices. The restriction to smooth measures P is natural, as smooth-
ness of the graphs is of little help if the probability mass is distributed in
an erratic manner.
Subgraphs of smooth functions provide a simple way to define sets with
smooth boundaries, but the asymmetry in the arguments is unattractive.
A more natural definition of such sets is to map the unit sphere S d−1 =
{x ∈ Rd : kxk = 1} smoothly into Rd and define a corresponding set as
the “interior” of the range of the map, the latter giving the “boundary” of
the set. For instance, the boundaries of smooth sets in R2 are defined as
images of the unit circle under smooth maps. Unfortunately, it is somewhat
complicated to make this definition precise.
To start we need to define smooth maps f : S d−1 → Rd . To this end we
can cover S d−1 by finitely many sets (an “atlas”) Vi such that each Vi is the
image φi (U d−1 ) of the open unit ball U d−1 in Rd under a C ∞ -isomorphism
φ: U d−1 → Vi . These maps can be constructed to be restrictions of C ∞ -
isomorphisms of a neighborhood of the closure of U d−1 into S d−1 . For given
α, M > 0 we then define Gα,M,d to be the collection of all maps f : S d−1 →
Rd such that each of the maps fj ◦ φi : U d−1 → R, where fj : S d−1 → R is
a coordinate function of f = (f1 , . . . , fd ), is contained in the Hölder ball
α
CM (U d−1 ).
246 2 Empirical Processes

From each function f in this class we now define a set I(f ) ⊂ Rd as


the set of points x ∈ Rd that do not belong to the range of f and such
that within the collection of maps g: S d−1 → Rd − {x} the map f is not
homotopic to any (constant) map g (which maps S d−1 onto a point y 6= x).
(Two functions f, g: S d−1 → Rd − {x} are called homotopic if there exists
a continuous map F : [0, 1] × S d−1 → Rd − {x} such that F (0, ·) = f and
F (1, ·) = g.) We interpret I(f ) as the interior of the set with boundary the
range f (S d−1 ) of f . (A point x is on the “outside” of the “closed curve”
f (S d−1 ) if and only if f (S d−1 ) can be continuously contracted to a point
without the contracted curves passing through x.) Finally we define Ka,M,d
as the collection of all sets I(f ) ∪ f (S d−1 ) as f ranges over Gα,M,d .
The entropy of these collections of sets are of the same order as the
entropies of the sets of subgraphs of Hölder functions considered in the
preceding corollary.

2.7.21 Theorem. Let d ≥ 2, M ≥ 1 and α ≥ 1, let h be the Hausdorff


metric, and let Q be a probability measure with bounded Lebesgue density
q. There exists a constants K, K 0 depending only on α, M , d, r, and kqk∞
such that, for every r ≥ 1 and ε > 0,

 1 (d−1)/α
log N (ε, Kα,M,d , h) ≤ K ,
ε
 1 (d−1)r/α
ε, Kα,M,d , Lr (Q) ≤ K 0

log N[ ] .
ε

Proof. See Sun and Pyke (1982) or Dudley (1999), Theorem 8.2.16.

Problems and Complements


1. (Lower bound) There exists a constant K depending only on d and α such
 d/α
that log N ε, C1α [0, 1]d , k · k∞ ≥ K 1/ε for every ε > 0.
[Hint: See Kolmogorov and Tikhomirov (1961).]

2. Let k · k be a norm on a vector space of real-valued functions. Then given a


subclass F of functions, the entropy numbers of the class M F = {M f : f ∈ F}
satisfy N (ε, M F, k · k) = N (ε/M, F, k · k). The same relationship holds for
bracketing numbers.

3. Let F be a class of measurable functions x 7→ f (x, r) indexed by 0 ≤ r ≤ 1


such that r 7→ f (x, r) is monotone for each x. If the envelope function of F
is square integrable, then the bracketing numbers of F are polynomial.
2.7 Problems and Complements 247

4. Let ε > 0, C a convex subset of a normed


space, and f : C ε → R a bounded
−1
and convex function. Then f (x) − f (y) ≤ 2ε kf kC ε kx − yk for every x, y

in C.
[Hint: Take x 6= y in C and define z = y + η(y − x)/ky − xk for fixed
η < ε. Then z ∈ C ε and y is a convex combination of x and z. By  convexity,
f (y) ≤ (1 − λ)f (x) + λf (z), whence f (y) − f (x) ≤ λ f (z) − f (x) . Calculate
λ.]

5. Let ε > 0, C a bounded, convex  subset of Rk , and F a class of convex


functions f : C ε → R such that f (x): f ∈ F is bounded above for every
x ∈ C ε and bounded below for at least one x ∈ C ε . Then F is uniformly
bounded and uniformly Lipschitz.
[Hint: The function supf ∈F f (x) is finite and convex. Therefore it is contin-
uous and hence bounded
on compact sets. For a lower bound on F suppose
that f (y): f ∈ F is bounded below. For any x ∈ C ε define z as in the
preceding problem. By convexity of f it follows that (for λ depending on x),
f (x) ≥ (1 − λ)−1 f (y) − λ(1 − λ)−1 f (z), which is uniformly bounded below.]

6. Let ε > 0 and let D be a compact, convex subset of Rk . Then there exists a
finite set D0 ⊂ Dε and a constant c depending only on ε and D, such that
kf kD is bounded by ckf kD0 for every convex function f : Dε → R.
[Hint: Since we can cover D by finitely many cubes that are contained in
Dε , it suffices to bound the supremum norm of f over a given closed cube
C ⊂ Dε . By convexity, the maximum of f over C is bounded above by
the maximum of |f | over the corners. Next, f can be bounded below by a
supporting hyperplane at an arbitrary point in the interior of C.
This takes

the form f (c) + h∇f (c), xi, which is bounded below by f (c) − ∇f (c) kxk

and ∇f (c)i can be bounded by the maximum of the values f (c+εei )−f (c) .]

7. Let ε > 0, C a bounded, convex subset of a normed space, and F a class


ε
of convex
 f : C → R that is uniformly boundedε above and such
functions
that f (x): f ∈ F is bounded below for at least one x ∈ C . Then the class
of restrictions to C of functions in F is uniformly bounded and uniformly
Lipschitz.
8. For any a, b > 0 there exists a constant K depending only on (a, b) such that,
for every c > 0,

X
(c2−aj ∧ 2bj ) ≤ Kcb/(a+b) .
j=0

9. For any a, b > 0 there exists a constant K depending only on (a, b) such that,
for every c > 0,

c2−aj ∨ 2bj
X  
(c2−aj ∧ 2bj ) log + 1 ≤ Kcb/(a+b) .
 
c2−aj ∧ 2bj
j=0

[Hint: The solution j0 to the equation c2−aj = 2bj satisfies c2−aj0 = 2bj0 =
cb/(a+b) . For the nearest integer these equalities hold up to a constant de-
pending only on a, b. The sum without the “+1” in the logarithm is bounded
248 2 Empirical Processes

by
X X
2bj log(c2−(a+b)j ) + c2−aj log(2(a+b)j /c)
j≤j0 j>j0
X −b(j0 −j)
X
b/(a+b)
log(2(a+b)(j0 −j) ) + c2−a(j−j0 ) log(2(a+b)(j−j0 ) ) .

.c 2
j≤j0 j>j0

j2−bj and
P
The series’ on the right are bounded by (a + b) log 2 times j≥0
j2−aj , respectively.]
P
j≥0

10. For any subset C ⊂ Nl ,

#C ≤ 4l sup n21 n22 · · · n2l .


(n1 ,...,nl )∈C

[Hint: For given (n1 , . . . , nl ) form a binary string as follows: write each nk
in its binary representation, duplicate every bit in this representation (e.g.
011 becomes 001111), and concatenate the strings with separator 01. This
gives a string of length bounded by
l
X
2 (2 log nk + 1) = 2(2 log n1 · · · nl ) + 2l.
k=1

For N the supremum of these numbers over all (n1 , . . . , nk ) in C, we can


extend this string to a string of length N by padding it with strings 01 at
the end. This recipe yields a one-to-one map of C into {0, 1}N . Hence the
number of elements of C is bounded by 2N .]

11. The number of subsets of size smaller than l from a set of size n is bounded
above by (en/l)l .
[Hint: The number of subsets can be written as 2n P(X ≤ l) for a binomial
(n, 21 )-distributed random variable X. By Markov’s inequality this is bounded
for any r ≤ 1 by 2n ErX−l = 2n r−l ( 21 + 12 r)n . Choose r = l/n and use that
(1 + l/n)n ≤ el .]

12. For any c > 1 and ε > 0,


1
X  
(c − 1) cj ε log ≤ c + c log(1/c).
εcj
j∈N∪{0}:cj ε≤1

R c/ε
[Hint: The left side is bounded by the integral 0 ε log(1/εx) dx = c +
c log(1/c), as can be seen by discretization on the grid points cj .]

13. For p ≤ q and α ∈ `p we have kαkq ≤ kαkp .


(p/q) (1−p/q)
[Hint: Combine kαk∞ ≤ kαkp . with kαkq ≤ kαkp kαk∞ .]

14. For a given bounded, nondecreasing function g: [0, ∞) → [0, ∞) and q > 0 the
R ∞ q 1/q
norm 0 t−α g(t) t−1 dt is equivalent to the `q -norm of the sequence
2−jα g(2j ) j∈Z . The latter norms are decreasing in q.
2.7 Problems and Complements 249

15. If X is convex, then k∆h f kp ≤ khk k∇f kp and k∆h f kp ≤ 2kf kp .


16. (Moment matching) Let ψ1 , . . . , ψN : K → R be continuous functions on a
compact metric space K. For any probability measure F on K, there exists
a discrete probability measure F 0 on K with at most N + 1 support points
such that F ψj = F 0 ψj for every j = 1, . . . , N .
[Hint: By the geometric form of Jensen’s inequality (F ψ1 , . . . , F ψN ) is con-
in the convex hull of the (compact) set C = ψ1 (x), . . . , ψN (x) : x ∈
tained
K . Any element of this convex hull can be written as a convex combination
of at most N + 1 elements of C (see Rudin (1973), page 73).]
2.8
Uniformity in the
Underlying Distribution

The previous chapters present empirical laws of large numbers and cen-
tral limit theorems for observations from a fixed underlying distribution P .
Many of the sufficient conditions given there are actually satisfied by very
large classes of underlying measures; typically, the only limitation is finite-
ness of some appropriate moment of the envelope function. For instance,
classes satisfying the uniform entropy condition are, up to measurability,
Glivenko-Cantelli or Donsker for all P with P ∗ F < ∞ or P ∗ F 2 < ∞, re-
spectively. In particular, many bounded classes of functions are universally
Donsker: Donsker for every probability measure on the sample space.
In this chapter we note that even stronger results are typically true.
For example, not only does the central limit theorem hold for all underlying
measures, or very large classes of measures, it is also valid uniformly in
the underlying measure, when this ranges over large classes of measures.
Essentially, the main empirical limit theorems are valid uniformly in the
underlying measure if the conditions hold in a uniform sense, which appears
to be the case frequently. This type of uniformity is certainly of interest in
statistical applications.

2.8.1 Glivenko-Cantelli Theorems


We start with a uniform Glivenko-Cantelli theorem. Convergence almost
surely to zero of a sequence of random variables Xn is equivalent to con-
vergence in probability of supm≥n |Xm | to zero. The latter characterization
can be used to define “almost sure convergence uniformly in the underlying
measure.” A class F of measurable functions on a measurable space (X , A)
2.8 Uniformity in the Underlying Distribution 251

is said to be Glivenko-Cantelli uniformly in P ∈ P for a given class P of


probability measures on (X , A) if
 
sup P∗P sup kPm − P kF > ε → 0,
P ∈P m≥n

for every ε > 0 as n → ∞.


The set Qn in the statement of the following theorem is the collection
of all possible realizations of empirical measures of n observations.

2.8.1 Theorem. Let F be a P -measurable class of functions on a measur-


able space for every probability measure P in a class P. Suppose that, for
some measurable envelope function F ,
lim sup P F {F > M } = 0,
M →∞ P ∈P

sup log N εkF kQ,1 , F, L1 (Q) = o(n), for every ε > 0,
Q∈Qn

where the supremum is taken over the set Qn of all discrete probability
measures with atoms of size integer multiples of 1/n. Then F is Glivenko-
Cantelli uniformly in P ∈ P.

Proof. For fixed M , let FM be the class of all functions f 1{f ≤ M } as f


ranges over F. Then

kPn − P kF ≤ k(Pn − P )f 1{F ≤ M }kF + Pn F 1{F > M } + P F {F > M }.

By the uniform strong law for real random variables (Proposition A.5.1),
the second term on the right converges uniformly in P almost surely to
P F {F > M }. By the first condition of the theorem, this quantity can be
made arbitrarily small uniformly in P by choosing large M . Conclude that
it suffices to show that the first term on the right, which equals kPn −P kFM ,
converges almost surely to zero uniformly in P for every fixed M .
The class FM satisfies the entropy condition of the theorem relative to
the envelope function M . This follows from the inequality
 
sup N εM, FM , L1 (Q) ≤ sup sup N εkF kQ,1 , F, L1 (Q) ,
Q∈Qn n≤k≤2n Q∈Qk

which may be proved as follows. Suppose that k out of the n support points
x1 , . . . , xn of a given measure Q ∈ Qn (with multiple appearance of a given
point permitted) satisfy F (xi ) ≤ M . If Qk ∈ Qk is the discrete measure on
these points, then Q|f {F ≤ M }| = (k/n)Qk |f | for any function f . Since
Qk F ≤ M , it follows that an εQk F -net for F in L1 (Qk ) yields an ε(k/n)M
net for FM in L1 (Q). Thus N εM, FM , L1 (Q) ≤ N εkF kQk ,1 , F, L1 (Qk ) .
Since Qk ⊂ Q2k ⊂ · · ·, the right side is bounded by the right side of the
preceding display. This completes the proof that FM satisfies the entropy
condition of the theorem relative to the envelope function M .
252 2 Empirical Processes

Fix η > 0 and values X1 , . . . , Xn , and take a minimal ηM -net FnX


in FM for the L1 (Pn ) semimetric.
 It has just been proved that the cardi-
nality N ηM, FM , L1 (Pn ) of FnX is bounded by a deterministic sequence
Nn (η) (uniformly in X1 , . . . , Xn ) satisfying log Nn (η) = o(n). Let Pon be the
symmetrized empirical measure as defined in Chapter 2.3. Then
kPon kFM ≤ kPon kFnX + η M.
The measurability of F implies
 the measurability of FM , so that the prob-
ability PP kPon kFM > ε can be written as EP,X Pε kPon kFM > ε . By
Hoeffding’s inequality, Lemma 2.2.7 applied for fixed X1 , . . . , Xn ,
1 2 2
EP,X Pε Pon f F > ε − ηM ≤ EP,X Nn (η) 2e− 2 n(ε − ηM ) /M .
 
nX

The integrand on the right side is independent of X1 . . . , Xn . Since Nn (η) =


exp o(n), the integrand is bounded P by 2 exp(−nε2 /4M 2 ) for  sufficiently
large n and small η. Conclude that m≥n P kPom kFM > ε → 0 for ev-
ery ε > 0. In view of the symmetrization Lemma 2.3.7 for probabilities, the
same is true for the empirical measure replacing the symmetrized empirical.
This concludes the proof that kPn − P kFM converges to zero outer almost
surely uniformly in P .

Vapnik-C̆ervonenkis classes of sets or functions satisfy



sup log N εkF kQ,1 , F, L1 (Q) < ∞,
Q

for the supremum taken over all probability measures Q. This trivially im-
plies the entropy condition of the previous theorem. It follows that a suit-
ably measurable VC-class is uniformly Glivenko-Cantelli over any class of
underlying measures for which its envelope function is uniformly integrable.
In particular, a uniformly bounded VC-class is uniformly Glivenko-Cantelli
over all probability measures, provided it satisfies the measurability condi-
tion.

2.8.2 Donsker Theorems


Next consider uniform in P central limit theorems. Let F be a class of mea-
surable functions f : X → R that is Donsker for every P in a set√
P of proba-
bility measures on (X , A). Thus the empirical process Gn,P = n(Pn − P )
converges weakly in `∞ (F) to a tight, Borel measurable version of the
Brownian bridge GP . According to Chapter 1.12, this is equivalent to
sup E∗P h(Gn,P ) − Eh(GP ) → 0.

h∈BL1

(Here BL1 is the set of all functions h: `∞ (F) → R which are uniformly
bounded by 1 and satisfy |h(z1 ) − h(z2 )| ≤ kz1 − z2 kF .) We shall call F
Donsker uniformly in P ∈ P if this convergence is uniform in P .
2.8 Uniformity in the Underlying Distribution 253

The weak convergence Gn,P GP under a fixed P is equivalent to


asymptotic equicontinuity of the sequence Gn,P and total boundedness of F
under the seminorm ρP (f ) = kf −P f kP,2 . It is to be expected that uniform
versions of these two conditions are sufficient for the uniform Donsker prop-
erty. The sequence Gn,P is called asymptotically equicontinuous uniformly
in P ∈ P if, for every ε > 0,
 
lim lim sup sup P∗P

sup Gn,P (f ) − Gn,P (g) > ε = 0.
δ↓0 n→∞ P ∈P ρP (f,g)<δ

Furthermore, the class F is ρP -totally bounded uniformly in P ∈ P if its


covering numbers satisfy supP ∈P N (ε, F, ρP ) < ∞.
These uniform versions of the two conditions actually imply more than
the uniform Donsker property. In the situation for a fixed P , the asymptotic
equicontinuity of the sequence Gn,P has its counterpart in the continuity
of the sample paths of the limit process GP . Requiring the asymptotic
equicontinuity uniformly in P yields, apart from uniform weak convergence,
also uniformity in the continuity of the limit process. The latter is expressed
in the following. The class F is pre-Gaussian uniformly in P ∈ P if the
Brownian bridges satisfy the two conditions

sup EkGP kF < ∞; lim sup E sup GP (f ) − GP (g) = 0.
P ∈P δ↓0 P ∈P ρP (f,g)<δ

(Some authors include the requirement of uniform pre-Gaussianity in the


concept of a uniform Donsker class.)
Throughout this section it is assumed that the class F possesses a
measurable envelope function F with the property

lim sup P F 2 {F > M } = 0.


M →∞ P ∈P

Such an envelope function is called square integrable uniformly in P ∈ P.

2.8.2 Theorem. Let F be a class of measurable functions with envelope


function F that is square integrable uniformly in P ∈ P. Then the following
statements are equivalent:
(i) F is Donsker and pre-Gaussian, both uniformly in P ∈ P;
(ii) the sequence Gn,P is asymptotically ρP -equicontinuous uniformly in
P ∈ P and supP ∈P N (ε, F, ρP ) < ∞ for every ε > 0.
Furthermore, uniform pre-Gaussianity implies uniform total boundedness
of F.

Proof. The last assertion is a consequence of Sudakov’s minorization


inequality A.2.6
p for Gaussian processes, which gives the upper bound
3EkGP kF for ε log N (ε, F, ρP ) for every ε > 0 and P .
254 2 Empirical Processes

(i) ⇒ (ii). For each fixed δ > 0 and P , the truncated modulus function

z 7→ hP (z) = sup |z(f ) − z(g)| ∧ 1


ρP (f,g)<δ

is contained in 2BL1 . If F is uniformly Donsker, then E∗P hP (Gn,P ) →


EhP (GP ) uniformly in P . Given uniform pre-Gaussianity, the expres-
sions EhP (GP ) can be made uniformly small by choice of a sufficiently
small δ. Thus lim sup E∗P hP (Gn,P ) is uniformly small for sufficiently small
δ. This implies that the sequence Gn,P is uniformly asymptotically ρP -
equicontinuous.
(ii) ⇒ (i). Fix an arbitrary
 G of F and P ∈ P. The se-
finite subset
quence of random vectors Gn,P (g): g ∈ G converges weakly to the Gaus-
sian vector {GP (g): g ∈ G}. By the portmanteau theorem,
 

P sup GP (f ) − GP (g) > ε
ρP (f,g)<δ
f,g∈G
 
P∗P

≤ lim inf sup sup Gn,P (f ) − Gn,P (g) > ε ,
n→∞ P ∈P ρP (f,g)<δ
f,g∈G

for every δ > 0. On the right side G can be replaced by F. By assumption the
resulting expression can be made arbitrarily small by choice of δ. Conclude
that, for every ε, η > 0, there exists δ > 0 such that the left side of the
display is smaller than η for every P . Let G increase to a ρP -countable dense
subset of F, and use the continuity of the sample paths of GP to see that
the same statement is true (with the same ε, η, and δ) with G replaced by
F.
This gives a probability form of the second requirement of uniform
pre-Gaussianity. The expectation form follows with the help of Borell’s
inequality A.2.1, which allows one to bound the moment EkGP kFδ by a
universal constant times the median of kGP kFδ (Problem A.2.2).
Fix δ > 0. By the uniform total boundedness, there exists for each P a
δ-net GP for ρP over F such that the cardinalities of the GP are uniformly
bounded by a constant k. Then

kGP kF ≤ kGP kGP + sup GP (f ) − GP (g) .
ρP (f,g)<δ

By the result of the preceding paragraph, the expectation of the second


term on the right can be made arbitrarily small uniformly in P by choice
of δ; it is certainly finite. The first term on the right is a maximum over at
most k Gaussian random variables, each having mean zero and variance not
exceeding supP ∈P P F 2 < ∞. Thus EkGP kGP is uniformly bounded. This
concludes the proof of uniform pre-Gaussianity.
For each P let ΠP : F → GP map each f into a ρP -closest element of
GP . Write z ◦ ΠP for the discretized element of `∞ (F) taking the value
2.8 Uniformity in the Underlying Distribution 255

z(ΠP f ) at each f ∈ F. By the uniform central limit theorem for Rk , the


random vectors {Gn,P (f ): f ∈ GP } in Rk (with zero coordinates added
if |GP | < k) converge weakly to the random vectors {GP (f ): f ∈ GP }
uniformly in P ∈ P (Proposition A.5.2). This implies that

sup E∗P h(Gn,P ◦ ΠP ) − Eh(GP ◦ ΠP ) → 0.



h∈BL1

Next, since every h ∈ BL1 satisfies the inequality |h(z1 ) − h(z2 )| ≤ 2 ∧ kz1 −
z2 kF , we have, for every ε > 0,

sup E∗P h(Gn,P ◦ ΠP ) − E∗ h(Gn,P ) ≤ ε + 2P∗P kGn,P ◦ ΠP − Gn,P kF > ε .



h∈BL1

By construction of GP , the random variable kGn,P ◦ΠP −Gn,P kF is bounded


by the modulus of continuity kGn,P kFδ of the process Gn,P . In view of the
uniform asymptotic equicontinuity, lim sup P∗P kGn,P ◦ ΠP − Gn,P kF > ε)
can be made arbitrarily small by choice of δ, uniformly in P , for every fixed
ε. Thus the limsup of the left side of the last display can be made arbitrarily
small by choice of δ.
Using the uniform pre-Gaussianity, we can obtain the analogous result
for the limit processes:

sup Eh(GP ◦ ΠP ) − Eh(GP ) → 0, δ ↓ 0.
h∈BL1

Combination of the last three displayed equations shows that E∗P h(Gn,P )
converges to Eh(GP ) uniformly in h ∈ BL1 and P .

In the situation that P consists of all probability measures on the un-


derlying measurable space, there is an interesting addendum to the previ-
ous theorem. In that case, the uniform pre-Gaussianity implies the uniform
Donsker property (under suitable measurability conditions on F). We do
not reproduce the proof of this result here. (See the Notes for further com-
ments and references.) Instead we derive the uniform versions of the two
main empirical limit theorems: the central limit theorem under the uniform
entropy condition and the central limit theorem with bracketing.
The uniform entropy condition (2.5.1) implies the uniform Donsker
property.

2.8.3 Theorem. Let F be a class of measurable functions with measurable


envelope function F such that Fδ,P = {f − g: f, g ∈ F, kf − gkP,2 < δ} and
2
F∞ = {(f − g)2 : f, g ∈ F} are P -measurable for every δ > 0 and P ∈ P.
Furthermore, suppose that, as M → ∞,
sup P F 2 {F > M } → 0,
P ∈P
Z ∞ q 
sup log N εkF kQ,2 , F, L2 (Q) dε < ∞,
0 Q
256 2 Empirical Processes

where Q ranges over all finitely discrete probability measures. Then F is


Donsker and pre-Gaussian uniformly in P ∈ P.

Proof. It may be assumed that the envelope function satisfies F ≥ 1,


because replacing a given F by F + 1 does not affect the uniform integra-
bility and makes the uniform entropy condition weaker. Given an arbitrary
sequence δn ↓ 0, set
1X n
2
θn,P = f 2 (Xi ) .

n i=1 Fδn ,P

Exactly as in the proof of Theorem 2.5.2, it can be shown that for some
universal constant K,
 2
P∗P kGn,P kFδn ,P > ε
Z θn,P /kF kn
K
q
≤ 2 E∗P sup log N εkF kQ,2 , F, L2 (Q) dε2 EP kF k2n .

ε 0 Q

Here kF kn ≥ 1 and EP kF k2n = P F 2 is uniformly bounded in P . Conclude


that the right side is up to a constant uniformly bounded by
Z η q
sup log N εkF kQ,2 , F, L2 (Q) dε2 + sup P∗P θn,P > η ,
 
0 Q P ∈P

for every η > 0. To establish uniform asymptotic equicontinuity, it suffices


to show that the second term converges to zero for all η. This follows as in
the proof of Theorem 2.5.2, except that presently we use the uniform law
of large numbers, Theorem 2.8.1.
By assumption, supQ N (ε, F, ρQ ) is finite for every ε > 0 if Q ranges
over all finitely discrete measures. That the supremum is still finite if Q
also ranges over P can be proved as in the proof of Theorem 2.5.2.

The uniform entropy condition requires the supremum over all proba-
bility measures of the root entropy to be integrable. In terms of entropy with
bracketing, it suffices to consider the supremum over the class of interest.

2.8.4 Theorem. Let F be a class of measurable functions such that

lim sup P F 2 {F > M } = 0,


M →∞ P ∈P
Z ∞ q 
sup log N[ ] εkF kP,2 , F, L2 (P ) dε < ∞.
0 P ∈P

Then F is Donsker and pre-Gaussian uniformly in P ∈ P.


2.8 Uniformity in the Underlying Distribution 257

Proof. The first condition implies that supP kF kP,2 < ∞. Finiteness of the
integral implies finiteness of the integrand. Therefore,

sup N[ ] εkF kP,2 , F, L2 (P ) < ∞,
P

and F is totally bounded uniformly in P . Uniform asymptotic equicontinu-


ity follows by making the proof of Theorem 2.5.6 uniform in P . Actually,
the proof of this theorem contains the essence of the proof of the maximal
inequality given by Theorem 2.14.15, which implies that
Z δq √
P F 2 {F > nb(δ, P )}
E∗P kGn,P kFδ,P .

log N[ ] ε, F, L2 (P ) dε + .
0 b(δ, P )
Here .q 
b(δ, P ) = δ 1 + log N[ ] δ, F, L2 (P ) .
The right side of the first display converges to zero uniformly in P as n → ∞
followed by δ ↓ 0.

2.8.3 Central Limit Theorem Under Sequences


As an application of the uniform central limit theorem, consider the em-
pirical process based on triangular arrays. For each n, let Xn1 , . . . , Xnn be
i.i.d. according to a probability measure Pn , and set
n
1X
Pn = δX .
n i=1 ni

If the sequence of underlying measures Pn converges to a measure P0 in


a
√ suitable sense, then we may hope to derive that∞the sequence Gn,Pn =
n(Pn − Pn ) converges in distribution to GP0 in ` (F). According to the
general results on weak convergence to Gaussian processes, this is equivalent
to marginal convergence plus F being totally bounded and the sequence
Gn,Pn being asymptotically uniformly equicontinuous, both with respect to
ρP0 . Suppose that the ρPn semimetrics converge uniformly to ρP0 in the
sense that

(2.8.5) sup ρPn (f, g) − ρP0 (f, g) → 0.
f,g∈F

Then asymptotic equicontinuity for ρP0 follows from asymptotic equiconti-


nuity “for the sequence ρPn ,” defined as
 
lim lim sup P∗Pn

sup Gn,Pn (f ) − Gn,Pn (g) > ε = 0.
δ↓0 n→∞ ρPn (f,g)<δ

In particular, it suffices that F be asymptotically equicontinuous uniformly


in the set {Pm }. This is in turn implied by F being Donsker and pre-
Gaussian uniformly in {Pm }.
258 2 Empirical Processes

For marginal convergence, we can apply the Lindeberg central limit


theorem. Pointwise convergence of ρPn to ρP0 means exactly that the co-
variance functions of the Gn,Pn converge to the desired limit. The Lindeberg
condition for the sequence Gn,Pn f is certainly implied by

(2.8.6) lim sup Pn F 2 {F ≥ ε n} = 0, for every ε > 0.
n→∞

One possible conclusion is that F being uniformly Donsker and pre-


Gaussian in the sequence {Pm } together with (2.8.5) and (2.8.6) is sufficient
for the convergence Gn,Pn GP0 . It is of interest that under the same con-
ditions the Gaussian limit distributions are continuous in the underlying
measure.

2.8.7 Lemma. Let F be Donsker and pre-Gaussian uniformly in the se-


quence {Pm }, and let (2.8.5) and (2.8.6) hold. Then Gn,Pn GP0 in `∞ (F).

2.8.8 Lemma. Let F be pre-Gaussian uniformly in the sequence {Pm },


and let (2.8.5) hold. Then GPn GP0 in `∞ (F).

Proofs. The first lemma was argued earlier (under slightly weaker con-
ditions). (Inspection of the proof shows that the implication (i) ⇒ (ii) of
Theorem 2.8.2 is valid without uniform integrability of the envelope.)
For the second lemma, first note that pointwise convergence of the
covariance functions and zero-mean normality of each GP imply marginal
convergence. The uniform pre-Gaussianity gives
 

lim lim sup PPn sup GPn (f ) − GPn (g) > ε = 0.

δ↓0 n→∞ ρPn (f,g)<δ

By the uniform convergence of ρPn to ρP0 , this is then also true with ρPn
replaced by ρP0 . The uniform pre-Gaussianity implies uniform total bound-
edness of F by Theorem 2.8.3, which together with (2.8.5) implies total
boundedness for ρP0 . Finally, apply Theorems 1.5.4 and 1.5.7.

The uniform Donsker and pre-Gaussian property in the first lemma


could be established by the uniform entropy condition or by a bracketing
condition. For easy reference we formulate two easily applicable Donsker
theorems for sequences. They are proved as the uniform Donsker theo-
rems 2.8.3 and 2.8.4. Alternatively, compare Theorems 2.11.1 and 2.11.9 or
Examples 2.14.14 and 2.14.19.

2.8.9 Theorem. Let F be a class of measurable functions with a measur-


able envelope function F such that Fδ,Pn = {f −g: f, g ∈ F, kf −gkPn ,2 < δ}
2
and F∞ = {(f − g)2 : f, g ∈ F } are Pn -measurable for every δ > 0 and n.
Furthermore, suppose that F satisfies the uniform entropy condition, that
Pn F 2 = O(1) and that (2.8.5) and (2.8.6) hold. Then Gn,Pn GP0 in
`∞ (F).
2.8 Problems and Complements 259

2.8.10 Theorem. Let F be a class of measurable functions, with a mea-


surable envelope function F , that is totally bounded for ρP0 , satisfies (2.8.5)
and (2.8.6), and is such that
Z δn q

log N[ ] ε, F, L2 (Pn ) dε → 0, for every δn ↓ 0.
0

Then Gn,Pn GP0 in `∞ (F).

Problems and Complements


1. Suppose ρPn → ρP0 uniformly on F × F. Then F is totally bounded
for ρP0 if and only if, for every ε > 0, there exists an N such that
supn≥N N (ε, F, ρPn ) < ∞.
2.9
Multiplier
Central Limit Theorems

With the notation Zi = δXi − P , the empirical central limit theorem can
be written
n
1 X
√ Zi G,
n i=1

in `∞ (F), where G is a (tight) Brownian bridge. Given i.i.d. real-valued


random variables ξ1 , . . . , ξn , which are independent of Z1 , . . . , Zn , the mul-
tiplier central limit theorem asserts that
n
1 X
√ ξi Z i G.
n i=1

If the ξi have mean zero, have variance 1, and satisfy a moment condition,
then the multiplier central limit theorem is true if and only if F is Donsker:
in that case, the two displays are equivalent.
A more refined and deeper result is the conditional multiplier central
limit theorem, according to which
n
1 X
√ ξi Z i G,
n i=1

given almost every sequence Z1 , Z2 , . . . . This more interesting assertion


turns out to be true under only slightly stronger conditions. Next to this
almost sure version, we also discuss a version in probability of the condi-
tional central limit theorem.
2.9 Multiplier Central Limit Theorems 261

The techniques developed in this chapter have applications to statisti-


cal problems, especially to the study of various ways of bootstrapping the
empirical process (see Chapter 3.7). The results are also used in the proofs
of Chapter 2.10.
It is unnecessary to make measurability assumptions on F. How-
ever, as usual, outer expectations are understood relative to a product
space. Throughout the section let X1 , X2 , . . . be defined as the coordi-
nate projections on the “first” ∞ coordinates in the probability space
(X ∞ × Z, A∞ × C, P ∞ × Q), and let ξ1 , ξ2 , . . . depend on the last coor-
dinate only.
The (unconditional) multiplier central limit theorem is a corollary of
a symmetrization inequality, which complements the symmetrization in-
equalities for Rademacher variables obtained in Chapter 2.3. For a random
variable ξ, set
Z ∞p
kξk2,1 = P(|ξ| > x) dx.
0

In spite of the notation, this is not a norm (but there exists a norm that
is equivalent to k · k2,1 ). Finiteness of kξk2,1 requires slightly more than a
finite second moment, but it is implied by a finite 2 + ε absolute moment
(Problem 2.9.1). P
In Chapter 2.3 it is P
shown that the norms of a given process Zi and
its symmetrized version εi Zi are comparable in magnitude. The following
lemma
P extends this comparison to the norm of a general multiplier process
ξi Zi .

2.9.1 Lemma (Multiplier inequalities). Let Z1 , . . . , Zn be i.i.d. stochas-


tic processes with E∗ kZi kF < ∞ independent of the Rademacher variables
ε1 , . . . , εn . Then for every i.i.d. sample ξ1 , . . . , ξn of mean-zero random vari-
ables independent of Z1 , . . . , Zn , and any 1 ≤ n0 ≤ n,
1 X n 1 X n
1
2 kξk1 E∗ √ εi Zi ≤ E∗ √ ξi Zi

n i=1 F n i=1 F

|ξi |
≤ 2(n0 − 1) E∗ kZ1 kF E max √
1≤i≤n n
k
√ 1 X

+ 2 2 kξk2,1 max E∗ √ εi Zi .

n0 ≤k≤n k i=n0 F


For symmetrically distributed variables ξi , the constants 1/2, 2, and 2 2
can all be replaced by 1.

Proof. Define ε1 , . . . , εn as independent of ξ1 , . . . , ξn (on “their own” factor


of a product probability space). If the ξi are symmetrically distributed, then
262 2 Empirical Processes

the variables εi |ξi | possess the same distribution as the ξi . In that case, the
inequality on the left follows from
Xn Xn
E∗ εi Eξ |ξi | Zi ≤ E∗ εi |ξi | Zi .

F F
i=1 i=1

For the general case, let η1 , . . . , ηn be an independent copy of ξ1 , . . . , ξn .


Then kξi k1 = E|ξi − Eηi | ≤ kξi − ηi k1 , so that kξi k1 can be replaced by
kξi − ηi k1 in the left-hand side. Next, apply the inequality for symmetric
variables to the variables ξi − ηi , and then use the triangle inequality to see
that X n Xn
E∗ (ξi − ηi )Zi ≤ 2E∗ ξi Zi .

F F
i=1 i=1
This concludes the proof of the inequality on the left.
Again assume that the ξi are symmetrically distributed. Let ξ˜1 ≥
· · · ≥ ξ˜n be the (reversed) order statistics of |ξ1 |, . . . , |ξn |. By the defi-
nition of Z1 , . . . , Zn as fixed functions of the coordinates on the product
space (X n , B n ), it follows that for any fixed ξ1 , . . . , ξn ,
Xn Xn
Eε E∗Z εi |ξi |Zi = Eε E∗Z εi ξ˜i Zi .

F F
i=1 i=1

In view of Lemma 1.2.7, the joint outer expectation E∗ can be replaced by


Eξ E∗Z . Thus
Xn Xn Xn
E∗ ξi Zi = Eξ,ε E∗Z εi |ξi |Zi = Eξ,ε E∗Z εi ξ˜i Zi

F F F
i=1 i=1 i=1
Xn
≤ (n0 − 1) Eξ˜1 E∗ kZ1 kF + E∗ εi ξ˜i Zi .

F
i=n0
Pn
Substitute ξ˜i = k=i (ξ˜k − ξ˜k+1 ) in the second term (with ξ˜n+1 = 0) and
change the order of summation in the resulting double sum to see that it
is equal to
Xn Xn k
X
E∗ εi ξ˜i Zi = E∗ (ξ˜k − ξ˜k+1 ) εi Zi

F F
i=n0 k=n0 i=n0
n √
X 1 X k
≤E k(ξ˜k − ξ˜k+1 ) max E∗ √ εi Zi .

n0 ≤k≤n k i=n0 F
k=n0

For ξ˜k+1 < t ≤ ξ˜k , one has k = #(i: |ξi | ≥ t). Thus the first expectation in
the product can be written as
Xn Z ξ̃k √ Z ∞ p Z ∞p
E k dt ≤ E #(i: |ξi | ≥ t) dt ≤ nP(|ξi | ≥ t) dt,
k=n0 ξ̃k+1 0 0
2.9 Multiplier Central Limit Theorems 263

by Jensen’s inequality. This concludes the proof of the upper bound for
symmetric variables.
For possibly asymmetric variables, first note that

1 X n 1 X n
E∗ √ ξi Zi ≤ E∗ √ (ξi − ηi )Zi .

n i=1 F n i=1 F

Next apply the bound for symmetric variables to the variables ξi − ηi ,


and then use the
√ triangle inequality and the “corrected” triangle inequality
kξ − ηk2,1 ≤ 2 2kξk2,1 (Problem 2.9.2).

For n0 = 1, the preceding lemma gives the inequalities

1 X n 1 X n
1
2 kξk1 E∗ √ εi Zi ≤ E∗ √ ξi Zi

n i=1 F n i=1 F

k
√ 1 X
≤ 2 2 kξk2,1 max E∗ √ εi Zi .

1≤k≤n k i=1 F

For variables ξi with range [−1, 1], the maximum in the right side may be
replaced by only its nth term (Problem 2.9.3), but this appears to be untrue
in general. This simpler inequality obtained for n0 = 1 is too weak to yield
the following theorem.

2.9.2 Theorem. Let F be a class of measurable functions. Let ξ1 , . . . , ξn


be i.i.d. random variables with mean zero, varianceP1, and kξk2,1 < ∞,
n
independent of X1 , . . . , Xn . Then the sequence n−1/2 i=1 ξi (δXi − P ) con-
verges to a tight limit process in `∞ (F) if and only if F is Donsker. In that
case, the limit process is a P -Brownian bridge.

Pn
Proof. Since both the empirical processes n−1/2 i=1 (δXi − P ) and the
n
multiplier processes n−1/2 i=1 ξi (δXi − P ) do not change if indexed by
P
the class of functions {f − P f : f ∈ F} instead of F, it may be assumed
without loss of generality that P f = 0 for every f . Marginal convergence
of both sequences of processes is equivalent to F ⊂ L2 (P ). It suffices to
show that the asymptotic equicontinuity conditions for the empirical and
the multiplier processes are equivalent.
If F is Donsker, then P ∗ (F > x) = o(x−2 ) as x → ∞ by Lemma 2.3.9.
By the same lemma convergence of the multiplier processes to a tight limit
implies that P ∗ (|ξF | > x) = o(x−2 ). In particular, P ∗ F < ∞ in both cases.
Since the assumption kξk2,1√< ∞ implies the existence of a second mo-
ment, we have E max1≤i≤n |ξi |/ n → 0. Combination with the multiplier
264 2 Empirical Processes

inequalities, Lemma 2.9.1, gives


1 X n 1 X n
1
2 kξk1 lim sup E∗ √ εi Zi ≤ lim sup E∗ √ ξi Zi

n→∞ n i=1 Fδ n→∞ n i=1 Fδ

k
√ 1 X
≤ 2 2kξk2,1 sup E∗ √ εi Zi ,

n0 ≤k k i=1 Fδ

for every n0 and δ > 0. By Lemma 2.3.6, the Rademacher variables can be
deleted in thisP
statement
at the cost of changing the constants. Conclude
n Pn
that E∗ n−1/2 i=1 Zi F → 0 if and only if E∗ n−1/2 i=1 ξi Zi F → 0.
δn δn
These are the mean versions of the asymptotic equicontinuity conditions.
By Lemma 2.3.11, these are equivalent to the probability versions.

Under the conditions of the preceding theorem, the sequence


 n n 
1 X 1 X
√ (δXi − P ), √ ξi (δXi − P )
n i=1 n i=1

is jointly asymptotically tight. Since the two coordinates are uncorrelated


and the joint marginals converge to multivariate normal distributions, the
sequence converges jointly to a vector of two independent Brownian bridges.

2.9.3 Corollary. Under the Pn conditions of the P


preceding theorem,
 the se-
n
quence of processes n−1/2 i=1 (δXi −P ), n−1/2 i=1 ξi (δXi −P ) converges
in `∞ (F) × `∞ (F) in distribution to a vector (G, G0 ) of independent (tight)
Brownian bridges G and G0 .

An additional corollary applies when the multipliers ξi do not have


mean zero. Let Eξi = µ and var ξi = σ 2 . Simple algebra gives
n
1 X
√ (ξi δXi − µP )
n i=1
n n
µ X σ X ξ i − µ  √
=√ (δXi − P ) + √ (δXi − P ) + n(ξ¯n − µ)P.
n i=1 n i=1 σ

The first two terms on the right converge weakly if and only if F is Donsker;
the third is in `∞ (F) if and only if kP kF < ∞; in that case, it converges.

2.9.4 Corollary. Let F be Donsker with kP kF < ∞. Let ξ1 , . . . , ξn be


i.i.d. random variables with mean µ, variance σ 2 , and kξi k2,1 < ∞, inde-
pendent of X1 , . . . , Xn . Then
n
1 X
√ (ξi δXi − µP ) µG + σG0 + σZ P,
n i=1
2.9 Multiplier Central Limit Theorems 265

where G and G0 are independent (tight) Brownian bridges and are inde-
pendent of the random variable Z, which is standard normally distributed.
The limit process µG + σG0 + σZ P is a Gaussian process with zero mean
and covariance function (σ 2 + µ2 )P f g − µ2 P f P g.

Next consider conditional multiplier central limit theorems. For finite


F, the almost sure version is a simple consequence of the Lindeberg theo-
rem.

2.9.5 Lemma. Let Z1 , Z2 , . . . be i.i.d. Euclidean random vectors with


EZi = 0 and EkZi k2 < ∞ independent of the i.i.d. sequence ξ1 , ξ2 , . . . ,
with Eξi = 0 and Eξi2 = 1. Then, conditionally on Z1 , Z2 , . . . ,
n
1 X 
√ ξi Zi N 0, cov Z1 ,
n i=1

for almost every sequence Z1 , Z2 , . . . .

Proof. According to the Lindeberg central limit theorem, the statement is


true for every sequence Z1 , Z2 , . . . such that, for every ε > 0,
n n
1X 1X √
Zi Zi0 → cov Z1 ; kZi k2 Eξ ξi2 |ξi |kZi k > ε n → 0.

n i=1 n i=1

The first condition is true for almost all sequences by the law of large
numbers. Furthermore,
√ a finite second moment, EkZi ||2 < ∞, implies that
max1≤i≤n kZi k/ n → 0 for almost all sequences. For the intersection of
the two sets of sequences, both conditions are satisfied.

The preceding lemma takes care of marginal convergence in the con-


ditional multiplier central limit theorems. For convergence in `∞ (F), it
suffices to check asymptotic equicontinuity.
Conditional weak convergence in probability must be formulated in
terms of a metric on conditional laws. Since “conditional laws” do not
exist without proper measurability, we utilize the bounded dual Lipschitz
distance based on outer expectations. In Chapter 1.12 it is shown that weak
convergence, Gn G, of a sequence of random elements, Gn in `∞ (F), to
a separable limit, G, is equivalent to

sup E∗ h(Gn ) − Eh(G) → 0.



h∈BL1

Here BL1 is the set of all functions h: `∞ (F) → [−1, 1] such that |h(z1 ) −
h(z2 )| ≤ kz1 − z2 kF for every z1 , z2 .
266 2 Empirical Processes

2.9.6 Theorem. Let F be a class of measurable functions. Let ξ1 , . . . , ξn


be i.i.d. random variables with mean zero, variance
Pn 1, and kξk
 2,1 < ∞,
independent of X1 , . . . , Xn . Let G0n = n−1/2 i=1 ξi δXi − P . Then the
following assertions are equivalent:
(i) F is Donsker;

(ii) suph∈BL1 Eξ h(G0n )−Eh(G) → 0 in outer probability, and the sequence
G0n is asymptotically measurable.

Proof. (i) ⇒ (ii). For a Donsker class F, the sequence G0n converges in
distribution to a Brownian bridge process by the unconditional multiplier
theorem, Theorem 2.9.2. Thus, it is asymptotically measurable.
A Donsker class is totally bounded for ρP . For each fixed δ > 0 and
f ∈ F, let Πδ f denote a closest element in a given, finite δ-net for F. By
uniform continuity of the limit process G, we have G ◦ Πδ → G almost
surely, as δ ↓ 0. Hence, it is certainly true that

sup Eh(G ◦ Πδ ) − Eh(G) → 0.
h∈BL1

Second, by the preceding lemma, as n → ∞ and for every fixed δ > 0,

sup Eξ h(G0n ◦ Πδ ) − Eh(G ◦ Πδ ) → 0,



h∈BL1


for almost every sequence X1 , X2 , . . . . (To see this, define A: Rp → ` (F)
by Ay(f ) = yi if Πδ f = fi . Then h(G ◦ Πδ ) = g G(f1 ), . . . , G(fp ) , for the
function g defined by g(y) = h(Ay). If h is bounded Lipschitz on `∞ (F),
then g is bounded Lipschitz on Rp with a smaller bounded Lipschitz norm.)
Since BL1 (Rp ) is separable for the topology of uniform convergence on com-
pacta, the supremum in the preceding display can be replaced by a count-
able supremum. It follows that the variable in the display is measurable,
because h(G0n ◦ Πδ ) is measurable. Third,

sup Eξ h(G0n ◦ Πδ ) − Eξ h(G0n ) ≤ sup Eξ h(G0n ◦ Πδ ) − h(G0n )



h∈BL1 h∈BL1
≤ Eξ kG0n ◦ Πδ − G0n k∗F ≤ Eξ kG0n k∗Fδ ,

where Fδ is the class {f − g: ρP (f − g) < δ}. Thus, the outer expectation


of the left side is bounded above by E∗ kG0n kFδ . As in the proof of The-
orem 2.9.2, we can use the multiplier inequalities of Lemma 2.9.1 to see
that the latter expression converges to zero as n → ∞ followed by δ → 0.
Combine this with the previous displays to obtain one-half of the theorem.
(ii) ⇒ (i). Letting h(G0n )∗ and h(G0n )∗ denote measurable majorants
and minorants with respect to (ξ1 , . . . , ξn , X1 , . . . , Xn ) jointly, we have, by
the triangle inequality and Fubini’s theorem,

E h(G0n )−Eh(G) ≤ EX Eξ h(G0n )∗ −E∗X Eξ h(G0n ) +E∗X Eξ h(G0n )−Eh(G) .

2.9 Multiplier Central Limit Theorems 267

If (ii) holds, then, by dominated convergence, the second term on the right
side converges to zero for every h ∈ BL1 . The first term on the right is
bounded above by EX Eξ h(G0n )∗ − EX Eξ h(G0n )∗ , which converges to zero
if G0n is asymptotically measurable. Thus, (ii) implies that G0n G un-
conditionally. Then F is Donsker by the converse part of the unconditional
multiplier theorem, Theorem 2.9.2.

It may be noted that the functions (ξ1 , . . . , ξn ) 7→ h(G0n ) are (Lipschitz)


continuous, hence Borel measurable, for every given sequence X1 , X2 , . . .
(under the assumption that kf (x) − P f kF < ∞ for every x, which is im-
plicitly made to ensure that our processes take their values in `∞ (F)). Thus,
the expectations Eξ h(G0n ) in the statement of the preceding theorem make
sense even though they may be nonmeasurable functions of (X1 , . . . , Xn ).
The second assumption in (ii) implies that these functions are asymptot-
ically measurable and appears to be necessary for the “easy” implication,
(ii) ⇒ (i).
The almost sure version of the conditional central Pn limit theorem  as-
serts weak convergence of the sequence G0n = n−1/2 i=1 ξi δXi − P given
almost every sequence X1 , X2 , . . . . It was seen in Example 1.9.4 that al-
most sure convergence without proper measurability may not mean much.
Thus, it is more attractive to define almost sure conditional convergence by
strengthening part (ii) of the preceding theorem. It will be shown that

sup Eξ h(G0n ) − Eh(G) as∗



→ 0.
h∈BL1

This implies that Eξ h(G0n ) → Eh(G) for every h ∈ BL1 , for almost every
sequence X1 , X2 , . . . . By the portmanteau theorem, this is then also true
for every continuous, bounded h. Thus, the sequence G0n converges in dis-
tribution to G given almost every sequence X1 , X2 , . . . also in a more naive
sense of almost sure conditional convergence.

2.9.7 Theorem. Let F be a class of measurable functions. Let ξ1 , . . . , ξn


be i.i.d. random variables with mean zero, variance Pn 1, and kξk
 2,1 < ∞,
independent of X1 , . . . , Xn . Define G0n = n−1/2 i=1 ξi δXi − P . Then the
following assertions are equivalent:
(i) F is Donsker
and P ∗ kf − P f k2F < ∞;
(ii) suph∈BL1 Eξ h(G0n ) − Eh(G) → 0 outer almost surely, and the sequence
Eξ h(G0n )∗ −Eξ h(G0n )∗ converges almost surely to zero for every h ∈ BL1 .
Here h(G0n )∗ and h(G0n )∗ denote measurable majorants and minorants with
respect to (ξ1 , . . . , ξn , X1 , . . . , Xn ) jointly.

Proof. (i) ⇒ (ii). For the first assertion of (ii), the proof of the in probabil-
ity conditional central limit theorem applies, except that it must be argued
that Eξ kG0n k∗Fδ converges to zero outer almost surely, as n → ∞ followed
268 2 Empirical Processes

by δ ↓ 0. By Corollary 2.9.9,

lim sup Eξ kG0n k∗Fδ ≤ 6 2 lim sup E∗ kG0n kFδ , a.s.
n→∞ n→∞

The right-hand side decreases to zero as δ ↓ 0, because G0n G uncondi-


tionally, by Theorem 2.9.2.
To see that the sequence Eξ h(G0n ) is strongly asymptotically measur-
able, obtain first by the same proof, but with a star added, that
Eξ h(G0n )∗ − Eh(G) as∗

→ 0.
The same proof also shows that this is true with a lower star. Thus, the
sequence Eξ h(G0n )∗ − Eξ h(G0n )∗ converges to zero almost surely.
(ii) ⇒ (i). The class F is Donsker in view of the in probability con-
ditional central limit theorem. We need to prove the condition on its en-
velope function. As in the proof of the portmanteau theorem, there exists

asequence of Lipschitz functions hm : R → R such that 1 ≥ hm kzkF ↓
1 kzkF ≥ t pointwise. By Lemma 1.2.2(vi),
∗ ∗
Pξ kG0n k∗F > t = Eξ 1 kG0n kF > t ≤ Eξ hm kG0n kF .
 

Under assumption (ii), the right side converges to Ehm kGkF almost
surely as n → ∞, for every fixed m. Conclude that
lim sup Pξ kG0n k∗F > t ≤ P(kGkF ≥ t ,
 
a.s.
n→∞

Set Zi = δXi − P . By the triangle inequality, kξn Zn k∗F ≤
√ n kG0n k∗F +
n − 1 kG0n−1 k∗F . Thus, for sufficiently large t,
 √ 
lim sup Pξ kξn Zn k∗F > t n < 1, a.s.
n→∞

Since ξn is distributed as ξ1 , it follows that lim sup kZn k∗F / n < ∞ almost
surely. This implies E∗ kZ1 k2F < ∞ (Problem 2.9.4).

The proof of the following lemma is based on isoperimetric inequalities


for product measures, which are treated in Appendix A.4. The Problems
section gives alternative proofs of similar results using more conventional
methods.

2.9.8 Lemma. Let Z1 , Z2 , . . . be i.i.d. stochastic processes such that


E∗ kZ1 k2F < ∞. Let ξ1 , ξ2 , . . . be i.i.d. random variables with mean zero,
independent of Z1 , Z2 , . . . . Then there exists a constant K such that, for
every t > 0,
2n 2n
X  X
∗ X

 E∗ kZ1 k2F
P Eξ ξi Zi > 6E ξi Zi + t2n/2 ≤ K .

n i=1
F
i=1
F t2
Here the first star on the left denotes a measurable majorant with respect
to the variables (ξ1 , . . . , ξn , Z1 , . . . , Zn ) jointly.
2.9 Multiplier Central Limit Theorems 269

2.9.9 Corollary. In the situation of the preceding lemma,


n n
1 X ∗ √ 1 X
lim sup Eξ √ ξi Zi F ≤ 6 2 lim sup E∗ √ ξi Z i , a.s.,

n→∞ n i=1 n→∞ n i=1 F

1 X n ∗ 1 X n q
E sup Eξ √ ξi Zi ≤ C sup E∗ √ ξi Zi + C E∗ kZ1 k2F ,

n n i=1 F n n i=1 F

for a universal constant C.


Pn ∗
Proofs. Abbreviate Sn = Eξ i=1 ξi Zi F . Fix n, and let Z[1] , . . . , Z[2n ] be
the processes Z1 , . . . , Z2n ordered by decreasing norms kZ[i] k∗F . Define the
control number f (A, A, A, Z1 , . . . , Z2n ) as in Appendix A.4 for the event
A = {S2n ≤ 2ES2n }. Then on the event f (A, A, A, Z) ~ < k,
k−1
X
S2n ≤ kZ[i] k∗ + 6ES2n .
i=1

(See the explanation after the proof of Proposition A.4.1.) By Markov’s


inequality, the event A has probability at least 1/2. Combining this with
Proposition A.4.1 yields, for every k ≥ 3,
   2 k k−1
X 
n/2 ∗ n/2
P S2n > 6ES2n + t2 ≤ +P kZ[i] k > t2
3 i=1
 2 k  t   t 
≤ + P kZ[1] k∗ > 2n/2 + P kkZ[2] k∗ > 2n/2
3 2 2
 1 8 t  t 
2
. + γn + γn ∧ 1.
k 2 2k
Here γn (t) = 2n P kZ1 k∗F > t2n/2 , and the

P∞square arises from the binomial
inequality of Problem 2.9.6. Let βn (t) = l=0 γn−l (t)2−l be the convolution
of the sequences γn (t) and (1/2)n . Then βn (t) ≥ γn−l (t)(1/2)l for any
natural number l, whence γn (t2−l ) = 22l γn−2l (t) ≤ 24l βn (t). This readily
yields that γn (t/k) . k 4 βn (t) for every number k ≥ 3. Use this inequality
with the choice k = bβn (t)−1/8 c ∨ 3 to bound the last line of the preceding
display up to a constant by
 1 8
+ βn (t) + k 8 βn2 (t) ∧ 1 . βn (t).
k
P
The
P∞proof of the lemma is complete upon noting that∗ βn (t) ≤
2 n=−∞ γn (t), which in turn is bounded by a multiple of E kZ1 k2F /t2 .
Since the random variables Sn are nondecreasing in n by Jensen’s in-
equality, we obtain
Sn √ S2n
sup √ ≤ 2 sup n/2
.
n≥m n n≥2 log m 2
270 2 Empirical Processes

In view of the Borel-Cantelli


 lemma, Lemma 2.9.8 gives the inequality
lim sup S2n − 6ES2n /2n/2 ≤ 0 almost surely. The first assertion of the
corollary follows. Next


S2n − 6ES2n
Z X  
E sup ≤ P S2n − 6ES2n > t2n/2 dt
n 2n/2 0 n
∞
E∗ kZ1 k2F
Z 
. ∧ 1 dt.
0 t2

This is bounded by two times the square root of E∗ kZ1 k2F . The second
assertion now follows similarly from the monotonicity of the variables Sn .

Problems and Complements


1. For any random variable ξ and r > 2, one has (1/2)kξk2 ≤ kξk2,1 ≤ r/(r −
2) kξkr .

2. For any pair of random variables ξ and η, one has kξ + ηk22,1 ≤ 4kξk22,1 +
4kηk22,1 .

3. In the situation of Lemma 2.9.1, if the variables ξi are P symmetric


and
n
possess bounded range contained in [−M, M ], then E∗ i=1 ξi Zi F ≤
Pn
M E∗ i=1 εi Zi F .
[Hint: Use the contraction principle, Proposition A.1.11.]


4. If X1 , X2 , . . . are i.i.d. random variables with lim sup |Xn |/ n < ∞ almost
2
surely, then EX1 < ∞.

[Hint: By the Kolmogorov 0-1 law, lim sup |Xn |/ √n isconstant. Thus, there
exists a constant M with P lim sup{|X n | > M n} = 1. By the Borel-

P |Xn |2 > nM 2 < ∞.]
P
Cantelli lemma,

5. Let F be Donsker with kP kF < ∞. Let ξ1 , . . . , ξn be i.i.d. random variables


with mean µ, variance σ 2 , and kξi k2,1 < ∞, independent of X1 , . . . , Xn . Let
Nn be Poisson-distributedP with mean n √ and independent of the ξi and Xi .
Nn
Then the sequence n−1/2 i=1 ξi δXi − nµP converges in distribution in
`∞ (F) to (σ 2 + µ2 )1/2 times a Brownian motion process.

6. If Z[1] , . . . , Z[n] are thereversed order statistics of an i.i.d. sample Z1 , . . . , Zn ,


then P(Z[k] > x) ≤ nk P(Z1 > x)k for every x.
n

[Hint: Note that P(Z[k] > x) ≤ k
P(Z1 > x, . . . , Zk > x).]
2.9 Problems and Complements 271

7. (Alternative proof of Corollary 2.9.9) Let Z1 , Z2 , . . . be i.i.d. stochastic


processes with sample paths in `∞ (F) and E∗ kZ1 k2F < ∞. Let ξ1 , ξ2 , . . . be
i.i.d. random variables with mean zero, independent of Z1 , Z2 , . . . . Then there
exists a constant K with
n n
1 X ∗ 1 X

lim sup Eξ √ ξi Zi ≤ K lim sup E∗ √ ξi Zi , a.s.

n→∞ n F n→∞ n F
i=1 i=1

Here the star on the left side denotes a measurable majorant with respect to
the variables (ξ1 , . . . , ξn , Z1 , . . . , Zn ) jointly.
Pn
[Hint: Without loss of generality, assume that E∗ n−1/2 i=1 ξi Zi F ≤
1/2,
Pfor every n, and E|ξ1 | = 1. Since Eξ = 0, the random variables
n ∗
Eξ i=1 ξi Zi F are nondecreasing in n by Jensen’s inequality. Therefore,
it suffices to consider the limsup along the subsequence indexed by 2n . Since
E∗ kZ1 k2F < ∞,
∞   ∞  
X X
P maxn kZi k∗F > 2n/2 ≤ 2n P kZ1 k∗F > 2n/2 < ∞.
1≤i≤2
n=1 n=1

It follows by the Borel-Cantelli lemma that each variable kZi k∗F , for
1 ≤ i ≤ 2n , is bounded by 2n/2 eventually, almost surely. P2nConsequently,
if Z̃i = Zi 1{kZi k∗F ≤ 2n/2 }, then the variables Eξ k i=1 ξi Zi k∗F and
P2n
Eξ k i=1 ξi Z̃i k∗F are equal eventually, almost surely. (We abuse notation,
since Z̃i depends Pon n.) It suffices to show that the random variable
2n ∗
lim sup 2−n/2 Eξ i=1 ξi Z̃i F is bounded by some fixed number almost
surely. By the Borel-Cantelli lemma, this is certainly the case if
∞  X2n ∗ 
X
P Eξ ξi Z̃i > 5 2n/2 < ∞.

F
n=1 i=1

The probabilities in this series can be replaced by their squares


Pat the cost

l
of decreasing 5 to 2. Indeed, the random variables Sk,l = Eξ i=k ξi Z̃i F
are monotonely decreasing in k and increasing in l. Let T be the first l such
that S1,l > 2 2n/2 . Then T ≤ 2n if and only if S1,2n > 2 2n/2 . Furthermore,
if T = k and S1,2n > 5 2n/2 , then Sk+1,2n > 2 2n/2 , because Sk,k = kZ̃k k∗F ≤
2n/2 . Thus
n
2
n/2
X
P T = k, S1,2n > 5 2n/2
 
P S1,2n > 5 2 ≤
k=1
n
2
X
P T = k, Sk+1,2n > 2 2n/2


k=1
n
2
X
P(T = k) max n P Sk,2n > 2 2n/2


1≤k≤2
k=1
2
≤ P S1,2n > 2 2n/2 .
In view
P2nof the contraction
P2principle
n stated in Proposition A.1.11, we have
E∗ i=1 ξi Z̃i F ≤ E∗ i=1 ξi Zi F ≤ 2n/2 . Conclude that it suffices to
272 2 Empirical Processes

prove that

X  X2n ∗ X2n 2

(2.9.10) P Eξ ξi Z̃i − E ξi Z̃i > 2n/2 < ∞.

F F
n=1 i=1 i=1

Let Ei denote conditional expectation with respect to the σ-field generated


by Z1 , . . . , Zi . More precisely, let E0 denote unconditional expectation, and
if Zi is the projection on the ith coordinate of the product measurable space
(X ∞ ×Z, A∞ ×C), let Ei denote conditional expectation given Ai ×X ∞−i ×C.
Set
X2n ∗ 

Di = Ei − Ei−1 = Ei − Ei−1 (Ci ),

Eξ ξi Z̃i

F
i=1

where
X2n ∗ 2n
X ∗
Ci = Eξ ξj Z̃j − Eξ ξj Z̃j .

F F
j=1 j=1,j6=i

The random variables


P2inside
n
the probabilities in (2.9.10) can be written as the
(telescoping) sums i=1 Di . By the triangle inequality, |Di | ≤ 2Ci ≤ 2kZ̃i k∗ .
Furthermore, in view
P2of
n
Problem
2.9.9, the expectation of Ci can be bounded
by ECi ≤ 2−n E∗ j=1 ξj Z̃j F ≤ 2−n/2 . The random variables D1 , . . . , D2n
are uncorrelated and have mean zero. By Chebyshev’s inequality, (2.9.10) is
bounded by
n
∞  2 2 ∞ ∞
X −n
X X X
2 EDi2 ≤ maxn E|Di |3 E|Di | . 2−n/2 E∗ kZ̃1 k3F .
1≤i≤2
n=1 i=1 n=1 n=1


Pn ∗
Substitute E =kZ̃1 k3F
k=−∞
E < kZ1 k∗F ≤ 2k/2 }; next
kZ1 k3F {2(k−1)/2
3 k/2
bound one factor of kZ1 kF by 2 ; and, finally, change the order of summa-
tion to bound the previous display by
∞ ∞
X X
E∗ kZ1 k2F {2(k−1)/2 < kZ1 k∗F ≤ 2k/2 } 2k/2−n/2 .
k=−∞ n=k
√ √
This equals E∗ kZ1 k2F 2/( 2 − 1). The proof is complete.]

8. (Alternative proof of Corollary 2.9.9) Let Z1 , Z2 , . . . be i.i.d. stochas-


tic processes with sample paths in `∞ (F). Let ξ1 , ξ2 , . . . be i.i.d. ran-
dom variables with mean zero, independent of Z1 , Z2 , . . . . Suppose M =
n

supn E∗ n−1/2 i=1 ξi Zi F < ∞. Then there exists a universal constant K
P
with
n
1 X ∗ √ 2 M
   
P sup Eξ √ ξi Zi > 2(2M + 3t) ≤ K + 3 E∗ kZ1 k2F ,
n≥1 n F t2 t
i=1

for every t > 0. Here the star on the left side denotes a measurable majorant
with respect to (ξ1 , . . . , ξn , Z1 , . . . , Zn ) jointly.
[Hint: The supremum over all n ≥ 1 inside the probability can be replaced by
2.9 Problems and Complements 273

2 times the supremum over the numbers 2n . Set Z̃i = Zi 1 kZi k∗F ≤ t2n/2 .

The given probability is bounded by
X2n ∗
X   X  
P max kZi k∗F > t2 n/2
+ P Eξ ξi Z̃i > 2n/2 (2M + 3t)

1≤i≤2n F
n n i=1
  X2n ∗ 2
X X 
≤ 2n P kZ1 k∗F > t2n/2 + P Eξ ξi Z̃i > 2n/2 (M + t) .

F
n n i=1

Here the square probability is obtained by the same argument as before, but
with T equal to the first l such that S1,l > 2n/2 (M + t). The proof can be
finished as before.]

9. GivenP i.i.d. stochastic processes Z1 , Z2 , . . . with sample paths in `∞ (F), define


n
Sn = i=1 Zi . Then E∗ kSn kF − E∗ kSn−1 kF ≤ n−1 E∗ kSn kF .
[Hint: The inequality is equivalent to E∗ kSn /nkF ≤ E∗ kSn−1 /(n − 1)kF .
The sequence kSn /nk∗F forms a reversed submartingale with respect to the
filtration Sn generated by the symmetric σ-fields, as in Lemma 2.4.6.]
2.10
Permanence of
the Glivenko-Cantelli
and Donsker Properties

In this chapter we consider a number of operations that preserve the


Glivenko-Cantelli and Donsker properties and allow the formation of many
new Donsker classes from given examples. For instance, unions, convex
hulls, and certain closures of Donsker classes are Donsker. The main re-
sults of this chapter concern pointwise transformations of Glivenko-Cantelli
classes and Donsker classes.
For given classes F1 , . . . , Fk of real functions defined on some set X
and a fixed map φ: Rk →  R, let φ◦(F1 , . . . , Fk ) denote the class of functions
x 7→ φ f1 (x), . . . , fk (x) as f = (f1 , . . . , fk ) ranges over F1 × · · · × Fk . In
Section 2.10.1 we show that the class φ ◦ (F1 , . . . , Fk ) is Glivenko-Cantelli
if F1 , . . . , Fk are Glivenko-Cantelli as soon as φ is continuous. In Sec-
tion 2.10.3 we show that the Donsker property is similarly preserved if
φ is Lipschitz.
Section 2.10.4 covers the preservation of the uniform-entropy condition,
and in Section 2.10.5 new Glivenko-Cantelli and Donsker classes are formed
through union of sample spaces.

2.10.1 Continuous Transformations of GC-Classes


For given classes F1 , . . . , Fk of functions and given φ: Rk → R define the
class of functions φ ◦ (F1 , . . . , Fk ) as indicated.

2.10.1 Theorem. Let F1 , . . . , Fk be P -Glivenko-Cantelli classes with inte-


grable envelopes. If φ: Rk → R is continuous, then the class φ ◦ (F1 , . . . , Fk )
is P -Glivenko-Cantelli provided that it has an integrable envelope function.
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 275

Proof. We first assume that the classes of functions Fi are appropriately


measurable. Let F1 , . . . , Fk and H be integrable envelopes for the classes
F1 , . . . , Fk and H: = φ◦(F1 , . . . , Fk ), respectively, and set F = F1 ∨. . .∨Fk .
For M ∈ (0, ∞) define HM to be the class of functions φ(f )1F ≤M ,
where f = (f1 , . . . , fk ) ranges over F1 × · · · × Fk . Then
k(Pn − P )φ(f )kF ≤ (Pn + P )H1[F >M ] + k(Pn − P )hkHM .
Because the expectation of the first term on the right converges to 0 as M →
∞, it suffices to show that the class HM is P -Glivenko-Cantelli for every
fixed M . By Lemma 2.10.2 below, applied to the function φ: [−M, M ]k → R
and k · k the L1 -norm on Rk , there exists for every ε > 0 a δ > 0, depending
only on ε, φ and M , such that for any pairs (fj , gj ) ∈ Fj , the inequality
δ
Pn |fj − gj |1[Fj ≤M ] ≤ , j = 1, . . . , k
k
implies that

Pn φ(f1 , . . . , fk ) − φ(g1 , . . . , gk )|1F ≤M ≤ ε.
We conclude that
k δ 
 Y
N ε, HM , L1 (Pn ) ≤ N , Fj 1Fj ≤M , L1 (Pn ) .
j=1
k

By the converse part of Theorem 2.4.3 the outer expectation of the loga-
rithm of each of the terms in the product on the right is of order o(n). It fol-
lows that E∗ log N ε, HM , L1 (Pn ) = o(n), for every ε > 0 and M > 0. By


Theorem 2.4.3 in the other direction the class φ(F) is P -Glivenko-Cantelli.


The preceding argument concludes the proof for classes F1 , . . . , Fk
that together with φ ◦ (F1 , . . . , Fk ) satisfy the measurability assumptions of
Theorem 2.4.3. We extend the theorem to general Glivenko-Cantelli classes
using separable versions of these classes, as defined in Section 2.3.3. As
shown in the preceding argument, it is not a loss of generality to assume
that the classes Fi are uniformly bounded. Furthermore, it suffices to show
that the class φ ◦ (F1 , . . . , Fk ) is weak Glivenko-Cantelli.
Because a Glivenko-Cantelli class with integrable envelope is totally
bounded in L1 (P ) (cf. Problem 2.4.11), it is separable as a subset of L1 (P ).
Therefore, by Theorem 2.3.16 every class Fi admits a separable modifica-
tion F̃i relative to the L1 (P )-norm, and by Theorem 2.3.16 the class Fi is
Glivenko-Cantelli if and only if F̃i is Glivenko-Cantelli, and supf ∈Fi Pn |f −
f˜| → 0 in outer probability. The classes F̃i possess enough measurability
to make the preceding argument work. Thus we conclude that the class
φ ◦ (F̃1 , . . . , F̃k ) is Glivenko-Cantelli. Finally we invoke Lemma 2.10.2 to
see that convergence to zero of supf ∈Fi P n |f − f˜| → 0 for j = 1, . . . , k,
implies convergence to zero of supf ∈F Pn φ ◦ (f1 , . . . , fk ) − φ(f˜1 , . . . , f˜k ) .

The theorem follows in the general case by application of Theorem 2.3.16


in the other direction.
276 2 Empirical Processes

The following lemma is the crux of the proof of Theorem 2.10.1. Let
k · k be any norm on Rk .

2.10.2 Lemma. Suppose that K ⊂ Rk is compact and φ: K → R is con-


tinuous. Then for every ε > 0 there exists δ > 0 P such that for all n and for
n
, an , b 1 , . . . , bn ∈ K the inequality n−1 i=1 kai − bi k < δ implies
all a1 , . . .P
−1 n
that n i=1 φ(ai ) − φ(bi ) < ε.

Proof. Letting Un be a random variable with the uniform distribution on


{1, . . . , n}, and setting Xn = aUn andPYn = bUn , we can represent the
n
two P discrepancies in the lemma as n−1 i=1 kai − bi k = EkXn − Yn k and
−1 n
n i=1 φ(a i ) − φ(b i ) = E φ(Xn ) − φ(Yn ) . Therefore, it suffices to show
that for every ε > 0 there exists δ > 0 such that for random vectors (X, Y )
taking values in K × K,

EkX − Y k < δ implies E φ(X) − φ(Y ) < ε.
If this were not the case, then for some ε > 0 and for all m = 1, 2, . . . there
existed (Xm , Ym ) such that EkXm −Ym k < m−1 and E φ(Xm )−φ(Ym ) ≥ ε.
By Prohorov’s theorem there would exist a subsequence (Xm0 , Ym0 ) that
converges in distribution to a limit (X, Y ). Then, by dominated con-
vergence,
both EkX − Y k = limm0 →∞ EkXm0 − Ym0 k = 0, and also
E φ(X) − φ(Y ) = limm0 →∞ E φ(Xm0 ) − φ(Ym0 ) ≥ ε > 0. This yields a
contradiction, and hence the implication in the preceding display is valid.

2.10.2 Closures and Convex Hulls


Given a class F of measurable functions, let F̄ denote the set of all f : X → R
for which there exists a sequence fm in F with fm → f both pointwise
P∞ and
in L2 (P ). Let sconv F denote
P the set of convex combinations i=1 λ i fi of
functions fi in F where |λi | ≤ 1 and the series converges both pointwise
and in L2 (P ).‡

2.10.3 Theorem. If F is Donsker and G ⊂ F, then G is Donsker.

2.10.4 Theorem. If F is Donsker, then F̄ is Donsker.

2.10.5 Theorem. If F is Donsker, then sconv F is Donsker.

Proofs. The first two results are immediate consequences of the char-
acterization of weak convergence in `∞ (F) as marginal convergence plus
asymptotic equicontinuity. In both cases the modulus of continuity does
not increase when passing from F to the new class.

Pointwise convergence means convergence for every argument, not convergence almost
surely.
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 277

For the third result, let {ψi } be an orthonormal base of a sub-


space of L2 (P ) that contains F, and let Z1 , Z2 , . . . be i.i.d. standard nor-
mally distributed random variables. Since F is pre-Gaussian, the series

convergent in `∞ (F) and represents
P 
i=1 P (f − P f )ψi Z i is uniformly
a tight Brownian bridge
Pk process (Problem  2.10.1). Thus the sequence of
partial sums Gk = i=1 P (f − P f )ψi Zi satisfies

kGk − Gl ksconv F = kGk − Gl kF → 0, a.s.,

as k, l → ∞. It follows that the sequence of partial sums forms a Cauchy


sequence in `∞ (sconv F) almost surely and hence converges almost surely
to a limit G. Since each of the partial sums is linear and ρP -uniformly
continuous on sconv F, so is the limit G, with probability 1. This proves
that the symmetric convex hull of F is pre-Gaussian.
According to the almost sure representation theorem, Theorem 1.10.4,
there exists a probability space (Ω, U, P) and perfect maps φn : Ω → X n and
φ such that
1 X n   
f φn (ω)i − P f − G f, φ(ω) as∗

√ → 0.

n i=1 F

The norm does not increase if F is replaced by sconv F. Thus there ex-
ists a version of
 the empirical process that converges outer almost surely
in `∞ sconv F to a process with uniformly continuous sample paths.
This implies that sconv F is Donsker. (Note that by perfectness of φn ,
Eh∗ {Gn f : f ∈ sconv F} = (P n )∗ h {Gn f : f ∈ sconv F} for every func-


tion h on `∞ sconv F .)

2.10.6 Example. The class of all functions x 7→ µ(−∞, x] with µ rang-


ing over all signed measures on Rk with total variation bounded by 1 is
universally Donsker.
This can be deduced by applying the preceding results several times.
The given class is in the convex hull of the set of cumulative distribution
functions of probability measures. In view of the classical Glivenko-Cantelli
theorem, any cumulative distribution function is the uniform limit of a
sequence of finitelyPdiscrete cumulative distribution functions.
P The class
of functions x 7→ p
i i 1{t i ≤ x}, with p i ≥ 0 and pi = 1, is in the
convex hull of the class of indicator functions of cells [t, ∞). This is Vapnik-
C̆ervonenkis and suitably measurable, hence universally Donsker.

2.10.3 Lipschitz Transformations of Donsker Classes


For given classes F1 , . . . , Fk of real functions defined on some set X and
a fixed map φ: Rk → R, let φ ◦ (F1 , . . . , Fk ) denote the class of functions
278 2 Empirical Processes

x 7→ φ f1 (x), . . . , fk (x) as f = (f1 , . . . , fk ) ranges over F1 × · · · × Fk . It
will be shown that the Donsker property is preserved if φ satisfies
k
φ ◦ f (x) − φ ◦ g(x) 2 ≤
X 2
(2.10.7) fl (x) − gl (x) ,
l=1

for every f, g ∈ F1 × · · · × Fk and x. This is certainly the case for Lipschitz


functions φ.

2.10.8 Theorem. Let F1 , . . . , Fk be Donsker classes with kP kFi < ∞ for


each i. Let φ: Rk → R satisfy (2.10.7). Then the class φ ◦ (F1 , . . . , Fk )
is Donsker, provided φ ◦ (f1 , . . . , fk ) is square integrable for at least one
(f1 , . . . , fk ).

2.10.9 Example. If F and G are Donsker classes and kP kF ∪G < ∞, then


the pairwise infima F ∧ G, the pairwise suprema F ∨ G, and pairwise sums
F + G are Donsker classes.
Since F ∪ G is contained in the sum of F ∪ {0} and G ∪ {0}, the union
of F and G is Donsker as well.

2.10.10 Example. If F and G are uniformly bounded Donsker classes,


then the pairwise products F ·G form a Donsker class. The function φ(f, g) =
f g is Lipschitz on bounded subsets of R2 , but not on the whole plane. The
condition that F and G be uniformly bounded cannot be omitted in general.
(In fact, the envelope function of the product of two Donsker classes need
not be weak L2 .)

2.10.11 Example. If F is Donsker with kP kF < ∞ and f ≥ δ for some


constant δ > 0 for every f ∈ F, then 1/F = {1/f : f ∈ F} is Donsker.

2.10.12 Example. If F is a Donsker class with kP kF < ∞ and g, a uni-


formly bounded, measurable function, then F · g is Donsker. Although the
function φ(f, g) = f g is not Lipschitz on the domain of interest, we have
 
φ f1 (x), g(x) − φ f2 (x), g(x) ≤ kgk∞ |f1 (x) − f2 (x)|,

for all x. Thus, condition (2.10.7) is satisfied and the theorem applies.

2.10.13 Example. If F is Donsker with integrable envelope function F ,


then the class F≤M = {f 1{F ≤ M 0 }: f ∈ F, M 0 ≤ M } is Donsker. Indeed,
the class of sets {F ≤ M } is linearly indexed by inclusion, so that it is VC.
Furthermore, the function φ(f, g) = f g is Lipschitz on the set {(f, 0): f ∈
R} ∪ {(f, 1): |f | ≤ M }.
Consequently, the class F M = {f 1{F > M }: f ∈ F} is Donsker be-
cause it is contained in F − F≤M .
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 279

Many transformations of interest are Lipschitz, but not uniformly so.


Consider a map φ: Rk → R such that
k
φ ◦ f (x) − φ ◦ g(x) 2 ≤
X 2
L2α,l (x) fl (x) − gl (x) ,

(2.10.14)
l=1

for given measurable functions Lα,1 , . . . , Lα,k and every f, g ∈ F1 × · · · × Fk


and x. Then the class φ ◦ (F1 , . . . , Fk ) is Donsker if each of the classes
Lα,i Fi , consisting of the functions x 7→ Lα,i (x)f (x) with f ranging over
Fi , is Donsker. By Example 2.10.12, the class gF is Donsker whenever
g is bounded and F is Donsker, but this is not useful to generalize the
preceding theorem. Typically, it is possible to relax the requirement that g
is bounded at the cost of more stringent conditions on F. For instance, in
the next subsection it is shown that gF is Donsker for every combination
of a Donsker class F that satisfies the uniform entropy condition and for a
function g such that P g 2 F 2 < ∞.

2.10.15 Corollary. Let φ: Rk → R satisfy (2.10.14). Let each of the classes


Lα,i Fi be Donsker with kP kLα,i Fi < ∞. Then the class φ ◦ (F1 , . . . , Fk )
is Donsker, provided φ ◦ (f1 , . . . , fk ) is square integrable for at least one
(f1 , . . . , fk ).

Proof. Without loss of generality, assume that each of the functions Lα,i
is positive. For f ∈ Lα F = Lα,1 F1 × · · · × Lα,k Fk , define ψ(f ) = φ(f /Lα ).
Then ψ is uniformly Lipschitz in the sense of (2.10.7). By the preceding
theorem, ψ(Lα F) = φ(F) is Donsker.

The main tool in the proof of Theorem 2.10.8 is “Gaussianization.” Let


ξ1 , ξ2 , . . . be i.i.d. random variables with a standard normal distribution
independent of X1 , X2 , . . . . By Chapter 2.9, the (conditional as well as
unconditional) asymptotic behavior of the empirical process is related to
the behavior of the processes
n
1 X
Zn = √ ξi δXi .
n i=1

Given fixed values X1 , . . . , Xn , the process {Zn (f ): f ∈ L2 (P )} is Gaussian


with zero mean and standard deviation metric
n
 1X 2 1/2
σξ Zn (f ) − Zn (g) = f (Xi ) − g(Xi )
n i=1

equal to the L2 (Pn ) semimetric. This observation permits the use of several
comparison principles for Gaussian processes, including Slepian’s lemma,
Proposition A.2.5
Three lemmas precede the proof of Theorem 2.10.8. The first is a com-
parison principle of independent interest.
280 2 Empirical Processes

2.10.16 Lemma. Let F be a Donsker class with kP kF < ∞. Then the class
F 2 = {f 2 : f ∈ F} is Glivenko-Cantelli in probability: kPn − P k∗F 2 →P
0. If,
∗ 2 2
in addition, P F < ∞ for some envelope function F , then F is also
Glivenko-Cantelli almost surely and in mean.

Proof. Let ξ1 , ξ2 , . . . be i.i.d. random variables with a standard normal


distribution independent of X1 , XP 2 , . . . . Given fixed values X1 , . . . , Xn the
n
stochastic process Zn (f ) = n−1/2 i=1 ξi f (Xi ) is a Gaussian process with
zero mean and variance varξ Zn (f ) = Pn f 2 . Thus
r r
1/2 π π
kPn f 2 kF = Eξ Zn (f ) ≤ Eξ kZn kF .

2 F 2
Since F is Donsker and kP kF < ∞, the sequence Zn converges (uncondi-
tionally) to a tight Gaussian process Z by the unconditional multiplier the-
orem, Theorem 2.9.2. By Lemma 2.3.11, this implies that E∗ kZn kF = O(1).
Conclude that for every ε > 0 there exist constants M and N such that
with inner probability at least 1 − ε: Pn f 2 ≤ M for every f ∈ F and every
n ≥ N . Since F is bounded in L2 (P ), one can also ensure that P f 2 ≤ M
for every f ∈ F.
For fixed M and any σ 2 , τ 2 ≤ M , the bounded Lipschitz distance
between the normal distributions N (0, σ 2 ) and N (0, τ 2 ) is bounded below
by K |σ 2 − τ 2 | for a constant K depending on M only (Problem 2.10.2).
Conclude that with inner probability at least 1 − ε and n ≥ N ,
K Pn f 2 − P f 2 ≤ sup Eξ h Zn (f ) − Eh Z(f ) .
 
h∈BL1 (R)

The right side is bounded by the bounded Lipschitz distance between the
processes Zn and Z. Conclude that with inner probability at least 1 − ε,
K Pn f 2 − P f 2 F ≤ sup Eξ h(Zn ) − Eh(Z) .

h∈BL1

The right side converges to zero in outer probability by the conditional


multiplier theorem, Theorem 2.9.6.
Under the condition P ∗ F 2 < ∞, the submartingale argument in the
proof of Theorem 2.4.3 (based on Lemma 2.4.6) applies and shows that the
convergence in probability can be strengthened to convergence in mean and
almost surely.

2.10.17 Lemma. If F is pre-Gaussian, then ε2 log N (ε, F, ρP


 ) → 0 as ε →
0. If, in addition, kP kF < ∞, then also ε2 log N ε, F, L2 (P ) → 0.

Proof.
Pm For any partition F = ∪m i=1 Fi and ε > 0, one has N (ε, F, ρP ) ≤
i=1 N (ε, Fi , ρP ). Bound the sum by m times the maximum of its terms,
and take logarithms to obtain
p p p
ε log N (ε, F, ρP ) ≤ ε log m + ε max log N (ε, Fi , ρP ).
i≤m
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 281

For fixed δ > 0, choose a partition into m = N (δ, F, ρP ) sets of diam-


eter at most δ. Take an arbitrary element fi from each partitioning set.
Let G be a separable
p Brownian bridge process. According to Sudakov’s in-
equality A.2.6, ε log N (ε, G, ρP ) ≤ 3E∗ kGkG for any set of functions G
and ε. Since N (ε, Fi , ρP ) = N (ε, Fi − fi , ρP ), the preceding display can be
bounded by
3ε ∗ 3ε
E kGkF + 3 max E∗ kGkFi −fi ≤ E∗ kGkF + 3E∗ kGkFδ ,
δ i≤m δ
where Fδ = {f − g: ρP (f − g) < δ, f, g ∈ F}. Since F is pre-Gaussian,
G can be chosen bounded and uniformly continuous. This implies that
E∗ kGkF < ∞ and E∗ kGkFδ → 0 as δ → 0. Thus, the first term on the
right is finite and converges to zero as ε ↓ 0. The second term can be made
arbitrarily small by choosing δ sufficiently small.
The second assertion of the lemma is proved in a similar way, now
using a Brownian motion process G + ξP (where ξ is standard normally
distributed and is independent of G) instead of G (cf. Problem 2.1.4).

2.10.18 Lemma. Let Z1 , . . . , Zm be separable, mean-zero Gaussian pro-


cesses indexed by arbitrary sets Ti . Then
 p 
E max kZi kTi ≤ K max EkZi kTi + log m max σ(Zi,t ) T ,
1≤i≤m 1≤i≤m 1≤i≤m i

for a universal constant K.



Proof. Without loss of generality, assume that m ≥ 2. Set σi = k σ Zi,t kTi
and Yi = kZi,t kTi . By Borell’s inequality A.2.1,
1 2 2
P |Yi − Mi | > x ≤ e− 2 x /σi ,

(2.10.19)

where Mi is a median of the random variable Yi . By Lemma 2.2.1, the


variable Yi − Mi has Orlicz norm kYi − Mi kψ2 bounded by a multiple of σi .
In view of the triangle inequality,
p
E max kZi kTi ≤ E max |Yi − Mi | + max Mi . log m max σi + max Mi ,
i i i i i

by Lemma 2.2.2. Integration of (2.10.19)


(or
another application of the in-
equality kYi − Mi kψ2 . σi ) yields E Yi − Mi . σi . The lemma follows upon
using this inequality to bound maxi Mi in the last displayed equation.

Proof of Theorem 2.10.8. The square integrability of φ ◦ (f1 , . . . , fk )


for one element of F1 × · · · × Fk and the Lipschitz condition imply square
integrability of every function of this type. This ensures marginal conver-
gence.
282 2 Empirical Processes

Let ξ1 , ξ2 , . . . , ξn be i.i.d. random variables with a standard normal


distribution, independent of X1 , . . . , Xn . Given fixed values X1 , . . . , Xn ,
the process
n
1 X
Hn (f ) = √ ξi φ ◦ f (Xi ), f = (f1 , . . . , fk ),
n i=1

is zero-mean Gaussian with “intrinsic” semimetric σξ Hn (f )−Hn (g) equal
to the L2 (Pn ) semimetric kφ ◦ f − φ ◦ gkn . Because each Fl is Donsker and
each kP kFl < ∞, the set F = F1 ×· · ·×Fk is totally bounded in the product
space L2 (P )k . For given δ > 0, let F = ∪m j=1 Hj be a partition of F into sets
of diameter less than δ with respect to the supremumQproduct metric. The
k 
number of sets in the partition can be chosen as m ≤ l=1 N δ, Fl , L2 (P ) .
Choose an arbitrary element hj from each Hj . Given the same X1 , . . . , Xn
as before, the processes {Hn (f )−Hn (hj ): f ∈ Hj } possess the same intrinsic
semimetric as Hn . By the preceding lemma,

Eξ sup Hn − Hn (hj ) . max Eξ Hn − Hn (hj )
Hj Hj
j j≤m
(2.10.20) p
+ log m max sup kφ ◦ f − φ ◦ hj kn .
j≤m f ∈Hj

For each l, let ξl1 , . . . , ξln be i.i.d. random variables with a standard nor-
mal
Pk distribution,
Pn chosen independently for different l. Define Gn (f ) =
−1/2
l=1 n i=1 ξli fl (Xi ). In view of (2.10.7),
k n
 X 1X
(fl − gl )2 (Xi ) = varξ Gn (f ) − Gn (g) .

varξ Hn (f ) − Hn (g) ≤
n i=1
l=1

By Slepian’s lemma A.2.5, the expectation Eξ Hn − Hn (hj ) H is bounded
j
above by two times the same expression, but with Gn instead of Hn . Make
this substitution in the right side of (2.10.20), and conclude that the left
side of (2.10.20) is bounded up to a constant by
p
Eξ sup Gn (f ) − Gn (h) + log m sup kf − hkn .
eP (f,h)<δ eP (f,h)<δ

(Here eP and kf − gkn denote product metrics corresponding to k · kP,2 and


k · kn .) Since each of the classes Fl is Donsker and satisfies kP kFl < ∞,
the sequence Gn is asymptotically equicontinuous with respect to eP . (See
Theorem 2.9.2 and Problem 2.1.2.) Hence the outer expectation of the first
term converges to zero as n → ∞ followed by δ ↓ 0. For the second term,
note that the classes Fl − Fl are Donsker by Theorem 2.10.5, so that their
squares (Fl − Fl )2 are Glivenko-Cantelli in probability, by Lemma 2.10.16.
Thus, as n → ∞,
p P∗
p
log m sup kf − hkn → log m sup eP (f, h)
eP (f,h)<δ eP (f,h)<δ
k q
X 
≤ log N δ, Fl , L2 (P ) δ.
l=1
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 283

This converges to zero as δ ↓ 0, by Lemma 2.10.17. Combination of the


preceding displays gives that the left side of (2.10.20) converges to zero in
probability as n → ∞ and next as δ ↓ 0.
Suppose that the processes Hn are separable, so that the variables
Hn − Hn (hj )kH are measurable. Then it can be concluded that for all
j
positive numbers ε and η there exists a finite partition F = ∪m j=1 Hj such
that  
lim sup P sup Hn − Hn (hj ) Hj > ε < η.
n→∞ j

This implies that the sequence Hn is asymptotically tight in `∞ (F), by


Theorem 1.5.6.
Define a map T : φ ◦ F → F by assigning to each function g ∈ φ ◦ F
a unique (but arbitrary) vector f = Tg ∈ F such that φ(f ) = g. The
map z 7→ z ◦ T from `∞ (F) to `∞ φ ◦ F is (Lipschitz) continuous. Hence,
by the continuous mapping theorem, the sequence of multiplier empirical
processes G0n = Hn ◦ T converges weakly to a tight limit in `∞ φ ◦ F .


Pn kP kφ◦F < ∞, it follows that the sequence of processes


Since
n−1/2 i=1 ξi (δXi − P ) converges weakly to a tight limit as well. By Theo-
rem 2.9.2, the same is true for the sequence of empirical processes Gn .
Finally, we remove the assumption that the processes Hn are separable.
According to Chapter 2.3.3, there exist pointwise-separable versions F̃l of
the classes Fl . Almost every value f˜l (x) is the limit of a sequence g̃m,l (x)
with each g̃m,l contained in a countable “separant’ of F̃l . Each element g̃l
of the separant is almost surely equal to an element gl of the original class
Fl . It follows that, perhaps possibly on a null set of x, the function φ is
defined at each vector f˜(x) or can be extended to this vector by continuity.
The extended function satisfies (2.10.7) on F̃ ∪ F except perhaps for x in a
fixed null set. Then the class of functions φ ◦ F̃ is (essentially) well defined
and is a pointwise-separable version of φ ◦ F. By the preceding argument,
φ ◦ F̃ is Donsker. By (2.10.7),
k
√ X √
n Pn φ(f ) − Pn φ(f˜) F . n Pn |fl − f˜l |kFl .

l=1

According to Theorem 2.3.15, this converges to zero in probability. Thus


φ ◦ F is Donsker also.

2.10.4 Permanence of the Uniform Entropy Bound


The main theorem of the preceding subsection combines the minimal condi-
tion that the classes F1 , . . . , Fk are Donsker with the strong condition that
φ is uniformly Lipschitz. It may be expected that stronger assumptions on
the classes Fi combined with weaker conditions on φ will yield the same
conclusion that the class φ(F1 , . . . , Fk ) is Donsker. Obviously, a concrete
formulation of this principle is useful only if it is sufficiently simple, because
284 2 Empirical Processes

otherwise it is preferable to deal with the class φ(F1 , . . . , Fk ) directly. Uni-


form entropy, integral-type conditions allow such a simple statement.
Let F1 , . . . , Fk be classes of measurable functions and φ: Rk → R a
given map that is Lipschitz of orders α1 , . . . , αk ∈ (0, 1] in the sense that
k
φ ◦ f (x) − φ ◦ g(x) 2 ≤
X
L2α,i (x) |fi − gi |2αi (x),

(2.10.21)
i=1

for all f = (f1 , . . . , fk ) and g = (g1 , . . . , gk ) in F = F1 × · · · × Fk and for


every x. In the special case that k = 1, we can set

φ(f ) − φ(g)
Lα = sup .
f,g |f − g|α
This relaxes condition (2.10.7) in two ways: the functions Lα,i need not be
bounded, and a lower-order Lipschitz condition suffices. In the following
theorem, the first weakening of (2.10.7) is compensated by a moment con-
dition on Lα ; the second by a strengthening of the entropy condition on
the Fi .
If Fi has envelope function Fi , then for every pair f and f0 in F,
k
φ ◦ f − φ ◦ f0 2 ≤ 4 L2α,i F 2αi .
X
i
i=1
Pk 2αi 1/2
2

Thus, the function 2 i=1 Lα,i Fi , which will be denoted 2Lα · F α ,
is an envelope function of the class φ(F) − φ(f0 ). We use this function to
standardize the entropy numbers of the class φ ◦ F.

2.10.22 Theorem. Let F1 , . . . , Fk be classes of measurable functions with


measurable envelopes Fi , and let φ: Rk → R be a map that satisfies
(2.10.21). Then, for every δ > 0,
Z δ q 
sup log N εkLα · F α kQ,2 , φ(F), L2 (Q) dε
0 Q
k Z
X δ 1/αi q  dε
≤ sup log N εkFi kQ,2αi , Fi , L2αi (Q) 1−αi .
i=1 0 Q ε

Here the supremum is taken over all finitely discrete probability measures
Q. Consequently, if the right side is finite and P ∗ (Lα · F α )2 < ∞, then the
class φ(F) is Donsker provided its members are square integrable and the
class is suitably measurable.

Proof. For a given finitely discrete probability measure Q, define measures


Ri by dRi = L2α,i dQ. Then
φ ◦ f − φ ◦ g 2 ≤
X
Ri |fi − gi |2αi .

Q,2
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 285

For each i, construct a minimal ε1/αi kFi kRi ,2αi -net Gi for Fi with respect
to the L2αi (Ri )-norm. Then the set of points φ(g1 , . . . , gk ) with (g1 , . . . , gk )
ranging over all possible combinations of gi from Gi forms a net for φ(F)
of L2 (Q)-size bounded by the square root of
X X
ε2 kFi k2α 2
Ri ,2αi = ε Q
i
L2α,i Fi2αi .
This equals ε2 kLα · F α k2Q,2 . Conclude that
k
 Y
N εkLα · F α kQ,2 , φ(F), L2 (Q) ≤ N ε1/αi kFi kRi ,2αi , Fi , L2αi (Ri ) .

i=1

Since expressions of the type N εkF kcR,r , F, Lr (cR) are constant in c, the
bound is also valid with Ri replaced by the probability measure Ri /Ri 1.
If Q ranges over all finitely discrete measures, then each of the Ri runs
through finitely discrete measures. Substitute the bound of the preceding
display in the left side of the theorem, and next make a change of variables
ε1/αi 7→ ε to conclude the proof.

2.10.23 Example. If F satisfies the uniform entropy condition and


P ∗ F 2k < ∞ for k > 1, then F k is Donsker if suitably measurable. In
comparison with Theorem 2.10.8, this result has traded a stronger entropy
condition for a weaker (almost optimal) condition on the envelope function:
Theorem 2.10.8 shows that F k is Donsker if F is Donsker and uniformly
bounded.
The present result follows since |f k − g k | ≤ |f − g|(2F )k−1 , so that
(2.10.21) applies with α = 1 and L = (2F )k−1 .

2.10.24 Example. Let the functions in F√be nonnegative with lower en-
velope function F , and consider the class √F. If F satisfies the uniform-
entropy condition and P ∗ F 2√/F < ∞, then F is√Donsker if suitably mea-

surable. This follows since√| f − g| ≤ |f − g|/ F .
The conclusion that F is Donsker can also be obtained under other
combinations√ of moment and entropy conditions. First, Theorem 2.10.8
shows that F is Donsker if F is Donsker and F ≥ δ > 0. This combines a
stronger assumption on the lower envelope with a weaker entropy condition.
Second, the present theorem can be applied with a different Lipschitz order.
For 1/2 ≤ α ≤ 1,
√ √ √ √
| f − g| | f − g|1−α p 1−2α
= √ √ ≤ 2 F .
|f − g|α | f + g|α

Thus, a suitably measurable class F is Donsker if, for some α ≥ 1/2, the
upper and lower envelopes of F satisfy P ∗ F 2α F 1−2α < ∞ and, in addition,
F satisfies
Z ∞ q  dε
sup log N εkF kQ,2α , F, L2α (Q) 1−α < ∞.
0 Q ε
286 2 Empirical Processes

It appears that finiteness of the integral is more restrictive than a finite


uniform L2 -entropy integral (though the supremum in the integrand is in-
creasing in α). For polynomial classes F and more generally classes with
uniform L1 -entropy of the order (1/ε)r with r <
√ 1, the integral is certainly
α = 1/2. For such classes, the class F is Donsker provided its
finite for √
envelope F is square integrable.

2.10.25 Example. Let F and G satisfy the uniform-entropy condition and


be suitably measurable. Then FG is Donsker provided the envelopes F and
G satisfy P ∗ F 2 G2 < ∞.
This follows from the theorem with Lα = (G, F ) and Lipschitz orders
α = (1, 1).

2.10.5 Partitions of the Sample Space


Let X = ∪∞ j=1 Xj be a partition of X into measurable sets, and let Fj be the
class of functions f 1Xj when f ranges over F. If the class F is Glivenko-
Cantelli or Donsker, then each class Fj is Glivenko-Cantelli or Donsker.
This follows from Theorems 2.10.1 and 2.10.8, because
a restriction
is a
contraction in the sense that (f 1Xj )(x)−(g1Xj )(x) ≤ |f (x)−g(x) for every
x. Consider the converse of these statements. While the sum of infinitely
many Glivenko-Cantelli or Donsker classes need not be Donsker, it is clear
that if each Fj is Glivenko-Cantelli or Donsker and the classes Fj become
suitably small as j → ∞, then F is Glivenko-Cantelli or Donsker.
The following theorems make this precise. For the Glivenko-Cantelli
property “suitably small” can be simply taken to be the (almost necessary)
condition that F possesses an integrable envelope function.

2.10.26 Theorem. If the restriction Fj of F to Xj is P -Glivenko-Cantelli


for each j and F has an integrable envelope function, thenF is itself P -
Glivenko-Cantelli.
P∞
Proof. Since f = j=1 f 1Xj , it follows that

X
E∗ kPn − P kF ≤ E∗ kPn − P kFj .
j=1

Each term of the series on the right tends to zero by the hypothesis that
the classes Fj are P -Glivenko-Cantelli. Furthermore, the P
jth terms is dom-

inated by E∗ Pn (F 1Xj ) + P (F 1Xj ) ≤ 2P (F 1Xj ), where j=1 P (F 1Xj ) =
P (F ) < ∞. Thus the right side of the display tends to zero by the domi-
nated convergence theorem.

For the Donsker property we impose a stronger, but still simple con-
dition.
2.10 Permanence of the Glivenko-Cantelli and Donsker Properties 287

2.10.27 Theorem. For each j, let the class Fj , the restriction of F to Xj ,


be Donsker and satisfy
E∗ kGn kFj ≤ Ccj ,
P∞
for a constant C not depending on j or n. If j=1 cj < ∞ and P ∗ F < ∞
for some envelope function, then the class F is Donsker.

Proof. We can assume without loss of generality that the class F contains
the constant function 1.
Since Fj is Donsker, the sequence of empirical processes indexed by Fj
converges in distribution in `∞ (Fj ) to a tight brownian bridge Hj for each
j. This implies that E∗ kGn kFj → EkHj kFj . Thus, EkHj kFj ≤ Ccj . Let
Zj be a standard normal variable independent of Hj , constructed on the
same probability space. Since supf ∈Fj |P f | < ∞, the process f 7→ Zj (f ) =
Hj (f ) + Zj P f is well defined on Fj and takes its values in `∞ (Fj ). We can
construct these processes for different j as independent
P∞ random elements on
a single probability space. Then the series Z(f ) = j=1 Zj (f 1Xj ) converges
in second mean for every f , and satisfies EZ(f )Z(g) = P f g, for every f
and g. Thus, the series defines a version of a brownian motion process.
Since each of the processes {Zj (f 1Xj ): f ∈ F } has bounded and uniformly
continuous sample paths Pk with respect to the L2 (P )-seminorm, so have the
partial sums Z≤k = { j=1 Zj (f 1Xj ): f ∈ F}. Furthermore,
X X
EkHj kFj + E|Zj | P ∗ F 1Xj

E sup Zj (f 1Xj ) ≤

f ∈F
j>k j>k
r
X 2 ∗
≤C cj + P F 1∪j>k Xj .
π
j>k

P∞
This converges to zero as k → ∞. Thus the series Z(f ) = j=1 Zj (f 1Xj )
converges in mean in the space `∞ (F). By the Ito-Nisio theorem, it also
converges almost surely. Conclude that almost all sample paths of the pro-
cess Z are bounded and uniformly continuous. Since EkZkF < ∞, the
class F is totally bounded in L2 (P ), by Sudakov’s inequality. Hence Z is
a tight version of a brownian motion process indexed by F. The process
G(f ) = Z(f ) − Z(1)P f defines a tight brownian bridge process indexed by
F. We have proved that F is pre-Gaussian.
For each k set Gn,≤k (f ) = Gn (f 1∪j≤k Xj ). The continuity modulus of
the process {Gn,≤k (f ); f ∈ F } is bounded by P the continuity modulus of
the empirical process indexed by the sum class j≤k Fj . Since the sum of
finitely many Donsker classes is Donsker we can conclude that the sequence
(Gn,≤k )∞ ∞
n=1 is asymptotically tight in ` (F). Considering the marginal dis-
tributions, we can conclude that the sequence converges in distribution to

G≤k (f ) = Z≤k (f ) − dk Z≤k (1)P f 1∪j≤k Xj ,


288 2 Empirical Processes

as n → ∞ for each fixed k, for dk a solution to the equation d2 P (∪j≤k Xj ) −


2d = −1.
As k → ∞, the sequence G≤k converges almost surely, whence in distri-
bution to G in `∞ (F). Since weak convergence to a tight limit is metrizable,
and the array Gn,≤k converges along every row to limits that converge to
G, there exist integers kn → ∞ such that the sequence Gn,≤kn converges in
distribution to G. We also have that the sequence Gn − Gn,≤kn converges
in outer probability to zero, since
X
E∗ kGn − Gn,≤kn kF ≤ C cj → 0.
j>kn

An application of Slutsky’s lemma completes the proof.

Methods to obtain maximal inequalities for the L1 -norm are discussed


in Section 2.14.1. We give three applications of the preceding theorem.

2.10.28 Example (Smooth functions). Let Rd = ∪∞


j=1 Xj be a partition
of Rd into uniformly bounded, convex sets with nonempty interior. Consider
the class F of functions such that the class Fj of restrictions is contained
α
in CM j
(Xj ) for each j for given constants Mj .
P∞
If α > d/2 and j=1 Mj P (Xj )1/2 < ∞, then the class F is P -Donsker.
This follows by combining Theorem 2.10.27with the (last) maximal inequal-
ity in Theorem 2.14.15, and the entropy estimate Theorem 2.7.1.

2.10.29 Example (Closed convex sets). Let R2 = ∪∞


j=1 Xj be a partition
into squares of fixed size. Let C be the class of closed convex subsets of R2 .
Then C P is P -Donsker for every probability measure P with a density
1/2
p such that kpkXj < ∞. In particular, this is the case if [(1 + |x|)(1 +
|y|)]2+δ p(x, y) is bounded for some δ > 0.

2.10.30 Example (Monotone functions). Consider the class F of all


nondecreasing functions f : R → R, such that 0 ≤ f ≤ F , for a given
nondecreasing function F . This class is Donsker provided kF k2,1 < ∞.
To see this, first reduce the problem to the case of uniform [0,1] ob-
servations by the quantile transformation. If P −1 is the quantile function
corresponding to the underlying measure P , then the class F ◦ P −1 is
Donsker with respect to the uniform measure if and only if F is P -Donsker.
The class G = F ◦ P −1 consists of monotone functions g: [0, 1] → R, with
0 ≤ g ≤ F ◦ P −1 . Furthermore,
Z 1
F ◦ P −1
Z
F (x)
√ du ≤ p dP (x).
0 1 − u 1 − P (−∞, x]
By partial integration the latter can be shown to be finite if and only if
kF k2,1 is finite. For simplicity of notation assume now that P is the uniform
measure.
2.10 Problems and Complements 289

Use the theorem with the partition into the intervals Xj = (xj , xj+1 ],
with xj = 1 − 2−j for each integer j ≥ 0. The condition of the theorem
involves the series
∞ ∞ 1
F (1 − 2−j )
Z
X X F (u)
kFj kP,2 ≤ p 2−j ≤ 2 √ du.
j=0 j=1
1 − (1 − 2−j ) 0 1−u

The result follows by the quantile transformation.

Problems and Complements


1. Let {ψi } be an orthonormal base of a subspace of L2 (P ) that contains
F and the constant functions, and let Z1 , Z2 , . . . be i.i.d. standard nor-
mally
P∞ distributed random  variables. If F is P -pre-Gaussian, then the series
Z P (f − P f )ψi is almost surely convergent in `∞ (F) and represents
i=1 i
a tight Brownian bridge process.
[Hint: The series
 converges in second
mean for every single f , and the limit
process (series) G(f ): f ∈ L2 (P ) is a well-defined stochastic process that is
continuous insecond mean with respect to ρP . Since F is separable, the latter
implies that G(f ): f ∈ F is a separable process. A separable version of a
uniformly continuous process is automatically uniformly continuous. Thus,
{G(f ): f ∈ F} defines a tight Brownian bridge. 
Let Gn be the nth partial sum of the series, and set Fδ = f − g: f, g ∈
F, ρP (f − g) < δ . By Jensen’s inequality, EkGn kFδ ≤ EkGkFδ . Since F is
pre-Gaussian, the right side converges to zero as δ ↓ 0. Hence the sequence
Gn is uniformly tight in `∞ (F). Since Gn − G 0 marginally, it follows that
Gn − G 0 in `∞ (F). By the Itô-Nisio theorem A.1.3, it converges almost
surely also.]

2. The bounded Lipschitz distance between the normal distributions N (0, σ 2 )


and N (0, τ 2 ) on the line is bounded below by |σ − τ | g(1 ∧ τ −1 ∧ σ −1 ) for
g(t) = t2 φ(t) and φ the standard normal density.
[Hint: For 0 ≤ c ≤ 1, the function
R h(x) = (c−|x|)∨0
 is contained in BL1 (R).
For σ < τ , the integral h(xσ) − h(xτ ) φ(x) dx is bounded below by
R c/τ 
φ(c/τ ) times −c/τ
h(xσ) − h(xτ ) dx = (τ − σ)(c/τ )2 . Take c = τ ∧ 1.]

3. For any class F of functions with envelope function F and 0 < r < s < ∞,

sup N 2εkF kQ,s , F, Ls (Q) ≤ sup N 2εr/s kF kQ,r , F, Lr (Q)


 
Q Q

if the supremum is taken over all finitely discrete measures.


r/s
[Hint: If dR = (2F )s−r dQ, then kf − gkQ,s ≤ kf − gkR,r .]
290 2 Empirical Processes

4. For any class F of functions with strictly positive envelope function F , the
expression supQ N εkF kQ,r , F, Lr (Q) , where the supremum is taken over
all finitely discrete measures, is nondecreasing in 0 < r < ∞. (Here Lr (Q) is
equipped with (Q|f |r )1/r even though this is not a norm for r < 1.)
[Hint: Given r < s and Q define dR = F r−s dQ. Then Q|f − g|r ≤ kf −
gkrR,s kF ks−r
R,s , by Hölder’s inequality, with (p, q) = (s/r, s/(s − r)).]


5. The expression N εkaF kbQ,r , aF, Lr (bQ) is constant in a and b.
6. If F is P -pre-Gaussian and A: L2 (P ) → L2 (P ) is continuous, then the class
{Af : f ∈ F} is P -pre-Gaussian.
7. If F is P -Donsker and A: L2 (P ) → L2 (P ) is an orthogonal projection, then
the class {Af : f ∈ F} is not necessarily P -Donsker.
8. If both F and G are Donsker, then the class F · G of products x 7→ f (x)g(x)
is Glivenko-Cantelli in probability.
2.11
Central Limit Theorem
for Processes

So far we have focused on limit theorems and inequalities for the empirical
process of independent and identically distributed random variables. Most
of the methods of proof apply more generally. In this chapter we indicate
some extensions to the case of independent but not identically distributed
processes.
We consider
 the general
situation of sums of independent stochastic
processes Zni (f ): f ∈ F with bounded sample paths indexed by an arbi-
trary set F.

2.11.1 Random Entropy


For each n, let Zn1 , . . . , Znmn be independent stochastic processes indexed
by a common (arbitrary) semimetric space (F, ρ). As usual, for computation
of outer expectations, the independence is understood in Qthe sense that the
mn
processes are defined on a product probability space i=1 (Xni , Ani , Pni )
with each Zni (f ) = Zni (f, x) depending only on the ith coordinate of x =
(x1 , . . . , xmn ). In the first theorem it is assumed that every one of the maps
Xmn

(x1 , . . . , xmn ) 7→ sup ei Zni (f ) − Zni (g) ,

ρ(f,g)<δ i=1
mn
X 2
(x1 , . . . , xmn ) 7→ sup ei Zni (f ) − Zni (g) ,

ρ(f,g)<δ i=1

is measurable, for every δ > 0, every vector (e1 , . . . , emn ) ∈ {−1, 0, 1}mn ,
and every natural
Qmnnumber n. (Measurability for the completion of the prob-
ability spaces i=1 (Xni , Ani , Pni ) suffices.)
292 2 Empirical Processes

Define a random semimetric by


mn 
X 2
d2n (f, g) = Zni (f ) − Zni (g) .
i=1
Pmn
The simplest result on marginal convergence of the processes i=1 Zni is
the Lindeberg central limit theorem. The following theorem gives a uniform
central limit theorem under a Lindeberg condition on norms combined with
a condition on entropies with respect to the random semimetric dn , a “ran-
dom entropy condition.”

2.11.1 Theorem. For each n, let Zn1 , . . . , Zn,mn be independent stochas-


tic processes indexed
Pmn by a totally bounded semimetric space (F, ρ). Assume
that the sums i=1 ei Zni are measurable as indicated and that
mn
X
E∗ kZni k2F {kZni kF > η} → 0, for every η > 0,
i=1
mn
X 2
sup E Zni (f ) − Zni (g) → 0, for every δn ↓ 0,
ρ(f,g)<δn i=1
Z δn p P∗
(2.11.2) log N (ε, F, dn ) dε → 0, for every δn ↓ 0.
0
Pmn 
Then the sequence i=1 Zni − EZni is asymptotically ρ-equicontinuous.
It converges in distribution in `∞ (F) provided the sequence of covariance
functions converges pointwise on F × F.

Proof. The Lindeberg condition for norms implies the Lindeberg condition
for marginals. Together with the assumption that the covariance function
converges, this gives marginal weak convergence (to a Gaussian process).
o
Set Zni = Zni − EZni , and let δn ↓ 0 be arbitrary. For fixed t and
sufficiently large
Pmn Chebyshev’s inequality and the second condition give
o o
the bound P | i=1 n
Zni (f ) − Zni (g)| > t/2 ≤ 1/2 for every pair f, g with
ρ(f, g) < δn . By the symmetrization lemma, Lemma 2.3.7, for sufficiently
large n,
 mn 
X
P∗ o o

sup Zni (f ) − Zni (g) > t
ρ(f,g)<δn i=1
 mn 
X  t
≤ 4P sup εi Zni (f ) − Zni (g) > .
ρ(f,g)<δn i=1 4

For fixed values of the processes Zn1 , . . . , Znmn , define the subset An ⊂ Rmn
as the set of all vectors Zn1 (f)−Zn1 (g), . . . , Znmn (f )−Znm n (g) when the
pairs (f, g) range over the set (f, g) ∈ F ×F: ρ(f, g) < δn . By Hoeffding’s
2.11 Central Limit Theorem for Processes 293
P
inequality, Lemma 2.2.7, the stochastic process { εi ai : a ∈ An } is sub-
Gaussian for the Euclidean metric on An . By Corollary 2.2.8,
Xmn Z ∞
 p
Eε sup εi Zni (f ) − Zni (g) . log N (ε, An , k · k) dε,


ρ(f,g)<δn i=1 0

where k · k is the Euclidean norm. The integrand can be bounded using the
inequality N (ε, An , k · k) ≤ N 2 (ε/2, F, dn ). Furthermore, if
mn
X Xmn
θn2 = sup a2i = a2i ,

a∈An i=1 An
i=1

then, for ε > θn , the set An fits in the ball of radius ε around the origin
and the integrand vanishes. Conclude from this and the entropy condition
(2.11.2) that the integral converges to zero in outer probability if θn → 0 in
probability. Under the measurabilityPassumption, this implies the asymp-
mn o
totic equicontinuity of the sequence i=1 Zni .
By the Lindeberg condition, there exists a sequence of numbers ηn ↓ 0
such that E∗ k a2i {kZni k∗F > ηn }kAn converges to zero. Thus, for showing
P
P∗
that θn → 0, it is not a loss of generality to assume that each Zni satisfies
kZni kF ≤ ηn . Fix Zn1 , . . . , Znmn , and take an ε-net Bn for An for the
Euclidean norm. For every a ∈ An , there exists b ∈ Bn with
P 2 P
εi ai = εi (ai −bi )2 +2 εi (ai −bi )bi + εi b2i ≤ ε2 +2εkbk+ εi b2i .
P P P

P 2
By Hoeffding’s inequality, Lemma 2.2.7, Pthe variable P εi bi has Orlicz
norm for ψ2 bounded by a multiple of ( b4i )1/2 . ηn ( b2i )1/2 . Apply
Lemma 2.2.2 to the third term on the right, and replace suprema over Bn
by suprema over An to obtain
P 1/2 p P 1/2
Eε εi a2i An . ε2 + 2ε a2i An + 1 + log |Bn | ηn a2i An .
P

The size of the ε-net can be chosen |Bn | ≤ N 2 (ε/2, F, dn ). By the entropy
condition (2.11.2), this variable is bounded in probability (Problem 2.11.1).
Conclude that for some constant K,
 P 
P εi a2i An > t

 Kh 2 p  P 1/2 i
≤ P∗ |Bn | > M + ε + ε + ηn log M E a2i A .
t n

For M and t sufficiently large, the right side is smaller than 1 − v1 for the
constant v1 in Hoffmann-Jørgensen’s Proposition A.1.5. More precisely, this
can be achieved for M such that P∗ (|Bn | > M ) ≤ (1 − v1 )/2 and t equal to
the numerator of the second term (i.e. K times the term in square brackets)
294 2 Empirical Processes

divided by (1 − v1 )/2. Then t is bigger than the v1 -quantile of the variable


k εi a2i kAn . Thus, Proposition A.1.5 yields
P

E εi a2i An . E max a2i An + t


P

p   P 1/2
. ηn2 + ε2 + ε + ηn log M E a2i An .

By the second assumption of the theorem, k Ea2i kAn → 0. Combine this


P
with Lemma 2.3.6 to see that
1/2
Ek a2i kAn ≤ Ek εi a2i kAn + o(1) ≤ ζ + ζ Ek a2i kAn
P P P
,

for ζ > ε2 ∨ ε and sufficiently large n. The inequality c≤ ζ + ζ c for a
2 + 4ζ 2 . Apply this to
p
nonnegative
P number c implies that c ≤
P ζ + ζ
c = E a2i A and conclude that E a2i A → 0 as n tends to infinity.
n n

2.11.3 Example (I.i.d. observations). The empirical process of a sample


X1 , . . . , Xn studied in the earlier sections of this part can be recovered by
setting Zni (f ) = n−1/2 f (Xi ). In this situation, F is a class of measurable
functions on the sample space.
Theorem 2.11.1 implies that the class F is Donsker if it is suitably
measurable, is totally bounded in L2 (P ), possesses a square-integrable en-
velope, and satisfies for every sequence δn ↓ 0
Z δn q
P

log N ε, F, L2 (Pn ) dε → 0.
0

The latter random-entropy condition is certainly satisfied if F satisfies


the uniform-entropy condition (2.5.1) and the envelope function is square-
integrable. Thus, Theorem 2.11.1 is a generalization of Theorem 2.5.2.

2.11.4 Example (Truncation). The Lindeberg condition on the norms


is certainly not necessary for the central limit theorem. In combination
with truncation, the preceding theorem applies to more general processes.
Consider stochastic processes Zn1 , . . . , Znmn such that
mn
X
P∗ kZni kF > η → 0,

for every η > 0.
i=1

Then the truncated processes Zni,η (f ) = Zni (f )1{kZni kF ≤ η} satisfy


mn
X mn
X
Zni − P∗
Zni,η → 0, in `∞ (F).
i=1 i=1

Since this is true for every η > 0, it is also true for every sequence
ηn ↓ 0 that converges to zero sufficiently slowly. The processes Zni,ηn cer-
tainly satisfy the Lindeberg condition. If they (or the centered processes
Zni,ηn − EZni,ηn ) also satisfy the other conditions of the theorem, then the
2.11 Central Limit Theorem for Processes 295
Pmn ∞
sequence i=1 (Zni − EZni,ηn ) converges weakly in ` (F). The random
semimetrics dn decrease by the truncation. Hence the conditions for the
truncated processes are weaker than those for the original processes.

2.11.1.1 Measurelike Processes


The preceding theorem is valid for arbitrary index sets F. Consider the spe-
cial case that F is a set of measurable functions f : X → R on a measurable
space (X , A) that satisfies a uniform-entropy condition:
Z ∞ q 
(2.11.5) sup log N εkF kQ,2 , F, L2 (Q) dε < ∞.
0 Q∈Q

Then the preceding theorem readily yields a central limit theorem for pro-
cesses with increments that are bounded by a (random) L2 -metric on F.
Call the processes Zni measurelike with respect to (random) measures µni
if 2
Z
Zni (f ) − Zni (g) ≤ (f − g)2 dµni , every f, g ∈ F.

For P
measurelike processes, the random semimetric dn is bounded by the
L2 ( µni )-semimetric, and the entropy condition (2.11.2) can be related to
the uniform-entropy condition.

2.11.6 Lemma. Let F be a class of measurable functions with envelope


function F . Let Zn1 , . . . , Znmn be measurelike processes indexed by F. If F
satisfies the
Puniform-entropy Pmncondition (2.11.5) for a set Q that contains the
mn
measures i=1 µni and i=1 µni F 2 = OP∗ (1), then the entropy condition
(2.11.2) is satisfied.
Pmn
Proof. Set µn = i=1 µni . Since dn is bounded by the L2 (µn )-semimetric,
we have
Z δn p
log N (ε, F, dn ) dε
0
Z δn /kF kµn q

≤ log N εkF kµn , F, L2 (µn ) dε kF kµn ,
0

on the set where kF k2µn = µn F 2 is finite. Abbreviate


Z δ q
J(δ) = sup log N (εkF kQ,2 , F, L2 (Q)) dε.
0 Q

On the set where kF kµn > η, the right side of the next-to-last displayed
equation is bounded by J(δn /η) OP (1). This converges to zero in probability
for every η > 0. On the set where kF kµn ≤ η, we have the bound J(∞) η.
This can be made arbitrarily small by the choice of η. Thus, the entropy
condition (2.11.2) is satisfied.
296 2 Empirical Processes

2.11.7 Example. Suppose the independent processes Zn1 , . . . , Znmn are


measurelike. Let F satisfy the uniform-entropy condition. Assume that, for
some probability measure P with P ∗ F 2 < ∞,
mn
X

E µni F 2 {µni F 2 > η} → 0, for every η > 0,
i=1
mn
X

sup E µni (f − g)2 → 0, for every δn ↓ 0,
kf −gkP,2 <δn i=1
mn
X
µni F 2 = OP∗ (1).
i=1

Then the preceding lemma verifies the entropy condition (2.11.2), while
the first two conditions of the display
ensure
the other conditions of Theo-
2 1/2
rem 2.11.1. Note that kZni k
PmFn ≤ Z ni (f ) + 2(µni F ) , for any f .
Thus, the sequence i=1 (Zni − EZni ) converges in `∞ (F) provided
that the sequence of covariance functions converges pointwise and that mea-
surability conditions are met.

2.11.8 Example (Weighted empirical processes). For each n, let


Xn1 , . . . , Xnmn be independent random elements in a measurable space
(X , A). Let Xni have law Pni and let Pni f exist for each element f of
a class F of measurable functions f : X → R. Given a triangular ar-
ray
Pmnof constants cni , consider
 the weighted empirical process Gn (f ) =
i=1 cni f (Xni ) − Pni f . Suppose F satisfies the uniform-entropy con-
dition and that
max |cni | → 0,
1≤i≤mn
mn
X
c2ni Pni ≤ P,
i=1

for a probability measure P with P ∗ F 2 < ∞.


Then under measurability conditions, the sequence Gn converges
weakly to a Gaussian process in `∞ (F), provided the sequence converges
marginally. Furthermore, there is a version of the limit process with uni-
formly continuous sample paths with respect to the L2 (P )-semimetric.
This follows from the preceding example applied to the processes Zni =
cni δXni , which are measurelike for the measures µni = c2ni δXni .

2.11.2 Bracketing
For each n, let Zn1 , . . . , Znmn be independent stochastic processes indexed
by a common index set F. In this chapter we aim at generalizations of
the bracketing central limit theorem, Theorem 2.5.6, in two directions: we
2.11 Central Limit Theorem for Processes 297

prove a version for the non-i.i.d. case, and we also weaken the entropy con-
ditions to conditions involving the existence of either majorizing measures
or certain Gaussian processes.
For the first generalization, define, for every n, the bracketing number
N[ ] (ε, F, Ln2 ) as the minimal number of sets Nε in a partition F = ∪N n
j=1 Fεj
ε

n n
of the index set into sets Fεj such that, for every partitioning set Fεj
mn
X 2
E∗ sup Zni (f ) − Zni (g) ≤ ε2 .

n
f,g∈Fεj
i=1

Note that the partitions are allowed to depend on n.

2.11.9 Theorem (Bracketing central limit theorem). For each n, let


Zn1 , . . . , Zn,mn be independent stochastic processes with finite second mo-
ments indexed by a totally bounded semimetric space (F, ρ). Suppose
mn
X
E∗ kZni kF kZni kF > η → 0,

for every η > 0,
i=1
mn
X 2
sup E Zni (f ) − Zni (g) → 0, for every δn ↓ 0,
ρ(f,g)<δn i=1
Z δn q
log N[ ] (ε, F, Ln2 ) dε → 0, for every δn ↓ 0.
0
Pmn
Zni − EZni is asymptotically tight in `∞ (F) and

Then the sequence i=1
converges in distribution provided it converges marginally. If the partitions
can be chosen independent of n, then the middle of the displayed conditions
is unnecessary.

A central limit theorem for the empirical process of i.i.d. observa-


tions can be recovered from the preceding theorem by setting Zni (f ) =
n−1/2 f (Xi ). Then the bracketing numbers in the theorem may be reduced
to N[ ] ε, F, L2 (P ) . The resulting theorem is weaker than the bracketing
central limit theorem, Theorem 2.5.6, in that the latter theorem uses a
combination of L2 (P )- and L2,∞ (P )-entropies. In the present theorem the
L2 -entropies may also be replaced by smaller numbers: the preceding theo-
rem remains true if N[ ] (ε, F, Ln2 ) is replaced by the minimal number of sets
Nε in a partition F = ∪N n n
j=1 Fεj of the index set in sets Fεj such that, for
ε

every j and every n,


mn
X 2
sup E Zni (f ) − Zni (g) ≤ ε2 ,
n
f,g∈Fεj i=1
mn
X  
t2 P∗ sup Zni (f ) − Zni (g) > t ≤ ε2 .

sup
t>0 n
f,g∈Fεj
i=1

These bracketing numbers appear to be rather hard to work with.


298 2 Empirical Processes

Another refinement of the theorem is to replace its entropy assumption


by a majorizing measure condition. This entails the existence of partitions
n
F = ∪j Fεj as before and discrete probability measures µn on F such that
Z δn s
1
(2.11.10) sup log n
dε → 0, for every δn ↓ 0.
f 0 µn (F ε f)

Here Fεn f is defined as the partitioning set at level ε to which f belongs.


This majorizing measure condition is a weakening of finiteness of the en-
tropy integral. Entropies correspond to certain uniform majorizing mea-
sures; the majorizing measure criterion permits one to measure the size of
F in a nonuniform way.[
Majorizing measures are also hard to work with. The resulting central
limit theorem can be expressed more elegantly in terms of the existence of
Gaussian semimetrics. For simplicity, we utilize only a single semimetric,
although it may be fruitful to allow the semimetric to depend on n. Call a
semimetric ρ Gaussian if it is the standard deviation semimetric
 2 1/2
ρ(f, g) = E G(f ) − G(g)

of a tight, zero-mean, Gaussian random element G in `∞ (F). The con-


nection with majorizing measures is the characterization of continuity of
Gaussian processes by majorizing measures. See Appendix A.2.3 for a dis-
cussion. In this chapter this deep characterization is used only in the proof
of the following theorem to translate its condition into the majorizing mea-
sure condition (2.11.10). The proof itself is next based on the existence of
the majorizing measure.
Call a semimetric ρ Gaussian-dominated if it is bounded above (on
F × F) by a Gaussian semimetric. Any semimetric ρ such that
Z ∞p
log N (ε, F, ρ) dε < ∞
0

is Gaussian-dominated (see Problem 2.11.4).

2.11.11 Theorem (Bracketing by Gaussian hypotheses). For each n, let


Zn1 , . . . , Zn,mn be independent stochastic processes indexed by an arbitrary
index set F. Suppose that there exists a Gaussian-dominated semimetric ρ
on F such that
mn
X
E∗ kZni kF kZni kF > η → 0,

for every η > 0,
i=1
mn
X 2
E Zni (f ) − Zni (g) ≤ ρ2 (f, g), for every f, g,
i=1

[
See the Problems and Appendix A.2.3.
2.11 Central Limit Theorem for Processes 299

mn
X  
2 ∗
sup Zni (f ) − Zni (g) > t ≤ ε2 ,

sup t P
t>0 f,g∈B(ε)
i=1

ρ-ball B(ε) ⊂ F of
for every P  radius less than ε and for every n. Then the
mn
sequence i=1 Zni − EZni is asymptotically tight in `∞ (F). It converges
in distribution provided it converges marginally.

The specialization of the preceding theorem to the case of i.i.d. obser-


vations is of interest because of the use of a Gaussian semimetric instead of
entropy integrals. For instance, the theorem contains the classical Chibisov-
O’Reilly theorem on the weighted empirical distribution function on the real
line (see Example 2.11.15).
To recover the i.i.d. case, let X1 , . . . , Xn be a random sample in a
measurable space (X√ , A). For a class F of measurable functions f : X → R,
set Zni (f ) = f (Xi )/ n. Recall that kf kP,2,∞ is the weak L2 -pseudonorm,
defined as the square root of supt>0 t2 P |f | > t .

2.11.12 Corollary. Let F be a pre-Gaussian class of measurable functions


whose envelope function possesses a weak second moment. Suppose there
exists a Gaussian-dominated semimetric ρ on F such that, for every ε > 0,

sup |f − g| ≤ ε,

f,g∈B(ε) P,2,∞

for every ρ-ball B(ε) ⊂ F of radius ε. Then F is P -Donsker.

Proof. Since F is P -pre-Gaussian and kP kF ≤ P ∗ F < ∞, there exists a


tight version of Brownian motion. This pmeans that the L2 (P ) semimetric
eP is Gaussian. The semimetric d = e2P + ρ2 is the standard deviation
metric of the sum of Brownian motion and an independent copy of the
process corresponding to ρ. Thus, this semimetric is Gaussian-dominated
also. The conditions of the preceding theorem are satisfied for d playing the
role of ρ.

2.11.13 Example (Jain-Marcus theorem). For each natural number n,


let Zn1 , . . . , Zn,mn be independent stochastic processes indexed by an arbi-
trary index set F such that

Zni (f ) − Zni (g) ≤ Mni ρ(f, g), for every f, g,
for independent random variables Mn1 , . . . , Mn,mn and a semimetric ρ such
that
Z ∞p
log N (ε, F, ρ) dε < ∞,
0
mn
X
2
EMni = O(1).
i=1
300 2 Empirical Processes

If the triangular array


Pmof norms kZni kF satisfies the Lindeberg condition,
then the sequence i=1 n
(Zni − EZni ) converges in distribution in `∞ (F)
to a tight Gaussian process provided the sequence of covariance functions
converges pointwise on F × F.
This follows from Theorems 2.11.9 and 2.11.11. The entropy condition
implies that ρ is Gaussian-dominated.
In view of the preceding discussion, the conditions may be relaxed in
two ways: the L2 -norm can be partly replaced by the weak L2 -norm, and
the entropy criterion can be replaced by the majorizing measure criterion.

2.11.14 Example. Let Z, Z1 , Z2 , . . . be i.i.d. stochastic processes indexed


by the unit interval F = [0, 1] ⊂ R, with kZkF ≤ 1, such that

E Z(f ) − Z(g) ≤ K|f − g|,
Pn
for some constant K. Then the sequence n−1/2 i=1 (Zi −EZi ) converges in
distribution to a tight Gaussian process in `∞ [0, 1]. Note that the condition
bounds the mean of the increments, not the increments themselves as re-
quired by the Jain-Marcus theorem. On the other hand, the index set must
be a compact interval in the real line and the processes uniformly bounded.
The result can be deduced from Theorem 2.11.11. For every interval
[a, a + ε],
n
2
X
Z(a + εk2−n ) − Z(a + ε(k − 1)2−n ) .

sup Z(f ) − Z(g) ≤ lim
a<f ≤g<a+ε n→∞
k=1

(The process is separable, with any dense subset of [0, 1] as the separant,
because it is continuous in probability; the sums on the right side are in-
creasing in n and bound dyadic increments.) The right-hand side has mean
bounded by Kε. Since the processes are bounded by 1
Z(f ) − Z(g) 2 ≤ 2Kε.

E sup
a<f ≤g<a+ε

Every set of diameter ε for the semimetric ρ(f, g) = |f −g|1/2 is contained in


an interval [a, a+ε2 ]. Conclude that the the condition of Theorem 2.11.11 is
up to a constant satisfied for this semimetric. This semimetric is Gaussian-
dominated, because it has a finite entropy integral.

2.11.15 Example (Weighted empirical distribution function). Let


X1 , X2 , . . . be i.i.d. random variables with the uniform distribution P on
[0, 1] ⊂ R. For a fixed function q: (0, 1/2] → R+ , consider the class of func-
tions  
1(0,t]
F= : 0 < t ≤ 21 .
q(t)
For simplicity, assume that the function 1/q is decreasing. The empirical
process indexed by this class is the classical weighted empirical process
2.11 Central Limit Theorem for Processes 301

(restricted to 0 < t ≤ 1/2 for convenience). According to the Chibisov-


O’Reilly theorem] the following statements are equivalent:
(i) F is Donsker; √
(ii) F is pre-Gaussian and 1/q(t) = o(1/ t);
R 1/2
(iii) 0 t−1 exp −εq 2 (t)/t dt < ∞, for every ε > 0.


The equivalence of (ii) and (iii) can be argued from properties of Brownian
motion. Here we deduce that (ii) implies (i) from Theorem 2.11.11.
In view of the second condition of (ii), the envelope function F =
(0,1/2] has a weak second moment and kP kF = sup t/q(t): 0 < t ≤
(1/q)1
1/2 is finite. Hence the pre-Gaussianness implies the existence of a tight
version of Brownian motion; its standard deviation metric is Gaussian. Since
for s < t,
1(0,s] 1(0,t] 1(s,t]  1 1 
− = + − 1(0,s] ,
q(s) q(t) q(t) q(t) q(s)
the square of this metric equals
t − s 1 1 2
ρ2 (s, t) = 2 + − s, s ≤ t.
q (t) q(t) q(s)
If ρ(s, t) < ε, then both terms on the right are bounded by ε2 . Conclude
that for every fixed s,

1(0,s] 1(0,t] ε1(s,t] ε1(0,s] ε1(s,1] ε1(0,s]
sup − ≤ sup √ + √ =√ + √ .
s<t q(s) q(t) t>s t−s s ·−s s
ρ(s,t)<ε

The L2,∞ (P )-norm of the function x 7→ 1(s,1] / x − s equals 1 for every s.
It follows that the L2,∞ (P )-norm of the left side of the preceding display
is bounded by a multiple of ε.

2.11.16 Example (Monotone processes). Let Z be a stochastic process


indexed by an interval [a, b] ⊂ R̄, whose sample paths t 7→ Z(t) are cadlag
and nondecreasing. If EZ 2 (a) < ∞ and EZ 2 (b) < ∞, then Z satisfies
the central limit theorem in `∞ [a, b]. P
More precisely, if Z1 , Z2 , . . . are i.i.d.
n
copies of Z, then the sequence n−1/2 i=1 (Zi − EZi ) converges in `∞ [a, b]
to a tight Gaussian process.
We shall derive this from Theorem 2.11.9. The condition that Z(a) and
Z(b) have finite second moments are necessary for the convergence of the
processes evaluated at a and b, respectively.
Since we can replace Z(t) by Z(t) − Z(a), we may assume that
Z(a) = 0. The envelope function of Z is kZk = Z(b) and is square in-
tegrable by assumption. It suffices to verify the entropy condition of The-
orem 2.11.9, where we shall choose the partitions independent of n. Define
F (t) = EZ(t)Z(b). This function is right-continuous with left limits, and
is nondecreasing from F (a) = 0 to F (b) = EZ 2 (b). Given ε > 0 choose a

]
Shorack and Wellner (1986), page 462.
302 2 Empirical Processes

partition a = t0 < t1 < · · · < tN = b such that F (ti −) − F (ti−1 ) < ε2 for
every i. Then, for every i,
2
sup Z(s) − Z(t) ≤ E Z(ti −) − Z(ti−1 ) Z(b) < ε2 .

E
ti−1 ≤s,t<ti

The number of points in the partition can be chosen smaller than a constant
times 1/ε2 . (Make sure that the points where F jumps more than ε2 are
among t0 , . . . , tN .) Thus the entropy condition of Theorem 2.11.9 is satisfied
easily for the partition [a, b] = ∪(ti−1 , ti ] ∪ {a}.

The proof of the theorems is based on Bernstein’s inequality combined


with a chaining argument that utilizes majorizing measures. The following
lemma is used to control the finite suprema over the links at a fixed level
in the chain.

2.11.17 Lemma. Let X be an arbitrary random variable such that


x2
 −1
P |X| > x ≤ 2 e 2 b+ax ,
for every x > 0. Then there exists a universal constant K such that
1 √
r
 1 
E|X|1A ≤ K a log + b log µ + P(A) ,
µ µ
for every measurable set A and every constant 0 < µ < e−1 .

Proof. The condition implies the upper bound 2 exp(−x2 /(4b)) on P (|X| >
x), for every x ≤ b/a, and the upper bound 2 exp(−x/(4a)), for all other
positive x. Consequently, the same upper bounds hold for all x > 0 for
the probabilities P(|X|1{|X| ≤ b/a} > x) and P(|X|1{|X| > b/a} > x),
respectively. By Lemma 2.2.1, this implies that the Orlicz norms kX1{|X|
√ ≤
b/a}kψ2 and kX1{|X| > b/a}kψ1 are up to constants bounded by b and
a, respectively. For any random variable Y , Jensen’s inequality yields for
any convex, increasing, nonnegative function ψ,
  
−1 |Y |  1A
E|Y |1A ≤ ψ Eψ kY kψ P(A)
kY kψ P(A)
 1 
≤ ψ −1 kY kψ P(A).
P(A)
Apply this to the random variables |X|1{|X| ≤ b/a} and |X|1{|X| > b/a}
separately and use the triangle inequality to find that
 1 √  1 
E|X|1A . ψ2−1 b P(A) + ψ1−1 a P(A).
P(A) P(A)
Finally, apply the inequality p logk (1 + 1/p) . (p + µ) logk (1/µ), which is
valid for small µ > 0 and k = 1/2, 1.
2.11 Central Limit Theorem for Processes 303

Proof of Theorems 2.11.9 P and 2.11.11. There exists a sequence of


E∗ kZni kF kZni kF > ηn → 0. Therefore,

numbers ηn ↓ 0 such that
 that kZni kF ≤ ηn for every i and
it is no loss of generality to assume
n. Otherwise, replace Zni by Zni 1 kZni k∗F ≤ ηn ; the conditions of the
theorems remain valid, and the difference between the centered sums and
the centered truncated sums converges to zero.
Under the conditions of the theorems, there exists for every n a se-
n
quence of nested partitions F = ∪j Fqj and discrete subprobability mea-
sures µn on F such that, for every j and n,

s
X
−q 1
(2.11.18) lim lim sup sup 2 log = 0,
q0 →∞ n→∞ f q>q0
µn (Fqn f )
mn
X 2
sup E Zni (f ) − Zni (g) ≤ 2−2q ,
n
f,g∈Fqj i=1
mn
X  
t2 P∗ sup Zni (f ) − Zni (g) > t ≤ 2−2q .

sup
t>0 n
f,g∈Fqj
i=1

Here Fqn f is the set in the qth partition to which f belongs. Under the
conditions of Theorem 2.11.9, this is true with 1/µn (Fqn f ) replaced by the
number Nqn of sets inPthe q-th partition. The measures µn can then be
−q
constructed as µn = q2 µn,q , where µn,q (Fqn f ) = (Nqn )−1 , for every
f and q. (Alternatively, the expression 1/µn (Fqn f ) can be replaced by Nqn
throughout the proof.) Under the conditions of Theorem 2.11.11, a single
measure µ = µn can be derived from a majorizing measure corresponding
to the Gaussian semimetric ρ. The existence of this measure is indicated in
Proposition A.2.22. The construction of a sequence of partitions (in sets of
ρ-diameter at most 2−q ) is carried out in Lemma A.2.24. In the following,
it is not a loss of generality to assume that µn (Fqn f ) ≤ 1/4 for every q
and f . Most of the argument is carried out for a fixed n; this index will be
suppressed in the notation.
The remainder of the proof is similar to the proof of Theorem 2.5.6,
except that Lemma 2.11.17 is substituted for Lemma 2.2.10 in the chaining
argument, which now uses majorizing measures. Choose an element fqj
from each partitioning set Fqj and define

πq f = fqj ,

(∆q f )ni = sup Zni (f ) − Zni (g) , if f ∈ Fqj ,
f,g∈Fqj
s
. 1
aq f = 2−q log .
µ(Fq+1 f )
304 2 Empirical Processes

Next, for q > q0 , define indicator functions



(Aq−1 f )ni = 1 (∆q0 f )ni ≤ aq0 f, . . . , (∆q−1 f )ni ≤ aq−1 f ,

(Bq f )ni = 1 (∆q0 f )ni ≤ aq0 f, . . . , (∆q−1 f )ni ≤ aq−1 f, (∆q f )ni > aq f ,

(Bq0 f )ni = 1 (∆q0 f )ni > aq0 f .

The indicator functions Aq f and Bq f are constant in f on each of the


partitioning sets Fqj at level q, because the partitions are nested. Now
decompose:

Zni (f ) − Zni (πq0 f ) = Zni (f ) − Zni (πq0 f ) (Bq0 f )ni

X 
(2.11.19) + Zni (f ) − Zni (πq f ) (Bq f )ni
q0 +1

X 
+ Zni (πq f ) − Zni (πq−1 f ) (Aq−1 f )ni .
q0 +1

Center each term at zero expectation; take the sum over i; and, finally, take
the supremum over f for all three terms on the right separately. It will be
shown that the resulting expressions converge to zero in mean as n → ∞
o
followed by q0 → ∞, whence the centered processes Zni satisfy
Xmn

(2.11.20) lim lim sup E∗ o
Zni o
(f ) − Zni (πqn0 f ) = 0.

q0 →∞ n→∞ F
i=1

In view of Theorem 1.5.6, this gives a complete proof in the case that the
partitions do not depend on n. The case that the partitions depend on n
requires an additional argument, given at the end of this proof.
Since (∆q f )ni ≤ 2ηn , the first term in (2.11.19) is zero as soon as
2ηn ≤ inf f aq0 f . For every fixed q0 , this is true for sufficiently large n,
because, by assumption (2.11.18), inf f aq f (which depends on n) is bounded
away from zero as n → ∞ for every fixed q0 .
By the nesting of the partitions, (∆q f )ni ≤ (∆q−1 f )ni , which is
bounded by aq−1 f on the set where (Bn qf )ni = 1. It follows that

Zni (f ) − Zni (πq f ) (Bq f )ni ≤ (∆q f )ni (Bq f )ni ≤ aq−1 f,
Xmn  X mn

var (∆q f )ni (Bq f )ni ≤ aq−1 f E(∆q f )ni ∆q f )ni > aq f
i=1 i=1
aq−1 f −2q
≤2 2 ,
aq f

by Problem 2.5.5. Combination


Pmn with Bernstein’s inequality 2.2.9 shows that
the random variable i=1 (∆q f )ni (Bq f )ni centered at its expectation sat-
isfies the condition of Lemma 2.11.17 with b = 2(aq−1 f /aq f ) 2−2q and
2.11 Central Limit Theorem for Processes 305

a = 2aq−1 f . Thus, the lemma implies that, for every f and measurable set
A,
Xmn
E (∆q f )ni (Bq f )ni − E(∆q f )ni (Bq f )ni 1A

i=1
s s !
1 aq−1 f −q 1 
. aq−1 f log + 2 log P(A) + µ(Fq f )
µ(Fq f ) aq f µ(Fq f )
s
1
. 2−q log

P(A) + µ(Fq f ) ,
µ(Fq f )

since µ(Fq+1 f ) ≤ µ(Fq f ). For each q, let Ω = ∪j Ωqj be a par-


tition
Pmn of the underlying probability space such that the maximum
i=1 (∆q f )ni (Bq f )ni − E(∆q f )ni (Bq f )ni F is achieved at fqj on the set

Ωqj . For every q, there are as many sets Ωqj as there are sets Fqj in the
qth partition. Then
mn
X X 
E ∆q f )ni (Bq f )ni − E(∆q f )ni (Bq f )ni

F
q>q0 i=1
mn
X X X
≤ E (∆q fqj )ni (Bq fqj )ni − E(∆q f )ni (Bq fqj )ni 1Ωqj

q>q0 j i=1
s
X X
−q 1 
≤ sup 2 log P(Ωqj ) + µ(Fq fqj )
j f q>q0
µ(Fq f )
s
X 1
≤ 2 sup 2−q log .
f q>q0
µ(Fq f )

This converges to zero as n → ∞ followed by q0 → ∞ and takes care of


the second term resulting from the decomposition (2.11.19), apart from the
centering. The centering is bounded by
mn
XX mn
XX 
E(∆q f )ni (Bq f )ni ≤ E(∆q f )ni (∆q f )ni > aq f
q>q0 i=1 q>q0 i=1
s
X 2−2q X 1
≤ sup sup . sup 2−q log .
f t>0 q>q aq f f q>q µ(Fq f )
0 0

This concludes the proof that the second term in the decomposition result-
ing from (2.11.19) converges to zero as n → ∞ followed by q0 → ∞.
The third term can be handled in a similar manner. The proof of
(2.11.20) is complete.
Finally, if the partitions depend on n, then the discrepancies at the ends
of the chains must be analyzed separately. If the q0 th partition consists of
306 2 Empirical Processes

Nqn0 sets, then the combination of Bernstein’s inequality and Lemma 2.2.10
yields
Xmn q
o
(πqn0 f ) − Zni
o
(πqn0 g) . log Nqn0 ηn + log Nqn0 2−q0 + δn .

E sup Zni


ρ(f,g)<δn i=1

The entropy condition of Theorem 2.11.9 implies that 2−q0 log Nqn0 → 0 as
n → ∞ followed by q0 → ∞. Hence the expression in the display converges
to zero as n → ∞ followed
Pmn o by q0 → ∞. Combine this with (2.11.20) to see
that the sequence i=1 Zni is asymptotically equicontinuous with respect
to ρ. Finally, Theorem 1.5.7 shows that the sequence is asymptotically
tight.

2.11.3 Classes of Functions Changing with n


Let X1 , X2 , . . . be a sequence of independent random elements with common
law P on a measurable space (X , A), and let x 7→ fn,t (x) be functions from
X to R indexed by n ∈ N and a fixed, totally bounded semimetric space
(T, ρ). We wish to derive conditions for the stochastic processes
n n
X o
n−1/2

fn,t (Xi ) − P fn,t : t ∈ T
i=1

to converge in distribution in the space `∞ (T ). These are the  empirical


processes {Gn fn,t : t ∈ T } indexed by classes of functions Fn = fn,t : t ∈ T
changing with n.
This situation√fits in the general set-up of this section upon setting
Zni (t) = fn,t (Xi )/ n.
Given envelope functions Fn , assume that
P ∗ Fn2 = O(1),

(2.11.21) P ∗ Fn2 {Fn > η n} → 0, for every η > 0,
2
sup P (fn,s − fn,t ) → 0, for every δn ↓ 0.
ρ(s,t)<δn

Then the central limit theorem holds under an entropy condition. The fol-
lowing theorems impose a uniform-entropy condition and a bracketing en-
tropy condition, respectively.

2.11.22 Theorem. For each n, let Fn = fn,t : t ∈ T be a class of measur-
able functions indexed by a totally bounded semimetric space (T, ρ) such
2
that the classes Fn,δ = fn,s − fn,t : ρ(s, t) < δ and Fn,δ are P -measurable
for every δ > 0. Suppose (2.11.21) holds, as well as
Z δn q

sup log N εkFn kQ,2 , Fn , L2 (Q) dε → 0, for every δn ↓ 0.
Q 0
2.11 Central Limit Theorem for Processes 307

Then the sequence {Gn fn,t : t ∈ T } is asymptotically tight in `∞ (T ) and


converges in distribution to a Gaussian process provided the sequence of
covariance functions P fn,s fn,t − P fn,s P fn,t converges pointwise on T × T .


2.11.23 Theorem. For each n, let Fn = fn,t : t ∈ T be a class of mea-
surable functions indexed by a totally bounded semimetric space (T, ρ).
Suppose (2.11.21) holds, as well as

Z δn q 
log N[ ] εkFn kP,2 , Fn , L2 (P ) dε → 0, for every δn ↓ 0.
0

Then the sequence {Gn fn,t : t ∈ T } is asymptotically tight in `∞ (T ) and


converges in distribution to a tight Gaussian process provided the sequence
of covariance functions P fn,s fn,t −P fn,s P fn,t converges pointwise on T ×T .

Proofs. The random distance given in Theorem 2.11.1 reduces to

n
1X
d2n (s, t) = (fn,s − fn,t )2 (Xi ) = Pn (fn,s − fn,t )2 .
n i=1


It follows that N (ε, T, dn ) = N ε, Fn , L2 (Pn ) , for every ε > 0. If Fn is
replaced by Fn ∨ 1, then the conditions of the theorem still hold. Hence,
assume without loss of generality that Fn ≥ 1. Insert the bound on the
covering numbers and next make a change of variables to bound the entropy
integral (2.11.2) by

Z δn q 
log N εkFn kPn ,2 , Fn , L2 (Pn ) dε kFn kPn ,2 .
0

This converges to zero in probability for every δn ↓ 0. Apply Theorem 2.11.1


to obtain the first theorem.
The second theorem is an immediate consequence of Theorem 2.11.9.

2.11.24 Example. The uniform-entropy condition of thefirst theorem


is
certainly satisfied if, for each n, the set of functions Fn = fn,t : t ∈ T is a
VC-class with VC-dimension bounded by some constant independent of n.
308 2 Empirical Processes

Problems and Complements


1. The random entropy condition (2.11.2) implies that the sequence N (ε, F, dn )
is bounded in probability for every ε > 0.
[Hint: For every δn ≤ ε,
 Z δn p p 

P N (ε, F, dn ) ≥ Mn ≤ P log N (ε, F, dn ) dε ≥ δn log Mn .
0

Given Mn → ∞, choose δn ↓ 0 such that δn log Mn is bounded away from
zero.]

2. Let F = ∪j Fqj be a sequence (q ∈ N) of nested partitions of an arbitrary set


F. Take arbitrary fqj ∈ Fqj for every partitioning set, and define πq f = fqj
and Fq f = Fqj if f ∈ Fqj . Let ρ(f, g) be 2−q0 for the first value q0 (counting
from 1) such that f and g do not belong to the same partitioning set at
level q0 . (Set ρ(f, g) = 0 if f and g are never separated.) Then ρ defines a
semimetric on F inducing open balls

B(f, 2−q ) = Fq f, for every q.

2−q ξq,πq f for i.i.d. standard normal variables


P
Furthermore, if G(f ) = q
ξq,fqj , then
 8 2
var G(f ) − G(g) = ρ (f, g),
3
for every f, g.
[Hint: We have ρ(f, g) < 2−q if and only if f and g are in the same parti-
tioning set at level q if and only if Fq f = Fq g. Of course, g ∈ Fq g for every
g.]

3. In the situation of the preceding problem, let the number Nq of partitioning


sets Fqj at level q satisfy
X p
2−q log Nq < ∞.
q

Then the entropy numbers for the semimetric ρ satisfy


Z ∞ p
log N (ε, F, ρ) dε < ∞.
0

Conclude that the Gaussian process G defined in the preceding problem has
a version with bounded, uniformly ρ-continuous sample paths. Hence ρ is a
Gaussian semimetric.
[Hint: The convergence of the integral is immediate from the fact that
N (2−q , F, ρ) = Nq . The continuity follows from Corollary 2.2.8 and Prob-
lem 2.2.17. Also see Appendix A.2.3.]
R∞p
4. Any semimetric d on an arbitrary set F such that 0 log N (ε, F, d) dε < ∞
is Gaussian-dominated.
[Hint: Construct a sequence of nested partitions of F = ∪j Fqj in sets of δ-
diameter at most 2−q for each q. The number Nq of sets in the qth partition
2.11 Problems and Complements 309
p
can be chosen such that q 2−q log Nq < ∞. As in the preceding problems,
P
the semimetric ρ defined from this sequence of partitions is Gaussian by the
preceding problem and has the property that each ball Bρ (f, 2−q ) equals a
partitioning set; hence these balls have d-diameter at most 2−q for every q.
The latter implies that d ≤ 2ρ.]

5. Any semimetric d on an arbitrary set F for which there exists a Borel prob-
ability measure µ with
Z δ q 
lim sup log 1/µ B(f, ε) dε = 0
δ↓0 f 0

is Gaussian-dominated. Here B(f, ε) is the d-ball of radius ε.


[Hint: By Lemma A.2.24, there exists a sequence of nested partitions of
F = ∪i Fqi in sets of δ-diameter at most 2−q for each q and a Borel probability
measure m such that
r
X −q 1
lim sup 2 log = 0.
q0 →∞ f m(Fq f )
q>q0

As in the preceding problems, the semimetric ρ defined from this sequence


of partitions has the property that each ball Bρ (f, 2−q ) equals a partitioning
set. Hence Z δq

lim sup log 1/m Bρ (f, ε) dε = 0,
δ↓0 f 0

where Bρ (f, ε) is a ball with respect to ρ. By Proposition A.2.22, the Gaussian


process G defined in the first problem has a version with bounded, uniformly
ρ-continuous sample paths. The inequalities diam B(f, 2−q ) ≤ 2−q imply that
d ≤ 2ρ.]

6. Let F be a class of measurable functions such that


Z q

log N[ ] ε, F, L2,∞ (P ) dε

is finite. Then there exists a Gaussian semimetric ρ on F such that



sup |f − g| ≤ ε,
P,2,∞
f,g∈B(ε)

for every ρ-ball B(ε) ⊂ F of radius ε.


[Hint: The finiteness of the integral implies the existence of a sequence of
partitions of F, as in the preceding problems, with the further property

≤ 2−q .

sup |f − g|
P,2,∞
f,g∈Fqj

The semimetric ρ in the preceding problems is Gaussian, and every ball of


radius 2−q coincides with a partitioning set. Thus the inequality is valid for
ε = 2−q and hence up to a constant 2 for every ε. Change the metric to
remove the 2.]
310 2 Empirical Processes

7. Corollary 2.11.12 is a refinement of Theorem 2.5.6.


8. If the Lindeberg condition on norms in Theorems 2.11.1 and 2.11.9 is replaced
by the two assumptions
Xmn

EZni (f ) kZni k∗F > η → 0,

sup

f
i=1
mn
X
P∗ kZni kF > η → 0,

i=1
Pmn
then the conclusion that the sequence of processes i=1
(Zni − EZni ) is
asymptotically tight is still valid.
2.12
Partial-Sum Processes

The name “Donsker class of functions” was chosen in honor of Donsker’s


theorem on weak convergence of the empirical distribution function. A sec-
ond famous theorem by Donsker concerns the partial-sum process
bnsc k
1 X 1 X k k+1
Zn (s) = √ Yi = √ Yi , ≤s< ,
n i=1 n i=1 n n

for i.i.d. random variables Y1 , . . . , Yn with zero mean and variance 1.


Donsker essentially proved that the sequence of processes {Zn (t): 0 ≤ t ≤ 1}
converges in distribution in the space `∞ [0, 1] to a standard Brownian mo-
tion process [Donsker (1951)].
In this chapter we discuss generalizations of this result in two direc-
tions. The first is to replace the Yi ’s  by processes {f (Xi ): f ∈ F } and
consider convergence in `∞ [0, 1] × F . The second is to consider real vari-
ables on a lattice. Partial sums are then obtained by intersecting the lattice
with sets in a given collection.

2.12.1 The Sequential Empirical Process


Let X1 , . . . , Xn be i.i.d. random elements with law P in the measurable
space (X , A), and let F be a collection of square-integrable, measurable
functions f : X → R. The sequential empirical process is defined as
bnsc r
1 X  bnsc
Zn (s, f ) = √ f (Xi ) − P f = Gbnsc (f ),
n i=1 n
312 2 Empirical Processes


where Gn = n(Pn − P ) is the empirical process indexed by F. The index
(s, f ) ranges over [0, 1] × F. The covariance function of Zn is given by

 bnsc ∧ bntc 
cov Zn (s, f ), Zn (t, g) = Pfg − PfPg .
n
By the multivariate
 central limit theorem, the marginals of the sequence
of processes Zn (s, f ): (s, f ) ∈ [0, 1] × F converge to the marginals of a
Gaussian process Z. The latter is known as the Kiefer-Müller process; it
has mean zero and covariance function
 
cov Z(s, f ), Z(t, g) = (s ∧ t) P f g − P f P g .

The aim of this section is to show that the weak convergence Zn Z is


uniform with respect to the semimetric of `∞ [0, 1] × F for every Donsker


class of functions F. More precisely, for every such F, the sequence Zn is


asymptotically tight and there exists a tight, Borel measurable version of
the Kiefer-Müller process Z.
In accordance with the general results on Gaussian processes, tightness
and measurability of the limit process Z are equivalent to the existence of
a version of with all sample paths (s, f ) 7→ Z(s, f ) uniformly bounded and
uniformly continuous with respect to the semimetric whose square is given
by
2
E Z(s, f ) − Z(t, g) = |s − t| ρ2P (f )1s>t + ρ2P (g)1s≤t + (s ∧ t) ρ2P (f − g).
 

This intrinsic semimetric is somewhat complicated. However, it is up to a


constant bounded (on [0, 1] × F) by the natural semimetric |s − t| + ρP (f −
g) if F is bounded for ρP ; in particular, if F is a Donsker class. By the
addendum of Theorem 1.5.7, asymptotic equicontinuity with respect to the
intrinsic semimetric is equivalent to asymptotic equicontinuity with respect
to the natural semimetric whenever [0, 1] × F is totally bounded under the
natural semimetric, particularly if F is a Donsker class.
Call F a functional Donsker class  if and only if the sequence Zn con-
verges in distribution in `∞ [0, 1] × F to a tight limit. Then the result can
be expressed as follows.

2.12.1 Theorem. A class of measurable functions is functionally Donsker


if and only if it is Donsker.

Proof. The empirical process Gn = n(Pn − P ) can be recovered from the
sequential process as Gn (f ) = Zn (1, f ). Since a restriction map is continu-
ous, a class is certainly Donsker if it is functionally Donsker.
For the converse, it suffices to establish the asymptotic
 equicontinuity
of the sequence Zn . With the usual notation, Fδ = f − g: f, g ∈ F, ρP (f −
2.12 Partial-Sum Processes 313

g) < δ , the triangle inequality yields

sup Zn (s, f ) − Zn (t, g)
|s−t|+ρP (f,g)<δ
(2.12.2)
≤ sup Zn (s, f ) − Zn (t, f ) F + sup Zn (t, f ) F .
δ
|s−t|<δ 0≤t≤1

In the second term on the right the parameter t may be restricted to


the points k/n
p with k ranging over 1, 2, . . . , n. Thus, this term is equal
to maxk≤n k k/n Gk kFδ . By Ottaviani’s inequality A.1.1,
P∗ kGn kFδ > ε
  

p
P max k/n kGk kFδ > 2ε ≤ p .
k≤n
1 − max P∗ k/n kGk kFδ > ε
k≤n
If the class F is Donsker, then the sequence Gn is asymptotically equicon-
tinuous. Thus, the numerator converges to zero as n → ∞ followed
by δ ↓ 0. The terms of the maximum in the denominator √ indexed by
kP ≤ n0 can be controlled with the help of the inequality kkGk kFδ ≤
n0
2 i=1 F (Xi ) + 2n0 P ∗ F for an envelope function F . For sufficiently large
n0 , the terms indexed by k > n0 are bounded away from 1 by the asymp-
totic equicontinuity of Gn . Conclude that the denominator is bounded away
from zero. Hence the second term on the right in (2.12.2) converges to zero
in probability as n → ∞ followed by δ ↓ 0.
To prove the same for the first term on the right in (2.12.2), it suffices
to prove convergence to zero of
 


P max sup Zn (s, f ) − Zn (jδ, f ) F > 2ε .

0≤jδ≤1 jδ≤s≤(j+1)δ

By the stationarity of the increments of Zn in s, the at most d1/δe terms


in the maximum are identically distributed. Thus, the probability can be
bounded by l1m  
P∗ sup Zn (s, f ) F > 2ε .

δ 0≤s≤δ
Again replace the continuous index s by a discrete one and conclude from
Ottaviani’s inequality that the preceding display is not larger than
p 
l1m  p  d1/δe P∗ bnδc/n kGbnδc kF > ε
P∗ max k/n kGk kF > 2ε ≤ p  .
δ k≤nδ
1 − max P∗ k/n kG k > ε
k F
k≤nδ
By the portmanteau theorem, the lim sup as n → ∞ of the probability in
the numerator is bounded by P kGkF ≥ ε/δ 1/2 . Since the norm kGkF of
a Brownian bridge has moments of all orders (cf. Proposition A.2.3), the
latter probability converges to zero faster than any power of δ as δ ↓ 0.
Conclude that the numerator converges to zero as n → ∞ followed by
δ ↓ 0. By a similar
 argument as before, but now also using the fact that
P kGkF > ε < 1 for every ε > 0 (cf. Problem A.2.5), the denominator
remains bounded away from zero.
314 2 Empirical Processes

2.12.2 Partial-Sum Processes on Lattices


For each positive integer n, let Yn1 , . . . , Ynmn be independent, real-valued
random variables, and let Qn1 , . . . , Qnmn be deterministic probability mea-
sures on a measurable space (X , A). Given a collection C of measurable
subsets of X , consider the stochastic process
Xmn
Sn (C) = Yni Qni (C).
i=1
Thus, Sn is a randomly weighted sum of the measures Qni . Both the classical
partial-sum process and its smoothed version are of this type.

2.12.3 Example. Let X = [0, 1] be the unit interval in the real line and C
the collection of cells [0, s] with 0 ≤ s ≤ 1. If Qni is the Dirac measure at
the point i/n (for 1 ≤ i ≤ n), then
k
 X X k k+1
Sn [0, s] = Yni = Yni , if ≤s< .
i=1
n n
i/n≤s

The choice Yni = Yi / n for an i.i.d. sequence Y1 , Y2 , . . . gives the classical
partial-sum process.

2.12.4 Example. Let X = [0, 1] be the unit interval in the real line and C
 0 ≤ s ≤ 1. If Qni is the uniform measure
the collection of cells [0, s] with
on the interval (i − 1)/n, i/n , then
k
 X s − k/n k k+1
Sn [0, s] = Yni + Yn,k+1 , if ≤s< .
i=1
1/n n n
This is a linear interpolation of the partial-sum process in the previous
example.

2.12.5 Example. If each Qni equals the Dirac measure at some point xni ∈
X , then the general process reduces to
X
Sn (C) = Yni .
i:xni ∈C
This may be visualized as random weights Yni being located at fixed points
xni . Each set C is charged with the sum of the weights that it carries.
P
The process Sn = Yni Qni is the sum of the independent processes
Zni = Yni Qni . Since
2 2

Zni (C) − Zni (D) ≤ Yni Qni C 4 D ,
the processes Zni are measurelike with respect to the measures µni =
2
Yni Qni . Thus for a collection C that satisfies the uniform-entropy condi-
tion, Theorem 2.11.1 combined with Lemma 2.11.6 readily yields a central
limit theorem for the sequence Sn . This is true in particular for VC-classes
of sets.
2.12 Partial-Sum Processes 315

2.12.6 Theorem. For each n, let Qn1 , . . . , Qnmn be deterministic proba-


bility measures on some measurable space, and let Yn1 , . . . , Ynmn be inde-
pendent, real-valued random variables with mean zero, satisfying
mn
X
2
EYni = O(1),
i=1
mn
X
2

EYni |Yni | > η → 0, for every η > 0.
i=1
Let C be a class of measurable sets that satisfies the uniform-entropy con-
dition, and assume that for some probability measure Q,
(2.12.7)
mn
X
2
2
sup EYni Qni (C) − Qni (D) → 0, for every δn ↓ 0.
Q(C4D)<δn i=1

Finally, suppose that the covariance function


Pmn ESn (C)Sn (D) converges
pointwise on C × C. Then the sequence Sn = i=1 Yni Qni converges weakly
in `∞ (C) to a tight Gaussian process with uniformly continuous sample
paths with respect to the semimetric Q(C 4 D).

Proof. Apply Theorem 2.11.1 with Zni = Yni Qni and ρ the L1 (Q)-
semimetric. Then kZni kF can be taken equal to |Yni |.
The entropy condition (2.11.2) is satisfied in view of Lemma 2.11.6.
Since C satisfies the uniform-entropy condition, it is totally bounded under
the L1 (Q)-semimetric (for any probability measure Q).
The measurability conditions listed before Theorem 2.11.1 are sat-
isfied, because the suprema can be replaced by countable suprema. In-
deed, since C satisfies the uniform-entropy
Pmn condition, it is totally bounded
and hence separable in L1 ( i=1 Qni + Q). Thus, there exists a count-
able
Pmnsubcollection that contains, for every C ∈ C, a sequence Ck with
i=1 Qni + Q (Ck 4 C) → 0 as k → ∞ (and n fixed). For this se-
quence, Sn (C) = limk→∞ Sn (Ck ). Similar arguments applied to the mul-
tiplier processes show that the suprema that are assumed measurable in
Theorem 2.11.1 can be replaced by countable suprema.
The theorem follows from Theorem 2.11.1.

The covariance function of the process Sn is given by


mn
X
2
ESn (C)Sn (D) = EYni Qni (C)Qni (D).
i=1
In the special case that each Qni is a Dirac measure, we have the equality
Qni (C)Qni (D) = Qni (C ∩ D) for every pair of sets, and the covariance
function reduces to Qn (C ∩ D) for the measure
mn
X
2
Qn = EYni Qni .
i=1
316 2 Empirical Processes

Then pointwise convergence of Qn to a limit on the collection of sets {C ∩


D: C, D ∈ C} verifies convergence of the sequence of covariance functions.
For any measures Qni , the sum on the left side of (2.12.7) is bounded
by Qn (C 4 D). Thus, consideration of the measures Qn can also be helpful
for verification of the other, more involved, condition of the theorem. A
simple sufficient condition
 for (2.12.7) is that
Qn converges uniformly to Q
on the collection of sets C 4 D: C, D ∈ C . This puts further restrictions
on the class C. A simple further sufficient condition is that the sequence
Qn converges weakly to Q and, for every ε > 0, the collection C can be
covered by finitely many brackets [Cl , C u ] of size Q(C u − Cl ) < ε and
having P -continuity sets Cl and C u as boundaries (Problem 2.12.1).

2.12.8 Example. Let X = [0, 1]d be the unit cube and C the collection
of all cells [0, t] with 0 ≤ t ≤ 1. Let the collection of measures Qni be
the nd Dirac measures located at nodes of the regular grid consisting of
the nd points {1/n, 2/n, . . . , 1}d . This gives a higher-dimensional version of
the classical partial-sum process. If Yni = Yi /nd/2 for an i.i.d. mean-zero
sequence Y1 , Y2 , . . . with unit variance, then the measure Qn reduces to
d
n
1 X
Qn = d Qni .
n i=1

This is the discrete uniform measure on the grid points. The sequence Qn
converges weakly to Lebesgue measure, and the collection of cells [0, t] sat-
isfies the bracketing condition for Lebesgue measure. Thus, (2.12.7) is sat-
isfied for Q equal to Lebesgue measure, and the sequence of covariance
functions converges. The limiting Gaussian process has covariance function
ES(C)S(D) = λ(C ∩ D) for Lebesgue measure λ (a standard Brownian
sheet).

2.12.9 Example. Let X = [0, 1]d be the unit cube and C a VC-class. Par-
tition [0, 1]d into nd cubes of volume n−d , and let the collection Qni be the
set of uniform probability measures on the cubes. This gives a smoothed
version of the partial-sum process in higher dimensions. If each of the Yni
has variance n−d , then the measure Qn reduces to Lebesgue measure. Thus
(2.12.7) is satisfied for Q equal to Lebesgue measure.
If the cubes in the partition are denoted Cni , then the covariance func-
tion can be written
d
n
1 X λ(Cni ∩ C) λ(Cni ∩ D)  
d
= EE 1C (U )| An E 1D (U )| An ,
n i=1 λ(Cni ) λ(Cni )

where U is the identity map on [0, 1]d equipped with Lebesgue measure
and An is the σ-field generated by the partition. For n → ∞, the sequence
E 1C (U )| An converges in probability to 1C (U ) for every set C. (This
2.12 Problems and Complements 317

follows from a martingale convergence theorem or a direct argument: it is


clear that E(h| An ) → h in quadratic mean for every continuous function
h; each indicator function 1C can be approximated arbitrarily closely by a
continuous function.) Thus, the sequence of covariance functions converges
to the limit E1C (U )1D (U ) = λ(C ∩ D).

Problems and Complements


1. Suppose the sequence of Borel measures Qn converges weakly to a probability
measure Q. Let C be a collection of Borel sets that for every ε > 0 can be
covered with finitely many brackets [Cl , C u ] of size Q(C u − Cl ) < ε and
consisting of Q-continuity sets Cl and C u . Then Qn → Q uniformly in C ∈
C 4 C and C ∈ C ∩ C.
[Hint: The following problem is helpful. Compare the proof of Theo-
rem 2.4.1.]

2. If C has the bracketing property as in the preceding problem, then so do the


collections C 4 C and C ∩ C.

Pn ≤ t}, then the


3. If Zni (s, f ) = f (Xi )1{i/n  sequential empirical process can be
written Zn = n−1/2 i=1
Z ni − EZni . Next the results of Chapter 2.11 can
be used to deduce that the sequence Zn converges in distribution. Compare
the result with the present chapter.
4. (Time reversal) If Z is a Kiefer-Müller process indexed by [1, ∞) × F, then
the process (t, f ) 7→ tZ(1/t, f ) is a Kiefer-Müller process indexed by (0, 1]×F.
5. If Zn is the sequential empirical process, then the sequence of processes
(t, f ) 7→ Zn (t, f )/(t ∨ 1) converges weakly in `∞ (R+ × F ).
√ 
[Hint: The process equals n bntc/n(t ∨ 1) (Pbntc − P ). Use the reverse
martingale property of kPn − P k∗F given in Lemma 2.4.6.]

6. Combination of the preceding problems yields that supm≥n nkPm − P kF
converges weakly to sup0≤t≤1 kZ(t, f )kF . This generalizes results of Müller
(1968). See Hjort and Fenstad (1992), Section 4, for applications of this result.
7. A standard Brownian sheet S indexed by R+ × [0, 1] is a zero-mean Gaussian
process with continuous sample paths and covariance function

cov S(s, u), S(t, v) = (s ∧ t)(u ∧ v).

Then the process S(s, u) − uS(s, 1) is a classical Kiefer-Müller process (a


Kiefer-Müller process for X = [0, 1] with uniform measure and F the collec-
tion of indicators of cells [0, u]).
8. With S a standard Brownian sheet, consider the processes S(s, u) − suS(1, 1)
and S(s, u) − uS(s, 1) − vS(1, u) + usS(1, 1). On which sides of the unit square
are these processes zero almost surely? (The second process appears in con-
nection with tests for independence; see Chapter 3.9 and Appendix A.2.2.)
2.13
Other Donsker Classes

In this section we consider some Donsker classes of interest, that do not fit
well in the framework of the preceding chapters.

2.13.1 Sequences
A countable class of functions may be shown to be Donsker by any of the
criteria we have discussed so far. A very simple sufficient condition is that
the sequence converges to zero sufficiently fast.

2.13.1 Theorem (Sequences). Any sequence {fi } of square-integrable,



P (fi − P fi )2 < ∞ is P -
P
measurable functions with the property i=1
Donsker.

Proof. For a fixed natural number m, define a partition {fi } = ∪m+1 i=1 Fi
by letting Fi consist of the single function fi for i ≤ m and Fm+1 =
{fm+1 , fm+2 , . . .}. Since the variation over the first m sets in the partition
is zero,
   ε
P sup sup Gn (f − g) > ε ≤ P sup Gn f >
i f,g∈Fi f ∈Fm+1 2

4 X
≤ 2 P (fi − P fi )2 < ∞,
ε i=m+1

by Chebyshev’s inequality. For sufficiently large m, this is smaller than any


prescribed η > 0. The result follows from Theorem 1.5.6.
2.13 Other Donsker Classes 319

2.13.2 Elliptical Classes


The preceding theorem may be combined with Theorem 2.10.5 to conclude
that the class
X∞ 
P
ci fi : |ci | ≤ 1, and the series converges pointwise
i=1
P∞
is Donsker for any given sequence fi with i=1 P fi2 < ∞. Under the addi-
tional condition that the functions fi are orthogonal, this can be improved.

2.13.2 Theorem (Elliptical classes). Let {fi } be a sequence of measur-



and i=1 P fi2 < ∞.
P
able functions such that P fi fj = 0 for every i 6= j P

Then
P∞ the class of all pointwise converging series i=1 ci fi , such that
2
c
i=1 i ≤ 1, is P -Donsker.
P∞
Proof. By the condition on c, each of the series i=1 ci fi converges point-
wise as well as in L2 (P ). The class F of all these series is totally bounded in
L2 (P ), because it is bounded and can be approximated arbitrarily closely
by a finite-dimensional set, since
X 2 X
P ci fi = c2i P fi2 ≤ max P fi2 → 0, m → ∞.
i>m
i>m i>m

It suffices to show that the sequence of empirical processes Gn indexed by


F is asymptotically
P equicontinuous with respect to the L2 (P )-seminorm.
P
For f = ci fi , g = di fi , and any natural number k,

X 2
Gn (f ) − Gn (g) 2 = (ci − di ) Gn (fi )

i=1
k k ∞ ∞
X X G2 (fi )
n
X X
≤2 (ci − di )2 P fi2 +2 (ci − di )2 G2n (fi ).
i=1 i=1
P fi2
i=k+1 i=k+1

In view of the assumption that kc − dk2 ≤ kck2 + kdk2 ≤ 2, this expression


is bounded by

k ∞
X G2 (fi )
n
X
2kf − gk2P,2 +8 G2n (fi ).
i=1
P fi2
i=k+1

Take the supremum over all pairs of series f and g with kf − gkP,2 < δ.
2 2
Since EGP n (fi ) ≤ P fi , the expectation of this supremum is bounded by
2 ∞ 2
2δ k + 8 i=k+1 P fi . This expression can be made arbitrarily small by
first choosing k large and next δ small.
320 2 Empirical Processes

In terms of an orthonormal sequence {ψi } in L2 (P ), the preceding


theorem can be stated in the following manner. For a given sequence of
numbers bi , the elliptical class

P c2i
X 
F= ci ψi : ≤ 1 and the series converges pointwise
i=1
b2i
P∞
is P -Donsker if i=1 b2i < ∞. The latter condition is also known to be nec-
essary. In fact, it is already necessary for the class to be pre-Gaussian.† The
pointwise convergence of the series forming F appears not to be automatic.
In view of the Cauchy-Schwarz inequality, a simple sufficient condition for
ψi (x)2 b2i < ∞.
P
absolute convergence at x is that
Elliptical classes are of some interest, because some well-known test
statistics, including the Cramér-von Mises and the Anderson-Darling statis-
tic, arise as the (generalized) Kolmogorov-Smirnov statistic kGn kF indexed
by an elliptical class. This is shown in the examples ahead. The Kolmogorov-
Smirnov statistics corresponding to an elliptical class can be represented as
a series of uncorrelated variables. Indeed, for F equal to the class of func-
tions in the previous display,
X∞ 2 X ∞
kGn k2F = sup ci Gn (ψi ) = b2i G2n (ψi ),

f i=1 i=1

in view of the Cauchy-Schwarz inequality. If the ψi are uncorrelated and


of
P unit variance, then the sequence kGn k2F is asymptotically distributed as
2 2
bi Zi for an i.i.d. sequence Z1 , Z2 , . . . of standard normal variables. This
follows from the series representation for kGn k2F ; it can also be obtained
from the fact that kGn k2F converges in distribution to the square norm of
a Brownian bridge G and a series representation for kGk2F , which can be
obtained in an analogous manner.

2.13.3 Example (Cramér-von Mises). Let Pn be the empirical distri-


bution of an i.i.d. sample of size n from√the uniform distribution on the
unit interval [0, 1] ⊂ R. Write Gn (t) = n(Pn − P )[0, t] for the classical
empirical process (t ∈ [0, 1]). Let F be the class of functions


X 
P 2 2 2
F= cj 2 cos πjt: cj π j ≤ 1 .
j=1

Since the cosines are bounded, the series defining the elements of F are
uniformly convergent in this case. The classical Cramér-von Mises statistic
equals the square of the Kolmogorov-Smirnov statistic over the elliptical
class F: Z 1
G2n (t) dt = kGn k2F .
0


Dudley (1967b), Proposition 6.3.
2.13 Other Donsker Classes 321

To see this, note that the Cramér-von Mises statistic is the √square of the
L2 [0, 1]-norm
of the function t →
7 G n (t). Since the functions 2 sin πjt: j =
1, 2, . . . form an orthonormal base of L2 [0, 1], Parseval’s formula yields
∞ ∞
Z 1 X R 1 √ 2 X 1 √
G2n (t) dt G2n

= G (t) 2 sin(πjt) dt =
0 n
2 cos πjt .
0 j=1 j=1
π2 j 2

(Note that − Gn f 0 √
R R
= f dGn = Gn f by the definitions of t 7→ Gn (t) and
Gn f .) The functions 2 cos πjt: j = 1, 2, . . . form an orthonormal system
in L2 [0, 1], so that the result follows from the general series representation
of kGn k2F for elliptical classes.

2.13.4 Example (Watson). In the situation of the preceding example, let


F be the elliptical class

√ √
X 
c2j−1 2 cos 2πjt + c2j 2 sin 2πjt : (c22j−1 + c22j )π 2 j 2 ≤ 1 .
 P
j=1

The series are again uniformly convergent. The Watson statistic equals the
square of the Kolmogorov-Smirnov statistic over the elliptical class F:
Z 1
2
Gn (t) − Gn (s) ds dt = kGn k2F .
 R
0

This can be seen by a similar argument. The Watson statistic is the square
of the L2 [0, 1]-norm of the projection
√ of the function
√ t 7→ Gn (t) on the mean-
zero functions. The functions 2 sin 2πjt, 2 cos 2πjt: j = 1, 2, . . . form
an orthonormal base of the mean-zero functions. Application of Parseval’s
formula followed by partial integration as in the preceding example yields
the result.

2.13.5 Example (Anderson-Darling). In the situation of the preceding


example, let F be the elliptical class


X 
P 2
cj 2 pj (2t − 1): cj j(j + 1) ≤ 1 and pointwise convergence ,
j=1
√ √
where√ the functions p0 (u) = (1/2) √ 2, p1 (u) = (1/2) 6u, p2 (u) =
(1/4) 10(3u2 − 1), p3 (u) = (1/4) 14(5u3 − 3u), . . . are the orthonor-
malized Legendre polynomials in L2 [−1, 1]. (This is the orthonormal sys-
tem obtained by applying the Gram-Schmidt procedure to the functions
1, u, u2 ,. . . .) The Anderson-Darling statistic equals the square of the
Kolmogorov-Smirnov statistic over the elliptical class F:
Z 1 2
Gn (t)
dt = kGn k2F .
0 t(1 − t)
322 2 Empirical Processes

The argument is slightly more involved than in the preceding examples, but
it is based on the same idea. The normalized Legendre polynomials satisfy
the differential equations p00j (u)(1 − u2 ) − 2up0j (u) = −j(j + 1)pj (u) for u ∈
[−1, 1] (Problem 2.13.1). By a change of variables and partial integration,
it can be deduced that
Z 1
2 p0i (2t − 1)p0j (2t − 1) t(1 − t) dt
0
Z 1
1
pi (u) p00j (u)(1 − u2 ) − 2up0j (u) du
 
= −4
−1
= 14 j(j + 1)δij .
√ p p
It follows that the functions 2 2p0i (2t−1) t(1 − t)/ j(j + 1) with j rang-
ing over {1, 2, . . .} form an orthonormal base of L2 [0, 1]. By Parseval’s for-
mula, the Anderson-Darling statistic equals
∞ ∞
X R 1 2 8 X 2
G (t)p0i (2t − 1) dt G2n pj (2t − 1) .

0 n
=
j=1
j(j + 1) j=1 j(j + 1)

This equals kGn k2F by the general series representation of kGn k2F for ellip-
tical classes.

2.13.3 Classes of Sets


The preceding chapters give a wide variety of relatively easy-to-apply suf-
ficient conditions for the Donsker property. They have little to say about
necessary conditions beyond the most abstract level or in concrete cases.
For classes of sets, simple necessary conditions, up to measurability, are
known.
Recall from Chapter 2.6 that for a given collection C of sets and given
points, ∆n (C, X1 , . . . , Xn ) denotes the number of subsets of {X1 , . . . , Xn }
picked out by C. Also define Kn (C, X1 , . . . , Xn ) to be the cardinality of a
maximal subset of {X1 , . . . , Xn } shattered by C:
n o
Kn (C, X1 , . . . , Xn ) = max #A: ∆n (C, A) = 2#A ,

where A ranges over all possible subsets of {X1 , . . . , Xn }.

2.13.6 Theorem. For every pointwise-separable collection of measurable


sets, the following statements are√equivalent:
(i) log ∆n (C, X1 , . . . , Xn ) = o√∗P ( n) and C is P -pre-Gaussian;
(ii) Kn (C, X1 , . . . , Xn ) = o∗P ( n) and C is P -pre-Gaussian;
(iii) C is P -Donsker; √
(iv) log N εn−1/2 , C, L1 (Pn ) = o∗P ( n) for every ε > 0 and C is P -pre-


Gaussian.

Proof. Giné and Zinn (1984) and Talagrand (1988).


2.13 Problems and Complements 323

Problems and Complements


1. The Legendre polynomials on [−1, 1] satisfy the differential equation

p00j (u)(1 − u2 ) − 2up0j (u) = −j(j + 1)pj (u).

[Hint: It suffices to prove the same relationship for the polynomials pj (u)
defined as uj minus the projection of uj on the linear space spanned by
1, u, . . . , uj−1 . The left side of the differential equation is a polynomial of
degree j with leading term −j(j + 1)uj . By definition, the right side has
the same property. It suffices to show that the left side is orthogonal to the
functions 1, u, . . . , uj−1 . By partial integration,
Z Z
p0j (u)2u uk du = 2pj (1) ± 2pj (−1) − pj (u)2(k + 1)uk du

= 2pj (1) ± 2pj (−1),


where ± is + if k < j is even and − otherwise. By using partial integration
twice, this can be seen to be equal to p00j (u)(1 − u2 ) uk du.]
R

2. The representation of the Watson statistic as the Kolmogorov-Smirnov statis-


tic of an elliptic class is not unique. Consider the class


X X 2 2 2 
F= cj 2 sin πjt: cj π j ≤ 1 .
j=1

Then the Watson statistic equals kGn k2F . (The terms in the accompanying
series representation are not asymptotically independent; hence this repre-
sentation is of less interest.)
√
[Hint: The functions 2 cos πjt: j = 1, 2, . . . form an orthonormal base of
the mean-zero functions in L2 [0, 1].]
2.14
Maximal Inequalities
and Tail Bounds

In this chapter we derive moment and tail bounds for the supremum √ kGn kF
of the empirical process. Throughout this chapter, Gn = n(Pn − P ) de-
notes the empirical process of an i.i.d. sample X1 , . . . , Xn from a probability
measure P , defined as the coordinate projections of a product probability
space (X ∞ , A∞ , P ∞ ).
In the first two sections we consider classes of functions such that
either the uniform-entropy integral or the bracketing integral converges.
Then both Lp -moments and ψp -moments can be bounded by a multiple
of the entropy integral times the corresponding moment of the envelope
function. In view of general results that bound the higher order Lp and
the ψp -norms in terms of the L1 -norm, the main job is to derive upper
bounds for the expectation E∗ kGn kF . We derive such bounds also with a
view toward statistical applications in Part 3.
For some of these applications it is important that the bound takes
account of the (small) variance of the individual functions. We present
useful bounds both under uniform entropy and bracketing.
Bounds on moments imply rates of decrease for the tail probabilities
P∗ kGn kF > t as t → ∞. For uniformly bounded classes F, the tails
of kGn kF are exponentially small. In Section 2.14.5, we derive exponential
bounds that give the correct constant in the exponent for such classes. Most
of the tail bounds are uniform in n.
All bounds measure the size of the supremum kGn kF . In Chapter 2.15
we consider the concentration of this variable around its mean. Unlike the
bounds in the present chapter the concentration bounds do not depend
2.14 Maximal Inequalities and Tail Bounds 325

(much) on the size of the class F, and hence have a different character.

2.14.1 Uniform Entropy Integrals


For a class F of measurable functions with measurable envelope function
F define the uniform entropy integral by
Z δq

J(δ, F| F, L2 ) = sup 1 + log N εkF kQ,2 , F, L2 (Q) dε,
Q 0

where the supremum is taken over all discrete probability measures Q with
kF kQ,2 > 0. The envelope function need not be the minimal envelope func-
tion. If the envelope function is clear from the context we may also omit it
from the notation and write J(δ, F, L2 ).
Certainly J(1, F, L2 ) < ∞ if F satisfies the uniform-entropy condition
(2.5.1). ForpVapnik-C̆ervonenkis
 classes F, the function J(δ, F, L2 ) is of the
order O δ log(1/δ) as δ ↓ 0.

2.14.1 Theorem. Let F be a P -measurable class of measurable functions


with measurable envelope function F . Then, for p ≥ 1,
kGn k∗F

P,p
. J(θn , F| F, L2 ) kF kn P,p . J(1, F| F, L2 ) kF kP,2∨p .

Here θn = kf kn F /kF kn , where k · kn is the L2 (Pn )-seminorm and the
inequalities are valid up to constants depending only on the p involved in
the statement.
Pn
Proof. Set Gon = n−1/2 i=1 εi f (Xi ) for the symmetrized empirical process.
In view of the symmetrization lemma, Lemma 2.3.1, Orlicz norms of kGn k∗F
are bounded by two times the corresponding Orlicz norms of kGon kF .
Given X1 , . . . , Xn , the process Gon is sub-Gaussian for the L2 (Pn )-
seminorm k · kn by Hoeffding’s inequality. The value ηn = kf kn F is an
upper bound for the radius of F ∪ {0} with respect to this norm. The
maximal inequality Theorem 2.2.4 gives
Z ηn q
o 
kGn kF . 1 + log N ε, F, L2 (Pn ) dε,
ψ2 |X 0
where k·kψ2 |X is the conditional Orlicz norm for ψ2 , given X1 , X2 , . . .. Make
a change of variable and bound the random entropy by a supremum to see
that the right side is bounded by J(θn , F| F, L2 )kF kn . Every Lp -norm is
bounded by a multiple of the ψ2 -Orlicz norm. Hence
Eε kGon kpF . J(θn , F| F, L2 )p kF kpn .
Take the expectation over X1 , . . . , Xn to obtain the left inequality of
the theorem. Since θn ≤ 1, the right side of the preceding display is
bounded by J(1, F|PF, L2 )p kF kpn . For p ≥ 2, this is further bounded by
J(1, F| F, L2 )p n−1 F p (Xi ), by Jensen’s inequality. This gives the in-
equality on the right side of the theorem.
326 2 Empirical Processes

Because kGn F kP,2 = kF kP,2 , the far right bound of the theorem can
be interpreted as saying that the supremum of the empirical process has
the same order of magnitude as its value on the envelope, as soon as the
uniform entropy integral is finite. This is a useful result, also for many
statistical applications, but has the drawback that it is insensitive to the
sizes of the individual functions.
The next theorem remedies this. Its bound is not accurate when δ is
very small (relative to n and the entropy integral), but has exactly the right
form for applications to obtain rates of convergence (in Chapter 3.4).

2.14.2 Theorem. Let F be a P -measurable class of measurable functions


with envelope function F ≤ 1 and such that F 2 is P -measurable. If P f 2 <
δ 2 P F 2 , for every f and some δ ∈ (0, 1), then
 J(δ, F, L2 ) 
E∗P kGn kF . J δ, F, L2 1 + 2 √ kF kP,2 .
δ nkF kP,2

Proof. Because δ 7→ J(δ, F, L2 ) is the integral of a nonincreasing nonnega-


tive function, it is a concave function such that the map t 7→ J(t)/t, which
is the average of its derivative over [0, t], is nonincreasing. The concavity
shows that its perspective (x, t) 7→ tJ(x/t, F, L2 ) is a concave function of
its two arguments (cf. Boyd and Vandenberghe (2004), page 89). Further-
more, the “extended-value extension” of this function (which by definition
is −∞ if x ≤ 0 or t ≤ 0) is obviously nondecreasing in its first argu-
ment and was noted to be nondecreasing in its second argument. There-
fore, by the vector composition rules for concave functions (Boyd and Van-
denberghe (2004), pages 83–87, p especially lines
√ -2 and -1 of page 86), the
function (x, y) 7→ H(x, y): = J x/y, F, L2 y is concave. We have that
E∗P Pn F 2 = kF k2P,2 . Therefore, by an application of Jensen’s inequality to
the first inequality in Theorem 2.14.1, we obtain, for σn2 = supf Pn f 2 ,
 pE∗ σ 2 
∗ P n
(2.14.3) EP kGn kF . J , F, L2 kF kP,2 .
kF kP,2
The application of Jensen’s inequality with outer expectations can be justi-
fied here by the monotonicity of the function H, which shows that the mea-
surable majorant of a variable H(U, V ) is bounded above by H(U ∗ , V ∗ ),
for U ∗ and V ∗ measurable majorants of U and V . Thus E∗ H(U, V ) ≤
EH(U ∗ , V ∗ ), after which Jensen’s inequality can be applied in its usual
(measurable) form.
The second step of the proof is to bound E∗P σn2 . Because Pn f 2 = P f 2 +
−1/2
n Gn f 2 and P f 2 ≤ δ 2 P F 2 for every f , we have
1
(2.14.4) E∗P σn2 ≤ δ 2 kF k2P,2 + √ E∗P kGn kF 2 .
n
Here the empirical process in the second term can be replaced by the sym-
metrized empirical process Gon at the cost of adding a multiplicative factor
2.14 Maximal Inequalities and Tail Bounds 327

2, by Lemma 2.3.1. The expectation can be factorized as the expectation on


the Rademacher variables ε followed by the expectation on X1 , . . . , Xn , and
Eε kGon kF 2 ≤ 2Eε kGon kF by the contraction principle for Rademacher vari-
ables, Proposition A.1.11, and the fact that F ≤ 1 by assumption. Taking
the expectation on X1 , . . . , Xn , we obtain that E∗P kGn kF 2 ≤ 4E∗P kGon kF ,
which in turn is bounded above by 8E∗P kGn kF by the desymmetrization
inequality, Lemma 2.3.6.
Thus F 2 in the last term of (2.14.4) can be replaced by F, at the cost
of inserting a constant. Next we apply (2.14.3) to this term, and conclude
that z 2 : = E∗P σn2 /kF k2P,2 satisfies the inequality
J(z, F, L2 )
(2.14.5) z2 . δ2 + √ .
nkF kP,2

We apply Lemma 2.14.6 (below) with r = 1, A = δ and B 2 = 1/( nkF kP,2 )
to see that
J 2 (δ, F, L2 )
J(z, F, L2 ) . J(δ, F, L2 ) + 2 √ .
δ nkF kP,2
We insert this in (2.14.3) to complete the proof.

2.14.6 Lemma. Let J: (0, ∞) → R be a concave, nondecreasing function


with J(0) = 0. If z 2 ≤ A2 + B 2 J(z r ) for some r ∈ (0, 2) and A, B > 0, then
h  B 2 i1/(2−r)
J(z) . J(A) 1 + J(Ar ) .
A
Proof. For t > s > 0 we can write s as the convex combination s =
(s/t)t + (1 − s/t)0 of t and 0. Since J(0) = 0, the concavity of J gives that
J(s) ≥ (s/t)J(t). Thus the function t 7→ J(t)/t is decreasing, which implies
that J(Ct) ≤ CJ(t) for C ≥ 1 and any t > 0.
By the monotonicity of J and the assumption on z it follows that
 r/2    B 2 r/2
J(z r ) ≤ J A2 + B 2 J(z r ) ≤ J(Ar ) 1 + J(z r ) .
A
This implies that J(z r ) is bounded by a multiple of the maximum of
J(Ar ) and J(Ar )(B/A)r J(z r )r/2 . If it is bounded by the second one, then
J(z r )1−r/2 . J(Ar )(B/A)r . We conclude that
 B 2r/(2−r)
J(z r ) . J(Ar ) + J(Ar )2/(2−r) .
A
Next again by the monotonicity of J,
p  r  B 2
J(z) ≤ J 2 2 r
A + B J(z ) ≤ J(A) 1 + J(z r )
A
h  B 2   B 2r/(2−r) i1/2
. J(A) 1 + J(Ar ) + J(Ar )2/(2−r)
A A
h p  B   B 2/(2−r) i
. J(A) 1 + J(A ) r + J(Ar )1/(2−r) .
A A
328 2 Empirical Processes

The middle term on the right side is bounded by a multiple of the sum of
the first and third terms, since xp. 1p + xq for any conjugate pair (p, q)
and any x > 0, in particular x = J(Ar )B/A.

In the next two theorems we relax the assumption that the class F
of functions is uniformly bounded, made in Theorem 2.14.2, to moment
and exponential bounds, respectively, at the cost of a deterioration of the
dependence of the bounds on δ.

2.14.7 Theorem. Let F be a P -measurable class of measurable functions


with envelope function F such that P F (4p−2)/(p−1) < ∞ for some p > 1
and such that F 2 and F 4 are P -measurable. If P f 2 < δ 2 P F 2 for every f
and some δ ∈ (0, 1), then
2−1/p
J(δ 1/p , F, L2 ) kF kP,(4p−2)/(p−1)  2p−1
p

E∗P kGn kF . J δ, F, L2 1+ √ kF kP,2 .
δ2 n kF k
2−1/p
P,2

2.14.8 Theorem. Let F be a P -measurable class of measurable functions


with envelope function F such that P exp(F p+ρ ) < ∞ for some p, ρ > 0
and such that F 2 and F 4 are P -measurable. If P f 2 < δ 2 P F 2 for every f
and some δ ∈ (0, 1/2), then for a constant c depending on p, P F 2 , P F 4
and P exp(F p+ρ ),


 J δ(log(1/δ))1/p , F, L2 
EP kGn kF ≤ cJ δ, F, L2 1 + √ .
δ2 n

Proofs. Application of the first inequality of Theorem 2.14.1 to the func-


tions f 2 , forming the class F 2 with envelope function F 2 , yields
2
 σn,4 
(2.14.9) E∗P kGn kF 2 . E∗P J , F 2
| F 2
, L2 (Pn F 4 )1/2 ,
(Pn F 4 )1/2

for σn,r the diameter of F in Lr (Pn ), i.e.


r
(2.14.10) σn,r = sup Pn |f |r .
f

Preservation properties of uniform entropy (see Theorem 2.10.21, applied


to φ(f ) = f 2 with L = 2F ) show that J(δ, F 2 | F 2 , L2 ) . J(δ, F| F, L2 ),
for every δ > 0. Because Pn f 2 = P f 2 + n−1/2 Gn f 2 and P f 2 ≤ δ 2 P F 2 by
assumption, we find that

1  σ2 
n,4
(2.14.11) E∗P σn,2
2
. δ 2 P F 2 + √ E∗P J , F, L2 (Pn F 4 )1/2 .
n (Pn F 4 )1/2
The next step is to bound σn,4 in terms of σn,2 .
2.14 Maximal Inequalities and Tail Bounds 329

By Hölder’s inequality, for any conjugate pair (p, q) and any 0 < s < 4,
1/p 1/q
Pn f 4 ≤ Pn |f |4−s F s ≤ Pn |f |(4−s)p Pn F sq .
Choosing s such that (4 − s)p = 2 (and hence sq = (4p − 2)/(p − 1)), we
find that 1/q
4 2/p
σn,4 ≤ σn,2 Pn F sq .
We insert this bound in (2.14.11). The function
p (x, y) 7→ x1/p√y 1/q is concave,
1/p 1/q
and hence the function (x, y, z) 7→ J( x y /z, F, L2 ) z can be seen
to be concave by the same arguments as in the proof of Theorem 2.14.2.
Therefore, we can apply Jensen’s inequality to see that
1/(2q)
∗ 2 2 2 1  (E∗P σn,2
2
)1/(2p) P F sq 
EP σn,2 . δ P F + √ J , F, L2 (P F 4 )1/2 .
n (P F 4 )1/2
We conclude that z: = (E∗P σn,2
2
)1/2 /kF kP,2 satisfies

1  (P F 2 )1/(2p) (P F sq )1/(2q)  (P F 4 )1/2


z 2 . δ 2 + √ J z 1/p , F, L2
n (P F 4 )1/2 PF2
sq 1/(2q)
(P F )
. δ 2 + J(z 1/p , F, L2 ) √ .
n(P F 2 )1−1/(2p)
In the last step we use that J(Cδ, F, L2 ) ≤ CJ(δ, F, L2 ) for C ≥ 1, and
Hölder’s inequality as previously to see that the present C satisfies this con-
dition. We next apply Lemma 2.14.6 (with r = 1/p) to obtain a bound on
J(z, F, L2 ), and conclude the proof by substituting this bound in (2.14.3).
For the proof of Theorem 2.14.8 fix r = 2/p. The functions
ψ, ψ: [0, ∞) → [0, ∞) defined by
1/r
ψ(f ) = logr (1 + f ), ψ(f ) = ef − 1,
are each other’s inverses, and are increasing from R fψ(0) = ψ(0) = 0 to
infinity. Thus their primitive functions Ψ(f ) = 0 ψ(s) ds and Ψ(f ) =
Rf
0
ψ(s) ds satisfy Young’s inequality f g ≤ Ψ(f ) + Ψ(g), for every f, g ≥ 0
(e.g. Boyd and Vandenberghe (2004), page 120, 3.38).
The function t 7→ t logr (1/t) is concave in a neighborhoodXS of 0
(specifically: on the interval (0, e1−r ∧ 1)), with limit from the right equal to
0 at 0, and derivative tending to infinity at this point. Therefore, there exists
a concave, increasing function k: (0, ∞) → (0, ∞) that is identical to t 7→
t logr (1/t) near 0 and bounded below and above by a positive constant times
the identity throughout its domain. (E.g. extend t 7→ t logr (1/t) linearly
with slope 1 from the point where the derivative of the latter function has
decreased to 1.) Write k(t) = t`r (t), so that `r is bounded below by a
constant and `(t) = log(1/t) near 0. Then, for every t > 0,
log(2 + t/C)
(2.14.12) . log(2 + t).
`(C)
330 2 Empirical Processes

(The constant in . may depend on r.) To see this, note that for C > c the
left side is bounded by a multiple of log(2
 + t/c), whereas for small
 C the
left side is bounded by a multiple of log(2 + t) + log(1 + 1/C) /`(C) .
log(2 + t) + 1.
From the inequality Ψ(f ) ≤ f ψ(f ), we obtain that, for f > 0,
 f 
Ψ r . f.
log (2 + f )
Therefore, by (2.14.12) followed by Young’s inequality,
f4 f 2 /C 2 f 2 logr (2 + f 2 /C 2 )
=
k(C 2 ) logr (2 + f 2 /C 2 ) `r (C 2 )
2
f
. 2 + Ψ F 2 logr (2 + F 2 ) .

C
On integrating this with respect to the empirical measure, with C 2 = Pn f 2 ,
r
2 2

we see that, with G = Ψ F log (2 + F ) ,

Pn f 4 . k(Pn f 2 ) 1 + Pn G .


4
We take the supremum over f to bound σn,4 as in (2.14.10) in terms of
2
k(σn,2 ), and next substitute this bound in (2.14.11) to find that
q √
2 ) 1+P G
1  k(σn,2 n 
E∗P σn,2
2
≤ δ 2 P F 2 + √ E∗P J , F, L2 (Pn F 4 )1/2
n (Pn F 4 )1/2
q √
1  k(E∗P σn,2
2 ) 1 + PG 
2 2
≤ δ PF + √ J 4 1/2
, F, L2 (P F 4 )1/2 ,
n (P F )
where we have used the concavity of k, and the concavity of the other
maps, as previously. By assumption the expected value P G is finite for
r = 2/p. It follows that z 2 = E∗P σn,2
2
/P F 2 satisfies, for suitable constants
2 4
a, b, c depending on r, P F , P F and P G,
a p
z 2 . δ 2 + √ J( k(z 2 b)c, F, L2 ).
n
By concavity and the fact that k(0) = p 0, we have k(Cz) ≤ Ck(z), for
C ≥ 1 and z > 0. The function z 7→ k(z 2 b)c inherits this property.
Therefore we can apply Lemma p 2.14.13, with k of the lemma equal to
the present function z 7→ k(z 2 b)c, to obtain a bound on J(z, F, L )
p 2
in terms of J(δ, F, L2 ) and J k(δ 2 b)c, F, L2 ), which we substitute in
(2.14.3). Here k(δ 2 ) = δ 2 logr (1/δ) for sufficiently small δ > 0 and
κ(δ 2 ) . δ 2 . δ 2 logr (1/δ) for δ < 1/2 and bounded away from 0. Thus
we can simplify the bound to the one in the statement of the theorem,
possibly after increasing the constants a, b, c to be at least 1, to complete
the proof.
2.14 Maximal Inequalities and Tail Bounds 331

2.14.13 Lemma. Let J: (0, ∞) → R be a concave, nondecreasing function


with J(0) = 0, and let k: (0, ∞) → (0, ∞) be nondecreasing and
 satisfy
k(Cz) ≤ Ck(z) for C ≥ 1 and z > 0. If z 2 ≤ A2 + B 2 J k(z) for some
A, B > 0, then h  B  2 i
J(z) . J(A) 1 + J k(A) .
A
Proof. As noted in the proof of Lemma 2.14.6 the properties of J imply that
J(Cz) ≤ CJ(z) for C ≥ 1 and any z > 0. In view of the assumed property
of k and the monotonicity of J it follows that J ◦ k(Cz) ≤ CJ ◦ k(z) for
every C ≥ 1 and z > 0. Therefore, by the monotonicity of J and k, and the
assumption on z,
p  p
J ◦ k(z) ≤ J ◦ k A2 + B 2 J ◦ k(z) ≤ J ◦ k(A) 1 + (B/A)2 J ◦ k(z).

As in the proof of Lemma 2.14.6 we can solve this for J ◦ k(z) to find that
 B 2
J ◦ k(z) . J ◦ k(A) + J ◦ k(A)2 .
A
Next again by the monotonicity of J,
p  p
J(z) ≤ J A2 + B 2 J ◦ k(z) ≤ J(A) 1 + (B/A)2 J ◦ k(z)
h  B p  B 2 i
. J(A) 1 + J ◦ k(A) + J ◦ k(A) .
A A
The middle term on the right side is bounded by the sum of the first and
third terms.

2.14.14 Example (Alternative proof of the Donsker theorem). The


proofs of the preceding maximal inequalities are modelled after the proofs of
the uniform-entropy central limit theorem, Theorem 2.5.2. The inequalities
are sufficiently strong to yield short proofs of these results.
Under the uniform-entropy bound (2.5.1), the uniform entropy inte-
gral J(δ, F| F, L2 ) is uniformly bounded and J(δ, F| F, L2 ) → 0 as δ ↓ 0.
The entropy integral of the class F − F (with envelope 2F ) is bounded
by a multiple of J(δ, F| F, L2 ). Application of the L1 -inequality of Theo-
2
rem 2.14.1 followed by the Cauchy-Schwarz inequality yields, with θn,δ =
kPn f 2 k∗Fδ /Pn F 2 and Fδ = {f − g: kf − gkP,2 < δ},
 1/2
EkGn kFδ . E∗ J(θn,δ , F| F, L2 )kF kn ≤ E∗ J 2 (θn,δ , F| F, L2 ) P F 2 .

The sequence of empirical processes indexed by F is asymptotically tight


if the right side converges to zero as n → ∞ followed by δ ↓ 0. In view of
the dominated convergence theorem, this is true if θn,δ → 0 in probability.
2
Without loss of generality, assume that F ≥ 1, so that θn,δ ≤
2 2 2
kPn f kFδ . The desired conclusion follows if kPn f − P f kFδ converges
332 2 Empirical Processes

in probability to zero, as n → ∞ followed by δ ↓ 0. This is certainly


the case if the class G = (F − F)2 is Glivenko-Cantelli in probabil-
ity. Let GM be the functions  g1{F ≤ M } when g ranges 2 over G. Since
N 4εM kF kQ,2 , GM , L2 (Q) ≤ N εkF kQ,2 , F, L2 (Q) , the uniform en-
tropy integral J(δ, GM | 4M F, L2 ) of GM with envelope 4M F (!) is bounded
by a multiple of J(δ, F| F, L2 ). A second application of Theorem 2.14.1 gives
1 1/2
E∗ Pn f 2 − P f 2 F −F . √ J(1, F| F, L2 )M P F 2 + P F 2 {F > M },

n
for every M . This converges to zero as n → ∞ followed by M → ∞.

2.14.2 Bracketing Integrals


Define the bracketing integral of a class of functions F with measurable
envelope function F relative to a given norm as
Z δq
 
J[ ] δ, F| F, k · k = 1 + log N[ ] εkF k, F, k · k dε.
0

The basic bracketing maximal inequality uses the L2 (P )-norm.

2.14.15 Theorem. Let F be a class of measurable functions with measur-


able envelope function F . For a given η > 0, set
.q 
a(η) = ηkF kP,2 1 + log N[ ] ηkF kP,2 , F, L2 (P ) .

Then for every η > 0,


√ √
kGn k∗F . J[ ] η, F| F, L2 (P ) kF kP,2 + nP F {F > na(η)}

P,1
q 
+ kf kP,2 F 1 + log N[ ] ηkF kP,2 , F, L2 (P ) .

Consequently, if kf kP,2 < δkF kP,2 for every f in the class F, then
√ √
kGn k∗F . J[ ] δ, F| F, L2 (P ) kF kP,2 + nP F {F > na(δ)}.

P,1

In particular, for any class F,


kGn k∗F . J[ ] 1, F| F, L2 (P ) kF kP,2 .

P,1

The constants in the inequalities are universal.

Although the bound of the preceding theorem shows some dependence


on the variances of the functions in the class, the “remainder terms” pre-
clude easy application. The following theorems, which are bracketing ana-
logues of Theorems 2.14.2 and 2.14.8, remedy this, at least for not too small
δ.
2.14 Maximal Inequalities and Tail Bounds 333

The first theorem applies to uniformly bounded classes of functions.


The second can handle more general classes, at the cost of replacing the
L2 (P )-norm in the bracketing integral by the “norm”
 1/2
kf kP,B = 2P e|f | − 1 − |f | .

This “Bernstein norm” neither is homogeneous nor satisfies the triangle in-
equality, but we shall nevertheless use it to measure the “size” of functions.
It dominates the L2 (P )-norm.

2.14.16 Theorem. Let F be a class of measurable functions such that


P f 2 < δ 2 P F 2 and kf k∞ ≤ 1 for every f in F. Then

J[ ] δ, F| F, L2 (P )



EP kGn kF . J[ ] δ, F| F, L2 (P ) 1+ √ kF kP,2 .
δ 2 nkF kP,2

2.14.17 Theorem. Let F be a class of measurable functions such that


kf kP,B ≤ δkF kP,B for every f ∈ F. Then

J[ ] δ, F| F, k · kP,B

E∗P kGn kF . J[ ] δ, F| F, k · kP,B

1+ √ kF kB,P .
δ 2 nkF kP,B

Since the Bernstein “norm” is not homogeneous, it may actually be


beneficial to apply the second theorem to a multiple σF of the class F,
noting that the left side equals σ −1 E∗P kGn kσF for every σ > 0. Even taking
σ = 2 may be helpful!
The proofs of the theorems are based on the following lemma, which
allows a general norm k · k, or even any “monotone” measure of size of
functions: it suffices that |f | ≤ |g| implies that kf k ≤ kgk, for every pair
of functions. In particular, the norm can be taken equal to the Bernstein
“norm”. The proof is based on a chaining argument, where the mean of the
supremum of the chains is bounded by the integral in the right side of the
lemma and the remainders at the ends of the chains give rise to two more
terms.

2.14.18 Lemma. Let F be an arbitrary class of measurable functions and


k · k a norm that dominates the L2 (P )-norm. Then, for every δ >√3γ ≥ 0,
there exist deterministic functions en : F → R with ken kF ≤ 8γ n such
that
Z δq
+
sup Gn − en ∗

. 1 + log N[ ] ε, F, k · k dε

f P,p γ

+ sup |Gn fi | + sup Gn sup |f − fi | ,

i P,p i f ∈Fi P,p

for a minimal covering F = ∪m i=1 Fi by brackets of k · k-size at most δ and


any choice of fi ∈ Fi . The constants in the inequalities are universal.
334 2 Empirical Processes

Proofs. For the proof of the lemma, fix integers q0 and q2 such that 2−q0 <
δ ≤ 2−q0 +1 and 2−q2 −2 < γ ≤ 2−q2 −1 . For q ≥ q0 , construct a nested
Nq
sequence of partitions F = ∪i=1 Fqi such that the q0 th partition is the
partition given in the statement of the lemma and for, each q > q0 ,
∗
sup |f − g| < 2−q .

f,g∈Fqi

By the choice ofq0 the number of sets in the q0 th partition satisfies Nq0 ≤
N[ ] 2−q0 , F, k·k and the preceding display is valid for q = q0 up to a factor
2. The number Nq of subsets in the qth partition can be chosen to satisfy
q
X
log N[ ] 2−r , F, k · k .

log Nq ≤
r=q0

Now adopt the same notation as in the proof of Theorem 2.5.6 (except define
aq with 1 + log rather than log) and decompose, pointwise in x (which is
suppressed in the notation),
q2
X
f − πq0 f = (f − πq0 f )Bq0 f + (f − πq f )Bq f
q0 +1
q2
X
+ (πq f − πq−1 f )Aq−1 f + (f − πq2 f )Aq2 f.
q0 +1

See the proof of Theorem 2.5.6 for the notation and motivation.
Apply the empirical process Gn to each of the three terms separately
and take suprema over f ∈ F. The Lp (P )-norm of the first term resulting
from this procedure can be bounded by
∗ ∗ √
Gn (f − πq0 f )Bq0 f F . Gn ∆q0 f F + n P ∆q0 f Bq0 f F .

P,p P,p

The first term on the right arises as the last term in the upper bound of
the lemma. The second term on the right can be bounded by
q
a−1 2

q0 P ∆ q0 f . δ 1 + log N[ ] δ/2, F, k · k
Z δ q

. 1 + log N[ ] ε, F, k · k dε.
δ/3

This is bounded by the right side of the lemma.


The second and third terms in the decomposition can be bounded in
exactly the same manner as in the proof of Theorem 2.5.6, where it may be
noted that Lemma 2.2.10 bounds the ψ1 -norm, hence every Lp -norm up to
a constant. This shows that inequality (2.5.7) is also valid with the Lp -norm
2.14 Maximal Inequalities and Tail Bounds 335

of kGn kF on the left. Thus the second and third terms give contributions
to the Lp (P )-norm of kGn k∗F bounded up to a constant by
qX
2 +1 Z 2−q0 q
−q
p 
2 1 + log Nq . 1 + log N[ ] ε, F, k · k dε.
q0 +1 2−q2 −1

This is bounded by the right side of the lemma.√


For the final term, define en (f ) = 2 nP ∆q2 f . Then Gn (f −
πq2 f )Aq2 f − en (f ) is bounded above by Gn ∆q2 f Aq2 f , and the fourth term
in the decomposition shifted by en yields
+ ∗
sup Gn (f − πq2 f )Aq2 f − en (f ) ∗ ≤ Gn ∆q2 f Aq2 f F

f P,p P,p
−q2
p
. aq2 −1 (1 + log Nq2 ) + 2 1 + log Nq2 .
This is bounded by the contribution of the second and third parts of the
decomposition. This concludes the proof of the lemma.
For the proof of Theorem 2.14.15, apply the preceding argument with
the L2 (P )-norm substituted for k · k and q2 = ∞. Then the fourth term
in the decomposition of f − πq0 f vanishes, and the contributions of the
second and third terms to E∗ kGn kF can be bounded as before. The first
term yields the contribution
√ √
E∗ Gn (f − πq0 f )Bq0 f F . nP ∗ F {2F > naq0 .

√ √
For δ = 8ηkF kP,2 , the right side is bounded by nP ∗ F {F > na(η) .

Finally,
√ √ √
E∗ Gn πq0 f F . E∗ Gn πq0 f {F ≤ na(η)} F + nP ∗ F {F > na(η) .

√ √
Each function πq0 f 1{F ≤ na(η)} is uniformly bounded by na(η). In-
equality (2.5.7) yields a bound for the first term on the left equal to a
multiple of
 p
1 + log Nq0 a(η) + 1 + log Nq0 kf kP,2 F .

The first term is bounded by a multiple of J[ ] η, F, L2 (P ) kF kP,2 . The
second arises as the last term in the first inequality of the theorem.
For the second inequality of Theorem 2.14.15, it suffices to note that
the integrand in the bracketing integral is decreasing so that
q  
δ 1 + log N[ ] (δkF kP,2 , F, L2 (P ) ≤ J[ ] δ, F, L2 (P ) .

For the final inequality of the theorem, choose δ = 2 in the second. Since
the class F fits in the single bracket [−F, F ], it follows that a(2) = 2kF kP,2 .
For the proofs of Theorems 2.14.16 and 2.14.17, apply Lemma 2.14.18
with γ = 0 and p = 1. It suffices to bound the second and third terms on
the right appropriately.
336 2 Empirical Processes

For the proof of Theorem 2.14.16 use Bernstein’s inequality given by


Lemma 2.2.9 and Lemma 2.2.10 (see (2.5.7)) to bound both the second and
third terms by a multiple of
  1
1 + log N[ ] δkF kP,2 , F, L2 (P ) √
n
q 
+ 1 + log N[ ] δkF kP,2 , F, L2 (P ) δkF kP,2 .

Since the entropy is decreasing in δ, the second term is smaller than the
entropy integral. By the same reasoning the part in brackets of the first
term is bounded by the square of the entropy integral divided by δ 2 .
Theorem 2.14.17 follows similarly, this time using the refined form
of Bernstein’s inequality, Lemma 2.2.11, in combination with the “norm”
k · kP,B .

2.14.19 Example (Alternative proof of the Donsker theorem). The sec-


ond maximal inequality given by Theorem 2.14.15 shows that E∗ kGkFδ → 0
as n → ∞ followed by δ ↓ 0, for Fd = {f − g: kf − gkP,2 < δ}. This
immediately implies the simplified version of the bracketing central limit
Theorem 2.5.6, which asserts that F is Donsker if the bracketing integral
J[ ] δ, F, L2 (P ) is finite.

2.14.3 VC-Classes of Sets


Combining Theorem 2.14.1 with the bounds on covering numbers of VC-
classes given by Theorem 2.6.4 or Theorem 2.6.7 shows that for every P -
measurable VC-class of functions
p
E∗ kGn kF . V (F)kF kP,2 .

The following classical inequality, for VC-classes of sets, adds an unneces-


sary logarithmic factor to this bound, but has the benefit of having explicit
and small constants.

2.14.20 Lemma (Vapnik-C̆ervonenkis). For any P -measurable VC-class


C of sets and 2n ≥ V ,
s
log n 
r 


E kGn kC ≤ 2 log 2 max ∆ 2n (C, x) ≤ 2 2V (C) 1 + .
x∈X 2n V (C)

Proof. We first prove that the left side is smaller than the far right
side. By the symmetrization inequality, Lemma 2.3.1, the left side is
bounded by 2E∗ kG o o
PnnkC for Gn the symmetrized empirical process, given
o −1/2 o
by Gn 1C = n i=1 εi 1C (Xi ). For given X1 , . . . , Xn the process Pn is
sub-Gaussian relative to the L2 (Pn )-norm, and (by definition) there are
2.14 Maximal Inequalities and Tail Bounds 337

∆n (C, X1 , . . . , Xn ) different vectors 1C (X1 ), . . . , 1C (Xn ) as C ranges over
C, all of L2 (Pn )-norm bounded by 1. Therefore, by Lemma 2.2.2, or in order
to obtain an p optimal constant direct integration of the tail bound, we have
E∗ε kGon kC ≤ 2 log 2∆n (C, X1 , . . . , Xn ). Finally the right side is bounded
in Corollary 2.6.3 (see Problem 2.6.7).
To prove that the expression in the middle is between the other
expressions, we introduce independent copies of X1 , . . . , Xn , and insert
Rademacher variables as in the proof of Lemma 2.3.1. However, we do not
separate the original variables and their copies (which produces a factor
2), but work with 2n variables. Given these 2n variables the process is still
subGaussian and we bound its conditional expectation as before.

2.14.4 Tail Bounds


In this section we derive bounds on tail probabilities of kGn kF for classes F
that possess a finite uniform-entropy or bracketing entropy integral. Such
classes permit tail bounds of the order t−p or exp(−Ctp ), uniformly in
n, depending on the envelope function F . While the methods used are too
simple to obtain sharp bounds, at least the powers of t in the bounds appear
correct.
The tail bounds will be implicitly expressed in terms of Orlicz norms.
By Markov’s inequality, a finite Lp -norm kXkp yields a polynomial bound

1
P |X| > t ≤ p kXkpp .

t

Alternatively, a bound on the Orlicz norm kXkψp for ψp (x) = exp xp − 1


and p ≥ 1 leads to an exponential bound
 −tp /kXkp
P |X| > t ≤ 2e ψp
.

It is useful to employ also exponential Orlicz norms for 0 < p < 1. The fact
that the functions x 7→ exp xp − 1 are convex only on the interval (cp , ∞)
for cp = (1/p − 1)1/p leads to the inconvenience that k · kψp (if defined ex-
actly as before) does not satisfy the triangle inequality. The usual solution
is to define ψp (x) as exp xp − 1 for x ≥ cp and to be linear and contin-
uous on (0, cp ). Since we are interested in using these norms to measure
tail probabilities, the particular adaptation of the definition of ψp is not
important.
The following theorem shows that a general Orlicz norm of kGn k∗F is
bounded by its L1 -norm plus the corresponding Orlicz norm of the envelope
function F . In fact, Orlicz norms of sums of independent processes can in
general be bounded by their L1 -norm plus a maximum or sum of Orlicz
norms of the individual terms.
338 2 Empirical Processes

2.14.21 Theorem. Let F be a class of measurable functions with measur-


able envelope function F . Then

kGn k∗ ∗ −1/2+1/p
F P,p . kGn kF P,1 + n kF kP,p (p ≥ 2),



−1/2

kGn k
F P,ψp . kGn kF P,1 + n
(1 + log n)1/p F kP,ψp (0 < p ≤ 1),

kGn k∗ ∗ −1/2+1/q
F P,ψp . kGn kF P,1 + n kF kP,ψp (1 < p ≤ 2).

Here 1/p + 1/q = 1 and the constants in the inequalities . depend only on
the type of norm involved in the statement.

Proof. Specialization of Proposition A.1.6 to the present situation gives


the inequalities

kGn k∗ ∗ −1/2
F P,p . kGn kF P,1 + n max1≤i≤n F (Xi ) P,p (p ≥ 1),


kGn k∗ . kGn k∗ + n−1/2 max1≤i≤n F (Xi ) (0 < p ≤ 1).
F P,ψp F P,1 P,ψp

Next, Lemma 2.2.2 can be used to further bound the maxima appearing
on the right. This gives the first two lines of the theorem. The third line is
immediate from Proposition A.1.6.

To obtain tail bounds for the random variable kGn k∗F , the last theorem
may be combined with the bounds on E∗ kGn kF given by the preceding
theorems. While the last theorem is valid for any class F of functions, the
bounds on E∗ kGn kF rely on the finiteness of an entropy integral. If either
the uniform-entropy integral or the bracketing entropy integral of the class
F is finite, then
E∗ kGn kF . kF kP,2 ,

where the constant depends on the entropy integral. The L2 (P )-norm of the
envelope function can in turn be bounded by an exponential Orlicz norm.
Combination with the last theorem shows that

kGn k∗F

P,ψ
. kF kP,ψp (0 < p ≤ 2).
p

Conclude that kGn k∗F has tails of the order exp(−Ctp ) for some 0 < p ≤ 2,
whenever the envelope function F has tails of this order. Similar reasoning
can be applied using the Lp (P )-norms. It may be noted that, for large n
and 0 < p < 2, the size of the ψp -norm of kGn k∗F is mostly determined
by the entropy integral and the L2 -norm of the envelope: according to the
preceding theorem, the ψp -norm of the envelope enters the upper bound
multiplied by a factor that converges to zero as n → ∞.
To see some of the strength of the preceding theorems, it is instruc-
tive to apply them to a class F consisting of a single function f . Then
2.14 Maximal Inequalities and Tail Bounds 339

J(δ, {f }) = δ and the envelope function is |f |. For F = {f }, Theorem 2.14.1


gives
 1 Xn 1/2
kGn f kP,p . f 2 (Xi ) ≤ kf kP,p , p ≥ 2.

n i=1

P,p

This is the upper half of the Marcinkiewicz-Zygmund inequality, Proposi-


tion A.1.9. Theorem 2.14.21 yields a tightening of the inequality between
the far left and the right sides. More interestingly, it gives bounds for the
exponential Orlicz norms

kGn f kP,ψp . kf kP,2 + n−1/2 (1 + log n)1/p f kP,ψp (0 < p ≤ 1),
kGn f kP,ψp . kf kP,2 + n−1/2+1/q kf kP,ψp (1 < p ≤ 2).
The last line for p = q = 2 shows that the random variable Gn f is sub-
Gaussian whenever kf kP,ψ2 is finite. Apart from a universal constant, this
generalizes Hoeffding’s inequality, which makes the same statement for uni-
formly bounded functions f . Hoeffding’s inequality for linear combinations
of Rademacher variables is given by Lemma 2.2.7 and is the essential in-
gredient in the proof of Theorem 2.14.1.

2.14.5 Uniformly Bounded Classes


In this section we refine the tail bounds on P∗ kGn kF > t in the case of


uniformly bounded classes F. We present both Hoeffding- and Bernstein-


type bounds. Throughout this section it is assumed that 0 ≤ f ≤ 1 for
every f ∈ F. Without further mention, it is also assumed that X1 , X2 , . . .
are defined as the coordinate projections on a product space (X ∞ , A∞ , P ∞ )
and that the class F is pointwise separable.
The first theorem is valid for classes F with polynomial covering num-
bers or polynomial bracketing numbers: for some constants V and K,
  K V
(2.14.22) sup N ε, F, L2 (Q) ≤ , for every 0 < ε < K,
Q ε
or
  K V
(2.14.23) N[ ] ε, F, L2 (P ) ≤ , for every 0 < ε < K.
ε
In the first inequality, the supremum is taken over all probability measures
Q. In particular, the theorem is valid for Vapnik-C̆ervonenkis classes: ac-
cording to Theorem 2.6.7, a VC-class of dimension V (F) with envelope
function F = 1 satisfies (2.14.22) for V = 2V (F) and a constant K that
depends on V only.
The second theorem applies to classes such that, for some constants
0 < W < 2 and K,
  1 W
(2.14.24) sup log N ε, F, L2 (Q) ≤ K , for every ε > 0.
Q ε
Again the supremum is taken over all probability measures Q.
340 2 Empirical Processes

2.14.25 Theorem. Let F be a class of measurable functions f : X → [0, 1]


that satisfies (2.14.22) or (2.14.23). Then, for every t > 0,
  Dt V −2t2
P∗ kGn kF > t ≤ √ e ,
V
for a constant D that depends on K only.

2.14.26 Theorem. Let F be a class of measurable functions f : X → [0, 1]


that satisfies (2.14.24). Then, for every δ > 0 and t > 0,
U +δ 2
P∗ kGn kF > t ≤ CeDt e−2t ,


where U = W (6 − W )/(2 + W ) and the constants C and D depend on K,


W , and δ only.

We note that the exponent U in the second theorem increases from 0


to 2 as W increases from 0 to 2.
2
While the constant 2 in the exponential e−2t in Theorem 2.14.25 is
sharp, the power of the additional term is not. For instance, the empirical
2
distribution function on the line satisfies the exponential bound De−2t ,
2
whereas the bound obtained from Theorem 2.14.25 is of the form Dt2 e−2t ,
since (2.14.22) holds with V = 2 in this case. We shall consider two im-
provements for F equal to a class of indicators of sets.
The exponential bounds for the suprema kGn kF are based on expo-
nential bounds for the individual variables Gn f . For sets C with probability
bounded away from zero and one, an improved exponential bound can be
derived from the exponential bound of Talagrand for the tail of a binomial
random variable
 K −2t2
P |Gn (C)| > t ≤ e , for every t > 0
t
(see Proposition A.6.4). This suggests that it might be possible to improve
on Theorem 2.14.25 in the case of sets. In fact, this improvement is possi-
ble. We replace (2.14.22) and (2.14.23) by their corresponding L1 versions.
Suppose that C is a class of sets such that, for given constants K and V ,
either
  K V
(2.14.27) sup N ε, C, L1 (Q) ≤ , for every 0 < ε < K,
Q ε
or
  K V
(2.14.28) N[ ] ε, C, L1 (P ) ≤ , for every 0 < ε < K.
ε
The supremum is taken over all probability measures Q. We note that the
present V is 1/2 times the exponent V in conditions (2.14.22) and (2.14.23).
2.14 Maximal Inequalities and Tail Bounds 341

2.14.29 Theorem. Let C be a class of sets that satisfies (2.14.27) or


(2.14.28). Then
 D  DKt2 V −2t2
P∗ kGn kC > t ≤ e ,
t V
for every t > 0 and a constant D that depends on K only.

Considering the special case of the empirical distribution function on


the line once again, we see that Theorem 2.14.29 still does not yield the
2
known exponential bound De−2t . Since the collection of left half-lines sat-
isfies (2.14.27) with V = 1, Theorem 2.14.29 yields a bound of the form
2
Dt e−2t .
To obtain the “correct” power of t in the bounds requires consideration
of the size of the sets in a neighborhood of the collection {C ∈ C: P (C) =
1/2} for which the variance of Gn C is maximal. To this end, define
n o
Cδ = C ∈ C: P (C) − 1/2 ≤ δ .

It is the dimensionality of such neighborhoods that really governs which


power of t is possible. The following theorem gives a refinement of Theo-
rem 2.14.29 to clarify this dependence.

2.14.30 Theorem. Let C be a class of sets that satisfies either (2.14.27)


or (2.14.28), and suppose moreover that
0
N ε, Cδ , L1 (P ) ≤ K 0 δ W ε−V ,

(2.14.31) for every δ ≥ ε > 0,

for some constant K 0 . Then


0 2
P∗ kGn kC > t ≤ Dt2V −2W e−2t ,



for every t > K W and a constant D that depends on K, K 0 , W, V , and
V 0 only.

For C equal to the collection of left half-lines in R, inequality (2.14.31)


holds with V 0 = 1 and W = 1 (at worst) for any probability measure P .
2
Hence Theorem 2.14.30 gives a bound of the form De−2t in agreement
with the bound of Dvoretzky, Kiefer, and Wolfowitz (1956).
When P is uniform on [0, 1]d and C is the collection of lower-left sub-
rectangles of [0, 1]d , (2.14.31) holds with V 0 = d, W = 1, and the bound
2
given by Theorem 2.14.30 is of the form Dt2(d−1) e−2t . However, this bound
is not uniform in P . See Smith and Dudley (1992) and Adler and Brown
(1986).
The preceding theorems give tail bounds in terms of entropy and en-
velope but do not show the dependence on the variances of the functions
342 2 Empirical Processes

in F. In many applications, especially when an entire collection F is re-


placed by a subcollection, as in the study of oscillation moduli, it is useful
to introduce the maximal variance
2
(2.14.32) σF = kP (f − P f )2 kF

into the bound. If every f takes its values in the interval [0, 1], this variance
is maximally 1/4. Thus a bound of the order exp −(1/2)t2 /σF 2 could be
better than the bounds given by the preceding theorems. This bound is valid
in the limit as n → ∞, but for finite sample size a correction is necessary.

2.14.33 Theorem. Let F be a class of measurable functions f : X → [0, 1]


2
that satisfies (2.14.22). Then for every σF ≤ σ 2 ≤ 1 and every δ > 0,
2

  1 2V  t 3V +δ − 12 σ2 +(3+t)/
t √
n,
P kGn kF > t ≤ C 1∨ e
σ σ
for every t > 0 and a constant C that depends on K, V , and δ only.

2.14.34 Theorem. Let F be a class of measurable functions f : X → [0, 1]


2
that satisfies (2.14.24). Then for every σF ≤ σ 2 ≤ 1, every δ > 0, 0 <
p(1 − u) < 2, and 0 < u < 1,
2
1 W t p(1−u)+δ t 2u − 21 σ2 +(3(t/σ)
t √
  
P kGn kF > t ≤ C eD
∗ +5
 u +t)/ n
σ σ σ e ,

for every t > 0, where p = W (6 − W )/(2 − W ) and the constants C and D


depend on K, W , and δ only.

We do not include proofs of all of Theorems 2.14.25 – 2.14.34. The-


orems 2.14.26, 2.14.33 , and 2.14.34 are proved completely in this subsec-
tion. For Theorem 2.14.25, we give a proof assuming (2.14.22), but with the
power of t in the theorem equal to 3V + δ instead of V . After developing
more tools needed to prove the tighter bounds in the following subsection,
we give a complete proof of Theorem 2.14.29 under the assumption (2.14.27)
in Subsection 2.14.6. For complete proofs of Theorems 2.14.25, 2.14.29, and
2.14.30, we refer the interested reader to Talagrand (1994).
The proofs of the tail bounds stated in Theorems 2.14.26, 2.14.33, and
2.14.34 rely on two key lemmas. The first lemma bounds the tail probability
of interest by a corresponding tail probability for sampling n observations
without replacement from the empirical measure based on a sample of size
N = mn (with large m chosen dependent on t in the proof). This lemma
may be considered an alternative to the symmetrization inequalities involv-
ing Rademacher variables derived previously: the present bound is more
complicated, but it appears to be more economic.
Sampling without replacement can be described in terms of a ran-
dom permutation. Let (R1 , . . . , RN ) be uniformly distributed on the set of
2.14 Maximal Inequalities and Tail Bounds 343

permutations of {1, . . . , N }, and let X1 , . . . , XN be an independent i.i.d.


sample from P . Define n0 = N − n and
n N
1X 1 X
P̃n,N = δX ; PN = δX .
n i=1 Ri N i=1 i

Then the tail probabilities of the empirical process of n observations can


be bounded by the tail probabilities of P̃n,N − PN .

2.14.35 Lemma. For all 0 < a < 1, any σ 2 ≥ P (f − P f )2 F , and every
t > 0,
 h σ2 i−1  n0 
P∗ kPn − P kF > t ≤ 1 − P∗
kP̃ n,N − P N k F > ta .
(1 − a)2 t2 n0 N

Proof. See Devroye (1982) and Shorack and Wellner (1986), pages 829–
830.

The motivation for this step is the availability of several relatively


sharp bounds for the tail probabilities of the mean of a sample of size
n without replacement. In the proof of the main results, these are applied
conditionally on the sample X1 , . . . , XN . For given real numbers c1 , . . . , cN ,
set
N N
1 X 2 1 X
c̄N = ci ; σN = (ci − c̄N )2 ; ∆N = max ci − min ci .
N i=1 N i=1 1≤i≤N 1≤i≤N

The following lemma collects four exponential bounds on the tail probabil-
ities of a mean of a sample without replacement.

2.14.36 Lemma. Let U1 , . . . , Un be a random sample without replacement


from the real numbers {c1 , . . . , cN }. Then for every t > 0,
 
2nt2


 2 exp − ∆ 2 (Hoeffding),

  N 
2
2nt

  2 exp − (1−(n−1)/N )∆2N (Serfling),

P |Ūn − c̄N | > t ≤  2

nt
 2 exp− 2σN (Hoeffding-Bernstein),

 2 +t∆

 2
 N
 2 exp − nt

if N = mn (Massart).
2
mσN

Proof. For a proof of the first three inequalities, see Shorack and Wellner
(1986). We prove the fourth.
The sample without replacement can be generated in two steps. First,
partition the set of points {c1 , . . . , cN } randomly into n subsets J1 , . . . , Jn
of m elements each. Next, randomly choose one element of each subset.
Let E2 denote expectation with respect to the second step only, treating
the partition obtained in the first step as fixed. Given the partition, the
344 2 Empirical Processes

variables U1 , . . . , Un are independent with means c̄1 , . . . , c̄n equal to the


averages over the ci in each partitioning set. The average of these averages
is the grand average c̄N . Thus
n n
2
∆2i /8n2
Y Y
E2 es(Ūn −c̄N ) = E2 es(Ui −c̄i )/n ≤ es ,
i=1 i=1

by Hoeffding’s inequality (Problem 2.14.2), where the numbers ∆i =


maxj∈JiP cj − minj∈Ji cj are the ranges of the partitioning sets. Since
∆2i ≤ 2 j∈Ji (cj − c̄N )2 , the sum of the ∆2i is bounded by 2N σN
2
. Conclude
that, for every s,
2 2
P Ūn − c̄N > t ≤ e−st EE2 es(Ūn −c̄N ) ≤ e−st+s mσN /4n .


2
The choice s = 2nt/mσN yields an upper bound equal to half the
 upper
bound given by the lemma. The probability P −Ūn + c̄N > t can be
bounded similarly.

Proof of Theorems 2.14.25, 2.14.26, 2.14.33, and 2.14.34. (Recall


that Theorem 2.14.25 will only be proved with 3V + δ instead of V .) It
suffices to prove the inequalities for sufficiently large t, since their validity
for small t can be ensured by choice of the constant C. Here “sufficiently
large” means larger than some value depending only on K, V , W , and δ as
in the statements of the theorems.
For a given N = mn and a sequence of positive constants εq ↓ 0, take
a minimal εq -net for F for the L2 (PN )-semimetric. For each f , let πq f be
a closest element in this net. Fix q0 . The series
X
(P̃n,N − PN )πq0 f + (P̃n,N − PN )(πq − πq−1 )f
q>q0

is convergent with limit (P̃n,N − PN )f . To see this, note first that the qth
partial sum of the series telescopes out to (P̃n,N − PN )πq f . Since P̃n,N f ≤
mPN f for every nonnegative f , we have

(P̃n,N − PN )(f − πq f ) 2 ≤ 2(P̃n,N + PN )(f − πq f )2 ≤ 2(m + 1)ε2q → 0,


and the conclusion follows. The triangle inequality can now be applied to
obtain
X
kP̃n,N − PN kF ≤ k(P̃n,N − PN )πq0 f kF + k(P̃n,N − PN )(πq − πq−1 )f kF .
q>q0

If the εq -net has Nq elements, then the suprema are over at most Nq0 in
the first term on the right and Nq Nq−1 elements in each of the terms of the
2.14 Maximal Inequalities and Tail Bounds 345

series,
P respectively. Conclude that for any positive numbers ηq and b with
q>q0 ηq + b ≤ 1:

PR kP̃n,N − PN kF > t
 
≤ Nq0 PR (P̃n,N − PN )πq0 f > tb

(2.14.37)  F 
X
Nq2 PR (P̃n,N − PN )(πq − πq−1 )f > tηq .

+

F
q>q0

The probabilities on the right will be bounded with the help of the various
exponential bounds of Lemma 2.14.36. Lemma 2.14.35 turns the resulting
upper bounds into bounds on the tails of the distribution of kPn − P kF .
For appropriate choices of tuning constants, this gives the desired results.
In all four cases the second term on the right is negligible with respect to
the first.
For the proofs of Theorems 2.14.25 and 2.14.26 (with 3V + δ instead
of V ), use Hoeffding’s inequality on the first term in the right of (2.14.37)
and Massart’s inequality on each of the terms of the series. Thus, (2.14.37)
is bounded by
 −nt2 η 2 
 X 2 q
(2.14.38) Nq0 2 exp −2nt2 b2 + Nq exp 2 .
q>q
m4ε q−1
0

2
Note that PN (πq f − πq−1 f ) ≤ + 2ε2q 2ε2q−1
is bounded by 4ε2q−1 .
For the proof of Theorem 2.14.25, choose α large and
a = 1 − t−2 ; b = 1 − t−2 ; m = bt2 c; q0 = 2 + bt2/(α−1) c;
√ −α−1 −α
m εq = q ; ηq = (q − 1) .
Combination of Lemma 2.14.35 and (2.14.22) with (2.14.37) and (2.14.38)
yields as upper bound for P∗ kGn kF > t a constant times


 n −1  V
tq0α+1 2 exp −2t2 b2 a2 n02 /N 2

1− 2 2 0
4(1 − a) t n

α+1 2V
X
1 2 2 2 02 2
 
+ tq 2 exp − 4 t (q − 1) a n /N .
q>q0

For sufficiently large t, the leading multiplicative factor outside the square
brackets is bounded by 2. The quotient n0 /n = 1 − 1/m is bounded below
by 1 − 2t−2 . Furthermore, the series
 is bounded by its q0 th term. To see
this, write it as q>q0 exp −ψ(q) for ψ(q) = (1/4)t2 (q − 1)2 a2 n02 /N 2 −
P
2V (α + 1) log q and apply Problem 2.14.3. Thus, for sufficiently large t, the
last display is up to a constant bounded by
 2
V (α+1)
tV 3 + t α−1 exp −2t2 (1 − t−2 )6

 4

+ t2V (2 + t2 )2V (α+1) exp − 41 t2 t α−1 (1 − 2t−2 )4 ,
346 2 Empirical Processes

the second term being an upper bound for the q0 th term of the series. For
2
large t, the second term is certainly bounded by e−2t . For sufficiently large
3V +δ −2t2
α, the first term is bounded by a multiple of (1∨t) e . This concludes
the proof of the first theorem.
For the proof of Theorem 2.14.26, choose α large and
a = 1 − t−2 ; b = 1 − t−r ; m = bt2 c; q0 = 2 + btr/(α−1) c;

mεq = q −α−β
; ηq = (q − 1) −α
; β= Wα
2−W ; 2 − r = W + W r α+β
α−1 .

Combination of Lemma 2.14.35 and (2.14.24) with (2.14.37) and (2.14.38)


yields that the probability P∗ kGn kF > t is bounded by a constant times


(2.14.39)
 n −1  W
exp K tq0α+β 2 exp −2t2 a2 b2 n02 /N 2

1− 2 2 0
4(1 − a) t n

X
α+β W
 1 2 2β 2 02 2

+ exp 2K tq 2 exp − t (q − 1) a n /N .
q>q
4
0

The multiplicative factor on the left is bounded by 2 for sufficiently large


t. The first term inside the square brackets is bounded by
 (α+β)W 
2 exp KtW 2 + tr/(α−1) − 2t2 (1 − 2t−r )6 .

Since 2t2 (1 − 2t−r )6 ≥ 2t2 − 24t2−r , this is further bounded by


 α+β

2 exp DtW +W r α−1 + 24t2−r − 2t2 .
The constant r has been chosen so that the first two terms are both of the
order t2−r . For α → ∞ (and hence β → ∞!), the exponent 2 − r decreases
to U as in the statement of the theorem.
The constant β has been chosen so that 2β P = W (α + β), whence the
series in (2.14.39) can be written in the form q>q0 e−ψ(q) for a function
of the form ψ(q) = c(q − 1)2β − dq 2β . For sufficiently large t, we have c > d
and ψ is increasing and convex for q ≥ q0 . Then Problem 2.14.3 applies and
the series is bounded by its q0 th term. This is of much lower order than
2
e−2t . The second theorem is proved.
For the proofs of Theorems 2.14.33 and 2.14.34, we replace the first
term of the bound (2.14.38) for (2.14.37) by a bound based on the Hoeffding-
Bernstein inequality. On the set Cn where PN (f − PN f )2 is bounded by
P (f − P f )2 + s for every f , the right side of (2.14.37) is bounded by
 nt2 b2  X  nt2 η 2 
2 q
(2.14.40) Nq0 2 exp − + N q exp − 2 .
2(σ 2 + s) + tb q>q
m4ε q−1
0

For the proof of Theorem 2.14.33, choose α large and


a = 1 − (t/σ)−2 ; b = 1 − (t/σ)−2 ; m = b(t/σ)2 c;
2/(α−1)
q0 = 2 + b(t/σ) c;
√ −α−1

mεq = σq ; ηq = (q − 1)−α ; 2N s = 3t/σ.
2.14 Maximal Inequalities and Tail Bounds 347

Combination of Lemma 2.14.35 and (2.14.22) with (2.14.37) and (2.14.40)


yields as upper bound for P∗ kGn kF > t a constant times


 nσ 2 −1  t q α+1 V  t2 b2 a2 n02 /N 2 


0
1− 2 exp − √
(1 − a)2 t2 n0 σ σ 2(σ 2 + s) + tban0 /N n
X  t q α+1 2V  1 t2 02 

2 2n
+ 2 exp − 2 (q − 1) a 2 + P∗ (Cn ).
q>q
σ σ 4 σ N
0

Once again the multiplicative factor is bounded and the series is bounded by
2
√ If t is sufficiently large, then so is t/σ ≥ t, m ≥ (1/2)(t/σ) ,
its q0 th term.
and s ≤ 3/ n. The preceding expression is up to a constant bounded by
 1 t V   t 2/(α−1) V (α+1)  t2 (1 − 2σ 2 /t2 )6 
2+ exp − 12 2 √
σσ σ σ + (3 + t)/ n
exp − 12 t2 /σ 2

+ + P∗ (Cn ).
σ 2V
The first two terms are bounded as desired. The last term can be bounded
with the help of Theorems 2.14.25 and 2.14.26. Since each |f | is bounded
by 1,

s1{Cn } ≤ kPN (f − PN )2 − P (f − P f )2 kF ≤ kPN − P kF 2 + 2kPN − P kF .



The covering numbers N ε, F 2 , L2 (Q) of the squares are bounded by
N ε/2, F, L2 (Q) . Thus, under (2.14.24) the probability P∗ (Cn ) is bounded

√   U
by ψ s N /3 exp −2N s2 /9 for a function ψ of the form ψ(s) = eDs for
some U < 2. For the present choice of s, this is bounded by a multiple of
exp(−t2 /(2σ 2 )). This concludes the proof of the third theorem.
For the proof of Theorem 2.14.34, choose α large and
a = 1 − (t/σ)−r ; b = 1 − (t/σ)−r ; m = b(t/σ)r c;
r/(α−1) Wα
q0 = 2 + b(t/σ) c; β= 2−W ; 2 − r = 2u;
√ √
mεq = σq −α−β ; ηq = (q − 1)−α ; 2N s = 3t/σ.
Combination of Lemma 2.14.35 and (2.14.22) with (2.14.37) and (2.14.40)
yields as upper bound for P∗ kGn kF > t a constant times


 nσ 2 −1 
1− P∗ (Cn )
(1 − a)2 t2 n0
   α+β W
t r/2 q0  t2 b2 a2 n02 /N 2 
+ exp K 2 exp − √
σ σ 2(σ 2 + s) + tban0 /N n
α+β 2 02 
  
X  t r/2 q
   W 
1 t 2β 2 n
+ exp 2K 2 exp − 4 2 (q − 1) a 2 .
q>q
σ σ σ N
0

Since (α + β)W = 2β, the series can be written in the form q>q0 e−ψ(q)
P

for a function of the form ψ(q) = c(q − 1)2β − dq 2β . The constants c and d
348 2 Empirical Processes

depend on t. Elementary calculations show that ψ is convex and increasing


for q ≥ q0 if t2−W r/2 ≥ Cσ 2−W r/2−W for a constant C depending only on
K. This is certainly the case for large t if 2 − W r/2 − W ≥ 0. In that case
the series can be bounded by its q0 th term. By similar arguments as before,
the preceding display can be bounded by a multiple of
!
 1 W  t  W2r +W r α+β 1 2
2t
α−1
 t 2−r
exp D +5 − 2  √
σ σ σ σ + 3 (t/σ)1−r/2 + t / n
!
 1 W  t  W2r +W r α+β
α−1
2βr 
 t 2+ α−1  σ r 4
+ exp 2D − 14 1−2 .
σ σ σ t
For α → ∞, the exponent
Wr α+β Wr 2W r α
+ Wr = +
2 α−1 2 2−W α−1
converges to rp/2 = (1−u)p. The upper bound is valid if 2−W r/2−W ≥ 0,
which is certainly the case if rp/2 < 2.

2.14.6 Proof of Theorem 2.14.29


A first step in the direction of the proof of Theorem 2.14.29 is the following
“basic inequality,” which controls the supremum of the empirical process
Gn over any class C containing a set C0 with σ02 ≤ P(C0 ) ≤ 1 − σ02 for the
constant σ02 in Lemma 2.15.10.

2.14.41 Theorem. Let C be a separable class of measurable subsets of X .


Suppose that C0 ∈ C has σ02 ≤ P (C0 ) ≤ 1 − σ02 where σ02 is the constant
determined in Lemma 2.15.10. Let µ0n = E∗ kGn kC4C0 , µ̄0n = µ0n ∨ n−1/2 ,
and a = supC∈C P (C4C0 ). Then, for t ≥ 4/σ02 ,
 K 2 t4
P∗ kGn kC > t ≤ e−2t exp(Kat2 + K µ̄0n t − ).
t 4n
To prove this theorem, define variables W1 ,W2 , and W = W1 + W2
through P (C − C) P (C − C)
n 0 0
W1 = nPn (C0 ) − ,

Pn (C0 ) P (C0 ) C
P (C − C ) P (C − C )
n 0 0
W2 = nPn (C0c ) − .
Pn (C0c ) P (C0c ) C
Also, define a collection of functions by
t2 et 1/2
ϕr (t) = 1{t ≤ r} + t log( ) 1{t ≥ r}.
r r
Theorem 2.15.5 can be reformulated in terms of these functions √ and asserts
2
that there exist universal constants C, D such that with S = nµ̄n + nσF
 t √n 
P∗ kGn kF > t ≤ D exp −ϕS

, t ≥ Cµn .
C
2.14 Maximal Inequalities and Tail Bounds 349

2.14.42 Lemma. With the notation of Theorem 2.14.41, S = na +
√ nµ̄0n ,
2 4
h(u) = 2u + (1/4)u , and ϕ(w) = ϕS (w/C)1{w ≥ C nµn },
 2a   2a 
nkPn − P kC ≤ n Pn (C0 ) − P (C0 ) 1 + 2 + W ≡ nU 1 + 2 + W,
σ0 σ0

where for all w ≥ 0, u > t > 0,

K
P∗ (U ≥ t, W ≥ w) ≤ √ exp −nh(u) − ϕ(w) + 5nu(u − t) .

nu

Proof. For any set C ∈ C, we can decompose

Pn (C) = Pn (C0 ) + Pn (C − C0 ) − Pn (C0 − C),



and similarly
with Pn replaced by P . Consequently, the difference Pn (C) −
P (C) can be bounded by a sum of three terms. The first term directly
yields a term that involves U . The second term can be bounded by

Pn (C0c ) Pn (C − C0 ) − P (C − C0 ) + Pn (C0c ) P (C − C0 ) − P (C − C0 ) .

Pn (C0c ) P (C0c ) P (C0c ) Pn (C0c )

This is bounded by 1/n times W2 + U a/σ02 . The third term can be handled
similarly.
Now consider the exponential bound. Since nU is distributed
 as the
absolute deviation from the mean of a binomial n, P (C0 ) variable, Tala-
grand’s inequality for the binomial tail (Proposition A.6.4) yields

2K 
P(U ≥ t) ≤ √ exp −nh(u) exp 5nu(u − t).
nu

For I ⊂ {1, . . . , n}, let ΩI be the set such that Xi ∈ C0 for i ∈ I and
Xi ∈ / C0 otherwise. Define P1 on ΩI by P1 (A) = P (A∩C0 )/P (C0√ ). Then for
#I = k, conditionally on ΩI , W1 has the same distribution as k kGYk kC1
where Y1 , . . . , Yk are i.i.d. P1 and C1 = {C0 − C: C ∈ C}. Similarly, if we
c c
define P2 on ΩI by P2 (A) = P (A √∩ C0 )/P (C
Z
0 ), then conditionally on ΩI ,
W2 has the same distribution as n − k kGn−k kC2 , where Z1 , . . . , Zn−k are
i.i.d. P2 and C2 = {C − C0 : C ∈ C}. Also, note that for any class F of
functions,
√ ∗ Y √
kE kGk kF ≤ K nE∗ kGn kF
for √
an absolute constant K (Problem 2.14.6). A similar inequality is valid
for n − k E∗ kGZ √ kF . By Theorem 2.15.5 applied twice conditionally on
n−k
ΩI , for w ≥ 2CK nµ0n ,
w  w
P∗ (W ≥ w| ΩI ) ≤ 2D exp −ϕS 0 ≤ 2D exp −ϕS ,
2C K
350 2 Empirical Processes

with
h√ i
S0 = kE∗ kGYk kC1 ∨ 1 + kkP1 (f − P1 f )2 kC1
h√ i
∨ n − kE∗ kGZ k
n−k 2C ∨ 1 + (n − k)kP 2 (f − P 2 f )2
kC2

√ 0 a
≤ K nµ̄n + n 2
σ
√ 0
≤ K 0 (na + nµ̄0n ) ≡ K 0 S.

Therefore, it follows that, for u > t > 0 and w > 0,


hX i
P∗ (U ≥ t, W ≥ w) = E∗ 1{ΩI }1 U ≥ t}P∗ (W ≥ w|ΩI )
I
≤ 2D exp −ϕS (w/K) P(U ≥ t)
2DK 
≤ √ exp −nh(u) − ϕ(w) + 5nu(u − t) .
nu
This concludes the proof.

Proof of Theorem 2.14.41. Consider the function ϕ defined in


Lemma 2.14.42 and note that ϕ(t)/t increases. Define u by
a t
u(1 + 2 )= √ .
σ02 n

Then u ≤ 1 since t ≤ n yields

t/ n 1
u= 2 ≤ ≤ 1.
1 + 2a/σ0 1 + 2a/σ02

Let d ≥ 1/u be the smallest


√ number satisfying ϕ(d)/d ≥ 11u. Suppose that
nU (1 + 2a/σ02 ) + W ≥ t n. Let l ≥ 0 be the smallest integer such that

nU (1 + 2a/σ02 ) ≥ t n − (l + 1)d.

Then if l > 0,

nU (1 + 2a/σ02 ) ≤ t n − ld,
so W ≥ ld. Thus, since W ≥ 0, we can find l ≥ 0 such that

(l + 1)d
W ≥ ld and U ≥u− .
n
In other words,
∞n
n 2a √ o [ (l + 1)d o
nU (1 + 2 ) + W ≥ t n ⊂ W ≥ ld, U ≥ u − .
σ0 n
l=1
2.14 Maximal Inequalities and Tail Bounds 351

By Lemma 2.14.42, this implies that



 X  (l + 1)d 
P∗ kGn kC > t ≤ P U ≥u− , W ≥ ld
n
l=0

K X  d
≤√ exp −nh(u) − ϕ(ld) + 5nu(l + 1)
nu n
l=0

K X 
= √ exp −nh(u) exp 5u(l + 1)d − ϕ(ld) .
nu
l=0

For l ≥ 1,
ϕ(ld) ≥ lϕ(d) ≥ l · 11ud ≥ 5u(l + 1)d + lud.
Hence 5u(l + 1)d − ϕ(ld) ≤ −lud ≤ −l and

X ∞
X
e5u(l+1)d−ϕ(ld) ≤ e−l ≡ K.
l=1 l=1

Thus
K
P∗ kGn kC > t ≤ √ e−nh(u) e5ud .

nu
It remains to bound ud and u in terms of t.√
By Problem 2.14.5 with t = K(C)u √ 0S, it follows that d ≤ d0 ≡
K(C)uS. Since
√ d ≥ 1/u and d ≥ C nµn , we deduce that that d ≤
max{1/u, C nµ0n , K(C)uS}, or

ud ≤ max{1, C nµ0n u, K(C)u2 S}.

Hence, using t/ n ≤ 1,

ud ≤ K{1 + µ0n t + n−1 t2 S} ≤ K̃{1 + µ̄0n t + at2 }.

Finally, we bound h(u). By convexity of h, we have h0 (u) ≤ 5u and


t t  1 
√ −u= √ 1− 2 .
n n 1 + 2a/σ0

It follows that
t t t
h(u) ≥ h( √ ) + (u − √ )h0 ( √ )
n n n
t t  t 
≥ h( √ ) − 5 √ √ − u
n n n
t t2 2a/σ02
= h( √ ) − 5 .
n n 1 + 2a/σ02
Putting this together yields the claimed inequality.
352 2 Empirical Processes

We are now ready to start the proof of Theorem 2.14.29. We first


decompose C as C = C0 ∪ C1 , where

C0 = {C ∈ C: P (C) ≤ σ02 or P (C) ≥ 1 − σ02 }

and C1 = C − C0 . Then kGn kC = kGn kC0 ∨ kGn kC1 so that

P∗ kGn kC > t ≤ P∗ kGn kC0 > t + P∗ kGn kC1 > t .


  
(2.14.43)
2 √
The first term is bounded by Ct−1 e−2t for t ≥ K̃ V log K by
Lemma 2.15.10 (cf. Problem 2.14.9). Thus, it suffices to bound the sec-
ond term in (2.14.43), where σ02 ≤ P (C) ≤ 1 − σ02 for all C ∈ C1 . Without
loss of generality, we may assume that these inequalities hold for all C ∈ C,
and relabel C1 as C.
To prove the desired inequality for the supremum over C1 , we partition
and then apply Theorem 2.14.41. The strategy is to decompose C1 into
pieces Dj , j = 1, . . . , M , satisfying
V
aj ≡ sup P (C) ≤ a ≡ .
C∈Dj t2

Next, Theorem 2.14.41 is applied to each Dj separately. The main work is


to further bound

(2.14.44) µn ≡ max µ0nj ,


1≤j≤M

where

(2.14.45) µ0nj ≡ E∗ kGn kDj 4Dj , j = 1, . . . , M.

Here Dj is an arbitrary fixed element of Dj for each j ≤ M . To bound µn ,


we will use the following partitioning lemma.

2.14.46 Lemma (Partitioning). Let (T, ρ) be a metric space, and let p, q


be nonnegative integers with p < q. Consider a partition Pq of T such that
each set of Pq has diameter ≤ 4−q , and suppose that kl , p ≤ l ≤ q, are
given positive integers. Then there is an increasing sequence {Pl }p≤l≤q of
partitions of T with the following properties:
(i) each set of Pl has diameter less than or equal to 4−l+1 ;
(ii) each atom of Pl contains at most kl atoms of Pl+1 ;
(iii) for all l < q, #Pl ≤ N (4−l , T, ρ) + #Pl+1
Q/k l;
m
(iv) for m ≥ l and Ti ∈ Pl , N (4−m , Ti , ρ) ≤ j=l+1 kj .
If N (4−l , T, ρ) ≤ (K4l )V for l ≥ p, #Pq ≤ (K4l )V , and if kl ≥ 2 · 4V for all
l, then
#Pl ≤ 2(K4l )V for p ≤ l ≤ q.
2.14 Maximal Inequalities and Tail Bounds 353

Proof. The construction proceeds by decreasing induction on l. Given Pl+1 ,


we will construct Pl .
Set N = N (4−l , T, ρ). Consider points {ti }1≤i≤N of T so that each
point of T is within ρ-distance 4−l of at least one ti , i = 1, . . . , N . Define

Ai = ∪ Tl+1,j : Tl+1,j ∩ B(ti , 4−l ) 6= ∅, Tl+1,j ∈ Pl+1 .




Since each set Tl+1,j has diameter ≤ 4−l , Ai has diameter at most 2 · 4−l +
2 · 4−l = 4−l+1 . Now disjointify the sets {Ai } by defining Ci = Ai − ∪j<i Aj
for every i = 1, . . . , N . The collection {Ci } forms a partition Q of T that
is coarser than Pl+1 . Certain elements of Q might contain more than κl
elements of Pl+1 ; partition any such element of Q into sets all of which
except one contain exactly kl elements of Pl+1 and one being the union
of at most kl sets of Pl+1 . This constructs Pl satisfying (i) and (ii); (iii)
follows easily from the construction.
To prove the last assertion, note that by (iii) and the hypotheses we
have
#Pl+1
#Pl ≤ (K4l )V + .
2 · 4V
Induction then yields the claim.

Now we return to the proof of Theorem 2.14.29. Given a = V /t2 , let p


be the smallest integer such that 4−p+1 ≤ a. Let q ≥ p; the value of q will
be determined later in the proof. The partitions {P}ql=p of T = C1 given by
Lemma 2.14.46 will be constructed with kl = b3 · 4V c ≥ 2 · 4V . Thus

Kt2 V
M = #Pp ≤ 2 .
V
Consider the atoms {Dj }M j=1 of Pp . To bound µn , we first show that if
3(q − p)V ≤ n4−q , then
h√ √ p p 
µn ≤ K̃ V 4−p + q4−q + 4−q log K
(2.14.47) i
+ n−1/2 (qV + V log K) .

To prove (2.14.47), first recall (2.14.44) and (2.14.45). By a chaining argu-


ment based on the partitions Pl , we find that
X
(2.14.48) µ0nj ≤ E∗ kGn kFl + E∗ sup kGn kGk .
1≤k≤m
p<l≤q

Here each f ∈ Fl is of the form 1C − 1C 0 where P (C4C 0 ) ≤ 4−l+1 = al


and #Fl ≤ (3 · 4V )l−p . Moreover, for some fixed sets C1 , . . . , Ck in C1 ,

Ck = C ∈ C1 : P (C4Ck ) ≤ 4−q+1 ,


Gk = {1C − 1Ck : C ∈ Ck }, k = 1, . . . , m,
354 2 Empirical Processes

and m ≤ (3 · 4V )q−p . To bound the terms involving a supremum over Fl ,


we use Theorem 2.15.5 (in the form given in the proof of Lemma 2.15.10;
or even Bernstein’s inequality) to show that for f ∈ Fl ,
√ 
P | nGn | ≥ t ≤ exp −ϕS (t/C) for t ≥ D,
√ √
with C as in Theorem 2.15.5, D ≤ K nal , and S ≤ nal + nal . Since
V ≥ 1, we have log #Fl ≤ 3(l − p)V . Hence if

log #Fl ≤ 3(l − p)V ≤ nal ≤ nal + nal ≡ S

and nal ≥ 1 (so that nal ≤ nal ), then we have (see Exercise 2.14.10)
√ √ p 
EkGn kFl ≤ K nal + nal log #Fl
(2.14.49) √ p
≤ K nal (l − p)V .

Hence the contribution


√ P of the
√ first term√of (2.14.48) is bounded by a con-
stant times V l≥p+1 2−l l − p ≤ K V 4−p .
To bound the term involving suprema over Gk , let aq = 4−q+1 , and
note that
Xn Xn
ε (1 (X ) − 1 (X )) ∼ εi 1C4Ck (Xi ) .

i C i Ck i
Ck Ck
i=1 i=1

Hence by Lemma 2.3.6, Theorem 2.15.5, and Problem 2.14.8, the variables
Xn

Zk = (1C − 1Ck )(Xi ) − n P (C) − P (Ck )

Ck
i=1

satisfy the bound of Problem 2.14.10 with C as in Theorem 2.15.5, r =


naq + D, and
1/2

 
V K K
D = K nV aq + log log .
n aq aq
Thus, by Exercise 2.14.10, it follows that
 
K
q

E sup kGn kGk ≤ √ D + C (naq + D)(q − p)V
1≤k≤m n
" s #
D D
q
≤ K √ + V (q − p) √ + V (q − p)aq ,
n n

provided that 3(q−p)V ≤ naq +D. Since aq ≤ 4·4p−q , we have log(K/aq ) ≥


(q − p)/K0 , and hence
D
q
V (q − p) ∨ V (q − p)aq ≤ K1 √ .
n
2.14 Maximal Inequalities and Tail Bounds 355

Therefore, it follows that


D
E∗ sup kGn kGk ≤ K2 √
1≤k≤m n
hp
(2.14.50)
p
≤ K3 V q4−q + V 4−q log K
i
+ n−1/2 (V q + V log K) .

Combining the inequalities (2.14.49) and (2.14.50) with (2.14.48) completes


the proof of (2.14.47).
Now we need to choose q ≥ p so that

(2.14.51) 3(q − p) ≤ n4−q

and also bound the sum of the second and third terms in the exponential
in the conclusion of Theorem 2.14.41. Namely, after using (2.14.47) and
4−p ≤ V /t2 , we need to bound
hp p i t4
Q = K̃t V q4−q + V 4−q log K + n−1/2 (V q + V log K) − .
4n
To do this, let q be the largest integer so that (2.14.51) holds. Then
n −p nV n
3(q − p)4q−p ≤ 4 ≤ = 2
V V t2 t
and hence q − p ≤ K4 log(K4 n/t2 ) for some constant K4 . Furthermore, by
the definition of q,

n4−q−1 ≤ 3(q + 1 − p)V ≤ 3qV,



and hence, using q log K ≤ 2(q + log K), it follows that

t h p i t4
Q ≤ K̃ √ 2V q + 3V 2 q log K + V log K −
n 4n
tV t4
≤ K5 √ [q + log K] −
n 4n
tV t4 tV t4
= K5 √ q − + K5 √ log K −
n 8n n 8n
≡ Q1 + Q2 .

But from the definition of p it follows that 4−(p−1)+1 ≥ V /t2 and hence
4−p ≥ V /16t2 or p ≤ K√6 log(16t2 /V ). Then, by writing q = p + (q − p), it
follows that, for t ≥ K̃ V log K, we have
 2    
16t K4 n K7 n
q ≤ K6 log + K4 log ≤ K 7 log ,
V t2 V
356 2 Empirical Processes

and this yields

t4 t4
   
tV K7 n tV K7 n
Q1 ≤ K8 √ log − ≤ sup√ K8 √ log −
n V 8n 0≤t≤ n n V 8n
≤ K9 V.


Furthermore, if t ≤ n/ log K, then Q2 ≤ K5 V , while

V2 2
sup Q2 ≤ K10 (log K) ,
n≥1 t2


so that Q ≤ K11 V if t ≤ n/ log K, and

V2
 
2
Q ≤ K12 V + 2 (log K)
t

in any case. Thus, it follows from Theorem 2.14.41 and our choices for a,
p, and q that

 
P∗ kGn kC1 ≥ t ≤ P∗ max kGn kDj ≥ t

1≤j≤M
M
X
P∗ kGn kDj ≥ t


j=1
t4
 
K −2t2
≤M e exp Kat2 + Kµn t −
t 4n
 2 V
V2
 
Kt K −2t2 2
≤2 e exp KV + K12 V + K12 2 (log K)
V t t
 2 V
Kt 1 −2t 2
≤ K13 e exp(3K14 V )
V t
V
D DKt2

2
≤ e−2t
t V


if t ≥ V log K or n ≥ √t2 (log K)2 . Choosing D = D(K) so large that the
last bound ≥ 1 for t ≤ V log K and combining with (2.14.43) completes
the proof of Theorem 2.14.29 when (2.14.27) holds.
2.14 Problems and Complements 357

Problems and Complements


1. (Alternative definition of k · kψp for small p) For 0 < p < 1, define
p p
ψp (x) = e−cp ex − 1 ;
 
ψ̃p (x) = ψp (x) − ψp (cp ) 1{x ≥ cp }.

Then ψ̃p is convex and there exists a constant Cp such that kXkψ̃p ≤
kXkψp ≤ Cp kXkψ̃p . Thus kXkψ̃p defines a true Orlicz norm and can be
used interchangeably with kXkψp up to constants. A bound on the latter
pseudonorm leads to the tail bound
p
p −tp /kXk
P |X| > t ≤ 1 + ecp e
  ψp
.

[Hint: Note that ψ̃p ≤ ψp ≤ ψ̃p + ψp (cp ). Also Eψ̃p (Y /C) ≤ Eψ̃p (Y )/C for
every C > 1 and Y by the convexity of ψ̃p .]

2. (Hoeffding’s inequality) If the random variable X has zero mean and


takes its values in the interval [u, v], then E exp(sX) ≤ exp(s2 (v − u)2 /8) for
every s ∈ R.
3. If ψ: [q0 , ∞) → R is convex and increasing, then
X
exp −ψ(q) ≤ ψ 0 (q0 )−1 exp −ψ(q0 ).
q>q0

Here ψ 0 is the right derivative of ψ.


R∞
[Hint: The tail series can be bounded by ψ 0 (q0 )−1 q exp −ψ(x) dψ(x).]
0

4. Let F be a class of measurable functions f : X → [l, u] satisfying one of the


following two conditions:
V
K
 
sup N ε(u − l), F, L2 (Q) ≤ ,
Q ε
 W
 1
sup log N ε(u − l), F, L2 (Q) ≤ K ,
Q ε
for every ε > 0, where the supremum is taken over all discrete probability
measures Q. Then the tail probabilities P∗ kGn kF > t can be bounded as
in the theorems, but with t replaced by t/(u − l).
5. For the function ϕr as in Lemma√2.15.10, there is an absolute constant K =
K(C) such that for all t ≤ K(C) r,
 K(C)√rt 
ϕr ≥ 11t2 .
C

√ √
[Hint: Consider the two regions K(C) rt ≥ rC and K(C) rt ≤ rC. On
the first region, 11C 2 exp(11C 2 − 1) works for K 2 (C), while 11C 2 works for
the second region. Hence K 2 (C) can be chosen to be the larger of these two
constants.]
358 2 Empirical Processes

6. Let Y1 , . . . , Yk and X1 , . . . , Xn be i.i.d. samples from P1 and P , respectively,


that are related by P1 (C) = P (C ∩ C0 )/P (C0 ). If σ02 ≤ P (C0 ) ≤ 1 − σ02 , then
for any class of functions F
Xk Xn
E∗ (f (Yi ) − P1 f ) ≤ KE∗ (f (Xi ) − P f ) ,

F F
i=1 i=1

for a constant K that depends only on σ02 .


7. Suppose that x1 , . . . , xn are points in a set X and ε1 , . . . , εn i.i.d. Rademacher
variables. If C is a class of subsets of P X satisfying (2.14.27), then there exists
n
a constant K̃ such that, with N = i=1 1C (xi ) C ,
Xn p
E εi 1C (xi ) ≤ K̃ N V log(Kn/N ).

C
i=1

p
[Hint: Use Corollary 2.2.8 with d(C, D) = nk1C − 1D kL1 (Qn ) , where Qn
√ the empirical measure on the points xi . The diameter of C for d is at most
is
2N .]

8. Suppose that X1 , . . . , Xn are i.i.d. random elements with distribution P on


(X , A), and let ε1 , . . . , εn i.i.d. Rademacher variables. If C is a measurable
class of measurable subsets of X satisfying (2.14.27), then there exists a
constant K̃ such that, with a = supC∈C P (C),
n
K
X
E∗ 1C (Xi ) ≤ 2na + K̃V log

C a
i=1

and
n
X √ h
V V

K 1/2
i
E∗ εi 1C (Xi ) ≤ K̃ V n a + log log .

C n a a
i=1

9. Suppose that C is a class of sets satisfying (2.14.27) and supC∈C P (C) ≤ σ02 .
Then for some constants D and K̃
D −2t2
 
P∗ kGn kC > t ≤

e ,
t

for all t ≥ K̃ V log K and n ≥ 1.
[Hint: Use Lemma 2.15.10 and the second result proved in Exercise 2.14.8.]

10. Suppose that Z1 , . . . , Zm are random variables satisfying



P |Zj | ≥ t ≤ exp −ϕr (t/C),

for all t ≥ D, where ϕr (t) is the function defined in the proof of


Lemma 2.15.10. Then there exists a constant such that, for log m ≤ r,
 p 
E max |Zj | ≤ K D + C r log m .
1≤j≤m
2.14 Problems and Complements 359

[Hint:
R∞ Recall the proof of Lemma
 2.2.2; compute the left side as the integral
P max 1≤j≤m |Zj | > t dt; and split the region of integration at τ ≡
0 √
D ∨ C r log m.]

11. Suppose that C is a class of subsets of a given set C0 satisfying (2.14.28).


There exists a constant K1 such that if P (C0 ) = d ∈ (0, 1), then
r
K
 

E kGn kC ≤ K1 V d log .
d
 
[Hint: Relate N[ ] ε, C, L1 (P ) to N[ ] ε, C, L2 (P ) , and then use Theo-
rem 2.14.15; compare with Exercise 2.14.7.]

12. For values of δ such that δkF kP,2  1/ n Theorem 2.14.2 can be improved
to [see Van der Vaart and Wellner (2011)]
√
1

E∗P kGn kF . J δ, F, L2 kF kP,2 + J 2 √ nkF k2P,2 .

, F, L2
nkF kP,2

13. The upper bound in Theorem 2.14.16 can (with the same proof) be slightly
improved to

 log N[ ] δkF kP,2 , F, L2 (P )
J[ ] δ, F| F, L2 (P ) kF kP,2 + √ .
n
A similar improvement is possible for Theorem 2.14.17.
2.15
Concentration

The location of the distribution of the norm kGn kF of the empirical process,
as measured for instance by its mean value E∗ kGn kF , depends of course
strongly on the size of the class of functions F. Chapter 2.14 gives vari-
ous methods to estimate this mean value. It turns out that the spread or
“concentration” of the distribution around its center hardly depends on the
complexity of the class F.
In this chapter we present two bounds on the concentration. The first
“bounded difference inequality” is an exponential inequality of Hoeffding
type, and gives a bound expressed in terms of the range of the functions
f . The second, deeper result is of Bennett-Bernstein type, and incorpo-
rates also the variance of the functions f . The bounds can be viewed as
uniform versions of the Hoeffding and Bernstein inequalities, Lemmas 2.2.7
and 2.2.9.
A similar phenomenon of concentration is known for suprema of cen-
tered Gaussian processes, for which Borell’s inequality Proposition A.2.1
gives a subGaussian bound for probabilities of deviations from the mean
that is valid without conditions.
As usual for measurability purposes the observations X1 , . . . , Xn are
assumed to be defined as the coordinate projections on a product probabil-
ity space.

2.15.1 Bounded Difference Inequality


By Hoeffding’s inequality the tail probabilities of the centered variable εn f ,
2
for a function f with range 1 are bounded by e−2t . The deviation proba-
2.15 Concentration 361

bilities of the empirical process are no bigger.

Theorem. If F is a class of measurable functions f : X → R such


2.15.1
that f (x) − f (y) ≤ 1 for every f ∈ F and every x, y ∈ X , then, for all
t ≥ 0,  
P∗ kGn kF − E∗ kGn kF ≥ t ≤ 2 exp(−2t2 ),

 
P∗ sup Gn f − E∗ sup Gn f ≥ t ≤ 2 exp(−2t2 ).

f f

This theorem is a special case of a result that applies to general statis-


tics T = T (X1 , . . . , Xn ) that satisfy the bounded difference inequality, for
all i = 1, . . . , n, given constants c1 , . . . , cn , and every x1 , . . . , xn , yi ∈ X ,

(2.15.2) T (x1 , . . . , xn ) − T (x1 , . . . , xi−1 , yi , xi+1 , . . . , xn ) ≤ ci .

Both the√norm kGn kF and the supremum supf ∈F Gn f satisfy this inequal-
ity, with nci the supremum over f of the range of f . (Therefore, so do their
measurable cover functions almost surely under the product law P n+1 , if
these variables are not measurable. See Problem 2.15.1.) In the application
to the empirical process the variables X1 , . . . , Xn are i.i.d., but in the fol-
lowing theorem it suffices that they are independent. The variable Yi (used
in the statement of the theorem to relax assumption (2.15.2) to an almost
sure inequality) is an independent copy of Xi .

2.15.3 Proposition (Bounded difference inequality). If T is a measur-


able function of independent variables X1 , . . . , Xn that satisfies (2.15.2) for
almost every (x1 , . . . , xn , yi ) relative to the law of (X1 , . . . , Xn , Yi ), then,
for all t > 0,
  2t2 
P T − ET ≥ t ≤ exp − Pn 2 ,
i=1 ci
  2t2 
P T − ET ≤ −t ≤ exp − Pn 2 .
i=1 ci

Proof. Let E(T | X1 , . . . , Xi−1 , x) be an abbreviation of the conditional


expectation E(T | X1 , . . . , Xi−1 , Xi = x), and set, for i = 1, . . . , n,
Ti = E(T | X1 , . . . , Xi ) − E(T | X1 , . . . , Xi−1 ),
 
Ai = inf E(T | X1 , . . . , Xi−1 , x) − E(T | X1 , . . . , Xi−1 ) ,
x
 
Bi = sup E(T | X1 , . . . , Xi−1 , x) − E(T | X1 , . . . , Xi−1 ) .
x

(If (2.15.2) only holds almost everywhere, use the essential infimum and
supremum relative to the law of Xi in the definitions of Ai and Bi instead.)
Pn
With the understanding that E(T | ∅) = ET , we then have T −ET = i=1 Ti .
Furthermore, Ai ≤ Ti ≤ Bi and Bi − Ai ≤ ci by integration of the bounded
362 2 Empirical Processes

difference assumption (2.15.2) relative to xi+1 , . . . , xn . Therefore, for all


λ > 0,
Pn  Pn−1  Pn−1 2 2
Eeλ i=1 Ti = E eλ i=1 Ti E eλTn | X1 , . . . , Xn−1 ≤ Eeλ i=1 Ti eλ cn /8 ,

by Problem 2.15.2, where we use that E(Tn | X1 , . . . , Xn−1 ) = 0, almost


surely. PBy repeating the argument we see that the right side is bounded by
n
exp λ2 i=1 c2i /8).
By Markov’s equality
P we turn this bound on the Laplace transform into
the bound e−λt exp λ2 i c2i /8) on the probability given in the (first) Pleft
side of the theorem, valid for every λ > 0. Finally we choose λ = 4t/ i c2i
to minimize the latter expression over λ. This completes the proof of the
first inequality. The second inequality follows by applying the first to −T .

2.15.2 Talagrand’s Inequality


2
The second bound of this chapter incorporates the maximal variance σF of
the functions in the class F, defined in (2.14.32). There are several varia-
tions of this theorem. We first state the theorem in its strongest form using
the function h: (0, ∞) → (0, ∞) defined by

(2.15.4) h(t) = t(log t − 1) + 1.

2.15.5 Theorem (Bousquet-Talagrand). For any countable class F of


measurable functions f : X → R with f (x) − P f ≤ 1 for every x ∈ X and
2
wn : = σF + 2n−1/2 E supf Gn f finite, for every t ≥ 0,
 
   t 
P sup Gn f − E sup Gn f ≥ t ≤ exp −nwn h 1 + √ .
f f nwn

If, moreover, f (x) − P f ≤ 1 for every x ∈ X , then this is also true with
supf Gn f replaced throughout by kGn kF .

Proof. The theorem is a corollary of Proposition 2.15.12 (below). It is no


loss of generality to assume that P f = 0 for every f . To overcome possible
measurability issues in the following proof, the theorem can also be reduced
P collection F.
first to the case of a finite
Given T = supf ∈F i f (Xi ) we choose
X
Ti = sup f (Xj ),
f ∈F
j6=i
Si = fi (Xi ; Xj : j 6= i),
for fi the function in F for which the supremum involved in the definition
Pnfor given (Xj : j 6= i). Then
of Ti is assumed, Pn Si ≤ 1, and 2E(Si | Xj : j 6= i) =
P fi = 0, and i=1 E(Si2 |Xj : j 6= i) = i=1 P fi2 ≤ nσF . (Here P fi and
2.15 Concentration 363

P fi2 are the integrals of fi and fi2 with respect to their first arguments,
the arguments (Xj : j 6= i) being fixed.) Finally, for given X1 , . . . , Xn and
f0 the function where the supremum in the definition of T is assumed,
n X
X n X
X n
X
(n − 1)T = f0 (Xj ) ≤ fi (Xj ) = Ti .
i=1 j6=i i=1 j6=i i=1

Pn
Rearranging this inequality gives that i=1 (T −Ti ) ≤ T . The first assertion
of the theorem follows from Proposition 2.15.12 with the choice u = 1.
The second assertion follows by applying the first to the class of func-
tions F ∪ (−F).

For many applications the sequence wn in the theorem will be bounded


in n (although this is not required by the theorem). Then, for fixed or
moderate t, the function t 7→ h(1+t) is mostly important in a neighborhood
of 0. Two functions with a similar behavior are the functions h0 and h1 given
by
t2 √
h0 (t) = , h1 (t) = 1 + t − 1 + 2t.
2(1 + t)
The three functions and their first derivatives all vanish at t = 0 and their
second derivatives are all 1. More importantly, one can show that, for any
t ≥ 0,
t t
(2.15.6) h(1 + t) ≥ 9h1 ≥ 9h0 .
3 3
Thus the inequality in the theorem remains true if the function t 7→ h(1 + t)
is replaced by a scaled version of h0 or h1 .
The bound resulting from h0 can be written in the form
   t2 
(2.15.7) P sup Gn f − E sup Gn f ≥ t ≤ exp − 21 √ .
f f wn + 13 t/ n

This has the same form as Bernstein’s inequality (Lemma 2.2.9), with
wn playing the role of the variance. If F consists of a single function f ,
then wn is indeed the variance of f , and hence the theorem exactly re-
covers Bernstein’s inequality. In the general case the (positive) constant
n−1/2 E∗ supf Gn f must be added to the variance, increasing the bound.
For a Donsker
√ class this added term is small and tends to zero, as does the
term t/3 n in the denominator of the exponent. In the limit as n → ∞ the
bound takes the form exp(− 12 t2 /σF
2
), and hence we could interprete these
additional terms as finite-sample
√ corrections on the Gaussian bound. On
the other hand, for t  n or very large classes F the correction terms
dominate the variance term.
364 2 Empirical Processes

Replacing h by the scaled version of h1 gives yet another bound, which


has a convenient “inverse”
√ formulation. As the inverse function of h1 is
given by h−1
1 (t) = 2t + t, it follows that, for every t > 0,
 √ t 
(2.15.8) P sup Gn f − E sup Gn f ≥ 2twn + √ ≤ e−t .
f f 3 n
Similar bounds can be obtained for the modulus of the empirical process.
In fact, these are equivalent to bounds on the supremum over the class of
functions F ∪ (−F).
For easy use we state a simplified version (with easy but suboptimal
constants) of (2.15.8) as a lemma.

2.15.9 Lemma. Let F be a countable class of measurable functions f : X →


R such that kf k∞ ≤ M and P f 2 ≤ δ 2 for every f in F. Then, for every
t > 0, with probability at least 1 − e−t ,
Mx √
∀f ∈ F: 12 Gn f ≤ E sup Gn f + √ + δ t.
f ∈F n

Proof. We apply
p √ (2.15.8)
√ to the
√ functions f /M , and use the inequality
2t(σ 2 + 2µ/ n) ≤ 2tσ + t/ n + µ.

The following lemma is an example of application of the preceding


theorem. It can be regarded as a uniform version of Kiefer’s inequality for
a binomial variable, Corollary A.6.3. The lemma is a key tool in the proof
of Theorem 2.14.29.
Set µn,F = E∗ kGn kF .

2.15.10 Lemma (Uniform small variance exponential bound). There


√ F of
exist universal constants D, K0 , and σ0 such that, for every class
2
measurable functions f : X → [0, 1] with σF ≤ σ02 and K0 µn,F ≤ n and
such that F and F 2 are P -measurable,
P∗ kGn kF > t ≤ D exp −11t2 ,
 
for every t ≥ K0 µn,F .

Proof. Since |f (x) − P f | ≤ 1 for all√ x, the probability on the left side of
the above display is zero√for all t ≥ n. Hence any nonnegative bound √ will
be trivially true for t > n, and we may restrict attention to t ≤ n.

For t ≥ K0 µn,F the left side is bounded above by P kGn kF − µn,F >
−1
t(1−K0 ) . Hence the theorem follows if we can show the bound with (e.g.)
12 instead of 11 in the exponent for the centered norm of the empirical
process. √
By 2
√ the assumptions wn√= σF +22µn / −1 n ≤ σ02 + 2K0−1 . Therefore, for
t ≤ ε n we have wn + t/(3 n) ≤ σ0 + 2K0 + ε/3, which is smaller than
1/24 if σ0 , K0−1 and ε are chosen sufficiently small. The bound then follows
from (2.15.7).
2.15 Concentration 365
√ √ √
Finally if ε n ≤ t ≤ n, then t/( nwn ) ≥ ε/(σ02 + 2K0−1 ), which can
be made arbitrarily large by first choosing small ε to satisfy the bound of
the preceding paragraph and next choosing σ0 and K0−1 to be still smaller.
Because h(t) ∼ t log t as t → ∞, and hence h(t) ≥ 12t for sufficiently large
t, we then have
 t  t √
nwn h 1 + √ ≥ 12nwn √ = 12 nt ≥ 12t2 .
nwn nwn
Thus the bound follows from Theorem 2.15.5.

Unlike the bounded difference inequality, Theorem 2.15.5 and its con-
sequences (2.15.7) and (2.15.8) concern upper tail probabilities only, and
do not show the size of deviations of the supremum under its mean. Al-
though these are of the same order (up to constants), it appears that their
estimation requires a separate argument.
The function h in the following theorem is as in the previous theorem
and is defined in (2.15.4).

2.15.11 Theorem. For any countable class F of measurable functions


2
f : X → R with f (x) − P f ≤ 1 for every x ∈ X and wn : = σF +
−1/2
2n E supf Gn f finite, for every t ≥ 0,
 
  nwn  3t 
P sup Gn f − E sup Gn f ≤ −t ≤ exp − h 1+ √ .
f ∈F f ∈F 9 nwn

The proofs of Theorems 2.15.5 and 2.15.11 rely upon the entropy
method. The entropy of a nonnegative random variable Y is defined as
Ent(Y ) = EY log Y − EY log EY . Problems 2.15.3 and 2.15.4 derive the
relevant properties of entropy.
Theorem 2.15.5 is a special case of a theorem on more general functions
T = T (X1 , . . . , Xn ) of independent variables X1 , . . . , Xn . The abstract ver-
sion is somewhat technical, but its proof is relatively straightforward.

2.15.12 Proposition. Let T be a measurable function of independent vari-


ables X1 , . . . , Xn . Suppose that there exist a constant u ≥ 0 and random
variables Si and Ti , for i = 1, . . . , n, such that Ti is a measurable function
of (Xj : j 6= i), and

Si ≤ T − Ti ≤ 1,

E Si | Xj : j 6= i ≥ 0,
Si ≤ u,
n
X
(T − Ti ) ≤ T.
i=1
366 2 Empirical Processes

Pn
If σ 2 is a constant satisfying σ 2 ≥ 2
i=1 E(Si | Xj : j 6= i), and v = (1 +
u)ET + σ 2 , then, for all t ≥ 0,
 
   t
P T − ET ≥ t ≤ exp −vh 1 + .
v

Proof. Assume without loss of generality that u ≤ 1, so that α = 1/(1 +


u) ∈ [1/2, 1]. Throughout the proof let λ > 0.
The tensorization inequality for entropy (Problem 2.15.4) gives that

Xn n
X
Ent(eλT ) ≤ E Enti (eλT ) = E eλTi Enti (eλ(T −Ti ) ),
i=1 i=1

because Ti is measurable relative to (Xj : j 6= i) and entropy is homogeneous


(Problem 2.15.3(iii)). The variational characterization of entropy (Prob-
lem 2.15.3(ii) with r = 1) gives Ent(eY ) ≤ E 1 − eY + Y eY ), for any
random variable Y . Applying this to each of the entropies Enti (eλ(T −Ti ) )
conditionally given (Xj : j 6= i) and substituting these bounds in the pre-
ceding display we obtain that
n 
X 
(2.15.13) Ent(eλT ) ≤ E eλTi − eλT + λ(T − Ti )eλT .
i=1
P
We shall later use the assumption i (T − T ) ≤ T to bound the right
side further, but first obtain an alternative upper bound using the other
assumptions.
The ith term of the sum in the right side of (2.15.13) is algebraically
identical to
 
gλ,α (T − Ti ) eλT − eλTi − λ(T − Ti )eλTi + λα(T − Ti )2 eλTi ,

for the function gλ,α : R → R given by gλ,α (t) = eλt (e−λt − 1 + λt)/(eλt −
1 − λt + λαt2 ). Because T − Ti ≤ 1 by assumption, and the function gλ,α is
increasing on (−∞, 1] (see Problem 2.15.5), and also the term in brackets is
nonnegative, the expression in the preceding display increases if gλ,α (T −Ti )
is replaced by gλ,α (1). Because gλ,α (1) > 0 and −t + αt2 ≤ −s + αs2 if both
s ≤ u ≤ 1 and s ≤ t ≤ 1 (remember that α = 1/(1 + u); the parabola is
symmetric about (u + 1)/2 ≥ u ≥ s and hence decreasing on [0, (1 + u)/2]
and equal at u and 1), the preceding display is further bounded by
 
gλ,α (1) eλT − eλTi − λSi eλTi + λαSi2 eλTi .

We take the expected value and sum over i to obtain an upper bound on
(2.15.13). Next we use that E(−λSi eλTi ) ≤ 0, because E(Si | P Xj : j 6= i) ≥ 0
and Ti is measurable relative to (Xj : j 6= i). We also use that i ESi2 eλTi ≤
2.15 Concentration 367

EE(Si2 | Xj : j 6= i)eλT ≤ σ 2 EeλT , where the last inequality is by assump-


P
i
tion and the first can be argued as follows. Because T − Ti ≥ Si , the con-
ditional mean of T − Ti is also nonnegative, so that by Jensen’s inequality
applied with the exponential function E eλ(T −Ti ) | Xj : j 6= i) ≥ 0, or equiv-
alently eλTi ≤ E(eλT | Xj : j 6= i). Thus ESi2 eλTi ≤ ESi2 E(eλT | Xj : j 6= i),
where in the right side the conditioning can be transferred from eλT to Si2 .
We thus have proved that
n 
1 X 
Ent(eλT ) ≤ E eλT − eλTi + λασ 2 EeλT .
gλ,α (1) i=1

The exponentials eλT and eλTi on the right side are in the reverse order
relative to (2.15.13). Therefore, adding the present inequality and (2.15.13)
gives
n
 1  X
Ent(eλT ) + 1 ≤ λασ 2 EeλT + E λ(T − Ti )eλT .
gλ,α (1) i=1

The last term on the right can be further bounded by λET eλT , since
P n λET
i=1 (T − Ti ) ≤ T by assumption. Next we multiply across by e to
rewrite the inequality as (cf. Problem 2.15.3(iii)).
 1 
Ent(eλ(T −ET ) ) + 1 ≤ λασ 2 Eeλ(T −ET ) + EλT eλ(T −ET )
gλ,α (1)
= λαvEeλ(T −ET ) + Eλ(T − ET )eλ(T −ET ) .

The entropy Ent(eλ(T −ET ) ) can be expressed in terms of the Laplace trans-
form F (λ) = Eeλ(T −ET ) as Ent(eλ(T −ET ) ) = λF 0 (λ) − F (λ) log F (λ), and
E(T − ET )eλ(T −ET ) = F 0 (λ). After also inserting the explicit form of
gλ,α (1), the preceding inequality can be simplified to
 eλ − 1 + α   eλ (e−λ − 1 + λ) 
(log F )0 (λ) − log F (λ) ≤ λ vα.
eλ − 1 − λ + λα e − 1 − λ + λα
Some algebra shows that the corresponding equality is solved (in log F ) by
the function λ 7→ vψ(λ), for ψ(λ) = eλ − 1 − λ. Consequently, we can apply
Problem 2.15.6 to see that log F (λ) ≤ vψ(λ).
Combining this with Markov’s inequality, we obtain P T − ET ≥ t) ≤
e−λt F (λ) ≤ e−(λt−vψ(λ)) . The (optimal) choice λ = log(1 + t/v) yields the
proposition.

A special case of Proposition 2.15.12, applicable to sums of nonnegative


variables, arises if the hypotheses of Theorem 2.15.12 hold with every Si
equal to 0, so that both u and σ 2 can be chosen equal to 0. Then
 
  t 
P T − ET ≥ t ≤ exp −ET h 1 + .
ET
368 2 Empirical Processes

Boucheron et al. (2000) show that under the same hypotheses, for nonneg-
ative and bounded variables T and for 0 < t ≤ ET ,
 
  t 
P T − ET ≤ −t ≤ exp −ET h 1 − .
ET

Proof of Theorem 2.15.11. If suffices to prove the theorem for a finite


collection F of functions with P f = 0. For fixed λ > 0 let Mf (λ) =
Ee−λf (X1 ) be the Laplace transform of the variables f (Xi ) at −λ, and
define Pn
e−λ i=1 f (Xi )
Sλ = min .
f Mf (λ)n
The main part of the proof is to bound the expectation ESλ . The the-
orem will P
next be derived from a bound on the Laplace transform of
n
T = maxf i=1 f (Xi ) in terms of ESλ .
By the tensorization inequality (Problem 2.15.4),
n
X
(2.15.14) Ent(Sλ ) ≤ E Enti (Sλ ),
i=1
n n
X X Ei S λ
(2.15.15) ≤ E Enti (Si,λ ) + E(Si,λ − Sλ ) log ,
i=1 i=1

by Problem 2.15.3(vi), for any random variables Si,λ , where Ei means the
expectation with respect to Xi with the other variables (Xj : j 6= i) fixed.
For fˆλ a function f that attains the minimum in the definition of Sλ , we
choose Pn
− f (Xi )
X e i=1
Si,λ = Pi (fˆλ = f ) .
Mf (λ)n
f

Because Si,λ is a convex combination of the same variables whose minimum


is Sλ , we have Si,λ − Sλ ≥ 0. Moreover, at the end of the proof we show
that
Ei Sλ ˆ
(2.15.16) ≤ eλfλ (Xi ) Mfˆλ (λ) ≤ ν(λ): = eλ cosh λ,

n
X X n 
(2.15.17) E(Si,λ − Sλ ) ≤ ν(λ) E Enti (Sλ ) − ESλ log Sλ ,
i=1 i=1
n
X
E Enti (Si,λ ) ≤ nν(λ)ESλ λL0fˆ (λ) − Lfˆλ (λ) .

(2.15.18)
λ
i=1

Here Lf = log Mf is the cumulant generating function of the variables


f (Xi ) at −λ. Subsituting inequalities (2.15.16)–(2.15.18) in (2.15.15) and
2.15 Concentration 369

rearranging, we find that, with γ = ν log ν,


n
X
1 − γ(λ) E Enti (Sλ ),
i=1

≤ nν(λ)ESλ λL0fˆ (λ) − Lfˆλ (λ) − γ(λ)ESλ log Sλ .



(2.15.19)
λ

For λ such that 1 − γ(λ) > 0 (i.e. λ ≤ λ0 ≈ 0.46) this is a bound on


1 − γ(λ) Ent(Sλ ), by (2.15.14). The next step is to express Ent(Sλ ) and
the quantities on the right side in terms of the function F (λ) = ESλ and
its derivative. 
Because each of the entries exp −λ f (Xi ) /Mf (λ)n in the minimum
P
that defines Sλ is an analytic function of λ, each pair of entries is either
identical or possesses graphs (as functions of λ) that agree at most on a set
without limit points. It follows that [0, ∞) can be partitioned into a count-
able number of intervals such that on each interval Sλ (as function of λ ≥ 0)
coincides with one of the entries. Consequently, λ 7→ Sλ is differentiable on
the interior of each of these intervals, with derivative
Pn
n
∂  e−λ i=1 f (Xi )   X 
0
Sλ = n
= Sλ − fˆλ (Xi ) − nL0fˆ (λ) .
∂λ Mf (λ) |f =fˆλ
i=1
λ

The expression inside brackets on the right side can also be written as
λ−1 log Sλ + nLfˆλ (λ) − nL0fˆ (λ). We conclude that F (λ) = ESλ satisfies
λ

0
λESλ0 = ESλ log Sλ − nESλ λL0fˆ (λ) − Lfˆλ (λ) .

λF (λ) =
λ

We solve this for ESλ log Sλ , and use the definition of Ent(Sλ ) and inequal-
ities (2.15.14) and (2.15.19) to infer that, for λ ≤ λ0 ,
λF 0 (λ) − 1 − γ(λ) F (λ) logF (λ) ≤ n ν(λ) − 1 ESλ λL0fˆ (λ) − Lfˆλ (λ)
  
λ

≤ n ν(λ) − 1 F (λ) max λL0f (λ) − Lf (λ) .


 
f

Here λL0f (λ) − Lf (λ) = Ent(e−λf (X1 ) )/Mf (λ) ≤ eλ (e−λ − 1 + λ)σF 2
, by
Problem 2.15.3(v) and because Mf (λ) ≥ 1, by Jensen’s inequality and the
assumption that P f = 0. We conclude that, for λ ≤ λ0 ,

λ(log F )0 (λ) − 1 − γ(λ) log F (λ) ≤ n 21 (e2λ − 1)eλ (e−λ − 1 + λ)σF


2

.
−λT
PnMf (λ) ≥ 1, we have that F (λ) = ESλ ≤ Ee
Because , for T =
maxf i=1 f (Xi ), and hence log F (λ)/λ ≤ (log Ee−λT )/λ → −ET , as λ ↓ 0.
Under this side condition the differential inequality of the preceding display
implies that, for λ ≤ λ0 ,

(2.15.20) log F (λ) ≤ −ET λe−I(λ) + nσF


2
λe−I(λ) J(λ),

for I and J the functions defined in Problems 2.15.7 and 2.15.8.


370 2 Empirical Processes

Because minf ef ≤ minf (ef /cf ) maxf cf , for any positive constants
cf , ef , we have Ee−λT ≤ ESλ maxf Mf (λ)n . Combining this with Bennett’s
bound log Mf (λ) ≤ P f 2 (eλ − 1 − λ) and (2.15.20), we conclude that

log Ee−λ(T −ET ) ≤ λET + log F (λ) + nσF 2


(eλ − 1 − λ)
 λ(1 − e−I(λ) )   
≤ 2ET +nσF 2
λe−I(λ) J(λ) + eλ − 1 − λ
2
1 3λ
≤ wn (e − 1 − 3λ),
9

by the definition of wn , in view of Problem 2.15.8, for λ ≤ λ0 .


For λ > λ0 we use instead of (2.15.20) the trivial bound log F (λ) ≤ 0,
which is valid for every λ, because the variables in the minimum defining Sλ
have mean 1. It follows that log Ee−λ(T −ET ) is bounded by wn max(λ/2, eλ −
1 − λ), which is bounded by the right side of the preceding display for λ ≥
λ1 ≈ 0.53. Also there exists λ2 >λ1 so that the the line segment connecting
the points λ0 , (e3λ0 −1−3λ0 )/9 and λ2 , max(λ2 /2, eλ2 −1−λ2 ) is below
the graph of the function λ 7→ (e3λ − 1 − 3λ)/9. By the convexity of the
function λ 7→ log Ee−λ(T −ET ) , we conclude that this function is below the
function λ 7→ wn (e3λ − 1 − 3λ)/9 for every λ ≥ 0. This bound on the
cumulant generating function can be turned into the tail bound of the
theorem using Markov’s inequality.
Finally we prove the validity of inequalities
P (2.15.16)–(2.15.18).
ˆ −λ f (Xj )
For fi,λ the minimizer of f 7→ e j6 = i /Mf (λ)n−1 , which de-
pends on (Xj : j 6= i) only, the left side of (2.15.16) is bounded by

Pn P
−λ fˆi,λ (Xj ) −λ fˆi,λ (Xj )
Ei e j=1 /Mfˆi,λ (λ)n e j6=i /Mfˆi,λ (λ)n−1
Pn = Pn .
−λ fˆλ (Xj ) −λ fˆλ (Xj )
e j=1 /Mfˆλ (λ)n e j=1 /Mfˆλ (λ)n

By the definition of fˆi,λ , the right side increases if fˆi,λ is replaced by fˆλ .
This yields the middle inequality of (2.15.16). The second inequality follows
because eλf ≤ eλ for every f by the assumption that f ≤ 1, while Mf (λ) ≤
cosh λ by Problem 2.15.9.
To prove (2.15.17) we first write

P
−λ f (Xj )
X e j6=i
ˆ
Ei Si,λ = Pi (fˆλ = f ) = Ei Sλ eλfλ (Xi ) Mfˆλ (λ).
Mf (λ)n−1
f
2.15 Concentration 371

It follows that the left side of (2.15.17) is equal to

n  
ˆ
X
ESλ eλfλ (Xi ) Mfˆλ (λ) − 1
i=1
n 
X ˆ ˆ 
= ESλ eλfλ (Xi ) Mfˆλ (λ) − 1 − ν(λ) log eλfλ (Xi ) Mfˆλ (λ)
i=1
Xn  e−λfˆλ (Xi ) 
− ν(λ) ESλ log .
i=1
Mfˆλ (λ)

ˆ
The first term on the right increases if eλfλ (Xi ) Mfˆλ (λ) is replaced by
Ei Sλ /Sλ , by (2.15.16) and the fact the function x 7→ x − 1 − ν log x is
decreasing for x ≤ ν. The sum in the second term on the right is equal to
ESλ log Sλ . It follows that the right side is bounded above by

n E S
X i λ Ei Sλ 
ESλ − 1 − ν(λ) log − ν(λ)ESλ log Sλ .
i=1
Sλ Sλ

This can be seen to be identical to the right side of (2.15.17).


The convexity and homogeneity of entropy (Problem 2.15.3(iv),(iii))
yield
P
−λ f (Xj )
X e j6=i
Pi (fˆλ = f ) Enti e−λf (Xi ) .

Enti (Si,λ ) ≤
Mf (λ)n
f

Here Enti (e−λf (Xi ) ) = Mf (λ) λL0f (λ)−Lf (λ) and the order of summation


and expectation can be exchanged to rewrite the right side as


P
−λ fˆλ (Xj )
e j6=i
λL0fˆ (λ) − Lfˆλ (λ)

Ei
Mfˆλ (λ)n−1 λ

ˆ
= Ei Sλ eλfλ (Xi ) Mfˆλ (λ) λL0fˆ (λ) − Lfˆλ (λ) .

λ

ˆ
We bound eλfλ (Xi ) Mfˆλ (λ) by ν(λ) using (2.15.16), next take the expecta-
tion over the other variables and sum over i to arrive at (2.15.18).
372 2 Empirical Processes

Problems and Complements


1. Let T : X × Y → R be an arbitrary map defined on the product probability
space (X × Y, A × B, P × Q) such that |T (x, y) − T (x, z)| ≤ c for a constant c
and every x ∈ X and y, z ∈ Y. Then the measurable cover T ∗ : X × Y → R for
P × Q satisfies |T ∗ (x, y) − T ∗ (x, z)| ≤ c for P × Q × Q almost every (x, y, z).
[Hint: By perfectness of a coordinate projection T ∗ is also a version of the
measurable cover for T viewed as a map T : X × Y × Y → R relative to
P × Q × Q. Furthermore, the measurable covers of T1 : (x, y, z) 7→ T (x, y) and
T2 : (x, y, z) 7→ T (x, z) relative to P × Q × Q satisfy |T1∗ − T2∗ | ≤ |T1 − T2 |∗
almost surely.]

2. If X is a random variable takingvalues in the interval [a, b] ⊂ R with EX = 0,


then EeλX ≤ exp λ2 (b − a)2 /8 , for all λ > 0.
[Hint: Convexity of the function x 7→ eλx gives, for any x ∈ [a, b],
x − a λb b − x λa
eλx ≤ e + e .
b−a b−a
This implies that EeλX ≤ eφ(λ(b−a)) for φ defined by φ(x) = −px+log(1−p+
pex ) and p = −a/(b − a). Then note that φ(0) = 0 = φ0 (0) while φ00 (x) ≤ 1/4
for all p, x.]

3. (Entropy) The entropy of a nonnegative variable Y is defined by

Ent(Y ) = EΦ(Y ) − Φ(EY ),

where Φ(y) = y log y (defined by continuity at y = 0). It satisfies:


(i) Ent(Y ) = sup{EY U : EeU = 1}, where the supremum is attained at
U = log Y − log EY .
(i’) Ent(Y ) = sup{EY log(V /EV ): V > 0}, where the supremum is attained
at V = cY . 
(ii) Ent(Y ) = inf r>0 E Φ(Y ) − (1 + log r)Y + r = inf r>0 {rEh(Y /r)} with
h(y) = y(log y − 1) + 1, where the infimum is assumed at r = EY .
(iii) Ent(γY ) = γ Ent(Y ), for any γ ∈ R.
(iv) Y 7→ Ent(Y ) is convex.
(v) If X ≤ 1, then Ent(eλX ) ≤ EX 2 eλ (e−λ − 1 + λ), for every λ > 0.
(vi) Ent(Y ) ≤ Ent(Z) + E(Y − Z) log(Y /EY ), for every positive random
variable Z.
[Hint: For (i) use that supy>0 (yu − Φ(y)) = eu−1 for u ∈ R. For (ii) use that

the function g: (0, ∞) → R defined by g(r) = E Φ(Y ) − (1 + log r)Y + r
is convex, with g(EY ) = Ent(Y ). For (v) use (ii) with r = 1 and Y =
eλX to see that Ent(eλX ) ≤ Eh(eλX ) ≤ EX 2 h(eλ ), by the monotonicity of
the function λ 7→ λ−2 h(eλ ). Inequality (vi) follows by writing Ent(Y ) as
EZ log(Y /EY ) + E(Y − Z) log(Y /EY ) and next applying (i’) to see that the
first term is smaller than Ent(Z).]

4. (Tensorization of entropy) If T is a measurable Pn transformation of inde-


pendent variables X1 , . . . , Xn , then Ent(T ) ≤ i=1
E Enti (T ), for Enti (T )
the entropy of T considered as a function of Xi only  with the other variables

(Xj : j 6= i) fixed, i.e. Enti (T ) = E Φ(T )| Xj : j 6= i − Φ E(T | Xj : j 6= i) .
2.15 Problems and Complements 373

[Hint: Let EI T denote the expectation of T relative to (Xi : i ∈ I) given fixed


values of the other variables (Xi : i ∈
/ I), with E∅ T = T . Then
n
X
T log T − T log ET = T (log E1,...,i−1 T − log E1,...,i T ).
i=1

Problem 2.15.3(i’) applied to T viewed as a function of Xi with (Xj : j 6= i)


fixed and V = E1,...,i−1 T for fixed (Xj : j > i) gives that Ei T (log E1,...,i−1 T −
log E1,...,i T ) ≤ Hi (T ). Inserting this in the preceding equation and taking the
expectation yields the result.]

5. The function t 7→ eλt (e−λt − 1 + λt)/(eλt − 1 − λt + λαt2 ) is nondecreasing


for t ≤ 1, for all λ ≥ 0 and all α ∈ [1/2, 1].
[Hint: See Lemma A.3, page 30, of ??]

6. (Differential inequality) Let L, L0 : [0, 1] → R be continuously differen-


tiable functions that satisfy the inequality f L0 − gL ≤ f L00 − gL0 , for con-
tinuous, nonnegative functions f, g: [0, 1] → R such that g/f is integrable. If
L(0) = L0 (0), then L ≤ L0 on [0, 1].

(f /g)(s) ds we obtain that (AL)0 = A(L0 −

[Hint: For A(λ) = exp − 0
(g/f )L) ≤ (AL0 )0 .]

7. Let α(λ) = (1 − γ)(λ)/λ for γ = ν log ν and ν(λ) = (e2λ + 1)/2 and
let Λ, β: [0, 1] → R be functions such that Λ is absolutely
R λ continuous with
Λ0 − αΛ ≤ β. Then Λ(λ) ≤ µλe−I(λ) + λe−I(λ) 0 s−1 eI(s) β(s) ds for

µ = lim supε↓0 Λ(ε)/ε and I(λ) = 0 γ(s)/s ds.
 (as γ(s) = O(s) as s → 0), but the “in-
[Hint: The function I isRwell defined
λ
tegrating factor” exp − 0 α(s) ds for the linear differential equation is not.
R λ[ε, 1] for ε ∈
This motivates consideration of the functions first on the interval
(0, 1). The corresponding integrating factor is Aε (λ) = exp − ε α(s) ds =
(ε/λ)eI(λ)−I(ε) . From the inequality
Rλ (Aε Λ)0 = Aε (Λ0 − αΛ) ≤ Aε β we obtain
that (Aε Λ)(λ) ≤ (ΛAε )(ε) + ε (Aε β)(s) ds. Dividing by Aε (λ) and taking
the limit as ε ↓ 0 gives the desired result.]

8. The functions I, J: [0, ∞) → R defined by


Z λ
s−1 21 (e2s + 1) log (e2s
1

I(λ) = 2
+ 1) ds,
0
Z λ
1
J(λ) = s−2 eI(s) (e2s − 1)es (e−s − 1 + s) ds,
2 0

satisfy
(i) λe−I(λ) J(λ) + eλ − 1 − λ ≤ (e3λ − 1 − 3λ)/9.
(ii) λ(1 − e−I(λ) ) ≤ 2(e3λ − 1 − 3λ)/9.
9. Any random variable X with EX = 0 and |X| ≤ 1 satisfies EeλX ≤ cosh λ,
for λ ≥ 0.
2
Notes

2.2. The maximal inequality Theorem 2.2.4 uses the special properties of
the Orlicz norm only through the application of Lemma 2.2.2. Hence an
identical result can be proved using other norms provided the analogue of
the lemma is available. For instance, Ledoux and Talagrand (1991) show
that the lemma holds for weak Lp -norms with ψ taken equal to xp . Second,
it may be remarked that often the full strength of Theorem 2.2.4 is not
needed and the weakened version of this theorem, where the Orlicz norm
on the left is replaced by E sups,t |Xs − Xt |, suffices. This weakened version
can be proved along exactly the same lines, provided a similarly weakened
version of Lemma 2.2.2 is available. The latter is contained in Problem 2.2.8
for any Orlicz norm.
The method of chaining to prove continuity or obtain bounds on
suprema of stochastic processes is due to “Kolmogorov and his school”,
and goes back to the 1930s and 40s. The combination of chaining with “en-
tropy” in the study of the continuity of Gaussian processes goes back to
Dudley (1967b) and Sudakov (1969). Dudley (1967b) credits V. Strassen
with the idea and basic results, but see the review of Sudakov (1976) by
Dudley (1978b). Strassen and Dudley (1969) apparently made the first use
of entropy in connection with empirical processes. Maximal inequalities for-
mulated as entropy bounds for general Young functions were obtained in
Kono (1980) and Pisier (1983).
Upper bounds for suprema of Gaussian processes in terms of majoriz-
ing measures were obtained by Fernique in Fernique (1974), and shown
to be sharp in Talagrand (1987c). After giving simplified proofs in Tala-
2 Notes 375

grand (1992) and Talagrand (1996b), Talagrand (2001) finally introduced


under the title “majorizing measures without measures” generic chaining
as a more fundamental approach, and developed it further in the books
Talagrand (2005, 2014). Upper bounds in terms of majorizing measures
for non-Gaussian processes using general Young functions were obtained in
Bednorz (2006), after initial work by Talagrand (1990) and Kono (1980).
Bednorz also extended the method of generic chaining to the setting of gen-
eral Young functions. Theorems 2.2.13 and 2.2.15 Proposition 2.2.16 were
obtained in Bednorz (2008, 2010, 2015). Among the rare applications of
generic chaining in statistics is the paper Van de Geer (2013).
The original papers by Bernstein are apparently unavailable. Bennett
(1962) discusses the history of Bernstein’s inequality, provides proofs, and
compares Bernstein’s inequality to a number of other inequalities.

2.3. The method of symmetrization goes back to Kahane (1968) and


Hoffmann-Jørgensen (1973, 1976). The construction of separable versions
using liftings is adapted from Talagrand (1987a).
2.4. Generalizations of the classical Glivenko (1933) and Cantelli (1933)
theorem for sets under a bracketing condition were first obtained by Blum
(1955) and Dehardt (1971). Dudley (1984) gives Theorem 2.4.1 in its present
formulation. Vapnik and Červonenkis (1968, 1971, 1981) give generaliza-
tions with conditions in terms of the logarithm of the L1 -covering numbers.
The proof of the converse part of Theorem 2.4.3 is adapted from Giné
and Zinn (1984), pages 981–982. Pollard (1982) formulates the reverse sub-
martingale property of kPn − P kF . Measurability difficulties regarding the
choice of filtration are pointed out by Strobl (1992). Theorem 2.4.7 is due
to Van Handel (2013), after initial work by Adams and Nobel (2010, 2012),
on extensions of the Glivenko-Cantelli property to general, not identically
distributed or independent, observations.
For more on Glivenko-Cantelli theorems, see Chapter 2.8 and Dudley,
Giné, and Zinn (1991).
2.5. The first uniform entropy central limit theorems were due to Pollard
(1982) and Kolčinskiı̆(1981), who, among other things, established the uni-
form central limit theorem for the empirical process indexed by a (suitably
measurable) VC-class of sets. Bracketing empirical central limit theorems
were obtained by Dudley (1978a, 1984), Ossiander (1987) and Andersen,
Giné, Ossiander, and Zinn (1988). Ossiander (1987) obtained the result un-
der the assumption that the L2 (P )-bracketing integral is finite. The result
presented in this section is more general than this, but less general than the
theorems obtained by Andersen et al. These authors relax finiteness of the
two integrals to continuous majorizing measure conditions on L2 (P )-balls
and L2,∞ (P )-brackets, respectively. See Chapter 2.11 for a statement.
The chaining argument in the proof of Theorem 2.5.6 may be refined
so that it is not necessary to construct a nested sequence of partitions.
376 2 Empirical Processes

See Pollard (1989a). The present organization and notation of the chaining
argument is borrowed from Arcones and Giné (1993), who chain by ex-
ponential bounds for probabilities. The use of means in combination with
Lemma 2.2.10 in the proof is new.

2.6. The study of VC-classes apparently originated in Vapnik and


Červonenkis (1971), who were motivated by problems in pattern recog-
nition. See also Vapnik and Červonenkis (1981) and Vapnik (1982). Sauer’s
lemma and its generalizations can be found in Sauer (1972) and Frankl
(1983). Theorem 2.6.4 is due to Haussler (1995), who improved an earlier
result of Dudley (1978a). Our rendering of the proof owes its origins to
conversations with David Pollard. Pollard (1984) showed how to obtain the
result about functions from the bound for sets.
The behavior of entropies under the formation of convex hulls as de-
scribed in Theorem 2.6.9 is an extension of results of Dudley (1987), who
obtained the bound up to an additional δ, and Ball and Pajor (1990), who
obtained the bound under the assumption that F is a sequence of func-
tions decreasing at the rate kfi kQ,2 ≤ i−1/V . The present formulation for
r = 2 and its induction proof were new in the first edition of this work.
The extensions to Lr -norms and to classes of polynomial entropy, as in
Theorem 2.6.13, proved here by direct generalization of the proof for r = 2,
were obtained in Carl et al. (1999) and Gao (2001). See Song and Wellner
(2002), Creutzig and Steinwart (2002), Mendelson (2002) and Gao (2004)
for different proofs, extensions and applications. Lemma 2.6.11 is given by
Pisier (1981), but is apparently due to B. Maurey.
Many results in Section 2.6.5 are due to Assouad (1981, 1983), Pol-
lard (1984), and Dudley (1984). Assouad (1981, 1983) gives explicit bounds
for the VC-dimension of many of the classes in Lemma 2.6.18 and formu-
lates a number of results concerning “dual density,” which we have not
included here (see especially Assouad (1983), pages 245–247). Bounds on
the VC-dimension of unions and intersections can be found in Van der
Vaart and Wellner (2009). Lemma 2.6.23 and Example 2.6.24 improve
upon an example of Dudley (1985), who shows that the class of functions
x 7→ tk−1 F (x)1{F (x) ≥ t} is VC-hull. We learned Exercise 2.6.21 from A.
Quiroz: see Quiroz, Nakamura, and Perez (1996). Lemma 2.6.14 was proved
by Giné and Koltchinskii (2006).
Fat-shattering numbers were introduced by Vapnik and C̆ervonenkis.
The discussion and results in Section 2.6.6 summarizes the findings in
Mendelson and Vershynin (2003) and Rudelson and Vershynin (2006).
Notable among the results not included in this section are those of
Stengle and Yukich (1989) and Pisier (1984). Stengle and Yukich show how
new VC classes can be generated by “semialgebraic sets.” Pisier (1984) links
up VC theory with probability in Banach spaces by showing that a class
of sets is VC if and only if a certain operator is type-2. See Ledoux and
Talagrand (1991) for an exposition.
2 Notes 377

2.7. Bounds on entropies for classes of smooth functions are obtained


by Kolmogorov and Tikhomirov (1961), Lorentz (1966), and Birman and
Solomjak (1967). In particular, Theorem 2.7.1 on Hölder classes and Theo-
rem 2.7.14 are explicit in Kolmogorov and Tikhomirov (1961). The entropy
bound on Besov classes can be obtained from Birman and Solomjak (1967)
by interpolation. The proof given here is based on representation in wavelet
bases as developed by Birgé and Massart (2000), Cohen, Dahmen, and De-
Vore (2000) and Cohen, Dahmen, Daubechies, and DeVore (2001). The
representation by Besov bodies and the resulting entropy estimates used
here follows the replicant and universal coding schemes by Kerkyacharian
and Picard (2003a)-Kerkyacharian and Picard (2003b).
Sets with smooth boundaries were considered by Sun and Pyke (1982)
and Dudley (1974). Dudley (1984), Chapter 7, and Dudley (1999), Chap-
ter 8, give a useful introductions to this literature. We adapted the proof of
the bound for entropy with bracketing for monotone functions on R from
Van de Geer (1991), who adapts techniques from Birman and Solomjak
(1967). The results for convex sets are due to Bronštein (1976), who im-
proved on results of Dudley (1974). These bounds were first exploited in
the context of empirical processes by Bolthausen (1978). Theorem 2.7.12
was obtained by Gao and Wellner (2017), who improved on Dryanov (2009)
and Guntuboyina and Sen (2013) and Guntuboyina (2016), who considered
convex functions on intervals and rectangles. The restriction r < p can-
not be removed as for r ≥ p, the class of convex functions is not totally
bounded. Gao and Wellner (2017) also give a sharper bound for a general
convex set involving the accuracy of “simplicial approximation” of the set.
Van der Vaart (1994b) gives improvements of Corollary 2.7.3 and shows
that the classesP of functions considered there are P -Glivenko-Cantelli if
and only if Mj P (Ij ) < ∞. He also shows that the bracketing integral
converges if the constants Mj P (Ij )1/2 are regularly varying at infinity and
1/2
P
Mj P (Ij ) < ∞. Giné and Zinn (1986b) first proved that the class
C11 (R) is Donsker under the latter condition. Van der Vaart (1993) gener-
alized their result to the classes of Corollary 2.7.3 in general. Bracketing
numbers on Gaussian mixtures were obtained by Ghosal and Van der Vaart
(2001) and Ghosal and Van der Vaart (2007), who also give slightly better
bounds on ordinary covering numbers.
For other work on entropy numbers, see Carl and Stephani (1990) and
Ball and Pajor (1990) and the references stated therein. These authors
study the entropy of the image of the unit ball under a (compact) operator
T between Banach spaces. More precisely, they study entropy numbers
denoted εn (T ) and en (T ), which as functions of n are roughly the inverses
of the covering numbers and entropy numbers as in the present manuscript.
For instance, en (T ) is defined as the smallest ε such that the image T (U )
can be covered by 2n−1 balls of radius ε.

2.8. Theorem 2.8.1 is due to Dudley, Giné, and Zinn (1991), who give a
378 2 Empirical Processes

complete treatment of the conditions under which a class of functions is


universal or uniform Glivenko-Cantelli.
The development of uniform in P Glivenko-Cantelli and Donsker the-
orems began with the work of Dvoretzky, Kiefer, and Wolfowitz (1956),
Kiefer and Wolfowitz (1959), and Kiefer (1961). The exponential bounds
developed in these papers yield uniform in P -Glivenko-Cantelli theorems
for the classical empirical distribution function of random variables in Rd .
Moreover, these authors essentially establish the uniform Donsker prop-
erty for the classical distribution function in the course of proving the
asymptotic minimaxity of the empirical distribution function as an es-
timator of the true distribution function. Uniformity in P is implicit in
the work of Vapnik and Červonenkis (1971) on the Glivenko-Cantelli the-
orems. The introduction of uniform entropy conditions by Pollard (1982)
and Kolčinskiı̆(1981), Theorem 5.10, page 4.11, establishes rates of con-
vergence for weak-approximation versions of Theorem 2.8.3 under growth
rate hypotheses on the uniform entropy and additional hypotheses on the
envelope function of the class. (It is plausible that his rates are in fact uni-
form over the collection P for which the moment hypotheses on F hold
uniformly.) Dudley (1987) unifies and extends results for the “universal
Donsker property,” which entail that F is Donsker for all probability mea-
sures P on the sample space. A universal Donsker class F is essentially
bounded: supf diam f < ∞ (where diam f = supx,y f (x)−f (y)). For classes
with this property, Giné and Zinn (1991) use Gaussian comparison methods
to characterize classes that are uniformly Donsker in all possible underlying
measures.
Theorem 2.8.2 is contained in Sheehy and Wellner (1992) along with
other equivalences; the present proof is new.

2.9. The multiplier central limit theorem, Theorem 2.9.2, is implicit in


Giné and Zinn (1984) and is stated explicitly in Giné and Zinn (1986a),
who attribute the multiplier inequality to Pisier and Fernique. Alexan-
der (1985), solving a problem posed by Hoffmann-Jørgensen, shows that
no “universal multiplier moment” exists: there is no function ψ: R2 →
R so that ξZ satisfies the central limit theorem whenever EξZ = 0,
Eψ |ξ|, kZk < ∞ and Z satisfies the central limit theorem, for inde-
pendent real- and Banach-space-valued random elements ξ and Z. On the
other hand, Ledoux and Talagrand (1986) show that the L2,1 -hypothesis
on the multipliers ξi cannot be relaxed in the sense that, for every ξ
with kξk2,1 = ∞, there exists a Banach space valued Z that satisfies the
central limit theorem, but ξZ does not satisfy the central limit theorem.
Ledoux and Talagrand (1986) and Ledoux and Talagrand (1991), Proposi-
tion 10.4 on page 279, give a different proof of the basic multiplier inequal-
ity.
The almost sure conditional multiplier central limit theorem is due to
Ledoux and Talagrand (1988), with contributions by J. Zinn. The present
2 Notes 379

proof, based on isoperimetric methods, is adapted from Ledoux and Ta-


lagrand (1991), Theorem 10.14, on page 293. The proof using martingale
difference methods originating in Yurinskii (1974), given in the “Problems
and Complements” section, is taken from Ledoux and Talagrand (1988).

2.10. Theorem 2.10.1 is due to Van der Vaart and Wellner (2000).
Alexander (1987c) gives Theorem 2.10.3 and derives that F +G and F ∪
G are Donsker if F and G are Donsker from his more general Proposition 2.6.
Theorem 2.10.4 and Theorem 2.10.5 are due to Dudley (1985).
Versions of Theorem 2.10.8 are established by Giné and Zinn (1986a)
(one-dimensional φ under measurability conditions) and Talagrand (1987a)
(uniformly bounded classes). Lemma 2.10.16 is established by Giné and
Zinn (1986a); the proof given here is new. Section 2.10.5 is based on Van
der Vaart (1993) and Van der Vaart and Wellner (2000) The result of Ex-
ample 2.10.28 specialized to the class C11 (R) was first obtained by Giné
and Zinn (1986b) by a different method. Pollard (1990), Section 5, gives a
variety of results similar to those in Section 2.10.4.

2.11. The limit theory in this section has its antecedents in Koul (1970),
Shorack (1973, 1979), and Van Zuijlen (1978). Also see Shorack and Beir-
lant (1986), Marcus and Zinn (1984), and Shorack and Wellner (1986),
Section 3.3, pages 108–109.
The first subsection, based on random and uniform-entropy, follows
Alexander (1987b), but with a simplified proof for the main Theorem 2.11.1.
See Alexander’s paper for a discussion of the minimal moment conditions
on the envelope within the context of this theorem. The improvement over
Theorem 2.5.2 in the i.i.d. case was already obtained by Giné and Zinn
(1984). Pollard (1990) develops a combinatorial theory for the non-i.i.d.
case paralleling the VC-theory of Chapter 2.6.
The bracketing central limit theorem, Theorem 2.11.11, and its corol-
lary are due to Andersen, Giné, Ossiander, and Zinn (1988), building on
Ossiander (1987) and work on majorizing measures and Gaussian processes
due to Fernique (1974) and Talagrand (1987c). Theorem 2.11.9 does not
seem to be a corollary of their result, but appears to be the natural gen-
eralization of Ossiander’s result to the case of non-i.i.d. observations. The
proof of Theorem 2.11.11 shows that the theorem remains true if the sin-
gle majorizing measure, which is guaranteed to exist by the assumption of
Gaussian domination, is replaced by certain sequences of majorizing mea-
sures (depending on n). The proof given here is greatly simplified in com-
parison to the original proof.
Jain and Marcus proved their theorem, as given here, in Jain and
Marcus (1975). Example 2.11.14 is taken from Giné and Zinn (1986a). Dud-
ley (1985), Section 6, gives the first satisfactory integration of the famous
“Chibisov-O’Reilly theorem” treated in Example 2.11.15 with modern em-
pirical process theory.
380 2 Empirical Processes

2.12. Partial-sum processes have a long and honorable history in proba-


bility theory and statistics, with rapid development of the classical theory
occurring in the late 1940s and early 1950s. The classical work of Erdös
and Kac (1946), Doob (1949), and Donsker (1951, 1952) was beautifully
summarized by Billingsley (1968).
Study of the partial-sum or sequential empirical process seems to have
begun with the work of Müller (1968) on the distributional invariance prin-
ciple for the Glivenko-Cantelli theorem, Pyke (1968) on random-sample-
size limit theorems for empirical processes, and Kiefer (1969) on embed-
ding questions. The Hungarian school’s work on embedding theorems in
the 1970s was followed by embedding theorems (explicit construction) for
general empirical processes in Dudley and Philipp (1983). The present form
seems to have been first stated by Sheehy and Wellner (1992), along with
other equivalences of Dudley and Philipp (1983) and Dudley (1984). The
present proof is analogous to the proof by Billingsley (1968) of a classical
partial-sum convergence theorem: Theorem 10.1, pages 68–70.
The subsection on partial-sum processes on lattices follows Alexander
and Pyke (1986), Bass and Pyke (1985), and Pyke (1983).

2.13. Theorem 2.13.1 is taken from Dudley (1984), who also shows that,
in a certain sense, the condition that the sequence converges is sharp. Giné
and Zinn (1986a) give further results for sequences of functions bounded in
uniform norm.
Dudley (1987) notes that the square of the supremum of the empirical
process over an elliptical class F equals a weighted sum of the squares of
the empirical process acting on the basis functions defining the elliptical
class, and he points out the practical importance of such classes.
The equivalence of (iii) and (iv) in Theorem 2.13.6 is due to Giné and
Zinn (1984); the equivalence of (i), (ii), and (iii) was proved by Talagrand
(1988).

2.14. Bounds for expectations of suprema of Gaussian processes were


developed by Dudley (1967b). In the case of empirical processes, Pollard
(1989b) was apparently the first to recognize the usefulness of moment
bounds for suprema of the form given in Theorems 2.14.1 and 2.14.15; also
see Pollard (1990) and Kim and Pollard (1990). Ledoux and Talagrand
(1991) systematically developed the use of Orlicz norms; see their Notation
and Chapters 6 and 11. Theorem 2.14.21 is a straightforward consequence
of the general domination of Orlicz norms of sums of independent random
elements by L1 -norms plus the Orlicz norm of maximal summand proved
by Talagrand (1989); see Ledoux and Talagrand (1991), Chapter 6.
Maximal inequalities that include the size of the functions in the
bounnd are especially useful when applied to the modulus of continuity
to obtain a rate of convergence of an M-estimator, as in Chapter 3.4. Theo-
rems 2.14.16 and 2.14.17 were obtained in Chapter 3.4 of the first edition of
2 Notes 381

this book. Theorems 2.14.2, 2.14.7, and 2.14.8 were obtained more recently
in Van der Vaart and Wellner (2011).
Dvoretzky, Kiefer, and Wolfowitz (1956) proved the first exponential
bound for the supremum distance between the empirical distribution func-
tion and the true distribution function in the case of X = R. Massart (1990)
found the best constant D multiplying the exponential for this classical case.
Related bounds for the more difficult case X = Rk were established by
Kiefer (1961). Vapnik and Červonenkis (1971) proved exponential bounds
of the same form as those given in Theorem 2.14.29, but with no powers of t
in front and the constant D depending on (and increasing with) n. Alexan-
der (1984) gives bounds of the same type as those in Theorems 2.14.25,
2.14.29, and 2.14.33 and also has sharper bounds than the bound given in
2.14.33.
Theorems 2.14.25, 2.14.29, and 2.14.30 are due to Talagrand (1994),
while Theorems 2.14.26, 2.14.33, and 2.14.34 are from Massart (1986). For
still more bounds, see Smith and Dudley (1992) and Adler and Brown
(1986).

2.15. The bounded difference inequality is due to McDiarmid (1989) fol-


lowing the method set out by Maurey (1979). Theorem 2.15.5 of Bennett-
Bernstein type has its roots in Talagrand (1994), with further develop-
ments in Talagrand (1996), Ledoux (1996), Massart (2000b), Rio (2001)
and Rio (2002). While the first edition of the present book presented the
early results, the present chapter is based on the definitive formulation,
with optimal constants, proved in Bousquet (2002) and Bousquet (2003).
The bounds on the lower tail were obtained in Klein (2002) and Klein and
Rio (2005).

You might also like