Professional Documents
Culture Documents
Densityestimation
Densityestimation
10/36-702
1 Introduction
Density estimation used for: regression, classification, clustering and unsupervised predic-
tion. For example, if pb( x, y) is an estimate of p( x, y) then we get the following estimate of
the regression function: Z
m
b ( x) = y pb( y| x) d y
where pb( y| x) = pb( y, x)/ pb( x). For classification, recall that the Bayes rule is
Example 1 (Bart Simpson) The top left plot in Figure 1 shows the density
1 1 X4
p( x) = φ( x; 0, 1) + φ( x; ( j /2) − 1, 1/10) (1)
2 10 j=0
where φ( x; µ, σ) denotes a Normal density with mean µ and standard deviation σ. Marron
and Wand (1992) call this density “the claw” although we will call it the Bart Simpson den-
sity. Based on 1,000 draws from p, we computed a kernel density estimator, described later.
The estimator depends on a tuning parameter called the bandwidth. The top right plot is
based on a small bandwidth h which leads to undersmoothing. The bottom right plot is
based on a large bandwidth h which leads to oversmoothing. The bottom left plot is based
on a bandwidth h which was chosen to minimize estimated risk. This leads to a much more
reasonable density estimate.
1
1.0
1.0
0.5
0.5
0.0
−3 0 3 0.0 −3 0 3
True Density Undersmoothed
1.0
1.0
0.5
0.5
0.0
0.0
−3 0 3 −3 0 3
Just Right Oversmoothed
Figure 1: The Bart Simpson density from Example 1. Top left: true density. The other plots
are kernel estimators based on n = 1,000 draws. Bottom left: bandwidth h = 0.05 chosen by
leave-one-out cross-validation. Top right: bandwidth h/10. Bottom right: bandwidth 10 h.
2
2 Loss Functions
Devroye and Györfi (1985) make a strong case for using the L 1 norm
Z
kp
b − pk1 ≡ | pb( x) − p( x)| dx
as the loss instead of L 2 . The L 1 loss has the following nice interpretation. If P and Q are
distributions define the total variation metric
d TV (P,Q ) = sup |P ( A ) − Q ( A )|
A
where the supremum is over all measurable sets. Now if P and Q have densities p and q
then
1 1
Z
d TV (P,Q ) = | p − q | = k p − q k1 . H
2 2
Thus, if | p − q| < δ then we know that |P ( A ) − Q ( A )| < δ/2 for all A . Also, the L 1 norm is
R
3 Histograms
Perhaps the simplest density estimators are histograms. For convenience, assume that the
data X 1 , . . . , X n are contained in the unit cube X = [0, 1]d (although this assumption is not
crucial). Divide X into bins, or sub-cubes, of size h. We discuss methods for choosing
h later. There are N ≈ (1/ h)d such bins and each has volume h d . Denote the bins by
B1 , . . . , B N . The histogram density estimator is
N θ
X bj
pbh ( x) = I (x ∈ B j ) (2)
j =1 hd
3
where
1X n
θbj = I(X i ∈ B j)
n i=1
is the fraction of data points in bin B j . Now we bound the bias and variance of pbh . We will
assume that p ∈ P (L) where
½ ¾
P (L) = p : | p( x) − p( y)| ≤ Lk x − yk, for all x, y . (3)
NowRwe bound the Rvariance. Since p is Lipschitz on a compact set, it is bounded. Hence,
θ j = B j p( u) du ≤ C B j du = Ch d for some C . Thus, the variance is
1 θ j (1 − θ j ) θj C
Var( pbh ( x)) = Var(θbj ) = ≤ ≤ .
h2d nh2d nh2d nh d
Later, we will prove the following theorem which shows that this upper bound is tight.
Specifically:
4
4 Kernel Density Estimation
R
A one-dimensional smoothing kernel is any smooth function K such that K ( x) dx = 1,
xK ( x) dx = 0 and σ2K ≡ x2 K ( x) dx > 0. Smoothing kernels should not be confused with
R R
Mercer kernels which we discuss later. Some commonly used kernels are the following:
2
Boxcar: K ( x) = 12 I ( x) Gaussian: K ( x) = p1 e− x /2
2π
Epanechnikov: K ( x) = 43 (1 − x2 ) I ( x) Tricube: K ( x) = 70 3 3
81 (1 − | x| ) I ( x)
−3 0 3 −3 0 3
−3 0 3 −3 0 3
Figure 2: Examples of smoothing kernels: boxcar (top left), Gaussian (top right), Epanech-
nikov (bottom left), and tricube (bottom right).
Suppose that X ∈ Rd . Given a kernel K and a positive number h, called the bandwidth, the
kernel density estimator is defined to be
1X n 1 µ
kx − X i k
¶
pb( x) = K . (7)
n i=1 h d h
where H is a positive definite bandwidth matrix and K H ( x) = | H |−1/2 K ( H −1/2 x). For simplic-
ity, we will take H = h2 I and we get back the previous formula.
5
−10 −5 0 5 10
Figure 3: A kernel density estimator pb. At each point x, pb( x) is the average of the kernels
centered over the data points X i . The data points are indicated by short vertical bars. The
kernels are not drawn to scale.
Sometimes we write the estimator as pbh to emphasize the dependence on h. In the multi-
variate case the coordinates of X i should be standardized so that each has the same vari-
ance, since the norm k x − X i k treats all coordinates as if they are on the same scale.
The kernel estimator places a smoothed out lump of mass of size 1/ n over each data point
X i ; see Figure 3. The choice of kernel K is not crucial, but the choice of bandwidth h
is important. Small bandwidths give very rough estimates while larger bandwidths give
smoother estimates.
In this section we examine the accuracy of kernel density estimation. We will first need a
few definitions.
6
For example, if d = 1 and β = 2 this means that
The most common case is β = 2; roughly speaking, this means that the functions have bounded
second derivatives.
where
X ( u − x) s s
g x,β ( u) = D g ( x ). (10)
| s|≤bβc s!
In the common case of β = 2, this means that
Let p h ( x) = E[ pbh ( x)]. The next lemma provides a bound on the bias p h ( x) − p( x).
for some c.
Proof. We have
1
Z
| p h ( x ) − p ( x )| = d
K (k u − xk/ h) p( u) du − p( x)
h
¯Z ¯
= ¯ K (kvk)( p( x + hv) − p( x)) dv¯
¯ ¯
¯Z ¯ ¯Z ¯
≤ ¯ K (kvk)( p( x + hv) − p x,β ( x + hv)) dv¯ + ¯ K (kvk)( p x,β ( x + hv) − p( x)) dv¯.
¯ ¯ ¯ ¯
The first term is bounded by Lhβ K ( s)| s|β since p ∈ Σ(β, L). The second term is 0 from the
R
properties on K since p x,β ( x + hv) − p( x) is a polynomial of degree bβc (with no constant term).
7
Lemma 4 The variance of pbh satisfies:
c
sup Var( pbh ( x)) ≤ (12)
p∈Σ(β,L) nh d
³ ´
Pn 1 k x− X i k
Proof. We can write pb( x) = n−1 i =1 Z i where Z i = hd
K h . Then,
1 k x − uk hd
Z µ¶ Z
Var( Z i ) ≤ E( Z 2i ) = 2d
K 2
p( u) du = 2d K 2 (kvk) p( x + hv) dv
h h h
sup x p( x) c
Z
≤ K 2 (kvk) dv ≤ d
hd h
for some c since the densities in Σ(β, L) are uniformly bounded. The result follows.
Since the mean squared error is equal to the variance plus the bias squared we have:
According to the next theorem, there does not exist an estimator that converges faster than
O ( n−2β/(2β+d) ). We state the result for integrated L 2 loss although similar results hold for
other loss functions and other function spaces. We will prove this later in the course.
Theorem 6 together with (13) imply that kernel estimators are rate minimax.
8
4.3 Boundary Bias
We have ignored what happens near the boundary of the sample space. If x is O ( h) close to
the boundary, the bias is O ( h) instead of O ( h2 ). There are a variety of fixes including: data
reflection, transformations, boundary kernels, local likelihood.
pbh ( x) − p h ( x)
Z n ( x) ≡ N (0, τ2 ( x)) H
s n ( x)
for some τ( x). This is true even if h = h n is decreasing. Specifically, suppose that h n → 0 and
nh n → ∞. Note that Z n ( x) = ni=1 L ni , say. According to Lyapounov’s CLT, ni=1 L ni
P P
N (0, 1)
as long as
n
E[L n,i |2+δ = 0
X
lim
n→∞
i =1
for some δ > 0. But this is does not yield a confidence interval for p( x). To see why, let us
write
pbh ( x) − p( x) pbh ( x) − p h ( x) p h ( x) − p( x) bias
= + = Z n ( x) + p .
s n ( x) s n ( x) s n ( x) var( x)
Assuming that the optimize the risk by balancing the bias and the variance, the second term
is some constant c. So
pbh ( x) − p( x)
N ( c, τ2 ( x)).
s n ( x)
This means that the usual confidence interval pbh ( x) ± zα/2 s( x) will not cover p( x) with prob-
ability tending to 1 − α. One fix for this is to undersmooth the estimator. (We sacrifice risk
for coverage.) An easier approach is just to interpret pbh ( x) ± zα/2 s( x) as a confidence interval
for the smoothed density p h ( x) instead of p( x).
But this only gives an interval at one point. To get a confidence band we use the bootstrap.
Let P n be the empirical distribution of X 1 , . . . , X n . The idea is to estimate the distribution
³p ´
F n ( t) = P nh d || pbh ( x) − p h ( x)||∞ ≤ t
9
h= 1
h= 2 h= 3
1. Let P n be the empirical distribution that puts mass 1/ n at each data point X i .
2. Draw X 1∗ , . . . , X n∗ ∼ P n . This is called a bootstrap sample.
3. Compute the density estimator pb∗h based on the bootstrap sample.
p
4. Compute R = sup x nh d || pb∗h − pbh ||∞ .
5. Repeat steps 2-4 B times. This gives R 1 , . . . , R B .
6. Let zα be the upper α quantile of the R j ’s. Thus
1 XB
I (R j > zα ) ≈ α.
B j=1
7. Let
zα zα
` n ( x) = p
bh ( x ) − p , u n ( x) = pbh ( x) + p .
nh d nh d
See Figure 4.
10
5 Cross-Validation
In practice we need a data-based method for choosing the bandwidth h. To do this, we will
need to estimate the risk of the estimator and minimize the estimated risk over h. Here, we
describe two cross-validation methods.
A common method for estimating risk is leave-one-out cross-validation. Recall that the loss
function is Z Z Z Z
( pb( x) − p( x))2 dx = pb2 ( x) dx − 2 pb( x) p( x) dx + p2 ( x) dx.
The last term does not involve pb so we can drop it. Thus, we now define the loss to be
Z Z
2
L( h) = pb ( x) dx − 2 pb( x) p( x) dx.
where pb(− i) is the density estimator obtained after removing the i th observation.
H
It is easy to check that E[R
b( h)] = R ( h).
When the kernel is Gaussian, the cross-validation score can be written, after some tedious
algebra, as follows. Let φ( z; σ) denote a Normal density with mean 0 and variance σ2 . Then,
p
φd (0; 2 h) n−2 X Y d p
R ( h) =
b + 2
φ( X i ` − X j ` ; 2 h ) (16)
( n − 1) n( n − 1) i6= j `=1
2 XY d
− φ( X i` − X j` ; h). (17)
n( n − 1) i6= j `=1
The estimator pb and the cross-validation score can be computed quickly using the fast
Fourier transform; see pages 61–66 of Silverman (1986).
For histograms, it is easy to work out the leave-one-out cross-validation in close form:
2 n + 1 X b2
Rb ( h) = − θ . H
( n − 1) h ( n − 1) h j j
A further justification for cross-validation is given by the following theorem due to Stone
(1984).
11
Theorem 9 (Stone’s theorem) Suppose that p is bounded. Let pbh denote the kernel esti-
mator with bandwidth h and let h
b denote the bandwidth chosen by cross-validation. Then,
R¡ ¢2
p( x) − pbhb ( x) dx a.s.
→ 1. (18)
infh ( p( x) − pbh ( x))2 dx
R
The bandwidth for the density estimator in the bottom left panel of Figure 1 is based on
cross-validation. In this case it worked well but of course there are lots of examples where
there are problems. Do not assume that, if the estimator pb is wiggly, then cross-validation
has let you down. The eye is not a good judge of risk.
There are cases when cross-validation can seriously break down. In particular, if there are
ties in the data then cross-validation chooses a bandwidth of 0.
Z
2X n
L
bj ≡L
b ( p, p
bj) = pb2j ( x) − pb j ( Z i ). (19)
n i=1
Y → { pb1 , . . . , pbN } = P
split
X = ( X 1 , . . . , X 2n ) =⇒
Z → {L
b 1, . . . , L
bN}
1
It is not necessary to split the data into two sets of equal size. We use the equal split version for simplicity.
12
Theorem 10 (Wegkamp 1999) There exists a C > 0 such that
C log N
b − pk2 ) ≤ 2 min E(k g − pk2 ) +
E(k p .
g∈P n
This theorem can be proved using concentration of measure techniques that we discuss later
in class. A similar result can be proved for V -fold cross-validation.
6 High Dimensions
The rate of convergence n−2β/(2β+d) is slow when the dimension d is large. In this case it is
hopeless to try to estimate the true density p precisely in the L 2 norm (or any similar norm).
We need to change our notion of what it means to estimate p in a high-dimensional problem.
Instead of estimating p precisely we have to settle for finding an adequate approximation
of p. Any estimator that finds the regions where p puts large amounts of mass should be
considered an adequate approximation. Let us consider a few ways to implement this type
of thinking.
Ph = P ⊕ K h
where ⊕ denotes convolution2 and K h is the distribution with density h−d K (k uk/ h). In other
words, if X ∼ P h then X = Y + Z where Y ∼ P and Z ∼ K h . This is just another way to say
that P h is a blurred or smoothed version of P . p h need not be close in L 2 to p but still could
preserve most of the important shape information about p. Consider then choosing a fixed
h > 0 and estimating p h instead of p. This corresponds to ignoring the bias in the density
estimator. From Theorem 15 we conclude:
2
Theorem 11 Let h > 0 be fixed. Then P(k pbh − p h k∞ > ²) ≤ Ce−nc² . Hence,
s
log n
kpb h − p h k∞ = O P .
n
If X ∼ P and Y ∼ Q are independent, then the distribution of X + Y is denoted by P ? Q and is called the
2
convolution of P and Q .
13
The rate of convergence is fast and is independent of dimension. How to choose h is not
clear.
Independence Based Methods. If we can live with some bias, we can reduce the dimen-
sionality by imposing some independence assumptions. The simplest example is to treat the
components ( X 1 , . . . , X d ) as if they are independent. In that case
d
Y
p( x1 , . . . , xd ) = p j (x j )
j =1
To estimate the density therefore only require that we estimate one and two-dimensional
marginals. But how do we find the edge set E ? Some methods are discussed in Liu et al
(2011) under the name “Forest Density Estimation.” A simple approach is to connect pairs
greedily using some measure of correlation.
Density Trees. Ram and Gray (2011) suggest a recursive partitioning scheme similar
to decision trees. They split each coordinate dyadically, in a greedy fashion. The density
estimator is taken to be piecewise constant. They use an L 2 risk estimator to decide when to
split. This seems promising. The ideas seems to have been re-discovered in Yand and Wong
(arXiv:1404.1425) and Liu and Wong (arXiv:1401.2597). Density trees seem very promising.
It would be nice if there was an R package to do this and if there were more theoretical
results.
7 Example
Figure 5 shows a synthetic two-dimensional data set, the cross-validation function and two
kernel density estimators. The data are 100 points generated as follows. We select a point
randomly on the unit circle then add Normal noise with standard deviation 0.1 The first
estimator (lower left) uses the bandwidth that minimizes the leave-one-out cross-validation
score. The second uses twice that bandwidth. The cross-validation curve is very sharply
peaked with a clear minimum. The resulting density estimate is somewhat lumpy. This is
because cross-validation is aiming to minimize L 2 error which does not guarantee that the
14
Risk
0.5 1.0 1.5 2.0 2.5 3.0
Bandwidth
Figure 5: Synthetic two-dimensional data set. Top left: data. Top right: cross-validation
function. Bottom left: kernel estimator based on the bandwidth that minimizes the cross-
validation score. Bottom right: kernel estimator based on the twice the bandwidth that
minimizes the cross-validation score.
estimate is smooth. Also, the dataset is small so this effect is more noticeable. The estimator
with the larger bandwidth is noticeably smoother. However, the lumpiness of the estimator
is not necessarily a bad thing.
8 Derivatives
Kernel estimators can also be used to estimate the derivatives of a density.3 Let D ⊗r p
denote the r th derivative p. We are using Kronecker notation. Let D ⊗0 p = p, D ⊗1 f is the
gradient of p, and D ⊗2 p = vecH where H is the Hessian. We also write this as p(r) when
convenient.
1X n
p
d (r) ( x) = D ⊗ r p
bH ( x ) = D ⊗ r K H ( x − X i ).
n i=1
15
The asymptotic mean squared error is derived in ChacÌAon,
˛ Duong and Wand (2011) and is
given by
1 −1/2 −1 ⊗ r ⊗r
m22 (K )
|H |tr(( H ) R (D (K ))) + tr(( I d r ⊗ vecT H )R (D ⊗(r+2) p)( I d r ⊗ vec( H )))
n 4
n−2/(d +2r+4) which yield an asymptotic mean squared error of order n−4/(d +2r+4) . In dimen-
sion d = 1, the risk looks like this as a function of r :
r risk
0 n−4/5
1 n−4/7
2 n−4/9
We see that estimating derivatives is harder than estimating the density itself.
where
1 X ⊗2r −1/2 2 X ⊗2r
Gn = 2
D K ( H ( X i − X j )) − D K ( H −1/2 ( X i − X j ))
n i, j n( n − 1) i6= j
We can use density estimation to do unsupervised prediction and anomaly detection. The
basic idea is due to Vovk, and was developed in a statistical framework in Jing, Robins and
Wasserman (2014).
P(Yn+1 ∈ C n ) ≥ 1 − α.
16
Fix a value y. Let A = (Y1 , . . . , Yn , y) be the augmented dataset. That is, we set Yn+1 = y. Let
pb A be a density estimate based on A . Consider the vector
pb A (Y1 ), . . . , pb A (Yn+1 ).
Under H0 , the rank of these values is uniformly distributed. That is, for each i ,
1
P( p
b A (Yi ) ≤ p
b A ( y)) = .
n+1
A p-value for the test is
1 nX +1
π( y) = I ( pb A (Yi ) ≤ pb A ( y)).
n + 1 i=1
The prediction set is n o
C n = y : π( y) ≥ α .
Computing C n is tedious. Fortunately, Jing, Robins and Wasserman (2014) show that there
is a simpler set that still has the correct coverage (but is slightly larger). The set is con-
structed as follows. Let Z i = pb(Yi ). Order these observations
Z(1) ≤ · · · ≤ Z(n) .
P(Yn+1 ∈ C n ) ≥ 1 − α.
Finally, we note that any Yi with a small p-value can be regarded as an outlier (anomaly).
17
P ( A ) > 0 even though µ( A ) = 0. Effectively, this means that the density is infinite. To see
this, consider a point x on the circle. Let B( x, ²) be a ball of radius ² centered at x. Then
P(B( x, ²))
p( x) = lim → ∞. H
²→0 µ(B( x, ²))
Note also that the L 2 loss does not make any sense. If you tried to use cross-validation, you
would find that the estimated risk is minimized at h = 0. H
11 Series Methods
We have emphasized kernel density estimation. There are many other density estimation
methods. Let us briefly mention a method based on basis functions. For simplicity, suppose
that X i ∈ [0, 1] and let φ1 , φ2 , . . . be an orthonormal basis for
Z 1
F = { f : [0, 1] → R, f 2 ( x) dx < ∞}.
0
Thus Z Z
φ2j ( x) dx = 1, φ j ( x)φk ( x) dx = 0.
If p ∈ F then
∞
X
p ( x) = β j φ j ( x)
j =1
R1 Pk
where β j = 0 p( x)φ j ( x) dx. An estimate of p is pb( x) = j =1 β j φ j ( x)
b where
1X n
β
bj = φ j ( X i ).
n i=1
The number of terms k is the smoothing parameter and can be chosen using cross-validation.
18
The first term is of order O ( k/ n). To bound the second term (the bias) one usually assumes
that p is a Sobolev space of order q which means that
∞
β2j j 2q < ∞.
X
j =1
k
η j φ ( x ; µ j , Σ j ).
X
p ( x) =
j =1
These mixtures form a flexible family. But they are part of the Twilight Zone of statistics.
They are difficult to deal with and have lots of problems.
p( x) = (1 − η)φ( x; µ1 , σ2 ) + ηφ( x; µ0 , σ2 ).
You would expect p( x) to be multimodal but this is not necessarily true. The density p( x)
is unimodal when |µ1 − µ2 | ≤ 2σ and bimodal when |µ1 − µ2 | > 2σ. One might expect that
the maximum number of modes of a mixture of k Gaussians would be K . However, there
are examples where a mixture of K Gaussians has more than K modes. Figure 6 provides a
simple example from Carreira-Perpinan and Williams (2003).
19
4
3 1
. . . . . .
2
. . .
1 0.5
. . . . . .
. . . 0
. . . . . . -1 . . 0
. . .
-2
. . . -0.5
-3
-4
-4 -3 -2 -1 0 1 2 3 4 .
-0.5 0 .
0.5 1 .
Figure 6: An illustrative example that shows mixture of k Gaussians models may asymp-
totically have more than k modes. Left plot: each black dot corresponds to the center of one
Gaussian. Right plot: for adjacent Gaussians, new maxima appear in the centers of the tri-
angles, so there are roughly 53 K maxima (neglecting boundary effects). These two pictures .
are from Carreira-Perpinan and Williams (2003).
In fact, Edelsbrunner, Fasy and Rote (2012) show that the relationship between the number
of modes of p and the number of components in the mixture is very complex.
where Pθ is the distribution corresponding to the density p θ . Mixture models are noniden-
tifiable in two different ways. First, there is nonidentifiability due to permutation of labels.
For example, consider a mixture of two univariate Gaussians,
p ψ1 ( x) = 0.3φ( x; 0, 1) + 0.7φ( x; 2, 1)
and
p ψ2 ( x) = 0.7φ( x; 2, 1) + 0.3φ( x; 0, 1),
then p ψ1 ( x) = p ψ2 ( x) even though ψ1 = (0.3, 0.7, 0, 2, 1)T 6= (0.7, 0.3, 2, 0, 1)T = ψ2 . This is not a
serious problem although it does contribute to the multimodality of the likelihood.
20
A more serious problem is local nonidentifiability. Suppose that
Irregularity. The parametric models we considered earlier were regular, meaning that
they satisfied many nice conditions. As we saw, these conditions ensure that estimators
for the parameters of the model have good behavior. For example, we usually have that the
Fisher information is non-zero and that θb− θ = O P ( n−1/2 ) where θb is the maximum likelihood
estimator. Mixture models are, in general, irregular: they do not satisfy the usual regularity
conditions that make parametric models so easy to deal with. Here is an example from Chen
(1995).
Nonintuitive Group Membership. One motivation for using mixtures is clustering. But
one should be aware that mixtures can exhibit unexpected behavior with respect to cluster-
ing. Let
p( x) = (1 − η)φ( x; µ1 , σ21 ) + ηφ( x; µ2 , σ22 ).
Suppose that µ1 < µ2 . We can classify an observation as being from cluster 1 or cluster 2 by
computing the probability of being from the first or second component, denoted Z = 0 and
Z = 1. We get
(1 − η)φ( x; µ1 , σ21 )
P( Z = 0| X = x) = .
(1 − η)φ( x; µ1 , σ21 ) + ηφ( x; µ2 , σ22 )
Define Z ( x) = 0 if P( Z = 0| X = x) > 1/2 and Z ( x) = 1 otherwise. When σ1 is larger than σ2 , we
end up classifying all the observations with large x to the leftmost component. Technically
this is correct, yet it seems to be an unintended consequence of the model and does not
capture what we mean by a cluster.
21
Improper Posteriors. In the Bayesian approach, inferences are based on the posterior
distribution defined p(ψ| X 1 , . . . , X n ) ∝ L (ψ)π(ψ). Here, π(ψ) is the prior distribution that
represents our knowledge of ψ before seeing the data. Often, the prior is improper, meaning
that it does not have a finite integral. For example, suppose that X 1 , . . . , X n ∼ N (µ, 1). It is
common to use an improper prior π(µ) = 1. This is improper because
Z
π(µ) d µ = ∞.
Nevertheless, the posterior p(µ|Dn ) ∝ L (µ)π(µ) is a proper distribution, where L (µ) is the
p
data likelihood of µ. In fact, the posterior for µ is N ( x, 1/ n) where x is the sample mean.
The posterior inferences in this case coincide exactly with the frequentist inferences. In
many parametric models, the posterior inferences are well defined even if the prior is im-
proper and usually they approximate the frequentist inferences. Not so with mixtures. Let
1 1
p( x; µ) = φ( x; 0, 1) + φ( x; µ, 1). (21)
2 2
If π(µ) is improper then so is the posterior. Moreover, Wasserman (2000) shows that the only
priors that yield posteriors in close agreement to frequentist methods are data-dependent
priors.
Finding the Parameter Estimates. To this day, there is no practical method for finding
the parameter estimates that comes with any guarantees. The most common approach is
EM but there are is no guarantee that you will find the MLE.
Choosing k. There are many heuristics for choosing k but very little in the way of theoret-
ical results.
What Does All This Mean? Mixture models can have very unusual and unexpected be-
havior. This does not mean that we should not use mixture modes. Compare this to kernel
density estimators which are simple and very well understood. If you are going to use mix-
ture models, I advise you to remember the words of Rod Serling:
13 Miscellanea
Another method for selecting h which is sometimes used when p is thought to be very
smooth is the plug-in method. The idea is to take the formula for the mean squared er-
ror (equation 22), insert a guess of p00 and then solve for the optimal bandwidth h. For
22
example, if d = 1 and under the idealized assumption that p is a univariate Normal this
yields h ∗ = 1.06σ n−1/5 . Usually, σ is estimated by min{ s,Q /1.34} where s is the sample stan-
dard deviation and Q is the interquartile range.4 This choice of h ∗ works well if the true
density is very smooth and is called the Normal reference rule.
Since we don’t want to necessarily assume that p is very smooth, it is usually better to
estimate h using cross-validation. See Loader (1999) for an interesting comparison between
cross-validation and plugin methods.
A generalization of the kernel method is to use adaptive kernels where one uses a different
bandwidth h( x) for each point x. One can also use a different bandwidth h( x i ) for each data
point. This makes the estimator more flexible and allows it to adapt to regions of varying
smoothness. But now we have the very difficult task of choosing many bandwidths instead
of just one.
Density estimation is sometimes used to find unusual observations or outliers. These are
observations for which pb( X i ) is very small.
14 Summary
1. A commonly used nonparametric density estimator is the kernel estimator
1X n 1 µ
kx − X i k
¶
pbh ( x) = K .
n i=1 h d h
15 Appendix
In this section we consider some asymptotic expansions that describe the behavior of the
kernel estimator. We focus on the case d = 1.
4
Recall that the interquartile range is the 75th percentile minus the 25th percentile. The reason for dividing
by 1.34 is that Q /1.34 is a consistent estimate of σ if the data are from a N (µ, σ2 ).
23
Theorem 13 Let R xR = E( p( x) − pb( x))2 and let R = R x dx. Assume that p00 is absolutely
R
1 4 4 00 2 p( x) K 2 ( x) dx
R
1
µ ¶
R x = σ K h n p ( x) + +O + O ( h6n )
4 nh n n
and
K 2 ( x) dx
R
1 1
Z µ ¶
R = σ4K h4n 00 2
p ( x) dx + +O + O ( h6n ) (22)
4 nh n
where σ2K = x2 K ( x) dx.
R
Proof. Write K h ( x, X ) = h−1 K (( x − X )/ h) and pb( x) = n−1 i K h ( x, X i ). Thus, E[ pb( x)] = E[K h ( x, X )]
P
1 x−t
Z µ ¶
E[K h ( x, X )] = K p( t) dt
h h
Z
= K ( u) p( x − hu) du
h2 u2 00
Z · ¸
0
= K ( u) p( x) − hu p ( x) + p ( x) + · · · du
2
1 2 00
Z
= p( x) + h p ( x) u2 K ( u) du · · ·
2
R R
since K ( x) dx = 1 and x K ( x) dx = 0. The bias is
1
E[K h n ( x, X )] − p( x) = σ2K h2n p00 ( x) + O ( h4n ).
2
By a similar calculation,
p( x) K 2 ( x) dx
R
1
µ ¶
Var[ pb( x)] = +O .
n hn n
The first result then follows since the risk is the squared bias plus variance. The second
result follows from integrating the first result.
If we differentiate (22) with respect to h and set it equal to 0, we see that the asymptotically
optimal bandwidth is
à !1/5
c2
h∗ = 2 (23)
c1 A ( f )n
where c 1 = x2 K ( x) dx, c 2 = K ( x)2 dx and A ( f ) = f 00 ( x)2 dx. This is informative because it
R R R
tells us that the best bandwidth decreases at rate n−1/5 . Plugging h ∗ into (22), we see that
if the optimal bandwidth is used then R = O ( n−4/5 ).
24
15.2 Concentration Analysis For Histograms
1 n ²2 h2d
µ ¶
d
P(|θbj − θ j | > h ²) ≤ 2 exp −
2 θ j (1 − θ j ) + ² h d /3
1 n²2 h2d
µ ¶
≤ 2 exp −
2 Ch d + ² h d /3
³ ´
≤ 2 exp − cn²2 h d
where c = 1/(2(C + 1/3)). By the union bound and the fact that N ≤ (1/ h)d ,
³ ´
P(|θbj − θ j | > h d ²) ≤ 2 h−d exp − cn²2 h d ≡ πn .
p
Earlier we saw that sup x | p( x) − p h ( x)| ≤ L dh. Hence, with probability at least 1 − πn ,
p
kp
bh − pk∞ ≤ k p bh − p h k∞ + k p h − pk∞ ≤ ² + L dh. (24)
Now set s
1 2
µ ¶
²= log .
cnh d δhd
Then, with probability at least 1 − δ,
s
1 2 p
µ ¶
kp
bh − pk∞ ≤ log + L dh. (25)
cnh d δhd
25
15.3 Concentration Analysis of Kernel Density Estimator
Now we state a result which says how fast pb( x) concentrates around p( x). First, recall
Bernstein’s inequality: Suppose that Y1 , . . . , Yn are iid with mean µ, Var(Yi ) ≤ σ2 and |Yi | ≤
M . Then
n ²2
½ ¾
P(|Y − µ| > ²) ≤ 2 exp − 2 . (27)
2σ + 2 M ²/3
Note that the last statement follows from the bias-variance calculation followed by Markov’s
inequality. The first statement does not.
|p
b( x) − p( x)| ≤ | p
b( x) − p h ( x)| + | p h ( x) − p( x)| (30)
where p h ( x) = E( pb( x)). From Lemma 3, | p h ( x) − p( x)| ≤ chβ for some c. Now pb( x) = n−1
Pn
i =1 Z i
where
1 kx − X i k
µ ¶
Zi = d K .
h h
Note that | Z i | ≤ c 1 / h d where c 1 = K (0). Also, Var( Z i ) ≤ c 2 / h d from Lemma 4. Hence, by
Bernstein’s inequality,
n ²2 nh d ²2
½ ¾ ½ ¾
P(| p
b( x) − p h ( x)| > ²) ≤ 2 exp − ≤ 2 exp −
2 c 2 h−d + 2 c 1 h−d ²/3 4 c2
p
whenever ² ≤ 3 c 2 / c 1 . If we choose ² = C log(2/δ)/( nh d ) where C = 4 c 2 then
s
C
P | p
b( x) − p h ( x)| > ≤ δ.
nh d
26
15.4 Concentration in L ∞
Theorem 14 shows that, for each x, pb( x) is close to p( x) with high probability. We would like
a version of this result that holds uniformly over all x. That is, we want a concentration
result for
kp
b − pk∞ = sup | pb( x) − p( x)|.
x
We can write
kp
bh − pk∞ ≤ k p bh − p h k∞ + chβ .
bh − p h k∞ + k p h − pk∞ ≤ k p
We bounded the first term when we discussed concentration of measure where we used
bracketing arguments together with Bernstein’s theorem to prove that,
¶d
C 3 n ²2 h d
µ µ ¶
P(k p
bh − p h k∞ > ²) ≤ 4 d +1 exp − . (31)
h ² 28K (0)
where N (T, d, ²) denotes the ²-covering number of the metric space (T, d ), F is the envelope
function of F and the supremum is taken over the set of all probability measures on Rd .
The quantities A and v are called the VC characteristics of Fh .
Theorem 15 (Giné and Guillou 2002) Assume that the kernel satisfies the above property.
1. Let h > 0 be fixed. Then, there exist constants c 1 > 0 and c 2 > 0 such that, for all small
² > 0 and all large n,
( )
n o
bh ( x) − p h ( x)| > ² ≤ c 1 exp − c 2 nh d ²2 .
P sup | p (33)
x∈Rd
nh dn
2. Let h n → 0 as n → ∞ in such a way that → ∞. Let
| log h dn |
s
| log h n |
²n ≥ . (34)
nh dn
Then, for all n large enough, (33) holds with h and ² replaced by h n and ²n , respectively.
27
The above theorem imposes minimal assumptions on the kernel K and, more importantly,
on the probability distribution P , whose density is not required to be bounded or smooth,
and, in fact, may not even exist. Combining the above theorem with Lemma 3 we have the
following result.
for some constants C and c where C depends on δ. Choosing h ³ log n/ n−1/(2β+d) we have
C log n
µ ¶
2
P sup | p
b( x) − p( x)| > 2β/(2β+d) < δ.
x n
15.5 L 1 Methods
VC Classes. Let A be a class of sets with VC dimension ν. As in section 5.2, split the data
X into Y and Z with P = { pb1 , . . . , pbN } constructed from Y . For g ∈ P define
¯Z ¯
∆n ( g) = sup ¯ g( x) dx − P n ( A )¯
¯ ¯
A ∈A A
ν
µ r ¶
P ∆( p
b) > min ∆( p
b j ) + 2c < δ.
j n
28
Hence, except on an event of probability at most δ, we have that
¯Z ¯ ¯Z ¯ ¯ ¯
∆n ( g) = sup ¯ g( x) dx − P n ( A )¯ ≤ sup ¯ g( x) dx − P ( A )¯ + sup ¯P n ( A ) − P ( A )¯
¯ ¯ ¯ ¯ ¯ ¯
A ∈A A A ∈A A A ∈A
ν
r
≤ ∆( g) + c .
n
q q
By a similar argument, ∆( g) ≤ ∆n ( g) + c nν . Hence, |∆( g) − ∆n ( g)| ≤ c nν for all g. Let
p ∗ = argmin g∈P ∆( g). Then,
ν ν ν
r r r
∆( p) ≤ ∆( pb) ≤ ∆n ( pb) + c ≤ ∆n ( p ∗ ) + c ≤ ∆( p ∗ ) + 2 c .
n n n
The difficulty in implementing this idea is computing and minimizing ∆n ( g). Hjort and
Walker (2001) presented a similar method which can be practically implemented when d = 1.
Yatracos Classes. Devroye and Györfi (2001) use a class of sets called a Yatracos class which
leads to estimators with some remarkable properties.
n Let P =o { p 1 , . . . , p N } be a set of den-
sities and define the Yatracos class of sets A = A ( i, j ) : i 6= j where A ( i, j ) = { x : p i ( x) >
p j ( x)}. Let
pb = argmin g∈G ∆( g)
where ¯Z ¯
∆n ( g) = sup ¯¯ g( u) du − P n ( A )¯¯
¯ ¯
A ∈A A
−1 P n
and P n ( A ) = n i =1 I ( Z i ∈ A ) is the empirical measure based on a sample Z1 , . . . , Z n ∼ p.
The term min j | p j − p| is like a bias while term ∆ is like the variance.
R
R R
Proof. Let i be such that pb = p i and let s be such that | p s − p| = min j | p j − p|. Let
B = { p i > p s } and C = { p s > p i }. Now,
Z Z Z
|p
b − p | ≤ | p s − p | + | p s − p i |. (37)
29
Let B denote all measurable sets. Then,
Z ¯Z Z ¯ ¯Z Z ¯
| p s − p i | = 2 max ¯ p i − p s ¯ ≤ 2 sup ¯ p i − p s ¯
¯ ¯ ¯ ¯
A ∈{B,C } A A A ∈A A A
¯Z ¯ ¯Z ¯
≤ 2 sup ¯ p i − P n ( A )¯ + 2 sup ¯ p s − P n ( A )¯
¯ ¯ ¯ ¯
A ∈A A A ∈A A
¯Z ¯
≤ 4 sup ¯ p s − P n ( A )¯
¯ ¯
A ∈A A
¯Z Z ¯ ¯Z ¯
≤ 4 sup ¯ p s − p¯ + 4 sup ¯ p − P n ( A )¯
¯ ¯ ¯ ¯
A ∈A A A A ∈A A
¯Z Z ¯ ¯Z Z ¯
= 4 sup ¯ p s − p¯ + 4∆ ≤ 4 sup ¯ p s − p¯ + 4∆
¯ ¯ ¯ ¯
A A A ∈B A A
ZA ∈A
= 2 | p s − p| + 4∆.
Now we apply this to kernel estimators. Again we split the data X into two halves Y =
(Y1 , . . . , Yn ) and Z = ( Z1 , . . . , Z n ). For each h let
1X n µ
k x − Yi k
¶
pbh ( x) = K .
n i=1 h
Let n o
A = A ( h, ν) : h, ν > 0, h 6= ν
where A ( h, ν) = { x : pbh ( x) > pbν ( x)}. Define
¯Z ¯
∆n ( g) = sup ¯¯ g( u) du − P n ( A )¯¯
¯ ¯
A ∈A A
P n ( A ) = n−1 ni=1 I ( Z i
P
where ∈ A ) is the empirical measure based on Z . Let
pb = argmin g∈G ∆( g).
Under some regularity conditions on the kernel, we have the following result.
The proof involves showing that the terms on the right hand side of (36) are small. We refer
the reader to Devroye and Györfi (2001) for the details.
R
Recall that d TV (P,Q ) = sup A |P ( A ) − Q ( A )| = (1/2) | p( x) − q( x)| dx where the supremum is
over all measurable sets. The above theorem says that the estimator does well in the total
variation metric, even though the method only used the Yatracos class of sets. Finding
computationally efficient methods to implement this approach remains an open question.
30
15.6 Functional Data and Quasi-Densities
In some problems, X is not just high dimensional, it is infinite dimensional. For example
suppose that each X i is a curve. An immediate problem is that the concept of a density is
no longer well defined. On a Euclidean space, the density p for a probability measure is
the function that satisfies P ( A ) = A p( u) d µ( u) for all measurable A where µ is Lebesgue
R
measure. Formally, we say that p is the Radon-Nikodym derivative of P with respect to the
dominating measure µ. Geometrically, we can think of p as
P(k X − xk ≤ ²)
p( x) = lim
²→0 V (²)
where V (²) = ²d πd/2 /Γ( d /2 + 1) is the volume of a sphere of radius ². Under appropriate
conditions, these two notions of density agree. (This is the Lebesgue density theorem.)
When the outcome space X is a set of curves, there is no dominating measure and hence
there is no density. Instead, we define the density geometrically by
q ² ( x) = P(ξ( x, X ) ≤ ²)
for a small ² where ξ is some metric on X . However we cannot divide by V (²) and let ² tend
to 0 since the dimension d is infinite.
One way around this is to use a fixed ² and work with the unnormalized density q ² . For the
purpose of finding high-density regions this may be adequate. An estimate of q ² is
1X n
I (ξ( x, X i ) ≤ ²).
qb² ( x) =
n i=1
Example 20 Figure 7 shows the tracks (or paths) of 40 North Atlantic tropical cyclones (TC).
The full dataset, consisting of 608 from 1950 to 2006 is shown in Figure 8. Buchman, Lee
and Schafer (2009) provide a thorough analysis of the data. We refer the reader to their paper
for the full details.5
Each data point— that is, each TC track— is a curve in R2 . Various questions are of interest:
Where are the tracks most common? Is the density of tracks changing over time? Is the track
related to other variables such as windspeed and pressure?
31
Figure 7: Paths of 40 tropical cyclones in the North Atlantic.
Figure 9 shows the 10 TC’s with highest local density and the the 10 TC’s with lowest local
density using ² = 16.38. This choice of ² corresponds to the 10th percentile of the values
{ d H ( X i , X j ) : i 6= j }. The high density tracks correspond to TC’s in the gulf of Mexico with
short paths. The low density tracks correspond to TC’s in the Atlantic with long paths.
32
Figure 8: Paths of 608 tropical cyclones in the North Atlantic.
Figure 9: 10 highest density paths (black) and 10 lowest density paths (blue).
33