Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Mass Volume Curves and Anomaly Ranking

Stéphan Clémençon
UMR LTCI No. 5141
Telecom ParisTech/CNRS
Institut Mines-Telecom
Paris, 75013, France
July 30, 2013

Abstract
This paper aims at formulating the issue of ranking multivari-
ate (unlabeled) observations depending on their degree of abnormal-
ity/novelty as an unsupervised statistical learning task. In the 1-d
situation, this problem is usually tackled by means of tail estimation
techniques: univariate observations are viewed as all the more ”abnor-
mal” as they are located far in the tail(s) of the underlying probability
distribution. In a wide variety of applications, it would be desirable as
well to dispose of a scalar valued ”scoring” function allowing for com-
paring the degree of abnormality of multivariate observations. Here
we formulate the issue of scoring anomalies as a M -estimation prob-
lem by means of a novel (functional) performance criterion, referred
to as the Mass Volume curve (MV curve in short), whose optimal
elements are, as expected, nondecreasing transforms of the density.
The question of statistical estimation of the MV curve of a scoring
function is tackled in this paper, as well as that of empirical optimiza-
tion of this criterion, interpreted as a continuum of minimum volume
set estimation problems. A discretization approach is proposed here,
consisting of estimating a sequence of N ≥ 1 minimum volume sets
and next constructing a piecewise constant scoring function sN (x)
based on the latter. Viewing its MV curve as a statistical version of
a stepwise approximant of the optimal MV curve, rate bounds are
established, describing the generalization ability of the methodology

1
promoted. The algorithm also provides, as a byproduct, an estimate
of the optimal MV curve.

Keywords. Anomaly detection, M-estimation, minimum volume set,


ranking, scoring function, unsupervised learning.

1 INTRODUCTION
In a wide variety of applications, ranging from the monitoring of aircraft en-
gines in aeronautics to non destructive control quality in the industry through
fraud detection, network intrusion surveillance or system management in data
centers (e.g. see [22]), anomaly detection is of crucial importance. In most
common situations, anomalies correspond to ”rare” observations and must
be detected in absence of any feedback (e.g. expert’s opinion) based on an
unlabeled dataset. In practice, the very purpose of anomaly detection tech-
niques is to rank observations by degree of abnormality/novelty, rather than
simply assigning them a binary label, ”abnormal” vs ”normal”. In the case
of univariate observations, abnormal values are generally those which are ex-
tremes, i.e. ”too large” or ”too small” in regard to central quantities such
as the mean or the median, and anomaly detection may then derive from
standard tail distribution analysis: the farther in the tail the observation
lies, the more ”abnormal” it is considered. In contrast, it is far from easy
to formulate the issue in a multivariate context. In the present paper, moti-
vated by applications such as those aforementioned, we place ourselves in the
situation where (unlabeled) observations take their values in a possibly very
high-dimensional space, X ⊂ Rd with d ≥ 1 say, making approaches based
on nonparametric density estimation or multivariate (heavy-) tail modelling
hardly feasible, if not unfeasible, due to the curse of dimensionality. In this
framework, a variety of statistical techniques, relying on the concept of mini-
mum volume set investigated in the seminal contributions of [9] and [16] (see
section 2), have been developed in order to split the feature space X into two
halves and decide whether observations should be considered as ”normal”
(namely when lying in the minimum volume set Ωα ⊂ X estimated on the
basis of the dataset available) or not (when lying in the complementary set
X \ Ωα ) with a given confidence level 1 − α ∈ (0, 1). One may also refer to
[18] and to [14] for closely related notions. The problem considered here is of
different nature, the goal pursued is not to assign to all possible observations

2
a label ”normal” vs ”abnormal”, but to rank them according to their level
of ”abnormality”. The most natural way to define a preorder on the fea-
ture space X is to transport the natural order on the real line through some
(measurable) scoring function s : X → R+ : the ”smaller” the score s(X), the
more likely the observation X is viewed as ”abnormal”. This problem shall
be here referred to as anomaly scoring. It can be somehow related to the
literature dedicated to statistical depth functions in nonparametric statistics
and operations research, see [23] and the references therein. Such parametric
functions are generally proposed ad hoc in order to define a ”center” for the
probability distribution of interest and a notion of distance to the latter, e.g.
the concept of ”center-outward ordering” induced by halfspace depth in [21]
or [6]. The angle embraced in this paper is quite different, its objective is
indeed twofold: 1) propose a performance criterion for the anomaly scoring
problem so as to formulate it in terms of M -estimation 2) investigate the
accuracy of scoring rules which optimize empirical estimates of the criterion
thus tailored.
Due to the global nature of the ranking problem, the criterion we pro-
mote is functional, just like the Receiver Operating Characteristic (ROC) and
Precision-Recall curves in the supervised ranking setup (i.e. when a binary
label, e.g. ”normal” vs ”abnormal”, is assigned to the sampling data), and
shall be referred to as the Mass Volume curve (MV curve in abbreviated
form). The latter induces a partial preorder on the set of scoring functions:
the collection of optimal elements is defined as the set of scoring functions
whose MV-curve is minimum everywhere and is proved to coincide, as ex-
pected, as increasing transforms of the underlying probability density. In the
unsupervised setting, MV curve analysis is shown to play a role quite compa-
rable to that of ROC curve analysis for supervised anomaly detection. The
issue of estimating the MV curve of a given scoring function based on sam-
pling data is then tackled and a smooth boostrap method for constructing
confidence regions is analyzed. A statistical methodology to build a nearly
optimal scoring function is next described, which works as follows: first,
discretize the problem; second solve a few minimum volume set estimation
problems where confidence levels are fixed in advance (see [18]); and third,
combine the estimated anomaly/novelty detection rules in order to build a
piecewise constant scoring function. The MV curve of the scoring rule thus
produced can be related to a stepwise approximant of the (unknown) opti-
mal MV curve, which permits to establish the generalization ability of the
algorithm through rate bounds in terms of sup norm in the MV space. We

3
also show how to get, as a byproduct, a piecewise constant estimate of the
optimal MV curve.
The rest of the article is structured as follows. Section 2 describes the
mathematical framework, sets out the main notations and recalls the crucial
notions related to anomaly/novelty detection on which the analysis carried
out in the paper relies. Section 3 first provides an informal description of
the anomaly scoring problem and then introduces the MV curve criterion,
dedicated to evaluate the performance of any scoring function. The set of
optimal elements is described and statistical results related to the estimation
of the MV curve of a given scoring function are stated in section 4. Statistical
learning of an anomaly scoring function is then formulated as a functional M -
estimation problem in section 5, while section 6 is devoted to the study of a
specific algorithm for the design of nearly optimal anomaly scoring functions.
Technical proofs are deferred to the Appendix section. We finally point
out that a very preliminary version of this work has been outlined in the
conference paper [1].

2 BACKGROUND AND PRELIMINARIES


As a first go, we start off with describing the mathematical setup and recalling
key concepts in anomaly detection involved in the subsequent analysis.

2.1 Framework and Notations


Here and throughout, we suppose that we observe independent and identi-
cally distributed realisations X1 , . . . , Xn of an unknown continuous prob-
ability distribution F (dx), copies of a generic random variable X, taking
their values in a (possibly very high dimensional) feature space X ⊂ Rd ,
with d ≥ 1. The density of the random variable X with respect to λ(dx),
Lebesgue measure on Rd , is denoted by f (x), its support by suppF and the
indicator function of any event E by I{E}. The generalized inverse of any
cdf function G(t) on R is defined by G−1 (u) = inf{t ∈ R : G(t) ≥ u}.
The essential supremum of any nonnegative r.v. Z is denoted by ||Z||∞ ,
the Dirac mass at any point a by δa . The notation OP (1) is used to mean
boundedness in probability. A natural way of defining preorders on X is to
map its elements onto R+ and use the natural order on the real half-line. For
any measurable function s : X 7→ R+ , we denote by s the preorder on X

4
defined by

∀(x, x0 ) ∈ X 2 : x s x0 iff s(x) ≤ s(x0 ).

We denote the level sets of any scoring function s by:

Ωs,t = {x ∈ X : s(x) ≥ t}, t ∈ [−∞, +∞].

Observe that the sequence is decreasing (for the inclusion, as t increases from
−∞ to +∞):
∀(t, t0 ) ∈ R2 , t ≥ t0 ⇒ Ωs,t ⊂ Ωs,t0
and that limt→+∞ Ωs,t = ∅ and limt→−∞ Ωs,t = X 1 . When the function
s(x) is additionally integrable w.r.t. Lebesgue measure, it is called a scoring
function. The set of all scoring functions is denoted by S.
The following quantities shall also be used in the sequel. For any scoring
function s and threshold level t ≥ 0, define:

αs (t) = P{s(X) ≥ t} and λs (t) = λ ({x ∈ X : s(x) ≥ t}) .

The quantity αs (t) is referred to as the mass of the level set Ωs,t , while
λs (t) is generally termed the volume (w.r.t. LebesgueR measure). Notice
that the volumes are finite on R∗+ : ∀t > 0, λs (t) ≤ X s(x)dx/t < +∞.
Reciprocally, for any (total) preorder  on X , one may define a scoring
function s such that  coincides with s . Indeed, for all x ∈ X , consider
the (supposedly measurable) set Ω (x) = {x0 ∈ X : x0  x}. Then, for any
finite positive
R measure µ(dx) with X as support, define the scoring function
sµ,  (x) = x0 ∈X I{x ∈ Ω(x)}µ(dx0 ). It is immediate to check that sµ, 
0

induces the preorder  on X .


For any α ∈ (0, 1) and any scoring function s, Q(s, α) = inf{u ∈ R :
P{s(X) ≤ u} ≥ 1 − α} denotes the quantile at level 1 − α of s(X)’s distri-
bution throughout the paper. We also set Q∗ (α) = Q(f, α) for all α ∈ (0, 1).

2.2 Minimum Volume Sets


The notion of minimum volume sets has been introduced and investigated
in the seminal contributions [9] and [16] in order to describe regions where
1
Recall that a sequence (An )n≥1 of subsets of an ensemble E is said to converge iff
lim sup An = lim inf An . In such a case, one defines its limit, denoted by lim An as
lim sup An = lim inf An .

5
a multivariate r.v. X takes its values with highest/smallest probability. Let
α ∈ (0, 1), a minimum volume set Ω∗α of mass at least α is any solution of
the constrained minimization problem

min λ(Ω) subject to P{X ∈ Ω} ≥ α, (1)


where the minimum is taken over all measurable subsets Ω of X . Application


of this concept includes in particular novelty/anomaly detection: for large
values of α, abnormal observations (outliers) are those which belong to the
complementary set X \ Ω∗α . In the continuous setting, it can be shown that
def
there exists a threshold value t∗α = Q(f, α) ≥ 0 such that the level set Ωf,t∗α is
a solution of the constrained optimization problem above. The (generalized)
quantile function is then defined by:
def
∀α ∈ (0, 1), λ∗ (α) = λ(Ω∗α ).

The following assumptions shall be used in the subsequent analysis.


A1 : the density f is bounded, i.e. ||f (X)||∞ < +∞.

A2 : the density f has no flat parts, i.e. for any constant c ≥ 0,

P{f (X) = c} = 0.

Under the hypotheses above, for any α ∈ (0, 1), there exists a unique mini-
mum volume set Ω∗α (equal to Ωf,t∗α up to subsets of null F -measure), whose
mass is equal to α exactly. Additionally, the mapping λ∗ is continuous on
(0, 1) and uniformly continuous on [0, 1 − ] for all  ∈ (0, 1) (when the sup-
port of F (dx) is compact, uniform continuity holds on the whole interval
[0, 1]).
b ∗α of minimum volume sets are
From a statistical perspective, estimates Ω
built by replacing the
Pn unknown probability distribution F by its empirical
version Fn = (1/n) i=1 δXi and restricting optimization to a collection A of
borelian subsets of X , supposed rich enough to include all density level sets
(or reasonable approximations of the latter). In [16], functional limit results
are derived for the generalized empirical quantile process {λ(Ω b ∗α ) − λ∗ (α)}
under certain assumptions for the class A (stipulating in particular that A
is a Glivenko-Cantelli class for F (dx)). In [18], it is proposed to replace the
level α by α − φn where φn plays the role of tolerance parameter. The latter

6
should be chosen of the same order as the supremum supΩ∈A |Fn (Ω) − F (Ω)|
roughly, complexity of the class A being controlled by the VC dimension
or by means of the concept of Rademacher averages, so as to establish rate
bounds at n < +∞ fixed.
Alternatively, so-termed plug-in techniques, consisting in computing first
an estimate fb of the density f and considering next level sets Ωfb,t of the
resulting estimator have been investigated in several papers, among which
[20] or [17] for instance. Such an approach however yields significant com-
putational issues even for moderate values of the dimension, inherent to the
curse of dimensionality phenomenon.

3 RANKING ANOMALIES
In this section, the issue of scoring observations depending on their level
of novelty/abnormality is first described in an informal manner and next
formulated quantitatively, as a functional optimization problem, by means
of a novel concept, termed the Mass Volume curve.

3.1 Overall Objective


The problem considered in this article is to learn a scoring function s, based
on training data X1 , . . . , Xn , so as to describe the extremal behavior of
the (high dimensional) random vector X by that of the univariate variable
s(X), which can be summarized by its tail behavior near 0: hopefully, the
smaller the score s(X), the more abnormal/rare the observation X should
be considered. Hence, an optimal scoring function should naturally permit
to rank observations X by increasing order of magnitude of f (X). The set
of optimal scoring functions is then given by:
S ∗ = {T ◦ f ∈ S with T : Imf (X) → R+ strictly increasing},
denoting by Imf (X) the image of the mapping f (X).
In the framework we develop, anomaly scoring boils down to recovering
the decreasing collection of all level sets of the density function f (x), {Ω∗α :
α ∈ (0, 1)}, without necessarily disposing of the corresponding levels. Indeed,
one may easily check that any scoring function of the form
Z 1

s (x) = I{x ∈ Ω∗α }dµ(α), (2)
0

7
where µ(dα) is an arbitrary finite positive measure dominating the Lebesgue
measure on (0, 1), belongs to S ∗ . Observe that H(f (x)), where H denotes
the cdf of the r.v. f (X), corresponds to the case where µ is chosen to be
the Lebesgue measure on (0, 1) in Eq. (2). The anomaly scoring problem
can be thus cast as an overlaid collection of minimum volume set estimation
problems. This observation shall turn out to be very useful when designing
practical statistical learning strategies in section 6.

3.2 A Functional Criterion: the Mass Volume Curve


We now introduce the concept of Mass Volume curve and shows that it
is a natural criterion to evaluate the accuracy of decision rules in regard to
anomaly scoring.
Definition 1. (True Mass Volume curve) Let s ∈ S. Its Mass Vol-
ume curve (MV curve in abbreviated form) with respect to the probability
distribution of the r.v. X is the parametrized curve:
t ∈ R+ 7→ (αs (t), λs (t)) ∈ [0, 1] × [0, +∞).
In addition, if αs has no flat parts, the MV curve can also be defined as the
plot of the mapping
def
MVs : α ∈ (0, 1) 7→ MVs (α) = λs ◦ αs−1 (α).
By convention, the MV curve of s(x) is identified as the graph of the mapping
MVs on [0, 1). If suppF is of infinite Lebesgue measure, one sets MVs (1) =
+∞ and MVs (1) = λ(suppF ) otherwise.
Equipped with this definition, observe that the MV curve of a scoring
function s has possibly jumps, just like the graph of the function αs−1 . In
particular, the MV curve of a piecewise constant scoring function is stepwise.
Remark 1. (Connection to ROC analysis) We point out that the curve
α ∈ (0, 1) 7→ 1 − λs ◦ αs−1 (1 − α) resembles a receiver operating characteristic
(ROC) curve of the test/diagnostic function s(x) (see [8]), except that the
distribution under the alternative hypothesis is not a probability measure but
the image of Lebesgue measure on X by the function s, while that under the
null hypothesis is the probability distribution of the r.v. s(x). Hence, the
curvature of the MV graph somehow measures the extent to which the spread
of the distribution of s(X) differs from that of a uniform distribution.

8
Optimal

Good

Bad

VOLUME

0 1
MASS

(Modes) (Extremal behavior)

Figure 1: Mass Volume curves

This functional criterion induces a partial order over the set of all scoring
functions. Let s1 and s2 be two scoring functions on X , the ordering provided
by s1 is better than that induced by s2 when

∀α ∈ (0, 1), MVs1 (α) ≤ MVs2 (α).

Typical MV curves are illustrated in Fig. 1. A desirable MV curve increases


slowly and rises near 1, just like the lowest curves of Fig. 1. This corresponds
to the situation where the distribution of the r.v. X is much concentrated
around its modes and the highest values (respectively, the lowest values)
taken by s are located near the modes of F (dx) (respectively, in the tail
region of F (dx)). The MV curve of the scoring function s(x) is then close
to the right lower corner of the Mass Volume space. We point out that, in
certain situations, some parts of the MV curve may be of interest solely,
corresponding to large values of α when focus is on extremal observations
(the tail region of the r.v. X) and to small values of α when modes of the
underlying distribution are investigated.
The result below shows that optimal scoring functions are those whose
MV curves are minimum everywhere.

9
Proposition 1. (Optimal MV curve) Let assumptions A1 − A2 be ful-
filled. The elements of the class S ∗ have the same MV curve and provide the
best possible ordering of X ’s elements in regard to the MV curve criterion:

∀(s, α) ∈ S × (0, 1), MV∗ (α) ≤ MVs (α), (3)

where MV∗ (α) = MVf (α) for all α ∈ (0, 1).


In addition, we have: ∀(s, α) ∈ S × (0, 1),

0 ≤ MVs (α) − MV∗ (α) ≤ λ Ω∗α ∆Ωs,Q(s,α) ,



(4)

where ∆ denotes the symmetric difference.

The proof is immediate based on the results recalled in subsection 2.2 and
is left to the reader. Incidentally, notice that, equipped with the notations
introduced in 2.2, we have λ∗ (α) = MV∗ (α) for all α ∈ (0, 1). We also
point out that bound (4) reveals that the pointwise difference between the
optimal MV curve and that of a scoring function candidate s(x) is controlled
by the error made in recovering the specific minimum volume set Ω∗α through
Ωs,Q(s,α) .

Example 1. (Gaussian distribution) In the case when X is a Gaus-


sian variable N (0, 1), we have MV∗ (α) = 2Φ−1 ((1 + α)/2), where Φ(x) =
−1/2 x
R
(2π) −∞
exp(−u2 /2)du.

The following result reveals that the optimal MV curve is convex and
provides a closed analytical form for its derivative.

Proposition 2. (Convexity and Derivative) Suppose that hypotheses


A1 − A2 are satisfied. Then, the optimal curve α ∈ (0, 1) 7→ MV∗ (α) is con-
vex. In addition, if f is differentiable with a gradient taking nonzero values
on the boundary ∂Ω∗α = {x ∈ X : f (x) = Q∗ (α)}, MV∗ is differentiable at
α ∈ [0, 1[ and:
Z
∗0 1 1
MV (α) = ∗
dµ(dx),
f (Q (α)) x∈∂Ω∗α ||∇f (x)||

where µ denotes the Hausdorff measure on ∂Ω∗α .

See the Appendix section for the technical proof. Elementary properties
of MV curves are summarized in the following proposition.

10
Proposition 3. (Properties of MV curves) For any s ∈ S, the follow-
ing assertions hold true.
1. Limit values.We have MVs (0) = 0 and limα→+1 MVs (α) = λ(suppF ).
2. Invariance. For any strictly increasing function ψ : R+ → R+ , we
have MVs = MVψ◦s .
3. Monotonicity. The mapping α ∈ (0, 1) 7→ MVs (α) is strictly increas-
ing.
4. Differentiability. Suppose that the distribution of s(X) is continuous
and has no plateau. Then, if t > 0 7→ λs (t) is differentiable and s(x) is
Lipschitz, then MVs is differentiable on (0, 1) and:
Z
0 1 µ(dy)
MVs (α) = − 0 −1 ,
αs (αs (α)) s−1 ({α−1
s (α)})
|∇f (y)|
for all α ∈ (0, 1), denoting by µ(dy) the Hausdorff measure on the
boundary s−1 ({αs−1 (α)}).
These assertions straightforwardly result from Definition 1. Except those
related to the derivative formula, details are omitted in the Appendix.

4 Statistical Estimation
In practice, MV curves are unknown, just like the probability distribution of
the r.v. X, and must be estimated based on the observed sample X1 , . . . , Xn .
Replacing the mass of each level set by its statistical counterpart in Definition
1 leads to define the notion of empirical MV curve. We set, for all t ≥ 0,
n Z
def 1
X
α
bs (t) = I{s(Xi ) > t} = I{u > t}Fbs (du), (5)
n i=1 X

P
where Fbs = (1/n) i≤n δs(Xi ) denotes the empirical distribution of the s(Xi )’s.
Notice that αbs takes its values in the set {k/n : k = 0, . . . , n}.
Definition 2. (Empirical MV curve) Let s ∈ S. By definition, the
empirical MV curve of s is the graph of the (piecewise constant) function

bs−1 (α).
d s : α ∈ [0, 1) 7→ λs ◦ α
MV

11
In order to obtain a smoothed version αes (t), a typical strategy consist of
replacing the empirical distribution estimate Fbs involved in Eq. (5) by the
continuous distribution Fes with density fes (u) = (1/n) ni=1 Kh (s(Xi ) − u),
P
with Kh (u) = h−1 K(h−1 ·u) where K ≥ 0 is a regularizing Parzen-Rosenblatt
R
kernel (i.e. a bounded square integrable function such that K(v)dv = 1)
and h > 0 is the smoothing bandwidth, see the discussion in subsection 4.2
for a practical view of smoothing.

4.1 Consistency and Asymptotic Normality


The theorem below reveals that, under mild assumptions, the empirical MV
curve is a consistent and asymptotically Gaussian estimate of the MV curve,
uniformly over any subinterval of [0, 1[. It involves the assumptions listed
below. Let  ∈ (0, 1) be fixed.
A3 The r.v. s(X) is bounded, i.e. ||s(X)||∞ < +∞, and has a continuous
distribution Fs (dt) with differentiable density fs (t) such that:

∀α ∈ [0, 1 − ], fs (αs−1 (α)) > 0

and, for some γ > 0,


d log(fs ◦ αs )
sup (α) ≤ γ < +∞.
α∈[0,1−] dα

A4 The mapping λs is of class C 2 .


Theorem 1. Let 0 <  ≤ 1 and s ∈ S. Assume that Assumptions A3 − A4
are fulfilled. The following assertions hold true.
(i) (Consistency) With probability one, we have uniformly over [0, 1−]:

lim MV
d s (α) = MVs (α).
n→+∞

(ii) (Strong approximation) There exists a sequence of Brownian bridges


(W (n) (α))α∈(0,1) such that we almost-surely have, uniformly over the
compact interval [0, 1 − ]: as n → ∞,
√  
n MVd s (α) − MVs (α) = Z (n) (α) + o(vn ),

12

with vn = (log log n)ρ1 (γ) logρ2 (γ) n/ n for all n ≥ 1, where

 (0, 1) if γ < 1
(ρ1 (γ), ρ2 (γ)) = (0, 2) if γ = 1 ,
(γ, γ − 1 + η) if γ > 1

the parameter η > 0 being arbitrary, and


λ0s (αs−1 (α)) (n)
Z (n) (α) = W (α), for α ∈ (0, 1).
fs (αs−1 (α))

See the Appendix for the technical proof, which relies on standard strong
approximation results for the quantile process, see [3]. Assertion (ii) means
√ d
that the fluctuation process { n(MV s (α) − MVs (α))}α∈[0,1−] converges in
the space of càd-làg functions on [0, 1 − ] equipped with the sup norm, to
the law of a Gaussian stochastic process {Z (1) (α)}α∈[0,1−] .
Remark 2. (Asymptotic normality) It results from Assertion (ii) in
Theorem 1 that, for any α ∈ (0, 1), the pointwise estimator MV
d s (α) is asymp-
totically Gaussian under Assumptions A3 − A4 . For all α ∈ (0, 1), we have
√ d
the convergence in distribution: n{MVs (α) − MVs (α)} ⇒ N (0, σs2 ), as
n → +∞, with σs2 = α(1 − α)(λ0s (αs−1 (α))/fs (αs−1 (α)))2 .

4.2 Confidence regions in the Mass Volume space


Beyond consistency of the empirical curve in sup norm and the asymptotic
normality of the fluctuation process, we now tackle the question of con-
structing confidence bands for the true MV curve of a given scoring function
s(x). In practice, it is difficult to construct confidence regions from simu-
lated brownian bridges, based on the approximation stated in Theorem 1.
From a computational perspective, a bootstrap approach (see [7]) should be
preferred. Following in the footsteps of [19], it is recommended to imple-
ment a smoothed bootstrap procedure. The asymptotic validity of such a
resampling method immediately derives from the strong approximation re-
sult previously stated, by standard coupling arguments. The latter suggests
to consider,
√ d as an estimate of the law of the fluctuation process rn , with
rn (α) = n(MVs (α) − MVs (α)) for α ∈ [0, 1), the conditional law given the
original sample Dn = {X1 , . . . , Xn } of the bootstrapped fluctuation process

rn∗ = { n(MVBoots (α) − MV
d s (α))}α∈[0,1) , (6)

13
where MVBoots is the empirical MV curve of the scoring function s based
on a sample of P i.i.d. random variables with a common distribution Fes close
to Fbs = (1/n) i≤n δs(Xi ) . The difficulty is twofold. First, the target is
a distribution on a path space, namely a subspace of the Skorohod’s space
Dn ([0, 1 − ]) of càd-làg real-valued functions on [0, 1 − ] with  ∈ (0, 1),
equipped with the sup norm. Second, rn is a functional of the quantile process
{Fbs−1 (α)}α∈(0,1) . The naive bootstrap, i.e. resampling from the raw empirical
distribution, generally provides bad approximations of the distribution of
empirical quantiles: the rate of convergence for a given quantile is indeed of
order OP (n−1/4 ), see [10], whereas the rate of the Gaussian approximation is
O(n−1/2 ). The same phenomenon may be naturally observed for MV curves.
In a similar manner to what is usually recommended for empirical quantiles,
a smoothed version of the bootstrap algorithm shall be implemented in order
to improve the approximation rate of the distribution of supα∈[0,1−] |rn |∞ ’s
distribution, namely, to resample the data from a smoothed version of the
empirical distribution Fbn .
Here we describe the algorithm for building a confidence band at level
1 − η in the MV space from sampling data Dn = {Xi : i = 1, . . . , n}. It is
performed in four steps as follows.

Algorithm - Smoothed MV Curve Bootstrap

1. Based on the sample Dn , compute the empirical estimate α bs = 1 − Fbs ,


es = 1 − Fs . Plot the MV curve estimate
as well as its smoothed version α e
{(α, MV
d s (α)) : α ∈ [0, 1)}.

2. From the smooth distribution estimate Fes , draw a bootstrap sample


DnBoot conditioned on the original sample Dn .

3. Based on DnBoot , compute the bootstrap version αsBoot of the empirical


estimate α
bs . Plot the bootstrap MV curve, i.e. the graph of the mapping

α ∈ [0, 1) 7→ MVBoot
s (α) = λs ◦ (αsBoot )−1 (α).

4. Finally, get the bootstrap confidence bands at level 1 − η defined by


d s and radius δη /√n in D([0, 1 − ]), where δη is
the ball of center MV
defined by P∗ (supα∈[0,1−] |rn∗ (α)| ≤ δη ) = 1 − η, denoting by P∗ (.) the
conditional probability given the original data Dn .

14
Before turning to the theoretical analysis of this algorithm, its descrip-
tions calls a few comments. From a computational perspective, the smoothed
bootstrap distribution must be approximated in its turn, by means of a
Monte-Carlo approximation scheme. As shown in [19], an effective way of
doing this in practice, while reproducing theoretical advantages of smooth-
ing, consists of drawing B ≥ 1 bootstrap samples of size n, with replacement,
in the original dataset and next perturbating each drawn data by indepen-
dent centered Gaussian random variables of variance h2 (this procedure is
equivalent to drawing bootstrap data from a smooth estimate Fes computed
using a Gaussian kernel Kh (u) = (2πh2 )−1/2 exp(−u2 /(2h2 ))). Concerning
the number of bootstrap replications, picking B = n does not modify the
rate of convergence. However, choosing B of magnitude comparable to n so
that (1+B) is an integer may be more appropriate: the η-quantile of the ap-
proximate bootstrap distribution is then uniquely defined and this does not
impact the rate of convergence neither, see [13]. The primary tuning parame-
ters of the algorithm previously described concern the smoothing step. When
using a Gaussian regularizing kernel, one should typically choose a bandwidth
hn of order n−1/5 so as to minimize the MSE. In addition, although it would
be fairly equivalent, regarding asymptotic analysis, to recenter by a smoothed
g s = λs ◦ α
version of the original empirical MVcurve MV es−1 in the computation
of the bootstrap fluctuation process, the numerical computation of the sup
norm of the estimate (6) is much more tractable: it only requires to evaluate
the distance between piecewise constant curves over the pooled set of jump
points. It should also be noticed that smoothing the original curve should be
also avoided in practice, since it hides the jump locations, which constitute
the essential part of the information.
In the subsequent analysis, the kernel K used in the smoothing step is
assumed to be ”pyramidal” (e.g. Gaussian or of the form I{u ∈ [−1, +1]}).
The next theorem reveals the asymptotic validity of the bootstrap estimate
proposed above. The proof straightforwardly results from the strong approx-
imation theorem for the quantile process stated in [2] (see also [3]), details
can be found in the Appendix section.

Theorem 2. (Asymptotic validity) Suppose that the hypotheses of The-


orem 1 are fulfilled. Assume in addition that a smoothed versions of the cdf
Fs is computed at step 1 using a scaled kernel Khn (u) with hn ↓ 0 as n → ∞
in a way that nh3n → ∞ and nh5n log2 n → 0. Then, the distribution estimate

15
output by the Smooth MV Curve Bootstrap Algorithm is such that

log(h−1
 
∗ ∗ n )
sup |P ( sup |rn (α)| ≤ t) − P( sup |rn | ≤ t)| = oP √ .
t∈R α∈[0,1−] α∈[0,1−] nhn

The result above shows in particular that, up to logarithmic factors,


choosing hn ∼ 1/(log2+η n1/5 ) with η > 0 leads to an approximation er-
ror of order OP (n−2/5 ) for the bootstrap estimate. Although its rate is slower
than that of the Gaussian approximation (cf assertion (ii) in Theorem 1),
the smoothed bootstrap method remains very appealing from a computa-
tional perspective: it is indeed very difficult to build confidence bands from
simulated brownian bridges in practice. Finally, it should be noticed that a
non smoothed bootstrap of the MV curve would lead to much worse approx-
imation rates, of the order OP (n−1/4 ) namely, see [10].

5 A M-ESTIMATION APPROACH
Now we are are equipped with the concept of Mass Volume curve, the
anomaly scoring task can be formulated as the building of a scoring function
s(x), based on the training set X1 , . . . , Xn , such that MVs is as close as
possible to the optimum MV∗ . Due to the functional nature of the criterion
performance, there are many ways of measuring how close the MV curve of
a scoring function candidate and the optimal one are. The Lp -distances, for
1 ≤ p ≤ +∞, provide a relevant collection of risk measures. Let  ∈ (0, 1) be
fixed (take  = 0 if λ(suppF ) < +∞) and consider the losses related to the
sup-norm and that related to the L1 -distance:
Z 1−
d1 (s, f ) = |MVs (α) − MV∗ (α)|dα,
0
d∞ (s, f ) = sup {MVs (α) − MV∗ (α)}.
α∈[0,1−]

Observe that, by virtue of Proposition 1, the ”excess-risk” decomposition ap-


plies in the L1 case and the learning problem can be directly tackled through
standard M -estimation arguments:
Z 1− Z 1−
d1 (s, f ) = MVs (α)dα − MV∗ (α)dα.
0 0

16
Hence, possible learning techniques could be based on the minimization, over
a set S0 ⊂ S of candidates,
R 1− of empirical counterparts of the area under the
MV curve, such as 0 MV d s (α)dα. In contrast, the approach cannot be
straightforwardly extended to the sup-norm situation. A possible strategy
would be to combine M -estimation with approximation methods, so as to
”discretize” the optimization task. This would lead to replace the unknown
curve MV∗ by an approximation, a piecewise constant approximant MV g∗
related to a subdivision ∆ : 0 < α1 < · · · < αK = 1 −  say and decompose
the L∞ -risk as

d∞ (s, f ) ≤ sup {MVs∗∆ (α)−MV∗ (α)}+ sup {MVs∗∆ (α)−MVs (α)},


α∈[0,1−] α∈[0,1−]

the first term on the right hand side of the bound above being viewed as
the bias of the statistical method. If one restricts optimization to the set of
piecewise constant scoring functions taking K + 1 values, the problem thus
boils down to recovering the bilevel sets R∗k = Ω∗αk \ Ω∗αk−1 for k = 1, . . . , K.
This simple observation paves the way for designing scoring strategies relying
on the estimation of a finite number of minimum volume sets, just like the
approach described in the next section.

6 The A-Rank algorithm


Now that the anomaly scoring problem has been rigorously formulated, we
propose a statistical method to solve it and establish learning rates in sup
norm for the latter.

6.1 Piecewise Constant Scoring Functions


We now focus on scoring functions of the simplest form, piecewise constant
functions. Let K ≥ 1 and consider an ordered collection W of K pairwise
disjoint subsets of the feature space of finite Lebesgue measure: C1 , . . . , CK .
When suppF is compact, one may suppose X of finite Lebesgue measure
and take X = ∪k≤K Ck . Then, define the piecewise constant scoring function
given by: ∀x ∈ X ,
K
X
sW (x) = (K − k + 1) · I{x ∈ Ck }.
k=1

17
Its piecewise linear MV curve is given by: ∀α ∈ [0, 1[,
K
X
MVsW (α) = λs,k · I{α ∈ [αs,k , αs,k+1 [},
k=0

where αs,0 = 0 = 1 − αs,K+1 , λs,0 = 0, αs,k = F (∪l≤k Cl ) and λs,k = λ(∪l≤k Cl )


for all k ∈ {1, . . . , K}.
Approximation of the optimal MV curve. Let  ∈ (0, 1) (take  = 0
when F (dx) has compact support) and ∆ : α0 = 0 < α1 < · · · < αK =
1 −  < αK+1 = 1 be a subdivision of the unit interval. As illustrated by Fig.
2, the MV curve of the piecewise constant scoring function
K
X
s∗∆ (x) = (K − k + 1) · I{x ∈ Ω∗αk \ Ω∗αk−1 } (7)
k=1

is a piecewise constant approximant of MV∗ related to the meshgrid ∆.


Hence, if MV∗ is Lipschitz on [0, 1 − ] with constant κ (i.e. ∀(α, α0 ) ∈
[0, 1 − ]2 , |MV∗ (α) − MV∗ (α0 )| ≤ κ |α − α0 |), which is naturally true when
MV∗ is of class C 1 (with κ = MV∗0 (1 − )), we classically have the error
estimate, see [4]:
∀α ∈ [0, 1 − ], MVs∗∆ (α) − MV∗ (α) ≤ κ · m∆ , (8)
with m∆ = max1≤k≤K {αk − αk−1 }.

6.2 Ranking by overlaying minimum volume sets


We shall now propose a statistical method to build an empirical version of the
discrete scoring function (7), which canPbe viewed as a discrete version of (2),
where µ is taken as the point measure K k=1 δαk . It is based on the estimation

of minimum volume sets Ωα for the specific values of the mass level α forming
the grid ∆. Sharpness of the meshgrid (the maximum spacing m∆ ) is one of
the tuning parameters that control the complexity of the method: precisely,
it governs the bias due to the approximation stage, see (8).
Let G be a class of measurable subsets of the feature space X of finite
Lebesgue measure and α ∈ (0, 1). Recall that the empirical minimum volume
set related to the class G and level α is the solution of the optimization
problem:
inf λ(Ω) subject to Fn (G) ≥ α − φ, (9)
Ω∈G

18
0   α1   α2   1-­‐ε  =  αK   1  

Figure 2: Discrete collection of minimum volume sets (on the left) and MV
curve of the related piecewise constant function (on the right)

where φ is a penalty term related to the complexity of the class, referred to as


the tolerance parameter. The accuracy of the solution depends on the choices
for the class G and for the tolerance parameter, see [18]. In this paper, we do
not address the issue of designing algorithms for empirical minimum volume
set estimation, we refer to section 7 of [18] for a description of off-the-shelf
methods documented in the literature, including partition-based techniques.
The A-Rank algorithm is implemented in two stages. Stage 1 consists
in estimating minimum volume sets by solving the problem (9) for mass
levels α corresponding to the grid ∆, targeting successively the ”most abnor-
mal” (100α1 )% among all possible observations x, then the ”most abnormal”
(100α2 )%, etc. In contrast to the target sequence of density level sets, the
collection of empirical minimum volume sets is not necessarily monotone. It
is the goal of the second stage to build a non decreasing sequence of subsets
based on the estimates produced at stage 1.

19
The A-Rank algorithm

Input. Meshgrid ∆ : 0 = α0 < α1 < . . . < αK = 1 − , tolerance φ, class G

Stage 1. For k = 1, . . . , K, solve the problem

inf λ(Ω) subject to Fn (G) ≥ αk − φ,


Ω∈G

producing the empirical minimum volume set Ω


bk.

Stage 2. Build recursively an increasing sequence of subsets by forming


e1 = Ω
b 1 and Ω b k+1 ∪ Ω
e k+1 = Ω e k for 1 ≤ k < K.

Output. The piecewise constant scoring function obtained by overlaying the


indicator functions of the Ω
e k ’s

K
X K
X
se∆ (x) = I{x ∈ Ω
ek} = (K − k + 1)I{x ∈ Ω
ek \ Ω
e k−1 }
k=1 k=1

e 0 = ∅ by convention, as well as the piecewise constant estimate of MV∗


with Ω
given by
K
∗ X
∀α ∈ (0, 1), MV∆ (α) =
d bk · I{α ∈ [αk , αk+1 [},
λ
k=1

where λ
bk = λ(Ω
b k ).

Before analyzing the statistical performance of the algorithm above, a few


remarks are in order.
Notice first that we could alternatively build a monotone sequence of
subsets (Ω̄k )1≤k≤K , recursively through: Ω̄K = Ω b k ∩ Ω̄k+1 = for
b K and Ω̄k = Ω
k = K − 1, . . . , 1. The results established in the sequel straightforwardly
extend to this construction.
One obtains as a byproduct of the algorithm a rough piecewise PK constant
estimator of the (optimal scoring) function (H ◦ f )(x), namely k=1 (αk+1 −
αk ) · I{x ∈ Ω
e k }, the Riemann sum approximating the integral (2) when µ is
the Lebesgue measure.

20
6.3 Performance bounds for the A-Rank algorithm
We now prove a result describing the performance of the scoring function pro-
duced by the algorithm proposed in the previous section to solve the anomaly
scoring problem, as formulated in section 5. The following assumptions are
required.

A5 ∀α ∈ [1 − ], Ωα ∈ G.

A6 The Rademacher average given by


" n #
1 X
Rn = E sup i I{Xi ∈ Ω} , (10)

Ω∈G n
i=1

where (i )i≥1 is a Rademacher chaos independent from the Xi ’s, is of


the order OP (n−1/2 ).

Assumption A6 is very general, it is satisfied in particular when the class


G is of finite VC dimension, see [15]. Assumption A5 is in contrast very
restrictive, it could be however relaxed at the price of a much more technical
analysis, involving the study of the bias in empirical minimum volume set
estimation under adequate assumptions on the smoothness of the boundaries
of the Ω∗α ’s, like in [20] for instance.

Theorem 3. Suppose that assumptions A1 -A2 and A5 -A6 hold true. Let
δ ∈ (0, 1). Consider the mesh ∆K made of gridpoints αk = (1 − )k/K,
0 ≤ k ≤ K and take as tolerance parameter
r
def log(1/δ)
φ = φn (δ) = 2Rn + .
2n
Then, if Kn ∼ n1/4 as n → +∞, there exists a constant c = c(δ) such that,
with probability larger than 1 − δ, we have for all n ≥ 1:
n o
sup MVse∆K (α) − MV∗ (α) ≤ cn−1/4 , (11)
α∈[1−]

denoting by se∆K the scoring function output by A-Rank.

21
The proof of the result stated above can be found in the Appendix, it
essentially relies on Theorems 3 and 11 in [18] combined with bound (8). As
a careful examination of the argument shows, the condition Kn ∼ n1/4 corre-
sponds to a balance between the approximation error (of the order O(1/K),
see Eq. (8)) and the stack of the errors made by overlaying empirical mini-
mum volume sets at Kn different mass levels. The rate is not proved to be
optimal here, lower bounds will be investigated in a future work.
The general approach decribed above can be extended in several ways.
First, the class G over which minimum volue set estimation is performed
(and the tolerance parameter as well) could vary depending on the mass
target α. Additionally, a model selection step could be incorporated, so
as to select an adequate class G (refer to section 4 in [18]). Secondly, the
subdivision ∆ is known in advance in the A-rank algorithm. A natural
extension would be to consider a flexible grid of the unit interval, which
could be possibly selected in an adaptive nonlinear manner, depending on
the local smoothness properties of the optimal curve MV∗ , see [5]. Due
to the concavity of MV∗ , the spacings αk+1 − αk should be picked as non
increasing. The assumption that grid points are equispaced in Theorem 3
aims at simplifying the analysis only and is by no means necessary: the sole
restriction is actually that supk {αk+1 − αk } = O(1/K). Such extensions are
far beyond the scope of the paper and shall be the subject of a future work.
Remark 3. (Plug-in) A possible strategy would be to estimate first the un-
known density function f (x) by means of (non-) parametric techniques and
next use the resulting estimator as a scoring function. Beyond the compu-
tational difficulties one would be confronted to for large or even moderate
values of the dimension, we point out that the goal pursued in this paper is by
nature very different from density estimation: the local properties of the den-
sity function are useless here, only the ordering of the possible observations
x ∈ X it induces is of importance, see Proposition 1.
The next proposition provides a bound in sup norm for the curve MV d∗

as an estimate of the optimum MV∗ .
Proposition 4. Suppose that assumptions of Theorem 3 are fulfilled and
take Kn ∼ n−1/2 as n → +∞.Then, there exists a constant c0 = c0 (δ) such
that, with probability at least 1 − δ,
 1/2
d ∗∆ ∗ 0 log n
sup |MV − MV (α)| ≤ c . (12)
α∈[0,1−]
K
n

22
g ∗ (α) = PK λ(Ω
One could also consider the alternative estimator MV e k )·
k=1
I{α ∈ [αk , αk+1 [}. However, the rate would be much slower, due to the error
stacking phenomenon resulting from stage 2 (monotonicity).
In the present analysis, we have restricted our attention to a truncated
part of the MV space, corresponding to mass levels in the interval [0, 1 − ]
and avoiding thus out-of-sample tail regions (in the case where suppF is
of infinite Lebesgue measure). An interesting direction for further research
could consist in investigating the accuracy of anomaly scoring algorithms
when letting  = n slowly decay to zero as n → +∞, under adequate
assumptions on the asymptotic behavior of MV∗ .

7 CONCLUSION
Motivated by a wide variety of applications, we have formulated the issue of
learning how to rank observations in the same order as that induced by the
density function, which we called anomaly scoring here. For this problem,
much less ambitious than estimation of the local values taken by the density,
a functional performance criterion, the Mass Volume curve namely, is pro-
posed. What MV curve analysis achieves for unsupervised anomaly detection
is quite akin to what ROC curve analysis accomplishes in the supervised set-
ting. The statistical estimation of MV curves has been investigated from an
asymptotic perspective and we have provided a strategy, where the feature
space is overlaid with a few well-chosen empirical minimum volume sets, to
build a scoring function with statistical guarantees in terms of rate of con-
vergence for the sup norm in the MV space. Its analysis suggests a number
of novel and important issues for unsupervised statistical learning, such as
extension of the approach promoted here to the case where the mass levels
of the overlaid empirical minimum volume sets are chosen adaptively, so as
to capture the geometry of the optimal MV curve.

APPENDIX - TECHNICAL PROOFS


Proof of Proposition 2
Assume that convexity does not hold for MV∗ . It means that there exist α
and  in (0, 1) such that MV∗ (α) − MV∗ (α − ) > MV∗ (α + ) − MV∗ (α).

23
This is in contradiction with Proposition 1. Indeed, consider then the scoring
function fe(x) equal to f (x) on Ω∗α− ∪ (X \ Ω∗α+ ), to

Q∗ (α + ) − Q∗ (α) Q∗ (α − )Q∗ (α + ) − Q∗ (α2 )


f (x) +
Q∗ (α) − Q∗ (α − ) Q∗ (α − ) − Q∗ (α)

on Ω∗α \ Ω∗α− and to

Q∗ (α) − Q∗ (α − ) Q∗ (α2 ) − Q∗ (α − )Q∗ (α + )


f (x) +
Q∗ (α + ) − Q∗ (α) Q∗ (α) − Q∗ (α + )

on Ω∗α+ \ Ω∗α . One may then check easily that Eq. (3) is not fulfilled at α for
s = fe. The formula for MV∗0 is a particular case of that given in Proposition
3.

Proof of Proposition 3
Applying the co-area formula (see [11, p.249, th.3.2.12]) for Lipschitz func-
Rtions toRg(x) = (1/|∇f (x)|)I{x : |∇f (x)| > 0, s(x) R ≥ t} yields λs (t) =
+∞ 0
t
du s−1 (u)
(1/|∇f (y)|)µ(dy), and thus λs (t) = − s−1 ({t})
1/|∇f (y)|µ(dy).
And the desired formula follows from the composite differentiation rule.

Proof of Theorem 1 (sketch of )


By virtue of Theorem 3 in [2], under Assumption A3 , there exists a sequence
of independent Brownian bridges {W (n) (α)}α∈(0,1) such that, we a.s. have:

√  −1 W (n) (α)
bs (α) − αs−1 (α) =

n α + o (vn ) ,
fs (αs−1 (α))

uniformly over [0, 1 − ], as n → +∞. Now the desired results can be imme-
diately derived from the Law of Iterated Logarithm for the Brownian Bridge,
combined with a Taylor expansion of λs .

Proof of Theorem 2
The argument is based on the the strong approximation stated in Theorem
1, combined with a standard coupling argument. The main steps of the
argument are as follows. Let P∗ the conditional probability given the original

24
data Dn . Note firstly that, conditioned on Dn , by applying Theorem 1’s
part (ii) with MV
g s as target curve (instead of MVs ), one gets that, with
probability 1, rn∗ (α) is equivalent to a stochastic process with same law as

(n)∗ λ0s (e
αs−1 (α)) (n)
Z (α) = W (α),
α−1 (α))
fes (e s

ρ1 (γ) logρ2 (γ) n


with a remainder of order o( (log log n) √
n
), uniformly over [0, 1 − ]. In
addition, we a.s. have

λ0s (αs−1 (α)) λ0s (eαs−1 (α)) log(h−1


 
)
− =O √ n ,
fs (αs−1 (α)) αs−1 (α))
fes (e nhn

under the stipulated conditions, see [12]. Hence, we almost surely have,
uniformly over [0, 1 − ],

log(h−1
 
(n)∗ (n) n )
Z (α) = Z (α) + O √ .
nhn

Since this results holds uniformly in α ∈ [0, 1 − ], by virtue of the continuous
mapping theorem applied to the function supα∈[0,1−] (.), the result also holds
in distribution up to same order.

Proof of Theorem 3
For clarity, we start off with recalling the following result, which straightfor-
wardy derives from Theorems 3 and 11 in [18].

Proposition 5. ([18]) Suppose that the assumptions of Theorem 3 are ful-


b ∈ R a solution of the constrained opti-
filled. Let α, δ in (0, 1) and let Ω
mization problem:

min λ(Ω) subject to Fb(Ω) ≥ α − φn (δ).


Ω∈G

Then, we have with probability at least 1 − δ: ∀n ≥ 1,

b ≥ α − 2φn (δ) and MV∗ (α) − 2φn (δ) b ≤ MV∗ (α).


F (Ω) ≤ λ(Ω) (13)
Q∗ (α)

25
Additional notations shall be required. We set ᾱk = F (Ω b k ), λ̄k = λ(Ω
b k ),
α
ek = F (Ω
e k ) and λ e k ) for all k ∈ {1, . . . , K}. Equipped with these
ek = λ(Ω
notations, we may write:
K
X
∀α ∈ (0, 1), MVse∆ (α) = ek · I{α ∈ [e
λ αk , α
ek+1 [}.
k=1

We decompose the pointwise deviation between the optimal MV curve and


that of the scoring function se∆ as follows: ∀α ∈ (0, 1),
   
MVse∆ (α) − MV∗ (α) = MVse∆ (α) − MVs∗e (α) + MVs∗e (α) − MV∗ (α) ,
∆ ∆
(14)
where ∆ is the subdivision of the unit interval: 0 = α
e e0 < . . . < α
eK < α
eK+1 =
1. The first term on the left hand side of Eq. (14) can be written as
K 
X 
MVse∆ (α) − MVs∗e (α) = ek − MV∗ (e
λ αk ) · I{α ∈ [e
αk , α
ek+1 [}

i=1

and bound (8) applied to the subdivision ∆


e permits to control the second
term:
MVs∗e (α) − MV∗ (α) ≤ κ · max {eαk+1 − α
ek }.
∆ 1≤k≤K

Dealing with the first term, we write:


ek − MV∗ (e
λ ek − MV∗ (αk )| + κ |αk − α
αk ) ≤ |λ ek |.

By triangular inequality, we also have: |αk − α


ek | ≤ |αk − ᾱk | + |ᾱk − α
ek |. We
prove the following intermediary result.
Lemma 1. For all k ∈ {1, . . . , K}, we have:

ek = ᾱk + (k − 1)OP (φn (δ)),


α
ek = λ̄k + (k − 1)OP (φn (δ)).
λ

proof. Using the additivity of the probability measure F (dx), we have


F (Ω
e 2 ) = F (Ω b1 \ Ω
b 2 ) + F (Ω b 2 ). As Ω∗ ⊂ Ω∗ and
α1 α2
   
b1 \ Ω
F (Ω b 2 ) = F (Ω b 1 \ Ω∗ ) ∪ (Ωb 1 ∩ Ω∗ ) \ (Ω b 2 ∩ Ω∗ ) ,
b 2 \ Ω∗ ) ∪ ( Ω
α1 α1 α2 α2

26
the additivity of F (dx) again yields: F (Ω
e 2 ) = F (Ω
b 2 ) + φn (δ). Using the
bound stated in Proposition 5, the general result is then obtained by re-
cursion. In a quite similar way, the same inequalities hold true for the λ’s.

The lemma above implies that max1≤k≤K |e αk − ᾱk | = OP (Kφn (δ)). Using
Proposition 5 (replacing δ by δ/K) combined with the union bound, we
obtain that max1≤k≤K |ᾱk − αk | = OP (φn (δ/K)). Then, we may write

|e
αk+1 − α
ek | ≤ max |ᾱk − ᾱk−1 | + KOP (φn (δ)).
1≤k≤K

Observe also that

max |ᾱk − ᾱk−1 | ≤ 2 max |ᾱk − αk | + max {αk − αk−1 }.


1≤k≤K 1≤k≤K 1≤k≤K

Hence, max1≤k≤K {αk − αk−1 } = O(1/K) by construction and, by virtue of


Proposition 5, 2 max1≤k≤K |ᾱk − αk | = OP (φn (δ/K)). This permits to control
maxk |eαk+1 − α ek | and consequently the first term on the right hand side of
Eq. (14).
Finally, since Rn is assumed to be of the order n−1/2 , choosing
√ K so as
to optimize
√ p the resulting bound, asymptotically equivalent to 1/ n + 1/K +
1/4
K/ n + (log K)/n, leads to K = Kn ∼ n . This yields the desired rate
bound.

Proof of Proposition 4
Re-using the notations introduced in the proof above, we decompose the
pointwise deviation as follows: ∀α ∈ [0, 1 − ]:

K
!
d ∗∆ (α) d ∗∆ (α)
X
MV − MV∗ (α) = MV − MV∗ (αk ) · I{α ∈ [αk , αk+1 [}
k=1
+ MVs∗∆ (α) − MV∗ (α). (15)

Bound (8) applied to the second term of the decomposition above proves
it is of order O(1/K). Proposition 5 implies that, with probability at least
1 − δ, the first term in the decomposition is bounded by 2φn (δ/K)/Q∗ (1 − ).
Taking Kn ∼ n1/2 yields the rate bound stated in Proposition 4.

27
References
[1] S. Clémençon and J. Jakubowicz. Scoring anomalies: a M-estimation
formulation. In Proceedings of the 16-th International Conference on
Artificial Intelligence and Statistics, Scottsdale, USA, 2013.
[2] M. Csorgo and P. Revesz. Quantile process approximations. The Annals
of Statistics, 6(4):882–894, 1978.
[3] M. Csorgo and P. Revesz. Strong Approximation Theorems in Probability
and Statistics. Academic Press, 1981.
[4] C. de Boor. A practical guide to splines. Springer, 2001.
[5] R. Devore and G. Lorentz. Constructive Approximation. Springer, 1993.
[6] D. Donoho and M. Gasko. Breakdown properties of location estimates
based on half space depth and projected outlyingness. The Annals of
Statistics, 20:1803–1827, 1992.
[7] B. Efron. Bootstrap methods: another look at the jacknife. Annals of
Statistics, 7:1–26, 1979.
[8] J.P. Egan. Signal Detection Theory and ROC Analysis. Academic Press,
1975.
[9] J.H.J. Einmahl and D.M. Mason. Generalized quantile process. The
Annals of Statistics, 20:1062–1078, 1992.
[10] M. Falk and R. Reiss. Weak convergence of smoothed and nonsmoothed
bootstrap quantile estimates. Annals of Probability, 17:362–371, 1989.
[11] H. Federer. Geometric Measure Theory. Springer, 1969.
[12] E. Giné and A. Guillou. Rates of strong uniform consistency for multi-
variate kernel density estimators. Ann. Inst. Poincaré (B), Probabilités
et Statistiques, 38:907–921, 2002.
[13] P. Hall. On the number of bootstrap simulations required to construct
a confidence interval. Annals of Statistics, 14:1453–1462, 1986.
[14] V. Koltchinskii. M-estimation, convexity and quantiles. The Annals of
Statistics, 25(2):435–477, 1997.

28
[15] V. Koltchinskii. Local Rademacher complexities and oracle inequali-
ties in risk minimization (with discussion). The Annals of Statistics,
34:2593–2706, 2006.

[16] W. Polonik. Minimum volume sets and generalized quantile processes.


Stochastic Processes and their Applications, 69(1):1–24, 1997.

[17] P. Rigollet and R. Vert. Fast rates for plug-in estimators of density level
sets. Bernoulli, 14(4):1154–1178, 2009.

[18] C. Scott and R. Nowak. Learning Minimum Volume Sets. Journal of


Machine Learning Research, 7:665–704, 2006.

[19] B. Silverman and G. Young. The bootstrap: to smooth or not to smooth.


Biometrika, 7(4):469–479, 1987.

[20] A.B. Tsybakov. On nonparametric estimation of density level sets. The


Annals of Statistics, 25(3):948–960, 1997.

[21] J. Tukey. Mathematics and picturing data. pages 523–531. Canadian


Math. Congress, 1975.

[22] K. Viswanathan, L. Choudur, V. Talwar, C. Wang, G. Macdonald, and


W. Satterfield. Ranking anomalies in data centers. In R.D.James, editor,
Network Operations and System Management, pages 79–87. IEEE, 2012.

[23] B.Y. Zuo and R. Serfling. General notions of statistical depth function.
The Annals of Statistics, 28(2):461–482, 2000.

29

You might also like