Professional Documents
Culture Documents
Mass Volume Curve
Mass Volume Curve
Stéphan Clémençon
UMR LTCI No. 5141
Telecom ParisTech/CNRS
Institut Mines-Telecom
Paris, 75013, France
July 30, 2013
Abstract
This paper aims at formulating the issue of ranking multivari-
ate (unlabeled) observations depending on their degree of abnormal-
ity/novelty as an unsupervised statistical learning task. In the 1-d
situation, this problem is usually tackled by means of tail estimation
techniques: univariate observations are viewed as all the more ”abnor-
mal” as they are located far in the tail(s) of the underlying probability
distribution. In a wide variety of applications, it would be desirable as
well to dispose of a scalar valued ”scoring” function allowing for com-
paring the degree of abnormality of multivariate observations. Here
we formulate the issue of scoring anomalies as a M -estimation prob-
lem by means of a novel (functional) performance criterion, referred
to as the Mass Volume curve (MV curve in short), whose optimal
elements are, as expected, nondecreasing transforms of the density.
The question of statistical estimation of the MV curve of a scoring
function is tackled in this paper, as well as that of empirical optimiza-
tion of this criterion, interpreted as a continuum of minimum volume
set estimation problems. A discretization approach is proposed here,
consisting of estimating a sequence of N ≥ 1 minimum volume sets
and next constructing a piecewise constant scoring function sN (x)
based on the latter. Viewing its MV curve as a statistical version of
a stepwise approximant of the optimal MV curve, rate bounds are
established, describing the generalization ability of the methodology
1
promoted. The algorithm also provides, as a byproduct, an estimate
of the optimal MV curve.
1 INTRODUCTION
In a wide variety of applications, ranging from the monitoring of aircraft en-
gines in aeronautics to non destructive control quality in the industry through
fraud detection, network intrusion surveillance or system management in data
centers (e.g. see [22]), anomaly detection is of crucial importance. In most
common situations, anomalies correspond to ”rare” observations and must
be detected in absence of any feedback (e.g. expert’s opinion) based on an
unlabeled dataset. In practice, the very purpose of anomaly detection tech-
niques is to rank observations by degree of abnormality/novelty, rather than
simply assigning them a binary label, ”abnormal” vs ”normal”. In the case
of univariate observations, abnormal values are generally those which are ex-
tremes, i.e. ”too large” or ”too small” in regard to central quantities such
as the mean or the median, and anomaly detection may then derive from
standard tail distribution analysis: the farther in the tail the observation
lies, the more ”abnormal” it is considered. In contrast, it is far from easy
to formulate the issue in a multivariate context. In the present paper, moti-
vated by applications such as those aforementioned, we place ourselves in the
situation where (unlabeled) observations take their values in a possibly very
high-dimensional space, X ⊂ Rd with d ≥ 1 say, making approaches based
on nonparametric density estimation or multivariate (heavy-) tail modelling
hardly feasible, if not unfeasible, due to the curse of dimensionality. In this
framework, a variety of statistical techniques, relying on the concept of mini-
mum volume set investigated in the seminal contributions of [9] and [16] (see
section 2), have been developed in order to split the feature space X into two
halves and decide whether observations should be considered as ”normal”
(namely when lying in the minimum volume set Ωα ⊂ X estimated on the
basis of the dataset available) or not (when lying in the complementary set
X \ Ωα ) with a given confidence level 1 − α ∈ (0, 1). One may also refer to
[18] and to [14] for closely related notions. The problem considered here is of
different nature, the goal pursued is not to assign to all possible observations
2
a label ”normal” vs ”abnormal”, but to rank them according to their level
of ”abnormality”. The most natural way to define a preorder on the fea-
ture space X is to transport the natural order on the real line through some
(measurable) scoring function s : X → R+ : the ”smaller” the score s(X), the
more likely the observation X is viewed as ”abnormal”. This problem shall
be here referred to as anomaly scoring. It can be somehow related to the
literature dedicated to statistical depth functions in nonparametric statistics
and operations research, see [23] and the references therein. Such parametric
functions are generally proposed ad hoc in order to define a ”center” for the
probability distribution of interest and a notion of distance to the latter, e.g.
the concept of ”center-outward ordering” induced by halfspace depth in [21]
or [6]. The angle embraced in this paper is quite different, its objective is
indeed twofold: 1) propose a performance criterion for the anomaly scoring
problem so as to formulate it in terms of M -estimation 2) investigate the
accuracy of scoring rules which optimize empirical estimates of the criterion
thus tailored.
Due to the global nature of the ranking problem, the criterion we pro-
mote is functional, just like the Receiver Operating Characteristic (ROC) and
Precision-Recall curves in the supervised ranking setup (i.e. when a binary
label, e.g. ”normal” vs ”abnormal”, is assigned to the sampling data), and
shall be referred to as the Mass Volume curve (MV curve in abbreviated
form). The latter induces a partial preorder on the set of scoring functions:
the collection of optimal elements is defined as the set of scoring functions
whose MV-curve is minimum everywhere and is proved to coincide, as ex-
pected, as increasing transforms of the underlying probability density. In the
unsupervised setting, MV curve analysis is shown to play a role quite compa-
rable to that of ROC curve analysis for supervised anomaly detection. The
issue of estimating the MV curve of a given scoring function based on sam-
pling data is then tackled and a smooth boostrap method for constructing
confidence regions is analyzed. A statistical methodology to build a nearly
optimal scoring function is next described, which works as follows: first,
discretize the problem; second solve a few minimum volume set estimation
problems where confidence levels are fixed in advance (see [18]); and third,
combine the estimated anomaly/novelty detection rules in order to build a
piecewise constant scoring function. The MV curve of the scoring rule thus
produced can be related to a stepwise approximant of the (unknown) opti-
mal MV curve, which permits to establish the generalization ability of the
algorithm through rate bounds in terms of sup norm in the MV space. We
3
also show how to get, as a byproduct, a piecewise constant estimate of the
optimal MV curve.
The rest of the article is structured as follows. Section 2 describes the
mathematical framework, sets out the main notations and recalls the crucial
notions related to anomaly/novelty detection on which the analysis carried
out in the paper relies. Section 3 first provides an informal description of
the anomaly scoring problem and then introduces the MV curve criterion,
dedicated to evaluate the performance of any scoring function. The set of
optimal elements is described and statistical results related to the estimation
of the MV curve of a given scoring function are stated in section 4. Statistical
learning of an anomaly scoring function is then formulated as a functional M -
estimation problem in section 5, while section 6 is devoted to the study of a
specific algorithm for the design of nearly optimal anomaly scoring functions.
Technical proofs are deferred to the Appendix section. We finally point
out that a very preliminary version of this work has been outlined in the
conference paper [1].
4
defined by
Observe that the sequence is decreasing (for the inclusion, as t increases from
−∞ to +∞):
∀(t, t0 ) ∈ R2 , t ≥ t0 ⇒ Ωs,t ⊂ Ωs,t0
and that limt→+∞ Ωs,t = ∅ and limt→−∞ Ωs,t = X 1 . When the function
s(x) is additionally integrable w.r.t. Lebesgue measure, it is called a scoring
function. The set of all scoring functions is denoted by S.
The following quantities shall also be used in the sequel. For any scoring
function s and threshold level t ≥ 0, define:
The quantity αs (t) is referred to as the mass of the level set Ωs,t , while
λs (t) is generally termed the volume (w.r.t. LebesgueR measure). Notice
that the volumes are finite on R∗+ : ∀t > 0, λs (t) ≤ X s(x)dx/t < +∞.
Reciprocally, for any (total) preorder on X , one may define a scoring
function s such that coincides with s . Indeed, for all x ∈ X , consider
the (supposedly measurable) set Ω (x) = {x0 ∈ X : x0 x}. Then, for any
finite positive
R measure µ(dx) with X as support, define the scoring function
sµ, (x) = x0 ∈X I{x ∈ Ω(x)}µ(dx0 ). It is immediate to check that sµ,
0
5
a multivariate r.v. X takes its values with highest/smallest probability. Let
α ∈ (0, 1), a minimum volume set Ω∗α of mass at least α is any solution of
the constrained minimization problem
P{f (X) = c} = 0.
Under the hypotheses above, for any α ∈ (0, 1), there exists a unique mini-
mum volume set Ω∗α (equal to Ωf,t∗α up to subsets of null F -measure), whose
mass is equal to α exactly. Additionally, the mapping λ∗ is continuous on
(0, 1) and uniformly continuous on [0, 1 − ] for all ∈ (0, 1) (when the sup-
port of F (dx) is compact, uniform continuity holds on the whole interval
[0, 1]).
b ∗α of minimum volume sets are
From a statistical perspective, estimates Ω
built by replacing the
Pn unknown probability distribution F by its empirical
version Fn = (1/n) i=1 δXi and restricting optimization to a collection A of
borelian subsets of X , supposed rich enough to include all density level sets
(or reasonable approximations of the latter). In [16], functional limit results
are derived for the generalized empirical quantile process {λ(Ω b ∗α ) − λ∗ (α)}
under certain assumptions for the class A (stipulating in particular that A
is a Glivenko-Cantelli class for F (dx)). In [18], it is proposed to replace the
level α by α − φn where φn plays the role of tolerance parameter. The latter
6
should be chosen of the same order as the supremum supΩ∈A |Fn (Ω) − F (Ω)|
roughly, complexity of the class A being controlled by the VC dimension
or by means of the concept of Rademacher averages, so as to establish rate
bounds at n < +∞ fixed.
Alternatively, so-termed plug-in techniques, consisting in computing first
an estimate fb of the density f and considering next level sets Ωfb,t of the
resulting estimator have been investigated in several papers, among which
[20] or [17] for instance. Such an approach however yields significant com-
putational issues even for moderate values of the dimension, inherent to the
curse of dimensionality phenomenon.
3 RANKING ANOMALIES
In this section, the issue of scoring observations depending on their level
of novelty/abnormality is first described in an informal manner and next
formulated quantitatively, as a functional optimization problem, by means
of a novel concept, termed the Mass Volume curve.
7
where µ(dα) is an arbitrary finite positive measure dominating the Lebesgue
measure on (0, 1), belongs to S ∗ . Observe that H(f (x)), where H denotes
the cdf of the r.v. f (X), corresponds to the case where µ is chosen to be
the Lebesgue measure on (0, 1) in Eq. (2). The anomaly scoring problem
can be thus cast as an overlaid collection of minimum volume set estimation
problems. This observation shall turn out to be very useful when designing
practical statistical learning strategies in section 6.
8
Optimal
Good
Bad
VOLUME
0 1
MASS
This functional criterion induces a partial order over the set of all scoring
functions. Let s1 and s2 be two scoring functions on X , the ordering provided
by s1 is better than that induced by s2 when
9
Proposition 1. (Optimal MV curve) Let assumptions A1 − A2 be ful-
filled. The elements of the class S ∗ have the same MV curve and provide the
best possible ordering of X ’s elements in regard to the MV curve criterion:
The proof is immediate based on the results recalled in subsection 2.2 and
is left to the reader. Incidentally, notice that, equipped with the notations
introduced in 2.2, we have λ∗ (α) = MV∗ (α) for all α ∈ (0, 1). We also
point out that bound (4) reveals that the pointwise difference between the
optimal MV curve and that of a scoring function candidate s(x) is controlled
by the error made in recovering the specific minimum volume set Ω∗α through
Ωs,Q(s,α) .
The following result reveals that the optimal MV curve is convex and
provides a closed analytical form for its derivative.
See the Appendix section for the technical proof. Elementary properties
of MV curves are summarized in the following proposition.
10
Proposition 3. (Properties of MV curves) For any s ∈ S, the follow-
ing assertions hold true.
1. Limit values.We have MVs (0) = 0 and limα→+1 MVs (α) = λ(suppF ).
2. Invariance. For any strictly increasing function ψ : R+ → R+ , we
have MVs = MVψ◦s .
3. Monotonicity. The mapping α ∈ (0, 1) 7→ MVs (α) is strictly increas-
ing.
4. Differentiability. Suppose that the distribution of s(X) is continuous
and has no plateau. Then, if t > 0 7→ λs (t) is differentiable and s(x) is
Lipschitz, then MVs is differentiable on (0, 1) and:
Z
0 1 µ(dy)
MVs (α) = − 0 −1 ,
αs (αs (α)) s−1 ({α−1
s (α)})
|∇f (y)|
for all α ∈ (0, 1), denoting by µ(dy) the Hausdorff measure on the
boundary s−1 ({αs−1 (α)}).
These assertions straightforwardly result from Definition 1. Except those
related to the derivative formula, details are omitted in the Appendix.
4 Statistical Estimation
In practice, MV curves are unknown, just like the probability distribution of
the r.v. X, and must be estimated based on the observed sample X1 , . . . , Xn .
Replacing the mass of each level set by its statistical counterpart in Definition
1 leads to define the notion of empirical MV curve. We set, for all t ≥ 0,
n Z
def 1
X
α
bs (t) = I{s(Xi ) > t} = I{u > t}Fbs (du), (5)
n i=1 X
P
where Fbs = (1/n) i≤n δs(Xi ) denotes the empirical distribution of the s(Xi )’s.
Notice that αbs takes its values in the set {k/n : k = 0, . . . , n}.
Definition 2. (Empirical MV curve) Let s ∈ S. By definition, the
empirical MV curve of s is the graph of the (piecewise constant) function
bs−1 (α).
d s : α ∈ [0, 1) 7→ λs ◦ α
MV
11
In order to obtain a smoothed version αes (t), a typical strategy consist of
replacing the empirical distribution estimate Fbs involved in Eq. (5) by the
continuous distribution Fes with density fes (u) = (1/n) ni=1 Kh (s(Xi ) − u),
P
with Kh (u) = h−1 K(h−1 ·u) where K ≥ 0 is a regularizing Parzen-Rosenblatt
R
kernel (i.e. a bounded square integrable function such that K(v)dv = 1)
and h > 0 is the smoothing bandwidth, see the discussion in subsection 4.2
for a practical view of smoothing.
lim MV
d s (α) = MVs (α).
n→+∞
12
√
with vn = (log log n)ρ1 (γ) logρ2 (γ) n/ n for all n ≥ 1, where
(0, 1) if γ < 1
(ρ1 (γ), ρ2 (γ)) = (0, 2) if γ = 1 ,
(γ, γ − 1 + η) if γ > 1
See the Appendix for the technical proof, which relies on standard strong
approximation results for the quantile process, see [3]. Assertion (ii) means
√ d
that the fluctuation process { n(MV s (α) − MVs (α))}α∈[0,1−] converges in
the space of càd-làg functions on [0, 1 − ] equipped with the sup norm, to
the law of a Gaussian stochastic process {Z (1) (α)}α∈[0,1−] .
Remark 2. (Asymptotic normality) It results from Assertion (ii) in
Theorem 1 that, for any α ∈ (0, 1), the pointwise estimator MV
d s (α) is asymp-
totically Gaussian under Assumptions A3 − A4 . For all α ∈ (0, 1), we have
√ d
the convergence in distribution: n{MVs (α) − MVs (α)} ⇒ N (0, σs2 ), as
n → +∞, with σs2 = α(1 − α)(λ0s (αs−1 (α))/fs (αs−1 (α)))2 .
13
where MVBoots is the empirical MV curve of the scoring function s based
on a sample of P i.i.d. random variables with a common distribution Fes close
to Fbs = (1/n) i≤n δs(Xi ) . The difficulty is twofold. First, the target is
a distribution on a path space, namely a subspace of the Skorohod’s space
Dn ([0, 1 − ]) of càd-làg real-valued functions on [0, 1 − ] with ∈ (0, 1),
equipped with the sup norm. Second, rn is a functional of the quantile process
{Fbs−1 (α)}α∈(0,1) . The naive bootstrap, i.e. resampling from the raw empirical
distribution, generally provides bad approximations of the distribution of
empirical quantiles: the rate of convergence for a given quantile is indeed of
order OP (n−1/4 ), see [10], whereas the rate of the Gaussian approximation is
O(n−1/2 ). The same phenomenon may be naturally observed for MV curves.
In a similar manner to what is usually recommended for empirical quantiles,
a smoothed version of the bootstrap algorithm shall be implemented in order
to improve the approximation rate of the distribution of supα∈[0,1−] |rn |∞ ’s
distribution, namely, to resample the data from a smoothed version of the
empirical distribution Fbn .
Here we describe the algorithm for building a confidence band at level
1 − η in the MV space from sampling data Dn = {Xi : i = 1, . . . , n}. It is
performed in four steps as follows.
α ∈ [0, 1) 7→ MVBoot
s (α) = λs ◦ (αsBoot )−1 (α).
14
Before turning to the theoretical analysis of this algorithm, its descrip-
tions calls a few comments. From a computational perspective, the smoothed
bootstrap distribution must be approximated in its turn, by means of a
Monte-Carlo approximation scheme. As shown in [19], an effective way of
doing this in practice, while reproducing theoretical advantages of smooth-
ing, consists of drawing B ≥ 1 bootstrap samples of size n, with replacement,
in the original dataset and next perturbating each drawn data by indepen-
dent centered Gaussian random variables of variance h2 (this procedure is
equivalent to drawing bootstrap data from a smooth estimate Fes computed
using a Gaussian kernel Kh (u) = (2πh2 )−1/2 exp(−u2 /(2h2 ))). Concerning
the number of bootstrap replications, picking B = n does not modify the
rate of convergence. However, choosing B of magnitude comparable to n so
that (1+B) is an integer may be more appropriate: the η-quantile of the ap-
proximate bootstrap distribution is then uniquely defined and this does not
impact the rate of convergence neither, see [13]. The primary tuning parame-
ters of the algorithm previously described concern the smoothing step. When
using a Gaussian regularizing kernel, one should typically choose a bandwidth
hn of order n−1/5 so as to minimize the MSE. In addition, although it would
be fairly equivalent, regarding asymptotic analysis, to recenter by a smoothed
g s = λs ◦ α
version of the original empirical MVcurve MV es−1 in the computation
of the bootstrap fluctuation process, the numerical computation of the sup
norm of the estimate (6) is much more tractable: it only requires to evaluate
the distance between piecewise constant curves over the pooled set of jump
points. It should also be noticed that smoothing the original curve should be
also avoided in practice, since it hides the jump locations, which constitute
the essential part of the information.
In the subsequent analysis, the kernel K used in the smoothing step is
assumed to be ”pyramidal” (e.g. Gaussian or of the form I{u ∈ [−1, +1]}).
The next theorem reveals the asymptotic validity of the bootstrap estimate
proposed above. The proof straightforwardly results from the strong approx-
imation theorem for the quantile process stated in [2] (see also [3]), details
can be found in the Appendix section.
15
output by the Smooth MV Curve Bootstrap Algorithm is such that
log(h−1
∗ ∗ n )
sup |P ( sup |rn (α)| ≤ t) − P( sup |rn | ≤ t)| = oP √ .
t∈R α∈[0,1−] α∈[0,1−] nhn
5 A M-ESTIMATION APPROACH
Now we are are equipped with the concept of Mass Volume curve, the
anomaly scoring task can be formulated as the building of a scoring function
s(x), based on the training set X1 , . . . , Xn , such that MVs is as close as
possible to the optimum MV∗ . Due to the functional nature of the criterion
performance, there are many ways of measuring how close the MV curve of
a scoring function candidate and the optimal one are. The Lp -distances, for
1 ≤ p ≤ +∞, provide a relevant collection of risk measures. Let ∈ (0, 1) be
fixed (take = 0 if λ(suppF ) < +∞) and consider the losses related to the
sup-norm and that related to the L1 -distance:
Z 1−
d1 (s, f ) = |MVs (α) − MV∗ (α)|dα,
0
d∞ (s, f ) = sup {MVs (α) − MV∗ (α)}.
α∈[0,1−]
16
Hence, possible learning techniques could be based on the minimization, over
a set S0 ⊂ S of candidates,
R 1− of empirical counterparts of the area under the
MV curve, such as 0 MV d s (α)dα. In contrast, the approach cannot be
straightforwardly extended to the sup-norm situation. A possible strategy
would be to combine M -estimation with approximation methods, so as to
”discretize” the optimization task. This would lead to replace the unknown
curve MV∗ by an approximation, a piecewise constant approximant MV g∗
related to a subdivision ∆ : 0 < α1 < · · · < αK = 1 − say and decompose
the L∞ -risk as
the first term on the right hand side of the bound above being viewed as
the bias of the statistical method. If one restricts optimization to the set of
piecewise constant scoring functions taking K + 1 values, the problem thus
boils down to recovering the bilevel sets R∗k = Ω∗αk \ Ω∗αk−1 for k = 1, . . . , K.
This simple observation paves the way for designing scoring strategies relying
on the estimation of a finite number of minimum volume sets, just like the
approach described in the next section.
17
Its piecewise linear MV curve is given by: ∀α ∈ [0, 1[,
K
X
MVsW (α) = λs,k · I{α ∈ [αs,k , αs,k+1 [},
k=0
18
0
α1
α2
1-‐ε
=
αK
1
Figure 2: Discrete collection of minimum volume sets (on the left) and MV
curve of the related piecewise constant function (on the right)
19
The A-Rank algorithm
Ω
e1 = Ω
b 1 and Ω b k+1 ∪ Ω
e k+1 = Ω e k for 1 ≤ k < K.
K
X K
X
se∆ (x) = I{x ∈ Ω
ek} = (K − k + 1)I{x ∈ Ω
ek \ Ω
e k−1 }
k=1 k=1
where λ
bk = λ(Ω
b k ).
20
6.3 Performance bounds for the A-Rank algorithm
We now prove a result describing the performance of the scoring function pro-
duced by the algorithm proposed in the previous section to solve the anomaly
scoring problem, as formulated in section 5. The following assumptions are
required.
A5 ∀α ∈ [1 − ], Ωα ∈ G.
Theorem 3. Suppose that assumptions A1 -A2 and A5 -A6 hold true. Let
δ ∈ (0, 1). Consider the mesh ∆K made of gridpoints αk = (1 − )k/K,
0 ≤ k ≤ K and take as tolerance parameter
r
def log(1/δ)
φ = φn (δ) = 2Rn + .
2n
Then, if Kn ∼ n1/4 as n → +∞, there exists a constant c = c(δ) such that,
with probability larger than 1 − δ, we have for all n ≥ 1:
n o
sup MVse∆K (α) − MV∗ (α) ≤ cn−1/4 , (11)
α∈[1−]
21
The proof of the result stated above can be found in the Appendix, it
essentially relies on Theorems 3 and 11 in [18] combined with bound (8). As
a careful examination of the argument shows, the condition Kn ∼ n1/4 corre-
sponds to a balance between the approximation error (of the order O(1/K),
see Eq. (8)) and the stack of the errors made by overlaying empirical mini-
mum volume sets at Kn different mass levels. The rate is not proved to be
optimal here, lower bounds will be investigated in a future work.
The general approach decribed above can be extended in several ways.
First, the class G over which minimum volue set estimation is performed
(and the tolerance parameter as well) could vary depending on the mass
target α. Additionally, a model selection step could be incorporated, so
as to select an adequate class G (refer to section 4 in [18]). Secondly, the
subdivision ∆ is known in advance in the A-rank algorithm. A natural
extension would be to consider a flexible grid of the unit interval, which
could be possibly selected in an adaptive nonlinear manner, depending on
the local smoothness properties of the optimal curve MV∗ , see [5]. Due
to the concavity of MV∗ , the spacings αk+1 − αk should be picked as non
increasing. The assumption that grid points are equispaced in Theorem 3
aims at simplifying the analysis only and is by no means necessary: the sole
restriction is actually that supk {αk+1 − αk } = O(1/K). Such extensions are
far beyond the scope of the paper and shall be the subject of a future work.
Remark 3. (Plug-in) A possible strategy would be to estimate first the un-
known density function f (x) by means of (non-) parametric techniques and
next use the resulting estimator as a scoring function. Beyond the compu-
tational difficulties one would be confronted to for large or even moderate
values of the dimension, we point out that the goal pursued in this paper is by
nature very different from density estimation: the local properties of the den-
sity function are useless here, only the ordering of the possible observations
x ∈ X it induces is of importance, see Proposition 1.
The next proposition provides a bound in sup norm for the curve MV d∗
∆
as an estimate of the optimum MV∗ .
Proposition 4. Suppose that assumptions of Theorem 3 are fulfilled and
take Kn ∼ n−1/2 as n → +∞.Then, there exists a constant c0 = c0 (δ) such
that, with probability at least 1 − δ,
1/2
d ∗∆ ∗ 0 log n
sup |MV − MV (α)| ≤ c . (12)
α∈[0,1−]
K
n
22
g ∗ (α) = PK λ(Ω
One could also consider the alternative estimator MV e k )·
k=1
I{α ∈ [αk , αk+1 [}. However, the rate would be much slower, due to the error
stacking phenomenon resulting from stage 2 (monotonicity).
In the present analysis, we have restricted our attention to a truncated
part of the MV space, corresponding to mass levels in the interval [0, 1 − ]
and avoiding thus out-of-sample tail regions (in the case where suppF is
of infinite Lebesgue measure). An interesting direction for further research
could consist in investigating the accuracy of anomaly scoring algorithms
when letting = n slowly decay to zero as n → +∞, under adequate
assumptions on the asymptotic behavior of MV∗ .
7 CONCLUSION
Motivated by a wide variety of applications, we have formulated the issue of
learning how to rank observations in the same order as that induced by the
density function, which we called anomaly scoring here. For this problem,
much less ambitious than estimation of the local values taken by the density,
a functional performance criterion, the Mass Volume curve namely, is pro-
posed. What MV curve analysis achieves for unsupervised anomaly detection
is quite akin to what ROC curve analysis accomplishes in the supervised set-
ting. The statistical estimation of MV curves has been investigated from an
asymptotic perspective and we have provided a strategy, where the feature
space is overlaid with a few well-chosen empirical minimum volume sets, to
build a scoring function with statistical guarantees in terms of rate of con-
vergence for the sup norm in the MV space. Its analysis suggests a number
of novel and important issues for unsupervised statistical learning, such as
extension of the approach promoted here to the case where the mass levels
of the overlaid empirical minimum volume sets are chosen adaptively, so as
to capture the geometry of the optimal MV curve.
23
This is in contradiction with Proposition 1. Indeed, consider then the scoring
function fe(x) equal to f (x) on Ω∗α− ∪ (X \ Ω∗α+ ), to
on Ω∗α+ \ Ω∗α . One may then check easily that Eq. (3) is not fulfilled at α for
s = fe. The formula for MV∗0 is a particular case of that given in Proposition
3.
Proof of Proposition 3
Applying the co-area formula (see [11, p.249, th.3.2.12]) for Lipschitz func-
Rtions toRg(x) = (1/|∇f (x)|)I{x : |∇f (x)| > 0, s(x) R ≥ t} yields λs (t) =
+∞ 0
t
du s−1 (u)
(1/|∇f (y)|)µ(dy), and thus λs (t) = − s−1 ({t})
1/|∇f (y)|µ(dy).
And the desired formula follows from the composite differentiation rule.
√ −1 W (n) (α)
bs (α) − αs−1 (α) =
n α + o (vn ) ,
fs (αs−1 (α))
uniformly over [0, 1 − ], as n → +∞. Now the desired results can be imme-
diately derived from the Law of Iterated Logarithm for the Brownian Bridge,
combined with a Taylor expansion of λs .
Proof of Theorem 2
The argument is based on the the strong approximation stated in Theorem
1, combined with a standard coupling argument. The main steps of the
argument are as follows. Let P∗ the conditional probability given the original
24
data Dn . Note firstly that, conditioned on Dn , by applying Theorem 1’s
part (ii) with MV
g s as target curve (instead of MVs ), one gets that, with
probability 1, rn∗ (α) is equivalent to a stochastic process with same law as
(n)∗ λ0s (e
αs−1 (α)) (n)
Z (α) = W (α),
α−1 (α))
fes (e s
under the stipulated conditions, see [12]. Hence, we almost surely have,
uniformly over [0, 1 − ],
log(h−1
(n)∗ (n) n )
Z (α) = Z (α) + O √ .
nhn
Since this results holds uniformly in α ∈ [0, 1 − ], by virtue of the continuous
mapping theorem applied to the function supα∈[0,1−] (.), the result also holds
in distribution up to same order.
Proof of Theorem 3
For clarity, we start off with recalling the following result, which straightfor-
wardy derives from Theorems 3 and 11 in [18].
25
Additional notations shall be required. We set ᾱk = F (Ω b k ), λ̄k = λ(Ω
b k ),
α
ek = F (Ω
e k ) and λ e k ) for all k ∈ {1, . . . , K}. Equipped with these
ek = λ(Ω
notations, we may write:
K
X
∀α ∈ (0, 1), MVse∆ (α) = ek · I{α ∈ [e
λ αk , α
ek+1 [}.
k=1
26
the additivity of F (dx) again yields: F (Ω
e 2 ) = F (Ω
b 2 ) + φn (δ). Using the
bound stated in Proposition 5, the general result is then obtained by re-
cursion. In a quite similar way, the same inequalities hold true for the λ’s.
The lemma above implies that max1≤k≤K |e αk − ᾱk | = OP (Kφn (δ)). Using
Proposition 5 (replacing δ by δ/K) combined with the union bound, we
obtain that max1≤k≤K |ᾱk − αk | = OP (φn (δ/K)). Then, we may write
|e
αk+1 − α
ek | ≤ max |ᾱk − ᾱk−1 | + KOP (φn (δ)).
1≤k≤K
Proof of Proposition 4
Re-using the notations introduced in the proof above, we decompose the
pointwise deviation as follows: ∀α ∈ [0, 1 − ]:
K
!
d ∗∆ (α) d ∗∆ (α)
X
MV − MV∗ (α) = MV − MV∗ (αk ) · I{α ∈ [αk , αk+1 [}
k=1
+ MVs∗∆ (α) − MV∗ (α). (15)
Bound (8) applied to the second term of the decomposition above proves
it is of order O(1/K). Proposition 5 implies that, with probability at least
1 − δ, the first term in the decomposition is bounded by 2φn (δ/K)/Q∗ (1 − ).
Taking Kn ∼ n1/2 yields the rate bound stated in Proposition 4.
27
References
[1] S. Clémençon and J. Jakubowicz. Scoring anomalies: a M-estimation
formulation. In Proceedings of the 16-th International Conference on
Artificial Intelligence and Statistics, Scottsdale, USA, 2013.
[2] M. Csorgo and P. Revesz. Quantile process approximations. The Annals
of Statistics, 6(4):882–894, 1978.
[3] M. Csorgo and P. Revesz. Strong Approximation Theorems in Probability
and Statistics. Academic Press, 1981.
[4] C. de Boor. A practical guide to splines. Springer, 2001.
[5] R. Devore and G. Lorentz. Constructive Approximation. Springer, 1993.
[6] D. Donoho and M. Gasko. Breakdown properties of location estimates
based on half space depth and projected outlyingness. The Annals of
Statistics, 20:1803–1827, 1992.
[7] B. Efron. Bootstrap methods: another look at the jacknife. Annals of
Statistics, 7:1–26, 1979.
[8] J.P. Egan. Signal Detection Theory and ROC Analysis. Academic Press,
1975.
[9] J.H.J. Einmahl and D.M. Mason. Generalized quantile process. The
Annals of Statistics, 20:1062–1078, 1992.
[10] M. Falk and R. Reiss. Weak convergence of smoothed and nonsmoothed
bootstrap quantile estimates. Annals of Probability, 17:362–371, 1989.
[11] H. Federer. Geometric Measure Theory. Springer, 1969.
[12] E. Giné and A. Guillou. Rates of strong uniform consistency for multi-
variate kernel density estimators. Ann. Inst. Poincaré (B), Probabilités
et Statistiques, 38:907–921, 2002.
[13] P. Hall. On the number of bootstrap simulations required to construct
a confidence interval. Annals of Statistics, 14:1453–1462, 1986.
[14] V. Koltchinskii. M-estimation, convexity and quantiles. The Annals of
Statistics, 25(2):435–477, 1997.
28
[15] V. Koltchinskii. Local Rademacher complexities and oracle inequali-
ties in risk minimization (with discussion). The Annals of Statistics,
34:2593–2706, 2006.
[17] P. Rigollet and R. Vert. Fast rates for plug-in estimators of density level
sets. Bernoulli, 14(4):1154–1178, 2009.
[23] B.Y. Zuo and R. Serfling. General notions of statistical depth function.
The Annals of Statistics, 28(2):461–482, 2000.
29