Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A New Distance for Probability Measures

Based on the Estimation of Level Sets


Alberto Mu
noz, Gabriel Martos, Javier Arriero, and Javier Gonzalez
Department of Statistics, University Carlos III, Madrid, and Johann Bernoulli
Institute for Mathematics and Computer Science, University of Groningen
{alberto.munoz,gabrielalejandro.martos,javier.arriero}@uc3m.es,
j.gonzalez.hernandez@rug.nl

Abstract. In this paper we propose to consider Probability Measures


(PM) as generalized functions belonging to some functional space endowed with an inner product. This approach allows to introduce a new
family of distances for PMs. We propose a particular (non parametric)
metric for PMs belonging to this class, based on the estimation of density level sets. Some real and simulated data sets are used for a rst
exploration of its performance.

Introduction

The study of distances between probability measures/distributions (PM) is increasingly attracting attention in the elds of Data Analysis and Pattern Recognition. Classical examples are homogeneity, independence and goodness of t
test problems between populations where the general aim is to determine if data
generated from dierent random samples come from the same population or not.
These problems can be solved by choosing an appropriate distance between PM.
You can also nd examples in Clustering [5,25], Image Analysis [9,10], Time Series Analysis [18,13], Econometrics [24,17] and Text Mining [15,7], just to name
a few.
For a review of interesting distances between probability distributions and
theoretical results, see for instance [26,4] and references therein. Non parametric
estimators often play a role in estimating such distances. In practical situations
there is usually available a (not huge) data sample, and the use of purely non
parametric estimators often results in poor performance [11].
An appealing point of view, initiated by Fisher and Rao [6,1,3] and continued
with recent development of Functional Data Analysis and Information Geometry
Methods (e.g. [20,2]), is to consider probability distributions as points belonging
to some manifold, and then take advantage of the manifold structure to derive
appropriate metrics for distributions.
In this work we elaborate on the idea of considering PMs as points in a functional space endowed with an inner product, and then obtain dierent distances
for PMs from the metric structure derived from the ambient inner product. We
propose a particular instance of such metrics for generalized functions based on
A.E.P. Villa et al. (Eds.): ICANN 2012, Part II, LNCS 7553, pp. 271278, 2012.
c Springer-Verlag Berlin Heidelberg 2012


272

A. Mu
noz et al.

the estimation of density level sets regions. The article is organized as follows:
In Section 2 we present probability measures as generalized functions and then
we dene general distances acting on the Schwartz distribution space. Section 3
presents a new distance built according to this point of view. Section 4 illustrates
the theory with some simulated and real data sets.

Probability Measures as Schwartz Distributions

Consider a measure space (X, F , ), where X is a sample space (here a compact


set of a real vector space), F a -algebra of measurable subsets of X and :
F IR+ the ambient -additive measure. A probability measure P is a additive nite measure absolutely continuous w.r.t. that satises the three
Kolmogorov axioms. By Radon-Nikodym theorem, there exists a measurable
function f : X IR+ (the density function) such that P (A) = A f d, and
f = dP
d is the Radon-Nykodim derivative.
A PM can be regarded as a Schwartz distribution (generalized function) (see
[22] for an introduction to Distribution Theory): We consider a vector space D
of test functions. The usual choice is to consider all the functions in C (X)
having compact support. A distribution (also named generalized function) is a
continuous linear functional on D. A probability measure can be regarded
as

a
Schwartz
distribution
P
:
D

IR
by
dening
P
()
=
P,

=
dP
=

(x)f (x)d(x) = , f . Thus, the probability distribution identies with a
linear functional obeying the Riesz representer theorem: the representer for P is
its density function f D: P () = , f .


In particular, the familiar condition P (X) = 1 is equivalent to P, [X] = 1,
where the function [X] belongs to D, being X compact. Note that we do not
need that f D; only the integral , f  should be properly dened for every
D.
So then a probability measure/distribution is a continuous linear functional
acting on a given function space. Two given linear functionals P1 and P2 will
be the same (or similar) if they act identically (or similarly)
on every D.

For instance, if we choose = Id, P1 () = P, x = 1dP = EP1 [X] and if P1
and P2 are similar then 1 = EP1 [X]  EP2 [X] = 2 because P1 and P2 are
continuous. Similar arguments apply for variance (take (x) = (x E[X])2 ) and
in general for higher order moments. For (x) = eix , IR, we obtain the
Fourier transform of the probability
(called characteristic functions in

 measure

Statistics), given by P () = P, eix = eix dP (x).
Thus, two PMs can be identied with their action as functionals on the
test functions and hence, distances between two distributions can be dened
from the dierences between functional evaluations for appropriately chosen test
functions.

A Metric Based on the Estimation of Level Sets

Given two PMs P and Q, we consider a family of test functions {i }iI D


and then dene distances between P and Q by weighting terms of the type

A New Distance for Distributions Based on the Estimation of Level Sets

273

d (P, i  , Q, i ) for i I, where d is some distance function. Our test functions
will be indicator functions of -level sets, introduced below.
Given a PM P with density function fP , minimum volume sets (or -level sets)
are dened by S (fP ) = {x X| fP (x) }, such that P (S (fP )) = 1 ,
where fP is the density function and 0 < < 1. If we consider an ordered
sequence 1 < . . . < n , i (0, 1), then Si+1 (fP ) Si (fP ). Let us dene
Ai (P) = Si (fP ) Si+1 (fP ), i {1, . . . , n 1}. We can choose 1  0 and
n 
maxxX fP (x) (which exists, given that X is compact and fP continuous);
then i Ai (P)  Supp(P) = {x X| fP (x) = 0} (equality takes place when
n , 1 0 and n 1 ). Given the denition of the Ai , if Ai (P) = Ai (Q)
for every i when n , then P = Q. Thus,
 taking i = [Ai ] , our choice is

,
Q,

)
=
|
P,

Q,


|
=
|
[Ai ] dP [Bi ] dQ|  | [Ai ]
d
(P,

i
i
i
j

[Bi ] |, the ambient measure. Indeed, given the denition of level set and the
choice of Ai , both P and Q are approximately constant on Ai and Bi , respectively,
and so we are using the counting (ambient) measure.
Denote by the symmetric dierence operator: A B = (A B) (B A).
Consider 1i = [Ai (P)Ai (Q)] and 2i = [Ai (Q)Ai (P)] . Consider di (P, Q) =
| P, 1i Q, 1i  |+| P, 2i Q, 2i  |. From the previous discussion di (P, Q) 
(Ai (P) Ai (Q)), what motivates the following:
Denition 1. Weighted level-set distance. Given = {(i) }n1 , we dene
d (P, Q) =

n1

i=1

(Ai (P) Ai (Q))


,
(Ai (P) Ai (Q))

(1)

where is the ambient measure. We use (Ai (P) Ai (Q)) in the numerator
instead of di (P, Q)  (Ai (P) Ai (Q)) for compactness. When n both
expressions are equivalent. Of course, we can calculate d in eq. (1) only when we
know the distribution function for both PMs. In practice there will be available
two data samples generated from P and Q, and we need to dene some plug
in estimator: Consider estimators Ai (P) = Si (fP ) Si+1 (fP ), then we can
estimate d (P, Q) by


n1
 # Ai (P) S Ai (Q)

 ,
d (P, Q) =
i
(2)
# Ai (P) Ai (Q)
i=1
where #A indicates the number of points in A and S indicates the set estimate
of the symmetric dierence, dened below.
Both d (P, Q) and d (P, Q), as currently dened, are semimetrics. The proposal of Euclidean metrics will be aorded immediately after the present work.
3.1

Estimation of Level Sets

To estimate level sets from a data sample we present the following denitions
and theorems, concerning the One-Class Neighbor Machine [19].

274

A. Mu
noz et al.

Denition 2. Neighbourhood Measures. Consider a random variable X


with density function f (x) defined on IRd . Let Sn denote the set of random
independent identically distributed (iid) samples of size n (drawn from f ). The
elements of Sn take the form sn = (x1 , , xn ), where xi IRd . Let M : IRd
Sn IR be a real-valued function defined for all n IN. (a) If f (x) < f (y)
implies lim P (M (x, sn ) > M (y, sn )) = 1, then M is a sparsity measure.
n

(b) If f (x) < f (y) implies lim P (M (x, sn ) < M (y, sn )) = 1, then M is a
n
concentration measure.
Example: Consider the distance from a point x IRd to its k th -nearest neighbour in sn , x(k) : M (x, sn ) = dk (x, sn ) = d(x, x(k) ): it is a sparsity measure. Note
that dk is neither a density estimator nor is it one-to-one related to a density
estimator. Thus, the denition of sparsity measure is not trivial.
The Support Neighbour Machine [19] solves the following optimization
problem:
n

max n
i
,
i=1
(3)
s.t. g(xi ) i ,
i 0,
i = 1, . . . , n ,
where g(x) = M (x, sn ) is a sparsity measure, [0, 1], i with i = 1, . . . , n are
slack variables and is a predened constant.
Theorem 1. The set Rn = {x : hn (x) = sign(n gn (x)) 0} converges to
a region of the form S (f ) = {x|f (x) }, such that P (S (f )) = . Therefore, the Support Neighbour Machine estimates a density contour cluster S (f )
(around the mode).
Hence, we take Ai (P) = Si (fP ) Si+1 (fP ), where Si (fP ) is estimated by
Rn dened above.
3.2

Estimation of the Symmetric Dierence between Sets

#(B
A)
=
We should not estimate the region A B by A
B = #(A B)
#(A B) #(A B), given that probably there will be no points in common
(which implies A

between A and B
B =A
B). Given that A is a set of points
estimating the spatial region A, we will estimate the region A by a covering of
the type A = ni B(xi , r), where B(xi , r) are closed balls with centres at xi Ai
and radius r [8]. The radius is chosen to be constant because we can assume
density approximately constant inside region Ai (P) if the partition {i } of the
set (0, 1) is ne enough. The problem of calculating A
B reduces therefore to
not belonging to the covering estimate of A, plus the
estimate the points in B
points in A not belonging to the covering estimate of B. This will be denoted by
A S B. Figure 3 illustrates the previous discussion. Notice that S implicitly
gives rise to kernels for sets; for example: K(A, B) = 1 (A S B)/(A B), that
allows to consider distance for distributions in the context of kernel methods.

A New Distance for Distributions Based on the Estimation of Level Sets

275

and B.
(b) A
B.

Fig. 1. Set estimate of the symmetric dierence. (a) Data sets A


A.

(c) B

Experimental Work

Being the proposed distance intrinsically nonparametric, there are no simple


parameters (like mean and variance) on which we can concentrate our attention
to do exhaustive benchmarking. The strategy will be to compare the proposed
distance to other classical PM distances for some well known (and parametrized)
distributions, to get a rst impression on its performance. Here we consider
Kullback-Leibler (KL) divergence, Kolmogorov-Smirnov (KS) distance and ttest (T) measure (Hotelling test in the multivariate case).
We begin by testing our Level Distance (LD) on the most favourable case to
the classical PM metrics: normal distributions. Consider a mixtures of normal
1
1
distributed populations: N( = d 2 12 ; = .75I) + (1)N( = d 2 12 ; =
1
1
I) and (1 )N( = d 2 12 ; = I) + N( = d 2 12 ; = .75I), with
= .6 and d the dimension considered in order to compare the discrimination
performance of the proposed distance relative to other classical multivariate
distances: Kullback-Leibler (KL) divergence and t-test (T) measure (Hotelling
test in the multivariate case). We found the minimum sample size n for which
the PM metrics are able to discriminate between both samples. In all cases we
choose type I error =.05 and type II error =.1. Table 1 report the results, we
can see that the Level Set Distance (LD) measure is more ecient (in terms of
sample size) in all the dimensions considered.
Table 1. Minimum sample size for a 5% type I and 10% type II errors
Metric d:
1
2
3
4
5
10
20
50
100
KL
1300 1700 1800 1900 2000 2700 > 5000 > 5000 > 5000
T
750 800 900 1000 1100 1400 1500 2100 2800
LD
200 380 650 750 880 1350 1400 1800 2200

276

A. Mu
noz et al.

Fig. 2. MDS plot for texture groups. A representer for each class is plotted in the map.

Fig. 3. Dendrogram with shaded image texture groups

To conclude, we will show an application of the LD measure to evaluate distances between data sets. To this aim, we consider 9 data sets from the Kylberg
texture data set [14]: blanket, canvas, seat, oatmeal, rice, lentils, linseeds, stone1, stone2 belonging to 3 mean types. There are 160 11 = 1760
texture images with a resolution of 576 576 pixels. We represent each image
using the 32 parameters of the wavelet coecient histogram proposed in [16].
Next we calculate the (between sets) distance matrix with the LD measure,
we obtain (by multidimesional scaling -MDS) the representation shown in Figure
2, that results to be a sort of MDS for data sets. It is apparent that textures get
organized in a very coherent way with human criteria, what seems to indicate
that the proposed Level Distance is appropriate for real pattern recognition
problems (high dimensional and small number of data samples).

A New Distance for Distributions Based on the Estimation of Level Sets

277

Future Work

In the near future we will aord the study of the geometry induced by the proposed measure and its asymptotic properties. An exhaustive testing on a variety
of data sets following dierent distributions is needed. We are also working on
a variation of the LD distance that satises the Euclidean metric conditions.
Acknowledgments. This work was partially supported by projects DGUCM
2008/00058/002 and MEC 2007/04438/001 (Spain).

References
1. Amari, S.-I., Barndor-Nielsen, O.E., Kass, R.E., Lauritzen, S.L., Rao, C.R.: Dierential Geometry in Statistical Inference. Lecture Notes-Monograph Series, vol. 10
(1987)
2. Amari, S., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society (2007)
3. Atkinson, C., Mitchell, A.F.S.: Raos Distance Measure. The Indian Journal of
Statistics, Series A 43, 345365 (1981)
4. M
uller, A.: Integral Probability Metrics and Their Generating Classes of Functions.
Advances in Applied Probability 29(2), 429443 (1997)
5. Banerjee, A., Merugu, S., Dhillon, I., Ghosh, J.: Clustering whit Bregman Divergences. Journal of Machine Learning Research, 17051749 (2005)
6. Burbea, J., Rao, C.R.: Entropy dierential metric, distance and divergence measures in probability spaces: A unied approach. Journal of Multivariate Analysis 12,
575596 (1982)
7. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A Comparison of String Distance
Metrics for Name-matching Tasks. In: Proceedings of IJCAI 2003, pp. 7378 (2003)
8. Devroye, L., Wise, G.L.: Detection of abnormal behavior via nonparametric estimation of the support. SIAM J. Appl. Math. 38, 480488 (1980)
9. Dryden, I.L., Koloydenko, A., Zhou, D.: Non-Euclidean statistics for covariance
matrices, with applications to diusion tensor imaging. The Annals of Applied
Statistics 3, 11021123
10. Dryden, I.L., Koloydenko, A., Zhou, D.: The Earth Movers Distance as a Metric
for Image Retrieval. International Journal of Computer Vision 40, 99121 (2000)
11. Gretton, A., Borgwardt, K., Rasch, M., Schlkopf, B., Smola, A.: A kernel method
for the two sample problem. In: Advances in Neural Information Processing Systems, pp. 513520 (2007)
12. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning, 2nd
edn. Springer (2009)
13. Hayashi, A., Mizuhara, Y., Suematsu, N.: Embedding Time Series Data for Classication. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587,
pp. 356365. Springer, Heidelberg (2005)
14. Kylberg, G.: The Kylberg Texture Dataset v. 1.0. Centre for Image Analysis,
Swedish University of Agricultural Sciences and Uppsala University, Uppsala, Sweden, http://www.cb.uu.se/gustaf/texture/
15. Lebanon, G.: Metric Learnong for Text Documents. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(4), 497508 (2006)

278

A. Mu
noz et al.

16. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(7), 674
693
17. Marriot, P., Salmon, M.: Aplication of Dierential Geometry to Econometrics.
Cambridge University Press (2000)
18. Moon, Y.I., Rajagopalan, B., Lall, U.: Estimation of mutual information using
kernel density estimators. Physical Review E 52(3), 23182321
19. Mu
noz, A., Moguerza, J.M.: Estimation of High-Density Regions using OneClass Neighbor Machines. IEEE Trans. on Pattern Analysis and Machine Intelligence 28(3), 476480
20. Ramsay, J.O., Silverman, B.W.: Applied Functional Data Analysis. Springer, New
York (2005)
21. Sriperumbudur, B.K., Fukumizu, K., Gretton, A., Scholkopf, B., Lanckriet, G.R.G.:
Non-parametric estimation of integral probability metrics. In: International Symposium on Information Theory (2010)
22. Strichartz, R.S.: A Guide to Distribution Theory and Fourier Transforms. World
Scientic (1994)
23. Szekely, G.J., Rizzo, M.L.: Testing for Equal Distributions in High Dimension.
InterStat (2004)
24. Ullah, A.: Entropy, divergence and distance measures with econometric applications. Journal of Statistical Planning and Inference 49, 137162 (1996)
25. Xing, E.P., Ng, A.Y., Jordan, M.I., Russell, S.: Distance Metric Learning, with Application to Clustering with Side-information. In: Advances in Neural Information
Processing Systems, pp. 505512 (2002)
26. Zolotarev, V.M.: Probability metrics. Teor. Veroyatnost. i Primenen 28(2), 264287
(1983)

You might also like