Professional Documents
Culture Documents
Robust Mixture Modelling Using The T Distribution: Statistics and Computing (2000) 10, 339-348
Robust Mixture Modelling Using The T Distribution: Statistics and Computing (2000) 10, 339-348
Normal mixture models are being increasingly used to model the distributions of a wide variety of
random phenomena and to cluster sets of continuous multivariate data. However, for a set of data
containing a group or groups of observations with longer than normal tails or atypical observations,
the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider
a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM
algorithm to fit this t mixture model is described and examples of its use are given in the context of
clustering multivariate data in the presence of atypical observations in the form of background noise.
Keywords: finite mixture models, normal components, multivariate t components, maximum like-
lihood, EM algorithm, cluster analysis
0960-3174 °
C 2000 Kluwer Academic Publishers
340 Peel and McLachlan
nature. The use of a mixture model of t distributions provides Kharin (1996), Rousseeuw, Kaufman and Trauwaert (1996) and
a sound mathematical basis for a robust method of mixture es- Zhuang et al. (1996).
timation and hence clustering. We shall illustrate its usefulness
in the latter context by a cluster analysis of a simulated data set
with background noise added and of an actual data set.
3. Multivariate t distribution
We let y 1 , . . . , y n denote an observed p-dimensional random
2. Previous work sample of size n. With a normal mixture model-based approach
to drawing inferences from these data, each data point is assumed
Robust estimation in the context of mixture models has been con- to be a realization of the random p-dimensional vector Y with
sidered in the past by Campbell (1984), McLachlan and Basford the g-component normal mixture probability density function
(1988, Chapter 3), and De Veaux and Kreiger (1990), among (p.d.f.),
others, using M-estimates of the means and covariance matrices X
g
of the normal components of the mixture model. This line of f (y; Ψ) = πi φ(y; µi , Σi ),
approach is to be discussed in Section 9. i=1
Recently, Markatou (1998) has provided a formal approach where the mixing proportions πi are nonnegative and sum to one
to robust mixture estimation by applying weighted likelihood and where
methodology in the context of mixture models. With this p
φ(y; µi , Σi ) = (2π )− 2 |Σi |− 2
1
methodology, an estimate of the vector of unknown parameters
is obtained as a solution of the equation ½ ¾
1 T −1
× exp − (y − µi ) Σi (y − µi ) (2)
X
n 2
w(y j )∂ log f (y j ; Ψ)/∂Ψ = 0, (1)
j=1 denotes the p-variate multivariate normal p.d.f. with mean
µi and covariance matrix Σi (i = 1, . . . , g). Here Ψ =
where f (y; Ψ) denotes the specified parametric form for the (π1 , . . . , πg−1 , θ T )T , where θ consists of the elements of the
probability density function (p.d.f.) of the random vector Y on µi and the distinct elements of the Σi (i = 1, . . . , g). In the
which y 1 , . . . , y n have been observed independently. The weight above and sequel, we are using f as a generic symbol for a p.d.f.
function w(y) is defined in terms of the Pearson residuals; see One way to broaden this parametric family for potential out-
Markatou, Basu and Lindsay (1998) and the previous work of liers or data with longer than normal tails is to adopt the two-
Green (1984). The weighted likelihood methodology provides component normal mixture p.d.f.
robust and first-order efficient estimators, and Markatou (1998)
has established these results in the context of univariate mixture (1 − ²)φ(y; µ, Σ) + ²φ(y; µ, cΣ), (3)
models. where c is large and ² is small, representing the small propor-
One useful application of normal mixture models has been in tion of observations that have a relatively large variance. Huber
the important field of cluster analysis. Besides having a sound (1964) subsequently considered more general forms of contami-
mathematical basis, this approach is not confined to the produc- nation of the normal distribution in the development of his robust
tion of spherical clusters, such as with k-means-type algorithms M-estimators of a location parameter, as discussed further in
that use Euclidean distance rather than the Mahalanobis distance Section 9.
metric which allows for within-cluster correlations between the The normal scale mixture model (3) can be written as
variables in the feature vector Y. Moreover, unlike clustering Z
methods defined solely in terms of the Mahalanobis distance, φ(y; µ; Σ/u) d H (u), (4)
the normal mixture-based clustering takes into account the nor-
malizing term |Σi |−1/2 in the estimate of the multivariate nor- where H is the probability distribution that places mass (1 − ²)
mal density adopted for the component distribution of Y cor- at the point u = 1 and mass ² at the point u = 1/c. Suppose we
responding to the ith cluster. This term can make an important now replace H by the p.d.f. of a chi-squared random variable
contribution in the case of disparate group-covariance matrices on its degrees of freedom ν; that is, by the random variable U
(McLachlan 1992, Chapter 2). distributed as
Although even a crude estimate of the within-cluster co- µ ¶
1 1
variance matrix Σi often suffices for clustering purposes U ∼ gamma ν, ν ,
(Gnanadesikan, Harvey and Kettenring 1993), it can be severely 2 2
affected by outliers. Hence it is highly desirable for methods where the gamma (α, β) density function f (u; α, β) is given
of cluster analysis to provide robust clustering procedures. The by
problem of making clustering algorithms more robust has re-
f (u; α, β) = {β α u α−1 / 0(α)} exp(−βu)I(0,∞) (u); (α, β > 0),
ceived much attention recently as, for example, in Smith, Bailey
and Munford (1995), Davé and Krishnapuram (1996), Frigui and the indicator function I(0, ∞) (u) = 1 for u > 0 and is
and Krishnapuram (1996), Jolion, Meer and Bataouche (1996), zero elsewhere. We then obtain the t distribution with location
Robust mixture modelling 341
parameter µ, positive definite inner product matrix Σ, and ν 1, . . . , g). The application of the EM algorithm for ML estima-
degrees of freedom, tion in the case of a single component t distribution has been de-
¡ ¢ scribed in McLachlan and Krishnan (1997, Sections 2.6 and 5.8).
0 ν+2 p |Σ|−1/2
f (y; µ, Σ, ν) = ¡ ¢ , The results there can be extended to cover the present case of a
(πν) 2 p 0 ν2 {1 + δ(y, µ; Σ)/ν} 2 (ν+ p)
1 1
g-component mixture of multivariate t distributions.
(5) In the EM-framework, the complete-data vector is given by
¡ ¢T
where x c = x oT , z 1T , . . . , z nT , u 1 , . . . , u n (8)
Now the E-step on the (k + 1)th iteration of the EM algo- E(log R) = ψ(α) − log β, (21)
rithm requires the calculation of Q(Ψ; Ψ(k) ), the current condi-
where
tional expectation of the complete-data log likelihood function
log L c (Ψ). This E-step can be effected by first taking the expec- ψ(s) = {∂0(s)/∂s}/ 0(s)
tation of log L c (Ψ) conditional also on z 1 , . . . , z n , as well as x o ,
and then finally over the z j given x o . It can be seen from (12) is the Digamma function. Applying the result (21) to the condi-
to (14) that in order to do this, we need to calculate tional density of U j given y j and zij = 1, as specified by (10), it
follows that
E Ψ(k) (Z ij | y j ),
E Ψ(k) (U j | y j , z j ), E Ψ(k) (log U j | y j , z i j = 1)
à (k) ! · n
νi + p 1 (k) ¡ o¸
(k) (k)
and =ψ − log ν + δ y j , µi ; Σi
2 2 i
E Ψ(k) (log U j | y j , z j ) (15) ( Ã (k) ! Ã (k) !)
(k) νi + p νi + p
= log u ij + ψ − log (22)
for i = 1, . . . , g; j = 1, . . . , n). 2 2
It follows that
E Ψ(k) (Z ij | y j ) = τij
(k) for j = 1, . . . , n. The last term on the right-hand side of (22),
à (k)
! Ã(k)
!
where ν +p νi + p
(k) ¡ (k+1) (k+1) (k+1) ¢
ψ i − log ,
(k) πi f y j ; µi , Σi , νi 2 2
τij = ¡ ¢ , (16)
f y j ; Ψ(k+1)
can be interpreted as the correction for just imputing the condi-
(k)
is the posterior probability that y j belongs to the ith com- tional mean value u ij for u j in log u j .
ponent of the mixture, using the current fit Ψ(k) for Ψ (i = On using the results (16), (19) and (22) to calculate the condi-
1, . . . , g; j = 1, . . . , n). tional expectation of the complete-data log likelihood from (11),
Since the gamma distribution is the conjugate prior distri- we have that Q(Ψ; Ψ(k) ) is given by
bution for U j , it is not difficult to show that the conditional ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢
distribution of U j given Y j = y j and Z ij = 1 is Q Ψ; Ψ(k) = Q 1 π; Ψ(k) + Q 2 ν; Ψ(k) + +Q 3 θ; Ψ(k) ,
(23)
U | y j , z ij = 1 ∼ gamma (m 1i , m 2i ), (17) where
where ¡ ¢ Xg X
n
(k)
Q 1 π; Ψ(k) = τ̂ ij log πi , (24)
1
m 1i = (νi + p) i=1 j=1
2
¡ ¢ Xg X
n
(k) ¡ ¢
and Q 2 ν; Ψ(k) = τ̂ ij Q 2 j νi ; Ψ(k) , (25)
1 i=1 j=1
m 2i = {νi + δ(y j , µi ; Σi )}. (18)
2 and
From (17), we have that
¡ ¢ Xg X
n
(k) ¡ ¢
νi + p Q 3 θ; Ψ(k) = τ̂ ij Q 3 j θi ; Ψ(k) , (26)
E(U j | y j , zij = 1) = . (19)
νi + δ(y j , µi ; Σi ) i=1 j=1
Thus from (19), and where, on ignoring terms not involving the νi ,
µ ¶ µ ¶
(k) ¡ ¢ 1 1 1
E Ψ(k) (U j | y j , zij = 1) = u ij , Q 2 j νi ; Ψ(k) = − log 0 νi + νi log νi
2 2 2
where (
Xn ³ ´
1
(k) νik + p + νi
(k) (k)
log u ij − u ij
u ij = ¡ (k) (k) ¢
. (20) 2
νik + δ y j , µi ; Σi j=1
à (k) ! à (k) !)
To calculate the conditional expectation (15), we need the νi + p νi + p
+ψ − log , (27)
result that if a random variable R has a gamma (α, β) 2 2
Robust mixture modelling 343
and Following the proposal of Kent, Tyler and Vardi (1994) in the
½ case of a single component t distribution, we can replace the
¡ ¢ 1 1 1 (k) P
Q 3 j θi ; Ψ(k) = − p log(2π ) − log |Σi | + p log u ij (k)
divisor nj=1 τij in (31) by
2 2 2
¾ X
n
1
− u ij (y j − µi )T Σi−1 (y j − µi ) .
(k) (k)
(28) τij u ij .
2 j=1
µ1 = ( 0 3 )T µ2 = ( 3 0 )T µ3 = ( −3 0 )T
µ ¶ µ ¶ µ ¶
2 0.5 1 0 2 −0.5
Σ1 = Σ2 = Σ3 =
0.5 .5 0 .1 −0.5 .5 Fig. 2. Plot of the result of fitting a mixture of t distributions with a
classification of noise at a significance level of 5% to the simulated
with mixing proportions π1 = π2 = π3 = The true grouping 1
3
. noisy data
is shown in Fig. 1. We now consider the clustering obtained
by fitting a mixture of three t components with unequal scale
matrices but equal degrees of freedom (ν1 = ν2 = ν3 = ν). The
(k)
values of the weights u ij at convergence, û ij , were examined.
The noise points (points 101–150) generally produced much
lower û ij values. In this application, an observation y j is treated
Pg
as an outlier (background noise) if i=1 ẑ ij û ij is sufficiently
small, or equivalently,
X
g
ẑ ij δ(y j , µ̂i ; Σ̂i ) (35)
i=1
Fig. 4. Plot of the result of fitting a four component normal mixture Fig. 5. Scatter plot of the second and third variates of the Blue Crab
model to the simulated noisy data data set with their true group of origin noted
It is of interest to note that if the number of groups is treated as improved clustering is obtained without restrictions on the scale
unknown and a normal mixture fitted, then the number of groups matrices, with the t and normal mixture model-based clusterings
selected via AIC, BIC, and bootstrapping of −2 log λ is four for both resulting in 17 fewer misallocations.
each of these criteria. The result of fitting a mixture of four nor- In this example, where the normal model for the components
mal components is displayed in Fig. 4. Obviously, the additional appears to be a reasonable assumption, the estimated degrees
fourth component is attempting to model the background noise. of freedom for the t components should be large, which they
However, it can be seen from Fig. 4 that this normal mixture- are. The estimate of their common value ν in the case of equal
based clustering is still affected by the noise. scale matrices and equal degrees of freedom (ν1 = ν2 = ν),
is ν̂ = 22.5; the estimates of ν1 and ν2 in the case of unequal
11. Example 2: Blue crab data set scale matrices and unequal degrees of freedom are ν̂ 1 = 23.0
and ν̂ 2 = 120.3.
To further illustrate the t mixture model-based approach to clus- The likelihood function can be fairly flat near the maximum
tering, we consider now the crab data set of Campbell and Mahon likelihood estimates of the degrees of freedom of the t com-
(1974) on the genus Leptograpsus. Attention is focussed on the ponents. To illustrate this, we have plotted in Fig. 6 the profile
sample of n = 100 blue crabs, there being n 1 = 50 males and likelihood function in the case of equal scale matrices and equal
n 2 = 50 females, which we shall refer to as groups G 1 and
G 2 respectively. Each specimen has measurements on the width
of the frontal lip FL, the rear width RW, and length along the
midline CL and the maximum width CW of the carapace, and
the body depth BD in mm. In Fig. 5, we give the scatter plot of
the second and third variates with their group of origin noted.
Hawkins’ (1981) simultaneous test for multivariate normality
and equal covariance matrices (homoscedasticity) suggests it is
reasonable to assume that the group-conditional distributions
are normal with a common covariance matrix. Consistent with
this, fitting a mixture of two t components (with equal scale
matrices and equal degrees of freedoms) gives only a slightly
improved outright clustering over that obtained using a mix-
ture of two normal homoscedastic components. The t mixture
model-based clustering results in one cluster containing 32 ob-
servations from G 1 and another containing all 50 observations
from G 2 , along with the remaining 18 observations from G 1 ;
the normal mixture model leads to one additional member of Fig. 6. Plot of the profile log likelihood for various values of ν for
G 1 being assigned to the cluster corresponding to G 2 . We note the Blue Crab data set with equal scale matrices and equal degrees of
in passing that, although the groups are homoscedastic, a much freedom (ν1 = ν2 = ν)
Robust mixture modelling 347
Table 1. Summary of comparison of error rates when fitting normal and t-distributions with the modification of a single point
degrees of freedom (ν1 = ν2 = ν), while in Fig. 7, we have plot- However, the assumption of normality is slightly more tenable
ted the profile likelihood function in the case of unequal scale for the logged data as assessed by Hawkins’ (1981) test, and the
matrices and unequal degrees of freedom. clustering of the logged data via either the normal or t mixture
Finally, we now compare the t and normal mixture based- models does result in fewer misallocations.
clusterings after some outliers are introduced into the original
data set. This was done by adding various values to the second
variate of the 25th point. In Table 1, we report the overall misal- References
location rate of the normal and t mixture-based clusterings for
each perturbed version of the original data set. It can be seen Aitchison J. and Dunsmore I.R. 1975. Statistical Predication Analysis.
that the t mixture-based clustering is robust to those perturba- Cambridge University Press, Cambridge.
tions, unlike the normal mixture-based clustering. It should be Böhning D. 1999. Computer-Assisted Analysis of Mixtures and Appli-
noted in the cases where the constant equals −15, −10, and 20, cations: Meta-Analysis, Disease Mapping and Others. Chapman
that fitting a normal mixture model results in an outright clas- & Hall/CRC, New York.
Campbell N.A. 1984. Mixture models and atypical values. Mathemat-
sification of the outlier into one cluster. The remaining points
ical Geology 16: 465–477.
are allocated to the second cluster giving an error rate of 49. In
Campbell N.A. and Mahon R.J. 1974. A multivariate study of variation
practice the user should identify this situation when interpreting in two species of rock crab of genus Leptograpsus. Australian
the results and hence remove the outlier giving an error rate of Journal of Zoology 22: 417–425.
20%. However when the fitting is part of an automatic procedure Davé R.N. and Krishnapuram R. 1995. Robust clustering methods: A
this would not be the case. unified view. IEEE Transactions on Fuzzy Systems 5: 270–293.
Concerning the effect of outliers by working with the logs Dempster A.P., Laird N.M., and Rubin D.B. 1977. Maximum likelihood
of these data under the normal mixture model (as raised by a from incomplete data via the EM algorithm (with discussion).
referee), it was found that the consequent clustering is still very Journal of the Royal Statistical Society B 39: 1–38.
sensitive to atypical observations when introduced as above. De Veaux R.D. and Kreiger A.M. 1990. Robust estimation of a normal
mixture. Statistics & Probability Letters 10: 1–7.
Everitt B.S. and Hand D.J. 1981. Finite Mixture Distributions. Chapman
& Hall, London.
Frigui H. and Krishnapuram R. 1996. A robust algorithm for automatic
extraction of an unknown number of clusters from noisy data.
Pattern Recognition Letters 17: 1223–1232.
Gnanadesikan R., Harvey J.W., and Kettenring J.R. 1993. Sankhyā A
55: 494–505.
Green P.J. 1984. Iteratively reweighted least squares for maximum
likelihood estimation, and some robust and resistant alternatives.
Journal of the Royal Statistical Society B 46: 149–192.
Hampel F.R. 1973. Robust estimation: A condensed partial survey.
Z. Wahrscheinlickeitstheorie verw. Gebiete 27: 87–104.
Hawkins D.M. and McLachlan G.J. 1997. High-breakdown linear
discriminant analysis. Journal of the American Statistical Associ-
ation 92: 136–143.
Huber P.J. 1964. Robust estimation of a location parameter. Annals of
Mmathematical Statistics 35: 73–101.
Jolion J.-M., Meer P., and Bataouche S. 1995. Robust clustering with
Fig. 7. Plot of the profile likelihood for ν1 and ν2 for the Blue Crab applications in computer vision. IEEE Transactions on Pattern
data set with unequal scale matrices and unequal degrees of freedom Analysis and Machine Intelligence 13: 791–802.
ν1 and ν2 Kent J.T., Tyler D.E., and Vardi Y. 1994. A curious likelihood identity
348 Peel and McLachlan
for the multivariate t-distribution. Communications in Statistics – McLachlan G.J. and Peel D. 1998. Robust cluster analysis via mixtures
Simulation and Computation 23: 441–453. of multivariate t-distributions. In: Amin A., Dori D., Pudil P., and
Kharin Y. 1996. Robustness in Statistical Pattern Recognition. Kluwer, Freeman H. (Eds.), Lecture Notes in Computer Science Vol. 1451.
Dordrecht. Springer-Verlag, Berlin, pp. 658–666.
Kosinski A. 1999. A procedure for the detection of multivariate outliers. McLachlan G.J., Peel D., Basford K.E., and Adams P. 1999. Fitting
Computational Statistics and Data Analysis 29: 145–161. of mixtures of normal and t-components. Journal of Statistical
Kowalski J., Tu X.M., Day R.S., and Mendoza-Blanco J.R. 1997. On the Software 4(2). (http://www.stat.ucla.edu/journals/jss/).
rate of convergence of the ECME algorithm for multiple regression Meng X.L. and van Dyk D. 1995. The EM algorithm – an old folk song
models with t-distributed errors. Biometrika 84: 269–281. sung to a fast new tune (with discussion).
Lange K., Little R.J.A., and Taylor J.M.G. 1989. Robust statistical mod- Rocke D.M. and Woodruff D.L. 1997. Robust estimation of multivariate
eling using the t distribution. Journal of the American Statistical location and shape. Journal of Statistical Planning and Inference
Association 84: 881–896. 57: 245–255.
Lindsay B.G. 1995. Mixture Models: Theory, Geometry and Appli- Rousseeuw P.J., Kaufman L., and Trauwaert E. 1996. Fuzzy clustering
cations, NSF-CBMS Regional Conference Series in Probability using scatter matrices. Computational Statistics and Data Analysis
and Statistics, Vol. 5. Institute of Mathematical Statistics and the 23: 135–151.
American Statistical Association, Alexandria, VA. Rubin D.B. 1983. Iteratively reweighted least squares. In: Kotz S.,
Liu C. 1997. ML estimation of the multivariate t distribution and the Johnson N.L., and Read C.B. (Eds.), Encyclopedia of Statistical
EM algorithm. Journal of Multivariate Analysis 63: 296–312. Sciences Vol. 4. Wiley, New York, pp. 272–275.
Liu C. and Rubin D.B. 1994. The ECME algorithm: A simple extension Schroeter P., Vesin J.-M., Langenberger T., and Meuli R. 1998. Robust
of EM and ECM with faster monotone convergence. Biometrika parameter estimation of intensity distributions for brain magnetic
81: 633–648. resonance images. IEEE Transactions on Medical Imaging 17:
Liu C. and Rubin D.B. 1995. ML estimation of the t distribution us- 172–186.
ing EM and its extensions, ECM and ECME. Statistica Sinica 5: Smith D.J., Bailey T.C., and Munford G. 1993. Robust classification of
19–39. high-dimensional data using artificial neural networks. Statistics
Liu C., Rubin D.B., and Wu Y.N. 1998. Parameter expansion to accel- and Computing 3: 71–81.
erate EM: The PX-EM Algorithm. Biometrika 85: 755–770. Sutradhar B. and Ali M.M. 1986. Estimation of parameters
Markatou M. 1998. Mixture models, robustness and the weighted like- of a regression model with a multivariate t error vari-
lihood methodology. Technical Report No. 1998-9. Department of able. Communications in Statistic – Theory and Methods 15:
Statistics, Stanford University, Stanford. 429–450.
Markatou M., Basu A., and Lindsay B.G. 1998. Weighted likelihood Titterington D.M., Smith A.F.M., and Makov U.E. 1985. Statistical
equations with bootstrap root search. Journal of the American Analysis of Finite Mixture Distributions. Wiley, New York.
Statistical Association 93: 740–750. Zhuang X., Huang Y., Palaniappan K., and Zhao Y. 1996. Gaussian
McLachlan G.J. and Basford K.E. 1988. Mixture Models: Inference density mixture modeling, decomposition and applications. IEEE
and Applications to Clustering. Marcel Dekker, New York. Transactions on Image Processing 5: 1293–1302.