A Tutorial On MM Algorithms

A Tutorial on MM Algorithms
Author(s): David R. Hunter and Kenneth Lange

Source: The American Statistician, Vol. 58, No. 1 (Feb., 2004), pp. 30-37
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: http://www.jstor.org/stable/27643496 .
Accessed: 25/10/2014 14:08
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to digitize, preserve
and extend access to The American Statistician.
http://www.jstor.org
This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM

All use subject to JSTOR Terms and Conditions
_General_
A Tutorial on MM Algorithms
R. Hunter
David
Most
in frequentist
problems
a function
are
rithms
as
such
Lange
and Kenneth
the most
among
involve optimization
statistics
or a sum
a likelihood
effective
of
EM
squares.
of
algo
for maximum
algorithms
continued.
likelihood estimation because they consistently drive the likeli

a simple surrogate function for the
hood uphill by maximizing
Iterative
log-likelihood.
a surrogate
of
optimization
as
function
exemplified by an EM algorithm does not necessarily require

missing data. Indeed, every EM algorithm is a special case of
the more general class of MM optimization algorithms, which
typically exploit convexity rather than missing data inmajoriz
or
ing
an
minorizing
MM
some
discusses
merous
of
for
and
some
error
standard
KEY WORDS:
We
to illustrate
the concepts
on
optimization;
EM
algo
constrained
algorithm;
likelihood and least squares are the dominant forms

in frequentist statistics. Toy optimization problems
classroom
designed
but most
practical maximum
must
problems
presentation
be
likelihood
solved
cusses an optimization method

and
can
be
solved
analytically,
and least squares esti
numerically.
This
article
dis
that typically relies on convexity
is a generalization
of
the well-known
EM
algo
rithm method (Dempster, Laird, and Rubin 1977; McLachlan

and Krishnan 1997). We call any algorithm based on this itera
tive method anMM algorithm.
the general principle behind MM algo
To our knowledge,
rithms was first enunciated by the numerical analysts Ortega
and Rheinboldt
(1970) in the context of line search methods.
De Leeuw and Heiser (1977) presented an MM algorithm for
multidimensional
scaling contemporary with the classic Demp
of Statistics, Penn State

Professor, Department
Park, PA 16802-2111
(E-mail: dhunter@stat.psu.edu).
University,
University
of Biomathematics
Kenneth Lange is Professor, Departments
and Human Genet
David
R. Hunter
is Assistant
ics, David Geffen School of Medicine

Research
supported in part by USPHS
30
The American
Statistician,
at UCLA,
Los Angeles, CA 90095-1766.

and MH59490.
grants GM53275
February
2004,
other
among
places,
Berge 1992), and inmedical imaging (De Pierro 1995 ;Lange and
Fessier 1995). The recent survey articles of de Leeuw (1994),
Heiser (1995), Becker, Yang, and Lange (1997), and Lange,
Hunter, and Yang (2000) deal with the general principle, but it
is not until the rejoinder of Hunter and Lange (2000a) that the
acronym MM first appears. This acronym pays homage to the
names
and
"majorization"
"iterative
of
majorization"
the
cluding quantile regression (Hunter and Lange 2000b), survival

analysis (Hunter and Lange 2002), paired and multiple compar
isons (Hunter 2004), variable selection (Hunter and Li 2002),
and DNA sequence analysis (Sabatti and Lange 2002).
One of the virtues of the MM acronym is that it does dou
ble duty. In minimization
problems, the first M of MM stands
INTRODUCTION
for
reappears,
principle
MM principle, emphasizes its crucial link to the better-known

EM principle, and diminishes the possibility of confusion with
the distinct subject inmathematics known as majorization (Mar
shall and Olkin 1979). Recent work has demonstrated the utility
of MM algorithms in a broad range of statistical contexts, in
Newton-Raphson.
1.
Maximum
of estimation
new material
MM
The
in robust regression (Huber 1981), in correspondence

analy
sis (Heiser 1987), in the quadratic lower bound principle of
literature
B?hning and Lindsay (1988), in the psychometrics
on least squares (Bijleveld and de Leeuw 1991; Kiers and Ten
earlier
nu
include
estimation.
Constrained
Minorization;
Majorization;
MM
them,
constructing
features.
the article
introduces
article
optimization
arguments
opinion,
In addition to surveying previous work on MM
this
mation
methods
attractive
their
throughout
examples
described.
rithms,
some
suggests
algorithms,
and
In our
function.
objective
to be part of the standard toolkit of profes

This article explains the principle behind
algorithms deserve
sional statisticians.
ster et al. (1977) article on EM algorithms. Although the work

of de Leeuw and Heiser did not spark the same explosion of
interest from the statistical community set off by the Dempster
et al. (1977) article, steady development of MM algorithms has
Vol. 58, No.
for majorize
problems,
and
the
for maximize.
(We
second
the first M
define
the
terms
and the second M
"majorize"
"minorize"
and
in Section 2.) A successful MM algorithm substitutes a simple

optimization problem for a difficult optimization problem. Sim
plicity can be attained by (a) avoiding large matrix inversions,
(b) linearizing an optimization problem, (c) separating the pa
rameters of an optimization problem, (d) dealing with equality
and inequality constraints gracefully, or (e) turning a nondiffer
entiable problem into a smooth problem. Iteration is the price
we pay for simplifying the original problem.
In our
view,
MM
are
algorithms
to understand
easier
and
sometimes easier to apply than EM algorithms. Although we

have no intention of detracting from EM algorithms, their domi
nance over MM algorithms is a historical accident. An EM algo
rithm operates by identifying a theoretical complete data space.
In the E-step of the algorithm,
data log-likelihood
complete
observed
data.
The
to a constant,
up
2004 American
function.
minorizing
Statistical
the conditional expectation

is calculated with respect
function
surrogate
norizing function ismaximized
In maximization
for minimize.
stands for minorize
created
by
of the
to the
the E-step
In the M-step,
is,
this mi
with respect to the parameters of
Association

DOT
10.1198/0003130042836
o
o
-1
-2
1. For q = 0.8,
Figure
minimized
that
is
f(6)
by
(a)
(b)
the "vee" function

(a) depicts
of the sample
the 0.8 quantile
the underlying model; thus, every EM algorithm is an example

of an MM algorithm. Construction of an EM algorithm some
times demands creativity in identifying the complete data and
technical skill in calculating an often complicated conditional
it analytically.
expectation and then maximizing
In contrast,
typical
of MM
applications
revolve
around
for constructing MM algorithms and to illustrate various aspects

of these algorithms through the study of specific examples.
conclude
this
section
with
a note
on
nomenclature.
downhill.
its quadratic
function
the objective
(b) shows
for9(m) = 2.5.
majorizer,
Indeed, the inequality
care
ful inspection of a log-likelihood or other objective function to

be optimized, with particular attention paid to convexity and in
equalities. Thus, success with MM algorithms and success with
EM algorithms hinge on somewhat different mathematical ma
neuvers. However, the skills required by most MM algorithms
are no harder tomaster than the skills required by most EM algo
rithms. The purpose of this article is to present some strategies
We
for 6(m) = -0.75;
function
pq(9) and its quadratic majorizing
1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, along with
Just
as EM is more a prescription for creating algorithms than an

actual algorithm, MM refers not to a single algorithm but to a
class of algorithms. Thus, this article refers to specific EM and
MM algorithms but never to "the MM algorithm" or "the EM
= /
f^(m))(2)
<
|
directly from the fact #(0(m+1)
| 0(m))
g(9^m)
anMM
and
The
lends
definition
descent
(1).
property (2)
?(m)^
algorithm remarkable numerical stability. With straightforward
rather than
changes, theMM recipe also applies tomaximization
a function f(6), we minorize
it by
minimization: To maximize
to
a surrogate function g(6 \ 0^)
and maximize g(0 | 0^)
follows
the next iterate #(m+1).
produce
2.1
Calculation
of Sample Quantiles
algorithm."
2.
THE MM
Let #(m) represent a fixed value of the

g(9 | 0(m)) denote a real-valued function
pends on 0^m). The function g(6 \ 9^)
real-valued function f(0) at the point 9^
a one-dimensional
As
PHILOSOPHY
parameter 0, and let
of 9 whose form de
is said to majorize a
puting
sample
0,
a sample
from
quantile
of com
the problem
xn
xi,...,
of
numbers. One can readily prove (Hunter and Lange 2000b)

for
q G
(0,1),
a qth
sample
quantile
of X\,...,
xn
real
that
minimizes
the function
provided
> f(0)
for all
g(0\eW)
Uj
=
/(<9(m)).
p(0(m) I0(m))
consider
example,
f(0)
Y,P?(xi-6)>
(3)
where pq(0) is the "vee" function

In other words, the surface 9 i? g (9 |
lies above the
9^)
surface f(9) and is tangent to it at the point 9 ? 9^m\ The
9-?
(M - iq0
pq^}
if -g{9
function g(9 \ 9^)
is said tominorize f(9) at 9^
0<0.
\
\-{l-q)0
at0<m).
9^) majorizes -f(9)
=
represents the current iterate in a search of When q
Ordinarily, 9^
1/2, this function is proportional to the absolute
MM algorithm, we
the surface f(9). In a majorize-minimize
value function; for q ^ 1/2, the "vee" is tilted to one side or the
rather than the
minimize
the majorizing function g(9 \ 9^)
other. As seen in Figure 1(a), it is possible tomajorize the "vee"
at any nonzero
function.
actual function f(9). If 0(m+1) denotes the minimizer of g(9 \ function
point
by a simple
quadratic
we can show that the MM procedure forces f{9)
a
at
then
for
is
0^
?#(m)
^ 0, pq(0)
Specifically,
given
majorized
#(m)),
The American
Statistician,

February
2004, Vol. 58, No.
31
derived
by
Uo\e(m))
under
and
its,
an
with
composition
relation
of
the formation
sums,
between
functions
nonnegative
lim
These
rules
function.
increasing
is
products,
permit us to work piecemeal in simplifying complicated objec

tive functions. Thus, the function f{9) of Equation (3) is ma
jorized at the point 9^ by
3.1
stemming
some common
inequalities
section
outlines
or
majorizing
of objective
10^1}.
1{|^|
various
This
concavity.
to construct
+(49-2)0 +
the majorization
Fortunately,
closed
from
from
functions
minorizing
or
convexity
inequalities
used
for various
types
functions.
Jensen's
Inequality
Jensen's inequality states for a convex function k(x) and any

random variable X that k[E (X)} < E [k(X)}. Since ? \n(x) is
a convex function, we conclude for probability densities
a(x)
and b(x) that
g (e |e{mA = ]T ?q (Xi o |x% o(mA .
The function f{9) and its majorizer g(9 | 0(m)) are shown in
? 12.
Figure 1(b) for a particular sample of size n
Setting the first derivative of g(9 | 0(m)) equal to zero gives
the minimum point
w- (m)rx.
"(2<7-i)+Er=i
?
\x% 0(m)|_1 depends on 0(m). A
flaw of algorithm (5) is that the weight i*4
is undefined when
ever 0(m) = x%. Inmending this flaw, Hunter and Lange (2000b)
also discussed the broader technique of quantile regression in
troduced by Koenker and Bas sett (1978). From a computational
=
the weight w?
where
perspective, themost fascinating thing about the quantile-finding

algorithm is that it avoids sorting and relies entirely on arithmetic
and
iteration.
For
the case
of
the
sample
median
(1973)
gorithm (5) is found in Schlossmacher
anMM algorithm by Lange and Sinsheimer
(1995).
in Equation (4) is a
Because g(9 \ 9^)
of 9, expression (5) coincides with the more
Raphson
(m+l)
(q
If h(x)U
(4)
?
1/2),
quadratic function
general Newton
update
left-hand
side
above
?(m)
V2#(0(m) |0(r
Vg ( 6>(m)| 6>(m)
In
b{X)
and we
vanishes
so
1,
the
obtain
E [lnapO] < E [\nb(X)] ,

which is sometimes known as the information inequality. It is
this inequality that guarantees that aminorizing function is con
structed in the E-step of any EM algorithm (de Leeuw 1994;
Heiser 1995), making every EM algorithm anMM algorithm.
3.2
Minorization
Jensen's
perplane
via Supporting
inequality
of a convex
property
Hyperplanes
is easily derived from the supporting hy

function:
Any
linear
tan
function
gent to the graph of a convex function is aminorizer at the point

of tangency. Thus, if k{6) is convex and differentiate,
then
?(0) > k(0m)
(7)
+V/^0(m)V(0-0(m)),
with equality when 0 = ?(m\ This inequality is illustrated by

the example of Section 7 involving constrained optimization.
3.3
via the Definition
Majorization
Ifwe wish tomajorize
IfX has the density b(x), thenE [a(X)/b{X)\
al
and is shown to be
(1993) and Heiser
<
ln<E
it, then we
can
K,(t) is convex
use
the
of Convexity
a convex function instead of minorizing

standard
definition
of convexity;
namely,
if and only if
(6)
where V#(0(m)
dient vector and
at 0(m). Because
denote the gra

| 0(m)) and V2g{9^
\ 9^)
the Hessian matrix of g(9 | 0(m)) evaluated
the descent property (2) depends only on de
it, the update (6) can
creasing g[9 | 0(m)) and not on minimizing
serve in cases where g[9 \9^ ) lacks a closed form minimizer,
In the
provided this update decreases the value of g(9 \ 9^).
context of EM algorithms, Dempster et al. (1977) called an al
it
gorithm that reduces g(9 \9^) without actually minimizing
a generalized EM (GEM) algorithm. The specific case of Equa
tion (6), which we call a gradient MM algorithm, was studied in
the EM context by Lange (1995a), who pointed out that update
(6) saves us from performing iterations within iterations and yet
still displays the same local rate of convergence as a full MM
algorithm thatminimizes g(9 | 0(m)) at each iteration.
3.
In the quantile
OF THE TRADE
TRICKS
example
of Section
tion admits a quadratic majorizer

general,
32
many
majorizing
or
2.1,
as depicted
minorizing
"vee"
in Figure
relationships
(8)
^2q??k(U)
for any finite collection of points t%and corresponding multipli

? 1.
ers o?i with a% > 0 and
Application of definition
^\ a%
(8) is particularly effective when k(?) is composed with a linear
function
For
xt6.
instance,
for
suppose
vectors
the substitution U = x?(0? ? 0?

(8) then becomes
that we make
Inequality
/<x?0) < â?
Now
inequality
x,
0,
and
0^m)
)/a? + xt9^rn\
(9)
?im)) + z*0(m)
<9i
if all components
Alternatively,
=
we
then
itive,
may take ti
Xi9?m /xt9(rn\
the convex
<
k I
^ÔLiU
of x, 0, and 9^
xt9^9i/9?n^
(8) becomes
are pos
?
and a?
func
1(a). In
may
be
Ktfe) < ?
xteêj
ztQ{m)
General

(10)
Inequalities (9) and (10)

gorithms in the contexts
Lange and Fessier 1995)
matrix inversion (Becker
3.4
have been used to construct MM al

of medical
imaging (De Pierro 1995;
and least-squares estimation without
et al. 1997).
via a Quadratic
Majorization
k(9)< k ?0(m))+ v? ?0(m)V(e 0(m))

+1 (o o^Y m (o o(mA
provides a quadratic upper bound. For example, Heiser
noted in the unidimensional case that
(1995)
0-0(m)
(0-0(m))2
?3
(0("O)2~+
3.5
of Section
example
Mean
2=1
If we
let xi =
eli,
then we
2=1
obtain
the
standard
One of the key criteria in judging minorizing or majorizing

functions is their ease of optimization. Successful MM algo
in high-dimensional
rithms
are
In other
separated.
2=1
the
form
Because
inequality.
function
surrogate
4.1
Poisson
to optimize
easier
sur
mapping
at each
iteration.
Sports Model
Consider a simplified version of amodel proposed by Maher

(1982) for a sports contest between two individuals or teams
in which the number of points scored by team i against team j
a Poisson
with
process
where
e0i~dj,
intensity
is an
o?
strength" parameter for team i and dj is a "defensive

strength" parameter for team j. If tij is the length of time that
i plays j and pij is the number of points that i scores against j,
then the corresponding Poisson log-likelihood function is
=
Pij{oi -dj)
Uje0i~dj
the expo
nential function is strictly convex, equality holds if and only if

all of the xi are equal. Inequality (11) is helpful in constructing
the majorizer
InUj
+pij
lnp^-l,
(14)
where 0 = (o, d) is the parameter vector. Note that the parame
=
ters should satisfy a linear constraint, such as ]T\ o i+ V
dj
in
order
for
the
model
be
0,
identifiable; otherwise, it is clearly
possible to add the same constant to each o? and dj without
altering the likelihood. We make two simplifying assumptions.
First, different games are independent of each other. Second,
each team's point total within a single game is independent of
point
total.
The
are
performances
second
is more
assumption
sus
it implies that a team's offensive
pect than the first because
2=1
mean
the arithmetic-geometric
the
words,
function
surrogate
defensive
of
on
rely
the individual parameter components
0 G U C Rd ?> R reduces to the sum of d real-valued functions

taking the real-valued arguments 6\ through 0<?.Because the d
univariate functions may be optimized one by one, this makes
its opponent's
\
often
spaces
parameter
inwhich
rogate functions
?ij{9)
Inequality
The arithmetic-geometric mean inequality is a special case of

?
? and a? = 1/ra yields
inequality (8). Taking n(t)
SEPARATIONOF PARAMETERS
AND CYCLICMM
"offensive
6.
The Arithmetic-Geometric
4.
follows
for 0 < c < min{0,9^}.

The corresponding quadratic lower
bound principle for minorization
is the basis for the logistic
regression
is the Cauchy-Schwartz
inequality. De Leeuw and Heiser
(1977) and Groenen (1993) used inequality (13) to derive MM
algorithms for multidimensional
scaling.
Bound
Upper
If a convex function k(9) is twice differentiable

and has
bounded curvature, then we can majorize k(0) by a quadratic
function with sufficiently high curvature and tangent to n{9)
at 0^
(B?hning and Lindsay 1988). In algebraic terms, if we
can find a positive definite matrix M such thatM ? V2k(9)
is
all
for
then
definite
0,
nonnegative
1
1
0 W)~
which
somehow
to one
unrelated
and
another;
nonetheless
the model gives an interesting first approximation
to reality. Under these assumptions, the full data log-likelihood
is obtained by summing ?ij(9) over all pairs (i,j). Setting the
partial derivatives of the log-likelihood equal to zero leads to the
equations
^1^2
/
_<
i
2 x2
1
X1-t?r+X2?7?T
(m)
2 ^1
(m)
?cy
of the product of two positive numbers. This

in the sports contest model of Section 4.
3.6
The Cauchy-Schwartz
/io\
(12)
?-*
tions
=
w
and -
-^
satisfied by themaximum
Inequality
llfljll
?'**
E,^
inequality is used
The Cauchy-Schwartz
inequality for the Euclidean norm is a
case
of
The function n(9) = ||0|| is convex
(7).
special
inequality
because it satisfies the triangle inequality and the homogeneity
condition ||a011= \a\ \\9\\. Since k(9) =
#2, we see that
^/^
=
and
therefore
(7)
inequality
gives
Vft(0)
0/||0||,
* ^(m)?+l
11*11
do
not
likelihood estimate (o, d). These equa
a closed
admit
j H3 ?e"
form
so we
solution,
turn
to an MM
algorithm.
Because
the task is to maximize
function.
need
minorizing
we may use inequality

-i-.e*-^
13
>
-
Focusing
on
the
(14), we
term,
?tije0i~dj
(12) to show that
-^
the log-likelihood
, -hL&f]?\
2
eo(->+4"*>
ie?r,+*r\
(15)
The American
Statistician,

February
2004, Vol. 58, No.
33
Table
Regular
According
Defensive
Strength.
the Basis
Offensive
team played
82 games.
Plus
Strength
in the right direction; indeed, every iteration of a cyclic MM

algorithm is simply an MM iteration on a reduced parameter
set.
0/ + Of/
Team
Wins
the 2002-2003
of
to Their Estimated
Each
o; + Of/
Team
on
Teams
of all 29 NBA
1. Ranking
Season
Wins
4.2
to National
Application
Results
Cleveland
-0.0994
17
Phoenix
0.0166
Denver
-0.0845
17
New Orleans
0.0169
47
Toronto
-0.0647
24
Philadelphia
0.0187
48
Miami
-0.0581
25
Houston
0.0205
43
Chicago
Atlanta
-0.0544
30
Minnesota
0.0259
51
-0.0402
35
LA Lakers
0.0277
50
in minutes.
LAClippers
-0.0355
27
0.0296
48
Memphis
New York
-0.0255
28
Utah
0.0299
47
period, if necessary,
-0.0164
37
Portland
0.0320
50
-0.0153
37
Detroit
0.0336
50
-0.0077
44
New
0.0481
49
-0.0051
38
San
0.0611
60
Orlando
-0.0039
42
Sacramento
0.0686
59
Milwaukee
-0.0027
42
Dallas
0.0804
60
Washington
Boston
State
Golden
Seattle
Indiana
Jersey
Antonio
44
40
0.0039
Although the right side of the above inequality may appear more
complicated than the left side, it is actually simpler in one impor
tant
o? and
components
parameter
respect?the
dj
are
separated
on the right side but not on the left. Summing the log-likelihood
(14) over all pairs (?, j) and invoking inequality (15) yields the
function
=
fl(0|0(m)) ?E
tin
Pij(oi
-dj)
%
i
e o(m)+d(m)
?om;+d
the full log-likelihood up to an additive constant in

dependent of 0. The fact that the components of 0 are separated
permits us to update parameters one by one
by g (9 | 0^)
minorizing
and
reduces
substantially
tial derivatives of g(9
costs.
computational
Setting
| 0^m)) equal to zero yields
the par
the updates
Association
Table 1 summarizes our application of the Poisson sports

model to the results of the 2002-2003
regular season of the
National Basketball Association.
In these data, tij is measured
to
score
4:Se?l~dj
lasts
game
regular
48 minutes,
and
overtime
each
adds five minutes. Thus, team i is expected

points
team
against
the
j when
teams
two
meet and do not tie. Team i is ranked higher than team j if

?
?
ai
di, which is equivalent to ?iJrdi>
dj > dj
?j H-dj.
It isworth emphasizing some of the virtues of themodel. First,
the ranking of the 29 NBA teams on the basis of the estimated
sums o i + di for the 2002-2003
regular season is not perfectly
consistent with their cumulative wins; strength of schedule and
margins of victory are reflected in themodel. Second, themodel
gives the point-spread function for a particular game as the differ
ence
two
of
Poisson
independent
random
variables.
Third,
one
can easily amend the model to rank individual players rather

than teams by assigning to each player an offensive and de
fensive intensity parameter. If each game is divided into time
segments punctuated by substitutions, then theMM algorithm
can be adapted to estimate the assigned player intensities. This
might provide a rational basis for salary negotiations that takes
into account subtle differences between players not reflected in
traditional
Lij ?-2d,
Basketball
statistics.
sports
Finally, the NBA dataset sheds light on the comparative

speeds of the original MM algorithm (16) and its cyclic modi
fication (17). The cyclic MM algorithm converged in fewer it
erations (25 instead of 28). However, because of the additional
work required to recompute the denominators in Equation (17),
the cyclic version required slightly more floating-point opera
tions as counted by MATLAB
(301,157 instead of 289,998).
5. SPEED OF CONVERGENCE
j
E7-P'W
o (m+l)
(m)_Am)
2-^j
MM
algorithms
tije
plementary
(m+l)
J2^P^J
In
(16)
o nL>+d\
rithms
boast
strengths.
a
quadratic
point
optimum
.Zik
soon
as they
become
available.
For
instance,
T,iPij
In
J+d\
if we
Newton-Raphson
as
they
convergence
words,
under
certain
update
?
||0( +i)_0*||
_
J{m)
0*||2
near
general
algo
a local
condi
(16),
for
some
faster
the
linear
rate
quadratic
of
lim
môo^?-??||0M
(17)
In practice, anMM algorithm often takes fewer iterations when

we cycle through the parameters updating one at a time than
when we update the whole vector at once as in algorithm (16).
We call such versions of MM algorithms cyclic MM algorithms;
they generalize the ECM algorithms of Meng and Rubin (1993).
A cyclic MM algorithm always drives the objective function
c. This
constant
than
"
lim
the
.Eik
34
hand,
rate of
tions,
o vector before the d vector in each iteration of algorithm

we could replace the formula for
above by
dj
?rn+l)
one
On
In other
0*.
The question now arises as to whether one should modify

algorithm (16) so that updated subsets of the parameters are used
as
algorithms have com
and Newton-Raphson
rate of
convergence
is much
convergence
c <
(18)
0*||
displayed by typical MM algorithms. Hence, Newton-Raphson

algorithms tend to require fewer iterations thanMM algorithms.
On the other hand, an iteration of a Newton-Raphson
algorithm
can
MM
be
far more
onerous
computationally
algorithm. Examination
0(m+i) =
than
an
iteration
of the form
?(rn)-v2/(^(m))_
General

vy(0(m))
of an
of
iteration
Newton-Raphson
that
reveals
it requires
the p x p matrix V2/(0)

an MM
trast,
on
MM
well-designed
but
to p3. By con
arithmetic
parameters
iteration.
per
operations
tend to require more
algorithms
For
than Newton-Raphson.
an
sometimes
enjoy
advantage
algorithms
for
linear
the
constraint).
iterations
matrix
to MATLAB.
according
iteration
Raphson
requires
even
Thus,
more
inversion
Newton
single
in
computation
this
example
than the 300,000 floating point operations required for theMM

in 28 iterations. Numerical
algorithm to converge completely
enters
also
stability
the balance
point.
al
of
the value
to appropri
guaranteed
the objective
at
function
iteration.
every
Other types of deterministic

Fisher
scoring,
can match
until
convergence,
mation
matrix
them
each
scoring
information
iterations
The
expected
infor
is sometimes
easier
to eval
of Newton-Raphson.
matrix
Scoring does not automatically lead to an increase in the log

it can always
likelihood, but at least (unlike Newton-Raphson)
be made to do so if some form of backtracking is incorporated.
methods
Quasi-Newton
or even
mitigate
eliminate
the need
for
matrix inversion. The Nelder-Mead

approach is applicable in
situations where the objective function is nondifferentiable. Be
cause of the complexities of practical problems, it is impossible
to declare
any
best
algorithm
optimization
In our
overall.
ex
perience, however, MM algorithms are often difficult to beat in

terms of stability and computational simplicity.
In most
cases,
covariance
formation
is often
matrix.
a maximum
matrix
likelihood
to the
equal
In practice,
the
by
well-approximated
estimator
inverse
of
the
has
expected
information
matrix
the observed
information
matrix
in
computed by differentiating the log-likelihood i(9)

?V2?(9)
twice. Thus, after theMLE 9 has been found, a standard error of
0 can be obtained by taking square roots of the diagonal entries
?
In some problems, however, direct
of the inverse of V2?(9).
calculation of V2?{9) is difficult. Here we propose two numeri
cal approximations to this matrix that exploit quantities readily
obtained by running anMM algorithm. Let g(9 \9^ ) denote a
minorizing function of the log-likelihood i{9) at the point 0(m\
M(??)
to be the MM
arg max #(0

6
algorithm map
taking 9^
| i?)
matrix.
identity
These
(20)
J t?=?
were
formulas
de
are
equal
at the
point
of minorization;
and
the gradient
second,
2, j entry equals
= fa " (' + *?>-*(?>,

sô
5
AM, w
(21)
the vector Cj is the jth standard basis vector having a one

= 0,
in its jth component and zeros elsewhere. Because M(9)
the jth column of VM(0) may be approximated using only out
where
put from the corresponding MM algorithm by (a) iterating until

0 is found, (b) altering the jth component of 0 by a small amount
?j, (c) applying theMM algorithm to this altered 0, (d) subtract
of
ing 0 from the result, and (e) dividing by ?j. Approximation
nu
it
involves
is
analogous except
V2?{9) using Equation (20)
=
merically approximating the Jacobian of /i(#)
Vg(9
|#). In
this case one may exploit the fact that h{9) is zero.
6.2
An MM Algorithm
for Logistic
Regression
To illustrate these ideas and facilitate comparison

ious
numerical
methods,
we
an
consider
of the var
the
in which
example
of the log-likelihood
is easy to compute. B?hning and
Lindsay (1988) apply the quadratic bound principle of Section
3.4 to the case of logistic regression, inwhich we have an n x 1
asymp
expected
and define
(19)
Hessian
6. STANDARDERROR ESTIMATES
totic
the
dOj
ground. Although
in required
V2g{9 |0) /I VM(0)

d
V2g(9 |0) +
of g{9 | 0(m)) at its maximizer M(9^)

is zero. Alternative
derivations of formulas (19) and (20) were given by Meng and
these
Rubin (1991) and Oakes (1999), respectively. Although
formulas have been applied to standard error estimation in the
EM algorithm literature?Meng
and Rubin (1991) base their
SEM idea on formula (19)?to our knowledge, neither has been
applied in the broader context of MM algorithms.
of V2?{9) using Equation (19) requires a nu
Approximation
whose
merical approximation of the Jacobian matrix VM(0),
methods
gradient-free
denotes
are based on
to ?V2?{9)
approximations
rived by Lange (1999) using two simple facts: First, the tan
gency of ?(9) and itsminorizer imply that their gradient vectors
such as
algorithms,
Newton-Raphson
has
its own merits.
in Fisher
used
the observed
than
or
occupy a kind of middle
like Nelder-Mead,
of
optimization
methods,
quasi-Newton
none
uate
are
algorithms
or decrease
increase
ately
MM
contrast,
By
A Newton-Raphson
sheet.
if started too far from an optimum
gorithm can behave poorly
where
of a 57 x 57 matrix requires roughly 387,000 floating point op

erations
via MM
a>v^'*>
reason
this
in computational
single
Thus,
For example, the Poisson process scoring model for the NBA
dataset of Section 4 has 57 parameters (two for each of 29 teams
one
Differentiation
The two numerical

the formulas
V2?{9)
speed.
minus
Numerical
takes
usually
iterations
simpler
MM
separates
to invert
needed
calculations
is roughly proportional
that
algorithm
of p or p2
the order
of
the number
then
components,
6.1
evalua
If 0 has p
tion and inversion of the Hessian matrix V2/(0^m^).
vector
of binary
and
responses
an n
x p matrix
of predic
= 1
stipulates that the probability 7r?(0) that Y%
equals ex.p{9txi}/
Straightforward differen
(1 + exp{0tx2}).
tiation of the resulting log-likelihood function shows that
tors. The model
V2?{9)
=
-^^(0)[l-^(0)]x,4
2=1
?
Since 7Ti(9) [1 tt1(9)} is bounded above by 1/4, we may define
the negative definite matrix B =
and conclude that
?\XlX
?
B is nonnegative definite as desired. Therefore, the
X/2l(9)
quadratic function
- 0(m)
= ?
+ V?
g ?9 I
0(m)) (9{rnA
(0(m)V(0
+-
to 0O+1).
The American
Statistician,

(0-0(m)V?
February
2004, Vol. 58, No.
35
2.
Table
and
Coefficients
Estimated
Low Birth Weight
Logistic
Exact V2l{6)
Constant
AGE
LWT
RACE2
RACE3
SMOKE
PTL
HT
Ul
FTV
mizing
on:
based
Equation (19) Equation (20)
0.48062
1.1969
1.1984
1.1984
-0.029549
0.037031
0.037081
0.037081
0.0069194
0.0069336
0.0069336
0.52736
0.52753
0.52753
0.44076
-0.015424
1.2723
minorizes
for the
Example
errors
Standard
Variable
Errors
Standard
Regression
0.8805
0.44079
0.44076
0.93885
0.40215
0.40219
0.40219
0.54334
0.34541
0.34545
0.34545
1.8633
0.69754
0.69811
0.69811
0.76765
0.45932
0.45933
0.45933
0.065302
0.17240
0.17251
0.17251
?(9) at 0^m). The MM

this quadratic, giving
=
0(m+l)
?(m)
Vj
Vj(9im))\-lnVj(9)+lnVj(9{m)))
Adding
?-lyg
algorithm of Equation
an
it enjoys
once,
as
Newton-Raphson
(B?hning and Lindsay

6.3
Application
We
now
test
\y-tt
.(22)
(0(m))
increasing
computational
with equality when 0 =
of predictors
the number
negative.
optimization
For
example,
we
Here
>0,
tuning
0(m). Summing over j and multiplying

uj, we
parameter
construct
the
function
Vj (0(*0?
increases
error
Data
based
approximations
on Equa
except
for
race,
which
is a three-level
purposes.
7. HANDLING CONSTRAINTS
Many
positive
In
will differ. The close agreement of the approximations with the

"gold standard" based on the exact value V2?(9) is clearly good
eters.
v3{9{rn)) v3{9).
1988).
the standard
for practical
advantage
factor with level 1 for whites, level 2 for blacks, and level 3 for
likelihood estimates
other races. Table 2 shows the maximum
errors
for
the
10
standard
and asymptotic
parameters. The dif
ferentiation increment Sj was 0j/lOOO for each parameter 9j.
The standard error approximations in the two rightmost columns
turn out to be the same in this example, but in other models they
enough
s)
(22) needs to invert XtX
to Low Birth Weight
is quantitative
>
?
s~1(t
+V^-(0(m))t(0-0(m))
tions (19) and (20) on the low birth weight dataset of Hosmer
and Lemeshow
(1989). This dataset involves 189 observations
and eight maternal predictors. The response is 0 or 1 accord
ing to whether an infant is born underweight, defined as less
than 2.5 kilograms. The predictors include mother's age in years
(AGE), weight at lastmenstrual period (LWT), race (RACE2 and
RACE3), smoking status during pregnancy (SMOKE), number
of previous premature labors (PTL), presence of hypertension
history (HT), presence of uterine irritability (UI), and number of
physician visits during the first trimester (FTV). Each of these
predictors
Ins ? hit >
the last two inequalities, we see that
?9{m))
by
only
over
v3{9) > VVj ?0(m>V (0<m> 0).
Vj(9{m)) [-ln^-(0)+lnt;J(0(m))
= 9{m)4(XtX)-1X?
Since theMM
?0(m))
Application of the similar inequality

implies that
by maxi
algorithm proceeds
interior of the parameter space but allows strict inequalities to

become equalities in the limit.
Consider the problem of minimizing f{9) subject to the con
straints Vj(6) > 0 for 1 < j < g, where each v3; (0) is a concave,
differentiable function. Since ?Vj(6)
is convex, we know from
that
(7)
inequality
problems
parameters
a
discuss
constraints
impose
are often
required
majorization
technique
on
9-9{rn)\tVvj
?0(m))
(23)
in
majorizing f{9) at 9^m\ The presence of the term lnvj(0)
<
The
from
0
(23)
prevents ^(0(m+1))
Equation
occurring.
of mi?j(0)
multiplier Vj(9^)
gradually adapts and allows
there
v.{Q(m+1)) to tend to 0 if it is inclined to do so. When
are equality constraints A9 = b in addition to the inequality
constraints v3; (0) > 0, these should be enforced during the min
imization of g (9 \9^).
7.1
Multinomial
Sampling
To gain a feel for how these ideas work in practice, consider

the problem of maximum
likelihood estimation given a random
sample of size n from a multinomial distribution. If there are q
categories and n? observations fall in category i, then the log
a constant. The compo
likelihood reduces to
^2i rii In 9i plus
= 1.
nents of the parameter vector 0 satisfy 0? > 0 and
Yli 9i
it is well known that the maximum
likelihood esti
Although
mates are given by 0? = rii/n, this example is instructive be
cause it is explicitly solvable and demonstrates the linear rate of
convergence of the proposed MM algorithm.
To minimize the negative log-likelihood f(9) = ? ]T^ ni ln 9%
=
9i > 0 and the
subject to the q inequality constraints 1^(0)
?
we
construct
themajorizing func
equality constraint J2i 9% 1,
tion
param
to be
that
non
in a
sense eliminates
inequality constraints. For this adaptive barrier

method (Censor and Zenios 1992; Lange 1994) towork, an initial
point 0(?) must be selected with all inequality constraints strictly
satisfied. The barrier method confines subsequent iterates to the
36
suggested in Equation (23), omitting irrelevant constants. We

= 1
minimize g (9 | 0(m)) while enforcing
by introducing
]^?0?
a Lagrange multiplier and looking for a stationary point of the
General

de Leeuw,
of Correction Matrix
J., and Heiser, W. J. (1977), "Convergence
for Multidimensional
Algorithms
of
Scaling," in Geometric Representations
Relational Data, eds. J. C. Lingoes, E. Roskam, and I. Borg, Ann Arbor, MI:
Mathesis
735-752.
Press, pp.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Likelihood
from Incomplete Data via the EM Algorithm," Journal of the Royal Statistical
Society, Ser. B, 39, 1-38.
Lagrangian
h(9) = 5(0|0^) +
A???-lJ.
equal to zero and multiplying
Setting dh(9)/d9l
-rii
cj02(m)
uj9% +
n?+o;0
1
n +
?(m+i)
all iterates have positive

The
components.
positive
\9%
final
if they start with
components
/^(m)
%
cj V
_ n?\ '
n )
the parameter
zero.
equal
one
where
or more
of
P. J. (1981), Robust Statistics, New York: Wiley.

for Generalized
Hunter, D. R. (2004), "MM Algorithms
Bradley-Terry Models,"
The Annals of Statistics, 32, 386-408.
to discussion of "Optimization
Hunter, D. R., and Lange, K. (2000a), Rejoinder
Transfer Using Surrogate Objective
Journal of Computational
Functions,"
and Graphical
Statistics, 9, 52-59.
Huber,
demonstrates that 9^
approaches the estimate 9 at the linear
rate uj/(n + uj), regardless of whether 9 occurs on the boundary
of
Theory and
"Convergent Computing
by Iterative Majorization:
inMultidimensional
inDe
Data Analysis,"
in Recent Advances
Applications
ed. W. J. Krzanowski,
Oxford: Clarendon
Analysis,
scriptive Multivariate
Press, pp. 157-189.
New
S. (1989), Applied Logistic Regression,
Hosmer, D. W., and Lemeshow,
York: Wiley.
-(1995),
rearrangement
n +
for
Algorithm
IEEE Transac
toMultidimensional
Scal
Groenen, P. J. F. (1993), The Memorization Approach
DSWO Press.
Leiden, the Netherlands:
ing: Some Problems and Extensions,
Heiser, W. J. (1987), "Correspondence
Analysis with Least Absolute Residuals,"
Statistical Data Analysis,
5, 337-356.
Computational
the update
cj
_ î =
1
n
0(m+l)
De Pierro, A. R. (1995), "AModified

Expectation Maximization
Penalized Likelihood
in Emission Tomography,"
Estimation
tions on Medical
14, 132-137.
Imaging,
0.
on i reveals that ? = n and yields
Summing
Hence,
by 9%gives
its components
via anMM Algorithm," Journal of Com

"Quantile Regression
and Graphical
in the Proportional Odds Model," Annals
-(2002),
"Computing Estimates
54, 155-168.
of the Institute of Statistical Mathematics,
8. DISCUSSION
Hunter, D. R., and Li, R. (2002), "A Connection Between Variable Selection and
technical report 0201, Dept. of Statistics, Pennsylvania
EM-type Algorithms,"
to whet
not
This
article
is meant
readers'
satiate
appetites,
State University.
much.
For instance,
there
is a great deal
them. We
have
omitted
of the
Acceleration
Jamshidian, M., and Jennrich, R. I. (1997), "Quasi-Newton
EM Algorithm,"
Journal of the Royal Statistical Society, Ser. B, 59, 569-587.
known about the
of MM
that
space
convergence
is too mathematically
algorithms
properties
to present
demanding
0?-(2000b),
putational
here.
Fortunately,
almost all results from the EM algorithm literature (Wu 1983;

Lange 1995a; McLachlan and Krishnan 1997; Lange 1999) carry
over
without
change
to MM
are
there
Furthermore,
algorithms.
of a Class of
Kiers, H. A. L., and Ten Berge, J. M. F. (1992), "Minimization
Matrix Trace Functions by Means
of Refined Majorization,"
Psychometrika,
57,371-382.
Koenker, R., and Bassett, G. (1978), "Regression Quantiles," Econometrica,
46,
33-50.
Lange,
K.
(1994),
"An Adaptive
Barrier Method
several methods for accelerating EM algorithms that are also

Methods Applications
1, 392-402.
Analysis,
"A Gradient Algorithm
-(1995a),
to
MM
Locally
applicable
accelerating
algorithms (Heiser 1995; Lange
rithm," Journal of the Royal Statistical Society,
1995b; Jamshidian and Jennrich 1997; Lange et al. 2000).
"A Quasi-Newton
Acceleration
-(1995b),
tic a Sinica, 5, 1-18.
Although this survey article necessarily reports much that is
are
there
known,
already
some
new
results
here.
Our MM
of constrained optimization in Section 7 is more general

than previous versions in the literature (Censor and Zenios 1992;
Lange 1994). The application of Equation (20) to the estimation
of standard errors inMM algorithms is new, as is the extension
of the SEM idea of Meng and Rubin (1991) to theMM case.
There are so many examples of MM algorithms in the liter
that we
are unable
to cite
them
all. Readers
should
the lookout for these and for known EM algorithms

more
explained
tantly,
MM
we
hope
as MM
simply
this article
will
Even
algorithms.
stimulate
readers
be
on
that can be
more
to discover
impor
new
algorithms.
[Received
June 2003. Revised
September
Programming,"
to the EM Algo
Equivalent
Ser. B, 57, 425-437.
of the EM Algorithm,"
Statis
Numerical
New York:
Analysis for Statisticians,
Verlag.
Lange, K., and Fessier, J.A. (1995), "Globally Convergent Algorithms
imum A Posteriori Transmission
IEEE Transactions
Tomography,"
1430-1438.
Processing A,
treat -(1999),
ment
ature
for Convex
2003.]
REFERENCES
Becker, M. P., Yang, I., and Lange, K. (1997), "EM Algorithms Without Missing
inMedical Research,
Data," Statistical Methods
6, 38-54.
Bijleveld, C. C. J. H., and de Leeuw, J. (1991), "Fitting Longitudinal Reduced
Rank Regression Models
56,
by Alternating Least Squares," Psychometrika,
433-447.
of Quadratic Approx
B?hning, D., and Lindsay, B. G. (1988), "Monotonicity
imation Algorithms,"
Annals of the Institute of Statistical Mathematics,
40,
641-663.
S. A.
"Proximal Minimization
With D
Censor, Y., and Zenios,
(1992),
73, 451-464.
Functions," Journal of Optimization
Theory and Applications,
de Leeuw, J. (1994), "Block Relaxation Algorithms
in Statistics, in Information
eds. H. H. Bock, W. Lenski, and M. M. Richter,
Systems and Data Analysis,
Berlin: Springer-Verlag,
pp. 308-325.
Springer
forMax
on Image
Transfer using
Lange, K., Hunter, D. R., and Yang, I. (2000), "Optimization
(with discussion),
Surrogate Objective Functions"
and Graphical
J. S. (1993), "Normal/Independent
Distributions
Lange, K., and Sinsheimer,
and Their Applications
in Robust Regression,"
and
Graphical
(2nd ed.), Read
Luenberger, D. G. (1984), Linear and Nonlinear
Programming
ing, MA: Addison-Wesley.
Football Scores," Statistica Neer
Maher, M. J. (1982), "Modelling Association
landica, 36, 109-118.
and
Marshall, A. W., and Olkin, I. (1979), Inequalities: Theory ofMajorization
its Applications,
San Diego: Academic.
G. J., and Krishnan, T. (1997), The EM Algorithm and Extensions,
McLachlan,
New York: Wiley.
and Rubin, D. B. (1991), "Using EM
X.-L.,
Meng,
Variance-Covariance
Matrices:
The SEM Algorithm,"
can Statistical Association,
86, 899-909.
to Obtain Asymptotic
Journal of the Ameri
A
"Maximum Likelihood
via the ECM Algorithm:
Estimation
Framework, Biometrika,
800, 267-278.
via the EM
of the Information Matrix
Oakes, D. (1999), "Direct Calculation
Journal of the Royal Statistical Society, Ser. B, 61, Part 2, 479
Algorithm,"
482.
?-(1993),
General
Ortega, J.M., and Rheinboldt, W. C. (1970), Iterative Solutions of Nonlinear

in Several Variables, New York: Academic,
Equations
pp. 253-255.
a
Identification Using
Sabatti, C, and Lange, K. (2002), "Genomewide Motif
Dictionary Model," Proceedings
of the IEEE, 90, 1803-1810.
E. J. (1973), "An Iterative Technique
Deviations
for Absolute
Schlossmacher,
Curve Fitting," Journal of the American Statistical Association,
68, 857-859.
The
Wu, C. F. J. (1983), "On the Convergence
Properties of the EM Algorithm,"
Annals of Statistics,
11, 95-103.
The American
Statistician,

February
2004, Vol. 58, No.
37

A Tutorial On MM Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Tutorial On MM Algorithms

Uploaded by

Copyright:

Available Formats

A Tutorial on MM Algorithms

Author(s): David R. Hunter and Kenneth Lange

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM

likelihood estimation because they consistently drive the likeli

exemplified by an EM algorithm does not necessarily require

likelihood and least squares are the dominant forms

cusses an optimization method

and least squares esti

that typically relies on convexity

rithm method (Dempster, Laird, and Rubin 1977; McLachlan

of Statistics, Penn State

ics, David Geffen School of Medicine

Los Angeles, CA 90095-1766.

cluding quantile regression (Hunter and Lange 2000b), survival

MM principle, emphasizes its crucial link to the better-known

in robust regression (Huber 1981), in correspondence

In addition to surveying previous work on MM

to be part of the standard toolkit of profes

ster et al. (1977) article on EM algorithms. Although the work

Vol. 58, No.

and the second M

in Section 2.) A successful MM algorithm substitutes a simple

sometimes easier to apply than EM algorithms. Although we

the conditional expectation

norizing function ismaximized

stands for minorize

with respect to the parameters of

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM

the "vee" function

the underlying model; thus, every EM algorithm is an example

for constructing MM algorithms and to illustrate various aspects

Indeed, the inequality

ful inspection of a log-likelihood or other objective function to

for 6(m) = -0.75;

as EM is more a prescription for creating algorithms than an

the next iterate #(m+1).

Let #(m) represent a fixed value of the

numbers. One can readily prove (Hunter and Lange 2000b)

where pq(0) is the "vee" function

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM

2004, Vol. 58, No.

permit us to work piecemeal in simplifying complicated objec

Jensen's inequality states for a convex function k(x) and any

g (e |e{mA = ]T ?q (Xi o |x% o(mA .

perspective, themost fascinating thing about the quantile-finding

E [lnapO] < E [\nb(X)] ,

is easily derived from the supporting hy

gent to the graph of a convex function is aminorizer at the point

?(0) > k(0m)

with equality when 0 = ?(m\ This inequality is illustrated by

via the Definition

Ifwe wish tomajorize

IfX has the density b(x), thenE [a(X)/b{X)\

a convex function instead of minorizing

denote the gra

tion admits a quadratic majorizer

for any finite collection of points t%and corresponding multipli

the substitution U = x?(0? ? 0?

/<x?0) < ^a?

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM

Inequalities (9) and (10)

have been used to construct MM al

= fa " (' + ?>-(?>,