Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Tutorial on MM Algorithms

Author(s): David R. Hunter and Kenneth Lange


Source: The American Statistician, Vol. 58, No. 1 (Feb., 2004), pp. 30-37
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical Association
Stable URL: http://www.jstor.org/stable/27643496 .
Accessed: 25/10/2014 14:08
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Taylor & Francis, Ltd. and American Statistical Association are collaborating with JSTOR to digitize, preserve
and extend access to The American Statistician.

http://www.jstor.org

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

_General_
A Tutorial on MM Algorithms
R. Hunter

David

Most

in frequentist

problems

a function
are

rithms

as

such

Lange

and Kenneth

the most

among

involve optimization

statistics
or a sum

a likelihood

effective

of

EM

squares.

of

algo

for maximum

algorithms

continued.

likelihood estimation because they consistently drive the likeli


a simple surrogate function for the
hood uphill by maximizing
Iterative

log-likelihood.

a surrogate

of

optimization

as

function

exemplified by an EM algorithm does not necessarily require


missing data. Indeed, every EM algorithm is a special case of
the more general class of MM optimization algorithms, which
typically exploit convexity rather than missing data inmajoriz
or

ing

an

minorizing

MM

some

discusses

merous

of

for

and

some
error

standard

KEY WORDS:

We

to illustrate

the concepts

on

optimization;

EM

algo

constrained

algorithm;

likelihood and least squares are the dominant forms


in frequentist statistics. Toy optimization problems
classroom

designed

but most

practical maximum
must

problems

presentation

be

likelihood

solved

cusses an optimization method


and

can

be

solved

analytically,

and least squares esti

numerically.

This

article

dis

that typically relies on convexity

is a generalization

of

the well-known

EM

algo

rithm method (Dempster, Laird, and Rubin 1977; McLachlan


and Krishnan 1997). We call any algorithm based on this itera
tive method anMM algorithm.
the general principle behind MM algo
To our knowledge,
rithms was first enunciated by the numerical analysts Ortega
and Rheinboldt
(1970) in the context of line search methods.
De Leeuw and Heiser (1977) presented an MM algorithm for
multidimensional
scaling contemporary with the classic Demp

of Statistics, Penn State


Professor, Department
Park, PA 16802-2111
(E-mail: dhunter@stat.psu.edu).
University,
University
of Biomathematics
Kenneth Lange is Professor, Departments
and Human Genet
David

R. Hunter

is Assistant

ics, David Geffen School of Medicine


Research
supported in part by USPHS

30

The American

Statistician,

at UCLA,

Los Angeles, CA 90095-1766.


and MH59490.
grants GM53275

February

2004,

other

among

places,

Berge 1992), and inmedical imaging (De Pierro 1995 ;Lange and
Fessier 1995). The recent survey articles of de Leeuw (1994),
Heiser (1995), Becker, Yang, and Lange (1997), and Lange,
Hunter, and Yang (2000) deal with the general principle, but it
is not until the rejoinder of Hunter and Lange (2000a) that the
acronym MM first appears. This acronym pays homage to the
names

and

"majorization"

"iterative

of

majorization"

the

cluding quantile regression (Hunter and Lange 2000b), survival


analysis (Hunter and Lange 2002), paired and multiple compar
isons (Hunter 2004), variable selection (Hunter and Li 2002),
and DNA sequence analysis (Sabatti and Lange 2002).
One of the virtues of the MM acronym is that it does dou
ble duty. In minimization
problems, the first M of MM stands

INTRODUCTION

for

reappears,

principle

MM principle, emphasizes its crucial link to the better-known


EM principle, and diminishes the possibility of confusion with
the distinct subject inmathematics known as majorization (Mar
shall and Olkin 1979). Recent work has demonstrated the utility
of MM algorithms in a broad range of statistical contexts, in

Newton-Raphson.

1.
Maximum
of estimation

new material

MM

The

in robust regression (Huber 1981), in correspondence


analy
sis (Heiser 1987), in the quadratic lower bound principle of
literature
B?hning and Lindsay (1988), in the psychometrics
on least squares (Bijleveld and de Leeuw 1991; Kiers and Ten

earlier

nu

include

estimation.

Constrained

Minorization;

Majorization;

MM

them,

constructing

features.

the article

introduces

article

optimization

arguments

opinion,

In addition to surveying previous work on MM

this

mation

methods

attractive

their

throughout

examples

described.
rithms,

some

suggests

algorithms,

and

In our

function.

objective

to be part of the standard toolkit of profes


This article explains the principle behind

algorithms deserve
sional statisticians.

ster et al. (1977) article on EM algorithms. Although the work


of de Leeuw and Heiser did not spark the same explosion of
interest from the statistical community set off by the Dempster
et al. (1977) article, steady development of MM algorithms has

Vol. 58, No.

for majorize

problems,

and

the

for maximize.

(We

second

the first M
define

the

terms

and the second M

"majorize"

"minorize"

and

in Section 2.) A successful MM algorithm substitutes a simple


optimization problem for a difficult optimization problem. Sim
plicity can be attained by (a) avoiding large matrix inversions,
(b) linearizing an optimization problem, (c) separating the pa
rameters of an optimization problem, (d) dealing with equality
and inequality constraints gracefully, or (e) turning a nondiffer
entiable problem into a smooth problem. Iteration is the price
we pay for simplifying the original problem.
In our

view,

MM

are

algorithms

to understand

easier

and

sometimes easier to apply than EM algorithms. Although we


have no intention of detracting from EM algorithms, their domi
nance over MM algorithms is a historical accident. An EM algo
rithm operates by identifying a theoretical complete data space.
In the E-step of the algorithm,
data log-likelihood

complete
observed

data.

The

to a constant,

up

2004 American

function.

minorizing

Statistical

the conditional expectation


is calculated with respect
function

surrogate

norizing function ismaximized

In maximization

for minimize.

stands for minorize

created

by

of the
to the

the E-step

In the M-step,

is,

this mi

with respect to the parameters of

Association

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

DOT

10.1198/0003130042836

o
o
-1

-2

1. For q = 0.8,
Figure
minimized
that
is
f(6)
by

(a)

(b)

the "vee" function


(a) depicts
of the sample
the 0.8 quantile

the underlying model; thus, every EM algorithm is an example


of an MM algorithm. Construction of an EM algorithm some
times demands creativity in identifying the complete data and
technical skill in calculating an often complicated conditional
it analytically.
expectation and then maximizing
In contrast,

typical

of MM

applications

revolve

around

for constructing MM algorithms and to illustrate various aspects


of these algorithms through the study of specific examples.
conclude

this

section

with

a note

on

nomenclature.

downhill.

its quadratic

function

the objective
(b) shows
for9(m) = 2.5.

majorizer,

Indeed, the inequality

care

ful inspection of a log-likelihood or other objective function to


be optimized, with particular attention paid to convexity and in
equalities. Thus, success with MM algorithms and success with
EM algorithms hinge on somewhat different mathematical ma
neuvers. However, the skills required by most MM algorithms
are no harder tomaster than the skills required by most EM algo
rithms. The purpose of this article is to present some strategies

We

for 6(m) = -0.75;

function
pq(9) and its quadratic majorizing
1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, along with

Just

as EM is more a prescription for creating algorithms than an


actual algorithm, MM refers not to a single algorithm but to a
class of algorithms. Thus, this article refers to specific EM and
MM algorithms but never to "the MM algorithm" or "the EM

= /
f^(m))(2)
<
|
directly from the fact #(0(m+1)
| 0(m))
g(9^m)
anMM
and
The
lends
definition
descent
(1).
property (2)
?(m)^
algorithm remarkable numerical stability. With straightforward
rather than
changes, theMM recipe also applies tomaximization
a function f(6), we minorize
it by
minimization: To maximize
to
a surrogate function g(6 \ 0^)
and maximize g(0 | 0^)
follows

the next iterate #(m+1).

produce
2.1

Calculation

of Sample Quantiles

algorithm."

2.

THE MM

Let #(m) represent a fixed value of the


g(9 | 0(m)) denote a real-valued function
pends on 0^m). The function g(6 \ 9^)
real-valued function f(0) at the point 9^

a one-dimensional

As

PHILOSOPHY
parameter 0, and let
of 9 whose form de
is said to majorize a

puting

sample

0,

a sample

from

quantile

of com

the problem
xn

xi,...,

of

numbers. One can readily prove (Hunter and Lange 2000b)


for

q G

(0,1),

a qth

sample

quantile

of X\,...,

xn

real

that

minimizes

the function

provided

> f(0)
for all
g(0\eW)
Uj
=
/(<9(m)).
p(0(m) I0(m))

consider

example,

f(0)

Y,P?(xi-6)>

(3)

where pq(0) is the "vee" function


In other words, the surface 9 i? g (9 |
lies above the
9^)
surface f(9) and is tangent to it at the point 9 ? 9^m\ The
9-?
(M - iq0
pq^}
if -g{9
function g(9 \ 9^)
is said tominorize f(9) at 9^
0<0.
\
\-{l-q)0
at0<m).
9^) majorizes -f(9)
=
represents the current iterate in a search of When q
Ordinarily, 9^
1/2, this function is proportional to the absolute
MM algorithm, we
the surface f(9). In a majorize-minimize
value function; for q ^ 1/2, the "vee" is tilted to one side or the
rather than the
minimize
the majorizing function g(9 \ 9^)
other. As seen in Figure 1(a), it is possible tomajorize the "vee"
at any nonzero
function.
actual function f(9). If 0(m+1) denotes the minimizer of g(9 \ function
point
by a simple
quadratic
we can show that the MM procedure forces f{9)
a
at
then
for
is
0^
?#(m)
^ 0, pq(0)
Specifically,
given
majorized
#(m)),
The American

Statistician,

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

February

2004, Vol. 58, No.

31

derived

by

Uo\e(m))
under

and

its,

an

with

composition

relation
of

the formation

sums,

between

functions

nonnegative

lim

These

rules

function.

increasing

is

products,

permit us to work piecemeal in simplifying complicated objec


tive functions. Thus, the function f{9) of Equation (3) is ma
jorized at the point 9^ by

3.1

stemming
some common

inequalities

section

outlines
or

majorizing

of objective

10^1}.

1{|^|

various

This

concavity.
to construct

+(49-2)0 +

the majorization

Fortunately,
closed

from

from

functions

minorizing

or

convexity

inequalities

used

for various

types

functions.

Jensen's

Inequality

Jensen's inequality states for a convex function k(x) and any


random variable X that k[E (X)} < E [k(X)}. Since ? \n(x) is
a convex function, we conclude for probability densities
a(x)
and b(x) that

g (e |e{mA = ]T ?q (Xi o |x% o(mA .

The function f{9) and its majorizer g(9 | 0(m)) are shown in
? 12.
Figure 1(b) for a particular sample of size n
Setting the first derivative of g(9 | 0(m)) equal to zero gives
the minimum point
w- (m)rx.

"(2<7-i)+Er=i

?
\x% 0(m)|_1 depends on 0(m). A
flaw of algorithm (5) is that the weight i*4
is undefined when
ever 0(m) = x%. Inmending this flaw, Hunter and Lange (2000b)
also discussed the broader technique of quantile regression in
troduced by Koenker and Bas sett (1978). From a computational
=

the weight w?

where

perspective, themost fascinating thing about the quantile-finding


algorithm is that it avoids sorting and relies entirely on arithmetic
and

iteration.

For

the case

of

the

sample

median

(1973)
gorithm (5) is found in Schlossmacher
anMM algorithm by Lange and Sinsheimer
(1995).
in Equation (4) is a
Because g(9 \ 9^)
of 9, expression (5) coincides with the more
Raphson
(m+l)

(q

If h(x)U

(4)

?
1/2),

quadratic function
general Newton

update

left-hand

side

above

?(m)

V2#(0(m) |0(r

Vg ( 6>(m)| 6>(m)

In

b{X)
and we

vanishes

so

1,

the

obtain

E [lnapO] < E [\nb(X)] ,


which is sometimes known as the information inequality. It is
this inequality that guarantees that aminorizing function is con
structed in the E-step of any EM algorithm (de Leeuw 1994;
Heiser 1995), making every EM algorithm anMM algorithm.
3.2

Minorization
Jensen's

perplane

via Supporting

inequality

of a convex

property

Hyperplanes

is easily derived from the supporting hy


function:

Any

linear

tan

function

gent to the graph of a convex function is aminorizer at the point


of tangency. Thus, if k{6) is convex and differentiate,
then

?(0) > k(0m)

(7)

+V/^0(m)V(0-0(m)),

with equality when 0 = ?(m\ This inequality is illustrated by


the example of Section 7 involving constrained optimization.
3.3

via the Definition

Majorization

Ifwe wish tomajorize

IfX has the density b(x), thenE [a(X)/b{X)\

al

and is shown to be
(1993) and Heiser

<

ln<E

it, then we

can

K,(t) is convex

use

the

of Convexity

a convex function instead of minorizing


standard

definition

of convexity;

namely,

if and only if

(6)
where V#(0(m)
dient vector and
at 0(m). Because

denote the gra


| 0(m)) and V2g{9^
\ 9^)
the Hessian matrix of g(9 | 0(m)) evaluated
the descent property (2) depends only on de
it, the update (6) can
creasing g[9 | 0(m)) and not on minimizing
serve in cases where g[9 \9^ ) lacks a closed form minimizer,

In the
provided this update decreases the value of g(9 \ 9^).
context of EM algorithms, Dempster et al. (1977) called an al
it
gorithm that reduces g(9 \9^) without actually minimizing
a generalized EM (GEM) algorithm. The specific case of Equa
tion (6), which we call a gradient MM algorithm, was studied in
the EM context by Lange (1995a), who pointed out that update
(6) saves us from performing iterations within iterations and yet
still displays the same local rate of convergence as a full MM
algorithm thatminimizes g(9 | 0(m)) at each iteration.
3.
In the quantile

OF THE TRADE

TRICKS
example

of Section

tion admits a quadratic majorizer


general,
32

many

majorizing

or

2.1,

as depicted

minorizing

"vee"

in Figure

relationships

(8)

^2q??k(U)

for any finite collection of points t%and corresponding multipli


? 1.
ers o?i with a% > 0 and
Application of definition
^\ a%
(8) is particularly effective when k(?) is composed with a linear
function

For

xt6.

instance,

for

suppose

vectors

the substitution U = x?(0? ? 0?


(8) then becomes

that we make
Inequality

/<x?0) < ^a?

Now

inequality

x,

0,

and

0^m)

)/a? + xt9^rn\

(9)

?im)) + z*0(m)

<9i

if all components
Alternatively,
=
we
then
itive,
may take ti
Xi9?m /xt9(rn\

the convex

<

k I
^^OLiU

of x, 0, and 9^
xt9^9i/9?n^
(8) becomes

are pos
?

and a?

func

1(a). In
may

be

Ktfe) < ?

xte^ej
ztQ{m)

General

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

(10)

Inequalities (9) and (10)


gorithms in the contexts
Lange and Fessier 1995)
matrix inversion (Becker
3.4

have been used to construct MM al


of medical
imaging (De Pierro 1995;
and least-squares estimation without
et al. 1997).

via a Quadratic

Majorization

k(9)< k ?0(m))+ v? ?0(m)V(e 0(m))


+1 (o o^Y m (o o(mA
provides a quadratic upper bound. For example, Heiser
noted in the unidimensional case that

(1995)

0-0(m)

(0-0(m))2
?3
(0("O)2~+

3.5

of Section

example

Mean

2=1

If we

let xi =

eli,

then we

2=1

obtain

the

standard

One of the key criteria in judging minorizing or majorizing


functions is their ease of optimization. Successful MM algo
in high-dimensional

rithms

are

In other

separated.

2=1

the

form

Because

inequality.

function

surrogate

4.1

Poisson

to optimize

easier

sur

mapping

at each

iteration.

Sports Model

Consider a simplified version of amodel proposed by Maher


(1982) for a sports contest between two individuals or teams
in which the number of points scored by team i against team j
a Poisson

with

process

where

e0i~dj,

intensity

is an

o?

strength" parameter for team i and dj is a "defensive


strength" parameter for team j. If tij is the length of time that
i plays j and pij is the number of points that i scores against j,
then the corresponding Poisson log-likelihood function is
=

Pij{oi -dj)

Uje0i~dj

the expo

nential function is strictly convex, equality holds if and only if


all of the xi are equal. Inequality (11) is helpful in constructing
the majorizer

InUj

+pij

lnp^-l,

(14)
where 0 = (o, d) is the parameter vector. Note that the parame
=
ters should satisfy a linear constraint, such as ]T\ o i+ V
dj
in
order
for
the
model
be
0,
identifiable; otherwise, it is clearly
possible to add the same constant to each o? and dj without
altering the likelihood. We make two simplifying assumptions.
First, different games are independent of each other. Second,
each team's point total within a single game is independent of
point

total.

The

are

performances

second

is more

assumption

sus

it implies that a team's offensive

pect than the first because

2=1
mean

the arithmetic-geometric

the

words,

function

surrogate

defensive
of

on

rely

the individual parameter components

0 G U C Rd ?> R reduces to the sum of d real-valued functions


taking the real-valued arguments 6\ through 0<?.Because the d
univariate functions may be optimized one by one, this makes

its opponent's
\

often

spaces

parameter

inwhich

rogate functions

?ij{9)

Inequality

The arithmetic-geometric mean inequality is a special case of


?
? and a? = 1/ra yields
inequality (8). Taking n(t)

SEPARATIONOF PARAMETERS
AND CYCLICMM

"offensive

6.

The Arithmetic-Geometric

4.

follows

for 0 < c < min{0,9^}.


The corresponding quadratic lower
bound principle for minorization
is the basis for the logistic
regression

is the Cauchy-Schwartz
inequality. De Leeuw and Heiser
(1977) and Groenen (1993) used inequality (13) to derive MM
algorithms for multidimensional
scaling.

Bound

Upper

If a convex function k(9) is twice differentiable


and has
bounded curvature, then we can majorize k(0) by a quadratic
function with sufficiently high curvature and tangent to n{9)
at 0^
(B?hning and Lindsay 1988). In algebraic terms, if we
can find a positive definite matrix M such thatM ? V2k(9)
is
all
for
then
definite
0,
nonnegative

1
1
0 W)~

which

somehow

to one

unrelated

and

another;

nonetheless
the model gives an interesting first approximation
to reality. Under these assumptions, the full data log-likelihood
is obtained by summing ?ij(9) over all pairs (i,j). Setting the
partial derivatives of the log-likelihood equal to zero leads to the
equations

^1^2

/
_<

i
2 x2
1
X1-t?r+X2?7?T
(m)

2 ^1
(m)

?cy

of the product of two positive numbers. This


in the sports contest model of Section 4.
3.6

The Cauchy-Schwartz

/io\
(12)

?-*

tions

=
w

and -

-^

satisfied by themaximum

Inequality

llfljll

?'**

E,^

inequality is used

The Cauchy-Schwartz
inequality for the Euclidean norm is a
case
of
The function n(9) = ||0|| is convex
(7).
special
inequality
because it satisfies the triangle inequality and the homogeneity
condition ||a011= \a\ \\9\\. Since k(9) =
#2, we see that
^/^
=
and
therefore
(7)
inequality
gives
Vft(0)
0/||0||,

* ^(m)?+l
11*11

do

not

likelihood estimate (o, d). These equa

a closed

admit

j H3 ?e"

form

so we

solution,

turn

to an MM

algorithm.

Because

the task is to maximize

function.

need

minorizing

we may use inequality


-i-.e*-^
13

>
-

Focusing

on

the

(14), we
term,

?tije0i~dj

(12) to show that

-^

the log-likelihood

, -hL&f]?\
2
eo(->+4"*>

ie?r,+*r\

(15)
The American

Statistician,

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

February

2004, Vol. 58, No.

33

Table
Regular

According

Defensive

Strength.

the Basis

Offensive

team played

82 games.

Plus

Strength

in the right direction; indeed, every iteration of a cyclic MM


algorithm is simply an MM iteration on a reduced parameter
set.

0/ + Of/

Team

Wins

the 2002-2003

of

to Their Estimated
Each

o; + Of/

Team

on

Teams

of all 29 NBA

1. Ranking
Season

Wins

4.2

to National

Application
Results

Cleveland

-0.0994

17

Phoenix

0.0166

Denver

-0.0845

17

New Orleans

0.0169

47

Toronto

-0.0647

24

Philadelphia

0.0187

48

Miami

-0.0581

25

Houston

0.0205

43

Chicago
Atlanta

-0.0544

30

Minnesota

0.0259

51

-0.0402

35

LA Lakers

0.0277

50

in minutes.

LAClippers

-0.0355

27

0.0296

48

Memphis
New York

-0.0255

28

Utah

0.0299

47

period, if necessary,

-0.0164

37

Portland

0.0320

50

-0.0153

37

Detroit

0.0336

50

-0.0077

44

New

0.0481

49

-0.0051

38

San

0.0611

60

Orlando

-0.0039

42

Sacramento

0.0686

59

Milwaukee

-0.0027

42

Dallas

0.0804

60

Washington
Boston
State

Golden

Seattle

Indiana

Jersey
Antonio

44

40

0.0039

Although the right side of the above inequality may appear more
complicated than the left side, it is actually simpler in one impor
tant

o? and

components

parameter

respect?the

dj

are

separated

on the right side but not on the left. Summing the log-likelihood
(14) over all pairs (?, j) and invoking inequality (15) yields the
function

=
fl(0|0(m)) ?E

tin

Pij(oi

-dj)

%
i
e o(m)+d(m)

?om;+d

the full log-likelihood up to an additive constant in


dependent of 0. The fact that the components of 0 are separated
permits us to update parameters one by one
by g (9 | 0^)

minorizing

and

reduces

substantially

tial derivatives of g(9

costs.

computational

Setting

| 0^m)) equal to zero yields

the par

the updates

Association

Table 1 summarizes our application of the Poisson sports


model to the results of the 2002-2003
regular season of the
National Basketball Association.
In these data, tij is measured

to

score

4:Se?l~dj

lasts

game

regular

48 minutes,

and

overtime

each

adds five minutes. Thus, team i is expected


points

team

against

the

j when

teams

two

meet and do not tie. Team i is ranked higher than team j if


?
?
ai
di, which is equivalent to ?iJrdi>
dj > dj
?j H-dj.
It isworth emphasizing some of the virtues of themodel. First,
the ranking of the 29 NBA teams on the basis of the estimated
sums o i + di for the 2002-2003
regular season is not perfectly
consistent with their cumulative wins; strength of schedule and
margins of victory are reflected in themodel. Second, themodel
gives the point-spread function for a particular game as the differ
ence

two

of

Poisson

independent

random

variables.

Third,

one

can easily amend the model to rank individual players rather


than teams by assigning to each player an offensive and de
fensive intensity parameter. If each game is divided into time
segments punctuated by substitutions, then theMM algorithm
can be adapted to estimate the assigned player intensities. This
might provide a rational basis for salary negotiations that takes
into account subtle differences between players not reflected in
traditional

Lij ?-2d,

Basketball

statistics.

sports

Finally, the NBA dataset sheds light on the comparative


speeds of the original MM algorithm (16) and its cyclic modi
fication (17). The cyclic MM algorithm converged in fewer it
erations (25 instead of 28). However, because of the additional
work required to recompute the denominators in Equation (17),
the cyclic version required slightly more floating-point opera
tions as counted by MATLAB
(301,157 instead of 289,998).

5. SPEED OF CONVERGENCE
j
E7-P'W

o (m+l)

(m)_Am)
2-^j

MM

algorithms

tije
plementary

(m+l)

J2^P^J

In

(16)

o nL>+d\

rithms

boast

strengths.
a
quadratic

point

optimum

.Zik

soon

as they

become

available.

For

instance,

T,iPij

In

J+d\

if we

Newton-Raphson
as
they

convergence

words,

under

certain

update

?
||0( +i)_0*||
_
J{m)
0*||2

near

general

algo
a local
condi

(16),

for

some

faster

the

linear

rate

quadratic
of

lim
m^oo^?-??||0M

(17)

In practice, anMM algorithm often takes fewer iterations when


we cycle through the parameters updating one at a time than
when we update the whole vector at once as in algorithm (16).
We call such versions of MM algorithms cyclic MM algorithms;
they generalize the ECM algorithms of Meng and Rubin (1993).
A cyclic MM algorithm always drives the objective function

c. This

constant

than

"

lim

the

.Eik

34

hand,

rate of

tions,

o vector before the d vector in each iteration of algorithm


we could replace the formula for
above by
dj
?rn+l)

one

On

In other

0*.

The question now arises as to whether one should modify


algorithm (16) so that updated subsets of the parameters are used
as

algorithms have com

and Newton-Raphson

rate of

convergence

is much

convergence

c <

(18)

0*||

displayed by typical MM algorithms. Hence, Newton-Raphson


algorithms tend to require fewer iterations thanMM algorithms.
On the other hand, an iteration of a Newton-Raphson
algorithm
can

MM

be

far more

onerous

computationally

algorithm. Examination

0(m+i) =

than

an

iteration

of the form

?(rn)-v2/(^(m))_

General

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

vy(0(m))

of an

of

iteration

Newton-Raphson

that

reveals

it requires

the p x p matrix V2/(0)


an MM

trast,
on

MM

well-designed
but

to p3. By con

arithmetic

parameters

iteration.

per

operations

tend to require more

algorithms

For

than Newton-Raphson.
an
sometimes
enjoy
advantage

algorithms

for

linear

the

constraint).

iterations

matrix

to MATLAB.

according
iteration

Raphson

requires

even

Thus,

more

inversion

Newton

single
in

computation

this

example

than the 300,000 floating point operations required for theMM


in 28 iterations. Numerical
algorithm to converge completely
enters

also

stability

the balance

point.

al

of

the value

to appropri

guaranteed
the objective

at

function

iteration.

every

Other types of deterministic


Fisher

scoring,

can match

until

convergence,

mation

matrix

them

each

scoring

information

iterations

The

expected

infor

is sometimes

easier

to eval

of Newton-Raphson.

matrix

Scoring does not automatically lead to an increase in the log


it can always
likelihood, but at least (unlike Newton-Raphson)
be made to do so if some form of backtracking is incorporated.
methods

Quasi-Newton

or even

mitigate

eliminate

the need

for

matrix inversion. The Nelder-Mead


approach is applicable in
situations where the objective function is nondifferentiable. Be
cause of the complexities of practical problems, it is impossible
to declare

any

best

algorithm

optimization

In our

overall.

ex

perience, however, MM algorithms are often difficult to beat in


terms of stability and computational simplicity.

In most

cases,

covariance

formation
is often

matrix.

a maximum
matrix

likelihood
to the

equal

In practice,

the
by

well-approximated

estimator

inverse

of

the

has

expected

information

matrix

the observed

information

matrix

in

computed by differentiating the log-likelihood i(9)


?V2?(9)
twice. Thus, after theMLE 9 has been found, a standard error of
0 can be obtained by taking square roots of the diagonal entries
?
In some problems, however, direct
of the inverse of V2?(9).
calculation of V2?{9) is difficult. Here we propose two numeri
cal approximations to this matrix that exploit quantities readily
obtained by running anMM algorithm. Let g(9 \9^ ) denote a
minorizing function of the log-likelihood i{9) at the point 0(m\

M(??)
to be the MM

arg max #(0


6

algorithm map

taking 9^

| i?)

matrix.

identity

These

(20)
J t?=?
were

formulas

de

are

equal

at the
point

of minorization;

and

the gradient

second,

2, j entry equals

= fa " (' + *?>-*(?>,


s^o
5

AM, w

(21)

the vector Cj is the jth standard basis vector having a one


= 0,
in its jth component and zeros elsewhere. Because M(9)
the jth column of VM(0) may be approximated using only out

where

put from the corresponding MM algorithm by (a) iterating until


0 is found, (b) altering the jth component of 0 by a small amount
?j, (c) applying theMM algorithm to this altered 0, (d) subtract
of
ing 0 from the result, and (e) dividing by ?j. Approximation
nu
it
involves
is
analogous except
V2?{9) using Equation (20)
=
merically approximating the Jacobian of /i(#)
Vg(9
|#). In
this case one may exploit the fact that h{9) is zero.
6.2

An MM Algorithm

for Logistic

Regression

To illustrate these ideas and facilitate comparison


ious

numerical

methods,

we

an

consider

of the var
the

in which

example

of the log-likelihood
is easy to compute. B?hning and
Lindsay (1988) apply the quadratic bound principle of Section
3.4 to the case of logistic regression, inwhich we have an n x 1

asymp

expected

and define

(19)

Hessian

6. STANDARDERROR ESTIMATES

totic

the

dOj

ground. Although
in required

V2g{9 |0) /I VM(0)


d
V2g(9 |0) +

of g{9 | 0(m)) at its maximizer M(9^)


is zero. Alternative
derivations of formulas (19) and (20) were given by Meng and
these
Rubin (1991) and Oakes (1999), respectively. Although
formulas have been applied to standard error estimation in the
EM algorithm literature?Meng
and Rubin (1991) base their
SEM idea on formula (19)?to our knowledge, neither has been
applied in the broader context of MM algorithms.
of V2?{9) using Equation (19) requires a nu
Approximation
whose
merical approximation of the Jacobian matrix VM(0),

methods

gradient-free

denotes

are based on

to ?V2?{9)

approximations

rived by Lange (1999) using two simple facts: First, the tan
gency of ?(9) and itsminorizer imply that their gradient vectors

such as

algorithms,

Newton-Raphson
has
its own merits.

in Fisher

used

the observed

than

or

occupy a kind of middle

like Nelder-Mead,
of

optimization
methods,

quasi-Newton

none

uate

are

algorithms

or decrease

increase

ately

MM

contrast,

By

A Newton-Raphson

sheet.

if started too far from an optimum

gorithm can behave poorly

where

of a 57 x 57 matrix requires roughly 387,000 floating point op


erations

via MM

a>v^'*>

reason

this

in computational

single

Thus,

For example, the Poisson process scoring model for the NBA
dataset of Section 4 has 57 parameters (two for each of 29 teams
one

Differentiation

The two numerical


the formulas

V2?{9)

speed.

minus

Numerical

takes

usually

iterations

simpler

MM

separates

to invert

needed

calculations

is roughly proportional
that

algorithm
of p or p2

the order

of

the number

then

components,

6.1

evalua

If 0 has p

tion and inversion of the Hessian matrix V2/(0^m^).

vector

of binary

and

responses

an n

x p matrix

of predic

= 1
stipulates that the probability 7r?(0) that Y%
equals ex.p{9txi}/
Straightforward differen
(1 + exp{0tx2}).
tiation of the resulting log-likelihood function shows that
tors. The model

V2?{9)

=
-^^(0)[l-^(0)]x,4
2=1

?
Since 7Ti(9) [1 tt1(9)} is bounded above by 1/4, we may define
the negative definite matrix B =
and conclude that
?\XlX
?
B is nonnegative definite as desired. Therefore, the
X/2l(9)
quadratic function

- 0(m)
= ?
+ V?
g ?9 I
0(m)) (9{rnA
(0(m)V(0
+-

to 0O+1).
The American

Statistician,

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

(0-0(m)V?
February

2004, Vol. 58, No.

35

2.

Table

and

Coefficients

Estimated

Low Birth Weight

Logistic

Exact V2l{6)

Constant

AGE
LWT
RACE2
RACE3
SMOKE
PTL
HT
Ul
FTV

mizing

on:

based

Equation (19) Equation (20)

0.48062

1.1969

1.1984

1.1984

-0.029549

0.037031

0.037081

0.037081

0.0069194

0.0069336

0.0069336

0.52736

0.52753

0.52753
0.44076

-0.015424
1.2723

minorizes

for the

Example

errors

Standard
Variable

Errors

Standard

Regression

0.8805

0.44079

0.44076

0.93885

0.40215

0.40219

0.40219

0.54334

0.34541

0.34545

0.34545

1.8633

0.69754

0.69811

0.69811

0.76765

0.45932

0.45933

0.45933

0.065302

0.17240

0.17251

0.17251

?(9) at 0^m). The MM


this quadratic, giving
=

0(m+l)

?(m)

Vj

Vj(9im))\-lnVj(9)+lnVj(9{m)))
Adding

?-lyg

algorithm of Equation
an

it enjoys

once,

as

Newton-Raphson

(B?hning and Lindsay


6.3

Application

We

now

test

\y-tt

.(22)

(0(m))

increasing

computational

with equality when 0 =

of predictors

the number

negative.

optimization
For

example,
we
Here

>0,

tuning

0(m). Summing over j and multiplying


uj, we

parameter

construct

the

function

Vj (0(*0?

increases

error

Data
based

approximations

on Equa

except

for

race,

which

is a three-level

purposes.

7. HANDLING CONSTRAINTS
Many

positive

In

will differ. The close agreement of the approximations with the


"gold standard" based on the exact value V2?(9) is clearly good

eters.

v3{9{rn)) v3{9).

1988).

the standard

for practical

advantage

factor with level 1 for whites, level 2 for blacks, and level 3 for
likelihood estimates
other races. Table 2 shows the maximum
errors
for
the
10
standard
and asymptotic
parameters. The dif
ferentiation increment Sj was 0j/lOOO for each parameter 9j.
The standard error approximations in the two rightmost columns
turn out to be the same in this example, but in other models they

enough

s)

(22) needs to invert XtX

to Low Birth Weight

is quantitative

>

?
s~1(t

+V^-(0(m))t(0-0(m))

tions (19) and (20) on the low birth weight dataset of Hosmer
and Lemeshow
(1989). This dataset involves 189 observations
and eight maternal predictors. The response is 0 or 1 accord
ing to whether an infant is born underweight, defined as less
than 2.5 kilograms. The predictors include mother's age in years
(AGE), weight at lastmenstrual period (LWT), race (RACE2 and
RACE3), smoking status during pregnancy (SMOKE), number
of previous premature labors (PTL), presence of hypertension
history (HT), presence of uterine irritability (UI), and number of
physician visits during the first trimester (FTV). Each of these
predictors

Ins ? hit >

the last two inequalities, we see that

?9{m))

by

only
over

v3{9) > VVj ?0(m>V (0<m> 0).

Vj(9{m)) [-ln^-(0)+lnt;J(0(m))

= 9{m)4(XtX)-1X?
Since theMM

?0(m))

Application of the similar inequality


implies that

by maxi

algorithm proceeds

interior of the parameter space but allows strict inequalities to


become equalities in the limit.
Consider the problem of minimizing f{9) subject to the con
straints Vj(6) > 0 for 1 < j < g, where each v3; (0) is a concave,
differentiable function. Since ?Vj(6)
is convex, we know from
that
(7)
inequality

problems
parameters
a
discuss

constraints
impose
are often
required
majorization

technique

on

9-9{rn)\tVvj

?0(m))

(23)

in
majorizing f{9) at 9^m\ The presence of the term lnvj(0)
<
The
from
0
(23)
prevents ^(0(m+1))
Equation
occurring.
of mi?j(0)
multiplier Vj(9^)
gradually adapts and allows
there
v.{Q(m+1)) to tend to 0 if it is inclined to do so. When
are equality constraints A9 = b in addition to the inequality
constraints v3; (0) > 0, these should be enforced during the min
imization of g (9 \9^).
7.1

Multinomial

Sampling

To gain a feel for how these ideas work in practice, consider


the problem of maximum
likelihood estimation given a random
sample of size n from a multinomial distribution. If there are q
categories and n? observations fall in category i, then the log
a constant. The compo
likelihood reduces to
^2i rii In 9i plus
= 1.
nents of the parameter vector 0 satisfy 0? > 0 and
Yli 9i
it is well known that the maximum
likelihood esti
Although
mates are given by 0? = rii/n, this example is instructive be
cause it is explicitly solvable and demonstrates the linear rate of
convergence of the proposed MM algorithm.
To minimize the negative log-likelihood f(9) = ? ]T^ ni ln 9%
=
9i > 0 and the
subject to the q inequality constraints 1^(0)
?
we
construct
themajorizing func
equality constraint J2i 9% 1,
tion

param

to be
that

non
in a

sense eliminates

inequality constraints. For this adaptive barrier


method (Censor and Zenios 1992; Lange 1994) towork, an initial
point 0(?) must be selected with all inequality constraints strictly
satisfied. The barrier method confines subsequent iterates to the
36

suggested in Equation (23), omitting irrelevant constants. We


= 1
minimize g (9 | 0(m)) while enforcing
by introducing
]^?0?
a Lagrange multiplier and looking for a stationary point of the

General

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

de Leeuw,
of Correction Matrix
J., and Heiser, W. J. (1977), "Convergence
for Multidimensional
Algorithms
of
Scaling," in Geometric Representations
Relational Data, eds. J. C. Lingoes, E. Roskam, and I. Borg, Ann Arbor, MI:
Mathesis
735-752.
Press, pp.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), "Maximum Likelihood
from Incomplete Data via the EM Algorithm," Journal of the Royal Statistical
Society, Ser. B, 39, 1-38.

Lagrangian

h(9) = 5(0|0^) +
A???-lJ.
equal to zero and multiplying

Setting dh(9)/d9l

-rii

cj02(m)

uj9% +

n?+o;0
1
n +

?(m+i)

all iterates have positive


The

components.

positive

\9%

final

if they start with

components

/^(m)
%
cj V

_ n?\ '
n )

the parameter
zero.
equal

one

where

or more

of

P. J. (1981), Robust Statistics, New York: Wiley.


for Generalized
Hunter, D. R. (2004), "MM Algorithms
Bradley-Terry Models,"
The Annals of Statistics, 32, 386-408.
to discussion of "Optimization
Hunter, D. R., and Lange, K. (2000a), Rejoinder
Transfer Using Surrogate Objective
Journal of Computational
Functions,"
and Graphical
Statistics, 9, 52-59.
Huber,

demonstrates that 9^
approaches the estimate 9 at the linear
rate uj/(n + uj), regardless of whether 9 occurs on the boundary
of

Theory and
"Convergent Computing
by Iterative Majorization:
inMultidimensional
inDe
Data Analysis,"
in Recent Advances
Applications
ed. W. J. Krzanowski,
Oxford: Clarendon
Analysis,
scriptive Multivariate
Press, pp. 157-189.
New
S. (1989), Applied Logistic Regression,
Hosmer, D. W., and Lemeshow,
York: Wiley.

-(1995),

rearrangement

n +

for
Algorithm
IEEE Transac

toMultidimensional
Scal
Groenen, P. J. F. (1993), The Memorization Approach
DSWO Press.
Leiden, the Netherlands:
ing: Some Problems and Extensions,
Heiser, W. J. (1987), "Correspondence
Analysis with Least Absolute Residuals,"
Statistical Data Analysis,
5, 337-356.
Computational

the update

cj

_ ^i =
1
n

0(m+l)

De Pierro, A. R. (1995), "AModified


Expectation Maximization
Penalized Likelihood
in Emission Tomography,"
Estimation
tions on Medical
14, 132-137.
Imaging,

0.

on i reveals that ? = n and yields

Summing

Hence,

by 9%gives

its components

via anMM Algorithm," Journal of Com


"Quantile Regression
and Graphical
Statistics, 9, 60-77.
in the Proportional Odds Model," Annals
-(2002),
"Computing Estimates
54, 155-168.
of the Institute of Statistical Mathematics,
8. DISCUSSION
Hunter, D. R., and Li, R. (2002), "A Connection Between Variable Selection and
technical report 0201, Dept. of Statistics, Pennsylvania
EM-type Algorithms,"
to whet
not
This
article
is meant
readers'
satiate
appetites,
State University.
much.
For instance,
there
is a great deal
them. We
have
omitted
of the
Acceleration
Jamshidian, M., and Jennrich, R. I. (1997), "Quasi-Newton
EM Algorithm,"
Journal of the Royal Statistical Society, Ser. B, 59, 569-587.
known about the
of MM
that
space

convergence

is too mathematically

algorithms

properties

to present

demanding

0?-(2000b),
putational

here.

Fortunately,

almost all results from the EM algorithm literature (Wu 1983;


Lange 1995a; McLachlan and Krishnan 1997; Lange 1999) carry
over

without

change

to MM

are

there

Furthermore,

algorithms.

of a Class of
Kiers, H. A. L., and Ten Berge, J. M. F. (1992), "Minimization
Matrix Trace Functions by Means
of Refined Majorization,"
Psychometrika,
57,371-382.
Koenker, R., and Bassett, G. (1978), "Regression Quantiles," Econometrica,
46,
33-50.
Lange,

K.

(1994),

"An Adaptive

Barrier Method

several methods for accelerating EM algorithms that are also


Methods Applications
1, 392-402.
Analysis,
"A Gradient Algorithm
-(1995a),
to
MM
Locally
applicable
accelerating
algorithms (Heiser 1995; Lange
rithm," Journal of the Royal Statistical Society,
1995b; Jamshidian and Jennrich 1997; Lange et al. 2000).
"A Quasi-Newton
Acceleration
-(1995b),
tic a Sinica, 5, 1-18.
Although this survey article necessarily reports much that is
are

there

known,

already

some

new

results

here.

Our MM

of constrained optimization in Section 7 is more general


than previous versions in the literature (Censor and Zenios 1992;
Lange 1994). The application of Equation (20) to the estimation
of standard errors inMM algorithms is new, as is the extension
of the SEM idea of Meng and Rubin (1991) to theMM case.
There are so many examples of MM algorithms in the liter
that we

are unable

to cite

them

all. Readers

should

the lookout for these and for known EM algorithms


more

explained
tantly,

MM

we

hope

as MM

simply
this article

will

Even

algorithms.
stimulate

readers

be

on

that can be

more
to discover

impor
new

algorithms.
[Received

June 2003. Revised

September

Programming,"

to the EM Algo
Equivalent
Ser. B, 57, 425-437.
of the EM Algorithm,"
Statis

Numerical
New York:
Analysis for Statisticians,
Verlag.
Lange, K., and Fessier, J.A. (1995), "Globally Convergent Algorithms
imum A Posteriori Transmission
IEEE Transactions
Tomography,"
1430-1438.
Processing A,

treat -(1999),

ment

ature

for Convex

2003.]

REFERENCES
Becker, M. P., Yang, I., and Lange, K. (1997), "EM Algorithms Without Missing
inMedical Research,
Data," Statistical Methods
6, 38-54.
Bijleveld, C. C. J. H., and de Leeuw, J. (1991), "Fitting Longitudinal Reduced
Rank Regression Models
56,
by Alternating Least Squares," Psychometrika,
433-447.
of Quadratic Approx
B?hning, D., and Lindsay, B. G. (1988), "Monotonicity
imation Algorithms,"
Annals of the Institute of Statistical Mathematics,
40,
641-663.
S. A.
"Proximal Minimization
With D
Censor, Y., and Zenios,
(1992),
73, 451-464.
Functions," Journal of Optimization
Theory and Applications,
de Leeuw, J. (1994), "Block Relaxation Algorithms
in Statistics, in Information
eds. H. H. Bock, W. Lenski, and M. M. Richter,
Systems and Data Analysis,
Berlin: Springer-Verlag,
pp. 308-325.

Springer
forMax
on Image

Transfer using
Lange, K., Hunter, D. R., and Yang, I. (2000), "Optimization
Journal of Computational
(with discussion),
Surrogate Objective Functions"
and Graphical
Statistics, 9, 1-20.
J. S. (1993), "Normal/Independent
Distributions
Lange, K., and Sinsheimer,
and Their Applications
in Robust Regression,"
and
Journal of Computational
Statistics, 2, 175-198.
Graphical
(2nd ed.), Read
Luenberger, D. G. (1984), Linear and Nonlinear
Programming
ing, MA: Addison-Wesley.
Football Scores," Statistica Neer
Maher, M. J. (1982), "Modelling Association
landica, 36, 109-118.
and
Marshall, A. W., and Olkin, I. (1979), Inequalities: Theory ofMajorization
its Applications,
San Diego: Academic.
G. J., and Krishnan, T. (1997), The EM Algorithm and Extensions,
McLachlan,
New York: Wiley.
and Rubin, D. B. (1991), "Using EM
X.-L.,
Meng,
Variance-Covariance
Matrices:
The SEM Algorithm,"
can Statistical Association,
86, 899-909.

to Obtain Asymptotic
Journal of the Ameri

A
"Maximum Likelihood
via the ECM Algorithm:
Estimation
Framework, Biometrika,
800, 267-278.
via the EM
of the Information Matrix
Oakes, D. (1999), "Direct Calculation
Journal of the Royal Statistical Society, Ser. B, 61, Part 2, 479
Algorithm,"
482.

?-(1993),
General

Ortega, J.M., and Rheinboldt, W. C. (1970), Iterative Solutions of Nonlinear


in Several Variables, New York: Academic,
Equations
pp. 253-255.
a
Identification Using
Sabatti, C, and Lange, K. (2002), "Genomewide Motif
Dictionary Model," Proceedings
of the IEEE, 90, 1803-1810.
E. J. (1973), "An Iterative Technique
Deviations
for Absolute
Schlossmacher,
Curve Fitting," Journal of the American Statistical Association,
68, 857-859.
The
Wu, C. F. J. (1983), "On the Convergence
Properties of the EM Algorithm,"
Annals of Statistics,
11, 95-103.
The American

Statistician,

This content downloaded from 128.235.29.171 on Sat, 25 Oct 2014 14:08:39 PM


All use subject to JSTOR Terms and Conditions

February

2004, Vol. 58, No.

37

You might also like