Oral Texte

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Exploring the EM Algorithm : Goals and Key

Insights
1 Intro
— Context : Dealing with incomplete data, such as missing values
and latent variables, presents significant computational challenges for
maximum likelihood estimation (MLE).
— EM Algorithm : The expectation-maximization (EM) algorithm
provides a powerful approach to such problems, with a rich literature
supporting its application.
— Challenge : While MLEs have desirable statistical properties, the
EM algorithm often converges to local optima, raising concerns about
its practical utility.
— Paper’s Objective : This paper aims to bridge the gap between
statistical and computational guarantees when applying the EM algo-
rithm.
— Insight : Proper initialization can lead the EM algorithm to converge
to statistically useful estimates, even in non-convex settings.
— Approach : Through a series of theoretical and empirical analyses,
the paper characterizes the conditions under which the EM algorithm
and its variants, such as gradient EM, converge effectively.
— Structure : The paper provides a comprehensive introduction, detai-
led model examples, and general convergence results, culminating in
practical implications and supporting simulations.

2 EM algorithm and its relatives


3 Background on the EM Algorithm
The EM (Expectation-Maximization) algorithm is a foundational tool
for maximum likelihood estimation in the presence of latent variables. This
section provides an overview of the EM algorithm’s framework and its appli-
cation to various statistical models.

3.1 The EM Algorithm and Variants


In the context of the EM algorithm, the observed data log likelihood for a
given parameter θ′ is bounded from below by a functional Q(θ′ |θ), as shown

3
in the following inequality :
Z

log gθ′ (y) ≥ Q(θ |θ) − kθ (z|y) log kθ (z|y) dz, (1)
Z

where Q(θ′ |θ) is defined as


Z

Q(θ |θ) = kθ (z|y) log fθ′ (y, z) dz. (2)
Z

This Q function represents the expected complete data log likelihood, com-
prising the log likelihood of both observed and latent variables weighted by
the current estimate of the latent variable distribution. The EM algorithm
iteratively maximizes Q(θ′ |θ), which serves as a surrogate for the actual log
likelihood, facilitating the maximization step when the direct optimization
of the observed data log likelihood is not tractable.

3.1.1 Standard EM Updates


The EM (Expectation-Maximization) algorithm is an iterative procedure
that alternates between performing an expectation (E-step) and maximiza-
tion (M-step). The E-step calculates the expected value of the log-likelihood
for the latent variables, given the current parameter estimates. This expec-
tation is formalized through the Q function. The M-step then finds the para-
meter that maximizes this Q function, thus updating the parameter estimate
to increase the likelihood of the observed data. This iterative process is for-
malized by the mapping M , which takes the current parameter estimate and
returns the maximizer of the Q function over the parameter space Θ. The ite-
rations continue until convergence to a local maximum, with each step either
increasing the log-likelihood of the observed data or leaving it unchanged.

3.1.2 Generalized EM Updates


In the Generalized EM framework, we relax the requirement of finding the
exact maximum in the M-step. Instead of aiming for the precise optimum, we
only need to choose a new parameter estimate that improves the Q function.
Formally, this requirement is expressed as follows :

Q(θt+1 |θt ) ≥ Q(θt |θt ).

This flexibility in selecting θt+1 allows for a range of EM algorithm variants,


each with its characteristics and benefits, thus making Generalized EM not
a single method but rather a family of adaptive approaches with a shared
principle of iterative improvement.

4
3.1.3 Gradient EM Updates
We now consider Gradient EM updates, a natural extension of the Ge-
neralized EM updates suitable when the Q function is differentiable. The
updates are performed according to a gradient ascent, where θt+1 is obtained
by adding to θt the product of a fixed step size α and the gradient of Q eva-
luated at θt . This process is embodied by the mapping G, which simplifies the
iteration expression. We maintain the setting for unconstrained problems, al-
though the method can be extended to include constraints. A careful choice
of α ensures that each update increases the Q function, consistent with the
continuous improvement strategy inherent in Generalized EM.

3.1.4 Population versus Sample Updates


In the EM algorithm, there’s a distinction between population-level up-
dates, assuming an infinite sample size, and sample-based updates with a
finite number of observations. The population function Q integrates across
the entire observation space, while the sample function Q̂n uses an empirical
average of the observed data. These perspectives lead to distinct update ope-
rators, Mn and Gn , for the sample-based EM and gradient EM approaches.

3.2 Illustrative Examples


3.2.1 Gaussian Mixture Models
The balanced isotropic Gaussian mixture model is a classic example where
EM is often applied. The density of this model is an average of two Gaussian
components with opposite means and known variance. Our goal is to estimate
the unknown mean vector from n i.i.d. samples. The function Q̂, constructed
from sampled data, is critical for EM updates and integrates the probability
of each observation belonging to either component of the mixture. This pro-
bability is calculated by the weighting function wθ (y), which is a function of
the distance between the observation and the current estimated mean.

EM Updates For Gaussian mixture models, the sample-based EM opera-


tor, denoted Mn , can be explicitly expressed. It is defined as the difference
between the weighted mean of observations and the simple mean. This opera-
tion is simplified by the weighting function wθ , which reflects the probability
that an observation belongs to a specific component of the mixture. For the
population version, the M operator uses theoretical expectation rather than
the empirical mean, thus reflecting the expectation under the full mixture
distribution.

5
Gradient EM Updates On the other hand, the gradient EM operators
based on the sample and population with a step size α > 0 are given by :

3.2.2 Mixture of Regressions


The mixture of regressions model we address enriches classic linear re-
gression by introducing bimodality into regression vectors : each observation
can emanate from either the vector θ∗ or its opposite −θ∗ . Concretely, this
means that for each observation pair (yi , xi ), with xi following a centered
normal distribution, the outcome yi is the dot product of xi with either θ∗
or −θ∗ , added to independent Gaussian noise. This duality is determined by
hidden variables zi , which mark each observation’s affiliation with one of the
two regression processes.

EM Updates We address the EM update in the mixture of regressions,


where each observation results either from the classic linear regression or its
opposite. The key here is the weighting function wθ (x, y), which assesses the
probability that an observation comes from the positive regression versus the
negative. This function is used to weight our observations during the EM
update. For the sample, the EM update maximizes a weighted cost function,
leading to a closed-form solution that is a linear function of the weighted ob-
servations. Similarly, the population update uses the expectation of weighted
observations over the joint distribution of (Y, X). These updates allow us to
iteratively estimate the unknown regression vector in this complex model.

Gradient EM Updates In the gradient EM updates for the mixture of


regressions model, we adjust our parameter estimates iteratively. The sample-
based operator Gn (θ) applies a step size α to the weighted combination of
the product of observations and their residuals. The population-based ope-
rator G(θ) uses the expected value of the weighted observations to adjust θ.
Both operators aim to steer the parameter estimates in the direction of the
gradient of our objective function, ensuring progressive improvement towards
the optimal solution.

3.2.3 Linear Regression with Missing Covariates


1. Linear Regression with Missing Covariates

In the realm of linear regression, we often encounter datasets with incomplete


information. One such scenario is when our covariates are missing comple-
tely at random. Instead of having access to the entire covariate vector xi , we

6
observe a version x̃i where each component may be missing with a certain
probability ρ. This probability reflects the likelihood of each covariate being
unobserved. Our goal is to adapt the linear regression model to effectively
handle and account for this randomness in data observation, maintaining the
integrity of our parameter estimation despite missing information.

2. E-step Notation for Linear Regression with Missing Covariates

In the context of handling missing data in linear regression, we first define


the observed and missing components of our data and parameters. The E-
step calculates the expected values of the missing data given what we observe.
Specifically, we compute the conditional mean of the missing covariates, µmis ,
using the current estimate of θ and the observed data. This step relies on
the properties of joint Gaussian distributions and allows us to express the
conditional mean and covariance in a closed form. These expectations are
crucial for the subsequent M-step.

3. EM Updates for Linear Regression with Missing Covariates

For the M-step of the EM algorithm, we aim to maximize the function Q,


which is based on the conditional mean and covariance we computed during
the E-step. The EM operator for the sample, Mn (θ), has an explicit solu-
tion that combines these conditional statistics. The equivalent population
EM operator involves expectations over the joint distribution of the data,
providing a theoretical counterpart to the sample-based update.

4. Gradient EM Updates for Missing Covariates

In the gradient EM approach, we update the parameter θ by taking a step


proportional to the gradient of the expected complete-data log-likelihood.
The sample-based gradient EM operator, Gn (θ), applies the step size α to
adjust θ in the direction suggested by the data. The population-based version
uses expectations to adjust θ in a similar fashion, ensuring our updates are
informed by the overall data distribution.

5. EM Update for Linear Regression with Missing Covariates

In cases of linear regression with missing covariates, the EM algorithm seeks


to maximize a function Q that measures the fit of our model given the obser-
ved data. By iterating between calculating expected values for the missing
data and updating our parameter estimates, we refine our model. The EM

7
operator for the sample provides us with an exact formula to update the
parameter θ by balancing the fit against the observed responses.

6. Gradient EM Update for Missing Covariates

Complementing the EM algorithm, the gradient EM approach provides a


method to iteratively update our parameter estimates using a step size α.
By adjusting θ in the direction indicated by both the observed data and the
current parameter estimates, the gradient EM operator continually improves
the parameter estimates, ensuring each step is informed by a combination of
the data and the model’s predictions.

4 General Convergence Results


In this section, we present an overview of the convergence results for the
EM and gradient EM algorithms. We’ll discuss the key conditions that allow
the population versions of these algorithms to reach θ, the point that maxi-
mizes the population likelihood. Additionally, we’ll explore how the sample-
based versions aim for a region close to θ, offering a probabilistic take on
convergence which is crucial for practical applications.

4.1 Analysis of the EM Algorithm


4.1.1 Guarantees for Population-Level EM
In our analysis of the EM algorithm at the population level, we concen-
trate on the maximizer of the population likelihood, denoted as θ∗ . This point
is characterized by the self-consistency condition, which is a classical property
ensuring that θ∗ maximizes the expected complete-data log-likelihood. To es-
tablish the convergence properties, we assume that the function Q at θ∗ is
strongly concave, a condition that guarantees a unique and stable solution.
The EM update and the fixed point θ∗ are both described by gradient in-
equalities, which are standard in optimization theory. By utilizing the strong
concavity of Q and the self-consistency at θ∗ , we can demonstrate that the
EM updates are in close proximity to θ∗ , affirming the convergence of the
algorithm to the population likelihood maximizer.

4.1.2 Guarantees for population-level EM


Here we explore the population-level guarantees of the EM algorithm,
starting with the self-consistency of the maximizer θ∗ . As per our assump-

8
tion, θ∗ maximizes the population likelihood, leading to the self-consistency
condition, where θ∗ is the argument at which Q is maximized given θ∗ .
Furthermore, we delve into the λ-strong concavity of the function q, a
property critical to our theoretical framework, especially around θ∗ .
We also address the fundamental role of gradient mappings in characte-
rizing the fixed point θ∗ , as indicated by the first-order optimality condition.
It’s through this lens that we understand the behavior of θ∗ within the pa-
rameter space Ω.
Finally, we consider the EM update characterization, where for any θ, the
EM update M (θ), signifying the maximization of Q for the update given θ.
Together, these equations form the cornerstone of our analysis, ensuring
that the EM algorithm’s population-level behavior is not only predictable
but also guarantees convergence to the desired fixed points.

4.2 Analysis of the Gradient EM Algorithm


4.2.1 Guarantees for Population-Level Gradient EM
Here we explore the population-level guarantees of the EM algorithm,
starting with the self-consistency of the maximizer θ∗ . As per our assump-
tion, θ∗ maximizes the population likelihood, leading to the self-consistency
condition, where θ∗ is the argument at which Q is maximized given θ∗ .
Furthermore, we delve into the λ-strong concavity of the function q, a
property critical to our theoretical framework, especially around θ∗ .

CHANGE SLIDE

We also address the fundamental role of gradient mappings in characteri-


zing the fixed point θ∗ , as indicated by the first-order optimality condition.
It’s through this lens that we understand the behavior of θ∗ within the pa-
rameter space Ω.
Finally, we consider the EM update characterization, where for any θ, the
EM update M (θ), signifying the maximization of Q for the update given θ.
Together, these equations form the cornerstone of our analysis, ensuring
that the EM algorithm’s population-level behavior is not only predictable
but also guarantees convergence to the desired fixed points.

4.3 Regularity Conditions in EM Convergence


In this section, we delve into the conditions under which the Expectation-
Maximization (EM) algorithm converges to the true parameters. A key aspect

9
of our analysis is the First-Order Stability (FOS) condition, which restricts
the behavior of the gradient of the Q-function within a specific region around
the true parameter, denoted as B2 (r; θ∗ ).
— First-Order Stability : This condition ensures that, within the ball
B2 (r; θ∗ ), the difference between the gradients of the Q-function at
the mapping M (θ) and at θ itself is bounded by a factor γ times
the distance from θ to the true parameter θ∗ . This gradient difference
is a crucial component in quantifying the EM algorithm’s ability to
improve the parameter estimate in each iteration.
— Gaussian Mixture Model Example : In the context of Gaussian
mixture models, we express the FOS condition in terms of the expec-
ted difference of the weights assigned to each Gaussian component,
weighted by the observation Y . The smoothness of the weight func-
tion wθ (y) with respect to θ suggests that the FOS condition is likely
to hold near the true parameter θ∗ , underpinning the local convergence
of the EM algorithm.

4.3.1 Guarantees for Sample-Based Gradient EM


In this part of the presentation, we’re focusing on the sample-based EM
algorithm’s convergence guarantees. We’re looking at two specific variants :
the standard EM, where we apply the same operation repeatedly, and the
sample-splitting EM, which uses different data subsets in each iteration. Our
main goal is to understand how these methods behave as we change the
sample size and the tolerance level.
The key element in our analysis is the convergence rate. For any parameter
within a specified range, we’ve defined ϵM (n, δ)—this rate shows us how close
the sample-based EM’s output will be to the true parameter value, with a
high degree of certainty. And for more robustness, we also have a uniform
rate, ϵunif
M (n, δ), that covers the entire parameter space.
Lastly, we present a theorem giving us the conditions for convergence : if
our EM operator is contractive within the parameter space and we start with
an initial estimate within that space, we can be confident that our algorithm
will converge according to the rates we’ve established.
In summary, these results provide valuable insights into the convergence
properties of sample-based EM algorithms, shedding light on their practical
utility.

CHANGE SLIDE

10
The Gradient EM algorithm refines parameter estimates iteratively, leve-
raging the gradient of the Q-function evaluated at the current estimate. The
update rule, defined by G(θ) := θ + α∇Q(θ|θ), incorporates a step size para-
meter α that controls the extent of each update. The choice of α affects the
convergence rate and stability of the algorithm.

4.4 Gradient Stability (GS) Definition


This condition measures the closeness of the gradient at any point θ to
the gradient at the optimal point θ∗ , scaled by a factor of γ. It is crucial
for ensuring that the gradient EM algorithm remains stable and converges
properly within a specified neighborhood of the optimum.

4.5 Graph
: for a point θ1 close to the population optimum θ∗ , the gradients ∇Q(θ1 |θ1 )
and ∇q(θ1 ) must be close, whereas for a point θ2 distant from θ∗ , the gra-
dients ∇Q(θ2 |θ2 ) and ∇q(θ2 ) can be quite different.

4.6 Guarantees for Sample-based Gradient EM


In the domain of sample-based gradient EM algorithms, we investigate the
performance of two distinct variants. The first utilizes all available samples
to compute the update operator Mn and applies it consistently. The second
variant employs a sample-splitting approach, which may provide advantages
in certain contexts.
The core analysis involves quantifying the deviations of the sample ope-
rator from the ideal, population-level operator. This is captured by ϵG (n, δ),
representing the smallest scalar bounding the deviation for any vector within
a specified radius of the optimal point, with a confidence level of 1 − δ.
Furthermore, we consider the uniform deviation across the entire parame-
ter space, denoted by ϵunif
G (n, δ). This quantity serves a similar purpose but
is applied supremely over the parameter space, offering a global guarantee
of the algorithm’s performance. Both measures are pivotal in assessing the
reliability and convergence of the gradient EM algorithm in sample-based
settings.

4.7 Consequences for specific models


We provided a number of general theorems on the behavior of the EM
algorithm as well as the gradient EM algorithm, at both the population and

11
sample levels. Now we will develop some concrete consequences of this general
theory for the two specific model classes previously introduced

4.8 Analysis of Gaussian Mixture Models


In our exploration of Gaussian mixture models, we focus on a scenario
with two components equally weighted and centered symmetrically at θ∗ and
−θ∗ , each with a variance of σ 2 I. A pivotal aspect of our analysis is the
signal-to-noise ratio (SNR), which we define as the ratio of the norm of θ∗ to
the standard deviation σ. We impose a lower bound on this ratio, expressed
∥θ∗ ∥2
by the inequality σ > η, where η is a constant that must be sufficiently
large to ensure accurate estimation.

The significance of a high SNR is rooted in the literature, where it has


been shown that low SNR can lead to a high variance in the maximum li-
kelihood (ML) solutions and slow convergence of the EM algorithm. This
substantiates the necessity of our assumption : a high SNR is not just a
theoretical convenience but a practical requirement for the reliability of the
EM algorithm in Gaussian mixtures.

In our framework, we demonstrate that the population EM operator exhi-


bits contractive properties within a certain radius around θ∗ . This contracti-
vity ensures that the EM updates will converge towards the true parameters,
with a rate of convergence that is exponentially faster as η 2 increases. In es-
sence, the larger the SNR (captured by η), the more rapid the convergence,
reinforcing the idea that high SNR is conducive to the successful application
of the EM algorithm in Gaussian mixture models.

12
5 Consequences for Specific Models
5.1 Gaussian Mixture Models
5.2 Corollaire 1
5.3 Corollaire 2
5.4 Comparison of EM and Gradient EM for Gaussian
Mixtures
— EM Algorithm for Gaussian Mixtures (Plot a) :
The optimization error (Opt. error) depicted in blue sharply declines,
indicating rapid convergence of the algorithm towards a minimum.
The statistical error (Stat. error) in red decreases to a level and then
stabilizes, signaling that the estimated parameters have reached the
proximity of the true model parameters.
— Gradient EM Algorithm for Gaussian Mixtures (Plot b) :
The optimization error (Opt. error) decreases over a more extensive
range of iterations, suggesting a slower convergence rate compared to
the standard EM algorithm.
The statistical error (Stat. error) exhibits a similar trend of decrease
followed by a plateau, but requires more iterations to stabilize than
the standard EM.
In conjunction with Corollary 1 and 2, they also predicts that the conver-
∥θ∗ ∥2
gence rate should increase as the signal-to-noise ratio σ is increased.
Both algorithms ultimately converge to the true parameters of the Gaus-
sian mixture model, with the standard EM algorithm demonstrating a faster
convergence in terms of the number of iterations.

5.5 Mixtures of Regressions


The plot in Figure demonstrates the impact of signal-to-noise ratio on
the convergence behavior of the EM algorithm when applied to Gaussian
mixture models. As we can observe, higher SNRs (shown in red) lead to a
steeper descent in the log-optimization error, indicating a more rapid conver-
gence towards the model’s true parameters. Conversely, lower SNRs (illus-
trated in green and blue) exhibit a more gradual convergence. This trend
is aligned with theoretical expectations, where the precision of the parame-
ter estimation in Gaussian mixture models is closely tied to the SNR. The
graph is instrumental in visualizing the correlation between SNR and conver-

13
gence efficiency, emphasizing the importance of SNR in the EM algorithm’s
performance.

Conclusion and Perspectives


In the article we discussed, the authors provide a detailed examination of
the EM and gradient EM algorithms, delving into their behavior at both the
population and sample levels. While the focus is on these specific algorithms,
the approaches presented could potentially be extended to analyze other
algorithms addressing non-convex problems.
Future research could expand on these findings in several ways. The cur-
rent analysis assumes a correctly specified model and i.i.d. sampling, which
might not always align with real-world data. Understanding how the EM al-
gorithm performs under model mis-specification is a compelling direction for
future study, given the robust properties of maximum likelihood estimation.
The paper also underscores the importance of suitable initialization for
the successful application of EM and gradient EM algorithms. For the models
considered, such as Gaussian mixtures and mixtures of regressions, simple
pilot estimators can be utilized to achieve these initial conditions. It would
be interesting to see further exploration of EM’s performance on a broader
range of latent variable models, particularly in light of recent advances in
moment-based estimators and initializations.

14

You might also like