Professional Documents
Culture Documents
Training Conditional Random Fields With Natural Gradient Descent
Training Conditional Random Fields With Natural Gradient Descent
Training Conditional Random Fields With Natural Gradient Descent
Yuan Cao
Center for Language & Speech Processing
The Johns Hopkins University
Baltimore, MD, USA, 21218
yuan.cao@jhu.edu
ditional random fields (CRF). This algo- ative) log-likelihood of the training data and can
rithm is an extension to the maximum be optimized by algorithms like L-BFGS (Dong
likelihood estimation (MLE), using loss C. Liu and Jorge Nocedal, 1989), stochastic gra-
functions defined by Bregman divergences dient descent (SGD) (Bottou, 1998), stochastic
which measure the proximity between the meta-descent (SMD) (Schraudolph, 1999) (Vish-
model expectation and the empirical mean wanathan, 2006) etc. If the structured hinge-loss is
of the feature vectors. This leads to a flex- chosen instead, then the Passive-Aggressive (PA)
ible training framework from which mul- algorithm (Crammer et al., 2006), structured Per-
tiple update strategies can be derived us- ceptron (Collins, 2002) (corresponding to a hinge-
ing natural gradient descent (NGD). We loss with zero margin) etc. can be applied for
carefully choose the convex function in- learning.
ducing the Bregman divergence so that the
types of updates are reduced, while mak-
ing the optimization procedure more ef-
fective by transforming the gradients of In this paper, we propose a novel CRF train-
the log-likelihood loss function. The de- ing procedure. Our loss functions are defined
rived algorithms are very simple and can by the Bregman divergence (Bregman, 1967) be-
be easily implemented on top of the exist- tween the model expectation and empirical mean
ing stochastic gradient descent (SGD) op- of the feature vectors, and can be treated as a gen-
timization procedure, yet it is very effec- eralization of the log-likelihood loss. We then use
tive as illustrated by experimental results. natural gradient descent (NGD) (Amari, 1998) to
optimize the loss. Since for large-scale training,
1 Introduction stochastic optimization is usually a better choice
than batch optimization (Bottou, 2008), we focus
Graphical models are used extensively to solve on the stochastic version of the algorithms. The
NLP problems. One of the most popular models proposed framework is very flexible, allowing us
is the conditional random fields (CRF) (Lafferty, to choose proper convex functions inducing the
2001), which has been successfully applied to Bregman divergences that leads to better training
tasks like shallow parsing (Fei Sha and Fernando performances.
Pereira, 2003), name entity recognition (Andrew
McCallum and Wei Li, 2003), word segmenta-
tion (Fuchun Peng et al. , 2004) etc., just to name
a few. In Section 2, we briefly reviews some back-
While the modeling power demonstrated by ground materials that are relevant to further dis-
CRF is critical for performance improvement, cussions; Section 3 gives a step-by-step introduc-
accurate parameter estimation of the model is tion to the proposed algorithms; Experimental re-
equally important. As a general structured predic- sults are given in Section 4, followed by discus-
tion problem, multiple training methods for CRF sions and conclusions.
2 Background its most popular applications. The conventional
gradient descent assumes the underlying parame-
2.1 MLE for Graphical Models
ter space to be Euclidean, nevertheless this is of-
Graphical models can be naturally viewed as ex- ten not the case (say when the space is a statistical
ponential families (Wainwright and Jordan, 2008). manifold). By contrast, NGD takes the geometry
For example, for a data sample (x, y) where x is of the parameter space into consideration, giving
the input sequence and y is the label sequence, the an update strategy as follows:
conditional distribution p(y|x) modeled by CRF
can be written as θ t+1 = θ t − λt Mθ−1 ∇`(θ t ) (2)
t
3.2 Applying Natural Gradient Descent 3.3 Reducing the Types of Updates
The gradients of loss functions B1 -B4 with respect
Although the framework described so far is very
to θ are given in Table 1.
flexible and multiple update strategies can be de-
rived, it is not a good idea to try them all in turn for
Loss hGradient wrt. θ i γ
∇θ Eθ ∇G(Eθ ) − ∇G(E) e 1 a given task. Therefore, it is necessary to reduce
B1
∇θ Eθ ∇2 G(Ee − Eθ )(Eθ − E)e 0 the types of updates and simplify the problem.
2
∇θhEθ ∇ G(Eθ )(Eθ − E) ie 1 We first remove U4 and U6 from the list, since
B2
∇θ Eθ ∇G(0) − ∇G(E e − Eθ ) 0 they can be recovered from U3 and U5 respectively
B3 ∇θ Ehθ ∇2 G(Eθ − γ E)(E
e θ − E) i
e {0, 1} by choosing G0 (u) = G(u + E). e To further re-
B4 ∇θ Eθ ∇G(Eθ − γ E)e − ∇G(ρE) e {0, 1} duce the update types, we impose the following
constraints on the convex function G:
Table 1: Gradients of loss functions B1 -B4 .
1. G is symmetric: G(u) = G(−u).
It is in general difficult to compute the gradients
of the loss functions, as all of them contain the 2. ∇G(u) is an element-wise function, namely
term ∇θ Eθ , whose computation is usually non- ∇G(u)i = gi (ui ), ∀i ∈ 1, . . . , d, where gi is
trivial. However, it turns out that the natural gra- a uni-variate scalar function.
dients of the loss functions can be handled easily.
To see this, note that for distributions in exponen- For example, G(u) = 12 kuk2 is a typical func-
tial family, ∇θ Eθ = ∇2θ Aθt = Mθt , which is the tion satisfying the constraints, since 12 kuk2 =
1 2 T
Fisher information matrix. Therefore the NGD up- 2 k − uk , and ∇G(u) = [u1 , . . . , ud ] where
date (Eq. 2) becomes gi (u) = u, ∀i ∈ 1, . . . , d. It is also worth mention-
ing that by choosing G(u) = 12 kuk2 , all updates
θ t+1 = θ t − λt ∇θ E−1
θ t ∇`(θ t ) (3) U1 -U6 become equivalent:
Now if we plug, for example ∇θ B1 , γ = 0, into θ t+1 = θ t − λt Eθt − E
e
Eq. 3, we have
which recovers the GD for the log-likelihood func-
θ t+1 = θ t − λt ∇θ E−1
θ t ∇θ B2 tion.
= θ t − λt ∇2 G(E
e − Eθ )(Eθ − E)(4)
t t
e
2
In (Hoffman et al., 2013), a similar trick was also used.
However their technique was developed from a specific vari-
Thus the step of computing the the Fisher informa- ational inference problem setting, whereas the proposed ap-
tion can be circumvented, making the optimization proach derived from Bregman divergence is more general.
When a twice-differentiable function G satisfies is sensitive to the condition number of the ob-
the constraints 1 and 2, we have jective function (Bottou, 2008), one heuristic for
the selection of G is to make the condition num-
∇G(u) = −∇G(−u) (5) bers of S1 and S2 smaller than that of the log-
∇G(0) = 0 (6) likelihood. However, this is hard to analyze since
∇2 G(u) = ∇2 G(−u) (7) the condition number is in general difficult to com-
pute. Alternatively, we may select a G so that the
where Eq. 7 holds since ∇2 G is a diagonal ma- second-order information of the log-likelihood can
trix. Given these conditions, we see immediately be (approximately) incorporated, as second-order
that U1 is equivalent to U3 , and U2 is equivalent to stochastic optimization methods usually converge
U5 . This way, the types of updates are eventually faster and are insensitive to the condition number
narrowed down to U1 and U2 . of the objective function (Bottou, 2008). This is
To see the relationship between U1 and U2 , note the guideline we follow in this section.
that the Taylor expansion of ∇G(0) at point E e−
The first convex function we consider is as fol-
Eθt is given by
lows:
e − Eθt ) + ∇2 G(E
∇G(0) =∇G(E e − Eθt )(Eθt − E) d
u2i
e X ui ui 1
2 G1 (u) = √ arctan( √ ) − log 1 +
+ O(kEθt − Ek
e ) 2
i=1
Therefore U1 and U2 can be regarded as approxi- where > 0 is a small constant free to choose.
mations to each other. It can be easily checked that G1 satisfies the con-
Since stochastic optimization is preferred for straints imposed in Section 3.3. The gradients of
large-scale training, we replace E e and Eθ with
t
G1 are given by
its stochastic estimation Et and Eθt ,t , where Eet ,
e
1
u1 ud T
Φ(xt , yt ) and Eθt ,t , Epθt (Y |xt ) [Φ(xt , Y )]. As- ∇G1 (u) = √ arctan( √ ), . . . , arctan( √ )
suming G satisfies the constraints, we re-write U1
1 1
and U2 as ∇2 G1 (u) = diag 2 ,..., 2
u1 + ud +
θ t+1 = θ t − λt ∇2 G(Eθt ,t − Eet )(Eθt − Eet ) (U1∗ ) In this case, the U1∗ update has the following
∗
θ t+1 = θ t − λt ∇G(Eθt ,t − Eet ) (U2 ) form (named U1 .G1 ):
∗
2 −1
ft i ft i
which will be the focus for the rest of the paper. θ it+1 = θ it − λt Eiθt ,t − E + Eiθt ,t − E
3.4 Choosing the Convex Function G where i = 1, . . . , d. This update can be treated
as a stochastic second-order optimization proce-
We now proceed to choose the actual functional
dure, as it scales the gradient Eq. 1 by the inverse
forms of G aiming to make the training procedure
of its variance in each dimension, and it reminds
more efficient.
us of the online Newton step (ONS) (Hazan et
A naı̈ve approach would be to choose G from
al., 2007) algorithm, which has a similar update
a parameterized convex function family (say the
step. However, in contrast to ONS where the full
vector p-norm, where p ≥ 1 is the hyper-
inverse covariance matrix is used, here we only
parameter), and tune the hyper-parameter on a
use the diagonal of the covariance matrix to scale
held-out set hoping to find a proper value that
the gradient vector. Diagonal matrix approxima-
works best for the task at hand. However this ap-
tion is often used in optimization algorithms incor-
proach is very inefficient, and we would like to
porating second-order information (for example
choose G in a more principled way.
SGN-QN (Bordes et al., 2009), AdaGrad (Duchi
Although we derived U1∗ and U2∗ from NGD,
et al., 2011) etc.), as it can be computed orders-of-
they can be treated as SGD of two surrogate loss
magnitude faster than using the full matrix ,with-
functions S1 (θ) and S2 (θ) respectively (we do
out sacrificing much performance.
not even need to know what the actual functional The U2∗ update corresponding to the choice of
forms of S1 , S2 are), whose gradients ∇θ S1 = G1 has the following form (named U2∗ .G1 ):
∇2 G(Eθ − E)(E
e θ − E) e and ∇θ S2 = ∇G(Eθ − E) e
ft i
!
are transformations of the gradient of the log- Eiθt ,t − E
θ it+1 = θ it − λt arctan √
likelihood (Eq. 1). Since the performance of SGD
2
1.5
the functions G2 , G3 , however it can be checked
1
that both of them satisfy the constraints. The rea-
0.5
son why we select these two functions from the
0 sigmoid family is that when the gradients of erf
−0.5 and d are the same as that of arctan at point
−1
u/(u2+0.1)
zero, both of them stay on top of arctan. There-
−1.5 arctan
erf fore, erf and gd are able to give stronger boosts
gd
−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 to small gradients. This is also illustrated in Fig-
ure 1.
Figure 1: Comparison of functions u/(u2 +
0.1), arctan, erf, gd. The sigmoid functions Applying ∇G2 and ∇G3 to U2∗ , we get two up-
arctan, erf and gd are normalized so that they dates named U2∗ .G2 and U2∗ .G3 respectively. We
have the same gradient as u/(u2 + 0.1) at u = 0 do not consider further applying ∇2 G2 and ∇3 G3
and their values are bounded by -1, 1. to U1∗ , as the meaning of the updates becomes less
clear.
√
Note that the constant 1/ in ∇G1 has been
3.5 Adding Regularization Term
folded into the learning rate λt . Although not ap-
parent at first sight, this update is in some sense So far we have not considered adding regulariza-
similar to U1∗ .G1 . From U1∗ .G1 we see that in each tion term to the loss functions yet, which is crucial
i
dimension, gradients Ei − Eet with smaller ab-
θt
for better generalization. A common choice of the
solute values get boosted more dramatically than regularization term is the 2-norm (Euclidean met-
i
those with larger values (as long as |Eiθt − Eet | ric) of the parameter vector C2 kθk2 , where C is
is not too close to zero). A similar property also a hyper-parameter specifying the strength of reg-
holds for U2∗ .G1 , since arctan is a sigmoid func- ularization. In our case, however, we derived our
tion, and as long as we choose < 1 so that algorithms by applying NGD to the Bregman loss
functions, which assumes the underlying parame-
d 1 1 ter space to be non-Euclidean. Therefore, the reg-
arctan √ u |u=0 = √ > 1
du ularization term we add has the form C2 θ T Mθ θ,
i and the regularized loss to be minimized is
the magnitude of Eiθt − Eet with small absolute
value will also get boosted. This is illustrated in
Figure 1 by comparing functions u2u+ (where we C T
θ M θ θ + Bi (8)
2
select = 0.1) and arctan √1 u . Note that for
many NLP tasks modeled by CRF, only indicator where Bi , i = {1, 2, 3, 4} are the loss functions
features are defined. Therefore the value of Eiθt − B1 -B4 .
i
Eet is bounded by -1 and 1, and we only care about However, the Riemannian metric Mθ itself is a
function values on this interval. function of θ, and the gradient of the objective at
Since arctan belongs to the sigmoid function time t is difficult to compute. Instead, we use an
family, it is not the only candidate whose corre- approximation by keeping the Riemannian metric
sponding update mimics the behavior of U1∗ .G1 , fixed at each time stamp, and the gradient is given
while G still satisfies the constraints given in Sec- by Mθt θ t + ∇θt Bi . Now if we apply NGD to
tion 3.3. We therefore consider two more con- this objective, the resulting updates will be no dif-
vex functions G2 and G3 , whose gradients are the ferent than the SGD for L2-regularized surrogate
erf and Gudermannian functions (gd) from the loss functions S1 , S2 :
sigmoid family respectively:
2 αui λt
Z
∇G2 (αu)i = exp{−x2 }dx θ t+1 = (1 − Cλt ) θ t − ∇θ Si,t (θ t )
π 0 1 − Cλt
π
∇G3 (βu)i = 2 arctan(exp{βui }) −
2 where i = {1, 2}. It is well-known that SGD for
where α, β ∈ R are hyper-parameters. In this L2-regularized loss functions has an equivalent but
case, we do not even know the functional form of more efficient sparse update (Shalev-Schwartz et
al. , 2007) (Bottou, 2012): Initialization:
Choose hyper-parameters C, /α/β (depending
θt
θ̄ t = on which convex function to use: G1 , G2 , G3 ).
zt
Set z0 = 1, and calibrate λ̂ on a small training
λt
θ̄ t+1 = θ̄ t − ∇θ Si,t (θ t ) subset.
(1 − λt )zt Algorithm:
zt+1 = (1 − Cλt )zt for e = 1 . . . num epoch do
where zt is a scaling factor and z0 = 1. We then for t = 1 . . . T do
modify U1∗ and U2∗ accordingly, simply by chang- Receive training sample (xt , yt )
h i−1
ing the step-size and maintaining a scaling factor. λt = λ̂(1 + λ̂Ct) , θ t = zt θ̄ t
As for the choice of learning rate λt , we follow Depending on the update strategy selected,
the recommendation given by (Bottou, 2012) and update the parameters:
h i−1
λt
set λt = λ̂(1 + λ̂Ct) , where λ̂ is calibrated θ̄ t+1 = θ̄ t − (1−λ t )zt
∇θ St (θ t )
from a small subset of the training data before the where ∇θ St (θ t )i , i = 1, . . . , d is given by
training starts.
−1
The final versions of the algorithms developed
2
ft i + ft i
Eiθt ,t − E Eiθt ,t − E (U1∗ .G1 )
so far are summarized in Figure 2.
ft i
!
Eiθt ,t − E
3.6 Computational Complexity arctan √ (U2∗ .G1 )
The proposed algorithms simply transforms the
ft i )
gradient of the log-likelihood, and each transfor- erf α(Eiθt ,t − E (U2∗ .G2 )
mation function can be computed in constant time, ft i )
gd β(Eiθt ,t − E (U2∗ .G3 )
therefore all of them have the same time complex-
ity as SGD (O(d) per update). In cases where only
indicator features are defined, the training process zt+1 = (1 − Cλt )zt
can be accelerated by pre-computing the values of
∇2 G or ∇G for variables within range [−1, 1] and if no more improvement on training set
keep them in a table. During the actual training, then
the transformed gradient values can be found sim- exit
ply by looking up the table after rounding-off to end if
the nearest entry. This way we do not even need end for
to compute the function values on the fly, which end for
significantly reduces the computational cost.
Figure 2: Summary of the proposed algorithms.
4 Experiments
4.1 Settings sets implemented in the CRF suite. The smaller
We conduct our experiments based on the follow- feature set contains 452,755 features, whereas the
ing settings: larger one contains 7,385,312 features.
Implementation: We implemented our algo- Baseline algorithms: We compare the perfor-
rithms based on the CRF suite toolkit (Okazaki, mance of our algorithms summarized in Figure 2
2007), in which SGD for the log-likelihood loss with SGD, L-BFGS, and the Passive-Aggressive
with L2 regularization is already implemented. (PA) algorithm. Except for L-BFGS which is
This can be easily done since our algorithm only a second-order batch-learning algorithm, we ran-
requires to modify the original gradients. Other domly shuffled the data and repeated experiments
parts of the code remain unchanged. five times.
Task: Our experiments are conducted for Hyper-parameter selection: For convex func-
the widely used CoNLL 2000 chunking shared tion G1 we choose = 0.1, correspondingly the
task (Sang and Buchholz, 2000). The training and arctan function in U2∗ .G1 is arctan(3.16u). We
test data contain 8939 and 2012 sentences respec- have also experimented the function arctan(10u)
tively. For fair comparison, we ran our experi- for this update, following the heuristic that the
u
ments with two standard linear-chain CRF feature transformation function u2 +0.1 given by ∇2 G1
Training set F-score
and arctan(10u) have consistent gradients at zero, 1
F-score
denoted by arctan.1 and arctan.2 respec- 0.97
SGD
tively). Following the same heuristic, we choose L-BFGS
√ 0.96 PA
U1**.G1
α = 5 π ≈ 8.86 for the erf function and 0.95
U2 .G1
5 10 15 20 25 30 35 40 45 50
β = 10 for the gd function. Epochs
Test set F-score
0.961
0.959
Comparison with the baseline: We compare the
F-score
performance of the proposed and baseline algo- 0.958
SGD
L-BFGS
rithms on the training and test sets in Figure 3 and 0.957 PA
U1**.G1
U2 .G1
Figure 4, corresponding to small and large feature 0.956
5 10 15 20 25 30 35 40 45 50
Epochs
sets respectively. The plots show the F-scores on
the training and test sets for the first 50 epochs. To Figure 3: F-scores of training and test sets given
keep the plots neat, we only show the average F- by the baseline and proposed algorithms, using the
scores of repeated experiments after each epoch, small feature set.
and omit the standard deviation error bars. For
U2∗ .G1 update, only arctan .1 function is reported Training set F-score
here. From the figure we observe that: 1
0.995
1. The strongest baseline is given by SGD. 0.99
0.98
ing data, while L-BFGS converges very slowly, al- 0.975 arctan.1
0.97 arctan.2
though eventually it catches up. 0.965
erf
gd
SGD
2. Although the SGD baseline is already very 0.96
5 10 15 20 25 30 35 40 45 50
both the proposed algorithm U1∗ .G1 and U2∗ .G1 0.961
arctan, erf and gd are all sigmoid functions Figure 6: F-scores of training and test sets given
and it is interesting to see how their behaviors by U2∗ .G1, U2∗ .G1 and U2∗ .G3, using the large fea-
differ, we compare the updates U2∗ .G1 (for both ture set.
arctan .1 and arctan .2), U2∗ .G2 and U2∗ .G3 in
Figure 5 and Figure 6. The strongest baseline,
SGD, is also included for comparison. From the Finally, we report in Table 2 the F-scores on the
figures we have the following observations: test set given by all algorithms after they converge.
1. As expected, the sigmoid functions demon-
strated similar behaviors, and performances of
5 Conclusion
their corresponding updates are almost indistin-
guishable. U2∗ .G2 . Similar to U2∗ .G1, U2∗ .G2 and We have proposed a novel parameter estimation
U2∗ .G3 both outperformed the SGD baseline. framework for CRF. By defining loss functions us-
2. Performances of U2∗ .G1 given by arctan ing the Bregman divergences, we are given the
functions are insensitive to the choice of the hyper- opportunity to select convex functions that trans-
parameters. Although we did not run similar ex- form the gradient of the log-likelihood loss, which
periments for erf and gd functions, similar prop- leads to more effective parameter learning if the
erties can be expected from their corresponding function is properly chosen. Minimization of
updates. the Bregman loss function is made possible by
Training set F-score Training set F-score
1 1
0.995
0.99
0.99
0.98 0.985
F-score
F-score
0.98
0.97 0.975
SGD arctan.1
L-BFGS 0.97 arctan.2
0.96 PA erf
U1**.G1 0.965 gd
U2 .G1 SGD
0.95 0.96
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Epochs Epochs
Test set F-score Test set F-score
0.961 0.961
0.96 0.96
0.959 0.959
F-score
F-score
0.958 0.958
SGD arctan.1
L-BFGS arctan.2
0.957 PA 0.957 erf
U1**.G1 gd
U2 .G1 SGD
0.956 0.956
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Epochs Epochs
Figure 4: F-scores of training and test sets given Figure 5: F-scores of training and test sets given
by the baseline and proposed algorithms, using the by U2∗ .G1, U2∗ .G1 and U2∗ .G3, using the small fea-
large feature set. ture set.