Training Conditional Random Fields With Natural Gradient Descent

Training Conditional Random Fields with Natural Gradient Descent
Yuan Cao
Center for Language & Speech Processing
The Johns Hopkins University
Baltimore, MD, USA, 21218
yuan.cao@jhu.edu
Abstract have been proposed corresponding to different

choices of loss functions. For example, one of the
We propose a novel parameter estimation common training approach is maximum likelihood
procedure that works efficiently for con- estimation (MLE), whose loss function is the (neg-
arXiv:1508.02373v1 [cs.LG] 10 Aug 2015
ditional random fields (CRF). This algo- ative) log-likelihood of the training data and can
rithm is an extension to the maximum be optimized by algorithms like L-BFGS (Dong
likelihood estimation (MLE), using loss C. Liu and Jorge Nocedal, 1989), stochastic gra-
functions defined by Bregman divergences dient descent (SGD) (Bottou, 1998), stochastic
which measure the proximity between the meta-descent (SMD) (Schraudolph, 1999) (Vish-
model expectation and the empirical mean wanathan, 2006) etc. If the structured hinge-loss is
of the feature vectors. This leads to a flex- chosen instead, then the Passive-Aggressive (PA)
ible training framework from which mul- algorithm (Crammer et al., 2006), structured Per-
tiple update strategies can be derived us- ceptron (Collins, 2002) (corresponding to a hinge-
ing natural gradient descent (NGD). We loss with zero margin) etc. can be applied for
carefully choose the convex function in- learning.
ducing the Bregman divergence so that the
types of updates are reduced, while mak-
ing the optimization procedure more ef-
fective by transforming the gradients of In this paper, we propose a novel CRF train-
the log-likelihood loss function. The de- ing procedure. Our loss functions are defined
rived algorithms are very simple and can by the Bregman divergence (Bregman, 1967) be-
be easily implemented on top of the exist- tween the model expectation and empirical mean
ing stochastic gradient descent (SGD) op- of the feature vectors, and can be treated as a gen-
timization procedure, yet it is very effec- eralization of the log-likelihood loss. We then use
tive as illustrated by experimental results. natural gradient descent (NGD) (Amari, 1998) to
optimize the loss. Since for large-scale training,
1 Introduction stochastic optimization is usually a better choice
than batch optimization (Bottou, 2008), we focus
Graphical models are used extensively to solve on the stochastic version of the algorithms. The
NLP problems. One of the most popular models proposed framework is very flexible, allowing us
is the conditional random fields (CRF) (Lafferty, to choose proper convex functions inducing the
2001), which has been successfully applied to Bregman divergences that leads to better training
tasks like shallow parsing (Fei Sha and Fernando performances.
Pereira, 2003), name entity recognition (Andrew
McCallum and Wei Li, 2003), word segmenta-
tion (Fuchun Peng et al. , 2004) etc., just to name
a few. In Section 2, we briefly reviews some back-
While the modeling power demonstrated by ground materials that are relevant to further dis-
CRF is critical for performance improvement, cussions; Section 3 gives a step-by-step introduc-
accurate parameter estimation of the model is tion to the proposed algorithms; Experimental re-
equally important. As a general structured predic- sults are given in Section 4, followed by discus-
tion problem, multiple training methods for CRF sions and conclusions.
2 Background its most popular applications. The conventional
gradient descent assumes the underlying parame-
2.1 MLE for Graphical Models
ter space to be Euclidean, nevertheless this is of-
Graphical models can be naturally viewed as ex- ten not the case (say when the space is a statistical
ponential families (Wainwright and Jordan, 2008). manifold). By contrast, NGD takes the geometry
For example, for a data sample (x, y) where x is of the parameter space into consideration, giving
the input sequence and y is the label sequence, the an update strategy as follows:
conditional distribution p(y|x) modeled by CRF
can be written as θ t+1 = θ t − λt Mθ−1 ∇`(θ t ) (2)
t
pθ (y|x) = exp{θ · Φ(x, y) − Aθ }

where λt is the learning rate and M is the Rie-
φc (xc , yc ) ∈ Rd is the fea-
P
where Φ(x, y) = mannian metric of the parameter space manifold.
c∈C When the parameter space is a statistical manifold,
ture vector (of dimension d) collected from all fac-
the common choice of M is the h Fisher information
tors C of the graph, and Aθ is the log-partition
i
matrix, namely Mθi,j = Ep ∂ log∂θpiθ (x) ∂ log∂θpjθ (x) .
function.
It is shown in (Amari, 1998) that NGD is asymp-
MLE is commonly used to estimate the model
totically Fisher efficient and often converges faster
parameters θ. The gradient of the log-likelihood
than the normal gradient descent.
of the training data is given by
Despite the nice properties of NGD, it is in gen-
e − Eθ
E (1) eral difficult to implement due to the nontrivial
computation of Mθ−1 . To handle this problem,
where E e and Eθ denotes the empirical mean of and researchers have proposed techniques to simplify,
model expectation of the feature vectors respec- approximate, or bypass the computation of Mθ−1 ,
tively. The moment-matching condition E e = Eθ∗ for example (Le Roux et al., 2007) (Honkela et al.,
holds when the maximum likelihood solution θ ∗ 2008) (Pascanu and Bengio, 2013).
is found.
2.2 Bregman Divergence 3 Algorithm

The Bregman divergence is a general proximity 3.1 Loss Functions
measure between two points p, q (can be scalars,
vectors, or matrices). It is defined as The loss function defined for our training proce-
dure is motivated by MLE. Since Eθ = E e needs to
BG (p, q) = G(p) − G(q) − ∇G(q)T (p − q) hold when the solution to MLE is found, the “gap”
between Eθ and E e (according to some measure)
where G is a differentiable convex function induc- needs to be reduced as the training proceeds. We
ing the divergence. Note that Bregman divergence therefore define the Bregman divergence between
is in general asymmetric, namely BG (p, q) 6= Eθ and Ee as our loss function. Since the Bregman
B(q, p). divergence is in general asymmetric, we consider
By choosing G properly, many interesting dis- four types of loss functions as follows (named B1 -
tances or divergences can be recovered. For ex- B4 )1 :
ample, choosing G(u) = 12 kuk2 , BG (p, q) =
kp − qk2 and the Euclidean distance is recov-

P e − ρEθ
BG γEθ , E (B1 )
ered; Choosing G(u) = ui log ui , BG (p, q) =
i e − ρEθ , γEθ
P
pi log pqii and the Kullback-Leibler divergence BG E (B2 )
i
is recovered. A good review of the Bregman diver- 1
In the most general setting, we may use B(aEθ −
gence and its properties can be found in (Banerjee bE, cEθ − dE)
e e s.t. a − b = c − d, a, b, c, d ∈ R as the
et al., 2005). loss functions, which guarantees Eθ = E e holds at its min-
imum. However, this formulation brings too much design
2.3 Natural Gradient Descent freedom which complicates the problem, since we are free to
choose the parameters a, b, c as well as the convex function
NGD is derived from the study of information ge- G. Therefore, we narrow down our scope to the four special
ometry (Amari and Nagaoka, 2000), and is one of cases of the loss functions B1 − B4 given above.

e Eθ − γ E
BG ρE, e (B3 ) of our loss functions tractable2 .
This trick applies to all gradients in Table 1,
BG Eθ − γ E,
e ρE
e (B4 ) yielding multiple update strategies. Note that
∇θ B1 , γ = 1 is equivalent to ∇θ B4 , γ = 0, and
where γ ∈ R and ρ , 1 − γ. It can be seen that ∇θ B2 , γ = 1 is equivalent to ∇θ B3 , γ = 0. By
whenever the loss functions are minimized at point applying Eq. 3 to all unique gradients, the follow-
θ ∗ (Bregman divergence reaches zero), Eθ∗ = E e ing types of updates are derived (named U1 -U6 ):
∗
and θ give the same solution as MLE.
We are free to choose the hyper-parameter γ θ t+1 = θ t − λt ∇2 G(E
e − Eθ )(Eθ − E)
t t
e (U1 )
which is possibly correlated with the performance h i
θ t+1 = θ t − λt ∇G(0) − ∇G(E e − Eθ ) (U2 )
t
of the algorithm. However to simplify the prob-
lem setting, we will only focus on the cases where θ t+1 = θ t − λt ∇2 G(Eθt − E)(E
e θ − E)
t
e (U3 )
γ = 0 or 1. Although it seems we now have eight
θ t+1 = θ t − λt ∇2 G(Eθt )(Eθt − E)
e (U4 )
versions of loss functions, it will be seen shortly h i
that by properly choosing the convex function G, θ t+1 = θ t − λt ∇G(Eθt − E) e − ∇G(0) (U5 )
many of them are redundant and we end up having h i
only two update strategies. θ t+1 = θ t − λt ∇G(Eθt ) − ∇G(E) e (U6 )
3.2 Applying Natural Gradient Descent 3.3 Reducing the Types of Updates
The gradients of loss functions B1 -B4 with respect
Although the framework described so far is very
to θ are given in Table 1.
flexible and multiple update strategies can be de-
rived, it is not a good idea to try them all in turn for
Loss hGradient wrt. θ i γ
∇θ Eθ ∇G(Eθ ) − ∇G(E) e 1 a given task. Therefore, it is necessary to reduce
B1
∇θ Eθ ∇2 G(Ee − Eθ )(Eθ − E)e 0 the types of updates and simplify the problem.
2
∇θhEθ ∇ G(Eθ )(Eθ − E) ie 1 We first remove U4 and U6 from the list, since
B2
∇θ Eθ ∇G(0) − ∇G(E e − Eθ ) 0 they can be recovered from U3 and U5 respectively
B3 ∇θ Ehθ ∇2 G(Eθ − γ E)(E
e θ − E) i
e {0, 1} by choosing G0 (u) = G(u + E). e To further re-
B4 ∇θ Eθ ∇G(Eθ − γ E)e − ∇G(ρE) e {0, 1} duce the update types, we impose the following
constraints on the convex function G:
Table 1: Gradients of loss functions B1 -B4 .
1. G is symmetric: G(u) = G(−u).
It is in general difficult to compute the gradients
of the loss functions, as all of them contain the 2. ∇G(u) is an element-wise function, namely
term ∇θ Eθ , whose computation is usually non- ∇G(u)i = gi (ui ), ∀i ∈ 1, . . . , d, where gi is
trivial. However, it turns out that the natural gra- a uni-variate scalar function.
dients of the loss functions can be handled easily.
To see this, note that for distributions in exponen- For example, G(u) = 12 kuk2 is a typical func-
tial family, ∇θ Eθ = ∇2θ Aθt = Mθt , which is the tion satisfying the constraints, since 12 kuk2 =
1 2 T
Fisher information matrix. Therefore the NGD up- 2 k − uk , and ∇G(u) = [u1 , . . . , ud ] where
date (Eq. 2) becomes gi (u) = u, ∀i ∈ 1, . . . , d. It is also worth mention-
ing that by choosing G(u) = 12 kuk2 , all updates
θ t+1 = θ t − λt ∇θ E−1
θ t ∇`(θ t ) (3) U1 -U6 become equivalent:

Now if we plug, for example ∇θ B1 , γ = 0, into θ t+1 = θ t − λt Eθt − E
e
Eq. 3, we have
which recovers the GD for the log-likelihood func-
θ t+1 = θ t − λt ∇θ E−1
θ t ∇θ B2 tion.
= θ t − λt ∇2 G(E
e − Eθ )(Eθ − E)(4)
t t
e
2
In (Hoffman et al., 2013), a similar trick was also used.
However their technique was developed from a specific vari-
Thus the step of computing the the Fisher informa- ational inference problem setting, whereas the proposed ap-
tion can be circumvented, making the optimization proach derived from Bregman divergence is more general.
When a twice-differentiable function G satisfies is sensitive to the condition number of the ob-
the constraints 1 and 2, we have jective function (Bottou, 2008), one heuristic for
the selection of G is to make the condition num-
∇G(u) = −∇G(−u) (5) bers of S1 and S2 smaller than that of the log-
∇G(0) = 0 (6) likelihood. However, this is hard to analyze since
∇2 G(u) = ∇2 G(−u) (7) the condition number is in general difficult to com-
pute. Alternatively, we may select a G so that the
where Eq. 7 holds since ∇2 G is a diagonal ma- second-order information of the log-likelihood can
trix. Given these conditions, we see immediately be (approximately) incorporated, as second-order
that U1 is equivalent to U3 , and U2 is equivalent to stochastic optimization methods usually converge
U5 . This way, the types of updates are eventually faster and are insensitive to the condition number
narrowed down to U1 and U2 . of the objective function (Bottou, 2008). This is
To see the relationship between U1 and U2 , note the guideline we follow in this section.
that the Taylor expansion of ∇G(0) at point E e−
The first convex function we consider is as fol-
Eθt is given by
lows:
e − Eθt ) + ∇2 G(E
∇G(0) =∇G(E e − Eθt )(Eθt − E) d
u2i

e X ui ui 1
2 G1 (u) = √ arctan( √ ) − log 1 +
+ O(kEθt − Ek
e ) 2
i=1
Therefore U1 and U2 can be regarded as approxi- where > 0 is a small constant free to choose.
mations to each other. It can be easily checked that G1 satisfies the con-
Since stochastic optimization is preferred for straints imposed in Section 3.3. The gradients of
large-scale training, we replace E e and Eθ with
t
G1 are given by
its stochastic estimation Et and Eθt ,t , where Eet ,
e
1

u1 ud T

Φ(xt , yt ) and Eθt ,t , Epθt (Y |xt ) [Φ(xt , Y )]. As- ∇G1 (u) = √ arctan( √ ), . . . , arctan( √ )

suming G satisfies the constraints, we re-write U1
1 1
and U2 as ∇2 G1 (u) = diag 2 ,..., 2
u1 + ud +
θ t+1 = θ t − λt ∇2 G(Eθt ,t − Eet )(Eθt − Eet ) (U1∗ ) In this case, the U1∗ update has the following
∗
θ t+1 = θ t − λt ∇G(Eθt ,t − Eet ) (U2 ) form (named U1 .G1 ):
∗
2 −1
ft i ft i

which will be the focus for the rest of the paper. θ it+1 = θ it − λt Eiθt ,t − E + Eiθt ,t − E
3.4 Choosing the Convex Function G where i = 1, . . . , d. This update can be treated
as a stochastic second-order optimization proce-
We now proceed to choose the actual functional
dure, as it scales the gradient Eq. 1 by the inverse
forms of G aiming to make the training procedure
of its variance in each dimension, and it reminds
more efficient.
us of the online Newton step (ONS) (Hazan et
A naı̈ve approach would be to choose G from
al., 2007) algorithm, which has a similar update
a parameterized convex function family (say the
step. However, in contrast to ONS where the full
vector p-norm, where p ≥ 1 is the hyper-
inverse covariance matrix is used, here we only
parameter), and tune the hyper-parameter on a
use the diagonal of the covariance matrix to scale
held-out set hoping to find a proper value that
the gradient vector. Diagonal matrix approxima-
works best for the task at hand. However this ap-
tion is often used in optimization algorithms incor-
proach is very inefficient, and we would like to
porating second-order information (for example
choose G in a more principled way.
SGN-QN (Bordes et al., 2009), AdaGrad (Duchi
Although we derived U1∗ and U2∗ from NGD,
et al., 2011) etc.), as it can be computed orders-of-
they can be treated as SGD of two surrogate loss
magnitude faster than using the full matrix ,with-
functions S1 (θ) and S2 (θ) respectively (we do
out sacrificing much performance.
not even need to know what the actual functional The U2∗ update corresponding to the choice of
forms of S1 , S2 are), whose gradients ∇θ S1 = G1 has the following form (named U2∗ .G1 ):
∇2 G(Eθ − E)(E
e θ − E) e and ∇θ S2 = ∇G(Eθ − E) e
ft i
!
are transformations of the gradient of the log- Eiθt ,t − E
θ it+1 = θ it − λt arctan √

likelihood (Eq. 1). Since the performance of SGD
2
1.5
the functions G2 , G3 , however it can be checked
1
that both of them satisfy the constraints. The rea-
0.5
son why we select these two functions from the
0 sigmoid family is that when the gradients of erf
−0.5 and d are the same as that of arctan at point
−1
u/(u2+0.1)
zero, both of them stay on top of arctan. There-
−1.5 arctan
erf fore, erf and gd are able to give stronger boosts
gd
−2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 to small gradients. This is also illustrated in Fig-
ure 1.
Figure 1: Comparison of functions u/(u2 +
0.1), arctan, erf, gd. The sigmoid functions Applying ∇G2 and ∇G3 to U2∗ , we get two up-
arctan, erf and gd are normalized so that they dates named U2∗ .G2 and U2∗ .G3 respectively. We
have the same gradient as u/(u2 + 0.1) at u = 0 do not consider further applying ∇2 G2 and ∇3 G3
and their values are bounded by -1, 1. to U1∗ , as the meaning of the updates becomes less
clear.
√
Note that the constant 1/ in ∇G1 has been
3.5 Adding Regularization Term
folded into the learning rate λt . Although not ap-
parent at first sight, this update is in some sense So far we have not considered adding regulariza-
similar to U1∗ .G1 . From U1∗ .G1 we see that in each tion term to the loss functions yet, which is crucial
i
dimension, gradients Ei − Eet with smaller ab-
θt
for better generalization. A common choice of the
solute values get boosted more dramatically than regularization term is the 2-norm (Euclidean met-
i
those with larger values (as long as |Eiθt − Eet | ric) of the parameter vector C2 kθk2 , where C is
is not too close to zero). A similar property also a hyper-parameter specifying the strength of reg-
holds for U2∗ .G1 , since arctan is a sigmoid func- ularization. In our case, however, we derived our
tion, and as long as we choose < 1 so that algorithms by applying NGD to the Bregman loss
functions, which assumes the underlying parame-
d 1 1 ter space to be non-Euclidean. Therefore, the reg-
arctan √ u |u=0 = √ > 1
du ularization term we add has the form C2 θ T Mθ θ,
i and the regularized loss to be minimized is
the magnitude of Eiθt − Eet with small absolute
value will also get boosted. This is illustrated in
Figure 1 by comparing functions u2u+ (where we C T
θ M θ θ + Bi (8)
2
select = 0.1) and arctan √1 u . Note that for
many NLP tasks modeled by CRF, only indicator where Bi , i = {1, 2, 3, 4} are the loss functions
features are defined. Therefore the value of Eiθt − B1 -B4 .
i
Eet is bounded by -1 and 1, and we only care about However, the Riemannian metric Mθ itself is a
function values on this interval. function of θ, and the gradient of the objective at
Since arctan belongs to the sigmoid function time t is difficult to compute. Instead, we use an
family, it is not the only candidate whose corre- approximation by keeping the Riemannian metric
sponding update mimics the behavior of U1∗ .G1 , fixed at each time stamp, and the gradient is given
while G still satisfies the constraints given in Sec- by Mθt θ t + ∇θt Bi . Now if we apply NGD to
tion 3.3. We therefore consider two more con- this objective, the resulting updates will be no dif-
vex functions G2 and G3 , whose gradients are the ferent than the SGD for L2-regularized surrogate
erf and Gudermannian functions (gd) from the loss functions S1 , S2 :
sigmoid family respectively:

2 αui λt
Z
∇G2 (αu)i = exp{−x2 }dx θ t+1 = (1 − Cλt ) θ t − ∇θ Si,t (θ t )
π 0 1 − Cλt
π
∇G3 (βu)i = 2 arctan(exp{βui }) −
2 where i = {1, 2}. It is well-known that SGD for
where α, β ∈ R are hyper-parameters. In this L2-regularized loss functions has an equivalent but
case, we do not even know the functional form of more efficient sparse update (Shalev-Schwartz et
al. , 2007) (Bottou, 2012): Initialization:
Choose hyper-parameters C, /α/β (depending
θt
θ̄ t = on which convex function to use: G1 , G2 , G3 ).
zt
Set z0 = 1, and calibrate λ̂ on a small training
λt
θ̄ t+1 = θ̄ t − ∇θ Si,t (θ t ) subset.
(1 − λt )zt Algorithm:
zt+1 = (1 − Cλt )zt for e = 1 . . . num epoch do
where zt is a scaling factor and z0 = 1. We then for t = 1 . . . T do
modify U1∗ and U2∗ accordingly, simply by chang- Receive training sample (xt , yt )
h i−1
ing the step-size and maintaining a scaling factor. λt = λ̂(1 + λ̂Ct) , θ t = zt θ̄ t
As for the choice of learning rate λt , we follow Depending on the update strategy selected,
the recommendation given by (Bottou, 2012) and update the parameters:
h i−1
λt
set λt = λ̂(1 + λ̂Ct) , where λ̂ is calibrated θ̄ t+1 = θ̄ t − (1−λ t )zt
∇θ St (θ t )
from a small subset of the training data before the where ∇θ St (θ t )i , i = 1, . . . , d is given by
training starts.
−1
The final versions of the algorithms developed
2
ft i + ft i

Eiθt ,t − E Eiθt ,t − E (U1∗ .G1 )
so far are summarized in Figure 2.
ft i
!
Eiθt ,t − E
3.6 Computational Complexity arctan √ (U2∗ .G1 )

The proposed algorithms simply transforms the
ft i )

gradient of the log-likelihood, and each transfor- erf α(Eiθt ,t − E (U2∗ .G2 )
mation function can be computed in constant time, ft i )

gd β(Eiθt ,t − E (U2∗ .G3 )
therefore all of them have the same time complex-
ity as SGD (O(d) per update). In cases where only
indicator features are defined, the training process zt+1 = (1 − Cλt )zt
can be accelerated by pre-computing the values of
∇2 G or ∇G for variables within range [−1, 1] and if no more improvement on training set
keep them in a table. During the actual training, then
the transformed gradient values can be found sim- exit
ply by looking up the table after rounding-off to end if
the nearest entry. This way we do not even need end for
to compute the function values on the fly, which end for
significantly reduces the computational cost.
Figure 2: Summary of the proposed algorithms.
4 Experiments
4.1 Settings sets implemented in the CRF suite. The smaller
We conduct our experiments based on the follow- feature set contains 452,755 features, whereas the
ing settings: larger one contains 7,385,312 features.
Implementation: We implemented our algo- Baseline algorithms: We compare the perfor-
rithms based on the CRF suite toolkit (Okazaki, mance of our algorithms summarized in Figure 2
2007), in which SGD for the log-likelihood loss with SGD, L-BFGS, and the Passive-Aggressive
with L2 regularization is already implemented. (PA) algorithm. Except for L-BFGS which is
This can be easily done since our algorithm only a second-order batch-learning algorithm, we ran-
requires to modify the original gradients. Other domly shuffled the data and repeated experiments
parts of the code remain unchanged. five times.
Task: Our experiments are conducted for Hyper-parameter selection: For convex func-
the widely used CoNLL 2000 chunking shared tion G1 we choose = 0.1, correspondingly the
task (Sang and Buchholz, 2000). The training and arctan function in U2∗ .G1 is arctan(3.16u). We
test data contain 8939 and 2012 sentences respec- have also experimented the function arctan(10u)
tively. For fair comparison, we ran our experi- for this update, following the heuristic that the
u
ments with two standard linear-chain CRF feature transformation function u2 +0.1 given by ∇2 G1
Training set F-score
and arctan(10u) have consistent gradients at zero, 1
since the U2∗ update imitates the behavior of U1∗ 0.99
(the two choices of the arctan functions are 0.98
F-score
denoted by arctan.1 and arctan.2 respec- 0.97
SGD
tively). Following the same heuristic, we choose L-BFGS
√ 0.96 PA
U1**.G1
α = 5 π ≈ 8.86 for the erf function and 0.95
U2 .G1
5 10 15 20 25 30 35 40 45 50
β = 10 for the gd function. Epochs
Test set F-score
0.961
4.2 Results 0.96
0.959
Comparison with the baseline: We compare the
F-score
performance of the proposed and baseline algo- 0.958
SGD
L-BFGS
rithms on the training and test sets in Figure 3 and 0.957 PA
U1**.G1
U2 .G1
Figure 4, corresponding to small and large feature 0.956
5 10 15 20 25 30 35 40 45 50
Epochs
sets respectively. The plots show the F-scores on
the training and test sets for the first 50 epochs. To Figure 3: F-scores of training and test sets given
keep the plots neat, we only show the average F- by the baseline and proposed algorithms, using the
scores of repeated experiments after each epoch, small feature set.
and omit the standard deviation error bars. For
U2∗ .G1 update, only arctan .1 function is reported Training set F-score
here. From the figure we observe that: 1
0.995
1. The strongest baseline is given by SGD. 0.99
By comparison, PA appears to overfit the train- 0.985

F-score
0.98
ing data, while L-BFGS converges very slowly, al- 0.975 arctan.1
0.97 arctan.2
though eventually it catches up. 0.965
erf
gd
SGD
2. Although the SGD baseline is already very 0.96
5 10 15 20 25 30 35 40 45 50
strong (especially with the large feature set), Epochs

Test set F-score
both the proposed algorithm U1∗ .G1 and U2∗ .G1 0.961
outperform SGD and stay on top of the SGD 0.96
curves most of the time. On the other hand, 0.959

F-score
the U2∗ .G1 update appears to be a little more 0.958

arctan.1
advantageous than U1∗ .G1 .

arctan.2
0.957 erf
gd
SGD
0.956
5 10 15 20 25 30 35 40 45 50
Comparison of the sigmoid functions: Since Epochs
arctan, erf and gd are all sigmoid functions Figure 6: F-scores of training and test sets given
and it is interesting to see how their behaviors by U2∗ .G1, U2∗ .G1 and U2∗ .G3, using the large fea-
differ, we compare the updates U2∗ .G1 (for both ture set.
arctan .1 and arctan .2), U2∗ .G2 and U2∗ .G3 in
Figure 5 and Figure 6. The strongest baseline,
SGD, is also included for comparison. From the Finally, we report in Table 2 the F-scores on the
figures we have the following observations: test set given by all algorithms after they converge.
1. As expected, the sigmoid functions demon-
strated similar behaviors, and performances of
5 Conclusion
their corresponding updates are almost indistin-
guishable. U2∗ .G2 . Similar to U2∗ .G1, U2∗ .G2 and We have proposed a novel parameter estimation
U2∗ .G3 both outperformed the SGD baseline. framework for CRF. By defining loss functions us-
2. Performances of U2∗ .G1 given by arctan ing the Bregman divergences, we are given the
functions are insensitive to the choice of the hyper- opportunity to select convex functions that trans-
parameters. Although we did not run similar ex- form the gradient of the log-likelihood loss, which
periments for erf and gd functions, similar prop- leads to more effective parameter learning if the
erties can be expected from their corresponding function is properly chosen. Minimization of
updates. the Bregman loss function is made possible by
Training set F-score Training set F-score
1 1
0.995
0.99
0.99
0.98 0.985
F-score
F-score
0.98
0.97 0.975
SGD arctan.1
L-BFGS 0.97 arctan.2
0.96 PA erf
U1**.G1 0.965 gd
U2 .G1 SGD
0.95 0.96
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Epochs Epochs
Test set F-score Test set F-score
0.961 0.961
0.96 0.96
0.959 0.959
F-score
F-score
0.958 0.958
SGD arctan.1
L-BFGS arctan.2
0.957 PA 0.957 erf
U1**.G1 gd
U2 .G1 SGD
0.956 0.956
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50
Epochs Epochs
Figure 4: F-scores of training and test sets given Figure 5: F-scores of training and test sets given
by the baseline and proposed algorithms, using the by U2∗ .G1, U2∗ .G1 and U2∗ .G3, using the small fea-
large feature set. ture set.
Algorithm F-score % (small) F-score % (large)

SGD 95.98±0.02 96.02±0.01 References
PA 95.82±0.04 95.90±0.03
L-BFGS 96.00 96.01 [Amari1998] Shunichi Amari. 1998. Natural Gradi-
U1∗ .G1 95.99±0.02 96.06±0.03 ent Descent Works Efficiently in Learning. Neural
U2∗ .G1 96.02±0.02 96.06±0.02 Computation, Vol. 10
U2∗ .G01 96.03 ± 0.01 96.06±0.03
U2∗ .G2 96.03±0.02 96.06±0.02 [Amari and Nagaoka2000] Shunichi Amari and Hiroshi
U2∗ .G3 96.02±0.02 96.06±0.02
Nagaoka. 2000. Methods of Information Geometry.
The American Mathematical Society
Table 2: F-scores on the test set after algorithm
converges, using the small and large feature sets. [Banerjee et al.2005] Arindam Banerjee, Srujana
U2∗ .G1 is the update given by arctan.1, and Merugu, Inderjit S. Dhillon, Joydeep Ghosh. 2005.
U2∗ .G2 given by arctan.2. Clustering with Bregman Divergences. Journal of
Machine Learning Research, Vol. 6, 1705–1749
[Bordes et al.2009] Antoine Bordes, Léon Bottou, and

NGD, thanks to the structure of exponential fam- Patrick Gallinari. 2009. SGD-QN: Careful Quasi-
ilies. We developed several parameter update Newton Stochastic Gradient Descent. Journal of
strategies which approximately incorporates the Machine Learning Research, 10:17371754
second-order information of the log-likelihood,
[Bottou1998] Léon Bottou. 1998. Online Algorithms
and outperformed baseline algorithms that are al-
and Stochastic Approximations. Online Learning
ready very strong on a popular text chunking task. and Neural Networks, Cambridge University Press
Proper choice of the convex functions is crit-
ical to the performance of the proposed algo- [Bottou2008] Léon Bottou and Olivier Bousquet. 2008
The Tradeoffs of Large Scale Learning. Proceed-
rithms, and is an interesting problem that mer-
ings of Advances in Neural Information Processing
its further investigation. While we selected the Systems (NIPS), Vol. 20, 161–168
convex functions with the motivation to reduce
the types of updates and incorporate approximate [Bottou2012] Léon Bottou. 2012. Stochastic Gradi-
second-order information, there are certainly more ent Descent Tricks. Neural Networks: Tricks of the
Trade, Second Edition, Lecture Notes in Computer
possible choices and the performance could be im- Science, Vol. 7700
proved via careful theoretical analysis. On the
other hand, instead of choosing a convex function [Bregman1967] L. M. Bregman. 1967. The Relax-
apriori, we may rely on some heuristics from the ation Method of Finding the Common Points of
Convex Sets and its Application to the Solution of
actual data and choose a function tailored for the Problems in Convex Programming. USSR Com-
task at hand. putational Mathematics and Mathematical Physics,
7(3):200–217
[Collins2002] Michael Collins. 2002. Discriminative [Fuchun Peng et al. 2004] Fuchun Pent, Fangfang Feng
training methods for hidden Markov Models: The- and Andrew McCallum. 2004. Chinese Segmen-
ory and Experiments with Perceptron. Algorithms tation and New Word Detection Using Conditional
Proceedings of Empirical Methods in Natural Lan- Random Fields Proceedings of the International
guage Processing (EMNLP), 10:1–8. Conference on Computational Linguistics (COL-
ING), 562–569
[Crammer et al.2006] Koby Crammer, Ofer Dekel,
Joseph Keshet, Shai Shalev-Schwartz, and Yoram [Sang and Buchholz2000] Erik F. Tjong Kim Sang and
Singer. 2006. Online Passive-Aggressive Al- Sabine Buchholz. 2000. Introduction to the
gorithms Journal of Machine Learning Re- CoNLL-2000 Shared Task: Chunking Proceedings
search(JMLR), 7:551–585. of Conference on Computational Natural Language
Learning (CoNLL), Vol. 7, 127–132
[Duchi et al.2011] John Duchi, Elad Hazan and Yoram
Singer. 2011. Adaptive Subgradient Methods for [Schraudolph1999] Nicol N. Schraudolph. 1999. Lo-
Online Learning and Stochastic Optimization. Jour- cal Gain Adaptation in Stochastic Gradient Descent.
nal of Machine Learning Research, 12:2121-2159 Proceedings of International Conference on Artifi-
cial Neural Networks (ICANN), Vol. 2, 569–574
[Hazan et al.2007] Elad Hazan, Amit Agarwal and
Satyen Kale. 2007. Logarithmic Regret Algorithms [Shalev-Schwartz et al. 2007] Shalev-Schwartz, Yoram
for Online Convex Optimization. Machine Learn- Singer and Nathan Srebro. 2007. Pegasos: Primal
ing, Vol. 69, 169–192 Estimated sub-GrAdient SOlver for SVM. Proceed-
ings of International Conference on Machine Learn-
[Hoffman et al.2013] Matthew D. Hoffman and David ing (ICML), 807–814
M. Blei and Chong Wang and John Pasley. 2013.
Stochastic Variational Inference. Journal of Ma- [Fei Sha and Fernando Pereira2003] Sha Fei and Fer-
chine Learning Research, Vol. 14, 169–192 nando Pereira. 2003. Shallow Parsing with Con-
ditional Random Fields Proceedings of North
[Honkela et al.2008] Antti Honkela, Matti Tornio, American Chapter of the Association for Computa-
Tapani Raiko and Juha Karhunen. 2008. Natu- tional Linguistics - Human Language Technologies
ral Conjugate Gradient in Variational Inference. (NAACL-HLT), 134–141
Lecture Notes in Computer Science, Vol. 4985,
305–314 [Vishwanathan2006] S.V. N. Vishwanathan, Nicol N.
Schraudolph, Mark W. Schmidt and Kevin P. Mur-
[Lafferty2001] John Lafferty, Andrew McCallum and phy. 1999. Accelerated Training of Conditional
Fernando Pereira. 2001. Conditional Random Random Fields with Stochastic Gradient Methods.
Fields: Probabilistic Models for Segmenting and La- Proceedings of International Conference on Ma-
beling Sequence Data. Proceedings of International chine Learning (ICML), 969–976
Conference on Machine Learning (ICML), 282–289
[Wainwright and Jordan2008] Martin J. Wainwright
[Le Roux et al.2007] Nicolas Le Roux, Pierre-Antoine and Michael I. Jordan. 2008. Graphical Models,
Manzagol and Yoshua Bengio. 2007. Top- Exponential Families, and Variational Inference.
moumoute Online Natural Gradient Algorithm. Foundations and Trends in Machine Learning, Vol
Proceedings of Advances in Neural Information Pro- 1. 1-305
cessing Systems (NIPS), 849–856
[Dong C. Liu and Jorge Nocedal1989] Dong C. Liu
and Jorge Nocedal. 1989. On the Limited
Memory Method for Large Scale Optimization.
Mathematical Programming, 45:503-528
[Andrew McCallum and Wei Li2003] Andrew McCal-
lum and Wei Li. 2003. Early Results for Named En-
tity Recognition with Conditional Random Fields,
Feature Induction and Web-Enhanced Lexicons.
Proceedings of the Seventh Conference on Natural
Language Learning at NAACL-HLT, 188–191
[Okazaki2007] Naoaki Okazaki. 2007. CRFsuite:
A Fast Implementation of Conditional Random
Fields (CRFs). http://www.chokkan.org/
software/crfsuite
[Pascanu and Bengio2013] Razvan Pascanu and
Yoshua Bengio. 2013. Revisiting Natural
Gradient for Deep Networks. arXiv preprint
arXiv:1301.3584,

Training Conditional Random Fields With Natural Gradient Descent

Uploaded by

Copyright:

Available Formats

You might also like

Training Conditional Random Fields With Natural Gradient Descent

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Training Conditional Random Fields With Natural Gradient Descent

Uploaded by

Copyright:

Available Formats

Training Conditional Random Fields with Natural Gradient Descent

Abstract have been proposed corresponding to different

pθ (y|x) = exp{θ · Φ(x, y) − Aθ }

2.2 Bregman Divergence 3 Algorithm

since the U2∗ update imitates the behavior of U1∗ 0.99

(the two choices of the arctan functions are 0.98

4.2 Results 0.96

By comparison, PA appears to overfit the train- 0.985

strong (especially with the large feature set), Epochs

outperform SGD and stay on top of the SGD 0.96

curves most of the time. On the other hand, 0.959

the U2∗ .G1 update appears to be a little more 0.958

advantageous than U1∗ .G1 .

Comparison of the sigmoid functions: Since Epochs

Algorithm F-score % (small) F-score % (large)

[Bordes et al.2009] Antoine Bordes, Léon Bottou, and

You might also like