Professional Documents
Culture Documents
25.colloquial Urdu PDF
25.colloquial Urdu PDF
Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree,
above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.
commonly used feature templates, however and the full embedding matrix is E ∈ R w d×N w
it could indicate that we cannot perform a where Nw is the dictionary size. Meanwhile,
RIGHT-ARC action if there is an arc from s1 we also map POS tags and arc labels to a d-
to b2 . dimensional vector space, where eti , elj ∈ Rd are
the representations of ith POS tag and j th arc la-
• Expensive feature computation. The fea-
bel. Correspondingly, the POS and label embed-
ture generation of indicator features is gen-
ding matrices are E t ∈ Rd×Nt and E l ∈ Rd×Nl
erally expensive — we have to concatenate
where Nt and Nl are the number of distinct POS
some words, POS tags, or arc labels for gen-
tags and arc labels.
erating feature strings, and look them up in a
We choose a set of elements based on the
huge table containing several millions of fea-
stack / buffer positions for each type of in-
tures. In our experiments, more than 195% of
formation (word, POS or label), which might
the time is consumed by feature computation
be useful for our predictions. We denote the
during the parsing process.
sets as S w , S t , S l respectively. For example,
So far, we have discussed preliminaries of given the configuration in Figure 2 and S t =
Softmax layer:
···
p = softmax(W2 h)
Hidden layer:
···
h = (W1w xw + W1t xt + W1l xl + b1 )3
Input layer: [xw , xt , xl ] ··· ···
{lc1 (s2 ).t, s2 .t, rc1 (s2 ).t, s1 .t}, we will extract
1
where W1w ∈ Rdh ×(d·nw ) , W1t ∈ Rdh ×(d·nt ) , Cube activation function
W1l ∈ Rdh ×(d·nl ) , and b1 ∈ Rdh is the bias. As stated above, we introduce a novel activation
A softmax layer is finally added on the top of function: cube g(x) = x3 in our model instead
the hidden layer for modeling multi-class prob- of the commonly used tanh or sigmoid functions
abilities p = softmax(W2 h), where W2 ∈ (Figure 3).
R|T |×dh . Intuitively, every hidden unit is computed by a
(non-linear) mapping on a weighted sum of input
POS and label embeddings units plus a bias. Using g(x) = x3 can model
To our best knowledge, this is the first attempt to the product terms of xi xj xk for any three different
introduce POS tag and arc label embeddings in- elements at the input layer directly:
stead of discrete representations.
Although the POS tags P = {NN, NNP, g(w1 x1 + . . . + wm xm + b) =
NNS, DT, JJ, . . .} (for English) and arc labels
X X
(wi wj wk )xi xj xk + b(wi wj )xi xj . . .
L = {amod, tmod, nsubj, csubj, dobj, . . .} i,j,k i,j
(for Stanford Dependencies on English) are rela-
tively small discrete sets, they still exhibit many In our case, xi , xj , xk could come from different
semantical similarities like words. For example, dimensions of three embeddings. We believe that
NN (singular noun) should be closer to NNS (plural this better captures the interaction of three ele-
ments, which is a very desired property of depen- 285,791, coverage = 79.0%). We will also com-
dency parsing. pare with random initialization of E w in Section
Experimental results also verify the success of 4. The training error derivatives will be back-
the cube activation function empirically (see more propagated to these embeddings during the train-
comparisons in Section 4). However, the expres- ing process.
sive power of this activation function is still open We use mini-batched AdaGrad (Duchi et al.,
to investigate theoretically. 2011) for optimization and also apply a dropout
(Hinton et al., 2012) with 0.5 rate. The parame-
The choice of S w , S t , S l
ters which achieve the best unlabeled attachment
Following (Zhang and Nivre, 2011), we pick a score on the development set will be chosen for
rich set of elements for our final parser. In de- final evaluation.
tail, S w contains nw = 18 elements: (1) The top 3
words on the stack and buffer: s1 , s2 , s3 , b1 , b2 , b3 ; 3.3 Parsing
(2) The first and second leftmost / rightmost We perform greedy decoding in parsing. At each
children of the top two words on the stack: step, we extract all the corresponding word, POS
lc1 (si ), rc1 (si ), lc2 (si ), rc2 (si ), i = 1, 2. (3) and label embeddings from the current configu-
The leftmost of leftmost / rightmost of right- ration c, compute the hidden layer h(c) ∈ Rdh ,
most children of the top two words on the stack: and pick the transition with the highest score:
lc1 (lc1 (si )), rc1 (rc1 (si )), i = 1, 2. t = arg maxt is feasible W2 (t, ·)h(c), and then ex-
We use the corresponding POS tags for S t ecute c → t(c).
(nt = 18), and the corresponding arc labels of Comparing with indicator features, our parser
words excluding those 6 words on the stack/buffer does not need to compute conjunction features and
for Sl (nl = 12). A good advantage of our parser look them up in a huge feature table, and thus
is that we can add a rich set of elements cheaply, greatly reduces feature generation time. Instead,
instead of hand-crafting many more indicator fea- it involves many matrix addition and multiplica-
tures. tion operations. To further speed up the parsing
3.2 Training time, we apply a pre-computation trick, similar
to (Devlin et al., 2014). For each position cho-
We first generate training examples {(ci , ti )}m i=1 sen from S w , we pre-compute matrix multiplica-
from the training sentences and their gold parse
tions for most top frequent 10, 000 words. Thus,
trees using a “shortest stack” oracle which always
computing the hidden layer only requires looking
prefers LEFT-ARCl over SHIFT, where ci is a
up the table for these frequent words, and adding
configuration, ti ∈ T is the oracle transition.
the dh -dimensional vector. Similarly, we also pre-
The final training objective is to minimize the
compute matrix computations for all positions and
cross-entropy loss, plus a l2 -regularization term:
all POS tags and arc labels. We only use this opti-
X λ mization in the neural network parser, but it is only
L(θ) = − log pti + kθk2 feasible for a parser like the neural network parser
2
i
which uses a small number of features. In prac-
where θ is the set of all parameters tice, this pre-computation step increases the speed
{W1w , W1t , W1l , b1 , W2 , E w , E t , E l }. A slight of our parser 8 ∼ 10 times.
variation is that we compute the softmax prob-
abilities only among the feasible transitions in 4 Experiments
practice.
For initialization of parameters, we use pre- 4.1 Datasets
trained word embeddings to initialize E w and use We conduct our experiments on the English Penn
random initialization within (−0.01, 0.01) for E t Treebank (PTB) and the Chinese Penn Treebank
and E l . Concretely, we use the pre-trained word (CTB) datasets.
embeddings from (Collobert et al., 2011) for En- For English, we follow the standard splits of
glish (#dictionary = 130,000, coverage = 72.7%), PTB3, using sections 2-21 for training, section
and our trained 50-dimensional word2vec em- 22 as development set and 23 as test set. We
beddings (Mikolov et al., 2013) on Wikipedia adopt two different dependency representations:
and Gigaword corpus for Chinese (#dictionary = CoNLL Syntactic Dependencies (CD) (Johansson
Dataset #Train #Dev #Test #words (Nw ) #POS (Nt ) #labels (Nl ) projective (%)
PTB: CD 39,832 1,700 2,416 44,352 45 17 99.4
PTB: SD 39,832 1,700 2,416 44,389 45 45 99.9
CTB 16,091 803 1,910 34,577 35 12 100.0
Table 3: Data Statistics. “Projective” is the percentage of projective trees on the training set.
and Nugues, 2007) using the LTH Constituent-to- a first-order graph-based parser (McDonald and
Dependency Conversion Tool3 and Stanford Basic Pereira, 2006).9 In this comparison, for Malt-
Dependencies (SD) (de Marneffe et al., 2006) us- Parser, we select stackproj (arc-standard) and
ing the Stanford parser v3.3.0.4 The POS tags are nivreeager (arc-eager) as parsing algorithms,
assigned using Stanford POS tagger (Toutanova et and liblinear (Fan et al., 2008) for optimization.10
al., 2003) with ten-way jackknifing of the training For MSTParser, we use default options.
data (accuracy ≈ 97.3%). On all datasets, we report unlabeled attach-
For Chinese, we adopt the same split of CTB5 ment scores (UAS) and labeled attachment scores
as described in (Zhang and Clark, 2008). Depen- (LAS) and punctuation is excluded in all evalua-
dencies are converted using the Penn2Malt tool5 tion metrics.11 Our parser and the baseline arc-
with the head-finding rules of (Zhang and Clark, standard and arc-eager parsers are all implemented
2008). And following (Zhang and Clark, 2008; in Java. The parsing speeds are measured on an
Zhang and Nivre, 2011), we use gold segmenta- Intel Core i7 2.7GHz CPU with 16GB RAM and
tion and POS tags for the input. the runtime does not include pre-computation or
Table 3 gives statistics of the three datasets.6 In parameter loading time.
particular, over 99% of the trees are projective in Table 4, Table 5 and Table 6 show the com-
all datasets. parison of accuracy and parsing speed on PTB
(CoNLL dependencies), PTB (Stanford dependen-
4.2 Results cies) and CTB respectively.
The following hyper-parameters are used in all ex-
periments: embedding size d = 50, hidden layer Dev Test Speed
Parser
size h = 200, regularization parameter λ = 10−8 , UAS LAS UAS LAS (sent/s)
initial learning rate of Adagrad α = 0.01. standard 89.9 88.7 89.7 88.3 51
To situate the performance of our parser, we first eager 90.3 89.2 89.9 88.6 63
make a comparison with our own implementa- Malt:sp 90.0 88.8 89.9 88.5 560
tion of greedy arc-eager and arc-standard parsers. Malt:eager 90.1 88.9 90.1 88.7 535
These parsers are trained with structured averaged MSTParser 92.1 90.8 92.0 90.5 12
perceptron using the “early-update” strategy. The Our parser 92.2 91.0 92.0 90.7 1013
feature templates of (Zhang and Nivre, 2011) are
Table 4: Accuracy and parsing speed on PTB +
used for the arc-eager system, and they are also
CoNLL dependencies.
adapted to the arc-standard system.7
Furthermore, we also compare our parser Clearly, our parser is superior in terms of both
with two popular, off-the-shelf parsers: Malt- accuracy and speed. Comparing with the base-
Parser — a greedy transition-based dependency lines of arc-eager and arc-standard parsers, our
parser (Nivre et al., 2006),8 and MSTParser — parser achieves around 2% improvement in UAS
3
http://nlp.cs.lth.se/software/treebank converter/ and LAS on all datasets, while running about 20
4
http://nlp.stanford.edu/software/lex-parser.shtml times faster.
5
http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html It is worth noting that the efficiency of our
6
Pennconverter and Stanford dependencies generate
9
slightly different tokenization, e.g., Pennconverter splits the http://www.seas.upenn.edu/ strctlrn/MSTParser/
token WCRS\/Boston NNP into three tokens WCRS NNP / MSTParser.html
10
CC Boston NNP. We do not compare with libsvm optimization, which is
7
Since arc-standard is bottom-up, we remove all features known to be sightly more accurate, but orders of magnitude
using the head of stack elements, and also add the right child slower (Kong and Smith, 2014).
features of the first stack element. 11
A token is a punctuation if its gold POS tag is {“ ” : , .}
8
http://www.maltparser.org/ for English and PU for Chinese.
Dev Test Speed Initialization of pre-trained word embeddings
Parser
UAS LAS UAS LAS (sent/s) We further analyze the influence of using pre-
standard 90.2 87.8 89.4 87.3 26 trained word embeddings for initialization. Fig-
eager 89.8 87.4 89.6 87.4 34 ure 4 (middle) shows that using pre-trained word
Malt:sp 89.8 87.2 89.3 86.9 469 embeddings can obtain around 0.7% improve-
Malt:eager 89.6 86.9 89.4 86.8 448 ment on PTB and 1.7% improvement on CTB,
MSTParser 91.4 88.1 90.7 87.6 10 compared with using random initialization within
Our parser 92.0 89.7 91.8 89.6 654 (−0.01, 0.01). On the one hand, the pre-trained
word embeddings of Chinese appear more use-
Table 5: Accuracy and parsing speed on PTB +
ful than those of English; on the other hand, our
Stanford dependencies.
model is still able to achieve comparable accuracy
without the help of pre-trained word embeddings.
Dev Test Speed
Parser
UAS LAS UAS LAS (sent/s) POS tag and arc label embeddings
standard 82.4 80.9 82.7 81.2 72
As shown in Figure 4 (right), POS embeddings
eager 81.1 79.7 80.3 78.7 80
yield around 1.7% improvement on PTB and
Malt:sp 82.4 80.5 82.4 80.6 420
nearly 10% improvement on CTB and the label
Malt:eager 81.2 79.3 80.2 78.4 393
embeddings yield a much smaller 0.3% and 1.4%
MSTParser 84.0 82.1 83.0 81.2 6
improvement respectively.
Our parser 84.0 82.4 83.9 82.4 936
However, we can obtain little gain from la-
Table 6: Accuracy and parsing speed on CTB. bel embeddings when the POS embeddings are
present. This may be because the POS tags of two
tokens already capture most of the label informa-
parser even surpasses MaltParser using liblinear, tion between them.
which is known to be highly optimized, while our
parser achieves much better accuracy. 4.4 Model Analysis
Also, despite the fact that the graph-based MST- Last but not least, we will examine the parame-
Parser achieves a similar result to ours on PTB ters we have learned, and hope to investigate what
(CoNLL dependencies), our parser is nearly 100 these dense features capture. We use the weights
times faster. In particular, our transition-based learned from the English Penn Treebank using
parser has a great advantage in LAS, especially Stanford dependencies for analysis.
for the fine-grained label set of Stanford depen-
dencies. What do E t , E l capture?
We first introduced E t and E l as the dense rep-
4.3 Effects of Parser Components resentations of all POS tags and arc labels, and
we wonder whether these embeddings could carry
Herein, we examine components that account for
some semantic information.
the performance of our parser.
Figure 5 presents t-SNE visualizations (van der
Maaten and Hinton, 2008) of these embeddings.
Cube activation function
It clearly shows that these embeddings effectively
We compare our cube activation function (x3 ) exhibit the similarities between POS tags or arc
with two widely used non-linear functions: tanh labels. For instance, the three adjective POS tags
x −e−x
( eex +e−x ), sigmoid ( 1+e1−x ), and also the JJ, JJR, JJS have very close embeddings, and
identity function (x), as shown in Figure 4 also the three labels representing clausal comple-
(left). ments acomp, ccomp, xcomp are grouped to-
In short, cube outperforms all other activation gether.
functions significantly and identity works the Since these embeddings can effectively encode
worst. Concretely, cube can achieve 0.8% ∼ the semantic regularities, we believe that they can
1.2% improvement in UAS over tanh and other be also used as alternative features of POS tags (or
functions, thus verifying the effectiveness of the arc labels) in other NLP tasks, and help boost the
cube activation function empirically. performance.
What do W1w , W1t , W1l capture? 5 Related Work
Knowing that Et and El (as well as the word em- There have been several lines of earlier work in us-
beddings E ) can capture semantic information
w
ing neural networks for parsing which have points
very well, next we hope to investigate what each of overlap but also major differences from our
feature in the hidden layer has really learned. work here. One big difference is that much early
Since we currently only have h = 200 learned work uses localist one-hot word representations
dense features, we wonder if it is sufficient to rather than the distributed representations of mod-
learn the word conjunctions as sparse indicator ern work. (Mayberry III and Miikkulainen, 1999)
features, or even more. We examine the weights explored a shift reduce constituency parser with
W1w (k, ·) ∈ Rd·nw , W1t (k, ·) ∈ Rd·nt , W1l (k, ·) ∈ one-hot word representations and did subsequent
Rd·nl for each hidden unit k, and reshape them to parsing work in (Mayberry III and Miikkulainen,
d × nt , d × nw , d × nl matrices, such that the 2005).
weights of each column corresponds to the embed- (Henderson, 2004) was the first to attempt to use
dings of one specific element (e.g., s1 .t). neural networks in a broad-coverage Penn Tree-
We pick the weights with absolute value > 0.2, bank parser, using a simple synchrony network to
and visualize them for each feature. Figure 6 gives predict parse decisions in a constituency parser.
the visualization of three sampled features, and it More recently, (Titov and Henderson, 2007) ap-
exhibits many interesting phenomena: plied Incremental Sigmoid Belief Networks to
• Different features have varied distributions of constituency parsing and then (Garg and Hender-
the weights. However, most of the discrim- son, 2011) extended this work to transition-based
inative weights come from W1t (the middle dependency parsers using a Temporal Restricted
zone in Figure 6), and this further justifies the Boltzman Machine. These are very different neu-
importance of POS tags in dependency pars- ral network architectures, and are much less scal-
ing. able and in practice a restricted vocabulary was
used to make the architecture practical.
• We carefully examine many of the h = 200 There have been a number of recent uses of
features, and find that they actually encode deep learning for constituency parsing (Collobert,
very different views of information. For the 2011; Socher et al., 2013). (Socher et al., 2014)
three sampled features in Figure 6, the largest has also built models over dependency representa-
weights are dominated by: tions but this work has not attempted to learn neu-
ral networks for dependency parsing.
– Feature 1: s1 .t, s2 .t, lc(s1 ).t.
Most recently, (Stenetorp, 2013) attempted to
– Feautre 2: rc(s1 ).t, s1 .t, b1 .t. build recursive neural networks for transition-
– Feature 3: s1 .t, s1 .w, lc(s1 ).t, lc(s1 ).l. based dependency parsing, however the empirical
performance of his model is still unsatisfactory.
These features all seem very plausible, as ob-
served in the experiments on indicator feature 6 Conclusion
systems. Thus our model is able to automati-
cally identify the most useful information for We have presented a novel dependency parser us-
predictions, instead of hand-crafting them as ing neural networks. Experimental evaluations
indicator features. show that our parser outperforms other greedy
parsers using sparse indicator features in both ac-
• More importantly, we can extract features re- curacy and speed. This is achieved by represent-
garding the conjunctions of more than 3 ele- ing all words, POS tags and arc labels as dense
ments easily, and also those not presented in vectors, and modeling their interactions through a
the indicator feature systems. For example, novel cube activation function. Our model only
the 3rd feature above captures the conjunc- relies on dense features, and is able to automat-
tion of words and POS tags of s1 , the tag of ically learn the most useful feature conjunctions
its leftmost child, and also the label between for making predictions.
them, while this information is not encoded An interesting line of future work is to combine
in the original feature templates of (Zhang our neural network based classifier with search-
and Nivre, 2011). based models to further improve accuracy. Also,
95
90 90
90
UAS score
UAS score
UAS score
85
85
85 80
75
80
80 70
Figure 4: Effects of different parser components. Left: comparison of different activation functions.
Middle: comparison of pre-trained word vectors and random initialization. Right: effects of POS and
label embeddings.
800
600
: advmod
. ) 600
nsubj
400
VBG ’’ , csubj
nsubjpass
400
expl amod
WP$ predet
num
nn det
200 WP WRB PRP $ RB RBSRBR appos
conj auxpass poss
200 rcmod
PRP$ WDT LS NNPNNPS parataxis possessive
IN POS NNSNN discourse quantmod
TO mark
cc
punct
csubjpass
prt
0 SYM −ROOT−
VBN FW 0 iobj dep
VB infmod root
partmod neg
RP prep
CC ‘‘ number
JJR (
−200
dobj
−200 JJ DT
UH aux acomp
JJS PDT preconj cop tmod advcl
−400 xcomp
EX npadvmod ccomp
−400 mwe
# misc misc
VBD
VBZ CD noun −600
clausal complement
punctuation noun pre−modifier
VBP verbal auxiliaries
−600 verb
MD −800 subject
adverb pcomp preposition complement
pobj noun post−modifier
adjective
−800 −1000
−600 −400 −200 0 200 400 600 −600 −400 −200 0 200 400 600 800
Figure 6: Three sampled features. In each feature, each row denotes a dimension of embeddings and
each column denotes a chosen element, e.g., s1 .t or lc(s1 ).w, and the parameters are divided into 3
zones, corresponding to W1w (k, :) (left), W1t (k, :) (middle) and W1l (k, :) (right). White and black dots
denote the most positive weights and most negative weights respectively.
there is still room for improvement in our architec- James Henderson. 2004. Discriminative training of a
ture, such as better capturing word conjunctions, neural network statistical parser. In ACL.
or adding richer features (e.g., distance, valency). Geoffrey E. Hinton, Nitish Srivastava, Alex
Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
Acknowledgments dinov. 2012. Improving neural networks by
preventing co-adaptation of feature detectors.
Stanford University gratefully acknowledges the CoRR, abs/1207.0580.
support of the Defense Advanced Research
Liang Huang, Wenbin Jiang, and Qun Liu. 2009.
Projects Agency (DARPA) Deep Exploration and Bilingually-constrained (monolingual) shift-reduce
Filtering of Text (DEFT) Program under Air parsing. In EMNLP.
Force Research Laboratory (AFRL) contract no.
Richard Johansson and Pierre Nugues. 2007. Ex-
FA8750-13-2-0040 and the Defense Threat Re-
tended constituent-to-dependency conversion for en-
duction Agency (DTRA) under Air Force Re- glish. In Proceedings of NODALIDA, Tartu, Estonia.
search Laboratory (AFRL) contract no. FA8650-
10-C-7020. Any opinions, findings, and conclu- Lingpeng Kong and Noah A. Smith. 2014. An em-
pirical comparison of parsing methods for Stanford
sion or recommendations expressed in this mate- dependencies. CoRR, abs/1404.4314.
rial are those of the authors and do not necessarily
reflect the view of the DARPA, AFRL, or the US Terry Koo, Xavier Carreras, and Michael Collins.
2008. Simple semi-supervised dependency parsing.
government. In ACL.
Sandra Kübler, Ryan McDonald, and Joakim Nivre.
References 2009. Dependency Parsing. Synthesis Lectures on
Human Language Technologies. Morgan & Clay-
Bernd Bohnet. 2010. Very high accuracy and fast de- pool.
pendency parsing is not a contradiction. In Coling.
Marshall R. Mayberry III and Risto Miikkulainen.
Ronan Collobert, Jason Weston, Léon Bottou, Michael 1999. Sardsrn: A neural network shift-reduce
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. parser. In IJCAI.
2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research. Marshall R. Mayberry III and Risto Miikkulainen.
2005. Broad-coverage parsing with neural net-
Ronan Collobert. 2011. Deep learning for efficient works. Neural Processing Letters.
discriminative parsing. In AISTATS.
Ryan McDonald and Fernando Pereira. 2006. Online
Marie-Catherine de Marneffe, Bill MacCartney, and learning of approximate dependency parsing algo-
Christopher D. Manning. 2006. Generating typed rithms. In EACL.
dependency parses from phrase structure parses. In
LREC. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas tions of words and phrases and their compositional-
Lamar, Richard Schwartz, and John Makhoul. 2014. ity. In NIPS.
Fast and robust neural network joint models for sta-
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
tistical machine translation. In ACL.
Maltparser: A data-driven parser-generator for de-
John Duchi, Elad Hazan, and Yoram Singer. 2011. pendency parsing. In LREC.
Adaptive subgradient methods for online learning Richard Socher, John Bauer, Christopher D Manning,
and stochastic optimization. The Journal of Ma- and Andrew Y Ng. 2013. Parsing with composi-
chine Learning Research. tional vector grammars. In ACL.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Richard Socher, Andrej Karpathy, Quoc V. Le, Christo-
Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A pher D. Manning, and Andrew Y. Ng. 2014.
library for large linear classification. The Journal of Grounded compositional semantics for finding and
Machine Learning Research. describing images with sentences. TACL.
Nikhil Garg and James Henderson. 2011. Temporal Pontus Stenetorp. 2013. Transition-based dependency
restricted boltzmann machines for dependency pars- parsing using recursive neural networks. In NIPS
ing. In ACL-HLT. Workshop on Deep Learning.
He He, Hal Daumé III, and Jason Eisner. 2013. Dy- Ivan Titov and James Henderson. 2007. Fast and ro-
namic feature selection for dependency parsing. In bust multilingual dependency parsing with a gener-
EMNLP. ative latent variable model. In EMNLP-CoNLL.
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In NAACL.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-SNE. The Journal of Ma-
chine Learning Research.
Yue Zhang and Stephen Clark. 2008. A tale of
two parsers: Investigating and combining graph-
based and transition-based dependency parsing us-
ing beam-search. In EMNLP.
Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
ACL.