Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Orthographic Features for Bilingual Lexicon Induction

Parker Riley and Daniel Gildea


Department of Computer Science
University of Rochester
Rochester, NY 14627

Abstract way; their method is used by Lample et al. (2018b)


as the initialization for a completely unsupervised
Recent embedding-based methods in machine translation system.
bilingual lexicon induction show good These recent advances in unsupervised bilin-
results, but do not take advantage of gual lexicon induction show promise for use in
orthographic features, such as edit dis- low-resource contexts. However, none of them
tance, which can be helpful for pairs of make use of linguistic features of the languages
related languages. This work extends themselves (with the arguable exception of syn-
embedding-based methods to incorporate tactic/semantic information encoded in the word
these features, resulting in significant embeddings). This is in contrast to work that
accuracy gains for related languages. predates many of these embedding-based meth-
ods that leveraged linguistic features such as edit
1 Introduction
distance and orthographic similarity: Dyer et al.
Over the past few years, new methods for bilingual (2011) and Berg-Kirkpatrick et al. (2010) inves-
lexicon induction have been proposed that are ap- tigate using linguistic features for word align-
plicable to low-resource language pairs, for which ment, and Haghighi et al. (2008) use linguis-
very little sentence-aligned parallel data is avail- tic features for unsupervised bilingual lexicon in-
able. Parallel data can be very expensive to create, duction. These features can help identify words
so methods that require less of it or that can utilize with common ancestry (such as the English-Italian
more readily available data are desirable. pair agile-agile) and borrowed words (macaroni-
One prevalent strategy involves creating multi- maccheroni).
lingual word embeddings, where each language’s The addition of linguistic features led to in-
vocabulary is embedded in the same latent space creased performance in these earlier models, es-
(Vulić and Moens, 2013; Mikolov et al., 2013a; pecially for related languages, yet these features
Artetxe et al., 2016); however, many of these have not been applied to more modern methods.
methods still require a strong cross-lingual signal In this work, we extend the modern embedding-
in the form of a large seed dictionary. based approach of Artetxe et al. (2017) with ortho-
More recent work has focused on reducing that graphic information in order to leverage similari-
constraint. Vulić and Moens (2016) and Vulic ties between related languages for increased accu-
and Korhonen (2016) use document-aligned data racy in bilingual lexicon induction.
to learn bilingual embeddings instead of a seed
dictionary. Artetxe et al. (2017) use a very small, 2 Background
automatically-generated seed lexicon of identi-
cal numerals as the initialization in an iterative This work is directly based on the work of Artetxe
self-learning framework to learn a linear mapping et al. (2017). Following their work, let X ∈
between monolingual embedding spaces; Zhang R|Vs |×d and Z ∈ R|Vt |×d be the word embedding
et al. (2017) use an adversarial training method to matrices of two distinct languages, referred to re-
learn a similar mapping. Lample et al. (2018a) spectively as the source and target, such that each
use a series of techniques to align monolingual row corresponds to the d-dimensional embedding
embedding spaces in a completely unsupervised of a single word. We refer to the ith row of one of

390
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 390–394
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
these matrices as Xi∗ or Zi∗ . The vocabularies for spelling of the word. This letter count vector is
each language are Vs and Vt , respectively. Also then scaled by a constant before being appended
let D ∈ {0, 1}|Vs |×|Vt | be a binary matrix repre- to the base word embedding. After appending, the
senting a dictionary such that Dij = 1 if the ith resulting augmented vector is normalized to have
word in the source language is aligned with the magnitude 1.
jth word in the target language. We wish to find Mathematically, let A be an ordered set of char-
a mapping matrix W ∈ Rd×d that maps source acters (an alphabet), containing all characters ap-
embeddings onto their aligned target embeddings. pearing in both language’s alphabets:
Artetxe et al. (2017) define the optimal mapping
matrix W ∗ with the following equation, A = Asource ∪ Atarget

Let Osource and Otarget be the orthographic


XX
W ∗ = arg min Dij kXi∗ W − Zj∗ k2
W i j
extension matrices for each language, containing
counts of the characters appearing in each word
which minimizes the sum of the squared Euclidean wi , scaled by a constant factor ce :
distances between mapped source embeddings and
their aligned target embeddings. Oij = ce · count(Aj , wi ), O ∈ {Osource , Otarget }
By normalizing and mean-centering X and Z,
and enforcing that W be an orthogonal matrix Then, we concatenate the embedding matrices
(W T W = I), the above formulation becomes and extension matrices:
equivalent to maximizing the dot product between 0 0
X = [X; Osource ], Z = [Z; Otarget ]
the mapped source embeddings and target embed-
dings, such that Finally, in the normalized embedding matrices
00 00
X and Z , each row has magnitude 1:
W ∗ = arg max Tr(XW Z T DT )
W 0 0
00 Xi∗ 00 Zi∗
Xi∗ = 0 , Zi∗ = 0
where Tr(·) is the trace operator, the sum of all di- kXi∗ k kZi∗ k
agonal entries. The optimal solution to this equa-
tion is W ∗ = U V T , where X T DZ = U ΣV T is These new matrices are used in place of X and
the singular value decomposition of X T DZ. Z in the self-learning process.
This formulation requires a seed dictionary. To
4 Orthographic Similarity Adjustment
reduce the need for a large seed dictionary, Artetxe
et al. (2017) propose an iterative, self-learning This method modifies the similarity score for each
framework that determines W as above, uses it to word pair during the dictionary induction phase
calculate a new dictionary D, and then iterates un- of the self-learning framework of Artetxe et al.
til convergence. In the dictionary induction step, (2017), which uses the dot product of two words’
they set Dij = 1 if j = arg maxk (Xi∗ W ) · Zk∗ embeddings to quantify similarity. We modify
and Dij = 0 otherwise. this similarity score by adding a measure of or-
We propose two methods for extending this sys- thographic similarity, which is a function of the
tem using orthographic information, described in normalized string edit distance of the two words.
the following two sections. The normalized edit distance is defined as the
Levenshtein distance (L(·, ·)) (Levenshtein, 1966)
3 Orthographic Extension of Word divided by the length of the longer word. The Lev-
Embeddings enshtein distance represents the minimum number
This method augments the embeddings for all of insertions, deletions, and substitutions required
words in both languages before using them in the to transform one word into the other. The normal-
self-learning framework of Artetxe et al. (2017). ized edit distance function is denoted as NL(·, ·).
To do this, we append to each word’s embedding L(w1 , w2 )
a vector of length equal to the size of the union NL(w1 , w2 ) =
max(|w1 |, |w2 |)
of the two languages’ alphabets. Each position in
this vector corresponds to a single letter, and its We define the orthographic similarity of two
value is set to the count of that letter within the words w1 and w2 as log(2.0−NL(w1 , w2 )). These

391
80 90

Development Translation Accuracy (%)

Development Translation Accuracy (%)


English-German English-German
English-Italian 80 English-Italian
70
English-Finnish English-Finnish
70
60
60
50
50
40
40
30
30
20
20

10 10

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Scaling Factor Scaling Factor

(a) Embedding Extension (b) Similarity Adjustment

Figure 1: Performance on development data vs. scaling factors ce and cs . The lowest tested value for
both was 10−6 .

similarity scores are used to form an orthographic word, stores them in a hash table, then does the
similarity matrix S, where each entry corresponds same for each source word and generates source-
to a source-target word pair. Each entry is first target pairs that share an entry in the hash table.
scaled by a constant factor cs . This matrix is added The complexity of this algorithm can be expressed
to the standard similarity matrix, XW Z T . as O(|V |lk ), where V = Vt ∪ Vs is the combined
vocabulary and l is the length of the longest word
Sij = cs ·log(2.0−NL(wi , wj )), wi ∈ Vs , wj ∈ Vt in V . This is linear with respect to the vocabu-
lary size, as opposed to the quadratic complexity
The vocabulary for each language is 200,000
required for computing the entire matrix. How-
words, so computing a similarity score for each
ever, the algorithm is sensitive to both word length
pair would involve 40 billion edit distance cal-
and the choice of k. In our experiments, we found
culations. Also, the vast majority of word pairs
that ignoring all words of length greater than 30
are orthographically very dissimilar, resulting in a
allowed the algorithm to complete very quickly
normalized edit distance close to 1 and an ortho-
while skipping less than 0.1% of the data. We also
graphic similarity close to 0, having little to no ef-
used small values of k (0 < k < 4), and used
fect on the overall estimated similarity. Therefore,
k = 1 for our final results, finding no significant
we only calculate the edit distance for a subset of
benefit from using a larger value.
possible word pairs.
Thus, the actual orthographic similarity matrix 5 Experiments
that we use is as follows:
( We use the datasets used by Artetxe et al. (2017),
0 Sij hwi , wj i ∈ symDelete(Vt ,Vs ,k) consisting of three language pairs: English-
Sij =
0 otherwise Italian, English-German, and English-Finnish.
The English-Italian dataset was introduced in
This subset of word pairs was chosen using
Dinu and Baroni (2014); the other datasets were
an adaptation of the Symmetric Delete spelling
created by Artetxe et al. (2017). Each dataset
correction algorithm described by Garbe (2012),
includes monolingual word embeddings (trained
which we denote as symDelete(·,·,·). This al-
with word2vec (Mikolov et al., 2013b)) for both
gorithm takes as arguments the target vocabulary,
languages and a bilingual dictionary, separated
source vocabulary, and a constant k, and identifies
into a training and test set. We do not use the
all source-target word pairs that are identical af-
training set as the input dictionary to the system,
ter k or fewer deletions from each word; that is,
instead using an automatically-generated dictio-
all pairs where each is reachable from the other
nary consisting only of numeral identity transla-
with no more than k insertions and k deletions.
tions (such as 2-2, 3-3, et cetera) as in Artetxe
For example, the Italian-English pair moderno-
et al. (2017).1 However, because the methods pre-
modern will be identified with k = 1, and the pair
sented in this work feature tunable hyperparame-
tollerante-tolerant will be identified with k = 2.
ters, we use a portion of the training set as devel-
The algorithm works by computing all strings
1
formed by k or fewer deletions from each target https://github.com/artetxem/vecmap

392
Method English-German English-Italian English-Finnish
Artetxe et al. (2017) 40.27 39.40 26.47
Artetxe et al. (2017) + identity 51.73 44.07 42.63
Embedding extension, ce = 18 50.33 48.40 29.63
Embedding extension + identity, ce = 18 55.40 47.13 43.54
Similarity adjustment, cs = 1 43.73 39.93 28.16
Similarity adjustment + identity, cs = 1 52.20 44.27 41.99
Combined, ce = 81 , cs = 1 53.53 49.13 32.51
Combined + identity, ce = 81 , cs = 1 55.53 46.27 41.78

Table 1: Comparison of methods on test data. Scaling constants ce and cs were selected based on
performance on development data over all three language pairs. The last two rows report the results of
using both methods together.

Source Word Our Prediction (Language) Incorrect Baseline Prediction (Translation)


caesium cäsium (German) isotope (isotope)
unevenly ungleichmäßig (German) gleichmäßig (evenly)
Ethiopians Äthiopier (German) Afrikaner (Africans)
autumn autunno (Italian) primavera (spring)
Brueghel Bruegel (Italian) Dürer (Dürer)
Latvians latvialaiset (Finnish) ukrainalaiset (Ukrainians)

Table 2: Examples of pairs correctly identified by our embedding extension method that were incorrectly
translated by the system of Artetxe et al. (2017). Our system can disambiguate semantic clusters created
by word2vec.

opment data.2 In all experiments, a single target Figure 1 shows the results on the development
word is predicted for each source word, and full data. Based on these results, we selected ce = 18
points are awarded if it is one of the listed correct and cs = 1 as our hyperparameters. The local op-
translations. On average, the number of transla- tima were not identical for all three languages, but
tions for each source (non-English) word was 1.2 we felt that these values struck the best compro-
for English-Italian, 1.3 for English-German, and mise among them.
1.4 for English-Finnish. Table 1 compares our methods against the sys-
tem of Artetxe et al. (2017), using scaling factors
6 Results and Discussion
selected based on development data results. Be-
For our experiments with orthographic extension cause approximately 20% of source-target pairs
of word embeddings, each embedding was ex- in the dictionary were identical, we also extended
tended by the size of the union of the alphabets all systems to guess the identity translation if the
of both languages. The size of this union was 199 source word appeared in the target vocabulary.
for English-Italian, 200 for English-German, and This improved accuracy in most cases, with some
287 for English-Finnish. exceptions for English-Italian. We also experi-
These numbers are perhaps unintuitively high. mented with both methods together, and found that
However, the corpora include many other char- this was the best of the settings that did not in-
acters, including diacritical markings and various clude the identity translation component; with the
symbols (%, [, !, etc.) that are an indication that identity component included, however, the embed-
tokenization of the data could be improved. We ding extension method alone was best for English-
did not filter these characters in this work. Finnish. The fact that Finnish is the only language
For our experiments with orthographic similar- here that is not in the Indo-European family (and
ity adjustment, the heuristic identified approxi- has fewer words borrowed from English or its an-
mately 2 million word pairs for each language pair cestors) may explain why the performance trends
out of a possible 40 billion, resulting in significant for English-Finnish were different than those of
computation savings. the other two language pairs.
2
We use all source-target pairs containing one of 1,000 In addition to identifying orthographically sim-
randomly-selected target words. ilar words, the extension method is capable of

393
learning a mapping between source and target let- Meeting of the Association for Computational Lin-
ters, which could partially explain its improved guistics (ACL-11), pages 409–419.
performance over our edit distance method. Wolf Garbe. 2012. 1000x faster spelling correc-
Table 2 shows some correct translations from tion algorithm. http://blog.faroo.com/
our system that were missed by the baseline. 2012/06/07/improved-edit-distance-
based-spelling-correction/. Accessed:
7 Conclusion and Future Work 2018-02-12.
Aria Haghighi, Taylor Berg-Kirkpatrick, and Dan
In this work, we presented two techniques (which Klein. 2008. Learning bilingual lexicons from
can be combined) for improving embedding-based monolingual corpora. In Proceedings of the 46th
bilingual lexicon induction for related languages Annual Meeting of the Association for Computa-
using orthographic information and no parallel tional Linguistics (ACL-08), pages 771–779.
data, allowing their use with low-resource lan- Guillaume Lample, Alexis Conneau, Marc’Aurelio
guage pairs. These methods increased accuracy in Ranzato, Ludovic Denoyer, and Hervé Jégou.
our experiments, with both the combined and em- 2018a. Word translation without parallel data. In
International Conference on Learning Representa-
bedding extension methods providing significant tions (ICLR).
gains over the baseline system.
Guillaume Lample, Ludovic Denoyer, and
In the future, we want to extend this work to
Marc’Aurelio Ranzato. 2018b. Unsupervised
related languages with different alphabets (experi- machine translation using monolingual corpora
menting with transliteration or phonetic transcrip- only. In International Conference on Learning
tion) and to extend other unsupervised bilingual Representations (ICLR).
lexicon induction systems, such as that of Lample V. I. Levenshtein. 1966. Binary codes capable of cor-
et al. (2018a). recting deletions, insertions, and reversals. Cyber-
netics and Control Theory, 10(8):707–710. Original
Acknowledgments We are grateful to the in Doklady Akademii Nauk SSSR 163(4): 845–848
anonymous reviewers for suggesting useful addi- (1965).
tions. This research was supported by NSF grant Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.
number 1449828. Exploiting similarities among languages for ma-
chine translation. arXiv preprint arXiv:1309.4168.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
References rado, and Jeff Dean. 2013b. Distributed representa-
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. tions of words and phrases and their compositional-
Learning principled bilingual mappings of word em- ity. In Advances in Neural Information Processing
beddings while preserving monolingual invariance. Systems, pages 3111–3119.
In Proceedings of the 2016 Conference on Empirical
Ivan Vulic and Anna Korhonen. 2016. On the role
Methods in Natural Language Processing (EMNLP-
of seed lexicons in learning bilingual word embed-
16), pages 2289–2294, Austin, Texas.
dings. In Proceedings of the 54th Annual Meet-
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. ing of the Association for Computational Linguistics
Learning bilingual word embeddings with (almost) (ACL-16), pages 247–257, Berlin, Germany.
no bilingual data. In Proceedings of the 55th An- Ivan Vulić and Marie-Francine Moens. 2013. A study
nual Meeting of the Association for Computational on bootstrapping bilingual vector spaces from non-
Linguistics (ACL-17), pages 451–462, Vancouver, parallel data (and nothing else). In Proceedings of
Canada. the 2013 Conference on Empirical Methods in Natu-
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, ral Language Processing (EMNLP-13), pages 1613–
John DeNero, and Dan Klein. 2010. Painless un- 1624, Seattle, Washington, USA.
supervised learning with features. In Proceedings Ivan Vulić and Marie-Francine Moens. 2016. Bilingual
of the 2010 Meeting of the North American chap- distributed word representations from document-
ter of the Association for Computational Linguistics aligned comparable data. J. Artif. Int. Res.,
(NAACL-10), pages 582–590. 55(1):953–994.
Georgiana Dinu and Marco Baroni. 2014. Improving Meng Zhang, Yang Liu, Huanbo Luan, and Maosong
zero-shot learning by mitigating the hubness prob- Sun. 2017. Adversarial training for unsupervised
lem. CoRR, abs/1412.6568. bilingual lexicon induction. In Proceedings of the
55th Annual Meeting of the Association for Com-
Chris Dyer, Jonathan Clark, Alon Lavie, and Noah A putational Linguistics (ACL-17), pages 1959–1970,
Smith. 2011. Unsupervised word alignment with ar- Vancouver, Canada.
bitrary features. In Proceedings of the 49th Annual

394

You might also like