Professional Documents
Culture Documents
Orthographic Features For Bilingual Lexicon Induction
Orthographic Features For Bilingual Lexicon Induction
390
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 390–394
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
these matrices as Xi∗ or Zi∗ . The vocabularies for spelling of the word. This letter count vector is
each language are Vs and Vt , respectively. Also then scaled by a constant before being appended
let D ∈ {0, 1}|Vs |×|Vt | be a binary matrix repre- to the base word embedding. After appending, the
senting a dictionary such that Dij = 1 if the ith resulting augmented vector is normalized to have
word in the source language is aligned with the magnitude 1.
jth word in the target language. We wish to find Mathematically, let A be an ordered set of char-
a mapping matrix W ∈ Rd×d that maps source acters (an alphabet), containing all characters ap-
embeddings onto their aligned target embeddings. pearing in both language’s alphabets:
Artetxe et al. (2017) define the optimal mapping
matrix W ∗ with the following equation, A = Asource ∪ Atarget
391
80 90
10 10
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Scaling Factor Scaling Factor
Figure 1: Performance on development data vs. scaling factors ce and cs . The lowest tested value for
both was 10−6 .
similarity scores are used to form an orthographic word, stores them in a hash table, then does the
similarity matrix S, where each entry corresponds same for each source word and generates source-
to a source-target word pair. Each entry is first target pairs that share an entry in the hash table.
scaled by a constant factor cs . This matrix is added The complexity of this algorithm can be expressed
to the standard similarity matrix, XW Z T . as O(|V |lk ), where V = Vt ∪ Vs is the combined
vocabulary and l is the length of the longest word
Sij = cs ·log(2.0−NL(wi , wj )), wi ∈ Vs , wj ∈ Vt in V . This is linear with respect to the vocabu-
lary size, as opposed to the quadratic complexity
The vocabulary for each language is 200,000
required for computing the entire matrix. How-
words, so computing a similarity score for each
ever, the algorithm is sensitive to both word length
pair would involve 40 billion edit distance cal-
and the choice of k. In our experiments, we found
culations. Also, the vast majority of word pairs
that ignoring all words of length greater than 30
are orthographically very dissimilar, resulting in a
allowed the algorithm to complete very quickly
normalized edit distance close to 1 and an ortho-
while skipping less than 0.1% of the data. We also
graphic similarity close to 0, having little to no ef-
used small values of k (0 < k < 4), and used
fect on the overall estimated similarity. Therefore,
k = 1 for our final results, finding no significant
we only calculate the edit distance for a subset of
benefit from using a larger value.
possible word pairs.
Thus, the actual orthographic similarity matrix 5 Experiments
that we use is as follows:
( We use the datasets used by Artetxe et al. (2017),
0 Sij hwi , wj i ∈ symDelete(Vt ,Vs ,k) consisting of three language pairs: English-
Sij =
0 otherwise Italian, English-German, and English-Finnish.
The English-Italian dataset was introduced in
This subset of word pairs was chosen using
Dinu and Baroni (2014); the other datasets were
an adaptation of the Symmetric Delete spelling
created by Artetxe et al. (2017). Each dataset
correction algorithm described by Garbe (2012),
includes monolingual word embeddings (trained
which we denote as symDelete(·,·,·). This al-
with word2vec (Mikolov et al., 2013b)) for both
gorithm takes as arguments the target vocabulary,
languages and a bilingual dictionary, separated
source vocabulary, and a constant k, and identifies
into a training and test set. We do not use the
all source-target word pairs that are identical af-
training set as the input dictionary to the system,
ter k or fewer deletions from each word; that is,
instead using an automatically-generated dictio-
all pairs where each is reachable from the other
nary consisting only of numeral identity transla-
with no more than k insertions and k deletions.
tions (such as 2-2, 3-3, et cetera) as in Artetxe
For example, the Italian-English pair moderno-
et al. (2017).1 However, because the methods pre-
modern will be identified with k = 1, and the pair
sented in this work feature tunable hyperparame-
tollerante-tolerant will be identified with k = 2.
ters, we use a portion of the training set as devel-
The algorithm works by computing all strings
1
formed by k or fewer deletions from each target https://github.com/artetxem/vecmap
392
Method English-German English-Italian English-Finnish
Artetxe et al. (2017) 40.27 39.40 26.47
Artetxe et al. (2017) + identity 51.73 44.07 42.63
Embedding extension, ce = 18 50.33 48.40 29.63
Embedding extension + identity, ce = 18 55.40 47.13 43.54
Similarity adjustment, cs = 1 43.73 39.93 28.16
Similarity adjustment + identity, cs = 1 52.20 44.27 41.99
Combined, ce = 81 , cs = 1 53.53 49.13 32.51
Combined + identity, ce = 81 , cs = 1 55.53 46.27 41.78
Table 1: Comparison of methods on test data. Scaling constants ce and cs were selected based on
performance on development data over all three language pairs. The last two rows report the results of
using both methods together.
Table 2: Examples of pairs correctly identified by our embedding extension method that were incorrectly
translated by the system of Artetxe et al. (2017). Our system can disambiguate semantic clusters created
by word2vec.
opment data.2 In all experiments, a single target Figure 1 shows the results on the development
word is predicted for each source word, and full data. Based on these results, we selected ce = 18
points are awarded if it is one of the listed correct and cs = 1 as our hyperparameters. The local op-
translations. On average, the number of transla- tima were not identical for all three languages, but
tions for each source (non-English) word was 1.2 we felt that these values struck the best compro-
for English-Italian, 1.3 for English-German, and mise among them.
1.4 for English-Finnish. Table 1 compares our methods against the sys-
tem of Artetxe et al. (2017), using scaling factors
6 Results and Discussion
selected based on development data results. Be-
For our experiments with orthographic extension cause approximately 20% of source-target pairs
of word embeddings, each embedding was ex- in the dictionary were identical, we also extended
tended by the size of the union of the alphabets all systems to guess the identity translation if the
of both languages. The size of this union was 199 source word appeared in the target vocabulary.
for English-Italian, 200 for English-German, and This improved accuracy in most cases, with some
287 for English-Finnish. exceptions for English-Italian. We also experi-
These numbers are perhaps unintuitively high. mented with both methods together, and found that
However, the corpora include many other char- this was the best of the settings that did not in-
acters, including diacritical markings and various clude the identity translation component; with the
symbols (%, [, !, etc.) that are an indication that identity component included, however, the embed-
tokenization of the data could be improved. We ding extension method alone was best for English-
did not filter these characters in this work. Finnish. The fact that Finnish is the only language
For our experiments with orthographic similar- here that is not in the Indo-European family (and
ity adjustment, the heuristic identified approxi- has fewer words borrowed from English or its an-
mately 2 million word pairs for each language pair cestors) may explain why the performance trends
out of a possible 40 billion, resulting in significant for English-Finnish were different than those of
computation savings. the other two language pairs.
2
We use all source-target pairs containing one of 1,000 In addition to identifying orthographically sim-
randomly-selected target words. ilar words, the extension method is capable of
393
learning a mapping between source and target let- Meeting of the Association for Computational Lin-
ters, which could partially explain its improved guistics (ACL-11), pages 409–419.
performance over our edit distance method. Wolf Garbe. 2012. 1000x faster spelling correc-
Table 2 shows some correct translations from tion algorithm. http://blog.faroo.com/
our system that were missed by the baseline. 2012/06/07/improved-edit-distance-
based-spelling-correction/. Accessed:
7 Conclusion and Future Work 2018-02-12.
Aria Haghighi, Taylor Berg-Kirkpatrick, and Dan
In this work, we presented two techniques (which Klein. 2008. Learning bilingual lexicons from
can be combined) for improving embedding-based monolingual corpora. In Proceedings of the 46th
bilingual lexicon induction for related languages Annual Meeting of the Association for Computa-
using orthographic information and no parallel tional Linguistics (ACL-08), pages 771–779.
data, allowing their use with low-resource lan- Guillaume Lample, Alexis Conneau, Marc’Aurelio
guage pairs. These methods increased accuracy in Ranzato, Ludovic Denoyer, and Hervé Jégou.
our experiments, with both the combined and em- 2018a. Word translation without parallel data. In
International Conference on Learning Representa-
bedding extension methods providing significant tions (ICLR).
gains over the baseline system.
Guillaume Lample, Ludovic Denoyer, and
In the future, we want to extend this work to
Marc’Aurelio Ranzato. 2018b. Unsupervised
related languages with different alphabets (experi- machine translation using monolingual corpora
menting with transliteration or phonetic transcrip- only. In International Conference on Learning
tion) and to extend other unsupervised bilingual Representations (ICLR).
lexicon induction systems, such as that of Lample V. I. Levenshtein. 1966. Binary codes capable of cor-
et al. (2018a). recting deletions, insertions, and reversals. Cyber-
netics and Control Theory, 10(8):707–710. Original
Acknowledgments We are grateful to the in Doklady Akademii Nauk SSSR 163(4): 845–848
anonymous reviewers for suggesting useful addi- (1965).
tions. This research was supported by NSF grant Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a.
number 1449828. Exploiting similarities among languages for ma-
chine translation. arXiv preprint arXiv:1309.4168.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
References rado, and Jeff Dean. 2013b. Distributed representa-
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. tions of words and phrases and their compositional-
Learning principled bilingual mappings of word em- ity. In Advances in Neural Information Processing
beddings while preserving monolingual invariance. Systems, pages 3111–3119.
In Proceedings of the 2016 Conference on Empirical
Ivan Vulic and Anna Korhonen. 2016. On the role
Methods in Natural Language Processing (EMNLP-
of seed lexicons in learning bilingual word embed-
16), pages 2289–2294, Austin, Texas.
dings. In Proceedings of the 54th Annual Meet-
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. ing of the Association for Computational Linguistics
Learning bilingual word embeddings with (almost) (ACL-16), pages 247–257, Berlin, Germany.
no bilingual data. In Proceedings of the 55th An- Ivan Vulić and Marie-Francine Moens. 2013. A study
nual Meeting of the Association for Computational on bootstrapping bilingual vector spaces from non-
Linguistics (ACL-17), pages 451–462, Vancouver, parallel data (and nothing else). In Proceedings of
Canada. the 2013 Conference on Empirical Methods in Natu-
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, ral Language Processing (EMNLP-13), pages 1613–
John DeNero, and Dan Klein. 2010. Painless un- 1624, Seattle, Washington, USA.
supervised learning with features. In Proceedings Ivan Vulić and Marie-Francine Moens. 2016. Bilingual
of the 2010 Meeting of the North American chap- distributed word representations from document-
ter of the Association for Computational Linguistics aligned comparable data. J. Artif. Int. Res.,
(NAACL-10), pages 582–590. 55(1):953–994.
Georgiana Dinu and Marco Baroni. 2014. Improving Meng Zhang, Yang Liu, Huanbo Luan, and Maosong
zero-shot learning by mitigating the hubness prob- Sun. 2017. Adversarial training for unsupervised
lem. CoRR, abs/1412.6568. bilingual lexicon induction. In Proceedings of the
55th Annual Meeting of the Association for Com-
Chris Dyer, Jonathan Clark, Alon Lavie, and Noah A putational Linguistics (ACL-17), pages 1959–1970,
Smith. 2011. Unsupervised word alignment with ar- Vancouver, Canada.
bitrary features. In Proceedings of the 49th Annual
394