Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Sub-Character Architecture for Korean Language Processing

Karl Stratos
Toyota Technological Institute at Chicago
stratos@ttic.edu

Abstract ᅡ
ᆫᄋ
ᄉᆯᄀ
ᅳ ᆻᄃ
ᅡᅡ

We introduce a novel sub-character ar- ᅡ


ᆫᄋ
ᄉᆯ
ᅳ ᆻᄃ

가
chitecture that exploits a unique com-
positional structure of the Korean lan- ᅡ

ᄉ ᆯ

ᅳ ᆻ

ᄀ ᅡ

guage. Our method decomposes each
character into a small set of primitive
ㅅ ㅏ ㄴ ㅇ ㅡ ㄹ ㄱ ㅏ ㅆ ㄷ ㅏ ∅
phonetic units called jamo letters from
which character- and word-level represen-
Figure 1: Korean sentence “ᄉ

ᆫᄋ
ᆯᄀᆻᄃ
ᅡ ᅡ” (I went to
tations are induced. The jamo letters di-

the mountain) decomposed to words, characters,
vulge syntactic and semantic information
and jamos.
that is difficult to access with conventional
character-level units. They greatly alle- Figure 1 for an illustration of the decomposi-
viate the data sparsity problem, reducing tion. The decomposition is deterministic; this is
the observation space to 1.6% of the orig- a crucial departure from previous work that uses
inal while increasing accuracy in our ex- language-specific sub-character information such
periments. We apply our architecture to as radical (a graphical component of a Chinese
dependency parsing and achieve dramatic character). The radical structure of a Chinese
improvement over strong lexical baselines. character does not follow any systematic process,
requiring an incomplete dictionary mapping be-
1 Introduction
tween characters and radicals to take advantage of
Korean is generally recognized as a language iso- this information (Sun et al., 2014; Yin et al., 2016).
late: that is, it has no apparent genealogical rela- In contrast, our Unicode decomposition does not
tionship with other languages (Song, 2006; Camp- need any supervision and can extract correct jamo
bell and Mixco, 2007). A unique feature of the letters for all possible Korean characters.
language is that each character is composed of a Our jamo architecture is fully general and can
small, fixed set of basic phonetic units called jamo be plugged in any Korean processing network. For
letters. Despite the important role jamo plays in a concrete demonstration of its utility, in this work
encoding syntactic and semantic information of we focus on dependency parsing. McDonald et al.
words, it has been neglected in existing modern (2013) note that “Korean emerges as a very clear
Korean processing algorithms. In this paper, we outlier” in their cross-lingual parsing experiments
bridge this gap by introducing a novel composi- on the universal treebank, implying a need to tai-
tional neural architecture that explicitly leverages lor a model for this language isolate. Because of
the sub-character information. the compositional morphology, Korean suffers ex-
Specifically, we perform Unicode decomposi- treme data sparsity at the word level: 2,703 out of
tion on each Korean character to recover its un- 4,698 word types (> 57%) in the held-out portion
derlying jamo letters and construct character- and of our treebank are OOV. This makes the language
word-level representations from these letters. See challenging for simple lexical parsers even when

721
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 721–726
Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics
augmented with a large set of pre-trained word Gillick et al. (2016) who process text as a sequence
representations. of bytes. We believe that such byte-level models
While such data sparsity can also be alleviated are too general and that there are opportunities to
by incorporating more conventional character- exploit natural sub-character structure for certain
level information, we show that incorporating languages such as Korean and Chinese.
jamo is an effective and economical new approach There exists a line of work on exploiting graph-
to combating the sparsity problem for Korean. In ical components of Chinese characters called rad-
experiments, we decisively improve the LAS of icals (Sun et al., 2014; Yin et al., 2016). For in-
the lexical BiLSTM parser of Kiperwasser and stance, 足 (foot) is the radical of 跑 (run). While
Goldberg (2016) from 82.77 to 91.46 while reduc- related, our work on Korean is distinguished in
ing the size of input space by 98.4% when we re- critical ways and should not be thought of as
place words with jamos. As a point of reference, just an extension to another language. First, as
a strong feature-rich parser using gold POS tags mentioned earlier, the compositional structure is
obtains 88.61. fundamentally different between Chinese and Ko-
To summarize, we make the following contribu- rean. The mapping between radicals and charac-
tions. ters in Chinese is nondeterministic and can only be
loosely approximated by an incomplete dictionary.
• To our knowledge, this is the first work
In contrast, the mapping between jamos and Ko-
that leverages jamo in end-to-end neural Ko-
rean characters is deterministic (Section 3.1), al-
rean processing. To this end, we develop a
lowing for systematic decomposition of all possi-
novel sub-character architecture based on de-
ble Korean characters. Second, the previous work
terministic Unicode decomposition.
on Chinese radicals was concerned with learn-
• We perform extensive experiments on depen- ing word embeddings. We develop an end-to-end
dency parsing to verify the utility of the ap- compositional model for a downstream task: pars-
proach. We show clear performance boost ing.
with a drastically smaller set of parameters.
Our final model outperforms strong baselines 3 Method
by a large margin. 3.1 Jamo Structure of the Korean Language
• We release an implementation of our jamo ar- Let W denote the set of word types and C the set
chitecture which can be plugged in any Ko- of character types. In many languages, c ∈ C is
rean processing network.1 the most basic unit that is meaningful. In Korean,
each character is further composed of a small fixed
2 Related Work set of phonetic units called jamo letters J where
We make a few additional remarks on related |J | = 51. The jamo letters are categorized as head
work to better situate our work. Our work fol- consonants Jh , vowels Jv , or tail consonants Jt .
lows the successful line of work on incorporating The composition is completely systematic. Given
sub-lexical information to neural models. Vari- any character c ∈ C, there exist ch ∈ Jh , cv ∈ Jv ,
ous character-based architectures have been pro- and ct ∈ Jt such that their composition yields c.
posed. For instance, Ma and Hovy (2016) and Kim Conversely, any ch ∈ Jh , cv ∈ Jv , and ct ∈ Jt
et al. (2016) use CNNs over characters whereas can be composed to yield a valid character c ∈ C.
Lample et al. (2016) and Ballesteros et al. (2015) As an example, consider the word ᄀ
ᆻᄃ
ᅡ ᅡ (went).
use bidirectional LSTMs (BiLSTMs). Both ap- It is composed of two characters, ᄀ
ᆻ, ᄃ
ᅡ ᅡ ∈ C. Each
proaches have been shown to be profitable; we em- character is furthermore composed of three jamo
ploy a BiLSTM-based approach. letters as follows:
Many previous works have also considered • ᅡ
ᄀ ∈ C is composed of ㄱ ∈ Jh , ㅏ ∈ Jv ,

morphemes to augment lexical models (Luong and ㅆ ∈ Jt .
et al., 2013; Botha and Blunsom, 2014; Cotterell
et al., 2016). Sub-character models are substan- • ᅡ
ᄃ ∈ C is composed of ㄷ ∈ Jh , ㅏ ∈ Jv ,
tially rarer; an extreme case is considered by and an empty letter ∅ ∈ Jt .
1
https://github.com/karlstratos/ The tail consonant can be empty; we assume a
koreannet special symbol ∅ ∈ Jt to denote an empty letter.

722
Figure 1 illustrates the decomposition of a Korean and induce a representation of w as
sentence down to jamo letters.    
c
fm
Note that the number of possible characters w C C
h = tanh U +b
is combinatorial in the number of jamo letters, bc1
loosely upper bounded by 513 = 132, 651. This
upper bound is loose because certain combina- Lastly, this representation is concatenated with a
tions are invalid. For instance, ㅁ ∈ Jh ∩ Jt but word-level lookup embedding (which can be ini-
ㅁ 6∈ Jv whereas ㅏ ∈ Jv but ㅏ 6∈ Jh ∪ Jt . tialized with pre-trained word embeddings), and
The combinatorial nature of Korean characters the result is fed into a BiLSTM network. The pa-
motivates the compositional architecture below. rameters associated with this layer are
For completeness, we describe the entire forward
pass of the transition-based BiLSTM parser of • Embedding ew ∈ RdW for each w ∈ W
Kiperwasser and Goldberg (2016) that we use in
our experiments. • Two-layer BiLSTM Φ that maps h1 . . . hn ∈

Rd+dW to z1 . . . zn ∈ Rd
3.2 Jamo Architecture
• Feedforward for predicting transitions
The parameters associated with the jamo layer are
Given a sentence w1 . . . wn ∈ W, the final d∗ -
• Embedding el ∈ Rd for each letter l ∈ J
dimensional word representations are given by
• U J , V J , W J ∈ Rd×d and bJ ∈ Rd  w   w 
h 1 h n
(z1 . . . zn ) = Φ w . . . wn
Given a Korean character c ∈ C, we perform Uni- e 1 e
code decomposition (Section 3.3) to recover the
underlying jamo letters ch , cv , ct ∈ J . We com- The parser then uses the feedforward network to
pose the letters to induce a representation of c as greedily predict transitions based on words that are
active in the system. The model is trained end-to-

hc = tanh U J ech + V J ecv + W J ect + bJ end by optimizing a max-margin objective. Since
this part is not a contribution of this paper, we refer
This representation is then concatenated with a to Kiperwasser and Goldberg (2016) for details.
character-level lookup embedding, and the result By setting the embedding dimension of jamos
is fed into an LSTM to produce a word representa- d, characters d0 , or words dW to zero, we can con-
tion. We use an LSTM (Hochreiter and Schmidhu- figure the network to use any combination of these
ber, 1997) simply as a mapping φ : Rd1 × Rd2 → units. We report these experiments in Section 4.
Rd2 that takes an input vector x and a state vector
h to output a new state vector h0 = φ(x, h). The 3.3 Unicode Decomposition
parameters associated with this layer are Our architecture requires dynamically extracting
d0 jamo letters given any Korean character. This is
• Embedding ec ∈ R for each c ∈ C
achieved by simple Unicode manipulation. For
0
• Forward LSTM φf : Rd+d × Rd → Rd any Korean character c ∈ C with Unicode value
U (c), let U (c) = U (c) − 44032 and T (c) =
0
• Backward LSTM φb : Rd+d × Rd → Rd U (c) mod 28. Then the Unicode values U (ch ),
U (cv ), and U (ct ) corresponding to the head con-
• U C ∈ Rd×2d and bC ∈ Rd sonant, vowel, and tail consonant are obtained by

Given a word w ∈ W and its character sequence 


U (c)

U (ch ) = 1 + + 0x10ff
c1 . . . cm ∈ C, we compute 588
 c   (U (c) − T (c)) mod 588
 
hi U (cv ) = 1 + + 0x1160
28
fic=φ f c
, fi−1 ∀i = 1 . . . m
eci U (ct ) = 1 + T (c) + 0x11a7
 c  
c b hi c
bi = φ , bi+1 ∀i = m . . . 1
eci where ct is set to ∅ if T (ct ) = 0.

723
Training Development Test ㄱㄳㄲㄵㄴㄷㄶㄹㄸㄻㄺ ㄼㅁ
# projective trees 5,425 603 299 ㅀㅃㅂㅅㅄㅇㅆㅉㅈㅋㅊㅍㅌ
# non-projective trees 12 0 0
ㅏㅎㅑㅐㅓㅒㅕㅔㅗㅖㅙㅘㅛㅚ
# # Ko Examples ㅝㅜㅟㅞㅡㅠㅣㅢ
word 31,060 – ᄑᄅ
ᅳ ᅳᄅ
ᅩᄀ ᆷᄇ
ᅢ ᅡᄀ
ᅩᄃ ᆯᄇ
ᅡ ᅵ booz
char 1,772 1,315 ᅬᄀ
ᄎ ᆯᄒ
ᅲ ᆷᄂ
ᅳ ᆼᄉ
ᅣ ᆺᄏ
ᅧ ᅢᄍ ᆨ@正a
ᅩ none of which is OOV in the dev set.
jamo 500 48 ㄱㄳㄼㅏㅠㅢ@正a
Implementation and baselines We implement
Table 1: Treebank statistics. Upper: Number of
our jamo architecture using the DyNet library
trees in the split. Lower: Number of unit types
(Neubig et al., 2017) and plug it into the BiLSTM
in the training portion. For simplicity, we include
parser of Kiperwasser and Goldberg (2016).3 For
non-Korean symbols (e.g., @, 正, a) as charac-
Korean syllable manipulation, we use the freely
ters/jamos.
available toolkit by Joshua Dong.4 We train the
parser for 30 epochs and use the dev portion for
3.4 Why Use Jamo Letters? model selection. We compare our approach to the
The most obvious benefit of using jamo letters is following baselines:
alleviating data sparsity by flattening the combi-
• McDonald13: A cross-lingual parser origi-
natorial space of Korean characters. We discuss
nally reported in McDonald et al. (2013).
some additional explicit benefits. First, jamo let-
ters often indicate syntactic properties of words. • Yara: A beam-search transition-based parser
For example, a tail consonant ㅆ strongly implies of Rasooli and Tetreault (2015) based on the
that the word is a past tense verb as in ᄀ ᆻᄃ
ᅡ ᅡ rich non-local features in Zhang and Nivre
(went), ᄋ ᆻᄃ
ᅪ ᅡ (came), and ᄒ ᆻᄃ
ᅢ ᅡ (did). Thus a (2011). We use beam width 64. We use
jamo-level model can identify unseen verbs more 5-fold jackknifing on the training portion to
effectively than word- or character-level models. provide POS tag features. We also report on
Second, jamo letters dictate the sound of a char- using gold POS tags.
acter. For example, ᄀ ᆻ is pronounced as got be-

cause the head consonant ㄱ is associated with the • K&G16: The basic BiLSTM parser of Kiper-
sound g, the vowel ㅏ with o, and the tail conso- wasser and Goldberg (2016) without the sub-
nant ㅆ with t. This is clearly critical for speech lexical architecture introduced in this work.
recognition/synthesis and indeed has been investi-
gated in the speech community (Lee et al., 1994; • Stack LSTM: A greedy transition-based
Sakti et al., 2010). While speech processing is not parser based on stack LSTM representa-
our focus, the phonetic signals can capture useful tions. Dyer15 denotes the word-level vari-
lexical correlation (e.g., for onomatopoeic words). ant (Dyer et al., 2015). Ballesteros15 denotes
the character-level variant (Ballesteros et al.,
2015).
4 Experiments
For pre-trained word embeddings, we apply the
Data We use the publicly available Korean tree-
spectral algorithm of Stratos et al. (2015) on a
bank in the universal treebank version 2.0 (Mc-
2015 Korean Wikipedia dump to induce 285,933
Donald et al., 2013).2 The dataset comes with
embeddings of dimension 100.
a train/development/test split; data statistics are
shown in Table 1. Since the test portion is sig- Parsing accuracy Table 2 shows the main re-
nificantly smaller than the dev portion, we report sult. The baseline test LAS of the original cross-
performance on both. lingual parser of McDonald13 is 55.85. Yara
As expected, we observe severe data sparsity achieves 85.17 with predicted POS tags and 88.61
with words: 24,814 out of 31,060 elements in the with gold POS tags. The basic BiLSTM model
vocabulary appear only once in the training data. of K&G16 obtains 82.77 with pre-trained word
On the dev set, about 57% word types and 3% embeddings (78.95 without). The stack LSTM
character types are OOV. Upon Unicode decom- parser is comparable to K&G16 at the word level
position, we obtain the following 48 jamo types: 3
https://github.com/elikip/bist-parser
4
https://github.com/JDongian/
2
https://github.com/ryanmcd/uni-dep-tb python-jamo

724
System Features Feature Representation Emb POS Dev Test
UAS LAS UAS LAS
McDonald13 cross-lingual features large sparse matrix – PRED – – 71.22 55.85
Yara (beam 64) features in Z&N11 large sparse matrix – PRED 76.31 62.83 91.19 85.17
GOLD 79.08 68.85 92.93 88.61
K&G16 word 31060 × 100 matrix – – 68.87 48.25 88.61 78.95
298115 × 100 matrix YES 76.30 60.88 90.00 82.77
Dyer15 word, transition 31067 × 100 matrix – – 69.40 48.46 88.41 78.22
298122 × 100 matrix YES 75.99 59.38 90.73 83.89
Ballesteros15 char, transition 1779 × 100 matrix – – 84.22 76.41 91.27 86.25
KoreanNet char 1772 × 100 matrix – – 84.76 76.95 94.75 90.81
1772 × 200 matrix 84.83 77.29 94.55 91.04
jamo 500 × 100 matrix – – 84.27 76.07 94.59 90.77
500 × 200 matrix 84.68 77.27 94.86 91.46
char, jamo 2272 × 100 matrix – – 85.35 78.18 94.79 91.19
2272 × 200 matrix 85.74 78.76 94.55 91.31
word, char, jamo 302339 × 200 matrix YES – 86.39 79.68 95.17 92.31

Table 2: Main result. Upper: Accuracy with baseline models. Lower: Accuracy with different configu-
rations of our parser network (word-only is identical to K&G16).

(Dyer15), but it performs significantly better at the onomatopoeic expression of laughter) and ㄴㄴ
character level (Ballesteros15) reaching 86.25 test (shorthand for ᄂ
ᅩ노, a transcription of no no) are
LAS. omnipresent in social media. The jamo architec-
We observe decisive improvement when we in- ture can be useful in this scenario, for instance by
corporate sub-lexical information into the parser correlating ㅎㅎㅎ and ᄒ ᅡᄒ ᅡᄒ ᅡ which might oth-
of K&G16. In fact, a strictly sub-lexical parser us- erwise be treated as independent.
ing only jamos or characters clearly outperforms
its lexical counterpart despite the fact that the Acknowledgments
model is drastically smaller (e.g., 90.77 with 500×
100 jamo embeddings vs 82.77 with 298115×100 The author would like to thank Lingpeng Kong for
word embeddings). Notably, jamos alone achieve his help with using the DyNet library, Mohammad
91.46 which is not far behind the best result 92.31 Rasooli for his help with using the Yara parser, and
obtained by using word, character, and jamo units Karen Livescu for helpful comments.
in conjunction. This demonstrates that our compo-
sitional architecture learns to build effective rep-
resentations of Korean characters and words for References
parsing from a minuscule set of jamo letters. Miguel Ballesteros, Chris Dyer, and Noah A. Smith.
2015. Improved transition-based parsing by model-
5 Discussion of Future Work ing characters instead of words with lstms. In Proc.
EMNLP.
We have presented a natural sub-character archi-
tecture to model the unique compositional orthog- Jan A Botha and Phil Blunsom. 2014. Compositional
raphy of the Korean language. The architecture in- morphology for word representations and language
modelling. In ICML, pages 1899–1907.
duces word-/sentence-level representations from a
small set of phonetic units called jamo letters. This Lyle Campbell and Mauricio J Mixco. 2007. A glos-
is enabled by efficient and deterministic Unicode sary of historical linguistics. Edinburgh University
decomposition of characters. Press.
We have focused on dependency parsing to
Ryan Cotterell, Hinrich Schütze, and Jason Eisner.
demonstrate the utility of our approach as an eco-
2016. Morphological smoothing and extrapolation
nomical and effective way to combat data sparsity. of word embeddings. In Proceedings of the 54th An-
However, we believe that the true benefit of this ar- nual Meeting of the Association for Computational
chitecture will be more evident in speech process- Linguistics, volume 1, pages 1651–1660.
ing as jamo letters are definitions of sound in the
Chris Dyer, Miguel Ballesteros, Wang Ling, Austin
language. Another potentially interesting applica- Matthews, and Noah A. Smith. 2015. Transition-
tion is informal text on the internet. Ill-formed based dependency parsing with stack long short-
words such as ㅎㅎㅎ (shorthand for ᄒ ᅡᄒ ᅡᄒ ᅡ, an term memory. In Proc. ACL.

725
Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Sakriani Sakti, Andrew Finch, Ryosuke Isotani,
Subramanya. 2016. Multilingual language process- Hisashi Kawai, and Satoshi Nakamura. 2010. Ko-
ing from bytes. In Proceedings of NAACL. rean pronunciation variation modeling with proba-
bilistic bayesian networks. In Universal Communi-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. cation Symposium (IUCS), 2010 4th International,
Long short-term memory. Neural computation, pages 52–57. IEEE.
9(8):1735–1780.
Jae Jung Song. 2006. The Korean language: Structure,
Yoon Kim, Yacine Jernite, David Sontag, and Alexan- use and context. Routledge.
der M Rush. 2016. Character-aware neural language
Karl Stratos, Michael Collins, and Daniel Hsu. 2015.
models. In Thirtieth AAAI Conference on Artificial
Model-based word embeddings from decomposi-
Intelligence.
tions of count matrices. In Proceedings of the 53rd
Annual Meeting of the Association for Computa-
Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim- tional Linguistics and the 7th International Joint
ple and accurate dependency parsing using bidirec- Conference on Natural Language Processing (Vol-
tional lstm feature representations. Transactions ume 1: Long Papers), pages 1282–1291, Beijing,
of the Association for Computational Linguistics, China. Association for Computational Linguistics.
4:313–327.
Yaming Sun, Lei Lin, Nan Yang, Zhenzhou Ji, and
Guillaume Lample, Miguel Ballesteros, Sandeep Sub- Xiaolong Wang. 2014. Radical-enhanced chinese
ramanian, Kazuya Kawakami, and Chris Dyer. 2016. character embedding. In International Conference
Neural architectures for named entity recognition. on Neural Information Processing, pages 279–286.
In Proceedings of NAACL. Springer.

Geunbae Lee, Jong-Hyeok Lee, and Kyunghee Rongchao Yin, Quan Wang, Rui Li, Peng Li, and Bin
Kim. 1994. Phonemie-level, speech and natural, Wang. 2016. Multi-granularity chinese word em-
language integration for agglutinative languages. bedding. In Proceedings of the Empiricial Methods
GGGGGGGG 0. in Natural Language Processing.

Thang Luong, Richard Socher, and Christopher D Yue Zhang and Joakim Nivre. 2011. Transition-based
Manning. 2013. Better word representations with dependency parsing with rich non-local features. In
recursive neural networks for morphology. In Proceedings of the 49th Annual Meeting of the Asso-
CoNLL, pages 104–113. ciation for Computational Linguistics: Human Lan-
guage Technologies: short papers-Volume 2, pages
188–193. Association for Computational Linguis-
Xuezhe Ma and Eduard Hovy. 2016. End-to-end se-
tics.
quence labeling via bi-directional lstm-cnns-crf. In
Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 1064–1074, Berlin, Germany.
Association for Computational Linguistics.

Ryan T McDonald, Joakim Nivre, Yvonne Quirmbach-


Brundage, Yoav Goldberg, Dipanjan Das, Kuzman
Ganchev, Keith B Hall, Slav Petrov, Hao Zhang, Os-
car Täckström, et al. 2013. Universal dependency
annotation for multilingual parsing. In ACL (2),
pages 92–97.

Graham Neubig, Chris Dyer, Yoav Goldberg, Austin


Matthews, Waleed Ammar, Antonios Anastasopou-
los, Miguel Ballesteros, David Chiang, Daniel
Clothiaux, Trevor Cohn, Kevin Duh, Manaal
Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji,
Lingpeng Kong, Adhiguna Kuncoro, Gaurav Ku-
mar, Chaitanya Malaviya, Paul Michel, Yusuke
Oda, Matthew Richardson, Naomi Saphra, Swabha
Swayamdipta, and Pengcheng Yin. 2017. Dynet:
The dynamic neural network toolkit. arXiv preprint
arXiv:1701.03980.

Mohammad Sadegh Rasooli and Joel R. Tetreault.


2015. Yara parser: A fast and accurate dependency
parser. CoRR, abs/1503.06733.

726

You might also like