A 500 Million Word POS-Tagged Icelandic Corpus

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A 500 Million Word POS-Tagged Icelandic Corpus

Thomas Eckart

! Erla "allsteinsd#ttir$! Sigr%n "elgad#ttir&! '(e )uastho**

! +irk ,oldhahn

1
Natural Language Processing Group, University of Leipzig, Germany
Department of Language and Communication, University of Soutern Denmar!, "dense, Denmar!
# $e %rni &agn'sson (nstitute for (celandic Studies, )ey!*av+!, (celand
,mail- .tec!art, /uastoff, dgoldan01informati!2uni3leipzig2de, erla1sdu2d!, sigruel1i2is
A-stract
$e ne4 P"S3tagged (celandic corpus of te Leipzig Corpora Collection is an e5tensive resource for te analysis of te (celandic
language2 6s it contains a large sare of all 7e8 documents osted under te .is top3level domain, it is especially valua8le for
investigations on modern (celandic and non3standard language varieties2 $e corpus is accessi8le via a dedicated 4e8 portal and large
sares are availa8le for do4nload2 9ocus of tis paper 4ill 8e te description of te tagging process and evaluation of statistical
properties li!e 4ord form fre/uencies and part of speec tag distri8utions2 $e latter 4ill 8e in particular compared 4it values from
te (celandic 9re/uency Dictionary :(9D; Corpus2
.e/(ords- Corpus Creation, Part3of3Speec $agging, Grammar and Synta5
0 "istor/ o* the Icelandic Corpus
Larger (celandic corpora ave 8een part of te Leipzig
Corpora Collection :LCC; since <==>2 $e aim of te
pro*ect is to generate large monolingual corpora 8ased on
various material of different genre, 4ere te 8iggest
resources are 7e8 te5ts provided 8y te National and
University Li8rary of (celand from autumn <==> and
autumn <=1= :appro52 ?? million sentences;2 &oreover,
additional ne4spaper te5ts :< million sentences; and te
complete (celandic 7i!ipedia is included2 9or a very large
mi5ed genre corpus, all tese resources 4ere com8ined
yielding a corpus of more tan >>= million running 4ords2
9or details of te processing, see :Goldan et al2, <=1<;2
(n <=1<, tese sentences 4ere P"S3tagged as descri8ed
8elo42
$e generated data can 8e 8ro4sed at a dedicated 4e8
portal
1
tat provides a 7e8 interface focusing on 4ord
form 8ased statistical information2 6s an e5ample 9ig2 1
and < so4 4ord co3occurrences graps for te 4ord
skipti. $is 4ord is am8iguous, as t4o meanings and
appears 4it t4o possi8le 4ord classes- as a noun
@time/occasion/separation/change/exchangeA and as a
ver8 @separate/change/exchangeA2 9igure 1 and < so4
te different 4ord co3occurences 8ased on sentences2
Note, tat te differing co3occurrences illustrate te t4o
different meanings and conte5ts2
1 ttp-BB4ortscatz2uni3leipzig2deB4sCislB
Figure 2: Sentence co-occurrences for the verb
skipti
Figure 1: Sentence co-occurrences for the noun
skipti
2398
10 Icelandic Part-o*-Speech Tagging
100 Com-ined Tagging
$e (celandic corpus of te Leipzig Corpora Collection
4as annotated using te same procedures and soft4are as
te $agged (celandic Corpus, MM, :DelgadEttir et al2,
<=1<;2 $e annotation consists of sentence segmentation,
to!enisation and morposyntactic tagging2 $e LCC
(celandic corpus 4as not lemmatized2 6 special program,
Corpusagger, 4as developed for tese tas!s for te
development of M!M-"#$%, a ne4 gold standard for
tagging (celandic :Loftsson et al2, <=1=;2 $e program
uses (ceNLP :Loftsson and )Fgnvaldsson, <==G; for
to!enisation and sentence segmentation2 $e te5t 4as ten
tagged 4it four different taggers, after 4ic
Combiagger :Denric et al2, <==H; 4as applied to select
a single tag 8y using simple ma*ority voting2 (n te
original 4or! :Loftsson et al2, <=1=;, te te5t 4as tagged
4it five different taggers :listed in descending order of
accuracy 4en tagging (celandic te5t;- !ceagger
:Loftsson, <==I;, &i'ir :Dredze and 7allen8erg, <==I;,
n :Jrants, <===;, fn&$ :Ngai and 9lorian, <==1;, and
M()#S :)atnapar!i, 1HHK;2 $e &i'ir tagger ad to 8e
dropped from te procedure since it did not seem to 8e
a8le to andle large /uantities of te5t2 $e n tagger 4as
replaced 4it riagger* 4ic is Drafn LoftsonLs
re3implementation of n :Loftsson et al2, <=11;2
$e tagset used for te corpus 4as developed for te
ma!ing of te !F% corpus :Pind et al2, 1HH1;2 $e tags are
caracter strings 4ere eac caracter in te tag as a
particular function2 $e first caracter denotes te 4ord
class and te remaining caracters :up to >; denote
various morpological features, suc as gender, num8er
and case2 $e !F% tagset as a8out G== tags2 $e !F%
corpus 4as tagged 4it a program tat used a
com8ination of grammatical rules and fre/uency
information and ten all tags 4ere corrected manually2
$e !F% corpus as 8een used for training te data3driven
taggers :riagger, M()#S and fn&$; as 4ell as
developing te rule38ased tagger !ceagger2
1010 Tag 2re3uencies
$e tagged corpus contains around ?2H million tagged
types 4it around >2K million different type3tag
com8inations2 Dence, on average every type 4as tagged
4it 12M different tags2 >G> different tags 4ere actually
found in te corpus2 $a8le M so4s te most fre/uent tags
in te corpus 4it teir a8solute and relative fre/uency2
6ll values are 8ased on tagged types, to!en fre/uencies
are not ta!en into account2
$e most fre/uent P"S tags tat do not descri8e nouns or
numerals are NeO :1Kt most fre/uent tag, denoting a
foreign 4ord; and Nl!ensfO :?Mt, ad*ective :&asc2, Sg2,
Nom2, Strong declension, positive;;2 $e most fre/uent
P"S tag denoting a ver8 :NsngO; occurs on M=t place
4it an a8solute fre/uency of M>,GMM 2
9or several reasons, a 4ord migt 8e :correctly or not;
tagged 4it different tags in different sentences2 $a8le >
so4s te num8er of types 8eing assigned 4it different
num8ers of tags2 (f all com8inations of 4ord and type tag
tat occur less fre/uently tan a certain minimum are
removed, a iger percentage of te types are assigned
multiple tags2
(n te complete corpus <M,G>I different com8inations of
t4o P"S tags 4ere seen tat 4ere assigned to te same
type2 "nly H,K?G of tese com8inations occur more tan
1= times :i2e2 more tan 1= types 4ere tagged 4it 8ot
tags;2 $e follo4ing ta8le so4s te most fre/uent
assignment of t4o P"S tags to te same type2
ag1 ag2 Fre+uenc,
n!en3s n!eo3s K?,=H=
n!en3s n!eP3s >H,GI>
nveo nveP >>,MH>
nen neo >1,GGM
n!eo3s n!eP3s M=,>KM
ab-e 1: ,pica- combinations of )#S tags for .or's
having mu-tip-e tags
6s e5pected, te ig morpological variety in (celandic
leads to multiple assignments2 $e reason for tese
assignments are identical 4ord forms for different
grammatical categories of te same 4ord :e2g2 .oman-
!onaQnven R !onuQneo R !onuQnveP R !onuQnvee;2
9urtermore, tere are multiple assignments due to
identical forms of 4ords of different 4ord classes, tat
differ in meaning and syntactical caracteristics2
(n addition, results 8ased on 4ord classes 4ere generated
:similar to PetrovSs universal tagset :Petrov, <=11;
containing only 1= 4ord classes;2 $ey can 8e used to
simplify results of te tagging process or to compare
results of different taggers using different sets of P"S
tags2 $e follo4ing ta8les so4 te distri8ution of tese
4ord classes in te corpus2 Jecause of te diverse input
material used, te tagged sentences contained
non3standard (celandic sentences :as often found in
message 8oard entries; and in some rare cases also non
(celandic material 4ic 4as included due to errors in te
preprocessing2 $o reduce te impact of tese pro8lematic
parts, te same statistics 4ere generated 4ere only
type34ordclass com8inations 4ere included tat occurred
at least ?= times2 6pparently, te distri8ution of 4ord
classes in te LCC corpus 8ased on 4ord types as a
strong 8ias to4ards nouns 4en compared 4it values
8ased on te (celandic 9re/uency Dictionary :Pind et al2,
1HH1;
<
2 $ese differences almost disappear if te
distri8utions for to!ens are compared :cf2 $a8le K;2
< Diger num8ers tan te overall num8er of types are
due to am8iguous 4ords2
2399
6s 8efore te fre/uency of types aving multiple 4ord
class tags is depicted2 $a8le < so4s te results2
/umber of assigne'
.or' c-ass tags
/umber of t,pes
1 ?,G=G,KKK
< 1><,HMG
? <>,??H
M M,<<H
> K>M
K 1<>
G 1H
I ?
ab-e 2: /umber of t,pes having mu-tip-e .or' c-ass
tags
6ccordingly te follo4ing ta8le ta!es a closer loo! at
typical com8inations of tags for 4ords aving multiple
4ord classes2
0or' c-ass 1 0or' c-ass 2 1bso-ute
Fre+uenc,
6DT N"UN ?IG,>M<
N"UN U,)J <1I,?>M
6DT U,)J 11M,?II
9"),(GN N"UN 11=,<HI
6DU N"UN ?1,IG1
6DU 6DT <=,>?M
9"),(GN 6DT 1?,=?M
9"),(GN U,)J 1=,?=K
N"UN NU& H,KI?
6DU U,)J K,K>=
ab-e 2: ,pica- combinations of )#S tags for .or's
having mu-tip-e .or' c-ass tags
40 Applications
400 2re3uencies *or Word 2orms and 5emmas
9or te fre/uencies of te corresponding lemmas, te
fre/uencies of all of its inflected forms ave to 8e
summed up2 9or tis application, some P"S3taggers
provide te lemma for eac 4ord form2 Unfortunately
tere are t4o sources for counting errors- sometimes te
lemma form provided is 4rong2 $is can 8e te case for
am8iguous 4ord forms 8elonging to multiple lemmas or
errors in lemmatisation2 (n addition some P"S3taggers use
te 4ord form itself as lemma or give no lemma at all if
te lemmatisation fails2 Dence, te fre/uencies for
lemmas generated 8y P"S3taggers sould 8e used 4it
care2
$is is te main reason 4y fre/uencies for 4ord forms
are used in (celandic 9re/uency Dictionary :Vuastoff et
al2 <=1<; 4ic 4as created using te (celandic corpus
descri8ed ere2
4010 Te6t-To-Speech
(n autumn <=1< te (celandic organization of te visually
impaired :&-in'raf3-agi4
2
; introduced ne4 (celandic
te5t3to3speec soft4are2 $e voice soft4are R te male
voice Warl and te female voice DEra R 4as developed 8y
te Polis company (vona2 $e linguistic material in te
te5t3to3speec soft4are 4as provided 8y tis (celandic
corpus2 Dence, te recording corpus compiled 8y (vona
for 8uilding te (celandic voices 4as created 8y using
sentences from te corpus2 9urtermore also te (celandic
language model used in te te5t3to3speec soft4are 4as
8ased on systematic analyses of structures :intonation
structures, grammar, synta5 and oter te5t structures; in
te (celandic corpus2
4040 Spellchecking
S!ram8i, an (celandic spellcec!er 4ic is currently
under development, uses a language model derived, in
part, from 4ord fre/uencies from te corpus2 $e
spellcec!er, 4ic is 8ased on te noisy cannel model
approac to spelling correction :Jrill and &oore, <===;,
uses a language model as 4ell as an error model in order
to estimate te pro8a8ility tat a given suggestion is
correct2 $e error model is trained on >2=== of te most
common non4ord errors found in te corpus2
70 8e*erences
Jrants, $2 :<===;2 $n$- a statistical part3of3speec tagger2
(n )rocee'ings of the sixth conference on 1pp-ie' natu-
ra- -anguage processing :6NLC S==;2 6ssociation for
Computational Linguistics, Strouds8urg, P6, US6,
<<M3<?12 D"(X1=2?11>BHGM1MG2HGM1GI ttp-BBd523
doi2orgB1=2?11>BHGM1MG2HGM1GI
Jrill, ,2, &oore, )2 C2 :<===;2 6n improved error model
for noisy cannel spelling correction2 (n )rocee'ings of
the 25th 1nnua- Meeting on 1ssociation for Computa-
tiona- $inguistics, Dong Wong2
Dredze, &2, 7allen8erg, T2 :<==I;2 (celandic Data Driven
Part of Speec $agging2 (n )rocee'ings of the 67 th 1n-
nua- Meeting of the 1ssociation for Computationa- $in-
guistics: 8uman $anguage echno-ogies, 6CL3DL$,
Colum8us, "D, US62
Goldan, D2, ,c!art, $2, Vuastoff, U2 :<=1<;2 Juilding
Large &onolingual Dictionaries at te Leipzig Corpora
Collection- 9rom 1== to <== Languages2 (n- )rocee'-
ings of the 9ighth !nternationa- Conference on $an-
guage :esources an' 9va-uation ;$:9C<12=, <=1<2
? 7e8 page- ttp-BB44428lind2isB
2400
DallsteinsdEttir, ,2, ,c!art, $2, Jiemann, C2, Vuastoff,
U2, )icter, &2 :<==G;2 Yslens!ur orZas*EZur 3 Juilding
a Large (celandic Corpus2 (n- )rocee'ings of /#%1-
$!%1->?, $artu, ,stonia, <==G2
DelgadEttir, S2[ SvavarsdEttir, %2, )Fgnvaldsson, ,2, J*ar3
nadEttir, W2, Loftsson, D2 :<=1<;2 $e $agged (celandic
Corpus :&Y&;2 )rocee'ings of the 0orkshop on $an-
guage echno-og, for /orma-isation of $ess-:esource'
$anguages - Sa$Mi$ 5 - 1f$a2>12, s2 KG3G<2 (stan3
8'l, $yr!landi2
Denric, U2, )euter, $2, Loftsson, D2 :<==H;2 Com8i$ag3
ger- 6 System for DevelopingCom8ined $aggers2 (n
)rocee'ings of the 22n' !nternationa- F$1!:S Confe-
rence, Special $rac!- 6pplied Natural Language Pro3
cessing, Sani8el (sland, 9lorida, US62
Loftsson, D2 :<==I;2 $agging (celandic te5t- 6 linguistic
rule38ased approac2 (n /or'ic @ourna- of $inguistics,
?1:1;, MG3G<2 \ <==I Cam8ridge University Press2
Loftsson, D2, )Fgnvaldsson, ,2 :<==G;2 (ceNLP- 6 Natu3
ral Language Processing $ool!it for (celandic2 (n )ro-
cee'ings of !nterSpeech 2>>?* Specia- session: ASpeech
an' -anguage techno-og, for -ess-resource' -angua-
gesA2 6nt4erp, Jelgium2
Loftsson, D2, ]ngvason, T2 D2, DelgadEttir, S2, )Fgnvalds3
son, ,2 :<=1=;2 Developing a PoS3tagged corpus using
e5isting tools2 (n )rocee'ings of Creation an' use of
basic -exica- resources for -ess-resource' -anguages*
.orkshop at te Gt (nternational Conference on Lan3
guage )esources and ,valuation, L),C <=1=2 Ualetta,
&alta2
Loftsson, D2[ DelgadEttir, S2[)Fgnvaldsson, ,2 :<=11;2
Using a morpological data8ase to increase te accura3
cy in PoS tagging2 (n )rocee'ings of :ecent 1'vances
in /atura- $anguage )rocessing ;:1/$) 2>11=2 Dissar,
Julgaria2
Ngai, G2[ 9lorian, )2 :<==1;2 $ransformation38ased lear3
ning in te fast lane2 (n )rocee'ings of /orth 1merica-
1C$ 2>>1, pages M=3MG, Tune <==12
Petrov, S2, Das, D2, ^ &cDonald, )2 :<=11;2 6 universal
part3of3speec tagset2 ar_iv preprint ar_iv-11=M2<=IK2
Pind, T2, :ed2;[ &agn'sson, 92, Jriem, S2 :1HH1;2 s-ensk
or4tB4nibCk2 "rZa8E! D`s!Elans, )ey!*av+!2
Vuastoff, U2, 9iedler, S2, DallsteinsdEttir, ,2 :eds2;
:<=1<;2 9re/uency Dictionary (celandic2 Leipziger Uni3
versitatsverlag2
)atnapar!i, 62 :1HHK;2 6 ma5imum entropy
part3of3speec tagger2 (n )rocee'ings of the 9mpirica-
Metho's in /atura- $anguage )rocessing Conference,
University of Pennsylvania, Piladelpia, US62
)#S tag %escription 1bso-ute fre+uenc, of
t,pes
:e-ative fre+uenc,
of t,pes in D
n!en3s &asculine proper name :Nom2, Sg2; ?>?,M?H K2MGIM
ta Numeral <G?,=GM >2==>M
n!en &asculine noun :Nom2, Sg2; 1G1,=1K ?21?MG
nven 9eminine noun :Nom2, Sg2; 1><,H?K <2I=??
nen Neuter noun :Nom2, Sg2; 1MI,1MH <2G1>>
n!eP3s &asculine proper name :Dat2, Sg2; 1M=,G?K <2>GHK
nveo 9eminine noun :6cc2, Sg2; 1?<,?G= <2M<K?
nveP 9eminine noun :Dat2, Sg2; 1?1,<>H <2M=>H
n!eo &asculine noun :6cc2, Sg2; 1?=,IMK <2?HIM
neo Neuter noun :6cc2, Sg2; 1?=,<HM <2?II<
$a8le M- &ost fre/uent P"S tags
2401
/umber of
assigne' )#S
tags
/umber of t,pes ;1bso-ute
number E )ercentage=
1fter remova- of a-- combinations
that occurre' -ess than 2 times
1fter remova- of a--
combinations that occurre' -ess
than 1> times
1 ?,11M,KH? :I=2=>b; G1<,>?K :G<2G=b; <GG,11> :G12H1b;
< MMK,>11 :112MIb; 1>1,1K1 :1>2M<b; K?,HMI :1K2>Hb;
? 1K>,?II :M2<>b; K1,>>H :K2<Ib; <>,KH? :K2KGb;
M G1,=GM :12I?b; <M,H?G :<2>Mb; H,1MH :<2?Gb;
> ?K,==H :=2H?b; 1<,1G? :12<Mb; M,1=? :12=Kb;
K 1H,II= :=2>1b; K,?=K :=2KMb; <,=>= :=2>?b;
G 11,K?> :=2?=b; ?,K=< :=2?Gb; 1,=>> :=2<Gb;
I G,?MH :=2<=b; <,<K> :=2<?b; KH1 :=21Ib;
H M,HIK :=21?b; 1,>?M :=21Kb; M== :=21=b;
Sum ?,IH1,=<> :1==b; HI=,=HI :1==b; ?I>,?K< :1==b;
$a8le >- Num8er of types aving multiple P"S tags
0or' c-ass
tag
For t,pes For token For t,pes
;fre+FG2>=
For token
;fre+FG2>=
Fre+uenc,
%ictionar,
,pes
Fre+uenc,
%ictionar, oken
Noun M,<=I,?I>
:9907:;
I<,<1=,MG?
:1;044:;
1I>,>KG
:90017:;
G1,1?H,GHI
:1<097:;
=909: 140=:
6d*ective >HM,KH?
:00;0:;
1G,I1=,>><
:=045:;
?I,MGI
:705=:;
K=,>1K,KMH
:=05:;
50;: =0;:
Numeral <G?,HGK
:5001:;
I,?I>,=H>
:10;;:;
H,HK=
:4099:;
G,GK?,=MM
:407:;
407: 0:
Uer8 <>?,?<>
:70=7:;
><,>G<,?=>
:<095:;
<?,M<G
:<0<9:;
>1,IM>,<KG
:100;5:;
907: ;0;:
"ter :foreign
4ords etc2;
H1,>H<
:0=<:;
H?I,GM>
:0044:;
<,G>=
:007:;
G1M,K=M
:001;:;
00;: 00:
6dver8 and
preposition
<I,G11
:0054:;
K=,K=1,1M<
:10=1:;
<,G=G
:004:;
K=,>1K,KMH
:17075:;
701: 1107:
Con*unction <,1M>
:0004;:;
?<,=MH,GGK
:074:;
IH
:0004:;
?1,HIK,HMI
:10;1:;
00: 0=:
Pronoun <,=GI
:0004<:;
<>,M=M,=<1
:;00=:;
HMH
:004=:;
<>,?HI,<1<
:001=:;
00: 704:
Unanalyzed
4ord
K<I
:0001:;
1?M,I>=
:0005:;
KI
:0004:;
1<1,IG<
:000;:;
001: 000:
Determiners >H
:0000:;
<?M,GK=
:000<:;
?M
:000:;
<?M,K1G
:000;:;
000: 00:
$a8le K- Distri8ution of 4ord classes in te LCC (celandic corpus compared 4it te (9D corpus
2402

You might also like