Professional Documents
Culture Documents
C94-1033 - Morphological Analysis
C94-1033 - Morphological Analysis
208
mar is shown in Figure 2. Currently our gram- with "~{'.1." in the dictionary. The start, lug posi-
mar has about 4,300 rules and 400 nonterminal tion is l, hen set I,o 3 and t.rled again, and this
symbols. ,.i,,~ th,. words "al~:~ (,:,lnlp,,Ce,.y and "i}t~';
k)~;{'~{i[i(comput.ing facilit,y)" are obtained. 'e
2.2 Dictionary Lookup The problem here ix (,hal;, even though we
know that, 1,here is "TQ){l!}i[.~])~ (ma.in[rarne)"
While the flmction words (particles, auxiliary
al, posit;ion I, we look up "}}[~{:t~ (computer)"
verbs, and so on, totaling several hundred) are
again. Since "iil~:~.~: (computer)" is a snhstring
encoded in the g r a m m a r rules, the content words
of "9'4~{~iI'~;)1~ (n-lainframe)," we know that, t;he
(nouns, verbs, and so on) are stored in st sepa-
word "~,i]~,i~ (compul:er)" exists at, posit;ion 3 as
rate dictionary. Since content words may appear
soon as we lind "X~{~}~[~3,,i~(lnainframe)" at i)o-.
at any position in tile input sentence, dictionary
sit;ion I. Therefore, going back l;o 1;he root node
access is tried from all the positions n.
at position 3 and trying mat;citing all over again
For example, in the sentence fragment: ill Fig-
means duplicatAng our efforts unnecessarily.
ure 3,
209
J[~i~/~ -> EAGYOU-SDAN [pos=l,kow=doushi] ~ f 5 f~;
YJ~ 5 [~ -> °'~'" [pos=26,kow=v_infl] ~,[iJ5 ~ k , ~ 4 ~x,~ cost=300;
~ ] 5 ~-~,~)- 4~(ff6 -> "J'" [pos=64,kow=jodoushi,fe={negative}]
"~" [pos=78, kow=set suzoku_j oshi]
~ I t J j ~ ~'f cost=500;
~l~J~J~,~Y/ -> J[~)~ cost=999;
~ 1 -> "" [pos=48,kow=jodoushi] J~J~-~Y~,#~ cost=S00;
.... ~ ~)~J~ cost=300;
~)j~y~t~ _> "fZ o " [pos=45,kow=aux_infl] '"~J[J~ cost=a00;
t ~ 3 fl" g- 6 '?
4large -~ = c o m p u t e r - "
facility
4 mainframe
* computing laiSility 4~
Fig. 4: T R I E index
210
~¢" G% t.-:
-I I
node accesses dominates the performance of the 3 Constructing TRIE with fail
overall system.
pointers
A TI{IF, index with fail pointers is created in
the following two steps:
211
If tile node corresponds to the end of some traditional T R I E and was 27% faster in CPU
word, we record the length l of the word in the time. The CPU time was measured with all the
node. For example, at the node that corresponds nodes in the main memory.
to the end of the word " ~ : ~ t ' ~ : ~ (mainframe)", For the computer manuals, the reduction rate
I = 5 and l = 3 are recorded because it is the end was a little larger. This is attributable to the fact
of both of the words " ~ ] . ~ : ~ (mainframe, that computer manuals tend to contain longer,
l = 5)" and "~l'-~-~ (computer, l = 3)." 3 more technical terms than newspaper artMes.
Figure 6 shows the complete T R I E with tile Our method is more effective if there are a large
fail pointers. number of long words in a text.
272
l,'ig. 6: TI{II,~ index with fail pointers
213