Professional Documents
Culture Documents
Description of The Tangora System
Description of The Tangora System
Description of The Tangora System
ACTES DE COLING-92, NANTI'S, 23 28 AoOr 1992 940 PRO[:. OF COLING-92, NArcr~s. AUG. 23-28, 1992
language model. Actually, the number of Appositions and interpolated clauses
homophones being quite high (2.6) In an Increase the distance between elemeuts
inflected language such as French, it Is which must agree:
clear that no acoustic model, as perfect as It Plusloun= PARTI5 d'oppo=lUo, de
may be, can produce a satisfactory gaucho, notammant Io paHl commu=
decoding without the support of a language nlate, PARTAGENT co point de rue.
model.
Predicting a word thanks to tim
preceding words does not allow the
Power and limitations of probabilistic system to appropriately control person
language models agreement when the subject follows the
verb. Example:
Probablllstlc language models are powerful Quo aont DEVENUS los prlnelpaux
enough to considerably reduce ambiguities PROTAGONtSTES de la vlctolre du
that the acoustic model alone cannot solve. onze novombre?
However, they suffer from punctual Imper-
fections that are bound to their formulation. Moreover, s o m e confusions due to
This Is clearly shown by testing a homophony induce changes of
probablllstlc model on the lattice formed by grammatical category, that require a
the set of the homophones of the words of complete Interpretation of the sentence
every sentence. The decoding obtained by to be properly diagnosed, as in "et"/'est ~
searching for the maxlreum likelihood path (conjunction/verb) or "&"l"a ~
(Cerf, 91) gives an error rate close to 3%, (preposition/verb).
thus showing some of the Inadequacies of
the probablllstlc language models.
Besides, and agatn for reliability reasons, Coupling the ADS with the
statistics need to be gathered from large
learning corpora (tens or even hundreds of
g r a m m a r checker
millions words). In spite of all the
preliminary cleaning that may be done To bring a solution to the problems
(automatic correction of typos, tripled described above, we propose to perform a
consonants for Instance), such a huge grammatical analysis after the decoding
corpus contains a certain number of operation. The grammatical analysis applies
grammatical errors, that Introduce noise In to the best of the hypotheses selected by
the model. the ADS. It serves as a basis to diagnose
grammatical errors and te suggest correc-
Probablllstlc estlmatlons are produced by tions 2
.
e A similar approachwas tested in English, but only to detect grammatically incorre~ct~nionceB (Bellegarda92)
AcrEs DE COLING-92, NANTES,23-28 AO~r 1992 94 1 PROC. o=: (;OI.ING-92, NANTES, AUG. 23-28, 1992
compute8 a syntactic sketch, which (Jansen, 83) and multiple parses are ranked
represents the likeliest syntactic surface thanks to specific procedures (Heldorn, 76).
structure of the sentence; at this stage, such This last point allows the system to
phenomena as coordinations, ellipses, automatically select the strongest
interpolated clauses, If not totally resolved, hypothesis, according to the linguistic
do not block the parsing. The analysis Is features (Including the grammar errors) of
based on the so-called relaxed approach, the syntactic trees.
which consists in rejecting linguistic
constraints which, as pertinent as they may
be In descriptive linguistics, are rarely Adaptation o f the parser to the ADS
satisfied strlcto sansu In the surface struc- As mentioned above, many grammatical
tures of free texts. This strategy proves to errors In written French are actually caused
broaden the coverage of the grammar as by homophones (gender, number
well as it allows the parser to deal with agreement, confusion between Infinitive and
erroneous texts. past participle, "chantez/chanter', %t/esf",
etc.). The parser, Initially built for written
Architecture of the parser:. French, Is thus well prepared to detect
errors produced by an ADS.
The system is written in PLNLP
(Programming Language for Natural It can however be adapted to the specific
Language Processing, G. Heldorn, 72). It needs of the ADS, by adding specific
Includes: procedures (detection of ill-recognized
frozen phrases, etc.), and by filtering out
• A morphologic dictionary (50,000 non-homophonic corrections, or corrections
lemmas plus their Inflection tables),= which do not belong to the list of candidates
* A morpho-syntactlo dictionary, which initially proposed by the ADS.
describes the sub-categorizations
attached to each temma, Indeed, post-processing procedures are
• A set of more than 300 PLNLP produc- largely used to diagnose errors after the
tion rules, which produce the syntactic syntactic tree has been computed. This
sketches, offers the Immense advantage of making the
• A set of procedures built to re-interpret system evolutionary: It can be easily
the syntactic sketches and to diagnose modified, In order to Improve the scope of
errors, the detections. This made the adaptation of
• A form generator, which provides the grammar checker to the ADS quite
corrected forms. straightforward.
corpl AFP dispatches (1000 words) Ce document est a falro sign6 recto et
corp2 AFP dispatches (3221 words) verso par le propdGtalro st par le gesUon-
nalro.
corp3 e-mail notes (1909 words)
corp4 grammar books (1337 words) After parsing, the suggested correction Is:
Ce document est & falro SIGNER recto et
verso par le proprl6talro et par le gestlon-
Only the CORP1 file was obtained through a nalro.
real decoding; the other corpora were
processed by automatically generating their
homophones.
Conclusion
Results
The experiments were made at an early Coupling the ADS and the syntactic parser
stage of the coupling. They could certainly meets the Initially assigned objectives quite
be improved with more extensive tests, as satisfactorily: broad coverage of the texts
the adaptation of the grammar checker to parsed by the grammar, meaningful
the ADS would gain In accuracy. percentage of justified corrections,
adequacy of the syntactic parser to the
Percentage of erroneous words left types of errors specifically generated by the
uncorrected decoder.
4 The bad results of the CORP3file are due In greet part to the difficulties of e-mall, that make parsing less
accurate.
ACTES DE COLING-92, NANTES, 23-28 AOt~T 1992 943 PROC. OF COLING-92, NANH'ES, AUG. 23-28, 1992
L'~ge do la MER lu plus fr~luent ~ I'accou- DeGennaro S., Cerf-Danon H., Ferrettl M.,
chement est de vlngt-slx ans. Gonzales J., Keppel E., 1991: "Tangora - a
large vocabulary speech recognition system
A subsidiary advantage of the coupling for five languages ", EuroSpeech 1991,
would be to detect errors that would not be Genoa.
produced by the ADS but by the speaker
him/herself (punctuation, stylistic infelicities, Derouault A-M., M~rialdo B., 1984:
mood of subordinate clauses, etc.). Not only "Language modeling at the syntactic level"
we may contemplate transcribing as 7th International Conference on Pattern
accurately as possible the words of a Recognition, August 1984, Montreal.
speaker, but also offering him/her a stylistic
aid. Derouault A-M., EI-B~ze M., 1990: "A
Morphological Model for Large Vocabulary
Speech Recognition", ICASSP 1990.
References Heldorn, G.E., 1972: Natural Language Inputs
to a Simulation Programming System, Ph.D.
dissertation, Yale University.
Averbuch A. et al., 1987: Experiments with
the TANGORA 20,000 word Speech Heidorn G.E., Jensen K., Miller L.A., Byrd
Recognizer, Proceedings of ICASSP, Dallas, R.J., Chodorow M.S., 1962: "3"he EPISTLE
pp. 701-704. Text-Critiquing System", IBM system Journal,
vol.21, n°3.
Bellegarda J., Braden-Harder L., Jensen K.,
Kanevsky D., Zadrozny W., 1992: "Post- Heidorn, G.E., 1976: "An Easily Computed
recognizer language processing: applica- Metric for Ranking Alternative Parses",
tions to speech, handwriting", submitted to Presented at the Fourteenth Annual Meeting
EUSIPCO'92. of the ACL, San Francisco, October 1976.
Cerf-Danon H., de La Noue P., Dlrlnger L., Jellnek F., 1976: "Continuous Speech
EI-B~ze M., Marcadet J.C., 1990: "A 20,000 Recognition by Statistical Methods",
words, automatic speech recognizer. Adap- Proceedings of the IEEE, Vo/ 64, April 1976.
tation to French of the US TANGORA
system", Nato 1990. Jensen, K., Heldorn, G.E., 1983: "The Fitted
Parse: 100% Parsing Capability In a
Cerf-Danon H., EI-B~ze M., 1991: "Three Syntactic Grammar of English", Prec. Conf.
different Probablllstlc Language Models: on Applied Natural Language Processing,
Comparison and Combination", ICASSP Santa Monlca, California, pp.93-98.
1991.
Jensen, K. 1966: "A Broad-Coverage
Chanod J-P., 1991: Analyse automatlque Computational Syntax of English",
d'erreurs: strat(~gie Ilngulstlque et Unpublished documents, IBM T.J. Watson
computatlonnelle, Colloque Informatlque et Research Center, Yorktown Heights, N.Y.
Langue naturelle, 23-24 janvler 91, Liana
Univ. de Nantes.
ACt'ES DECOLING-92, NANTES, 23-28 AO~V 1992 944 PROC. OF COLING-92, NANTES, AUG. 23-28, 1992