Professional Documents
Culture Documents
Text Normalization Is The Process of
Text Normalization Is The Process of
Watch our Africa community live stream @ 16:00 UTC, January 23. Join
the celebration.
Text normalization
Applications
Text normalization is frequently used
when converting text to speech. Numbers,
dates, acronyms, and abbreviations are
non-standard "words" that need to be
pronounced differently depending on
context.[2] For example:
"$200" would be pronounced as "two
hundred dollars" in English, but as "lua
selau tālā" in Samoan.[3]
"vi" could be pronounced as "vie," "vee,"
or "the sixth" depending on the
surrounding words.[4]
Techniques
For simple, context-independent
normalization, such as removing non-
alphanumeric characters or diacritical
marks, regular expressions would suffice.
For example, the sed script
sed ‑e "s/\s+/ /g" inputfile
would normalize runs of whitespace
characters into a single space. More
complex normalization requires
correspondingly complicated algorithms,
including domain knowledge of the
language and vocabulary being
normalized. Among other approaches, text
normalization has been modeled as a
problem of tokenizing and tagging
streams of text[5] and as a special case of
machine translation.[6][7]
See also
Automated paraphrasing
Canonicalization
Text simplification
Unicode equivalence
References
1. Richard Sproat and Steven Bedrick
(September 2011). "CS506/606: Txt
Nrmlztn" . Retrieved October 2, 2012.
2. Sproat, R.; Black, A.; Chen, S.; Kumar,
S.; Ostendorfk, M.; Richards, C. (2001).
"Normalization of non-standard
words." Computer Speech and
Language 15; 287–333.
doi:10.1006/csla.2001.0169 .
3. "Samoan Numbers" .
MyLanguages.org. Retrieved
October 2, 2012.
4. "Text-to-Speech Engines Text
Normalization" . MSDN. Retrieved
October 2, 2012.
5. Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T.
(2007). "A Unified Tagging Approach to
Text Normalization." Proceedings of
the 45th Annual Meeting of the
Association of Computational
Linguistics; 688–695.
doi:10.1.1.72.8138 .
. Filip, G.; Krzysztof, J.; Agnieszka, W.;
Mikołaj, W. (2006). "Text Normalization
as a Special Case of Machine
Translation." Proceedings of the
International Multiconference on
Computer Science and Information
Technology 1; 51–56.
7. Mosquera, A.; Lloret, E.; Moreda, P.
(2012). "Towards Facilitating the
Accessibility of Web 2.0 Texts through
Text Normalisation" Proceedings of
the LREC workshop: Natural Language
Processing for Improving Textual
Accessibility (NLP4ITA); 9-14
Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Text_normalization&oldid=992614319"