Text Normalization Is The Process of

Uploaded by

Sileshi Terefe

0% found this document useful (0 votes)

27 views9 pages

Original Title

sile

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

27 views9 pages

Text Normalization Is The Process of

Uploaded by

Sileshi Terefe

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 9

Search inside document

It is Wikipedia's birthday!

Watch our Africa community live stream @ 16:00 UTC, January 23. Join
the celebration.

Text normalization

Text normalization is the process of

transforming text into a single canonical
form that it might not have had before.
Normalizing text before storing or
processing it allows for separation of
concerns, since input is guaranteed to be
consistent before operations are
performed on it. Text normalization
requires being aware of what type of text
is to be normalized and how it is to be
processed afterwards; there is no all-
purpose normalization procedure.[1]

Applications
Text normalization is frequently used
when converting text to speech. Numbers,
dates, acronyms, and abbreviations are
non-standard "words" that need to be
pronounced differently depending on
context.[2] For example:
"$200" would be pronounced as "two
hundred dollars" in English, but as "lua
selau tālā" in Samoan.[3]
"vi" could be pronounced as "vie," "vee,"
or "the sixth" depending on the
surrounding words.[4]

Text can also be normalized for storing

and searching in a database. For instance,
if a search for "resume" is to match the
word "résumé," then the text would be
normalized by removing diacritical marks;
and if "john" is to match "John", the text
would be converted to a single case. To
prepare text for searching, it might also be
stemmed (e.g. converting "flew" and
"flying" both into "fly"), canonicalized (e.g.
consistently using American or British
English spelling), or have stop words
removed.

Techniques
For simple, context-independent
normalization, such as removing non-
alphanumeric characters or diacritical
marks, regular expressions would suﬃce.
For example, the sed script
sed ‑e "s/\s+/ /g" inputfile
would normalize runs of whitespace
characters into a single space. More
complex normalization requires
correspondingly complicated algorithms,
including domain knowledge of the
language and vocabulary being
normalized. Among other approaches, text
normalization has been modeled as a
problem of tokenizing and tagging
streams of text[5] and as a special case of
machine translation.[6][7]

See also
Automated paraphrasing
Canonicalization
Text simpliﬁcation
Unicode equivalence

References
1. Richard Sproat and Steven Bedrick
(September 2011). "CS506/606: Txt
Nrmlztn" . Retrieved October 2, 2012.
2. Sproat, R.; Black, A.; Chen, S.; Kumar,
S.; Ostendorfk, M.; Richards, C. (2001).
"Normalization of non-standard
words." Computer Speech and
Language 15; 287–333.
doi:10.1006/csla.2001.0169 .
3. "Samoan Numbers" .
MyLanguages.org. Retrieved
October 2, 2012.
4. "Text-to-Speech Engines Text
Normalization" . MSDN. Retrieved
October 2, 2012.
5. Zhu, C.; Tang, J.; Li, H.; Ng , H.; Zhao, T.
(2007). "A Uniﬁed Tagging Approach to
Text Normalization." Proceedings of
the 45th Annual Meeting of the
Association of Computational
Linguistics; 688–695.
doi:10.1.1.72.8138 .
. Filip, G.; Krzysztof, J.; Agnieszka, W.;
Mikołaj, W. (2006). "Text Normalization
as a Special Case of Machine
Translation." Proceedings of the
International Multiconference on
Computer Science and Information
Technology 1; 51–56.
7. Mosquera, A.; Lloret, E.; Moreda, P.
(2012). "Towards Facilitating the
Accessibility of Web 2.0 Texts through
Text Normalisation" Proceedings of
the LREC workshop: Natural Language
Processing for Improving Textual
Accessibility (NLP4ITA); 9-14
Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Text_normalization&oldid=992614319"

Last edited 2 months ago by DocFreeman24

Content is available under CC BY-SA 3.0 unless

otherwise noted.

VHDL Digital Full ADDER Logic Using NAND Gate Program
Document49 pages
VHDL Digital Full ADDER Logic Using NAND Gate Program
Preeti Budhiraja
100% (1)
Generator Basics
Document100 pages
Generator Basics
Travis Chesna
100% (6)
Parcel Delivery Management System
Document28 pages
Parcel Delivery Management System
Hemal Karmakar
No ratings yet
(Robert Mailloux) Electronically Scanned Arrays
Document92 pages
(Robert Mailloux) Electronically Scanned Arrays
Manikanta Lalkota
No ratings yet
K20 Engine Control Module X3 (Lt4) Document ID# 4739106
Document3 pages
K20 Engine Control Module X3 (Lt4) Document ID# 4739106
Data Técnica
No ratings yet
TESCO Systems Integration Brochure
Document2 pages
TESCO Systems Integration Brochure
davev2005
No ratings yet
DSB- Unit4-representing and miniing text-decision-analytic-think-II
Document46 pages
DSB- Unit4-representing and miniing text-decision-analytic-think-II
Bhavya sree machineni
No ratings yet
Linguistic Data Management
Document36 pages
Linguistic Data Management
AGUINALDO GOMES DE SOUZA
No ratings yet
Multilingual Information Retrieval
Document18 pages
Multilingual Information Retrieval
Harshitha
No ratings yet
Unraveling The Power of Natural Language Processing
Document11 pages
Unraveling The Power of Natural Language Processing
suranifaizan52
No ratings yet
Sample
Document8 pages
Sample
Sasi Dhar
No ratings yet
Normalization of Informal Text
Document22 pages
Normalization of Informal Text
mimaktermaya12233
No ratings yet
Thesis Sahib - Order Now On
Document11 pages
Thesis Sahib - Order Now On
rredknlpd
100% (1)
Tasks in NLP
Document7 pages
Tasks in NLP
A K
No ratings yet
Week 8-Module 7 NLP
Document52 pages
Week 8-Module 7 NLP
funcrusherr
No ratings yet
Ai DP 2
Document3 pages
Ai DP 2
Harsh Naraini
No ratings yet
2 Text Indexing Storage Retrieval
Document65 pages
2 Text Indexing Storage Retrieval
Tushar Shah
No ratings yet
02 Text Operation
Document52 pages
02 Text Operation
Mikiyas Abate
No ratings yet
NLP Manual (1-12)
Document55 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12) 1
Document56 pages
NLP Manual (1-12) 1
sj120cp
No ratings yet
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
Document14 pages
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
aderaj getinet
No ratings yet
NLP Manual (1-12)
Document54 pages
NLP Manual (1-12)
sj120cp
No ratings yet
Unit 5
Document16 pages
Unit 5
Sanju Shree
No ratings yet
Text Corpus
Document3 pages
Text Corpus
linda976
No ratings yet
Chapter-1 Introduction To NLP
Document12 pages
Chapter-1 Introduction To NLP
Sruja Koshti
No ratings yet
2 Text Operations
Document32 pages
2 Text Operations
halal.army07
No ratings yet
161 Paper
Document6 pages
161 Paper
acouillault
No ratings yet
Ch1 B NLP Introduction
Document43 pages
Ch1 B NLP Introduction
fefwefe
No ratings yet
Controlled LG For Human Communication
Document4 pages
Controlled LG For Human Communication
shadi sabouri
No ratings yet
Text Normalization in Social Media: Progress, Problems and Applications For A Pre-Processing System of Casual English
Document10 pages
Text Normalization in Social Media: Progress, Problems and Applications For A Pre-Processing System of Casual English
23520053 I Putu Eka Surya Aditya
No ratings yet
Basics of Bag of Words Model
Document32 pages
Basics of Bag of Words Model
Ganesh Chandrasekaran
No ratings yet
5365-Article Text-8590-1-10-20200508
Document8 pages
5365-Article Text-8590-1-10-20200508
江澜
No ratings yet
Word Count
Document3 pages
Word Count
Leo Lonardelli
No ratings yet
Unit 1
Document4 pages
Unit 1
Shiv M
No ratings yet
Tagging Spoken Swedish
Document32 pages
Tagging Spoken Swedish
ramlohani
No ratings yet
Raslan Skell
Document9 pages
Raslan Skell
أبو طارق ابن الحوشبي
No ratings yet
2 - Text Operation
Document29 pages
2 - Text Operation
maki ababi
No ratings yet
Chapter 2 Part 1 & 2
Document58 pages
Chapter 2 Part 1 & 2
S J
No ratings yet
Interlingua in Machine Translation
Document5 pages
Interlingua in Machine Translation
foronica
No ratings yet
Sms Test
Document11 pages
Sms Test
Vaishali Kashyap
No ratings yet
Lecture 2 Serialization Basics 1.5 Hours
Document10 pages
Lecture 2 Serialization Basics 1.5 Hours
shortsforyou62
No ratings yet
Encoder Decoder Models
Document31 pages
Encoder Decoder Models
belwalkarvarad
No ratings yet
Text Operation Assingnmet
Document33 pages
Text Operation Assingnmet
beshahashenafe20
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
Document54 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
Hamad Abdullah
No ratings yet
Natural Language Processing
Document9 pages
Natural Language Processing
api-3806739
100% (2)
Assignment of AI Finished
Document16 pages
Assignment of AI Finished
Aga Chimdesa
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
Document6 pages
CSDM2-Text Preprocessing For NL Data - 011050
ignaciojudyann596
No ratings yet
Bag of Words
Document32 pages
Bag of Words
ravinyse
No ratings yet
Thesis 2.1 User Guide
Document4 pages
Thesis 2.1 User Guide
evkrjniig
100% (2)
Enhancing Automatic Acquisition of Thematic Structure in A Large-Scale Lexicon For Mandarin Chinese
Document9 pages
Enhancing Automatic Acquisition of Thematic Structure in A Large-Scale Lexicon For Mandarin Chinese
Alen Zahorjević
No ratings yet
NLP Manual
Document15 pages
NLP Manual
Chennakesavareddy Appireddy
No ratings yet
Session 14 - Computaional Linguistics
Document23 pages
Session 14 - Computaional Linguistics
ebonchill7.0.0
No ratings yet
Seven Dimensions of Portability For Language Documentation and Description
Document26 pages
Seven Dimensions of Portability For Language Documentation and Description
Arul Dayanand
No ratings yet
SMC Learning Module For Students in NLP001 1
Document44 pages
SMC Learning Module For Students in NLP001 1
Ladylen Dee
No ratings yet
Pauses in Writin, 17 R&W Journal
Document19 pages
Pauses in Writin, 17 R&W Journal
Mutahar AL-Nahari
No ratings yet
Evolving English
Document2 pages
Evolving English
alierfanreza
No ratings yet
Heterogeneous Linguistic Data-Generic XML-based Representation and Flexible Visualization-2005
Document5 pages
Heterogeneous Linguistic Data-Generic XML-based Representation and Flexible Visualization-2005
driss ou
No ratings yet
NLP Insem Notes
Document13 pages
NLP Insem Notes
taslimarif makandar
No ratings yet
NLP Notes
Document71 pages
NLP Notes
softb0774
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
Document26 pages
Lecture 3-Term Vocabulary and Posting Lists
Yash Gupta
No ratings yet
Information Processing and Management: Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka
Document21 pages
Information Processing and Management: Danushka Bollegala, Naoaki Okazaki, Mitsuru Ishizuka
indorayabt
No ratings yet
Terminology in The Age of Multilingual Corpora
Document23 pages
Terminology in The Age of Multilingual Corpora
VascoSobreiro
No ratings yet
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
Document9 pages
A Method For Vietnamese Text Normalization To Improve The Quality of Speech Synthesis
rain1024
No ratings yet
Project Report
Document12 pages
Project Report
Sasi Dhar
No ratings yet
Application of Computational Linguistics
Document19 pages
Application of Computational Linguistics
abeeraafzal45
No ratings yet
The Art of Style and Design For Editors and Authors
From Everand
The Art of Style and Design For Editors and Authors
Steve Taylor
No ratings yet
SS11 Heavy Duty Speed Sensor Drawing 21184200
Document1 page
SS11 Heavy Duty Speed Sensor Drawing 21184200
Arturo Narvaez
No ratings yet
Structural Design of RCC CWR KL Capacity At-Dist - : 550 Abc Nagaur
Document30 pages
Structural Design of RCC CWR KL Capacity At-Dist - : 550 Abc Nagaur
ARSE
No ratings yet
G G G G B: Bolts of No Dia Bolt D
Document1 page
G G G G B: Bolts of No Dia Bolt D
Anonymous Nayak
No ratings yet
(Tex Ebook) - Advanced Latex
Document23 pages
(Tex Ebook) - Advanced Latex
enricosirola
No ratings yet
Starrett Decimal Equivalent Card (Bulletin 1317)
Document2 pages
Starrett Decimal Equivalent Card (Bulletin 1317)
Franz Coras
No ratings yet
11
Document9 pages
11
Saharah Pundug
No ratings yet
FL Response Crpe PDF
Document10 pages
FL Response Crpe PDF
National Education Policy Center
No ratings yet
Campbell Dudek Smith 1970
Document10 pages
Campbell Dudek Smith 1970
richardvas12
No ratings yet
Read Me
Document1 page
Read Me
Dileep Mishra
No ratings yet
The Clinical Relevance of The Relation Between Maximum Flow Declination Rate and Sound Pressure Level in Predicting Vocal Fatigue
Document10 pages
The Clinical Relevance of The Relation Between Maximum Flow Declination Rate and Sound Pressure Level in Predicting Vocal Fatigue
Chris
No ratings yet
El Ácido Clorogénico
Document10 pages
El Ácido Clorogénico
Jonatan Velez
No ratings yet
What Is Flow?: Turbulent Flow Laminar Flow
Document3 pages
What Is Flow?: Turbulent Flow Laminar Flow
Aurelia Alexandra
No ratings yet
Sand Compaction Method
Document124 pages
Sand Compaction Method
isaych33ze
100% (1)
Transcript PDF
Document1 page
Transcript PDF
api-486098227
No ratings yet
Square D Accusine Pcs Active Harmonic Filter
Document2 pages
Square D Accusine Pcs Active Harmonic Filter
Mushfiqur Rahman
No ratings yet
Cache Memory Scheme
Document7 pages
Cache Memory Scheme
jainvidisha
No ratings yet
Introduction To Internal Combustion Engines: Asst. Prof. Vishnu Sankar
Document57 pages
Introduction To Internal Combustion Engines: Asst. Prof. Vishnu Sankar
Mohammad Ishfaq Bhat
No ratings yet
9 Cu Sinif Riyaziyyat Imtahan Testleri
Document72 pages
9 Cu Sinif Riyaziyyat Imtahan Testleri
Seymur Akbarov
0% (2)
University of Cagayan Valley School of Criminology 23
Document4 pages
University of Cagayan Valley School of Criminology 23
gaea lou
No ratings yet
X-Ways Forensics v14.3 Manual
Document101 pages
X-Ways Forensics v14.3 Manual
kjadhav2000
No ratings yet
Time Value of Money-Powerpoint
Document83 pages
Time Value of Money-Powerpoint
haljordan313
No ratings yet
Distribution of Experiments (RU-23)
Document2 pages
Distribution of Experiments (RU-23)
alifarafatbiploby
No ratings yet
CAD of Radial and Mixed Flow Turbomachinery
Document2 pages
CAD of Radial and Mixed Flow Turbomachinery
james
No ratings yet
CT Abdomen
Document2 pages
CT Abdomen
Vlog Anak Dusun
No ratings yet