Text Preprocessing: Information Retrieval

TEXT
PREPROCESSING
Information Retrieval
Role
Preprocessing is an important task and critical step in Text
mining, Natural Language Processing (NLP) and
information retrieval (IR).
In the area of Text Mining, data preprocessing used for
extracting interesting and non-trivial and knowledge from
unstructured text data.
Information Retrieval (IR) is essentially a matter of
deciding which documents in a collection should be
retrieved to satisfy a user's need for information based
upon query.
Techniques
Tokenization
Stop Words (Common Words)
Removal
Normalization
Stemming
Lemmatization
Acronym Expansion
Tokenization
Break the input in to words.
Converts character stream -> token stream.
Called tokenizer / lexer / scanner.
The aim of the tokenization is the exploration of the
words in a sentence.
Tokenizer then hooks up to parser.
A parser defines rules of grammar for these tokens
and determines whether these statements are
correct.
Tokenizer then feeds the token to retrieval system for
further processing.
Identifying Tokens
Divide on whitespace and throw away the
punctuations?
There are many documents that we see in day to
day life that are structured using markup tags.
HTML, XML, ePub etc
What is the best way to deal with these tags?
Use them as delimiter / tokens.
Filter them our entirely.
Challenges in Tokenization
Challenges in tokenization depend on the type of
language.
Languages such as English and French are referred to as
space delimited.
Languages such as Chinese and Thai are referred to as
unsegmented as words do not have clear boundaries.
Tokenizing unsegmented language sentences requires
additional lexical and morphological information.
Advanced Tokenization
Convert the character stream to tokens using any
programming language.
Libraries used in the programming language uses
specific grammars to tokenize a text.
They can easily identify comments, literals.
Java has tokenization libraries
java.util.Scanner
Java.util.String.split()
java.util.StringTokenizer
Python provides you nltk.tokenize package which
tokenizes the input text.
Tweet Tokenizer
This tokenizer tokenizes the tweets according to their writing context
e.g links as a separate complete token.
RT @Edourdoo: Shocking picture of the

earthquake in Nepal, http://t.co/2N09Jz96lq
RT @NewEarthquake: 4.7 earthquake, 22km W of Kodari, Nepal. Apr 25 14:05 at

epicenter (31m ago, depth 10km). http://t.co/5I5vVQnCQe
['RT', ':', 'Shocking', 'picture', 'of', 'the', 'earthquake', 'in', 'Nepal', ',',
'http://t.co/2N09Jz96lq']
Dropping Common Terms (Stop Words)
Stop words are insignificant or most common words in a
language
In English articles prepositions etc.
To reduce vast amount of unnecessary information retrieved after
search query.
Some tools specifically avoid removing these stop words to
support phrase search
The general trend in IR systems over time has been from standard
use of quite large stop lists (200-300 terms) to very small stop lists
(7-12 terms) to no stop list whatsoever.
Some stop words Accordingly, Across, Actually, After, Afterwards
Dropping Common Terms (Stop Words)
RT @hendhunu: #NepalQuakes More than 100 killed in powerful Nepal

#earthquake, say govt. officials and police http://t.co/Bk50ubiZPB
RT @hendhunu: #NepalQuakes 100 killed powerful Nepal #earthquake, say govt.

official police http://t.co/Bk50ubiZPB
Removed words are More, than, in, and (nltk.corpus stopwords)
Normalization
Transforming text into a single canonical form.
Text normalization requires
Awareness of type of text to be normalized
How it is to be processed afterwards
It is mostly used to:
convert text to speech.
Numbers and dates.
Acronyms and abbreviations
Non-standard "words" that need to be pronounced differently depending
on context.
Depending on the language:

$200 two hundred dollars in English
$200 lua selau tl in Samoan.
Lemmatization
Words appear in several inflected forms e.g the verb to
walk may appear as walk, walked, walks, walking.
Process of grouping together the different inflected forms
of a word so they can be analyzed as a single item.
Lemmatization reduces inflection or derivationally related
forms of a word to a common base form.
The base form of the word is called the lemma for the
word.
It can match different forms of verbs, adjectives and adverbs, and
even for synonyms and wildcard matching.
Sometimes refer to it as fuzzy matching.
Lemmatization
Example:
am, are, is be
car, cars, car's, cars car
the boy's cars are different colors the boy car be differ color
Using nltk.wordnet.WordNetLemmatizer
24 hour Control Room for queries regarding the Nepal Earthquake.
24 hour Control Room for query regard the Nepal Earthquake.

Stemming vs. Lemmatization
Stemmingusuallyreferstoacrudeheuristicprocessthatchops
offtheendsofwords.
Aimingtogetthewordbychoppingandoftenincludestheremovalof
derivationalaffixes.
Lemmatizationusuallyreferstodoingthingsproperlywiththe
useofavocabularyandmorphologicalanalysisofwords.
Aimingtoremoveinflectionalendingsonlyandtoreturnthebaseor
dictionaryformofaword
Saw s bystemming
Sawseeorsawbylemmatization
Dependsonwhethertheuseofthewordwasasaverboranoun.
Acronym Expansion
Word or name formed as an abbreviation from the initial
components in a phrase or a word,
Individual letters (as in NATO or laser) or syllables (as in Benelux).
Words have multiple abbreviations depending on context
Thursday could be abbreviated as any of Thurs,Thur, Thr, Th, or T
Abbreviate library contains a dictionary of known
abbreviations.(Python)
Each word has preferred abbreviation (Thr for Thursday)
The basic `abbreviate` method will only apply preferred
abbreviations and no heuristics.
For best possible matching target length and effort is given.
Go SF Giants! Such an amaazzzing feelin!!!!
\m/ :D
Stopwords
SF Giants! amaazzzing feelin!!!! \/ :D

Special chars
SF Giants amaazzzing feelin

Spell check
SF Giants amazing feeling
Stemming
SF Giants amazing feel me

SF Giants amazing feel
stopwords

Text Preprocessing: Information Retrieval

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Preprocessing: Information Retrieval

Uploaded by

Copyright:

Available Formats

TEXT

RT @Edourdoo: Shocking picture of the

RT @NewEarthquake: 4.7 earthquake, 22km W of Kodari, Nepal. Apr 25 14:05 at

RT @hendhunu: #NepalQuakes More than 100 killed in powerful Nepal

RT @hendhunu: #NepalQuakes 100 killed powerful Nepal #earthquake, say govt.

Depending on the language:

24 hour Control Room for queries regarding the Nepal Earthquake.

24 hour Control Room for query regard the Nepal Earthquake.

SF Giants! amaazzzing feelin!!!! \/ :D

SF Giants amaazzzing feelin

SF Giants amazing feeling

SF Giants amazing feel me

You might also like