Information Retrieval: Text Processing

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 43

Information Retrieval

Text Processing

Dr. Bassel ALKHATIB


Text Processing

• Text operations in IR systems


– Tokenization
– Stopword removal
– Lemmatization
– Stemming
– Identification of phrases and collocation (optional)

Slide 2
IR System Architecture

User Interface
Text
User
Text Operations
Need
Logical View
User Query Database
Feedback Operations Indexing
Manager
Inverted
file
Query Searching Index
Text
Ranked Retrieved Database
Docs Ranking Docs
Slide 3
Tokenization : Text  { word}

A token is a strong of characters extracted from a document, e.g.,


The discussion classes on Wednesday evenings are
from 7:30 to 8:30 p.m.

Slide 4
Simple Tokenization : Text  { word}
• Analyze text into a sequence of discrete tokens (words).
• Most Languages
– Word = string of characters separated by white spaces and/or
punctuation
• Difficulties:
– Abbreviations (etc. , ..)
• Transformed to original format using MRD (Machine Readable
Dictionary)
– Hyphenated terms (_, -)
– Apostrophes
– Numbers

Slide 5
Simple Tokenization : Text  { word}
• Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers and punctuation
and use only case-insensitive unbroken strings of alphabetic
characters as tokens.
• More careful approach:
– Separate ? ! ; : “ ‘ [ ] ( ) < >
– Care with .
– Care with …

Slide 6
Punctuation

• Children’s: use language-specific mappings to normalize .


• State-of-the-art: break up hyphenated sequence.
• U.S.A. vs. USA
• a.out

Slide 7
Numbers

• 3/12/91
• Mar. 12, 1991
• 55 B.C.
• B-52
• 100.2.86.144
– Generally, don’t index as text
– Creation dates for docs

Slide 8
Case folding

• Reduce all letters to lower case


– exception: upper case in mid-sentence
• e.g., General Motors

Slide 9
Tokenizing HTML

• Should text in HTML commands not typically seen by


the user be included as tokens?
• Simplest approach is to exclude all HTML tag
information (between “<“ and “>”) from tokenization.

Slide 10
Stop words Removals
• Many of the most frequently used words in English are
worthless in the indexing – these words are called stop
words.
– the, of, and, to, ….
– Typically about 400 to 500 such words

• Why do we need to remove stop words?


– Reduce indexing file size
• stopwords accounts 20-30% of total word counts.
– Improve efficiency
• stop words are not useful for searching

Slide 11
Stopwords
• Stopwords are language dependent
• For efficiency, store strings for stopwords in a hashtable to
recognize them in constant time.

• How to determine a list of stopwords?


– For English? – may use existing lists of stopwords
• E.g. WordNet stopword list

Slide 12
Stop words Removals

• Potential problems of removing stop words


– small stop list does not improve indexing much
– large stop list may eliminate some words that might be useful
for someone or for some purposes
– stopwords might be part of phrases (ex : put on, take off ,..)
– needs to process for both indexing and queries.

Slide 13
Some English Stop words
•a • also
• about • although
• above • always
• across • am
• after • among
• afterwards • amongst
• again • amount
• against • an
• all • and
• almost • another
• alone • any
• along • anyhow
• already

Slide 14
Lemmatization
• Reduce inflectional/variant forms to base form
• Direct impact on VOCABULARY size
• E.g.,
– am, are, is  be
– car, cars, car's, cars'  car

• the boy's cars are different colors  the boy car be different color

• How to do this?
– Need a list of grammatical rules + a list of irregular words
– Children  child, spoken  speak …

Slide 15
Stemming
Morphological variants of a word (morphemes). Similar
terms derived from a common stem:
engineer, engineered, engineering
use, user, users, used, using
Stemming in Information Retrieval. Words with a common
stem are mapped into the same term.
For example, read, reads, reading, and readable are mapped onto
the term read.

Slide 16
Advantages of stemming

• improving effectiveness
• matching similar words

• reducing indexing size


• combing words with same roots may reduce indexing size as much as 40-50%.

• Criteria for stemming


– correctness
– retrieval effectiveness
– compression performance

Slide 17
Categories of Stemmer

The following diagram illustrate the various


categories of stemmer.

Conflation (Stemming) Methods

Manual Automatic (stemmers)

Affix Successor Table n-gram


removal variety lookup

Longest Simple
match removal
(Porter)

Slide 18
Table Lookup
• Store a table of all index terms and their stems
• Terms from queries and indexes could then be stemmed
via table lookup
• Problems
– No such data for English
– Domain-dependent vocabulary may not use standard English
– Storage overhead

Term Stem
Engineering Engineer
Engineered Engineer
Engineer Engineer

Slide 19
Affix Removal Stemmers

• Affix removal algorithms remove suffixes and/or


prefixes from terms leaving a stem
• Most stemmers are iterative longest match stemmers
– Remove the longest possible string of characters from a word
according to a set of rules
– This process is repeated until no more characters can be removed

• Porter algorithm is an affix removal stemmer


– Consist of a set of condition/action rules
• Conditions on the stem, conditions on the suffix,
and conditions on the rules

Slide 20
Porter Stemmer
• Simple procedure for removing known affixes in
English without using a dictionary.
• Can produce unusual stems that are not English words:
– “computer”, “computational”, “computation” all reduced to
same token “comput”

• May conflate (reduce to the same token) words that are


actually distinct.
– organization, organ  organ
– police, policy  polic
– arm, army  arm

• Not recognize all morphological derivations.

Slide 21
Basic stemming methods
Use tables and rules

• Affix removal algorithms (suffixes, prefixes)


– remove ending
• if a word ends with a consonant other than s,
followed by an s, then delete s.
• if a word ends in es, drop the s.
• if a word ends in ing, delete the ing unless the remaining word consists
only of one letter or of th.
• If a word ends with ed, preceded by a consonant, delete the ed unless
this leaves only a single letter.
• …...
– transform the remaining word
• if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”

Slide 22
Typical rules in Porter

• sses  ss
• ies  i
• ational  ate
• tional  tion

Slide 23
Porter Stemmer
A multi-step, longest-match stemmer.
M. F. Porter, An algorithm for suffix stripping. (Originally
published in Program, 14 no. 3, pp 130-137, July 1980.)
http://www.tartarus.org/~martin/PorterStemmer/def.txt
Notation
v vowel(s)
c constant(s)
(vc)m vowel(s) followed by constant(s), repeated m times
Any word can be written: [c](vc)m[v]
m is called the measure of the word

Slide 24
Porter's Stemmer
Multi-Step Stemming Algorithm
Complex suffixes
Complex suffixes are removed bit by bit in the different
steps. Thus:
GENERALIZATIONS
becomes GENERALIZATION (Step 1)
becomes GENERALIZE (Step 2)
becomes GENERAL (Step 3)
becomes GENER (Step 4)
[In this example, note that Steps 3 and 4 appear to be
unhelpful for information retrieval.]
Slide 25
Porter Stemmer: Step 1a
Suffix Replacement Examples

sses ss caresses -> caress


The stem may
ies i ponies -> poni
not be an
ties -> ti
actual word
ss ss caress -> caress

s cats -> cat

At each step, carry out the longest match only.

Slide 26
Porter Stemmer: Step 1b
Conditions Suffix Replacement Examples
(m > 0) eed ee feed -> feed
agreed -> agree
(*v*) ed null plastered -> plaster
bled -> bled
(*v*) ing null motoring -> motor
sing -> sing

Notation
m - the measure of the stem
*v* - the stem contains a vowel

Slide 27
Porter Stemmer: Step 5a

Some of the steps are based on peculiarities of English, e.g.,

(m>1) e -> probate -> probat


rate -> rate
(m=1 and not *o) e -> cease -> ceas

*o - the stem ends cvc, where the second c is not w, x or y


(e.g. -wil, -hop).

Slide 28
Porter Stemmer: Results
Suffix stripping of a vocabulary of 10,000 words
Number of words reduced in step 1: 3597
step 2: 766
step 3: 327
step 4: 2424
step 5: 1373
Number of words not reduced: 3650
The resulting vocabulary of stems contained 6370
distinct entries. Thus the suffix stripping process
reduced the size of the vocabulary by about one third.

Slide 29
Successor Variety

• Definition (successor variety of a string)


the number of different characters that follow it in words in
some body of text

• Example
a body of text: able, axle, accident, ape, about
successor variety of apple
1st(a): 4 (b, x, c, p)
2nd(ap): (e)

Slide 30
Successor Variety (Continued)
• Idea
The successor variety of substrings of a term will decrease as more characters
are added until a segment boundary is reached, i.e., the successor variety will
sharply increase.
• Example
Test word: READABLE
Corpus: ABLE, BEATABLE, FIXABLE, READ,
READABLE, READING, RED, ROPE, RIPE
Prefix Successor Variety Letters
R 3 E, O, I
RE 2 A, D
REA 1 D
READ 3 A, I, S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 blank

Slide 31
The successor variety stemming
process

• Determine the successor variety for a word.


• Use this information to segment the word.
– cutoff method
a boundary is identified whenever the cutoff value is reached
– peak and plateau method
a character whose successor variety exceeds that of the character immediately
preceding it and the character immediately following it
– complete word method
a segment is a complete word

• Select one of the segments as the stem.

Slide 32
Basic stemming methods
N-gram stemmers
• A n-gram is n-consecutive letters
– Conflates terms based on number of n-grams (= sequences of
n consecutive letters) that they share
– Often use of bigrams or trigrams
– Terms that are strongly related by the number of shared n-
grams are clustered
– Heuristics help in detecting the root form
– Language- independent technique
– Example :
All diagrams of the word “statistics” are
st ta ti is st ti ic cs
All diagrams of “statistical” are
st ta ti is st ti ic ca al

Slide 33
n-gram stemmers

• Diagram
a pair of consecutive letters
• Shared diagram method (Adamson and Boreham, 1974)
association measures are calculated between pairs of terms

2C
S
A B
• where A: the number of unique diagrams in the first word,
B: the number of unique diagrams in the second,
C: the number of unique diagrams shared by A and B

Slide 34
n-gram stemmers (Continued)

• Example
statistics => st ta at ti is st ti ic cs
unique diagrams => at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique diagrams => al at ca ic is st ta ti

2C 2 *6
S   0.80
A B 78

Slide 35
n-gram stemmers (Continued)

• similarity matrix
determine the semantic measures for all pairs of terms in the database
word1 word2 word3 ... wordn-1
word1
word2 S21
word3 S31 S32
.
.
Wordn Sn1 Sn2 Sn3 … Sn(n-1)
• terms are clustered using a single link clustering method
– most pairwise similarity measures were 0
– using a cutoff similarity value of .6

Slide 36
Stemmers are not perfect:

Organization  organ
University  universe
Policy  police

Slide 37
Identification of phrases and collocation

• Collocations
Expression consisting of two or more words that correspond to
some conventional way of saying things
–Good indicators of text’s content (especially
noun and prepositional phrases):
–Important concepts in subject domain:
• e.g. joint venture, make up
–Less ambiguous than the single words they are
composed of

Slide 38
Identification of phrases and collocation

Recognition of phrases
• Use of dictionary with phrases:
• Only practical in restricted subject domains
• Statistical approach:
• Assumption: words that often co-occur might
denote a phrase
• For phrases: not always correct and meaningful

• Linguistic (language-dependent) approach

Slide 39
Identification of phrases and collocation

• Normalization of phrases
Mapping of equivalent phrases to standard single phrases

• Approaches for the normalization of phrases


1. Use of MRD with equivalent phrases
2. Omission of function words and possible neglect of the order of content
words
3. Language-dependent rules applied upon the output of syntactic parses

• Possibly in combination with word stemming

Slide 40
Identification of phrases and collocation

• Recognition of proper names


Names of persons, companies, institutions, product brands,
locations, …
• Recognition of names - Approaches
1.Use of dictonary of names, variant spelling
2.Recognition with special rules that express typical features (e.g.
capitalization) or linguistic content (e.g. indicator words)
3.Recognition of variant spelling (e.g. based on shared letter sequences - n-
grams)

Slide 41
On Metadata

• On Metadata
– Often included in Web pages
– Hidden from the browser, but useful for indexing
• Information about a document that may not be a part of the
document itself (data about data).
• Descriptive metadata is external to the meaning of the document:
– Author
– Title
– Source (book, magazine, newspaper, journal)
– Date
– ISBN
– Publisher
– Length

Slide 42
Web Metadata

• META tag in HTML


– <META NAME=“keywords” CONTENT=“pets, cats,
dogs”>

Slide 43

You might also like