Professional Documents
Culture Documents
Chapter 2 Part II
Chapter 2 Part II
Representation
Chapter 2 - Part two
1
Sub topics
Overview
Manual Vs Automatic indexing
Indexing Languages and Methods
Term selection process
Lexical analysis (tokenization)
Use of stopwords list
Conflation
2
Conflation
The act of combining
3
Conflation
Conflation is the act of fusing or combining.
Term conflation is a general term used for the
processes of matching morphological term variants.
It is one way of improving IR performance
4
Cont…
Conflation can broadly divided in to two main
classes
Stemming: language dependent and basically is
used to handle morphological variants (most
widely used)
5
Conflation
String Stemming
Similarity
N-gram
Table Affix Successor
lookup Removal Varity
Iterative
Longest
(simple
Match
Removal)
6
Motivation
Free text IR systems are effective in operation only
if it is possible to map the particular word forms in
a user‟s query to the form that occur in the
document database.
Example
If a searcher enters the term stemming as part
of a query. It is likely that he or she will also be
interested in such variants as stemmed and
stem.
7
Why variations in word form? (Need for
term conflation)
In dealing with NL free text, it is common that there are
variations in word forms. Common reasons could be:
1. Grammatical requirements (Morphology)
Ex. Words, Agriculture, Agricultural and
Agriculturalist are all semantically related words
formed form a common form to suit grammatical
requirements
2. National usage ( alternative spellings)
Different spellings of the same word to suit national
usage ( American and British)
Labor and Labour
Sulphur and sulfur 8
Cont…
3. Historical changes
Arithmetic and Aerithmaticke
9
Stemming/Conflation
Assumption in Term conflation or stemming
Words with the same stem are semantically related
and have the same meaning to the user of the text
The chance of matching increase when the index
term are reduced to their word stems because it is
normal to search using “ retrieve” than “retrieving”
The algorithm (i.e., the stemmer) reduces all words
with the same root (or stem) to a single form.
Stemming in the field of IR aims at improving the
matching (possibility) process between the index terms
of query and document text
10
Cont…
Stemming, thus, is a recall enhancing means to broaden
an index term in a text search. Identify words with
similar meaning (to enhance recall).
But it affects (decreasing) precision (exactness)
11
Cont…
Stemming is more use full in the case of morphologically
rich languages (like Hungarian and Hebrew).
How?
It is generally agreed up on that stemming either has a
significant positive effect on retrieval effectiveness or
create no problem on retrieval performance
Accordingly there are three automatic approaches (methods)
to stemming
Table lookup
Affix removal algorithms
Letter successor variety stemmers
12
Table lookup Approach to Stemming
Term Stem
Connection Connect
Connections Connect
Connective Connect
Connected Connect
Connecting Connect
Connect Connect
Engineering Engineer
Engineered Engineer
Engineer Engineer
14
Cont…
Advantage of this method
The stemming results are generally correct
Disadvantage of the method
The table becomes large
when it takes into account terms in standard
language and possibly terms in the specialized
subject domain of the text corpus
Has storage overhead
Large tables require large storage space and
efficient algorithms (e.g. binary search, hash table)
Difficult to construct the table
to capture all possibilities (e.g., new
15
words)
Affix Removal Stemmers
Most commonly used approach and brings the concept
stemmer (it removes suffixes and/or prefixes from
terms leaving a stem).
Usually performed by, striping off affixes from each
word variant and checking whether any context
sensitive rules apply.
These affix removal stemmers are language dependent
and need manual involvement.
A simple example could be one that removes the
plurals from terms
Two manners of operations:
Iterative approach
Longest match approach
16
Iterative approach
Example
If the word PEACEFULNESS is considered , NESS will
be removed at the first iteration, and FUL will be
again removed to leave PEACE as a final stem
If the word ALARMINGLY is considered , Ly will be
removed at the first iteration, and ING will be again
removed to leave ALARM as a final stem
17
Longest Mach approach
18
Inflectional Vs Derivational
19
Inflectional Stemming
Inflectional stemming is just removing inflectional
morphemes
Inflectional morphemes are normally due to the difference in :
number (singular and plural)
tense (present, past, future)
aspect (progressive, punctual, perfective, iterative)
Examples:
reach, reaches, reached, reaching
big, bigger, biggest
Removal of inflectional morphemes usually has little impact
upon a word‟s meaning.
Thus, it can be safely done (e.g. mapping singular and plural
of a same word to a single stem)
20
Derivational Stemming
Here tokens and stems do not have the same part
of speech
Example:
Introduce, introduction, introductions introducer,
introducing
Category, categorize, categorization
Use, Useful, Usable, Unusable
Removal of such derivational morpheme may
change words meaning.
Thus in addition to removal of suffix it needs some
transformation of letters
21
Component of a stemmer
Dictionary of affixes (Suffix and prefix dictionary.)
Can be created manually, automatically or a
combination of both
Example: ation , ing, ses, ed , s, ly, able, full,
ness…….
Search of each word in an input text against this
dictionary
Longest match search
Iterative
Context sensitive
In addition to the removal of affixes it considers
other characteristics in the word (token).
Many exceptions rules need to be handled.
23
Cont…
Example
Minimum stem conditions are used as a common
restriction. Invalid stems can be detected by
means of a minimum stem length.
-ATE and -ING would not be removed from
CREATE and SING
RE- would not be removed from „READ‟ or „AL‟
from „METAL‟
The stem-endings satisfies a certain condition
Eg-Does not end with Q
24
Recoding Rules
It is an important issue to consider and some times
a cure for spelling exceptions.
Recoding is a context sensitive transformation of
the form
A xC to A y C, where A and C specify
the context of transformation, X is the input
string and Y is the transformed string.
25
Cont…
Example
Ifa stem ending in „i‟ following , „t‟ then change „i‟
to „y‟
beautiful beauti beauty
Eliminating doubling terminal consonants
Forgetting forgett forget
Changing „ rpt „ into‟ rb‟ to conflate terms like
Absorb, Absorbing & Absorption
Absorption absorb
26
Example of stemmed query
Consider the following natural language (information
need) and convert it into stemmed query
representatives
(justfollow the basic lexical analysis, removal of
stopword and affix removal stemming procedures)
"I am interested in novel matching algorithms for
document-retrieval systems. Anything on best-match or
nearest-neighbor retrieval, and partial-match searching
might also be of interest."
Stemmed query representative
algorithm, best, docum, match, near, neighb,
novel, part, retriev, search, system
27
Possible Problems with (Affix Removal
stammers)
There are some major reasons, stemming can be
incorrect
Overstemming
(removing too much),
Understemming
(removing too little)
30
Under stemming
Refers to the removal of too little affix from a term.
It prevents related terms from being conflated
together and
asa result of which a concept may be spread
over numerous stems,
which will affect retrieval performance.
Words that should be grouped by stemming but are
not.
Example:
removing only -ness from the term peacefulness
31
Cont…
Effect of understemming on IR performance will be
32
Some Common English Stemming
Algorithms
Dawson
Lovins
MARS
MORPHS
Porter
RADCOL
SMART
Paice/Husk
33
Features of some common English Language
text Affix removal stemmers (algorithms)
Longest Suffix Stem Context
Algorithm Iterative Recoding
match Dictionary dictionary sensitive
Dawson X X X X
Lovins X X X X X
MARS X X X X X
MORPHS X X X X X
Porter X X X X
RADCOL X X X X
SMART X X X X34 X
Porter Stemming Algorithm
By Martin Porter (1980)
Try the following to get
Original paper by Martin Porter
Source code for porter stemmer using C, Java, etc.
http://www.tartarus.org/%7Emartin/PorterStemmer/
http://www.utilitymill.com/utility/Porter_Stemming_Algorithm
http://snowball.tartarus.org/demo.php
35
Porter Stemming Algorithm
Was developed by Martin Porter for reducing English
words to their stem
37
Cont…
There has been also similar interest in stemming
algorithms for other languages including
38
Effectiveness of Stemming
Manual evaluation suggests that about 95% of words
are processed to give reasonable stems (but there
is a possibility of under and over stemming)
It is also seen that, there is a significant reduction
in dictionary size.
39
Criteria for judging stemmers
Correctness: Just by considering the number of
errors (mistakes) in relation to over stemmed and
under stemmed words.
40
Advantage of affix removal stemmers
Reduce the total number of distinct terms (Dictionary
size) significantly
Increase retrieval performance by bringing similar
words, having similar meanings, to a common form
41
Putting it together: Exercises
42
Cont…
“The purpose of the course is to teach theory and
practice underlying the construction of the web based
information systems. As such, the course will devote
equal time to information retrieval and software
engineering topics. The theory will be put into
practice through a semester long team programming
project”
43
Lexical analysis
structure
48
Letter Successor variety (LSV) Approach
to stemming
It is a stemming method that is based on the use of letters
and a body of text
It attempts to determine word and morpheme boundaries
The successor variety (SV) of string is the number of
different characters that follow it (i.e. the string) in words
in some body of text.
the technique is just
to segment the term in to two, stem and suffix
considering the successor variety of a string
49
Cont…
Consider a body of text consisting of the following
words, for example.
able, axle, accident, ape, about.
To determine the successor varieties for "apple," for
example, the following process would be used.
The first letter of apple is "a." "a" is followed in the text
body by four characters: "b," "x," "c," and "p."
Thus, the successor variety of "a" is four.
The next successor variety for apple would be one,
since only "e" follows "ap" in the text body, and so on.
50
Basic steps include
For a given term at hand
Determining successor varieties (distinct terms
following a given string)
Segment the term in to two segments(parts)
Decide which one of the segments is a stem.
51
Example:
The test word: - READABLE
56
Final decision - Remark
Based on a serious of experiments and observation
the following rule (algorithm) is usually used
d o x p u v i n g, d o x p u v e r, d o x p u v e d,
d o x x o n, d o x i n g, d o x p l e x e n v
58
Solution
59
Cont…
In summary, the successor variety stemming process
has three parts:
determine the successor varieties for a word,
use this information to segment the word using one
of the methods above, and
select one of the segments as the stem.
61
Test
What are the three methods of segmentation in LSV?
62
N-gram Approach to Term conflation
It is confusing to call it stemming method, since no
stem is produced through this method.
It is basically a string similarity approach to
conflation.
it‟s an approach that tries to find word similarity by
comparing the string structure in words.
It is a technique that creates
sets of strings of n characters from each word and
then compares elements of the two sets to find
similarity.
Definition (working definition)
An n-gram is a set of n consecutive characters
(letters) extracted from a word
63
Cont…
It is a representation of a word by its consecutive (set of)
n-gram (n–sized groups of letters)
65
N-gram
The approach assumes no prior linguistic knowledge
about the text being processed, and thus relatively
immune to spelling problems
66
N-gram
Example: Consider the word BILGISAYAR (computer)
This results in the generation of the digrams
*B, BI, IL, LG, GI, IS, SA, AY, YA, AR, R*
And the trigrams
**B, *BI, BIL, ILG, LGI, GIS, ISA, SAY, AYA, YAR, AR*, R**
Where „*‟ denotes a padding space
There are n + 1 such digrams and n + 2 such trigrams in a
word containing n characters
67
Procedures to compute similarity between two
words
Create sets of strings of n characters from each word
Terms are represented using n-grams
Identify the set of unique n-grams (e.g., bigrams) for
each word
Identify the common n sized characters that they share
A similarity measure based on them is computed
Computing similarity of two terms using Dice’s or
overlap coefficient
Terms that are strongly related by their number of shared
n-grams are considered similar and hence conflated or
clustered into groups of related words
68
N-gram
Dice and overlap coefficients
Let A be the number of unique digrams in the first word
Let B the number of unique digrams in the second word
Let C the number of unique digrams shared by A and B
(number of common digrams)
The Dices similarity coefficient is given by
2C
sim( wi , w j )
A B
70
N-gram
Example 1 (cont‟d)
The two sets have quite a significant number of
identical elements
Once the unique digrams for the word pair have been
identified and counted, a similarity measure based on
them is computed
The similarity measure used to quantify the degree of
similarity is Dice‟s coefficient defined by
2C 2C 2 *10
sim( wi , w j ) sim( wi , w j ) .80
A B A B 12 13
71
N-gram
Example 2 (n-gram matching)
Consider the terms “statistics” and “statistical”
Represent the terms using n-grams
statistics ==> {st ta at ti is st ti ic cs}
statistical ==> {st ta at ti is st ti ic ca al}
Unique digrams
statistics ==> {AT CS IC IS ST TA TI}
statistical ==>{AL AT CA IC IS ST TA TI}
72
N-gram
Example 2 (cont’d)
2C 2C 2 6
S ( wi , w j ) S ( wi , w j ) 0.80
A B A B 78
74
End of Chapter 2
75