Introduction - Types of Stemming Algorithms

Chapter 4
Stemming Algorithms
• Introduction
• Types of stemming algorithms
1
Stemming
Introduction
One technique for improving IR performance is to provide
searchers with ways of finding morphological variants of
search terms.
for example, a searcher enters the term stemming as part of

a query, it is likely that he or she will also be interested in
such variants as stemmed and stem.
We use the term conflation, meaning the act of fusing or

combining, as the general term for the process of matching
morphological term variants.
2
Stemming
Conflation can be either manual--using some kind of regular

expressions--or automatic, via programs called stemmers.
Stemming is also used in IR to reduce the size of index files
3
Example for the need of stemming
4
Stemming in Searching Process
Terms are stemmed at search time rather than at indexing

time
•Example
Look for: system users
The system takes each term in the query, and tries to
determine which other terms in the database might have the
same stem.
If any possibly related terms are found, the system presents
them to the user for selection
The user selects the terms he or she wants by entering their
numbers.
5
Stemming in Searching Process (Cont)
6
Stemming Algorithms
Types of Stemming Algorithms:
1. Store a table of all index terms

2. n-gram stemmers
3. Successor variety stemmers
4. Affix Removal Stemmers
7
Store a table of all index terms
One way to do stemming is to store a table of all index

terms and their stems.
8
store a table of all index terms
There are problems with this approach:
–there is no such data for English. Even if there were, many

terms found in databases would not be represented.
–Another problem is the storage overhead for such a table,
9
n-gram stemmers
•An n-gram of size 1 is referred to as a "unigram"; size 2 is

a "bigram" (or, less commonly, a "diagram"); size 3 is a
"trigram"; and size 4 or more is simply called an "n-gram".
In this approach, association measures are calculated

between pairs of terms based on shared unique diagrams.
10
n-gram stemmers
For example, the terms statistics and statistical can be

broken into diagrams as follows.
11
n-gram stemmers
The two words share six unique diagrams:

at, ic, is, st, ta, ti.
12
n-gram stemmers
Once the unique diagrams for the word pair have been
identified and counted, a similarity measure based on them is
computed as follows:
•Where:
A:the number of unique diagrams in the first word
B:the number of unique diagrams in the second word
C: the number of unique diagrams shared by A and B
The Similarity measure is called the Dice’s coefficient.
13
n-gram stemmers
For the example above, Dice's coefficient would equal
2C 2*6
S    .80
A B 78
•Such similarity measures are determined for all pairs of terms in
the database, forming a similarity matrix.
14
n-gram stemmers
Exercise:
Use n-gram stemmer to determine which of these words

form stem of the word “‫” طفل‬
•– ‫اطفالهم‬
•– ‫متطفل‬
•– ‫متطبع‬
•– ‫فلفل‬
15
Successor Variety
the successor variety of a string is the number of different

characters that follow it in words in some body of text.
•Consider a body of text consisting of the following words,

for example:
able, axle, accident, ape, about.
To determine the successor varieties for "apple," for
example:
•the successor variety of "a" is four
•The next successor variety for apple would be one
16
Successor Variety
After a word has been segmented, the segment to be used as the

stem must be selected.
•In the complete word method, a break is made after a segment,
if the segment is a complete word in the corpus.
•Haferand Weiss used the following rule:

if (first segment occurs in <= 12 words in corpus)
first segment is stem
 The successor variety of substrings of a term will decrease as
more characters are added until a segment boundary is reached
17
Successor Variety
Example:
Test Word: READABLE
•Corpus: ABLE, APE, BEATABLE, FIXABLE, READ,

READABLE READING, READS, RED, ROPE, RIPE.
•Using successor variety determine the stem of the word

READABLE.
18
Successor Variety
Solution:
Given
• Test Word: READABLE

• Corpus: ABLE, APE, BEATABLE, FIXABLE, READ,
READABLE READING, READS, RED, ROPE, RIPE.
• Using successor variety determine the stem of the word
READABLE.
Now
Using the complete word segmentation method, the test
word "READABLE" will be segmented into "READ” and
"ABLE," since READ appears as a word in the corpus
19
Successor Variety (con’t)
Successor variety is achieved by prefix processand Predecessor is by
suffix :
20
Successor Variety (con’t)
• peak and plateau method

– segment break is made after a character whose
successor variety exceeds that of the characters
immediately preceding and following it
 The successor variety stemming process has three parts

1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem
21
Successor Variety
Excercise:
• Given: Test Word: Connection
• Corpus:
connect, connected, connecting, concatenation, corrupt
connection, connections, connectable, connectedness
•Using successor variety to determine the stem of the word

Connection.
-----------
22
Affix Removal Stemmers
Affix removal algorithms remove suffixes and/or

prefixes from terms leaving a stem.
A simple example of an affix removal stemmer is one
that removes the plurals from terms.
A set of rules for such a stemmer is as follows
– If a word ends in “ies” but not ”eies” or ”aies ”
Then “ies” -> “y”
– If a word ends in “es” but not ”aes” , or ”ees ” or
“oes” Then “es” -> “e”
– If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
23
Affix Removal Stemmers
Most stemmers currently in use are iterative longest

match stemmers
An iterative longest match stemmer removes the longest

possible string of characters from a word according to a
set of rules
This process is repeated until no more characters can be

removed
Even after all characters have been removed, stems may

not be correctly conflated..
24
‫‪Arabic Affix Removal Stemmer:‬‬
‫‪Some examples‬‬
‫‪Prefixes‬‬
‫سيست‬ ‫ستست‬ ‫فال‬ ‫بال‬ ‫فلل‬ ‫كال‬ ‫سيس‬ ‫مست‬
‫تست‬ ‫مست‬ ‫ال‬ ‫با‬ ‫كا‬ ‫سا‬ ‫فكال‬ ‫فبال‬
‫سا‬ ‫اف‬ ‫و‬ ‫ي‬ ‫است‬ ‫ستت‬ ‫سيت‬ ‫يست‬
‫لل‬ ‫فب‬ ‫فس‬ ‫سي‬ ‫فك‬ ‫ست‬ ‫فل‬ ‫اس‬
‫‪Suffixes‬‬
‫تما‬ ‫نا‬ ‫ني‬ ‫كن‬ ‫كم‬ ‫ها‬ ‫هن‬ ‫هم‬
‫تن‬ ‫يه‬ ‫ا‬ ‫و‬ ‫ي‬ ‫ه‬ ‫كما‬ ‫هما‬
‫ون‬ ‫ين‬ ‫ان‬ ‫نه‬ ‫ات‬ ‫وا‬ ‫تم‬
‫‪25‬‬
Arabic Affix Removal Stemmer:
The Algorithm
Input word W
Input word length N
Prefixes matrix S
Suffixes matrix P
The prefix Li
Step (1): The suffix Mi
Read the input word length and store the output in (N)
If the word length is less than or equal to 3 Mark R = W
If the word length is greater than 3 Go to Step 2
26
Arabic Affix Removal Stemmer
Step (2):
Read the prefix length (Li) from the matrix ( Si) and do the following:
A - If the prefix ( Li ) matches with the beginning of the word ( W ) do the

following
a. Delete the prefix ( Li ) from the beginning of the word and store the output
in ( temp ).
b. If the length of (temp) is less than 3, cancel the deletion process and move
to the next prefix in the prefixes matrix, and repeat step 2
c. If the length of (temp) is more than 3, let R = temp and go to Step 3 .
B - If the prefix )Li) from the matrix (S) does not match with the beginning of the
word (w), move to the next prefix and repeat step 2.
27
Arabic Affix Removal Stemmer
Step (3):
If the length of ( temp ) = 4 Go to Step No. 4
Read the length of the suffix (mi( from the matrix (p) and do the
following:
A - If the suffix (mi) matches with the ending of the word (w)
a. Delete the suffix ( mi ) from the ending of the word (w) and store
the output in (temp)
b. If the length of (temp) is less than 3, cancel the deletion process , and
move to the next suffix and repeat step 3
c. If the length of (temp) is more than or equal to 3 Mark R = temp and
go to step 4
B - If the suffix (mi (does not match with the ending of the word (w),move
to the next suffix and repeat step 3
Step (4):
Return R
28

Introduction - Types of Stemming Algorithms

Uploaded by

Copyright:

Available Formats

You might also like

Introduction - Types of Stemming Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction - Types of Stemming Algorithms

Uploaded by

Copyright:

Available Formats

Chapter 4

for example, a searcher enters the term stemming as part of

We use the term conflation, meaning the act of fusing or

Conflation can be either manual--using some kind of regular

Stemming is also used in IR to reduce the size of index files

Terms are stemmed at search time rather than at indexing

Types of Stemming Algorithms:

1. Store a table of all index terms

4. Affix Removal Stemmers

One way to do stemming is to store a table of all index

There are problems with this approach:

–there is no such data for English. Even if there were, many

–Another problem is the storage overhead for such a table,

•An n-gram of size 1 is referred to as a "unigram"; size 2 is

In this approach, association measures are calculated

For example, the terms statistics and statistical can be

The two words share six unique diagrams:

The Similarity measure is called the Dice’s coefficient.

Use n-gram stemmer to determine which of these words

the successor variety of a string is the number of different

•Consider a body of text consisting of the following words,

After a word has been segmented, the segment to be used as the

•Haferand Weiss used the following rule:

Test Word: READABLE

•Corpus: ABLE, APE, BEATABLE, FIXABLE, READ,

•Using successor variety determine the stem of the word

• Test Word: READABLE

• peak and plateau method

 The successor variety stemming process has three parts

•Using successor variety to determine the stem of the word

Affix removal algorithms remove suffixes and/or

Most stemmers currently in use are iterative longest

An iterative longest match stemmer removes the longest

This process is repeated until no more characters can be

Even after all characters have been removed, stems may

A - If the prefix ( Li ) matches with the beginning of the word ( W ) do the

You might also like