Introduction - Types of Stemming Algorithms

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 28

Chapter 4

Stemming Algorithms

• Introduction
• Types of stemming algorithms

1
Stemming

Introduction
One technique for improving IR performance is to provide
searchers with ways of finding morphological variants of
search terms.

for example, a searcher enters the term stemming as part of


a query, it is likely that he or she will also be interested in
such variants as stemmed and stem.

We use the term conflation, meaning the act of fusing or


combining, as the general term for the process of matching
morphological term variants.
2
Stemming

Conflation can be either manual--using some kind of regular


expressions--or automatic, via programs called stemmers.

Stemming is also used in IR to reduce the size of index files

3
Example for the need of stemming

4
Stemming in Searching Process

Terms are stemmed at search time rather than at indexing


time
•Example
Look for: system users
The system takes each term in the query, and tries to
determine which other terms in the database might have the
same stem.
If any possibly related terms are found, the system presents
them to the user for selection
The user selects the terms he or she wants by entering their
numbers.

5
Stemming in Searching Process (Cont)

6
Stemming Algorithms

Types of Stemming Algorithms:

1. Store a table of all index terms


2. n-gram stemmers
3. Successor variety stemmers

4. Affix Removal Stemmers

7
Store a table of all index terms

One way to do stemming is to store a table of all index


terms and their stems.

8
store a table of all index terms

There are problems with this approach:

–there is no such data for English. Even if there were, many


terms found in databases would not be represented.

–Another problem is the storage overhead for such a table,

9
n-gram stemmers

•An n-gram of size 1 is referred to as a "unigram"; size 2 is


a "bigram" (or, less commonly, a "diagram"); size 3 is a
"trigram"; and size 4 or more is simply called an "n-gram".

In this approach, association measures are calculated


between pairs of terms based on shared unique diagrams.

10
n-gram stemmers

For example, the terms statistics and statistical can be


broken into diagrams as follows.

11
n-gram stemmers

The two words share six unique diagrams:


at, ic, is, st, ta, ti.

12
n-gram stemmers

Once the unique diagrams for the word pair have been
identified and counted, a similarity measure based on them is
computed as follows:

•Where:
A:the number of unique diagrams in the first word
B:the number of unique diagrams in the second word
C: the number of unique diagrams shared by A and B

The Similarity measure is called the Dice’s coefficient.

13
n-gram stemmers
For the example above, Dice's coefficient would equal
2C 2*6
S    .80
A B 78
•Such similarity measures are determined for all pairs of terms in
the database, forming a similarity matrix.

14
n-gram stemmers

Exercise:

Use n-gram stemmer to determine which of these words


form stem of the word “‫” طفل‬
•– ‫اطفالهم‬
•– ‫متطفل‬
•– ‫متطبع‬
•– ‫فلفل‬

15
Successor Variety

the successor variety of a string is the number of different


characters that follow it in words in some body of text.

•Consider a body of text consisting of the following words,


for example:
able, axle, accident, ape, about.
To determine the successor varieties for "apple," for
example:
•the successor variety of "a" is four
•The next successor variety for apple would be one

16
Successor Variety

After a word has been segmented, the segment to be used as the


stem must be selected.
•In the complete word method, a break is made after a segment,
if the segment is a complete word in the corpus.

•Haferand Weiss used the following rule:


if (first segment occurs in <= 12 words in corpus)
first segment is stem
 The successor variety of substrings of a term will decrease as
more characters are added until a segment boundary is reached

17
Successor Variety

Example:

Test Word: READABLE

•Corpus: ABLE, APE, BEATABLE, FIXABLE, READ,


READABLE READING, READS, RED, ROPE, RIPE.

•Using successor variety determine the stem of the word


READABLE.

18
Successor Variety

Solution:
Given

• Test Word: READABLE


• Corpus: ABLE, APE, BEATABLE, FIXABLE, READ,
READABLE READING, READS, RED, ROPE, RIPE.
• Using successor variety determine the stem of the word
READABLE.

Now
Using the complete word segmentation method, the test
word "READABLE" will be segmented into "READ” and
"ABLE," since READ appears as a word in the corpus
19
Successor Variety (con’t)
Successor variety is achieved by prefix processand Predecessor is by
suffix :

20
Successor Variety (con’t)

• peak and plateau method


– segment break is made after a character whose
successor variety exceeds that of the characters
immediately preceding and following it

 The successor variety stemming process has three parts


1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem

21
Successor Variety

Excercise:
• Given: Test Word: Connection
• Corpus:
connect, connected, connecting, concatenation, corrupt
connection, connections, connectable, connectedness

•Using successor variety to determine the stem of the word


Connection.
-----------

22
Affix Removal Stemmers

Affix removal algorithms remove suffixes and/or


prefixes from terms leaving a stem.
A simple example of an affix removal stemmer is one
that removes the plurals from terms.
A set of rules for such a stemmer is as follows
– If a word ends in “ies” but not ”eies” or ”aies ”
Then “ies” -> “y”
– If a word ends in “es” but not ”aes” , or ”ees ” or
“oes” Then “es” -> “e”
– If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
23
Affix Removal Stemmers

Most stemmers currently in use are iterative longest


match stemmers

An iterative longest match stemmer removes the longest


possible string of characters from a word according to a
set of rules

This process is repeated until no more characters can be


removed

Even after all characters have been removed, stems may


not be correctly conflated..
24
‫‪Arabic Affix Removal Stemmer:‬‬

‫‪Some examples‬‬

‫‪Prefixes‬‬
‫سيست‬ ‫ستست‬ ‫فال‬ ‫بال‬ ‫فلل‬ ‫كال‬ ‫سيس‬ ‫مست‬
‫تست‬ ‫مست‬ ‫ال‬ ‫با‬ ‫كا‬ ‫سا‬ ‫فكال‬ ‫فبال‬
‫سا‬ ‫اف‬ ‫و‬ ‫ي‬ ‫است‬ ‫ستت‬ ‫سيت‬ ‫يست‬
‫لل‬ ‫فب‬ ‫فس‬ ‫سي‬ ‫فك‬ ‫ست‬ ‫فل‬ ‫اس‬

‫‪Suffixes‬‬
‫تما‬ ‫نا‬ ‫ني‬ ‫كن‬ ‫كم‬ ‫ها‬ ‫هن‬ ‫هم‬
‫تن‬ ‫يه‬ ‫ا‬ ‫و‬ ‫ي‬ ‫ه‬ ‫كما‬ ‫هما‬
‫ون‬ ‫ين‬ ‫ان‬ ‫نه‬ ‫ات‬ ‫وا‬ ‫تم‬

‫‪25‬‬
Arabic Affix Removal Stemmer:

The Algorithm
Input word W
Input word length N
Prefixes matrix S
Suffixes matrix P
The prefix Li
Step (1):  The suffix Mi

Read the input word length and store the output in (N)
 If the word length is less than or equal to 3 Mark R = W
If the word length is greater than 3 Go to Step 2
26
Arabic Affix Removal Stemmer

Step (2):
Read the prefix length (Li) from the matrix ( Si) and do the following:

A - If the prefix ( Li ) matches with the beginning of the word ( W ) do the


following
a. Delete the prefix ( Li ) from the beginning of the word and store the output
in ( temp ).
b. If the length of (temp) is less than 3, cancel the deletion process and move
to the next prefix in the prefixes matrix, and repeat step 2
c. If the length of (temp) is more than 3, let R = temp and go to Step 3 .

B - If the prefix )Li) from the matrix (S) does not match with the beginning of the
word (w), move to the next prefix and repeat step 2.

27
Arabic Affix Removal Stemmer

Step (3):
If the length of ( temp ) = 4 Go to Step No. 4
Read the length of the suffix (mi( from the matrix (p) and do the
following:
A - If the suffix (mi) matches with the ending of the word (w)
a. Delete the suffix ( mi ) from the ending of the word (w) and store
the output in (temp)
b. If the length of (temp) is less than 3, cancel the deletion process , and
move to the next suffix and repeat step 3
c. If the length of (temp) is more than or equal to 3 Mark R = temp and
go to step 4
B - If the suffix (mi (does not match with the ending of the word (w),move
to the next suffix and repeat step 3

Step (4):
Return R
28

You might also like