Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

‫مبانی بازیابی اطالعات و‬

‫جستجوی وب‬
‫هفته سوم‬

‫‪Porter Stemming Algorithm‬‬

‫سیدمحسن حسینی‬
Introduction
In linguistics (study of language and its structure), a stem is part of a word, that is
common to all of its inflected variants.
Terms with a common stem will usually have similar meanings.
The performance of an IR system will be improved if term groups such
as this are conflated into a single term.
CONNECT This may be done by removal of the various suffixes -ED, -ING, -ION,
CONNECTED IONS to leave the single term CONNECT.
CONNECTING CONNECT
CONNECTION In addition, the suffix stripping process will reduce the total number of
CONNECTIONS terms in the IR system, and hence reduce the size and complexity of
the data in the system, which is always advantageous.

The Porter Stemming algorithm (or Porter Stemmer) is used to remove the
suffixes from an English word and obtain its stem which becomes very useful
in the field of Information Retrieval (IR).
2
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Martin Porter
Martin F. Porter is the inventor of the Porter Stemmer,
one of the most common algorithms for stemming
English, and the Snowball programming framework.
His 1980 paper "An algorithm for suffix stripping",
proposing the stemming algorithm, has been cited
over 12000 times (Google Scholar).

Porter read mathematics at St John's College, Cambridge (1963–66)


and went to get a Diploma in Computer Science (1967) and a PhD.
at Cambridge Computer Laboratory. He worked at the University of
Leeds for a year before returning to Cambridge's Literary and
Linguistic Computing Centre (1971-1974) and at the Sedgwick
Museum as a programmer (1974-1976). In 1977, he became the
3 Director of the Museum Documentation Advisory Unit (MDA).

‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬


History
The original stemming algorithm paper was written in 1979 in the Computer
Laboratory, Cambridge (England), as part of a larger IR project, and appeared
as Chapter 6 of the final project report,

C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in
probabilistic information retrieval. London: British Library. (British Library Research
and Development Report, no. 5587).

With van Rijsbergen’s encouragement, it was also published in,

M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.

And since then it has been reprinted in

Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San
Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.
4
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Various implementations
https://tartarus.org/martin/PorterStemmer/index.html
impossibilities imposs
To test the programs out, here
fortifications fortif
excommunication excommun is a sample vocabulary (0.19
prognostication prognost megabytes), and the
principalities princip corresponding output.
unthankfulness unthank
voluptuousness voluptu
23531 words
communication commun
deliciousness delici
forgetfulness forget
fortification fortif
impossibility imposs
justification justif
mollification mollif
qualification qualif
covetousness covet
excellencies excel
fruitfulness fruit
thankfulness thank
bashfulness bash

5 fearfulness fear

‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬


Consonants and Vowels
A consonant is a letter other than the vowels(A, E, I, O or U) and other than a letter
“Y” preceded by a consonant. So in “TOY” the consonants are “T” and “Y”, and in
“SYZYGY” they are “S”, “Z” and “G”.

$regex_consonant = '(?:[bcdfghjklmnpqrstvwxz]|(?<=[aeiou])y|^y)’;

If a letter is not a consonant it is a vowel.


$regex_vowel = '(?:[aeiou]|(?<![aeiou])y)’;

Any word, or part of a word, therefore


consonant: c vowel: v has one of the four forms given below.
Form Example
A list ccc... of length greater than 0 will be denoted by C,
A list vvv... of length greater than 0 will be denoted by V. CVCV ... C collection, management
CVCV ... V conclude, revise
VCVC ... C entertainment, illumination
6
VCVC ... V illustrate, abundance
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
What is m?
Form Example
CVCV ... C collection, management
[C]VCVC … [V]
CVCV ... V conclude, revise
VCVC ... C entertainment, illumination
VCVC ... V illustrate, abundance [C](VC)m[V]

7
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Rules
(condition) S1 → S2
(m > 1) EMENT →

REPLACEMENT → REPLAC
cvccvc
CVCVC
m(REPLAC) = 2

8
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Conditions
*S – the stem ends with S (and similarly for the other letters)

*v* – the stem contains a vowel

*d – the stem ends with a double consonant (e.g. -TT, -SS)

*o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)

And the condition part may also contain expressions with and, or and not.
(m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T.

(*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant and
does not end with letters L, S or Z.

9
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
How rules are obeyed?
In a set of rules written beneath each other, only one is obeyed, and this will be the
one with the longest matching S1 for the given word. For example, with the
following rules,
S1 → S2
SSES → SS
IES → I
(Here the conditions are all null)
SS → SS
S →

CARESSES maps to CARESS since SSES is the longest match for S1.

Equally CARESS maps to CARESS (since S1=”SS”) and CARES to CARE (since S1=”S”).

10
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫‪The Algorithm‬‬

‫‪11‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 1‬‬

‫‪12‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 1‬‬

‫‪13‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 1‬‬

‫‪14‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 2‬‬

‫‪15‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 3‬‬

‫‪16‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 4‬‬

‫‪17‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
‫)‪The Algorithm (Step 5‬‬

‫‪18‬‬
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Example 1 (MULTIDIMENSIONAL )

1. The suffix will not match any of the cases found in


steps 1, 2 and 3.
2. Then it comes to step 4.
3. The stem of the word has m > 1 (since m = 5) and
ends with “AL”.
4. Hence in step 4, “AL” is deleted (replaced with
null).
5. Calling step 5 will not change the stem further.
6. Finally the output will be MULTIDIMENSION.

MULTIDIMENSIONAL → MULTIDIMENSION
19
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬
Example 2 (CHARACTERIZATION )
1. The suffix will not match any of the cases found in step 1.
2. So it will move to step 2.
3. The stem of the word has m > 0 (since m = 3) and ends
with “IZATION”.
4. Hence in step 2, “IZATION” will be replaced with “IZE”.
5. Then the new stem will be CHARACTERIZE.
6. Step 3 will not match any of the suffixes and hence will
move to step 4.
7. Now m > 1 (since m = 3) and the stem ends with “IZE”.
8. So in step 4, “IZE” will be deleted (replaced with null).
9. No change will happen to the stem in other steps.
10. Finally the output will be CHARACTER.

CHARACTERIZATION → CHARACTERIZE → CHARACTER

20
‫سیدمحسن حسینی‬ ‫مبانی بازیابی اطالعات و جستجوی وب‬

You might also like