Professional Documents
Culture Documents
IR 003 Porter Stemmer
IR 003 Porter Stemmer
جستجوی وب
هفته سوم
سیدمحسن حسینی
Introduction
In linguistics (study of language and its structure), a stem is part of a word, that is
common to all of its inflected variants.
Terms with a common stem will usually have similar meanings.
The performance of an IR system will be improved if term groups such
as this are conflated into a single term.
CONNECT This may be done by removal of the various suffixes -ED, -ING, -ION,
CONNECTED IONS to leave the single term CONNECT.
CONNECTING CONNECT
CONNECTION In addition, the suffix stripping process will reduce the total number of
CONNECTIONS terms in the IR system, and hence reduce the size and complexity of
the data in the system, which is always advantageous.
The Porter Stemming algorithm (or Porter Stemmer) is used to remove the
suffixes from an English word and obtain its stem which becomes very useful
in the field of Information Retrieval (IR).
2
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Martin Porter
Martin F. Porter is the inventor of the Porter Stemmer,
one of the most common algorithms for stemming
English, and the Snowball programming framework.
His 1980 paper "An algorithm for suffix stripping",
proposing the stemming algorithm, has been cited
over 12000 times (Google Scholar).
C.J. van Rijsbergen, S.E. Robertson and M.F. Porter, 1980. New models in
probabilistic information retrieval. London: British Library. (British Library Research
and Development Report, no. 5587).
M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3) pp 130−137.
Karen Sparck Jones and Peter Willet, 1997, Readings in Information Retrieval, San
Francisco: Morgan Kaufmann, ISBN 1-55860-454-4.
4
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Various implementations
https://tartarus.org/martin/PorterStemmer/index.html
impossibilities imposs
To test the programs out, here
fortifications fortif
excommunication excommun is a sample vocabulary (0.19
prognostication prognost megabytes), and the
principalities princip corresponding output.
unthankfulness unthank
voluptuousness voluptu
23531 words
communication commun
deliciousness delici
forgetfulness forget
fortification fortif
impossibility imposs
justification justif
mollification mollif
qualification qualif
covetousness covet
excellencies excel
fruitfulness fruit
thankfulness thank
bashfulness bash
5 fearfulness fear
$regex_consonant = '(?:[bcdfghjklmnpqrstvwxz]|(?<=[aeiou])y|^y)’;
7
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Rules
(condition) S1 → S2
(m > 1) EMENT →
REPLACEMENT → REPLAC
cvccvc
CVCVC
m(REPLAC) = 2
8
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Conditions
*S – the stem ends with S (and similarly for the other letters)
*o – the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP)
And the condition part may also contain expressions with and, or and not.
(m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T.
(*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant and
does not end with letters L, S or Z.
9
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
How rules are obeyed?
In a set of rules written beneath each other, only one is obeyed, and this will be the
one with the longest matching S1 for the given word. For example, with the
following rules,
S1 → S2
SSES → SS
IES → I
(Here the conditions are all null)
SS → SS
S →
CARESSES maps to CARESS since SSES is the longest match for S1.
Equally CARESS maps to CARESS (since S1=”SS”) and CARES to CARE (since S1=”S”).
10
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
The Algorithm
11
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 1
12
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 1
13
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 1
14
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 2
15
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 3
16
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 4
17
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
)The Algorithm (Step 5
18
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Example 1 (MULTIDIMENSIONAL )
MULTIDIMENSIONAL → MULTIDIMENSION
19
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب
Example 2 (CHARACTERIZATION )
1. The suffix will not match any of the cases found in step 1.
2. So it will move to step 2.
3. The stem of the word has m > 0 (since m = 3) and ends
with “IZATION”.
4. Hence in step 2, “IZATION” will be replaced with “IZE”.
5. Then the new stem will be CHARACTERIZE.
6. Step 3 will not match any of the suffixes and hence will
move to step 4.
7. Now m > 1 (since m = 3) and the stem ends with “IZE”.
8. So in step 4, “IZE” will be deleted (replaced with null).
9. No change will happen to the stem in other steps.
10. Finally the output will be CHARACTER.
20
سیدمحسن حسینی مبانی بازیابی اطالعات و جستجوی وب