Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

Content/subject Analysis and

Representation
Chapter 2 - Part two

1
Sub topics
 Overview
 Manual Vs Automatic indexing
 Indexing Languages and Methods
 Term selection process
 Lexical analysis (tokenization)
 Use of stopwords list
 Conflation

2
Conflation
The act of combining

3
Conflation
 Conflation is the act of fusing or combining.
 Term conflation is a general term used for the
processes of matching morphological term variants.
 It is one way of improving IR performance

 Provides searchers with ways of finding morphological


variants of search terms.
But HOW?????

4
Cont…
 Conflation can broadly divided in to two main
classes
 Stemming: language dependent and basically is
used to handle morphological variants (most
widely used)

 Stringsimilarity: usually language independent


and are designed to handle all type of variants.

5
Conflation

String Stemming
Similarity

N-gram
Table Affix Successor
lookup Removal Varity

Iterative
Longest
(simple
Match
Removal)
6
Motivation
 Free text IR systems are effective in operation only
if it is possible to map the particular word forms in
a user‟s query to the form that occur in the
document database.

 Example
 If a searcher enters the term stemming as part
of a query. It is likely that he or she will also be
interested in such variants as stemmed and
stem.

7
Why variations in word form? (Need for
term conflation)
 In dealing with NL free text, it is common that there are
variations in word forms. Common reasons could be:
1. Grammatical requirements (Morphology)
 Ex. Words, Agriculture, Agricultural and
Agriculturalist are all semantically related words
formed form a common form to suit grammatical
requirements
2. National usage ( alternative spellings)
 Different spellings of the same word to suit national
usage ( American and British)
Labor and Labour
Sulphur and sulfur 8
Cont…
3. Historical changes
 Arithmetic and Aerithmaticke

 Remark: Morphology (form of the word) is the main


source of word variation in NL texts. Considering
grammatical requirements, suffixing and prefixing
are the most common ways of creating word
variants

9
Stemming/Conflation
 Assumption in Term conflation or stemming
 Words with the same stem are semantically related
and have the same meaning to the user of the text
 The chance of matching increase when the index
term are reduced to their word stems because it is
normal to search using “ retrieve” than “retrieving”
 The algorithm (i.e., the stemmer) reduces all words
with the same root (or stem) to a single form.
 Stemming in the field of IR aims at improving the
matching (possibility) process between the index terms
of query and document text
10
Cont…
 Stemming, thus, is a recall enhancing means to broaden
an index term in a text search. Identify words with
similar meaning (to enhance recall).
 But it affects (decreasing) precision (exactness)

 Stemming like use of stopword list has also the


advantage of
 decreasing the size of representation
 reduce the number of distinct forms ( smaller index
size, is an advantage in storage)

11
Cont…
 Stemming is more use full in the case of morphologically
rich languages (like Hungarian and Hebrew).
 How?
 It is generally agreed up on that stemming either has a
significant positive effect on retrieval effectiveness or
create no problem on retrieval performance
 Accordingly there are three automatic approaches (methods)
to stemming
 Table lookup
 Affix removal algorithms
 Letter successor variety stemmers
12
Table lookup Approach to Stemming

 Is the simplest method, that requires to store a


table (create a machine readable dictionary) of all
index terms and their corresponding stems.
 So terms from query and indexes could be stemmed
(via table lookup).
 That is, stemming is done via lookups in the table
 Remark:
 The table can be indexed to speed up the search
 The space overhead may be high
 The stemming result are generally correct ( adv.)
13
Example

Term Stem
Connection Connect
Connections Connect
Connective Connect
Connected Connect
Connecting Connect
Connect Connect
Engineering Engineer
Engineered Engineer
Engineer Engineer

14
Cont…
 Advantage of this method
 The stemming results are generally correct
 Disadvantage of the method
 The table becomes large
 when it takes into account terms in standard
language and possibly terms in the specialized
subject domain of the text corpus
 Has storage overhead
 Large tables require large storage space and
efficient algorithms (e.g. binary search, hash table)
 Difficult to construct the table
 to capture all possibilities (e.g., new
15
words)
Affix Removal Stemmers
 Most commonly used approach and brings the concept
stemmer (it removes suffixes and/or prefixes from
terms leaving a stem).
 Usually performed by, striping off affixes from each
word variant and checking whether any context
sensitive rules apply.
 These affix removal stemmers are language dependent
and need manual involvement.
 A simple example could be one that removes the
plurals from terms
 Two manners of operations:
 Iterative approach
 Longest match approach
16
Iterative approach

 Involves removal of the shortest possible suffix from the


word and considers the resulting word for stemming
again

 Example
 If the word PEACEFULNESS is considered , NESS will
be removed at the first iteration, and FUL will be
again removed to leave PEACE as a final stem
 If the word ALARMINGLY is considered , Ly will be
removed at the first iteration, and ING will be again
removed to leave ALARM as a final stem

17
Longest Mach approach

 Involves removal of the longest possible suffix.


 That is, it removes the longest possible string of
characters from a word according to a set of
rules.
 With the above example, FULLNESS and INGLY
will be removed by the algorithm at once
respectively.
 The problem with this type of suffix removal is the
list of affixes is very large compared to the iterative
methods

18
Inflectional Vs Derivational

 As said before, Affix removal addresses variations due


to many reasons like number, tense, aspect where
tokens or words retain the part-of -speech of the base
forms

 Accordingly Affix Removal Stemmers can also be either


inflectional or derivational.

 It means either it simply removes the suffixes attached


to a stem and/or check other features too

19
Inflectional Stemming
 Inflectional stemming is just removing inflectional
morphemes
 Inflectional morphemes are normally due to the difference in :
 number (singular and plural)
 tense (present, past, future)
 aspect (progressive, punctual, perfective, iterative)
 Examples:
 reach, reaches, reached, reaching
 big, bigger, biggest
 Removal of inflectional morphemes usually has little impact
upon a word‟s meaning.
 Thus, it can be safely done (e.g. mapping singular and plural
of a same word to a single stem)
20
Derivational Stemming
 Here tokens and stems do not have the same part
of speech
 Example:
 Introduce, introduction, introductions introducer,
introducing
 Category, categorize, categorization
 Use, Useful, Usable, Unusable
 Removal of such derivational morpheme may
change words meaning.
 Thus in addition to removal of suffix it needs some
transformation of letters

21
Component of a stemmer
 Dictionary of affixes (Suffix and prefix dictionary.)
 Can be created manually, automatically or a
combination of both
 Example: ation , ing, ses, ed , s, ly, able, full,
ness…….
 Search of each word in an input text against this
dictionary
 Longest match search
 Iterative

 Context Sensitive/Free rules


 Recoding rules
 Stem dictionary (some times)
22
Context sensitive Vs. context
free stemming
 Context free
 Remove the suffixes according to the suffix list
(or set of rules).
 That means with out considering any context

 Context sensitive
 In addition to the removal of affixes it considers
other characteristics in the word (token).
 Many exceptions rules need to be handled.

23
Cont…
Example
 Minimum stem conditions are used as a common
restriction. Invalid stems can be detected by
means of a minimum stem length.
 -ATE and -ING would not be removed from
CREATE and SING
 RE- would not be removed from „READ‟ or „AL‟
from „METAL‟
 The stem-endings satisfies a certain condition
 Eg-Does not end with Q

24
Recoding Rules
 It is an important issue to consider and some times
a cure for spelling exceptions.
 Recoding is a context sensitive transformation of
the form
A xC to A y C, where A and C specify
the context of transformation, X is the input
string and Y is the transformed string.

25
Cont…
 Example
 Ifa stem ending in „i‟ following , „t‟ then change „i‟
to „y‟
beautiful beauti beauty
 Eliminating doubling terminal consonants
Forgetting forgett forget
 Changing „ rpt „ into‟ rb‟ to conflate terms like
Absorb, Absorbing & Absorption
Absorption absorb

26
Example of stemmed query
 Consider the following natural language (information
need) and convert it into stemmed query
representatives
 (justfollow the basic lexical analysis, removal of
stopword and affix removal stemming procedures)
"I am interested in novel matching algorithms for
document-retrieval systems. Anything on best-match or
nearest-neighbor retrieval, and partial-match searching
might also be of interest."
 Stemmed query representative
 algorithm, best, docum, match, near, neighb,
novel, part, retriev, search, system
27
Possible Problems with (Affix Removal
stammers)
 There are some major reasons, stemming can be
incorrect
 Overstemming
 (removing too much),
 Understemming
 (removing too little)

 Sometimes may miss good conflations.


 Like for example terms matrices and matrix may
not be conflated together
 Level of manual involvement
 Are often difficult to understand and modify
28
Over stemming
 Refers to the removal of too much of its part and may produce
stems that are not words and difficult for a user to interpret.
 With Porter ITERATION produces ITER and GENERAL
produces GENER
 This causes the meaning of the stems to be diluted,
 affect precision of IR
 In addition it causes unrelated terms to be conflated to the
same stem.
 Words that should not be grouped together by stemming, but
are.
 Example,
 able from Table, and ing from Sing
29
Cont…
 Effect of overstemming on IR performance is

 the retrieval of non-relevant documents, which


will affect the precision of the system

30
Under stemming
 Refers to the removal of too little affix from a term.
 It prevents related terms from being conflated
together and
 asa result of which a concept may be spread
over numerous stems,
 which will affect retrieval performance.
 Words that should be grouped by stemming but are
not.
 Example:
 removing only -ness from the term peacefulness

31
Cont…
 Effect of understemming on IR performance will be

 prohibiting from getting all relevant documents.


 affect recall

32
Some Common English Stemming
Algorithms
 Dawson
 Lovins
 MARS
 MORPHS
 Porter
 RADCOL
 SMART
 Paice/Husk

33
Features of some common English Language
text Affix removal stemmers (algorithms)
Longest Suffix Stem Context
Algorithm Iterative Recoding
match Dictionary dictionary sensitive
Dawson X X X X

Lovins X X X X X

MARS X X X X X

MORPHS X X X X X

Porter X X X X

RADCOL X X X X

SMART X X X X34 X
Porter Stemming Algorithm
By Martin Porter (1980)
Try the following to get
Original paper by Martin Porter
Source code for porter stemmer using C, Java, etc.
http://www.tartarus.org/%7Emartin/PorterStemmer/

Demo: try the following URLs for Porter online

http://www.utilitymill.com/utility/Porter_Stemming_Algorithm

http://snowball.tartarus.org/demo.php
35
Porter Stemming Algorithm
 Was developed by Martin Porter for reducing English
words to their stem

 An effective, simple and popular English stemmer

 Removing the common morphological and inflectional


endings from words in English

 Involves a multiple process that successively removes


short suffixes, rather than removing in a single step the
longest suffix

 The algorithm is very careful not to remove a suffix when


the stem is too short, thus helping to improve the
performance of the resultant stemmer
36
Cont…
 Removes around 60 different suffixes
 Is quite aggressive when creating stems and does
overstemming
 But still it performs well in precision/recall
 Results suggest at least as good as other stemming
options
 Its main use is as part of a term normalization process
that is usually done when setting up IR systems

37
Cont…
 There has been also similar interest in stemming
algorithms for other languages including

 French (Savoy, 1993),


 Turkish (Solak and Oflazer, 1993),
 Amhairc (Alemayehu, 1999)

 Think of it for other languages

38
Effectiveness of Stemming
 Manual evaluation suggests that about 95% of words
are processed to give reasonable stems (but there
is a possibility of under and over stemming)
 It is also seen that, there is a significant reduction
in dictionary size.

 Experiments suggest that stemming is never less


effective than the use of unstemmed words, but

 itwill give better result in languages that are


morphologically more complex (rich).

39
Criteria for judging stemmers
 Correctness: Just by considering the number of
errors (mistakes) in relation to over stemmed and
under stemmed words.

 Effectiveness of stemming: considering recall and


precession of the retrieval process in addition to
their speed.

 Compression performance: to what extent it


reduce the size of index terms

40
Advantage of affix removal stemmers
 Reduce the total number of distinct terms (Dictionary
size) significantly
 Increase retrieval performance by bringing similar
words, having similar meanings, to a common form

Disadvantage of affix removal stemmers


 The possibility of over stemming and under stemming
 Difficult to understand and modify (need better
understanding of the language at hand)

41
Putting it together: Exercises

 Consider the following document and come up


with the best possible index terms to be used in a
given indexing system.
 Follow the basic steps
 Lexical analysis, Stop word removal, Identifying
only nouns, Stemming and alphabetizing (Show
the results of each step, both by typing and
counting word and characters)
 Indicate your assumptions in every step.

42
Cont…
 “The purpose of the course is to teach theory and
practice underlying the construction of the web based
information systems. As such, the course will devote
equal time to information retrieval and software
engineering topics. The theory will be put into
practice through a semester long team programming
project”

43
Lexical analysis

the purpose of the course is to teach theory and


practice underlying the construction of web based
information systems as such the course will devote
equal time to information retrieval and software
engineering topics the theory will be put into practice
through a semester long team programming project

48 words, 307 characters


44
Stop Word Removal

purpose course teach theory practice underlying


construction web based information course
devote equal time information retrieval
software engineering topics theory practice
semester long team programming project

26 words, 213 chars


45
Only nouns

purpose course theory practice construction


web information course equal time information
retrieval software engineering topics theory
practice semester team programming project

21 words, 179 chars


46
Stemming & Alphabetizing

construct course course engineer equal


informat informat practice practice program
project purpose retrieve semester software
team theory theory time topic web

21 words, 161 chars


47
Indexing- file structure
 Terms remaining after document processing must be
stored to facilitate retrieval.

 Typically, they are stored in an inverted index. More on


that later…

Accents Noun Automatic or


Docs stopwords stemming
spacing groups Manual indexing

structure

48
Letter Successor variety (LSV) Approach
to stemming
 It is a stemming method that is based on the use of letters
and a body of text
 It attempts to determine word and morpheme boundaries
 The successor variety (SV) of string is the number of
different characters that follow it (i.e. the string) in words
in some body of text.
 the technique is just
 to segment the term in to two, stem and suffix
considering the successor variety of a string

49
Cont…
 Consider a body of text consisting of the following
words, for example.
able, axle, accident, ape, about.
 To determine the successor varieties for "apple," for
example, the following process would be used.
 The first letter of apple is "a." "a" is followed in the text
body by four characters: "b," "x," "c," and "p."
 Thus, the successor variety of "a" is four.
 The next successor variety for apple would be one,
since only "e" follows "ap" in the text body, and so on.

50
Basic steps include
 For a given term at hand
 Determining successor varieties (distinct terms
following a given string)
 Segment the term in to two segments(parts)
 Decide which one of the segments is a stem.

51
Example:
The test word: - READABLE

The Corpus (related words): ABLE, APE, BEATABLE, FIXABLE, READ,


READABLE, READING, READS, RED, ROPE, RIPE.

Task: Determining the stem of the word READABLE

Prefix Successor Variety Letters


 R 3 E, I, O
 RE 2 A, D
 REA 1 D
 READ 3 A, I, S
 READA 1 B
 READAB 1 L
 READABL 1 E
 READABLE 0 (Blank) 52
Cont…
 For a given test word, against a dictionary, it is to
determine how many successors exist for a possible
stem of a word.
 Successor variety number decreases, as the string gets
longer until a segment boundary.( most of the cases)
 These boundaries show the stems

 There are some basic suggestions to do the


segmentations
 Cut off method
 Peak and plateau method
 Complete word method 53
Cut off method
 The first easy method
 separate (break) if the character has successor greater
than some cutoff value
 That means a boundary is identified whenever the cut off
value is reached.
 The problem with this type of segmentation is:
 How to select the cut off value that means,
( incorrect cuts will be made and correct cuts
will be missed)
 For example in the above case if we set cut off
value as >2, R and Read will be considered to
be stem.
54
Peak and Plateau method
 A segment break (segmentation) is made after a
character whose successor variety exceeds that of the
characters immediately preceding it and the character
immediately following it

 Thus it has an advantage of removing the need for the


cut off value to be selected.
 Example
 Segmenting (breaking) at the character whose
successor variety is greater than both it‟s proceeding
and following character, results in breaking
READABLE after D, that will give us READ as a stem
and -ABLE as a suffix
55
Complete Method
 A break is made after a segment if the segment is a
complete word in the corpus

 For example in the above case READ is in the


dictionary (corpus) as a complete word

56
Final decision - Remark
 Based on a serious of experiments and observation
the following rule (algorithm) is usually used

If (first segment occurs in <= 12 words in corpus)


Then
first segment is a stem
Else
the second segment is a stem.
End if

 In other words if the 1st term (segment) occur in


more than 12 words in the corpus it is probably an
affix
57
Exercise
 Consider the following corpus and give the successor
variety of each prefix (d, do, dox, … etc) of the word “d
o x p u v e d ” and then segment the word using peak
and plateau method.
 Suggest the possible stem from the two segments with a
reason.

d o x p u v i n g, d o x p u v e r, d o x p u v e d,
d o x x o n, d o x i n g, d o x p l e x e n v

58
Solution

59
Cont…
 In summary, the successor variety stemming process
has three parts:
 determine the successor varieties for a word,
 use this information to segment the word using one
of the methods above, and
 select one of the segments as the stem.

 It is pointed out that while affix removal stemmers


work well, they require human preparation of suffix
lists and removal rules
60
A string similarity approach
to Term conflation
N-gram Approach

61
Test
 What are the three methods of segmentation in LSV?

 Mention the two approaches in affix removal


stemming?

62
N-gram Approach to Term conflation
 It is confusing to call it stemming method, since no
stem is produced through this method.
 It is basically a string similarity approach to
conflation.
 it‟s an approach that tries to find word similarity by
comparing the string structure in words.
 It is a technique that creates
 sets of strings of n characters from each word and
 then compares elements of the two sets to find
similarity.
 Definition (working definition)
 An n-gram is a set of n consecutive characters
(letters) extracted from a word
63
Cont…
 It is a representation of a word by its consecutive (set of)
n-gram (n–sized groups of letters)

 to determine similarity to other words


 (n-could be 1, 2,3,4…)

 Typical values for n are 2 or 3, corresponding to the use


of diagrams or trigrams respectively
 Assumptions
 Word variants are expected to have similar sets of n-
grams or
 Similar word will have a high proportion of n-grams in
common
64
N-gram
 The N-gram of a string is any substring of some fixed length
 N-grams are N-sized substrings of a specific string (e.g., TUMBA!)
 Unigrams
 -, T, U, M, B, A, !, -
 Bigrams (digrams)
 -T, TU, UM, MB, BA, A!, A!, !-
 Trigrams
 -TU, TUM, UMB, MBA, BA!, A!-, !--
 Quadgrams
 -TUM, TUMB, UMBA, MBA!, BA!-, A!--,!---
 Quintgrams
 -TUMB, TUMBA, UMBA!, MBA!-, BA!--, A!---, !----

65
N-gram
 The approach assumes no prior linguistic knowledge
about the text being processed, and thus relatively
immune to spelling problems

 Furthermore, there is no linguistic-specific information


used in the N-gram approach either, which qualifies
this method as a language-independent one

 N-gram stemmers conflate terms based on the number


of n-grams that are shared by terms, and are language
independent

66
N-gram
 Example: Consider the word BILGISAYAR (computer)
 This results in the generation of the digrams
*B, BI, IL, LG, GI, IS, SA, AY, YA, AR, R*
And the trigrams
**B, *BI, BIL, ILG, LGI, GIS, ISA, SAY, AYA, YAR, AR*, R**
 Where „*‟ denotes a padding space
 There are n + 1 such digrams and n + 2 such trigrams in a
word containing n characters

67
Procedures to compute similarity between two
words
 Create sets of strings of n characters from each word
 Terms are represented using n-grams
 Identify the set of unique n-grams (e.g., bigrams) for
each word
 Identify the common n sized characters that they share
 A similarity measure based on them is computed
 Computing similarity of two terms using Dice’s or
overlap coefficient
 Terms that are strongly related by their number of shared
n-grams are considered similar and hence conflated or
clustered into groups of related words
68
N-gram
 Dice and overlap coefficients
 Let A be the number of unique digrams in the first word
 Let B the number of unique digrams in the second word
 Let C the number of unique digrams shared by A and B
(number of common digrams)
 The Dices similarity coefficient is given by
2C
sim( wi , w j ) 
A B

 An overlap coefficient could also be used to measure the


similarity and is given by
C
sim( wi , w j ) 
min{ A, B}
69
N-gram
 Example 1(n - gram matching)
 Consider the two words “AGRICULTURE” and “AGRICULTURAL”
 If we consider n to be 2, the two words can be broken into
digrams as follows
AGRICULTURE
{_A AG GR RI IC CU UL LT TU UR RE E_}
AGRICULTURAL
{_A AG GR RI IC CU UL LT TU UR RA AL L_}
 AGRICULTURE has 12 digrams, of which all are distinct
 AGRICULTURAL has 13 digrams, of which all are distinct

70
N-gram
 Example 1 (cont‟d)
 The two sets have quite a significant number of
identical elements
 Once the unique digrams for the word pair have been
identified and counted, a similarity measure based on
them is computed
 The similarity measure used to quantify the degree of
similarity is Dice‟s coefficient defined by

2C 2C 2 *10
sim( wi , w j )  sim( wi , w j )    .80
A B A  B 12  13

71
N-gram
 Example 2 (n-gram matching)
 Consider the terms “statistics” and “statistical”
 Represent the terms using n-grams
 statistics ==> {st ta at ti is st ti ic cs}
 statistical ==> {st ta at ti is st ti ic ca al}
 Unique digrams
 statistics ==> {AT CS IC IS ST TA TI}
 statistical ==>{AL AT CA IC IS ST TA TI}

72
N-gram
 Example 2 (cont’d)

 Common unique digrams


{AT, IC, IS, ST, TA, TI}

 Similarity of two terms are measured by Dice’s coefficient

2C 2C 2 6
S ( wi , w j )  S ( wi , w j )    0.80
A B A B 78

where A and B are the number of unique bigrams in two terms,


respectively and C is the number of unique bigrams shared
73 by the two
terms.
Exercise

 Compare (identify similarity) the following


three terms and identify if they can be
conflated together?
 photography, photographic, phonetic

74
End of Chapter 2

75

You might also like