Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

UNIT II – INGEST AND INDEXING

•Introduction
•Item Receipt
• Duplicate Detection
•Item Normalization
•Zoning and Creation of Processing Tokens,
•Stemming
• Entity Processing
•Categorization
•Citational Metadata
•Manual Indexing Process
• Automatic Indexing of Text and Multimedia
Introduction to Ingest
• Ingest is the initial process in an information retrieval systems.
• It is the process that receives the items to be stored and indexed in the
system and performs the initial processing of them.
• Ingest process can be broken down into a number of subprocesses.
• Each subprocesses can add value to the final index and improve the
capability of the system to get better results.
• Ingest subprocess can be “pull” or “catch”
– A pull process has the system going out to other locations and
retrieving items from those locations to ingest (e.g., web/local
network crawling or RSS feeds).
– Catching is where other systems deliver the items by sending them
to the ingest process.
• One internal system writes files into an ingest directory or uses
the “web services” application interface of the ingest process to
deliver new items.
• Once the item is received it goes through a
– Normalization process for processing purposes.
– Original item is usually kept as part of the document repository to
ensure no information is lost in the normalization process
– One of the first checks in many systems is to validate if this item has
already been ingested into their system.
– Duplicate detection can save a lot of redundant processing.
– Generation of processing tokens.
– Entity identification -some normalization can be done on the processing
tokens to map them into a canonical value.
– The result of these steps is the data that is ready to be passed to the
Indexing process.
– One final process called Categorization can also be applied to expand
the processing tokens for an item.
Item Receipt

• Item receipt is the process associated with first getting items into the
system.
Methods used:
• Pulling
– Crawling internet/local area network
• Depth First
• Breadth First
– Getting item from RSS feed
• Catch
– This process is usually associated with internal systems.
– Within the internal generation of new information there is a step where new items are
written to a queue that is monitored by the indexing process
• Crawling network
• Seed list
– Starts with a list of addresses that can be used as a starting point for the
crawling
– Defines the maximum subset of the network from which items can be part
of the information system.
– Each item (e.g.,web page) pointed to by the seed list is retrieved from its
web site to be indexed.
• URL (Uniform Resource Locators)
– When the page is retrieved to be indexed, all the links(URLs) on that page
that link to other pages on that site or a different site are extracted and
added to the list of crawlable locations.
– Data structure used to store the new URLs to be crawled is called the URL
Frontier.
– As pages are crawled from the URL frontier they are marked as crawled
along with a date time of crawl.
• The list of all items to retrieve is continually growing.
• The term “black web” is used to describe that subset of the Internet that is not
retrieved and indexed.
• Depth First
– Depth first says you select pages from the URL Frontier following
pages to the same site to retrieve as much of the site as you can.
– Has the possible negative aspect of hitting the same site often in a
limited interval which could cause the site administrator to lock your
crawler out.
– This is called “politeness” of crawling and the general rule is to have at
least two seconds before the system retrieves another page from a site
– Software that takes the URLs from the Seed list or URL Frontier and
retrieves the linked page is called a web crawler, spider or robot.
– Software should be scalable
• Breadth First
– Breath first says you significantly limit the number of pages retrieved
from a site, going on to other sites that you have not retrieved
– Gives you a better sample of many sites versus in depth of a fewer
number of sites
• Challenges of crawling
• Design and optimization of crawlers and URL frontier
• Crawler cannot retrieve from the same web site too frequently or it will
alert a system administrator who could block that crawler from future
access to its site
• Website updation to be tracked frequently
• Amount of data that is kept in databases at web sites
– Static web pages-preconstructed
– Dynamic web pages – hidden web
• The first decision that affects an information retrieval systems performance
in terms of precision and recall is what data will be ingested and indexed.
• Subscribing to RSS Feed
• RSS stands for either Really Simple Syndication or Rich Site Summary
includes a specifications on how a web site can publish information to a
user.
• Will deliver document in XML format
• used to publish frequently updated works such as blog entries, news
headlines, audio, and video in a standardized format.
• RSS document includes the item plus metadata such as dates and
authorship.
• RSS feeds are read using software called an RSS reader or an aggregator.
• Once subscribed to a feed, an aggregator is able to check for new content at
user-determined intervals and retrieve the update.
• Example of XML
Duplicate Detection
• The first process that can make search and retrieval more effective is to
eliminate duplicate information.
• The duplicates cause a lot of wasted system overhead in indexing the same
item.
• The standard approach for duplicate detection is to create a signature
unique key that represents the contents of an item.
• The most common methodology is to create a hash for the complete file
(e.g., Message Digital Algorithm 2—MD2 or MD5).
• The problem with this approach is it is a hash on every character in the
item.
• If there is just a few characters different (e.g., a copy is made of an article
but a system assigns a unique ID it places on the article or it places a
current date on the article) you will get two different hash values and not
detect it as a duplicate.
• This problem can be addressed by using heuristic algorithm.
• Near Duplicate Detection
– Automated algorithm in ingest system for detecting duplicates
• Two ways to detect near duplicates
– One is where the text in the two items is almost identical.
– The second is where the semantics of both items is identical but the
text used to express the semantics is different
Resemblance between 2 documents defined as:

• The formula specifies that the resemblance is measured by ratio of the


intersection of the features divided by the union of the features.
• Eg. if two documents are both 50 words in length and there are 20 words
that are common to both documents, the resemblance would be
(20)/(50+50−20)=20/70.
• The “resemblance” formula is similar to the Jaccard similarity formula
• The problems encountered when trying to use this approach have been
– the determination of which features to use
– if the algorithms are scalable to large collections of items.
• Other methods
• “bag of words”
• “shingles”
– break the items up into shorter segments and create signatures for each
of the segments.
– shorter segments are called “shingles”.
– The issue with this approach comes if the segments don’t start on the
same boundary.
– If one item is off by one word from the other, then all the segments will
be off by one word
– The only way this could work is if a logical semantic boundary is
defined for the segments
– Requires more sophisticated parsing to accurately define the segments
and the process would vary by language
• An alternative is to define each segment as a fixed number of words but
overlap the segments to ensure the majority of them are in both items.
• This is similar to the N-Gram process
• Instead of using a fixed number of characters for the “n”, a fixed number of
words will be used.
• If “n” is selected as 4 words, then the process would be the following:
– the first four words of the item would be the first signature,
– words 2–5 will be the second signature,
– words 3–7 will be the third signature
• There will be as many signatures created as there are words in the item.
• This process is called “sliding window shingling”.
• We only compare the lowest 4 signatures from any items we would have
• Item 1 signatures: 12, 18, 22 and 24 and for
• Item 2 it would be 12, 24, 33, and 55.
• Notice also that “12” comes from close to the end of the phase (w7 w4 w5)
• 24 comes from the start (w3 w1 w7) which shows the random selection
• Broder’s formula for resemblance which is the number of signatures in
common in the numerator and the total number of unique signatures in the
denominator:

• The biggest issue with this technique is that it will get poorer results for
shorter item that do not have that many shingles to combine into bigger
shingles.
• To avoid the problem that comes from defining word segments, another
approach is to use a similarity approach between documents as a measure if
they are near duplicates
Item Normalization
• As you retrieve items from multiple sources, it’s possible that the encoding formats
from different sources can vary.
• The first step in processing the items is to normalize the format to a standard
format
• All the items will have a standard encoding after normalization.
• The first step in processing a textual item is to detect the language(s) that the item is
in.
• Language should be known to perform the follow-on processing (e.g.,
morphological rules, stemming rules, dictionary look-up, etc. are all language
dependent)
• Once the language is determined the text can be put into UNICODE, then
all of the different formats will be normalized to a single format.
(UTF-8,UTF-16)
• Next step after standard format representation is character normalization
• For multimedia objects you may also need to normalize to a standard format to
make the development of algorithms more efficient.
• For example you may decide to use MPEG-2 OR MPEG-4 as your standard
processing format for video
Text Normalization Process
Zoning and Creation of Processing Tokens
• Next step in the process is to zone the document and identify
processing tokens for indexing.
• Zoning
– Parsing the item into logical sub-divisions that have meaning to the user
– Title, Author, Abstract, Main Text, Conclusion, References, Country,
Keyword…
• Zoning used to increase precision of search and optimize display
• The zoning information is passed to the processing token
identification operation to store the zone location information,
allowing searches to be restricted to a specific zone.
• Once the standardization and zoning has been completed, information (i.e., words)
that are used in creating the index to be searched needs to be identified in the item.
• The term “processing token” is used because a “word” is not the most efficient unit
on which to base search structures.
• The first step in identification of a processing token consists of determining a word
• Systems determine words by dividing input symbols into three classes: valid word
symbols, inter-word symbols, and special processing symbols
• Examples of possible inter-word symbols are blanks, periods and semicolons.
• Some languages that are glyph or ideogram based (e.g., CJK—Chinese, Japanese,
and Korean) do not have any interword symbols between the characters.
• An ideogram is a character or symbol representing an idea or a thing without
expressing the pronunciation of a particular word
• Processing tokens for multimedia items also exist
– Image
– Audio
– Video
• The next step in defining processing tokens is identification of any specific
word characteristics
• Morphological analysis of the processing token’s part of speech is included
here. Thus, for a word such as “plane,” the system understands that it could
mean “level or flat” as an adjective, “aircraft or facet” as a noun
• After identifying potential list of processing tokens stop words can be
removed by Stop Algorithm.
• The objective of the Stop function is to save system resources by
eliminating from the set of searchable processing tokens those that have
little value to the system or the user (eg. ‘the’)
• Ziph (Ziph-49) postulated that, looking at the frequency of occurrence of
the unique words across a corpus of items, the majority of unique words are
found to occur a few times. The rank-frequency law of Ziph is:
• Examples of Stop algorithms are:
• Stop all numbers greater than “999,999” (this was selected to allow
dates to be searchable)
• Stop any processing token that has numbers and characters
intermixed
• At this point the textual processing tokens have been identified and
stemming may be applied.
Stemming

• Porter Stemming algorithm


• Dictionary Lookup-Stemmers
• Successor Stemmer

• One of the last transformations often applied to data before placing it in the
searchable data structure is stemming.
• Stemming reduces the diversity of representations of a concept (word)
• Stemming has the potential to improve recall.
• Reduces precision if not applied properly
• The most common stemming algorithm removes suffixes and prefixes,
sometimes recursively, to derive the final stem.
• Other techniques such as table lookup and successor stemming provide
alternatives that require additional overheads.
• A related operation is called lemmatization. Lemmatization is typically
accomplished via dictionary lookup
– Eg. it could map “eat” to “ate” or “tooth” and “teeth”.
• Conflation(the process or result of fusing items into one entity; fusion;
amalgamation)is a term that is used to refer mapping multiple morphological
variants to single representation(stem).
• Stem carries the meaning of the concept associated with the word and the
affixes(ending) introduce subtle(slight) modification of the concept.
• Terms with a common stem will usually have similar meanings, for example:
• Ex:Terms with a common stem will usually have similar meanings, for example:
CONNECT
CONNECTED
CONNECTING
CONNECTION
CONNECTIONS
• Frequently, the performance of an IR system will be improved if term groups such
as this are conflated into a single term. This may be done by removal of the various
suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT
• In addition, the suffix stripping process will reduce the total number of terms in the
IR system, and hence reduce the size and complexity of the data in the system,
which is always advantageous.
 Major usage of stemming is to improve recall.
 Important for a system to categories a word prior to making the
decision to stem.
 Proper names and acronyms (A word formed from the initial letters
of a name say JNTU …) should not have stemming applied.
 Stemming can also cause problems for natural language processing
NPL systems by causing loss of information
Porter stemming algorithm

• Based on a set condition of the stem


• A consonant in a word is a letter other than A, E, I, O or U, some
important stem conditions are
1. The measure m of a stem is a function of sequence of vowels (V)
followed by a sequence of consonant ( C ) .
C (VC)mV.
where the initial C and final V are optional and m is the number VC repeats.

free– CCVV
frees- CCVVC
prologue–CCVCVCVV
The case m = 0 covers the null word.
2. *<X> - stem ends with a letter X
3.*v* - stem contains a vowel
4. *d - stem ends in double consonant (e.g. -TT, -SS).
5. *o stem ends in consonant vowel sequence where the final
consonant is not w,x,y(e.g. -WIL, -HOP).
• Suffix conditions takes the form current _suffix = = pattern
•Actions are in the form old_suffix ->. New_suffix
• Rules are divided into steps to define the order for applying the
rule.
Examples of the rules
Step Condition Suffix Replaceme Example
nt
1a Null Sses Ss Stresses -> stress

1b *v* Ing Null Making -> mak

1b1 Null At Ate Inflated-> inflate

1c *v* Y I Happy->happi

2 m>0 aliti al Formaliti-> formal

3 m>0 Icate Ic Duplicate->duplie

4 m>1 Able Null Adjustable -> adjust

5a m>1 e Null Inflate-> inflat

5b m>1 and *d Null Single Control -> control


letter
Dictionary look up stemmers
 Use of dictionary look up.
 The original term or stemmed version of the term is looked upin a
dictionary and replaced by the stem that best represents it.
 This technique has been implemented in INQUERY and Retrieval
systems
 INQUERY system uses the technique called Kstem.
 Kstem is a morphological analyzer that conflates words variants to
a root form.
 It requires a word to be in the dictionary
 Kstem uses 6 major data files to control and limit the stemming
process.
1. Dictionary of words (lexicon)
2. Supplemental list of words for dictionary
3. Exceptional list of words that should retain a ‘e’ at the end (e.g.,
“suites” to “suite” but “suited” to “suit”).
4. Direct _conflation - word pairs that override stemming algorithm.
5. Country_nationality _conflation ( British maps to Britain )
6. Proper nouns -- that should not be stemmed

 New words that are not special forms (e.g., dates, phone numbers) are
located in the dictionary to determine simpler forms by stripping off
suffixes and respelling plurals as defined in the dictionary.
Successor stemmers:
Based on length of prefixes .
The smallest unit of speech that distinguishes on word from another
The process uses successor varieties for a word .
Uses information to divide a word into segments and selects on of the segments to
stem.
Symbol tree for term bag , barn
b

g r

n
Symbol tree for terms bag, barn, bring, box, bottle, both
Successor variety of words are used to segment a word by applying one of the
following four methods.
1. Cutoff method : a cut of value is selected to define the stem length.
2. Peak and plateau: a segment break is made after a character whose
successor variety exceeds that of the character.
3. Complete word method: break on boundaries of complete words.
4. Entropy method: uses the distribution method of successor variety letters.
1. Let |Dak| be the number of words beginning with k length sequence of
letters a.
2. Let |Dakj| be the number of words in Dak with successor j.
3. The probability that a member of Dak has the successor j is given as
|Dakj| / |Dak|
After a word has been segmented the segment to be used as stem must
be selected.
Hafer and Weiss selected the following rule
If ( first segment occurs in <=12 words in database)
First segment is stem
Else (second segment is stem)

•Frakes Summary (Ref.Frakes-92)


•Stemming can affect retrieval(recall) and where effects were identified they
were positive.
•There is little difference between retrieval effectiveness of different full
stemmers with the exception of the Hafer and Weiss stemmer.
• Stemming is as effective as manual conflation.
• Stemming is dependent upon the nature of the vocabulary.
Entity Processing
• Entity Identification
• Entity Normalization
• Entity Resolution
• Information Extraction

• Core problem in getting good search results is the mismatch between the
vocabulary of the author and the vocabulary of the user.
• There is an issue with the many variants on how an entity can be specified.
• In many different ways that a person’s name may be expressed as well as
transliteration issues are also possible
• For example the following are just a subset of the ways of expressing
Libiyan leader’s name: Muammar el Qaddafi, Ghaddafi, Kaddafi,
Muammar al-Gathafi, Col. Mu’ammar al-Qadhafi.
• Yet they all refer to the same person (entity).
Entity Identification

• Entity identification can be broken into 3 processes


– First process is identifying if a word or words belongs to an
entity class
– Second is entity normalization
– Third is entity resolution
• The first process is identifying if a word or words belongs to an entity
class.
• For example you can have entity classes such as people, places,
organizations, telephone numbers, Internet URLs, etc.,
• This process would associate the processing token with one or more of
those classes.
• This in a sense “tags” some of the processing tokens with additional
metadata to help define its semantics.
• For Eg. consider the term “bush”
– Can represent plan
– Can represent persons name
• The technical approaches of entity identification and normalization
fall into three major classes
– Matching enumeration process
– Rule based approach
– Linguistic approach

•A special class of entity identification is for geographic references.


Entity Normalization
• Different variants of the same entity instance are mapped to a common name.
• For example George Bush, President Bush, George W. Bush, and Bush might
all be mapped to a single value that represents that one person (that instance).
• A more complex extension is associating pronouns with specific entities.
• This is called coreference which means the identification of an anaphoric
relation.
• An anaphoric relation indicates the relation between two textual elements that
denote the same object.
• For example:
George Bush is the president. He is from Texas
George Bush and He refers to same object
• You can create an additional inversion list represented by the single value that
has all of the instances of all of the variants in it.
• Enumeration method of listing by linguistic rule can be used to overcome
coreference issue
Entity Resolution
• As processing tokens are associated with an entity instance, it’s
possible they will be assigned to the wrong entity instance.
• For Eg. first time when token George Bush is encountered it is
mapped to George W. Bush
• As the item continues processing it might find that all the other
processing tokens are referring to George H. Bush and thus the first
processing token should also refer to George H. Bush.
• When the entities with their attributes are merged into the searchable
database its possible that there are inconsistencies between the
entities and their attributes.
• The merging and resolution of the conflicts is also called entity
resolution.
Information Extraction
• There are two processes associated with information extraction:
– Determination of facts to go into structured fields in a database
• only a subset of the important facts in an item may be identified
and extracted
– Extraction of text that can be used to summarize an item.
• All of the major concepts in the item should be represented in the
summary.
• The process of extracting facts to go into indexes is called Automatic File
Build.
• Its goal is to process incoming items and extract index terms that will go
into a structured database
• The objective of the data extraction is in most cases to update a structured
database with additional facts
• Performance measures used
• Precision
– Refers to how much information was extracted accurately versus the
total information extracted
• Recall
– Refers to how much information was extracted from an item versus
how much should have been extracted from the item
• Overgeneration
– Measures the amount of irrelevant information that is extracted
• Fallout
– Measures how much a system assigns incorrect slot fillers as the
number of potential incorrect slot fillers increases
Categorization
• The categorization process is focused on finding additional descriptors for
the content of an item.
• Processing tokens for an item are expanded by the terms associated with
each category found for an item
• For example, there may be a category for Environment Protection.
• When an item comes in that discusses oil spills, it will also have the term
Environment Protection assigned to it, even though neither those words nor
any variants of those words were included in the item.
• The result of the categorization process is a “confidence” value that the
item should be part of that category.
• This is typically a value between 0 and 1 or 0 to 100
• Learning algorithms such as neural net and SVM can be used
• Categorization using Naïve Bayes Approach
• Classification problem can be defined as:

• Probability of a particular Category given a particular Item equals the


probability of an Item given the Category times the probability the category
occurs divided by the probability an Item occurs.
• The next step is to determine how to calculate the factor P(I/C).
• An item is a set of processing tokens (words) that are meaningful (i.e., stop
words need to be eliminated).
• Thus the formula can be estimated to be P(set of words in I) given a
Category.
• The concept of using a training set is that you will have a set of Items
identified as those in the Category and those not in the Category
• Log function is used to be sure that if a particular word would have a value
close to or equal to zero

• p(tk/C) factor is the number of times that a word is found in the training
data set for a particular Category divided by the total number of words.
• This approach can lead to problems if certain terms are not found in the
training set so Laplace smoothing can be added to eliminate zeros

• “N” is the number of unique processing tokens in training set.


• Given the following 6 Items in the training data set, the factors for
each of two categories are calculated.
• The P(computer category) is 4/6=0.667 and the P(not computer) is
2/6=0.333
• If a new Item is to be categorized and the system is calculating if it
should be in the “computer” category then the data above would be
used. For example if a new item is:
• Item 7=computer physics computer mathematics
• p(computer/Item7)=(0.667)*(0.526*0.263*0.526*0.105)=0.0051
• P(notcomputer/Item7)=(0.333)*(0.182*0.546*0.182*0.091)=0.0005
• Thus the Item 7 would be assigned the category “computer”.
Citational Metadata

• There is additional indexable data that is associated with where the


item came from.
• This information can help limit the users search to a subset of the
items in the database, thereby improving precision.
• Examples of citational information are the date the item was
ingested by the search system, the date the item was created by the
user, the source of the item (e.g., web site name, news agency
for RS feeds, etc.), author of the item, and language an item is in
• The advantage of this data is that subjective information and can
help narrow a search down if it is an aspect of what the information
need of the user is.

You might also like