Professional Documents
Culture Documents
UNIT II-INGEST
UNIT II-INGEST
•Introduction
•Item Receipt
• Duplicate Detection
•Item Normalization
•Zoning and Creation of Processing Tokens,
•Stemming
• Entity Processing
•Categorization
•Citational Metadata
•Manual Indexing Process
• Automatic Indexing of Text and Multimedia
Introduction to Ingest
• Ingest is the initial process in an information retrieval systems.
• It is the process that receives the items to be stored and indexed in the
system and performs the initial processing of them.
• Ingest process can be broken down into a number of subprocesses.
• Each subprocesses can add value to the final index and improve the
capability of the system to get better results.
• Ingest subprocess can be “pull” or “catch”
– A pull process has the system going out to other locations and
retrieving items from those locations to ingest (e.g., web/local
network crawling or RSS feeds).
– Catching is where other systems deliver the items by sending them
to the ingest process.
• One internal system writes files into an ingest directory or uses
the “web services” application interface of the ingest process to
deliver new items.
• Once the item is received it goes through a
– Normalization process for processing purposes.
– Original item is usually kept as part of the document repository to
ensure no information is lost in the normalization process
– One of the first checks in many systems is to validate if this item has
already been ingested into their system.
– Duplicate detection can save a lot of redundant processing.
– Generation of processing tokens.
– Entity identification -some normalization can be done on the processing
tokens to map them into a canonical value.
– The result of these steps is the data that is ready to be passed to the
Indexing process.
– One final process called Categorization can also be applied to expand
the processing tokens for an item.
Item Receipt
• Item receipt is the process associated with first getting items into the
system.
Methods used:
• Pulling
– Crawling internet/local area network
• Depth First
• Breadth First
– Getting item from RSS feed
• Catch
– This process is usually associated with internal systems.
– Within the internal generation of new information there is a step where new items are
written to a queue that is monitored by the indexing process
• Crawling network
• Seed list
– Starts with a list of addresses that can be used as a starting point for the
crawling
– Defines the maximum subset of the network from which items can be part
of the information system.
– Each item (e.g.,web page) pointed to by the seed list is retrieved from its
web site to be indexed.
• URL (Uniform Resource Locators)
– When the page is retrieved to be indexed, all the links(URLs) on that page
that link to other pages on that site or a different site are extracted and
added to the list of crawlable locations.
– Data structure used to store the new URLs to be crawled is called the URL
Frontier.
– As pages are crawled from the URL frontier they are marked as crawled
along with a date time of crawl.
• The list of all items to retrieve is continually growing.
• The term “black web” is used to describe that subset of the Internet that is not
retrieved and indexed.
• Depth First
– Depth first says you select pages from the URL Frontier following
pages to the same site to retrieve as much of the site as you can.
– Has the possible negative aspect of hitting the same site often in a
limited interval which could cause the site administrator to lock your
crawler out.
– This is called “politeness” of crawling and the general rule is to have at
least two seconds before the system retrieves another page from a site
– Software that takes the URLs from the Seed list or URL Frontier and
retrieves the linked page is called a web crawler, spider or robot.
– Software should be scalable
• Breadth First
– Breath first says you significantly limit the number of pages retrieved
from a site, going on to other sites that you have not retrieved
– Gives you a better sample of many sites versus in depth of a fewer
number of sites
• Challenges of crawling
• Design and optimization of crawlers and URL frontier
• Crawler cannot retrieve from the same web site too frequently or it will
alert a system administrator who could block that crawler from future
access to its site
• Website updation to be tracked frequently
• Amount of data that is kept in databases at web sites
– Static web pages-preconstructed
– Dynamic web pages – hidden web
• The first decision that affects an information retrieval systems performance
in terms of precision and recall is what data will be ingested and indexed.
• Subscribing to RSS Feed
• RSS stands for either Really Simple Syndication or Rich Site Summary
includes a specifications on how a web site can publish information to a
user.
• Will deliver document in XML format
• used to publish frequently updated works such as blog entries, news
headlines, audio, and video in a standardized format.
• RSS document includes the item plus metadata such as dates and
authorship.
• RSS feeds are read using software called an RSS reader or an aggregator.
• Once subscribed to a feed, an aggregator is able to check for new content at
user-determined intervals and retrieve the update.
• Example of XML
Duplicate Detection
• The first process that can make search and retrieval more effective is to
eliminate duplicate information.
• The duplicates cause a lot of wasted system overhead in indexing the same
item.
• The standard approach for duplicate detection is to create a signature
unique key that represents the contents of an item.
• The most common methodology is to create a hash for the complete file
(e.g., Message Digital Algorithm 2—MD2 or MD5).
• The problem with this approach is it is a hash on every character in the
item.
• If there is just a few characters different (e.g., a copy is made of an article
but a system assigns a unique ID it places on the article or it places a
current date on the article) you will get two different hash values and not
detect it as a duplicate.
• This problem can be addressed by using heuristic algorithm.
• Near Duplicate Detection
– Automated algorithm in ingest system for detecting duplicates
• Two ways to detect near duplicates
– One is where the text in the two items is almost identical.
– The second is where the semantics of both items is identical but the
text used to express the semantics is different
Resemblance between 2 documents defined as:
• The biggest issue with this technique is that it will get poorer results for
shorter item that do not have that many shingles to combine into bigger
shingles.
• To avoid the problem that comes from defining word segments, another
approach is to use a similarity approach between documents as a measure if
they are near duplicates
Item Normalization
• As you retrieve items from multiple sources, it’s possible that the encoding formats
from different sources can vary.
• The first step in processing the items is to normalize the format to a standard
format
• All the items will have a standard encoding after normalization.
• The first step in processing a textual item is to detect the language(s) that the item is
in.
• Language should be known to perform the follow-on processing (e.g.,
morphological rules, stemming rules, dictionary look-up, etc. are all language
dependent)
• Once the language is determined the text can be put into UNICODE, then
all of the different formats will be normalized to a single format.
(UTF-8,UTF-16)
• Next step after standard format representation is character normalization
• For multimedia objects you may also need to normalize to a standard format to
make the development of algorithms more efficient.
• For example you may decide to use MPEG-2 OR MPEG-4 as your standard
processing format for video
Text Normalization Process
Zoning and Creation of Processing Tokens
• Next step in the process is to zone the document and identify
processing tokens for indexing.
• Zoning
– Parsing the item into logical sub-divisions that have meaning to the user
– Title, Author, Abstract, Main Text, Conclusion, References, Country,
Keyword…
• Zoning used to increase precision of search and optimize display
• The zoning information is passed to the processing token
identification operation to store the zone location information,
allowing searches to be restricted to a specific zone.
• Once the standardization and zoning has been completed, information (i.e., words)
that are used in creating the index to be searched needs to be identified in the item.
• The term “processing token” is used because a “word” is not the most efficient unit
on which to base search structures.
• The first step in identification of a processing token consists of determining a word
• Systems determine words by dividing input symbols into three classes: valid word
symbols, inter-word symbols, and special processing symbols
• Examples of possible inter-word symbols are blanks, periods and semicolons.
• Some languages that are glyph or ideogram based (e.g., CJK—Chinese, Japanese,
and Korean) do not have any interword symbols between the characters.
• An ideogram is a character or symbol representing an idea or a thing without
expressing the pronunciation of a particular word
• Processing tokens for multimedia items also exist
– Image
– Audio
– Video
• The next step in defining processing tokens is identification of any specific
word characteristics
• Morphological analysis of the processing token’s part of speech is included
here. Thus, for a word such as “plane,” the system understands that it could
mean “level or flat” as an adjective, “aircraft or facet” as a noun
• After identifying potential list of processing tokens stop words can be
removed by Stop Algorithm.
• The objective of the Stop function is to save system resources by
eliminating from the set of searchable processing tokens those that have
little value to the system or the user (eg. ‘the’)
• Ziph (Ziph-49) postulated that, looking at the frequency of occurrence of
the unique words across a corpus of items, the majority of unique words are
found to occur a few times. The rank-frequency law of Ziph is:
• Examples of Stop algorithms are:
• Stop all numbers greater than “999,999” (this was selected to allow
dates to be searchable)
• Stop any processing token that has numbers and characters
intermixed
• At this point the textual processing tokens have been identified and
stemming may be applied.
Stemming
• One of the last transformations often applied to data before placing it in the
searchable data structure is stemming.
• Stemming reduces the diversity of representations of a concept (word)
• Stemming has the potential to improve recall.
• Reduces precision if not applied properly
• The most common stemming algorithm removes suffixes and prefixes,
sometimes recursively, to derive the final stem.
• Other techniques such as table lookup and successor stemming provide
alternatives that require additional overheads.
• A related operation is called lemmatization. Lemmatization is typically
accomplished via dictionary lookup
– Eg. it could map “eat” to “ate” or “tooth” and “teeth”.
• Conflation(the process or result of fusing items into one entity; fusion;
amalgamation)is a term that is used to refer mapping multiple morphological
variants to single representation(stem).
• Stem carries the meaning of the concept associated with the word and the
affixes(ending) introduce subtle(slight) modification of the concept.
• Terms with a common stem will usually have similar meanings, for example:
• Ex:Terms with a common stem will usually have similar meanings, for example:
CONNECT
CONNECTED
CONNECTING
CONNECTION
CONNECTIONS
• Frequently, the performance of an IR system will be improved if term groups such
as this are conflated into a single term. This may be done by removal of the various
suffixes -ED, -ING, -ION, IONS to leave the single term CONNECT
• In addition, the suffix stripping process will reduce the total number of terms in the
IR system, and hence reduce the size and complexity of the data in the system,
which is always advantageous.
Major usage of stemming is to improve recall.
Important for a system to categories a word prior to making the
decision to stem.
Proper names and acronyms (A word formed from the initial letters
of a name say JNTU …) should not have stemming applied.
Stemming can also cause problems for natural language processing
NPL systems by causing loss of information
Porter stemming algorithm
free– CCVV
frees- CCVVC
prologue–CCVCVCVV
The case m = 0 covers the null word.
2. *<X> - stem ends with a letter X
3.*v* - stem contains a vowel
4. *d - stem ends in double consonant (e.g. -TT, -SS).
5. *o stem ends in consonant vowel sequence where the final
consonant is not w,x,y(e.g. -WIL, -HOP).
• Suffix conditions takes the form current _suffix = = pattern
•Actions are in the form old_suffix ->. New_suffix
• Rules are divided into steps to define the order for applying the
rule.
Examples of the rules
Step Condition Suffix Replaceme Example
nt
1a Null Sses Ss Stresses -> stress
1c *v* Y I Happy->happi
New words that are not special forms (e.g., dates, phone numbers) are
located in the dictionary to determine simpler forms by stripping off
suffixes and respelling plurals as defined in the dictionary.
Successor stemmers:
Based on length of prefixes .
The smallest unit of speech that distinguishes on word from another
The process uses successor varieties for a word .
Uses information to divide a word into segments and selects on of the segments to
stem.
Symbol tree for term bag , barn
b
g r
n
Symbol tree for terms bag, barn, bring, box, bottle, both
Successor variety of words are used to segment a word by applying one of the
following four methods.
1. Cutoff method : a cut of value is selected to define the stem length.
2. Peak and plateau: a segment break is made after a character whose
successor variety exceeds that of the character.
3. Complete word method: break on boundaries of complete words.
4. Entropy method: uses the distribution method of successor variety letters.
1. Let |Dak| be the number of words beginning with k length sequence of
letters a.
2. Let |Dakj| be the number of words in Dak with successor j.
3. The probability that a member of Dak has the successor j is given as
|Dakj| / |Dak|
After a word has been segmented the segment to be used as stem must
be selected.
Hafer and Weiss selected the following rule
If ( first segment occurs in <=12 words in database)
First segment is stem
Else (second segment is stem)
• Core problem in getting good search results is the mismatch between the
vocabulary of the author and the vocabulary of the user.
• There is an issue with the many variants on how an entity can be specified.
• In many different ways that a person’s name may be expressed as well as
transliteration issues are also possible
• For example the following are just a subset of the ways of expressing
Libiyan leader’s name: Muammar el Qaddafi, Ghaddafi, Kaddafi,
Muammar al-Gathafi, Col. Mu’ammar al-Qadhafi.
• Yet they all refer to the same person (entity).
Entity Identification
• p(tk/C) factor is the number of times that a word is found in the training
data set for a particular Category divided by the total number of words.
• This approach can lead to problems if certain terms are not found in the
training set so Laplace smoothing can be added to eliminate zeros