Professional Documents
Culture Documents
2T-Inverted Index
2T-Inverted Index
nkerzazi@gmail.com
Goals : Understanding How Does Search Works.
Know of the Index the document Know how relevant Retrieve ranked by
document’s existence for lookup the document is relevance
• .
Most Relevant Document for search Terms
1 2 3 4
• .
Search is not only for the web
• The Web
• E-commerce (catalogues, items, etc.)
• Video
• Logs
• Documents
• ….
The Three Major Components
1. A crawler explores the world, looking for documents to index.
2. The indexer takes a set of documents found by the crawler,
and produces something called an inverted index out of
them.
3. A query processor asks a user for a query, and processes it
using the inverted index to figure out a list of documents that
match the query.
9
How Google infrastructure works
Crawling
• First model:
– Crawling the web for 4 weeks,
– Indexing for 1 week,
– Pushing data out for 1 Week
10
First assignment: Data Crawling
• .
Tokenize Text into Words
• Split words
• Lowercased
• Removed punctuation
The frequency
• .
The frequency
• .
Dictionary
Postings list
• Sorted so lookup is easy
Steps to build an inverted index
• Fetch the Document
Removing of Stop Words: Stop words are most occuring and useless words in
document like “I”, “the”, “we”, “is”, “an”.
• Stemming of Root Word
Whenever I want to search for “cat”. But the word present in the document is called
“cats” or “catty” instead of “cat”. We could get the “root word”. There are standard
tools for performing this like “Porter’s Stemmer”.
• Record Document IDs
If word is already present add reference of document to index else create new
entry. Add additional information like frequency of word, location of word etc.
• Repeat for all documents and sort the words.
Sec. 2.2.1
Tokenization
▪ Input: “Friends, Romans and Countrymen”
▪ Output: Tokens
▪ Friends
▪ Romans
▪ Countrymen
▪ A token is an instance of a sequence of characters
▪ Each such token is now a candidate for an index entry, after further
processing
▪ Described below
▪ But what are valid tokens to emit?
Sec. 2.2.1
Tokenization
• Issues in tokenization:
– Finland’s capital →
Finland AND s? Finlands? Finland’s?
– Hewlett-Packard → Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.
• co-education
• lowercase, lower-case, lower case ?
• It can be effective to get the user to put in possible hyphens
– San Francisco: one token or two?
• How do you decide it is one token?
Sec. 2.2.1
Numbers
• 3/20/91 Mar. 12, 1991 20/3/91
• 55 B.C.
• B-52
• My PGP key is 324a3df234cb23e
• (800) 234-2333
– Often have embedded spaces
– Older IR systems may not index numbers
• But often very useful: think about things like looking up error
codes/stacktraces on the web
• (One answer is using n-grams: IIR ch. 3)
– Will often index “meta-data” separately
• Creation date, format, etc.
Sec. 2.2.1
• ← → ←→ ← start
• ‘Algeria achieved its independence in 1962 after 132 years of
French occupation.’
• With Unicode, the surface presentation is complex, but the stored form is
straightforward
Sec. 2.2.2
Stop words
• With a stop list, you exclude from the dictionary entirely the
commonest words. Intuition:
– They have little semantic content: the, a, and, to, be
– There are a lot of them: ~30% of postings for top 30 words
Normalization to terms
• We may need to “normalize” words in indexed text as well as
query words into the same form
– We want to match U.S.A. and USA
• Result is terms: a term is a (normalized) word type, which is an
entry in our IR system dictionary
• We most commonly implicitly define equivalence classes of
terms by, e.g.,
– deleting periods to form a term
• U.S.A., USA USA
Case folding
Normalization to terms
Lemmatization
Stemming
Porter’s algorithm
• sses → ss
• ies → i
• ational → ate
• tional → tion
Other stemmers
Language-specificity
• English: very mixed results. Helps recall for some queries but
harms precision on others
– E.g., operative (dentistry) ⇒ oper
• Definitely useful for Spanish, German, Finnish, …
– 30% performance gains for Finnish!
Ch. 3
This lecture
40
Sec. 3.1
41
Sec. 3.1
A naïve dictionary
• An array of struct:
43
Sec. 3.1
Hashtables
• Each vocabulary term is hashed to an integer
– (We assume you’ve seen hashtables before)
• Pros:
– Lookup is faster than for a tree: O(1)
• Cons:
– No easy way to find minor variants:
• judgment/judgement
– No prefix search [tolerant retrieval]
– If vocabulary keeps growing, need to occasionally do the expensive
operation of rehashing everything
44
Sec. 3.1
Root
a-m n-z
45
Sec. 3.1
Tree: B-tree
a-hu n-z
hy-m
Trees
Wild-card queries: *
• mon*: find all docs containing any word beginning with “mon”.
• Easy with binary tree (or B-tree) lexicon: retrieve all words in
range: mon ≤ w < moo
• *mon: find words ending in “mon”: harder
– Maintain an additional B-tree for terms backwards.
Can retrieve all words in range: nom ≤ w < non.
Query processing
50
Sec. 3.2
Permuterm index