Welcome to Scribd!

0% found this document useful (0 votes)

12 views

Inverted Index

Uploaded by

The document discusses the key aspects of constructing an inverted index for information retrieval systems. It describes how an inverted index stores a list of documents associated with each term, and how postings lists identify documents using document IDs. It also covers the initial stages of text processing like tokenization, normalization, stemming, removing stop words and handling numbers, dates, abbreviations and multi-word expressions. Further, it discusses language-specific issues and challenges in tokenization for languages like French, German, Chinese, Japanese and Arabic.

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Cape 2004 Communication Studies Module 2 Model Answer
Document2 pages
Cape 2004 Communication Studies Module 2 Model Answer
edlinrochford8533
71% (7)
Curso de Inglés para Programadores
Document19 pages
Curso de Inglés para Programadores
Sendy Esthefany
No ratings yet
PPP
Document9 pages
PPP
Juampi Duboué Iuvaro
No ratings yet
IRS Chapter 2
Document57 pages
IRS Chapter 2
deccancollege
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
Document38 pages
Lecture 3-Term Vocabulary and Posting Lists
Prateek Sharma
No ratings yet
Introduction To: Information Retrieval
Document34 pages
Introduction To: Information Retrieval
arvind arvind meena
No ratings yet
Lec 19
Document60 pages
Lec 19
hancocker
No ratings yet
Lecture2 Dictionary
Document62 pages
Lecture2 Dictionary
PRAVINSABARIBALA
No ratings yet
Information Retrieval Systems Chap 2
Document60 pages
Information Retrieval Systems Chap 2
saivishu1011_6936113
67% (3)
Introduction To: Information Retrieval
Document48 pages
Introduction To: Information Retrieval
putri_dIYANi
No ratings yet
AI6122 Topic 1.2 - WordLevel
Document63 pages
AI6122 Topic 1.2 - WordLevel
Yujia Tian
No ratings yet
Introduction To: Information Retrieval
Document48 pages
Introduction To: Information Retrieval
sklgoel
No ratings yet
Introduction To: Information Retrieval
Document44 pages
Introduction To: Information Retrieval
Hussain Saeed
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
Document26 pages
Lecture 3-Term Vocabulary and Posting Lists
Yash Gupta
No ratings yet
IR Lec03 Vocabulary Postings List
Document28 pages
IR Lec03 Vocabulary Postings List
Arpita Gupta
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
Document48 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
Rahul Kumar
No ratings yet
Chap 2 Part 1
Document16 pages
Chap 2 Part 1
Ahmed gamal ebied
No ratings yet
Lesson Plan Data Representation Characters
Document3 pages
Lesson Plan Data Representation Characters
api-169562716
No ratings yet
2 - Text Operation
Document29 pages
2 - Text Operation
maki ababi
No ratings yet
Lecture 1 Text Preprocessing PDF
Document29 pages
Lecture 1 Text Preprocessing PDF
Ammad Asif
No ratings yet
Text Processing, Tokenization & Characteristics
Document89 pages
Text Processing, Tokenization & Characteristics
Gunik Maliwal
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
Document43 pages
Apex Institute of Technology Natural Language Processing (20CST354)
trinidhisingh15
No ratings yet
2T-Inverted Index
Document54 pages
2T-Inverted Index
Mariem El Mechry
No ratings yet
Information Retrieval: Text Processing
Document43 pages
Information Retrieval: Text Processing
firas
No ratings yet
Fundamentals of Assembly Language: Julius Bancud
Document47 pages
Fundamentals of Assembly Language: Julius Bancud
Red Streak
No ratings yet
Fundamentals of Assembly Language: Julius Bancud
Document47 pages
Fundamentals of Assembly Language: Julius Bancud
Red Streak
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
Document35 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
Sushan
No ratings yet
The Fine Art of Commenting
Document19 pages
The Fine Art of Commenting
Edgardo Veliz Limas
No ratings yet
Lecture 3
Document70 pages
Lecture 3
Abdullah Khan Qadri
No ratings yet
Numerals and Tables: Use Old-Style Numerals in Running Text
Document26 pages
Numerals and Tables: Use Old-Style Numerals in Running Text
Rakesh
No ratings yet
Chapter 1: Introduction To BASIC Programming: Programming Language Purpose Difficulty
Document28 pages
Chapter 1: Introduction To BASIC Programming: Programming Language Purpose Difficulty
Jubril Akinwande
100% (1)
33-International Considerations in SQL Server
Document10 pages
33-International Considerations in SQL Server
divandann
No ratings yet
10.2005.5 Unicode
Document4 pages
10.2005.5 Unicode
Jonas Singbo
No ratings yet
I18n Java
Document52 pages
I18n Java
api-3715867
No ratings yet
Chapter 2 Part 1 & 2
Document58 pages
Chapter 2 Part 1 & 2
S J
No ratings yet
Dscript PDF
Document9 pages
Dscript PDF
saironwe
No ratings yet
Introduction To L Tex: A Document Preparation System: Produced With L TEX by G Baker and G Moloney
Document34 pages
Introduction To L Tex: A Document Preparation System: Produced With L TEX by G Baker and G Moloney
bizhanj
No ratings yet
A Shallow Introduction To The K Programming Language - Kuro5hin
Document22 pages
A Shallow Introduction To The K Programming Language - Kuro5hin
lycancapital
No ratings yet
TechnicalCompound Nouns
Document3 pages
TechnicalCompound Nouns
arun.dhanapalan11
No ratings yet
Schmid Tokenising - CorpusLinguistics IntHbk
Document17 pages
Schmid Tokenising - CorpusLinguistics IntHbk
Luca Boetti
No ratings yet
Algol W: Jump To Navigationjump To Search
Document4 pages
Algol W: Jump To Navigationjump To Search
Claude Vinyard
No ratings yet
Lecture 2 o
Document79 pages
Lecture 2 o
furkangokaypk26
No ratings yet
LLODREAM Presentation
Document18 pages
LLODREAM Presentation
Anas Fahad Khan
No ratings yet
oldEnglighAlphabet Accentcodes
Document8 pages
oldEnglighAlphabet Accentcodes
Margaret Kitty
No ratings yet
1 The Mongolian Language: ??) Can Also Be Typeset by Using The ??)
Document13 pages
1 The Mongolian Language: ??) Can Also Be Typeset by Using The ??)
Gilmer Calderon Quispe
No ratings yet
CH-02 Text
Document54 pages
CH-02 Text
محمد صباح
No ratings yet
Regular Expression - Sentence Segment
Document46 pages
Regular Expression - Sentence Segment
naruto sasuke
No ratings yet
Phonetic
Document2 pages
Phonetic
altoquez
No ratings yet
Regular Expressions: Exceptions in A Character Set
Document10 pages
Regular Expressions: Exceptions in A Character Set
deartest
No ratings yet
Qbasic
Document32 pages
Qbasic
Haliu Hakorede
No ratings yet
Cdrfonts PDF
Document0 pages
Cdrfonts PDF
Bertrand Tchuente
No ratings yet
An Introduction To Unicode - The Trainer's Friend
Document52 pages
An Introduction To Unicode - The Trainer's Friend
Leandro Gabriel López
No ratings yet
Haskell and Yesod
Document265 pages
Haskell and Yesod
g_teodorescu
100% (1)
15-583:algorithms in The Real World: Data Compression I - Introduction - Information Theory - Probability Coding
Document33 pages
15-583:algorithms in The Real World: Data Compression I - Introduction - Information Theory - Probability Coding
Sanjeev Gupta
No ratings yet
LINGUA1767 With Cover Page v2
Document21 pages
LINGUA1767 With Cover Page v2
zzzzz zzzzz zzzzz
No ratings yet
E-Science E-Business E-Government and Their Technologies: Core XML
Document195 pages
E-Science E-Business E-Government and Their Technologies: Core XML
to_alk
No ratings yet
Howto Unicode
Document9 pages
Howto Unicode
domiel
No ratings yet
Text Preprocessing
Document59 pages
Text Preprocessing
saisuraj1510
No ratings yet
JB Guidelines Manuscript Submission Chicago
Document5 pages
JB Guidelines Manuscript Submission Chicago
Camila Lopez
No ratings yet
Building Blocks
Document13 pages
Building Blocks
Shreya Sahni
No ratings yet
Making Secret Codes
From Everand
Making Secret Codes
Jillian Gregory
No ratings yet
DB2 11 for z/OS: Intermediate Training for Application Developers
From Everand
DB2 11 for z/OS: Intermediate Training for Application Developers
Robert Wingate
No ratings yet
TA10-Unit 1
Document59 pages
TA10-Unit 1
Phuong Phan
No ratings yet
Stylistics
Document17 pages
Stylistics
Teodora Alexandra
100% (1)
Answer The Given Worksheets.: The Learner Demonstrates Understanding of
Document6 pages
Answer The Given Worksheets.: The Learner Demonstrates Understanding of
Tane MB
No ratings yet
MKT Ic4e Sampler 20191029
Document54 pages
MKT Ic4e Sampler 20191029
tyler_dennis_20
0% (1)
Word Order
Document15 pages
Word Order
Mariana Pasat
No ratings yet
CL 12
Document5 pages
CL 12
samwelntunga74
No ratings yet
Context Clues
Document25 pages
Context Clues
raffy capio
No ratings yet
Stages/ Time Content Activities Resource Set Induction (5 Minutes)
Document2 pages
Stages/ Time Content Activities Resource Set Induction (5 Minutes)
Zechrist Zechariah
No ratings yet
Phraseology and Style in Subgenres of The Novel A Synthesis of Corpus and Literary Perspectives 1St Ed 2020 Edition Iva Novakova All Chapter
Document68 pages
Phraseology and Style in Subgenres of The Novel A Synthesis of Corpus and Literary Perspectives 1St Ed 2020 Edition Iva Novakova All Chapter
tamara.aguinaga853
100% (9)
The Syntax of The Simple Sentence
Document34 pages
The Syntax of The Simple Sentence
Marian Roman
No ratings yet
Philosophy Now
Document68 pages
Philosophy Now
charid
100% (1)
Theory and Social Research - Neuman - Chapter2
Document24 pages
Theory and Social Research - Neuman - Chapter2
Madie Chua
50% (2)
EAPPG11 - q1 - Mod1 - Reading For Acadtext - v2
Document33 pages
EAPPG11 - q1 - Mod1 - Reading For Acadtext - v2
Emer Perez
100% (1)
English Day 1
Document4 pages
English Day 1
joann
No ratings yet
English Yt Sem 1 Final
Document8 pages
English Yt Sem 1 Final
Gaming with Devil
No ratings yet
DOC-baiyyinah Notes
Document150 pages
DOC-baiyyinah Notes
ruqayya muhammed
No ratings yet
Neurolinguis 5
Document345 pages
Neurolinguis 5
Melandes Tamiris
No ratings yet
Speech Competition Rubric
Document4 pages
Speech Competition Rubric
Whitley Patadungan
No ratings yet
Improving Students' Speaking Skill Through Debate Technique (Classroom Action Research at Sman 19 Kab Tangerang) PDF
Document214 pages
Improving Students' Speaking Skill Through Debate Technique (Classroom Action Research at Sman 19 Kab Tangerang) PDF
Laesa Oktavia
No ratings yet
On Adaptation of English Nominal Compounds in The Serbian Language With Reference To Word-Formation Types
Document8 pages
On Adaptation of English Nominal Compounds in The Serbian Language With Reference To Word-Formation Types
Ijahss Journal
No ratings yet
Modal Auxiliaries in Phraseology: A Contrastive Study of Learner English and NS English
Document27 pages
Modal Auxiliaries in Phraseology: A Contrastive Study of Learner English and NS English
Sidra Khawaja
No ratings yet
A слова
Document297 pages
A слова
ailisha libra
No ratings yet
2AC: AT: Speed Kritik: Psychology Today
Document2 pages
2AC: AT: Speed Kritik: Psychology Today
BLAGMISH
No ratings yet
San Pablo City Integrated HS Grade 10 Shiela P. Bagsit English April 11, 2022 Fourth Quarter 8:00am-9:00pm 1 HR
Document6 pages
San Pablo City Integrated HS Grade 10 Shiela P. Bagsit English April 11, 2022 Fourth Quarter 8:00am-9:00pm 1 HR
Shiela Peregrin Bagsit
No ratings yet
中国人写英文文章最常犯的错误总结
Document24 pages
中国人写英文文章最常犯的错误总结
Thomas
No ratings yet
Family Life A. Aims 1. Knowledge: - Grammar: Future Forms - Pronunciation: Sentence Stress
Document8 pages
Family Life A. Aims 1. Knowledge: - Grammar: Future Forms - Pronunciation: Sentence Stress
Viktoria Hrankina
100% (1)
CBLM 1 Participate in Workplace Communication
Document96 pages
CBLM 1 Participate in Workplace Communication
Orlando Najera
No ratings yet
Semantics (R-E Anul Iv) : Lexical Relations
Document44 pages
Semantics (R-E Anul Iv) : Lexical Relations
irinachircev
No ratings yet

Inverted Index

Uploaded by

qwrr rewq

0% found this document useful (0 votes)

12 views5 pages

Original Description:

Original Title

Untitled

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

12 views5 pages

Inverted Index

Uploaded by

qwrr rewq

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 5

Search inside document

Inverted index

Done: View
Inverted index

 The key data structure underlying modern IR

 For each term t, we must store a list of all documents that contain
term (t).

 Identify each doc by a docID, a document serial number.

 We need variable-size postings lists

 In memory, can use linked lists or variable length arrays

1 2 4 31 45 173 174 X

1 2 4 5 6 16 57 132

2 31 54 101 Y

Inverted index construction

Initial stages of text processing
• Tokenization
– Cut character sequence into word tokens
-- Deal with “John‟s”, a state-of-the-art solution

Input: “Friends, Romans and Countrymen”

Output: Tokens Friends Romans Countrymen

- A token is an instance of a sequence of characters

- Each such token is now a candidate for an index entry, after further
processing

Issues in tokenization:

 Finland‟s capital : Finland AND s? Finlands? Finland‟s?

 Hewlett-Packard: Hewlett and Packard as two tokens?

 state-of-the-art: break up hyphenated sequence.
 co-education
 lowercase, lower-case, lower case ?
 It can be effective to get the user to put in possible hyphens
 San Francisco: one token or two?

How do you decide it is one token?

Numbers

 3/20/91 Mar. 12, 1991 20/3/91

 55 B.C.
 B-52
 My PGP key is 324a3df234cb23e
 (800) 234-2333

Often have embedded spaces, Older IR systems may not index numbers

Tokenization: language issues

French

L'ensembleone token or two?

L ? L‟ ? Le ?

Want l‟ensemble to match with un ensemble

Until at least 2003, it didn‟t on Google Internationalization!

German noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter

„life insurance company employee‟

German retrieval systems benefit greatly from a compound splitter module

Can give a 15% performance boost for German

Chinese and Japanese have no spaces between

words:

莎拉波娃现在居住在美国东南部的佛罗里达。

Not always guaranteed a unique tokenization

Further complicated in Japanese, with multiple alphabets intermingled

Dates/amounts in multiple formats

フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Arabic (or Hebrew) is basically written right to left,

but with certain items like numbers written left to right

Words are separated, but letter forms within a word form complex ligatures
←→←
→ ← start

„Algeria achieved its independence in 1962 after 132 years of French

occupation.‟

With Unicode, the surface presentation is complex, but the stored form is
straightforward

• Normalization
– Map text and query term to same form
-- You want U.S.A. and USA to match
• Stemming
– We may wish different forms of a root to match
-- authorize, authorization

-- Reduce terms to their “roots” before indexing

-“Stemming” suggests crude affix chopping-Language dependent

e.g., automate(s), automatic, automation all reduced to automat

• Stop words
– We may omit very common words (or not)
-- the, a, to, of You need them for:

- Phrase queries: “King of Denmark”

- Various song titles, etc.: “Let it be”, “To be or not to

be”

- “Relational” queries: “flights to London”

 Case folding

Reduce all letters to lower case

exception: upper case in mid-sentence? e.g., General

Motors
Fed vs. fed SAIL vs. sail

Often best to lower case everything, since users will use lowercase
regardless of „correct‟ capitalization…

 Lemmatization

- Reduce inflectional/variant forms to base form

- E.g., am, are, is ===be

- car, cars, car's, cars'===car

- the boy's cars are different colors===the boy car be different

color

- Lemmatization implies doing “proper” reduction to dictionary

headword form

 Thesauri and soundex

Do we handle synonyms and homonyms?

E.g., by hand-constructed equivalence classes

synonyms=== car = automobile

homonyms=== color = colour

We can rewrite to form equivalence-class terms

When the document contains automobile, index it under car

automobile (and vice-versa)

Or we can expand a query When the query contains automobile, look

under car as well

synonyms==="good" and "excellent"

homonyms==="to", "too"

Cape 2004 Communication Studies Module 2 Model Answer
Document2 pages
Cape 2004 Communication Studies Module 2 Model Answer
edlinrochford8533
71% (7)
Curso de Inglés para Programadores
Document19 pages
Curso de Inglés para Programadores
Sendy Esthefany
No ratings yet
PPP
Document9 pages
PPP
Juampi Duboué Iuvaro
No ratings yet
IRS Chapter 2
Document57 pages
IRS Chapter 2
deccancollege
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
Document38 pages
Lecture 3-Term Vocabulary and Posting Lists
Prateek Sharma
No ratings yet
Introduction To: Information Retrieval
Document34 pages
Introduction To: Information Retrieval
arvind arvind meena
No ratings yet
Lec 19
Document60 pages
Lec 19
hancocker
No ratings yet
Lecture2 Dictionary
Document62 pages
Lecture2 Dictionary
PRAVINSABARIBALA
No ratings yet
Information Retrieval Systems Chap 2
Document60 pages
Information Retrieval Systems Chap 2
saivishu1011_6936113
67% (3)
Introduction To: Information Retrieval
Document48 pages
Introduction To: Information Retrieval
putri_dIYANi
No ratings yet
AI6122 Topic 1.2 - WordLevel
Document63 pages
AI6122 Topic 1.2 - WordLevel
Yujia Tian
No ratings yet
Introduction To: Information Retrieval
Document48 pages
Introduction To: Information Retrieval
sklgoel
No ratings yet
Introduction To: Information Retrieval
Document44 pages
Introduction To: Information Retrieval
Hussain Saeed
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
Document26 pages
Lecture 3-Term Vocabulary and Posting Lists
Yash Gupta
No ratings yet
IR Lec03 Vocabulary Postings List
Document28 pages
IR Lec03 Vocabulary Postings List
Arpita Gupta
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
Document48 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
Rahul Kumar
No ratings yet
Chap 2 Part 1
Document16 pages
Chap 2 Part 1
Ahmed gamal ebied
No ratings yet
Lesson Plan Data Representation Characters
Document3 pages
Lesson Plan Data Representation Characters
api-169562716
No ratings yet
2 - Text Operation
Document29 pages
2 - Text Operation
maki ababi
No ratings yet
Lecture 1 Text Preprocessing PDF
Document29 pages
Lecture 1 Text Preprocessing PDF
Ammad Asif
No ratings yet
Text Processing, Tokenization & Characteristics
Document89 pages
Text Processing, Tokenization & Characteristics
Gunik Maliwal
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
Document43 pages
Apex Institute of Technology Natural Language Processing (20CST354)
trinidhisingh15
No ratings yet
2T-Inverted Index
Document54 pages
2T-Inverted Index
Mariem El Mechry
No ratings yet
Information Retrieval: Text Processing
Document43 pages
Information Retrieval: Text Processing
firas
No ratings yet
Fundamentals of Assembly Language: Julius Bancud
Document47 pages
Fundamentals of Assembly Language: Julius Bancud
Red Streak
No ratings yet
Fundamentals of Assembly Language: Julius Bancud
Document47 pages
Fundamentals of Assembly Language: Julius Bancud
Red Streak
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
Document35 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
Sushan
No ratings yet
The Fine Art of Commenting
Document19 pages
The Fine Art of Commenting
Edgardo Veliz Limas
No ratings yet
Lecture 3
Document70 pages
Lecture 3
Abdullah Khan Qadri
No ratings yet
Numerals and Tables: Use Old-Style Numerals in Running Text
Document26 pages
Numerals and Tables: Use Old-Style Numerals in Running Text
Rakesh
No ratings yet
Chapter 1: Introduction To BASIC Programming: Programming Language Purpose Difficulty
Document28 pages
Chapter 1: Introduction To BASIC Programming: Programming Language Purpose Difficulty
Jubril Akinwande
100% (1)
33-International Considerations in SQL Server
Document10 pages
33-International Considerations in SQL Server
divandann
No ratings yet
10.2005.5 Unicode
Document4 pages
10.2005.5 Unicode
Jonas Singbo
No ratings yet
I18n Java
Document52 pages
I18n Java
api-3715867
No ratings yet
Chapter 2 Part 1 & 2
Document58 pages
Chapter 2 Part 1 & 2
S J
No ratings yet
Dscript PDF
Document9 pages
Dscript PDF
saironwe
No ratings yet
Introduction To L Tex: A Document Preparation System: Produced With L TEX by G Baker and G Moloney
Document34 pages
Introduction To L Tex: A Document Preparation System: Produced With L TEX by G Baker and G Moloney
bizhanj
No ratings yet
A Shallow Introduction To The K Programming Language - Kuro5hin
Document22 pages
A Shallow Introduction To The K Programming Language - Kuro5hin
lycancapital
No ratings yet
TechnicalCompound Nouns
Document3 pages
TechnicalCompound Nouns
arun.dhanapalan11
No ratings yet
Schmid Tokenising - CorpusLinguistics IntHbk
Document17 pages
Schmid Tokenising - CorpusLinguistics IntHbk
Luca Boetti
No ratings yet
Algol W: Jump To Navigationjump To Search
Document4 pages
Algol W: Jump To Navigationjump To Search
Claude Vinyard
No ratings yet
Lecture 2 o
Document79 pages
Lecture 2 o
furkangokaypk26
No ratings yet
LLODREAM Presentation
Document18 pages
LLODREAM Presentation
Anas Fahad Khan
No ratings yet
oldEnglighAlphabet Accentcodes
Document8 pages
oldEnglighAlphabet Accentcodes
Margaret Kitty
No ratings yet
1 The Mongolian Language: ??) Can Also Be Typeset by Using The ??)
Document13 pages
1 The Mongolian Language: ??) Can Also Be Typeset by Using The ??)
Gilmer Calderon Quispe
No ratings yet
CH-02 Text
Document54 pages
CH-02 Text
محمد صباح
No ratings yet
Regular Expression - Sentence Segment
Document46 pages
Regular Expression - Sentence Segment
naruto sasuke
No ratings yet
Phonetic
Document2 pages
Phonetic
altoquez
No ratings yet
Regular Expressions: Exceptions in A Character Set
Document10 pages
Regular Expressions: Exceptions in A Character Set
deartest
No ratings yet
Qbasic
Document32 pages
Qbasic
Haliu Hakorede
No ratings yet
Cdrfonts PDF
Document0 pages
Cdrfonts PDF
Bertrand Tchuente
No ratings yet
An Introduction To Unicode - The Trainer's Friend
Document52 pages
An Introduction To Unicode - The Trainer's Friend
Leandro Gabriel López
No ratings yet
Haskell and Yesod
Document265 pages
Haskell and Yesod
g_teodorescu
100% (1)
15-583:algorithms in The Real World: Data Compression I - Introduction - Information Theory - Probability Coding
Document33 pages
15-583:algorithms in The Real World: Data Compression I - Introduction - Information Theory - Probability Coding
Sanjeev Gupta
No ratings yet
LINGUA1767 With Cover Page v2
Document21 pages
LINGUA1767 With Cover Page v2
zzzzz zzzzz zzzzz
No ratings yet
E-Science E-Business E-Government and Their Technologies: Core XML
Document195 pages
E-Science E-Business E-Government and Their Technologies: Core XML
to_alk
No ratings yet
Howto Unicode
Document9 pages
Howto Unicode
domiel
No ratings yet
Text Preprocessing
Document59 pages
Text Preprocessing
saisuraj1510
No ratings yet
JB Guidelines Manuscript Submission Chicago
Document5 pages
JB Guidelines Manuscript Submission Chicago
Camila Lopez
No ratings yet
Building Blocks
Document13 pages
Building Blocks
Shreya Sahni
No ratings yet
Making Secret Codes
From Everand
Making Secret Codes
Jillian Gregory
No ratings yet
DB2 11 for z/OS: Intermediate Training for Application Developers
From Everand
DB2 11 for z/OS: Intermediate Training for Application Developers
Robert Wingate
No ratings yet
TA10-Unit 1
Document59 pages
TA10-Unit 1
Phuong Phan
No ratings yet
Stylistics
Document17 pages
Stylistics
Teodora Alexandra
100% (1)
Answer The Given Worksheets.: The Learner Demonstrates Understanding of
Document6 pages
Answer The Given Worksheets.: The Learner Demonstrates Understanding of
Tane MB
No ratings yet
MKT Ic4e Sampler 20191029
Document54 pages
MKT Ic4e Sampler 20191029
tyler_dennis_20
0% (1)
Word Order
Document15 pages
Word Order
Mariana Pasat
No ratings yet
CL 12
Document5 pages
CL 12
samwelntunga74
No ratings yet
Context Clues
Document25 pages
Context Clues
raffy capio
No ratings yet
Stages/ Time Content Activities Resource Set Induction (5 Minutes)
Document2 pages
Stages/ Time Content Activities Resource Set Induction (5 Minutes)
Zechrist Zechariah
No ratings yet
Phraseology and Style in Subgenres of The Novel A Synthesis of Corpus and Literary Perspectives 1St Ed 2020 Edition Iva Novakova All Chapter
Document68 pages
Phraseology and Style in Subgenres of The Novel A Synthesis of Corpus and Literary Perspectives 1St Ed 2020 Edition Iva Novakova All Chapter
tamara.aguinaga853
100% (9)
The Syntax of The Simple Sentence
Document34 pages
The Syntax of The Simple Sentence
Marian Roman
No ratings yet
Philosophy Now
Document68 pages
Philosophy Now
charid
100% (1)
Theory and Social Research - Neuman - Chapter2
Document24 pages
Theory and Social Research - Neuman - Chapter2
Madie Chua
50% (2)
EAPPG11 - q1 - Mod1 - Reading For Acadtext - v2
Document33 pages
EAPPG11 - q1 - Mod1 - Reading For Acadtext - v2
Emer Perez
100% (1)
English Day 1
Document4 pages
English Day 1
joann
No ratings yet
English Yt Sem 1 Final
Document8 pages
English Yt Sem 1 Final
Gaming with Devil
No ratings yet
DOC-baiyyinah Notes
Document150 pages
DOC-baiyyinah Notes
ruqayya muhammed
No ratings yet
Neurolinguis 5
Document345 pages
Neurolinguis 5
Melandes Tamiris
No ratings yet
Speech Competition Rubric
Document4 pages
Speech Competition Rubric
Whitley Patadungan
No ratings yet
Improving Students' Speaking Skill Through Debate Technique (Classroom Action Research at Sman 19 Kab Tangerang) PDF
Document214 pages
Improving Students' Speaking Skill Through Debate Technique (Classroom Action Research at Sman 19 Kab Tangerang) PDF
Laesa Oktavia
No ratings yet
On Adaptation of English Nominal Compounds in The Serbian Language With Reference To Word-Formation Types
Document8 pages
On Adaptation of English Nominal Compounds in The Serbian Language With Reference To Word-Formation Types
Ijahss Journal
No ratings yet
Modal Auxiliaries in Phraseology: A Contrastive Study of Learner English and NS English
Document27 pages
Modal Auxiliaries in Phraseology: A Contrastive Study of Learner English and NS English
Sidra Khawaja
No ratings yet
A слова
Document297 pages
A слова
ailisha libra
No ratings yet
2AC: AT: Speed Kritik: Psychology Today
Document2 pages
2AC: AT: Speed Kritik: Psychology Today
BLAGMISH
No ratings yet
San Pablo City Integrated HS Grade 10 Shiela P. Bagsit English April 11, 2022 Fourth Quarter 8:00am-9:00pm 1 HR
Document6 pages
San Pablo City Integrated HS Grade 10 Shiela P. Bagsit English April 11, 2022 Fourth Quarter 8:00am-9:00pm 1 HR
Shiela Peregrin Bagsit
No ratings yet
中国人写英文文章最常犯的错误总结
Document24 pages
中国人写英文文章最常犯的错误总结
Thomas
No ratings yet
Family Life A. Aims 1. Knowledge: - Grammar: Future Forms - Pronunciation: Sentence Stress
Document8 pages
Family Life A. Aims 1. Knowledge: - Grammar: Future Forms - Pronunciation: Sentence Stress
Viktoria Hrankina
100% (1)
CBLM 1 Participate in Workplace Communication
Document96 pages
CBLM 1 Participate in Workplace Communication
Orlando Najera
No ratings yet
Semantics (R-E Anul Iv) : Lexical Relations
Document44 pages
Semantics (R-E Anul Iv) : Lexical Relations
irinachircev
No ratings yet

Inverted Index

Uploaded by

Copyright:

Available Formats

You might also like

Inverted Index

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inverted Index

Uploaded by

Copyright:

Available Formats

Inverted index

 The key data structure underlying modern IR

 Identify each doc by a docID, a document serial number.

 We need variable-size postings lists

 In memory, can use linked lists or variable length arrays

Inverted index construction

Input: “Friends, Romans and Countrymen”

Output: Tokens Friends Romans Countrymen

- A token is an instance of a sequence of characters

 Finland‟s capital : Finland AND s? Finlands? Finland‟s?

 Hewlett-Packard: Hewlett and Packard as two tokens?

How do you decide it is one token?

 3/20/91 Mar. 12, 1991 20/3/91

Tokenization: language issues

L'ensembleone token or two?

Want l‟ensemble to match with un ensemble

Until at least 2003, it didn‟t on Google Internationalization!

German noun compounds are not segmented

„life insurance company employee‟

German retrieval systems benefit greatly from a compound splitter module

Can give a 15% performance boost for German

Chinese and Japanese have no spaces between

Not always guaranteed a unique tokenization

Further complicated in Japanese, with multiple alphabets intermingled

Dates/amounts in multiple formats

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Arabic (or Hebrew) is basically written right to left,

but with certain items like numbers written left to right

„Algeria achieved its independence in 1962 after 132 years of French

-- Reduce terms to their “roots” before indexing

-“Stemming” suggests crude affix chopping-Language dependent

e.g., automate(s), automatic, automation all reduced to automat

- Phrase queries: “King of Denmark”

- Various song titles, etc.: “Let it be”, “To be or not to

- “Relational” queries: “flights to London”

Reduce all letters to lower case

exception: upper case in mid-sentence? e.g., General

- Reduce inflectional/variant forms to base form

- E.g., am, are, is ===be

- car, cars, car's, cars'===car

- the boy's cars are different colors===the boy car be different

- Lemmatization implies doing “proper” reduction to dictionary

 Thesauri and soundex

Do we handle synonyms and homonyms?

E.g., by hand-constructed equivalence classes

synonyms=== car = automobile

homonyms=== color = colour

We can rewrite to form equivalence-class terms

When the document contains automobile, index it under car

Or we can expand a query When the query contains automobile, look

synonyms==="good" and "excellent"

You might also like