02 Information Retrieval - Boolean Retrieval

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 193

Ghislain Fourny

Information Retrieval
2. Boolean Retrieval
Data Shapes

2
Data Shapes: Tables

3
Data Shapes: Trees

4
Data Shapes: Graphs

5
Data Shapes: Cubes

6
Data Shapes: Unstructured text
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam vel erat nec dui
aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit.
Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius
ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id
commodo turpis.

Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget,
scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu,
quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra
justo massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci
suscipit rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit
amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra
congue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus
suscipit, accumsan nibh vel, posuere ipsum. Nulla nec tempor nibh, id
venenatis lectus. Duis lobortis id urna eget tincidunt.
7
Information Retrieval

Finding unstructed data

8
Information Retrieval

Finding unstructed data

that satisfies an information need

9
Information Retrieval

Finding unstructed data

that satisfies an information need

from within large collections


10
First, naive approach

11
Example of corpus
CHAPTER I holmes.txt (3 MB)
Mr. Sherlock Holmes
Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was up all night, was
seated at the breakfast table. I stood upon the hearth-rug and picked
up the stick which our visitor had left behind him the night before.
It was a fine, thick piece of wood, bulbous-headed, of the sort which
is known as a "Penang lawyer." Just under the head was a broad silver
band nearly an inch across. "To James Mortimer, M.R.C.S., from his
friends of the C.C.H.," was engraved upon it, with the date "1884."
It was just such a stick as the old-fashioned family practitioner
used to carry--dignified, solid, and reassuring.
"Well, Watson, what do you make of it?"
Holmes was sitting with his back to me, and I had given him no sign
of my occupation.
"How did you know what I was doing? I believe you have eyes in the
back of your head."
12
Example

Q: lawyer

13
Naive approach

Q: lawyer

14
Example of corpus
CHAPTER I holmes.txt (3 MB)
Mr. Sherlock Holmes
Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was up all night, was
seated at the breakfast table. I stood upon the hearth-rug and picked
up the stick which our visitor had left behind him the night before.
It was a fine, thick piece of wood, bulbous-headed, of the sort which
is known as a "Penang lawyer." Just under the head was a broad silver
band nearly an inch across. "To James Mortimer, M.R.C.S., from his
friends of the C.C.H.," was engraved upon it, with the date "1884."
It was just such a stick as the old-fashioned family practitioner
used to carry--dignified, solid, and reassuring.
"Well, Watson, what do you make of it?"
Holmes was sitting with his back to me, and I had given him no sign
of my occupation.
"How did you know what I was doing? I believe you have eyes in the
back of your head."
15
Result
Q: lawyer
CHAPTER I
Mr. Sherlock Holmes
Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was up all night, was
seated at the breakfast table. I stood upon the hearth-rug and picked
up the stick which our visitor had left behind him the night before.
It was a fine, thick piece of wood, bulbous-headed, of the sort which
is known as a "Penang lawyer." Just under the head was a broad silver
band nearly an inch across. "To James Mortimer, M.R.C.S., from his
friends of the C.C.H.," was engraved upon it, with the date "1884."
It was just such a stick as the old-fashioned family practitioner
used to carry--dignified, solid, and reassuring.
"Well, Watson, what do you make of it?"
Holmes was sitting with his back to me, and I had given him no sign
of my occupation.
"How did you know what I was doing? I believe you have eyes in the
back of your head."
16
Result
Q: lawyer
CHAPTER I
Mr. Sherlock Holmes
Mr. Sherlock Holmes, who was usually very late in the mornings, save
upon those not infrequent occasions when he was up all night, was
seated at the breakfast table. I stood upon the hearth-rug and picked
up the stick which our visitor had left behind him the night before.
It was a fine, thick piece of wood, bulbous-headed, of the sort which
is known as a "Penang lawyer." Just under the head was a broad silver
band nearly an inch across. "To James Mortimer, M.R.C.S., from his
friends of the C.C.H.," was engraved upon it, with the date "1884."
It was just such a stick as the old-fashioned family practitioner
used to carry--dignified, solid, and reassuring.
"Well, Watson, what do you make of it?"
Holmes was sitting with his back to me, and I had given him no sign
of my occupation.
"How did you know what I was doing? I believe you have eyes in the
back of your head."
17
Naive approach
$ grep lawyer corpus.txt

18
Naive approach
$ grep lawyer corpus.txt

William Whyte was. Some pragmatical seventeenth century lawyer, I


He was a lawyer. That sounded ominous. What was the relation between
"Her banker or her lawyer. There is that double possibility. But I am
"To an English lawyer named Norton."
and send down to Fordham, the Horsham lawyer.'
"I did as he ordered, and when the lawyer arrived I was asked to step
"I signed the paper as directed, and the lawyer took it away with
violin-player, boxer, swordsman, lawyer, and self-poisoner by cocaine
which was a Penang-lawyer weighted with lead, was just such a weapon
a lawyer with a good practice. They had one child, but the yellow
has some claim on half Cunningham's estate, and the lawyers have been
lawyer whose name was given in the paper. There we met two gentlemen,
as he followed the argument of the lawyer. At least, I thought, when
is known as a "Penang lawyer." Just under the head was a broad silver
Holmes to the porter. "A lawyer, is he not, gray-headed, and walks
Reilly the lawyer and take the defense upon myself. Take my word for
"Exactly," said I. "A plausible lawyer could make it out as an act of
daughter of Augusto Barelli, who was the chief lawyer and once the
I showed it to Mr. Sutro, my lawyer, who lives in Harrow. He said to
lawyer of yours a capable man?"
This Sutro, of course, is her lawyer. I made a mistake, I fear, in
gentleman, who introduced himself as the lawyer, together with a
"I only heard of it this morning," the lawyer explained.
I have been recommended to you by my lawyers, but indeed the matter
to the American lawyer. ... Very good. Good-bye!"
lawyer burst excitedly into the room.
"I can only suppose that this American lawyer put it in himself. What
say nothing, as her lawyer had advised her to reserve her defence. We

19
Naive approach

Q: lawyer AND Penang

20
Naive approach
$ grep lawyer corpus.txt | grep Penang

21
Naive approach
$ grep lawyer corpus.txt | grep Penang

which was a Penang-lawyer weighted with lead, was just such a weapon
is known as a "Penang lawyer." Just under the head was a broad silver

22
Naive approach

Q: lawyer AND
Penang AND
NOT silver

23
Naive approach
$ grep lawyer corpus.txt | grep Penang | grep –v silver

24
Naive approach
$ grep lawyer corpus.txt | grep Penang | grep –v silver

which was a Penang-lawyer weighted with lead, was just such a weapon

25
Grepping

Regular expressions

foo|bar[a-z]+

Wildcards

.*

26
Grepping: shortcomings (1)

27
Grepping: shortcomings (1)

What if we want
to search
an extremely large
collection:

The Web

?
28
Picture: 123RF / shtanzman
Grepping: shortcomings

To do it right, he told Page, you'd really have to capture a significant chunk


of the World Wide Web's link structure.

Steven Levy
In The Plex: How Google Thinks, Works, and Shapes Our Lives

29
Grepping: shortcomings

To do it right, he told Page, you'd really have to capture a significant chunk


of the World Wide Web's link structure.

Page said, sure, he'd go and download the web and get the structure.

Steven Levy
In The Plex: How Google Thinks, Works, and Shapes Our Lives

30
Grepping: shortcomings

To do it right, he told Page, you'd really have to capture a significant chunk


of the World Wide Web's link structure.

Page said, sure, he'd go and download the web and get the structure.

He figured it would take a week or something.

Steven Levy
In The Plex: How Google Thinks, Works, and Shapes Our Lives

31
Grepping: shortcomings

To do it right, he told Page, you'd really have to capture a significant chunk


of the World Wide Web's link structure.

Page said, sure, he'd go and download the web and get the structure.

He figured it would take a week or something.

“And of course,” he later recalled, “it took, like, years.”

Steven Levy
In The Plex: How Google Thinks, Works, and Shapes Our Lives

32
Grepping: shortcomings (2)

33
Grepping: shortcomings (2)

Proximity search?

Zurich NEAR Switzerland

34
Grepping: shortcomings (3)

35
Grepping: shortcomings (3)
How do we rank?

36
Indices

37
Usefulness of indices

Indices are everywhere

Look at the last pages of books

Steven Levy
In The Plex: How Google Thinks, Works,
and Shapes Our Lives 38
Document

39
Document

Documents
40
Document

Documents
41
Term

42
Term

Sherlock
lawyer
Switzerland
Unterwalden nid dem Wald
ETH Zürich
person
watch
run
paper
book
...

43
Incidence Matrix

Documents

contain (or not)

Terms
44
Incidence Matrix

Documents

1 contain (or not) 0


Terms
45
Incidence Matrix

46
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

47
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

48
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

49
Boolean Query

50
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

51
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

52
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

53
Boolean Query

NOT u

54
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

55
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

56
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

NOT

NOT u

57
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

NOT u

58
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

NOT u

59
Boolean Query

u AND x

60
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

61
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u
Terms

62
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u
Terms

AND

u AND x
63
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u AND x
64
Boolean Query

u OR x

65
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

66
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u
Terms

67
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u
Terms

OR

u OR x
68
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10

u AND x
69
Effectiveness of an IR system

70
Information need
Impact of global warming on the
development of butterfly ecosystems in
Southern part of the Canton of Uri
between 2011 and 2013

71
Information need
Impact of global warming on the
development of butterfly ecosystems in
Southern part of the Canton of Uri
between 2011 and 2013

$ butterfly AND climate AND uri AND south

$ lepidoptera AND ur AND NOT winter

$ (central NEAR (switzerland OR sweden))


AND NOT (lu OR nw OR ow OR stockholm)
AND warmth

72
Results

73
Results

74
Results
Relevant Not relevant
Returned results

75
Results
Relevant Not relevant
Returned results

Precision =

76
Results
Relevant Not relevant
Returned results

true positives

Precision = returned

77
Results
Relevant Not relevant
Returned results

Precision =

78
Results
Relevant Not relevant
Returned results

Precision = 50%

79
Results
Relevant Not relevant
Returned results

Precision =

80
Results
Relevant Not relevant
Returned results

Precision = 100%

81
Results
Relevant Not relevant
Returned results

Precision =

82
Results
Relevant Not relevant
Returned results

Precision = 0%

83
Results
Relevant Not relevant
Returned results

Precision =

84
Results
Relevant Not relevant
Returned results

Precision = 80%

85
Results
Relevant Not relevant
Positives

86
Results
Relevant Not relevant
Positives
Negatives

87
Results
Relevant
Positives

Recall =
Negatives

88
Results
Relevant
Positives

true positives
Recall =
relevant
Negatives

89
Results
Relevant
Positives

Recall = 50%
Negatives

90
Results
Relevant
Positives

Recall = 100%
Negatives

91
Results
Relevant
Positives

Recall = 0%
Negatives

92
Results
Relevant
Positives

Recall = 67%
Negatives

93
Type I vs. Type II error
Relevant Not relevant

Type I error
Positives

(false positives)

Type II error
Negatives

(false negatives)

94
High precision
Relevant Not relevant
Positives
Negatives

95
High recall
Relevant Not relevant
Positives
Negatives

96
Compromise

Compromise

High precision High recall

97
Inverted indices

98
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

99
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

w Typically 500,000

100
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

w Typically 500,000

101
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

w Typically 500,000

x 500 billion booleans!


y

102
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

w Typically 500,000

x 500 billion booleans!


y
How many zeros?
103
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

Typically 1,000 terms


Typically 500,000 per document
w

y
How many zeros?
104
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

one billion 1s
w Typically 500,000

y
How many zeros?
105
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

one billion 1s
w Typically 500,000
499 billion 0s

y
How many zeros?
106
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

w Typically 500,000

x 99.8% empty!
y

107
Incidence Matrix
Documents
1 2 3 4 5 6 7 8 9 10
t
Typically one million
u

v
Terms

w Typically 500,000

x 99.8% empty!
y

This is space-inefficient.
108
Can we store this in a better way?
Documents

01 2 3 4 5 6 7 8 9 10
1
t 1 0 0 1 0 0 0 0 1 1
u B0 0 0 0 1 1 1 1 0 0C
B C
v B0 1 0 1 0 1 0 1 0 1C
Terms

B C
w B0 0 0 0 1 0 0 0 0 0C
B C
x @1 0 1 1 0 0 1 0 0 0A
y 0 0 0 0 1 0 0 1 0 1

109
Can we store this in a better way?
Documents
1 2 3 4 5 6 7 8 9 10
t

v
Terms

110
Can we store this in a better way?
Documents
1 2 3 4 5 6 7 8 9 10
t 1, 4, 9, 10

u 5, 6, 7

v
Terms

2, 4, 6, 8, 10

w 5

x 1, 3, 4, 7

y 5, 8, 10

111
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7

v 2, 4, 6, 8, 10
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

112
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7

v 2, 4, 6, 8, 10
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

Vocabulary
113
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7 Posting
v 2, 4, 6, 8, 10
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

114
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7 Posting Assumption:
Each document has a Document ID
v 2, 4, 6, 8, 10 (docID)
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

115
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7

v 2, 4, 6, 8, 10 Postings list
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

116
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7

v 2, 4, 6, 8, 10 Postings
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

117
Inverted index (inverted file)

t 1, 4, 9, 10

u 5, 6, 7

v 2, 4, 6, 8, 10 Dictionary
Terms

w 5

x 1, 3, 4, 7

y 5, 8, 10

118
Document frequency

t 1, 4, 9, 10 4
u 5, 6, 7 3
v 2, 4, 6, 8, 10 5
Terms

Count
w 5 1
x 1, 3, 4, 7 4
y 5, 8, 10 3

119
Document frequency

t 1, 4, 9, 10 4
u 5, 6, 7 3
v 2, 4, 6, 8, 10 5
Terms

Most frequent term


Count
w 5 1
x 1, 3, 4, 7 4
y 5, 8, 10 3

120
Building an inverted index

Document 1
You come most carefully upon your hour.

Document 2
Take thy fair hour, Laertes; time be thine,

Document 3
My hour is almost come,

Document 4
Possess it merely. That it should come to this!

121
Tokenize
You come most carefully upon your hour

Take fair thy hour Laertes time be thine

My hour is almost come

Possess it merely That it should come to this

122
Linguistic pre-processing
you come most carefully upon your hour

take fair thy hour Laertes time be thine

my hour is almost come

Possess it merely that it should come to this

123
Assign document IDs
you 1 take 2 my 3 possess 4
come 1 fair 2 hour 3 it 4
most 1 thy 2 is 3 merely 4
carefully 1 hour 2 almost 3 that 4
upon 1 Laertes 2 come 3 it 4
your 1 time 2 should 4
hour 1 be 2
come 4
thine 2
to 4
this 4

124
Sort
almost 3 is 3 take 2

be 2 it 4 that 4
carefully 1 it 4 thine 2

come 1 this 4
Laertes 2
come 4 thy 2
merely 4
come 3 time 2
most 1 to 4
fair 2
my 3 upon 1
hour 3
possess 4 you 1
hour 1
should 4 your 1
hour 2

125
Merge
almost 3 is 3 take 2

be 2 that 4
carefully 1 it 4 thine 2

come 1 3 4 this 4
Laertes 2
thy 2
merely 4
time 2
most 1 to 4
fair 2
my 3 upon 1
hour 1 2 3
possess 4 you 1
should 4 your 1

126
Merge
almost 3 Laertes 2
thine 2
be 2 merely 4 this 4
carefully 1 most 1 thy 2
come 1 3 4 my 3 time 2
fair 2 possess 4 to 4
hour 1 2 3 upon 1
should 4
is 3 you 1
take 2
your 1
it 4 that 4

127
Add document frequency
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

128
Single word query

hour

129
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

130
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

131
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4
Document 1
You come most carefully upon your hour.
Results
Document 2
Take thy fair hour, Laertes; time be thine,
Document 3
132
My hour is almost come,
Boolean (conjunctive) query

hour AND come

133
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

134
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

135
Single word query

come 3 1 3 4

hour 3 1 2 3

136
Single word query

come 3 1 3 4

Intersect 1 3
hour 3 1 2 3

137
Single word query

come 3 1 3 4

Intersect 1 3
hour 3 1 2 3

Document 1
You come most carefully upon your hour.
Results
Document 3
My hour is almost come,
138
Boolean (conjunctive) query

hour OR this

139
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

140
Single word query
almost 1 3 Laertes 1 2
thine 1 2
be 1 2 merely 1 4 this 1 4
carefully 1 1 most 1 1 thy 1 2
come 3 1 3 4 my 1 3 time 1 2

fair 1 2 possess 1 to 1 4
4
hour 3 2 3 upon 1 1
1 should 1 4
is 1 3 you 1 1
take 1 2
your 1 1
it 1 4 that 1 4

141
Single word query

hour 3 1 2 3

this 1 4

142
Single word query

hour 3 1 2 3
Union
this 1 4

143
Single word query

hour 3 1 2 3
Union 1 2 3 4
this 1 4

144
Single word query

hour 3 1 2 3
Union 1 2 3 4
this 1 4

Document 1
You come most carefully upon your hour.
Document 2
Take thy fair hour, Laertes; time be thine,
Results Document 3
My hour is almost come,
Document 4
Possess it merely. That it should come to this! 145
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

146
Intersection algorithm

List A 1 2 4 5 8 9 10 12

Initialization:
two pointers
List B 1 3 4 6 7 8 11 12

147
Intersection algorithm

List A 1 2 4 5 8 9 10 12

Initialization:
two pointers
List B 1 3 4 6 7 8 11 12

Intersection of A and B

148
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B

149
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

150
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

151
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

152
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

153
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

154
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1

155
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4

156
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4

157
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4

158
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4

159
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8

160
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8

161
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8

162
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8

163
Intersection algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8 12

164
Intersection algorithm

List A 1 2 4 5 8 9 10 12

End
List B 1 3 4 6 7 8 11 12

Intersection of A and B 1 4 8 12

165
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

166
Union algorithm

List A 1 2 4 5 8 9 10 12

Initialization:
two pointers
List B 1 3 4 6 7 8 11 12

167
Union algorithm

List A 1 2 4 5 8 9 10 12

Initialization:
two pointers
List B 1 3 4 6 7 8 11 12

Union of A and B

168
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B

169
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1

170
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1

171
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2

172
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2

173
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3

174
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3

175
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4

176
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5

177
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6

178
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

179
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

8 180
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

9 8 181
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

10 9 8 182
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

11 10 9 8 183
Union algorithm

List A 1 2 4 5 8 9 10 12

List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

12 11 10 9 8 184
Union algorithm

List A 1 2 4 5 8 9 10 12

End
List B 1 3 4 6 7 8 11 12

Union of A and B 1 2 3 4 5 6 7

12 11 10 9 8 185
Optimizing

eth AND zurich AND bibliopole

The inverted index has the document frequencies!

186
Optimizing

eth AND zurich AND bibliopole

10,000 docs 10 docs

100,000 docs

187
Optimizing

bibliopole AND eth AND zurich


10 docs
10,000 docs

100,000 docs

Sort query terms by increasing document frequency

188
Optimizing

bibliopole AND eth AND zurich


10 docs
10,000 docs

100,000 docs
Maintain intermediate
result in memory

189
Optimizing

bibliopole AND eth AND zurich


10 docs
10,000 docs

100,000 docs
Maintain intermediate
result in memory

190
Optimizing

bibliopole AND eth AND zurich


10 docs
10,000 docs

100,000 docs
Maintain intermediate
result in memory

191
More complex queries

(epf OR eth) AND zurich AND bibliopole

We can approximate the (estimated) size of the postings list


by summing the two original lists

192
More complex queries

(epf OR eth) AND NOT zurich AND bibliopole

We can adapt the intersection algorithm


to "virtually" walk down the negated list

193

You might also like