Ch5 Retrieval Evaluation 2021

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

By Abdo Ababor

July, 2021
• Types of Evaluation Strategies
• Major evaluation criteria
• Difficulties in Evaluating IR System
• Measuring retrieval effectiveness
Why System Evaluation?
It provides the ability to measure the difference between IR
systems
 How well do our search engines work?
Is system A better than B?
 Under what conditions? (relevancy/access time)

Evaluation drives what to research on regarding existing IR


system. And what to improve
 Identify techniques that work and do not work
 There are many retrieval models/ algorithms/ systems
 which one is the best?
What is the best component for:
 Similarity measures (dot-product, cosine, …)
 Index term selection (stop-word removal, stemming…)
 Term weighting (TF, TF-IDF,…)
3
Types of Evaluation Strategies

A. System-centered evaluation
• Given documents, queries, & relevance judgments
• Try several variations of the system
• Measure which system returns the “best” matching list of documents
B. User-centered evaluation
• Given several users, and at least two IR systems
• Have each user try the same task on each system
• Measure which system works the “best” for users information need
• How do we measure users satisfaction?

4
Major Evaluation Criteria
What are some of the main measures for evaluating an IR system’s performance?
I. Efficiency: time, space
• Speed in terms of retrieval time and indexing time
• Speed of query processing
• The space taken by corpus vs. index
 Is there a need for compression
 Index size: Index/corpus size ratio
II. Effectiveness
• How is a system capable of retrieving relevant documents from the
collection?
• Is a system better than another one?
• User satisfaction: How “good” are the documents that are returned as
a response to user query?
• “Relevance” of results to meet information need of users 5
Difficulties in Evaluating IR System
IR systems essentially facilitate communication between a user and document
collections
Relevance is a measure of the effectiveness of communication
 Effectiveness is related to the relevancy of retrieved items.
 Relevance: relates information need (query) and a document or
surrogate
Relevancy is not typically binary but continuous.
 Even if relevancy is binary, it is a difficult judgment to make.
 Relevance is the degree of a correspondence existing between a document and a
query as determined by requester / information specialist/ external judge / other
users

6
Difficulties in Evaluating IR System

Relevance judgments is made by


The user who posed the retrieval problem
An external judge or information specialists or system developer
 Is the relevance judgment made by users, information
specialists or external person the same ? Why?
 Relevance judgment is usually:
a. Subjective: Depends upon a specific user’s judgment.
b. Situational: Relates to user’s current needs.
c. Cognitive: Depends on human perception and behavior.
d. Dynamic: Changes over time.

7
Retrieval scenario
• The scenario where 13 results retrieved by different systems
for a given query?
1 2 3 4 5 6 7 8 9 10 11 12 13

8
Measuring Retrieval Effectiveness

• Retrieval of documents may result in:


•False negative (false drop): some relevant documents may not be
retrieved.
•False positive: some irrelevant documents may be retrieved.
•For many applications a good index should not permit any false drops,
but may permit a few false positives.

relevant irrelevant
“Type one errors”
“Errors of commission”
retrieved
A B “False positives”

not

“Type two errors” C D


retrieved
•Metrics often used to
“Errors of omission” evaluate effectiveness of
“False negatives”
9 the system
Relevant performance metrics

10
Measuring Retrieval Effectiveness

Relevant Not
relevant Collection size = A+B+C+D
Retrieved A B Relevant = A+C
Retrieved = A+B
Not retrieved C D

| {Relevant}  {Retrieved} | Retrieved


Re call 
| {Relevant} |
Relevant Relevant +
| {Relevant}  {Retrieved} |
Retrieved
Pr ecision 
| {Retrieved} |
Not Relevant + Not Retrieved
When is precision important?
When is recall important? 11
Example 1
Assume there are 14 relevant documents in the corpus, compute
precision and recall at the given cutoff points?

Hits 1-10
Recall 1/14 1/14 1/14 1/14 2/14 3/14 3/14 4/14 4/14 4/14
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10

Hits 11-20
Recall 5/14 5/14 5/14 5/14 5/14 6/14 6/14 6/14 6/14 6/14
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20
= relevant document

=irrelevant document
12
Example 2
n doc # relevant Recall Precision
1 588 x 0.167 1
• Let total number of 2 589 x 0.333 1
relevant documents
3 576
is 6, compute recall
4 590 x 0.5 0.75
and precision for
each cut off point n: 5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990 13
Precision/Recall tradeoff
• Can increase recall by retrieving many documents (down to a
low level of relevance ranking),
•but many irrelevant documents would be fetched, reducing
precision
• Can get high recall (but low precision) by retrieving all
documents for all queries

1 The ideal: Better recall


and precision
Returns relevant
Precision

documents but
Returns most relevant
misses many
documents but includes
useful ones too
15 0 lots of junk
Recall 1
Compare Two or More Systems

• The curve closest to the upper right-hand corner of the


graph indicates the best performance

1
0.8 NoStem Stem
Precision

0.6
0.4
0.2
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
16
Calculating Precision at Standard Recall Levels
Assume that there are a total of 10 relevant document
Ranking Relevant Recall Precision
1. Doc. 50 Rel 10% 100%
2. Doc. 34 Not rel ? ?
3. Doc. 45 Rel 20% 67%
4. Doc. 8 Not rel ? ?
5. Doc. 23 Not rel ? ?
6. Doc. 16 Rel 30% 50%
7. Doc. 63 Not rel ? ?
8. Doc. 119 Rel 40 50%
9. Doc. 10 Not rel ? ?
10. Doc. 2 Not rel ? ?
11. Doc. 9 Rel 50% 40% 17
Single-valued measures
Single value measures: may want a single value for each query to
evaluate performance

Such single valued measures include:


 Average precision : Average precision is calculated by averaging
precision when recall increases.
 Mean average precision
 R-precision, etc.

19
Average precision
• Average precision at each retrieved relevant document
• Relevant documents not retrieved contribute zero to score
• Example: Assume total of 16 relevant documents, compute
mean average precision.

Hits 1-10
Precision 1/1 1/2 1/3 1/4 2/5 3/6 3/7 4/8 4/9 4/10
Hits 11-20
Precision 5/11 5/12 5/13 5/14 5/15 6/16 6/17 6/18 6/19 6/20

= relevant document
• (1/1)+(2/5)+(3/6)+(4/8)+(5/11)+(6/16) AP = 3.23/16= 0.202
• 1+0.4+0.5+0.5+0.4545+0.375 = 3.23
MAP (Mean Average Precision)
• Computing mean average for more than one query
• rij = rank of the j-th relevant
document for Qi
MAP   
1 1 j
n
 ( )
Qi | Ri | D j Ri rij
• |Ri| = #rel. doc. for Qi
• n = # test queries

E.g. Assume that for query 1 and 2, there are 3 and 2 relevant
documents in the collection, respectively.
Relevant Docs. Query Query
retrieved 1 2 1 1 1 2 3 1 1 2 3
MAP  [ (   )  (  )] 
1st rel. doc. 1 4 2 3 1 5 10 2 4 8 8

2nd rel. doc. 5 8


3rd rel. doc. 10 21
R- Precision
• Precision at the R-th position in the ranking of results for a
query, where R is the total number of relevant documents.
• Calculate precision after R documents are seen
• Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x
R = # of relevant docs = 6
3 576
4 590 x
5 986
6 592 x
R-Precision = 4/6 = 0.67
7 984
8 988
9 578
10 985
11 103
12 591 22
13 772 x
14 990
More Example

• Average precision is calculated by averaging precision when


recall increases. Hence
• average precision at this recall values = 62.2% for Ranking #1,
52.0% for Ranking #2. Thus, using this measure we can say that
Ranking #1 is better than #2. 23
More Example
• Mean Average Precision (MAP): Often we have a number of
queries to evaluate for a given system. For each query, we can
calculate average precision, and if we take average of those averages
for a given system, it gives us Mean Average Precision (MAP),
which is a very popular measure to compare two systems.
• R-precision: It is defined as precision after R documents
retrieved, where R is the total number of relevant documents for a
given query.
• Average precision and R-precision are shown to be highly
correlated. In the previous example, since the number of relevant
documents (R) is 5, R-precision for both the rankings is 0.4 (value
of precision after 5 documents retrieved).
• But Average Precision is 0.622 for Ranking #1, 0.52 for
Ranking #2. 24
F-Measure
• One measure of performance that takes into account both
recall and precision.
• Harmonic mean of recall and precision:

2 PR 2
F  1 1
P  R RP
• Compared to arithmetic mean, both need to be high for
harmonic mean to be high.
• What if no relevant documents exist?

25
E-Measure
Associated with Van Rijsbergen
Allows user to specify importance of recall and precision
It is parameterized F Measure. A variant of F measure that
allows weighting emphasis on precision2 over recall: 2
(1   ) PR (1   )
E  2 1
 PR
2

R P

Value of  controls trade-off:


 = 1: Equal weight for precision and recall (E=F).
 > 1: Weight recall more. It emphasizes precision.
 < 1: Weight precision more. It emphasizes recall.

26
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken into
account.
• Recall is undefined when there is no relevant document in the collection.
• Precision is undefined when no document is retrieved.

Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
• Noise = 1 – Precision; Silence = 1 – Recall
| {Relevant}  {NotRetrieved} |
Miss 
| {Relevant} |

| {Retrieved}  {NotRelevant} |
Fallout 
| {NotRelevant} |
27
27

You might also like