Professional Documents
Culture Documents
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
Introduction To Telecom Technologies (Telecom) : Getachew Mamo
Introduction to Telecom
Technologies (Telecom)
Getachew Mamo
Department of Information Technology
College of Engineering and Technology
Jimma University
E. Mail, jissolution@yahoo.com
1
Chapter 1:
Understanding the Telecommunications
Revolution
2
Outline
Introduction
Changes in Telecommunications
3
Introduction
4
What is Telecommunications?
• The word telecommunications has its roots in
Greek:
– tele means "over a distance," and
– communicara means "the ability to share."
• Hence,
– telecommunications literally means "the sharing of
information over a distance."
5
TR vs. Database Retrieval
• Information
– Unstructured/free text vs. structured data
– Ambiguous vs. well-defined semantics
• Query
– Ambiguous vs. well-defined semantics
– Incomplete vs. complete specification
• Answers
– Relevant documents vs. matched records
7
TR is “Easy”!
• TR CAN be easy in a particular case
– Ambiguity in query/document is RELATIVE to the
database
– So, if the query is SPECIFIC enough, just one keyword
may get all the relevant documents
8
History of TR on One Slide
• Birth of TR
– 1945: V. Bush’s article “As we may think”
– 1957: H. P. Luhn’s idea of word counting and matching
9
Short vs. Long Term Info Need
• Short-term information need (Ad hoc retrieval)
– “Temporary need”, e.g., info about used cars
– Information source is relatively static
– User “pulls” information
– Application example: library search, Web search
10
Importance of Ad hoc Retrieval
• Directly manages any existing large collection of
information
• There are many many “ad hoc” information needs
• A long-term information need can be satisfied
through frequent ad hoc retrieval
• Basic techniques of ad hoc retrieval can be used for
filtering and other “non-retrieval” tasks, such as
automatic summarization.
11
Formal Formulation of TR
• Vocabulary V={w , w , …, w } of language
1 2 N
• Collection C= {d , …, d }
1 k
12
Computing R(q)
• Strategy 1: Document selection
– R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator
function or classifier
– System must decide if a doc is relevant or not
(“absolute relevance”)
13
Document Selection vs. Ranking
+ +- - R’(q)
1
True R(q) + ++
Doc Selection
f(d,q)=?
- -- - - -
+ +- - + - - 0 -
+ - + - -
+ -
- -
-- - -
- - 0.98 d1 +
- - Doc Ranking 0.95 d2 +
R’(q)
f(d,q)=? 0.83 d3 -
0.80 d4 +
0.76 d5 -
0.56 d6 -
0.34 d7 -
0.21 d8 +
0.21 d9 - 14
Problems of Doc Selection
• The classifier is unlikely accurate
– “Over-constrained” query (terms are too specific): no
relevant documents found
– “Under-constrained” query (terms are too general):
over delivery
– It is extremely hard to find the right position between
these two extremes
• Theoretical
[Robertson 77]
justification: Probability Ranking Principle
16
Probability Ranking Principle
[Robertson 77]
• As stated by Cooper
“If a reference retrieval system’s response to each request is a ranking of the
documents in the collections in order of decreasing probability of usefulness
to the user who submitted the request, where the probabilities are estimated
as accurately a possible on the basis of whatever data made available to the
system for this purpose, then the overall effectiveness of the system to its
users will be the best that is obtainable on the basis of that data.”
17
According to the PRP, all we need is
which satisfies
18
Evaluation in Information
Retrieval
19
Evaluation Criteria
• Effectiveness/Accuracy
– Precision, Recall
• Efficiency
– Space and time complexity
• Usability
– How useful for real user tasks?
20
Methodology: Cranfield Tradition
• Laboratory testing of system components
– Precision, Recall
– Comparative testing
• Test collections
– Set of documents
– Set of questions
– Relevance judgments
21
The Contingency Table
Action
Doc Retrieved Not Retrieved
Relevant Retrieved
Precision
Retrieved
Relevant Retrieved
Recall
Relevant
22
How to measure a ranking?
• Compute the precision at every recall point
• Plot a precision-recall (PR) curve
x Which is better?
precision x precision
x x
x x
x x
recall recall
23
Summarize a Ranking: MAP
• Given that n docs are retrieved
– Compute the precision (at rank) where each (new) relevant document
is retrieved => p(1),…,p(k), if we have k rel. docs
– E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2.
– If a relevant document never gets retrieved, we assume the precision
corresponding to that rel. doc to be zero
• Compute the average over all the relevant documents
– Average precision = (p(1)+…p(k))/k
• This gives us (non-interpolated) average precision, which captures
both precision and recall and is sensitive to the rank of each relevant
document
• Mean Average Precisions (MAP)
– MAP = arithmetic mean average precision over a set of topics
– gMAP = geometric mean average precision over a set of topics (more
affected by difficult topics)
24
Summarize a Ranking: NDCG
• What if relevance judgments are in a scale of [1,r]? r>2
• Cumulative Gain (CG) at rank n
– Let the ratings of the n documents be r1, r2, …rn (in ranked order)
– CG = r1+r2+…rn
• Discounted Cumulative Gain (DCG) at rank n
– DCG = r1 + r2/log22 + r3/log23 + … rn/log2n
– We may use any base for the logarithm, e.g., base=b
– For rank positions above b, do not discount
• Normalized Cumulative Gain (NDCG) at rank n
– Normalize DCG at rank n by the DCG value at rank n of the ideal
ranking
– The ideal ranking would first return the documents with the highest
relevance level, then the next highest relevance level, etc
– Compute the precision (at rank) where each (new) relevant
document is retrieved => p(1),…,p(k), if we have k rel. docs
• NDCG is now quite popular in evaluating Web search
25
When There’s only 1 Relevant Document
• Scenarios:
– known-item search
– navigational queries
26
Precion-Recall Curve
Out of 4728 rel docs,
we’ve got 3212
Recall=3212/4728
Precision@10docs
about 5.5 docs
in the top 10 docs
are relevant
Breakeven Point
(prec=recall)
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation
28
The Pooling Strategy
• When the test collection is very large, it’s impossible to
completely judge all the documents
• TREC’s strategy: pooling
– Appropriate for relative comparison of different systems
– Given N systems, take top-K from the result of each, combine them
to form a “pool”
– Users judge all the documents in the pool; unjudged documents
are assumed to be non-relevant
• Advantage: less human effort
• Potential problem:
– bias due to incomplete judgments (okay for relative comparison)
– Favor a system contributing to the pool, but when reused, a new
system’s performance may be under-estimated
• Reuse the data set with caution!
29
User Studies
• Limitations of Cranfield evaluation strategy:
– How do we evaluate a technique for improving the interface of a search
engine?
– How do we evaluate the overall utility of a system?
• User studies are needed
• General user study procedure:
– Experimental systems are developed
– Subjects are recruited as users
– Variation can be in the system or the users
– Users use the system and user behavior is logged
– User information is collected (before: background, after: experience with
the system)
• Clickthrough-based real-time user studies:
– Assume clicked documents to be relevant
– Mix results from multiple methods and compare their clickthroughs
30
Common Components in a
TR System
31
Typical TR System Architecture
docs query
Feedback judgments
Tokenizer
32
Text Representation/Indexing
• Making it easier to match a query with a document
• Query and document should be represented using
the same units/terms
• Controlled vocabulary vs. full text indexing
• Full-text indexing is more practically useful and has
proven to be as effective as manual indexing with
controlled vocabulary
33
What is a good indexing term?
• Specific (phrases) or general (single word)?
• Luhn found that words with middle frequency are
most useful
– Not too specific (low utility, but still useful!)
– Not too general (lack of discrimination, stop words)
– Stop word removal is common, but rare words are kept
34
Tokenization
• Word segmentation is needed for some languages
– Is it really needed?
35
Relevance Feedback
Results:
d1 3.5
Query Retrieval
d2 2.4
Engine
…
dk 0.5
Updated ...
query
User
Document
collection Judgments:
d1 +
d2 -
d3 +
Feedback …
dk -
...
36
Pseudo/Blind/Automatic
Feedback
Results:
d1 3.5
Query Retrieval
d2 2.4
Engine
…
dk 0.5
Updated ...
query Document
collection Judgments:
d1 + top 10
d2 +
d3 +
Feedback …
dk -
...
37
What You Should Know
• How TR is different from DB retrieval
• Why ranking is generally preferred to document selection
(justified by PRP)
• How to compute the major evaluation measure (precision,
recall, precision-recall curve, MAP, gMAP, breakeven
precision, NDCG, MRR)
• What is pooling
• What is tokenization (word segmentation, stemming, stop
word removal)
• What is relevance feedback; what is pseudo relevance
feedback
38
Overview of Retrieval Models
Relevance
39
Retrieval Models: Vector Space
40
The Basic Question
Given a query, how do we know if document A
is more relevant than B?
41
Relevance = Similarity
• Assumptions
– Query and document are represented similarly
– A query can be regarded as a “document”
– Relevance(d,q) similarity(d,q)
42
Vector Space Model
• Represent a doc/query by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a high-dimensional space
– Element of vector corresponds to term weight
– E.g., d=(x1,…,xN), xi is “importance” of term i
43
VS Model: illustration
Starbucks
D9
D2 ??
??
D11
D3 D5
D10
D4 D6
Java
Query
D7
D8 D1
Microsoft
??
44
What the VS model doesn’t say
• How to define/select the “basic concept”
– Concepts are assumed to be orthogonal
45
What’s a good “basic concept”?
• Orthogonal
– Linearly independent basis vectors
– “Non-overlapping” in meaning
• No ambiguity
• Weights can be assigned automatically and hopefully
accurately
• Many possibilities: Words, stemmed words, phrases,
“latent concept”, …
46
How to Assign Weights?
• Very very important!
• Why weighting
– Query side: Not all terms are equally important
– Doc side: Some terms carry more information about contents
• How?
– Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
– TF normalization
47
TF Weighting
• Idea: A term is more important if it occurs more frequently in a
document
• Some formulas: Let f(t,d) be the frequency count of term t in
doc d
– Raw TF: TF(t,d) = f(t,d)
– Log TF: TF(t,d)=log f(t,d)
– Maximum frequency normalization: TF(t,d) = 0.5
+0.5*f(t,d)/MaxFreq(d)
– “Okapi/BM25 TF”: TF(t,d) = k
f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen))
48
TF Normalization
• Why?
– Document length variation
– “Repeated occurrences” are less informative than the
“first occurrence”
49
TF Normalization (cont.)
Norm. TF
Raw TF
50
Regularized/“Pivoted”
Length Normalization
Norm. TF
Raw TF
“Pivoted normalization”: Using avg. doc length to regularize normalization
51
IDF Weighting
• Idea: A term is more discriminative if it occurs only in
fewer documents
• Formula:
IDF(t) = 1+ log(n/k)
n – total number of docs
k -- # docs with term t (doc freq)
52
TF-IDF Weighting
• TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)
– Common in doc high tf high weight
– Rare in collection high idf high weight
53
How to Measure Similarity?
Di ( w i 1 ,..., w iN )
Q ( wq1 ,..., wqN ) w 0 if a term is absent
N
Dot product similarity : sim(Q , Di ) wqj w ij
j 1
N
wqj w ij
j 1
Cosine : sim(Q , Di )
N N
( wqj ) 2
( wij )2
j 1 j 1
54
VS Example: Raw TF & Dot Product
information
doc1 retrieval
Sim(q,doc1)=4.8*2.4+4.5*4.5 query=“information retrieval”
search
engine
information Sim(q,doc2)=2.4*2.4
travel
information
Sim(q,doc3)=0
doc2 map
travel
55
What Works the Best?
(Singhal 2001)
56
Relevance Feedback in VS
• Basic setting: Learn from examples
– Positive examples: docs known to be relevant
– Negative examples: docs known to be non-relevant
– How do you learn from this to improve performance?
57
Rocchio Feedback: Illustration
--- ---
- ++++ - -
-- + q q - -
- +++ + + -
- - - + +++ -
+ ++ --
- - -- --
58
Rocchio Feedback: Formula
Parameters
New query
59
Rocchio in Practice
• Negative (non-relevant) examples are not very important
(why?)
• Often project the vector onto a lower dimension (i.e., consider
only a small number of words that have high weights in the
centroid vector) (efficiency concern)
• Avoid “training bias” (keep relatively high weight on the
original query weights) (why?)
• Can be used for relevance feedback and pseudo feedback
• Usually robust and effective
60
“Extension” of VS Model
• Alternative similarity measures
– Many other choices (tend not to be very effective)
– P-norm (Extended Boolean): matching a Boolean query
with a TF-IDF document vector
• Alternative representation
– Many choices (performance varies a lot)
– Latent Semantic Indexing (LSI) [TREC performance
tends to be average]
61
Advantages of VS Model
63
What You Should Know
• What is Vector Space Model (a family of models)
• What is TF-IDF weighting
• What is pivoted normalization weighting
• How Rocchio works
64
Roadmap
• This lecture
– Basic concepts of TR
– Evaluation
– Common components
– Vector space model
65