Lecture 7

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 126

Natural Language Processing Applications

Lecture 7 Fabienne Venant Universit Nancy2 / Loria

Information Retrieval

What is Information Retrieval?


Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) Applications:
Many universities and public libraries use IR systems to provide access to books journals and other documents. Web search
Large volumes of unstable, unstructured dat Speed is important

Cross-language IR
Finding documents written in another language Touches on Machine translation

....

Concerns
The set of texts can be very large hence hence efficiency is a concern Textual data is noisy, incomplete and untrustworthy hence robustness is a concern Information may be hidden:
Need to derive information from raw data Need to derive information from vaguely expressed needs

IR Basic concepts
Information needs : queries and relevance Indexing: helps speeding up retrieval Retrieval models: describe how to search and recover relevant documents Evaluation: IR systems are large and convincing evaluation is tricky

Information needs

Information needs
INFORMATION NEED : the topic about which the user desires to know more QUERY : what the user conveys to the computer in an attempt to communicate the information need RELEVANCE : a document is relevant if it is one that the user perceives as containing information of value wrt their personal information need
Ex : topic pipeline leaks relevant documents : doesnt matter if they use those words or express the concept with other words such a pipeline rupture .

Capturing information needs


Information needs can be hard to capture One possibility : use natural language
Advantage: expressive enough to allow all needs to be described Drawbacks:
Semantic analysis of arbitrary NL is very hard Users may not want to type full blown sentences into a search engine

Queries

Queries
Information needs are typically expressed as a query :
Where shall I go on holiday? holiday destinations

Two main types of possible queries


How much blood does the human heart pump in one minute? Boolean queries :
heart AND blood AND minutes

Web types queries :


human biology

Remarks
A query :
is usually quite short and incomplete; may contain misspelled or poorly selected words may contain too many or too few words

The information need :


may be difficult to describe precisely,especially when the user isn't familiar about the topic

Precise understanding of the document content is difficult.

Persistent vs one-off Queries


Queries might or not evolve over times
Persistent queries :
predefined and routinely performed :
Top ten performing shares today Continuous queries : persistent queries that allow users to receive new results when they become available

typical of Information extraction and News Routing systems

One-off (or ad-hoc) queries


created to obtain information as the need arises typical of Web searching

Relevance
Relevance is subjective
python : ambiguous but not for user Topicality vs. Utility: a document is relevant wrt a specific Goal A document is relevant if it addresses the stated information need, not because it just happens to contain all the words in the query.

Relevance is a gradual concept (a document is not just relevant or not; it is more or less relevant to a query) IR systems usually rank retrieved documents by relevance
But many algorithm use a binary decision of relevance.

The big picture

Terminology
An IR system looks for data matching some criteria defined by the users in their queries. The langage used to ask a question is called the query language. These queries use keywords (atomic items characterizing some data). The basic unit of data is a document (can be a file, an article, a paragraph, etc.). A document corresponds to free text (may be unstructured). All the documents are gathered into a collection (or corpus).

Searching for a given word in a document


One way to do that is to start at the beginning and to read through all the text
Pattern matching (re) + speed of modern computer grepping through tex can be a very effective

Enough for simple querying of modest collections (millions of words) But for many purposes, you do need more:
To process large document collections (billions ot trillions of words) quickly. To allow more flexible matching operations. For example, it is impractical to perform the query Romans NEAR countrymen with grep, where NEAR might be defined as within 5 words or within the same sentence. To allow ranked retrieval: in many cases you want the best answer to an information need among many documents that contain certain words

-- >You need an Index

Index

Motivation for Indexing


Extremely large dataset Only a tiny fraction of the dataset is relevant to a given query Speed is essential (0.25 second for web searching) Indexing helps speedup retrieval

Indexing documents
How to relate the users information need with some documents content ? Idea : using an index to refer to documents Usually an index is a list of terms that appear in a document, it can be represented mathematically as: index : doci {Uj keywordj} Here, the kind of index we use maps keywords to the list of documents they appear in: index : keywordj {Ui doci} We call this an inverted index.

Indexing documents

The set of keywords is usually called the dictionary (or vocabulary) A document identifier appearing in the list associated with a keyword is called a posting The list of document identifiers associated with a given keyword is called a posting list

Inverted files
The most common indexing technique Source file: collection organised by documents Inverted file: collection organised by terms

Inverted Index
Given a dictionary of terms (also called vocabulary or vocabulary lexicon) For each term, record in a list which documents the term occurs in Each item in the list:
records that a term appeared in a document and, later, often, the positions in the document is conventionally called a posting

The list is then called a postings list (or inverted list),

Inverted Index

From an introduction to information retrieval , C.D. Manning,P. Raghavan and H.Schtze

Exercise
Draw the inverted index that would be built for the following document collection Doc 1 breakthrough drug for schizophrenia Doc 2 new schizophrenia drug Doc 3 new approach for treatment of schizophrenia Doc 4 new hopes for schizophrenia patients For this document collection, what are the returned results for these queries:
1. schizophrenia AND drug 2. schizophrenia AND NOT(drug OR approach)

Indexing documents
Arising questions: how to build an index automatically ? What are the relevant keywords ? Some additional desiderata:
fast processing of large collections of documents, having flexible matching operations (robust retrieval), having the possibility to rank the retrieved document in terms of relevance

To ensure these requirements (especially fast processing) are fulfilled, the indexes are computed in advance Note that the format of the index has a huge impact on the performances of the system

Indexing documents
NB: an index is built in 4 steps:
1. Gathering of the collection (each document is given a unique identifier) 2. Segmentation of each document into a list of atomic tokens tokenization 3. Linguistic processing of the tokens in order to normalize them lemmatizing. 4. Indexing the documents by computing the dictionary and lists of postings

Manual indexing
Advantages
Human judgement are most reliable Retrieval is better

Drawbacks
Time consuming Not always consistent
different people build different indexes for the same document.

Automatic indexing
Using NLU?
Not fast enough in real world settings (e.g., web search) Not robust enough (low coverage) Difficulty : what to include and what to exclude.
Indexes should not contain headings for topics for which there is no information in the document Can a machine parse full sentences of ideas and recognize the core ideas, the important terms, and the relationships between related concepts throughout the entire text?

Building the vocabulary

Stop list
The members of which are discarded during indexing
some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely.

These words are called STOP WORDS Collection strategy :


Sort the terms by collection frequency (the total number of times each term appears in the document collection), Take the most frequent terms
often hand-filtered for their semantic content relative to the domain of the documents being indexed

What counts as a stop word depends on the collection


in a collection of legal article law can be considered a stop word

Ex:
a an and are as at be by for from has he in is it its of on that the to was were will with

Why eliminate stop words?


Efficiency
Eliminating stop words reduces the size of the index considerably Eliminating stop words reduces retrieval time considerably

Quality of results
Most of the time not indexing stop words does little harm
keyword searches with terms like the and by dont seem very useful

BUT, this is not true for phrase searches.


The phrase query President of the United States is more precise than President AND United States. The meaning of flights to London is likely to be lost if the word to is stopped out. .....

Building the vocabulary


Processing a stream of characters to extract keywords 1st task: tokenization, main difficulties:
token delimiters (ex: Chinese) apostrophes (ex: Oneill, Finlands capital) hyphens (ex: Hewlett-Packard, state-of-the-art) segmented compound nouns (ex: Los Angeles) unsegmented compound nouns (icecream, breadknife) numerical data (dates, IP addresses) word order (ex: Arabic wrt nouns and numbers)

Solutions for tokenization issues:


Using a pre-defined dictionary with largest matches and heuristics for unknown words Using learning algorithms trained over hand-segmented words

Choosing keywords
Selecting the words that are most likely to appear in a query
These words characterize the documents they appear in Which are they?

The bag of words approach


Extreme interpretation of the the principle of compositional semnaics The meaning of documents resides solely in the words that are contained within them The exact ordering of the terms in a document is ignored but the number of occurrences of each term is material

BoW

Not the same thing a bit! said the Hatter. You might just as well say that I see what I eat is the same thing as I eat what I see! You might just as well say, added the March Hare, that I like what I get is the same thing as I get what I like! You might just as well say, added the Dormouse, who seemed to be talking in its sleep, that I breathe when I sleep is the same thing as I sleep when I breathe!

Bags of words
Nevertheless, it seems intuitive that two documents with similar bag of words representations are similar in content..

Whats in a bag of words?


Are all words in a document equally important?
stop words do not contribute in any way to retrieval and scoring BoW contain terms
What should count as a term?
Words Phrases (e.g., president of the US)

Morphological normalization
Should index terms be word forms, lemmas or stems?
Matching morphological variants increase recall Example morphological variants :
anticipate, anticipating, anticipated, anticipation Company/Companies, sell/sold USA vs U.S.A., 22/10/2007 vs 10/22/2007 vs 2007/10/22 university vs University opel

Idea: using equivalence classes of terms,


ex: { Opel, OPEL, opel }

Two techniques:
stemming : refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time Lemmatisation : refers to doing things,properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return a dictionary form of a word, which is known as the lemma.

NB: documents and queries have to be processed using the same tokenization process !

Stemming and Lemmatization


Role: reducing inflectional forms to common base forms, Example: car, cars, cars, cars car am, are, is be Stemming removes suffixes (surface markers) to produce root forms Lemmatization reduces a word to a canonical form (using a dictionary and a morphological analyser) Illustration of the difficulty: plurals (woman/women, crisis/crisis) derivational morphology (automatize/automate)

English 1980)

Porter stemming algorithm (University of Cambridge, UK,

Porter stemmer
Algorithm based on a set of context-sensitive rewriting rules http://tartarus.org/~martin/PorterStemmer/index.html http://tartarus.org/~martin/PorterStemmer/def.txt Rules are composed of a pattern (left-hand-side) and a string (right-handside), example: (.*)sses \1 ss sses ss : caresses caress (.* [aeiou].*)ies \1i ies i : ponies poni, ties ti (.* [aeiou].*)ss \1 ss ss ss : caress caress Rules may be constrained by conditions on the words measure, example: (m > 1) (.*)ement \1 replacement replac but not cement c (m>0) (.*)eed -> \1ee feed -> feed but agreed -> agree (*v*) ed -> \1 plastered -> plaster but bled -> bled (*v*) ing -> \1 motoring -> motor but sing -> sing

Porter Stemmer Word measure


Assumed that a list of consonants is denoted by C, and a list of vowels by V Any word, or part of a word has one of the four forms:
CVCV ... C CVCV ... V VCVC ... C VCVC ... V

These may all be represented by the single form


[C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents.

Using (VC)m to denote VC repeated m times, this may again be written as


[C](VC)m [V].

m will be called the measure of any word or word part when represented in this form. Here are some examples:
m=0 TR, EE, TREE, Y, BY m=1 TROUBLE, OATS, TREES, IVY m=2 TROUBLES, PRIVATE, OATEN, ORRERY. (m > 1) EMENT ->
This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2.

Exercise
What is the Porter measure of the following words (give your computation) ?
crepuscular rigorous placement

cr ep usc ul ar C VC VC VC VC m=4 r ig or ous C VC VC VC m=3 pl ac em ent C VC VC VC m=3

Stemming
Most stemmers also removes suffixes such as ed, ing, ational, ation, able, ism...
Relational relate

Most stemmers dont use lexical look up There are shortcomings:


Stemming can result in non-words
Organization Doing doe Organ

Unrelated words can be reduced to the same stem


police, policy polic

Stemming
Popular stemmers
Porters Lovins Iterated Lovins Kstem

Lemmatization
Exceptions needs to be handled:
sought seek, sheep sheep, feet foot

Computationally more expensive than stemming as it lookups words in a dictionnary Lemmatizer for French
http://bach.arts.kuleuven.be/pmertens/morlex/ FLEMM (F. Namer)

POS taggers with lemmatization: TreeTagger, LT-POS

What is actually used?


Most retrieval systems use stemming/lemmatising and stop word lists
Stemming increases recall while harming precision

Most web search engines do use stop word lists but not stemming/lemmatising because
the text collection is extremely large so that the change of matching morphogical variants is higher recall is not an issue stemming is imperfect and the size and diversity of the web increase the chance of a mismatch stemming/tokenising tools are available for few languages

Example Text Representations


Scientists have found compelling new evidence of possible ancient microscopic life on mars, derived from magnetic crystals in a meteorit that fell to Earth from the red planet, NASA anounced on Monday.

Web search: scientists, found, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystals, meteorite, fell, earth, red, planet, NASA, anounced, Monday Information service or library search: scientist, find, compelling, new, evidence, possible, ancient, microscopic, life, mars, derived, magnetic, crystal, meteorite, fall, earth, red, planet, NASA, anounce, Monday

Granularity
Document unit :
An index can map terms
... to documents ... to paragraphs in documents ... to sentences in document ... to positions in documents

An IR system should be designed to offer choices of granularity. For now, we will henceforth assume that a suitable size document unit has been chosen, together with an appropriate way of dividing or aggregating files, if needed.

Index Content
The index usually stores some or all of the following information:
For each term:
Document count. How many documents the term occurs in. Total Frequency count. How many times the term occurs accross all documents popularity measure

For each term and for each document:


Frequency : How often the term occurs in that document. Position. The offsets at which the term occurs in that document.

Retrieval model

What is a retrieval model


A model is an abstraction of a process here: retrieval Conclusions derived by the model are good if the model provides a good approximation of the retrieval process IR Model variables: queries, documents, terms, relevance, users, information needs Existing types of retrieval models :
Boolean models Vector space models Probabilistic models Models based on Belief nets Models based on language models

Retrieval Models: the general intuition


Documents and user information needs are represented using index terms
Index terms serve as links to documents Queries consists of index terms

Relevance can be measured in terms of a match between queries and document index

Exact vs. Best Match


Exact Match
A query specifies precise retrieval criteria Each document either matches or fails to match the query The result is a set of documents (no ranking)

Best match
A query describes good or best matching documents The result is a ranked list of documents

Stastical retrieval

Statistical Models
A document is typically represented by a bag of words (unordered words with frequencies) User specifies a set of desired terms with optional weights: Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > Unweighted query terms: Q = < database; text; information > No Boolean conditions specified in the query.

56

Statistical Retrieval
Retrieval based on similarity between query and documents. Output documents are ranked according to similarity to query Similarity based on occurrence frequencies of keywords in query and document Automatic relevance feedback can be supported
The user issues a (short, simple) query. The system returns an initial set of retrieval results. The user marks some returned documents as relevant or nonrelevant. The systemcomputes a better representation of the information need base on the user feedback. The system displays a revised set of retrieval results.
57

Boolean model

The boolean model


Most common exact-match model Basic assumptions:
An index term is either present or absent in a document All index terms provide equal evidence wrt information needs

Queries are boolean combinations of index terms


x AND y: docts that contains both x and y (intersection of addresses) x OR y: docts that contains x, y or both (union of addresses) NOT x: docts that do not contain x (complement set of addresses)

Additionnally,
proximity operator simple regular expressions spelling variants

Boolean queries Example


User information need:
interested in learning about vitamins that are antioxidant

User boolean query:


antioxidant AND vitamin

The boolean model


Example of input collection (Shakespeares plays): Doc1 I did enact Julius Caesar: I was killed in the Capitol; Brutus killed me. Doc2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

The boolean model index construction


First we build the list of pairs (keyword, docID)):

The boolean model index construction


Then the lists are sorted by keywords, frequency information is added:

The boolean model index construction


Multiple occurences of keywords are then merged to create a dictionary file and a postings file:

Processing Boolean queries


User boolean query: Brutus AND Calpurnia
over the inverted index :
Locate Brutus in the Dictionary Retrieve its postings Locate Calpurnia in the Dictionary Retrieve its postings Intersect the two postings lists

The intersection operation is the crucial one. It has to be we efficient so as to be able to quickly find documents that contain both terms.
sometimes referred to as merging postings lists because it uses a merge algorithm Merge algortihm : general family of algorithms that combine multiple sorted lists by interleaved advancing of pointers through each list

Intersection

Extended boolean queries


Merging algorithm (from Manning et al., 07)

NB: the posting lists HAVE to be sorted.

Extended boolean queries


Generalisation of the merging process:
Imagine more than 2 keywords appear in the query:
(Brutus AND Caesar) AND NOT (Capitol) Brutus AND Caesar AND Capitol (Brutus OR Caesar) AND (Capitol ...

Ideas:
consider keywords with shorter posting lists first (to reduce the number of operations). use the frequency information stored in the dictionary See Manning et al., 07 for the algorithm

Extended boolean queries

retrieved docs : D7, D5, D2

Exercise
How would you process the following queries (main steps) Brutus AND NOT Caesar Try your algorithm on

Exercise
How would you process the following query (main steps) Brutus OR NOT Caesar

Remarks on the boolean model


The boolean model allows to express precise queries (you know what you get, BUT you do not have flexibility exact matches) Boolean queries can be processed efficiently (time complexity of the merge algorithm is linear in the sum of the length of the lists to be merged) Has been a reference model in IR for a long time

Advantages of exact-match retrieval


Predictable, easy to explain Structured queries Works well when information need is clear and precise

Drawbacks of exact-match retrieval


Unintuitive for non experts: adequate query formulation difficult for most users no ranking of retrieved documents exact matching may lead to too few or too many retrieved documents
too few: if not using synonyms difficulty increases with collection size large results sets need to be compensated by interactive query refinement

No notion of partial relevance (useful if query is overrestrictive)

All terms have equal importance (no term weighing) Ranking models consistently better

Boolean model The story so far


An inverted index associate keywords with posting lists The postings lists contain document identifiers (and other useful information, such as total frequences, number of documents, etc.) Boolean queries are processed by merging posting lists in order to find the documents satisfaying the query The cost of this list merging is time linear in the total number of document Ids: O(m + n) Question: how to process phrase queries (i.e. taking the words context into account) ?

Dealing with phrases queries


Many complex or technical concepts and many organization and product names are multiword compounds or phrases.
Stanford University Graph Theory Natural Language Processing ...

The user wants documents were the whole phrase appears, and not only some parts of it (i.e. The inventor Stanford Ovshinsky never went to university is not a match ) About 10 % of the web queries are phrase queries (songs names, institutions...) Such queries need either more complex dictionary terms, or more complex index (critical parameter: size of the index)

Biword indexes
Use key-phrases of length 2, example : Text : Natural Language Processing Dictionary:
Natural Language Language Processing The dictionary is made of biwords (notion of context)

Query : Information retrieval in Natural Langage Processing


(Information retrieval) and (retrieval Natural) and (Natural Language) and (Language Processing) It might seem a better query to omit the middle biword. Better results can be obtained by using more precise part-of-speech patterns that define which extended biwords should be indexed

Positionnal indexes
Store positions in the inverted indexes, example: termID ::= doc1: position1, position2, ... doc2: position1, position2, .. .... Processing then corresponds to an extension of the merging algorithm (additional checkings while traversing the lists) NB: such indexes can be used to process proximity queries (i.e. using constraints on proximity between words)

Positional indexes need an entry per occurence (NB: classic inverted indexes need an entry per document Id) The size of such indexes grows exponentially with the size of the document The size of a positional index depends on the language being indexed and the type of document (books, articles, etc) On average, a positional index is 2-4 times bigger than a inverted index, it can reach 35 to 50 % of the size of the original text (for English) Positional indexes can be used in combination with classic indexes to save time and space (see [Williams et al, 2005]).

Exercise
Which documents can contain the sentence to be or not to be considering the following (incomplete) indexes ? be ::= 1: 7, 18, 33, 72, 86, 231 2: 3, 149 4: 17, 191, 291, 430, 434 5: 363, 367 to ::= 2: 1, 17, 74, 222, 551 4: 8, 16, 190, 429, 433 7: 13, 23, 191

Exercise
Given the following positional indexes, give the documents Ids corresponding to the query world wide web : world ::= 1: 7, 18, 33, 70, 85, 131 2: 3, 149 4: 17, 190, 291, 430, 434 wide ::= 1: 12, 19, 40, 72, 86, 231 2: 2, 17, 74, 150, 551 3: 8, 16, 191, 429, 435 web ::= 1: 20, 22, 41, 75, 87, 200 2: 18, 32, 45, 56, 77, 151 4: 25, 192, 300, 332, 440

The postings lists to access are: to, be, or, not. We will examine intersecting the postings lists for to and be. We first look for documents that contain both terms. Then, we look for places in the lists where there is an occurrence of be with a token index one higher than a position of to and then we look for another occurrence of each word with token index 4 higher than the first occurrence. In the above lists, the pattern of occurrences that is a possible match is: to: <...;4:<...,429,433>...> Be: <...;4:<...,430,434>...>

Exercise
Consider the following index:
Language: <d1,12><d2,23-32-43><d3,53><d5,36-42-48> Loria: <d1,25> <d2,34-40> <d5,38-51>

Where dI refers to the document I, the other numbers being positions. The infix operator NEAR/x refers to the proximity x between two term : Give the solutions to the query language NEAR/2 Loria Give the pairs (x,docids) for each x such that language NEAR/x Loria has at least one solution Propose an algorithm for retrieving matching document for this operator

Example: WESTLAW
Large commercial system that serves legal and professional market since 1974
legal materials (court opinions, statutes, regulations, ...) news (newspapers, magazines, journals, ...) financial (stock quotes, financial analyses, ...)

Total collection size: 5-7 Terabytes 700 000 users (they claim 56% of legal searchers as of 2002) Best match added in 1992

WESTLAW query language features


Boolean and proximity operators
Phrases : West Publishing Word Proximity : West /5 Publishing Same sentence : Massachussets /s technology Same paragraph - information retrieval /p

Restrictions : DATE(AFTER 1992 & BEFORE 1995) Term expansion


wildcard (THOM*SON); truncation (THOM!); automatic expansion of plurals, possessive

Document structure (fields)

WESTLAW query example


Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competingcompany. Query: "trade secret" /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace. Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a hosts responsibility for drunk guests. Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest

Boolean query languages are not dead


Exact match still prevalent in the commercial market (but then includes some type of ranking) Many users prefer Boolean For some queries/collections, boolean may work better Boolean and free text queries find different documents Need retrieval models that support both

The Vector Space Model

Best-Match retrieval
Boolean retrieval is the archetypal example of exactmatch retrieval Best-match or ranking models are now more common Advantages
easier to use similar efficiency provides ranking best match generally has better retrieval performance most relevant documents appear at the top of the ranking

But: comparison best- and exact-match is difficult

Boolean model: all documents matching the query are retrieved The matching is binary: yes or no Extreme cases: the list of retrieved documents can be empty, or huge A ranking of the documents matching a query is needed A score is computed for each pair (query, document)

Vector-space Retrieval
By far the most common retrieval systems Key idea: Everything (document, queries) is a vector in a high dimensional space Vector coefficients for an object (document, query, term) represent the degree to which this object embodies each of the basic dimensions Relevance is measured using vector similarity: a document is relevant to a query if their representing vectors are similar

Vector-space Representation
Documents are vectors of terms Terms are vectors of documents A query is a vector of terms

Graphic Representation
T3
5

D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3


2 3

T1
D2 = 3T1 + 7T2 + T3
7

T2

Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3

93

Similarity in the Vector-space


Vector can contain binary terms or weighted terms
Binary term vector: 1 term present, 0 term absent Weighted term vector:indicates relative importance of terms in a document

Vector similarity can be measured in several ways:


Inner product (measure of overlap) Cosine coefficient Jacquard coefficient Dice coefficient Mikowski metric (dissimilarity) Euclidian distance (dissimilarity)

Using the inner product similarity measure


Given a query vector q and a doct vector d, both of length n, similarity between q and d is defined by the inner product q d of q and d :

where qi (di ) is the value of the i -th position of q(d) With binary values this amounts to counting the matching terms between q and d

Similarity: an example in the Vector-space

The effect of varying document lengths


Problem :
Longer documents will be represented with longer vectors, but that does not mean they are more important If two documents have the same score, the shorter one should be preferred

Solution : the length of a document must be taken into account when computing the similarity score

Document length normalization


The length of a document: euclidian length If d= (x1, x2, ... Xn) then dw= To normalize a document, we divide it by its own length : d/dw Similarity given by the cosine measure between normalized vectors: q (d/dw) One problem is solved : shorter more focused documents receive a higher score than longer documents with the same matching terms But: shorter documents are generally preferred over longer one! More sophisticated weighting schemes are generally used

Term weights
qi is the weight of the term i in q Up to now, we only considered binary term weight
0: term absent 1: term present

Two shortcomings:
Does not reflect how often a term occurs All terms are equally important (president vs. the)

Remedy: use non binary term weights


tf-score: store the frequency of a term in the vector (e.g., 4 if the term occurs 4 times in the document) idf-score: to distinguish meaningful terms ie terms that occur only in a few documents

Term frequency
A document is treated as a set of words Each word characterizes that document to some extent When we have eliminated stop words, the most frequent words tend to be what the document is about Therefore: fkd (Nb of occurrences of word k in document d) will be an important measure. Also called the term frequency (tf)

Document frequency
What makes this document distinct from others in the corpus? The terms which discriminate best are not those which occur with high document frequency! Therefore: dk (nb of documents in which word k occurs) will also be an important measure. Also called the document frequency (idf)

TF.IDF
This can all be summarized as:
Words are best discriminators when :
they occur often in this document (term frequency) do not occur in a lot of documents (document frequency)

One very common measure of the importance of a word to a document is :


TF.IDF: term frequency x inverse document frequency

There are multiple formulas for actually computing this. The underlying concept is the same in all of them.

Term weights
tf-score : tfi,j = frequency of term i in document j idf-score : idfi = Inversed document frequency of term i
idfi = log(N/ni) with
N, the size of the document collection (nb of documents) ni , the number of documents in which the term i occurs

idfi = Proportion of the document collection in which termi occurs

Term weight of term i in document j (TF-IDF):


tfi,j. idfi the rarity of a term in the document collection

Boolean retrieval vs. Vector Space Retrieval


Boolean retrieval
Documents are not ranked Boolean queries are not easy to manipulate

Vector space retrieval


Documents can be ranked Issue 1: choice of comparison function. Usually cosine comparison. Issue 2: choice of weighing scheme. Usuall variations on tfi,j. idfi

Evaluation

Evaluation
Issues User-based evaluation System-based evaluation TREC Precision and recall

Evaluation methods
Two types of evaluation methods:
User-based: measures the user satisfaction System-based: focuses on how well the system ranks the documents

User based evaluation


More direct Expensive Difficult to do correctly Need sufficiently large, representative sample of users The compared systems must be equally well developed (complete with fully fonctional user interface) Each user must be trained to control learning effects Information, information needs, relevance are intangible concepts

System based evaluation


Good system performance = good document rankings Allows for fair comparative testing Less expensive; can be reused Test collection = Topics, Documents, Relevance judgments System based evaluation goes back to Cranfields experiments (1960)
Rate relevance of retrieved bibliographic reference on a scale from 1 to 4

Recall and Precision


Three important performance metrics:
Precision : Proportion of retrieved documents that are relevant

No penalty for selecting too few item

Recall : Proportion of relevant documents that have been retrieved

No penalty for selecting too many items (e.g., everything)7

F-Measure

Standard Text Collections


Relevant documents must be identified Given a document collection D and a set of queries Q, RELq is the set of document relevant to q Whether a document d is relevant to a query q is decided by human judgement

Standard Text Collections


CACM (computer science): 3024 abstracts, 64 queries CF (medicine): 1239 abstracts, 100 queries CISI (library science): 1460 abstracts, 112 queries CRANFIELD (aeronautics): 1400 abstracts, 225 queries LISA (library science): 6004 abstracts, 35 queries TIME (newspaper): 423 abstracts, 83 queries Ohsumed (medicine): 348 566 abstracts, 106 queries

Building Test Collections


How to identify relevant documents? How to assess relevance? (binary or finergrained) One vs several judges

TREC
Text REtrieval Conference Proceedings at http://trec.nist.gov/ Established in 1991 to evaluate large scale IR Retrieving documents from a gigabyte collection Organised by NIST and run continuously since 1991 Best known IR evaluation setting
25 participants in 92 109 participants from 4 continents in 2004 European (CLEF) and Asian counterparts (NTCIR)7

TREC Format
Several IR research tracks
ad-hoc retrieval routing/filtering cross languag scanned document spoken document Video Web question answering ...

TREC notion of relevance


If you were writing a report on the subject of the topic and would use the information contained in the document in the report, then the document is relevant Pooling is used for identifying relevant documents:
A set of possibly relevant documents is created automatically for each information need The top 100 documents returned by each system are kept and inspected by judges who determine which documents are relevant Inter-judge agreement is about 80%8

Improving Recall and Precision


The two big problems with short queries are:
Synonymy: Poor recall results from missing documents that contain synonyms of search terms, but not the terms themselves Polysemy/Homonymy: Poor precision results from search terms that have multiple meanings leading to the retrieval of non-relevant documents.

Query Expansion
Find a way to expand a users query to automatically include relevant terms (that they should have included themselves), in an effort to improve recall
Use a dictionary/thesaurus Use relevance feedback

Thesauri
A thesaurus contains information about words (e.g., violin) such as :
Synonyms: similar words e.g., fiddle Hyperonyms: more general words e.g., instrument Hyponyms: more specific words e.g., Stradivari Meronyms: parts, e.g., strings

A very popular machine readable thesaurus is Wordnet

Problems of Thesauri
Language dependent Available only for a couple of languages

Cooccurence models
Semantically or syntactically related terms Cooccurence vs. Thesauri
Easy to adapt to other languages/domains Also covers relations not expressed in thesaur Not as reliable as manually edited thesauri Can introduce considerable noise

Selection criteria: Mutual information, Expected mutual, information

Relevance feedback
Ask user to identify a few documents which appear to be related to their information need Extract terms from those documents and add them to the original query. Run the new query and present those results to the user. Typically converges quickly

Blind feedback
Assume that first few documents returned are most relevant rather than having users identify them Proceed as for relevance feedback Tends to improve recall at the expense of precision

Post-Hoc Analysis
When a set of documents has been returned, they can be analyzed to improve usefulness in addressing information need
Grouped by meaning for polysemic queries (using N-Gram-type approaches) Grouped by extracted information (Named entities, for instance) Group into existing hierarchy if structured fields available Filtering (e.g., eliminate spam)

References
Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schtze. To appear at Cambridge University Press (chapters available at the book website).

Information Retrieval, Second Edition, by C.J. van Rijsbergen, Butterworths, London, 1979. Available here.

You might also like