7 Adv Topics

Information
retrieval(3170718)
Prepared by:
Ms. Twinkle P. Kosambia
Computer Engineering Department
C. K. Pithawalla College of Engineering and
Technology
Subject Name: Information retrieval
Subject Code: 3170718
Reference Books:
1.Introduction to Information Retrieval. Christopher D. Manning,
Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press,
2007.
2.Search Engines: Information Retrieval in Practice. Bruce Croft, Donald
Metzler, and Trevor Strohman, Pearson Education, 2009.
3.Modern Information Retrieval. Baeza-Yates Ricardo and Berthier
Ribeiro-Neto. 2nd edition, Addison-Wesley, 2011.
4.Information Retrieval: Implementing and Evaluating Search Engines.
Stefan Buttcher, Charlie Clarke, Gordon Cormack, MIT Press, 2010.
Chapter : 7 Advanced Topics
● Summarization
● Topic detection and tracking
● Personalization
● Question answering
● Cross language information retrieval
What is summarization
To reduce documents to a few key paragraphs, sentences, or phrases

describing their content.
Snippet generation is an example of text summarization:
1. Query - independent
2. Query - dependent
Summarization
Summaries may be classified as :

● Extractive
● Abstractive
Extractive summaries
● Extractive summaries are created by reusing portions (words,

sentences, etc.) of the input text verbatim.
● For example, search engines typically generate extractive
summaries from webpages.
● Most of the summarization research today is on extractive
summarization.
Extractive summaries
Abstractive summaries
● In abstractive summarization, information from the source text is
rephrased.
● Human beings generally write abstractive summaries (except when
they do their assignments).
● Example:
● Abstractive summary: book review
● An innocent hobbit of the shire journeys with eight companions to
the fires of mount doom to destroy the one ring and the dark lord
sauron forever.
Summarization techniques
● Summarization techniques can be,

○ Supervised
○ unsupervised
Supervised techniques
● Features (e.g. position of the sentence, number of in the sentence

etc.) of sentences that make them good candidates for inclusion in
the summary are learnt.
● Sentences in an original training document can be labeled as “in
summary” or “not in summary”.
● The main drawback of supervised techniques is that training data is
expensive to produce and relatively sparse.
● Also, most readily available human generated summaries are
abstractive in nature.
Mejor supervised approaches
● Lin and hovy use the concept of topic signatures to rank

sentences.
● TS = {topic, signature}
○ ={topic, <ti,wi>,....,<tn,wn>}
● TS = {restaurant - visit, <food, 0.5>, <menu, 0.2>, <waiter,

0.15>, …. }
● Topic signatures are learnt using a set of documents pre -
classified as relevant or non - relevant for each topic.
Example
● If a document is classified as relevant to restaurant visit, we
know the important words of this documents will from the topic
signature of the topic “restaurant visit”. This is a supervised
process.
● Extraction of the important words from a document, is an
unsupervised process.
● E.g. food, occurs a lot of time in a cookbook and is an important
word in that document
Unsupervised techniques: TextRank
● This approach models the document as a graph and uses an

algorithm similar to google’s page rank algorithm to find
top-ranked sentences.
● The key intuition is the notion of centrality or prestige in social
networks i.e. a sentence should be highly ranked if it is
recommended br many other highly ranked sentences
Topic detection
and tracking
Introduction
● The TDT is intended to explore techniques for detecting the

appearance of new topics and for tracking the reappearance
and evaluation of them.
● The notation of a “topic” was modified and sharpened to be an
“event”, meaning some unique thing that happens at some
point in time.
● Sources of information, for example various newswires and
various news broadcast programs.
● The information flowing from each source is assumed to be
divided into a sequence of stories, which may provide
information on one or more events.
Introduction
● The general task is to identify the events being discussed in
these stories, in term of the stories that discuss them.
● Stories that discuss unexpected events will of course follow the
event whereas stories on expected events can both precede and
follow the event.
Purpose : To develop technologies for retrieval and automatic
organization of broadcast news and newswire stories and to evaluate
the performance.
Corpus: TDT processing addresses multiple sources of information,
including newswire (text) and broadcast news (speech).
The information is modeled as a sequence of stories. These stories
provide information on many topics.
Google News
Terminology in TDT
● A story is a single document/ newswire story conveying some
information to the user:
● An event is something what happens at a particular place and a
particular time:
● A topic is an important event considered together with all related
events
● For example, the eruption of mount pinatubo on june 15th, 1991 is
consider to be an event, whereas volcanic eruption in general is
considered to be a class of event.
Google Alert
RSS Feed
● What is RSS: RSS is an XML-based format that allows the

syndication of lists of hyperlinks, song with other information,
or metadata, that helps viewers decide whether they to follow
the link.
○ RSS stands for Really simple Syndication
○ RSS is written in XML, RSS files automatically updated
Web content providers can easily created and disseminate feeds
of data that include new links, headlines, and summaries.
RSS Feed
Corpus
TDT2 Corpus contains data from
➔ Newswire: Associated press WorldStream, New York Times
News Services
➔ Radio : Voice of America World News, Public Radio
International The World
➔ Television: CNN Headline News, ABC World News Tonight
● There are 300 stories/day, 5 hrs digital recordings/day, 54,000

stories, 630 hours of audio
● For newswire source each story is clearly delimited by the
newswire format.
Organization TDT2 Corpus
TDT2 Corpus was divided into three parts for research

management purpose
➔ Training set : the data may be used without limit for research
purpose
➔ Development test set : the data will be available for testing
TDT algorithm
➔ Evaluation test set : the data will be reserved for final formal
evaluation of performance organization of the TDT2 Corpus
The three tasks
The input to TDT2 project is a stream of stories.

This stream may not be pre-segmented into stories, and the topics
may not be known to the system.
Three technical tasks are,
1. Segmentation of a news source into stories
2. The tracking of known topics
3. The detection of unknown topics.
Segmentation
Segmenting the stream of data into constituent stories, applies to

audio (radio and TV) source.
Segmentation
Tracking
● Associating incoming stories with topics that are known to the system.
● A topic is “known” by its association with the stories that discuss it.
● A set of training stories is identified for each topic. The system may
train on the target topic by using all of the stories in the corpus
● Find all the stories that discuss a given target topic
○ Training : given N sample stories that discuss a given target topic,
○ Test : find all subsequent stories that discuss the target topic.
The tracking task
The system is given one training document Dj per story.

Stories come in sequence S1… S2
Stories with similarity above a threshold thresh yes/no to the

training stories are marked YES
The tracking task
● Misses and false alarms

Tracking task - adaptation
Evaluating tracking
● Perfect tracker say yes to on-topic stories and no to all other

stories
● In reality, system emits confidence of topic
Evaluating tracking
● At every score, there is a miss and false alarm rate

○ Any on-topic stories below score are misses
○ Any off-topic stories above score are false alarms
● Plot (false alarm, miss ) pair for every score
○ Results is an ROC curve
○ TDT uses a modification called the “DET curve” or
“DEt plot”
Det plots
What is personalization
Limitations in typical IR system
General approach for mitigating challenges
Types of personalization
Question answering
In information retrieval
The task of Question Answering
Question answering system architecture

7 Adv Topics

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

7 Adv Topics

Uploaded by

Copyright:

Available Formats

Information

To reduce documents to a few key paragraphs, sentences, or phrases

Snippet generation is an example of text summarization:

Summaries may be classified as :

● Extractive summaries are created by reusing portions (words,

● Summarization techniques can be,

● Features (e.g. position of the sentence, number of in the sentence

● Lin and hovy use the concept of topic signatures to rank

● TS = {restaurant - visit, <food, 0.5>, <menu, 0.2>, <waiter,

● This approach models the document as a graph and uses an

● The TDT is intended to explore techniques for detecting the

● What is RSS: RSS is an XML-based format that allows the

● There are 300 stories/day, 5 hrs digital recordings/day, 54,000

TDT2 Corpus was divided into three parts for research

The input to TDT2 project is a stream of stories.

Segmenting the stream of data into constituent stories, applies to

The system is given one training document Dj per story.

Stories with similarity above a threshold thresh yes/no to the

● Misses and false alarms

● Perfect tracker say yes to on-topic stories and no to all other

● At every score, there is a miss and false alarm rate

You might also like