Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

Information

retrieval(3170718)

Prepared by:
Ms. Twinkle P. Kosambia
Computer Engineering Department
C. K. Pithawalla College of Engineering and
Technology
Subject Name: Information retrieval
Subject Code: 3170718
Reference Books:
1.Introduction to Information Retrieval. Christopher D. Manning,
Prabhakar Raghavan, and Hinrich Schuetze, Cambridge University Press,
2007.
2.Search Engines: Information Retrieval in Practice. Bruce Croft, Donald
Metzler, and Trevor Strohman, Pearson Education, 2009.
3.Modern Information Retrieval. Baeza-Yates Ricardo and Berthier
Ribeiro-Neto. 2nd edition, Addison-Wesley, 2011.
4.Information Retrieval: Implementing and Evaluating Search Engines.
Stefan Buttcher, Charlie Clarke, Gordon Cormack, MIT Press, 2010.
Chapter : 7 Advanced Topics
● Summarization
● Topic detection and tracking
● Personalization
● Question answering
● Cross language information retrieval
What is summarization

To reduce documents to a few key paragraphs, sentences, or phrases


describing their content.

Snippet generation is an example of text summarization:

1. Query - independent
2. Query - dependent
Summarization

Summaries may be classified as :


● Extractive
● Abstractive
Extractive summaries

● Extractive summaries are created by reusing portions (words,


sentences, etc.) of the input text verbatim.
● For example, search engines typically generate extractive
summaries from webpages.
● Most of the summarization research today is on extractive
summarization.
Extractive summaries
Abstractive summaries
● In abstractive summarization, information from the source text is
rephrased.
● Human beings generally write abstractive summaries (except when
they do their assignments).
● Example:
● Abstractive summary: book review
● An innocent hobbit of the shire journeys with eight companions to
the fires of mount doom to destroy the one ring and the dark lord
sauron forever.
Summarization techniques

● Summarization techniques can be,


○ Supervised
○ unsupervised
Supervised techniques

● Features (e.g. position of the sentence, number of in the sentence


etc.) of sentences that make them good candidates for inclusion in
the summary are learnt.
● Sentences in an original training document can be labeled as “in
summary” or “not in summary”.
● The main drawback of supervised techniques is that training data is
expensive to produce and relatively sparse.
● Also, most readily available human generated summaries are
abstractive in nature.
Mejor supervised approaches

● Lin and hovy use the concept of topic signatures to rank


sentences.
● TS = {topic, signature}

○ ={topic, <ti,wi>,....,<tn,wn>}

● TS = {restaurant - visit, <food, 0.5>, <menu, 0.2>, <waiter,


0.15>, …. }
● Topic signatures are learnt using a set of documents pre -
classified as relevant or non - relevant for each topic.
Example
● If a document is classified as relevant to restaurant visit, we
know the important words of this documents will from the topic
signature of the topic “restaurant visit”. This is a supervised
process.
● Extraction of the important words from a document, is an
unsupervised process.
● E.g. food, occurs a lot of time in a cookbook and is an important
word in that document
Unsupervised techniques: TextRank

● This approach models the document as a graph and uses an


algorithm similar to google’s page rank algorithm to find
top-ranked sentences.
● The key intuition is the notion of centrality or prestige in social
networks i.e. a sentence should be highly ranked if it is
recommended br many other highly ranked sentences
Topic detection
and tracking
Introduction

● The TDT is intended to explore techniques for detecting the


appearance of new topics and for tracking the reappearance
and evaluation of them.
● The notation of a “topic” was modified and sharpened to be an
“event”, meaning some unique thing that happens at some
point in time.
● Sources of information, for example various newswires and
various news broadcast programs.
● The information flowing from each source is assumed to be
divided into a sequence of stories, which may provide
information on one or more events.
Introduction
● The general task is to identify the events being discussed in
these stories, in term of the stories that discuss them.
● Stories that discuss unexpected events will of course follow the
event whereas stories on expected events can both precede and
follow the event.
Purpose : To develop technologies for retrieval and automatic
organization of broadcast news and newswire stories and to evaluate
the performance.
Corpus: TDT processing addresses multiple sources of information,
including newswire (text) and broadcast news (speech).
The information is modeled as a sequence of stories. These stories
provide information on many topics.
Google News
Terminology in TDT
● A story is a single document/ newswire story conveying some
information to the user:
● An event is something what happens at a particular place and a
particular time:
● A topic is an important event considered together with all related
events
● For example, the eruption of mount pinatubo on june 15th, 1991 is
consider to be an event, whereas volcanic eruption in general is
considered to be a class of event.
Google Alert
RSS Feed

● What is RSS: RSS is an XML-based format that allows the


syndication of lists of hyperlinks, song with other information,
or metadata, that helps viewers decide whether they to follow
the link.
○ RSS stands for Really simple Syndication
○ RSS is written in XML, RSS files automatically updated
Web content providers can easily created and disseminate feeds
of data that include new links, headlines, and summaries.
RSS Feed
Corpus
TDT2 Corpus contains data from
➔ Newswire: Associated press WorldStream, New York Times
News Services
➔ Radio : Voice of America World News, Public Radio
International The World
➔ Television: CNN Headline News, ABC World News Tonight

● There are 300 stories/day, 5 hrs digital recordings/day, 54,000


stories, 630 hours of audio
● For newswire source each story is clearly delimited by the
newswire format.
Organization TDT2 Corpus

TDT2 Corpus was divided into three parts for research


management purpose
➔ Training set : the data may be used without limit for research
purpose
➔ Development test set : the data will be available for testing
TDT algorithm
➔ Evaluation test set : the data will be reserved for final formal
evaluation of performance organization of the TDT2 Corpus
The three tasks

The input to TDT2 project is a stream of stories.


This stream may not be pre-segmented into stories, and the topics
may not be known to the system.
Three technical tasks are,
1. Segmentation of a news source into stories
2. The tracking of known topics
3. The detection of unknown topics.
Segmentation

Segmenting the stream of data into constituent stories, applies to


audio (radio and TV) source.
Segmentation
Tracking
● Associating incoming stories with topics that are known to the system.
● A topic is “known” by its association with the stories that discuss it.
● A set of training stories is identified for each topic. The system may
train on the target topic by using all of the stories in the corpus
● Find all the stories that discuss a given target topic
○ Training : given N sample stories that discuss a given target topic,
○ Test : find all subsequent stories that discuss the target topic.
The tracking task

The system is given one training document Dj per story.


Stories come in sequence S1… S2

Stories with similarity above a threshold thresh yes/no to the


training stories are marked YES
The tracking task

● Misses and false alarms


Tracking task - adaptation
Evaluating tracking

● Perfect tracker say yes to on-topic stories and no to all other


stories
● In reality, system emits confidence of topic
Evaluating tracking

● At every score, there is a miss and false alarm rate


○ Any on-topic stories below score are misses
○ Any off-topic stories above score are false alarms
● Plot (false alarm, miss ) pair for every score
○ Results is an ROC curve
○ TDT uses a modification called the “DET curve” or
“DEt plot”
Det plots
What is personalization
Limitations in typical IR system
General approach for mitigating challenges
Types of personalization
Question answering
In information retrieval
The task of Question Answering
Question answering system architecture

You might also like