Professional Documents
Culture Documents
Text Mining
Text Mining
Text Mining
E-CSE(II)
1
Text Databases
Consists of large collections of documents from various
Information Retrieval(IR)
It is a field that has been developing in parallel with
database systems.
Database systems focused on query and transaction
Cont
Database system problems are usually not present in
database systems such as unstructured documents, search based on keywords and notion of relevance.
Cont
Problem in IR is to locate relevant document in document
to push any newly arrived information item to a user if the item is relevant to the users information need.
This type of information access is called information
filtering.
5
Recall: Percentage of documents that are relevant to the query and were, in fact, retrieved.
| {Relevant} {Retrieved} | Re call | {Relevant} |
6
Relevant
Retrieved
All Documents
F-score Its a trade off recall for precision and vice versa. Its a harmonic mean of precision and recall It discourages a system that sacrifices one measure for another.
F-score=(recall*precision)/((recall + precision)/2)
Document Selection
Query is used to specifying constraints for selecting
relevant documents
Boolean Model Document is represented as set of keywords and user provides a boolean expression of keywords. Eg: tea or coffee, database systems but not DB2. Retrieval system would take such a boolean query and return documents that satisfies the boolean query. Works well when the user knows lot about the document collection.
10
Document Ranking
Use the query to rank all documents in order of
relevance. Retrieval systems present a ranked list of documents in response to a users query. Different ranking methods based on logic, algebra, probability and statistics. Mach the keywords in the query with those in the documents and score each document based on how well it matches the query. Degree of relevance of a document is a score computed based on information such as the frequency of words in the document and the whole collection.
11
12
Stop list Set of words that are deemed irrelevant. Usually they are prepositions, articles, one letter words. Eg: a, for, the, an, with, etc Maintains stop list with a set of documents to avoid indexing useless word Word stem Group of words may share the same word stem. Eg: running -> run, runner -> run
13
0
1+log(1+log(freq(d,t)))
if freq(d,t)=0
otherwise
14
important if it occurs in many research papers in a database system conference. IDF(t)=log((1+|d|)/|dt|) d- document collection dt- set of documents containing term t |dt|<<|d|- term t have a large IDF scaling factor and vice versa.
15
sim(v1 , v2 )
v1.v2 is ti=1 v1i v2i
|v1 | =v1.v1
v1 v2 | v1 || v2 |
16
17
Inverted Indices
It is an index structure that maintains two hash
Term table consists of term records term_id-> identifier for term posting_list-> list of document identifiers in which the terms appears.
18
Signature File
It is a file that stores a signature record for each
terms.
19
collection, a retrieval system can answer for a given query by looking up which documents contain the query keywords.
When examples of relevant documents are available,
the system can learn from such examples to improve retrieval performance, called relevance feedback.
20
Cont
When we do not have such relevant examples a system
can assume the top few retrieved documents in some initial retrieval results to be relevant and extract the keywords to expand a query, called pseudo-feedback or blind feedback.
21
Thank you
23