Text Mining

By, M.Karthikeyan-10MCS103 M.
E-CSE(II)
1
Text Databases
Consists of large collections of documents from various
sources. Eg- articles, books, research papers, digital libraries, etc

Semistructured data
Document contains few structured fields such as title,authors and
unstructured text components such as abstract and contents.
Information retrival techniques such as indexing methods
have been developed to handle unstructured documents.

2
Information Retrieval(IR)
It is a field that has been developing in parallel with
database systems.
Database systems focused on query and transaction
processing on structured data.

Information retrieval focused on organization and
retrieval of information from a large number of textbased documents.

3
Cont
Database system problems are usually not present in
IR such as concurrency control, recovery, transaction management and update.

IR problems are usually not encounterd in traditional
database systems such as unstructured documents, search based on keywords and notion of relevance.
Cont
Problem in IR is to locate relevant document in document
collection based on users query.

For short-term, user takes the initiative to pull the
information out from the collection.

For long-term, retrieval system may also take the initiative
to push any newly arrived information item to a user if the item is relevant to the users information need.
This type of information access is called information
filtering.
5
Basic Measures for Text retrieval

Precision: Percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses)
| {Relevant} {Retrieved } | precision | {Retrieved } |
Recall: Percentage of documents that are relevant to the query and were, in fact, retrieved.
| {Relevant} {Retrieved} | Re call | {Relevant} |
6
Relevant
Relevant & Retrieved
Retrieved
All Documents
F-score Its a trade off recall for precision and vice versa. Its a harmonic mean of precision and recall It discourages a system that sacrifices one measure for another.
F-score=(recall*precision)/((recall + precision)/2)
Text Retrieval Methods
Document Selection Document Ranking
Document Selection
Query is used to specifying constraints for selecting
relevant documents
Boolean Model Document is represented as set of keywords and user provides a boolean expression of keywords. Eg: tea or coffee, database systems but not DB2. Retrieval system would take such a boolean query and return documents that satisfies the boolean query. Works well when the user knows lot about the document collection.
10
Document Ranking
Use the query to rank all documents in order of
relevance. Retrieval systems present a ranked list of documents in response to a users query. Different ranking methods based on logic, algebra, probability and statistics. Mach the keywords in the query with those in the documents and score each document based on how well it matches the query. Degree of relevance of a document is a score computed based on information such as the frequency of words in the document and the whole collection.
11
Vector Space Model

Represent a document and a query both as vectors in a
high-dimensional space corresponding to all the keywords.

Use an appropriate similarity measure to compute the
similarity between the query and the document vector.

Similarity values are used to rank the documents.
12
Stop list Set of words that are deemed irrelevant. Usually they are prepositions, articles, one letter words. Eg: a, for, the, an, with, etc Maintains stop list with a set of documents to avoid indexing useless word Word stem Group of words may share the same word stem. Eg: running -> run, runner -> run
13
Weighted Term Frequency

Term frequency- No. of occurrence of term t in
document d i.e) freq(d,t).

(Weighted)term frequency Matrix- TF(d,t) Association of a term t with respect to the given document d. Zero if the d does not contain t otherwise nonzero. TF(d,t)=
0
1+log(1+log(freq(d,t)))
if freq(d,t)=0
otherwise
14
Inverse Document Frequency(IDF)

It represents the importance of the term t.
Eg: the term database systems may likely be less
important if it occurs in many research papers in a database system conference. IDF(t)=log((1+|d|)/|dt|) d- document collection dt- set of documents containing term t |dt|<<|d|- term t have a large IDF scaling factor and vice versa.
15
In complete vector-space model, TF and IDF are
combined together, which forms the TF-IDF measure.

TF-IDF(d,t)=TF(d,t) * IDF(t)
Similarity v1,v2 are two document vectors. Their cosine similarity is
sim(v1 , v2 )
v1.v2 is ti=1 v1i v2i
|v1 | =v1.v1
v1 v2 | v1 || v2 |
16
Text Indexing Techniques

Several text retrieval indexing techniques Inverted indices
Signature files
17
Inverted Indices
It is an index structure that maintains two hash
indexed or B+ tree indexed tables.

Document table consists set of document records doc_id-> identifier for document posting_list-> list of terms in the document stored according to some relevance measure.
Term table consists of term records term_id-> identifier for term posting_list-> list of document identifiers in which the terms appears.
18
Signature File
It is a file that stores a signature record for each
document in the database .

Each signature has a fixed size of b bits representing
terms.
19
Query Processing Technique

Once inverted index is created for a document
collection, a retrieval system can answer for a given query by looking up which documents contain the query keywords.
When examples of relevant documents are available,
the system can learn from such examples to improve retrieval performance, called relevance feedback.
20
Cont
When we do not have such relevant examples a system
can assume the top few retrieved documents in some initial retrieval results to be relevant and extract the keywords to expand a query, called pseudo-feedback or blind feedback.
21
Limitations in Query processing

Synonymy problem Two words with identical meanings may have very different surface forms Eg: query may use the word automobile but a relevant document may use vehicle instead automobile. Polysemy problem One word may mean different things in different context. Eg: mining
22
Thank you
23

Text Mining

Uploaded by

Copyright:

Available Formats

You might also like

Text Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Mining

Uploaded by

Copyright:

Available Formats

By, M.Karthikeyan-10MCS103 M.

sources. Eg- articles, books, research papers, digital libraries, etc

unstructured text components such as abstract and contents.

Information retrival techniques such as indexing methods

have been developed to handle unstructured documents.

processing on structured data.

retrieval of information from a large number of textbased documents.

IR such as concurrency control, recovery, transaction management and update.

collection based on users query.

information out from the collection.

Basic Measures for Text retrieval

Relevant & Retrieved

Text Retrieval Methods

Document Selection Document Ranking

Vector Space Model

high-dimensional space corresponding to all the keywords.

similarity between the query and the document vector.

Weighted Term Frequency

document d i.e) freq(d,t).

Inverse Document Frequency(IDF)

In complete vector-space model, TF and IDF are

combined together, which forms the TF-IDF measure.

Similarity v1,v2 are two document vectors. Their cosine similarity is

Text Indexing Techniques

indexed or B+ tree indexed tables.

document in the database .

Query Processing Technique

Limitations in Query processing

You might also like