Text Mining

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

By, M.Karthikeyan-10MCS103 M.

E-CSE(II)
1

Text Databases
Consists of large collections of documents from various

sources. Eg- articles, books, research papers, digital libraries, etc


Semistructured data
Document contains few structured fields such as title,authors and

unstructured text components such as abstract and contents.

Information retrival techniques such as indexing methods

have been developed to handle unstructured documents.


2

Information Retrieval(IR)
It is a field that has been developing in parallel with

database systems.
Database systems focused on query and transaction

processing on structured data.


Information retrieval focused on organization and

retrieval of information from a large number of textbased documents.


3

Cont
Database system problems are usually not present in

IR such as concurrency control, recovery, transaction management and update.


IR problems are usually not encounterd in traditional

database systems such as unstructured documents, search based on keywords and notion of relevance.

Cont
Problem in IR is to locate relevant document in document

collection based on users query.


For short-term, user takes the initiative to pull the

information out from the collection.


For long-term, retrieval system may also take the initiative

to push any newly arrived information item to a user if the item is relevant to the users information need.
This type of information access is called information

filtering.
5

Basic Measures for Text retrieval


Precision: Percentage of retrieved documents that are in fact relevant to the query (i.e., correct responses)
| {Relevant} {Retrieved } | precision | {Retrieved } |

Recall: Percentage of documents that are relevant to the query and were, in fact, retrieved.
| {Relevant} {Retrieved} | Re call | {Relevant} |
6

Relevant

Relevant & Retrieved

Retrieved

All Documents

F-score Its a trade off recall for precision and vice versa. Its a harmonic mean of precision and recall It discourages a system that sacrifices one measure for another.

F-score=(recall*precision)/((recall + precision)/2)

Text Retrieval Methods

Document Selection Document Ranking

Document Selection
Query is used to specifying constraints for selecting

relevant documents
Boolean Model Document is represented as set of keywords and user provides a boolean expression of keywords. Eg: tea or coffee, database systems but not DB2. Retrieval system would take such a boolean query and return documents that satisfies the boolean query. Works well when the user knows lot about the document collection.
10

Document Ranking
Use the query to rank all documents in order of

relevance. Retrieval systems present a ranked list of documents in response to a users query. Different ranking methods based on logic, algebra, probability and statistics. Mach the keywords in the query with those in the documents and score each document based on how well it matches the query. Degree of relevance of a document is a score computed based on information such as the frequency of words in the document and the whole collection.
11

Vector Space Model


Represent a document and a query both as vectors in a

high-dimensional space corresponding to all the keywords.


Use an appropriate similarity measure to compute the

similarity between the query and the document vector.


Similarity values are used to rank the documents.

12

Stop list Set of words that are deemed irrelevant. Usually they are prepositions, articles, one letter words. Eg: a, for, the, an, with, etc Maintains stop list with a set of documents to avoid indexing useless word Word stem Group of words may share the same word stem. Eg: running -> run, runner -> run
13

Weighted Term Frequency


Term frequency- No. of occurrence of term t in

document d i.e) freq(d,t).


(Weighted)term frequency Matrix- TF(d,t) Association of a term t with respect to the given document d. Zero if the d does not contain t otherwise nonzero. TF(d,t)=

0
1+log(1+log(freq(d,t)))

if freq(d,t)=0
otherwise
14

Inverse Document Frequency(IDF)


It represents the importance of the term t.
Eg: the term database systems may likely be less

important if it occurs in many research papers in a database system conference. IDF(t)=log((1+|d|)/|dt|) d- document collection dt- set of documents containing term t |dt|<<|d|- term t have a large IDF scaling factor and vice versa.
15

In complete vector-space model, TF and IDF are

combined together, which forms the TF-IDF measure.


TF-IDF(d,t)=TF(d,t) * IDF(t)

Similarity v1,v2 are two document vectors. Their cosine similarity is

sim(v1 , v2 )
v1.v2 is ti=1 v1i v2i
|v1 | =v1.v1

v1 v2 | v1 || v2 |

16

Text Indexing Techniques


Several text retrieval indexing techniques Inverted indices
Signature files

17

Inverted Indices
It is an index structure that maintains two hash

indexed or B+ tree indexed tables.


Document table consists set of document records doc_id-> identifier for document posting_list-> list of terms in the document stored according to some relevance measure.

Term table consists of term records term_id-> identifier for term posting_list-> list of document identifiers in which the terms appears.
18

Signature File
It is a file that stores a signature record for each

document in the database .


Each signature has a fixed size of b bits representing

terms.

19

Query Processing Technique


Once inverted index is created for a document

collection, a retrieval system can answer for a given query by looking up which documents contain the query keywords.
When examples of relevant documents are available,

the system can learn from such examples to improve retrieval performance, called relevance feedback.

20

Cont
When we do not have such relevant examples a system

can assume the top few retrieved documents in some initial retrieval results to be relevant and extract the keywords to expand a query, called pseudo-feedback or blind feedback.

21

Limitations in Query processing


Synonymy problem Two words with identical meanings may have very different surface forms Eg: query may use the word automobile but a relevant document may use vehicle instead automobile. Polysemy problem One word may mean different things in different context. Eg: mining
22

Thank you

23

You might also like