Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

K Srividhyasaradha

312216205105

Ex. No.: TEXT MINING ON COMMERCIAL APPLICATION


Date:

MOTIVATION:
Extracting patterns from text documents using text mining( String to word vector) provides
crucial information which could be as useful as extracting data from table. Most of the reports
and events were recorded in text previously. Thus there is a need for a way to mine data from
text documents.

MINING LOGIC:
 Converts String attributes into a set of attributes representing word occurrence
(depending on tokenizer) information from the text contained in the strings.
 The set of words (attributes) is determined by the first batch filtered (typically training
data).

INPUT DATASET:
 Set of documents (String).
 Class attribute that is used for mining.

PATTERN MINING:
 The first step in most retrieval systems is to identify keywords for representing
documents, a preprocessing step often called tokenization.
 To avoid indexing useless words, a text retrieval system often associates a stop list
with a set of documents.
 A stop list is a set of words that are deemed “irrelevant”. For example, a, the, of, for,
with, and so on are the stop words, even though they may appear frequently.
 A group of different words may share the same word stem. For example, the group of
words drug, drugged, and drugs, share a common word stem, drug, and can be viewed
as different as different occurrences of the same word.
 Starting with a set of d documents and a set of t terms, we can model each document
as a vector v in the t dimensional space Rt, which is why this method is called the
vector space model.
 Let the term frequency be the number of occurrences of the term t in the document d,
that is , freq (d; t). The (weighted) term-frequency matrix TF(d; t) measures the
association of a term t with respect to the given document d: it is generally defined as
0 if the document does not contain the term, and nonzero otherwise.
TF(d; t) = 0, if freq(d; t) = 0

1+log(1+log(freq(d; t))), otherwise.


K Srividhyasaradha
312216205105

 Inverse document frequency (IDF), represents the scaling factor, or the importance, of
a term t. If a term t occurs in many documents, its importance will be scaled down due
to its reduced discriminative power.
IDF(t) = log (1_d)/dt

Where, d is the document collection, and dt is the set of documents containing term t.
 In a complete vector-space model, TF and IDF are combined together, which forms
the TF-IDF are combined together, which forms the TF-IDF measure:

TF-IDF(d; t) = TF(d; t)_IDF(t)

 Let us examine how to compute similarity among a set of documents based on the
notions of term frequency and inverse document frequency.
 The term frequency and inverse document frequency. Table shows a term frequency
matrix where each row represents a document vector, each column represents a term,
and each entry registers freq(d; t), the number of occurrences of term tj in document
di.
 Based on this table we can calculate the TF-IDF value of a term in a document.
 For example, for t6 in dA,

TF(d4, t6) = 1 + log(1+log(15)) = 1.3377

IDF(t6)=log(1+5)/3 = 0.301

TF-IDF(d4,t6) = 1.3377 X 0.301 = 0.403

 A term frequency matrix showing the frequency of terms per document is given
below:
Document/term T1 T2 T3 T4 T5 T6 T7
D1 0 4 10 8 0 5 0
D2 5 19 7 16 0 0 32
D3 15 0 0 4 9 0 17
D4 22 3 12 0 5 15 0
D5 0 7 0 9 2 4 12
K Srividhyasaradha
312216205105

INPUT DATASETS:

Names of the datasets: ReutersCorn-train.arff, ReutersCorn-test.arff


No. of attributes: 2 (Text,class)
Target attribute: class {0,1}

The datasets contain records of text which has been classified as either 0 or 1. The class 0
signifies that the text is ‘not related to corn’ and class 1 signifies that the text is ‘related to
corn’.
K Srividhyasaradha
312216205105

OUTPUT:
K Srividhyasaradha
312216205105
K Srividhyasaradha
312216205105
K Srividhyasaradha
312216205105
K Srividhyasaradha
312216205105
K Srividhyasaradha
312216205105

RESULT:
Hence, the text mining has been performed on the document ReutersCorn-train.arff and the
output has been verified for a test dataset to check if the classification of ‘related to corn’ and
‘not related to corn’ is performed properly.

You might also like