Files Volume2Issue5 IJETAE 0512 63

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)

Sequential Sequence Mining Technique in


Mammographic Information Analysis Database
Kiran Amin1, J. S. Shah2
1

Head, Department of Computer Engineering & Information Technology,,


U.. V. Patel College of Engineering, Kherva, Gujarat. E-mail:kiran.amin@ganpatuniversity.ac.in
2

Head, Department of Computer Engg.,


L.D. College of Engineering, Ahmedabad, E-mail:jssld@yahoo.co.in
If the rules satisfy the minimum support and minimum
confidence threshold[6]. The support and confidence is
discussed here.

AbstractThe Sequential Sequence Mining produces large


sequences of biomedical data. It provides the opportunities for
data analysis and knowledge discovery. The Sequential
Sequence Mining provides the efficient and scalable methods
to extract the sequences of interest in datasets. The synthetic
dataset was taken of the medical images of mammography.
The Sequential Sequence Mining technique motivates us to
discover sequences of interest in the existing dataset. Many
algorithms have been developed to associate the huge amount
of Mammographic data sets. This paper focuses on Sequential
Sequence Mining algorithm to analyze mammographic data
sets.

2.1. Support: The support of an association rule is the ratio


(in percent) of the records that contain {X, Y} to the total
number of records in the database: support (XY) =
Prob{XUY} with respect to total numbers of records.
2.2. Confidence: The confidence is the ratio of the number
of records that contain {X, Y} to the number of records that
contain X: confidence (X Y) = Prob { Y| X} = ( support
(X U Y) ) /( support (X).

Keywords Sequential Sequence Mining, Association Rule


Mining, Biomedical data, Memographic Image Analysis

In brief, an association rule is an expression X, Y,


where X and Y are sets of items. The meaning of such rules
is quite intuitive. Given a database D of transactions where
each transaction T D is a set of items, X Y expresses
that whenever a transaction T contains X, then T probably
contains Y also; The following key parameters are used to
evaluate the generated association rules: support and
confidence.

I. INTRODUCTION
With the mammography, the detection and classification
of breast abnormalities are found. It may be benign or
malignant. The synthetic dataset contains 150 malignant,
60 benign, and 70 normal cases. The mammographic
images, in the form of clinical and pathology reports are
collected. The masses of mammograms have been
extracted. The shape factors[2] representing compactness,
fractional concavity, and spiculation index were found. The
association-rule mining method applied to the data set.
Sequential Sequence Technique is used to develop and
analyze the memographic sequences. The Sequential
Sequence Mining technology enables to identify the long
sequence which is responsible for the causes of anomalies
occurring in the functioning of the breast.

2.3. Strong Association Rules: For every frequent itemset


A, if BA, B 0, and support (A)/support (B) minconf,
then we have association rule B (A-B). Rules that satisfy
both a minimum support threshold (min-sup) and a
minimum confidence threshold (min-con) are called strong
rules. Strong rules are the key elements obtained from an
analysis of all possible rules.

II. ASSOCIATION RULE MINING

Association rule mining gives the interesting


relationship among a large items[7]. If the rules satisfy the
minimum support and minimum confidence threshold[6].
The association rule mining divides the problems into two
parts[6]. First it finds the frequent item sets : Means each
item set is frequent if it satisfies minimum support[7].

Association rule mining gives the interesting


relationship among a large items[7]. Association-rule mining
is a data-mining task that is used to discover relationships
among items in a transactional database [9].
375

International Journal of Emerging Technology and Advanced Engineering


Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
Second it finds the strong association rules from the
frequent item sets: Means these rules should satisfy
minimum support and minimum confidence[7]. Many
scans are required to find frequent sequence by using
association rule mining. We may apply Apriori[6], FPgrowth[9], Hashing have been found. They apply on gene
sequence. This association rule mining algorithm finds the
useful patterns from the biological sequence. Using this
approach, one gene sequence is used to the induction of a
serial of target gene sequences.
Among the best-known algorithms for association rule
induction is the Apriori algorithm [9]. In this paper, we
used the Apriori algorithm in order to discover association
rules among the shape features extracted from the
mammographic mass regions.

It finds first frequent-1 item sets then frequent-2


itemsets like wise we may generate frequentn item sets. It
finds conditional frequent tree based on the subset of the
database. This algorithm is divided in two phases. First
phase uses FP-tree. The FP-tree growth method converts
the problem of finding long frequent patterns searches for
shorter ones recursively and then merging suffix.
This method uses the least frequent items as a suffix. It
reduces the search costs. When the database is large, it is
sometimes unrealistic to construct a main memory-base FPtree. The performance study on FP-tree shows that it is
efficient and scalable for mining both long and short
frequent patterns.
IV. PARTITION
Partition algorithm is based on the frequent sets are
normally in very less number in the set of itemsets. As a
result the database is divided into various partitions such
that each partition can be adjusted in main memory. This
algorithm reduces the number of scanning of database. It
brings the partition in memory while scanning and counts
the number of items in that partition. The algorithm is
implemented in two phases First phase logically divides the
database into number of non-overlapping partitions. The
partitions are considered one at a time and all frequent
itemsets for that partition are generated. Hence therefore if
there are n partitions, n iteration are taken by Phase I of the
algorithm. At the end of this phase, these frequent itemsets
are merged together with a plan to generate the set of all
potential frequent itemsets. In this step, the local frequent
itemsets of same lengths from all n partitions are combined
to generate the candidate rules should satisfy minimum
support and minimum confidence[8]. Many scans are
required to find frequent sequence by using association rule
mining. We may apply Apriori[5], FP-growth[9], Hashing
have been found. They apply on gene sequence. This
association rule mining algorithm finds the useful patterns
from the biological sequence.

2.4. Apriory Algorithm: Apriori Algorithm[7] uses prior


knowledge of frequent itemset properties such as such
kitemsets are used to find k+1 sequence. It uses join step
and prune step method to find the frequent sequence. First
it finds L1 itemsets which is single itemsets. The L2
itemsets are found by using L1 itemsets. Again L3 item sets
are found by using L2 itemsets. Finally the frequent
sequence is found by using Apriori Principle.
Apriori algorithm[6]:
(1) Cl = {candidate 1-itemsets};
(2) L1 = { c C1 lc.count minsup}};
(3) for (k=2; Lk- 1 0; k++) do begin
(4) Ck=apriori-gen(Lk -1);
(5) for all transactions t D do begin
(6) Ct=subset(Ck, t);
(7) for all candidates c Ct do
(8) c.count++;
(9) end
(10) Lk={c Ck lc.countminsup}
(11) end
(12)Answer = U Lk;

Association rules mining algorithms are used to find


gene expression in data. The several gene expressions are
scanned. These several association rule mining algorithms
used to find related sequence. One of them is Apriori. We
will shortly discuss the algorithm provided by Apriori[5].

III. FREQUENT PATTERN GROWTH TREE


Frequent Pattern growth Tree mines the frequent item
sets uses divide-and-conquer method. It performs on the
tree and finds the frequent sets. The nodes are kept as in
such a way that more frequently occurring nodes shares the
frequently occurring item sets compare to less frequently
occurring.

V. FEATURE EXTRACTION
Three shape measures of compactness (C), fractional
concavity (Fee), and spiculation index. These measures
have been found to be effective in the discrimination of
masses as being benign or malignant.
376

International Journal of Emerging Technology and Advanced Engineering


Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
6.2. Pruning methods
After the association-rule mining algorithm is applied,
many rules are generated. However, we might not be
interested in all the rules generated. Pruning methods need
to be employed to identify the rules that represent
knowledge in a useful manner.
1. The major constraint could be that only those rules that
could be used further for classification are to be considered.
Given the transaction model X Y, we are interested in
rules of the form Y is Ci (Classification Type: Benign or
Malignant). Such rules are called interesting rules.
2. The probabilities and joint probabilities of items and
combinations of interest are evaluated to provide thresholds
based on their support or confidence values. In this paper,
the term strong rules is applied to interesting rules that
have support 5% and confidence 80% confidence.
Single-feature rules for fractional concavity indicate that if
the value is less than 0.4, the mass is most likely benign,
whereas if the value is greater than 0.4, the mass is
malignant with 100% confidence. For the spiculation index,
if the value is less than 0.2, the mass is most likely benign,
and if the value is greater than 0.2, the mass is malignant
with 100% confidence.

1. The normalized form of compactness, C, is a simple


measure of shape complexity, and is computed as

C = 1 - 4 A
P2
where A is the area and P is the perimeter of the contour. 2.

Fractional concavity Fcc is the ratio of the cumulative


length of the concave portions of the contour to the total
length of the contour[3].
3. Spiculation index SI represents the degree of picularity
of a mass contour.
VI. QUANTITATIVE ASSOCIATION-RULE MINING
6.1. Data preparation
Here the quantitative, real values are used in Association
Rule and Sequential Sequence Mining Technique.
Quantitative association rules are mapped into a Boolean
association-rule problem. Each shape features values are
extracted numerically in the range [0.0, 1.0].
Each feature is split into 10 ranges as [0.0, 0.1], [0.1,
0.2],[0.2, 0.3]....., [0.9, 1.0], and intervals are defined for
each feature Fi as Rangemin Fi <Rangemax. For
example, the feature SI = 0.82 belongs to the new feature
interval [Rangemin = 0.8, Rangemax = 0.9]. These new
feature intervals are organized in the form of transactions.
This would be used as a input for the data-mining and
classification algorithms.
The transactions are of the form {Class Label, F1, F2,.
....., Fn}, where F1, F2, . . ., Fn are the features extracted
for a given mammographic mass image, and Class Label of
mass is referred as benign or malignant. Here the feature
interval samples are given below for the transactions:
{ Benign, C0.1-0.2, F0.0-0.1,S0.0-0.1 }
{ Benign, C0.1-0.2, F0.0-0.1,S0.0-0.1 }
{ Benign, C0.0-0.1, F0.1-0.2,S0.2-0.3 }

{Malignant, C0.3-0.4, F0.2-0.3, S0.l-0.2}

VII. USING SEQUENTIAL SEQUENCE MINING


The Large Gene sequence may be found using
Sequential Sequence Mining techniques. This technique
uses vertical fragment representation of the database with
efficient support counting. It finds long sequences using
generating sequence-A and Sequence-B sequences. The
various sequences are generated by item adding the item at
the end of the sequences.
When the item is added at the end of the sequence, it
becomes a Sequence-A, while the item is added at the end
of last item set, such that item whose index is greater than
the last itemset.
In this method the customers transactions are shown by
various fragments. The corresponding bit is set to 1 if the
transaction it contains the last itemset in the sequence and
previous transactions contain all previous itemsets in the
sequence (i.e. the customer contains the sequence of
itemsets) from a parent to as Sequence-A and Sequence-B.
A Sequence-A requires that we set the first 1 in the current
sequences fragment slice to 0 and all fragments afterward
to 1 to indicate that the new item can only come in a
transaction after the last transaction in the current sequence.

Here, F = Fcc, and S = SI.

377

International Journal of Emerging Technology and Advanced Engineering


Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
7.1. The Absolute Support: The absolute support of a
sequence Kp in the sequence representation of a database D
is defined as the number of sequences k D that contain
Kp , and the relative support is defined as the percentage of
sequences k D

Method :
1.
2.

7.2. Generation of Sequences: First the Customers data are


sorted by Customer ID and then Transaction ID. First
various Sequence-A and Sequence-B sequences are
generated. Various nodes are tested using Sequence-A and
Sequence-B sequences. At each node n, the support of each
Sequence-A and each Sequence-B is tested. If the support
of a generated sequence s is greater than or equal to
minSup then this sequence will be useful and stored as a
frequent sequence. If the support of s is less than minSup,
then we do not need to repeat the process on s by the
Apriori principle since any child sequence generated from s
will not be frequent.
If we create the sequence by considering a tree then
each sequence in tree generates Sequence-A sequences and
Sequence-B sequences. Thus we can associate with each
sequence n in the tree two sets: The set of items that are
considered for a Sequence-A extensions of sequence n (kextensions) = Kn . and The set of items that are considered
for a Sequence-B extensions = I n .
Suppose the item elements are considered as nodes of
the tree then each element in the tree is generated only by
either Sequence-A. The sequence ( { p , q }, { q } ) is
generated from sequence ( { p , q }) with Sequence { q }.
It can not generate from the sequence ( { p } , { q } ) or ( {
q } , { q } ).

3.
4.

5.
6.
7.
8.
9.
10.
11.
12.

13.

7.3. Pruning: We can prune candidate k-extensions and iextensions [10] of a node n in the tree. We use the pruning
techniques based on Apriori principle and aimed at
minimizing the size of Kn and In at each node n.
7.4. Fragment Representation: We have used various
fragments to represent the data. In the fragment map, each
item in the transaction is set to 1 if it appears otherwise the
corresponding value is set to 0. Suppose the item i is
appearing in transaction x then the value of that bit is to be
set to 1 for the transaction otherwise it is set to 0. Suppose
the item i and item j are appearing in one transaction, for
finding { i , j } we need to do bitwise AND operation
among the transformed fragment of {i} and itemset {j}.
7.5. Algorithm:
Input : D, a database of transactions
Output : Frequent Sequences of database
378

Collect the information for customer's transaction from


an input data file and stores into array.
Initialize Number of Customers, Transactions, and
Items.
Store the Customer, Transactions and Items into array.
Increment customer count for new customer.
Increment customers transaction count for same
customer for different transaction. Increment item
count for the same customer and different item.
Read the information about CID, TID, IID from the
array and put 1 to appropriate transaction in bit.
Read the data and fill in transaction bits.
Read the input file and finds the frequent-1 itemsets.
Find the max number of transactions and number of
customers & set minimum support.
Finds the frequent itemset.
Do sequence-A and Sequence-B process on the current
node.
Find sequence-A sequence for the next node
Create Sequence-B for the next node whose index is
higher than current node. Check for the frequent item.
If it is frequent then store it.
Output index wise Sequence-A whose support is
greater than min support threshold.

International Journal of Emerging Technology and Advanced Engineering


Website: www.ijetae.com (ISSN 2250-2459, Volume 2, Issue 5, May 2012)
VIII. PERFORMANCE
The Figure 1 and Figure 2 shows the analysis of
Sequential Sequence Mining. Figure 1 shows the Analysis
of various customers with different support values. Figure 2
shows the mining time taken with different support value.
Here we made comparisons with Association Rule Mining
and Here the results are found better for Gene Sequences.
IX. CONCLUSION
In this paper, frequent sequences in Mammographic
Sequences are found and used as transactions by using
sequence-A and sequence-B sequences. The algorithm
finds the sequential sequence and performs better. It
generates the sequences with various lengths with efficient
support count.

REFERENCES
[1] Alberta Cancer Board, Screen Test: Alberta Program for the Early
Detection of Brest Cancer, 1999/01 Biennial Report, Edmonton,
Alberta, Canada,2001.

[2] H. Alto, R.M. Rangayyan, R.B. Paranjape, J.E.L. Desautels, and H.


Bryant, An indexed atlas of digital mammograms for computeraided diagnosis of breast cancer, vol. 58, no.5,6, pp.820-835,2003.

[3]

R.M.Rangayyan, N.R.Mudigonda, and J.E.L. Desautels, Boundary


modelling and shape analysis methods for classification of
mammographic masses, Medical and Biological Engineering and
Computing, vol.38, pp. 487-495,2000.

[4] R.Agrawal, T.Imielinki and A.Swami, Mining association rules


between set of item of large databases in Proc. Of the ACM
SIGMOD Intll Conf. on Management of data, Washington,
D.C.,USA, 1993, pp 207-216.

[5] R.Agrawal, R.Srikant, Fast algorithms for mining association rules


in large databases. Proc. of 20th Intl conf. on VLDB: 487-499,
1994.

[6] C.Gyorodi, R.Gyorodi. Mining Association rules in Large


Databases.Proc. of Oradea EMES02: 45-50, Oradea, Romania,
2002.

[7] Pei, J., Han, J. and Wang, W., Mining Sequential Pattern with
Constraints in Large Databases, in Proc. Of CIKM02, pp. 1825,
2002.

[8] J.Han, J.Pei, Y.Yin, Mining Frequent Patterns without candidate


generation. Proc. Of ACM-SIGMOD, 2000.

[9] S.Brin, R.Motawani, J.D.Ullman and S. Tsur, Dynamic Itemset


counting and implication rules for market basket data in Proc. of the
ACM SIGMOD Intll Conf. on Management of data, Tucson,
Arizona, USA, 1997, pp. 255-264.World

[10] C.Gyorodi, R.Gyorodi, T.Cofeey & S.Holban Mining association


rules using Dynamic FP-Trees in Proc. of The Irish signal and
Systems Conference, University of Limerick, Limerick, Ireland, 30th
June- 2nd July 2003, ISBN 0-9542973-1-8, page 76-82.

379

You might also like