Finalmain PRJCT PPT 24-3-11

Guide
K.KISHORE 07W61A0522
Mr. D. S. SHARMA M.Tech, (Ph.D.,)
CH. SAINATH 07W61A0510
Associate Professor
Y. RAMESH 07W61A0542
Dept of CSIT
K. NAVYA 07W61A0525
Sri sivani College Of Engg.
B. MANASA 07W61A0529
Sri sivani college of

engineering
Traditionally, business analysts performed the task of
extracting useful information from recorded data, but the
increasing volume of data in modern business and science calls
for computer-based approaches.
In data mining, association rule learning is a popular and well

researched method for discovering interesting relations between
variables in large databases.
A typical and widely-used example of association rule mining is

Market Basket Analysis.
This project is established using the Boolean matrix Algorithm,

a technique to implement the association rules under bi-
directional search.
Our main objective is to makes a comparative analysis between

the Apriori and the Boolean matrix algorithms, so as to prove
Hardware Requirements
•Processor: Intel Pentium 4 or equivalent or above.
(Greater clock speeds and FSB speeds with single-core mainstream
compatibility)
•RAM: Minimum of 256 MB or higher.

(Since oracle 8i alone requires a minimum ram of 256MB).
•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit

Makes output screens (graphs, bar charts, pie charts) appear sharper and
better
•HDD: 20 GB or higher (apt)

(The total volume of s/w installed along with database take min 20 GB of
HD)
•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixels

Color quality: 32 bit & Color scheme: Windows Standard
(It’s the standard configuration of most PC’s with optimized viewable
space)
•Mouse: Scroll mouse or optical mouse.

oftware Requirements
perating system: Windows XP (almost all versions) or Unix or

ny operating system that can support java)
ront-end: Java (displayed in form of applets and frames)

most all versions of java are compatible to run the code)
ack-end: Oracle 10g. (This is to design the database)

Existing System
There are a lot of systems developed already,
which use apriori
algorithm to implement the association rules.
An apriori algorithm operates in a bottom up,

breadth first search method.
The computation starts from the smallest set of

frequent itemsets and moves upwards till it reaches
the largest frequent itemset.
The number of database passes equal to the largest

size of the frequent itemset .
Apriori Algorithm
Apriori uses breadth-first search and a hash tree
structure to count candidate item sets efficiently.
It generates candidate item sets of length k from item
sets of length k − 1.
Then it prunes the candidates which have an infrequent

sub pattern.
According to the downward closure lemma, the
candidate set contains all frequent k-length item sets.
After that, it scans the transaction database to

determine frequent item sets among the candidates.
For determining frequent items quickly, the algorithm

uses a hash tree to store candidate item sets.
Apriori Itemset Generation
Pass 1
Generate the candidate itemsets in C1
Save the frequent itemsets in L1
Pass k
Generate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, q.item1, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
Generate all (k-1)-subsets from the candidate itemsets in Ck
Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate
itemset is not in the frequent itemset Lk-1 Scan the transaction database to
determine the support for each candidate itemset in Ck Save the frequent itemsets
in Lk
Example Assume the user-specified minimum support is 40%, then generate all
frequent itemsets. Given: The transaction database shown below
TID A B C D E F G
T1 1 0 1 0 0 1 0
T2 0 0 1 1 0 0 1
T3 1 1 0 0 1 0 0
T4 0 0 1 0 1 0 1
T5 1 0 1 0 1 1 0
T6 0 0 0 1 1 0 0
T7 0 0 1 0 1 1 1
T8 1 0 0 0 0 1 0
T9 0 1 1 1 0 0 0
Pass 1
L1 afterpruning
C1
A
Itemset X supp(X)
A 4
B
B 2 C
C 6 D
D 3
E
F 5
G 4
F
H 3 G
H
Pass 2 L2 after
pruning
C2
Item set X supp( Item set supp(X) A,C
X) X A,E
A,B 1 C,D 2
A,F
A,C 2 C,E 3
C,D
A,D 0 C,F 3
D,E 1 C,F
A,E 2
A,F 3 D,F 1 E,F
A,G 0 D,G 1 E,G
B,C 1 E,F 2
B,D 1 E,G 2
B,E 1 F,G 1
B,F 0
B,G 0
Pass 3
L3 after pruning
Itemset X supp(X)
C3
join AC with AD A,C,D 0 A,C,F
join AC with AE A,C,E 1
C,E,F
join AC with AF A,C,F 2
join AC with AG A,C,G 0 C,E,G
join AE with AF A,E,F 1
join AE with AG A,E,G 0
join CEwith CF C,E,F 2
join CEwith CG C,E,G 2
Pass 4
C4
Item set x Support(

X)
C,E,F,G 1
Pass 5
For pass 5 we can't form any candidates because there aren't

two frequent 4-itemsets beginning with the same 3 items.
Disadvantages:
n any one of the frequent set items becomes longer the algo
to go through much iteration and as a result the performanc
eases in terms of response time.
ng the execution, every frequent item set is explicitly consid
of scans is required .
In our proposed system, we implement association
rules using Boolean matrix algorithm.
•It operates in a top down process.
•It scans the database only once.
•So the search time—response time becomes quicker.
•It finds the maximum frequent itemsets which meet

the minimum support in short time through the
operation of vector and matrix.
•Works better than most of the currently existing

algorithms such as apriori.
Boolean matrix
• Boolean matrix algorithm need to scan the database only once
and as the structure of the Boolean matrix is simple, it can be
understood easily, and it is easy to compute without
generating plenty of candidate item sets.
• For the databases are translate into files of matrix and the
files are very small, it reduce plenty of time spending on
scanning the database .So the algorithm is efficient.
• This is better and faster than the Apriori approach .
• Thus the technique evaluates the required data much quicker

than most of the algorithms currently existing .
For the same example,
Consider the original matrix and transpose of that matrix

1010010
0011001 101010010
1100100 001000001
0010101 110110101
R= 1010110 T= 010001001
0001100 001111100
0010111 100010110
1000010 010100100
0111000
Now multiply both these matrices,we get

311130221
130211202
103121111
121321301
S= 311231321
011112101
221331411
201020120
121111102
Consider the upper triangular matrix
3111 30221
302 11202
31 21111
3 21301
Su = 31321
2101
411
20
2
Consider the maximum element from the longest diagonal
and find in which rows the element will be.
The maximum element is 4 Consider the upper
triangular matrix
So find in which rows 4 exists
311130
t77 = 4 221
B7(7,1) 30211
It is less than min sup value so there 202
are no 4-frequent itemsets. 3121
111
321
Next maximum element is 3 301
t11=3 , t22=3, t33=3, t44=3, t55=3 S= 313
21
B1(1,5,2); B4(4,7,2); B5(5,7,2) 2
First consider B1(1,5,2) 101
411
Find Logical AND operation of a1 and a5 = 1010010
This indicates B1 = {A,C,F} 20
Similarly, B4 and B5 values will be known. by performing
logical for a4 and a7 and for B5 for a5 and a7 respectively
So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }
Similarly, 2-frequent itemsets will be generated.

{ (A,C), (A,E), (A,F), (C,D), (C,E), (C,F), (C,G), (E,F), (E,G) }
Therefore, both the algorithms gives the same output.
2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),

(C,F), (C,G), (E,F), (E,G) }
3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }
advantages:
•Helps in improving the marketing tactics.
•Used for collaborative filtering in setting business trends.
•Comprehensive analysis of the Customer choices.
•Comprehensive analysis of the Product Demand inflations.
•Helps to minimize losses and maximize profits.

Performance Analysis:
Assumption
1.Apriori’s Best Case == Boolean’s Best Case
2.Apriori’s Worst Case == Boolean’s Average Case√

Axiom
Let i =1 , and k is the total number of candidate item sets
1. Best Case  Most frequent item set
evaluated in pass = i.
2. Average Case  Most frequent item set evaluated
in pass > i && < k.
3. Worst Case  Most frequent item set
Proof by Example:
Best Case
T1={bread} T2={bread} T3={bread} T4={bread} Let min supp=50%
Apriori
Pass 1: Boolean
C1 : Pass 1:
bread - 100% (not pruned) C1 :
MFS = { {bread}} stop! bread - 100% (not pruned)
MFS = { {bread}}
R=1 T=1111 MFCS={{bread}-100%} not
1 pruned stop!
1
1
U=1111 C=1111
1111 111
1111 11
1111 1
Average Case
T1={bread, jam, sugar, cheese} T2={jam, sugar, cheese} T3={sugar, cheese}
T4={cheese} Let min supp=50%
Apriori
Pass 1:
C1 :
bread - 25% ( pruned) jam - 50% (not pruned) sugar - 75% (not pruned)
cheese - 100% (not pruned)
MFS = { {jam}, {sugar}, {cheese}}
Pass 2 :
C2:
{jam, sugar } - 50% (not pruned) {jam, cheese } - 50% (not pruned)
{sugar, cheese } - 75% (not pruned)
MFS = { {jam, sugar},{jam, cheese},{sugar, cheese}}
Pass 3;
C3:
{jam, sugar ,cheese} - 50% (not pruned) MFS = { {jam, sugar, cheese}}
Boolean
1111 1000 4 321 4321
R= 0111 T= 1100 C= 33 21 U= 321
0011 1110 2221 21
0001 1111 1111 1
Pass 1:
Max element = 4
4 doesn’t exist in any row so there are are no 4-frequent itemsets.
Next element = 3
U(1,1) = 4; U(1,2) = 3
B(1,2,2) = 3 = 0 1 1 1
So, the 3-frequent itemsets are {{jam,sugar,cheese}}
Pass 2 :
Next max ele = 2
B(1,2,3,3) = {{sugar,cheese],{jam,sugar},{jam,cheese}}
B(2,3,2) = {{sugar,cheese}}
So, the 2-frequent itemsets are {{sugar,cheese],{jam,sugar},{jam,cheese}}
Thus, algorithm terminates in two passes
Worst Case
T1={bread, jam, sugar, cheese} T2={bread, jam, sugar, cheese} Let min supp=50%
Apriori
Pass 1:
C1 :
bread - 100% jam -100% sugar -100% cheese - 100% (nothing pruned)
MFS = { {jam}, {sugar}, {cheese},{bread}}
Pass 2:
C1 : {bread, jam}-100%, {bread, sugar}-100%, {bread, cheese}-100%, {sugar, jam}-100
{cheese, jam}-100%, {sugar, cheese}-100%
(nothing pruned) MFS = {{bread, jam}, {bread, sugar},{bread, cheese}, {sugar, jam},
{cheese, jam}, {sugar, cheese} }
Pass 3:
C1 :
{bread,jam,cheese} 100% , {bread,jam,sugar} 100% ,{jam,sugar,cheese} 100%
(nothing pruned)
MFS = { {bread,jam,cheese} , {bread,jam,sugar} ,{jam,sugar,cheese} }
Pass 4:
C1 :
(bread,jam,sugar,cheese}-100% not pruned
MFS = { {jam, sugar, cheese, bread}}
Boolean
1111 1111 4444 4444
R= 1111 T= 1111 C= 4444 U= 444
1111 1111 4444 44
1111 1111 4444 4
Pass 1:
Max ele = 4
So, 4-frequent itemsets are { {jam, sugar, cheese, bread}}
Hence 3-frequent itemsets are {{bread,jam,cheese},{bread,jam,sugar} ,
{jam,sugar,cheese} }
2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},

{cheese,jam}, {sugar,cheese} }
Apriori Time Complexity
•In best case one pass so O(1)

•In worst case all items are considered so O(n)
•In Average case only about half the number of items are taken O(n-k)
Boolean matrix Time Complexity
In best case s O(1)

Average Case O(n)
Worst Case O(n)
ataflow diagram
UML Diagrams
Sequence diagram for Apriori Sequence diagram for Boolean

Sequence Diagram for pruning the Item sets
Testing:
Testing is the process of executing program
with the intent of finding an error.
Test Case Id Input Format Expected Obtained
Input data
Usability Test:
UT_1.1 Time String No Error No Error
‘morning’ or ‘afternoon’ or ‘evening’ or morning
‘night’
UT_1.2 Hrs:mins:secs No Error Error

11:35:65
UT_1.3 Hrs:mins No Error Error
03:75
UT_1.4 Hrs No Error Error
32
UT_1.5 Random character string Error Error
H2
UT_1.6 Integer Number Error Error
12.9
UT_1.7 Floating Point Error Error
15
UT_1.8 Alphanumeric @w Error Error
UT_1.9 Special Characters 9,0 Error Error
UT_1.10 Empty String Error Error

Test case Id Input Module Next Module Next Module
(Expected) (Obtained)
PT_1 Receive user inputs Retrieve raw candidate transaction table Raw candidate transaction table retrieved
PT_2. Retrieve raw candidate transaction table Construct binary candidate transaction table Binary candidate transaction table constructed
PT_3. Binary candidate transaction table constructed Algorithm begins Algorithm began
Path testing
PT_4 Algorithm began Subset Selection Subset Selected
PT_5 Subset Selected Support Evaluation Support evaluated
PT_6 Support evaluated Prune Pruned
PT_7 Pruned Most frequent item set evaluation Most frequent item set evaluated
PT_8 Most frequent item sets evaluated Frame Association rules Association rules framed
PT_9 Association rules framed Pictorial representation using paint method Pictorially representation began
PT_10 Pictorially representation began Draw Bar chart Bar chart drawn
PT_11 Bar Chart drawn Draw Pie Chart Pie Chart Drawn
PT_12 Pie Chart Drawn Draw Support for single pass graph Support for single pass graph drawn
PT_13 Support for single pass graph drawn Draw Time Complexity Graph Time Complexity Graph Drawn
Conclusion:
Discovering frequent itemsets is a key problem in
important data mining applications, such as discovery
of association rules..
BMA efficiently overcomes the difficulty better than all

the algorithms currently existing and could better than
Apriori.
We evaluate the performance of the algorithm using

well-known synthetic benchmark databases, real-life
census and stock market databases.
Further Developments:
Parallelizing the Boolean matrix algorithm
this is a way to minimize the duplicate

calculations and to maximize the use of available
processors.

Finalmain PRJCT PPT 24-3-11

Uploaded by

Copyright:

Available Formats

You might also like

Finalmain PRJCT PPT 24-3-11

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Finalmain PRJCT PPT 24-3-11

Uploaded by

Copyright:

Available Formats

Guide

Sri sivani college of

In data mining, association rule learning is a popular and well

A typical and widely-used example of association rule mining is

This project is established using the Boolean matrix Algorithm,

Our main objective is to makes a comparative analysis between

•RAM: Minimum of 256 MB or higher.

•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit

•HDD: 20 GB or higher (apt)

•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixels

•Mouse: Scroll mouse or optical mouse.

perating system: Windows XP (almost all versions) or Unix or

ront-end: Java (displayed in form of applets and frames)

ack-end: Oracle 10g. (This is to design the database)

An apriori algorithm operates in a bottom up,

The computation starts from the smallest set of

The number of database passes equal to the largest

Then it prunes the candidates which have an infrequent

After that, it scans the transaction database to

For determining frequent items quickly, the algorithm

A,G 0 D,G 1 E,G

Item set x Support(

For pass 5 we can't form any candidates because there aren't

ng the execution, every frequent item set is explicitly consid

•It operates in a top down process.

•It scans the database only once.

•So the search time—response time becomes quicker.

•It finds the maximum frequent itemsets which meet

•Works better than most of the currently existing

• This is better and faster than the Apriori approach .

• Thus the technique evaluates the required data much quicker

Consider the original matrix and transpose of that matrix

Now multiply both these matrices,we get

So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }

Similarly, 2-frequent itemsets will be generated.

Therefore, both the algorithms gives the same output.

2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),

•Used for collaborative filtering in setting business trends.

•Comprehensive analysis of the Customer choices.

•Comprehensive analysis of the Product Demand inflations.

•Helps to minimize losses and maximize profits.

2.Apriori’s Worst Case == Boolean’s Average Case√

2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},

•In best case one pass so O(1)

Boolean matrix Time Complexity

In best case s O(1)

Sequence diagram for Apriori Sequence diagram for Boolean

UT_1.2 Hrs:mins:secs No Error Error

UT_1.9 Special Characters 9,0 Error Error

UT_1.10 Empty String Error Error

PT_5 Subset Selected Support Evaluation Support evaluated

PT_6 Support evaluated Prune Pruned

BMA efficiently overcomes the difficulty better than all

We evaluate the performance of the algorithm using

Parallelizing the Boolean matrix algorithm

this is a way to minimize the duplicate

You might also like