Finalmain PRJCT PPT 24-3-11

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Guide

K.KISHORE 07W61A0522
Mr. D. S. SHARMA M.Tech, (Ph.D.,)
CH. SAINATH 07W61A0510
Associate Professor
Y. RAMESH 07W61A0542
Dept of CSIT
K. NAVYA 07W61A0525
Sri sivani College Of Engg.
B. MANASA 07W61A0529

Sri sivani college of


engineering
Traditionally, business analysts performed the task of
extracting useful information from recorded data, but the
increasing volume of data in modern business and science calls
for computer-based approaches.

In data mining, association rule learning is a popular and well


researched method for discovering interesting relations between
variables in large databases.

A typical and widely-used example of association rule mining is


Market Basket Analysis.

This project is established using the Boolean matrix Algorithm,


a technique to implement the association rules under bi-
directional search.

Our main objective is to makes a comparative analysis between


the Apriori and the Boolean matrix algorithms, so as to prove
Hardware Requirements
•Processor: Intel Pentium 4 or equivalent or above.
(Greater clock speeds and FSB speeds with single-core mainstream
compatibility)

•RAM: Minimum of 256 MB or higher.


(Since oracle 8i alone requires a minimum ram of 256MB).

•Video Cards: Video Card Average: 32 bit, Recommended: 64 bit


Makes output screens (graphs, bar charts, pie charts) appear sharper and
better

•HDD: 20 GB or higher (apt)


(The total volume of s/w installed along with database take min 20 GB of
HD)

•Monitor: 15’’ or 17’’ color monitor. Screen resolution: 1024 by 768 pixels


Color quality: 32 bit & Color scheme: Windows Standard
(It’s the standard configuration of most PC’s with optimized viewable
space)

•Mouse: Scroll mouse or optical mouse.


oftware Requirements

perating system: Windows XP (almost all versions) or Unix or


ny operating system that can support java)

ront-end: Java (displayed in form of applets and frames)


most all versions of java are compatible to run the code)

ack-end: Oracle 10g. (This is to design the database)


Existing System
There are a lot of systems developed already,
which use apriori
algorithm to implement the association rules.

An apriori algorithm operates in a bottom up,


breadth first search method.

The computation starts from the smallest set of


frequent itemsets and moves upwards till it reaches
the largest frequent itemset.

The number of database passes equal to the largest


size of the frequent itemset .
Apriori Algorithm
Apriori uses breadth-first search and a hash tree
structure to count candidate item sets efficiently.
It generates candidate item sets of length k from item
sets of length k − 1.

Then it prunes the candidates which have an infrequent


sub pattern.
According to the downward closure lemma, the
candidate set contains all frequent k-length item sets.

After that, it scans the transaction database to


determine frequent item sets among the candidates.

For determining frequent items quickly, the algorithm


uses a hash tree to store candidate item sets.
Apriori Itemset Generation

Pass 1
Generate the candidate itemsets in C1
Save the frequent itemsets in L1

Pass k
Generate the candidate itemsets in Ck from the frequent
itemsets in Lk-1
Join Lk-1 p with Lk-1q, as follows:
insert into Ck
select p.item1, q.item1, . . . , p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1q
where p.item1 = q.item1, . . . p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1
Generate all (k-1)-subsets from the candidate itemsets in Ck

Prune all candidate itemsets from Ck where some (k-1)-subset of the candidate
itemset is not in the frequent itemset Lk-1 Scan the transaction database to
determine the support for each candidate itemset in Ck Save the frequent itemsets
in Lk
Example Assume the user-specified minimum support is 40%, then generate all
frequent itemsets. Given: The transaction database shown below

TID A B C D E F G
T1 1 0 1 0 0 1 0

T2 0 0 1 1 0 0 1

T3 1 1 0 0 1 0 0

T4 0 0 1 0 1 0 1

T5 1 0 1 0 1 1 0

T6 0 0 0 1 1 0 0

T7 0 0 1 0 1 1 1

T8 1 0 0 0 0 1 0

T9 0 1 1 1 0 0 0
Pass 1

L1 afterpruning
C1

A
Itemset X supp(X)
A 4
B
B 2 C
C 6 D
D 3
E
F 5
G 4
F
H 3 G
H
Pass 2 L2 after
pruning
C2
Item set X supp( Item set supp(X) A,C
X) X A,E
A,B 1 C,D 2
A,F
A,C 2 C,E 3
C,D
A,D 0 C,F 3
D,E 1 C,F
A,E 2
A,F 3 D,F 1 E,F

A,G 0 D,G 1 E,G

B,C 1 E,F 2
B,D 1 E,G 2
B,E 1 F,G 1
B,F 0
B,G 0
Pass 3

L3 after pruning
Itemset X supp(X)
C3
join AC with AD A,C,D 0 A,C,F
join AC with AE A,C,E 1
C,E,F
join AC with AF A,C,F 2
join AC with AG A,C,G 0 C,E,G
join AE with AF A,E,F 1
join AE with AG A,E,G 0
join CEwith CF C,E,F 2
join CEwith CG C,E,G 2
Pass 4
C4

Item set x Support(


X)
C,E,F,G 1

Pass 5

For pass 5 we can't form any candidates because there aren't


two frequent 4-itemsets beginning with the same 3 items.
Disadvantages:

n any one of the frequent set items becomes longer the algo
to go through much iteration and as a result the performanc
eases in terms of response time.

ng the execution, every frequent item set is explicitly consid

of scans is required .
In our proposed system, we implement association
rules using Boolean matrix algorithm.

•It operates in a top down process.

•It scans the database only once.

•So the search time—response time becomes quicker.

•It finds the maximum frequent itemsets which meet


the minimum support in short time through the
operation of vector and matrix.

•Works better than most of the currently existing


algorithms such as apriori.
Boolean matrix
• Boolean matrix algorithm need to scan the database only once
and as the structure of the Boolean matrix is simple, it can be
understood easily, and it is easy to compute without
generating plenty of candidate item sets.

• For the databases are translate into files of matrix and the
files are very small, it reduce plenty of time spending on
scanning the database .So the algorithm is efficient.

• This is better and faster than the Apriori approach .

• Thus the technique evaluates the required data much quicker


than most of the algorithms currently existing .
For the same example,

Consider the original matrix and transpose of that matrix


1010010
0011001 101010010
1100100 001000001
0010101 110110101
R= 1010110 T= 010001001
0001100 001111100
0010111 100010110
1000010 010100100
0111000

Now multiply both these matrices,we get


311130221
130211202
103121111
121321301
S= 311231321
011112101
221331411
201020120
121111102
Consider the upper triangular matrix

3111 30221
302 11202
31 21111
3 21301
Su = 31321
2101
411
20
2
Consider the maximum element from the longest diagonal
and find in which rows the element will be.
The maximum element is 4 Consider the upper
triangular matrix
So find in which rows 4 exists
311130
t77 = 4 221
B7(7,1) 30211
It is less than min sup value so there 202
are no 4-frequent itemsets. 3121
111
321
Next maximum element is 3 301
t11=3 , t22=3, t33=3, t44=3, t55=3 S= 313
21
B1(1,5,2); B4(4,7,2); B5(5,7,2) 2
First consider B1(1,5,2) 101

411
Find Logical AND operation of a1 and a5 = 1010010
This indicates B1 = {A,C,F} 20
Similarly, B4 and B5 values will be known. by performing
logical for a4 and a7 and for B5 for a5 and a7 respectively

So 3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }

Similarly, 2-frequent itemsets will be generated.


{ (A,C), (A,E), (A,F), (C,D), (C,E), (C,F), (C,G), (E,F), (E,G) }

Therefore, both the algorithms gives the same output.

2-frequent itemsets are { (A,C), (A,E), (A,F), (C,D), (C,E),


(C,F), (C,G), (E,F), (E,G) }
3-frequent itemsets are { (A,C,F) , (C,E,G) , (C,F,G) }
advantages:
•Helps in improving the marketing tactics.

•Used for collaborative filtering in setting business trends.

•Comprehensive analysis of the Customer choices.

•Comprehensive analysis of the Product Demand inflations.

•Helps to minimize losses and maximize profits.


Performance Analysis:

Assumption
1.Apriori’s Best Case == Boolean’s Best Case

2.Apriori’s Worst Case == Boolean’s Average Case√


Axiom
Let i =1 , and k is the total number of candidate item sets
1. Best Case  Most frequent item set
evaluated in pass = i.
2. Average Case  Most frequent item set evaluated
in pass > i && < k.
3. Worst Case  Most frequent item set
Proof by Example:
Best Case
T1={bread} T2={bread} T3={bread} T4={bread} Let min supp=50%

Apriori
Pass 1: Boolean
C1 : Pass 1:
bread - 100% (not pruned) C1 :
MFS = { {bread}} stop! bread - 100% (not pruned)
MFS = { {bread}}
R=1 T=1111 MFCS={{bread}-100%} not
1 pruned stop!
1
1
U=1111 C=1111
1111 111
1111 11
1111 1
Average Case
T1={bread, jam, sugar, cheese} T2={jam, sugar, cheese} T3={sugar, cheese}
T4={cheese} Let min supp=50%
Apriori
Pass 1:
C1 :
bread - 25% ( pruned) jam - 50% (not pruned) sugar - 75% (not pruned)
cheese - 100% (not pruned)
MFS = { {jam}, {sugar}, {cheese}}
Pass 2 :
C2:
{jam, sugar } - 50% (not pruned) {jam, cheese } - 50% (not pruned)
{sugar, cheese } - 75% (not pruned)
MFS = { {jam, sugar},{jam, cheese},{sugar, cheese}}
Pass 3;
C3:
{jam, sugar ,cheese} - 50% (not pruned) MFS = { {jam, sugar, cheese}}
Boolean
1111 1000 4 321 4321
R= 0111 T= 1100 C= 33 21 U= 321
0011 1110 2221 21
0001 1111 1111 1
Pass 1:
Max element = 4
4 doesn’t exist in any row so there are are no 4-frequent itemsets.
Next element = 3
U(1,1) = 4; U(1,2) = 3
B(1,2,2) = 3 = 0 1 1 1
So, the 3-frequent itemsets are {{jam,sugar,cheese}}
Pass 2 :
Next max ele = 2
B(1,2,3,3) = {{sugar,cheese],{jam,sugar},{jam,cheese}}
B(2,3,2) = {{sugar,cheese}}
So, the 2-frequent itemsets are {{sugar,cheese],{jam,sugar},{jam,cheese}}
Thus, algorithm terminates in two passes
Worst Case
T1={bread, jam, sugar, cheese} T2={bread, jam, sugar, cheese} Let min supp=50%
Apriori
Pass 1:
C1 :
bread - 100% jam -100% sugar -100% cheese - 100% (nothing pruned)
MFS = { {jam}, {sugar}, {cheese},{bread}}
Pass 2:
C1 : {bread, jam}-100%, {bread, sugar}-100%, {bread, cheese}-100%, {sugar, jam}-100
{cheese, jam}-100%, {sugar, cheese}-100%
(nothing pruned) MFS = {{bread, jam}, {bread, sugar},{bread, cheese}, {sugar, jam},
{cheese, jam}, {sugar, cheese} }
Pass 3:
C1 :
{bread,jam,cheese} 100% , {bread,jam,sugar} 100% ,{jam,sugar,cheese} 100%
(nothing pruned)
MFS = { {bread,jam,cheese} , {bread,jam,sugar} ,{jam,sugar,cheese} }
Pass 4:
C1 :
(bread,jam,sugar,cheese}-100% not pruned
MFS = { {jam, sugar, cheese, bread}}
Boolean
1111 1111 4444 4444
R= 1111 T= 1111 C= 4444 U= 444
1111 1111 4444 44
1111 1111 4444 4
Pass 1:
Max ele = 4
So, 4-frequent itemsets are { {jam, sugar, cheese, bread}}
Hence 3-frequent itemsets are {{bread,jam,cheese},{bread,jam,sugar} ,
{jam,sugar,cheese} }

2-frequent itemsets are {{bread,jam}, {bread,sugar},{bread,cheese}, {sugar,jam},


{cheese,jam}, {sugar,cheese} }
Apriori Time Complexity

•In best case one pass so O(1)


•In worst case all items are considered so O(n)
•In Average case only about half the number of items are taken O(n-k)

Boolean matrix Time Complexity

In best case s O(1)


Average Case O(n)
Worst Case O(n)
ataflow diagram
UML Diagrams

Sequence diagram for Apriori Sequence diagram for Boolean


Sequence Diagram for pruning the Item sets
Testing:
Testing is the process of executing program
with the intent of finding an error.
Test Case Id Input Format Expected Obtained
Input data

Usability Test:
UT_1.1 Time String No Error No Error
‘morning’ or ‘afternoon’ or ‘evening’ or morning
‘night’

UT_1.2 Hrs:mins:secs No Error Error


11:35:65
UT_1.3 Hrs:mins No Error Error
03:75
UT_1.4 Hrs No Error Error
32
UT_1.5 Random character string Error Error
H2
UT_1.6 Integer Number Error Error
12.9
UT_1.7 Floating Point Error Error
15
UT_1.8 Alphanumeric @w Error Error

UT_1.9 Special Characters 9,0 Error Error

UT_1.10 Empty String Error Error


Test case Id Input Module Next Module Next Module
(Expected) (Obtained)

PT_1 Receive user inputs Retrieve raw candidate transaction table Raw candidate transaction table retrieved

PT_2. Retrieve raw candidate transaction table Construct binary candidate transaction table Binary candidate transaction table constructed

PT_3. Binary candidate transaction table constructed Algorithm begins Algorithm began

Path testing
PT_4 Algorithm began Subset Selection Subset Selected

PT_5 Subset Selected Support Evaluation Support evaluated

PT_6 Support evaluated Prune Pruned

PT_7 Pruned Most frequent item set evaluation Most frequent item set evaluated

PT_8 Most frequent item sets evaluated Frame Association rules Association rules framed

PT_9 Association rules framed Pictorial representation using paint method Pictorially representation began

PT_10 Pictorially representation began Draw Bar chart Bar chart drawn

PT_11 Bar Chart drawn Draw Pie Chart Pie Chart Drawn

PT_12 Pie Chart Drawn Draw Support for single pass graph Support for single pass graph drawn

PT_13 Support for single pass graph drawn Draw Time Complexity Graph Time Complexity Graph Drawn
Conclusion:
Discovering frequent itemsets is a key problem in
important data mining applications, such as discovery
of association rules..

BMA efficiently overcomes the difficulty better than all


the algorithms currently existing and could better than
Apriori.

We evaluate the performance of the algorithm using


well-known synthetic benchmark databases, real-life
census and stock market databases.
Further Developments:

Parallelizing the Boolean matrix algorithm

this is a way to minimize the duplicate


calculations and to maximize the use of available
processors.

You might also like