Chapter-3 Flexible Uniform Framework For Mining Association Rules

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 21

Chapter-3

Flexible Uniform Framework for


Mining Association Rules

3.1 Introduction

In this chapter a new framework for data mining called Flexible Uniform Framework
(FUF) to discover various type of association rules has been proposed. This
framework provides an environment, where the user can invoke independent
knowledge discovery operations with individual perspectives. It offers flexibility and
reusability during mining process and forms the basis of a new model for association
rule mining.

Section 3.2 gives the motivation and an overview of the Flexible Uniform
Framework for mining association rules. Sections 3.3 and 3.4 discuss different phases
of this framework and its salient features respectively. In Section 3.5 a new model for
the KDD process is proposed.

3.2 Flexible Uniform Framework

49
3.2.1 Motivation

The Flexible Uniform Framework (FUF) has been proposed for mining different
kinds of association rules and its extensions, keeping in view the following:

• In most of the cases pertaining to association rule mining, only a few


1-frequent items are present in the data, for a given minimum support
threshold, compared to the number of items present in the data. In [91], it
has been observed that the largest frequent itemsets do not generally
contain more than 8-10 items. Further, the user may be interested in only
a few items in the 1-frequent items set depending on his/her logical view.

• Association rule mining is an iterative process. Most of the time users


have to run association rule mining algorithm (of their choice) at different
value of support, confidence, and abstract level (at different concept
hierarchy) to discover the hidden, previously unknown, and ultimately
useful knowledge from the huge volume of data. Mining algorithm may
take a long time before giving the desired results. Repeated execution of
mining algorithm with varied constraints incurs prohibitive costs.
Following statement of Anand [116] is also important in this context:

“The objective of most commercial knowledge discovery endeavors is the


development of knowledge discovery applications rather than one time discovery of
some interesting insight.”

• Depending on application and consequent requirement the user may be


interested in either discovery driven or verification driven association rule
mining.

The above discussion suggests that before adopting any algorithm(s)/framework for
designing a mining system, the designer/architect of the system needs to analyze them

50
in totality including flexibility, reusability (of its subcomponents), and ease offered by
them. An algorithm may not be found efficient in a single run. However its reusability
value may be appreciated in subsequent runs by using the computations of previous
runs. Further, if selected algorithm does not offer flexibility and reusability, an
entirely different algorithm is required for different kinds of association rule mining,
and for their execution every time user will have to start from the beginning.

To address the above, a Flexible Uniform Framework for mining different kinds of
association rules is being proposed in this work. The ultimate goal of creating this
framework is to provide environment, enhancing productivity, flexibility, and ease of
use and comprehension at the user level.

3.2.2 Phases of FUF

In this work the uniform data mining framework proposed in [38] has been modified.
Some new phases (Phase 3 to 5 – given below) are incorporated and Association
graph construction phase of [38] has been dropped out. However, all the remaining
phases of [38] have been incorporated in the different phases of the present FUF.
New phases added make the present FUF more flexible and reusable during mining,
reduce the number of candidate itemsets generation and testing, and carryout partial
analysis of the database which helps in selection of suitable algorithm for mining
frequent itemsets. Following are the different phases of the Flexible Uniform
Framework:
Phase 1: Encoding and Counting
During this phase all distinct items present in the data are encoded (to a smaller code)
and counted individually.

Phase 2: Frequent Items Generation


During this phase frequent items present in the data are identified at a given minimum
support.

51
Phase 3: Construction of the Concentrator
This phase constructs the Concentrator by scanning the data to be mined and using
the information obtained during Phase 1 and 2. The Concentrator is a pre-defined
structure containing mostly significant items, except a few false frequent items, as
well as transactions of the data in the encoded form. Its Construction (described in
Section 3.3.3) requires one complete scan of the entire data to be mined. After this
phase, the mining algorithms will not scan the original data during mining performed
at the support equal to or greater than, at which the Concentrator is constructed1. The
support at which concentrator is constructed has been called as primary support
throughout this thesis.

Phase 4: Detection and Filtering False Frequent Items


False frequent items are those items, which are not frequent but look like frequent
statistically. All such false frequent items are to be identified and removed. Removal
of false frequent items reduces the generation and testing of candidate itemsets as
well as the search space of the Concentrator. This task is performed during this phase
(described in Section 3.3.4).

…………………..…………………………………………..…………………………..
1
Concentrator can also be constructed without user bias; it will provide unrestricted view of all association to the
users during mining
Phase 5: Prediction of Maximum Size of Frequent Itemset
During this phase maximum size of the frequent itemset, which may be present in the
data at a given minimum support is predicted. This information is useful in reducing
the generation and testing of candidate itemsets and the number of scans of the
Concentrator and certain other tasks described in Chapter 4. Prediction of maximum
size of frequent itemset present in the data is discussed in Section 3.3.5.

Phase 6: Association Pattern Generation


This phase generates all frequent association patterns by scanning the Concentrator.

52
Phase 7: Association Rule Generation
During this phase association rules are generated directly from the corresponding
frequent patterns obtained in phase 6.

3.3 Description of Different Phases of FUF

3.3.1 Encoding and Counting

As mentioned earlier, during this phase all items present in the pre-mined data are
encoded and individually counted. An algorithm has been developed for performing
this task wherein information regarding encoding and counting is recorded in
encode_decode_table and count_table respectively for future reference.
encode_decode_table has two attributes: item_name and decoded_name (of the
items) while count_table has attributes: decoded_name (of the items) and
count_value. count_table will be referred during frequent item generation phase while
encode_decode_table will be referred when mining results are presented to the user.
Further mining processes are performed only on this encoded data.

3.3.2 Frequent Items Generation

Frequent items present in the data, at the given minimum support are identified by
calling the function identify_frequent_item(). The input parameters of this function
are the list of encoded items along with their count value (count_table) and the
minimum user support (min_supp) while it returns L1 (1-frequent items). The structure
of L1 is same as that of count_table. The function that incorporates the idea is given
below.

//Function to identify frequent items present in the data

53
function identify_frequent_item(count_table, min_supp)

L1 = φ; // L1 is the set of 1-frequent items


forall tuple t ∈ count_table do
if (t.count ≥ min_supp) then
L1 = L1 ∪ t.item;
endif;
endfor;
t = t.next;
return L1;

This function may identify the frequent items, at any given minimum support, by
referring the count_table. Note that it will not refer the original data or the
Concentrator.

3.3.3 Construction of the Concentrator

The Concentrator contains mostly statistically significant items and transactions in


the encoded form. Hence, its overall size is reduced as compared to the original data
to be mined. All proposed algorithms are built as process running on this structure.
Mannila [93] has also reflected the usefulness of such structure for data mining
operations. In the present work, the Concentrator is constructed by invoking function
make_concentrator(). The input parameters of this function are data, D and 1-frequent
items, L1.

//Function for constructing the Concentrator

function make_concentrator(D, L1)

54
create table ξ(tid, L1) as attributes;
i = 1;
forall transaction t ∈ D do
items = t ∩ L1;
if | items | ≥ 2, then
forall j ∈ items do
ξ(i, j) = 1;
endfor;
i = i +1;
else
forall j ∈ items do
j.count = j.count – 1;
endfor;
endif;
endfor;
return ξ;

Brief description of this function is as below.

When function make_concentrator() is invoked, it creates a table (ξ) with tid and 1-
frequent items (each member of L1 will be treated as independent attribute) as
attributes. It then reads all the transactions of data file one by one. For each
transaction containing at least two frequent items, ‘1’ is entered in location (i, j) of
the Concentrator for each frequent item present in the transaction; where i and j are
row and column in the Concentrator corresponding to a tid and frequent item of the
database respectively. If any transaction contains less than two frequent items then
such a transaction is not entered in the Concentrator as a row. Such transactions are
statistically insignificant. Thus, the Concentrator contains (L1+1) columns and
each column (except the first) is a bit vector corresponding to a specific element of L1

55
and the number of rows in the Concentrator will be equal to or less than D. The bit
vector associated with item i is denoted as βi and number of 1s in a bit vector βi by β
i(1). The resulting Concentrator is preserved in a secondary storage for future use in
mining process. Computations done during construction of the Concentrator are
shared and reused every time the user requests for mining of association rule at the
same or higher support at which the Concentrator is constructed.

The extra space that is required to store the Concentrator is apparently an


overhead. However, the benefits in terms of faster response time, flexibility, and
reusability outweigh the expense.

3.3.4 Detection and Filtering False Frequent Items

Due to the pruning of insignificant transactions during construction of the


Concentrator, it may be possible that the support of some of the items in the
Concentrator may fall below the minimum required support. Such items have been
designated as false frequent items in this thesis. Presence of such items in the
Concentrator will require unnecessary generation and testing of candidate itemsets.
Thus, such items needed to be filtered out before starting the actual mining process.
This also results in the reduction of the size of the Concentrator. The false frequent
items present in the Concentrator can be detected by using the following lemma.

Lemma 3.1:

56
If for any item ‘i’ of the Concentrator the value of βi(1) that is number of 1’s in column
corresponds to item ‘i’ is less than the minimum support i.e., βi(1) < min_supp,
then item ‘i’ is the false frequent item.

The real support of an item ‘i’ is obtained by counting number of 1s in βi i.e. the value
of βi(1). If the value of βi(1) is less than the primary_support (or the minimum support
at which mining is to be performed) then item ‘i’ is the false frequent item. Columns
corresponding to such items are filtered out from the Concentrator and also removed
from L1. Remaining items in L1 are actual frequent items. The support of different
items, which was calculated before making the Concentrator (during Phase 1), is
apparent support. The value of apparent support is always greater than or equal to its
real support.

Examples for the construction of the Concentrator and detection of false frequent
items are illustrated below.

Consider the data given in Table 3.1 where each record is a <Tid, Itemset> pair. The
Tid is the transaction identifier and the itemset is a list of items purchased in the
transaction. Assume that the value of primary_support as well as min_supp is 50%.

Table 3.1: Data

Tid Itemset
1 A, B, C, D, E, K
2 C, F, G
3 D, E, F
4 A, B, C, D, E, H
5 C, H
6 B, E, H, I
7 A, B, C, D, E, H, J
8 A, B, E, H
9 A, B, C, D, E
10 H, I, J
11 B, I, J, K
12 A, C, D, F, I

57
Table 3.2 shows the number of occurrences of each item present in the data.

Table 3.2: Number of occurrences of items in given data

Item A B C D E F G H I J K
0 0 0 0 0 0 0 0 0 0
Count 02
6 7 7 6 7 3 1 6 4 3

Thus, set of apparent 1-frequent items (aL1) at the assumed minimum support of 50%
is as given below.

a
L1 = {A, B, C, D, E, H}

Now function make_concentrator() is called. The output of this function is shown in


Table 3.3. For making the presentation understandable, encoding of the items is not
done and actual names of items are used.

Table 3.3: The Concentrator for data of Table 3.1


βA(1) =06 βB(1) =06 βC(1) =06 βD(1) =06 βE(1) =07 βH(1) =05
ξ=09

10 1 0 1 1 0 0
9 1 1 1 1 1 0
8 1 1 0 0 1 1
7 1 1 1 1 1 1
6 0 1 0 0 1 1
5 0 0 1 0 0 1
4 1 1 1 1 1 1
3 0 0 0 1 1 0
1 1 1 1 1 1 0
Tid βA βB βC βD βE βH

58
In the Concentrator (Table 3.3) the count value of 1s in column βH is 5, which is less
than minimum support (that is 50% or 6 transactions); thus item ‘H’ is a false
frequent item according to Lemma 3.1 and can be filtered out from aL1 without losing
any frequent itemset. Thus, set of real frequent items (rL1) at the assumed minimum
will be as given below.

r
L1 = {A, B, C, D, E}

To discover the 2-frequent itemsets algorithms [1,2,3,4,38,50] generate 2-candidate


itemsets by using apparent 1-frequent items (by joining aL1 with aL1) and then test
each generated candidate itemset whether it is frequent or not. In such algorithms
number of 2-candidate itemsets generated and tested can be given by,
C2 = aL1 (aL1 - 1)/2

Thus, if there are more items in aL1, higher will be the processing cost of finding L2.
For given data of Table 3.1, fifteen 2-candidate itemsets are generated and tested as
shown in Table 3.4. While for the same database, if C 2 is generated by using rL1 then
total number of candidate itemsets generated and tested will be ten as shown in Table
3.5. However, generated 2-frequent itemsets is same by both approaches. In all future
discussion 1-frequent items set will be designated by L1.

Table 3.4: Generation of C2 and L2 from aL1


2-candidate itemset (C2) Count value 2-frequent itemset (L2)
AB 05 ---
AC 05 ---
AD 05 ---
AE 05 ---
AH 03 ---
BC 04 ---
BD 04 ---
BE 06 BE
BH 04 ---
CD 05 ---
CE 04 ---

59
CH 03 ---
DE 05 ---
DH 02 ---
EH 04 ---

Table 3.5: Generation of C2 and L2 from rL1

2-candidate itemset Count value 2-frequent itemset


(C2) (L2)
AB 05 ---
AC 05 ---
AD 05 ---
AE 05 ---
BC 04 ---
BD 04 ---
BE 06 BE
CD 05 ---
CE 04 ---
DE 05 ---

3.3.5 Prediction of Maximum Size of Frequent Itemset

60
Currently available association rule mining algorithms continue to generate and test
the candidate itemsets till frequent itemset in a pass becomes null. Due to this, many
insignificant candidate itemsets are generated and tested. Testing of such insignificant
candidate itemsets may also require one more database scan.

To overcome this problem a mechanism that could predict the maximum size of the
frequent itemset present in the database at a given minimum support is needed. The
function predict_size, given below has been developed in the present work to address
this problem.

function predict_size(ξ, min_supp,D)

//Input-output parameters of this function are given below


//ξ: The Concentrator containing only real frequent items
// min_supp: User specified minimum support
//D: Number of transactions in the data to be mined
//γ: Upper limit for size of the largest frequent itemset(s)

γ = φ;
i = 1;
forall tuple t ∈ ξ do
Ψ(i) = count(t);
i = i + 1;
endfor;
Ψ = sort_descending(Ψ);
j = (D* min_supp)/100) ;
γ = Ψ(j);

61
return γ;

Brief description of the function predict_size() is given below:

First, the number of frequent items present in each tuple (t) of the Concentrator (ξ) is
counted by calling function count() and stored in the ith location of array Ψ. Function
sort_decending() is then called. The input to this function is array, Ψ. This function
sorts the array, Ψ in descending order. Now the value of (D* min_supp)/100) is
computed and assigned to j. Thereafter content of location j of the array Ψ is read and
assigns this value to γ, which represent the largest size of the frequent itemset(s)
present in the data at a given minimum support. The size of the largest frequent
itemset(s) cannot exceed by γ. Thus, by using function predict_size() size of the
largest frequent itemset (worst case), γ having the potential to become frequent
itemset, can be predicted in advance.

The working of this function has been illustrated by using the data of Table 3.6.

Table 3.6: Transaction data

Tid Itemset
T1 Bread, Eggs
T2 Bread, Milk
T3 Bread, Eggs
T4 Butter, Eggs, Milk
T5 Bread, Butter, Eggs, Milk
T6 Eggs, Milk
T7 Bread, Butter
T8 Bread, Milk

The transactional data of Table 3.6 contains four distinct items i.e. Bread, Butter,
Eggs, and Milk. At the assumed minimum support of 26% all these items are frequent
items.

62
Frequent item count (Ψi) of each transaction is given below.

Ψ1 Ψ2 Ψ3 Ψ4 Ψ5 Ψ6 Ψ7 Ψ8
2 2 2 3 4 2 2 2

Frequent item counts arranged in descending order are,

4 3 2 2 2 2 2 2

Now location (Total Number of Transaction * Minimum Support), that is  8 *


0.26 = 2.08  = 3 (indicated by arrow) is read. The content of this location is 2.
Thus, two is the predicted largest size frequent itemset (γ) which may be present in
the given data at the given minimum support. Similarly for the assumed minimum
support of 20% and 30%, predicted sizes of maximum size frequent itemsets are 3
and 2 respectively.

3.3.6 Association Pattern Generation

This phase i.e. Association Pattern Generation of the proposed Flexible Uniform
Framework is the heart of the mining process and one of the core objectives of the
present work. It has been researched exhaustively and new algorithms have been
developed for the generation of association patterns. These algorithms and other
associated issues have been presented in Chapter 4 of this thesis.

3.3.7 Association Rule Generation

63
As in Section 3.3.6, different algorithms developed during this research work for the
generation of association rule along with other related issues have been described in
Chapter 5 of this thesis.

3.4 Salient Features of the Flexible Uniform Framework

The following are some of the distinct features of the Flexible Uniform Framework:

3.4.1 Reusability of Computations

Non-repetitive and repetitive phases of FUF are shown in Figure 3.1.

Encoding and Counting

Frequent Items Generation

Construction of the Concentrator

User-1 User-2 User-n


View View View

Detection and Filtering of False Frequent Items

Prediction of Maximum Size of Frequent Itemset

Association Pattern Generation

Association Rule Generation

Figure 3.1: Repetitive and non-repetitive phases of FUF

64
(Repetitive phases are shaded)

Non-repetitive phases do the encoding, counting, construct the Concentrator, and also
determine some other statistical information. The repetitive phases (shaded) process
the Concentrator2 to discover the knowledge according to the parameters and
constraints assigned by the end user. For every new schema of mining only these
phases are to be repeated and computations required for non-repetitive phases are
reused. Further, the computations done during Phase 5 (Prediction of Maximum Size
of Frequent Itemset) may also be reused for different tasks during mining process as
described in Chapter 4.

It may also be mentioned that the Detection and Filtering False Frequent Items phase
will be repeated only when mining is to be performed at a higher support than the
primary support at which the Concentrator is constructed.

3.4.2 Flexibility During Mining

The design of the Concentrator is not biased to any particular application. It is


generic and application independent. User has the flexibility to select the entire
Concentrator or its representative sample with all or selected attributes for mining,
according to his/her logical view.

Items present in the Concentrator can be generalized easily to higher concept levels,
according to the concept hierarchy defined for the items. Generalization of the items
to higher concept levels requires only the scanning of corresponding bit vectors of the
items. Mining can be performed on these generalized items. Generalization of the
items is illustrated below.

65
…………………………………………………………………………………………………………………………………..
2
It may be processed asynchronously
Suppose user wants to generalize items ‘desktop computer’ (say A) and ‘laptop
computer’ (say B) to higher concept level ‘computer’ (say C) according to concept
hierarchy defined for these items in Figure 3.2 (a). The bit vectors βA and βB
correspond to items A and B are also shown below in Figure 3.2(b). These items can
be generalized to a higher concept level computer by getting βA∨ βB, where ‘∨’ is the
logical OR operator. Resultant generalized bit vector, βC is shown in Figure 3.2(c).

Computer 0 0 0
0 0 0
0 0 0
1 0 1
0 0 0
Desktop computer Laptop computer 0 0 0
(a) 0 1 1
0 0 0
βA βB βC

(b) (c)
Figure 3.2: Generalization of items

Further, due to the structure of the Concentrator it is easy to add or delete more tuples
(rows) or attributes (columns) in the existing Concentrator. Thus association rules
can be maintained for updated data. This aspect has been dealt in Chapter 4 of this
thesis.

User may also impose Boolean constraints on the items as per his/her needs and these
constraints may be integrated with the mining algorithm. When mining is performed
from the scratch level, integration of constraints with mining algorithm will improve
the efficiency of mining process.

66
It is reiterated that in the proposed framework mining algorithms operate on the
Concentrator instead of data. Thus the knowledge discovery process can be
undertaken without any interference to the day-to-day database operations.

3.5 AR-Model: A New Model for the KDD Process

A new model for KDD process, called AR-Model is being proposed in Figure 3.3.
This model is downward compatible and provides full functionality of the traditional
KDD process model presented in Appendix- A of this thesis.

Step AR1 of Figure 3.3 corresponds to step S1 of traditional KDD process model.
During this step discovery goals are specified in terms of Knowledge Discovery
Schema (KDS). The schema is compiled and the resulting meta-data is stored for
future reference.

The next step AR2 is Pre-processing. This step is a compound step containing sub-
steps (AR2a, AR2b, and AR2c), which corresponds to steps (S2 - S4) of traditional
KDD process model. The outcome of this process step is the Concentrator (discussed
in Section 3.4.3).

Step AR3 is novel and responsible for partial data analysis. The outcome of this
analysis is used to select proper mining algorithm(s) during mining. Computation
done to select the mining algorithm is also used for reducing the number of candidate
itemsets generation and testing and reducing the number of scan and size of scanned
data in the Concentrator.

Step AR4 signifies initiation of the mining phase. The user initiates this step
asynchronous as it commences either formulation of mining query or execution of an
application. The query or application is vetted against the corresponding schema and
compiled during this step.

67
Step AR5, the Data Mining step corresponds to S5 of traditional KDD process model.
During this step suitable association rule mining algorithms are selected on the basis
of information obtained during Step AR3. The resulting knowledge is presented and
interpreted during steps (AR6 - AR7), which correspond to steps (S6 - S7) of
traditional KDD process model.

Domain
Understanding AR1 AR4
and Data Minin
Application
Mining g
Interpretation
Requirements Query

AR7 AR6

Knowledge
Presentatio
n

AR2a AR2b AR2c

Data Partial
Data Transfo Data DataAR5
Data AR3
Analysi
Cleaning r- Minin
Selection s
mation g

Data flow
Optional repetition

Figure 3.3: AR-Model for KDD Process


By varying the mining perspective, the user can repeat steps (AR3 - AR4 - AR5 - AR6).

The proposed model for mining association rules inherently supports experimentation
and monitoring. These functionalities are desirable in business and scientific data

68
explorations. Development of the Concentrator provides flexibility and promotes
experimentation and monitoring of desired subsets of database.

An interesting contrast between the two models is that in traditional KDD process the
functionality is defined at the beginning of the process by the end user, while in AR-
Model it is decided dynamically by the end user at the time of actual mining of the
data.

A KDD System based on proposed AR-Model is designed and developed in Chapter 6


and it is called AR-Miner in all future discussion in this thesis.

69

You might also like