Professional Documents
Culture Documents
Chapter-3 Flexible Uniform Framework For Mining Association Rules
Chapter-3 Flexible Uniform Framework For Mining Association Rules
Chapter-3 Flexible Uniform Framework For Mining Association Rules
3.1 Introduction
In this chapter a new framework for data mining called Flexible Uniform Framework
(FUF) to discover various type of association rules has been proposed. This
framework provides an environment, where the user can invoke independent
knowledge discovery operations with individual perspectives. It offers flexibility and
reusability during mining process and forms the basis of a new model for association
rule mining.
Section 3.2 gives the motivation and an overview of the Flexible Uniform
Framework for mining association rules. Sections 3.3 and 3.4 discuss different phases
of this framework and its salient features respectively. In Section 3.5 a new model for
the KDD process is proposed.
49
3.2.1 Motivation
The Flexible Uniform Framework (FUF) has been proposed for mining different
kinds of association rules and its extensions, keeping in view the following:
The above discussion suggests that before adopting any algorithm(s)/framework for
designing a mining system, the designer/architect of the system needs to analyze them
50
in totality including flexibility, reusability (of its subcomponents), and ease offered by
them. An algorithm may not be found efficient in a single run. However its reusability
value may be appreciated in subsequent runs by using the computations of previous
runs. Further, if selected algorithm does not offer flexibility and reusability, an
entirely different algorithm is required for different kinds of association rule mining,
and for their execution every time user will have to start from the beginning.
To address the above, a Flexible Uniform Framework for mining different kinds of
association rules is being proposed in this work. The ultimate goal of creating this
framework is to provide environment, enhancing productivity, flexibility, and ease of
use and comprehension at the user level.
In this work the uniform data mining framework proposed in [38] has been modified.
Some new phases (Phase 3 to 5 – given below) are incorporated and Association
graph construction phase of [38] has been dropped out. However, all the remaining
phases of [38] have been incorporated in the different phases of the present FUF.
New phases added make the present FUF more flexible and reusable during mining,
reduce the number of candidate itemsets generation and testing, and carryout partial
analysis of the database which helps in selection of suitable algorithm for mining
frequent itemsets. Following are the different phases of the Flexible Uniform
Framework:
Phase 1: Encoding and Counting
During this phase all distinct items present in the data are encoded (to a smaller code)
and counted individually.
51
Phase 3: Construction of the Concentrator
This phase constructs the Concentrator by scanning the data to be mined and using
the information obtained during Phase 1 and 2. The Concentrator is a pre-defined
structure containing mostly significant items, except a few false frequent items, as
well as transactions of the data in the encoded form. Its Construction (described in
Section 3.3.3) requires one complete scan of the entire data to be mined. After this
phase, the mining algorithms will not scan the original data during mining performed
at the support equal to or greater than, at which the Concentrator is constructed1. The
support at which concentrator is constructed has been called as primary support
throughout this thesis.
…………………..…………………………………………..…………………………..
1
Concentrator can also be constructed without user bias; it will provide unrestricted view of all association to the
users during mining
Phase 5: Prediction of Maximum Size of Frequent Itemset
During this phase maximum size of the frequent itemset, which may be present in the
data at a given minimum support is predicted. This information is useful in reducing
the generation and testing of candidate itemsets and the number of scans of the
Concentrator and certain other tasks described in Chapter 4. Prediction of maximum
size of frequent itemset present in the data is discussed in Section 3.3.5.
52
Phase 7: Association Rule Generation
During this phase association rules are generated directly from the corresponding
frequent patterns obtained in phase 6.
As mentioned earlier, during this phase all items present in the pre-mined data are
encoded and individually counted. An algorithm has been developed for performing
this task wherein information regarding encoding and counting is recorded in
encode_decode_table and count_table respectively for future reference.
encode_decode_table has two attributes: item_name and decoded_name (of the
items) while count_table has attributes: decoded_name (of the items) and
count_value. count_table will be referred during frequent item generation phase while
encode_decode_table will be referred when mining results are presented to the user.
Further mining processes are performed only on this encoded data.
Frequent items present in the data, at the given minimum support are identified by
calling the function identify_frequent_item(). The input parameters of this function
are the list of encoded items along with their count value (count_table) and the
minimum user support (min_supp) while it returns L1 (1-frequent items). The structure
of L1 is same as that of count_table. The function that incorporates the idea is given
below.
53
function identify_frequent_item(count_table, min_supp)
This function may identify the frequent items, at any given minimum support, by
referring the count_table. Note that it will not refer the original data or the
Concentrator.
54
create table ξ(tid, L1) as attributes;
i = 1;
forall transaction t ∈ D do
items = t ∩ L1;
if | items | ≥ 2, then
forall j ∈ items do
ξ(i, j) = 1;
endfor;
i = i +1;
else
forall j ∈ items do
j.count = j.count – 1;
endfor;
endif;
endfor;
return ξ;
When function make_concentrator() is invoked, it creates a table (ξ) with tid and 1-
frequent items (each member of L1 will be treated as independent attribute) as
attributes. It then reads all the transactions of data file one by one. For each
transaction containing at least two frequent items, ‘1’ is entered in location (i, j) of
the Concentrator for each frequent item present in the transaction; where i and j are
row and column in the Concentrator corresponding to a tid and frequent item of the
database respectively. If any transaction contains less than two frequent items then
such a transaction is not entered in the Concentrator as a row. Such transactions are
statistically insignificant. Thus, the Concentrator contains (L1+1) columns and
each column (except the first) is a bit vector corresponding to a specific element of L1
55
and the number of rows in the Concentrator will be equal to or less than D. The bit
vector associated with item i is denoted as βi and number of 1s in a bit vector βi by β
i(1). The resulting Concentrator is preserved in a secondary storage for future use in
mining process. Computations done during construction of the Concentrator are
shared and reused every time the user requests for mining of association rule at the
same or higher support at which the Concentrator is constructed.
Lemma 3.1:
56
If for any item ‘i’ of the Concentrator the value of βi(1) that is number of 1’s in column
corresponds to item ‘i’ is less than the minimum support i.e., βi(1) < min_supp,
then item ‘i’ is the false frequent item.
The real support of an item ‘i’ is obtained by counting number of 1s in βi i.e. the value
of βi(1). If the value of βi(1) is less than the primary_support (or the minimum support
at which mining is to be performed) then item ‘i’ is the false frequent item. Columns
corresponding to such items are filtered out from the Concentrator and also removed
from L1. Remaining items in L1 are actual frequent items. The support of different
items, which was calculated before making the Concentrator (during Phase 1), is
apparent support. The value of apparent support is always greater than or equal to its
real support.
Examples for the construction of the Concentrator and detection of false frequent
items are illustrated below.
Consider the data given in Table 3.1 where each record is a <Tid, Itemset> pair. The
Tid is the transaction identifier and the itemset is a list of items purchased in the
transaction. Assume that the value of primary_support as well as min_supp is 50%.
Tid Itemset
1 A, B, C, D, E, K
2 C, F, G
3 D, E, F
4 A, B, C, D, E, H
5 C, H
6 B, E, H, I
7 A, B, C, D, E, H, J
8 A, B, E, H
9 A, B, C, D, E
10 H, I, J
11 B, I, J, K
12 A, C, D, F, I
57
Table 3.2 shows the number of occurrences of each item present in the data.
Item A B C D E F G H I J K
0 0 0 0 0 0 0 0 0 0
Count 02
6 7 7 6 7 3 1 6 4 3
Thus, set of apparent 1-frequent items (aL1) at the assumed minimum support of 50%
is as given below.
a
L1 = {A, B, C, D, E, H}
10 1 0 1 1 0 0
9 1 1 1 1 1 0
8 1 1 0 0 1 1
7 1 1 1 1 1 1
6 0 1 0 0 1 1
5 0 0 1 0 0 1
4 1 1 1 1 1 1
3 0 0 0 1 1 0
1 1 1 1 1 1 0
Tid βA βB βC βD βE βH
58
In the Concentrator (Table 3.3) the count value of 1s in column βH is 5, which is less
than minimum support (that is 50% or 6 transactions); thus item ‘H’ is a false
frequent item according to Lemma 3.1 and can be filtered out from aL1 without losing
any frequent itemset. Thus, set of real frequent items (rL1) at the assumed minimum
will be as given below.
r
L1 = {A, B, C, D, E}
Thus, if there are more items in aL1, higher will be the processing cost of finding L2.
For given data of Table 3.1, fifteen 2-candidate itemsets are generated and tested as
shown in Table 3.4. While for the same database, if C 2 is generated by using rL1 then
total number of candidate itemsets generated and tested will be ten as shown in Table
3.5. However, generated 2-frequent itemsets is same by both approaches. In all future
discussion 1-frequent items set will be designated by L1.
59
CH 03 ---
DE 05 ---
DH 02 ---
EH 04 ---
60
Currently available association rule mining algorithms continue to generate and test
the candidate itemsets till frequent itemset in a pass becomes null. Due to this, many
insignificant candidate itemsets are generated and tested. Testing of such insignificant
candidate itemsets may also require one more database scan.
To overcome this problem a mechanism that could predict the maximum size of the
frequent itemset present in the database at a given minimum support is needed. The
function predict_size, given below has been developed in the present work to address
this problem.
γ = φ;
i = 1;
forall tuple t ∈ ξ do
Ψ(i) = count(t);
i = i + 1;
endfor;
Ψ = sort_descending(Ψ);
j = (D* min_supp)/100) ;
γ = Ψ(j);
61
return γ;
First, the number of frequent items present in each tuple (t) of the Concentrator (ξ) is
counted by calling function count() and stored in the ith location of array Ψ. Function
sort_decending() is then called. The input to this function is array, Ψ. This function
sorts the array, Ψ in descending order. Now the value of (D* min_supp)/100) is
computed and assigned to j. Thereafter content of location j of the array Ψ is read and
assigns this value to γ, which represent the largest size of the frequent itemset(s)
present in the data at a given minimum support. The size of the largest frequent
itemset(s) cannot exceed by γ. Thus, by using function predict_size() size of the
largest frequent itemset (worst case), γ having the potential to become frequent
itemset, can be predicted in advance.
The working of this function has been illustrated by using the data of Table 3.6.
Tid Itemset
T1 Bread, Eggs
T2 Bread, Milk
T3 Bread, Eggs
T4 Butter, Eggs, Milk
T5 Bread, Butter, Eggs, Milk
T6 Eggs, Milk
T7 Bread, Butter
T8 Bread, Milk
The transactional data of Table 3.6 contains four distinct items i.e. Bread, Butter,
Eggs, and Milk. At the assumed minimum support of 26% all these items are frequent
items.
62
Frequent item count (Ψi) of each transaction is given below.
Ψ1 Ψ2 Ψ3 Ψ4 Ψ5 Ψ6 Ψ7 Ψ8
2 2 2 3 4 2 2 2
4 3 2 2 2 2 2 2
This phase i.e. Association Pattern Generation of the proposed Flexible Uniform
Framework is the heart of the mining process and one of the core objectives of the
present work. It has been researched exhaustively and new algorithms have been
developed for the generation of association patterns. These algorithms and other
associated issues have been presented in Chapter 4 of this thesis.
63
As in Section 3.3.6, different algorithms developed during this research work for the
generation of association rule along with other related issues have been described in
Chapter 5 of this thesis.
The following are some of the distinct features of the Flexible Uniform Framework:
64
(Repetitive phases are shaded)
Non-repetitive phases do the encoding, counting, construct the Concentrator, and also
determine some other statistical information. The repetitive phases (shaded) process
the Concentrator2 to discover the knowledge according to the parameters and
constraints assigned by the end user. For every new schema of mining only these
phases are to be repeated and computations required for non-repetitive phases are
reused. Further, the computations done during Phase 5 (Prediction of Maximum Size
of Frequent Itemset) may also be reused for different tasks during mining process as
described in Chapter 4.
It may also be mentioned that the Detection and Filtering False Frequent Items phase
will be repeated only when mining is to be performed at a higher support than the
primary support at which the Concentrator is constructed.
Items present in the Concentrator can be generalized easily to higher concept levels,
according to the concept hierarchy defined for the items. Generalization of the items
to higher concept levels requires only the scanning of corresponding bit vectors of the
items. Mining can be performed on these generalized items. Generalization of the
items is illustrated below.
65
…………………………………………………………………………………………………………………………………..
2
It may be processed asynchronously
Suppose user wants to generalize items ‘desktop computer’ (say A) and ‘laptop
computer’ (say B) to higher concept level ‘computer’ (say C) according to concept
hierarchy defined for these items in Figure 3.2 (a). The bit vectors βA and βB
correspond to items A and B are also shown below in Figure 3.2(b). These items can
be generalized to a higher concept level computer by getting βA∨ βB, where ‘∨’ is the
logical OR operator. Resultant generalized bit vector, βC is shown in Figure 3.2(c).
Computer 0 0 0
0 0 0
0 0 0
1 0 1
0 0 0
Desktop computer Laptop computer 0 0 0
(a) 0 1 1
0 0 0
βA βB βC
(b) (c)
Figure 3.2: Generalization of items
Further, due to the structure of the Concentrator it is easy to add or delete more tuples
(rows) or attributes (columns) in the existing Concentrator. Thus association rules
can be maintained for updated data. This aspect has been dealt in Chapter 4 of this
thesis.
User may also impose Boolean constraints on the items as per his/her needs and these
constraints may be integrated with the mining algorithm. When mining is performed
from the scratch level, integration of constraints with mining algorithm will improve
the efficiency of mining process.
66
It is reiterated that in the proposed framework mining algorithms operate on the
Concentrator instead of data. Thus the knowledge discovery process can be
undertaken without any interference to the day-to-day database operations.
A new model for KDD process, called AR-Model is being proposed in Figure 3.3.
This model is downward compatible and provides full functionality of the traditional
KDD process model presented in Appendix- A of this thesis.
Step AR1 of Figure 3.3 corresponds to step S1 of traditional KDD process model.
During this step discovery goals are specified in terms of Knowledge Discovery
Schema (KDS). The schema is compiled and the resulting meta-data is stored for
future reference.
The next step AR2 is Pre-processing. This step is a compound step containing sub-
steps (AR2a, AR2b, and AR2c), which corresponds to steps (S2 - S4) of traditional
KDD process model. The outcome of this process step is the Concentrator (discussed
in Section 3.4.3).
Step AR3 is novel and responsible for partial data analysis. The outcome of this
analysis is used to select proper mining algorithm(s) during mining. Computation
done to select the mining algorithm is also used for reducing the number of candidate
itemsets generation and testing and reducing the number of scan and size of scanned
data in the Concentrator.
Step AR4 signifies initiation of the mining phase. The user initiates this step
asynchronous as it commences either formulation of mining query or execution of an
application. The query or application is vetted against the corresponding schema and
compiled during this step.
67
Step AR5, the Data Mining step corresponds to S5 of traditional KDD process model.
During this step suitable association rule mining algorithms are selected on the basis
of information obtained during Step AR3. The resulting knowledge is presented and
interpreted during steps (AR6 - AR7), which correspond to steps (S6 - S7) of
traditional KDD process model.
Domain
Understanding AR1 AR4
and Data Minin
Application
Mining g
Interpretation
Requirements Query
AR7 AR6
Knowledge
Presentatio
n
Data Partial
Data Transfo Data DataAR5
Data AR3
Analysi
Cleaning r- Minin
Selection s
mation g
Data flow
Optional repetition
The proposed model for mining association rules inherently supports experimentation
and monitoring. These functionalities are desirable in business and scientific data
68
explorations. Development of the Concentrator provides flexibility and promotes
experimentation and monitoring of desired subsets of database.
An interesting contrast between the two models is that in traditional KDD process the
functionality is defined at the beginning of the process by the end user, while in AR-
Model it is decided dynamically by the end user at the time of actual mining of the
data.
69