Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X

SURVEY ON AIS, APRIORI AND FP-TREE ALGORITHMS


R.Divya * S.Vinod kumar **
*Research Scholar in Computer Science Sree Saraswathi Thyagaraja College, Pollachi, Tamil Nadu. **Asst prof PG Department of Computer Science Sree Saraswathi Thyagaraja College, Pollachi, Tamil Nadu.

Abstract---Several association rule mining algorithms have been proposed to generate association rules from the given set of data. Association rules are if/then statements that help uncover relationships between unrelated data in a relational database or other information repository. This paper presents a survey on three different association rule mining algorithms AIS, Apriori and FP-tree algorithm and their drawbacks which would be helpful to find new solution for the problems found in these algorithms. Keywords---Association rule, if/then statements, Relationships, relational database, Apriori.

associations or casual structures among sets of items in the transaction databases or other data repositories. Association rules are widely used in various areas such as telecommunication networks, market and risk management, inventory control etc.

Graphical User Interface

Pattern Evaluation

1. INTRODUCTION The term data mining or knowledge discovery in database has been adopted for a field of research dealing with the automatic discovery of implicit information or knowledge within the databases. The implicit information within databases, mainly the interesting association relationships among sets of objects that lead to association rules may disclose useful patterns for decision support, financial forecast, marketing policies, even medical diagnosis and many other applications.
Database Data Mining Tools

Knowledge base

Data Repositories

Data Cleaning & Integration

Various data mining techniques are applied to the data source; different knowledge comes out as the mining result. Those knowledge that are evaluated by certain rules, such as the domain knowledge or concepts. Among these mining techniques Association rule mining, one of the most important and well researched techniques of data mining, was first introduced in [Agrawal et al. 1993].. It aims to extract interesting correlations, frequent patterns,

Data Warehouse

Other Repositories

Fig: Knowledge Discovery in Database processes[1] Association rule mining is to find out association rules that satisfy the predefined minimum support and confidence from a given database. The problem

R.Divya et.al.

194

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
is usually decomposed into two sub problems. One is to find those item sets whose occurrences exceed a predefined threshold in the database; those item sets are called frequent or large item sets. The second problem is to generate association rules from those large item sets with the constraints of minimal confidence. II.ASSOCIATION RULES Association rules are if/then statements that help uncover relationships between seemingly unrelated data in a relational database or other information repository. An association rule has two parts, an antecedent which represents the if part and a consequent which represents the then part. An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent. Support(S) of an association rule is defined as the percentage/fraction of records that contain X Y to the total number of records in the database. Suppose the support of an item is 0.1%, it means only 0.1 percent of the transaction contain purchasing of this item. Support (XY) = Support count of (XY) Total number of transaction in D Confidence(C) Confidence(C) of an association rule is defined as the percentage/fraction of the number of transactions that contain X Y to the total number of records that contain X. Confidence is a measure of strength of the association rules, suppose the confidence of the association rule XY is 80%, it means that 80% of the transactions that contain X also contain Y together.

Confidence (X|Y) =

Support (XY) Support (X)

Association rule mining algorithm Fig: Association rule[9] Association rules uses two criteria support and confidence to identify the relationships and rules are generated by analyzing data for frequent if/then patterns. Association rules are usually required to satisfy a user-specified minimum support and a userspecified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: First, minimum support is applied to find all frequent item sets in a database. Second, these frequent item sets and the minimum confidence constraint are used to form rules. While the second step is straightforward, the first step needs more attention. Generally, an association rules mining algorithm contains the following steps: The set of candidate k-item sets is generated by 1extensions of the large (k -1) item sets generated in the previous iteration. Supports for the candidate k-item sets are generated by a pass over the database. Item sets that do not have the minimum support are discarded and the remaining item sets are called large k- item sets. Various algorithms used for association rule mining Several algorithms have been proposed for mining association rules. Here presents three different algorithms that face the common drawback. III.AIS ALGORITHM The AIS (Agrawal, Imielinski, Swami) algorithm was the first algorithm proposed for mining association rule in [2]. It focuses on improving the quality of databases together with necessary functionality to process decision support queries. In

Support(S)

R.Divya et.al.

195

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
this algorithm only one item consequent association rules are generated, which means that the consequent of those rules only contain one item, for example we only generate rules like X YZ but not those rules as XY Z. The databases were scanned many times to get the frequent item sets in AIS .To make this algorithm more efficient, an estimation method was introduced to prune those item sets candidates that have no hope to be large, consequently the unnecessary effort of counting those item sets can be avoided. Since all the candidate item sets and frequent item sets are assumed to be stored in the main memory, memory management is also proposed for AIS when memory is not enough. One approach is to delete candidate item sets that have never been extended. Another approach is to delete candidate item sets that have maximal number of items and their siblings, and store this as the parent item sets in the disk as a seed for the next pass. A. AIS Algorithm (iv) C2 Steps followed in AIS algorithm [9] ITEMS (1) Candidate item sets are generated and counted on-the-fly as the database is scanned. (2) For each transaction, it is determined which of the large item sets of the previous pass are contained in this transaction. (3) New candidate item sets are generated by extending these large item sets with other items in this transaction. I1, I2 ,I3 I1, I2 ,15 I2, I3 ,14 I2, I3, I6 I1, I2 ,I6 I1,I2, I4 . (vi)C3 AIS mining process TID T01 T02 T03 T04 T05 T06 T07 T08 T09 T10 ITEMS I1,I2 I2,I3 I4, I6 I2, I3,I6 I1, I2, I4 I1, I3,I5 I2, I3 I1, I3 I1, I2, I3, I6 I1, I2, I3 I1, I2, I5, I6 In AIS [5], the frequent item sets were generated by scanning the databases several times. The support count of each individual item was accumulated during the first pass over the database. Based on the minimal support count those items whose support count less than its minimum value gets eliminated from the list of item. Candidate 2-itemsets are generated by extending frequent 1-itemsets with other items in the transaction. During the second pass over the database, the support count of those candidate 2-itemsets are accumulated and checked against the support threshold. Similarly those candidate (k+1)-item sets are generated by extending frequent k-item sets with items in the same transaction. All that candidate item sets generation and frequent item sets generation process iterate until any one of them becomes empty. The result frequent COUNT NUMBER 2 1 1 3 2 1 .. (v) L2 ITEMS I1 I2 I3 I4 I5 I6 (iii)L1 ITEMS I1, I2 I1, I3 I2, I3 I2, I4 I2 ,I5 I2, I6 .. COUNT NUMBER 5 4 5 2 1 4 .. COUNT NUMBER 7 8 7 2 2 4 (ii)C1

LARGE 1 ITEMS I1 I2 I3 I4

LARGE 2ITEMS I1, I2 I1, I3 I2, I3 I2, I6

(i)Original Database

R.Divya et.al.

196

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
item sets includes only one large 3-itemsets I2; I3; I6 [1]. B. Drawbacks The main drawbacks of the AIS algorithm are Too many candidate item sets that finally turned out to be small are generated, which requires more space and wastes much effort that turned out to be useless. AIS algorithm results in unnecessarily generate and count too many candidate item sets that turn out to be small. This algorithm requires too many passes over the whole database. IV.APRIORI ALGORITHM Apriori algorithm was first proposed by Agrawal in [3].Apriori is more efficient during the candidate generation process [2]. It uses a breadth-first search strategy [8] to count the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Apriori uses pruning techniques to avoid measuring certain item sets, while guaranteeing completeness. These are the item sets that the algorithm can prove will not turn out to be large. The first pass of the algorithm simply counts item occurrences to determine the large 1itemsets. A subsequent pass, say pass k, consists of two phases. First, the large item sets Lk-1 found in the (k-1)th pass are used to generate the candidate item sets Ck, using the Apriori- gen function Next, the database is scanned and the support of candidates in Ck is counted. Apriori is designed to operate on databases containing transactions. Other algorithms are designed for finding association rules in data having no transactions or having no time stamps. Apriori principle [7] states that, if an item set is frequent, then all of its subsets must be frequent . The Apriori algorithm is based on the Apriori principle [7], which says that the item set X containing item set X is never large if item set X is not large. Based on this principle, the Apriori algorithm generates a set of candidate large item sets whose lengths are (k+1) from the large k item sets (for k1) and eliminates those candidates, which contain not large subset. Then, for the rest candidates, only those with support over minimum support threshold are taken to be large (k+1)-item sets. The Apriori generate item sets by using only the large item sets found in the previous pass, without considering the transactions. The Apriori algorithm takes advantage of the fact that any subset of a frequent item set is also a frequent item set. The algorithm can therefore, reduce the number of candidates being considered by only exploring the item sets whose support count is greater than the minimum support count. All infrequent item sets can be pruned if it has an infrequent subset. In the process of finding frequent item sets, Apriori avoids the effort wastage of counting the candidate item sets that are known to be infrequent. The candidates are generated by joining among the frequent item sets level-wisely, also candidate are pruned according the Apriori property. As a result the number of remaining candidate item sets ready for further support checking becomes much smaller, which dramatically reduces the computation, I/O cost and memory requirement. A. Apriori Algorithm Steps involved in Apriori algorithm [9] (1)Candidate item sets are generated using only the large item sets of the previous pass without considering the transactions in the database. (2)The large item set of the previous pass is joined with itself to generate all item sets whose size is higher by 1. (3)Each generated item set that has a subset which is not large is deleted. The remaining item sets are the candidate ones.

ITEMS I1 I2 I3 I4 I5 I6 (i)C1

COUNT NUMBER 7 8 7 2 2 4

LARGE 1 ITEMS I1 I2 I3 I4 ( ii) L1

R.Divya et.al.

197

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
ITEMS I1, I2 I1, I3 I2, I3 I2, I6 I2, I4 I2, I5 .. COUNT NUMBER 5 4 5 4 2 1 .. To generate the candidate set it requires multiple scan over the database. LARGE 2 ITEMS I1, I2 I1, I3 I2, I3 I2, I6

V. FP-TREE ALGORITHM FP-Tree [4] frequent pattern mining is used in the development of association rule mining. FP-Tree algorithm overcomes the problem found in Apriori algorithm. The frequent item set generation process requires only two passes over the database there is no need for candidate generation process. By avoiding the candidate generation process and less passes over the database, FP-Tree founds to be faster than the Apriori algorithm. An FP-Tree is a prefix tree for transactions [11]. Every node in the tree represents one item and each path represents the set of transactions that involve with the particular item. All nodes referring to the same item are linked together in a list, so that all the transactions that containing the same item can be easily found and counted. A. FP-Tree Algorithm FP-tree algorithm involves the generation of frequent patterns using the frequent patterns generation process which includes two sub processes: Constructing the FP-Tree, and Generation of frequent patterns from the FP-Tree.

(iii) C2 ITEMS I1, I2 ,I3 I1, I2 ,I6 . (v)C3 COUNT NUMBER 2 2 ..

(iv) l2

Apriori Mining process In Apriori algorithm there involves two processes for finding large item sets from the database. First, the database is scanned to check the support count of the corresponding item sets after generating the candidate item set. Support count of each item is calculated during the first scan of the database and pruning is done on those item sets whose supports are below the pre-defined threshold to generate large 1 item sets. In every pass over the candidate item sets that include the same specified number of items are generated and checked. The candidate k-item sets are generated after the (k-1 )th passes over the database by joining the frequent k-1 item sets. The Apriori property says that every sub (k-1)-item sets of the frequent k-item sets must be frequent. For generation 3-itemsets frequent 2itemsets are joined to get candidate3- item sets, which include (I1, I2, I3) , (I1, I2, I6), (I2, I3, I6). Those item sets are then checked for their sub item sets, since (I3, I6) is not frequent 2-itemsets, the last item set gets eliminated from the list of candidate 3itemsets. This process is repeated continuously to find all frequent item sets until the candidates item sets become empty . B. Drawbacks.

TID T01 T02 T03 T04 T05 T06 T07 T08 T09 T10

ITEMS I1, I2 I2, I3, I4, I6 I2, I3, I6 I1, I2, I4 I1, I3, I5 I2, I3 I1, I3 I1, I2, I3, I6 I1, I2, I3 I1, I2, I5, I6

The main drawbacks of Apriori algorithm are It takes more time, space and memory for candidate generation process.

(i) Original database

R.Divya et.al.

198

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
TID T01 T02 T03 T04 T05 T06 T07 T08 ITEMS I2, I1 I2 ,I3, I6 I2, I3, I6 I2, I1 I1, I3 I2, I3 I1, I3 I2, I3, I6 growth procedure, constructing the conditional FPTree which contain patterns with specified suffix patterns, frequent patterns can be easily. Also the computation cost decreased dramatically. (3) FP-Tree uses a divide and conquer method that considerably reduced the size of the subsequent conditional FP-Tree, longer frequent patterns are generated by adding a suffix to the shorter frequent patterns. B. Drawbacks The disadvantages of FP-Tree Algorithm are The process of constructing the FP-Tree is as follows [10]. (1) The database is scanned for the first time, during this scanning the support count of each items are collected. As a result the frequent 1 - item sets are generated process is the same as in Apriori algorithm. Those frequent item sets are sorted in a descending order of their supports. Also the head table of ordered frequent 1 -item sets is created. (2) Create the root node of the FP-Tree T with a label of Root. The database is scanned again to construct the FP-Tree with the head table, for each transaction the order of frequent items is resorted according to the head table. (3) The function Insertf [p j P]; Tg works as follows. If T has a child N such that N.itemname=p.item-name then the count of N is increased by 1, else a new node N is created and N.itemname=p.item-name with a support count of 1. Its parent link be linked to T and its node link is linked to the node with the same item-name via a sub-link. This function InsertfP;Tg is called recursively until P becomes empty. It requires several scan over the database for the construction of FP-Tree. Whole mining process should be repeated whenever the support value is changed as well when a new dataset is inserted into the database.

ITEMS I1 I2 I3 I6 (ii) L1

COUNT NUMBER 7 8 7 4

(iii) Transformed database

Conclusion Various algorithms have been proposed for mining association rule but in every algorithm there founds a common drawback of various scans over the database this drawback can be overcome by introducing a new technique of transaction pattern base which founds to be efficient for searching frequent patterns in the database.

References [1] Association Rule Mining: A Survey: http://sci2s.ugr.es/keel/pdf/specific/report/zhao03ars. pdf [2] Agrawal et al. 1993 [3] Agrawal, R. and Srikant, R. 1995. Mining sequential patterns, P. S. Yu and A. S. P. Chen, Eds. IEEE Computer Society Press, Taipei, Taiwan, 3{14. [4] Han et al. 2000 [5] Han, J. and Kamber, M. 2000. Data Mining Concepts and Techniques. Morgan Kanufmann. [6] Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases.

The efficiency of FP-Tree algorithm account for three reasons [1]. (1) FP-Tree is a compressed representation of the original database because only those frequent items are used to construct the tree, other irrelevant information are pruned. Also by ordering the items according to their supports the overlapping parts appear only once with different support count. (2) This algorithm only scans the database twice. The frequent patterns are generated by the FP-

R.Divya et.al.

199

www.ijcsmr.org

International Journal of Computer Science and Management Research Vol 1 Issue 2 September 2012 ISSN 2278-733X
[7]http://www.users.cs.umn.edu/~kumar/dmbook/ch6 .pdf -Association analysis Basic concepts and algorithms. [8] Agrawal, Rakesh; and Srikant, Ramakrishnan; Fast algorithms for mining association rules in large databases [9]http://chemeng.utoronto.ca/~datamining/dmc/ association_rules.htm [10] http://en.wikibooks.org/wiki/Data_Mining_ Algorithms_In_R/Frequent_Pattern_Mining/The_FPGrowth_Algorithm#FP-Tree_structure [11]http://software.intel.com/en-s/articles/Multicoreenabling-FP-tree-Algorithm-for-Frequent-PatternMining

R.Divya et.al.

200

www.ijcsmr.org

You might also like