Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

CHAPTER 5

CONCEPT DESCRIPTION AND ASSOCIATION RULE MINING

SUBJECT:DMBI PREPARED BY:


CODE:2170715 ASST.PROF.NENSI KANSAGARA
(CSE DEPARTMENT,ACET)
WHAT IS CONCEPT DESCRIPTION?
❖ Descriptive vs. predictive data mining Descriptive mining:
➢ describes concepts or task-relevant data sets in concise, summarative,
informative, discriminative forms
❖ Predictive mining: Based on data and analysis, constructs models for the
database, and predicts the trend and properties of unknown data
❖ Concept description:
➢ Characterization: provides a concise and succinct summarization of the
given collection of data
➢ Comparison: provides descriptions comparing two or more collections of
data
WHAT IS CONCEPT DESCRIPTION?
❖ A concept usually refers to a collection of data such as
frequent_buyers, graduate_students etc.
❖ As a data mining task, concept description is not a simple
enumeration (number of things done one by one) of the data.
❖ Concept description generates descriptions for characterization and
comparison of the data it is also called class description.
WHAT IS CONCEPT DESCRIPTION?
❖ Characterization provides a concise and brief summarization of the data.
❖ While concept or class comparison (also known as discrimination) provides discriminations
(inequity) comparing two or more collections of data.
❖ Example
➢ Given the ABC Company database, for example, examining individual customer
transactions.
➢ Sales managers may prefer to view the data generalized to higher levels, such as
summarized by customer groups according to geographic regions, frequency of
purchases per group and customer income.
DATA GENERALIZATION AND
SUMMARIZATION
❖ Data generalization
➢ A process which abstracts a large set of task-relevant data in a
database from a low conceptual levels to higher ones.
❖ Approaches:
➢ Data cube approach(OLAP approach)
➢ Attribute-oriented induction approach
DATA GENERALIZATION
PRESENTATION OF GENERALIZED RESULTS
❖ Generalized relation:
➢ Relations where some or all attributes are generalized, with counts or other
aggregation values accumulated.
❖ Cross tabulation:
➢ Mapping results into cross tabulation form (similar to contingency tables).
❖ Visualization techniques:
➢ Pie charts, bar charts, curves, cubes, and other visual forms.
❖ Quantitative characteristic rules:
➢ Mapping generalized result into characteristic rules with quantitative
information associated with it
ATTRIBUTE RELEVANCE
❖ Why?
➢ Which dimensions should be included?
➢ How high level of generalization?
➢ Automatic vs. interactive Reduce # attributes; easy to understand patterns
❖ What?
➢ statistical method for preprocessing data
■ filter out irrelevant or weakly relevant attributes
■ retain or rank the relevant attributes
➢ relevance related to dimensions and levels
➢ analytical characterization, analytical comparison
❖ How?
❖ Data Collection
❖ Preliminary
❖ Relevance Analysis
➢ Perform AOI to removing or generalizing attributes.
❖ Relevance Analysis
➢ Use relevance analysis measure e.g. information gain to identify highly
relevant dimensions and levels.
➢ Sort and select the most relevant dimensions and levels.
❖ Attribute-oriented Induction for class description
➢ On selected dimension/level
❖ OLAP operations (e.g. drilling, slicing) on relevance rules
CLASS COMPARISON
ASSOCIATION RULE MINING
MARKET BASKET ANALYSIS

❖ Market Basket Analysis is a modelling technique based upon the theory that if
you buy a certain group of items, you are more (or less) likely to buy another
group of items. For example, if you are in an English pub and you buy a pint of
beer and don't buy a bar meal, you are more likely to buy crisps (US. chips) at the
same time than somebody who didn't buy beer.
❖ The set of items a customer buys is referred to as an itemset, and market basket
analysis seeks to find relationships between purchases.
❖ Typically the relationship will be in the form of a rule:
➢ IF {beer, no bar meal} THEN {crisps}.
MARKET BASKET ANALYSIS
❖ The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is
true) is referred to as the support for the rule. The conditional probability that a customer
will purchase crisps is referred to as the confidence.
❖ The algorithms for performing market basket analysis are fairly straightforward (Berry and
Linhoff is a reasonable introductory resource for this). The complexities mainly arise in
exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock
10,000 or more line items), and dealing with the large amounts of transaction data that may
be available.
❖ A major difficulty is that a large number of the rules found may be trivial for anyone
familiar with the business. Although the volume of data has been reduced, we are still
asking the user to find a needle in a haystack. Requiring rules to have a high minimum
support level and a high confidence level risks missing any exploitable result we might
have found.
MARKET BASKET ANALYSIS
EXAMPLE OF MARKET BASKET ANALYSIS
ASSOCIATION RULE MINING
❖ Association rule learning
is a rule-based machine
learning method for
discovering interesting
relations between variables
in large databases. It is
intended to identify strong
rules discovered in
databases using some
measures of
interestingness
ASSOCIATION RULE MINING
❖ Learning of Association rules is used to find relationships between attributes in
large databases. An association rule, A=> B, will be of the form” for a set of
transactions, some value of itemset A determines the values of itemset B under the
condition in which minimum support and confidence are met”.
❖ Support and Confidence can be represented by the following example:

❖ The above statement is an example of an association rule. This means that there is a 2%
transaction that bought bread and butter together and there are 60% of customers who bought
bread as well as butter.
ASSOCIATION RULE MINING

Association rule mining consists of 2 steps:

1. Find all the frequent itemsets.


2. Generate association rules from the above frequent itemsets.
EXAMPLE OF ASSOCIATION RULE
FINDING FREQUENT ITEM SETS:
APRIORI ALGORITHM
❖ With the quick growth in e-commerce applications, there is an accumulation vast quantity of data in

months not in years. Data Mining, also known as Knowledge Discovery in Databases(KDD), to find

anomalies, correlations, patterns, and trends to predict outcomes.

❖ Apriori algorithm is a classical algorithm in data mining. It is used for mining frequent itemsets and

relevant association rules. It is devised to operate on a database containing a lot of transactions, for

instance, items brought by customers in a store.

❖ It is very important for effective Market Basket Analysis and it helps the customers in purchasing their

items with more ease which increases the sales of the markets. It has also been used in the field of

healthcare for the detection of adverse drug reactions. It produces association rules that indicates what all

combinations of medications and patient characteristics lead to ADRs.


WHAT IS AN ITEMSET?
❖ A set of items together is called an itemset. If any itemset has k-items it
is called a k-itemset. An itemset consists of two or more items. An
itemset that occurs frequently is called a frequent itemset. Thus
frequent itemset mining is a data mining technique to identify the
items that often occur together.
➢ For Example, Bread and butter, Laptop and Antivirus software,
etc.
WHAT IS A FREQUENT ITEMSET?
❖ A set of items is called frequent if it satisfies a minimum threshold value for
support and confidence. Support shows transactions with items purchased together
in a single transaction. Confidence shows transactions where the items are
purchased one after the other.
❖ For frequent itemset mining method, we consider only those transactions which
meet minimum threshold support and confidence requirements. Insights from these
mining algorithms offer a lot of benefits, cost-cutting and improved competitive
advantage.
❖ There is a tradeoff time taken to mine data and the volume of data for frequent
mining. The frequent mining algorithm is an efficient algorithm to mine the hidden
patterns of itemsets within a short time and less memory consumption.
FREQUENT PATTERN MINING
❖ The frequent pattern mining algorithm is one of the most important techniques of data
mining to discover relationships between different items in a dataset. These relationships
are represented in the form of association rules. It helps to find the irregularities in data.
❖ FPM has many applications in the field of data analysis, software bugs, cross-marketing,
sale campaign analysis, market basket analysis, etc.
❖ Frequent itemsets discovered through Apriori have many applications in data mining
tasks. Tasks such as finding interesting patterns in the database, finding out sequence
and Mining of association rules is the most important of them.
❖ Association rules apply to supermarket transaction data, that is, to examine the customer
behavior in terms of the purchased products. Association rules describe how often the
items are purchased together.
WHY FREQUENT ITEMSET MINING?
❖ Frequent itemset or pattern mining is broadly used because of its wide applications in mining
association rules, correlations and graph patterns constraint that is based on frequent patterns,
sequential patterns, and many other data mining tasks.
❖ Apriori says:
❖ The probability that item I is not frequent is if:
➢ P(I) < minimum support threshold, then I is not frequent.
➢ P (I+A) < minimum support threshold, then I+A is not frequent, where A also belongs to
itemset.
➢ If an itemset set has value less than minimum support then all of its supersets will also
fall below min support, and thus can be ignored. This property is called the
Antimonotone property.
STEPS TO FOLLOW IN APRIORI ALGORITHM
1. Join Step: This step generates (K+1) itemset from K-itemsets by
joining each item with itself.
2. Prune Step: This step scans the count of each item in the database.
If the candidate item does not meet minimum support, then it is
regarded as infrequent and thus it is removed. This step is
performed to reduce the size of the candidate itemsets.
STEPS IN APRIORI ALGORITHM
Apriori algorithm is a sequence of steps to be followed to find the most frequent itemset in the given
database. This data mining technique follows the join and the prune steps iteratively until the most
frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the
user.

#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets candidate. The algorithm will
count the occurrences of each item.

#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets whose occurrence is
satisfying the min sup are determined. Only those candidates which count more than or equal to min_sup,
are taken ahead for the next iteration and the others are pruned.

#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join step, the 2-itemset is
generated by forming a group of 2 by combining items with itself.
STEPS IN APRIORI ALGORITHM
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the table will have 2
–itemsets with min-sup only.

#5) The next iteration will form 3 –itemsets using join and prune step. This iteration will follow
antimonotone property where the subsets of 3-itemsets, that is the 2 –itemset subsets of each group
fall in min_sup. If all 2-itemset subsets are frequent then the superset will be frequent otherwise it
is pruned.

#6) Next step will follow making 4-itemset by joining 3-itemset with itself and pruning if its subset
does not meet the min_sup criteria. The algorithm is stopped when the most frequent itemset is
achieved.
Example of Apriori: Support threshold=50%, Confidence= 60%
6. Generate Association Rules: From the frequent itemset discovered above the association could be:

{I1, I2} => {I3}

Confidence = support {I1, I2, I3} / support {I1, I2} = (3/ 4)* 100 = 75%

{I1, I3} => {I2}

Confidence = support {I1, I2, I3} / support {I1, I3} = (3/ 3)* 100 = 100%

{I2, I3} => {I1}

Confidence = support {I1, I2, I3} / support {I2, I3} = (3/ 4)* 100 = 75%

{I1} => {I2, I3}

Confidence = support {I1, I2, I3} / support {I1} = (3/ 4)* 100 = 75%

{I2} => {I1, I3}


Confidence = support {I1, I2, I3} / support {I2 = (3/ 5)* 100 = 60%

{I3} => {I1, I2}

Confidence = support {I1, I2, I3} / support {I3} = (3/ 4)* 100 = 75%

This shows that all the above association rules are strong if minimum confidence threshold is 60%.
PSEUDO CODE OF APRIORI ALGORITHM
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large databases

Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
IMPROVED APRIORI ALGORITHM
Many methods are available for improving the efficiency of the algorithm.

1. Hash-Based Technique: This method uses a hash-based structure called a hash table for generating
the k-itemsets and its corresponding count. It uses a hash function for generating the table.
2. Transaction Reduction: This method reduces the number of transactions scanning in iterations. The
transactions which do not contain frequent items are marked or removed.
3. Partitioning: This method requires only two database scans to mine the frequent itemsets. It says
that for any itemset to be potentially frequent in the database, it should be frequent in at least one of
the partitions of the database.
4. Sampling: This method picks a random sample S from Database D and then searches for frequent
itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering
the min_sup.
5. Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start
point of the database during the scanning of the database.
APPLICATION OF ALGORITHM
Some fields where Apriori is used:

1. In Education Field: Extracting association rules in data mining of admitted


students through characteristics and specialties.
2. In the Medical field: For example Analysis of the patient's database.
3. In Forestry: Analysis of probability and intensity of forest fire with the forest fire
data.
4. Apriori is used by many companies like Amazon in the Recommender System and
by Google for the auto-complete feature.
INCREMENTAL ARM
❖ It is noted that analysis of past transaction data can provide very valuable information
on customer buying behavior, and thus improve the quality of business decisions.
❖ With the increasing use of the record-based databases whose data is being
continuously added, updated, deleted etc.
❖ Examples of such applications include Web log records, stock market data, grocery
sales data, transactions in e-commerce, and daily weather/traffic records etc.
❖ In many applications, we would like to mine the transaction database for a fixed
amount of most recent data (say, data in the last 12 months).
INCREMENTAL ARM
❖ Mining is not a one-time operation, a naive approach to solve the
incremental mining problem is to re-run the mining algorithm on the
updated database.
ASSOCIATIVE CLASSIFICATION
❖ Associative classification (AC) is a branch of a wide area of scientific study known
as data mining. Associative classification makes use of association rule mining for
extracting efficient rules, which can precisely generalize the training data set, in the
rule discovery process.
❖ An associative classifier (AC) is a kind of supervised learning model that uses
association rules to assign a target value. The term associative classification was
coined by Bing Liu et al., in which the authors defined a model made of rules "whose
right-hand side are restricted to the classification class attribute".
DATA FLOW DIAGRAM OF ASSOCIATIVE
CLASSIFICATION
FP GROWTH ALGORITHM
❖ The FP-Growth Algorithm,
proposed by Han in, is an efficient
and scalable method for mining the
complete set of frequent patterns by
pattern fragment growth, using an
extended prefix-tree structure for
storing compressed and crucial
information about frequent patterns
named frequent-pattern tree
(FP-tree).
FREQUENT PATTERN GROWTH ALGORITHM
❖ This algorithm is an improvement to the Apriori method. A frequent pattern is
generated without the need for candidate generation. FP growth algorithm
represents the database in the form of a tree called a frequent pattern tree or FP
tree.
❖ This tree structure will maintain the association between the itemsets. The
database is fragmented using one frequent item. This fragmented part is called
“pattern fragment”. The itemsets of these fragmented patterns are analyzed.
Thus with this method, the search for frequent itemsets is reduced
comparatively.
FP TREE
❖ Frequent Pattern Tree is a tree-like structure that is made with the
initial itemsets of the database. The purpose of the FP tree is to mine
the most frequent pattern. Each node of the FP tree represents an
item of the itemset.
❖ The root node represents null while the lower nodes represent the
itemsets. The association of the nodes with the lower nodes that is
the itemsets with the other itemsets are maintained while forming
the tree.
FP GROWTH STEPS
#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the same as
the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency of 1-itemset.

#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by null.

#3) The next step is to scan the database again and examine the transactions. Examine the first transaction and find out
the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower count and so on. It means
that the branch of the tree is constructed with transaction itemsets in descending order of count.

#4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If any
itemset of this transaction is already present in another branch (for example in the 1st transaction), then this transaction
branch would share a common prefix to the root.

This means that the common itemset is linked to the new node of another itemset in this transaction.
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common node
and new node count is increased by 1 as they are created and linked according to transactions.

#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with the
links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this, traverse
the path in the FP Tree. This path or paths are called a conditional pattern base.

Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).

#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.

#8) Frequent Patterns are generated from the Conditional FP Tree.


Example Of FP-Growth Algorithm
Support threshold=50%, Confidence= 60%
STEP 1
STEP 2
STEP 3 :BUILD FP TREE
1. Considering the root node null.
2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
{I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked
to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share I2 node as common as it
is already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as
a child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
root node, hence it will be incremented by 1. Similarly I1 will be incremented
by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3},
{I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4},
{I3:3}, {I4 1}.
STEP 4:MINING FP TREE IS SUMMARIZED
BELOW:
1. The lowest node item I5 is not considered as it does not have a min support count, hence it is
deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},{I2,I3,I4:1}. Therefore
considering I4 as suffix the prefix paths will be {I2, I1, I3:1}, {I2, I3: 1}. This forms the
conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is constructed.
This will contain {I2:2, I3:2}, I1 is not considered as it does not meet the min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node FP-tree : {I2:4,
I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3}, {I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-tree: {I2:4} and
frequent patterns are generated: {I2, I1:4}.
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori which
scans the transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.

Disadvantages Of FP-Growth Algorithm


1. FP Tree is more cumbersome and difficult to build than Apriori.
2. It may be expensive.
3. When the database is large, the algorithm may not fit in the shared memory.
FP GROWTH VS APRIORI ALGORITHM

You might also like