Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

FARAZ

TOPIC: Association Rule Mining

 Association rule mining is a procedure used to discover frequent patterns,


correlations, associations, or causal structures in data sets stored in various types of
databases such as relational databases, transactional databases, and other types of data
repositories.
 The goal of association rule mining, given a set of transactions, is to find the rules that
allow us to predict the occurrence of a specific item based on the occurrences of the
other items in the transaction. The data mining process of discovering the rules that
govern associations and causal objects between sets of items is known as association
rule mining.
 So, in a given transaction involving multiple items, it attempts to identify the rules
that govern how or why such items are frequently purchased together. For example,
peanut butter and jelly are frequently purchased together because many people enjoy
making PB&J sandwiches.
 Association Rule Mining is a Data Mining technique for discovering patterns in data.
Association Rule Mining patterns represent relationships between items. When
combined with sales data, this is known as Market Basket Analysis.
 Fast-food restaurants, for example, discovered early on that people who eat fast food
tend to be thirsty due to the high salt content and end up buying Coke. They took
advantage of this by creating combo meals that combine food that is sure to make you
thirsty with Coke as part of the meal.
 Maybe these chains didn't use data mining to make this business decision, but maybe
they did. In any case, it has contributed to their increased profits. The purpose of this
example is to demonstrate that Association Rules represent relationships, which must
be interpreted before they can be used in strategies.
The Association Rules
 Association rules are used in data science to discover correlations and co-occurrences
between data sets. They are best suited for explaining patterns in data from seemingly
unrelated information repositories, such as relational and transactional databases. The use
of association rules is sometimes referred to as "association rule usage."
An association rule has 2 parts:
 an antecedent (if) and
 a consequent (then)
An antecedent is something that’s found in data, and a consequent is an item that is found in
combination with the antecedent. Have a look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the consequent. Simply put,
it can be understood as a retail store’s association rule to target their customers better. If the
above rule is a result of a thorough analysis of some data sets, it can be used to not only
improve customer service but also improve the company’s revenue.
Applications
Some of the applications of Association Rule Mining are as follows:
1) Market-Basket Analysis
In most supermarkets, data is collected using barcode scanners. This database is called the
“market basket” database. It contains a large number of past transaction records. Every record
contains the name of all the items each customer purchases in one transaction. From this data,
the stores come to know the inclination and choices of items of the customers. And according
to this information, they decide the store layout and optimize the cataloging of different
items.
A single record contains a list of all the items purchased by a customer in a single transaction.
Knowing which groups are inclined toward which set of items allows these stores to adjust
the store layout and catalog to place them optimally next to one another.

2) Medical Diagnosis
Association rules in medical diagnosis can help physicians diagnose and treat patients.
Diagnosis is a difficult process with many potential errors that can lead to unreliable results.
You can use relational association rule mining to determine the likelihood of illness based on
various factors and symptoms. This application can be further expanded using some learning
techniques on the basis of symptoms and their relationships in accordance with diseases.

3) Census Data
The concept of Association Rule Mining is also used in dealing with the massive amount of
census data. If properly aligned, this information can be used in planning efficient public
services and businesses.
Advantages Association Rules
Association Rule Advantages Disadvantages
Mining Algorithm

Apriori 1. This algorithm has least 1. It requires many scans of database.


memory consumption.
2. It allows only a single minimum
2.Easy implementation. support threshold.

3. It uses Apriori property for 3. It is favourable only for small


pruning therefore, itemsets left for database.
further support checking remain
less 4. It explains only the presence or
absence of an item in the database.

FP- growth 1. It is faster than other 1. The memory consumption is more.


association rule mining algorithm.
2. It cannot be used for interactive
2. It uses compressed mining and incremental mining.
representation of original
database. 3. The resulting FP-Tree is not unique
for the same logical database.
3. Repeated database scan is
eliminated.
Conceptualization & Steps to find association with example
Steps in Association Rule Mining 
Association Rules are based on if/then statements. These statements aid in the discovery of
associations between independent data in a database, relational database, or other data
repository. These rules are used to determine the relationships between objects that are
commonly used together.
 
Support and confidence are the two primary patterns used by association rules. The method
searches for similarities and rules formed by decomposing data for commonly used if/then
patterns. Association rules are typically used to simultaneously satisfy user-specified
minimum support and a user-specified minimum resolution. To implement association rule
learning, various algorithms are used.
 
Association Rule Mining can be described as a two-step process.
 
Step 1: Locate all frequently occurring itemsets
 An itemset is a collection of items found in a shopping basket. It can include many
products. For example, [bread, butter, eggs] is a supermarket database itemset.
 A frequently occurring item set is one that frequently appears in a database. This
raises the issue of how frequency is defined. This is where your support comes into
play. The frequency of an item in the dataset is used to calculate its support count.
 The number of supporters can only speak to the frequency of an item set. It does not
consider the relative frequency, or the frequency in relation to the number of
transactions. This is referred to as an itemset's support. The frequency of an item set
in relation to the number of transactions is referred to as its support. 

Step 2: Create strong association rules using the frequently used itemsets
 Association rules are created by constructing associations from the frequent itemsets
created in step 1. To find strong associations, this employs a metric known as
confidence.
 The Apriori algorithm is one of the most fundamental Association Rule Mining
algorithms. It is based on the idea that "having prior knowledge of frequent itemsets
can generate strong association rules." The term Apriori refers to prior knowledge.
 Apriori discovers frequent itemsets through a process known as candidate itemset
generation. This is an iterative approach that uses k-itemsets to explore (k+1)-
itemsets. The set of frequent 1-itemsets is found first, followed by the set of frequent
2-itemsets, and so on until no more frequent k-itemsets can be found.
 An important property known as the Apriori property is used to reduce the search
space to improve the efficiency of the level-wise generation of frequent itemsets.
According to the Apriori Property, "all non-empty subsets of a frequent itemset must
also be frequent."
 This means that if an item is frequent, its subsets will also be frequent. For example, if
[Bread, Butter] is a frequent item set, [Bread] and [Butter] must be frequent
individually as well.
Define: Support, Confidence, Lift
Support

Support refers to an item's default popularity and can be calculated by dividing the

number of transactions containing a specific item by the total number of transactions.

Assume we want to find help for item B. This can be calculated as follows:

Support(B) = (Transactions with (B))/ (Total Transactions)

Confidence

If item A is purchased, confidence refers to the likelihood that item B will be purchased

as well. It can be calculated by dividing the number of transactions in which A and B

are purchased together by the total number of transactions in which A is purchased. It

can be expressed mathematically as:

Confidence(AB) = (Transactions with both (A and B))/ (Transactions containing A)

 
Lift 

Lift(A -> B) denotes the increase in the sale ratio of B when A is sold. Lift(A -> B) is

computed by dividing Confidence(A -> B) by Support (B). It can be expressed

mathematically as: 

Lift(AB) = (Ab) Confidence/(B) Support


- Types of algorithm (Apriori Algorithm, Eclat Algorithm, F-P Growth Algorithm with
examples)
1) Apriori Algorithm
It delivers by characteristic the foremost frequent individual things within the information
and increasing them to larger and bigger item sets as long as those item sets seem ofttimes
enough within the information.
The common itemsets ensured by apriori also are accustomed make sure association rules
that highlight trends within the information. It counts the support of item sets employing a
breadth-first search strategy and a candidate generation perform that takes advantage of the
downward closure property of support.

We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product and 0
represents the absence of the product.

The Apriori Algorithm makes the given assumptions

o All subsets of a frequent itemset must be frequent.


o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now, short the
frequency table to add only those products with a threshold support level of over 50 percent.
We find the given frequency table.
The above table indicated the products frequently bought by the customers.

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Step 3

Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the given
combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.

If you implement the threshold assumption, you can figure out that the customers' set of three
products is RPO.

We have considered an easy example to discuss the apriori algorithm in data mining. In
reality, you find thousands of such combinations.

2) Eclat Algorithm
Eclat denotes equivalence class transformation. The set intersection was supported by its
depth-first search formula. It’s applicable for each successive and parallel execution with
spot-magnifying properties. This can be the associate formula for frequent pattern mining
supported by the item set lattice’s depth-first search cross.
 It is a DFS cross of the prefix tree rather than a lattice.
 For stopping, the branch and a specific technique are used.
Let us now understand the above stated working with an example:-
Consider the following transactions record:-

The above-given data is a boolean matrix where for each cell (i, j), the value denotes whether
the j’th item is included in the i’th transaction or not. 1 means true while 0 means false.
We now call the function for the first time and arrange each item with it’s tidset in a tabular
fashion:-
k = 1, minimum support = 2

We now recursively call the function till no more item-tidset pairs can be combined:-
k=2

k=3
k=4

We stop at k = 4 because there are no more item-tidset pairs to combine.


Since minimum support = 2, we conclude the following rules from the given dataset:-
3) FP-growth Algorithm
This algorithm is also called a recurring pattern. The FP growth formula is used for locating
frequent item sets terribly dealings data but not for candidate generation.
This was primarily designed to compress the database that provides frequent sets and then
divides the compressed data into conditional database sets.
This conditional database is associated with a frequent set. Each database then undergoes the
process of data mining.
The data source is compressed using the FP-tree data structure.
This algorithm operates in two stages. These are as follows:
 FP-tree construction
 Extract frequently used itemsets
Example

Support threshold=50%, Confidence= 60%

Table: Table 1

Solution: Support threshold=50% => 0.5*6= 3 => min_sup=3

2: Count of each item

Table 2

3: Sort the itemset in descending order


Table 2
3. Build FP Tree

1. Considering the root node null.


2. The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1},
{I3:1}, where I2 is linked as a child to root, I1 is linked to I2 and I3 is linked
to I1.
3. T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to
I2 and I4 is linked to I3. But this branch would share I2 node as common as it
is already used in T1.
4. Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a
child to I3. The count is {I2:2}, {I3:1}, {I4:1}.
5. T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6. T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the
root node, hence it will be incremented by 1. Similarly I1 will be incremented
by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7. T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3},
{I3:2}, {I5:1}.
8. T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4},
{I3:3}, {I4 1}.
4. Mining of FP-tree is summarized below:
1. The lowest node item I5 is not considered as it does not have a min support
count, hence it is deleted.
2. The next lower node is I4. I4 occurs in 2 branches , {I2,I1,I3:,I41},
{I2,I3,I4:1}. Therefore considering I4 as suffix the prefix paths will be {I2, I1,
I3:1}, {I2, I3: 1}. This forms the conditional pattern base.
3. The conditional pattern base is considered a transaction database, an FP-tree is
constructed. This will contain {I2:2, I3:2}, I1 is not considered as it does not
meet the min support count.
4. This path will generate all combinations of frequent patterns : {I2,I4:2},
{I3,I4:2},{I2,I3,I4:2}
5. For I3, the prefix path would be: {I2,I1:3},{I2:1}, this will generate a 2 node
FP-tree : {I2:4, I1:3} and frequent patterns are generated: {I2,I3:4}, {I1:I3:3},
{I2,I1,I3:3}.
6. For I1, the prefix path would be: {I2:4} this will generate a single node FP-
tree: {I2:4} and frequent patterns are generated: {I2, I1:4}.

The diagram given below depicts the conditional FP tree associated with the conditional node
I3.
References
https://www.analyticssteps.com/blogs/association-rule-mining-importance-and-steps
https://hevodata.com/learn/association-rule-mining/#intro2
https://www.javatpoint.com/fp-growth-algorithm-in-data-mining
https://www.geeksforgeeks.org/ml-frequent-pattern-growth-algorithm/
https://www.javatpoint.com/apriori-algorithm#:~:text=Generally%2C%20the%20apriori
%20algorithm%20operates,performance%20of%20the%20particular%20store.
https://www.geeksforgeeks.org/ml-eclat-algorithm/

You might also like