Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 126

CS402 Data Mining & Warehousing

MODULE 5

April 18, 2020

CS402 Data Mining & Warehousing April 18, 20201 / 126


Outline:

1 Association Rules Mining


Concepts
Apriori Algorithm
FP-Growth Algorithm

2 Cluster Analysis
Concepts
Types of Data in Cluster Analysis
Categorization of Clustering Methods
1. Partitioning methods: K-Means and K-
Medoid Clustering
2. Hierarchical Clustering method: BIRCH
- (Module 6)
3. Density-Based Clustering methods:
DBSCAN and OPTICS -(Module 6)

CS402 Data Mining & Warehousing April 18, 20202 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsI
● Frequent patterns are patterns (such as itemsets, subsequences, or
substructures) that appear in a data set frequently
● A frequent itemset typically refers to a set of items that frequently
appear together in a transactional data set, such as milk and bread
● Association Rules are rules specifying how frequently an itemset
occurs in a transaction
● An example of such a rule, mined from the AllElectronics
transactional database
buys(X, ”computer”) =⇒ buys(X, ”antivirus software”)[support =
1%, confidence =50%]
The rule indicates the relationship between purchase of computer and
software by a customer
● Support:
● 1% support means: 1% of all of the transactions under analysis showed
that computer and software were purchased together.
CS402 Data Mining & Warehousing April 18, 20203 / 126
Association Rules Mining Concepts

Association Rules Mining -


ConceptsII
● ie;; We have 9 transactions under analysis, out of these, the support 1% is
the percentage of transactions that computer and software were
purchased together.
● Suppose we have 2 transactions that computer and software were
purchased together.Then we say that the support count is 2. Then the
support will be 2/9 =22%.
● The support represented in percentage is called relative support. If we
take the support in terms of support count then we call it as absolute
support.
● Confidence:
● confidence, or certainty, of 50%: means that if a customer buys a
computer, there is a 50% chance that he will buy software Suppose
● we have 2 transactions that computer and software were
purchased together. Also we have a total of 4 transactions(out of 9) that
the computer is purchased. Then the confidence is 2/4 =50%

CS402 Data Mining & Warehousing April 18, 20204 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsIII
● This association rule involves a single attribute or predicate (i.e.,
buys) that repeats.
● Association rules that contain a single predicate are referred to as
single-dimensional association rules.
● The above rule can be written simply as:
computer =⇒ antivirus software[1%, 50%]
● Another example:
age(X, ”20...29”) ∧ income(X, ”20K...29K”) =⇒
buys(X, ”CDplayer”)[support = 2%, confidence =
● 60%]
Rule Indicates that of the AllElectronics customers under study:
● 2% are 20 to 29 years of age with an income of 20, 000 to 29, 000 and
have purchased a CD player
● There is a 60% probability that a customer in this age and income
group will purchase a CD player.

CS402 Data Mining & Warehousing April 18, 20205 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsIV
● Note that this is an association between more than one attribute, or
predicate (i.e., age, income, and buys)
● Association Rule Mining: It is the process of mining the frequent
patterns, associations, and correlations in a transactional database that
satisfy a minimum support threshold and a minimum confidence
threshold
● There can be a number of association rules mined from a
transactional database
● Typically, association rules are discarded as uninteresting if they do
not satisfy both a minimum support threshold and a minimum
confidence threshold

CS402 Data Mining & Warehousing April 18, 20206 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsV
Need for finding patterns: Market Basket Analysis

CS402 Data Mining & Warehousing April 18, 20207 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsVI
● In example we analyze customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”
● This will help in many business decision-making processes, such as
catalog design, cross-marketing, and customer shopping behavior
analysis.
● Help retailers develop marketing strategies by gaining insight into
which items are frequently purchased together by customers.
● For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip to the supermarket?
Such information can lead to increased sales by helping retailers do
● selective marketing and plan their shelf space.
Also market basket analysis may help you design different store
● layouts.

CS402 Data Mining & Warehousing April 18, 20208 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsVII

● Items that are frequently purchased together can be placed in proximity in


order to further encourage the sale of such items together.
● If customers who purchase computers also tend to buy antivirus
software at the same time, then placing the hardware display close to the
software display may help increase the sales of both items.
● Placing hardware and software at opposite ends of the store may entice
customers who purchase such items to pick up other items along the way
If customers tend to purchase computers and printers together, then
● having a sale on printers may encourage the sale of printers as well as
computers.

CS402 Data Mining & Warehousing April 18, 20209 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsVIII
● Let I = {I 1 , I 2 , ..., I m } be a set of itemseg;
I = {milk, bread, butter, beer, jelly}
● where each transaction T i is a set of items such that T i ⊆ I

● Each transaction is associated with an identifier, called T I D

CS402 Data Mining & Warehousing April 18, 202010 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsIX
● Let A be a set of items. A transaction T is said to contain A if and
only if A ⊆ T
● eg;let A ={Bread, Milk}, clearly A ⊆T 3
● An association rule is an implication of the form
A =⇒ B[support = s, confidence = c], where A ⊂ I , B ⊂ I ,
and
●A ∩ B support
where =φ. s is the percentage of transactions in D that contain
A ∪ B (i.e., the union of sets A and B, or say, both A and B)
● confidence c in the transaction set D, where c is the percentage of
transactions in D containing A that also contain B.
eg; Bread =⇒ Butter[support =2%, confidence = 60%]
● Where support 2% is the percentage of transactions in D that contain
both Bread and Butter

CS402 Data Mining & Warehousing April 18, 202011 / 126


Association Rules Mining Concepts

Association Rules Mining -


ConceptsX
● In other words A support of 2% means that 2% of all the transactions
under analysis show that Bread and Butter are purchased together
● Where confidence 60% is the percentage of transactions in D
containing Bread that also contain Butter
● A confidence of 60% means that 60% of the customers who
purchased a Bread also bought the Butter
● The support and confidence are calculated as below:
# Txs. having both A & B
support(A =⇒ B ) =P (A ∪ B ) =
Total # of Txs in D
# Txs. having both A & B
confidence(A =⇒ B ) =P (B | A) =
Total # of Txs having A
support(A ∪ B ) support count(A ∪ B )
= =
support(A) support count(A)

CS402 Data Mining & Warehousing


Association Rules Mining Concepts

Association Rules Mining -


ConceptsXI

● Rules that satisfy both a minimum support threshold (min sup) and a
minimum confidence threshold (min conf ) are called strong.
● By convention, we write support and confidence values so as to occur
between 0% and 100%

CS402 Data Mining & Warehousing


Association Rules Mining Concepts

Association Rules Mining -


ConceptsXII
Itemset, Support Count, Frequent Itemset
A set of items is referred to as an itemset
● k-itemset: An itemset that contains k items
eg; {bread, butter} is a 2-itemset.
● Frequency of an itemset: the number of transactions that contain
the itemset
● This is also known, simply, as the frequency, support count, or count
of the itemset.
Support represented in % referred to as relative support
● Support represented as frequency(count) is the absolute support

CS402 Data Mining & Warehousing


Association Rules Mining Concepts

Association Rules Mining -


ConceptsXIII

Itemset, Support Count, Frequent Itemset


● Frequent itemset: An item set I whose relative support satisfies a
prespecified minimum support threshold
● or Frequent itemset: An item set I whose absolute support(count)
satisfies a prespecified minimum support count threshold
The set of frequent k − itemsets is commonly denoted by L k
● Thus the problem of mining association rules can be reduced to that
of mining frequent itemsets.

CS402 Data Mining & Warehousing


Association Rules Mining Concepts

Association Rules Mining -


ConceptsXIV

● In general, association rule mining can be viewed as a two-step


process:
1. Find all frequent itemsets: By definition, each of these itemsets will
occur at least as frequently as a predetermined minimum support count,
min sup.
2. Generate strong association rules from the frequent itemsets: By
definition, these rules must satisfy minimum support and minimum
confidence.
● Algorithms to find frequent itemsets and generate association rules:
Apriori Algorithm: Finds frequent itemsets only
● FP-growth Algorithm: Finds frequent itemsets only

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmI

Apriori Algorithm: Finding frequent itemsets- Example

● Based on the AllElectronics


transaction database, D, of
Table, illustrate the Apriori
algorithm for finding frequent
itemsets in D.
● There are nine transactions in
this database, that is, |D| = 9.

Let the minimum support count
required is 2, that is,
mi n sup = 2

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmII
Apriori Algorithm: Finding frequent itemsets- Example

(1) Generate candidate 1-itemsets, C1 , and scan D to count the number of


occurrences of each item.
(2) Determine frequent 1-itemsets, L 1 . Select the candidate 1-itemsets
satisfying minimum support(sup. count ≥2). Here In our example, all of
the candidates in C1 satisfy minimum support

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmIII
Apriori Algorithm: Finding frequent itemsets- Example (3)Generate

candidate set of 2-itemsets, C1 , apply join operation


L1 œ L1 .
Note: L 1 œ L 1 is equivalent to L 1 × L 1
Now check whether C2 follows apriority property to prune the C2
● Apriori property: All nonempty subsets of a frequent itemset must
also be frequent
● Here we find the candidates in C whose
2 subsets(any one) is not
frequent and remove such candidates
● In this example: no candidates are removed from C during
2 the prune
step because each subset of the candidates is also frequent.
(4)Now we Scan D, find support count of each candidate itemset in C2

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmIV

Apriori Algorithm: Finding frequent itemsets- Example

(5)Find L 2 , the set of frequent 2-itemsets which statisfy min sup

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmV
Apriori Algorithm: Finding frequent itemsets- Example

(6)Generate C3 the set of candidate 3-itemsets using join operation. ie.;


C 3=L œ2 L2
● Now perform pruning
● Based on the Apriori property that all subsets of a frequent itemset
must also be frequent, Here we can determine that the four
candidates cannot possibly be frequent.
We therefore remove them from C3
● Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after pruning

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmVI

Figure:Pruning C3 .Do any of the candidates have a subset that is not frequent?

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmVII
Apriori Algorithm: Finding frequent itemsets- Example

(7)Find L 3 , the set of frequent 3-itemsets which statisfy min sup


(8)Generate C4 =L 3 œ L3 and prune it.
Although the join results in {{I1, I2, I3, I5}}

This itemset is pruned because its subset {{I2, I3, I5}} is not
frequent
● Thus, C 4=φ, and thealgorithm terminates, having found all of the
frequent itemsets.

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmVIII

Note:
● If L k = φ at any stage, then algorithm terminates, L will be the set
k −1
of frequent itemset.
Algorithm: Apriori. Find frequent itemsets using an iterative level-wise
approach based on candidate generation.

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmIX

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmX

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmXI

Generating Association Rules from Frequent Itemsets -Example


● Once the frequent itemsets from transactions in a database D have
been found, it is straightforward to generate strong association rules
from them
● strong association rules satisfy both minimum support and minimum
confidence
(1)For each frequent itemset l, generate all nonempty subsets of l
Let the frequent itemset be l = {I1, I2, I5}
● The nonempty subsets of l are
{I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, and
{I5}

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmXII
Generating Association Rules from Frequent Itemsets -Example

(2)For every nonempty subset s of l, output the rule s =⇒ ( l − s) and


find the confidence of each rule.

support count(A ∪ B )
● Confidence ( A =⇒ B ) =
support count(A)

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmXIII

Generating Association Rules from Frequent Itemsets -Example


● Example:
support count({I1, I2} ∪
Confidence(I1 ∧ I2 =⇒ I5) = =
{I5})
support count({I1, I2})
support count({I1, I2, I5})
=2/4 = 50%
support count({I1, I2})
● If the minimum confidence threshold is, say, 70%, then only the
second, third, and last rules will be the strong association rules

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori AlgorithmXIV
How can we further improve the efficiency of Apriori-based mining?
1. Hash-based technique (hashing itemsets into corresponding
buckets)
● A hash-based technique can be used to reduce the size of the
kcandidate
− itemsets, Ck , for k > 1.
● Example: When we scan each transaction for creating frequent
1-itemsets, L 1 from C1 , we can generate all elements of C2 and L2, the
● frequent T 100 ={I1,
E x: Let 2-itemsets also } thea2hash
I2,byI5using functionare
− itemsets
{
● I1, I2}, {I2, I5}, {I1, I5}
For {I1, I5} we compute hash(H2) using function:
h(x, y) =((order o f (x) × 10 + order of (y)) mod 7
h(I1, I5) =((order of (I1) × 10 + order of (I5)) mod 7
h(I1, I5) = ((1 × 10 + 5) mod 7 = 1
//so {I1, I5 } is put into bucket #1

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmXV
● Similary other 2-itemsets in T 100 can be mapped to a bucket using the
hash function
mod 7 indicated the number of buckets(we have 7)
● The process will be repeated for each transaction in D while
determining L 1 from C1

● Clearly we can see the bucket count indicate the support count for 2-
itemesets, thus if min support =3, then the itemsets in buckets
0, 1, 3, and 4 cannot be frequent and so they should not be included
in
C2
CS402 Data Mining & Warehousing
Association Rules Mining Apriori Algorithm

Apriori
AlgorithmXVI
2. Transaction reduction (reducing the number of transactions
scanned in future iterations)
● A transaction that does not contain any frequent k-itemsets cannot
contain any frequent (k + 1)-itemsets
● Therefore, such a transaction can be marked or removed from further
consideration because subsequent scans of the database for j-
itemsets,where j >k, will not require it.
3. Partitioning (partitioning the data to find candidate itemsets):
● A partitioning technique uses just two database scans to mine the
frequent itemsets

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmXVII
● It consists of two phases. In Phase I, divide into Partitions and for each
partition, all frequent itemsets within the partition are found(local frequent
itemsets).
● In Phase II, a second scan of D is conducted in which the actual support
of each candidate is assessed in order to determine the global frequent
itemsets.
● Any itemset that is potentially frequent with respect to D must occur as
a frequent itemset in at least one of the partitions
4. Sampling (mining on a subset of the given data)
● Pick a random sample S of the given data D, and then search for
frequent itemsets inS instead of D
● Since searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets. To
● lessen this possibility, we use a lower support threshold than
minimum support to find the frequent itemsets local to S

CS402 Data Mining & Warehousing


Association Rules Mining Apriori Algorithm

Apriori
AlgorithmXVIII

● If L Sactually contains all of the frequent itemsets in D, then only one


scan of D is required. Otherwise, a second pass can be done in order to
find the frequent itemsets that were missed in the first pass.
5. Dynamic itemset counting (adding candidate itemsets at different
points during a scan)
The database is partitioned into blocks marked by start points
● So that new candidate itemsets can be added at any start point, while in
Apriori, new candidate itemsets are determined only immediately before
each complete database scan.
● The resulting algorithm requires fewer database scans than Apriori.

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.I

Drawbacks of Apriori
● It may need to generate a huge number of candidate sets. if there are 104
frequent 1-itemsets, the Apriori algorithm will need to generate more
than 107 candidate 2-itemsets.
● It may need to repeatedly scan the database and check a large set of
candidates by pattern matching
Can we design a method that mines the complete set of frequent itemsets
without candidate generation?

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.II

FP-Growth Algorithm (frequent-pattern growth)


● An algorithm for mining frequent itemsets without candidate
generation
● Adopts a divide-and-conquer strategy: decompose mining tasks into
smaller ones
● Compresses the database representing frequent items into a
frequent-pattern tree, or FP-tree
● Then divides the compressed database into a set of conditional
databases (a special kind of projected database), each associated with
one frequent item or “pattern fragment,” and mines each such database
separately

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.III

FP-growth Algorithm: Example

● Based on the AllElectronics


transaction database, D, of
Table, illustrate the Apriori
algorithm for finding frequent
itemsets in D.
● There are nine transactions in
this database, that is, |D| = 9.

Let the minimum support count
required is 2, that is,
mi n sup = 2

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.IV

FP-growth Algorithm: Example


(1)Scan the database(same as Apriori), and derives the set of frequent 1-
itemsets) and their support counts (frequencies)

(2)The set of frequent items is sorted in the order of descending support


count. This resulting set or list is denoted L
L = {{I2 ∶ 7}, {I1 ∶ 6}, {I3 ∶ 6}, {I4 ∶ 2}, {I5 ∶ 2}}

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.V

FP-growth Algorithm: Example


(3)Construct FP-Tree
Create the root of the
tree, labeled with “null
● For
”. each transaction a branch is created with items having their
Scan the count
support database D
separated by colon.
● Whenever
again and process
the sameeach
node is encountered in another transaction, we
just
transactionthe
increment in L
support
order count of the common node or Prefix
● Eg; for first transaction T 100 ∶ I1, I2, I5,
Contains three items are taken in L order as (I2, I1, I5)
● Construct the first branch of the tree(from root) with three nodes:
(null{}) → (I2 ∶ 1) → (I1 ∶ 1) → (I5 ∶ 1)
● Shown in Fig3.(a)

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.VI

Transactions taken in L order null{} null{}


T 100 ∶ I2, I1,
T 200 ∶ I2,
I5
I2 ∶ 1 I2 ∶ 2
T 300 ∶ I2,
I4
T 400 ∶ I2, I1,
I3
T 500 ∶ I1,
I4 I1 ∶ 1 I1 ∶ 1 I4 ∶ 1
T 600 ∶ I2,
I3
T 700 ∶ I1,
I3 I5 ∶ 1 I5 ∶ 1
T 800 ∶ I2, I1, I3,
I3
T 900 ∶ I2, I1,
I5 3.(a)After reading 3.(b)After reading
I3 T 100 T 200

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


VII
FP-growth Algorithm: Example
(3)Construct FP-Tree (contd...)
● for T 200 ∶ I2, I4
Two items are taken in
L orderthe
Results I4) (null{}) → ( I 2 ∶ 1) →I 4 ∶
(I2,branch:
(
But (null{}) → (I2 ∶ 1) path is already 1)
encountered by T 100
● So increment the count of the I 2 node by 1, and create a new path with
node, (I4 ∶ 1)
●The branching will be Fig3.(b)
This process is repeated for each transaction to construct FP-tree
● Also, an item header table is built so that each item points to its
occurrences in the tree via a chain of node-links as shown in figure
below

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


VIII

Figure:The FP-tree

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.IX

FP-growth Algorithm: Example


(4) Mine FP-tree to construct Conditional (sub) pattern bases and
Conditional FP-tree for each Item
● Construct Pattern Base for I5
Select last item in L, which is I 5
● Now look into FP-tree and find the no of branches I 5 occurs, which is 2
To count no of occurrences of I 5 just follow its chain of node-links
● Now find the paths formed by I5, which are: (I2, I1, I 5 ∶ 1) and
(
● Now take I 5 as suffix and extract the remaining prefix paths, which are:
I2, I 1 1 and I2, I1, I 3 1
● The (I2, I 1 ∶ 1) and (I2, I1, I 3 ∶ 1) are called Conditional Pattern
Bases
for I 5

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.X

Transactions for I5 null{} null{}


T 100 ∶ I2, I1,
I5
T 200 ∶ I2, I2 ∶ 2 I2 ∶ 2
T 300 ∶ I2,
I4
T 400 ∶ I2, I1,
I3
I1 ∶ 2 I1 ∶ 2
T 500 ∶ I1,
I4
I3
T 600 I2, I3
T 700 ∶ I1, I5 ∶ 1 I3 ∶ 1 I5 ∶ 1 X I3 ∶ 1
T 800 ∶ I2, I1, I3,
I3
T 900 ∶ I2, I1,
I5 I5 ∶ 1 I5 ∶ 1
C P B = {{I2, I1
I3
∶1 }, {I2, I1, I3 ∶ ● Paths containing ● Cond. FP-tree
1}} node I5 for I5

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.XI

Figure:The Conditional Pattern Base

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XII

FP-growth Algorithm: Example


(4)Mine FP-tree to construct Conditional (sub) pattern bases and
Conditional FP-tree for each Item (contd...)
● Construct Conditional FP-tree for I5 (fig. above)
Next calculate support counts of items present in conditional bases
●Support count(I1) in conditional pattern base is = 1 + 1 = 2
Support count(I2) in conditional pattern base is = 1 + 1 = 2
For I3, support count in conditional pattern base = 1
Take paths with support count ≥ 2, which is {I2, I1}
● Thus Conditional FP-tree for I5 will be the path (I2 ∶ 2, I 1 ∶ 2), the ∶
2
represents the support count through the prefix path

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XIII

Figure:The Conditional Pattern Base, Conditional FP-tree

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XIV
FP-growth Algorithm: Example
(5)From generate Frequent patterns
● Generate frequent pattern for suffix I5
I 5 take all possible combinations of I5, and Conditional FP-tree
● Thus the frequent patterns for I 5 will be:{I2, I5: 2}, {I1, I5: 2}, {I2,
I1, I5: 2}
● Similarly repeat steps (4) and (5) for each item in the decending
order of L
ie., The same procedure is applied to suffixes I4, I3 and I1.
● Note:I2 is not taken into consideration for suffix because it doesn’t
have any prefix at all.

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XV
FP-growth Algorithm: Example

● For I4,

● Exracting are : {{I2I1,
Two paths prefix paths gets ∶ 1}, {I2, I4Pattern
I4Conditional ∶ 1}} Base(CPB)
as:{{I2, I1 ∶ 1}, {I2 ∶
● Compute
1}} support counts in CPB:
support count(I1) =1, support count(I2) =1 + 1 =2
Conditional FP-tree will be: (I2 ∶ 2)
here ∶ 2 represent the support count for I2 in CPB
● Now take all possible combinations between I4 and Conditional
FP-tree and get the Frequent Patterns
● Frequent Patterns Generated will be: {I2, I4 ∶ 2}

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XVI
Transactions for I4
T 100 ∶ I2, I1, null{} null{}
I5
T 200 ∶ I2,
T 300 ∶ I2,
I4
I2 ∶ 2 I2 ∶ 2
I3
T 400 ∶ I2, I1,
T 500 ∶ I1,
I4
I3
T 600 ∶ I2, I4 ∶ 1 I1 ∶ 1 I4 ∶ 1 X I1 ∶ 1
T 700 ∶ I1,
I3
I3
T 800 ∶ I2, I1, I3, I4 ∶ 1 I4 ∶ 1
T 900 ∶ I2, I1,
I5
C P B ={{I2, I1 ∶ 1}, {I2
I3 ● Paths containing ● Cond. FP-tree
∶1}} node I4 for I4

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XVII
FP-growth Algorithm: Example
● For I3,
● The paths are : {{I2, I1, I3 ∶ 2}, {I2, I3 ∶ 2}, {I1, I3 ∶ 2}}
● Conditional Pattern Base: {{I2, I1 ∶ 2}, {I2 ∶ 2}, {I1 ∶ 2}}
Compute support counts in CPB:
support count(I1) = 2 + 2 = 4, support count(I2) =2 + 2 =4
Conditional FP-tree will be: (I2 ∶ 4, I1 ∶ 2), (I1 ∶ 2)
● (I2 ∶ 4, I1 ∶ 2) is obtained by combining paths (I2, I1 ∶ 2), (I2 ∶ 2 ) since
the prefix of first path also covers second path
● Now take all possible combinations between I3 and Conditional
FP-tree and get the Frequent Patterns
● Frequent Patterns Generated will be:{I2,
I3 ∶ 4}, {I1, I3 ∶ 4}, {I2, I1, I3 ∶ 2}

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XVIII
Transactions for I3
T 100 ∶ I2, I1, I5
T 200 ∶ I2, null{} null{}
T 300 ∶ I2,
I4
T 400 ∶ I2, I1,
I3 I2 ∶ 4 I1 ∶ 2 I2 ∶ 4 I1 ∶ 2
T 500 ∶ I1,
I4
T 600 ∶ I2,
I3 I1 ∶ 2 I3 ∶ 2 I3 ∶ 2 I1 ∶ 2 I3 ∶ 2 I3 ∶ 2
T 700 ∶ I1,
I3
I3
T 800 ∶ I2, I1, I3, I3 ∶ 2 I3 ∶ 2
T 900 ∶ I2, I1,
I5
I3
● Paths containing ● Cond. FP-tree
C P B ={{I2, I1 ∶ 2}, {I2
node I3 for I3
∶2 }, {I1 ∶ 2}}

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XIX

FP-growth Algorithm: Example


● For I1,
● Paths are : {{I2, I1 ∶ 4}, {I1 ∶ 2}}
● Conditional Pattern Base: {{I2 ∶ 4}}
Compute support counts in CPB: support count(I2) =4
Conditional FP-tree will be: (I2 ∶ 4)
● Now take all possible combinations between I1 and Conditional
FP-tree and get the Frequent Patterns
● Frequent Patterns Generated will be:{I2, I1 ∶ 4}

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XX
Transactions for I1
T 100 ∶ I2, I1, null{}
T 200 ∶ I2,
I5
I4
T 300 ∶ I2, I2 ∶ 4 I1 ∶ 2
T 400 ∶ I2, I1,
I3
I4
T 500 ∶ I1, I3 prefix I1 ∶ 4 I3 ∶ 2
null
T 600 ∶ I2,
T 700 ∶ I1, I3 prefix
I3 I5 ∶ 1 I3 ∶ 2 I4 ∶ 1
T 800 ∶ I2, I1, I3,
null
I5900 ∶ I2, I1,
T
I5 ∶ 1
I3
C P B ={{I2 ∶ 4}}
● Cond. FP-tree for I1

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XXI

Figure:The Frequent Patterns Generated

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XXII

The FP-growth algorithm for discovering frequent itemsets without


candidate generation.

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XXIII

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XXIV

CS402 Data Mining & Warehousing


Association Rules Mining FP-Growth Algorithm

FP-Growth : Frequent Itemsets without Candidate Gen.


XXV
Advantages Of FP Growth Algorithm
● This algorithm needs to scan the database only twice when compared to
Apriori which scans the transactions for each iteration.
No candidate generation, no candidate test
● The pairing of items is not done in this algorithm and this makes it
faster.
● The database is stored in a compact version in memory. It is efficient
and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
FP Tree is more cumbersome and difficult to build than Apriori.
● It may be expensive.

When the database is large, the algorithm may not fit in the shared
memory.
CS402 Data Mining & Warehousing
Cluster Analysis Concepts

Cluster Analysis-
ConceptsI

Figure:A sample of 3 Clusters in n-dimensional space

CS402 Data Mining & Warehousing


Cluster Analysis Concepts

Cluster Analysis-
ConceptsII
● Clustering: The process of grouping a set of physical or abstract
objects into classes of similar objects
● In other words: It is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in
some sense) to each other than to those in other groups (clusters).
● A cluster is a collection of data objects that are similar to one
another within the same cluster and are dissimilar to the objects in
other clusters.
● In clustering analysis, First partition the set of data into groups based
on data similarity (e.g., using clustering), and then assign labels to the
relatively small number of groups
● Unlike classification, clustering and unsupervised learning do not rely
on predefined classes and class-labeled training examples.
● Clustering is a form of learning by observation, rather than learning by
examples.
CS402 Data Mining & Warehousing
Cluster Analysis Concepts

Cluster Analysis-
ConceptsIII
● Applications: market research, pattern recognition, data analysis,
image processing, machine learning, information retrieval,
bioinformatics, data compression, and computer graphics
● In business:help marketers discover distinct groups in their customer
bases and characterize customer groups based on purchasing patterns.
● In biology: used to derive plant and animal taxonomies, categorize
genes with similar functionality, and gain insight into structures
inherent in populations
Helps to classify documents on the Web for information discovery.
● Help in the identification of areas of similar land use in an earth
observation database etc...
● Outlier detection: can detect values that are “far away” from any
cluster

CS402 Data Mining & Warehousing


Cluster Analysis Concepts

Cluster Analysis- ConceptsIV

Typical requirements of clustering algorithms in data mining:


● Scalability:
● Algorithms must be scalable over large database that may contain
millions of data objects.
● Ability to deal with different types of attributes:
● Should support not only the interval-based (numerical) data, but also
other types of data, such as binary, categorical (nominal), and ordinal
data, or mixtures of these data types.
● Discovery of clusters with arbitrary shape:
● Algorithms based on distance measures like Euclidean or Manhattan,
tend to find spherical clusters with similar size and density.
● It is important to develop algorithms that can detect clusters of
arbitrary shape.

CS402 Data Mining & Warehousing


Cluster Analysis Concepts

Cluster Analysis- ConceptsV


● Minimal requirements for domain knowledge to determine input
parameters
● Many clustering algorithms require users to input certain parameters in
cluster analysis (such as the number of desired clusters), which are often
difficult to determine.
● The algorithms must have minimal requirements to avoid burden to
users
● Ability to deal with noisy data:
● Some clustering algorithms are sensitive to outliers or missing, unknown,
or erroneous data. and may lead to clusters of poor quality.
● So the algorithms should have a mechanism to deal such data to get
clusters of better quality
Incremental clustering and insensitivity to the order of input
● records:
● Algorithms must incorporate newly inserted data (i.e., database
updates) into existing clustering structures, as well as they should be
insensitive to the order of input.
CS402 Data Mining & Warehousing
Cluster Analysis Concepts

Cluster Analysis-
ConceptsVI

● High dimensionality
● A database or a data warehouse can contain several dimensions or
attributes
● Algorithms should support data objects in high dimensional space
● Constraint-based clustering:
● Real-world applications may need to perform clustering under various
kinds of constraints.
● The algorithms must be capable of satisfying user specified constraints.
Interpretability and usability:

● The clustering results to be interpretable, comprehensible, and usable.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisI


There are mainly two data structures main memory-based clustering
algorithms typically operate on:Data matrix, Dissimilarity matrix
Data matrix (or object-by-variable structure)

● Represents n objects, such as per- sons,


with p variables (also called
measurements or attributes) such as
age, height, weight, gender, and so on
● The structure is in the form of a
relational table, or n − by − p matrix (n
objects ×p variables)
● The data set to be clustered contains n Where x i f represents the
objects, which may represent persons, value of attribute f of i t h
houses, documents, countries, and so on object

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisII


Dissimilarity matrix (or object-by-object structure)

● Stores a collection of proximities that


are available for all pairs of n objects.
Often represented by an n − by − n table
● where d(i, j ) is a nonnegative no.
represents the measured difference or
dissimilarity between objects i and j.
● If d(i, j ) close to 0, then objects i and j
are highly similar or “near” each other.
● If d(i, j ) becomes larger the more the
objects differ
● Also we have d(i, j ) = d(j, i), and
d(i, i) =0

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisIII

The various types of data in Cluster Analysis are:


Interval-Scaled Variables
● Binary Variables
Categorical Variables
● Ordinal Variables
Ratio-Scaled Variables
● Variables of Mixed Types
● Vector Objects
We also address how object dissimilarity can be computed for each type of
variable

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisIV

Interval-Scaled Variables

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisV


● Interval-scaled variables are continuous measurements of a roughly
linear scale
● Eg; weight and height, latitude and longitude coordinates (e.g., when
clustering houses), and weather temperature.
Here the measurement unit used can affect the clustering analysis
● Ex; changing measurement units from meter(m) to inches(in) for
height, or from kilograms(kg) to pounds(lbs) for weight, may lead
to a very different clustering structure.
● In general, expressing a variable in smaller units will lead to a larger
range for that variable, and thus a larger effect on the resulting
clustering structure.
● Standardization: help avoid dependence on the choice of
measurement units.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisVI


Interval-Scaled Variables(contd...)
Standardization: attempts to give all variables an equal weight
To standardize measurements, one choice is to convert the original
measurements to unitless variables
1.Calculate Mean absolute deviation, s f
● if x 1f, ..., x nf are n measurements of f , and m fis the mean value of
f , then

2.Calculate the standardized measurement, or z-score:

● The mean absolute deviation, s ,f is more robust to outliers than the


standard deviation σ f

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisVII

Interval-Scaled Variables(contd...)
Computing dissimilarity (or similarity) between the objects described by
interval-scaled variable:
● Dissimilarity computed based on the distance between each pair of
objects
● Can use various distance measures: Euclidean distance, Manhattan
(or city block) distance, Minkowski distance
● if i = (x i 1 , x i 2 , ..., x i n ) and j = (x j 1 , x j 2 , ..., x j n ) are two
n − dimensional data objects
● Euclidean Distance:

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisVIII


Interval-Scaled Variables(contd...)
● Manhattan (or city block) distance:

● Minkowski distance: a generalization of both Euclidean distance and


Manhattan distance,

● If p =1 then Manhattan distance(L norm),


1 if p =2 then Euclidean
distance (L 2 norm)
● Weighted Euclidean distance: Used when each variable is assigned a
weight according to its perceived importance

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisIX


Interval-Scaled Variables(contd...)
Both the Euclidean distance and Manhattan distance satisfy the following
mathematic requirements of a distance function:
d(i, j ) ≥0: Distance is a nonnegative number.
d(i, i) =0: The distance of an object to itself is 0.
● d(i, j ) = d(j, i): Distance is a symmetric function.
● d(i, j ) ≤ d(i, h) + d(h, j): Going directly from object i to object j in
space is no more than making a detour over any other object h
(triangular inequality).
Example:
Let x 1=(1, 2) , x 2 = (3,√5)
Euclidean Distance = 22 + 32 = 3.61
Manhattan Distance =2 + 3 = 5

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisX


Binary Variables
A binary variable has only two states: 0 or 1
● where 0 means that the variable is absent, and 1 means that it is
present
Eg: smoker variable- 1(patient smokes), 0(patient does not smoke)
● Dissimilarity between two objects i and j, using a 2 − by − 2
contingency table(assume all p variables are binary)

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXI


Binary Variables(contd...)
Out of p variables of both objects i and j
q Number of variables that equal 1 for both objects i and j
● r ∶ Number of variables that equal 1 for object i but that are 0 for
object j
● s ∶ Number of variables that equal 0 for object i but equal 1 for object
j
t ∶ Number of variables that equal 0 for both objects i and j.
The total number of variables is p, where p =q + r + s + t
● From contingency table we can compute dissimilarity.
● But dissimilarity of symmetric and asymmetric binary variables is
calculated in different ways.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXII

Binary Variables(contd...)
Symmetric binary dissimilarity
● A binary variable is symmetric if both of its states are equally valuable
and carry the same weight
ie., there is no preference on which outcome should be coded as 0 or 1
Eg; gender: male, female
The dissimilarity can be computed as:

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXIII


Binary Variables(contd...)
Asymmetric binary dissimilarity
● A binary variable is asymmetric if the outcomes of the states are not
equally important
Eg; the positive and negative outcomes of a disease test
● Given two asymmetric binary variables, the agreement of two 1s (a
positive match) is then considered more significant than that of two 0s
(a negative match). So no. of negative matches, t, is ignored
● Therefore, such binary variables are often considered “monary” (as if
having one state).
● The dissimilarity can be computed as:

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXIV

Binary Variables(contd...)
Asymmetric binary dissimilarity
We can also
computeAsymmetric
binary similarity
● Asymmetric binary
similarity between the
objects i and j, or
sim(i, j), can be
computed as:

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXV

Binary Variables - Problem


Suppose that a patient record table contains the attributes name, gender,
fever, cough, test-1, test-2, test-3, and test-4, where name is an object
identifier, gender is a symmetric attribute, and the remaining attributes are
asymmetric binary.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXVI


Binary Variables - Problem (contd...)
● For asymmetric attribute values, let the values Y (yes) and P (positive)
be set to 1, and the value N (no or negative) be set to 0
● Suppose that the distance between objects (patients) is computed
based only on the asymmetric variables
● Distance between each pair of the three patients, Jack, Mary, and Jim is:

Mary and Jim: unlikely to have a similar disease due to higher value
● Jack and Mary:most likely to have a similar disease

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXVII


Categorical Variables
● A categorical variable is a generalization of the binary variable in that it
can take on more than two states
Eg; map color with 5 states: red, yellow, green, pink, and blue
● If M is number of states of a categorical variable,the states can be
denoted by letters, symbols, or a set of integers, such as 1, 2, ..., M
● Computing dissimilarity between objects i and j described by
categorical variables

m ∶ number of matches (i.e., the number of variables for which i and


● j are in the same state)
● p ∶ the total number of variables.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXVIII
Categorical Variables- Example

Suppose that we have the sample data of


Table with only the object-identifier and the
variable (or attribute) test-1 are available,
where test-1 is categorical.
● here we have one categorical variable,
test-1, we set p = 1

d(i, j ) evaluates to 0 if objects i and j
match, and 1 if the objects differ.
● The dissimilarity matrix will be

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXIX


Ordinal Variables
There are two types of Ordinal variables : Discrete, Continuous
● Discrete ordinal variable: resembles a categorical variable, except that
the M states of the ordinal value are ordered in a meaningful sequence
Eg; professional ranks: assistant, associate, and full for professors
● Continuous ordinal variable: a set of continuous data of an unknown
scale; that is, the relative ordering of the values is essential but their
actual magnitude is not.
Eg;relative ranking in a particular sport: gold, silver, bronze
Ordinal variables may also be obtained from the discretization of
● interval-scaled quantities by splitting the value range into a finite
number of classes.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXX


Ordinal Variables(contd...)
Dissimilarity between ordinal variables
Let f is a variable from a set of ordinal variables describing n objects
which has M f states. These states define the ranking 1, ..., M f
1. Suppose the value of f for the i t h object is x i f , Replace each x i f by
its corresponding rank, r i f s1, ..., M f .
2. Map the range of each variable onto [0.0, 1.0] so that each variable
has equal weight.

3.Compute dissimilarity using distance measures using z i f to represent


the f value for the i t h object.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXXI


Ordinal Variables- Example
Suppose that we have the sample data of
Table, except that this time only the object-
identifier and the continuous ordinal
variable, test-2, are available.
● There are three states for test-2, namely
fair, good, and excellent, that is M =f 3
1. Replace each value for test-2 by its
rank, 3, 1, 2,and 3, respectively
2. Normalize the ranking by mapping
rank
1 to 0.0, 2 to 0.5, and 3 to 1.0
3. Use the Euclidean distance results in
the dissimilarity matrix given

CS402 Data Mining & Warehousing April 18, 202086 / 126


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXXII


Ratio-Scaled Variables
● A ratio-scaled variable makes a positive measurement on a nonlinear
scale, such as an exponential scale, such as

where A and B are positive constants, and t typically represents time


Eg; growth of a bacteria population, decay of a radioactive element.
● Dissimilarity between ratio-scaled variables: three methods
Treat ratio-scaled variables like interval-scaled variables
● Apply logarithmic transformation to a ratio-scaled variable f having
value x i f for object i by using the formula y i f =log(x i)f . The y i f
values can be treated as interval-valued
● Treat x i f as continuous ordinal data and treat their ranks as
interval-valued.

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXIII
Ratio-scaled Variables- Example

We have the sample data of Table, except


that only the object-identifier and the ratio-
scaled variable, test-3, are available
Try a logarithmic transformation.
● Taking the log of test-3 results in the
values 2.65, 1.34, 2.21, and 3.08 for the
objects 1 to 4, respectively
● Use the Euclidean distance on the
transformed values, we obtain the
dissimilarity matrix

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXIV
Variables of Mixed Types
● Finds dissimilaritybetween objects described by variables of themixed
type, where these types may be either interval-scaled, symmetric
binary, asymmetric binary, categorical, ordinal, or ratio-scaled.
● Approach 1: Performing a separate cluster analysis for each variable
type
● Approach 2 : Process all variable types together, performing a single
cluster analysis
● In second approach, combine the different variables into a single
dissimilarity matrix, brings all of the meaningful variables onto a
common scale of the interval [0.0, 1.0]

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster AnalysisXXV

Variables of Mixed Types(contd...)


● Suppose p is the number of mixed type variables, dissimilarity d( i, j )
between objects i and j is defined as:

δif=
j 0 if either
● ● x i f or x j f is missing(there is no measurement of variable f for object i
or object j)
x i f = x j f =0 and variable f is asymmetric binary;
● fi j =1 otherwise

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXVI
Variables of Mixed Types(contd...)
(f )
The d i j is computed dependent on its type:

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXVII
Mixed Type variables- Example
Let’s compute a dissimilarity matrix for the objects of Table. Here all of
the variables, which are of different type

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXVIII
Mixed Type variables- Example
● For test-1(categorical), test-2(ordinal), we worked out the dissimilarity
matrices for each of the individual variables
● For test-3 log results in the values 2.65, 1.34, 2.21, and 3.08 for the
objects 1 to 4, respectively to be normalized
● We can now use the dissimilarity matrices for the three variables in
our computation
1(1) + 1(1) + 1(0.75)
● d(2, 1) = 0.92
= 3
● The dissimilarity matrices for test-1, test-2, test-3 and the resulting
dissimilarity matrix are given below
The matrix indicates that objects 4 and 2 are the least similar.
● Objects pair 4 and 1 are more similar

CS402 Data Mining & Warehousing


Cluster Analysis Types of Data in Cluster Analysis

Types of Data in Cluster


AnalysisXXIX
Mixed Type variables- Example

Figure:Dissimilarity matrix for test-3 Figure:Dissimilarity matrix for test-1

Figure:Final Dissimilarity matrix for all Figure:Dissimilarity matrix for test-2

CS402 Data Mining & Warehousing


Cluster Analysis Categorization of Clustering Methods

Categorization of Clustering MethodsI


● Partitioning methods:
● A partitioning method constructs k partitions from the a given
database of n objects or data tuples, where each partition represents a
cluster and k ≤n.
● ie; it classifies the data into k groups which together satisfy the
following requirements:
(1)each group must contain at least one object
(2)each object must belong to exactly one group
● Given k, the number of partitions to construct, a partitioning method
creates an initial partitioning
● Then uses an iterative relocation technique that attempts to improve the
partitioning by moving objects from one group to another.
● General criterion of a good partitioning: objects in the same cluster are
“close” or related to each other, whereas objects of different clusters are
“far apart” or very different
● There are various kinds of other criteria for judging the quality of
partitions.

CS402 Data Mining & Warehousing


Cluster Analysis Categorization of Clustering Methods

Categorization of Clustering MethodsII


● There are two types of clustering methods based on partioning
1. k-means algorithm: where each cluster is represented by the mean
value of the objects in the cluster
2. k-medoids algorithm:where each cluster is represented by one of the
objects located near the center of the cluster.
● Hierarchical methods:
● A hierarchical method creates a hierarchical decomposition of the given
set of data objects
● A hierarchical method can be classified as being either agglomerative
or divisive, based on how the hierarchical decomposition is formed.
● agglomerative Methods:
● also called the bottom-up approach, , starts with each object forming a
separate group
● It successively merges the objects or groups that are close to one another,
until all of the groups are merged into one (the topmost level
of the hierarchy), or until a termination condition holds.
● divisive Methods:

CS402 Data Mining & Warehousing


Cluster Analysis Categorization of Clustering Methods

Categorization of Clustering
MethodsIII
● also called the top-down approach, starts with all of the objects in the same
cluster.
● In each successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition
holds.
● Hierarchical methods suffer from the fact that once a step (merge or
split) is done, it can never be undone.
● Heirachical Clustering Methods: BIRCH
● Density-based methods:
● Clustering methods have been developed based on the notion of
density(number of objects or data points)
● The general idea is to continue growing the given cluster as long as the
density (number of objects or data points) in the “neighborhood” exceeds
some threshold
● ie; for each data point within a given cluster, the neighborhood of a
given radius has to contain at least a minimum number of points

CS402 Data Mining & Warehousing April 18, 202097 / 126


Cluster Analysis Categorization of Clustering Methods

Categorization of Clustering MethodsIV


● Such a method can be used to filter out noise (outliers) and discover
clusters of arbitrary shape.
● Density based clustering methods: DBSCAN and its extension,
OPTICS
● Grid-based methods
● Quantize the object space into a finite number of cells that form a grid
structure.
● All of the clustering operations are performed on the grid structure
(i.e., on the quantized space).
● The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only
on the number of cells in each dimension in the quantized space.
● Example: STING
● Model-based methods
● Hypothesize a model for each of the clusters and find the best fit of
the data to the given model

CS402 Data Mining & Warehousing


Cluster Analysis Categorization of Clustering Methods

Categorization of Clustering MethodsV

● A model-based algorithm may locate clusters by constructing a density


function that reflects the spatial distribution of the data points
● It also leads to a way of automatically determining the number of
clusters based on standard statistics
● Takes “noise” or outliers into account and thus yielding robust
clustering methods.

CS402 Data Mining & Warehousing


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringI

Steps:
1 Choose k, the number of clusters
2 Select at random k points, the centroids(not necessarily from your
dataset)
3 Assign each data point to the closest centroid , which forms k
clusters
4 Compute and centroids
● The new place thewill
newbecetroid of value
the mean each of
cluster
the objects for each cluster
5 Reassign each data point to then new closest centroid(mean value).
If any reassignment took place repeat step 4
● Otherwise STOP

The process of iteratively reassigning objects to clusters to improve the
partitioning is referred to as iterative relocation

CS402 Data Mining & Warehousing April 18, 2020100 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringII

CS402 Data Mining & Warehousing April 18, 2020101 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringIII

CS402 Data Mining & Warehousing April 18, 2020102 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringIV

CS402 Data Mining & Warehousing April 18, 2020103 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringV

CS402 Data Mining & Warehousing April 18, 2020104 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringVI

CS402 Data Mining & Warehousing April 18, 2020105 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringVII

CS402 Data Mining & Warehousing April 18, 2020106 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringVIII

CS402 Data Mining & Warehousing April 18, 2020107 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringIX

CS402 Data Mining & Warehousing April 18, 2020108 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringX

CS402 Data Mining & Warehousing April 18, 2020109 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXI

CS402 Data Mining & Warehousing April 18, 2020110 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXII

CS402 Data Mining & Warehousing April 18, 2020111 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXIII

CS402 Data Mining & Warehousing April 18, 2020112 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXIV

k means - Problem
Use k-means clustering algorithm to divide the following data into two
clusters and also compute the the representative data points for the
clusters.

CS402 Data Mining & Warehousing April 18, 2020113 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXV

Solution
(1) No. of clusters, we have k =2
(2) Choose two points arbitrarily as the initial cluster centroids.

(3) To assign each data point to the closest centroid compute distances

CS402 Data Mining & Warehousing April 18, 2020114 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXVI

CS402 Data Mining & Warehousing April 18, 2020115 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means


ClusteringXVII
(4)Recalculate new centroids

CS402 Data Mining & Warehousing April 18, 2020116 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means


ClusteringXVIII
(5)Compute distances and Reassign data points to new centroids

CS402 Data Mining & Warehousing April 18, 2020117 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXIX

(6)Recalculate new centroids

CS402 Data Mining & Warehousing April 18, 2020118 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXX


(7)Compute distances and Reassign data points to new centroids

CS402 Data Mining & Warehousing April 18, 2020119 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Means ClusteringXXI


(8)Recalculate new centroids
● We note that these are identical to the cluster centres calculated in
Step 6.
● So there will be no reassignment of data points to different clusters and
hence the computations are stopped here.

(9)Conclusion
● The k means clustering algorithm with k =2 applied to the dataset in
Table yields the following clusters and the associated cluster centres

CS402 Data Mining & Warehousing April 18, 2020120 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Medoid ClusteringI


● In k-medoid clustering, each cluster is represented by one of the
objects located near the center of the cluster.
Steps:
1 Choose k, the number of clusters
2 Select at random k objects in D as the initial nearest representatives or
seeds
3 Assign each data point(object) to the closest representative object,
which forms k clusters
4 randomly select a nonrepresentative object, O′ ;
5 compute the cost change, S, of swapping representative object, O j ,
with O′ ;

6 if S <0 then swap O , with
j O to form the new set of k
representative objects;
7 Repeat steps 3 to 6 until no change; otherwise STOP
CS402 Data Mining & Warehousing April 18, 2020121 / 126
Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Medoid ClusteringII

K-medoid Algorithm- Example

Use k-medoid clustering algorithm to


divide the given data into two clusters
and also compute the the
representative data points for the
clusters.

CS402 Data Mining & Warehousing April 18, 2020122 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Medoid


ClusteringIII
Solution:
(1)We have k =2, the number of clusters
(2)Initialize random medoids c =1(3, 4) and c =( 2 7, 4)
(3)Calculating distance(cost) so as to associate each data object to its
nearest medoid

CS402 Data Mining & Warehousing April 18, 2020123 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Medoid ClusteringIV

● The distance(cost) between c i =(a, b) and X i =(c, d) is calculated as


cost =| a − c | + | b − c | which is the Manhattan Distance
● Calculate the total cost(S1) as summation of the cost of data object
from its medoid in its clusters, which is S =1 20
● This divides the data into two clusters: Cluster 1: { X 1 , X 2 , X 3 , X 4 }
and Cluster 2: { X 5 , X 6 , X 7 , X 8 , X 9 , X 1 0 }
(4)Select another random medoid O′ =(7, 3). Now the medoids are
c1=(3, 4)′ , O =(7, 3), and calculate the new total cost(S ) ′

CS402 Data Mining & Warehousing April 18, 2020124 / 126


Cluster Analysis Categorization of Clustering Methods

Partitioning methods: K-Medoid ClusteringV

● We have S ′=22, We get the cost of swapping the medoid from



old(c2) to new(O ) as:
S =S ′− S =1 22 − 20 =2 >0
● The positive value indicates a higher cost if we swap to new medoid.

So moving to O would be bad idea. hence the previous choice was
good and algorithm terminates here.

CS402 Data Mining & Warehousing April 18, 2020125 / 126


Cluster Analysis Categorization of Clustering Methods

CS402 Data Mining & Warehousing April 18, 2020126 / 126

You might also like