Data Mining

**Sampling:**
- **Objective:** Sampling involves mining frequent itemsets on a subset (sample) of the given data
rather than the entire dataset. This trade-off between accuracy and efficiency allows for faster
processing, particularly when dealing with large datasets.
- **Approach:**
- **Random Sample:** A random sample \( S \) of the original dataset \( D \) is selected.
- **Frequent Itemset Search:** Frequent itemsets are then mined within the sample \( S \) instead
of the entire dataset \( D \).
- **Sampling Size:** The sample size \( S \) is chosen such that the search for frequent itemsets can
be performed efficiently in main memory, typically requiring only one scan of the transactions in \
( S \).
- **Local Frequent Itemsets:** To account for the possibility of missing some global frequent
itemsets due to sampling, a lower support threshold than the minimum support is used to identify
frequent itemsets local to \( S \) (denoted as \( L_S \)).
- **Actual Frequency Computation:** The remaining portion of the dataset \( D \) is then used to
compute the actual frequencies of each itemset in \( L_S \).
- **Verification Mechanism:** A mechanism is employed to determine whether all global frequent

itemsets are included in \( L_S \). If so, only one scan of \( D \) is required; otherwise, a second pass
may be conducted to find the missed frequent itemsets.
- **Benefits:** Sampling is especially advantageous for computationally intensive applications where

efficiency is crucial, as it reduces the processing time and memory requirements while still providing
meaningful insights from the data.
**Dynamic Itemset Counting:**
- **Objective:** Dynamic itemset counting aims to optimize the process of identifying frequent
itemsets by adding candidate itemsets dynamically at different points during a database scan.
- **Approach:**
- **Database Partitioning:** The database is partitioned into blocks marked by start points.
- **Candidate Itemset Addition:** Unlike the Apriori algorithm, which determines new candidate
itemsets only before each complete database scan, dynamic itemset counting allows for the addition
of new candidate itemsets at any start point during the scan.
- **Count-so-far:** The count-so-far (the count accumulated up to the current point in the scan) is
used as the lower bound of the actual count of an itemset.
- **Minimum Support Check:** If the count-so-far exceeds the minimum support threshold, the
itemset is considered frequent and added to the collection of frequent itemsets. It can then be used
to generate longer candidate itemsets.
- **Advantages:** This technique reduces the number of database scans compared to the Apriori
algorithm, resulting in improved efficiency in finding all the frequent itemsets within the dataset.
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is a clustering algorithm
designed for efficiently clustering large volumes of numeric data. It integrates hierarchical clustering
at the microclustering stage and iterative partitioning at the macroclustering stage to overcome
scalability issues and the inability to undo previous clustering steps commonly encountered in
agglomerative clustering methods.
#### Clustering Feature Trees (CF-trees):
- BIRCH uses the concept of clustering features and CF-trees to represent cluster hierarchies
efficiently.
- A clustering feature summarizes information about clusters of objects, typically represented as a 3-

D vector containing the linear sum (LS) and square sum (SS) of the data points.
- Key statistics of a cluster, such as centroid, radius, and diameter, can be derived from its clustering
feature.
- Clustering features enable the summarization of clusters without storing detailed information about
individual objects, resulting in significant space savings.
- Clustering features are additive, allowing the easy computation of the feature for merged clusters.
#### CF-tree Structure:
- A CF-tree is a height-balanced tree used to store clustering features for hierarchical clustering.
- Non-leaf nodes in the tree summarize clustering information about their children by storing sums of
the clustering features.
- CF-trees have two parameters: branching factor (B) and threshold (T), which control the tree's size
and structure.
- The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes,
affecting the granularity of the clustering.
#### Multiphase Clustering Technique:
- BIRCH employs a multiphase clustering technique to efficiently cluster large datasets while
minimizing input/output (I/O) operations.
- Phase 1 involves building an initial in-memory CF-tree by scanning the database, dynamically
updating the tree as objects are inserted.
- Phase 2 applies a selected clustering algorithm to cluster the leaf nodes of the CF-tree, removing
sparse clusters as outliers and grouping dense clusters into larger ones.
- The method is incremental, allowing for the insertion of new objects and dynamic adjustment of
the CF-tree's size.
#### Effectiveness and Limitations:
- BIRCH exhibits linear scalability with respect to the number of objects to be clustered.
- While experiments have shown good quality clustering results, BIRCH may not perform well for
non-spherical clusters due to its reliance on radius or diameter-based clustering boundaries.
- Each node in a CF-tree can only hold a limited number of entries, which may not always correspond
to natural clusters perceived by users.
- The ideas of clustering features and CF-trees have been widely adopted beyond BIRCH for clustering
streaming and dynamic data.
In summary, BIRCH efficiently addresses the challenges of clustering large numeric datasets by
integrating hierarchical and iterative clustering techniques, utilizing clustering features and CF-trees
to represent cluster hierarchies, and applying a multiphase clustering approach for scalability and
quality clustering results.
Chameleon is a hierarchical clustering algorithm that utilizes dynamic modeling to determine the
similarity between pairs of clusters. Unlike traditional clustering methods, Chameleon assesses
cluster similarity based on both the internal connectivity of objects within a cluster and the proximity
of clusters to each other. This dynamic approach allows Chameleon to automatically adapt to the
characteristics of the data being clustered without relying on a static user-supplied model.
### Algorithm Overview:
1. **Cluster Similarity Assessment**: Chameleon determines the similarity between clusters based
on their internal connectivity and proximity.
2. **Graph Construction**: It constructs a k-nearest-neighbor graph, where each vertex represents a

data object, and edges between vertices are weighted based on object similarity.
3. **Graph Partitioning**: A graph partitioning algorithm is applied to partition the k-nearest-

neighbor graph into relatively small subclusters, minimizing edge cuts.
4. **Hierarchical Merging**: Chameleon uses an agglomerative hierarchical clustering algorithm to

iteratively merge subclusters based on their similarity.
### Cluster Similarity Metrics:
- **Relative Interconnectivity (RI)**: Measures the absolute interconnectivity between two clusters,
normalized by their internal interconnectivity.
- **Relative Closeness (RC)**: Measures the absolute closeness between two clusters, normalized by
their internal closeness.
### Relative Interconnectivity (RI):

\[ RI(C_i, C_j) = \frac{\sum_{e \in EC_{(C_i, C_j)}} w(e)}{\frac{1}{2} \left(|ECC_i| + |ECC_j|\right) \
cdot |C_i| \cdot |C_j|} \]
- \( EC_{(C_i, C_j)} \): Edge cut between clusters \( C_i \) and \( C_j \).
- \( ECC_i \) (or \( ECC_j \)): Minimum sum of cut edges partitioning \( C_i \) (or \( C_j \)) into two
roughly equal parts.
- \( w(e) \): Weight of edge \( e \).
### Relative Closeness (RC):
\[ RC(C_i, C_j) = \frac{\sum_{e \in EC_{(C_i, C_j)}} w(e)}{\frac{1}{|C_i|} \cdot \sum_{e \in SECC_i}
w(e) + \frac{1}{|C_j|} \cdot \sum_{e \in SECC_j} w(e)} \]
- \( SEC_{(C_i, C_j)} \): Average weight of edges connecting vertices in \( C_i \) to vertices in \( C_j \).
- \( SECC_i \) (or \( SECC_j \)): Average weight of edges belonging to the mincut bisector of \( C_i \)
(or \( C_j \)).
### Advantages and Limitations:
- **Advantages**:
- Capable of discovering arbitrarily shaped clusters of high quality.
- Outperforms algorithms like BIRCH and density-based DBSCAN in certain scenarios.
- **Limitations**:
- High processing cost for high-dimensional data, potentially requiring \( O(n^2) \) time for \( n \)
objects in the worst case.
In summary, Chameleon offers a dynamic approach to hierarchical clustering, leveraging internal

connectivity and proximity metrics to determine cluster similarity. While it excels at discovering
diverse clusters, it may suffer from high computational costs for high-dimensional data.

Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

Sampling:

- Random Sample: A random sample \( S \) of the original dataset \( D \) is selected.

- Verification Mechanism: A mechanism is employed to determine whether all global frequent

- Benefits: Sampling is especially advantageous for computationally intensive applications where

Dynamic Itemset Counting:

#### Clustering Feature Trees (CF-trees):

- A clustering feature summarizes information about clusters of objects, typically represented as a 3-

#### CF-tree Structure:

#### Multiphase Clustering Technique:

### Algorithm Overview:

2. Graph Construction: It constructs a k-nearest-neighbor graph, where each vertex represents a

3. Graph Partitioning: A graph partitioning algorithm is applied to partition the k-nearest-

4. Hierarchical Merging: Chameleon uses an agglomerative hierarchical clustering algorithm to

### Cluster Similarity Metrics:

### Relative Interconnectivity (RI):

- \( w(e) \): Weight of edge \( e \).

### Relative Closeness (RC):

### Advantages and Limitations:

- Capable of discovering arbitrarily shaped clusters of high quality.

- Outperforms algorithms like BIRCH and density-based DBSCAN in certain scenarios.

In summary, Chameleon offers a dynamic approach to hierarchical clustering, leveraging internal

You might also like