Professional Documents
Culture Documents
CZ4032 Data Analytics & Mining Notes
CZ4032 Data Analytics & Mining Notes
Underfitting – when model is too simple, both training and test errors are large
Insufficient Examples:
o Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
o Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant
to the classification task
Overfitting – decision trees that are more complex than necessary
The learning algorithm has access only to the training set during model building. It
has no knowledge of the test set
Solution 1: Pre-Pruning
o Stop the algorithm before it becomes a fully-grown tree
o Typical stop conditions:
All instances belong to the same class
Attribute values are the same
o Restrictive conditions:
Number of instances is less than some user-specified threshold
Class distribution of instances are independent of the available
features
Expanding the current node does not improve impurity measures
Solution 2: Post-pruning
o Grow decision tree to its entirety
o Trim the nodes of decision tree in a bottom-up fashion
o If generalization error improves after trimming, replace sub-tree by a leaf
node
o Class label of a leaf node is determined from majority class of instances in the
sub-tree
o Can use Minimum Description Length (MDL) for post pruning
Estimating Generalization Errors
Occam’s Razor: Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model
For complex models, there is a greater chance that it was fitted accidentally by errors
in data
Minimum Description Length:
o Cost (Model, Data) = Cost (Data|Model) + Cost (Model)
Cost: number of bits needed for encoding
Search for the least costly model
o Cost (Data|Model): misclassification errors
o Cost (Model): node encoding (number of children) + splitting condition
encoding
Instance Based Classifiers:
Things Required:
o Set of stored training records
o Distance metric to compute distance between records
o Value of k, the number of nearest neighbours to retrieve
If k is too small, sensitive to noise points
If k is too large, neighbourhood may include points from other classes
To classify an unknown record:
o Compute distance to other training records
o Identify k nearest neighbours
o The class labels of the nearest neighbours to determine the class label of the
unknown vote
Computation:
o Euclidean distance:
If one of the conditional probabilities is zero, then the entire expression becomes
zero
To know which line is better, we want to find a hyperplane that maximizes the
margin:
Types of Clustering:
o Partitional Clustering: A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
o Hierarchical Clustering: A set of nested clusters organized as a hierarchical
tree
o Contiguous Cluster (Nearest neighbour or Transitive): A cluster is a set of
points such that a point in a cluster is closer to one or more other points in
the cluster than to any point not in cluster
o Density-based: A cluster is a dense region or points, which is separated by
low-density regions, from other regions of high density
Used when clusters are irregular or intertwined, and when noise and
outliers are present
o Conceptual Cluster: Finds clusters that share some common property or
represent a particular concept
Other Distinctions:
o Exclusive versus non-exclusive
o Fuzzy vs non-fuzzy
o Partial versus complete
o Heterogeneous versus homogeneous
K-means Clustering:
o Partitional clustering approach
o Each cluster is associated with a centroid
o Each point is assigned to the cluster with the closest centroid (Euclidean
distance, cosine similarity, etc.)
o Number of clusters, K, must be specified
o Complexity: O (n*k | *d) where n = number of points, k = number of clusters,
I = number of iterations, d = number of attributes
o Choice of initial centroids are extremely important
o Sum of Squared Error (SSE): For every point, the error is the distance to the
nearest cluster
o Solution:
Multiple runs
Sample and use hierarchical clustering to determine initial centroids
Select more than k initial centroids and then select among these initial
centroids: Select more widely separated
Postprocessing
Bisecting K-means: Not as susceptible to initialization issues
o Handling Empty Clusters:
Choose the point that contributes most to SSE
Choose a point from the cluster with the highest SSE
If there are several empty clusters, the above can be repeated several
times
o Updating Centres Incrementally after each assignment:
Each assignment updates zero/2 centroids
More expensive
Introduces an order dependency
Never get an empty cluster
Can use “weights” to change the impact
o Pre-Processing:
Normalize the data
Eliminate outliers
o Post-processing:
Eliminate small clusters that may represent outliers
Split ‘loose’ clusters, i.e. clusters with relatively high SSE
Merge clusters that are ‘close’ and that have relatively low SSE
Can use these steps during the clustering process: ISODATA (Iterative
Self-Organizing Data Analysis)
Bisecting K-means algorithm: Variant of K-means that can produce a partitional or a
hierarchical clustering
Limitations:
o Problems when clusters are of differing sizes, densities, non-globular shapes
o Data contains outliers
Hierarchical Clustering
Strengths:
o Do not have to assume any particular number of clusters: Any desired
number of clusters can be obtained by ‘cutting’ the dendrogram at the
proper level
o Correspond to meaningful taxonomies
Types of hierarchical clustering:
o Agglomerative:
Start with points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
o Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point
Inter-Cluster Similarity
MIN MAX
Strength: Strength:
- Can handle non-elliptical shapes - Less susceptible to noise and outliers
Limitation: Limitations:
- Sensitive to noise and outliers - Tends to break large clusters
- Biased towards globular clusters
Group Average Distance between Centroids
Strength:
- Less susceptible to noise and outliers
Limitations:
- Biased towards globular clusters
Cluster Similarity: Ward’s Method
o Based on the increase in squared error when two clusters are merged:
o Less susceptible to noise and outliers
o Biased towards globular clusters
o Hierarchical analogue of K-means
Build MST
o Start with a tree that consists of any point
o In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
o Add q to the tree and put an edge between p and q
DBSCAN Algorithm
o Eliminate noise points
o Perform clustering on the remaining noise
Two core points within a specific radius are put into the same cluster
Border points are added
To determine EPS and Min Pts:
o For points in a cluster, their kth nearest neighbours are at roughly the same
distance
o Noise points have the kth nearest neighbour at farther distance
o So, plot sorted distance of every point to its kth nearest neighbour
Graph-Based Clustering
Graph-Based clustering uses the proximity graph
o Start with the proximity matrix
o Considering each point as a node in a graph
o Each edge between two nodes has a weight which is the proximity between
the two points
A graph, G = (V, E) representing a set of objects (V) and their relations
(E)
Initially the proximity graph is fully connected
MIN (single-link) and MAX (complete-link) can be viewed as starting
with this graph
o Clusters are connected components in the graph
Clustering may work better
o Sparsification techniques keep the connections to the most similar (nearest)
neighbour of a point while breaking the connections to less similar points
o The nearest neighbours of a point tend to belong to the same class as the
point itself
o This reduces the impact of noise and outliers and sharpens the distinction
between clusters
Sparsification facilitates the use of graph partitioning algorithms (Chameleon &
Hypergraph-based Clustering)
Shared Nearest Neighbour (SNN) graph: the weight of an edge is the number off shared
neighbours between the vertices given that the vertices are connected
Cluster Validation:
Determining the clustering tendency of a set of data, i.e. distinguishing whether non-
random structure actually exists in the data
Compare the results of a cluster analysis to externally known results
Evaluate how well the results of a cluster analysis fit the data without referencing to
external information
Comparing the results of two different sets of cluster analyses to determine which is
better
Determining the ‘correct’ number of clusters
Measures of Cluster Validity:
o External Index: extent that cluster labels match externally supplied class
labels (Entropy)
o Internal Index: goodness of a clustering structure without respect to external
information (Sum of Squared Error)
Cluster Cohesion: Measures how closely related are objects in a
cluster (e.g. SSE) – sum of weight of all links within a cluster
Cluster Separation: Measures how distinct or well-separated a cluster
is from other clusters (e.g. Squared Error) – sum of weights between
nodes in the cluster and nodes outside the cluster