Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

Lecture 4 – Classification: Alternative Techniques

Underfitting – when model is too simple, both training and test errors are large

 Insufficient Examples:
o Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
o Insufficient number of training records in the region causes the decision tree
to predict the test examples using other training records that are irrelevant
to the classification task

Overfitting – decision trees that are more complex than necessary

 The learning algorithm has access only to the training set during model building. It
has no knowledge of the test set
 Solution 1: Pre-Pruning
o Stop the algorithm before it becomes a fully-grown tree
o Typical stop conditions:
 All instances belong to the same class
 Attribute values are the same
o Restrictive conditions:
 Number of instances is less than some user-specified threshold
 Class distribution of instances are independent of the available
features
 Expanding the current node does not improve impurity measures
 Solution 2: Post-pruning
o Grow decision tree to its entirety
o Trim the nodes of decision tree in a bottom-up fashion
o If generalization error improves after trimming, replace sub-tree by a leaf
node
o Class label of a leaf node is determined from majority class of instances in the
sub-tree
o Can use Minimum Description Length (MDL) for post pruning
Estimating Generalization Errors

Occam’s Razor: Given two models of similar generalization errors, one should prefer the
simpler model over the more complex model

 For complex models, there is a greater chance that it was fitted accidentally by errors
in data
 Minimum Description Length:
o Cost (Model, Data) = Cost (Data|Model) + Cost (Model)
 Cost: number of bits needed for encoding
 Search for the least costly model
o Cost (Data|Model): misclassification errors
o Cost (Model): node encoding (number of children) + splitting condition
encoding
Instance Based Classifiers:

 Store the training records (without training explicit models)


 Use training records directly to predict the class label of unseen cases
 Rote Learner: memorizes entire training data and performs classification only if
attributes of record match one of the training examples exactly
 K-Nearest Neighbours (k-NN): Uses k “closest” points (nearest neighbours) for
performing classification
Nearest-Neighbour Classifier:

 Things Required:
o Set of stored training records
o Distance metric to compute distance between records
o Value of k, the number of nearest neighbours to retrieve
 If k is too small, sensitive to noise points
 If k is too large, neighbourhood may include points from other classes
 To classify an unknown record:
o Compute distance to other training records
o Identify k nearest neighbours
o The class labels of the nearest neighbours to determine the class label of the
unknown vote
 Computation:
o Euclidean distance:

o Determine the class from nearest neighbour list:


 Take majority vote of class labels among the k-nearest neighbours
 Weigh the vote according to distance (Weight factor, w = 1/d2)
 1 nearest-neighbour:
o Voronoi Diagram: boundaries denote 1-neighborhoods

Bayesian Classifier: A probabilistic framework for classification problems


Naïve Bayes Classifier

 If one of the conditional probabilities is zero, then the entire expression becomes
zero

Estimate Probabilities from Data:

 For continuous attributes:


o Discretize the range into bins
 One ordinal attribute per bin
 Violates independence assumption
o Two-way split: (A<v) or (A>v)
 Choose only one of the two splits as new attribute
o Probability density estimation:
 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution (e.g. mean and
standard deviation)
 Once probability distribution is known, can use it to estimate the
conditional probability P(Ai|c)
 Normal Distribution:
Support Vector Machines

 To know which line is better, we want to find a hyperplane that maximizes the
margin:

 If the problem is not linearly separable:

Nonlinear Support Vector Machines

 Transform data into higher dimensional space


 Using “Kernel Trick”, actual transformation need not be known
 Just compute similarity between two vectors in original space
Artificial Neural Networks (ANN)

 Perceptron Model is an assembly of inter-connected nodes and weighted links


 Output node sums up each of its input value according to the weights of its links
 Compare output node against some threshold t

 Algorithm for learning ANN:


o Initialize the weights (w0, w1, w2, …, wk)
o Adjust the weights in such a way that the output of ANN is consistent with
class labels of training examples
 Objective function:

 Find the weights wi that minimize the above objective function


e.g. backpropagation algorithm
Ensemble Methods:

 Construct a set of classifiers from the training data


 Predict class label previously unseen records by aggregating predictions made by
multiple classifer
 Assumption: Individual classifiers (voters) could be lousy (stupid), but the aggregate
(electorate) can usually classify (decide) correctly
 Bagging:
o Sampling with replacement
o Build classifier on each bootstrap sample
o Each sample has probability of (1 – (1-1/N) N of being selected
 Tends to be 0.63 for large N
 Boosting: An iterative procedure to adaptively change distribution of training data by
focusing more on previously misclassified records
o Initially, all N are assigned equal weights
o Unlike bagging, sampling weights may change at the end of boosting round
o Records that are wrongly classified will have their weights increased
o Records that are correctly classified will have their weights decreased
Lecture 6 – Cluster Analysis
Cluster Analysis: Finding groups of objects such that the objects in a group will be similar to
one another and different from to the objects in other groups.

 Types of Clustering:
o Partitional Clustering: A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
o Hierarchical Clustering: A set of nested clusters organized as a hierarchical
tree
o Contiguous Cluster (Nearest neighbour or Transitive): A cluster is a set of
points such that a point in a cluster is closer to one or more other points in
the cluster than to any point not in cluster
o Density-based: A cluster is a dense region or points, which is separated by
low-density regions, from other regions of high density
 Used when clusters are irregular or intertwined, and when noise and
outliers are present
o Conceptual Cluster: Finds clusters that share some common property or
represent a particular concept
 Other Distinctions:
o Exclusive versus non-exclusive
o Fuzzy vs non-fuzzy
o Partial versus complete
o Heterogeneous versus homogeneous

Partial Clustering: Hierarchical Clustering:

 K-means Clustering:
o Partitional clustering approach
o Each cluster is associated with a centroid
o Each point is assigned to the cluster with the closest centroid (Euclidean
distance, cosine similarity, etc.)
o Number of clusters, K, must be specified
o Complexity: O (n*k | *d) where n = number of points, k = number of clusters,
I = number of iterations, d = number of attributes
o Choice of initial centroids are extremely important
o Sum of Squared Error (SSE): For every point, the error is the distance to the
nearest cluster

o To reduce SSE, it is important to increase K, the number of clusters


 A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
o Problem with Selecting Initial Points:
 If there are K ‘real’ clusters then the chance of selecting one centroid
from each cluster is small
 Chance is relatively small when K is large
 If clusters are the small size, n, then

o Solution:
 Multiple runs
 Sample and use hierarchical clustering to determine initial centroids
 Select more than k initial centroids and then select among these initial
centroids: Select more widely separated
 Postprocessing
 Bisecting K-means: Not as susceptible to initialization issues
o Handling Empty Clusters:
 Choose the point that contributes most to SSE
 Choose a point from the cluster with the highest SSE
 If there are several empty clusters, the above can be repeated several
times
o Updating Centres Incrementally after each assignment:
 Each assignment updates zero/2 centroids
 More expensive
 Introduces an order dependency
 Never get an empty cluster
 Can use “weights” to change the impact
o Pre-Processing:
 Normalize the data
 Eliminate outliers
o Post-processing:
 Eliminate small clusters that may represent outliers
 Split ‘loose’ clusters, i.e. clusters with relatively high SSE
 Merge clusters that are ‘close’ and that have relatively low SSE
 Can use these steps during the clustering process: ISODATA (Iterative
Self-Organizing Data Analysis)
Bisecting K-means algorithm: Variant of K-means that can produce a partitional or a
hierarchical clustering

 Limitations:
o Problems when clusters are of differing sizes, densities, non-globular shapes
o Data contains outliers

Hierarchical Clustering

 Produces a set of nested clusters organized as a hierarchical tree


 Space: O(N2) where N is number of points; Time: O(N3) where there are N steps and
at each step the size, N2, proximity matrix must be updated and searched
 Can be visualized as a dendrogram
o A tree like diagram that records that sequences of merges or splits

 Strengths:
o Do not have to assume any particular number of clusters: Any desired
number of clusters can be obtained by ‘cutting’ the dendrogram at the
proper level
o Correspond to meaningful taxonomies
 Types of hierarchical clustering:
o Agglomerative:
 Start with points as individual clusters
 At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

o Divisive:
 Start with one, all-inclusive cluster
 At each step, split a cluster until each cluster contains a point
 Inter-Cluster Similarity

MIN MAX

Strength: Strength:
- Can handle non-elliptical shapes - Less susceptible to noise and outliers
Limitation: Limitations:
- Sensitive to noise and outliers - Tends to break large clusters
- Biased towards globular clusters
Group Average Distance between Centroids

Strength:
- Less susceptible to noise and outliers
Limitations:
- Biased towards globular clusters
 Cluster Similarity: Ward’s Method
o Based on the increase in squared error when two clusters are merged:
o Less susceptible to noise and outliers
o Biased towards globular clusters
o Hierarchical analogue of K-means

Minimum Spanning Tree (MST): Divisive Hierarchical Clustering

 Build MST
o Start with a tree that consists of any point
o In successive steps, look for the closest pair of points (p, q) such that one
point (p) is in the current tree but the other (q) is not
o Add q to the tree and put an edge between p and q

 Use MST for constructing hierarchy of clusters


Lecture 7 – Cluster Analysis: Alternative Methods

DBSCAN: density-based algorithm


 Density = number of points within a specified radius (Eps)
 A point is a core point if it has more than a specified number of points (MinPts)
within Eps (including itself)
o These are points that are at the interior of a cluster
 A border point is not a core point but is in the neighbourhood of a core point
 A noise point is any point that is not a core point or a border point

 DBSCAN Algorithm
o Eliminate noise points
o Perform clustering on the remaining noise
 Two core points within a specific radius are put into the same cluster
 Border points are added
 To determine EPS and Min Pts:
o For points in a cluster, their kth nearest neighbours are at roughly the same
distance
o Noise points have the kth nearest neighbour at farther distance
o So, plot sorted distance of every point to its kth nearest neighbour

Clustering Using REpresentatives (CURE)

 Representative points are found by selecting a constant number of points from a


cluster and then “shrinking” them toward the centre of the cluster
 Cluster similarity is the similarity of the closest pair of representative points from
different clusters
 Shrinking representative points toward the centre helps avoid problems with noise
and outliers
 CURE is better able to handle clusters of arbitrary shapes and sizes

Graph-Based Clustering
 Graph-Based clustering uses the proximity graph
o Start with the proximity matrix
o Considering each point as a node in a graph
o Each edge between two nodes has a weight which is the proximity between
the two points
 A graph, G = (V, E) representing a set of objects (V) and their relations
(E)
 Initially the proximity graph is fully connected
 MIN (single-link) and MAX (complete-link) can be viewed as starting
with this graph
o Clusters are connected components in the graph
 Clustering may work better
o Sparsification techniques keep the connections to the most similar (nearest)
neighbour of a point while breaking the connections to less similar points
o The nearest neighbours of a point tend to belong to the same class as the
point itself
o This reduces the impact of noise and outliers and sharpens the distinction
between clusters
 Sparsification facilitates the use of graph partitioning algorithms (Chameleon &
Hypergraph-based Clustering)

o Chameleon: Clustering using Dynamic Modelling


 Adapt to the characteristics of the data set to find the natural clusters
 Use a dynamic model to measure the similarity between clusters
 Relatively closeness & Relative inter-connectivity of the cluster
 Two clusters are combined if the resulting cluster shares
certain properties with the constituent clusters
 Merging scheme preserves self-similarity
 Used in spatial data:
 Clusters are defined as densely populated regions of
space
 Clusters have arbitrary shapes, orientation, and non-
uniform sizes
 Difference in densities across clusters and variation in
density within clusters
 Existence of special artefacts and noise
 Steps:
 Pre-processing Step: Represent the Data by a Graph
 Given a set of points, construct the k-nearest-
neighbour graph to capture the relationship between a
point and its k nearest neighbours
 Concept of neighbourhood is captured dynamically
 Phase 1: Use a multilevel graph partitioning algorithm on the
graph to find a large number of clusters of well-connected
vertices
 Each cluster should contain mostly points from one
“true” cluster
 Phase 2: Use Hierarchical Agglomerative Clustering to merge
sub-clusters: Two clusters are combined if the resulting cluster
shares certain properties with the constituent clusters
 Relative Interconnectivity: Absolute interconnectivity
of two clusters normalized by the internal connectivity
of the clusters
 Relative Closeness: Absolute closeness of two clusters
normalized by the internal closeness of the clusters

Shared Nearest Neighbour (SNN) graph: the weight of an edge is the number off shared
neighbours between the vertices given that the vertices are connected

 Jarvis Patrick Clustering:


o K-nearest neighbours of all points are found (breaking all but the k strongest
links from a point to other points in the proximity graph)
o Pair of points is put in the same cluster if:
 Two points share more than T neighbours
 Two points are in each other’s k nearest neighbour list
 SNN Density-based Clustering Algorithm:
o Compute the similarity matrix
o Sparsify the similarity matrix by keeping only the k most similar neighbours
o Construct the shared nearest neighbour graph from the sparsified similarity
matrix
o Find the core points
o Form clusters from the core points
o Discard all the noise points
o Assign all non-noise, non-core points to clusters

Cluster Validation:

 Determining the clustering tendency of a set of data, i.e. distinguishing whether non-
random structure actually exists in the data
 Compare the results of a cluster analysis to externally known results
 Evaluate how well the results of a cluster analysis fit the data without referencing to
external information
 Comparing the results of two different sets of cluster analyses to determine which is
better
 Determining the ‘correct’ number of clusters
 Measures of Cluster Validity:
o External Index: extent that cluster labels match externally supplied class
labels (Entropy)
o Internal Index: goodness of a clustering structure without respect to external
information (Sum of Squared Error)
 Cluster Cohesion: Measures how closely related are objects in a
cluster (e.g. SSE) – sum of weight of all links within a cluster
 Cluster Separation: Measures how distinct or well-separated a cluster
is from other clusters (e.g. Squared Error) – sum of weights between
nodes in the cluster and nodes outside the cluster

 Silhouette Coefficient: combine ideas of both cohesion and separation

o Relative Index: Used to compare two different clustering or cluster


o Correlation:
 Proximity Matrix
 Incidence Matrix: entry = 1 if associated pair of points belong to the
same cluster; else entry = 0
 Compute the correlation between the two matrices
 High correlation indicates that points that belong to the same cluster
are close to each other

Lecture 8 – Association Rule Mining


Associate Rule Mining: Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction
Frequent Itemset: An itemset whose support is greater than or equal to a minsup threshold
 Itemset: A collection of one or more items (E.g. Milk, Bread, Diaper)
 K-itemset: An itemset that contains k items
Support count: Frequency of occurrence of an itemset
Support: Fraction of transactions that contain an itemset

You might also like