Professional Documents
Culture Documents
Advanced Data Analytics: Simon Scheidegger
Advanced Data Analytics: Simon Scheidegger
Lecture 4
Simon Scheidegger
Today’s Roadmap
4. Evaluating Classifiers
1. Classification – Motivation
Test Images
Recall: “our example”
Classification aims at predicting a nominal target feature
based on one or multiple other (numerical) features.
Example: Predict origin (U.S. vs Non-U.S.) of a car based
on its power (in horsepowers) and weight (in pounds).
Since our predictor f(x) takes values in a discrete set G, we can
always divide the input space into a collection of regions labeled
according to the classification.
Which country?
*https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Classification Boundaries
Input vector , assign it to one of K discrete classes
Assumption: classes are disjoint, i.e., input vectors are assigned
to exactly one class.
Idea: Divide input space into decision regions whose boundaries
are called decision boundaries/surfaces.
Decision Theory
We want a framework that allows us to make optimal decisions
(in situations involving uncertainty).
Consider, for example, a medical diagnosis problem in which we
have taken an X-ray image of a patient, and we wish to
determine whether the patient has cancer or not.
In this case, the input vector x is the set of pixel intensities in the
image, and output variable t will represent the presence of
cancer, which we denote by the class C1, or the absence of
cancer, which we denote by the class C2.
We might, for instance, choose t to be a binary variable such that
t = 0 corresponds to class C1 and t = 1 corresponds to class C2.
Decision Theory – more abstract
Ideas to do so:
1. Maximum Likelihood
Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.
Want to avoid misclassification
Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.
Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.
Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.
Such a rule will divide the input space into regions Rk called
decision regions, one for each class, such that all points in Rk
are assigned to class Ck .
The boundaries between decision regions are called decision
boundaries or decision surfaces.
Minimizing the misclassification rate
In order to find the optimal decision rule, consider first of all the case
of two classes, as in the cancer problem.
The probability of this occurring is given by
Decision: Maximum Likelihood
Inference step: Determine statistics from training data.
Decision step: Determine optimal t for test input x:
Decision: Minimum Misclassification
Rate
Decision: Minimum Loss/Cost
Maximize the probability of being
correct
For the more general case of K classes, it is slightly easier to
maximize the probability of being correct, which is given by
Provides an A posteriori Belief for the estimation, rather
than a single point estimate.
Can utilize A priori Information in the decision.
The Rejection option
There are regions where we are relatively uncertain
about class membership.
The Rejection option
There are regions where we are relatively uncertain
about class membership.
In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.
The Rejection option
There are regions where we are relatively uncertain
about class membership.
In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.
This is known as the reject option.
The Rejection option
There are regions where we are relatively uncertain
about class membership.
In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.
This is known as the reject option.
For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.
The Rejection option
There are regions where we are relatively uncertain
about class membership.
In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.
This is known as the reject option.
For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.
We can achieve this by introducing a threshold θ and rejecting those
inputs x for which the largest of the
posterior probabilities p(Ck |x) is less than or equal to θ.
2. Linear Classification
See, e.g., Bishop Chapters 1 and 4
Focus on linear classification model, i.e., the decision boundary is
a linear function of x.
Implicit assumption: Classes can be modeled well by Gaussians.
This is called a generalized linear model f (·) is a fixed non-linear
function e.g.
Decision boundary between classes will be linear function of x
A linear discriminant between two classes separates with a
hyperplane
How to use this for multiple classes?
One-versus-the-rest method: build K − 1 classifiers, between C
k
and all others.
One-versus-one method: build K(K − 1)/2 classifiers, between
all pairs.
Multiple Classes – Better idea
After some algebra, we get a solution using the pseudo-inverse
as in regression
Some details on how to compute
Each class Ck is described by its own linear model so that
Looks ok… least squares decision
Gets worse by adding “easy” points!
boundary
Why?
Similar to logistic regression decision
boundary (more later) → If target value is 1, points far from
the boundary will have a high value,
say 10; this is a large error so the
boundary is moved.
More Problems with Least Squares
Easily separated by hyperplanes, but not found using least
squares!
We’ll address these problems later with better models
First, a look at a different criterion for linear discriminant
Fisher’s Linear Discriminant (FLD)
The two-class linear discriminant acts as a projection:
followed by a threshold
In which direction w should we project?
One which separates classes “well”
Fisher’s Linear Discriminant
A natural idea would be to project in the direction of the line connecting
class means.
However, problematic if classes have variance in this direction.
Fisher criterion: maximize ratio of inter-class separation (between) to
intra-class variance (inside)
Math time - FLD
Ideally large
Ideally small
Math time - FLD
Implementation of FLD
1. Standardize the dataset (zero mean, standard deviation of 1)
FLD is a dimensionality reduction technique (more later in the
Course).
Criterion for choosing projection based on class labels.
Still suffers from outliers (e.g. earlier least squares example).
Implementation of LDA
We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.
We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.
We split the in 70% training and 30% test data
In a classification context, this is a well posed problem with "well behaved" class structures.
LDA in Sklearn
demo/sklearn_lda.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.model_selection import train_test_split
style.use('fivethirtyeight')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# 0. Load in the data and split the descriptive and the target feature
df = pd.read_csv('wine.txt',sep=',',
names=['target','Alcohol','Malic_acid','Ash','Akcakinity','Magnesium',
'Total_pheonols','Flavanoids','Nonflavanoids',
'Proanthocyanins','Color_intensity','Hue','OD280','Proline'])
X = df.iloc[:,1:].copy()
target = df['target'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,target,test_size=0.3,random_state=0)
data_projected = LDA.fit_transform(X_train,y_train)
print(data_projected.shape)
Implementation of LDA
We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.
We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.
We split the in 70% training and 30% test data
demo/sklearn_lda.py
Perceptrons
Perceptrons is used to refer to many neural network structures
more in later lectures)
The classic type is a fixed non-linear transformation of input, one
layer of adaptive weights, and a threshold:
1 a>=0
f(a)
-1 a < 0
Developed by Frank Rosenblatt in the 50s.
The main difference compared to the methods we’ve seen so far
is the learning algorithm.
Frank Rosenblatt
see Bishop (2006)
Perceptron Learning
Two class problem
We saw that squared error was problematic
Instead, we’d like to minimize the number of misclassified
examples
An example is mis-classified if
Perceptron criterion:
Iterate over all training examples, only change w if the example
is mis-classified
Guaranteed to converge if data are linearly separable
Will not converge if not
May take many iterations
Sensitive to initialization
Perceptron Learning Illustration
Bishop (2006), Chp. 4)
Limitations of Perceptrons
Perceptrons can only solve linearly separable problems in feature
space.
A canonical example of non-separable problem is X-OR
Real data sets can look like this too
Probabilistic Generative Models
Up to now we’ve looked at learning classification by choosing
parameters to minimize an error function
We’ll now develop a probabilistic approach
Generative model:
This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been studying
Of course, we will see how such a simple form
arises naturally
Logistic Sigmoid
The function is known as the logistic sigmoid
It squashes the real axis down to [0, 1]
It is continuous and differentiable
It avoids the problems encountered with the too correct
least-squares error fitting (later)
Multi-class Extension
There is a generalization of the logistic sigmoid to K > 2
classes:
a. k. a. softmax function
If some a aj , p(Ck |x) goes to 1
k
Gaussian Class-Conditional Densities
Maximum Likelihood Learning
We can fit the parameters to this model using maximum
likelihood
Parameters are
Refer to as θ
Fitting Gaussian using ML criterion is sensitive to outliers
Simple linear form for a in logistic sigmoid occurs for more than
just Gaussian distributions
Arises for any distribution in the exponential family, a large
class of distributions
Probabilistic Discriminative Models
Generative model made assumptions about form of
class-conditional distributions (e.g. Gaussian)
Resulted in logistic sigmoid of linear function of x
Discriminative model - explicitly use functional form
Initially, we focus on the case of binary classification, i.e.,
our target vector contains only values in {0, 1}.
Logistic Regression (I)
Logistic regression predicts values in [0, 1], each of which
can be interpreted as the probability that the corresponding
data point belongs to class 1.
To determine the predicted class in {0, 1} of a data point,
we can threshold based on this probability as follows
Logistic Regression (II)
Logistic regression assumes that there is a linear relationship
between the log odds of the data point belonging to class 1
and the numerical features
with
as a linear combination of the features for data point x(i)
with coefficients wi that we need to determine.
Logistic Regression (III)
When we solve for the predicted value, we obtain
Logistic function (also known as sigmoid function)
maps values from (-∞, + ∞) onto [0, 1]
Logistic Function
Logistic Function
Logit Regression
How can we determine the coefficient vector
Different choices of the coefficients correspond to different
separating hyperplanes (e.g., straight lines), so that data
points from the two classes lie on different
sides of the the hyperplane.
Logit Regression
Loss Function
We need a loss function to measure how well a specific
choice of the coefficients fits our data
Recall that our predicted values can be interpreted as
probabilities that the data point belongs to class 1
Loss Function (II)
Intuitively, we want to choose the coefficients in such a way
that we assign high probability to data points from class 1
and low probability to data points from class 0.
Here, the value G(w) is the likelihood (probability) that our
regression model predicts the classes of our data points
correctly for a specific choice of coefficients w.
Loss Function (III)
In practice, one considers the negative log-likelihood as
a loss function, which can then be minimized using (stochastic)
gradient descent.
Predicting Origin from Power and Weight
See the code here: demo/auto_class.py
Predicting Origin from Power and Weight
Plotting the Decision Boundary
(Training Data)
Plotting the Decision Boundary
(Training Data)
Plotting the Decision Boundary
(Test Data)
4. Evaluating Classifiers
How can we assess the prediction quality of a classifier?
Initially, we’ll consider the case of binary classification
(and extend it later to multi-class classification).
Confusion matrix shows the performance of a classifier.
Accuracy and Error Rate
Accuracy measures the classifier’s ability to put data points into
the right class.
Error rate, as the counterpart to accuracy, reflects to what
extent the classifier puts data points into the wrong class.
False-Positive Rate and True Positive Rate
False-positive rate is the fraction of negative (0 / No)
data points that is falsely classified as positive (1 / Yes)
True-positive rate is the fraction of positive (1 / Yes)
points that is correctly classified as positive (1 / Yes)
Precision and Recall
Precision reflects the classifier’s ability to correctly
detect positive (1 / Yes) data points
Recall reflects the classifier’s ability to detect all
positive (1 / Yes) data points
F1-Measure
Precision and Recall are widely used in Information Retrieval
F1-Measure as the harmonic mean of precision and recall
combines both measures in a single measure
Accuracy and Error Rate
Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight
False-Positive Rate and True Positive Rate
Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight.
Precision and Recall
Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight.
Computing Quality Measures
See the code here: demo/CompQualityMeasures.py
Micro- and Macro-Averages
When dealing with more than two classes, the confusion
matrix has one column and one row per class
For each class i, we can now determine the numbers TNi, FPi,
FNi, TPi, assuming that it is the positive class, and all others are
treated as negative.
Micro- and Macro-Averages (III)
Micro-averages plug per-class numbers into the definitions
Macro-averages average per-class quality assessments