Download as pdf or txt
Download as pdf or txt
You are on page 1of 97

Advanced Data Analytics

Lecture 4

Simon Scheidegger
Today’s Roadmap

1. Recall: Supervised Machine Learning – Classification

2. Linear Models for Classification


2.1. Discriminant Functions
2.2. Generative Models
2.3. Discriminant Models

3. Another detailed Example: Logistic Regression

4. Evaluating Classifiers
1. Classification – Motivation
Test Images
Recall: “our example”


Classification aims at predicting a nominal target feature
based on one or multiple other (numerical) features.

Example: Predict origin (U.S. vs Non-U.S.) of a car based
on its power (in horsepowers) and weight (in pounds).

Since our predictor f(x) takes values in a discrete set G, we can
always divide the input space into a collection of regions labeled
according to the classification.
Which country?
*https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Classification Boundaries


Input vector , assign it to one of K discrete classes


Assumption: classes are disjoint, i.e., input vectors are assigned
to exactly one class.

Idea: Divide input space into decision regions whose boundaries
are called decision boundaries/surfaces.
Decision Theory


We want a framework that allows us to make optimal decisions
(in situations involving uncertainty).

Consider, for example, a medical diagnosis problem in which we
have taken an X-ray image of a patient, and we wish to
determine whether the patient has cancer or not.

In this case, the input vector x is the set of pixel intensities in the
image, and output variable t will represent the presence of
cancer, which we denote by the class C1, or the absence of
cancer, which we denote by the class C2.


We might, for instance, choose t to be a binary variable such that
t = 0 corresponds to class C1 and t = 1 corresponds to class C2.
Decision Theory – more abstract

 For a sample x, decide which class(Ck ) it is from.


Ideas to do so:

1. Maximum Likelihood

2. Minimum Loss/Cost (e.g. misclassification rate)

3. Maximum Aposteriori (MAP)


Decision Theory (II)

Although this can be a very useful and informative quantity, in the end
we must decide either to give treatment to the patient or not.

We would like this choice to be optimal in some appropriate sense.

This is the decision step, and it is the subject of decision theory to tell
us how to make optimal decisions given the appropriate probabilities.

Example: When we obtain the X-ray image x for a new patient, our
goal is to decide which of the two classes to assign to the image.

We are interested in the probabilities of the two classes given
the image, which are given by p(Ck |x).
Thomas Bayes and his Terminology
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.

 We can now interpret p(Ck ) as the prior probability for the


class Ck , and p(Ck |x) as the corresponding posterior probability.
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.

 We can now interpret p(Ck ) as the prior probability for the


class Ck , and p(Ck |x) as the corresponding posterior probability.

 Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.

 We can now interpret p(Ck ) as the prior probability for the


class Ck , and p(Ck |x) as the corresponding posterior probability.

 Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.

 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’


theorem in light of the information contained in the X-ray.
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.

 We can now interpret p(Ck ) as the prior probability for the


class Ck , and p(Ck |x) as the corresponding posterior probability.

 Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.

 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’


theorem in light of the information contained in the X-ray.

If our aim is to minimize the chance of assigning x to the wrong
class, then intuitively we would choose the class having the higher
posterior probability.
Minimizing the misclassification rate

Suppose that our goal is simply to make as few mis-
classifications as possible.

We need a rule that assigns each value of x to one of the
available classes.

 Such a rule will divide the input space into regions Rk called
decision regions, one for each class, such that all points in Rk
are assigned to class Ck .


The boundaries between decision regions are called decision
boundaries or decision surfaces.
Minimizing the misclassification rate

In order to find the optimal decision rule, consider first of all the case
of two classes, as in the cancer problem.

 A mistake occurs when an input vector belonging to class C1 is assigned


to class C2 or vice versa.


The probability of this occurring is given by
Decision: Maximum Likelihood


Inference step: Determine statistics from training data.


Decision step: Determine optimal t for test input x:
Decision: Minimum Misclassification
Rate
Decision: Minimum Loss/Cost
Maximize the probability of being
correct

For the more general case of K classes, it is slightly easier to
maximize the probability of being correct, which is given by

→ which is maximized when the regions Rk are chosen such that


each x is assigned to the class for which p(x, Ck ) is largest.
Decision: Maximum Aposteriori (MAP)


Provides an A posteriori Belief for the estimation, rather
than a single point estimate.

Can utilize A priori Information in the decision.
The Rejection option

There are regions where we are relatively uncertain
about class membership.
The Rejection option

There are regions where we are relatively uncertain
about class membership.

In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.
The Rejection option

There are regions where we are relatively uncertain
about class membership.

In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.

This is known as the reject option.
The Rejection option

There are regions where we are relatively uncertain
about class membership.

In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.

This is known as the reject option.

For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.
The Rejection option

There are regions where we are relatively uncertain
about class membership.

In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.

This is known as the reject option.

For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.

We can achieve this by introducing a threshold θ and rejecting those
inputs x for which the largest of the
posterior probabilities p(Ck |x) is less than or equal to θ.
2. Linear Classification
See, e.g., Bishop Chapters 1 and 4

Why Linear Classifiers?



Statistical simplicity: We often cannot estimate more complex
classifiers than linear discriminant functions due to
lack of data and high dimensionality of the feature
space.

Computational efficiency: Linear classifiers have efficient
learning algorithms and are efficient inference machines.

Historical: Perceptrons were invented by Rosenblatt (1957)
and rigorously analyzed by Novikov (1962) who proved
convergence of the perceptron learning algorithm.
Linear Classifier


Focus on linear classification model, i.e., the decision boundary is
a linear function of x.

→ Defined by a (D-1)-dimensional hyper-plane.



If the data can be separated exactly by linear decision surfaces,
they are called linearly separable.


Implicit assumption: Classes can be modeled well by Gaussians.

→ Here: Treat classification as a projection problem.


Generalized Linear Models

Similar to previous chapter on linear models for regression, we
will use a “linear” model for classification:


This is called a generalized linear model f (·) is a fixed non-linear
function e.g.


Decision boundary between classes will be linear function of x

 Can also apply non-linearity to x, as in φi (x) for regression


Generalized Linear Model
Discriminant Functions with Two Classes
Multiple Classes – Bad idea


A linear discriminant between two classes separates with a
hyperplane

How to use this for multiple classes?
 One-versus-the-rest method: build K − 1 classifiers, between C
k
and all others.

One-versus-one method: build K(K − 1)/2 classifiers, between
all pairs.
Multiple Classes – Better idea

A solution is to build K linear functions:


Assign a point x to
Assign x to class a class Ck if yk(x) > yj(x)

Gives connected, convex decision regions


Least Squares for Classification

How do we learn the decision boundaries ?

One approach is to use least squares, similar to regression

Find W to minimize squared error over all examples and all
components of the label vector:


After some algebra, we get a solution using the pseudo-inverse
as in regression
Some details on how to compute
Each class Ck is described by its own linear model so that

where k = 1, . . . , K. We can conveniently group these together using


vector notation so that

We now determine the parameter matrix , by minimizing a sum-of-


squares error function, as we did for regression

The sum-of-squares error function can then be written as

Setting the derivative with respect to to zero, and rearranging, we then


obtain the solution for W, in the form
Problems with Least Squares


Looks ok… least squares decision 
Gets worse by adding “easy” points!
boundary

Why?

Similar to logistic regression decision
boundary (more later) → If target value is 1, points far from
the boundary will have a high value,
say 10; this is a large error so the
boundary is moved.
More Problems with Least Squares


Easily separated by hyperplanes, but not found using least
squares!

We’ll address these problems later with better models

First, a look at a different criterion for linear discriminant
Fisher’s Linear Discriminant (FLD)


The two-class linear discriminant acts as a projection:

followed by a threshold

In which direction w should we project?

One which separates classes “well”
Fisher’s Linear Discriminant


A natural idea would be to project in the direction of the line connecting
class means.

However, problematic if classes have variance in this direction.

Fisher criterion: maximize ratio of inter-class separation (between) to
intra-class variance (inside)
Math time - FLD

Ideally large

Ideally small
Math time - FLD
Implementation of FLD
1. Standardize the dataset (zero mean, standard deviation of 1)

2. Compute the total mean vector μ as well as the mean vectors


per class μc

3. Compute the scatter withing and scatter between matrices SB


and SW

4. Compute the eigenvalues and eigenvectors of to find


the w which maximizes

5. Select the Eigenvectors of the corresponding k largest


Eigenvalues to create a dxk dimensional transformation matrix w
where the Eigenvectors are the columns of this matrix

6. Use w to transform the original nxd dimensional dataset x into


a lower, nxk dimensional dataset y
Fisher’s Linear Discriminant – Summary


FLD is a dimensionality reduction technique (more later in the
Course).

Criterion for choosing projection based on class labels.

Still suffers from outliers (e.g. earlier least squares example).
Implementation of LDA

We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.

We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.

We split the in 70% training and 30% test data

Data Set Information:


These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from
three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three
types of wines.

The attributes are


1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline

In a classification context, this is a well posed problem with "well behaved" class structures.
LDA in Sklearn
demo/sklearn_lda.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.model_selection import train_test_split
style.use('fivethirtyeight')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# 0. Load in the data and split the descriptive and the target feature
df = pd.read_csv('wine.txt',sep=',',
names=['target','Alcohol','Malic_acid','Ash','Akcakinity','Magnesium',
'Total_pheonols','Flavanoids','Nonflavanoids',
'Proanthocyanins','Color_intensity','Hue','OD280','Proline'])

X = df.iloc[:,1:].copy()
target = df['target'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,target,test_size=0.3,random_state=0)

# 1. Instantiate the method and fit_transform the algotithm


LDA = LinearDiscriminantAnalysis(n_components=2)
"""The n_components key word gives us the projection to the
n most discriminative directions in the dataset.
We set this parameter to two to get a transformation in two dimensional space. """

data_projected = LDA.fit_transform(X_train,y_train)
print(data_projected.shape)
Implementation of LDA

We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.

We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.


We split the in 70% training and 30% test data

demo/sklearn_lda.py
Perceptrons

Perceptrons is used to refer to many neural network structures
more in later lectures)

The classic type is a fixed non-linear transformation of input, one
layer of adaptive weights, and a threshold:
1 a>=0
f(a)
-1 a < 0

Developed by Frank Rosenblatt in the 50s.

The main difference compared to the methods we’ve seen so far
is the learning algorithm.
Frank Rosenblatt
see Bishop (2006)
Perceptron Learning

Two class problem

 For ease of notation, we will use t = 1 for class C1 and t = −1


for class C2


We saw that squared error was problematic

Instead, we’d like to minimize the number of misclassified
examples

An example is mis-classified if

Perceptron criterion:

→ sum over mis-classified examples only


Perceptron Learning Algorithm

Minimize the error function using stochastic gradient descent
(gradient descent per example):


Iterate over all training examples, only change w if the example
is mis-classified

Guaranteed to converge if data are linearly separable

Will not converge if not

May take many iterations

Sensitive to initialization
Perceptron Learning Illustration
Bishop (2006), Chp. 4)
Limitations of Perceptrons


Perceptrons can only solve linearly separable problems in feature
space.

A canonical example of non-separable problem is X-OR

Real data sets can look like this too
Probabilistic Generative Models

Up to now we’ve looked at learning classification by choosing
parameters to minimize an error function

We’ll now develop a probabilistic approach

 With 2 classes, C1 and C2 :

In generative models we specify the distribution p(x|Ck ) which


generates the data for each class
Probabilistic Generative Models - Example

Let’s say we observe x which is the current temperature

 Determine if we are in Zürich (C1) or Lausanne(C2)


Generative model:

 p(x|C1 ) is distribution over typical temperatures in Zürich


e.g. p(x|C1 ) = N (x; 10, 5)
 p(x|C2 ) is distribution over typical temperatures in Lausanne

 Class priors p(C1 ) = 0.1, p(C2) = 0.9


Generalized Linear Models

We can write the classifier in another form


This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been studying

Of course, we will see how such a simple form
arises naturally
Logistic Sigmoid


The function is known as the logistic sigmoid

It squashes the real axis down to [0, 1]

It is continuous and differentiable

It avoids the problems encountered with the too correct
least-squares error fitting (later)
Multi-class Extension

There is a generalization of the logistic sigmoid to K > 2
classes:


a. k. a. softmax function
 If some a aj , p(Ck |x) goes to 1
k
Gaussian Class-Conditional Densities
Maximum Likelihood Learning

We can fit the parameters to this model using maximum
likelihood

Parameters are

Refer to as θ

 For a data point xn from class C1 (tn = 1):

 For a data point xn from class C2 (tn = 0):


Maximum Likelihood Learning
Maximum Likelihood Learning - Class Priors
Maximum Likelihood Learning - Gaussian
Parameters
Gaussian with Different Covariances
Probabilistic Generative Models:
Summary


Fitting Gaussian using ML criterion is sensitive to outliers

Simple linear form for a in logistic sigmoid occurs for more than
just Gaussian distributions

Arises for any distribution in the exponential family, a large
class of distributions
Probabilistic Discriminative Models

Generative model made assumptions about form of
class-conditional distributions (e.g. Gaussian)

Resulted in logistic sigmoid of linear function of x

Discriminative model - explicitly use functional form

and find w directly



For the generative model we had 2M + M (M + 1)/2 + 1
parameters

M is dimensionality of x

Discriminative model will have M + 1 parameters
Generative vs. Discriminative

Generative models 
Discriminative models
 Can generate synthetic 
Only usable for classification
example data

Don’t solve a harder problem
 Perhaps accurate than you need to
classification is equivalent
to accurate synthesis

Tend to have fewer
parameters

e.g. vision and graphics
 Tend to have more

Require good model of
parameters decision boundary
 Require good model of
class distributions
Maximum Likelihood Learning -
Discriminative Model
Iterative Reweighted Least Squares
Conclusion
3. A detailed example –
Logistic Regression

Logistic regression is a simple-yet-popular classification method
that builds on multiple linear regression.

Data matrix X encodes n data points each of which has
values for m numerical features.


Initially, we focus on the case of binary classification, i.e.,
our target vector contains only values in {0, 1}.
Logistic Regression (I)


Logistic regression predicts values in [0, 1], each of which
can be interpreted as the probability that the corresponding
data point belongs to class 1.

To determine the predicted class in {0, 1} of a data point,
we can threshold based on this probability as follows
Logistic Regression (II)

Logistic regression assumes that there is a linear relationship
between the log odds of the data point belonging to class 1
and the numerical features


with


as a linear combination of the features for data point x(i)
 with coefficients wi that we need to determine.
Logistic Regression (III)

When we solve for the predicted value, we obtain

which is guaranteed to be in [0, 1]


Logistic function (also known as sigmoid function)
maps values from (-∞, + ∞) onto [0, 1]
Logistic Function
Logistic Function
Logit Regression

How can we determine the coefficient vector


Different choices of the coefficients correspond to different
separating hyperplanes (e.g., straight lines), so that data
points from the two classes lie on different
sides of the the hyperplane.
Logit Regression
Loss Function


We need a loss function to measure how well a specific
choice of the coefficients fits our data

Recall that our predicted values can be interpreted as
probabilities that the data point belongs to class 1
Loss Function (II)

Intuitively, we want to choose the coefficients in such a way
that we assign high probability to data points from class 1
and low probability to data points from class 0.


Here, the value G(w) is the likelihood (probability) that our
regression model predicts the classes of our data points
correctly for a specific choice of coefficients w.
Loss Function (III)


In practice, one considers the negative log-likelihood as
a loss function, which can then be minimized using (stochastic)
gradient descent.
Predicting Origin from Power and Weight
See the code here: demo/auto_class.py
Predicting Origin from Power and Weight
Plotting the Decision Boundary
(Training Data)
Plotting the Decision Boundary
(Training Data)
Plotting the Decision Boundary
(Test Data)
4. Evaluating Classifiers

How can we assess the prediction quality of a classifier?

Initially, we’ll consider the case of binary classification
(and extend it later to multi-class classification).

Confusion matrix shows the performance of a classifier.
Accuracy and Error Rate

Accuracy measures the classifier’s ability to put data points into
the right class.


Error rate, as the counterpart to accuracy, reflects to what
extent the classifier puts data points into the wrong class.
False-Positive Rate and True Positive Rate


False-positive rate is the fraction of negative (0 / No)
data points that is falsely classified as positive (1 / Yes)


True-positive rate is the fraction of positive (1 / Yes)
points that is correctly classified as positive (1 / Yes)
Precision and Recall


Precision reflects the classifier’s ability to correctly
detect positive (1 / Yes) data points


Recall reflects the classifier’s ability to detect all
positive (1 / Yes) data points
F1-Measure


Precision and Recall are widely used in Information Retrieval


F1-Measure as the harmonic mean of precision and recall
combines both measures in a single measure
Accuracy and Error Rate


Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight
False-Positive Rate and True Positive Rate


Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight.
Precision and Recall


Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight.
Computing Quality Measures
See the code here: demo/CompQualityMeasures.py
Micro- and Macro-Averages

When dealing with more than two classes, the confusion
matrix has one column and one row per class

 For each class i, we can now determine the numbers TNi, FPi,
FNi, TPi, assuming that it is the positive class, and all others are
treated as negative.
Micro- and Macro-Averages (III)

Micro-averages plug per-class numbers into the definitions


Macro-averages average per-class quality assessments

You might also like