Advanced Data Analytics: Simon Scheidegger

Advanced Data Analytics
Lecture 4
Simon Scheidegger
Today’s Roadmap
1. Recall: Supervised Machine Learning – Classification
2. Linear Models for Classification

2.1. Discriminant Functions
2.2. Generative Models
2.3. Discriminant Models
3. Another detailed Example: Logistic Regression
4. Evaluating Classifiers
1. Classification – Motivation
Test Images
Recall: “our example”

Classification aims at predicting a nominal target feature
based on one or multiple other (numerical) features.

Example: Predict origin (U.S. vs Non-U.S.) of a car based
on its power (in horsepowers) and weight (in pounds).

Since our predictor f(x) takes values in a discrete set G, we can
always divide the input space into a collection of regions labeled
according to the classification.
Which country?
*https://archive.ics.uci.edu/ml/datasets/Auto+MPG
Classification Boundaries

Input vector , assign it to one of K discrete classes

Assumption: classes are disjoint, i.e., input vectors are assigned
to exactly one class.

Idea: Divide input space into decision regions whose boundaries
are called decision boundaries/surfaces.
Decision Theory

We want a framework that allows us to make optimal decisions
(in situations involving uncertainty).

Consider, for example, a medical diagnosis problem in which we
have taken an X-ray image of a patient, and we wish to
determine whether the patient has cancer or not.

In this case, the input vector x is the set of pixel intensities in the
image, and output variable t will represent the presence of
cancer, which we denote by the class C1, or the absence of
cancer, which we denote by the class C2.

We might, for instance, choose t to be a binary variable such that
t = 0 corresponds to class C1 and t = 1 corresponds to class C2.
Decision Theory – more abstract
 For a sample x, decide which class(Ck ) it is from.

Ideas to do so:
1. Maximum Likelihood
2. Minimum Loss/Cost (e.g. misclassification rate)
3. Maximum Aposteriori (MAP)

Decision Theory (II)

Although this can be a very useful and informative quantity, in the end
we must decide either to give treatment to the patient or not.

We would like this choice to be optimal in some appropriate sense.

This is the decision step, and it is the subject of decision theory to tell
us how to make optimal decisions given the appropriate probabilities.

Example: When we obtain the X-ray image x for a new patient, our
goal is to decide which of the two classes to assign to the image.

We are interested in the probabilities of the two classes given
the image, which are given by p(Ck |x).
Thomas Bayes and his Terminology
Want to avoid misclassification

Note that any of the quantities appearing in Bayes’ theorem can be
obtained from the joint distribution p(x, Ck ) by either marginalizing or
conditioning with respect to the appropriate variables.

 We can now interpret p(Ck ) as the prior probability for the

class Ck , and p(Ck |x) as the corresponding posterior probability.


 Thus p(C1 ) represents the probability that a person has cancer, before
we take the X-ray measurement.


 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’

theorem in light of the information contained in the X-ray.


 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’

theorem in light of the information contained in the X-ray.

If our aim is to minimize the chance of assigning x to the wrong
class, then intuitively we would choose the class having the higher
posterior probability.
Minimizing the misclassification rate

Suppose that our goal is simply to make as few mis-
classifications as possible.

We need a rule that assigns each value of x to one of the
available classes.
 Such a rule will divide the input space into regions Rk called
decision regions, one for each class, such that all points in Rk
are assigned to class Ck .

The boundaries between decision regions are called decision
boundaries or decision surfaces.
Minimizing the misclassification rate

In order to find the optimal decision rule, consider first of all the case
of two classes, as in the cancer problem.
 A mistake occurs when an input vector belonging to class C1 is assigned

to class C2 or vice versa.

The probability of this occurring is given by
Decision: Maximum Likelihood

Inference step: Determine statistics from training data.

Decision step: Determine optimal t for test input x:
Decision: Minimum Misclassification
Rate
Decision: Minimum Loss/Cost
Maximize the probability of being
correct

For the more general case of K classes, it is slightly easier to
maximize the probability of being correct, which is given by
→ which is maximized when the regions Rk are chosen such that

each x is assigned to the class for which p(x, Ck ) is largest.
Decision: Maximum Aposteriori (MAP)

Provides an A posteriori Belief for the estimation, rather
than a single point estimate.

Can utilize A priori Information in the decision.
The Rejection option

There are regions where we are relatively uncertain
about class membership.


In some applications, it will be appropriate to
avoid making decisions on the difficult cases
in anticipation of a lower error rate on those
examples for which a classification decision is made.



This is known as the reject option.




For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.




For example, in our hypothetical medical illustration,
it may be appropriate to use an automatic
system to classify those X-ray images for
which there is little doubt as to the correct class,
while leaving a human expert to classify the more ambiguous cases.

We can achieve this by introducing a threshold θ and rejecting those
inputs x for which the largest of the
posterior probabilities p(Ck |x) is less than or equal to θ.
2. Linear Classification
See, e.g., Bishop Chapters 1 and 4
Why Linear Classifiers?


Statistical simplicity: We often cannot estimate more complex
classifiers than linear discriminant functions due to
lack of data and high dimensionality of the feature
space.

Computational efficiency: Linear classifiers have efficient
learning algorithms and are efficient inference machines.

Historical: Perceptrons were invented by Rosenblatt (1957)
and rigorously analyzed by Novikov (1962) who proved
convergence of the perceptron learning algorithm.
Linear Classifier

Focus on linear classification model, i.e., the decision boundary is
a linear function of x.
→ Defined by a (D-1)-dimensional hyper-plane.


If the data can be separated exactly by linear decision surfaces,
they are called linearly separable.

Implicit assumption: Classes can be modeled well by Gaussians.
→ Here: Treat classification as a projection problem.

Generalized Linear Models

Similar to previous chapter on linear models for regression, we
will use a “linear” model for classification:

This is called a generalized linear model f (·) is a fixed non-linear
function e.g.

Decision boundary between classes will be linear function of x
 Can also apply non-linearity to x, as in φi (x) for regression

Generalized Linear Model
Discriminant Functions with Two Classes
Multiple Classes – Bad idea

A linear discriminant between two classes separates with a
hyperplane

How to use this for multiple classes?
 One-versus-the-rest method: build K − 1 classifiers, between C
k
and all others.

One-versus-one method: build K(K − 1)/2 classifiers, between
all pairs.
Multiple Classes – Better idea
A solution is to build K linear functions:

Assign a point x to
Assign x to class a class Ck if yk(x) > yj(x)
Gives connected, convex decision regions

Least Squares for Classification

How do we learn the decision boundaries ?

One approach is to use least squares, similar to regression

Find W to minimize squared error over all examples and all
components of the label vector:

After some algebra, we get a solution using the pseudo-inverse
as in regression
Some details on how to compute
Each class Ck is described by its own linear model so that
where k = 1, . . . , K. We can conveniently group these together using

vector notation so that
We now determine the parameter matrix , by minimizing a sum-of-

squares error function, as we did for regression
The sum-of-squares error function can then be written as
Setting the derivative with respect to to zero, and rearranging, we then

obtain the solution for W, in the form
Problems with Least Squares

Looks ok… least squares decision 
Gets worse by adding “easy” points!
boundary

Why?

Similar to logistic regression decision
boundary (more later) → If target value is 1, points far from
the boundary will have a high value,
say 10; this is a large error so the
boundary is moved.
More Problems with Least Squares

Easily separated by hyperplanes, but not found using least
squares!

We’ll address these problems later with better models

First, a look at a different criterion for linear discriminant
Fisher’s Linear Discriminant (FLD)

The two-class linear discriminant acts as a projection:
followed by a threshold

In which direction w should we project?

One which separates classes “well”
Fisher’s Linear Discriminant

A natural idea would be to project in the direction of the line connecting
class means.

However, problematic if classes have variance in this direction.

Fisher criterion: maximize ratio of inter-class separation (between) to
intra-class variance (inside)
Math time - FLD
Ideally large
Ideally small
Math time - FLD
Implementation of FLD
1. Standardize the dataset (zero mean, standard deviation of 1)
2. Compute the total mean vector μ as well as the mean vectors

per class μc
3. Compute the scatter withing and scatter between matrices SB

and SW
4. Compute the eigenvalues and eigenvectors of to find

the w which maximizes
5. Select the Eigenvectors of the corresponding k largest

Eigenvalues to create a dxk dimensional transformation matrix w
where the Eigenvectors are the columns of this matrix
6. Use w to transform the original nxd dimensional dataset x into

a lower, nxk dimensional dataset y
Fisher’s Linear Discriminant – Summary

FLD is a dimensionality reduction technique (more later in the
Course).

Criterion for choosing projection based on class labels.

Still suffers from outliers (e.g. earlier least squares example).
Implementation of LDA

We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.

We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.

We split the in 70% training and 30% test data
Data Set Information:

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from
three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three
types of wines.
The attributes are

1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
In a classification context, this is a well posed problem with "well behaved" class structures.
LDA in Sklearn
demo/sklearn_lda.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from matplotlib import style
from sklearn.model_selection import train_test_split
style.use('fivethirtyeight')
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# 0. Load in the data and split the descriptive and the target feature
df = pd.read_csv('wine.txt',sep=',',
names=['target','Alcohol','Malic_acid','Ash','Akcakinity','Magnesium',
'Total_pheonols','Flavanoids','Nonflavanoids',
'Proanthocyanins','Color_intensity','Hue','OD280','Proline'])
X = df.iloc[:,1:].copy()
target = df['target'].copy()
X_train, X_test, y_train, y_test = train_test_split(X,target,test_size=0.3,random_state=0)
# 1. Instantiate the method and fit_transform the algotithm

LDA = LinearDiscriminantAnalysis(n_components=2)
"""The n_components key word gives us the projection to the
n most discriminative directions in the dataset.
We set this parameter to two to get a transformation in two dimensional space. """
data_projected = LDA.fit_transform(X_train,y_train)
print(data_projected.shape)
Implementation of LDA

We use the [UCI] wine dataset (https://archive.ics.uci.edu/ml/datasets/wine) which
has 13 dimensions.

We want to find the transformation which makes the three different classes best
linearly separable and plot this transformation in 2 dimensional space.

We split the in 70% training and 30% test data
demo/sklearn_lda.py
Perceptrons

Perceptrons is used to refer to many neural network structures
more in later lectures)

The classic type is a fixed non-linear transformation of input, one
layer of adaptive weights, and a threshold:
1 a>=0
f(a)
-1 a < 0

Developed by Frank Rosenblatt in the 50s.

The main difference compared to the methods we’ve seen so far
is the learning algorithm.
Frank Rosenblatt
see Bishop (2006)
Perceptron Learning

Two class problem
 For ease of notation, we will use t = 1 for class C1 and t = −1

for class C2

We saw that squared error was problematic

Instead, we’d like to minimize the number of misclassified
examples

An example is mis-classified if

Perceptron criterion:
→ sum over mis-classified examples only

Perceptron Learning Algorithm

Minimize the error function using stochastic gradient descent
(gradient descent per example):

Iterate over all training examples, only change w if the example
is mis-classified

Guaranteed to converge if data are linearly separable

Will not converge if not

May take many iterations

Sensitive to initialization
Perceptron Learning Illustration
Bishop (2006), Chp. 4)
Limitations of Perceptrons

Perceptrons can only solve linearly separable problems in feature
space.

A canonical example of non-separable problem is X-OR

Real data sets can look like this too
Probabilistic Generative Models

Up to now we’ve looked at learning classification by choosing
parameters to minimize an error function

We’ll now develop a probabilistic approach
 With 2 classes, C1 and C2 :
In generative models we specify the distribution p(x|Ck ) which

generates the data for each class
Probabilistic Generative Models - Example

Let’s say we observe x which is the current temperature
 Determine if we are in Zürich (C1) or Lausanne(C2)

Generative model:
 p(x|C1 ) is distribution over typical temperatures in Zürich

e.g. p(x|C1 ) = N (x; 10, 5)
 p(x|C2 ) is distribution over typical temperatures in Lausanne
 Class priors p(C1 ) = 0.1, p(C2) = 0.9

Generalized Linear Models

We can write the classifier in another form

This looks like gratuitous math, but if a takes a simple form this
is another generalized linear model which we have been studying

Of course, we will see how such a simple form
arises naturally
Logistic Sigmoid

The function is known as the logistic sigmoid

It squashes the real axis down to [0, 1]

It is continuous and differentiable

It avoids the problems encountered with the too correct
least-squares error fitting (later)
Multi-class Extension

There is a generalization of the logistic sigmoid to K > 2
classes:

a. k. a. softmax function
 If some a aj , p(Ck |x) goes to 1
k
Gaussian Class-Conditional Densities
Maximum Likelihood Learning

We can fit the parameters to this model using maximum
likelihood

Parameters are

Refer to as θ
 For a data point xn from class C1 (tn = 1):
 For a data point xn from class C2 (tn = 0):

Maximum Likelihood Learning
Maximum Likelihood Learning - Class Priors
Maximum Likelihood Learning - Gaussian
Parameters
Gaussian with Different Covariances
Probabilistic Generative Models:
Summary

Fitting Gaussian using ML criterion is sensitive to outliers

Simple linear form for a in logistic sigmoid occurs for more than
just Gaussian distributions

Arises for any distribution in the exponential family, a large
class of distributions
Probabilistic Discriminative Models

Generative model made assumptions about form of
class-conditional distributions (e.g. Gaussian)

Resulted in logistic sigmoid of linear function of x

Discriminative model - explicitly use functional form
and find w directly


For the generative model we had 2M + M (M + 1)/2 + 1
parameters

M is dimensionality of x

Discriminative model will have M + 1 parameters
Generative vs. Discriminative

Generative models 
Discriminative models
 Can generate synthetic 
Only usable for classification
example data

Don’t solve a harder problem
 Perhaps accurate than you need to
classification is equivalent
to accurate synthesis

Tend to have fewer
parameters

e.g. vision and graphics
 Tend to have more

Require good model of
parameters decision boundary
 Require good model of
class distributions
Maximum Likelihood Learning -
Discriminative Model
Iterative Reweighted Least Squares
Conclusion
3. A detailed example –
Logistic Regression

Logistic regression is a simple-yet-popular classification method
that builds on multiple linear regression.

Data matrix X encodes n data points each of which has
values for m numerical features.

Initially, we focus on the case of binary classification, i.e.,
our target vector contains only values in {0, 1}.
Logistic Regression (I)

Logistic regression predicts values in [0, 1], each of which
can be interpreted as the probability that the corresponding
data point belongs to class 1.

To determine the predicted class in {0, 1} of a data point,
we can threshold based on this probability as follows
Logistic Regression (II)

Logistic regression assumes that there is a linear relationship
between the log odds of the data point belonging to class 1
and the numerical features

with

as a linear combination of the features for data point x(i)
 with coefficients wi that we need to determine.
Logistic Regression (III)

When we solve for the predicted value, we obtain
which is guaranteed to be in [0, 1]

Logistic function (also known as sigmoid function)
maps values from (-∞, + ∞) onto [0, 1]
Logistic Function
Logistic Function
Logit Regression

How can we determine the coefficient vector

Different choices of the coefficients correspond to different
separating hyperplanes (e.g., straight lines), so that data
points from the two classes lie on different
sides of the the hyperplane.
Logit Regression
Loss Function

We need a loss function to measure how well a specific
choice of the coefficients fits our data

Recall that our predicted values can be interpreted as
probabilities that the data point belongs to class 1
Loss Function (II)

Intuitively, we want to choose the coefficients in such a way
that we assign high probability to data points from class 1
and low probability to data points from class 0.

Here, the value G(w) is the likelihood (probability) that our
regression model predicts the classes of our data points
correctly for a specific choice of coefficients w.
Loss Function (III)

In practice, one considers the negative log-likelihood as
a loss function, which can then be minimized using (stochastic)
gradient descent.
Predicting Origin from Power and Weight
See the code here: demo/auto_class.py
Predicting Origin from Power and Weight
Plotting the Decision Boundary
(Training Data)
(Training Data)
(Test Data)
4. Evaluating Classifiers

How can we assess the prediction quality of a classifier?

Initially, we’ll consider the case of binary classification
(and extend it later to multi-class classification).

Confusion matrix shows the performance of a classifier.
Accuracy and Error Rate

Accuracy measures the classifier’s ability to put data points into
the right class.

Error rate, as the counterpart to accuracy, reflects to what
extent the classifier puts data points into the wrong class.
False-Positive Rate and True Positive Rate

False-positive rate is the fraction of negative (0 / No)
data points that is falsely classified as positive (1 / Yes)

True-positive rate is the fraction of positive (1 / Yes)
points that is correctly classified as positive (1 / Yes)
Precision and Recall

Precision reflects the classifier’s ability to correctly
detect positive (1 / Yes) data points

Recall reflects the classifier’s ability to detect all
positive (1 / Yes) data points
F1-Measure

Precision and Recall are widely used in Information Retrieval

F1-Measure as the harmonic mean of precision and recall
combines both measures in a single measure
Accuracy and Error Rate

Confusion matrix for our binary car classifier (Non-U.S. = 0
vs. U.S. =1) using logistic regression on power and weight
False-Positive Rate and True Positive Rate

vs. U.S. =1) using logistic regression on power and weight.
Precision and Recall

vs. U.S. =1) using logistic regression on power and weight.
Computing Quality Measures
See the code here: demo/CompQualityMeasures.py
Micro- and Macro-Averages

When dealing with more than two classes, the confusion
matrix has one column and one row per class
 For each class i, we can now determine the numbers TNi, FPi,
FNi, TPi, assuming that it is the positive class, and all others are
treated as negative.
Micro- and Macro-Averages (III)

Micro-averages plug per-class numbers into the definitions

Macro-averages average per-class quality assessments

Advanced Data Analytics: Simon Scheidegger

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Data Analytics: Simon Scheidegger

Uploaded by

Copyright:

Available Formats

Advanced Data Analytics

1. Recall: Supervised Machine Learning – Classification

2. Linear Models for Classification

3. Another detailed Example: Logistic Regression

 For a sample x, decide which class(Ck ) it is from.

2. Minimum Loss/Cost (e.g. misclassification rate)

3. Maximum Aposteriori (MAP)

 We can now interpret p(Ck ) as the prior probability for the

 We can now interpret p(Ck ) as the prior probability for the

 We can now interpret p(Ck ) as the prior probability for the

 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’

 We can now interpret p(Ck ) as the prior probability for the

 Similarly, p(C1 |x) is the corresponding probability, revised using Bayes’

 A mistake occurs when an input vector belonging to class C1 is assigned

→ which is maximized when the regions Rk are chosen such that

Why Linear Classifiers?

→ Defined by a (D-1)-dimensional hyper-plane.

→ Here: Treat classification as a projection problem.

 Can also apply non-linearity to x, as in φi (x) for regression

A solution is to build K linear functions:

Gives connected, convex decision regions

where k = 1, . . . , K. We can conveniently group these together using

We now determine the parameter matrix , by minimizing a sum-of-

The sum-of-squares error function can then be written as

Setting the derivative with respect to to zero, and rearranging, we then

2. Compute the total mean vector μ as well as the mean vectors

3. Compute the scatter withing and scatter between matrices SB

4. Compute the eigenvalues and eigenvectors of to find

5. Select the Eigenvectors of the corresponding k largest

6. Use w to transform the original nxd dimensional dataset x into

Data Set Information:

The attributes are

# 1. Instantiate the method and fit_transform the algotithm

 For ease of notation, we will use t = 1 for class C1 and t = −1

→ sum over mis-classified examples only

 With 2 classes, C1 and C2 :

In generative models we specify the distribution p(x|Ck ) which

 Determine if we are in Zürich (C1) or Lausanne(C2)

 p(x|C1 ) is distribution over typical temperatures in Zürich

 Class priors p(C1 ) = 0.1, p(C2) = 0.9

 For a data point xn from class C1 (tn = 1):

 For a data point xn from class C2 (tn = 0):

and find w directly

which is guaranteed to be in [0, 1]

You might also like