CIS 700 Advanced Machine Learning For NLP: Dan Roth

CIS 700
Advanced Machine Learning for NLP
Multiclass classification: Local and Global Views
Dan Roth
Department of Computer and Information Science
University of Pennsylvania
Augmented and modified by Vivek Srikumar
Page 1
Admin Stuff
• Class web page
http://www.cis.upenn.edu/~danroth/Teaching/CIS-700-006/
• Piazza
• HW0 ???
2
Multiclass classification
• Introduction
• Combining binary classifiers

– One-vs-all
– All-vs-all
– Error correcting codes
• Training a single classifier

– Multiclass SVM
– Constraint classification
3
What is multiclass classification?
• An input can belong to one of K classes
• Training data: Input associated with class label (a number from 1

to K)
• Prediction: Given a new input, predict the class label
Each input belongs to exactly one class. Not more, not less.
• Otherwise, the problem is not multiclass classification
• If an input can be assigned multiple labels (think tags for emails
rather than folders), it is called multi-label classification
4
Example applications: Images
– Input: hand-written character; Output: which character?
all map to the letter A
– Input: a photograph of an object; Output: which of a set of

categories of objects is it?
• Eg: the Caltech 256 dataset
Car tire Car tire Duck laptop
5
Example applications: Language
• Input: a news article; Output: which section of the

newspaper should it belong to?
• Input: an email; Output: which folder should an email

be placed into?
• Input: an audio command given to a car; Output:

which of a set of actions should be executed?
6
Where are we?
• Introduction

– One-vs-all
– All-vs-all

– Multiclass SVM
7
Binary to multiclass
• Can we use a binary classifier to construct a

multiclass classifier?
– Decompose the prediction into multiple binary decisions
• How to decompose?
– One-vs-all
– All-vs-all
8
General setting
• Input x 2 <n
– The inputs are represented by their feature vectors
• Output y 2 {1, 2, , K}
– These classes represent domain-specific labels
• Learning: Given a dataset D = {<xi, yi>}

– Need to specify a learning algorithm that takes uses D to construct a function that can
predict y given x
– Goal: find a predictor that does well on the training data and has low generalization
error
• Prediction/Inference: Given an example x and the learned function (often

also called the model)
– Using the learned function, compute the class label for x
9
1. One-vs-all classification
• Assumption: Each class individually separable from all the others
• Learning: Given a dataset D = {<xi, yi>},

Note: xi 2 <n, yi 2 {1, 2, , K}
– Decompose into K binary classification tasks
– For class k, construct a binary classification task as:
• Positive examples: Elements of D with label k
• Negative examples: All other elements of D
– Train K binary classifiers w1, w2,  wK using any learning algorithm we
have seen
Question: What is the
• Prediction: “Winner Takes All” dimensionality of
argmaxi wiTx each wi?
10
Visualizing One-vs-all
From the full dataset, construct three
binary classifiers, one for each class
wblueTx > 0
wredTx > 0 for wgreenTx > 0
for blue
red inputs for green
inputs
inputs
Notation: Score Winner Take All will predict the right answer. Only the
for blue label correct label will have a positive score
11
One-vs-all may not always work
Black points are not separable with a single binary
classifier
The decomposition will not work for these cases!
wblueTx > 0 wredTx > 0 for wgreenTx > 0 ???

for blue red inputs for green
inputs inputs
12
One-vs-all classification: Summary
• Easy to learn
– Use any binary classifier learning algorithm
• Problems
– No theoretical justification
– Calibration issues
• We are comparing scores produced by K classifiers trained
independently. No reason for the scores to be in the same
numerical range!
– Might not always work
• Yet, works fairly well in many cases, especially if the underlying
binary classifiers are tuned, regularized
13
2. All-vs-all classification
Sometimes called one-vs-one
• Assumption: Every pair of classes is separable

• Learning: Given a dataset D = {<xi, yi>},
Note: xi 2 <n, yi 2 {1, 2, , K}
– For every pair of labels (j, k), create a binary classifier with:
• Positive examples: All examples with label j
• Negative examples: All examples with label k
æK ö K(K - 1)
– Train ç ÷=
è2 ø 2classifiers in all
• Prediction: More complex, each label get K-1 votes

– How to combine the votes? Many methods
• Majority: Pick the label with maximum votes
• Organize a tournament between the labels
14
All-vs-all classification
• Every pair of labels is linearly separable here

– When a pair of labels is considered, all others are ignored
• Problems
1. O(K2) weight vectors to train and store
2. Size of training set for a pair of labels could be very small,
leading to overfitting
3. Prediction is often ad-hoc and might be unstable
Eg: What if two classes get the same number of votes? For a tournament,
what is the sequence in which the labels compete?
15
Summary: One-vs-All vs. All vs. All
• Assume m examples, k class labels.
– For simplicity, say, m/k in each.
• One vs. All:
– classifier fi: m/k (+) and (k-1)m/k (-)
– Decision:
– Evaluate k linear classifiers and do Winner Takes All (WTA):
– f(x) = argmaxi fi(x) = argmaxi wiTx
• All vs. All:
– Classifier fij: m/k (+) and m/k (-)
– More expressivity, but less examples to learn from.
– Decision:
– Evaluate k2 linear classifiers; decision sometimes unstable.
• What type of learning methods would prefer All vs. All
(efficiency-wise)?
(Think about Dual/Primal)
16
A Challenge: One-vs-All vs. All vs. All
• You are asked to implement a generic multiclass linear
classification box.
• Which method will you use, One vs All or All vs All?
• Dual or Primal?
– Are we going to use kernels or not?
• Primal: algorithms are O(# of examples)
• Dual: algorithms are O(# of examples)2
– Why?
One vs All All vs All

Primal km k2 m/k
Dual k m2 k2 (m/k)2 = m2
17
Error Correcting Codes Decomposition
• 1-vs-all uses k classifiers for k labels; can you use only log 2 k?
• Reduce the multi-class classification to random binary problems.
– Choose a “code word” for each label.
– K=8: all we need is 3 bits, three classifiers
• Rows: An encoding of each class (k rows)
• Columns: L dichotomies of the data, each corresponds to a new classification
problem
Label P1 P2 P3 P4
• Extreme cases:
– 1-vs-all: k rows, k columns 1 - + - +
– k rows log2 k columns
• 2 - + + -
Each training example is mapped to one example per column
– (x,3)  {(x,P1), +; (x,P2), -; (x,P3), -; (x,P4), +}
3 + - - +
• To classify a new example x: 4
– Evaluate hypothesis on the 4 binary problems + - + +
{(x,P1) , (x,P2), (x,P3), (x,P4),}
–
k - + - -
Choose label that is most consistent with the results.
• Use Hamming distance (bit-wise distance)
• Nice theoretical results as a function of the performance of the Pi s (depending on code & size)
• Potential Problems?
Can you separate any dichotomy?
18
Questions?
Decomposition methods: Summary

• General idea
– Decompose the multiclass problem into many binary problems
– We know how to train binary classifiers
– Prediction depends on the decomposition
• Constructs the multiclass label from the output of the binary classifiers
• Learning optimizes local correctness
– Each binary classifier does not need to be globally correct
• That is, the classifiers do not need to agree with each other
• E.g., two classifiers might think that a given example “belongs” to them.
– The learning algorithm is not even aware of the prediction procedure!
• Poor decomposition gives poor performance
– Difficult local problems, can be “unnatural”
• Eg. for ECOC, why should the binary problems be separable?
• Not clear how to generalize multi-class to problems with a very large # of
output variables.
– Examples? What’s unique about these problems?
19
Where are we?
• Introduction

– One-vs-all
– All-vs-all

– Multiclass SVM
20
Motivation
• Decomposition methods
– Do not account for how the final predictor will be used
– Do not optimize any global measure of correctness
• Goal: To train a multiclass classifier that is “global”
21
Recall: Margin for binary classifiers
• The margin of a hyperplane for a dataset is the

distance between the hyperplane and the data point
nearest to it.
- -- + + ++ +
- -- - + ++
- -- - -
- -- - Margin with respect to this hyperplane
-
22
Multiclass margin
Defined as the score difference between the highest

scoring label and the second one
Multiclass Margin
Blue
Red
Score for a label
= wlabelTx Green
Black
Labels
23
Multiclass SVM (Intuition)
• Recall: Binary SVM

– Maximize margin
– Equivalently,
Minimize norm of weights such that the closest points to the hyperplane
have a score §1
• Multiclass SVM
– Each label has a different weight vector (like one-vs-all)
– Maximize multiclass margin
– Equivalently,
Minimize total norm of the weights such that the true label is scored at least
1 more than the second best one
24
Multiclass SVM in the separable case
Recall hard binary SVM
Size of the weights.
Effectively, regularizer
The score for the true label is higher

than the score for any other label by 1
25
Multiclass SVM: General case
Total slack. Effectively,
Size of the weights. don’t allow too many
Effectively, regularizer examples to violate the
margin constraint
The score for the true label is higher Slack variables. Not all
than the score for any other label by 1 examples need to
satisfy the margin
constraint.
Slack variables can

only be positive
26
Multiclass SVM: General case
Total slack. Effectively,
Size of the weights. don’t allow too many
Effectively, regularizer examples to violate the
margin constraint
The score for the true label is higher Slack variables. Not all
than the score for any other label by examples need to
1 - »i satisfy the margin
constraint.
Slack variables can

only be positive
27
Multiclass SVM
• Generalizes binary SVM algorithm

– If we have only two classes, this reduces to the binary (up
to scale)
• Comes with similar generalization guarantees as the

binary SVM
• Can be trained using different optimization methods

– Stochastic sub-gradient descent can be generalized
• Try as exercise
28
Multiclass SVM: Summary
• Training:
– Optimize the “global” SVM objective
• Prediction:
– Winner takes all
argmaxi wiTx
• With K labels and inputs in <n, we have nK weights in all

– Same as one-vs-all
• Why does it work?

– Why is this the “right” definition of multiclass margin?
• A theoretical justification, along with extensions to other algorithms beyond SVM is given by “Constraint
Classification”
– Applies also to multi-label problems, ranking problems, etc.
– [Dav Zimak; with D. Roth and S. Har-Peled]
29
Where are we?
• Introduction

– One-vs-all
– All-vs-all

– Multiclass SVM
30
One-vs-all again
• Training:
– Create K binary classifiers w1, w2, …, wK
– wi separates class i from all others
• Prediction: argmaxi wiTx
• Observations:
1. At training time, we require wiTx to be positive for examples of
class i.
2. Really, all we need is for wiTx to be larger than all other scores,
not larger than zero
31
1 Vs All: Learning Architecture
• k label nodes; n input features, nk weights.
• Evaluation: Winner Take All
• Training: Each set of n weights, corresponding to the i-th label, is trained
– Independently, given its performance on example x, and
– Independently of the performance of label j on x.
• Hence: Local learning; only the final decision is global, (Winner Takes All (WTA)).
• However, this architecture allows multiple learning algorithms; e.g., see the
implementation in the SNoW Multi-class Classifier
Targets (each an LTU)
Weighted edges (weight

vectors)
Features
Recall: Winnow’s Extensions
Positive Negative
• Winnow learns monotone Boolean functions w+ w-
• We extended to general Boolean functions via
• “Balanced Winnow”
– 2 weights per variable;
– Decision: using the “effective weight”,
the difference between w+ and w-
– This is equivalent to the Winner take all decision
– Learning: In principle, it is possible to use the 1-vs-all rule and update each set
of n weights separately, but we suggested the “balanced” Update rule that
takes into account how both sets of n weights predict on example x
If [(w  w )  x   ]  y, w i  w i r y x i , w i  w i ry xi
Can this be generalized to the
We need a “global”
case of k labels, k >2?
learning approach
Extending Balanced
• In a 1-vs-all training you have a target node that represents positive
examples and target node that represents negative examples.
• Typically, we train each node separately (mine/not-mine example).
• Rather, given an example we could say: this is more a + example than a
– example.
If [(w  w )  x   ]  y, w i  w i r y x i , w i  w i ry x i
• We compared the activation of the different target nodes (classifiers)
on a given example. (This example is more class + than class -)
• Can this be generalized to the case of k labels, k >2?

Constraint Classification
• The examples we give the learner are pairs (x,y), y 2 {1,…k}
• The “black box learner” we described might be thought of as a function
of x only but, actually, we made use of the labels y
• How is y being used?
– y decides what to do with the example x; that is, which of the k classifiers
should take the example as a positive example (making it a negative to all the
others).
• How do we make decisions: Is it better in any well defined way?
– Let: fy(x) = w ¢ x
y
T
– Then, we predict using: y* = argmaxy=1,…k fy(x)

• Equivalently, we can say that we predict as follows:
– Predict y iff
– 8 y’ 2 {1,…k}, y’:=y (wyT – wy’T ) ¢ x ¸ 0 (**)
• So far, we did not say how we learn the k weight vectors wy (y = 1,…k)
– Can we train in a way that better fits the way we make decisions?
– What does it mean?
We showed: if pairs of labels are separable (a reasonable assumption) than in
some higher dimensional space, the problem is linearly separable.
Linear Separability for Multiclass

• We are learning k n-dimensional weight vectors, so we can concatenate the k
weight vectors into Notice: This is just a representational
w= (w1, w2,…wk) 2 Rnk trick. We did not say how to learn the
weight vectors.
• Key Construction: (Kesler Construction; Zimak’s Constraint Classification)
– We will represent each example (x,y), as an nk-dimensional vector, xy, with x embedded in the y-th
part of it (y=1,2,…k) and the other coordinates are 0.
• E.g., xy = (0,x,0,0)  Rkn (here k=4, y=2)
• Now we can understand the n-dimensional decision rule:
• Predict y iff 8 y’ 2 {1,…k}, y’:=y (wyT – wy’T ) ¢ x ¸ 0 (**)

 In the nk-dimensional space.
 Predict y iff 8 y’ 2 {1,…k}, y’:=y wT ¢ (xy – xy’)  wT ¢ xyy’ ¸ 0
•Conclusion: The set (xyy’ , + )  (xy – xy’ , +) is linearly separable from the
set (-xyy’ , - ) using the linear separator w 2 Rkn ’
– We solved the voroni diagram challenge.
• Training:
– Given a data set {(x,y)}, (m examples) with x 2 Rn, y 2 {1,2,…k}
create a binary classification task:
(xy - xy’, +), (xy’ – xy -), for all y’ : = y (2m(k-1) examples)
Here xy 2 Rkn
– Use your favorite linear learning algorithm to train a binary
classifier.
• Prediction:
– Given an nk dimensional weight vector w and a new
example x, predict: argmaxy wT xy
37
Linear Separability with multiple classes
(1/3)
For examples with label i, we want wiTx > wjTx for all j
Rewrite inputs and weight vector
• Stack all weight vectors into an
nK-dimensional vector
• Define a feature vector for label i being associated to input x:
x in the ith block, zeros

everywhere else
38
(2/3)
x in the ith block, zeros

everywhere else
Equivalent requirement:
This is called the Kesler construction
39
(3/3)
Equivalently, the following binary task in nK dimensions that

should be linearly separable
Positive examples Negative examples
ith
block
w
Example (x, i) in dataset, all other labels j
40
• Training:
– Given a data set {<x, y>}, create a binary classification task as
<Á(x, y) - Á(x, y’), +1>, <Á(x, y’) - Á(x, y), -1> for all y’ ≠ y
– Use your favorite algorithm to train a binary classifier
• Exercise: What do the perceptron update rules look like in terms of the Ás?
• Prediction: Given a nK dimensional weight vector w and a new example x

argmaxy wT Á(x, y)
• Note: The binary classification task expresses preferences over label

assignments
– Approach extends training a ranker, can use partial preferences too, more on this
later…
41
Perceptron in Kesler Construction
• A perceptron update rule applied in the nk-dimensional space due to a
mistake in wT ¢ xij ¸ 0
• Or, equivalently to (wiT – wjT ) ¢ x ¸ 0 (in the n-dimensional space)
• Implies the following update:
• Given example (x,i) (example x 2 Rn, labeled i)

– 8 (i,j), i,j = 1,…k, i := j (***)
– If (wiT - wjT ) ¢ x < 0 (mistaken prediction;equivalent to wT ¢ xij <
0)
– wi  wi +x (promotion) and wj  wj – x (demotion)
• Note that this is a generalization of balanced Winnow rule.
• Note that we promote wi and demote (up to) k-1 weight vectors wj
Conservative update
• The general scheme suggests:
• Given example (x,i) (example x 2 Rn, labeled i)
– 8 (i,j), i,j = 1,…k, i := j (***)
– If (wiT - wjT ) ¢ x < 0 (mistaken prediction; equivalent to wT ¢ xij < 0 )
– wi  wi +x (promotion) and wj  wj – x (demotion)
– Promote wi and demote (up to) k-1 weight vectors wj
• This can also be written, in the nk dimensional space as:
– w  w + Á(x, yi) - Á(x, yj),
• A conservative update: (LBJava’s implementation):
– In case of a mistake: only the weights corresponding to the target node i and the closest
node j are updated.
– Let: j* = argmaxj=1,…k wjT ¢ x
(highest activation among competing labels)
– If (wiT – wj*T ) ¢ x < 0 (mistaken prediction)
– wi  wi +x (promotion) and wj*  wj* – x (demotion)

– Other weight vectors are not being updated.
Significance
• The hypothesis learned above is more expressive than when the OvA
assumption is used.
• Any linear learning algorithm can be used, and algorithmic-specific
properties are maintained (e.g., attribute efficiency if using winnow.)
• E.g., the multiclass support vector machine can be implemented by
learning a hyperplane to separate P(S) with maximal margin.
• As a byproduct of the linear separability observation, we get a natural

notion of a margin in the multi-class case, inherited from the binary
separability in the nk-dimensional space.
– Given example xij 2 Rnk, margin(xij,w) = min wT ¢ xij
ij
– Consequently, given x 2 Rn, labeled i
margin(x,w) = minj (wiT - wjT ) ¢ x
A second look at the multiclass margin
Defined as the score difference between the highest
scoring label and the second one
Multiclass Margin
In terms of Kesler
construction
Blue
Red
Score for a label
Green
Black Here y is the label that
has the highest score
Labels
45
Discussion
• What is the number of weights for multiclass SVM and constraint
classification?
– nK. Same as One-vs-all, much less than all-vs-all K(K-1)/2
• But both still account for all pairwise label preferences

– Multiclass SVM via the definition of the learning objective
– Constraint classification by constructing a binary classification problem
• Both come with theoretical guarantees for generalization
• Important idea that is applicable when we move to arbitrary structures

Questions?
46
• The scheme presented can be generalized to provide a uniform view

for multiple types of problems: multi-class, multi-label, category-
ranking
• Reduces learning to a single binary learning task
• Captures theoretical properties of binary algorithm
• Experimentally verified
• Naturally extends Perceptron, SVM, etc...
• It is called “constraint classification” since it does it all by representing

labels as a set of constraints or preferences among output labels.
47
Multi-category with Constraint Classification
• The unified formulation is clear from the following examples:
• Multiclass Just like in the multiclass we
– (x, A)  (x, (A>B, A>C, A>D) ) learn one wi 2 Rn for each
label, the same is done for
• Multilabel multi-label and ranking. The
– (x, (A, B))  (x, ( (A>C, A>D, B>C, B>D) ) weight vectors are updated
• Label Ranking according with the
– (x, (5>4>3>2>1))  (x, ( (5>4, 4>3, 3>2, 2>1) ) requirements from y 2 Sk
Research Question:
• In all cases, we have examples (x,y) with y  Sk • Can this be generalized to
• Where Sk : partial order over class labels {1,...,k} general Boolean formulae
– defines “preference” relation ( > ) for class labeling on the labels?
• What are natural examples?
• Consequently, the Constraint Classifier is: h: X  Sk
– h(x) is a partial order
– h(x) is consistent with y if (i<j)  y  (i<j) h(x)
48
Training multiclass classifiers: Wrap-up
• Label belongs to a set that has more than two elements
• Methods
– Decomposition into a collection of binary (local) decisions
• One-vs-all
• All-vs-all
• Error correcting codes
– Training a single (global) classifier
• Multiclass SVM
• Constraint classification
• What if the size of the label space is huge?

• What if there is an interesting structure on the label space?
Questions?
49
Next steps…
• Build up to structured prediction

– Multiclass is really a simple structure
– But that framework we developed is really structured prediction
• Different aspects of structured prediction

– Deciding the structure, training, inference
• The functions Á(x, y) will be more involved
• The prediction step (inference; argmax ) will be more involved
• Next: Sequence models

50

CIS 700 Advanced Machine Learning For NLP: Dan Roth

Uploaded by

Copyright:

Available Formats

You might also like

CIS 700 Advanced Machine Learning For NLP: Dan Roth

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIS 700 Advanced Machine Learning For NLP: Dan Roth

Uploaded by

Copyright:

Available Formats

CIS 700

Advanced Machine Learning for NLP

Multiclass classification: Local and Global Views

• Combining binary classifiers

• Training a single classifier

• Training data: Input associated with class label (a number from 1

– Input: hand-written character; Output: which character?

all map to the letter A

– Input: a photograph of an object; Output: which of a set of

Car tire Car tire Duck laptop

• Input: a news article; Output: which section of the

• Input: an email; Output: which folder should an email

• Input: an audio command given to a car; Output:

• Combining binary classifiers

• Training a single classifier

• Can we use a binary classifier to construct a

• Learning: Given a dataset D = {<xi, yi>}

• Prediction/Inference: Given an example x and the learned function (often

• Learning: Given a dataset D = {<xi, yi>},

The decomposition will not work for these cases!

wblueTx > 0 wredTx > 0 for wgreenTx > 0 ???

• Assumption: Every pair of classes is separable

• Prediction: More complex, each label get K-1 votes

• Every pair of labels is linearly separable here

One vs All All vs All

Decomposition methods: Summary

• Combining binary classifiers

• Training a single classifier

• Goal: To train a multiclass classifier that is “global”

• The margin of a hyperplane for a dataset is the

Defined as the score difference between the highest

• Recall: Binary SVM

The score for the true label is higher

Slack variables can

Slack variables can

• Generalizes binary SVM algorithm

• Comes with similar generalization guarantees as the

• Can be trained using different optimization methods

• With K labels and inputs in <n, we have nK weights in all

• Why does it work?

• Combining binary classifiers

• Training a single classifier

Weighted edges (weight

• Can this be generalized to the case of k labels, k >2?

– Then, we predict using: y* = argmaxy=1,…k fy(x)

Linear Separability for Multiclass

• Predict y iff 8 y’ 2 {1,…k}, y’:=y (wyT – wy’T ) ¢ x ¸ 0 (**)

• Define a feature vector for label i being associated to input x:

x in the ith block, zeros

x in the ith block, zeros

This is called the Kesler construction

Equivalently, the following binary task in nK dimensions that

Positive examples Negative examples

• Prediction: Given a nK dimensional weight vector w and a new example x

• Note: The binary classification task expresses preferences over label

• Given example (x,i) (example x 2 Rn, labeled i)

• Note that this is a generalization of balanced Winnow rule.