Professional Documents
Culture Documents
Optimization of The Accuracy and Calibration of Binary and Multiclass Pattern Recognizers
Optimization of The Accuracy and Calibration of Binary and Multiclass Pattern Recognizers
Niko Brümmer
Spescom DataVoice
Optimization of the Accuracy and
Calibration of Binary and Multiclass
Pattern Recognizers, for Wide
Ranges of Applications
Niko Brümmer
Spescom DataVoice
El Ingenioso Hidalgo
Don Quijote de la Ciudad del Cabo
Sancho
Rocinante el Tiburón
Contents
Part I
1. Introduction
2. Good old error-rate
3. Binary case:
Error-Rate → ROC → Cdet → Cllr
Input, e.g.
speech recording,
image, etc.
Pattern Recognizer
Input, e.g.
speech recording,
image, etc.
Pattern Recognizer
application-dependent
pattern recognizer
class prior
misclassification costs
misclassification costs
class likelihoods
minimize expected
costs hard decisions
cost of decisions
application-independent input
pattern recognizer
class likelihoods
Application:
class prior Bayes’ Rule (standard implementation,
nothing to tune)
class posterior
minimize expected
costs hard decisions
cost of decisions
application-independent input
pattern recognizer
Agenda:
optimize likelihoods, to
class likelihoods
make cost-effective
standard Bayes
decisions over a wide
range of applications,
Application:
prior Bayes’ Rule
with different
(standard priors and
implementation,
class posterior nothing to tune)
costs.
minimize expected
costs hard decisions
cost of decisions
3. Evaluation and Optimization
Before we can optimize performance, we need
to be able to evaluate performance!
target miss
non-target false-alarm
score distributions
targets
non-targets
score
targets
non-targets
score
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Pmiss
score
Pfa
Example
DET1: 1conv4w-1conv4w
40
30
STBU sub-systems
and fusion,
20 NIST SRE 2006
Mis s proba bility (in %)
10 sub-systems
5
1
0.5 fusion
0.2
0.1
• EER = Equal-Error-Rate
– works for both ROC and DET
– popular in speaker recognition
EER: Equal-Error-Rate
1
ROC f a
=
P
s
is
Pm
Pmiss
EER
0 Pfa 1
AUC: Area-Under-Curve
1
ROC
Pmiss
0 Pfa 1
DET/ROC Advantages
• Very useful summary of the potential of
recognizers to discriminate between two
classes over a wide spectrum of
applications involving different priors
and/or misclassification costs.
geometric interpretation
Cdet
1
ROC Weighted linear
combination of
Pmiss and Pfa .
• Relative
Pmiss
weighting (slope)
depends on priors
and costs of a
specific
application.
0 Pfa 1
Cdet
1
ROC 1. Evaluator chooses
application (cost,
prior) which
determines slope.
Pmiss
2. Evaluee chooses
threshold, which
determines
operating point.
0 Pfa 1
Cdet
1
ROC 1. Evaluator chooses
slope.
2. Evaluee chooses
operating point.
Pmiss
• Threshold may be
suboptimal.
• Difference is
calibration loss.
0 Pfa 1
Cdet
Does evaluate calibration,
but
ROC Cdet
application-
independent
evaluates
calibration
Part I
1. Introduction
2. Good old error-rate
3. Binary case:
Error-Rate → ROC → Cdet → Cllr
New approach: Cllr
– Application-independent
– Evaluates Calibration
New approach: Cllr
ROC and Cdet are combined by integrating
Cdet over the whole ROC-curve.
Cllr = ∫ Cdet
ROC
1
ROC
Cdet
0 Pfa 1
1
ROC
Cdet
0 Pfa 1
1
ROC
Cdet
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Pmiss
0 Pfa 1
1
ROC
Cdet
Cllr
Pmiss
0 Pfa 1
Cllr
Cdet
Discrimination loss:
integrated decision cost at
thresholds optimized by evaluator.
Questions
class posterior
minimize expected
costs hard decisions
cost of decisions
In the binary (detection) case,
this can be simplified …
application-independent input
detector
likelihood-
ratio
P(score | target )
P(score | non - target )
compute Bayes
compare to threshold
threshold
(uncalibrated) score
(calibrated)
likelihood-ratio
calibration transformation
What detector (evaluee) does
generative (e.g. GMM) or
discriminative (e.g. SVM) input
modeling
(uncalibrated) score
to application
or evaluator
(calibrated)
likelihood-ratio
calibration transformation
What evaluator does
accept / Bayes
count errors
reject decision
likelihood-ratio
Pmiss , Pfa
prior(θ ) ,
Cdet costs(θ )
∫C DET (θ ) dθ Cllr
Here is another view …
Mis s proba bility (in %)
40
score 30
20
in count 10
decision 5
log(LR) errors 2
1
form 0.5
0.2
0.1
0.1
0.20 .51 2 5 10 203040
Fa ls e Ala rm proba bility (in %)
− log 1−PTPT Cmiss
C fa
PT
Cmiss , Cfa
α ,β Cdet= αPmiss+ βPfa
Cllr ∫ det
C
Mis s proba bility (in %)
40
score 30
20
in count 10
decision 5
log(LR) errors 2
1
form 0.5
0.2
0.1
0.1
0.20 .51 2 5 10 203040
Fa ls e Ala rm proba bility (in %)
− log 1−PTPT Cmiss
C fa
PT
Cmiss , Cfa
α,β Cdet= αPmiss+ βPfa
Cllr ∫ det
C
Mis s proba bility (in %)
40
score 30
20
in count 10
decision 5
log(LR) errors 2
1
form 0.5
0.2
0.1
0.1
0.20 .51 2 5 10 203040
Fa ls e Ala rm proba bility (in %)
− log 1−PTPT Cmiss
C fa
PT
Cmiss , Cfa
α,β Cdet= αPmiss+ βPfa
Cllr ∫ det
C
Mis s proba bility (in %)
40
score 30
20
in count 10
decision 5
log(LR) errors 2
1
form 0.5
0.2
0.1
0.1
0.20 .51 2 5 10 203040
Fa ls e Ala rm proba bility (in %)
− log 1−PTPT Cmiss
C fa
PT
Cmiss , Cfa
α,β Cdet= αPmiss+ βPfa
Cllr ∫ det
C
Mis s proba bility (in %)
40
score 30
20
in count 10
decision 5
log(LR) errors 2
1
form 0.5
0.2
0.1
0.1
0.20 .51 2 5 10 203040
Fa ls e Ala rm proba bility (in %)
− log 1−PTPT Cmiss
C fa
PT
Cmiss , Cfa
α, β Cdet= αPmiss+ βPfa
Cllr ∫ det
C
Questions
∫C det (θ ) dθ Cllr
How does the evaluator
perform the integral?
prior(θ ) ,
costs(θ )
∫C det (θ ) dθ Cllr
Choice of functions:
prior(θ ) and costs(θ )
There exist choices for these functions, so that:
• Integral is solved analytically (easy to
compute).
• Cllr represents recognizer performance over a
wide range of applications.
• Cllr has intuitive information-theoretic
interpretation (cross-entropy).
• Cllr serves as good numerical optimization
objective function (logistic regression).
The magic formula:
1 1
PtarCmiss = , (1 − Ptar )C fa =
θ 1−θ
0 ≤θ ≤1
Notice infinities at θ =0 and θ =1. This is
good! We want Cllr to represent a wide range
of applications, including those with very
high misclassification costs.
3
10
10
2
1 1
1 θ 1−θ
10
∞ 10
0 ∞
-1
10
-2
weights reach infinity, but …
10
-3
10
0 0.2 0.4 0.6 0.8 1
θ
3
10
10
2
1 1
1 θ 1−θ
10
0
10
Pmiss
10
-1 Pfa
for a well-calibrated recognizer,
10
-2 respective errors vanish at 0
and 1,
-3 so that Cdet remains small.
10
0 0.2 0.4 0.6 0.8 1
θ
3
10
10
2
1 1
1 θ 1−θ
10
0
10
Pmiss
10
-1 Pfa
1 1
-2 Cdet = Pmiss + Pfa
θ 1−θ
10
-3
10
0 0.2 0.4 0.6 0.8 1
θ
This gives:
1
1 1
Cllr = k ∫ Pmiss (θ ) + Pfa (θ ) dθ
0
θ 1−θ
∑ log (1 + exp(llr ))
1
+ 2 t
2 SN t∈S N
minimize
input detector llr
expected cost
additional info
applied info decisions
Information-theoretic
interpretation
The complement, 1 - Cllr measures the average
amount of additional information (in bits per
trial) contributed by the llr scores of the
detector, when there is least prior information:
Ptar = 0.5.
10 sub-systems
5
1
0.5 fusion
0.2
0.1
See: www.dsp.sun.ac.za/~nbrummer/focal
Contents
Part I
1. Introduction
2. Good old error-rate
3. Binary case:
Error-rate → ROC → Cdet → Cllr
Part II: Multiclass
Error-rate → ROC → Cmiss → Cmxe
Part II: Multiclass
1. What we want to do
2. Why cost and error-rate don’t work.
3. How NIST did it.
4. How we propose to do it.
5. Experimental demonstration of our
proposal.
1. What we want to do
To create an evaluation criterion that is:
• application-independent,
• sensitive to calibration, and
• useful as numerical optimization
objective
2. Why cost and error-rate
don’t work.
• Average error-rate is
– application-dependent
– increases with N
• Conditional error-rate analysis
– ROC is ill-defined and computationally problematic
– does not evaluate calibration
• Misclassification Cost Functions
– application-dependent
– complexity increases as N 2
• None of the above give good numerical
optimization objectives.
3. How NIST did it.
• NIST’s Language Recognition Evaluation
(LRE) is a multiclass pattern recognition
problem (14 languages in 2007).
• NIST presented it as 14 different, one-
against-the-rest detection tasks, with the
evaluation criterion being average detection
cost over all 14.
• This approach has both good and bad
consequences:
Advantages of LRE strategy
1. What we want to do
2. Why cost and error-rate don’t work.
3. How NIST did it.
4. How we propose to do it.
5. Experimental demonstration of our
proposal.
Let’s recapitulate our
application-independent
recipe.
application-independent input
pattern recognizer
Agenda:
optimize likelihoods, to
class likelihoods
make cost-effective
standard Bayes
decisions over a wide
range of applications,
Application:
prior Bayes’ Rule
with different
(standard priors and
implementation,
class posterior nothing to tune)
costs.
minimize expected
costs hard decisions
cost of decisions
application-independent input
pattern recognizer
class likelihoods
To optimize,
we need
Application:
to
prior Bayes’ Rule
class posterior
evaluate.
(standard implementation,
nothing to tune)
minimize expected
costs hard decisions
cost of decisions
application-independent input
pattern recognizer
class likelihoods
minimize expected
costs hard decisions
cost of decisions
application-independent input
pattern recognizer
class likelihoods
prior
Replace Bayes’ Rule with evaluation
application
method class
thatposterior
represents a wide range of
Bayes decision applications, with
minimize expected
different
costs costs and
cost of decisions
priors. hard decisions
Evaluation over different priors
and costs
N
Pi = P(class i ) , ∑P =1
i =1
i
estimated class
1 2 … N
1 C12 … C1N
true class
2 C21 … C2N
… … … … …
N CN1 CN2 …
We will instead use a
simplified cost function:
Simplified cost function
1 Cmiss(1) … Cmiss(1)
true class
2 Cmiss(2) … Cmiss(2)
… … … …
N Cmiss(N) Cmiss(N) …
N
Cmiss = ∑ Pi Cmiss (i ) Pmiss (i )
i =1
Simplified cost function:
Expected miss cost
N
Cmiss = ∑ Pi Cmiss (i ) Pmiss (i )
i =1
N
Cmiss = ∑ Pi Cmiss (i ) Pmiss (i )
i =1
N
Cmiss = ∑ Pi Cmiss (i ) Pmiss (i )
i =1
(uncalibrated)
N-component
score-vector
to application
(calibrated) or evaluator
calibration class log-likelihoods:
transformation log P ( score-vector | class i ),
i = 1, 2, …, N
What evaluator does
Pmiss(1), …, Pmiss(N)
prior (θ ),
Cmiss
costs(θ )
θ
∫ ∫⋯ ∫ C miss (θ ) dθ Cmxe
What evaluator does
N
θ : 0 ≤ θi ≤ 1 , ∑θ
i =1
i =1
1
PiCmiss (i ) =
θi
The magic formula:
N
θ : 0 ≤ θi ≤ 1 , ∑θ
i =1
i =1
1
PiCmiss (i ) =
θi
Note 1: θx is an N-vector of parameters,
which has the form of a probability
distribution.
The magic formula:
N
θ : 0 ≤ θi ≤ 1 , ∑θ
i =1
i =1
1
PiCmiss (i ) =
θi
Note 2: We don’t need to vary cost and prior
separately, because in expected-cost
calculations they always act together as
prior-cost products.
The magic formula:
N
θ : 0 ≤ θi ≤ 1 , ∑θ
i =1
i =1
1
PiCmiss (i ) =
θi
Note 3: Notice again the infinities at the
edges of the parameter simplex ( at θi = 0 ).
This ensures that we include applications
with arbitrarily large cost in our evaluation.
which gives our new
evaluation objective:
1 1 1 N
1
Cmxe = k ∫ ∫ ⋯ ∫ ∑ Pmiss (i ) dθ N −1 ⋯ dθ 2 dθ1
0 0 0 i =1
θi
N −1
θ N = 1 − ∑θ i
i =1
1 1 1 N
1
Cmxe = k ∫ ∫ ⋯ ∫ ∑ Pmiss (i ) dθ N −1 ⋯ dθ 2 dθ1
0 0 0 i =1
θi
N −1
θ N = 1 − ∑θ i
i =1
∑
N
1 N
1 exp( ll )
∑ ∑ log j =1 jt
Cmxe = 2
N i =1 Si t∈Si exp(llit )
1. What we want to do
2. Why cost and error-rate don’t work.
3. How NIST did it.
4. How we propose to do it.
5. Experimental demonstration of
our proposal.
Experimental demonstration
• Here N = 14 languages.
Experimental demonstration
s ubmitte d
proje cte d
4
re -ca libra te d
Ca vg [%]
0
a tvs ts s 4 ing iir but ptt mit
system: 1 2 3 4 5 6 7
Ge ne ra l LR, clos e d-s e t, a ll 14 la ngua ge s , 30s
6
0
a tvs ts s 4 ing iir but ptt mit
system: 1 2 3 4 5 6 7
Ge ne ra l LR, clos e d-s e t, a ll 14 la ngua ge s , 30s
6
0
a tvs ts s 4 ing iir but ptt mit
system: 1 2 3 4 5 6 7
Now all 16369 subsets:
System 1 (projected scores)
System 2 (original scores)
{English, Hindustani}
System 3 (projected, re-calibrated)
System 4 (projected, re-calibrated)
System 5 (original scores)
System 6 (re-calibrated)
{English, Hindustani}
System 7 (projected, re-calibrated)
LRE Subsets Experiment:
Conclusions
1. We used multiclass logistic regression
(optimization of Cmxe ), to improve the
performance of systems that had specifically
been designed for LRE’07.
See:
http://niko.brummer.googlepages.com/focalmulticlass
Contents
Part I
Part II
Conclusion
Bibliography
Conclusion
The following are all equivalent (and all are
good things to do):
• Calibrating pattern recognition outputs as
likelihoods.
• Optimizing the amount of effective information
delivered by pattern recognizers.
• Optimizing recognition error-rates over wide
ranges of the priors.
• Optimizing recognizer decision cost over wide
ranges of cost and prior parameters.
Bibliography
1. Cllr references
2. FSR References
3. Cmxe References
Cllr references
1. Niko Brümmer, “Application-Independent
Evaluation of Speaker Detection”, Odyssey
2004: The Speaker and Language Recognmition
Workshop, Toledo.
2. Niko Brümmer and Johan du Preez,
“Application-Independent Evaluation of Speaker
Detection”, Computer Speech and Language,
2006, pp. 230-275.
3. David van Leeuwen and Niko Brümmer, “An
Introduction to Application-Independent
Evaluation of Speaker Recognition Systems”, in
Speaker Classification vol 1, Ed. Christian
Müller, Springer, 2007, pp. 330-353.
FSR References