Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 69

Principled Asymmetric Boosting Approaches

to Rapid Training and Classification

in Face Detection

presented by

Minh-Tri Pham
Ph.D. Candidate and Research Associate
Nanyang Technological University, Singapore
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Problem
Application
Application

Face recognition
Application

3D face reconstruction
Application

Camera auto-focusing
Application
Windows face logon
• Lenovo Veriface Technology
Appearance-based Approach
• Scan image with probe
window patch (x,y,s)
– at different positions and scales
– Binary classify each patch into
• face, or
• non-face
• Desired output state:
– (x,y,s) containing face

Most popular approach


•Viola-Jones ‘01-’04, Li et.al. ‘02, Wu et.al. ’04, Brubaker et.al. ‘04, Liu et.al. ’04, Xiao et.al ‘04,
•Bourdev-Brandt ‘05, Mita et.al. ‘05, Huang et.al. ’05 – ‘07, Wu et.al. ‘05, Grabner et.al.
’05-’07, 0 1
•And many more
Appearance-based Approach
• Statistics:
– 6,950,440 patches in a
320x240 image

– P(face) < 10-5

• Key requirement:
– A very fast classifier
0 1
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

A very fast classifier


A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Boosting
Wrongly
classified

Correctly
classified

Wrongly
Weak Weak
classified
Classifier Classifier
Learner Learner
1 Correctly 2
classified

Stage 1 Stage 2
: negative example
: positive example
Asymmetric Boosting
• Weight positives  times more than negatives

Weak Weak
Classifier Classifier
Learner Learner
1 2

Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v
Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v


Main issues
• Requires too much intervention from experts
A very fast classifier
• Cascade of non-face rejectors:
pass pass pass pass
F1 F2 …. FN face
reject reject reject
non-face

• F1, F2, …, FN : asymmetric classifiers


– FRR(Fk)  0
– FAR(Fk) as small as possible (e.g. 0.5 – 0.8)

How to choose bounds for FRR(Fk)


and FAR(Fk)?
Asymmetric Boosting How to
• Weight positives  times more than negatives choose ?

Weak Weak
Classifier Classifier
Learner Learner
1 2

Stage 1 Stage 2
: negative example
: positive example
Non-face Rejector
• A strong combination of weak classifiers:
yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold

How to
choose ?
Main issues
• Requires too much intervention from experts

• Very long learning time


Weak classifier
• Classify a Haar-like feature value

input feature Classify score


patch value v v

10 minutes to learn a
weak classifier

Main issues
• Requires too much intervention from experts

• Very long learning time


– To learn a face detector ( 4000 weak classifiers):
• 4,000 * 10 minutes  1 month

• Only suitable for objects with small shape variance


Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Detection with Multi-exit
Asymmetric Boosting

CVPR’08 poster paper:


Minh-Tri Pham and Viet-Dung D. Hoang and Tat-Jen Cham. Detection with Multi-exit Asymmetric
Boosting. In Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), Anchorage, Alaska, 2008.
• Won Travel Grant Award
Problem overview
• Common appearance-based approach:
pass pass pass pass
F1 F2 …. FN object
reject reject reject
non-object

– F1, F2, …, FN : boosted classifiers

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
reject

– f1,1, f1,2, …, f1,K : weak classifiers


–  : threshold
Objective

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 

• Find f1,1, f1,2, …, f1,K, and  such that:


– FAR ( F1 )   0
– FRR ( F )  
1 0
– K is minimized  proportional to F1’s evaluation time
Existing trends (1)

Idea Issues
• For k from 1 until convergence: • Weak classifiers are sub-
– Let  k  optimal w.r.t. training goal.
F1 ( x)  sign  f1,i ( x) 
 i 1  • Too many weak classifiers
are required in practice.
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k
 k 
F1 ( x)  sign  f1,i ( x)   
– Let  i 1 

– Adjust  to see if we can achieve FAR(F1) <= 0 and FRR(F1) <= 0:
• Break loop if such  exists
Existing trends (2)

Idea
• For k from 1 until convergence:
– Let  k 
F1 ( x)  sign  f1,i ( x) 
 i 1 
– Learn new weak classifier f1,k(x):
fˆ1,k  arg minFAR ( F1 )  FRR( F1 )
f1,k

– Break loop if FAR(F1) <= 0 and FRR(F1) <= 0


Pros Cons Solution to con
• Reduce FRR at the • How to choose ? • Trial and error:
cost of increasing FAR • Much longer training • choose  such that
– acceptable for time K is minimized.
cascades
• Fewer weak classifiers
Our solution

Learn every weak classifier f1,k ( x ) using the same asymmetric goal:

fˆ1,k  arg minFAR ( F1 )  FRR( F1 ),


f1,k

where   0 .
0

Why?
Because…
• Consider two desired bounds (or targets) for learning a boosted classifier FM(x):
– Exact bound:  0
FAR ( FM )and FRR ( FM )   0 (1)
– Conservative bound: 
FAR ( F )  0 FRR( F )   M M 0 (2)
0
• (2) is more conservative than (1) because (2) => (1).
FAR FAR
1 1

H1
 =1  = 0/0

H2

H1 H3
H4 exact bound
exact bound H2 H39 Q2
0 0
H3
Q1 H40 Q1
conservative H201 H200 Q 4 Q 3 Q2 conservative Q3
H41
bound Q200 bound Q40 Q39
0 b0 Q201 0 b0 Q41
1 FRR 1 FRR



At 0for
, every new weak classifier learned, the ROC operating
0
point moves the fastest toward the conservative bound
Implication

yes
+ + +
f1,1 f1,2 …. f1,K >? pass
F1
no
 K  reject
F1 ( x)  sign  f1,i ( x)   
 i 1 

• When the ROC operation point lies in the conservative bound:


– FAR ( F )  
1 0

FRR ( F1 )   0
– Conditions met, therefore  = 0.
Multi-exit Boosting
A method to train a single boosted classifier with multiple exit nodes:

F1 +
F2 + + +
F3 + + +
f1 f2 f3 f4 f5 f6 f7 f8 object
pass pass pass
reject reject reject

non-obj

fi: a weak classifier fi followed by a decision to


: a weak classifier
continue or reject – an exit node

• Features: 0
• Weak classifiers are trained with the same goal:  .
0
•  0
Every pass/reject decision is guaranteed with FARand FRR   0 .
• The classifier is a cascade.
• Score is propagated from one node to another.

• Main advantages:
• Weak classifiers are learned (approximately) optimally.
• No training of multiple boosted classifiers.
• Much fewer weak classifiers are needed than traditional cascades.
Results
Goal () vs. Number of weak classifiers (K)

• Toy problem: To learn a (single-


exit) boosted classifier F for
classifying face/non-face patches
such that FAR(F) < 0.8 and
FRR(F) < 0.01
– Empirically best goal:
  [10,100].
– Our method chooses:
0.8
  80.
0.01

• Similar results were obtained for


tests on other desired error rates.
Ours vs. Others (in Face Detection)

• Use Fast StatBoost as base method for fast-training a weak classifier.

Method No of No of Total
weak exit training
classifier nodes time
s

Viola Jones [3] 4,297 32 6h20m

Viola Jones [4] 3,502 29 4h30m

Boosting chain [7] 959 22 2h10m

Nested cascade [5] 894 20 2h

Soft cascade [1] 4,871 4,871 6h40m

Dynamic cascade [6] 1,172 1,172 2h50m

Multi-exit 575 24 1h20m


Asymmetric Boosting
Ours vs. Others (in Face Detection)
• MIT+CMU Frontal Face Test set:
Conclusion

• Multi-exit Asymmetric Boosting trains every weak


classifier approximately optimally.

– Better accuracy

– Much fewer weak classifiers

– Significantly reduces training time


• No more trial-and-error for training a boosted classifier
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Fast Training and Selection of
Haar-like Features using Statistics

ICCV’07 oral paper:


Minh-Tri Pham and Tat-Jen Cham. Fast Training and Selection of Haar Features using Statistics in
Boosting-based Face Detection. In Proc. International Conference on on Computer Vision (ICCV), Rio de
Janeiro, Brazil, 2007.
• Won Travel Grant Award
• Won Second Prize, Best Student Paper in Year 2007 Award, Pattern Recognition and Machine
Intelligence Association (PREMIA), Singapore
Motivation

• Face detectors today


– Real-time detection
speed

…but…

– Weeks of training time


Why is Training so Slow?
A view of a face detector training algorithm

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)



– 15ms to train a feature classifier
update weights – O(N)
for feature t from 1 to T: – 10 minutes to train a weak classifier
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N)
select best feature classifier – O(T)

Facto Description Common


r value
N number of examples 10,000

M number of weak 4,000 -


classifiers in total 6,000
T number of Haar-like 40,000
features
Why Should the Training Time be
Improved?
• Tradeoff between time and generalization
– E.g. training 100 times slower if we increase both N and T by 10 times

• Trial and error to find key parameters for training


– Much longer training time needed

• Online-learning face detectors have the same problem


Existing Approaches to Reduce the
Training Time
• Sub-sample Haar-like feature set
– Simple but loses generalization

• Use histograms and real-valued boosting (B. Wu et. al. ‘04)


– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Raise overfitting concerns:
• Real AdaBoost not known to be overfitting resistant
• Weak classifier may overfit if too many histogram bins are used

• Pre-compute feature values’ sorting orders (J. Wu et. al. ‘07)


– Pro: Reduce from O(MNT log N) to O(MNT)
– Con: Require huge memory storage
• For N = 10,000 and T = 40,000, a total of 800MB is needed.
Why is Training so Slow?
A view of a face detector training algorithm

for weak classifier m from 1 to M: • Time complexity: O(MNT log N)


… – 15ms to train a feature classifier
update weights – O(N) – 10min to train a weak classifier
for feature t from 1 to T:
compute N feature values – O(N) – 27 days to train a face detector
sort N feature values – O(N log
N)
train feature classifier – O(N) • Bottleneck:
select best feature classifier – O(T) – At least O(NT) to train a weak

classifier

Facto Description Common


r value • Can we avoid O(NT)?
N number of examples 10,000

M number of weak 4,000 -


classifiers in total 6,000
T number of Haar-like 40,000
features
Our Proposal

• Fast StatBoost: To train feature classifiers using statistics rather than


using input data
– Con:
• Less accurate
… but not critical for a feature classifier
– Pro:
• Much faster training time:
 Constant time instead of linear time
Fast StatBoost
• Training feature classifiers using Non-face

statistics: Face
– Assumption: feature value v(t) is normally
distributed given face class c is known
– Closed-form solution for optimal threshold

• Fast linear projections of the statistics Optimal Feature


of a window’s integral image into 1D threshold value
statistics of a feature value
 (t )  mTJ g (t )  (t ) 2  g (t )T  J g (t )
(t ) ,  (t ) 2 : mean and variance of feature value v(t)

J : random vector representing a window’s integral image



m J ,  J : mean vector and covariance matrix of J

g (t ) : Haar-like feature, a sparse vector with less than 20 non-zero elements

 constant time to train a feature classifier


Fast StatBoost
• Integral image’s statistics are obtained directly from the weighted input data
– Input: N training integral images and their current weights w(m):
  
w ( m)
1  ( m)
  ( m)
, J 1 , c1 , w 2 , J 2 , c2 ,..., w N , J N , c N 
– We compute: zˆc   n
w (m)

n:cn  c
• Sample total weight: 
ˆ c  zˆ
m 1
c  w Jn
n:cn  c
(m)
n

• Sample mean vector:   T 


ˆ
 c  zˆc   wn J n J n   m
1 ( m)
ˆ cm
ˆ Tc
 n:cn c 
• Sample covariance matrix:
Fast StatBoost
A view of our face detector training algorithm
• To train a weak classifier:
for weak classifier m from 1 to M: – Extract the class-conditional integral
… image statistics
update weights – O(N) • Time complexity: O(Nd2)
Extract statistics of integral image – • Factor d2 negligible because fast algorithms
O(Nd2) exist, hence in practice: O(N)
for feature t from 1 to T:
project statistics into 1D – O(1)
train feature classifier – O(1)
select best feature classifier – O(T) – Train T feature classifiers by
… projecting the statistics into 1D:
• Time complexity: O(T)

Facto Description Common


r value
– Select the best feature classifier
N number of examples 10,000 • Time complexity: O(T)

M number of weak 4,000 -


classifiers in total 6,000 • Time complexity: O(N+T)
T number of Haar-like 40,000
features
d number of pixels of a 300-500
window
Experimental Results
• Edge features: Corner features:
Setup
– Intel Pentium IV 2.8GHz
– 19 types  295,920 Haar-like (1) (2) (3) (4) (5) (6)
features Diagonal line features:

• Time for extracting the statistics: (7)


(10) (11) (12) (13)
(8) (9)
– Main factor: covariance matrices
Line features: Center-surround features:
• GotoBLAS: 0.49 seconds
per matrix
• Time for training T features: (15) (17)
(14) (18) (19)
– 2.1 seconds (16)

Nineteen feature types used in our experiments

 Total training time: 3.1 seconds per weak classifier with 300K features
• Existing methods: up to 10 minutes with 40K features or fewer
Experimental Results
• Comparison with Fast AdaBoost (J. Wu et. al. ‘07), the fastest known
implementation of Viola-Jones’ framework:

training time of a weak classifier

12 Fast AdaBoost
10 Fast StatBoost
seconds (s)

8
6
4
2
0
0 50000 100000 150000 200000 250000 300000
number of features (T)
Experimental Results
• Performance of a cascade:

Method Total training Memory


time requirement
Fast AdaBoost 13h 20m 800 MB
(T=40K)
Fast StatBoost 02h 13m 30 MB
(T=40K)
Fast StatBoost 03h 02m 30 MB
(T=300K)

ROC curves of the final cascades for face detection


Conclusions

• Fast StatBoost: use of statistics instead of input data to train feature


classifiers

• Time:
– Reduction of the face detector training time from up to a month to 3 hours
– Significant gain in both N and T with little increase in training time
• Due to O(N+T) per weak classifier

• Accuracy:
– Even better accuracy for face detector
• Due to much more members of Haar-like features explored
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Weak classifier
• Cascade of non-face rejectors:
Outline

• Motivation
• Contributions
– Automatic Selection of Asymmetric Goal
– Fast Weak Classifier Learning
– Online Asymmetric Boosting
– Generalization Bounds on the Asymmetric Error

• Future Work
• Summary
Summary

• Online Asymmetric Boosting


– Integrates Asymmetric Boosting with Online Learning

• Fast Training and Selection of Haar-like Features using Statistics


– Dramatically reduce training time from weeks to a few hours

• Multi-exit Asymmetric Boosting


– Approximately minimizes the number of weak classifiers
Thank You

You might also like