Robust Deep Learning

Wooaah...
Provably robust deep

learning
J. Zico Kolter
Carnegie Mellon University and

Bosch Center for AI
1
Outline
Introduction
Attacking machine learning algorithms
Defending against adversarial attacks
Final thoughts
2
Outline
Introduction
Final thoughts
3
The AI breakthrough (some recent history)
Karras et al., 2018 Radford et al., 2019 Vinyals et al., 2019

4
…but the stakes are low
??
??
5
Adversarial attacks
Figure from
Madry et al.
Sharif et al., 2016 Athalye et al., 2017

Evtimov et al., 2017 6
… and some recent work
7
[Lee and Kolter, 2019], https://arxiv.org/abs/1906.11897
Why should we care?
…you probably don’t have an adversary changing inputs to your classifier at a
pixel level (or if you do, you have bigger problems)
1. Genuine security implications for deep networks (e.g., with physical attacks)
2. Says something fundamental about the representation of deep classifiers,

smooth decision boundaries, sensitivity to distribution shift (within threat model),
etc
8
Outline
Introduction
Final thoughts
9
Adversarial attacks as optimization
𝐄",$ Loss 𝑓/ (𝑥), 𝑦
𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

(∈∆
10
The adversarial optimization problem
How do we solve the “inner” optimization problem
max Loss 𝑓/ (𝑥 + 𝛿), 𝑦
(∈∆
Key insight: the same process that enabled us to learn the model parameters via
gradient descent also allows us to create an adversarial example via gradient
descent
𝜕
Loss 𝑓/ (𝑥 + 𝛿), 𝑦
𝜕𝛿
11
Solving with projected gradient descent
Since we are trying to maximize the loss when creating an adversarial example,
we repeatedly move in the direction of the positive gradient
Since we also need to ensure that 𝛿 ∈ Δ, we also project back into this set after
each step, a process known as projected gradient descent (PGD)
𝜕
𝛿 ≔ Proj∆ 𝛿 + 𝛼 Loss 𝑓/ 𝑥 + 𝛿 , 𝑦
𝜕𝛿
Example: for Δ = {𝛿: 𝛿 ∞ ≤ 𝜖} (called the ℓ∞ ball), the projection operator just
clips each coordinate to [−𝜖, 𝜖]
12
The Fast Gradient Sign Method
The Fast Gradient Sign Method (FGSM) takes a
∂
single PGD step with step size 𝛼 → ∞, which α (fθ (x + δ), y)
corresponds exactly to just taking a step in the ∂δ
signs of the gradient terms
P∆
Creates weaker attacks than running full PGD,
but substantially faster
δ=0
∆
13
Illustration of adversarial examples
We will demonstrate adversarial attacks on MNIST data set, using two different
architectures
2-layer fully
connected MLP 6 layer ConvNet
Conv-32x28x28 Conv-64x14x14
FC-10 Conv-32x28x28 Conv-64x14x14 FC-10

FC-200 FC-100
14
Illustrations of FGSM/PGD
Test Error, epsilon=0.1
96.4%
92.6%
ConvNet
(FGSM): 74.3%
41.7%
ConvNet 2.9% 1.1%

(PDG)
MLP ConvNet
Clean FGSM PGD
15
Outline
Introduction
Final thoughts
16
Adversarial robustness
“pig”
min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

/ / (∈∆
1. Adversarial training: Take model SGD steps at (approximate) worst-case

perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]
2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter,
2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
17
Adversarial training
How do we optimize the objective
min ∑ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

/ (∈∆
",$∈O
We would like to solve it with gradient descent, but how do we compute the
gradient of the objective with the max term inside?
18
Danskin’s Theorem
A fundamental result in optimization:
𝜕 𝜕
max Loss 𝑓/ (𝑥 + 𝛿), 𝑦 = Loss 𝑓/ (𝑥 + 𝛿 ⋆ ), 𝑦
𝜕𝜃 (∈∆ 𝜕𝜃
where 𝛿 ⋆ = argmax Loss 𝑓/ (𝑥 + 𝛿), 𝑦

(∈∆
Seems “obvious,” but it is a very subtle result; means we can optimize through the
max by just finding it’s maximizing value
Note however, it only applies when max is performed exactly
19
Adversarial training
Test Error, epsilon=0.1
Repeat 74.4%
1. Select minibatch 𝐵
2. For each 𝑥, 𝑦 ∈ 𝐵, compute
adversarial example 𝛿 ⋆ 𝑥
41.7%
3. Update parameters
𝛼 𝜕
𝜃≔𝜃− ∑ Loss 𝑓/ (𝑥 + 𝛿 ⋆ 𝑥 ), 𝑦
𝐵 ",$∈T
𝜕𝜃
2.6%
Common to also mix robust/standard updates 1.1% 0.9%
2.8%
(not done in our case)
ConvNet Robust ConvNet
Clean FGSM PGD
20
Evaluating robust models
Our model looks good, but we should be careful declaring success
Need to evaluate against different attacks, PGD attacks run for longer, with
random restarts, etc
Note: it is not particularly informative to evaluate against a different type of attack,

e.g. evaluate ℓ∞ robust model against ℓ1 or ℓ2 attacks
21
Adversarial robustness
“pig”
min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

/ / (∈∆
1. Adversarial training: Take model SGD steps at (approximate) worst-case

perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]
2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter,
2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
22
Provable defenses
max Loss 𝑓/ 𝑥 + 𝛿 , 𝑦 ≤ max Loss 𝑓/rel 𝑥 + 𝛿 , 𝑦 ≤ Loss(𝑓/dual 𝑥, Δ , 𝑦)
(∈∆ (∈∆
z z z
ẑ ẑ ẑ
ℓ u ℓ u u
Maximization problem is now ℓ

a convex linear program Dual from [Wong and Kolter,
[Wong and Kolter, 2018] 2018], also independently
derived via hybrid zonotope
[Mirman et al., 2018] and
forward Lipschitz arguments
[Weng et al., 2018] 23
[Wong and Kolter, 2018], https://arxiv.org/abs/1711.00851
Robust optimization: putting it all together
In the end, instead of minimizing the traditional loss…
`
minimize ∑ ℓ(ℎ/ 𝑥_ , 𝑦_ )
/
_=1
…we just minimize our computed bound on loss, implemented in an auto-

differentiation framework (PyTorch), and we get a guaranteed bound on worst-
case loss (or error) for any norm-bounded adversarial attack
` `
minimize ∑ ℓ(𝐽c,/ 𝑥_ , 𝑦_ ) ≥ minimize ∑ ℓ(max ℓ ℎ/ 𝑥_ + 𝛿 , 𝑦_ )
/ / (∈∆
_=1 _=1
Full code available at https://github.com/locuslab/convex_adversarial
24
2D Toy Example
Simple 2D toy problem, 2-100-100-100-2 MLP network, trained with Adam
(learning rate = 0.001, no hyperparameter tuning)
Standard training Robust convex training

25
Standard and robust errors on MNIST 𝜖 = 0.1
100%
100.00%
90.00%
80.00%
70.00%
60.00%
50.00% 44%
40.00%
30.00%
20.00% 17%
10.00%
1.10% 1.10% 3.70%
0.00%
Standard CNN Robust linear classifier Our method (CNN)
Error Guaranteed robust error bound

26
MNIST Attacks
We can also look at how well real attacks perform at 𝜖 = 0.1
100%
100.0%
90.0% 82%
80.0%
70.0%
60.0%
50%
50.0%
40.0%
30.0%
20.0%
10.0% 1.1% 1.1% 2.1% 2.8% 3.7%
0.0%
Standard training Our method
No attack FGSM PGD Robust bound
27
What causes adversarial examples?
Adversarial examples are caused (informally) by small regions of adversarial class
“jutting” into an otherwise “nice” decision region (see also, e.g., [Roth et al., 2019])
Data point
Incorrect class
Correct class
28
Randomization as a defense?
We can “smooth” this decision region by adding Gaussian noise to the input and
picking the majority class of the classifier over this noise
𝑓(𝑥) 𝑔 𝑥 = argmax 𝐏c∼k(0,m2 o) [𝑓 𝑥 + 𝜖 = 𝑦]

$
This was proposed (in many different ways) as a heuristic defense, but [Lecuyer et
al, 2018] and later [Li et al., 2018] demonstrated that it gives certified bounds; we
simplify and tighten this analysis in [Cohen et al., 2019]
29
Visual intuition of randomized smoothing
To classify panda images, classify a bunch of versions perturbed by random noise,
take the majority vote
Note that this requires that our “base” classifier 𝑓 be able to classify noisy images
well (in practice, means we also need to train on these noisy images)
30
The randomized smoothing guarantee
Theorem (binary case):
• Given some input 𝑥, let 𝑦 ̂ = 𝑔(𝑥) be prediction of the smoothed classifier,
and let 𝑝 > 1/2 be the associated probability of this class under the
smoothing distribution
𝑝 = 𝐏c∼k(0,m2 o) 𝑓 𝑥 + 𝜖 = 𝑦 ̂
• Then 𝑔 𝑥 + 𝛿 = 𝑦 ̂ (i.e., smoothed classifier is robust)

for any 𝛿 such that
𝛿 2 ≤ 𝜎Φ−1 𝑝
where Φ−1 is the Gaussian inverse CDF
31
Proof of certified robustness
Reasonable question: why can performance on random noise tell us anything
about performance under adversarial noise?
Proof of theorem (informal):

• Suppose I have two points 𝑥 and 𝑥 + 𝛿 and you an adversarial want to craft
a decision boundary for the underlying classifier 𝑓(𝑥) such that:
1. 𝑥 is classified one way by smoothed classifier 𝑔(𝑥)
2. 𝑥 + 𝛿 is classified differently by smoothed classifier 𝑔(𝑥)
32
Proof of certified robustness (cont)
(Follows from
Neyman-Pearson x+δ
lemma in
hypothesis testing) x+δ
x
See also [Li and x
Kuelbs 1998] R
(thanks Ludwig Schmidt for
pointing out reference)
𝑓(𝑥) 𝑔 𝑥
For linear classifier, we can compute ℓ2 distance to worse-case boundary exactly
𝑅 = 𝜎Φ−1 𝑝
where 𝑝 is probability of majority class; implies any perturbation with 𝛿 2 ≤ 𝑅
cannot change class label ∎
33
Caveats (a.k.a. the fine print)
The procedure here only guarantees robustness for the smoothed classifier 𝑔 not
for the underlying classifier 𝑓
The probability 𝑝 of correct classification under smoothing cannot be computed

exactly (the exactly convolution of a Gaussian with a neural network is intractable)
• In practice, we need to resort to Monte Carlo estimates to compute a lower
bound on 𝑝 and certify the prediction (need a lot of samples to compute
certified radius, though much fewer to just compute prediction)
• Bounds hold with high probability over (internal) randomness of sampling
We are certifying a tiny radius compared to noise distribution
34
Comparison to previous SOTA on CIFAR10
For identical networks, mostly outperforms previous SOTA for ℓ2 robustness, but also
scales to much larger networks (where it uniformly outperforms duality-based approaches)
35
Performance on ImageNet
Example: we can certify smoothed classifier has top-1 accuracy of 37% under any
perturbation with 𝛿 2 ≤ 1 (in normalized pixels, i.e., RGB values in [0,1])
36
Future and ongoing work
Extension to other perturbation norms besides ℓ2 ?
• Seems extremely challenging (possibly impossible under certain
assumptions), e.g., can’t do better than naive 𝑑1/2 scaling for ℓ∞ norm
A strange property:
• Previous work on LP bounds was extremely specific to neural networks
• Smoothing work never uses the fact that base classifier is neural network
My best guess for a way forward: we need to use model information to extract
properties of base classifier beyond single probability 𝑝, use these to get better
bounds
37
Outline
Introduction
Final thoughts
38
Robust artificial intelligence
Deep learning is making amazing strides, but we have a long ways to go before
we can build deep learning systems that achieve even ”small” degrees of
robustness/adaptability compared to what humans take for granted
Resources:
• http://zicokolter.com – Web page with all papers
• http://github.com/locuslab – Code associated with all papers
• http://adversarial-ml-tutorial.org – Tutorial/code on adversarial robustness
• http://locuslab.github.io – Group blog
39

Robust Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Deep Learning

Uploaded by

Copyright:

Available Formats

Wooaah...

Provably robust deep

Carnegie Mellon University and

Attacking machine learning algorithms

Defending against adversarial attacks

Attacking machine learning algorithms

Defending against adversarial attacks

Karras et al., 2018 Radford et al., 2019 Vinyals et al., 2019

Sharif et al., 2016 Athalye et al., 2017

2. Says something fundamental about the representation of deep classifiers,

Attacking machine learning algorithms

Defending against adversarial attacks

𝐄",$ Loss 𝑓/ (𝑥), 𝑦

𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

FC-10 Conv-32x28x28 Conv-64x14x14 FC-10

ConvNet 2.9% 1.1%

Attacking machine learning algorithms

Defending against adversarial attacks

min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

1. Adversarial training: Take model SGD steps at (approximate) worst-case

min ∑ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

where 𝛿 ⋆ = argmax Loss 𝑓/ (𝑥 + 𝛿), 𝑦

Note however, it only applies when max is performed exactly

Note: it is not particularly informative to evaluate against a different type of attack,

min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦

1. Adversarial training: Take model SGD steps at (approximate) worst-case

Maximization problem is now ℓ

…we just minimize our computed bound on loss, implemented in an auto-

Full code available at https://github.com/locuslab/convex_adversarial

Standard training Robust convex training

Error Guaranteed robust error bound

𝑓(𝑥) 𝑔 𝑥 = argmax 𝐏c∼k(0,m2 o) [𝑓 𝑥 + 𝜖 = 𝑦]

• Then 𝑔 𝑥 + 𝛿 = 𝑦 ̂ (i.e., smoothed classifier is robust)

Proof of theorem (informal):

The probability 𝑝 of correct classification under smoothing cannot be computed

We are certifying a tiny radius compared to noise distribution

Attacking machine learning algorithms

Defending against adversarial attacks

You might also like