Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Wooaah...

Provably robust deep


learning

J. Zico Kolter

Carnegie Mellon University and


Bosch Center for AI

1
Outline
Introduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

2
Outline
Introduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

3
The AI breakthrough (some recent history)

Karras et al., 2018 Radford et al., 2019 Vinyals et al., 2019


4
…but the stakes are low

??
??

5
Adversarial attacks

Figure from
Madry et al.

Sharif et al., 2016 Athalye et al., 2017


Evtimov et al., 2017 6
… and some recent work

7
[Lee and Kolter, 2019], https://arxiv.org/abs/1906.11897
Why should we care?
…you probably don’t have an adversary changing inputs to your classifier at a
pixel level (or if you do, you have bigger problems)

1. Genuine security implications for deep networks (e.g., with physical attacks)

2. Says something fundamental about the representation of deep classifiers,


smooth decision boundaries, sensitivity to distribution shift (within threat model),
etc

8
Outline
Introduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

9
Adversarial attacks as optimization

𝐄",$ Loss 𝑓/ (𝑥), 𝑦

𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦


(∈∆
10
The adversarial optimization problem
How do we solve the “inner” optimization problem
max Loss 𝑓/ (𝑥 + 𝛿), 𝑦
(∈∆

Key insight: the same process that enabled us to learn the model parameters via
gradient descent also allows us to create an adversarial example via gradient
descent
𝜕
Loss 𝑓/ (𝑥 + 𝛿), 𝑦
𝜕𝛿

11
Solving with projected gradient descent
Since we are trying to maximize the loss when creating an adversarial example,
we repeatedly move in the direction of the positive gradient

Since we also need to ensure that 𝛿 ∈ Δ, we also project back into this set after
each step, a process known as projected gradient descent (PGD)
𝜕
𝛿 ≔ Proj∆ 𝛿 + 𝛼 Loss 𝑓/ 𝑥 + 𝛿 , 𝑦
𝜕𝛿

Example: for Δ = {𝛿: 𝛿 ∞ ≤ 𝜖} (called the ℓ∞ ball), the projection operator just
clips each coordinate to [−𝜖, 𝜖]

12
The Fast Gradient Sign Method
The Fast Gradient Sign Method (FGSM) takes a

single PGD step with step size 𝛼 → ∞, which α (fθ (x + δ), y)
corresponds exactly to just taking a step in the ∂δ
signs of the gradient terms
P∆
Creates weaker attacks than running full PGD,
but substantially faster

δ=0

13
Illustration of adversarial examples
We will demonstrate adversarial attacks on MNIST data set, using two different
architectures

2-layer fully
connected MLP 6 layer ConvNet
Conv-32x28x28 Conv-64x14x14

FC-10 Conv-32x28x28 Conv-64x14x14 FC-10


FC-200 FC-100

14
Illustrations of FGSM/PGD
Test Error, epsilon=0.1

96.4%
92.6%
ConvNet
(FGSM): 74.3%

41.7%

ConvNet 2.9% 1.1%


(PDG)
MLP ConvNet
Clean FGSM PGD

15
Outline
Introduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

16
Adversarial robustness
“pig”

min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦


/ / (∈∆

1. Adversarial training: Take model SGD steps at (approximate) worst-case


perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]

2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter,
2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
17
Adversarial training
How do we optimize the objective

min ∑ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦


/ (∈∆
",$∈O

We would like to solve it with gradient descent, but how do we compute the
gradient of the objective with the max term inside?

18
Danskin’s Theorem
A fundamental result in optimization:
𝜕 𝜕
max Loss 𝑓/ (𝑥 + 𝛿), 𝑦 = Loss 𝑓/ (𝑥 + 𝛿 ⋆ ), 𝑦
𝜕𝜃 (∈∆ 𝜕𝜃

where 𝛿 ⋆ = argmax Loss 𝑓/ (𝑥 + 𝛿), 𝑦


(∈∆

Seems “obvious,” but it is a very subtle result; means we can optimize through the
max by just finding it’s maximizing value

Note however, it only applies when max is performed exactly

19
Adversarial training
Test Error, epsilon=0.1
Repeat 74.4%

1. Select minibatch 𝐵
2. For each 𝑥, 𝑦 ∈ 𝐵, compute
adversarial example 𝛿 ⋆ 𝑥
41.7%
3. Update parameters
𝛼 𝜕
𝜃≔𝜃− ∑ Loss 𝑓/ (𝑥 + 𝛿 ⋆ 𝑥 ), 𝑦
𝐵 ",$∈T
𝜕𝜃

2.6%
Common to also mix robust/standard updates 1.1% 0.9%
2.8%
(not done in our case)
ConvNet Robust ConvNet
Clean FGSM PGD

20
Evaluating robust models
Our model looks good, but we should be careful declaring success

Need to evaluate against different attacks, PGD attacks run for longer, with
random restarts, etc

Note: it is not particularly informative to evaluate against a different type of attack,


e.g. evaluate ℓ∞ robust model against ℓ1 or ℓ2 attacks

21
Adversarial robustness
“pig”

min 𝐄",$ Loss 𝑓/ (𝑥), 𝑦 ⟹ min 𝐄",$ max Loss 𝑓/ (𝑥 + 𝛿), 𝑦


/ / (∈∆

1. Adversarial training: Take model SGD steps at (approximate) worst-case


perturbations [Goodfellow et al., 2015, Kurakin et al., 2016; Madry et al., 2017]

2. Certified defenses: Provably upper bound inner maximization [Wong and Kolter,
2018; Ragunathan et al., 2018; Mirman et al., 2018; Cohen et al., 2019]
22
Provable defenses
max Loss 𝑓/ 𝑥 + 𝛿 , 𝑦 ≤ max Loss 𝑓/rel 𝑥 + 𝛿 , 𝑦 ≤ Loss(𝑓/dual 𝑥, Δ , 𝑦)
(∈∆ (∈∆

z z z

ẑ ẑ ẑ
ℓ u ℓ u u

Maximization problem is now ℓ


a convex linear program Dual from [Wong and Kolter,
[Wong and Kolter, 2018] 2018], also independently
derived via hybrid zonotope
[Mirman et al., 2018] and
forward Lipschitz arguments
[Weng et al., 2018] 23
[Wong and Kolter, 2018], https://arxiv.org/abs/1711.00851
Robust optimization: putting it all together
In the end, instead of minimizing the traditional loss…
`
minimize ∑ ℓ(ℎ/ 𝑥_ , 𝑦_ )
/
_=1

…we just minimize our computed bound on loss, implemented in an auto-


differentiation framework (PyTorch), and we get a guaranteed bound on worst-
case loss (or error) for any norm-bounded adversarial attack
` `
minimize ∑ ℓ(𝐽c,/ 𝑥_ , 𝑦_ ) ≥ minimize ∑ ℓ(max ℓ ℎ/ 𝑥_ + 𝛿 , 𝑦_ )
/ / (∈∆
_=1 _=1

Full code available at https://github.com/locuslab/convex_adversarial

24
2D Toy Example
Simple 2D toy problem, 2-100-100-100-2 MLP network, trained with Adam
(learning rate = 0.001, no hyperparameter tuning)

Standard training Robust convex training


25
Standard and robust errors on MNIST 𝜖 = 0.1
100%
100.00%
90.00%
80.00%
70.00%
60.00%
50.00% 44%
40.00%
30.00%
20.00% 17%
10.00%
1.10% 1.10% 3.70%
0.00%
Standard CNN Robust linear classifier Our method (CNN)

Error Guaranteed robust error bound


26
MNIST Attacks
We can also look at how well real attacks perform at 𝜖 = 0.1
100%
100.0%
90.0% 82%
80.0%
70.0%
60.0%
50%
50.0%
40.0%
30.0%
20.0%
10.0% 1.1% 1.1% 2.1% 2.8% 3.7%
0.0%
Standard training Our method
No attack FGSM PGD Robust bound

27
What causes adversarial examples?
Adversarial examples are caused (informally) by small regions of adversarial class
“jutting” into an otherwise “nice” decision region (see also, e.g., [Roth et al., 2019])

Data point

Incorrect class
Correct class
28
Randomization as a defense?
We can “smooth” this decision region by adding Gaussian noise to the input and
picking the majority class of the classifier over this noise

𝑓(𝑥) 𝑔 𝑥 = argmax 𝐏c∼k(0,m2 o) [𝑓 𝑥 + 𝜖 = 𝑦]


$

This was proposed (in many different ways) as a heuristic defense, but [Lecuyer et
al, 2018] and later [Li et al., 2018] demonstrated that it gives certified bounds; we
simplify and tighten this analysis in [Cohen et al., 2019]
29
Visual intuition of randomized smoothing
To classify panda images, classify a bunch of versions perturbed by random noise,
take the majority vote

Note that this requires that our “base” classifier 𝑓 be able to classify noisy images
well (in practice, means we also need to train on these noisy images)

30
The randomized smoothing guarantee
Theorem (binary case):
• Given some input 𝑥, let 𝑦 ̂ = 𝑔(𝑥) be prediction of the smoothed classifier,
and let 𝑝 > 1/2 be the associated probability of this class under the
smoothing distribution
𝑝 = 𝐏c∼k(0,m2 o) 𝑓 𝑥 + 𝜖 = 𝑦 ̂

• Then 𝑔 𝑥 + 𝛿 = 𝑦 ̂ (i.e., smoothed classifier is robust)


for any 𝛿 such that
𝛿 2 ≤ 𝜎Φ−1 𝑝
where Φ−1 is the Gaussian inverse CDF

31
Proof of certified robustness
Reasonable question: why can performance on random noise tell us anything
about performance under adversarial noise?

Proof of theorem (informal):


• Suppose I have two points 𝑥 and 𝑥 + 𝛿 and you an adversarial want to craft
a decision boundary for the underlying classifier 𝑓(𝑥) such that:
1. 𝑥 is classified one way by smoothed classifier 𝑔(𝑥)
2. 𝑥 + 𝛿 is classified differently by smoothed classifier 𝑔(𝑥)

32
Proof of certified robustness (cont)
(Follows from
Neyman-Pearson x+δ
lemma in
hypothesis testing) x+δ
x
See also [Li and x
Kuelbs 1998] R
(thanks Ludwig Schmidt for
pointing out reference)

𝑓(𝑥) 𝑔 𝑥
For linear classifier, we can compute ℓ2 distance to worse-case boundary exactly
𝑅 = 𝜎Φ−1 𝑝
where 𝑝 is probability of majority class; implies any perturbation with 𝛿 2 ≤ 𝑅
cannot change class label ∎
33
Caveats (a.k.a. the fine print)
The procedure here only guarantees robustness for the smoothed classifier 𝑔 not
for the underlying classifier 𝑓

The probability 𝑝 of correct classification under smoothing cannot be computed


exactly (the exactly convolution of a Gaussian with a neural network is intractable)
• In practice, we need to resort to Monte Carlo estimates to compute a lower
bound on 𝑝 and certify the prediction (need a lot of samples to compute
certified radius, though much fewer to just compute prediction)
• Bounds hold with high probability over (internal) randomness of sampling

We are certifying a tiny radius compared to noise distribution

34
Comparison to previous SOTA on CIFAR10

For identical networks, mostly outperforms previous SOTA for ℓ2 robustness, but also
scales to much larger networks (where it uniformly outperforms duality-based approaches)
35
Performance on ImageNet

Example: we can certify smoothed classifier has top-1 accuracy of 37% under any
perturbation with 𝛿 2 ≤ 1 (in normalized pixels, i.e., RGB values in [0,1])
36
Future and ongoing work
Extension to other perturbation norms besides ℓ2 ?
• Seems extremely challenging (possibly impossible under certain
assumptions), e.g., can’t do better than naive 𝑑1/2 scaling for ℓ∞ norm

A strange property:
• Previous work on LP bounds was extremely specific to neural networks
• Smoothing work never uses the fact that base classifier is neural network

My best guess for a way forward: we need to use model information to extract
properties of base classifier beyond single probability 𝑝, use these to get better
bounds

37
Outline
Introduction

Attacking machine learning algorithms

Defending against adversarial attacks

Final thoughts

38
Robust artificial intelligence
Deep learning is making amazing strides, but we have a long ways to go before
we can build deep learning systems that achieve even ”small” degrees of
robustness/adaptability compared to what humans take for granted

Resources:
• http://zicokolter.com – Web page with all papers
• http://github.com/locuslab – Code associated with all papers
• http://adversarial-ml-tutorial.org – Tutorial/code on adversarial robustness
• http://locuslab.github.io – Group blog

39

You might also like