Professional Documents
Culture Documents
Cryptography in The Presence of Physical Attacks Thesis-384 PHD 2020 KU Leuven Lauren de Meyer
Cryptography in The Presence of Physical Attacks Thesis-384 PHD 2020 KU Leuven Lauren de Meyer
Lauren De Meyer
September 2020
Cryptography in the Presence of Physical Attacks
Lauren DE MEYER
September 2020
© 2020 KU Leuven – Faculty of Engineering Science
Uitgegeven in eigen beheer, Lauren De Meyer, Kasteelpark Arenberg 10 box 2452, B-3001 Leuven (Belgium)
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
schriftelijke toestemming van de uitgever.
All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
electronic or any other means without written permission from the publisher.
Preface
Four years does not seem like such a long time, but four years working on a
PhD is life-changing. Naturally, there are many people I have to acknowledge
for shaping this experience.
I would like to express my gratitude to the members of my examination
committee for taking the time to read and help me improve my work. I
also thank the chairman for leading my virtual private defense and real-life
public defense. Thank you to the FWO for funding me, which allowed me to
devote 3.5 years entirely to my research and achieve results I am very happy
with.
Dear Vincent, thank you for supervising me throughout this journey. Every
time I came knocking unexpectedly on your office door, you made time for me.
You tracked my progress in your notebook, you proofread everything I sent you
and never let a sunny Friday go by without telling me to go home early. It was
a privilege to have you as my supervisor.
Dear Bart, thank you for introducing me to the world of cryptography. When
you gave me a summer internship in 2012, I had no idea how much this would
end up shaping my life. I am very grateful for the time you took to teach me
new concepts on the blackboard in the office and also, for appointing Begül as
my internship supervisor.
Which brings me to my PhD guardian angel, Begül. Not a single day of my
PhD went by that I didn’t think of myself as the luckiest student in COSIC for
having you as my daily supervisor. In the office, you showed me by example
how to become an independent researcher with integrity and a critical mind.
Also outside the office, you were always there for me. Over four years, we had
so much fun working on midnight deadlines, eating sushi and künefe, going to
Ikea and saving butterflies from dying. I cannot thank you enough!
Dear Victor, we started out as colleagues and then became co-authors and then
i
ii PREFACE
friends. It didn’t take long for us to become really really really good friends
and eventually also office mates. I couldn’t have made it through this PhD
without your support. Thank you for being my rock both inside and outside
the office. Whether in Leuven, Taipei or Washington DC, we had the absolute
best conversations, in which you taught me about life, people and myself. I
look forward to sharing the next stages in our professional and personal lives.
One of the most important aspects of a PhD is collaboration and I want to thank
all my co-authors for the papers we worked on together. Special thanks are also
due to the people who helped me make my way around the COSIC lab, especially
Lennert and Arthur. To all the COSICs I did not get to collaborate with, you
were also wonderful. I could not imagine a better work environment. How many
people have the luxury to say that going to work feels like going to hang out with
friends? That is not to say that we do not work, as is hopefully evidenced by
this dissertation. But in between the work, we had amazing Halloween parties,
Friday beers at Metafoor, COSIC weekends and other memorable evenings
in the feestzaal. Especially to my Barracks crew, thank you for the skiing,
camping and Disney trips and all the crazy unforgettable stuff in between. Dear
Péla, thank you for all your help and for letting me talk to you about basically
anything. We are so lucky to have you at COSIC. I am also eternally grateful
to my dear family for giving me so many opportunities in life and for always
helping me pursue my dreams.
Por fin, el que me trae un poquititito loco. No querías muchas palabras, así
que tomaré prestadas algunas de Márquez: “Ella lo esperaba con tal ansiedad
que la sola sonrisa de él le devolvía el aliento.” Gracias <4
Abstract
Cryptographic primitives are designed such that they can resist black-box attacks
(cryptanalysis). For their implementations, extra measures must be taken to also
provide security against physical attacks. One class of physical attacks is that
of side-channel analysis (SCA), a non-invasive attack that exploits the physical
leakages emanating from a device (power consumption or electromagnetic
radiation among others) to retrieve its secret data. One particularly powerful
attack, differential power analysis (first-order DPA) was introduced in 1999 by
Kocher et al. Today, we aim at providing security against higher-order DPA.
In a dth -order DPA attack, the attacker exploits any statistical moment of the
power consumption up to order d. A popular and established countermeasure
is masking, a method based on secret sharing in which intermediate variables
are stochastically split into multiple shares to make the side-channel-leaked
information independent of sensitive data. Many different types of masking
schemes have been proposed, often accompanied by a formal proof of security.
Another class of physical attacks is starting to gain more attention in the last
years. In fault attacks such as differential fault analysis (DFA), an attacker
induces logical errors in the computation by for example under-powering the
device or by careful illumination of certain areas in the silicon die. The result of
a faulty computation can reveal a wealth of secret information. These attacks
can be executed either separately or combined with side-channel analysis.
The application of countermeasures to cryptographic primitives significantly
increases their implementation cost, especially of the nonlinear components.
Moreover, they require the generation of a large number of random bits, which
is expensive in practice.
The work described in this book can be divided into four categories.
Our first research direction looks at the design of countermeasures against
side-channel attacks and their application to existing ciphers, such as the
Advanced Encryption Standard (AES). Because of its wide deployment in
iii
iv ABSTRACT
v
vi BEKNOPTE SAMENVATTING
Het werk dat in dit boek wordt beschreven, kan in vier categorieën worden
verdeeld.
Onze eerste onderzoeksrichting kijkt naar het ontwerp van tegenmaatregelen
tegen nevenkanaal aanvallen en hun toepassing op bestaande cijfers, zoals
de Advanced Encryption Standard (AES). Vanwege zijn uitgebreid gebruik
in de industrie, is het ontwerpen van zulke implementaties die beveiliging
bieden tegen fysieke aanvallen tegen minimale oppervlakte, reactietijd en
willekeurigheidskosten een belangrijke taak. Ook is de afweging tussen deze
verschillende kostenstatistieken moeilijk te manoeuvreren, omdat optimaliseren
voor de ene meestal nadelig is voor de andere.
Als tweede doelstelling kijken we naar de analyse en verificatie van de bedachte
tegenmaatregelen. Het ontwerpen van gemaskeerde circuits is niet triviaal
en veel van de voorgestelde tegenmaatregelen van de afgelopen jaren zijn na
publicatie relatief snel kwetsbaar gebleken. Dit heeft geleid tot een nieuwe golf
van werken over de verificatie van maskeerschema’s en hun implementaties. Met
meerdere beveiligingsnoties in de literatuur en verschillen tussen de theoretische
en praktische aanpak, is er nog steeds onzekerheid over hoe de beveiliging van
een gemaskeerde implementatie moet worden geëvalueerd.
Ten derde beginnen we met het verkennen van het ontwerp van tegenmaatregelen
die niet alleen beschermen tegen nevenkanaal aanvallen, maar ook tegen
foutaanvallen en gecombineerde aanvallen. Dit is een zeer jonge tak van
onderzoek en in tegenstelling tot maskering was de state-of-the-art op het
gebied van tegenmaatregelen voor gecombineerde aanvallen (vóór ons onderzoek)
meestal heuristisch en miste een formele achtergrond.
Ten slotte beschouwen we de uitdaging om cryptografische primitieven te
ontwerpen met de kosten van deze tegenmaatregelen in gedachten. De meeste
cryptografische algoritmen die tegenwoordig algemeen worden geaccepteerd en
gebruikt, zijn ontworpen in een tijd waarin fysieke aanvallen niet beschouwd
werden. Componenten werden meestal gekozen op basis van hun wiskundige
en cryptografische eigenschappen. Soms werd rekening gehouden met de
implementatiekosten, maar dan buiten de context van maskering.
Door naar deze vier zeer verschillende aspecten van cryptografie voor
geïntegreerde systemen te kijken, creëren we niet alleen een uitgebreide en
multidisciplinaire kennisbasis, maar onze ervaring in elk aspect verbetert ook
ons begrip van de anderen. Het stelt ons in staat om gemeenschappelijke
trends in de bestaande literatuur over verschillende onderwerpen te identificeren,
evenals verschillen en contrasten.
List of Abbreviations
AB almost bent.
AE affine equivalence.
FA fault analysis.
FPGA Field Programmable Gate Array.
vii
viii LIST OF ABBREVIATIONS
NI non-interference.
NIST National Institute of Standards and Technology.
TI threshold implementations.
TVLA test vector leakage assessment.
Contents
Abstract iii
Beknopte samenvatting v
List of Tables xv
ix
x CONTENTS
2.3.2 In Hardware . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Pushing the Limits . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 Where the Randomness Comes From . . . . . . . . . . . 39
2.4 My Contributions in this Context . . . . . . . . . . . . . . . . . 40
2.4.1 Multiplicative Masking for AES in Hardware . . . . . . 40
2.4.2 Rotational Symmetry for FPGA-specific Advanced En-
cryption Standard (AES) . . . . . . . . . . . . . . . . . 41
2.4.3 Masking the AES with only Two Random Bits . . . . . 42
2.4.4 Recovering the CTR_DRBG state in 256 traces . . . . 43
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Side-Channel Analysis 47
3.1 Side-Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Differential Power Analysis . . . . . . . . . . . . . . . . 50
3.1.2 Higher-Order Attacks . . . . . . . . . . . . . . . . . . . 54
3.2 Verifying Masked Implementations . . . . . . . . . . . . . . . . 56
3.2.1 Leakage Assessment . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Adversary Models . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Provable Security . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Flaw Detection . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 My Contributions in this Context . . . . . . . . . . . . . . . . . 73
3.3.1 Consolidating Security Notions in Hardware Masking . . 73
3.3.2 Recovering the CTR_DRBG state in 256 traces . . . . 74
3.3.3 On the Effect of the (Micro)Architecture on the Develop-
ment of Side-Channel Resistant Software . . . . . . . . 74
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Combined Physical Attacks and Countermeasures 79
4.1 Fault Attacks and Countermeasures . . . . . . . . . . . . . . . 79
4.1.1 Fault Attacks . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.2 Countermeasures . . . . . . . . . . . . . . . . . . . . . . 82
4.1.3 Ineffective Faults and Safe-Errors. . . . . . . . . . . . . 84
4.2 Combined Attacks and Countermeasures . . . . . . . . . . . . . 85
4.2.1 Attacks in the Literature . . . . . . . . . . . . . . . . . 85
4.2.2 Countermeasures . . . . . . . . . . . . . . . . . . . . . . 87
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 My Contributions in this Context . . . . . . . . . . . . . . . . . 94
4.3.1 CAPA: The Spirit of Beaver against Physical Attacks . 94
4.3.2 M&M: Masks and Macs against Physical Attacks . . . . 95
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
CONTENTS xi
II Publications 157
List of Publications 159
Multiplicative Masking for AES in Hardware 161
xiii
xiv LIST OF FIGURES
3.10 The gap between theory and practice: provable vs. practical
security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.11 The gap between theory and practice: attacks. . . . . . . . . . 78
xv
Part I
1
Chapter 1
Introduction
3
4 INTRODUCTION
hand, a public key is used for encryption and a secret key for decryption.
This can be compared to the operation of a padlock, which can be locked by
anyone, but only unlocked by the person with the key. The topic of this work
applies to any cryptographic primitive, but for simplicity, our descriptions will
mostly consider keyed primitives and encryption specifically. There are other
cryptographic primitives such as hash functions, for which a similar treatment
holds.
𝐼𝑉 𝐼𝑉 + 1 𝐼𝑉 + 𝑖
𝑃& 𝑃* 𝑃,
𝐶& 𝐶* 𝐶,
AddRoundKey is the field addition of the 128-bit state with a 128-bit round
key. Since the Galois field has characteristic two, the addition is equivalent
to a bitwise XOR.
6 INTRODUCTION
The 128-bit round keys are derived from a master key using the key schedule.
The key schedule consists mostly of linear operations, apart from using the
same S-box S as in SubBytes on four bytes of the key state in each round.
In the models above, the capabilities of the adversary are limited to the
knowledge or choice of inputs and outputs of the cryptographic primitive.
The encryption/decryption itself is considered as a black box. The black-box
model was the conventional model in cryptography until the 1990s, when
physical attacks started to emerge and cryptographers realised that in reality,
the cryptographic function is more like a grey box.
An attacker can exploit physical access to a device in several ways to get more
than just input and output information. We distinguish passive and active
physical attacks. In the first case, one passively observes the physical device
outputs. These attacks are called side-channel attacks. In the second case, an
attacker actively disturbs the device and introduces faults into the computations.
We refer to them as fault attacks. In this introduction, we introduce the former,
which is also the main theme of this work, but also fault attacks will play a role
in one of the chapters.
consumption, we would see that there is also a difference between the power
consumed to charge B and the power consumed to discharge B [MOP07, Fig.
3.3]. In this work, we only consider the dynamic power consumption. For an
investigation of the static power consumption, we refer to the works of Moos et
al. [Moo19, MMR20].
𝐴 𝐵 𝐴 𝐵 𝐴 𝐵
1 →0 0→1 0→1 1→0
1.2.2 Countermeasures
Side-Channel Information
Implementation Intermediates/operations
II
Encryption Intermediates/operations
III
Secrets/Key
Figure 1.3: Links between secrets and side-channels (extended from [MOP07]).
We will complete this introductory chapter with our research questions and
guiding comments about the structure of the chapters.
4. How can we design new symmetric primitives such that the masking
overhead is minimized? Efficiently masking cryptographic primitives is a
challenging task. The properties that make cryptographic components good for
confusion, make them expensive for masked implementations. Many primitives
currently in use have been designed without considering the cost of masking.
The next step in optimizing cryptography for embedded systems is to take this
cost into account from the very beginning, in the design of the primitive itself. It
is however not always clear which properties result in efficient implementations
and many trade-offs must be made. Furthermore, a lot is still unknown about
the search space of S-boxes. We improve and extend the state-of-the-art on
S-box classification. We also investigate how some design decisions influence
the cost of masked implementations and how recent proposals in the literature
compare in this aspect.
12 INTRODUCTION
This work is divided into four chapters, each of which deals with one of the above
research questions. We follow a publication-based model, which means that a
selection of our contributions to the field can be found as they were published
in Part II of this book. We chose this subset of publications based on their
significance in the field and our contribution to the work itself. Additionally,
we aimed to minimize the overlap of content with the chapters of Part I. The
full list of publications can be found on page 159. The chapters of Part I are
not only meant as introductions to the respective publications, but also provide
a comprehensive evaluation of the state-of-the-art, including our contributions.
In each chapter, we give an extensive background of the topic and an overview
of recent developments and we include a brief description of each publication.
This format allows us to critically evaluate both our own contributions in the
field and those of the research community as a whole. For clarity, we refer to
our own works with numeric citations (e.g. [3]) and to other references with
name-year-based citations (e.g. [Bil15]).
Chapter 2 considers the masking countermeasure in detail (i.e. research question
1). We selected our work from TCHES 2018 [7] (p. 161) as representative
publication for this topic. Next, Chapter 3 explains side-channel analysis
and most importantly investigates the verification of masked implementations
(research question 2). We included our work from TCHES 2020 [3] (p. 185) as
contribution to both research questions 1 and 2, as it constitutes an improvement
on an existing side-channel attack as well as an investigation into the randomness
requirements of masking. Our most important contribution towards research
question 2 is a work from TCHES 2019 [4], which was not included, as its
contents were incorporated into Chapter 3 itself. These first two chapters are
long compared to the last two. On the one hand, their topic represents the
bulk of the work performed during this PhD. On the other hand, there is an
overwhelming amount of existing research to consider. Chapter 4 deals with
combined attacks and countermeasures (research question 3) and is a slightly
shorter chapter. While it represents a very important contribution of our PhD,
this topic is relatively new in research, which means the existing literature
is limited. Representative publications for this chapter are our works from
CRYPTO 2018 [13] (p. 205) and from TCHES 2019 [6] (p. 229). Finally, in
Chapter 5 we look at the design of symmetric cryptographic primitives for
embedded systems (i.e. research question 4). This research direction is also less
established with very little literature available. With respect to this topic, we
included our work from ToSC 2019 [5] (p. 253).
Together, these chapters symbolize the different stages in the development of
embedded cryptography:
THESIS OVERVIEW 13
In practice, these stages follow each other in a circular rather than linear way
(see Figure 1.4). For example, the successful analysis of an implementation
can lead to corrections and improvements of the masking scheme. Also, as
exemplified by this work, experience with countermeasures such as masking
leads to new insights on primitive design.
Masking against
Side-Channel Attacks
Notation. Let F be some finite field. In the context of this work, the field has
characteristic two. Specific Galois fields of size 2k are denoted F2k . A vector or
matrix over the field F is written in bold font: x ∈ Fn . The vector xI contains
only the elements xi for i ∈ I. Addition over the field is denoted by ⊕. We use
× for multiplication over the field, but sometimes omit it for ease of notation.
15
16 MASKING AGAINST SIDE-CHANNEL ATTACKS
This means that the number of shares n in a dth -order masking must always
be strictly larger than d. However, in many cases, more than the minimal
number of d + 1 shares are required to preserve this security. The security of the
masking countermeasure very much relies on the so-called independent leakage
assumption (ILA) [CJRR99, PR13]. That is, it is generally assumed that the
leakages of different intermediate values occurring in distinct calculations are
independent of each other and that the power consumption is a linear function
of the individual data variables. While this assumption was validated with
experiments by Chari et al. [CJRR99], it has recently become clear that it
does not always hold. Extra care is required on platforms where the power
consumption may not follow a linear model due to for example coupling or
microarchitectural effects. We treat this issue in more detail in the next chapter.
Giraud [AG01]. However, this first proposal was insecure due to an inherent flaw
of multiplicative masking that was uncovered by Golic̀ and Tymen [GT02]. It is
18 MASKING AGAINST SIDE-CHANNEL ATTACKS
known as the zero-value problem, which refers to the fact that, no matter how
many shares the representation uses, multiplicative masking cannot securely
encode the value 0. For (x0 , x1 , . . . , xn−1 ) to encode zero, we need that x0 ×
x1 × . . . xn−1 = 0 and thus that at least one share xi is equal to zero itself.
Hence, a single probe on that share xi would reveal the secret. Alternatively, the
mean of a single share (and by extension its power consumption) also depends
on the secret: ∀i : E[xi |x = 0] 6= E[xi |x 6= 0]. Recent works on multiplicative
masking avoid the zero-value problem and have extended and optimized the
original methodology [GPQ10, GPQ11, 7].
Other Masking Representations. There are other types of masking for which,
like Boolean masking, linear operations are trivial, but their representations
cannot be defined with a single operation . They split the secret x into
shares x = (x0 , . . . , xn−1 ) such that it requires a more generic reconstruction
function f : Fn → F by x = f (x) = f (x0 , x1 , . . . , xn−1 ). One example is that
of polynomial masking, based on the secret sharing scheme by Shamir [Sha79],
which is used in multi-party computation (MPC). The first masking schemes to
use this type of masking were those of Prouff and Roche [PR11] and Goubin
and Martinelli [GM11]. One masks a secret variable x by first constructing a
dth -degree polynomial px (y), for which x is the constant coefficient:
d
M
px (y) = x ⊕ ai y i = x ⊕ (a1 y) ⊕ (a2 y 2 ) ⊕ . . . ⊕ (ad y d ) (2.3)
i=1
The coefficients ai are drawn randomly and kept secret. The shares of x are
calculated as points on the polynomial, evaluated in n different nonzero elements
αi : xi = px (αi ). The elements αi are public. The secret x can be reconstructed
from the shares using the reconstruction function f (x):
n−1
M
f (x) = xi Li = (x0 L0 ) ⊕ (x1 L1 ) ⊕ . . . ⊕ (xn−1 Ln−1 ) (2.4)
i=0
where the coefficients Li can be derived from the public elements αi , using
Lagrange interpolation.
A slightly different proposal, called inner product masking, came from
Dziembowski and Faust [DF12] and was first applied by Balasch et al. [BFGV12].
A secret x is represented by shares (x0 , . . . , xn−1 ) with (xi )n−1
i=1 random masks
and
Trichina proposes to introduce a new random mask r and construct the shares
of z as follows:
z0 = r (2.8)
z1 = ((((r ⊕ x0 y0 ) ⊕ x0 y1 ) ⊕ x1 y0 ) ⊕ x1 y1 ) (2.9)
Note that it is important to perform the additions in eq. (2.9) from left to
right, indicated by the parentheses and in Figure 2.1. For example, if the
crossterms x0 y0 and x0 y1 are combined without the random mask r, the resulting
intermediate x0 (y0 ⊕ y1 ) = x0 y depends on the unmasked secret y.
&"
$ &%
!" !" !% !%
#" #% #" #%
zij = rij
(2.11)
zji = (rij ⊕ xi yj ) ⊕ xj yi
MASKING CONSTRUCTIONS: A ROADMAP 21
with rij fresh random masks and zii = xi yi . This multiplication is dubbed the
ISW multiplication and remains today an essential building block for masked
implementations. It requires n(n − 1)/2 fresh random masks rij ∈ F. We note
that ISW initially deemed this multiplication with n = d + 1 shares to be secure
against attacks of order d/2 and lower, but Rivain and Prouff [RP10] proved
that it actually provides dth -order security with d + 1 shares.
Clock
𝑥
𝑥𝑦
Figure 2.2: Example of a glitch in the signal xy because its inputs do not switch
exactly at the same time after the positive clock edge.
Their effect is that hardware circuits do not fit in the “ideal circuit” model
of ISW where the internal state of the device only depends on the exact
intermediates of a calculation and where the order of operations is respected.
Take for example the Trichina gate (see Figure 2.1). It was noted earlier that the
order of operations in the calculation of the output shares is vital for security.
Consider the case where the delay of the random mask r is larger than the
delay of the shares of x and y. The circuit would then temporarily compute the
value z1 = x0 y0 ⊕ x0 y1 ⊕ x1 y0 ⊕ x1 y1 = xy, which is the unmasked output and
hence sensitive. Similarly, the ISW multiplication is not suitable for hardware
implementations. Since the works of Mangard et al., software and hardware
masking have diverged into separate directions. Glitches are difficult to predict
and control. Protection against the vulnerabilities arising from them is typically
22 MASKING AGAINST SIDE-CHANNEL ATTACKS
Registers. The main problem of glitches is that the signal on a wire transitions
more than once. Before calculating the intended function value, the wire carries
the value of different unintended intermediates. One way to stop glitches from
propagating is to synchronize wires with registers. Registers (or flip-flops at bit
level) are circuit elements of which the output signal transitions exactly once
per cycle. Moreover, these transitions are more or less synchronized as they
always happen on the positive edge of a periodic clock signal (such as x and y in
Figure 2.2). We will note the stabilization of a variable x with square brackets:
[x]. For example, a naive (but expensive) method to secure the Trichina gate
for hardware implementations is to fix the order of calculations by storing each
intermediate value in a register:
z0 = [r]
(2.12)
z1 = [[[[r ⊕ x0 y0 ] ⊕ x0 y1 ] ⊕ x1 y0 ] ⊕ x1 y1 ]
Correctness:
Ln−1 The masked function f is correct for f if, for any x ∈ Fn such
that i=0 xi = x ∈ F, the function f outputs a sharing y ∈ Gm such
Lm−1 Lm−1 Ln−1
that yi = fi (x) and i=0 yi = y = f (x): i=0 fi (x) = f i=0 xi
z0 = x0 y0 ⊕ x0 y1 ⊕ x1 y0 ⊕ r0
z1 = x1 y1 ⊕ x1 y2 ⊕ x2 y1 ⊕ r1 (2.13)
z2 = x2 y2 ⊕ x2 y0 ⊕ x0 y2 ⊕ r0 ⊕ r1
Also, it was determined that at least td + 1 input shares are required to obtain
dth -order protection against SCA for a function of algebraic degree t. However,
it was noted by Reparaz [Rep15] that the higher-order TI constructions of
Bilgin et al. [BGN+ 14a] do not provide the claimed security. The vulnerability
lies in the higher-order adversary’s ability to observe d > 1 intermediates or
24 MASKING AGAINST SIDE-CHANNEL ATTACKS
the dth -order statistical moment of the power consumption, which includes
combinations of different variables at different time instants. While the proof
of security based on uniformity and higher-order non-completeness remains
valid for a univariate higher-order side-channel attack, it does not extend to the
multivariate adversary. For example, the variance of the power consumption at
a single point of these threshold implementations is independent of the secret,
but the covariance of the power consumption combining two different time
instants is not. The solution to this problem is to ensure the independence of
different time instants by refreshing the shares of intermediate variables after
every computation with new random masks. This way, higher-order threshold
implementations lose the low-randomness advantage of first-order threshold
implementations.
z00 z01 ... z0,n−1
.. ..
z10 . .
. .. .. =
. .
. .
zn−1,0 ... ... zn−1,n−1
(2.14)
x0 y0 x0 y1 ... x0 yn−1 0 r01 ... r0,n−1
.. .. .. ..
x1 y0 . . r01 . .
. .. .. ⊕ . .. ..
. .
. .
. . . .
xn−1 y0 ... ... xn−1 yn−1 r0,n−1 ... ... 0
The remasking matrix in Eq. (2.14) was proposed by Gross et al. [GMK16] and
is similar to that of Ishai et al. [ISW03]. The idea is that only cross products of
different shares (xi yj for i 6= j) need remasking and that the same mask can be
used for cross products xi yj and xj yi (i.e. the remasking matrix is symmetric).
Next, as in the ISW multiplication, the n2 shares zij can be compressed back
into n shares zi . Before this, synchronization is required to stop glitches from
MASKING CONSTRUCTIONS: A ROADMAP 25
Reparaz et al. [RBN+ 15] further noted that with this construction, it is possible
to achieve dth -order non-completeness with only d + 1 input shares, regardless
of the algebraic degree of a function. An important caveat is that this is only
valid if the input sharings of x and y are independent. Consider for example
the cross products in the first-order case when y = x2 with y = (x20 , x21 ):
x0 y1 = x0 x21 (2.16)
z00 z01 ... z0,n−1 x0 y0 L0 x0 y1 L1 ... x0 yn−1 Ln−1
.. .. .. ..
z10 . . x1 y0 L0 . .
. .. . = .. .. ..
. . .. .
. . .
zn−1,0 ... ... zn−1,n−1 xn−1 y0 L0 ... ... xn−1 yn−1 Ln−1
0 r01 L−1
0 ... r0,n−1 L−1
0
.. ..
r01 L−11
. .
⊕
.. .. ..
. . .
r0,n−1 L−1
n−1 ... ... 0
(2.17)
Note that the random matrix in Eq. (2.17) is not symmetric, although it still
holds that rij = rji . The correctness of the above is easily verified:
n−1
M
z = hz, Li = zi Li
i=0
n−1
M n−1
M n−1
M n−1
M
= zij Li = xi yj Lj ⊕ rij L−1
i Li
i=0 j=0 i=0 j=0
(2.18)
n−1
M n−1
M n−1
M n−1
M
= xi Li yj Lj ⊕ rij
i=0 j=0 i=0 j=0
n−1
M n−1
M
= xi Li × yj Lj = hx, Lihy, Li = xy
i=0 j=0
De Cnudde [DeC18] noted that the ISW multiplication can also be used to
optimize the glitch-resistant polynomial masking multiplication of Prouff and
Roche [PR11]. The resulting multiplication is essentially identical to that of
Eq. (2.17). As a result, Boolean masking, inner product masking and polynomial
masking are all equivalent in their methods for masked addition and masked
multiplication. The only difference in multiplication stems from vector L, which
is determined by the masking scheme. Boolean masking schemes use Li = 1, ∀i,
inner product masking uses a vector L with L0 = 1 and in the case of polynomial
masking, the vector L consists of Lagrange interpolation coefficients.
Relatively little works consider also the question of how to share more generic
Boolean functions of algebraic degree t [BNN+ 12]. A tth -degree monomial can
be shared following the same methodology as for the multiplication: The d + 1
input shares of the t input variables are expanded into (d + 1)t intermediate
shares, which are remasked, synchronized and then compressed back into d + 1
shares. Consider for example the first-order sharing (d = 1) of the cubic
monomial z = wxy. In the first step, one computes (d + 1)3 = 8 intermediate
shares zijk from the partial product wi xj yk and a random mask rijk .
Now that we know how to mask the most basic nonlinear building block, we can
use it to mask cryptographic primitives, such as the AES. The main challenge
in masking the AES is the implementation of its only nonlinear component: the
S-box. Whether using Boolean masking, inner product masking or polynomial
masking, the linear/affine operations AddRoundKey, SubBytes and MixColumns
can simply be applied independently to each of d + 1 shares of the AES state.
Hence, descriptions of masked AES implementations are typically descriptions
of the masked S-box.
The literature is divided into implementations for software on the one hand and
those for hardware on the other. Apart from the fact that one needs to deal with
glitches on hardware platforms, there is also a distinction between architectures
suitable for one or the other and the resources available to the designer. We
note that it is common to describe only the smallest version of AES: AES-128,
which has a 128-bit key. The difference with AES-192 and AES-256 is mostly
in the key schedule and the number of rounds. In the following, we speak only
of AES-128, unless otherwise mentioned.
2.3.1 In Software
the addition chain used by Rivain and Prouff [RP10] computes successively the
exponentiations (x), x2 , x3 , x12 , x15 , x240 , x252 , x254 (see Figure 2.3).
∗* ! %'
∗%,
! !( ! %) ! '*" ! ')' ! ')*
∗' '
!
Figure 2.3: Addition chain used by Rivain and Prouff [RP10] to compute x254 .
5
resulting chain uses the mapping x → x as an atomic operation and computes
the exponentiations (x), x2 , x5 , x25 , x125 , x127 , x254 (see Figure 2.4).
It was noted by Duc et al. [DDF14] that a provably secure method of refreshing
x is to use an ISW multiplication with a sharing (1, 0, . . . , 0). This is equivalent
to adding d(d+1)/2 fresh random masks rij to share xi and xj for 0 ≤ i < j ≤ d.
As a result, the proposal by Grosso et al. [GPS14] is more efficient in computation
30 MASKING AGAINST SIDE-CHANNEL ATTACKS
∗) ∗) ∗) ∗'
! !) ! ') ! %') ! %'- ! ')*
∗' !'
Figure 2.4: Addition chain used by Grosso et al. [GPS14] to compute x254 .
time but the original proposal by Rivain and Prouff [RP10] is less costly in
terms of randomness (see Table 2.2).
Belaid et al. [BBP+ 16, BGR18] optimized the randomness complexity by noting
that the requirements of some multiplications in the addition chain can be
relaxed on the one hand and by reducing the number of refreshing gadgets on
the other. The resulting S-boxes are among the first to come with a global
security proof rather than relying on the security of the separate building blocks
(i.e. multiplications, refreshings, . . . ). This method was first introduced by
Barthe et al. [BBD+ 16] and will be further discussed in Chapter 3.
bitsliced AES implementations are often based on Boyar and Peralta’s [BP12]
bit-level descriptions of the AES S-box, which have been optimized for circuit
depth.
/ .
Figure 2.5: Structure of the AES inversion with Boolean and multiplicative
masking.
An alternative line of works by Balasch et al. [BFGV12, BFG15, BFG+ 17] uses
inner product masking to implement AES. The methodology in their most
recent work [BFG+ 17] strongly resembles that of ISW-based schemes. They
use the same addition chain as Rivain and Prouff [RP10], with the optimization
by Coron et al. [CPRR13]. Their implementations also come with a global
32 MASKING AGAINST SIDE-CHANNEL ATTACKS
security proof using the method of Barthe et al. [BBD+ 16]. Although these
implementations remain more expensive than those with Boolean masking in
terms of memory and speed, they benefit from security order amplification,
which means an increased resistance against SCA in practice. Notably, the
work of Balasch et al. [BFG+ 17] is among the few in software masking that
demonstrates its practical security with empirical measurements.
2.3.2 In Hardware
Boolean Masking and The Tower Field Construction. Almost all Boolean
masked hardware implementations use a construction for the power map x254
which is commonly known as the tower-field implementation. The construction
was suggested by Rijmen [Rij00] and first implemented by Satoh et al. [SMTM01].
It exploits the fact that the inversion in F28 can be written as several operations
in the subfield F24 . Likewise, the inversion in F24 can be written in function of
operations in F22 (see Figure 2.6).
The approach was optimized by Canright [Can05] and remains to date among
the smallest constructions of the unmasked AES S-box. For this reason, it has
also been used in a plethora of masked AES implementations. The results are
summarized in Table 2.4.
The first application of TI to this design is due to Moradi et al. [MPL+ 11].
It was improved on by Bilgin et al. [BGN+ 14b, BGN+ 15], though the serial
AES architecture remained the same. De Cnudde et al. [DBR+ 15] presented
the first higher-order secure AES S-box implementation using td + 1 shares.
With the advent of d + 1-masking, they significantly improved both first- and
second-order implementations in terms of area, but at the cost of more fresh
randomness [DRB+ 16]. Gross et al. [GMK17] improved the randomness cost and
also introduced a more efficient serial architecture for AES. Ueno et al. [UHA17]
once again reduced the area footprint, but with a significantly increased
randomness cost. Finally, Sugawara [Sug19] was the first to create a TI of the
tower-field S-box that requires no fresh randomness. These implementations all
use a byte-serial architecture such as that of Moradi et al. [MPL+ 11]. As a result,
their latencies are very similar. Note again that different works sometimes use
different logic libraries and thus that their area results are difficult to compare.
The addition chains used in software masking are not popular in hardware
masking. This might be explained by the need for refreshing in the chain of
Rivain and Prouff [RP10], which incurs a high latency cost, since the refreshed
shares must be synchronized in registers, before they can be used in the next
MASKING THE ADVANCED ENCRYPTION STANDARD 35
multiplication. The solution by Coron et al. [CPRR13] has not been explored
in hardware masking either, presumably because of its latency cost as well.
Many more AES implementations can be found in the literature, other than
those discussed in § 2.3.1 and § 2.3.2. Especially for first-order security, there is
a trend to explore the absolute limits of the area-randomness-latency trade-off.
We illustrate the trade-off in Figure 2.7 using the costs of the implementations
in the literature.
AES, which results in an encryption latency of only 20 clock cycles. Their area
cost is just below 60kGE. This result dates back to 2014, which shows that
low-latency AES implementations have not received a lot of attention in the
recent literature.
So far in this chapter, we have several times assumed the availability of some
fresh random masks for each masked multiplication or S-box. As noted in the
previous subsection, this requires that some random number generator (RNG)
operates in parallel to the masked encryption, with sufficiently high throughput
to supply the required number of random bits. Since true random number
generators (TRNG) achieve only moderate throughput [YRG+ 18], we typically
instantiate a PRNG to provide a continuous stream of random bits to masked
implementations. For implementations without online randomness costs, a
TRNG could suffice. We generally care about three properties of the PRNG: (1)
its cost, (2) the quality of randomness and (3) its side-channel resistance. Up to
today, a lot of questions and uncertainty surround each of these requirements.
Quality. Proofs of security for masking schemes assume that the fresh random
masks are uniformly random in F and mutually independent. PRNGs are
deterministic functions, which compute a stream of pseudo-random numbers
from a single seed. In reality, these random values hence do not have full
entropy. A good PRNG produces a stream of numbers with quality as close
as possible to true randomness. Cryptographically secure PRNGs are typically
based on cryptographic primitives such as stream ciphers or block ciphers in
a mode of operation (e.g. CTR mode) [BK15]. These PRNGs come with
particularly strong properties as they are used for cryptographic purposes such
as key generation. In the context of randomness for masked implementations,
40 MASKING AGAINST SIDE-CHANNEL ATTACKS
it is not entirely clear whether this high quality is required. Popular choices of
PRNG in experiments with masked implementations are linear feedback shift
registers (LFSRs) [8] or unrolled implementations of block ciphers [DRB+ 16].
The two are wildly different in cost, cryptographic strength and quality of
randomness. The effect of their quality on the practical security of masked
implementations has not been investigated with experiments. The results of
such experiments would highly depend on the used platform and measurement
setup. Hence the question of whether an AES-based PRNG is “overkill” and
whether LFSRs are too weak remains unanswered.
Context. Section 2.3.2 showed that hardware masked AES designs have
often relied on Boolean masking and used the tower-field construction of
Canright [Can05] to construct the masked S-box [BGN+ 14b, DRB+ 16, GMK17]
MY CONTRIBUTIONS IN THIS CONTEXT 41
(see Table 2.4). On the other hand, Akkar and Giraud [AG01] noted that
splitting sensitive variables in a multiplicative way is more amenable for the
computation of the AES S-box, since the Galois field inversion then becomes
a local operation. Before our work, sound higher-order multiplicative masking
schemes had been implemented only in software by Genelle et al. [GPQ11].
Context. The effort in reducing the area of AES implementations has largely
been focused on ASICs, in which a tower-field construction is a popular method
to achieve small designs of the AES S-box [SMTM01, Can05]. In contrast, a
naive LUT-based implementation of the AES S-box has been the status-quo on
FPGAs. A similar discrepancy holds for masking schemes, which are commonly
optimized to achieve minimal area in ASICs [BGN+ 14b, DRB+ 16, GMK17].
42 MASKING AGAINST SIDE-CHANNEL ATTACKS
Extension. An extended version of this work has been published in the Journal
of Cryptology [2]. In this work, we formalize our masking methodology with
new theoretical concepts and proofs and also optimize our heuristic algorithms
for masking Boolean functions. We referred to this methodology in Section 2.2.3.
We prove that a first-order sharing with minimal number of intermediate shares
(d + 1)t exists for any t-degree Boolean function with t + 1 variables. We also
detail a generic methodology for masking Boolean functions of any degree t
with any number of variables (> t + 1). The improvements of our method over
the work at CHES 2018, allow us to optimize also our AES implementation and
reduce the number of FPGA LUTs by 21%, the number of FPGA flip-flops by
25% and the number of slices by 33%.
however, the distinction between randomness costs to produce the initial masking
and the randomness to maintain security during computation (online) is not
meaningful. Sugawara [Sug19] succeeded in implementing AES without an
online randomness cost, but still requires 776 initial random bits for each
encryption. Faust et al. [FPS17] were the first to prove that masking any cipher
with only 2 random bits in total is possible.
Context. Constructions for PRNGs for masking are not often made explicit,
since a lot of uncertainty surrounds their requirements. A recurring question
is whether it makes sense to use an unmasked PRNG to protect a masked
design. One of the NIST documents prescribes how to construct a PRNG from
a block cipher [BK15]. A popular method is to use AES in CTR mode. It
was already shown by Jaffe [Jaf07] that AES-CTR can be attacked with side-
channel analysis, even without knowledge of the nonce. His attack requires 216
power measurements. The NIST CTR_DRBG specification [BK15] prescribes
a maximum size on each random number request, limiting the number of
encryptions in CTR mode with the same key to 4 096. As a result, it is not
vulnerable to Jaffe’s attack.
Contribution. In this work from CHES 2020 [3], included on page 185, we
adapt this attack so that it requires only 256 traces, which is well within
44 MASKING AGAINST SIDE-CHANNEL ATTACKS
the NIST limits. With this work, we essentially demonstrate that the NIST
recommendation for the CTR_DRBG allows too large requests. However, we
provide several recommendations for the implementation of a CTR_DRBG
such that the attack can be avoided without requiring a countermeasure as
expensive as masking. We use this opportunity to start a discussion on how to
protect PRNGs for masked implementations against SCA.
2.5 Conclusion
One Multiplication to Rule Them All. Another consolidation has taken place
at the level of masking representations. Boolean masking, inner product masking
and polynomial masking always had in common that linear operations are local.
Initially, their multiplication procedures were quite different, but today, thanks
CONCLUSION 45
to the works of Balasch et al. and De Cnudde, it is clear that they are all
essentially the same. Boolean masking enjoys a large popularity because it
incurs a smaller overhead compared to the other two. However, inner product
and polynomial masking have decreased leakage from an information-theoretic
point-of-view as an important advantage. Especially now that it is clear that
the ILA does not hold in CPU datapaths, more attention should be devoted to
these types of masking.
Where Are My Random Bits? Finally, the topic of random number generation
for masked implementations requires a thorough examination. We need to
determine what properties must be fulfilled by the generated randomness and
devise efficient PRNG constructions. These constructions must be so that their
randomness cannot be recovered from side-channel information, but without
requiring countermeasures such as masking. Only then will we know how to
express the cost of a fresh random bit or byte and will comparisons between
the many masked implementations of AES make sense. However, in the search
for the most “efficient” PRNG, we should keep in mind that area is not as
constrained as it used to be and that the smaller the PRNG overhead, the
smaller its contribution to the noise of the side-channel measurements.
Chapter 3
Side-Channel Analysis
47
48 SIDE-CHANNEL ANALYSIS
The mixed moment is m-variate if m coefficients di are nonzero. For the first-
and second-order central moments, we also use the following notations:
Side-channel attacks first became known in 1996 with the seminal work
of Kocher [Koc96]. There are many different types of side-channels,
such as timing [Koc96], power consumption [KJJ99], electromagnetic (EM)
radiation [QS01] and even sound [GST14]. In this work, we only consider power
analysis, but EM analysis follows the same principles as both side-channels leak
similar information.
0.08 0.08
0.06 0.06
Power Measurement
Power Measurement
0.04 0.04
0.02 0.02
0.00 0.00
0.02 0.02
0.04 0.04
0 1000 2000 3000 4000 5000 6000 7000 0 200 400 600
Time [samples] Time [samples]
Figure 3.1: Power measurements of AES. Left: one encryption. Right: One
round.
Strategy
We will explain the attack procedure for a single byte of the AES key below. A
full DPA attack simply repeats this process 16 times to recover the entire key.
on the other hand, has been specifically designed to ensure minimal dependency
between the output difference S(x) ⊕ S(y) and the input difference x ⊕ y.
Hypotheses. Our unknown constant k has 8 bits. Hence, there are 28 = 256
possible candidates kj . For each candidate kj , we hypothesize that k = kj and
compute the corresponding intermediate values at the point of interest f (k, m)
for each encryption n. We thus construct a set of hypotheses H = {hnj } with
hnj = f (kj , mn ) = S(kj ⊕ mn ).
Leakage Models. Next, we try to predict how our hypotheses will appear in the
power traces, by simulating their power consumption using some leakage model
L(x). The power consumption of an intermediate x is assumed to be proportional
to L(x). There are many known leakage models to choose from. For example,
the Hamming weight model assumes that the leaked power consumption of an
intermediate depends on its Hamming weight: L(x) = HW (x). In the Hamming
distance model, the power consumption is proportional to the Hamming weight
of the difference between two intermediates: L(x, y) = HD(x, y). The closer
the leakage model resembles the true power consumption behaviour, the more
effective the DPA attack. There are also binary models, i.e. with L(x) ∈ {0, 1}.
In the original work on DPA, Kocher et al. [KJJ99] proposed a bit model, where
L(x) corresponds to a single bit of x. Popular choices are the least significant
bit (LSB) or most significant bit (MSB), but also others are possible. Finally,
in the zero value model, L(x) = 1 if x = 0, else L(x) = 0.
Distinguisher. Next, we measure the difference between the two sets S0 and
S1 . For the wrong key guess, the partition is approximately random and
uncorrelated with the power consumption and hence, the difference should be
minimal. For the correct key guess on the other hand, the partition creates two
sets such that at some time samples q ∗ , the power consumption tnq∗ depends
on bit 0 if tn ∈ S0 and on bit 1 if tn ∈ S1 . Kocher et al. [KJJ99] compute the
difference of means of the two sets:
For the wrong key guess kj , we expect limN →∞ ∆jq = 0 for all trace samples q.
For the correct key guess kj ∗ , the time samples q ∗ where the power consumption
depends on our hypothesized intermediate should be correlated to our partition
function and hence ∆j ∗ q∗ differs significantly from zero (see Figure 3.2).
0.007
Max Difference of Means
0.006
0.005
0.004
0.003
0.002
0.001
0 200 400 600 800 1000
# traces
Figure 3.2: DPA results of AES with MSB model. Left: Difference of means
for all keys at all time samples with 1 000 measurements. Right: Maximum
Difference of means for each key as a function of the number of measurements.
The correct key is indicated in black.
Variations on a Theme
The above description corresponds to the original DPA of Kocher et al. [KJJ99].
Many variants have been proposed later in the literature. They all have in
common that the traces are partitioned or categorized into subsets based on
hypothesized leakages. From that point, any statistical method or distinguisher
may be used to verify the meaningfulness of that partition. We discuss a few of
them below. While each type of attack can be designated with its own name,
they may all be considered DPA attacks.
SIDE-CHANNEL ATTACKS 53
Correlation Power Analysis (CPA). One of the most popular versions of DPA
was introduced by Brier et al. [BCO04], with the proposal to use Pearson’s
correlation as statistical method. The Pearson correlation is a measure of the
linear relation between two variables. It thus assumes a linear relationship
between the intermediate leakage L(x) and the power consumption ∼ αL(x) + β.
Let Tq = {tnq }N
n=1 be the power consumption at sample q and Lj = {L(hnj )}n=1
N
the corresponding leakage hypotheses for key guess kj . A CPA attack computes
for each key guess kj and trace sample q the linear correlation ρjq between the
two:
Cov(Tq , Lj )
ρjq =
σ(Tq )σ(Lj )
PN (3.6)
1
n=1 (tnq − µ(Tq ))(L(hnj ) − µ(Lj ))
= 1
PN
N
PN
2 1
n=1 (tnq − µ(Tq )) · N n=1 (L(hnj ) − µ(Lj ))
2
N
It is assumed to reach its maximum for the correct guess kj ∗ at the time samples
q ∗ where the power consumption depends on the hypothesized intermediates
(see Figure 3.3).
0.9
Max Pearson Correlation
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 200 400 600 800 1000
# traces
Figure 3.3: CPA results of AES with HW model. Left: Pearson Correlation
for all keys at all time samples with 1 000 measurements. Right: Maximum
Pearson Correlation for each key as a function of the number of measurements.
The correct key is indicated in black.
1.0
0.9
Training Accuracy
0.8
0.7
0.6
0.5
0 5 10 15 20 25 30 35 40 45 50
# epochs
Figure 3.4: DDLA results of AES with MSB model. Training accuracy for all
keys as a function of the epoch with 1 000 measurements. The correct key is
indicated in black.
DDLA is typically not more efficient than CPA, since it must perform the
training of a neural network 256 times and its performance depends significantly
on the parameters of the network. For example, it is possible that the network
overfits and reaches high accuracy for any key guess. However, convolutional
neural networks have been shown to perform well when the power traces are
misaligned or exhibit jitter. Also in this attack, any leakage model may be used,
but binary models are preferable, as they require a less complex neural network.
masked implementations are vulnerable to SCA. Recall that a dth -order masked
implementation splits the secrets into at least d + 1 shares with the goal of
protecting against attacks that target d intermediate values or the dth -order
statistical moment of the power consumption. First-order masking hence
protects against attacks such as those described in Section 3.1.1, but an attack
that would use the combination of two intermediates (i.e. a second-order SCA)
would defeat the countermeasure. Generally, a dth -order masking scheme is
always vulnerable to d + 1st -order DPA.
Noise. Another problem for higher-order SCA in practice is the noise in the
power measurements. Chari et al. [CJRR99] assume that the power consumption
for an intermediate variable x is proportional to L(x) + e with e ∼ N (0, σ 2 ),
where σ 2 is the variance of the noise. In this model, they prove that the number
of power measurements required to distinguish a secret grows asymptotically as
σ d with d the masking order. Hence, with realistic noisy power measurements,
the complexity of a dth -order attack grows exponentially with d.
Schramm and Paar [SP06] were shown to exhibit a third-order flaw by Coron et
al. [CPR07]. The refreshing methodology of Rivain and Prouff [RP10] was
corrected by Coron et al. [CPRR13]. Reparaz [Rep15] noted that the extension
of threshold implementations to higher-order security by Bilgin et al. [BGN+ 14a]
was insecure. More fundamentally, the fact that early masking schemes such
as those from Trichina [Tri03] and ISW [ISW03] were unsuitable for hardware,
was brought to light by Mangard et al. [MPO05, MPG05].
This history of trial and error has engendered a new branch of research on the
verification of masking schemes, with many proposals at different abstraction
levels and with varying scopes. One approach is to define an abstract model of
the adversary’s abilities in a so-called adversary model and verify a scheme’s
claims in that model. If a model is well-defined, it is evident to derive a
verification procedure. A trend in the recent literature is to automate the
verification using carefully crafted tools and eliminate in this way the risks of
human error. There is typically a trade-off between the efficiency and scope
of such tools. Some tools can create exact proofs for small masked gadgets,
whereas others aim to detect flaws in larger designs. However, a theoretical
model is seldom an exact representation of reality. Hence, it is also important
to verify the practical security of implementations on actual platforms. We
explain the different methodologies of verification in the following subsections.
will not be rejected with a larger sample size. In the side-channel community,
the two most commonly used hypothesis tests are Welch’s t-test [GJJR11] and
Pearson’s χ2 -test [MRSS18].
Large absolute values of this t-statistic indicate that the null hypothesis can
be rejected with a high degree of confidence. It is common to reject the null
hypothesis when the t-statistic exceeds some critical value. This critical value
depends on the required confidence level.
(Fi,j − Ei,j )2
r−1 X
X c−1
χ2 = (3.8)
i=0 j=0
Ei,j
ν = (r − 1) · (c − 1) (3.9)
1 X
c−1 r−1
X
Ei,j = Fi,k · Fk,j (3.10)
N
k=0 k=0
VERIFYING MASKED IMPLEMENTATIONS 59
The null hypothesis is rejected when the χ2 -statistic exceeds a critical value.
The critical value depends on the required confidence level and the degrees of
freedom ν.
TVLA
Fixed and Random Plaintexts. The plaintexts of the two sets S0 and S1 can
be chosen in four different ways:
Fix vs. Fix: All measurements in set S0 are acquired with the same plaintext
p0 and all measurements in set S1 are acquired with a different fixed
plaintext p1 .
Fix vs. Random: All measurements in set S0 are acquired with the same
plaintext p0 and for each measurement in set S1 , a new random plaintext
is drawn and used.
Semi-fix vs. Random: The measurements in set S0 are acquired with a
varying set of plaintexts that are determined such that some intermediate
state in the algorithm is fixed. Each measurement in set S1 uses a new
random plaintext.
Random vs. Random: All measurements (in either set) use a fresh randomly
drawn plaintext.
In the first two cases, the rejection of the null hypothesis indicates that different
intermediate variables are distinguishable. In the case of “Fix vs. Fix”, the
presence or absence of leakage strongly depends on the choice of the plaintexts
p0 and p1 . The “Fix vs. Random” test is more powerful in that sense, as it
60 SIDE-CHANNEL ANALYSIS
Critical Values. The critical cut-off value C of a statistic for deciding whether
to reject the null hypothesis, depends on the degrees of freedom ν and the
required confidence level (see for example [NIS03]). However, in the case of
the t-test, the degrees of freedom parameter is often ignored. Based on the
work of Goodwill et al. [GJJR11], it is common to choose C = 4.5, which
implies a confidence of > 99.999% if one uses enough measurements. For a
smaller number of traces, one should take the degrees of freedom into account.
Figure 3.5 illustrates some t-test results. For an unmasked implementation, the
t-statistic surpasses the critical threshold C at numerous time samples with
only 12 000 traces, indicating that the null hypothesis should be rejected and
thus that sensitive information leaks from the power measurements. For a
masked implementation on the other hand, the t-statistic remains smaller than
C in absolute value for all sample points, with up to 50 million power traces.
The null hypothesis can therefore not be rejected and we can deduce with high
confidence that with 50 million traces, no sensitive information can be extracted
from the measurements. We note again that further preprocessing of the traces
may change this.
4
300
2
200
t-statistic
t-statistic
0
100
0 2
100 4
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Time [samples] Time [samples]
Figure 3.5: T-statistic for a fix vs. random test of an unmasked AES encryption
in hardware with 12k traces (left) and a masked AES encryption in hardware
with 50 million traces (right).
When studying the t-statistic (Eq. 3.7), it is clear that a consistent difference
combined with a growing number of measurements, must result in a growing
t-statistic. As long as the t-value does not show a definite growing trend, we do
not conclude that leakage is present (see Figure 3.6).
Max t-statistic (logscale)
d-Probing Model. One of the first adversary models defined in the context of
masking remains today the most used in the literature: the d-probing model
by Ishai et al. [ISW03]. In this model, the adversary is assumed to have the
ability to probe up to d intermediate values of the calculation within a certain
period (e.g. a cycle). Only the exact intermediates are observed and nothing
more. We will denote by I the set of all intermediates in the calculation. The
ISW multiplication with d + 1 shares is provably secure in this model because
any subset Q ⊂ I with |Q| ≤ d is independent of the unmasked secret inputs.
This model is convenient for devising theoretical proofs, but not a very realistic
representation of an actual attacker targeting noisy side-channel measurements.
Noisy Leakage Model. Chari et al. [CJRR99] proved the security of their first
masking scheme in the noisy leakage model, in which the adversary observes
the leakage of the intermediates superposed with a noise function, rather than
exactly. This corresponds to the leakage models used in side-channel attacks
L(x) + e, for x ∈ I and e ∼ N (0, σ 2 ). An important result from this work
is a proof that the number of measurements required to distinguish a secret,
grows exponentially with the masking order d. However, they only consider
the masking of variables at bit level and independent of computations. This
was generalized by Prouff and Rivain [PR13]. While more realistic than the
probing model, it is not straightforward to prove the security of a masking
scheme against this adversary. Thankfully, the two models were united by
Duc et al. [DDF14] in a seminal work, which proved that security in the probing
model implies security in the noisy leakage model. As a result, security proofs
in the probing model are not only convenient but also practically relevant.
Bounded Moment Model. Barthe et al. [BDF+ 17] introduced another model
of an adversary that observes the dth -order statistical moment of the leakages,
64 SIDE-CHANNEL ANALYSIS
ℛ"
𝑥" ℛ$&"
𝐶" 𝑥$&"
𝐶$&"
𝑥#
𝐶#
…
…
…
𝑥'
…
𝐶'
…
𝑥$
𝐶$
ℛ$
Figure 3.7: Circuit model with combinational and sequential logic and glitch-
extended probes Ri for wires xi .
VERIFYING MASKED IMPLEMENTATIONS 65
Recall that sequential logic elements such as registers serve as a boundary for
glitches by means of synchronization. It is assumed that an adversary probing
some wire in a combinational function, automatically obtains all inputs to
that function up to the previous glitch boundary. This set of inputs is called
the glitch-extended probe and is denoted Ri for each intermediate xi ∈ I (see
Figure 3.7). Following the independent leakage assumption (ILA), a glitch
function on the wire xi depends only on the set of inputs Ri . As a result, this
model includes the worst possible glitch. This worst-case scenario does not
necessarily reflect reality. In other words, this model is over-conservative in
some sense. However, the model allows working at an abstract and theoretical
level, without requiring details about the underlying technology or platform.
Moreover, since the glitch-extended probing model is like the original ISW model
up to a redefinition of probes, we intuitively expect the reduction of Duc et
al. [DDF14] to hold. Specifically, we would expect that security in this model
implies security in a more realistic glitch-extended noisy leakage model, where
the adversary can observe the leakage of glitch-extended probes superposed
with a noise function.
Robust Probing Model. The idea to extend regular probes with additional
information, based on physical defaults was further developed by Faust et
al. [FGP+ 18]. Glitches are not the only cause for the gap between the probing
model and reality. In recent literature, various cases have been demonstrated
where the ILA does not hold. On hardware platforms such as FPGAs and
ASICs, wires carrying different variables (such as shares) are driven by the same
voltage line and the same clock signal drives all sequential logic in a masked
gadget. As a result, it has been shown by De Cnudde et al. [DEM18] that
coupling effects exist between “independent” wires due to capacitances or shared
voltage/clock lines. These effects are potentially dangerous for the recombination
of shares of the same variable. On software platforms, the combinations of
intermediates in the CPU datapath have been studied by Papagiannopoulos
and Veshchikov [PV17] among others. Balasch et al. [BGG+ 14] suggested a
transitional leakage model, in which not only each intermediate xi ∈ I is probed,
but also the XOR of any two intermediates xi ⊕ xj for xi , xj ∈ I. In the robust
probing model, Faust et al. [FGP+ 18] propose to replace the exact probes in
the probing model, not only with glitch-extended probes, but also with extended
probes for memory transitions or coupling. This is an attractive approach for
closing the gap between theory and practice. However, Levi et al. [LBS19]
demonstrated that defining extended probes for coupling is not straightforward.
Also, modelling the CPU effects in this way is nontrivial, as argued by De
Meyer et al. [14].
66 SIDE-CHANNEL ANALYSIS
While these two notions were initially thought to be equivalent, there is a subtle
difference. Consider for example the intermediate (x0 ⊕ y0 )x1 . It is easy to
verify that this variable is independent of the secrets x and y and hence probing
secure. The share y0 acts like a one-time pad on x0 , which means that the
multiplication does not reveal any information on x. However, to simulate this
single value, one requires 2 shares of x, which means that non-interference does
not hold. The important take-away is that d-NI is a stronger notion. It implies
d-probing security, but not vice versa. De Meyer et al. [4] also clarified this
with an information-theoretic definition of d-NI, to mirror that of Gammel and
Mangard.
Definition 3.4 (d-non-interference): A gadget with d + 1 input shares x
is d-non-interferent if and only if for any observation set Q of at most d
probes, it holds that ∃i : M I(Q; xi |xī ) = 0.
The above definitions can be seen as more detailed specifications of the probing
adversary model, as they imply a verification methodology for the security of
a masked gadget, either by simulation or by comparing joint distributions for
different secrets or by some equivalent methodology. Note that the verification
M I(Q; ·|·) = 0 must be done for every possible set of d probes Q.
the entire design is probing secure, i.e. probing security is not composable.
This is for example why the AES of Rivain and Prouff [RP10] was flawed, even
though their refreshing gadgets and multiplication gadgets (separately) were
provably secure. NI, while stronger than probing security, is also not generically
composable.
Composability is a desirable property because it is not always feasible to verify
probing security for complex designs, since one should verify the independence
or simulatability of every observation set Q
of d probes. In a circuit/algorithm
with w wires/intermediates, there are wd possible sets Q. The complexity of
verification thus grows considerably with the verification order d (assuming
d w).
Barthe et al. [BBD+ 16] introduced the concept of strong non-interference (SNI)
to alleviate this problem.
Definition 3.6 (d-strong non-interference): A gadget is d-strong non-
interferent (d-SNI) if and only if for every set QI of t1 probes on
intermediate variables (i.e. no outputs) and every set QO of t2 probes on
output shares such that t1 + t2 ≤ d, the set Q = QI ∪ QO can be simulated
using at most t1 shares of each input.
The difference between SNI and NI lies only in the number of input shares
which may be used for simulation. By making this number independent of the
number of output probes, SNI ensures a separation between the output and
input shares, which enables generic composition with other blocks. Belaïd et
al. [BBP+ 16] demonstrate how gadgets satisfying NI and SNI can be combined
in a global proof of security. The main idea is to determine the number of
shares that are required at the inputs of each gadget to simulate the rest of the
circuit and to propagate this backwards to the outputs of the previous gadgets.
De Meyer et al. [4] suggested a mathematical description of SNI based on mutual
information rather than simulation.
Definition 3.7 (d-strong non-interference): A gadget with d + 1 input
shares x is d-strong non-interferent if and only if for any observation set
Q of at most d probes, of which t1 are intermediates and t2 are output
probes such that t1 + t2 ≤ d, it holds that ∃ I ⊂ {0, . . . , d} with |I| = t1
such that M I(Q; xĪ |xI ) = 0.
was noted by Faust et al. [FGP+ 18], that another synchronization stage at the
outputs is required for this purpose. Without glitches, it does satisfy SNI.
𝑑 − SNI 𝑑 − NI 𝑑 − probing
𝑑 − SNI 5 𝑑 − NI 5 𝑑 − probing 5 𝑑 − NC
+uniformity, 𝑑 = 1
Automated Tools. Several tools for the provable verification of gadgets have
been proposed in the literature. Coron [Cor18] introduced a tool to verify
simulatability (NI and SNI) based on symbolic manipulation. It is suitable
for both Boolean and arithmetically masked functions in software, but suffers
from false negatives (e.g. SNI gadgets which are evaluated as not SNI). The
first formal verification tool for probing security in the presence of glitches
was developed by Bloem et al. [BGI+ 18]. Barthe et al. [BBC+ 19] created
MaskVerif, a very versatile tool for the verification of probing security, NI, and
SNI, either with or without glitches. MaskVerif achieves very good efficiency
even for probing security. For each observation set Q, the tool first takes a
symbolic approach, similar to the method of Coron. Only if the result of this
test is negative, are the properties of Definitions 3.2 or 3.5 verified by exhaustive
computation of joint probability distributions. Sadly, its applicability is limited
to Boolean masking.
Formal proofs of global security are either only applicable to small designs or are
very demanding on the gadgets (i.e. SNI). Moreover, even if a masking scheme
70 SIDE-CHANNEL ANALYSIS
In the Probing Model. Reparaz [Rep16] first proposed a flaw detection tool
for software masking. The traces for TVLA are generated according to the
d-probing model, by including for each intermediate variable one sample in the
trace. It is possible to use leakage models other than the identity model, such
as Hamming weight or LSB. In some cases, one can even optimize the efficiency
by “downscaling” the scheme, for example from the field F28 to F24 . The t-test
is applied to these traces, as in regular TVLA. The test can be specific or
non-specific, the secret can be chosen in a fix vs. fix or fix. vs. random manner,
and higher-order verification is possible by preprocessing the traces. Though
operating in the probing model, this tool was also able to detect the flaw of
higher-order threshold implementations [Rep15].
Frequency
Fix? ℛ$
Value ?
om
nd
Ra
|ℛ C |
2
… … Time [samples]
𝑖
Frequency
Value
2|ℛC|
… … Time [samples]
𝑖
≈?
With Simulation of Glitches. Another tool by Sijacic et al. [SBY+ 18] for
masked hardware implementations, performs a post-place-and-route simulation
of a circuit. Since this simulation uses actual propagation delays, it can also
accurately simulate the occurrence of glitches. This tool thus operates in a
model that is more realistic and less worst-case than the glitch-extended probing
model. On the other hand, the results are more dependent on implementation-
specific characteristics such as the logic library and the placement and routing.
Bertoni and Martinoli [BM16] perform a simulation of so-called transients to
enumerate all possible transitions and thus intermediates that may occur on a
wire. This method is again independent of implementation specifics, yet more
optimistic than the glitch-extended probing model. The difference between
transients and glitch-extended probes is that apart from including all possible
intermediate results, the latter also includes other combinations of inputs, such
as transitions between these intermediates.
their complexity. On the other hand, this does not mean that flaw detection
tools can replace regular TVLA. With any verification method, it is important
to keep its limits in mind. Simulation-based tools (whether exhaustive or not)
can only be as accurate as the adversary model they are based on. In the gap
between theory and practice, they can provide no guarantees. Hence, performing
TVLA on real measurement traces (while keeping its limitations in mind as
well) remains an important step in the evaluation of masked implementations.
We demonstrate this in the next paragraph.
Contribution. In our work from CHES 2019 [4], we describe a new, succinct,
information-theoretic security condition, as described in Definition 3.5. This
is the first formal condition for d-probing security in the presence of glitches
which is both necessary and sufficient. This single condition includes, but
is not limited to, previous security notions such as those used in threshold
implementations. As a consequence, we can prove that non-completeness is
indeed necessary and demonstrate that uniformity is not, despite being enforced
in most works on masking. Furthermore, we also treat the notion of (strong)
non-interference from an information-theoretic point-of-view (see Definitions 3.4
and 3.7). We unify the different security concepts and pave the way to the
verification of composability in the presence of glitches.
We consolidate all existing and new security notions into a single framework
based on mutual information. All notions in Section 3.2.3 can be verified by
some form of the property M I(A; B|C) = 0, where A depends on the type of
probing (with/without glitches, transitions, . . . ) and B, C determine whether
one verifies NI, SNI or probing security. This paper is not included in this book,
but Section 3.2.3 is essentially a summary of its results (excluding our proofs).
74 SIDE-CHANNEL ANALYSIS
Finally, we use this framework in a tool that efficiently tests and validates
the resistance of masked implementations against DPA. We described this
tool in Section 3.2.4. The tool is an extension of the flaw detection tool from
Reparaz [Rep16], but can also be used for the provable security of small gadgets.
We demonstrate the adaptability of the framework to for example different
types of mask representations and point out important features that are not
yet included in state-of-the-art tools such as MaskVerif [BBC+ 19]. For example,
it was used to validate the security and optimize the randomness use of the
multiplicative masked AES, published at CHES 2018 [7].
The new security notions (Definitions 3.4, 3.5 and 3.7) were also adopted in a
very efficient tool by Knichel et al. [KSM20].
3.4 Conclusion
use of a single random mask. What we get in return is the flexibility of being
able to compose the gadget with any other block. However, a gadget in a specific
design (e.g. AES) does not need to be composable with any other gadget, but
only with those that it is actually composed with. A specific example is that of
first-order threshold implementations, which are neither 1-NI nor 1-SNI, but
can still be (serially) composed thanks to uniformity and non-completeness. It
is thus not necessarily true that a gadget that satisfies neither NI nor SNI, does
not provide the required security. Furthermore, it is possible for a complex
block to consist of such gadgets and still be probing secure. The problem is that
the literature currently lacks formal methods to prove the security of a large
design that does not consist of strong non-interferent gadgets. In a recent effort,
the current simulatability notions are being refined to account for multiple
inputs and outputs [CS19]. As long as the verification of probing security itself
does not become more efficient, we need to come up with more tight security
requirements. We also need these new notions to allow for randomness recycling.
The Boolean Bias. Over the last years, a large number of verification tools
have been introduced. Their functionalities range from the verification of
provable security for small gadgets to the validation of practical security for
larger designs. This is a positive development, since hand-written proofs are
liable for human error and publicly available tools can be scrutinized by a larger
community. However, similar to the popularity of Boolean masking in the
previous chapter, we see here a bias towards tools compatible with this type of
masking. To the best of our knowledge, there is no publicly available tool that
allows verification of inner product or polynomially masked implementations.
This is regrettable, given that none of the security notions in Section 3.2.3
make any assumptions on the masking representation and especially given
the increased security of inner product and polynomial masking over Boolean
masking in practice.
Modelling the Adversary. Tools decrease the dangers of human errors and
increase the convenience of verifying masked implementations. However, lack of
security is not always due to lack of verification, but rather due to misconceptions
of leakage behaviour. Consider for example the scheme of Ishai et al. [ISW03],
which was introduced for implementation in hardware. Though it came with a
proof of security, it did not provide the claimed security, because their model
did not take glitches into account. Today, Boolean masked implementations
for software are based on ISW and come with global proofs of security, yet still
exhibit leakage on a real platform. Hence, more important than the actual
verification of security, is the accurate modelling of the adversary. If the model
CONCLUSION 77
is inaccurate, it does not matter how many proofs or tools one uses, since not
all leakage can be found.
Figure 3.10: The gap between theory and practice: provable vs. practical
security.
We can try to make our models correspond to reality as closely as possible, but
need to keep in mind that more complicated models engender more expensive
gadgets (e.g. SNI). The glitch-extended probing model includes the worst
possible glitches, because it cannot predict the glitches that will actually occur.
The transitional leakage model assumes an adversary can probe any XOR
combination of intermediates, because it is unknown at the theoretical level
which combinations are possible. More fundamentally, the probing model
assumes that exact values in a calculation can be observed. It is common in
theory to give more power to the adversary than strictly necessary to keep the
models conceptually simple, but with strong security guarantees. As a result,
the masked gadgets in these models may be more expensive than required.
Moreover, the gap is a double-edged sword, as it can equivalently be seen from
an attacker’s point-of-view (see Figure 3.11). Hence, at some point, it does
not make sense to keep increasing the cost and complexity of masking gadgets
so that they are secure in an even stronger model. We must find a balance
between the effort spent on making implementations secure in some model and
the effort spent on making them secure in practice.
Bottoms Up: From Provable to Practical Security. Apart from the gap
between provable and practical security, there is also a gap between the literature
on software masking and hardware masking. Recall that in the previous chapter
78 SIDE-CHANNEL ANALYSIS
79
80 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
heuristic and do not boast a formal background like that of masking against
DPA. In this section, we give a summary of important fault attacks mechanisms
and countermeasures.
A fault attack consists of two stages. First, a fault is injected into the
cryptographic computation and then, the effect of the fault is exploited using
fault analysis. Below, we list different methods for physically inserting the fault
onto a computing chip. Next, we describe how the faults can be characterized
from a theoretical point-of-view. Finally, we describe DFA, the most important
threat against cryptographic implementations in this context.
Fault Injection. There are several ways to induce faults in a computation, with
varying degrees of invasiveness and precision [Tun17]. For example, modifying
the temperature is a non-invasive technique for tampering with an embedded
device. Other non-invasive approaches are varying the supply voltage or inducing
glitches in an external clock. Clock glitches are a very popular way for fault
injection because they are relatively easy and cheap to perform. Since a glitch
in the clock signal essentially shortens the clock period (the rising edge comes
too soon), it allows introducing a fault by storing wrong intermediates in the
registers. A clock glitch can cause the next operation to start executing before
the current one finishes.
Semi-invasive techniques require the chip surface to be exposed. The
computation can then be disturbed with a laser. Laser injections are more
powerful than clock glitches since the attacker has more control, but they are
more complex and expensive to achieve. The attacker should, for example,
have some knowledge about the layout of the chip. Moreover, with advancing
technologies, transistors decrease in size, making it more difficult to hit individual
bits or bytes. The improvements in technology also reduce the size of the laser
spot, but this is limited by the wavelength and can thus not be made arbitrarily
small. The exact outcome of a laser injection can therefore be hard to predict.
Dutertre et al. [DBC+ 18] recently showed that it is still possible to target
single bits in 28nm CMOS technology. On microcontrollers, the storage is a
popular target for laser injections, since it is easy to distinguish from other
components [KBB+ 18].
More invasive methods exist, which actually alter the chip itself, such as focused
ion beams (FIB). These are relatively expensive and out of scope for this work.
FAULT ATTACKS AND COUNTERMEASURES 81
Fault Models. In the theory of SCA, the side-channel observations are typically
characterized by probes on intermediate values. Faults are more complicated
to model because they come in so many variations. We can distinguish them
according to the following characteristics [KSV13]:
• Granularity: Does the fault affect bits or bytes or words? How many of
them are affected?
• Type of Modification: The fault resets to 0 or sets to 1, causes a bit
flip or a random value.
• Degree of Control: How precisely can the location and timing of the
fault be controlled?
• Duration: The fault is transient, permanent or destructive.
The behaviours of the different types of fault injections fall into different fault
model categories, but not necessarily a single one. For example, with a clock
glitch, one has good control over the timing, but not necessarily over the
location, since this is determined by the critical path delays. Laser injection
gives strong control over both the timing and the location of faults (see for
example Figure 4.1). The modification effect of such a fault is typically a reset
to 0 or a set to 1. Given the large variety of fault injection methods, it is difficult
to create a single fault model that considers all types of faulting adversaries.
Figure 4.1: Sensitive regions on flash memory. Laser injections in each region
result in a reset of one or two (neighbouring) bits of an instruction opcode.
From top to bottom, the affected bit positions move from least significant to
most significant. [KBB+ 18]
Encryption Standard (DES). They consider random transient faults, i.e. during
each encryption, one (or a few) bits are flipped with some small probability.
The fault position is unknown to the attacker. The attack requires about 50
to 200 pairs of ciphertexts (C, C 0 ), with C a correct ciphertext and C 0 a faulty
ciphertext for the same plaintext. In other words, one needs to be able to
encrypt each plaintext twice and only disturbs the second encryption. Using
the difference C ⊕ C 0 and techniques from differential cryptanalysis [BS90], it
is possible to recover the key. Biham and Shamir further specify that, if the
attacker can choose the location of the faults, only 3 ciphertext pairs (C, C 0 )
are required. Piret and Quisquater [PQ03] applied DFA to the AES, assuming
a uniformly random fault is injected on a single byte in the last or next-to-last
encryption round. This attack requires two faulty encryptions to recover the
key. Tunstall et al. [TMA11] showed that it is possible to attack AES with a
single faulty ciphertext, in which the fault is injected in the eighth round.
It is notable that the fault injections for these attacks are supposed to affect a
small rather than a large number of bits. Most attacks also benefit from fault
injections in the last rounds of encryption, since these are easier to exploit.
4.1.2 Countermeasures
Countermeasures against fault attacks have been devised at various levels. Some
low-level physical countermeasures target the fault injection itself, using active
shields or light detectors or filters in the clock to mitigate glitches [BCN+ 06].
Most countermeasures intend to obstruct the successful analysis of faults. Similar
to hiding techniques against SCA, random delays and shuffling can make it
more difficult for the attacker to inject precise faults, but they cannot prevent
the attacks completely. At a higher level, one could prevent the attacker from
collecting sufficient encryptions with the same key. Fresh re-keying is therefore
a popular mechanism for protecting implementations against both DPA and
DFA [MSGR10].
Within the scope of this work, we only consider countermeasures at the
algorithmic level, i.e. not at the physical or protocol level. The protection of
implementations against faults long precedes the publications on fault attacks.
For example, parity bits or checksums have been used in data transmissions
since the 1950s to detect and correct unintentional errors [Ham50]. Hence, fault
attack countermeasures have a lot in common with techniques from coding
theory. The fundamental requirement is the same: we need redundancy. A
countermeasure is defined, on the one hand, by the type of redundancy it uses
and, on the other hand, by how this information is employed to protect the
implementation against malicious faults.
FAULT ATTACKS AND COUNTERMEASURES 83
Using the Redundancy. Lomné et al. [LRT12] distinguish two ways to use the
redundancy: detection and infection. Detection is conceptually straightforward.
In the case of repeated encryptions, one compares the two ciphertexts and with
EDC, one verifies the check bits. If the check fails and thus a fault is detected,
the computation needs to stop in a secure way and the faulty ciphertext should
not be released.
An alternative to stopping the computation is ensuring that any injected fault
results in a random ciphertext, from which the attacker cannot obtain any
information. This concept is called infective computation. Since the success of
DFA relies on the information in the difference C ⊕ C 0 , a uniformly random
C 0 cannot be used to recover the key. Infective computation was introduced
by Yen et al. [YKLM01] for public-key systems. Several proposals were made
for symmetric-key systems as well [LRT12, GST12], but all of them have been
shown to be flawed [BG13].
Additionally, we note that correction is a third method for using the redundancy,
for example using error-correcting codes (ECC) or majority voting.
84 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
faults that get detected and prevent the attacker from obtaining sufficient faulty
encryptions to mount the attack. Take SIFA for example. While the ineffective
faults cannot be detected, the effective faults can be detected and used at the
protocol level. Another strategy is to increase the number of encryptions that
appear to have ineffective faults. Faults injected in dummy rounds are, for
example, automatically ineffective, regardless of sensitive data. Error correction
can ensure that even effective faults result in correct ciphertexts.
Safe-errors are also difficult to detect. Countermeasures must ensure that for any
effective fault, one of the following holds: (1) Either the faulty value will not be
combined with sensitive data. Hence, if it becomes a safe-error and disappears,
no secret information is revealed. (2) Or the faulty value is propagated to the
next error check, which means it cannot become a safe-error. For example,
Ishai et al. [IPSW06] describe gates that ensure that any detectable error is
propagated.
Setting the Stage. An early example of an attack that combines power analysis
with laser light is by Skorobogatov [Sko06]. Skorobogatov uses laser light to
enhance the power traces and isolate the leakage of individual transistors. Since
the laser light is carefully adjusted so as not to interfere with the operations,
no faults are injected and the attack may be classified as a semi-invasive power
86 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
analysis rather than a combined attack. Moreover, Skorobogatov notes that the
attack would not work with modern submicron technologies.
PACA on Public-Key Systems. Amiel et al. [AVFM07] exploit the fact that
fault countermeasures typically only act at the end of the computation, before
the ciphertext is released at the output. Hence, power analysis can be used
to avoid the error check mechanism and extract information about the faulty
ciphertext. Their attack is applied to a public-key cryptosystem, which is
protected against both SPA and DPA. They use actual power measurements
and techniques from SPA to extract the secrets. They refer to the attack as
PACA, which stands for Passive and Active Combined Attack.
In Practice. Of the above attacks, none have been used against symmetric-key
ciphers such as AES in practice. Patranabis et al. [PBMB17] performed an attack
on a microcontroller performing the symmetric cipher PRESENT [BKL+ 07].
While this is a side-channel assisted fault attack, they still target an unprotected
implementation of the cipher and require the knowledge of faulty ciphertexts.
facilitate FA. These are side-channel assisted fault attacks. On the other hand,
Clavier et al. [CFGR10] use fault injections to reduce the DPA protection order
and facilitate SCA, i.e. they perform a fault assisted side-channel attack.
4.2.2 Countermeasures
Private Circuits II. Ishai et al. [IPSW06] extended their work on ISW
multiplications with fault protection against two types of adversaries. In the first
version, the adversary can inject an unlimited number of faults, but the effect
of each fault is a reset to 0. Starting from a masked circuit based on [ISW03],
they encode every bit using a Manchester encoding (0 → 01, 1 → 10). As a
result, 00 and 11 are invalid encodings, which only occur as the result of a
fault. They introduce gadgets for the AND and XOR operations, which ensure
that any detectable fault is propagated to an “error cascading stage”. The
error cascading stages act as self-destruction mechanism and ensure that any
invalid encoding is spread to all the wires in the circuit output or memory.
The second type of adversary can also do bit flips and set bits to 1, but is
88 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
limited in the number of faults injected. In that case, every bit is encoded
using 2nf wires, with n the number of shares and f the number of allowed
faults per clock cycle. De Cnudde and Nikova [DN16] applied this approach to
the PRESENT block cipher and implemented it on an FPGA. However, since
the ISW countermeasure is not secure in the presence of glitches, they used
threshold implementations (TI) [NRR06] as countermeasure against SCA. The
overhead of the fault protection over the SCA-only countermeasure is a factor
of approximately 8.8 for the adversary that is only allowed reset faults. The
countermeasure is thus very expensive for a not too realistic model.
CAPA. Reparaz et al. [13] took a radically different approach by not starting
from an existing masking scheme. Instead, they exploited the link between
masking as a countermeasure against passive SCA on the one hand and secret
sharing methods for multi-party computation (MPC) on the other. State-of-the-
art MPC protocols consider malicious parties that do not only observe a shared
computation, but also actively deviate from it. By tailoring such active MPC
protocols [DPSZ12] to the embedded systems context, they obtain a combined
countermeasure (named CAPA) with very strong security guarantees against
combined attacks. Reparaz et al. [13] introduce the tile-probe-and-fault model
to formalize their adversary, which is strongly based on the MPC model. In this
model, an embedded system can be seen as consisting of multiple tiles, each
COMBINED ATTACKS AND COUNTERMEASURES 89
logic logic
logic logic
representing one party in the MPC protocol (see Figure 4.2). The adversary
can obtain full control over d of the d + 1 tiles. This represents very well a
combined adversary that can combine knowledge of the intermediates in those
tiles with the capability of changing them. Moreover, the authors explain how
CAPA is even secure against safe-error attacks.
The redundancy in this countermeasure is based on information-theoretic
message authentication code (MAC) tags. Given a fixed secret key α ∈ F,
each variable x ∈ F is accompanied by a tag τ x = αx ∈ F. The variables
x, the tags τ x and the key α are all manipulated in shared form only. The
key α is fresh for each encryption. The fault coverage of the methodology
depends on the size of the MAC tags, i.e. |F|. If the field F is not large enough,
multiple keys αi can be used to attribute multiple tags to each variable. As a
result, the countermeasure is scalable, but to obtain a good fault coverage, the
overhead becomes prohibitively large in hardware. For the small block cipher
KATAN [DDK09], they obtain an overhead factor of approximately 1 + m for a
fault detection probability of 1 − 2−m .
M&M. With M&M, which stands for Masks and MACs, De Meyer et al. [6]
introduce a family of countermeasures based on the same framework as CAPA,
but with a relaxed adversary model. They show how any masking scheme can
be extended with information-theoretic MAC tags, without using the expensive
MPC calculations. As a result, instead of the tile-probe-and-fault model, their
adversary model is more similar to that of ParTI [SMG16], with the exception
that faults are not limited in Hamming weight and can affect any number of bits.
Instead of using an error check mechanism which is easily vulnerable to combined
attacks, they use infective computation to ensure that detectable faults result in
random ciphertexts. As with CAPA, the fault coverage depends on the size of the
90 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
MAC tags and is scalable. By stepping away from MPC protocols, M&M loses
the provable security against combined attacks. In contrast with ParTI [SMG16],
there are no intermediate error checks and the tags are only used at the end of
each encryption. Nevertheless, none of the attacks described in Section 4.2.1
are effective against it. Moreover, the countermeasure is much more efficient
than CAPA. The authors are able to provide example implementations for AES,
achieving an overhead factor of 2.53 − 2.63 for a coverage of 1 − 2−8 .
4.2.3 Discussion
MAC tags vs. EDC. The information-theoretic MAC tags were inherited
from MPC protocols, where the malicious attacker has very strong control over
the errors injected. The MPC adversary has the ability to exactly change one
share of an intermediate to another known value with success probability one.
For fault injections with clock glitches or a laser, such assumptions are too
strong, as the precise effect of a fault is difficult to control and predict. Hence,
COMBINED ATTACKS AND COUNTERMEASURES 91
Context. Both SCA and FA are real threats to embedded cryptography and
countermeasures against both attacks have mostly been studied separately.
Recent attacks have shown that fault injections can facilitate side-channel
attacks and that power measurements can be used to circumvent fault
countermeasures [AVFM07, CFGR10, RLK11]. This calls for new designs
of countermeasures which do not only separately protect against SCA and FA,
but also against possible combined attacks, in which side-channel measurements
and faults are jointly exploited. One of the first proposals in this area is due
to Ishai et al. [IPSW06]. However, the adversary model in this work is either
limited to reset faults, or bounded in the number of bits that are affected.
Neither corresponds to realistic fault injections. Moreover, the scheme results
in a very large overhead. A much more efficient countermeasure by Schneider et
al. [SMG16], combines the masking countermeasure of TI [NRR06] with EDC.
However, as a superposition of masking with redundancy, this scheme does not
provide security against combined attacks.
attacks. The tile-probe-and-fault model leads one to naturally look (by analogy)
at actively secure multi-party computation protocols. Indeed, CAPA draws much
inspiration from the MPC protocol SPDZ [DPSZ12]. To demonstrate that the
model, and the CAPA countermeasure, are not just theoretical constructions,
but could also serve to build practical countermeasures, we present initial
experiments of proof-of-concept designs using the CAPA methodology. Namely,
a hardware implementation of the KATAN and AES block ciphers, as well as a
software bitsliced AES S-box implementation. We demonstrate experimentally
that the design can resist second-order DPA attacks, even when the attacker
is presented with many hundreds of thousands of traces. In addition, our
proof-of-concept can also detect faults within our model with high probability
in accordance with the methodology. This work can be found on page 205.
M&M can be instantiated from any dth -order secure masking scheme, and hence
achieves generic order of protection for SCA. The combination with MAC tags
then ensures generic order of protection against DFA and the combination of SCA
and DFA. As opposed to EDC, the MAC mapping is perfectly unpredictable,
eliminating the possibility of smart undetectable faults. This also makes M&M
secure against faults that affect any number of bits. It thus works in a stronger
adversary model than the existing scheme ParTI, yet is a lot less costly to
implement than the provably secure MPC-based scheme CAPA. We demonstrate
M&M with first- and second-order secure implementations of the AES cipher.
This example shows that M&M can be very efficient in area with an overhead
factor of merely 2.53 compared to an implementation that protects only against
SCA. We perform a SCA evaluation of our implementations where no leakage
is found with up to 100 million traces. Additionally, we design and perform
a fault evaluation to confirm our theoretically claimed fault coverage. This
methodology was later extended into a generic fault detection tool [AWMN19].
We include the paper on page 229.
4.4 Conclusion
Prepare for the Real World. In the search for proper models of combined
adversaries, we should avoid as much as possible a disconnect between theory
and practice, such as can be witnessed in the world of masking. Naturally,
as argued in the previous chapter, a gap will always exist between the two.
However, we can bring them as close together as possible. Before moving forward
CONCLUSION 97
with too expensive models, we need to see evidence of the practical feasibility
of a combined attack on a block cipher implementation with combined masking
and fault protection. To choose the right type of redundancy (e.g. MAC tags
vs. EDC), we should investigate the fault distributions that can be produced
by different fault injection methods in practice. As much as possible, we should
use practical results to design and justify theoretical adversary models that are
representative of the real-life threats.
Design of Symmetric
Cryptographic Primitives
99
100 DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
and F2 : F2 → F2 : F1 ◦ F2 (x)P
n m
= F1 (F2 (x)). We also consider the inner product
of two bit-vectors as hx, yi = i xi yi .
The algebraic degree (and more generally complexity of the algebraic description)
plays a role in the resistance against algebraic attacks [CP02], which target a
cryptosystem by considering it as a system of equations. For an n-bit bijective
S-box, the largest possible algebraic degree is n − 1.
S-BOX PROPERTIES AND AFFINE EQUIVALENCE 101
The differential uniformity [Nyb93] is the largest value in the DDT for α 6= 0:
The AES S-box. To illustrate these properties, we consider again the main
example of this work, the AES S-box. While we originally defined it as a
function over F28 in § 1.1, it can also be represented as a vectorial Boolean
function over F82 . This function has the maximum algebraic degree of 7. Its
differential uniformity and linearity are respectively 4 and 32. While not AB
nor APN, this S-box remains to this day the best 8-bit S-box in the literature in
terms of cryptographic properties. No S-boxes with lower differential uniformity
or linearity have been found and it is not clear whether they even exist. The
main reason that we are still unsure about this is the magnitude of the search
space.
5.1.2 Classifications
When looking for S-boxes with good properties, we deal with a dimensionality
problem. The number of possible bijections on n bits is 2n !, which prohibits
exhaustive search for n > 3. To manage the enormous search spaces of S-boxes,
we divide them into classes, defined based on an equivalence property.
Circuit Properties. The cost of an S-box circuit can be expressed with many
different metrics. We typically count the number of gates (gate complexity)
or look at the circuit depth. Depending on which gates we consider, we can
obtain different cost estimations. Stoffelen [Sto16] for example distinguishes
gate complexity for hardware implementations and bitslice gate complexity for
software implementations. The former considers all types of gates which can
be found in typical CMOS libraries (AND, OR, NOT, XOR, NAND, NOR,
XNOR), while the latter only considers those for which a CPU instruction exists
in most processors (AND, OR, NOT, XOR). The bitslice gate complexity can
be used as an indicator of the speed of a software implementation, since each
gate should map to one instruction. For hardware implementations, the gate
complexity is related to the area of a circuit. For the latency of a circuit, we
look at the circuit depth, which is the maximum number of gates on any path
from an input to an output. Note that we typically only consider 2-input gates
in these metrics for genericity and ease of comparison.
XOR vs. AND. Any function can be represented in terms of AND, XOR
and NOT gates only, because these gates form a functionally complete set of
operators. It is therefore common to consider only these gate complexities.
Naturally, area or latency estimates based on gate counts are not exact, since
each different type of instruction or gate has a different area or delay. Exact cost
metrics can be obtained using gate-specific costs from a logic library, combined
with distinct gate counts (AND gate complexity, XOR gate complexity, . . . ). In
CMOS technology, a NAND gate consists of 4 transistors, while an XOR gate
requires as much as 8 (assuming the input complements are not yet available).
A linear function can thus be more expensive than a nonlinear function (in
hardware). Traditionally, circuits and S-boxes have been optimized according
to that philosophy. However, if we want efficient circuits for embedded systems
104 DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
mind. The S-box is based on an inversion operation, which means that hardware
implementations for encryption and decryption can share the same inversion
block.
However, as we move towards a world with more and more embedded devices,
where side-channel attacks are a constant threat, we must shift our understanding
of implementation cost to one that takes SCA countermeasures into account.
In fact, since those countermeasures come with such large overheads, the
consideration of implementation cost in the design process becomes even
more important than before. The ongoing NIST lightweight cryptography
standardization contest even explicitly lists this as a requirement for candidates:
The same is stated for resistance against fault attacks. To achieve this, we need
designers to become familiar with how their decisions influence the cost.
In this section, we will first identify important goals for the designer and
properties to optimize based on the knowledge we have gathered from the
previous chapters. Next, we will discuss recent trends in the state-of-the-art on
cipher design and, in particular, assess how the NIST lightweight candidates
comply with the SCA requirement.
this specification in mind, often need more than that to keep the AND gate
complexity within reasonable bounds. For example, most masked AES S-boxes
in Chapter 2 require at least four instead of three quadratic steps. The number
of register stages mostly influences the latency of hardware implementations,
because it directly determines the number of clock cycles. However, also the
area footprint is affected by those registers, which have relatively high cost
compared to combinational logic on ASICs.
Nb. of Rounds
the cost of a large multiplicative depth, even if area is more important than
latency. Hence, in this case, the goal is to find S-boxes that have low level-D
multiplicative complexity, where D is ideally the minimal multiplicative depth
M D = dlog2 (Degr(S))e.
The Inverse. Often, when an encryption uses the S-box S, its inverse S −1
is required for decryption. The cost of the inverse is not always considered,
because the cryptographic properties Diff and Lin are the same for S and S −1 .
The algebraic degree and multiplicative complexity, on the other hand, are not,
which means considering only the implementation cost of S may result in an
expensive S −1 . In the survey of S-boxes of Bilgin et al. [1], the cryptographic
and implementation properties of S-boxes and their inverses are investigated.
Moreover, they consider also the possibility of sharing resources between
encryption and decryption. The AES S-box, for example, uses an inversion,
which is naturally an involution. Hence both encryption and decryption can
use the same hardware components, which reduces the area footprint on a
device that needs to be able to do both. Other than involutions, Bilgin et al. [1]
identify several ways to minimize the combined area of S and S −1 and propose a
selection of S-boxes that perform well in this regard, as well as cryptanalytically.
Bit Sizes: Large vs. Small. AES is one of the few block ciphers that uses
an 8-bit S-box. Most block ciphers use a 4-bit S-box. There are two ways
to look at the choice of S-box size: from a cryptanalytic point-of-view and
from a SCA point-of-view. The trade-off between cryptographic strength and
implementation cost of small and large S-boxes is again complicated by the
involvement of the linear layers. Hence, we leave it to the cryptanalysts to
investigate it. Nevertheless, it is probable that the popularity of small bit
sizes (e.g. 4) is more due to the lack of knowledge on the search space of
larger S-boxes than due to a qualitative advantage. In addition, the success
of AES has suppressed other ciphers that use an 8-bit S-box. From a SCA
point-of-view however, larger S-boxes may enjoy some benefit in LUT-based
implementations. Recall that differential power analysis (DPA) on AES requires
28 = 256 hypotheses to be made on each 8-bit subkey. This number is directly
determined by the size of the S-box (or more specifically, the number of input
bits that each output bit depends on). In a similar cipher with 4-bit S-boxes,
only 24 = 16 hypotheses would have to be made per subkey. More generally, in a
state of B bits with n-bit S-boxes, a DPA attack requires 2n−log2 n B hypotheses
to recover the entire round key. Hence, very large S-box sizes could interfere
with the divide and conquer strategy of SCA. The problem is that their search
spaces are too large to explore.
108 DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
Bit Sizes: Odd vs. Even. Another contrast in S-box sizes is that between
odd and even. Traditionally, often S-boxes of size a power of two were chosen,
because of the datapath width in processors. For hardware implementations
or bitsliced software implementations, this restriction does not make sense,
but still, it is challenging to fit an odd-sized S-box into a block cipher with
state size a power of two (e.g. 128 or 256). As a result, even-sized S-boxes
(mostly 4) dominate in the literature. However, both from a cryptanalytic and
implementation perspective, odd-sized S-boxes show an advantage over even-
sized ones. The results of Bilgin et al. [1] show that S-boxes of odd size n achieve
the same cryptographic strength as S-boxes of even size n + 1, but at lower
cost. They are especially interesting when it comes to low latency applications,
since for every odd size n, there exists at least one AE class of quadratic APN
S-boxes. These are S-boxes with optimal cryptographic properties, that can be
implemented at minimal latency.
We will now look at the literature from the last years and show that some
first steps have been taken towards the above goals. We will also critically
assess some candidates from the NIST lightweight competition, that claim
to have taken the cost of side-channel countermeasures into account in the
design process. Note that our expertise does not extend to cryptanalysis and
that many of the discussed ciphers are relatively new, i.e. not as scrutinized
and established as AES. We therefore limit our treatment to an evaluation of
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING 109
the implementation properties only and say nothing about the cryptographic
strength.
Multi-party Computation. The link between masking and the field of multi-
party computation (MPC) has appeared multiple times in this work. Both areas
use secret sharing, which causes nonlinear operations to be more expensive
than linear ones. As a result, we can see recent efforts into the design of
cryptographic primitives with low multiplicative complexity. Albrecht et
al. [ARS+ 15] introduced a family of ciphers, called LowMC, which is intended
to minimize both its multiplicative complexity and depth. The design is based
on a substitution-permutation network (SPN) with 3-bit S-boxes of M D = 1.
For AES-like security parameters, they repeat the SPN for 12 rounds, which
results in a total multiplicative depth of 12. Albrecht et al. [AGR+ 16] also
introduce MIMC, a very simple construction consisting only of key additions
and the quadratic map x → x3 in a finite field Fq with q prime or a power of
two. This latter cipher focuses more on multiplicative complexity than depth.
They need to repeat the round function 82 times to achieve AES-like security
parameters, so their total multiplicative depth is even worse than that of AES.
decomposable, since they are assembled from quadratic building blocks. Note
however that their increased bit size does not increase the complexity of a DPA
attack, because the hypotheses can be made about the smaller (e.g. 4-bit)
subcomponents. Moreover, most of the S-boxes obtained in this way have quite
a large multiplicative depth and none achieve cryptographic properties as good
as the AES S-box.
Given the above-acquired knowledge on primitive design and given that the NIST
lightweight competition explicitly states that the cost of SCA countermeasures
should be taken into account, we now take a look at some of the Round 2
candidates.2 This investigation constitutes a new contribution of this thesis.
Side-Channel Claims. Many (not all) candidates make a note about having
considered side-channel attacks. However, this claim is often not very well-
argued. In some cases, it is justified by the fact that the design uses
“easy-to-mask” operations such as bitwise functions. While this is more
convenient for the masking designer, it gives no guarantees about the total
cost. Some proposals use existing primitives and use their lightweight property
as justification. However, these primitives were not necessarily designed with
SCA in mind. Other candidates use AES and argue that a lot of research
exists on masking the AES. The many schemes presented in Section 2.3 indeed
confirm that there is an abundance of available literature on the subject, but the
existence of a lot of research does not imply that its results are most efficient.
This holds especially for mask conversions between Boolean and arithmetic
masking [BCZ18, CGTV15], which are required for ARX ciphers. In the NIST
2 Descriptions can all be found at https://csrc.nist.gov/projects/lightweight-
cryptography/round-2-candidates
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING 111
initialization phase of a mode. Since asymptotically only the rounds per plaintext block
matter, we will not consider initialization rounds here. Note however that for short messages,
the initialization rounds will be dominant.
4 When a primitive is used in a sponge construction, we divide by the rate r, since this
A Note on Leakage Resilience. We note that our analysis considers only the
internal building blocks of the NIST proposals, regardless of whether they are
used in a leakage resilient mode or not. We see this as a necessary first step
for comparison. Moreover, since different candidates rely on different types of
leakage resilience, making a more detailed comparison is challenging. For an
investigation into the leakage resilience of several candidates, we refer to the
work of Bellizia et al. [BBC+ 20].
Observations
The existence of Tables 5.2 and 5.3 is immediately justified by the large variability
in some of its columns. We make some interesting observations here.
Table 5.2: Comparison of NIST candidates for Hardware. (n = S-box size, B = block size or permutation state size, r
= rate for Sponge)
Primitive n B r # Rnds S-box MC/bit S-box Tot MD Tot Candidates
MC MD MD/bit
XOODOO 3 384 128 12 3 1 1 12 0.09375 Xoodyak
Pyjamask 3/4 96/128 14 3/4 1 1/2 14/28 ≥ 0.146 Pyjamask
Clyde 4 128 12 4 1 2 24 0.1875 Spook
GIFT (I) 4 64/128 28/40 5 1.25 2 56/80 ≥ 0.625 ESTATE, GIFT-COFB,
HYENA,
LOTUS/LOCUS,
SUNDAE-GIFT
GIFT (II) 4 64/128 28/40 4 1 4 112/160 ≥ 1.25 idem
KNOT 4 256 64 28* 4 1 2 56* 0.875* KNOT
PHOTON 4 256 32/128 12 4 1 2 24 PHOTON-Beetle
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING
≥ 0.1875
Shadow 4 512 256 12 4 1 2 24 0.09375 Spook
Spongent 4 160/176 80/90 5 1.25 2 160/180 ≥1 Elephant
ForkSkinny 4/8 64/128 40/48 4/8 1 2/4 80/192 ≥ 1.25 ForkAE
ASCON 5 320 64/128 6*/8* 5 1 1 6*/8* ≥ 0.0625* Ascon, ISAP
GASCON 5 320 128 7* 5 1 1 7* 0.055* DryGASCON
Keccak 5 200 18 5 1 1 18 0.09 Elephant
5 400 144 8 5 1 1 8 0.056 ISAP
AES 8 128 10 32 4 4 40 0.3125 ESTATE, mixFEED,
SAEAES
Skinny 8 128 48/56 8 1 4 192/224 ≥ 1.5 Romulus,
SKINNY-AEAD
GIMLI 96 384 128 24 96 1 1 24 0.1875 Gimli
Subterranean 2.0 257 257 32 1* 257 1 1 1* 0.03125* Subterranean 2.0
* Given a larger number of initialization rounds
113
DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
Table 5.3: Comparison of NIST candidates for Software. (n = S-box size, B = block size or permutation state size, r
= rate for Sponge)
Primitive n B r # Rnds S-box Tot MC/bit Candidates
MC No Bitslice 16-bit 32-bit 64-bit
XOODOO 3 384 128 12 3 36 2.25 1.125 0.5625 Xoodyak
Pyjamask 3/4 96/128 14 3/4 14 0.875 0.4375 0.4375 Pyjamask
Clyde 4 128 12 4 12 0.75 0.375 0.375 Spook
GIFT 4 64/128 28/40 4 28/40 1.75/2.5 1.75/1.25 1.75/1.25 ESTATE,
GIFT-COFB,
HYENA,
LOTUS/LOCUS,
SUNDAE-GIFT
KNOT 4 256 64 28* 4 112* 7* 3.5* 1.75* KNOT
PHOTON 4 256 32/128 12 4 96/24 6/1.5 3/0.75 1.5/0.38 PHOTON-Beetle
Shadow 4 512 256 12 4 24 1.5 0.75 0.375 Spook
Spongent 4 160/176 80/90 5 100/112.5 7.5/7.67 5/5.11 2.5/2.56 Elephant
ForkSkinny 4/8 64/128 40/48 4/8 40/96 2.5/6 2.5/3 2.5/3 ForkAE
ASCON 5 320 64/128 6*/8* 5 30/20* 1.88/1.25* 0.94/0.63* 0.47/0.31* Ascon, ISAP
GASCON 5 320 128 7* 5 17.5* 1.09* 0.55* 0.27* DryGASCON
Keccak 5 200 18 5 18 1.35 0.9 0.45 Elephant
5 400 144 8 5 22.22 1.39 0.83 0.56 ISAP
AES 8 128 10 32 40 2.5 2.5 2.5 ESTATE,
mixFEED,
SAEAES
Skinny 8 128 48/56 8 48/56 3/3.5 3/3.5 3/3.5 Romulus,
SKINNY-AEAD
GIMLI 96 384 128 24 96 72 4.5 2.25 1.125 Gimli
Subterranean 2.0 257 257 32 1* 257 8.03* 0.53* 0.28* 0.16* Subterranean 2.0
* Given a larger number of initialization rounds
114
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING 115
MC/bit. The M C/bit is almost identical for all non-AES proposals and there
is little to no need for improvement in that aspect.
Number of Rounds. The largest contrasts arise from differences in the number
of rounds, which plays an important role when speed is a priority. With respect
to the metrics of Tot M C/bit or Tot M D/bit, we see that several primitives are
not competitive with AES (e.g. Spongent, Skinny among others) due to a large
number of rounds. We note that this design parameter is highly dependent on
the designer’s choice of security margin.
and Clyde. It also shows that various proposals offer little advantage over AES
in this regard.
Contribution. In our work from FSE 2020 [5], we introduce an extension of the
algorithm of Biryukov et al. [BDBP03], which allows finding the representative
of an AE class for non-bijective n × m functions with m < n. Thanks to this
new algorithm, we can adapt the search methodology of Bozilov et al. [BBS17]
and reduce the computation time to classify 5-bit quadratic permutations from
several hours (using 16 threads) to a mere six minutes (using 4 threads).5
This optimization makes it possible to also classify the 6-bit quadratic Boolean
functions for the first time. In addition, it enables the classification of (balanced)
non-bijective Boolean functions from n bits to m < n bits for n ≤ 6. We also
provide a second tool for finding length-two decompositions of higher-degree
permutations, which can be useful to create efficient masked implementations.
We demonstrate it by decomposing the 5-bit AB and APN permutations. We
can also use this tool to generate new high-quality S-boxes, which can be
decomposed. This work can be found on page 253.
5 On a Linux machine with an Intel Core i5-6500 processor at 3.20GHz.
MY CONTRIBUTIONS IN THIS CONTEXT 117
5.4 Conclusion
The Odd One Out. S-boxes of odd size achieve better cryptographic properties
at lower depth than S-boxes of even size. Nevertheless, even and in particular
4-bit S-boxes dominate in the literature. The exception is the 5-bit Keccak
S-box, which after being adopted in SHA-3, has been included into many other
primitives. Yet, there are many more possibilities which have not been used,
such as 7- or 9-bit APNs.
Conclusion
In this work, we have looked at four different but closely related aspects
of embedded systems security. In Chapter 2, we explained the masking
countermeasure in detail and considered the community’s developments of
the last years, in particular with regards to the masked multiplication building
block and trade-offs in implementing the AES. Chapter 3 dealt with two types
of analysis of masked implementations. On the one hand, we presented the
state-of-the-art on side-channel attacks and more specifically differential power
analysis. On the other hand, we investigated the problem of verifying masked
implementations and looked at this challenge from both a theoretical and
practical perspective, uniting the two as much as possible. In Chapter 4, we
extended the issue of side-channel attacks to more general physical attacks
including fault and combined attacks. We studied the state-of-the-art on
both attacks and countermeasures. Chapter 5 looked at how we can optimize
embedded system design by incorporating the cost of countermeasures into
the design of cryptographic primitives. We collected a set of criteria and
contemplated recent proposals from the literature. In this final chapter, we
will summarize our contributions in each of these areas and recollect a few
conclusions from each chapter.
Contributions
We now recall the research questions from Chapter 1 and consider our
contributions to each.
119
120 CONCLUSION
Throughout the previous chapters, there were recurring themes, which we use
now to recall the most important conclusions and directions for future work.
Low Latency. In various places, we have seen that low latency is often not
a priority. The majority of the masked implementations in Chapter 2 have
been designed for optimal area or randomness use, using a serial architecture.
The use of dual-rail logic dates back to more than five years ago, but can
result in a masked AES using very few clock cycles. It is an interesting topic
for further research, also for combined countermeasures (Chapter 4), since
it exhibits inherent redundancy. We saw the same trend in the design of
cryptographic primitives in Chapter 5. While the last few years have seen an
increase of designs for low multiplicative complexity, the same cannot be said
for multiplicative depth. Future research should consider the latency as more
important, since modern technologies with increasingly smaller transistor sizes
make area relatively less limited.
122 CONCLUSION
Randomness. Our work is based on the assumption that random bits are
readily available for masking and combined countermeasures. In practice, we
noted that this is achieved with a PRNG. This critical component requires
a lot more attention in the literature. In Chapter 2, we noted that a PRNG
for masking probably does not require to be masked itself, but that specific
constructions should be proposed and analysed for clarity on the cost of
randomness. Furthermore, in Chapter 4, it became clear that there is a need
for PRNGs that are robust in the presence of faults. This means on the one
hand, that we need to add fault detection mechanisms, but on the other hand,
that we need to work out proper procedures for when an error is detected. At
this point, this issue might be more important to investigate than combined
countermeasures themselves, since a failure of randomness allows one to trivially
bypass the countermeasures.
Bibliography
[ABB+ 14] Andreeva, E., Bilgin, B., Bogdanov, A., Luykx, A.,
Mendel, F., Menninck, B., Mouha, N., Wang, Q., and
Yasuda, K. PRIMATEs: Submission to the CAESAR compe-
tition. https://competitions.cr.yp.to/round1/primatesv1.
pdf, March 2014.
[Abe10] Abe, M. (ed.). Advances in Cryptology - ASIACRYPT 2010 -
16th International Conference on the Theory and Application of
Cryptology and Information Security, Singapore, December 5-9,
2010. Proceedings, Lecture Notes in Computer Science, vol. 6477.
Springer, 2010.
[AG01] Akkar, M. and Giraud, C. An implementation of DES and
AES, secure against some attacks. In Koç et al. [KNP01], 309–318.
[AGR+ 16] Albrecht, M.R., Grassi, L., Rechberger, C., Roy, A.,
and Tiessen, T. MiMC: Efficient encryption and cryptographic
hashing with minimal multiplicative complexity. In Cheon and
Takagi [CT16], 191–219.
[AH17] Avanzi, R. and Heys, H.M. (eds.). Selected Areas in
Cryptography - SAC 2016 - 23rd International Conference, St.
John’s, NL, Canada, August 10-12, 2016, Revised Selected Papers,
Lecture Notes in Computer Science, vol. 10532. Springer, 2017.
[AJ01] Attali, I. and Jensen, T.P. (eds.). Smart Card Programming
and Security, International Conference on Research in Smart Cards,
E-smart 2001, Cannes, France, September 19-21, 2001, Proceedings,
Lecture Notes in Computer Science, vol. 2140. Springer, 2001.
[ARS+ 15] Albrecht, M.R., Rechberger, C., Schneider, T., Tiessen,
T., and Zohner, M. Ciphers for MPC and FHE. In Oswald and
Fischlin [OF15], 430–454.
123
124 BIBLIOGRAPHY
[BBP+ 16] Belaïd, S., Benhamouda, F., Passelègue, A., Prouff, E.,
Thillard, A., and Vergnaud, D. Randomness complexity of
private circuits for multiplication. In Fischlin and Coron [FC16],
616–648.
[BBS17] Bozilov, D., Bilgin, B., and Sahin, H.A. A note on 5-bit
quadratic permutations’ classification. IACR Trans. Symmetric
Cryptol., 2017(2017)(1), 398–404.
[BC13] Bertoni, G. and Coron, J. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2013 - 15th International Workshop,
Santa Barbara, CA, USA, August 20-23, 2013. Proceedings, Lecture
Notes in Computer Science, vol. 8086. Springer, 2013.
[BCG+ 12] Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B.,
Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V.,
Paar, C., Rechberger, C., Rombouts, P., Thomsen, S.S.,
and Yalçin, T. PRINCE - A low-latency block cipher for pervasive
computing applications - extended abstract. In Wang and Sako
[WS12], 208–225.
[BCN+ 06] Bar-El, H., Choukri, H., Naccache, D., Tunstall, M.,
and Whelan, C. The sorcerer’s apprentice guide to fault attacks.
Proceedings of the IEEE, 94(2006)(2), 370–382.
[BDF+ 17] Barthe, G., Dupressoir, F., Faust, S., Grégoire, B.,
Standaert, F., and Strub, P. Parallel implementations of
masking schemes and the bounded moment leakage model. In
Coron and Nielsen [CN17], 535–566.
126 BIBLIOGRAPHY
[BDPV11] Bertoni, G., Daemen, J., Peeters, M., and Van Assche, G.
The Keccak reference. http://keccak.noekeon.org/, 2011.
[BF19] Bilgin, B. and Fischer, J. (eds.). Smart Card Research and
Advanced Applications, 17th International Conference, CARDIS
2018, Montpellier, France, November 12-14, 2018, Revised Selected
Papers, Lecture Notes in Computer Science, vol. 11389. Springer,
2019.
[BFG15] Balasch, J., Faust, S., and Gierlichs, B. Inner product
masking revisited. In Oswald and Fischlin [OF15], 486–510.
[BFG+ 17] Balasch, J., Faust, S., Gierlichs, B., Paglialonga, C., and
Standaert, F. Consolidating inner product masking. In Takagi
and Peyrin [TP17], 724–754.
[BFGV12] Balasch, J., Faust, S., Gierlichs, B., and Verbauwhede,
I. Theory and practice of a leakage resilient masking scheme. In
Wang and Sako [WS12], 758–775.
[BG12] Bertoni, G. and Gierlichs, B. (eds.). 2012 Workshop on
Fault Diagnosis and Tolerance in Cryptography, Leuven, Belgium,
September 9, 2012. IEEE Computer Society, 2012.
[BG13] Battistello, A. and Giraud, C. Fault analysis of infective
AES computations. In Fischer and Schmidt [FS13], 101–107.
[BGG+ 14] Balasch, J., Gierlichs, B., Grosso, V., Reparaz, O., and
Standaert, F. On the cost of lazy engineering for masked software
implementations. In Joye and Moradi [JM15], 64–81.
[BGG+ 16] Boss, E., Grosso, V., Güneysu, T., Leander, G., Moradi,
A., and Schneider, T. Strong 8-bit sboxes with efficient masking
in hardware. In Gierlichs and Poschmann [GP16], 171–193.
[BGI+ 18] Bloem, R., Groß, H., Iusupov, R., Könighofer, B.,
Mangard, S., and Winter, J. Formal verification of masked
hardware implementations in the presence of glitches. In Nielsen
and Rijmen [NR18], 321–353.
[BGK04] Blömer, J., Guajardo, J., and Krummel, V. Provably secure
masking of AES. In Handschuh and Hasan [HH04], 69–83.
BIBLIOGRAPHY 127
[BGK+ 07] Breveglieri, L., Gueron, S., Koren, I., Naccache, D.,
and Seifert, J. (eds.). Fourth International Workshop on Fault
Diagnosis and Tolerance in Cryptography, 2007, FDTC 2007:
Vienna, Austria, 10 September 2007. IEEE Computer Society, 2007.
[BGN+ 14a] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. Higher-order threshold implementations. In Sarkar
and Iwata [SI14], 326–343.
[BGN+ 14b] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. A more efficient AES threshold implementation. In
Pointcheval and Vergnaud [PV14], 267–284.
[BGN+ 15] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. Trade-offs for threshold implementations illustrated
on AES. IEEE Trans. on CAD of Integrated Circuits and Systems,
34(2015)(7), 1188–1200.
[BGR18] Belaïd, S., Goudarzi, D., and Rivain, M. Tight private
circuits: Achieving probing security with the least refreshing. In
Peyrin and Galbraith [PG18], 343–372.
[BJK+ 16] Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi,
A., Peyrin, T., Sasaki, Y., Sasdrich, P., and Sim, S.M.
The SKINNY family of block ciphers and its low-latency variant
MANTIS. In Robshaw and Katz [RK16], 123–153.
[Can05] Canright, D. A very compact S-box for AES. In Rao and Sunar
[RS05], 441–455.
[CCZ98] Carlet, C., Charpin, P., and Zinoviev, V.A. Codes, bent
functions and permutations suitable for DES-like cryptosystems.
Des. Codes Cryptogr., 15(1998)(2), 125–156.
[CDL15] Canteaut, A., Duval, S., and Leurent, G. Construction
of lightweight S-boxes using Feistel and MISTY structures. In
Dunkelman and Keliher [DK16], 373–393.
[CFGR10] Clavier, C., Feix, B., Gagnerot, G., and Roussellet, M.
Passive and active combined attacks on AES: Combining fault
attacks and side channel analysis. In Breveglieri et al. [BJK+ 10],
10–19.
[CG09] Clavier, C. and Gaj, K. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2009, 11th International Workshop,
Lausanne, Switzerland, September 6-9, 2009, Proceedings, Lecture
Notes in Computer Science, vol. 5747. Springer, 2009.
[CGTV15] Coron, J., Großschädl, J., Tibouchi, M., and Vadnala,
P.K. Conversion from arithmetic to Boolean masking with
logarithmic complexity. In Leander [Lea15], 130–149.
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., and Rohatgi, P. Towards
sound approaches to counteract power-analysis attacks. In Wiener
[Wie99], 398–412.
[CN17] Coron, J. and Nielsen, J.B. (eds.). Advances in Cryptology -
EUROCRYPT 2017 - 36th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Paris,
France, April 30 - May 4, 2017, Proceedings, Part I, Lecture Notes
in Computer Science, vol. 10210. 2017.
BIBLIOGRAPHY 131
[EKM+ 08] Eisenbarth, T., Kasper, T., Moradi, A., Paar, C.,
Salmasizadeh, M., and Shalmani, M.T.M. On the power
of power analysis in the real world: A complete break of the
KeeLoq code hopping scheme. In Wagner [Wag08], 203–220.
[FC16] Fischlin, M. and Coron, J. (eds.). Advances in Cryptology -
EUROCRYPT 2016 - 35th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Vienna,
Austria, May 8-12, 2016, Proceedings, Part II, Lecture Notes in
Computer Science, vol. 9666. Springer, 2016.
134 BIBLIOGRAPHY
[GJJR11] Goodwill, G., Jun, B., Jaffe, J., and Rohatgi, P. A testing
methodology for side-channel resistance validation. (2011).
[GLM16] Güneysu, T., Leander, G., and Moradi, A. (eds.). Lightweight
Cryptography for Security and Privacy - 4th International
Workshop, LightSec 2015, Bochum, Germany, September 10-11,
2015, Revised Selected Papers, Lecture Notes in Computer Science,
vol. 9542. Springer, 2016.
[GM06] Goubin, L. and Matsui, M. (eds.). Cryptographic Hardware
and Embedded Systems - CHES 2006, 8th International Workshop,
Yokohama, Japan, October 10-13, 2006, Proceedings, Lecture Notes
in Computer Science, vol. 4249. Springer, 2006.
[Ham50] Hamming, R.W. Error detecting and error correcting codes. The
Bell System Technical Journal, 29(1950)(2), 147–160.
BIBLIOGRAPHY 137
[MMR20] Moos, T., Moradi, A., and Richter, B. Static power side-
channel analysis - an investigation of measurement factors. IEEE
Trans. Very Large Scale Integr. Syst., 28(2020)(2), 376–389.
[Moo19] Moos, T. Static power SCA of sub-100 nm CMOS asics and the
insecurity of masking schemes in low-noise environments. IACR
Trans. Cryptogr. Hardw. Embed. Syst., 2019(2019)(3), 202–232.
[MOP07] Mangard, S., Oswald, E., and Popp, T. Power analysis
attacks - revealing the secrets of smart cards. Springer, 2007.
[Mor89] Mora, T. (ed.). Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes, 6th International Conference, AAECC-6, Rome,
Italy, July 4-8, 1988, Proceedings, Lecture Notes in Computer
Science, vol. 357. Springer, 1989.
142 BIBLIOGRAPHY
[PYR+ 16] Picek, S., Yang, B., Rozic, V., Vliegen, J., Winderickx,
J., De Cnudde, T., and Mentens, N. PRNGs for masking
applications and their mapping to evolvable hardware. In Lemke-
Rust and Tunstall [LT17], 209–227.
[QPDK04] Quisquater, J., Paradinas, P., Deswarte, Y., and
Kalam, A.A.E. (eds.). Smart Card Research and Advanced
Applications VI, IFIP 18th World Computer Congress, TC8/WG8.8
& TC11/WG11.2 Sixth International Conference on Smart Card
Research and Advanced Applications (CARDIS), 22-27 August 2004,
Toulouse, France, IFIP, vol. 153. Kluwer/Springer, 2004.
[QS00] Quisquater, J. and Schneier, B. (eds.). Smart Card Research
and Applications, This International Conference, CARDIS ’98,
Louvain-la-Neuve, Belgium, September 14-16, 1998, Proceedings,
Lecture Notes in Computer Science, vol. 1820. Springer, 2000.
[QS01] Quisquater, J. and Samyde, D. Electromagnetic analysis
(EMA): measures and counter-measures for smart cards. In Attali
and Jensen [AJ01], 200–210.
[RBN+ 15] Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., and
Verbauwhede, I. Consolidating masking schemes. In Gennaro
and Robshaw [GR15], 764–783.
[Rep15] Reparaz, O. A note on the security of higher-order threshold
implementations. IACR Cryptology ePrint Archive, 2015(2015), 1.
[SSA+ 07] Shirai, T., Shibutani, K., Akishita, T., Moriai, S., and
Iwata, T. The 128-bit blockcipher CLEFIA (extended abstract).
In Biryukov [Bir07], 181–195.
[SSR19] Sako, K., Schneider, S., and Ryan, P.Y.A. (eds.). Computer
Security - ESORICS 2019 - 24th European Symposium on Research
in Computer Security, Luxembourg, September 23-27, 2019,
Proceedings, Part I, Lecture Notes in Computer Science, vol. 11735.
Springer, 2019.
[WSY+ 16] Wang, W., Standaert, F., Yu, Y., Pu, S., Liu, J., Guo,
Z., and Gu, D. Inner product masking for bitslice ciphers and
security order amplification for linear leakages. In Lemke-Rust and
Tunstall [LT17], 174–191.
[WVGX15] Wang, J., Vadnala, P.K., Großschädl, J., and Xu, Q.
Higher-order masking in practice: A vector implementation of
masked AES for ARM NEON. In Nyberg [Nyb15], 181–198.
[YYP+ 18] Yao, Y., Yang, M., Patrick, C., Yuce, B., and Schaumont,
P. Fault-assisted side-channel analysis of masked implementations.
In 2018 IEEE International Symposium on Hardware Oriented
Security and Trust, HOST 2018, Washington, DC, USA, April 30
- May 4, 2018, 57–64. IEEE Computer Society, 2018.
At first sight, it looks like the computational complexity is higher than that of
the original algorithm, but the inverses L−1i Lj
−1
can easily be precomputed.
The Boolean masked multiplication of Ishai et al. [ISW03] with d + 1 shares
was proven to be d-SNI by Barthe et al. [BBD+ 16]. It differs slightly from the
description in Eq. 2.14, but the proof for the latter is equivalent. Balash et
al. [BFG+ 17] also proved that their inner product masking multiplication is
d-SNI with d + 1 shares. We list the intermediates of algorithm 1 with d + 1
shares below:
xi , yi 0≤i<d
rij , Uij = Uji = rij L−1 −1
i Lj 0≤j<i<d
xi yj , Tij = xi yj ⊕ Uij 0 ≤ i, j < d
Vij = Tij Lj 0 ≤ i, j < d
Pk
j=0 Vij 0 ≤ i, k < d
One must prove that any set of t1 intermediates and t2 outputs can be simulated
153
154 A NEW INNER PRODUCT MASKING ALGORITHM
6: Uji = Uij
7: end for
8: end for
9: for i = 0 to n − 1 do
10: for j = 0 to n − 1 do
11: Tij = xi yj
12: Tij = Tij ⊕ Uij
13: Vij = Tij Lj
14: end for
15: end for
16: for i = 0P to n − 1 do
17: zi = V
j ij
18: end for
using at most t1 shares of each input. Let Ix and Iy be the sets of shares
of respectively x and y required for simulation. We will show that for each
intermediate, at most one share index is added to Ix and Iy and for each output,
no share index is added. As a result, any t1 internal probes and t2 output
probes can be simulated with xIx and yIy for which |Ix | ≤ t1 and |Iy | ≤ t1 and
t1 + t2 ≤ d. In addition, we record the list L of all probes which are simulatable
with xIx and yIy .
Proof.
Pk
(6) j=0 Vij : Ix ← Ix ∪ {i}, Iy ← Iy ∪ {i}, L ← L ∪ {xi , yi }. For all j = 0
to k − 1, compute Vij as follows:
(a) If Vij ∈ L: ok
(b) Else if Uij ∈ L and yj ∈ L: Compute Tij = xi yj ⊕Uij and Vij = Tij Lj .
L = L ∪ {xi yj , Tij , Vij }.
(c) Else if Uij ∈ L and yj ∈ / L: Consider how Uij was added to L:
• It was added by (3), which did not add any index to Iy . Hence
we can do Iy ← Iy ∪ {j} and go back to (b).
• It was added by (5) for a probe Tij or Vij , hence contradiction,
see (a).
• It was added by (5) for a probe Tji or Vji , which means i was
added to Iy twice. Hence we can do Iy ← Iy ∪ {j} and go back
to (b).
Pk
• It was added by (6) for a probe l=0 Vjl , hence yj ∈ L. This is
a contradiction, see (b).
(d) Else: Uij ∈ / L and Uji ∈ / L:
• If Tji ∈ / L and yj ∈ L: Simulate rij , Uij , Uji as in (3) and
compute Tij = xi yj ⊕ Uij and Vij = Tij Lj . L ← L ∪
{xi yj , rij , Uij , Uji , Tij , Vij }.
• Else if Tji ∈ / L and yj ∈ / L: Assign a random value to Tij and
compute Vij = Tij Lj . L ← L ∪ {Tij , Vij }.
• Else: Tji ∈ L. With Uji ∈ / L, this occurs only if Tji was
simulated randomly as above, which means yj ∈ L and xj ∈ L.
Simulate rij , Uij , Uji as in (3) and compute Tij = xi yj ⊕ Uij ,
Tji = xj yi ⊕ Uji , Vij = Tij Lj and Vji = Tji Li . L ← L ∪
{xi yj , xj yi , rij , Uij , Uji , Tij , Vij }. Replace Tji and Vji in L and
recompute
Pk the probe depending on them.
L ← L ∪ { j=0 Vij }.
(7) zi (output probe): We distinguish three cases:
• All Vij have been simulated: ∀j : Vij ∈ L. Compute zi .
Pk
• A partial sum has been simulated: j=0 Vij ∈ L for some k. Simulate
the remaining Vij as in (6).
• No partial sum, but some or no Vij have been simulated. The output
share zi depends on d random values, of which it has exactly one
(rij ) in common with each other output zj . There are at most d − 1
other probes (each may involve either one Vij or another output
share zj ). Hence, at least one random value of zi has not been used
in any observed wire. We can thus assign a random value to zi .
Part II
Publications
157
List of Publications
International Journals
[1] Bilgin, B., De Meyer, L., Duval, S., Levi, I., and Standaert, F.
Low AND depth an efficient inverses: a guide on S-boxes for low-latency
masking. Accepted for Publication in IACR Transactions on Symmetric
Cryptology 2020(1), 2020.
[2] Wegener, F., De Meyer, L., and Moradi, A. Spin me right round
rotational symmetry for FPGA-specific AES: Extended version. Journal
of Cryptology (Jan 2020).
[3] De Meyer, L. Recovering the CTR_DRBG state in 256 traces. IACR
Trans. Cryptogr. Hardw. Embed. Syst. 2020, 1 (2020), 37–65.
[6] De Meyer, L., Arribas, V., Nikova, S., Nikov, V., and Rijmen, V.
M&M: Masks and MACs against physical attacks. IACR Trans. Cryptogr.
Hardw. Embed. Syst. 2019, 1 (2019), 25–50.
[7] De Meyer, L., Reparaz, O., and Bilgin, B. Multiplicative masking
for AES in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018,
3 (2018), 431–468.
[8] De Meyer, L., Moradi, A., and Wegener, F. Spin me right round
rotational symmetry for FPGA-specific AES. IACR Trans. Cryptogr.
Hardw. Embed. Syst. 2018, 3 (2018), 596–626.
159
160 BIBLIOGRAPHY
International Conferences
[10] Delpech de Saint Guilhem, C., De Meyer, L., Orsini, E., and
Smart, N. P. BBQ: using AES in picnic signatures. In Selected Areas in
Cryptography - SAC 2019 - 26th International Conference, Waterloo, ON,
Canada, August 12-16, 2019, Revised Selected Papers (2019), vol. 11959
of Lecture Notes in Computer Science, Springer, pp. 669–692.
[11] Groß, H., Stoffelen, K., De Meyer, L., Krenn, M., and
Mangard, S. First-order masking with only two random bits. In
Proceedings of ACM Workshop on Theory of Implementation Security
Workshop, TIS@CCS 2019, London, UK, November 11, 2019 (2019),
ACM, pp. 10–23.
[12] Purnal, A., Arribas, V., and De Meyer, L. Trade-offs in protecting
Keccak against combined side-channel and fault attacks. In Constructive
Side-Channel Analysis and Secure Design - 10th International Workshop,
COSADE 2019, Darmstadt, Germany, April 3-5, 2019, Proceedings (2019),
vol. 11421 of Lecture Notes in Computer Science, Springer, pp. 285–302.
[13] Reparaz, O., De Meyer, L., Bilgin, B., Arribas, V., Nikova, S.,
Nikov, V., and Smart, N. P. CAPA: the spirit of Beaver against
physical attacks. In Advances in Cryptology - CRYPTO 2018 - 38th
Annual International Cryptology Conference, Santa Barbara, CA, USA,
August 19-23, 2018, Proceedings, Part I (2018), vol. 10991 of Lecture
Notes in Computer Science, Springer, pp. 121–151.
Unpublished Manuscripts
Publication Data
My Contribution
Principal author.
Notes
161
162 Multiplicative Masking for AES in Hardware
Abstract. Hardware masked AES designs usually rely on Boolean masking and perform the
computation of the S-box using the tower-field decomposition. On the other hand, splitting
sensitive variables in a multiplicative way is more amenable for the computation of the AES S-
box, as noted by Akkar and Giraud. However, multiplicative masking needs to be implemented
carefully not to be vulnerable to first-order DPA with a zero-value power model. Up to now,
sound higher-order multiplicative masking schemes have been implemented only in software. In
this work, we demonstrate the first hardware implementation of AES using multiplicative masks.
The method is tailored to be secure even if the underlying gates are not ideal and glitches
occur in the circuit. We detail the design process of first- and second-order secure AES-128
cores, which result in the smallest die area to date among previous state-of-the-art masked AES
implementations with comparable randomness cost and latency. The first- and second-order
masked implementations improve resp. 29% and 18% over these designs. We deploy our
construction on a Spartan-6 FPGA and perform a side-channel evaluation. No leakage is
detected with up to 50 million traces for both our first- and second-order implementation. For
the latter, this holds both for univariate and bivariate analysis.
Keywords: DPA · Masking · Glitches · Sharing · Adaptive · Boolean · Multiplicative · AES ·
S-box · Side-channel
1 Introduction
Cryptographic primitives are designed to resist mathematical attacks such as linear or differential
cryptanalysis. The designer typically assumes a classic adversarial model, where encryption is
treated as a black box, only revealing inputs and outputs to adversaries. When these primitives are
deployed in embedded devices, unintended signals such as the instantaneous power consumption or
electromagnetic radiation leak sensitive information, effectively turning the black box into a gray
box. Side-channel analysis is a cheap and scalable technique that allows the adversary to exploit
these signals and extract secret keys or passwords. Hence, cryptography deployed into embedded
devices needs not only mathematical but also physical security.
One particularly powerful attack, differential power analysis (DPA) was introduced in 1999
by Kocher et al. [KJJ99]. In this type of attack, the adversary feeds different plaintexts to an
encryption algorithm using the same key and extracts sensitive information from the power traces
he collects. Today, we aim at providing security against dth -order DPA. In a dth -order DPA
attack, the adversary exploits any statistical moment of the power consumption up to order d.
Since statistical moments are exponentially harder to estimate with the order d given sufficient
noise (both in terms of numbers of samples and computational time), having a moderate security
target d = 1, 2 often suffices in practice, especially when used in conjunction with complementary
countermeasures [HOM06, CCD00].
In a side-channel secure implementation, the goal is to make the leakages of the values handled
in the implementation independent of the sensitive inputs and sensitive intermediate variables. At
the architectural level this is typically achieved by masking, which means the processed data is
probabilistically split into multiple shares in such a way that one can only recover the sensitive
data if all of its shares are known. Recovering secrets from shares is exponentially more difficult as
noise increases; as this corresponds to estimating higher-order statistical moments with increasing
noise levels [CJRR99, GP99].
Previous Work. The earliest masking schemes [GP99, Tri03, ISW03] were shown to be unsuitable
for hardware implementations by Mangard et al. [MPG05, MPO05]. The vulnerability arises when
Multiplicative Masking for AES in Hardware 163
unintended transitions of a signal or glitches occur, caused by non-idealities such as logic gates with
non-zero propagation delays or routing imbalances. The glitches problem can be addressed at many
levels: either by equalizing signal paths (which normally requires manual access to low-level routing
details and a careful characterization of the logic library), by adding synchronization elements (such
as registers or signal gating) or by using a masking scheme that is inherently secure under glitches.
Extensive research has been done on countermeasures based on secret sharing and multi-party
computation that are provably secure even in the presence of glitches. The prevailing schemes are
those of Prouff and Roche [PR11] and Threshold Implementations (TI) by Nikova et al. [NRS11]
which use polynomial and Boolean masking respectively. The latter was extended to higher-order
security by Bilgin et al. (higher-order TI) [BGN+ 14a]. The similarities and differences between TI
and the Private Circuits scheme [ISW03], which provides provable security if the circuit behaves
ideally (no glitches), were analysed by Reparaz et al. (Consolidated Masking Schemes) [RBN+ 15].
Reparaz et al. also discuss how ISW can be implemented to provide security on hardware. More
recently, Gross et al. presented Domain Oriented Masking [GMK16], which is also related to the
original Private Circuits scheme [ISW03] with additional registers againts glitches and a different
randomness consumption. These masking schemes have all been applied to Canright’s tower-field
AES S-box [Can05] due to its small foot-print and structure, resulting in a multitude of masked AES
implementations [MPL+ 11, BGN+ 14b, CRB+ 16, GMK17, UHA17]. Those of Ueno et al. [UHA17],
De Cnudde et al. [CRB+ 16] and Gross et al. [GMK17] are the smallest to date, with the latter
requiring much less randomness.
In this paper we follow a different avenue. We do not apply Boolean masking to Canright’s
tower-field decomposition, but instead, we revisit the well-known concept of switching between
different types of masking. Boolean masking schemes are compatible with linear operations but
difficult to work out for non-linear functions. Akkar and Giraud [AG01] were the first to propose an
adaptive masking scheme for AES at CHES 2001. The idea is to use Boolean masks for the affine
operations and multiplicatively masked values for multiplications (or in the case of AES, inversion)
and convert between the two types when necessary. At CHES 2002 [TSG02, GT02] an inherent
weakness of multiplicative masking was presented, namely that it is vulnerable to first-order DPA
because the zero element cannot be effectively masked multiplicatively. As a solution to this zero
problem, they proposed to map each zero element to a non-zero element. The adaptive masking
scheme was studied in depth and extended to higher-order security by Genelle et al. [GPQ11b]. So
far, it has only been used in software implementations.
Our Contribution. We present the first hardware implementation of an adaptively masked AES.
We describe glitch-resistant modules that convert between Boolean and multiplicative masking
and that attend to the zero problem, based on the algorithmic descriptions provided for software
in [GPQ10, GPQ11a, GPQ11b]. While this work focuses on the AES S-box, the methodology can
be used to mask any inverse or power map-based S-box [AGR+ 16]. We optimize the number of
inversions used and the randomness cost for first-order and second-order resistant AES, which both
achieve a smaller area than the current state-of-the-art masked hardware AES implementations
of [CRB+ 16] and [GMK17], while having comparable randomness and latency requirement. We
formally discuss the security of our S-box and its components up to the level current state-of-the-art
tools and methods allow. We also deploy our implementations into an FPGA for side-channel
evaluation using a non-specific leakage assessment test to analyse practical security in a lab
environment with low noise. No leakage is detected with up to 50 million traces, confirming that
the security claims hold empirically.
2 Preliminaries
Notation. Multiplication and addition in the field Fq = GF(2k ) are denoted by ⊗ and ⊕ respec-
tively. We use & for multiplication in the field GF(2) (i.e. the AND operation). For ease of notation,
we sometimes omit ⊗ and &. Square brackets [·] in formulas indicate where synchronization via
registers or memory elements are used. An element r ∈ Fq drawn uniformly at random from Fq is
$
shown as r ← Fq . We denote F∗q = Fq \ {0}. The expected value of x is denoted E[x].
164 Multiplicative Masking for AES in Hardware
d
M
x = (bx0 , . . . , bxd ) ⇔ x= bxi
i=0
In this paper we also use multiplicative sharing, which in a side-channel context is typically defined
as
d−1
O
x = (px0 , . . . , pxd ) ⇔ x= (pxi )−1 ⊗ pxd
i=0
We refer to this sharing as a type-I multiplicative sharing. We further define a type-II multiplicative
sharing:
Od
x = (q0x , . . . , qdx ) ⇔ x= qix
i=0
This notation is more common in secret-sharing. We omit the superscript x when it is clear from
context.
Masked operations. In Boolean masking, linear operations can trivially be applied locally on
each share:
x ⊕ y = (bx0 , . . . , bxd ) ⊕ (by0 , . . . , byd ) = (bx0 ⊕ by0 , . . . , bxd ⊕ byd )
Non-linear operations such as a multiplication on the other hand are less straightforward and much
more costly to implement. The opposite situation arises if one uses multiplicative masking. In that
case, linear operations are non-trivial but multiplication is local:
Finding an efficient but glitch-resistant way to process Boolean shares in a non-linear operation
has been a hot topic in the last years. A natural strategy is to switch back and forth between
masked representations and perform each operation in its most compatible setting.
The zero-value problem. The fundamental security flaw of multiplicative masking was first
pointed out by Trichina [TSG02] and Golić and Tymen [GT02]. Multiplicative masking cannot
securely encode the value 0. The mean power consumption of a single share pxi reveals whether the
underlying secret is zero or non-zero, since E[pxi |x = 0] 6= E[pxi |x 6= 0] for any share index i. This
means that for any number of shares, the original multiplicative masking scheme is vulnerable to
first-order DPA.
Multiplicative Masking for AES in Hardware 165
Non-completeness. The concept of non-completeness appears in the work of Nikova et al. [NRS11]
and follow-up works on higher-order security [BGN+ 14a, RBN+ 15]. Non-completeness between
register stages has become a fundamental property for constructing provable-secure hardware
implementations even if the underlying logic gates glitch. We recall here the definition of non-
completeness: for any shared implementation f operating on a shared input x, dth -order non-
completeness is satisfied if any combination of up to d shares of f is independent of at least one
input share.
Masked Multiplier. Reparaz et al. [RBN+ 15] showed that a dth -order masked multiplication in
hardware can be constructed using only d + 1 shares if the sharings of the inputs are independent
(so as to not break non-completeness). One approach to do this is detailed in [GMK16] and is
referred to as Domain Oriented Masking (DOM).
Our work uses as a masked AND gate the DOM-indep multiplier from [GMK16]. Let x = (bx0 , bx1 )
and y = (by0 , by1 ) be first-order Boolean sharings of bits x and y. A sharing of the multiplication
result z = x&y is obtained by first calculating four partial products tij = bxi &byj , i, j ∈ {0, 1} as
in [ISW03]. When i 6= j, tij is called a cross-domain term and must be refreshed with a randomly
$
drawn bit r ← GF(2). After a register stage for synchronization, the shares (bz0 , bz1 ) are computed.
Note that we employ the special version of the DOM-indep multiplier where only the cross-
domain terms are synchronized in registers. For efficiency, these registers are clocked on the negative
edge as is done in [GSM17]. This is illustrated for the first-order multiplier in Figure 1.
!"# &
$ !"(
!" &
)
$
!% &
#
!%(
!% &
Boolean to Multiplicative. More specifically, consider a conversion from Boolean to type-I multi-
plicative shares. After k iterations of the above steps, we have an intermediate sharing
k−1
O d
M
x = (p0 , . . . , pk−1 , bk , . . . , bd ) where x= p−1
i ⊗ bi
i=0 i=k
The number of target (multiplicative) shares is k and the number of source (Boolean) shares is
d + 1 − k. In the expansion phase, we add a new multiplicative share by drawing a random pk and
multiplying it with all Boolean shares:
b0i = pk ⊗ bi for i = k, . . . , d (2)
We now obtain a d + 2 sharing
k
O d
M
x = (p0 , . . . , pk , b0k , . . . , b0d ) where x= p−1
i ⊗ b0i
i=0 i=k
In the compression phase, we remove Boolean share b0k by adding it to another Boolean share b0k+1 :
Multiplicative to Boolean. For the opposite conversion from multiplicative to Boolean shares, we
consider a type-II multiplicative sharing, but the procedure for type-I is identical, apart from d
additional inversions. Note that the first iteration starts with k = 1 and b00d = qd . In iteration k, we
have the intermediate sharing
x = (q0 , . . . , qd−k , b0d−k+1 , . . . , b0d−1 , b00d )
with k target (Boolean) shares and d + 1 − k source (multiplicative) shares. In the expansion phase,
a new Boolean share b0d−k is added by splitting b00d into b0d ⊕ b0d−k with b0d−k randomly drawn. The
d + 2 shares of x are then
d−k
O d
M
x = (q0 , . . . , qd−k , b0d−k , . . . , b0d ) where x= qi ⊗ b0i
i=0 i=d−k
Multiplicative Masking for AES in Hardware 167
In the compression phase, multiplicative share qd−k is removed by multiplication with all Boolean
shares:
bi = qd−k ⊗ b0i for i = d − k, . . . , d
d−k−1
O d
M
x = (q0 , . . . , qd−k−1 , bd−k , . . . , bd ) where x= qi ⊗ bi
i=0 i=d−k
Conversions in Hardware: Dealing with glitches. The register stage between the expansion and
compression phases is necessary because of the presence of glitches in hardware circuits. Without
this register, the non-completeness of the conversion is broken and we have no security guarantees.
Consider for example equations (2) and (3). Together, they compute the following
Without a register, the signal pk might arrive late to the multiplication. As a result, two of the
shares of x are combined on one wire bk ⊕ bk+1 and the security is reduced by one order.
O
d−1
−1 d−1
O
x−1 = (pxi )−1 ⊗ pxd = pxi ⊗ (pxd )−1
i=0 i=0
Indeed, by only locally inverting the last share pxd of a type-I multiplicative masking of x, we
obtain a type-II multiplicative sharing of its inverse x−1 :
Note that regardless of the security order d, only one unshared inverter is required this way.
We now look in more detail at the first- and second-order implementations of the conversions.
168 Multiplicative Masking for AES in Hardware
First-order. The complete first-order masked inversion including the resulting circuits for first-
order conversions between Boolean and multiplicative masking is shown in Figure 2. The left side
of the figure converts a Boolean sharing x = (b0 , b1 ) to a type-I multiplicative sharing (p0 , p1 ) such
$
that x = p−1
0 p1 . With a non-zero r0 ← Fq , the multiplicative shares are calculated as
∗
p0 = r0
p1 = [b0 r0 ] ⊕ [b1 r0 ]
The right side of the circuit converts a type-II multiplicative masking of x−1 into a Boolean masking.
$
This requires another random r1 ← Fq :
b00 = r1 q0
b01 = [q1 ⊕ r1 ]q0
%" = '"
!"
%$ )$ %$)$ = '$
#" #$(
#$ !$ #"(
Figure 2: First-order shared implementation of an inversion in Fq . The dashed lines depict registers.
Second-order. Adopting the same algorithms for d + 1 = 3 shares does not provide second-order
secure conversions (see Appendix A). We require an extra refreshing of additive shares. Figure 3
depicts our circuit for the second-order shared inversion in Fq . The conversion from a Boolean to a
type-I multiplicative sharing is depicted on the left side of the figure. The conversion requires three
$ $
units of randomness: r0 , r1 ← F∗q and the extra refreshing u ← Fq . The multiplicative shares are as
follows:
p0 = r0
p1 = r1
h i h i
p2 = r1 [r0 b0 ] ⊕ [r0 b1 ⊕ u] ⊕ r1 [r0 b2 ] ⊕ u
For the opposite conversion (shown on the right side of Figure 3), we start from a type-II
multiplicative masking. This means we only need to invert the last share, p2 . We calculate the
Boolean shares of x−1 as
$
The conversion again uses three units of randomness, r2 , r3 , u ← Fq , although we can recycle the
refreshing mask u from the Boolean to multiplicative conversion. Each conversion thus uses only
2.5 units of randomness.
Our procedures differ slightly from those of Genelle et al. [GPQ11b], especially in the smaller
use of randomness (we expand on this in Appendix A). For a general randomness strategy for
higher-order conversions, we refer to [GPQ11b], but we note that their randomness cost is not
necessarily optimal for each target security order d. A custom approach can result in a lower cost.
Multiplicative Masking for AES in Hardware 169
&" = )"
!"
&$ = )$
#" !$ !+ #",
*$
&% *$ &% = )% #%,
#$
#% !% #$,
' '
We now describe how to circumvent the zero problem of multiplicative masking. Both in MPC
literature [DK10] and in software masking [GPQ10], it has been proposed to map each zero element
in Fq to a non-zero element in F∗q using a Kronecker Delta function before converting to multiplicative
masks.
In the AES S-box, we need to do an inversion in Fq . Both the zero and unit element of Fq are
their own inverses:
It is therefore sufficient to replace each zero element by a “one” before the inversion and change it
back afterwards. Consider a Kronecker delta function δ(x):
(
1 if x = 0
δ(x) =
0 if x =
6 0
We thus require a circuit that computes a shared Kronecker delta function δ(x). Its output (a
sharing of “zero” or a sharing of “one”) is to be added to the input of the conversion from Boolean to
multiplicative masking and to the output of the conversion from multiplicative to Boolean masking
(see Figure 5). This way, any zero element goes through the Fq inversion as a “one” and is thus
never shared multiplicatively.
The Kronecker delta function δ(x) can be calculated with an n-input AND, or equivalently, a
log2 (n)-level 2-input AND tree with the inverted bits of x as input:
The circuit is shown for n = 8 in Figure 4 with xi a sharing of the ith bit of x. In software, it has
been realized using masked table lookups [GPQ10] and bit-slicing [GPQ11a]. We implement each
AND gate with a DOM-indep multiplier [GMK16]. We denote by rj the randomness needed for
each gate. As each multiplier requires one register stage, the entire circuit of Figure 4 takes three
clock cycles (regardless of the number of shares).
170 Multiplicative Masking for AES in Hardware
./
!#
"
.3
!$
"
.0
!%
" .5
!&
"
.1 +(")
!'
" .4
!(
"
.2
!)
"
!*
"
Figure 4: Circuit for the shared Kronecker delta function δ(x) for n = 8
We note that a trade-off can be made here between latency and area. It is possible to reduce
the depth of the tree (and thus the number of clock cycles) at the cost of a larger fan-in for the
AND gates, which results in a considerable increase in area for shared implementations. In this
paper, we choose to work only with 2-input AND gates in order to minimize circuit area.
The DOM gate thus uses its inputs somewhat asymmetrically since the output shares depend only
on the unmasked second input y and not on its sharing. This means that any randomness that has
been used to mask y before arriving at this gate, disappears from its output sharing z. Hence, we
can reuse this randomness in the next layer. In our case, we use the more significant bit (depicted
as the lower input to an AND gate in Fig. 4) as the “second input” and we conclude that the
second layer of the Kronecker implementation removes any dependence of the data on r2 and r4 . In
contrast, reusing r1 (or r3 ) in layer two is not advisable. Moreover, for a first-order implementation
(only univariate matters), the upper and lower two gates in the first layer have independent inputs
and outputs, and can therefore use the same randomness as long as layer two does not.
We propose the following use of randomness:
$ $
r1 = r3 ← GF(2) r5 ← GF(2) r7 = r1
$
r2 = r4 ← GF(2) r6 = [r5 ⊕ r2 ]
We are thus able to reduce the randomness consumption of the first-order Kronecker delta
implementation from 7 to only 3 bits. We refer to Appendix C for the probability distributions of
intermediate and output wires of this circuit with our randomness optimization. We verified that
these probability distributions are independent of the secret input. Moreover, we note that these
probability distributions are the same as in the circuit without randomness optimization.
randomness in the circuit, we propose a recycling of the bits. Following the framework of [FPS17]
would require five groups of three fresh random bits, i.e. 15 bits. Our customization is more
restricted in the higher-order case because of the possibility of multivariate leakage. We still have
the special composability property of the DOM gates, but the gates in the first layer can no longer
be considered independent. We propose the following:
$
r1 , r2 , r3 , r4 ← (GF(2))3
r50 = r30 , r51 = r41 , r52 = [r32 ⊕ r42 ]
r60 = r10 , r61 = r21 , r62 = [r12 ⊕ r22 ]
$
r70 = [r11 ⊕ r31 ], r71 = [r20 ⊕ r40 ], r72 ← GF(2)
We thus reduce the randomness consumption of the second-order Kronecker delta implementation
from 21 to 13 bits. The probability distributions of relevant (pairs of) wires can again be found in
Appendix C.
#$ , … , #' .$ , … , .'
! → 01
→ -(!) /(!)
(+$ , … , +' ) (#$ , … , #' )
" !
Figure 5: First-order adaptive masking implementation of the AES S-box. The dotted grey lines
depict registers.
172 Multiplicative Masking for AES in Hardware
registers that is the input of the MixColumns operation is indicated by a red striped frame, whereas
the registers receiving the output of MixColumns once cycle later are specified by a full red frame.
The S-box input is taken from State 00, while the Kronecker delta input starts computing
three cycles beforehand on State 30. In order to have State 30 ready for the Kronecker function,
we have to put the MixColumns operation in the second column (instead of the first column as
in [GMK16]). ShiftRows is performed when the sixteenth and last S-box output enters the state.
We also adapt the ShiftRows connections such that all bytes end up one column to the right of the
actual ShiftRows result. This means that the normally first column is the first MixColumns input
(state bytes 01,11,21,31) and the normally last column now occupies state bytes 00,10,20 and 30.
During the next four clock cycles, we rotate the state by returning byte 00 to the state input (33)
untouched. After those four cycles, the state columns are restored to their correct order and the
first S-box input is ready in State 00. Moreover, its output to the Kronecker function is also ready
at this point. The key schedule is synchronized with the state in a way that the partial Round Key
to be used in that clock cycle corresponds to State 30. The AddRoundKey stage is embedded in
the connection between State 30 and State 20 and its output is the input to the Kronecker delta
function.
The result is fed back into the key state as Key 33.
4.3 Control
We now go into more detail on the scheduling of the 24 clock cycles (0 to 23) that make up one
encryption round when the S-box latency is four cycles (as in our second-order implementation).
Table 1 details the control of the register movement and Table 2 shows how various inputs to the
states and the S-box change.
The 16 bytes of the state register are fed to the S-box in cycles 3 to 18 of each round of
encryption. This means the Kronecker delta function receives the same 16 bytes three cycles before
00 01 02 03 00 01 02 03 Kronecker in
Normal operation
ShiftRows
MixCol In
10 11 12 13 10 11 12 13 S-box in
MixCol Out
20 21 22 23 20 21 22 23
30 31 32 33 30 31 32 33
Kronecker in
that: in cycles 0 to 15. During these cycles, the key state follows its meandering movement and
Key 00 is used to construct the Round Key byte. In the remaining clock cycles (from cycle 16 until
cycle 23), the key array is rotating. The last column of the array is fed through the Kronecker delta
function in cycles 17 to 20 and through the S-box in cycles 20 to 23, which means their outputs are
ready for the first four Round Key calculations four cycles later: in cycles 0 to 3.
The state receives its S-box outputs in cycles 7 to 22. In the last cycle (22), we do the adapted
ShiftRows that puts each state byte one extra column to the right. The first MixColumns operation
is in the next cycle (23), which means the first input byte to the Kronecker delta function (in State
30) is ready in cycle 0. During cycles 23 to 2, State 00 holds bytes of the last column and is thus
fed back into State 33. The MixColumns operation occurs four times every four cycles, i.e. in
cycles 23, 3, 7 and 11 (except in the last round of encryption).
The first round of encryption (loading of the inputs) starts in cycle 0 with the data and key
inputs replacing respectively State 30 and the Round Key. In total, one AES encryption is obtained
in 10 × 24 + 16 = 256 cycles. Our first-order AES implementation has the same latency in spite
of the S-box requiring only two cycles. Given the AES design, it is difficult to exploit an S-box
latency below four cycles.
5 Security Evaluation
In this section, we elaborate on the security of the first- and second-order AES constructions against
a probing adversary in the presence of glitches. Neither formal proofs in a particular security model
nor empirical leakage detecting tools can in their own capacity provide full evidence for security. A
security evaluation is incomplete without complementary analyses following both methodologies.
Therefore, our approach consists of three stages: first in § 5.1, we address the security of the S-box
under the ideal circuit assumption using the notion of strong non-interference [BBD+ 16, BBP+ 16].
Next in § 5.2, we evaluate the security of the S-box in the presence of glitches, using leakage
detection tools available in literature. Finally in § 5.3 we complete the evaluation by analyzing our
whole circuit on a physical device.
Now, consider our S-box in Figure 7, consisting of six parts: A1 , A3 and A5 are affine (only
computing share wise) and A2 , A4 and A6 are d-SNI as proven in Appendices D and E. The proof
starts from the output and backtracks to the input. We denote by Ii the set of intermediate
probes in gadget Ai and by O the set of output probes on S(x). The sets are constrained by
Table 2: State and key inputs during one round of encryption (except during loading)
15 13 12
14
16 #$ , … , #' -$ , … , -'
! 7 → /0 → ,(!) .(!)
(*$ , … , *' ) 9 : (#$ , … , #' ) ;
1<
" !
8
P6
|O| + i=1 |Ii | ≤ d. We further define Si as the set of shares that are required at the input of
Si
block Ai in order to be able to simulate the probes in the remainder of the circuit, i.e. j=1 Ii ∪ O.
We subsequently treat this set as a set of probes that needs to be simulated using input shares
from a previous block Ai−1 . This way, we gradually move towards S6 the input and try toP show that
6
the number of input shares of x required to simulate all probes i=1 Ii ∪ O is at most i=1 |Ii |.
Consider for example block A4 in Table 3. This block has output z and input y. The set of
shares of z, S3 is constrained by |S3 | ≤ |S2 | + |I3 |. Since A4 is d-SNI and since |S3 ∪ I4 | ≤ d, we
have that the number of shares of y required to simulate S3 ∪ I4 is at most |I4 |. We call this set
of shares S4 . Now,S3since we are able to simulate S3 using S4 and since S3 is able to simulate S4 the
remaining probes i=1 Ii ∪ O, we know that the set of shares S4 is sufficient to simulate i=1 Ii ∪ O.
Table 3 shows that we need |S5,1 ∪ S6 | < |S4 | + |I5 | + |I6 | < |I4 | + |I5 | + |I6 | shares of the input
to simulate all d-tuples of probes in the circuit, proving that the S-box is d-SNI.
property from register to register. By applying this tool directly to the RTL HDL descriptions
of our gadgets, we confirm that each stage is non-complete and therefore secure in the univariate
setting in the presence of glitches if the shared input does not have a secret dependent bias. We
verify this condition on the input sharing independently (Appendix C).
We note that it has been implied in [FGMDP+ 18] that verifying glitch security and strong non-
interference separately does not guarantee composability in a glitchy environment. In section 5.1,
we have given security proofs for the S-box as best as we could with the tools at our disposal. In this
section, we consider glitches. The combined theoretical verification of “glitchy” SNI is an interesting
direction for future research. However, note that SNI is not a necessary condition for the S-box
to be secure. As an example, consider our first-order S-box. Not every glitch-extended probe in
the subcircuit shown in Figure 2 is simulatable with only t1 shares of the input. However, we have
exhaustively verified that every glitch-extended probe in the entire S-box circuit is independent of the
secret. The S-box is thus 1-probing secure, even though one of its subcircuits is not (1, 0, 0)-robust
1-SNI [FGMDP+ 18]. We further evaluate the security of the entire S-box using state-of-the-art
tools.
We use the simulation tool of [Rep16], in which we exhaustively probe the S-box and create
power traces using an identity leakage model. These traces do not only contain explicit intermediates
(stabilized values on wires) but also values that could be observed in a glitch (transient values on
wires). We exhaustively probe the S-box in this way in a completely noiseless setting and create
up to 100 million simulated traces. For more details, we refer to [Rep16]. We detect no univariate
leakage with up to 100 million traces nor bivariate in the case of our second-order gadgets. We
draw the same conclusions when using the tool described in [DBR18]. This tool essentially exhausts
every possible glitch in the computation by verifying that there is no mutual information between
the secret and all possible (pairs of) glitch-extended probes.
While the theoretical possibility of a very weak bias still exists we would need more than 100
million traces to detect it and thus the practical implications of this are thin: if the leak is not even
detected with 100 million traces in a noiseless scenario, it would take even considerably more traces
to exploit it (perform key-recovery) in a realistic noisy scenario.
Setup. We program a Xilinx Spartan6 FPGA with both our first- and second-order design on
a SAKURA-G board, specifically designed for side-channel evaluation. For the synthesis, we use
the Xilinx ISE option KEEP_HIERARCHY to prevent optimization across modules (and in particular
across shares). To minimize platform noise, we split the implementation over a crypto FPGA, which
handles the AES encryption and a control FPGA, which communicates with the host computer and
supplies masked data to the crypto FPGA. The FPGA’s are clocked at 3.072 MHz and sampled at
1GS/s.
The crypto FGPA is also equipped with a PRNG to generate the randomness required in every
clock cycle. This PRNG is loaded with a fresh seed for every encryption. In contrast with other
state-of-the-art masked implementations, we have to be able to generate one or two non-zero bytes
for the multiplicative masks. We refer to Appendix F for a description of how we achieve this in
practice, without stalling the pipeline.
Univariate. We perform a non-specific leakage detection test [BCD+ 13] using the methodology
from [RGV17]. This means we gather power traces in two sets: the first corresponding to encryptions
of a fixed plaintext and the other to encryptions of random plaintexts. We choose the fixed plaintext
equal to the key in order to test the special case of zero inputs to the S-box in the first round.
Nonzero S-box inputs then occur in encryption round two and are thus naturally also tested. The
two sets of measurements are compared using the t-test statistic. When the t-statistic at order
d crosses the threshold T = ±4.5, the null hypothesis “The design has no dth -order leakage” is
rejected with confidence > 99.999%. On the other hand, when the t-statistic remains below this
threshold, we corroborate that side-channel information is not distinguishable at order d.
The results for our first-order design are shown in Figure 8. Each trace consists of 64 clock
cycles, comprising about two and a half rounds of encryption. An example of such a trace is shown
176 Multiplicative Masking for AES in Hardware
Figure 8: Non-specific leakage detection test on 2.5 rounds of encryption of a first-order protected
AES. Left: PRNG off; 12 000 traces. Right: PRNG on; 50 million traces. Rows(top to bottom):
exemplary power trace, first-order, second-order t-value.
in Figure 8, top. To verify the soundness of our setup, we first perform the leakage detection test
with the PRNG turned off (i.e. unmasked implementation). This is shown in the left column of
the figure and as expected, the design presents severe leakage at only 12 000 traces. On the right
side, we do the leakage detection test with the PRNG turned on. We do not observe evidence for
first-order leakage with up to 50 million power traces. The design does leak in the second order, as
anticipated.
Similarly, we show the test results for our second-order design in Figure 9. The leakage when
the PRNG is turned off (left column) is clear. The masked implementation (right column) does not
present evidence for first- nor second-order leakage with up to 50 million power traces. While we
would expect the third-order t-statistic to surpass the threshold, this is not yet the case due to
platform noise.
We also track the evolution of the maximum absolute t-test value as a function of the number
of traces taken. This is shown in Figure 10 for the first-order (left) and second-order (right)
protected AES implementations. On the left, we clearly see an increase in the absolute t-value of
Figure 9: Non-specific leakage detection test on 2.5 rounds of encryption of a second-order protected
AES. Left: PRNG off; 12 000 traces. Right: PRNG on; 50 million traces. Rows(top to bottom):
exemplary power trace, first-order, second-order, third-order t-value.
Multiplicative Masking for AES in Hardware 177
the second- and third-order moment, while the statistic for first order is stable. For our second-order
implementation, the noise of the platform prevents us from seeing evidence for third-order leakage.
10 3 6
d=1
5
d=2
d=3
10 2 4
max(|t-value|)
max(|t-value|)
10 1 2
d=1
1 d=2
d=3
0 0
10
0 10 20 30 40 50 0 10 20 30 40 50
# Million Traces # Million Traces
Figure 10: Evolution of the maximum absolute t-value across the measurements. Left: First order.
Right: Second order.
Bivariate. In order to do a bivariate leakage detection test, we reduce the length of the power
traces to 15 clock cycles and the sample rate of the oscilloscope to 200MS/s. Each trace then
consists of 1 000 time samples. In order to reduce the signal-to-noise ratio, we make the traces DC
free. We then combine the measurements at different time samples by doing an outer product of the
centered traces with themselves. The resulting symmetric matrices are the samples for our t-test.
We first perform this experiment on the first-order protected AES implementation to verify if
we can indeed detect bivariate leakage. The resulting t-statistic after 1 and 45 million traces is
shown in Figure 11 and confirms that our method is sound.
Next, we do the same for the second-order masked AES implementation. We collect 50 million
traces and show the resulting t-statistic in Figure 12. The result shows clearly that no bivariate
leakage can be detected with 50 million traces.
6 Implementation Cost
We presented first- and second-order secure constructions for AES and evaluated their security. In
this section we investigate the implementation cost and compare it to the state-of-the-art AES
designs of [CRB+ 16] and [GMK17]. All area measures were obtained with the Synopsis Design
Compiler v.2013.12, using the Open Cell Nangate 45nm library [NAN] and are expressed in 2-input
NAND gate equivalents1 . We use compile option -exact_map to prevent optimization across
modules. For a fair comparison, we also synthesize the implementations of [CRB+ 16] and [GMK17]
with the same library and toolchain. From the latter, we picked the options for smallest area, i.e. not
perfectly-interleaved and the eight-stage S-box. Both these works create a shared implementation
1 One NAND gate is 0.798µm2
1000 45 1000 45
40 40
800 800
35 35
30 30
600 600
25 25
20 20
400 400
15 15
10 10
200 200
5 5
0 0
200 400 600 800 1000 200 400 600 800 1000
Figure 11: Non-specific bivariate leakage detection test on 15 clock cycles of a first-order protected
AES. Left: 1 million traces. Right: 45 million traces.
178 Multiplicative Masking for AES in Hardware
1000 45
40
800
35
30
600
25
20
400
15
10
200
0
200 400 600 800 1000
Figure 12: Non-specific bivariate leakage detection test on 15 clock cycles of a second-order
protected AES with 50 million traces.
from Canright’s compact AES S-box [Can05] using the tower-field method. Our approach is thus
radically different. We cannot compare easily with [UHA17] because of different synthesis libraries,
though they seem to have a similar area footprint for larger randomness requirement (64 bits per
S-box). Also, they only provide a first-order implementation. We first detail the cost of the S-box
only in § 6.1 and then look at the entire AES encryption in § 6.2.
Table 4: Implementation results for the AES S-box with Nangate 45nm Library
6.2 AES
Table 5 shows the implementation results of our entire AES implementations in comparison with
those of De Cnudde et al. [CRB+ 16] and Gross et al. [GMK17]. Our S-box area reduction results
in an overall improvement of around 10% over the state-of-the-art with comparable or even better
randomness consumption and latency.
7 Conclusion
We have ported the well-known concept of adaptively masking ciphers such as AES to hardware.
The idea has been extensively studied in software, but had not yet been applied in hardware up till
now. We show that this methodology is a very competitive alternative to state-of-the-art masked
Multiplicative Masking for AES in Hardware 179
AES designs. Our approach is conceptually simple, yet incorporates modern countermeasures to
mitigate the effect of glitches in hardware.
Specifically, we present secure circuits for converting between Boolean and multiplicative
masking and for circumventing the well-known zero problem of multiplicative masking. We apply
the methodology to the AES cipher for first- and second-order security and show with experiments
that our implementations do not exhibit univariate or multivariate leakage with up to 50 million
traces. Our AES S-box implementations require comparable randomness and latency to state-of-
the-art implementations and yet achieve an 18 to 29% smaller chip area. We believe this is an
interesting addition to the hardware designer’s toolbox.
Acknowledgements
This work was supported in part by the NIST Research Grant 60NANB15D346. Oscar Reparaz
and Begül Bilgin are postdoctoral fellows of the Fund for Scientific Research - Flanders (FWO) and
Lauren De Meyer is funded by a PhD fellowship of the FWO. The authors would like to thank
François-Xavier Standaert and Vincent Rijmen for helpful discussions.
References
[AG01] Mehdi-Laurent Akkar and Christophe Giraud. An implementation of DES and AES,
secure against some attacks. In Çetin Kaya Koç, David Naccache, and Christof
Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2001, Third
International Workshop, Paris, France, May 14-16, 2001, Proceedings, volume 2162
of Lecture Notes in Computer Science, pages 309–318. Springer, 2001.
[AGR+ 16] Martin R. Albrecht, Lorenzo Grassi, Christian Rechberger, Arnab Roy, and Tyge
Tiessen. Mimc: Efficient encryption and cryptographic hashing with minimal
multiplicative complexity. In Jung Hee Cheon and Tsuyoshi Takagi, editors, Advances
in Cryptology - ASIACRYPT 2016 - 22nd International Conference on the Theory
and Application of Cryptology and Information Security, Hanoi, Vietnam, December
4-8, 2016, Proceedings, Part I, volume 10031 of Lecture Notes in Computer Science,
pages 191–219, 2016.
[ANR17] Victor Arribas, Svetla Nikova, and Vincent Rijmen. VerMI: Verification tool for
masked implementations. Cryptology ePrint Archive, Report 2017/1227, 2017.
[BBD+ 16] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin
Grégoire, Pierre-Yves Strub, and Rébecca Zucchini. Strong non-interference and
type-directed higher-order masking. In Edgar R. Weippl, Stefan Katzenbeisser,
Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security,
Vienna, Austria, October 24-28, 2016, pages 116–129. ACM, 2016.
[BBP+ 16] Sonia Belaïd, Fabrice Benhamouda, Alain Passelègue, Emmanuel Prouff, Adrian
Thillard, and Damien Vergnaud. Randomness complexity of private circuits for
180 Multiplicative Masking for AES in Hardware
[DBR18] Lauren De Meyer, Begül Bilgin, and Oscar Reparaz. Consolidating security notions
in hardware masking. IACR Cryptology ePrint Archive, 2018:597, 2018.
[DDF14] Alexandre Duc, Stefan Dziembowski, and Sebastian Faust. Unifying leakage models:
From probing attacks to noisy leakage. In Phong Q. Nguyen and Elisabeth Oswald,
editors, Advances in Cryptology - EUROCRYPT 2014 - 33rd Annual International
Conference on the Theory and Applications of Cryptographic Techniques, Copenhagen,
Denmark, May 11-15, 2014. Proceedings, volume 8441 of Lecture Notes in Computer
Science, pages 423–440. Springer, 2014.
[DK10] Ivan Damgård and Marcel Keller. Secure multiparty AES. In Radu Sion, editor,
Financial Cryptography and Data Security, 14th International Conference, FC 2010,
Tenerife, Canary Islands, January 25-28, 2010, Revised Selected Papers, volume
6052 of Lecture Notes in Computer Science, pages 367–374. Springer, 2010.
[FGMDP+ 18] Sebastian Faust, Vincent Grosso, Santos Merino Del Pozo, Clara Paglialonga, and
François-Xavier Standaert. Composable masking schemes in the presence of physical
defaults & the robust probing model. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2018(3):89–120, Aug. 2018.
[FPS17] Sebastian Faust, Clara Paglialonga, and Tobias Schneider. Amortizing randomness
complexity in private circuits. In Tsuyoshi Takagi and Thomas Peyrin, editors,
Advances in Cryptology - ASIACRYPT 2017 - 23rd International Conference on
the Theory and Applications of Cryptology and Information Security, Hong Kong,
China, December 3-7, 2017, Proceedings, Part I, volume 10624 of Lecture Notes in
Computer Science, pages 781–810. Springer, 2017.
[FRR+ 10] Sebastian Faust, Tal Rabin, Leonid Reyzin, Eran Tromer, and Vinod Vaikuntanathan.
Protecting circuits from leakage: the computationally-bounded and noisy cases. In
Henri Gilbert, editor, Advances in Cryptology - EUROCRYPT 2010, 29th Annual
International Conference on the Theory and Applications of Cryptographic Tech-
niques, French Riviera, May 30 - June 3, 2010. Proceedings, volume 6110 of Lecture
Notes in Computer Science, pages 135–156. Springer, 2010.
[GMK16] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented masking:
Compact masked hardware implementations with arbitrary protection order. IACR
Cryptology ePrint Archive, 2016:486, 2016.
[GMK17] Hannes Groß, Stefan Mangard, and Thomas Korak. An efficient side-channel
protected AES implementation with arbitrary protection order. In Helena Handschuh,
editor, Topics in Cryptology - CT-RSA 2017 - The Cryptographers’ Track at the
RSA Conference 2017, San Francisco, CA, USA, February 14-17, 2017, Proceedings,
volume 10159 of Lecture Notes in Computer Science, pages 95–112. Springer, 2017.
[GP99] Louis Goubin and Jacques Patarin. DES and differential power analysis (the "du-
plication" method). In Çetin Kaya Koç and Christof Paar, editors, Cryptographic
Hardware and Embedded Systems, First International Workshop, CHES’99, Worces-
ter, MA, USA, August 12-13, 1999, Proceedings, volume 1717 of Lecture Notes in
Computer Science, pages 158–172. Springer, 1999.
[GPQ10] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Secure multiplicative
masking of power functions. In Jianying Zhou and Moti Yung, editors, Applied
Cryptography and Network Security, 8th International Conference, ACNS 2010,
Beijing, China, June 22-25, 2010. Proceedings, volume 6123 of Lecture Notes in
Computer Science, pages 200–217, 2010.
[GPQ11a] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Montgomery’s trick and
fast implementation of masked AES. In Abderrahmane Nitaj and David Pointcheval,
editors, Progress in Cryptology - AFRICACRYPT 2011 - 4th International Confer-
ence on Cryptology in Africa, Dakar, Senegal, July 5-7, 2011. Proceedings, volume
6737 of Lecture Notes in Computer Science, pages 153–169. Springer, 2011.
182 Multiplicative Masking for AES in Hardware
[GPQ11b] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Thwarting higher-order
side channel analysis with additive and multiplicative maskings. In Preneel and
Takagi [PT11], pages 240–255.
[GSM17] Hannes Groß, David Schaffenrath, and Stefan Mangard. Higher-order side-channel
protected implementations of KECCAK. In Hana Kubátová, Martin Novotný, and
Amund Skavhaug, editors, Euromicro Conference on Digital System Design, DSD
2017, Vienna, Austria, August 30 - Sept. 1, 2017, pages 205–212. IEEE, 2017.
[GT02] Jovan Dj. Golic and Christophe Tymen. Multiplicative masking and power analysis
of AES. In Jr. et al. [JKP03], pages 198–212.
[HOM06] Christoph Herbst, Elisabeth Oswald, and Stefan Mangard. An AES smart card
implementation resistant to power analysis attacks. In Jianying Zhou, Moti Yung,
and Feng Bao, editors, Applied Cryptography and Network Security, 4th International
Conference, ACNS 2006, Singapore, June 6-9, 2006, Proceedings, volume 3989 of
Lecture Notes in Computer Science, pages 239–252, 2006.
[ISW03] Yuval Ishai, Amit Sahai, and David A. Wagner. Private circuits: Securing hardware
against probing attacks. In Dan Boneh, editor, Advances in Cryptology - CRYPTO
2003, 23rd Annual International Cryptology Conference, Santa Barbara, California,
USA, August 17-21, 2003, Proceedings, volume 2729 of Lecture Notes in Computer
Science, pages 463–481. Springer, 2003.
[JKP03] Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors. Cryptographic
Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood
Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes
in Computer Science. Springer, 2003.
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In
Wiener [Wie99], pages 388–397.
[MPG05] Stefan Mangard, Thomas Popp, and Berndt M. Gammel. Side-channel leakage of
masked CMOS gates. In Alfred Menezes, editor, Topics in Cryptology - CT-RSA
2005, The Cryptographers’ Track at the RSA Conference 2005, San Francisco, CA,
USA, February 14-18, 2005, Proceedings, volume 3376 of Lecture Notes in Computer
Science, pages 351–365. Springer, 2005.
[MPL+ 11] Amir Moradi, Axel Poschmann, San Ling, Christof Paar, and Huaxiong Wang.
Pushing the limits: A very compact and a threshold implementation of AES. In
Kenneth G. Paterson, editor, Advances in Cryptology - EUROCRYPT 2011 - 30th
Annual International Conference on the Theory and Applications of Cryptographic
Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings, volume 6632 of Lecture
Notes in Computer Science, pages 69–88. Springer, 2011.
[MPO05] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Successfully attacking
masked AES hardware implementations. In Rao and Sunar [RS05], pages 157–171.
[NAN] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.
nangate.com.
[NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schläffer. Secure hardware implementa-
tion of nonlinear functions in the presence of glitches. J. Cryptology, 24(2):292–321,
2011.
[PR11] Emmanuel Prouff and Thomas Roche. Higher-order glitches free implementation of
the AES using secure multi-party computation protocols. In Preneel and Takagi
[PT11], pages 63–78.
[PT11] Bart Preneel and Tsuyoshi Takagi, editors. Cryptographic Hardware and Embedded
Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 -
October 1, 2011. Proceedings, volume 6917 of Lecture Notes in Computer Science.
Springer, 2011.
Multiplicative Masking for AES in Hardware 183
[RBN+ 15] Oscar Reparaz, Begül Bilgin, Svetla Nikova, Benedikt Gierlichs, and Ingrid Ver-
bauwhede. Consolidating masking schemes. In Rosario Gennaro and Matthew
Robshaw, editors, Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptol-
ogy Conference, Santa Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I,
volume 9215 of Lecture Notes in Computer Science, pages 764–783. Springer, 2015.
[Rep16] Oscar Reparaz. Detecting flawed masking schemes with leakage detection tests. In
Thomas Peyrin, editor, Fast Software Encryption - 23rd International Conference,
FSE 2016, Bochum, Germany, March 20-23, 2016, Revised Selected Papers, volume
9783 of Lecture Notes in Computer Science, pages 204–222. Springer, 2016.
[RGV17] Oscar Reparaz, Benedikt Gierlichs, and Ingrid Verbauwhede. Fast leakage assessment.
In Wieland Fischer and Naofumi Homma, editors, Cryptographic Hardware and
Embedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan,
September 25-28, 2017, Proceedings, volume 10529 of Lecture Notes in Computer
Science, pages 387–399. Springer, 2017.
[RP10] Matthieu Rivain and Emmanuel Prouff. Provably secure higher-order masking of
AES. In Stefan Mangard and François-Xavier Standaert, editors, Cryptographic
Hardware and Embedded Systems, CHES 2010, 12th International Workshop, Santa
Barbara, CA, USA, August 17-20, 2010. Proceedings, volume 6225 of Lecture Notes
in Computer Science, pages 413–427. Springer, 2010.
[RS05] Josyula R. Rao and Berk Sunar, editors. Cryptographic Hardware and Embedded
Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 -
September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science.
Springer, 2005.
[Tri03] Elena Trichina. Combinational logic design for AES subbyte transformation on
masked data. IACR Cryptology ePrint Archive, 2003:236, 2003.
[TSG02] Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified adaptive multi-
plicative masking for AES. In Jr. et al. [JKP03], pages 187–197.
[UHA17] Rei Ueno, Naofumi Homma, and Takafumi Aoki. Toward more efficient dpa-resistant
AES hardware architecture based on threshold implementation. In Sylvain Guilley,
editor, Constructive Side-Channel Analysis and Secure Design - 8th International
Workshop, COSADE 2017, Paris, France, April 13-14, 2017, Revised Selected Papers,
volume 10348 of Lecture Notes in Computer Science, pages 50–64. Springer, 2017.
[Wie99] Michael J. Wiener, editor. Advances in Cryptology - CRYPTO ’99, 19th Annual
International Cryptology Conference, Santa Barbara, California, USA, August 15-19,
1999, Proceedings, volume 1666 of Lecture Notes in Computer Science. Springer,
1999.
Recovering the CTR_DRBG
state in 256 traces
Publication Data
Notes
185
186 Recovering the CTR_DRBG state in 256 traces
Abstract. The NIST CTR_DRBG specification prescribes a maximum size on each random
number request, limiting the number of encryptions in CTR mode with the same key to
4 096. Jaffe’s attack on AES in CTR mode without knowledge of the nonce from CHES 2007
requires 216 traces, which is safely above this recommendation. In this work, we exhibit an
attack that requires only 256 traces, which is well within the NIST limits. We use simulated
traces to investigate the success probability as a function of the signal-to-noise ratio. We also
demonstrate its success in practice by attacking an AES-CTR implementation on a Cortex-M4
among others and recovering both the key and nonce. Our traces and code are made openly
available for reproducibility.
Keywords: DPA · SCA · CPA · AES · CTR · PRNG · NIST · DRBG · DDLA
1 Introduction
Cryptographic implementations in embedded devices are vulnerable to side-channel attacks (SCA)
such as differential power analysis (DPA), which was first introduced by Kocher et al. in 1999 [KJJ99].
In the following years, many variations of this attack have been proposed, such as correlation
power analysis (CPA) by Brier et al. [BCO04], mutual information analysis (MIA) by Gierlichs et
al. [GBTP08] and very recently, differential deep learning analysis (DDLA) by Timon [Tim19].
The success of DPA and its variations lies in the ability to divide-and-conquer, because the
power consumption at some instants depends on a (constant) small part of the secret combined
with variable known data (e.g. plaintext bytes). In most cases, side-channel attacks are performed
under the assumption that the adversary knows the plaintext and/or ciphertext, which allows him
to hypothesize on and recover chunks of the secret key.
In some scenarios, this assumption does not hold. Consider for example a pseudo-random
number generator (PRNG) that is used for key generation or for the supply of fresh randomness to
masked implementations (to protect against SCA). In such cases, neither the plaintext (i.e. the
state of the PRNG) nor the ciphertext (i.e. the output of the PRNG) are considered public. The
adversary is then assumed to only have knowledge of the power consumption or electromagnetic
radiation emanating from the device.
At CHES 2007, Jaffe [Jaf07] presented an attack of AES in Counter mode (AES-CTR) [Dwo01]
in this adversary model. He showed that the sequential nature of the counter mode enables one to
attack AES-CTR with only knowledge of the power traces and without knowledge of the initial
counter (the nonce). Another line of works that consider the same adversary model is that of
blind side-channel attacks, originally by Linge et al. [LDL14] and recently improved by Clavier
et al. [CR17]. In these works, the joint distribution of leakage points is exploited to extract keys
without knowledge of the plaintext or ciphertext.
In this work, we focus on the case of PRNGs. The NIST recommendations for random
number generation include one type of PRNG which is based on a cipher in CTR mode, denoted
CTR_DRBG [BK15]. AES being an important standardized cipher, many PRNGs naturally
use AES-CTR at their core, which means they are vulnerable to Jaffe’s attack. However, NIST
recommends to limit the size of randomness requests to the CTR_DRBG to 219 bits. Generating
such a request thus takes at most 4 096 AES encryptions in CTR mode. The NIST CTR_DRBG
also calls an Update function, which changes the PRNG state (nonce and key) between every
request. Since Jaffe’s attack requires 216 encryption traces, it actually does not pose a threat to
the NIST CTR_DRBG. In his conclusion [Jaf07], he does allude to the possibility of using only 28
traces.
Recovering the CTR_DRBG state in 256 traces 187
1.1 Contribution
In this work, we demonstrate an adaptation of Jaffe’s attack, which requires only 256 power
measurements. We explain the methodology and investigate the success probability of the attack
as a function of the signal-to-noise ratio. Interestingly, our attack’s success depends on the nonce it
is trying to recover and we show that in some cases, using less traces actually improves the success
probability.
We demonstrate the feasibility of the attack on multiple real devices, essentially showing that
the NIST recommendation for the CTR_DRBG allows for too large requests. We also explore
blind SCA [CR17] as an alternative attack methodology and demonstrate the recently introduced
DDLA [Tim19] in a variation of the attack for misaligned traces.
In the context of masked implementations against SCA, PRNGs are usually required to provide a
constant stream of fresh randomness during the computation. Having that randomness compromised
would nullify the protection offered by the masking countermeasure. To this day, very little research
is publicly available on specific constructions for this PRNG. The question of whether this PRNG
should be protected against side-channel analysis itself is largely avoided. We use our attack as
a starting point for the discussion on how to protect PRNGs against adversaries who only have
access to side-channel information and not the plaintexts/ciphertexts.
2 Preliminaries
In Section 2.1, we give a brief overview of AES and introduce our notation for the rest of the paper.
Section 2.2 describes the NIST recommendations for the CTR_DRBG.
2.1 AES
The Advanced Encryption Standard (AES) is a 128-bit block cipher based on a substitution-
permutation network. The master key can be 128, 192 or 256 bits long and the corresponding
number of rounds is respectively 10, 12 or 14. Each round i (except the last round) consists of 4
transformations (AddRoundKey, SubBytes, ShiftRows and MixColumns), which we explain briefly
below. The 128-bit state is considered as a matrix of 4 by 4 bytes (see Figure 1). Each round also
receives a 128-bit round key Ki , which is derived from the master key using the key schedule. The
details of the key schedule are not relevant here.
AddRoundKey is a linear transformation, which performs a 128-bit exclusive or (⊕) between the
state Xi and the round key Ki :
Yi = Xi ⊕ Ki
SubBytes is the only nonlinear transformation in the round function. It takes each of the 16
bytes of the state and substitutes it for another:
Zi,j = S(Yi,j ) = S(Xi,j ⊕ Ki,j ) j = 0 . . . 15
A typical DPA attack targets the output of this function and exploits the fact that X1,j (a plaintext
byte) is known and variable and K1,j (a master key byte) is unkown and fixed over the acquired
traces.
ShiftRows is simply a permutation of the state bytes, obtained by rotating row j of the state
matrix by j bytes to the left (see Figure 2).
188 Recovering the CTR_DRBG state in 256 traces
MixColumns is a linear transformation of the AES state, by multiplying each column of the state
with a matrix M in F28 . This is the last transformation of each round, except the last round, where
this step is skipped.
3 1 1 2
2.2 CTR-DRBG
The NIST publication SP 800-90A [BK15] describes recommendations for random number generation
using Deterministic Random Bit Generators (DRBG). One of these is based on block ciphers in
CTR mode and is therefore referred to as CTR_DRBG. For a detailed description of the operation
of the CTR_DRBG, we refer to [BK15]. A simplified pseudocode of the functions relevant for this
work is given in Algorithms 1 and 2.
Figure 3: Operation of the NIST CTR_DRBG _Update function (left) and random bit stream
generation (right) [BK15]
Random Number Generation. In the context of this paper, it is important to know that the
internal state of the CTR_DRBG contains a Key and a value V , as shown in Figure 3. The value
of V at the beginning of a randomness request is what we refer to as the nonce N . The value V is
incremented by a counter after every use of the block cipher AES, as in CTR mode. This is shown
in Figure 3 on the right and in Algorithm 2. While the block cipher performs in CTR mode, the
output blocks are concatenated until the requested output length is obtained.
Updating the State. At the end the random bit generation in Algorithm 2, a new key and value
V are generated by the CTR_DRBG _Update function, which is shown in Algorithm 1. This
essentially means that performing a DPA attack across various requests is not possible, because the
secret key changes. Any DPA attack would have to be performed during a single request to the
DBRG (Algorithm 2 lines 2-5). However, the maximum number of bits per request is limited [BK15,
Table 3]. If the counter field occupies at least 13 bits of the block, then the maximum number of
bits per randomness request is 219 . In the case of AES, which has a block length of 16 bytes, this is
equivalent to 212 = 4 096 encryptions. If the counter field length (“ctr_len”) is smaller, the number
Recovering the CTR_DRBG state in 256 traces 189
of performed encryptions in CTR mode is 2ctr_len − 4. Not specified here in these algorithms is the
reseed counter, which makes sure that the PRNG is reseeded when the number of requests succeeds
a threshold. According to the NIST specifications, this threshold must be at most 249 .
Forward/Backward Secrecy. The concepts of forward and backward secrecy evaluate the security
of PRNGs when their state is compromised (i.e. known by an adversary). The CTR_DRBG
provides backward secrecy because recovering the state (key and nonce) during one request does
not allow an adversary to compute the previous states. The explanation for this is simply that
the current state is the result of an AES-CTR computation with the previous (unkown) key (see
Algorithm 1). On the other hand, as long as the DRBG is not reseeded with a fresh seed, it does
not provide forward secrecy, since the knowledge of the current state allows one to perfectly predict
the following states.
3 The Attack
In this attack, as in [Jaf07], we perform DPA on four rounds of AES-CTR. In the first rounds, we
assume a large part of the state is constant and we recover information about a few variable bytes.
By propagating them through the ShiftRows and MixColumns transformations, we obtain enough
information to perform DPA in the next round, until finally, we can recover the entire round key in
round four. In this work, we choose CPA as our attack methodology.
Simulated traces. For the remainder of this section, we apply the steps of the attack to simulated
traces and explore the success rate as a function of the signal-to-noise ratio (SNR). We will apply
the attack to traces from real devices in Section 4. To generate the simulated traces, we perform
AES-CTR and after each round transformation, we collect the Hamming weights of the 16 bytes of
the state and add them to the trace. Each time sample in a simulated trace thus corresponds to
the Hamming weight of one state byte in one round. We then add Gaussian noise to the trace with
some standard deviation σ. The variance σ 2 is calculated as the variance of the collected Hamming
weights divided by the desired SNR. For example, the relationship between the actual Hamming
weight and some simulated leakages is shown in Figure 4. For each experiment, we add new noise
to the original Hamming weights and we measure the success as the proportion of correct bytes
recovered. We repeat each experiment ten times for each SNR.
190 Recovering the CTR_DRBG state in 256 traces
Setup. The input to the first round is constructed by the addition of a counter T with an unknown
nonce N : X1 = N + T mod 2128 . We assume for simplicity that the counter starts at the least
significant byte of the state. It is trivial to adapt the attack if this is not the case. We thus assume
that X1,15 = N15 + T mod 256, with N15 constant and unknown and T the counter starting from 0.
Further, since we will only use 256 traces, we can consider the 14 most significant bytes completely
constant: X1,j = Nj for j < 14. Byte 14 is a special case, since it is not constant, but will only
assume two values: N14 and (N14 + 1) mod 256. We visualize this in Figure 5, where white squares
signify fixed values, black squares are varying continuously and the byte in the grey square toggles
at most once in the set of traces.
As in [Jaf07], let N15 = N15,hi |N15,lo and K1,15 = K1,15,hi |K1,15,lo where hi denotes the most
significant bit and lo the other 7 bits and let b = N15,hi ⊕ K1,15,hi . Then we can write Z1,15 as
Z1,15 = S (b 7) ⊕ K1,15,lo ⊕ (N15,lo + T ) mod 256 [Jaf07]
We then perform CPA, where we hypothesize on the 15 bits (b, K1,15,lo and N15,lo ) and compute
the correlation between our Z1,15 and the traces. The winning hypothesis (with the largest absolute
correlation) does not tell us the most significant bits of K1,15 and N15 , but this is of no importance
for the remainder of the attack. With these 15 bits, we know Z1,15 completely.
Figure 6 shows the success rate of this step, which is 1.0 for reasonably low SNR levels. Below
the threshold of SNR=0.2, the success rate decreases dramatically and becomes 0.0 as of SNR=0.01.
Figure 6: Success Rate of Step 1 with 256 traces as function of the SNR.
after SubBytes.
Figure 7: AES state after the first Shiftrows (left) and MixColumns (right) transformations.
By treating K2,0 ⊕ 2Z1,0 ⊕ 3Z1,5 ⊕ Z1,10 as one unkown 8-bit constant C2,0 , we can recover this
constant using CPA with only 256 hypotheses and thus recover Z2,0 . The same is true for the other
three bytes in the first column:
We note that performing CPA for Z2,0 and Z2,1 is identical, since in both cases the S-box input
is the sum of 1Z1,15 with a constant. Indeed, the example in Figure 8 shows that there is not one
but there are two prevailing hypotheses: 0xAC and 0x94. Since each byte corresponds to only one
time sample in the simulated traces, the correlation peaks are very close to each other in Figure 8.
The separation is more clear in real power traces. If the S-box evaluations are not randomly shuffled,
it is trivial to decide which constant belongs to which state byte. In this case, C2,0 = 0x95 and
C2,1 = 0xAC.
Figure 9 shows the success rate of recovering all four bytes of the first column. Again, the
threshold for reaching 100% success lies at SNR=0.2. The cutoff is still quite steep, with 0 success
for SNR=0.001 and below.
Figure 8: Pearson Correlation coefficients in Step 2 with SNR=1.0, with 256 traces as a function of
the time samples (left) and their maximum as a function of the number of traces (right).
Figure 9: Success Rate of the second step of the attack with 256 traces.
by the grey squares in Figure 10, each column now has an additional byte that is non-constant.
Because we only have 256 traces and didn’t follow the approach from [Jaf07], the grey bytes are also
unknown. However, keep in mind that the grey bytes only assume two distinct values throughout all
the traces. The number of traces for each depends on the carry of the addition X1,15 = (N15 + T )
mod 256, which makes X1,14 toggle from N14 to (N14 + 1) mod 256.
Best Case. Assume for simplicity that the least significant byte of the nonce N15 is 0x00. In that
case, X1,14 = N14 never toggles and all grey squares in Figure 10 are constant, just like the white
squares. This means that in each column, we can apply the same method as we did in round 2.
Each byte of the SubBytes output can be written as (2):
where the factors fj are easily derived from MixColumns matrix M and kj refers to the known
byte (the black squares) in each column (see Appendix A). As in round 2, each column again has
Recovering the CTR_DRBG state in 256 traces 193
two bytes for which the hypotheses are identical (when fj = 1) and the correct constants can be
derived by comparing the time samples where the maximum correlation occurs.
Now, assume that the nonce is 0xFF and X1,14 toggles immediately, leading to the grey squares
in Figure 10 being identical in all but one of the traces. When performing the same CPA, we now
recover different constants C3,j
0
, corresponding to when X1,14 = N14 + 1 mod 256.
Average Case. In all other cases, the constants in the computation will be C3 for the first portion
of traces and C30 for the second portion, after X1,14 has toggled. Interestingly, the same approach
as before, with 256 traces, still works. The winning hypotheses are those constants that occur most
often in the set of traces. The traces that correspond to the other (not-winning) constants act as
noise. The attack is successfull if the 16 recovered bytes are either C3 or C30 , but not a mix of
both. Clearly, this depends on the least significant byte of the nonce (N15 ), since this byte decides
when X1,14 toggles from N14 to N14 + 1 and the constants from C3 to C30 . This is demonstrated in
Figure 13, where we show the success rate for various values of N15 .
Worst Case. The worst case scenario is when the toggle occurs approximately halfway, i.e. when
N15 ≈ 0x80. In that case, the constants C3,j and C3,j 0
are in a close race (see Figure 11, left). This
results in recovering some bytes from C3 and some from C30 , which is a problem for the next and
last stage of the attack. This is clearly reflected in the results in Figure 13, since the success rate
only converges to approximately 0.6 for nonce 0x80. Figure 13 also shows the success rate of the
attack with N15 = 0x80 when we use only half of the traces, indicated by 0x80∗ . This is a rare and
interesting case, where using less traces actually improves the performance of the attack, though,
not surprising since we know that the traces we are removing act as noise.
Figure 11: Pearson Correlation coefficients in Step 3 with N15 = 0x80 and SNR=1.0 with 256
traces (left) and 128 traces (right).
In Figure 12, we depict the maximum correlation coefficient for each hypothesis as a function of
the amount of traces used for the best and worst case. It demonstrates again very clearly that with
nonce N15 = 0x80, using more than 128 traces only deteriorates the success of key recovery.
The attacker only knows the least significant 7 bits of the nonce, so is unable to distinguish 0x80
from 0x00. However, seeing a close race as in Figure 11, left is a good clue, especially if performing
the CPA again with only half the traces results in a clear winner (Figure 11, right).
Also in other cases, the knowledge of the 7 least significant nonce bits can be used to calculate
exactly how many traces to remove (either at the beginning or the end of the acquired set) to have
a pure subset of traces using only one constant. There are two possible sets of traces, depending on
whether the most significant bit of N15 is 0 or 1. We can try out both possibilities and detect as
in Figure 11, which option gives the best results. We will demonstrate this in in the examples in
Section 4.
Figure 12: Maximum Correlation coefficients as a function of the number of traces in Step 3 with
SNR = 1.0 and N15 = 0x00 (left) or N15 = 0x80 (right).
Figure 13: Success Rates of the third step of the attack for various nonces with 256 traces (except
0x80∗ with 128 traces).
of the traces can still help to improve the success probability. This is again indicated in Figure 14,
right and in Figure 15 by 0x80∗ . From now on, we always perform the fourth step of the attack
with the same selection of traces as step 3. With the recovery of the round key K4 , it is trivial to
reverse the key schedule and calculate the master key K1 . Next, we can calculate the nonce N by
performing the AES rounds backward from the state Z3 .
The success probabilities in Figures 9 to 15 were each obtained in experiments using the correct
information from the previous steps. They are thus actually conditional probabilities, conditioned
on the success of the previous step of the attack. Hence, by multiplying these success rates, we
obtain the success rate of the entire attack. This is shown in Figure 16.
3.5 Discussion
Jaffe’s Original Attack. In the original attack by Jaffe [Jaf07], the first step that recovers Z1,15
by hypothesizing on 15 bits is identical. The difference with this paper is that Jaffe uses 216
power traces and can therefore also recover Z1,14 with this approach. This requires hypothesizing
on 16 bits and is thus more complex. In our case, with only 256 traces, byte Z1,14 is almost
constant, hence we must follow a different approach. This way, we also avoid the hypothesis on
16 bits. After retrieving Z1,14 and Z1,15 , Jaffe selects a subset of traces in which the remaining
bytes Z1,0 , . . . , Z1,13 are constant. In the second round of encryption, the attack follows the same
approach as described in § 3.2. With both Z1,15 and Z1,14 known, it is possible to recover the first
two columns: Z2,0 , . . . , Z2,7 . Finally, in round three, two bytes per column are known and variable,
so the same approach as in round two allows retrieval of the entire state Z3 . The last step of the
attack is again analogous to ours.
Recovering the CTR_DRBG state in 256 traces 195
Figure 14: Maximum Correlation coefficients as a function of the number of traces in Step 4 with
SNR=1.0 and N15 = 0x00 (left) or N15 = 0x80 (right).
Figure 15: Success Rates of the fourth step of the attack for various nonces with 256 traces (except
0x80∗ with 128 traces).
More traces available? The NIST recommendations currently allow an adversary to obtain up to
4 096 traces of AES-CTR, which is well above 256. What happens to the attack success probability
when we can actually use this full number of traces? A general understanding in side-channel
analysis is that increasing the number of traces always increases the success probability of an attack.
This is certainly also true for the first step of the attack, since the least significant byte of the
counter is not affected by a carry from a previous byte. With up to 4 096= 212 encryptions in
CTR-mode, the same can be said for the second step of the attack, since a counter to 212 is not
enough to invalidate the assumption that three bytes in the first column are constant. In the third
and fourth step however, the success very much relies on the assumption that three bytes in each
column are (quasi-)constant, which means increasing the number of traces would only increase the
“noise”. However, having more than 256 traces available can certainly help, since one can select
from them the perfect subset of traces. For example, if the attacker suspects from the first 256
traces that the least significant byte of the nonce is near 0x80, (s)he only has to throw away the
first 128 traces and use the next 256 traces to turn a worst case scenario into a best case scenario
(nonce 0x00). Similarly, with any other nonce N15 the adversary can compute exactly how many
traces to throw away (256 − N15 ) to obtain the subset with nonce 0x00. It is important that step 3
and 4 of the attack are still performed with only 256 traces, in order for the assumptions on the
white squares to hold.
Hence, if an adversary has 512 traces at his disposal, the success rate of the attack will always
follow the best case in Figure 16, or even a bit better, since the first two steps can use the full amount
of 512 traces (see Figure 17). We will illustrate this method in an application in Appendix C.
The Rippling Carry. There is one more case we did not consider in the above description of the
attack. We mention it here, since it does not significantly affect the attack. In Figure 5, we assume
196 Recovering the CTR_DRBG state in 256 traces
Figure 16: Success Rates of the attack for various nonces with 256 traces (except 0x80∗ with 128
traces).
Figure 17: Success Rate of the attack with 256 traces (best case) or 512 traces (any case)
that the white squares are completely constant throughout a set of 256 traces and that only the
grey square can toggle once. However, if X1,14 = 0xFF, its toggling to 0x00, will actually create a
non-zero carry which affects X1,13 and makes it increment as well. If that byte is 0xFF as well, the
carry propagates to the next byte, and so on.
While this is something to keep an eye on when recovering the nonce N from X1 , it should not
affect the first four steps of the attack. The toggling of any other byte from one value to another
will happen at the same time as the toggling of X1,14 . Hence, the situation in round 3 and 4 of the
attack remains the same: a part of the traces corresponds to one constant (C3 ) and another part
uses constant C30 .
Step two of the attack is affected if the carry ripples all the way to byte X1,10 , which affects the
first column of the state in round 2. This would mean that N11 = N12 = N13 = N14 = 0xFF and is
thus a very special case.
4 Experimental Validation
To test our attack on a real device, we program a Cortex-M4 CPU with an AES-CTR implementation.
For this, we use the ChipWhisperer CW308T-STM32F3 target mounted on the CW308 UFO board.
The UFO board is connected to the ChipWhisperer-Lite board. We use the ChipWhisperer Capture
software for programming the device, communicating with the device and for collecting power
measurements. The clock frequency of the target and sample rate of the scope are set to the
ChipWhisperer defaults.
We collect exactly 256 traces of 12 000 samples each, consisting of approximately the first four
Recovering the CTR_DRBG state in 256 traces 197
rounds of AES. The nonce and key are chosen randomly by the ChipWhisperer Capture software.
Thanks to the ChipWhisperer measurement setup, the traces are well aligned. An example trace
is shown in Figure 18. For efficiency, we will use only the SubBytes region of each round in the
corresponding steps of the attack.
Figure 18: Example trace of the first four rounds of AES-CTR on a Cortex-M4.
Round 1. In the first step of the attack, the winning hypothesis achieves almost double the
correlation of the others. We learn that (b, K1,15,lo , N15,lo ) = (0, 0x57, 0x0D) (see Figure 19). This
means that (K1,15 , N15 ) is either (0x57, 0x0D) or (0xD7, 0x8D). We already have here an example
of a possible worst-case scenario.
Figure 19: Pearson Correlation coefficients in Step 1, with 256 traces as a function of the time
samples (left) and their maximum as a function of the number of traces (right).
Round 2. In Round 2, we recover the constants C2 = [0x65, 0x22, 0x52, 0x52] (see Figure 20).
Figure 20: Pearson Correlation coefficients in Step 2 (bytes 0 and 1), with 256 traces as a function
of the time samples (left) and their maximum as a function of the number of traces (right).
198 Recovering the CTR_DRBG state in 256 traces
Round 3. In round 3, we start with the attack to recover constant C3,0 . The result is shown in
Figure 21, left and gives a strong suspicion that the least significant byte of the nonce is actually
0x8D, since we see a close race between two constants. This means that the least significant key
byte should be 0xD7.
Figure 21: Pearson Correlation coefficients in Step 3 with 256 traces (left) and 128 traces (right)
(byte 0).
If we perform the same attack with only half the traces (see Figure 21, right), we obtain a clear
winner. In Figure 22, we show the maximum correlation coefficients as a function of the number of
traces used. We thus suspect that N15 = 0x8D and continue the attack with only half the traces.
We recover the following constants in round 3:
C3 = [0x76, 0x23, 0x3D, 0xCE, 0x70, 0xB9, 0xCB, 0xA4, 0x46, 0x32, 0x6E, 0x84, 0xA0, 0x64, 0x68, 0x09]
Figure 22: Maximum Correlation coefficients as a function of the number of traces in Step 3 (byte
0).
Round 4. Finally, still using half the traces (see Figure 23), we recover the following round key in
Round 4:
K4 = [0x7B, 0xFF, 0x7A, 0xD7, 0x0D, 0x28, 0x2E, 0xE3, 0x00, 0x3E, 0xD1, 0x58, 0xCB, 0x87, 0x0B, 0xBB]
If we perform the Key Schedule backward, we find that the Master Key is
K1 = [0xCC, 0x8E, 0x0F, 0x06, 0x0D, 0xE8, 0x3E, 0x80, 0x24, 0xBE, 0x94, 0x73, 0xBD, 0x6E, 0x8E, 0xD7]
Looking at the least significant byte, we can now confirm that (K1,15 , N15 ) = (0xD7, 0x8D) and that
we probably performed the attack correctly. Indeed, when we perform AES backward from Z3 , we
obtain
X1 = [0xE6, 0x10, 0x3B, 0x22, 0x55, 0x62, 0x7E, 0xE6, 0xBE, 0x93, 0x18, 0xBD, 0x71, 0xB7, 0xBA, 0x8D]
which is equal to the nonce N , since we used the first half of the traces. Pay attention when using
the second half or when the majority of the traces use the constant C30 . In that case, we recover
X1,14 = N14 + 1 mod 256.
Recovering the CTR_DRBG state in 256 traces 199
Figure 23: Pearson Correlation coefficients in Step 4 (byte 0), with 128 traces as a function of the
time samples (left) and their maximum as a function of the number of traces (right)
Device Behaviour. Now that we know the key and nonce, we can investigate the relation between
Hamming weights and their leakage on the Cortex-M4. In Figure 24, we plot the Hamming weight
of one point of interest (byte 15 in the first SubBytes) across 256 encryptions on the x-axis and the
measurements of the corresponding sample in the power traces on the y-axis. The corresponding
trace sample is chosen as the time sample where the power measurements have the largest Pearson
correlation with this array of Hamming weights. We also estimate the SNR at this time sample as
V ar(signal)
SN R =
V ar(noise)
The signal is constructed by replacing each measurement with the average of all measurements for
that Hamming weight, as is done in [MOP07]. The noise is approximated by subtracting these
averages from the actual measurements. This way, we obtain SN R ≈ 2.18 at the time sample
corresponding to state byte 15 after the S-box in the the first round.
Figure 24: Measured leakages vs. actual Hamming weights on the Cortex-M4.
Other devices. We additionally successfully performed the attack on an Arduino Uno and the
ChipWhisperer-lite XMEGA target. For the results, we refer to Appendices B and C. The trace
files and a JuPyter Notebook performing the above attack can be found online1 .
5 Discussion
5.1 The Worst Case Nonce
Figure 16 gives a bleak impression of the attack’s sensitivity to the least significant byte of the
nonce. However, the worst case scenario is not as bad as it seems.
1 https://github.com/LaurenDM/AttackAESCTR
200 Recovering the CTR_DRBG state in 256 traces
Firstly, it does not imply the existence of a protection mechanism since biasing the nonces
towards 0x80 would only reduce the search space of the attacker.
Secondly, we have shown that removing part of the traces may improve the chance of success.
This trick is not limited to the worst case, as the attacker has knowledge of N15,lo and can thus
always compute the right number of traces to throw away. Without knowing N15,hi , there are
two possible ways to do it, but one will clearly improve results, while the other will make them
worse. However, depending on the amount of noise in the traces, extra traces may still improve the
performance. As stated in § 3, in the worst case, we can say the attack requires 512 traces, which
is still far less than 4 096. With 512 traces available, step 3 and 4 always achieve the best success
rate and the dependency on the nonce thus disappears.
Finally, PRNGs tend to be used for applications that need a continuous supply of randomness.
If the attacker really has access to at most 256 traces per PRNG request and the worst case scenario
occurs, the next PRNG request will have a different nonce. NIST prescribes the PRNG to be
reseeded after at most 248 requests. As soon as the adversary manages to recover the key and nonce
for one request, the internal state of the PRNG is known and the future random outputs can be
calculated as long as the PRNG is not reseeded.
Application to AES-CTR. In the original paper, this methodology uses around 3 000 traces. With
only 256 traces available, it is more likely that the network “memorizes” the data and starts to
overfit. Choosing a suitable network architecture is therefore a bit more challenging in our case.
We used the same traces as in Section 4, but as in [Tim19], created a misalignment by shifting each
trace by a random offset between -25 and 25. This is not a large offset, but it is sufficient to make
regular CPA fail. Our neural network starts from the CN Nexp from Timon [Tim19], but we use 8
filters of size 100 in the first convolutional layer and we replace the second convolutional layer with
a 10-neuron dense layer. We also use the most significant bit of the S-box output in our hypotheses.
It is standard to randomly initialize a neural network’s weights. The initial weights have some
influence on the accuracy of the training, which is why training the network for the same hypothesis
twice can give different results in accuracy. Therefore, we noticed that, when training the same
network for different hypotheses and comparing their accuracies, it is better to always use the same
initial weights in the neural network.
Figure 25 shows the resulting accuracies for the first step of the attack. Even with only 256
traces, this methodology works. It takes a lot of computation time, since we need to train the
network 215 times, but it succeeds where regular CPA does not.
Distinguishing time samples. In the second step of the attack, it is important to know the most
defining time samples in the trace in order to distinguish the winning hypotheses of two bytes in
one column. For this purpose, we use the sensitivity analysis as described in [Tim19, §3.2.2]. The
results are shown in Figure 26 with the accuracies on the left and the sensitivities on the right.
They show clearly which of the two constants appears first in the trace. The results correspond to
those of Section 4.
We can thus conclude that other variants of DPA methods can be applied in the attack. DDLA
is a good choice if the traces are misaligned, but does take quite some computation time.
Recovering the CTR_DRBG state in 256 traces 201
Figure 25: Performing the first step of the attack with DDLA, using 256 traces.
Figure 26: Performing the second step of the attack (byte 0 and 1) with DDLA, using 256 traces.
Accuracies (left) and sensitivity analysis (right).
Application to AES-CTR. Since not all plaintext bytes vary in CTR mode, it makes more sense
to apply the blind attack to the last round, where the ciphertext bytes are constantly changing.
We noticed a number of drawbacks to blind SCA compared to regular CPA in this application. For
example, the blind attack requires a precise estimation of the location of the two points of interest:
Xi,j , Zi,j corresponding to a byte Xi,j at the input of AddRoundKey and Zi,j = S(Ki,j ⊕ Xi,j ) the
byte at the output of SubBytes. Even if one manages to pinpoint the correct samples in the traces,
the attack also requires the leakages at these points to be converted to Hamming weight estimations.
In the work of [CR17], this is done by estimating coefficients α and β such that the leakage is
approximately αHW + β. However, when we compare the measured leakages on a Cortex-M4
device with the actual Hamming weights in Figure 24, we see that one easily estimates the wrong
Hamming weights from these.
In contrast, for a regular CPA attack, it suffices to identify only an approximate region of
interest, since the Pearson correlation coefficient can be computed for many time samples. Moreover,
202 Recovering the CTR_DRBG state in 256 traces
it is not required to estimate the Hamming weights, since CPA can be applied directly to the
measurements obtained from the oscilloscope (no matter the leakage unit). The same can thus be
said for the CPA-based attack of this work.
Experiments. We tried a simplified attack, where the points of interest and α, β are given to the
adversary: We collected power measurements both from an Arduino Uno and from a Cortex-M4.
We computed the actual Hamming weight values using a simulation of AES-CTR with the same
nonce and key as was sent to the device. We determined the points of interest in the real power
traces by computing the correlation of the trace points with the real Hamming weights. We then
used the least squares method to determine the coefficients α, β in the relationship between the
real Hamming weights and the leakage units of the trace. We used the maximum number of traces
available according to the NIST recommendations: 4 096. Even then, the blind SCA was only able
to recover 11 of the 16 key bytes on the Arduino Uno device and 8 bytes on the Cortex-M4. We
show figures for each key byte in Appendix D.
5.4.1 Observations
Hiding. A common hiding technique against side-channel attacks is to randomize the order of the
16 S-box calculations during SubBytes. We saw in § 3.2 and § 3.3 that the order of execution is
important to distinguish two of the constants in each state column. The hiding countermeasure
therefore does not increase the number of traces required but can increase the complexity of the
attack by increasing the number of possibilities to try in step 2 and step 3. However, it does not
completely prevent our attack.
Hardware. Related to the shuffling of S-box calculations, a hardware PRNG implementation that
performs all 16 S-boxes in parallel, does not allow to distinguish the two equal-hypothesis constants
in each column. More importantly, when the 128-bit state is being operated on in parallel, the
signal-to-noise ratio is a lot smaller, since the leakage of one byte (the signal) only corresponds to
approximately one sixteenth of the measurement (not including the noise) [MOP07]. Furthermore,
in an unrolled implementation, it is difficult to separate the power measurements of the 128-bit
states of different rounds, even if the device is sampled at a very high rate. In other words, the SNR
of such a hardware implementation would be much worse than for our software implementations
and it is unlikely that even a regular CPA attack (with knowledge of the plaintext) would succeed
with only 256 traces.
Real-world Crypto. Despite the NIST recommendations, we found two commercial CTR_DRBG
implementations which do put a proper limit on the request size. In the open-source mbed TLS
library [arm], we can see that the maximum number of requested bytes per CTR_DRBG call is
1024, which is equivalent to only 64 encryptions with AES-128 in CTR mode. Even more secure is
Recovering the CTR_DRBG state in 256 traces 203
a CTR_DRBG implementation by Texas Instruments [Ins], which puts the limit at 211 bits, or
equivalently only 16 AES-CTR encryptions.
5.4.2 Recommendations
Types of counters. As previously mentioned, the attack is not prevented if the counter field
starts in a byte X1,j ∗ other than the least significant byte X1,15 . The first round of the attack then
simply recovers Z1,j ∗ , which propagates to a different column in step 2, but does not change the
overall approach or complexity. On the other hand, the NIST document on modes of operation
also suggests the possibility to use an LFSR as incrementing function in AES-CTR, as long as its
period is sufficiently long [Dwo01]. A good choice of LFSR would update bits which are spread over
the entire 128-bit AES state, rather than just one byte, thereby preventing the divide-and-conquer
approach that enables targeting key bytes in side-channel attacks. In contrast with a normal
incrementing counter, it is thus possible for a CTR_DRBG based on an LFSR to resist our attack,
even if 4 096 traces are available.
Size of Requests. Each new request to the NIST CTR_DRBG results in the computation of
AES-CTR with a different nonce and key, because of the CTR_DRBG _Update function, which
is performed with every randomness request. Hence, when using such a PRNG for a masked
implementation, it would clearly be less secure to perform a single (large) request for all the
random bits needed in a masked AES than to perform multiple (small) requests, such as one for
each masked S-box evaluation separately. Even worse would of course be to not follow the NIST
recommendations and to keep using the same key across PRNG requests.
Conclusion for masked implementations. Through this section, we also want to start the discus-
sion on whether PRNGs for masked implementations need their own side-channel protection, a
question which is often sidestepped due to its “chicken-and-egg” character. Indeed, protecting a
PRNG used for masked implementations against side-channel attacks, would require its own fresh
randomness source in return. However, by keeping these observations and recommendations in
mind and ensuring that the request sizes are sufficiently limited, this attack against AES-CTR can
be avoided. For other modes of operations, it looks like also the blind SCA can be avoided on real
devices if the number of available traces is limited. Under the assumption that the state (nonce and
key) of the CTR_DRBG is not known to the side-channel adversary, it does not seem like masking
the PRNG is necessary. An investigation of other PRNG constructions and attacks against them is
an interesting direction for future research.
6 Conclusion
In this work, we demonstrated an attack on AES-CTR mode with unknown key and nonce in only
256 traces, a significant improvement over the previous attack by Jaffe [Jaf07]. Most importantly,
this number of traces shows that a CTR_DRBG following the NIST specification can be vulnerable
to this attack, as it currently allows adversaries to obtain as much as 4 096 traces of a CTR_DRBG
performing AES-CTR. We demonstrated the feasibility of our attack on several real devices such as
a Cortex-M4 and make our implementations openly available for reproducability.
We explored alternative methods such as DDLA for misaligned traces and blind SCA, which
does not require the CTR-mode assumption.
We start the discussion on PRNGs for masked implementations (i.e. a PRNG for which an
adversary can only observe the power consumption), a topic for which very little research is available.
Using the observations from this attack, we can conclude that masking should not be necessary
for such a PRNG, provided its correct use such as limiting the request size and updating the
state between requests. The question remains on whether a construction as large as AES-CTR
is necessary in this adversary model. An investigation of various PRNG constructions and their
security in this context is an interesting direction for future work.
Acknowledgements
The author would like to thank Josep Balash, Arthur Beckers, Begül Bilgin, Vincent Rijmen and
Lennert Wouters. The author is funded by a PhD fellowship of the Fund for Scientific Research -
Flanders (FWO).
204 Recovering the CTR_DRBG state in 256 traces
References
[arm] arm. Mbed tls. https://tls.mbed.org/api/ctr__drbg_8h.html#
a5b787e6157d91055d7c07d40f519cf52.
[BCO04] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation power analysis with
a leakage model. In Marc Joye and Jean-Jacques Quisquater, editors, Cryptographic
Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge,
MA, USA, August 11-13, 2004. Proceedings, volume 3156 of Lecture Notes in Computer
Science, pages 16–29. Springer, 2004.
[BK15] Elaine Barker and John Kelsey. Recommendations for random number generation using
deterministic random bit generators. NIST SP 800-90A Rev. 1, June 2015.
[CR17] Christophe Clavier and Léo Reynaud. Improved blind side-channel analysis by exploita-
tion of joint distributions of leakages. In Wieland Fischer and Naofumi Homma, editors,
Cryptographic Hardware and Embedded Systems - CHES 2017 - 19th International Con-
ference, Taipei, Taiwan, September 25-28, 2017, Proceedings, volume 10529 of Lecture
Notes in Computer Science, pages 24–44. Springer, 2017.
[Dwo01] Morris Dworkin. Recommendation for block cipher modes of operation: Methods and
techniques. NIST SP 800-38A, December 2001.
[GBTP08] Benedikt Gierlichs, Lejla Batina, Pim Tuyls, and Bart Preneel. Mutual information
analysis. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and
Embedded Systems - CHES 2008, 10th International Workshop, Washington, D.C., USA,
August 10-13, 2008. Proceedings, volume 5154 of Lecture Notes in Computer Science,
pages 426–442. Springer, 2008.
[Ins] Texas Instruments. Random number generation using msp430fr59xx and msp430fr69xx
microcontrollers. http://www.ti.com/lit/an/slaa725/slaa725.pdf.
[Jaf07] Joshua Jaffe. A first-order DPA attack against AES in counter mode with unknown
initial counter. In Pascal Paillier and Ingrid Verbauwhede, editors, Cryptographic
Hardware and Embedded Systems - CHES 2007, 9th International Workshop, Vienna,
Austria, September 10-13, 2007, Proceedings, volume 4727 of Lecture Notes in Computer
Science, pages 1–13. Springer, 2007.
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In
Michael J. Wiener, editor, Advances in Cryptology - CRYPTO ’99, 19th Annual Inter-
national Cryptology Conference, Santa Barbara, California, USA, August 15-19, 1999,
Proceedings, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer,
1999.
[LDL14] Yanis Linge, Cécile Dumas, and Sophie Lambert-Lacroix. Using the joint distributions
of a cryptographic function in side channel analysis. In Emmanuel Prouff, editor,
Constructive Side-Channel Analysis and Secure Design - 5th International Workshop,
COSADE 2014, Paris, France, April 13-15, 2014. Revised Selected Papers, volume 8622
of Lecture Notes in Computer Science, pages 199–213. Springer, 2014.
[MOP07] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power analysis attacks - revealing
the secrets of smart cards. Springer, 2007.
[Tim19] Benjamin Timon. Non-profiled deep learning-based side-channel attacks with sensitivity
analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst., 2019(2):107–131, 2019.
CAPA: The Spirit of Beaver
against Physical Attacks
Publication Data
Oscar Reparaz, Lauren De Meyer, Begül Bilgin, Victor Arribas, Svetla Nikova,
Ventzislav Nikov, and Nigel P. Smart. CAPA: The Spirit of Beaver against
Physical Attacks. In Advances in Cryptology - CRYPTO 2018 - 38th Annual
International Cryptology Conference, Santa Barbara, CA, USA, August 19-23,
2018, Proceedings Part I (eds. H. Shacham and A. Boldyreva), Lecture Notes
in Computer Science, vol. 10991, pages 121-151. Springer, 2018.
My Contribution
One of main authors. The software implementation was done by Oscar Reparaz.
205
206 CAPA: The Spirit of Beaver against Physical Attacks
CAPA:
The Spirit of Beaver against Physical Attacks
Oscar Reparaz1,2 , Lauren De Meyer1 , Begül Bilgin1 , Victor Arribas1 ,
Svetla Nikova1 , Ventzislav Nikov3 and Nigel Smart1,4
1
KU Leuven, imec - COSIC, Leuven, Belgium
firstname.lastname@esat.kuleuven.be
2
Square Inc., San Francisco, USA
oreparaz@gmail.com
3
NXP Semiconductors, Leuven, Belgium
venci.nikov@gmail.com
4
University of Bristol, Bristol, UK
Abstract. In this paper we introduce two things: On one hand we introduce the Tile-Probe-
and-Fault model, a model generalising the wire-probe model of Ishai et al. extending it to
cover both more realistic side-channel leakage scenarios on a chip and also to cover fault and
combined attacks. Secondly we introduce CAPA: a combined Countermeasure Against Physical
Attacks. Our countermeasure is motivated by our model, and aims to provide security against
higher-order SCA, multiple-shot FA and combined attacks. The tile-probe-and-fault model
leads one to naturally look (by analogy) at actively secure multi-party computation protocols.
Indeed, CAPA draws much inspiration from the MPC protocol SPDZ. So as to demonstrate
that the model, and the CAPA countermeasure, are not just theoretical constructions, but could
also serve to build practical countermeasures, we present initial experiments of proof-of-concept
designs using the CAPA methodology. Namely, a hardware implementation of the KATAN and
AES block ciphers, as well as a software bitsliced AES S-box implementation. We demonstrate
experimentally that the design can resist second-order DPA attacks, even when the attacker is
presented with many hundreds of thousands of traces. In addition our proof-of-concept can
also detect faults within our model with high probability in accordance to the methodology.
Keywords: No keywords given.
1 Introduction
Side-channel analysis attacks (SCA) [41] are cheap and scalable methods to extract secrets, such
as cryptographic keys or passwords, from embedded electronic devices. They exploit unintended
signals (such as the instantaneous power consumption [42] or the electromagnetic radiation [24])
stemming from a cryptographic implementation. In the last twenty years, plenty of countermeasures
to mitigate the impact of side-channel information have been developed. Masking [15, 26] is an
established solution that stands out as a provably secure yet practically useful countermeasure.
Fault analysis (FA) is another relevant attack vector for embedded cryptography. The basic
principle is to disturb the cryptographic computation somehow (for example, by under-powering
the cryptographic device, or by careful illumination of certain areas in the silicon die). The result
of a faulty computation can reveal a wealth of secret information: in the case of RSA or AES, a
single faulty ciphertext pair makes key recovery possible [10, 48]. Countermeasures are essentially
based on adding some redundancy to the computation (in space or time). In contrast to masking,
the countermeasures for fault analysis are mostly heuristic and lack a formal background.
However, there is a tension between side-channel countermeasures and fault analysis counter-
measures. On the one hand, fault analysis countermeasures require redundancy, which can give out
more leakage information to an adversary. On the other hand, a device that implements first-order
masking offers an adversary double the attack surface to insert a fault in the computation. A duality
relation between SCA and FA was pointed out in [23]. There is clearly a need for a combined
countermeasure that tackles both problems simultaneously.
In this work we introduce a new attack model to capture this combined attack surface which we
call the tile-probe-and-fault model. This model naturally extends the wire-probe model of [34]. In
CAPA: The Spirit of Beaver against Physical Attacks 207
the wire-probe model individual wires of a circuit may be targetted for probing. The goal is then
to protect against a certain fixed set of wire-probes. In our model, inspired by modern processor
designs, we allow whole areas (or tiles) to be probed, and in addition we add the possibility of the
attacker inducing faults on such tiles.
Protection against attacks in the wire-probe model is usually done via masking; which is in many
cases the extension of ideas from passively secure secret sharing based Multi-Party Computation
(MPC) to the side-channel domain. It is then natural to look at actively secure MPC protocols
for the extension to fault attacks. The most successful modern actively secure MPC protocols are
in the SPDZ family [20]. These use a pre-processing or preparation phase to produce so called
Beaver triples, named after Beaver [6]. These auxiliary data values, which will be explained later,
are prepared either before a computation, or in a just-in-time manner, so as to enable an efficient
protocol to be executed. This use of prepared Beaver triples also explains, partially, the naming of
our system, CAPA (a Combined countermeasure Against Physical Attacks), since Capa is also the
beaver spirit in Lakota mythology. In this mythology, Capa is the lord of domesticity, labour and
preparation.
Side-Channel Attack Models and Countermeasures: A side-channel adversary typically uses the
noisy leakage model [55], where side-channel analysis (SCA) attacks are bounded by the statistical
moment of the attack due to a limited number of traces and noisy leakages. Given enough noise
and an independent leakage assumption of each wire, this model, when limited to the tth -order
statistical moment, is shown to be comparable to the t-probing model introduced in [34], where an
attacker is allowed to probe, receive and combine the noiseless information about t wires within a
time period [21]. Finally, it has been shown in [4] that a (semi-)parallel implementation is secure in
the tth -order bounded moment model if its complete serialization is secure at the t-probing model.
While the countermeasures against fault attacks are limited to resist only a small subset of the
real-world adversaries and attack models, protection against side-channel attacks stands on much
more rigorous grounds and generally scales well with the attacker’s powers. A traditional solution
is to use masking schemes [9, 29, 34, 51, 56, 58, 59] to implement a function in a manner in which
higher-order SCA is needed to extract any secret information, i.e. the attacker must exploit the
joint leakage of several intermediate values. Masking schemes are analogues of the passively secure
threshold MPC protocols based on secret sharing. One can thus justify their defence by appealing
208 CAPA: The Spirit of Beaver against Physical Attacks
to the standard MPC literature. In MPC, a number of parties can evaluate a function on shared
data, even in the presence of adversaries amongst the computing parties. The maximum number of
dishonest parties which can be tolerated is called the threshold. In an embedded scenario, the basic
idea is that different parts of a chip simulate the parties in an MPC protocol.
Combining Faults and Side-Channels Models and Countermeasures. The importance of com-
bined countermeasures becomes more aparent as attacks such as [2] show the feasibility of combined
attacks. Being a relatively new threat, combined adversarial models lack a joint description and are
typically limited to the combination of a certain side-channel model and a fault model independently.
One possible countermeasure against combined attacks is found in leakage resilient schemes [45],
although none of these constructions provide provable security against FA. Typical leakage resilient
schemes rely on a relatively simple and easy to protect key derivation function in order to update
the key that is used by the cryptographic algorithm within short periods. That is, a leakage resilient
scheme acts as a specific “mode of operation”. Thus, it cannot be a drop-in replacement for a
standard primitive such as the AES block cipher. The aforementioned period can be as short as
one encryption per key in order to eliminate fault attacks completely. However, the synchronization
burden this countermeasure brings, makes it difficult to integrate with deployed protocols.
There are a couple of alternative countermeasures proposed for embedded systems in recent years.
In private circuits II [16, 33], the authors use redundancy on top of a circuit that already resists SCA
(private circuits I [34]) to add protection against FA. In ParTI [62], threshold implementations (TI)
are combined with concurrent error detection techniques. ParTI naturally inherits the drawbacks
of using an error correction/detection code. Moreover, the detectable faults are limited in hamming
weight due to the choice of the code. Finally, in [63], infective computation is combined with error-
preserving computation to obtain a side-channel and fault resistant scheme. However, combined
attacks are not taken into account.
Given the above introduction, it is clear that both combined attack models and countermeasures
are not mature enough to cover a significant part of the attack surface.
Actively Secure MPC. Modern MPC protocols obtain active security, i.e. security against
malicious parties which can actively deviate from the protocol. By mapping such protocols to
the on-chip side-channel countermeasures, we would be able to protect against an eavesdropping
adversary that inserts faults into a subset of the simulated parties. An example of a practical attack
that fits this model is the combined attack of Amiel et al. [2]. We place defences against faults on
the same theoretical basis as defences against side-channels.
To obtain maliciously secure MPC protocols in the secret-sharing model, there are a number of
approaches. The traditional approach is to use Verifiable Secret Sharing (VSS), which works in the
information theoretic model and requires that strictly less than n/3 parties can be corrupt. The
modern approach, adopted by protocols such as BODZ, SPDZ, Tiny-OT, MASCOT etc. [7, 20,
40, 50], is to work in a full threshold setting (i.e. all but one party can be corrupted) and attach
information theoretic MACs to each data item. This approach turns out to be very efficient in the
MPC setting, apart from its usage of public-key primitives. The computational efficiency of the use
of information theoretic MACs and the active adversarial model of SPDZ lead us to adopt this
philosophy.
Tile-probe-and-fault model. We introduce a new adversary model that expands on the wire-probe
model and brings it closer to real-world processor designs. Our model is set in an architecture
that mimics the actively secure MPC setting that inspires our countermeasures (see Figure 1).
Instead of individual wires at the foundation of the model, we visualize a separation of the chip
(integrated circuit) into areas or tiles, consisting of not only many wires, but also complete blocks
of combinational and sequential logic. Such tiled designs are inherent in many modern processor
CAPA: The Spirit of Beaver against Physical Attacks 209
architectures, where the tiles correspond to “cores” and the wires correspond to the on-chip
interconnect. This can easily be related to a standard MPC architecture where each tile behaves
like a separate party. The main difference between our architecture and the MPC setting is that in
the latter, parties are assumed to be connected by a complete network of authenticated channels.
In our architecture, we know exactly how the wires are connected in the circuit.
”Party” 1 ”Party” 2
”Party” " …
Figure 1: Partition of the integrated circuit area into tiles, implementing MPC “parties”
The tile architecture satisfies the independent leakage assumption [21] amongst tiles. That
is, leakage is local and thus observing the behaviour of a tile by means of probing, faulting or
observing its side-channel leakage, does not give unintended information about another tile through,
for example, coupling.
As the name implies, the adversary in our model exploits side-channels and introduces faults.
We stress that our goal is to detect faults as opposed to tolerate or correct them. That is, if an
adversary interjects a fault, we want our system to abort without revealing any of the underlying
secrets.
Experimental Results. We provide examples of CAPA designs in hardware of the KATAN and
AES block ciphers as well as a software bitsliced implementation of the AES S-box. Our designs
show that our methodology is feasible to implement, and in addition our attack experiments
confirm our theoretical claims. For example, we implemented a second-order secure hardware
implementation of KATAN onto a Spartan-6 FPGA and perform a non-specific leakage detection
test, which does not show evidence of first- or second-order leakage with up to 100 million traces.
Furthermore, we deploy a second-order secure software based CAPA implementation of the AES
S-box on an ARM Cortex-M4 and take electromagnetic measurements; for this implementation
neither first-nor second-order leakage is detected with up to 200 000 traces. Using toy parameters,
we verify our claimed fault detection probability for the AES S-box software implementation. It
should be noted that our experimental implementations are to be considered only proof-of-concept;
they are currently too expensive to be used in practice. But the designs demonstrate that the
overall methodology can provide significant side-channel and fault protection, and they provide a
benchmark against which future improvements can be measured.
210 CAPA: The Spirit of Beaver against Physical Attacks
Tile Architecture. Consider a partition of the chip in a number of tiles Ti , with wires running
between each pair of tiles as shown in Figure 1. We call the set of all tiles T . Each tile Ti ∈ T
possesses its own combinational logic, control logic (or program code) and pseudo-random number
generator needed for the calculations of one share. In the abstract setting, we consider each tile
as the set composed of all input and intermediate values on the wires and memory elements of
those blocks. A probe-and-fault attacker may obtain, for a given subset of tiles, all the internal
information at given time intervals on this set of tiles. He may also inject faults (known or random)
into each tile in this set.
In our model, each sensitive variable is split into d shares through secret sharing. Without loss of
generality, we use Boolean sharing in this paper.
We define each tile such that it stores and manipulates at most one share of each intermediate
variable. Any wire running from one tile to another carries only blinded versions of a sensitive
variables’ share used by Ti . We make minimal assumptions on the security of these wires. Instead,
we include all the information on the unidirectional wires in Figure 1 in the tile on the receiving
and not the sending end. We thus assume only one tile is affected by an integrity failure of a wire.
We assume that shared calculations are performed in parallel without loss of generalization. The
redundancy of intermediate variables and logic makes the tiles completely independent apart from
the communication through wires.
Probes. Throughout this work, we assume a powerful dp -probing adversary where we give an
attacker information about all intermediate values possessed by dp specified tiles, i.e. ∪i∈i1 ,...,idp Ti .
The attacker obtains all the intermediate values on the tile (such as internal wire and register
values) with probability one and obtains these values from the start of the computation until the
end. Note that this is stronger than both the standard t-probing adversary which gives access to
only t intermediate values within a certain amount of time [34] and -probing adversary where
the information about t intermediate values is gained with certain probability. In our dp -probing
model, the adversary gets information from n intermediate values from dp tiles where n dp .
Therefore, our dp -probing model is more generic and covers realistic scenarios including an attacker
with a limited number of EM probes which enable observation of multiple intermediate values
simultaneously within arbitrarily close proximity on the chip.
Faults. We also consider two types of fault models. Firstly, a df -faulting adversary which can
induce chosen-value faults in any number of intermediate bits/values within df tiles, i.e. from the
set ∪i∈i1 ,...,idf Ti . These faults can have the nature of either flipping the intermediate values with
a pre-calculated (adversarially chosen) offset or setting the intermediate values to a chosen fixed
value. In particular, the faults are not limited in hamming weight. One can relate this type of
faults with, for example, very accurate laser injections.
Secondly, we consider an -faulting adversary which is able to insert a random-value fault in any
variable belonging to any party. This is a somehow new MPC model, and essentially means that
all parties are randomly corrupted. The -adversary may inject the random-value fault according
to some distribution (for example, flip each bit with certain probability), but he cannot set all
intermediates to a chosen fixed value. This adversary is different from the df -faulting adversary.
One can relate the -faulting adversary to a certain class of non-localised EM attacks.
Time periods. We assume a notion of time periods; where the period length is at least one clock
cycle. We require that a df -fault to an adversarially chosen value cannot be preceded by a probe
within the same time period. Thus adversarial faults can only depend on values from previous time
periods. This time restriction is justified by practical experimental constraints; where the time
period is naturally upper bounded by the time it takes to set up such a specific laser injection.
Adversarial Models. Given the aforementioned definitions, we consider on the one hand an active
adversary A1 with both dp -probing and df -faulting capabilities simultaneously. We define P1 the
set of up to dp tiles that can be probed and F1 the set of up to df tiles that can be faulted by
CAPA: The Spirit of Beaver against Physical Attacks 211
A1 . Since each tile potentially sees a different share of a variable and we use a d-sharing for each
variable, we constrain the attack surface (the sets of adversarially probed and potentially modified
tiles) as follows:
(F1 ∪ P1 ) ⊆ ∪d−1
j=1 Tij
The constraint implies that at least one share remains unaccessed/honest and thus |F1 ∪ P1 | ≤ d − 1.
Within those d − 1 tiles, the adversary can probe and fault arbitrarily many wires, including the
wires arriving at each tile. The adversary’s df -faulting capabilities are limited in time by our
definition of time periods, which implies that any df -fault cannot be preceded by another probe
within the same time period.
We also consider an active adversary A2 that has dp -probing and -faulting capabilities simulta-
neously. In this case, the constraint on the set of probed tiles P2 remains the same:
P2 ⊆ ∪d−1
j=1 Tij
F2 ⊆ T
Moreover, as -faults do not require the same set-up time as df -faults, they are not limited in
time. Note that, -faults do not correspond to a standard adversary model in the MPC literature;
thus this part of our model is very much an aspect of our side-channel and fault analysis focus.
A rough equivalent model in the MPC literature would be for an honest-but-curious adversary
who is able to replace the share or MAC values of honest players with values selected from a given
random distribution. Whilst such an attack makes sense in the hardware model we consider, in
the traditional MPC literature this model is of no interest due to the supposed isolated nature of
computing parties.
As our constructions are based on MPC protocols which are statically secure we make the same
assumptions in our tile-probe-and-fault model, i.e. the selection of tiles attacked must be fixed
beforehand and cannot depend on information gathered during computation. This model reflects
realistic attackers since it is infeasible to move a probe or a laser during a computation with today’s
resources. We thus assume that both adversaries A1 and A2 are static.
Notation. Although generalization to any finite field holds, in this paper we work over a field Fq
with characteristic 2, for example GF (2k ) for a given k, as this is sufficient for application to most
symmetric ciphers. We use · and + to describe multiplication and addition in Fq respectively. We
use upper case letters for constants. The lower case letters x, y, z are reserved for the variables
used only in the evaluation stage (e.g. sensitive variables) whereas a, b, c, . . . represent auxiliary
variables generated from randomness in the preprocessing stage. The kronecker delta function is
denoted by δi,j . We use L(.) to denote an additively homomorphic function and A(.) = C + L(.)
with C some constant.
Information Theoretic MAC Tags and the MAC Key α. We represent a value a ∈ Fq (similarly
x ∈ Fq ) as a pair h~ai = P (~a, τ~a ) of data and tag shares in the masked domain. The data shares
~a = (a1 , . . . , ad ) satisfy ai = a. For each a ∈ Fq , there exists a corresponding MAC tag Pτ
a
are in Fm
q and the tag shares satisfy τi [j] = τ [j] = α[j] · a, ∀j ∈ {1, . . . , m}. Further we assume
m = 1 unless otherwise mentioned.
212 CAPA: The Spirit of Beaver against Physical Attacks
Addition. To compute the addition (~z, τ~z ) of (~x, τ~x ) and (~y , τ~y ), each tile performs local addition
of their data shares zi = xi + yi and their tag shares τiz = τix + τiy . When one operand is public
(for example, a cipher constant C ∈ Fq ), the sum can be computed locally as zi = xi + C · δi,1 for
value shares and τiz = τix + C · αi for tag shares.
Multiplication by a Public Constant. Given a public constant C ∈ Fq , the multiplication (~z, τ~z )
of (~x, τ~x ) and C is obtained locally by setting zi = C · xi and τiz = C · τix .
The following operations, on the other hand, require auxiliary data generated in a preprocessing
stage and also communication between the tiles.
Multiplication. Multiplication of (~x, τ~x ) and (~y , τ~y ) requires as auxiliary data a Beaver triple
(h~ai, h~bi, h~ci), which satisfies c = a · b, for random a and b. The multiplication itself is performed in
four steps.
• Step A. In the blinding step, each tile Ti computes locally a randomized version of its share
of the secret: εi = xi + ai and ηi = yi + bi .
• Step B. In the partial unmasking step, each tile Ti broadcasts its own shares Pεi and ηi to other
P
tiles, such that each tile can construct and store locally the values ε = εi and η = ηi .
The value ε (resp. η) is the partial unmasking of (~ε, τ ) (resp. (~η , τ )), i.e. the value ε (resp.
~ε ~η
η) is unmasked but its tag τ~ε (τ~η ) remains shared. These values are blinded versions of the
secrets x and y and can therefore be made public.
• Step C. In the MAC-tag checking step, the tiles check whether the tags τ~ε (τ~η ) are consistent
with the public values ε and η, using a method which we will explain later in this section.
zi = ci + ε · bi + η · ai + ε · η · δi,1
τiz = τic + ε · τib + η · τia + ε · η · αi .
It can be seen easily that the sharing (~z, τ~z ) corresponds to z = x · y unless faults occurred. Step B
and C are the only steps that require communication among tiles. Step A and D are completely
local. Note that to avoid leaking information on the sensitive data x and y, the shares εi and ηi
must be synchronized using memory elements after step A, before being released to other tiles in
step B. Moreover, we remark that step C does not require the result of step B and can thus be
performed in parallel.
Squaring. Squaring is a linear operation in characteristic 2 fields. Hence, the output shares
of a squaring operation can be computed locally using the input shares. However, obtaining
the corresponding tag shares is non-trivial. To square (~x, τ~x ) into (~z, τ~z ), we therefore require
an auxiliary tuple (h~ai, h~bi) such that b = a2 . The procedure to obtain (~z, τ~z ) mimics that of
multiplication with some modifications: there is only one partially unmasked value ε = x + a, whose
tag needs to be checked, and each tile calculates zi = bi + ε2 · δi,1 and τiz = τib + ε2 · αi .
Following the same spirit, we can also perform the following operations.
Affine Transformation. Provided that we have access to a tuple (h~ai, h~bi) such that b = A(a), we
can compute (~z, τ~z ) satisfying z = A(x) = C + L(x), where L(x) is an additively homomorphic
function over the finite field, by computing the output sharing as zi = bi + L(ε) · δi,1 and τiz =
τib + L(ε) · αi .
CAPA: The Spirit of Beaver against Physical Attacks 213
Multiplication following Linear Transformations. The technique used for the above additively
homomorphic operations can be generalized even further to compute z = L1 (x) · L2 (y) in shared
form, where L1 and L2 are additively homomorphic functions. A trivial methodology would
require two tuples (ha~i i, hb~i i) with bi = Li (ai ) for i ∈ {1, 2}, plus a standard Beaver triple (i.e.
requiring seven pre-processed data items). We see that we can do the same operation with five
pre-processed items (h~ai, h~bi, h~ci, hdi,
~ h~ei), such that c = L1 (a), d = L2 (b) and e = L1 (a) · L2 (b).
The tiles partially unmask ~x + ~a (resp. ~y + ~b) to obtain ε (resp. η) and verify them. Each tile
computes its value share and tag share of z as zi = ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η) · δi,1 and
τiz = τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi , respectively. We refer to (h~ai, h~bi, h~ci, hdi,
~ h~ei)
as a quintuple.
Proof.
d
X d
X
zi = ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η)
i=1 i=1
d
X d
X d
X
= ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η)
i=1 i=1 i=1
= L1 (a) · L2 (b) + L1 (x + a) · L2 (b) + L2 (y + b) · L1 (a) + L1 (x + a) · L2 (y + b)
= L1 (a) · L2 (b) + L1 (x) · L2 (b) + L1 (a) · L2 (b) + L1 (a) · L2 (y) + L1 (a) · L2 (b)
+ L1 (x) · L2 (y) + L1 (x) · L2 (b) + L1 (a) · L2 (y) + L1 (a) · L2 (b)
= L1 (x) · L2 (y)
d
X d
X
τiz = τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi
i=1 i=1
d
X d
X d
X d
X
= τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi
i=1 i=1 i=1 i=1
= α · e + L1 (ε) · α · d + L2 (η) · α · c + L1 (ε) · L2 (η) · α
= α · e + L1 (ε) · d + L2 (η · c + L1 (ε) · L2 (η)
= α · L1 (x) · L2 (y)
Consider a public value ε = x + a, calculated in the partial unmasking step of the Beaver
multiplication operation. Recall that we obtain its MAC-tag shares as follows: τiε = τia + τix . During
the MAC-tag checking step of the Beaver operation, the authenticity of τ ε corresponding to ε is
tested. As P
ε is public, each tile can calculate and broadcast
P the value ε · αi + τiε . For a correct tag,
we expect τiε = α · ε, thus each tile computes (ε · αi + τiε ) and proceeds if the result is zero.
Recall that the broadcasting must be preceded by a synchronization of the shares.
There are several components in a cipher which do not need to be protected against SCA (i.e.
masked), because their specific values are not sensitive. One prominent example is the control unit
which decides what operations should be performed (e.g. the round counter). Other examples are
constants such as the AES affine constant 0x63 or public values such as ε in a Beaver calculation
and the difference ε · α + τ ε during the MAC-tag checking phase.
While these public components are not sensitive in a SCA context, they can be targeted in a
fault attack. It is therefore important to introduce some redundancy. Each tile should have its own
control logic and keep a local copy of all public values to avoid single points of attack. The shares
εi are distributed to all tiles so that ε can be unmasked by each tile separately and any subsequent
computation performed on these public values is repeated by each tile. Finally, each tile also keeps
its own copy of the abort status. This is in fact completely analogous to the MPC scenario.
214 CAPA: The Spirit of Beaver against Physical Attacks
Auxiliary Data Generation. To generate a triple (h~ai, h~bi, h~ci) satisfying c = a · b, we draw random
shares ~a = (a1 , . . . , ad ) and ~b = (b1 , . . . , bd ) and use a passively secure shared multiplier to compute
~c s.t. c = a · b. We then use another such multiplication with the shared MAC key α ~ to generate
tag shares τ~a , τ~b , τ~c . We note that the shares ai , bi are randomly generated by tile Ti . There are
thus d separate PRNG’s on d distinct tiles.
Why We Need Randomization. This sacrificing step involves two values r1 and r2 . We present
the following attack to illustrate why this randomization is needed. Again, we elaborate on triples,
but the same can be said for tuples and quintuples. As the security does not rely on the secrecy of
r1 and r2 , we assume for simplicity that they are known to the attacker. We only stress that they
are different: r1 6= r2 .
Consider two triples (h~ai, h~bi, hc~0 i) and (hdi,
~ h~ei, hf~0 i) at the input of the sacrificing stage. We
assume that the adversary has introduced an additive difference into one share of c0 and f 0 such
that c0 = a · b + ∆c and f 0 = d · e + ∆f . This fault is injected before the MAC tag calculation,
so that τ c and τ f are valid tags for the faulted values c0 and f 0 respectively. In particular, this
0 0
The sacrificing step calculates the following four differences (for rj = r1 and r2 ) and only
succeeds if all are zero.
X
d
∆j = rj · c0i + fi0 + ε · ei + η · di + ε · η
i=1
?
= rj · ∆ c + ∆ f = 0
Xd
τ ∆j =
0 0
rj · τic + τif + ε · τie + η · τid + ε · η · αi
i=1
?
= r j · α · ∆c + α · ∆f = 0
Without randomization (i.e. r1 = r2 = 1), the attacker only has to match the differences ∆f = ∆c
to pass verification. With a random r1 , the attacker can fix ∆f = r1 · ∆c to automatically force
∆1 and τ ∆1 to zero. Even if he does not know r1 , he has probability as high as 2−k to guess it
correctly.
Only thanks to the repetition of the relation verification with r2 , the adversary is detected with
a probability 1 − 2−km . Assuming he fixed ∆f = r1 · ∆c , it is impossible to also achieve ∆f = r2 · ∆c .
Even if the attacker manages to force ∆2 to zero with an additive injection (since he knows all
components r2 , ∆c and ∆f ), he cannot get rid of the difference τ ∆2 = r2 · α · ∆c + α · ∆f without
knowing the MAC key. Since α remains secret, the attacker only has a success probability of 2−km
to succeed.
4 Discussion
4.1 Security Claims
With both described adversaries A1 and A2 , our design CAPA claims provable security against the
following types of attacks as well as a combined attack of the two
2. Fault Attacks (i.e. an adversary introducing either known faults into d − 1 tiles or random
faults everywhere).
Side-channel Analysis. One can check that no union of d − 1 tiles ∪j∈j1 ,...,jd−1 Tj has all the shares
of a sensitive value. Very briefly, we can reason to this d − 1th -order non-completeness as follows.
All computations are local with the exception of the unmasking of public values such as ε. However,
the broadcasting of all shares of ε does not break non-completeness since ε = x + a is not sensitive
itself but rather a blinded version of a sensitive value x, using a random a that is shared across
all tiles. Unmasking the public value ε therefore gives each tile Ti only one share ε + ai of a new
sharing of the secret x:
In this sharing, no union of d − 1 shares suffices to recover the secret. Our architecture thus provides
non-completeness for all sensitive values. As a result, our d-share implementation is secure against
d − 1-probing attacks. Any number of probes following the adversaries’ restrictions leak no sensitive
data. Our model is related to the wire-probe model, but with wires replaced by entire tiles. We
can thus at least claim security against d − 1th order SCA.
216 CAPA: The Spirit of Beaver against Physical Attacks
Fault Attacks. A fault is only undetected if both value and MAC tag shares are modified such
that they are consistent. Adversary A1 can fault at most df < d tiles, which means he requires
knowledge of the MAC key α ∈ GF (2km ) to forge a valid tag for a faulty value. Since α is secret,
his best option is to guess the MAC key. This guess is correct with probability 2−km . Adversary
A2 has -faulting abilities only and will therefore only avoid detection if the induced faults in value
and tag shares happen to be consistent. This is the case with probability 2−km . We can therefore
1
claim an error detection probability (EDP) of 1 − 2km . The EDP does not depend on the number
of faulty bits (or the hamming weight of the injected fault).
Combined Attacks. In a combined attack, an adversary with df -faulting capabilities can mount
an attack where he uses the knowledge obtained from probing some tiles ∈ P1 to carefully forge the
faults. In SPDZ, commitments are used to avoid the so called “rushing adversary”. CAPA does not
need commitments as the timing limitation on A1 adversary ensures a df -fault cannot be preceded
by a probe in the same clock cycle. As a result, we inherit the security claims of SPDZ and the
claimed EDP is not affected by probing or SCA. Also, the injection of a fault in CAPA does not
change the side-channel security. Performing a side-channel attack on a perturbed execution does
not reveal any additional information because the Beaver operations do not allow injected faults to
propagate through a calculation into a difference that depends on sensitive information. We can
claim this security, because of the aspects inherited from MPC. CAPA is essentially secure against
a very powerful adversary that has complete control (hence combined attacks) over all but one of
the tiles.
What Does Our MAC Security Mean? We stress that CAPA provides significantly higher security
than existing approaches against faults. An adversary that injects errors in up to df tiles cannot
succeed with more than the claimed detection probability. This means that our design can stand
d0f df shots if they affect at most df tiles. This is the case even if those df tiles leak their entire
state; hence our resistance against combined attacks. The underlying reason for this is that to forge
values, an attacker needs to know the MAC key, but since this is also shared, the attacker does not
gain any information on the MAC key and their best strategy is to insert a random fault, which is
detected with probability 1 − 2−km . Moreover, our solution is incredibly scalable compared to for
example error detection code solutions.
How Much Do Tags Leak? The tag shares τia form a Boolean masking of a variable τ a . This
variable τ a itself is an information theoretic MAC tag of the underlying value a and can be seen as
a multiplicative share of a. We therefore require the MAC key to change for each execution. Hence
MAC tag shares are a Boolean masking of a multiplicative share and are expected to leak very
little information in comparison with the value shares themselves.
Forbidding the All-0s MAC Key. If the MAC key size mk is small, we should forbid the all-0
MAC key. This ensures that tags are injective: if an attacker changes a value share, he must change
the tag share. We only pay with a slight decrease in the claimed detection probability. By excluding
1 of the 2km MAC key possibilities, we reduce the fault detection probability to 1 − 2−κ , where
κ = log2 (2km − 1).
4.2 Attacks
The Glitch Power Supply or Clock Attack. The solution presented in this paper critically depends
on the fact that there is no single point where an attacker can insert a fault that affects all d tiles
deterministically. An attacker may try to glitch the chip clock line that is shared among all tiles.
In this case, the attacker could try to carefully insert a glitch so that writing to the abort register
is skipped or a test instruction is skipped. Since all tiles share the same clock, the attacker can
bypass in this way the tag verification step. Similar comments apply, for example, to glitches in the
power line. The bottom line is that one should design the hardware architecture accordingly, that
is, deploy low-level circuit countermeasures that detect or avoid this attack vector.
Skipping Instructions. In software, when each tile is a separate processor (with its own program
counter, program memory and RAM memory), skipping one instruction in up to d − 1 shares would
CAPA: The Spirit of Beaver against Physical Attacks 217
be detected. The unaffected tiles will detect this misbehavior when checking partially unmasked
values.
Safe Error Attack. We point out a specific attack that targets any countermeasure against a
probing and faulting adversary. In a safe error attack [65], the attacker perturbs the implementation
in a way that the output is only affected if a sensitive variable has a certain value. The attacker
learns partial secret information by merely observing whether or not the computation succeeds
(i.e. does not abort). Consider for example a shared multiplication of a variable x and a secret
y and call the resulting product z = xy. The adversary faults one of the inputs with an additive
nonzero difference such that the multiplication is actually performed on x0 = x + ∆ instead of
x. Such an additive fault can be achieved by affecting only one share/tile. The multiplication
results in the faulty product z 0 = z + ∆ · y. The injected fault has propagated into a difference that
depends on sensitive data (y). As a result, the success or failure of any integrity check following
this multiplication depends on y. In particular, if nothing happens (all checks pass), the attacker
learns that y must be 0.
Among existing countermeasures against combined attacks, none provide protection against
this kind of selective failure attack as they cannot detect the initial fault ∆. The attacker can
always target the wire running from the last integrity check on x to the multiplication with y.
We believe CAPA is currently unique in preventing this type of attack. One can verify that the
MAC-tag checking step in a Beaver operation successfully prevents ∆ from propagating to the
output. This integrity check only passes if all tiles have a correct copy of the public value ε. Any
faults injected after this check have a limited impact as the calculation finishes locally. That is,
once the correct public values are established between the tiles, the shares of the multiplication
output z are calculated without further communication among tiles. The adversary is thus unable
to elicit a fault that depends on sensitive data.
PACA. We claim security against the passive and active combined attack (PACA) on masked
AES described in [2] because CAPA does not output faulty ciphertexts. A second attack in this
work uses another type of safe errors (or ineffective faults as they are called in this work) which
are impossible to detect. The attacker fixes a specific wire to the value zero (this requires the
df -faulting capability) and collects power traces of the executions that succeed. This means the
attacker only collects traces of encryptions in which that specific wire/share was already zero. The
key is then extracted using d − 1th -order SCA on the remaining d − 1 shares. This safe error attack
however falls outside our model since the adversary gets access (either by fault or SCA) to all d
shares and thus (F1 ∪ P1 ) = T .
Advanced Physical Attacks. In our description we are assuming that during the broadcast phase
there are no “races” between tiles: by design, each tile sets its share to be broadcasted at clock
cycle t and captures other tiles’ share in the same clock cycle t. We are implicitly assuming that
tiles cannot do much work between these two events. If this assumption is violated (for example,
using advanced circuit editing tools), a powerful adversary could bypass any verification. This is
why in the original SPDZ protocol there are commitments prior to broadcasting operations; if this
kind of attack is a concern one could adapt the same principles of commitments to CAPA. This is
a very strong adversarial model that we consider out of scope for this paper.
MAC Tag Checking. SPDZ delays the tag checking of public values until the very end of the
encryption by using commitments. For this, each party keeps track of publicly opened values.
This is to avoid a slowdown of the computation and because in the MPC setting, local memory is
cheaper than communication costs. In an embedded scenario the situation is opposite so we check
the opened values on the fly at the cost of additional dedicated circuit. In hardware, we “simulate”
218 CAPA: The Spirit of Beaver against Physical Attacks
Table 1: Overview of the number of Fq multiplications (.), Fq additions (+) and linear operations
in GF (2) (L(.)) required to calculate all building blocks with d shares and m tags
the broadcast channel by wiring between all tiles. Each tile keeps a local copy of those broadcasted
values.
Adversary. Although MPC considers mainly the “synchronous” communication model, the SPDZ
adversary model also includes the so-called “rushing” adversary, which first collects all inputs from
the other parties and only then decides what to send in reply. In our embedded setting, as already
pointed out, the “rushing” adversary is impossible. Due to the nature of the implementation, the
computational environment and storage is very much restricted. On the other hand, communication
channels are very efficient and can be assumed to be automatically synchronous with all tiles
progressing in-step in the computation.
Complexity for Passive Attacker Scenario. It is remarkable that if active attackers are ruled
out, and only SCA is a concern, then the complexity of the principal computation is linear in d.
This may seem like a significant improvement over previous masking schemes which have quadratic
complexity on the security order [18, 34, 58]. However, this complexity is again pushed into the
preprocessing stage. Nevertheless, this can be interesting especially for software implementations in
platforms where a large amount of RAM is available to store the auxiliary data generated in §3.2.
The same comments apply to FPGAs with plenty of BlockRAM.
5 Proof-of-Concept
In this section we detail a proof-of-concept implementation of the CAPA methodology in both a
hardware and a software environment. We emphasize specific concepts for hardware and software
implementations and provide case studies of KATAN-32 [14] and AES [1], which cover operations in
different fields, possibility of bitsliced implementations, specific timing and memory optimizations,
and performance results.
CAPA: The Spirit of Beaver against Physical Attacks 219
Table 2: Area (GE) of 2-share KATAN-32 implementations with m MAC keys α[j] ∈ Fq
Library. For synthesis, we use Synopsis Design Compiler Version I-2013.12 using the NanGate 45nm
Open Cell library [49] for ease of future comparisons. We choose the compile option - exact_map to
prevent optimization across tiles. The area results are provided in 2-input NAND-gate equivalents
(GE).
KATAN-32 is a shift register based block cipher, which has a 80-bits key and processes 32-bit
plaintext input. It is designed specifically for efficient hardware implementations and performs 254
cycles of four AND-XOR operations. Hence, its natural shared data representation is in the field
Fq = GF(2), which makes the mapping into CAPA operations relatively straightforward. However,
the small finite field means that we need to utilize a vectorized MAC-tag operation (m > 1) to
ensure a good probability of detecting errors. Our implementation is round based, as in [14] with
three AND-XOR Beaver operations and one constant AND-XOR calculated in parallel. Each
Beaver AND-XOR operation requires two cycles, and is implemented in a pipelined fashion such
that the latency of the whole computation increases only by one clock cycle.
Implementation Cost. Tables 2 and 3 summarize the area of our KATAN implementations.
Naturally, compared to a shared implementation without MAC tags, the state registers grow with
a factor m + 1 as the MAC-key size increases. In the last columns, we extrapolate the area results
for any m.
Each Beaver multiplication in GF(2) requires one triple, and each triple needs 2d random bits
for generating ~a and ~b. A d-share DOM multiplication requires d2 units of randomness. The
construction of one triple requires 1 + 3m masked multiplications: one to obtain the multiplication
~c of ~a and ~b; and 3m to obtain the m tags τ~a ,τ~b and τ~c . Due to the relation verification through
the sacrificing of another triple, the randomness must be doubled. Hence, the total required number
of random bits per round of KATAN is 3 · 2 · (2d + (1 + 3m) d(d−1) 2 )).
Table 3: Area (GE) of 3-share KATAN-32 implementations with m MAC keys α[j] ∈ Fq
Figure 2: Non-specific leakage detection on the first 31 rounds of first-order KATAN. Left column:
PRNG off (24K traces). Right column: PRNG on (100M traces). Rows (top to bottom): exemplary
power trace; first-order t-test; second-order t-test
Figure 3: Non-specific leakage detection on the first 31 rounds of second-order KATAN. Left
column: PRNG off (24K traces). Right column: PRNG on (100M traces). Rows (top to bottom):
exemplary power trace; first-order t-test; second-order t-test; third-order t-test
https://drive.google.com/file/d/
0B19mBnPrtz4hQllQMVNTeTBpSGM/
CAPA: The Spirit of Beaver against Physical Attacks view?usp=sharing
221
x S(x)
x4 · y 2 A(x)
verification verification
= register
Design choices. The AES S-box consists of an inversion in GF(28 ), followed by an affine transfor-
mation over bits. We distinguish two methodologies for the S-box implementation: It is well known
that the combination of the two operations can be expressed by the following polynomial in GF(28 )
[19]:
S-box(x) =0x63 + 0x8F · x127 + 0xB5 · x191 + 0x01 · x223 + 0xF4 · x239
(1)
+ 0x25 · x247 + 0xF9 · x251 + 0x09 · x253 + 0x05 · x254
This polynomial can be implemented using 6 squares and 7 multiplications in GF(28 ) with a latency
of 13 clock cyles. A second approach is to evaluate the inversion x −→ x254 using the following
multiplication chain from [30]:
5 2
x254 = x4 · (x5 )5
Since the AES affine transform A(x) is linear over GF(2), we can then use the Beaver operation
described in §3.1 to evaluate it in one cycle, using auxiliary affine tuples (h~ai, h~bi) such that b = A(a)
. Initial estimations reveal the former method is more expensive than the latter, so we adopt the
latter technique.
Multiplication Chain. Our implementation of the proposed multiplication chain uses two types
of operations: x5 and x4 · y 2 , which can both be computed as described in §3.1 (Multiplication
following Linear Transformations). Given an input h~xi and a triple (h~ai, h~bi, h~ci) such that b = a4
and c = a5 , we calculate the CAPA exponentiation to the power five. Likewise, we perform the
map x4 · y 2 (with y = x125 ) in one cycle, using quintuples (h~ai, h~bi, h~ci, hdi,
~ h~ei) such that c = a4 ,
d = b2 and e = c · d = a4 · b2 . As a result, an inversion in GF(28 ) costs only 4 cycles, using 3
exponentiation triples and 1 quintuple. Combined with the affine stage, we obtain the S-box output
in 5 cycles (see Figure 4). This approach does not only optimize the number of cycles but also the
amount of required randomness. The S-box is implemented as a five stage pipeline.
Implementation Cost. We use a serialized AES architecture, based on that in [28]. One round
of the cipher requires 21 clock cycles, making the latency of one complete encryption 226 clock
cycles. Since the unprotected serialised implementation of [47] also requires 226 cycles, the timing
performance is very good.
Table 4 presents the area for the different blocks that make up our AES implementation. We
can see a significant difference between the preprocessing and evaluation stages, i. e. the efficient
calculation phase comes at the cost of expensive resource generation machinery.
Table 5 summarizes the required number of random bytes for the generation of the triples/tuples
for the AES S-box as a function of the number of MAC keys m and the number of shares d. Recall
222 CAPA: The Spirit of Beaver against Physical Attacks
that the S-box needs three exponentiation triples, one quintuple and one affine tuple per cycle
(doubled for the sacrificing). Each of these uses d initial bytes of randomness per input
for the
shares of a (and b). Furthermore, recall that each masked multiplication requires d2 bytes or
randomness. That is, for d = 3 and m = 1, we need 156 bytes of randomness per S-box evaluation.
Table 4: Areas for first- and second-order AES implementations with m = 1 in 2-NAND Gate
Equivalents (GE)
Table 5: The number of randomness in bytes for the initial sharing, shared multiplication and the
sacrifice required for AES S-box
200 200
EM field
EM field
150 150
100 100
50 50
5
50
t value
t value
0 0
-50
-5
5
0
t value
-20
t value
0
-40
-60 -5
40 5
20
t value
t value
0
-20
-40
-5
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
time [samples] 10 5 time [samples] 10 5
Figure 5: Non-specific leakage detection on second-order SubBytes. Left column: masks off. Right
column: masks on (200K traces). Rows (top to bottom): one exemplary EM trace, first-order t-test;
second-order t-test; third-order t-test
sound. When we plug in the PRNG, no leakage is detected with up to 200 000 traces (the statistic
does not surpass the threshold C = ±4.5). This serves to confirm that the implementation effectively
masks all intermediates, and that first- nor second-order DPA is not possible on this implementation.
SPA features within an electromagnetic trace are better visible in the cross-correlation matrix
shown in Figure 6.
50
100
150
200
250
300
350
400
450
500
550
Figure 6: Cross-correlation for second-order SubBytes. One can identify the 34 AND gates in the
SubBytes circuit of Boyar et al. [11].
Experimental Validation of DFA Security. For the purposes of validating our theoretical security
claims on CAPA’s protection against fault attacks, we scale down our software AES SubBytes
implementation, reducing the MAC key size to m = 2 and scaling down words to bits (k = 1). Note
that this parameter choice lowers the detection probability; the point of using these toy parameters
is only to verify more comfortably that the detection probability works as expected. It is easier
to verify that the detection probability is 1 − 2−2 rather than 1 − 2−40 . This concrete parameter
224 CAPA: The Spirit of Beaver against Physical Attacks
6 Conclusion
In this paper, we introduced the first adversary model that jointly considers side-channels and faults
in a unified and formal way. The tile-probe-and-fault security model extends the more traditional
wire-probe model and accounts for a more realistic and comprehensive adversarial behavior. Within
this model, we developed the methodology CAPA: a new combined countermeasure against physical
attacks. CAPA provides security against higher-order DPA, multiple-shot DFA and combined
attacks. CAPA scales to arbitrary security orders and borrows concepts from SPDZ, an MPC
protocol. We showed the feasibility of implementing CAPA in embedded hardware and software
by providing prototype implementations of established block ciphers. We hope CAPA provides
an interesting addition to the embedded designer’s toolbox, and stimulates further research on
combined countermeasures grounded on more formal principles.
6.0.1 Acknowledgements.
This work was supported in part by the Research Council KU Leuven: C16/15/058 and OT/13/071,
by the NIST Research Grant 60NANB15D346 and the EU H2020 project FENTEC. Oscar Reparaz
and Begül Bilgin are postdoctoral fellows of the Fund for Scientific Research - Flanders (FWO)
and Lauren De Meyer is funded by a PhD fellowship of the FWO. The work of Nigel Smart
has been supported in part by ERC Advanced Grant ERC-2015-AdG-IMPaCT, by the Defense
Advanced Research Projects Agency (DARPA) and Space and Naval Warfare Systems Center,
Pacific (SSC Pacific) under contract No. N66001-15-C-4070, and by EPSRC via grants EP/M012824
and EP/N021940/1.
References
[1] Advanced Encryption Standard (AES). National Institute of Standards and Technology (NIST),
FIPS PUB 197, U.S. Department of Commerce, Nov. 2001.
[2] F. Amiel, K. Villegas, B. Feix, and L. Marcel. Passive and active combined attacks: Combining
fault attacks and side channel analysis. In L. Breveglieri, S. Gueron, I. Koren, D. Naccache,
and J. Seifert, editors, FDTC 2007, pages 92–102. IEEE Computer Society, 2007.
[3] J. Balasch, B. Gierlichs, O. Reparaz, and I. Verbauwhede. DPA, bitslicing and masking at 1
GHz. In Güneysu and Handschuh [31], pages 599–619.
[4] G. Barthe, F. Dupressoir, S. Faust, B. Grégoire, F. Standaert, and P. Strub. Parallel imple-
mentations of masking schemes and the bounded moment leakage model. In J. Coron and J. B.
Nielsen, editors, EUROCRYPT 2017, Part I, volume 10210 of LNCS, pages 535–566, 2017.
[5] A. Battistello and C. Giraud. Fault analysis of infective AES computations. In W. Fischer
and J. Schmidt, editors, FDTC 2013, pages 101–107. IEEE Computer Society, 2013.
[8] G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, and V. Piuri. Error analysis and detection
procedures for a hardware implementation of the advanced encryption standard. IEEE Trans.
Computers, 52(4):492–505, 2003.
[9] B. Bilgin, B. Gierlichs, S. Nikova, V. Nikov, and V. Rijmen. Higher-order threshold implemen-
tations. In P. Sarkar and T. Iwata, editors, ASIACRYPT 2014, Part II, volume 8874 of LNCS,
pages 326–343. Springer, Heidelberg, Dec. 2014.
[11] J. Boyar, P. Matthews, and R. Peralta. Logic minimization techniques with applications to
cryptology. Journal of Cryptology, 26(2):280–312, Apr. 2013.
[12] J. Bringer, C. Carlet, H. Chabanne, S. Guilley, and H. Maghrebi. Orthogonal direct sum
masking - A smartcard friendly computation paradigm in a code, with builtin protection
against side-channel and fault attacks. In D. Naccache and D. Sauveron, editors, WISTP 2014.
Proceedings, volume 8501 of LNCS, pages 40–56. Springer, 2014.
[13] J. Bringer, H. Chabanne, and T. Le. Protecting AES against side-channel analysis using
wire-tap codes. J. Cryptographic Engineering, 2(2):129–141, 2012.
[14] C. D. Cannière, O. Dunkelman, and M. Knežević. KATAN and KTANTAN - a family of small
and efficient hardware-oriented block ciphers. In C. Clavier and K. Gaj, editors, CHES 2009,
volume 5747 of LNCS, pages 272–288. Springer, Heidelberg, Sept. 2009.
[15] S. Chari, C. S. Jutla, J. R. Rao, and P. Rohatgi. Towards sound approaches to counteract
power-analysis attacks. In Wiener [64], pages 398–412.
[16] T. D. Cnudde and S. Nikova. More efficient private circuits II through threshold implementations.
In FDTC 2016, pages 114–124. IEEE Computer Society, 2016.
[17] J. Cooper, E. DeMulder, G. Goodwill, J. Jaffe, G. Kenworthy, and P. Rohatgi. Test Vector
Leakage Assessment (TVLA) methodology in practice. International Cryptographic Module
Conference, 2013.
[18] J.-S. Coron. Higher order masking of look-up tables. In P. Q. Nguyen and E. Oswald, editors,
EUROCRYPT 2014, volume 8441 of LNCS, pages 441–458. Springer, Heidelberg, May 2014.
[19] J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard.
Information Security and Cryptography. Springer, 2002.
[20] I. Damgård, V. Pastro, N. P. Smart, and S. Zakarias. Multiparty computation from somewhat
homomorphic encryption. In Safavi-Naini and Canetti [60], pages 643–662.
[21] A. Duc, S. Faust, and F.-X. Standaert. Making masking security proofs concrete - or how
to evaluate the security of any leaking device. In E. Oswald and M. Fischlin, editors, EU-
ROCRYPT 2015, Part I, volume 9056 of LNCS, pages 401–429. Springer, Heidelberg, Apr.
2015.
[22] W. Fischer and N. Homma, editors. Cryptographic Hardware and Embedded Systems - CHES
2017 - 19th International Conference, Taipei, Taiwan, September 25-28, 2017, Proceedings,
volume 10529 of Lecture Notes in Computer Science. Springer, 2017.
[23] B. M. Gammel and S. Mangard. On the duality of probing and fault attacks. J. Electronic
Testing, 26(4):483–493, 2010.
[24] K. Gandolfi, C. Mourtel, and F. Olivier. Electromagnetic analysis: Concrete results. In Çetin
Kaya. Koç, D. Naccache, and C. Paar, editors, CHES 2001, volume 2162 of LNCS, pages
251–261. Springer, Heidelberg, May 2001.
226 CAPA: The Spirit of Beaver against Physical Attacks
[25] B. Gierlichs, J.-M. Schmidt, and M. Tunstall. Infective computation and dummy rounds: Fault
protection for block ciphers without check-before-output. In A. Hevia and G. Neven, editors,
LATINCRYPT 2012, volume 7533 of LNCS, pages 305–321. Springer, Heidelberg, Oct. 2012.
[26] L. Goubin and J. Patarin. DES and differential power analysis (the “duplication” method).
In Çetin Kaya. Koç and C. Paar, editors, CHES’99, volume 1717 of LNCS, pages 158–172.
Springer, Heidelberg, Aug. 1999.
[27] H. Groß and S. Mangard. Reconciling d+1 masking in hardware and software. In Fischer and
Homma [22], pages 115–136.
[28] H. Groß, S. Mangard, and T. Korak. Domain-oriented masking: Compact masked hardware
implementations with arbitrary protection order. IACR Cryptology ePrint Archive, 2016:486,
2016.
[29] H. Groß, S. Mangard, and T. Korak. An efficient side-channel protected AES implementation
with arbitrary protection order. In H. Handschuh, editor, Topics in Cryptology - CT-RSA
2017 - The Cryptographers’ Track at the RSA Conference 2017, San Francisco, CA, USA,
February 14-17, 2017, Proceedings, volume 10159 of LNCS, pages 95–112. Springer, 2017.
[30] V. Grosso, E. Prouff, and F.-X. Standaert. Efficient masked S-boxes processing - A step
forward -. In D. Pointcheval and D. Vergnaud, editors, AFRICACRYPT 14, volume 8469 of
LNCS, pages 251–266. Springer, Heidelberg, May 2014.
[31] T. Güneysu and H. Handschuh, editors. CHES 2015, volume 9293 of LNCS. Springer,
Heidelberg, Sept. 2015.
[32] X. Guo, D. Mukhopadhyay, C. Jin, and R. Karri. Security analysis of concurrent error detection
against differential fault analysis. J. Cryptographic Engineering, 5(3):153–169, 2015.
[33] Y. Ishai, M. Prabhakaran, A. Sahai, and D. Wagner. Private circuits II: Keeping secrets in
tamperable circuits. In S. Vaudenay, editor, EUROCRYPT 2006, volume 4004 of LNCS, pages
308–327. Springer, Heidelberg, May / June 2006.
[34] Y. Ishai, A. Sahai, and D. Wagner. Private circuits: Securing hardware against probing
attacks. In D. Boneh, editor, CRYPTO 2003, volume 2729 of LNCS, pages 463–481. Springer,
Heidelberg, Aug. 2003.
[35] N. Joshi, K. Wu, and R. Karri. Concurrent error detection schemes for involution ciphers.
In M. Joye and J.-J. Quisquater, editors, CHES 2004, volume 3156 of LNCS, pages 400–412.
Springer, Heidelberg, Aug. 2004.
[36] M. Joye, P. Manet, and J. Rigaud. Strengthening hardware AES implementations against
fault attacks. IET Information Security, 1(3):106–110, 2007.
[37] M. G. Karpovsky, K. J. Kulikowski, and A. Taubin. Differential fault analysis attack resistant
architectures for the advanced encryption standard. In J. Quisquater, P. Paradinas, Y. Deswarte,
and A. A. E. Kalam, editors, CARDIS 2004, 22-27 August 2004, Toulouse, France, volume
153 of IFIP, pages 177–192. Kluwer/Springer, 2004.
[38] R. Karri, G. Kuznetsov, and M. Gössel. Parity-based concurrent error detection of substitution-
permutation network block ciphers. In C. D. Walter, Çetin Kaya. Koç, and C. Paar, editors,
CHES 2003, volume 2779 of LNCS, pages 113–124. Springer, Heidelberg, Sept. 2003.
[39] R. Karri, K. Wu, P. Mishra, and Y. Kim. Concurrent error detection schemes for fault-based
side-channel cryptanalysis of symmetric block ciphers. IEEE Trans. on CAD of Integrated
Circuits and Systems, 21(12):1509–1517, 2002.
[40] M. Keller, E. Orsini, and P. Scholl. MASCOT: Faster malicious arithmetic secure computation
with oblivious transfer. In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and
S. Halevi, editors, ACM CCS 16, pages 830–842. ACM Press, Oct. 2016.
[41] P. C. Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other
systems. In N. Koblitz, editor, CRYPTO’96, volume 1109 of LNCS, pages 104–113. Springer,
Heidelberg, Aug. 1996.
CAPA: The Spirit of Beaver against Physical Attacks 227
[42] P. C. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In Wiener [64], pages 388–397.
[43] V. Lomné, T. Roche, and A. Thillard. On the need of randomness in fault attack countermea-
sures - application to AES. In G. Bertoni and B. Gierlichs, editors, FDTC 2012, pages 85–94.
IEEE Computer Society, 2012.
[44] T. Malkin, F. Standaert, and M. Yung. A comparative cost/security analysis of fault attack
countermeasures. In L. Breveglieri, I. Koren, D. Naccache, and J. Seifert, editors, FDTC 2006,
volume 4236 of LNCS, pages 159–172. Springer, 2006.
[45] M. Medwed, F.-X. Standaert, J. Großschädl, and F. Regazzoni. Fresh re-keying: Security
against side-channel and fault attacks for low-cost devices. In D. J. Bernstein and T. Lange,
editors, AFRICACRYPT 10, volume 6055 of LNCS, pages 279–296. Springer, Heidelberg, May
2010.
[46] S. Mitra and E. J. McCluskey. Which concurrent error detection scheme to choose ? In
Proceedings IEEE International Test Conference 2000, Atlantic City, NJ, USA, October 2000,
pages 985–994. IEEE Computer Society, 2000.
[47] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang. Pushing the limits: A very compact
and a threshold implementation of AES. In Paterson [53], pages 69–88.
[48] D. Mukhopadhyay. An improved fault based attack of the advanced encryption standard.
In B. Preneel, editor, AFRICACRYPT 09, volume 5580 of LNCS, pages 421–434. Springer,
Heidelberg, June 2009.
[49] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.nangate.com.
[52] S. Nikova, V. Rijmen, and M. Schläffer. Secure hardware implementation of non-linear functions
in the presence of glitches. In P. J. Lee and J. H. Cheon, editors, ICISC 08, volume 5461 of
LNCS, pages 218–234. Springer, Heidelberg, Dec. 2009.
[53] K. G. Paterson, editor. EUROCRYPT 2011, volume 6632 of LNCS. Springer, Heidelberg, May
2011.
[55] E. Prouff and M. Rivain. Masking against side-channel attacks: A formal security proof. In
T. Johansson and P. Q. Nguyen, editors, EUROCRYPT 2013, volume 7881 of LNCS, pages
142–159. Springer, Heidelberg, May 2013.
[57] O. Reparaz, B. Gierlichs, and I. Verbauwhede. Fast leakage assessment. In Fischer and Homma
[22], pages 387–399.
[58] M. Rivain and E. Prouff. Provably secure higher-order masking of AES. In S. Mangard
and F.-X. Standaert, editors, CHES 2010, volume 6225 of LNCS, pages 413–427. Springer,
Heidelberg, Aug. 2010.
[59] T. Roche and E. Prouff. Higher-order glitch free implementation of the AES using secure multi-
party computation protocols - extended version. J. Cryptographic Engineering, 2(2):111–127,
2012.
228 CAPA: The Spirit of Beaver against Physical Attacks
[60] R. Safavi-Naini and R. Canetti, editors. CRYPTO 2012, volume 7417 of LNCS. Springer,
Heidelberg, Aug. 2012.
[61] T. Schneider and A. Moradi. Leakage assessment methodology - A clear roadmap for side-
channel evaluations. In Güneysu and Handschuh [31], pages 495–513.
[62] T. Schneider, A. Moradi, and T. Güneysu. ParTI – towards combined hardware countermea-
sures against side-channel and fault-injection attacks. In M. Robshaw and J. Katz, editors,
CRYPTO 2016, Part II, volume 9815 of LNCS, pages 302–332. Springer, Heidelberg, Aug.
2016.
[63] O. Seker, T. Eisenbarth, and R. Steinwandt. Extending glitch-free multiparty protocols to
resist fault injection attacks. IACR Cryptology ePrint Archive, 2017:269, 2017.
[64] M. J. Wiener, editor. CRYPTO’99, volume 1666 of LNCS. Springer, Heidelberg, Aug. 1999.
[65] S. Yen and M. Joye. Checking before output may not be enough against fault-based cryptanal-
ysis. IEEE Trans. Computers, 49(9):967–970, 2000.
M&M: Masks and Macs
against Physical Attacks
Publication Data
Lauren De Meyer, Victor Arribas, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. M&M: Masks and Macs against Physical Attacks. IACR Transactions
on Cryptographic Hardware and Embedded Systems, 2019(1), pages 25-50.
My Contribution
229
230 M&M: Masks and Macs against Physical Attacks
1 Introduction
The implementation of cryptographic algorithms in embedded systems should be done with extreme
care. Physical attacks are proliferating considerably and they are becoming easier and cheaper to
perform. The most important physical attacks are Side-Channel Analysis (SCA), a non-invasive
attack that exploits the physical leakages emanating from the device (power consumption or
electromagnetic radiation among others) and Fault Attacks (FA), in which an adversary induces
and exploits logical errors in the computation. These attacks are commonly used to retrieve secret
data from the embedded device and can be executed either separately or combined. The most
threatening attacks are differential power analysis (DPA) [KJJ99] for SCA and differential fault
analysis (DFA) [BS97] and fault sensitivity analysis (FSA) [LSG+ 10] for FA.
In the case of SCA, a popular and established countermeasure is masking [ISW03, NRR06, PR11,
NRS11, BGN+ 14, RBN+ 15, GMK16, GM17], a secret sharing-based method in which intermediate
variables are stochastically split into multiple shares in order to make the side-channel-leaked
information independent of sensitive data. To protect against fault injections there are two major
countermeasures, as noted in [LRT12]: The first, Detection, checks whether the algorithm was
faulted during the execution by using either area or time redundancy (e.g. duplication [BECN+ 06],
concurrent error detection [BBK+ 03, KKG03, KKT04], . . . ). The problem with duplication is
that it does not provide security when faults are duplicated as well. Even with error-detecting
codes, a powerful attacker can avoid detection if the injected faults result in valid codewords. The
second approach, Infection, prevents an adversary from extracting secret information from a faulty
ciphertext by ensuring that any induced fault results in a garbage output [GST12]. So far, all
infective computations schemes have been broken [BG13].
The research direction of combined countermeasures - that is, countermeasures against both SCA
and FA - is quite young and experimental. A popular methodology is to superpose two techniques
that separately resist one family of attacks. Examples of schemes that combine masking against
SCA with redundancy against FA are ParTI [SMG16] and Private Circuits II [IPSW06, CN16].
These countermeasures naturally inherit the drawbacks of redundancy, that is, they are vulnerable
against the injection of smart undetectable faults. Moreover, implementing a checking mechanism
that does not reveal sensitive information under combined attacks is a difficult task. More recently,
M&M: Masks and Macs against Physical Attacks 231
an actively secure multi-party computation protocol was adapted to the context of embedded
systems in order to provide security against combined attacks [RDB+ 18]. The resulting combined
countermeasure benefits from very strong formal security guarantees, but is extremely expensive to
implement in hardware. A combination of duplication and infection is explored in [LRT12], but this
scheme was broken in [BG13]. Infective computation is also combined with polynomial masking
in [SFRES18]. These schemes alleviate the need for a checking mechanism, but as a result cannot
give an honest user any indication on whether or not the chip has been tampered with.
Our Contribution In this work, we describe M&M, a new family of countermeasures that extends
any SCA-secure masking scheme with information-theoretic MAC tags against DFA (i.e. Masks &
MACs) and combines them with an infective computation mechanism. By instantiating M&M with
a dth -order secure masking scheme, one achieves generic order of protection for SCA. The M&M
construction then ensures generic order of protection against DFA and the combination of SCA and
DFA. As opposed to error detecting codes, the MAC mapping is perfectly unpredictable, eliminating
the possibility of smart undetectable faults. This makes M&M secure against stronger adversaries
than when error detecting codes are used. We demonstrate M&M with first- and second-order
secure implementations of the AES cipher. This example shows that M&M can be very efficient in
area with an overhead factor of merely 2.53 compared to an implementation that protects only
against SCA. We perform a SCA evaluation of our implementations where no leakage is found with
up to 100 million traces. Additionally, we design and perform a fault evaluation to confirm our
theoretically claimed fault coverage.
Scheme Overview We revisit the infective computation scheme of [LRT12], which uses a redundant
encryption of the plaintext and uses the difference between the two ciphertexts to infect the output.
That is, if the ciphertexts match, the output is exactly that ciphertext. If the ciphertexts do not
match, the output is randomized so the attacker cannot get any information from it. The general
idea is illustrated in Figure 1. For more details, we refer to the original work.
Enc Enc
! !′
Infect
!$
This scheme was broken in [BG13] because of a bias on the randomized output. We make two
important changes. First, instead of using redundancy, which is vulnerable to the injection of
identical faults, we replace the second instantiation of the cipher with a computation on information-
theoretic MAC tags of the plaintext. If faults occur anywhere in the computation, the output of this
block does not correspond to a valid MAC tag of the ciphertext with arbitrarily high probability.
We use the difference between what the MAC tag should be and what it actually is, to randomize
the ciphertext without any bias. This is illustrated in Figure 2. We also ensure that one can find
out whether the ciphertext is correct or not. In a way, we thus combine the advantages of detection
and infection.
The computation on masks and MACs resembles the approach of [RDB+ 18]. However, instead
of using expensive MPC machinery, we devise new constructions for generic field operations using
existing SCA-secure gadgets.
In Section 2, we introduce our adversarial model. Section 3 presents our framework of shared
data and information-theoretic MAC tags and the basic M&M building blocks that are subsequently
232 M&M: Masks and Macs against Physical Attacks
' !"#(')
Enc EncMac
% !"#(%)
Infect
%(
used in Section 4 to do more complex computations. In particular, we describe M&M blocks for
elementary Galois Field operations, which can be used to construct the encryption blocks Enc and
EncMac . In Section 5 we describe how the shared ciphertext and MAC tags are used in an infective
computation. This is followed by a discussion of the security in Section 6. Finally in Section 7, we
demonstrate our scheme with an implementation and practical evaluation of the AES cipher.
2 Adversarial Model
In this work, we consider a semi-invasive adversary with probing and faulting capabilities. On the
one hand, we work under the d-probing model introduced in [ISW03] for SCA, providing security
against dth -order side-channel analysis attacks under the independent leakage assumption. The
model can include or exclude hardware glitches, but in this work, we specifically instantiate M&M
considering glitches.
On the other hand, we consider two types of faults. We model faults as stochastic additive
errors: this means the effect of a fault is the XOR of the current state with an error variable
following some random distribution. This adversary model is very similar to the one described
in [SMG16]. However, in this work, we do not limit the adversary in the number of bits he can
alter, since we present a scheme which can tolerate multiple faults with any Hamming weight.
In addition, we allow the attacker to inject non-stochastic faults (for example very precise laser
injections or stuck-at faults). In that case however, the faults must be restricted to affect at most d
of the d + 1 shares. We can justify this limitation by a proper placement of the circuit on the chip
and the more complex setup of these kinds of faults.
Data representation against SCA and DFA. With security against both side-channel analysis and
faults in mind, we port the information-theoretic MAC tags to the shared domain. This means that
every intermediate x ∈ GF(2k ) is represented by hxi = (x, τ x ) with value shares x = (x0 , . . . , xd )
such that x0 + . . . + xd = x and tag shares τ x = (τ0x , . . . , τdx ) such that τ0x + . . . + τdx = τ x . The
MAC key itself is also shared: α = (α0 , . . . , αd ). Note that the MAC key α authenticates the
sensitive value x itself and not merely its shares xi . Hence, the tag shares τix are not tags of the
value shares xi but rather a share of the tag τ x :
τix 6= α · xi
X X
τix = α · xi
i i
Shared multiplication. In what follows, we describe how M&M extends an existing SCA-secure
masking scheme with protection against faults. Literature provides us with many secure Boolean
masking schemes to choose from [ISW03, NRR06, PR11, NRS11, BGN+ 14, RBN+ 15, GMK16].
Each of those is defined by how it performs nonlinear operations (i.e. a multiplication) on Boolean
shares. A specific instantiation of M&M thus depends on the choice of how to implement the shared
multiplication operation x y = z. We assume that this operation transforms d + 1-sharings of
two variables x and y into a d + 1-sharing of their product z = xy. The latency and randomness
cost of this operation depends on the choice of SCA countermeasure. However, we assume for now
that the latency is one clock cycle, since this is the case in most schemes. For further discussion on
the latency of the multiplication gadgets, see Section 7.1.
τx τ y = αx αy = α2 z 6= τ z
However, if we use the result above in another shared multiplication with a sharing of α−1 , we can
obtain the correct tag shares:
α−1 (τ x τ y ) = αz = τ z
Note that we could also obtain these tag shares by either x τ y or τ x y, but we want to avoid
crossing the datapaths of value and tag shares such that faults introduced in the values cannot
automatically propagate to the tags. Consider for example a fault injected on input x, resulting in
x̃. Then x̃ τ y is a valid tag for x̃ y.
The M&M multiplication is summarized in the left side of Figure 3. We assume the operation
includes one register stage, since this is the case for most state-of-the-art d-secure multipliers.
The value-datapath of a M&M multiplication thus requires one clock cycle whereas that of the tags
requires two.
234 M&M: Masks and Macs against Physical Attacks
!
⊙ ' ! ⋆* '
"
#!
⊙ ⊙ #' #! ⋆* ⊙ #'
#"
$%& $%&
Figure 3: M&M nonlinear operations: Obtaining the value and tag shares of z = xy(left) and
z = x2 (right)
This multiplication uses a sharing of the inverse of the MAC key α−1 . We assume this is made
available together with the sharing of the MAC key itself. If this is not the case, α−1 can be
precomputed and stored.
M&M Squaring. Note that squaring in M&M follows the same procedure. That is, to obtain
hzi from hxi such that z = x2 , we first square the value shares x and tag shares τ x . Since a
characteristic-two finite field allows (a + b)2 = a2 + b2 , squaring in the shared domain is a local
operation that requires no registers: x2 = (x20 , . . . , x2d ) and (τ x )2 = ((τ0x )2 , . . . , (τdx )2 ). We then
again calculate the tag shares τ z by a shared multiplication with the inverse of the MAC key α−1 .
The M&M squaring operation therefore takes one clock cycle and is depicted in the right side of
Figure 3. The local squaring operation of shared data is depicted as ?2 . Extending this means that
exponentiations by a power of two (x2 ) take l clock cycles for the tags.
l
The Field GF(2). Note that in the case of bits (the field GF(2)), no correction of the MAC tag
is needed since we then have that α2 = α, i.e. τ x τ y = τ z . As a result, the M&M multiplication
in GF(2) has the same latency as the SCA-secure multiplication and M&M squaring (as well as
exponentiation by a power of two) is local.
Since x5 = (x2 )2 · x, this inversion requires seven M&M squares and four M&M multiplications.
Using the above squaring and multiplication blocks, obtaining the inversion output thus requires 15
clock cycles.
Optimization 1. The calculation can be sped up using a specialized block for the exponentiation
to the power five, shown in Figure 4. This is also done in [GPS14] and justifies the choice of
multiplication chain. In our case however, it is not trivial to do the same optimization for the
tag calculation. The operation ?5 raises the shares of x to the power five in one clock cycle. This
requires a local computation of x4 = (x40 , . . . , x4d ), followed by a shared multiplication x x4 = x5 .
This must be done with care since x and x4 are essentially the same variable and multiplying them
may break non-completeness due to the dependencies among these two variables. We therefore
precompute and refresh x4 one cycle before it is used.
After this first stage, which takes one clock cycle, we thus obtain a sharing of x5 in the value
datapath:
(x40 , . . . , x4d ) = x4
x x4 = x5
((τ0x )4 , . . . , (τdx )4 ) = (τ x )4
5
τx (τ x )4 = (τ x )5 6= τ x
A valid tag for x5 is obtained through one more shared multiplication with α−4 , which is easily
obtained locally from α−1 :
5
α−4 (τ x (τ x )4 ) = τ x
! ⋆' %
()* ⋆$
With this specialized block, one obtains the exponentiation to the power five of hxi in two clock
cycles, if (x4 , (τ x )4 ) is already refreshed beforehand. In those two clock cycles, we can obtain both
the output hzi and a refreshed (z 4 , (τ z )4 ) to be ready for the next block. The value shares can be
refreshed using the second register stage (i.e. while the tag shares are being multiplied with α−4 ).
For the tag shares, there is no spare register stage for refreshing. In the shared multiplication with
α−4 , we therefore raise each crossproduct to the power four. Before the register stage, we thus
create a (d + 1)2 -sharing of both τ z and of (τ z )4 , each using its own randomness. As a result, the
inversion result is available in only ten clock cycles:
Optimization 2. By merging the last two operations into one step of two clock cycles, we reduce
the total latency of the M&M inversion to nine cycles. This is possible because x254 = f (x4 , x125 )
with f (a, b) = a · b2 . We apply the same methodology as above. For the value shares:
(b20 , . . . , b2d ) = b2
a b2 = f (a, b)
For the tag shares:
((τ0b )2 , . . . , (τdb )2 ) = (τ b )2
τa (τ b )2 = α3 f (a, b) 6= τ f (a,b)
α −2
(τ a
(τ b )2 ) = τ f (a,b)
Figure 5 summarizes the nine-stage pipeline that calculates the value and tag shares for an
inversion of hxi.
#'
# !'
!% $ # (&
Figure 5: Inversion pipeline. (Register stages are depicted by red dotted lines.)
the correct tag shares of x−1 . This is illustrated in Figure 6. Again, it is easy to obtain α2 by
locally squaring the shares of α.
! ⋆'( %
) ⋆$
Figure 6: M&M inversion: Obtaining the value and tag shares of z = x−1
sharing of L(x) is trivially obtained by applying the transform locally to each share of x. The same
cannot be said for the tag shares of x. In this section, we describe how to obtain the tag shares for
any linear transform of this type.
Isomorphisms. Consider the isomorphism φ between the finite field GF(2k ) and the vector space
(GF(2))k , that maps each element to its bitrepresentation vector, i.e. φ(2i ) = ei with ei the ith
unit vector. We denote the bitvector of x as φ(x) = x. For the linear transform, we thus have
φ(L(x)) = Lφ(x) = Lx
with L ∈ (GF(2))k×k the matrix that defines the linear transformation. One of the consequences of
this isomorphism is
Note that the opposite direction does not work: not for every M ∈ (GF(2))k×k there exists an
α ∈ GF(2k ) such that this relation holds.
Given a value x ∈ GF(2k ) and a corresponding tag τ x = αx ∈ GF(2k ), we wish to obtain a tag
τ L(x) satisfying τ L(x) = αL(x). We denote the bitvector of τ x as φ(τ x ) = t = Mα x. We have
Hence, we can go from (x, τ x ) to (L(x), τ L(x) ) by applying L to the bitvector of x and Mα LMα−1 to
the bitvector of τ x . A similar approach is used in error detecting/correcting code schemes [BCC+ 14].
In our case however, the code depends on α and thus the matrix Mα LMα−1 is secret and different
in every encryption.
we get that
LMα−1 = φ(L(128α−1 )) φ(L(64α−1 )) . . . φ(L(α−1 ))
and thus
Mα LMα−1 = φ(αL(128α−1 )) φ(αL(64α−1 )) . . . φ(αL(α−1 ))
For each MAC key α, this matrix can be precomputed and stored in d + 1 shares, similar to
the precomputed sharing of α−1 . In the linear transformation, we obtain the tag shares τ L(x)
by a shared matrix-vector multiplication with the sharing of Mα LMα−1 . A shared matrix-vector
multiplication can use the same equations as the SCA-secure multiplier , but with one of the
inputs a matrix and with the field multiplication ‘·’ replaced by matrix-vector products. Because of
the register stage in the shared matrix-vector multiplication, the affine transformation requires one
clock cycle.
Latency Optimization Note that in cases such as AES, where the affine transformation follows
a power map, a little trick can ensure that the affine transformation does not increase the total
latency of an S-box evaluation. The last clock cycle of the inversion in §4.1.1 (resp. §4.1.2) is spent
on a shared multiplication of intermediate tag shares with α−2 (resp. α2 ). We can incorporate this
tag correction in the affine transformation by replacing the matrix Mα LMα−1 with Mα LMα−3
(resp. Mα LMα ).
238 M&M: Masks and Macs against Physical Attacks
5 Infective Computation
We have described above how to implement two encryption blocks: one that calculates ciphertext
shares, given plaintext shares and another that calculates ciphertext tag shares from plaintext tag
shares and MAC key shares αi . We now consider the ciphertext block-per-block with blocksize
k. Let ci ∈ GF(2k ) be the shares of one ciphertext block and τic ∈ P GF(2k ) thePshares of the
corresponding tag block. If the tags are consistent with the data, then i τic = α · i ci .
Check(c, τ c , α)
Let θ ← α c
for all shares i do
Let Ei ← θi + τic
end Pfor
E = i Ei
Output (E == 0)
This checking algorithm is not secure in a combined adversary model with probing and faulting.
When an attacker manages to insert a known fault ∆ in one share of the shared multiplication such
that the check is performed with α0 = α + ∆, the unshared (zero) error E is replaced by
E 0 = τ c + α0 · c
= τc + α · c + ∆ · c
=∆·c
A single probe on the unshared E 0 thus reveals the (faulty) ciphertext c. This attack defeats
the very purpose of the error check, which is to stop a faulty ciphertext from being released to the
adversary.
Infect (c, τ c , α)
Let θ ← α c
$
Draw R ← GF(2k ) \ {0}
for all shares i do
Let c̃i ← ci + R · (θi + τic )
end for
Output c̃
P
The scheme outputs a sharing of the adapted ciphertext block c̃ = i c̃i = c + R · (α · c + τ c ).
Thus, if the tags are consistent (α · c + τ = 0), the scheme outputs a sharing of the computed block
c
c. On the other hand, if the tags do not match (α · c + τ c 6= 0), the unshared output c̃ is random.
One may note that generating a nonzero mask R is nontrivial. However, there must be a
PRNG with enough throughput to realize all the randomness for the computation of the S-box.
The number of random bits available for the routine Infect is thus much higher than the amount
required. From this, it is easy to generate one nonzero byte.
M&M: Masks and Macs against Physical Attacks 239
Unbiased Randomization We verify that the infected ciphertexts are uniformly distributed, so
the attacker cannot obtain any information from them. Consider the case when the computation
of Enc (and EncMac ) is disturbed by faults ∆c (resp. ∆τ ), resulting in the unshared infected
ciphertext block
c̃ = c + ∆c + R · (α · (c + ∆c ) + τ c + ∆τ )
We may assume ∆c is non-zero and unkown to the attacker. In fact, knowing the faulty ciphertext,
or indeed ∆c , is the goal of the adversary in a DFA attack. Furthermore, we assume a strong
probing adversary can know the value of the mask R, which is why we do not allow it to be zero.
However, the MAC key α is always secret due to sharing. These introduced faults (∆c , ∆τ ) remain
undetected with a probability of 2−km , corresponding to the case when ∆τ = α · ∆c . We therefore
claim an error detection probability (EDP) of 1 − 2−km and focus now on the case when ∆τ 6= α · ∆c .
Using the fact that α · c + τ c = 0, we rewrite c̃ = c + ∆c · (1 + R · α) + ∆τ · R. Clearly, the
unbiased randomization of the output depends on the uniformity of the mask (1 + R · α). It can be
verified that this mask is uniformly random in GF(2k ) when R is uniformly random in GF(2k ) \ {0}
and α uniformly random in GF(2k ). As a result, c̃ is uniformly random in GF(2k ) when ∆c 6= 0.
6 Security Analysis
In this section, we discuss the security of the M&M scheme. Note that M&M can be based on
several different Boolean masking schemes providing SCA secure multiplication, inversion and
refreshing gadgets in the chosen adversary model and thus the security of any instantiation depends
heavily on those choices.
hardware glitches if the shared multiplication and inversion mechanism used are also secure in the
presence of glitches, M&M inherits this.
The computation of the tag shares follows the same design principles as the value share
calculations. The two datapaths operate completely independently of each other and receive their
own distinct fresh randomness (see Figure 7). It is important to note that the input sharings p and
τ p must be independent as well, which is easily achieved if the initial maskings of p and τ p are
obtained separately. The independence of the two datapaths ensure that their merging in the Infect
block does not induce leakage on p or τ p .
RNG
RNG
%
! Enc
# R Infect %$
"! EncMac
"%
RNG
The Refreshing Gadget. It is important that any refreshing mechanism used (cf. § 4.1.1) ensures
the same security as is provided by the used masking scheme. The kind of refreshing thus depends
on the targeted security order d [BBP+ 16] and the considered attacker model. In general, one can
always make use of the multiplication-based refresh gadget of Ishai et al. [ISW03]. It has been
shown in [BBD+ 16, Gadget 4b] that this refreshing ensures composability at any order. For a
specific target security level, randomness can be consumed more efficiently. For example, the ring
refreshing approach of [CRB+ 16] uses d + 1 fresh masks in a circular manner to refresh d + 1 shares.
This method suffices for second- and third-order security. At certain higher orders, one can use its
variant, offset refreshing, which still uses only d + 1 units of fresh randomness but rotates with an
offset of more than 1 [BBD+ 18, Alg. 2]. Finally, additive refreshing using only d fresh masks is
sufficient when first-order security is targeted. For a more detailed treatment of refreshing gadgets,
we refer to [BBD+ 18].
remains true. Since the MAC key α ∈ GF(2 ) is unkown to the adversary, any number of
km
stochastic additive errors result in this relation with probability at most 2−km . If the attacker
has the ability to inject non-stochastic faults, our model restricts these to affect at most d of the
d + 1 shares. In that case, the success probability still depends on the probability of guessing the
secret MAC key α ∈ GF(2km ) correctly. We therefore claim an error detection probability (EDP)
of 1 − 2−km .
As stated in the adversarial model in §2, the faults we consider are neither limited in Hamming
weight nor in quantity. The worst-case probability that (1) is satisfied is 2−km regardless of the
Hamming weight of a single fault. The accumulated effect of multiple faults also does not change
this probability. The adversary obtains no additional information after injecting faults hence
M&M: Masks and Macs against Physical Attacks 241
random shooting or guessing α remains the best strategy for subsequent faults. In the end, the same
equation (1) in GF(2km ) must hold for the faults to remain undetected. It holds with probability
2−km . We experimentally investigate the effect of multiple faults in Section 7.3.2.
The zero MAC key. The event that the MAC key is zero occurs with probability 2−km . Since α
is secret and shared, the adversary cannot know when the tags are zero and must still guess α to
determine what fault to inject in the tag computation. An adversary strategy of not injecting faults
in the tag computation, i.e. injecting faults only on the value computations, corresponds to guessing
that α = 0 and succeeds with probability 2−km . This is completely analogous to guessing for
example that α = 1 and injecting identical faults in both the tag and value computation accordingly.
Either by guessing α or by injecting a random fault, the adversary hits the correct value with
probability 2−km , corresponding to our claimed EDP of 1 − 2−km . Hence, in theory, the case α = 0
is equivalent to any other nonzero MAC key. In practice however, the strategy corresponding to
guessing α = 0 is easier since it requires only fault injections in the value datapath and not the tag
datapath. To avoid it, one could exclude the zero MAC key and reduce the EDP to 1 − (2km − 1)−1 .
This difference is negligible if km is sufficiently large. However, note that in that case, the infective
computation output phase can no longer be used, since it requires α to be uniformly random in
GF(2km ).
Ineffective Faults. Apart from DFA and FSA, there is also an interesting branch of fault attacks
that exploits so-called ineffective faults. For example, a stuck-at-zero fault on a wire or set of
wires is ineffective when those wires already carry the zero value. This type of faults are naturally
undetectable at algorithm level, which makes them immune to both detection and infection
countermeasures. A flavour of Ineffective Fault Analysis (IFA) [Cla07] called Statistical Ineffective
Fault Analysis (SIFA) [DEK+ 18] has recently been proposed. SIFA collects a subset of correct
ciphertexts from a large number of faulted encryptions and exploits the fact that the intermediate
state of the algorithm is not uniformly distributed in this subset. This attack has been extended
to masked implementations in [DEG+ 18]. Ineffective faults (and thus SIFA) fall outside of our
adversary model since they are impossible to detect. Protection against such attack can be provided
at a different level, for example using a protocol that erases the key as soon as a certain threshold
of faulty ciphertexts has been detected.
constructions. We then empirically validate our SCA claims using univariate and bivariate test
vector leakage assessment and we perform a simulation-based verification of the DFA security.
t00 = x0 · y0
t01 = x0 · y1 + r0
t02 = x0 · y2 + r1
t20 = x2 · y0 + r1
t21 = x2 · y1 + r2
t22 = x2 · y2
The
corresponding first-order construction is given in [GMK16]. Each multiplication requires
fresh units of randomness. It has been shown in [FGMDP+ 18] that such a multiplication
d+1
2
gadget is composable if the result is stored in a register. As shown on Figure 5, we present in this
work a construction with such registers in the value share datapath but without extra registers in
the tag share datapath. Note that, the intermediate registers of [FGMDP+ 18] are a requirement
for SNI schemes in HW, but the SNI property is not a prerequisite for a scheme to be secure and we
customized our design for better performance. We see currently no formal way to prove the security
of our construction, but because the tag shares can be seen as Boolean shares of a multiplicative
share of the secret, we judge that the tag shares do not need to be stored in registers in order to
obtain security. We verify our approach empirically using TVLA and do not detect any leakage (cf.
§7.3.1). Provable security can be achieved by adding registers to each of the tag share and value
share datapaths.
M&M: Masks and Macs against Physical Attacks 243
Refreshing gadgets. When d = 1, a d + 1-share variable x can be refreshed with d fresh random
units using additive refreshing:
d−1
X
(x0 , x1 , . . . , xd ) → (x0 + r0 , x1 + r1 , . . . , xd−1 + rd−1 , xd + ri )
i=0
For second-order security, we use ring refreshing as in [CRB+ 16], which consumes d + 1 fresh
random units.
We use these refreshings for the shares of x4 , x20 and x100 in Figure 5 as well as for the shares
of (τ x )4 . We let r(d) be the corresponding randomness cost for refreshing d + 1 shares with security
5 25
order d, i.e. r(1) = 1 and r(2) = 3. The shares of (τ x )4 and (τ x )4 are refreshed using d+1 2
units during the last multiplication
in the M&M power five block (cf. Figure 4). For d ∈ {1, 2}, we
actually have that r(d) = d+1 2 , so the cost of each refreshing is r(d). We note again that these
types of refreshing gadgets are only to be used for respectively first- and second-order security and
are not secure for higher orders.
Inversion gadget. For the shared inversion in GF(28 ), ?−1 , we can use De Cnudde et al.’s
d + 1 [CRB+ 16] or Gross et al.’s Domain Oriented Masking [GMK17] AES S-box implementations.
Both are based on Canright’s compact S-box [Can05] using the tower field approach. We opt for
the first, which requires five register stages. Together with the final shared multiplication with α2 ,
the latency of the M&M inversion in Figure 6 is thus six cycles.
Latency The first inversion from §4.1 requires nine clock cycles. In version 2, we use the ?−1
implementation from [CRB+ 16], which results in a M&M inversion of six clock cycles. Recall from
§4.2 that the AES affine transformation in M&M requires no additional cycles. In total, the AES
S-box output is thus obtained in nine clock cycles with version 1 and six clock cycles with version 2.
244 M&M: Masks and Macs against Physical Attacks
The byte-serialized architecture from [GMK17] is very efficient as it performs the MixColumns,
ShiftRows and AddRoundKey stages in parallel with the SubBytes stage. As a result, when the
S-box latency is C ≥ 4 cycles, one round of encryption (including key schedule) requires exactly
16 + C clock cycles: During the first 16 cycles, all the state bytes are fed to the S-box pipeline
input. The MixColumns operation is done in parallel every four cycles. The remaining C cycles are
spent waiting for the last S-box output. During the first four of these, the S-box can be used by
the key schedule. In the last cycle, the last S-box output is shifted into the state at the same time
as ShiftRows is performed.
Our two versions of the AES Encryption therefore require respectively 25 and 22 clock cycles
per encryption round.
Area We report our area results for first- and second-order security in Table 2 together with
latency and randomness cost. As expected, the more customized version of the S-box results in a
much more efficient implementation. Note that many more tradeoffs between area, latency and
randomness cost are possible depending on design choices (e.g. multiplication chain) and used
building blocks (e.g. shared inversion block ?−1 ).
Comparison to State-of-the-art. In Table 3, we report our area results next to other state-of-the-
art schemes. Some of these protect only against SCA [CRB+ 16, GMK17] and some are combined
countermeasures, such as ParTI [SMG16] and CAPA [RDB+ 18]. For the latter, it is not easy to
compare the results given the difference in cipher implemented and synthesis libraries used. We try
to overcome these differences by also reporting the overhead factor of the combined countermeasure,
compared to an implementation that provides only protection against SCA.
The ParTI countermeasure is applied to the LED scheme in [SMG16]. The authors report an
area of 20.2 kGE obtained with a UMC 0.18µm library [Inc04], compared to 7.9 kGE for the SCA-
only first-order secure LED implementation. This signifies an area overhead factor of 20.2
7.9 = 2.56.
For the combined countermeasure CAPA, we can compute the overhead over a SCA-secure KATAN
implementation [RDB+ 18, Table 2]. Finally, we compare our first-order M&M AES (V2) with De
Cnudde’s [CRB+ 16] first-order implementation against SCA only. Table 3 reports the area of the
implementation that is only secure against SCA in the fourth column and the area of the combined
M&M: Masks and Macs against Physical Attacks 245
countermeasure in the fifth column. The overhead factor for those can be found in the last column.
We note that the dependency on the synthesis library cannot completely be eliminated this way.
Furthermore, all schemes consider very different adversary models.
Table 4 does the same for second-order secure implementations.
7.3 Evaluation
Evaluation of our M&M implementations is done separately for SCA and DFA, since no compre-
hensive method for verifying against combined attacks has been published to our knowledge. For
the first, we program both versions of a second-order protected AES on an FPGA and evaluate
the leakage coming from the power consumption with a non-specific t-test. The state-of-the-art
on evaluating fault countermeasures is less advanced. We evaluate the scheme’s EDP through a
simulation of the circuits, in which we model additive faults in the RTL.
TVLA. We perform a non-specific test vector leakage assessment (TVLA) [BCD+ 13] using the
methodology described in [RGV17]. This assessment is not used to mount an attack but to detect
correlations of the instantaneous power consumption with the secret. We gather power traces for
two distinct plaintext classes (one fixed and one random) and compare the two sets using the t-test
statistic. When the t-statistic exceeds the threshold 4.5 in absolute value, one can conclude with
confidence 99.9995% that the two sets of power traces follow different distributions and thus, that
the design leaks. This is a necessary but not sufficient condition for a successful attack to exist.
When the t-statistic remains below this threshold, the designer can conclude with high confidence
that the design is secure. We choose the fixed plaintext equal to the key so that all S-box inputs in
the first round are zero.
2 This number differs from the one reported in [CRB+ 16]. We contacted the authors and obtained their code in
order to synthesize with the same software and library as our M&M implementation.
246 M&M: Masks and Macs against Physical Attacks
We first perform the t-test on an unprotected AES implementation to verify that our setup
is sound and able to detect leakage. We emulate the unprotected implementation by disabling
the PRNG. Leakage is then expected in every order. When we turn the PRNG on and activate
the countermeasures, we expect only third-order leakage since the implementation is second-order
secure.
Figure 9: Non-specific t-test on second-order secure M&M AES implementation, V1. Left: PRNG
off (24K traces); Right: PRNG on (100M traces). Rows (top to bottom): one exemplary power
trace; first-order t-test; second-order t-test; third-order t-test.
Figure 10: Non-specific t-test on second-order secure M&M AES implementation, V2. Left: PRNG
off (24K traces); Right: PRNG on (100M traces). Rows (top to bottom): one exemplary power
trace; first-order t-test; second-order t-test; third-order t-test.
The t-test results for version 1 and 2 of our AES implementation are shown in Figures 9 and 10
respectively. In both cases, we see clear evidence of leakage at only 24 000 traces when the PRNG is
turned off. When we enable the PRNG, neither the first- nor second-order t-test statistics surpass
the threshold 4.5 with up to 100 million power traces.
In addition, we perform a bivariate analysis by combining time samples. For memory efficiency,
we reduce the resolution of the oscilloscope to 100MS/s, resulting in power traces of 1 000 time
samples each. We then perform the t-test as described above on 1 000 × 1 000 matrices, formed by
a centered product of the traces. The results are shown in Figure 11 and confirm that there is no
bivariate leakage with up to 50 million traces.
7.3.2 FA evaluation
In this section we evaluate the behaviour of our design when multiple stochastic faults are injected.
We describe our experiment, which aims to measure M&M’s detection rate of faults. For simplicity,
we evaluate our first-order secure AES implementations.
M&M: Masks and Macs against Physical Attacks 247
120
70
100 100
100
200 60 200
300 300
50 80
400 400
40
500 500 60
600 30 600
40
700 700
20
800 800
20
10
900 900
4.5
4.5
1000 0 1000 0
200 400 600 800 1000 200 400 600 800 1000
Figure 11: Bivariate t-test on second-order secure M&M AES imlementation, V1 (left) and V2
(right). Below diagonal: PRNG off (20K traces); Above diagonal: PRNG on (50M traces).
Fault modeling. Traditionally, fault modeling theory distinguishes between faults that affect the
logic function on the one hand and delay faults on the other. Moreover, faults can be clasified
as structural (modifying the interconnections among components in the circuit) or functional
(modifying the functionality of certain parts of the circuit) [ABF94].
Faults in cryptographic devices are typically injected with a laser or introduced by clock or
power line glitches. In our experiments we consider functional faults in the logic functions to model
the adversary of §2 (i.e. additive errors). We model faults using XOR additions. This does not
only allow us to flip one specific bit, but also to XOR entire offsets to k-bit words.
Fault injection. We enable a fault injection on a wire by extending the original VHDL code with
an additional fault gate on that wire. Such a gate is simply an XOR with a fault selector, indicating
whether or not we want to inject a fault.
In each design to be tested, we select a number of critical bytes where an attacker is most likely
to inject a fault. These points are: the input to the state register; the state and key byte before
AddRoundKey; the SubBytes input from the key schedule and four different points inside SubBytes.
Faults can be inserted in every data share and every tag share of those bytes.
This means that 256 fault gates are installed in the first order implementations. We collect the
corresponding fault selectors in a fault vector, which is controlled by the testbench. Each bit set to
’1’ in the fault vector corresponds to a fault on a single bit. By setting multiple bits in the vector,
we enable multiple faults in the implementation. When several faults are activated in the same
byte, it implies the XOR of an offset to that variable.
We want to be able to randomly draw fault vectors with a chosen Hamming weight H. For this,
we draw inspiration from the basic principles of address decoding. We draw one random bit; if it is
‘zero’, a fault occurs in the first half of the vector and if it is ‘one’, in the second half. We draw a
second bit and follow the same procedure to decide which of the two quarters. We continue this
way until a single fault bit is selected. Thus, log2 (256) = 8 random bits are needed to set a one-bit
fault in a 256-bit vector. By repeating this method H times, we can draw random fault vectors
with Hamming weight H. In our experiments we choose H = 128.
For each selected bit, we flip one more coin that decides whether the selected fault is activated
or not. This means that of the 128 selected faults, approximately half will be active. Since most of
the fault attacks in literature target one of the last rounds of AES, we similarly “inject” our faults
in the last round of encryption.
Results. We simulate the fault-augmented VHDL code with Xilinx ISIM for 50 000 iterations
and measure the fault detection rate. In each experiment, approximately 64 bits are altered in
the computation. We are thus faulting one or more bits in multiple bytes. We consider the faults
detected if the returned ciphertext is infected. In version 1 of our M&M AES, the experiment shows
that 210 faulty ciphertexts are not infected. This means that the experimental rate of detection of
our M&M implementation is 0.9958, compared to the theoretical 1 − 2−8 = 0.9961. In our second
AES version, 189 faulty ciphertexts are not infected, which means the experimental detection
probability is 0.9962.
248 M&M: Masks and Macs against Physical Attacks
8 Conclusion
We introduce a new family of countermeasures to provide security against both SCA and DFA.
M&M can extend any masking countermeasure with information-theoretic MAC tags and infective
computation. We demonstrate how to construct basic M&M building blocks and how to build a
secure implementation of any cipher. We illustrate our proposal with first- and second-order secure
implementations of AES and we experimentally verify the SCA and DFA security. We show that
M&M implementations can be very efficient while providing resistance against both SCA and DFA
in a strong but realistic adversary model.
Acknowledgements The authors would like to thank Dusan Bozilov, Begül Bilgin and Nigel
Smart for fruitful discussions and also the CHES reviewers for their helpful comments. This work
was supported in part by the Research Council KU Leuven: C16/15/058 and OT/13/071, by the
NIST Research Grant 60NANB15D346 and the EU H2020 project FENTEC. Lauren De Meyer is
funded by a PhD fellowship of the Fund for Scientific Research - Flanders (FWO).
References
[ABF94] M. Abramovici, M. Breuer, and A. Friedman. Digytal systems testing and testable
design. Wiley-IEEE Press, September 1994.
[AVFM07] F. Amiel, K. Villegas, B. Feix, and L. Marcel. Passive and active combined attacks:
Combining fault attacks and side channel analysis. In Workshop on Fault Diagnosis
and Tolerance in Cryptography (FDTC 2007), pages 92–102, Sept 2007.
[BBD+ 15] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin Gré-
goire, and Pierre-Yves Strub. Veriffied proofs of higher-order masking. EUROCRYPT,
IACR Cryptology ePrint Archive, 2015:060, 2015.
[BBD+ 16] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin
Grégoire, Pierre-Yves Strub, and Rébecca Zucchini. Strong non-interference and
type-directed higher-order masking. In Edgar R. Weippl, Stefan Katzenbeisser,
Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security,
Vienna, Austria, October 24-28, 2016, pages 116–129. ACM, 2016.
[BBD+ 18] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin
Grégoire, François-Xavier Standaert, and Pierre-Yves Strub. Improved parallel
mask refreshing algorithms - generic solutions with parametrized non-interference &
automated optimizations. IACR Cryptology ePrint Archive, 2018:505, 2018.
[BBK+ 03] Guido Bertoni, Luca Breveglieri, Israel Koren, Paolo Maistri, and Vincenzo Piuri.
Error analysis and detection procedures for a hardware implementation of the
advanced encryption standard. IEEE Trans. Computers, 52(4):492–505, 2003.
[BBP+ 16] Sonia Belaïd, Fabrice Benhamouda, Alain Passelègue, Emmanuel Prouff, Adrian
Thillard, and Damien Vergnaud. Randomness complexity of private circuits for
multiplication. In Marc Fischlin and Jean-Sébastien Coron, editors, Advances in
Cryptology - EUROCRYPT 2016 - 35th Annual International Conference on the
Theory and Applications of Cryptographic Techniques, Vienna, Austria, May 8-12,
2016, Proceedings, Part II, volume 9666 of Lecture Notes in Computer Science, pages
616–648. Springer, 2016.
[BCC+ 14] Julien Bringer, Claude Carlet, Hervé Chabanne, Sylvain Guilley, and Houssem
Maghrebi. Orthogonal direct sum masking - A smartcard friendly computation
paradigm in a code, with builtin protection against side-channel and fault attacks.
In David Naccache and Damien Sauveron, editors, Information Security Theory and
Practice. Securing the Internet of Things - 8th IFIP WG 11.2 International Workshop,
WISTP 2014, Heraklion, Crete, Greece, June 30 - July 2, 2014. Proceedings, volume
8501 of Lecture Notes in Computer Science, pages 40–56. Springer, 2014.
M&M: Masks and Macs against Physical Attacks 249
[BECN+ 06] H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan. The sorcerer’s
apprentice guide to fault attacks. Proceedings of the IEEE, 94(2):370–382, Feb 2006.
[BG13] Alberto Battistello and Christophe Giraud. Fault analysis of infective AES compu-
tations. In Wieland Fischer and Jörn-Marc Schmidt, editors, 2013 Workshop on
Fault Diagnosis and Tolerance in Cryptography, Los Alamitos, CA, USA, August 20,
2013, pages 101–107. IEEE Computer Society, 2013.
[BGN+ 14] Begül Bilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. Higher-order threshold implementations. In Palash Sarkar and Tetsu Iwata,
editors, Advances in Cryptology - ASIACRYPT 2014 - 20th International Conference
on the Theory and Application of Cryptology and Information Security, Kaoshiung,
Taiwan, R.O.C., December 7-11, 2014, Proceedings, Part II, volume 8874 of Lecture
Notes in Computer Science, pages 326–343. Springer, 2014.
[BS97] Eli Biham and Adi Shamir. Differential fault analysis of secret key cryptosystems,
pages 513–525. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[Can05] David Canright. A very compact s-box for AES. In Josyula R. Rao and Berk
Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th
International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings,
volume 3659 of Lecture Notes in Computer Science, pages 441–455. Springer, 2005.
[CFGR10] C. Clavier, B. Feix, G. Gagnerot, and M. Roussellet. Passive and active combined
attacks on aes combining fault attacks and side channel analysis. In 2010 Workshop
on Fault Diagnosis and Tolerance in Cryptography, pages 10–19, Aug 2010.
[Cla07] Christophe Clavier. Secret external encodings do not prevent transient fault analysis.
In Pascal Paillier and Ingrid Verbauwhede, editors, Cryptographic Hardware and
Embedded Systems - CHES 2007, 9th International Workshop, Vienna, Austria,
September 10-13, 2007, Proceedings, volume 4727 of Lecture Notes in Computer
Science, pages 181–194. Springer, 2007.
[CN16] Thomas De Cnudde and Svetla Nikova. More efficient private circuits II through
threshold implementations. In 2016 Workshop on Fault Diagnosis and Tolerance
in Cryptography, FDTC 2016, Santa Barbara, CA, USA, August 16, 2016, pages
114–124. IEEE Computer Society, 2016.
[CRB+ 16] Thomas De Cnudde, Oscar Reparaz, Begül Bilgin, Svetla Nikova, Ventzislav Nikov,
and Vincent Rijmen. Masking AES with d+1 shares in hardware. In Benedikt
Gierlichs and Axel Y. Poschmann, editors, Cryptographic Hardware and Embedded
Systems - CHES 2016 - 18th International Conference, Santa Barbara, CA, USA,
August 17-19, 2016, Proceedings, volume 9813 of Lecture Notes in Computer Science,
pages 194–212. Springer, 2016.
[DEG+ 18] Christoph Dobraunig, Maria Eichlseder, Hannes Groß, Stefan Mangard, Florian
Mendel, and Robert Primas. Statistical ineffective fault attacks on masked AES
with fault countermeasures. IACR Cryptology ePrint Archive, 2018:357, 2018.
[DEK+ 18] Christoph Dobraunig, Maria Eichlseder, Thomas Korak, Stefan Mangard, Florian
Mendel, and Robert Primas. Sifa: Exploiting ineffective fault inductions on symmet-
ric cryptography. IACR Transactions on Cryptographic Hardware and Embedded
Systems, 2018(3):547–572, Aug. 2018.
[DV12] F. Dassance and A. Venelli. Combined fault and side-channel attacks on the aes
key schedule. In 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography,
pages 63–71, Sept 2012.
250 M&M: Masks and Macs against Physical Attacks
[FGMDP+ 18] Sebastian Faust, Vincent Grosso, Santos Merino Del Pozo, Clara Paglialonga, and
François-Xavier Standaert. Composable masking schemes in the presence of physical
defaults & the robust probing model. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2018(3):89–120, Aug. 2018.
[FH17] Wieland Fischer and Naofumi Homma, editors. Cryptographic Hardware and Em-
bedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan,
September 25-28, 2017, Proceedings, volume 10529 of Lecture Notes in Computer
Science. Springer, 2017.
[GM17] Hannes Groß and Stefan Mangard. Reconciling d+1 masking in hardware and
software. In Fischer and Homma [FH17], pages 115–136.
[GMK16] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented masking:
Compact masked hardware implementations with arbitrary protection order. IACR
Cryptology ePrint Archive, 2016:486, 2016.
[GMK17] Hannes Groß, Stefan Mangard, and Thomas Korak. An efficient side-channel
protected AES implementation with arbitrary protection order. In Helena Handschuh,
editor, Topics in Cryptology - CT-RSA 2017 - The Cryptographers’ Track at the
RSA Conference 2017, San Francisco, CA, USA, February 14-17, 2017, Proceedings,
volume 10159 of Lecture Notes in Computer Science, pages 95–112. Springer, 2017.
[GPS14] Vincent Grosso, Emmanuel Prouff, and François-Xavier Standaert. Efficient masked
s-boxes processing - A step forward -. In David Pointcheval and Damien Vergnaud,
editors, Progress in Cryptology - AFRICACRYPT 2014 - 7th International Confer-
ence on Cryptology in Africa, Marrakesh, Morocco, May 28-30, 2014. Proceedings,
volume 8469 of Lecture Notes in Computer Science, pages 251–266. Springer, 2014.
[GST12] Benedikt Gierlichs, Jörn-Marc Schmidt, and Michael Tunstall. Infective computation
and dummy rounds: Fault protection for block ciphers without check-before-output.
In Alejandro Hevia and Gregory Neven, editors, Progress in Cryptology - LAT-
INCRYPT 2012 - 2nd International Conference on Cryptology and Information
Security in Latin America, Santiago, Chile, October 7-10, 2012. Proceedings, volume
7533 of Lecture Notes in Computer Science, pages 305–321. Springer, 2012.
[Inc04] Virtual Silicon Inc. 0.18 µm VIP Standard cell library tapeout ready, partnumber:
UMCL18G212T3, process: UMC logic 0.18µm generic II technology: 0.18µm, July
2004.
[IPSW06] Yuval Ishai, Manoj Prabhakaran, Amit Sahai, and David A. Wagner. Private circuits
II: keeping secrets in tamperable circuits. In Serge Vaudenay, editor, Advances
in Cryptology - EUROCRYPT 2006, 25th Annual International Conference on the
Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May
28 - June 1, 2006, Proceedings, volume 4004 of Lecture Notes in Computer Science,
pages 308–327. Springer, 2006.
[ISW03] Y. Ishai, A. Sahai, and D. Wagner. Private Circuits: Securing Hardware against
Probing Attacks, pages 463–481. Springer Berlin Heidelberg, Berlin, Heidelberg,
2003.
[KKG03] Ramesh Karri, Grigori Kuznetsov, and Michael Gössel. Parity-based concurrent error
detection of substitution-permutation network block ciphers. In Colin D. Walter,
Çetin Kaya Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded
Systems - CHES 2003, 5th International Workshop, Cologne, Germany, September
8-10, 2003, Proceedings, volume 2779 of Lecture Notes in Computer Science, pages
113–124. Springer, 2003.
M&M: Masks and Macs against Physical Attacks 251
[KKT04] Mark G. Karpovsky, Konrad J. Kulikowski, and Alexander Taubin. Differential fault
analysis attack resistant architectures for the advanced encryption standard. In Jean-
Jacques Quisquater, Pierre Paradinas, Yves Deswarte, and Anas Abou El Kalam,
editors, Smart Card Research and Advanced Applications VI, IFIP 18th World
Computer Congress, TC8/WG8.8 & TC11/WG11.2 Sixth International Conference
on Smart Card Research and Advanced Applications (CARDIS), 22-27 August 2004,
Toulouse, France, volume 153 of IFIP, pages 177–192. Kluwer/Springer, 2004.
[LRT12] Victor Lomné, Thomas Roche, and Adrian Thillard. On the need of randomness in
fault attack countermeasures - application to AES. In Guido Bertoni and Benedikt
Gierlichs, editors, 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography,
Leuven, Belgium, September 9, 2012, pages 85–94. IEEE Computer Society, 2012.
[LSG+ 10] Yang Li, Kazuo Sakiyama, Shigeto Gomisawa, Toshinori Fukunaga, Junko Taka-
hashi, and Kazuo Ohta. Fault Sensitivity Analysis, pages 320–334. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2010.
[NAN] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.
nangate.com.
[NRR06] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Threshold Implemen-
tations Against Side-Channel Attacks and Glitches. In Information and Commu-
nications Security, 8th International Conference, ICICS 2006, Raleigh, NC, USA,
December 4-7, 2006, Proceedings, pages 529–545, 2006.
[NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schläffer. Secure hardware implementa-
tion of nonlinear functions in the presence of glitches. J. Cryptology, 24(2):292–321,
2011.
[PR11] Emmanuel Prouff and Thomas Roche. Higher-order glitches free implementation
of the AES using secure multi-party computation protocols. In Bart Preneel
and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems -
CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1,
2011. Proceedings, volume 6917 of Lecture Notes in Computer Science, pages 63–78.
Springer, 2011.
[RBN+ 15] Oscar Reparaz, Begül Bilgin, Svetla Nikova, Benedikt Gierlichs, and Ingrid Ver-
bauwhede. Consolidating masking schemes. In Rosario Gennaro and Matthew
Robshaw, editors, Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptol-
ogy Conference, Santa Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I,
volume 9215 of Lecture Notes in Computer Science, pages 764–783. Springer, 2015.
[RDB+ 18] Oscar Reparaz, Lauren De Meyer, Begül Bilgin, Victor Arribas, Svetla Nikova,
Ventzislav Nikov, and Nigel P. Smart. CAPA: the spirit of beaver against physical
attacks. In Hovav Shacham and Alexandra Boldyreva, editors, Advances in Cryptology
- CRYPTO 2018 - 38th Annual International Cryptology Conference, Santa Barbara,
CA, USA, August 19-23, 2018, Proceedings, Part I, volume 10991 of Lecture Notes
in Computer Science, pages 121–151. Springer, 2018.
[RGV17] Oscar Reparaz, Benedikt Gierlichs, and Ingrid Verbauwhede. Fast leakage assessment.
In Fischer and Homma [FH17], pages 387–399.
[RLK11] Thomas Roche, Victor Lomné, and Karim Khalfallah. Combined Fault and Side-
Channel Attack on Protected Implementations of AES, pages 65–83. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2011.
[SFRES18] Okan Seker, Abraham Fernandez-Rubio, Thomas Eisenbarth, and Rainer Steinwandt.
Extending glitch-free multiparty protocols to resist fault injection attacks. IACR
Transactions on Cryptographic Hardware and Embedded Systems, 2018(3):394–430,
Aug. 2018.
252 M&M: Masks and Macs against Physical Attacks
[SMG16] Tobias Schneider, Amir Moradi, and Tim Güneysu. Parti: Towards combined
hardware countermeasures against side-channeland fault-injection attacks. In Begül
Bilgin, Svetla Nikova, and Vincent Rijmen, editors, Proceedings of the ACM Workshop
on Theory of Implementation Security, TIS@CCS 2016 Vienna, Austria, October,
2016, page 39. ACM, 2016.
Classification of Balanced
Quadratic Functions
Publication Data
My Contribution
Principal author.
253
254 Classification of Balanced Quadratic Functions
Abstract. S-boxes, typically the only nonlinear part of a block cipher, are the heart of
symmetric cryptographic primitives. They significantly impact the cryptographic strength and
the implementation characteristics of an algorithm. Due to their simplicity, quadratic vectorial
Boolean functions are preferred when efficient implementations for a variety of applications are
of concern. Many characteristics of a function stay invariant under affine equivalence. So far, all
6-bit Boolean functions, 3- and 4-bit permutations have been classified up to affine equivalence.
At FSE 2017, Bozoliv et al. presented the first classification of 5-bit quadratic permutations.
In this work, we propose an adaptation of their work resulting in a highly efficient algorithm
to classify n × m functions for n ≥ m. Our algorithm enables for the first time a complete
classification of 6-bit quadratic permutations as well as all balanced quadratic functions for
n ≤ 6. These functions can be valuable for new cryptographic algorithm designs with efficient
multi-party computation or side-channel analysis resistance as goal. In addition, we provide a
second tool for finding decompositions of length two. We demonstrate its use by decomposing
existing higher degree S-boxes and constructing new S-boxes with good cryptographic and
implementation properties.
Keywords: Affine Equivalence · S-box · Boolean functions · Classification · Decomposition
1 Introduction
For a variety of applications, such as multi-party computation, homomorphic encryption and
zero-knowledge proofs, linear operations are considered to have minimal cost. Nonlinear operations
on the other hand cause a rapid growth of implementation requirements. Therefore, it becomes
important to create cryptographically strong algorithms with minimal nonlinear components. A
recent study in this direction called MiMC [AGR+ 16], which is based on some relatively old
observations [NK95], uses the simple quadratic function x3 in different fields as the only nonlinear
block of the algorithm. Another work that minimizes the number of multiplications is the LowMC
design [ARS+ 15], where a quadratic 3-bit permutation is used as the only nonlinear component of
a Substitution-Permutation-Network (SPN).
We also see the importance of minimizing the nonlinear components in the field of secure
implementations against side-channel analysis. Efforts to decompose the S-boxes of existing
algorithms, such as the DES and AES S-boxes, into a minimum number of lower degree nonlin-
ear components (AND-gates, field multiplications or other quadratic or cubic functions), have
produced more than a handful of papers. Some of these decomposition tools are generic and
work heuristically [CGP+ 12, RV13, CRV14, CPRR15, GR16, PV16] whereas others focus on enu-
merating decompositions of all permutations for a certain size [BNN+ 12, KNP13]. In general,
they all make it clear that there is a significant advantage in considering side-channel security
during the design process and hence using low degree nonlinear components. As a reaction to
this line of research, a variety of novel symmetric-key designs use simply a quadratic permuta-
tion [ABB+ 14, BDP+ 14, BDP+ 15, DEMS15]. Examples include Keccak [BDPA13], one instance
of which is the new hash function standard, and several candidates of the CAESAR competition.
Generating strong, higher degree S-boxes using quadratic functions has also been shown useful
in [BGG+ 16]. These works demonstrate the relevance of our research, which focuses on enumerating
quadratic n × m functions for n < 7.
A valuable tool for the analysis of vectorial Boolean functions, which are typically used as
S-boxes, is the concept of affine equivalence (AE). AE allows the entire space of n × m functions
to be classified into groups with the same cryptographic properties. These properties include
the algebraic degree, the differential uniformity and the linearity of both the function and its
Classification of Balanced Quadratic Functions 255
possible inverse in addition to multiplicative complexity. Moreover, the randomness cost of a first-
order masked implementation is also invariant within a class if countermeasures such as threshold
implementations are used [Bil15]. With similar concerns in mind, our research relies on this affine
equivalence classification.
2
256 Classification of Balanced Quadratic Functions
all quadratic vectorial Boolean functions with five inputs in merely six minutes, which makes the
search for even 6-bit quadratic functions feasible. We also provide the cryptographic properties of
these functions and their inverses if possible.
Our work focuses on quadratic functions, since they tend to have low area requirements in
hardware, especially for masked implementations. We also introduce a tool for finding length-two
quadratic decompositions of higher degree permutations and we use it to decompose the 5-bit AB
and APN permutations. Furthermore, we find a set of high quality 5-bit permutations of degree 4
with small decomposition length that can be efficiently implemented.
Our list of quadratic 6-bit permutations is an important step towards decomposing the only
known 6-bit APN permutation class as an alternative to [PUB16].
2 Preliminaries
We consider an n × m (vectorial) Boolean function F (x) = y from Fn2 to Fm 2 . The bits of x and
the coordinate functions of F are denoted by small letter subscripts, i.e. x = (x0 , . . . , xn−1 ) where
xi ∈ F2 and F (x) = (f0 (x), . . . , fm−1 (x)) where fi (x) is from Fn2 to F2 . We use ’◦’ to denote
the composition of two or more functions, e.g. F1 ◦ F2 (x) = F1 (F2 (x)) where F1 : Fm 2 → F2 and
l
F2 : Fn2 → Fm
2 . We use |.| and · for absolute value and inner product respectively.
The algebraic degree of a function F = (f0 , f1 , . . . , fm−1 ) is simply the largest degree of its
coordinate functions, i.e. Degr(F ) = max0≤i<m Degr(fi ).
Definition 1 (Component [NK95]). The components of a vectorial Boolean function F are the
nonzero linear combinations β · F of the coordinate functions of F , with β ∈ Fm
2 \ {0}.
Definition 2 (DDT [BS90, Nyb93]). We define the Difference Distribution Table (DDT) δF of F
with its entries
δF (α, β) = #{x ∈ Fn2 : F (x ⊕ α) = F (x) ⊕ β}
for α ∈ Fn2 and β ∈ Fm
2 . The differential uniformity Diff(F ) is the largest value in the DDT for
α 6= 0, β:
Diff(F ) = max δF (α, β)
α6=0,β
Definition 3 (LAT and Walsh Spectrum [O’C94, CV94]). We define the Linear Approximation
Table (LAT) λF of F with its entries
3
Classification of Balanced Quadratic Functions 257
X
fˆ(ω) = (−1)f (x) · (−1)ω·x .
x∈Fn
2
P
A function’s LAT is directly related to its two-dimensional Walsh transform F̂ (α, β) = x∈Fn (−1) α·x
·
2
β·F (x)
(−1) as follows:
F̂ (α, β)
λF (α, β) =
2
Any column in a function’s LAT (λF (α, β̄) for β̄ fixed) is thus the scaled Walsh spectrum of a
component of F . The linearity Lin(F ) is the largest value in the LAT for β 6= 0, α:
An n-bit permutation F is said to be almost bent (AB) if ∀β 6= 0, α ∈ Fn2 , the LAT element
λF (α, β) is equal to either 0 or ±2(n−1)/2 . It is known that all AB permutations are also APN. The
LAT frequency distribution ΛF of F is a histogram of the absolute values occuring in the LAT:
Remark 1. In some works, the linearity is expressed in terms of the Walsh spectrum instead of
the LAT table as L(F ) = maxβ6=0,α |F̂ (α, β)|. The two definitions differ by a factor of two, i.e.
L(F ) = 2 · Lin(F ).
4
258 Classification of Balanced Quadratic Functions
At the start of Algorithm 2, x is typically a power of 2 which means A(x) cannot be determined
from linear combinations and can be chosen freely. If the BackwardSweep is successful (i.e.
it finds a suitable A(x) such that S ◦ A(x) = B ◦ R(x)), we recurse on the ForwardSweep. If
the BackwardSweep fails, we need to guess A(x). This is for example the case in the very first
iteration when nB = 0.
Algorithm 1: ForwardSweep(x, y, nA , nB )
while x < 2nA −1 do
Determine y 0 s.t. B(y 0 ) = S ◦ A(x);
if y 0 not yet defined then
Pick y 0 = 2nB −1 ;
Set B(y 0 ) = S ◦ A(x);
nB = nB + 1;
end
if SetR(x, y 0 ) then
x = x + 1;
else
Dead end: Stop forward sweep;
end
end
if x < 2n then
BackwardSweep(x, y, nA , nB );
end
5
Classification of Balanced Quadratic Functions 259
The Guess function is described by Algorithm 3. It fixes R(x) using Algorithm 4 to the smallest
unused y and then loops over all available assignments of A(x). For each guess, we try recursion
on the ForwardSweep. We need to try all because any guess can result in a lexicographically
smaller representative R.
Algorithm 4 builds the representative R and only changes previously determined outputs if they
are smaller than the current one.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 B 9 C D 6 F 3 E 8 7 4 A 2 5 0
S S
x → A(x) → B(y) ← y x → A(x) → B(y) ← y
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
R(x) 0 1 2 3 4 6 8
6
260 Classification of Balanced Quadratic Functions
Figure 2 depicts how the predicted complexity (for fixed n = 5) increases monotonously as m
decreases.
1 2 3 4 5 1 2 3 4 5
m m
In what follows, we describe an extension of the algorithm in Section 2.3 which has a non-
monotonous complexity behavior as m decreases as can be observed in Figure 3. Note that Figure 3
depicts experimental runtimes whereas Figure 2 depicts an asymptotic complexity estimation. Their
scales are thus very different and should not be compared in magnitude. Instead, we are considering
only the difference in trends. For m = n, the algorithm is identical to [BCBP03]. The runtimes
are calculated using a random selection of 500 5 × m functions for each m. Note that since no
pseudo-code is provided in [BCBP03] and the description is very brief, we can not conclude whether
this is due to a complexity estimation error or having a slightly different algorithm. Moreover, the
real runtimes might approximate the asymptotic complexity better for n → ∞.
One of the changes caused by the non-invertability of S is that we can no longer compute the
inverse S −1 and thus we cannot obtain x0 in Algorithm 2. We propose Algorithm 5 as an alternative
in which we loop over all possible x0 for which S ◦ A(x0 ) = B(y).
7
Classification of Balanced Quadratic Functions 261
not immediately increase y after each BackwardSweep but only when it runs out of candidates
x0 for which S ◦ A(x0 ) = B(y). The complete procedure for finding the representative of a balanced
non-injective function is illustrated in Figure 4.
This second feature actually makes the new algorithm very efficient in finding the smallest
representative when n − m is not too large. Instead of guessing A(x), which implies a loop over
approximately 2n guesses, now the list of 2n−m candidates x0 immediately gives us the guesses A(x0 )
that result in the smallest output value R(x). The more often we can reuse an output value y, the
less often we need to guess. This can also be observed by comparing the examples in Figure 1 and 4.
As a result, the algorithm to find a representative becomes more efficient for n × m functions with
m < n. If m becomes very small, the complexity increases again since the enumeration of 2n−m
candidates, which is used also in [BCBP03], becomes the dominant factor. That the complexity
first decreases and then increases with m corresponds to our initial observation in Figure 3.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 3 1 0 1 2 3 3 2 0 3 0 2 2 1 0
S
x → A(x) → B(y) ← y
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
R(x) 0 0 0 1 0 2 1
8
262 Classification of Balanced Quadratic Functions
this timing alone with a couple of hours, using 16 threads given in [BBS17] shows the impact of
using an iterative approach, made possible by the new AE algorithm of Section 3. Nevertheless, in
this section we will describe two ways to further optimize the complexity.
#|ω : fˆ(ω) = ξ|
Class Representative
ξ = 32 ξ = 16 ξ = 8
(5,1)
Q0 x0 1 0 0
(5,1)
Q1 x0 ⊕ x1 x2 0 4 0
(5,1)
Q2 x0 ⊕ x1 x2 ⊕ x3 x4 0 0 16
9
Classification of Balanced Quadratic Functions 263
Since this property is unique for each class of Boolean functions, we will use it to fix the order
of coordinate functions during the classification algorithm. Using Lemma 1, in each intermediate
step m, we will only allow new coordinate functions fm for which the linearity is not smaller than
the linearity of fm−1 .
Table 3: Average runtimes of the AE algorithm [BCBP03] for some 5-bit representatives
10−2
10−3
1 2 3 4 5
m
We introduce the following definition of a linear extension in order to define our optimization
for the classification.
Definition 6 (Linear Extension). An n-bit permutation F = (f0 , . . . , fn−1 ) is called the linear
extension of an n × m function G = (f0 , . . . , fm−1 ) if ∀m ≤ i < n, fi is linear.
Any balanced n × m function can be linearly extended with n − m linear components into a
balanced n-bit permutation. Correspondingly, each balanced n-bit permutation with 2n−m − 1
linear components can be generated as a linear extension of some balanced n × m function with
10
264 Classification of Balanced Quadratic Functions
zero linear components. We therefore initially eliminate all linear coordinate functions from our
search, generating 5 × m functions with only nonlinear coordinates in each step. In the very last
stage, we obtain a list of 5-bit bijections without linear components. Finally, we add to this list all
the linear extensions of the 5 × m representatives found so far (for m = 1, . . . , 4) to also obtain the
5-bit bijections with 2n−m − 1 linear components. This optimization increases the efficiency of the
search in three ways. Firstly, it reduces the number of fi candidates inserted in each stage (|F| &).
Secondly, it discards functions for which finding the AE representative is slow. Finally, it reduces
the number of n × m representatives that each stage starts from (|R| &).
Table 4: Number of affine equivalence classes for 5 × i functions for i = 1, . . . , 5 with 2i−m − 1
linear components.
m
# 5 × i representatives Tot. #
0 1 2 3 4 5
# 5×1 1 2 - - - - 3
# 5×2 1 3 8 - - - 12
# 5×3 1 5 19 55 - - 80
# 5×4 1 3 17 52 93 - 166
# 5×5 1 2 6 22 23 22 76
# Linear Components: 31 15 7 3 1 0
Note that the number of classes obtained from linearly extending all 5 × m functions can be
much smaller than the number of 5 × m classes itself (for example 23 93 for m = 4). This can be
explained by the fact that linearly extending two extended affine but not affine equivalent functions
can result in affine equivalent permutations (i.e. a collision in the linear extension). Consider
for example the following two 5 × 3 functions that are extended affine equivalent but not affine
equivalent:
x0 ⊕ x1 x2
x 0 ⊕ x 3 ⊕ x 1 x 2
x1 ⊕ x2 x3 6∼ x1 ⊕ x3 ⊕ x2 x3
x4 ⊕ x0 x1 x4 ⊕ x0 x1
It is straightforward to verify that linearly extending both functions with coordinate functions x2
and x3 results in two affine equivalent 5-bit permutations.
x0 ⊕ x1 x2
x0 ⊕ x3 ⊕ x1 x2
⊕
1x x x
2 3 1 ⊕ x3 ⊕ x2 x3
x
x4 ⊕ x0 x1 ∼ x4 ⊕ x0 x1
x2
x2
x x
3 3
11
Classification of Balanced Quadratic Functions 265
and masking, where the area grows exponentially with the degree of a function. Below we describe
length-two decompositions and constructions of cryptographically interesting permutations.
H ∼ R1 ◦ A ◦ R2 . (1)
As with the classification of Boolean functions, we perform this search iteratively, starting from
n × 1 Boolean functions f for which f ◦ R2 : Fn2 → F2 is extended affine equivalent to a component
function of H. We thus select the candidates for f using the following criteria:
(C1) f is balanced
(C2) ∃β ∈ Fm
2 \ {0} s.t. Degr(f ◦ R2 ) = Degr(β · H)
(C3) ∃β ∈ Fm
2 \ {0} s.t. (∆f ◦R2 , Λf ◦R2 ) = (∆β·H , Λβ·H )
Then, we can describe H 0 as L◦H for some L ∈ L (Fn2 → Fm 2 ). In order to eliminate false candidates
12
266 Classification of Balanced Quadratic Functions
The quadratic function classification algorithm (Algorithm 6) is very efficient because it reduces
the lists of intermediate functions to their affine equivalent representatives at each step m. However,
we cannot do that in this case as this would change the affine transformation A in the decomposition
(see Eqn. (1)). Let S1 = B ◦ R1 ◦ A be a candidate for which S1 ◦ R2 ∼ H and let R1 be its affine
representative. Reducing S1 to R1 would discard the affine transformation A. In that case, we
would only be able to decompose functions that are affine equivalent to the composition of two
representatives: R1 ◦ R2 . In other words, if there is another candidate S10 affine equivalent to S1 ,
we do not want to discard it as it will not necessarily result in affine equivalent compositions.
S10 ∼ S1 ⇒
/ S10 ◦ R2 ∼ S1 ◦ R2
However, without any reductions in the intermediate steps of the algorithm, the search becomes
very inefficient as the list of candidate functions grows exponentially. There is still a redundancy in
our search because of the affine output transformation B that is included in S1 . If S10 is only left
affine equivalent to S1 , then their compositions are affine equivalent:
S10 = B 0 ◦ S1 ⇒ S10 ◦ R2 ∼ S1 ◦ R2
We therefore adapt the AE algorithm to find the lexicographically smallest function R1L that
is left affine equivalent to S1 : R1L = B −1 ◦ S1 = R1 ◦ A. We call this function R1L the left
affine representative of S1 . The algorithm to find R1L is identical to finding the affine equivalent
representative with the input affine transformation constrained to the identity function. This
constraint removes the need for guesses and makes the algorithm very efficient. An example is
shown in Figure 6. Algorithm 7 summarizes the resulting decomposition method.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 B 9 C D 6 F 3 E 8 7 4 A 2 5 0
S
x → A(x) → B(y) ← y
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
RL (x) 0 1 2 4 8 5 7
Figure 6: Example for finding the left representative RL of S. Input transformation A is fixed to
the identity function: A(x) = x
13
Classification of Balanced Quadratic Functions 267
the same algebraic degree (=3). We also know that the DDT of an AB function contains only
zeros and twos and its LAT contains only zeros and elements with absolute value 4. It immediately
follows that also the Walsh transform of each coordinate function of the AB is equal to either 0 or
±8.
Moreover, when we look at all 5 × m subfunctions H 0 = L ◦ H, ∀L ∈ L (Fn2 → Fm 2 ), there is
only one permitted differential spectrum and LAT frequency distribution for each m. It is indeed
known that all coordinate functions of the AB are (extended) affine equivalent.
We enumerate all 75 candidates for R2 and perform the search for S1 = R1 ◦ A using Algorithm 7.
(5,5) (5,5)
When R2 is the representative of classes Q1 to Q74 , the algorithm finds no 5-bit bijections
that compose with R2 to a cubic AB. The search only ends with non-empty R when we perform
(5,5)
it with R2 the representative of Q75 , which is itself the quadratic AB permutation x5 . The
resulting R1 is equal to R2 and their composition forms the AB class that holds the inverse of
(5,5)
Q75 (corresponding to power map x7 ). This decomposition is easily verified using power maps.
Indeed, x5 ◦ x5 ◦ x5 = x125 = x1 mod 31 .
Without the constraint that the AB needs to be cubic, we also find a decomposition for class
(5,5) (5,5)
Q75 itself with R1 = R2 = Q74 . A length-two decomposition for the odd cubic AB permutations
(x11 ) is not found. Since the algorithm is exhaustive, this means it does not exist. Indeed, it is
shown in [NNR19] that the shortest decomposition of x11 has length three.
Table 5: Look-up-tables for the even cubic AB function F and its decomposition F = S1 ◦ R2 with
S1 ∼ R2
F 0,1,2,8,4,17,30,13,10,18,5,19,6,20,11,26,16,15,9,23,3,7,29,21,14,12,25,31,28,27,22,24
S1 0,1,2,4,8,10,16,21,17,28,18,24,23,25,14,7,30,6,19,12,20,15,3,31,9,29,5,22,13,26,27,11
R2 0,1,2,4,3,8,16,28,5,10,26,18,17,20,31,29,6,21,24,12,22,15,25,7,14,19,13,23,9,30,27,11
Table 6: Look-up-tables for the Keccak permutation χ and its inverse χ−1
χ 0,9,18,11,5,12,22,15,10,3,24,1,13,4,30,7,20,21,6,23,17,16,2,19,26,27,8,25,29,28,14,31
χ−1 0,11,22,9,13,4,18,15,26,1,8,3,5,12,30,7,21,20,2,23,16,17,6,19,10,27,24,25,29,28,14,31
The Keccak inverse does not have the same strong properties as the AB permutations. Each
coordinate function is still cubic but the differential and linear properties are naturally weaker.
Firstly, apart from zeros we find both ±4 and ±8 in the LAT. For the DDT, there are multiple
differential spectra for the intermediate 5 × m sub functions. As explained above, we generate the
list of possible DDT and LAT frequency distributions for each m = 1, . . . , 4 and feed this as input
to the search algorithm. We filter out all intermediate functions F : Fn2 → Fm 2 for which the DDT
and LAT frequency distributions of F ◦ R2 do not occur in this list.
14
268 Classification of Balanced Quadratic Functions
While the search finds many classes with the same cryptographic properties as the Keccak
inverse, a decomposition of length 2 for χ−1 itself does not appear to exist.
Theorem 1 ([Nyb94, Thm. 12]). Let S = (f0 , f1 , . . . , fn−1 ) : Fn2 → Fn2 be an n-bit bijection with
Diff(S) the maximal value in its DDT. Then, for any function F : Fn2 → Fm 2 with m < n, composed
from a subset of the coordinate functions of S, F = (fi1 , fi2 , . . . , fim ) with i1 , . . . , im ∈ {0, . . . , n−1},
the values in its DDT are upperbounded by Diff(S) · 2n−m .
Our search delivers 17 quartic affine equivalence classes with very good cryptographic properties,
shown in Table 7. One of those is the APN class, which contains the permutation formed by the
inversion x−1 in F25 . Indeed, it was shown in [NNR19] that this permutation has decomposition
length two. Each of these very strong 5-bit S-boxes have an efficient masked implementation, as
they can be decomposed into only two quadratic components. This list is only a sample of the
functions that can be found using this method.
Table 7: Strong quartic (degree 4) 5-bit permutations with decomposition length two
15
Classification of Balanced Quadratic Functions 269
Table 8: Number of affine equivalence classes with/without linear components for quadratic 6 × m
functions for m = 1, . . . , 5
Table 10: Number of quadratic 6 × 3 classes with cryptographic properties (Diff, Lin)
Table 11: Number of quadratic 6 × 4 classes with cryptographic properties (Diff, Lin)
Table 12: Number of quadratic 6 × 5 classes with cryptographic properties (Diff, Lin)
2 The exact listing of the representatives and their cryptographic properties can be found on http://homes.esat.
16
270 Classification of Balanced Quadratic Functions
In order to complete the final stage of the search for all 6-bit quadratic permutations, we
generate the list of candidates for the AE algorithm by extending the 6 × 5 representatives with the
Boolean function candidates fi ∈ F and we add the linear extensions of all other 6 × m functions.
We split this list into 100 parts and complete the rest of the algorithm on 100 cores. In the end,
we find 2 263 classes of 6-bit quadratic permutations (not including the one linear permutation)3 .
Table 13 shows how these classes are distributed among even and odd permutations or how many
of them have quadratic/cubic inverses. Table 14 depicts the histogram of cryptographic properties.
There are eight classes with Diff = 4 and Lin = 8. These are shown in Table 15. One of those
permutations is odd. Finally, Figure 8 shows the total number of affine equivalence classes of
quadratic n × n permutations. While it was already clear that this number grows fast with n, the
figure demonstrates how difficult it was before this work to predict just how fast.
Even/Odd 2258 5
Inverse = quadratic/cubic 70 2193
Table 14: Number of quadratic 6 × 6 classes with cryptographic properties (Diff, Lin)
Conclusion
This work studies the classification of quadratic vectorial Boolean functions under affine equivalence.
It extends Biryukov’s Affine Equivalence algorithm to non-bijective functions for use in a new
classification tool that provides us with the complete classification of balanced n × m quadratic
vectorial Boolean functions for m ≤ n and n < 7. We also introduce a tool for finding length-two
quadratic decompositions of higher degree functions.
New cryptographic algorithms should be designed with resistance against side-channel attacks
in mind. When it comes to choosing S-boxes, designers can use our classification to pick quadratic
3 The exact listing of the representatives and their cryptographic properties can be found on http://homes.esat.
17
Classification of Balanced Quadratic Functions 271
components and use our (de)composition tool to create cryptographically strong S-boxes with
efficient masked implementations. After the classifications of 4- and 5-bit permutations in previous
works, this work expands the knowledge base on both classification and decomposition, bringing us
one step closer to classifying 8-bit functions and decomposing the AES S-box using permutations
instead of tower field or square-and-multiply approaches.
Acknowledgements
The authors thank Dusan Bozilov for the insights into his algorithm and Prof. Vincent Rijmen for
fruitful discussion and helpful comments. This work was supported by the Research Council KU
Leuven: C16/15/058. Lauren De Meyer is funded by a PhD fellowship (aspirant) of the Fund for
Scientific Research - Flanders (FWO). Begül Bilgin was a postdoctoral fellow of the FWO during
this research. Currently, she is working at Rambus Cryptography Research.
References
[ABB+ 14] Elena Andreeva, Begül Bilgin, Andrey Bogdanov, Atul Luykx, Florian Mendel, Bart Men-
nink, Nicky Mouha, Qingju Wang, and Kan Yasuda. CAESAR submission: PRIMATEs
v1.02, March 2014. http://primates.ae/wp-content/uploads/primatesv1.02.pdf.
[AGR+ 16] Martin R. Albrecht, Lorenzo Grassi, Christian Rechberger, Arnab Roy, and Tyge Tiessen.
Mimc: Efficient encryption and cryptographic hashing with minimal multiplicative
complexity. In Jung Hee Cheon and Tsuyoshi Takagi, editors, Advances in Cryptology -
ASIACRYPT 2016 - 22nd International Conference on the Theory and Application of
Cryptology and Information Security, Hanoi, Vietnam, December 4-8, 2016, Proceedings,
Part I, volume 10031 of Lecture Notes in Computer Science, pages 191–219, 2016.
[ARS+ 15] Martin R. Albrecht, Christian Rechberger, Thomas Schneider, Tyge Tiessen, and Michael
Zohner. Ciphers for MPC and FHE. In Elisabeth Oswald and Marc Fischlin, editors,
Advances in Cryptology - EUROCRYPT 2015 - 34th Annual International Conference
on the Theory and Applications of Cryptographic Techniques, Sofia, Bulgaria, April
26-30, 2015, Proceedings, Part I, volume 9056 of Lecture Notes in Computer Science,
pages 430–454. Springer, 2015.
[BBS17] Dusan Bozilov, Begül Bilgin, and Haci Ali Sahin. A note on 5-bit quadratic permutations’
classification. IACR Trans. Symmetric Cryptol., 2017(1):398–404, 2017.
[BCBP03] Alex Biryukov, Christophe De Cannière, An Braeken, and Bart Preneel. A toolbox for
cryptanalysis: Linear and affine equivalence algorithms. In Eli Biham, editor, Advances
in Cryptology - EUROCRYPT 2003, International Conference on the Theory and
Applications of Cryptographic Techniques, Warsaw, Poland, May 4-8, 2003, Proceedings,
volume 2656 of Lecture Notes in Computer Science, pages 33–50. Springer, 2003.
[BDP+ 14] Guido Bertoni, Joan Daemon, Michaël Peeters, Gilles Van Assche, and Ronny Van
Keer. CAESAR submission: Ketje v1, March 2014. https://competitions.cr.yp.
to/round1/ketjev1.pdf.
18
272 Classification of Balanced Quadratic Functions
[BDP+ 15] Guido Bertoni, Joan Daemon, Michaël Peeters, Gilles Van Assche, and Ronny Van
Keer. CAESAR submission: Keyak v2, August 2015. https://competitions.cr.yp.
to/round2/keyakv2.pdf.
[BDPA13] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Keccak. In
Thomas Johansson and Phong Q. Nguyen, editors, EUROCRYPT 2013, Athens, Greece,
May 26-30, 2013. Proceedings, volume 7881 of LNCS, pages 313–314. Springer, 2013.
[BGG+ 16] Erik Boss, Vincent Grosso, Tim Güneysu, Gregor Leander, Amir Moradi, and Tobias
Schneider. Strong 8-bit sboxes with efficient masking in hardware. In Gierlichs and
Poschmann [GP16], pages 171–193.
[BKNN15] Begül Bilgin, Miroslav Knezevic, Ventzislav Nikov, and Svetla Nikova. Compact
implementations of multi-sbox designs. In Naofumi Homma and Marcel Medwed,
editors, Smart Card Research and Advanced Applications - 14th International Conference,
CARDIS 2015, Bochum, Germany, November 4-6, 2015. Revised Selected Papers, volume
9514 of Lecture Notes in Computer Science, pages 273–285. Springer, 2015.
[BL08] Marcus Brinkmann and Gregor Leander. On the classification of APN functions up to
dimension five. Des. Codes Cryptography, 49(1-3):273–288, 2008.
[BNN+ 12] Begül Bilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, and Georg Stütz.
Threshold implementations of all 3x3 and 4x4 s-boxes. In Emmanuel Prouff and Patrick
Schaumont, editors, Cryptographic Hardware and Embedded Systems - CHES 2012
- 14th International Workshop, Leuven, Belgium, September 9-12, 2012. Proceedings,
volume 7428 of Lecture Notes in Computer Science, pages 76–91. Springer, 2012.
[BNN+ 15] Begül Bilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, Natalia N. Tokareva,
and Valeriya Vitkup. Threshold implementations of small s-boxes. Cryptography and
Communications, 7(1):3–33, 2015.
[BP10] Joan Boyar and René Peralta. A new combinational logic minimization technique
with applications to cryptology. In Paola Festa, editor, Experimental Algorithms, 9th
International Symposium, SEA 2010, Ischia Island, Naples, Italy, May 20-22, 2010.
Proceedings, volume 6049 of Lecture Notes in Computer Science, pages 178–189. Springer,
2010.
[BS90] Eli Biham and Adi Shamir. Differential cryptanalysis of des-like cryptosystems. In
Alfred Menezes and Scott A. Vanstone, editors, Advances in Cryptology - CRYPTO
’90, 10th Annual International Cryptology Conference, Santa Barbara, California, USA,
August 11-15, 1990, Proceedings, volume 537 of Lecture Notes in Computer Science,
pages 2–21. Springer, 1990.
[BW72] Elwyn R. Berlekamp and Lloyd R. Welch. Weight distributions of the cosets of the (32,
6) reed-muller code. IEEE Trans. Information Theory, 18(1):203–207, 1972.
[CCZ98] Claude Carlet, Pascale Charpin, and Victor A. Zinoviev. Codes, bent functions and
permutations suitable for des-like cryptosystems. Des. Codes Cryptography, 15(2):125–
156, 1998.
[CGP+ 12] Claude Carlet, Louis Goubin, Emmanuel Prouff, Michaël Quisquater, and Matthieu
Rivain. Higher-order masking schemes for s-boxes. In Anne Canteaut, editor, Fast
Software Encryption - 19th International Workshop, FSE 2012, Washington, DC, USA,
March 19-21, 2012. Revised Selected Papers, volume 7549 of Lecture Notes in Computer
Science, pages 366–384. Springer, 2012.
19
Classification of Balanced Quadratic Functions 273
[CPRR15] Claude Carlet, Emmanuel Prouff, Matthieu Rivain, and Thomas Roche. Algebraic
decomposition for probing security. In Rosario Gennaro and Matthew Robshaw, editors,
Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptology Conference, Santa
Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I, volume 9215 of Lecture
Notes in Computer Science, pages 742–763. Springer, 2015.
[CRV14] Jean-Sébastien Coron, Arnab Roy, and Srinivas Vivek. Fast evaluation of polynomials
over binary finite fields and application to side-channel countermeasures. In Lejla
Batina and Matthew Robshaw, editors, Cryptographic Hardware and Embedded Systems
- CHES 2014 - 16th International Workshop, Busan, South Korea, September 23-26,
2014. Proceedings, volume 8731 of Lecture Notes in Computer Science, pages 170–187.
Springer, 2014.
[CV94] Florent Chabaud and Serge Vaudenay. Links between differential and linear cryptanalysis.
In Alfredo De Santis, editor, Advances in Cryptology - EUROCRYPT ’94, Workshop
on the Theory and Application of Cryptographic Techniques, Perugia, Italy, May 9-12,
1994, Proceedings, volume 950 of Lecture Notes in Computer Science, pages 356–365.
Springer, 1994.
[DEM15] Christoph Dobraunig, Maria Eichlseder, and Florian Mendel. Higher-order cryptanalysis
of lowmc. In Soonhak Kwon and Aaram Yun, editors, Information Security and
Cryptology - ICISC 2015 - 18th International Conference, Seoul, South Korea, November
25-27, 2015, Revised Selected Papers, volume 9558 of Lecture Notes in Computer Science,
pages 87–101. Springer, 2015.
[DEMS15] Christoph Dobraunig, Maria Eichlseder, Florian Mendel, and Martin Schläffer. CAESAR
submission: ASCON v1.1, August 2015. https://competitions.cr.yp.to/round2/
asconv11.pdf.
[Ful03] Joanne Elizabeth Fuller. Analysis of affine equivalent boolean functions for cryptography.
PhD thesis, Queensland University of Technology, 2003.
[Gol59] Solomon W. Golomb. On the classification of boolean functions. IRE Trans. Information
Theory, 5(5):176–186, 1959.
[GP16] Benedikt Gierlichs and Axel Y. Poschmann, editors. Cryptographic Hardware and
Embedded Systems - CHES 2016 - 18th International Conference, Santa Barbara, CA,
USA, August 17-19, 2016, Proceedings, volume 9813 of Lecture Notes in Computer
Science. Springer, 2016.
[GR16] Dahmun Goudarzi and Matthieu Rivain. On the multiplicative complexity of boolean
functions and bitsliced higher-order masking. In Gierlichs and Poschmann [GP16], pages
457–478.
[KNP13] Sebastian Kutzner, Phuong Ha Nguyen, and Axel Poschmann. Enabling 3-share
threshold implementations for all 4-bit s-boxes. In Hyang-Sook Lee and Dong-Guk
Han, editors, Information Security and Cryptology - ICISC 2013 - 16th International
Conference, Seoul, Korea, November 27-29, 2013, Revised Selected Papers, volume 8565
of Lecture Notes in Computer Science, pages 91–108. Springer, 2013.
[Mai91] James A. Maiorana. A classification of the cosets of the Reed-Muller Code R(1, 6).
Mathematics of Computation, 57(195):403–414, 1991.
[NK95] Kaisa Nyberg and Lars R. Knudsen. Provable security against a differential attack. J.
Cryptology, 8(1):27–37, 1995.
[NNR19] Svetla Nikova, Ventzislav Nikov, and Vincent Rijmen. Decomposition of permutations
in a finite field. Cryptography and Communications, 11(3):379–384, 2019.
[Nyb93] Kaisa Nyberg. Differentially uniform mappings for cryptography. In Tor Helleseth, editor,
Advances in Cryptology - EUROCRYPT ’93, Workshop on the Theory and Application
of of Cryptographic Techniques, Lofthus, Norway, May 23-27, 1993, Proceedings, volume
765 of Lecture Notes in Computer Science, pages 55–64. Springer, 1993.
20
274 Classification of Balanced Quadratic Functions
[Nyb94] Kaisa Nyberg. S-boxes and round functions with controllable linearity and differential
uniformity. In Preneel [Pre95], pages 111–130.
[O’C94] Luke O’Connor. Properties of linear approximation tables. In Preneel [Pre95], pages
131–136.
[Pre95] Bart Preneel, editor. Fast Software Encryption: Second International Workshop. Leuven,
Belgium, 14-16 December 1994, Proceedings, volume 1008 of Lecture Notes in Computer
Science. Springer, 1995.
[PUB16] Léo Perrin, Aleksei Udovenko, and Alex Biryukov. Cryptanalysis of a theorem: De-
composing the only known solution to the big APN problem. In Matthew Robshaw
and Jonathan Katz, editors, Advances in Cryptology - CRYPTO 2016 - 36th Annual
International Cryptology Conference, Santa Barbara, CA, USA, August 14-18, 2016,
Proceedings, Part II, volume 9815 of Lecture Notes in Computer Science, pages 93–122.
Springer, 2016.
[PV16] Jürgen Pulkus and Srinivas Vivek. Reducing the number of non-linear multiplications
in masking schemes. In Gierlichs and Poschmann [GP16], pages 479–497.
[RV13] Arnab Roy and Srinivas Vivek. Analysis and improvement of the generic higher-order
masking scheme of FSE 2012. In Guido Bertoni and Jean-Sébastien Coron, editors,
Cryptographic Hardware and Embedded Systems - CHES 2013 - 15th International
Workshop, Santa Barbara, CA, USA, August 20-23, 2013. Proceedings, volume 8086 of
Lecture Notes in Computer Science, pages 417–434. Springer, 2013.
[Saa11] Markku-Juhani O. Saarinen. Cryptographic analysis of all 4 x 4-bit s-boxes. In Ali
Miri and Serge Vaudenay, editors, Selected Areas in Cryptography - 18th International
Workshop, SAC 2011, Toronto, ON, Canada, August 11-12, 2011, Revised Selected
Papers, volume 7118 of Lecture Notes in Computer Science, pages 118–133. Springer,
2011.
21
Curriculum Vitae
275
FACULTY OF ENGINEERING SCIENCE
DEPARTMENT OF ELECTRICAL ENGINEERING
COSIC
Kasteelpark Arenberg 10 box 2452
B-3001 Leuven
lauren.demeyer@esat.kuleuven.be
http://www.cosic.esat.kuleuven.be