Cryptography in The Presence of Physical Attacks Thesis-384 PHD 2020 KU Leuven Lauren de Meyer

ARENBERG DOCTORAL SCHOOL
Faculty of Engineering Science
Cryptography in the Presence

of Physical Attacks
Design, Implementation and Analysis
Lauren De Meyer
Supervisors: Dissertation presented in partial

Prof. dr. ir. V. Rijmen fulfillment of the requirements for the
dr. B. Bilgin degree of Doctor of Engineering
Science (PhD): Electrical Engineering
September 2020
Cryptography in the Presence of Physical Attacks
Design, Implementation and Analysis
Lauren DE MEYER
Examination committee: Dissertation presented in partial

Prof. dr. ir. P. Sas, chair fulfillment of the requirements for
Prof. dr. ir. V. Rijmen, supervisor the degree of Doctor of Engineering
dr. B. Bilgin, supervisor Science (PhD): Electrical Engineer-
Prof. dr. ir. B. Preneel ing
Prof. dr. ir. M. Verhelst
dr. S. Belaïd
(CryptoExperts)
Prof. dr. DI. S. Mangard
(TU Graz)
September 2020
© 2020 KU Leuven – Faculty of Engineering Science
Uitgegeven in eigen beheer, Lauren De Meyer, Kasteelpark Arenberg 10 box 2452, B-3001 Leuven (Belgium)
Alle rechten voorbehouden. Niets uit deze uitgave mag worden vermenigvuldigd en/of openbaar gemaakt worden
door middel van druk, fotokopie, microfilm, elektronisch of op welke andere wijze ook zonder voorafgaande
schriftelijke toestemming van de uitgever.
All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm,
electronic or any other means without written permission from the publisher.
Preface
Four years does not seem like such a long time, but four years working on a
PhD is life-changing. Naturally, there are many people I have to acknowledge
for shaping this experience.
I would like to express my gratitude to the members of my examination
committee for taking the time to read and help me improve my work. I
also thank the chairman for leading my virtual private defense and real-life
public defense. Thank you to the FWO for funding me, which allowed me to
devote 3.5 years entirely to my research and achieve results I am very happy
with.
Dear Vincent, thank you for supervising me throughout this journey. Every
time I came knocking unexpectedly on your office door, you made time for me.
You tracked my progress in your notebook, you proofread everything I sent you
and never let a sunny Friday go by without telling me to go home early. It was
a privilege to have you as my supervisor.
Dear Bart, thank you for introducing me to the world of cryptography. When
you gave me a summer internship in 2012, I had no idea how much this would
end up shaping my life. I am very grateful for the time you took to teach me
new concepts on the blackboard in the office and also, for appointing Begül as
my internship supervisor.
Which brings me to my PhD guardian angel, Begül. Not a single day of my
PhD went by that I didn’t think of myself as the luckiest student in COSIC for
having you as my daily supervisor. In the office, you showed me by example
how to become an independent researcher with integrity and a critical mind.
Also outside the office, you were always there for me. Over four years, we had
so much fun working on midnight deadlines, eating sushi and künefe, going to
Ikea and saving butterflies from dying. I cannot thank you enough!
Dear Victor, we started out as colleagues and then became co-authors and then
i
ii PREFACE
friends. It didn’t take long for us to become really really really good friends
and eventually also office mates. I couldn’t have made it through this PhD
without your support. Thank you for being my rock both inside and outside
the office. Whether in Leuven, Taipei or Washington DC, we had the absolute
best conversations, in which you taught me about life, people and myself. I
look forward to sharing the next stages in our professional and personal lives.
One of the most important aspects of a PhD is collaboration and I want to thank
all my co-authors for the papers we worked on together. Special thanks are also
due to the people who helped me make my way around the COSIC lab, especially
Lennert and Arthur. To all the COSICs I did not get to collaborate with, you
were also wonderful. I could not imagine a better work environment. How many
people have the luxury to say that going to work feels like going to hang out with
friends? That is not to say that we do not work, as is hopefully evidenced by
this dissertation. But in between the work, we had amazing Halloween parties,
Friday beers at Metafoor, COSIC weekends and other memorable evenings
in the feestzaal. Especially to my Barracks crew, thank you for the skiing,
camping and Disney trips and all the crazy unforgettable stuff in between. Dear
Péla, thank you for all your help and for letting me talk to you about basically
anything. We are so lucky to have you at COSIC. I am also eternally grateful
to my dear family for giving me so many opportunities in life and for always
helping me pursue my dreams.
Por fin, el que me trae un poquititito loco. No querías muchas palabras, así
que tomaré prestadas algunas de Márquez: “Ella lo esperaba con tal ansiedad
que la sola sonrisa de él le devolvía el aliento.” Gracias <4
Abstract
Cryptographic primitives are designed such that they can resist black-box attacks
(cryptanalysis). For their implementations, extra measures must be taken to also
provide security against physical attacks. One class of physical attacks is that
of side-channel analysis (SCA), a non-invasive attack that exploits the physical
leakages emanating from a device (power consumption or electromagnetic
radiation among others) to retrieve its secret data. One particularly powerful
attack, differential power analysis (first-order DPA) was introduced in 1999 by
Kocher et al. Today, we aim at providing security against higher-order DPA.
In a dth -order DPA attack, the attacker exploits any statistical moment of the
power consumption up to order d. A popular and established countermeasure
is masking, a method based on secret sharing in which intermediate variables
are stochastically split into multiple shares to make the side-channel-leaked
information independent of sensitive data. Many different types of masking
schemes have been proposed, often accompanied by a formal proof of security.
Another class of physical attacks is starting to gain more attention in the last
years. In fault attacks such as differential fault analysis (DFA), an attacker
induces logical errors in the computation by for example under-powering the
device or by careful illumination of certain areas in the silicon die. The result of
a faulty computation can reveal a wealth of secret information. These attacks
can be executed either separately or combined with side-channel analysis.
The application of countermeasures to cryptographic primitives significantly
increases their implementation cost, especially of the nonlinear components.
Moreover, they require the generation of a large number of random bits, which
is expensive in practice.
The work described in this book can be divided into four categories.
Our first research direction looks at the design of countermeasures against
side-channel attacks and their application to existing ciphers, such as the
Advanced Encryption Standard (AES). Because of its wide deployment in
iii
iv ABSTRACT
real-life applications, designing implementations that provide security against

physical attacks at minimal area, latency and randomness cost is an important
task. Also the trade-off between these different cost metrics is difficult to
manoeuvre, since optimizing for one is typically adverse to another.
As a second objective, we look into the analysis and verification of the devised
countermeasures. Constructing masked circuits is non-trivial and many of the
proposed countermeasures of the last years have been shown to be vulnerable
relatively quickly after being published. This history of trial and error has
given rise to a new wave of works on the verification of masking schemes and
their implementations. With several different security notions and disparities
between theoretical and practical approaches, there is still some debate on how
the security of a masked implementation should be evaluated.
Thirdly, we start exploring the design of countermeasures that protect not
only against side-channel attacks, but also against fault attacks and combined
attacks. This is a very young branch of research and in contrast to masking, the
state-of-the-art on countermeasures for combined attacks (before our research)
was mostly heuristic and lacking a formal background.
Finally, we consider the challenge of designing cryptographic primitives with the
cost of these countermeasures in mind. Most cryptographic algorithms which
are widely accepted and used today were designed in a time when physical
attacks were not on the radar. Components were chosen mostly based on
their mathematical and cryptographic properties. Implementation cost was
sometimes taken into account, but outside the context of masking.
By looking at these four very different aspects of cryptography for embedded
systems, we not only create a comprehensive and multidisciplinary knowledge
base, but our experience in each aspect also improves our understanding of the
others. It allows us to identify common trends in the existing literature on
different topics as well as disparities and contrasts.
Beknopte samenvatting
Cryptografische primitieven zijn zo ontworpen dat ze zwarte-doos aanvallen

(cryptanalyse) kunnen weerstaan. Voor hun implementatie moeten extra
maatregelen worden genomen om ook beveiliging te bieden tegen fysieke
aanvallen. Een belangrijke categorie van fysieke aanvallen is die van
nevenkanaalanalyse, een niet-invasieve aanval die de fysieke karakteristieken
van een apparaat (onder meer stroomverbruik of elektromagnetische straling)
gebruikt om zijn geheimen te achterhalen. Een bijzonder krachtige aanval,
differentiële vermogensanalyse (eerste-orde DPA) werd in 1999 geïntroduceerd
door Kocher et al. Tegenwoordig willen we beveiliging bieden tegen DPA van
een hogere orde. Bij een DPA aanval van orde d gebruikt de aanvaller elk
statistisch moment van het stroomverbruik tot aan orde d. Een populaire en
gevestigde tegenmaatregel is maskering, een methode op basis van geheim-delen
waarbij tussenvariabelen stochastisch worden opgesplitst in meerdere delen om
de uit het nevenkanaal gelekte informatie onafhankelijk te maken van gevoelige
informatie. Er zijn veel verschillende soorten maskeerschema’s voorgesteld, vaak
vergezeld van een formeel bewijs van beveiliging.
Een andere klasse van fysieke aanvallen begint de laatste jaren meer aandacht
te krijgen. Bij foutaanvallen zoals differentiële foutanalyse (DFA), veroorzaakt
een aanvaller logische fouten in de berekening door bijvoorbeeld het apparaat
onvoldoende stroom te voeden of door bepaalde gebieden in de siliciummatrijs
zorgvuldig te verlichten. Het resultaat van een foutieve berekening kan een
schat aan geheime informatie opleveren. Deze aanvallen kunnen afzonderlijk
worden uitgevoerd of in combinatie met nevenkanaalanalyse.
De toepassing van tegenmaatregelen op cryptografische primitieven verhoogt
hun implementatiekosten aanzienlijk, vooral van de niet-lineaire componenten.
Bovendien vereisen ze het genereren van een groot aantal willekeurige bits, wat
in de praktijk duur is.
v
vi BEKNOPTE SAMENVATTING
Het werk dat in dit boek wordt beschreven, kan in vier categorieën worden
verdeeld.
Onze eerste onderzoeksrichting kijkt naar het ontwerp van tegenmaatregelen
tegen nevenkanaal aanvallen en hun toepassing op bestaande cijfers, zoals
de Advanced Encryption Standard (AES). Vanwege zijn uitgebreid gebruik
in de industrie, is het ontwerpen van zulke implementaties die beveiliging
bieden tegen fysieke aanvallen tegen minimale oppervlakte, reactietijd en
willekeurigheidskosten een belangrijke taak. Ook is de afweging tussen deze
verschillende kostenstatistieken moeilijk te manoeuvreren, omdat optimaliseren
voor de ene meestal nadelig is voor de andere.
Als tweede doelstelling kijken we naar de analyse en verificatie van de bedachte
tegenmaatregelen. Het ontwerpen van gemaskeerde circuits is niet triviaal
en veel van de voorgestelde tegenmaatregelen van de afgelopen jaren zijn na
publicatie relatief snel kwetsbaar gebleken. Dit heeft geleid tot een nieuwe golf
van werken over de verificatie van maskeerschema’s en hun implementaties. Met
meerdere beveiligingsnoties in de literatuur en verschillen tussen de theoretische
en praktische aanpak, is er nog steeds onzekerheid over hoe de beveiliging van
een gemaskeerde implementatie moet worden geëvalueerd.
Ten derde beginnen we met het verkennen van het ontwerp van tegenmaatregelen
die niet alleen beschermen tegen nevenkanaal aanvallen, maar ook tegen
foutaanvallen en gecombineerde aanvallen. Dit is een zeer jonge tak van
onderzoek en in tegenstelling tot maskering was de state-of-the-art op het
gebied van tegenmaatregelen voor gecombineerde aanvallen (vóór ons onderzoek)
meestal heuristisch en miste een formele achtergrond.
Ten slotte beschouwen we de uitdaging om cryptografische primitieven te
ontwerpen met de kosten van deze tegenmaatregelen in gedachten. De meeste
cryptografische algoritmen die tegenwoordig algemeen worden geaccepteerd en
gebruikt, zijn ontworpen in een tijd waarin fysieke aanvallen niet beschouwd
werden. Componenten werden meestal gekozen op basis van hun wiskundige
en cryptografische eigenschappen. Soms werd rekening gehouden met de
implementatiekosten, maar dan buiten de context van maskering.
Door naar deze vier zeer verschillende aspecten van cryptografie voor
geïntegreerde systemen te kijken, creëren we niet alleen een uitgebreide en
multidisciplinaire kennisbasis, maar onze ervaring in elk aspect verbetert ook
ons begrip van de anderen. Het stelt ons in staat om gemeenschappelijke
trends in de bestaande literatuur over verschillende onderwerpen te identificeren,
evenals verschillen en contrasten.
List of Abbreviations
AB almost bent.
AE affine equivalence.
AES Advanced Encryption Standard.

ANF algebraic normal form.
APN almost perfect nonlinear.
ASIC Application-Specific Integrated Circuit.
CPA correlation power analysis.
DDLA differential deep learning analysis.
DDT difference distribution table.

DES Data Encryption Standard.
DFA differential fault analysis.
DPA differential power analysis.
ECC error-correcting codes.
EDC error-detecting codes.

EM electromagnetic.
FA fault analysis.
FPGA Field Programmable Gate Array.
vii
viii LIST OF ABBREVIATIONS
GCM Galois counter mode.
ILA independent leakage assumption.
LFSR linear feedback shift register.

LSB least significant bit.
MAC message authentication code.

MIA mutual information analysis.
MPC multi-party computation.

MSB most significant bit.
NI non-interference.
NIST National Institute of Standards and Technology.
OTP one-time pad.
PRNG pseudo-random number generator.
RNG random number generator.
SCA side-channel analysis.

SEA safe-error analysis.
SIFA statistical ineffective fault analysis.
SNI strong non-interference.
SNR signal-to-noise ratio.

SPA simple power analysis.
SPN substitution-permutation network.
TI threshold implementations.
TVLA test vector leakage assessment.
Contents
Abstract iii
Beknopte samenvatting v
List of Abbreviations viii

Contents ix
List of Figures xiii
List of Tables xv
I Design, Implementation and Analysis of Cryptog-

raphy in the Presence of Physical Attacks 1
1 Introduction 3
1.1 Introduction to Symmetric Cryptography . . . . . . . . . . . . 3
1.2 Introduction to Side-Channel Attacks and Countermeasures . . 7
1.2.1 Side-Channel Analysis . . . . . . . . . . . . . . . . . . . 7
1.2.2 Countermeasures . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Research Questions . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 About This Doctoral Thesis . . . . . . . . . . . . . . . . 12
2 Masking against Side-Channel Attacks 15
2.1 Masking: The Basics . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Masking Representations . . . . . . . . . . . . . . . . . 16
2.2 Masking Constructions: A Roadmap . . . . . . . . . . . . . . . 19
2.2.1 The First Private Circuits . . . . . . . . . . . . . . . . . 19
2.2.2 Glitch-Resistant Masking Schemes . . . . . . . . . . . . 21
2.2.3 Consolidation and Generalization . . . . . . . . . . . . . 24
2.3 Masking the Advanced Encryption Standard . . . . . . . . . . . 28
2.3.1 In Software . . . . . . . . . . . . . . . . . . . . . . . . . 28
ix
x CONTENTS
2.3.2 In Hardware . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 Pushing the Limits . . . . . . . . . . . . . . . . . . . . . 36
2.3.4 Where the Randomness Comes From . . . . . . . . . . . 39
2.4 My Contributions in this Context . . . . . . . . . . . . . . . . . 40
2.4.1 Multiplicative Masking for AES in Hardware . . . . . . 40
2.4.2 Rotational Symmetry for FPGA-specific Advanced En-
cryption Standard (AES) . . . . . . . . . . . . . . . . . 41
2.4.3 Masking the AES with only Two Random Bits . . . . . 42
2.4.4 Recovering the CTR_DRBG state in 256 traces . . . . 43
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 Side-Channel Analysis 47
3.1 Side-Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Differential Power Analysis . . . . . . . . . . . . . . . . 50
3.1.2 Higher-Order Attacks . . . . . . . . . . . . . . . . . . . 54
3.2 Verifying Masked Implementations . . . . . . . . . . . . . . . . 56
3.2.1 Leakage Assessment . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Adversary Models . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Provable Security . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Flaw Detection . . . . . . . . . . . . . . . . . . . . . . . 69
3.3.1 Consolidating Security Notions in Hardware Masking . . 73
3.3.2 Recovering the CTR_DRBG state in 256 traces . . . . 74
3.3.3 On the Effect of the (Micro)Architecture on the Develop-
ment of Side-Channel Resistant Software . . . . . . . . 74
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4 Combined Physical Attacks and Countermeasures 79
4.1 Fault Attacks and Countermeasures . . . . . . . . . . . . . . . 79
4.1.1 Fault Attacks . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.3 Ineffective Faults and Safe-Errors. . . . . . . . . . . . . 84
4.2 Combined Attacks and Countermeasures . . . . . . . . . . . . . 85
4.2.1 Attacks in the Literature . . . . . . . . . . . . . . . . . 85
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 CAPA: The Spirit of Beaver against Physical Attacks . 94
4.3.2 M&M: Masks and Macs against Physical Attacks . . . . 95
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
CONTENTS xi
5 Design of Symmetric Cryptographic Primitives 99

5.1 S-box Properties and Affine Equivalence . . . . . . . . . . . . . 100
5.1.1 Cryptographic Properties . . . . . . . . . . . . . . . . . 100
5.1.2 Classifications . . . . . . . . . . . . . . . . . . . . . . . . 102
5.1.3 Implementation Properties . . . . . . . . . . . . . . . . 103
5.2 Towards Cryptography Design for Masking . . . . . . . . . . . 104
5.2.1 Goals and Trade-offs . . . . . . . . . . . . . . . . . . . . 105
5.2.2 Discussion of the State-of-the-Art. . . . . . . . . . . . . 108
5.2.3 NIST Lightweight Competition. . . . . . . . . . . . . . . 110
5.3.1 Classification of Balanced Quadratic Functions . . . . . 116
5.3.2 Low AND Depth and Efficient Inverses: an S-box Portfolio
for Low-latency Masking . . . . . . . . . . . . . . . . . . 117
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6 Conclusion 119
Bibliography 123
A A New Inner Product Masking Algorithm 153
II Publications 157
List of Publications 159
Multiplicative Masking for AES in Hardware 161
Recovering the CTR_DRBG state in 256 traces 185
CAPA: The Spirit of Beaver against Physical Attacks 205
M&M: Masks and Macs against Physical Attacks 229
Classification of Balanced Quadratic Functions 253

Curriculum Vitae 275
List of Figures
1.1 CTR mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 CMOS inverter operation (simplified) . . . . . . . . . . . . . . 8
1.3 Links between secrets and side-channels (extended from [MOP07]). 9
1.4 The circle of (embedded system) life . . . . . . . . . . . . . . . 13
2.1 Trichina AND gate. . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Example of a glitch. . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Addition chain used by Rivain and Prouff [RP10] to compute x254 . 29
2.4 Addition chain used by Grosso et al. [GPS14] to compute x254 . 30
2.5 AES inversion with Boolean and multiplicative masking. . . . . . 31
2.6 AES S-box in the tower-field construction [DRB+ 16]. . . . . . . 34
2.7 The area-randomness-latency trade-off. . . . . . . . . . . . . . . 36
3.1 Power measurements of AES. . . . . . . . . . . . . . . . . . . . 49

3.2 DPA results of AES with MSB model. . . . . . . . . . . . . . . 52
3.3 CPA results of AES with HW model. . . . . . . . . . . . . . . . 53
3.4 DDLA results of AES with MSB model. . . . . . . . . . . . . . 54
3.5 T-statistic for a fix vs. random test of an unmasked AES encryption. 61
3.6 Maximum t-statistic as a function of the number of power
measurements. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7 Circuit model with combinational and sequential logic. . . . . . 64
3.8 Summary of implication relationships between security notions. 69
3.9 Illustration of the tool of [4]. . . . . . . . . . . . . . . . . . . . . 71
xiii
xiv LIST OF FIGURES
3.10 The gap between theory and practice: provable vs. practical
security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.11 The gap between theory and practice: attacks. . . . . . . . . . 78
4.1 Sensitive regions on flash memory. . . . . . . . . . . . . . . . . . 81

4.2 The tile-probe-and-fault model. . . . . . . . . . . . . . . . . . . 89
5.1 Trade-offs between cryptographic strength and latency. . . . . . 106

List of Tables
2.1 Cost of masked building blocks [GPS14]. . . . . . . . . . . . . . 29

2.2 Cost of masked AES S-box implementations. . . . . . . . . . . 30
2.3 Area-latency trade-off for AES in hardware. . . . . . . . . . . . 33
2.4 Byte-serial masked AES implementations. . . . . . . . . . . . . 35
4.1 Comparing combined countermeasures in hardware . . . . . . . 90
5.1 Comparison of primitives in the state-of-the-art. . . . . . . . . 110

5.2 Comparison of National Institute of Standards and Technology
(NIST) candidates for Hardware. (n = S-box size, B = block
size or permutation state size, r = rate for Sponge) . . . . . . . 113
5.3 Comparison of NIST candidates for Software. (n = S-box size,
B = block size or permutation state size, r = rate for Sponge) 114
xv
Part I
Design, Implementation and

Analysis of Cryptography in
the Presence of Physical
Attacks
1
Chapter 1
Introduction
The presence of cryptography - e.g. the transformation of a message into

something incomprehensible for confidentiality - can be tracked back through
the history of humanity. During the second world war, it played a vital role in the
communication and strategic decision making of the fighting parties. With the
conception of the internet in the 1960s and the world wide web (WWW) in the
nineties, the degree of interconnectivity and need for secure communication grew
rapidly. While cryptography was initially only required for critical organizations
such as banks and governments and deployed on large computers that take up
an entire room, today we count multiple tiny devices per person in developed
countries, all needing various types of cryptography. As devices become smaller
and more interconnected, it becomes easier for adversaries to get physical access
to cryptographic calculations. This has opened up a new world of physical
attacks, with adversaries whose capabilities far outreach those of the earlier
days of modern cryptography.
1.1 Introduction to Symmetric Cryptography
One of the most important functions of cryptography is encryption, i.e. the

transformation of a message (plaintext) into a concealed message (ciphertext)
using a key. The field of cryptography is typically divided into two groups,
based on the use of keys. In this work, we will only consider symmetric
cryptography, in which the same key is used for encryption and decryption,
similar to how a classic door functions. This key must be shared among the
communicating parties beforehand. In asymmetric cryptography on the other
3
4 INTRODUCTION
hand, a public key is used for encryption and a secret key for decryption.
This can be compared to the operation of a padlock, which can be locked by
anyone, but only unlocked by the person with the key. The topic of this work
applies to any cryptographic primitive, but for simplicity, our descriptions will
mostly consider keyed primitives and encryption specifically. There are other
cryptographic primitives such as hash functions, for which a similar treatment
holds.
An Example. The most simple example of symmetric encryption is the one-

time pad (OTP). Consider for example the encrypted message “NJZMZ”. It is
impossible to decipher it without knowledge of the key that was used to shift
the letters (where “A” represents a shift by one letter, “B” by two and so on).
Only who possesses the key “MYKEY” can decrypt the message “ALOHA”.
This scheme is provably secure but rather inefficient. Firstly, it requires a key
of the same length as that of the message. Secondly, each key can only be used
once. In modern cryptography, it is possible to encrypt multiple messages of
any length with the same relatively small key, thanks to block ciphers which
are used in a mode of operation.
Block Ciphers. A block cipher is a function EK (P ) that encrypts a fixed-

size plaintext P with the fixed-size key K, which can be re-used in multiple
encryptions. The goal of function EK is to conceal the relationship between
its inputs and outputs. For example, given the ciphertext C = EK (P ), an
adversary should only be able to obtain information on the plaintext P with
very small probability. Also, given one or more plaintext-ciphertext pairs (P, C),
an adversary should not be able to recover the key K. A block cipher typically
repeats a complex round function multiple times to achieve this. The round
function is designed based on the principles of confusion and diffusion, identified
by Shannon [Sha45]. Diffusion refers to the fact that changing one bit of the
plaintext should affect many bits of the ciphertext. This is typically achieved
with linear layers that combine many bits of the state. Confusion ensures a
complex relationship between the ciphertext and the key and is often obtained by
using small highly nonlinear functions on parts of the state. The first encryption
standard was chosen by the National Institute of Standards and Technology
(NIST) in the 1970s and named the Data Encryption Standard (DES). It was
superseded by the Advanced Encryption Standard (AES) in 2001.
Mode of Operation. Since block ciphers only operate on plaintexts of limited

size B, special care is required when encrypting longer messages. It has been
shown that dividing each message into blocks Pi of size B and computing each
INTRODUCTION TO SYMMETRIC CRYPTOGRAPHY 5
ciphertext block as Ci = EK (Pi ) is insecure, because it allows an adversary to

find repeating patterns. A mode of operation describes how the block cipher can
be repeatedly applied with the same key to obtain the encryption of messages
of any length, without the possibility of patterns. The most simple example is
Counter (CTR) mode, as shown in Figure 1.1. The IV (or nonce) is an initial
value for a counter, which increments with each block cipher encryption. The
outputs of the block cipher can be seen as a key stream, which allows encryption
of the plaintext blocks Pi as in the OTP method. The IV should be different for
each message, such that no patterns arise when the same plaintext is encrypted
multiple times with the same key.
𝐼𝑉 𝐼𝑉 + 1 𝐼𝑉 + 𝑖
𝐸" 𝐸" … 𝐸" …
𝑃& 𝑃* 𝑃,
𝐶& 𝐶* 𝐶,
Figure 1.1: CTR mode
Many other modes of operation exist. Specifically, recent modes of operation

such as Galois counter mode (GCM) [MV04] are designed to not only provide
secrecy but also authentication.
The Advanced Encryption Standard. In 2001, NIST selected the Rijndael

block cipher [DR98] as the national standard for symmetric encryption in the
US and named it the Advanced Encryption Standard (AES). AES-κ operates
on a 128-bit state and is based on a substitution-permutation network (SPN). It
allows key lengths κ of 128, 192 or 256 bits, resulting in varying levels of security.
The 128-bit state can be considered as a 4 × 4 matrix of elements in GF (28 ). In
each encryption or decryption round, the state is transformed by four operations,
of which all but one are linear/affine. The number of repetitions of the round
function depends on the key length κ. For AES-128, one encryption/decryption
consists of ten rounds.
AddRoundKey is the field addition of the 128-bit state with a 128-bit round
key. Since the Galois field has characteristic two, the addition is equivalent
to a bitwise XOR.
6 INTRODUCTION
SubBytes is the only nonlinear transformation in the round function. It

consists of applying an 8 × 8 substitution function S (the S-box) to each of
the 16 bytes of the state. The S-box function is equivalent to an inversion
in the field GF (28 ) followed by an invertible affine transformation over
GF (2)8 :

S : GF (28 ) → GF (28 ) : x → φ−1 Aφ(x−1 ) ⊕ b
with A ∈ GF (2)8×8 and b ∈ GF (2)8 defining the affine transform and

φ : GF (28 ) → GF (2)8 the isomorphic mapping from bytes in GF (28 ) to
bits in GF (2)8 .
ShiftRows is a bit-permutation of the state by means of rotating its rows.
The ith row of the state is rotated by i bytes to the left. Hence, the first
row (row 0) remains unchanged and the last row (row 3) rotates 3 bytes
to the left or equivalently 1 byte to the right.
MixColumns is a linear operation that multiplies each column of the state
with a matrix M ∈ GF (28 )4×4 .
The 128-bit round keys are derived from a master key using the key schedule.
The key schedule consists mostly of linear operations, apart from using the
same S-box S as in SubBytes on four bytes of the key state in each round.
Adversary Model. Many different models can be distinguished based on the

capabilities and goals of the adversary. As noted above, some possible goals can
be plaintext or key recovery among others. For the capabilities, we list some
common models here in order of increasing power:
• Known Ciphertext or Ciphertext Only

• Known Plaintext
• Chosen Plaintext
• Chosen Ciphertext
The field of cryptanalysis deals with the analysis of cryptographic primitives

in these models. Important examples of statistical attacks are differential
cryptanalysis [BS90], a chosen-plaintext attack and linear cryptanalysis [Mat93],
a known-plaintext attack. For the remainder of this work, we will only consider
an adversary with the goal of recovering a key from a device, whose capabilities
will be described in the next section.
INTRODUCTION TO SIDE-CHANNEL ATTACKS AND COUNTERMEASURES 7
1.2 Introduction to Side-Channel Attacks and Coun-

termeasures
In the models above, the capabilities of the adversary are limited to the
knowledge or choice of inputs and outputs of the cryptographic primitive.
The encryption/decryption itself is considered as a black box. The black-box
model was the conventional model in cryptography until the 1990s, when
physical attacks started to emerge and cryptographers realised that in reality,
the cryptographic function is more like a grey box.
An attacker can exploit physical access to a device in several ways to get more
than just input and output information. We distinguish passive and active
physical attacks. In the first case, one passively observes the physical device
outputs. These attacks are called side-channel attacks. In the second case, an
attacker actively disturbs the device and introduces faults into the computations.
We refer to them as fault attacks. In this introduction, we introduce the former,
which is also the main theme of this work, but also fault attacks will play a role
in one of the chapters.
1.2.1 Side-Channel Analysis
Side-channel attacks came to light with a series of works by Kocher [Koc96,

KJJ99]. In these attacks, an attacker obtains side-channel information from
any physically observable feature such as the timing, power consumption or
electromagnetic (EM) radiation of a device and exploits this to extract secrets
from it. We will start with an example of why this is possible.
From Secrets to Side-Channels. CMOS is the prevailing technology in

computing devices today, because of its low static power consumption. More
specifically, the power consumed by a CMOS gate when its inputs are constant
(static) is very small compared to the power consumed when the inputs change
(dynamic). Consider for example the simplified CMOS inverter gate in Figure 1.2.
The input signal A controls two transistors (or switches) in a way that only
one is conducting at a time. If the voltage of A goes low (signal “0”), the
upper transistor closes, which allows the output voltage B to be charged by
the supply voltage, i.e. its signal becomes “1”. As long as A stays “0”, there is
no current flow, since B is already at the supply voltage. When A goes high,
the lower transistor closes and discharges B. While A stays “1”, there is no
current flow, because B is already discharged. Hence, power is only consumed
when the signals change. If we were to look into more detail at the power
8 INTRODUCTION
consumption, we would see that there is also a difference between the power
consumed to charge B and the power consumed to discharge B [MOP07, Fig.
3.3]. In this work, we only consider the dynamic power consumption. For an
investigation of the static power consumption, we refer to the works of Moos et
al. [Moo19, MMR20].
𝐴 𝐵 𝐴 𝐵 𝐴 𝐵
1 →0 0→1 0→1 1→0
Figure 1.2: CMOS inverter operation (simplified)
Types of Side-Channels. Because of this behaviour, the power consumption

of any device depends on the data it computes on. Information leakage through
power consumption is only one type of side-channel. A similar reasoning can
be made for the EM radiation of a device, which also depends on the current
flowing through the CMOS gates. Computation time can also provide a rich
source of information. Just as an average human would require more time to
divide 5555 by 7 than to divide 7000 by 7, computers complete their calculations
slower or faster depending on the complexity of the operations and the length
of the numbers among others.
Adversary Model. In this work, we consider an adversary with the capability

of observing all these device characteristics, although our main focus lies on the
power side-channel. In addition, the adversary is assumed to know the ciphertext
and plaintext, unless otherwise mentioned. For simplicity, we only consider
the goal of key recovery. This model is generally relevant for cryptographic
applications on so-called “embedded devices”, which we consider to be any
device to which an attacker can have physical access. An example is a remote
keyless entry system such as Keeloq. Using power analysis, Eisenbarth et
al. [EKM+ 08] were able to recover the secret cryptographic keys of both the
transponder (e.g. car key) and receiver (e.g. car). Cryptography for embedded
systems can be implemented both by software which runs on a microcontroller
or by dedicated hardware on a Field Programmable Gate Array (FPGA) or
Application-Specific Integrated Circuit (ASIC).
INTRODUCTION TO SIDE-CHANNEL ATTACKS AND COUNTERMEASURES 9
1.2.2 Countermeasures
Several countermeasures have been devised to protect cryptographic applications

against side-channel attacks. Consider the illustration of the dependency of side-
channels on sensitive key information in Figure 1.3. We distinguish different
countermeasures depending on which dependency they target. In practice,
it is recommended to protect implementations with at least two of these
methodologies.
Side-Channel Information
Implementation Intermediates/operations
II
Encryption Intermediates/operations
III
Secrets/Key
Figure 1.3: Links between secrets and side-channels (extended from [MOP07]).
Hiding. The goal of hiding countermeasures is to decouple the power

consumption (or other side-channels) from the data that a device computes
on and the operations it performs (link I), by decreasing the signal-to-noise
ratio (SNR). At the physical level, this can for instance be achieved by noise
generation and at the algorithm level by randomizing the order of operations.
Hiding countermeasures typically increase the complexity of side-channel attacks,
but do not offer provable security [MOP07].
Re-keying. A different approach is to accept that the power consumption

depends on the computed data and operations, but avoid that those depend on
the secret key (link III). For example, instead of using the master key K itself
for encryption, one could use a temporary key K1 derived from the master key
K. In the next encryption, we use another temporary key K2 and so on. This
countermeasure works at the protocol level and is called re-keying. Some modes
of operations are based on this principle and are called leakage-resilient modes
of operation [Pie09].
Masking. Finally, masking is a very popular countermeasure at the algorithmic

level that breaks the correlation between the data of an algorithm and the data
10 INTRODUCTION
computed on in its implementation (link II). In contrast with hiding, masking

comes with a large formal background and security proofs. It is also the principal
countermeasure we address in this work. Its main concept is to split variables
into multiple shares, such that all shares are required to reconstruct the variable.
This way, the power consumption depends on the randomized shares and not
directly on the intermediates of the algorithm themselves. The first masking
proposals split variables into only two shares to protect implementations against
side-channel attacks that exploit the mean power consumption. However, since
higher-order side-channel attacks can exploit higher-order moments of the
power consumption, more shares are generally required. Masking is therefore a
parametrizable countermeasure, with a parameter d that determines how many
shares should be used and against what order of side-channel attacks it protects.
1.3 Thesis Overview
We will complete this introductory chapter with our research questions and
guiding comments about the structure of the chapters.
1.3.1 Research Questions
1. How can we improve the area-randomness-latency trade-off for masked

implementations? The masking countermeasure comes with a very large
overhead in terms of implementation cost, such as the area footprint or latency.
Apart from increasing the resources required to compute an encryption, masking
also introduces a new cost factor: randomness. Different design decisions target
different aspects to optimize and the optimization of one cost metric often comes
at the expense of another. Making masked implementations more efficient is an
ongoing research topic. We continue in this direction and further optimize the
state-of-the-art implementations. In particular, we investigate if it is possible
to eliminate the cost of randomness in first-order masking. We also take a step
back and contemplate the results in the literature to discern the domains in
the area-randomness-latency trade-off that have been extensively explored and
those that require more research.
2. How can we efficiently and conclusively verify the security of masked

implementations? In practice, masked implementations do not always offer
the security we expect from them in theory. The causes for this problem
range from inaccuracies in the initial assumptions or models to human errors
or bugs in the implementations. For this reason, the last years have seen
THESIS OVERVIEW 11
an increased amount of research into the verification of these schemes and

implementations. These studies either take on a theoretical perspective, relying
on formal proofs or a practical perspective, using experimental measurements.
A lot of different security notions can be found in the literature and different
methodologies typically result in different conclusions about the security of
an instance. Our work consolidates these notions into a single framework and
unifies the theoretical and practical techniques into an all-inclusive bottom-up
approach.
3. How can we transform masking countermeasures into a more general

countermeasure against both side-channel and fault attacks? Apart from
side-channel analysis, also fault analysis forms a threat to embedded systems.
Countermeasures against these two types of attacks have mostly been
researched separately, even though many embedded applications would require
both combined. Moreover, the mere superposition of two independent
countermeasures may not be sufficient to protect against an attacker that
combines physical observations with fault injections. The state-of-the-art on
combined countermeasures is quite limited, since it has only recently become a
popular research direction. At the outset of this work, not a single combined
countermeasure existed that can provide provable security against a combined
attack. Our work extends and improves on the state-of-the-art by considering
more generic adversary models.
4. How can we design new symmetric primitives such that the masking
overhead is minimized? Efficiently masking cryptographic primitives is a
challenging task. The properties that make cryptographic components good for
confusion, make them expensive for masked implementations. Many primitives
currently in use have been designed without considering the cost of masking.
The next step in optimizing cryptography for embedded systems is to take this
cost into account from the very beginning, in the design of the primitive itself. It
is however not always clear which properties result in efficient implementations
and many trade-offs must be made. Furthermore, a lot is still unknown about
the search space of S-boxes. We improve and extend the state-of-the-art on
S-box classification. We also investigate how some design decisions influence
the cost of masked implementations and how recent proposals in the literature
compare in this aspect.
12 INTRODUCTION
1.3.2 About This Doctoral Thesis
This work is divided into four chapters, each of which deals with one of the above
research questions. We follow a publication-based model, which means that a
selection of our contributions to the field can be found as they were published
in Part II of this book. We chose this subset of publications based on their
significance in the field and our contribution to the work itself. Additionally,
we aimed to minimize the overlap of content with the chapters of Part I. The
full list of publications can be found on page 159. The chapters of Part I are
not only meant as introductions to the respective publications, but also provide
a comprehensive evaluation of the state-of-the-art, including our contributions.
In each chapter, we give an extensive background of the topic and an overview
of recent developments and we include a brief description of each publication.
This format allows us to critically evaluate both our own contributions in the
field and those of the research community as a whole. For clarity, we refer to
our own works with numeric citations (e.g. [3]) and to other references with
name-year-based citations (e.g. [Bil15]).
Chapter 2 considers the masking countermeasure in detail (i.e. research question
1). We selected our work from TCHES 2018 [7] (p. 161) as representative
publication for this topic. Next, Chapter 3 explains side-channel analysis
and most importantly investigates the verification of masked implementations
(research question 2). We included our work from TCHES 2020 [3] (p. 185) as
contribution to both research questions 1 and 2, as it constitutes an improvement
on an existing side-channel attack as well as an investigation into the randomness
requirements of masking. Our most important contribution towards research
question 2 is a work from TCHES 2019 [4], which was not included, as its
contents were incorporated into Chapter 3 itself. These first two chapters are
long compared to the last two. On the one hand, their topic represents the
bulk of the work performed during this PhD. On the other hand, there is an
overwhelming amount of existing research to consider. Chapter 4 deals with
combined attacks and countermeasures (research question 3) and is a slightly
shorter chapter. While it represents a very important contribution of our PhD,
this topic is relatively new in research, which means the existing literature
is limited. Representative publications for this chapter are our works from
CRYPTO 2018 [13] (p. 205) and from TCHES 2019 [6] (p. 229). Finally, in
Chapter 5 we look at the design of symmetric cryptographic primitives for
embedded systems (i.e. research question 4). This research direction is also less
established with very little literature available. With respect to this topic, we
included our work from ToSC 2019 [5] (p. 253).
Together, these chapters symbolize the different stages in the development of
embedded cryptography:
THESIS OVERVIEW 13
1. Conception of the primitive

2. Secure implementation of the primitive
3. Analysis and verification of its protection against physical attacks
In practice, these stages follow each other in a circular rather than linear way
(see Figure 1.4). For example, the successful analysis of an implementation
can lead to corrections and improvements of the masking scheme. Also, as
exemplified by this work, experience with countermeasures such as masking
leads to new insights on primitive design.
Analysis/Attack Primitive Design
Masking & Combined

Countermeasures
Figure 1.4: The circle of (embedded system) life

Chapter 2
Masking against
Side-Channel Attacks
In this chapter, we provide a detailed overview of the masking countermeasure.

In Section 2.1, we introduce the basic concepts of masking and the different
types of masking representations. Section 2.2 elaborates on the most essential
building block: the masked multiplication. In that section, we give a
chronological overview of important developments in the history of masking until
today. In Section 2.3, we describe how the masked multiplication is used for
implementations of the Advanced Encryption Standard (AES), both in software
and hardware. We discuss the trade-offs between area, speed and randomness
costs and also shortly touch upon the issue of random number generation
for masked implementations. Section 2.4 identifies our contributions in this
field. This chapter aims to provide all necessary information for the embedded
designer to implement a masked cryptographic primitive. The overview of the
state-of-the-art also allows us to critically assess the trends of the literature and
open problems for future research in Section 2.5.
Notation. Let F be some finite field. In the context of this work, the field has
characteristic two. Specific Galois fields of size 2k are denoted F2k . A vector or
matrix over the field F is written in bold font: x ∈ Fn . The vector xI contains
only the elements xi for i ∈ I. Addition over the field is denoted by ⊕. We use
× for multiplication over the field, but sometimes omit it for ease of notation.
15
16 MASKING AGAINST SIDE-CHANNEL ATTACKS
2.1 Masking: The Basics
The goal of masking is to decouple the internal state of a cryptographic device

from the sensitive variables in a cryptographic algorithm. Typically, any
intermediate variable in an encryption algorithm depends on the secret key
and is thus a sensitive variable. Since the power consumption depends on the
internal state of the device, one must ensure that the intermediate values of
the calculation are independent of the (sensitive) intermediate variables of the
algorithm. This is achieved by splitting the variables into multiple shares. The
first proposals of masking (or then called “data splitting”) came from Chari et
al. [CJRR99] and Goubin and Patarin [GP99].
Masking. Let x ∈ F be our sensitive intermediate variable and let be some

splitting operation. Then the masked representation of x is a stochastically
created n−tuple x = (x0 , . . . , xn−1 ) ∈ Fn with x0 x1 . . . xn−1 = x. We
distinguish different types of masking based on the splitting operation . A
masked implementation only handles the shares x0 , . . . , xn−1 as intermediates
instead of x itself. A dth -order masking of x is a masking representation x that
offers the following security property:
For every proper subset I ⊂ {0, . . . , n − 1} with |I| ≤ d, xI is independent of

the sensitive variable x.
This means that the number of shares n in a dth -order masking must always
be strictly larger than d. However, in many cases, more than the minimal
number of d + 1 shares are required to preserve this security. The security of the
masking countermeasure very much relies on the so-called independent leakage
assumption (ILA) [CJRR99, PR13]. That is, it is generally assumed that the
leakages of different intermediate values occurring in distinct calculations are
independent of each other and that the power consumption is a linear function
of the individual data variables. While this assumption was validated with
experiments by Chari et al. [CJRR99], it has recently become clear that it
does not always hold. Extra care is required on platforms where the power
consumption may not follow a linear model due to for example coupling or
microarchitectural effects. We treat this issue in more detail in the next chapter.
2.1.1 Masking Representations
Boolean Masking. The most popular type of masking is Boolean masking,

which uses the exclusive-OR (XOR) ⊕ as the splitting operation . Its popularity
MASKING: THE BASICS 17
is easily explained by the multitude of linear operations used in symmetric-key

cryptographic primitives, such as bit-permutations, multiplications with MDS
matrices and XOR additions themselves. These functions are all linear in a
characteristic-two field F2k and are therefore trivial to mask (with Boolean
representation). Consider for example a linear function L, which is applied to
the secret x and results in another sensitive variable y: L(x) = y. When x is
split into its Boolean masked representation, we obtain
L(x0 ⊕ x1 ⊕ . . . ⊕ xn−1 ) = L(x0 ) ⊕ L(x1 ) ⊕ . . . ⊕ L(xn−1 )

(2.1)
= y0 ⊕ y1 ⊕ . . . ⊕ yn−1
Hence, linear functions can be applied to each share independently: yi = L(xi ).

Cryptographic algorithms cannot rely on linear functions alone. Nonlinearity
is vital for achieving confusion. In masking, there are two alternative ways to
deal with nonlinearities. The first is to find a secure way to perform nonlinear
functions such as multiplications on Boolean shares. The alternative method
is to transform into a different masked representation, more suitable for the
targeted operations. The main problem to overcome is then to transform
between the two masked representations.
Arithmetic Masking. Instead of using the XOR operation as , it is, for

example, possible to use modular addition: x = x0 + x1 + . . . + xn−1 (mod
2k ). This type of masking is called arithmetic masking. This can be useful
for ARX ciphers, a family of symmetric-key algorithms which are based on
three operations: modular Addition, Rotation and XOR. Also, in public-key
cryptography, arithmetic masking is common. Transformations between Boolean
and arithmetic representations were introduced by Goubin [Gou01]. They have
been extended and optimized in various recent works [CGTV15, BCZ18, HT19].
Multiplicative Masking. Alternatively, if the splitting operation is a

multiplication × in the field F, then the multiplication becomes a local operation,
which can be applied to each share separately:
x × y = (x0 × x1 × . . . × xn−1 ) × (y0 × y1 × . . . × yn−1 )

(2.2)
= (x0 × y0 ) × (x1 × y1 ) × . . . × (xn−1 × yn−1 )
Also other multiplicative operations, such as inversion, can be calculated locally:

x−1 = (x−10 , . . . , xn−1 ). Multiplicative masking was first applied by Akkar and
−1
Giraud [AG01]. However, this first proposal was insecure due to an inherent flaw
of multiplicative masking that was uncovered by Golic̀ and Tymen [GT02]. It is
known as the zero-value problem, which refers to the fact that, no matter how
many shares the representation uses, multiplicative masking cannot securely
encode the value 0. For (x0 , x1 , . . . , xn−1 ) to encode zero, we need that x0 ×
x1 × . . . xn−1 = 0 and thus that at least one share xi is equal to zero itself.
Hence, a single probe on that share xi would reveal the secret. Alternatively, the
mean of a single share (and by extension its power consumption) also depends
on the secret: ∀i : E[xi |x = 0] 6= E[xi |x 6= 0]. Recent works on multiplicative
masking avoid the zero-value problem and have extended and optimized the
original methodology [GPQ10, GPQ11, 7].
Other Masking Representations. There are other types of masking for which,
like Boolean masking, linear operations are trivial, but their representations
cannot be defined with a single operation . They split the secret x into
shares x = (x0 , . . . , xn−1 ) such that it requires a more generic reconstruction
function f : Fn → F by x = f (x) = f (x0 , x1 , . . . , xn−1 ). One example is that
of polynomial masking, based on the secret sharing scheme by Shamir [Sha79],
which is used in multi-party computation (MPC). The first masking schemes to
use this type of masking were those of Prouff and Roche [PR11] and Goubin
and Martinelli [GM11]. One masks a secret variable x by first constructing a
dth -degree polynomial px (y), for which x is the constant coefficient:
d
M
px (y) = x ⊕ ai y i = x ⊕ (a1 y) ⊕ (a2 y 2 ) ⊕ . . . ⊕ (ad y d ) (2.3)
i=1
The coefficients ai are drawn randomly and kept secret. The shares of x are
calculated as points on the polynomial, evaluated in n different nonzero elements
αi : xi = px (αi ). The elements αi are public. The secret x can be reconstructed
from the shares using the reconstruction function f (x):
n−1
M
f (x) = xi Li = (x0 L0 ) ⊕ (x1 L1 ) ⊕ . . . ⊕ (xn−1 Ln−1 ) (2.4)
i=0
where the coefficients Li can be derived from the public elements αi , using
Lagrange interpolation.
A slightly different proposal, called inner product masking, came from
Dziembowski and Faust [DF12] and was first applied by Balasch et al. [BFGV12].
A secret x is represented by shares (x0 , . . . , xn−1 ) with (xi )n−1
i=1 random masks
and
x0 = x ⊕ (L1 x1 ) ⊕ (L2 x2 ) ⊕ . . . ⊕ (Ln−1 xn−1 ). (2.5)

MASKING CONSTRUCTIONS: A ROADMAP 19
The vector L ∈ Fn is a public parameter of the masking scheme with L0 =

1 [BFG15]. The secret can be reconstructed by the inner product of x and L:
n−1
M
x = hx, Li = xi Li = (x0 L0 ) ⊕ (x1 L1 ) ⊕ . . . ⊕ (xn−1 Ln−1 ) (2.6)
i=0
The scheme bears resemblance to polynomial masking, since the reconstruction

function (2.4) is also essentially an inner product.
These representations tend to be more expensive than Boolean masking for
the same protection order d, and are therefore slightly less popular. While not
considered very practical in their early days, recent works [BFG15, BFG+ 17,
DeC18] have improved their efficiency considerably. Moreover, their higher
algebraic complexity also comes with an advantage. Both for polynomial
masking [PR11] and inner product masking [BFG+ 17], it has been shown that
from an information-theoretic perspective, their shares leak significantly less
information than Boolean shares with the same security order d and noise level.
This is called Security Order Amplification [WSY+ 16].
2.2 Masking Constructions: A Roadmap
In this section, we will take a chronological walk through the history of

(mostly Boolean) masking and its most central question of how to implement a
multiplication. A lot of these descriptions originally only considered masked
bit-level multiplications (i.e. AND gates). However, when the description
does not make any assumptions about the bit length, it also serves for masked
multiplications in a characteristic-two Galois Field F2k . The importance of this
building block stems from the fact that any complex function can be decomposed
into a collection of adders and multipliers.
2.2.1 The First Private Circuits
Trichina gate. One of the first proposals for a masked multiplication in F2

came from Trichina [Tri03] and is therefore often dubbed the Trichina gate. Like
most Boolean masked multiplications today, it is founded on the distributive
property of the multiplication over the addition:
z = xy = (x0 ⊕ x1 )(y0 ⊕ y1 ) = x0 y0 ⊕ x0 y1 ⊕ x1 y0 ⊕ x1 y1 (2.7)

Trichina proposes to introduce a new random mask r and construct the shares
of z as follows:
z0 = r (2.8)
z1 = ((((r ⊕ x0 y0 ) ⊕ x0 y1 ) ⊕ x1 y0 ) ⊕ x1 y1 ) (2.9)
Note that it is important to perform the additions in eq. (2.9) from left to
right, indicated by the parentheses and in Figure 2.1. For example, if the
crossterms x0 y0 and x0 y1 are combined without the random mask r, the resulting
intermediate x0 (y0 ⊕ y1 ) = x0 y depends on the unmasked secret y.
&"
$ &%
!" !" !% !%
#" #% #" #%
Figure 2.1: Trichina AND gate.
Ishai-Shahai-Wagner. Ishai, Shahai and Wagner (frequently referred to as

ISW) were the first to create a theoretical framework for a side-channel adversary
with scalable abilities in their seminal work [ISW03]. In this work, they make
the connection between the masking countermeasure and MPC protocols and
they define a generic methodology for masking a multiplier at any security
order d. Another important contribution is their simulation-based proof for the
security of that multiplier in their adversary model. As in the Trichina gate,
the first step is to compute all the partial products xi yj . What defines the ISW
method, is how and in which order these partial products are carefully added
to form the output shares zi . We can describe the method in two steps. In the
first step, one creates a matrix of n2 shares (zij )n−1
i,j=0 . In the second step, these
shares are compressed to n shares zi by combining all the elements in the same
row:
n−1
M
zi = zii ⊕ zij (2.10)
j=0,j6=i
The matrix zij is computed as follows for 0 ≤ i < j ≤ n − 1:
zij = rij
(2.11)
zji = (rij ⊕ xi yj ) ⊕ xj yi
with rij fresh random masks and zii = xi yi . This multiplication is dubbed the
ISW multiplication and remains today an essential building block for masked
implementations. It requires n(n − 1)/2 fresh random masks rij ∈ F. We note
that ISW initially deemed this multiplication with n = d + 1 shares to be secure
against attacks of order d/2 and lower, but Rivain and Prouff [RP10] proved
that it actually provides dth -order security with d + 1 shares.
2.2.2 Glitch-Resistant Masking Schemes
Glitches. A turning point for masking came in 2005 when Mangard et

al. [MPO05, MPG05] demonstrated that the existing schemes did not provide
the claimed security when implemented in hardware. The vulnerability can be
traced back to the occurrence of unintended signal transitions or glitches. They
are caused by the varying delays of logic gates and routing imbalances among
others (for example Figure 2.2).
Clock
𝑥
𝑥𝑦
Figure 2.2: Example of a glitch in the signal xy because its inputs do not switch
exactly at the same time after the positive clock edge.
Their effect is that hardware circuits do not fit in the “ideal circuit” model
of ISW where the internal state of the device only depends on the exact
intermediates of a calculation and where the order of operations is respected.
Take for example the Trichina gate (see Figure 2.1). It was noted earlier that the
order of operations in the calculation of the output shares is vital for security.
Consider the case where the delay of the random mask r is larger than the
delay of the shares of x and y. The circuit would then temporarily compute the
value z1 = x0 y0 ⊕ x0 y1 ⊕ x1 y0 ⊕ x1 y1 = xy, which is the unmasked output and
hence sensitive. Similarly, the ISW multiplication is not suitable for hardware
implementations. Since the works of Mangard et al., software and hardware
masking have diverged into separate directions. Glitches are difficult to predict
and control. Protection against the vulnerabilities arising from them is typically
provided at the algorithmic level, embedded in masking schemes, which are

dedicated to hardware implementations.
Registers. The main problem of glitches is that the signal on a wire transitions
more than once. Before calculating the intended function value, the wire carries
the value of different unintended intermediates. One way to stop glitches from
propagating is to synchronize wires with registers. Registers (or flip-flops at bit
level) are circuit elements of which the output signal transitions exactly once
per cycle. Moreover, these transitions are more or less synchronized as they
always happen on the positive edge of a periodic clock signal (such as x and y in
Figure 2.2). We will note the stabilization of a variable x with square brackets:
[x]. For example, a naive (but expensive) method to secure the Trichina gate
for hardware implementations is to fix the order of calculations by storing each
intermediate value in a register:
z0 = [r]
(2.12)
z1 = [[[[r ⊕ x0 y0 ] ⊕ x0 y1 ] ⊕ x1 y0 ] ⊕ x1 y1 ]
Threshold Implementations. The first masking scheme to come with a proof

of security in the presence of glitches is that of Nikova et al. [NRR06, NRS11],
which is called threshold implementations (TI) and provides security against first-
order side-channel analysis (SCA). We define a masked function f : Fn → Gm
as an m-tuple of functions (f0 (x), f1 (x), . . . , fm−1 (x)) with each component
function fi : Fn → G. A first natural requirement for any masking scheme is
that it correctly implements the intended unmasked function f : F → G.
Correctness:
Ln−1 The masked function f is correct for f if, for any x ∈ Fn such
that i=0 xi = x ∈ F, the function f outputs a sharing y ∈ Gm such
Lm−1 Lm−1 Ln−1
that yi = fi (x) and i=0 yi = y = f (x): i=0 fi (x) = f i=0 xi
A threshold implementation is a masking scheme, which apart from correctness,

satisfies two additional properties:
Non-Completeness: The masked function f is non-complete if, for any

component function fi (x), there exists an input share xj such that fi (x)
is independent of xj , i.e. fi (x) = fi (xj̄ )
Uniformity: The masked function f is uniform if it maps a uniform sharing
of the input x to a uniform sharing of the output y, where a sharing
of x is uniform if all its shared representations (x0 , . . . , xn−1 ) such that
Ln−1
i=0 xi = x are equiprobable.
Together, these two properties ensure security in the presence of glitches.

Because of non-completeness, no glitch in the calculation of fi (x) can reveal
any information on xj . The uniformity of the sharing of x guarantees that
xj̄ is independent of x itself. The preservation of uniformity through the
masked function f ensures that threshold implementations may be cascaded.
An important note here is that non-completeness is only effective for one single
TI stage. The outputs of the TI function f described here must be synchronized
before they can be used in the next TI stage [NRS11].
When sharing a function of algebraic degree t, its threshold implementation must
use at least t + 1 shares [NRR06]. Quadratic functions such as multiplications
xy are thus implemented with n = 3 shares. As with Trichina and ISW, one
must first compute the cross shares xi yj . The essence of TI is then to find
a distribution of these cross products over the m output shares such that
correctness is satisfied and the sharing is non-complete. An example is the
following multiplier:
z0 = x0 y0 ⊕ x0 y1 ⊕ x1 y0 ⊕ r0
z1 = x1 y1 ⊕ x1 y2 ⊕ x2 y1 ⊕ r1 (2.13)
z2 = x2 y2 ⊕ x2 y0 ⊕ x0 y2 ⊕ r0 ⊕ r1
The random masks r0 , r1 are required to preserve the uniformity. It is clear in

this case that z0 , z1 and z2 are independent of input share 2, 0 and 1 respectively.
A very detailed study of threshold implementations can be found in the PhD
thesis of Bilgin [Bil15]. A remarkable feature of TI is that it can often avoid the
use of fresh random masks ri , because it uses more than the minimal number
of shares.
Higher-Order Threshold Implementations. Threshold implementations were

extended to higher-order security by Bilgin et al. [BGN+ 14a] through the
definition of higher-order non-completeness:
Higher-Order Non-Completeness: The masked function f is dth -order non-

complete if, for any set of functions {fi (x)}i∈I with |I| ≤ d, there exists
an input share xj such that {fi (x)}i∈I is independent of xj .
Also, it was determined that at least td + 1 input shares are required to obtain
dth -order protection against SCA for a function of algebraic degree t. However,
it was noted by Reparaz [Rep15] that the higher-order TI constructions of
Bilgin et al. [BGN+ 14a] do not provide the claimed security. The vulnerability
lies in the higher-order adversary’s ability to observe d > 1 intermediates or
the dth -order statistical moment of the power consumption, which includes
combinations of different variables at different time instants. While the proof
of security based on uniformity and higher-order non-completeness remains
valid for a univariate higher-order side-channel attack, it does not extend to the
multivariate adversary. For example, the variance of the power consumption at
a single point of these threshold implementations is independent of the secret,
but the covariance of the power consumption combining two different time
instants is not. The solution to this problem is to ensure the independence of
different time instants by refreshing the shares of intermediate variables after
every computation with new random masks. This way, higher-order threshold
implementations lose the low-randomness advantage of first-order threshold
implementations.
2.2.3 Consolidation and Generalization
Consolidating Boolean Masking Schemes. Reparaz et al. [RBN+ 15] further

noted that a TI multiplication together with a refreshing layer bares a strong
resemblance to the ISW multiplication. Based on this observation, they proposed
a general construction for multiplications, in which the shares of x and y are
expanded into n2 cross products xi yj and remasked to form the matrix of shares
zij , which can be compressed back to n shares, as in an ISW multiplication:
 
z00 z01 ... z0,n−1
 .. .. 
 z10 . . 
 . .. .. =
 . .

. .
zn−1,0 ... ... zn−1,n−1
(2.14)
   
x0 y0 x0 y1 ... x0 yn−1 0 r01 ... r0,n−1
 .. ..   .. .. 
 x1 y0 . .   r01 . . 
 . .. .. ⊕ . .. .. 
 . .
  . .

. . . .
xn−1 y0 ... ... xn−1 yn−1 r0,n−1 ... ... 0
The remasking matrix in Eq. (2.14) was proposed by Gross et al. [GMK16] and
is similar to that of Ishai et al. [ISW03]. The idea is that only cross products of
different shares (xi yj for i 6= j) need remasking and that the same mask can be
used for cross products xi yj and xj yi (i.e. the remasking matrix is symmetric).
Next, as in the ISW multiplication, the n2 shares zij can be compressed back
into n shares zi . Before this, synchronization is required to stop glitches from
combining unrefreshed cross products:

n−1
M
zi = [zii ] ⊕ [zij ] (2.15)
j=0,j6=i
Reparaz et al. [RBN+ 15] further noted that with this construction, it is possible
to achieve dth -order non-completeness with only d + 1 input shares, regardless
of the algebraic degree of a function. An important caveat is that this is only
valid if the input sharings of x and y are independent. Consider for example
the cross products in the first-order case when y = x2 with y = (x20 , x21 ):
x0 y1 = x0 x21 (2.16)
Since both shares of x are combined in a single cross product, non-completeness

is clearly not satisfied.
The result of this consolidating work [RBN+ 15] is that the distinction between
masking schemes for software and hardware fades. In both cases, an established
methodology for masking the multiplication of two variables with dth -order
security against SCA can be described as follows:
1. Expand d + 1 input shares of x and y into (d + 1)2 cross products xi yj .
2. Refresh the cross products xi yj for i 6= j as in Eq. (2.14).
3. Synchronize the refreshed cross products to stop glitches from propagating.
(This is implicit in software).
4. Compress the (d + 1)2 shares zij back into d + 1 shares zi as in Eq. (2.15).
Consolidating All Types of Masking. Interestingly, the optimization of inner

product masking schemes also followed from the exploitation of similarities to
the ISW multiplication. The inner product masking multiplication of Balasch et
al. [BFG+ 17] can also be described using a matrix (zij )n−1
i,j=0 constructed from
the addition of cross products and random masks. The compressing of these
shares is then analogous to that of ISW, either following Eq. (2.10) without
synchronization or following Eq. (2.15) with registers against glitches. Only the
matrix construction differs from that of Eq. (2.14). Each intermediate share zij
is computed as the sum of the cross product xi yj Lj and the (scaled) random
mask rij L−1
i , with L the public inner product masking parameter.
   
z00 z01 ... z0,n−1 x0 y0 L0 x0 y1 L1 ... x0 yn−1 Ln−1
 .. ..   .. .. 
 z10 . .   x1 y0 L0 . . 
 . .. . = .. .. .. 
 . . ..   .

. . .
zn−1,0 ... ... zn−1,n−1 xn−1 y0 L0 ... ... xn−1 yn−1 Ln−1
 
0 r01 L−1
0 ... r0,n−1 L−1
0
 .. .. 
 r01 L−11
. . 
⊕ 
 .. .. .. 
. . .
r0,n−1 L−1
n−1 ... ... 0
(2.17)
Note that the random matrix in Eq. (2.17) is not symmetric, although it still
holds that rij = rji . The correctness of the above is easily verified:
n−1
M
z = hz, Li = zi Li
i=0
n−1
M n−1
M n−1
M n−1
M
= zij Li = xi yj Lj ⊕ rij L−1
i Li
i=0 j=0 i=0 j=0
(2.18)
n−1
M n−1
M n−1
M n−1
M
= xi Li yj Lj ⊕ rij
i=0 j=0 i=0 j=0
n−1
M n−1
M

= xi Li × yj Lj = hx, Lihy, Li = xy
i=0 j=0
De Cnudde [DeC18] noted that the ISW multiplication can also be used to
optimize the glitch-resistant polynomial masking multiplication of Prouff and
Roche [PR11]. The resulting multiplication is essentially identical to that of
Eq. (2.17). As a result, Boolean masking, inner product masking and polynomial
masking are all equivalent in their methods for masked addition and masked
multiplication. The only difference in multiplication stems from vector L, which
is determined by the masking scheme. Boolean masking schemes use Li = 1, ∀i,
inner product masking uses a vector L with L0 = 1 and in the case of polynomial
masking, the vector L consists of Lagrange interpolation coefficients.
A Generic Masking Methodology. So far, we have only looked at the masking

of a multiplication, since the majority of the literature is devoted to this block.
Relatively little works consider also the question of how to share more generic
Boolean functions of algebraic degree t [BNN+ 12]. A tth -degree monomial can
be shared following the same methodology as for the multiplication: The d + 1
input shares of the t input variables are expanded into (d + 1)t intermediate
shares, which are remasked, synchronized and then compressed back into d + 1
shares. Consider for example the first-order sharing (d = 1) of the cubic
monomial z = wxy. In the first step, one computes (d + 1)3 = 8 intermediate
shares zijk from the partial product wi xj yk and a random mask rijk .
z000 = w0 x0 y0 z100 = w1 x0 y0 ⊕ r011
z001 = w0 x0 y1 ⊕ r001 z101 = w1 x0 y1 ⊕ r010

(2.19)
z010 = w0 x1 y0 ⊕ r010 z110 = w1 x1 y0 ⊕ r001
z011 = w0 x1 y1 ⊕ r011 z111 = w1 x1 y1
After synchronization with registers, the shares are compressed as follows:
z0 = [z000 ] ⊕ [z001 ] ⊕ [z010 ] ⊕ [z011 ]

(2.20)
z1 = [z100 ] ⊕ [z101 ] ⊕ [z110 ] ⊕ [z111 ]
A Boolean function of degree t is a sum of monomials of degree t or less. Sharing

such a function by expanding each monomial separately into (d + 1)t shares
is very inefficient. Some monomials are compatible to be grouped and treated
together. Consider for example the function z = vxy ⊕ wxy, which can be
expanded into intermediate shares zijk = vi xj yk ⊕ wi xj yk ⊕ rijk . An important
advantage of grouping monomials together is that fewer registers are required
to synchronize the intermediate shares and also less random fresh masks rijk
need to be generated. Ideally, the entire Boolean function can be grouped as
one. This approach was first applied for first-order masking with d + 1 shares
by Ueno et al. [UHA17]. It is however not generally possible to implement any
Boolean function with the minimal number of intermediate shares (d + 1)t . The
designer’s goal is to minimize the number of shares to expand to. Bozilov et
al. [BKN18] first noted and Wegener et al. [2] proved that a first-order sharing
with minimal number of intermediates shares (d + 1)t exists for any tth -degree
Boolean function with t + 1 input variables. A generic methodology for masking
tth -degree Boolean functions with any number of variables (> t + 1) and at
any security order d is introduced in the works of De Meyer and Wegener et
al. [8, 2]. They introduce an algorithm for finding the minimal s ∈ N such that
the Boolean function can be expanded into s(d + 1)t shares.
2.3 Masking the Advanced Encryption Standard
Now that we know how to mask the most basic nonlinear building block, we can
use it to mask cryptographic primitives, such as the AES. The main challenge
in masking the AES is the implementation of its only nonlinear component: the
S-box. Whether using Boolean masking, inner product masking or polynomial
masking, the linear/affine operations AddRoundKey, SubBytes and MixColumns
can simply be applied independently to each of d + 1 shares of the AES state.
Hence, descriptions of masked AES implementations are typically descriptions
of the masked S-box.
The literature is divided into implementations for software on the one hand and
those for hardware on the other. Apart from the fact that one needs to deal with
glitches on hardware platforms, there is also a distinction between architectures
suitable for one or the other and the resources available to the designer. We
note that it is common to describe only the smallest version of AES: AES-128,
which has a 128-bit key. The difference with AES-192 and AES-256 is mostly
in the key schedule and the number of rounds. In the following, we speak only
of AES-128, unless otherwise mentioned.
2.3.1 In Software
An unmasked software implementation of AES typically stores the S-box as a

lookup table and reads the outputs S(x) from memory for each input x. The first
works on masking [CJRR99, Mes00a] proposed the use of lookup tables also for
masked S-box implementations. At first and second order, masked table lookups
can be very efficient, with the tables stored in RAM. However, the overhead
in terms of time, memory and randomness complexity grows considerably for
higher orders. In this thesis, we direct our attention only to implementations of
the S-box as an arithmetic circuit, rather than a table lookup.
ISW-based Circuits. When introducing the ISW multiplication, Ishai et

al. [ISW03] originally only considered the multiplication in F2 , i.e. AND
gates. It was noted by Rivain and Prouff [RP10] that the scheme is also valid
for finite fields F2k , which is much more convenient for masking the AES. Since
the affine transformation in the AES S-box is trivial to mask, only the field
inversion x → x−1 in F28 poses a challenge. The inversion can be built from
a so-called addition chain exponentiation, exploiting the fact that x−1 = x254
in the field F28 . The idea of such addition chains is that the power map x254
can be obtained by calculating a chain of exponentiations, in which each can be
obtained by multiplying or squaring previous elements in the chain. For example,
MASKING THE ADVANCED ENCRYPTION STANDARD 29
Table 2.1: Cost of masked building blocks [GPS14].

Function Type # XOR # multiplications # Table lookups
l(x) d+1
x×y 2d(d + 1) (d + 1)2
x × l(x) 5d(d + 1) 2d(d + 1) + d + 1
the addition chain used by Rivain and Prouff [RP10] computes successively the
exponentiations (x), x2 , x3 , x12 , x15 , x240 , x252 , x254 (see Figure 2.3).
∗* ! %'
∗%,
! !( ! %) ! '*" ! ')' ! ')*
∗' '
!
Figure 2.3: Addition chain used by Rivain and Prouff [RP10] to compute x254 .
Note that squaring in characteristic-two fields is a linear operation, which can

be applied to each share independently: (x0 ⊕ x1 )2 = x20 ⊕ x21 . Hence, the
cost of masking the above addition chain comes down to masking just four
field multiplications. However, additional care is required when implementing
addition chains, since the inputs of the multiplication (and most importantly
their sharings) are not necessarily independent. Consider the mapping f (x) =
x × l(x) with l(x) some linear function on x. For example, the cubic power
x3 is obtained with l(x) = x2 . Either x or l(x) must be remasked with fresh
randomness before the multiplication. A security flaw in the refreshing of [RP10]
was noted by Coron et al. [CPRR13]. The latter work solves the problem by
avoiding refreshing altogether and introducing a dedicated masking methodology
for the mapping f (x). Interestingly, the masked block implementing f (x) is
more efficient than regular multiplications, since it can benefit from unmasked
table lookups (see Table 2.1) [CPRR13].
On the other hand, the randomness cost of masking f (x) is double that of a
masked multiplication. Grosso et al. [GPS14] exploited this specialized block
by searching for addition chains with intermediates of the form x1+2 . The
s
5
resulting chain uses the mapping x → x as an atomic operation and computes
the exponentiations (x), x2 , x5 , x25 , x125 , x127 , x254 (see Figure 2.4).
It was noted by Duc et al. [DDF14] that a provably secure method of refreshing
x is to use an ISW multiplication with a sharing (1, 0, . . . , 0). This is equivalent
to adding d(d+1)/2 fresh random masks rij to share xi and xj for 0 ≤ i < j ≤ d.
As a result, the proposal by Grosso et al. [GPS14] is more efficient in computation
∗) ∗) ∗) ∗'
! !) ! ') ! %') ! %'- ! ')*
∗' !'
Figure 2.4: Addition chain used by Grosso et al. [GPS14] to compute x254 .
Table 2.2: Cost of masked AES S-box implementations.

# x × y # Refresh # x × l(x) # Random Bytes
[RP10] 4 2 0 6 d(d+1)
2
[GPS14] 1 0 3 7 d(d+1)
2
time but the original proposal by Rivain and Prouff [RP10] is less costly in
terms of randomness (see Table 2.2).
Belaid et al. [BBP+ 16, BGR18] optimized the randomness complexity by noting
that the requirements of some multiplications in the addition chain can be
relaxed on the one hand and by reducing the number of refreshing gadgets on
the other. The resulting S-boxes are among the first to come with a global
security proof rather than relying on the security of the separate building blocks
(i.e. multiplications, refreshings, . . . ). This method was first introduced by
Barthe et al. [BBD+ 16] and will be further discussed in Chapter 3.
Bitslicing. A popular technique in software masked implementations is

bitslicing. Its use in cryptographic implementations dates back to Biham’s
fast DES implementation [Bih97a]. In this parallelization method, one treats a
general-purpose CPU with n-bit datapath as a SIMD computer, consisting of n
1-bit processors. For AES, it is typically used to parallelize the computation
of the S-box on 16 different bytes of the state. Instead of considering the
AES S-box in F28 and implementing it using multiplications in that field, the
S-box is implemented using bit-level operations. Given a proper ordering of
the state bytes in the CPU registers (where the same register holds the same
bit of all bytes), each bit-level operation can be applied to the whole state
using a single instruction. This technique has been very successful at increasing
the throughput of the SubBytes step or the entire encryption. Schwabe and
Stoffelen [SS16] for example used a 32-bit CPU datapath to parallelize the 16
S-box calculations of two AES encryptions in parallel. Balasch et al. [BGRV15]
first combined bitslicing with gate-level masking using Trichina’s gate. They
noted that bitslicing reduces the signal-to-noise ratio (SNR) and hence increases
the complexity of a DPA attack, but that it does not prevent it. Today, masked
bitsliced AES implementations are often based on Boyar and Peralta’s [BP12]
bit-level descriptions of the AES S-box, which have been optimized for circuit
depth.
Non-Boolean Masking. The bulk of the literature on masking in software

uses only Boolean masking. Nevertheless, one of the first masked AES
implementations used Boolean masking for the linear transformations and
multiplicative masking for the nonlinear (i.e. the field inversion). Since obtaining
x−1 from x is then a local operation, the main challenge consists of converting
between the two types of sharings. Genelle et al. [GPQ11] proposed the first
higher-order secure conversions for software. They also observed that there is
a conceptually simple solution for the zero problem of multiplicative masking
in the case of power map S-boxes such as that of AES [GPQ10]. The solution
exploits the fact that both the zero and unit element are mapped to themselves
in the field inversion or power map. Hence, it is sufficient to replace each zero
element “0” by a “1” before the inversion and change it back afterwards (see
Figure 2.5). Indeed, the inversion x−1 can be rewritten as follows:
x−1 = (x ⊕ δ(x))−1 ⊕ δ(x) (2.21)
with δ(x) a Kronecker delta function:

(
1 if x = 0
δ(x) = (2.22)
0 if x =
6 0
Boolean to Local Multiplicative

. Multiplicative ∗0% to Boolean .01
/ .
Figure 2.5: Structure of the AES inversion with Boolean and multiplicative
masking.
An alternative line of works by Balasch et al. [BFGV12, BFG15, BFG+ 17] uses
inner product masking to implement AES. The methodology in their most
recent work [BFG+ 17] strongly resembles that of ISW-based schemes. They
use the same addition chain as Rivain and Prouff [RP10], with the optimization
by Coron et al. [CPRR13]. Their implementations also come with a global
security proof using the method of Barthe et al. [BBD+ 16]. Although these
implementations remain more expensive than those with Boolean masking in
terms of memory and speed, they benefit from security order amplification,
which means an increased resistance against SCA in practice. Notably, the
work of Balasch et al. [BFG+ 17] is among the few in software masking that
demonstrates its practical security with empirical measurements.
Challenges in Software Masking Originally, it was believed that hardware

masking is more challenging than software masking because of glitches and
that it can be assumed that the ILA holds in software. In fact, recently, it is
becoming clear that it is not straightforward to implement the provably secure
Boolean masking schemes such that they truly offer the claimed side-channel
resistance in practice. The datapath of a CPU is an intricate and complex
network of components, forming a pipeline in which multiple instructions are
processed at the same time. As a result, many intermediates, which were thought
to leak independently, are combined in the datapath of the CPU. Balasch et
al. [BGG+ 14] first made this observation and suggested that to obtain dth -
order security in practice, one should implement a theoretically 2dth -order
secure masking scheme. Papagiannopoulos and Veshchikov [PV17] performed a
detailed study of an AVR microcontroller to identify specific CPU effects that
violate the ILA. De Meyer et al. [14] showed that these effects vary significantly
over different platforms and that it is almost impossible to model all the CPU
combinations. They demonstrate that even a theoretically second-order secure
AES implementation exhibits first-order leakage in practice. This is where the
security order amplification of inner product and polynomial masking entails
an advantage over Boolean masking. Indeed, Balasch et al. [BFG+ 17] were
able to empirically validate the security of their inner product masked AES
implementations. We return to this issue in the next chapter.
2.3.2 In Hardware
In contrast with software implementations, works on hardware masking typically

target specific security orders rather than providing generic higher-order
constructions with a global proof of security. Indeed, there is a trade-off between
genericity and optimal efficiency. Another trade-off exists between the area and
speed (latency or throughput) of hardware implementations. Before presenting
masked AES designs, we first list some hardware architectures.
Hardware Architectures. We distinguish four types of architectures, based on

the degree of parallelism.
Table 2.3: Area-latency trade-off for AES in hardware.

Architecture Area [GE] #clock cycles
Unrolled [MV15] 126 571 1
Round-based [SMTM01] 12 454 11
Byte-serial [MPL+ 11] 2 601 266
Bit-serial [JMPS17] 1 982 1776
A round-based AES implementation consists of all the combinational logic

required to perform exactly one round function (i.e. AddRoundKey,
SubBytes, ShiftRows and MixColumns) in each clock cycle. Sequential
logic stores the intermediate state between two rounds and feeds it back
to the input of the round function.
Alternatively, for minimal latency, one typically creates an unrolled implemen-
tation, which performs the entire cipher in combinational logic in a single
cycle.
On the other side of the area-latency spectrum, serial implementations target
minimal area cost by performing the round function on only a part of the
state (e.g. a bit or byte) in each clock cycle.
Finally, it is also possible to implement AES with maximum parallelism as
with an unrolled implementation, but with additional sequential logic to
store intermediate results. Such an implementation is called a pipeline
and achieves high throughput.
The area-latency trade-off is demonstrated in Table 2.3. We typically express
the area in terms of NAND gate equivalents (GE). This number is obtained
by dividing the total area of the implementation on an Application-Specific
Integrated Circuit (ASIC) by the area of a single NAND gate. Note that the
number of gate equivalents strongly depends on the used logic library and
hence numbers from different libraries should not be compared too rigorously.
Table 2.3 demonstrates at least the differences in order of magnitude between
the footprints of the different architectures and how they trade off with the
latency.
Masking introduces a significant area overhead, especially for the nonlinear S-
box. As a result, byte-serial designs, which require only one copy of the masked
S-box, are quite popular. The masked S-box itself often requires multiple cycles
and is typically implemented as a pipeline. On the one hand, several register
stages are required anyway because of glitches and on the other, a pipelined
implementation increases the throughput of the S-box, which must be used for
16 different bytes of the state and 4 more bytes in the key state.
Boolean Masking and The Tower Field Construction. Almost all Boolean
masked hardware implementations use a construction for the power map x254
which is commonly known as the tower-field implementation. The construction
was suggested by Rijmen [Rij00] and first implemented by Satoh et al. [SMTM01].
It exploits the fact that the inversion in F28 can be written as several operations
in the subfield F24 . Likewise, the inversion in F24 can be written in function of
operations in F22 (see Figure 2.6).
Figure 2.6: AES S-box in the tower-field construction [DRB+ 16].
The approach was optimized by Canright [Can05] and remains to date among
the smallest constructions of the unmasked AES S-box. For this reason, it has
also been used in a plethora of masked AES implementations. The results are
summarized in Table 2.4.
The first application of TI to this design is due to Moradi et al. [MPL+ 11].
It was improved on by Bilgin et al. [BGN+ 14b, BGN+ 15], though the serial
AES architecture remained the same. De Cnudde et al. [DBR+ 15] presented
the first higher-order secure AES S-box implementation using td + 1 shares.
With the advent of d + 1-masking, they significantly improved both first- and
second-order implementations in terms of area, but at the cost of more fresh
randomness [DRB+ 16]. Gross et al. [GMK17] improved the randomness cost and
also introduced a more efficient serial architecture for AES. Ueno et al. [UHA17]
once again reduced the area footprint, but with a significantly increased
randomness cost. Finally, Sugawara [Sug19] was the first to create a TI of the
tower-field S-box that requires no fresh randomness. These implementations all
use a byte-serial architecture such as that of Moradi et al. [MPL+ 11]. As a result,
their latencies are very similar. Note again that different works sometimes use
different logic libraries and thus that their area results are difficult to compare.
The addition chains used in software masking are not popular in hardware
masking. This might be explained by the need for refreshing in the chain of
Rivain and Prouff [RP10], which incurs a high latency cost, since the refreshed
shares must be synchronized in registers, before they can be used in the next
Table 2.4: Byte-serial masked AES implementations.

Area [GE] Random* Latency Library
S-box AES [bits/cycle] [clk cycl.]
First-order (d = 1)
[MPL+ 11] 4 244 11 114 48 266 UMC 0.18µm
[BGN+ 14b] 3 708 9 102 44 246 UMC 0.18µm
[BGN+ 15] 2 835 8 119 32 246 UMC 0.18µm
[DRB+ 16] 2 348 7 682 54 276 Nangate 45nm
[GMK17] 2 432 7 337 18 246 Nangate 45nm
[UHA17] 1 425 6 321 64 219 TSMC 65nm
[7]† 1 685 6 557 19 256 Nangate 45nm
[Sug19] 3 500 17 100 0 266 Nangate 45nm
Second-order (d = 2)
[DBR+ 15] 11 174 - 126 - Nangate 45nm
[DRB+ 16] 4 744 12 640 162 276 Nangate 45nm
[GMK17] 4 759 12 024 54 246 Nangate 45nm
[7]† 3 891 10 931 53 256 Nangate 45nm
* online randomness, excluding the initial masking
† not with the tower-field construction
multiplication. The solution by Coron et al. [CPRR13] has not been explored
in hardware masking either, presumably because of its latency cost as well.
Non-Boolean Masking. Also in hardware implementations, Boolean masking

is dominant in the literature, yet other representations have obtained good
results as well. De Meyer et al. [7] revisited the idea to use multiplicative
masking for hardware implementations of AES. Recall that this means the
masked inversion becomes a local operation, which can be separately applied
to each share (d + 1 times). They adapted the conversions of Genelle et
al. [GPQ11] to be glitch-resistant and corrected a flaw in their second-order
conversion. They further noted that it is possible to implement the AES S-box
with only one (unmasked) inversion in the field F28 , regardless of the security
order d. The resulting implementations improved upon the tower-field-based
AES implementations of De Cnudde et al. [DRB+ 16] in terms of area, while
keeping the randomness cost similar to that of Gross et al. [GMK17] (see
Table 2.4).
It is important to mention also the first generic glitch-resistant higher-order
masking scheme due to Prouff and Roche [PR11]. It uses polynomial instead
of Boolean masking, as it is based on MPC’s secret sharing methods. While

proposing to implement the AES S-box with an addition chain as done in [RP10],
this work does not provide actual hardware implementations. Moradi and
Mischke [MM13] implemented this S-box on an FGPA device. The ISW-like
multiplication methodology for polynomial masking of De Cnudde [DeC18], has
not yet been applied to a hardware AES implementation.
2.3.3 Pushing the Limits
Many more AES implementations can be found in the literature, other than
those discussed in § 2.3.1 and § 2.3.2. Especially for first-order security, there is
a trend to explore the absolute limits of the area-randomness-latency trade-off.
We illustrate the trade-off in Figure 2.7 using the costs of the implementations
in the literature.
Figure 2.7: The area-randomness-latency trade-off (masked logic style in red).
Area. At one extreme of the spectrum of masked hardware implementations

are those that target minimal area footprint. As a result, the overhead of the
masking countermeasure is typically pushed to the latency. An example is the
area record achieved by Wegener and Moradi [WM18b]. They were able to
implement the first-order AES S-box for an ASIC with 1 378 GE by decomposing
2
the power map as x254 = (x13 )49 and computing the cubic functions x13 and
x49 sequentially with a single F28 multiplier. However, a single S-box evaluation
requires as much as 36 clock cycles, which means one encryption has a latency
of more than 5000 cycles. This is in stark contrast with the S-boxes of Table 2.4,
which spend approximately 3 to 6 clock cycles on the S-box and where the
S-box can be implemented as a pipeline, resulting in encryption latencies of

approximately 250 cycles.
Another example is the work of De Meyer, Wegener and Moradi [8, 2], which
achieves the smallest masked AES footprint on Field Programmable Gate Array
(FPGA). They reduce the required number of LUTs (resp. slices) by 61% (resp.
70%) compared to the state-of-the-art but with a latency that is almost 28
times that of previous works (6 852 clock cycles).
Randomness. Another important overhead cost of masking is that of the

online randomness. This is the fresh randomness which is required repeatedly
for every multiplication or S-box calculation and is distinguished from the
offline randomness, which is the fixed cost of the initial input masking. The
use of a continuous supply of fresh random masks implies the need for a
pseudo-random number generator (PRNG), working in parallel with the masked
implementation. For first-order security, recent works were able to cut this
cost to its minimum: zero. Wegener and Moradi [WM18a] were the first to
eliminate online randomness in a first-order AES by applying the “Changing
of the Guards”-trick of Daemen [Dae17] to their (x26 )49 decomposition of the
AES S-box. This optimization comes at the cost of both area and latency. The
cubic functions x26 and x49 are implemented with td + 1 = 4 shares, resulting in
a larger S-box footprint of 4.2kGE. Furthermore, 2 804 clock cycles are required
to evaluate one AES encryption. Sugawara [Sug19] was the first to implement
the AES S-box with the tower-field approach (implying reasonable latency,
see Table 2.4) and zero online randomness. What made his implementation
possible is, on the one hand, the use of td + 1 = 3 shares as in TI and, on the
other hand, a new uniformity trick similar to that of Daemen. As a result,
their S-box of 3.5kGE is larger than those with the minimal number of shares
(d + 1). Moreover, since they implement also the linear components of AES
with 3 shares, their entire first-order secure AES implementation has a very
large footprint of 17.1kGE.
However, Sugawara’s solution does not consume “zero” randomness (and nor
does any other). It is common to ignore the offline randomness cost (required
to obtain the initial sharings of the inputs), since it pales in comparison with
the online cost. Consider for example the first-order secure AES of Gross et
al. [GMK17]. To generate the initial masking of the 128-bit plaintext and 128-bit
key, one requires 2 × 128 = 256 bits of randomness. During the calculation of
AES, each S-box evaluation consumes 18 bits of randomness. Since one AES-128
encryption (with key schedule) involves 20 S-box evaluations per round, during
10 rounds, the total amount of online randomness is 3.6k bits. Indeed, the
offline cost of 256 bits is rather negligible in comparison. However, when the
online randomness is eliminated as in Sugawara’s work [Sug19], the offline cost

becomes dominant. With 3 shares of the plaintext and key and some additional
setup cost, Sugawara’s AES requires 776 random bits.
The first work to consider also the cost of the initial masking is by Gross et
al. [11]. They show that a first-order secure AES can be implemented with only
2 random bits in total (including both online and offline randomness). Their
solution is not very suitable for hardware (it would exhibit a very high latency),
but achieves good efficiency in software.
Latency. In software masking, the speed of the implementation is a dominant

cost factor, apart from randomness requirements and bytes stored in ROM or
RAM. Naturally, the number of clock cycles of any particular implementation
highly depends on the used platform. A multitude of works have optimized
implementations for specific microcontrollers. The used masking schemes
are typically those discussed in Section 2.3.1. For example, Schwabe and
Stoffelen [SS16] achieve fast AES-CTR encryption on a Cortex-M4 by bitslicing
and processing two blocks in parallel. Goudarzi and Rivain [GR17] investigated
the speed of bitsliced AES encryptions up to security order 10 on an ARM7TDMI.
Wang et al. [WVGX15] optimize up to fourth-order secure AES implementations
on ARM NEON processors by using vector instructions.
In hardware masked implementations, low latency is a difficult objective, because
of the many registers that stop glitches from propagating. There is currently
no implementation of the masked S-box which completes within a single cycle.
Gross et al. [GIB18] explore the possibility of having only one register stage by
expanding (d + 1) = 2 shares to 2(d + 1)7 shares. This entails a very large area of
60kGE for the S-box only. Even if it were possible to implement a masked S-box
without registers, a round-based implementation taking only 10 clock cycles,
would require at least 16 copies of such an S-box in parallel. Thinking further
than that, an unrolled implementation that finishes a masked encryption within
a single cycle seems impossible. As a result, serial implementations prevail in
the literature. When comparing the latencies in Table 2.4 with that of the
unmasked byte-serial implementation in Table 2.3, one sees that the number
of clock cycles is almost the same. Yet, the latency of these implementations
is at least 246 clock cycles, hence low-latency masked AES implementations
are rather rare. An interesting research direction which has enjoyed relatively
little popularity in the masking community recently is that of masked logic
styles. Masked Dual-rail with Precharge Logic (MDPL) was first introduced
by Popp and Mangard [PM05] and later amended by Leierson et al. [LMW14].
Their technique does not only enable a masked S-box implementation in only
2 clock cycles, but they also choose a round-based architecture for the entire
AES, which results in an encryption latency of only 20 clock cycles. Their area
cost is just below 60kGE. This result dates back to 2014, which shows that
low-latency AES implementations have not received a lot of attention in the
recent literature.
2.3.4 Where the Randomness Comes From
So far in this chapter, we have several times assumed the availability of some
fresh random masks for each masked multiplication or S-box. As noted in the
previous subsection, this requires that some random number generator (RNG)
operates in parallel to the masked encryption, with sufficiently high throughput
to supply the required number of random bits. Since true random number
generators (TRNG) achieve only moderate throughput [YRG+ 18], we typically
instantiate a PRNG to provide a continuous stream of random bits to masked
implementations. For implementations without online randomness costs, a
TRNG could suffice. We generally care about three properties of the PRNG: (1)
its cost, (2) the quality of randomness and (3) its side-channel resistance. Up to
today, a lot of questions and uncertainty surround each of these requirements.
Cost. Cost figures for masked implementations are typically expressed as in

Table 2.2 and 2.4, i.e. the randomness cost is expressed in the number of bits
or bytes and ignored when it comes to the area or speed. The reason for this
distinction is that there is no generally accepted cost per random bit. This
makes it difficult to compare different AES designs. For example, the AES of
De Cnudde et al. [DRB+ 16] improved upon the area of Bilgin et al. [BGN+ 15]
by using d + 1 instead of td + 1 masking. As a result, the required randomness
increased. The question remains of how the area costs including a suitable
PRNG would compare. To answer this question, we first need to know what
constitutes a “suitable” PRNG.
Quality. Proofs of security for masking schemes assume that the fresh random
masks are uniformly random in F and mutually independent. PRNGs are
deterministic functions, which compute a stream of pseudo-random numbers
from a single seed. In reality, these random values hence do not have full
entropy. A good PRNG produces a stream of numbers with quality as close
as possible to true randomness. Cryptographically secure PRNGs are typically
based on cryptographic primitives such as stream ciphers or block ciphers in
a mode of operation (e.g. CTR mode) [BK15]. These PRNGs come with
particularly strong properties as they are used for cryptographic purposes such
as key generation. In the context of randomness for masked implementations,
it is not entirely clear whether this high quality is required. Popular choices of
PRNG in experiments with masked implementations are linear feedback shift
registers (LFSRs) [8] or unrolled implementations of block ciphers [DRB+ 16].
The two are wildly different in cost, cryptographic strength and quality of
randomness. The effect of their quality on the practical security of masked
implementations has not been investigated with experiments. The results of
such experiments would highly depend on the used platform and measurement
setup. Hence the question of whether an AES-based PRNG is “overkill” and
whether LFSRs are too weak remains unanswered.
Side-Channel Resistance. Consider a masked AES implementation, with for

example an AES-based PRNG providing the required randomness. A common
question in the literature is whether or not the AES-based PRNG needs its
own masking countermeasure. This question is often ignored in works on
masking, as it constitutes a “chicken-and-egg” problem: a masked PRNG would
require its own source of randomness in return. An attack by Jaffe [Jaf07]
and later optimized by De Meyer [3] shows that the key and nonce of AES in
CTR mode (a National Institute of Standards and Technology (NIST) PRNG
recommendation [BK15]) can indeed be recovered with only 256 power traces.
However, the discussion in [3] states that such an attack can be avoided by
following some practical guidelines, such as limiting the size of the randomness
requests and frequently updating the PRNG state. One wonders whether similar
conclusions cannot be drawn for different PRNG constructions. In particular,
LFSRs being based on linear functions, provide less interesting points of attack
for SCA than for example an AES with its highly nonlinear S-box. Investigating
specific PRNG constructions for masking and possible side-channel attacks
against them is an important open problem. An interesting work in this respect
is that of Picek et al. [PYR+ 16], where new constructions for reconfigurable
hardware PRNGs are developed to provide resistance against modelling attacks.
The proposed PRNGs explore the trade-off between reconfigurability, throughput
and area footprint, but have not been widely adopted in the literature. These
PRNGs naturally come at a high cost.
2.4 My Contributions in this Context
2.4.1 Multiplicative Masking for AES in Hardware
Context. Section 2.3.2 showed that hardware masked AES designs have
often relied on Boolean masking and used the tower-field construction of
Canright [Can05] to construct the masked S-box [BGN+ 14b, DRB+ 16, GMK17]
MY CONTRIBUTIONS IN THIS CONTEXT 41
(see Table 2.4). On the other hand, Akkar and Giraud [AG01] noted that
splitting sensitive variables in a multiplicative way is more amenable for the
computation of the AES S-box, since the Galois field inversion then becomes
a local operation. Before our work, sound higher-order multiplicative masking
schemes had been implemented only in software by Genelle et al. [GPQ11].
Contribution. In our paper at CHES 2018 [7], included on page 161,

we demonstrate the first glitch-resistant implementations of AES using
multiplicative masks. The conversions between Boolean and multiplicative
masks are based on those of Genelle et al. [GPQ11], but we detect and correct
a flaw in their second-order conversion. Moreover, instead of requiring d + 1
copies of the unmasked field inversion, we demonstrate that the AES S-box can
be implemented with only one inversion block, for any protection order d. To
alleviate the zero problem, we create a glitch-resistant shared implementation
of the Kronecker function δ(x). We exploit a special property of the masked
multiplications to recycle random bits and reduce the randomness cost.
For the AES encryption, we use the byte-serialized architecture of Gross et
al. [GMK17], since it is more efficient than that of Moradi et al. [MPL+ 11]
in terms of latency. We optimize the latency overhead by precomputing the
Kronecker function of the S-box input, while it moves through the state array.
We verify the security of our scheme in two ways. On the one hand, we use
the global security proof methodology of Barthe et al. [BBD+ 16]. On the
other hand, we deploy our construction on a Spartan-6 FPGA and perform a
side-channel evaluation. No leakage is detected with up to 50 million power
traces for both our first- and second-order implementation. For the latter, this
holds both for univariate and bivariate analysis.
Our first- and second-order masked implementations improve resp. 29% and
18% over previous designs for comparable randomness and latency cost (see
Table 2.4).
2.4.2 Rotational Symmetry for FPGA-specific AES
Context. The effort in reducing the area of AES implementations has largely
been focused on ASICs, in which a tower-field construction is a popular method
to achieve small designs of the AES S-box [SMTM01, Can05]. In contrast, a
naive LUT-based implementation of the AES S-box has been the status-quo on
FPGAs. A similar discrepancy holds for masking schemes, which are commonly
optimized to achieve minimal area in ASICs [BGN+ 14b, DRB+ 16, GMK17].
Contribution. In a collaboration with Prof. Moradi from Ruhr-Universität

Bochum, published at CHES 2018 [8], we have looked into FPGA-specific AES
implementations in which we exploit the rotational symmetry of power maps
such as the AES S-box, which leads to a 50% reduction of its area footprint on
FPGA devices.
We present new AES implementations which improve on the state of the art
and explore various trade-offs between area and latency. For instance, at the
cost of increasing 4.5 times the latency, one of our design variants requires 25%
fewer lookup tables (LUTs) than the smallest known AES on Xilinx FPGAs by
Sasdrich and Güneysu [SG16].
We further explore the protection of such implementations against first-order
SCA. On the one hand, we split the AES inversion x254 into the cubic power
maps x26 and x49 . On the other, we exploit the rotational symmetry of these
power maps. Recall from Section 2.2 that masking methodologies often describe
masked multiplications only. In this work, we introduced a heuristic methodology
for masking t-degree Boolean functions with the minimal number of shares
d + 1. Its application to our new construction of the AES S-box allowed us to
introduce the smallest masked AES implementation on Xilinx FPGAs, to date.
Finally, we verify the practical side-channel security of our implementation with
power traces from a Spartan-6 FPGA.
Extension. An extended version of this work has been published in the Journal
of Cryptology [2]. In this work, we formalize our masking methodology with
new theoretical concepts and proofs and also optimize our heuristic algorithms
for masking Boolean functions. We referred to this methodology in Section 2.2.3.
We prove that a first-order sharing with minimal number of intermediate shares
(d + 1)t exists for any t-degree Boolean function with t + 1 variables. We also
detail a generic methodology for masking Boolean functions of any degree t
with any number of variables (> t + 1). The improvements of our method over
the work at CHES 2018, allow us to optimize also our AES implementation and
reduce the number of FPGA LUTs by 21%, the number of FPGA flip-flops by
25% and the number of slices by 33%.
2.4.3 Masking the AES with only Two Random Bits
Context. Many of the existing works on masking focus on the reduction of

randomness requirements since the production of fresh random bits with high
entropy is very costly in practice. Most of these works rely on the assumption
that only so-called online randomness results in additional costs. In practice,
however, the distinction between randomness costs to produce the initial masking
and the randomness to maintain security during computation (online) is not
meaningful. Sugawara [Sug19] succeeded in implementing AES without an
online randomness cost, but still requires 776 initial random bits for each
encryption. Faust et al. [FPS17] were the first to prove that masking any cipher
with only 2 random bits in total is possible.
Contribution. In a collaboration with Dr. Gross from TU Graz and Dr.

Stoffelen from Radboud University, we study the question of minimum
randomness requirements for first-order Boolean masking when taking the
costs for initial randomness into account. This work is published in the
ACM workshop on Theory of Implementation Security (TIS 2019) [11]. We
demonstrate that first-order masking can in theory always be performed by
just using two fresh random bits and without requiring online randomness. We
first show that two random bits are enough to mask linear transformations and
then discuss prerequisites under which nonlinear transformations are securely
performed likewise. Subsequently, we introduce a new masked AND gate that
fulfils these requirements and which forms the basis for our synthesis tool that
automatically transforms an unmasked implementation into a first-order secure
masked implementation. We demonstrate the feasibility of this approach by
implementing AES in software with only two bits of randomness, including the
initial masking. Our implementation is optimized for speed and shows that the
reduction of randomness need not imply a significant slowdown in software.
2.4.4 Recovering the CTR_DRBG state in 256 traces
Context. Constructions for PRNGs for masking are not often made explicit,
since a lot of uncertainty surrounds their requirements. A recurring question
is whether it makes sense to use an unmasked PRNG to protect a masked
design. One of the NIST documents prescribes how to construct a PRNG from
a block cipher [BK15]. A popular method is to use AES in CTR mode. It
was already shown by Jaffe [Jaf07] that AES-CTR can be attacked with side-
channel analysis, even without knowledge of the nonce. His attack requires 216
power measurements. The NIST CTR_DRBG specification [BK15] prescribes
a maximum size on each random number request, limiting the number of
encryptions in CTR mode with the same key to 4 096. As a result, it is not
vulnerable to Jaffe’s attack.
Contribution. In this work from CHES 2020 [3], included on page 185, we
adapt this attack so that it requires only 256 traces, which is well within
the NIST limits. With this work, we essentially demonstrate that the NIST
recommendation for the CTR_DRBG allows too large requests. However, we
provide several recommendations for the implementation of a CTR_DRBG
such that the attack can be avoided without requiring a countermeasure as
expensive as masking. We use this opportunity to start a discussion on how to
protect PRNGs for masked implementations against SCA.
2.5 Conclusion
In this chapter we provided an overview of important developments in masking,

starting from its conception until today. We specified the necessary tools
and background for a designer to start constructing masked implementations.
We now conclude this chapter with some observations and assess gaps in the
literature which form interesting directions for future research.
On the Diverging and Converging of Software and Hardware Masking

Schemes. In the early days of masking, no distinction was made between
schemes designed for software and those designed for hardware. Masked AND
constructions such as those from Trichina and ISW were thought suitable for
hardware, until Mangard et al. called attention to the glitch problem. At
that point, the literature on software and hardware masking started to diverge,
but would interestingly converge again approximately 10 years later, with the
consolidating work of Reparaz et al. Initially, specialized glitch-resistant masking
schemes were introduced by for example Nikova et al. and Prouff and Roche.
Many optimizations were proposed to make these schemes more practical, until
eventually, they resembled the software masking methods. The main difference
now lies in the careful addition of register stages as synchronization mechanism
and glitch boundary. This way, any software masking scheme can, in theory, be
converted to one for hardware. However, problems are being exposed for masked
software implementations, related to the recombinations of shares in processors.
This is not exactly a new problem, but has only recently been given attention
in the literature. Proposed solutions so far are implementation-specific [14] and
do not change the masking scheme. In the future, we might see the distinction
between schemes designed for software and for hardware reinstated.
One Multiplication to Rule Them All. Another consolidation has taken place
at the level of masking representations. Boolean masking, inner product masking
and polynomial masking always had in common that linear operations are local.
Initially, their multiplication procedures were quite different, but today, thanks
CONCLUSION 45
to the works of Balasch et al. and De Cnudde, it is clear that they are all
essentially the same. Boolean masking enjoys a large popularity because it
incurs a smaller overhead compared to the other two. However, inner product
and polynomial masking have decreased leakage from an information-theoretic
point-of-view as an important advantage. Especially now that it is clear that
the ILA does not hold in CPU datapaths, more attention should be devoted to
these types of masking.
Opposite Trends in Software and Hardware Implementations. While the

distinction between masking schemes dedicated to software or hardware has
largely disappeared, we can identify an interesting disparity in the literature
when they are applied to for example AES.
In software, AES implementations are often proposed with a generic construction
that can provide security at any order d. In recent years, these proposals come
with a global proof of security according to the method of Barthe et al. However,
practical verifications of the side-channel security with real power traces are
rather rare in this branch of the literature. This may explain why the leakage
problem of masked software implementations stayed under the radar for so long.
On the other hand, in hardware, there is currently no AES implementation
that has been proven globally secure at any order d. Specific constructions
with optimizations for a particular order d are more common. The security of
these implementations is not always proven theoretically, but is almost always
empirically verified with power measurements.
We thus see quite opposite approaches when it comes to the verification of the
security of an implementation. Ideally, we should, of course, combine both.
Theoretical proofs can induce trust in the generality of an approach, but only
practical experiments can validate the security on a realistic platform. We will
return to this issue in the next chapter.
The Area-Randomness-Latency Trade-off. When we survey the many

available masked implementations of AES in the literature, it is clear that
a lot of effort has been put into minimizing its area and randomness cost. We
identify two gaps in the literature, which should receive more focus in the future.
First, with advancing technologies, area constraints become less and less tight
and low latency grows in importance. The scarceness of low latency masked
AES implementations for hardware becomes more obvious with every work that
further reduces the area footprint. It is high time that this direction is explored
in more detail, even if the area cost grows considerably. More specifically,
rather than minimizing the area at the cost of latency, we should optimize the
area of low-latency masked implementations. This includes looking more at

round-based instead of serial AES architectures.
Secondly, the community has made great progress when it comes to optimizing
randomness requirements for first-order masking. We have achieved the
theoretical minimum of 2 bits for an entire AES encryption. Higher-order
implementations are conspicuously absent in these lines of work, resulting in a
growing gap between first-order masking and second-order masking.
Where Are My Random Bits? Finally, the topic of random number generation
for masked implementations requires a thorough examination. We need to
determine what properties must be fulfilled by the generated randomness and
devise efficient PRNG constructions. These constructions must be so that their
randomness cannot be recovered from side-channel information, but without
requiring countermeasures such as masking. Only then will we know how to
express the cost of a fresh random bit or byte and will comparisons between
the many masked implementations of AES make sense. However, in the search
for the most “efficient” PRNG, we should keep in mind that area is not as
constrained as it used to be and that the smaller the PRNG overhead, the
smaller its contribution to the noise of the side-channel measurements.
Chapter 3
Side-Channel Analysis
In this chapter, we introduce the reader to the analysis of implementations

based on side-channel information. We first consider the point-of-view of an
attacker in Section 3.1 and explain how side-channel analysis (SCA) can retrieve
secrets from cryptographic implementations on embedded devices. After this
chapter, the reader will know how to launch a successful side-channel attack,
which recovers the secret key from a set of power traces of an AES encryption.
Next, we consider the designer’s perspective in Section 3.2 with a survey of
various methods to evaluate the security of masked implementations against
SCA. We distinguish verification of provable security on the one hand and
of practical security on the other and amply discuss the need to investigate
both. In Section 3.3, we enumerate our contributions in this field and finally, in
Section 3.4 we conclude with an overview of our observations and suggestions
for future work.
Notation. HW (x) is the HammingP weight of x. PIf [x] is the binary

representation of x, then x = i [x]i 2i and HW (x) = i [x]i . The Hamming
distance denotes the Hamming weight of the difference between two variables
x and y: HD(x, y) = HW (x ⊕ y). We denote a Gaussian distribution with
mean µ and standard deviation σ as N (µ, σ 2 ). E[X] is the expected value of a
random variable X. The dth -order mixed statistical moment of a set X = {Xi }
of length |X | is denoted µd with
|X |−1
Y X
µ (X ) = E
d
Xidi with di ∈ N, di = d (3.1)
i=0 i
47
48 SIDE-CHANNEL ANALYSIS
The mixed moment is m-variate if m coefficients di are nonzero. For the first-
and second-order central moments, we also use the following notations:
µ(X) = µ1 (X) = E[X] (3.2)
V ar(X) = σ(X)2 = E[(X − µ(X))2 ] (3.3)
Cov(X, Y ) = E[(X − µ(X))(Y − µ(Y ))] (3.4)
For a random variable

PX and p(X) its probability distribution, the Shannon
entropy is H(X) = − X ∗ p(X ∗ ) log2 (p(X ∗ )). The mutual information between
X and Y is computed as M I(X; Y ) = H(X) − H(X|Y ). Note that the
mutual information is symmetric: M I(X; Y ) = M I(Y ; X). The property
M I(X; Y ) = 0 is equivalent to H(X) = H(X|Y ), which is also equivalent
to p(X) = p(X|Y ). It means that the entropy of X does not decrease when
conditioned on Y and vice versa, i.e. X and Y are statistically independent.
3.1 Side-Channel Attacks
Side-channel attacks first became known in 1996 with the seminal work
of Kocher [Koc96]. There are many different types of side-channels,
such as timing [Koc96], power consumption [KJJ99], electromagnetic (EM)
radiation [QS01] and even sound [GST14]. In this work, we only consider power
analysis, but EM analysis follows the same principles as both side-channels leak
similar information.
Simple vs. Differential Power Analysis. Kocher et al. [KJJ99] distinguish

two variants of power analysis: simple power analysis (SPA) and differential
power analysis (DPA). In SPA (which is anything but “simple”), an attacker
uses a single power trace (or a small number) and mostly exploits the fact that
the power consumption depends on the executed instructions/operations. In
contrast, a DPA attack requires a large number of power traces and mainly
exploits the dependency on the data. Consider for example the power trace of a
single AES encryption in Figure 3.1.
One can clearly identify a repeating pattern of 9 rounds and a 10th shorter
round. Within each round, it is furthermore possible to distinguish the different
round operations. The AddRoundKey and SubBytes transformations are each
recognizable by 16 peaks which perform the same operation on 16 different bytes
of the AES state. Next comes ShiftRows and then MixColumns shows a pattern
that is repeated four times, once for each state column. This example trace
SIDE-CHANNEL ATTACKS 49
0.08 0.08
0.06 0.06
Power Measurement
Power Measurement
0.04 0.04
0.02 0.02
0.00 0.00
0.02 0.02
0.04 0.04
0 1000 2000 3000 4000 5000 6000 7000 0 200 400 600
Time [samples] Time [samples]
Figure 3.1: Power measurements of AES. Left: one encryption. Right: One
round.
has been acquired from a low-noise platform. In practice, it can be difficult to

recognize the functions in a noisy power trace.
Apart from operation dependencies, it is also possible to target data dependencies
in SPA. The same operation may exhibit different leakage depending on the
data it is performed on. However, such differences tend to be extremely subtle
and require more than a few traces to exploit. In the remainder of this work,
we will not consider SPA.
Template Attacks. Specific power analysis attacks are typically known-

plaintext (or -ciphertext) attacks, in which the knowledge of the plaintext
and side-channel information from a device are combined to derive its secret
key. The attacker thus needs some kind of physical access to the device to
obtain the side-channel traces. Chari et al. [CRR02] introduced a category of
SCA with a more powerful adversary model, called template attacks. In the
first phase, the attacker characterizes the leakage behaviour of the sensitive
device, for example using a copy of the device over which (s)he has full control.
Access to such a device is thus a main assumption for template attacks, which
makes the adversary a lot more powerful. The attacker could create a model
of the power consumption’s dependency on the secret key or find the most
informative points in the power traces or precisely model the noise of the device.
In the second phase, the information learned in the first phase is exploited
to recover the key of the actual device in only a few traces. In recent years,
there is a trend to do template attacks with the help of machine learning. The
characterization or profiling phase of the attack then corresponds to the training
of a classifier. Initially, support vector machines were used [HGD+ 11], but later,
neural networks [MPP16] became very common. In the remainder of this work,
we will not consider template attacks.
3.1.1 Differential Power Analysis
Differential power analysis (DPA) [KJJ99] exploits data dependencies across a

large number of power traces. In contrast with SPA, it requires little knowledge
of the sensitive device and is more resilient against noise in the measurements.
Apart from the knowledge of the cryptographic algorithm, the device can be
considered a black box to the attacker.
The power of SCA lies in the ability to “divide and conquer”. A cryptographic
primitive is designed such that exhaustive search over the secret key is the best
attack strategy in the black-box model. In the case of AES-128 for example,
such a search has a prohibitive complexity of 2128 . However, the information
leaked through side-channels allows the attacker to target only small parts of
that key. It is for example possible to recover an AES key in 16 chunks of 8
bits each, which reduces the attack complexity to 16 × 28 = 212 .
The main idea of DPA is to target the power consumption of an intermediate
result that depends on a small part of the secret key (e.g. 8 bits) combined with
some known data. Across the large number of traces, the key remains constant,
but the known data (i.e. the plaintext or ciphertext) varies constantly. The
key schedule is hence typically not a target for DPA, since it only operates on
the constant key and does not involve variable known data.
Strategy
We will explain the attack procedure for a single byte of the AES key below. A
full DPA attack simply repeats this process 16 times to recover the entire key.
Point of Interest. The first step is to choose a suitable intermediate in the

cryptographic algorithm that depends on a small part of the key k and some
non-constant data m. We denote the intermediate as f (k, m), where f is the
targeted operation. In AES, one typically targets a byte in the output of the
first SubBytes operation, i.e. f (k, m) = S(k ⊕ m) with m a byte of the plaintext
and k the corresponding byte of the master key. It is also possible to target the
input of the SubBytes in the last round, i.e. f (k, m) = S −1 (k ⊕ m) with m
a byte of the ciphertext and k a byte of the last round key. We remark that
the intermediate k ⊕ m also depends on the key and data, but the output of a
linear function is not a beneficial point of interest. To see this, note that the
difference between the intermediate k ⊕ m for two different data values m1 and
m2 does not depend on the key: (k ⊕ m1 ) ⊕ (k ⊕ m2 ) = m1 ⊕ m2 . The S-box,
on the other hand, has been specifically designed to ensure minimal dependency
between the output difference S(x) ⊕ S(y) and the input difference x ⊕ y.
Measurements. Measure the power consumption of N AES encryptions and

collect the set of traces T = {tn }. Each trace has length Q and tnq is the q th
sample in the nth trace. If the attack targets the first SubBytes, it is sufficient
for the traces to contain only the first round. This is a known-plaintext attack,
which means the plaintexts are not chosen by the attacker, but they are known.
For each encryption, a new plaintext is drawn. The targeted plaintext byte in
the nth encryption is denoted with mn . Often, some preprocessing of the traces
is required, for example to correct misalignment.
Hypotheses. Our unknown constant k has 8 bits. Hence, there are 28 = 256
possible candidates kj . For each candidate kj , we hypothesize that k = kj and
compute the corresponding intermediate values at the point of interest f (k, m)
for each encryption n. We thus construct a set of hypotheses H = {hnj } with
hnj = f (kj , mn ) = S(kj ⊕ mn ).
Leakage Models. Next, we try to predict how our hypotheses will appear in the
power traces, by simulating their power consumption using some leakage model
L(x). The power consumption of an intermediate x is assumed to be proportional
to L(x). There are many known leakage models to choose from. For example,
the Hamming weight model assumes that the leaked power consumption of an
intermediate depends on its Hamming weight: L(x) = HW (x). In the Hamming
distance model, the power consumption is proportional to the Hamming weight
of the difference between two intermediates: L(x, y) = HD(x, y). The closer
the leakage model resembles the true power consumption behaviour, the more
effective the DPA attack. There are also binary models, i.e. with L(x) ∈ {0, 1}.
In the original work on DPA, Kocher et al. [KJJ99] proposed a bit model, where
L(x) corresponds to a single bit of x. Popular choices are the least significant
bit (LSB) or most significant bit (MSB), but also others are possible. Finally,
in the zero value model, L(x) = 1 if x = 0, else L(x) = 0.
Partitioning. The hypothesized leakages based on the leakage model can be

used as selection function to partition the N collected traces into subsets. For
each key guess kj , we obtain a different partition with Sb = {tn s.t. L(hnj ) = b}.
For binary leakage models, there are two subsets S0 and S1 .
Distinguisher. Next, we measure the difference between the two sets S0 and
S1 . For the wrong key guess, the partition is approximately random and
uncorrelated with the power consumption and hence, the difference should be
minimal. For the correct key guess on the other hand, the partition creates two
sets such that at some time samples q ∗ , the power consumption tnq∗ depends
on bit 0 if tn ∈ S0 and on bit 1 if tn ∈ S1 . Kocher et al. [KJJ99] compute the
difference of means of the two sets:
∆jq = µ(S1 ) − µ(S0 )

PN PN (3.5)
n=1 L(hnj )tnq (1 − L(hnj ))tnq
= PN
n=1
− P
n=1 L(hnj ) n=1 (1 − L(hnj ))
N
For the wrong key guess kj , we expect limN →∞ ∆jq = 0 for all trace samples q.
For the correct key guess kj ∗ , the time samples q ∗ where the power consumption
depends on our hypothesized intermediate should be correlated to our partition
function and hence ∆j ∗ q∗ differs significantly from zero (see Figure 3.2).
0.007
Max Difference of Means
0.006
0.005
0.004
0.003
0.002
0.001
0 200 400 600 800 1000
# traces
Figure 3.2: DPA results of AES with MSB model. Left: Difference of means
for all keys at all time samples with 1 000 measurements. Right: Maximum
Difference of means for each key as a function of the number of measurements.
The correct key is indicated in black.
Variations on a Theme
The above description corresponds to the original DPA of Kocher et al. [KJJ99].
Many variants have been proposed later in the literature. They all have in
common that the traces are partitioned or categorized into subsets based on
hypothesized leakages. From that point, any statistical method or distinguisher
may be used to verify the meaningfulness of that partition. We discuss a few of
them below. While each type of attack can be designated with its own name,
they may all be considered DPA attacks.
Correlation Power Analysis (CPA). One of the most popular versions of DPA
was introduced by Brier et al. [BCO04], with the proposal to use Pearson’s
correlation as statistical method. The Pearson correlation is a measure of the
linear relation between two variables. It thus assumes a linear relationship
between the intermediate leakage L(x) and the power consumption ∼ αL(x) + β.
Let Tq = {tnq }N
n=1 be the power consumption at sample q and Lj = {L(hnj )}n=1
N
the corresponding leakage hypotheses for key guess kj . A CPA attack computes
for each key guess kj and trace sample q the linear correlation ρjq between the
two:
Cov(Tq , Lj )
ρjq =
σ(Tq )σ(Lj )
PN (3.6)
1
n=1 (tnq − µ(Tq ))(L(hnj ) − µ(Lj ))
= 1
PN
N
PN
2 1
n=1 (tnq − µ(Tq )) · N n=1 (L(hnj ) − µ(Lj ))
2
N
It is assumed to reach its maximum for the correct guess kj ∗ at the time samples
q ∗ where the power consumption depends on the hypothesized intermediates
(see Figure 3.3).
0.9
Max Pearson Correlation
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 200 400 600 800 1000
# traces
Figure 3.3: CPA results of AES with HW model. Left: Pearson Correlation
for all keys at all time samples with 1 000 measurements. Right: Maximum
Pearson Correlation for each key as a function of the number of measurements.
The correct key is indicated in black.
Originally, Brier et al. [BCO04] suggested to use a Hamming distance model,

but the attack also works with Hamming weight or binary models.
Mutual Information Analysis (MIA). Gierlichs et al. [GBTP08] proposed a

distinguisher based on mutual information. Given the power measurements
Tq at sample q and the corresponding hypotheses Lj for key guess kj , they
compute the mutual information between the two as
M I(Tq ; Lj ) = H(Tq ) − H(Tq |Lj )
where the entropies are estimated based on the experimental probability

distributions. This attack methodology is less efficient but more generic than
CPA, since it does not assume a linear relationship between the intermediates
and the leakage. Moreover, MIA considers the entire probability distributions,
rather than only a single statistical moment.
Differential Deep Learning Analysis (DDLA). The use of deep learning in

side-channel attacks was for a long time limited to template attacks. Recently,
Timon [Tim19] proposed for the first time to use a neural network in a non-
profiled setting. In a DPA attack, each key guess leads to a different partition or
classification of the power traces. The idea is to train a neural network, which
receives the power traces tn as input and which outputs the corresponding
classifier L(hnj ). When training the network for the correct key hypothesis, its
accuracy is expected to increase a lot faster than when training for a wrong key
hypothesis (see Figure 3.4).
1.0
0.9
Training Accuracy
0.8
0.7
0.6
0.5
0 5 10 15 20 25 30 35 40 45 50
# epochs
Figure 3.4: DDLA results of AES with MSB model. Training accuracy for all
keys as a function of the epoch with 1 000 measurements. The correct key is
indicated in black.
DDLA is typically not more efficient than CPA, since it must perform the
training of a neural network 256 times and its performance depends significantly
on the parameters of the network. For example, it is possible that the network
overfits and reaches high accuracy for any key guess. However, convolutional
neural networks have been shown to perform well when the power traces are
misaligned or exhibit jitter. Also in this attack, any leakage model may be used,
but binary models are preferable, as they require a less complex neural network.
3.1.2 Higher-Order Attacks
In response to the work of Kocher et al. [KJJ99], masking was introduced

and has grown into one of the most popular countermeasures. However, also
masked implementations are vulnerable to SCA. Recall that a dth -order masked
implementation splits the secrets into at least d + 1 shares with the goal of
protecting against attacks that target d intermediate values or the dth -order
statistical moment of the power consumption. First-order masking hence
protects against attacks such as those described in Section 3.1.1, but an attack
that would use the combination of two intermediates (i.e. a second-order SCA)
would defeat the countermeasure. Generally, a dth -order masking scheme is
always vulnerable to d + 1st -order DPA.
Preprocessing. Higher-order DPA attacks are conceptually not very different

from first-order attacks. Rather than exploiting the leakage from a single point
tnq in the trace, we now mix multiple points with some combination function
C(tnq1 , tnq2 , . . .). This can be achieved by preprocessing the power traces and
effectively creating new traces in which every single point is a combination
of d points in the original traces. The other steps of DPA remain the same.
Preprocessing of power traces is not always necessary. In a DDLA attack for
example, the neural network itself is able to combine samples in the original
traces. With MIA, it is possible to directly use the joint distribution of two
intermediates, instead of combining them with some combination function.
Combination Functions. Chari et al. [CJRR99] proposed to combine mul-

tiple points in a product function: C(tnq1 , tnq2 , . . .) = tnq1 × tnq2 × . . ..
Messerges [Mes00b] proposed a second-order combination function based on
the absolute value of the difference: C(tnq1 , tnq2 ) = |tnq1 − tnq2 |. Prouff et
al. [PRB09] suggested a centered product: C(tnq1 , tnq2 ) = (tnq1 − µ(Tq1 ))(tnq2 −
µ(Tq2 )). Note that, when combined samples of this type are averaged as in a
DPA attack, one essentially computes the second-order central moment of the
traces.
Points of Interest. In practice, higher-order SCA is a lot more difficult than

first-order attacks for two reasons. Firstly, noise plays an important role, which
we discuss in the next paragraph. Secondly, to preprocess the power traces
with some combination function, the attacker must know which trace samples
to combine. In masked hardware implementations, different shares tend to be
computed on in parallel. This means that the univariate dth -order moment
of the traces (e.g. the variance) is likely to reveal the secret and thus that
preprocessing only needs to combine trace samples with themselves. In other
words, the q th sample in the nth preprocessed trace may be computed using
C(tnq , tnq , . . .) for tnq the q th sample in the nth original trace. In masked
software implementations, on the other hand, different shares are computed on
serially and a multivariate attack is required. The preprocessed traces must

combine different samples from the original traces and it is very difficult for
the attacker to know which combinations will reveal the secret. The number of
possible combinations in a dth -order attack grows combinatorially as Q d with Q
the length of the traces. Hence, the complexity of exhaustive search is typically
prohibitive. Reparaz et al. [RGV12] propose a more efficient methodology
and apply it to perform second-order DPA with significant speedup. Still, the
selection of points of interest makes higher-order attacks (d > 2) very challenging
in practice.
Noise. Another problem for higher-order SCA in practice is the noise in the
power measurements. Chari et al. [CJRR99] assume that the power consumption
for an intermediate variable x is proportional to L(x) + e with e ∼ N (0, σ 2 ),
where σ 2 is the variance of the noise. In this model, they prove that the number
of power measurements required to distinguish a secret grows asymptotically as
σ d with d the masking order. Hence, with realistic noisy power measurements,
the complexity of a dth -order attack grows exponentially with d.
On Masking vs. Attacks. The goal of masking is to sufficiently increase

the difficulty of successful key recovery with a dth -order attack, such that a
d + 1st -order attack becomes the “best” strategy. Such higher-order attacks
become combinatorially more complex in terms of preprocessing and require
exponentially more traces. Moreover, masking countermeasures can be combined
with hiding countermeasures that increase the noise of the traces. In fact, it is
advisable to do this [SVO+ 10].
So, while dth -order masking in theory only provides security against dth -order
SCA, in practice also a d + 1th -order attack may be infeasible with a restricted
number of measurements. On the other hand, even a dth -order attack could
succeed given an infinite amount of traces. There is a gap between the theoretical
properties of masking and the security it exhibits in practice. In the next
section, we survey various methodologies to evaluate the security of masked
implementations, both in theory and in practice.
3.2 Verifying Masked Implementations
An important component in the design of masking schemes is the verification

that they provide security against the attacks described above. The history
of masking shows a myriad of examples of schemes that did not offer their
claimed security. For example, the higher-order masked table look-ups by
VERIFYING MASKED IMPLEMENTATIONS 57
Schramm and Paar [SP06] were shown to exhibit a third-order flaw by Coron et
al. [CPR07]. The refreshing methodology of Rivain and Prouff [RP10] was
corrected by Coron et al. [CPRR13]. Reparaz [Rep15] noted that the extension
of threshold implementations to higher-order security by Bilgin et al. [BGN+ 14a]
was insecure. More fundamentally, the fact that early masking schemes such
as those from Trichina [Tri03] and ISW [ISW03] were unsuitable for hardware,
was brought to light by Mangard et al. [MPO05, MPG05].
This history of trial and error has engendered a new branch of research on the
verification of masking schemes, with many proposals at different abstraction
levels and with varying scopes. One approach is to define an abstract model of
the adversary’s abilities in a so-called adversary model and verify a scheme’s
claims in that model. If a model is well-defined, it is evident to derive a
verification procedure. A trend in the recent literature is to automate the
verification using carefully crafted tools and eliminate in this way the risks of
human error. There is typically a trade-off between the efficiency and scope
of such tools. Some tools can create exact proofs for small masked gadgets,
whereas others aim to detect flaws in larger designs. However, a theoretical
model is seldom an exact representation of reality. Hence, it is also important
to verify the practical security of implementations on actual platforms. We
explain the different methodologies of verification in the following subsections.
3.2.1 Leakage Assessment
It is infeasible to verify whether an implementation withstands every known side-

channel attack. Instead, it is common to perform a test vector leakage assessment
(TVLA) as described by Goodwill et al. [GJJR11]. The goal of leakage
assessment is to detect any presence of sensitive information leakage. The
contrast with performing a specific attack is that no keys are extracted. Rather,
a statistical hypothesis test is used to detect whether different intermediates
can be distinguished in the side-channel traces. The traces are partitioned into
subsets based on algorithmic information.
Statistical Hypothesis tests
Hypothesis testing is the practice of using the statistical properties of a relatively

small sample to test the validity of some (null) hypothesis. It is often used
to assess the statistical difference between two subsets (S0 and S1 ). The null
hypothesis H0 is then the assumption that the two subsets are (statistically)
the same. It is important to remember that the null hypothesis can be rejected
but never accepted. The fact that it is not rejected, gives no guarantees that it
will not be rejected with a larger sample size. In the side-channel community,
the two most commonly used hypothesis tests are Welch’s t-test [GJJR11] and
Pearson’s χ2 -test [MRSS18].
Welch’s t-test. Welch’s t-test is considered a moment-based test as it verifies

whether two populations have the same first-order moment (or mean), i.e. its
null hypothesis H0 states that “the sets S0 and S1 are drawn from populations
with the same mean.” The hypothesis can be verified by calculating the t-
statistic:
µ(S0 ) − µ(S1 )
t= q 2 (3.7)
σ (S0 ) σ 2 (S1 )
|S0 | + |S1 |
Large absolute values of this t-statistic indicate that the null hypothesis can
be rejected with a high degree of confidence. It is common to reject the null
hypothesis when the t-statistic exceeds some critical value. This critical value
depends on the required confidence level.
Pearson’s χ2 -test. In contrast to Welch’s t-test, Pearson’s χ2 -test is not

moment-based. Moreover, the number of subsets Si is not limited to two. It
tests the null hypothesis H0 that “the observations in the sets S0 , . . . , Sr−1 are
independent of each other.” The null hypothesis is verified by calculating the
χ2 -statistic and the degrees of freedom ν. This calculation requires that the
frequency distributions of the observations are collected in a contingency table
(Fi,j )r−1,c−1
i=0,j=0 , where Fi,j holds the number of observations of value j in subset
Si . Let N be the total number of observations:
r−1
X r−1 X
X c−1
N= |Si | = Fi,j
i=0 i=0 j=0
The statistics can then be calculated as follows:
(Fi,j − Ei,j )2
r−1 X
X c−1
χ2 = (3.8)
i=0 j=0
Ei,j
ν = (r − 1) · (c − 1) (3.9)
with Ei,j the expected frequency of value j in subset Si :
1 X
c−1 r−1
X
Ei,j = Fi,k · Fk,j (3.10)
N
k=0 k=0
The null hypothesis is rejected when the χ2 -statistic exceeds a critical value.
The critical value depends on the required confidence level and the degrees of
freedom ν.
TVLA
Specific vs. Non-specific. In the context of leakage detection, the different

subsets S0 and S1 correspond to subsets of traces with different sensitive
intermediates. Goodwill et al. [GJJR11] introduced different types of TVLA
depending on the method with which the traces are partitioned. In a specific
test, the traces are divided into subsets based on the value of a specific point of
interest, e.g. the output of an S-box calculation. This is similar to the procedure
used for DPA, except in this case, the sensitive algorithm information is known
and not guessed. However, the results depend strongly on the assumptions
made and the chosen point of interest. When one targets specific intermediates
and does not detect leakage, it is not possible to make a generic conclusion
about the security of the implementation. For this reason, the non-specific
test has gained considerable popularity. By carefully selecting the test inputs,
this test aims to detect any distinction in the measurements coming from any
sensitive intermediate variable in the calculation. The most common approach
is to fix the secret key for all the acquisitions and vary only the plaintexts.
Fixed and Random Plaintexts. The plaintexts of the two sets S0 and S1 can
be chosen in four different ways:
Fix vs. Fix: All measurements in set S0 are acquired with the same plaintext
p0 and all measurements in set S1 are acquired with a different fixed
plaintext p1 .
Fix vs. Random: All measurements in set S0 are acquired with the same
plaintext p0 and for each measurement in set S1 , a new random plaintext
is drawn and used.
Semi-fix vs. Random: The measurements in set S0 are acquired with a
varying set of plaintexts that are determined such that some intermediate
state in the algorithm is fixed. Each measurement in set S1 uses a new
random plaintext.
Random vs. Random: All measurements (in either set) use a fresh randomly
drawn plaintext.
In the first two cases, the rejection of the null hypothesis indicates that different
intermediate variables are distinguishable. In the case of “Fix vs. Fix”, the
presence or absence of leakage strongly depends on the choice of the plaintexts
p0 and p1 . The “Fix vs. Random” test is more powerful in that sense, as it
verifies the distinguishability of p0 from any other plaintext. On the other

hand, if leakage can be detected for some particular pair p0 and p1 , it is more
pronounced in a “Fix vs. Fix” test than in a “Fix vs. Random” test and more
measurements are required to detect it in the latter case [MRSS18]. There is
thus a trade-off between the genericity and the power of the test. Recall also
that a χ2 -test is not limited to two subsets and can thus be used with more than
two fixed plaintexts. The “Semi-fix vs. Random” test is essentially equivalent
to a “Fix vs. Random” test, performed at an internal round. It is used to avoid
false-positives caused by non-cryptographic operations on the inputs/outputs at
the beginning or end of the algorithm. Finally, a “Random vs. Random” test
cannot be used for leakage assessment. It is still an important tool for verifying
the validity of the experiments. Since it is known with certainty that leakage
should not be detectable in this case, the presence of leakage indicates that the
experiment is void and must be reconsidered.
Higher-Order TVLA. Exactly as in higher-order DPA, the power traces can

be preprocessed for a t-test, to compare the higher-order moments of the subsets.
An efficient methodology for computing higher-order moments on-the-fly was
presented by Schneider and Moradi [SM15]. It was later optimized by Reparaz et
al. [RGV17], by storing the full histograms of the observations. This approach
also enables leakage assessment by χ2 -test [MRSS18].
On TVLA vs. Attacks. TVLA is assumed to detect side-channel vulnerabili-

ties without the need to perform every existing attack. Note however, that the
results of TVLA cannot be used to draw conclusions about the success of just
any type of side-channel attack. For example, when the null hypothesis in the
t-test is not rejected with a particular number of traces, it only implies that
a DPA attack with the same number of traces on the same device would fail.
However, no conclusion can be made on the success of other attacks that first
transform the traces using preprocessing. The required preprocessing can be
alignment of the traces or the combination of samples for higher-order attacks
or some other methodology corresponding to a specialized attack strategy (e.g.
horizontal attacks). Hence, the t-test can only be used to judge the exploitability
of the traces in the exact shape that they are used. For a specific type of SCA,
TVLA still holds the advantage of being more general and more easily applicable.
For example, compared to a DPA attack, TVLA does not require the knowledge
of the implemented cipher, a choice of point of interest nor the estimation of a
leakage model.
Critical Values. The critical cut-off value C of a statistic for deciding whether
to reject the null hypothesis, depends on the degrees of freedom ν and the
required confidence level (see for example [NIS03]). However, in the case of
the t-test, the degrees of freedom parameter is often ignored. Based on the
work of Goodwill et al. [GJJR11], it is common to choose C = 4.5, which
implies a confidence of > 99.999% if one uses enough measurements. For a
smaller number of traces, one should take the degrees of freedom into account.
Figure 3.5 illustrates some t-test results. For an unmasked implementation, the
t-statistic surpasses the critical threshold C at numerous time samples with
only 12 000 traces, indicating that the null hypothesis should be rejected and
thus that sensitive information leaks from the power measurements. For a
masked implementation on the other hand, the t-statistic remains smaller than
C in absolute value for all sample points, with up to 50 million power traces.
The null hypothesis can therefore not be rejected and we can deduce with high
confidence that with 50 million traces, no sensitive information can be extracted
from the measurements. We note again that further preprocessing of the traces
may change this.
4
300
2
200
t-statistic
t-statistic
0
100
0 2
100 4
0 5000 10000 15000 20000 0 5000 10000 15000 20000
Time [samples] Time [samples]
Figure 3.5: T-statistic for a fix vs. random test of an unmasked AES encryption
in hardware with 12k traces (left) and a masked AES encryption in hardware
with 50 million traces (right).
An important remark in the original work by Goodwill et al. [GJJR11] is often

forgotten. It is not uncommon for the t-statistic to slightly exceed the critical
value, even when leakage is not present. As a result, it is advisable to never
judge the t-test based on a single snapshot for a fixed sample size. The null
hypothesis should be rejected only if the t-statistic exceeds the critical value
in two independent experiments, at the same points. If the t-statistic exceeds
the critical value only by chance, it is unlikely to do so a second time in an
independent experiment in exactly the same point . In more recent works, the
importance of the threshold C = 4.5 has diminished. Instead, we look at the
evolution of the t-statistic as a function of the total number of measurements N .
When studying the t-statistic (Eq. 3.7), it is clear that a consistent difference
combined with a growing number of measurements, must result in a growing
t-statistic. As long as the t-value does not show a definite growing trend, we do
not conclude that leakage is present (see Figure 3.6).
Max t-statistic (logscale)
102 leaking (I)

leaking (II)
leaking (III)
101 not leaking (IV)
0 50 100 150 200 250

#k traces
Figure 3.6: Maximum t-statistic as a function of the number of power

measurements for several implementations.
Practical Considerations and Limitations. When performing TVLA with real

devices, the results are significantly influenced by the measurement setup. It is
therefore important for the assessor to follow certain guidelines, to make sure
the experiment is valid.
Firstly, it is noted by Goodwill et al. [GJJR11], that the measurements for
different subsets S0 , S1 should be randomly interleaved, to make the TVLA
results independent from gradual changes in the measurement environment,
such as for example the temperature.
The assessor must always test the boundaries of the experiment. One must verify
that the setup can detect leakage in an insecure (e.g. unmasked) implementation.
On the other hand, the null hypothesis in a “Random vs. Random” test or “Fix
vs. Fix” test with equal fixed plaintexts should never be rejected.
Even when following these guidelines, different measurement setups can give
very different results. In some cases, high noise levels effectively hide the leakage
with the considered number of traces. It is important to keep in mind that
TVLA cannot be used to draw general conclusions about a masking scheme.
The absence of leakage on one device with a particular number of measurements
offers no guarantees for the security of the implementation on a different device,
with potentially more traces. On the other hand, the presence of leakage
does not imply that this leakage is exploitable and that an attack against the
implementation exists.
In the next subsections, we consider methodologies which are independent of

the platform and measurement setup and hence offer more generic conclusions.
3.2.2 Adversary Models
In this subsection, we introduce various ways to model a side-channel adversary.

These models are used to argue about the security of masked implementations,
without the consideration of actual measurements.
d-Probing Model. One of the first adversary models defined in the context of
masking remains today the most used in the literature: the d-probing model
by Ishai et al. [ISW03]. In this model, the adversary is assumed to have the
ability to probe up to d intermediate values of the calculation within a certain
period (e.g. a cycle). Only the exact intermediates are observed and nothing
more. We will denote by I the set of all intermediates in the calculation. The
ISW multiplication with d + 1 shares is provably secure in this model because
any subset Q ⊂ I with |Q| ≤ d is independent of the unmasked secret inputs.
This model is convenient for devising theoretical proofs, but not a very realistic
representation of an actual attacker targeting noisy side-channel measurements.
Noisy Leakage Model. Chari et al. [CJRR99] proved the security of their first
masking scheme in the noisy leakage model, in which the adversary observes
the leakage of the intermediates superposed with a noise function, rather than
exactly. This corresponds to the leakage models used in side-channel attacks
L(x) + e, for x ∈ I and e ∼ N (0, σ 2 ). An important result from this work
is a proof that the number of measurements required to distinguish a secret,
grows exponentially with the masking order d. However, they only consider
the masking of variables at bit level and independent of computations. This
was generalized by Prouff and Rivain [PR13]. While more realistic than the
probing model, it is not straightforward to prove the security of a masking
scheme against this adversary. Thankfully, the two models were united by
Duc et al. [DDF14] in a seminal work, which proved that security in the probing
model implies security in the noisy leakage model. As a result, security proofs
in the probing model are not only convenient but also practically relevant.
Bounded Moment Model. Barthe et al. [BDF+ 17] introduced another model
of an adversary that observes the dth -order statistical moment of the leakages,
i.e. the adversary observes

|I|−1
Y
E L(xi )di
i=0
P
with i di = d and L(x) the leakage function on the intermediates xi ∈ I. They
further prove that security in the d-probing model for a serial implementation
implies security in the d-bounded moment model for parallel implementations,
which makes it more relevant for hardware masking.
Note that this model corresponds nicely to the TVLA methodology of
Goodwill et al. [GJJR11] with a moment-based t-test on preprocessed traces
consisting of the product of d power samples. If the null hypothesis is rejected,
the dth -order moment is not independent of the sensitive information, which
implies that the implementation is not d-probing secure.
Glitch-Extended Probing Model. Today, the d-probing model remains the

prevailing adversary model for proving the security of masking schemes. However,
as shown by Mangard et al. [MPO05, MPG05], ISW did not account for glitchy
circuits. With glitches, the set of intermediates in the calculation is essentially
extended beyond the exact and stabilized values in the masking algorithm I. In
the abstract sense, a circuit may compute some unknown glitch function before
stabilizing at the intended value xi ∈ I. Trying to predict the exact behaviour
of this glitch function is difficult and probably infeasible, as it depends not
only on the implemented algorithm, but also on the platform, the execution
environment, and many other unknown factors. One of the first models to
include glitches at the theoretical level came from Reparaz et al. [RBN+ 15]
with the introduction of glitch-extended probes.
ℛ"
𝑥" ℛ$&"
𝐶" 𝑥$&"
𝐶$&"
𝑥#
𝐶#
…
…
…
𝑥'
…
𝐶'
…
𝑥$
𝐶$
ℛ$
Figure 3.7: Circuit model with combinational and sequential logic and glitch-
extended probes Ri for wires xi .
Recall that sequential logic elements such as registers serve as a boundary for
glitches by means of synchronization. It is assumed that an adversary probing
some wire in a combinational function, automatically obtains all inputs to
that function up to the previous glitch boundary. This set of inputs is called
the glitch-extended probe and is denoted Ri for each intermediate xi ∈ I (see
Figure 3.7). Following the independent leakage assumption (ILA), a glitch
function on the wire xi depends only on the set of inputs Ri . As a result, this
model includes the worst possible glitch. This worst-case scenario does not
necessarily reflect reality. In other words, this model is over-conservative in
some sense. However, the model allows working at an abstract and theoretical
level, without requiring details about the underlying technology or platform.
Moreover, since the glitch-extended probing model is like the original ISW model
up to a redefinition of probes, we intuitively expect the reduction of Duc et
al. [DDF14] to hold. Specifically, we would expect that security in this model
implies security in a more realistic glitch-extended noisy leakage model, where
the adversary can observe the leakage of glitch-extended probes superposed
with a noise function.
Robust Probing Model. The idea to extend regular probes with additional
information, based on physical defaults was further developed by Faust et
al. [FGP+ 18]. Glitches are not the only cause for the gap between the probing
model and reality. In recent literature, various cases have been demonstrated
where the ILA does not hold. On hardware platforms such as FPGAs and
ASICs, wires carrying different variables (such as shares) are driven by the same
voltage line and the same clock signal drives all sequential logic in a masked
gadget. As a result, it has been shown by De Cnudde et al. [DEM18] that
coupling effects exist between “independent” wires due to capacitances or shared
voltage/clock lines. These effects are potentially dangerous for the recombination
of shares of the same variable. On software platforms, the combinations of
intermediates in the CPU datapath have been studied by Papagiannopoulos
and Veshchikov [PV17] among others. Balasch et al. [BGG+ 14] suggested a
transitional leakage model, in which not only each intermediate xi ∈ I is probed,
but also the XOR of any two intermediates xi ⊕ xj for xi , xj ∈ I. In the robust
probing model, Faust et al. [FGP+ 18] propose to replace the exact probes in
the probing model, not only with glitch-extended probes, but also with extended
probes for memory transitions or coupling. This is an attractive approach for
closing the gap between theory and practice. However, Levi et al. [LBS19]
demonstrated that defining extended probes for coupling is not straightforward.
Also, modelling the CPU effects in this way is nontrivial, as argued by De
Meyer et al. [14].
3.2.3 Provable Security
The concept of masking is considered provably secure against SCA, because

the stochastic splitting into multiple shares destroys the dependency of the
power consumption on the sensitive variables. Indeed, its security has been
proven in the above-described models. However, it is not trivial to perform
computations on the masked variables in a way that does not re-introduce
sensitive dependencies, for example by accidental combinations of shares. As
such, theoretical proofs of security for masked designs are a common component
in works on masking. Since these proofs are often simulation-based, we will
first define simulatability.
Definition 3.1 (Simulatability (extended from [BBP+ 16])): Consider a
masked gadget with shared input x. A set of l probes Q = {q1 , . . . , ql }
can be simulated with at most t shares of the input, if there exists a set
of indices I with |I| ≤ t and a random function S : Ft → Fl such that
for any fixed input shares x, the distributions of Q = {q1 , . . . , ql } and
{S(xI )} are identical.
If the gadget has multiple shared inputs x1 , . . . , xn , then the gadget is
simulatable with at most t shares of each input, if there exists a set of
indices Ij for each j = 1 . . . n with |Ij | ≤ t and a function S : Fnt → Fl
such that the distributions of Q and {S(x1I1 , . . . , xnIn )} are identical.
Local Probing Security Definitions. With the introduction of the d-probing

model, Ishai et al. [ISW03] used a simulation-based proof of security for their
multiplication gadget in that model. They demonstrate that the view of a d-
probing adversary can be perfectly simulated without knowledge of the sensitive
inputs. Blömer et al. [BGK04] call an algorithm perfectly masked if the joint
distribution of any d intermediate results is independent of the secret key and
plaintext. Gammel and Mangard [GM10] provide an information-theoretic
definition of d-probing security, which aligns with that of Blömer et al.:
Definition 3.2 (d-probing security): A gadget with secret x is d-probing
secure if and only if for any observation set Q = {q1 , . . . , qt } of t ≤ d
probes, it holds that M I(Q; x) = 0.
However, Carlet et al. [CPRR15] introduce a different definition, which they

refer to as perfect d-probing security. We will refer to it as d-non-interference
(as done in [BBP+ 16]).
Definition 3.3 (d-non-interference): A gadget is d-non-interferent (d-
NI) if and only if every observation set Q of t probes with t ≤ d can be
simulated with at most t shares of each input.
While these two notions were initially thought to be equivalent, there is a subtle
difference. Consider for example the intermediate (x0 ⊕ y0 )x1 . It is easy to
verify that this variable is independent of the secrets x and y and hence probing
secure. The share y0 acts like a one-time pad on x0 , which means that the
multiplication does not reveal any information on x. However, to simulate this
single value, one requires 2 shares of x, which means that non-interference does
not hold. The important take-away is that d-NI is a stronger notion. It implies
d-probing security, but not vice versa. De Meyer et al. [4] also clarified this
with an information-theoretic definition of d-NI, to mirror that of Gammel and
Mangard.
Definition 3.4 (d-non-interference): A gadget with d + 1 input shares x
is d-non-interferent if and only if for any observation set Q of at most d
probes, it holds that ∃i : M I(Q; xi |xī ) = 0.
The above definitions can be seen as more detailed specifications of the probing
adversary model, as they imply a verification methodology for the security of
a masked gadget, either by simulation or by comparing joint distributions for
different secrets or by some equivalent methodology. Note that the verification
M I(Q; ·|·) = 0 must be done for every possible set of d probes Q.
Glitch-Extended Probing Security. It was noted by De Meyer et al. [4] that

these mathematical model descriptions can be combined with the glitch-extended
probes of Reparaz et al. [RBN+ 15] to obtain descriptions and verification
methods of security in the presence of glitches.
Definition 3.5 (d-glitch-extended probing security): A gadget with secret
x is d-glitch-extended probing secure if and only if for any observation
set Q = {q1 , . . . , qt } of t ≤ d probes, with respective glitch-extended probes
R = R1 ∪ . . . ∪ Rt , it holds that M I(R; x) = 0.
With this description, it was shown that the non-completeness property of

threshold implementations is indeed a necessary requirement for glitch-extended
probing security, though it is not sufficient. Uniformity, on the other hand, is
neither sufficient nor necessary. Together, non-completeness and uniformity are
sufficient for 1-glitch-extended probing security, but as was already noted by
Reparaz et al. [RBN+ 15], not for higher-order security. By replacing the regular
probes Q with glitch-extended probes R in Definition 3.4, one can similarly
verify d-NI in the presence of glitches.
Composability and Global Probing Security. Many masked gadgets in the

literature were proven secure, using one of the methods above. However, their
use as building blocks in larger masked implementations does not imply that
the entire design is probing secure, i.e. probing security is not composable.
This is for example why the AES of Rivain and Prouff [RP10] was flawed, even
though their refreshing gadgets and multiplication gadgets (separately) were
provably secure. NI, while stronger than probing security, is also not generically
composable.
Composability is a desirable property because it is not always feasible to verify
probing security for complex designs, since one should verify the independence
or simulatability of every observation set Q
of d probes. In a circuit/algorithm
with w wires/intermediates, there are wd possible sets Q. The complexity of
verification thus grows considerably with the verification order d (assuming
d w).
Barthe et al. [BBD+ 16] introduced the concept of strong non-interference (SNI)
to alleviate this problem.
Definition 3.6 (d-strong non-interference): A gadget is d-strong non-
interferent (d-SNI) if and only if for every set QI of t1 probes on
intermediate variables (i.e. no outputs) and every set QO of t2 probes on
output shares such that t1 + t2 ≤ d, the set Q = QI ∪ QO can be simulated
using at most t1 shares of each input.
The difference between SNI and NI lies only in the number of input shares
which may be used for simulation. By making this number independent of the
number of output probes, SNI ensures a separation between the output and
input shares, which enables generic composition with other blocks. Belaïd et
al. [BBP+ 16] demonstrate how gadgets satisfying NI and SNI can be combined
in a global proof of security. The main idea is to determine the number of
shares that are required at the inputs of each gadget to simulate the rest of the
circuit and to propagate this backwards to the outputs of the previous gadgets.
De Meyer et al. [4] suggested a mathematical description of SNI based on mutual
information rather than simulation.
Definition 3.7 (d-strong non-interference): A gadget with d + 1 input
shares x is d-strong non-interferent if and only if for any observation set
Q of at most d probes, of which t1 are intermediates and t2 are output
probes such that t1 + t2 ≤ d, it holds that ∃ I ⊂ {0, . . . , d} with |I| = t1
such that M I(Q; xĪ |xI ) = 0.
Composability in the Presence of Glitches. As before, it is possible to replace

the regular probes Q with glitch-extended probes R in Definition 3.7 to obtain a
description of d-SNI in the presence of glitches. Remark that the multiplication
gadget in Eq. (2.14) with d + 1 shares is d-probing secure and d-non-interferent
in the presence of glitches, but not d-glitch-extended strong non-interferent. It
was noted by Faust et al. [FGP+ 18], that another synchronization stage at the
outputs is required for this purpose. Without glitches, it does satisfy SNI.
Summary of Associations. In Figure 3.8, we summarize which of the above

security notions are implied by each other. The relation A ⇒ B means that A
is stronger than and implies B. Gadgets that satisfy A are thus typically more
expensive than gadgets that satisfy B.
𝑑 − SNI 𝑑 − NI 𝑑 − probing
𝑑 − SNI 5 𝑑 − NI 5 𝑑 − probing 5 𝑑 − NC
+uniformity, 𝑑 = 1
Figure 3.8: Summary of implication relationships between security notions. The

notation g represents the glitch-extended version of the notions. NC is the
abbreviation for non-completeness.
Automated Tools. Several tools for the provable verification of gadgets have
been proposed in the literature. Coron [Cor18] introduced a tool to verify
simulatability (NI and SNI) based on symbolic manipulation. It is suitable
for both Boolean and arithmetically masked functions in software, but suffers
from false negatives (e.g. SNI gadgets which are evaluated as not SNI). The
first formal verification tool for probing security in the presence of glitches
was developed by Bloem et al. [BGI+ 18]. Barthe et al. [BBC+ 19] created
MaskVerif, a very versatile tool for the verification of probing security, NI, and
SNI, either with or without glitches. MaskVerif achieves very good efficiency
even for probing security. For each observation set Q, the tool first takes a
symbolic approach, similar to the method of Coron. Only if the result of this
test is negative, are the properties of Definitions 3.2 or 3.5 verified by exhaustive
computation of joint probability distributions. Sadly, its applicability is limited
to Boolean masking.
3.2.4 Flaw Detection
Formal proofs of global security are either only applicable to small designs or are
very demanding on the gadgets (i.e. SNI). Moreover, even if a masking scheme
is provably secure, its implementation may be flawed due to human error. On

the other hand, results from TVLA on actual devices are platform-specific and
highly dependent on the measurement setup. Noise in the measurements may
hide flaws in the implementation or even in the masking scheme itself.
Flaw detection tools form a compromise between these two extremes. They
are less exhaustive than provable security tools, but also more efficient, and
can therefore be applied to larger implementations. They often resemble
TVLA methods, but are applied to simulated power traces rather than real
measurements. This way, the design can be verified in a completely noiseless
setting and independently of the implementation-specific parameters such as
the platform, the layout, the synthesis process, etc.
In the Probing Model. Reparaz [Rep16] first proposed a flaw detection tool
for software masking. The traces for TVLA are generated according to the
d-probing model, by including for each intermediate variable one sample in the
trace. It is possible to use leakage models other than the identity model, such
as Hamming weight or LSB. In some cases, one can even optimize the efficiency
by “downscaling” the scheme, for example from the field F28 to F24 . The t-test
is applied to these traces, as in regular TVLA. The test can be specific or
non-specific, the secret can be chosen in a fix vs. fix or fix. vs. random manner,
and higher-order verification is possible by preprocessing the traces. Though
operating in the probing model, this tool was also able to detect the flaw of
higher-order threshold implementations [Rep15].
In the Glitch-Extended Probing Model. De Meyer et al. [4] extended the

work of Reparaz with verification in the glitch-extended probing model. The
tool follows from the observation that one can verify Definition 3.5 statistically
using estimated probability distributions of the probes R (see Figure 3.9). The
mutual information cannot be zero if the probability distributions are not exact,
but the significance of the difference between the histograms is measured with
Pearson’s χ2 test.
Each sample i in the simulated traces corresponds to a combination of d glitch-
extended probes Ri . Hence, different simulated traces are constructed for each
verification order d and the χ2 test considers the entire probability distributions
of the trace points. In that sense, the tool is different from that of Reparaz
even for software masking, since it verifies the property of Definition 3.2 instead
of preprocessing the traces for higher orders with some combination function
(e.g. a product) and using the moment-based t-test.
Frequency
Fix? ℛ$
Value ?
om
nd
Ra
|ℛ C |
2
… … Time [samples]
𝑖
Frequency
Value
2|ℛC|
… … Time [samples]
𝑖
≈?
Figure 3.9: Illustration of the tool of [4].
With Simulation of Glitches. Another tool by Sijacic et al. [SBY+ 18] for
masked hardware implementations, performs a post-place-and-route simulation
of a circuit. Since this simulation uses actual propagation delays, it can also
accurately simulate the occurrence of glitches. This tool thus operates in a
model that is more realistic and less worst-case than the glitch-extended probing
model. On the other hand, the results are more dependent on implementation-
specific characteristics such as the logic library and the placement and routing.
Bertoni and Martinoli [BM16] perform a simulation of so-called transients to
enumerate all possible transitions and thus intermediates that may occur on a
wire. This method is again independent of implementation specifics, yet more
optimistic than the glitch-extended probing model. The difference between
transients and glitch-extended probes is that apart from including all possible
intermediate results, the latter also includes other combinations of inputs, such
as transitions between these intermediates.
Discussion. Flaw detection tools that do not exhaustively simulate an

implementation for all possible inputs, cannot be used to prove security. As with
TVLA, the presence of leaks or flaws can be asserted by rejection of the null
hypothesis, but the absence of leakage is never conclusive. Nevertheless, they
can be very informative and important in the judgement of practical security. If
secrets cannot be distinguished with X traces in a completely noiseless setting,
an attacker will need at least as many traces to recover them with DPA in
a realistic noisy environment. This brings us back to the fact that dth -order
masking may not prevent dth -order attacks, but at least sufficiently increases
their complexity. On the other hand, this does not mean that flaw detection
tools can replace regular TVLA. With any verification method, it is important
to keep its limits in mind. Simulation-based tools (whether exhaustive or not)
can only be as accurate as the adversary model they are based on. In the gap
between theory and practice, they can provide no guarantees. Hence, performing
TVLA on real measurement traces (while keeping its limitations in mind as
well) remains an important step in the evaluation of masked implementations.
We demonstrate this in the next paragraph.
Inner Product and Boolean Masking. In the previous chapter, we introduced

the representations of Boolean and inner product masking and described their
multiplication in respectively Eq. (2.14) and Eq. (2.17). Both schemes have been
proven secure (and even SNI) in the probing model. Yet in practice, they do not
offer the same security. Consider Figure 3.6, where measurement I represents a
Boolean masked ISW multiplication with d = 1 and measurement II the inner
product masking multiplication from Balash et al. [BFG+ 17, Alg. 3] with d = 1.
These results show two things. On the one hand, they demonstrate that indeed
the theoretical security of the two schemes does not extend to practice. On the
other hand, we see that the multiplication with inner product masking exhibits
less leakage. This could be explained by the transitional leakages in a CPU,
which combine intermediates in a Boolean XOR [BGG+ 14, 14]. In fact, we can
reduce the leakage even further. The multiplication of Balasch et al. [BFG+ 17]
still contains a quasi-Boolean-shared variableLT with Tij = xi yj Lj , which can
also be seen in Eq. (2.17). It is clear that j Tij = xi y, which means that
Boolean combinations of these shares reduce the security order of variable y.
We adapted the algorithm to avoid intermediate Boolean shares. We refer to
Appendix A for the algorithm and its proof of security. This new multiplication
leads to measurement set III in Figure 3.6. Indeed, we see that more power
traces are required to detect leakage. However, only the final measurement set
IV does not exhibit leakage with 250k traces. This is a Boolean masked ISW
multiplication with d = 2. We note that increasing the security order does not
always resolve the issue. This was demonstrated by De Meyer et al. [14] with a
full AES implementation.
Our experiment shows that the different schemes may have the same security in
theory, but in practice, inner product masking shows a smaller vulnerability to
transitional leakage. Moreover, even different algorithms for the same scheme
result in different leakage behaviours. Finally, the adapted multiplication
algorithm is an original contribution of this thesis.
3.3.1 Consolidating Security Notions in Hardware Masking
Context. Masked implementations for software rely on an extensive theory.

The probing model of ISW [ISW03] is well-defined and has served as the basis
for many proofs of security of masked gadgets. Since the introduction of (strong)
non-interference by Barthe et al. [BBD+ 16], it is possible to create a global
proof of security.
For masked hardware implementations, it has not always been like this.
The provable security of threshold implementations [NRR06] relies on the
property of non-completeness and the property of uniformity (cf. § 2.2.2).
Together, they form a sufficient condition for first-order security against SCA.
It was noted by Reparaz [Rep15] that this is not the case for higher-order
security. Many proposals for hardware implementations secure against side-
channel attacks [DRB+ 16, GMK17] have appeared since. All rely on the
non-completeness property, since it is a necessary requirement for security
against SCA in the presence of glitches. However, it was long unclear what
constitutes a sufficient condition, which means a lot of masked implementations
have actually been designed without a clear understanding of how particular
design decisions impact the resistance against SCA.
Contribution. In our work from CHES 2019 [4], we describe a new, succinct,
information-theoretic security condition, as described in Definition 3.5. This
is the first formal condition for d-probing security in the presence of glitches
which is both necessary and sufficient. This single condition includes, but
is not limited to, previous security notions such as those used in threshold
implementations. As a consequence, we can prove that non-completeness is
indeed necessary and demonstrate that uniformity is not, despite being enforced
in most works on masking. Furthermore, we also treat the notion of (strong)
non-interference from an information-theoretic point-of-view (see Definitions 3.4
and 3.7). We unify the different security concepts and pave the way to the
verification of composability in the presence of glitches.
We consolidate all existing and new security notions into a single framework
based on mutual information. All notions in Section 3.2.3 can be verified by
some form of the property M I(A; B|C) = 0, where A depends on the type of
probing (with/without glitches, transitions, . . . ) and B, C determine whether
one verifies NI, SNI or probing security. This paper is not included in this book,
but Section 3.2.3 is essentially a summary of its results (excluding our proofs).
Finally, we use this framework in a tool that efficiently tests and validates
the resistance of masked implementations against DPA. We described this
tool in Section 3.2.4. The tool is an extension of the flaw detection tool from
Reparaz [Rep16], but can also be used for the provable security of small gadgets.
We demonstrate the adaptability of the framework to for example different
types of mask representations and point out important features that are not
yet included in state-of-the-art tools such as MaskVerif [BBC+ 19]. For example,
it was used to validate the security and optimize the randomness use of the
multiplicative masked AES, published at CHES 2018 [7].
The new security notions (Definitions 3.4, 3.5 and 3.7) were also adopted in a
very efficient tool by Knichel et al. [KSM20].
3.3.2 Recovering the CTR_DRBG state in 256 traces
Context. In DPA attacks, it is typically assumed that the attacker has

knowledge of the plaintext or ciphertext. This assumption is not always valid.
For example, the state of a pseudo-random number generator (PRNG) that
is used for sensitive data such as keys or masks in a masked implementation,
is considered secret. The attacker then only has access to the side-channel
information leaking from its implementation.
Contribution. We already described our work from CHES 2020 [3] in

Section 2.4.4. We also consider this work a contribution in the context of
this chapter, since it improves on an attack from the state-of-the-art [Jaf07].
We use simulated traces to investigate our success probability as a function
of the signal-to-noise ratio (SNR). We also demonstrate its success in practice
by attacking an AES-CTR implementation on multiple real platforms such
as a Cortex-M4 and recovering both the key and nonce. In addition, we use
the alternative methodology of blind SCA [CR17], which also does not require
knowledge of the plaintexts. The comparison shows that our attack is a lot more
efficient. Our traces and code are made openly available for reproducibility.
3.3.3 On the Effect of the (Micro)Architecture on the

Development of Side-Channel Resistant Software
Context. Works on software masking use formal proofs to guarantee

appropriate security properties [BBD+ 16]. However, the latter requires a
thorough understanding of a given platform and the mechanisms that may
produce side-channel leakage. If one does not completely capture all possible
CONCLUSION 75
effects, the benefit of a proof is somewhat limited. Microprocessors do not

conform with the probing model which is typically used in these proofs. As a
result, masked software implementations in practice do not exhibit the security
we expect in theory. The micro-architectural specification is typically not
completely known when a given cryptographic algorithm is implemented, and it
is therefore difficult to predict all ways in which an attacker may be able to find
a side-channel. As a result, these features are not often considered in the context
of the masking countermeasure. Papagiannopoulos and Veshchikov [PV17] made
some observations showing how some micro-architectural features affect the
side-channel leakage of first-order masking on an AVR-based ATMega163.
Contribution. In a collaboration with Elke De Mulder and Mike Tunstall from

the company Rambus, we submitted a work to CHES 2020 [14]. In this work,
we generalize and extend work by Papagiannopoulos and Veshchikov to describe
how a microprocessor may leak. We show that the sources of leakage are far more
numerous than previously considered and highly dependent on the platform.
Balasch et al. [BGG+ 14] claim that a straightforward second-order masking
scheme will provide first-order resistance and can ignore micro-architectural
considerations. We demonstrate that this is false with a counterexample. We
further describe how to write high-level code in the C programming language that
allows one to work around common microarchitectural features. In particular,
we introduce implementation techniques to reduce sensitive combinations made
by the CPU and which are devised to be preserved through the optimizations
made by the compiler. We apply these techniques to two case studies (DES and
AES) and show that they can provide security on several platforms.
3.4 Conclusion
We have given an overview of SCA methods, based on attacks on the one

hand and verification on the other. Here, we will recall some disparities in the
literature and open problems for future work. We will also thoroughly discuss
the gap between theory and practice.
Composability. Verifying the d-probing security of an entire encryption is

infeasible. The introduction of SNI has made it possible to create global proofs
of security. By ensuring that small building blocks satisfy certain conditions, we
have a guarantee that the entire encryption is probing secure. Sadly, gadgets
that satisfy SNI are typically expensive. The cost of randomness is especially
high, since current simulatability notions are not compatible with the multiple
use of a single random mask. What we get in return is the flexibility of being
able to compose the gadget with any other block. However, a gadget in a specific
design (e.g. AES) does not need to be composable with any other gadget, but
only with those that it is actually composed with. A specific example is that of
first-order threshold implementations, which are neither 1-NI nor 1-SNI, but
can still be (serially) composed thanks to uniformity and non-completeness. It
is thus not necessarily true that a gadget that satisfies neither NI nor SNI, does
not provide the required security. Furthermore, it is possible for a complex
block to consist of such gadgets and still be probing secure. The problem is that
the literature currently lacks formal methods to prove the security of a large
design that does not consist of strong non-interferent gadgets. In a recent effort,
the current simulatability notions are being refined to account for multiple
inputs and outputs [CS19]. As long as the verification of probing security itself
does not become more efficient, we need to come up with more tight security
requirements. We also need these new notions to allow for randomness recycling.
The Boolean Bias. Over the last years, a large number of verification tools
have been introduced. Their functionalities range from the verification of
provable security for small gadgets to the validation of practical security for
larger designs. This is a positive development, since hand-written proofs are
liable for human error and publicly available tools can be scrutinized by a larger
community. However, similar to the popularity of Boolean masking in the
previous chapter, we see here a bias towards tools compatible with this type of
masking. To the best of our knowledge, there is no publicly available tool that
allows verification of inner product or polynomially masked implementations.
This is regrettable, given that none of the security notions in Section 3.2.3
make any assumptions on the masking representation and especially given
the increased security of inner product and polynomial masking over Boolean
masking in practice.
Modelling the Adversary. Tools decrease the dangers of human errors and
increase the convenience of verifying masked implementations. However, lack of
security is not always due to lack of verification, but rather due to misconceptions
of leakage behaviour. Consider for example the scheme of Ishai et al. [ISW03],
which was introduced for implementation in hardware. Though it came with a
proof of security, it did not provide the claimed security, because their model
did not take glitches into account. Today, Boolean masked implementations
for software are based on ISW and come with global proofs of security, yet still
exhibit leakage on a real platform. Hence, more important than the actual
verification of security, is the accurate modelling of the adversary. If the model
CONCLUSION 77
is inaccurate, it does not matter how many proofs or tools one uses, since not
all leakage can be found.
Mind the Gap: Provable vs. Practical Security. It is infeasible for a

theoretical adversary model to exactly represent reality. There will always
be a gap between theory and practice, or equivalently between provable security
and practical security (see Figure 3.10).
Effects not in model

(eg: glitches, transitions, …)
Security in Model Security in Practice
Noise, complexity of multivariate attacks, …
Figure 3.10: The gap between theory and practice: provable vs. practical
security.
We can try to make our models correspond to reality as closely as possible, but
need to keep in mind that more complicated models engender more expensive
gadgets (e.g. SNI). The glitch-extended probing model includes the worst
possible glitches, because it cannot predict the glitches that will actually occur.
The transitional leakage model assumes an adversary can probe any XOR
combination of intermediates, because it is unknown at the theoretical level
which combinations are possible. More fundamentally, the probing model
assumes that exact values in a calculation can be observed. It is common in
theory to give more power to the adversary than strictly necessary to keep the
models conceptually simple, but with strong security guarantees. As a result,
the masked gadgets in these models may be more expensive than required.
Moreover, the gap is a double-edged sword, as it can equivalently be seen from
an attacker’s point-of-view (see Figure 3.11). Hence, at some point, it does
not make sense to keep increasing the cost and complexity of masking gadgets
so that they are secure in an even stronger model. We must find a balance
between the effort spent on making implementations secure in some model and
the effort spent on making them secure in practice.
Bottoms Up: From Provable to Practical Security. Apart from the gap
between provable and practical security, there is also a gap between the literature
on software masking and hardware masking. Recall that in the previous chapter
Effects not in model

(eg: glitches, transitions, …)
Attack in Model Attack in Practice
Noise, complexity of multivariate attacks, …
Figure 3.11: The gap between theory and practice: attacks.
we observed that software masking schemes typically come hand-in-hand with

a strong theoretical proof, but no practical evaluation and that the situation is
reversed for hardware masking schemes. However, the “correct” methodology
is not one or the other. This chapter has shown that we have at our disposal
methodologies for verifying security at every level, from bottom to top. We
need to use basic masked building blocks which come with strong and generic
security guarantees, so that they can be used for any cipher and any platform.
When building a masked implementation, we can use simulation tools to detect
flaws at an early stage, to try out optimizations and to verify leakage in a
noiseless environment. Finally, we always have to validate that the reality meets
the expectations we gained from our model, by deploying our implementation
on an actual platform and performing leakage assessment with real power
measurements.
Chapter 4
Combined Physical Attacks

and Countermeasures
After the thorough study of masking against side-channel attacks in Chapter 2,

this chapter will look at more powerful adversaries, which apart from passively
analysing the side-channel emanations from a device, can also actively disturb
the computations on the device through fault injections. In Section 4.1, we
briefly summarize the background on fault injections, how they are exploited
and how implementations can be protected against them. Next, Section 4.2
gives an overview of recent proposals in the state-of-the-art on combined attacks
and countermeasures. We critically evaluate the schemes and the adversary
models that they are defined for. Section 4.3 highlights our contributions in this
field. Since this is a very new branch of research, there are a large number of
questions for future research, which we consider in the conclusion in Section 4.4.
4.1 Fault Attacks and Countermeasures
The active counterpart of side-channel analysis (SCA) is fault analysis (FA).

Rather than observing a cryptographic computation through side-channels,
the attacker disturbs the computation and obtains secret information from
the outputs. Boneh et al. [BDL97] were the first to show how public-
key cryptosystems can be attacked when the computation is faulty. For
symmetric-key systems, Biham and Shamir introduced differential fault analysis
(DFA) [BS97]. While this work even predates the work on differential power
analysis (DPA) [KJJ99], countermeasures against fault analysis are more
79
80 COMBINED PHYSICAL ATTACKS AND COUNTERMEASURES
heuristic and do not boast a formal background like that of masking against
DPA. In this section, we give a summary of important fault attacks mechanisms
and countermeasures.
4.1.1 Fault Attacks
A fault attack consists of two stages. First, a fault is injected into the
cryptographic computation and then, the effect of the fault is exploited using
fault analysis. Below, we list different methods for physically inserting the fault
onto a computing chip. Next, we describe how the faults can be characterized
from a theoretical point-of-view. Finally, we describe DFA, the most important
threat against cryptographic implementations in this context.
Fault Injection. There are several ways to induce faults in a computation, with
varying degrees of invasiveness and precision [Tun17]. For example, modifying
the temperature is a non-invasive technique for tampering with an embedded
device. Other non-invasive approaches are varying the supply voltage or inducing
glitches in an external clock. Clock glitches are a very popular way for fault
injection because they are relatively easy and cheap to perform. Since a glitch
in the clock signal essentially shortens the clock period (the rising edge comes
too soon), it allows introducing a fault by storing wrong intermediates in the
registers. A clock glitch can cause the next operation to start executing before
the current one finishes.
Semi-invasive techniques require the chip surface to be exposed. The
computation can then be disturbed with a laser. Laser injections are more
powerful than clock glitches since the attacker has more control, but they are
more complex and expensive to achieve. The attacker should, for example,
have some knowledge about the layout of the chip. Moreover, with advancing
technologies, transistors decrease in size, making it more difficult to hit individual
bits or bytes. The improvements in technology also reduce the size of the laser
spot, but this is limited by the wavelength and can thus not be made arbitrarily
small. The exact outcome of a laser injection can therefore be hard to predict.
Dutertre et al. [DBC+ 18] recently showed that it is still possible to target
single bits in 28nm CMOS technology. On microcontrollers, the storage is a
popular target for laser injections, since it is easy to distinguish from other
components [KBB+ 18].
More invasive methods exist, which actually alter the chip itself, such as focused
ion beams (FIB). These are relatively expensive and out of scope for this work.
FAULT ATTACKS AND COUNTERMEASURES 81
Fault Models. In the theory of SCA, the side-channel observations are typically
characterized by probes on intermediate values. Faults are more complicated
to model because they come in so many variations. We can distinguish them
according to the following characteristics [KSV13]:
• Granularity: Does the fault affect bits or bytes or words? How many of
them are affected?
• Type of Modification: The fault resets to 0 or sets to 1, causes a bit
flip or a random value.
• Degree of Control: How precisely can the location and timing of the
fault be controlled?
• Duration: The fault is transient, permanent or destructive.
The behaviours of the different types of fault injections fall into different fault
model categories, but not necessarily a single one. For example, with a clock
glitch, one has good control over the timing, but not necessarily over the
location, since this is determined by the critical path delays. Laser injection
gives strong control over both the timing and the location of faults (see for
example Figure 4.1). The modification effect of such a fault is typically a reset
to 0 or a set to 1. Given the large variety of fault injection methods, it is difficult
to create a single fault model that considers all types of faulting adversaries.
Figure 4.1: Sensitive regions on flash memory. Laser injections in each region
result in a reset of one or two (neighbouring) bits of an instruction opcode.
From top to bottom, the affected bit positions move from least significant to
most significant. [KBB+ 18]
Differential Fault Analysis. Biham and Shamir [BS97] introduced DFA, a

generic fault attack for secret-key cryptosystems and applied it to the Data
Encryption Standard (DES). They consider random transient faults, i.e. during
each encryption, one (or a few) bits are flipped with some small probability.
The fault position is unknown to the attacker. The attack requires about 50
to 200 pairs of ciphertexts (C, C 0 ), with C a correct ciphertext and C 0 a faulty
ciphertext for the same plaintext. In other words, one needs to be able to
encrypt each plaintext twice and only disturbs the second encryption. Using
the difference C ⊕ C 0 and techniques from differential cryptanalysis [BS90], it
is possible to recover the key. Biham and Shamir further specify that, if the
attacker can choose the location of the faults, only 3 ciphertext pairs (C, C 0 )
are required. Piret and Quisquater [PQ03] applied DFA to the AES, assuming
a uniformly random fault is injected on a single byte in the last or next-to-last
encryption round. This attack requires two faulty encryptions to recover the
key. Tunstall et al. [TMA11] showed that it is possible to attack AES with a
single faulty ciphertext, in which the fault is injected in the eighth round.
It is notable that the fault injections for these attacks are supposed to affect a
small rather than a large number of bits. Most attacks also benefit from fault
injections in the last rounds of encryption, since these are easier to exploit.
Countermeasures against fault attacks have been devised at various levels. Some
low-level physical countermeasures target the fault injection itself, using active
shields or light detectors or filters in the clock to mitigate glitches [BCN+ 06].
Most countermeasures intend to obstruct the successful analysis of faults. Similar
to hiding techniques against SCA, random delays and shuffling can make it
more difficult for the attacker to inject precise faults, but they cannot prevent
the attacks completely. At a higher level, one could prevent the attacker from
collecting sufficient encryptions with the same key. Fresh re-keying is therefore
a popular mechanism for protecting implementations against both DPA and
DFA [MSGR10].
Within the scope of this work, we only consider countermeasures at the
algorithmic level, i.e. not at the physical or protocol level. The protection of
implementations against faults long precedes the publications on fault attacks.
For example, parity bits or checksums have been used in data transmissions
since the 1950s to detect and correct unintentional errors [Ham50]. Hence, fault
attack countermeasures have a lot in common with techniques from coding
theory. The fundamental requirement is the same: we need redundancy. A
countermeasure is defined, on the one hand, by the type of redundancy it uses
and, on the other hand, by how this information is employed to protect the
implementation against malicious faults.
FAULT ATTACKS AND COUNTERMEASURES 83
Redundancy. The most simple form of redundancy is repetition, either in time

(e.g. sequential encryptions of the same plaintext) or in space (e.g. parallel
encryptions of the same plaintext). A drawback of these approaches is that
two identical faults defeat parallel repetition and permanent faults defeat time-
redundancy [SM12]. Alternatively, it is possible to use the inverse operation.
Instead of performing the encryption twice, one can decrypt the obtained
ciphertext and compare the result to the original plaintext.
Again, drawing inspiration from the field of coding theory, among the
most popular forms of redundancy are linear codes for error detection
or correction [KWMK02, KKG03, BBK+ 03, BBKM04, JWK04, KKT04b,
KKT04a]. A study by Malkin et al. [MSY06] showed that error-detecting
codes (EDC) are not necessarily more efficient than simple duplication methods.
Finally, we note that some masking schemes, designed to protect against SCA,
already exhibit redundancy. One example is dual-rail masked logic styles such
as LMDPL [LMW14], where for every bit b, there are two wires: one carrying
b and the other b̄. Also polynomial masking [PR11] is amenable to redundant
representations, since the n shares must be points on a dth -degree polynomial.
There is redundancy when n > d, which makes it more efficient than duplication,
since we can have n < 2(d + 1).
Using the Redundancy. Lomné et al. [LRT12] distinguish two ways to use the
redundancy: detection and infection. Detection is conceptually straightforward.
In the case of repeated encryptions, one compares the two ciphertexts and with
EDC, one verifies the check bits. If the check fails and thus a fault is detected,
the computation needs to stop in a secure way and the faulty ciphertext should
not be released.
An alternative to stopping the computation is ensuring that any injected fault
results in a random ciphertext, from which the attacker cannot obtain any
information. This concept is called infective computation. Since the success of
DFA relies on the information in the difference C ⊕ C 0 , a uniformly random
C 0 cannot be used to recover the key. Infective computation was introduced
by Yen et al. [YKLM01] for public-key systems. Several proposals were made
for symmetric-key systems as well [LRT12, GST12], but all of them have been
shown to be flawed [BG13].
Additionally, we note that correction is a third method for using the redundancy,
for example using error-correcting codes (ECC) or majority voting.
4.1.3 Ineffective Faults and Safe-Errors.
In this last subsection, we describe a special category of faults and corresponding

attacks. Yen and Joye [YJ00] were the first to demonstrate that fault injections
can be exploited even without knowledge of faulty ciphertexts. The mere
knowledge of whether a fault injection changes the ciphertext or not can be
sufficient for an attacker to recover the key. This means that these attacks even
benefit from the presence of a fault detecting countermeasure.
Safe-Error Analysis (SEA). A safe-error refers to a fault that still results

in the correct output. For example, if the faulted value is not used or if it is
multiplied with a zero, it does not affect the ciphertext. This can be exploited
to derive secret information [YJ00]. Consider the multiplication z = x × y and
the result when x is corrupted: z 0 = x0 × y with x0 6= x. Clearly, if z = z 0 , then
y must be zero.
Ineffective Fault Analysis (IFA). When a zero bit is reset to 0, nothing

happens, i.e. the fault is ineffective. The same can be said for a bit “one” that
is set to 1. Detecting whether or not the fault has any effect, allows the attacker
to derive the original state of that bit. Blömer and Seifert [BS03] exploit such
reset faults to recover the AES key with 128 faulty ciphertexts, where each time
a single bit is faulted. These attacks exploit ineffective faults. The difference
between an ineffective fault and a safe-error is that the latter is a nonzero error
that disappears, while the former never changes any value.
Statistical Ineffective Fault Analysis (SIFA). The attack of Blömer and

Seifert requires quite precise laser injections to ensure that the fault is ineffective.
Dobraunig et al. [DEK+ 18] introduced statistical ineffective fault analysis
(SIFA), which relaxes the requirements on the fault injections. They corrupt
several encryptions and keep only the ones with correct ciphertexts (i.e. where
the faults are ineffective). They then exploit that the intermediates in the
computations where the fault was ineffective are non-uniformly distributed.
SIFA needs to fault many (O(1000)) encryptions to obtain a small subset
(O(100)) with ineffective faults.
Countermeasures. Since an ineffective fault does not change anything, it

cannot be detected at the algorithmic level. Physical countermeasures can be
used to detect the laser injection or mitigate the clock glitch. Alternatively,
at the protocol level, it is possible to keep track of the number of effective
COMBINED ATTACKS AND COUNTERMEASURES 85
faults that get detected and prevent the attacker from obtaining sufficient faulty
encryptions to mount the attack. Take SIFA for example. While the ineffective
faults cannot be detected, the effective faults can be detected and used at the
protocol level. Another strategy is to increase the number of encryptions that
appear to have ineffective faults. Faults injected in dummy rounds are, for
example, automatically ineffective, regardless of sensitive data. Error correction
can ensure that even effective faults result in correct ciphertexts.
Safe-errors are also difficult to detect. Countermeasures must ensure that for any
effective fault, one of the following holds: (1) Either the faulty value will not be
combined with sensitive data. Hence, if it becomes a safe-error and disappears,
no secret information is revealed. (2) Or the faulty value is propagated to the
next error check, which means it cannot become a safe-error. For example,
Ishai et al. [IPSW06] describe gates that ensure that any detectable error is
propagated.
4.2 Combined Attacks and Countermeasures
A masked implementation is vulnerable to FA. Vice versa, the power traces of

an implementation with protection against faults still reveal secret information.
Furthermore, the two countermeasures can be in conflict. On the one hand,
redundancy against FA may increase the signal-to-noise ratio (SNR) and
facilitate SCA. On the other hand, masking increases the attack surface for FA.
Hence, it is natural that the two types of countermeasures should be combined.
Moreover, an attacker that can do both SCA and FA might also combine these
capabilities to launch a potentially more efficient and powerful attack. Combined
countermeasures must therefore protect against more than just SCA and FA
separately. Research on both combined attacks and combined countermeasures
is relatively new. In this section, we first give an overview of attacks from the
literature that combine power analysis and fault injections. Next, we outline
recent proposals for countermeasures. We finish with a discussion.
4.2.1 Attacks in the Literature
Setting the Stage. An early example of an attack that combines power analysis
with laser light is by Skorobogatov [Sko06]. Skorobogatov uses laser light to
enhance the power traces and isolate the leakage of individual transistors. Since
the laser light is carefully adjusted so as not to interfere with the operations,
no faults are injected and the attack may be classified as a semi-invasive power
analysis rather than a combined attack. Moreover, Skorobogatov notes that the
attack would not work with modern submicron technologies.
Differential Behavioral Analysis. Robisson and Manet [RM07] combined SEA

with DPA. Like SEA, the attack only exploits whether the computation
completes correctly or not. It targets implementations without protection
against SCA and requires repeated stuck-at faults (identical in each encryption)
on a small number of bits. They only performed the attack on simulated traces.
PACA on Public-Key Systems. Amiel et al. [AVFM07] exploit the fact that
fault countermeasures typically only act at the end of the computation, before
the ciphertext is released at the output. Hence, power analysis can be used
to avoid the error check mechanism and extract information about the faulty
ciphertext. Their attack is applied to a public-key cryptosystem, which is
protected against both SPA and DPA. They use actual power measurements
and techniques from SPA to extract the secrets. They refer to the attack as
PACA, which stands for Passive and Active Combined Attack.
PACA on Secret-Key Systems. Clavier et al. [CFGR10] introduce a chosen-

plaintext and known-ciphertext attack on AES, which exploits both power traces
and faulty ciphertexts. The attack is able to target masked implementations.
However, fault countermeasures which avoid releasing faulty ciphertexts can
prevent the attack. In an alternative attack, Clavier et al. [CFGR10] use
safe-errors and therefore do not require faulty ciphertexts. This attack even
benefits from the presence of fault countermeasures. Both attacks are only
performed in simulation. Roche et al. [RLK11] present an attack similar to that
of Amiel et al. [AVFM07] and target the error detection mechanism, assuming
that it manipulates unmasked ciphertexts. Their attack is also performed in
simulation only.
In Practice. Of the above attacks, none have been used against symmetric-key
ciphers such as AES in practice. Patranabis et al. [PBMB17] performed an attack
on a microcontroller performing the symmetric cipher PRESENT [BKL+ 07].
While this is a side-channel assisted fault attack, they still target an unprotected
implementation of the cipher and require the knowledge of faulty ciphertexts.
Reflection. The combined attacks in the literature can be categorized into

two classes. On the one hand, attacks like those of Amiel et al. [AVFM07] and
Roche et al. [RLK11] use power analysis to sidestep a fault countermeasure and
facilitate FA. These are side-channel assisted fault attacks. On the other hand,
Clavier et al. [CFGR10] use fault injections to reduce the DPA protection order
and facilitate SCA, i.e. they perform a fault assisted side-channel attack.
Most of the combined countermeasures in the literature are an extension of

a masking scheme with redundancy for fault protection. We describe recent
proposals below. While this approach ensures security against both SCA and
FA separately, their theoretical background does not cover combined attacks.
First, we make a note about fault models for masked implementations.
Fault Models. When an implementation is protected against SCA using

masking, there are two different ways to consider the faults. On the one
hand, we can consider the faults on the intermediates of the implementations,
i.e. on the shares. Their effect can be modelled as described in Section 4.1.1,
i.e. by a reset to 0, a set to 1, a bit flip or random fault. On the other hand,
we can consider the faults on the unmasked intermediates of the algorithm. In
that case, it is typically assumed that a reset to 0 or set to 1 is not possible,
since this requires the same fault on all shares of that intermediate. Hence at
the algorithm level, only additive faults are considered [RLK11]. These can be
modelled as x0 = x ⊕ ∆ with ∆ the injected fault on variable x. Moreover, the
specific value ∆ is more difficult to predict. In unmasked implementations, it is
already a strong assumption that an attacker knows exactly the value of a fault.
In a masked implementation, even if the attacker knows precisely the fault
injected in one share of x, (s)he does not automatically have knowledge about
the values of the remaining shares. Hence, while masking does not prevent fault
attacks, the effect of a fault injection is typically more difficult to predict.
Private Circuits II. Ishai et al. [IPSW06] extended their work on ISW
multiplications with fault protection against two types of adversaries. In the first
version, the adversary can inject an unlimited number of faults, but the effect
of each fault is a reset to 0. Starting from a masked circuit based on [ISW03],
they encode every bit using a Manchester encoding (0 → 01, 1 → 10). As a
result, 00 and 11 are invalid encodings, which only occur as the result of a
fault. They introduce gadgets for the AND and XOR operations, which ensure
that any detectable fault is propagated to an “error cascading stage”. The
error cascading stages act as self-destruction mechanism and ensure that any
invalid encoding is spread to all the wires in the circuit output or memory.
The second type of adversary can also do bit flips and set bits to 1, but is
limited in the number of faults injected. In that case, every bit is encoded
using 2nf wires, with n the number of shares and f the number of allowed
faults per clock cycle. De Cnudde and Nikova [DN16] applied this approach to
the PRESENT block cipher and implemented it on an FPGA. However, since
the ISW countermeasure is not secure in the presence of glitches, they used
threshold implementations (TI) [NRR06] as countermeasure against SCA. The
overhead of the fault protection over the SCA-only countermeasure is a factor
of approximately 8.8 for the adversary that is only allowed reset faults. The
countermeasure is thus very expensive for a not too realistic model.
Error Detecting Codes. Schneider et al. [SMG16] introduced a very efficient

hardware-oriented combined countermeasure, based on TI against SCA and
EDC against FA. They carefully investigated linear codes to ensure that their
scheme provides more protection than simple duplication. They consider an
adversary that can both perform dth -order SCA and inject faults into the
computation. The faults are modelled as additive errors over the entire state:
state0 = state⊕∆ where ∆ can follow either a uniform or a biased distribution. A
fault is undetected if it is a valid codeword. Hence the exact fault coverage of an
implementation depends on the used code. A disadvantage of this methodology
is that a specific code is only guaranteed to detect errors of limited Hamming
weight. For example, in the implementation of [SMG16], it is assumed that the
adversary alters at most 3 bits of a codeword. This is an assumption that we
cannot rely on in practice, since, for example, clock glitches and laser injections
may affect more bits. This implementation achieves an overhead of factor 2.56
over the SCA-only protected implementation. The error check is performed in
every clock cycle and hence defeats the combined attacks that assume the error
check only happens at the end of an encryption. On the other hand, the check
mechanism combines shares and is not secure against combined attacks that
target the check itself, as shown in [6].
CAPA. Reparaz et al. [13] took a radically different approach by not starting
from an existing masking scheme. Instead, they exploited the link between
masking as a countermeasure against passive SCA on the one hand and secret
sharing methods for multi-party computation (MPC) on the other. State-of-the-
art MPC protocols consider malicious parties that do not only observe a shared
computation, but also actively deviate from it. By tailoring such active MPC
protocols [DPSZ12] to the embedded systems context, they obtain a combined
countermeasure (named CAPA) with very strong security guarantees against
combined attacks. Reparaz et al. [13] introduce the tile-probe-and-fault model
to formalize their adversary, which is strongly based on the MPC model. In this
model, an embedded system can be seen as consisting of multiple tiles, each
RNG control RNG control
logic logic
RNG control RNG control
logic logic
Figure 4.2: Conceptual example of the tile-probe-and-fault model with 4 tiles.
representing one party in the MPC protocol (see Figure 4.2). The adversary
can obtain full control over d of the d + 1 tiles. This represents very well a
combined adversary that can combine knowledge of the intermediates in those
tiles with the capability of changing them. Moreover, the authors explain how
CAPA is even secure against safe-error attacks.
The redundancy in this countermeasure is based on information-theoretic
message authentication code (MAC) tags. Given a fixed secret key α ∈ F,
each variable x ∈ F is accompanied by a tag τ x = αx ∈ F. The variables
x, the tags τ x and the key α are all manipulated in shared form only. The
key α is fresh for each encryption. The fault coverage of the methodology
depends on the size of the MAC tags, i.e. |F|. If the field F is not large enough,
multiple keys αi can be used to attribute multiple tags to each variable. As a
result, the countermeasure is scalable, but to obtain a good fault coverage, the
overhead becomes prohibitively large in hardware. For the small block cipher
KATAN [DDK09], they obtain an overhead factor of approximately 1 + m for a
fault detection probability of 1 − 2−m .
M&M. With M&M, which stands for Masks and MACs, De Meyer et al. [6]
introduce a family of countermeasures based on the same framework as CAPA,
but with a relaxed adversary model. They show how any masking scheme can
be extended with information-theoretic MAC tags, without using the expensive
MPC calculations. As a result, instead of the tile-probe-and-fault model, their
adversary model is more similar to that of ParTI [SMG16], with the exception
that faults are not limited in Hamming weight and can affect any number of bits.
Instead of using an error check mechanism which is easily vulnerable to combined
attacks, they use infective computation to ensure that detectable faults result in
random ciphertexts. As with CAPA, the fault coverage depends on the size of the
Table 4.1: Comparing combined countermeasures in hardware

Countermeasure Overhead SCA model FA model Combined
Factor* Attacks
ParTI [SMG16] 2.56 d-probing additive, ≤ 3
bits
M&M [6] ∼ 1 + 1.58m d-probing additive,
O(d) bits
Private Circuits II [DN16] 8.8 d-probing reset, X
unlimited
CAPA [13] ∼1+m d-tile-probe-and-fault X
* for a serial implementation with one S-box.
MAC tags and is scalable. By stepping away from MPC protocols, M&M loses
the provable security against combined attacks. In contrast with ParTI [SMG16],
there are no intermediate error checks and the tags are only used at the end of
each encryption. Nevertheless, none of the attacks described in Section 4.2.1
are effective against it. Moreover, the countermeasure is much more efficient
than CAPA. The authors are able to provide example implementations for AES,
achieving an overhead factor of 2.53 − 2.63 for a coverage of 1 − 2−8 .
Polynomial Masking. Seker et al. [SFES18] introduced a combined counter-

measure based on the polynomial masking scheme of Prouff and Roche [PR11].
They exploit the redundancy that is already present in the secret sharing scheme.
Since a dth -order sharing consists of n > d points on a dth -degree polynomial,
any error in those points is likely to result in a polynomial of higher degree
after interpolation. The smaller the number of modified shares, the larger
their detection probability. Moreover, they design new gadgets for additions
and multiplications to ensure that every error propagates. At the end of each
encryption, they use infective computation to avoid that faulty ciphertext
are released. Despite being based on a glitch-resistant masking scheme, this
countermeasure has not yet been implemented in hardware.
4.2.3 Discussion
MAC tags vs. EDC. The information-theoretic MAC tags were inherited
from MPC protocols, where the malicious attacker has very strong control over
the errors injected. The MPC adversary has the ability to exactly change one
share of an intermediate to another known value with success probability one.
For fault injections with clock glitches or a laser, such assumptions are too
strong, as the precise effect of a fault is difficult to control and predict. Hence,
the information-theoretic MAC tags might be considered more expensive than

strictly necessary. Moreover, to get a proper fault detection probability, the
tag size needs to be made quite large, which results in a large overhead. In
contrast, EDC are efficient and get very high fault coverage for faults within the
adversary model, but might not detect faults that exceed a particular Hamming
weight. With the MAC tags, the fault coverage does not depend on the number
of bits affected by a fault. Based on the state-of-the-art results, it is not yet
clear which of the two can achieve the best trade-off between fault coverage and
cost.
Adversary Models. The countermeasures in Section 4.2.2 are all profoundly

different in the type of adversaries they consider. Hence, it is very difficult to
make a fair comparison of their cost. Table 4.1 gives an overview of some
of their properties, listing the countermeasures approximately in order of
increasingly powerful adversary. The overhead cost factor refers to the overhead
of an implementation with combined protection over a SCA-only protected
implementation. These numbers are only approximations and have been based
on a single implementation for each countermeasure. At this preliminary stage
of the research, the adversary model is a more interesting point for comparison.
None of the proposed models are close representations of reality. On the one
hand, the model derived from EDC limits the Hamming weight of faults. On the
other hand, the security of the MAC tags might be too strong. M&M is much
more efficient than CAPA, but does not offer provable security against combined
attacks. CAPA achieves provable security in the very powerful tile-probe-and-
fault model, which is based on a much too powerful MPC adversary. Yet,
the assumption that an adversary has complete access to a tile is not entirely
excessive. Recall that the previous chapters remarked that the independent
leakage assumption (ILA) is not always valid, especially on software platforms.
The tile-probing model only requires independent leakage between the tiles, but
not within. Hence, this model may be useful, even in the context of side-channel
protection only.
Practical Considerations for the Tile Model. The tile-probe-and-fault

model [13] does not specify physical requirements. In practice, it should be
combined with some implementation-specific measures. It is assumed that a
uniformly random fault may affect all tiles, but that targeted (known) faults
are only injected into d of the d + 1 tiles. This assumption is only justified
if an implementation on FPGA or ASIC platforms uses proper placement to
physically separate the tiles. On a microcontroller, it is suggested that different
cores may fulfil the task of different tiles. This matches nicely with the fact
that the ILA is known not to be valid within a single core. Naturally, each tile
should be allocated a separate chunk in memory. The physical separation is

motivated by laser injections. For clock glitches on a hardware platform, we
should add additional constraints. Since the different tiles perform very similar
(if not identical) calculations on different shares, it is probable that their critical
path delays are comparable and that a clock glitch affects each tile in the same
way. Hence, to adhere to the tile model, one should introduce slight variations
in placement and routing when implementing the different tiles. Alternatively,
one could supply each tile with a separate clock line, but this would bring
additional synchronization issues.
The above is not only true for the tile-based model. In many masking schemes,
identical operations are performed on different shares and their implementations
are thus likely to have the same critical path delays. In most combined
countermeasures, an identical fault in all shares may be harmful. Hence,
variations in the placement and routing of different shares could be used as an
additional countermeasure.
Attacks vs. Countermeasures. A lot of the attacks in Section 4.2.1 still

target unmasked implementations or assume the error check uses unmasked
ciphertexts. Hence, it looks like the combined countermeasures are still ahead of
the combined attacks. The ParTI error check may be vulnerable, but both M&M
and the polynomial masking based countermeasure use infective computation to
avoid any recombination of the shares of the ciphertext. Still, the redundancy
in these countermeasures is only used at the end of the encryption, so in theory,
a combined attack could bypass this and use power measurements to obtain
information about the effect of a fault. It remains to be investigated how to
exploit the leakage from a masked implementation which has been faulted,
without knowledge of the faulty ciphertext. To this day, such an attack has
not yet been described. Moreover, the existing attacks in the literature that
target symmetric-key ciphers, have only been performed on simulated traces
and not on actual platforms. A demonstration of a practical attack on one of
the countermeasures in Section 4.2.2 could give clarity on the type of adversary
model we need. The SIFA attacks [DEK+ 18] may be a good starting point,
since they exploit the presence of a biased distribution and are effective on
masked implementations [DEG+ 18]. While it does not classify as a combined
attack, even combined countermeasures such as M&M are vulnerable to it. We
refer to Section 4.1.3 for our discussion on countermeasures against ineffective
faults. However, a combined attack that uses power measurements to bypass
the fault countermeasure, would not need to rely on ineffective faults.
Verification. A large part of Chapter 3 was devoted to the issue of verifying

the security of SCA-protected implementations. Even though masking has
an established background, a lot of questions still surround its verification.
Nevertheless, test vector leakage assessment (TVLA) and the statistical t-test
have emerged as very useful tools, which allow analysing real measurements,
independently from specific attacks. The verification of fault detection
mechanisms is a subject that (like the mechanisms themselves) precedes the
publications on fault attacks. Any software or hardware component must
undergo thorough functional and robustness testing before being put to use,
since unintentional faults commonly appear. In contrast, in cryptography, we
need to be able to test the ability of a countermeasure to detect maliciously
injected faults, which tend to have a different characterization than accidental
faults. The large variability in fault models and ways to exploit them, contribute
to the complexity of the verification problem. Arribas et al. [AWMN19]
recently introduced a formal model and tool for cryptographic fault diagnosis.
They applied it to many of the combined countermeasures of Section 4.2.2
and verified their claimed fault coverage. Their tool only verifies the fault
detection mechanism and is a recent addition to the state-of-the-art. This
shows that we are still far from having a generic verification method for
security against combined attacks. This is also evidenced by the lack of such
attacks that have been performed in practice. The literature now includes
several proposals for combined countermeasures and no methods to verify
their security against combined attacks. As argued in the previous chapter,
proof methodologies in particular theoretical models alone would not suffice.
Any verification methodology for combined SCA- and FA-resistance should
also include experiments with practical power measurements and real fault
injections.
Randomness. The quite expensive countermeasures of Section 4.2.2 all ignore

one fundamental problem. The most obvious combined attack on a masked
implementation is one that injects faults into the random number generator
(RNG) and performs first-order DPA. In the worst case, the faults completely
disable the RNG and eliminate the randomness, but also biased randomness
can debilitate the masking countermeasure. Yao et al. [YYP+ 18] demonstrated
a practical attack of this type and successfully recovered the key of a masked
AES implementation. They first use power analysis to locate the transfer of
random numbers from the RNG to the masked AES. With clock glitching, they
can skip the mask transfer operation, which effectively disables the randomness.
As a result, first-order DPA can recover the key from the masked AES. In
theory, such an attack could target a masked implementation of any security
order with first-order DPA. Hence, as long as the RNG is unprotected against
faults, none of the combined countermeasures in the literature will prevent key
recovery by a combined adversary. We argued in Chapter 2 that the RNG

of a masked implementation does not require masking itself. The attack of
Yao et al. makes it clear that the RNG does require protection against fault
injections. Only effective faults that bias the randomness and affect the masking
countermeasure are a threat to the RNG, so any combination of countermeasures
from Section 4.1.2 is applicable here. Physical and algorithmic countermeasures
could be used to detect whether the RNG is under attack, but this raises the
question of how to handle this event. Possibly, error correction (for example
with ECC or majority voting) could be used to ensure a minimal level of
service. Robust RNG constructions for these scenarios are a very interesting
and important direction for future research.
4.3.1 CAPA: The Spirit of Beaver against Physical Attacks
Context. Both SCA and FA are real threats to embedded cryptography and
countermeasures against both attacks have mostly been studied separately.
Recent attacks have shown that fault injections can facilitate side-channel
attacks and that power measurements can be used to circumvent fault
countermeasures [AVFM07, CFGR10, RLK11]. This calls for new designs
of countermeasures which do not only separately protect against SCA and FA,
but also against possible combined attacks, in which side-channel measurements
and faults are jointly exploited. One of the first proposals in this area is due
to Ishai et al. [IPSW06]. However, the adversary model in this work is either
limited to reset faults, or bounded in the number of bits that are affected.
Neither corresponds to realistic fault injections. Moreover, the scheme results
in a very large overhead. A much more efficient countermeasure by Schneider et
al. [SMG16], combines the masking countermeasure of TI [NRR06] with EDC.
However, as a superposition of masking with redundancy, this scheme does not
provide security against combined attacks.
Contribution. In a work published at Crypto 2018 [13], we introduce two

things, which were also briefly described in Section 4.2.2 of this chapter: Firstly,
we introduce the tile-probe-and-fault model, a model generalising the wire-
probing model of Ishai et al. [ISW03], extending it to cover both more realistic
side-channel leakage scenarios on a chip and also to cover fault and combined
attacks. Secondly, we introduce CAPA: a combined Countermeasure Against
Physical Attacks. Our countermeasure is motivated by our model and aims
to provide security against higher-order SCA, multiple-shot FA and combined
attacks. The tile-probe-and-fault model leads one to naturally look (by analogy)
at actively secure multi-party computation protocols. Indeed, CAPA draws much
inspiration from the MPC protocol SPDZ [DPSZ12]. To demonstrate that the
model, and the CAPA countermeasure, are not just theoretical constructions,
but could also serve to build practical countermeasures, we present initial
experiments of proof-of-concept designs using the CAPA methodology. Namely,
a hardware implementation of the KATAN and AES block ciphers, as well as a
software bitsliced AES S-box implementation. We demonstrate experimentally
that the design can resist second-order DPA attacks, even when the attacker
is presented with many hundreds of thousands of traces. In addition, our
proof-of-concept can also detect faults within our model with high probability
in accordance with the methodology. This work can be found on page 205.
Follow-up. At COSADE 2019 [12], we presented an application of CAPA to

the Keccak permutation. We systematically explore the speed-area trade-off
with four architectures and show that CAPA, in spite of its algorithmic overhead,
can be very fast or reasonably small. In fact, for the standardized Keccak-
f[1600] instance, our low-latency version is nearly twice as fast as the previous
implementations that only consider side-channel security, at the cost of area
and randomness consumption. For all four presented designs, the protection
level for side-channel and fault attacks can be scaled separately and to arbitrary
order. To evaluate the physical security, we assess the side-channel leakage
of a representative second-order secure implementation on FPGA. We also
experimentally validate the claimed fault detection probability.
4.3.2 M&M: Masks and Macs against Physical Attacks
Context. Recent proposals in the area of combined countermeasures have

shown that there is a big trade-off between the implementation cost and the
strength of the adversary model. CAPA [13] is the first combined countermeasure
to provide provable security against combined attacks. However, the overhead
of this countermeasure is so large that an AES implementation does not fit on
a state-of-the-art FPGA. ParTI [SMG16] is a very efficient scheme, but its use
of EDC implies that only faults with limited Hamming weight are guaranteed
to be detected.
Contribution. In a work published at CHES 2019 [6], we introduce a new family

of combined countermeasures M&M, that combine Masking with information-
theoretic MAC tags and infective computation. We describe how a combined
protected implementation can be built from any SCA-secure masking scheme.
M&M can be instantiated from any dth -order secure masking scheme, and hence
achieves generic order of protection for SCA. The combination with MAC tags
then ensures generic order of protection against DFA and the combination of SCA
and DFA. As opposed to EDC, the MAC mapping is perfectly unpredictable,
eliminating the possibility of smart undetectable faults. This also makes M&M
secure against faults that affect any number of bits. It thus works in a stronger
adversary model than the existing scheme ParTI, yet is a lot less costly to
implement than the provably secure MPC-based scheme CAPA. We demonstrate
M&M with first- and second-order secure implementations of the AES cipher.
This example shows that M&M can be very efficient in area with an overhead
factor of merely 2.53 compared to an implementation that protects only against
SCA. We perform a SCA evaluation of our implementations where no leakage
is found with up to 100 million traces. Additionally, we design and perform
a fault evaluation to confirm our theoretically claimed fault coverage. This
methodology was later extended into a generic fault detection tool [AWMN19].
We include the paper on page 229.
4.4 Conclusion
In this chapter, we looked at the emerging field of combined countermeasures

against physical attacks, including side-channel and fault attacks. As a very
new research topic, it naturally produces a lot of open questions.
Battle of the Adversary Models. In the recent literature, a myriad of new

combined countermeasures has been introduced. It looks like each new
countermeasure comes with a different adversary model. As a result, it is
impossible to make a straightforward comparison of the costs of these new
schemes. While masking countermeasures are exploring the trade-off between
area, randomness and speed, the new combined countermeasures force us to
choose between overall cost and security level. Before moving forward, the
community should agree on one or a few unified models. Part of the reason
that we do not have this yet, is a lack of understanding of combined attacks
and what is essentially required in a combined countermeasure.
Prepare for the Real World. In the search for proper models of combined
adversaries, we should avoid as much as possible a disconnect between theory
and practice, such as can be witnessed in the world of masking. Naturally,
as argued in the previous chapter, a gap will always exist between the two.
However, we can bring them as close together as possible. Before moving forward
CONCLUSION 97
with too expensive models, we need to see evidence of the practical feasibility
of a combined attack on a block cipher implementation with combined masking
and fault protection. To choose the right type of redundancy (e.g. MAC tags
vs. EDC), we should investigate the fault distributions that can be produced
by different fault injection methods in practice. As much as possible, we should
use practical results to design and justify theoretical adversary models that are
representative of the real-life threats.
Optimisation. Once we have more clarity on the models, we can continue

with the design and optimization of combined countermeasures. Should the
tile-probe-and-fault model remain of interest (e.g. in software), recent advances
in the field of MPC will help reduce the cost of schemes such as CAPA. On
the other hand, we might find methodologies to get provable security against
combined attacks, without relying on expensive MPC machinery.
Verification. Further research on combined countermeasures is also constrained

by the lack of verification methodologies. Current proposals have been verified
to provide the claimed protection against SCA and FA separately. Any claims
made about combined attacks are difficult to confirm. In the best case, we
would obtain a TVLA-like methodology that aims to detect leakage in the
power traces of disturbed executions of an implementation. The problem is
again that there is a too large variability in the type of faults and how they can
be exploited.
Inspiration. While originally designed with combined attackers in mind, the

tile-probe-and-fault model comes with an interesting property regarding the
ILA. At the same time, the masking community is realizing that the ILA does
not hold in the data path of a modern CPU. An interesting direction for future
research would be to look at a passively secure version of CAPA (i.e. in a
tile-probing model without faults) and compare its cost and leakage behaviour
to state-of-the-art Boolean and inner product masking schemes.
On the combined countermeasure side, the use of inherent redundancies in some
masking schemes, such as done by Seker et al. with polynomial masking,
is an appealing approach. Dual-rail masked logic styles compute on two
complementary wires for each bit. This redundancy is reminiscent of that
used by Ishai et al. (the Manchester encoding). In Chapter 2, we identified
dual-rail masked logic styles as a promising direction for future research, since
it allows to achieve relatively low latency implementations. Given the inherent
redundancy, it would also be interesting to investigate their combination with a
fault protection mechanism.
Robust Randomness. Finally, as mentioned earlier in this chapter, none of

the combined adversary models or countermeasures will effectively protect an
implementation if the RNG can be faulted. Hence, the most important target
for future research is to design an RNG which continues to produce randomness
of sufficient quality, even when under attack by fault injections. As the attack
by Yao et al. demonstrated, also the transfer of the random masks to the
masked implementation is a critical point of attack.
Chapter 5
Design of Symmetric
Cryptographic Primitives
Cryptographic primitives have been designed to be secure against mathematical

attacks in a black-box model. In the previous chapters, it was shown how
such primitives can be implemented in a way that they are also secure against
physical attacks, in a grey-box model. The increased security always comes with
a high price tag in terms of implementation cost. In this final chapter, we look
at how the traditional design principles can be at odds with the optimization of
the implementations and how they can evolve to be more suitable for embedded
systems. Our treatment will focus mostly on the nonlinear components of
symmetric primitives, e.g. S-boxes, since the cost of implementations is
dominated by their overhead. However, keep in mind that the design of the
linear layers might also affect the cost of side-channel attack countermeasures.
We first delve into the topic of S-boxes in Section 5.1 and explain some of their
properties and methods of classification. We consider both mathematical and
implementation properties. Next, Section 5.2 considers the design of symmetric
cryptosystems (especially S-boxes) from the embedded systems engineer’s point-
of-view. We list optimization goals for hardware and software implementations
and discuss the state-of-the-art, including proposals in the recent National
Institute of Standards and Technology (NIST) lightweight competition.1 In
Section 5.3, we describe our contributions in this field and Section 5.4 concludes
the chapter.
1 https://csrc.nist.gov/Projects/Lightweight-Cryptography
99
100 DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
5.1 S-box Properties and Affine Equivalence
We list here cryptographic properties, which indicate the S-box’s strength

against mathematical attacks (i.e. cryptanalysis) and which were traditionally
considered the principal evaluation criteria in the choice of S-boxes for primitives.
Next, we describe S-box classifications, a popular method to simplify the
enormous search space of S-boxes and detail the most important results from
the literature in this context. Finally, we identify some properties, which give
information on the cost of implementing an S-box.
5.1.1 Cryptographic Properties
Notation An S-box is typically a balanced vectorial Boolean function F : Fn2 →

Fm2 , where each output y = F (x) ∈ F2 is equiprobable for all inputs x ∈ F2 .
m n
Often, n = m and thus F is bijective. We denote the bits of x ∈ F2 by xi n
for i = 0 . . . n − 1. An n × m vectorial Boolean function can be split into m

coordinate functions, each of which is a Boolean function fi : Fn2 → F2 for
i = 0 . . . m − 1. Let ◦ denote the composition of functions, i.e. for F1 : Fm
2 → F2
l
and F2 : F2 → F2 : F1 ◦ F2 (x)P
n m
= F1 (F2 (x)). We also consider the inner product
of two bit-vectors as hx, yi = i xi yi .
Algebraic Normal Form (ANF). The algebraic normal form is a unique

representation of a Boolean function f : Fn2 → F2 as a multivariate polynomial:
X n−1
Y
f (x) = αj xji i (5.1)
j∈Fn
2
i=0
Algebraic Degree. The algebraic degree of a Boolean function f : Fn2 → F2 is

the highest degree that occurs in the ANF. It can be described as
Degr(f ) = max HW (j) (5.2)
2 ,αj 6=0
j∈Fn
with HW (j) the Hamming weight of j. The algebraic degree of a vectorial

Boolean function F : Fn2 → Fm
2 is the largest degree of its coordinate functions:
Degr(F ) = max Degr(fi ) (5.3)

0≤i<m
The algebraic degree (and more generally complexity of the algebraic description)
plays a role in the resistance against algebraic attacks [CP02], which target a
cryptosystem by considering it as a system of equations. For an n-bit bijective
S-box, the largest possible algebraic degree is n − 1.
S-BOX PROPERTIES AND AFFINE EQUIVALENCE 101
Differential Uniformity. Let F : Fn2 → Fm 2 be a vectorial Boolean function.

We define its difference distribution table (DDT) [BS90] as δF with for α ∈ Fn2
and β ∈ Fm2 :
δF (α, β) = #{x ∈ Fn2 : F (x ⊕ α) = F (x) ⊕ β} (5.4)
The differential uniformity [Nyb93] is the largest value in the DDT for α 6= 0:
Diff(F ) = max δF (α, β) (5.5)

α6=0,β
This metric indicates the difficulty of differential cryptanalysis [BS90], a

statistical attack methodology which exploits the probability that some input
difference propagates to some output difference through the cipher. The larger
the value Diff, the less uniform the probabilities in δF are and thus, the less
resistant a function is against differential cryptanalysis. The lower bound for
the differential uniformity of bijective S-boxes is 2. The S-boxes that obtain this
limit (and thus have DDT with only values 0 and 2) are called almost perfect
nonlinear (APN).
Linearity. Another statistical cryptanalysis which is considered an important

threat to symmetric-key cryptosystems is linear cryptanalysis [Mat93]. Instead
of considering input- and output-differences of functions, this attack considers
linear combinations of the bits of inputs and outputs. Similarly, we can define a
property, which measures the resistance of functions against this type of attack.
The two-dimensional Walsh spectrum of a function F : Fn2 → Fm 2 is defined as:
X
F̂ (α, β) = (−1)hα,xi · (−1)hβ,F (x)i (5.6)
x∈Fn
2
for α ∈ Fn2 and β ∈ Fm

2 . It can also be computed as a linear approximation
table (LAT) [CV94]:
F̂ (α, β) = 2#{x ∈ Fn2 : hα, xi = hβ, F (x)i} − 2n (5.7)
The linearity is the largest absolute value in the LAT for β 6= 0:
Lin(F ) = max |F̂ (α, β)| (5.8)

β6=0,α
In some sense, the linearity measures how easy it is to approximate a function

by a linear function. Naturally, the smaller this value, the better the resistance
against linear cryptanalysis. The lower bound for the linearity of bijective
S-boxes is 2(n+1)/2 and the S-boxes for which the linearity equals this limit
are called almost bent (AB). It was shown that every AB function is also
APN [CV94].
The AES S-box. To illustrate these properties, we consider again the main
example of this work, the AES S-box. While we originally defined it as a
function over F28 in § 1.1, it can also be represented as a vectorial Boolean
function over F82 . This function has the maximum algebraic degree of 7. Its
differential uniformity and linearity are respectively 4 and 32. While not AB
nor APN, this S-box remains to this day the best 8-bit S-box in the literature in
terms of cryptographic properties. No S-boxes with lower differential uniformity
or linearity have been found and it is not clear whether they even exist. The
main reason that we are still unsure about this is the magnitude of the search
space.
5.1.2 Classifications
When looking for S-boxes with good properties, we deal with a dimensionality
problem. The number of possible bijections on n bits is 2n !, which prohibits
exhaustive search for n > 3. To manage the enormous search spaces of S-boxes,
we divide them into classes, defined based on an equivalence property.
Affine Equivalence. It has been shown that transforming the inputs

and outputs of an S-box with an affine function preserves many of its
cryptographic properties, including the algebraic degree, differential uniformity
and linearity [CCZ98]. Following this observation, we can define an equivalence
relation based on such transformations. We call two functions F1 : Fn2 → Fm 2
and F2 : Fn2 → Fm2 affine equivalent [CCZ98] if and only if there exists a pair of
n-bit and m-bit invertible affine bijections A and B such that F1 = B ◦ F2 ◦ A.
Exhaustive Classifications. The affine equivalence (AE) property has enabled

the exhaustive classifications of entire function spaces up to a certain size. For
these sizes, we therefore also have exhaustive knowledge of the cryptographic
properties that exist. The first classification of Boolean functions dates back
to 1959 [Gol59]. By 1972, all Boolean functions with up to five input bits
were classified up to AE by Berlekamp and Welch [BW72]. Maiorana [Mai91]
was the first to identify the AE classes of 6-bit Boolean functions. Using
an efficient algorithm for verifying AE by Biryukov et al. [BDBP03], De
Cannière [De 07] created an exhaustive classification of all 4-bit bijective S-
boxes. The dimensionality reduction is significant, since the classification allows
one to consider only 302 AE classes instead of 16! permutations. However, the
search for the classes themselves becomes too complex for larger sizes. To this
day, no exhaustive classification for vectorial Boolean functions over n bits with
n > 4 exists.
S-BOX PROPERTIES AND AFFINE EQUIVALENCE 103
Partial Classifications. Classifications have been extended to S-box sizes n > 4

by restricting certain properties. Brinkmann and Leander [BL08] constrained
the search space to bijective S-boxes with optimal properties (APN) and were
able to classify them exhaustively up to dimension 5. Alternatively, Bozilov et
al. [BBS17] were able to exhaustively classify all quadratic 5-bit permutations
with a dedicated search method for functions of algebraic degree two. Following
an enhancement of the AE algorithm of Biryukov et al. [BDBP03], De Meyer and
Bilgin [5] were able to optimize the algorithm of Bozilov et al. [BBS17], which
led to the first classification of quadratic 6-bit functions, including balanced
non-bijective Boolean functions.
5.1.3 Implementation Properties
Circuit Properties. The cost of an S-box circuit can be expressed with many
different metrics. We typically count the number of gates (gate complexity)
or look at the circuit depth. Depending on which gates we consider, we can
obtain different cost estimations. Stoffelen [Sto16] for example distinguishes
gate complexity for hardware implementations and bitslice gate complexity for
software implementations. The former considers all types of gates which can
be found in typical CMOS libraries (AND, OR, NOT, XOR, NAND, NOR,
XNOR), while the latter only considers those for which a CPU instruction exists
in most processors (AND, OR, NOT, XOR). The bitslice gate complexity can
be used as an indicator of the speed of a software implementation, since each
gate should map to one instruction. For hardware implementations, the gate
complexity is related to the area of a circuit. For the latency of a circuit, we
look at the circuit depth, which is the maximum number of gates on any path
from an input to an output. Note that we typically only consider 2-input gates
in these metrics for genericity and ease of comparison.
XOR vs. AND. Any function can be represented in terms of AND, XOR
and NOT gates only, because these gates form a functionally complete set of
operators. It is therefore common to consider only these gate complexities.
Naturally, area or latency estimates based on gate counts are not exact, since
each different type of instruction or gate has a different area or delay. Exact cost
metrics can be obtained using gate-specific costs from a logic library, combined
with distinct gate counts (AND gate complexity, XOR gate complexity, . . . ). In
CMOS technology, a NAND gate consists of 4 transistors, while an XOR gate
requires as much as 8 (assuming the input complements are not yet available).
A linear function can thus be more expensive than a nonlinear function (in
hardware). Traditionally, circuits and S-boxes have been optimized according
to that philosophy. However, if we want efficient circuits for embedded systems
exposed to side-channel attacks, we need to consider the cost of countermeasures

such as masking. Recall from Section 2.2.3 that a masked XOR requires d + 1
regular XOR gates, whereas a masked AND requires about (d + 1)2 AND gates
and 2d(d + 1) XOR gates. It is therefore common to regard the cost of XOR
negligible compared to that of AND.
Multiplicative Complexity. As a result, recent works often consider the metric

multiplicative complexity (M C) [Sch88]. This is the minimal number of 2-
input AND gates required to evaluate a function over the basis (AND, XOR,
NOT). The M C is an important metric for the area of masked hardware
implementations and for the latency of masked software implementations. Note
that it is a property of a function, not of a circuit and that it corresponds to
the AND gate complexity of the most efficient implementation of that function,
with respect to AND gate count.
Multiplicative Depth. For the latency of masked hardware implementations,

we care about the circuit depth in terms of 2-input AND gates. Recall
from Section 2.2.3 that every layer of AND gates requires a register stage
for synchronization, which significantly affects the latency in terms of clock
cycles. Given a circuit over the basis (AND, XOR, NOT), the multiplicative
depth (M D) is the maximum number of 2-input AND gates on any path from
an input to an output. This is a circuit-specific property. For any S-box S,
the minimal multiplicative depth achievable follows directly from its algebraic
degree: M D ≥ dlog2 (Degr(S))e.
Affine Equivalence. Interestingly and conveniently, the multiplicative com-

plexity M C is also invariant under AE, since affine transformations do not alter
the number of AND gates.
5.2 Towards Cryptography Design for Masking
The consideration of implementation cost in the design of cryptographic

components (and more specifically S-boxes) is not new. The S-boxes of the Data
Encryption Standard (DES) were chosen in the first place according to a list of
cryptographic criteria. Among those that fulfilled these criteria, the designers
chose the ones that would be most efficiently implemented in hardware [MM82].
Daemen and Rijmen [DR98] pointed out that the coefficients in the MixColumns
operation of AES were specifically chosen with implementation efficiency in
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING 105
mind. The S-box is based on an inversion operation, which means that hardware
implementations for encryption and decryption can share the same inversion
block.
However, as we move towards a world with more and more embedded devices,
where side-channel attacks are a constant threat, we must shift our understanding
of implementation cost to one that takes SCA countermeasures into account.
In fact, since those countermeasures come with such large overheads, the
consideration of implementation cost in the design process becomes even
more important than before. The ongoing NIST lightweight cryptography
standardization contest even explicitly lists this as a requirement for candidates:
“While implementations will not be required to provide side-channel

resistance, the ability to provide it easily and at low cost is highly
desired.”
The same is stated for resistance against fault attacks. To achieve this, we need
designers to become familiar with how their decisions influence the cost.
In this section, we will first identify important goals for the designer and
properties to optimize based on the knowledge we have gathered from the
previous chapters. Next, we will discuss recent trends in the state-of-the-art on
cipher design and, in particular, assess how the NIST lightweight candidates
comply with the SCA requirement.
5.2.1 Goals and Trade-offs
Decomposability. In Chapter 2, we saw various masked implementations

of AES. In each case, the S-box of algebraic degree 7 was decomposed into
quadratic components, which could each be masked using, for example, the
ISW multiplication. This approach is very popular in the masking of S-boxes.
Bilgin et al. [BNN+ 12] were able to create threshold implementations (TI) for all
3- and 4-bit S-boxes up to AE by decomposing the cubic S-boxes into quadratic
ones. A beneficial property for an S-box is therefore to be easily decomposable
into quadratic or low-degree functions.
Minimize Multiplicative Depth. Recall that glitches in hardware masked

implementations require that all quadratic stages are separated by registers
for synchronization. The number of decomposition functions therefore plays
an important role. Ideally, it should be possible to implement the S-box
with the minimal multiplicative depth. S-boxes that are not designed with
this specification in mind, often need more than that to keep the AND gate
complexity within reasonable bounds. For example, most masked AES S-boxes
in Chapter 2 require at least four instead of three quadratic steps. The number
of register stages mostly influences the latency of hardware implementations,
because it directly determines the number of clock cycles. However, also the
area footprint is affected by those registers, which have relatively high cost
compared to combinational logic on ASICs.
Algebraic Degree. If the S-box is indeed chosen so that it can be implemented

with minimal multiplicative depth M D = dlog2 (Degr(S))e, then the algebraic
degree becomes a direct indicator for the latency of the S-box. Naturally, there
is a trade-off with cryptographic quality. Quadratic functions tend to have large
differential uniformity and linearity. Finding the optimal trade-off is difficult.
For S-boxes only, the AE classifications help to find the cryptographically
strongest functions at the lowest cost. A larger investigation and comparison
for S-boxes of many more sizes were made by Bilgin et al. [1]. However, the
cryptographic strength of an S-box alone is not directly linked to that of the
cipher, since it depends also on the linear layers and the number of rounds.
Similarly, the latency of the entire cipher depends on the latency of the S-box,
the number of rounds and the architecture used (see Figure 5.1). We should
thus attempt to minimize the total multiplicative depth or algebraic degree of
the cipher.
S-box Latency S-box Strength
Cipher Latency Cipher Strength
Nb. of Rounds
Figure 5.1: Illustration of the complicated trade-offs between cryptographic

strength and latency. Note that linear layers are not depicted here, but play an
important role for the cryptographic strength of the cipher as well.
Minimize Multiplicative Complexity. In software, the depth has little

importance since all operations are performed sequentially. The number of
instructions in a masked implementation grows most with the number of
AND operations. Hence, for software-oriented ciphers, it is most important
to design primitives with low multiplicative complexity. Also in hardware,
the multiplicative complexity is important for the area footprint of the S-box.
However, recall that a low multiplicative complexity should not be achieved at
the cost of a large multiplicative depth, even if area is more important than
latency. Hence, in this case, the goal is to find S-boxes that have low level-D
multiplicative complexity, where D is ideally the minimal multiplicative depth
M D = dlog2 (Degr(S))e.
The Inverse. Often, when an encryption uses the S-box S, its inverse S −1
is required for decryption. The cost of the inverse is not always considered,
because the cryptographic properties Diff and Lin are the same for S and S −1 .
The algebraic degree and multiplicative complexity, on the other hand, are not,
which means considering only the implementation cost of S may result in an
expensive S −1 . In the survey of S-boxes of Bilgin et al. [1], the cryptographic
and implementation properties of S-boxes and their inverses are investigated.
Moreover, they consider also the possibility of sharing resources between
encryption and decryption. The AES S-box, for example, uses an inversion,
which is naturally an involution. Hence both encryption and decryption can
use the same hardware components, which reduces the area footprint on a
device that needs to be able to do both. Other than involutions, Bilgin et al. [1]
identify several ways to minimize the combined area of S and S −1 and propose a
selection of S-boxes that perform well in this regard, as well as cryptanalytically.
Bit Sizes: Large vs. Small. AES is one of the few block ciphers that uses
an 8-bit S-box. Most block ciphers use a 4-bit S-box. There are two ways
to look at the choice of S-box size: from a cryptanalytic point-of-view and
from a SCA point-of-view. The trade-off between cryptographic strength and
implementation cost of small and large S-boxes is again complicated by the
involvement of the linear layers. Hence, we leave it to the cryptanalysts to
investigate it. Nevertheless, it is probable that the popularity of small bit
sizes (e.g. 4) is more due to the lack of knowledge on the search space of
larger S-boxes than due to a qualitative advantage. In addition, the success
of AES has suppressed other ciphers that use an 8-bit S-box. From a SCA
point-of-view however, larger S-boxes may enjoy some benefit in LUT-based
implementations. Recall that differential power analysis (DPA) on AES requires
28 = 256 hypotheses to be made on each 8-bit subkey. This number is directly
determined by the size of the S-box (or more specifically, the number of input
bits that each output bit depends on). In a similar cipher with 4-bit S-boxes,
only 24 = 16 hypotheses would have to be made per subkey. More generally, in a
state of B bits with n-bit S-boxes, a DPA attack requires 2n−log2 n B hypotheses
to recover the entire round key. Hence, very large S-box sizes could interfere
with the divide and conquer strategy of SCA. The problem is that their search
spaces are too large to explore.
Bit Sizes: Odd vs. Even. Another contrast in S-box sizes is that between
odd and even. Traditionally, often S-boxes of size a power of two were chosen,
because of the datapath width in processors. For hardware implementations
or bitsliced software implementations, this restriction does not make sense,
but still, it is challenging to fit an odd-sized S-box into a block cipher with
state size a power of two (e.g. 128 or 256). As a result, even-sized S-boxes
(mostly 4) dominate in the literature. However, both from a cryptanalytic and
implementation perspective, odd-sized S-boxes show an advantage over even-
sized ones. The results of Bilgin et al. [1] show that S-boxes of odd size n achieve
the same cryptographic strength as S-boxes of even size n + 1, but at lower
cost. They are especially interesting when it comes to low latency applications,
since for every odd size n, there exists at least one AE class of quadratic APN
S-boxes. These are S-boxes with optimal cryptographic properties, that can be
implemented at minimal latency.
Clarifying Example. Choosing an S-box according to these goals and

preferences is easier said than done. In the end, the cryptographic strength
of a cipher remains the most important decision factor. However, what these
guidelines aim to do is give more clarity about the impact of certain design
decisions. In some cases, there are many S-boxes that result in the same
security properties. It is exactly then that the masked implementation cost
should be taken into account. Let us look at the popular 4-bit S-boxes as an
example. Since the quadratic ones do not provide good cryptographic properties,
only cubic ones are used in block ciphers. They need a multiplicative depth
of (at least) two. In terms of latency, such a decision is wasteful in a way,
since with M D = 2, it is possible to implement a fourth-degree S-box with
better cryptographic properties. In fact, Bilgin et al. [1] showed that even
with M D = 1, 5-bit S-boxes obtain better cryptographic properties than 4-bit
S-boxes with M D = 2. And with these better cryptographic properties, it is
possible that the number of rounds can be reduced, which even further optimizes
the latency of masked implementations.
5.2.2 Discussion of the State-of-the-Art.
We will now look at the literature from the last years and show that some
first steps have been taken towards the above goals. We will also critically
assess some candidates from the NIST lightweight competition, that claim
to have taken the cost of side-channel countermeasures into account in the
design process. Note that our expertise does not extend to cryptanalysis and
that many of the discussed ciphers are relatively new, i.e. not as scrutinized
and established as AES. We therefore limit our treatment to an evaluation of
the implementation properties only and say nothing about the cryptographic
strength.
Multi-party Computation. The link between masking and the field of multi-
party computation (MPC) has appeared multiple times in this work. Both areas
use secret sharing, which causes nonlinear operations to be more expensive
than linear ones. As a result, we can see recent efforts into the design of
cryptographic primitives with low multiplicative complexity. Albrecht et
al. [ARS+ 15] introduced a family of ciphers, called LowMC, which is intended
to minimize both its multiplicative complexity and depth. The design is based
on a substitution-permutation network (SPN) with 3-bit S-boxes of M D = 1.
For AES-like security parameters, they repeat the SPN for 12 rounds, which
results in a total multiplicative depth of 12. Albrecht et al. [AGR+ 16] also
introduce MIMC, a very simple construction consisting only of key additions
and the quadratic map x → x3 in a finite field Fq with q prime or a power of
two. This latter cipher focuses more on multiplicative complexity than depth.
They need to repeat the round function 82 times to achieve AES-like security
parameters, so their total multiplicative depth is even worse than that of AES.
Keccak and PRIMATES. In recent years, several primitives have been

introduced that explicitly mentioned side-channel attacks as motivation
for their S-box choice. Most prominent is the Keccak family of sponge
functions [BDPV11], which has been selected as the SHA-3 Cryptographic
Hash standard by NIST. They use quadratic 5-bit S-boxes with a very low
multiplicative complexity of 5. The round function is repeated 18 times or more
(depending on the state size), which means Keccak can have multiplicative
depth as low as 18. Another permutation that uses a 5-bit S-box is PRIMATES
by Andreeva et al. [ABB+ 14]. They chose an S-box from a quadratic 5-bit AB
class, which results in optimal cryptographic properties at only slightly higher
MC than the Keccak S-box. Moreover, the permutation only requires 6 or 12
iterations of the round function, which results in a very small multiplicative
depth. We summarize these results in Table 5.1.
Length Increasing Structures. Another trend in the literature is to build large

S-boxes from smaller ones, using length-increasing structures which are inspired
by block cipher design. In particular, 8-bit S-boxes of this type have been used in
several cryptographic primitives, including CRYPTON [Lim98], Khazad [BR00],
Whirlpool [BR11], ICEBERG [SPR+ 04] and CLEFIA [SSA+ 07]. Comparative
studies of such S-boxes, including new proposals, were made by Canteaut et
al. [CDL15] and by Boss et al. [BGG+ 16]. By construction, these S-boxes are
Table 5.1: Comparison of primitives in the state-of-the-art. We denote with n

and B respectively the S-box size and block size.
Primitive n/B/#Rnds S-box S-box MC/bit Tot MD Tot
MC MD ·MC/bit
AES [DR98] 8/128/10 32 4 4 40 40
LowMC [ARS+ 15] 3/196/14 3 1 0.96* 14 13.5
Keccak [BDPV11] 5/200/18 5 1 1 18 18
PRIMATEs [ABB+ 14] 5/120/[6/12] 7 1 1.4 6/12 8.4/16.8
* LowMC does not apply S-boxes to the entire state
decomposable, since they are assembled from quadratic building blocks. Note
however that their increased bit size does not increase the complexity of a DPA
attack, because the hypotheses can be made about the smaller (e.g. 4-bit)
subcomponents. Moreover, most of the S-boxes obtained in this way have quite
a large multiplicative depth and none achieve cryptographic properties as good
as the AES S-box.
5.2.3 NIST Lightweight Competition.
Given the above-acquired knowledge on primitive design and given that the NIST
lightweight competition explicitly states that the cost of SCA countermeasures
should be taken into account, we now take a look at some of the Round 2
candidates.2 This investigation constitutes a new contribution of this thesis.
Side-Channel Claims. Many (not all) candidates make a note about having
considered side-channel attacks. However, this claim is often not very well-
argued. In some cases, it is justified by the fact that the design uses
“easy-to-mask” operations such as bitwise functions. While this is more
convenient for the masking designer, it gives no guarantees about the total
cost. Some proposals use existing primitives and use their lightweight property
as justification. However, these primitives were not necessarily designed with
SCA in mind. Other candidates use AES and argue that a lot of research
exists on masking the AES. The many schemes presented in Section 2.3 indeed
confirm that there is an abundance of available literature on the subject, but the
existence of a lot of research does not imply that its results are most efficient.
This holds especially for mask conversions between Boolean and arithmetic
masking [BCZ18, CGTV15], which are required for ARX ciphers. In the NIST
2 Descriptions can all be found at https://csrc.nist.gov/projects/lightweight-
cryptography/round-2-candidates
proposal SPARKLE, the argumentation for side-channels is again that a large

amount of research exists on this topic. However, whether these conversions are
considered efficient is highly disputable.
A few of the candidates stand out in their treatment of SCA. Firstly, Goudarzi et
al. describe and implement a masked version of their scheme Pyjamask in
software. Secondly, the proposal ISAP by Dobraunig et al. is based on the ISAP
mode of operation [DEM+ 17], which is a leakage-resilient mode of operation,
designed to provide security against DPA by a re-keying mechanism. In contrast
with masking as discussed in Chapter 2, this countermeasure acts at the protocol
level instead of at the algorithmic level. Other proposals that claim to use
some form of leakage resilience are Xoodyak, Spook, Ascon, DryGASCON and
Subterranean.
Implementation Properties. Since the claims on SCA are often badly

motivated and since we need to be able to correctly compare different candidates,
we collect some properties of their building blocks in Tables 5.2 and 5.3. We
limit our selection to primitives used in proposals that make some claim about
the consideration of SCA and list several properties related to the multiplicative
complexity and depth of the ciphers. Naturally, these properties should not be
considered by themselves, as the cost of implementations depends on several
of them jointly, and cryptographic strength is not taken into account here. In
Table 5.2, we look at properties for hardware implementations and in Table 5.3,
we consider software implementations. We recall the most important influences
on the cost for different cases:
Hardware with focus on low latency: The latency of a (serial or round-
based) masked implementation will depend strongly on the total
multiplicative depth (Tot M D). We calculate this as the multiplicative
depth of the S-box (S-box M D), multiplied with the number of rounds
(# Rnds) in the primitive.3 Since different primitives operate on different
state sizes, we also calculate the total multiplicative depth per bit (Tot
M D/bit) by dividing the total M D by the block size B.4
Hardware with focus on small area: The area cost of masked hardware
implementations comes from registers on the one hand and combinational
logic on the other. On ASIC devices, the registers are quite expensive,
whereas on FPGA devices, they are relatively cheap from being available in
3 We note that some sponge-based proposals use a higher number of rounds during the
initialization phase of a mode. Since asymptotically only the rounds per plaintext block
matter, we will not consider initialization rounds here. Note however that for short messages,
the initialization rounds will be dominant.
4 When a primitive is used in a sponge construction, we divide by the rate r, since this
indicates the number of plaintext bits being processed per iteration.

large quantities. The register cost is considerably affected by the block or

state size B, especially in a serial implementation. Furthermore, also the
multiplicative depth of the S-box (S-box M D) contributes to the registers,
but this is more prominent in round-based implementations. As for the
combinational logic, its area grows most with the multiplicative complexity
of the S-box (S-box M C). Again, to account for the scalability with the
number of bits being operated on, we also calculate the multiplicative
complexity per bit (M C/bit) as the S-box M C divided by the S-box size
n. The number of rounds is not important for the area of a serial or
round-based implementation.
Software with focus on low latency: In software, the speed can be approx-
imated by the number of instructions. For masked implementations,
this will be highly correlated with the total multiplicative complexity.
The total number of multiplicative complexity naturally depends on the
multiplicative complexity of the S-box on the one hand and on the number
of rounds on the other. Also the number of S-boxes per round (= B/n)
matters, but if bitslicing is used, some of these can be calculated in
parallel, rather than sequentially. We therefore consider various degrees
of bitslicing, assuming p S-boxes can be computed in parallel on a p-bit
platform in Table 5.3. The total multiplicative complexity on such a
platform is S-box M C× # Rnds ×d np B
e. We compare this metric scaled
per bit (Tot M C/bit) by division with the number of bits processed per
encryption (i.e. the block size B or rate r).
We note that linear operations are not entirely negligible, especially when the
masking order d is not very high, but since their cost is typically taken into
account in the design of unmasked primitives, we do not consider them here.
A Note on Leakage Resilience. We note that our analysis considers only the
internal building blocks of the NIST proposals, regardless of whether they are
used in a leakage resilient mode or not. We see this as a necessary first step
for comparison. Moreover, since different candidates rely on different types of
leakage resilience, making a more detailed comparison is challenging. For an
investigation into the leakage resilience of several candidates, we refer to the
work of Bellizia et al. [BBC+ 20].
Observations
The existence of Tables 5.2 and 5.3 is immediately justified by the large variability
in some of its columns. We make some interesting observations here.
Table 5.2: Comparison of NIST candidates for Hardware. (n = S-box size, B = block size or permutation state size, r
= rate for Sponge)
Primitive n B r # Rnds S-box MC/bit S-box Tot MD Tot Candidates
MC MD MD/bit
XOODOO 3 384 128 12 3 1 1 12 0.09375 Xoodyak
Pyjamask 3/4 96/128 14 3/4 1 1/2 14/28 ≥ 0.146 Pyjamask
Clyde 4 128 12 4 1 2 24 0.1875 Spook
GIFT (I) 4 64/128 28/40 5 1.25 2 56/80 ≥ 0.625 ESTATE, GIFT-COFB,
HYENA,
LOTUS/LOCUS,
SUNDAE-GIFT
GIFT (II) 4 64/128 28/40 4 1 4 112/160 ≥ 1.25 idem
KNOT 4 256 64 28* 4 1 2 56* 0.875* KNOT
PHOTON 4 256 32/128 12 4 1 2 24 PHOTON-Beetle
TOWARDS CRYPTOGRAPHY DESIGN FOR MASKING
≥ 0.1875
Shadow 4 512 256 12 4 1 2 24 0.09375 Spook
Spongent 4 160/176 80/90 5 1.25 2 160/180 ≥1 Elephant
ForkSkinny 4/8 64/128 40/48 4/8 1 2/4 80/192 ≥ 1.25 ForkAE
ASCON 5 320 64/128 6*/8* 5 1 1 6*/8* ≥ 0.0625* Ascon, ISAP
GASCON 5 320 128 7* 5 1 1 7* 0.055* DryGASCON
Keccak 5 200 18 5 1 1 18 0.09 Elephant
5 400 144 8 5 1 1 8 0.056 ISAP
AES 8 128 10 32 4 4 40 0.3125 ESTATE, mixFEED,
SAEAES
Skinny 8 128 48/56 8 1 4 192/224 ≥ 1.5 Romulus,
SKINNY-AEAD
GIMLI 96 384 128 24 96 1 1 24 0.1875 Gimli
Subterranean 2.0 257 257 32 1* 257 1 1 1* 0.03125* Subterranean 2.0
* Given a larger number of initialization rounds
113
DESIGN OF SYMMETRIC CRYPTOGRAPHIC PRIMITIVES
Table 5.3: Comparison of NIST candidates for Software. (n = S-box size, B = block size or permutation state size, r
= rate for Sponge)
Primitive n B r # Rnds S-box Tot MC/bit Candidates
MC No Bitslice 16-bit 32-bit 64-bit
XOODOO 3 384 128 12 3 36 2.25 1.125 0.5625 Xoodyak
Pyjamask 3/4 96/128 14 3/4 14 0.875 0.4375 0.4375 Pyjamask
Clyde 4 128 12 4 12 0.75 0.375 0.375 Spook
GIFT 4 64/128 28/40 4 28/40 1.75/2.5 1.75/1.25 1.75/1.25 ESTATE,
GIFT-COFB,
HYENA,
LOTUS/LOCUS,
SUNDAE-GIFT
KNOT 4 256 64 28* 4 112* 7* 3.5* 1.75* KNOT
PHOTON 4 256 32/128 12 4 96/24 6/1.5 3/0.75 1.5/0.38 PHOTON-Beetle
Shadow 4 512 256 12 4 24 1.5 0.75 0.375 Spook
Spongent 4 160/176 80/90 5 100/112.5 7.5/7.67 5/5.11 2.5/2.56 Elephant
ForkSkinny 4/8 64/128 40/48 4/8 40/96 2.5/6 2.5/3 2.5/3 ForkAE
ASCON 5 320 64/128 6*/8* 5 30/20* 1.88/1.25* 0.94/0.63* 0.47/0.31* Ascon, ISAP
GASCON 5 320 128 7* 5 17.5* 1.09* 0.55* 0.27* DryGASCON
Keccak 5 200 18 5 18 1.35 0.9 0.45 Elephant
5 400 144 8 5 22.22 1.39 0.83 0.56 ISAP
AES 8 128 10 32 40 2.5 2.5 2.5 ESTATE,
mixFEED,
SAEAES
Skinny 8 128 48/56 8 48/56 3/3.5 3/3.5 3/3.5 Romulus,
SKINNY-AEAD
GIMLI 96 384 128 24 96 72 4.5 2.25 1.125 Gimli
Subterranean 2.0 257 257 32 1* 257 8.03* 0.53* 0.28* 0.16* Subterranean 2.0
* Given a larger number of initialization rounds
114
4-bit vs. 5-bit S-boxes. The popularity of 4-bit S-boxes continues. It is

clear that they systematically result in S-box M D = 2. However, many other
proposals use odd-sized S-boxes, which achieve the minimal depth of one. By
extension, these proposals achieve a smaller Tot M D overall. Frontrunners in
terms of multiplicative depth per bit are Subterranean 2.0, (G)ASCON, Keccak
and XOODOO, which interestingly, all use an S-box based on a very similar
structure.
MC/bit. The M C/bit is almost identical for all non-AES proposals and there
is little to no need for improvement in that aspect.
Number of Rounds. The largest contrasts arise from differences in the number
of rounds, which plays an important role when speed is a priority. With respect
to the metrics of Tot M C/bit or Tot M D/bit, we see that several primitives are
not competitive with AES (e.g. Spongent, Skinny among others) due to a large
number of rounds. We note that this design parameter is highly dependent on
the designer’s choice of security margin.
Sponge Constructions. On the one hand, sponge constructions often use

a larger number of initialization rounds than the number of rounds used
per plaintext block. This can be beneficial for the speed of a hardware
implementation if messages are not too short. On the other hand, permutations
in a sponge construction typically require a larger state size than block ciphers
where the message is the entire state. Large state sizes are bad for area
requirements on ASIC devices. We also see that the throughput of software
implementations (Tot M C/bit) is badly affected by the fact that only r of the
B bits of the state are processed. For example, the Tot M C/bit of PHOTON
and Shadow are worse than that of Clyde, despite having equivalent S-box
properties and number of rounds.
Bitslicing. Table 5.3 considers various degrees of bitslicing, under the

assumption that p S-boxes can be computed in parallel on a p-bit platform.
When a primitive has less than p S-boxes, it may be that bitslicing offers
less advantage. For example, Subterranean 2.0 performs one large nonlinear
operation across the entire state, which can be seen as only one S-box. However,
this S-box has a highly repetitive structure (also known as cellular automata),
which could be exploited in bitslicing as well. Apart from Subterranean 2.0,
Table 5.3 shows two more proposals that perform better than any other in terms
of multiplicative complexity, regardless of the degree of bitslicing: Pyjamask
and Clyde. It also shows that various proposals offer little advantage over AES
in this regard.
5.3.1 Classification of Balanced Quadratic Functions
Context. S-boxes significantly impact the cryptographic strength and imple-

mentation characteristics of an algorithm. Due to their simplicity, quadratic
vectorial Boolean functions are preferred when efficient implementations for
a variety of applications are of concern, such as resistance against SCA or
MPC. Many characteristics of a function stay invariant under AE, including
cryptographic and some implementation properties. AE classification is
therefore an important tool in the search for suitable Boolean functions for
cryptography. So far, all 6-bit Boolean functions [Mai91] and all 3- and 4-bit
permutations [De 07] have been classified up to AE. Current methods are not
able to classify the entire space of 5-bit permutations. At FSE 2017, Bozoliv et
al. [BBS17] presented the first classification of 5-bit quadratic permutations. Like
many works on classification before, the authors use an algorithm by Biryukov et
al. [BDBP03] to find the representative of an AE class. The complexity of their
search does not allow an expansion to 6-bit functions.
Contribution. In our work from FSE 2020 [5], we introduce an extension of the
algorithm of Biryukov et al. [BDBP03], which allows finding the representative
of an AE class for non-bijective n × m functions with m < n. Thanks to this
new algorithm, we can adapt the search methodology of Bozilov et al. [BBS17]
and reduce the computation time to classify 5-bit quadratic permutations from
several hours (using 16 threads) to a mere six minutes (using 4 threads).5
This optimization makes it possible to also classify the 6-bit quadratic Boolean
functions for the first time. In addition, it enables the classification of (balanced)
non-bijective Boolean functions from n bits to m < n bits for n ≤ 6. We also
provide a second tool for finding length-two decompositions of higher-degree
permutations, which can be useful to create efficient masked implementations.
We demonstrate it by decomposing the 5-bit AB and APN permutations. We
can also use this tool to generate new high-quality S-boxes, which can be
decomposed. This work can be found on page 253.
5 On a Linux machine with an Intel Core i5-6500 processor at 3.20GHz.
5.3.2 Low AND Depth and Efficient Inverses: an S-box

Portfolio for Low-latency Masking
Context. Resistance against SCA has become a common requirement for

cryptographic implementations. The ongoing NIST Lightweight Cryptography
competition even states as one of its goals that countermeasures against SCA
should be relatively easy to integrate. When ciphers are not designed with this
goal in mind, creating efficient masked implementations can be challenging. For
this reason, recent works have started to consider the multiplicative complexity
in the choice of nonlinear building blocks [BDPV11, ARS+ 15, AGR+ 16]. On the
other hand, low latency is gaining importance as advancing technologies reduce
area restrictions and the need for high-performance cryptography increases.
Only a few ciphers have been designed to achieve circuits of small logical
depth [BCG+ 12, BJK+ 16, Ava17]. However, optimizing the latency of an
unprotected design does not necessarily optimize the latency of a SCA-protected
design. In the latter case, the multiplicative depth of the cipher is a determining
factor. This property is rarely taken into account in the design of new primitives.
Contribution. In a collaboration with Prof. Standaert from UC Louvain,

published at FSE 2020 [1], we perform an extensive investigation and construct
a portfolio of S-boxes suitable for secure lightweight implementations, which
aligns well with the ongoing NIST Lightweight Cryptography competition. In
particular, we target good functional properties on the one hand and efficient
implementations in terms of AND depth and AND gate complexity on the
other. Moreover, we also consider the implementation of the inverse S-box
and the possibility for it to share resources with the forward S-box. We take
our exploration beyond the conventional small (and even) S-box sizes. Our
investigation is twofold: (1) we note that implementations of existing S-boxes
are not optimized for the criteria which define masking complexity (AND depth
and AND gate complexity) and improve a tool published at FSE 2016 by
Stoffelen [Sto16] to fill this gap. (2) We search for new S-box designs which
take these implementation properties into account from the start. We perform
a systematic search based on the properties of not only the S-box but also its
inverse as well as an exploration of larger S-box sizes using length-doubling
structures. The result of our investigation is not only a wide selection of
very good S-boxes, but we also provide complete descriptions of their circuits,
enabling their integration into future work.
5.4 Conclusion
We now conclude this chapter with a recapitulation of the most important

observations in the state-of-the-art and suggestions for future work.
The Odd One Out. S-boxes of odd size achieve better cryptographic properties
at lower depth than S-boxes of even size. Nevertheless, even and in particular
4-bit S-boxes dominate in the literature. The exception is the 5-bit Keccak
S-box, which after being adopted in SHA-3, has been included into many other
primitives. Yet, there are many more possibilities which have not been used,
such as 7- or 9-bit APNs.
Efficiency by Design. Cryptographic designers need to set their priorities. We

have seen that there can be a large difference between designing for low area or
low latency, for software or hardware and even for FPGA or ASIC. The designer
should decide whether encryption and decryption are supposed to run with the
same hardware or not and choose S-boxes accordingly. The difference between
designing for unmasked or for masked applications is evident as well. Many
lightweight primitives of the past years involve a less complex round function
than that of AES, which is repeated for a higher number of rounds (e.g. GIFT,
Skinny, Spongent, . . . ). The rationale is that this reduces area requirements
and that a round-based encryption can run at a higher frequency. However,
we see that the increased number of rounds is detrimental to the speed of a
masked implementation. Hence, the targeted properties of the primitive must
be clear at the outset of the design process and ideally, an expert of side-channel
countermeasures should be involved.
Low latency. Apart from a misunderstanding about what design choices

result in low latency masked implementations, another problem is that it
is often not prioritised. We saw a similar trend in the optimization of masked
implementations themselves in Chapter 2. However, since area footprint becomes
less constrained with advancing technologies, low latency gains in importance.
Also for lightweight applications in particular, the energy consumption is
an important consideration, which is highly related to the latency of the
implementation.
Chapter 6
Conclusion
In this work, we have looked at four different but closely related aspects
of embedded systems security. In Chapter 2, we explained the masking
countermeasure in detail and considered the community’s developments of
the last years, in particular with regards to the masked multiplication building
block and trade-offs in implementing the AES. Chapter 3 dealt with two types
of analysis of masked implementations. On the one hand, we presented the
state-of-the-art on side-channel attacks and more specifically differential power
analysis. On the other hand, we investigated the problem of verifying masked
implementations and looked at this challenge from both a theoretical and
practical perspective, uniting the two as much as possible. In Chapter 4, we
extended the issue of side-channel attacks to more general physical attacks
including fault and combined attacks. We studied the state-of-the-art on
both attacks and countermeasures. Chapter 5 looked at how we can optimize
embedded system design by incorporating the cost of countermeasures into
the design of cryptographic primitives. We collected a set of criteria and
contemplated recent proposals from the literature. In this final chapter, we
will summarize our contributions in each of these areas and recollect a few
conclusions from each chapter.
Contributions
We now recall the research questions from Chapter 1 and consider our
contributions to each.
119
120 CONCLUSION
Research Question 1: How can we improve the area-randomness-latency

trade-off for masked implementations? We extended the state-of-the-art
on the masking of the AES in various ways. Our multiplicatively masked
AES S-box improved in area over tower-field implementations by 29% (first
order) and 18% (second order), without sacrifices in latency or randomness. The
second-order implementation remains to date the smallest and most randomness-
efficient design in hardware, with latency cost similar to others. We created
the first masked AES implementation, targeted to the FPGA platform, which
achieves the smallest area footprint, but at the cost of increased latency. This
implementation was enabled by a new methodology that we developed for the
masking of generic Boolean functions. We introduced the first masked AES
in software with only 2 bits of randomness, including for the offline masking.
We investigated whether PRNGs for masking require masking themselves by
successfully attacking the NIST CTR_DRBG and identifying ways to prevent
this. Finally, in this thesis (Chapter 3), we also explored the trade-off between
cost and security by comparing inner product and Boolean masking. We
introduced a new algorithm for multiplying inner product masked variables,
which was shown to exhibit less leakage in practice.
Research Question 2: How can we efficiently and conclusively verify the

security of masked implementations? With our consolidating work, we have
unified all existing security notions for masked implementations into one
framework, with a mathematical description that includes various types of
masking, various adversary models and even both provable and practical security.
This description also directly suggests a tool for the provable verification of
small gadgets or flaw detection of larger components. It can potentially be used
for models that have not been defined yet. On the other hand, our work on
micro-architectural effects has shown that it might not be possible to define an
accurate adversary model for software masking.
Research Question 3: How can we transform masking countermeasures into

a more general countermeasure against both side-channel and fault attacks?
We have introduced the first integrated adversary model for combined side-
channel and fault attacks and the first combined countermeasure with provable
security in this model: CAPA. This countermeasure was applied to the AES
cipher and the SHA-3 hash function. We also proposed a generic family of
countermeasures against side-channel and fault analysis based on the extension
of any masking scheme with MAC tags and infective computation: M&M. M&M
is secure in a weaker model than CAPA, but achieves far better efficiency.
CONCLUSION 121
Research Question 4: How can we design new symmetric primitives such

that the masking overhead is minimized? By optimising existing algorithms,
we have enabled the first exhaustive classification of quadratic Boolean functions
of input size up to 6. In addition, we have performed an extensive investigation
into good S-boxes for masking applications of various sizes. In this thesis, we
also extended that work with an original contribution. For the first time, we
have considered and compared the NIST lightweight competition candidates in
terms of implementation criteria that matter for masking.
Conclusions and Future Work
Throughout the previous chapters, there were recurring themes, which we use
now to recall the most important conclusions and directions for future work.
The Boolean Bias. In Chapter 2, we showed that Boolean masking is much

more popular than its counterparts polynomial and inner product masking.
While initially, the latter two were considerably more expensive, this gap is
closing with the realisation that the ISW multiplication can easily be adapted
to the other masking types. Hence, as we move forward, we should not overlook
these alternatives, especially for software masking applications, where many
theoretically secure schemes demonstrate leakage in practice. It has been
shown that the increased algebraic complexity of polynomial and inner product
masking results in less information leakage in practice. Also in Chapter 3, we
noted that most verification tools for masking are tailored to Boolean masking,
even though our description based on mutual information does not rely on a
particular type of masking.
Low Latency. In various places, we have seen that low latency is often not
a priority. The majority of the masked implementations in Chapter 2 have
been designed for optimal area or randomness use, using a serial architecture.
The use of dual-rail logic dates back to more than five years ago, but can
result in a masked AES using very few clock cycles. It is an interesting topic
for further research, also for combined countermeasures (Chapter 4), since
it exhibits inherent redundancy. We saw the same trend in the design of
cryptographic primitives in Chapter 5. While the last few years have seen an
increase of designs for low multiplicative complexity, the same cannot be said
for multiplicative depth. Future research should consider the latency as more
important, since modern technologies with increasingly smaller transistor sizes
make area relatively less limited.
122 CONCLUSION
Adversary Models. There are a lot of questions in the literature surrounding

adversary models. In Chapter 3, we talked about the fact that the probing model
is not an accurate reflection of side-channel adversaries in software. While the
robust probing model described how we can extend probes according to physical
characteristics, it remains unknown how to do this for software platforms in a way
that guarantees security in practice, without requiring unnecessarily expensive
schemes. When considering combined side-channel and fault adversaries as in
Chapter 4, the issue of describing them mathematically is even more difficult.
Many different fault models exist and apart from the tile-probe-and-faulting
model, no combined adversaries have been formally described. Future work
is required on these models, joined with practical experiments to confirm the
models and minimize the gap between theory and practice.
Verification. In Chapter 2, we already noted that there is a gap between

the literature on software and hardware masking in the way that schemes are
verified. In Chapter 3, we described these theoretical and practical verification
methodologies and proposed a bottom-up approach that includes verification at
every level. Most importantly, practical experiments with real devices should
always be a part of the verification-chain, since there will always be a gap
between theoretical adversary models and real-life attack scenarios. Further,
we need the verification methods to be extended for combined countermeasures
as described in Chapter 4. The state of the art is non-existing, since adversary
models have not been studied in depth. Also here, practical experiments should
play an important part. Currently, no TVLA-like approaches have been devised
yet and it is unlikely that a single experiment can account for the many different
types of fault attacks.
Randomness. Our work is based on the assumption that random bits are
readily available for masking and combined countermeasures. In practice, we
noted that this is achieved with a PRNG. This critical component requires
a lot more attention in the literature. In Chapter 2, we noted that a PRNG
for masking probably does not require to be masked itself, but that specific
constructions should be proposed and analysed for clarity on the cost of
randomness. Furthermore, in Chapter 4, it became clear that there is a need
for PRNGs that are robust in the presence of faults. This means on the one
hand, that we need to add fault detection mechanisms, but on the other hand,
that we need to work out proper procedures for when an error is detected. At
this point, this issue might be more important to investigate than combined
countermeasures themselves, since a failure of randomness allows one to trivially
bypass the countermeasures.
Bibliography
[ABB+ 14] Andreeva, E., Bilgin, B., Bogdanov, A., Luykx, A.,
Mendel, F., Menninck, B., Mouha, N., Wang, Q., and
Yasuda, K. PRIMATEs: Submission to the CAESAR compe-
tition. https://competitions.cr.yp.to/round1/primatesv1.
pdf, March 2014.
[Abe10] Abe, M. (ed.). Advances in Cryptology - ASIACRYPT 2010 -
16th International Conference on the Theory and Application of
Cryptology and Information Security, Singapore, December 5-9,
2010. Proceedings, Lecture Notes in Computer Science, vol. 6477.
Springer, 2010.
[AG01] Akkar, M. and Giraud, C. An implementation of DES and
AES, secure against some attacks. In Koç et al. [KNP01], 309–318.
[AGR+ 16] Albrecht, M.R., Grassi, L., Rechberger, C., Roy, A.,
and Tiessen, T. MiMC: Efficient encryption and cryptographic
hashing with minimal multiplicative complexity. In Cheon and
Takagi [CT16], 191–219.
[AH17] Avanzi, R. and Heys, H.M. (eds.). Selected Areas in
Cryptography - SAC 2016 - 23rd International Conference, St.
John’s, NL, Canada, August 10-12, 2016, Revised Selected Papers,
Lecture Notes in Computer Science, vol. 10532. Springer, 2017.
[AJ01] Attali, I. and Jensen, T.P. (eds.). Smart Card Programming
and Security, International Conference on Research in Smart Cards,
E-smart 2001, Cannes, France, September 19-21, 2001, Proceedings,
[ARS+ 15] Albrecht, M.R., Rechberger, C., Schneider, T., Tiessen,
T., and Zohner, M. Ciphers for MPC and FHE. In Oswald and
Fischlin [OF15], 430–454.
123
124 BIBLIOGRAPHY
[Ava17] Avanzi, R. The QARMA block cipher family. almost MDS

matrices over rings with zero divisors, nearly symmetric even-
mansour constructions with non-involutory central rounds, and
search heuristics for low-latency S-boxes. IACR Trans. Symmetric
Cryptol., 2017(2017)(1), 4–44.
[AVFM07] Amiel, F., Villegas, K., Feix, B., and Marcel, L. Passive
and active combined attacks: Combining fault attacks and side
channel analysis. In Breveglieri et al. [BGK+ 07], 92–102.
[AWMN19] Arribas, V., Wegener, F., Moradi, A., and Nikova, S.

Cryptographic fault diagnosis using VerFI. IACR Cryptology ePrint
Archive, 2019(2019), 1312.
[AZ11] Ardagna, C.A. and Zhou, J. (eds.). Information Security
Theory and Practice. Security and Privacy of Mobile Devices
in Wireless Communication - 5th IFIP WG 11.2 International
Workshop, WISTP 2011, Heraklion, Crete, Greece, June 1-3, 2011.
Proceedings, Lecture Notes in Computer Science, vol. 6633. Springer,
2011.
[BBC+ 19] Barthe, G., Belaïd, S., Cassiers, G., Fouque, P., Grégoire,
B., and Standaert, F. MaskVerif: Automated verification of
higher-order masking in presence of physical defaults. In Sako et al.
[SSR19], 300–318.
[BBC+ 20] Bellizia, D., Bronchain, O., Cassiers, G., Grosso, V., Guo,
C., Momin, C., Pereira, O., Peters, T., and Standaert, F.
Mode-level vs. implementation-level physical security in symmetric
cryptography: A practical guide through the leakage-resistance
jungle. IACR Cryptol. ePrint Arch., 2020(2020), 211.
[BBD+ 16] Barthe, G., Belaïd, S., Dupressoir, F., Fouque, P.,
Grégoire, B., Strub, P., and Zucchini, R. Strong non-
interference and type-directed higher-order masking. In Weippl
et al. [WKK+ 16], 116–129.
[BBK+ 03] Bertoni, G., Breveglieri, L., Koren, I., Maistri, P., and
Piuri, V. Error analysis and detection procedures for a hardware
implementation of the Advanced Encryption Standard. IEEE
Trans. Computers, 52(2003)(4), 492–505.
[BBKM04] Bertoni, G., Breveglieri, L., Koren, I., and Maistri,

P. An efficient hardware-based fault diagnosis scheme for AES:
performances and cost. In 19th IEEE International Symposium
BIBLIOGRAPHY 125
on Defect and Fault-Tolerance in VLSI Systems (DFT 2004), 10-

13 October 2004, Cannes, France, Proceedings, 130–138. IEEE
Computer Society, 2004.
[BBP+ 16] Belaïd, S., Benhamouda, F., Passelègue, A., Prouff, E.,
Thillard, A., and Vergnaud, D. Randomness complexity of
private circuits for multiplication. In Fischlin and Coron [FC16],
616–648.
[BBS17] Bozilov, D., Bilgin, B., and Sahin, H.A. A note on 5-bit
quadratic permutations’ classification. IACR Trans. Symmetric
Cryptol., 2017(2017)(1), 398–404.
[BC13] Bertoni, G. and Coron, J. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2013 - 15th International Workshop,
Santa Barbara, CA, USA, August 20-23, 2013. Proceedings, Lecture
Notes in Computer Science, vol. 8086. Springer, 2013.
[BCG+ 12] Borghoff, J., Canteaut, A., Güneysu, T., Kavun, E.B.,
Knezevic, M., Knudsen, L.R., Leander, G., Nikov, V.,
Paar, C., Rechberger, C., Rombouts, P., Thomsen, S.S.,
and Yalçin, T. PRINCE - A low-latency block cipher for pervasive
computing applications - extended abstract. In Wang and Sako
[WS12], 208–225.
[BCN+ 06] Bar-El, H., Choukri, H., Naccache, D., Tunstall, M.,
and Whelan, C. The sorcerer’s apprentice guide to fault attacks.
Proceedings of the IEEE, 94(2006)(2), 370–382.
[BCO04] Brier, E., Clavier, C., and Olivier, F. Correlation power

analysis with a leakage model. In Joye and Quisquater [JQ04],
16–29.
[BCZ18] Bettale, L., Coron, J., and Zeitoun, R. Improved high-order
conversion from Boolean to arithmetic masking. IACR Trans.
Cryptogr. Hardw. Embed. Syst., 2018(2018)(2), 22–45.
[BDBP03] Biryukov, A., De Cannière, C., Braeken, A., and Preneel,
B. A toolbox for cryptanalysis: Linear and affine equivalence
algorithms. In Biham [Bih03], 33–50.
[BDF+ 17] Barthe, G., Dupressoir, F., Faust, S., Grégoire, B.,
Standaert, F., and Strub, P. Parallel implementations of
masking schemes and the bounded moment leakage model. In
Coron and Nielsen [CN17], 535–566.
126 BIBLIOGRAPHY
[BDL97] Boneh, D., DeMillo, R.A., and Lipton, R.J. On the

importance of checking cryptographic protocols for faults (extended
abstract). In Fumy [Fum97], 37–51.
[BDPV11] Bertoni, G., Daemen, J., Peeters, M., and Van Assche, G.
The Keccak reference. http://keccak.noekeon.org/, 2011.
[BF19] Bilgin, B. and Fischer, J. (eds.). Smart Card Research and
Advanced Applications, 17th International Conference, CARDIS
2018, Montpellier, France, November 12-14, 2018, Revised Selected
Papers, Lecture Notes in Computer Science, vol. 11389. Springer,
2019.
[BFG15] Balasch, J., Faust, S., and Gierlichs, B. Inner product
masking revisited. In Oswald and Fischlin [OF15], 486–510.
[BFG+ 17] Balasch, J., Faust, S., Gierlichs, B., Paglialonga, C., and
Standaert, F. Consolidating inner product masking. In Takagi
and Peyrin [TP17], 724–754.
[BFGV12] Balasch, J., Faust, S., Gierlichs, B., and Verbauwhede,
I. Theory and practice of a leakage resilient masking scheme. In
Wang and Sako [WS12], 758–775.
[BG12] Bertoni, G. and Gierlichs, B. (eds.). 2012 Workshop on
Fault Diagnosis and Tolerance in Cryptography, Leuven, Belgium,
September 9, 2012. IEEE Computer Society, 2012.
[BG13] Battistello, A. and Giraud, C. Fault analysis of infective
AES computations. In Fischer and Schmidt [FS13], 101–107.
[BGG+ 14] Balasch, J., Gierlichs, B., Grosso, V., Reparaz, O., and
Standaert, F. On the cost of lazy engineering for masked software
implementations. In Joye and Moradi [JM15], 64–81.
[BGG+ 16] Boss, E., Grosso, V., Güneysu, T., Leander, G., Moradi,
A., and Schneider, T. Strong 8-bit sboxes with efficient masking
in hardware. In Gierlichs and Poschmann [GP16], 171–193.
[BGI+ 18] Bloem, R., Groß, H., Iusupov, R., Könighofer, B.,
Mangard, S., and Winter, J. Formal verification of masked
hardware implementations in the presence of glitches. In Nielsen
and Rijmen [NR18], 321–353.
[BGK04] Blömer, J., Guajardo, J., and Krummel, V. Provably secure
masking of AES. In Handschuh and Hasan [HH04], 69–83.
BIBLIOGRAPHY 127
[BGK+ 07] Breveglieri, L., Gueron, S., Koren, I., Naccache, D.,
and Seifert, J. (eds.). Fourth International Workshop on Fault
Diagnosis and Tolerance in Cryptography, 2007, FDTC 2007:
Vienna, Austria, 10 September 2007. IEEE Computer Society, 2007.
[BGN+ 14a] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. Higher-order threshold implementations. In Sarkar
and Iwata [SI14], 326–343.
[BGN+ 14b] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. A more efficient AES threshold implementation. In
Pointcheval and Vergnaud [PV14], 267–284.
[BGN+ 15] Bilgin, B., Gierlichs, B., Nikova, S., Nikov, V., and
Rijmen, V. Trade-offs for threshold implementations illustrated
on AES. IEEE Trans. on CAD of Integrated Circuits and Systems,
34(2015)(7), 1188–1200.
[BGR18] Belaïd, S., Goudarzi, D., and Rivain, M. Tight private
circuits: Achieving probing security with the least refreshing. In
Peyrin and Galbraith [PG18], 343–372.
[BGRV15] Balasch, J., Gierlichs, B., Reparaz, O., and Verbauwhede,

I. DPA, bitslicing and masking at 1 GHz. In Güneysu and
Handschuh [GH15], 599–619.
[Bih97a] Biham, E. A fast new DES implementation in software. In Fast
Software Encryption, 4th International Workshop, FSE ’97, Haifa,
Israel, January 20-22, 1997, Proceedings [Bih97b], 260–272.
[Bih97b] Biham, E. (ed.). Fast Software Encryption, 4th International
Workshop, FSE ’97, Haifa, Israel, January 20-22, 1997,
1997.
[Bih03] Biham, E. (ed.). Advances in Cryptology - EUROCRYPT

2003, International Conference on the Theory and Applications
of Cryptographic Techniques, Warsaw, Poland, May 4-8, 2003,
2003.
[Bil15] Bilgin, B. Threshold implementations: as countermeasure against

higher-order differential power analysis. Ph.D. thesis, University of
Twente, Enschede, Netherlands, 2015.
128 BIBLIOGRAPHY
[Bir07] Biryukov, A. (ed.). Fast Software Encryption, 14th International

Workshop, FSE 2007, Luxembourg, Luxembourg, March 26-28,
2007, Revised Selected Papers, Lecture Notes in Computer Science,
vol. 4593. Springer, 2007.
[BJK+ 10] Breveglieri, L., Joye, M., Koren, I., Naccache, D., and
Verbauwhede, I. (eds.). 2010 Workshop on Fault Diagnosis and
Tolerance in Cryptography, FDTC 2010, Santa Barbara, California,
USA, 21 August 2010. IEEE Computer Society, 2010.
[BJK+ 16] Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi,
A., Peyrin, T., Sasaki, Y., Sasdrich, P., and Sim, S.M.
The SKINNY family of block ciphers and its low-latency variant
MANTIS. In Robshaw and Katz [RK16], 123–153.
[BK15] Barker, E. and Kelsey, J. Recommendations for random

number generation using deterministic random bit generators. NIST
SP 800-90A Rev. 1, June 2015.
[BKL+ 07] Bogdanov, A., Knudsen, L.R., Leander, G., Paar,
C., Poschmann, A., Robshaw, M.J.B., Seurin, Y., and
Vikkelsoe, C. PRESENT: an ultra-lightweight block cipher. In
Paillier and Verbauwhede [PV07], 450–466.
[BKM18] Batina, L., Kühne, U., and Mentens, N. (eds.). PROOFS
2018, 7th International Workshop on Security Proofs for Embedded
Systems, colocated with CHES 2018, Amsterdam, The Netherlands,
September 13, 2018, Kalpa Publications in Computing, vol. 7.
EasyChair, 2018.
[BKN18] Bozilov, D., Knezevic, M., and Nikov, V. Optimized
threshold implementations: Securing cryptographic accelerators for
low-energy and low-latency applications. IACR Cryptology ePrint
Archive, 2018(2018), 922.
[BKNS06] Breveglieri, L., Koren, I., Naccache, D., and Seifert,

J. (eds.). Fault Diagnosis and Tolerance in Cryptography, Third
International Workshop, FDTC 2006, Yokohama, Japan, October
10, 2006, Proceedings, Lecture Notes in Computer Science, vol.
4236. Springer, 2006.
[BL08] Brinkmann, M. and Leander, G. On the classification of APN

functions up to dimension five. Des. Codes Cryptogr., 49(2008)(1-3),
273–288.
BIBLIOGRAPHY 129
[BL10] Bernstein, D.J. and Lange, T. (eds.). Progress in Cryptology

- AFRICACRYPT 2010, Third International Conference on
Cryptology in Africa, Stellenbosch, South Africa, May 3-6, 2010.
2010.
[BM16] Bertoni, G. and Martinoli, M. A methodology for the
characterisation of leakages in combinatorial logic. In Carlet et al.
[CHS16], 363–382.
[BNN+ 12] Bilgin, B., Nikova, S., Nikov, V., Rijmen, V., and Stütz,
G. Threshold implementations of all 3 ×3 and 4 ×4 S-boxes. In
Prouff and Schaumont [PS12], 76–91.
[Bon03] Boneh, D. (ed.). Advances in Cryptology - CRYPTO 2003,
23rd Annual International Cryptology Conference, Santa Barbara,
California, USA, August 17-21, 2003, Proceedings, Lecture Notes
in Computer Science, vol. 2729. Springer, 2003.
[Boy01] Boyd, C. (ed.). Advances in Cryptology - ASIACRYPT 2001,
7th International Conference on the Theory and Application
of Cryptology and Information Security, Gold Coast, Australia,
December 9-13, 2001, Proceedings, Lecture Notes in Computer
Science, vol. 2248. Springer, 2001.
[BP12] Boyar, J. and Peralta, R. A small depth-16 circuit for the
AES S-box. In Gritzalis et al. [GFT12], 287–298.
[BR00] Barreto, P.S.L.M. and Rijmen, V. The Khazad legacy-level
block cipher, 2000. Submission to NESSIE.
[BR11] Barreto, P.S.L.M. and Rijmen, V. Whirlpool. In van Tilborg
and Jajodia [vTJ11], 1384–1385.
[BR14] Batina, L. and Robshaw, M. (eds.). Cryptographic Hardware
and Embedded Systems - CHES 2014 - 16th International Workshop,
Busan, South Korea, September 23-26, 2014. Proceedings, Lecture
[BS90] Biham, E. and Shamir, A. Differential cryptanalysis of DES-like
cryptosystems. In Menezes and Vanstone [MV91], 2–21.
[BS97] Biham, E. and Shamir, A. Differential fault analysis of secret
key cryptosystems. In Jr. [Jr.97], 513–525.
[BS03] Blömer, J. and Seifert, J. Fault based cryptanalysis of the
Advanced Encryption Standard (AES). In Wright [Wri03], 162–181.
130 BIBLIOGRAPHY
[BW72] Berlekamp, E.R. and Welch, L.R. Weight distributions of the

cosets of the (32, 6) Reed-Muller code. IEEE Trans. Information
Theory, 18(1972)(1), 203–207.
[Can05] Canright, D. A very compact S-box for AES. In Rao and Sunar
[RS05], 441–455.
[CCZ98] Carlet, C., Charpin, P., and Zinoviev, V.A. Codes, bent
functions and permutations suitable for DES-like cryptosystems.
Des. Codes Cryptogr., 15(1998)(2), 125–156.
[CDL15] Canteaut, A., Duval, S., and Leurent, G. Construction
of lightweight S-boxes using Feistel and MISTY structures. In
Dunkelman and Keliher [DK16], 373–393.
[CFGR10] Clavier, C., Feix, B., Gagnerot, G., and Roussellet, M.
Passive and active combined attacks on AES: Combining fault
attacks and side channel analysis. In Breveglieri et al. [BJK+ 10],
10–19.
[CG09] Clavier, C. and Gaj, K. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2009, 11th International Workshop,
Lausanne, Switzerland, September 6-9, 2009, Proceedings, Lecture
[CGTV15] Coron, J., Großschädl, J., Tibouchi, M., and Vadnala,
P.K. Conversion from arithmetic to Boolean masking with
logarithmic complexity. In Leander [Lea15], 130–149.
[CHS16] Carlet, C., Hasan, M.A., and Saraswat, V. (eds.). Security,

Privacy, and Applied Cryptography Engineering - 6th International
Conference, SPACE 2016, Hyderabad, India, December 14-18,
2016, Proceedings, Lecture Notes in Computer Science, vol. 10076.
Springer, 2016.
[CJRR99] Chari, S., Jutla, C.S., Rao, J.R., and Rohatgi, P. Towards
sound approaches to counteract power-analysis attacks. In Wiener
[Wie99], 398–412.
[CN17] Coron, J. and Nielsen, J.B. (eds.). Advances in Cryptology -
EUROCRYPT 2017 - 36th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Paris,
France, April 30 - May 4, 2017, Proceedings, Part I, Lecture Notes
in Computer Science, vol. 10210. 2017.
BIBLIOGRAPHY 131
[Cor18] Coron, J. Formal verification of side-channel countermeasures

via elementary circuit transformations. In Preneel and Vercauteren
[PV18], 65–82.
[CP02] Courtois, N. and Pieprzyk, J. Cryptanalysis of block ciphers

with overdefined systems of equations. In Zheng [Zhe02], 267–287.
[CPR07] Coron, J., Prouff, E., and Rivain, M. Side channel
cryptanalysis of a higher order masking scheme. In Paillier and
Verbauwhede [PV07], 28–44.
[CPRR13] Coron, J., Prouff, E., Rivain, M., and Roche, T. Higher-
order side channel security and mask refreshing. In Moriai [Mor14],
410–424.
[CPRR15] Carlet, C., Prouff, E., Rivain, M., and Roche, T. Algebraic
decomposition for probing security. In Gennaro and Robshaw
[GR15], 742–763.
[CR17] Clavier, C. and Reynaud, L. Improved blind side-channel
analysis by exploitation of joint distributions of leakages. In Fischer
and Homma [FH17], 24–44.
[Cra12] Cramer, R. (ed.). Theory of Cryptography - 9th Theory of

Cryptography Conference, TCC 2012, Taormina, Sicily, Italy,
March 19-21, 2012. Proceedings, Lecture Notes in Computer
[CRR02] Chari, S., Rao, J.R., and Rohatgi, P. Template attacks. In
Kaliski Jr. et al. [KKP03], 13–28.
[CS19] Cassiers, G. and Standaert, F. Towards globally optimized
masking: From low randomness to low noise rate or probe isolating
multiplications with reduced randomness and security against
horizontal attacks. IACR Trans. Cryptogr. Hardw. Embed. Syst.,
2019(2019)(2), 162–198.
[CT16] Cheon, J.H. and Takagi, T. (eds.). Advances in Cryptology -
ASIACRYPT 2016 - 22nd International Conference on the Theory
and Application of Cryptology and Information Security, Hanoi,
Vietnam, December 4-8, 2016, Proceedings, Part I, Lecture Notes
in Computer Science, vol. 10031. 2016.
[CV94] Chabaud, F. and Vaudenay, S. Links between differential and
linear cryptanalysis. In Santis [San95], 356–365.
132 BIBLIOGRAPHY
[CV04] Canteaut, A. and Viswanathan, K. (eds.). Progress in

Cryptology - INDOCRYPT 2004, 5th International Conference
on Cryptology in India, Chennai, India, December 20-22, 2004,
2004.
[Dae17] Daemen, J. Changing of the guards: A simple and efficient
method for achieving uniformity in threshold sharing. In Fischer
and Homma [FH17], 137–153.
[DBC+ 18] Dutertre, J., Beroulle, V., Candelier, P., Castro, S.D.,
Faber, L., Flottes, M., Gendrier, P., Hély, D., Leveugle,
R., Maistri, P., Natale, G.D., Papadimitriou, A., and
Rouzeyre, B. Laser fault injection at the CMOS 28 nm technology
node: an analysis of the fault model. In 2018 Workshop on Fault
Diagnosis and Tolerance in Cryptography, FDTC 2018, Amsterdam,
The Netherlands, September 13, 2018, 1–6. IEEE Computer Society,
2018.
[DBR+ 15] De Cnudde, T., Bilgin, B., Reparaz, O., Nikov, V., and
Nikova, S. Higher-order threshold implementation of the AES
S-box. In Homma and Medwed [HM16], 259–272.
[DDF14] Duc, A., Dziembowski, S., and Faust, S. Unifying leakage
models: From probing attacks to noisy leakage. In Nguyen and
Oswald [NO14], 423–440.
[DDK09] De Cannière, C., Dunkelman, O., and Knezevic, M.
KATAN and KTANTAN - A family of small and efficient hardware-
oriented block ciphers. In Clavier and Gaj [CG09], 272–288.
[De 07] De Cannière, C. Analysis and Design of Symmetric Encryption
Algorithms. Ph.D. thesis, Katholieke Universiteit Leuven, 2007.
[DeC18] De Cnudde, T. Cryptography Secured Against Side-Channel
Attacks. Ph.D. thesis, KU Leuven, Belgium, 2018.
[DEG+ 18] Dobraunig, C., Eichlseder, M., Groß, H., Mangard,
S., Mendel, F., and Primas, R. Statistical ineffective fault
attacks on masked AES with fault countermeasures. In Peyrin and
Galbraith [PG18], 315–342.
[DEK+ 18] Dobraunig, C., Eichlseder, M., Korak, T., Mangard, S.,
Mendel, F., and Primas, R. SIFA: exploiting ineffective fault
inductions on symmetric cryptography. IACR Trans. Cryptogr.
Hardw. Embed. Syst., 2018(2018)(3), 547–572.
BIBLIOGRAPHY 133
[DEM+ 17] Dobraunig, C., Eichlseder, M., Mangard, S., Mendel,

F., and Unterluggauer, T. ISAP - towards side-channel
secure authenticated encryption. IACR Trans. Symmetric Cryptol.,
2017(2017)(1), 80–105.
[DEM18] De Cnudde, T., Ender, M., and Moradi, A. Hardware
masking, revisited. IACR Trans. Cryptogr. Hardw. Embed. Syst.,
2018(2018)(2), 123–148.
[DF12] Dziembowski, S. and Faust, S. Leakage-resilient circuits

without computational assumptions. In Cramer [Cra12], 230–247.
[DK16] Dunkelman, O. and Keliher, L. (eds.). Selected Areas
in Cryptography - SAC 2015 - 22nd International Conference,
Sackville, NB, Canada, August 12-14, 2015, Revised Selected
2016.
[DN16] De Cnudde, T. and Nikova, S. More efficient private circuits
II through threshold implementations. In 2016 Workshop on Fault
Diagnosis and Tolerance in Cryptography, FDTC 2016, Santa
Barbara, CA, USA, August 16, 2016, 114–124. IEEE Computer
Society, 2016.
[DPSZ12] Damgård, I., Pastro, V., Smart, N.P., and Zakarias, S.
Multiparty computation from somewhat homomorphic encryption.
In Safavi-Naini and Canetti [SC12], 643–662.
[DR98] Daemen, J. and Rijmen, V. The block cipher Rijndael. In

Quisquater and Schneier [QS00], 277–284.
[DRB+ 16] De Cnudde, T., Reparaz, O., Bilgin, B., Nikova, S., Nikov,
V., and Rijmen, V. Masking AES with d+1 shares in hardware.
In Gierlichs and Poschmann [GP16], 194–212.
[EKM+ 08] Eisenbarth, T., Kasper, T., Moradi, A., Paar, C.,
Salmasizadeh, M., and Shalmani, M.T.M. On the power
of power analysis in the real world: A complete break of the
KeeLoq code hopping scheme. In Wagner [Wag08], 203–220.
[FC16] Fischlin, M. and Coron, J. (eds.). Advances in Cryptology -
EUROCRYPT 2016 - 35th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Vienna,
Austria, May 8-12, 2016, Proceedings, Part II, Lecture Notes in
Computer Science, vol. 9666. Springer, 2016.
134 BIBLIOGRAPHY
[FG18] Fan, J. and Gierlichs, B. (eds.). Constructive Side-

Channel Analysis and Secure Design - 9th International Workshop,
COSADE 2018, Singapore, April 23-24, 2018, Proceedings, Lecture
[FGP+ 18] Faust, S., Grosso, V., Pozo, S.M.D., Paglialonga, C., and
Standaert, F. Composable masking schemes in the presence
of physical defaults & the robust probing model. IACR Trans.
[FH17] Fischer, W. and Homma, N. (eds.). Cryptographic Hardware and

Embedded Systems - CHES 2017 - 19th International Conference,
Taipei, Taiwan, September 25-28, 2017, Proceedings, Lecture Notes
[FPS17] Faust, S., Paglialonga, C., and Schneider, T. Amortizing

randomness complexity in private circuits. In Takagi and Peyrin
[TP17], 781–810.
[FS13] Fischer, W. and Schmidt, J. (eds.). 2013 Workshop on Fault
Diagnosis and Tolerance in Cryptography, Los Alamitos, CA, USA,
August 20, 2013. IEEE Computer Society, 2013.
[Fum97] Fumy, W. (ed.). Advances in Cryptology - EUROCRYPT

’97, International Conference on the Theory and Application of
Cryptographic Techniques, Konstanz, Germany, May 11-15, 1997,
Proceeding, Lecture Notes in Computer Science, vol. 1233. Springer,
1997.
[GBTP08] Gierlichs, B., Batina, L., Tuyls, P., and Preneel, B.

Mutual information analysis. In Oswald and Rohatgi [OR08],
426–442.
[GFT12] Gritzalis, D., Furnell, S., and Theoharidou, M. (eds.).
Information Security and Privacy Research - 27th IFIP TC
11 Information Security and Privacy Conference, SEC 2012,
Heraklion, Crete, Greece, June 4-6, 2012. Proceedings, IFIP
Advances in Information and Communication Technology, vol. 376.
Springer, 2012.
[GG14] Garay, J.A. and Gennaro, R. (eds.). Advances in Cryptology
- CRYPTO 2014 - 34th Annual Cryptology Conference, Santa
Barbara, CA, USA, August 17-21, 2014, Proceedings, Part I,
BIBLIOGRAPHY 135
[GH15] Güneysu, T. and Handschuh, H. (eds.). Cryptographic

Hardware and Embedded Systems - CHES 2015 - 17th International
Workshop, Saint-Malo, France, September 13-16, 2015, Proceedings,
[GIB18] Groß, H., Iusupov, R., and Bloem, R. Generic low-latency
masking in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst.,
2018(2018)(2), 1–21.
[GJJR11] Goodwill, G., Jun, B., Jaffe, J., and Rohatgi, P. A testing
methodology for side-channel resistance validation. (2011).
[GLM16] Güneysu, T., Leander, G., and Moradi, A. (eds.). Lightweight
Cryptography for Security and Privacy - 4th International
Workshop, LightSec 2015, Bochum, Germany, September 10-11,
2015, Revised Selected Papers, Lecture Notes in Computer Science,
[GM06] Goubin, L. and Matsui, M. (eds.). Cryptographic Hardware
and Embedded Systems - CHES 2006, 8th International Workshop,
Yokohama, Japan, October 10-13, 2006, Proceedings, Lecture Notes
[GM10] Gammel, B.M. and Mangard, S. On the duality of probing

and fault attacks. J. Electronic Testing, 26(2010)(4), 483–493.
[GM11] Goubin, L. and Martinelli, A. Protecting AES with Shamir’s
secret sharing scheme. In Preneel and Takagi [PT11], 79–94.
[GMK16] Groß, H., Mangard, S., and Korak, T. Domain-oriented

masking: Compact masked hardware implementations with
arbitrary protection order. IACR Cryptology ePrint Archive,
2016(2016), 486.
[GMK17] Groß, H., Mangard, S., and Korak, T. An efficient side-
channel protected AES implementation with arbitrary protection
order. In Handschuh [Han17], 95–112.
[Gol59] Golomb, S.W. On the classification of Boolean functions. IRE
Trans. Information Theory, 5(1959)(5), 176–186.
[Gou01] Goubin, L. A sound method for switching between Boolean and
arithmetic masking. In Koç et al. [KNP01], 3–15.
[GP99] Goubin, L. and Patarin, J. DES and differential power analysis
(the "duplication" method). In Koç and Paar [KP99], 158–172.
136 BIBLIOGRAPHY
[GP16] Gierlichs, B. and Poschmann, A.Y. (eds.). Cryptographic

Hardware and Embedded Systems - CHES 2016 - 18th International
Conference, Santa Barbara, CA, USA, August 17-19, 2016,
2016.
[GPQ10] Genelle, L., Prouff, E., and Quisquater, M. Secure
multiplicative masking of power functions. In Zhou and Yung
[ZY10], 200–217.
[GPQ11] Genelle, L., Prouff, E., and Quisquater, M. Thwarting

higher-order side channel analysis with additive and multiplicative
maskings. In Preneel and Takagi [PT11], 240–255.
[GPS14] Grosso, V., Prouff, E., and Standaert, F. Efficient masked
S-boxes processing - A step forward -. In Pointcheval and Vergnaud
[PV14], 251–266.
[GR15] Gennaro, R. and Robshaw, M. (eds.). Advances in Cryptology
- CRYPTO 2015 - 35th Annual Cryptology Conference, Santa
Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I,
[GR17] Goudarzi, D. and Rivain, M. How fast can higher-order masking

be in software? In Coron and Nielsen [CN17], 567–597.
[GST12] Gierlichs, B., Schmidt, J., and Tunstall, M. Infective
computation and dummy rounds: Fault protection for block ciphers
without check-before-output. In Hevia and Neven [HN12], 305–321.
[GST14] Genkin, D., Shamir, A., and Tromer, E. RSA key extraction
via low-bandwidth acoustic cryptanalysis. In Garay and Gennaro
[GG14], 444–461.
[GT02] Golic, J.D. and Tymen, C. Multiplicative masking and power
analysis of AES. In Kaliski Jr. et al. [KKP03], 198–212.
[Gui17] Guilley, S. (ed.). Constructive Side-Channel Analysis and
Secure Design - 8th International Workshop, COSADE 2017, Paris,
France, April 13-14, 2017, Revised Selected Papers, Lecture Notes
[Ham50] Hamming, R.W. Error detecting and error correcting codes. The
Bell System Technical Journal, 29(1950)(2), 147–160.
BIBLIOGRAPHY 137
[Han17] Handschuh, H. (ed.). Topics in Cryptology - CT-RSA 2017 - The

Cryptographers’ Track at the RSA Conference 2017, San Francisco,
CA, USA, February 14-17, 2017, Proceedings, Lecture Notes in
[Hel94] Helleseth, T. (ed.). Advances in Cryptology - EUROCRYPT
’93, Workshop on the Theory and Application of of Cryptographic
Techniques, Lofthus, Norway, May 23-27, 1993, Proceedings,
[HGD+ 11] Hospodar, G., Gierlichs, B., De Mulder, E., Ver-
bauwhede, I., and Vandewalle, J. Machine learning in side-
channel analysis: a first study. J. Cryptographic Engineering,
1(2011)(4), 293–302.
[HH04] Handschuh, H. and Hasan, M.A. (eds.). Selected Areas in
Cryptography, 11th International Workshop, SAC 2004, Waterloo,
Canada, August 9-10, 2004, Revised Selected Papers, Lecture Notes
[HM16] Homma, N. and Medwed, M. (eds.). Smart Card Research and
Advanced Applications - 14th International Conference, CARDIS
2015, Bochum, Germany, November 4-6, 2015. Revised Selected
2016.
[HN12] Hevia, A. and Neven, G. (eds.). Progress in Cryptology -
LATINCRYPT 2012 - 2nd International Conference on Cryptology
and Information Security in Latin America, Santiago, Chile,
October 7-10, 2012. Proceedings, Lecture Notes in Computer
[HT19] Hutter, M. and Tunstall, M. Constant-time higher-order
Boolean-to-arithmetic masking. J. Cryptographic Engineering,
9(2019)(2), 173–184.
[IPSW06] Ishai, Y., Prabhakaran, M., Sahai, A., and Wagner, D.A.
Private circuits II: keeping secrets in tamperable circuits. In
Vaudenay [Vau06], 308–327.
[ISW03] Ishai, Y., Sahai, A., and Wagner, D.A. Private circuits:
Securing hardware against probing attacks. In Boneh [Bon03],
463–481.
[Jaf07] Jaffe, J. A first-order DPA attack against AES in counter mode
with unknown initial counter. In Paillier and Verbauwhede [PV07],
1–13.
138 BIBLIOGRAPHY
[JM15] Joye, M. and Moradi, A. (eds.). Smart Card Research and

Advanced Applications - 13th International Conference, CARDIS
2014, Paris, France, November 5-7, 2014. Revised Selected Papers,
[JMPS17] Jean, J., Moradi, A., Peyrin, T., and Sasdrich, P. Bit-
sliding: A generic technique for bit-serial implementations of spn-
based primitives - applications to AES, PRESENT and SKINNY.
In Fischer and Homma [FH17], 687–707.
[JN13] Johansson, T. and Nguyen, P.Q. (eds.). Advances in

Cryptology - EUROCRYPT 2013, 32nd Annual International
Conference on the Theory and Applications of Cryptographic
Techniques, Athens, Greece, May 26-30, 2013. Proceedings, Lecture
[Jou09] Joux, A. (ed.). Advances in Cryptology - EUROCRYPT 2009, 28th

Annual International Conference on the Theory and Applications
of Cryptographic Techniques, Cologne, Germany, April 26-30, 2009.
2009.
[JQ04] Joye, M. and Quisquater, J. (eds.). Cryptographic Hardware

and Embedded Systems - CHES 2004: 6th International Workshop
Cambridge, MA, USA, August 11-13, 2004. Proceedings, Lecture
[Jr.97] Jr., B.S.K. (ed.). Advances in Cryptology - CRYPTO ’97,
17th Annual International Cryptology Conference, Santa Barbara,
[JT12] Joye, M. and Tunstall, M. (eds.). Fault Analysis in
Cryptography. Information Security and Cryptography. Springer,
2012.
[JWK04] Joshi, N., Wu, K., and Karri, R. Concurrent error detection
schemes for involution ciphers. In Joye and Quisquater [JQ04],
400–412.
[KBB+ 18] Kumar, D.S.V., Beckers, A., Balasch, J., Gierlichs,
B., and Verbauwhede, I. An in-depth and black-box
characterization of the effects of laser pulses on ATmega328P.
In Bilgin and Fischer [BF19], 156–170.
BIBLIOGRAPHY 139
[Kim02] Kim, K. (ed.). Information Security and Cryptology - ICISC

2001, 4th International Conference Seoul, Korea, December 6-7,
2001, Proceedings, Lecture Notes in Computer Science, vol. 2288.
Springer, 2002.
[KJJ99] Kocher, P.C., Jaffe, J., and Jun, B. Differential power
analysis. In Wiener [Wie99], 388–397.
[KKG03] Karri, R., Kuznetsov, G., and Gössel, M. Parity-based
concurrent error detection of substitution-permutation network
block ciphers. In Walter et al. [WKP03], 113–124.
[KKP03] Kaliski Jr., B.S., Koç, Ç.K., and Paar, C. (eds.).
Cryptographic Hardware and Embedded Systems - CHES 2002,
4th International Workshop, Redwood Shores, CA, USA, August
13-15, 2002, Revised Papers, Lecture Notes in Computer Science,
[KKT04a] Karpovsky, M.G., Kulikowski, K.J., and Taubin, A.
Differential fault analysis attack resistant architectures for the
Advanced Encryption Standard. In Quisquater et al. [QPDK04],
177–192.
[KKT04b] Karpovsky, M.G., Kulikowski, K.J., and Taubin, A.
Robust protection against fault-injection attacks on smart cards
implementing the Advanced Encryption Standard. In 2004
International Conference on Dependable Systems and Networks
(DSN 2004), 28 June - 1 July 2004, Florence, Italy, Proceedings,
93–101. IEEE Computer Society, 2004.
[KNP01] Koç, Ç.K., Naccache, D., and Paar, C. (eds.). Cryptographic
Hardware and Embedded Systems - CHES 2001, Third International
Workshop, Paris, France, May 14-16, 2001, Proceedings, Lecture
[Kob96] Koblitz, N. (ed.). Advances in Cryptology - CRYPTO ’96,
[Koc96] Kocher, P.C. Timing attacks on implementations of Diffie-
Hellman, RSA, DSS, and other systems. In Koblitz [Kob96], 104–
113.
[KP99] Koç, Ç.K. and Paar, C. (eds.). Cryptographic Hardware
and Embedded Systems, First International Workshop, CHES’99,
140 BIBLIOGRAPHY
Worcester, MA, USA, August 12-13, 1999, Proceedings, Lecture

[KP00] Koç, Ç.K. and Paar, C. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2000, Second International Workshop,
Worcester, MA, USA, August 17-18, 2000, Proceedings, Lecture
[KSM20] Knichel, D., Sasdrich, P., and Moradi, A. SILVER -
statistical independence and leakage verification. IACR Cryptol.
ePrint Arch., 2020(2020), 634.
[KSV13] Karaklajic, D., Schmidt, J., and Verbauwhede, I. Hardware
designer’s guide to fault attacks. IEEE Trans. VLSI Syst.,
21(2013)(12), 2295–2306.
[KWMK02] Karri, R., Wu, K., Mishra, P., and Kim, Y. Concurrent
error detection schemes for fault-based side-channel cryptanalysis
of symmetric block ciphers. IEEE Trans. on CAD of Integrated
Circuits and Systems, 21(2002)(12), 1509–1517.
[LBS19] Levi, I., Bellizia, D., and Standaert, F. Reducing a masked
implementation’s effective security order with setup manipulations
and an explanation based on externally-amplified couplings. IACR
Trans. Cryptogr. Hardw. Embed. Syst., 2019(2019)(2), 293–317.
[Lea15] Leander, G. (ed.). Fast Software Encryption - 22nd International
Workshop, FSE 2015, Istanbul, Turkey, March 8-11, 2015, Revised
Selected Papers, Lecture Notes in Computer Science, vol. 9054.
Springer, 2015.
[Lim98] Lim, C.H. CRYPTON: A new 128-bit block cipher - specification
and analysis, 1998. AES Submission.
[LMW14] Leiserson, A.J., Marson, M.E., and Wachs, M.A. Gate-
level masking under a path-based leakage metric. In Batina and
Robshaw [BR14], 580–597.
[LRT12] Lomné, V., Roche, T., and Thillard, A. On the need of
randomness in fault attack countermeasures - application to AES.
In Bertoni and Gierlichs [BG12], 85–94.
[LT17] Lemke-Rust, K. and Tunstall, M. (eds.). Smart Card Research
and Advanced Applications - 15th International Conference,
CARDIS 2016, Cannes, France, November 7-9, 2016, Revised
Springer, 2017.
BIBLIOGRAPHY 141
[Mai91] Maiorana, J.A. A classification of the cosets of the Reed-Muller

Code R(1, 6). Mathematics of Computation, 57(1991)(195), 403–
414.
[Mat93] Matsui, M. Linear cryptanalysis method for DES cipher. In

Helleseth [Hel94], 386–397.
[Men05] Menezes, A. (ed.). Topics in Cryptology - CT-RSA 2005, The
Cryptographers’ Track at the RSA Conference 2005, San Francisco,
[Mes00a] Messerges, T.S. Securing the AES finalists against power analysis
attacks. In Schneier [Sch01], 150–164.
[Mes00b] Messerges, T.S. Using second-order power analysis to attack
DPA resistant software. In Koç and Paar [KP00], 238–251.
[MM82] Meyer, C.H. and Matyas, S.M. Cryptography: A New
Dimension in Computer Data Security, chap. Implementation
Considerations for the S-box Design, 163–165. John Wiley &
Sons, 1982.
[MM13] Moradi, A. and Mischke, O. On the simplicity of converting

leakages from multivariate to univariate - (case study of a glitch-
resistant masking scheme). In Bertoni and Coron [BC13], 1–20.
[MM17] Mayes, K. and Markantonakis, K. (eds.). Smart Cards,
Tokens, Security and Applications, Second Edition. Springer, 2017.
[MMR20] Moos, T., Moradi, A., and Richter, B. Static power side-
channel analysis - an investigation of measurement factors. IEEE
Trans. Very Large Scale Integr. Syst., 28(2020)(2), 376–389.
[Moo19] Moos, T. Static power SCA of sub-100 nm CMOS asics and the
insecurity of masking schemes in low-noise environments. IACR
Trans. Cryptogr. Hardw. Embed. Syst., 2019(2019)(3), 202–232.
[MOP07] Mangard, S., Oswald, E., and Popp, T. Power analysis
attacks - revealing the secrets of smart cards. Springer, 2007.
[Mor89] Mora, T. (ed.). Applied Algebra, Algebraic Algorithms and Error-
Correcting Codes, 6th International Conference, AAECC-6, Rome,
Italy, July 4-8, 1988, Proceedings, Lecture Notes in Computer
142 BIBLIOGRAPHY
[Mor14] Moriai, S. (ed.). Fast Software Encryption - 20th International

Workshop, FSE 2013, Singapore, March 11-13, 2013. Revised
Springer, 2014.
[MPG05] Mangard, S., Popp, T., and Gammel, B.M. Side-channel
leakage of masked CMOS gates. In Menezes [Men05], 351–365.
[MPL+ 11] Moradi, A., Poschmann, A., Ling, S., Paar, C., and
Wang, H. Pushing the limits: A very compact and a threshold
implementation of AES. In Paterson [Pat11], 69–88.
[MPO05] Mangard, S., Pramstaller, N., and Oswald, E. Successfully
attacking masked AES hardware implementations. In Rao and
Sunar [RS05], 157–171.
[MPP16] Maghrebi, H., Portigliatti, T., and Prouff, E. Breaking

cryptographic implementations using deep learning techniques. In
Carlet et al. [CHS16], 3–26.
[MRSS18] Moradi, A., Richter, B., Schneider, T., and Standaert,
F. Leakage detection with the χ2 -test. IACR Trans. Cryptogr.
Hardw. Embed. Syst., 2018(2018)(1), 209–237.
[MS10] Mangard, S. and Standaert, F. (eds.). Cryptographic Hardware
and Embedded Systems, CHES 2010, 12th International Workshop,
[MSGR10] Medwed, M., Standaert, F., Großschädl, J., and

Regazzoni, F. Fresh re-keying: Security against side-channel and
fault attacks for low-cost devices. In Bernstein and Lange [BL10],
279–296.
[MSY06] Malkin, T., Standaert, F., and Yung, M. A comparative
cost/security analysis of fault attack countermeasures. In
Breveglieri et al. [BKNS06], 159–172.
[MV91] Menezes, A. and Vanstone, S.A. (eds.). Advances in Cryptology
- CRYPTO ’90, 10th Annual International Cryptology Conference,
Santa Barbara, California, USA, August 11-15, 1990, Proceedings,
[MV04] McGrew, D.A. and Viega, J. The security and performance of
the Galois/counter mode (GCM) of operation. In Canteaut and
Viswanathan [CV04], 343–355.
BIBLIOGRAPHY 143
[MV15] Maene, P. and Verbauwhede, I. Single-cycle implementations

of block ciphers. In Güneysu et al. [GLM16], 131–147.
[NIS03] NIST/SEMATECH. 1.3.6. probability distributions. e-Handbook

of Statistical Methods, (2003). https://www.itl.nist.gov/
div898/handbook/eda/section3/eda367.htm.
[NO14] Nguyen, P.Q. and Oswald, E. (eds.). Advances in Cryptology
- EUROCRYPT 2014 - 33rd Annual International Conference
on the Theory and Applications of Cryptographic Techniques,
Copenhagen, Denmark, May 11-15, 2014. Proceedings, Lecture
[NQL06] Ning, P., Qing, S., and Li, N. (eds.). Information and
Communications Security, 8th International Conference, ICICS
2006, Raleigh, NC, USA, December 4-7, 2006, Proceedings, Lecture
[NR18] Nielsen, J.B. and Rijmen, V. (eds.). Advances in Cryptology
- EUROCRYPT 2018 - 37th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Tel Aviv,
Israel, April 29 - May 3, 2018 Proceedings, Part II, Lecture Notes
[NRR06] Nikova, S., Rechberger, C., and Rijmen, V. Threshold
implementations against side-channel attacks and glitches. In Ning
et al. [NQL06], 529–545.
[NRS11] Nikova, S., Rijmen, V., and Schläffer, M. Secure hardware

implementation of nonlinear functions in the presence of glitches.
J. Cryptology, 24(2011)(2), 292–321.
[Nyb93] Nyberg, K. Differentially uniform mappings for cryptography. In
Helleseth [Hel94], 55–64.
[Nyb15] Nyberg, K. (ed.). Topics in Cryptology - CT-RSA 2015, The

Cryptographer’s Track at the RSA Conference 2015, San Francisco,
CA, USA, April 20-24, 2015. Proceedings, Lecture Notes in
[OF15] Oswald, E. and Fischlin, M. (eds.). Advances in Cryptology
- EUROCRYPT 2015 - 34th Annual International Conference on
the Theory and Applications of Cryptographic Techniques, Sofia,
Bulgaria, April 26-30, 2015, Proceedings, Part I, Lecture Notes in
144 BIBLIOGRAPHY
[OR08] Oswald, E. and Rohatgi, P. (eds.). Cryptographic Hardware

and Embedded Systems - CHES 2008, 10th International Workshop,
Washington, D.C., USA, August 10-13, 2008. Proceedings, Lecture
[Pat11] Paterson, K.G. (ed.). Advances in Cryptology - EUROCRYPT
2011 - 30th Annual International Conference on the Theory and
Applications of Cryptographic Techniques, Tallinn, Estonia, May
15-19, 2011. Proceedings, Lecture Notes in Computer Science, vol.
6632. Springer, 2011.
[PBMB17] Patranabis, S., Breier, J., Mukhopadhyay, D., and Bhasin,
S. One plus one is more than two: A practical combination of
power and fault analysis attacks on PRESENT and PRESENT-like
block ciphers. In 2017 Workshop on Fault Diagnosis and Tolerance
in Cryptography, FDTC 2017, Taipei, Taiwan, September 25, 2017,
[Pey16] Peyrin, T. (ed.). Fast Software Encryption - 23rd International
Conference, FSE 2016, Bochum, Germany, March 20-23, 2016,
Revised Selected Papers, Lecture Notes in Computer Science, vol.
9783. Springer, 2016.
[PG18] Peyrin, T. and Galbraith, S.D. (eds.). Advances in Cryptology
- ASIACRYPT 2018 - 24th International Conference on the Theory
and Application of Cryptology and Information Security, Brisbane,
QLD, Australia, December 2-6, 2018, Proceedings, Part II, Lecture
[Pie09] Pietrzak, K. A leakage-resilient mode of operation. In Joux

[Jou09], 462–482.
[PM05] Popp, T. and Mangard, S. Masked dual-rail pre-charge logic:
DPA-resistance without routing constraints. In Rao and Sunar
[RS05], 172–186.
[Poi06] Pointcheval, D. (ed.). Topics in Cryptology - CT-RSA 2006,
The Cryptographers’ Track at the RSA Conference 2006, San Jose,
[PQ03] Piret, G. and Quisquater, J. A differential fault attack

technique against SPN structures, with application to the AES
and KHAZAD. In Walter et al. [WKP03], 77–88.
BIBLIOGRAPHY 145
[PR11] Prouff, E. and Roche, T. Higher-order glitches free

implementation of the AES using secure multi-party computation
protocols. In Preneel and Takagi [PT11], 63–78.
[PR13] Prouff, E. and Rivain, M. Masking against side-channel attacks:

A formal security proof. In Johansson and Nguyen [JN13], 142–159.
[PRB09] Prouff, E., Rivain, M., and Bevan, R. Statistical analysis of
second order differential power analysis. IEEE Trans. Computers,
58(2009)(6), 799–811.
[Pro11] Prouff, E. (ed.). Smart Card Research and Advanced Applications
- 10th IFIP WG 8.8/11.2 International Conference, CARDIS 2011,
Leuven, Belgium, September 14-16, 2011, Revised Selected Papers,
[PS12] Prouff, E. and Schaumont, P. (eds.). Cryptographic Hardware

and Embedded Systems - CHES 2012 - 14th International Workshop,
Leuven, Belgium, September 9-12, 2012. Proceedings, Lecture Notes
[PT11] Preneel, B. and Takagi, T. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2011 - 13th International Workshop,
Nara, Japan, September 28 - October 1, 2011. Proceedings, Lecture
[PV07] Paillier, P. and Verbauwhede, I. (eds.). Cryptographic
Hardware and Embedded Systems - CHES 2007, 9th International
Workshop, Vienna, Austria, September 10-13, 2007, Proceedings,
[PV14] Pointcheval, D. and Vergnaud, D. (eds.). Progress in
Cryptology - AFRICACRYPT 2014 - 7th International Conference
on Cryptology in Africa, Marrakesh, Morocco, May 28-30, 2014.
2014.
[PV17] Papagiannopoulos, K. and Veshchikov, N. Mind the gap:
Towards secure 1st-order masking in software. In Guilley [Gui17],
282–297.
[PV18] Preneel, B. and Vercauteren, F. (eds.). Applied Cryptography

and Network Security - 16th International Conference, ACNS 2018,
Leuven, Belgium, July 2-4, 2018, Proceedings, Lecture Notes in
146 BIBLIOGRAPHY
[PYR+ 16] Picek, S., Yang, B., Rozic, V., Vliegen, J., Winderickx,
J., De Cnudde, T., and Mentens, N. PRNGs for masking
applications and their mapping to evolvable hardware. In Lemke-
Rust and Tunstall [LT17], 209–227.
[QPDK04] Quisquater, J., Paradinas, P., Deswarte, Y., and
Kalam, A.A.E. (eds.). Smart Card Research and Advanced
Applications VI, IFIP 18th World Computer Congress, TC8/WG8.8
& TC11/WG11.2 Sixth International Conference on Smart Card
Research and Advanced Applications (CARDIS), 22-27 August 2004,
Toulouse, France, IFIP, vol. 153. Kluwer/Springer, 2004.
[QS00] Quisquater, J. and Schneier, B. (eds.). Smart Card Research
and Applications, This International Conference, CARDIS ’98,
Louvain-la-Neuve, Belgium, September 14-16, 1998, Proceedings,
[QS01] Quisquater, J. and Samyde, D. Electromagnetic analysis
(EMA): measures and counter-measures for smart cards. In Attali
and Jensen [AJ01], 200–210.
[RBN+ 15] Reparaz, O., Bilgin, B., Nikova, S., Gierlichs, B., and
Verbauwhede, I. Consolidating masking schemes. In Gennaro
and Robshaw [GR15], 764–783.
[Rep15] Reparaz, O. A note on the security of higher-order threshold
implementations. IACR Cryptology ePrint Archive, 2015(2015), 1.
[Rep16] Reparaz, O. Detecting flawed masking schemes with leakage

detection tests. In Peyrin [Pey16], 204–222.
[RGV12] Reparaz, O., Gierlichs, B., and Verbauwhede, I. Selecting
time samples for multivariate DPA attacks. In Prouff and
Schaumont [PS12], 155–174.
[RGV17] Reparaz, O., Gierlichs, B., and Verbauwhede, I. Fast

leakage assessment. In Fischer and Homma [FH17], 387–399.
[Rij00] Rijmen, V. Efficient implementation of the Rijndael S-box, 2000.
[RK16] Robshaw, M. and Katz, J. (eds.). Advances in Cryptology -
CRYPTO 2016 - 36th Annual International Cryptology Conference,
Santa Barbara, CA, USA, August 14-18, 2016, Proceedings, Part
II, Lecture Notes in Computer Science, vol. 9815. Springer, 2016.
BIBLIOGRAPHY 147
[RLK11] Roche, T., Lomné, V., and Khalfallah, K. Combined fault

and side-channel attack on protected implementations of AES. In
Prouff [Pro11], 65–83.
[RM04] Roy, B.K. and Meier, W. (eds.). Fast Software Encryption,

11th International Workshop, FSE 2004, Delhi, India, February
5-7, 2004, Revised Papers, Lecture Notes in Computer Science, vol.
3017. Springer, 2004.
[RM07] Robisson, B. and Manet, P. Differential behavioral analysis.

In Paillier and Verbauwhede [PV07], 413–426.
[RP10] Rivain, M. and Prouff, E. Provably secure higher-order masking
of AES. In Mangard and Standaert [MS10], 413–427.
[RS05] Rao, J.R. and Sunar, B. (eds.). Cryptographic Hardware and
Embedded Systems - CHES 2005, 7th International Workshop,
Edinburgh, UK, August 29 - September 1, 2005, Proceedings,
[San95] Santis, A.D. (ed.). Advances in Cryptology - EUROCRYPT
’94, Workshop on the Theory and Application of Cryptographic
Techniques, Perugia, Italy, May 9-12, 1994, Proceedings, Lecture
[SBY+ 18] Sijacic, D., Balasch, J., Yang, B., Ghosh, S., and
Verbauwhede, I. Towards efficient and automated side channel
evaluations at design time. In Batina et al. [BKM18], 16–31.
[SC12] Safavi-Naini, R. and Canetti, R. (eds.). Advances in

Cryptology - CRYPTO 2012 - 32nd Annual Cryptology Conference,
[Sch88] Schnorr, C. The multiplicative complexity of Boolean functions.
In Mora [Mor89], 45–58.
[Sch01] Schneier, B. (ed.). Fast Software Encryption, 7th International
Workshop, FSE 2000, New York, NY, USA, April 10-12, 2000,
2001.
[SFES18] Seker, O., Fernandez-Rubio, A., Eisenbarth, T., and

Steinwandt, R. Extending glitch-free multiparty protocols to
resist fault injection attacks. IACR Trans. Cryptogr. Hardw. Embed.
Syst., 2018(2018)(3), 394–430.
148 BIBLIOGRAPHY
[SG16] Sasdrich, P. and Güneysu, T. A grain in the silicon: SCA-

protected AES in less than 30 slices. In 27th IEEE International
Conference on Application-specific Systems, Architectures and
Processors, ASAP 2016, London, United Kingdom, July 6-8, 2016,
[Sha45] Shannon, C. A mathematical theory of cryptography. https:
//www.iacr.org/museum/shannon/shannon45.pdf, September
1945.
[Sha79] Shamir, A. How to share a secret. Commun. ACM, 22(1979)(11),
612–613.
[SI14] Sarkar, P. and Iwata, T. (eds.). Advances in Cryptology -
ASIACRYPT 2014 - 20th International Conference on the Theory
and Application of Cryptology and Information Security, Kaoshiung,
Taiwan, R.O.C., December 7-11, 2014, Proceedings, Part II, Lecture
[Sko06] Skorobogatov, S.P. Optically enhanced position-locked power
analysis. In Goubin and Matsui [GM06], 61–75.
[SM12] Schmidt, J. and Medwed, M. Countermeasures for symmetric
key ciphers. In Joye and Tunstall [JT12], 73–87.
[SM15] Schneider, T. and Moradi, A. Leakage assessment
methodology - A clear roadmap for side-channel evaluations. In
Güneysu and Handschuh [GH15], 495–513.
[SMG16] Schneider, T., Moradi, A., and Güneysu, T. ParTI - towards
combined hardware countermeasures against side-channel and fault-
injection attacks. In Robshaw and Katz [RK16], 302–332.
[SMTM01] Satoh, A., Morioka, S., Takano, K., and Munetoh, S. A
compact Rijndael hardware architecture with S-box optimization.
In Boyd [Boy01], 239–254.
[SP06] Schramm, K. and Paar, C. Higher order masking of the AES.
In Pointcheval [Poi06], 208–225.
[SPR+ 04] Standaert, F., Piret, G., Rouvroy, G., Quisquater, J.,
and Legat, J. ICEBERG : An involutional cipher efficient for
block encryption in reconfigurable hardware. In Roy and Meier
[RM04], 279–299.
[SS16] Schwabe, P. and Stoffelen, K. All the AES you need on
Cortex-M3 and M4. In Avanzi and Heys [AH17], 180–194.
BIBLIOGRAPHY 149
[SSA+ 07] Shirai, T., Shibutani, K., Akishita, T., Moriai, S., and
Iwata, T. The 128-bit blockcipher CLEFIA (extended abstract).
In Biryukov [Bir07], 181–195.
[SSR19] Sako, K., Schneider, S., and Ryan, P.Y.A. (eds.). Computer
Security - ESORICS 2019 - 24th European Symposium on Research
in Computer Security, Luxembourg, September 23-27, 2019,
Proceedings, Part I, Lecture Notes in Computer Science, vol. 11735.
Springer, 2019.
[Sto16] Stoffelen, K. Optimizing S-box implementations for several

criteria using SAT solvers. In Peyrin [Pey16], 140–160.
[Sug19] Sugawara, T. 3-share threshold implementation of AES S-box
without fresh randomness. IACR Trans. Cryptogr. Hardw. Embed.
Syst., 2019(2019)(1), 123–145.
[SVO+ 10] Standaert, F., Veyrat-Charvillon, N., Oswald, E.,
Gierlichs, B., Medwed, M., Kasper, M., and Mangard, S.
The world is not enough: Another look on second-order DPA. In
Abe [Abe10], 112–129.
[Tim19] Timon, B. Non-profiled deep learning-based side-channel attacks

with sensitivity analysis. IACR Trans. Cryptogr. Hardw. Embed.
Syst., 2019(2019)(2), 107–131.
[TMA11] Tunstall, M., Mukhopadhyay, D., and Ali, S. Differential
fault analysis of the Advanced Encryption Standard using a single
fault. In Ardagna and Zhou [AZ11], 224–233.
[TP17] Takagi, T. and Peyrin, T. (eds.). Advances in Cryptology -
ASIACRYPT 2017 - 23rd International Conference on the Theory
and Applications of Cryptology and Information Security, Hong
Kong, China, December 3-7, 2017, Proceedings, Part I, Lecture
[Tri03] Trichina, E. Combinational logic design for AES SubByte

transformation on masked data. IACR Cryptology ePrint Archive,
2003(2003), 236.
[Tun17] Tunstall, M. Smart card security. In Mayes and Markantonakis
[MM17], 217–251.
[UHA17] Ueno, R., Homma, N., and Aoki, T. Toward more efficient
DPA-resistant AES hardware architecture based on threshold
implementation. In Guilley [Gui17], 50–64.
150 BIBLIOGRAPHY
[Vau06] Vaudenay, S. (ed.). Advances in Cryptology - EUROCRYPT

2006, 25th Annual International Conference on the Theory and
Applications of Cryptographic Techniques, St. Petersburg, Russia,
May 28 - June 1, 2006, Proceedings, Lecture Notes in Computer
[vTJ11] van Tilborg, H.C.A. and Jajodia, S. (eds.). Encyclopedia of
Cryptography and Security, 2nd Ed. Springer, 2011.
[Wag08] Wagner, D.A. (ed.). Advances in Cryptology - CRYPTO 2008,
CA, USA, August 17-21, 2008. Proceedings, Lecture Notes in
[Wie99] Wiener, M.J. (ed.). Advances in Cryptology - CRYPTO ’99,
[WKK+ 16] Weippl, E.R., Katzenbeisser, S., Kruegel, C., Myers, A.C.,
and Halevi, S. (eds.). Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security, Vienna,
Austria, October 24-28, 2016. ACM, 2016.
[WKP03] Walter, C.D., Koç, Ç.K., and Paar, C. (eds.). Cryptographic
Hardware and Embedded Systems - CHES 2003, 5th International
Workshop, Cologne, Germany, September 8-10, 2003, Proceedings,
[WM18a] Wegener, F. and Moradi, A. A first-order SCA resistant AES
without fresh randomness. In Fan and Gierlichs [FG18], 245–262.
[WM18b] Wegener, F. and Moradi, A. Yet another size record for AES:
A first-order SCA secure AES S-box based on GF(28 ) multiplication.
In Bilgin and Fischer [BF19], 111–124.
[Wri03] Wright, R.N. (ed.). Financial Cryptography, 7th International
Conference, FC 2003, Guadeloupe, French West Indies, January
27-30, 2003, Revised Papers, Lecture Notes in Computer Science,
[WS12] Wang, X. and Sako, K. (eds.). Advances in Cryptology -
ASIACRYPT 2012 - 18th International Conference on the Theory
and Application of Cryptology and Information Security, Beijing,
China, December 2-6, 2012. Proceedings, Lecture Notes in Computer
BIBLIOGRAPHY 151
[WSY+ 16] Wang, W., Standaert, F., Yu, Y., Pu, S., Liu, J., Guo,
Z., and Gu, D. Inner product masking for bitslice ciphers and
security order amplification for linear leakages. In Lemke-Rust and
Tunstall [LT17], 174–191.
[WVGX15] Wang, J., Vadnala, P.K., Großschädl, J., and Xu, Q.
Higher-order masking in practice: A vector implementation of
masked AES for ARM NEON. In Nyberg [Nyb15], 181–198.
[YJ00] Yen, S. and Joye, M. Checking before output may not be

enough against fault-based cryptanalysis. IEEE Trans. Computers,
49(2000)(9), 967–970.
[YKLM01] Yen, S., Kim, S., Lim, S., and Moon, S. RSA speedup
with residue number system immune against hardware fault
cryptanalysis. In Kim [Kim02], 397–413.
[YRG+ 18] Yang, B., Rozic, V., Grujic, M., Mentens, N., and
Verbauwhede, I. ES-TRNG: A high-throughput, low-area true
random number generator based on edge sampling. IACR Trans.
[YYP+ 18] Yao, Y., Yang, M., Patrick, C., Yuce, B., and Schaumont,
P. Fault-assisted side-channel analysis of masked implementations.
In 2018 IEEE International Symposium on Hardware Oriented
Security and Trust, HOST 2018, Washington, DC, USA, April 30
- May 4, 2018, 57–64. IEEE Computer Society, 2018.
[Zhe02] Zheng, Y. (ed.). Advances in Cryptology - ASIACRYPT 2002,

8th International Conference on the Theory and Application of
Cryptology and Information Security, Queenstown, New Zealand,
December 1-5, 2002, Proceedings, Lecture Notes in Computer
[ZY10] Zhou, J. and Yung, M. (eds.). Applied Cryptography and Network
Security, 8th International Conference, ACNS 2010, Beijing, China,
June 22-25, 2010. Proceedings, Lecture Notes in Computer Science,
vol. 6123. 2010.
Appendix A
A New Inner Product

Masking Algorithm
Balasch et al. introduce an algorithm for multiplying inner product masked

values [BFG+ 17, Alg. 3]. We present an adaptation in Algorithm 1, which
avoids the combination xi yj Lj , for which Boolean combinations may reduce
the security order. It is easy to verify that our intermediates Vij correspond to
those of Balasch et al. [BFG+ 17], which implies correctness:
Vij = Tij Lj = (xi yj ⊕ rij L−1

i Lj )Lj = xi yj Lj ⊕ rij Li
−1 −1
At first sight, it looks like the computational complexity is higher than that of
the original algorithm, but the inverses L−1i Lj
−1
can easily be precomputed.
The Boolean masked multiplication of Ishai et al. [ISW03] with d + 1 shares
was proven to be d-SNI by Barthe et al. [BBD+ 16]. It differs slightly from the
description in Eq. 2.14, but the proof for the latter is equivalent. Balash et
al. [BFG+ 17] also proved that their inner product masking multiplication is
d-SNI with d + 1 shares. We list the intermediates of algorithm 1 with d + 1
shares below:
xi , yi 0≤i<d
rij , Uij = Uji = rij L−1 −1
i Lj 0≤j<i<d
xi yj , Tij = xi yj ⊕ Uij 0 ≤ i, j < d
Vij = Tij Lj 0 ≤ i, j < d
Pk
j=0 Vij 0 ≤ i, k < d
One must prove that any set of t1 intermediates and t2 outputs can be simulated
153
154 A NEW INNER PRODUCT MASKING ALGORITHM
Algorithm 1 Multiply inner product masked values z = xy, adapted from

Balasch et al. [BFG+ 17].
Input: n-sharings x, y
Output: n-sharing z such that hz, Li = hx, Li · hy, Li
1: for i = 0 to n − 1 do
2: Uii = 0
3: for j = 0 to i − 1 do
4: rij ← §
5: Uij = rij L−1
i Lj
−1
6: Uji = Uij
7: end for
8: end for
9: for i = 0 to n − 1 do
10: for j = 0 to n − 1 do
11: Tij = xi yj
12: Tij = Tij ⊕ Uij
13: Vij = Tij Lj
14: end for
15: end for
16: for i = 0P to n − 1 do
17: zi = V
j ij
18: end for
using at most t1 shares of each input. Let Ix and Iy be the sets of shares
of respectively x and y required for simulation. We will show that for each
intermediate, at most one share index is added to Ix and Iy and for each output,
no share index is added. As a result, any t1 internal probes and t2 output
probes can be simulated with xIx and yIy for which |Ix | ≤ t1 and |Iy | ≤ t1 and
t1 + t2 ≤ d. In addition, we record the list L of all probes which are simulatable
with xIx and yIy .
Proof.
(1) xi : Ix ← Ix ∪ {i}, L ← L ∪ {xi }.

(2) yi : Iy ← Iy ∪ {i}, L ← L ∪ {yi }.
(3) rij , Uij , Uji : Assign a random value to rij . Compute Uij = Uji =
rij L−1
i Lj
−1
since L is public. Ix ← Ix , Iy ← Iy , L ← L ∪ {rij , Uij , Uji }.
(4) xi yj : Ix ← Ix ∪ {i}, Iy ← Iy ∪ {j}, L ← L ∪ {xi , yj , xi yj }.
(5) Tij , Vij : Ix ← Ix ∪ {i}, Iy ← Iy ∪ {j}.
• If Uij ∈ L: We can compute Tij = xi yj ⊕ Uij and Vij = Tij Lj .
L ← L ∪ {xi , yj , xi yj , Tij , Vij }.
• Else: Simulate rij , Uij , Uji as in (3) and compute Tij = xi yj ⊕ Uij
and Vij = Tij Lj . L ← L ∪ {xi , yj , xi yj , rij , Uij , Uji , Tij , Vij }.
A NEW INNER PRODUCT MASKING ALGORITHM 155
Pk
(6) j=0 Vij : Ix ← Ix ∪ {i}, Iy ← Iy ∪ {i}, L ← L ∪ {xi , yi }. For all j = 0
to k − 1, compute Vij as follows:
(a) If Vij ∈ L: ok
(b) Else if Uij ∈ L and yj ∈ L: Compute Tij = xi yj ⊕Uij and Vij = Tij Lj .
L = L ∪ {xi yj , Tij , Vij }.
(c) Else if Uij ∈ L and yj ∈ / L: Consider how Uij was added to L:
• It was added by (3), which did not add any index to Iy . Hence
we can do Iy ← Iy ∪ {j} and go back to (b).
• It was added by (5) for a probe Tij or Vij , hence contradiction,
see (a).
• It was added by (5) for a probe Tji or Vji , which means i was
added to Iy twice. Hence we can do Iy ← Iy ∪ {j} and go back
to (b).
Pk
• It was added by (6) for a probe l=0 Vjl , hence yj ∈ L. This is
a contradiction, see (b).
(d) Else: Uij ∈ / L and Uji ∈ / L:
• If Tji ∈ / L and yj ∈ L: Simulate rij , Uij , Uji as in (3) and
compute Tij = xi yj ⊕ Uij and Vij = Tij Lj . L ← L ∪
{xi yj , rij , Uij , Uji , Tij , Vij }.
• Else if Tji ∈ / L and yj ∈ / L: Assign a random value to Tij and
compute Vij = Tij Lj . L ← L ∪ {Tij , Vij }.
• Else: Tji ∈ L. With Uji ∈ / L, this occurs only if Tji was
simulated randomly as above, which means yj ∈ L and xj ∈ L.
Simulate rij , Uij , Uji as in (3) and compute Tij = xi yj ⊕ Uij ,
Tji = xj yi ⊕ Uji , Vij = Tij Lj and Vji = Tji Li . L ← L ∪
{xi yj , xj yi , rij , Uij , Uji , Tij , Vij }. Replace Tji and Vji in L and
recompute
Pk the probe depending on them.
L ← L ∪ { j=0 Vij }.
(7) zi (output probe): We distinguish three cases:
• All Vij have been simulated: ∀j : Vij ∈ L. Compute zi .
Pk
• A partial sum has been simulated: j=0 Vij ∈ L for some k. Simulate
the remaining Vij as in (6).
• No partial sum, but some or no Vij have been simulated. The output
share zi depends on d random values, of which it has exactly one
(rij ) in common with each other output zj . There are at most d − 1
other probes (each may involve either one Vij or another output
share zj ). Hence, at least one random value of zi has not been used
in any observed wire. We can thus assign a random value to zi .
Part II
Publications
157
List of Publications
International Journals
[1] Bilgin, B., De Meyer, L., Duval, S., Levi, I., and Standaert, F.
Low AND depth an efficient inverses: a guide on S-boxes for low-latency
masking. Accepted for Publication in IACR Transactions on Symmetric
Cryptology 2020(1), 2020.
[2] Wegener, F., De Meyer, L., and Moradi, A. Spin me right round
rotational symmetry for FPGA-specific AES: Extended version. Journal
of Cryptology (Jan 2020).
[3] De Meyer, L. Recovering the CTR_DRBG state in 256 traces. IACR
Trans. Cryptogr. Hardw. Embed. Syst. 2020, 1 (2020), 37–65.
[4] De Meyer, L., Bilgin, B., and Reparaz, O. Consolidating security

notions in hardware masking. IACR Trans. Cryptogr. Hardw. Embed.
Syst. 2019, 3 (2019), 119–147.
[5] De Meyer, L., and Bilgin, B. Classification of balanced quadratic
functions. IACR Trans. Symmetric Cryptol. 2019, 2 (2019), 169–192.
[6] De Meyer, L., Arribas, V., Nikova, S., Nikov, V., and Rijmen, V.
M&M: Masks and MACs against physical attacks. IACR Trans. Cryptogr.
Hardw. Embed. Syst. 2019, 1 (2019), 25–50.
[7] De Meyer, L., Reparaz, O., and Bilgin, B. Multiplicative masking
for AES in hardware. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018,
3 (2018), 431–468.
[8] De Meyer, L., Moradi, A., and Wegener, F. Spin me right round
rotational symmetry for FPGA-specific AES. IACR Trans. Cryptogr.
Hardw. Embed. Syst. 2018, 3 (2018), 596–626.
159
160 BIBLIOGRAPHY
[9] De Meyer, L., and Vaudenay, S. DES S-box generator. Cryptologia

41, 2 (2017), 153–171.
International Conferences
[10] Delpech de Saint Guilhem, C., De Meyer, L., Orsini, E., and
Smart, N. P. BBQ: using AES in picnic signatures. In Selected Areas in
Cryptography - SAC 2019 - 26th International Conference, Waterloo, ON,
Canada, August 12-16, 2019, Revised Selected Papers (2019), vol. 11959
of Lecture Notes in Computer Science, Springer, pp. 669–692.
[11] Groß, H., Stoffelen, K., De Meyer, L., Krenn, M., and
Mangard, S. First-order masking with only two random bits. In
Proceedings of ACM Workshop on Theory of Implementation Security
Workshop, TIS@CCS 2019, London, UK, November 11, 2019 (2019),
ACM, pp. 10–23.
[12] Purnal, A., Arribas, V., and De Meyer, L. Trade-offs in protecting
Keccak against combined side-channel and fault attacks. In Constructive
Side-Channel Analysis and Secure Design - 10th International Workshop,
COSADE 2019, Darmstadt, Germany, April 3-5, 2019, Proceedings (2019),
vol. 11421 of Lecture Notes in Computer Science, Springer, pp. 285–302.
[13] Reparaz, O., De Meyer, L., Bilgin, B., Arribas, V., Nikova, S.,
Nikov, V., and Smart, N. P. CAPA: the spirit of Beaver against
physical attacks. In Advances in Cryptology - CRYPTO 2018 - 38th
Annual International Cryptology Conference, Santa Barbara, CA, USA,
August 19-23, 2018, Proceedings, Part I (2018), vol. 10991 of Lecture
Notes in Computer Science, Springer, pp. 121–151.
Unpublished Manuscripts
[14] De Meyer, L., De Mulder, E., and Tunstall, M. On the effect

of the (micro)architecture on the development of side-channel resistant
software. In Submission, 2020.
[15] De Meyer, L., Wegener, F., and Moradi, A. A note on masking
generic Boolean functions. IACR Cryptology ePrint Archive 2019 (2019),
1247.
Multiplicative Masking for
AES in Hardware
Publication Data
Lauren De Meyer, Oscar Reparaz, and Begül Bilgin. Multiplicative Masking

for AES in Hardware. IACR Transactions on Cryptographic Hardware and
Embedded Systems, 2018(3), pages 431-468.
My Contribution
Principal author.
Notes
Appendices not included for brevity. Please refer to [7].
161
162 Multiplicative Masking for AES in Hardware
Multiplicative Masking for AES in Hardware

Lauren De Meyer1 , Oscar Reparaz1,2 and Begül Bilgin1
1
imec - COSIC, KU Leuven, Belgium
firstname.lastname@esat.kuleuven.be
2
Square inc., San Francisco, USA
Abstract. Hardware masked AES designs usually rely on Boolean masking and perform the
computation of the S-box using the tower-field decomposition. On the other hand, splitting
sensitive variables in a multiplicative way is more amenable for the computation of the AES S-
box, as noted by Akkar and Giraud. However, multiplicative masking needs to be implemented
carefully not to be vulnerable to first-order DPA with a zero-value power model. Up to now,
sound higher-order multiplicative masking schemes have been implemented only in software. In
this work, we demonstrate the first hardware implementation of AES using multiplicative masks.
The method is tailored to be secure even if the underlying gates are not ideal and glitches
occur in the circuit. We detail the design process of first- and second-order secure AES-128
cores, which result in the smallest die area to date among previous state-of-the-art masked AES
implementations with comparable randomness cost and latency. The first- and second-order
masked implementations improve resp. 29% and 18% over these designs. We deploy our
construction on a Spartan-6 FPGA and perform a side-channel evaluation. No leakage is
detected with up to 50 million traces for both our first- and second-order implementation. For
the latter, this holds both for univariate and bivariate analysis.
Keywords: DPA · Masking · Glitches · Sharing · Adaptive · Boolean · Multiplicative · AES ·
S-box · Side-channel
1 Introduction
Cryptographic primitives are designed to resist mathematical attacks such as linear or differential
cryptanalysis. The designer typically assumes a classic adversarial model, where encryption is
treated as a black box, only revealing inputs and outputs to adversaries. When these primitives are
deployed in embedded devices, unintended signals such as the instantaneous power consumption or
electromagnetic radiation leak sensitive information, effectively turning the black box into a gray
box. Side-channel analysis is a cheap and scalable technique that allows the adversary to exploit
these signals and extract secret keys or passwords. Hence, cryptography deployed into embedded
devices needs not only mathematical but also physical security.
One particularly powerful attack, differential power analysis (DPA) was introduced in 1999
by Kocher et al. [KJJ99]. In this type of attack, the adversary feeds different plaintexts to an
encryption algorithm using the same key and extracts sensitive information from the power traces
he collects. Today, we aim at providing security against dth -order DPA. In a dth -order DPA
attack, the adversary exploits any statistical moment of the power consumption up to order d.
Since statistical moments are exponentially harder to estimate with the order d given sufficient
noise (both in terms of numbers of samples and computational time), having a moderate security
target d = 1, 2 often suffices in practice, especially when used in conjunction with complementary
countermeasures [HOM06, CCD00].
In a side-channel secure implementation, the goal is to make the leakages of the values handled
in the implementation independent of the sensitive inputs and sensitive intermediate variables. At
the architectural level this is typically achieved by masking, which means the processed data is
probabilistically split into multiple shares in such a way that one can only recover the sensitive
data if all of its shares are known. Recovering secrets from shares is exponentially more difficult as
noise increases; as this corresponds to estimating higher-order statistical moments with increasing
noise levels [CJRR99, GP99].
Previous Work. The earliest masking schemes [GP99, Tri03, ISW03] were shown to be unsuitable
for hardware implementations by Mangard et al. [MPG05, MPO05]. The vulnerability arises when
unintended transitions of a signal or glitches occur, caused by non-idealities such as logic gates with
non-zero propagation delays or routing imbalances. The glitches problem can be addressed at many
levels: either by equalizing signal paths (which normally requires manual access to low-level routing
details and a careful characterization of the logic library), by adding synchronization elements (such
as registers or signal gating) or by using a masking scheme that is inherently secure under glitches.
Extensive research has been done on countermeasures based on secret sharing and multi-party
computation that are provably secure even in the presence of glitches. The prevailing schemes are
those of Prouff and Roche [PR11] and Threshold Implementations (TI) by Nikova et al. [NRS11]
which use polynomial and Boolean masking respectively. The latter was extended to higher-order
security by Bilgin et al. (higher-order TI) [BGN+ 14a]. The similarities and differences between TI
and the Private Circuits scheme [ISW03], which provides provable security if the circuit behaves
ideally (no glitches), were analysed by Reparaz et al. (Consolidated Masking Schemes) [RBN+ 15].
Reparaz et al. also discuss how ISW can be implemented to provide security on hardware. More
recently, Gross et al. presented Domain Oriented Masking [GMK16], which is also related to the
original Private Circuits scheme [ISW03] with additional registers againts glitches and a different
randomness consumption. These masking schemes have all been applied to Canright’s tower-field
AES S-box [Can05] due to its small foot-print and structure, resulting in a multitude of masked AES
implementations [MPL+ 11, BGN+ 14b, CRB+ 16, GMK17, UHA17]. Those of Ueno et al. [UHA17],
De Cnudde et al. [CRB+ 16] and Gross et al. [GMK17] are the smallest to date, with the latter
requiring much less randomness.
In this paper we follow a different avenue. We do not apply Boolean masking to Canright’s
tower-field decomposition, but instead, we revisit the well-known concept of switching between
different types of masking. Boolean masking schemes are compatible with linear operations but
difficult to work out for non-linear functions. Akkar and Giraud [AG01] were the first to propose an
adaptive masking scheme for AES at CHES 2001. The idea is to use Boolean masks for the affine
operations and multiplicatively masked values for multiplications (or in the case of AES, inversion)
and convert between the two types when necessary. At CHES 2002 [TSG02, GT02] an inherent
weakness of multiplicative masking was presented, namely that it is vulnerable to first-order DPA
because the zero element cannot be effectively masked multiplicatively. As a solution to this zero
problem, they proposed to map each zero element to a non-zero element. The adaptive masking
scheme was studied in depth and extended to higher-order security by Genelle et al. [GPQ11b]. So
far, it has only been used in software implementations.
Our Contribution. We present the first hardware implementation of an adaptively masked AES.
We describe glitch-resistant modules that convert between Boolean and multiplicative masking
and that attend to the zero problem, based on the algorithmic descriptions provided for software
in [GPQ10, GPQ11a, GPQ11b]. While this work focuses on the AES S-box, the methodology can
be used to mask any inverse or power map-based S-box [AGR+ 16]. We optimize the number of
inversions used and the randomness cost for first-order and second-order resistant AES, which both
achieve a smaller area than the current state-of-the-art masked hardware AES implementations
of [CRB+ 16] and [GMK17], while having comparable randomness and latency requirement. We
formally discuss the security of our S-box and its components up to the level current state-of-the-art
tools and methods allow. We also deploy our implementations into an FPGA for side-channel
evaluation using a non-specific leakage assessment test to analyse practical security in a lab
environment with low noise. No leakage is detected with up to 50 million traces, confirming that
the security claims hold empirically.
2 Preliminaries
Notation. Multiplication and addition in the field Fq = GF(2k ) are denoted by ⊗ and ⊕ respec-
tively. We use & for multiplication in the field GF(2) (i.e. the AND operation). For ease of notation,
we sometimes omit ⊗ and &. Square brackets [·] in formulas indicate where synchronization via
registers or memory elements are used. An element r ∈ Fq drawn uniformly at random from Fq is
$
shown as r ← Fq . We denote F∗q = Fq \ {0}. The expected value of x is denoted E[x].
2.1 Adversarial Model

We consider a physical adversary model, in which an attacker can probe and observe up to d
intermediate wires in each time period. This model is known as the d-probing model [ISW03]. To
account for non-ideal (glitchy) circuits, we assume that any probed wire carrying a function output
also leaks information about all function inputs up to the last register [RBN+ 15]. It has been shown
in [FRR+ 10, RP10, DDF14] that security in the d-probing model implies security against dth -order
DPA as well given the independent leakage assumption of each share and its corresponding logic
from the others.
2.2 Boolean and Multiplicative Masking

A popular countermeasure against dth -order DPA is masking sensitive values by probabilistically
splitting them into d + 1 shares. Let be some group operation. Then for any x ∈ Fq we
process the sharing x = (s0 , . . . , sd ) with s0 s1 . . . sd = x instead of x itself. Similarly,
f (x) = (f0 (x), . . . , fd (x)) is a shared representation of a function f (x).
Masked representations. We can distinguish different masked representations based on the

splitting operation . A common choice is the exclusive-or operation ⊕, resulting in a Boolean
sharing. We use bxi to denote Boolean shares of x: i.e.
d
M
x = (bx0 , . . . , bxd ) ⇔ x= bxi
i=0
In this paper we also use multiplicative sharing, which in a side-channel context is typically defined
as
d−1
O
x = (px0 , . . . , pxd ) ⇔ x= (pxi )−1 ⊗ pxd
i=0
We refer to this sharing as a type-I multiplicative sharing. We further define a type-II multiplicative
sharing:
Od
x = (q0x , . . . , qdx ) ⇔ x= qix
i=0
This notation is more common in secret-sharing. We omit the superscript x when it is clear from
context.
Masked operations. In Boolean masking, linear operations can trivially be applied locally on
each share:
x ⊕ y = (bx0 , . . . , bxd ) ⊕ (by0 , . . . , byd ) = (bx0 ⊕ by0 , . . . , bxd ⊕ byd )
Non-linear operations such as a multiplication on the other hand are less straightforward and much
more costly to implement. The opposite situation arises if one uses multiplicative masking. In that
case, linear operations are non-trivial but multiplication is local:
x ⊗ y = (px0 , . . . , pxd ) ⊗ (py0 , . . . , pyd ) = (px0 ⊗ py0 , . . . , pxd ⊗ pyd )
Finding an efficient but glitch-resistant way to process Boolean shares in a non-linear operation
has been a hot topic in the last years. A natural strategy is to switch back and forth between
masked representations and perform each operation in its most compatible setting.
The zero-value problem. The fundamental security flaw of multiplicative masking was first
pointed out by Trichina [TSG02] and Golić and Tymen [GT02]. Multiplicative masking cannot
securely encode the value 0. The mean power consumption of a single share pxi reveals whether the
underlying secret is zero or non-zero, since E[pxi |x = 0] 6= E[pxi |x 6= 0] for any share index i. This
means that for any number of shares, the original multiplicative masking scheme is vulnerable to
first-order DPA.
2.3 Masking in Hardware

Masking in hardware requires special care. The seminal work of Mangard et al. [MPG05, MPO05]
showed that glitches can reveal sensitive information in hardware masked implementations that
otherwise were expected to be secure.
Non-completeness. The concept of non-completeness appears in the work of Nikova et al. [NRS11]
and follow-up works on higher-order security [BGN+ 14a, RBN+ 15]. Non-completeness between
register stages has become a fundamental property for constructing provable-secure hardware
implementations even if the underlying logic gates glitch. We recall here the definition of non-
completeness: for any shared implementation f operating on a shared input x, dth -order non-
completeness is satisfied if any combination of up to d shares of f is independent of at least one
input share.
Masked Multiplier. Reparaz et al. [RBN+ 15] showed that a dth -order masked multiplication in
hardware can be constructed using only d + 1 shares if the sharings of the inputs are independent
(so as to not break non-completeness). One approach to do this is detailed in [GMK16] and is
referred to as Domain Oriented Masking (DOM).
Our work uses as a masked AND gate the DOM-indep multiplier from [GMK16]. Let x = (bx0 , bx1 )
and y = (by0 , by1 ) be first-order Boolean sharings of bits x and y. A sharing of the multiplication
result z = x&y is obtained by first calculating four partial products tij = bxi &byj , i, j ∈ {0, 1} as
in [ISW03]. When i 6= j, tij is called a cross-domain term and must be refreshed with a randomly
$
drawn bit r ← GF(2). After a register stage for synchronization, the shares (bz0 , bz1 ) are computed.
bz0 = bx0 &by0 ⊕ [bx0 &by1 ⊕ r]

(1)
bz1 = bx1 &by1 ⊕ [bx1 &by0 ⊕ r]
$
The second-order multiplier uses three bits of randomness r ← (GF(2))3 . The inputs and
outputs have three shares and there are nine partial products tij .
bz0 = bx0 &by0 ⊕ [bx0 &by2 ⊕ r1 ] ⊕ [bx0 &by1 ⊕ r0 ]

bz1 = [bx1 &by0 ⊕ r0 ] ⊕ bx1 &by1 ⊕ [bx1 &by2 ⊕ r2 ]
bz2 = [bx2 &by0 ⊕ r1 ] ⊕ [bx2 &by1 ⊕ r2 ] ⊕ bx2 &by2
Note that we employ the special version of the DOM-indep multiplier where only the cross-
domain terms are synchronized in registers. For efficiency, these registers are clocked on the negative
edge as is done in [GSM17]. This is illustrated for the first-order multiplier in Figure 1.
!"# &
$ !"(
!" &
)
$
!% &
#
!%(
!% &
Figure 1: First-order DOM-indep multiplier
3 Design of an Adaptively Masked AES S-box

The AES S-box is an inversion in GF(28 ), followed by an affine transformation over bits. We adopt
the idea of adaptive masking, where we use Boolean sharings for linear operations and multiplicative
masks for non-linear operations. We thus implement the inversion by first converting the input from
Boolean to multiplicative masking. The inversion then becomes a local operation on the individual
shares:
x = (p0 , . . . , pd ) ⇔ x−1 = (p−1
0 , . . . , pd )
−1
We convert back to a Boolean masking to do the affine transformation.

In what follows, we first describe the conversion circuits between Boolean and multiplicative
masking. We address the zero problem in § 3.3. An overview of the S-box can be found in Figure 5.
While this section is written with AES in mind, the methodology can be applied to any S-box
constructed from inversion or another power map in Fq .
3.1 Masking Conversions

Following the strategy of [GPQ11b], we intuitively describe a higher-order conversion between
Boolean and multiplicative shares with the following steps. Note that this description is not final
and we will deviate from them slightly in § 3.2.
For k = 1, . . . , d:
(a) Expansion: extend the sharing x with a new share of the target masking type. The
number of target shares is augmented by one and the total number of shares is now
d + 2.
(b) Synchronize the shares in a register
(c) Compression: Remove one share from the source sharing by partially unmasking. The
number of source shares shrinks by one and the total number of shares is again d + 1.
Boolean to Multiplicative. More specifically, consider a conversion from Boolean to type-I multi-
plicative shares. After k iterations of the above steps, we have an intermediate sharing
k−1
O d
M

x = (p0 , . . . , pk−1 , bk , . . . , bd ) where x= p−1
i ⊗ bi
i=0 i=k
The number of target (multiplicative) shares is k and the number of source (Boolean) shares is
d + 1 − k. In the expansion phase, we add a new multiplicative share by drawing a random pk and
multiplying it with all Boolean shares:
b0i = pk ⊗ bi for i = k, . . . , d (2)
We now obtain a d + 2 sharing
k
O d
M

x = (p0 , . . . , pk , b0k , . . . , b0d ) where x= p−1
i ⊗ b0i
i=0 i=k
In the compression phase, we remove Boolean share b0k by adding it to another Boolean share b0k+1 :
b00k+1 = b0k ⊕ b0k+1 (3)

which brings us to a d + 1 sharing
x = (p0 , . . . , pk , b00k+1 , b0k+2 , . . . , b0d )
with k + 1 target (multiplicative) shares and d − k source (Boolean) shares. After d iterations,
Nd−1 −1
the sharing has been converted to x = (p0 , . . . , pd−1 , bd ) such that x = i=0 pi ⊗ bd , which is
equivalent to a type-I multiplicative sharing of x with pd = bd .
Multiplicative to Boolean. For the opposite conversion from multiplicative to Boolean shares, we
consider a type-II multiplicative sharing, but the procedure for type-I is identical, apart from d
additional inversions. Note that the first iteration starts with k = 1 and b00d = qd . In iteration k, we
have the intermediate sharing
x = (q0 , . . . , qd−k , b0d−k+1 , . . . , b0d−1 , b00d )
with k target (Boolean) shares and d + 1 − k source (multiplicative) shares. In the expansion phase,
a new Boolean share b0d−k is added by splitting b00d into b0d ⊕ b0d−k with b0d−k randomly drawn. The
d + 2 shares of x are then
d−k
O d
M

x = (q0 , . . . , qd−k , b0d−k , . . . , b0d ) where x= qi ⊗ b0i
i=0 i=d−k
In the compression phase, multiplicative share qd−k is removed by multiplication with all Boolean
shares:
bi = qd−k ⊗ b0i for i = d − k, . . . , d
resulting in the d + 1 sharing
d−k−1
O d
M

x = (q0 , . . . , qd−k−1 , bd−k , . . . , bd ) where x= qi ⊗ bi
i=0 i=d−k
with k + 1 target (Boolean) shares and d − k source (multiplicative) shares.

We provide high-level descriptions for both conversions in pseudocode below. These pseudocodes
are slightly different from the higher-order generalizations in [GPQ11b] (Algorithms 1 and 2) but
representative of their first- and second-order descriptions.
Algorithm 1 Boolean to Multiplicative Algorithm 2 Multiplicative to Boolean

Input: x = (b0 , . . . , bd ) Input: x = (q0 , . . . , qd )
Output: x = (p0 , . . . , pd ) Output: x = (b0 , . . . , bd )
for i = 0 to d − 1 do bd ← qd
$
pi ← F∗q for i = d − 1 downto 0 do
$
for j = i to d do bi ← F q
bj ← bj ⊗ pi bd ← bd ⊕ bi
end for **Register Stage**
**Register Stage** for j = i to d do
bi+1 ← bi+1 ⊕ bi bj ← bj ⊗ q i
end for end for
pd ← bd end for
Conversions in Hardware: Dealing with glitches. The register stage between the expansion and
compression phases is necessary because of the presence of glitches in hardware circuits. Without
this register, the non-completeness of the conversion is broken and we have no security guarantees.
Consider for example equations (2) and (3). Together, they compute the following
b00k+1 = [pk bk ] ⊕ [pk bk+1 ]

= pk (bk ⊕ bk+1 )
Without a register, the signal pk might arrive late to the multiplication. As a result, two of the
shares of x are combined on one wire bk ⊕ bk+1 and the security is reduced by one order.
3.2 Specific Inversion Circuits

Why we use two types of multiplicative
Nd−1 x −1 masking: Consider a type-I multiplicative masking, i.e.
x = (px0 , px1 , . . . , pxd ) ⇔ x = i=0 (pi ) ⊗ pxd . To obtain a type-I masking of its inverse x−1 , we
can locally invert all shares pxi using d + 1 unshared Fq inverters. Converting back to Boolean
masking then requires d more Fq inverters. However, the following formula shows that we can do
the entire masked inversion with only one unshared Fq inverter:
O
d−1
−1 d−1
O
x−1 = (pxi )−1 ⊗ pxd = pxi ⊗ (pxd )−1
i=0 i=0
Indeed, by only locally inverting the last share pxd of a type-I multiplicative masking of x, we
obtain a type-II multiplicative sharing of its inverse x−1 :
(x−1 ) (x−1 ) (x−1 )

x−1 = (q0 , q1 , . . . , qd ) = (px0 , px1 , . . . , (pxd )−1 )
Note that regardless of the security order d, only one unshared inverter is required this way.
We now look in more detail at the first- and second-order implementations of the conversions.
First-order. The complete first-order masked inversion including the resulting circuits for first-
order conversions between Boolean and multiplicative masking is shown in Figure 2. The left side
of the figure converts a Boolean sharing x = (b0 , b1 ) to a type-I multiplicative sharing (p0 , p1 ) such
$
that x = p−1
0 p1 . With a non-zero r0 ← Fq , the multiplicative shares are calculated as
∗
p0 = r0
p1 = [b0 r0 ] ⊕ [b1 r0 ]
The right side of the circuit converts a type-II multiplicative masking of x−1 into a Boolean masking.
$
This requires another random r1 ← Fq :
b00 = r1 q0
b01 = [q1 ⊕ r1 ]q0
These procedures are identical to those described in Algorithms 1 and 2.
%" = '"
!"
%$ )$ %$)$ = '$
#" #$(
#$ !$ #"(
Figure 2: First-order shared implementation of an inversion in Fq . The dashed lines depict registers.
Second-order. Adopting the same algorithms for d + 1 = 3 shares does not provide second-order
secure conversions (see Appendix A). We require an extra refreshing of additive shares. Figure 3
depicts our circuit for the second-order shared inversion in Fq . The conversion from a Boolean to a
type-I multiplicative sharing is depicted on the left side of the figure. The conversion requires three
$ $
units of randomness: r0 , r1 ← F∗q and the extra refreshing u ← Fq . The multiplicative shares are as
follows:
p0 = r0
p1 = r1
h i h i
p2 = r1 [r0 b0 ] ⊕ [r0 b1 ⊕ u] ⊕ r1 [r0 b2 ] ⊕ u
For the opposite conversion (shown on the right side of Figure 3), we start from a type-II
multiplicative masking. This means we only need to invert the last share, p2 . We calculate the
Boolean shares of x−1 as
b00 = [r3 ⊕ u]q0

b01 = [r2 q1 ⊕ u]q0

b02 = [q2 ⊕ r2 ]q1 ⊕ r3 q0
$
The conversion again uses three units of randomness, r2 , r3 , u ← Fq , although we can recycle the
refreshing mask u from the Boolean to multiplicative conversion. Each conversion thus uses only
2.5 units of randomness.
Our procedures differ slightly from those of Genelle et al. [GPQ11b], especially in the smaller
use of randomness (we expand on this in Appendix A). For a general randomness strategy for
higher-order conversions, we refer to [GPQ11b], but we note that their randomness cost is not
necessarily optimal for each target security order d. A custom approach can result in a lower cost.
&" = )"
!"
&$ = )$
#" !$ !+ #",
*$
&% *$ &% = )% #%,
#$
#% !% #$,
' '
Figure 3: Second-order shared implementation of an inversion in Fq . The dashed lines depict

registers.
3.3 The Zero Problem
We now describe how to circumvent the zero problem of multiplicative masking. Both in MPC
literature [DK10] and in software masking [GPQ10], it has been proposed to map each zero element
in Fq to a non-zero element in F∗q using a Kronecker Delta function before converting to multiplicative
masks.
In the AES S-box, we need to do an inversion in Fq . Both the zero and unit element of Fq are
their own inverses:
x−1 = x for x ∈ {0, 1}
It is therefore sufficient to replace each zero element by a “one” before the inversion and change it
back afterwards. Consider a Kronecker delta function δ(x):
(
1 if x = 0
δ(x) =
0 if x =
6 0
We can write the inversion of any x ∈ Fq as follows:
x−1 = (x ⊕ δ(x))−1 ⊕ δ(x)
We thus require a circuit that computes a shared Kronecker delta function δ(x). Its output (a
sharing of “zero” or a sharing of “one”) is to be added to the input of the conversion from Boolean to
multiplicative masking and to the output of the conversion from multiplicative to Boolean masking
(see Figure 5). This way, any zero element goes through the Fq inversion as a “one” and is thus
never shared multiplicatively.
The Kronecker delta function δ(x) can be calculated with an n-input AND, or equivalently, a
log2 (n)-level 2-input AND tree with the inverted bits of x as input:
δ(x) = x̄0 &x̄1 &x̄2 & . . . &x̄n−1
The circuit is shown for n = 8 in Figure 4 with xi a sharing of the ith bit of x. In software, it has
been realized using masked table lookups [GPQ10] and bit-slicing [GPQ11a]. We implement each
AND gate with a DOM-indep multiplier [GMK16]. We denote by rj the randomness needed for
each gate. As each multiplier requires one register stage, the entire circuit of Figure 4 takes three
clock cycles (regardless of the number of shares).
./
!#
"
.3
!$
"
.0
!%
" .5
!&
"
.1 +(")
!'
" .4
!(
"
.2
!)
"
!*
"
Figure 4: Circuit for the shared Kronecker delta function δ(x) for n = 8
We note that a trade-off can be made here between latency and area. It is possible to reduce
the depth of the tree (and thus the number of clock cycles) at the cost of a larger fan-in for the
AND gates, which results in a considerable increase in area for shared implementations. In this
paper, we choose to work only with 2-input AND gates in order to minimize circuit area.
First-order optimizations. In a straightforward first-order secure implementation of δ(x), each

$
input bit has two shares and each DOM-AND gate requires 1 extra random bit rj ← GF(2). The
circuit thus receives a total of 23 bits. That is a lot of entropy for a function that outputs only 2
bits. In order to bring down the randomness cost of the circuit, we decide to recycle some of the
bits across the multiplication gates. A theoretical framework for this was presented in [FPS17].
Following this would result in a total randomness cost of 5 units: one bit in each of the three layers
and one bit each for the refreshing after layer 1 and after layer 2. We now push the cost even
further by using custom optimizations.
We rewrite the DOM equations (1) and note that they have a special property:
bzi = bxi byi ⊕ [bxi byi⊕1 ⊕ r]

= bxi y ⊕ r
The DOM gate thus uses its inputs somewhat asymmetrically since the output shares depend only
on the unmasked second input y and not on its sharing. This means that any randomness that has
been used to mask y before arriving at this gate, disappears from its output sharing z. Hence, we
can reuse this randomness in the next layer. In our case, we use the more significant bit (depicted
as the lower input to an AND gate in Fig. 4) as the “second input” and we conclude that the
second layer of the Kronecker implementation removes any dependence of the data on r2 and r4 . In
contrast, reusing r1 (or r3 ) in layer two is not advisable. Moreover, for a first-order implementation
(only univariate matters), the upper and lower two gates in the first layer have independent inputs
and outputs, and can therefore use the same randomness as long as layer two does not.
We propose the following use of randomness:
$ $
r1 = r3 ← GF(2) r5 ← GF(2) r7 = r1
$
r2 = r4 ← GF(2) r6 = [r5 ⊕ r2 ]
We are thus able to reduce the randomness consumption of the first-order Kronecker delta
implementation from 7 to only 3 bits. We refer to Appendix C for the probability distributions of
intermediate and output wires of this circuit with our randomness optimization. We verified that
these probability distributions are independent of the secret input. Moreover, we note that these
probability distributions are the same as in the circuit without randomness optimization.
Second-order optimizations. A second-order implementation uses three bits of randomness per

$
multiplication: rj = (rj0 , rj1 , rj2 ) ← (GF(2))3 . Again, instead of consuming 21 bits of extra
randomness in the circuit, we propose a recycling of the bits. Following the framework of [FPS17]
would require five groups of three fresh random bits, i.e. 15 bits. Our customization is more
restricted in the higher-order case because of the possibility of multivariate leakage. We still have
the special composability property of the DOM gates, but the gates in the first layer can no longer
be considered independent. We propose the following:
$
r1 , r2 , r3 , r4 ← (GF(2))3
r50 = r30 , r51 = r41 , r52 = [r32 ⊕ r42 ]
r60 = r10 , r61 = r21 , r62 = [r12 ⊕ r22 ]
$
r70 = [r11 ⊕ r31 ], r71 = [r20 ⊕ r40 ], r72 ← GF(2)
We thus reduce the randomness consumption of the second-order Kronecker delta implementation
from 21 to 13 bits. The probability distributions of relevant (pairs of) wires can again be found in
Appendix C.
3.4 The S-box

We summarize the AES S-box circuit in Figure 5. The local inversion is based on the smallest
unshared AES S-box implementation to date by Boyar, Matthews and Peralta [BMP13]. More
details on our adaptation of this circuit are given in Appendix B. The registers are depicted with
grey dotted lines. In a first-order implementation each conversion has a latency of one cycle, whereas
in a second-order implementation, it is two clock cycles. The S-box input needs to be fed to the
δ(x) circuit three clock cycles before the first conversion. This could cost us three cycles of S-box
latency as well as three stages of 8 × (d + 1)-bit registers. Instead, we reorganize the state array
and key schedule such that the Kronecker delta function can be precomputed. We describe this in
the next Section.
4 AES Architecture and Control

The ShiftRows, MixColumns and AddRoundKey stages in AES are all linear and thus trivially
masked by instantiating d + 1 copies, one for each share of the state and key schedule. Following
previous masked AES implementations, we use a byte-serialized architecture with a pipelined
S-box as shown in Figure 5. Note that instead of the serialized architecture from [MPL+ 11],
we use a similar architecture to that of [GMK16, Fig. 5] since it exhibits a more compact and
efficient datapath. We adapt [GMK16] to accommodate for our S-box that needs a three-cycle
precomputation of the Kronecker delta function.
4.1 State Array

The byte-serialized architecture from [GMK16] is very efficient in terms of clock cycles, since
it performs the MixColumns, ShiftRows and AddRoundKey operations in parallel to SubBytes.
Figure 6 (left) shows the state array with its normal meandering movement during the SubBytes
operation in black full lines and the ShiftRows functionality in blue dotted lines. The column of
#$ , … , #' .$ , … , .'
! → 01
→ -(!) /(!)
(+$ , … , +' ) (#$ , … , #' )
" !
Figure 5: First-order adaptive masking implementation of the AES S-box. The dotted grey lines
depict registers.
registers that is the input of the MixColumns operation is indicated by a red striped frame, whereas
the registers receiving the output of MixColumns once cycle later are specified by a full red frame.
The S-box input is taken from State 00, while the Kronecker delta input starts computing
three cycles beforehand on State 30. In order to have State 30 ready for the Kronecker function,
we have to put the MixColumns operation in the second column (instead of the first column as
in [GMK16]). ShiftRows is performed when the sixteenth and last S-box output enters the state.
We also adapt the ShiftRows connections such that all bytes end up one column to the right of the
actual ShiftRows result. This means that the normally first column is the first MixColumns input
(state bytes 01,11,21,31) and the normally last column now occupies state bytes 00,10,20 and 30.
During the next four clock cycles, we rotate the state by returning byte 00 to the state input (33)
untouched. After those four cycles, the state columns are restored to their correct order and the
first S-box input is ready in State 00. Moreover, its output to the Kronecker function is also ready
at this point. The key schedule is synchronized with the state in a way that the partial Round Key
to be used in that clock cycle corresponds to State 30. The AddRoundKey stage is embedded in
the connection between State 30 and State 20 and its output is the input to the Kronecker delta
function.
4.2 Key Array

The key array is depicted in Figure 6 (right) and is identical to that of [GMK16, Fig. 5]. The
normal meandering operation is indicated in black full arrows, while the rotating movement is
illustrated by green dotted arrows. The key state rotates in order to put its last column through
the AES S-box. Note that this key array requires a lot fewer multiplexers than that of [MPL+ 11]
because the direction of the normal operation corresponds to that of the rotations. The Round Key
byte that is used in the AddRoundKey stage is constructed in three different ways, depending on
which state byte it is added to:
Key 00 ⊕ S-box Out ⊕ RCon for the first state byte

Key 00 ⊕ S-box Out for the next three bytes
Key 00 ⊕ Key 03 for the remaining 12 bytes
The result is fed back into the key state as Key 33.
4.3 Control
We now go into more detail on the scheduling of the 24 clock cycles (0 to 23) that make up one
encryption round when the S-box latency is four cycles (as in our second-order implementation).
Table 1 details the control of the register movement and Table 2 shows how various inputs to the
states and the S-box change.
The 16 bytes of the state register are fed to the S-box in cycles 3 to 18 of each round of
encryption. This means the Kronecker delta function receives the same 16 bytes three cycles before
State out Key out

S-box in
00 01 02 03 00 01 02 03 Kronecker in
Normal operation
ShiftRows
MixCol In
10 11 12 13 10 11 12 13 S-box in
MixCol Out
20 21 22 23 20 21 22 23
30 31 32 33 30 31 32 33
Kronecker in
Round Key Key in PT in State in State in Round Key Key in
Figure 6: State and Key Array

that: in cycles 0 to 15. During these cycles, the key state follows its meandering movement and
Key 00 is used to construct the Round Key byte. In the remaining clock cycles (from cycle 16 until
cycle 23), the key array is rotating. The last column of the array is fed through the Kronecker delta
function in cycles 17 to 20 and through the S-box in cycles 20 to 23, which means their outputs are
ready for the first four Round Key calculations four cycles later: in cycles 0 to 3.
The state receives its S-box outputs in cycles 7 to 22. In the last cycle (22), we do the adapted
ShiftRows that puts each state byte one extra column to the right. The first MixColumns operation
is in the next cycle (23), which means the first input byte to the Kronecker delta function (in State
30) is ready in cycle 0. During cycles 23 to 2, State 00 holds bytes of the last column and is thus
fed back into State 33. The MixColumns operation occurs four times every four cycles, i.e. in
cycles 23, 3, 7 and 11 (except in the last round of encryption).
The first round of encryption (loading of the inputs) starts in cycle 0 with the data and key
inputs replacing respectively State 30 and the Round Key. In total, one AES encryption is obtained
in 10 × 24 + 16 = 256 cycles. Our first-order AES implementation has the same latency in spite
of the S-box requiring only two cycles. Given the AES design, it is difficult to exploit an S-box
latency below four cycles.
5 Security Evaluation
In this section, we elaborate on the security of the first- and second-order AES constructions against
a probing adversary in the presence of glitches. Neither formal proofs in a particular security model
nor empirical leakage detecting tools can in their own capacity provide full evidence for security. A
security evaluation is incomplete without complementary analyses following both methodologies.
Therefore, our approach consists of three stages: first in § 5.1, we address the security of the S-box
under the ideal circuit assumption using the notion of strong non-interference [BBD+ 16, BBP+ 16].
Next in § 5.2, we evaluate the security of the S-box in the presence of glitches, using leakage
detection tools available in literature. Finally in § 5.3 we complete the evaluation by analyzing our
whole circuit on a physical device.
5.1 Security of the S-box in a theoretical framework

We now use the concept of Strong Non-Interference (SNI) [BBD+ 16] to prove that the S-box
construction is theoretically secure. We use the same methodology as the proof of [BBD+ 16, Fig.
4]. Recall the definition of SNI:
Definition 1 (Strong Non-Interference (SNI) [BBP+ 16]). An algorithm is d-strong non-interferent

(or d-SNI) if and only if for every set I of t1 probes on intermediate variables (i.e. no output wires
or shares) and every set O of t2 probes on output shares such that t1 + t2 ≤ d, the set I ∪ O can
be simulated by only t1 shares of each input.
Now, consider our S-box in Figure 7, consisting of six parts: A1 , A3 and A5 are affine (only
computing share wise) and A2 , A4 and A6 are d-SNI as proven in Appendices D and E. The proof
starts from the output and backtracks to the input. We denote by Ii the set of intermediate
probes in gadget Ai and by O the set of output probes on S(x). The sets are constrained by
Table 1: State and key control during one round of encryption
Cycle State Shift MixColumns Key Shift

0-2 Meander No Meander
3 Meander Yes Meander
16-21 Meander No Rotate
22 ShiftRows No Rotate
23 Meander Yes Rotate
Table 2: State and key inputs during one round of encryption (except during loading)
Cycle Round Key Kronecker In SBin State In S20

0 K00 ⊕ SBout ⊕ Rcon S30 ⊕ RndKey - S00 Krncker In
1-2 K00 ⊕ SBout S30 ⊕ RndKey - S00 Krncker In
3 K00 ⊕ SBout S30 ⊕ RndKey S00 - Krncker In
4-6 K00 ⊕ K03 S30 ⊕ RndKey S00 - Krncker In
7-15 K00 ⊕ K03 S30 ⊕ RndKey S00 SBout Krncker In
16 - - S00 SBout S30
17-18 - K03 S00 SBout S30
19 - K03 - SBout S30
20 - K03 K13 SBout S30
21 - - K13 SBout S30
22 - - K13 SBout S31
23 - - K13 S00 S30
15 13 12
14
16 #$ , … , #' -$ , … , -'
! 7 → /0 → ,(!) .(!)
(*$ , … , *' ) 9 : (#$ , … , #' ) ;
1<
" !
8
Figure 7: AES S-box
P6
|O| + i=1 |Ii | ≤ d. We further define Si as the set of shares that are required at the input of
Si
block Ai in order to be able to simulate the probes in the remainder of the circuit, i.e. j=1 Ii ∪ O.
We subsequently treat this set as a set of probes that needs to be simulated using input shares
from a previous block Ai−1 . This way, we gradually move towards S6 the input and try toP show that
6
the number of input shares of x required to simulate all probes i=1 Ii ∪ O is at most i=1 |Ii |.
Consider for example block A4 in Table 3. This block has output z and input y. The set of
shares of z, S3 is constrained by |S3 | ≤ |S2 | + |I3 |. Since A4 is d-SNI and since |S3 ∪ I4 | ≤ d, we
have that the number of shares of y required to simulate S3 ∪ I4 is at most |I4 |. We call this set
of shares S4 . Now,S3since we are able to simulate S3 using S4 and since S3 is able to simulate S4 the
remaining probes i=1 Ii ∪ O, we know that the set of shares S4 is sufficient to simulate i=1 Ii ∪ O.
Table 3 shows that we need |S5,1 ∪ S6 | < |S4 | + |I5 | + |I6 | < |I4 | + |I5 | + |I6 | shares of the input
to simulate all d-tuples of probes in the circuit, proving that the S-box is d-SNI.
Table 3: Proof that the S-box in Figure 7 is d-SNI for d = 1, 2
Probes Constraints Details

P6
S(x) : O |O| + i=1 i
|I | ≤ d
A1 v : S1,1 ; w : S1,2 |S1,k | ≤ |I1 | + |O| Affine
A2 u : S2 ; w : S1,2 |S2 | ≤ |I2 | d-SNI
A3 z : S3 ; w : S1,2 |S3 | ≤ |I3 | + |S2 | Affine
A4 y : S4 ; w : S1,2 |S4 | ≤ |I4 | d-SNI
A5 x : S5,1 ; w : S5,2 |S5,1 | ≤ |I5 | + |S4 | Affine
|S5,2 | ≤ |I5 | + |S1,2 |
A6 x : S5,1 ∪ S6 |S6 | ≤ |I6 | d-SNI
5.2 Practical Evaluation of Glitch Security of the S-box

A useful property for the synthesis of secure circuits in the presence of glitches is non-completeness [NRS11].
We use the VerMI tool described in [ANR17] to verify the security of the gadgets that create the
S-box, i.e. the conversions and the Kronecker delta. This tool was designed specifically for masked
hardware implementations. In particular, it can verify if a circuit satisfies the non-completeness
property from register to register. By applying this tool directly to the RTL HDL descriptions
of our gadgets, we confirm that each stage is non-complete and therefore secure in the univariate
setting in the presence of glitches if the shared input does not have a secret dependent bias. We
verify this condition on the input sharing independently (Appendix C).
We note that it has been implied in [FGMDP+ 18] that verifying glitch security and strong non-
interference separately does not guarantee composability in a glitchy environment. In section 5.1,
we have given security proofs for the S-box as best as we could with the tools at our disposal. In this
section, we consider glitches. The combined theoretical verification of “glitchy” SNI is an interesting
direction for future research. However, note that SNI is not a necessary condition for the S-box
to be secure. As an example, consider our first-order S-box. Not every glitch-extended probe in
the subcircuit shown in Figure 2 is simulatable with only t1 shares of the input. However, we have
exhaustively verified that every glitch-extended probe in the entire S-box circuit is independent of the
secret. The S-box is thus 1-probing secure, even though one of its subcircuits is not (1, 0, 0)-robust
1-SNI [FGMDP+ 18]. We further evaluate the security of the entire S-box using state-of-the-art
tools.
We use the simulation tool of [Rep16], in which we exhaustively probe the S-box and create
power traces using an identity leakage model. These traces do not only contain explicit intermediates
(stabilized values on wires) but also values that could be observed in a glitch (transient values on
wires). We exhaustively probe the S-box in this way in a completely noiseless setting and create
up to 100 million simulated traces. For more details, we refer to [Rep16]. We detect no univariate
leakage with up to 100 million traces nor bivariate in the case of our second-order gadgets. We
draw the same conclusions when using the tool described in [DBR18]. This tool essentially exhausts
every possible glitch in the computation by verifying that there is no mutual information between
the secret and all possible (pairs of) glitch-extended probes.
While the theoretical possibility of a very weak bias still exists we would need more than 100
million traces to detect it and thus the practical implications of this are thin: if the leak is not even
detected with 100 million traces in a noiseless scenario, it would take even considerably more traces
to exploit it (perform key-recovery) in a realistic noisy scenario.
5.3 Physical Evaluation

After evaluating the S-box both theoretically and empirically in simulation, we finally put our entire
AES design to the test in a physical environment.
Setup. We program a Xilinx Spartan6 FPGA with both our first- and second-order design on
a SAKURA-G board, specifically designed for side-channel evaluation. For the synthesis, we use
the Xilinx ISE option KEEP_HIERARCHY to prevent optimization across modules (and in particular
across shares). To minimize platform noise, we split the implementation over a crypto FPGA, which
handles the AES encryption and a control FPGA, which communicates with the host computer and
supplies masked data to the crypto FPGA. The FPGA’s are clocked at 3.072 MHz and sampled at
1GS/s.
The crypto FGPA is also equipped with a PRNG to generate the randomness required in every
clock cycle. This PRNG is loaded with a fresh seed for every encryption. In contrast with other
state-of-the-art masked implementations, we have to be able to generate one or two non-zero bytes
for the multiplicative masks. We refer to Appendix F for a description of how we achieve this in
practice, without stalling the pipeline.
Univariate. We perform a non-specific leakage detection test [BCD+ 13] using the methodology
from [RGV17]. This means we gather power traces in two sets: the first corresponding to encryptions
of a fixed plaintext and the other to encryptions of random plaintexts. We choose the fixed plaintext
equal to the key in order to test the special case of zero inputs to the S-box in the first round.
Nonzero S-box inputs then occur in encryption round two and are thus naturally also tested. The
two sets of measurements are compared using the t-test statistic. When the t-statistic at order
d crosses the threshold T = ±4.5, the null hypothesis “The design has no dth -order leakage” is
rejected with confidence > 99.999%. On the other hand, when the t-statistic remains below this
threshold, we corroborate that side-channel information is not distinguishable at order d.
The results for our first-order design are shown in Figure 8. Each trace consists of 64 clock
cycles, comprising about two and a half rounds of encryption. An example of such a trace is shown
Figure 8: Non-specific leakage detection test on 2.5 rounds of encryption of a first-order protected
AES. Left: PRNG off; 12 000 traces. Right: PRNG on; 50 million traces. Rows(top to bottom):
exemplary power trace, first-order, second-order t-value.
in Figure 8, top. To verify the soundness of our setup, we first perform the leakage detection test
with the PRNG turned off (i.e. unmasked implementation). This is shown in the left column of
the figure and as expected, the design presents severe leakage at only 12 000 traces. On the right
side, we do the leakage detection test with the PRNG turned on. We do not observe evidence for
first-order leakage with up to 50 million power traces. The design does leak in the second order, as
anticipated.
Similarly, we show the test results for our second-order design in Figure 9. The leakage when
the PRNG is turned off (left column) is clear. The masked implementation (right column) does not
present evidence for first- nor second-order leakage with up to 50 million power traces. While we
would expect the third-order t-statistic to surpass the threshold, this is not yet the case due to
platform noise.
We also track the evolution of the maximum absolute t-test value as a function of the number
of traces taken. This is shown in Figure 10 for the first-order (left) and second-order (right)
protected AES implementations. On the left, we clearly see an increase in the absolute t-value of
Figure 9: Non-specific leakage detection test on 2.5 rounds of encryption of a second-order protected
AES. Left: PRNG off; 12 000 traces. Right: PRNG on; 50 million traces. Rows(top to bottom):
exemplary power trace, first-order, second-order, third-order t-value.
the second- and third-order moment, while the statistic for first order is stable. For our second-order
implementation, the noise of the platform prevents us from seeing evidence for third-order leakage.
10 3 6
d=1
5
d=2
d=3
10 2 4
max(|t-value|)
max(|t-value|)
10 1 2
d=1
1 d=2
d=3
0 0
10
0 10 20 30 40 50 0 10 20 30 40 50
# Million Traces # Million Traces
Figure 10: Evolution of the maximum absolute t-value across the measurements. Left: First order.
Right: Second order.
Bivariate. In order to do a bivariate leakage detection test, we reduce the length of the power
traces to 15 clock cycles and the sample rate of the oscilloscope to 200MS/s. Each trace then
consists of 1 000 time samples. In order to reduce the signal-to-noise ratio, we make the traces DC
free. We then combine the measurements at different time samples by doing an outer product of the
centered traces with themselves. The resulting symmetric matrices are the samples for our t-test.
We first perform this experiment on the first-order protected AES implementation to verify if
we can indeed detect bivariate leakage. The resulting t-statistic after 1 and 45 million traces is
shown in Figure 11 and confirms that our method is sound.
Next, we do the same for the second-order masked AES implementation. We collect 50 million
traces and show the resulting t-statistic in Figure 12. The result shows clearly that no bivariate
leakage can be detected with 50 million traces.
6 Implementation Cost
We presented first- and second-order secure constructions for AES and evaluated their security. In
this section we investigate the implementation cost and compare it to the state-of-the-art AES
designs of [CRB+ 16] and [GMK17]. All area measures were obtained with the Synopsis Design
Compiler v.2013.12, using the Open Cell Nangate 45nm library [NAN] and are expressed in 2-input
NAND gate equivalents1 . We use compile option -exact_map to prevent optimization across
modules. For a fair comparison, we also synthesize the implementations of [CRB+ 16] and [GMK17]
with the same library and toolchain. From the latter, we picked the options for smallest area, i.e. not
perfectly-interleaved and the eight-stage S-box. Both these works create a shared implementation
1 One NAND gate is 0.798µm2
1000 45 1000 45
40 40
800 800
35 35
30 30
600 600
25 25
20 20
400 400
15 15
10 10
200 200
5 5
0 0
200 400 600 800 1000 200 400 600 800 1000
Figure 11: Non-specific bivariate leakage detection test on 15 clock cycles of a first-order protected
AES. Left: 1 million traces. Right: 45 million traces.
1000 45
40
800
35
30
600
25
20
400
15
10
200
0
200 400 600 800 1000
Figure 12: Non-specific bivariate leakage detection test on 15 clock cycles of a second-order
protected AES with 50 million traces.
from Canright’s compact AES S-box [Can05] using the tower-field method. Our approach is thus
radically different. We cannot compare easily with [UHA17] because of different synthesis libraries,
though they seem to have a similar area footprint for larger randomness requirement (64 bits per
S-box). Also, they only provide a first-order implementation. We first detail the cost of the S-box
only in § 6.1 and then look at the entire AES encryption in § 6.2.
Table 4: Implementation results for the AES S-box with Nangate 45nm Library
First-order secure Second-order secure

Variant Area Randomness Latency Area Randomness Latency
Module [GE] [bits/S-box] [cc] [GE] [bits/S-box] [cc]
This work 1 685 19 2 (+3) 3 891 53 4 (+3)
Kronecker delta 259 3 (3) 629 13 (3)
Bool to Mult. 538 8 1 1 434 20 2
Inversion 226 - - 226 - -
Mult. to Bool 538 8 1 1 388 20 2
Others 124 - - 214 - -
[CRB+ 16] 2 348 54 6 4 744 162 6
[GMK17] 2 432 18 8 4 759 54 8
6.1 The S-box

Table 4 shows our implementation results for the S-box. Our S-box implementations are the smallest
to date among state-of-the-art schemes with similar randomness and latency with an area reduction
of 29% for first order and 18% for second order.
6.2 AES
Table 5 shows the implementation results of our entire AES implementations in comparison with
those of De Cnudde et al. [CRB+ 16] and Gross et al. [GMK17]. Our S-box area reduction results
in an overall improvement of around 10% over the state-of-the-art with comparable or even better
randomness consumption and latency.
7 Conclusion
We have ported the well-known concept of adaptively masking ciphers such as AES to hardware.
The idea has been extensively studied in software, but had not yet been applied in hardware up till
now. We show that this methodology is a very competitive alternative to state-of-the-art masked
Table 5: Implementation results for AES-128 with Nangate 45nm Library
First-order secure Second-order secure

Variant Area Randomness Latency Area Randomness Latency
Module [GE] [bits/S-box] [cc] [GE] [bits/S-box] [cc]
This work 6 557 19 256 10 931 53 256
S-box 1 685 - - 3 891 - -
State Array 2 509 - - 3 728 - -
Key Array 1 579 - - 2 368 - -
Control 208 - - 199 - -
Others 576 - - 745 - -
[CRB+ 16] 7 682 54 276 12 640 162 276
[GMK17] 7 337 18 246 12 024 54 246
AES designs. Our approach is conceptually simple, yet incorporates modern countermeasures to
mitigate the effect of glitches in hardware.
Specifically, we present secure circuits for converting between Boolean and multiplicative
masking and for circumventing the well-known zero problem of multiplicative masking. We apply
the methodology to the AES cipher for first- and second-order security and show with experiments
that our implementations do not exhibit univariate or multivariate leakage with up to 50 million
traces. Our AES S-box implementations require comparable randomness and latency to state-of-
the-art implementations and yet achieve an 18 to 29% smaller chip area. We believe this is an
interesting addition to the hardware designer’s toolbox.
Acknowledgements
This work was supported in part by the NIST Research Grant 60NANB15D346. Oscar Reparaz
and Begül Bilgin are postdoctoral fellows of the Fund for Scientific Research - Flanders (FWO) and
Lauren De Meyer is funded by a PhD fellowship of the FWO. The authors would like to thank
François-Xavier Standaert and Vincent Rijmen for helpful discussions.
References
[AG01] Mehdi-Laurent Akkar and Christophe Giraud. An implementation of DES and AES,
secure against some attacks. In Çetin Kaya Koç, David Naccache, and Christof
Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2001, Third
International Workshop, Paris, France, May 14-16, 2001, Proceedings, volume 2162
of Lecture Notes in Computer Science, pages 309–318. Springer, 2001.
[AGR+ 16] Martin R. Albrecht, Lorenzo Grassi, Christian Rechberger, Arnab Roy, and Tyge
Tiessen. Mimc: Efficient encryption and cryptographic hashing with minimal
multiplicative complexity. In Jung Hee Cheon and Tsuyoshi Takagi, editors, Advances
in Cryptology - ASIACRYPT 2016 - 22nd International Conference on the Theory
and Application of Cryptology and Information Security, Hanoi, Vietnam, December
4-8, 2016, Proceedings, Part I, volume 10031 of Lecture Notes in Computer Science,
pages 191–219, 2016.
[ANR17] Victor Arribas, Svetla Nikova, and Vincent Rijmen. VerMI: Verification tool for
masked implementations. Cryptology ePrint Archive, Report 2017/1227, 2017.
[BBD+ 16] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin
Grégoire, Pierre-Yves Strub, and Rébecca Zucchini. Strong non-interference and
type-directed higher-order masking. In Edgar R. Weippl, Stefan Katzenbeisser,
Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security,
Vienna, Austria, October 24-28, 2016, pages 116–129. ACM, 2016.
[BBP+ 16] Sonia Belaïd, Fabrice Benhamouda, Alain Passelègue, Emmanuel Prouff, Adrian
Thillard, and Damien Vergnaud. Randomness complexity of private circuits for
multiplication. In Marc Fischlin and Jean-Sébastien Coron, editors, Advances in

Cryptology - EUROCRYPT 2016 - 35th Annual International Conference on the
Theory and Applications of Cryptographic Techniques, Vienna, Austria, May 8-12,
2016, Proceedings, Part II, volume 9666 of Lecture Notes in Computer Science, pages
616–648. Springer, 2016.
[BCD+ 13] G. Becker, J. Cooper, E. De Mulder, G. Goodwill, J. Jaffe, G. Kenworthy, T. Kouzmi-
nov, A. Leiserson, M. Marson, P. Rohatgi, and S. Saab. Test vector leakage assessment
(TVLA) methodology in practice. In International Cryptographic Module Conference,
volume 1001, page 13, 2013.
[BGN+ 14a] Begül Bilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. Higher-order threshold implementations. In Palash Sarkar and Tetsu Iwata,
editors, Advances in Cryptology - ASIACRYPT 2014 - 20th International Conference
on the Theory and Application of Cryptology and Information Security, Kaoshiung,
Taiwan, R.O.C., December 7-11, 2014, Proceedings, Part II, volume 8874 of Lecture
Notes in Computer Science, pages 326–343. Springer, 2014.
[BGN+ 14b] Begül Bilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. A more efficient AES threshold implementation. In David Pointcheval
and Damien Vergnaud, editors, Progress in Cryptology - AFRICACRYPT 2014 -
7th International Conference on Cryptology in Africa, Marrakesh, Morocco, May
28-30, 2014. Proceedings, volume 8469 of Lecture Notes in Computer Science, pages
267–284. Springer, 2014.
[BMP13] Joan Boyar, Philip Matthews, and René Peralta. Logic minimization techniques
with applications to cryptology. J. Cryptology, 26(2):280–312, 2013.
[Can05] David Canright. A very compact s-box for AES. In Rao and Sunar [RS05], pages
441–455.
[Can06] Christophe De Cannière. Trivium: A stream cipher construction inspired by block
cipher design principles. In Sokratis K. Katsikas, Javier Lopez, Michael Backes,
Stefanos Gritzalis, and Bart Preneel, editors, Information Security, 9th Interna-
tional Conference, ISC 2006, Samos Island, Greece, August 30 - September 2, 2006,
Proceedings, volume 4176 of Lecture Notes in Computer Science, pages 171–186.
Springer, 2006.
[CCD00] Christophe Clavier, Jean-Sébastien Coron, and Nora Dabbous. Differential power
analysis in the presence of hardware countermeasures. In Çetin Kaya Koç and
Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES
2000, Second International Workshop, Worcester, MA, USA, August 17-18, 2000,
Proceedings, volume 1965 of Lecture Notes in Computer Science, pages 252–263.
Springer, 2000.
[CJRR99] Suresh Chari, Charanjit S. Jutla, Josyula R. Rao, and Pankaj Rohatgi. Towards
sound approaches to counteract power-analysis attacks. In Wiener [Wie99], pages
398–412.
[Cor17] Jean-Sébastien Coron. Checkmasks: Formal verification of side-channel countermea-
sures. Publicly available at https://github.com/coron/checkmasks, 2017.
[Cor18] Jean-Sébastien Coron. Formal verification of side-channel countermeasures via
elementary circuit transformations. In Bart Preneel and Frederik Vercauteren,
editors, Applied Cryptography and Network Security - 16th International Conference,
ACNS 2018, Leuven, Belgium, July 2-4, 2018, Proceedings, volume 10892 of Lecture
[CRB+ 16] Thomas De Cnudde, Oscar Reparaz, Begül Bilgin, Svetla Nikova, Ventzislav Nikov,
and Vincent Rijmen. Masking AES with d+1 shares in hardware. In Benedikt
Gierlichs and Axel Y. Poschmann, editors, Cryptographic Hardware and Embedded
Systems - CHES 2016 - 18th International Conference, Santa Barbara, CA, USA,
August 17-19, 2016, Proceedings, volume 9813 of Lecture Notes in Computer Science,
pages 194–212. Springer, 2016.
[DBR18] Lauren De Meyer, Begül Bilgin, and Oscar Reparaz. Consolidating security notions
in hardware masking. IACR Cryptology ePrint Archive, 2018:597, 2018.
[DDF14] Alexandre Duc, Stefan Dziembowski, and Sebastian Faust. Unifying leakage models:
From probing attacks to noisy leakage. In Phong Q. Nguyen and Elisabeth Oswald,
editors, Advances in Cryptology - EUROCRYPT 2014 - 33rd Annual International
Conference on the Theory and Applications of Cryptographic Techniques, Copenhagen,
Denmark, May 11-15, 2014. Proceedings, volume 8441 of Lecture Notes in Computer
Science, pages 423–440. Springer, 2014.
[DK10] Ivan Damgård and Marcel Keller. Secure multiparty AES. In Radu Sion, editor,
Financial Cryptography and Data Security, 14th International Conference, FC 2010,
Tenerife, Canary Islands, January 25-28, 2010, Revised Selected Papers, volume
6052 of Lecture Notes in Computer Science, pages 367–374. Springer, 2010.
[FGMDP+ 18] Sebastian Faust, Vincent Grosso, Santos Merino Del Pozo, Clara Paglialonga, and
François-Xavier Standaert. Composable masking schemes in the presence of physical
defaults & the robust probing model. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2018(3):89–120, Aug. 2018.
[FPS17] Sebastian Faust, Clara Paglialonga, and Tobias Schneider. Amortizing randomness
complexity in private circuits. In Tsuyoshi Takagi and Thomas Peyrin, editors,
Advances in Cryptology - ASIACRYPT 2017 - 23rd International Conference on
the Theory and Applications of Cryptology and Information Security, Hong Kong,
China, December 3-7, 2017, Proceedings, Part I, volume 10624 of Lecture Notes in
Computer Science, pages 781–810. Springer, 2017.
[FRR+ 10] Sebastian Faust, Tal Rabin, Leonid Reyzin, Eran Tromer, and Vinod Vaikuntanathan.
Protecting circuits from leakage: the computationally-bounded and noisy cases. In
Henri Gilbert, editor, Advances in Cryptology - EUROCRYPT 2010, 29th Annual
International Conference on the Theory and Applications of Cryptographic Tech-
niques, French Riviera, May 30 - June 3, 2010. Proceedings, volume 6110 of Lecture
[GMK16] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented masking:
Compact masked hardware implementations with arbitrary protection order. IACR
Cryptology ePrint Archive, 2016:486, 2016.
[GMK17] Hannes Groß, Stefan Mangard, and Thomas Korak. An efficient side-channel
protected AES implementation with arbitrary protection order. In Helena Handschuh,
editor, Topics in Cryptology - CT-RSA 2017 - The Cryptographers’ Track at the
RSA Conference 2017, San Francisco, CA, USA, February 14-17, 2017, Proceedings,
volume 10159 of Lecture Notes in Computer Science, pages 95–112. Springer, 2017.
[GP99] Louis Goubin and Jacques Patarin. DES and differential power analysis (the "du-
plication" method). In Çetin Kaya Koç and Christof Paar, editors, Cryptographic
Hardware and Embedded Systems, First International Workshop, CHES’99, Worces-
ter, MA, USA, August 12-13, 1999, Proceedings, volume 1717 of Lecture Notes in
Computer Science, pages 158–172. Springer, 1999.
[GPQ10] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Secure multiplicative
masking of power functions. In Jianying Zhou and Moti Yung, editors, Applied
Cryptography and Network Security, 8th International Conference, ACNS 2010,
Beijing, China, June 22-25, 2010. Proceedings, volume 6123 of Lecture Notes in
Computer Science, pages 200–217, 2010.
[GPQ11a] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Montgomery’s trick and
fast implementation of masked AES. In Abderrahmane Nitaj and David Pointcheval,
editors, Progress in Cryptology - AFRICACRYPT 2011 - 4th International Confer-
ence on Cryptology in Africa, Dakar, Senegal, July 5-7, 2011. Proceedings, volume
[GPQ11b] Laurie Genelle, Emmanuel Prouff, and Michaël Quisquater. Thwarting higher-order
side channel analysis with additive and multiplicative maskings. In Preneel and
Takagi [PT11], pages 240–255.
[GSM17] Hannes Groß, David Schaffenrath, and Stefan Mangard. Higher-order side-channel
protected implementations of KECCAK. In Hana Kubátová, Martin Novotný, and
Amund Skavhaug, editors, Euromicro Conference on Digital System Design, DSD
2017, Vienna, Austria, August 30 - Sept. 1, 2017, pages 205–212. IEEE, 2017.
[GT02] Jovan Dj. Golic and Christophe Tymen. Multiplicative masking and power analysis
of AES. In Jr. et al. [JKP03], pages 198–212.
[HOM06] Christoph Herbst, Elisabeth Oswald, and Stefan Mangard. An AES smart card
implementation resistant to power analysis attacks. In Jianying Zhou, Moti Yung,
and Feng Bao, editors, Applied Cryptography and Network Security, 4th International
Conference, ACNS 2006, Singapore, June 6-9, 2006, Proceedings, volume 3989 of
Lecture Notes in Computer Science, pages 239–252, 2006.
[ISW03] Yuval Ishai, Amit Sahai, and David A. Wagner. Private circuits: Securing hardware
against probing attacks. In Dan Boneh, editor, Advances in Cryptology - CRYPTO
2003, 23rd Annual International Cryptology Conference, Santa Barbara, California,
USA, August 17-21, 2003, Proceedings, volume 2729 of Lecture Notes in Computer
[JKP03] Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors. Cryptographic
Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood
Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes
in Computer Science. Springer, 2003.
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In
Wiener [Wie99], pages 388–397.
[MPG05] Stefan Mangard, Thomas Popp, and Berndt M. Gammel. Side-channel leakage of
masked CMOS gates. In Alfred Menezes, editor, Topics in Cryptology - CT-RSA
2005, The Cryptographers’ Track at the RSA Conference 2005, San Francisco, CA,
USA, February 14-18, 2005, Proceedings, volume 3376 of Lecture Notes in Computer
[MPL+ 11] Amir Moradi, Axel Poschmann, San Ling, Christof Paar, and Huaxiong Wang.
Pushing the limits: A very compact and a threshold implementation of AES. In
Kenneth G. Paterson, editor, Advances in Cryptology - EUROCRYPT 2011 - 30th
Annual International Conference on the Theory and Applications of Cryptographic
Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings, volume 6632 of Lecture
[MPO05] Stefan Mangard, Norbert Pramstaller, and Elisabeth Oswald. Successfully attacking
masked AES hardware implementations. In Rao and Sunar [RS05], pages 157–171.
[NAN] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.
nangate.com.
[NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schläffer. Secure hardware implementa-
tion of nonlinear functions in the presence of glitches. J. Cryptology, 24(2):292–321,
2011.
[PR11] Emmanuel Prouff and Thomas Roche. Higher-order glitches free implementation of
the AES using secure multi-party computation protocols. In Preneel and Takagi
[PT11], pages 63–78.
[PT11] Bart Preneel and Tsuyoshi Takagi, editors. Cryptographic Hardware and Embedded
Systems - CHES 2011 - 13th International Workshop, Nara, Japan, September 28 -
October 1, 2011. Proceedings, volume 6917 of Lecture Notes in Computer Science.
Springer, 2011.
[RBN+ 15] Oscar Reparaz, Begül Bilgin, Svetla Nikova, Benedikt Gierlichs, and Ingrid Ver-
bauwhede. Consolidating masking schemes. In Rosario Gennaro and Matthew
Robshaw, editors, Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptol-
ogy Conference, Santa Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I,
[Rep16] Oscar Reparaz. Detecting flawed masking schemes with leakage detection tests. In
Thomas Peyrin, editor, Fast Software Encryption - 23rd International Conference,
FSE 2016, Bochum, Germany, March 20-23, 2016, Revised Selected Papers, volume
[RGV17] Oscar Reparaz, Benedikt Gierlichs, and Ingrid Verbauwhede. Fast leakage assessment.
In Wieland Fischer and Naofumi Homma, editors, Cryptographic Hardware and
Embedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan,
September 25-28, 2017, Proceedings, volume 10529 of Lecture Notes in Computer
[RP10] Matthieu Rivain and Emmanuel Prouff. Provably secure higher-order masking of
AES. In Stefan Mangard and François-Xavier Standaert, editors, Cryptographic
Hardware and Embedded Systems, CHES 2010, 12th International Workshop, Santa
Barbara, CA, USA, August 17-20, 2010. Proceedings, volume 6225 of Lecture Notes
in Computer Science, pages 413–427. Springer, 2010.
[RS05] Josyula R. Rao and Berk Sunar, editors. Cryptographic Hardware and Embedded
Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 -
September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science.
Springer, 2005.
[Tri03] Elena Trichina. Combinational logic design for AES subbyte transformation on
masked data. IACR Cryptology ePrint Archive, 2003:236, 2003.
[TSG02] Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified adaptive multi-
plicative masking for AES. In Jr. et al. [JKP03], pages 187–197.
[UHA17] Rei Ueno, Naofumi Homma, and Takafumi Aoki. Toward more efficient dpa-resistant
AES hardware architecture based on threshold implementation. In Sylvain Guilley,
editor, Constructive Side-Channel Analysis and Secure Design - 8th International
Workshop, COSADE 2017, Paris, France, April 13-14, 2017, Revised Selected Papers,
[Wie99] Michael J. Wiener, editor. Advances in Cryptology - CRYPTO ’99, 19th Annual
International Cryptology Conference, Santa Barbara, California, USA, August 15-19,
1999, Proceedings, volume 1666 of Lecture Notes in Computer Science. Springer,
1999.
Recovering the CTR_DRBG
state in 256 traces
Publication Data
Lauren De Meyer. Recovering the CTR_DRBG state in 256 traces. IACR

Transactions on Cryptographic Hardware and Embedded Systems, 2020(1), pages
37-65.
Notes
Appendices not included for brevity. Please refer to [3].
185
186 Recovering the CTR_DRBG state in 256 traces
Recovering the CTR_DRBG state in 256 traces

Lauren De Meyer
KU Leuven, imec - COSIC
lauren.demeyer@esat.kuleuven.be
Abstract. The NIST CTR_DRBG specification prescribes a maximum size on each random
number request, limiting the number of encryptions in CTR mode with the same key to
4 096. Jaffe’s attack on AES in CTR mode without knowledge of the nonce from CHES 2007
requires 216 traces, which is safely above this recommendation. In this work, we exhibit an
attack that requires only 256 traces, which is well within the NIST limits. We use simulated
traces to investigate the success probability as a function of the signal-to-noise ratio. We also
demonstrate its success in practice by attacking an AES-CTR implementation on a Cortex-M4
among others and recovering both the key and nonce. Our traces and code are made openly
available for reproducibility.
Keywords: DPA · SCA · CPA · AES · CTR · PRNG · NIST · DRBG · DDLA
1 Introduction
Cryptographic implementations in embedded devices are vulnerable to side-channel attacks (SCA)
such as differential power analysis (DPA), which was first introduced by Kocher et al. in 1999 [KJJ99].
In the following years, many variations of this attack have been proposed, such as correlation
power analysis (CPA) by Brier et al. [BCO04], mutual information analysis (MIA) by Gierlichs et
al. [GBTP08] and very recently, differential deep learning analysis (DDLA) by Timon [Tim19].
The success of DPA and its variations lies in the ability to divide-and-conquer, because the
power consumption at some instants depends on a (constant) small part of the secret combined
with variable known data (e.g. plaintext bytes). In most cases, side-channel attacks are performed
under the assumption that the adversary knows the plaintext and/or ciphertext, which allows him
to hypothesize on and recover chunks of the secret key.
In some scenarios, this assumption does not hold. Consider for example a pseudo-random
number generator (PRNG) that is used for key generation or for the supply of fresh randomness to
masked implementations (to protect against SCA). In such cases, neither the plaintext (i.e. the
state of the PRNG) nor the ciphertext (i.e. the output of the PRNG) are considered public. The
adversary is then assumed to only have knowledge of the power consumption or electromagnetic
radiation emanating from the device.
At CHES 2007, Jaffe [Jaf07] presented an attack of AES in Counter mode (AES-CTR) [Dwo01]
in this adversary model. He showed that the sequential nature of the counter mode enables one to
attack AES-CTR with only knowledge of the power traces and without knowledge of the initial
counter (the nonce). Another line of works that consider the same adversary model is that of
blind side-channel attacks, originally by Linge et al. [LDL14] and recently improved by Clavier
et al. [CR17]. In these works, the joint distribution of leakage points is exploited to extract keys
without knowledge of the plaintext or ciphertext.
In this work, we focus on the case of PRNGs. The NIST recommendations for random
number generation include one type of PRNG which is based on a cipher in CTR mode, denoted
CTR_DRBG [BK15]. AES being an important standardized cipher, many PRNGs naturally
use AES-CTR at their core, which means they are vulnerable to Jaffe’s attack. However, NIST
recommends to limit the size of randomness requests to the CTR_DRBG to 219 bits. Generating
such a request thus takes at most 4 096 AES encryptions in CTR mode. The NIST CTR_DRBG
also calls an Update function, which changes the PRNG state (nonce and key) between every
request. Since Jaffe’s attack requires 216 encryption traces, it actually does not pose a threat to
the NIST CTR_DRBG. In his conclusion [Jaf07], he does allude to the possibility of using only 28
traces.
1.1 Contribution
In this work, we demonstrate an adaptation of Jaffe’s attack, which requires only 256 power
measurements. We explain the methodology and investigate the success probability of the attack
as a function of the signal-to-noise ratio. Interestingly, our attack’s success depends on the nonce it
is trying to recover and we show that in some cases, using less traces actually improves the success
probability.
We demonstrate the feasibility of the attack on multiple real devices, essentially showing that
the NIST recommendation for the CTR_DRBG allows for too large requests. We also explore
blind SCA [CR17] as an alternative attack methodology and demonstrate the recently introduced
DDLA [Tim19] in a variation of the attack for misaligned traces.
In the context of masked implementations against SCA, PRNGs are usually required to provide a
constant stream of fresh randomness during the computation. Having that randomness compromised
would nullify the protection offered by the masking countermeasure. To this day, very little research
is publicly available on specific constructions for this PRNG. The question of whether this PRNG
should be protected against side-channel analysis itself is largely avoided. We use our attack as
a starting point for the discussion on how to protect PRNGs against adversaries who only have
access to side-channel information and not the plaintexts/ciphertexts.
2 Preliminaries
In Section 2.1, we give a brief overview of AES and introduce our notation for the rest of the paper.
Section 2.2 describes the NIST recommendations for the CTR_DRBG.
2.1 AES
The Advanced Encryption Standard (AES) is a 128-bit block cipher based on a substitution-
permutation network. The master key can be 128, 192 or 256 bits long and the corresponding
number of rounds is respectively 10, 12 or 14. Each round i (except the last round) consists of 4
transformations (AddRoundKey, SubBytes, ShiftRows and MixColumns), which we explain briefly
below. The 128-bit state is considered as a matrix of 4 by 4 bytes (see Figure 1). Each round also
receives a 128-bit round key Ki , which is derived from the master key using the key schedule. The
details of the key schedule are not relevant here.
!",$ !",% !",& !",'(

!",' !",) !",* !",'+
!" =
!",( !",, !",'$ !",'%
!",+ !",- !",'' !",')
Figure 1: AES state
AddRoundKey is a linear transformation, which performs a 128-bit exclusive or (⊕) between the
state Xi and the round key Ki :
Yi = Xi ⊕ Ki
SubBytes is the only nonlinear transformation in the round function. It takes each of the 16
bytes of the state and substitutes it for another:
Zi,j = S(Yi,j ) = S(Xi,j ⊕ Ki,j ) j = 0 . . . 15
A typical DPA attack targets the output of this function and exploits the fact that X1,j (a plaintext
byte) is known and variable and K1,j (a master key byte) is unkown and fixed over the acquired
traces.
ShiftRows is simply a permutation of the state bytes, obtained by rotating row j of the state
matrix by j bytes to the left (see Figure 2).
/",$ /",% /",& /",'( /",$ /",% /",& /",'(
/",' /",) /",* /",'+ /",) /",* /",'+ /",'

/" = = 0"
/",( /",, /",'$ /",'% /",'$ /",'% /",( /",,
/",+ /",- /",'' /",') /",') /",+ /",- /",''
Figure 2: AES ShiftRows
MixColumns is a linear transformation of the AES state, by multiplying each column of the state
with a matrix M in F28 . This is the last transformation of each round, except the last round, where
this step is skipped.
Xi+1,[4j...4j+3] = M × Ui,[4j...4j+3] j = 0...3

 
2 3 1 1
1 2 3 1
with M =  1 1 2 3

3 1 1 2
2.2 CTR-DRBG
The NIST publication SP 800-90A [BK15] describes recommendations for random number generation
using Deterministic Random Bit Generators (DRBG). One of these is based on block ciphers in
CTR mode and is therefore referred to as CTR_DRBG. For a detailed description of the operation
of the CTR_DRBG, we refer to [BK15]. A simplified pseudocode of the functions relevant for this
work is given in Algorithms 1 and 2.
Figure 3: Operation of the NIST CTR_DRBG _Update function (left) and random bit stream
generation (right) [BK15]
Random Number Generation. In the context of this paper, it is important to know that the
internal state of the CTR_DRBG contains a Key and a value V , as shown in Figure 3. The value
of V at the beginning of a randomness request is what we refer to as the nonce N . The value V is
incremented by a counter after every use of the block cipher AES, as in CTR mode. This is shown
in Figure 3 on the right and in Algorithm 2. While the block cipher performs in CTR mode, the
output blocks are concatenated until the requested output length is obtained.
Updating the State. At the end the random bit generation in Algorithm 2, a new key and value
V are generated by the CTR_DRBG _Update function, which is shown in Algorithm 1. This
essentially means that performing a DPA attack across various requests is not possible, because the
secret key changes. Any DPA attack would have to be performed during a single request to the
DBRG (Algorithm 2 lines 2-5). However, the maximum number of bits per request is limited [BK15,
Table 3]. If the counter field occupies at least 13 bits of the block, then the maximum number of
bits per randomness request is 219 . In the case of AES, which has a block length of 16 bytes, this is
equivalent to 212 = 4 096 encryptions. If the counter field length (“ctr_len”) is smaller, the number
Algorithm 1 NIST CTR_DRBG _Update (simplified)

Input: (Key,V ), seed length
Output: (Key,V )
1: Init x = 0
2: while length(x) < seed length do
3: V =V +1
4: x = x|AESKey (V )
5: end while
6: (Key,V ) ← x
Algorithm 2 NIST CTR_DRBG _Generate (simplified)

Input: (Key,V ), requested # bits
Output: x
1: Init x = 0
2: while length(x) < requested # bits do
3: V =V +1
4: x = x|AESKey (V )
5: end while
6: Truncate x to requested # bits
7: (Key,V ) ← CTR_DRBG _Update(Key,V )
of performed encryptions in CTR mode is 2ctr_len − 4. Not specified here in these algorithms is the
reseed counter, which makes sure that the PRNG is reseeded when the number of requests succeeds
a threshold. According to the NIST specifications, this threshold must be at most 249 .
Forward/Backward Secrecy. The concepts of forward and backward secrecy evaluate the security
of PRNGs when their state is compromised (i.e. known by an adversary). The CTR_DRBG
provides backward secrecy because recovering the state (key and nonce) during one request does
not allow an adversary to compute the previous states. The explanation for this is simply that
the current state is the result of an AES-CTR computation with the previous (unkown) key (see
Algorithm 1). On the other hand, as long as the DRBG is not reseeded with a fresh seed, it does
not provide forward secrecy, since the knowledge of the current state allows one to perfectly predict
the following states.
3 The Attack
In this attack, as in [Jaf07], we perform DPA on four rounds of AES-CTR. In the first rounds, we
assume a large part of the state is constant and we recover information about a few variable bytes.
By propagating them through the ShiftRows and MixColumns transformations, we obtain enough
information to perform DPA in the next round, until finally, we can recover the entire round key in
round four. In this work, we choose CPA as our attack methodology.
Simulated traces. For the remainder of this section, we apply the steps of the attack to simulated
traces and explore the success rate as a function of the signal-to-noise ratio (SNR). We will apply
the attack to traces from real devices in Section 4. To generate the simulated traces, we perform
AES-CTR and after each round transformation, we collect the Hamming weights of the 16 bytes of
the state and add them to the trace. Each time sample in a simulated trace thus corresponds to
the Hamming weight of one state byte in one round. We then add Gaussian noise to the trace with
some standard deviation σ. The variance σ 2 is calculated as the variance of the collected Hamming
weights divided by the desired SNR. For example, the relationship between the actual Hamming
weight and some simulated leakages is shown in Figure 4. For each experiment, we add new noise
to the original Hamming weights and we measure the success as the proportion of correct bytes
recovered. We repeat each experiment ten times for each SNR.
Figure 4: Simulated leakages vs. actual Hamming weights for SNR=1.0
Setup. The input to the first round is constructed by the addition of a counter T with an unknown
nonce N : X1 = N + T mod 2128 . We assume for simplicity that the counter starts at the least
significant byte of the state. It is trivial to adapt the attack if this is not the case. We thus assume
that X1,15 = N15 + T mod 256, with N15 constant and unknown and T the counter starting from 0.
Further, since we will only use 256 traces, we can consider the 14 most significant bytes completely
constant: X1,j = Nj for j < 14. Byte 14 is a special case, since it is not constant, but will only
assume two values: N14 and (N14 + 1) mod 256. We visualize this in Figure 5, where white squares
signify fixed values, black squares are varying continuously and the byte in the grey square toggles
at most once in the set of traces.
!",$ !",% !",& !","'
!"," !",( !",) !","*

!" =
!",' !",+ !","$ !","%
!",* !",, !","" !","(
Figure 5: Input to the first round of AES
3.1 Round 1: one byte

The attack on the least significant byte corresponds exactly to that described in [Jaf07]. This is the
most complex step in our attack, as it requires hypothesizing on 15 unknown bits (i.e. complexity
215 ). We target the output of SubBytes:

Z1,15 = S(K1,15 ⊕ X1,15 ) = S K1,15 ⊕ (N15 + T ) mod 256
As in [Jaf07], let N15 = N15,hi |N15,lo and K1,15 = K1,15,hi |K1,15,lo where hi denotes the most
significant bit and lo the other 7 bits and let b = N15,hi ⊕ K1,15,hi . Then we can write Z1,15 as

Z1,15 = S (b 7) ⊕ K1,15,lo ⊕ (N15,lo + T ) mod 256 [Jaf07]
We then perform CPA, where we hypothesize on the 15 bits (b, K1,15,lo and N15,lo ) and compute
the correlation between our Z1,15 and the traces. The winning hypothesis (with the largest absolute
correlation) does not tell us the most significant bits of K1,15 and N15 , but this is of no importance
for the remainder of the attack. With these 15 bits, we know Z1,15 completely.
Figure 6 shows the success rate of this step, which is 1.0 for reasonably low SNR levels. Below
the threshold of SNR=0.2, the success rate decreases dramatically and becomes 0.0 as of SNR=0.01.
3.2 Round 2: four bytes

Figure 7 depicts the AES state after the ShiftRows and MixColumns operations in terms of
variability. We will use the continuously changing byte Z1,15 to recover the first column of the state
Figure 6: Success Rate of Step 1 with 256 traces as function of the SNR.
after SubBytes.
!",$ !",% !",& !","' -',$ -',% -',& -',"'
!",( !",) !","* !"," -'," -',( -',) -',"*
!","$ !","% !",' !",+ -',' -',+ -',"$ -',"%
!","( !",* !",, !","" -',* -',, -',"" -',"(
Figure 7: AES state after the first Shiftrows (left) and MixColumns (right) transformations.
Consider for example the SubBytes output Z2,0 :
Z2,0 = S(Y2,0 ) = S(K2,0 ⊕ X2,0 )

= S(K2,0 ⊕ 2Z1,0 ⊕ 3Z1,5 ⊕ 1Z1,10 ⊕ 1Z1,15 ) (1)
| {z } | {z }
constant & unknown variable & known
By treating K2,0 ⊕ 2Z1,0 ⊕ 3Z1,5 ⊕ Z1,10 as one unkown 8-bit constant C2,0 , we can recover this
constant using CPA with only 256 hypotheses and thus recover Z2,0 . The same is true for the other
three bytes in the first column:
Z2,0 = S(C2,0 ⊕ 1Z1,15 )

Z2,1 = S(C2,1 ⊕ 1Z1,15 )
(2)
Z2,2 = S(C2,2 ⊕ 3Z1,15 )
Z2,3 = S(C2,3 ⊕ 2Z1,15 )
We note that performing CPA for Z2,0 and Z2,1 is identical, since in both cases the S-box input
is the sum of 1Z1,15 with a constant. Indeed, the example in Figure 8 shows that there is not one
but there are two prevailing hypotheses: 0xAC and 0x94. Since each byte corresponds to only one
time sample in the simulated traces, the correlation peaks are very close to each other in Figure 8.
The separation is more clear in real power traces. If the S-box evaluations are not randomly shuffled,
it is trivial to decide which constant belongs to which state byte. In this case, C2,0 = 0x95 and
C2,1 = 0xAC.
Figure 9 shows the success rate of recovering all four bytes of the first column. Again, the
threshold for reaching 100% success lies at SNR=0.2. The cutoff is still quite steep, with 0 success
for SNR=0.001 and below.
3.3 Round 3: sixteen bytes

The known bytes Z2,0 to Z2,3 are spread to all columns of the state by the ShiftRows transformation
as shown in Figure 10. The subsequent MixColumns operations will affect the entire state. As shown
Figure 8: Pearson Correlation coefficients in Step 2 with SNR=1.0, with 256 traces as a function of
the time samples (left) and their maximum as a function of the number of traces (right).
Figure 9: Success Rate of the second step of the attack with 256 traces.
by the grey squares in Figure 10, each column now has an additional byte that is non-constant.
Because we only have 256 traces and didn’t follow the approach from [Jaf07], the grey bytes are also
unknown. However, keep in mind that the grey bytes only assume two distinct values throughout all
the traces. The number of traces for each depends on the carry of the addition X1,15 = (N15 + T )
mod 256, which makes X1,14 toggle from N14 to (N14 + 1) mod 256.
!",$ !",% !",& !",'"
!",( !",) !",'* !",'
!",'$ !",'% !"," !",+
!",'( !",* !",, !",''
Figure 10: AES state after the second ShiftRows transformation.
Best Case. Assume for simplicity that the least significant byte of the nonce N15 is 0x00. In that
case, X1,14 = N14 never toggles and all grey squares in Figure 10 are constant, just like the white
squares. This means that in each column, we can apply the same method as we did in round 2.
Each byte of the SubBytes output can be written as (2):
Z3,j = S(C3,j ⊕ fj Z2,kj ) (3)
where the factors fj are easily derived from MixColumns matrix M and kj refers to the known
byte (the black squares) in each column (see Appendix A). As in round 2, each column again has
two bytes for which the hypotheses are identical (when fj = 1) and the correct constants can be
derived by comparing the time samples where the maximum correlation occurs.
Now, assume that the nonce is 0xFF and X1,14 toggles immediately, leading to the grey squares
in Figure 10 being identical in all but one of the traces. When performing the same CPA, we now
recover different constants C3,j
0
, corresponding to when X1,14 = N14 + 1 mod 256.
Average Case. In all other cases, the constants in the computation will be C3 for the first portion
of traces and C30 for the second portion, after X1,14 has toggled. Interestingly, the same approach
as before, with 256 traces, still works. The winning hypotheses are those constants that occur most
often in the set of traces. The traces that correspond to the other (not-winning) constants act as
noise. The attack is successfull if the 16 recovered bytes are either C3 or C30 , but not a mix of
both. Clearly, this depends on the least significant byte of the nonce (N15 ), since this byte decides
when X1,14 toggles from N14 to N14 + 1 and the constants from C3 to C30 . This is demonstrated in
Figure 13, where we show the success rate for various values of N15 .
Worst Case. The worst case scenario is when the toggle occurs approximately halfway, i.e. when
N15 ≈ 0x80. In that case, the constants C3,j and C3,j 0
are in a close race (see Figure 11, left). This
results in recovering some bytes from C3 and some from C30 , which is a problem for the next and
last stage of the attack. This is clearly reflected in the results in Figure 13, since the success rate
only converges to approximately 0.6 for nonce 0x80. Figure 13 also shows the success rate of the
attack with N15 = 0x80 when we use only half of the traces, indicated by 0x80∗ . This is a rare and
interesting case, where using less traces actually improves the performance of the attack, though,
not surprising since we know that the traces we are removing act as noise.
Figure 11: Pearson Correlation coefficients in Step 3 with N15 = 0x80 and SNR=1.0 with 256
traces (left) and 128 traces (right).
In Figure 12, we depict the maximum correlation coefficient for each hypothesis as a function of
the amount of traces used for the best and worst case. It demonstrates again very clearly that with
nonce N15 = 0x80, using more than 128 traces only deteriorates the success of key recovery.
The attacker only knows the least significant 7 bits of the nonce, so is unable to distinguish 0x80
from 0x00. However, seeing a close race as in Figure 11, left is a good clue, especially if performing
the CPA again with only half the traces results in a clear winner (Figure 11, right).
Also in other cases, the knowledge of the 7 least significant nonce bits can be used to calculate
exactly how many traces to remove (either at the beginning or the end of the acquired set) to have
a pure subset of traces using only one constant. There are two possible sets of traces, depending on
whether the most significant bit of N15 is 0 or 1. We can try out both possibilities and detect as
in Figure 11, which option gives the best results. We will demonstrate this in in the examples in
Section 4.
3.4 Round 4: recovering the round key

Whether the previous step recovered constants C3 or C30 , we now know exactly the state Z3 in most
of the traces, which after propagation through ShiftRows and MixColumns allows us to do a classic
CPA in the next round and recover round key K4 . As in § 3.3, the success of this stage depends on
the least significant byte of the nonce N15 . This is shown in Figure 15, although now, even the
worst case can lead to a successful attack if the SNR is sufficiently high (SNR ≥ 1). Removing part
Figure 12: Maximum Correlation coefficients as a function of the number of traces in Step 3 with
SNR = 1.0 and N15 = 0x00 (left) or N15 = 0x80 (right).
Figure 13: Success Rates of the third step of the attack for various nonces with 256 traces (except
0x80∗ with 128 traces).
of the traces can still help to improve the success probability. This is again indicated in Figure 14,
right and in Figure 15 by 0x80∗ . From now on, we always perform the fourth step of the attack
with the same selection of traces as step 3. With the recovery of the round key K4 , it is trivial to
reverse the key schedule and calculate the master key K1 . Next, we can calculate the nonce N by
performing the AES rounds backward from the state Z3 .
The success probabilities in Figures 9 to 15 were each obtained in experiments using the correct
information from the previous steps. They are thus actually conditional probabilities, conditioned
on the success of the previous step of the attack. Hence, by multiplying these success rates, we
obtain the success rate of the entire attack. This is shown in Figure 16.
3.5 Discussion
Jaffe’s Original Attack. In the original attack by Jaffe [Jaf07], the first step that recovers Z1,15
by hypothesizing on 15 bits is identical. The difference with this paper is that Jaffe uses 216
power traces and can therefore also recover Z1,14 with this approach. This requires hypothesizing
on 16 bits and is thus more complex. In our case, with only 256 traces, byte Z1,14 is almost
constant, hence we must follow a different approach. This way, we also avoid the hypothesis on
16 bits. After retrieving Z1,14 and Z1,15 , Jaffe selects a subset of traces in which the remaining
bytes Z1,0 , . . . , Z1,13 are constant. In the second round of encryption, the attack follows the same
approach as described in § 3.2. With both Z1,15 and Z1,14 known, it is possible to recover the first
two columns: Z2,0 , . . . , Z2,7 . Finally, in round three, two bytes per column are known and variable,
so the same approach as in round two allows retrieval of the entire state Z3 . The last step of the
attack is again analogous to ours.
Figure 14: Maximum Correlation coefficients as a function of the number of traces in Step 4 with
SNR=1.0 and N15 = 0x00 (left) or N15 = 0x80 (right).
Figure 15: Success Rates of the fourth step of the attack for various nonces with 256 traces (except
0x80∗ with 128 traces).
More traces available? The NIST recommendations currently allow an adversary to obtain up to
4 096 traces of AES-CTR, which is well above 256. What happens to the attack success probability
when we can actually use this full number of traces? A general understanding in side-channel
analysis is that increasing the number of traces always increases the success probability of an attack.
This is certainly also true for the first step of the attack, since the least significant byte of the
counter is not affected by a carry from a previous byte. With up to 4 096= 212 encryptions in
CTR-mode, the same can be said for the second step of the attack, since a counter to 212 is not
enough to invalidate the assumption that three bytes in the first column are constant. In the third
and fourth step however, the success very much relies on the assumption that three bytes in each
column are (quasi-)constant, which means increasing the number of traces would only increase the
“noise”. However, having more than 256 traces available can certainly help, since one can select
from them the perfect subset of traces. For example, if the attacker suspects from the first 256
traces that the least significant byte of the nonce is near 0x80, (s)he only has to throw away the
first 128 traces and use the next 256 traces to turn a worst case scenario into a best case scenario
(nonce 0x00). Similarly, with any other nonce N15 the adversary can compute exactly how many
traces to throw away (256 − N15 ) to obtain the subset with nonce 0x00. It is important that step 3
and 4 of the attack are still performed with only 256 traces, in order for the assumptions on the
white squares to hold.
Hence, if an adversary has 512 traces at his disposal, the success rate of the attack will always
follow the best case in Figure 16, or even a bit better, since the first two steps can use the full amount
of 512 traces (see Figure 17). We will illustrate this method in an application in Appendix C.
The Rippling Carry. There is one more case we did not consider in the above description of the
attack. We mention it here, since it does not significantly affect the attack. In Figure 5, we assume
Figure 16: Success Rates of the attack for various nonces with 256 traces (except 0x80∗ with 128
traces).
Figure 17: Success Rate of the attack with 256 traces (best case) or 512 traces (any case)
that the white squares are completely constant throughout a set of 256 traces and that only the
grey square can toggle once. However, if X1,14 = 0xFF, its toggling to 0x00, will actually create a
non-zero carry which affects X1,13 and makes it increment as well. If that byte is 0xFF as well, the
carry propagates to the next byte, and so on.
While this is something to keep an eye on when recovering the nonce N from X1 , it should not
affect the first four steps of the attack. The toggling of any other byte from one value to another
will happen at the same time as the toggling of X1,14 . Hence, the situation in round 3 and 4 of the
attack remains the same: a part of the traces corresponds to one constant (C3 ) and another part
uses constant C30 .
Step two of the attack is affected if the carry ripples all the way to byte X1,10 , which affects the
first column of the state in round 2. This would mean that N11 = N12 = N13 = N14 = 0xFF and is
thus a very special case.
4 Experimental Validation
To test our attack on a real device, we program a Cortex-M4 CPU with an AES-CTR implementation.
For this, we use the ChipWhisperer CW308T-STM32F3 target mounted on the CW308 UFO board.
The UFO board is connected to the ChipWhisperer-Lite board. We use the ChipWhisperer Capture
software for programming the device, communicating with the device and for collecting power
measurements. The clock frequency of the target and sample rate of the scope are set to the
ChipWhisperer defaults.
We collect exactly 256 traces of 12 000 samples each, consisting of approximately the first four
rounds of AES. The nonce and key are chosen randomly by the ChipWhisperer Capture software.
Thanks to the ChipWhisperer measurement setup, the traces are well aligned. An example trace
is shown in Figure 18. For efficiency, we will use only the SubBytes region of each round in the
corresponding steps of the attack.
Figure 18: Example trace of the first four rounds of AES-CTR on a Cortex-M4.
Round 1. In the first step of the attack, the winning hypothesis achieves almost double the
correlation of the others. We learn that (b, K1,15,lo , N15,lo ) = (0, 0x57, 0x0D) (see Figure 19). This
means that (K1,15 , N15 ) is either (0x57, 0x0D) or (0xD7, 0x8D). We already have here an example
of a possible worst-case scenario.
Figure 19: Pearson Correlation coefficients in Step 1, with 256 traces as a function of the time
samples (left) and their maximum as a function of the number of traces (right).
Round 2. In Round 2, we recover the constants C2 = [0x65, 0x22, 0x52, 0x52] (see Figure 20).
Figure 20: Pearson Correlation coefficients in Step 2 (bytes 0 and 1), with 256 traces as a function
of the time samples (left) and their maximum as a function of the number of traces (right).
Round 3. In round 3, we start with the attack to recover constant C3,0 . The result is shown in
Figure 21, left and gives a strong suspicion that the least significant byte of the nonce is actually
0x8D, since we see a close race between two constants. This means that the least significant key
byte should be 0xD7.
Figure 21: Pearson Correlation coefficients in Step 3 with 256 traces (left) and 128 traces (right)
(byte 0).
If we perform the same attack with only half the traces (see Figure 21, right), we obtain a clear
winner. In Figure 22, we show the maximum correlation coefficients as a function of the number of
traces used. We thus suspect that N15 = 0x8D and continue the attack with only half the traces.
We recover the following constants in round 3:
C3 = [0x76, 0x23, 0x3D, 0xCE, 0x70, 0xB9, 0xCB, 0xA4, 0x46, 0x32, 0x6E, 0x84, 0xA0, 0x64, 0x68, 0x09]
Figure 22: Maximum Correlation coefficients as a function of the number of traces in Step 3 (byte
0).
Round 4. Finally, still using half the traces (see Figure 23), we recover the following round key in
Round 4:
K4 = [0x7B, 0xFF, 0x7A, 0xD7, 0x0D, 0x28, 0x2E, 0xE3, 0x00, 0x3E, 0xD1, 0x58, 0xCB, 0x87, 0x0B, 0xBB]
If we perform the Key Schedule backward, we find that the Master Key is
K1 = [0xCC, 0x8E, 0x0F, 0x06, 0x0D, 0xE8, 0x3E, 0x80, 0x24, 0xBE, 0x94, 0x73, 0xBD, 0x6E, 0x8E, 0xD7]
Looking at the least significant byte, we can now confirm that (K1,15 , N15 ) = (0xD7, 0x8D) and that
we probably performed the attack correctly. Indeed, when we perform AES backward from Z3 , we
obtain
X1 = [0xE6, 0x10, 0x3B, 0x22, 0x55, 0x62, 0x7E, 0xE6, 0xBE, 0x93, 0x18, 0xBD, 0x71, 0xB7, 0xBA, 0x8D]
which is equal to the nonce N , since we used the first half of the traces. Pay attention when using
the second half or when the majority of the traces use the constant C30 . In that case, we recover
X1,14 = N14 + 1 mod 256.
Figure 23: Pearson Correlation coefficients in Step 4 (byte 0), with 128 traces as a function of the
time samples (left) and their maximum as a function of the number of traces (right)
Device Behaviour. Now that we know the key and nonce, we can investigate the relation between
Hamming weights and their leakage on the Cortex-M4. In Figure 24, we plot the Hamming weight
of one point of interest (byte 15 in the first SubBytes) across 256 encryptions on the x-axis and the
measurements of the corresponding sample in the power traces on the y-axis. The corresponding
trace sample is chosen as the time sample where the power measurements have the largest Pearson
correlation with this array of Hamming weights. We also estimate the SNR at this time sample as
V ar(signal)
SN R =
V ar(noise)
The signal is constructed by replacing each measurement with the average of all measurements for
that Hamming weight, as is done in [MOP07]. The noise is approximated by subtracting these
averages from the actual measurements. This way, we obtain SN R ≈ 2.18 at the time sample
corresponding to state byte 15 after the S-box in the the first round.
Figure 24: Measured leakages vs. actual Hamming weights on the Cortex-M4.
Other devices. We additionally successfully performed the attack on an Arduino Uno and the
ChipWhisperer-lite XMEGA target. For the results, we refer to Appendices B and C. The trace
files and a JuPyter Notebook performing the above attack can be found online1 .
5 Discussion
5.1 The Worst Case Nonce
Figure 16 gives a bleak impression of the attack’s sensitivity to the least significant byte of the
nonce. However, the worst case scenario is not as bad as it seems.
1 https://github.com/LaurenDM/AttackAESCTR
Firstly, it does not imply the existence of a protection mechanism since biasing the nonces
towards 0x80 would only reduce the search space of the attacker.
Secondly, we have shown that removing part of the traces may improve the chance of success.
This trick is not limited to the worst case, as the attacker has knowledge of N15,lo and can thus
always compute the right number of traces to throw away. Without knowing N15,hi , there are
two possible ways to do it, but one will clearly improve results, while the other will make them
worse. However, depending on the amount of noise in the traces, extra traces may still improve the
performance. As stated in § 3, in the worst case, we can say the attack requires 512 traces, which
is still far less than 4 096. With 512 traces available, step 3 and 4 always achieve the best success
rate and the dependency on the nonce thus disappears.
Finally, PRNGs tend to be used for applications that need a continuous supply of randomness.
If the attacker really has access to at most 256 traces per PRNG request and the worst case scenario
occurs, the next PRNG request will have a different nonce. NIST prescribes the PRNG to be
reseeded after at most 248 requests. As soon as the adversary manages to recover the key and nonce
for one request, the internal state of the PRNG is known and the future random outputs can be
calculated as long as the PRNG is not reseeded.
5.2 Variations on a theme: DDLA

In the original work of Jaffe [Jaf07], the attack was not performed using CPA, but rather DPA
as introduced in the original work of Kocher [KJJ99]. In this work, we opted for CPA, but any
similar SCA methodology can replace this. For example, recently, a non-profiled SCA using deep
learning (DDLA) was introduced by Timon [Tim19], which was shown to be more resilient in case
of misaligned traces.
The main idea of DDLA is to train a neural network for various key guesses. With each training,
the inputs to the network are the traces and the outputs are the corresponding leakage hypotheses
for a particular key guess. For the correct key guess, the network accuracy during training is
supposed to grow a lot faster than for wrong key guesses. The use of a convolutional neural network
in this method is more robust in the case of misaligned traces. For more details on DDLA, we refer
to [Tim19].
Application to AES-CTR. In the original paper, this methodology uses around 3 000 traces. With
only 256 traces available, it is more likely that the network “memorizes” the data and starts to
overfit. Choosing a suitable network architecture is therefore a bit more challenging in our case.
We used the same traces as in Section 4, but as in [Tim19], created a misalignment by shifting each
trace by a random offset between -25 and 25. This is not a large offset, but it is sufficient to make
regular CPA fail. Our neural network starts from the CN Nexp from Timon [Tim19], but we use 8
filters of size 100 in the first convolutional layer and we replace the second convolutional layer with
a 10-neuron dense layer. We also use the most significant bit of the S-box output in our hypotheses.
It is standard to randomly initialize a neural network’s weights. The initial weights have some
influence on the accuracy of the training, which is why training the network for the same hypothesis
twice can give different results in accuracy. Therefore, we noticed that, when training the same
network for different hypotheses and comparing their accuracies, it is better to always use the same
initial weights in the neural network.
Figure 25 shows the resulting accuracies for the first step of the attack. Even with only 256
traces, this methodology works. It takes a lot of computation time, since we need to train the
network 215 times, but it succeeds where regular CPA does not.
Distinguishing time samples. In the second step of the attack, it is important to know the most
defining time samples in the trace in order to distinguish the winning hypotheses of two bytes in
one column. For this purpose, we use the sensitivity analysis as described in [Tim19, §3.2.2]. The
results are shown in Figure 26 with the accuracies on the left and the sensitivities on the right.
They show clearly which of the two constants appears first in the trace. The results correspond to
those of Section 4.
We can thus conclude that other variants of DPA methods can be applied in the attack. DDLA
is a good choice if the traces are misaligned, but does take quite some computation time.
Figure 25: Performing the first step of the attack with DDLA, using 256 traces.
Figure 26: Performing the second step of the attack (byte 0 and 1) with DDLA, using 256 traces.
Accuracies (left) and sensitivity analysis (right).
5.3 Blind SCA

An alternative approach to attack a CTR mode with unknown nonce is to use a blind SCA as
described at CHES 2017 by Clavier et al. [CR17]. This methodology stems from the observation
that the joint probability distribution of (HW (Xi,j ), HW (Zi,j )) with Zi,j = S(Xi,j ⊕ Ki,j ) depends
on the secret key Ki,j . In [CR17], this is exploited by computing the maximum likelihood that the
leakages observed for Xi,j and Zi,j occur in the case of a specific key guess.
This method has as advantage that it does not even require the CTR mode as it does not require
specific knowledge on the plaintext X1 . It can thus be applied to PRNGs based on different modes
of operation, but only to recover the key. However, we found that the methodology is very sensitive
to noise and less effective in this case than the ones described in this paper and Jaffe’s [Jaf07]. We
were not able to recover the secret key from our devices running AES in CTR mode, using 4 096
traces.
Application to AES-CTR. Since not all plaintext bytes vary in CTR mode, it makes more sense
to apply the blind attack to the last round, where the ciphertext bytes are constantly changing.
We noticed a number of drawbacks to blind SCA compared to regular CPA in this application. For
example, the blind attack requires a precise estimation of the location of the two points of interest:
Xi,j , Zi,j corresponding to a byte Xi,j at the input of AddRoundKey and Zi,j = S(Ki,j ⊕ Xi,j ) the
byte at the output of SubBytes. Even if one manages to pinpoint the correct samples in the traces,
the attack also requires the leakages at these points to be converted to Hamming weight estimations.
In the work of [CR17], this is done by estimating coefficients α and β such that the leakage is
approximately αHW + β. However, when we compare the measured leakages on a Cortex-M4
device with the actual Hamming weights in Figure 24, we see that one easily estimates the wrong
Hamming weights from these.
In contrast, for a regular CPA attack, it suffices to identify only an approximate region of
interest, since the Pearson correlation coefficient can be computed for many time samples. Moreover,
it is not required to estimate the Hamming weights, since CPA can be applied directly to the
measurements obtained from the oscilloscope (no matter the leakage unit). The same can thus be
said for the CPA-based attack of this work.
Experiments. We tried a simplified attack, where the points of interest and α, β are given to the
adversary: We collected power measurements both from an Arduino Uno and from a Cortex-M4.
We computed the actual Hamming weight values using a simulation of AES-CTR with the same
nonce and key as was sent to the device. We determined the points of interest in the real power
traces by computing the correlation of the trace points with the real Hamming weights. We then
used the least squares method to determine the coefficients α, β in the relationship between the
real Hamming weights and the leakage units of the trace. We used the maximum number of traces
available according to the NIST recommendations: 4 096. Even then, the blind SCA was only able
to recover 11 of the 16 key bytes on the Arduino Uno device and 8 bytes on the Cortex-M4. We
show figures for each key byte in Appendix D.
5.4 How (not) to use CTR mode

It is clear that the presence of the counter in CTR mode gives more information to the adversary
than in the case of for example CBC mode. However, the possibility of the demonstrated attack
does not imply that using a CTR mode-based PRNG is always a bad idea. By following a few simple
guidelines when using the CTR_DRBG, the attack can be avoided. In this section we discuss
some observations and recommendations for the use of AES-CTR in a PRNG. As an example, we
consider the context of masked implementations against side-channel attacks, where online PRNGs
are usually required to provide a continuous stream of randomness. This is a very interesting use
case for the attack, since recovering the state of the CTR_DRBG once implies that the attacker
can derive any future PRNG output. The attacker can then compute all the masks used in the
masked implementation and perform a classic first-order DPA attack to recover the secret key.
It does not matter then whether the masked implementation is first-, second- or even fifth-order
secure. Recall that the CTR_DRBG does not provide forward secrecy as long as it is not reseeded
and the recommended maximum number of request between reseeds is 248 [BK15], which allows for
more than enough traces for a first-order attack.
While we keep this application in mind, the discussion is of course also relevant for any other
use of the CTR_DRBG, such as key generation or IV generation for protocols.
5.4.1 Observations
Hiding. A common hiding technique against side-channel attacks is to randomize the order of the
16 S-box calculations during SubBytes. We saw in § 3.2 and § 3.3 that the order of execution is
important to distinguish two of the constants in each state column. The hiding countermeasure
therefore does not increase the number of traces required but can increase the complexity of the
attack by increasing the number of possibilities to try in step 2 and step 3. However, it does not
completely prevent our attack.
Hardware. Related to the shuffling of S-box calculations, a hardware PRNG implementation that
performs all 16 S-boxes in parallel, does not allow to distinguish the two equal-hypothesis constants
in each column. More importantly, when the 128-bit state is being operated on in parallel, the
signal-to-noise ratio is a lot smaller, since the leakage of one byte (the signal) only corresponds to
approximately one sixteenth of the measurement (not including the noise) [MOP07]. Furthermore,
in an unrolled implementation, it is difficult to separate the power measurements of the 128-bit
states of different rounds, even if the device is sampled at a very high rate. In other words, the SNR
of such a hardware implementation would be much worse than for our software implementations
and it is unlikely that even a regular CPA attack (with knowledge of the plaintext) would succeed
with only 256 traces.
Real-world Crypto. Despite the NIST recommendations, we found two commercial CTR_DRBG
implementations which do put a proper limit on the request size. In the open-source mbed TLS
library [arm], we can see that the maximum number of requested bytes per CTR_DRBG call is
1024, which is equivalent to only 64 encryptions with AES-128 in CTR mode. Even more secure is
a CTR_DRBG implementation by Texas Instruments [Ins], which puts the limit at 211 bits, or
equivalently only 16 AES-CTR encryptions.
5.4.2 Recommendations
Types of counters. As previously mentioned, the attack is not prevented if the counter field
starts in a byte X1,j ∗ other than the least significant byte X1,15 . The first round of the attack then
simply recovers Z1,j ∗ , which propagates to a different column in step 2, but does not change the
overall approach or complexity. On the other hand, the NIST document on modes of operation
also suggests the possibility to use an LFSR as incrementing function in AES-CTR, as long as its
period is sufficiently long [Dwo01]. A good choice of LFSR would update bits which are spread over
the entire 128-bit AES state, rather than just one byte, thereby preventing the divide-and-conquer
approach that enables targeting key bytes in side-channel attacks. In contrast with a normal
incrementing counter, it is thus possible for a CTR_DRBG based on an LFSR to resist our attack,
even if 4 096 traces are available.
Size of Requests. Each new request to the NIST CTR_DRBG results in the computation of
AES-CTR with a different nonce and key, because of the CTR_DRBG _Update function, which
is performed with every randomness request. Hence, when using such a PRNG for a masked
implementation, it would clearly be less secure to perform a single (large) request for all the
random bits needed in a masked AES than to perform multiple (small) requests, such as one for
each masked S-box evaluation separately. Even worse would of course be to not follow the NIST
recommendations and to keep using the same key across PRNG requests.
Conclusion for masked implementations. Through this section, we also want to start the discus-
sion on whether PRNGs for masked implementations need their own side-channel protection, a
question which is often sidestepped due to its “chicken-and-egg” character. Indeed, protecting a
PRNG used for masked implementations against side-channel attacks, would require its own fresh
randomness source in return. However, by keeping these observations and recommendations in
mind and ensuring that the request sizes are sufficiently limited, this attack against AES-CTR can
be avoided. For other modes of operations, it looks like also the blind SCA can be avoided on real
devices if the number of available traces is limited. Under the assumption that the state (nonce and
key) of the CTR_DRBG is not known to the side-channel adversary, it does not seem like masking
the PRNG is necessary. An investigation of other PRNG constructions and attacks against them is
an interesting direction for future research.
6 Conclusion
In this work, we demonstrated an attack on AES-CTR mode with unknown key and nonce in only
256 traces, a significant improvement over the previous attack by Jaffe [Jaf07]. Most importantly,
this number of traces shows that a CTR_DRBG following the NIST specification can be vulnerable
to this attack, as it currently allows adversaries to obtain as much as 4 096 traces of a CTR_DRBG
performing AES-CTR. We demonstrated the feasibility of our attack on several real devices such as
a Cortex-M4 and make our implementations openly available for reproducability.
We explored alternative methods such as DDLA for misaligned traces and blind SCA, which
does not require the CTR-mode assumption.
We start the discussion on PRNGs for masked implementations (i.e. a PRNG for which an
adversary can only observe the power consumption), a topic for which very little research is available.
Using the observations from this attack, we can conclude that masking should not be necessary
for such a PRNG, provided its correct use such as limiting the request size and updating the
state between requests. The question remains on whether a construction as large as AES-CTR
is necessary in this adversary model. An investigation of various PRNG constructions and their
security in this context is an interesting direction for future work.
Acknowledgements
The author would like to thank Josep Balash, Arthur Beckers, Begül Bilgin, Vincent Rijmen and
Lennert Wouters. The author is funded by a PhD fellowship of the Fund for Scientific Research -
Flanders (FWO).
References
[arm] arm. Mbed tls. https://tls.mbed.org/api/ctr__drbg_8h.html#
a5b787e6157d91055d7c07d40f519cf52.
[BCO04] Eric Brier, Christophe Clavier, and Francis Olivier. Correlation power analysis with
a leakage model. In Marc Joye and Jean-Jacques Quisquater, editors, Cryptographic
Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge,
MA, USA, August 11-13, 2004. Proceedings, volume 3156 of Lecture Notes in Computer
[BK15] Elaine Barker and John Kelsey. Recommendations for random number generation using
deterministic random bit generators. NIST SP 800-90A Rev. 1, June 2015.
[CR17] Christophe Clavier and Léo Reynaud. Improved blind side-channel analysis by exploita-
tion of joint distributions of leakages. In Wieland Fischer and Naofumi Homma, editors,
Cryptographic Hardware and Embedded Systems - CHES 2017 - 19th International Con-
ference, Taipei, Taiwan, September 25-28, 2017, Proceedings, volume 10529 of Lecture
[Dwo01] Morris Dworkin. Recommendation for block cipher modes of operation: Methods and
techniques. NIST SP 800-38A, December 2001.
[GBTP08] Benedikt Gierlichs, Lejla Batina, Pim Tuyls, and Bart Preneel. Mutual information
analysis. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and
Embedded Systems - CHES 2008, 10th International Workshop, Washington, D.C., USA,
August 10-13, 2008. Proceedings, volume 5154 of Lecture Notes in Computer Science,
[Ins] Texas Instruments. Random number generation using msp430fr59xx and msp430fr69xx
microcontrollers. http://www.ti.com/lit/an/slaa725/slaa725.pdf.
[Jaf07] Joshua Jaffe. A first-order DPA attack against AES in counter mode with unknown
initial counter. In Pascal Paillier and Ingrid Verbauwhede, editors, Cryptographic
Hardware and Embedded Systems - CHES 2007, 9th International Workshop, Vienna,
Austria, September 10-13, 2007, Proceedings, volume 4727 of Lecture Notes in Computer
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential power analysis. In
Michael J. Wiener, editor, Advances in Cryptology - CRYPTO ’99, 19th Annual Inter-
national Cryptology Conference, Santa Barbara, California, USA, August 15-19, 1999,
Proceedings, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer,
1999.
[LDL14] Yanis Linge, Cécile Dumas, and Sophie Lambert-Lacroix. Using the joint distributions
of a cryptographic function in side channel analysis. In Emmanuel Prouff, editor,
Constructive Side-Channel Analysis and Secure Design - 5th International Workshop,
COSADE 2014, Paris, France, April 13-15, 2014. Revised Selected Papers, volume 8622
[MOP07] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power analysis attacks - revealing
the secrets of smart cards. Springer, 2007.
[Tim19] Benjamin Timon. Non-profiled deep learning-based side-channel attacks with sensitivity
analysis. IACR Trans. Cryptogr. Hardw. Embed. Syst., 2019(2):107–131, 2019.
CAPA: The Spirit of Beaver
against Physical Attacks
Publication Data
Oscar Reparaz, Lauren De Meyer, Begül Bilgin, Victor Arribas, Svetla Nikova,
Ventzislav Nikov, and Nigel P. Smart. CAPA: The Spirit of Beaver against
Physical Attacks. In Advances in Cryptology - CRYPTO 2018 - 38th Annual
International Cryptology Conference, Santa Barbara, CA, USA, August 19-23,
2018, Proceedings Part I (eds. H. Shacham and A. Boldyreva), Lecture Notes
in Computer Science, vol. 10991, pages 121-151. Springer, 2018.
My Contribution
One of main authors. The software implementation was done by Oscar Reparaz.
205
206 CAPA: The Spirit of Beaver against Physical Attacks
CAPA:
The Spirit of Beaver against Physical Attacks
Oscar Reparaz1,2 , Lauren De Meyer1 , Begül Bilgin1 , Victor Arribas1 ,
Svetla Nikova1 , Ventzislav Nikov3 and Nigel Smart1,4
1
KU Leuven, imec - COSIC, Leuven, Belgium
2
Square Inc., San Francisco, USA
oreparaz@gmail.com
3
NXP Semiconductors, Leuven, Belgium
venci.nikov@gmail.com
4
University of Bristol, Bristol, UK
Abstract. In this paper we introduce two things: On one hand we introduce the Tile-Probe-
and-Fault model, a model generalising the wire-probe model of Ishai et al. extending it to
cover both more realistic side-channel leakage scenarios on a chip and also to cover fault and
combined attacks. Secondly we introduce CAPA: a combined Countermeasure Against Physical
Attacks. Our countermeasure is motivated by our model, and aims to provide security against
higher-order SCA, multiple-shot FA and combined attacks. The tile-probe-and-fault model
leads one to naturally look (by analogy) at actively secure multi-party computation protocols.
Indeed, CAPA draws much inspiration from the MPC protocol SPDZ. So as to demonstrate
that the model, and the CAPA countermeasure, are not just theoretical constructions, but could
also serve to build practical countermeasures, we present initial experiments of proof-of-concept
designs using the CAPA methodology. Namely, a hardware implementation of the KATAN and
AES block ciphers, as well as a software bitsliced AES S-box implementation. We demonstrate
experimentally that the design can resist second-order DPA attacks, even when the attacker is
presented with many hundreds of thousands of traces. In addition our proof-of-concept can
also detect faults within our model with high probability in accordance to the methodology.
Keywords: No keywords given.
1 Introduction
Side-channel analysis attacks (SCA) [41] are cheap and scalable methods to extract secrets, such
as cryptographic keys or passwords, from embedded electronic devices. They exploit unintended
signals (such as the instantaneous power consumption [42] or the electromagnetic radiation [24])
stemming from a cryptographic implementation. In the last twenty years, plenty of countermeasures
to mitigate the impact of side-channel information have been developed. Masking [15, 26] is an
established solution that stands out as a provably secure yet practically useful countermeasure.
Fault analysis (FA) is another relevant attack vector for embedded cryptography. The basic
principle is to disturb the cryptographic computation somehow (for example, by under-powering
the cryptographic device, or by careful illumination of certain areas in the silicon die). The result
of a faulty computation can reveal a wealth of secret information: in the case of RSA or AES, a
single faulty ciphertext pair makes key recovery possible [10, 48]. Countermeasures are essentially
based on adding some redundancy to the computation (in space or time). In contrast to masking,
the countermeasures for fault analysis are mostly heuristic and lack a formal background.
However, there is a tension between side-channel countermeasures and fault analysis counter-
measures. On the one hand, fault analysis countermeasures require redundancy, which can give out
more leakage information to an adversary. On the other hand, a device that implements first-order
masking offers an adversary double the attack surface to insert a fault in the computation. A duality
relation between SCA and FA was pointed out in [23]. There is clearly a need for a combined
countermeasure that tackles both problems simultaneously.
In this work we introduce a new attack model to capture this combined attack surface which we
call the tile-probe-and-fault model. This model naturally extends the wire-probe model of [34]. In
the wire-probe model individual wires of a circuit may be targetted for probing. The goal is then
to protect against a certain fixed set of wire-probes. In our model, inspired by modern processor
designs, we allow whole areas (or tiles) to be probed, and in addition we add the possibility of the
attacker inducing faults on such tiles.
Protection against attacks in the wire-probe model is usually done via masking; which is in many
cases the extension of ideas from passively secure secret sharing based Multi-Party Computation
(MPC) to the side-channel domain. It is then natural to look at actively secure MPC protocols
for the extension to fault attacks. The most successful modern actively secure MPC protocols are
in the SPDZ family [20]. These use a pre-processing or preparation phase to produce so called
Beaver triples, named after Beaver [6]. These auxiliary data values, which will be explained later,
are prepared either before a computation, or in a just-in-time manner, so as to enable an efficient
protocol to be executed. This use of prepared Beaver triples also explains, partially, the naming of
our system, CAPA (a Combined countermeasure Against Physical Attacks), since Capa is also the
beaver spirit in Lakota mythology. In this mythology, Capa is the lord of domesticity, labour and
preparation.
1.1 Previous Work

Fault Attack Models and Countermeasures: Faults models typically describe the characterization
of an attacker’s ability. That is, the fault model is constructed as a combination of the following:
the precision of the fault location and time, the number of affected bits which highly depends on the
architecture, the effect of the fault (flip/set/reset/random) and its duration (transient/permanent).
Moreover, the fault can target the clock or power line, storage units, combinational or control logic.
When it comes to countermeasures, one distinguishes between protection of the algorithm on the
one hand and protection of the device itself by using, for example, active or passive shields on the
other. No countermeasure provides perfect security at a finite cost; it is the designer’s responsibility
to strive for a balance between high-level (algorithmic) countermeasures and low-level ones that
work at the circuit level and complement each other. In this paper, we discuss the former.
One algorithmic technique is to replicate the calculation m times in either time or space and
only complete if all executions return the same result [54]. This countermeasure has the important
caveat that there are conceptually simple attacks, such as m identical fault injections in each
execution, that break the implementation with probability one. However, it should be stated that
these attacks are not trivial to mount in practice when the redundancy is in space.
A second method is to use an error correcting or detecting code [8, 12, 13, 32, 35, 36, 37, 38, 39, 46].
This means one performs all calculations on both data and checksum. A drawback is that error
correcting/detecting codes only work in environments in which errors are randomly generated, as
opposed to maliciously generated. Thus, a skilled attacker may be able to carefully craft a fault
that results in a valid codeword and is thus not detected. A detailed cost comparison between error
detection codes and doubling is given in [44].
Another approach is that of infective computation [25, 43], where any fault injected will affect
the ciphertext in a way that no secret information can be extracted from it. This method ensures
the ciphertext can always be returned without the need for integrity checks. While infective methods
are very efficient, the schemes proposed so far have all been broken [5].
Side-Channel Attack Models and Countermeasures: A side-channel adversary typically uses the
noisy leakage model [55], where side-channel analysis (SCA) attacks are bounded by the statistical
moment of the attack due to a limited number of traces and noisy leakages. Given enough noise
and an independent leakage assumption of each wire, this model, when limited to the tth -order
statistical moment, is shown to be comparable to the t-probing model introduced in [34], where an
attacker is allowed to probe, receive and combine the noiseless information about t wires within a
time period [21]. Finally, it has been shown in [4] that a (semi-)parallel implementation is secure in
the tth -order bounded moment model if its complete serialization is secure at the t-probing model.
While the countermeasures against fault attacks are limited to resist only a small subset of the
real-world adversaries and attack models, protection against side-channel attacks stands on much
more rigorous grounds and generally scales well with the attacker’s powers. A traditional solution
is to use masking schemes [9, 29, 34, 51, 56, 58, 59] to implement a function in a manner in which
higher-order SCA is needed to extract any secret information, i.e. the attacker must exploit the
joint leakage of several intermediate values. Masking schemes are analogues of the passively secure
threshold MPC protocols based on secret sharing. One can thus justify their defence by appealing
to the standard MPC literature. In MPC, a number of parties can evaluate a function on shared
data, even in the presence of adversaries amongst the computing parties. The maximum number of
dishonest parties which can be tolerated is called the threshold. In an embedded scenario, the basic
idea is that different parts of a chip simulate the parties in an MPC protocol.
Combining Faults and Side-Channels Models and Countermeasures. The importance of com-
bined countermeasures becomes more aparent as attacks such as [2] show the feasibility of combined
attacks. Being a relatively new threat, combined adversarial models lack a joint description and are
typically limited to the combination of a certain side-channel model and a fault model independently.
One possible countermeasure against combined attacks is found in leakage resilient schemes [45],
although none of these constructions provide provable security against FA. Typical leakage resilient
schemes rely on a relatively simple and easy to protect key derivation function in order to update
the key that is used by the cryptographic algorithm within short periods. That is, a leakage resilient
scheme acts as a specific “mode of operation”. Thus, it cannot be a drop-in replacement for a
standard primitive such as the AES block cipher. The aforementioned period can be as short as
one encryption per key in order to eliminate fault attacks completely. However, the synchronization
burden this countermeasure brings, makes it difficult to integrate with deployed protocols.
There are a couple of alternative countermeasures proposed for embedded systems in recent years.
In private circuits II [16, 33], the authors use redundancy on top of a circuit that already resists SCA
(private circuits I [34]) to add protection against FA. In ParTI [62], threshold implementations (TI)
are combined with concurrent error detection techniques. ParTI naturally inherits the drawbacks
of using an error correction/detection code. Moreover, the detectable faults are limited in hamming
weight due to the choice of the code. Finally, in [63], infective computation is combined with error-
preserving computation to obtain a side-channel and fault resistant scheme. However, combined
attacks are not taken into account.
Given the above introduction, it is clear that both combined attack models and countermeasures
are not mature enough to cover a significant part of the attack surface.
Actively Secure MPC. Modern MPC protocols obtain active security, i.e. security against
malicious parties which can actively deviate from the protocol. By mapping such protocols to
the on-chip side-channel countermeasures, we would be able to protect against an eavesdropping
adversary that inserts faults into a subset of the simulated parties. An example of a practical attack
that fits this model is the combined attack of Amiel et al. [2]. We place defences against faults on
the same theoretical basis as defences against side-channels.
To obtain maliciously secure MPC protocols in the secret-sharing model, there are a number of
approaches. The traditional approach is to use Verifiable Secret Sharing (VSS), which works in the
information theoretic model and requires that strictly less than n/3 parties can be corrupt. The
modern approach, adopted by protocols such as BODZ, SPDZ, Tiny-OT, MASCOT etc. [7, 20,
40, 50], is to work in a full threshold setting (i.e. all but one party can be corrupted) and attach
information theoretic MACs to each data item. This approach turns out to be very efficient in the
MPC setting, apart from its usage of public-key primitives. The computational efficiency of the use
of information theoretic MACs and the active adversarial model of SPDZ lead us to adopt this
philosophy.
1.2 Our Contributions

Our contributions are threefold. We first introduce the tile-probe-and-fault model, a new adversary
model for physical attacks on embedded systems. We then use the analogy between masking and
MPC to provide a methodology, which we call CAPA, to protect against such a tile-probe-and-fault
attacker. Finally, we illustrate that the CAPA methodology can be prototyped by describing specific
instantiations of the CAPA methodology, and our experimental results.
Tile-probe-and-fault model. We introduce a new adversary model that expands on the wire-probe
model and brings it closer to real-world processor designs. Our model is set in an architecture
that mimics the actively secure MPC setting that inspires our countermeasures (see Figure 1).
Instead of individual wires at the foundation of the model, we visualize a separation of the chip
(integrated circuit) into areas or tiles, consisting of not only many wires, but also complete blocks
of combinational and sequential logic. Such tiled designs are inherent in many modern processor
architectures, where the tiles correspond to “cores” and the wires correspond to the on-chip
interconnect. This can easily be related to a standard MPC architecture where each tile behaves
like a separate party. The main difference between our architecture and the MPC setting is that in
the latter, parties are assumed to be connected by a complete network of authenticated channels.
In our architecture, we know exactly how the wires are connected in the circuit.
”Party” 1 ”Party” 2
”Party” " …
Figure 1: Partition of the integrated circuit area into tiles, implementing MPC “parties”
The tile architecture satisfies the independent leakage assumption [21] amongst tiles. That
is, leakage is local and thus observing the behaviour of a tile by means of probing, faulting or
observing its side-channel leakage, does not give unintended information about another tile through,
for example, coupling.
As the name implies, the adversary in our model exploits side-channels and introduces faults.
We stress that our goal is to detect faults as opposed to tolerate or correct them. That is, if an
adversary interjects a fault, we want our system to abort without revealing any of the underlying
secrets.
CAPA Methodology. We introduce CAPA, a countermeasure against the tile-probe-and-fault-

attacker, which is suitable for implementation in both hardware and software. CAPA inherits
theoretical aspects of the MPC protocol SPDZ [20] by similarly computing on shared values, along
with corresponding shared MAC tags. The former prevents the adversary from learning sensitive
values, while the latter allows for detection of any faults introduced. Moreover, having originated
from the MPC protocol SPDZ, CAPA is the first countermeasure with provable security against
combined attacks. The methodology can be scaled to achieve an arbitrary fault detection probability
and is suitable for implementation in both hardware and software.
Experimental Results. We provide examples of CAPA designs in hardware of the KATAN and
AES block ciphers as well as a software bitsliced implementation of the AES S-box. Our designs
show that our methodology is feasible to implement, and in addition our attack experiments
confirm our theoretical claims. For example, we implemented a second-order secure hardware
implementation of KATAN onto a Spartan-6 FPGA and perform a non-specific leakage detection
test, which does not show evidence of first- or second-order leakage with up to 100 million traces.
Furthermore, we deploy a second-order secure software based CAPA implementation of the AES
S-box on an ARM Cortex-M4 and take electromagnetic measurements; for this implementation
neither first-nor second-order leakage is detected with up to 200 000 traces. Using toy parameters,
we verify our claimed fault detection probability for the AES S-box software implementation. It
should be noted that our experimental implementations are to be considered only proof-of-concept;
they are currently too expensive to be used in practice. But the designs demonstrate that the
overall methodology can provide significant side-channel and fault protection, and they provide a
benchmark against which future improvements can be measured.
2 The Tile-Probe-and-Fault Model

The purpose of this section is to introduce a new adversarial model in which our security guarantees
are based. This model is strictly more powerful than the traditional DPA or DFA models.
Tile Architecture. Consider a partition of the chip in a number of tiles Ti , with wires running
between each pair of tiles as shown in Figure 1. We call the set of all tiles T . Each tile Ti ∈ T
possesses its own combinational logic, control logic (or program code) and pseudo-random number
generator needed for the calculations of one share. In the abstract setting, we consider each tile
as the set composed of all input and intermediate values on the wires and memory elements of
those blocks. A probe-and-fault attacker may obtain, for a given subset of tiles, all the internal
information at given time intervals on this set of tiles. He may also inject faults (known or random)
into each tile in this set.
In our model, each sensitive variable is split into d shares through secret sharing. Without loss of
generality, we use Boolean sharing in this paper.
We define each tile such that it stores and manipulates at most one share of each intermediate
variable. Any wire running from one tile to another carries only blinded versions of a sensitive
variables’ share used by Ti . We make minimal assumptions on the security of these wires. Instead,
we include all the information on the unidirectional wires in Figure 1 in the tile on the receiving
and not the sending end. We thus assume only one tile is affected by an integrity failure of a wire.
We assume that shared calculations are performed in parallel without loss of generalization. The
redundancy of intermediate variables and logic makes the tiles completely independent apart from
the communication through wires.
Probes. Throughout this work, we assume a powerful dp -probing adversary where we give an
attacker information about all intermediate values possessed by dp specified tiles, i.e. ∪i∈i1 ,...,idp Ti .
The attacker obtains all the intermediate values on the tile (such as internal wire and register
values) with probability one and obtains these values from the start of the computation until the
end. Note that this is stronger than both the standard t-probing adversary which gives access to
only t intermediate values within a certain amount of time [34] and -probing adversary where
the information about t intermediate values is gained with certain probability. In our dp -probing
model, the adversary gets information from n intermediate values from dp tiles where n dp .
Therefore, our dp -probing model is more generic and covers realistic scenarios including an attacker
with a limited number of EM probes which enable observation of multiple intermediate values
simultaneously within arbitrarily close proximity on the chip.
Faults. We also consider two types of fault models. Firstly, a df -faulting adversary which can
induce chosen-value faults in any number of intermediate bits/values within df tiles, i.e. from the
set ∪i∈i1 ,...,idf Ti . These faults can have the nature of either flipping the intermediate values with
a pre-calculated (adversarially chosen) offset or setting the intermediate values to a chosen fixed
value. In particular, the faults are not limited in hamming weight. One can relate this type of
faults with, for example, very accurate laser injections.
Secondly, we consider an -faulting adversary which is able to insert a random-value fault in any
variable belonging to any party. This is a somehow new MPC model, and essentially means that
all parties are randomly corrupted. The -adversary may inject the random-value fault according
to some distribution (for example, flip each bit with certain probability), but he cannot set all
intermediates to a chosen fixed value. This adversary is different from the df -faulting adversary.
One can relate the -faulting adversary to a certain class of non-localised EM attacks.
Time periods. We assume a notion of time periods; where the period length is at least one clock
cycle. We require that a df -fault to an adversarially chosen value cannot be preceded by a probe
within the same time period. Thus adversarial faults can only depend on values from previous time
periods. This time restriction is justified by practical experimental constraints; where the time
period is naturally upper bounded by the time it takes to set up such a specific laser injection.
Adversarial Models. Given the aforementioned definitions, we consider on the one hand an active
adversary A1 with both dp -probing and df -faulting capabilities simultaneously. We define P1 the
set of up to dp tiles that can be probed and F1 the set of up to df tiles that can be faulted by
A1 . Since each tile potentially sees a different share of a variable and we use a d-sharing for each
variable, we constrain the attack surface (the sets of adversarially probed and potentially modified
tiles) as follows:
(F1 ∪ P1 ) ⊆ ∪d−1
j=1 Tij
The constraint implies that at least one share remains unaccessed/honest and thus |F1 ∪ P1 | ≤ d − 1.
Within those d − 1 tiles, the adversary can probe and fault arbitrarily many wires, including the
wires arriving at each tile. The adversary’s df -faulting capabilities are limited in time by our
definition of time periods, which implies that any df -fault cannot be preceded by another probe
within the same time period.
We also consider an active adversary A2 that has dp -probing and -faulting capabilities simulta-
neously. In this case, the constraint on the set of probed tiles P2 remains the same:
P2 ⊆ ∪d−1
j=1 Tij
but the set of faulted tiles is no longer constrained:
F2 ⊆ T
Moreover, as -faults do not require the same set-up time as df -faults, they are not limited in
time. Note that, -faults do not correspond to a standard adversary model in the MPC literature;
thus this part of our model is very much an aspect of our side-channel and fault analysis focus.
A rough equivalent model in the MPC literature would be for an honest-but-curious adversary
who is able to replace the share or MAC values of honest players with values selected from a given
random distribution. Whilst such an attack makes sense in the hardware model we consider, in
the traditional MPC literature this model is of no interest due to the supposed isolated nature of
computing parties.
As our constructions are based on MPC protocols which are statically secure we make the same
assumptions in our tile-probe-and-fault model, i.e. the selection of tiles attacked must be fixed
beforehand and cannot depend on information gathered during computation. This model reflects
realistic attackers since it is infeasible to move a probe or a laser during a computation with today’s
resources. We thus assume that both adversaries A1 and A2 are static.
3 The CAPA Design

The CAPA methodology consists of two stages. A preprocessing step generates auxiliary data,
which is used to perform the actual cryptographic operation in the evaluation step. We first present
some notation, then the building blocks for the main evaluation, and finally the preprocessing
components.
Notation. Although generalization to any finite field holds, in this paper we work over a field Fq
with characteristic 2, for example GF (2k ) for a given k, as this is sufficient for application to most
symmetric ciphers. We use · and + to describe multiplication and addition in Fq respectively. We
use upper case letters for constants. The lower case letters x, y, z are reserved for the variables
used only in the evaluation stage (e.g. sensitive variables) whereas a, b, c, . . . represent auxiliary
variables generated from randomness in the preprocessing stage. The kronecker delta function is
denoted by δi,j . We use L(.) to denote an additively homomorphic function and A(.) = C + L(.)
with C some constant.
Information Theoretic MAC Tags and the MAC Key α. We represent a value a ∈ Fq (similarly
x ∈ Fq ) as a pair h~ai = P (~a, τ~a ) of data and tag shares in the masked domain. The data shares
~a = (a1 , . . . , ad ) satisfy ai = a. For each a ∈ Fq , there exists a corresponding MAC tag Pτ
a
computed as τ a = α · a, where α is a MAC key, which is secret-shared amongst the tiles as α = αi .

P Analogously to the data, the MAC tag is shared τ~a = (τ1a , . . . , τda ), such that it satisfies
τia = τ a , but the MAC key itself does not carry a tag. Depending on a security parameter m,
there can be m independent MAC keys P aα[j] ∈ Faq for j ∈ {1, . . . , m}. In that case, α as well as τ
a
are in Fm
q and the tag shares satisfy τi [j] = τ [j] = α[j] · a, ∀j ∈ {1, . . . , m}. Further we assume
m = 1 unless otherwise mentioned.
3.1 Evaluation Stage

We let each tile Ti hold the ith share of each sensitive and auxilary variable (xi , . . ., ai , . . .) and the
MAC key share αi . We first describe operations that do not require communication between tiles.
Addition. To compute the addition (~z, τ~z ) of (~x, τ~x ) and (~y , τ~y ), each tile performs local addition
of their data shares zi = xi + yi and their tag shares τiz = τix + τiy . When one operand is public
(for example, a cipher constant C ∈ Fq ), the sum can be computed locally as zi = xi + C · δi,1 for
value shares and τiz = τix + C · αi for tag shares.
Multiplication by a Public Constant. Given a public constant C ∈ Fq , the multiplication (~z, τ~z )
of (~x, τ~x ) and C is obtained locally by setting zi = C · xi and τiz = C · τix .
The following operations, on the other hand, require auxiliary data generated in a preprocessing
stage and also communication between the tiles.
Multiplication. Multiplication of (~x, τ~x ) and (~y , τ~y ) requires as auxiliary data a Beaver triple
(h~ai, h~bi, h~ci), which satisfies c = a · b, for random a and b. The multiplication itself is performed in
four steps.
• Step A. In the blinding step, each tile Ti computes locally a randomized version of its share
of the secret: εi = xi + ai and ηi = yi + bi .
• Step B. In the partial unmasking step, each tile Ti broadcasts its own shares Pεi and ηi to other
P
tiles, such that each tile can construct and store locally the values ε = εi and η = ηi .
The value ε (resp. η) is the partial unmasking of (~ε, τ ) (resp. (~η , τ )), i.e. the value ε (resp.
~ε ~η
η) is unmasked but its tag τ~ε (τ~η ) remains shared. These values are blinded versions of the
secrets x and y and can therefore be made public.
• Step C. In the MAC-tag checking step, the tiles check whether the tags τ~ε (τ~η ) are consistent
with the public values ε and η, using a method which we will explain later in this section.
• Step D. In the Beaver computation step, each tile locally computes
zi = ci + ε · bi + η · ai + ε · η · δi,1
τiz = τic + ε · τib + η · τia + ε · η · αi .
It can be seen easily that the sharing (~z, τ~z ) corresponds to z = x · y unless faults occurred. Step B
and C are the only steps that require communication among tiles. Step A and D are completely
local. Note that to avoid leaking information on the sensitive data x and y, the shares εi and ηi
must be synchronized using memory elements after step A, before being released to other tiles in
step B. Moreover, we remark that step C does not require the result of step B and can thus be
performed in parallel.
Squaring. Squaring is a linear operation in characteristic 2 fields. Hence, the output shares
of a squaring operation can be computed locally using the input shares. However, obtaining
the corresponding tag shares is non-trivial. To square (~x, τ~x ) into (~z, τ~z ), we therefore require
an auxiliary tuple (h~ai, h~bi) such that b = a2 . The procedure to obtain (~z, τ~z ) mimics that of
multiplication with some modifications: there is only one partially unmasked value ε = x + a, whose
tag needs to be checked, and each tile calculates zi = bi + ε2 · δi,1 and τiz = τib + ε2 · αi .
Following the same spirit, we can also perform the following operations.
Affine Transformation. Provided that we have access to a tuple (h~ai, h~bi) such that b = A(a), we
can compute (~z, τ~z ) satisfying z = A(x) = C + L(x), where L(x) is an additively homomorphic
function over the finite field, by computing the output sharing as zi = bi + L(ε) · δi,1 and τiz =
τib + L(ε) · αi .
Multiplication following Linear Transformations. The technique used for the above additively
homomorphic operations can be generalized even further to compute z = L1 (x) · L2 (y) in shared
form, where L1 and L2 are additively homomorphic functions. A trivial methodology would
require two tuples (ha~i i, hb~i i) with bi = Li (ai ) for i ∈ {1, 2}, plus a standard Beaver triple (i.e.
requiring seven pre-processed data items). We see that we can do the same operation with five
pre-processed items (h~ai, h~bi, h~ci, hdi,
~ h~ei), such that c = L1 (a), d = L2 (b) and e = L1 (a) · L2 (b).
The tiles partially unmask ~x + ~a (resp. ~y + ~b) to obtain ε (resp. η) and verify them. Each tile
computes its value share and tag share of z as zi = ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η) · δi,1 and
τiz = τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi , respectively. We refer to (h~ai, h~bi, h~ci, hdi,
~ h~ei)
as a quintuple.
Proof.
d
X d
X
zi = ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η)
i=1 i=1
d
X d
X d
X
= ei + L1 (ε) · di + L2 (η) · ci + L1 (ε) · L2 (η)
i=1 i=1 i=1
= L1 (a) · L2 (b) + L1 (x + a) · L2 (b) + L2 (y + b) · L1 (a) + L1 (x + a) · L2 (y + b)
= L1 (a) · L2 (b) + L1 (x) · L2 (b) + L1 (a) · L2 (b) + L1 (a) · L2 (y) + L1 (a) · L2 (b)
+ L1 (x) · L2 (y) + L1 (x) · L2 (b) + L1 (a) · L2 (y) + L1 (a) · L2 (b)
= L1 (x) · L2 (y)
d
X d
X
τiz = τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi
i=1 i=1
d
X d
X d
X d
X
= τie + L1 (ε) · τid + L2 (η) · τic + L1 (ε) · L2 (η) · αi
i=1 i=1 i=1 i=1
= α · e + L1 (ε) · α · d + L2 (η) · α · c + L1 (ε) · L2 (η) · α

= α · e + L1 (ε) · d + L2 (η · c + L1 (ε) · L2 (η)
= α · L1 (x) · L2 (y)
3.1.1 Checking the MAC Tag of Partially Unmasked Values.
Consider a public value ε = x + a, calculated in the partial unmasking step of the Beaver
multiplication operation. Recall that we obtain its MAC-tag shares as follows: τiε = τia + τix . During
the MAC-tag checking step of the Beaver operation, the authenticity of τ ε corresponding to ε is
tested. As P
ε is public, each tile can calculate and broadcast
P the value ε · αi + τiε . For a correct tag,
we expect τiε = α · ε, thus each tile computes (ε · αi + τiε ) and proceeds if the result is zero.
Recall that the broadcasting must be preceded by a synchronization of the shares.
3.1.2 Note on Unmasked Values/Calculations.
There are several components in a cipher which do not need to be protected against SCA (i.e.
masked), because their specific values are not sensitive. One prominent example is the control unit
which decides what operations should be performed (e.g. the round counter). Other examples are
constants such as the AES affine constant 0x63 or public values such as ε in a Beaver calculation
and the difference ε · α + τ ε during the MAC-tag checking phase.
While these public components are not sensitive in a SCA context, they can be targeted in a
fault attack. It is therefore important to introduce some redundancy. Each tile should have its own
control logic and keep a local copy of all public values to avoid single points of attack. The shares
εi are distributed to all tiles so that ε can be unmasked by each tile separately and any subsequent
computation performed on these public values is repeated by each tile. Finally, each tile also keeps
its own copy of the abort status. This is in fact completely analogous to the MPC scenario.
3.2 Preprocessing Stage

The auxiliary data (h~ai, h~bi, . . .) required in the Beaver evaluations, is generated in a preprocessing
stage. This preparation corresponds to the offline phase in SPDZ . However, CAPA’s preprocessing
stage is lighter and does not require a public key calculation due to the differences in adversary
model. As in SPDZ , this stage is completely independent from the sensitive data of the main
evaluation. Below, we describe the generation of a Beaver triple used in multiplication. This can
trivially be generalized to tuples and quintuples.
Auxiliary Data Generation. To generate a triple (h~ai, h~bi, h~ci) satisfying c = a · b, we draw random
shares ~a = (a1 , . . . , ad ) and ~b = (b1 , . . . , bd ) and use a passively secure shared multiplier to compute
~c s.t. c = a · b. We then use another such multiplication with the shared MAC key α ~ to generate
tag shares τ~a , τ~b , τ~c . We note that the shares ai , bi are randomly generated by tile Ti . There are
thus d separate PRNG’s on d distinct tiles.
Passively Secure Shared Multiplier. For a secure implementation of a shared multiplication, no

subset of d − 1 tiles should have access to all shares of any variable. This concept, which is used in
the context of secure implementations against SCA on hardware, is precisely called d − 1th -order
non-completeness in [9, 52]. In the last decade, there has been significant improvement on passively
secure shared multipliers that can be used in both hardware and software [9, 27, 29, 51, 56]. In
principle, CAPA can use any such multiplier as long as the tile structure still holds.
A close inspection of existing multipliers show that they require the calculation of the cross
products ai bj . In order to make these multipliers compatible with the CAPA tile architecture, we
define tiles Ti,j which receive ai from Ti and bj from Tj where i 6= j in order to handle the pair
(ai , bj ) to be used during tuple, triple and quintuple generation. This implies d(d − 1) smaller
tiles used only during auxiliary data generation in addition to d tiles used for both auxiliary
data generation and evaluation. The output wires from Ti,j are only connected to Ti and carry
randomized information.
The multipliers used in the preprocessing phase are only passively secure. We also ensure
resistance against active adversaries because on the one hand, deterministic faults are limited to
d − 1 tiles and on the other, because of a relation verification step, which is explained in the next
section.
3.3 Relation Verification of Auxiliary Data

The information theoretic MAC tags provide security against faults induced in the evaluation stage.
To detect faults in the preprocessing stage, we perform a relation verification of the auxiliary data.
This relation verification step is done for each generated triple that is passed from the preprocessing
to the evaluation stage and ensures that the triple is functionally correct (i.e. c = a·b) by sacrificing
another triple. That is, we take as input two triples (h~ai, h~bi, h~ci) and (hdi,
~ h~ei, hf~i), that should
satisfy the same relation, in this example c = a · b and f = d · e. The following Beaver computation
holds if and only if both relations are satisfied:
• Draw a random r1 ∈ Fq
• Use triple (hdi,

~ h~ei, hf~i) to calculate the multiplication of r1 · h~ai and h~bi using a constant
multiplication with r1 , followed by the Beaver equation for multiplication described above.
The result h~c̃i is a shared representation of c̃ = r1 · a · b.
• For each share i, calculate the difference with the shares and tags of r1 · c: ∆i = r1 · ci + c̃i
and τi∆ = r1 · τic + τic̃ .
• Unmask the resulting differences ∆ and τ ∆ .
• If a difference is nonzero, reject (h~ai, h~bi, h~ci) as a valid triple.
• Pick another r2 ∈ Fq such that r2 6= r1 and repeat a second time.
Note that this relation verification ensures that the second triple is functionally correct too. However,
it is burnt (or “sacrificed”) in this process in order to ensure that the first triple can be used securely
further on. Note that this relation verification or “sacrificing” step is mandatory in each Beaver-like
operation.
Why We Need Randomization. This sacrificing step involves two values r1 and r2 . We present
the following attack to illustrate why this randomization is needed. Again, we elaborate on triples,
but the same can be said for tuples and quintuples. As the security does not rely on the secrecy of
r1 and r2 , we assume for simplicity that they are known to the attacker. We only stress that they
are different: r1 6= r2 .
Consider two triples (h~ai, h~bi, hc~0 i) and (hdi,
~ h~ei, hf~0 i) at the input of the sacrificing stage. We
assume that the adversary has introduced an additive difference into one share of c0 and f 0 such
that c0 = a · b + ∆c and f 0 = d · e + ∆f . This fault is injected before the MAC tag calculation,
so that τ c and τ f are valid tags for the faulted values c0 and f 0 respectively. In particular, this
0 0
means we have τ c = τ c + α · ∆c and τ f = τ f + α · ∆f .

0 0
The sacrificing step calculates the following four differences (for rj = r1 and r2 ) and only
succeeds if all are zero.
X
d

∆j = rj · c0i + fi0 + ε · ei + η · di + ε · η
i=1
?
= rj · ∆ c + ∆ f = 0
Xd

τ ∆j =
0 0
rj · τic + τif + ε · τie + η · τid + ε · η · αi
i=1
?
= r j · α · ∆c + α · ∆f = 0
Without randomization (i.e. r1 = r2 = 1), the attacker only has to match the differences ∆f = ∆c
to pass verification. With a random r1 , the attacker can fix ∆f = r1 · ∆c to automatically force
∆1 and τ ∆1 to zero. Even if he does not know r1 , he has probability as high as 2−k to guess it
correctly.
Only thanks to the repetition of the relation verification with r2 , the adversary is detected with
a probability 1 − 2−km . Assuming he fixed ∆f = r1 · ∆c , it is impossible to also achieve ∆f = r2 · ∆c .
Even if the attacker manages to force ∆2 to zero with an additive injection (since he knows all
components r2 , ∆c and ∆f ), he cannot get rid of the difference τ ∆2 = r2 · α · ∆c + α · ∆f without
knowing the MAC key. Since α remains secret, the attacker only has a success probability of 2−km
to succeed.
4 Discussion
4.1 Security Claims
With both described adversaries A1 and A2 , our design CAPA claims provable security against the
following types of attacks as well as a combined attack of the two
1. Side-Channel Analysis (i.e. against d − 1 tile probing adversary).
2. Fault Attacks (i.e. an adversary introducing either known faults into d − 1 tiles or random
faults everywhere).
Side-channel Analysis. One can check that no union of d − 1 tiles ∪j∈j1 ,...,jd−1 Tj has all the shares
of a sensitive value. Very briefly, we can reason to this d − 1th -order non-completeness as follows.
All computations are local with the exception of the unmasking of public values such as ε. However,
the broadcasting of all shares of ε does not break non-completeness since ε = x + a is not sensitive
itself but rather a blinded version of a sensitive value x, using a random a that is shared across
all tiles. Unmasking the public value ε therefore gives each tile Ti only one share ε + ai of a new
sharing of the secret x:
~x = (a1 , . . . , ai−1 , ε + ai , ai+1 , . . . , ad )
In this sharing, no union of d − 1 shares suffices to recover the secret. Our architecture thus provides
non-completeness for all sensitive values. As a result, our d-share implementation is secure against
d − 1-probing attacks. Any number of probes following the adversaries’ restrictions leak no sensitive
data. Our model is related to the wire-probe model, but with wires replaced by entire tiles. We
can thus at least claim security against d − 1th order SCA.
Fault Attacks. A fault is only undetected if both value and MAC tag shares are modified such
that they are consistent. Adversary A1 can fault at most df < d tiles, which means he requires
knowledge of the MAC key α ∈ GF (2km ) to forge a valid tag for a faulty value. Since α is secret,
his best option is to guess the MAC key. This guess is correct with probability 2−km . Adversary
A2 has -faulting abilities only and will therefore only avoid detection if the induced faults in value
and tag shares happen to be consistent. This is the case with probability 2−km . We can therefore
1
claim an error detection probability (EDP) of 1 − 2km . The EDP does not depend on the number
of faulty bits (or the hamming weight of the injected fault).
Combined Attacks. In a combined attack, an adversary with df -faulting capabilities can mount
an attack where he uses the knowledge obtained from probing some tiles ∈ P1 to carefully forge the
faults. In SPDZ, commitments are used to avoid the so called “rushing adversary”. CAPA does not
need commitments as the timing limitation on A1 adversary ensures a df -fault cannot be preceded
by a probe in the same clock cycle. As a result, we inherit the security claims of SPDZ and the
claimed EDP is not affected by probing or SCA. Also, the injection of a fault in CAPA does not
change the side-channel security. Performing a side-channel attack on a perturbed execution does
not reveal any additional information because the Beaver operations do not allow injected faults to
propagate through a calculation into a difference that depends on sensitive information. We can
claim this security, because of the aspects inherited from MPC. CAPA is essentially secure against
a very powerful adversary that has complete control (hence combined attacks) over all but one of
the tiles.
What Does Our MAC Security Mean? We stress that CAPA provides significantly higher security
than existing approaches against faults. An adversary that injects errors in up to df tiles cannot
succeed with more than the claimed detection probability. This means that our design can stand
d0f df shots if they affect at most df tiles. This is the case even if those df tiles leak their entire
state; hence our resistance against combined attacks. The underlying reason for this is that to forge
values, an attacker needs to know the MAC key, but since this is also shared, the attacker does not
gain any information on the MAC key and their best strategy is to insert a random fault, which is
detected with probability 1 − 2−km . Moreover, our solution is incredibly scalable compared to for
example error detection code solutions.
How Much Do Tags Leak? The tag shares τia form a Boolean masking of a variable τ a . This
variable τ a itself is an information theoretic MAC tag of the underlying value a and can be seen as
a multiplicative share of a. We therefore require the MAC key to change for each execution. Hence
MAC tag shares are a Boolean masking of a multiplicative share and are expected to leak very
little information in comparison with the value shares themselves.
Forbidding the All-0s MAC Key. If the MAC key size mk is small, we should forbid the all-0
MAC key. This ensures that tags are injective: if an attacker changes a value share, he must change
the tag share. We only pay with a slight decrease in the claimed detection probability. By excluding
1 of the 2km MAC key possibilities, we reduce the fault detection probability to 1 − 2−κ , where
κ = log2 (2km − 1).
4.2 Attacks
The Glitch Power Supply or Clock Attack. The solution presented in this paper critically depends
on the fact that there is no single point where an attacker can insert a fault that affects all d tiles
deterministically. An attacker may try to glitch the chip clock line that is shared among all tiles.
In this case, the attacker could try to carefully insert a glitch so that writing to the abort register
is skipped or a test instruction is skipped. Since all tiles share the same clock, the attacker can
bypass in this way the tag verification step. Similar comments apply, for example, to glitches in the
power line. The bottom line is that one should design the hardware architecture accordingly, that
is, deploy low-level circuit countermeasures that detect or avoid this attack vector.
Skipping Instructions. In software, when each tile is a separate processor (with its own program
counter, program memory and RAM memory), skipping one instruction in up to d − 1 shares would
be detected. The unaffected tiles will detect this misbehavior when checking partially unmasked
values.
Safe Error Attack. We point out a specific attack that targets any countermeasure against a
probing and faulting adversary. In a safe error attack [65], the attacker perturbs the implementation
in a way that the output is only affected if a sensitive variable has a certain value. The attacker
learns partial secret information by merely observing whether or not the computation succeeds
(i.e. does not abort). Consider for example a shared multiplication of a variable x and a secret
y and call the resulting product z = xy. The adversary faults one of the inputs with an additive
nonzero difference such that the multiplication is actually performed on x0 = x + ∆ instead of
x. Such an additive fault can be achieved by affecting only one share/tile. The multiplication
results in the faulty product z 0 = z + ∆ · y. The injected fault has propagated into a difference that
depends on sensitive data (y). As a result, the success or failure of any integrity check following
this multiplication depends on y. In particular, if nothing happens (all checks pass), the attacker
learns that y must be 0.
Among existing countermeasures against combined attacks, none provide protection against
this kind of selective failure attack as they cannot detect the initial fault ∆. The attacker can
always target the wire running from the last integrity check on x to the multiplication with y.
We believe CAPA is currently unique in preventing this type of attack. One can verify that the
MAC-tag checking step in a Beaver operation successfully prevents ∆ from propagating to the
output. This integrity check only passes if all tiles have a correct copy of the public value ε. Any
faults injected after this check have a limited impact as the calculation finishes locally. That is,
once the correct public values are established between the tiles, the shares of the multiplication
output z are calculated without further communication among tiles. The adversary is thus unable
to elicit a fault that depends on sensitive data.
PACA. We claim security against the passive and active combined attack (PACA) on masked
AES described in [2] because CAPA does not output faulty ciphertexts. A second attack in this
work uses another type of safe errors (or ineffective faults as they are called in this work) which
are impossible to detect. The attacker fixes a specific wire to the value zero (this requires the
df -faulting capability) and collects power traces of the executions that succeed. This means the
attacker only collects traces of encryptions in which that specific wire/share was already zero. The
key is then extracted using d − 1th -order SCA on the remaining d − 1 shares. This safe error attack
however falls outside our model since the adversary gets access (either by fault or SCA) to all d
shares and thus (F1 ∪ P1 ) = T .
Advanced Physical Attacks. In our description we are assuming that during the broadcast phase
there are no “races” between tiles: by design, each tile sets its share to be broadcasted at clock
cycle t and captures other tiles’ share in the same clock cycle t. We are implicitly assuming that
tiles cannot do much work between these two events. If this assumption is violated (for example,
using advanced circuit editing tools), a powerful adversary could bypass any verification. This is
why in the original SPDZ protocol there are commitments prior to broadcasting operations; if this
kind of attack is a concern one could adapt the same principles of commitments to CAPA. This is
a very strong adversarial model that we consider out of scope for this paper.
4.3 Differences with SPDZ

Offline Phase. In SPDZ, the auxiliary data is generated using a somewhat homomorphic encryption
scheme. The mapping onto a chip environment thus seems prohibitive due to the need for this
expensive public-key machinery to obtain full threshold and the large storage required. We avoid
this by generating the Beaver triples using passively secure shared multipliers. Furthermore, to
avoid the large storage requirement, we produce the auxiliary data on the fly whenever required.
MAC Tag Checking. SPDZ delays the tag checking of public values until the very end of the
encryption by using commitments. For this, each party keeps track of publicly opened values.
This is to avoid a slowdown of the computation and because in the MPC setting, local memory is
cheaper than communication costs. In an embedded scenario the situation is opposite so we check
the opened values on the fly at the cost of additional dedicated circuit. In hardware, we “simulate”
Table 1: Overview of the number of Fq multiplications (.), Fq additions (+) and linear operations
in GF (2) (L(.)) required to calculate all building blocks with d shares and m tags
Public Values Output calculation MAC check

Value Tags
· + L(.) · + · + · +
Add. d dm
Add. with C 1 dm dm
Multip. with C d dm
Multip. d 2d + 2(d − 1)d 2d 2d + 1 3dm 3dm 2dm 4dm + 2(d − 1)dm
Square/Affine d + (d − 1)d d 1 dm dm dm 2dm + (d − 1)dm
L1 (x) · L2 (y) d 2d + 2(d − 1)d d+d 2d 2d + 1 3dm 3dm 2dm 4dm + 2(d − 1)dm
the broadcast channel by wiring between all tiles. Each tile keeps a local copy of those broadcasted
values.
Adversary. Although MPC considers mainly the “synchronous” communication model, the SPDZ
adversary model also includes the so-called “rushing” adversary, which first collects all inputs from
the other parties and only then decides what to send in reply. In our embedded setting, as already
pointed out, the “rushing” adversary is impossible. Due to the nature of the implementation, the
computational environment and storage is very much restricted. On the other hand, communication
channels are very efficient and can be assumed to be automatically synchronous with all tiles
progressing in-step in the computation.
4.4 Cost Analysis and Scalability

The computation as described in §3.1 scales nicely with the masking order d and the security
parameter m. For any fixed number of shares d, the circuit area scales linearly in m (see for
example Table 2). Storage increases with a factor (m + 1)d compared to a plain implementation.
We note that our implementations run in almost the same amount of cycles as that of a plain
implementation. There is almost no loss in throughput and only negligible in latency. In software
as well, the timing scales linearly if tiles run in parallel.
This efficiency does not come for free. The complexity is shifted to the preprocessing stage;
indeed the generation of auxiliary triples is the most expensive part of the implementation. There
is a trade-off to be made here between the online and offline complexity. The more auxiliary data
we prepare “offline”, the more efficient the online computation.
Complexity for Passive Attacker Scenario. It is remarkable that if active attackers are ruled
out, and only SCA is a concern, then the complexity of the principal computation is linear in d.
This may seem like a significant improvement over previous masking schemes which have quadratic
complexity on the security order [18, 34, 58]. However, this complexity is again pushed into the
preprocessing stage. Nevertheless, this can be interesting especially for software implementations in
platforms where a large amount of RAM is available to store the auxiliary data generated in §3.2.
The same comments apply to FPGAs with plenty of BlockRAM.
Optimization of Preprocessing. It may be beneficial to store the output of the preprocessing

stage §3.2 in a table for later usage. One could optimize this process by recycling auxiliary data
(sample elements with replacement from the table). Of course, this would void the provable security
claims; but if performed with care (with appropriate table shuffling and table elements refresh),
this can give rise to an implementation that is secure in practice.
5 Proof-of-Concept
In this section we detail a proof-of-concept implementation of the CAPA methodology in both a
hardware and a software environment. We emphasize specific concepts for hardware and software
implementations and provide case studies of KATAN-32 [14] and AES [1], which cover operations in
different fields, possibility of bitsliced implementations, specific timing and memory optimizations,
and performance results.
Table 2: Area (GE) of 2-share KATAN-32 implementations with m MAC keys α[j] ∈ Fq
No tags m=1 m=8 Any m

- Evaluation 2 315 4 708 21 404 ≈ 2 315 + 2 390m
* Shift Register 888 1 823 8 419 ≈ 888 + 935m
* Key Schedule 1 427 2 885 12 985 ≈ 1 427 + 1 455m
- Preprocessing (x3) 363 679 2 727 ≈ 363 + 315m
* Two triple generation 237 431 1 786 ≈ 237 + 195m
* Relation verification 126 248 941 ≈ 126 + 120m
Total 3 672 7 103 30 596 ≈ 3 672 + 3 430m
5.1 Hardware Implementations

We now describe two case studies for applying CAPA in hardware. Our implementations are
somewhat optimized for latency rather than area with d tiles spatially separated and operating in
parallel, each with its own combinational and control logic and auxiliary data preparation module.
These preparation modules are equipped with a passively secure shared multiplication with higher-
order non-completeness. Literature provides us with a broad spectrum of multipliers to choose
from [9, 27, 29, 51, 56]. In order to minimize the randomness requirement, our implementation uses
the one from [29], hereafter referred to as DOM.
Library. For synthesis, we use Synopsis Design Compiler Version I-2013.12 using the NanGate 45nm
Open Cell library [49] for ease of future comparisons. We choose the compile option - exact_map to
prevent optimization across tiles. The area results are provided in 2-input NAND-gate equivalents
(GE).
5.1.1 Case Study: KATAN-32.
KATAN-32 is a shift register based block cipher, which has a 80-bits key and processes 32-bit
plaintext input. It is designed specifically for efficient hardware implementations and performs 254
cycles of four AND-XOR operations. Hence, its natural shared data representation is in the field
Fq = GF(2), which makes the mapping into CAPA operations relatively straightforward. However,
the small finite field means that we need to utilize a vectorized MAC-tag operation (m > 1) to
ensure a good probability of detecting errors. Our implementation is round based, as in [14] with
three AND-XOR Beaver operations and one constant AND-XOR calculated in parallel. Each
Beaver AND-XOR operation requires two cycles, and is implemented in a pipelined fashion such
that the latency of the whole computation increases only by one clock cycle.
Implementation Cost. Tables 2 and 3 summarize the area of our KATAN implementations.
Naturally, compared to a shared implementation without MAC tags, the state registers grow with
a factor m + 1 as the MAC-key size increases. In the last columns, we extrapolate the area results
for any m.
Each Beaver multiplication in GF(2) requires one triple, and each triple needs 2d random bits

for generating ~a and ~b. A d-share DOM multiplication requires d2 units of randomness. The
construction of one triple requires 1 + 3m masked multiplications: one to obtain the multiplication
~c of ~a and ~b; and 3m to obtain the m tags τ~a ,τ~b and τ~c . Due to the relation verification through
the sacrificing of another triple, the randomness must be doubled. Hence, the total required number
of random bits per round of KATAN is 3 · 2 · (2d + (1 + 3m) d(d−1) 2 )).
Table 3: Area (GE) of 3-share KATAN-32 implementations with m MAC keys α[j] ∈ Fq
No tags m=1 m=8 Any m

- Evaluation 3 560 7 139 32 368 ≈ 3 560 + 3 580m
* Shift Register 1 363 2 812 12 890 ≈ 1 363 + 1 450m
* Key Schedule 2 197 4 327 19 478 ≈ 2 197 + 2 130m
- Preprocessing (x3) 638 1 468 7 124 ≈ 638 + 830m
* Two triple generation 428 952 4 694 ≈ 428 + 524m
* Relation verification 210 516 2 430 ≈ 210 + 306m
Total 5 971 12 083 55 254 ≈ 5 971 + 6 112m
Figure 2: Non-specific leakage detection on the first 31 rounds of first-order KATAN. Left column:
PRNG off (24K traces). Right column: PRNG on (100M traces). Rows (top to bottom): exemplary
power trace; first-order t-test; second-order t-test
Experimental Validation. The goal of the prior proof-of-concept implementation is to experi-

mentally validate the protection against side-channel attacks offered by the CAPA methodology.
We deploy a first- and second-order secure KATAN instance onto a Xilinx Spartan-6 FPGA. Our
platform is a Sakura-G board specifically designed for side-channel evaluation with two FPGA’s to
minimize platform noise: a control FPGA handles I/O with the host computer and supplies masked
data to the crypto FPGA, which implements both the preprocessing and evaluation. The KATAN
implementations use d = 2 (resp. d = 3) shares and m = 2 MAC keys. The parameter m = 2 is
insufficient in practice, but serves for this experiment since m has no influence on SCA security.
The designs are clocked at 3 MHz and we sample power traces of 10 000 time samples each at 1
GS/s. Exemplary traces are shown in Figure 2,top.
We perform a non-specific leakage detection test [17] following the methodology from [57, 61].
First, we test the designs without masks to verify that our setup is indeed sound and able to detect
leakage. Then we switch on the PRNG and corroborate that the design does not leak with high
confidence.
In Figure 2, we show the results for the first-order secure design (d = 2). In the left column, the
PRNG is turned off, emulating an unmasked design. Indeed, we see clear leakage at first order,
since the t-statistics cross the threshold 4.5. With the PRNG on (right column), no first-order
leakage is detected with up to 100 million traces. As expected, we do see second-order leakage.
Figure 3 exhibits the results for the second-order secure design (d = 3). The left column shows
clear leakage at first, second and third order when the PRNG is turned off. In the right column, we
repeat the procedure with PRNG on and no univariate leakage is detected with up to 100 million
traces.1
1 Since our implementation handles 3 shares, we expect to detect leakage in the third order. Due to platform
noise, this is not visible.
Figure 3: Non-specific leakage detection on the first 31 rounds of second-order KATAN. Left
column: PRNG off (24K traces). Right column: PRNG on (100M traces). Rows (top to bottom):
exemplary power trace; first-order t-test; second-order t-test; third-order t-test
https://drive.google.com/file/d/
0B19mBnPrtz4hQllQMVNTeTBpSGM/
CAPA: The Spirit of Beaver against Physical Attacks view?usp=sharing
221
x S(x)
x4 · y 2 A(x)
x5 x5 x5 verification verification verification
verification verification
= register
Figure 4: AES S-box pipeline
5.1.2 Case Study: AES.

There has been a great deal of work on MPC and masked implementations of the basic AES
operations. We take what has now become the traditional approach and work in the field GF(28 )
with m = 1 for AES, i.e. the MAC key, data and tag shares αi , ai and τia are ∈ GF(28 ). The
ShiftRows and MixColumns operations are linear in GF(28 ), hence are straightforward. Here, we
only describe the S-box calculation.
Design choices. The AES S-box consists of an inversion in GF(28 ), followed by an affine transfor-
mation over bits. We distinguish two methodologies for the S-box implementation: It is well known
that the combination of the two operations can be expressed by the following polynomial in GF(28 )
[19]:
S-box(x) =0x63 + 0x8F · x127 + 0xB5 · x191 + 0x01 · x223 + 0xF4 · x239
(1)
+ 0x25 · x247 + 0xF9 · x251 + 0x09 · x253 + 0x05 · x254
This polynomial can be implemented using 6 squares and 7 multiplications in GF(28 ) with a latency
of 13 clock cyles. A second approach is to evaluate the inversion x −→ x254 using the following
multiplication chain from [30]:
5 2
x254 = x4 · (x5 )5
Since the AES affine transform A(x) is linear over GF(2), we can then use the Beaver operation
described in §3.1 to evaluate it in one cycle, using auxiliary affine tuples (h~ai, h~bi) such that b = A(a)
. Initial estimations reveal the former method is more expensive than the latter, so we adopt the
latter technique.
Multiplication Chain. Our implementation of the proposed multiplication chain uses two types
of operations: x5 and x4 · y 2 , which can both be computed as described in §3.1 (Multiplication
following Linear Transformations). Given an input h~xi and a triple (h~ai, h~bi, h~ci) such that b = a4
and c = a5 , we calculate the CAPA exponentiation to the power five. Likewise, we perform the
map x4 · y 2 (with y = x125 ) in one cycle, using quintuples (h~ai, h~bi, h~ci, hdi,
~ h~ei) such that c = a4 ,
d = b2 and e = c · d = a4 · b2 . As a result, an inversion in GF(28 ) costs only 4 cycles, using 3
exponentiation triples and 1 quintuple. Combined with the affine stage, we obtain the S-box output
in 5 cycles (see Figure 4). This approach does not only optimize the number of cycles but also the
amount of required randomness. The S-box is implemented as a five stage pipeline.
Implementation Cost. We use a serialized AES architecture, based on that in [28]. One round
of the cipher requires 21 clock cycles, making the latency of one complete encryption 226 clock
cycles. Since the unprotected serialised implementation of [47] also requires 226 cycles, the timing
performance is very good.
Table 4 presents the area for the different blocks that make up our AES implementation. We
can see a significant difference between the preprocessing and evaluation stages, i. e. the efficient
calculation phase comes at the cost of expensive resource generation machinery.
Table 5 summarizes the required number of random bytes for the generation of the triples/tuples
for the AES S-box as a function of the number of MAC keys m and the number of shares d. Recall
that the S-box needs three exponentiation triples, one quintuple and one affine tuple per cycle
(doubled for the sacrificing). Each of these uses d initial bytes of randomness per input
for the
shares of a (and b). Furthermore, recall that each masked multiplication requires d2 bytes or
randomness. That is, for d = 3 and m = 1, we need 156 bytes of randomness per S-box evaluation.
5.2 Software Implementation

CAPA is a suitable technique for software implementations if we map different tiles to different
processors/cores. We do, however, need to place some constraints on the underlying hardware
architecture; namely each processor should have an independent memory bank. Otherwise, a single
affected tile (processor) could compromise the security of the whole system by for example dumping
the entire memory contents (including all shares for sensitive variables).
This model therefore does not perfectly fit commercial off-the-shelf multi-core architectures,
but we think isolated memory regions is a reasonable assumption for future micro-processors.
While we do not have access to such architecture, as a proof of concept we emulate the proposed
multi-processor architecture by time-sharing a 32-bit single-core ARM Cortex-M4 processor. This
proof-of-concept does not provide resistance against attacks such as the memory dump example
above.
5.2.1 Case Study: AES S-box.

Even though it is possible to implement the AES S-box using GF(28 ) operations in SW also, we
base our bitsliced software implementation on the principles of gate-level masking and we use the
depth-16 AES S-box circuit by Boyar et al. [11] in order to provide competitive throughput. Our
high-level implementation processes 32 blocks simultaneously which is compatible with the word
size of our processor and can naturally be reduced. As the circuit boils down to a series of XOR
and AND operations over pairs of value and tag shares, we redefine these elementary operations
in the same way as previous works [3, §4]. We note that this technique is independent from the
concrete design, and one could apply the same principles to different ciphers.
We create a prototype implementation in C99. This is an unoptimized implementation meant
for functionality and security testing. We compile with gcc-arm 4.8.4. The 32 parallel SubBytes
operations are performed in 2.52 million cycles (15ms) at 168MHz with m = 8 MAC tags and
d = 3 shares. The implementation holds 41 intermediate variables in the stack (but this can be
optimized); each takes d · w bytes for value shares and m · d · w bytes for tag shares (w = 4 is
number of bytes per word).
Experimental Validation of DPA Security. We use an STM32F407 32-bit ARM Cortex-M4

processor running the C99 implementation. We take EM measurements with an electromagnetic
probe on top of a decoupling capacitor. This platform is very low noise: a DPA attack on the
unprotected byte-oriented AES implementation succeeds with only 15 traces. Each trace is slightly
above 500 000 time samples long and covers the entire execution of SubBytes. An exemplary trace
is depicted at the top of Figure 5.
Following the same procedure as in §5.1.1, we first perform a non-specific leakage detection test
with the masking PRNG turned off. The results of the first-, second- and third-order leakage tests
are shown on the left side of Figure 5. Severe leakage is detected, which confirms that the setup is
Table 4: Areas for first- and second-order AES implementations with m = 1 in 2-NAND Gate
Equivalents (GE)
Evaluation d=2 d=3 Preprocessing d=2 d=3

S-box 18 810 28 234 Quintuples 29 147 53 212
* Beaver x5 (x3) 3 914 5 875 * Generation 15 092 32 241
* Beaver x4 y 2 4 944 7 427 * Sacrificing 14 055 20 971
* Beaver Affine 1 563 2 344 Triples (x3) 19 106 34 954
State array 4 962 7 466 * Generation 9 804 21 112
* MixColumns 1 056 1 584 * Sacrificing 9 302 13 842
Key array 3 225 4 835 Affine tuples 7 603 14 657
Others 1 296 1 839 * Generation 4 821 10 444
* Sacrificing 2 782 4 213
Total 28 293 42 374 Total 94 068 172 731
TOTAL 122 361 215 105
Table 5: The number of randomness in bytes for the initial sharing, shared multiplication and the
sacrifice required for AES S-box
Initial sharing Shared mult. Total
Exp. triple 1 + 3m 2(d + (1 + 3m) 2 )

d(d−1)
d
Quintuple 2d 1 + 5m 2(2d + (1 + 5m) 2 )
d(d−1)
Affine tuple 2m 2(d + 2m 2 )

d(d−1)
d
Total 12d + 2(4 + 16m)

d(d−1)
2
200 200
EM field
EM field
150 150
100 100
50 50
5
50
t value
t value
0 0
-50
-5
5
0
t value
-20
t value
0
-40
-60 -5
40 5
20
t value
t value
0
-20
-40
-5
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
time [samples] 10 5 time [samples] 10 5
Figure 5: Non-specific leakage detection on second-order SubBytes. Left column: masks off. Right
column: masks on (200K traces). Rows (top to bottom): one exemplary EM trace, first-order t-test;
second-order t-test; third-order t-test
sound. When we plug in the PRNG, no leakage is detected with up to 200 000 traces (the statistic
does not surpass the threshold C = ±4.5). This serves to confirm that the implementation effectively
masks all intermediates, and that first- nor second-order DPA is not possible on this implementation.
SPA features within an electromagnetic trace are better visible in the cross-correlation matrix
shown in Figure 6.
50
100
150
200
250
300
350
400
450
500
550
100 200 300 400 500
Figure 6: Cross-correlation for second-order SubBytes. One can identify the 34 AND gates in the
SubBytes circuit of Boyar et al. [11].
Experimental Validation of DFA Security. For the purposes of validating our theoretical security
claims on CAPA’s protection against fault attacks, we scale down our software AES SubBytes
implementation, reducing the MAC key size to m = 2 and scaling down words to bits (k = 1). Note
that this parameter choice lowers the detection probability; the point of using these toy parameters
is only to verify more comfortably that the detection probability works as expected. It is easier
to verify that the detection probability is 1 − 2−2 rather than 1 − 2−40 . This concrete parameter
choice is naturally not to be used in a practical deployment.

When barring the all zeroes key, we expect the attacker to succeed with probability at most
1
2mk −1
= 221−1 = 33%. The instrumented implementation conditionally inserts faults in value
and/or tag shares. We repeat the SubBytes execution 1000 times, each iteration with a fresh MAC
key. Faults are inserted in a random location during the execution of the S-box.
We verify that single faults on only values or only tags are detected unconditionally when we
bar the all-0s key. When a single-bit offset (fault) is inserted in a single tile in both the value and
tag share, it is indeed detected in approximately 66% of the iterations. Inserting a single-bit offset
in value share and a random-bit offset in tag share is a worse attack strategy and is detected in
around 83% of the experiments. The same results hold when faults are inserted in up to d − 1 tiles.
When the value and tag shares in all d tiles are modified and fixed to a known value, the fault
escapes detection with probability one, as expected.
6 Conclusion
In this paper, we introduced the first adversary model that jointly considers side-channels and faults
in a unified and formal way. The tile-probe-and-fault security model extends the more traditional
wire-probe model and accounts for a more realistic and comprehensive adversarial behavior. Within
this model, we developed the methodology CAPA: a new combined countermeasure against physical
attacks. CAPA provides security against higher-order DPA, multiple-shot DFA and combined
attacks. CAPA scales to arbitrary security orders and borrows concepts from SPDZ, an MPC
protocol. We showed the feasibility of implementing CAPA in embedded hardware and software
by providing prototype implementations of established block ciphers. We hope CAPA provides
an interesting addition to the embedded designer’s toolbox, and stimulates further research on
combined countermeasures grounded on more formal principles.
6.0.1 Acknowledgements.
This work was supported in part by the Research Council KU Leuven: C16/15/058 and OT/13/071,
by the NIST Research Grant 60NANB15D346 and the EU H2020 project FENTEC. Oscar Reparaz
and Begül Bilgin are postdoctoral fellows of the Fund for Scientific Research - Flanders (FWO)
and Lauren De Meyer is funded by a PhD fellowship of the FWO. The work of Nigel Smart
has been supported in part by ERC Advanced Grant ERC-2015-AdG-IMPaCT, by the Defense
Advanced Research Projects Agency (DARPA) and Space and Naval Warfare Systems Center,
Pacific (SSC Pacific) under contract No. N66001-15-C-4070, and by EPSRC via grants EP/M012824
and EP/N021940/1.
References
[1] Advanced Encryption Standard (AES). National Institute of Standards and Technology (NIST),
FIPS PUB 197, U.S. Department of Commerce, Nov. 2001.
[2] F. Amiel, K. Villegas, B. Feix, and L. Marcel. Passive and active combined attacks: Combining
fault attacks and side channel analysis. In L. Breveglieri, S. Gueron, I. Koren, D. Naccache,
and J. Seifert, editors, FDTC 2007, pages 92–102. IEEE Computer Society, 2007.
[3] J. Balasch, B. Gierlichs, O. Reparaz, and I. Verbauwhede. DPA, bitslicing and masking at 1
GHz. In Güneysu and Handschuh [31], pages 599–619.
[4] G. Barthe, F. Dupressoir, S. Faust, B. Grégoire, F. Standaert, and P. Strub. Parallel imple-
mentations of masking schemes and the bounded moment leakage model. In J. Coron and J. B.
Nielsen, editors, EUROCRYPT 2017, Part I, volume 10210 of LNCS, pages 535–566, 2017.
[5] A. Battistello and C. Giraud. Fault analysis of infective AES computations. In W. Fischer
and J. Schmidt, editors, FDTC 2013, pages 101–107. IEEE Computer Society, 2013.
[6] D. Beaver. Precomputing oblivious transfer. In D. Coppersmith, editor, CRYPTO’95, volume

963 of LNCS, pages 97–109. Springer, Heidelberg, Aug. 1995.
[7] R. Bendlin, I. Damgård, C. Orlandi, and S. Zakarias. Semi-homomorphic encryption and

multiparty computation. In Paterson [53], pages 169–188.
[8] G. Bertoni, L. Breveglieri, I. Koren, P. Maistri, and V. Piuri. Error analysis and detection
procedures for a hardware implementation of the advanced encryption standard. IEEE Trans.
Computers, 52(4):492–505, 2003.
[9] B. Bilgin, B. Gierlichs, S. Nikova, V. Nikov, and V. Rijmen. Higher-order threshold implemen-
tations. In P. Sarkar and T. Iwata, editors, ASIACRYPT 2014, Part II, volume 8874 of LNCS,
pages 326–343. Springer, Heidelberg, Dec. 2014.
[10] D. Boneh, R. A. DeMillo, and R. J. Lipton. On the importance of eliminating errors in

cryptographic computations. Journal of Cryptology, 14(2):101–119, 2001.
[11] J. Boyar, P. Matthews, and R. Peralta. Logic minimization techniques with applications to
cryptology. Journal of Cryptology, 26(2):280–312, Apr. 2013.
[12] J. Bringer, C. Carlet, H. Chabanne, S. Guilley, and H. Maghrebi. Orthogonal direct sum
masking - A smartcard friendly computation paradigm in a code, with builtin protection
against side-channel and fault attacks. In D. Naccache and D. Sauveron, editors, WISTP 2014.
Proceedings, volume 8501 of LNCS, pages 40–56. Springer, 2014.
[13] J. Bringer, H. Chabanne, and T. Le. Protecting AES against side-channel analysis using
wire-tap codes. J. Cryptographic Engineering, 2(2):129–141, 2012.
[14] C. D. Cannière, O. Dunkelman, and M. Knežević. KATAN and KTANTAN - a family of small
and efficient hardware-oriented block ciphers. In C. Clavier and K. Gaj, editors, CHES 2009,
volume 5747 of LNCS, pages 272–288. Springer, Heidelberg, Sept. 2009.
[15] S. Chari, C. S. Jutla, J. R. Rao, and P. Rohatgi. Towards sound approaches to counteract
power-analysis attacks. In Wiener [64], pages 398–412.
[16] T. D. Cnudde and S. Nikova. More efficient private circuits II through threshold implementations.
In FDTC 2016, pages 114–124. IEEE Computer Society, 2016.
[17] J. Cooper, E. DeMulder, G. Goodwill, J. Jaffe, G. Kenworthy, and P. Rohatgi. Test Vector
Leakage Assessment (TVLA) methodology in practice. International Cryptographic Module
Conference, 2013.
[18] J.-S. Coron. Higher order masking of look-up tables. In P. Q. Nguyen and E. Oswald, editors,
EUROCRYPT 2014, volume 8441 of LNCS, pages 441–458. Springer, Heidelberg, May 2014.
[19] J. Daemen and V. Rijmen. The Design of Rijndael: AES - The Advanced Encryption Standard.
Information Security and Cryptography. Springer, 2002.
[20] I. Damgård, V. Pastro, N. P. Smart, and S. Zakarias. Multiparty computation from somewhat
homomorphic encryption. In Safavi-Naini and Canetti [60], pages 643–662.
[21] A. Duc, S. Faust, and F.-X. Standaert. Making masking security proofs concrete - or how
to evaluate the security of any leaking device. In E. Oswald and M. Fischlin, editors, EU-
ROCRYPT 2015, Part I, volume 9056 of LNCS, pages 401–429. Springer, Heidelberg, Apr.
2015.
[22] W. Fischer and N. Homma, editors. Cryptographic Hardware and Embedded Systems - CHES
2017 - 19th International Conference, Taipei, Taiwan, September 25-28, 2017, Proceedings,
volume 10529 of Lecture Notes in Computer Science. Springer, 2017.
[23] B. M. Gammel and S. Mangard. On the duality of probing and fault attacks. J. Electronic
Testing, 26(4):483–493, 2010.
[24] K. Gandolfi, C. Mourtel, and F. Olivier. Electromagnetic analysis: Concrete results. In Çetin
Kaya. Koç, D. Naccache, and C. Paar, editors, CHES 2001, volume 2162 of LNCS, pages
251–261. Springer, Heidelberg, May 2001.
[25] B. Gierlichs, J.-M. Schmidt, and M. Tunstall. Infective computation and dummy rounds: Fault
protection for block ciphers without check-before-output. In A. Hevia and G. Neven, editors,
LATINCRYPT 2012, volume 7533 of LNCS, pages 305–321. Springer, Heidelberg, Oct. 2012.
[26] L. Goubin and J. Patarin. DES and differential power analysis (the “duplication” method).
In Çetin Kaya. Koç and C. Paar, editors, CHES’99, volume 1717 of LNCS, pages 158–172.
Springer, Heidelberg, Aug. 1999.
[27] H. Groß and S. Mangard. Reconciling d+1 masking in hardware and software. In Fischer and
Homma [22], pages 115–136.
[28] H. Groß, S. Mangard, and T. Korak. Domain-oriented masking: Compact masked hardware
implementations with arbitrary protection order. IACR Cryptology ePrint Archive, 2016:486,
2016.
[29] H. Groß, S. Mangard, and T. Korak. An efficient side-channel protected AES implementation
with arbitrary protection order. In H. Handschuh, editor, Topics in Cryptology - CT-RSA
2017 - The Cryptographers’ Track at the RSA Conference 2017, San Francisco, CA, USA,
February 14-17, 2017, Proceedings, volume 10159 of LNCS, pages 95–112. Springer, 2017.
[30] V. Grosso, E. Prouff, and F.-X. Standaert. Efficient masked S-boxes processing - A step
forward -. In D. Pointcheval and D. Vergnaud, editors, AFRICACRYPT 14, volume 8469 of
LNCS, pages 251–266. Springer, Heidelberg, May 2014.
[31] T. Güneysu and H. Handschuh, editors. CHES 2015, volume 9293 of LNCS. Springer,
Heidelberg, Sept. 2015.
[32] X. Guo, D. Mukhopadhyay, C. Jin, and R. Karri. Security analysis of concurrent error detection
against differential fault analysis. J. Cryptographic Engineering, 5(3):153–169, 2015.
[33] Y. Ishai, M. Prabhakaran, A. Sahai, and D. Wagner. Private circuits II: Keeping secrets in
tamperable circuits. In S. Vaudenay, editor, EUROCRYPT 2006, volume 4004 of LNCS, pages
308–327. Springer, Heidelberg, May / June 2006.
[34] Y. Ishai, A. Sahai, and D. Wagner. Private circuits: Securing hardware against probing
attacks. In D. Boneh, editor, CRYPTO 2003, volume 2729 of LNCS, pages 463–481. Springer,
Heidelberg, Aug. 2003.
[35] N. Joshi, K. Wu, and R. Karri. Concurrent error detection schemes for involution ciphers.
In M. Joye and J.-J. Quisquater, editors, CHES 2004, volume 3156 of LNCS, pages 400–412.
Springer, Heidelberg, Aug. 2004.
[36] M. Joye, P. Manet, and J. Rigaud. Strengthening hardware AES implementations against
fault attacks. IET Information Security, 1(3):106–110, 2007.
[37] M. G. Karpovsky, K. J. Kulikowski, and A. Taubin. Differential fault analysis attack resistant
architectures for the advanced encryption standard. In J. Quisquater, P. Paradinas, Y. Deswarte,
and A. A. E. Kalam, editors, CARDIS 2004, 22-27 August 2004, Toulouse, France, volume
153 of IFIP, pages 177–192. Kluwer/Springer, 2004.
[38] R. Karri, G. Kuznetsov, and M. Gössel. Parity-based concurrent error detection of substitution-
permutation network block ciphers. In C. D. Walter, Çetin Kaya. Koç, and C. Paar, editors,
CHES 2003, volume 2779 of LNCS, pages 113–124. Springer, Heidelberg, Sept. 2003.
[39] R. Karri, K. Wu, P. Mishra, and Y. Kim. Concurrent error detection schemes for fault-based
side-channel cryptanalysis of symmetric block ciphers. IEEE Trans. on CAD of Integrated
Circuits and Systems, 21(12):1509–1517, 2002.
[40] M. Keller, E. Orsini, and P. Scholl. MASCOT: Faster malicious arithmetic secure computation
with oblivious transfer. In E. R. Weippl, S. Katzenbeisser, C. Kruegel, A. C. Myers, and
S. Halevi, editors, ACM CCS 16, pages 830–842. ACM Press, Oct. 2016.
[41] P. C. Kocher. Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other
systems. In N. Koblitz, editor, CRYPTO’96, volume 1109 of LNCS, pages 104–113. Springer,
[42] P. C. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In Wiener [64], pages 388–397.
[43] V. Lomné, T. Roche, and A. Thillard. On the need of randomness in fault attack countermea-
sures - application to AES. In G. Bertoni and B. Gierlichs, editors, FDTC 2012, pages 85–94.
IEEE Computer Society, 2012.
[44] T. Malkin, F. Standaert, and M. Yung. A comparative cost/security analysis of fault attack
countermeasures. In L. Breveglieri, I. Koren, D. Naccache, and J. Seifert, editors, FDTC 2006,
volume 4236 of LNCS, pages 159–172. Springer, 2006.
[45] M. Medwed, F.-X. Standaert, J. Großschädl, and F. Regazzoni. Fresh re-keying: Security
against side-channel and fault attacks for low-cost devices. In D. J. Bernstein and T. Lange,
editors, AFRICACRYPT 10, volume 6055 of LNCS, pages 279–296. Springer, Heidelberg, May
2010.
[46] S. Mitra and E. J. McCluskey. Which concurrent error detection scheme to choose ? In
Proceedings IEEE International Test Conference 2000, Atlantic City, NJ, USA, October 2000,
pages 985–994. IEEE Computer Society, 2000.
[47] A. Moradi, A. Poschmann, S. Ling, C. Paar, and H. Wang. Pushing the limits: A very compact
and a threshold implementation of AES. In Paterson [53], pages 69–88.
[48] D. Mukhopadhyay. An improved fault based attack of the advanced encryption standard.
In B. Preneel, editor, AFRICACRYPT 09, volume 5580 of LNCS, pages 421–434. Springer,
Heidelberg, June 2009.
[49] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.nangate.com.
[50] J. B. Nielsen, P. S. Nordholt, C. Orlandi, and S. S. Burra. A new approach to practical

active-secure two-party computation. In Safavi-Naini and Canetti [60], pages 681–700.
[51] S. Nikova, C. Rechberger, and V. Rijmen. Threshold implementations against side-channel

attacks and glitches. In P. Ning, S. Qing, and N. Li, editors, ICICS 06, volume 4307 of LNCS,
pages 529–545. Springer, Heidelberg, Dec. 2006.
[52] S. Nikova, V. Rijmen, and M. Schläffer. Secure hardware implementation of non-linear functions
in the presence of glitches. In P. J. Lee and J. H. Cheon, editors, ICISC 08, volume 5461 of
LNCS, pages 218–234. Springer, Heidelberg, Dec. 2009.
[53] K. G. Paterson, editor. EUROCRYPT 2011, volume 6632 of LNCS. Springer, Heidelberg, May
2011.
[54] S. Patranabis, A. Chakraborty, P. H. Nguyen, and D. Mukhopadhyay. A biased fault attack on

the time redundancy countermeasure for AES. In S. Mangard and A. Y. Poschmann, editors,
COSADE 2015. Revised Selected Papers, volume 9064 of LNCS, pages 189–203. Springer, 2015.
[55] E. Prouff and M. Rivain. Masking against side-channel attacks: A formal security proof. In
T. Johansson and P. Q. Nguyen, editors, EUROCRYPT 2013, volume 7881 of LNCS, pages
142–159. Springer, Heidelberg, May 2013.
[56] O. Reparaz, B. Bilgin, S. Nikova, B. Gierlichs, and I. Verbauwhede. Consolidating masking

schemes. In R. Gennaro and M. J. B. Robshaw, editors, CRYPTO 2015, Part I, volume 9215
of LNCS, pages 764–783. Springer, Heidelberg, Aug. 2015.
[57] O. Reparaz, B. Gierlichs, and I. Verbauwhede. Fast leakage assessment. In Fischer and Homma
[22], pages 387–399.
[58] M. Rivain and E. Prouff. Provably secure higher-order masking of AES. In S. Mangard
and F.-X. Standaert, editors, CHES 2010, volume 6225 of LNCS, pages 413–427. Springer,
[59] T. Roche and E. Prouff. Higher-order glitch free implementation of the AES using secure multi-
party computation protocols - extended version. J. Cryptographic Engineering, 2(2):111–127,
2012.
[60] R. Safavi-Naini and R. Canetti, editors. CRYPTO 2012, volume 7417 of LNCS. Springer,
[61] T. Schneider and A. Moradi. Leakage assessment methodology - A clear roadmap for side-
channel evaluations. In Güneysu and Handschuh [31], pages 495–513.
[62] T. Schneider, A. Moradi, and T. Güneysu. ParTI – towards combined hardware countermea-
sures against side-channel and fault-injection attacks. In M. Robshaw and J. Katz, editors,
CRYPTO 2016, Part II, volume 9815 of LNCS, pages 302–332. Springer, Heidelberg, Aug.
2016.
[63] O. Seker, T. Eisenbarth, and R. Steinwandt. Extending glitch-free multiparty protocols to
resist fault injection attacks. IACR Cryptology ePrint Archive, 2017:269, 2017.
[64] M. J. Wiener, editor. CRYPTO’99, volume 1666 of LNCS. Springer, Heidelberg, Aug. 1999.
[65] S. Yen and M. Joye. Checking before output may not be enough against fault-based cryptanal-
ysis. IEEE Trans. Computers, 49(9):967–970, 2000.
M&M: Masks and Macs
against Physical Attacks
Publication Data
Lauren De Meyer, Victor Arribas, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. M&M: Masks and Macs against Physical Attacks. IACR Transactions
on Cryptographic Hardware and Embedded Systems, 2019(1), pages 25-50.
My Contribution
One of main authors.
229
230 M&M: Masks and Macs against Physical Attacks
M&M: Masks and Macs against Physical Attacks

Lauren De Meyer1 , Victor Arribas1 ,
Svetla Nikova1 , Ventzislav Nikov2 and Vincent Rijmen1
1
KU Leuven, imec - COSIC, Belgium
2
NXP Semiconductors, Belgium
venci.nikov@gmail.com
Abstract. Cryptographic implementations on embedded systems need to be protected against

physical attacks. Today, this means that apart from incorporating countermeasures against
side-channel analysis, implementations must also withstand fault attacks and combined attacks.
Recent proposals in this area have shown that there is a big tradeoff between the implementation
cost and the strength of the adversary model. In this work, we introduce a new combined
countermeasure M&M that combines Masking with information-theoretic MAC tags and
infective computation. It works in a stronger adversary model than the existing scheme ParTI,
yet is a lot less costly to implement than the provably secure MPC-based scheme CAPA. We
demonstrate M&M with a SCA- and DFA-secure implementation of the AES block cipher. We
evaluate the side-channel leakage of the second-order secure design with a non-specific t-test
and use simulation to validate the fault resistance.
Keywords: SCA, DFA, combined, countermeasure, masking, CAPA, ParTI, embedded, infective
computation
1 Introduction
The implementation of cryptographic algorithms in embedded systems should be done with extreme
care. Physical attacks are proliferating considerably and they are becoming easier and cheaper to
perform. The most important physical attacks are Side-Channel Analysis (SCA), a non-invasive
attack that exploits the physical leakages emanating from the device (power consumption or
electromagnetic radiation among others) and Fault Attacks (FA), in which an adversary induces
and exploits logical errors in the computation. These attacks are commonly used to retrieve secret
data from the embedded device and can be executed either separately or combined. The most
threatening attacks are differential power analysis (DPA) [KJJ99] for SCA and differential fault
analysis (DFA) [BS97] and fault sensitivity analysis (FSA) [LSG+ 10] for FA.
In the case of SCA, a popular and established countermeasure is masking [ISW03, NRR06, PR11,
NRS11, BGN+ 14, RBN+ 15, GMK16, GM17], a secret sharing-based method in which intermediate
variables are stochastically split into multiple shares in order to make the side-channel-leaked
information independent of sensitive data. To protect against fault injections there are two major
countermeasures, as noted in [LRT12]: The first, Detection, checks whether the algorithm was
faulted during the execution by using either area or time redundancy (e.g. duplication [BECN+ 06],
concurrent error detection [BBK+ 03, KKG03, KKT04], . . . ). The problem with duplication is
that it does not provide security when faults are duplicated as well. Even with error-detecting
codes, a powerful attacker can avoid detection if the injected faults result in valid codewords. The
second approach, Infection, prevents an adversary from extracting secret information from a faulty
ciphertext by ensuring that any induced fault results in a garbage output [GST12]. So far, all
infective computations schemes have been broken [BG13].
The research direction of combined countermeasures - that is, countermeasures against both SCA
and FA - is quite young and experimental. A popular methodology is to superpose two techniques
that separately resist one family of attacks. Examples of schemes that combine masking against
SCA with redundancy against FA are ParTI [SMG16] and Private Circuits II [IPSW06, CN16].
These countermeasures naturally inherit the drawbacks of redundancy, that is, they are vulnerable
against the injection of smart undetectable faults. Moreover, implementing a checking mechanism
that does not reveal sensitive information under combined attacks is a difficult task. More recently,
an actively secure multi-party computation protocol was adapted to the context of embedded
systems in order to provide security against combined attacks [RDB+ 18]. The resulting combined
countermeasure benefits from very strong formal security guarantees, but is extremely expensive to
implement in hardware. A combination of duplication and infection is explored in [LRT12], but this
scheme was broken in [BG13]. Infective computation is also combined with polynomial masking
in [SFRES18]. These schemes alleviate the need for a checking mechanism, but as a result cannot
give an honest user any indication on whether or not the chip has been tampered with.
Our Contribution In this work, we describe M&M, a new family of countermeasures that extends
any SCA-secure masking scheme with information-theoretic MAC tags against DFA (i.e. Masks &
MACs) and combines them with an infective computation mechanism. By instantiating M&M with
a dth -order secure masking scheme, one achieves generic order of protection for SCA. The M&M
construction then ensures generic order of protection against DFA and the combination of SCA and
DFA. As opposed to error detecting codes, the MAC mapping is perfectly unpredictable, eliminating
the possibility of smart undetectable faults. This makes M&M secure against stronger adversaries
than when error detecting codes are used. We demonstrate M&M with first- and second-order
secure implementations of the AES cipher. This example shows that M&M can be very efficient in
area with an overhead factor of merely 2.53 compared to an implementation that protects only
against SCA. We perform a SCA evaluation of our implementations where no leakage is found with
up to 100 million traces. Additionally, we design and perform a fault evaluation to confirm our
theoretically claimed fault coverage.
Scheme Overview We revisit the infective computation scheme of [LRT12], which uses a redundant
encryption of the plaintext and uses the difference between the two ciphertexts to infect the output.
That is, if the ciphertexts match, the output is exactly that ciphertext. If the ciphertexts do not
match, the output is randomized so the attacker cannot get any information from it. The general
idea is illustrated in Figure 1. For more details, we refer to the original work.
Enc Enc
! !′
Infect
!$
Figure 1: Infective Computation Scheme such as that of [LRT12]
This scheme was broken in [BG13] because of a bias on the randomized output. We make two
important changes. First, instead of using redundancy, which is vulnerable to the injection of
identical faults, we replace the second instantiation of the cipher with a computation on information-
theoretic MAC tags of the plaintext. If faults occur anywhere in the computation, the output of this
block does not correspond to a valid MAC tag of the ciphertext with arbitrarily high probability.
We use the difference between what the MAC tag should be and what it actually is, to randomize
the ciphertext without any bias. This is illustrated in Figure 2. We also ensure that one can find
out whether the ciphertext is correct or not. In a way, we thus combine the advantages of detection
and infection.
The computation on masks and MACs resembles the approach of [RDB+ 18]. However, instead
of using expensive MPC machinery, we devise new constructions for generic field operations using
existing SCA-secure gadgets.
In Section 2, we introduce our adversarial model. Section 3 presents our framework of shared
data and information-theoretic MAC tags and the basic M&M building blocks that are subsequently
' !"#(')
Enc EncMac
% !"#(%)
Infect
%(
Figure 2: Our scheme
used in Section 4 to do more complex computations. In particular, we describe M&M blocks for
elementary Galois Field operations, which can be used to construct the encryption blocks Enc and
EncMac . In Section 5 we describe how the shared ciphertext and MAC tags are used in an infective
computation. This is followed by a discussion of the security in Section 6. Finally in Section 7, we
demonstrate our scheme with an implementation and practical evaluation of the AES cipher.
2 Adversarial Model
In this work, we consider a semi-invasive adversary with probing and faulting capabilities. On the
one hand, we work under the d-probing model introduced in [ISW03] for SCA, providing security
against dth -order side-channel analysis attacks under the independent leakage assumption. The
model can include or exclude hardware glitches, but in this work, we specifically instantiate M&M
considering glitches.
On the other hand, we consider two types of faults. We model faults as stochastic additive
errors: this means the effect of a fault is the XOR of the current state with an error variable
following some random distribution. This adversary model is very similar to the one described
in [SMG16]. However, in this work, we do not limit the adversary in the number of bits he can
alter, since we present a scheme which can tolerate multiple faults with any Hamming weight.
In addition, we allow the attacker to inject non-stochastic faults (for example very precise laser
injections or stuck-at faults). In that case however, the faults must be restricted to affect at most d
of the d + 1 shares. We can justify this limitation by a proper placement of the circuit on the chip
and the more complex setup of these kinds of faults.
3 M&M: The Basics

In this section, we describe the M&M framework and its most fundamental M&M building blocks,
i.e. field multiplication and squaring in GF(2k ). We omit descriptions of trivial linear operations
such as addition and scaling. Using these blocks, it is possible to secure any circuit against both
SCA and DFA. Indeed, one only has to replace each AND-gate (resp. XOR-gate) with a M&M
multiplication (resp. addition) in the field GF(2). We note that the proposed approach is only meant
for the datapath. The control logic and public constants do not require a combined countermeasure
as they are only vulnerable to FA.
3.1 The M&M Framework

Notation. x denotes a d + 1-sharing (x0 , . . . , xd ) of an element x ∈ GF(2k ) such that x =
x0 + . . . + xd , with “+” denoting addition in the Galois Field GF(2k ). Additionally, “·” is a
field multiplication in GF(2k ) and “ ” a shared (i.e. SCA protected) field multiplication in
GF(2k ) : x y = z ⇔ x · y = z. Upright bold font is used for bit vectors x ∈ (GF(2))k and matrices
M ∈ (GF(2))k×k .
Information-theoretic MAC tags. Detection of faults in the computation is achieved by accom-

panying each intermediate variable x ∈ GF(2k ) with an information-theoretic MAC tag. Let
α ∈ GF(2k ) denote a MAC key, which must be fresh for every encryption. For each x ∈ GF(2k ),
we have a MAC tag τ x = α · x. Note that, if α were fixed and identical for all encryptions, the
values and tags would be equivalent to an error detecting code. Security against faults is based on
the fact that the MAC key α is secret. Without knowledge of α, an adversary cannot forge a valid
tag τ x̃ for a faulty x̃. Its best strategy is guessing α, which offers a success probability of 2−k . If
the field GF(2k ) is too small, one can assign to each intermediate x multiple MAC tags τ x [j], each
for a different MAC key α[j] for j = 1, . . . , m. The success probability of the adversary is then at
most 2−km . For readability, we will assume m = 1 unless otherwise mentioned.
Data representation against SCA and DFA. With security against both side-channel analysis and
faults in mind, we port the information-theoretic MAC tags to the shared domain. This means that
every intermediate x ∈ GF(2k ) is represented by hxi = (x, τ x ) with value shares x = (x0 , . . . , xd )
such that x0 + . . . + xd = x and tag shares τ x = (τ0x , . . . , τdx ) such that τ0x + . . . + τdx = τ x . The
MAC key itself is also shared: α = (α0 , . . . , αd ). Note that the MAC key α authenticates the
sensitive value x itself and not merely its shares xi . Hence, the tag shares τix are not tags of the
value shares xi but rather a share of the tag τ x :
τix 6= α · xi
X X
τix = α · xi
i i
Shared multiplication. In what follows, we describe how M&M extends an existing SCA-secure
masking scheme with protection against faults. Literature provides us with many secure Boolean
masking schemes to choose from [ISW03, NRR06, PR11, NRS11, BGN+ 14, RBN+ 15, GMK16].
Each of those is defined by how it performs nonlinear operations (i.e. a multiplication) on Boolean
shares. A specific instantiation of M&M thus depends on the choice of how to implement the shared
multiplication operation x y = z. We assume that this operation transforms d + 1-sharings of
two variables x and y into a d + 1-sharing of their product z = xy. The latency and randomness
cost of this operation depends on the choice of SCA countermeasure. However, we assume for now
that the latency is one clock cycle, since this is the case in most schemes. For further discussion on
the latency of the multiplication gadgets, see Section 7.1.
3.2 Basic M&M Building Blocks

M&M Multiplication. Given two operands hxi and hyi, we want to compute the value shares z
and tag shares τ z of z = xy. The value shares z can naturally be obtained using a shared multiplier
with x and y as inputs. From this point, for ease of notation, we use xy to denote a sharing of the
product xy.
x y = xy = z
Deploying the same multiplier for the tag shares does not result in a valid tag for z
τx τ y = αx αy = α2 z 6= τ z
However, if we use the result above in another shared multiplication with a sharing of α−1 , we can
obtain the correct tag shares:
α−1 (τ x τ y ) = αz = τ z
Note that we could also obtain these tag shares by either x τ y or τ x y, but we want to avoid
crossing the datapaths of value and tag shares such that faults introduced in the values cannot
automatically propagate to the tags. Consider for example a fault injected on input x, resulting in
x̃. Then x̃ τ y is a valid tag for x̃ y.
The M&M multiplication is summarized in the left side of Figure 3. We assume the operation
includes one register stage, since this is the case for most state-of-the-art d-secure multipliers.
The value-datapath of a M&M multiplication thus requires one clock cycle whereas that of the tags
requires two.
!
⊙ ' ! ⋆* '
"
#!
⊙ ⊙ #' #! ⋆* ⊙ #'
#"
$%& $%&
Figure 3: M&M nonlinear operations: Obtaining the value and tag shares of z = xy(left) and
z = x2 (right)
This multiplication uses a sharing of the inverse of the MAC key α−1 . We assume this is made
available together with the sharing of the MAC key itself. If this is not the case, α−1 can be
precomputed and stored.
M&M Squaring. Note that squaring in M&M follows the same procedure. That is, to obtain
hzi from hxi such that z = x2 , we first square the value shares x and tag shares τ x . Since a
characteristic-two finite field allows (a + b)2 = a2 + b2 , squaring in the shared domain is a local
operation that requires no registers: x2 = (x20 , . . . , x2d ) and (τ x )2 = ((τ0x )2 , . . . , (τdx )2 ). We then
again calculate the tag shares τ z by a shared multiplication with the inverse of the MAC key α−1 .
The M&M squaring operation therefore takes one clock cycle and is depicted in the right side of
Figure 3. The local squaring operation of shared data is depicted as ?2 . Extending this means that
exponentiations by a power of two (x2 ) take l clock cycles for the tags.
l
The Field GF(2). Note that in the case of bits (the field GF(2)), no correction of the MAC tag
is needed since we then have that α2 = α, i.e. τ x τ y = τ z . As a result, the M&M multiplication
in GF(2) has the same latency as the SCA-secure multiplication and M&M squaring (as well as
exponentiation by a power of two) is local.
4 Building circuits with M&M

In this section we describe how to use the above building blocks to construct circuits for more
complex operations in the M&M framework. Specifically, we demonstrate the methodology for an
inversion in GF(2k ), k > 21 , which is of course of particular interest for implementing the AES
S-box. Furthermore, we introduce a method for processing an affine transformation over bits, as
used as well in the AES S-box.
4.1 Galois Field Inversion

We discuss two methods to do an inversion in Galois Field GF(2k ). The first constructs a
multiplication chain from the M&M multiplication and squaring blocks described above. This
methodology is generic and can be applied to any S-box since any S-box can be presented as a
polynomial over the considered field. The second version uses a SCA-secure inversion implementation
from existing literature to build a secure M&M block. This approach is specific for the AES S-box,
but results in a more efficient implementation.
4.1.1 Version 1: “Generic”

For x ∈ GF(28 ), the inversion x−1 is equivalent to the power map x254 . We can obtain this function
via the following power chain [GPS14]:
5 5 2
x254 = x4 · x5
1 In GF(2) and GF(22 ), inversion is trivial as it corresponds respectively to the identity and squaring function
Since x5 = (x2 )2 · x, this inversion requires seven M&M squares and four M&M multiplications.
Using the above squaring and multiplication blocks, obtaining the inversion output thus requires 15
clock cycles.
Optimization 1. The calculation can be sped up using a specialized block for the exponentiation
to the power five, shown in Figure 4. This is also done in [GPS14] and justifies the choice of
multiplication chain. In our case however, it is not trivial to do the same optimization for the
tag calculation. The operation ?5 raises the shares of x to the power five in one clock cycle. This
requires a local computation of x4 = (x40 , . . . , x4d ), followed by a shared multiplication x x4 = x5 .
This must be done with care since x and x4 are essentially the same variable and multiplying them
may break non-completeness due to the dependencies among these two variables. We therefore
precompute and refresh x4 one cycle before it is used.
After this first stage, which takes one clock cycle, we thus obtain a sharing of x5 in the value
datapath:
(x40 , . . . , x4d ) = x4
x x4 = x5
and a sharing of (τ x )5 = α5 x5 in the tag datapath:
((τ0x )4 , . . . , (τdx )4 ) = (τ x )4
5
τx (τ x )4 = (τ x )5 6= τ x
A valid tag for x5 is obtained through one more shared multiplication with α−4 , which is easily
obtained locally from α−1 :
5
α−4 (τ x (τ x )4 ) = τ x
! ⋆' %
"! ⋆' ⊙ "%
()* ⋆$
Figure 4: Obtaining the value and tag shares of z = x5
With this specialized block, one obtains the exponentiation to the power five of hxi in two clock
cycles, if (x4 , (τ x )4 ) is already refreshed beforehand. In those two clock cycles, we can obtain both
the output hzi and a refreshed (z 4 , (τ z )4 ) to be ready for the next block. The value shares can be
refreshed using the second register stage (i.e. while the tag shares are being multiplied with α−4 ).
For the tag shares, there is no spare register stage for refreshing. In the shared multiplication with
α−4 , we therefore raise each crossproduct to the power four. Before the register stage, we thus
create a (d + 1)2 -sharing of both τ z and of (τ z )4 , each using its own randomness. As a result, the
inversion result is available in only ten clock cycles:
• One cycle for the preparation of refreshed x4 and (τ x )4

5 5
• The calculation of hx125 i = h x5 i requires six cycles.
• In the next cycle, we square hx125 i
• The last two cycles are spent on the multiplication of the result hx250 i with hx4 i to obtain
hx254 i.
Optimization 2. By merging the last two operations into one step of two clock cycles, we reduce
the total latency of the M&M inversion to nine cycles. This is possible because x254 = f (x4 , x125 )
with f (a, b) = a · b2 . We apply the same methodology as above. For the value shares:
(b20 , . . . , b2d ) = b2
a b2 = f (a, b)
For the tag shares:
((τ0b )2 , . . . , (τdb )2 ) = (τ b )2
τa (τ b )2 = α3 f (a, b) 6= τ f (a,b)
α −2
(τ a
(τ b )2 ) = τ f (a,b)
Figure 5 summarizes the nine-stage pipeline that calculates the value and tag shares for an
inversion of hxi.
#'
# !'
!% $ # (&
!" !" !"

#" # $" # &$"
Figure 5: Inversion pipeline. (Register stages are depicted by red dotted lines.)
4.1.2 Version 2: “Custom”

The AES S-box has been extensively studied in literature and an abundance of SCA-protected
implementations has already been proposed. When implementing the AES S-box in the M&M
framework, it only makes sense to exploit the results from this research. We can take the above
optimizations even further by merging all stages together into one. Consider applying a dth -order
secure shared inversion in GF(2k ) (denoted ?−1 ) on both the value and tag shares. One obtains
x−1 and (τ x )−1 = α−1 x−1 6= τ x . Only a shared multiplication with α2 is required to calculate
−1
the correct tag shares of x−1 . This is illustrated in Figure 6. Again, it is easy to obtain α2 by
locally squaring the shares of α.
! ⋆'( %
"! ⋆'( ⊙ "%
) ⋆$
Figure 6: M&M inversion: Obtaining the value and tag shares of z = x−1
4.2 Affine transformation over bits

The AES affine transformation at the end of the S-box A(x) = L(x) + c is linear over GF(2). The
linear part of the transform, L(x) is a matrix multiplication operating on the bitvector of x. A
sharing of L(x) is trivially obtained by applying the transform locally to each share of x. The same
cannot be said for the tag shares of x. In this section, we describe how to obtain the tag shares for
any linear transform of this type.
Isomorphisms. Consider the isomorphism φ between the finite field GF(2k ) and the vector space
(GF(2))k , that maps each element to its bitrepresentation vector, i.e. φ(2i ) = ei with ei the ith
unit vector. We denote the bitvector of x as φ(x) = x. For the linear transform, we thus have
φ(L(x)) = Lφ(x) = Lx
with L ∈ (GF(2))k×k the matrix that defines the linear transformation. One of the consequences of
this isomorphism is
∀α ∈ GF(2k ), ∃Mα ∈ (GF(2))k×k s.t. ∀x ∈ GF(2k ) : φ(αx) = Mα φ(x) = Mα x
Note that the opposite direction does not work: not for every M ∈ (GF(2))k×k there exists an
α ∈ GF(2k ) such that this relation holds.
Given a value x ∈ GF(2k ) and a corresponding tag τ x = αx ∈ GF(2k ), we wish to obtain a tag
τ L(x) satisfying τ L(x) = αL(x). We denote the bitvector of τ x as φ(τ x ) = t = Mα x. We have
φ(τ L(x) ) = φ(αL(x)) = Mα φ(L(x))

= Mα Lx
= Mα L(Mα−1 t)
= (Mα LMα−1 )t
Hence, we can go from (x, τ x ) to (L(x), τ L(x) ) by applying L to the bitvector of x and Mα LMα−1 to
the bitvector of τ x . A similar approach is used in error detecting/correcting code schemes [BCC+ 14].
In our case however, the code depends on α and thus the matrix Mα LMα−1 is secret and different
in every encryption.
Calculating Mα . The matrix Mα is straightforward to find. The ith column of Mα can be

denoted by Mα ei = Mα φ(2i ) = φ(α2i ). We therefore know that

Mα = φ(128α) φ(64α) . . . φ(2α) φ(α)
By seeing a matrix product as a compact way to describe k matrix-vector products

AB = Abk−1 Abk−2 . . . Ab0
we get that
LMα−1 = φ(L(128α−1 )) φ(L(64α−1 )) . . . φ(L(α−1 ))
and thus
Mα LMα−1 = φ(αL(128α−1 )) φ(αL(64α−1 )) . . . φ(αL(α−1 ))
For each MAC key α, this matrix can be precomputed and stored in d + 1 shares, similar to
the precomputed sharing of α−1 . In the linear transformation, we obtain the tag shares τ L(x)
by a shared matrix-vector multiplication with the sharing of Mα LMα−1 . A shared matrix-vector
multiplication can use the same equations as the SCA-secure multiplier , but with one of the
inputs a matrix and with the field multiplication ‘·’ replaced by matrix-vector products. Because of
the register stage in the shared matrix-vector multiplication, the affine transformation requires one
clock cycle.
Latency Optimization Note that in cases such as AES, where the affine transformation follows
a power map, a little trick can ensure that the affine transformation does not increase the total
latency of an S-box evaluation. The last clock cycle of the inversion in §4.1.1 (resp. §4.1.2) is spent
on a shared multiplication of intermediate tag shares with α−2 (resp. α2 ). We can incorporate this
tag correction in the affine transformation by replacing the matrix Mα LMα−1 with Mα LMα−3
(resp. Mα LMα ).
5 Infective Computation
We have described above how to implement two encryption blocks: one that calculates ciphertext
shares, given plaintext shares and another that calculates ciphertext tag shares from plaintext tag
shares and MAC key shares αi . We now consider the ciphertext block-per-block with blocksize
k. Let ci ∈ GF(2k ) be the shares of one ciphertext block and τic ∈ P GF(2k ) thePshares of the
corresponding tag block. If the tags are consistent with the data, then i τic = α · i ci .
5.1 The Problem with Error Checking

The use of infective computation in M&M is motivated by the difficulty of designing an error
checking mechanism that is secure against combined attacks. Consider the following algorithm that
computes the error on the tags E = τ c + α · c and verifies that it is zero. We assume sufficient
registers are in place to prevent glitch problems.
Check(c, τ c , α)
Let θ ← α c
for all shares i do
Let Ei ← θi + τic
end Pfor
E = i Ei
Output (E == 0)
This checking algorithm is not secure in a combined adversary model with probing and faulting.
When an attacker manages to insert a known fault ∆ in one share of the shared multiplication such
that the check is performed with α0 = α + ∆, the unshared (zero) error E is replaced by
E 0 = τ c + α0 · c
= τc + α · c + ∆ · c
=∆·c
A single probe on the unshared E 0 thus reveals the (faulty) ciphertext c. This attack defeats
the very purpose of the error check, which is to stop a faulty ciphertext from being released to the
adversary.
5.2 The Solution

We propose the following routine Infect, which is local except for one shared multiplication to
obtain a sharing of the correct MAC tag αc. Let R be a uniformly random mask ∈ GF(2k ) \ {0}.
Each share of the output ciphertext block is modified using this random mask and the difference of
the tags.
Infect (c, τ c , α)
Let θ ← α c
$
Draw R ← GF(2k ) \ {0}
for all shares i do
Let c̃i ← ci + R · (θi + τic )
end for
Output c̃
P
The scheme outputs a sharing of the adapted ciphertext block c̃ = i c̃i = c + R · (α · c + τ c ).
Thus, if the tags are consistent (α · c + τ = 0), the scheme outputs a sharing of the computed block
c
c. On the other hand, if the tags do not match (α · c + τ c 6= 0), the unshared output c̃ is random.
One may note that generating a nonzero mask R is nontrivial. However, there must be a
PRNG with enough throughput to realize all the randomness for the computation of the S-box.
The number of random bits available for the routine Infect is thus much higher than the amount
required. From this, it is easy to generate one nonzero byte.
Unbiased Randomization We verify that the infected ciphertexts are uniformly distributed, so
the attacker cannot obtain any information from them. Consider the case when the computation
of Enc (and EncMac ) is disturbed by faults ∆c (resp. ∆τ ), resulting in the unshared infected
ciphertext block
c̃ = c + ∆c + R · (α · (c + ∆c ) + τ c + ∆τ )
We may assume ∆c is non-zero and unkown to the attacker. In fact, knowing the faulty ciphertext,
or indeed ∆c , is the goal of the adversary in a DFA attack. Furthermore, we assume a strong
probing adversary can know the value of the mask R, which is why we do not allow it to be zero.
However, the MAC key α is always secret due to sharing. These introduced faults (∆c , ∆τ ) remain
undetected with a probability of 2−km , corresponding to the case when ∆τ = α · ∆c . We therefore
claim an error detection probability (EDP) of 1 − 2−km and focus now on the case when ∆τ 6= α · ∆c .
Using the fact that α · c + τ c = 0, we rewrite c̃ = c + ∆c · (1 + R · α) + ∆τ · R. Clearly, the
unbiased randomization of the output depends on the uniformity of the mask (1 + R · α). It can be
verified that this mask is uniformly random in GF(2k ) when R is uniformly random in GF(2k ) \ {0}
and α uniformly random in GF(2k ). As a result, c̃ is uniformly random in GF(2k ) when ∆c 6= 0.
5.3 Combining Infection and Detection

In many applications, outputting garbage when the chip is under attack suffices. There are however
some use cases, where one might want to know whether the outputted ciphertext is correct. If that
is the case, we propose to do the infective computation twice, with different masks and with different
MAC keys. The two resulting ciphertexts can be compared; If the computation was corrupted, the
ciphertexts are distinct and randomized. Otherwise, they are identical.
The above adaption requires that the number of MAC keys m is at least two. If m = 1 suffices
for security, we propose the following solution to avoid duplicating the tagsize just for the sake of
outputting two ciphertexts. A second MAC key β is created only for the infective computation
part and not for the cipher evaluation. The procedure Infect2 below describes how to obtain the
second ciphertext block c̃0 apart from the original c̃ obtained with Infect.
Infect2 (c, τ c , α−1 , β, R)

Let θ 0 ← β c
Let τ c0 ← α−1 (β τ c )
$
Draw R0 ← GF(2k ) \ {0, R}
for all shares i do
Let c̃0i ← ci + R0 · (θi 0 + τic 0 )
end for
Output c̃0
The two unshared outputs are thus
X
c̃ = c̃i = c + R · (α · c + τ c )
i
X
c̃0 = c̃0i = c + R0 · (β · c + β · α−1 · τ c )
i
6 Security Analysis
In this section, we discuss the security of the M&M scheme. Note that M&M can be based on
several different Boolean masking schemes providing SCA secure multiplication, inversion and
refreshing gadgets in the chosen adversary model and thus the security of any instantiation depends
heavily on those choices.
6.1 Security against SCA.

By adhering to security principles such as ensuring non-completeness [BGN+ 14] and proper re-
freshing is satisfied everywhere, M&M inherits the SCA security of the shared multiplication
and inversion mechanism used. A Boolean masking scheme that is secure in the considered SCA
attacker model thus provides security against dth -order SCA. Since the model can include or exclude
hardware glitches if the shared multiplication and inversion mechanism used are also secure in the
presence of glitches, M&M inherits this.
The computation of the tag shares follows the same design principles as the value share
calculations. The two datapaths operate completely independently of each other and receive their
own distinct fresh randomness (see Figure 7). It is important to note that the input sharings p and
τ p must be independent as well, which is easily achieved if the initial maskings of p and τ p are
obtained separately. The independence of the two datapaths ensure that their merging in the Infect
block does not induce leakage on p or τ p .
RNG
RNG
%
! Enc
# R Infect %$
"! EncMac
"%
RNG
Figure 7: Overview of the scheme
The Refreshing Gadget. It is important that any refreshing mechanism used (cf. § 4.1.1) ensures
the same security as is provided by the used masking scheme. The kind of refreshing thus depends
on the targeted security order d [BBP+ 16] and the considered attacker model. In general, one can
always make use of the multiplication-based refresh gadget of Ishai et al. [ISW03]. It has been
shown in [BBD+ 16, Gadget 4b] that this refreshing ensures composability at any order. For a
specific target security level, randomness can be consumed more efficiently. For example, the ring
refreshing approach of [CRB+ 16] uses d + 1 fresh masks in a circular manner to refresh d + 1 shares.
This method suffices for second- and third-order security. At certain higher orders, one can use its
variant, offset refreshing, which still uses only d + 1 units of fresh randomness but rotates with an
offset of more than 1 [BBD+ 18, Alg. 2]. Finally, additive refreshing using only d fresh masks is
sufficient when first-order security is targeted. For a more detailed treatment of refreshing gadgets,
we refer to [BBD+ 18].
6.2 Security against FA.

The introduction of MAC tags to the circuit results in resistance against FA, in which faults are
not limited in Hamming weight nor in quantity. Faults are only undetected if both the value and
MAC tag shares are modified in such a way that the relation
X X
τix = α · xi (1)
i i
remains true. Since the MAC key α ∈ GF(2 ) is unkown to the adversary, any number of
km
stochastic additive errors result in this relation with probability at most 2−km . If the attacker
has the ability to inject non-stochastic faults, our model restricts these to affect at most d of the
d + 1 shares. In that case, the success probability still depends on the probability of guessing the
secret MAC key α ∈ GF(2km ) correctly. We therefore claim an error detection probability (EDP)
of 1 − 2−km .
As stated in the adversarial model in §2, the faults we consider are neither limited in Hamming
weight nor in quantity. The worst-case probability that (1) is satisfied is 2−km regardless of the
Hamming weight of a single fault. The accumulated effect of multiple faults also does not change
this probability. The adversary obtains no additional information after injecting faults hence
random shooting or guessing α remains the best strategy for subsequent faults. In the end, the same
equation (1) in GF(2km ) must hold for the faults to remain undetected. It holds with probability
2−km . We experimentally investigate the effect of multiple faults in Section 7.3.2.
The zero MAC key. The event that the MAC key is zero occurs with probability 2−km . Since α
is secret and shared, the adversary cannot know when the tags are zero and must still guess α to
determine what fault to inject in the tag computation. An adversary strategy of not injecting faults
in the tag computation, i.e. injecting faults only on the value computations, corresponds to guessing
that α = 0 and succeeds with probability 2−km . This is completely analogous to guessing for
example that α = 1 and injecting identical faults in both the tag and value computation accordingly.
Either by guessing α or by injecting a random fault, the adversary hits the correct value with
probability 2−km , corresponding to our claimed EDP of 1 − 2−km . Hence, in theory, the case α = 0
is equivalent to any other nonzero MAC key. In practice however, the strategy corresponding to
guessing α = 0 is easier since it requires only fault injections in the value datapath and not the tag
datapath. To avoid it, one could exclude the zero MAC key and reduce the EDP to 1 − (2km − 1)−1 .
This difference is negligible if km is sufficiently large. However, note that in that case, the infective
computation output phase can no longer be used, since it requires α to be uniformly random in
GF(2km ).
Ineffective Faults. Apart from DFA and FSA, there is also an interesting branch of fault attacks
that exploits so-called ineffective faults. For example, a stuck-at-zero fault on a wire or set of
wires is ineffective when those wires already carry the zero value. This type of faults are naturally
undetectable at algorithm level, which makes them immune to both detection and infection
countermeasures. A flavour of Ineffective Fault Analysis (IFA) [Cla07] called Statistical Ineffective
Fault Analysis (SIFA) [DEK+ 18] has recently been proposed. SIFA collects a subset of correct
ciphertexts from a large number of faulted encryptions and exploits the fact that the intermediate
state of the algorithm is not uniformly distributed in this subset. This attack has been extended
to masked implementations in [DEG+ 18]. Ineffective faults (and thus SIFA) fall outside of our
adversary model since they are impossible to detect. Protection against such attack can be provided
at a different level, for example using a protocol that erases the key as soon as a certain threshold
of faulty ciphertexts has been detected.
6.3 Security against combined attacks.

Having brought the MAC tags to the shared domain, our scheme also provides security against
combined attacks. Thanks to the fact that the MAC key α is shared, it remains secret even to a
probing adversary. As a result, the detection probability of injected faults does not change, even
when the adversary combines them with SCA. We recall that our model limits the injection of
deterministic faults to d of the d + 1 shares. In case of a combined attack, the total number of
affected shares by either faults or probes should thus still not exceed d.
Modern combined attacks such as PACA [AVFM07, CFGR10], which require a faulty ciphertext
to succeed, are prevented since faulty ciphertexts are only released with probability 2−km . The
effective complexity of such attacks thus increases by a factor at least 2km . Thanks to the infective
computation, M&M is also secure against combined attacks that target the checking mechanism or
that exploit correlations on the faulty ciphertext [RLK11, DV12].
Although we are not aware of any combined attack against M&M, we cannot formally prove it
as CAPA [RDB+ 18] does. The only provable approach against combined attacks known so far is to
adapt an actively secure MPC protocol. No other formal techniques have yet been found.
7 AES Case Study

M&M has been designed in a way that allows it to use both provably secure (e.g. SNI [BBD+ 15])
gadgets or more efficient but non-provably secure blocks. We now present specific AES implemen-
tations confirming the latter. We target first- and second-order security as those are relevant for
realistic attacks and use TVLA for these specific orders to demonstrate the security.
Our implementations are mere examples of the many different ways to instantiate M&M. Any
existing or future SCA-secure gadgets can be used as the underlying building blocks. In this section,
we first detail our choice of gadgets and investigate the implementation cost of the resulting AES
constructions. We then empirically validate our SCA claims using univariate and bivariate test
vector leakage assessment and we perform a simulation-based verification of the DFA security.
7.1 Implementation Details

In Section 4, we described essentially all components that are needed to construct an AES SubBytes
implementation: the inversion in GF(28 ) and the affine transformation over bits. We distinguish two
versions of the S-box implementation. One follows a rather generic methodology using multiplication
chains and the other is customized for the inversion. We investigate the implementation cost of
both implementations for first- and second-order SCA security (d = 1 or 2). For our specific
instantiations of M&M, we only claim first- and second-order security. Nevertheless, it is extendable
to higher-orders given a suitable choice of building blocks.
The remaining AES blocks (ShiftRows, MixColumns and AddRoundKey) consist exclusively of
linear operations and are thus trivially implemented for the M&M framework. More specifically,
the datapaths in both the Enc and EncMac blocks each consist of d + 1 copies of the AES state and
key arrays. These arrays contain data shares in the encryption block Enc and tag shares in case of
the EncMac block. In total, the area cost of these blocks increases with a factor (d + 1)(m + 1)
compared to an unprotected implementation. Our implementation uses the same byte-serialized
architecture as used in [GMK17].
We now detail our choice of multiplication and refresh gadgets which are used in the multiplication
chain version of the inversion (cf. version 1, Figure 5) and the inversion gadget used in version 2
(cf. Figure 6).
Multiplication gadgets. We choose a d + 1-share multiplier as our SCA-secure multiplier. The

following equations describe for example a three-share, second-order (d = 2) secure multiplication
of shared variables x = (x0 , x1 , x2 ) and y = (y0 , y1 , y2 ) into z = (z0 , z1 , z2 ), using three units of
randomness r0 , r1 , r2 ∈ GF(2k ) as in [GMK16]. We first calculate nine intermediate values tij for
i, j ∈ {0, 1, 2}. After a register stage for synchronization of tij , we compute the output shares
z = (z0 , z1 , z2 ). The latency of the operation is thus 1 clock cycle.
t00 = x0 · y0
t01 = x0 · y1 + r0
t02 = x0 · y2 + r1
t10 = x1 · y0 + r0 z0 = [t00 ]reg + [t01 ]reg + [t02 ]reg

t11 = x1 · y1 z1 = [t10 ]reg + [t11 ]reg + [t12 ]reg
t12 = x1 · y2 + r2 z2 = [t20 ]reg + [t21 ]reg + [t22 ]reg
t20 = x2 · y0 + r1
t21 = x2 · y1 + r2
t22 = x2 · y2
The
corresponding first-order construction is given in [GMK16]. Each multiplication requires
fresh units of randomness. It has been shown in [FGMDP+ 18] that such a multiplication
d+1
2
gadget is composable if the result is stored in a register. As shown on Figure 5, we present in this
work a construction with such registers in the value share datapath but without extra registers in
the tag share datapath. Note that, the intermediate registers of [FGMDP+ 18] are a requirement
for SNI schemes in HW, but the SNI property is not a prerequisite for a scheme to be secure and we
customized our design for better performance. We see currently no formal way to prove the security
of our construction, but because the tag shares can be seen as Boolean shares of a multiplicative
share of the secret, we judge that the tag shares do not need to be stored in registers in order to
obtain security. We verify our approach empirically using TVLA and do not detect any leakage (cf.
§7.3.1). Provable security can be achieved by adding registers to each of the tag share and value
share datapaths.
Refreshing gadgets. When d = 1, a d + 1-share variable x can be refreshed with d fresh random
units using additive refreshing:
d−1
X
(x0 , x1 , . . . , xd ) → (x0 + r0 , x1 + r1 , . . . , xd−1 + rd−1 , xd + ri )
i=0
For second-order security, we use ring refreshing as in [CRB+ 16], which consumes d + 1 fresh
random units.
Figure 8: Ring and Additive Refreshing [CRB+ 16]
We use these refreshings for the shares of x4 , x20 and x100 in Figure 5 as well as for the shares
of (τ x )4 . We let r(d) be the corresponding randomness cost for refreshing d + 1 shares with security
5 25
order d, i.e. r(1) = 1 and r(2) = 3. The shares of (τ x )4 and (τ x )4 are refreshed using d+1 2
units during the last multiplication
in the M&M power five block (cf. Figure 4). For d ∈ {1, 2}, we
actually have that r(d) = d+1 2 , so the cost of each refreshing is r(d). We note again that these
types of refreshing gadgets are only to be used for respectively first- and second-order security and
are not secure for higher orders.
Inversion gadget. For the shared inversion in GF(28 ), ?−1 , we can use De Cnudde et al.’s
d + 1 [CRB+ 16] or Gross et al.’s Domain Oriented Masking [GMK17] AES S-box implementations.
Both are based on Canright’s compact S-box [Can05] using the tower field approach. We opt for
the first, which requires five register stages. Together with the final shared multiplication with α2 ,
the latency of the M&M inversion in Figure 6 is thus six cycles.
7.2 Implementation Cost

Randomness We summarize the randomness cost in Table 1.
Recall that for d ∈ {1, 2}, each shared multiplication and each refreshing consume d+1
2 units
of randomness, where each unit is a byte in the case of AES. The inversion pipeline of Figure 5
performs four M&M nonlinear operations (three times a5 and once ab2 ). Each of these requires
exactly three ’s, although we count one less for the f (a, b) = ab2 block (because this operation
is included in the affine transform). We count two additional ’s with α−1 for the computation
of the tag shares of x4 and one for the tag shares in the affine transform. This brings the
total number of shared multiplications to 14. In addition, we need 6 refreshings to preserve the
non-completeness in the power five exponentiations ?5 .
The inversion circuit from [CRB+ 16] consumes 54 (resp. 162) bits of randomness in first-(resp.
second-)order. The M&M inversion in Figure 6 uses this circuit twice. Furthermore, the affine
transformation adds one additional shared (matrix) multiplication.
Finally, the infective computation uses randomness for one shared multiplication and an
additional unit for the mask R. However, since the infection takes place when all SubBytes
evaluations have finished, the total randomness does not increase.
Latency The first inversion from §4.1 requires nine clock cycles. In version 2, we use the ?−1
implementation from [CRB+ 16], which results in a M&M inversion of six clock cycles. Recall from
§4.2 that the AES affine transformation in M&M requires no additional cycles. In total, the AES
S-box output is thus obtained in nine clock cycles with version 1 and six clock cycles with version 2.
Table 1: Randomness Cost for the AES S-box implementations
# # ?−1 # Fresh # Random Bits

d d=1 d=2

Shared Mult. ( )/Refresh 1 - - 8 d+1
2
8 24
Shared Inv. (?−1 ) [CRB+ 16] - 1 - - 54 162
Fresh Mask - - 1 8 8 8

S-box V1 14+6 - - 160 d+1
2
160 480
S-box V2 1 2 - - 116 348
Infective Computation 1 - 1 8 d+1
2
+8 16 32
The byte-serialized architecture from [GMK17] is very efficient as it performs the MixColumns,
ShiftRows and AddRoundKey stages in parallel with the SubBytes stage. As a result, when the
S-box latency is C ≥ 4 cycles, one round of encryption (including key schedule) requires exactly
16 + C clock cycles: During the first 16 cycles, all the state bytes are fed to the S-box pipeline
input. The MixColumns operation is done in parallel every four cycles. The remaining C cycles are
spent waiting for the last S-box output. During the first four of these, the S-box can be used by
the key schedule. In the last cycle, the last S-box output is shifted into the state at the same time
as ShiftRows is performed.
Our two versions of the AES Encryption therefore require respectively 25 and 22 clock cycles
per encryption round.
Area We report our area results for first- and second-order security in Table 2 together with
latency and randomness cost. As expected, the more customized version of the S-box results in a
much more efficient implementation. Note that many more tradeoffs between area, latency and
randomness cost are possible depending on design choices (e.g. multiplication chain) and used
building blocks (e.g. shared inversion block ?−1 ).
Table 2: Cost of first- and second-order AES implementations
Area [kGE] Latency # Random bits/cycle

d=1 d=2 [# cycles] d=1 d=2
State Array 5.6 8.5 - - -
Key Array 4.2 6.3 - - -
S-box
• V1 19.2 42.0 9 160 480
• V2 6.5 13.5 6 116 348
Control 0.2 0.2 - - -
Infective Comp. 1.7 3.3 - 16 32
Other 1.0 1.4 - - -
Total
• V1 31.9 61.7 266 160 480
• V2 19.2 33.2 236 116 348
Comparison to State-of-the-art. In Table 3, we report our area results next to other state-of-the-
art schemes. Some of these protect only against SCA [CRB+ 16, GMK17] and some are combined
countermeasures, such as ParTI [SMG16] and CAPA [RDB+ 18]. For the latter, it is not easy to
compare the results given the difference in cipher implemented and synthesis libraries used. We try
to overcome these differences by also reporting the overhead factor of the combined countermeasure,
compared to an implementation that provides only protection against SCA.
The ParTI countermeasure is applied to the LED scheme in [SMG16]. The authors report an
area of 20.2 kGE obtained with a UMC 0.18µm library [Inc04], compared to 7.9 kGE for the SCA-
only first-order secure LED implementation. This signifies an area overhead factor of 20.2
7.9 = 2.56.
For the combined countermeasure CAPA, we can compute the overhead over a SCA-secure KATAN
implementation [RDB+ 18, Table 2]. Finally, we compare our first-order M&M AES (V2) with De
Cnudde’s [CRB+ 16] first-order implementation against SCA only. Table 3 reports the area of the
implementation that is only secure against SCA in the fourth column and the area of the combined
countermeasure in the fifth column. The overhead factor for those can be found in the last column.
We note that the dependency on the synthesis library cannot completely be eliminated this way.
Furthermore, all schemes consider very different adversary models.
Table 4 does the same for second-order secure implementations.
Table 3: Area comparison for first-order secure implementations
Countermeasure Synthesis Library Cipher SCA-only Combined Overhead

[kGE] [kGE] factor
[CRB+ 16] Nangate 45nm [NAN] AES 7.62 - -
[GMK17] UMC 90nm Low-K AES 6.0 - -
CAPA [RDB+ 18] Nangate 45nm [NAN] KATAN 3.6 30.5 8.47
ParTI [SMG16] UMC 0.18µm [Inc04] LED 7.9 20.2 2.56
M&M NanGate 45nm [NAN] AES 7.6 19.2 2.53
Table 4: Area comparison for second-order secure implementations
Countermeasure Synthesis Library Cipher SCA-only Combined Overhead

[kGE] [kGE] factor
[CRB+ 16] Nangate 45nm [NAN] AES 12.61 - -
[GMK17] UMC 90nm Low-K AES 10.0 - -
CAPA [RDB+ 18] Nangate 45nm [NAN] KATAN 5.9 55.2 9.35
M&M NanGate 45nm [NAN] AES 12.6 33.2 2.63
7.3 Evaluation
Evaluation of our M&M implementations is done separately for SCA and DFA, since no compre-
hensive method for verifying against combined attacks has been published to our knowledge. For
the first, we program both versions of a second-order protected AES on an FPGA and evaluate
the leakage coming from the power consumption with a non-specific t-test. The state-of-the-art
on evaluating fault countermeasures is less advanced. We evaluate the scheme’s EDP through a
simulation of the circuits, in which we model additive faults in the RTL.
7.3.1 SCA evaluation

Setup. To assess the security of our implementations we use a SAKURA-G board, which is
specifically designed for side-channel evaluation. On this board there are two distinct Spartan-6
FPGA’s. The control FPGA handles the communication with the host computer and generates the
shares for the cryptographic FPGA. In the crypto FPGA, we deploy the actual encryption scheme.
This way we isolate the power consumption of the actual encryption, reducing considerably the
noise in the experiment. We use a very slow 3 MHz clock to ensure clear power traces with minimal
overlap between consecutive time samples. The synthesis of the design is done using Xilinx tools
with the KEEP HIERARCHY constraint, in order to avoid optimizations across different shares. We
sample the power consumption at 1.0GS/s with 10 000 points per frame, which includes 30 clock
cycles. This is equivalent to approximately 1.2 rounds of V1 and 1.4 rounds of V2.
TVLA. We perform a non-specific test vector leakage assessment (TVLA) [BCD+ 13] using the
methodology described in [RGV17]. This assessment is not used to mount an attack but to detect
correlations of the instantaneous power consumption with the secret. We gather power traces for
two distinct plaintext classes (one fixed and one random) and compare the two sets using the t-test
statistic. When the t-statistic exceeds the threshold 4.5 in absolute value, one can conclude with
confidence 99.9995% that the two sets of power traces follow different distributions and thus, that
the design leaks. This is a necessary but not sufficient condition for a successful attack to exist.
When the t-statistic remains below this threshold, the designer can conclude with high confidence
that the design is secure. We choose the fixed plaintext equal to the key so that all S-box inputs in
the first round are zero.
2 This number differs from the one reported in [CRB+ 16]. We contacted the authors and obtained their code in
order to synthesize with the same software and library as our M&M implementation.
We first perform the t-test on an unprotected AES implementation to verify that our setup
is sound and able to detect leakage. We emulate the unprotected implementation by disabling
the PRNG. Leakage is then expected in every order. When we turn the PRNG on and activate
the countermeasures, we expect only third-order leakage since the implementation is second-order
secure.
Figure 9: Non-specific t-test on second-order secure M&M AES implementation, V1. Left: PRNG
off (24K traces); Right: PRNG on (100M traces). Rows (top to bottom): one exemplary power
trace; first-order t-test; second-order t-test; third-order t-test.
Figure 10: Non-specific t-test on second-order secure M&M AES implementation, V2. Left: PRNG
off (24K traces); Right: PRNG on (100M traces). Rows (top to bottom): one exemplary power
trace; first-order t-test; second-order t-test; third-order t-test.
The t-test results for version 1 and 2 of our AES implementation are shown in Figures 9 and 10
respectively. In both cases, we see clear evidence of leakage at only 24 000 traces when the PRNG is
turned off. When we enable the PRNG, neither the first- nor second-order t-test statistics surpass
the threshold 4.5 with up to 100 million power traces.
In addition, we perform a bivariate analysis by combining time samples. For memory efficiency,
we reduce the resolution of the oscilloscope to 100MS/s, resulting in power traces of 1 000 time
samples each. We then perform the t-test as described above on 1 000 × 1 000 matrices, formed by
a centered product of the traces. The results are shown in Figure 11 and confirm that there is no
bivariate leakage with up to 50 million traces.
7.3.2 FA evaluation
In this section we evaluate the behaviour of our design when multiple stochastic faults are injected.
We describe our experiment, which aims to measure M&M’s detection rate of faults. For simplicity,
we evaluate our first-order secure AES implementations.
120
70
100 100
100
200 60 200
300 300
50 80
400 400
40
500 500 60
600 30 600
40
700 700
20
800 800
20
10
900 900
4.5
4.5
1000 0 1000 0
200 400 600 800 1000 200 400 600 800 1000
Figure 11: Bivariate t-test on second-order secure M&M AES imlementation, V1 (left) and V2
(right). Below diagonal: PRNG off (20K traces); Above diagonal: PRNG on (50M traces).
Fault modeling. Traditionally, fault modeling theory distinguishes between faults that affect the
logic function on the one hand and delay faults on the other. Moreover, faults can be clasified
as structural (modifying the interconnections among components in the circuit) or functional
(modifying the functionality of certain parts of the circuit) [ABF94].
Faults in cryptographic devices are typically injected with a laser or introduced by clock or
power line glitches. In our experiments we consider functional faults in the logic functions to model
the adversary of §2 (i.e. additive errors). We model faults using XOR additions. This does not
only allow us to flip one specific bit, but also to XOR entire offsets to k-bit words.
Fault injection. We enable a fault injection on a wire by extending the original VHDL code with
an additional fault gate on that wire. Such a gate is simply an XOR with a fault selector, indicating
whether or not we want to inject a fault.
In each design to be tested, we select a number of critical bytes where an attacker is most likely
to inject a fault. These points are: the input to the state register; the state and key byte before
AddRoundKey; the SubBytes input from the key schedule and four different points inside SubBytes.
Faults can be inserted in every data share and every tag share of those bytes.
This means that 256 fault gates are installed in the first order implementations. We collect the
corresponding fault selectors in a fault vector, which is controlled by the testbench. Each bit set to
’1’ in the fault vector corresponds to a fault on a single bit. By setting multiple bits in the vector,
we enable multiple faults in the implementation. When several faults are activated in the same
byte, it implies the XOR of an offset to that variable.
We want to be able to randomly draw fault vectors with a chosen Hamming weight H. For this,
we draw inspiration from the basic principles of address decoding. We draw one random bit; if it is
‘zero’, a fault occurs in the first half of the vector and if it is ‘one’, in the second half. We draw a
second bit and follow the same procedure to decide which of the two quarters. We continue this
way until a single fault bit is selected. Thus, log2 (256) = 8 random bits are needed to set a one-bit
fault in a 256-bit vector. By repeating this method H times, we can draw random fault vectors
with Hamming weight H. In our experiments we choose H = 128.
For each selected bit, we flip one more coin that decides whether the selected fault is activated
or not. This means that of the 128 selected faults, approximately half will be active. Since most of
the fault attacks in literature target one of the last rounds of AES, we similarly “inject” our faults
in the last round of encryption.
Results. We simulate the fault-augmented VHDL code with Xilinx ISIM for 50 000 iterations
and measure the fault detection rate. In each experiment, approximately 64 bits are altered in
the computation. We are thus faulting one or more bits in multiple bytes. We consider the faults
detected if the returned ciphertext is infected. In version 1 of our M&M AES, the experiment shows
that 210 faulty ciphertexts are not infected. This means that the experimental rate of detection of
our M&M implementation is 0.9958, compared to the theoretical 1 − 2−8 = 0.9961. In our second
AES version, 189 faulty ciphertexts are not infected, which means the experimental detection
probability is 0.9962.
8 Conclusion
We introduce a new family of countermeasures to provide security against both SCA and DFA.
M&M can extend any masking countermeasure with information-theoretic MAC tags and infective
computation. We demonstrate how to construct basic M&M building blocks and how to build a
secure implementation of any cipher. We illustrate our proposal with first- and second-order secure
implementations of AES and we experimentally verify the SCA and DFA security. We show that
M&M implementations can be very efficient while providing resistance against both SCA and DFA
in a strong but realistic adversary model.
Acknowledgements The authors would like to thank Dusan Bozilov, Begül Bilgin and Nigel
Smart for fruitful discussions and also the CHES reviewers for their helpful comments. This work
was supported in part by the Research Council KU Leuven: C16/15/058 and OT/13/071, by the
NIST Research Grant 60NANB15D346 and the EU H2020 project FENTEC. Lauren De Meyer is
funded by a PhD fellowship of the Fund for Scientific Research - Flanders (FWO).
References
[ABF94] M. Abramovici, M. Breuer, and A. Friedman. Digytal systems testing and testable
design. Wiley-IEEE Press, September 1994.
[AVFM07] F. Amiel, K. Villegas, B. Feix, and L. Marcel. Passive and active combined attacks:
Combining fault attacks and side channel analysis. In Workshop on Fault Diagnosis
and Tolerance in Cryptography (FDTC 2007), pages 92–102, Sept 2007.
[BBD+ 15] Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin Gré-
goire, and Pierre-Yves Strub. Veriffied proofs of higher-order masking. EUROCRYPT,
IACR Cryptology ePrint Archive, 2015:060, 2015.
Grégoire, Pierre-Yves Strub, and Rébecca Zucchini. Strong non-interference and
type-directed higher-order masking. In Edgar R. Weippl, Stefan Katzenbeisser,
Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security,
Vienna, Austria, October 24-28, 2016, pages 116–129. ACM, 2016.
Grégoire, François-Xavier Standaert, and Pierre-Yves Strub. Improved parallel
mask refreshing algorithms - generic solutions with parametrized non-interference &
automated optimizations. IACR Cryptology ePrint Archive, 2018:505, 2018.
[BBK+ 03] Guido Bertoni, Luca Breveglieri, Israel Koren, Paolo Maistri, and Vincenzo Piuri.
Error analysis and detection procedures for a hardware implementation of the
advanced encryption standard. IEEE Trans. Computers, 52(4):492–505, 2003.
[BBP+ 16] Sonia Belaïd, Fabrice Benhamouda, Alain Passelègue, Emmanuel Prouff, Adrian
Thillard, and Damien Vergnaud. Randomness complexity of private circuits for
multiplication. In Marc Fischlin and Jean-Sébastien Coron, editors, Advances in
Cryptology - EUROCRYPT 2016 - 35th Annual International Conference on the
Theory and Applications of Cryptographic Techniques, Vienna, Austria, May 8-12,
2016, Proceedings, Part II, volume 9666 of Lecture Notes in Computer Science, pages
616–648. Springer, 2016.
[BCC+ 14] Julien Bringer, Claude Carlet, Hervé Chabanne, Sylvain Guilley, and Houssem
Maghrebi. Orthogonal direct sum masking - A smartcard friendly computation
paradigm in a code, with builtin protection against side-channel and fault attacks.
In David Naccache and Damien Sauveron, editors, Information Security Theory and
Practice. Securing the Internet of Things - 8th IFIP WG 11.2 International Workshop,
WISTP 2014, Heraklion, Crete, Greece, June 30 - July 2, 2014. Proceedings, volume
[BCD+ 13] G. Becker, J. Cooper, E. De Mulder, G. Goodwill, J. Jaffe, G. Kenworthy, T. Kouzmi-

nov, A. Leiserson, M. Marson, P. Rohatgi, et al. Test vector leakage assessment
(tvla) methodology in practice. In International Cryptographic Module Conference,
volume 1001, page 13, 2013.
[BECN+ 06] H. Bar-El, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan. The sorcerer’s
apprentice guide to fault attacks. Proceedings of the IEEE, 94(2):370–382, Feb 2006.
[BG13] Alberto Battistello and Christophe Giraud. Fault analysis of infective AES compu-
tations. In Wieland Fischer and Jörn-Marc Schmidt, editors, 2013 Workshop on
Fault Diagnosis and Tolerance in Cryptography, Los Alamitos, CA, USA, August 20,
2013, pages 101–107. IEEE Computer Society, 2013.
[BGN+ 14] Begül Bilgin, Benedikt Gierlichs, Svetla Nikova, Ventzislav Nikov, and Vincent
Rijmen. Higher-order threshold implementations. In Palash Sarkar and Tetsu Iwata,
editors, Advances in Cryptology - ASIACRYPT 2014 - 20th International Conference
on the Theory and Application of Cryptology and Information Security, Kaoshiung,
Taiwan, R.O.C., December 7-11, 2014, Proceedings, Part II, volume 8874 of Lecture
[BS97] Eli Biham and Adi Shamir. Differential fault analysis of secret key cryptosystems,
pages 513–525. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[Can05] David Canright. A very compact s-box for AES. In Josyula R. Rao and Berk
Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th
International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings,
[CFGR10] C. Clavier, B. Feix, G. Gagnerot, and M. Roussellet. Passive and active combined
attacks on aes combining fault attacks and side channel analysis. In 2010 Workshop
on Fault Diagnosis and Tolerance in Cryptography, pages 10–19, Aug 2010.
[Cla07] Christophe Clavier. Secret external encodings do not prevent transient fault analysis.
In Pascal Paillier and Ingrid Verbauwhede, editors, Cryptographic Hardware and
Embedded Systems - CHES 2007, 9th International Workshop, Vienna, Austria,
[CN16] Thomas De Cnudde and Svetla Nikova. More efficient private circuits II through
threshold implementations. In 2016 Workshop on Fault Diagnosis and Tolerance
in Cryptography, FDTC 2016, Santa Barbara, CA, USA, August 16, 2016, pages
[CRB+ 16] Thomas De Cnudde, Oscar Reparaz, Begül Bilgin, Svetla Nikova, Ventzislav Nikov,
and Vincent Rijmen. Masking AES with d+1 shares in hardware. In Benedikt
Gierlichs and Axel Y. Poschmann, editors, Cryptographic Hardware and Embedded
Systems - CHES 2016 - 18th International Conference, Santa Barbara, CA, USA,
[DEG+ 18] Christoph Dobraunig, Maria Eichlseder, Hannes Groß, Stefan Mangard, Florian
Mendel, and Robert Primas. Statistical ineffective fault attacks on masked AES
with fault countermeasures. IACR Cryptology ePrint Archive, 2018:357, 2018.
[DEK+ 18] Christoph Dobraunig, Maria Eichlseder, Thomas Korak, Stefan Mangard, Florian
Mendel, and Robert Primas. Sifa: Exploiting ineffective fault inductions on symmet-
ric cryptography. IACR Transactions on Cryptographic Hardware and Embedded
Systems, 2018(3):547–572, Aug. 2018.
[DV12] F. Dassance and A. Venelli. Combined fault and side-channel attacks on the aes
key schedule. In 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography,
pages 63–71, Sept 2012.
[FGMDP+ 18] Sebastian Faust, Vincent Grosso, Santos Merino Del Pozo, Clara Paglialonga, and
François-Xavier Standaert. Composable masking schemes in the presence of physical
defaults & the robust probing model. IACR Transactions on Cryptographic Hardware
and Embedded Systems, 2018(3):89–120, Aug. 2018.
[FH17] Wieland Fischer and Naofumi Homma, editors. Cryptographic Hardware and Em-
bedded Systems - CHES 2017 - 19th International Conference, Taipei, Taiwan,
Science. Springer, 2017.
[GM17] Hannes Groß and Stefan Mangard. Reconciling d+1 masking in hardware and
software. In Fischer and Homma [FH17], pages 115–136.
[GMK16] Hannes Groß, Stefan Mangard, and Thomas Korak. Domain-oriented masking:
Compact masked hardware implementations with arbitrary protection order. IACR
Cryptology ePrint Archive, 2016:486, 2016.
[GMK17] Hannes Groß, Stefan Mangard, and Thomas Korak. An efficient side-channel
protected AES implementation with arbitrary protection order. In Helena Handschuh,
editor, Topics in Cryptology - CT-RSA 2017 - The Cryptographers’ Track at the
RSA Conference 2017, San Francisco, CA, USA, February 14-17, 2017, Proceedings,
[GPS14] Vincent Grosso, Emmanuel Prouff, and François-Xavier Standaert. Efficient masked
s-boxes processing - A step forward -. In David Pointcheval and Damien Vergnaud,
editors, Progress in Cryptology - AFRICACRYPT 2014 - 7th International Confer-
ence on Cryptology in Africa, Marrakesh, Morocco, May 28-30, 2014. Proceedings,
[GST12] Benedikt Gierlichs, Jörn-Marc Schmidt, and Michael Tunstall. Infective computation
and dummy rounds: Fault protection for block ciphers without check-before-output.
In Alejandro Hevia and Gregory Neven, editors, Progress in Cryptology - LAT-
INCRYPT 2012 - 2nd International Conference on Cryptology and Information
Security in Latin America, Santiago, Chile, October 7-10, 2012. Proceedings, volume
[Inc04] Virtual Silicon Inc. 0.18 µm VIP Standard cell library tapeout ready, partnumber:
UMCL18G212T3, process: UMC logic 0.18µm generic II technology: 0.18µm, July
2004.
[IPSW06] Yuval Ishai, Manoj Prabhakaran, Amit Sahai, and David A. Wagner. Private circuits
II: keeping secrets in tamperable circuits. In Serge Vaudenay, editor, Advances
in Cryptology - EUROCRYPT 2006, 25th Annual International Conference on the
Theory and Applications of Cryptographic Techniques, St. Petersburg, Russia, May
28 - June 1, 2006, Proceedings, volume 4004 of Lecture Notes in Computer Science,
[ISW03] Y. Ishai, A. Sahai, and D. Wagner. Private Circuits: Securing Hardware against
Probing Attacks, pages 463–481. Springer Berlin Heidelberg, Berlin, Heidelberg,
2003.
[KJJ99] P. C. Kocher, J. Jaffe, and B. Jun. Differential power analysis. In Advances in

Cryptology - CRYPTO ’99, 19th Annual International Cryptology Conference, Santa
Barbara, California, USA, August 15-19, 1999, Proceedings, pages 388–397, 1999.
[KKG03] Ramesh Karri, Grigori Kuznetsov, and Michael Gössel. Parity-based concurrent error
detection of substitution-permutation network block ciphers. In Colin D. Walter,
Çetin Kaya Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded
Systems - CHES 2003, 5th International Workshop, Cologne, Germany, September
8-10, 2003, Proceedings, volume 2779 of Lecture Notes in Computer Science, pages
113–124. Springer, 2003.
[KKT04] Mark G. Karpovsky, Konrad J. Kulikowski, and Alexander Taubin. Differential fault
analysis attack resistant architectures for the advanced encryption standard. In Jean-
Jacques Quisquater, Pierre Paradinas, Yves Deswarte, and Anas Abou El Kalam,
editors, Smart Card Research and Advanced Applications VI, IFIP 18th World
Computer Congress, TC8/WG8.8 & TC11/WG11.2 Sixth International Conference
on Smart Card Research and Advanced Applications (CARDIS), 22-27 August 2004,
Toulouse, France, volume 153 of IFIP, pages 177–192. Kluwer/Springer, 2004.
[LRT12] Victor Lomné, Thomas Roche, and Adrian Thillard. On the need of randomness in
fault attack countermeasures - application to AES. In Guido Bertoni and Benedikt
Gierlichs, editors, 2012 Workshop on Fault Diagnosis and Tolerance in Cryptography,
Leuven, Belgium, September 9, 2012, pages 85–94. IEEE Computer Society, 2012.
[LSG+ 10] Yang Li, Kazuo Sakiyama, Shigeto Gomisawa, Toshinori Fukunaga, Junko Taka-
hashi, and Kazuo Ohta. Fault Sensitivity Analysis, pages 320–334. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2010.
[NAN] NANGATE. The NanGate 45nm Open Cell Library. Available at http://www.
nangate.com.
[NRR06] Svetla Nikova, Christian Rechberger, and Vincent Rijmen. Threshold Implemen-
tations Against Side-Channel Attacks and Glitches. In Information and Commu-
nications Security, 8th International Conference, ICICS 2006, Raleigh, NC, USA,
December 4-7, 2006, Proceedings, pages 529–545, 2006.
[NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schläffer. Secure hardware implementa-
tion of nonlinear functions in the presence of glitches. J. Cryptology, 24(2):292–321,
2011.
[PR11] Emmanuel Prouff and Thomas Roche. Higher-order glitches free implementation
of the AES using secure multi-party computation protocols. In Bart Preneel
and Tsuyoshi Takagi, editors, Cryptographic Hardware and Embedded Systems -
CHES 2011 - 13th International Workshop, Nara, Japan, September 28 - October 1,
2011. Proceedings, volume 6917 of Lecture Notes in Computer Science, pages 63–78.
Springer, 2011.
[RBN+ 15] Oscar Reparaz, Begül Bilgin, Svetla Nikova, Benedikt Gierlichs, and Ingrid Ver-
bauwhede. Consolidating masking schemes. In Rosario Gennaro and Matthew
Robshaw, editors, Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptol-
ogy Conference, Santa Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I,
[RDB+ 18] Oscar Reparaz, Lauren De Meyer, Begül Bilgin, Victor Arribas, Svetla Nikova,
Ventzislav Nikov, and Nigel P. Smart. CAPA: the spirit of beaver against physical
attacks. In Hovav Shacham and Alexandra Boldyreva, editors, Advances in Cryptology
- CRYPTO 2018 - 38th Annual International Cryptology Conference, Santa Barbara,
CA, USA, August 19-23, 2018, Proceedings, Part I, volume 10991 of Lecture Notes
in Computer Science, pages 121–151. Springer, 2018.
[RGV17] Oscar Reparaz, Benedikt Gierlichs, and Ingrid Verbauwhede. Fast leakage assessment.
In Fischer and Homma [FH17], pages 387–399.
[RLK11] Thomas Roche, Victor Lomné, and Karim Khalfallah. Combined Fault and Side-
Channel Attack on Protected Implementations of AES, pages 65–83. Springer Berlin
Heidelberg, Berlin, Heidelberg, 2011.
[SFRES18] Okan Seker, Abraham Fernandez-Rubio, Thomas Eisenbarth, and Rainer Steinwandt.
Extending glitch-free multiparty protocols to resist fault injection attacks. IACR
Transactions on Cryptographic Hardware and Embedded Systems, 2018(3):394–430,
Aug. 2018.
[SMG16] Tobias Schneider, Amir Moradi, and Tim Güneysu. Parti: Towards combined
hardware countermeasures against side-channeland fault-injection attacks. In Begül
Bilgin, Svetla Nikova, and Vincent Rijmen, editors, Proceedings of the ACM Workshop
on Theory of Implementation Security, TIS@CCS 2016 Vienna, Austria, October,
2016, page 39. ACM, 2016.
Classification of Balanced
Quadratic Functions
Publication Data
Lauren De Meyer, and Begül Bilgin. Classification of Balanced Quadratic

Functions. IACR Transactions on Symmetric Cryptology, 2019(2), pages 169-
192.
My Contribution
Principal author.
253
254 Classification of Balanced Quadratic Functions
Classification of Balanced Quadratic Functions

Lauren De Meyer and Begül Bilgin
KU Leuven, imec - COSIC, Belgium
Abstract. S-boxes, typically the only nonlinear part of a block cipher, are the heart of
symmetric cryptographic primitives. They significantly impact the cryptographic strength and
the implementation characteristics of an algorithm. Due to their simplicity, quadratic vectorial
Boolean functions are preferred when efficient implementations for a variety of applications are
of concern. Many characteristics of a function stay invariant under affine equivalence. So far, all
6-bit Boolean functions, 3- and 4-bit permutations have been classified up to affine equivalence.
At FSE 2017, Bozoliv et al. presented the first classification of 5-bit quadratic permutations.
In this work, we propose an adaptation of their work resulting in a highly efficient algorithm
to classify n × m functions for n ≥ m. Our algorithm enables for the first time a complete
classification of 6-bit quadratic permutations as well as all balanced quadratic functions for
n ≤ 6. These functions can be valuable for new cryptographic algorithm designs with efficient
multi-party computation or side-channel analysis resistance as goal. In addition, we provide a
second tool for finding decompositions of length two. We demonstrate its use by decomposing
existing higher degree S-boxes and constructing new S-boxes with good cryptographic and
implementation properties.
Keywords: Affine Equivalence · S-box · Boolean functions · Classification · Decomposition
1 Introduction
For a variety of applications, such as multi-party computation, homomorphic encryption and
zero-knowledge proofs, linear operations are considered to have minimal cost. Nonlinear operations
on the other hand cause a rapid growth of implementation requirements. Therefore, it becomes
important to create cryptographically strong algorithms with minimal nonlinear components. A
recent study in this direction called MiMC [AGR+ 16], which is based on some relatively old
observations [NK95], uses the simple quadratic function x3 in different fields as the only nonlinear
block of the algorithm. Another work that minimizes the number of multiplications is the LowMC
design [ARS+ 15], where a quadratic 3-bit permutation is used as the only nonlinear component of
a Substitution-Permutation-Network (SPN).
We also see the importance of minimizing the nonlinear components in the field of secure
implementations against side-channel analysis. Efforts to decompose the S-boxes of existing
algorithms, such as the DES and AES S-boxes, into a minimum number of lower degree nonlin-
ear components (AND-gates, field multiplications or other quadratic or cubic functions), have
produced more than a handful of papers. Some of these decomposition tools are generic and
work heuristically [CGP+ 12, RV13, CRV14, CPRR15, GR16, PV16] whereas others focus on enu-
merating decompositions of all permutations for a certain size [BNN+ 12, KNP13]. In general,
they all make it clear that there is a significant advantage in considering side-channel security
during the design process and hence using low degree nonlinear components. As a reaction to
this line of research, a variety of novel symmetric-key designs use simply a quadratic permuta-
tion [ABB+ 14, BDP+ 14, BDP+ 15, DEMS15]. Examples include Keccak [BDPA13], one instance
of which is the new hash function standard, and several candidates of the CAESAR competition.
Generating strong, higher degree S-boxes using quadratic functions has also been shown useful
in [BGG+ 16]. These works demonstrate the relevance of our research, which focuses on enumerating
quadratic n × m functions for n < 7.
A valuable tool for the analysis of vectorial Boolean functions, which are typically used as
S-boxes, is the concept of affine equivalence (AE). AE allows the entire space of n × m functions
to be classified into groups with the same cryptographic properties. These properties include
the algebraic degree, the differential uniformity and the linearity of both the function and its
possible inverse in addition to multiplicative complexity. Moreover, the randomness cost of a first-
order masked implementation is also invariant within a class if countermeasures such as threshold
implementations are used [Bil15]. With similar concerns in mind, our research relies on this affine
equivalence classification.
1.1 Classification of (Vectorial) Boolean Functions

The classification of Boolean functions dates back to the fifties [Gol59]. The equivalence classes for
functions of up to five inputs were identified by 1972 [BW72] and Maiorana [Mai91] was the first to
classify all 6-bit Boolean functions in 1991. Fuller [Ful03] confirmed in 2003 that this classification
was complete.
For vectorial Boolean functions, only n-bit permutations for n ≤ 4 have been completely
classified so far [Can07, Saa11, BNN+ 15]. Most of these classifications use the affine equivalence
(AE) tool introduced by Biryukov et al. in [BCBP03]. This algorithm computes a representative
of the affine equivalence class for any n-bit permutation. In [Can07], De Cannière classifies all
4-bit permutations by transversing a graph of permutations connected by single transpositions and
reducing them to their affine equivalence class representative. As this method is unpractical for
larger dimensions (n > 4), no classification of the complete space of 5-bit bijective permutations
exists. A classification of the APN classes (which have the best cryptographic properties) does exist
by Brinkmann et al. [BL08]. The authors build a tree of LUTs of 5-bit permutations, in which each
level of the tree specifies one more output of the function. The tree is pruned using on the one
hand an APN filter function and on the other hand an affine equivalence filter, which is also based
on the algorithm of [BCBP03]. The quadratic 5-bit permutations have been classified by Bozilov et
al. [BBS17]. Their approach consists of two stages: First, they generate an exhaustive list of 5-bit
permutations from quadratic ANF’s. Then, they use the affine equivalence algorithm of Biryukov
et al. [BCBP03] to find the affine representatives of all the candidates in this list. Eliminating the
doubles results in 75 quadratic classes. This approach uses the AE algorithm ≈ 223 times, resulting
in a runtime of a couple of hours, using 16 threads. Again, extending this approach to higher
dimensions is not feasible.
Vectorial Boolean functions from n to m < n bits have been used as S-boxes as well (e.g. the
6 × 4 DES S-boxes), yet their classification has been largely ignored. They are also used in the
construction of larger 8-bit S-boxes by Boss et al. [BGG+ 16].
1.2 Decompositions of Higher-Degree Functions

The authors of [BNN+ 12, KNP13] decompose all 4-bit permutations in order to provide efficient
implementations against side-channel analysis. The decompositions in both works benefit from the
affine equivalence classification of permutations. The main difference between them is that [BNN+ 12]
only focuses on decompositions using quadratic and cubic components. It is shown that not all cubic
4-bit permutations can be composed from quadratics. This work has been extended in [KNP13], in
which decomposition of all permutations is enabled by including additions and compositions with
non-bijective quadratic functions. The decompositions provided in both these papers have been
proven to have the smallest length with the given structure. A possible decomposition for all 6 × 4
DES S-boxes jointly using 4-bit permutations is also provided as an output of the aforementioned
research [BKNN15].
A complementary work which decomposes a function into other quadratic and cubic functions
is [CPRR15]. This work starts from a randomly chosen low-degree function. They iteratively enlarge
their set of functions using addition and composition. Finally, the generated set of functions is used
to get a decomposition for a target function. This approach is not unlike the logic minimization
technique of [BP10]. The tool is heuristic and the decompositions provided do not necessarily have
the smallest length. The theoretical lower bounds are not necessarily achieved for a randomly
selected function decomposition for bigger sizes. However, it performs well for small functions.
1.3 Our Contribution

In this work, we explore the extension of Biryukov’s AE algorithm to non-bijective n × m functions
with m < n and analyse its performance. We propose an algorithm that does not only classify
all n-bit permutations, but also all balanced n × m-bit functions for m ≤ n. Our complexity is
significantly lower than that of previous algorithms known to date. This allows us to generate
2
all quadratic vectorial Boolean functions with five inputs in merely six minutes, which makes the
search for even 6-bit quadratic functions feasible. We also provide the cryptographic properties of
these functions and their inverses if possible.
Our work focuses on quadratic functions, since they tend to have low area requirements in
hardware, especially for masked implementations. We also introduce a tool for finding length-two
quadratic decompositions of higher degree permutations and we use it to decompose the 5-bit AB
and APN permutations. Furthermore, we find a set of high quality 5-bit permutations of degree 4
with small decomposition length that can be efficiently implemented.
Our list of quadratic 6-bit permutations is an important step towards decomposing the only
known 6-bit APN permutation class as an alternative to [PUB16].
2 Preliminaries
We consider an n × m (vectorial) Boolean function F (x) = y from Fn2 to Fm 2 . The bits of x and
the coordinate functions of F are denoted by small letter subscripts, i.e. x = (x0 , . . . , xn−1 ) where
xi ∈ F2 and F (x) = (f0 (x), . . . , fm−1 (x)) where fi (x) is from Fn2 to F2 . We use ’◦’ to denote
the composition of two or more functions, e.g. F1 ◦ F2 (x) = F1 (F2 (x)) where F1 : Fm 2 → F2 and
l
F2 : Fn2 → Fm
2 . We use |.| and · for absolute value and inner product respectively.
2.1 (Vectorial) Boolean Function Properties

In this paper, we focus on balanced vectorial Boolean functions F (x) = y, i.e. each output y ∈ Fm 2
is equiprobable for all inputs x ∈ Fn2 . When n = m, F is thus bijective and typically called an n-bit
permutation.
A Boolean function f : Fn2 → F2 can be uniquely represented by its algebraic normal form
(ANF)
M n−1
Y j
f (x) = αj xj where xj = xi i .
j∈Fn
2
i=0
The algebraic degree of f is

n−1
X
Degr(f ) = max HW(j) with HW(j) = ji .
j∈F2 ,αj 6=0
n
i=0
The algebraic degree of a function F = (f0 , f1 , . . . , fm−1 ) is simply the largest degree of its
coordinate functions, i.e. Degr(F ) = max0≤i<m Degr(fi ).
Definition 1 (Component [NK95]). The components of a vectorial Boolean function F are the
nonzero linear combinations β · F of the coordinate functions of F , with β ∈ Fm
2 \ {0}.
Definition 2 (DDT [BS90, Nyb93]). We define the Difference Distribution Table (DDT) δF of F
with its entries
δF (α, β) = #{x ∈ Fn2 : F (x ⊕ α) = F (x) ⊕ β}
2 . The differential uniformity Diff(F ) is the largest value in the DDT for
α 6= 0, β:
Diff(F ) = max δF (α, β)
α6=0,β
An n-bit permutation F is said to be almost perfect nonlinear (APN) if ∀α 6= 0, β ∈ Fn2 , the

DDT element δF (α, β) is equal to either 0 or 2. The DDT frequency distribution or differential
spectrum ∆F of F is a histogram of the elements occuring in the DDT:
∆F (δ) = #{(α, β) ∈ Fn2 × Fm

2 : δF (α, β) = δ}
Definition 3 (LAT and Walsh Spectrum [O’C94, CV94]). We define the Linear Approximation
Table (LAT) λF of F with its entries
λF (α, β) = #{x ∈ Fn2 : α · x = β · F (x)} − 2n−1
3

2 . The Walsh spectrum of a Boolean function f : F2 → F2 is defined as
n
X
fˆ(ω) = (−1)f (x) · (−1)ω·x .
x∈Fn
2
P
A function’s LAT is directly related to its two-dimensional Walsh transform F̂ (α, β) = x∈Fn (−1) α·x
·
2
β·F (x)
(−1) as follows:
F̂ (α, β)
λF (α, β) =
2
Any column in a function’s LAT (λF (α, β̄) for β̄ fixed) is thus the scaled Walsh spectrum of a
component of F . The linearity Lin(F ) is the largest value in the LAT for β 6= 0, α:
Lin(F ) = max |λF (α, β)|

β6=0,α
An n-bit permutation F is said to be almost bent (AB) if ∀β 6= 0, α ∈ Fn2 , the LAT element
λF (α, β) is equal to either 0 or ±2(n−1)/2 . It is known that all AB permutations are also APN. The
LAT frequency distribution ΛF of F is a histogram of the absolute values occuring in the LAT:
ΛF (λ) = #{(α, β) ∈ Fn2 × Fm

2 : |λF (α, β)| = λ}
Remark 1. In some works, the linearity is expressed in terms of the Walsh spectrum instead of
the LAT table as L(F ) = maxβ6=0,α |F̂ (α, β)|. The two definitions differ by a factor of two, i.e.
L(F ) = 2 · Lin(F ).
2.2 Affine Equivalence

Functions with algebraic degree 1 are called affine. We use them to define affine equivalence
relations that classify the space of all n × m functions.
Definition 4 (Extended Affine Equivalence [CCZ98]). Two n × m functions F1 (x) and F2 (x) are
extended affine equivalent if and only if there exists a pair of n-bit and m-bit invertible affine
permutations A and B and an n × m linear mapping L such that F1 = B ◦ F2 ◦ A ⊕ L.
The algebraic degree and DDT and LAT frequency distributions are invariant over extended
affine equivalence.
Definition 5 (Affine Equivalence [CCZ98]). Two n × m functions F1 (x) and F2 (x) are affine
equivalent (F1 ∼ F2 ) if and only if there exists a pair of n-bit and m-bit invertible affine permutations
A and B such that F1 = B ◦ F2 ◦ A.
Clearly, affine equivalent functions are always extended affine equivalent but not vice versa.
Note that the affine equivalence relation also covers linear equivalence, where A and B are linear
permutations (i.e. A(0) = B(0) = 0). Moreover, also affine equivalence preserves algebraic degree
and DDT and LAT frequency distributions. In the case of Boolean functions (m = 1), affine
equivalence and extended affine equivalence are the same.
It is common practice to take the lexicographically smallest function in an affine equivalence class
as the representative, which we denote by R. An efficient algorithm for finding the affine equivalent
(AE) representative of any n-bit permutation S was proposed by Biryukov et al. in [BCBP03].
In short, it computes the linear representatives of S(x ⊕ a) ⊕ b for all a, b ∈ Fn2 and chooses the
lexicographically smallest among them as affine equivalent representative. Since, we rely on this
algorithm and modify it according to our needs, we provide a detailed description below.
2.3 Finding the Representative of a Permutation

This recursive algorithm described in [BCBP03] finds for a given permutation S the smallest affine
equivalent R = B −1 ◦ S ◦ A by guessing some of the output values of the affine permutations A and
B and determining the others using the linearity property. Throughout the algorithm, the numbers
nA and nB record logarithmically for how many input values the outputs of A and B have been
defined. For example, A(x) is defined for all x < 2nA −1 . It is possible to fix A(0) (resp. B(0)) at
the beginning of the algorithm which implies nA (resp. nB ) being initialized to 1. The number of
defined values for R(x) is NR , i.e. R(x) will be defined for all x < NR .
4
The computation starts with x = y = 0 from the ForwardSweep described in Algorithm 1,

which serves as the outer loop of the algorithm. The ForwardSweep enumerates all inputs x
for which affine transformation A(x) has already been defined and determines the representative
output y = R(x). Either there already exists an output y such that S ◦ A(x) = B(y) or we choose y
as the next smallest unused power of 2. When the ForwardSweep is complete, we continue with
the BackwardSweep in Algorithm 2. Note that when nA = 0 (the very first iteration), there are
no inputs to enumerate yet and the computation actually starts with a BackwardSweep.
At the start of Algorithm 2, x is typically a power of 2 which means A(x) cannot be determined
from linear combinations and can be chosen freely. If the BackwardSweep is successful (i.e.
it finds a suitable A(x) such that S ◦ A(x) = B ◦ R(x)), we recurse on the ForwardSweep. If
the BackwardSweep fails, we need to guess A(x). This is for example the case in the very first
iteration when nB = 0.
Algorithm 1: ForwardSweep(x, y, nA , nB )
while x < 2nA −1 do
Determine y 0 s.t. B(y 0 ) = S ◦ A(x);
if y 0 not yet defined then
Pick y 0 = 2nB −1 ;
Set B(y 0 ) = S ◦ A(x);
nB = nB + 1;
end
if SetR(x, y 0 ) then
x = x + 1;
else
Dead end: Stop forward sweep;
end
end
if x < 2n then
BackwardSweep(x, y, nA , nB );
end
Algorithm 2: BackwardSweep(x, y, nA , nB ) for invertible S

while y < 2nB −1 do
Determine x0 s.t. A(x0 ) = S −1 ◦ B(y);
if x0 < x then
y = y + 1;
else
if SetR(x, y) then
Set A(x) = S −1 ◦ B(y);
ForwardSweep(x, y + 1, nA + 1, nB );
Return;
end
end
end
Guess(x, y, nA , nB );
5
Algorithm 3: Guess(x, y, nA , nB ) Algorithm 4: SetR(x, y)

SetR(x, y); if R(x) already defined (i.e. x < NR )
for all guesses g for A(x) do then
Set A(x) = g; if y > R(x) then
Set B(y) = S ◦ A(x); Return False;
ForwardSweep(x, y, nA + 1, nB + 1); end
end if y = R(x) then
Return True;
end
end
Set R(x) = y and NR = x + 1;
Return True;
The Guess function is described by Algorithm 3. It fixes R(x) using Algorithm 4 to the smallest
unused y and then loops over all available assignments of A(x). For each guess, we try recursion
on the ForwardSweep. We need to try all because any guess can result in a lexicographically
smaller representative R.
Algorithm 4 builds the representative R and only changes previously determined outputs if they
are smaller than the current one.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 B 9 C D 6 F 3 E 8 7 4 A 2 5 0
S S
x → A(x) → B(y) ← y x → A(x) → B(y) ← y
Guess 0→0 → 1←0

Guess 1→1 → B←1 or Guess 1→5 → 6←1
Guess 2→2 → 9←2 Guess 2→A → 7←2
Fwd 3→3 → C←4 Fwd 3→F → 0←3
Bwd 4→7 ← 3←3 Guess 4→4 → D←4
Fwd 5→6 → F←8 Fwd 5→1 → B←6
Fwd 6→5 → 6←5 Fwd 6→E → 5←8
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
R(x) 0 1 2 3 4 6 8
Figure 1: Example of finding the linear representative for a 4-bit bijective S
This whole procedure of finding the representative of an n-bit permutation is exemplified in

Figure 1 for clarification. Note that even though the S-box we use and the one in [BCBP03] are
the same, the representative we obtain is different since we focus on the lexicographically smallest
one by assigning, for example, R(0) = 0. Moreover, for the same reason, the representative on the
right side of Figure 1 is favored over the left side.
3 Finding the Representative of a Non-invertible Function

It has been suggested in [BCBP03] that the algorithm in Section 2.3 can be extended to find
representatives for non-bijective functions S : Fn2 → Fm 2 , but that this is only efficient when
n − m is small. When S is not invertible (but still balanced), instead of one single solution to the
equation S(x) = y, there are 2n−m possible x candidates for each y. The additional complexity of
enumerating these candidates during BackwardSweep grows larger as m decreases. Therefore,
in [BCBP03] the total complexity of finding the affine representative for an n × m function where
n > m is estimated as:
n3 · 2n · (2n−m !) 2n−m
n
6
Figure 2 depicts how the predicted complexity (for fixed n = 5) increases monotonously as m
decreases.
1 2 3 4 5 1 2 3 4 5
m m
Figure 2: Asymptotic Complexity Figure 3: Our experimental runtimes for

from [BCBP03] for 5 × m functions. random 5 × m functions.
In what follows, we describe an extension of the algorithm in Section 2.3 which has a non-
monotonous complexity behavior as m decreases as can be observed in Figure 3. Note that Figure 3
depicts experimental runtimes whereas Figure 2 depicts an asymptotic complexity estimation. Their
scales are thus very different and should not be compared in magnitude. Instead, we are considering
only the difference in trends. For m = n, the algorithm is identical to [BCBP03]. The runtimes
are calculated using a random selection of 500 5 × m functions for each m. Note that since no
pseudo-code is provided in [BCBP03] and the description is very brief, we can not conclude whether
this is due to a complexity estimation error or having a slightly different algorithm. Moreover, the
real runtimes might approximate the asymptotic complexity better for n → ∞.
One of the changes caused by the non-invertability of S is that we can no longer compute the
inverse S −1 and thus we cannot obtain x0 in Algorithm 2. We propose Algorithm 5 as an alternative
in which we loop over all possible x0 for which S ◦ A(x0 ) = B(y).
Algorithm 5: BackwardSweep(x, y, nA , nB ) for non-invertible S

while y < 2nB −1 do
for all x0 s.t. S ◦ A(x0 ) = B(y) do
if x0 < x then
Try next x0 ;
else
if Set R(x, y) then
Set A(x) = S −1 ◦ B(y);
ForwardSweep(x, y, nA + 1, nB ) ;
end
end
end
if no x0 found then
y = y + 1;
else
Return;
end
end
Guess(x, y, nA , nB );
Another difference is in the assignment of y which is the smallest element in Fm

2 that does not
yet have a corresponding input x such that R(x) = y. Note that y decides the representative output
R(x) in the BackwardSweep and Guess runs. The representative R of a balanced function
S has the same output distribution as S, which implies each y = R(x) can only occur once in
a bijective permutation. This is why Algorithm 2 immediately increments y after using it. In a
non-bijective function on the other hand, y can be reused 2n−m times. Algorithm 5 therefore does
7
not immediately increase y after each BackwardSweep but only when it runs out of candidates
x0 for which S ◦ A(x0 ) = B(y). The complete procedure for finding the representative of a balanced
non-injective function is illustrated in Figure 4.
This second feature actually makes the new algorithm very efficient in finding the smallest
representative when n − m is not too large. Instead of guessing A(x), which implies a loop over
approximately 2n guesses, now the list of 2n−m candidates x0 immediately gives us the guesses A(x0 )
that result in the smallest output value R(x). The more often we can reuse an output value y, the
less often we need to guess. This can also be observed by comparing the examples in Figure 1 and 4.
As a result, the algorithm to find a representative becomes more efficient for n × m functions with
m < n. If m becomes very small, the complexity increases again since the enumeration of 2n−m
candidates, which is used also in [BCBP03], becomes the dominant factor. That the complexity
first decreases and then increases with m corresponds to our initial observation in Figure 3.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 3 1 0 1 2 3 3 2 0 3 0 2 2 1 0
S
x → A(x) → B(y) ← y
Guess 0→0 → 1←0

Bwd 1→2 ← 1←0
Bwd 2→4 ← 1←0
Fwd 3→6 → 3←1
Bwd 4→E ← 1←0
Fwd 5→C → 2←2
Fwd 6→A → 3←1
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
R(x) 0 0 0 1 0 2 1
Figure 4: Example for 4-bit non-bijective S
4 Classifying Balanced 5 × m Quadratic Functions

In this section, we first describe how all 5×m balanced quadratic functions can be classified iteratively
using our algorithm. Even though all 5-bit quadratic Boolean functions and permutations have
already been classified in [BW72] and [BBS17] respectively, this is the first time such an analysis is
performed for m ∈ / {1, 5}. Moreover, we introduce novel optimizations using the (non-)linearity of
the components to perform this classification much faster. We then compare the performances of
finding all quadratic permutations using the method in [BBS17] with ours.
4.1 Naive Iteration

There exist 215 different 5-bit quadratic Boolean functions. Since we target balanced functions, we
consider only the balanced 18 259 out of 215 as candidate coordinate functions fi : F52 → F2 . In
iterative stages for m = 1 to 5, we systematically augment all balanced 5 × (m − 1) functions with
these 18 259 candidates to form a set of 5 × m functions. We then use the adapted AE algorithm
to reduce these functions to their affine equivalent representative. This reduction step is the key
feature of the classification algorithm, since it not only provides us with all 5 × m representatives,
but also significantly lowers the workload of the next stage. The search procedure is described by
Algorithm 6.
Table 1 shows the number of representatives we obtain for m = 1, . . . , 51 . Our results for
m ∈ {1, 5} align with those from previous works and require 50 minutes of computation time, using
4 threads on a Linux machine with an Intel Core i5-6500 processor at 3.20GHz. The comparison of
1 The exact listing of the representatives and their cryptographic properties can be found on http://homes.esat.
kuleuven.be/~ldemeyer/ > Miscellaneous.
8
Algorithm 6: Generate Quadratic Functions

Initialize R = {0}, S = ∅ and m = 1;
Let F contain all balanced quadratic Boolean functions;
while m < 5 do
for all S = (S1 , . . . , Sm−1 ) ∈ R do
for all candidates f ∈ F do
if S 0 = (S, f ) is balanced then
S ← S ∪ {S 0 };
end
end
end
R ← ∅;
for all S ∈ S do
Find affine equivalent representative R of S;
R ← R ∪ R;
end
Sort and eliminate doubles from R.
S ← ∅;
m ← m + 1;
end
this timing alone with a couple of hours, using 16 threads given in [BBS17] shows the impact of
using an iterative approach, made possible by the new AE algorithm of Section 3. Nevertheless, in
this section we will describe two ways to further optimize the complexity.
Table 1: Number of affine equivalence classes for 5 × m functions for m = 1, . . . , 5
5×1 5×2 5×3 5×4 5×5

# classes 3 12 80 166 76
4.2 Impact of the Order of Coordinate Functions on Efficiency

Optimizing our classification algorithm comes down to reducing as much as possible the number of
functions to which the AE algorithm must be applied. Ideally, given two intermediate functions
F1 and F2 which are affine equivalent (F1 ∼ F2 ), we only want to find the representative of one
of them. Affine equivalence is not so easily detected in all cases. However, we can focus on a
simpler case. If F1 and F2 have the same coordinate functions, but in a different order, then they
are naturally affine equivalent. Hence, we will try to fix the order of the coordinate functions of
each intermediate function F and in that way eliminate all functions that are equal to F up to a
reordering of the coordinates. To do this, we need a property to base the ordering on. Consider the
(5,1) (5,1) (5,1)
three Boolean quadratic function classes Q0 , Q1 and Q2 for which representative ANF’s
and nonzero Walsh coefficient distributions are provided in Table 2.
Table 2: 5-bit Boolean functions
#|ω : fˆ(ω) = ξ|
Class Representative
ξ = 32 ξ = 16 ξ = 8
(5,1)
Q0 x0 1 0 0
(5,1)
Q1 x0 ⊕ x1 x2 0 4 0
(5,1)
Q2 x0 ⊕ x1 x2 ⊕ x3 x4 0 0 16
Lemma 1. Every n × m function F = (f0 , f1 , . . . , fm−1 ) is affine equivalent to an n × m function

G(x) = (g0 , g1 , . . . , gm−1 ) with max ĝ0 (ω) ≤ max ĝ1 (ω) ≤ . . . ≤ max ĝm−1 (ω), where ĝi (ω) is the
Walsh spectrum of gi (x).
9
Since this property is unique for each class of Boolean functions, we will use it to fix the order
of coordinate functions during the classification algorithm. Using Lemma 1, in each intermediate
step m, we will only allow new coordinate functions fm for which the linearity is not smaller than
the linearity of fm−1 .
4.3 Impact of Linear Components on Efficiency

The runtime for finding the affine representative of a function depends on the accuracy of guesses.
That is, the algorithm searches the smallest representative for each guess of A(x). As a result, we
notice that, the more nonlinear components the function has, the more dead ends the algorithm
encounters and the more quickly it finishes. On the other hand, the more linear components the
function exhibits, the more valid solutions for the affine transforms and thus the longer the algorithm
needs to search through them. Therefore, the algorithm for finding the linear representative becomes
less efficient as the number of linear components of the function increases.
In order to illustrate this significant difference, we choose five 5-bit representatives with a
different number of linear components. We use the same class enumeration as in [BBS17] and
(5,5)
represent the ith quadratic permutation with Qi . From each class, we randomly choose 100
permutations and observe the average runtime of the AE algorithm. The results of this experiment
are shown in Table 3.
Table 3: Average runtimes of the AE algorithm [BCBP03] for some 5-bit representatives
Class # Linear Components Av. Runtime (s.)

(5,5)
Q1 15 1.36
(5,5)
Q2 7 0.39
(5,5)
Q37 3 0.017
(5,5)
Q49 1 0.0083
(5,5)
Q75 0 0.0053
Moreover, we further analyze the runtime of the AE algorithm by removing coordinates to

derive n × m functions with less linear components. The result is illustrated in Figure 5.
Ave Runtime (s)

10 0 Q1
Q2
Q37
Q49
10−1
Q75
10−2
10−3
1 2 3 4 5
m
Figure 5: Actual runtimes observed for some 5-bit functions
We introduce the following definition of a linear extension in order to define our optimization
for the classification.
Definition 6 (Linear Extension). An n-bit permutation F = (f0 , . . . , fn−1 ) is called the linear
extension of an n × m function G = (f0 , . . . , fm−1 ) if ∀m ≤ i < n, fi is linear.
Any balanced n × m function can be linearly extended with n − m linear components into a
balanced n-bit permutation. Correspondingly, each balanced n-bit permutation with 2n−m − 1
linear components can be generated as a linear extension of some balanced n × m function with
10
zero linear components. We therefore initially eliminate all linear coordinate functions from our
search, generating 5 × m functions with only nonlinear coordinates in each step. In the very last
stage, we obtain a list of 5-bit bijections without linear components. Finally, we add to this list all
the linear extensions of the 5 × m representatives found so far (for m = 1, . . . , 4) to also obtain the
5-bit bijections with 2n−m − 1 linear components. This optimization increases the efficiency of the
search in three ways. Firstly, it reduces the number of fi candidates inserted in each stage (|F| &).
Secondly, it discards functions for which finding the AE representative is slow. Finally, it reduces
the number of n × m representatives that each stage starts from (|R| &).
4.4 Performance Comparison

Table 4 summarizes the results of the optimized search that takes a mere 6 minutes, using 4
threads. This significant increase of performance enables us to classify for the first time also all
6-bit functions, which is described in Section 6. Note that the first column of Table 4 (m = 0)
corresponds to the classes of affine 5 × i functions. The last column holds the total number of 5 × i
functions, which is the sum of each row and corresponds to Table 1.
Each column starts with the number of “purely nonlinear” 5 × m representatives (only nonlinear
coordinates). The rows below the diagonal hold the number of classes that result from linearly
extending the classes in previous rows. The last row shows the number of linear components in the
corresponding 5 × 5 bijections and is equal to 25−m − 1. We find 22 quadratic 5-bit equivalence
classes without any linear components. Adding to this the linear extensions of smaller functions,
we obtain all the 75 quadratic and the one affine 5-bit representatives.
Table 4: Number of affine equivalence classes for 5 × i functions for i = 1, . . . , 5 with 2i−m − 1
linear components.
m
# 5 × i representatives Tot. #
0 1 2 3 4 5
# 5×1 1 2 - - - - 3
# 5×2 1 3 8 - - - 12
# 5×3 1 5 19 55 - - 80
# 5×4 1 3 17 52 93 - 166
# 5×5 1 2 6 22 23 22 76
# Linear Components: 31 15 7 3 1 0
Note that the number of classes obtained from linearly extending all 5 × m functions can be
much smaller than the number of 5 × m classes itself (for example 23 93 for m = 4). This can be
explained by the fact that linearly extending two extended affine but not affine equivalent functions
can result in affine equivalent permutations (i.e. a collision in the linear extension). Consider
for example the following two 5 × 3 functions that are extended affine equivalent but not affine
equivalent:  
x0 ⊕ x1 x2
 x 0 ⊕ x 3 ⊕ x 1 x 2

x1 ⊕ x2 x3 6∼ x1 ⊕ x3 ⊕ x2 x3

 

x4 ⊕ x0 x1 x4 ⊕ x0 x1
It is straightforward to verify that linearly extending both functions with coordinate functions x2
and x3 results in two affine equivalent 5-bit permutations.
 

x0 ⊕ x1 x2 
 x0 ⊕ x3 ⊕ x1 x2

 


 ⊕ 

 1x x x
2 3  1 ⊕ x3 ⊕ x2 x3
x
x4 ⊕ x0 x1 ∼ x4 ⊕ x0 x1

 


 x2 
 x2

 

x x
3 3
5 Decomposing and Generating Higher Degree Permutations

We now adapt our algorithm to (de)compose higher degree functions into/from quadratics. This
leads to area efficient implementations especially in the context of side-channel countermeasures
11
and masking, where the area grows exponentially with the degree of a function. Below we describe
length-two decompositions and constructions of cryptographically interesting permutations.
5.1 Length-two Decomposition

We are trying to decompose a higher degree function H : Fn2 → Fn2 . If a quadratic decomposition of
length two exists, then we can state that H = B ◦ R1 ◦ A ◦ R2 ◦ C with A, B, C affine permutations
and R1 , R2 representatives of n-bit quadratic classes. Alternatively, we can state that H is affine
equivalent to R1 ◦ A ◦ R2 . Suppose we fix R2 (to one of the known n-bit representatives) and we
want to find the representative R1 and the affine permutation A such that
H ∼ R1 ◦ A ◦ R2 . (1)
As with the classification of Boolean functions, we perform this search iteratively, starting from
n × 1 Boolean functions f for which f ◦ R2 : Fn2 → F2 is extended affine equivalent to a component
function of H. We thus select the candidates for f using the following criteria:
(C1) f is balanced
(C2) ∃β ∈ Fm
2 \ {0} s.t. Degr(f ◦ R2 ) = Degr(β · H)
(C3) ∃β ∈ Fm
2 \ {0} s.t. (∆f ◦R2 , Λf ◦R2 ) = (∆β·H , Λβ·H )
Starting from this list of candidates, we proceed in a similar manner as in Algorithm 6. We

only slightly have to tweak this algorithm to obtain the decomposition Algorithm 7. Each time
Algorithm 7: Find decompositions of length two

for all quadratic n-bit representatives R2 do
Initialize R = {0}, S = ∅ and m = 1;
F ← all quadratic Boolean functions f satisfying above criteria (C1-C3);
while m < n do
D ← {(∆L◦H , ΛL◦H )}L∈L (Fn2 →Fm 2 )
;
for all S ∈ R do
for all candidates f ∈ F do
if S 0 = (S, f ) is balanced and (∆S 0 ◦R2 , ΛS 0 ◦R2 ) ∈ D then
S ← S ∪ {S 0 };
end
end
end
R ← ∅;
for all S ∈ S do
Find left affine equivalent representative RL of S;
R ← R ∪ RL ;
end
Sort and eliminate doubles from R.
S ← ∅;
m ← m + 1;
end
if R = 6 ∅ then
Decomposition of length 2 found;
end
end
we augment an n × (m − 1) function with one of the Boolean function candidates f to a balanced

n × m function F , we verify that the DDT and LAT of F ◦ R2 have the same frequency distributions
as those of some function H 0 : Fn2 → Fm 2 , that has as its coordinate functions a subset of the
components of H. Let L (Fn2 → Fm 2 ) be the set of all balanced linear mappings from F2 to F2 .
n m
Then, we can describe H 0 as L◦H for some L ∈ L (Fn2 → Fm 2 ). In order to eliminate false candidates
12
as early as possible, we verify for each intermediate n × m candidate F if the composition F ◦ R2

can be affine equivalent to some subfunction H 0 . In particular, we check that
(∆F ◦R2 , ΛF ◦R2 ) ∈ {(∆L◦H , ΛL◦H )}L∈L (Fn2 →Fm

2 )
The quadratic function classification algorithm (Algorithm 6) is very efficient because it reduces
the lists of intermediate functions to their affine equivalent representatives at each step m. However,
we cannot do that in this case as this would change the affine transformation A in the decomposition
(see Eqn. (1)). Let S1 = B ◦ R1 ◦ A be a candidate for which S1 ◦ R2 ∼ H and let R1 be its affine
representative. Reducing S1 to R1 would discard the affine transformation A. In that case, we
would only be able to decompose functions that are affine equivalent to the composition of two
representatives: R1 ◦ R2 . In other words, if there is another candidate S10 affine equivalent to S1 ,
we do not want to discard it as it will not necessarily result in affine equivalent compositions.
S10 ∼ S1 ⇒
/ S10 ◦ R2 ∼ S1 ◦ R2
However, without any reductions in the intermediate steps of the algorithm, the search becomes
very inefficient as the list of candidate functions grows exponentially. There is still a redundancy in
our search because of the affine output transformation B that is included in S1 . If S10 is only left
affine equivalent to S1 , then their compositions are affine equivalent:
S10 = B 0 ◦ S1 ⇒ S10 ◦ R2 ∼ S1 ◦ R2
We therefore adapt the AE algorithm to find the lexicographically smallest function R1L that
is left affine equivalent to S1 : R1L = B −1 ◦ S1 = R1 ◦ A. We call this function R1L the left
affine representative of S1 . The algorithm to find R1L is identical to finding the affine equivalent
representative with the input affine transformation constrained to the identity function. This
constraint removes the need for guesses and makes the algorithm very efficient. An example is
shown in Figure 6. Algorithm 7 summarizes the resulting decomposition method.
a 0 1 2 3 4 5 6 7 8 9 A B C D E F
S(a) 1 B 9 C D 6 F 3 E 8 7 4 A 2 5 0
S
x → A(x) → B(y) ← y
Fwd 0→0 → 1←0

Fwd 1→1 → B←1
Fwd 2→2 → 9←2
Fwd 3→3 → C←4
Fwd 4→4 → D←8
Fwd 5→5 → 6←5
Fwd 6→6 → F←7
x 0 1 2 3 4 5 6 7 8 9 A B C D E F
RL (x) 0 1 2 4 8 5 7
Figure 6: Example for finding the left representative RL of S. Input transformation A is fixed to
the identity function: A(x) = x
5.2 Almost Bent Permutations

There are five APN 5 × 5 permutations up to affine equivalence [BL08], all corresponding to a
(5,5)
power map over F25 . Two of these are quadratic and AB and correspond to classes Q74 (∼ x3 )
(5,5) 5
and Q75 (∼ x ). We demonstrate Algorithm 7 by decomposing the inverses of these classes (resp.
x11 and x7 ), which are cubic and (naturally) also AB. The algorithm is very efficient in this case
because the properties of AB’s are so well defined: Firstly, all components of the cubic AB’s have
13
the same algebraic degree (=3). We also know that the DDT of an AB function contains only
zeros and twos and its LAT contains only zeros and elements with absolute value 4. It immediately
follows that also the Walsh transform of each coordinate function of the AB is equal to either 0 or
±8.
Moreover, when we look at all 5 × m subfunctions H 0 = L ◦ H, ∀L ∈ L (Fn2 → Fm 2 ), there is
only one permitted differential spectrum and LAT frequency distribution for each m. It is indeed
known that all coordinate functions of the AB are (extended) affine equivalent.
We enumerate all 75 candidates for R2 and perform the search for S1 = R1 ◦ A using Algorithm 7.
(5,5) (5,5)
When R2 is the representative of classes Q1 to Q74 , the algorithm finds no 5-bit bijections
that compose with R2 to a cubic AB. The search only ends with non-empty R when we perform
(5,5)
it with R2 the representative of Q75 , which is itself the quadratic AB permutation x5 . The
resulting R1 is equal to R2 and their composition forms the AB class that holds the inverse of
(5,5)
Q75 (corresponding to power map x7 ). This decomposition is easily verified using power maps.
Indeed, x5 ◦ x5 ◦ x5 = x125 = x1 mod 31 .
Without the constraint that the AB needs to be cubic, we also find a decomposition for class
(5,5) (5,5)
Q75 itself with R1 = R2 = Q74 . A length-two decomposition for the odd cubic AB permutations
(x11 ) is not found. Since the algorithm is exhaustive, this means it does not exist. Indeed, it is
shown in [NNR19] that the shortest decomposition of x11 has length three.
Table 5: Look-up-tables for the even cubic AB function F and its decomposition F = S1 ◦ R2 with
S1 ∼ R2
F 0,1,2,8,4,17,30,13,10,18,5,19,6,20,11,26,16,15,9,23,3,7,29,21,14,12,25,31,28,27,22,24
S1 0,1,2,4,8,10,16,21,17,28,18,24,23,25,14,7,30,6,19,12,20,15,3,31,9,29,5,22,13,26,27,11
R2 0,1,2,4,3,8,16,28,5,10,26,18,17,20,31,29,6,21,24,12,22,15,25,7,14,19,13,23,9,30,27,11
5.3 The Keccak χ Inverse

The nonlinear transformation χ used in the Keccak [BDPA13] sponge function family χ (Figure 7)
(5,5)
is a quadratic 5-bit permutation from class Q68 with a cubic inverse. For the possibility of
implementing an algorithm using χ , we decompose this cubic permutation (see Table 6).
−1
Figure 7: The nonlinear transformation χ from Keccak [BDPA13]
Table 6: Look-up-tables for the Keccak permutation χ and its inverse χ−1
χ 0,9,18,11,5,12,22,15,10,3,24,1,13,4,30,7,20,21,6,23,17,16,2,19,26,27,8,25,29,28,14,31
χ−1 0,11,22,9,13,4,18,15,26,1,8,3,5,12,30,7,21,20,2,23,16,17,6,19,10,27,24,25,29,28,14,31
The Keccak inverse does not have the same strong properties as the AB permutations. Each
coordinate function is still cubic but the differential and linear properties are naturally weaker.
Firstly, apart from zeros we find both ±4 and ±8 in the LAT. For the DDT, there are multiple
differential spectra for the intermediate 5 × m sub functions. As explained above, we generate the
list of possible DDT and LAT frequency distributions for each m = 1, . . . , 4 and feed this as input
to the search algorithm. We filter out all intermediate functions F : Fn2 → Fm 2 for which the DDT
and LAT frequency distributions of F ◦ R2 do not occur in this list.
14
While the search finds many classes with the same cryptographic properties as the Keccak
inverse, a decomposition of length 2 for χ−1 itself does not appear to exist.
5.4 Towards Higher-Degree Permutations

When it comes to choosing a nonlinear permutation for use in a cryptographic primitive, the designer
will sooner go to those with higher degree as they provide more resilience against higher-order
differential and algebraic attacks [DEM15]. With masked implementations in mind, we thus want to
find strong n-bit permutations with high algebraic degree for which a decomposition into quadratic
blocks exists. Our decomposition algorithm can be used for this purpose. If instead of searching for
specific functions H, we define a set of more general but strong criteria, we can use the algorithm 7
to generate a list of favorable permutations. In particular, we use the following criteria to perform
a search for 5-bit permutations S with optimal algebraic degree and near-optimal cryptographic
properties:
- S is balanced
- Degr(S) = 4
- Lin(S) ≤ 6
- Diff(S) ≤ 4
The first three criteria are easily translated for intermediate 5 × m functions (the bound on
the LAT table stays the same). As we are not looking for a known class, this is more difficult for
the bound on the DDT. We use the fact that the upperbound on the values in the DDT at most
doubles every time we discard one output bit (see Theorem 1). This upperbound is not tight, but
can be used to filter some of the unusable intermediate functions F .
Theorem 1 ([Nyb94, Thm. 12]). Let S = (f0 , f1 , . . . , fn−1 ) : Fn2 → Fn2 be an n-bit bijection with
Diff(S) the maximal value in its DDT. Then, for any function F : Fn2 → Fm 2 with m < n, composed
from a subset of the coordinate functions of S, F = (fi1 , fi2 , . . . , fim ) with i1 , . . . , im ∈ {0, . . . , n−1},
the values in its DDT are upperbounded by Diff(S) · 2n−m .
Our search delivers 17 quartic affine equivalence classes with very good cryptographic properties,
shown in Table 7. One of those is the APN class, which contains the permutation formed by the
inversion x−1 in F25 . Indeed, it was shown in [NNR19] that this permutation has decomposition
length two. Each of these very strong 5-bit S-boxes have an efficient masked implementation, as
they can be decomposed into only two quadratic components. This list is only a sample of the
functions that can be found using this method.
Table 7: Strong quartic (degree 4) 5-bit permutations with decomposition length two
Cl. Representative Diff Lin R1 R2

(5,5) (5,5)
1: 0,1,2,3,4,6,7,8,5,9,16,12,21,26,29,30,10,18,24,13,27,17,20,31,14,11,23,19,22,28,15,25 4 6 Q52 Q57
(5,5) (5,5)
2: 0,1,2,3,4,6,7,8,5,12,16,26,28,18,29,13,9,21,30,25,10,27,20,22,14,19,23,31,17,24,11,15 4 6 Q71 Q62
(5,5) (5,5)
3: 0,1,2,3,4,6,7,8,5,16,18,29,9,15,28,26,10,30,20,19,23,31,24,11,12,22,27,17,13,21,14,25 4 6 Q53 Q67
(5,5) (5,5)
4: 0,1,2,3,4,6,8,11,5,9,16,18,12,17,28,23,7,31,21,19,10,26,14,29,30,25,27,22,24,13,15,20 4 6 Q71 Q75
(5,5) (5,5)
5: 0,1,2,3,4,6,8,11,5,12,16,24,15,21,17,20,7,23,9,18,14,19,25,30,31,10,28,22,13,26,27,29 4 6 Q69 Q71
(5,5) (5,5)
6: 0,1,2,3,4,6,8,12,5,7,16,27,26,15,28,18,9,14,22,17,20,31,24,21,13,29,10,19,25,23,11,30 4 6 Q33 Q70
(5,5) (5,5)
7: 0,1,2,3,4,6,8,12,5,9,16,24,31,10,17,20,7,21,28,22,15,29,14,27,25,19,11,23,13,26,30,18 4 6 Q74 Q74
(5,5) (5,5)
8: 0,1,2,3,4,6,8,12,5,9,16,31,18,17,15,23,7,24,10,29,21,27,11,28,25,30,14,22,26,19,20,13 4 6 Q74 Q75
(5,5) (5,5)
9: 0,1,2,3,4,6,8,12,5,10,16,27,25,19,22,11,7,18,30,13,24,21,28,15,31,9,26,29,23,17,20,14 4 6 Q74 Q68
(5,5) (5,5)
10: 0,1,2,3,4,6,8,12,5,11,16,25,18,10,19,29,7,17,30,21,31,24,13,14,27,22,26,9,20,23,28,15 4 6 Q75 Q74
(5,5) (5,5)
11: 0,1,2,3,4,6,8,12,5,11,16,24,22,26,9,19,7,23,10,13,31,18,20,29,27,30,28,15,14,17,21,25 4 6 Q74 Q68
(5,5) (5,5)
12: 0,1,2,3,4,6,8,12,5,13,16,23,17,18,24,11,7,29,21,27,25,9,22,10,31,14,15,20,19,30,28,26 4 6 Q72 Q68
(5,5) (5,5)
13: 0,1,2,3,4,6,8,12,5,14,16,26,10,27,23,31,7,24,11,28,20,17,9,18,25,21,13,30,15,22,29,19 4 6 Q10 Q72
(5,5) (5,5)
14: 0,1,2,3,4,6,8,12,5,16,13,23,25,21,26,14,7,17,20,28,29,19,11,9,15,10,31,24,27,18,30,22 4 6 Q72 Q74
(5,5) (5,5)
15: 0,1,2,3,4,6,8,12,5,16,21,26,31,22,18,10,7,24,17,13,30,14,19,27,20,9,23,25,11,29,15,28 4 6 Q74 Q74
(5,5) (5,5)
16: 0,1,2,3,4,6,8,16,5,10,20,29,7,31,27,13,9,25,15,18,19,14,22,26,21,17,11,12,30,28,23,24 4 6 Q10 Q75
(5,5) (5,5)
17: 0,1,2,4,3,6,8,16,5,10,15,27,19,29,31,20,7,18,25,21,12,14,24,28,26,11,23,13,30,9,17,22 2 6 Q75 Q74
15
6 Classifying 6 × m Quadratic Functions

The efficiency of Algorithm 6 makes it feasible to extend the search for quadratic permutations to
n = 6 bits for the first time. There are 221 different 6-bit quadratic Boolean functions, of which
there are 914 004 nonlinear balanced ones. This is our list F of candidate coordinate functions
fi : F62 → F2 . Generating all classes of 6 × m functions for m < 6 without linear components takes
8.5 hours on 24 cores. The total number of classes found for each m is shown in Table 82 . Tables 9
to 12 show histograms of the classes’ cryptographic properties. It is interesting to note that the two
(5,5) (5,5)
best 6 × 5 classes in Table 12 correspond to the two AB 5 × 5 classes Q74 and Q75 , extended
with a sixth unused input bit.
Table 8: Number of affine equivalence classes with/without linear components for quadratic 6 × m
functions for m = 1, . . . , 5
6×1 6×2 6×3 6×4 6×5

# classes without 2 19 604 10 480 7 458
# classes with 3 24 670 11 891 12 647
Table 9: Number of quadratic 6 × 2 classes with cryptographic properties (Diff, Lin)
Lin = 8 Lin = 16 Lin = 32

Diff = 32 5 3 0
Diff = 64 2 9 5
Lin = 8 Lin = 16 Lin = 32

Diff = 16 57 7 0
Diff = 32 128 252 19
Diff = 64 11 149 47
Lin = 8 Lin = 16 Lin = 32

Diff =8 10 1 0
Diff = 16 1935 845 64
Diff = 32 618 5013 740
Diff = 64 42 2016 607
Lin = 8 Lin = 16 Lin = 32

Diff =4 2 0 0
Diff =8 111 3 4
Diff = 16 124 1028 424
Diff = 32 0 3343 2993
Diff = 64 4 2843 1768
16
In order to complete the final stage of the search for all 6-bit quadratic permutations, we
generate the list of candidates for the AE algorithm by extending the 6 × 5 representatives with the
Boolean function candidates fi ∈ F and we add the linear extensions of all other 6 × m functions.
We split this list into 100 parts and complete the rest of the algorithm on 100 cores. In the end,
we find 2 263 classes of 6-bit quadratic permutations (not including the one linear permutation)3 .
Table 13 shows how these classes are distributed among even and odd permutations or how many
of them have quadratic/cubic inverses. Table 14 depicts the histogram of cryptographic properties.
There are eight classes with Diff = 4 and Lin = 8. These are shown in Table 15. One of those
permutations is odd. Finally, Figure 8 shows the total number of affine equivalence classes of
quadratic n × n permutations. While it was already clear that this number grows fast with n, the
figure demonstrates how difficult it was before this work to predict just how fast.
Table 13: Number of quadratic 6 × 6 classes with certain properties
Even/Odd 2258 5
Inverse = quadratic/cubic 70 2193
Lin = 8 Lin = 16 Lin = 32

Diff =4 8 0 0
Diff =8 0 0 12
Diff = 16 0 49 100
Diff = 32 0 49 1067
Diff = 64 0 200 779
Table 15: Strong quadratic 6-bit permutations
Cl. Representative Diff Lin Parity

0,1,2,3,4,6,7,5,8,12,16,20,32,39,57,62,9,17,21,13,40,51,53,46,50,47,52,41,63,33,56,38,10,45,
2256: 4 8 Even
27,60,43,15,59,31,58,24,49,19,55,22,61,28,29,35,18,44,25,36,23,42,30,37,11,48,54,14,34,26
0,1,2,3,4,6,7,5,8,12,16,20,32,39,57,62,9,17,21,13,41,50,52,47,40,53,46,51,36,58,35,61,10,25,
2257: 4 8 Even
37,54,33,49,15,31,45,59,24,14,42,63,30,11,29,23,44,38,18,27,34,43,19,28,56,55,48,60,26,22
0,1,2,3,4,6,7,5,8,12,16,20,32,39,57,62,9,17,21,13,41,50,52,47,55,42,49,44,59,37,60,34,10,25,
2258: 4 8 Even
38,53,35,51,14,30,61,43,11,29,56,45,15,26,22,28,36,46,27,18,40,33,23,24,63,48,54,58,31,19
0,1,2,3,4,6,8,10,5,11,16,30,32,45,59,54,7,20,41,58,47,63,15,31,48,44,9,21,57,38,14,17,12,25,
2259: 4 8 Even
24,13,34,52,56,46,40,50,43,49,39,62,42,51,35,36,27,28,33,37,23,19,53,61,26,18,22,29,55,60
0,1,2,3,4,6,8,10,5,11,16,30,32,45,59,54,7,24,40,55,48,44,17,13,9,25,49,33,31,12,41,58,14,27,
2260: 4 8 Even
26,15,57,47,35,53,61,39,62,36,43,50,38,63,46,37,23,28,42,34,29,21,22,18,56,60,51,52,19,20
0,1,2,3,4,6,8,10,5,11,16,30,32,45,59,54,7,34,21,48,13,43,17,55,56,18,61,23,19,58,24,49,9,52,
2261: 4 8 Even
20,41,31,33,12,50,46,28,36,22,25,40,29,44,62,39,51,42,38,60,37,63,35,53,57,47,26,15,14,27
0,1,2,3,4,6,8,10,5,12,16,25,32,42,59,49,7,20,14,29,52,36,51,35,53,46,43,48,39,63,55,47,9,58,
2262: 4 8 Even
44,31,22,38,61,13,19,40,33,26,45,21,17,41,54,23,24,57,30,60,62,28,27,50,34,11,18,56,37,15
0,1,2,3,4,8,16,28,5,12,32,41,10,14,57,61,6,62,23,47,33,20,38,19,43,27,29,45,7,58,39,26,9,22,
2263: 4 8 Odd
55,40,11,25,35,49,44,59,53,34,37,63,42,48,21,51,56,30,52,31,15,36,24,54,18,60,50,17,46,13
Conclusion
This work studies the classification of quadratic vectorial Boolean functions under affine equivalence.
It extends Biryukov’s Affine Equivalence algorithm to non-bijective functions for use in a new
classification tool that provides us with the complete classification of balanced n × m quadratic
vectorial Boolean functions for m ≤ n and n < 7. We also introduce a tool for finding length-two
quadratic decompositions of higher degree functions.
New cryptographic algorithms should be designed with resistance against side-channel attacks
in mind. When it comes to choosing S-boxes, designers can use our classification to pick quadratic
17
Figure 8: Number of quadratic n × n classes for growing n
components and use our (de)composition tool to create cryptographically strong S-boxes with
efficient masked implementations. After the classifications of 4- and 5-bit permutations in previous
works, this work expands the knowledge base on both classification and decomposition, bringing us
one step closer to classifying 8-bit functions and decomposing the AES S-box using permutations
instead of tower field or square-and-multiply approaches.
Acknowledgements
The authors thank Dusan Bozilov for the insights into his algorithm and Prof. Vincent Rijmen for
fruitful discussion and helpful comments. This work was supported by the Research Council KU
Leuven: C16/15/058. Lauren De Meyer is funded by a PhD fellowship (aspirant) of the Fund for
Scientific Research - Flanders (FWO). Begül Bilgin was a postdoctoral fellow of the FWO during
this research. Currently, she is working at Rambus Cryptography Research.
References
[ABB+ 14] Elena Andreeva, Begül Bilgin, Andrey Bogdanov, Atul Luykx, Florian Mendel, Bart Men-
nink, Nicky Mouha, Qingju Wang, and Kan Yasuda. CAESAR submission: PRIMATEs
v1.02, March 2014. http://primates.ae/wp-content/uploads/primatesv1.02.pdf.
[AGR+ 16] Martin R. Albrecht, Lorenzo Grassi, Christian Rechberger, Arnab Roy, and Tyge Tiessen.
Mimc: Efficient encryption and cryptographic hashing with minimal multiplicative
complexity. In Jung Hee Cheon and Tsuyoshi Takagi, editors, Advances in Cryptology -
ASIACRYPT 2016 - 22nd International Conference on the Theory and Application of
Cryptology and Information Security, Hanoi, Vietnam, December 4-8, 2016, Proceedings,
Part I, volume 10031 of Lecture Notes in Computer Science, pages 191–219, 2016.
[ARS+ 15] Martin R. Albrecht, Christian Rechberger, Thomas Schneider, Tyge Tiessen, and Michael
Zohner. Ciphers for MPC and FHE. In Elisabeth Oswald and Marc Fischlin, editors,
Advances in Cryptology - EUROCRYPT 2015 - 34th Annual International Conference
on the Theory and Applications of Cryptographic Techniques, Sofia, Bulgaria, April
26-30, 2015, Proceedings, Part I, volume 9056 of Lecture Notes in Computer Science,
[BBS17] Dusan Bozilov, Begül Bilgin, and Haci Ali Sahin. A note on 5-bit quadratic permutations’
classification. IACR Trans. Symmetric Cryptol., 2017(1):398–404, 2017.
[BCBP03] Alex Biryukov, Christophe De Cannière, An Braeken, and Bart Preneel. A toolbox for
cryptanalysis: Linear and affine equivalence algorithms. In Eli Biham, editor, Advances
in Cryptology - EUROCRYPT 2003, International Conference on the Theory and
Applications of Cryptographic Techniques, Warsaw, Poland, May 4-8, 2003, Proceedings,
[BDP+ 14] Guido Bertoni, Joan Daemon, Michaël Peeters, Gilles Van Assche, and Ronny Van
Keer. CAESAR submission: Ketje v1, March 2014. https://competitions.cr.yp.
to/round1/ketjev1.pdf.
18
[BDP+ 15] Guido Bertoni, Joan Daemon, Michaël Peeters, Gilles Van Assche, and Ronny Van
Keer. CAESAR submission: Keyak v2, August 2015. https://competitions.cr.yp.
to/round2/keyakv2.pdf.
[BDPA13] Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Keccak. In
Thomas Johansson and Phong Q. Nguyen, editors, EUROCRYPT 2013, Athens, Greece,
May 26-30, 2013. Proceedings, volume 7881 of LNCS, pages 313–314. Springer, 2013.
[BGG+ 16] Erik Boss, Vincent Grosso, Tim Güneysu, Gregor Leander, Amir Moradi, and Tobias
Schneider. Strong 8-bit sboxes with efficient masking in hardware. In Gierlichs and
Poschmann [GP16], pages 171–193.
[Bil15] Begül Bilgin. Threshold Implementations: As Countermeasure Against Higher-Order

Differential Power Analysis. PhD thesis, KU Leuven, Belgium & UTwente, The Nether-
lands, 2015.
[BKNN15] Begül Bilgin, Miroslav Knezevic, Ventzislav Nikov, and Svetla Nikova. Compact
implementations of multi-sbox designs. In Naofumi Homma and Marcel Medwed,
editors, Smart Card Research and Advanced Applications - 14th International Conference,
CARDIS 2015, Bochum, Germany, November 4-6, 2015. Revised Selected Papers, volume
[BL08] Marcus Brinkmann and Gregor Leander. On the classification of APN functions up to
dimension five. Des. Codes Cryptography, 49(1-3):273–288, 2008.
[BNN+ 12] Begül Bilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, and Georg Stütz.
Threshold implementations of all 3x3 and 4x4 s-boxes. In Emmanuel Prouff and Patrick
Schaumont, editors, Cryptographic Hardware and Embedded Systems - CHES 2012
- 14th International Workshop, Leuven, Belgium, September 9-12, 2012. Proceedings,
[BNN+ 15] Begül Bilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, Natalia N. Tokareva,
and Valeriya Vitkup. Threshold implementations of small s-boxes. Cryptography and
Communications, 7(1):3–33, 2015.
[BP10] Joan Boyar and René Peralta. A new combinational logic minimization technique
with applications to cryptology. In Paola Festa, editor, Experimental Algorithms, 9th
International Symposium, SEA 2010, Ischia Island, Naples, Italy, May 20-22, 2010.
Proceedings, volume 6049 of Lecture Notes in Computer Science, pages 178–189. Springer,
2010.
[BS90] Eli Biham and Adi Shamir. Differential cryptanalysis of des-like cryptosystems. In
Alfred Menezes and Scott A. Vanstone, editors, Advances in Cryptology - CRYPTO
’90, 10th Annual International Cryptology Conference, Santa Barbara, California, USA,
[BW72] Elwyn R. Berlekamp and Lloyd R. Welch. Weight distributions of the cosets of the (32,
6) reed-muller code. IEEE Trans. Information Theory, 18(1):203–207, 1972.
[Can07] Christophe De Cannière. Analysis and Design of Symmetric Encryption Algorithms.

PhD thesis, Katholieke Universiteit Leuven, 2007.
[CCZ98] Claude Carlet, Pascale Charpin, and Victor A. Zinoviev. Codes, bent functions and
permutations suitable for des-like cryptosystems. Des. Codes Cryptography, 15(2):125–
156, 1998.
[CGP+ 12] Claude Carlet, Louis Goubin, Emmanuel Prouff, Michaël Quisquater, and Matthieu
Rivain. Higher-order masking schemes for s-boxes. In Anne Canteaut, editor, Fast
Software Encryption - 19th International Workshop, FSE 2012, Washington, DC, USA,
March 19-21, 2012. Revised Selected Papers, volume 7549 of Lecture Notes in Computer
19
[CPRR15] Claude Carlet, Emmanuel Prouff, Matthieu Rivain, and Thomas Roche. Algebraic
decomposition for probing security. In Rosario Gennaro and Matthew Robshaw, editors,
Advances in Cryptology - CRYPTO 2015 - 35th Annual Cryptology Conference, Santa
Barbara, CA, USA, August 16-20, 2015, Proceedings, Part I, volume 9215 of Lecture
[CRV14] Jean-Sébastien Coron, Arnab Roy, and Srinivas Vivek. Fast evaluation of polynomials
over binary finite fields and application to side-channel countermeasures. In Lejla
Batina and Matthew Robshaw, editors, Cryptographic Hardware and Embedded Systems
- CHES 2014 - 16th International Workshop, Busan, South Korea, September 23-26,
2014. Proceedings, volume 8731 of Lecture Notes in Computer Science, pages 170–187.
Springer, 2014.
[CV94] Florent Chabaud and Serge Vaudenay. Links between differential and linear cryptanalysis.
In Alfredo De Santis, editor, Advances in Cryptology - EUROCRYPT ’94, Workshop
on the Theory and Application of Cryptographic Techniques, Perugia, Italy, May 9-12,
1994, Proceedings, volume 950 of Lecture Notes in Computer Science, pages 356–365.
Springer, 1994.
[DEM15] Christoph Dobraunig, Maria Eichlseder, and Florian Mendel. Higher-order cryptanalysis
of lowmc. In Soonhak Kwon and Aaram Yun, editors, Information Security and
Cryptology - ICISC 2015 - 18th International Conference, Seoul, South Korea, November
25-27, 2015, Revised Selected Papers, volume 9558 of Lecture Notes in Computer Science,
[DEMS15] Christoph Dobraunig, Maria Eichlseder, Florian Mendel, and Martin Schläffer. CAESAR
submission: ASCON v1.1, August 2015. https://competitions.cr.yp.to/round2/
asconv11.pdf.
[Ful03] Joanne Elizabeth Fuller. Analysis of affine equivalent boolean functions for cryptography.
PhD thesis, Queensland University of Technology, 2003.
[Gol59] Solomon W. Golomb. On the classification of boolean functions. IRE Trans. Information
Theory, 5(5):176–186, 1959.
[GP16] Benedikt Gierlichs and Axel Y. Poschmann, editors. Cryptographic Hardware and
Embedded Systems - CHES 2016 - 18th International Conference, Santa Barbara, CA,
USA, August 17-19, 2016, Proceedings, volume 9813 of Lecture Notes in Computer
[GR16] Dahmun Goudarzi and Matthieu Rivain. On the multiplicative complexity of boolean
functions and bitsliced higher-order masking. In Gierlichs and Poschmann [GP16], pages
457–478.
[KNP13] Sebastian Kutzner, Phuong Ha Nguyen, and Axel Poschmann. Enabling 3-share
threshold implementations for all 4-bit s-boxes. In Hyang-Sook Lee and Dong-Guk
Han, editors, Information Security and Cryptology - ICISC 2013 - 16th International
Conference, Seoul, Korea, November 27-29, 2013, Revised Selected Papers, volume 8565
[Mai91] James A. Maiorana. A classification of the cosets of the Reed-Muller Code R(1, 6).
Mathematics of Computation, 57(195):403–414, 1991.
[NK95] Kaisa Nyberg and Lars R. Knudsen. Provable security against a differential attack. J.
Cryptology, 8(1):27–37, 1995.
[NNR19] Svetla Nikova, Ventzislav Nikov, and Vincent Rijmen. Decomposition of permutations
in a finite field. Cryptography and Communications, 11(3):379–384, 2019.
[Nyb93] Kaisa Nyberg. Differentially uniform mappings for cryptography. In Tor Helleseth, editor,
Advances in Cryptology - EUROCRYPT ’93, Workshop on the Theory and Application
of of Cryptographic Techniques, Lofthus, Norway, May 23-27, 1993, Proceedings, volume
20
[Nyb94] Kaisa Nyberg. S-boxes and round functions with controllable linearity and differential
uniformity. In Preneel [Pre95], pages 111–130.
[O’C94] Luke O’Connor. Properties of linear approximation tables. In Preneel [Pre95], pages
131–136.
[Pre95] Bart Preneel, editor. Fast Software Encryption: Second International Workshop. Leuven,
Belgium, 14-16 December 1994, Proceedings, volume 1008 of Lecture Notes in Computer
[PUB16] Léo Perrin, Aleksei Udovenko, and Alex Biryukov. Cryptanalysis of a theorem: De-
composing the only known solution to the big APN problem. In Matthew Robshaw
and Jonathan Katz, editors, Advances in Cryptology - CRYPTO 2016 - 36th Annual
International Cryptology Conference, Santa Barbara, CA, USA, August 14-18, 2016,
Proceedings, Part II, volume 9815 of Lecture Notes in Computer Science, pages 93–122.
Springer, 2016.
[PV16] Jürgen Pulkus and Srinivas Vivek. Reducing the number of non-linear multiplications
in masking schemes. In Gierlichs and Poschmann [GP16], pages 479–497.
[RV13] Arnab Roy and Srinivas Vivek. Analysis and improvement of the generic higher-order
masking scheme of FSE 2012. In Guido Bertoni and Jean-Sébastien Coron, editors,
Cryptographic Hardware and Embedded Systems - CHES 2013 - 15th International
Workshop, Santa Barbara, CA, USA, August 20-23, 2013. Proceedings, volume 8086 of
Lecture Notes in Computer Science, pages 417–434. Springer, 2013.
[Saa11] Markku-Juhani O. Saarinen. Cryptographic analysis of all 4 x 4-bit s-boxes. In Ali
Miri and Serge Vaudenay, editors, Selected Areas in Cryptography - 18th International
Workshop, SAC 2011, Toronto, ON, Canada, August 11-12, 2011, Revised Selected
Papers, volume 7118 of Lecture Notes in Computer Science, pages 118–133. Springer,
2011.
21
Curriculum Vitae
Lauren De Meyer obtained her Master of Science in mathematical engineering

from KU Leuven in 2015. During her undergraduate studies in computer science
and electrical engineering, she conducted research projects under guidance of
Professors Bart Preneel and Vincent Rijmen. She spent her first year of graduate
studies at EPFL, Switzerland as an exchange student, where she was able to
work under the supervision of Professor Serge Vaudenay. She obtained a Master
of Science in electrical engineering from Harvard University in May 2016. In
October 2016, she joined the research group COSIC at KU Leuven, in pursuit
of a PhD degree, funded by the Fund for Scientific Research Flanders (FWO).
275
FACULTY OF ENGINEERING SCIENCE
DEPARTMENT OF ELECTRICAL ENGINEERING
COSIC
Kasteelpark Arenberg 10 box 2452
B-3001 Leuven
lauren.demeyer@esat.kuleuven.be
http://www.cosic.esat.kuleuven.be

Cryptography in The Presence of Physical Attacks Thesis-384 PHD 2020 KU Leuven Lauren de Meyer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cryptography in The Presence of Physical Attacks Thesis-384 PHD 2020 KU Leuven Lauren de Meyer

Uploaded by

Copyright:

Available Formats

ARENBERG DOCTORAL SCHOOL

Faculty of Engineering Science

Cryptography in the Presence

Supervisors: Dissertation presented in partial

Design, Implementation and Analysis

Examination committee: Dissertation presented in partial

real-life applications, designing implementations that provide security against

Cryptografische primitieven zijn zo ontworpen dat ze zwarte-doos aanvallen

AES Advanced Encryption Standard.

ASIC Application-Specific Integrated Circuit.

CPA correlation power analysis.

DDLA differential deep learning analysis.

DDT difference distribution table.

ECC error-correcting codes.

EDC error-detecting codes.

GCM Galois counter mode.

ILA independent leakage assumption.

LFSR linear feedback shift register.

MAC message authentication code.

MPC multi-party computation.

OTP one-time pad.

PRNG pseudo-random number generator.

RNG random number generator.

SCA side-channel analysis.

SNR signal-to-noise ratio.

List of Abbreviations viii

I Design, Implementation and Analysis of Cryptog-

5 Design of Symmetric Cryptographic Primitives 99

A A New Inner Product Masking Algorithm 153

Recovering the CTR_DRBG state in 256 traces 185

CAPA: The Spirit of Beaver against Physical Attacks 205

M&M: Masks and Macs against Physical Attacks 229

Classification of Balanced Quadratic Functions 253

1.1 CTR mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Trichina AND gate. . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Power measurements of AES. . . . . . . . . . . . . . . . . . . . 49

4.1 Sensitive regions on flash memory. . . . . . . . . . . . . . . . . . 81

5.1 Trade-offs between cryptographic strength and latency. . . . . . 106

2.1 Cost of masked building blocks [GPS14]. . . . . . . . . . . . . . 29

4.1 Comparing combined countermeasures in hardware . . . . . . . 90

5.1 Comparison of primitives in the state-of-the-art. . . . . . . . . 110

Design, Implementation and

The presence of cryptography - e.g. the transformation of a message into

1.1 Introduction to Symmetric Cryptography

One of the most important functions of cryptography is encryption, i.e. the

An Example. The most simple example of symmetric encryption is the one-

Block Ciphers. A block cipher is a function EK (P ) that encrypts a fixed-

Mode of Operation. Since block ciphers only operate on plaintexts of limited

ciphertext block as Ci = EK (Pi ) is insecure, because it allows an adversary to

𝐸" 𝐸" … 𝐸" …

Figure 1.1: CTR mode

Many other modes of operation exist. Specifically, recent modes of operation

The Advanced Encryption Standard. In 2001, NIST selected the Rijndael

SubBytes is the only nonlinear transformation in the round function. It

with A ∈ GF (2)8×8 and b ∈ GF (2)8 defining the affine transform and

Adversary Model. Many different models can be distinguished based on the

• Known Ciphertext or Ciphertext Only

The field of cryptanalysis deals with the analysis of cryptographic primitives

1.2 Introduction to Side-Channel Attacks and Coun-

1.2.1 Side-Channel Analysis

Side-channel attacks came to light with a series of works by Kocher [Koc96,

From Secrets to Side-Channels. CMOS is the prevailing technology in

Figure 1.2: CMOS inverter operation (simplified)

Masking. Let x ∈ F be our sensitive intermediate variable and let be some

Arithmetic Masking. Instead of using the XOR operation as , it is, for

Multiplicative Masking. Alternatively, if the splitting operation is a