Thesis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

High-Performance Hardware Architecture for

CRYSTALS-Dilithium

by
Truong Dang Quang

A THESIS

Submitted to the faculty of


INHA UNIVERSITY
in partial fulfillment of the requirement
for the degree of
MASTER OF SCIENCE

Department of Electrical and Computer Engineering


August, 2023
High-Performance Hardware Architecture for
CRYSTALS-Dilithium

by
Truong Dang Quang

A THESIS

Submitted to the faculty of


INHA UNIVERSITY
in partial fulfillment of the requirement
for the degree of
MASTER OF SCIENCE

Major: Information and Communication Engineering


Advisor: Professor Hanho Lee
This certifies that the thesis of Truong Dang Quang is approved

Chair: 서 영 교

Referee: 이 한 호

Referee: 서 영 교
Abstract

In the 4th industrial revolution, the development of technology in various fields, such
as IOT, big data... has made digital data more important and is replacing conventional
data. Along with the revolution of digital data storage and communication technologies,
digital information can be easily stored, copied, changed, and transported. These are
useful desirable properties; however, they also raise some security issues. Privacy,
authentication and integrity of data are major concerns which require security procedures
to be attached to digital information. Some areas, such as receipts, contracts, and others,
raise concerns among users regarding unauthorized access and data theft. Conventional
signatures are unable to address these issues, the solution for all these security issues is
Digital Signature - a mathematical scheme which ensures the conversation’s privacy,
integrity, authenticity of digital message/sender and non-repudiation of sender. Currently,
Digital Signature is being used in various application domains, including government,
banking, financial services, healthcare, and many others.
The existing Digital signature standard such as RSA Probabilistic Signature Scheme
(RSA-PSS) and Elliptic Curve Digital Signature Algorithm (ECDSA) can offer a certain
level of security and has been included in the 2009 version of Federal Information
Processing Standards FIPS 186. Recently, advances in quantum computing may break
their security, as the security of ECDSA and RSA-PSS are based on the difficulties of
solving the elliptic curve logarithm and solving some number theoretic problem,
respectively. In September 2022, the National Institute of Standards and Technology
(NIST) announced a status report selecting Dilithium as the primary signature algorithm
for standardization.
Dilithium is a member of Cryptographic Suite for Algebraic Lattices (CRYSTALS),
the core operation of Dilithium is the arithmetic of polynomial matrices and vectors.
Dilithium is a lattice-based digital signature algorithm based on the Fiat-Shamir paradigm,
with security based on the Module Learning with Error (M-LWE) cryptosystems and
Shortest Integer Solution (SIS) problems. The three core algorithms of Dilithium are key

i
generation, signature generation and signature verification. Key generation creates a
public key and secret key for the rest of algorithms. The signature generation generates
the signature from the secret key and message, while the verify used for checking the
authentication and integrity of messages.
This thesis presents a high-performance FPGA implementation of CRYSTALS-
Dilithium designed at the Register Transfer Level (RTL) using Verilog. It is a combined
architecture capable of performing key generation, signature generation and signature
verification and selecting between security levels at runtime. The proposed architecture
has been implemented on Xilinx Vivado tool and Virtex-UltraScale+ FPGA platform,
achieving the lowest latency for all operations while maintaining a smaller area.
To perform and accelerate polynomial arithmetic operations in Dilithium algorithm,
this thesis proposes a flexible arithmetic unit – a specific design for Dilithium hardware
implementation. With the multi-path delay commutator (MDC) architecture-based
polynomial multipliers, our arithmetic unit can offer a reduction in multiplication time
while requiring lower hardware resources compared with existing architectures. Besides,
the other sub-modules are parallelized as needed to minimize the stall and wait delays, as
well as to achieve better performance in Keygen, Signing and Verify operations.

ii
Acknowledgements

I would like to extend my sincere gratitude to those who gave me the possibility to
complete this thesis. This thesis would not have been possible without the considerable
guidance and support of my advisor, Professor Hanho Lee, to whom I am deeply grateful.
I would like to thank the honorable dissertation committee for agreeing to be a part of
my dissertation committee, and for their kind effort. I am also grateful to professors in the
Department of Electrical and Computer Engineering, Inha University for the knowledge
in their valuable lectures which are very important for me in my research.
I would like to thank my colleagues, my friends in Digital Integrated Systems
Laboratory for their support, encouragement, and the friendly environment they provided
me during my stay. Last but not least, I am extremely thankful to my family and close
friends for their unconditional support in any circumstances.

iii
Contents

Abstract ............................................................................................................................ i

Acknowledgements ........................................................................................................ iii

Chapter 1 Introduction ...................................................................................................1

1.1 Overview .....................................................................................................................1

1.2 Contribution.................................................................................................................2

1.3 Thesis organization ......................................................................................................2

Chapter 2 Digital Signature Overview and Dilithium .................................................4

2.1 Digital signature ..........................................................................................................4

2.2 Lattices ........................................................................................................................6

2.3 Lattice Problems ..........................................................................................................8

2.3.1 Shortest Vector Problem .....................................................................................8

2.3.2 Ring-LWE...........................................................................................................9

2.3.3 Module_LWE ...................................................................................................10

2.3.4 Fiat-Shamir with Aborts ................................................................................. 11

2.4 Dilithium Scheme ......................................................................................................12

2.4.1 Basic Ring Operations ......................................................................................13

2.4.2 NTT Domain Representation ...........................................................................13

2.4.3 Hashing .............................................................................................................14

2.4.4 High/Low Order Bits and Hints .....................................................................15

2.5 Dilithium Algorithm ..................................................................................................18

2.5.1 Key Generation .................................................................................................18

2.5.2 Signature Generation ........................................................................................18

iv
2.5.3 Verification .......................................................................................................19

2.6 Summary....................................................................................................................22

Chapter 3 NTT Algorithm and Architecture ..............................................................23

3.1 NTT Multiplication .................................................................................................23

3.2 Cooley-Tukey Algorithm and Gentleman-Sande Algorithm ...................................26

3.3 NTT Processor Architecture ....................................................................................26

3.3.1 Single-path Delay Feedback (SDF) ..................................................................27

3.3.2 Multipath Delay Feedback (MDF) ...................................................................28

3.3.3 Single-path Delay Commutator (SDC) ............................................................29

3.3.4 Multipath Delay Commutator (MDC) ..............................................................29

3.3.5 Radix-2 Multipath Delay Commutator .............................................................30

3.4 Summary....................................................................................................................30

Chapter 4 Hardware Implementation of Dilithium ...................................................31

4.1 Combined Architecture of CRYSTALS-Dilithium ..................................................31

4.2 Hardware Implementation for Sub-modules in Dilithium .......................................33

4.2.1 Polynomial Arithmetic Module ........................................................................33

4.2.2 Unified Butterfly Unit ......................................................................................34

4.2.3 Modular Reduction Module .............................................................................35

4.2.4 BRAM Array Configuration .............................................................................36

4.2.5 Hashing and Sampling......................................................................................36

4.2.6 Encoder/Decoder ..............................................................................................38

4.2.7 Make/Use Hint..................................................................................................39

4.2.8 Decomposer ......................................................................................................39

v
4.3 Operation Scheduling ..............................................................................................40

4.3.1 Key Generation .................................................................................................40

4.3.2 Signature Generation ........................................................................................40

4.3.3 Signature Verification .......................................................................................41

4.4 Summary....................................................................................................................42

Chapter 5 Performance Evaluation .............................................................................43

5.1 CRYSTALS-Dilithium Simulation ............................................................................43

5.2 Resource Utilization and Performance ......................................................................46

5.3 Summary....................................................................................................................49

Chapter 6 Conclusion ....................................................................................................50

Reference ........................................................................................................................51

vi
List of Figures

Figure 1. Digital signature scheme 5

Figure 2. Example of a 2-dimensional lattice with two different bases 6

Figure 3. An SVP solution 8

Figure 4. Module LWE 10

Figure 5. Lattice-Based Identification Scheme 11

Figure 6. SampleInBall description 15

Figure 7. The pseudo-code for deterministic and randomized versions of Dilithium 20

Figure 8. CRYSTALS-Dilithium scheme 22

Figure 9. Block diagram of a typical NTT polynomial multiplication 24

Figure 10. Cooley-Tukey data flow diagram (n=23) 25

Figure 11. Gentleman-Sande data flow diagram (n=23) 25

Figure 12. Block diagram of a typical NTT polynomial multiplication 25

Figure 13. General block diagram of pipeline architecture 27

Figure 14. SDF architecture 27

Figure 15. MDF architecture 28

Figure 16. R4SDC structure 29

Figure 17. MDC architecture 29

Figure 18. Block diagram of a 256-point R2MDC FFT 30

Figure 19. Combined architecture CRYSTALS-Dilithium 32

Figure 20. Block diagram of the proposed flexible arithmetic module 33

Figure 21. Unified butterfly unit 35

Figure 22. Modular reduction module 36

vii
Figure 23. Keccak hardware architecture 36

Figure 24. Encoder 38

Figure 25. Decoder 38

Figure 26. Decomposer 39

Figure 27. Key generation timing diagram 40

Figure 28. Schedule of the precomputation stage of signature generation 41

Figure 29. Timing diagram of rejection loop in signature generation 41

Figure 30. Schedule of signature verification 41

Figure 31. Simulation result of CRYSTALS-Dilithium 43

Figure 32. Simulation result of secret key in NIST level 2 44

Figure 33. Simulation result of public key in NIST level 2 44

Figure 34. Receiving message M process in signature generation phase 45

Figure 35. Simulation results of digital signature 45

Figure 36. Simulation results of verification phase 45

Figure 37. Verification results and execution time 46

viii
List of Tables

Table 1. Supporting algorithms for Dilithium 17

Table 2. Parameters of Dilithium submitted in NIST Round 3 32

Table 3. Comparison of FPGA-based designs for Dilithium signature scheme 47

ix
Chapter 1 Introduction

1.1 Overview

Using conventional signature is a very popular way to verify document’s


authentication. In the era of digital technology, the requirements of corresponding method
for digital data are compulsory in some areas like receipt, contracts, and similar others.
The solution for this security problem is digital signature – a mathematical technique used
to validate the integrity and authenticity of messages, digital documents or software. The
America National Institute of Standards and Technology (NIST) has issued Federal
Information Processing Standard FIPS 186 [1], known as the Digital Signature Algorithm
(DSA). The DSA was initially introduced in 1991 and updated in 1993 based on feedback
from the public regarding the security of the scheme. There was a further minor revision
in 1996. Recently, with the appearance of quantum computers, the RSA and ECC
signature algorithms are no longer secure. Due to the threat quantum computers pose
against modern cryptosystems, NIST started a post-quantum cryptography competition in
2017. The goal is to create new cryptographic standards presumed to be quantum-resistant,
specify algorithms for public-key encryption, key establishment and digital signatures.
After three rounds of evaluation and analysis, CRYSTALS Dilithium - a lattice-based
signature scheme is the primary algorithm for standardization of digital signature [2].
Following this way, this promising candidate is expected to replace classical digital
signature mechanisms within the next few years. There is a variety of applications
including: verifying transactions, ratifying and managing contracts, identify citizens,
paperless banking…which are utilized in various organizations. With the dramatical
development of digital technology, the requirement for efficient and high-performance
implementation of digital signature is very crucial. This research proposal focuses on the
efficient implementation to improve performance of the Dilithium scheme.

1
1.2 Contribution
Throughout the NIST Post-Quantum Cryptography Standardization Process, many
works have been conducted on implementing and evaluating Dilithium through various
methods such as pure software or hardware-codesign. In the context of this thesis,
outstanding full hardware implementations of Dilithium for round 3 Dilithium that are
relevant to our work are summarized. Among them, there are two combined architectures
capable of performing three major operations at all security levels, which are described
in [3] and [4]. Both of these approaches aim for high performance. In [28], a lightweight
hardware accelerator for CRYSTALS-Dilithium supporting levels 5 is designed.
Although there are other works available, they don't support all of the security levels or
individual modules that support only one of the three major operations. This thesis aims
to design a hardware architecture that supports all three major operations and security
levels with the most recent version of Dilithium. Additionally, a flexible arithmetic unit
that is a specific design for Dilithium hardware implementation and inspired by the
R2MDC FFT structure is proposed. In this thesis, the optimized scheduling of operations
is presented to achieve efficient hardware implementations and maximize utilization of
the sub-modules.

1.3 Thesis Organization


The remaining sections of this thesis are organized as follows:
• Chapter 2 provides an overview and background information of lattice-based
cryptography, especially on its hard problems. The theoretical fundamentals of digital
signature Dilithium scheme and its algorithms including key generation, signature
generation and verification are briefly described.
• Chapter 3 shows the fundamental theory and design methodology used in the
polynomial arithmetic unit. This chapter provides theoretical introduction on number
theoretic transform, analyzes benefits and drawbacks of different FFT processor
architectures and proposes an efficient one to implement Dilithium scheme on
hardware structure.

2
• Chapter 4 presents the implementation of high-performance hardware
architecture for Dilithium including its sub-module and operation scheduling used in
this design.
• Chapter 5 provides the evaluation and comparison with previous works.
• Chapter 6 summarizes the work achieved in this thesis.

3
Chapter 2 Digital Signature Overview and Dilithium

This chapter briefly introduces theoretical groundwork of lattice-based signature


scheme Dilithium and its functionalities including mathematical foundations of lattices
and their corresponding hard problems, such as Short Integer Solution (SIS) and Learning
with Errors (LWE). Additionally, the approach called Fiat-Shamir with Aborts and the
algorithm of Dilithium which is submitted to NIST are described in detail.

2.1 Digital signature


With the revolution of digital data storage and communication technologies, digital
information can be easily stored, copied, changed and transported. The digital data has
some desirable properties are very useful as well as having some security issues, therefore
people need solutions to protect digital information, documents… which is regarded as
unreliable in areas where privacy, authentication and integrity of data are of big concern.
There are some areas like receipt, contracts, messages and similar others where users have
concerns of unauthorized access and stealing of data. Conventional signatures are not able
to change this situation because they are included in the document as a part of the
document. The solution to all these security issues is Digital Signature. When we sign a
document digitally, we send the signature as a separate document.
In the Digital Signature method, both the message and the signature are received by
the recipient simultaneously. To verify the authenticity, the recipient must apply a
verification technique to the combination of the message and the signature. Digital
Signature ensures the privacy of digital data and prevents it from unauthorized access.
Digital Signature is currently used in various application domains that include:
Government (E.g. Filing tax returns online by taxpayers, citizen ID card, issuing forms
and licenses, reservations and ticketing), Banking (E.g. Inter/ Intra bank messaging
systems, corporate Internet banking applications), Financial Services/Broking (E.g.
Online trading, electronic contract notes), Healthcare (E.g. electronic medical recording,
healthcare management system, electronic prescriptions) and many others. Integrity,
Authentication, Privacy and Nonrepudiation are four key factors to achieve information

4
Figure 1: Digital signature scheme [5].

security. Privacy, which is also known as confidentiality, guarantees protection of


information from unauthorized access.
Digital Signature is an authentication mechanism that allows the sender of a message
to attach a unique code that serves as a signature. Typically, in figure 1 the digital signature
is formed by hashing the message and encrypting it with the private key of the sender.
The source and integrity of the message are guaranteed by signature. The Digital
Signature standard is one of NIST standards that is utilized for the secure hash algorithm.
The message signature, plain message, and public key (pk) of the sender are combined
and packed together, then transformed into an encrypted and signed message using the
Public Key of the recipient. The recipient then unpacks and decrypts the signature before
computing the message digest of the received message using the same hashing function.
The resulting message digest is then compared to the decrypted signature to verify the
authenticity and integrity of the message.
There are many kinds of standards for Digital Signature, and this thesis specifically
focuses on the Dilithium algorithm [6] – the primary algorithm for standardization of
NIST, this algorithm is a member of the Cryptographic Suite for Algebraic Lattices
CRYSTALS. The mathematical foundations of lattices and their corresponding hard
problems, such as Short Integer Solution (SIS), Learning with Errors (LWE) and Fiat-
Shamir with Aborts [7] approach will be discussed in the next section.

5
2.2 Lattices
When discussing cryptographic systems that are secure against quantum computers
we have many options, including, but not limited to, lattice-based systems, isogeny-based
systems, and multivariant-based systems. The most promising to date are lattice-based
crypto systems.

Definition 1: A m-dimensional lattice L is a subgroup of ℝn generated by bi ∈ ℝn:


𝑚

𝐿 = {∑ 𝑎𝑖 𝐛𝐢 : 𝑎𝑖 ∈ Z} = {𝐵 ∗ 𝐚 ∶ 𝐚 ∈ Z𝑚}
𝑖=1

where bi are independent vectors, called basis vectors. The matrix B ∈ ℝn×m is called
the basis matrix and consists of the basis vectors.

Definition 2: A m-dimensional vector subspace V of ℝn is generated by bi ∈ ℝn as


𝑚

𝑉 = {∑ 𝑎𝑖 𝐛𝐢 : 𝑎𝑖 ∈ R}
𝑖=1

where bi are independent vectors, called basis vectors.

Figure 2: Example of a 2-dimensional lattice with two different bases.

6
A lattice L is a discrete version of a vector subspace V, but instead of taking any real
linear combination of the basis vectors {𝐛𝟏 , ...,𝐛𝐦 , only linear combinations with
integers are allowed. Namely, 𝑎𝑖 ∈ ℤ for a lattice and 𝑎𝑖 ∈ ℝ for a vector subspace.
Like a vector subspace, a lattice does not have a distinct set of basis vectors. An example
of a 2-dimensional lattice with two sets of basis vectors is shown in Figure 2. Considering
the lattice L generated by the two vectors 𝐛𝟏 = (2 0) and 𝐛𝟐 = (1 1). This is the set of all
vectors of the form (2x+y y), where x, y ∈ ℤ. If we plot these points in the plane then we
see that these points form a two-dimensional lattice, the red basis is certainly “more
orthogonal” than the blue basis, so the red is nicer than blue, the lattices problem is with
vector subspaces, given a lattice basis, is there a nicer basis?
Since a lattice is a discrete version of a vector subspace, there always exists a smallest
vector besides the trivially zero vector. Out of this, we can define the non-zero minimum
of any lattice L, which is denote by: λ1(L) := min{‖x‖2 : x ∈ L, x ≠ 0}.

The successive minima λi(L) can be defined as the smallest radius r such that the n-
dimensional ball of radius r and centered on the origin contains i linearly independent
lattice points. Many tasks in computing, and especially cryptography, can be reduced to
trying to determine the smallest non-zero vector in a lattice.
q-ary Lattices. In cryptography, one does not calculate in infinite rings, for example,
ℤ but rather in finite ones, such as ℤq, for some prime q. Recall that ℤq = ℤ /q ℤ = (-q/2,
q/2] ∩ ℤ. So-called q-ary lattices use the same approach but with lattices. A q-ary lattice
L is one such that qℤ𝑛 ⊂ L ⊂ ℤ𝑛 , for some integer q. Thus, we focus only on finite lattices.
The following two q-ary lattices will be used later to reduce a hard lattice problem to find
the smallest non-zero vector:
Definition 3: Given a matrix A ∈ ℤ𝑛×𝑚
𝑞 , with m ≥ n; we define two m-dimensional

q-ary lattices the following way [8]:


Λq(A) = {y ∈ Zm : y = AT z ( mod q) for some z ∈ Zn},

Λ⊥𝑞 (A) = {y ∈ Zm : Ay = 0 ( mod q) .

The detailed definition of Lattice and more information can be referred to [8].

7
2.3 Lattice Problems
2.3.1 Shortest Vector Problem
The simplest, and most famous, hard problem in a lattice is to determine the shortest
vector within the lattice. This problem comes in several variants:
Definition 4: Given a lattice basis B, there are three variants of this problem:
• The shortest vector problem (SVP) for a lattice Λ is to find a nonzero vector x ∈ Λ
such that ∀ nonzero y ∈ Λ ‖𝑥‖≤‖𝑦‖.
• The γ-approximate shortest vector problem (SVPγ) for a lattice Λ is to find a nonzero
vector x ∈ Λ such that ∀ nonzero y ∈ Λ ‖𝑥‖≤ γ‖𝑦‖.
• The γ-unique SVP (uSVPγ) is given a lattice and a constant γ > 1 such that λ2(L) > γ
λ1(L), find a non-zero x ∈ L of length λ1(L).
See Figure 3 for an example of two-dimensional lattice, the input basis, and the two
shortest lattice vectors which an SVP solver should find. Note that a short lattice vector
is not unique, since if x ∈ L then we also have −x ∈ L. The LLL algorithm [9] will
heuristically solve the SVP, and for large dimension will solve the approximate-SVP
problem with a value of γ of 2(n−1)/2 in the worst case. The γ-unique SVP problem is
potentially easier than the others, since we are given more information about the
underlying lattice.

Figure 3: An SVP solution.

8
Short Integer Solution (SIS) Problem. The next hard problem is called Short Integer
Solution problem which is defined here:
Definition 5: Given an integer q and vectors 𝑎1 , ...,𝑎𝑚 ∈ ℤ𝑛𝑞 the SIS problem is to
find a short vector z ∈ ℤm such that 𝑧1 𝑎1 + ... + 𝑧𝑚 𝑎𝑚 = 0 (mod q)
‘Short’ refers to ‖𝐳‖∞ = 1.
The relation between the SIS problem and q-ary lattices is the following: if we set
A = (𝑎1 , . . . , 𝑎𝑚 ) ∈ ℤ𝑛×𝑚
𝑞 and set
Λ⊥𝑞 (A) = {z ∈ ℤm : Az = 0 ( mod q) ,
the SIS problem becomes the SVP for the lattice Λ⊥(A). It is known that the SIS problem
has worst-case hardness. Refer [10] for more information and proof for this statement.

2.3.2 Ring-LWE
The next hard lattice problem is called learning with errors (LWE). There are two
LWE problems: the search and decision problems.
A more specific formulation of LWE is called Ring-LWE. Given a ring 𝑅𝑞 and an
error distribution DR,σ , the Ring-LWE problems are defined the following way:
Ring-LWE Search Problem: Pick a, s ∈ 𝑅𝑞 and e ← 𝐷𝑅,𝜎 and set b := as + e (mod
q). The search problem is given the pair (a, b) to output the value s.
Ring-LWE Decision Problem: Given (a, b) where a, b ∈ 𝑅𝑞 determine which of the
following two cases holds:
1. b is chosen uniformly at random from 𝑅𝑞 .
2. b := as + e (mod q) where e ← 𝐷𝑅,𝜎 and s ∈ 𝑅𝑞 .
The ring 𝑅𝑞 is defined as the ring R = ℤ𝑞 [X]/F(X) reduced modulo q, where F(X)
denotes an integer polynomial of degree n. The distribution 𝐷𝑅,𝜎 produces polynomials
of degree less than n and whose coefficients are distributed like a rounded normal
distribution with a mean zero and a standard deviation of σ. The Ring-LWE problems are
lattice-based problems with a worst-case to average-case reduction and thus have a worst-
case hardness [8].

9
2.3.3 Module-LWE
The Module Learning with Errors (MLWE) problem was introduced to overcome
limitations of both plain LWE and RLWE by combining the two. Compared to ideal
lattices, module lattices have more intricate algebraic structures. In order to increase the
security level of the encryption algorithm based on Ring-LWE, the length N of the
polynomial or the size q of the modulus must be converted. The change of the polynomial
ring leads to the algorithm of the NTT operation, which is a polynomial multiplication
operation, and its internal operations such as modular reduction. For this reason, Ring-
LWE is evaluated as being inflexible in adjusting the security level. Meanwhile, Module-
LWE organizes public and private keys in matrix form as shown in Figure 4. Equation (1)
is the module-LWE operation process. Unlike the Ring-LWE with equation b := as + e,
polynomials b, a, s, e with N elements is composed of k vectors. The security level can
be converted by adjusting the number of polynomials and the parameter k representing
the size of the matrix and vector without ring dimension n. More details of ring LWE
and module LWE can be referred to [11].

𝑏𝑘 = 𝑎𝑘.𝑙 x 𝑠 𝑙 + 𝑒𝑘 (1)

Figure 4: Module LWE.

10
2.3.4 Fiat-Shamir with Aborts
The Fiat-Shamir transform [7] is the method to transform an identification scheme
into a signature scheme. An identification scheme, which we assume to be public key
identification (ID)-Protocol, is a protocol that allows the holder of a secret key to prove
its identity to any other entity holding the corresponding public key without disclosing
the secret key [10]. The identification scheme includes two main components: a key-
generation algorithm and an interactive protocol between a prover who has the secret key
and a verifier who has the corresponding public key [12].
The general identification scheme is a three-way protocol:
• Commitment phase: the prover commits to a certain value.
• Challenge phase: the verifier responds with a challenge.
• Verification phase: the prover has to provide a final response, called proof, which
is connected to the challenge and the commitment. The verifier must be able to
verify this proof.
More information about identification schemes and how the Fiat-Shamir
transformation works can be found in [7,10,12]. In this thesis, the only lattice-based
identification scheme is mentioned.

$
Private key: 𝐬̂ ← 𝐷𝑠𝑚
$
Public key:ℎ ← ℋ(𝑅, 𝐷, 𝑚), 𝐒 ← ℎ(𝐬̂)
Prover Verifier
$
𝐲̂ ← 𝐷𝑦𝑚 , 𝐘 ← ℎ(𝐲̂) Y

$
𝐜 𝐜 ← 𝐷𝑐
𝐳̂ ← 𝐬̂𝐜 + 𝐲̂
𝑖𝑓 𝐳̂ ∉ G𝑚 𝑡ℎ𝑒𝑛 𝐳̂ ←⊥ 𝒛̂
𝐴𝑐𝑐𝑒𝑝𝑡 𝑖𝑓 𝑧̂ ∈ 𝐺 𝑚 𝑎𝑛𝑑 ℎ(𝑧̂ ) = 𝑆𝑐 + 𝑌
Figure 5: Lattice-Based Identification Scheme [12].

11
One such lattice-based identification scheme is shown in Figure 5. In the first step, the
prover picks a m-dimensional vector ŷ from some distribution 𝐷 𝑚 , and commits to it by
sending Y = h(ŷ), where h is a cryptographic hash function, to the verifier. Conversely,
the verifier selects a random challenge c from another distribution D𝑐 and sends this
challenge to the prover. The prover then computes ẑ = ŝc + ŷ. If this ẑ falls into G𝑚 ,
where G𝑚 defines the space with elements that all suffice all security requirements, then
the prover sends this result to the verifier. Otherwise, the protocol is aborted. The intuition
behind aborting is that nothing about the secret key should be disclosed. Thus, as we see
step 3 in Figure 5, the prover will compute ẑ = ŝc + ŷ. One can easily see that if the
prover would only compute ẑ = ŝc, the verifier could simply conclude what ŝ is.
Therefore, the masking vector ŷ is needed. The main reason for aborting is to remain
witness indistinguishable. An identification scheme is said to be perfectly witness
indistinguishable if with any public key S, and arbitrary two valid secret keys s, s′ (i.e. s,
s′ ∈ D𝑠 and 𝑔 𝑠 mod N =𝑔 𝑠′ mod N = S), the view of any (possibly malicious) verifier
has the exact same distribution in the interaction where the prover uses s as in the view
where the prover uses s′ [12].
Intuitively, it would make sense to pick ŷ uniformly random from a much larger space
than ŝ. However, in the lattice-based scheme, this would be infeasible because in doing
so, much stronger complexity assumptions would be required, which would decrease the
efficiency of the protocol [12]. Therefore, ŷ needs to be large enough to successfully
mask the secret key ŝ and thus make the protocol witness- indistinguishable. On the
other hand, it must be small enough to make the protocol efficient. If ŷ is not in the
correct range, the prover aborts the protocol if the randomly picked ŷ is not in the correct
range. Finally, the prover will accept the interaction if ẑ ∈ G𝑚 and h(ẑ) = Sc + Y.
That this identification scheme is sound and complete is proven in [12].

2.4 Dilithium Scheme


Before going through Dilithium algorithm we need to define certain operations. All
definitions are taken from [6].

12
2.4.1 Basic Ring Operations

R and 𝑅𝑞 respectively denote the rings ℤ[X]/(𝑋 𝑛 + 1) and ℤ𝑞 [X]/(𝑋 𝑛 + 1), for q an
integer. In Dilithium, the value of n will always be 28 = 256 and the prime q is 8380417
= 223 − 213 + 1. Regular font letters represent elements in R or 𝑅𝑞 (which includes
elements in ℤ and ℤ𝑞 ) and bold lower-case letters denote column vectors with
coefficients in R or 𝑅𝑞 . As a default setting, bold upper-case letters are matrices, and all
vectors will be column vectors. We denote by 𝐯 T is transpose vector of v.
Modular Reductions (centered reduction modulo α.) r ′ = r mod± α: for an even α, r ′
α α
is the unique element in range − 2 < r′ ≤ for an odd α, r ′ is the unique element in range
2
α−1 α−1
− < r′ ≤ such that r ′ ≡ r (mod α).
2 2

Sizes of elements. For an element w ∈ 𝑍𝑞 , we write ‖𝑤‖ ∞ to mean


|𝑤 𝑚𝑜𝑑 ± q|. We define the ℓ∞ and ℓ2 norms for w= w0 + w1X+...+ w1Xn-1 ∈ R:
‖𝑤‖∞ =max‖𝑤𝑖 ‖∞ , ‖𝑤‖∞ = √ ‖𝑤‖2∞ + … + ‖𝑤𝑛−1 ‖2∞ .
𝑖

Similarly, for w = (𝑤1,…, 𝑤k ) ∈ Rk, we define


‖𝐰‖∞ =max‖𝑤𝑖 ‖∞ , ‖𝑤‖∞ = √ ‖𝑤1 ‖2 + … + ‖𝑤𝑘 ‖2 .
𝑖

We will write 𝑆η to denote all elements w ∈ R such that ‖𝑤‖∞ < η. We will write
𝑠η to denote the set {w mod± 2η : w ∈ R}.

2.4.2 NTT Domain Representation


Modulus q is selected such a way that there exists a 512-th unity’s root r modulo q.
Specifically, the value of r is chosen to be 1753. This choice ensures that the cyclotomic
polynomial X 256 +1 can be separate into linear factors X −r i modulo q where i = 1, 3,
5, . . . , 511. By the Chinese remainder theorem (CRT) this cyclotomic ring 𝑅𝑞 is thus
isomorphic to the product of the rings ℤ𝑞 [𝑋]/(𝑋 − 𝑟 𝑖 ) = 𝑍𝑞 . The isomorphism
ɑ → (a(r), a(r3),…, a(r511)) : 𝑅𝑞 → ∏𝑖 𝑍𝑞 [𝑋]/(𝑋 − 𝑟 𝑖 )

13
can be calculated quickly using the Fast Fourier Transform. The Fast Fourier Transform
(FFT) is also called Number Theory Transform (NTT) in this case where the ground field
is a finite field. Therefore, the NTT domain representation is defined: â = NTT(a) ∈ ℤ256
𝑞

of a polynomial a ∈ 𝑅𝑞 to have coefficients in the order as output by this reference NTT.


Concretely,
â = NTT(a) = (a(𝑟0 ), a(−𝑟0 ), . . . , a(𝑟127), a(−𝑟127)).
where 𝑟𝑖 = r brv(128+𝑖) with brv(k) the bit-reversal of the 8-bit number k. With this
notation, and because of the isomorphism property, we have 𝑎𝑏 =
NTT −1 (NTT(𝑎)NTT(𝑏)). Further details about NTT implementations in the next chapter.

2.4.3 Hashing
Dilithium scheme employs various algorithms to hash binary strings in {0, 1} * to
domains of different forms.
Hashing to a Ball. Let 𝐵𝜏 represent the set of elements of R that have τ coefficients
are either 1 or -1 and set the rest are 0. We have |𝐵𝜏 | = 2𝜏 (256
𝜏 ). In Dilithium, the process

of hashing onto B_τ involves two steps using a cryptographic hash function. The first step
involves utilizing a 2𝑛𝑑 pre-image resistant cryptographic hash function to map {0, 1}*
onto the domain {0, 1}256 , while the second stage involves applying an XOF (e.g.
SHAKE-256) to map the output of the first stage onto an element of 𝐵𝜏 . The algorithm
used to generate a random element in 𝐵𝜏 is sometimes called the “inside-out” version of
the Fisher-Yates shuffle. The high-level description of this algorithm is demonstrated in
figure 6. The XOF in the second stage creates the required randomness for the algorithm
using the output of the first stage as the seed in Steps 03 and 04.

14
SampleInBall(𝜌)
𝑐 = 𝑐0 𝑐1 … 𝑐255 = 00 … 0
𝑖 ∶= 256 − 𝜏 𝑡𝑜 255
𝑗 ← {0,1, … , 𝑖}
𝑠 ← {0,1}
𝑐𝑖 ∶= 𝑐𝑗
𝑐𝑗 ∶= (−1)𝑠
Return c
Figure 6: SampleInBall description.

Expanding Matrix A. Expand(A) is the function that maps a uniform seed ρ ∈


{0, 1}256 to polynomial matrix A ∈ 𝑅𝑞𝑘x𝑙 in NTT domain representation. Matrix A is only
needed for multiplication. Hence, the expansion function Expand(A) generates  ∈ 𝑍𝑞256
instead of output A, which is interpreted as the number theoretic transform (NTT) domain
representation of polynomial matrix A. To ensure that the resulting NTT domain
representation of A is uniform, the Expand(A) function must also be sampled uniformly
in this representation. This is necessary because A needs to be uniformly sampled and the
NTT is an isomorphism.
Sampling vectors s1 and s2. The function Expand(S), used for generating the secret
vectors (sk) in key generation, maps a seed ρ’ to (s1, s2) ∈ 𝑆η𝑙 × 𝑆η𝑘 .
Sampling vector y. In order to deterministically generating the randomness in the
signature scheme, the function ExpandMask are used to maps a seed ρ’ and a nonce κ to
𝑙
y ∈ 𝑆ɣ1 .
Hashing. The hashing function used in the signature scheme is an extended output
function (XOF) SHAKE-256 that maps onto the specified domain.

2.4.4 High/Low Order Bits and Hints


In order to minimize the size of the public key, we will require basic algorithms that
can extract the "higher-order" and "lower-order" bits from elements in ℤ𝑞 . By this way,
when given an arbitrary element r ∈ ℤ𝑞 , the higher-order bits of r + z can be recovered

15
without the need to store z. Therefore, we define algorithms that take z, r and produce a
1-bit hint h that permits one to calculate the high-order bits of z + r just using h and r.
This hint h is essentially the “carry-bit” caused by z in the addition.
The algorithm Power2Round is the straightforward bit-wise way to split up an element
in ℤ𝑞 into their “high-order” and “low-order” bits r = 𝑟1 ·2𝑑 + 𝑟0 where 𝑟0 = r
mod±2𝑑 and 𝑟1 = (r-𝑟0 )/ 2𝑑 .
If choosing the representatives of 𝑟1 to be non-negative integers between 0 and
[q/2𝑑 ], the distance modulo q between [q/2𝑑 ] ·2𝑑 and 0 could be very small. This
becomes problematic when using a 1-bit hint, as even a small addition to r can result in a
change of more than 1 in the high-order bits of r. To avoid this problematic by a simple
tweak, choose an α is a divisor of (q-1) and write r = 𝑟1 · α + 𝑟0 the same way as before.
α should be chosen as an even number for simplification. The possible 𝑟1 · α’s are now
{0, α, 2α, 3α . . . , q - 1} and the distance between 0 and q - 1 is 1, this procedure is called
Decompose. Using this procedure as a sub-routine, we have some function’s definition
using in Dilithium:
Decompose. The Decompose𝑞 (r, α) function will split up the value r into a
unique form of r = 𝑟1·α + 𝑟0 where 0 ≤ 𝑟0 < α.
HighBits. The ‘high-bits’ are thus simply the 𝑟1 coefficients after decomposing the
value r.
LowBits. Similar to the ‘high-bits’, the ‘low-bits’ are calculated with the
Decomposeq (r, α) function. In difference, the ‘low-bits’ are taking the 𝑟0 coefficients.
MakeHint. MakeHint 𝑞 (,, r, α) function decides if a hint is needed to verify a
signature correctly. This is done by comparing if the HighBits𝑞 function changes and
decide we have the sum or not. If it changes, we need a hint; if it does not, we can omit
the hint.
UseHint. In simplification, this function is used for restoring the ‘high-bits’ if a hint
is present. If the hint is present, we must decide if we have to add or subtract 1 from the
‘high-bits’. The idea behind the Hints is to reduce the public key size.

16
Power2Round𝑞 (r, 𝑑) Highbitsq (r, α)
01. r := r 𝑚𝑜𝑑+ q 12. (r1, r0) := Decompose𝑞 (r, α)
02. r0 := r 𝑚𝑜𝑑 ± 2d 13. return r1
03. return ((r - r0) /2d, r0)
Decompose𝑞 (r, α)
MakeHint 𝑞 (z, r, α) 14. r := r 𝑚𝑜𝑑 + q
04. r1 := HighBits𝑞 (r, α) 15. r0 := r 𝑚𝑜𝑑 + α
05. 𝑣1 := HighBits𝑞 (r + z, α) 16. if r - r0 = q - 1
06. return ⟦𝑟1 ≠ 𝑣1 ⟧ 17. then r1 :=0; r0:= r0 - 1
18. else r1 := (r- r0)/α
UseHint q(h, r, α) 19. return (r1, r0)
07. m := (q-1)/ α
08. (r1, r0) := Decompose𝑞 (r, α) LowBits𝑞 (𝑟, 𝛼)
09. if r0 > 0 and h = 1 return (r1 +1) 𝑚𝑜𝑑 + m 20. (r1, r0) := Decompose𝑞 (r, α)
10. if r0 ≤ 0 and h = 1 return (r1 - 1) 𝑚𝑜𝑑 + m 21. return r0
11. return r1
Table 1: Supporting algorithms for Dilithium.

The following Lemmas describe the properties of the supporting algorithms used in
the Dilithium scheme, which are crucial for ensuring its correctness and security. The
proofs for these Lemmas can be found in the specification submitted to NIST round 3 [6].
Lemma 1: Assuming positive integers q and α such that q > 2α, q ≡ 1(mod α) and α is
even. Chose r and z are the vectors of elements in R q where ‖z‖∞ ≤ α/2, and let h’, h
are vectors of bits. Then, the HighBits𝑞 , MakeHint 𝑞 , and UseHint 𝑞 algorithms satisfy
the following properties:
1. UseHint 𝑞 (MakeHint 𝑞 (z, r, α), r, α) = HighBits𝑞 (r + z, α).
2. Let v1 = UseHint 𝑞 (h, r, α). Then ‖𝐫 − 𝐯𝟏 𝛼‖∞ ≤ α + 1. Moreover, it is worth
noting that if the number of 1’s in h is w, then all except at most w coefficients of
r - 𝐯𝟏 α will have a magnitude of at most α/2 after centered reduction modulo q.
3. With any h, h’, if UseHint q(h, r, α) = UseHint 𝑞 (h’, r, α), then h’ = h.
Lemma 2: I f ‖LowBits𝑞 (𝐫, 𝛼)‖∞ < α/2 – β and ‖s‖∞ ≤ β, then

HighBits𝑞 (r, α) = HighBits𝑞 (r + s, α).


Lemma 3: Let (r1 , r0 ) = Decompose𝑞 (r, α), (𝑤1, 𝑤0 ) = Decompose𝑞 (r+s, α), and
‖s‖∞ ≤ β. Then ‖s + r0 ‖∞ < α/2 – β ↔ w1 = r1 ˄ ‖𝑤0 ‖∞ < α/2 – β.

17
2.5 Dilithium Algorithm
The Key Generation, Signing, and Verification algorithms for Dilithium signature
scheme which has been submitted to NIST are presented in Fig. 7.

2.5.1 Key generation


This algorithm makes the keypair using to make digital signature and verification
progress.
Matrix A is not randomly sampled but rather deterministically generated from a
randomly sampled seed ρ ∈ {0, 1}256 . Since A is part of the public key, ρ will use less
space than A in keypair. In addition, A is stored in NTT Domain Representation to take
advantage of point-wise multiplication. Next, the vector s1 and s2 is similar which has
been describe in Hashing and Sampling part above. We calculate t:=A.s1+s2. Vector t0
and t1 is low-bits and high-bits of t via the function Power2Round𝑞 (t, d). The public
key consists of the tuple pk = (ρ, t1) and secret key sk = (ρ, K, tr, s1, s2, t0). The data layout
of keypair is detailly described in specification of Dilithium.

2.5.2 Signature generation

This function receives the secret key sk and the message M as parameters. The first
step is to create the matrix A with the help of the ExpandA(ρ) function.
µ saves the hash of tr concatenated with the message µ = H(tr‖M). Since this signing
procedure may need to repeat several times until a valid signature is generated, here, κ is
used in the form of a counter. Each time the signing attempt is discarded, κ will be
incremented, and the sampling will be restarted. This procedure is necessary to make sure
that the SHAKE-256 output is distinct for each signing attempt of the same message.
Vector y is deterministically sampled by hashing K, µ and κ, it is deferent between each
iteration by the increase of κ.
The calculation of w1 which is the high-bits of w could be computed via
Decomposeq (Ay, 2γ2 ). Then, the challenge 𝑐̃ is the hash of µ and w1. The vector r0 is
made as seen in line 20.

18
The signing attempt z is described by the function: z := y + c.s1.
If the candidate is good, we then create a hint: h := MakeHint 𝑞 (−𝑐𝐭 𝟎 , 𝐰 + 𝑐𝐭 𝟎 −
𝑐𝐬𝟐 , 2γ2 ). This creates a vector h in which each ‘coordinate’ expresses one bit. The idea
behind hints is to reduce the size of the signature. If the bit equals 1, the summation causes
a ‘carry’.
There are two conditions which must not be met:
1. if ‖𝑐𝐭 0 ‖∞ ≥ γ2 (the largest coefficient of 𝑐𝐭 0 is larger than γ2 ),
2. if the number of 1’s in h is bigger than w. The signature is then composed by σ =
(𝑐̃ , 𝐳, 𝐡).
The signature data layout is formatted as the concatenation of a bit packed
representation of challenge c and encodings of z and h, in this order.

2.5.3 Verification
The verification algorithm takes the public key pk, the message M, and the signature
σ = (𝑐̃ , 𝐳, 𝐡) as parameters.
First, the matrix A must be reconstructed from the seed ρ. Then, µ gets created equally
as in the Gen and Sign algorithm. To reconstruct 𝐰′𝟏 one needs to use the hint: 𝐰′𝟏 =
UseHint 𝑞 (𝐯, 𝐀𝐳 − 𝑐𝐭 𝟏 2𝑑 , 2γ2 ) . Recall, the UseHint 𝑞 (ℎ, 𝑟, 𝛼) is like the inverse
function to the MakeHint 𝑞 (ℎ, 𝑟, 𝛼) function.
Finally, the signature is accepted if:
1.‖𝐳‖∞ ≤ γ1 −β,
2.c = H(µ‖𝐰′𝟏 ),
3.Number of 1’s in h ≤ w.

19
Gen
01. 𝜁 ← {0,1}256
02. (𝜌, 𝜌′ , 𝐾) ∈ {0,1}256 × {0,1}512 × {0,1}256 ∶= H(𝜁) ► H is instantiated as SHAKE-256
03. 𝐀 ∈ 𝑅𝑞𝑘×𝑙 ∶= 𝐸𝑥𝑝𝑎𝑛𝑑𝐴(𝜌) ̂
► A is generated and store in NTT representation as 𝐀
04. (s1 , s2 ) ∈ 𝑆𝑛𝑙 × 𝑆𝑛𝑘 ∶= 𝐸𝑥𝑝𝑎𝑛𝑑𝑆(𝜌′ )
05. 𝐭 ∶= 𝐀s1 + s2 ► 𝐶𝑜𝑚𝑝𝑢𝑡𝑒 𝐴𝑠1 𝑎𝑠 𝑁𝑇𝑇 −1 (𝐴̂ ∙ 𝑁𝑇𝑇(𝑠1 ))
06. (𝐭1, 𝐭 0 ) ∶= Power2Round𝑞 (𝐭, 𝑑)
07. 𝑡𝑟 ∈ {0,1}256 ∶= H(𝜌 ∥ 𝑡1 )
08. 𝐫𝐞𝐭𝐮𝐫𝐧 𝑝𝑘 = (𝜌, 𝐭1 ), 𝑠𝑘 = (𝜌, 𝐾, 𝑡𝑟, s1 , s2 , 𝐭 0 )

Sign(sk,M)
09. 𝐀 ∈ 𝑅𝑞𝑘×𝑙 ∶= 𝐸𝑥𝑝𝑎𝑛𝑑𝐴(𝜌) ̂
► A is generated and store in NTT representation as 𝐀
10. 𝜇 ∈ {0,1}512 ∶= H(𝑡𝑟 ∥ 𝑀)
11. 𝑘 ∶= 0, (𝐳, 𝐡) ∶=⊥
12. 𝜌′ ∈ {0,1}512 ∶= H(𝐾 ∥ 𝜇) (or 𝜌′ ← {0,1}512 for randomized signing)
13. 𝐰𝐡𝐢𝐥𝐞 (𝐳, 𝐡) =⊥ do ► Pre-computer ŝ1 ∶= NTT(s1 ), ŝ2 ∶= NTT(s2 ), 𝐭̂𝟎 ∶= NTT(𝐭 𝟎 )
𝑙
14. 𝐲 ∈ 𝑆𝛾1 ∶= 𝐸𝑥𝑝𝑎𝑛𝑑𝑀𝑎𝑠𝑘(𝜌′ , 𝑘)
15. 𝐰 ∶= 𝐀𝐲 ̂ ∙ NTT(𝐲))
► 𝐰 ∶= NTT −1 (𝐀
16. 𝐰𝟏 ∶= HighBits𝑞 (𝐰, 2γ2 )
17. 𝑐̃ ∈ {0,1}256 ∶= H(𝜇 ∥ 𝑤1 )
18. 𝑐 ∈ 𝐵𝜏 ∶= SampleInBall(𝑐̃ ) ► Store c in NTT representation as 𝑐̂ = NTT(𝑐)
19. 𝐳 ∶= 𝐲 + 𝑐s1 ► Compute cs1 as NTT −1 (𝑐̂ ∙ ŝ1 )
20. 𝐫𝟎 ∶= LowBits𝑞 (𝑤 − 𝑐𝑠2 , 2𝛾2 ) ► Compute cs2 as NTT −1 (𝑐̂ ∙ ŝ2 )
21. 𝐢𝐟 ‖𝐳‖∞ ≥ 𝛾1 − 𝛽 𝑜𝑟 ‖𝐫𝟎 ‖∞ ≥ 𝛾2 − 𝛽, 𝑡ℎ𝑒𝑛 (𝐳, 𝐡) ∶=⊥
22. 𝐞𝐥𝐬𝐞
23. 𝐡 ∶= MakeHint 𝑞 (−𝑐𝐭 𝟎 , 𝐰 − 𝑐s2 + 𝑐𝐭 𝟎 , 2𝛾2 ) ►Compute 𝑐𝐭 𝟎 as NTT −1 (𝑐̂ ∙ 𝐭̂ 𝟎 )
24. 𝐢𝐟 ‖𝑐𝑡0 ‖∞ ≥ 𝛾2 𝐨𝐫 the # of 1′ s in 𝐡 is greater than 𝜔, then (𝐳, 𝐡) ∶=⊥
25. κ ∶= κ + 𝑙
26. 𝐫𝐞𝐭𝐮𝐫𝐧 𝜎 = (𝑐̃ , 𝐳, 𝐡)

Verify(𝑝𝑘, 𝑀, 𝜎 = (𝑐̃ , 𝐳, 𝐡))


27. 𝐀 ∈ 𝑅𝑞𝑘×𝑙 ∶= 𝐸𝑥𝑝𝑎𝑛𝑑𝐴(𝜌) ̂
► A is generated and store in NTT representation as 𝐀
28. 𝜇 ∈ {0,1}512
∶= H(H(𝜌 ∥ 𝐭 𝟏 ) ∥ 𝑀)
29. 𝑐 ∶= SampleInBall(𝑐̃ )
30. 𝐰𝟏′ ∶= UseHint q (𝐡, 𝐀𝐳 − 𝑐𝐭 𝟏 ∙ 2𝑑 , 2𝛾2 )
►Compute as NTT −1 (𝐀 ̂ ∙ NTT(𝐳) − NTT(𝑐) ∙ NTT(𝐭 𝟏 ∙ 2𝑑 ))

31. 𝐫𝐞𝐭𝐮𝐫𝐧 ⟦‖𝐳‖∞ ≤ 𝛾1 − 𝛽 ⟧ 𝑎𝑛𝑑 ⟦𝑐̃ = H(𝜇 ∥ 𝐰𝟏 )⟧ 𝐚𝐧𝐝 ⟦#of 1′ s in 𝐡 is ≤ 𝜔⟧

Figure 7: The pseudo-code for deterministic and randomized versions of Dilithium.

20
Proof. Requirements one and three from the verification function to accept a signature
are trivially fulfilled since they follow directly from the creation of the signature. What is
left to show is that: 𝐰′𝟏 = 𝐰𝟏
If ‖𝑐𝐭 0 ‖∞ < γ2 and with the help of lemma 1, we know that:
UseHint 𝑞 (h, w − cs2 + ct0, 2γ2)
= UseHint 𝑞 ((MakeHint 𝑞 (−ct0, w − cs2 + ct0, 2γ2), w − cs2 + ct0, 2γ2),
= HighBits𝑞 (w − cs2 + ct0 + (−ct0), 2γ2),
= HighBits𝑞 (w − cs2, 2γ2).
Since, w = Ay and t = As1 + s2, we can conclude that:
w − cs2 = Ay − cs2 = A(z − cs1) = Az − ct.
Now, we show that w − cs2 + ct0 = Az − ct12𝑑 :
𝐀𝐳 − 𝑐𝐭 𝟏 2𝑑 = 𝐀(𝐲 + 𝑐𝐬𝟏 ) − c(t – t0)/2𝑑 . 2𝑑 ,
= Ay + cAs1 − ct + ct0,
= Ay + cAs1 − c(As1 + s2) + ct0,
= Ay + cAs1 − cAs1 − cs2 + ct0,
= w − cs2 + ct0.
Here are some explanations, all received by using the helper functions:
t = As1 + s2,
w = Ay,
t = (t – t0)/2𝑑 ,
z = y + cs1
Thus, the verifier computes:
UseHint 𝑞 (h, Az − ct12d, 2γ) = HighBits𝑞 (w − cs2, 2γ2).
Since r1 = w1 this is equivalent to:
HighBits𝑞 (w − cs2, 2γ2) = HighBits𝑞 (w, 2γ2).
Because of the facts from the algorithm:
r1 = HighBits𝑞 (w − cs2, 2γ2),
w1 = HighBits𝑞 (w, 2γ2).
The CRYSTALS-Dilihium scheme has been shown in Fig.8.

21
Digital data(M) Digital signature Digital signature

Prover Verifier
pk=(ρ,t1)
M
c
sk=(ρ,K,tr,s1,s2,t0) Hashing Hashing
H(tr M) H(H(ρ t1) M)

Decrypt w1'
Hashing
µ H(µ w1')

y Coef in [-γ,γ ]
c := H(Highbits(Ay) µ)
z:= y+ cs1
Compare
c µ w1'
F check conditions Sign
Encrypt MakeHint(h) σ=(c ,z,h)
Message Message
T Not M odified Modified

Figure 8: CRYSTALS-Dilihium scheme.

2.6 Summary
The foundation of digital signature scheme, especially CRYSTALS-Dilithium
scheme are discussed in this chapter. In addition, the fundamental theorem of lattice-
based cryptosystem and its hard problems are clarified. It also covers the scheme's
specifications, the proof of its rounding properties, and the detailed algorithms utilized.
The detail algorithms used in Dilithium are described as the basis for research in the
following sections.

22
Chapter 3 NTT Algorithm and Architecture

The NUMBER THEORETIC TRANSFORM (NTT) is basically a form of Discrete


Fast Fourier (FFT) transformation with the finite field R 𝑞 =ℤ𝑞 /(𝑥 𝑛 +1). After transforming
a pair of polynomials to the "frequency" domain, their product can be easily computed
through an O(n) element-by-element product, and the resulting polynomial can be
converted back to the time domain via the inverse transformation (INTT). Hence, the cost
of the transformation to and from the frequency domain is the dominant cost of
polynomial multiplication. Because of the NTT significance, which has a complexity of
just O(nlog(n)). As a signature scheme based on module lattice and is one of the
algorithms from CRYSTALS, Dilithium uses number theoretic transform in order to
reduce computational complexity.

3.1 NTT Multiplication


Polynomial multiplication over a ring is an arithmetic operation requiring the most
processing time. To speed up the computation process and reduce the computation
complexity of ring-LWE cryptographic scheme, a novel polynomial multiplication using
NTT is proposed as be seen in figure 9.
Considering 𝑎𝑖 and 𝑏𝑖 within the ring of 𝑅𝑞 = ℤ𝑞 [𝑥]/(𝑥 𝑛 + 1), the polynomials 𝑎(𝑥)
and 𝑏(𝑥) can be expressed as equations (2) and (3).
𝑎(𝑥) = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + ⋯ + 𝑎𝑖−1 𝑥 𝑛−1 (2)
𝑏(𝑥) = 𝑏0 + 𝑏1 𝑥 + 𝑏2 𝑥 2 + ⋯ + 𝑏𝑖−1 𝑥 𝑛−1 (3)
The results of this multiplication 𝑐(𝑥) can be expressed as equation (4).
𝑛−1 𝑛−1

𝑐(𝑥) = 𝑎(𝑥) ∙ 𝑏(𝑥) = ∑ ∑ 𝑎𝑖 𝑏𝑗 𝑥 𝑖+𝑗 (4)


𝑖=0 𝑗=0

23
Step 1 Step 2 Step 3 Step 4 Step 5
Point-wise
a(x) Bit-reverse NTT Bit-reverse INTT c(x)
multiplication

b(x) Bit-reverse NTT

Figure 9: Block diagram of a typical NTT polynomial multiplication.

The polynomial a(x) ∈ 𝑅𝑞 with coefficients a(x) = (𝑎0 , 𝑎1 , · · · , 𝑎𝑛−1 ) has the NTT
representation is defined as Equation (4)
𝑛−1
𝑖𝑗
Ai = ∑ 𝑎𝑗 𝑤𝑛 𝑚𝑜𝑑 𝑞, for i = 0, 1, 2 … , n − 1 (5)
𝑗=0

The inverse NTT transform (INTT) is defined as Equation (6).


𝑛−1
−1 −𝑖𝑗
ai = 𝑛 ∑ 𝐴𝑗 𝑤𝑛 𝑚𝑜𝑑 𝑞, for i = 0, 1, 2 … , n − 1 (6)
𝑗=0

Note that the INTT operation is inverse version of NTT, where 𝑤𝑛 is replaced by
𝑤𝑛−1 mod q and the final results is multiplied by n-1. Since there are log(n) stages in the
NTT outer loop, and O(n) operations in each stage, resulting in an overall time complexity
of O(n log(n)). The factors 𝑤 are called the twiddle factors, similar to FFT.
Multiplication between polynomials a(x) and b(x) can be computed as an operation
of NTT and INTTs as shown in equation (7) [13]. The NTT provides a fast multiplication
algorithm in 𝑅𝑞 with time complexity O(nlog(n)) instead of O(n2) for school-book
multiplication. The product c after INNT operation obtains the same result c as
polynomial schoolbook multiplication.
𝑐 = INTT(NTT(𝑎) ○ NTT(𝑏)) (7)

24
x[0] X[0]
x[1] W1 X[4]
x[2] W2 X[2]
x[3] W2 W5 X[6]
4
x[4] W X[1]
x[5] W4 W3 X[5]
x[6] W4 W 6
X[3]
W4 W6 W7
x[7] X[7]

Figure 10: Cooley-Tukey data flow diagram (n=23).

X[0] x[0]
X[4] W -1 x[1]
-2
X[2] W x[2]
X[6] W -5 W -2 x[3]
X[1] W -4 x[4]
-3
X[5] W W -4 x[5]
X[3] W -6 W -4 x[6]
W -7 W -6 W -4
X[7] x[7]

Figure 11: Gentleman-Sande data flow diagram (n=23).

Block diagram of a typical NTT polynomial multiplication

Figure 12: Block diagram of a typical NTT polynomial multiplication.

Here, ○ is point-wise multiplication of the polynomials. Figures 10 and 11 are signal


flows of the Cooley-Tukey (CT) algorithm and the Gentlemen-Sande (GS) algorithm
among the methods of the NTT algorithm when (n=23). Conventional NTT goes through
a bit-reverse process, but it is possible to omit the bit-reverse process by using the NTT
CT and INTT GS algorithms [14]. Figure 12 shows a block diagram of an NTT-based
polynomial multiplier that does not perform bit-reversing operations [14]. The NTT
polynomial multiplier using the CT-GS algorithm can obtain a result faster than the
existing NTT polynomial multiplier.

25
3.2 Cooley-Tukey Algorithm and Gentleman-Sande Algorithm
The Cooley-Tukey algorithm is an efficient algorithm to compute the forward
transformation NTT and DIT DFT operations (Decimation In Time).
The divide-and-conquer approach is utilized in this algorithm. This is a method of
dividing and calculating recursively for N where N = 𝑛1 𝑛2 holds. Equation (5) is the
Cooley-Tukey algorithm, which decomposes the given DFT algorithm into even and odd
terms. Figure 10 is a data flow diagram of the NTT algorithm applying Cooley-Tukey.
Cooley-Tukey algorithm recursively reduces the computational complexity of the NTT
algorithm from O(n2 ) to O(log(n)). This function takes inputs in a standard ordering and
outputs the result in bit-reversed ordering.
On the contrary, Gentleman-Sande algorithm is utilized for inverse transformation
INTT and DIF DFT operations (Decimation In Frequency). This function takes inputs in
bit-reversed ordering and outputs in standard ordering. Figure 11 is the signal flow of the
NTT algorithm applying the G-S algorithm. Detail document and how to apply those
algorithms to speed up NTT can refer [15].

3.3 NTT Processor Architecture


According to [16-17], based on various structures, there are three basic types of the
NTT processors:
- Memory-based architecture: Memory-based architectures rely on memory unit for
their operations. These architectures typically consist of one or more processing elements
relying on computation, control unit and memory blocks. Memory-based architectures
can be categorized into two types: single-memory architecture and dual-memory
architecture [16].

Buffer Processor Buffer Processor Buffer Processor

Figure 13: General block diagram of pipeline architecture.

26
- Array architecture is another type of NTT processor architecture that divides the
whole processing into multiple independent processing elements (PE) with the local
buffers. This architecture is acceptable for large number data processes [16].
- My thesis focusses on pipeline architecture. The pipeline architecture offers high
throughput and energy-efficient implementation by reading the input data in series. The
basic structure of this architecture is shown in Figure 13. The most commonly used
pipeline architectures are Single-path Delay Feedback, Single-path Delay Commutator,
Multi-path Delay Commutator and Multi-path Delay Feedback.

3.3.1 Single-path Delay Feedback (SDF)


In SDF architecture employs a single data stream that passes through a multiplier at
every stage. The shift registers in the SDF architecture are utilized more efficiently
because the inputs and outputs in each stage share the same shift registers. The radix-2
SDF architecture for N = 128 is shown in Figure 14. In the SDF scheme, the butterfly unit
stores the inputs into the feedback memory during the first half cycles, and after the first
N/2 cycles, the butterfly unit recovers x(n) samples from the feedback memory and
operates corresponding calculations with the sample x(n+N/2). The output is then sent to
the next stage. The SDF architecture requires a specific ordering of the input data and the
use of bit-reversed addressing to achieve optimal performance. Despite its high
throughput and low area requirements, the SDF architecture suffers from a longer latency
compared to other FFT architectures, which can be a limiting factor in some applications.

64 32 16 4 1

BF-2 BF-2 BF-2 BF-2 BF-2

Figure 14: SDF architecture.

27
3.3.2 Multipath Delay Feedback (MDF)
The fundamental of MDF architecture is related to that of SDF to remain the same
number of the overall number of internal memory cells. Additionally, multiple butterfly
units are necessary to concurrently deal with multiple input samples. Because the butterfly
output shares the storage with its input, the multi-path delay feedback architectures are
certainly more efficient than the corresponding multi-path delay commutator structures
in terms of memory utilization.
Several MDF architectures have been proposed. M. Shin [18] proposed a FFT
processor architecture using four-parallel 128-point radix- 24 MDF. This MDF
architecture offers a lower hardware complexity compared to the classical SDF
approaches and allows the radix-24 FFT algorithm to decrease the power consumption.
The MDF architecture is able to provide a higher throughput with minimal hardware
requirement by combining the outstanding features of SDF and MDC. In MDF
architecture, the SDF architecture is expanded by using a multi-path approach. The
number of data-paths can be raised up to eight or even sixteen to achieve a better
throughput rate. The architecture for two-parallel 256-point MDF is shown in Figure 15.
64
64 32
32 16
16 4 1
Fully

Fully

Fully

Fully

BF- BF- BF- BF- BF-


Pipeline

Pipeline

Pipeline
Pipeline

BF2 BF2 BF2 BF2 BF2


pipeline

pipeline

pipeline
pipeline

2 2 2 2 2

64 32 16 4 1
Fully

Fully

Fully
Fully pipeline

BF- BF- BF- BF- BF-


Pipeline

Pipeline

Pipeline

BF2
pipeline

pipeline

pipeline

2 2 2 2 2

Figure 15: MDF architecture.

28
3.3.3 Signal-path delay Commutator (SDC)

In SDC-architecture [19], a programmable method is employed to modify the


architecture of the standard radix-4 butterfly unit (BU), reducing the complex constant
multipliers required. Additionally, a combined delay-commutator is used to decrease the
memory required. However, the butterfly unit and delay-commutator become more
complex due to the programmability requirement. Fig. 16. shows the R4SDC architecture
256-length of FFT.

3.3.4 Multipath Delay Commutator (MDC)


In MDC architecture, input data are initially divided into different parallel data
streams flowing forward by a commutator. Basically, a commutator is a switch used for
data shift between stages of radix butterfly in the pipeline architecture. The latency
between each parallel data-path which is defined by the algorithms needs to be matched.
The most straight forward approaches for R4MDC pipeline implementation of are
described in Figure 17. Additionally, another popular structure is the R2MDC FFT
structure, which is proposed and described in the next section of this thesis.
DC 6x64

DC 6x16

DC 6x4

DC 6x1
BF4

BF4

BF4
BF4

Figure 16: R4SDC structure.

12 3

8 1 2
BF4

BF4
C4
C4

4 2 1

Figure 17: MDC architecture.

29
Figure 18: Block diagram of a 256-point R2MDC FFT.

3.3.5 Radix-2 Multipath Delay Commutator


The Radix-2 Multipath Delay Commutator (R2MDC) architecture is a popular pipeline
architecture for the fast Fourier transform (FFT). Compared with the popular in-place
FFT architecture, R2MDC has fewer memory accesses, a more regular ordering of the
input and output data, and simpler control logic, and it is better at processing multiple
FFTs continuously. Figure 18 shows the architecture for a 256-point R2MDC FFT, which
needs two input coefficients per cycle to achieve a 100% utilization rate of the butterfly
units. This architecture can process both radix-2 decimation-in-time and decimation-in-
frequency FFTs by using different butterfly units and twiddle factors. In addition, it can
process both the FFT and the inverse FFT (IFFT), with the difference being that the IFFT
requires additional postprocessing and different twiddle factors.

3.4 Summary
The NTT algorithm and basic information on the hardware architectures that are
commonly used for computing the FFT algorithms are summarized in this chapter. It is
the theoretical basis for building efficient polynomial arithmetic module for Dilithium
scheme on hardware. The pipelined-based architecture demonstrates the highest
efficiency for long serial arithmetic operations, such as those used in the Dilithium
algorithm. It has low latency and does not require a control logic unit to manage the
address of the coefficients. In popular pipeline FFT architecture, the R2MDC structure is
proposed for our arithmetic module, because it has fewer memory accesses and lower
hardware cost than R4MDC. Although the control of the commutator in MDC architecture
is more complicated compared to delay feedback structures, it is possible to apply the
folding transformation method to reduce the required butterfly unit’s number by half [4].

30
Chapter 4 Hardware Implementation of Dilithium

This chapter presents the design of High-performance Hardware implementation for


CRYSTALS-Dilithium. The proposed hardware architecture can perform 3 algorithms of
Dilithium: key generation, signature generation, signature verification selecting between
security levels at runtime. This design achieves the lowest latency for all operations while
maintaining a smaller area and reaches the efficiency in Area x Time trade-off metric.

4.1 Combined Architecture CRYSTALS-Dilithium


Figure 19 shows our proposed combined architecture capable of performing key
generation, signature generation and signature verification and selecting between security
levels at run time. With the target is high performance, we parallelize the other
submodules as needed to minimize the stall and wait delays in polynomial arithmetic
modules. Polynomial arithmetic module and Keccak unit for hashing are the most
important modules which are used in all three algorithms. Depending on the algorithm
chosen to perform, some sub-modules in combined scheme can be removed. This design
uses one input and one output ports with 64-bit data width. The detailed resource
utilization and latency in our implementation can be seen in Table 3. The parameters for
version 3.1 Dilithium at all security levels can be seen in Table 2.
In Dilithium, signature generetion is the most complex and highest time execution
phase. The schedule for this phase is split into precomputation stage and rejection loop
stage. To minimize latency, the better solution for this phase is to split the rejection loop
into a 2-stage pipeline. These two stages run in parallel with their respective calculations
corresponding with 2 polynomial arithmetic modules.

31
Hint

Decomposer
Polynomial
AXI Arithmetic
Encoder
stream
out

Keccak core
Polynomial
Memory
AXI
stream Decoder
in

Sampling
A y s1 s2 c
Sample in
Matrix Mask Eta
ball
SHAKE 128
SHAKE 256

Figure 19: Combined architecture CRYSTALS-Dilithium.

NIST Security Level 2 3 5


Parameters
q [modulus] 8380417 8380417 8380417
d [dropped bits from t] 13 13 13

τ [# of ± 1 s in 𝑐] 39 49 60
challenge entropy [log (256
𝜏 ) + 𝜏] 192 225 257
17 19
ɣ1 [y coefficient range] 2 2 219
ɣ2 [low-order rounding range] (q - 1)/88 (q - 1)/32 (q - 1)/32
(k,l) [dimension of A] (4,4) (6,5) (8,7)
𝜂[secret key range] 2 4 2
β [τ. 𝜂] 78 196 120
w[max. # of 1’s in the hint h] 80 55 75
Repetitions 4.25 5.1 3.85

Table 2: Parameters of Dilithium submitted in NIST Round 3.


Before mentioning operation scheduling of our implementation, we describe the
submodules used in it.

32
4.2 Hardware Implementation for Sub-modules in Dilithium
4.2.1 Polynomial Arithmetic Module
The Polynomial Arithmetic module performs polynomial addition, subtraction, and
multiplication, which are used in lattice-based cryptosystems. As mentioned before,
polynomial multiplication is the most time consuming and critical to improve
performance. The NTT operation is used to accelerate the polynomial multiplication
operation as appropriate. The typical method used for hardware implementation of the
NTT algorithm is to use butterfly units to perform layer-by-layer calculations in
accordance with the butterfly diagram. In chapter 3, we have analyzed that the pipelined
NTT architecture is most efficient to reach high-performance aspect. Our thesis proposed
the flexible polynomial arithmetic module, inspired by the R2MDC FFT structure
introduced in section 3.3.5. The overall structure of the proposed flexible polynomial
arithmetic module (PolyArith) is shown in Figure 20.
Our design utilizes eight butterfly units, each capable of performing the basic
arithmetic operations via Cooley-Tukey and Gentlemen-Sande butterfly configurations.
As mentioned above, to improve performance, we need 2 polynomial arithmetic modules

In 1
PE 1 PE 2 PE 3 PE 4
Out 1

Out 2
PE 5 PE 6 PE 7 PE 8
In 2

Legend
64D/32D
Use d in all
Switch box
Use d only in NTT
Butterfly Comp
64D/32D
Use d only in INTT

Use d only in fully INTT

Figure 20: Block diagram of the proposed flexible arithmetic module.

33
for this implementation. If we use the original R2MDC FFT architecture, we need a total
of 16 butterfly units in our proposed architecture. Each butterfly unit attaches with their
sub-modules and memory to storage twiddle factor that requires a large of area
consumption. Besides, when calculating pointwise multiplications, our storage scheme
requires reading sixteen coefficients and writing back eight coefficients per cycle for 8
butterfly units, which is inefficient for dual-port BRAM usage.
To mitigate the area consumption shortcoming of R2MDC structure, we use the
method called folding transformation, which can use four butterfly units instead of eight
in the original R2MDC structure but still keep the pipelined NTT structure [4]. In this
design, the signature generation operation needs 2 PolyArith, while the other phases
require only one. Therefore, we propose the flexible arithmetic module with eight
butterfly units inside the structure, this module can flexibly transform from one module
with eight BU to two dependent modules with four BU inside. By this way, depending on
operation scheduling of each phase in Dilithium, this module can change efficiently to
boot up the speed but does not raise the resource consumption.

4.2.2 Unified Butterfly Unit


Since the proposed design aims to use different butterfly structures for NTT and INTT
operations, we propose a unified butterfly unit shown in Fig.21. The proposed butterfly
unit uses no extra modular multiplier, adder or subtractor than a dedicated CT or GS
butterfly unit with one modular adder, subtractor and multiplier.
As can be seen in Fig.21. The proposed butterfly unit take a, b and ω as input, the
control signal ntt/intt in the red line used as selection signal for multiplexers in the
butterfly unit. A and B are the output, when performing pointwise multiplication or
subtraction, the result could be taken in B, while output A is for addition using CT
butterfly configuration. In [20] Zhang et al. proposed a technique to eliminate the
multiplication of resulting coefficients with n−1(mod q) after the INTT operation. In this
work, we adopt this technique by insert divide 2 operation integrated in addition unit, and
pre-processed the operand ω for the inverse NTT to incorporate the factor 2−1 .

34
b 0 1
+ A
1 0
ntt/
intt
a 1
0
- B
1 0
b
1 0

0
ω × mod
1

Figure 21: Unified butterfly unit.

4.2.3 Modular Reduction Unit


The reference implementation of Dilithium uses Montgomery reduction for modular
reduction after multiplication. This algorithm can apply on hardware, but it requires
additional multiplications and does not exploit any special property of modulus q. Another
commonly used algorithm is Barrett reduction; this reduction can be more efficient in the
case of Dilithium modulus q but the optimized algorithm contains a longer carry-chain
that limits the maximum frequency. Land et al. [21] presented a fast modular reduction
method specific to Dilithium modulus q. However, I find a small mistake that the results
are not within the interval (-q,2q) as they said but (-q,3q). In their reduction method, 46-
bit value s can be reduced by recursively exploiting the relation 223 = 213 − 1 , the
equation could be referred in [21]:
s[45:0] ≡ 213 (s[45:43] + s[42:33] + s[32:23]) + s[22:0]
− (s[45:43] +s[45:33] + s[45:23]).
The subtraction is not friendly with hardware like addition, we use additive
inverse with s̅ is negative s, so we transfer the equation above to:
s[45:0] ≡ s[22:0] + s̅ [45:23] + (213 s[32:23] + s̅ [45:33])
+ 213 (s[42:33] + s̅ [45:43]) + 213 s[45:43] + 50331648.

35
m0
+ 23

m1
+
23 23 c
a 23
46 23 26
× 23
+
16 23
b
m2

Figure 22: Modular reduction module.

Because the results within the interval (-q,3q), we can easily obtain the hardware
architecture for modular multiplication in Fig.22, the result is 23-bit value (when m0 =
50331648 – 2q, m1 = 50331648 – q, m2 = 50331648 +q).

4.2.4 BRAM Array Configuration


Implementing Dilithium requires a large amount data to perform polynomial sampling
and intermediate values which need efficient storage. With 23-bit modulus q has been
chosen, this thesis proposes the structure of BRAM array that four coefficients can be
stored with same address. The bandwidth requirement is 23x4 = 92 bits which can be
supported by a group of three 36-kbit BRAMs. This efficient structure allows lower 25%
BRAM utilization than using one BRAM address to save one polynomial coefficient.

4.2.5 Hashing and Sampling


hash_ready

Input 1 State register

w d
d

r
c
r
IO Buffer r

r
w
1

Buffer_ready
Output
R

Figure 23: Keccak hardware architecture [22].

36
As other lattice-based cryptosystems, In Dilithium, the polynomials composing the
vectors and matrices are independently sampled using a constant seed value and an
appended incriminating nonce value as the input to either SHAKE 128 or SHAKE 256.
SHAKE 128 and SHAKE 256 are parts of secure hash algorithm family, they have the
same structure but different in their rate and capacity. The detailed information about
Keccak core can be seen in Keccak specification, submitted by Keccak team [22]. Figure
23 shows the high-speed Keccak core design. It is based on the plain instantiation of the
combinational logic for computing one Keccak-f round and use it iteratively. The core is
composed of three main components: the round function, the state register and the
input/output buffer. In our implementation, the Keccak cores have been modified from
existing implementation [23], we use 2 cores as Shake 128 and 1 core as Shake 256.
In Dilithium scheme, there are 4 sampling function in different distributions:
- Expand matrix: After mapping a uniform seed to Keccak core, this module takes 64-
bit value from Keccak register every clock cycle and save it in its inside buffer. This
module is used to make samples of matrix A, four 23-bit from buffer are sent to memory
as four polynomial coefficients of A.
- Uniform Eta and Expand Mask are the sampling module for vector 𝐬𝟏 , 𝐬𝟐 and y.
Both of them are connected with Keccak core for randomness and make samples depend
on their rejection condition and the received data width.
- SampleInBall: The SampleInBall unit generates the challenge c. It absorbs the first
64 bits of the hash string as the “signs,” which are used to determine whether a new value
is 1 or -1 in each round. Then, it processes an 8-bit value per clock cycle. If the value is
not greater than the current loop index, it would be taken as an address of a particular
polynomial RAM. The RAM is accessed in a “read first” manner and it has been
initialized to zeroes. According to the address, the SampleInBall unit writes the new value
to the BRAM. Meanwhile, it reads the original value stored in that location. The original
value would be written back to the RAM at the address that is indicated by the loop index
in the next clock cycle. But if the loop index conflicts with the new address, the original

37
Out
In Buffer >>x

mode x
LUT

Figure 24: Encoder.

value should be abandoned directly. The new value and the original value are written to
the RAM through different ports, so the 8-bit hash values can be processed continuously.

4.2.6 Encoder/Decoder
The Dilithium parameters shows the difference in their sample distribution, as
mentioned in section (hash), each coefficient of matrix A uses 23 bits when vector y is 18
or 20 bits depend on security level … Therefore, we need one module to encode other
vectors as byte strings. This is needed for absorbing them into SHAKE and defining the
data layout of the keys and signature. Our encoders are present in figure 24, the input is
the vector need to be pack is saved to buffer. Signal mode is a selection signal informing
which vector needs to pack, the shift register will shift each coefficient and round it to
byte. The explanation of bit-packing can be seen in section 5.2 in Dilithium specification.
power to round
(1<<(T-1))-1
t0
+ >> T

-
t1
Out
In
Buffer

>>x

mode x mode
LUT

Figure 25: Decoder.

38
In reverse, the decoder unit takes input as 64 bit-width data layouts of secret key,
public key, digital signature and message and unpack it to vector or variable inside
Dilithium algorithm. This unit integrated a function named “power to round” which used
to split vector t in Key generations phase to 10 high-bit 𝑡1 and 13 low-bit 𝑡0 . The output
and input bit-width of those modules are 64 bits which can connect with Keccak core and
same with data layout to AXI bus stream.

4.2.7 Make/Use Hint


We store the hint in two registers, i.e., one storing the 1’s offsets and the other one
storing the k polynomial boundaries in the same format as specified for the packed
signatures. For the MakeHint operation, we have w − c𝑠2 and w − c𝑠2 + c𝑡0 stored
separately such that both can be read simultaneously. Eventually, we look up both
HighBits and if differing, a new offset is shifted in. Further, for the UseHint operation,
the hint module looks up the HighBits for each coefficient, i.e., both for h=0 and h=1.
Then, selecting the correct one, the value is shifted into a buffer register for sampling (as
described before) and absorbed to compute the value c’’, which ultimately is compared to
the value of the signature during verification.

4.2.8 Decomposer

Decomposer function will split up the value r into a unique form of r = 𝑟1 · 2ɣ +


𝑟0 where 0 ≤ 𝑟0 < 2ɣ. In Dilithium scheme, ɣ is 44 in security level 2 and 32 in others.
In order to prevent DSP usage, the LUT block in Fig. 26. is used as the divide to 2ɣ unit,
the output is one of possible values has been pre-computed and saved in look-up-table.

r1
LUT
r 2ɣ
- r0
- &Q
-
>> 31
(Q-1)/2

Figure 26: Decomposer.

39
The rest of the operations are mostly either shift operations or addition/subtraction with a
constant value.

4.3 Operation Scheduling


4.3.1 Key Generation
The schedule of operations for key generation can be seen in Fig. 27. The longest path
for key generation is the computation, packing, and hashing of the polynomial vector t.
As such, our schedule aims to minimize any delays in the computation of t by immediately
sampling s1, and matrix A so the NTT transform of vector ŝ1 and t̂ = A. ŝ1 can be
performed as soon as possible. t is then encoded and hashed in parallel with the addition
operation, so no additional time is needed to calculate tr. Using the function
Power2Round, vector t is splinted to high-bit 𝑡1 and low-bit 𝑡0 . The encoder unit will
pack values needed to make secret key and public key.

PolyArith1 INTT
= mul(A, )
PolyArith0 NTT(s1) ( )
t+=s 2

Encoder pack s1 pack s2 pk = (ρ,t1) sk = (ρ,K...)

Gen s1 Gen s2
Sample
Expand Matrix A

Figure 27: Key generation timing diagram.


4.3.2 Signature Generation
The schedule for the signature generation is split into two sections: the
precomputation stage in Fig. 28. and the rejection loop stage, where multiple signature
attempts are run in parallel, in Fig. 29. Precomputation is the stage where the secret key
values are unpacked and transformed into the NTT domain. In rejection loop, the timing
diagram describes the work serially as the algorithm in Dilithium scheme. In this stage,
the vector NTT and matrix multiplication operations become the dominant operations and
thus we split the number of these operations evenly between the two stages corresponding
with two arithmetic modules as be seen in Fig. 29. Inside, the first stage uses the
PolyArith0 unit to compute the y and w vectors. The second stage uses the results of the

40
Decoder decode s1 decode s2 decode t0

PolyArith1 NTT(s1) NTT(s2) NTT(t0)

PolyArith0 NTT(y) =A. INTT( )

Encoder
Expand y
Sample
Expand Matrix A

Figure 28: Schedule of the precomputation stage of signature generation.

PolyArith1 NTT(c) = INTT( ) INTT( ) INTT( ) w0-s w0+t

PolyArith0 NTT(y) =A. INTT( )

Decoder decomp w

Encoder Pack w Pack z

Gen c makeHint σ
Sample
Expand y

Figure 29: Timing diagram of rejection loop in signature generation.

first stage to generate the vector z and the hint. We found that splitting computations into
two stages gives better average-case performance for signature generation. The rejection
loop is repeated until all the conditions of hint are met. The average number of repetitions
required for the rejection loop is 3.85 - 5.1 depending on the security level.

4.3.3 Signature Verification


The verification schedule is shown in Fig. 30. The longest path for verification is the
calculation and hashing of w. In verification, vector z and 𝑡1 , are immediately unpacked
from public key and signature at the same time with the generation of matrix A so that
calculation of w can be performed as soon at the polynomial arithmetic unit completes
Decoder unpack z unpack t 1

PolyArith1 - INTT decomp


= mul(A,z )
PolyArith0 NTT(z) NTT(t1) NTT(c) ( )

Encoder Pack w1
useHint
Gen c Gen c
Sample
Expand Matrix A

Figure 30: Schedule of signature verification.

41
the NTT operations. Once w is calculated, the hint is applied to the higher-order bits so
that it can be hashed with μ and compared with the challenge seed c’ to determine if the
signature is valid.

4.4. Summary

An approach to implementation combined and high-performance hardware


architecture for Dilithium has been presented in this chapter. Several optimized modules
are designed to use fewer resources while performing the corresponding functions faster,
including a flexible polynomial arithmetic module, a BRAM-array, a compact
Decompose module, and optimized modular reduction module. In addition, the operation
scheduling for three major algorithms is described to maximize utilization of the flexible
arithmetic module, which are the core of our design and are responsible for the majority
of the operations in Dilithium.

42
Chapter 5 Performance Evaluation

This chapter will demonstrate the simulation process used to verify the result and
performance characteristics of the high-performance hardware architecture for
CRYSTALS-Dilithium, which was described in chapter 4. The first section of this chapter
will describe the simulation process of the Dilithium scheme on a simulated tool in the
Vivado software. The process begins with the key generation phase, where a key pair is
generated. Then, the signature generation operation is performed to create a digital
signature from a message, and finally, the process concludes with the verification of the
input message's integrity. Throughout this simulation, we will evaluate the efficiency and
effectiveness of the CRYSTALS-Dilithium hardware architecture. To compare our
implementation with state-of-the-art hardware implementations, the results of this
simulation will be presented and analyzed at the end of the chapter.

5.1 CRYSTALS-Dilithium Simulation


This proposal implementation can perform three major algorithms of Dilithium
scheme. It has been modeled in Register Transfer Level using hardware design language
Verilog HDL and functionally verified by Verilog HDL Simulation. This section shows
the simulation continually Key generation, Signature generation and Verification phases.
This work has 15 samples for 3 secure levels, 5 for each of them to verify the results.
Simulation process can be observed from Behavioral Simulation tool in Vivado as seen
in Fig. 31.

Figure 31: Simulation result of CRYSTALS-Dilithium.

43
As a combined implementation, it has 2 input selection signal “mode” and “sec_lvl”
to change between each phase and levels:
• The input signal "mode" is set to 0 to initiate the key generation phase, 1 to initiate
the signature generation phase, and 2 to initiate the verification phase.
• It has 3 options for “sec_lvl” corresponding with NIST security levels.
• Signal “start” informs this module starts working in new sample.
• One 64-bit width data input port “data_i” and one 64-bit width data output port
“data_o” to transform data.
Figures 32 and 33 show the output of the key generation operation, which includes
the secret key and public key. Figure 33 illustrates the process of receiving the message
M during the signature generation phase, while the digital signature sent via the output
port can be seen in Figure 35. In the verification phase, the verification process is
performed after taking the digital signature and the message, and the result is sent to the
"data_o" port. If the message is authentic, the result is 0, and if it has been modified, the
result is 1 (Figure 36).

Figure 32: Simulation result of secret key in NIST level 2.

Figure 33: Simulation result of public key in NIST level 2.

44
Figure 34: Receiving message M process in signature generation phase.

Figure 35: Simulation result of digital signature.

Figure 36: Simulation result of verification phase.

After each sample, the Tcl Console will show the message as a result and its latency in
number of clock cycles. The output result is only show: “completed” to notice the
message of prover sent are same with the one received by verifier. Otherwise, the message
on Tcl Console will show “Rejected”.

45
Figure 37: Verification result and execution time.
5.2 Resource Utilization and Performance
The proposal hardware implementation can perform three major operations in
Dilithium scheme: Keygen, Sign and Verify in 3 security levels. All results were
generated using Xilinx Vivado 2022.2. As the target is high performance, this work chose
Virtex Ultrscale+ platform (XCVU57p), the performance results and comparison with
existing implementations are detailed in Table.3. In Dilithium, the execution time for Key
generation and Verification in each security level is stable. According to the specification
[6], the average case for signature generation largely depends on the average number of
attempts needed to generate a valid signature. On average, level 2 requires 4.25 attempts,
level 3 5.1 attempts and level 5 3.85 attempts. In this design, each addition attempt
requires 5665 cycles for security level 2, 7660 cycles for level 3 and 10.4K cycles for
level 5. If the signature can be valid in the first attempt, our signature generation process
is done at 10838 cycles in level 2, 15469 in level 3 and 23580 for level 5.
This thesis has mentioned a metric called Area×Time trade-off, which measures the
efficiency of an implementation by multiplying the number of hardware resources
(LUTs/FFs/DSPs/BRAM) by the time required. This metric allows for a fair comparison
with previous works that employ different types of hardware resources.

46
Level Reference Family Freq. Area Keygen Sign Verify
(MHz) LUT FF DSP RAM cycles μS cycles μS cycles μS
Akita [25] ZUs+ 270 23277 9798 4 24 14594 54 31662 117 15423 57
Akita [26] ZUs+ 200 18494 9319 4 24 14183 71 30358 152 15044 75
Land [21] Artix-7 163 27433 10681 145 15 18761 115 76613 470 19687 121
II Wang [27] Z-7000 159 18558 7342 10 17 7757 49 52038 327 7675 48
Zhao [4] Artix-7 96.9 29998 10366 10 11 4172 43 31600 326.1 4422 45.6
Beckwith [3] Vus+ 256 53907 28435 16 29 4875 19 29876 117 6582 26
This work VUs+ 265 51317 26438 16 28 3865 14.6 29249 110.4 5162 19.5
Akita [25] ZUs+ 270 23277 9798 4 24 23619 87 48446 171 26124 97
Akita [26] ZUs+ 200 18494 9319 4 24 22957 115 47418 237 25535 128
Land [21] Artix-7 145 30900 11372 45 21 33102 229 123218 852 32050 222
III Wang [27] Z-7000 159 19614 8466 10 21 12982 82 89213 561 11232 71
Zhao [4] Artix-7 96.9 29998 10366 10 11 5851 60.4 49496 510.8 6181 63.8
Beckwith [3] Vus+ 256 53907 28435 16 29 8291 32 49437 193 9724 39
This work VUs+ 265 51317 26438 16 28 6255 23.6 46875 176.9 7383 27.9
Akita [25] ZUs+ 270 23277 9798 4 24 39737 147 70179 260 46671 173
Akita [26] ZUs+ 200 18494 9319 4 24 38841 194 68460 342 45789 229
Land [21] Artix-7 140 44653 13814 45 31 50892 364 145912 1042 52712 377
Naina [28] ZUs+ 391 13975 6845 4 35 63.2k 162 113.9k 291 67.9k 174
V
Wang [27] Z-7000 159 20973 9677 10 28 20185 127 93708 589 15875 100
Zhao [4] Artix-7 96.9 29998 10366 10 11 8765 90.5 55321 570.9 9039 93.3
Beckwith [3] Vus+ 256 53907 28435 16 29 14037 55 55070 215 14642 57
This work VUs+ 265 51317 26438 16 28 10550 39.8 53220 200.8 10300 38.9

Table 3: Comparison of FPGA-based designs for Dilithium signature scheme.

47
Akita at [25], report the scheme support for all security levels both for Dilithium and
Kyber. In comparison, our implementation consumes 2.2×,2.7×,2×,1.2× more area (LUT,
FF, DSP, BRAM) but exhibits a faster average of 3.7× in Keygen, (2.9×, 3.5×, 4.4×) in
Verify depends on security levels. It is worth noting that Akita does not mention the
average time for signature generation and only provides results for best-case scenarios
where the signature is generated after the first loop iteration. Therefore, to ensure a fair
comparison, the implementation in this thesis is compared based on best-case scenarios,
where it achieves latency improvements of 40.9/58.4/89μs and 2.9× on average.
This work of [26] is another scheme of Akita combined Dilithium and Saber, in the
Area utilization, the implementation in this thesis consumes more hardware resource
(2.8× LUT, 2.8× FF, 2× DSP, 1.2× BRAM) but demonstrates faster average performance
in Keygen (4.9×) and verification (3.8×, 4.6×, 5.9×) across three levels depending on the
security level. In the best-case scenario for Sign, the implementation in this thesis
achieves time execution improvements of 3.7×, 4×, and 3.8×, respectively.
When compared to the three cryptosystems proposed by Land et al. [21] for three
levels of Dilithium, it becomes evident that the implementation in this thesis outperforms
them in terms of both resource utilization and latency.
Wang et al. [27] proposed three modules corresponding to three levels of Dilithium,
whereas the implementation in this thesis is unified for all parameter sets, making it more
complex. However, when comparing in security level 5 of [4] with this work, it is evident
that the former consumes less hardware resources (2.4×, 2.7×, 1.6×, 1× in LUT, FF, DSP,
BRAM), but the latter achieves significantly better high performance (3.2×, 2.9×, 2.6×).
Naina et al. [28] proposed a lightweight architecture for only Dilithium level 5, their
resources are less than us 3.7× , 3.9× , 4× , 0.8× but our design is better 4.1× , 4.5× in
Keygen and Verify. In Sign algorithm, this work only shows the results for the best-case
scenario, compare with the same of us, we reach 3.3× time.

Comparing with the work of [4], the area in this work is more than 1.7× , 2.55× , 1.6× ,
2.5× (LUT, FF, DSP, BRAM), but in term of time, we better (2.9× , 2.6× , 2.2× ) in

Keygen, (2.3× , 2.2× , 2.4× ) in Verify, In case of Sign, they mention 2 case when doing

48
with the new keypair and using the same keypair saved before. To be fair, we compare it
with case 1, when everything must to do from the first time, our implementation better
3× , 2.9× , 2.8× in time consumption.
The work of [3], it is the same high-performance implementation as our work. Its
results were generated using Vivado but the maximum clock frequency was determined
by the Minerva hardware optimization tool [24], it is different evaluation standard with
the others which totally determine by Vivado tool. However, when comparing with this
work, our implementation reaches higher max frequency 265 and 256 Mhz, slightly
reduce 5% and 7% in number of LUT and FF, respectively. The latency is also better in
the time consuming of corresponding algorithms which can be seen in table 3.
In conclusion, the implementation presented in this thesis is the fastest scheme when
compared to state-of-the-art implementations, as it is a unified architecture capable of
performing all algorithms and supporting all security levels of the Dilithium scheme.
Although it requires higher resource utilization for achieving high performance, the
Area×Time trade-off metric indicates that it outperforms previous works.

5.3 Summary

This chapter presents the simulation results of the proposed hardware implementation
using a simulated tool. It also includes the performance evaluation results and a
comparison with state-of-the-art works. The results show a higher resource utilization,
but for achieving high performance, the proposed hardware architecture has low latency
and efficient evaluation through the Area×Time trade-off metric.

49
CHAPTER 6 CONCLUSION

This thesis presents the high-performance hardware architecture for CRYSTALS-


Dilithium – a primary algorithm for standardization of digital signature selected by NIST.
The main contributions of this thesis are summarized as follows:
Firstly, a high-performance hardware architecture capable of performing three major
Dilithium’s phases and selecting between security levels at runtime has been proposed.
This design achieves a low latency for all operations while maintaining a small area.

Evaluating by Area×Time trade-off metric between area requirement and latency, this

implementation show the efficiency among the other previous works.


Secondly, this thesis presents an efficient approach to implement three operations in
Dilithium. The polynomial arithmetic module has been optimized to decrease the latency
of individual operations. Improvements have been made in the butterfly unit and modular
reduction sub-modules to specify Dilithium’s parameters. Additionally, other sub-
modules have been parallelized as needed to minimize stall and wait delays in the overall
implementation. The proposed implementation can be applied in applications of digital
signature, which is necessary in many fields that use digital data.
Data storing and transmission services play an important role in modern life, and a
new standardization for digital signatures is necessary to withstand the emergence of
quantum computers. Dilithium algorithm has the potential to become the new standard
and implementing it on hardware can significantly improve its performance. Therefore, it
is a promising research area to work on.

50
REFERENCE

[1] NIST, “FIPS 186-2 - Digital Signature Standard.” url: http:


//csrc.nist.gov/publications/fips/archive/fips186-2/fips186-2.pdf.
[2] NIST, “Status report on the third round of the NIST-PQC standardization
Process,”,https://csrc.nist.gov/projects/post-quantum-cryptography.
[3] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware implementation
of crystals-dilithium,” Crypto. ePrint Arch., Report 2021/1451, 2021.
[4] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin, S. Wei and L.
Liu, “A compact and high-performance hardware architecture for crystals-dilithium,”
IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022.
[5] S. Nakov, “Practical cryptography for developers”, ISBN: 978-619-00-0870-5, Sofia,
November 2018.
[6] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler, and
D. Stehle, “CRYSTALS-Dilithium”, Proposal to NIST PQC Standardization, Round3,
2021, https://csrc.nist.gov/Projects/post-quantum-cryptography/round-3-submissions.
[7] Amos Fiat and Adi Shamir, “How to prove yourself: Practical solutions to
identification and signature problems,” In Andrew M. Odlyzko, editor, Advances in
Cryptology - CRYPTO ’86, Santa Barbara, California, USA, 1986, Proceedings, volume
263 of Lecture Notes in Computer Science, pages 186–194. Springer, 1986.
[8] Nigel P. Smart, “Cryptography Made Simple,” Springer, 2016.
[9] K. Lenstra, W. Lenstra, and L. Lovasz, “Factoring polynomials with rational
coefficients,” Mathematische Annalen. 261:515–534, 1982.
[10] D. Micciancio and O. Regev, “Worst-case to average-case reductions based on
gaussian measures,” in Proc. 45th Symp. Found. Comput. Sci., Rome, Italy, 2004, pp.
372–381.
[11] J. Wang, and M. Wang, “Molude-LWE versus Ring-LWE, Revisited,” Cryptology
ePrint Archive, Report 2019/930, 2019.
[12] Vadim Lyubashevsky, “Fiat-shamir with aborts: Applications to lattice and
factoring-based signatures,” Advances in Cryptology - ASIACRYPT 2009, 15th

51
International Conference on the Theory and Application of Cryptology and Information
Security, Tokyo, Japan, December 6-10, 2009. Proceedings, volume 5912 of Lecture
Notes in Computer Science, pages 598–616. Springer, 2009
[13] T. Poppelmann, “Efficient implementation of ideal lattice-based cryptography,” IT-
Information Technology, vol. 59, no. 6, pp.305-309. Sep. 2017.
[14] T. N. Tan and H. Lee, “Efficient-scheduling parallel multiplier-based ring-LWE
cryptoprocessors,” Electronics, vol. 8, no. 4, pp. 413-426. Mar. 2019.
[15] P. Longa and M. Naehrig, “Speeding up the number theoretic transform for faster
ideal lattice-based cryptography,” in Cryptology and Network Security (Lecture Notes in
Computer Science), vol. 10052. Cham, Switzerland: Springer, Nov. 2016, pp. 124–139.
[16] M. Baas, “An approach to low-power, high-performance FFT processor design,”
Ph.D. dissertation, Stanford University, 1999.
[17] B. Gold and L. Rabiner, “Theory and Application of Digital Signal Processing,”
Prentice-Hall, 1975.
[18] M. Shin and H. Lee, “A high-speed, four-parallel radix- 24 FFT processor for
UWB applications,” in Proceeding of IEEE International Symposium on Circuits and
Systems (ISCAS 2008), Seattle, Washington, USA, May 2008, pp. 960-963.
[19] G. Bi and E. Jones, “A pipelined FFT processor for word-sequential data,” Acoustic,
Speech and Signal Processing, IEEE Transactions on, vol.37, no.12, pp.1982-1985, 1989.
[20] N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei and L.Liu, “Highly Efficient
Architecture of NewHope-NIST on FPGA using Low-Complexity NTT/INTT,” IACR
Trans. on CHES, vol. 2020, no. 2, pp. 49–72, 2020.
[21] G. Land, P. Sasdrich, and T. Guneysu, “A hard crystal implementing Dilithium on
reconfigurable hardware,” Cryptology ePrint Archive, 2021
[22] Bertoni, Guido and Daemen, Joan and Peeters, Michael and Van Assche, Gilles,
“Keccak hardware implementation,” https://keccak.team/hardware.html.
[23] CERG, SHAKE, https://github.com/GMUCERG/SHAKE, 2021.
[24] F. Farahmand, A. Ferozpuri, W. Diehl, and K. Gaj, “Minerva: Automated hardware
optimization tool,” in 2017 international Conference on ReConFigurable Computing and
FPGAs, ReConFig 2017, Cancun: IEEE, Dec. 2017, pp. 1-8.

52
[25] A. Aikata, C. Mert, M. Imran, S. Pagliarini and S. Roy, “KaLi: A crystal for post-
quantum security using Kyber and Dilithium,” IEEE Trans. Circuits Syst. 1, 2023.
[26] Aikata, A. C. Mert, D. Jacquemin, A. Das, D. Matthews, S. Ghosh, and S. S. Roy,
“A unified cryptoprocessor for lattice-based signature and key-exchange,” Cryptology
ePrint Archive, Report 2021/1461, 2021.
[27] T. Wang, C. Zhang, P. Cao and D. Gu, “Efficient implementation of Dilithium
signature scheme on FPGA SoC Platform,” IEEE transections on Very Large Integration
(VLSI) systems, vol: 30, 2022.
[28] N. Gupta, A. Jati, A. Chattopadhyay, and G. Jha, “Lightweight Hardware Accelerator
for Post-Quantum Digital Signature CRYSTALS-Dilithium”, Cryptology ePrint Archive,
Paper 2022/496, 28 April 2022.

53

You might also like