Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Arithmetic and Algorithms for Emerging Cryptography

Paulo Martins, Leonel Sousa


INESC-ID
Instituto Superior Ténico
Universidade de Lisboa
Email: paulo.sergio@netcabo.pt, las@inesc-id.pt

Abstract—Current cryptography needs to urgently deal with


two major issues. The first pertains to the derivation of schemes
~b1
resistant to quantum computing. The second is related with the
ubiquity of embedded devices. As more and more sensors are
integrated into our general infrastructure, there is an increas-
ing need to securely offload the data they collect to the Internet.
This data is often centralised in computing servers, which need ~c

to manage large amounts of connections simultaneously. In this p~

paper this topic is addressed in two manners, both exploit- ~b0

ing innovative number representations. First, a post-quantum


cryptographic system is designed achieving efficient encryption
on embedded devices and large decryption throughput on high
performance computing platforms. In particular, the Residue
Number System (RNS) is used to increase the decryption
throughput. A second solution to the problem of offloading Figure 1. Encryption and Decryption in GGH
sensitive data is derived using Fully Homomorphic Encryption
(FHE). With this type of cryptography one can encrypt data,
producing ciphertexts that are malleable, and that can be randomly chosen to achieve a form of message blinding and
processed by others without any real access to the hidden reduce the amount of leaked data. As new cryptographic
data. A plaintext number representation, rooted in stochastic schemes emerge, namely achieving post-quantum security
computing, is proposed, which achieves homomorphic circuits and with homomorphic capabilities, there is a need to update
with small multiplicative depth. number representation techniques and adapt them to the new
schemes.
1. Introduction Emerging schemes include post-quantum cryptography
and FHE. Goldreich-Goldwasser-Halevi (GGH) is a form
Number representations can have a profound impact on of Lattice-based Cryptography (LBC) that achieves post-
the efficiency of cryptographic systems. Certain numeral quantum security. A lattice is a repeated arrangement of
systems, such as the RNS, provide for parallel additions and points. Encryption corresponds to adding a message, en-
multiplications, while penalising the execution of operations coded as a vector, to a lattice point. In Figure 1, the message
such as number comparison and division. Others, like the p~ is added to the lattice point ~b0 , producing the ciphertext ~c.
Mixed-Radix System (MRS), allow to efficiently compare Babai’s round-off [2] is used during decryption to compute
two numbers, but are less efficient when it comes to mul- the closest lattice vector, ~b0 , to ~c. This is only possible to
tiplications. The design of efficient cryptographic systems the holder of the private key, which consists of a nearly-
should therefore take into account which representation is orthogonal basis of the lattice (~b0 and ~b1 ). Since lattices
more suitable for which cryptographic operation. of very large dimension are used to ensure security, the
In the past, innovative number representations have been numbers one needs to process during decryption are very
applied to traditional cryptographic schemes such as Rivest- large. In this paper, this bottleneck is overcome through the
Shamir-Adleman (RSA) and Elliptic Curve (EC) Cryptogra- use of the RNS, which parallelises the arithmetic associated
phy (ECC) [1] to improve their efficiency and security. For with these numbers [3].
instance, since the RNS breaks down long integer arithmetic With FHE, ciphertexts are endowed with a certain level
into smaller channels, these may be processed in parallel of malleability, allowing for others to operate on them with-
to accelerate cryptographic primitives. Since the underlying out any real access to the underlying plaintexts [4]. While
channels are independent of one another, they may also be FHE was impractical at the time of its uncovering, there
has been a large effort by the research community to make bits a0 , . . . , am−1 is generated such that the probability
it more efficient, with the introduction of techniques such P (ai = 1) = a∀i ∈ {0, . . . , m − 1}.
as batching. Batching is a technique that exploits arithmetic Like with RNS, the multiplication of two independent
properties of the plaintext space to simultaneously encrypt number representations can be performed by multiplying
multiple messages in a single ciphertext. When operations their digits element-wise:
are applied to a ciphertext, they affect each slot individu-
ally. An initial strategy to exploit batching was to design
Boolean circuits and evaluate multiple gates in parallel. c = (a1 , . . . , am ) × (b1 , . . . , bm ) = (a1 × b1 , . . . , am × bm )
Herein, we argue that for a certain class of numbers, a (4)
stochastic representation of numbers can take better ad- However, the use of a stochastic number representation
vantage of this technique [5]. With stochastic computing, introduces further restrictions than RNS. First, it is not pos-
numbers are represented as the relative frequency of 1s in sible to directly add two numbers. Instead, a scaled addition
a bit sequence. Arithmetic operations can be implemented operation is available. Given the independent number rep-
using simple logical circuits: for instance, multiplication is resentations of a, b, s ∈ [0, 1], the stochastic representation
achieved through the bitwise AND of two bit sequences. of c = s × a + (1 − s) × b can be computed as:

c = (s1 × a1 + (1 − s1 ) × b1 mod 2,
2. Number representations
. . . , sm × am + (1 − sm ) × bm mod 2) (5)
Certain cryptographic applications need to deal with Also, there is no direct way to compute divisions.
unusually large integers. Due to hardware limitations, pro-
cessors can only handle integers with up to 32 or 64 bits.
Traditionally, larger numbers are processed by breaking their 3. Improving Babai’s Round-off
representation into words of 32 or 64 bits, and processing
them one at a time. Babai’s round-off plays a crucial role in decrypting
PnAa −1
school-book algorithm
Pnb −1to multiply cryptograms in GGH. Given a basis B generating the lattice
two numbers a = i=0 ai wi and b = j=0 bj wj with
32 64 ~zB , for ~z ∈ Zn , and an input vector ~c, Babai’s round-off
w = 2 or 2 would proceed as follows:
approximates the closest vector ~v in the lattice to ~c as:
a −1
nX b −1
nX a −1 n
nX X b −1

c= ai wi bj wj = ai bj wi+j (1) ~v = ~cB −1 B


 
(6)
i=0 j=0 i=0 j=0
For a nearly orthogonal basis B and a well-formed cryp-
requiring na × nb basic multiplications. togram ~c, the expression in (6) will produce the exact closest
The complexity of (1) can be reduced through a change lattice vector. For thiscomputation to be correct, the vector-
in number representation. The RNS is a numeral system matrix multiplication ~cB −1 has to be computed with high-

relying on the Chinese Remainder Theorem (CRT) [1]. An precision rationals. Since one of the main objectives of this
RNS relies on a basis of coprime integers n1 , . . . , nm . Each paper is to maximise the throughput of (6) by using the RNS,
value a in the interval [0, n − 1), for n = n1 × . . . nm , has and the RNS can only handle integers, this multiplication
a unique representation in RNS: has to be converted to integer arithmetic:

(a1 , . . . , am ) = (a mod n1 , . . . , a mod nm ) (2)  −1 


~cB =
Due to the CRT, a multiplication in RNS can be per-
~cB 0 ~cB 0
   
1
formed by multiplying the digits of a by the digits of b = + ~v1 =
element-wise. Suppose that n > c, so that no wrap-around det(B) det(B) 2
happens. Then, (1) can be rewritten as 2~cB 0 + det(R)~v1 − [2~cB 0 + det(B)~v1 mod 2det(B)]
(7)
2det(B)
c = (a1 , . . . , am ) × (b1 , . . . , bm )
= (a1 × b1 mod n1 , . . . , am × bm mod nm ) (3) where B 0 = det(B)B −1 is an integer matrix and ~v1 is the
all-ones vector.
If each ni is smaller than 232 or 264 , then the number In (7), the rounding operation of (6) has been replaced by
of basic multiplications is reduced from na × nb in (1) to a modular reduction. In RNS, this operation can be achieved
m ≈ na + nb in (3). Additions and subtractions can also through Montgomery’s reduction [1]. This algorithm has
be performed coefficient-wise. However, the RNS does not been represented in Figure 2. It replaces a reduction of A
have a positional nature. This means that operations such as modulo D = 2det(B) by another modulus M1 , that under-
ba/be ∈ Z (i.e. division with rounding to the nearest integer) pins an RNS representation,
Q i.e. the moduli in basis M1
cannot be directly computed in RNS. satisfy M1 = m∈M1 m. Since operations are implicitly
A second number representation enabling parallelism reduced by M1 in RNS, the first step of the algorithm can be
makes use of stochastic processes [5]. In order to rep- achieved very efficiently. In contrast, since a division by M1
resent a number a ∈ [0, 1], a sequence of independent is later required, and this operation cannot be implemented
accurate but inexact computations. There is a direct map-
ping between the batching slots and a stochastic number
M1 M2
representation, namely by encoding each bit of a stochastic
number in a different batching slot.
Q ← −AD−1 mod M1 Q mod M2
The proposed encoding leads to low complexity cir-
cuits for the homomorphic evaluation of operations such as
Z ← (A + QD)M1−1 mod M2
weighted sums or multiplications. Since homomorphic mul-
if Z ≥ D: tiplications map to the coefficient-wise AND of two plain-
Z ←Z −D texts and homomorphic additions map to their coefficient-
wise XOR, one can implement stochastic multiplication
with homomorphic multiplication and scaled addition by a
Figure 2. RNS Montgomery’s reduction combination of homomorphic additions and multiplications.
While the previous set of operations may seem limited,
a wide range of polynomials can be evaluated with it.
in M1 , data has to be converted to another basis M2 . After In particular, certain polynomials can be converted to a
this division, the result will be in the [0, 2D) interval. To Bernstein representation [6, Corollary 1] b0 , . . . , bd such
perform a full modular reduction, Z has to be compared that:
with D and a subtraction carried out if necessary. Since the
RNS is of a non-positional nature, the comparison with D d  
is very expensive and ideally should be avoided. X d i
B(x) = bi Bi (x), Bi (x) = x (1 − x)d−i (11)
However, if one simply omits the comparison, an er- i
i=0
ror
 will  possibly be introduced during the computation of
~cB −1 , and the wrong lattice vector will be associated with With this representation, the evaluation of polynomials
~c. A strategy to detect and eliminate this error is as follows. can be performed using Algorithm 1. Since this algorithm, at
If ~c is scaled by γ ∈ Z, the expected result will similarly be its core, performs repeated scaled addition, and the proposed
scaled by γ , but the error introduced during the computation method can evaluate this operation homomorphically, it can
will be detectable modc γ : also evaluate Bernstein polynomials.

Algorithm 1 de Casteljau’s Algorithm for the evaluation of


RNS( γ~cB −1 ) = γ ~cB −1 +
   
ve (8) Bernstein polynomials
|{z} Pd
detectable modc γ Require: B(x) = i=0 bi Bi (x)
Require: x0
In addition, if the plaintext p~ is constrained such that 1: for i ∈ {0, . . . , d} do
||~
p||∞ < mσ /2, then (7) may be computed modulo mσ . By (0)
2: bi = bi
following this design approach, it is possible not only to re-
3: end for
move an expensive number comparison from Montgomery’s
4: for j ∈ {1, . . . , d} do
reduction, but also to reduce M2 to two moduli: γ and mσ .
5: for i ∈ {0, . . . , d − j} do
(j) (j−1) (j−1)
6: bi = bi (1 − x0 ) + bi+1 x0
4. Homomorphic Stochastic Computing 7: end for
8: end for
(d)
FHE refers to the class of encryption systems that sup- 9: return B(x0 ) = b0
port two operations ⊕ and ⊗ satisfying:

Enc(a) ⊕ Enc(b) = Enc(a + b) (9) 5. Implementation Details & Experimental Re-


Enc(a) ⊗ Enc(b) = Enc(a × b) (10) sults
for a and b belonging to a certain ring, and without ⊕ The implementation of the GGH decryption operation
and ⊗ accessing the private-key. Using such a system, an using the strategy described in Section 3 took into account
embedded platform may offload the processing of sensitive several levels of parallelism. Since servers often need to
data to a server, without giving the service provider access handle several connections simultaneously, a first level of
to that data. parallelism relates to using work-groups to decrypt several
Several FHE cryptosystems support batching [4]. Batch- messages in parallel in Graphical Processing Units (GPUs).
ing is a technique supporting the encryption of several bits Since GGH has to deal with vector arithmetic, the work-
in the same ciphertext, so that homomorphic additions and load of processing vector components can be distributed
multiplications operate on them in parallel. In this paper, a among several work-items in GPUs or threads in multi-
stochastic representation is proposed as a generic framework core Central Processing Units (CPUs). Finally, the usage
for the homomorphic processing of applications requiring of the RNS leads to the homogeneous execution of several
RNS Bernstein Average
Function Polynomial Mean Squared Error Execution
103 RNS AVX2
MRS Degree Time [s]
MRS AVX2
cos(x) 6 8.94 · 10−6 2.24
NTL
8.9 · 10−5
Delay [ms]

2 sin(x) 5 1.79
10
tan(x)/tan(1) 5 1.92 · 10−5 1.8
acos(x)/acos(0) 5 2.82 · 10−5 1.79
asin(x)/asin(1) 5 3.22 · 10−5 1.8
101 atan(x) 5 2.85 · 10−5 1.77
cosh(x)/cosh(1) 6 3.61 · 10−5 2.31
sinh(x)/sinh(1) 5 2.57 · 10−5 1.78
tanh(x) 5 3.19 · 10−5 1.79
100 exp(x)/exp(1) 6 2.38 · 10−5 2.28
1000 1200 1400 1600 1800 ln(x+1) 6 1.08 · 10−4 2.23
Lattice Dimension
Table 1. H OMOMORPHIC EVALUATION OF NONLINEAR FUNCTIONS IN A
Figure 3. GGH decryption latency in a i7 6700k I 7 5960X

RNS AVX2 - i7 4770K


6. Conclusion
104 RNS AVX2 - i7 5960X
RNS AVX2 - i7 6700K The representation of data has a major influence on how
Throughput [messages/s]

RNS GPU - K40c


103 RNS GPU - GTX 980
efficiently it can be processed. In public-key cryptography
RNS GPU - Titan X this motto is materialised by the way numbers are repre-
102 sented. In this paper, RNS-based architectures supporting
GGH decryption were proposed and evaluated, achieving
101 low-latency on CPUs and high-throughput on GPUs. More-
over, with the proposal of homomorphic stochastic represen-
100 tations, the homomorphic evaluation of nonlinear functions
1000 1200 1400 1600 1800
Lattice Dimension has become more viable and efficient, with applications on
image processing and machine learning.
Figure 4. GGH decryption throughput
Acknowledgments
This work was supported by Portuguese funds through
work-items, which is very suitable to GPU architectures and Fundação para a Ciência e a Tecnologia (FCT) with ref-
for the exploitation of CPU Single-Instruction-Multiple-Data erence UID/CEC/50021/2013 and by the Ph.D. grant with
(SIMD) extensions. reference SFRH/BD/103791/2014.
Figure 3 depicts the delay of the decryption operation
for several implementation strategies. First, the MRS label References
refers to an implementation using the Mixed-Radix System
[1] L. Sousa, S. Antão, and P. Martins, “Combining Residue Arithmetic to
(MRS) to perform the comparison in Figure 2. Due to the Design Efficient Cryptographic Circuits and Systems,” IEEE Circuits
complexity of this operation, it achieves worse performance and Systems Magazine, vol. 16, no. 4, 2016. [Online]. Available:
than a generic multiprecision library (NTL). Nevertheless, it http://ieeexplore.ieee.org/document/7748580/
exposes a greater level of parallelism, which when exploited [2] L. Babai, “On Lovász’ Lattice Reduction and the Nearest Lattice
(MRS AVX2), leads to large performance improvements. Point Problem (Shortened Version),” in Proc. of the 2nd Symp. of
Theoretical Aspects of Comput. Sci., ser. STACS ’85, London, UK,
Notwithstanding, when the bottleneck of number compar- 1985, pp. 13–20. [Online]. Available: http://dl.acm.org/citation.cfm?
ison is removed, the delay is reduced to a great extent id=646502.696106
(RNS), while still allowing for parallel implementations [3] P. Martins, J. Eynard, J.-C. Bajard, and L. Sousa, “Arithmetical
(RNS AVX2). The amenability of the RNS approach to Improvement of the Round-Off for Cryptosystems in High-
parallel architectures is verified in practice in Figure 4, Dimensional Lattices,” IEEE Transactions on Computers, vol. 66,
where massively parallel GPUs achieve throughputs one no. 12, pp. 2005–2018, 2017. [Online]. Available: http://ieeexplore.
ieee.org/document/7891511/
order of magnitude larger than CPUs.
[4] P. Martins, L. Sousa, and A. Mariano, “A Survey on Fully
Finally, the applicability of the proposed homomorphic Homomorphic Encryption: an Engineering Perspective,” ACM
stochastic representation has been verified in practice for Computing Surveys, vol. 50, no. 6, pp. 1–33, dec 2017. [Online].
Available: http://dl.acm.org/citation.cfm?doid=3161158.3124441
a wide range of nonlinear functions in Table 1. These
[5] P. Martins and L. Sousa, “A Stochastic Number Representation
functions were approximated with Taylor series, and the re- for Fully Homomorphic Cryptography,” in 2017 IEEE International
sulting polynomials converted to Bernstein representations. Workshop on Signal Processing Systems (SiPS). IEEE, oct 2017, pp.
An average Mean Squared Error (MSE) of about 10−5 was 1–6. [Online]. Available: http://ieeexplore.ieee.org/document/8109973/
achieved for an execution time of about 2 s. It should [6] W. Qian and M. D. Riedel, “The synthesis of robust polynomial
be noted that if a different precision were required, the arithmetic with stochastic logic,” in 2008 45th ACM/IEEE Design
cryptographic parameters could be changed to achieve that. Automation Conference, June 2008, pp. 648–653.

You might also like