Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1 High-Performance Accelerator for Constant-Time

2
3
Cross-Domain Integer and Montgomery Inversion on FPGA
4
ANONYMOUS AUTHOR(S)
5
6 Modular Inversion (MI) is one of the fundamental arithmetic operations in finite field, which plays an essential
7 role in various cryptographic applications and requires high performance and security. Unfortunately, the
8 simple MI algorithm is vulnerable to side-channel attacks, such as the timing attack, which collects the latency
9 during its running time. Attackers possibly recover the initial data since the time reflects the iteration rounds,
10
which are closely related to data. Besides, the low complexity of hardware implementation and resource
consumption in MI is also challenging. In this paper, we propose two novel modular inversion algorithms,
11
named Constant-Time Integer Modular Inversion (CT-IMI) and Constant-Time Complementary Montgomery
12
Modular Inversion (CT-CMMI), which consist of constant iteration rounds to resist the timing attack: one is
13
for common applications and the other is for cross-domain. The former algorithm processes the data in the
14 integer field. In contrast, the cross-domain algorithm directly uses data in the Montgomery domain, which
15 can save the conversion steps for some specific applications, e.g. Elliptic Curve Cryptography (ECC) scalar
16 multiplication. In software simulations, we count the average clock cycles for one time inversion and show
17 the relationship between different bit length numbers and the latency. The noticeable differences between
18 constant and non-constant algorithms prove the possibility of timing attacks. In addition, we design two
19 efficient hardware architectures on FPGA. Experimental results show that our CT-IMI can finish one time
20 inversion in 2.56𝜇s with 4.2k LUT, 1.8k FF and our CT-CMMI requires 2.45𝜇s, 2.7k LUT, 1.6k FF. The product
21
of area and latency of our CT-IMI and CT-CMMI can reach 10.50 and 6.62 respectively, which shows optimal
performance compared with all the results in the existing literature.
22
23 CCS Concepts: • Security and privacy → Public key (asymmetric) techniques; • Mathematics of
24 computing → Finite Field Algorithm; • Hardware;
25
Additional Key Words and Phrases: Finite Field Arithmetic, Modular Inversion, Montgomery Algorithm,
26
Elliptic Curve Cryptography, Side-Channel Attack, FPGA
27
28
29 1 INTRODUCTION
30 Efficient and secure cryptography technology has been widely used to prevent private information
31 from attackers. Even though many cryptosystems are indistinguishable in computation and secure
32 enough in theory, they’re still under the threat of some side-channel attacks when implemented on
33 the Internet of Things (IoT) equipment or embedded systems, such as Simple Power Analysis (SPA)
34 [34], Cache-Timing Attack (CTA) [21], Machine Learning based Profiling Attack (MLPA) [33] and
35 so on. These attacks usually collect side-channel information during the calculation and utilize
36 this to recover the initial information. A common method to resist side-channel attacks is adding
37 redundant operations to make the computation finished in constant time. Constant time algorithms
38 can resist the most basic timing attacks at least, and for some personal information, constant time
39 algorithms are maybe necessary.
40 Modular inversion is one of the modular algorithms used in public key cryptosystems, such as
41 Rivest-Shamir-Adleman (RSA) [30], Elliptic Curve Cryptography (ECC) [18, 23], Diffie-Hellman
42 protocols (DH) [10] and so on. In addition to cryptography, some other applications, including
43 digital signal processing, multimedia, and Reed-Solomon (RS) decoding also need modular inversion.
44 Unfortunately, compared with other modular algorithms, i.e., modular addition or multiplication,
45 inversion is too expensive in implementation and should be avoided as much as possible. For
46 example, ECC uses projective coordinates [11] to reduce the number of modular inversions during
47 the scalar multiplication, even though this trick may leak some information [27]. However, modular
48 inversion is still necessary for ECC when converting the result from projective coordinates to affine
49
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
2 Anon.

50 coordinates. There are many notable improvements in modular inversion algorithms and hardware
51 implementations, while the overall developments are relatively slow.
52 In this paper, we propose two novel modular inversion algorithms, which can both prevent
53 side-channel information leakage and have high efficiency for computation. We also design two
54 efficient hardware architectures based on our algorithms, named Constant-Time Integer Modular
55 Inversion (CT-IMI) and Constant-Time Complementary Montgomery Modular Inversion (CT-
56 CMMI) respectively. CT-IMI mainly processes the number in the same field as classic algorithms,
57 while CT-CMMI performs a cross-domain modular inversion operation. They have shown optimal
58 performance compared with all the results in the existing work, which proves that our algorithms
59 are effective. Overall, our designs have the following features:
60
61
• High security. Our modular inversion algorithms have a constant time feature to resist
62
timing attacks.
63
• High efficiency. Different from Extend Euclid Algorithm (EEA) and Montgomery Inversion
64
(MI), our designs not only have good performance in software but also are friendly to hard-
65
ware due to we use subtraction and shift operations to replace division and multiplication,
66
while EEA and MI require larger hardware resources.
67
• Cross-Domain. The classic modular inversion algorithm processes numbers on the same
68
field. While in our cross-domain designs, inputs are in the Montgomery domain and the
69
results are in the integer domain, which can avoid the extra conversion step.
70 The remainder of this paper is organized as follows. Section 2 and 3 provide the existing algorithms
71 about modular inversion and introduce the state of the art works. Section 4 proposes our novel
72 algorithms and theoretically analyzes their principles. Section 5 shows our hardware architectures
73 and introduces the details of hardware implementations. The comparison results in software and
74 hardware are listed in Section 6. Finally, Section 7 summarizes our work.
75
76 2 BACKGROUND
77
In this section, we briefly explain some basic concepts and classic applications of modular inversion
78
firstly. Then we introduce the Montgomery domain and the relationship between the Montgomery
79
domain and the integer domain. Finally, we elaborate on Fermat’s Little Theorem (FLT), Extended
80
Euclidean Algorithm (EEA) and Montgomery Inversion (MI) algorithm as they are the basis for
81
almost all research on modular inversion algorithms today.
82
83
84
2.1 Basic concepts and classic Applications of Modular Inversion
85 Modular inversion is used to finish the division operation in finite field. For an integer 𝑎, the
86 modular inversion of 𝑎 in field 𝑝 can be represented as 𝑎 −1 . It satisfies 𝑎 · 𝑎 −1 mod 𝑝 = 1, where 𝑝
87 is an prime and 𝑎 < 𝑝.
88 In public key cryptography, modular inversion is a common operation in finite field and it requires
89 much more complex computation than addition, subtraction and multiplication. It is necessary for
90 some cryptosystems, e.g. RSA, a most widely used public key encryption algorithm, generating the
91 secret key 𝑑 by calculating modular inversion of the public key 𝑒 in prime field.
92 However, different from modular multiplication, modular inversion is expensive and needs to
93 perform multiple iteration rounds to find the results, which causes a lot of logic resources cost and
94 a long time delay on hardware. So many works tried to avoid the inversion as much as possible
95 and get a low-cost solution. For example, point addition (PA) and point doubling (PD) are two basic
96 operations in Elliptic Curve Encryption. Suppose there are two points 𝐴(𝑥𝑎 , 𝑦𝑎 ) and 𝐵(𝑥𝑏 , 𝑦𝑏 ) on
97 an elliptic curve 𝑦 2 = 𝑥 3 + 𝑎𝑥 + 𝑏. We assume 𝐶 (𝑥𝑐 , 𝑦𝑐 ) as the result of point addition 𝐴 + 𝐵. The
98
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 3

99 coordinates of 𝐶 can be calculated by the following equations:


100
101
𝑥 𝑐 = 𝜆 2 − 𝑥 𝑎 − 𝑥𝑏
(1)
102 𝑦𝑐 = 𝜆 (𝑥𝑎 − 𝑥𝑐 ) − 𝑦𝑎
103 𝑦 −𝑦
where 𝜆 = 𝑥𝑏𝑏 −𝑥𝑎𝑎 . Obviously, one modular inversion is performed according to the expression of 𝜆.
104
105
Point addition is the same.
106
If we repeat them many times, the hardware resource consumption and latency are unbearable.
107
A common solution to solve this problem is using projective coordinates, not affine coordinates.
108
Modular inversion is replaced by some modular multiplications. But we have to convert the result
109
which comes from a series of PA and PD operations back to affine coordinates, modular inversion
110
is necessary here. Thus, modular inversion is still inevitable and requires further optimization.
111
2.2 Montgomery Domain
112
113 For an integer 𝑎 and a finite field prime 𝑝, we can make a conversion such that 𝑎˜ = 𝑎 · 2𝑙 mod 𝑝,
114 where 𝑙 is an integer and 2𝑙 > 𝑝. 𝑎˜ is the form of 𝑎 in the Montgomery domain. Many arithmetic
115 operations in this domain are more convenient, especially Montgomery modular reduction and
116 Montgomery modular multiplication [24]. We can also combine simple multipliers and modular
117 reduction to finish modular multiplication and avoid using the Montgomery domain. Especially,
118 for some Mersenne or pseudo-Mersenne primes, there exist some optimization algorithms [8,
119 32] for high-performance modular reduction. While Montgomery’s methods are universal and
120 perform significantly superior to ordinary modular multiplication when processing continuous
121 multiplications.
122 We denote the Montgomery multiplication as MontMul(·), and Montgomery reduction as
123 MontRedc(·). For integer 𝑎, 𝑏 and their corresponding forms in Montgomery domain 𝑎˜ = 𝑎 · 2𝑙
124 mod 𝑝, 𝑏˜ = 𝑏 · 2𝑙 mod 𝑝, the following equations are satisfied:
125
126 MontRedc(𝑎) = 𝑎 · 2𝑙 mod 𝑝 = 𝑎˜ (2)
127
128
𝑎 · 2𝑙 · 𝑏 · 2𝑙
129 MontMul(𝑎, 𝑏) = mod 𝑝 = 𝑎 · 𝑏 · 2𝑙 mod 𝑝 (3)
130
2𝑙
131 Assume an integer 𝑐 = 𝑎 · 2−𝑙 mod 𝑝, then we can use MontRedc(𝑐) to get 𝑎. This trick plays
132 an important role in the design of CT-CMMI.
133
134 2.3 Basic Modular Inversion Algorithms
135 We select three basic modular inversion algorithms here, i.e., FLT, EEA and MI. Many other
136 algorithms stem from them.
137 FLT (Alg. 1) is a special case of Euler Theorem. It says that 𝑎𝑝 −1 ≡ 1 mod 𝑝 when 𝑎 is an integer
138 and 𝑝 is a prime. Since 𝑎𝑝 −2 mod 𝑝 ≡ 𝑎 −1 mod 𝑝, the modular inversion can be solved. It requires
139 many times multiplication and squaring, which limits its efficiency.
140 A trade-off between the number of iteration rounds and the efficiency of each iteration is the key
141 in modular inversion. EEA (Alg. 2) has a small number of iteration rounds, but time-consuming
142 multiplication and division operations are required in each iteration, which consumes more hard-
143 ware resources and is not suitable for resource-constrained devices. To overcome this shortcoming,
144 subsequently, the Binary Extended Euclidean Algorithm (BEEA) [16] is proposed, using more itera-
145 tion rounds to replace multiplication and division with shift, addition and subtraction. Compared
146 to EEA, BEEA is more suitable for hardware implementation.
147
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
4 Anon.

148 Algorithm 1 Fermat’s Little Theorem


149
Require: Integer 𝑎, Modulus 𝑝
150
Ensure: 𝑎 −1 mod 𝑝
151
1: 𝑟𝑒𝑡, 𝑘 = 1, 𝑝 − 2
152
2: while 𝑘 do
153
3: if 𝑘 is odd then
154
4: 𝑟𝑒𝑡 = 𝑟𝑒𝑡 · 𝑎 mod 𝑝
155
5: end if
156
6: 𝑘 =𝑘 ≫1
157
7: 𝑥 = 𝑥 2 mod 𝑝
158
8: end while
159
9: return 𝑟𝑒𝑡
160
161
162
Algorithm 2 Extended Euclidean Algorithm
163 Require: Integer 𝑎, Modulus 𝑝
164 Ensure: 𝑎 −1 mod 𝑝
165 1: 𝑥 1, 𝑥 2, 𝑟 1, 𝑟 2 = 0, 1, 𝑝, 𝑎
166 2: while j𝑟 2 ≠ k 0 do
167 3: 𝑞 = 𝑟𝑟 21
168 4: 𝑟 1, 𝑟 2 = 𝑟 2, 𝑟 1 − 𝑞 · 𝑟 2
169 5: 𝑥 1, 𝑥 2 = 𝑥 2, 𝑥 1 − 𝑞 · 𝑥 2
170 6: end while
171 7: return 𝑥 1 mod 𝑝
172
173
174 Montgomery inversion uses the iterative structure and Montgomery multiplication/reduction to
175 improve the speed. For each integer 𝑎, it firstly uses the iteration to find an integer 𝑘 and outputs
176 𝑎 −1 · 2𝑘 mod 𝑝, and then uses Montgomery multiplication/reduction to get the final result 𝑎 −1
177 mod 𝑝. In practical application, the integer should be a number in the Montgomery domain with
178 the form of 𝑎˜ = 𝑎 · 2𝑙 mod 𝑝, where 𝑙 is an integer and 2𝑙 > 𝑝. When we use the Montgomery
179 ˜ the output should be 𝑎 −1 · 𝑅 mod 𝑝, where 𝑅 = 2𝑙 . Alg. 3 shows the
inversion to calculate 𝑎,
180 detailed algorithm.
181 These algorithms have different advantages in different scenarios. From the point of view of
182 computational performance, MI is typically more efficient than BEEA while FLT performs poorly.
183 If we take the hardware resource consumption into consideration, BEEA is superior to MI and
184 FLT costs the most. However, FLT is still considered for the protection of private data because the
185 number of iterations required for computation is independent of the input. This feature guarantees
186 FLT finishes the inversion within a constant time, while BEEA and MI do not.
187
188 3 RELATED WORK
189 Yan [37], Choi [7] used the form of radix-4 to improve the original BEEA and reduce the number of
190 iteration rounds. Hossain [13], Zhou [38], and Mrabet [25] mainly focused on the optimization of
191 hardware designs. They shared the adders and reduced the delay caused by long carry chains. The
192 implementation results showed that they not only improved the maximum frequency of the circuit
193 but also reduced the computation latency. Bigou [5] firstly designed a modular inversion based on
194 residue number system (RNS), which can decompose a large integer into multiple small integers in
195 different prime number domains. The addition and multiplication between small integers in RNS
196
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 5

197 Algorithm 3 Montgomery Inversion


198
Require: Integer 𝑎, Modulus 𝑝
199
Ensure: 𝑎 −1 mod 𝑝
200
1: 𝑘, 𝑢, 𝑣, 𝑟, 𝑠 = 0, 𝑝, 𝑎, 0, 1
201
2: while 𝑣 > 0 do
202
3: if 𝑢 is even then
203
4: 𝑢, 𝑠 = 𝑢 ≫ 1, 𝑠 ≪ 1
204
5: else if 𝑣 is even then
205
6: 𝑣, 𝑟 = 𝑣 ≫ 1, 𝑟 ≪ 1
206
7: else if 𝑢 > 𝑣 then
207
8: 𝑢, 𝑟, 𝑠 = (𝑢 − 𝑣) ≫ 1, 𝑟 + 𝑠, 𝑠 ≪ 1
208
9: else
209
10: 𝑣, 𝑠, 𝑟 = (𝑣 − 𝑢) ≫ 1, 𝑟 + 𝑠, 𝑟 ≪ 1
210
11: end if
211
12: 𝑘 =𝑘 +1
212
13: end while
213
14: 𝑟 = 𝑝 − 𝑟 mod 𝑝
214
15: 𝑏 =MontMul(𝑟, 22·𝑙 −𝑘 , 𝑝)
215
16: 𝑟𝑒𝑡 =MontRedc(𝑏, 2𝑙 , 𝑝)
216
17: return 𝑟𝑒𝑡
217
218
219
220 are independent, which allows for the design of parallel computation. Later in [6], he made further
221 optimization by removing the comparison and used the plus-minus trick to reduce the number of
222 modular operations. Hars [12] also used the plus-minus trick to improve Left-Shift [22], Right-Shift
223 and Shift Euclidean algorithms and analyzed their performance with the GMP library [20].
224 Deshpande [9] presented a sequential design and a full-length design respectively. He also
225 pointed out the limitation of full-length on hardware-constrained devices. When the bit length is
226 less than 1280, the full-length design performs better than the sequential one. As the bit length goes
227 longer, the sequential design is more friendly to hardware. It’s worth noting that the algorithm
228 used in Deshpande’s designs came from Bernstein [4], a constant time algorithm with adequate
229 mathematical analysis. Besides, Murat [26] divided the full length into several blocks. Each smallest
230 block accepts 8-bit long input.
231 Kaliski [1] firstly proposed "almost Montgomery Inversion". Kaliski divided the computation
232 into two phases. The first phase is similar to BEEA and requires iterations to get the specified
233 result. The second phase is still iterative but is called bit-level operation. Bos [3] provided an idea
234 to keep the sum of the iteration rounds in the first and second phases constant. Savas [31] used
235 Montgomery multiplication to replace the bit-level operation. The second phase is simplified by
236 using this trick. In addition, Savas’s work was not limited to the integer or Montgomery domain.
237 He also proposed cross-domain algorithms, but to the best of our knowledge, the algorithms have
238 not yet been implemented on hardware.
239 FLT is the earliest constant-time modular inverse algorithm, and the calculation process is similar
240 to Modular Exponentiation Algorithm (MEA). The structure used for RSA encryption can also
241 be applied to compute FLT. In the elliptic curve digital signature system designed by Lim [2], the
242 RNS-based FLT algorithm is used to solve the modular inversion for private key generation. Xu
243 [36] designed the 𝐶 (·) function to re-express the prime numbers and used this representation to
244 calculate the least modular multiplication and exponentiation steps required by FLT. The new
245
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
6 Anon.

246 method can reduce the number of modular multiplication operations by about 90% without changing
247 the number of powers, which improved the efficiency of using FLT to solve modular inversion.
248 In a word, there are various kinds of methods to optimize the calculation of modular inversion,
249 e.g. high radix and RNS base. As side-channel attacks continue to advance, it becomes necessary
250 to ensure the security in modular inversion calculation. Due to the low efficiency of FLT, more
251 and more efficient and constant-time modular inversion algorithms have been proposed in recent
252 years. This paper develops from the existing theories of previous work, especially [15, 31]. We
253 proposed novel algorithms in constant time to resist the timing attack. What’s more, we are also
254 the first work to implement the cross-domain algorithms on hardware. Our work not only extends
255 the previous research, but also provides new directions and opportunities for future research.
256
257 4 CONSTANT TIME MODULAR INVERSION
258 The reason why BEEA performs better than EEA on hardware is that BEEA makes a more appro-
259 priate trade-off between the number of iteration rounds and the efficiency of each round. In this
260 section, we will introduce our novel modular inversion algorithms, which take great advantage of
261 some properties applied in BEEA and MI.
262 Our algorithms have two phases, which are similar to MI. In first phase, the output form of our
263 algorithms is different from MI. The result in MI is 𝑎 · 2𝑘 mod 𝑝, while in our new algorithms
264 the result is 𝑎 · 2−𝑙 mod 𝑝. 𝑎 is the input but 𝑘, 𝑙 are integers with different meanings. Besides, in
265 order to resist the timing attack, the iteration rounds keep constant for the specified 𝐺𝐹 (𝑝) field.
266 We named the first phase Complementary Montgomery Iteration (CMI). As for the second phase,
267 it has two types of configurations, which determine whether the whole algorithm is classic or
268 cross-domain. We call the classic algorithm Constant-Time Integer Modular Inversion (CT-IMI)
269 while the cross-domain algorithm Constant-Time Complementary Montgomery Modular Inversion
270 (CT-CMMI). CT-IMI and CT-CMMI have the same iteration structure. The differences between
271 them are the form of the input number and the process in the second phase.
272
273 4.1 Complementary Montgomery Iteration Structure
274 In the following description, 𝑎𝑖𝑛𝑖𝑡 and 𝑝𝑖𝑛𝑖𝑡 correspond to the initial input number and the modulus.
275 𝑎 and 𝑝 are intermediate variables in the iterative process. In the beginning, 𝑎 = 𝑎𝑖𝑛𝑖𝑡 , 𝑝 = 𝑝𝑖𝑛𝑖𝑡 .
276 As we known, EEA is based on Bézout’s identity. This lemma states that for integer 𝑎 and 𝑝,
277 there exists integer 𝑥, 𝑦 so that:
278 𝑎 · 𝑥 + 𝑝 · 𝑦 = gcd(𝑎, 𝑝) (4)
279
When 𝑝𝑖𝑛𝑖𝑡 is a prime number, for each 𝑎𝑖𝑛𝑖𝑡 ∈ [1, 𝑝𝑖𝑛𝑖𝑡 ), gcd(𝑎𝑖𝑛𝑖𝑡 , 𝑝𝑖𝑛𝑖𝑡 ) = 1. According to (4), there
280
exists integers 𝑥, 𝑦 so that 𝑎𝑖𝑛𝑖𝑡 · 𝑥 + 𝑝𝑖𝑛𝑖𝑡 ·𝑦 = 1, which is equivalent to 𝑎𝑖𝑛𝑖𝑡 · 𝑥 ≡ 1 mod 𝑝𝑖𝑛𝑖𝑡 . Then
281
the value of 𝑥, also called the modular inversion of 𝑎𝑖𝑛𝑖𝑡 , can be solved with an iterative method.
282
As we mentioned before, BEEA performs better than EEA since it replaces division and multipli-
283
cation with addition, subtraction and shift operations. This optimization mainly depends on the
284
following properties in number theory. The proof of these properties can be referred in [19].
285
Assume the integers 𝑎, 𝑝 (𝑎 < 𝑝) are coprime. When both 𝑎 and 𝑝 are odd:
286
𝑝 −𝑎
287 gcd(𝑎, 𝑝) = gcd(𝑎, ) (5)
288
2
289 When 𝑎 is odd and 𝑝 is even:
𝑝
290 gcd(𝑎, 𝑝) = gcd(𝑎, ) (6)
291
2
When 𝑎 is even and 𝑝 is odd:
292 𝑎
293 gcd(𝑎, 𝑝) = gcd( , 𝑝) (7)
2
294
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 7

295 For all 𝑎 and 𝑝:


296 gcd(𝑎, 𝑝) = gcd(𝑎, 𝑝 − 𝑎) (8)
297
Different from the BEEA, we introduce a new method for calculating 𝑥 based on these properties.
298
We choose one case for detailed derivation and other situations are similar to this.
299
Assume that the input 𝑎 0 = 𝑎𝑖𝑛𝑖𝑡 is even and 𝑝 0 = 𝑝𝑖𝑛𝑖𝑡 is odd since its initial value is a prime.
300
Then, we combine Bézout’s identity (4) with the property (7), there exists integers 𝑥 0 , 𝑦0 , 𝑥 1 and 𝑦1
301
satisfy the following equation:
302
𝑎0 𝑎0 𝑥1
303 𝑎 0 · 𝑥 0 + 𝑝 0 · 𝑦0 = gcd( , 𝑝 0 ) = · 𝑥1 + 𝑝0 · 𝑦 = 𝑎0 · + 𝑝 0 · 𝑦1 (9)
2 2 2
304
305 According to (9), 𝑥 0 = 12 𝑥 1, 𝑦0 = 𝑦1 is one of the solutions. The relationship between 𝑥 0, 𝑦0, 𝑥 1 and
306 𝑦1 can be described more simply in the form of matrices (10).
  1  
307 𝑥0 0 𝑥1
= 2 mod 𝑝𝑖𝑛𝑖𝑡 (10)
308 𝑦0 0 1 𝑦1
309 𝑎0
310
Similarly, now we can treat 𝑎 1 = 2 and 𝑝 1 = 𝑝 0 as input. The transform (11) is similar to (10).
 1 0
 
311   
𝑎1 𝑝1 = 𝑎0 𝑝0 2 (11)
312 0 1
313
If we ignore 𝑥 0, 𝑦0 for now, the problem becomes calculating the inversion of 𝑎 1 and the result is 𝑥 1 .
314
The parity of 𝑎 1 and 𝑝 1 will suit one of the properties in (5)(6)(7) again, then an iterative process
315
appears. In the 𝑖-th iteration round, we denote the input as 𝑎𝑖 , 𝑝𝑖 , and there must exist 𝑥𝑖 , 𝑦𝑖 which
316
satisfy the equation 𝑎𝑖 · 𝑥𝑖 + 𝑝𝑖 · 𝑦1 = 1. If 𝑥𝑖 , 𝑦𝑖 can be fixed, the final result 𝑥 0 can be obtained
317
through reverse iterative calculation. Besides, we stipulate that 𝑎𝑖 < 𝑝𝑖 during the iteration, and
318
if 𝑎𝑖 > 𝑝𝑖 during the iteration, we will exchange their values. The reason why we do this will be
319
explained later.
320
Based on the properties from (5) to (7), the parity of 𝑎𝑖 , 𝑝𝑖 will suit one of the situations in the
321 
𝑖-th iteration round, and it reduces at least one bit of 𝑎𝑖 or 𝑝𝑖 . After at most 𝑛 = log2 𝑎 + log2 𝑝
322
iteration rounds, the values of 𝑎𝑛 and 𝑝𝑛 will become 0 and 1. If we do not stipulate that 𝑎𝑖 < 𝑝𝑖 , it’s
323
impossible to determine which of 𝑎𝑛 and 𝑝𝑛 becomes zero, and the situation will be more complex.
324
Once we set that 𝑎𝑛 and 𝑝𝑛 become 0 and 1 finally, the value of 𝑥𝑛 can be any and 𝑦𝑛 is must 1 since
325
the equation 𝑎𝑛 · 𝑥𝑛 + 𝑝𝑛 · 𝑦𝑛 = 1 still holds. The value of 𝑦𝑛 is unique and definite in this case.
326
We assume that the values of 𝑎𝑘 and 𝑝𝑘 become 0 and 1 after 𝑘 (𝑘 ≤ 𝑛) round. Their values will
327
not change anymore since the parity satisfies properties (7). It means that the redundant iteration
328
rounds from 𝑘 + 1 to 𝑛 will not affect the correctness of the result. For different input 𝑎, the iteration
329
round 𝑘 may be different, while they can continue to iterate until the 𝑛 round. Therefore, the
330
iteration round is constant now. The constant round also means constant time.
331
Now we have solved the constant time problem, but the problem with execution efficiency of
332
the algorithm still exists. A basic idea is to reduce the upper boundary 𝑛 when 𝑎𝑛 , 𝑝𝑛 can become
333
0 and 1. We observe that only 𝑎𝑖 or 𝑝𝑖 changes when using (5)(6)(7) properties in 𝑖-th iteration
334
round. Intuitively, if both of them decrease, or one of them decreases more than before, the upper
335
boundary will also decrease. Since property (8) is suitable for any 𝑎𝑖 , 𝑝𝑖 , we combine it with (5)(6)(7),
336
and get the (13)(14)(15). Experimental results prove that this is indeed effective and we will give a
337
detailed method to determine the upper boundary in section 6. The definition of the function ℱ(·)
338
is as following, which is a simplified form used in (13)(14)(15).
339
340 
341 𝑚 𝑚<𝑛
ℱ(𝑚, 𝑛) = (12)
342 𝑚 −𝑛 𝑚 ≥ 𝑛
343
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
8 Anon.

344 When both 𝑎 and 𝑝 are odd:


345 𝑝 −𝑎
346 gcd(𝑎, 𝑝) = gcd(𝑎, ℱ( , 𝑎)) (13)
2
347
When 𝑎 is odd and 𝑝 is even:
348
𝑝
349 gcd(𝑎, 𝑝) = gcd(𝑎, ℱ( , 𝑎)) (14)
350
2
351 When 𝑎 is even and 𝑝 is odd:
352 𝑎 𝑎 𝑎
gcd(𝑎, 𝑝) = gcd( , ℱ(ℱ(𝑝 − 𝑎, ), )) (15)
353 2 2 2
354 Based on the (13)(14)(15) properties and Bézout’s identity, we define the 𝑖-th iteration round:
355     
𝑥𝑖 𝛼𝑖 𝛽𝑖 𝑥𝑖+1
356 = mod 𝑝𝑖𝑛𝑖𝑡 (16)
𝑦𝑖 𝛾𝑖 𝜔𝑖 𝑦𝑖+1
357
358
 
    𝛼 𝑖 𝛽𝑖
359 𝑎𝑖+1 𝑝𝑖+1 = 𝑎𝑖 𝑝𝑖 (17)
𝛾𝑖 𝜔 𝑖
360  
𝛼 𝛽𝑖
361 Where 𝑎𝑖 · 𝑥𝑖 + 𝑝𝑖 ·𝑦𝑖 = 𝑎𝑖+1 · 𝑥𝑖+1 + 𝑝𝑖+1 ·𝑦𝑖+1 = 1, 𝑎 0 = 𝑎𝑖𝑛𝑖𝑡 , 𝑝 0 = 𝑝𝑖𝑛𝑖𝑡 . We define the matrix 𝑖
362
𝛾𝑖 𝜔 𝑖
363
as the Transition Matrix (TM) for the 𝑖-th iteration round. The 𝑖-th TM is decided by the parity and
364
values of 𝑎𝑖 , 𝑝𝑖 , as shown in Tab. 1. Property (13) corresponds to three cases in Tab. 1 since we not
𝑝 −𝑎 𝑝 −𝑎
365
only compare the value of 2 and a, but also 𝑎 and ℱ( 2 , 𝑎) to promise 𝑎𝑖 < 𝑝𝑖 in each iteration
366
round. Property (14) and (15) are similar.
367
Table 1. Relationship between 𝑎𝑖 , 𝑝𝑖 and Transition Matrix (TM)
368
369
370
Parity of 𝑎𝑖 , 𝑝𝑖 Value of 𝑎𝑖 , 𝑝𝑖 TM 𝛼𝑖 𝛽𝑖 𝛾𝑖 𝜔𝑖
371
372
𝑝𝑖 ≥ 5𝑎𝑖 𝑀0 1 − 32 0 1
2
373
374 𝑎𝑖 :odd, 𝑝𝑖 :odd 3𝑎𝑖 ≤ 𝑝𝑖 < 5𝑎𝑖 𝑀1 − 32 1 1
0
2
375
376
𝑎𝑖 < 𝑝𝑖 < 3𝑎𝑖 𝑀2 − 12 1 1
2 0
377
378 1
𝑝𝑖 ≥ 4𝑎𝑖 𝑀3 1 −1 0 2
379
380 𝑎𝑖 :odd, 𝑝𝑖 :even 2𝑎𝑖 ≤ 𝑝𝑖 < 4𝑎𝑖 𝑀4 −1 1 1
0
2
381
382 1
𝑎𝑖 < 𝑝𝑖 < 2𝑎𝑖 𝑀5 0 1 2 0
383
384
𝑝𝑖 ≥ 25 𝑎𝑖 𝑀6 1
2 −2 0 1
385
386
2𝑎𝑖 ≤ 𝑝𝑖 < 52 𝑎𝑖 𝑀7 −2 1
2 1 0
387 𝑎𝑖 :even, 𝑝𝑖 :odd
388 3 1
2 𝑎𝑖 ≤ 𝑝𝑖 < 2𝑎𝑖 𝑀8 2 −1 0 1
389
390
𝑎𝑖 < 𝑝𝑖 < 32 𝑎𝑖 𝑀9 −1 1
2 1 0
391
392
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 9

393 Based on the iteration formula (16), we can expand the whole chains to calculate 𝑥 0 as following.
394          
𝑥0 𝛼 𝛽0 𝛼 𝛽𝑖 𝛼 𝛽𝑛−1 𝑥
395 = 0 ··· 𝑖 · · · 𝑛−1 · 𝑛 mod 𝑝𝑖𝑛𝑖𝑡
𝑦0 𝛾0 𝜔 0 𝛾𝑖 𝜔 𝑖 𝛾𝑛−1 𝜔𝑛−1 𝑦𝑛
396        
397 1 2𝛼 0 2𝛽 0 2𝛼𝑖 2𝛽𝑖 2𝛼𝑛−1 2𝛽𝑛−1 𝑥
= 𝑛 ··· ··· · 𝑛 mod 𝑝𝑖𝑛𝑖𝑡 (18)
398 2 2𝛾 0 2𝜔 0 2𝛾𝑖 2𝜔𝑖 2𝛾𝑛−1 2𝜔𝑛−1 𝑦𝑛
399
Where 𝑛 is the iteration round. For each TM, we convert the coefficients to integers by doubling
400
their values. This process avoids the modular shift operation, which means the odd number needs
401
to add 𝑝𝑖𝑛𝑖𝑡 before shifting.
402
We mentioned that after 𝑛 iteration rounds, the values of 𝑎𝑛 and 𝑝𝑛 become 0 and 1, respectively.
403
And the value of 𝑦𝑛 is always 1, while the value of 𝑥𝑛 can be any. In order to simplify the calculation
404
of 𝑥 0 , we set 𝑥𝑛 = 0.
405
According to (18), we get the following expression to calculate 𝑥 0 :
406     
407 𝑥0 1 𝛼 𝛽 0
= 𝑛 mod 𝑝𝑖𝑛𝑖𝑡 (19)
408 𝑦0 2 𝛾 𝜔 1
409      
𝛼 𝛽 2𝛼 0 2𝛽 0 2𝛼𝑛−1 2𝛽𝑛−1
410 where = . . . .
𝛾 𝜔 2𝛾 0 2𝜔 0 2𝛾𝑛−1 2𝜔𝑛−1
411 𝛽
412
Then 𝑥 0 = 2𝑛 mod 𝑝𝑖𝑛𝑖𝑡 , where 𝛽 is the number calculated during the iterative process. The
413
above iteration process is the first phase of our new algorithms, and the second phase further
414
processes 𝛽 to get the final modular inversion result 𝑥 0 .
415
Alg. 4 shows the details. In each iteration round, we need to choose the Transition Matrix firstly
416
according to the parity and values of 𝑎𝑖 , 𝑝𝑖 , which corresponds to the operation in line 4. Then we
417
can use the (16)(17) iteration formulas, which are shown in lines 5 and 6. We require  an initial matrix
1 0
418 before getting the first Transition Matrix, and an identity matrix 𝐸 = is satisfied. After we
419
0 1
 
𝛼 𝛽0
420 get the first Transition Matrix 𝑀0 = 0 , we can multiply 𝐸 by 𝑀0 . For the next round, we get
421 𝛾0 𝜔 0
 
422 𝛼 𝛽
𝑀1 and keep multiplying. Finally, we get the value of . Since only the value of 𝛽 is useful,
423 𝛾 𝜔
 
424 we can use a determinant 1 0 1×2 to replace matrix 𝐸. In Alg. 4, 𝑎, 𝑝 and 𝑢, 𝑣 are intermediate
425 variables, and we express these variables in the determinants for simplicity. The meaning of lines 5
426 and 6 is to multiply a determinant and a matrix, updating the result in determinants 𝐴, 𝐵.
427
428 Algorithm 4 Complementary Montgomery Iteration (CMI)
429
Require: Integer 𝑎𝑖𝑛𝑖𝑡 , Modulus 𝑝𝑖𝑛𝑖𝑡 , Iteration round 𝑛
430
Ensure: 𝑣
431
1: 𝑎 = 𝑎𝑖𝑛𝑖𝑡 , 𝑝= 𝑝𝑖𝑛𝑖𝑡 , 𝑢 = 1, 𝑣 = 0, 𝑘 = 0
432
2: 𝐴 = 𝑢 𝑣 1×2 , 𝐵 = 𝑎 𝑝 1×2
433
3: while 𝑘 ≠ 𝑛 do
434
4: choose 𝑀𝑡 , 𝑡 ∈ [0, 9] according to Tab. 1
435
5: 𝐴 = 𝐴 · 2𝑀𝑡
436
6: 𝐵 = 𝐵 · 𝑀𝑡
437
7: end while
438
8: return 𝑣
439
440
441
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
10 Anon.

442 4.2 CT-IMI and CT-CMMI


443
In section 4.1, we have shown the iteration formula, by multiplying 𝑛 Transition Matrix to get 𝛽.
444
But the result 𝛽 is not the final modular inversion 𝑥 0 , which requires a further process.
445
Since 𝛽 can not directly remove the lowest 𝑛 bits, the simplest way is using the modular shift
446
operation. This operation adds the odd number with 𝑝𝑖𝑛𝑖𝑡 before shifting, or directly shifts when
447
the number is even. Using this method also requires 𝑛 iteration round, which is not efficient yet.
448 𝛽
Fortunately, MontRedc(·) can simplify this process, since MontRedc(𝛽) = 2𝑙 . If we combine
449
CMI and MontRedc(·), it may be helpful to compute the inversion in integer field. Assume that for
450 ′ ′ ′
integer 𝑎𝑖𝑛𝑖𝑡 and its Montgomery representation 𝑎𝑖𝑛𝑖𝑡 , 𝑎𝑖𝑛𝑖𝑡 = 𝑎𝑖𝑛𝑖𝑡 · 2𝑙 mod 𝑝𝑖𝑛𝑖𝑡 . If we use 𝑎𝑖𝑛𝑖𝑡 as
451 ′
the input number, then we can calculate 𝛽 after the CMI process. If the input number is 𝑎𝑖𝑛𝑖𝑡 now,
452
𝛽 ′ ·2𝑙
we get 𝛽 and it is also a Montgomery representation of 𝛽 ′ , then 2𝑛 = 2𝑛 = 𝛽 ′ · 2𝑙 −𝑛 . When 𝑙 is
𝛽
453
454 equal to 𝑛, then we directly get 𝑥 0′ = 𝛽 ′ mod 𝑝𝑖𝑛𝑖𝑡 , which is the inversion of 𝑎𝑖𝑛𝑖𝑡
′ . The value of 𝑙

455 only needs to be satisfied with 2 > 𝑝𝑖𝑛𝑖𝑡 , and the value of 𝑛 based on CMI is greater than log2 𝑝𝑖𝑛𝑖𝑡 ,
𝑙

456 which we can set 𝑙 = 𝑛. Now we can use a Montgomery domain number as the input of CMI, and
457 get the inversion in integer field. Modular reduction is used here to limit the output in finite field.
458
459
Algorithm 5 Montgomery Reduction
460
461
Require: Integer 𝑣, Modulus 𝑝𝑖𝑛𝑖𝑡
462
Ensure: 𝑣 · 𝑟 −1 mod 𝑝𝑖𝑛𝑖𝑡
−1 mod 𝑟, 𝑟 := 2𝑛 , 𝑛 is iteration round
𝑝𝑖𝑛𝑣 := −𝑝𝑖𝑛𝑖𝑡
463
464
1: 𝑚 = ((𝑣 mod 𝑟 ) · 𝑝𝑖𝑛𝑣 ) mod 𝑟
465
2: 𝑟𝑒𝑡 = (𝑣 + 𝑚 · 𝑝𝑖𝑛𝑖𝑡 ) ≫ 𝑛
466
3: return 𝑟𝑒𝑡
467
468
Alg. 5 shows the details of Montgomery Reduction. 𝑝𝑖𝑛𝑣 is pre-computed and the multiplication
469
steps can be simplified. 𝑚𝑜𝑑 𝑟 can also be replaced by bit-and operation due to 𝑟 is 2𝑛 .
470
471
472 Algorithm 6 Modular Reduction For 𝑃𝑠𝑚2
473 Require: Integer 𝑣
474 Ensure: 𝑣 mod 𝑃𝑠𝑚2
475 1: (𝑣 9, 𝑣 8, 𝑣 7, 𝑣 6, 𝑣 5, 𝑣 4, 𝑣 3, 𝑣 2, 𝑣 1, 𝑣 0 ) 10×32 = 𝑣
476 2: 𝑟 = (𝑣 9 + 𝑣 8 + 𝑣 7, 𝑣 6, 𝑣 5, 𝑣 9 + 𝑣 4, 𝑣 8 + 𝑣 3, 𝑣 2 − 𝑣 8 − 𝑣 9, 𝑣 1 + 𝑣 9, 𝑣 0 + 𝑣 9 + 𝑣 8 )8×32
477 3: (𝑟 8, 𝑟 7, 𝑟 6, 𝑟 5, 𝑟 4, 𝑟 3, 𝑟 2, 𝑟 1, 𝑟 0 ) 8×32 = 𝑟
478 4: 𝑟𝑒𝑡 = (𝑟 7 + 𝑟 8, 𝑟 6, 𝑟 5, 𝑟 4, 𝑟 3 + 𝑟 8, 𝑟 2 − 𝑟 8, 𝑟 1, 𝑟 0 + 𝑟 8 ) 8×32
479 5: return 𝑟𝑒𝑡
480
481
482 Alg. 6 shows an optimization method for Modular reduction, which uses the congruence proper-
483 ties of some specific primes, e.g. 2192 ≡ 264 + 1 mod 𝑃 192 (𝑃192 = 2192 − 264 − 1). For different primes,
484 it uses different congruence properties for reduction. We show the process for the prime used in
485 SM2. The value of 𝑣 is split into 10 numbers and the bit length of each number is 32. We denote
Í9
486 it as 𝑣 = 𝑖=0 𝑣𝑖 · 232·𝑖 . The reason why we assume that the bit length of 𝑣 is no more than 320 is
487 the statistical results of experiments for 𝑃𝑠𝑚2 . The structure of Modular reduction depends on the
488 prime used in the practical applications. Here we only give a reference. In summary, we give a full
489 description of our modular inversion algorithms in Alg. 7.
490
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 11

491 Algorithm 7 CT-IMI and CT-CMMI


492 ′
Require: Integer 𝑎𝑖𝑛𝑖𝑡 = 𝑎𝑖𝑛𝑖𝑡 · 2𝑛 , Modulus 𝑝𝑖𝑛𝑖𝑡 , Iteration round 𝑛
493 −1 ′ −1
Ensure: 𝑎𝑖𝑛𝑖𝑡 or 𝑎𝑖𝑛𝑖𝑡 mod 𝑝𝑖𝑛𝑖𝑡
494
1: 𝑣 = CMI(𝑎𝑖𝑛𝑖𝑡 , 𝑝𝑖𝑛𝑖𝑡 )
495
2: if 𝑀𝑂𝐷𝐸 = CT-IMI then
496
3: 𝑟𝑒𝑡 = MontRedc(𝑣)
497
4: else if 𝑀𝑂𝐷𝐸 == CT-CMMI then
498
5: 𝑟𝑒𝑡 = MontRedc(𝑣)
499
6: end if
500
7: return 𝑟𝑒𝑡
501
502
503 5 HARDWARE ARCHITECTURE
504 In this section, we propose hardware architectures to implement our algorithms, including a
505 universal iteration architecture, Montgomery and modular reduction for 𝑃𝑠𝑚2 .
506 Based on the algorithms of CT-IMI and CT-CMMI, we design two hardware architectures on
507 Xilinx Virtex-7 (xc7vx690tffg1157-3) and Zynq-Ultrascale+ (xczu19eg-ffvb1517-3-e) FPGA. Our
508 designs are synthesized using Xilinx Vivado 2018.3 synthesis tools with the "Area Optimized
509 high" directive. Fig. 1 shows the top architecture. Since the iteration architecture is the same, the
510 differences between the two designs are the optional architecture which is marked in Fig. 1. If
511 the input numbers are integers, we choose the Montgomery reduction. Otherwise, the modular
512 reduction is used. For high efficiency, the Montgomery and modular reduction architectures require
513 specific optimization for primes. In this work, we use the prime 𝑃𝑠𝑚2 to finish the whole modular
514 inversion architecture.
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532 Fig. 1. The Top Hardware Architecture of Modular Inversion
533
534 Four registers are mainly used during the iteration. At the beginning of the iteration, 256 bits
535 registers 𝑎 and 𝑝 are used to store the input number and 𝑃𝑠𝑚2 . 320 bits registers 𝑢 and 𝑣 are
536 initialized to 1 and 0. Another 9 bits register 𝑘 is used as a counter. The matrix multiplication has
537 been simplified to shifting and subtraction operations. LSB1, LSB2, LSB3 are the Least Significant
538 Bit (LSB) which reflects the parties. CMP is the result of a comparer which equals 1 or 0. They are
539
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
12 Anon.

540 used in the multiplexers. In each iteration round, registers 𝑎, 𝑝, 𝑢, 𝑣 will update the value and the
541 update logic designed in Fig. 2 is based on Tab. 1. The counter also increases by 1 in register 𝑘 at
542 the same time. Each iteration is finished in one clock cycle. After 281 times iteration, the iteration
543 will stop and the value in register 𝑣 should keep unchanged.
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566 Fig. 2. The Complementary Montgomery Iteration Hardware Architecture
567
568 When the iteration is finished, if the input is an integer, we will use Montgomery reduction
569 to get the final result according to the algorithm mentioned in section 4.2. Otherwise, modular
570 reduction is needed since the maximum value in register 𝑣 may reach 320 bits.
571 Alg. 5 and its corresponding hardware architecture in Fig. 3 are designed to finish the Montgomery
572 reduction. Let 𝑟 = 2281 and 𝑣 mod 𝑟 be finished with the bit-and operation. We put the lower 281
573 −1 , it can
bits of 𝑣 into the 𝑣𝑙 register, and the higher 39 bits of 𝑣 into the 𝑣ℎ register. Since 𝑃𝑖𝑛𝑣 = −𝑃𝑠𝑚2
574 be pre-computed, and the multiplication steps in Alg. 5 can be simplified. For the sake of brevity,
575 we only draw one "Mul" in Fig. 3, and use the symbols to distinguish inputs and outputs. The
576 first multiplication operation is 𝑣𝑙 (1.1) · 𝑃𝑖𝑛𝑣 (1.2). The whole multiplication result (1.3) will reach
577 to 537-bit, which should be reduced in modulo 𝑟 . Thus, only the lower 281-bit value is effective
578 and this saves a lot of hardware resources. The second multiplication is the result from (1.3) and
579 𝑃𝑠𝑚2 (2.1). Since it only consists of a small number of powers of 2(𝑃𝑠𝑚2 = 2256 − 2224 − 296 + 264 − 1),
580 we directly convert this multiplication process into shift-and-add operations. The output in (2.3) is
581 the lower 281 bits, and (2.4) is the higher 256 bits. Then we add up the number in 𝑣𝑙 and (2.3), and
582 only take out the carry part (cin) of the operation result exceeding 281 bits, which can avoid the
583 last shift operation in Alg. 5. Finally, the sum of 𝑣ℎ , cin and (2.4) is the modular inversion. At most
584 one subtraction is required to ensure that the result is constrained within 256 bits.
585 Alg. 6 and its corresponding hardware architecture in Fig. 4 are designed to finish the modular
586 reduction. In order to increase the maximum frequency of the circuits, registers 𝑟 0, 𝑟 1, · · · , 𝑟 8 are
587 used to divide the modular reduction into two stages and form a two-stage pipeline.
588
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 13

589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
Fig. 3. The Montgomery Reduction for 𝑃𝑠𝑚2 Hardware Architecture
617
618 In the first stage, the 320-bit register value 𝑣 is divided into ten 32-bit numbers 𝑣 0, 𝑣 1, . . . , 𝑣 9 , such
619 that 𝑣 = 𝑣 9 · 2288 + 𝑣 8 · 2256 + · · · + 𝑣 0 . A carry chain is necessary for register 𝑟 0, 𝑟 1, . . . , 𝑟 8 whose
620 bit length is 32. For instance, if 𝑤 0 = 𝑣 0 + 𝑣 8 + 𝑣 9 , the bit length of 𝑤 0 will reach 33. Therefore,
621 we divide 𝑤 0 into a one-bit carry and a 32-bit sum, such that the carry 𝑐 0 = 𝑤 0 ≫ 32 and the
622 sum 𝑠 0 = 𝑤 0 &0xFFFFFFFF. The sum 𝑠 0 is stored in register 𝑟 0 , and the carry is used to calculate
623 𝑤 1 = 𝑣 1 + 𝑣 9 + 𝑐 0 . Ultimately, the ten numbers 𝑣 0, . . . , 𝑣 9 are transformed into nine numbers
624 𝑟 0, 𝑟 1, . . . , 𝑟 8 . In fact, if we let 𝑟 = 𝑟 8 · 2256 + · · · + 𝑟 0 , it follows that 𝑟 mod 𝑝 = 𝑣 mod 𝑝, and the bit
625 length of 𝑟 is at most 288.
626 In the second stage, the bit length of the register 𝑟𝑒𝑡 is 256, which is similar to the first stage.
627 We perform addition, subtraction, and carry processing with the numbers in registers 𝑟 0, . . . , 𝑟 8 ,
628 and finally obtain eight 32-bit numbers that can split the 256 bits into the result as the output of
629 modular inversion.
630
631 6 EXPERIMENTAL RESULT AND COMPARISONS
632 In this section, we will show the experiments both on software and hardware. We firstly analyse
633 the least iteration rounds for different bit length numbers and make a comparison with the other
634 algorithms. Since the performance of CT-IMI and CT-CMMI is similar on software, we only use
635 CT-IMI for software comparison. Then the specific designs for 𝑃𝑠𝑚2 on FPGA are shown and we
636 make a comparison with the other hardware implementations on 256-bit length numbers.
637
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
14 Anon.

638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
Fig. 4. The Two-Stage Pipeline Modular Reduction for 𝑃𝑠𝑚2 Hardware Architecture
666
667
668
669
6.1 The Constant Iteration Round
670 The question remaining from section 4 is the value of iteration round 𝑛.For each prime,  there exists
671 a different boundary 𝑛. It is too redundant even though setting 𝑛 = log2 𝑎 + log2 𝑝 . Due to the
672 theoretical derivation of the lowest boundary value for 𝑛 being beyond the scope of this paper,
673 we use experimental statistics to estimate the boundary and all codes & statistics are recorded in
674 https://github.com/Aimmecat/CTCMI. The software experimental platform uses Python 3.7 on
675 Intel (R) Core (TM) i7-9700K CPU @ 3.6GHz, 32.0 GB RAM and the operating system is Ubuntu
676 18.04.6 LTS (GNU/Linux 4.15.0-172-generic x86_64).
677 We choose 𝑝 192, 𝑝 224, 𝑝 256, 𝑝 384, 𝑝 521, 𝑝 1279, 𝑝 2203 for experiments and 107 random inputs created by
678 the "secrets" library of Python, which are true random numbers created by operating system. For
679 each prime, we count the iteration rounds for different inputs when the values of 𝑎 and 𝑝 become
680 0, 1. The maximum result will be the upper boundary of this prime.
681 Fig. 5 displays the distribution of iteration rounds for various primes, which is similar to a
682 Gaussian distribution. To verify this, we employ Anderson-Darling and Shapiro-Wilk tests, but the
683 results were inconclusive, as we cannot test all inputs, even for 𝑃192 . However, when we sampled
684 102 from 107 results, all primes passed the test. As the sample size increased to 103 , only 𝑃1279 and
685 𝑃2203 passed the test. Although we cannot ensure that it strictly satisfies a Gaussian distribution,
686
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 15

687 we can still use it as a model for simplifying the analysis and predicting the upper boundary by
688 fitting the curve to a Gaussian distribution.
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709 Fig. 5. The Distribution of Iteration Rounds with Different Primes
710
711 Tab. 2 shows the fitting results. The amplitude is too large and we normalize the original data
712 before fitting the mean 𝜇 and variance 𝜎. MSE is also calculated to show the accuracy. The Range
713 column records the shortest and longest iteration rounds. After fitting the 𝜇 and 𝜎 for each prime,
714 the value of 𝜇 ± 3𝜎 is calculated. Compared to the Range column, the longest iteration rounds are
715 larger than 𝜇 + 3𝜎. We denote the longest iteration rounds and Gaussian distribution as 𝐻, 𝐺 (·).
716 Since the maximum iteration rounds 𝑛 are must in [1, 𝑙𝑜𝑔2𝑎 + 𝑙𝑜𝑔2 𝑝], then the probability of
717
Í 𝐻
𝑖=1 𝐺 (𝑖 )
𝑃𝑟 (𝑛 ≤ 𝐻 ) = Í𝑙𝑜𝑔2 (𝑎·𝑝 ) . We are not using CDF to calculate 𝑃𝑟 (𝑛 ≤ 𝐻 ) here since the distribution
718 𝐺 (𝑖 )
𝑖=1
719 of the iteration round is discrete. The experiment results indicate that setting the 𝐻 as the constant
720 iteration round is reasonable. Although there may exist numbers that require more iteration
721 rounds,
 the probability
 is almost negligible for practical applications. We choose 𝐻 rather than
722 log2 𝑎 + log2 𝑝 at least can be treated as a trade-off between efficiency and accuracy.
723
724 6.2 Software Simulations
725 In section 6.1, we analyse how we determine the constant iteration rounds. We choose some
726 non-constant time modular inversion algorithms to show the relationship between the bit length
727 and latency. Then we use these algorithms to make a comparison with our algorithm to show the
728 necessity of the constant time feature. Fig. 6 shows the results. For each bit length, we sample 104
729 numbers and calculate the average latency for one modular operation. The prime is changed with
730 the bit length of the number. When the bit length is less than 193, 𝑃192 is used. While the bit length
731 comes to 193, we will change the prime to 𝑃224 . The other situations are similar and we denote it in
732 Fig. 6. For each prime, the small picture in the upper left of Fig. 6 shows the difference between
733 mean and original latency. In addition to the BEEA and MI, HBEEA [14], Stein [17] and P_stein are
734 used for comparison.
735
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
16 Anon.

736 Table 2. Statistics of Iteration Rounds and Gaussian Distribution Fitting For Different Primes
737
738 Pirme Source 𝜇 𝜎 MSE Range([L,H]) 𝜇 ± 3𝜎 Porbability
739
𝑃192 NIST [29] 171.8 6.9 0.028 [141, 214] [151.1, 192.6] 1 − 2−30.1
740
741 𝑃224 NIST [29] 200.6 7.5 0.025 [173, 243] [178.1, 223.3] 1 − 2−27.3
742 𝑃256 SM2 [28] 229.5 8.1 0.024 [199, 281] [205.3, 253.8] 1 − 2−33.9
743
𝑃384 NIST [29] 344.8 9.9 0.022 [306, 402] [315.2, 374.4] 1 − 2−28.6
744
745 𝑃521 Mersenne 468.3 11.6 0.017 [420, 530] [433.4, 503.1] 1 − 2−24.5
746 𝑃1279 Mersenne 1151.2 18.3 0.012 [1073, 1259] [1096.4, 1206.0] 1 − 2−29.3
747
𝑃2203 Mersenne 2053.8 24.4 0.011 [1948, 2183] [1980.9, 2127.0] 1 − 2−24.2
748
749
750 The experimental results prove that CT-IMI performs better in security than other non-constant
751 algorithms since the whole line is stable and the difference between mean and original latency
752 is nearly 0 for each prime. For the same prime, there is not much difference in operation latency
753 between longer and shorter bit length numbers. This feature makes it can resist the timing attack.
754 Even though CT-IMI is strictly constant time in the algorithm, the operating frequency during
755 software simulations will inevitably change, resulting in slight differences in latency. It’s worth
756 noting that for each prime, e.g. 𝑃 2203 , it does not require that the input number must exceed 1279
757 bits. When using a 1-bit number, the latency is still nearly indistinguishable from a 2203-bit number.
758 To make Fig. 6 clear, we omit these parts.
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779 Fig. 6. Comparison of Latency between Constant and Non-Constant Time Modular Inverse Algorithm for
780 Different Bit Length Numbers
781
782 There are also many other constant algorithms, e.g. BOS [3], BY [4], hyBY [35]. In the latest
783 available literature, Jin [15] shows that the average clock cycles of SICT-MI are lower than these
784
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 17

785 Table 3. Comparison of Average Operating Clock Cycles of Constant Time Modular Inverse Algorithms
786
787 192 bits 224 bits 256 bits 384 bits 521 bits 1279 bits 2203 bits
788
FLT 337305 444174 558902 1223687 2725711 27131468 138936167
789
790 SICT-MI 290068 341334 398437 615215 862327 2385599 5209863
791 CT-IMI 196709 234804 274669 405105 579633 1598433 3305455
792
793
794 algorithms. In Tab. 3, we make a comparison with two constant modular inversion algorithms,
795 traditional FLT and the latest SICT-MI. Compared to SICT-MI, CT-IMI saves around 30%  clock
796 cycles for each prime. This is mainly due to the choice of iteration rounds. SICT-MI used 2 log2 𝑝
797 as the constant rounds, which is conservative, while we analyze the distribution of iteration rounds
798 and choose a lower boundary for efficiency. Since we use Python for experiments, the clock cycles
799 of SICT-MI are larger than the results in original paper. The growth rate of clock cycles consumed
800 by FLT is much greater than that of the other two. This is mainly due to the multiplication of large
801 integers. So we can conclude that FLT is not suitable when processing long-bit length numbers.
802
803
6.3 Hardware Results and Comparisons
804 For each prime, even though the iteration architecture is universal, the Montgomery and modular
805 reduction architectures are different in order to pursue high efficient hardware implementation.
806 We design the universal iteration architecture and the reduction architecture for 𝑃𝑠𝑚2 , which is
807 widely used in ECC. Tab. 4 shows the experimental results, and two Area&Time (AT) products are
808 used to evaluate performance. For CT-IMI, the input numbers are integers and CT-CMMI requires
809 Montgomery inputs. Since modular reduction is simpler than Montgomery reduction, which results
810 that CT-CMMI costs fewer hardware resources than CT-IMI. On Virtex-7, Spartan-6, and Virtex-5
811 FPGA, each slice contains 4 LUTs and 8 FFs. In contrast, slice on Ultrascale consists of 8 LUTs and
812 16 FFs. This means that Ultrascale requires fewer slices to implement the same design, due to the
813 increased number of LUTs and FFs per slice. To the best of the authors’ knowledge, all the latest
814 and reasonable research results, which are related to the implementation of modular inversion for
815 256 bits prime on FPGA, have been presented in Tab. 4.
816
817 Table 4. Comparison of Hardware Implementations of 256-bit Modular Inversion
818
819 Freq. Latency AT𝐼 AT𝐼 𝐼
Ref. Technology LUT FF Cycle Slices
(MHz) (𝜇s) (LUT × ms) (Slice × ms)
820
CT-IMI Ultrascale 151.4 4.2k 1.8k 295 712 1.95 8.19 1.39
821
822 CT-IMI Virtex-7 115.9 4.1k 1.8k 295 1132 2.56 10.50 2.90
823
CT-CMMI Ultrascale 152.7 2.7k 1.6k 285 459 1.87 5.05 0.86
824
825 CT-CMMI Virtex-7 116.3 2.7k 1.6k 285 854 2.45 6.62 2.09
826
Hossain [13] Virtex-7 146.38 5.8k 1.3k 340 1480 2.32 13.47 3.43
827
828 Deshpande [9] Ultrascale 190.4 unknown 0.8k 8949 915 47.00 unknown 43.01

829 Murat [26] Spartan-6 231.87 1.1k 0.5k 5120 unknown 22.08 24.29 unknown
830
Mrabet [25] Virtex-5 129 2.3k 1.1k 1024 592 7.93 18.24 4.70
831
832 AT𝐼 LUT cost × latency for one time operation AT𝐼 𝐼 Slice cost × latency for one time operation
833
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
18 Anon.

834 We evaluate several designs for implementing CT-IMI and CT-CMMI on different FPGA platforms.
835 On Virtex-7, our CT-IMI has a latency of 2.56𝜇s and requires 4.1k LUTs, 1.8k FFs and 1132 slices. Our
836 CT-CMMI has a latency of 2.45𝜇s and requires 2.7k LUTs, 1.6k FFs and 854 slices. Hossain’s design
837 has the lowest latency of 2.32𝜇s, but it requires 5.8k LUTs, 1.3k FFs and 1480 slices. Compared
838 to Hossain’s design, our CT-IMI saves around 27.% LUTs and 23.5% slices, while our CT-CMMI
839 saves around 53.4% LUTs and 42.3% slices. Murat’s design has the highest maximum frequency and
840 uses the least hardware resources with 1.1k LUTs and 0.5k FFs, making it suitable for lightweight
841 applications where latency is not critical. However, the clock cycles in our designs are far less than
842 Murat’s design, and our latency is around ten times faster. Mrabetm’s design costs 2.3k LUTs, 1.1k
843 FFs and 592 slices, with a latency of 7.93𝜇s. The area costs are smaller than ours, but our CT-IMI
844 and CT-CMMI designs save around 68.4% and 69.7% time, respectively.
845 On Ultrascale, CT-IMI has a latency of 1.95𝜇s and requires 4.2k LUT, 1.8k FF and 712 slices. CT-
846 CMMI has a latency of 1.87𝜇s and requires 2.7k LUT, 1.6k FF and 459 slices. Deshpande proposed a
847 universal hardware architecture for different bit length numbers. For 256 bits inputs, his design is
848 also a constant time algorithm and costs 915 slices. Our CT-IMI saves 22.2% Silces and CT-CMMI
849 saves 49.8% Slices. The latency of his design is around 23.5 times of ours.
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866 Fig. 7. Comparison of AT𝐼 and AT𝐼 𝐼 for Different Hardware Implementations
867
868 Fig 7 shows the AT𝐼 and AT𝐼 𝐼 for each design. Compared with the other designs, our design
869 balances the area and time. The performance of A&T reaches an optimal level. When considering
870 the performance of A&T, Hossain achieved a better result than Deshpande, Murat and Mrabet.
871 However, Hossain’s design took 1.28 times the AT𝐼 , 1.18 times the AT𝐼 𝐼 of our CT-IMI, 2.03 times
872 the AT𝐼 , 1.64 times the AT𝐼 𝐼 of our CT-CMMI on Virtex-7 respectively.
873 The hardware results show that our designs have great performance than the other available
874 designs in Tab. 4. Besides, our architectures have advantages due to the constant time feature and
875 the CT-CMMI can directly compute with Montgomery domain numbers. Thus, our designs are
876 valuable and suitable for practical applications.
877
878 7 CONCLUSION
879 This paper mainly proposes two modular inversion algorithms, CT-IMI and CT-CMMI. Both of them
880 have the constant time feature, which allows them to resist the timing attack. In software analysis,
881 we make a comparison with other non-constant algorithms, and show the significant differences in
882
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
High-Performance Accelerator for Constant-Time Cross-Domain Integer and Montgomery Inversion on FPGA 19

883 latency when the bit length increases. Therefore, the constant time feature is necessary for modular
884 inversion algorithms when considering the safety against side-channel attacks. Besides, CT-IMI is a
885 classic design, which processes the modular inversion on integer domain. In comparison, CT-CMMI
886 is a cross-domain design, which uses the Montgomery numbers as inputs and directly computes the
887 inversion of it on integer domain. Compared to CT-IMI, CT-CMMI avoids the conservation steps
888 for some specific applications. In order to pursue high efficiency, we use a new method to analyse
889 the boundary of iteration rounds. After sampling around 107 random numbers and counting the
890 iteration rounds, the distribution is very likely to be Gaussian distribution. We use the maximum
891 iteration round that occurs in experiments as the constant iteration round, fitting the Gaussian
892 distribution and showing the probability of the correctness of using this value. Finally, we propose
893 our hardware architectures on FPGA, consisting of a new hardware iteration architecture which
894 is universal for any primes, and a specific Montgomery and modular reduction for 𝑃𝑠𝑚2 . This is
895 also the first time that the cross-domain modular inversion algorithm has been implemented on
896 hardware to the best of our knowledge. Compared with the existing literature, both the clock cycles
897 in software and area&time product in hardware reach the optimal. We believe that CT-IMI and
898 CT-CMMI will be used in high-efficiency cryptosystem processors in the future.
899
900
REFERENCES
901
[1] 1995. The Montgomery inverse and its applications. 44, 8 (1995), 1064–1065. https://doi.org/10.1109/12.403725
902 [2] 2009. Elliptic curve digital signature algorithm over GF(p) on a residue number system enabled microprocessor.
903 https://doi.org/10.1109/TENCON.2009.5396175
904 [3] 2023. Constant time modular inversion. 4 (2023). https://doi.org/10.1007/s13389-014-0084-8
905 [4] Daniel J. Bernstein and Bo-Yin Yang. 2019. Fast constant-time gcd computation and modular inversion. (2019), 340–398.
https://doi.org/10.46586/tches.v2019.i3.340-398
906
[5] Karim Bigou and Arnaud Tisserand. 2013. Improving modular inversion in RNS using the plus-minus method. In
907 Cryptographic Hardware and Embedded Systems-CHES 2013: 15th International Workshop, Santa Barbara, CA, USA,
908 August 20-23, 2013. Proceedings 15. Springer, 233–249.
909 [6] Karim Bigou and Arnaud Tisserand. 2016. Binary-Ternary Plus-Minus Modular Inversion in RNS. 65, 11 (2016),
910 3495–3501. https://doi.org/10.1109/TC.2016.2529625 Conference Name: IEEE Transactions on Computers.
[7] Piljoo Choi, Jeong-Taek Kong, and Dong Kyue Kim. 2015. Analysis of hardware modular inversion modules for elliptic
911
curve cryptography. In 2015 International SoC Design Conference (ISOCC) (2015-11). 313–314. https://doi.org/10.1109/
912 ISOCC.2015.7401713
913 [8] Richard E. Crandall. 1991. Method and apparatus for public key exchange in a cryptographic system.
914 [9] Sanjay Deshpande, Santos Merino del Pozo, Victor Mateu, Marc Manzano, Najwa Aaraj, and Jakub Szefer. 2021. Modular
915 Inverse for Integers using Fast Constant Time GCD Algorithm and its Applications. In 2021 31st International Conference
on Field-Programmable Logic and Applications (FPL) (2021-08). 122–129. https://doi.org/10.1109/FPL53798.2021.00028
916
ISSN: 1946-1488.
917 [10] W. Diffie and M. Hellman. 1976. New directions in cryptography. IEEE Transactions on Information Theory 22, 6 (1976),
918 644–654. https://doi.org/10.1109/TIT.1976.1055638
919 [11] G.V Chudnovsky D.V Chudnovsky. 1986. Sequences of numbers generated by addition in formal groups and new
920 primality and factorization tests. Advances in Applied Mathematics 7, 4 (1986), 385–434. https://doi.org/10.1016/0196-
8858(86)90023-0
921
[12] Laszlo Hars. 2006. Modular Inverse Algorithms Without Multiplications for Cryptographic Applications. 2006 (2006),
922 1–13. https://doi.org/10.1155/ES/2006/32192
923 [13] Md Selim Hossain and Yinan Kong. 2015. High-Performance FPGA Implementation of Modular Inversion over F_256
924 for Elliptic Curve Cryptography. In 2015 IEEE International Conference on Data Science and Data Intensive Systems
925 (2015-12). 169–174. https://doi.org/10.1109/DSDIS.2015.47
[14] Jin Hu and Yongbin Li. 2022. An Improved Modular Inversion Algorithm and Its Hardware Implementation. Journal
926
of Hunan University (Natural Sciences) 49, 02 (2022), 101–105. https://doi.org/10.16339/j.cnki.hdxbzkb.2022264
927 [15] Yaoan Jin and Atsuko Miyaji. 2023. Short-Iteration Constant-Time GCD And&nbsp;Modular Inversion. In Smart
928 Card Research and Advanced Applications: 21st International Conference, CARDIS 2022, Birmingham, UK, November
929 7–9, 2022, Revised Selected Papers (Birmingham, United Kingdom). Springer-Verlag, Berlin, Heidelberg, 82–99. https:
930 //doi.org/10.1007/978-3-031-25319-5_5
931
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.
20 Anon.

932 [16] D. E. Knuth. 1969. The Art of Computer Programming, Volume 2, Seminumerical Algorithms. Addison-Wesley, Reading,
933 MA.
934 [17] Donald E. Knuth. 1998. The Art of Computer Programming: Seminumerical Algorithms (3rd Edition). Addison-Wesley
Publishing Company, Massachusetts.
935
[18] Neal Koblitz. 1987. Elliptic Curve Cryptosystems. Math. Comp. 48, 177 (Jan. 1987), 203–209.
936 [19] Edmund Landau. 1999. Elementary Number Theory (second ed.). AMS Chelsea Pub., Providence, R.I.
937 [20] The GNU Multiple Precision Arithmetic Library. 2021. GMP library. https://gmplib.org/. [Online; accessed 1-
938 September-2021].
939 [21] Chen Lin and Yi Wang. 2022. Implementation of Cache Timing Attack Based on Present Algorithm. In 2022 8th Annual
International Conference on Network and Information Systems for Computers (ICNISC). 32–35. https://doi.org/10.1109/
940
ICNISC57059.2022.00016
941 [22] Róbert Lórencz. 2003. New Algorithm for Classical Modular Inverse. In Cryptographic Hardware and Embedded Systems
942 - CHES 2002 (Berlin, Heidelberg, 2003) (Lecture Notes in Computer Science), Burton S. Kaliski, çetin K. Koç, and Christof
943 Paar (Eds.). Springer, 57–70. https://doi.org/10.1007/3-540-36400-5_6
944 [23] Victor S. Miller. 1986. Use of Elliptic Curves in Cryptography. In Advances in Cryptology — CRYPTO ’85 Proceedings,
Hugh C. Williams (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 417–426.
945
[24] Peter L. Montgomery. 1985. Modular Multiplication without Trial Division. Math. Comp. 44, 170 (1985), 519–521.
946 [25] Amine Mrabet, Nadia El-Mrabet, Belgacem Bouallegue, Sihem Mesnager, and Mohsen Machhout. 2017. An efficient
947 and scalable modular inversion/division for public key cryptosystems. In 2017 International Conference on Engineering
948 & MIS (ICEMIS) (2017-05). 1–6. https://doi.org/10.1109/ICEMIS.2017.8272995 ISSN: 2575-1328.
949 [26] Ertugrul Murat, Süleyman Kardaş, and Erkay Savaş. 2011. Scalable and Efficient FPGA Implementation of Montgomery
Inversion. In 2011 Workshop on Lightweight Security & Privacy: Devices, Protocols, and Applications (2011-03). 61–68.
950
https://doi.org/10.1109/LightSec.2011.14
951 [27] David Naccache, Nigel P. Smart, and Jacques Stern. 2004. Projective Coordinates Leak. In Advances in Cryptology -
952 EUROCRYPT 2004 (Berlin, Heidelberg, 2004), Christian Cachin and Jan L. Camenisch (Eds.). Springer Berlin Heidelberg,
953 257–267.
954 [28] National Standardization Technical Committee of Information Security of China. 2017. Information Technology - SM2
Elliptic Curve Public Key Cryptographic Algorithm - Part 5: Parameter Definition. GB/T 32918.5-2017.
955
[29] National Institute of Standards and Technology. 2000. FIPS 186-2, Digital Signature Standard (DSS). Federal Information
956 Processing Standards Publication (2000).
957 [30] R. L. Rivest, A. Shamir, and L. Adleman. 1978. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems.
958 Commun. ACM 21, 2 (feb 1978), 120–126. https://doi.org/10.1145/359340.359342
959 [31] E. Savas and C.K. Koc. 2000. The Montgomery modular inverse-revisited. 49, 7 (2000), 763–766. https://doi.org/10.
1109/12.863048 Conference Name: IEEE Transactions on Computers.
960
[32] Jerome A. Solinas. 2011. Generalized Mersenne Prime. Springer US, Boston, MA, 509–510. https://doi.org/10.1007/978-
961 1-4419-5906-5_32
962 [33] Burcu Sönmez, Ahmet Ali Sarıkaya, and Şerif Bahtiyar. 2019. Machine Learning based Side Channel Selection for
963 Time-Driven Cache Attacks on AES. In 2019 4th International Conference on Computer Science and Engineering (UBMK).
964 1–5. https://doi.org/10.1109/UBMK.2019.8907211
[34] Keke Wu, Huiyun Li, Tingding Chen, and Fengqi Yu. 2009. Simple Power Analysis on Elliptic Curve Cryptosystems
965
and Countermeasures: Practical Work. In 2009 Second International Symposium on Electronic Commerce and Security,
966 Vol. 1. 21–24. https://doi.org/10.1109/ISECS.2009.7
967 [35] P. Wuille and G. Maxwell. 2021. roconnor-blockstream:Safegcd-bounds. https://github.com/sipa/safegcd-bounds.
968 [36] Sen Xu, Haihua Gu, Lingyun Wang, Zheng Guo, Junrong Liu, Xiangjun Lu, and Dawu Gu. 2017. Efficient and Constant
969 Time Modular Inversions Over Prime Fields. In 2017 13th International Conference on Computational Intelligence and
Security (CIS) (2017-12). 524–528. https://doi.org/10.1109/CIS.2017.00122
970
[37] Xiaodong Yan and Shuguo Li. 2007. Modified modular inversion algorithm for VLSI implementation. In 2007 7th
971 International Conference on ASIC (2007-10). 90–93. https://doi.org/10.1109/ICASIC.2007.4415574 ISSN: 2162-755X.
972 [38] Tao Zhou, Xingjun Wu, Guoqiang Bai, and Hongyi Chen. 2002. New algorithm and fast VLSI implementation for
973 modular inversion in Galois field GF(p). In IEEE 2002 International Conference on Communications, Circuits and Systems
974 and West Sino Expositions (2002-06), Vol. 2. 1491–1495 vol.2. https://doi.org/10.1109/ICCCAS.2002.1179061
975
976
977
978
979
980
J. ACM, Vol. 1, No. 1, Article . Publication date: March 2023.

You might also like