Preprint
Article

This version is not peer-reviewed.

On the Parallelization of Square-Root Vélu’s Formulas

A peer-reviewed article of this preprint also exists.

Submitted:

17 January 2024

Posted:

18 January 2024

You are already at the latest version

Abstract
A primary challenge in isogeny-based cryptography lies in the substantial computational cost associated to computing and evaluating prime-degree isogenies. This computation traditionally relied on Vélu’s formulas, an approach with time complexity linear in the degree but which was further enhanced by Bernstein, De Feo, Leroux, and Smith to a square-root complexity. The improved square-root Vélu’s formulas exhibit a degree of parallelizability which has not been exploited in major implementations. In this study, we introduce a theoretical framework for parallelizing isogeny computations and provide a proof-of-concept implementation in C with OpenMP. While the parallelization effectiveness exhibits diminishing returns with the number of cores, we still obtain strong results when using a small number of cores. Concretely, our implementation shows that for large degrees it is easy to achieve speedup factors of up to 1.74, 2.54 and 3.44 for 2, 4 and 8 cores, respectively.
Keywords: 
;  ;  ;  ;  

1. Introduction

Since the foundational work of De Feo, Jao and Plût [1,2], which resulted in the SIKE protocol [3], isogeny-based cryptography has received considerable interest in the development of post-quantum cryptography. Today, the security of SIKE has been broken by a wave of attacks which started with the work of Castryck and Decru [4] and was followed up by the works of Maino, Martindale, Panny, Pope, and Wesolowski [5] and of Robert [6], ultimately demonstrating the existence of a polynomial-time attack on SIKE with any starting curve. Nevertheless, there are still many isogeny-based protocols that remain unbroken, including the CSIDH [7] key exchange protocol and signature schemes like SeaSign [8], CSI-Fish [9] and SQISign [10], the last of which is currently under consideration in the NIST (National Institute of Standards and Technology) process for Standardization of Additional Digital Signature Schemes [11].
While SIKE only required the computation of isogenies of degrees 2 and 3, there has been a tendency in some of the newer isogeny-based protocols to move towards higher and higher prime degrees. This has brought increasing attention to the task of optimizing the evaluation of large-degree isogenies. For instance, the original CSIDH proposal used isogenies of degrees up to 587, and Chávez, Chi, Jacques and Rodríguez have suggested that a level 5 dummy-free implementation would need degrees as high as 2239 [12] after revising the security. More recently, the level 5 version of SQISign that was submitted to the NIST standarization process [13] brings the ceiling up with a colossal degree-318233 isogeny computation. Even the aforementioned attacks on SIKE require the evaluation of large prime-degree isogenies, with the techniques from [5] requiring the evaluation of an isogeny of degree 321193 over a large extension field for a direct key recovery attack. Therefore, optimizing the computation of large prime-degree isogenies is of great interest for constructive purposes as well as for cryptanalysis. Interestingly, a new generation has also risen of isogeny-based protocols which utilize the techniques of the attack constructively, such as FESTA [14], QFESTA [15], IS-CUBE [16] and constructions for VDFs [17] and VRFs [18], of which the last two also involve large-degree isogenies.
Traditionally, the main method for computing isogenies of prime degree when there exists a point of order in the curve has been Vélu’s formulas, initially introduced in [19] and expanded upon in [20] and [21]. In view of the ubiquitousness of large-degree isogenies, the more recent improvement from Bernstein, De Feo, Leroux, and Smith [22], which reduced the time complexity from O ( ) to O ˜ ( ) , has been a keystone for the viability of many protocols. In addition to the straightforward savings that it provides, this new algorithm also exhibits a substantial degree of parallelizability that has remained largely unexplored.
  • Our Contributions. In this work we detail and showcase the parallelizability of the square-root Vélu formulas, focusing on the problem of computing a single large-degree isogeny from a fixed kernel generator and pushing a number of points through said isogeny. We analyze and provide a parallelism-friendly reformulation of the square-root Vélu formulas, and provide a proof-of-concept (PoC) implementation to illustrate the implications of our work. More precisely,
  • We propose a new indexing system for the points in the kernel of the isogeny that is similar to the one presented in [22], but which can be naturally split into subsets that are assigned to each core in a multi-core implementation.
  • We provide a parallelized expected cost function based on our square-root Vélu variant, as well as a revised expected cost of the sequential Vélu formulas that were presented in [23]. Assuming that the assigned task is to perform an isogeny construction and to push two points through it, the idealized speedup factors for large degrees is up to 1.79, 2.96 and 4.58 faster than the expected sequential cost function using two, four and eight cores, respectively.
  • We give a PoC implementation utilising C and OpenMP to compute our isogeny formulas with large odd prime degrees. Our implementation assumes for concreteness that isogenies are evaluated over a prime field such that there exist rational points of the desired degree in the curve, and achieves speedup factors for large degrees of up to 1.74 , 2.54 and 3.44 for 2, 4 and 8 cores, respectively, even for medium-sized isogenies.
  • Assumptions and generalizations. For the purposes of our implementation, we have searched for a 1792-bit prime such that p + 1 contains evenly-spaced prime factors ranging from 19 to 321193 (the degree of the isogeny in the attack of [5]). Isogenies are performed over the base prime field, and two points are pushed through each isogeny. These choices were made for concreteness and closely mirror the scenario in CSIDH. On the other hand, they do not adhere exactly to the case of SQISign (which works in a quadratic field extension) nor that of the attack in [5] (which performs isogenies in a much larger field extension). Nevertheless, the techniques that we describe save on a fixed number of field multiplications, which are the dominating part of the total cost. Therefore we expect that similar speedups in percentage would be obtained in those other cases, with the caveat that the savings could even be slightly larger as the parallelization overhead becomes more negligible with respect to the arithmetic cost. As for the theoretical savings, our costs are expressed in terms of the total number of field multiplications and remain valid for any choice of base field.
  • Related Work. Different forms of parallelism have been explored at various layers of isogeny-based protocols. For instance, the use of vectorization through Intel’s Advanced Vector Extensions (AVX-512) has been exploited for the finite field arithmetic layer, both in the context of CSIDH [24] and of the now obsolete SIKE protocol [25,26], obtaining speedup factors in the order of 1.5 . Additionally, it has also been proposed to use AVX-512 to batch multiple evaluations of the protocols, leading to increases in throughput of up to 3.6 for CSIDH [24] and 4.6 in SIKE [26]. When latency is the focus, vectorization can also be used to push multiple points through an isogeny, leading to the concept of parallel-friendly evaluation strategies which favour pushing points over point multiplications [26,27,28]. These parallel strategies can accelerate the evaluation of large composite-degree isogenies, but do not apply for isogenies of a large prime degree.
This work focuses exclusively on the multi-core optimization for the isolated evaluation of a large prime-degree isogeny, which has received much less attention. To the best of our knowledge, there are only simple parallelization strategies that have been proposed [29] for the linear-complexity Vélu formulas, not exploiting the improved complexity methods from [22]. We also do not consider vectorization: if it was to be used, we expect that the best way to exploit it would be at the field arithmetic layer. This would lead to improvements similar to those from [24,25,26] which are well documented and are largely parallel to our multi-core improvements.
  • Outline. The remainder of the manuscript is organized as follows. In Section 2 we give the basic background and description of the original square-root Vélu formulas with their three main building blocks: KPS, xISOG and xEVAL, and derive their expected cost function. In Section 3 we introduce our new framework, explaining the need for a new indexing system in Section 3.1 and then detailing the new algorithms in Section 3.2 and deriving their expected cost function in Section 3.3. Finally, in Section 4, we present our experimental results, comparing the performance to the theoretical savings.

2. Background

We denote by F q the finite field with q = p n elements, and p a prime number. We focus on supersingular Montgomery curves, E, given by
E : y 2 = x 3 + A x 2 + x , A F q { ± 2 } .
Remark 1.
In general, a Montgomery curve is defined by the equation E A , B : B y 2 = x 3 + A x 2 + x such that B 0 and A ± 2 , but all choices of B are equivalent up to isomorphism. For the sake of simplicity, we write E A instead of E A , 1 and omit the constant B.
Viewing the curve as a projective surface, the set E ( F q ) of F q -rational points in E forms a group where the point at infinity acts as the group identity. An order-d point P on E is a point on the curve such that d is the smallest positive integer satisfying
[ d ] P i = 1 d P = .
We write E [ d ] to refer to the d-torsion subgroup { P E ( F q ¯ ) [ d ] P = } of E.
  • Isogenies. An isogeny ϕ : E E is a surjective morphism between curves with a finite kernel such that ϕ ( E ) = E . Such a map is always also a group homomorphism. Two curves E and E are isogenous over F q if there exists such ϕ connecting them, or equivalently if # E ( F q ) = # E ( F q ) . The kernel of ϕ is { P E ( F q ) ϕ ( P ) = } , denoted by ker ϕ . When restricting only to separable maps, the degree of an isogeny is equal to the size of its kernel, and an isogeny is uniquely determined by its kernel up to an isomorphism. An isogeny ϕ : E E of degree is referred to as a -isogeny, and it has a unique dual (up to isomorphism) ϕ ^ : E E of the same degree such that ϕ ϕ ^ and ϕ ^ ϕ are equivalent to the multiplication-by- map on E and E , respectively.

2.1. Computation of -Isogenies

Let E A be a Montgomery curve defined over F q and P E A ( F q ) a point of order . If ϕ : E A E A is an isogeny with ker ϕ = P , then it is possible to find polynomials g ( x ) , h ( x ) F q [ x ] such that
ϕ ( x , y ) = g ( x ) h ( x ) , y g ( x ) h ( x ) .
The polynomials g ( x ) and h ( x ) are related by a formula stated by Elkies [30], so the main task for computing an -isogeny is that of obtaining h ( x ) from P. In the particular case of Montgomery curves, from the formulas of [31] the coefficient of the Montgomery curve E A can be obtained by,
A = 2 ( 1 + d ) 1 d , d = A 2 A + 2 h S ( 1 ) h S ( 1 ) 8 ,
and the x-coordinate ϕ x ( x ( Q ) ) of the point ϕ ( Q ) , by
ϕ x ( x ( Q ) ) = x ( Q ) h S ( 1 / x ( Q ) ) h S ( x ( Q ) ) 2 ,
where h S ( X ) = s S ( X x ( [ s ] P ) ) and S = { 1 , 3 , , 2 } .
In 2020, Bernstein, De Feo, Leroux and Smith [22] proposed an important improvement in the computation of isogenies. Their key idea exploits the fact that for isogeny evaluations and constructions, it is not required to compute the polynomial h itself, but only its evaluation at a given set of points. We briefly describe their method for evaluating (1) and (2), starting by pointing out that, although the map Q x ( Q ) is not holomorphic, an algebraic relation exists between the values. More precisely,
Lemma 1
([22] Lemma 4.3). Let E / F q be an elliptic curve, where q is a prime power. There exist biquadratic polynomials F 0 , F 1 , and F 2 in F q [ X 1 , X 2 ] such that
X x ( P + Q ) X x ( P Q ) = X 2 + F 1 ( x ( P ) , x ( Q ) ) F 0 ( x ( P ) , x ( Q ) ) + F 2 ( x ( P ) , x ( Q ) ) F 0 ( x ( P ) , x ( Q ) ) ,
for all P and Q in E such that 0 { P , Q , P + Q , P Q } .
This allows us to rearrange h S ( X ) in a baby-step giant-step style:
h ( X ) = i I j J ( X x ( [ i + j ] P ) ) ( X x ( [ i j ] P ) ) k K ( X x ( [ k ] P ) ) ,
where the index system { I , J } is chosen such that S = ( I + J ) ( I J ) K , with K = S ( I + J ) ( I J ) and satisfying the following definition:
Definition 1.
Let S, I and J be finite sets of integers. The ( I , J ) is an index system of S if:
1. 
the maps I × J Z defined by ( i , j ) i + j and ( i , j ) i j are both injective and have disjoint images.
2. 
I + J and I J are both contained in S.
From here, they show that the factor of h ( X ) that is related to I and J can be computed efficiently as a resultant. More precisely, letting I ± J be the union of the sets I + J and I J , the following result is proven.
Lemma 2
([22] Lemma 4.9). Let E / F q be an elliptic curve, where q is a prime power. Let P be an order-n element of E ( F q ) . Let ( I , J ) be an index system such that I , J , I + J , and I J do not contain any elements of n Z . Then
h I ± J ( X ) = 1 Δ I , J R e s Z ( h I ( Z ) , E J ( X , Z ) )
where
E J ( X , Z ) : = i J F 0 ( Z , x ( [ j ] P ) ) X 2 + F 1 ( Z , x ( [ j ] P ) ) X + F 2 ( Z , x ( [ j ] P ) )
and Δ I , J : = R e s Z ( h I ( Z ) ; D J ( Z ) ) where D J ( Z ) : = i J F 0 ( Z , x ( [ j ] P ) ) .
Remark 2.
Note that Δ I , J in Lemma 2 does not depend on the variable X. The formulas (2) and (1) for computing the codomain curve and point evaluation respectively, include quotients of the polynomial h, so in practice, Δ I , J in Lemma 2 does not need to be computed.
Notice that h I ± J is precisely the first parenthesis that appears in (3), which can be computed quadratically faster as a resultant, while the second parenthesis (corresponding to h K ( X ) ) is a leftover product that is still computed in linear time. Based on this, Adj, Chi-Domínguez, and Rodríguez-Henríquez [23] introduced three explicit algorithms for evaluating -isogenies: KPS (which computes the necessary multiples of the kernel generator), xISOG (which obtains the coefficient of the image curve) and xEVAL (which obtains the image of a point). These algorithms are reproduced here as Algorithm 1, Algorithm 2 and Algorithm 3.
Algorithm 1  KPS
Inputs: 
An elliptic curve E A / F q ; P E A ( F q ) of order an odd prime  
Output: 
I , J , K such that ( I , J ) is an index system for S 
1:
b ( 1 ) / 2  
2:
b ( 1 ) / 4 b  
3:
J { 2 j + 1 : 0 j < b }  
4:
I { 2 b ( 2 i + 1 ) : 0 i < b }  
5:
K S ( I ± J )  
6:
J { x ( [ j ] P ) : j J }  
7:
I { x ( [ i ] P ) : i I }  
8:
K { x ( [ k ] P ) : k K } v
9:
return  I , J and K
Algorithm 2  xISOG
Inputs: 
An elliptic curve E A / F q : y 2 = x 3 + A x 2 + x ; P E A ( F q ) of order an odd prime ; and I , J , K the output from KPS(P
Output: 
A F q such that E A / F q : y 2 = x 3 + A x 2 + x is the image curve of a separable isogeny with kernel P  
1:
h I x i I ( Z x i ) F q [ Z ]  
2:
E 0 , J x j J ( F 0 ( Z , x j ) + F 1 ( Z , x j ) + F 2 ( Z , x j ) ) F q [ Z ]  
3:
E 1 , J x j J ( F 0 ( Z , x j ) F 1 ( Z , x j ) + F 2 ( Z , x j ) ) F q [ Z ]  
4:
R 0 R e s Z ( h I , E 0 , J ) F q  
5:
R 1 R e s Z ( h I , E 1 , J ) F q  
6:
M 0 x k K ( 1 x k ) F q  
7:
M 1 x k K ( 1 x k ) F q  
8:
d A 2 A + 2 M 0 R 0 M 1 R 1 8
9:
return  2 ( 1 + d ) 1 d
Algorithm 3  xEVAL
Inputs: 
An elliptic curve E A / F q : y 2 = x 3 + A x 2 + x ; P E A ( F q ) of order an odd prime ; α = x ( Q ) = [ Q x : Q z ] 0 of a point Q E A ( F q ) P and I , J , K the output from KPS(P
Output: 
The image of x ( ϕ ( Q ) ) by w, where ϕ is a separable isogeny of kernel P
1:
h I x i I ( Z x i ) F q [ Z ]  
2:
E 0 , J x j J ( F 0 ( Z , x j ) / α 2 + F 1 ( Z , x j ) / α + F 2 ( Z , x j ) ) F q [ Z ]  
3:
E 1 , J x j J ( F 0 ( Z , x j ) α 2 + F 1 ( Z , x j ) α + F 2 ( Z , x j ) ) F q [ Z ]  
4:
R 0 R e s Z ( h I , E 0 , J ) F q  
5:
R 1 R e s Z ( h I , E 1 , J ) F q  
6:
M 0 x k K ( 1 / α x k ) F q  
7:
M 1 x k K ( α x k ) F q
8:
return  ( M 0 R 0 Q x ) 2 / ( M 1 R 1 Q z ) 2

2.2. Computing the Resultants

The main computational task that is required for the square-root Vélu algorithm is the evaluation of the resultants R e s Z ( h I , E i , J ) , which can be achieved in constant-time via a remainder tree approach as described in [23]. This task accounts for approximately 85% of the total computation time, and depends on the size of the sets I, J, and K. For the following discussion, we set # J = b , # I = b and # K = 1 2 2 b b . As a remainder, if f ( Z ) F q [ Z ] splits into linear factors as f ( Z ) = a 0 i < b ( Z x i ) and g ( Z ) F q [ Z ] , then the resultant is given by
R e s Z ( f ( Z ) , g ( Z ) ) = a b 0 i < b g ( x i ) .
For our purposes, the polynomials f and g are provided as a product of b linear polynomials and b quadratic polynomials, respectively. Since the x i values are readily available, one could compute (4) directly by evaluating g multiple times, but this would lead to a complexity of O ( b b ) = O ( ) .
Instead, we begin by obtaining the polynomials f and g as a list of coefficients by multiplying together all the terms following a product tree approach, as shown in Figure 1.
After the product tree for f has been built, we compute a corresponding reciprocal tree. Unlike the product tree, the reciprocal tree is computed from the root down. If a node in the product tree contains a polynomial F of degree m, then the corresponding node in the reciprocal tree contains ( F , c ) , where F is a polynomial of degree m and c is a constant such that r e v m ( F ) · F = c mod x m . Here, r e v m ( · ) denotes the polynomial with the list of coefficients in reverse order: if F = i = 0 m F i x i , then r e v m ( F ) = i = 0 m F n i x i . The reciprocal tree is depicted in Figure 2, and is used as an auxiliary tool to construct one last tree called the residue tree.
The residue tree is also built from the root down, and as depicted in Figure 3, each of its nodes contains c g mod F , where F is the polynomial in the corresponding node of the product tree of f and c the constant from the reciprocal tree. The leaves of the remainder tree contain a multiple of g ( Z ) mod ( Z x i ) = g ( x i ) , so a multiple of the resultant is readily obtained by multiplying all the leaves together. The multiple cancels out when taking the ratios at the last step of Algorithm 3 and Algorithm 2. Remarkably, this method allows us to compute (a multiple of) all of the g ( x i ) with a complexity of just O ˜ ( b + b ) = O ˜ ( ) . A detailed description for the cost of the product tree, reciprocal tree, and remainder tree is presented in the Appendix A.

2.3. Cost model for the Sequential Square-Root Vélu

An expected cost function for KPS, xISOG, and xEVAL has been presented in [23], but it takes into account some redundant computations and also fails to consider the cost of the remainder tree. We now present a more accurate and optimal cost function.
As in [23], we begin by noting that a better performance is achieved when # K is as small as possible and # I , # J are similar in size. Since we need to have 2 ( # I ) ( # J ) + # K = ( 1 ) / 2 , we set the size of # J to be b : = ( 1 ) / 2 and then perform division with remainder to obtain # I = ( 1 ) / ( 4 b ) and # K [ 0 , 2 b ] . For simplicity, from now on we ignore rounding errors and assume the average value for # K , so that # I # J # K b ( 1 ) / 2 .
Next, we provide explicit formulas for the number of field multiplications in each of the three procedures as a function of b. This cost model neglects the cost of field additions and subtractions, as well as the savings that a field squaring may have over a regular field multiplication. We consider this a fair compromise, since the total number of squarings is small and the benefit of having a single metric far outweighs the importance of the small error. The cost functions are expressed in terms of the cost of the various tree procedures, which are detailed in the Appendix A.
  • Cost function for KPS. Notice that the computation of h I is performed in both Algorithm 3 and Algorithm 2, so instead we delegate this product tree and the corresponding reciprocal tree to KPS and perform it only once. The total cost of KPS is then reflected by obtaining b multiples of an elliptic curve point (at a cost of 6 b field multiplications each) for each of I , J and K , and the cost of a product tree and reciprocal tree for b linear polynomials. We obtain
    C o s t KPS ( b ) = 18 b + P r o d u c t T r e e 1 ( b ) + R e c i p r o c a l T r e e ( b ) = 4 b log 2 ( 3 ) + 2 log 2 ( b ) + 16 b 2 .
  • Cost function for xISOG. The degree-2 factors in E 0 , J and E 1 , J can be obtained at the cost of 5 b field multiplications, and the complete polynomials are then obtained from two product trees of quadratic leaves. We then perform 2 residue trees to compute the two resultants of Algorithm 2, and another 2 ( b 1 ) multiplications to compute M 0 and M 1 . Algorithm 2 also requires an -th power exponentiation at an average cost of 1.5 log 3 ( log 2 ( b ) + 1 ) multiplications, which must be performed separately on the numerator and denominator when working in projective coordinates. Finally, there are another 10 multiplications for the final two steps. We obtain
    C o s t xISOG ( b ) = 5 b + 2 × P r o d u c t T r e e 2 ( b ) + 2 × R e s i d u e T r e e ( b ) + 2 ( b 1 ) + 2 × 3 ( log 2 ( b ) + 1 ) + 10 ,
    C o s t xISOG ( b ) = 18 b log 2 ( 3 ) + 6 log 2 ( b ) 5 b + 12 .
  • Cost function for xEVAL. In this case, computing the factors for E 0 , J requires 10 b field multiplications. However, only one product tree is required since E 1 , J can be obtained by inverting the coefficient list of E 0 , J , as noted in [23]. We then need two residue trees to compute the resultants in Algorithm 3, 2 ( 2 b 1 ) multiplications to compute M 0 and M 1 , and 6 more multiplications for the final step. The total is
    C o s t xEVAL ( b ) = 10 b + P r o d u c t T r e e 2 ( b ) + 2 × R e s i d u e T r e e ( b ) + 2 ( 2 b 1 ) + 6
    C o s t xEVAL ( b ) = 15 b log 2 ( 3 ) + 5 b + 2 .
  • Total cost function. The total cost function, assuming that two points need to be pushed through the isogeny as is the case in CSIDH and SQISign, is
    C o s t V é lu ( b ) = C o s t KPS ( b ) + C o s t xISOG ( b ) + 2 × C o s t xEVAL ( b ) = 52 b log 2 ( 3 ) + 8 log 2 ( b ) + 21 b + 14 .  

3. Parallelizing Square-Root Vélu Formulas

In this section, we present our main results which focuses on the problem of parallelizing Vélu’s formulas for isogeny computations.
An immediate observation is that the two resultants that appear in each of Algorithm 2 and Algorithm 3, which represent the heaviest part of the computation, are completely independent of each other. Therefore, there is a simple way to parallelize the two algorithms in a 2-core implementation by assigning one resultant to each core. We show that these algorithms exhibit a good degree of parallelizability even beyond two cores, by assigning more than one core per resultant. However, doing so requires a new indexing system that allows for a clear separation of the computation between cores, which we introduce in the first subsection.
Additionally, other parts of the algorithms are also apt for parallelization, such as the computation of M 0 and M 1 where a partial product can be assigned to each core, or the computation of product trees where a subtree can be assigned to each core after a small bit of sequential work. We show also that the computation of point multiples in KPS can be parallelized with our new index system.
After detailing the new index system, provide explicit new algorithms for the square-root Vélu formulas which can be computed by assuming the availability of n-cores, where n > 1 is a power of two.

3.1. Construction of a New Index System

The main observation that allows us to parallelize the computation of resultants beyond two cores is that they exhibit a multiplicative property. More precisely, if I can be written as the disjoint union
I = I 0 I 1 I n / 2 1 ,
then
R e s Z ( h I , E i , J ) = R e s Z ( h I 0 , E i , J ) × R e s Z ( h I 1 , E i , J ) × × R e s Z ( h I n / 2 1 , E i , J ) ,
so we can assign one subset I t to each of the cores, have them compute one subresultant each, and then multiply them all together. Doing so will require us to modify the sizes of I and J: since each resultant computation should have balanced sizes, we now require # J # I / ( n / 2 ) . Accordingly, we now need # J ( 1 ) / 2 n and # I ( 1 ) n / 4 . We will design such an indexing system according to the following lemma.
Lemma 3.
Let n > 1 be a power of two, ℓ be an odd positive integer and consider the set S = { 1 , 3 , . . . , } . If b = ( 1 ) / 2 n 1 and b = ( 1 ) / 2 n b , then ( I , J ) is an index system for S, where
J : = { 2 j + 1 : 0 j < b } ,
I = t = 0 n / 2 1 I t , and
I t : = { 2 b ( 2 t + 1 ) + 2 b n i : 0 i < b } for 0 t < n / 2 .
Moreover, I t I t ˜ = if and only if t t ˜ .
Proof of Lemma .
Suppose that
( 2 b ( 2 t + 1 ) + 2 b n i 1 ) ) ( 2 b ( 2 t ˜ + 1 ) + 2 b n i 2 ) ) = ( 2 j 1 + 1 ) ( 2 j 2 + 1 ) ,
where 0 j i , j 2 < b , 0 i 1 , i 2 < b and j 1 j 2 and i 1 i 2 (t and t ˜ can be equal). Then,
4 b ( t t ˜ ) + 2 b n ( i 1 i 2 ) = 2 ( j 1 j 2 ) ,
so
2 b ( t t ˜ + n ( i 1 i 2 ) / 2 ) = j 1 j 2
but this is impossible since 0 < | t t ˜ | < n and thus 0 < | j 1 j 2 | < b < 2 b | ( t t ˜ + n ( i 1 i 2 ) / 2 ) | . Therefore, the map I × J Z defined by ( i , j ) i + j is injective. Similarly, the map ( i , j ) i j is injective and therefore, both have disjoint images.
On the other hand, the sets I + J and I J are both contained in S. It is clear that the elements in these sets are odd integers. Since for all 0 t < n / 2 , 0 j < b and 0 i < b we have the following:
( 2 b ( 2 t + 1 ) + 2 b n i ) + ( 2 j + 1 ) 2 b ( 2 ( n / 2 1 ) + 1 + n i ) + ( 2 b + 1 ) = 2 n b i + 2 n b + 1
( 2 b ( 2 t + 1 ) + 2 b n i ) + ( 2 j + 1 ) 2 n b ( 1 ) 2 n b 1 + 2 n b + 1 = .
Similarly, 1 ( 2 b ( 2 t + 1 ) + 2 b n i ) ( 2 j + 1 ) for all 0 t < n / 2 , 0 j < b and 0 i < b since the minimum of I is ( 2 b ( 2 t + 1 ) + 2 b n i ) = 2 b when t = i = 0 and the maximum of J is 2 j + 1 = 2 b 1 when j = b 1 .    □
Note that we have chosen to partition I into only n / 2 subsets because, when n cores are available, we assign n / 2 cores to each of the two resultants. However, when computing point multiples during KPS we do have all n cores available for working on the n / 2 subsets. Therefore, it will be convenient to further divide each I t into two subsets to obtain a total of n half-sized subsets. Note that some subsets will have fewer elements than others because # I need not be a multiple of n. In practice, this is problematic because it is crucial for each core to perform exactly analogous tasks to guarantee performance. In order to achieve a balanced computation, we will add additional elements to the subsets that are lacking, ensuring that each subset contains the same number of elements, while ignoring redundant multiples in future steps. This idea is inspired by [22] and the parallelization technique proposed by Costello in [32] to compute the set [ i ] P 1 i d . We provide an example to illustrate this approach.
Example 1.
Let = 89 and n = 4 . From the Lemma 3 we get
J : = { 2 j + 1 : 0 j < 3 } , I = t = 0 1 I t ,
I t : = { 2 b ( ( 2 t + 1 ) + n i ) : 0 i < 3 } , f o r   a l l   0 t < 2 .
So I 0 : = { 2 b , 10 b , 18 b } and I 1 : = { 6 b , 14 b , 22 b } . Instead of this, we can consider the following sets:
I I t : = { 2 b ( ( 2 t + 1 ) + 2 n i ) : 0 i < 2 } f o r a l l 0 t < 4 .
So I I 0 : = { 2 b , 18 b } , I I 1 : = { 6 b , 22 b } , I I 2 : = { 10 b , 26 b } and I I 3 : = { 14 b , 30 b } .
We will now demonstrate that the partitioning of the set I, without taking into account the additional elements added to ensure equal-sized subsets, still forms an index system.
Proposition 1.
Let us assume the same conditions of the Lemma 3. Let c = b / 2 > 2 . If
I : = t = 0 n 1 I I t t = n / 2 n 1 { w t } w i t h w t : = 2 b ( 2 t + 1 + 2 n ( c 1 ) ) ( b mod 2 ) , a n d I I t : = { 2 b ( 2 t + 1 ) + b n ( 2 i ) : 0 i < c } f o r a l l 0 t < n ,
I I t : = { 2 b ( 2 t + 1 ) + b n ( 2 i ) : 0 i < c } , f o r a l l 0 t < n ,
I : = t = 0 n 1 I I t t = n / 2 n 1 { w t } , w i t h w t : = 2 b ( 2 t + 1 + 2 n ( c 1 ) ) ( b mod 2 ) ,
then I = I . In particular, ( I , J ) is an index system for S.
Proof of Proposition 1.
If b is an even integer, it is clear that I I t I , for all 0 t < n and thus t = n / 2 n 1 { w t } = { 0 } . Note that 0 2 i 2 c = b and considering that t = 0 n 1 I I t and I have the same cardinality, we conclude that they are equal. If b is an odd integer, all 0 t < n / 2 we have that I I t I . When n / 2 t < n , dropping the last element w t : = 2 b ( 2 t + 1 + 2 n ( c 1 ) ) in each I I n / 2 , . . . , I I n 1 , we have I I t I . Note that the cardinality of t = n / 2 n 1 { w t } is n / 2 and the cardinality of t = 0 n 1 I I t is n c = n ( ( b + 1 ) / 2 ) = n b + n / 2 = # I + n / 2 .    □
Similarly, we consider a partition of K = S ( I ± J ) = { 2 b b n + 1 , . . . , 2 , } into n sets K 0 , . . , K n 1 with d = # K / n elements,
K t = { 2 b b n + 2 ( t + 1 ) + 2 n k : 0 k < d } , 0 t < n
However, when # K is not divisible by n, we will add one additional element to each set K d mod ( n ) , . . . , K n 1 , ensuring that each subset contains the same number of elements.
The additional elements that were added to balance the size of the sets are not taken into account when calculating the corresponding products for each set.

3.2. Parallelized Algorithms

We now describe the construction of parallelized versions of algorithms KPS, xISOG, and xEVAL assuming that n cores are available, which we refer to as KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel.
The KPS-Parallel algorithm, presented as Algorithm 4, is based on the new index system from Proposition 1: instead of computing the sets I and K as in the original algorithm, KPS-Parallel computes the sets I I t and K t , as well as the polynomial product tree for h I I t , by assigning one core to each value of 0 t < n . Note that the computation of the resultants requires a reciprocal tree for each of the h I t , so the polynomials { h I I t , 0 t < n } are multiplied pair-wise to obtain the polynomials { h I t , 0 t < n / 2 } . We then compute the residue trees, which are built from the root down: the root is computed for each tree using n / 2 cores, and then the two subtrees of each tree are computed in parallel, using two cores per tree, for a total of n cores. Note that although the computations for the root of the product and reciprocal trees, corresponding to lines 11 and 12 of 4, should only be ran with n / 2 cores, we instead have all cores running with half of them repeating the work over dummy arrays I t for t > n / 2 . These computations are of course redundant, but ensure that the workload is balanced.
Algorithm 4 KPS-Parallel
Preprints 96582 i001
For the algorithms xISOG-Parallel and xEVAL-Parallel, the set J is partitioned into n subsets J t , and each core computes the polynomial product tree corresponding to one subset. The roots of the n trees are then multiplied together sequentially to obtain the full polynomial E i , J . Next, each core computes one of the subresultants R e s Z ( h I t , E i , J ) using the reciprocal trees from KPS-Parallel, and the subresultants are multiplied sequentially at the end to obtain the two principal resultants. As for the computations related to K , it is also split into subsets K t so that each core can compute a subproduct of M i , and the subproducts are multiplied together at the end.
Algorithm 5 xISOG-Parallel
Preprints 96582 i002
In the next subsection, we present a detailed analysis that estimates the cost of computing the image curve of an isogeny and the evaluation of a point using the procedures KPS-Parallel+ xISOG-Parallel+ xEVAL-Parallel.
Algorithm 6 xEVAL-Parallel
Preprints 96582 i003

3.3. Cost Analysis of the Parallel Square-Root Vélu

In this section, we will provide an estimation of the cost that is expected when performing the KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel procedures, which is analogous to the one presented in subSection 2.3 for the sequential algorithms. Recall that for our new index system from subSection 3.1 we have # J = ( 1 ) / ( 2 n ) , where is the degree of the isogeny and n represents the number of cores available, while each of the I t subsets has a size of b : = ( 1 ) / ( 2 n ( # J ) ) for a total size of # I = ( n / 2 ) ( 1 ) / ( 2 n ( # J ) ) . It follows that
# K : = 1 2 2 ( # I ) ( # J ) ,
# K 1 2 2 1 2 n 1 n 2 1 2 n 1 ,
# K = 2 n ( 1 ) n .
As before, we will ignore rounding errors and assume that # K takes the middle value of this range (neglecting the second term), so that # I n b / 2 , # J b , and # K n b for b = ( 1 ) / ( 2 n ) .
Proposition 2.
The cost of computing the parallel AlgorithmKPS-Parallelwith n cores is
C o s t KPS Parallel ( b , n ) = 8 3 b log 2 ( 3 ) + 4 log 2 ( b ) + 15 b + 12 n 18 ,
field multiplications.
Proof of Proposition 2.
In order to find all the point multiples in K, we first compute the first n multiples sequentially and then core t start from the t-th multiple and take steps of size n. This leads to wall-clock time of n + # K / n 1 point operations, which involve 6 field multiplications each. A similar approach is taken for computing all the multiples in the I I t sets, while the ones in J are computed sequentially as this is the smallest of the sets. A product tree is constructed for each of the h I I t in parallel, and line 11 involves multiplying two polynomials of degree b/2. As for the reciprocal trees, we incur the cost of the root node and then just half of the remainder cost since there is one core working on each subtree, so its cost is (Reciprocal(b) + ReciprocalTree(b))/2, where Reciprocal(b) = blog2(3) + b + 2log2(b) − 2 is the cost of the root node alone. The total cost is then
C o s t KPS Parallel ( b , n ) = 6 n + b 1 + 6 n + b / 2 1 + 6 b + P r o d u c t T r e e 1 ( b / 2 ) + P o l y M u l ( b / 2 ) + ( R e c i p r o c a l ( b ) + R e c i p r o c a l T r e e ( b ) ) / 2 ,
and the proposition follows after considering the cost functions in the Appendix A.
   □
We now focus on Algorithm 5:xISOG-Parallel which computes the coefficient of the image curve, and on Algorithm 6:xEVAL-Parallel which computes the isogeny evaluation at a specific point.
Proposition 3.
The cost of computing the parallel AlgorithmxISOG-Parallelwith n cores is
C o s t   xISOG Parallel ( b , n ) = 6 b log 2 ( 3 ) + 3 ( n 2 n + 2 ) b n log 2 ( 3 ) + 3 log 2 ( b n ) b b n + 5 2 n + 11 2 ,
field multiplications.
Proof of Proposition 3.
We begin by considering the cost of computing the polynomials E 0 , J and E 1 , J in Algorithm 5, where J comes from KPS-Parallel. We partition J into the subsets J 0 , , J n 1 of size ( b / n ) each, and compute one sub-product tree E 0 , J t per subset J t concurrently. The cost of computing the factors of each subset is given by 5 ( b / n ) field multiplications, then a product tree is computed for b / n quadratic polynomials on each core, and the tree roots are multiplied together sequentially to obtain the complete polynomial E 0 , J . This last step involves multiplying together n polynomials of degree 2 b / n each, which we denote as L i n e a r P r o d u c t 2 b / n ( n ) . An identical procedure is used to compute E 1 , J as well.
Using the residue trees r t r e e h I 0 , . . . , r t r e e h I n / 2 1 from KPS-Parallel, each core then compute a subresultant using a residue tree of size b / n , and the subresultants are multiplied sequentially at a cost of n / 2 1 multiplications per resultant.
For each of the M i , each core computes a subproduct of size b at a cost of b 1 multiplications, and then combining the subproducts takes another n 1 multiplications.
Finally, we use two cores to compute the -th power exponentiation in Algorithm  for the numerator and denominator concurrently at a cost of about 1 . 5 log 3 / 2 + 3 log 2 ( b ) + 3 log ( n ) / 2 , and the last few products take 10 more multiplications.
The total cost is then
C o s t xISOG Parallel ( b , n ) = 5 b / n + 2 × P r o d u c T r e e 2 ( b / n ) + 2 × L i n e a r P r o d u c t n ( 2 b / n ) + R e s i d u e T r e e ( b ) + n / 2 1 + 2 × ( b + n 2 ) + 3 / 2 + 3 log 2 ( b ) + 3 log ( n ) / 2 + 10 ,
and the proposition follows after considering the cost functions in the Appendix A.    □
Proposition 4.
The cost of computing the parallel AlgorithmxEVAL-Parallelwith n cores is
C o s t   xEVAL Parallel ( b , n ) = 6 b log 2 ( 3 ) + 3 2 ( n 2 n + 2 ) b n log 2 ( 3 ) + b + 7 b n + 5 2 n .
field multiplications.
Proof. 
The proof of the current proposition follows a similar structure to the previous proposition, with the main distinction lying in the polynomials E 0 , J E 1 , J M 0 , t and M 1 , t . As before, the factors of E 0 , J can be computed with a cost of 10 ( b / n ) , and only one product tree is needed since E 1 , J is obtained at no additional cost. For each of M 0 and M 1 , each core is still working with a subset of size b, but now there is a cost of b scalar multiplications for computing the factors in a subset, b 1 for the subproduct of each subset, and n 1 for combining the subproducts. There is no exponentiation in this case, and the final steps require only 6 additional multiplications. The total cost is then
C o s t xEVAL Parallel ( b , n ) = 10 b / n + P r o d u c T r e e 2 ( b / n ) + L i n e a r P r o d u c t n ( 2 b / n ) + R e s i d u e T r e e ( b ) + n / 2 1 + 2 ( 2 b + n 2 ) + 6 ,
and the proposition follows after considering the cost functions in the Appendix.    □
The next theorem summarizes our cost analyzis, and follows immediately from the previous propositions.
Theorem 1.
The expected total cost of the algorithmsKPS-Parallel,xISOG-Paralleland twicexEVAL-Parallelexpressed in field multiplications, assuming that two points need to be pushed through the isogeny, is given by
C o s t V é lu Parallel ( b , n ) = C o s t KPS Parallel ( b , n ) + C o s t xISOG Parallel ( b , n ) + 2 × C o s t xEVAL Parallel ( b , n ) = 62 3 b log 2 ( 3 ) + 6 ( n 2 n + 2 ) b n log 2 ( 3 ) + 7 log 2 ( b ) + 13 b n + 16 b + 3 2 log 2 ( n ) + 39 2 n 25 2 .
Remark 3.
While it may be tempting to compare (6) directly to (5), a direct comparison would be incorrect since the definitions of b (the size of the set J) are different in each case. The fair comparison would be in terms of ℓ, where (5) has b = ( 1 ) / 2 , whereas (6) has b = ( 1 ) / ( 2 n ) .

4. Experimental Results

We implemented our parallel version of the square-root Vélu algorithm in the C programming language, leveraging OpenMP for parallelization. This implementation has been integrated into a fork of the SQALE’d CSIDH library [12], where we can measure the savings due to parallelism for isogenies of each prime degree | p + 1 . In order to visualize the progression of the performance with the isogeny degree, we have incorporated a new 1792-bit prime
p t e s t : = 0 x 4 FA 4 E 8 C 57 C 4 EFF 02 D 0650 FE 3 AFC 0413536 E 72253101645 B 1387 DA 2 DD 519 C 17 FBEC 69843 C 04 DEAEA 2 DB 59 CDDDA 7876 B 514 C 101 A 1 DF 0 D 96778 BD 72 A 3 C 51844 BB 0196 F 73 F 1 DDBFEC 980 A 4 BB 3 B 200 A 4 E 618 C 54621 ADB 35 B 5 E 4 B 0545 F 5 BE 025 D 63 BC 914 AB 11 A 882 AD 78 B 6203 C 57 A 31031 B 98 B 6 C 104 DC 99 AC 9 A 4532 DEC 0 C 293 0 F 8 AE 51 B 008 E 4 BA 6 D 26 E 56 C 736 D 3 C 067 C 8 F 2 DFDF 7 F 8206 B 444 A 42 D 39 E 0 F 4 D 82 FF 3 FD 0 EB 1 DF 44 B 31 DDCDE 876 E 658489 E 1 CA 359 DAF 5868 A 6 C 22 E 8455 B 4 A 4 F 7679 F 6 2 B 0 C 30 D 8883 D 2 B 79931 C 19 E 4737 C 3 CC 33056461 E 9 C 96 A 175 D 94 B 594 B 2 A 9 EAB 4 B 6 B 6303 ,
where p t e s t + 1 = 4 i = 0 107 i for odd primes i that are roughly evenly spaced between 0 = 19 and 107 = 321193 (the degree of the isogeny in the attack of [5]). This prime was chosen so that isogenies of each degree i can be evaluated using only F p t e s t arithmetic (for which we also provide an assembly code implementation). As previously mentioned, this model most closely resembles the scenario in CSIDH as opposed to applications where isogenies are performed over quadratic fields or larger extensions. However, we make this choice for concreteness while still expecting our results to extrapolate well to other scenarios since the main savings come from reducing the number of field multiplications by a factor that is agnostic to the field itself. Our library can be found at https://github.com/TheParallelSquarerootVeluTeam/Parallel-Squareroot-Velu. The benchmark results presented in this section are from experiments conducted using an Intel(R) Xeon(R) Gold 6230R processor equipped with 26 physical cores, operating at 2 . 10 GHz . Turbo boost and Hyper-threading technologies were disabled during these experiments.
Our experimental results are illustrated in Figure 4. We measured the total computational time for executing KPS-Parallel, xISOG-Parallel, and two iterations of xEVAL-Parallel for each of the prime degrees i using 2, 4 and 8 cores, and compared to the computational time of the original sequential implementation of [12]. In Figure 4 we compare the observed speedup factors against the theoretical speedup using (6) and (5). We observe there is a more important difference for the eight-core implementation. Moreover, we notice that in situations where the degree of the isogeny is not sufficiently large, the overhead associated with synchronization and the computational steps involved tends to offset any potential performance gains achieved by employing eight cores. However, the speedup factor can be seen to achieve levels close to the asymptotic value starting from 10000 .
To showcase our results in a more practical setting, we also implemented and benchmarked our parallelized square-root Vélu algorithms in the context of SQALE’d CSIDH-9216. In its dummy-free variant, SQALE’d CSIDH-9216 performs the operations corresponding to KPS + xISOG + 2xEVAL exactly once for each of the 333 smallest odd primes (excluding 263), over a 9216-bit prime field. The experimental timings for each of the three routines, summed over all the degrees, are shown in Table 1. Our two-core implementation achieves speedup factors of 1.23, 1.70, and 1.92 for KPS-Parallel, xISOG-Parallel, and xEVAL-Parallel, respectively, while for the four-core implementation we obtained acceleration factors of 1.42, 2.27, and 3.02. While these accelerations span the entirety of the Vélu-related computations in the protocol, it is important to stress that they cannot be translated easily into an acceleration for the protocol as a whole, since the protocol includes other non-negligible computations (most notably elliptic curve point scalar multiplications).

5. Discussion

Since KPS-Parallel performs some of its computational tasks sequentially, while others only benefit from half of the total number of cores, it is not surprising that its speedup factor is the lowest out of the three square-root Vélu routines. Nonetheless, as illustrated by Table 1, this is not too important as its total cost is comparatively small. Fortunately, the largest accelerations come from the xEVAL-Parallel, which apart from being the most expensive of the three is often required to be evaluated multiple times for different points.
Both our experimental and theoretical results show that parallelizing with at least two cores yields strong improvements, which are important considering that they can be combined with other improvements such as vectorized field arithmetic. On the downside, it is also noticeable that the speedup factors in our implementations do not scale well when transitioning to four- and eight-core implementations. From eight cores onward, even the maximum speedup derived from the theoretic estimates suggests that the utilization that can be achieved may not be attractive for practical applications. The intuition behind these diminishing returns on the parallelization effectiveness is easy to explain, as the size of the resultants in algorithm 5 and 6 only decreases as 1 / n .

Acknowledgments

This work started when Chávez-Saab, J. and Ortega, O. were doing an internship at the Technology Innovation Institute (TII) under the guidance of Prof. Dr. Rodríguez-Henríquez F. We thank TII for sponsoring such an internship. We thank ANID for the study scholarship to Ortega, O. grant number 21190301. We also thank Dr. Chi-Domínguez J. and Dr. Zamarripa-Rivera L. for valuable discussion on an early version of this manuscript. Additionally, this work has received partial funding to facilitate the use of a server in CINVESTAV-IPN in Mexico which was used for our tests.

Appendix A. Cost Functions for Polynomial Operations

In this section, we detail the costs in field multiplications associated with various operations in both the sequential and parallel versions of square-root Vélu.
  • Polynomial Multiplication. The basic step is the multiplication of two polynomials of equal degree d, which we denote as P o l y M u l ( d ) . We use a Karatsuba strategy for polynomial multiplication, so the cost in field multiplications can be expressed as in [33]
    P o l y M u l ( d ) = d log 2 ( 3 ) .
  • Multiplication of Multiple Polynomials. Suppose we want to multiply together m polynomials of degree d each. If m is small, we opt for multiplying the polynomials one by one: in step 1 we are multiplying together 2 polynomials of degree d, and in step t for 1 < t < m we are multiplying a polynomial of degree t d by another of degree d. By partitioning the larger polynomial, we can perform this multiplication via t polynomial multiplications of degree d by d, so the total cost is
    L i n e a r P r o d u c t m ( d ) = t = 1 m 1 t · P o l y M u l ( d ) = m ( m 1 ) 2 P o l y M u l ( d ) = m ( m 1 ) d l o g 2 ( 3 ) 2 .
On the other hand, if m is large then we opt for a product tree strategy, where the input polynomials are placed at the leaves of the tree and each node contains the product of its two child nodes. We assume for simplicity that m is a power of 2. At level i (where i=0 corresponds to the root), we need to compute 2 i nodes, each of which is a product of two polynomials of degree m d / 2 i + 1 . The total cost is therefore
P r o d u c t T r e e d ( m ) = i = 0 log 2 ( m ) 1 2 i × P o l y M u l ( m d / 2 i + 1 ) = i = 0 log 2 ( m ) 1 2 i m d 2 i + 1 log 2 ( 3 ) ,
= ( m d ) log 2 ( 3 ) m d log 2 ( 3 ) .
We only use product trees for computing h I and E i , J which have leaves of degree d = 1 and d = 2 , respectively, so we are only interested in the special cases
P r o d u c t T r e e 1 ( m ) = m log 2 ( 3 ) m , P r o d u c t T r e e 2 ( m ) = 3 m log 2 ( 3 ) 3 m .
  • Reciprocal tree. Computing the resultants requires the aid of a reciprocal tree, which is built from the root down using the product tree. If a node in the product tree contains a polynomial F of degree m, then the corresponding node in the reciprocal tree contains ( F , c ) , where F is a polynomial of degree m and c is a constant such that r e v m ( F ) · F = c mod x m . Here, r e v m ( · ) denotes the polynomial with the list of coefficients in reverse order: if F = i = 0 m F i x i , then r e v m ( F ) = i = 0 m F m i x i .
We only use the case for the product tree of a polynomial f that has b leaves of degree 1. For the root of the tree, we need to know the cost of obtaining from scratch r e v b ( f ) 1 mod x b , where f has degree b. As pointed out in [23], if we have already computed a reciprocal ( f 0 , c 0 ) of half the degree (that is, r e v b ( f ) · f 0 = c 0 mod x b / 2 ), then the inverse modulus x b can be obtained as ( f , c ) with
f = c 0 f 0 ( r e v b ( f ) · f 0 c 0 ) f 0 mod x b ,
c = c 0 2 .
These equations can be evaluated at the cost of b / 2 + 1 field multiplications (for multiplying f 0 by the scalar c 0 ), plus the cost of 2 polynomial multiplications modulus x b / 2 (because the term in parenthesis is known to have null coefficients for all powers less than b / 2 ), plus one squaring. This leads to the recursive formula for the cost of the reciprocal
R e c i p r o c a l ( b ) = R e c i p r o c a l ( b / 2 ) + b / 2 + 1 + 2 × P o l y M u l ( b / 2 ) + 1
= R e c i p r o c a l ( b / 2 ) + b / 2 + 2 ( b / 2 ) l o g 2 ( 3 ) + 2 .
The base case (finding a multiple of the reciprocal of a constant) is trivial, so we obtain
R e c i p r o c a l ( b ) = i = 1 log 2 ( b ) b 2 i + 2 b 2 i log 2 ( 3 ) + 2 = b log 2 ( 3 ) + b + 2 log 2 ( b ) 2 .
The child nodes in the reciprocal tree can then be computed from the root. Let f 1 , f 2 be polynomials of degree d corresponding to sibling nodes of the product tree, and assume that we have already computed a reciprocal for their parent node F = f 1 · f 2 . That is, we already have ( F , C ) such that r e v 2 d ( F ) · F = C mod x 2 d . It follows that r e v d ( f 1 ) · ( r e v d ( f 2 ) · F ) = C mod x d , so we can compute a reciprocal for f 1 via the modular multiplication r e v d ( f 2 ) · F mod x d . The total cost of the reciprocal tree is then
R e c i p r o c a l T r e e ( b ) = R e c i p r o c a l ( b ) + i = 1 log 2 ( b ) 2 i × P o l y M u l ( b / 2 i )
= b l o g 2 ( 3 ) + b + 2 log 2 ( b ) 2 + i = 1 log 2 ( b ) 2 i b 2 i log 2 ( 3 )
= 3 b log 2 ( 3 ) b + 2 log 2 ( b ) 2 .
  • Residue tree. Recall that we can compute the resultant R e s Z ( f , g ) for monic f as
    R e s Z ( f , g ) = i g ( x i ) = i ( g mod f i ) ,
    where x i are the roots of f and f i = Z x i are its linear factors. In the context of 2 and 3, (resp. the parallel versions 5 and 6), f is a polynomial of degree b corresponding to h I (resp. h I t ) for which we have already computed the product tree with leaves f i as well as the reciprocal tree, and g is a polynomial of degree 2 b corresponding to E i , J for which we have computed the product tree with leaves of degree 2. We obtain the values g mod f i by building a residue tree from the root down. Each node of the residue tree contains R : = c F mod G , where F and G are the polynomials in the corresponding nodes of the product trees of f and g, respectively, and c is the constant of the reciprocal tree.
As pointed out in [34], the computation of F mod G for d e g ( G ) = 2 d e g ( F ) = 2 d can be achieved with the aid of ( F , c ) from the reciprocal tree via
F mod G = G r e v d ( F · r e v 2 d ( G ) mod x d ) · F mod x d ,
at the cost of an additional two polynomial multiplications modulus x d , so the total cost of the tree (including the cost to multiply together all of the leaves) is
R e s i d u e T r e e ( b ) = b 1 + 2 i = 0 log 2 ( b ) 2 i × P o l y M u l ( b / 2 i )
= b 1 + 2 i = 0 log 2 ( b ) 2 i b 2 i log 2 ( 3 )
= 6 b log 2 ( 3 ) 3 b 1 .
Note that the product of all the leaves gives us a multiple of the resultant (the multiple being the product of all the c constants at the leaves of the reciprocal tree), but these cancel out when taking ratios in the actual algorithms.

References

  1. De Feo, L.; Jao, D. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. Post-Quantum Cryptography; Yang, B.Y., Ed.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2011; pp. 19–34. [Google Scholar]
  2. De Feo, L.; Jao, D.; Plût, J. Towards quantum-resistant cryptosystems from supersingular elliptic curve isogenies. J. Mathematical Cryptology 2014, 8, 209–247. [Google Scholar] [CrossRef]
  3. SIKE - Supersingular Isogeny Key Encapsulation. Available online: https://sike.org/, 2023. Accessed: 6-12-2023.
  4. Castryck, W.; Decru, T. An Efficient key recovery attack on SIDH. Advances in Cryptology – EUROCRYPT 2023; Hazay, C., Stam, M., Eds.; Springer Nature Switzerland: Cham, 2023; pp. 423–447. [Google Scholar]
  5. Maino, L.; Martindale, C.; Panny, L.; Pope, G.; Wesolowski, B. A direct key recovery attack on SIDH. Advances in Cryptology – EUROCRYPT 2023; Hazay, C., Stam, M., Eds.; Springer Nature Switzerland: Cham, 2023; pp. 448–471. [Google Scholar]
  6. Robert, D. Breaking SIDH in polynomial time. Advances in Cryptology – EUROCRYPT 2023; Hazay, C., Stam, M., Eds.; Springer Nature Switzerland: Cham, 2023; pp. 472–503. [Google Scholar]
  7. Castryck, W.; Lange, T.; Martindale, C.; Panny, L.; Renes, J. CSIDH: An efficient post-quantum commutative group action. Advances in Cryptology – ASIACRYPT 2018: 24th International Conference on the Theory and Application of Cryptology and Information Security, Brisbane, QLD, Australia, December 2–6, 2018, Proceedings, Part III; Springer-Verlag: Berlin, Heidelberg, 2018; pp. 395–427. [Google Scholar]
  8. De Feo, L.; Galbraith, S.D. SeaSign: compact isogeny signatures from class group actions. Advances in Cryptology – EUROCRYPT 2019; Ishai, Y., Rijmen, V., Eds.; Springer International Publishing: Cham, 2019; pp. 759–789. [Google Scholar]
  9. Beullens, W.; Kleinjung, T.; Vercauteren, F. CSI-FiSh: Efficient isogeny based signatures through class group computations. Advances in Cryptology – ASIACRYPT 2019; Galbraith, S.D., Moriai, S., Eds.; Springer International Publishing: Cham, 2019; pp. 227–247. [Google Scholar]
  10. De Feo, L.; Kohel, D.; Leroux, A.; Petit, C.; Wesolowski, B. SQISign: Compact post-quantum signatures from quaternions and isogenies. Advances in Cryptology – ASIACRYPT 2020; Moriai, S., Wang, H., Eds.; Springer International Publishing: Cham, 2020; pp. 64–93. [Google Scholar]
  11. National Institute of Standards and Technology NIST. Available online: https://csrc.nist.gov/news/2023/additional-pqc-digital-signature-candidates, 2023. Accessed: 6-12-2023.
  12. Chávez-Saab, J.; Chi-Domínguez, J.; Jaques, S.; Rodríguez-Henríquez, F. The SQALE of CSIDH: sublinear Vélu quantum-resistant isogeny action with low exponents. J. Cryptogr. Eng. 2022, 12, 349–368. [Google Scholar] [CrossRef]
  13. SQIsign: Algorithm specifications and supporting documentation. Available online: https://csrc.nist.gov/csrc/media/Projects/pqc-dig-sig/documents/round-1/spec-files/sqisign-spec-web.pdf, 2023. Accessed: 6-12-2023.
  14. Basso, A.; Maino, L.; Pope, G. FESTA: Fast encryption from supersingular torsion attacks. Cryptology ePrint Archive, Paper 2023/660, 2023.
  15. Nakagawa, K.; Onuki, H. QFESTA: Efficient algorithms and parameters for FESTA using quaternion algebras. Cryptology ePrint Archive, Report 2023/1468, 2023.
  16. Moriya, T. IS-CUBE: An isogeny-based compact KEM using a boxed SIDH diagram. Cryptology ePrint Archive, Report 2023/1506, 2023.
  17. Decru, T.; Maino, L.; Sanso, A. Towards a Quantum-resistant Weak Verifiable Delay Function. Progress in Cryptology - LATINCRYPT 2023 - 8th International Conference on Cryptology and Information Security in Latin America; Aly, A.; Tibouchi, M., Eds. Springer, 2023, Vol. 14168, LNCS, pp. 149–168.
  18. Leroux, A. Verifiable random function from the Deuring correspondence and higher dimensional isogenies. Cryptology ePrint Archive, Report 2023/1251, 2023.
  19. Vélu, J. Isogénies entre courbes elliptiques. Comptes-Rendus de l’Académie des Sciences, Série I 1971, 273, 238–241. [Google Scholar]
  20. Kohel, D.R. Endomorphism rings of elliptic curves over finite fields. PhD thesis, University of California at Berkeley, The address of the publisher, 1996. Available at:http://iml.univ-mrs.fr/~kohel/pub/thesis.pdf.
  21. Washington, L. Elliptic curves: number theory and cryptography, Second Edition, 2 ed.; Chapman & Hall/CRC, 2008.
  22. Bernstein, D.J.; Feo, L.D.; Leroux, A.; Smith, B. Faster computation of isogenies of large prime degree. In: ANTS XIV. The Open Book Series, vol. 4(1), pp. 39–55 (2020), 2020.
  23. Adj, G.; Chi-Domínguez, J.; Rodríguez-Henríquez, F. Karatsuba-based square-root Vélu’s formulas applied to two isogeny-based protocols. Journal of Cryptographic Engineering 2022. [Google Scholar] [CrossRef]
  24. Cheng, H.; Fotiadis, G.; Großschädl, J.; Ryan, P.Y.A.; Rønne, P.B. Batching CSIDH group actions using AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems 2021, 2021, 618–649. [Google Scholar] [CrossRef]
  25. Orisaka, G.; López-Hernández, J.C.; Aranha, D.F. Finite field arithmetic using AVX-512 for isogeny-based cryptography. 18th Brazilian Symposium on Information and Computer Systems Security (SBSeg), 2018, pp. 49–56.
  26. Cheng, H.; Fotiadis, G.; Großschädl, J.; Ryan, P.Y.A. Highly Vectorized SIKE for AVX-512. IACR Transactions on Cryptographic Hardware and Embedded Systems 2022, 2022, 41–68. [Google Scholar] [CrossRef]
  27. Phalakarn, K.; Suppakitpaisarn, V.; Rodríguez-Henríquez, F.; Hasan, M.A. Vectorized and parallel computation of large smooth-Degree isogenies using precedence-constrained scheduling. Cryptology ePrint Archive, Paper 2023/885, 2023.
  28. Phalakarn, K.; Suppakitpaisarn, V.; Hasan, M.A. Speeding-up parallel computation of large smooth-degree isogeny using precedence-constrained scheduling. Information Security and Privacy; Nguyen, K., Yang, G., Guo, F., Susilo, W., Eds.; Springer International Publishing: Cham, 2022; pp. 309–331. [Google Scholar]
  29. Kato, G.; Suzuki, K. Speeding up CSIDH using parallel computation of isogeny. 2020 7th International Conference on Advance Informatics: Concepts, Theory and Applications (ICAICTA), 2020, pp. 1–6.
  30. Elkies, N.D.; others. Elliptic and modular curves over finite fields and related computational issues. AMS IP STUDIES IN ADVANCED MATHEMATICS 1998, 7, 21–76. [Google Scholar]
  31. Costello, C.; Hisil, H. A Simple and Compact Algorithm for SIDH with Arbitrary Degree Isogenies. Advances in Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, December 3-7, 2017, Proceedings, Part II; Takagi, T.; Peyrin, T., Eds. Springer, 2017, Vol. 10625, Lecture Notes in Computer Science, pp. 303–329.
  32. Costello, C. B-SIDH: Supersingular Isogeny Diffie-Hellman Using Twisted Torsion. Advances in Cryptology - ASIACRYPT 2020 - 26th International Conference on the Theory and Application of Cryptology and Information Security, Daejeon, South Korea, December 7-11, 2020, Proceedings, Part II; Moriai, S.; Wang, H., Eds. Springer, 2020, Vol. 12492, Lecture Notes in Computer Science, pp. 440–463.
  33. Karatsuba, A.; Ofman, Y. Multiplication of Multidigit Numbers on Automata. Soviet Physics Doklady 1962, 7, 595. [Google Scholar]
  34. Bernstein, D.J., Fast multiplication and its applications. In Algorithmic Number Theory; Buhler, J.; Stevenhagen, P., Eds.; Mathematical Sciences Research Institute Publications, Cambridge University Press: United Kingdom, 2008; pp. 325–384.
Figure 1. Diagram of the product tree for computing f i = ( Z x i ) .
Figure 1. Diagram of the product tree for computing f i = ( Z x i ) .
Preprints 96582 g001
Figure 2. Diagram of the reciprocal tree, where f i = ( Z x i ) , r e v ( . ) and c i are as mentioned above.
Figure 2. Diagram of the reciprocal tree, where f i = ( Z x i ) , r e v ( . ) and c i are as mentioned above.
Preprints 96582 g002
Figure 3. Diagram of the remainder tree for computing R e s Z ( f ( Z ) , g ( Z ) ) .
Figure 3. Diagram of the remainder tree for computing R e s Z ( f ( Z ) , g ( Z ) ) .
Preprints 96582 g003
Figure 4. Speedup factors of timings of the joint cost KPS-Parallel + xISOG-Parallel + 2xEVAL-Parallel over F ptest for each odd prime degree ( p t e s t + 1 ) . Experimental timings correspond to an average of 100 runs, whereas the expected savings are obtained from the estimated number of field multiplications from (6) and (5).
Figure 4. Speedup factors of timings of the joint cost KPS-Parallel + xISOG-Parallel + 2xEVAL-Parallel over F ptest for each odd prime degree ( p t e s t + 1 ) . Experimental timings correspond to an average of 100 runs, whereas the expected savings are obtained from the estimated number of field multiplications from (6) and (5).
Preprints 96582 g004
Table 1. Aggregate computational time (in gigacycles) measured for the single-core square-root Vélu procedures KPS, xISOG, and xEVAL described in §Section 2.1, and of the two-core and four-core implementations of the parallel analogues described in §Section 3.2. The costs are summed over each odd prime degree ( p 9216 + 1 ) , and the Total column corresponds to KPS + xISOG + 2xEVAL. Timings correspond with the average of 100 runs. (Gcc) is the number of gigacycles and (SF) is the experimental speedup factor considering the single core implementation as a baseline.
Table 1. Aggregate computational time (in gigacycles) measured for the single-core square-root Vélu procedures KPS, xISOG, and xEVAL described in §Section 2.1, and of the two-core and four-core implementations of the parallel analogues described in §Section 3.2. The costs are summed over each odd prime degree ( p 9216 + 1 ) , and the Total column corresponds to KPS + xISOG + 2xEVAL. Timings correspond with the average of 100 runs. (Gcc) is the number of gigacycles and (SF) is the experimental speedup factor considering the single core implementation as a baseline.
Algorithm Single core Algorithm Two cores Four cores
Gcc Gcc SF Gcc SF
KPS 23.6 KPS-Parallel 19.2 1.23 16.6 1.42
xISOG 58.5 xISOG-Parallel 34.5 1.70 25.8 2.27
xEVAL 59.5 xEVAL-Parallel 31.0 1.92 19.7 3.02
Total 201.1 115.7 1.74 81.8 2.46
1One gigacycle is one billion clock cycles.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated