Preprint
Article

This version is not peer-reviewed.

Cycle-Stride Evaluation of Finite Fractional Fourier Orbits over Dyadic-Symmetry-Complete Prime Fields

Submitted:

29 May 2026

Posted:

02 June 2026

You are already at the latest version

Abstract
We present a streaming algorithm for the full finite fractional Fourier orbit over a prime field Fp with p≡1(mod4). The construction rests on a projector-orbit identity: once the four eigenspace projections of the input signal under the normalized finite Fourier operator have been computed, every fractional power is recovered by a degree-three scalar recombination, indexed by the orbit parameter. The streaming phase materializes the full orbit of p−1 output vectors in O((p−1)2) field operations, matching the Ω((p−1)2) output-size lower bound up to constant factors. For the subclass of primes with 2 a primitive root --- which we call dyadic-symmetry-complete primes, characterized by the congruence p≡5(mod8) --- a contribution-array reformulation eliminates general modular multiplication from the streaming phase, leaving only modular additions and shift updates. This subclass is structurally distinct from the Fermat-prime NTT setting; only p=5 is common to both. We give self-contained proofs of the four-cycle and projector identities, prove the necessary congruence and the Fermat-prime separation, and report timing experiments validating correctness and the O(n2) streaming scaling against a dense matrix-vector baseline.
Keywords: 
;  ;  ;  ;  ;  ;  ;  

1. Introduction

The finite fractional Fourier transform (FrFT) over a prime field F p generalizes the discrete Fourier transform by interpreting it as one step of a cyclic rotation in representation space: the ordinary Fourier transform occupies a quarter-turn position in a four-cycle whose other cardinal points are identity, parity, and the inverse Fourier transform. Between these cardinal positions sit intermediate fractional transforms indexed by a parameter s in a cyclic group. The continuum FrFT has well-established applications in optics, time-frequency signal analysis, and quantum mechanics [1,2]; finite-field constructions have been developed for discrete signal processing [3,4] and, more recently, for qudit-based quantum circuit implementations [5].
For primes p 1 ( mod 4 ) , the field F p contains a primitive fourth root of unity — a square root of 1 — and the finite Fourier matrix admits an exact normalization with the four-cycle structure of the classical FrFT. This setting is convenient because the four-cycle, the projector decomposition onto eigenspaces, and the polynomial form of the fractional powers all transfer cleanly from the continuum theory. The price is that the cycle length n = p 1 is constrained to 4 t for some integer t, and the standard radix-2 Cooley–Tukey FFT does not in general apply.
This paper makes three contributions.
  • Output-optimal full-orbit streaming.
After computing the four spectral projections u = Π v of an input signal v F p n ( n = p 1 ), the entire orbit { F [ s ] v : s Z n } is recovered by a streaming procedure performing O ( n ) field operations per output. Total streaming cost is O ( n 2 ) , which matches the output-size lower bound up to constant factors: the orbit consists of n vectors each of length n, so any algorithm that materializes the full orbit must perform Ω ( n 2 ) output operations. This result holds for every prime p = 4 t + 1 and every choice of primitive generator e t F p × . We refer to such primes as symmetry-complete throughout, the condition simply being that F p contains a primitive fourth root of unity and the normalized finite Fourier operator admits the four-cycle F t 4 = I .
  • Dyadic-symmetry-complete primes and shift-based streaming.
We isolate the class of primes for which the streaming phase admits a particularly hardware-friendly implementation. Call p dyadic-symmetry-complete (DSC) if p 1 ( mod 4 ) and 2 is a primitive root of F p × . For these primes, the streaming phase can be reformulated using elementwise contribution arrays whose updates between meridian steps consist of multiplications by 2 , 4 , 8 modulo p — which are bit-shifts followed by conditional reductions, requiring no general modular multiplier. We establish two structural facts: every DSC prime satisfies p 5 ( mod 8 ) (because 2 being primitive forces 2 to be a quadratic non-residue), and among Fermat primes only p = 5 is DSC (because no other Fermat prime has 2 as a primitive root).
  • Output optimality interpretation.
The full-orbit streaming cost of O ( n 2 ) matches the output-size lower bound. It also matches, up to constants, the cost of the most natural NTT-based alternative: precompute v, F t v , J v , F t 1 v (four transforms in O ( n log n ) each under any applicable NTT), assemble the projector components, and stream the orbit by recombination in O ( n 2 ) . Our contribution is therefore not an asymptotic speedup over all alternatives but rather a unified streaming construction valid for every symmetry-complete prime, with a shift-only inner loop in the DSC subclass that we believe is novel.

1.1. Relation to Existing Work

The eigenstructure-based discretization of the FrFT was introduced by Pei, Yeh, and Tseng [4] and by Candan, Kutay, and Ozaktas [3], who used the projector decomposition to define canonical discrete fractional powers of the Fourier matrix. Our use of the same projectors as a precomputed decomposition enabling streaming evaluation across the full meridian cycle does not appear in those references, which focus on defining a single fractional transform rather than enumerating the orbit. The arithmetic FrFT of Floratos and Pavlidis [5] constructs discrete fractional Fourier families adapted to qudit quantum circuits.
The number-theoretic transform literature, beginning with Pollard [6] and now applied widely in lattice-based cryptography [7,8], exploits the fast modular arithmetic of specific prime fields. Fermat primes p = 2 2 k + 1 admit radix-2 NTTs of length a power of two, and the property “multiplication by 2 is a bit-shift modulo p” is used as an implementation optimization for the twiddle-factor updates. Crucially, in F 2 2 k + 1 the element 2 has order 2 k + 1 rather than p 1 , so it is not a primitive root of F p × for k 2 . The class of primes we identify — where 2 is primitive — is therefore distinct from the Fermat primes (with the exception of p = 5 , the only common member).
For prime-length DFTs more generally, Rader’s algorithm [9] and Bluestein’s chirp-z transform [10] provide O ( n log n ) alternatives when n is prime or has an inconvenient factorization. The cycle-stride streaming we develop here applies after a single setup phase regardless of how that phase is implemented; in particular, the post-projection streaming cost does not depend on whether the projections were obtained by direct matrix multiplication, by an NTT-based method, or by any other route.

1.2. Organization

Section 2 establishes the algebraic setting and proves the four-cycle and projector identities from first principles. Section 3 states and proves the general streaming theorem (Theorem 1) valid for every symmetry-complete prime. Section 4 introduces DSC primes, proves the congruence and Fermat-prime separation results (Theorem 2), and gives the contribution-array reformulation (Theorem 3). Section 5 analyzes complexity in operation counts. Section 6 reports empirical validation. Section 7 discusses applications and limitations.

2. Preliminaries

2.1. Setting and Notation

Throughout, p is a prime with p 1 ( mod 4 ) . Write p = 4 t + 1 and n = p 1 = 4 t . The multiplicative group F p × is cyclic of order n, and we fix a primitive generator e t F p × . Set
i t : = e t t F p × .
Let Φ t : = Z n and V t : = F p Φ t , the space of functions v : Φ t F p (equivalently, n-vectors over F p indexed by Φ t ).

2.2. The Finite Fourier Matrix

Definition 1
(Shell Fourier matrix). The shell Fourier matrix  W t : V t V t has entries
( W t ) k , j : = e t j k , j , k Φ t .
Definition 2
(Reversal operator). The reversal operator  J : V t V t acts by ( J v ) k : = v k (indices mod n). Equivalently J k , = δ k , .
It is immediate that J 2 = I and that J has trace equal to the number of indices k Φ t with k k ( mod n ) , namely k { 0 , 2 t } .
Lemma 1
(Squared Fourier identity). i t 2 = 1 in F p , and
W t 2 = J .
Proof. 
First claim. Since e t is a primitive generator of F p × , we have e t n = 1 and e t n / 2 1 . The element e t n / 2 squares to 1, so it equals either 1 or 1 ; not equal to 1, hence equal to 1 . Therefore i t 2 = e t 2 t = e t n / 2 = 1 .
Second claim. For k , Φ t , the ( k , ) entry of W t 2 is
( W t 2 ) k , = j Φ t e t j k · e t j = j = 0 n 1 e t j ( k + ) .
If k + 0 ( mod n ) , every summand equals 1 and the sum is n = p 1 = 1 in F p . Otherwise e t k + 1 and the finite geometric sum is ( e t n ( k + ) 1 ) / ( e t k + 1 ) = 0 . Hence ( W t 2 ) k , = δ , k = J k , .    □

2.3. The Normalized Fourier Operator and Its Four-Cycle

Definition 3
(Normalized shell Fourier operator). Thenormalized shell Fourier operator is
F t : = i t W t .
Proposition 1
(Four-cycle). F t 2 = J and F t 4 = I .
Proof. 
By Lemma 1, F t 2 = i t 2 W t 2 = ( 1 ) ( J ) = J . Squaring, F t 4 = J 2 = I .    □
The four-cycle structure is the algebraic skeleton on which the finite FrFT is built. The four cardinal points are
F t 0 = I , F t 1 = F t ( Fourier ) , F t 2 = J ( parity ) , F t 3 = F t 1 ( inverse Fourier ) .

2.4. Eigenspace Projectors

The eigenvalues of F t are the four roots of x 4 = 1 in F p , namely { 1 , i t , 1 , i t } = { i t : = 0 , 1 , 2 , 3 } . Each eigenvalue’s algebraic multiplicity is the dimension of the corresponding eigenspace, and the projector onto eigenspace is given by the standard polynomial formula in F t .
Definition 4
(Spectral projectors). For { 0 , 1 , 2 , 3 } , define
Π : = 1 4 r = 0 3 i t r F t r .
The scalar 1 / 4 F p denotes the multiplicative inverse of 4; since p is odd prime with p > 2 , this is well-defined.
Proposition 2
(Projector identities). The operators Π satisfy:
(i) 
Π Π m = δ m Π ;
(ii) 
= 0 3 Π = I ;
(iii) 
F t Π = i t Π .
Proof. 
The proof uses the discrete orthogonality identity over Z 4 : for a , b { 0 , 1 , 2 , 3 } ,
1 4 r = 0 3 i t r ( a b ) = 1 if a b ( mod 4 ) , 0 otherwise ,
which follows from i t 4 = 1 and the same geometric-sum argument used in Lemma 1.
(iii): Multiplying Π by F t on the left and using F t 4 = I :
F t Π = 1 4 r = 0 3 i t r F t r + 1 = 1 4 r = 1 4 i t ( r 1 ) F t r = i t · 1 4 r = 0 3 i t r F t r = i t Π ,
where we used i t · 4 = 1 to reindex.
(ii): Π = 1 4 r = 0 3 = 0 3 i t r F t r . By the orthogonality identity with a = 0 , b = r , the inner sum is 4 for r = 0 and 0 otherwise. So Π = F t 0 = I .
(i): Using (iii) and the formula for Π m ,
Π Π m = Π · 1 4 r = 0 3 i t m r F t r = 1 4 r = 0 3 i t m r ( i t ) r Π = 1 4 r = 0 3 i t r ( m ) Π = δ m Π .
   □

2.5. The Finite FrFT Family

Definition 5
(Finite fractional Fourier transform). For s Φ t ,
F [ s ] : = = 0 3 e t s Π .
Proposition 3
(FrFT properties). The family { F [ s ] : s Φ t } satisfies:
(i) 
F [ s + r ] = F [ s ] F [ r ] for all s , r Φ t .
(ii) 
F [ 0 ] = I , F [ t ] = F t , F [ 2 t ] = J , F [ 3 t ] = F t 1 , F [ 4 t ] = F [ 0 ] = I .
Proof. 
(i): Using the orthogonality of the projectors,
F [ s ] F [ r ] = e t s Π m e t m r Π m = e t ( s + r ) Π = F [ s + r ] .
(ii): For s = 0 : F [ 0 ] = Π = I by Proposition 2(ii). For s = t : e t t = i t , so F [ t ] = i t Π . Using F t Π = i t Π and Π = I , this equals F t . For s = 2 t : e t · 2 t = i t 2 = ( 1 ) , and F [ 2 t ] = ( 1 ) Π = F t 2 = J . For s = 3 t : e t 3 t = i t 3 = i t , giving F t 1 . For s = 4 t : e t 4 t = 1 , recovering Π = I .    □
This is the algebraic setup we use throughout. We will return to the question of when the map s F [ s ] is injective (faithfulness) in Section 7; for the algorithmic results below, what matters is the projector identity (2) below.

3. The Cycle-Stride Streaming Algorithm

3.1. The Projector-Orbit Identity

The central observation of this paper is the following identity, which follows directly from Definition 5.
Theorem 1
(Projector-orbit identity; Theorem A). Let p = 4 t + 1 be prime and e t F p × a primitive generator. For any signal v V t , set
u : = Π v , { 0 , 1 , 2 , 3 } .
Then for every s Φ t ,
F [ s ] v = u 0 + e t s u 1 + e t 2 s u 2 + e t 3 s u 3 i n V t .
Consequently, after the projections { u } = 0 3 have been computed, the entire orbit { F [ s ] v : s Φ t } can be evaluated in O ( n 2 ) field operations and O ( n ) working memory beyond the output stream.
Proof. 
By Definition 5,
F [ s ] v = = 0 3 e t s Π v = = 0 3 e t s u ,
which is (2). The cost analysis: there are n values of s, and for each, computing u 0 + e t s u 1 + e t 2 s u 2 + e t 3 s u 3 requires 3 n scalar multiplications and 3 n additions in F p , plus a constant number of operations to update the powers e t s , e t 2 s , e t 3 s between successive values of s. The total is O ( n 2 ) field operations. Working memory beyond the output stream is the four arrays u 0 , u 1 , u 2 , u 3 , totalling 4 n field elements, plus O ( 1 ) coefficient scalars.    □
Remark 1
(Output-size optimality). The full orbit { F [ s ] v } s Φ t consists of n vectors in V t , each of length n, totalling n 2 field elements. Any algorithm that explicitly materializes or streams every element of the full orbit must therefore perform Ω ( n 2 ) output operations. The cycle-stride streaming phase matches this output-size lower bound up to constant factors: it performs O ( n 2 ) field operations and produces n 2 output entries.

3.2. Algorithm A: Scalar-Coefficient Streaming

The most direct realization of Theorem 1 maintains three scalar coefficients c 1 , c 2 , c 3 representing e t s , e t 2 s , e t 3 s , and updates them multiplicatively between meridian steps.
This algorithm performs the projector setup in O ( n 2 ) field operations and the streaming phase in O ( n 2 ) field operations (with 3 n multiplications and 3 n additions per meridian step, plus three scalar updates). The cost is dominated by general modular multiplications in the inner loop — the per-output term c u [ j ] is a full modular multiply regardless of the value of e t .

4. Dyadic-Symmetry-Complete Primes

We now isolate a class of primes for which the streaming phase admits a structurally cleaner implementation.

4.1. Definition and Structural Characterization

Definition 6
(Dyadic-symmetry-complete prime). A prime p is dyadic-symmetry-complete (DSC) if p 1 ( mod 4 ) and 2 is a primitive root of F p × .
Theorem 2
(Necessary congruence and Fermat-prime separation; Theorem C). Let p be a DSC prime.
(i) 
p 5 ( mod 8 ) .
(ii) 
If p 1 is a power of 2, then p = 5 .
(iii) 
Among the known Fermat primes { 3 , 5 , 17 , 257 , 65537 } , only p = 5 is DSC.
Proof. 
(i): If 2 is a primitive root of F p × , then 2 has order p 1 , in particular order strictly greater than ( p 1 ) / 2 . Hence 2 ( p 1 ) / 2 1 ( mod p ) . By Euler’s criterion, 2 ( p 1 ) / 2 2 p ( mod p ) , where 2 p is the Legendre symbol. Therefore 2 p = 1 , i.e., 2 is a quadratic non-residue mod p.
The second supplementary law of quadratic reciprocity states 2 p = ( 1 ) ( p 2 1 ) / 8 , equivalently 2 p = 1 iff p ± 1 ( mod 8 ) and 2 p = 1 iff p ± 3 ( mod 8 ) [11]. Combining with p 1 ( mod 4 ) , the only admissible residue class is p 5 ( mod 8 ) .
(ii): Suppose p > 5 is DSC and p 1 = 2 N for some N 2 . Then p = 2 N + 1 is a prime of the form 2 N + 1 , which forces N itself to be a power of 2: indeed, if N has an odd factor d > 1 , writing N = d N , then 2 N + 1 = ( 2 N ) d + 1 is divisible by 2 N + 1 > 1 , contradicting primality. Hence N = 2 k for some k 1 , and p is a Fermat prime F k = 2 2 k + 1 .
In F p we have 2 2 k = 2 N = p 1 1 ( mod p ) , so 2 2 · 2 k = 1 . Hence the order of 2 in F p × divides 2 · 2 k = 2 k + 1 ; and the order does not divide 2 k (since 2 2 k = 1 1 ). Therefore ord p ( 2 ) = 2 k + 1 . For 2 to be primitive, we need ord p ( 2 ) = p 1 = 2 2 k , i.e., k + 1 = 2 k . This equation holds at k = 1 (giving p = 5 ) but at no k 2 (since 2 k grows much faster than k + 1 ). Hence p 5 , contradicting p > 5 .
(iii): For each Fermat prime, one verifies directly: in F 3 , 2 = 1 has order 2 = p 1 , but p = 3 does not satisfy p 1 ( mod 4 ) . In F 5 , 2 has order 4 = p 1 and p 1 ( mod 4 ) , so p = 5 is DSC. In F 17 , F 257 , F 65537 , the order of 2 is 2 k + 1 for k = 2 , 3 , 4 , namely 8 , 16 , 32 , far below p 1 . None of these is DSC.    □
Corollary 1
(Structural separation from Fermat-prime NTT). The DSC primes and Fermat primes (other than p = 5 ) are disjoint classes. In particular, the dyadic shift property used in Fermat-prime NTT implementations [6,8] — where “multiplication by 2 is bit-shift” refers to twiddle-factor update for non-primitive 2 — is structurally different from the property exploited in this paper, where 2 is a primitive generator and powers of 2 traverse all of F p × .
Remark 2
(Effective search class). By Theorem 2(i), every DSC prime satisfies p 5 ( mod 8 ) . The effective search space for DSC primes is therefore the arithmetic progression { 8 q + 5 : q Z 0 } intersected with the primes; within this class one then tests whether 2 is a primitive root. The empirical density results below (Proposition 4) are reported relative to this effective class.

4.2. Density: Empirical and Conjectural

The density of DSC primes within primes p 5 ( mod 8 ) is governed heuristically by Artin’s conjecture on primitive roots, conditional on the Generalized Riemann Hypothesis for certain Dedekind zeta functions [12]. Artin’s constant
C Artin = q prime 1 1 q ( q 1 ) 0.3739558
governs the global density of primes for which 2 is a primitive root. Restricted to the congruence class p 5 ( mod 8 ) , the conjectural density may differ; we are not aware of a closed form, but the heuristic and the empirical data agree that it is bounded away from zero.
Unconditionally, the deepest known result is Heath-Brown’s theorem [13]: at least one of the integers 2 , 3 , 5 is a primitive root for infinitely many primes. This does not single out 2, so it does not unconditionally establish the infinitude of DSC primes. We therefore record the existence statement as conjectural and the algorithm’s applicability as empirically demonstrated for the explicit examples enumerated below.
Proposition 4
(Empirical density at small p). Among the 43 primes p 1000 with p 5 ( mod 8 ) , exactly 36 are DSC, an empirical density of 0.837 . The first thirty DSC primes are
5 , 13 , 29 , 37 , 53 , 61 , 101 , 149 , 173 , 181 , 197 , 269 , 293 , 317 , 349 , 373 , 389 , 421 , 461 , 509 , 541 , 557 , 613 , 653 , 661 , 677 , 701 , 709 , 757 , 773 .
The empirical density within p 5 ( mod 8 ) is notably higher than the global Artin constant. The discrepancy is consistent with the heuristic that restricting to the congruence class removes the primes where 2 must be a quadratic residue, and is in line with conditional results on primitive roots in arithmetic progressions surveyed in [14].

4.3. Bit-shift Multiplication Modulo P

Lemma 2
(Bit-shift multiplication). For any prime p and any a { 0 , 1 , , p 1 } , the operation a 2 a ( mod p ) is realized by:
(1) 
left-shift a by one bit to obtain 2 a as an integer in { 0 , , 2 p 2 } ;
(2) 
if 2 a p , subtract p.
On a register of log 2 p + 1 bits, the operation requires one shift and one conditional subtraction. Iterated k times, the operation realizes a 2 k a ( mod p ) using k shifts and at most k conditional subtractions.
This is elementary but worth recording. The DSC condition ensures that iterating this operation generates all of F p × from the starting value 1.

4.4. Algorithm B: Contribution-Array Streaming for DSC Primes

The scalar-coefficient streaming of Algorithm 1 performs three general modular multiplications per output element ( c u [ j ] ). For DSC primes with e t = 2 , this can be restructured to eliminate general modular multiplications from the streaming phase entirely.
Algorithm 1 Scalar-coefficient streaming (Algorithm A)
Require: Prime p = 4 t + 1 ; primitive e t F p × ; signal v F p n ; precomputed projectors Π 0 , Π 1 , Π 2 , Π 3
Ensure: The orbit { F [ s ] v : s = 0 , 1 , , n 1 }
  1:
for  = 0 , 1 , 2 , 3  do
  2:
     u Π v
  3:
end for
  4:
c 1 , c 2 , c 3 1 , 1 , 1
  5:
μ 1 , μ 2 , μ 3 e t , e t 2 , e t 3 ( mod p )                    ▹ Per-step multipliers
  6:
for  s = 0 , 1 , , n 1  do
  7:
    for  j = 0 , 1 , , n 1  do
  8:
         w [ j ] u 0 [ j ] + c 1 u 1 [ j ] + c 2 u 2 [ j ] + c 3 u 3 [ j ] ( mod p )
  9:
    end for
10:
    output w as F [ s ] v
11:
     c 1 c 1 μ 1 mod p
12:
     c 2 c 2 μ 2 mod p
13:
     c 3 c 3 μ 3 mod p
14:
end for
The reformulation maintains contribution arrays
a ( s ) : = e t s u V t , = 1 , 2 , 3 ,
which absorb the scalar coefficient into the vector. The recombination at each meridian step is then an additive expression with no scalar-vector multiplications, and the update between meridian steps is an elementwise multiplication of each a by e t { 2 , 4 , 8 } — which by Lemma 2 is a sequence of shifts.
Theorem 3
(Shift-based streaming over DSC primes; Theorem B). Algorithm 2 is correct: it outputs F [ s ] v at step s for every s Φ t . The streaming phase after the projection setup performs no general modular multiplications; the inner-loop operations consist entirely of:
  • modular additions (in the output recombination);
  • left bit-shifts and conditional subtractions (in the contribution-array updates).
Per meridian step, Algorithm 2 performs 3 n modular additions to form the output w = u 0 + a 1 + a 2 + a 3 , and updates the three contribution arrays by multiplication by 2 , 4 , 8 modulo p. Implemented by repeated doubling, the three updates require 6 n elementary shift-reduce operations per meridian step. Equivalently, if multiplication by 2 , 4 , 8 is implemented as fixed dyadic shift-reduce primitives, the updates consist of 3 n fixed dyadic modular updates per step. In either interpretation, the streaming inner loop contains no general modular multiplication.
Algorithm 2 Contribution-array streaming for DSC primes (Algorithm B)
Require: DSC prime p; signal v F p n ; precomputed projectors Π 0 , Π 1 , Π 2 , Π 3
Ensure: The orbit { F [ s ] v : s = 0 , 1 , , n 1 }
  1:
for  = 0 , 1 , 2 , 3  do
  2:
     u Π v
  3:
end for
  4:
a 1 u 1 ; a 2 u 2 ; a 3 u 3                                          ▹ Contribution arrays
  5:
for  s = 0 , 1 , , n 1  do
  6:
    for  j = 0 , 1 , , n 1  do
  7:
         w [ j ] u 0 [ j ] + a 1 [ j ] + a 2 [ j ] + a 3 [ j ] ( mod p )
  8:
    end for
  9:
    output w as F [ s ] v
10:
    for  j = 0 , 1 , , n 1  do
11:
           a 1 [ j ] 2 a 1 [ j ] mod p                                                        ▹ One shift
12:
           a 2 [ j ] 4 a 2 [ j ] mod p                                                      ▹ Two shifts
13:
           a 3 [ j ] 8 a 3 [ j ] mod p                                                   ▹ Three shifts
14:
    end for
15:
end for
Proof. 
We prove correctness by induction on s. The invariant maintained at the beginning of meridian step s is a ( s ) = 2 s u for = 1 , 2 , 3 . Base case ( s = 0 ): a ( 0 ) = u = 2 0 u . Inductive step: After the output at step s, line 11 updates a to 2 a ( s ) = 2 · 2 s u = 2 ( s + 1 ) u = a ( s + 1 ) .
The output at step s is
w = u 0 + a 1 ( s ) + a 2 ( s ) + a 3 ( s ) = u 0 + 2 s u 1 + 2 2 s u 2 + 2 3 s u 3 ,
which equals F [ s ] v by Theorem 1 applied with e t = 2 .
Operation count: the recombination w [ j ] = u 0 [ j ] + a 1 [ j ] + a 2 [ j ] + a 3 [ j ] contains three modular additions per index, hence 3 n modular additions per meridian step. The contribution-array updates a 1 2 a 1 , a 2 4 a 2 , a 3 8 a 3 , each applied elementwise to a length-n array, require 1 + 2 + 3 = 6 elementary shift-reduce operations per index under the repeated-doubling implementation, for 6 n shift-reduce operations per meridian step. Under the alternative interpretation in which multiplication by 2 , 4 , 8 modulo p is treated as a primitive dyadic operation, each contribution update is one dyadic modular update per index, for 3 n such updates per step. Neither interpretation involves a general modular multiplication. □
Remark 3
(Comparison of Algorithms A and B). Algorithm 1 performs 3 n general modular multiplications per meridian step in the inner loop. Algorithm 2 replaces these with elementwise multiplications by the fixed small constants 2 , 4 , 8 modulo p, which decompose into 6 n single-bit shift-reduce operations under repeated doubling or, equivalently, 3 n fixed dyadic modular updates if these constants are implemented as dedicated shift-reduce primitives. On hardware where general modular multiplication is more expensive than shift-reduce — typically FPGA, ASIC, and microcontroller targets without hardware multipliers — Algorithm 2 is preferable. On a modern x86 or ARM CPU with single-cycle 64-bit integer multiplication, multiplication by a small constant is no cheaper than general multiplication, and the two algorithms have similar wall-clock cost; we report this explicitly in Section 6.

5. Complexity Analysis

We compare four methods for computing the full FrFT orbit { F [ s ] v } s Φ t over a symmetry-complete prime p = 4 t + 1 with n = p 1 . Let M , A , S denote the cost of a modular multiplication, modular addition, and shift-with-reduction in F p respectively.
  • Method 1: Direct.
Construct F [ s ] as a matrix for each s via Definition 5 (cost 4 n 2 multiplications), then apply to v ( n 2 multiplications). Total over n values of s: O ( n 3 M ) .
  • Method 2: Iterated NTT.
If an NTT of length n is available at cost O ( n log n ) , then F [ s ] v for a single s can be evaluated in O ( n log n ) field operations by leveraging the matrix form of F [ s ] . Iterating across n values of s: O ( n 2 log n ) . This baseline applies only when an NTT of length n = 4 t is available; by Theorem 2(ii), n is not a power of 2 for any DSC prime p > 5 , so radix-2 NTT does not apply directly.
  • Method 3: NTT-based projector setup, then streaming.
Compute v, F t v , J v , F t 1 v once via NTT (cost O ( n log n ) each), then form the projector components u as linear combinations. Stream the orbit via Algorithm 1 or 2. Total: O ( n log n ) setup + O ( n 2 ) streaming = O ( n 2 ) .
  • Method 4: Cycle-stride (this paper).
Compute the projector components via direct matrix-vector multiplication (cost 4 n 2 M ) or via Method 3’s setup (cost O ( n log n ) if NTT applies). Stream via Algorithm 1 (general primitive e t ) or Algorithm 2 (DSC, e t = 2 ). Total: O ( n 2 ) .
The principal observations from Table 1:
(a)
The post-projection streaming phase is O ( n 2 ) , which is output-optimal: the orbit contains n 2 field elements, so no algorithm producing the full orbit can have lower asymptotic cost.
(b)
Cycle-stride is not an asymptotic improvement over Method 3 (NTT setup + streaming); both achieve O ( n 2 ) total. The advantages of cycle-stride are: (i) it does not require NTT to be available at length n, which by Theorem 2(ii) is significant for DSC primes; (ii) its streaming phase has a particularly simple structure; and (iii) in the DSC subclass, Algorithm 2 replaces modular multiplications with shifts in the inner loop.
(c)
For single fractional transforms (s fixed), an NTT-based method computes one F [ s ] v in O ( n log n ) field operations when applicable, beating both cycle-stride variants. Cycle-stride is preferable only when the full orbit (or a substantial fraction of it) is required.

6. Empirical Validation

We implemented both Algorithm 1 (scalar-coefficient streaming) and Algorithm 2 (contribution-array streaming for DSC primes) in Python with NumPy, using 64-bit integer arithmetic throughout. The reference baseline is direct construction of F [ s ] as a matrix from the precomputed projectors, followed by matrix-vector multiplication. All implementations share the same projector precomputation; only the per-s work differs.
We emphasize the scope of these experiments at the outset. The measurements below validate correctness of Algorithms 1 and 2 against an independent dense reference, and validate the predicted O ( n 2 )  scaling of the streaming phase against the O ( n 3 ) scaling of the dense baseline. They do not benchmark Algorithm 2’s intended shift-reduce hardware advantage — which is a property of FPGA, ASIC, and microcontroller implementations and is not exposed on a general-purpose CPU running Python with NumPy — nor do they compare against optimized NTT pipelines, where a fair head-to-head requires careful implementation in a common framework. The shift-reduce hardware advantage and the NTT comparison are both left as future work.
Table 2 reports per-orbit wall-clock timings on a random signal v F p n , averaged over 30 runs, on a single CPU core. The rightmost three columns show speedups relative to direct evaluation.
  • Correctness validation.
Both streaming algorithms produce outputs identical to the direct baseline at all n meridian indices for every prime in Table 2. We additionally verified correctness of Algorithm 1 on a non-DSC symmetry-complete prime ( p = 41 , with primitive root e t = 6 ); the general streaming structure applies regardless of whether e t = 2 .
  • Scaling validation.
The ratio between direct and streaming costs grows approximately linearly in n, consistent with the predicted Θ ( n ) ratio between O ( n 3 ) direct evaluation and O ( n 2 ) streaming. The empirical speedup of Algorithm 1 over direct at p = 197 is approximately 97 × , with n = 196 .
  • Algorithm A vs Algorithm B in Python.
A notable observation is that the contribution-array streaming (Algorithm B) runs approximately 30 %  slower than scalar-coefficient streaming (Algorithm A) in our Python/NumPy implementation. The reason is implementational rather than asymptotic: NumPy’s elementwise scalar multiplication (used in Algorithm B to update the contribution arrays) carries per-call overhead comparable to a small matrix operation, while scalar coefficient updates in Algorithm A use native Python integers with low overhead. The crossover in Algorithm B’s favor occurs on hardware where general modular multiplication is substantially more expensive than shift-and-reduce, which is not the case for 64-bit integer arithmetic on modern x86. The contribution-array variant’s principal advantage is therefore on FPGA, microcontroller, or custom-ASIC targets where multiplier area or cycle cost is significant.

7. Discussion

7.1. Faithfulness and Signal-Orbit Degeneracy

The streaming algorithms (Algorithms 1 and 2) compute F [ s ] v correctly for every s Φ t regardless of whether the operator family s F [ s ] is faithful or whether the realized signal orbit is degenerate. We separate two distinct questions about the orbit and clarify the role each plays in the present work.
Operator-level faithfulness concerns whether the map s F [ s ] is injective on Φ t as a map into End ( V t ) . From Definition 5, F [ s ] = F [ r ] iff ( e t s e t r ) Π = 0 , which (using Π Π m = δ m Π ) holds iff ( e t s e t r ) Π = 0 for every . Hence operator-level faithfulness reduces to whether the spectral projectors Π 1 and Π 3 are nonzero: if Π 1 0 , then e t s = e t r in F p forces s = r modulo n by primitivity of e t .
Signal-orbit distinctness concerns whether, for a fixed vector v, the realized outputs { F [ s ] v : s Φ t } are pairwise distinct as vectors. By the projector-orbit identity (Theorem 1), F [ s ] v = F [ r ] v iff ( e t s e t r ) u = 0 for every , where u = Π v is the realized projected component. A particular vector with u 1 = 0 has a signal orbit of period at most 2 t in s, regardless of whether Π 1 is nonzero in operator terms.
These are different questions, and the algorithms do not require either: they produce F [ s ] v correctly for every s, with repeated outputs when the signal orbit is degenerate. The full operator-level faithfulness result for t 2 — which can be obtained by computing the projector ranks rank Π via tr Π and Gauss-sum evaluations of tr F t r — is not required by Theorems 1 and 3.

7.2. Applications

The principal use cases for streaming over the FrFT orbit are settings that require exact arithmetic over a finite field combined with sweeping over the fractional-rotation parameter. Three concrete examples:

Time-frequency signal analysis with exact arithmetic.

The continuum FrFT is widely used in chirp detection, time-frequency filtering, and time-frequency localization [2]. Implementations over a finite field via NTT-style arithmetic avoid floating-point roundoff in applications that demand reproducibility (e.g., regulatory signal processing or symbolic computation). When the application requires sweeping the fractional-rotation parameter — as in chirp parameter estimation or time-frequency contour search — the cycle-stride algorithm computes the entire sweep at output-optimal cost.

Qudit quantum circuit verification.

Floratos and Pavlidis [5] use the finite FrFT family as a one-parameter family of qudit gates. Classical simulation and verification of such circuits requires computing the action of each gate on a state vector; the cycle-stride streaming computes the action of all n fractional Fourier gates on a fixed state vector in O ( n 2 ) field operations, producing the n-vector orbit of total output size n 2 . Explicitly materializing the full family of n gate matrices would have output size Ω ( n 3 ) and is outside the vector-orbit complexity claimed here; for circuit-simulation applications that need only gate actions on specific input states, the vector-orbit complexity is the relevant figure.

Hardware implementations.

On targets without hardware modular multipliers (low-power microcontrollers, area-constrained FPGAs), Algorithm 2 is structurally attractive: the streaming inner loop requires no multiplication unit beyond what is needed to compute the initial projections. For applications that perform many full-orbit evaluations on the same prime field — e.g., batch processing of different signals through the same fractional Fourier sweep — the projector precomputation amortizes and the streaming cost dominates, making the multiplier-free inner loop the relevant figure of merit.

7.3. Limitations and Future Work

We note several limitations of the present work.
First, the empirical comparison is against a dense matrix-vector baseline rather than against an optimized NTT pipeline. A fair comparison against, for example, hand-tuned Bluestein chirp-z implementations or production NTT libraries (NFLlib, Microsoft SEAL) is necessary to establish the practical regime in which cycle-stride dominates. We expect cycle-stride to remain competitive for full-orbit workloads but to lose to NTT for single-transform use cases when NTT applies.
Second, the cycle-stride algorithm produces the full orbit. Some applications may want a contiguous sub-arc of the orbit (e.g., a small window of fractional parameters around the standard Fourier transform). For such applications, a tailored algorithm restricting the streaming to the desired sub-arc is natural, with proportionally reduced cost; we have not developed this variant in detail.
Third, the DSC condition that 2 is primitive could be relaxed to 2 k for small k, broadening the prime class at the cost of slightly more expensive contribution updates (each step requires k shifts rather than shifts for the -th contribution array). The full characterization of which prime classes admit a similar shift-only streaming phase is an open question.
Fourth, the structural separation of the DSC primes from the Fermat-prime NTT class (Theorem 2(ii)) means that the two methods address different prime classes and are not in direct competition for the same applications. A hybrid system that uses standard radix-2 NTT for Fermat primes and cycle-stride for DSC primes would cover a broader range of cycle lengths than either method alone, though the cycle lengths involved (n a power of 2 versus n = 4 · ( odd composite ) ) typically correspond to different signal sizes.

8. Conclusion

We have identified a streaming structure for the finite fractional Fourier transform over symmetry-complete prime fields. The full orbit of p 1 fractional powers on a signal of length p 1 can be generated in O ( ( p 1 ) 2 ) post-projection field operations, matching the Ω ( ( p 1 ) 2 ) output-size lower bound up to constant factors. The streaming procedure separates cleanly into a projection setup that depends only on the field and the input signal, and a streaming phase whose cost is independent of the implementation of the setup. For the dyadic-symmetry-complete primes — those with p 5 ( mod 8 ) for which 2 is a primitive root — the streaming phase admits a contribution-array reformulation that eliminates general modular multiplications from the inner loop, leaving only modular additions and shift updates.
The class of dyadic-symmetry-complete primes is structurally distinct from the Fermat primes used in classical radix-2 NTT settings: among the Fermat primes only p = 5 is DSC, and for every DSC prime with p > 5 , the cycle length n = p 1 is not a power of two. The two prime classes are therefore complementary rather than competing, and a finite-field signal-processing toolkit that includes both broadens the range of admissible cycle lengths beyond what either method covers individually.
The result is a clean algorithmic note rather than a dramatic asymptotic improvement: cycle-stride achieves the same O ( n 2 ) full-orbit total cost as the natural NTT-setup-and-stream alternative, and its principal contribution is in the structure of the streaming phase rather than in the asymptotic complexity. We expect the contribution-array variant to be most useful in hardware implementations where modular multiplication carries significant area or energy cost relative to shift-and-reduce; a head-to-head comparison against tuned NTT pipelines is left to future work.

References

  1. Namias, V. The Fractional Order Fourier Transform and its Application to Quantum Mechanics. IMA Journal of Applied Mathematics 1980, 25, 241–265. [CrossRef]
  2. Ozaktas, H.M.; Zalevsky, Z.; Kutay, M.A. The Fractional Fourier Transform with Applications in Optics and Signal Processing; Wiley, 2001.
  3. Candan, Ç.; Kutay, M.A.; Ozaktas, H.M. The Discrete Fractional Fourier Transform. IEEE Transactions on Signal Processing 2000, 48, 1329–1337. [CrossRef]
  4. Pei, S.C.; Yeh, M.H.; Tseng, C.C. Discrete Fractional Fourier Transform Based on Orthogonal Projections. IEEE Transactions on Signal Processing 1999, 47, 1335–1348. [CrossRef]
  5. Floratos, E.; Pavlidis, A. A Novel Finite Fractional Fourier Transform and its Quantum Circuit Implementation on Qudits, 2024, [arXiv:quant-ph/2409.05759].
  6. Pollard, J.M. The Fast Fourier Transform in a Finite Field. Mathematics of Computation 1971, 25, 365–374. [CrossRef]
  7. Lyubashevsky, V.; Peikert, C.; Regev, O. On Ideal Lattices and Learning with Errors over Rings. Journal of the ACM 2013, 60, 43:1–43:35. [CrossRef]
  8. Longa, P.; Naehrig, M. Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography. In Proceedings of the Cryptology and Network Security (CANS). Springer, 2016, pp. 124–139. [CrossRef]
  9. Rader, C.M. Discrete Fourier Transforms when the Number of Data Samples is Prime. Proceedings of the IEEE 1968, 56, 1107–1108. [CrossRef]
  10. Bluestein, L.I. A Linear Filtering Approach to the Computation of Discrete Fourier Transform. IEEE Transactions on Audio and Electroacoustics 1970, 18, 451–455. [CrossRef]
  11. Ireland, K.; Rosen, M. A Classical Introduction to Modern Number Theory, 2nd ed.; Vol. 84, Graduate Texts in Mathematics, Springer, 1990. [CrossRef]
  12. Hooley, C. On Artin’s Conjecture. Journal für die reine und angewandte Mathematik 1967, 225, 209–220.
  13. Heath-Brown, D.R. Artin’s Conjecture for Primitive Roots. The Quarterly Journal of Mathematics 1986, 37, 27–38. [CrossRef]
  14. Moree, P. Artin’s Primitive Root Conjecture — A Survey. Integers 2012, 12A, A13.
Table 1. Cost decomposition for full-orbit computation. The cycle-stride streaming phase is O ( n 2 ) regardless of setup choice. The DSC specialization replaces M by S in the streaming phase.
Table 1. Cost decomposition for full-orbit computation. The cycle-stride streaming phase is O ( n 2 ) regardless of setup choice. The DSC specialization replaces M by S in the streaming phase.
Method Setup cost Streaming cost
Direct (per s matrix construction) O ( n 3 M )
Iterated NTT (when applicable) O ( n 2 log n · M )
NTT setup + streaming (Method 3) O ( n log n · M ) O ( n 2 · ( M + A ) )
Direct setup + Algorithm B (DSC) O ( n 2 · M ) O ( n 2 · ( S + A ) )
Table 2. Wall-clock timings for full meridian-cycle computation on DSC primes, 30 runs averaged. “Direct” is per-s matrix construction plus matrix-vector multiplication. Algorithm A is scalar-coefficient streaming. Algorithm B is contribution-array streaming.
Table 2. Wall-clock timings for full meridian-cycle computation on DSC primes, 30 runs averaged. “Direct” is per-s matrix construction plus matrix-vector multiplication. Algorithm A is scalar-coefficient streaming. Algorithm B is contribution-array streaming.
p n Direct (s) Alg. A (s) Alg. B (s) A/Direct B/Direct B/A
5 4 verify only
13 12 0.0002 0.0001 0.0001 2.7 × 2.1 × 0.77
29 28 0.0010 0.0003 0.0004 3.5 × 2.5 × 0.71
37 36 0.0015 0.0002 0.0003 8.0 × 5.4 × 0.67
53 52 0.0030 0.0003 0.0004 10.4 × 7.3 × 0.70
61 60 0.0043 0.0004 0.0005 12.0 × 9.1 × 0.76
101 100 0.0173 0.0006 0.0009 29.6 × 20.1 × 0.68
149 148 0.0553 0.0010 0.0015 55.6 × 37.6 × 0.68
173 172 0.0875 0.0012 0.0018 74.8 × 49.2 × 0.66
181 180 0.1023 0.0012 0.0019 82.0 × 53.9 × 0.66
197 196 0.1340 0.0014 0.0022 96.6 × 59.6 × 0.62
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated