Submitted:
01 July 2024
Posted:
01 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. Post-Quantum Cryptography
2.2. Dilithium
| Algorithm 1:Main algorithm of Dilithium. |
![]() |
2.3. Kyber
- Kyber.CPAPKE.KeyGen. Key generation first generates a matrix A by sampling from the SHAKE128 hash function. It then samples secret s and e using SHAKE128. The public key is computed as , and the secret key is s.
- Kyber.CPAPKE.Encryption. Encryption first generates the matrix matrix by sampling from the SHAKE128 hash function. Then it computes and . The ciphertext c is computed as [0] .
- Kyber.CPAPKE.Decryption. Decryption uses the secret key and ciphertext c to restores u and v by decompressing the ciphertext. The original message is computed as .
| Algorithm 2:Main algorithm of Kyber. |
![]() |
2.4. Falcon
| Algorithm 3:Main algorithm of Falcon. |
![]() |
2.5. SPHINCS+
| Algorithm 4:Main algorithm of SPHINCS+. |
![]() |
3. Related Works and Motivation
- 1. Versatility across applications and environments. Allows for a single solution adaptable to different security and performance requirements.
- 2. Reduced need for multiple specialized accelerators. Unifies HW resources, reducing overall cost and complexity.
- 3. Simplified maintenance and updates. Changes can be uniformly applied across all supported schemes and easier to adapt to evolving cryptographic standards.
- 4. Enhanced flexibility and longevity of the HW design. Ensures compatibility with future PQC standards and extends the useful life of the HW investment.
4. Design Methodology
4.1. Performance Profiling
- P1-Polynomial operations are commonly used, but data-types are different. Three schemes-Dilithium, Kyber, and Falcon—commonly perform operations over polynomial data. Dilithium and Kyber operate over polynomials with coefficients in integer rings, requiring the variants of Montgomery reduction [26] and the Number Theoretic Transform (NTT). Both functions are frequently used and represent significant hotspots in their execution profiles. Falcon also operates over polynomial data but uses polynomials with floating-point coefficients and performs Fast Fourier Transforms (FFT) instead of NTT, eliminating the need for modular reductions.
- P2-Dissimilar proportion of Keccak. Keccak is another hotspot function present in all four schemes, but accounts for a varying proportions of execution time. It constitutes 43% of Dilithium’s operations, 19% for Kyber, 3.7% for Falcon, and 99% for SPHINCS+. Since there is no clear preference for any specific scheme, (there are no statistical numbers available on which scheme is more frequently used, as PQC schemes are merely developed, not deployed in industries), we assume all four schemes will be used equally.


- P3-Distinct high-level operation sequence. Although the schemes share common functions, their high-level operation sequences differ significantly. This is due to not only the inherent difference in algorithm but also the varying polynomial length used in different parameters. For instance, using parallel butterfly modules to compute NTT [27] requires different numbers of stages, and for each stage, we need different numbers of cycles depending on the number of butterfly modules we instantiate.
4.2. Proposed Design
4.2.1. Keccak Accleration Moudle (KAM)
- 1. KAM-Small. Optimized for minimal area consumption, it uses 5 KALUs (Keccak ALUs) to compute each step of the Keccak permutation, taking 5 cycles per step.
- 2. KAM-Large. A mid-range solution balancing area and performance, it has 25 KALUs, allowing each Keccak permutation step to be computed in a single cycle.
- 3. KAM-FP. For maximum performance, this variant has a fully-pipelined datapath that computes each round of permutation in a single cycle.

4.2.2. Joint Polynomial Arithmetic Unit (JPAU).
4.2.3. Control Unit
- Sample_polynomial. The UPCU initiates and manages the polynomial sampling process. This includes setting up necessary registers and handling data flow for efficient sampling.
- Polynomial_multiplication. The UPCU controls the sequence of multiplication and accumulation operations, coordinating data flow and setting up operands for the computation.
- NTT_INTT. The UPCU manages the NTT and INTT operations, controlling the butterfly units and Montgomery reduction units. It ensures efficient operations by adjusting control signals and managing data flow through various stages, utilizing the Twiddle factor ROM for different schemes.
5. Implementation
6. Evaluation
6.1. Dilithium and Kyber
| Parameter | Gupta et al. [20] |
Aikata et al. [15] |
Akata et al. [6] |
Ours_S | Ours_M | Ours_L | |
| Keygen | Dilithium2 | - | 0.52 | 1.27 | 1.00 | 1.74 | 2.09 |
| Dilithium3 | 0.57 | 1.39 | 1.00 | 1.76 | 2.08 | ||
| Dilithium5 | 1.11 | 0.61 | 1.50 | 1.00 | 1.77 | 2.08 | |
| Kyber512 | - | - | 4.66 | 1.00 | 1.04 | 2.18 | |
| Kyber768 | - | - | 3.47 | 1.00 | 1.04 | 2.26 | |
| Kyber1024 | - | - | 3.08 | 1.00 | 1.04 | 2.31 | |
| Sign/Encapsulate | Dilithium2 | - | 0.96 | 2.31 | 1.00 | 1.90 | 2.01 |
| Dilithium3 | - | 1.08 | 2.63 | 1.00 | 1.92 | 2.01 | |
| Dilithium5 | 2.39 | 1.35 | 3.30 | 1.00 | 1.93 | 2.01 | |
| Kyber512 | - | - | 2.85 | 1.00 | 1.07 | 2.74 | |
| Kyber768 | - | - | 2.57 | 1.00 | 1.07 | 2.71 | |
| Kyber1024 | - | - | 2.34 | 1.00 | 1.07 | 2.68 | |
| Verify/Decapsulate | Dilithium2 | - | 0.83 | 2.02 | 1.00 | 1.83 | 2.00 |
| Dilithium3 | - | 0.85 | 2.08 | 1.00 | 1.85 | 2.01 | |
| Dilithium5 | 1.71 | 0.86 | 2.11 | 1.00 | 1.86 | 2.03 | |
| Kyber512 | - | - | 2.26 | 1.00 | 1.11 | 2.65 | |
| Kyber768 | - | - | 1.96 | 1.00 | 1.11 | 2.61 | |
| Kyber1024 | - | - | 2.10 | 1.00 | 1.11 | 2.57 |
| Dilithium2 | Dilithium3 | Dilithium5 | |
| q | 8380417 | 8380417 | 8380417 |
| N | 256 | 256 | 256 |
| (k,l) | (4,4) | (6,5) | (8,7) |
| (q-1)/88 | (q-1)/32 | (q-1)/32 | |
| Kyber512 | Kyber768 | Kyber1024 | |
| q | 3329 | 3329 | 3329 |
| N | 256 | 256 | 256 |
| k | 2 | 3 | 4 |
| 3 | 2 | 2 | |
| 2 | 2 | 2 | |
| (10,4) | (10,4) | (10,5) | |
6.2. Falcon
6.3. SPHINCS+
| Parameter | n | h | d | log(t) | k | w | NIST Security Level |
| SPHINCS+-256s | 32 | 64 | 8 | 14 | 22 | 16 | 5 |
| SPHINCS+-256s robust | 32 | 64 | 8 | 14 | 22 | 16 | 5 |
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
| 1 | For simplicity, we will refer to CRYSTALS-Kyber as Kyber and CRYSTALS-Dilithium as Dilithium throughout this paper. |
References
- Carracedo, J.M.; Milliken, M.; Chouhan, P.K.; Scotney, B.; Lin, Z.; Sajjad, A.; Shackleton, M. Cryptography for Security in IoT. In Proceedings of the 2018 Fifth International Conference on Internet of Things: Systems, Management and Security; 2018; pp. 23–30. [Google Scholar] [CrossRef]
- Katzenbeisser, S.; Polian, I.; Regazzoni, F.; Stöttinger, M. Security in Autonomous Systems. In Proceedings of the 2019 IEEE European Test Symposium (ETS); 2019; pp. 1–8. [Google Scholar] [CrossRef]
- Muzikant, P.; Willemson, J. Deploying Post-quantum Algorithms in Existing Applications and Embedded Devices. In Proceedings of the Ubiquitous Security; Wang, G.; Wang, H.; Min, G.; Georgalas, N.; Meng, W., Eds., Singapore; 2024; pp. 147–162. [Google Scholar]
- Kim, D.; Choi, H.; Seo, S.C. Parallel Implementation of SPHINCS+ With GPUs. IEEE Transactions on Circuits and Systems I: Regular Papers 2024, 71, 2810–2823. [Google Scholar] [CrossRef]
- Shor, P.W. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM review 1999, 41, 303–332. [Google Scholar] [CrossRef]
- Aikata, A.; Mert, A.C.; Imran, M.; Pagliarini, S.; Roy, S.S. KaLi: A Crystal for Post-Quantum Security Using Kyber and Dilithium. IEEE Transactions on Circuits and Systems I: Regular Papers 2023, 70, 747–758. [Google Scholar] [CrossRef]
- Avanzi, R.; Bos, J.; Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schanck, J.M.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Kyber, 2020. Algorithm Specifications and Supporting Documentation, Submission to the NIST post-quantum project.
- Ducas, L.; Kiltz, E.; Lepoint, T.; Lyubashevsky, V.; Schwabe, P.; Seiler, G.; Stehlé, D. CRYSTALS-Dilithium, 2020. Algorithm Specifications and Supporting Documentation, Submission to the NIST post-quantum project.
- Fouque, P.A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Pornin, T.; Prest, T.; Ricosset, T.; Seiler, G.; Whyte, W.; Zhang, Z. Falcon: Fast-Fourier Lattice-based Compact Signatures over NTRU, 2020. Specification v1.2.
- Aumasson, J.P.; Bernstein, D.J.; Beullens, W.; Dobraunig, C.; Eichlseder, M.; Fluhrer, S.; Gazdag, S.L.; Hülsing, A.; Kampanakis, P.; Kölbl, S.; et al. SPHINCS+ specification, 2022. Submission to the NIST post-quantum project.
- NIST. Selected Algorithms 2022, July 2022. https://csrc.nist.gov/projects/post-quantum-cryptography/selected-algorithms-2022.
- Prest, T.; Fouque, P.A.; Hoffstein, J.; Kirchner, P.; Lyubashevsky, V.; Pornin, T.; Ricosset, T.; Seiler, G.; Whyte, W.; Zhang, Z. Falcon. Post-Quantum Cryptography Project of NIST, 2020. [Google Scholar]
- Seo, E.Y.; Kim, Y.S.; Lee, J.W.; No, J.S. Peregrine: toward fastest FALCON based on GPV framework. Cryptology ePrint Archive 2022. [Google Scholar]
- Bernstein, D.J.; Hopwood, D.; Hülsing, A.; Lange, T.; Niederhagen, R.; Papachristodoulou, L.; Schneider, M.; Schwabe, P.; Wilcox-O’Hearn, Z. SPHINCS: Practical Stateless Hash-Based Signatures. In Proceedings of the Advances in Cryptology – EUROCRYPT 2015; Oswald, E.; Fischlin, M., Eds., Berlin, Heidelberg; 2015; pp. 368–397. [Google Scholar]
- Aikata, A.; Mert, A.C.; Jacquemin, D.; Das, A.; Matthews, D.; Ghosh, S.; Roy, S.S. A Unified Cryptoprocessor for Lattice-Based Signature and Key-Exchange. IEEE Transactions on Computers 2023, 72, 1568–1580. [Google Scholar] [CrossRef]
- Basso, A.; Bermudo Mera, J.M.; D’Anvers, J.P.; Karmakar, A.; Sinha Roy, S.; Van Beirendonck, M.; Vercauteren, F. SABER: Mod-LWR based KEM (Round 3 Submission), 2017. SABER submission package for round 3.
- Lee, J.; Kim, W.; Kim, J.H. A Programmable Crypto-Processor for National Institute of Standards and Technology Post-Quantum Cryptography Standardization Based on the RISC-V Architecture. Sensors 2023, 23. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.H.; Kieu-Do-Nguyen, B.; Pham, C.K.; Hoang, T.T. High-Speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium. IEEE Access 2024, 12, 34918–34930. [Google Scholar] [CrossRef]
- Lee, Y.; Youn, J.; Nam, K.; Jung, H.H.; Cho, M.; Na, J.; Park, J.Y.; Jeon, S.; Kang, B.G.; Oh, H.; et al. An Efficient Hardware/Software Co-Design for FALCON on Low-End Embedded Systems. IEEE Access 2024, 12, 57947–57958. [Google Scholar] [CrossRef]
- Gupta, N.; Jati, A.; Chattopadhyay, A.; Jha, G. Lightweight Hardware Accelerator for Post-Quantum Digital Signature CRYSTALS-Dilithium. IEEE Transactions on Circuits and Systems I: Regular Papers 2023, 70, 3234–3243. [Google Scholar] [CrossRef]
- Wagner, A.; Oberhansl, F.; Schink, M. To Be, or Not to Be Stateful: Post-Quantum Secure Boot using Hash-Based Signatures. In Proceedings of the Proceedings of the 2022 Workshop on Attacks and Solutions in Hardware Security, New York, NY, USA, 2022, ASHES’22,; pp. 85–94. [CrossRef]
- Mandal, S.; Roy, D.B. KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA. In Proceedings of the 2024 37th International Conference on VLSI Design and 2024 23rd International Conference on Embedded Systems (VLSID); 2024; pp. 455–460. [Google Scholar] [CrossRef]
- Beckwith, L.; Nguyen, D.T.; Gaj, K. Hardware Accelerators for Digital Signature Algorithms Dilithium and FALCON. IEEE Design & Test. [CrossRef]
- Bisheh-Niasar, M.; Azarderakhsh, R.; Mozaffari-Kermani, M. A Monolithic Hardware Implementation of Kyber: Comparing Apples to Apples in PQC Candidates. In Proceedings of the Progress in Cryptology – LATINCRYPT 2021; Longa, P.; Ràfols, C., Eds., Cham; 2021; pp. 108–126. [Google Scholar]
- Inc., I. Inc., I. Intel Vtune Profiler, 2023. https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html/.
- Montgomery, P.L. Modular multiplication without trial division. Mathematics of computation 1985, 44, 519–521. [Google Scholar] [CrossRef]
- Bekele, A. Cooley-tukey fft algorithms. Advanced algorithms 2016. [Google Scholar]
- Soni, D.; Basu, K.; Nabeel, M.; Aaraj, N.; Manzano, M.; Karri, R. Hardware Architectures for Post-Quantum Digital Signature Schemes 2021. [CrossRef]
- SYNOPSYS Inc.. Synopsys Design Cimpiler. https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html.
- Martins, M.; Matos, J.M.; Ribas, R.P.; Reis, A.; Schlinker, G.; Rech, L.; Michelsen, J. Open Cell Library in 15nm FreePDK Technology. In Proceedings of the Proceedings of the 2015 Symposium on International Symposium on Physical Design, New York, NY, USA, 2015; ISPD ’15, ; pp. 171–178. [CrossRef]
- Amiet, D.; Leuenberger, L.; Curiger, A.; Zbinden, P. FPGA-based SPHINCS+ Implementations: Mind the Glitch. In Proceedings of the 2020 23rd Euromicro Conference on Digital System Design (DSD); 2020; pp. 229–237. [Google Scholar] [CrossRef]




| Kyber | Dilithium | FALCON | SPHINCS+ | |
| Algorithm Type | Key Exchange (KEA) | Digital Signature (DSA) | ||
| Based Approach | Lattice-Based | Hash-Based | ||
| Mnenomic | Opcode | Discription |
| NOP | 0000 | No operation, Do nothing |
| ADD | 0001 | Result[i]←vec_a[i]+vec_b[i] |
| SUB | 0010 | Result[i]←vec_a[i]-vec_b[i] |
| CADDQ | 0011 | Result[i]←(vec_a[i] <0) ? vec_a[i] +Q : vec_a[i] |
| MULT | 0100 | Result[i]←vec_a[i] × vec_b[i] |
| SHIFT | 0101 | Result[i]←vec_a[i] <<SHIFT_AMOUNT |
| REDUCE | 0110 | Result[i]←MontgomeryReduction(vec_a[i] ) |
| AND | 0111 | Result[i]←vec_a[i] AND vec_b[i] |
| OR | 1000 | Result[i]←vec_a[i] OR vec_b[i] |
| XOR | 1001 | Result[i]←vec_a[i] XOR vec_b[i] |
| NTT_BUTTERFLY | 1010 | Result←Butterfly(vec_a) |
| INTT_BUTTERFLY | 1011 | Result←InvButterfly(vec_a) |
| COMP | 1100 | Comp_result[i] ← COMPARE(vec_a[i], vec_b[i]) |
| RESERVED | 1101-1111 | - |
![]() |
![]() |
| Falcon512 | Falcon1024 | Peregrine *512 | Peregrine*1024 | |
| q | 12289 | 12289 | 12289 | 12289 |
| N | 512 | 1024 | 512 | 1024 |
| b | 34034726 | 70265242 | 34034726 | 150700176 |
| Parameter | CPU (AVX) |
Wagner et al. [21] |
Wagner et al. [21] extended |
Amiet et al. [31] |
Ours_S | Ours_M | Ours_L | |
| Keygen | 256s-simple | 0.05 | - | - | - | 1.00 | 1.00 | 2.95 |
| 256s-robust | 0.04 | - | - | - | 1.00 | 1.00 | 2.97 | |
| Sign | 256s-simple | 0.03 | - | - | 0.82 | 1.00 | 1.00 | 2.95 |
| 256s-robust | 0.03 | - | - | 0.83 | 1.00 | 1.00 | 2.97 | |
| Verify | 256s-simple | 0.01 | 0.03 | 0.04 | 0.06 | 1.00 | 1.00 | 2.58 |
| 256s-robust | 0.01 | 0.02 | 0.04 | 0.08 | 1.00 | 1.00 | 2.59 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).





