Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Core Concepts
The authors present efficient sparse polynomial multiplication implementations of the Dilithium lattice-based digital signature scheme for the ARM Cortex-M4 and Apple M2 platforms, achieving significant performance improvements over the state-of-the-art.
Abstract
The paper focuses on developing efficient sparse polynomial multiplication implementations of the Dilithium post-quantum cryptography scheme for the ARM Cortex-M4 and Apple M2 platforms.
For the ARM Cortex-M4, the authors:
- Optimize the sparse polynomial multiplication algorithm by encoding the challenge polynomial and implementing a branchless approach.
- Leverage the ARM Cortex-M4's DSP instructions to achieve 4-way parallelism for the secret polynomial vectors.
- Integrate the optimized sparse polynomial multiplication with the infinity norm checks in the Dilithium signing process, further enhancing signing efficiency.
- Optimize other Dilithium modules like polynomial sampling, rounding, and packing/unpacking using DSP instructions.
For the Apple M2, the authors:
- Utilize ARM NEON Intrinsics to vectorize the sparse polynomial multiplication, achieving 16-way parallelism.
- Combine the sparse polynomial multiplication with parallel comparison functions to perform early rejection checks on the infinity norm.
The authors' optimized implementations achieve significant speedups of up to 30% on ARM Cortex-M4 and 55% on Apple M2 for sparse polynomial multiplication compared to the state-of-the-art Number-Theoretic Transform (NTT) implementation. They also demonstrate improvements in the overall Dilithium signing performance, reducing stack usage by 10.8%, 1.2%, and 7.7% for Dilithium2, Dilithium3, and Dilithium5, respectively, and enhancing signing performance by 0.4% to 0.8% compared to the previous ARM Cortex-M4 implementation. On the MacBook Air 2022, the authors' Dilithium implementation achieves 10% to 11% speedups in the signing procedure.
Translate Source
To Another Language
Generate MindMap
from source content
ESPM-D: Efficient Sparse Polynomial Multiplication for Dilithium on ARM Cortex-M4 and Apple M2
Stats
Dilithium2 sparse polynomial multiplication on Cortex-M4 achieves a 30% speedup over the state-of-the-art small-modulus NTT implementation.
Dilithium3 sparse polynomial multiplication on Cortex-M4 achieves an 11% speedup over the state-of-the-art small-modulus NTT implementation.
Dilithium5 sparse polynomial multiplication on Cortex-M4 is slightly slower than the state-of-the-art small-modulus NTT implementation.
Dilithium sparse polynomial multiplication on Apple M2 achieves 33% to 55% speedups over the NTT-based implementation for computing cs1 and cs2.
The optimized Dilithium implementation on Cortex-M4 reduces stack usage by 10.8%, 1.2%, and 7.7% in the signing procedure for Dilithium2, Dilithium3, and Dilithium5, respectively.
The optimized Dilithium implementation on Cortex-M4 enhances signing performance by 0.4% to 0.8% compared to the previous state-of-the-art implementation.
The optimized Dilithium implementation on Apple M2 achieves 10% to 11% speedups in the signing procedure compared to the previous state-of-the-art implementation.
Quotes
"Our optimized sparse polynomial multiplication achieves significant speedups of up to 30% on ARM Cortex-M4 and 55% on Apple M2 compared to the state-of-the-art Number-Theoretic Transform (NTT) implementation."
"Our optimized implementation not only reduces stack usage by 10.8%, 1.2%, and 7.7% in the signing procedure of Dilithium2, Dilithium3, and Dilithium5, respectively, but also enhances signing performance by 0.4% to 0.8% compared to the state-of-the-art ARM Cortex-M4 implementation."
"On the MacBook Air 2022, our Dilithium implementation achieves 10% to 11% speedups in the signing procedure."
Deeper Inquiries
How can the proposed sparse polynomial multiplication techniques be extended to other lattice-based cryptographic schemes beyond Dilithium
The proposed sparse polynomial multiplication techniques can be extended to other lattice-based cryptographic schemes by adapting the algorithms to suit the specific requirements of each scheme. Since many lattice-based schemes involve polynomial operations, similar optimization strategies can be applied to improve efficiency. For example, techniques such as branchless sparse polynomial multiplication, parallelization, and encoding of challenge polynomials can be implemented in schemes like NTRUEncrypt, Kyber, and Saber. By customizing the algorithms to the polynomial structures and operations of each scheme, significant speedups and resource savings can be achieved across a range of lattice-based cryptographic applications.
What are the potential trade-offs or limitations of using 8-bit signed integer storage for the secret polynomial vectors in the Dilithium3 parameter set, and how can these be addressed
Using 8-bit signed integer storage for the secret polynomial vectors in the Dilithium3 parameter set can introduce limitations in terms of precision and range of values that can be represented. The main trade-off is the potential for overflow or underflow when performing arithmetic operations on the coefficients, leading to inaccuracies in the results. To address these limitations, techniques such as careful scaling of coefficients, dynamic range adjustment during computations, and error detection and correction mechanisms can be implemented. Additionally, utilizing higher precision data types for critical operations or implementing error-checking mechanisms can help mitigate the risks associated with using 8-bit signed integers for storage.
Given the performance improvements on ARM Cortex-M4 and Apple M2, what other embedded or mobile platforms could benefit from these optimization strategies, and what additional challenges might arise in porting the implementations to those platforms
The optimization strategies employed for ARM Cortex-M4 and Apple M2 platforms can benefit other embedded or mobile platforms with similar architectures, such as ARMv7 and ARMv8-based processors. Platforms like STM32, Raspberry Pi, and Qualcomm Snapdragon could leverage these optimization techniques to enhance the performance of lattice-based cryptographic schemes. However, porting the implementations to different platforms may present challenges related to architecture-specific optimizations, memory constraints, and instruction set compatibility. Adapting the code to different platforms may require additional testing, tuning, and customization to ensure optimal performance and compatibility with the target hardware.