Towards NVIDIA 1:4 Semi-Structured 75% Sparsity via Cannistraci–Hebb N:M Dynamic Sparse-to-Sparse Training

Jiaqing Lyu; Michael Wirz; Carlo Vittorio Cannistraci

doi:10.20944/preprints202605.0851.v1

Submitted:

09 May 2026

Posted:

13 May 2026

You are already at the latest version

Abstract

Early 2026 has witnessed significant volatility in the oil market, and an energy crisis is expected in the coming months. Large-scale LLM inference continues to consume substantial power in data centers and improving inference efficiency is therefore increasingly important for energy sustainability of AI economy. Semi-structured N:M sparsity, most notably 2:4 (50%) and potentially 1:4 (75%), offers a hardware-friendly path to lower compute and energy, and has been supported in modern GPU designs. Yet existing training methods for 2:4 sparsity (e.g., STE-based approaches) often incur large accuracy drops relative to dense baselines, and practical support for 1:4 remains limited in current software stacks. As a result, attention has shifted toward quantization and mixture-of-experts, leaving high-sparsity N:M pre-training underexplored. Here we introduce a paradigm shift: we treat neural networks as complex systems whose sparse connectivity can be trained using network-science principles formalized by Cannistraci–Hebb sparse-to-sparse training (CHT), coupled with a tailored optimizer. We propose CHTsNM, a sparse-to-sparse training framework centered on Topology-Aware Newton–Schulz (TANS) optimization. TANS makes Newton–Schulz-style matrix updates compatible with dynamically changing semi-structured sparse topologies via active-mask projection, active-support RMS matching, and refresh-aware ramping after topology updates. CHTsNM further incorporates two lightweight mechanisms: Contextually Modulated LoRA (CoMoLoRA) for input-adaptive low-rank residual compensation, and Motif Pattern Revisitation (MPR) to improve exploration of legal row-wise N:M patterns. Across 4 LLaMA pre-training benchmarks, CHTsNM with 2:4 sparsity achieves performance close to dense baselines on most tasks and yields sparse-over-dense gains on 8 tasks. 1:4 sparsity approaches dense performance, though does not yet consistently surpass it. For hardware evaluation, we report measured speedups for native 2:4 execution on current NVIDIA GPUs, and provide a clearly labeled CSR sparse-GEMM surrogate analysis to estimate the acceleration potential of 1:4. Overall, although not implement on hardware yet, our results identify 1:4 sparse pre-training as a promising direction and establish TANS sparse-to-sparse optimization as a practical step toward future high-sparsity N:M accelerators.

Keywords:

semi-structured sparsity

;

dynamic sparse training

;

contextual modulation branch

;

low rank branch

;

newton–schulz

;

optimizer

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Towards NVIDIA 1:4 Semi-Structured 75% Sparsity via Cannistraci–Hebb N:M Dynamic Sparse-to-Sparse Training

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe