Preprint
Article

This version is not peer-reviewed.

Low-Light Video Enhancement via Fast–Slow Dual Branches and Flow-Guided Attention

Submitted:

17 April 2026

Posted:

17 April 2026

You are already at the latest version

Abstract
Low-light video enhancement aims to restore clear, color-faithful, and temporally consistent visual content from video sequences captured under extremely low signal-to-noise ratios and high dynamic range constraints. Existing multi-frame enhancement methods typically adopt uniform spatio-temporal sampling and feature extraction strategies for all frames, making it challenging to simultaneously achieve long-range temporal denoising and accurate fast-motion modeling. To address this trade-off, we propose a low-light video enhancement framework based on a Fast–Slow dual-branch architecture. The video signal is decomposed into two complementary feature streams: a Slow branch with sparse temporal sampling and high spatial resolution, built on a Vision Transformer backbone, which focuses on long-range temporal denoising and high-frequency texture restoration for static and slow-moving regions; and a Fast branch with dense temporal sampling and low spatial resolution, built on a ViT-Tiny backbone, which efficiently captures large-scale motion and rapid illumination changes. To mitigate the discrepancy in sampling rates and spatial resolutions between the two branches, we further introduce a flow branch based on a pre-trained StreamFlow model and design a Flow-Guided Cross-Attention (FGCA) module. FGCA first uses optical flow to geometrically modulate and progressively align Fast-branch features, and then injects the flow-enhanced Fast features into the Slow branch at each space-time location via lightweight pixel-wise cross-attention. This mechanism achieves a cascade of coarse geometric alignment and fine semantic fusion. Experiments on two real-world low-light video datasets, SDSD-indoor and SDSD-outdoor, demonstrate that our method consistently outperforms several representative approaches in terms of PSNR, SSIM, AB(Var), and MABD, while effectively suppressing motion blur and ghosting artifacts in dynamic night scenes, yielding temporally stable and perceptually pleasing results.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated