Preprint
Article

This version is not peer-reviewed.

Vision Transformer with Multi-Scale Dilated Convolution and Dual Attention Fusion for Robust Zooplankton Classification

Submitted:

16 June 2026

Posted:

17 June 2026

You are already at the latest version

Abstract
Zooplankton serve as critical bioindicators of marine ecosystem health, yet their accurate classification remains challenging due to subtle inter-class differences, substantial intra-class variations, and complex underwater imaging conditions. While Vision Transformers (ViTs) have shown promise in fine-grained recognition, they struggle with the local feature modeling and multi-scale perception essential for zooplankton identification. This paper proposes ViT-MDFA, a novel architecture that synergistically integrates Multi-scale Dilated Convolution (MSDC) and Dual Attention (DA) mechanisms into the Vision Transformer framework. The MSDC module employs parallel dilated convolutions with strategically selected dilation rates to capture both fine-grained textures and global structural patterns without computational overhead. The DA mechanism combines channel-wise and spatial attention to adaptively emphasize diagnostically relevant features while suppressing background interference. Extensive evaluations across four challenging benchmarks (WHOI-Plankton, ZooScanNet, Kaggle-Plankton, and Dec-22) demonstrate that ViT-MDFA achieves state-of-the-art performance, reaching 97.46\% accuracy and 96.73\% F1-score on the Dec-22 dataset—surpassing the baseline ViT-B/16 by remarkable margins of 6.14\% and 7.79\%, respectively. Comprehensive ablation studies validate the individual contributions of MSDC (+1.83\%) and DA (+0.81\%) modules, with sensitivity analysis identifying the optimal dilation rate configuration [6,12,18]. Grad-CAM visualizations reveal that ViT-MDFA consistently attends to biologically meaningful morphological structures, such as antennae, body segments, and caudal spines. The proposed architecture achieves this superior performance while maintaining a lightweight, modular design suitable for deployment on flow cytometers and edge computing platforms, thereby enabling real-time, automated zooplankton monitoring for marine ecological assessment.
Keywords: 
;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated