Vision Transformer with Multi-Scale Dilated Convolution and Dual Attention Fusion for Robust Zooplankton Classification

Matteo Rossi; Haowei Li

doi:10.20944/preprints202606.1348.v1

Submitted:

16 June 2026

Posted:

17 June 2026

You are already at the latest version

Abstract

Zooplankton serve as critical bioindicators of marine ecosystem health, yet their accurate classification remains challenging due to subtle inter-class differences, substantial intra-class variations, and complex underwater imaging conditions. While Vision Transformers (ViTs) have shown promise in fine-grained recognition, they struggle with the local feature modeling and multi-scale perception essential for zooplankton identification. This paper proposes ViT-MDFA, a novel architecture that synergistically integrates Multi-scale Dilated Convolution (MSDC) and Dual Attention (DA) mechanisms into the Vision Transformer framework. The MSDC module employs parallel dilated convolutions with strategically selected dilation rates to capture both fine-grained textures and global structural patterns without computational overhead. The DA mechanism combines channel-wise and spatial attention to adaptively emphasize diagnostically relevant features while suppressing background interference. Extensive evaluations across four challenging benchmarks (WHOI-Plankton, ZooScanNet, Kaggle-Plankton, and Dec-22) demonstrate that ViT-MDFA achieves state-of-the-art performance, reaching 97.46\% accuracy and 96.73\% F1-score on the Dec-22 dataset—surpassing the baseline ViT-B/16 by remarkable margins of 6.14\% and 7.79\%, respectively. Comprehensive ablation studies validate the individual contributions of MSDC (+1.83\%) and DA (+0.81\%) modules, with sensitivity analysis identifying the optimal dilation rate configuration [6,12,18]. Grad-CAM visualizations reveal that ViT-MDFA consistently attends to biologically meaningful morphological structures, such as antennae, body segments, and caudal spines. The proposed architecture achieves this superior performance while maintaining a lightweight, modular design suitable for deployment on flow cytometers and edge computing platforms, thereby enabling real-time, automated zooplankton monitoring for marine ecological assessment.

Keywords:

zooplankton classification

;

fine-grained visual recognition

;

vision Transformer

;

multi-scale dilated convolution

;

dual attention mechanism

;

deep learning

;

marine ecological monitoring

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Vision Transformer with Multi-Scale Dilated Convolution and Dual Attention Fusion for Robust Zooplankton Classification

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe