Zooplankton serve as critical bioindicators of marine ecosystem health, yet their accurate classification remains challenging due to subtle inter-class differences, substantial intra-class variations, and complex underwater imaging conditions. While Vision Transformers (ViTs) have shown promise in fine-grained recognition, they struggle with the local feature modeling and multi-scale perception essential for zooplankton identification. This paper proposes ViT-MDFA, a novel architecture that synergistically integrates Multi-scale Dilated Convolution (MSDC) and Dual Attention (DA) mechanisms into the Vision Transformer framework. The MSDC module employs parallel dilated convolutions with strategically selected dilation rates to capture both fine-grained textures and global structural patterns without computational overhead. The DA mechanism combines channel-wise and spatial attention to adaptively emphasize diagnostically relevant features while suppressing background interference. Extensive evaluations across four challenging benchmarks (WHOI-Plankton, ZooScanNet, Kaggle-Plankton, and Dec-22) demonstrate that ViT-MDFA achieves state-of-the-art performance, reaching 97.46\% accuracy and 96.73\% F1-score on the Dec-22 dataset—surpassing the baseline ViT-B/16 by remarkable margins of 6.14\% and 7.79\%, respectively. Comprehensive ablation studies validate the individual contributions of MSDC (+1.83\%) and DA (+0.81\%) modules, with sensitivity analysis identifying the optimal dilation rate configuration [6,12,18]. Grad-CAM visualizations reveal that ViT-MDFA consistently attends to biologically meaningful morphological structures, such as antennae, body segments, and caudal spines. The proposed architecture achieves this superior performance while maintaining a lightweight, modular design suitable for deployment on flow cytometers and edge computing platforms, thereby enabling real-time, automated zooplankton monitoring for marine ecological assessment.