Submitted:
15 September 2025
Posted:
16 September 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Underwater imaging hardware framework: A theoretical analysis system is established covering power modeling, endurance estimation, and thermal constraints. Combined with material selection and empirical data, a feasible solution is proposed for shallow-, mid-, and deep-water mission modes.
- Image processing module: HAT-based super-resolution and DICAM staged enhancement are integrated for clarity restoration, color correction, and contrast enhancement. Experiments and ablation studies confirm that this module substantially improves detection accuracy.
- Adaptive Vision Transformer (A-ViT): Dynamic token pruning and early-exit strategies enable on-demand computation, significantly reducing latency and GPU memory usage. Together with fallback and key-region preservation mechanisms, A-ViT enhances robustness under adversarial perturbations.
- Improved detector architecture (YOLOv11-CA_HSFPN): Coordinate attention and a high-order spatial feature pyramid are incorporated into the neck, boosting detection of small and blurred objects while providing accuracy and robustness gains with minimal additional computational cost.
2. Materials and Methods
2.1. Underwater Imaging Hardware Design
- Main control platform: Embedded processing unit, with power consumption denoted as .
- Imaging device: Image sensor or camera module, with power consumption denoted as .
- Illumination module: Light source unit (e.g., LED array), with power consumption denoted as and duty cycle ,where increases with operating depth.
- Power supply system: Battery pack with nominal voltage , capacity , and total energy:
2.2. Image Processing Module
2.2.1. Image Super-Resolution
2.2.2. Staged Underwater Image Enhancement
- (1)
- Physics-based modeling of degradation.
- (2)
- Deep learning–based enhancement with DICAM.
- A.
- Stage 1: Channel-level recovery. The physics-corrected R/G/B channels are processed with channel attention modules (CAMs) to refine inter-channel correlations and suppress severe chromatic bias. Overlapping local information is aggregated to strengthen structural edges and reduce scattering noise.
- B.
- Stage 2: Color correction and dimensional reduction. A compact module comprising convolution (3×3), LeakyReLU, and convolution (1×1) layers compresses redundant features, followed by a Sigmoid activation to produce final enhanced outputs.
2.3. Improved Detector: YOLOv11-CA_HSFPN
- Coordinate Attention (CA).
- 2.
- High-order Spatial Feature Pyramid Network (HSFPN).
- 3.
- Advantages of YOLOv11-CA_HSFPN
- (1)
- Enhanced spatial directionality: Explicit modeling of long-range dependencies compensates for YOLOv11’s weakness in spatial feature capture.
- (2)
- Improved small-target detection: CA-HSFPN facilitates discrimination of low-contrast small objects from noisy backgrounds, improving recall.
- (3)
- Robustness reinforcement: Under challenging conditions such as low illumination and scattering, CA attention highlights structural cues, ensuring stable detection.
- (4)
- Lightweight preservation: CA-HSFPN introduces only lightweight convolutions and pooling operations, adding negligible computational cost.
- (5)
- Underwater adaptability: Targets with orientation-dependent distributions, such as fish schools, achieve better representation under CA-HSFPN, aligning with underwater scene characteristics.
2.4. Adaptive Vision Transformer for Modeling in Adversarial Underwater Environments
- 1.
- Dynamic Token Pruning
- 2.
- Early-Exit Mechanism
- 3.
- Robustness-Oriented Token Importance
- 4.
- Energy-Aware Computation
- Foreground ROI retention: Regions with high saliency are preserved at full resolution for YOLO-based detection, ensuring accurate boundary recall.
- Background down-sampling: Non-salient regions are down-sampled (e.g., to 0.5× or 0.25× resolution) and processed only once, reducing redundancy.
- Fallback mechanism: If ROI coverage drops below threshold, the model reverts to full-resolution processing to avoid missing potential targets.
2.5. Energy-Constrained Adversarial Attacks and Vulnerability Analysis of A-ViT
- 1.
- Formulation of the Halting Mechanism
- 2.
- EOAP Adversarial Objective
- (1)
- Natural camouflage. EOAPs can be visually concealed within low-contrast underwater textures such as coral patterns or sand ripples. These perturbations are difficult to distinguish through human vision or simple preprocessing, making them stealthy and dangerous.
- (2)
- Practical constraints. Underwater missions are often extremely sensitive to inference latency. If EOAPs induce doubled or prolonged inference times, tasks may fail or incur severe consequences. For instance, in military scenarios, a delayed AUV detector may fail to recognize underwater mines or hostile targets in time, posing critical safety risks.
3. Results
3.1. Quantitative Analysis of Hardware Design and Material Selection Experiments
- (1)
- Shallow-water mode: LED duty ratio = 20%.
- (2)
- Medium-depth mode: LED duty ratio = 80%.
- (3)
- Deep-water mode: LED full power operation.
- 1.
- Metals:
- (1)
- Aluminum alloy (6061/5083): High thermal conductivity (120–170 W/m·K) and moderate strength (215–275 MPa) allow effective heat dissipation with relatively low weight, suitable for shallow and medium-depth tasks. However, long-term immersion requires anodizing or protective coatings due to limited corrosion resistance.
- (2)
- Stainless steel (316L): Offers excellent seawater resistance, especially against pitting corrosion, though high density (8.0 g/cm³) limits portability.
- (3)
- Titanium alloy (Ti-6Al-4V): Superior strength (800–900 MPa) and corrosion resistance make it the only viable choice for >300 m deep-sea missions despite higher cost.
- 2.
- Transparent materials
- (1)
- PMMA (Acrylic): High transparency (92–93%) and light weight (1.2 g/cm³) make it suitable for shallow (<50 m) low-load tasks. However, low strength (65–75 MPa) and poor durability limit deep-sea use.
- (2)
- Polycarbonate (PC): Slightly lower transparency (88–90%) but better impact resistance than PMMA; suitable for shallow-to-medium depths (<100 m).
- (3)
- Tempered glass: Transparency 90–92% with yield strength 140–160 MPa, applicable for medium depths (<300 m), though heavier and fracture-prone.
- (4)
- Sapphire: Optimal candidate for deep-sea windows (>300 m) due to extreme strength (2000–2500 MPa), excellent corrosion resistance, and high optical stability. Cost, however, restricts large-scale use.
- Fully transparent solution (PMMA/PC/Sapphire): Simplifies optical design but struggles to meet thermal dissipation requirements under high-load conditions; limited to shallow-water, low-power tasks.
- Metallic housing + localized transparent window: Metals provide structural strength and heat dissipation, while the transparent window ensures optical imaging. This hybrid solution flexibly adapts to shallow, medium, and deep-water tasks, achieving balanced performance.
- Shallow-water tasks: Aluminum alloy housing + PMMA/PC window.
- Medium-depth tasks: Aluminum alloy or stainless steel housing + tempered glass window.
- Deep-water tasks: Titanium alloy housing + sapphire window.
3.2. Experimental Analysis of the Image Processing Module
3.2.1. Image Super-Resolution
3.2.2. Multi-Stage Underwater Image Enhancement
3.3. Visualization Analysis of YOLOv11-CA_HSFPN Detector
- Stable Convergence: Validation box_loss and dfl_loss curves show minimal oscillations, indicating smooth convergence.
- Synchronized Quality Improvement: Precision and Recall consistently improve, and the final mAP plateau suggests strong generalization ability.
- Effectiveness of Structural Modifications: Compared with the baseline, the CA-HSFPN-enhanced model achieves higher plateaus across metrics, confirming stronger feature representation capacity.
3.4. Visualization Analysis of Fish-School Detection Based on A-ViT
3.5. Dynamic Visualization Analysis and Adversarial Vulnerability Verification of A-ViT in Underwater Fish-School Detection
- Energy efficiency improvement — By reducing redundant background computation, A-ViT compresses the effective processing area to less than half, synchronously decreasing memory consumption and inference latency, thereby significantly improving deployability on embedded platforms.
- Robustness enhancement — The discarded regions often coincide with locations where adversarial patches can be most effectively hidden, making the dynamic pruning mechanism inherently resistant to such perturbations while lowering energy costs.
4. Discussion
4.1. Ablation Study of Image Processing Modules
4.2. Performance Analysis and Ablation Study of YOLOv11-CA_HSFPN
- HSFPN — which improves multi-scale feature fusion, alleviating difficulties caused by diverse target sizes in underwater environments;
- CA module — which enhances channel-wise discriminability, allowing the model to highlight task-relevant features and suppress background noise.
4.3. Experimental Analysis of the A-ViT+ROI Dynamic Inference Mechanism
4.4. Energy–Efficiency Vulnerability of A-ViT+ROI under Adversarial Attacks
4.5. Defense Strategy: Image-Stage Quick Defense via IAQ
5. Conclusions
- Hardware system framework: A comprehensive theoretical framework for power consumption modeling, endurance estimation, and material selection was established and validated under shallow, mid-depth, and deep-water conditions. A layered material scheme of metallic housing + localized transparent windows was proposed, ensuring stable operation under diverse working environments with balanced energy efficiency and thermal dissipation.
- Image processing module: By integrating HAT-based super-resolution reconstruction with the DICAM multi-stage enhancement method, the framework effectively mitigates blurring, color distortion, and low contrast in underwater imagery. Ablation experiments further demonstrate that this module significantly improves detection performance in terms of UIQM, UCIQE, and mAP, thereby providing high-quality inputs for downstream detection.
- A-ViT mechanism and security defense: The proposed Adaptive Vision Transformer (A-ViT), through dynamic token pruning and early-exit strategies, substantially reduces latency and memory usage under low-to-moderate foreground ratios while maintaining detection accuracy. In adversarial experiments, although vulnerabilities in energy efficiency were observed, the fallback mechanism and key-region preservation ensured high robustness. These results highlight A-ViT’s comprehensive advantages in balancing efficiency and security.
- Improved detector: The proposed YOLOv11-CA_HSFPN introduces Coordinate Attention (CA) and a High-order Spatial Feature Pyramid Network (HSFPN) into the Neck, significantly enhancing small-object and low-contrast target detection. Ablation studies reveal complementary effects between CA and HSFPN, enabling simultaneous improvements in accuracy and robustness with negligible additional inference cost.
- Exploring cross-modal fusion of optical and acoustic imaging to improve robustness in turbid or extreme underwater environments.
- Leveraging large-scale models and knowledge distillation to unify high-capacity detection with lightweight deployment.
- Developing multi-level defense and input screening mechanisms against Alpha-channel attacks and energy-oriented adversarial patches.
- Investigating adaptive energy scheduling strategies to optimally balance detection performance and platform endurance across diverse mission scenarios.
- Conducting long-term real-world underwater experiments to validate the stability, generalization, and practical deployment potential of the proposed framework.
References
- Tolie, Hamidreza Farhadi, Jinchang Ren, and Eyad Elyan. “DICAM: Deep inception and channel-wise attention modules for underwater image enhancement.” Neurocomputing 584 (2024): 127585. [CrossRef]
- Islam, Md Jahidul, Youya Xia, and Junaed Sattar. “Fast underwater image enhancement for improved visual perception.” IEEE robotics and automation letters 5, no. 2 (2020): 3227-3234. [CrossRef]
- Liu, Botao, Yimin Yang, Ming Zhao, and Min Hu. “A novel lightweight model for underwater image enhancement.” Sensors 24, no. 10 (2024): 3070. [CrossRef]
- Sarala, A., and C. Vinoth kumar. “A cutting-edge ensemble model for enhanced underwater image restoration and quality improvement.” Scientific Reports 15, no. 1 (2025): 30480. [CrossRef]
- Kahveci, Semih, and Erdinç Avaroğlu. “An Adaptive Underwater Image Enhancement Framework Combining Structural Detail Enhancement and Unsupervised Deep Fusion.” Applied Sciences 15, no. 14 (2025): 7883. [CrossRef]
- Chen, Weiwen, Yingtie Lei, Shenghong Luo, Ziyang Zhou, Mingxian Li, and Chi-Man Pun. “Uwformer: Underwater image enhancement via a semi-supervised multi-scale transformer.” In 2024 International Joint Conference on Neural Networks (IJCNN), pp. 1-8. IEEE, 2024.
- Shen, Zhen, Haiyong Xu, Ting Luo, Yang Song, and Zhouyan He. “UDAformer: Underwater image enhancement based on dual attention transformer.” Computers & Graphics 111 (2023): 77-88. [CrossRef]
- Girshick, Ross. “Fast r-cnn.” In Proceedings of the IEEE international conference on computer vision, pp. 1440-1448. 2015. [CrossRef]
- Edozie, Enerst, Aliyu Nuhu Shuaibu, Ukagwu Kelechi John, and Bashir Olaniyi Sadiq. “Comprehensive review of recent developments in visual object detection based on deep learning.” Artificial Intelligence Review 58, no. 9 (2025): 277. [CrossRef]
- Jian, Muwei, Nan Yang, Chen Tao, Huixiang Zhi, and Hanjiang Luo. “Underwater object detection and datasets: a survey.” Intelligent Marine Technology and Systems 2, no. 1 (2024): 9. [CrossRef]
- Cai, Shaobin, Xiangkui Zhang, and Yuchang Mo. “A Lightweight underwater detector enhanced by Attention mechanism, GSConv and WIoU on YOLOv8.” Scientific Reports 14, no. 1 (2024): 25797. [CrossRef]
- Chen, Long, Feixiang Zhou, Shengke Wang, Junyu Dong, Ning Li, Haiping Ma, Xin Wang, and Huiyu Zhou. “SWIPENET: Object detection in noisy underwater scenes.” Pattern Recognition 132 (2022): 108926. [CrossRef]
- Liu, Zhuoyan, Bo Wang, Bing Wang, and Ye Li. “U-decn: End-to-end underwater object detection convnet with improved denoising training.” IEEE Transactions on Geoscience and Remote Sensing (2025).
- Gu, XiaoTong, Shengyu Tang, Yiming Cao, and Changdong Yu. “Underwater object detection in sonar imagery with detection transformer and Zero-shot neural architecture search.” arXiv preprint arXiv:2505.06694 (2025).
- Gao, Jinxiong, Yonghui Zhang, Xu Geng, Hao Tang, and Uzair Aslam Bhatti. “PE-Transformer: Path enhanced transformer for improving underwater object detection.” Expert Systems with Applications 246 (2024): 123253. [CrossRef]
- Sun, Weikai, Xiaoqun Liu, Juan Hao, Qiyou Yao, Hailin Xi, Yuwen Wu, and Zhaoye Xing. “AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments.” Journal of Marine Science and Engineering 13, no. 8 (2025): 1465. [CrossRef]
- Li, Xingkun, Yuhao Zhao, Hu Su, Yugang Wang, and Guodong Chen. “Efficient underwater object detection based on feature enhancement and attention detection head.” Scientific Reports 15, no. 1 (2025): 5973. [CrossRef]
- Chen, Long, Yuzhi Huang, Junyu Dong, Qi Xu, Sam Kwong, Huimin Lu, Huchuan Lu, and Chongyi Li. “Underwater object detection in the era of artificial intelligence: Current, challenge, and future.” arXiv preprint arXiv:2410.05577 (2024). [CrossRef]
- Lu, Yong, and Minghao Sun. “Lightweight multidimensional feature enhancement algorithm LPS-YOLO for UAV remote sensing target detection.” Scientific Reports 15, no. 1 (2025): 1340. [CrossRef]
- Nasir, Safwan. “EdgeViT++: Efficient Vision Transformers for Edge Devices via Dynamic Token Pruning and Hybrid Quantization.”.
- Setyawan, Novendra, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, and Jun-Wei Hsieh. “MicroViT: a vision transformer with low complexity self attention for edge device.” In 2025 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-5. IEEE, 2025.
- Fathy, Mohamed E., Samer A. Mohamed, Mohammed I. Awad, and Hossam E. Abd El Munim. “A vision transformer based CNN for underwater image enhancement ViTClarityNet.” Scientific Reports 15, no. 1 (2025): 16768. [CrossRef]
- Lei, Juan, Huigang Wang, Zelin Lei, Jiayuan Li, and Shaowei Rong. “CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation.” Remote Sensing 17, no. 4 (2025): 707. [CrossRef]
- Mao, Xiaofeng, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. “Towards robust vision transformer.” In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12042-12051. 2022.
- Zhang, Lu, Sangarapillai Lambotharan, Gan Zheng, Guisheng Liao, Xuekang Liu, Fabio Roli, and Carsten Maple. “Vision Transformer with Adversarial Indicator Token against Adversarial Attacks in Radio Signal Classifications.” IEEE Internet of Things Journal (2025). [CrossRef]
- Cai, Shaobin, Xin Zhou, Wanchen Cai, Liansuo Wei, and Yuchang Mo. “Lightweight underwater object detection method based on multi-scale edge information selection.” Scientific Reports 15, no. 1 (2025): 27681. [CrossRef]
- Chen, Gangqi, Zhaoyong Mao, Kai Wang, and Junge Shen. “HTDet: A hybrid transformer-based approach for underwater small object detection.” Remote Sensing 15, no. 4 (2023): 1076. [CrossRef]
- Mahmood, Kaleel, Rigel Mahmood, and Marten Van Dijk. “On the robustness of vision transformers to adversarial examples.” In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7838-7847. 2021.
- Islam, Chashi Mahiul, Samuel Jacob Chacko, Mao Nishino, and Xiuwen Liu. “Mechanistic understandings of representation vulnerabilities and engineering robust vision transformers.” arXiv preprint arXiv:2502.04679 (2025).
- Mao, Xiaofeng, Gege Qi, Yuefeng Chen, Xiaodan Li, Ranjie Duan, Shaokai Ye, Yuan He, and Hui Xue. “Towards robust vision transformer.” In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 12042-12051. 2022.
- Zhang, Lu, Sangarapillai Lambotharan, Gan Zheng, Guisheng Liao, Xuekang Liu, Fabio Roli, and Carsten Maple. “Vision Transformer with Adversarial Indicator Token against Adversarial Attacks in Radio Signal Classifications.” IEEE Internet of Things Journal (2025). [CrossRef]
- Jian, Muwei, Nan Yang, Chen Tao, Huixiang Zhi, and Hanjiang Luo. “Underwater object detection and datasets: a survey.” Intelligent Marine Technology and Systems 2, no. 1 (2024): 9. [CrossRef]
- Ji, Baofeng, Jiayu Huang, Yi Wang, Kang Song, Chunguo Li, Congzheng Han, and Hong Wen. “Multi-relay cognitive network with anti-fragile relay communication for intelligent transportation system under aggregated interference.” IEEE Transactions on Intelligent Transportation Systems 24, no. 7 (2023): 7736-7745. [CrossRef]
- Kurakin, Alexey, Ian J. Goodfellow, and Samy Bengio. “Adversarial examples in the physical world.” In Artificial intelligence safety and security, pp. 99-112. Chapman and Hall/CRC, 2018.
- Saha, Shaibal, and Lanyu Xu. “Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies.” Neurocomputing (2025): 130417. [CrossRef]
- Chen, Xiangyu, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, and Chao Dong. “Hat: Hybrid attention transformer for image restoration.” arXiv preprint arXiv:2309.05239 (2023).
- Islam, Md Jahidul, Peigen Luo, and Junaed Sattar. “Simultaneous enhancement and super-resolution of underwater imagery for improved visual perception.” arXiv preprint arXiv:2002.01155 (2020).
- Du, Daoping, Lanlan Pan, Ye Liang, Honghao Yang, Xinyu Sui, and Xiang Li. “Underwater image restoration based on a dual-branch super-resolution residual network.” Multimedia Systems 31, no. 4 (2025): 286. [CrossRef]
- Chen, Zhe, Chenxu Liu, Kai Zhang, Yiwen Chen, Ruili Wang, and Xiaotao Shi. “Underwater-image super-resolution via range-dependency learning of multiscale features.” Computers and Electrical Engineering 110 (2023): 108756. [CrossRef]
- Li, Chongyi, Chunle Guo, Wenqi Ren, Runmin Cong, Junhui Hou, Sam Kwong, and Dacheng Tao. “An underwater image enhancement benchmark dataset and beyond.” IEEE transactions on image processing 29 (2019): 4376-4389. [CrossRef]
- Islam, Md Jahidul, Peigen Luo, and Junaed Sattar. “Simultaneous enhancement and super-resolution of underwater imagery for improved visual perception.” arXiv preprint arXiv:2002.01155 (2020).
- Li, Xingkun, Yuhao Zhao, Hu Su, Yugang Wang, and Guodong Chen. “Efficient underwater object detection based on feature enhancement and attention detection head.” Scientific Reports 15, no. 1 (2025): 5973. [CrossRef]
- Cai, Shaobin, Xin Zhou, Wanchen Cai, Liansuo Wei, and Yuchang Mo. “Lightweight underwater object detection method based on multi-scale edge information selection.” Scientific Reports 15, no. 1 (2025): 27681. [CrossRef]
- Sun, Weikai, Xiaoqun Liu, Juan Hao, Qiyou Yao, Hailin Xi, Yuwen Wu, and Zhaoye Xing. “AGS-YOLO: An Efficient Underwater Small-Object Detection Network for Low-Resource Environments.” Journal of Marine Science and Engineering 13, no. 8 (2025): 1465. [CrossRef]
- Rao, Yongming, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. “Dynamicvit: Efficient vision transformers with dynamic token sparsification.” Advances in neural information processing systems 34 (2021): 13937-13949.
- Bolya, Daniel, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. “Token merging: Your vit but faster.” arXiv preprint arXiv:2210.09461 (2022).
- Meng, Lingchen, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. “Adavit: Adaptive vision transformers for efficient image recognition.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12309-12318. 2022.
- Rao, Yongming, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. “Dynamicvit: Efficient vision transformers with dynamic token sparsification.” Advances in neural information processing systems 34 (2021): 13937-13949.
- Bolya, Daniel, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. “Token merging: Your vit but faster.” arXiv preprint arXiv:2210.09461 (2022).
- Fu, Jiyuan, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo, Dingkang Yang, Zhaoyu Chen, and Wenqiang Zhang. “LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops.” arXiv preprint arXiv:2506.14493 (2025).















| Mode | (W) | (h) | (J/frame, F=15fps) | (Wh/1000f) | (cm²) |
|---|---|---|---|---|---|
| Low-power cruise | 7.9 | 8.6 | 0.53 | 0.15 | 17.6 |
| Medium-power acquisition | 14.7 | 4.9 | 0.98 | 0.27 | 32.7 |
| High-load inspection | 26.5 | 2.6 | 1.77 | 0.49 | 58.9 |
| Material | Transparency (%) | (W/m·K) | (g/cm³) | Corrosion Resistance in Seawater | Cost | |
|---|---|---|---|---|---|---|
| Aluminum alloy (6061/5083) | Opaque | 120–170 | 2.7 | 215–275 | Medium (requires anodizing) | Low |
| Stainless steel (316L) | Opaque | 15–17 | 8.0 | 170–290 | High (excellent pitting resistance) | Medium |
| Titanium alloy (Ti-6Al-4V) | Opaque | 6–7 | 4.4 | 800–900 | Excellent | High |
| PMMA (Acrylic) | 92–93 | 0.18–0.22 | 1.2 | 65–75 | Poor | Low |
| PC (Polycarbonate) | 88–90 | 0.18–0.22 | 1.2 | 70–85 | Poor–Medium | Low–Medium |
| Tempered glass | 90–92 | 0.9–1.1 | 2.5 | 140–160 | Medium | Medium |
| Sapphire | >90 | 30–40 | 3.9 | 2000–2500 | Excellent | High |
| Indicator Name | Blurred Image Value | Super-Resolved Image Value | Improvement |
|---|---|---|---|
| PSNR ↑ (dB) | 15.62 dB | 27.29 dB | +74.8% |
| SSIM ↑ | 0.186 | 0.885 | +375.8% |
| MSE ↓ | 0.0151 | 0.00187 | −87.6% |
| Method | UIQM ↑ | UCIQE ↑ | MSE_UIQM ↓ | MSE_UCIQE ↓ |
|---|---|---|---|---|
| Original | 3.00 | 0.550 | – | – |
| CLAHE | 3.42 | 0.612 | 0.179 | 0.0038 |
| UDCP | 3.15 | 0.587 | 0.022 | 0.0016 |
| UWnet | 3.56 | 0.641 | 0.313 | 0.0051 |
| TCTL-Net | 3.71 | 0.665 | 0.500 | 0.0078 |
| DICAM | 3.85 | 0.673 | 0.722 | 0.0092 |
| Project | Parameter Value |
|---|---|
| GPU model | NVIDIA RTX 4090 18 GB |
| Learning Rate | 1 × 10⁻³(cosine decay) |
| Batch Size | 32 |
| Input image resolution | 640 × 640 |
| Optimizer | AdamW (β₁ = 0.9, β₂ = 0.999) |
| Weight decay | 5 × 10⁻⁴ |
| Warmup epochs | 3 |
| Total training epochs | 300 |
| Model checkpoint interval | Every 10 epochs |
| Variant | Image Processing | Detector | mAP@0.5 ↑ | Precision ↑ | Recall ↑ |
|---|---|---|---|---|---|
| Baseline | None (Raw) | Original YOLO | 54.7 | 0.875 | 0.742 |
| A | CR | Original YOLO | 55.3 | 0.880 | 0.747 |
| B | SR | Original YOLO | 55.5 | 0.882 | 0.750 |
| C | CR + SR | Original YOLO | 56.0 | 0.886 | 0.752 |
| D | None | Modified YOLO | 56.2 | 0.889 | 0.754 |
| E | CR + SR | Modified YOLO | 57.0 | 0.892 | 0.760 |
| Variant | Precision ↑ | Recall ↑ | mAP@0.5↑ | Latency (ms) |
|---|---|---|---|---|
| YOLOv5 | 0.861 | 0.728 | 52.6 | 12.4 |
| YOLOv7 | 0.868 | 0.735 | 53.9 | 11.7 |
| YOLOv8 | 0.873 | 0.739 | 54.4 | 11.0 |
| YOLOv10 | 0.878 | 0.741 | 54.9 | 10.6 |
| YOLOv11 | 0.875 | 0.742 | 54.7 | 10.2 |
| RT-DETR v1-R50 | 0.870 | 0.741 | 55.0 | 14.8 |
| RT-DETR v2-R50 | 0.878 | 0.749 | 55.9 | 12.6 |
| YOLOv11-CA_HSFPN | 0.889 | 0.754 | 56.2 | 10.5 |
| Variant | Precision↑ | Recall ↑ | mAP@0.5↑ | Latency (ms) | ΔmAP@0.5:0.95 |
|---|---|---|---|---|---|
| Baseline | 0.875 | 0.742 | 54.7 | 10.2 | - |
| +HSFPN | 0.881 | 0.746 | 55.6 | 10.4 | 0.9 |
| +CA | 0.877 | 0.749 | 55.8 | 10.3 | 1.1 |
| +CA_HSFPN | 0.889 | 0.754 | 56.2 | 10.5 | 1.5 |
| Foreground Ratio | Model | A-ViT+ROI Latency (ms) | Latency Reduction (%) | A-ViT+ROI VRAM (GB) | VRAM Reduction (%) | Fallback |
|---|---|---|---|---|---|---|
| 0.23 | YOLOv8 | 7.18 | 34.7 | 0.55 | 75.0 | FALSE |
| 0.41 | YOLOv8 | 7.77 | 29.4 | 0.92 | 58.2 | FALSE |
| 0.62 | YOLOv8 | 12.38 | −12.5 | 2.31 | −5.0 | TRUE |
| 0.24 | YOLOv11-CA_HSFPN | 7.63 | 27.3 | 0.71 | 74.6 | FALSE |
| 0.44 | YOLOv11-CA_HSFPN | 8.51 | 19.0 | 1.15 | 58.9 | FALSE |
| 0.59 | YOLOv11-CA_HSFPN | 11.19 | −6.6 | 2.95 | −5.4 | TRUE |
| 0.18 | RT-DETR-R50 | 7.55 | 48.9 | 0.72 | 80.0 | FALSE |
| 0.43 | RT-DETR-R50 | 9.88 | 33.2 | 1.49 | 58.6 | FALSE |
| 0.65 | RT-DETR-R50 | 15.27 | −3.2 | 3.78 | −5.0 | TRUE |
| Foreground Ratio | Model | A-ViT+ROI Latency (ms) | Latency Reduction (%) | A-VIТ+ROI VRAM (GB) | VRAM Reduction (%) | Fallback |
|---|---|---|---|---|---|---|
| 0.23 | YOLOv8 | 12.83 | −16.6 | 2.50 | −13.6 | TRUE |
| 0.41 | YOLOv8 | 12.95 | −17.7 | 2.57 | −16.8 | TRUE |
| 0.62 | YOLOv8 | 13.12 | −19.3 | 2.61 | −18.6 | TRUE |
| 0.24 | YOLOv11-CA_HSFPN | 12.01 | −14.4 | 3.07 | −9.6 | TRUE |
| 0.44 | YOLOv11-CA_HSFPN | 12.17 | −15.9 | 3.13 | −11.8 | TRUE |
| 0.59 | YOLOv11-CA_HSFPN | 12.36 | −17.7 | 3.21 | −14.6 | TRUE |
| 0.18 | RT-DETR-R50 | 16.57 | −12.0 | 3.92 | −8.9 | TRUE |
| 0.43 | RT-DETR-R50 | 16.78 | −13.4 | 3.99 | −10.8 | TRUE |
| 0.65 | RT-DETR-R50 | 16.96 | −14.6 | 4.06 | −12.8 | TRUE |
| Algorithm 1. IAQ-Gated Defense against SlowFormer Attacks |
|---|
|
Input: frame x Output: detection result y # Step 1: Lightweight feature extraction function extract_features(x): R_HF ← DCT_highfreq_ratio(x) # High-frequency energy ratio σ_H² ← local_entropy_variance(x) # Local entropy variance V_Lap ← Laplacian_variance(x) # Sharpness anomaly Δ_c ← color_shift(x) # Color channel shift U ← stability_consistency(x) # Consistency under slight perturbation return [R_HF,σ_H², V_Lap,Δ_c, U] # Step 2: Attack score computation function IAQ_gate(x): feats = extract_features(x) s ← normalize(feats)⋅w # z-score + linear weighting if s ≥τ: return “BYPASS” # Suspicious → bypass A-ViT else: return “A-ViT” # Normal image → use ROI + ViT # Step 3: Main inference flow for each frame x_t: gate ← IAQ_gate(x_t) if gate == “A-ViT”: y ← run_AViT_ROI_detector(x_t) else: y ← run_baseline_detector(x_t) output y |
| Model | Clean (No Attack) | Attack (No Defense) | Attack (IAQ Defense) |
|---|---|---|---|
| YOLOv8 | 7.7 ms | 13.0 ms | 11.0 ms |
| YOLOv11-C | 8.5 ms | 12.2 ms | 11.1 ms |
| RT-DEIR-R50 | 9.9 ms | 16.8 ms | 11.3 ms |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).