Submitted:
28 April 2025
Posted:
30 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Design and implementation of a dual input vision transformer (ViT) architecture that processes both original eye images and their frequency domain representations.
- Comprehensive evaluation and demonstration of the efficacy of frequency domain features in veterinary ophthalmology.
- Comparative analysis of Fourier and wavelet transformations to improve diagnostic precision in five common canine eye conditions.
2. Related Work
2.1. Deep Learning in Veterinary Medical Imaging
2.2. Multi-Modal Learning in Medical Imaging
2.2.1. Frequency Domain Analysis in Medical Imaging
2.3. Vision Transformers in Medical Image Analysis
3. Materials and Methods
3.1. Dataset

4. Image Preprocessing and Frequency Domain Transformations
4.1. Spatial Domain Preprocessing
4.1.1. Fourier Transform Preprocessing
4.1.2. Wavelet Transform Preprocessing

5. Model Architecture
- Baseline Model: A single-input Vision Transformer (ViT) that processes only the original images
- Fourier Multi-modal Model: A dual-input ViT that processes both original images and their Fourier transformations
- Wavelet Multi-modal Model: A dual-input ViT that processes both original images and their wavelet transformations
5.1. Baseline Model
5.2. Multi-Modal Models
-
Two parallel ViT-Base encoders:
- One for processing the original spatial domain images
- One for processing the frequency domain representations (Fourier or wavelet)
- A fusion module to combine the embeddings from both encoders
- A classification head to produce the final predictions
- Linear layer: 1536 → 256 units with ReLU activation
- Dropout layer (rate = 0.1)
- Linear layer: 256 → 5 units (one per class)

6. Training Procedure
7. Results
7.1. Classification Performance
7.2. Class-Wise Performance
7.3. Confusion Matrix Analysis
8. Discussion
8.1. Interpretation of Results
8.2. Technical Implications and Potential Applications
- Model Architecture Efficiency: Our dual-input ViT architecture shows that integrating complementary data representations can yield performance improvements without requiring more complex backbone networks.
- Frequency Domain Effectiveness: The consistent performance gains, particularly with Fourier transforms, suggest that frequency domain information could benefit other medical image classification tasks beyond veterinary applications.
- Automated Analysis Framework: The proposed framework provides a pipeline for automated image analysis that could be adapted to other diagnostic imaging tasks where consistent evaluation is valuable.
- Interpretable Results: The class-wise performance variations offer insights into which conditions benefit most from frequency domain analysis, potentially guiding future research directions.
8.3. Limitations
- Dataset Composition: Although comprehensive, our data set may not represent the full diversity of presentations in different dog breeds, ages, and disease stages.
- Computational Requirements: The dual encoder architecture significantly increases computational demands during both training and inference, potentially limiting deployment on resource-constrained devices.
- Binary Approach to Disease: Our classification framework treats each condition as present or absent, whereas real-world cases often involve gradations of severity or co-occurring conditions.
- Model Interpretability: The attention mechanisms of ViTs offer some interpretability, but more work is needed to provide clinically meaningful explanations for the predictions.
8.4. Future Directions
- Model Optimization: Investigating more efficient fusion strategies and lightweight architectures to reduce computational requirements.
- Additional Modalities: Incorporating clinical metadata, breed information, or other imaging modalities (e.g. ultrasound) could further enhance performance.
- Explainable AI: Developing visualization techniques to highlight the specific features that contribute to diagnoses, enhancing clinician trust and adoption.
- Severity Grading: Extending the framework to assess the severity of the disease, not just the presence, would provide more nuanced clinical information.
- Comprehensive Benchmark Framework: Developing standardized evaluation protocols and benchmark datasets to systematically assess model performance across diverse image qualities, acquisition devices, and environmental conditions. This would enable objective comparison between different computational approaches and establish reliable metrics for real-world deployment readiness.
9. Conclusion
Acknowledgments
Abbreviations
| ViT | Vision Transformer |
| CNN | Convolutional Neural Network |
| FT | Fourier Transform |
| WT | Wavelet Transform |
| FFT | Fast Fourier Transform |
| DWT | Discrete Wavelet Transform |
| ROC-AUC | Receiver Operating Characteristic - Area Under Curve |
| MLP | Multilayer Perceptron |
References
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 2020. arXiv:2010.11929.
- Shamshad, F.; Khan, S.; Zamir, S.W.; Khan, M.H.; Hayat, M.; Khan, F.S.; Fu, H. Transformers in medical imaging: A survey. Medical image analysis 2023, 88, 102802.
- Takahashi, S.; Sakaguchi, Y.; Kouno, N.; Takasawa, K.; Ishizu, K.; Akagi, Y.; Aoyama, R.; Teraya, N.; Bolatkan, A.; Shinkai, N.; et al. Comparison of vision transformers and convolutional neural networks in medical image analysis: a systematic review. Journal of Medical Systems 2024, 48, 84. [CrossRef]
- Xiao, S.; Dhand, N.K.; Wang, Z.; Hu, K.; Thomson, P.C.; House, J.K.; Khatkar, M.S. Review of applications of deep learning in veterinary diagnostics and animal health. Frontiers in Veterinary Science 2025, 12, 1511522. [CrossRef]
- Boroffka, S.A.; Voorhout, G.; Verbruggen, A.M.; Teske, E. Intraobserver and interobserver repeatability of ocular biometric measurements obtained by means of B-mode ultrasonography in dogs. American journal of veterinary research 2006, 67, 1743–1749. [CrossRef]
- Zhou, S.K.; Greenspan, H.; Davatzikos, C.; Duncan, J.S.; Van Ginneken, B.; Madabhushi, A.; Prince, J.L.; Rueckert, D.; Summers, R.M. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE 2021, 109, 820–838. [CrossRef]
- Bracewell, R.N. The fourier transform. Scientific American 1989, 260, 86–95.
- Mallat, S.G. A theory for multiresolution signal decomposition: the wavelet representation. IEEE transactions on pattern analysis and machine intelligence 1989, 11, 674–693. [CrossRef]
- Mini, M.; Devassia, V.; Thomas, T. Multiplexed wavelet transform technique for detection of microcalcification in digitized mammograms. Journal of digital imaging 2004, 17, 285–291. [CrossRef]
- Gomes, D.A.; Alves-Pimenta, M.S.; Ginja, M.; Filipe, V. Predicting canine hip dysplasia in x-ray images using deep learning. In Proceedings of the International Conference on Optimization, Learning Algorithms and Applications. Springer, 2021, pp. 393–400.
- Nyquist, M.L.; Fink, L.A.; Mauldin, G.E.; Coffman, C.R. Evaluation of a novel veterinary dental radiography artificial intelligence software program. Journal of Veterinary Dentistry 2025, 42, 118–127. [CrossRef]
- Dumortier, L.; Guépin, F.; Delignette-Muller, M.L.; Boulocher, C.; Grenier, T. Deep learning in veterinary medicine, an approach based on CNN to detect pulmonary abnormalities from lateral thoracic radiographs in cats. Scientific reports 2022, 12, 11418. [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 2018, 41, 423–443.
- Zhou, T.; Ruan, S.; Guo, Y.; Canu, S. A multi-modality fusion network based on attention mechanism for brain tumor segmentation. In Proceedings of the 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE, 2020, pp. 377–380.
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2097–2106.
- Liu, R.; Li, Q.; Xu, F.; Wang, S.; He, J.; Cao, Y.; Shi, F.; Chen, X.; Chen, J. Application of artificial intelligence-based dual-modality analysis combining fundus photography and optical coherence tomography in diabetic retinopathy screening in a community hospital. BioMedical Engineering OnLine 2022, 21, 47. [CrossRef]
- Rahman, M.M.; Bhattacharya, P.; Desai, B.C. A framework for medical image retrieval using machine learning and statistical similarity matching techniques with relevance feedback. IEEE transactions on Information Technology in Biomedicine 2007, 11, 58–69. [CrossRef]
- Abd El Kader, I.; Xu, G.; Shuai, Z.; Saminu, S.; Javaid, I.; Ahmad, I.S.; Kamhi, S. Brain tumor detection and classification on MR images by a deep wavelet auto-encoder model. diagnostics 2021, 11, 1589. [CrossRef]
- Yoon, J.S.; Oh, K.; Shin, Y.; Mazurowski, M.A.; Suk, H.I. Domain generalization for medical image analysis: A survey. arXiv preprint arXiv:2310.08598 2023.
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 2021.
- Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
- Hub, A. Canine Ophthalmic Disease Dataset. Available online: https://aihub.or.kr/aihubdata/data/view.do?currMenu=&topMenu=&aihubDataSe=data&dataSetSn=562 (accessed on 3 Feb 2025), 2021. AI Hub Open Dataset.

| Class | Training | Validation | Test | Total |
|---|---|---|---|---|
| Eyelid Tumor | 3,607 | 481 | 722 | 4,810 |
| Nuclear Sclerosis | 7,200 | 960 | 1,440 | 9,600 |
| Cataract | 5,163 | 689 | 1,033 | 6,885 |
| Ulcerative Keratitis | 10,305 | 1,374 | 2,062 | 13,741 |
| Epiphora | 7,200 | 960 | 1,441 | 9,601 |
| Total | 33,475 | 4,464 | 6,698 | 44,637 |
| Model | Accuracy | Avg. F1-Score | ROC-AUC |
|---|---|---|---|
| Baseline (Single-modal) | 85.65% | 0.843 | 0.982 |
| Multi-modal (Fourier) | 86.52% | 0.854 | 0.983 |
| Multi-modal (Wavelet) | 86.17% | 0.849 | 0.984 |
| Disease Class | Baseline | Fourier | Wavelet | |||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | F1-Score | Accuracy | F1-Score | Accuracy | F1-Score | |||
| Eyelid Tumor | 84.63% | 0.795 | 89.06% | 0.818 | 86.43% | 0.808 | ||
| Nuclear Sclerosis | 77.92% | 0.766 | 77.71% | 0.774 | 78.33% | 0.773 | ||
| Cataract | 81.99% | 0.825 | 84.22% | 0.842 | 83.06% | 0.830 | ||
| Ulcerative Keratitis | 91.90% | 0.937 | 92.05% | 0.939 | 92.05% | 0.939 | ||
| Epiphora | 87.58% | 0.892 | 87.79% | 0.896 | 87.72% | 0.896 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).