Intelligent Staging Performance of Diabetic Retinopathy Based on Fundus Fluorescein Angiography Images with Different Diffusion Times

Wei Wang; Zhenpeng Chen; Mingming Li; Shuang Li; Kang Wang; Haiyun Li

doi:10.20944/preprints202606.0206.v1

Submitted:

02 June 2026

Posted:

02 June 2026

You are already at the latest version

Abstract

Fundus fluorescein angiography (FFA) is the gold standard for assessing diabetic retinopathy (DR) severity, but whether different contrast diffusion times affect staging remains unclear. Following two clinical DR staging criteria, we established DR staging models on FFA images respectively adopting Swin-Transformer and ConvNeXt architectures and investigated the impact of FFA diffusion time on staging performance. A total of 7,508 FFA images were annotated according to both staging criteria, and divided into venue, recirculation, and late phase. Single-phase and combined-phase images were used for multi-class and binary classification tasks respectively. Model performance was evaluated using accuracy, precision, recall, and F1-score. A generalized linear model with Bonferroni correction was adopted for inter-phase statistical comparisons. Experimental results demonstrated ConvNeXt architecture outperformed Swin-Transformer in overall accuracy. No statistically significant impact of diffusion time images on staging performance was observed for either model (P > 0.05). Although staging accuracy numerically declined across successive phases, the differences were not significant, possibly attributed to reduced contrast and obscured lesions caused by hyperfluorescent leakage in late phase. Our findings verify that different contrast diffusion times have only a marginal, non-significant impact on DR staging accuracy, offering theoretical guidance for intelligent DR diagnosis using FFA images.

Keywords:

diabetic retinopathy

;

fluorescein fundus angiography

;

Swin-Transformer

;

ConvNeXt

;

diffusion time

;

grading model

Subject:

Medicine and Pharmacology - Ophthalmology

1. Introduction

Diabetic retinopathy (DR) is a leading cause of visual impairment and blindness, affecting approximately one-third of diabetic patients [1,2,3,4,5]. Vision loss occurs only in advanced stages, characterized by diabetic macular edema (DME) and/or proliferative diabetic retinopathy (PDR) [6,7,8]. Early intervention plays a vital role in preventing blindness. Therefore, accurate DR staging is essential for assessing fundus lesion status and disease progression [9,10].

Fluorescein fundus angiography (FFA) is the gold standard for in vivo observation of the retinal vascular system [11]. FFA accurately delineates lesions associated with different DR stages, such as microaneurysms, exudates, non-perfused areas, intraretinal microvascular abnormalities (IRMA), neovascularization, vascular leakage, macular edema, and the extent of PDR [12]. Moreover, FFA enables detection of subtle characteristic changes among similar lesions, holding potential to accurately differentiate clinical stages of DR [13,14].

Deep learning model has been widely applied to FFA-based DR staging. The Ai-Doctor system first demonstrated that deep learning can directly perform staging recognition on FFA images, achieving an AUC of 0.991–0.999 [15]. The InterpreFFA framework enhanced capture of dynamic retinal pathological features via contrastive learning [16]. The LLM-assisted system explored interactive report generation [17]. Rasta et al. proposed an unsupervised algorithm for detecting non-perfusion areas, achieving 81% sensitivity, 78% specificity, and 91% accuracy [18]. Gao et al. developed a deep learning system on a large-scale FFA dataset, with VGG16 achieving a maximum accuracy of 94.17% [19]. Pan et al. constructed a multi-label model detecting four lesion types (non-perfusion, microaneurysms, leakage, laser scars), with DenseNet achieving AUCs of 0.8703, 0.9435, 0.9647, and 0.9653, respectively [20]. Existing studies have achieved promising results. However, most existing studies focus on improving model performance on static images, while a key clinical variable, contrast agent diffusion duration, has rarely been systematically explored.

FFA examination is a dynamic process evolving over time. Early-phase images exhibit high vascular contrast and clear outlines of micro-lesions; intermediate-phase images show increasing background fluorescence due to gradual leakage; late-phase images reveal reduced vascular contrast and blurred lesion morphology due to contrast metabolism and persistent leakage. Theoretically, these phase images differences may significantly impact deep learning model performance. Specifically, two potential confounding factors exist in late-phase images [21]: first, reduced contrast may impede identification of microvascular details; second, hyperfluorescent leakage areas may partially or completely obscure key lesions such as microaneurysms and hemorrhages, increasing missed detection risk. Despite these potential effects, the actual degree of interference with model performance remains unclear, and statistically rigorous evaluations are lacking.

In this paper, in accordance with the International five-grade and Chinese six-grade DR staging criteria, we respectively adopted two representative deep learning architectures, Swin-Transformer and ConvNeXt, to establish intelligent DR staging models on FFA images, and systematically investigate the impact of the FFA images with different diffusion time on staging performance, with a focus on both trend and statistical significance. The main innovations include: (1) Dual Grading Standard Validation: Simultaneously adopt the International five-grade and Chinese six-grade diabetic retinopathy (DR) staging criteria for image annotation to verify the generalization ability of the model under different grading systems. (2) Systematic Investigation of the Impact of Diffusion Time: For the first time, systematically study the impact of contrast agent diffusion duration on the performance of DR intelligent staging, adopt a generalized linear model (GLM) with Bonferroni correction for rigorous statistical testing, quantify the accuracy differences among different phases, and analyze the mechanism by which contrast agent leakage affects model performance. (3) Comparison of Two Mainstream Deep Learning Architectures: Construct models using two representative deep learning architectures, Swin-Transformer and ConvNeXt, to compare the relative advantages of attention mechanisms and convolutional architectures in FFA image staging tasks.

2. Materials and Methods

A total of 7,508 FFA images from patients with diabetic retinopathy (DR) were enrolled in this study. All images were independently annotated in accordance with both the International five-grade and Chinese six-grade DR classification standards. Discrepancies in annotations were resolved through consensus discussion to ensure labeling accuracy. Based on the circulation phase of the contrast agent, the FFA images were stratified into three subgroups for comparative evaluation: venous-phase, recirculation-phase, and late-phase. To systematically investigate the impact of contrast agent diffusion duration on model grading performance, two state-of-the-art deep learning architectures, Swin-Transformer and ConvNeXt, were respectively adopted to establish intelligent DR staging models on FFA images. Multi-class classification (corresponding to DR staging grades) and binary classification (e.g., non-proliferative vs. proliferative DR) experiments were performed under both grading standards to comprehensively validate model performance. The performance of the two models was comprehensively evaluated using four core metrics: accuracy, precision, recall, and F1 -score. Additionally, rigorous statistical analysis was conducted to assess the influence of different phase groups on model performance, with a generalized linear model combined with Bonferroni correction.

2.1. Sample Collection

The samples were obtained from the Department of Ophthalmology, Beijing Friendship Hospital, Capital Medical University. All FFA images were acquired using a Heidelberg Spectralis HRA (Heidelberg Engineering, Germany) angiography system. After obtaining approval from the hospital ethics committee, we retrospectively collected fluorescein fundus angiography (FFA) images from patients newly diagnosed with diabetic retinopathy (DR) at our hospital between May 2016 and December 2025. A total of 450 patients (863 eyes; mean age 63.8 ± 12.6 years; male-to-female ratio 221:229) were included. Ultimately, a total of 7508 eligible FFA images were obtained.

2.2. DR Grading Standards and Image Annotation

To guarantee the accuracy and reliability of grading labels, FFA images in this study were annotated independently following both the International five-grade and Chinese six-grade DR classification criteria, with a rigorous quality control procedure implemented throughout the annotation workflow.

The International five-grade clinical classification system stratifies DR severity into five hierarchical categories: no apparent diabetic retinopathy, mild non-proliferative diabetic retinopathy (NPDR), moderate NPDR, severe NPDR, and proliferative diabetic retinopathy (PDR).

The Chinese six-grade criterion was formulated by the Fundus Disease Study Group of the Ophthalmology Branch of the Chinese Medical Association in 2014. On the basis of the International five-grade framework, this domestic standard further refines the proliferative stage and establishes six sub-stages: microaneurysm stage, mild-to-moderate NPDR stage, severe NPDR stage, early PDR stage, moderate- and high-risk PDR stage, and advanced decompensated PDR stage.

All image grading and annotation tasks were independently performed by three ophthalmology experts from Beijing Friendship Hospital, Capital Medical University, each with over ten years of clinical experience in retinal diseases. Prior to formal annotation, all annotators received standardized training to unify the core definitions and differential diagnostic key points of the two grading systems, so as to minimize subjective assessment discrepancies. In cases with inconsistent annotations among the three raters, a senior retinal subspecialist was consulted to reach a consensus and determine the final annotated label.The sample distributions of the two staging criteria are shown in Table 1 and Table 2.

2.3. Sample Grouping

According to the temporal progression of the FFA examination procedure, all enrolled images could be categorized into three non-overlapping phases: venous-phase (V, 25 s-2 min after contrast injection), recirculation-phase (R, 2.5-6 min), and late-phase (L, 7-10 min)[22]. The three phase groups were strictly demarcated without temporal overlap, enabling a comparative evaluation of the model’s independent recognition performance across different contrast agent diffusion duration. In addition, to explore the staging performance of images with different diffusion durations, we further constructed five groups of image samples under two staging criteria, as shown in Table 3 and Table 4.

To investigate how diffusion duration influences the discrimination of disease severity, a binary classification task was established by stratifying diabetic retinopathy (DR) into non-proliferative DR (NPDR) and proliferative DR (PDR). To mitigate potential bias arising from class imbalance, the sample sizes of the two categories were kept balanced. A total of 2,854 images were included in this task, and the sample distribution is detailed in Table 5. Furthermore, to ensure consistency with the experimental design, the binary classification task adopted an identical phase grouping strategy, as presented in Table 6.

2.4. DR Staging Models on FFA Images

In this paper, two state-of-the-art deep learning architectures, Swin-Transformer and ConvNeXt, were respectively adopted to establish intelligent DR staging models on FFA images.

2.4.1. Swin-Transformer Based DR Staging Model

In this DR staging model, the Swin-Transformer was adopted as the backbone network, as shown in Figure 1. Its core advantage lies in the introduction of the shifted window attention mechanism, which restricts the computational complexity of self-attention while enabling information exchange between adjacent windows, thereby effectively capturing global contextual information within the image. The specific model configuration was as follows: the Swin-Tiny (Swin-T) variant was employed, comprising four stages, with a window size of 7×7, a head dimension of 32, and a multilayer perceptron (MLP) expansion ratio of 4. Transfer learning was applied using pre-trained weights from ImageNet-22K, where the parameters of the first two stages were frozen, and the last two stages were fine-tuned to adapt to the distinctive features of FFA images.

2.4.2. ConvNeXt Based DR Staging Model

In this DR staging model, the ConvNeXt was adopted as the backbone network, as shown in Figure 2. ConvNeXt is a modernized convolutional neural network (CNN) architecture that incorporates design concepts from Transformers. In this study, the ConvNeXt-Tiny variant was adopted. Its key architectural features include a patchify stem layer, depthwise separable convolutions, an inverted bottleneck structure, larger kernel sizes (7×7), and a reduced number of activation and normalization layers. This design preserves the computational efficiency advantages of CNNs while enhancing the model's representational capacity. Transfer learning was performed using ImageNet pre-trained weights. By comparing ConvNeXt with Swin-Transformer, this study aims to explore the relative advantages of attention mechanisms versus convolutional architectures in the task of FFA image grading.

2.5. Experimental Setup

The model was trained on the Ubuntu operating system. The implementation was carried out using Python (version 3.8). The computational setup consisted of a single graphics processing unit (GPU) (NVIDIA RTX 3090 with 24 GB memory). The cross-entropy loss function was adopted as the optimization objective for the multi-class classification task. The AdamW optimizer was used, with an initial learning rate of 1e−4. The batch size was set to 64, and the model was trained for 100 epochs.

To comprehensively and objectively assess the performance of the proposed DR intelligent staging models, four core evaluation metrics were employed: accuracy, precision, recall, and F1-score.

2.6. Statistical Analysis

To evaluate the impact of contrast agent diffusion duration on model grading performance, the accuracy under each phase grouping was used as the primary outcome indicator and analyzed using a generalized linear model (GLM). In the model, phase grouping (V, R, L, V+R, V+R+L) served as the fixed-effect factor, with the corresponding accuracy as the dependent variable. The model was fitted using a Gaussian distribution family with an identity link function. This modeling strategy enabled direct comparison of the least-squares means of model accuracy across different diffusion duration conditions while controlling for other systematic variations.

Given that multiple comparisons were conducted among phase groups, the Bonferroni method was applied to adjust P-values for all pairwise group comparisons to control the Type I error rate. The adjustments were performed automatically using the relevant functions of the multcomp and emmeans packages in R software. For all statistical tests, a corrected P < 0.05 was considered statistically significant.

This analytical approach was applied to the accuracy comparisons of both the Swin-Transformer and ConvNeXt based models under the International five-stage classification, the Chinese six-stage classification, and the binary classification (NPDR vs. PDR) tasks.

3. Results

In this study, two DR staging models built on Swin-Transformer and ConvNeXt architectures were employed to perform multiple DR staging experiments on FFA images stratified according to different contrast agent diffusion durations. The multiple DR staging experiments were designed in accordance with the International five-grade DR criteria, Chinese six-grade DR criteria, and binary classification criteria (NPDR vs. PDR). All experimental images were stratified by FFA phase, including venous-phase (V), recirculation-phase (R), late-phase (L), as well as their combined cohorts (V+R, V+R+L). Model performance was assessed using accuracy, precision, recall and F1-score. A generalized linear model was adopted to statistically compare accuracy differences across distinct phase groups, and the Bonferroni method was used for pairwise multiple comparison correction.

3.1. Performance of Swin-Transformer and ConvNeXt Based Staging Models on FFA Images with Different Diffusion Durations Under the International Five-Grade DR Classification Standard

Under the International five-grade DR classification standard, the performance of the two staging models on FFA images with different diffusion durations are shown in Table 7. The ConvNeXt based model achieved superior overall performance compared to the Swin-Transformer based model across all phase groups. In the venous-phase, ConvNeXt based model attained its best staging performance, with an accuracy of 86.67%, precision of 84.62%, recall of 85.26%, and F1-score of 84.91%, all of which were higher than those of Swin-Transformer based model (83.55%, 82.15%, 80.12%, and 80.83%, respectively).

Over the course of contrast agent diffusion, both models exhibited a gradual decline in performance metrics as the diffusion duration increased. Specifically, the accuracy of Swin-Transformer based model decreased from its peak of 83.55% in the venous-phase to its lowest value of 77.46% in the late-phase, while that of ConvNeXt based model decreased from 86.67% in the venous-phase to 81.87% in the late-phase.

After fitting the data with a generalized linear model and applying the Bonferroni correction, none of the pairwise comparisons among the venous-phase, recirculation-phase, late-phase, or their combined groups (V+R, V+R+L) reached statistical significance for either model (all corrected P > 0.05), as shown in Table 8.

3.2. Performance of Swin-Transformer and ConvNeXt Based Staging Models on FFA Images with Different Diffusion Durations Under the Chinese Six-Grade DR Classification Standard

Under the Chinese six-grade DR classification standard, the performance of the two staging models on FFA images across different diffusion phases is summarized in Table 9. The ConvNeXt based model yielded consistently better overall performance than the Swin-Transformer based model across all phase subgroups.

In the venous-phase, ConvNeXt based model achieved the optimal staging results, with an accuracy of 85.78%, precision of 89.57%, recall of 86.60%, and F1-score of 87.77%. All metrics were markedly higher than those of Swin-Transformer based model(80.87%, 86.00%, 77.37%, and 80.97%, respectively).

Longitudinal comparison revealed a clear declining trend in all performance indicators for both models as contrast agent diffusion time prolonged. The accuracy of Swin-Transformer based model dropped from a peak of 80.87% in the venous-phase to a minimum of 76.66% in the late-phase; meanwhile, ConvNeXt based model accuracy decreased from 85.78% in the venous-phase to 80.98% in the late-phase.

Generalized linear model fitting followed by Bonferroni correction showed no statistically significant differences in pairwise comparisons among the venous-phase, recirculation-phase, late-phase, and their combined groups for either model (P > 0.05), as shown in Table 10.

3.3. Performance of Swin-Transformer and ConvNeXt Based Staging Models on FFA Images with Different Diffusion Durations Under the International NPDR vs. PDR Classification Standard

Under the International binary classification standard of NPDR versus PDR, the performance of the two staging models on FFA images across different contrast diffusion phases is presented in Table 11. The ConvNeXt based model achieved superior overall performance relative to the Swin-Transformer based model across all phase subgroups.

In the venous-phase, ConvNeXt based model yielded the optimal classification results, with an accuracy of 94.92%, precision of 95.17%, recall of 89.32%, and F1-score of 91.86%. All these metrics outperformed those of the Swin-Transformer based model (93.75%, 91.57%, 89.26%, and 90.35%, respectively). Longitudinal observation showed that both models exhibited a slight numerical decline in performance as contrast agent diffusion duration prolonged. The accuracy of the Swin-Transformer based model decreased from a peak of 93.75% in the venous-phase to a minimum of 92.76% in the late-phase; for ConvNeXt based model, accuracy declined from the highest value of 94.92% in the venous-phase to the lowest level of 93.75% in the combined V+R+L group.

Following generalized linear model fitting and Bonferroni correction for multiple comparisons, no statistically significant differences were detected in any pairwise comparisons among the venous-phase, recirculation-phase, late-phase, and their combined groups for either model (P > 0.05),as shown in Table 12.

3.4. Performance of Swin-Transformer and ConvNeXt Based Staging Models on FFA Images with Different Diffusion Durations Under the Chinese NPDR vs. PDR Classification Standard

Under the Chinese NPDR vs. PDR binary classification standard, the performance of the two staging models on FFA images across different contrast agent diffusion phases is presented in Table 13. The Swin-Transformer based model and ConvNeXt based models exhibited comparable overall performance on venous-phase (V) images: both models achieved an accuracy of 93.30%, with F1-scores of 93.29% and 93.28% for Swin-Transformer based model and ConvNeXt based model, respectively.

From a longitudinal perspective, both models demonstrated a gradual numerical decline in performance metrics as contrast agent diffusion duration prolonged. For the Swin-Transformer based model, accuracy decreased from its peak of 93.30% in the venous-phase to its lowest value of 91.34% in the V+R+L combined group; similarly, ConvNeXt’s accuracy declined from the maximum of 93.30% (venous-phase) to the minimum of 90.18% (V+R+L combined group).

Following generalized linear model (GLM) fitting and Bonferroni correction for multiple pairwise comparisons, no statistically significant differences were observed among the venous-phase, recirculation-phase, late-phase, and their combined groups for either model (P > 0.05), as shown in Table 14.

4. Discussion

Fluorescein fundus angiography (FFA) enables the detection of subtle DR lesions and serves as a pivotal imaging modality for evaluating DR severity. FFA images acquired at different contrast diffusion times vary in lesion manifestation. Theoretically, FFA images with different diffusion durations may significantly impact FFA-based DR staging model performance.

In this paper, two state-of-the-art deep learning architectures, Swin-Transformer and ConvNeXt, were respectively adopted to establish intelligent DR staging models on FFA images. Following strictly the International five-grade and Chinese six-grade DR classification criteria, the effects of images with different contrast diffusion durations on model performance were systematically evaluated. The results showed that the impact of different diffusion durations on model performance did not reach statistical significance (corrected P > 0.05), although model performance exhibited a numerical downward trend in late-phase images due to decreased contrast and lesion obscuration by hyperfluorescent leakage. To our knowledge, this study is the first to delineate the marginal effect of contrast diffusion duration from both statistical and mechanistic perspectives, providing empirical evidence for standardized FFA image acquisition and robustness optimization of intelligent DR staging models.

In the International five-grade, Chinese six-grade, and binary classification tasks, ConvNeXt based model generally outperformed Swin-Transformer based model in terms of accuracy, precision, recall, and F1-score across all phase groups, achieving the best staging performance in the venous-phase. This advantage may be attributed to the fact that ConvNeXt, while retaining the local feature extraction capabilities of convolutional neural networks, incorporates design concepts from Transformers (e.g., 7×7 large convolutional kernels and inverted bottleneck structures). This makes ConvNeXt more adept at capturing the coupling characteristics between local vascular structures (e.g., microaneurysms and IRMA) and global leakage patterns in FFA images. In contrast, although Swin-Transformer possesses powerful global modeling capabilities, in late-phase images with low contrast and blurred lesion boundaries, the attention mechanism may shift due to the lack of local structural constraints, leading to performance degradation. This finding suggests that hybrid architectures integrating convolutional local perception and Transformer global modeling possess greater potential for FFA image analysis.

Generalized linear model analysis with Bonferroni multiple comparison correction yielded no significant differences in all pairwise phase comparisons. For example, in the International five-grade task, the accuracy difference between the venous and late-phases for Swin-Transformer based model was 6.09 percentage points (corrected P = 0.3651); in the Chinese six-grade task, the F1-score difference between the venous-phase and the combined group for ConvNeXt based model reached 7.60 percentage points (corrected P = 0.8994). The lack of statistical significance may be attributed to the following factors: (1) the relatively large sample size (7,508 images) allowed the models to achieve a certain degree of phase robustness; (2) both Swin-Transformer and ConvNeXt architectures have a high tolerance for fluctuations in image quality; (3) when annotating recirculation-phase and late-phase images, physicians could refer to the venous-phase images of the same patient, which reduced annotation bias to some extent. Notably, the Chinese six-grade fine-grained task showed pronounced performance degradation: the F1-score of Swin-Transformer based model decreased by 17.84 percentage points and that of ConvNeXt based model decreased by 12.13 percentage points from the venous to the late-phase. This indicates that phase-related performance attenuation is more prominent in fine-grained DR grading, which warrants further validation in larger cohorts and higher-resolution datasets.

Despite the absence of statistical significance, the gradual performance decline observed in late-phase images has a clear pathophysiological and imaging basis. First, reduced imaging contrast: With gradual vascular perfusion and metabolic clearance of the contrast agent over time, the fluorescence intensity gradient between retinal vessels and the background declines markedly. In the late-phase, retinal vasculature is largely cleared of fluorescein [23], blurring the boundaries of key lesions such as microaneurysms and capillary non-perfusion areas, and hindering the extraction of stable discriminative features. In this study, both models presented reduced recall in late-phase images; for Swin-Transformer under the Chinese six-grade standard, recall dropped from 77.37% to 61.40%, indicating an elevated risk of missed lesion detection. Second, lesion masking by hyperfluorescent leakage: In patients with PDR, extensive hyperfluorescent leakage is commonly observed in the late-phase. Chronic hyperglycemia disrupts tight junctions of retinal vascular endothelial cells, impairs the inner blood–retinal barrier, and increases vascular permeability [24]. Progressive retinal ischemia and hypoxia further induce neovascularization, whose immature endothelial structure is highly prone to dye leakage. Pathological manifestations include vascular wall staining, capillary dilatation, and focal dye pooling [22]. These hyperfluorescent regions may partially or completely obscure microaneurysms, retinal hemorrhages and neovascularization, ultimately inducing model misclassification and missed diagnosis. In the Chinese six-grade task, the precision of Swin-Transformer declined sharply from 86.00% to 65.88% in late-phase images, confirming that leakage-induced lesion obscuration substantially increases the risk of false-positive predictions.

This study simultaneously adopted the International five-stage and Chinese six-stage DR classification standards, and the results showed that the performance trend across phases was highly consistent under both standards (venous-phase best, late-phase or combined group lowest), indicating that the phase effect is robust across classification standards. The Chinese six-stage standard, due to its further subdivision of the proliferative stage, achieved lower overall accuracy than the International five-stage standard, suggesting that more fine-grained classification tasks impose higher demands on image quality and model discriminative ability.

The present findings provide practical implications for clinical FFA acquisition and the deployment of intelligent DR analysis systems. First, optimization of acquisition timing: Venous or recirculation-phase images are recommended as the preferred data source for constructing intelligent DR staging models. Second,phase labeling in clinical deployment: In clinical computer-aided diagnostic systems based on FFA, the contrast diffusion phase of input images should be explicitly recorded. For late-phase images, the system may appropriately reduce prediction confidence output and provide clinical risk prompts to facilitate comprehensive clinician judgment. Third, multi-phase mixing in model training: Model performance of combined phase groups fell between that of single venous and late-phases, suggesting that incorporating multi-phase FFA images into training datasets can effectively enhance cross-phase generalization and model robustness.

Several limitations of this study should be acknowledged. First, all data were retrospectively collected from a single medical center. Despite the large sample size, geographical and device homogeneity may limit the external generalizability of the conclusions. Future multicenter prospective cohort studies are required for further validation. Second, this study only qualitatively explored the mechanism of lesion masking by late-phase leakage, without quantitative lesion-level evaluation, such as dynamic curves of microaneurysm detection rate across continuous diffusion time points. Moreover, the continuous dynamic FFA process was discretized into three independent phase subgroups in this work. Advanced strategies including temporal attention mechanisms and video-level sequential modeling were not investigated, which represents a valuable direction for future research to enhance model comprehension of dynamic FFA angiographic sequences.

5. Conclusion

In this paper, two state-of-the-art deep learning architectures, Swin-Transformer and ConvNeXt, were utilized to establish intelligent diabetic retinopathy (DR) staging models using fundus fluorescein angiography (FFA) images. Strictly adhering to the International five-grade and Chinese six-grade DR classification criteria, we systematically evaluated the impact of FFA images acquired at different contrast agent diffusion time on model performance.

The findings demonstrate that FFA images acquired at different contrast agent diffusion times exert only a marginal, non-statistically significant impact on the accuracy of intelligent DR staging. This investigation clarifies the role of contrast agent diffusion time, a critical clinical variable, in regulating the performance of DR staging models. Furthermore, it provides a valuable theoretical reference for the development of FFA-based DR staging strategies, offers practical guidance for optimizing the selection of image acquisition phases in FFA examinations, and contributes to the precise diagnosis and standardized clinical management of DR.

Author Contributions

Wei Wang (First Author): Conceptualization, Data curation, Formal analysis, Writing – original draft, Validation (technical support for the model). Zhenpeng Chen: Methodology, Software, Validation (technical support for the model). Wei Wang, Mingming Li, and Shuang Li: Acquisition of data and data annotation (FFA image analysis). Kang Wang (Ophthalmologist): Clinical expertise (DR staging guidance). Haiyun Li (Corresponding Author): Supervision, Project administration, Writing – review & editing.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Rudnicka, A. R.; Shakespeare, R.; Chambers, R.; et al. Automated retinal image analysis systems to triage for grading of diabetic retinopathy: a large-scale, open-label, national screening programme in England. Lancet Digit. Health 2025, 7(11), 100914. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhao, R.; Zou, H. Artificial intelligence for diabetic retinopathy. Chin. Med. J. 2022, 135(03), 253–260. [Google Scholar] [CrossRef]
Raj, A.; Singla, A.; Sidana, S. Preventive and therapeutic strategies via health care delivery system to minimize sight-threatening diabetic retinopathy: a narrative review. Curr. Diabetes Rep. 2025, 25(1), 36. [Google Scholar] [CrossRef]
Liu, W.; Chen, J.; Niu, S.; et al. Recent advances in the study of circadian rhythm disorders that induce diabetic retinopathy. Biomed. Pharmacother. 2023, 166, 115368. [Google Scholar] [CrossRef]
Seo, H.; Park, S. J.; Song, M. Diabetic retinopathy (DR): mechanisms, current therapies, and emerging strategies. Cells 2025, 14(5), 376. [Google Scholar] [CrossRef]
Farahat, Z.; Zrira, N.; Souissi, N.; et al. Diabetic retinopathy screening through artificial intelligence algorithms: A systematic review. Surv. Ophthalmol. 2024, 69(5), 707–721. [Google Scholar] [CrossRef]
Grauslund, J. Diabetic retinopathy screening in the emerging era of artificial intelligence. Diabetologia 2022, 65(9), 1415–1423. [Google Scholar] [CrossRef] [PubMed]
Ikram, A.; Imran, A. ResViT FusionNet Model: An explainable AI-driven approach for automated grading of diabetic retinopathy in retinal images. Comput. Biol. Med. 2025, 186, 109656. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Hu, X.; Yu, L.; et al. CANet: cross-disease attention network for joint diabetic retinopathy and diabetic macular edema grading. IEEE Trans. Med. Imaging 2019, 39(5), 1483–1493. [Google Scholar] [CrossRef]
Shukla, D.; Dhawan, A.; Kalliath, J. Featureless retina in diabetic retinopathy: Clinical and fluorescein angiographic profile. Indian J. Ophthalmol. 2021, 69(11), 3194–3198. [Google Scholar] [CrossRef]
Liu, X.; Xie, J.; Hou, J.; et al. D-GET: group-enhanced transformer for diabetic retinopathy severity classification in fundus fluorescein angiography. J. Med. Syst. 2025, 49(1), 34. [Google Scholar] [CrossRef]
An, D.; Tan, B.; Yu, D. Y.; et al. Differentiating microaneurysm pathophysiology in diabetic retinopathy through objective analysis of capillary nonperfusion, inflammation, and pericytes. Diabetes 2022, 71(4), 733–746. [Google Scholar] [CrossRef]
Liu, Q.; Zhu, H.; Qian, T.; et al. Spatial-frequency dual-constrained Mamba diffusion model for cross-modal generation from CFP to FFA. Med. Image Anal. 2025, 109, 103925. [Google Scholar] [CrossRef]
Daho, M. E. H.; Li, Y.; Zeghlache, R.; et al. DISCOVER: 2-D multiview summarization of optical coherence tomography angiography for automatic diabetic retinopathy diagnosis. Artif. Intell. Med. 2024, 149, 102803. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Lin, Z.; Yu, S.; et al. An artificial intelligence system for the whole process from diagnosis to treatment suggestion of ischemic retinal diseases. Cell Rep. Med. 2023, 4(10), 101197. [Google Scholar] [CrossRef] [PubMed]
Shao, A.; Liu, X.; Shen, W.; et al. Generative artificial intelligence for fundus fluorescein angiography interpretation and human expert evaluation. npj Digit. Med. 2025, 8(1), 396. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Zhang, W.; Xu, P.; et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digit. Med. 2024, 7(1), 111. [Google Scholar] [CrossRef]
Rasta, S. H.; Nikfarjam, S.; Javadzadeh, A. Detection of retinal capillary nonperfusion in fundus fluorescein angiogram of diabetic retinopathy. BioImpacts BI 2015, 5(4), 183. [Google Scholar] [CrossRef]
Gao, Z.; Jin, K.; Yan, Y.; et al. End-to-end diabetic retinopathy grading based on fundus fluorescein angiography images using deep learning. Graefe's Arch. Clin. Exp. Ophthalmol. 2022, 260(5), 1663–1673. [Google Scholar] [CrossRef] [PubMed]
Pan, X.; Jin, K.; Cao, J.; et al. Multi-label classification of retinal lesions in diabetic retinopathy for automatic analysis of fundus fluorescein angiography based on deep learning. Graefe's Arch. Clin. Exp. Ophthalmol. 2020, 258(4), 779–785. [Google Scholar] [CrossRef]
Yang, Y.; Li, F.; Liu, T.; et al. Comparison of widefield swept-source optical coherence tomographic angiography and fluorescein fundus angiography for detection of retinal neovascularization with diabetic retinopathy. BMC Ophthalmol. 2023, 23(1), 315. [Google Scholar] [CrossRef]
Retinal angiography and optical coherence tomography[M]; Springer: New York, 2009.
Johnson, R. N.; Fu, A. D.; McDonald, H. R.; et al. Fluorescein Angiography: Basic Principles and Interpretation[M]//Retina Fifth Edition; Elsevier Inc., 2012; pp. 2–50. e1. [Google Scholar]
Lutty, G. A. Effects of diabetes on the eye. Investig. Ophthalmol. Vis. Sci. 2013, 54(14), ORSF81–ORSF87. [Google Scholar] [CrossRef]

Figure 1. Architecture diagram of the Swin-Tiny model.

Figure 2. Architecture diagram of the ConvNeXt model.

Table 1. The distribution of image samples annotated according to the International five-grade DR classification standard.

Dataset	level 1	level 2	level 3	level 4	level 5	Total
Sample size	791	635	1353	3222	1507	7,508
Percentage	10.54%	8.46%	18.02%	42.91%	20.07%	100%

Table 2. The distribution of image samples annotated according to the Chinese six-grade DR classification standard.

Dataset	level 1	level 2	level 3	level 4	level 5	level 6	Total
Sample size	635	1353	3222	1013	303	191	6717
Percentage	9.45%	20.14%	47.97%	15.08%	4.51%	2.84%	100%

Table 3. Grouping of images with different diffusion durations under the International five-grade DR classification standard.

Phase Group	level 1	level 2	level 3	level 4	level 5	Total
V	263	207	465	942	487	2364
R	300	293	646	1709	785	3733
L	228	135	242	571	235	1411
V+R	563	500	1111	2651	1272	6097
V+R+L	791	635	1353	3222	1507	7,508

Table 4. Grouping of images with different diffusion durations under the Chinese six-grade DR classification standard.

Phase Group	level 1	level 2	level 3	level 4	level 5	level 6	Total
V	207	465	942	320	116	51	2104
R	293	646	1709	521	147	117	3433
L	135	242	571	172	40	23	1183
V+R	500	1111	2651	841	263	168	5534
V+R+L	635	1353	3222	1013	303	191	6717

Table 5. The distribution of image samples annotated according to the two-stage DR classification standard.

Dataset	International		China
Dataset	NPDR	PDR	NPDR	PDR
Sample Size	1427	1427	1427	1427
Percentage	50%	50%	50%	50%

Table 6. Grouping of images with different diffusion durations under the two-stage DR classification standard.

Phase Group	International		China
Phase Group	NPDR	PDR	NPDR	PDR
V	488	488	488	488
R	669	669	669	669
L	270	270	270	270
V+R	1157	1157	1157	1157
V+R+L	1427	1427	1427	1427

Table 7. Performance of the two staging models on FFA images with different diffusion durations under the International five-grade DR classification standard.

Model	Phase Group	ACC	PRE	RECALL	F1-SCORE
Swin-Transformer based model	V	83.55	82.15	80.12	80.83
	R	81.48	79.94	78.12	78.65
	L	77.46	77.16	76.07	76.44
	V+R	79.36	77.74	76.34	76.90
	V+R+L	79.01	76.70	75.21	75.70
ConvNeXt based model	V	86.67	84.62	85.26	84.91
	R	83.84	82.50	80.75	81.25
	L	81.87	80.06	80.60	79.88
	V+R	84.58	83.18	82.46	82.74
	V+R+L	83.92	83.05	81.98	82.42

Table 8. Statistical significance of the effect of contrast agent diffusion duration on the staging performance of the two models under the International five-grade DR classification standard.

Model	Phase Group	V	R	L	V+R
Swin-Transformer based model	R	0.9502
	L	0.3651	0.4619
	V+R	0.5774	0.7152	0.9249
	V+R+L	0.5209	0.6360	0.9684	0.9994
ConvNeXt based model	R	0.8214
	L	0.5129	0.9092
	V+R	0.9244	0.9910	0.7189
	V+R+L	0.8139	1.0000	0.8681	0.9904

Table 9. Performance of the two staging models on FFA images with different diffusion durations under the Chinese six-grade DR classification standard.

Model	Phase Group	ACC	PRE	RECALL	F1-SCORE
Swin-Transformer based model	V	80.87	86.00	77.37	80.97
	R	79.45	77.01	70.77	73.40
	L	76.66	65.88	61.40	63.13
	V+R	80.61	82.45	74.58	77.91
	V+R+L	80.16	77.18	73.85	75.33
ConvNeXt based model	V	85.78	89.57	86.60	87.77
	R	85.53	83.16	80.26	81.56
	L	80.98	77.25	74.29	75.64
	V+R	85.56	82.57	82.33	82.36
	V+R+L	83.30	81.11	80.01	80.42

Table 10. Statistical significance of the effect of contrast agent diffusion duration on the staging performance of the two models under the Chinese six-grade DR classification standard.

Model	Phase Group	V	R	L	V+R
Swin-Transformer based model	R	0.9900
	L	0.7509	0.8327
	V+R	1.0000	0.9748	0.5180
	V+R+L	0.9992	0.9949	0.6022	0.9989
ConvNeXt based model	R	1.0000
	L	0.6036	0.3228
	V+R	1.0000	1.0000	0.2680
	V+R+L	0.8994	0.6836	0.8468	0.5928

Table 11. Performance of the two staging models on FFA images with different diffusion durations under the International NPDR vs. PDR classification standard.

Model	Phase Group	ACC	PRE	RECALL	F1-SCORE
Swin-Transformer based model	V	93.75	91.57	89.26	90.35
	R	93.72	91.37	85.50	88.07
	L	92.76	89.82	85.38	87.36
	V+R	93.65	91.36	86.33	88.57
	V+R+L	93.43	91.35	85.67	88.15
ConvNeXt based model	V	94.92	95.17	89.32	91.86
	R	94.73	92.33	88.46	90.24
	L	94.06	91.98	87.78	89.68
	V+R	94.43	92.26	88.31	90.12
	V+R+L	93.75	90.82	87.54	89.06

Table 12. Statistical significance of the effect of contrast agent diffusion duration on the staging performance of the two models under the International NPDR vs. PDR classification standard.

Model	Phase Group	V	R	L	V+R
Swin-Transformer based model	R	1.0000
	L	0.9889	0.9695
	V+R	1.0000	1.0000	0.9737
	V+R+L	0.9997	0.9986	0.9903	0.9994
ConvNeXt based model	R	1.0000
	L	0.9903	0.9886
	V+R	0.9980	0.9984	0.9987
	V+R+L	0.9513	0.8622	0.9995	0.9478

Table 13. Performance of the two staging models on FFA images with different diffusion durations under the Chinese NPDR vs. PDR classification standard.

Model	Phase Group	ACC	PRE	RECALL	F1-SCORE
Swin-Transformer based model	V	93.30	93.41	93.30	93.29
	R	91.73	91.88	91.73	91.72
	L	91.67	91.80	91.67	91.66
	V+R	91.34	91.35	91.34	91.34
	V+R+L	90.53	90.56	90.53	90.52
ConvNeXt based model	V	93.30	93.86	93.30	93.28
	R	92.59	92.64	92.59	92.59
	L	91.67	91.68	91.67	91.67
	V+R	91.56	91.56	91.56	91.56
	V+R+L	90.18	90.27	90.18	90.17

Table 14. Statistical significance of the effect of contrast agent diffusion duration on the staging performance of the two models under the Chinese NPDR vs. PDR classification standard.

Model	Phase Group	V	R	L	V+R
Swin-Transformer based model	R	0.9710
	L	0.9851	1.0000
	V+R	0.9191	0.9998	1.0000
	V+R+L	0.7666	0.9803	0.9958	0.9913
ConvNeXt based model	R	0.9985
	L	0.9851	0.9982
	V+R	0.9442	0.9893	1.0000
	V+R+L	0.6893	0.8101	0.9890	0.9410

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.