Assessing the Performance of Artificial Intelligence Systems for the Screening of Diabetic Retinopathy: A Systematic Review and Meta- Analysis

Diabetic retinopathy is the most common microvascular complication of diabetes mellitus and one of the leading causes of blindness globally. Due to the progressive nature of the disease, earlier detection and timely treatment can lead to substantial reductions in the incidence of irreversible vision-loss. Artificial intelligence (AI) screening systems have offered clinically acceptable and quicker results in detecting diabetic retinopathy from retinal fundus and optical coherence tomography (OCT) images. Thus, this systematic review and meta-analysis of relevant investigations was performed to document the performance of AI screening systems that were applied to fundus and OCT images of patients from diverse geographic locations including North America, Europe, Africa, Asia, and Australia. A systematic literature search on Medline, Global Health, and PubMed was performed and studies published between October 2015 and January 2020 were included. The search strategy was based on the Preferred Reporting Items for Systematic Reviews and Metaanalyses (PRISMA) reporting guidelines, and AI-based investigations were mandatory for studies inclusion. The abstracts, titles, and full-texts of potentially eligible studies were screened against inclusion and exclusion criteria. Twenty-one studies were included in this systematic review; 18 met inclusion criteria for the meta-analysis. The pooled sensitivity of the evaluated AI screening systems in detecting diabetic retinopathy was 0.93 (95% CI: 0.920.94) and the specificity was 0.88 (95% CI: 0.86-0.89). The included studies detailed training and external validation datasets, criteria for diabetic retinopathy case ascertainment, imaging modalities, DR-grading scales, and compared AI results to those of human graders (e.g., ophthalmologists, retinal specialists, trained nurses, and other healthcare providers) as a reference standard. The findings of this study showed that the majority AI screening systems demonstrated clinically acceptable levels of sensitivity and specificity for detecting referable diabetic retinopathy from retinal fundus and OCT photographs. Further improvement depends on the continual development of novel algorithms with large and gradable sets of images for training and validation. If cost-effectiveness ratios can be optimized, AI can become a financially sustainable and clinically effective intervention that can be incorporated into the healthcare systems of low-to-middle income countries (LMICs) and geographically remote locations. Combining screening technologies with treatment interventions such as anti-VEGF therapy, acellular capillary laser treatment, and vitreoretinal surgery can lead to substantial reductions in the incidence of irreversible vision-loss due to proliferative diabetic retinopathy.


Introduction
Diabetes mellitus is a global epidemic that affects approximately 422 million people globally and has been increasing rapidly in recent decades (22). The traditional approach to caring for diabetes mellitus in diverse health settings, including primary, secondary and tertiary care facilities, has been ineffective in addressing diabetes-induced complications, resulting in limited access to screening resources, increasing disease incidence rates, and unfavorable outcomes in low, middle, and high-income countries (22). Currently, innovative and novel methods of care must be developed in an effort to address the systemic effects of diabetes mellitus on the health of patients. Diabetic retinopathy, which affects one-third of diabetes mellitus patients, is the most prevalent diabetes-induced complication and causes preventable vision-loss if left untreated (23). Specifically, diabetic retinopathy is a microvascular complication of diabetes mellitus that leads to the development of lesions that progressively damage to the retina over time (23). Detecting diabetic retinopathy during its early stages is essential to prevent progressive vision loss. With an approximate incidence range of 2.4 to 13.1%, diabetic retinopathy is the leading cause of vision loss in low-tomiddle-income countries (LMICs), with adults aged 18-64 constituting the most at-risk groups for the condition (24). Public health interventions aimed at managing and identifying diabetic retinopathy in its early stages and increasing participation and access to screening and treatment services are crucial.
Although substantial data regarding diabetic retinopathy pathology exists and a comprehensive guide has been developed for ophthalmologists and internists by the International Council of Ophthalmology (ICO) detailing evidence-based principles for diagnosis, definition, screening and referral criteria, follow-up, and management options, a lack of screening programs in many countries is contributing to increasing rates of preventable vision loss (25). National diabetic retinopathy screening programs are not commonly incorporated in many countries because the implementation and maintenance of such programs requires substantial resources, and many patients are unaware that they present the condition (23,25). Furthermore, attending recommended follow-ups for patients with diabetes is a challenge for those under financial pressure, or for those lacking any mode of transportation (23,25). An additional contributor to the global ubiquity of diabetic retinopathy is the lack of resource availability for care and the cost of essential long-term treatment. For example, diabatic macular edema, which occurs when there is abnormal leakage of fluid in the macula from damaged blood vessels in the retina and commonly caused by diabetic retinopathy, requires long-term and expensive treatments including vascular endothelial growth factor (VEGF) injections (26). Variable responses to treatment make caring for diabetic retinopathy-induced complications even more difficult and less reliable (24,26).

Framework for diabetic retinopathy management and clinical gaps
Reducing the burden of diabetic retinopathy necessitates a balance of individual and collective preventive measures including intensive medical treatment for diabetes patients experiencing progressive vision loss, screening for undiagnosed diabetic retinopathy, to changes in transport or economic policies affecting the majority of the population (27). The management of diabetic retinopathy can be improved by implementing particular interventions at different settings ranging from individual to population care (23,27).
Preventing diabetes mellitus is one of the most upstream methods of preventing diabetic retinopathy (28). Strategies for doing so include but are not limited to tight glycemic control through dietary modifications, earlier detection by screening, physical activity promotion for high-risk groups, worksite behavioral interventions, changes to internal built environments and transport infrastructure, or fiscal policy to support access to healthy food (27,28).
Additional strategies include improving public awareness, developing evidence-based clinical guidelines and screening programs, and optimizing the utilization of anti-VEGF for progressive diabetic retinopathy (29).
The lack of sustainable diabetic retinopathy screening programs has formed a serious clinical gap for managing the condition (23,29). In order to close this gap, sustainable screening programs must be developed globally so that diabetes mellitus patients can be cared for in accessible primary care settings (30). Additionally, the advancement of risk prediction methods is necessary to ascertain which patients will likely develop visionthreatening diabetic retinopathy (28,30). At present, not even ophthalmologists can predict which group of diabetic retinopathy patients are at increased risk of vision loss. Lastly, improvements in accessibility and treatments for those experiencing vision-loss from diabetic retinopathy is essential (30). Such treatments include acellular capillary laser treatments, anti-VEGF injections, and vitreoretinal surgeries (26).
Despite the existence of solutions to address clinical gaps, significant barriers stand in the way of their implementation (31). Establishing financially sustainable screening programs is a key challenge, particularly for LMICs that lack the resources to develop and maintain such efforts (31). Additionally, the recruitment and training of retinal specialists remains a significant challenge as their demand far exceeds the number of professionals available to screen retinal images of diabetes mellitus patients (32).
With the incidence of diabetes mellitus and vision-threatening diabetic retinopathy rapidly increasing globally, it is imperative that novel screening methods that are accurate, cost-effective, and sustainable are developed.

Artificial intelligence application in Ophthalmology
In the field of ophthalmology, the application of artificial intelligence (AI) using machine learning (ML) and deep learning (DL) has been extensively investigated. Specific ocular conditions that have been assessed with the use of AI technologies include glaucoma, age-related macular degeneration, and non-proliferative and proliferative retinopathies (33). Several applications of AI in optical coherence tomography (OCT) and retinal fundus photography have demonstrated high performance levels comparable to those of manual retinal graders and ophthalmologists (27,33). DL technologies have become useful in identifying macular edema based on OCT images, which can be particularly useful when screening for late-stage diabetic retinopathy (34).

Artificial intelligence application for diabetic retinopathy
One of the most promising methods for the large-scale management of diabetic retinopathy is the use of AI screening technologies. The success of AI screening systems largely depends on the presence of accessible nationwide screening programs in which diabetes mellitus patients are reminded to attend routine appointments (33,35). The accuracy of such technological developments, as assessed by prior investigations, has surpassed clinically acceptable thresholds of sensitivity (number of true positive assessments over the number of all positive assessments reported) and specificity (number of true negative assessments over the number of all negative assessments reported). With recent advancements in ML and DL, countries across the globe are revisiting the incorporation of AI systems for diabetic retinopathy screening (30,35).
Diabetic retinopathy is primed for AI. Screening for the condition depends solely on the use of a single image, whether it be color retinal fundus photographs or OCT images (27). Regardless of whether a diabetes mellitus patient is symptomatic or asymptomatic for diabetic retinopathy, the aforementioned imaging techniques will display the presence of hallmark morphological lesions so long as the condition is developing (36). Trained AI systems are thus sufficient to screen for retinal lesions and can serve as an effective and efficient solution for the scarcity of diabetic retinopathy management (25,34). An additional upside to automated diabetic retinopathy screening is that it does not replace the role of eye care professionals (35,36). By establishing wide-spread and accessible screening programs, the rate of disease detection will likely increase in parallel with the subsequent demand for tertiary care from ophthalmologists (36). Thus, the accessibility of eye care will increase and can encourage future investigations that seek to optimize and incorporate automated screening technologies into clinical settings. Furthermore, AI can be easily incorporated into diabetic retinopathy screening programs and thus acts as an enhancer rather than a disruptor to traditional screening methods (37).
1.4 The current progress of applying AI for diabetic retinopathy screening Studies conducted by research groups across the globe have demonstrated clinically acceptable performances of AI screening systems in detecting diabetic retinopathy (38). This research extends to multifaceted AI as well, with some groups developing a single system that screens for multiple eye diseases at once including diabetic retinopathy, age-related macular degeneration (AMD), and glaucoma (31,25,38). Large external validation image sets consisting of retinal fundus or OCT photographs from diabetes mellitus patients in different countries have been used to demonstrate the performance of various AI systems (27). Countries and continents which have provided validation sets include the United States, Europe, Africa, Australia, India, China, Korea, and Thailand (26,27,32).
In addition to demonstrating clinically acceptable performance levels, AI for diabetic retinopathy screening has met the standards of the Food and Drug Administration in the United States, the Health Sciences Authority in Singapore, and Conformité Européenemarking in the European Economic Area (39,40). Securing regulatory approval serves as an important contributor to the progress of making AI application in eye care settings a commonplace practice (39).
Lastly, economic studies evaluating the financial feasibility of AI implementation have shown promising cost-effectiveness results. Based on the calculation of incremental cost-effectiveness ratios (ICERs), AI solutions demonstrate cost-saving benefits when compared to traditional, manual methods of retinal grading (41). These results support the claim that automated screening services for diabetic retinopathy prevention are not only clinically proven to detect signs of disease with high accuracy, but also are economically sustainable and would benefit primary care settings that choose to adopt them (38, 41).

Diabetic retinopathy progression and pathology
Generally, diabetic retinopathy progresses according to particular parameters. Glucose and glycated hemoglobin (HbA1c) levels, blood pressure, lipid level, and smoking have near linear relationships with retinopathy progression (42). Pregnancy may also cause rapid deterioration of the retina in those developing diabetic retinopathy (43). In humans, it takes several years for diabetic retinopathy to reach a stage where it could threaten a person's sight (42,43). The retina itself is a light-sensitive layer of cells at the back of the eye, which converts incident light into electrical signals that are sent to the brain for image generation (44). In order for the retina to function properly, it needs a constant supply of blood which it receives though a network of capillaries (42,44). Over time, uncontrolled and consistent high blood glucose levels can damage retinal vasculature in three notable stages (45). The first stage is known as background retinopathy during which tiny bulges, classified as aneurysms, develop in the blood vessels (46). These bulges may cause bleeding, however at this stage of diabetic retinopathy development, they usually do not affect a person's vision (43,46). The second stage, pre-proliferative retinopathy, is characterized by more considerable bleeding due to greater damage to retinal vasculature and potential hemorrhaging (47). At this stage, vision will likely be impacted. The third stage, called proliferative retinopathy, demonstrates scar tissue and neovascularization in the form of minimally or nonfunctional acellular capillaries (42,47). These new vessels that develop on the retina are structurally weak, cause further bleeding, and eventually progressive vision loss (47). In addition to the aforementioned lesions, microglial infiltration, lipemia retinalis, intraretinal microvascular abnormalities (IRMAs), and ischemia serve as other hallmark indicators of diabetic retinopathy progression (48). Anyone with type 1 or type 2 diabetes mellitus is at risk of developing diabetic retinopathy, however, early detection with effective screening systems and subsequent treatment can prevent progressive vision loss. Figure 1 presents isolated healthy retinal vasculature. Figure 2 compares the vasculature of a non-diabetic versus a diabetic retina. The arrows indicate incident acellular capillaries that have resulted from neovascularization.  OCT allows for high resolution imaging in the axial direction of the retina, resulting in cross-section visualization of vasculature, retinal cell layers, and limiting membranes (49). Additionally, OCT has the capability of capturing retinal reflectance, in which light is delivered through the pupil and images are formed from the light reflected back from the retina (49). The detection of reflectance allows studies to investigate biomarkers, such as inflammatory cytokines or neurotoxins released by microglial cells, that may affect visual function at the cellular level (49,50). This is particularly useful when screening for small changes in the retinal cell layers. Thus, OCT imaging can capture the location, nature of retinal changes, thickness of the retina, and integrity of the surrounding structures (49,50).
Retinal fundus photographs document the current ophthalmoscopic appearance of a patient's retina without detailed visualization of the retinal cell layers (51). It is useful for detecting significant or large changes in retinal cell layers, however, is limited in its ability to detect small changes (unlike OCT imaging). A fundus camera is a specialized low power microscope with an attached camera that sends light rays through the pupil upon image capture (52). If the illumination system of the fundus camera and the produced image are focused and aligned, the illuminating light rays will reflect off the retina and back into the objective lens of the camera. A retinal fundus image is subsequently generated (51,52).
Retinal fundus photography was used more widely prior to the optimization of OCT imaging (52). Nowadays, OCT is commonly used due to its ability to detect subtle changes in the retina (52). In the context of screening for diabetic retinopathy, neovascularization is an important biomarker for detection in OCT and fundus photographs, whereas detecting changes in the retinal pigmented epithelium (RPE), as well as other cell layers, is more suited for OCT imaging.

Datasets and research communities used for the development and training of artificial intelligence screening systems
Four notable datasets and data scientist research communities were used across the various studies included in this review to externally validate diabetic retinopathy AI screening systems: Messidor-2, EyePACS, Kaggle, and E-Ophtha.
The Messidor-2 dataset is a collection of 874 diabetic retinopathy examinations (1,748 fundus images) each consisting of two macula-centered fundus images (one per eye). It does not include annotations that define a diabetic retinopathy ground truth, which allows researchers to unbiasedly externally validate their respective AI systems of interest (53). The Messidor-Original dataset consists of 529 examinations (1,058 retinal fundus images) that come in pairs or as single images (53,54). In order to generate new Messidor datasets, diabetic patients were recruited Brest University Hospital in France between October 16, 2009 andSeptember 6, 2010 (53). The hospital's Ophthalmology Department imaged eye fundi, without inducing dilation, using a Topcon TRC NW6 non-mydriatic fundus camera at a 45-degree field of view (53). Only macula-centered images were incorporated in the dataset in order to remain consistent with Messidor-Original (54).
The EyePACS database consists of over five million retinal fundus images from diverse populations presenting different degrees of diabetic retinopathy severity (55). Such a large variety helps AI algorithms recognize diverse retinas that exist in real-world settings globally. Major automated screening development studies have and are currently using EyePACS datasets to train and externally validate algorithms.
Kaggle is an online community of data scientists and machine learning researchers that provide a large set of high-resolution retinal fundus images taken under a variety of imaging conditions (56,57). Two images, one of each eye, from each subject are included in the dataset. Some images are displayed as one would see the retina anatomically (56). For example, when viewing an image of the right eye, one would see the optic nerve on the right and the macula on the left side of the image (56,57). Other images are demonstrated as seen through a condensing lens on a microscope: inverted as one sees in typical live eye exams. The Kaggle dataset is known for containing some images that contain artifacts and are out of focus, underexposed, or overexposed. (57) AI systems must function and provide accurate outputs in the presence of such noise and variation to be deemed clinically acceptable (43).
E-ophtha is a database of color retinal fundus images used specifically for diabetic retinopathy research. E-ophtha contains two datasets consisting of 463 fundus images that demonstrate either exudates, microaneurysms, or hemorrhages (58). The exudate database contains 47 images with exudates and 35 images with no lesions. The microaneurysm set contains 148 images with microaneurysms or small hemorrhages and 233 images with no lesions (59). Thus, this dataset is particularly useful for training algorithms to recognize exudates, microaneurysms, and hemorrhages in fundus images (58, 59).

Convolution neural networks computation and training methodology
Convolution neural networks (CNNs) form the base of deep learning (DL), a subset of machine learning (ML) where the algorithms are inspired by the structure of the human brain (60,61). CNNs take in data, train themselves to recognize the patterns in the data, then predict an output. They are made up of layers of neurons (61). The first layer is known as the input layer which receives the input, the output layer predicts the final output, and the in between layers perform the majority of the computations required by the neural network (62).
In the context of diabetic retinopathy, CNNs are trained to recognize retinal lesions by training their algorithms with fundus and OCT images (61). Figure 3 demonstrates a general example of how a trained CNN computes probabilistic lesion outputs and makes a correct prediction (60). A section of retinal vasculature is presented to the CNN at 200 times magnification (63). The arrows indicate diabetes-induced acellular capillary growth (62,63). Each pixel of the image is fed as input to each neuron of the first layer. Neurons of one layer are connected to neurons of the next layer through channels, each of which is assigned a numerical value known as weights (64). The inputs are multiplied to the corresponding weights and their sum is sent as input to the neurons in the hidden layer (63,64,65). Each of these neurons is associated with a numerical value called the bias, which is then added to the input sum (64,65). This value is then passed through a threshold function called the activation function. The result of the activation function determines if the particular neuron will get activated or not (66). An activated neuron transmits data to the neurons of the net layer over the channels. In this manner, the data is propagated through the network (63,66). This is called forward propagation. In the output layer, the neuron with the highest value fires and determines the output. The values represent a probability. In this example, the neuron correctly associated with acellular capillary recognition has the highest probability, hence that is the most likely output predicted by the neural network.
If the neuron had associated with an incorrect output, such as aneurysm or hemorrhage, it would be an indication that further training is necessary. During the training process, along with the input, the CNN also has the output fed to it (67). The predicted output, whether correct or incorrect, is compared against the actual output to realize the error in prediction. The magnitude of the error indicates how wrong the CNN is and a positive or negative value suggests that the predicted value is either higher or lower than expected, respectively (67). This information is then transferred backward through the neural network, known as back propagation (68). Now based on this information, the weights are adjusted. This cycle of forward propagation and backward propagation is repeatedly performed with multiple inputs (68,69). The process is continued until the weights are assigned such that the neural network can predict retinal lesions correctly in most cases (69). This brings the training process to an end.

Figure 3. -Convolutional neural network analysis of diabetic retinopathy-induced acellular capillary growth
The left-hand image displays the formation of acellular capillaries in the vascular architecture of a diabetic retina.

Convolutional neural networks assessed by the included studies
Five different CNN architectures were used by the studies included in this review: AlexNet, Inception, Iowa Detection Program (IDP), Visual Geometry Group (VGG), and EyeArt.
The AlexNet architectures consists of eight distinct layers: five convolutional layers and three fully connected layers (70). AlexNet has three features that make it unique compared to other existing CNNs: overlapping pooling, rectified linear units (ReLU) nonlinearity, and multiple graphics processing units (GPUs). Normally, CNNs pool outputs of adjacent groups of neurons with no overlapping (70). However, when overlapping was introduced in AlexNet, researchers observed a reduction in error by approximately 0.5% and found that it is more difficult for architectures with overlapping pooling to provide inaccurate output predictions (71). AlexNet uses ReLU instead of the hyperbolic tangent (tanh) function, which was traditionally used by CNNs (70,71). Incorporating ReLU is particularly advantageous to quickening AI training time; ReLU-based systems are able to reach a 25% error on CIFAR-10, EyePACS, and Messidor datasets six times faster than systems using the tanh function (72). In addition to faster training times, AlexNet also has the capacity to analyze larger models. AlexNet allows for multi-GPU training by putting half of its neurons on one GPU and the other half on another GPU (72).
Inception is a DL architecture consisting of CNNs that are 27 neuronal layers deep (73). Inception V3 and Inception V4, which are constituents of the Inception family and referred to in this review by the eligible studies, possess important ML features including label smoothing, factorized 7 x 7 convolutions, and the use of auxiliary classifiers to propagate input information to lower down in the network (73). Label smoothing is a regularization technique for classification problems to prevent the Inception model from predicting outcomes too confidently during training and generalizing poorly (74). Factorized 7 x 7 convolutions includes changes that factorize the first 7 x 7 convolutional layer into a sequence of 3 x 3 convolutional layers. The term convolution itself refers to the mathematical combination of two functions to produce a third function (merging two sets of information). In the case of Inception V3 and Inception V4, the convolution performed on the input data, a fundus or OCT image, helps produce a feature map from which the CNNs can distinguish lesions that they have or are being trained to recognize (73,74). Auxiliary classifiers are a component of the Inception architecture that improves the propagation of computations made by the large and deep Inception neural networks when receiving an input (75). In the context of diabetic retinopathy, including auxiliary classifiers in the AI screening system improves the efficiency of translating an input into a probabilistic outcome of the identity of a retinal lesion (73,74,75).
IDP is an algorithm based on expert designed image analysis that uses wavelet transformations (76). A wavelet is a mathematical function that is useful in image processing (76,77). Wavelet compression works by analyzing an image and converting it into a set of mathematical expressions that can be decoded by a neural network to identify features of an image (76,77). This is particularly useful when a CNN is fed an image containing large quantities or easily mistakable pieces of information (77). In the context of diabetic retinopathy, if a fundus or OCT image is taken of an eye with many and diverse lesions, for example from the proliferative stage, IDP has the ability to distinguish between morphological structures with considerable accuracy with its wavelet feature (77).
VGG is a classical CNN based on an analysis of how to increase the depth of such networks (78). VGG is characterized by its simplicity as it uses small 3 x 3 filters, pooling layers, and a fully connected layer (78). Applying 3 x 3 convolutions on images with a 3 x 3 filter allows for the analysis of three-dimensional images (79). Additionally, they are used for blurring, sharpening, edge detection, and the embossing of images (78,79). The pooling feature of VGG allows the architecture to reduce the size of images while preserving their important characteristics (78). Fully connected layers are simply the connection between one layer of neurons to another, as is a defining feature of CNNs in general (78,79).
The EyeArt system is a cloud-based AI eye screening technology used to detect different stages of diabetic retinopathy through automated analysis of patients' color fundus images (80). It is commonly used amongst endocrinologists, general practitioners, and diabetologists in primary care settings to rapidly and accurately screen for signs diabetic retinopathy within minutes (29). EyeArt uses morphological image analysis with DL techniques to create an automated diabetic retinopathy screening system (ADRSS) engineered for large-scale deployment in the cloud (80). EyeArt is known for its speed and accuracy as it is able to screen 100,000 patients in less than 45 hours whereas human graders can screen retinal images of only 8 to 12 patients per hour (80, 81).

Population of Interest
People who have been diagnosed with type 1 or type 2 diabetes mellitus comprise the population of interest. The populations that have contributed to the performance results of AI screening systems in this review are from diverse countries and continents including the United States, the United Kingdom, Africa, India, China, Thailand, and Australia. Participants were recruited from a variety of healthcare settings including primary care practices, screening units and programs in urban centers, endocrinology outpatient services, and tertiary care diabetes and general hospitals. Patients with type 1 or type 2 diabetes mellitus were identified either through diabetes or pharmacy registers and were invited for screening studies. Participants consented for the studies and had retinal photographs taken of their eyes.

Case definition and other important terms
Due to varying criteria in different countries regarding whether a person has diabetic retinopathy, there is no global standard or checklist of symptoms that have been defined. However, based on the studies included in this review, there are a set of general indicators that are commonly used to screen for diabetic retinopathy: types 1 and 2 diabetes mellitus, diabetic macular edema, drusen, exudative retinal detachment, microvascular abnormalities, and retinal vessel occlusion (48,82). Diabetic retinopathy results from microvascular lesions in the retinas of patients suffering from type 1 or type 2 diabetes mellitus (83). Type 1 diabetes is an autoimmune reaction that attacks one's beta cells in the pancreas, leading to an inability to produce enough insulin and subsequently leads to consistently high blood glucose levels; it can have both genetic and environmental origins (84). Type 2 diabetes occurs when one's body becomes resistant to insulin and is associated with genetics and lifestyle choices (28). Diabetic macular edema, a contributor to the progression of diabetic retinopathy, occurs when leaky vessels cause fluid to build up in the macula at the center of the retina; it is commonly screened for during diabetic retinopathy examinations (85). Drusen is a defining feature of retinal degeneration and appear as small yellow or white spots on the retina that can be detected by ophthalmologists and trained AI screening systems with retinal photography (86). Exudative retinal detachment develops when fluid collects in the subretinal space. This often follows the development of diabetic macular edema in diabetic retinopathy patients as fluid builds up on the retina (83,84,85,86). Microvascular abnormalities associated with diabetic retinopathy include microaneurysms and hemorrhaging of retinal capillaries and neovascularization, the formation of new and structurally weak vessels (acellular capillaries). Lastly retinal vein occlusion, the blockage of blood vessels in the fundus of one's eye, is a potential indicator of diabetic retinopathy that is screened for during examinations (87). Occlusion could relate to the development of hyperlipidemia and hypertension in diabetes patients, which lead to subsequent microvascular complications (82, 88).

Meaningful measures of AI screening system performance
Sensitivity and specificity are the measures of AI performance that are assessed by the eligible studies. In this review, sensitivity values are reported as the percentage of screened participants with diabetic retinopathy who are correctly identified as positive by the AI screening system of interest (89). Specificity values are reported as the percent of screened participants without diabetic retinopathy who are correctly identified as negative by the system of interest (89). The "Royal Devon and Exeter National Health Service (NHS) Standards" are that a diabetic retinopathy screening program must achieve a sensitivity and specificity of ≥80% to be deemed clinically acceptable (16,89). Area under a receiver operating characteristic (ROC) curve (AUC) was also reported by some of the included studies. In the context of using AI screening for diabetic retinopathy, the AUC is a measure of a particular screening system (90). Specifically, the AUC can be interpreted as the average value of sensitivity for all possible values of specificity (90). Alternatively, it can be understood as the probability that a randomly selected participant with diabetic retinopathy has a screening result indicating a greater likelihood of presenting the condition than that of a randomly chosen subject with diabetic retinopathy (90,91). ROC curves demonstrate the sensitivity plotted as a function of the specificity. Each point on the ROC curve signifies a true positive-true negative pair (90, 91).

Databases and search strategy
Systematic search methods were performed using Medline, Global Health, and PubMed and with MeSH terms as appropriate. Prior to finalizing a search methodology, pilot examination of studies was carried out in order to identify key MeSH terms used in relevant literature. Search filters were not used when selecting studies to avoid the exclusion of potentially admissible studies. Terms utilized in literature searches are as follows: 1.14 Selection of studies 341 studies were initially gathered using the aforementioned search terms in their designated databases. The studies were imported to Mendeley and duplicate literature was discarded, leaving 224 records for assessment. The remaining records were screened by titleabstract review according to the inclusion and exclusion outlined in Table 2. Of the 224 records, 57 titles and abstracts were chosen for full-text evaluation. Following the completion of full-text evaluation and further consideration of inclusion-exclusion criteria, 21 studies were deemed eligible for inclusion in this study. Figure 4 illustrates the methodology used for the identification, screening, and eligibility-determination of included publications.

Table 2. -Inclusion and exclusion criteria used to screen preliminary publications collected from literature search Inclusion Criteria
Study uses AI to detect DR Study provides grading comparisons between AI and manual graders Study provides proof of DR development with retinal photographs Study specifies retinal imaging technique(s) used for data collection Study provides sensitivity and specificity values for included AI systems Study uses a sample size of greater than 100 participants for real-world and external validation Study uses real-world validation data set(s) to assess neural network performance Study participants did not have a history of laser treatments or surgeries of the retina or injection into either eye Study participants were not participating in another investigational eye study or actively receiving investigation product for DR or DME Study provides a pathway of regulatory approval for AI screening system Study specifies DR-induced lesions ascertained by AI screening system

Exclusion Criteria
Study is irrelevant to diabetic retinopathy Study does not use neural networks for retinal analysis Study does not provide grading comparisons between AI and manual graders Study does not specify retinal imaging technique(s) used for data collection Study does not provide sensitivity or specificity values for included AI systems Study uses a sample size of less than 100 participants for real-world and external validation Study does not use real-world validation data set(s) to assess neural network performance Study participants have a history of laser treatments or surgeries of the retina or injection into either eye Study participants were participating in another investigational eye study or actively receiving investigation product for DR or DME Study does not provide a pathway of regulatory approval for AI screening system Study does not specify DR-induced lesions ascertained by AI screening system

Study quality
Quality assessment of the included studies was performed using the National Institutes of Health (NIH) quality assessment tool. The guidelines of the tool were used to provide a number score out of 14 and overall rating for each of the included studies. The guidelines used for scoring consist of 14 "yes" or "no" questions regarding the clarity, validity, design, methods, and cohort populations of the included studies. After assessing all appropriate study components, if the number of "yes" answers is equal to or greater than seven, a "Good" overall rating is assigned to the reviewed study. Scores from four to six or less than three are designated as "Medium" and "Poor" rated studies, respectively. Of the 20 included studies, 13 were designated as "Good" studies, seven as "Medium", and none as "Poor" quality. "Yes" and "no" determinations were made to the best of the reviewer's ability with consideration to all aspects of every study in order to decrease the likelihood of subjective errors.  Were key potential confounding variables measured and adjusted statistically for their impact on the relationship between exposure(s) and outcome(s)? Rating-(Good, Medium or Poor), Good = 7-14 yes; Medium = 4-6 yes; Poor = 0-3.

Quality analysis of the included studies: Limitations and Strengths
The primary objective of the included studies was to assess the abilities of particular AI systems of interest to screen for diabetic retinopathy in people with type 1 and type 2 diabetes mellitus. Each study demonstrated a notable limitation in some form relating to either its external validity or use of imaging methods to capture the fundus of OCT photographs.
There are three key limitations regarding the external validity of the results reported by the included studies to the burden of diabetic retinopathy in their respective countries. The first limitation is not specifying the setting, environment, or type of community (for example, urban, suburban, or rural) from which study participants were recruited (1,2,8,12). Environmental factors play a key role in the pathogenesis of diabetes mellitus. Such factors include air pollution, soil, water, stress, lack of physical activity, unhealthy diet, vitamin D deficiency, and exposure to particular pathogens. Although a genetic basis also exists, the time of onset of diabetes mellitus, and subsequently diabetic retinopathy, depends largely on the aforementioned environmental factors. Risk factors for diabetic retinopathy are complex, and past studies have demonstrated that the neighborhood environment in which patients live influences retinal microvascular complications associated with diabetes. The assessment of environmental factors that contribute to diabetic retinopathy is important in AI screening studies that do not specify the severity of the disease present in study participants. The severity or stage of the disease may impact the AI system's reported sensitivity and specificity values, and so considering confounding variables such as setting, environment, and types of communities is important to produce accurate and generalizable results. The second limitation to the external validity of various studies is the recruitment of participants solely from a single location or limited geographic region (6,7,13). Reporting performance metrics based on a limited population produces results that may not be representative or generalizable to a larger area of interest, for example the country in which the study was conducted. A study that reports sensitivity and specificity results of an AI screening system based on participants from a single city may be representative of the population of that geographic region. However, further studies assessing such performance metrics must be conducted on a larger scale in order to develop national AI screening programs that are based on nationwide data and evaluation. The third limitation to external validity of particular studies included in this review is not providing the geographic or demographic information of recruited participants (16,19,20,21). Even if a particular AI screening system presents exceptionally high sensitivity and specificity results, because the geographic scope and demographic breakdown of its participants are unknown, one cannot estimate how the system will perform when presented with retinal scans of diabetes mellitus patients from a different region and demographic composition.
The majority of the included studies considered diverse symptomatic indicators of diabetic retinopathy, used proper imaging methods, and properly reported the quantity and source of fundus or OCT images (4,5,(7)(8)(9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21). Reliable datasets designed for laboratory and clinical research that were used by the included studies for training and external validation are as follows: E-ophtha (4,17), Messidor-2 (4, 5), EyePACS (5,12,16,19), DiaRetDB1 (7), Kaggle (7,14), and IDRID (17). However, four studies (1,2,3,6) only provided the source of retinal images and failed to detail the sample size of the dataset used for the development, training, and external validation of their respective AI screening systems. For instance, Abramoff et al. reported that it obtained OCT images from 10 primary care practices from across the United States, however, did not report the quantity of images analyzed by its AlexNet AI system. The quantity of images evaluated by AI screening systems directly effects how well it learns to recognize particular lesions. Although the generalizability of results may seem promising from Abramoff et al., as it considers clinical settings from across the US, if the quantity of images analyzed was low, AlexNet may not have received sufficient training to recognize particular disease indicators, which could lessen the validity of its reported sensitivity and specificity results.
Despite presenting limitations, each study demonstrated notable strengths. Nine studies utilized large quantities of fundus or OCT images for the training and external validation of their respective AI systems (4,5,8,10,11,14,15,18,19). Quantities ranged from 25,000 to over 600,000 images. Using larger sample sizes allows for a more precise estimate of sensitivity and specificity results, can be more representative of the sample's population, and can be used to better generalize results. In addition to using a sample size of greater than 200,000 retinal images, Li et al. (2019), assessed the performance of its AI system of interest (VGG-16) over a four-year period, to assess the consistency of sensitivity and specificity results. Three studies recruited population-based cohorts, which allowed for the estimation of AI screening performance (sensitivity and specificity) values in the reference populations (1,2,15). For example, Raumviboonsuk et al. implemented over 25,000 retinal images from a community-based nationwide diabetic retinopathy screening program in Thailand and reported severity distributions based on the stage of diabetic retinopathy predicted by its Inception V4 AI system. Additionally, one study applied its AI system to screen for a disease other than diabetic retinopathy for external validation purposes (9). Such an approach to external validation was unique to this study amongst all included studies in this review. Kermany et al. demonstrated the general applicability of its Inception V3 system to screen for pediatric pneumonia using chest X-Rays to help externally validate its reported results of 97.8% sensitivity and 97.4% specificity. Furthermore, this demonstrates the versatility of the Inception V3 system. There are additional strengths amongst the included studies that are noteworthy. De Fauw et al. used less prohibitive training data requirements, which allowed researchers to develop their U-Net AI screening systems using retinal images from across multiple real-world settings. Hansen et al. and Sayres et al. assessed their respective AI screening systems across different stages and severity levels of diabetic retinopathy, which was unique to these two studies. Providing stage and severity breakdown is an important feature for AI systems that is being assessed for future clinical implementation, as diabetic retinopathy is a progressive rather than a binary disease. Kanagasingam et al. assesses its Inception V3 AI screening system against an established gold standard grading protocol that the study developed in collaboration with ophthalmologists who specialize in retinal diseases. The alignment of AI screening results with a grading protocol from a reliable source supports the demonstrated performance of the system of interest. Son et al. developed its CNN to recognize diverse lesions, which increases the potential of screening systems to differentiate between stages and severity levels of progressing diabetic retinopathy. Lim et al. collected ungradable EyeArt AI results, dilated the unreadable images, and repeated screening assessments. Doing so improves the gradeability rate of EyeArt and allows for the greater use of available retinal images. Lastly, Tufail et al. reported the incremental cost-effectiveness ratio (ICER) for its EyeArt AI screening system, which is particularly useful for informing public health interventions that plan to implement of accurate, top-performing, and economical AI systems for large-scale DR screening programs. Trial of an AI system to detect diabetic retinopathy in people with diabetes Urban, suburban, and rural classifications are not specified for the 10 primary care practices included in the study; community type could have a potential impact on the frequency of certain lesions and/or stages of DR in the study population, therefore the study's external validity is limited to its areas of interest The number of retinal images gathered from the 10 primary care practices across the US is not specified; this creates uncertainty in the validity of sensitivity and specificity values reported by the study as the number of retinal images used to train the AI algorithm directly effects how well it can recognize lesions during the external validation phase The study includes 10 primary care practices from across the US rather than within a single region

Bellemo 2019 (2)
To evaluate the accuracy of an AI model using deep learning in a population-based diabetic retinopathy screening program in Zambia, a lower-middle-income country.
Urban, suburban, and rural classifications are not specified for the 5 mobile screening clinics included in the study; community type could have a potential impact on the frequency of certain lesions and/or stages of DR in the study population, therefore the study's external validity is limited to its areas of interest The number of retinal images gathered from the 5 mobile screening clinics across Zambia is not specified; this creates uncertainty in the validity of sensitivity and specificity values reported by the study as the number of retinal images used to train the AI algorithm directly effects how well it can recognize lesions during the external validation phase The study includes 5 mobile screening units from across Zambia rather than within a single region

De Fauw 2018 (3)
To apply a novel deep learning architecture to a clinically heterogeneous set of threedimensional optical coherence tomography scans from patients referred to a major eye hospital N/A: population-based cohorts were used to externally validate the study's DLA To establish a diagnostic tool based on a deep-learning framework for the screening of patients with common treatable blinding retinal diseases The model was tested with 1,000 retinal images belonging to 633 patients from an unspecified location; uncertainty with regards to the environments, community types, and locations of residence of the study participants limits the study's ability to generalize its findings to retinal images belonging to patients outside of the study N/A: proper symptomatic indicators, imaging methods (OCT), and reporting techniques were considered and used in this study Study provides a more transparent and interpretable diagnosis of DR by highlighting the regions recognized by the neural network Study further demonstrates the general applicability of its AI system to screen for pediatric pneumonia using chest Xray images 2.6 Meta-Analysis 2.6.1 Sensitivity of AI screening systems reported by the eligible studies Sensitivity is reported in this review as the percentage of screened participants with diabetic retinopathy who are correctly identified as positive by the AI screening system of interest. Amongst the 21 eligible studies included in this review, 19 reported sensitivity estimates for their respective AI architectures that were applied to retinal images from diverse populations (see Figure 6 on the next page; 1, 4-21). The median sensitivity amongst the studies is 92.5% true-positive rate, with a total range of 80.3% to 100%. The quartile 1 (Q1) to quartile 3 (Q3) quartile range is 91.4% to 96.1% (see Figure 5, below). The mean of the reported results is 92.4% sensitivity. Sensitivity estimates depended largely on the quantity of retinal images used to train, develop, and externally validate AI screening systems, which was determined by the studies' respective authors. In addition to specificity variation being influenced by image quantity, the imaging modality (fundus photography or OCT), geographic area of recruited participants, and the number of participants may have affected the results. Furthermore, Table 7 compares sensitivity values to the sample size of training datasets applied to AI screening systems. It is noteworthy that the subgroup with the larger sample size of ≥ 75,000 retinal images showed higher sensitivity (94.0%; 95% CI: 91.3% to 96.7%) than the smaller sample size of < 75,000 images (90.5%; 95% CI: 87.1% to 93.9%).

Figure 6. -Sensitivity (%) of AI screening systems and external validation data set (if specified) reported by the included studies
2.6.2 Sensitivity of AI screening systems according to type of architecture implemented Different AI architectures may produce different sensitivity results due to many potential reasons including varying algorithms, reference standards, diversity of recognizable retinal lesions, various stages and severities of diabetic retinopathy presented to the architectures, type of imaging modality, quality and quantity of retinal images, geographic area in which assessments are conducted, demographic breakdown of participants, and different training, development, and external validation datasets. Figure 7 presents the mean sensitivity of each AI system that was assessed by the included studies. From lowest to highest sensitivity, the order of the reported results is as follows: 87.2% (AlexNet), 89

Specificity of AI screening systems reported by the eligible studies
Specificity is reported in this review as the percentage of screened participants without diabetic retinopathy who are correctly identified as negative by the AI screening system of interest. Amongst the 21 eligible studies included in this review, 18 reported specificity estimates for their respective AI architectures applied to retinal images from diverse populations (see Figure 9 on the next page; 1, 4-21). The median specificity amongst the studies is 92.2% false-positive rate, with a total range of 69.9% to 98.8%. The Q1 to Q3 quartile range is 90.6% to 95.2% (see Figure 8, below). The mean of the reported results is 90.3% specificity. Specificity estimates depended largely on the quantity of retinal images used to train, develop, and externally validate AI screening systems, which was determined by the studies respective authors. In addition to specificity variation being influenced by image quantity, the imaging modality (fundus photography or OCT), geographic area of recruited participants, and the number of participants may have affected the results. Furthermore, Table 8 demonstrates compares specificity values to the sample size of training data sets applied to AI screening systems. It is noteworthy that the subgroup with the larger sample size of ≥ 75,000 retinal images showed higher specificity (93.7%; 95% CI: 90.6% to 96.8%) than the smaller sample size of < 75,000 images (90.0%; 95% CI: 85.4% to 94.6%).   2.6.4 Specificity of AI screening systems reported by the eligible studies Different AI architectures may produce different specificity results due to many potential reasons including varying algorithms, reference standards, diversity of recognizable retinal lesions, various stages and severities of diabetic retinopathy presented to the architectures, type of imaging modality, quality and quantity of retinal images, geographic area in which assessments are conducted, demographic breakdown of participants, and different training, development, and external validation datasets. Figure 10 presents the mean specificity of each AI system that was assessed by the included studies. From lowest to highest specificity, the order of the reported results is as follows: 69.9% (IDP), 80.2% (EyeArt), 90.1% (unspecified CNNs), 90.7% (AlexNet), 93.0% (VGG), 94.8% (Inception V3), and 95.3% (Inception V4). The pooled specificity amongst all studies is 87.7%. Due to the aforementioned differences between studies, precise comparisons of specificity are limited.  The squares and horizontal lines correspond to the study-specific sensitivity and 95% confidence intervals (CIs), respectively. The diamond represents the pooled sensitivity and 95% CI. The overall pooled sensitivity is 92.8% (95% CI: 91.9%-93.7%).

Figure 12. -Forest Plot for Reported Specificity Values
The squares and horizontal lines correspond to the study-specific specificity and 95% confidence intervals (CIs), respectively. The diamond represents the pooled specificity and 95% CI. The overall pooled specificity is 87.7% (95% CI: 86.4% to 89.0%). 2.6.6 Summary receiver operating characteristic (SROC) curve analysis Figure 13 displays an SROC curve of the included studies. The dashed line indicates the 95% prediction region. This SROC curve shows the relationship between reported sensitivity and specificity values for each study. High sensitivity corresponds to a high negative predictive value and is the ideal factor of a "rule-out" test for diabetic retinopathy, while a high specificity corresponds to a high positive predictive value and is the ideal factor for a rule-in test.

Discussion
To date, this is the largest systematic review and meta-analysis to assess the utility of neural networks for diabetic retinopathy screening. This study shows that the neural architecture method can correctly detect 92.8% (95% CI: 91.9% to 93.7%) of diabetes mellitus patients with referable diabetic retinopathy and exclude 87.7% (95% CI: 86.4% to 89.0%) of those without referable diabetic retinopathy. These results are superior to pooled sensitivity and specificity results reported in previous meta-analyses on AI screening system performance and surpasses the 80% acceptability threshold needed for AI screening systems to be applied in clinical settings. The sensitivity and specificity of six CNN models were compared in subgroup analyses. No significant differences in sensitivity were found among the included studies and all reported values were clinically acceptable. The mean specificity of IDP (71%) was reported to be of lower specificity than the 80% acceptance threshold, while all other AI screening systems demonstrated mean values of 80% or greater. The majority of studies recruiting diabetes mellitus patients from diverse backgrounds were conducted to assess the performance of their respective AI screening systems for clinical use, but there is a lack of studies that summarized those results quantitatively. This review provides quantitative evidence of the accuracy of such systems. Additionally, the results of this study showed that CNNs have great potential in clinical application as high screening accuracy was demonstrated amongst diverse neural architectures. Since it is quite expensive and time-consuming to develop and train algorithms with a large quantity of high-resolution and labeled retinal images, it would be cost-effective for future investigations to ascertain a gold standard for size of development and training sets and image resolution. The findings from these investigations could be particularly useful for using neural networks to detect rare diseases, of which there are only a limited number of cases. It is noteworthy that this metaanalysis did demonstrate that the subgroups with larger sample sizes showed higher performances in terms of sensitivity and specificity. However, further research is still crucial before concluding that the sample size of the training dataset influences sensitivity and specificity results. The screening accuracy of CNNs may not be affected by the criteria in experts' standards for screening diabetic retinopathy, which is reasonable because ICDRS was developed using ETDRS. However, because ICDRS is easier and more commonly used in clinical settings, it may be preferable to consider ICDRS criteria when developing, testing, and validating automated screening systems, though ETDRS is still treated as the gold standard.
With regard to imaging modalities, the majority of studies that have and currently investigate AI screening performance use fundus photography as the modality of choice and a smaller number utilize OCT imaging. Fundus photography produces a two-dimensional image of the three-dimensional structure of the retina, while OCT captures the cross-sectional axial of the retina through light coherence. The main shortcoming of using fundus photography for screening is that it only produces two-dimensional images, while the structure of the retina is three-dimensional. For this reason, it is advantageous to use OCT for diabetic retinopathy visualization as OCT can produce clear three-dimensional images of thick samples by rejecting background signals while collecting light directly reflected from retinal surfaces. Thus, researchers should focus on training algorithms to interpret OCT images in order to better screen for diabetic retinopathy. Alternatively, training neural networks to recognize and analyze both OCT and fundus images might lead to higher screening performance.
A comprehensive literature search was conducted in global health and biomedical databases. High quality studies that met a specific set of eligibility criteria were included in this study. Performance metrics of sensitivity and specificity were meta-analyzed to assess the performance of diverse AI screening systems. The heterogeneity of results was also graphically demonstrated through forest plot and SROC curve analyses.

Limitations for review
This review possesses several limitations that must be considered. First, in the subgroup analyses assessing the sensitivity and specificity of AI screening systems, only one study contributed results to the AlexNet, IDP, and Inception V4 subgroups. Reporting performance levels based on a limited number of studies weakens the credibility of this study's meta-analytic findings. Second, several studies used the same datasets for training and validation of their respective AI screening systems. Assessing the impact of using overlapping data sources on the outcome of the results is challenging to evaluate because the contents of each dataset was not discussed in the included studies. Third, CNNs lack standardized cut-off points or thresholds with which to designate the severity of diabetic retinopathy. Due to such an absence, this review could not make strong comparisons between their abilities to screen for disease severity. Fourth, there was a strong risk of selection bias amongst the included studies with regard to participant recruitment. It is unclear whether participant data was included in multiple datasets, so the overall performance metrics may be underestimated or overestimated due to potential changes in said participants' severities of diabetic retinopathy between data sources.

Conclusions
This review and meta-analysis demonstrates clinically acceptable performances from the majority AI systems used in diabetic retinopathy screening. Although the majority of neural networks showed clinically acceptable performance levels, further improvement depends on the continual development of novel algorithms with large and gradable sets of images for training and validation. With the rapidly growing global burden of diabetic retinopathy, AI screening systems can increase the ability for disease prevention by allowing for early detection. If cost-effectiveness ratios can be optimized, AI can become a financially sustainable and clinically effective intervention that can be incorporated into the healthcare systems of LMICs and geographically remote locations. AI screening can increase the efficiency of eye care and diabetes services and optimize the care of patients within healthcare systems that provide large-scale services on a population level. Combining screening technologies with treatment interventions such as anti-VEGF therapy, acellular capillary laser treatment, and vitreoretinal surgery can lead to substantial reductions in the incidence of irreversible vision-loss due to proliferative diabetic retinopathy. With further advancement, AI will inform and improve primary, secondary, and tertiary care settings' approaches to diabetic retinopathy management.