Perspectives on Applying Artificial Intelligence to Model Outbreak Risk: Contextualizing The Impact of Climate Change on Vector-borne Disease

As recent history has shown, changing climate not only threatens to increase the spread of known disease, but also the emergence of new and dangerous phenotypes. This occurred most recently with West Nile virus: a virus previously known for mild febrile illness rapidly emerged to become a major cause of mortality and long-term disability throughout the world. As we move forward, into increasingly uncertain times, public health research must begin to incorporate a broader understanding of the determinants of disease emergence – what, how, why, and when. The increasing mainstream availability of high-quality open data and high-powered analytical methods presents promising new opportunities. Up to now, quantitative models of disease outbreak risk have been largely based on just a few key drivers, namely climate and large-scale climatic effects. Such limited assessments, however, often overlook key interacting processes and downstream determinants more likely to drive local manifestation of disease. Such pivotal determinants may include local host abundance, human behavioral variability, and population susceptibility dynamics. The results of such analyses can therefore be misleading in cases where necessary downstream requirements are not fulfilled. It is therefore important to develop models that include climate and higher-level climatic effects alongside the downstream non-climatic factors that ultimately determine individual disease manifestation. Today, few models attempt to comprehensively address such dynamics: up until very recently, the technology simply hasn’t been available. Herein, we present an updated overview of current perspectives on the varying drivers and levels of interactions that drive disease spread. We review the predominant analytical paradigms, discuss their strengths and weaknesses, and highlight promising new analytical solutions. Our focus is on the prediction of arboviruses, particularly West Nile virus, as these diseases represent the pinnacle of epidemiological complexity – solution to which would serve as an effective “gatekeeper”. We present the current state-of-the-art with respect to known drivers of arbovirus outbreak risk and severity, differentially highlighting the impact of climate and non-climatic drivers. The reality of multiple classes of drivers interacting at different geospatial and temporal scales requires advanced new methodologies. We therefore close out by presenting and discussing some promising new applications of AI. Given the reality of accelerating disease risks due to climate change, public health and other related fields must begin the process of updating their research programs to incorporate these much needed, new capabilities.


Introduction
Climate change and variability has now been implicated in the range expansion of a number of vectorborne infectious diseases across the globe 1 . Copious empirical studies have been conducted to understand which variables contribute most to emergence and spread of disease 2 . Such studies generally aim to establish statistical relationships between long-term climate trends, contemporaneous meteorological conditions, and disease. Some studies will include the presence and deterministically relevant activities of known disease vectors. This particular theme has been covered quite exhaustively: strong statistical and causal links have been demonstrated between vectors (mosquitoes and other arthropods) and climatic conditions, such as temperature and rainfall. This has, in turn, been been linked to the incidence of vector-borne diseases, such as dengue and West Nile virus. And such information has indeed proven valuable for prediction of outbreak severity and proactive intervention 3,4 . However, there are few who would argue that the current degree of geospatial and temporal precision of prediction does not need to be improved. Up to now, progress beyond the status quo has been stymied by methodological limitationstraditional research methodologies were simply not designed to handle the volume of data needed to deliver precision required 5 . Recent methodological advancements along with increased open data availability promise to change this 6 . Herein, we present an updated overview of current perspectives on the varying drivers and levels of interactions that drive disease spread. We review the predominant analytical paradigms, discuss their strengths and weaknesses, and highlight promising new analytical solutions. Our focus is on the prediction of arboviruses, particularly West Nile virus, as these diseases represent the pinnacle of epidemiological complexity -solution to which would serve as an effective "gatekeeper". We present the current state-of-the-art with respect to known drivers of arbovirus outbreak risk and severity, differentially highlighting the impact of climate and non-climatic drivers. The reality of multiple classes of drivers interacting at different geospatial and temporal scales requires advanced new methodologies.

Modeling of climate and infectious diseases
The past 20 years have been characterized by a notable trend in the spread of infectious disease into previously naï ve populations; and this spread has been projected to continue in line with climate change forecasts 7 . The enhancement and spread of infectious disease is driven by climate in various ways 8 . One notable example is the ongoing change to the habitats and behavior of both hosts and vectors [9][10][11] . This has been most notable in the case of mosquito borne disease, particularly those capable of being carried long-distance by regular host migration [12][13][14] . Climate change also has potential impacts on human mobility, which has an obvious association with disease spread [15][16][17] .
Much research attention has been focused on more precisely understanding these dynamics. There are two broad categories of quantitative models applied for this purpose 18 : statistical and processoriented. Statistical models focus mainly on the analysis of time-series and patterns associated with disease outbreak data 19 . However, statistical models lack any explicit structure with respect to mechanistic drivers. As a result, while generally regarded as having high predictive power, they tend to suffer from problematic generalizability 5 . Such models are often used for development of early warning systems and for intervention (resource) planning 20 . An extension of the statistical modeling paradigm is referred to as statistical, or machine, learning. Such models take advantage of advances in computational power to allow for a more data-driven approach that obviates the need for a priori model specification and feature selection 21 (Simonsen L., et al. 2016). Decision trees are an increasingly popular implementation that provide several convenient advantages ( Table 1). These models have met with great success in ecological modeling, where precise assumptions with respect to temporal associations and causality are of a lesser priority than pure predictive power and feature discrimination 22 . Classification trees are also becoming increasingly popular for geospatial disease risk modeling 23 , which share many similarities to ecological models in terms of methodology and predictors 24 .

Decion tree methods Description
Decision Trees A graphical representation of possible solutions to a decsion based on certain conditions or thresholds being met. Each decision path is guided by the degree to which the evidence provided meets or exceeds successive threshhold conditions.
Bagging (or Bootstrapping) An ensemble meta-algorithm that combines predictions from multiple decision trees algorithmically, using a majority voting heuristic. Models can be based on subsets drawn randomly from a larger dataset. This is employed in cross-validation.
Random Forest Similar to bagging but allows for feature randomization. Each model to represent a differet set of features. Multiple decsion tree are generated with a randomized feature set and then the results algorithmically combined to generate a prediction.

Boosting
Extends either the previous approaches by allowing each new model to learn from previous iterations. Error minimization heuristics guide each iteration: models with high validity are "boosted" and sequentially augmented with new features Gradient Boosting Boosting based algorithm that employs predictive heuristics (gradient descent) to guide each successive model iteration.
Extreme Gradient Boosting Employs various internal optimizations to improve execution times by up to a factor of 100 and increase the stability of estimations. This allows for larger data sets to be applied. Table 1: Such models are simple to run, deliver accurate predictions, and provide for easy interpretation 22 . They are being increasingly applied in many fields, including infectious disease modeling 21 . A major advantage of these methods is the extreme tolerance for data sets that are large and of disparate data types. Categorical and continuous variables can be modeled side-by-side with no need for pre-processing. Input variables are automatically stratified or "binned" and so dummy variables do not need to be encoded and outliers do not need to be removed. Missing values are similarly handled. Objective functions can include either classification or regression-or any other model for which a linear formulation exists, e.g the Cox proportional hazards model. This result is a statistically robust modeling process that allows for more data to be used with less overall effort. However, these models do not provide parameter estimates out of the box. As a result, there is limited value for applications that require both direction and magnitude of effect. These models are also prone to over-fitting and often require careful curation to ensure out-of-sample validity. "eXtreme Gradient Boosting" (XGB) is a recently introduced method that has shown exceptional promise with respect to the limitations cited above. Extensions to this method allow for robust effect estimation as well 25 . XGB is now regarded by many as having made all prior variants obsolete.
By comparison, process-oriented models set out to more precisely quantify and characterize the relationships between observed and suspected causal determinants of outbreak risk as well as other associated factors 26 . Such mechanistic models attempt to extend theory and practical knowledge to the problem of understanding, predicting, and controlling the factors ultimately influencing the manifestation and spread of disease 27 . Questions that might be addressed within the context of such models include-a) association between vector abundance and disease manifestation 28 , b) climate-and environment-associated biotic factors that influence "a" 29 , c) population vulnerability factors that influence "b" 30,31 , d) human behavioral and mobility patterns that influence "c" 16,32 , and e) the effects of various control/clinical interventions on all of the above 33 . However, it is often very difficult to parametize such theory-driven modelsi.e., precisely align them to the reality observed by means of empirical data (Figure 1).
. Parameterization can be seen as an iterative process that begins with functional assumptions and attempts to improve them by collecting observations. However, this often requires data that is simply unavailable due to resource constraints or scale. Effective parameterization requires data and time. Researchers are therefore often forced to rely on limited data or even pure supposition. Parameter values may often also differ according to the spatiotemporal scope being considered. Such models must therefore often be recalibrated when applied to contexts beyond their original scope. This problem tends to increase with model complexity. Parameterization under such circumstances therefore often substantial additional efforts in terms of data collection and computation. This is, of course, preferable to having too little detail in the model, as the mechanisms driving the response of interest may not be sufficiently represented. Luo et al (2011) 34 offers a solution in the form of data assimilation (DA), also referred to as online learning. DA uses streamed data to 1) continuously update model parameters based on observed data, and 2) quantify and integrate errors and patterns in unexplained variability. DA can be thought of as a machine-learning based form of continuous parameter optimization. DA be deployed to incorporate a broader spectrum of data, in realtime, for predictions that are more reliable across contexts and scale.
Beyond that, climate change scenarios based on long-term variability as well as anthropogenic causes have been used to estimate the long-term expansion or contraction of geospatial disease risks, particularly those associated with arthropod vectors (i.e., arboviruses). Such studies have found positive associations between determinants of arbovirus outbreak risk and climate change, in terms of a) changes to vector habitat suitability 35 , b) socioeconomic development 36 , and c) anthropogenic changes to the environment 37 . Human mobility and increased trade has been shown to be associated with increased outbreak risk 38 , and has in fact been linked to the introduction of both disease 32 and disease-competent vectors in Europe 39 . Taken together, this clearly indicates the need to develop models that are comprehensive both in terms of mechanistic assumptions and included predictor data.
Despite the history of successful research efforts, progress has been stymied by the frequent observation that, for many climate-related variables, associations are time-dependent and appear to change depending on time-lag and seasonality 18 . Moreover, effect estimates are known to vary as a function of the geospatial scope and scale assumed for analysis 40,41 . In addition, the impact of underreporting and detection is an ongoing concern. Barker (2019) attributes this to stochastic noise due to under reporting of mild symptoms and the otherwise limited number of reported cases. This creates imprecision in the predictive object used to train the model and can lead to difficulties in both calibration and interpretation 42 . Beyond that, efforts to understand the role of underlying drivers of climatic variability have up to now been limited. Such drivers have been shown to contribute to the spread of infectious disease in both the American and Old World contexts. ENSO ("el Niño Southern Oscillation") is well established as a driver of dengue and malaria 43,44 . Likewise, NAO ("North Atlantic Oscillation") has been associated with a host of infectious diseases in Europe 45 . Predictions of the emergence and severity of disease that are spatiotemporally precise are therefore difficult to achieve in practice. This difficulty increases in direct proportion to the complexity of the dynamics associated with disease transmission. Transmission dynamics are relatively simple in cases where transmission is driven primarily by host-to-host contacte.g., influenzabut unfortunately this is not always the case.

Climate and West Nile Virus as an example
West Nile Virus (WNV) has attracted much attention as of late for this very reason 46 49 . This spread has been attributed to changes to conditions in the northern latitudes of both Europe and America that have increased the capacity for endemic mosquito-vectors to transmit the disease 50 . The invasion and establishment of non-native, but disease-competent mosquito species has likewise also been associated with improved habitat suitability due to weather that is warmer and wetter [51][52][53] . However, this alone fails to explain the noted increase in syndromatic severity as a proportion of infected cases.
Within the European context, multiple species of mosquito vectors have been associated with the WNV transmission cycle. In addition to importation via migrant hosts, local amplification requires suitable local reservoir as well. However, the extent to which specific resident species impact WNV outbreak risk remains largely a mystery in the European context. Contrasted to the United States, where several reservoir species have been firmly established, the situation in Europe is still a matter of much uncertainty. A broad range of possible amplification hosts have been suggested: in addition to birds, amphibians and reptiles have also been found to contribute to local amplification. Mammals, which include humans, horses, and even domestic pets generally do not develop levels of viremia sufficient to transmit the virus to biting mosquitoes. While transmission of WNV generally requires a mosquito vector, the literature has cited dozens of discrete interactions that contribute to WNV outbreak risk. For example, host-to-host transmission has been observed among both humans and birds. 54,55 .

Figure 2. Simplified transmission matrix for West Nile virus involves 25 cells, each representing routes that contribute to the WNV transmission cycle for a given interaction
This degree of determinant complexity is difficult to handle by means of traditional process oriented or statistical models. Multicollinearity tends to be a function of model complexity, and often leads to difficulty in model construction or interpretation 56 . Geospatial or scale-dependent effect variability is another notable limitation 57 : it is an unavoidable statistical fact that estimations will vary according to the scope and size of the sample. Spatiotemporal precision is further hampered by the lack of quality data at sufficient levels of granularity and completeness 20 . One consequence is that features with greater levels of data availability, such as climate, are most convenient for disease modeling and tend to be highlighted accordingly. Meanwhile, features with high degrees of local variabilityand therefore more difficult to accurately sampletend to be omitted or overgeneralized. This includes mechanistically critical factors associated with amplification hosts, vectors, and human susceptibility 58 . And process-oriented models suffer from the same limitations: absent sufficient volume and variety of observational data, such models cannot be effectively parameterized.

Climate in the context of other drivers of outbreak risk
Beyond climate, several key components are needed to understand the dynamics that determine the emergence and spread of infectious disease. These include non-climate drivers which facilitate 1) introduction and spread, 2) potential evolution, and 3) the susceptibility of available host populations. Such drivers facilitate transmission through an interactive pathway of underlying mechanisms and processes. Each of these factors enjoy differing degrees of causal proximity: some causal effects occur proximal to and in direct connection to the event, others occur further upstream. All drivers of outbreak risk can be understood to work synergistically: the effect of one is often mediated through processes linked to other drivers. This often makes it difficult to isolate and quantify independent effects. Cross-scale effects, such as the interaction between climate and the behavior of local vectors and hosts, are another area where understanding is limited 41 . As previously described, the most causally direct drivers exhibit high geospatial variability and localization, and available data is limited 59 . New methods are therefore needed that can make better use of available data, and better capture and quantify these crucial local effects, both independently and in synergy with or downstream from other drivers 60 . Promising new applications of machine learning and AI may represent possible solutions, but are presently only in the early stages of development.

Figure 3. AI-powered confirmatory path analysis suggests potential casual associations
One distal effect associated with climate pertains to viral evolution. Changes to viral fitness have been cited as responsible for increases in the spread and neurotropic severity of WNV in Europe since 2008 61 . This mutated "lineage 2" has since supplanted the previous variant and has been responsible for increasing disease burden since 46 . Historically, genetic drift and evolution have been recognized as important factors for epidemics and spread, and will likely continue to play an important role in the future. The potential relationship between climate change and the viral evolution of arboviruses has not garnered much research attention. However, given the well-established dependence of vectorial capacity on temperature, viruses could very well likely experience accelerated transmission and replication cycles, therefore facilitating more rapid evolutionary processes. This would therefore imply that changing climate not only increases the risk of emergence and spread, but also the rate at which new, more virulent phenotypes potentially emerge.

Where do we go from here?
The degree to which such challenges limit performance of a given model is in direct proportion to the complexity and variability of the system being modeled. And no other infectious disease can claim the degree of transmission chain complexity and variability as demonstrated by WNV. WNV has consequently been declared "notoriously hard to predict" and likened to an "impetuous child" 62 . And given the reality of sudden and large surges in outbreak severity and range, such as that observed in 2018, new solutions are clearly required. That said, WNV is most definitely a "worst case" modeling targetcurrent technology is more than sufficient for the most common diseases (i.e., influenza paradigms can no longer be relied upon, absent substantial efforts to allow for predictive precision at local scales by expanding the analytical scope of dynamics addressed.
Recently developments in data availability and computational approaches have been suggested to hold the answer 63 . Supported by the increasing availability of high-resolution data from open sources 64 , models that incorporate an interdisciplinary assortment of drivers are becoming a reality 65,66 . Looking forward, such efforts are expected to evolve to incorporate state-of-the-art predictive algorithms with novel methods for deriving effect estimates and exploring underlying associations. For example, a novel method has recently been introduced that generates statistically optimal local model explanations 25 . Local effect estimation allows for global model structure to be assessed at heretofore unimagineable level of analytical precision and granularity 67 . This will allow for more precise tracking and prediction of outbreaks as well as greater ability to predict disease burden, especially within the context of the threat of climate change and the likely effect it may have with respect to the emergence of new pathogenic threats.