Zoonotic risk technology enters the viral emergence toolkit

In light of the urgency raised by the COVID-19 pandemic, global investment in wildlife virology is likely to increase, and new surveillance programs will identify hundreds of novel viruses that might someday pose a threat to humans. Our capacity to identify which viruses are capable of zoonotic emergence depends on the existence of a technology—a machine learning model or other informatic system—that leverages available data on known zoonoses to identify which animal pathogens could someday pose a threat to global health. We synthesize the findings of an interdisciplinary workshop on zoonotic risk technologies to answer the following questions: What are the prerequisites, in terms of open data, equity, and interdisciplinary collaboration, to the development and application of those tools? What effect could the technology have on global health? Who would control that technology, who would have access to it, and who would benefit from it? Would it improve pandemic prevention? Could it create new challenges? Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 April 2021 doi:10.20944/preprints202104.0200.v1


Introduction
After the COVID-19 pandemic ends -or even before [1] -the world will face another emergence of a heretofore-unknown epidemic or pandemic threat, which will most likely be a novel zoonotic virus. This is less a testament to the state of global health, and more a basic consequence of arithmetic: only an estimated one percent of mammal viruses are currently known to science, while as many as one in every five undiscovered viruses might have the ability to someday make the jump into human populations [2,3]. For example, a whole constellation of distinct SARS-related coronaviruses circulate in bats and in China and Southeast Asia [4,5], and at least two-thirds of reservoirs might still be unidentified [6]. But even the most intensively studied viruses and wellsampled hosts can harbor undiscovered diversity: influenza A viruses are perhaps the most widely agreed upon future pandemic threat [7][8][9], but novel strains emerging through reassortment in wildlife and livestock are often only noticed once they reach or cross the animal-human interface.
Despite the urgency of research on zoonotic emergence, the diversity and rapid evolution of viruses poses a problem of scale for actionable science.
The next zoonotic threat might be unfamiliar to virologists, but more likely than not, it will bear at least some similarity to previous counterparts. A handful of viral clades make the zoonotic jump most often, and are more likely to continue spreading within human populations [10][11][12]. As a result, novel zoonotic epidemics often harken back to previous outbreaks: severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) shares 76% of its genome with SARS-CoV, and much of its pathology [13]; the emergence of HIV-1 group M in the 1920s was followed by a dozen more spillovers of similar primate viruses, including the progenitor of HIV-2 [14][15][16]; and the emergence of filoviruses with Marburg virus in 1967 was followed almost a decade later by the first Sudan ebolavirus and Zaire ebolavirus outbreaks, both in 1976 [17]. These similarities point to the widelyaccepted idea that while individual viral emergence events are idiosyncratic and (as standalone stochastic events) unpredictable, they often follow predictable patterns, which might constitute the raw materials for a zoonotic risk assessment procedure -defining, for example, the virus species, conditions, or locations with a greater risk of causing or experiencing these events [18,19].
As zoonotic viruses and their non-human animal (hereafter animal) hosts become better characterized, a growing library of virological data is becoming increasingly available and accessible to the scientific community, putting this risk assessment procedure within reach for the first time.
This is increasingly possible with the growing application of machine learning to risk assessment problems. Zoonotic origins are often described as a sequential process, in which pathogens must pass through a series of biological, ecological, and social filters that would otherwise prevent their emergence [20,21]. At each of these steps, machine learning has been successfully and reliably applied to predict the animal origins of a novel zoonosis [22,23], the potential hosts of undiscovered zoonoses [6,24], the ecological and anthropogenic risk factors for zoonotic spillover [25,26], the ability of novel viruses to infect humans [27] and their ability to transmit onwards in human populations [10,28]. Such models have also been used to predict the severity of disease [29], and may be extended to predict mortality in the future [28,30]. These methods have been particularly useful when they can harness the genomic signatures of host adaptation and compatibility [22,31], as for many viruses, this may be the only available data [32].
Here we focus on a subset of this emerging set of methods, which we term zoonotic risk technology and define as an informatic system, statistical model, or artificial intelligence that identifies at least one of two viral traits we term zoonotic potential (which we define as the ability of an animal virus to infect a human host) and epidemic potential (the ability of a zoonotic infection to cause disease and transmit onwards in human populations). This set of approaches has a necessarily narrowly-defined scope, which does not encompass every component of "risk." For example, machine learning algorithms can also be applied to predict zoonotic isolates of bacteria [33,34], and similar models can be applied to identify potential wildlife reservoirs or arthropod vectors of zoonoses [24,35,36]. Similarly, spatiotemporal patterns of viral dynamics in livestock and wildlife reservoirs are a critical missing piece in many spillover risk assessments [37][38][39]. After a transmissible pathogen reaches human hosts, yet another set of virological, social, economic, and political factors determine whether a spillover event becomes an epidemic or pandemic [40][41][42][43].
However, we focus on the narrowly-defined idea of zoonotic risk technology as a way to operationalize a specific set of existing approaches to facilitate the identification of viruses with zoonotic potential, and to interrogate the potential value of these technologies to global health.
To facilitate discussion on these topics, we held a one-day digital workshop (the "Verena Forum on Zoonotic Risk Technology") at the Georgetown University Center for Global Health Science and Security in January, 2021. This setting allowed scientists to present cutting-edge computational and laboratory approaches, and to discuss potential applications or challenges with global health practitioners, with a focus on equity concerns in data sharing and technology deployment. Here, we report a brief synthesis of our findings.
Zoonotic risk technologies are no longer hypothetical, and are rapidly emerging as practical, concrete applications of scientific knowledge. These tools are only one item in the broader toolkit of prediction in viral ecology, and like other predictive models, are imperfect. Here, we identify three major barriers to actionable science that researchers must consider further: 1. Technologies will have the most value to global health if they are treated as part of the process of characterizing risk, rather than the singular endpoint. Additional work is required to validate predictions, such as laboratory investigations, but may be expensive at-scale and potentially politically sensitive.
2. Academic publishing alone is insufficient to enable the deployment of tools in surveillance programs or rapid outbreak response scenarios; user-friendly, open-source tools must be coupled with global capacity building in risk analyses and mitigation. 3. The development and application of zoonotic risk technology, and the sharing of data to enable these processes, are likely to engage critical issues such as ownership, equity, and governance; these issues are considered central in global health, but relevant scholarship may not currently interface with existing research on zoonotic risk prediction.
We explore each of these issues in depth here, and discuss possible avenues for interdisciplinary work that might help overcome these barriers, identify conditions precedent to their use, and flag potential limitations.
How zoonotic risk technology works At its core, zoonotic risk technology exploits the assumption that viruses with undetected zoonotic potential are more similar to known zoonoses than to non-zoonotic viruses. Early efforts have focused on identifying coarse traits that are common among known zoonoses, such as origins in particular host clades [11,44,45] or a broad host range [46][47][48]. These approaches are useful for identifying common profiles of what a zoonosis "looks like" -e.g., a vector-borne single-stranded RNA virus with a broad host range including primates -that can be generalized across animal viruses. One of the only examples of zoonotic risk technology available for public use, the SpillOver viral risk ranking platform, uses this approach to rank 887 viruses based on 31 risk factors [49].
These approaches benefit from generality and interpretability, but can be limited by data availability; for example, host range is rarely characterized exhaustively in wildlife viruses until they are a known threat to human health, and may suffer from biases in study effort or surveillance [50].
Moreover, trait-based assessments may be limited by apparent contradictions in simple patterns. For example, genome size correlates positively with zoonotic risk [51] but has been reported as having contradictory effects on transmissibility [10,12]; replication in the cytoplasm similarly predicts zoonotic potential [11,52,53], but also predicts reduced transmissibility [10].
Genomic data increasingly offer an alternate avenue for predictive work. Genomes are inherently high-dimensional data, encoding meaningful information about microbiology and immunology, and are often the first aspect of a novel virus to be characterized, months or years before its ecology.
A simple model might be trained on the nucleotide similarity of viruses compared to known zoonotic threats, while a more advanced one might also include similarity in genomic composition biases [22,27]. This approach has worked well for the parallel problem of inferring viral origins using genomic features that encode coevolutionary signals of host adaptation. For example, CpG dinucleotide composition can be used to identify vertebrate viruses [54,55], exploiting a viral adaptation that matches genomic composition to the vertebrate genome in order to evade innate immune responses searching for non-self genetic material [56,57]. These patterns are rare and poorly understood today [58], but the subject of significant interest. For zoonotic risk, a model is likely to identify some combination of broadly-transferrable coevolutionary adaptations that allow a virus to cross species barriers more readily within a broad group (e.g., primates, or vertebrates), and random genomic patterns that happen to increase their odds of successful infection of human hosts (which they may never have encountered in their evolutionary history). For example, including similarity of viral genomes to human housekeeping genes and interferon-stimulated genes appears to measurably improve prediction of zoonotic potential [27].
Over time, genomic approaches are likely to move beyond similarity, and start identifying de novo predictors of viral compatibility with human cells. A mechanism-agnostic model may simply collapse genomes into hundreds of computational features, identify a small handful of significant predictors, and generalize these patterns -but can only do so successfully with sufficient data. For example, massive clearinghouses of genomic sequences such as GISAID and the NIAID Influenza Research Database have enabled a number of models that accurately classify the zoonotic potential of influenza strains down to the protein level [59,60]. Decomposing viral genomes, and identifying the regions most relevant to zoonotic emergence, can open new avenues for advanced modeling that goes beyond pattern recognition. For example, researchers have developed a number of structural simulations to explore binding affinity between the spike protein of SARS-CoV-2 and ACE2 receptors in animal and human cells [61,62], and structural modeling can be paired with other trait data to better predict the capacity for various mammal species to transmit SARS-CoV-2 [63].
Similar approaches could be used to identify the zoonotic potential of other viruses for which surface protein structure and receptor use has been characterized [64]. These kinds of approaches are ultimately limited by the comparability of different structures in both host and pathogen genomes, and may be most predictive when comparing hosts or pathogens at lower taxonomic levels (i.e. viral strain up to viral genus or family).
It is difficult to define what zoonotic risk technology might look like within the next ten years. This nascent field of research is likely to grow exponentially as post-pandemic investments transform both the available data describing the global virome, and the institutional support for modeling research and development (and associated training in higher education). Especially if this work focuses on improvement through validation and cross-talk between experimentalists and modelers, we anticipate that the predictive accuracy and reliability of these technologies will continue to grow.
Previous work, especially from virologists, has been skeptical that these approaches might become a reliable source of inference [50,65]. However, prior work anticipating the level of predictive resolution that exists today has also historically been subjected to similar skepticism. As these technologies become powered by growing datasets cataloguing the global virome [2,[66][67][68][69][70], and more complex microbiological predictors that capture host-virus interactions, it is difficult to imagine today how accurate and valuable they might become. If their potential for global health manifests, we should prepare now to guard against potential misuse, including monopolization in high income countries, and to anticipate important matters of equity, including the equitable sharing of the benefits arising from their use.

Connecting computational and empirical work
Zoonotic risk technology can suggest which viruses may have zoonotic potential, with a nontrivial degree of uncertainty, but further confirming that risk requires laboratory characterization. For example, successful viral replication in humans requires tens to hundreds of protein-protein interactions, which may not be predicted from viral sequence data alone and require laboratory characterization [71][72][73]. Conversely, one of the greatest strengths of these tools is their ability to narrow down the list of (potentially millions of) viruses for risk assessment procedures that require complicated, sometimes-expensive experiments. For example, experimental evaluation of host competency may require establishment of cell lines from new species [74] or non-model organism systems in the laboratory with a suite of associated challenges including unique housing requirements, low fecundity, a lack of commercial availability, few species-specific laboratory reagents, and often scant baseline data upon which to support health evaluations.
This aspect of zoonotic risk assessment can be complicated by concerns that this might require gainof-function experiments, which use genetic editing or forced adaptation experiments to induce new phenotypes, potentially expanding the host range, pathogenesis, or mode of transmission of a pathogen [75]. While these experiments have been critical to previous work -for example, by demonstrating the epidemic potential of highly pathogenic avian influenza through directed mutagenesis and serial passage to recover a virus capable of airborne transmission [76,77], or by demonstrating the potential of SARS-like viruses to jump from bat reservoirs into human populations [78] -they also face tremendous scrutiny given perceived or potential biosafety and biosecurity risks, including those potentially arising from dual-use research of concern [79]. Importantly, most host-virus interaction research (including in vitro and in vivo experimental infections) are not actually "gain-of-function" experiments, but are mislabeled or misidentified as such by the public or media. These concerns are likely to face even greater scrutiny given public conversations about SARS-CoV-2's as-yet-unknown origins and the emergence of origin theories centered around biosecurity lapses [80].
There are a tremendous diversity of experimental approaches stopping short of gain-of-function experiments that may be used to validate predictive models and offer a more operationalized view of the problem. Among these are experimental infections to test the ability for cell entry and receptor usage, replication, pathogenesis, evasion of host immune responses, assembly and egress, and onward transmission [81,82]. The complexity of these experimental systems may begin with host cell lines and surrogate viruses, pseudoviruses, or replicon systems, and expand to include experiments with live virus and organoids [83] or live animal models [84]. Each of these laboratory approaches can offer targeted methods to validate predictions from machine learning models, such as virus-human compatibility, ability for viral replication and productive infection, and disease pathology, tissue tropism, or courses of infection. Many of these methods can be used without the requirement for high-containment laboratories, which is particularly important to ensure that a wide variety of different viral groups can be studied safely yet at scale and across country contexts.
Experimental work can further help identify the (model-able) molecular barriers to zoonotic emergence, including rules governing attachment and entry, transcription and genome replication, viral protein expression, innate immune antagonism, viral assembly, and egress. Each stage in the viral life cycle represents an opportunity to improve model performance, but will require the gathering and reconciliation of data across multiple host-virus systems and experimental approaches. For example, laboratory experiments are likely to vastly improve model performance upstream by offering new kinds of predictor data that reflect the various types of host responses to infection. These may include broad comparative data on host transcriptomic or proteomic responses to infection [85], or host-virus protein-protein interactions [72], which may help identify the mechanisms of infection and pathogenesis in humans even when collected from animal model systems [85]. Through better collaboration among statistical modellers and empiricists, future development of zoonotic risk technologies can iteratively validate or falsify model predictions, helping to improve the accuracy and applicability of predictive models over time. While some modeling publications may therefore recommend further characterization of specific viruses, this will be unlikely to occur without active partnership between modelers and experimentalists, given that the priority is often placed on known and recurrent threats.
Theory to technology, technology to toolkit Most zoonotic risk technology is developed with the stated intent to contribute positively to human health and reduce the future burden of emerging zoonotic viruses. However, the knowledge that a virus poses a threat to human health often exists for years, even decades, before a catastrophic outbreak [32]. Zoonotic risk technology may therefore have limited benefit to global health without careful, intentional work focused on application and actionability. The pipeline from technology development, to implementation, to risk mitigation is likely to only succeed at first in specific, narrow use cases.
First, and most foundationally, predictive modeling work will always be disconnected from global health if the endpoint is in the academic literature. This is particularly the case if research groups are separated geographically and practically from the direct impacts of potential spillover events, and choose not to pursue collaboration and knowledge exchange with local experts, limiting both the expertise available to properly design and contextualize work, and the channels available for possible dissemination and outreach. Similar challenges have been identified for related modeling problems, like the development and deployment of early warning systems or real-time epidemic forecasting [86,87]. Engaging practitioners, policymakers, and stakeholders in the participatory design, release, and ongoing improvement of infrastructure is likely to increase the value of zoonotic risk technology, as will designing open tools with public interfaces (open-source software or websites, e.g. FluLeap: https://fluleap.bic.nus.edu.sg/) and based on open, interpretable data. These tools should be adaptable in order to more accurately reflect scientific advances, as well as changing user needs, over time. They must report appropriate, interpretable, and locally-relevant metrics of risk to inform decision-making, with uncertainty presented as transparently as possible, including where uncertainty comes from (e.g., data limitations vs. model calibration) and how uncertainty correlates with the outcome variables. Scientists may also be called on to develop new language for conveying context-dependent risk and communicating uncertainty to different audiences, including properly disclaiming results such that public, private, or health sector responses neither over-react to highrisk predictions nor under-react to low-risk predictions. This enduring challenge extends into many other aspects of global health and disease ecology.
At least in the near term, it is unlikely that any specific, coordinated, and effective response will be mobilized based solely on the identification of a novel virus with zoonotic potential. Resource scarcity prevents the development of individual surveillance systems or biomedical research-anddevelopment programs for each of the thousands of wildlife viruses with zoonotic potential; existing programs focused on the narrowest set of expert-assessed high-risk threats (e.g., influenza A viruses, betacoronaviruses, or henipaviruses) are already over-encumbered. These programs may, however, benefit from technology that identifies zoonosis-relevant evolutionary shifts in viruses circulating in wildlife (e.g., the emergence of a Nipah virus lineage with greater estimated transmissibility [88]).
More broadly, these tools may find applications in existing One Health surveillance programs focused on high-risk interfaces between wildlife, domestic and captive animals, and humans.
Zoonotic risk technology is likely to be most actionable at small scales: a local inventory of wildlife viruses can be ranked according to risk, frequency, and degree of human-animal contact, with the highest-risk viruses incorporated back into local surveillance priorities. For studies reporting the discovery of novel animal viruses, this step can be simple as an additional analysis, with at least one published example of this use case [89]. However, if studies are designed with these kinds of assessments in mind, they may also be able to collect additional data with tremendous value. For example, sequence-based viral discovery often focuses only on viral reads and discards data from the host [90]. The host-derived sequence data contains crucial information about the host response that could provide insight into a given virus' pathogenic potential in an animal or human host, allowing for further surveillance prioritization of both host and virus species. The utility of these studies is also limited by the quality of data shared in the public domain: sharing of standardised and validated data such as host species identification, location, specimen type, and date of collection in a centralised resource, rather than lost in a publication (if at all), is one of the first steps to a truly collaborative approach.
Once high-risk viruses are identified and reported in wildlife monitoring studies, these pathogens may also be identified and flagged earlier in samples collected by programs that passively monitor the health of high-risk human populations (e.g., livestock keepers or wildlife traders) and sentinel hosts like livestock, or actively screen human populations for novel pathogens by investigating undiagnosed febrile illnesses [91][92][93][94]. Behavioral change or occupational safety interventions may then be targeted to reduce spillover risk for high-risk human populations, though they may be most feasible or successful if they target risky exposure to specific host species with multiple high-risk viruses and frequent human contact (especially if they already match local priorities), thereby protecting against their entire zoonotic virome. For example, while Nipah virus is the highestpriority zoonotic threat hosted by the Indian fruit bat (Pteropus medius), interventions that reduce Nipah exposure in humans may also protect against the other 50+ viruses that these bats host [95].
Similarly, the reservoir for Lassa virus carries a number of other bacterial zoonoses, and rodent control can reduce transmission risk across this range of threats [96].
Failure to equitably share benefits may limit impact As the failure to ensure global access to diagnostics, therapeutics, and vaccines during the COVID-19 pandemic has demonstrated [97], the distribution of the benefits from health technologies, particularly novel technologies, is inequitable and a global injustice. Global health must not simply aspire to principles of health equity and social justice, but must also make equitable access to lifesaving technologies a condition precedent to their development and use. This must be a priority in the development and use of zoonotic risk technology, which may also pose a unique set of problems for both researchers and practitioners. These technologies depend on open data sharing, both to create sufficient training sets for artificial intelligence, and to actually apply them for risk assessment and subsequent mitigation. Community efforts to share human and animal sequence data at sufficient scales (i.e., to generate feature sets for advanced machine learning) exist for just a handful of high profile viruses, nearly exclusively as part of international coordination on pandemic preparedness and response (e.g. influenza A and SARS-CoV-2 data sharing via the GISAID platform), while all-purpose repositories like GenBank still only capture a fraction of known viruses (as many are bottlenecked by taxonomic ratification), and lack essential metadata needed for prediction. Both face challenges with regard to contributors receiving credit and attribution for research (especially modeling studies) based on the data they submit (Box 1).
These problems become more complex with regards to the deployment of zoonotic risk technology itself. Initially, there may be resistance to using these tools: scientists who gather novel sequence data may rightfully be hesitant to upload unpublished data to online web tools for zoonotic risk prediction without clear and enforceable protections against data reuse by the curators of such tools, even if these tools are curated by a trusted third-party (though access to this technology may inherently change power dynamics). This is only one concern out of a broader set of issues around access and benefit sharing for viral surveillance. Based in countries' sovereign rights to determine the use of resources within their territory, access and benefit-sharing regimes seek to redress and prevent injustices arising from the exploitation of genetic resources, and from the inequitable sharing of the benefits that arise from their use. Some protections and norms around the sharing of physical pathogen samples and the benefits arising from their use are reflected under the Nagoya Protocol on Access to Genetic Resources and the Fair and Equitable Sharing of Benefits Arising from their Utilization (Nagoya Protocol) to the Convention on Biological Diversity. Under the Nagoya Protocol, countries may implement domestic legislation that requires foreign researchers seeking access to pathogen samples, or in some cases even data related to those samples, to obtain the country's prior informed consent and the conclusion of mutually agreed terms that include benefit sharing, such as attribution in publications, capacity building, technology transfer, or intellectual property rights. Depending on the terms agreed, the use of genetic sequence data derived from pathogen samples may be restricted, including both preventing the sharing of sequence data in open access databases, requiring open sharing, or the sharing of any diagnostic tools developed utilizing the sequence data.
Given that a growing number of labs are readily able to synthesize viruses from their genome sequences (e.g., horsepox [98] and SARS-CoV-2 [99]), there are concerns that the bargain underpinning access and benefit sharing regimes provided by physical samples -like those established in the Nagoya Protocol -may be in flux. If zoonotic risk technologies allow researchers to identify high-risk viruses with reasonable certainty before laboratory characterization, this could add an additional layer of complexity. Sequence sharing and synthesis of SARS-CoV-2, in addition to the global failure to equitably distribute associated vaccines, diagnostics, and therapeutics, may motivate attempts to expressly address these gaps in international legal instruments. These could include, for example, potential revision to the International Health Regulations (2005) or the Nagoya Protocol, or new international law, such as a Pandemic Treaty. Any international governance reform should actively consider the importance, on equal footing, of open data sharing and the equitable sharing of the benefits of novel technologies like zoonotic risk technology.
Even if zoonotic risk technologies are easily applied without challenges around sequence data sharing, there may be gaps between intentions and actionable science. When high-risk viruses are identified, findings may be kept private until published alongside existing research efforts, to protect credit for the field discoveries, or given governmental hesitancy to release the information for fear of social stigmatisation at the local, national, and international level. This could reinforce the disconnect between viral sampling and actionable science for global public health. At present, announcements about the discovery of notable animal viruses are often made ad hoc either by press release or conventional publishing methods. If zoonotic risk technology becomes a widely-adopted part of surveillance, new governance processes will likely need to be developed that protect researchers' careers and credit, but also ensure that announcements are transparent and verifiable, particularly if alarming or unusual results (e.g., the hypothetical discovery of a filovirus with zoonotic potential in bats in the United States) are likely to motivate public or international concern.
Another set of issues could arise around who benefits from zoonotic risk technology. It seems plausible that these technologies might mostly benefit from the effort and data sharing occurring in tropical countries, where zoonotic viral diversity is believed to be highest [11]. However, their development might mostly further the careers of researchers in high income countries, particularly if developed by experts who are unattuned to power dynamics in global health. Equally concerning, we identify a possibility that these tools will largely be developed as proprietary "risk assessment algorithms" by corporate "data science for impact" programs, for-profit global health firms, and non-profit organizations, just as they have been for the development of pandemic insurance programs or similar analytics. In these circumstances, and without appropriate governance, the countries with the highest burden of zoonotic emergence might find their own data (repackaged in an analytic format) sold back to them at a premium by scientists and corporations from high-income countries. Open sharing of academic research could help scientists undermine this trend and provide tools directly to end users in public health, or assist them in developing their own tools, but may simply accelerate advances in zoonotic risk technology without changing the existing colonial framework of global health. Involving researchers from the discipline of science and technology studies (STS) may lead to more honest and critical appraisal of the ethical issues surrounding the emerging technology, and who it benefits or harms.
Finally, we anticipate that zoonotic risk technology may replicate existing, and potentially create new, ethics and governance problems in synthetic biology. Just as convolutional neural networks and other kinds of artificial intelligence can be used to fabricate realistic images entirely through predictive algorithms (e.g., thispersondoesnotexist.com), zoonotic risk technology might be used to generate novel viral sequences (and potentially synthetic viruses) with high predicted zoonotic and epidemic potential. Already, researchers have used these approaches to simulate alternate coronavirus spike protein sequences that might be able to infect human cells [100]. These approaches might support biomedical work; for example, synthetic spike proteins could be used to test a candidate universal betacoronavirus vaccine for its value across "unsampled evolutionary space." However, if biomedical companies attempt to patent these sequences, they could create new problems for future sample sharing, therapeutic and vaccine development, or outbreak response if viruses with the relevant sequence someday emerge -potentially at the expense of some countries more than others. While similar issues have been raised before during zoonotic outbreaks [101], the novelty of simulated zoonoses might create new complications for intellectual property law.
Moreover, viral ranking algorithms or artificially simulated virus sequences might also be used by a malicious actor, highlighting the need to involve scholarship from the "dual use" field of bioethics.

Prediction isn't prevention
Zoonotic risk technology may become an asset in the emerging disease toolkit, but overselling this technology or understating uncertainty will lead to preventable divergences between expectations and scientific possibility. Models may ultimately have profound clinical and field applications, but the uncertainty around risk estimates and likelihood of inaccurate predictions must be carefully communicated. As part of that, epistemic differences in disciplinary conceptions of uncertainty may need to be bridged: for example, a model may make "errors" simply because reality is a stochastic observation of underlying risk landscapes, and a technology that correctly infers probabilities or risk landscapes will still never perfectly represent reality. (These may play into other disciplinary tensions about what "prediction" means: to public health experts and the public, prediction is often synonymous with anticipating future events, but to computational biologists, it may more often be used to describe accurate inference about biological possibility.) Further, there is no substitute for experimental work, and bench virology will play a critical role to generate the necessary data for model development and validation. Zoonotic risk technology is also no substitute for general public health preparedness; even though these tools could be used in the future to estimate the risk posed by newly-discovered viruses as soon as the first genome becomes available, many viruses are still likely to continue to enter human populations before they have been characterized in animals.
Whether these outbreaks become epidemics or pandemics is a problem outside the scope of the technologies we discuss.
Therefore, we warn that investments in research and development on topics like machine learning or animal virus genomics must not come at the expense of other essential kinds of modeling work (e.g., work focused on virus transmission and spread, or identifying the most consequential surveillance gaps), or more importantly, at the expense of non-technological investments in health systems strengthening, including attainment of universal health coverage, and similar aspects of pandemic preparedness. Similarly, it is possible that interest in pre-emergence zoonotic viruses might conflict with, redirect, or undermine local priorities like water and food-borne diseases (and sanitation), agricultural, high-burden communicable diseases (e.g., HIV-AIDS, tuberculosis, and malaria) or non-communicable diseases; interventions may even disrupt local interests and norms, potentially weakening outbreak response during emergencies. If the post-pandemic period becomes dominated by this narrow subset of research priorities, researchers will need to be individually careful in order to accurately and fairly present the value and importance of their work (an imperative that will be encouraged by efforts to reduce funding scarcity in this space).
At the same time, it is indisputable that zoonotic risk technology is currently limited in both development and application by data scarcity, and that the only solution to this is continued or greater investment in data collection-particularly in basic science. Post-pandemic investment in coordinated programs for viral discovery, One Health surveillance, bench virology, and other kinds of laboratory capacity are all likely to generate vital data that can improve the performance of these technologies, and remedy critical gaps in our current understanding of the global virome. These will be most effective if investments are maximized in the hotspots of zoonotic emergence, if modelers are engaged in the process to support data collection and processing in reusable formats, andperhaps most importantly -if these investments are made with the aim of improving outbreak prevention and preparedness entirely independent of the success or failure of zoonotic risk technology as a scientific outcome.
Finally, we suggest that ongoing work is required to benchmark the accuracy and value of these technologies, that transparency and uncertainty be key facets of their presentation, and most importantly that the scientific community remains prepared for "surprises." (In a strikingly timely example, only days before the submission of this manuscript, the first ever report of human infections with H5N8 avian influenza A virus was released.) Models are only as powerful as the data that inform them, and with such a small percentage of the global virome described to date -and new viruses evolving constantly -it seems likely that the next generation of risk prediction systems, and public health infrastructure that may come to rely on them, will face a number of entirely unexpected threats.

Box 1. Crediting researchers for reused, open sequence data
When "big data" becomes available at scales that allow machine learning (or other intensive secondary analysis), the researchers who compiled the data often receive exponentially-diminishing credit through academic incentives. Existing public data repositories, including GISAID and NCBI GenBank, have no indexable source attribution for sequence data. GISAID requires acknowledgement of the source, but such acknowledgement is not a trackable metric contributing to career development; similarly, GenBank accession numbers assist in reproducibility of analyses, but are not indexed, and cannot be easily tracked by contributors as a career metric. This can disincentivize open data sharing if it is seen as a "non-promotable" task for those generating the data, given that other indexed metrics like citations may be used to determine scientific impact when evaluating funding proposals or in hiring and promotion decisions. Moreover, this system currently benefits users of public data repositories more than those who generate the data. In several instances during the COVID-19 pandemic, laboratories generating SARS-CoV-2 sequence data have been stretched thin with pandemic response and were unable to annotate, analyze, and publish on their data before computational or academic laboratories used the data in their own publications. Similar practices are particularly divisive when researchers use data generated by public health laboratories in developing nations without co-authorship, collaboration, or indexed citation of the source.
One potential solution would be an indexed DOI for sequence data; similar approaches are used for aggregate data in biodiversity research (e.g., by the Global Biodiversity Informatics Facility; gbif.org), and while many studies fail to follow recommendations for proper attribution, these procedures are a reasonable first step for fair credit.