The PHA 4 GE SARS-CoV-2 contextual data specification for open genomic epidemiology

The Public Health Alliance for Genomic Epidemiology (PHA4GE) (​https://pha4ge.org​) is a global coalition that is actively working to establish consensus standards, document and share best practices, improve the availability of critical bioinformatic tools and resources, and advocate for greater openness, interoperability, accessibility and reproducibility in public health microbial bioinformatics. In the face of the current pandemic, PHA4GE has identified a clear and present need for a fit-for-purpose, open source SARS-CoV-2 contextual data standard. As such, we have developed an extension to the INSDC pathogen package, providing a SARS-CoV-2 contextual data specification based on harmonisable, publicly available, community standards. The specification is implementable via a collection template, as well as an array of protocols and tools to support the harmonisation and submission of sequence data and contextual information to public repositories. Well-structured, rich contextual data adds value, promotes reuse, and enables aggregation and integration of disparate data sets. Adoption of the proposed standard and practices will better enable interoperability between datasets and systems, improve the consistency and utility of generated data, and ultimately facilitate novel insights and discoveries in SARS-CoV-2 and COVID-19. 2 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 9 August 2020 doi:10.20944/preprints202008.0220.v1


The importance of contextual data for interpreting SARS-CoV-2 sequences
The SARS-CoV-2 pandemic has been referred to as a once-in-a-century event (1) . Beginning in late 2019 in Wuhan, China, the virus has now spread to virtually every country and territory in the world, causing hundreds of thousands of deaths and millions of confirmed cases of COVID-19 (2,3) . Understanding, monitoring and preventing transmission have been primary goals of the public health response to SARS-CoV-2.
Tracking the spread and evolution of the virus at global, national and local scales has been aided by the analysis of viral genome sequence data alongside SARS-CoV-2 epidemiology. Large scale sequencing efforts are often formalised as consortia across the world, including the COG-UK in the UK (4) , SPHERES in the USA (5) , CanCOGeN in Canada (6) , Latin American Genomics SARS-CoV-2 Network (7,8), 2019nCoVR in China (9) , and the South Africa NGS Genomic Surveillance Network (10). These combined efforts will result in the generation of hundreds of thousands of genome sequences within the first year of the pandemic. Deposition of these sequences into public repositories such as the Global Initiative on Sharing All Influenza Data (GISAID) (11) and the International Nucleotide Sequence Database Collaboration (INSDC) (12) has enabled rapid global sharing of data. At the time of writing, 97 countries had undertaken open sequencing initiatives (GISAID accessed 2020-07-02) generating over 58,545 sequences which are being reused and analysed on a massive scale. The open data sharing paradigm has had tremendous success in the genomic epidemiology of foodborne pathogens (13,14) , and has the potential to reveal a deeper understanding of SARS-CoV-2 origin, pathogenicity, and basic biology when submissions from its wild hosts are included alongside human sample (15) . The open sharing of SARS-CoV-2 data has already paid dividends for diagnostics and catalyzed a number of vaccine initiatives (16,17) . Mutations in genomes rendering assay probes less sensitive or ineffective is highly problematic in a pandemic where testing is a crucial aspect of infection control. Global monitoring of mutations in platforms like Nextstrain and CoV-Glue-UK have better enabled agility and confidence in the diagnostic domain (18) .
Public health sequence data is of limited value without contextual data, which consists of laboratory (e.g. date and location of testing, cycle threshold (CT) values), clinical (e.g. hospitalization, outcomes), epidemiological (e.g. age, gender, exposures) and methods (sampling, sequencing, bioinformatics) information. For example, phylodynamics, the combined analysis of epidemiological, immunological, and evolutionary characteristics, is predicated on having accurate sampling time and location data for each genome which aid public health practitioners in understanding the spatiotemporal patterns of disease transmission (19)(20)(21) . Additionally, contextual data may be used to determine whether specific lineages are circulating in specific settings e.g. long-term care facilities (22) , meat packing plants (23) , conferences (24) or other public gatherings (25) , or travel-related (26,27) . The importance of the contextual data in evaluating the epidemiological relevance of genomic relationships (28) is particularly important in low-diversity pathogens, such as SARS-CoV-2. Genomic variations are a key source of information that can help public health researchers better understand putative changes in transmission, virulence, epidemiology and therapeutics of an emerging pathogen. Evaluating which variants represent real, circulating viruses, as opposed to artifacts of sample handling or sequencing depends on the capture of methodological information, such as experimental design, laboratory procedures, bioinformatic processing, and quality control metrics, in order to understand the context and limitations of analyses e.g. detecting systematic batch effect errors related to certain sequencing centres and methods (29)(30)(31) . These are just a few examples, and there are many additional ways to interpret public health genomic data to inform decision making for public health responses and develop greater scientific understanding of the pathogen.
Contextual data that are structured and consistent, particularly complying with community standards like minimum information checklists (MIxS (32) , MIGS (33) , Sample Application Standard (34) ) and ontologies (OBO Foundry (35) ), are easier to understand and process, and can be more easily aggregated and reused for different types of analyses. However, contextual data is often collected on a project-specific basis according to local needs and reporting requirements, and is often structured according to organization or initiative-specific data dictionaries. Furthermore, attribute packages and metadata standards developed by different organizations are scoped to cover as many use cases and pathogens as possible, and so can include fields of information not applicable to SARS-CoV-2 or that may be subject to privacy concerns, or exclude fields commonly used in public health surveillance and investigations. As different types of contextual data are subject to different ethical, practical and privacy concerns, not all components of existing standards are immediately or widely shareable. As a result, the range of generic metadata standards being applied to SARS-CoV-2 data presents challenges for data harmonization (36) and analysis critical for fighting the disease and ending the pandemic.
While the examples here focus on public health surveillance, we must recognize that good data management (tracking and documenting) goes beyond data sharing. Good data stewardship practices are not only critical for auditability and reproducibility, but for posterity -documenting critical information about samples, methods, risk factors and outcomes etc, can help build a roadmap for dealing with future public health crises.
In light of these challenges, PHA4GE has identified a clear and present need for a fit-for-purpose, open source SARS-CoV-2 contextual data specification which can be used to consistently structure information as part of good data management practices and for data sharing with trusted partners and/or public repositories. The specification was developed by consensus among domain experts, and incorporates existing community standards in light of SARS-CoV-2 public health needs in order to ensure privacy while maximizing information linkage, content and interoperability across datasets and databases, to better enable analyses to fight COVID-19.

SARS-CoV-2 Contextual Data Specification: The Framework
The purpose of the PHA4GE SARS-CoV-2 specification is to provide a mechanism for consistent structure, collection and formatting of fields and values containing SARS-CoV-2 contextual data. We emphasize that the purpose of this specification is not to force data sharing, but rather to provide a framework to structure data consistently across disparate laboratory and epidemiological databases so that they can be harmonized for different uses ( Figure 1). Data sharing is just one use case and can involve sharing between divisions within a single agency, sharing between partners based on memorandums of understanding, or submission to public repositories. Contextual data can be captured and structured using the PHA4GE specification so that it can be more easily harmonized across different data sources and providers. Different subsets of the harmonized data can be 1) shared with public repositories e.g. GISAID and INSDC, 2) shared with trusted partners e.g. national sequencing consortia, public health partners, and 3) kept private and retained locally with the potential for sharing in the future for particular surveillance or research activities. While fields have been colour-coded in the template to indicate whether they are considered "required", "strongly recommended" and "optional", how the specification is implemented, and how, if any, of the data is shared, is ultimately at the discretion of the user. Box 1 describes the information types covered in the full specification.
The PHA4GE SARS-CoV-2 contextual data specification was created through broad consultation with representatives from public health laboratories, research institutes and universities in eight countries (Canada, Australia, Germany, Portugal, South Africa, Switzerland, the United Kingdom, the United States of America) who are involved with the SARS-CoV-2 genome sequencing and analysis efforts at various levels. Based on this consultation and consensus, the specification contains different fields covering a wide array of data types described in Box 1 ( Figure 1). The specification attempts to harmonize different data standards (INSDC, GISAID, MIxS, MIGS, Sample Application Standard) by reusing fields or mapping to fields, as much as possible. As PHA4GE embraces FAIR data stewardship principles (Findability, Accessibility, Interoperability and Reuse of digital assets), we strived to implement FAIR principles in the design and implementation of the specification for data management and data sharing. At their core, these principles emphasize machine-actionability and consistency of data, and are critical for dealing with the volume and complexity of genomic sequence and contextual data.
The versioned specification is available as a contextual data collection template (.xlsx) and in machine-amenable JSON format from GitHub ( https://github.com/pha4ge/SARS-CoV-2-Contextual-Data-Specification ; version at time of publication https://zenodo.org/record/3947048#.Xxs7gvhKg_U ). The collection template also offers ontology-based standardized terms for a number of fields in the form of pick lists. The template is also supported by a number of materials such as a Reference Guide, which provides definitions and field-level guidance, as well as examples of how data might appear when structured according to the specification. A Standard Operating Procedure (SOP), which contains instructions for using the collection template has also been provided. Mapping of fields to standards and public repository submission requirements, and links to protocols that have been developed by PHA4GE for SARS-CoV-2 sequence submission have also been provided. A table outlining the different materials can be found in Table 1.

Data submission protocol (NCBI)
The SARS-CoV-2 submission protocol for NCBI provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data.

Data submission protocol (EMBL-EBI)
The SARS-CoV-2 submission protocol for ENA provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data.

Data submission protocol (GISAID)
The SARS-CoV-2 submission protocol for GISAID provides step-by-step instructions and recommendations aimed at improving interoperability and consistency of submitted data.

JSON structure of PHA4GE specification
A JSON structure of the PHA4GE specification has been provided for easier integration into software applications.

Getting Started -How To Use The Standard
In designing the specification we first began with considering the goals of data collection and harmonization. Consulted partners felt that the primary priority of standardizing data should be improved support for SARS-CoV-2 genomic surveillance activities and the submission of sequence data and minimal metadata to public repositories. The two most important attributes for tracking transmission from pathogen genomic data are temporal information describing when a sample was collected and spatial information describing where a virus was sampled. Comparisons of minimal contextual data requirements across different national sequencing efforts, as well as submission requirements for INSDC and GISAID databases, yielded a minimal set of 10 fields which we have annotated as "required" in the specification (colour-coded yellow in the collection template, see Table 1). Those fields, their definitions, and guidance notes are described in Table 2. A number of other fields have been annotated as "strongly recommended" (colour-coded purple in the collection template) for capturing sample collection and processing methods, critical epidemiological information about the host, and acknowledging scientific contributions. Fields colour-coded white are considered optional. Every Sample ID from a single submitter must be unique. It can have any format, but we suggest that you make it concise, unique and consistent within your lab, and as informative as possible.
sample collected by The name of the agency that collected the original sample.
The name of the agency should be written out in full, (with minor exceptions) and consistent across multiple submissions.

sequence submitted by
The name of the agency that generated the sequence.
The name of the agency should be written out in full, (with minor exceptions) and be consistent across multiple submissions.

sample collection date
The date on which the sample was collected. isolate Identifier of the specific isolate. This identifier should be an unique, indexed, alpha-numeric ID within your laboratory. If submitted to the INSDC, the "isolate" name is propagated throughout different databases. As such, structure the "isolate" name to be ICTV/INSDC compliant in the following format: "SARS-CoV-2/host/country/sampleID/date" host (scientific name) The taxonomic, or scientific name of the host. This field is only required if there was a host. If the host was a human select COVID-19 from the pick list. If the host was asymptomatic, this can be recorded under "host health state details". "COVID-19" should still be provided if the patient is asymptomatic. If the host is not human, and the disease state is not known or the host appears healthy, put "not applicable".
As many contextual data types are stored in different locations and databases (e.g. LIMS, epidemiology case report forms and databases), a benefit of implementing the PHA4GE collection template is that it enables the capture of these different pieces of information in one place. The collection template also offers picklists for a variety of fields e.g. a curated INSDC country list for "geo_loc name (country)", the standardised name of the virus under the "organism" field (i.e. Severe acute respiratory coronavirus 2), and a multitude of standardised terms for cell lines in the "lab host" field. The picklists provided are neither exhaustive, nor comprehensive, but have been curated from current literature representing active sampling and surveillance activities. If a pick list is missing standardised terms of interest, the reference guide also provides links to different ontology look-up services enabling users to identify additional standardized terms. The reference guide provides definitions for the fields, additional guidance regarding the structure of the values in the field, and any suggestions for addressing issues pertaining to privacy and identifiability. The template SOP provides users with step-by-step instructions for populating the template, looking up standardized terms, and how best to structure sample descriptions. The SOP also highlights a number of ethical, practical, and privacy considerations for data sharing.
Implementation of the PHA4GE specification around the world How, and how much of, the specification is implemented is ultimately at the discretion of the user. To date, versions of the specification are being implemented in the CanCOGeN (Canada) and SPHERES (USA) SARS-CoV-2 sequencing initiatives, the AusTrakka (Australia) national data sharing platform (37) , by the Global Emerging Pathogens Treatment Consortium (Africa) (38), and in the Baobab LIMS (39) at the South African National Bioinformatics Institute (SANBI) (40) .

Submitting Data to Public Sequence Repositories
For a large global genomic surveillance program to be successful, each new entry (genome, contextual data (usually referred to as metadata), plus raw reads) must be made publicly available and properly linked in one of the collaborating databases. Most existing SARS-CoV-2 sequences have only been deposited in GISAID, with a small proportion of submitters (~20%, 2020-06-04) also depositing matching raw read data in the INSDC (i.e. National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) and DNA Data Bank of Japan (DDBJ)).
Within the INSDC, the metadata describing the samples are stored as accessioned BioSamples (41) with a consistent set of attribute names and standardized values. BioSamples add value, promote reuse, and enable interoperability of data submitted from laboratories that may only be connected by following the same metadata standard. The INSDC databases provide a generic pathogen metadata template for the BioSample that is heavily utilized for bacterial genomic surveillance (42) and extended for particular use cases (32) . GISAID uses a different format and data structure for associating metadata primarily for influenza surveillance and now extended to include SARS-CoV-2. The ENA provides a virus metadata checklist (ENA virus pathogen reporting standard checklist) developed as part of the COMPARE project (43) , which is very similar to the GISAID submission requirements. Building off of these existing standards, we developed a metadata specification for SARS-CoV-2 genomic surveillance that is broad enough for internal laboratory use while providing formated submission templates for public release to INSDC and GISAID. The detailed mapping of PHA4GE fields to public repository submission requirements as well as guidance and advice are available as supporting documents (see Table 1). We have also provided detailed protocols for data submission to the three participating repositories, GenBank/SRA (NCBI), ENA (EMBL-EBI), and GISAID. An overview of how the PHA4GE specification is integrated into public repository submissions is presented in Figure 2. 1. submit raw sequencing data and assembled/consensus genomes to INSDC and GISAID 2. create a BioSample record when submitting using the PHA4GE guidance, populating the mandatory and recommended fields where possible 3. curate your public records (sequence data and BioSample), updating them when subsequent information becomes available or retracting if/when records become untrustworthy.

Conclusion
The collective response to the SARS-CoV-2 pandemic has resulted in an unprecedented deployment of genomic surveillance worldwide, bringing together public health agencies, academic research institutions, and industry partners. This unified action provides opportunities to more effectively understand and respond to the pandemic. Yet it also provides an enormous challenge, as realizing the full potential of this opportunity will require standardization and harmonization of data collection across these partners. As countries around the world face exponential growth in the number of COVID-19 cases, and prepare for new waves of infections throughout the pandemic, a unique opportunity for harmonization in data collection exists. With our SARS-CoV-2 metadata specification we have endeavored to create a mechanism for promoting consistent, standardized contextual data collection that can be applied broadly. We hope that, given sufficient uptake, this specification will improve the consistency of collected data, making them reusable by agencies as they continue working towards an increased understanding of SARS-CoV-2 epidemiology and biology, and harmonizing them such that community-based data sharing efforts are not excessively burdened. Furthermore, the framework for SARS-CoV-2 presented in this work can also be used to build a roadmap for dealing with future public health crises.