CoVigator—A Knowledge Base for Navigating SARS-CoV-2 Genomic Variants

Background: The outbreak of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) resulted in the global COVID-19 pandemic. The urgency for an effective SARS-CoV-2 vaccine has led to the development of the first series of vaccines at unprecedented speed. The discovery of SARS-CoV-2 spike-glycoprotein mutants, however, and consequentially the potential to escape vaccine-induced protection and increased infectivity, demonstrates the persisting importance of monitoring SARS-CoV-2 mutations to enable early detection and tracking of genomic variants of concern. Results: We developed the CoVigator tool with three components: (1) a knowledge base that collects new SARS-CoV-2 genomic data, processes it and stores its results; (2) a comprehensive variant calling pipeline; (3) an interactive dashboard highlighting the most relevant findings. The knowledge base routinely downloads and processes virus genome assemblies or raw sequencing data from the COVID-19 Data Portal (C19DP) and the European Nucleotide Archive (ENA), respectively. The results of variant calling are visualized through the dashboard in the form of tables and customizable graphs, making it a versatile tool for tracking SARS-CoV-2 variants. We put a special emphasis on the identification of intrahost mutations and make available to the community what is, to the best of our knowledge, the largest dataset on SARS-CoV-2 intrahost mutations. In the spirit of open data, all CoVigator results are available for download. The CoVigator dashboard is accessible via covigator.tron-mainz.de. Conclusions: With increasing demand worldwide in genome surveillance for tracking the spread of SARS-CoV-2, CoVigator will be a valuable resource of an up-to-date list of mutations, which can be incorporated into global efforts.


Introduction
The identification, characterization and monitoring of the pathogen responsible for a novel emerging disease is crucial for the development of a timely public health response. This includes rapid and open sharing of data [1], which has been adapted in past outbreaks to advance research and improve medical support [2,3]. The outbreak of the respiratory disease COVID-19 caused by SARS-CoV-2 demonstrated the increasing value of highthroughput sequencing through enabling the publication of the complete virus genome within one month of sampling [4,5]. The identification of the SARS-CoV-2 spike protein as a valuable target for vaccine design [6,7] led to the development of vaccines at unprecedented speed [8,9] and is still fostering further developments.
Nevertheless, the discovery of SARS-CoV-2 spike-glycoprotein mutants, associated with the potential to escape vaccine-induced protection, demonstrates the importance of

Materials and Methods
The CoVigator knowledge base is implemented in Python version 3.8 and the database for storing data is PostgreSQL (version 13.4).
The CoVigator pipeline (version 0.14.0) is implemented in the Nextflow framework version 19.10.0. All dependencies are managed within conda (version 4.9) environments [46] (see Table 1). The pipeline may receive as input either (1) a single-end FASTQ, (2) two paired-end FASTQs, (3) an assembly in FASTA format or (4) a VCF file with mutations. Adapter sequences are trimmed from FASTQs using fastp [47], alignment to the reference genome is performed with BWA mem2 [48], Base Quality Score Recalibration (BQSR) is performed with GATK [49], duplicate reads are marked with sambamba [50] and, finally, a horizontal and vertical coverage analysis is performed with samtools [51]. Variant calling on the BAM files derived from the FASTQs is performed with LoFreq [52], GATK [49], BCFtools [53] and iVar [54] (only the results from LoFreq are shown in the dashboard). Variant calling on the FASTA assemblies is performed with a custom script using Biopython's Needleman-Wunsch global alignment [55]. Further processing of VCF files adds functional annotations with SnpEff [56], technical annotations with VAFator [14], ConsHMM conservation scores [57] and Pfam protein domains [58]. Pangolin [59] is employed to determine the lineage of every sample. The input for pangolin is either the input assembly in FASTA format or the consensus assembly derived from the clonal mutations (i.e., VAF ≥ 0.8) and the reference genome. See Table 1 for more details on the specific settings of each tool. The CoVigator dashboard is also implemented in Python using the visualization framework Dash (version 2.1.0). The computation is distributed through a high-performance computing cluster with a library that provides advanced parallelism, Dask (version 2022.9.2).

System Description
The CoVigator system ( Figure 1) has three main components: (1) the knowledge base, (2) the analysis pipeline and (3) the dashboard. For every sample, the knowledge base orchestrates the metadata retrieval, raw data download and finally its analysis through the pipeline for the detection of mutations. Furthermore, it makes all necessary data available through a database (Postgre-SQL version 13). Finally, the dashboard presents the data to the end user through a set of interactive visualizations.
CoVigator operates via interaction with external systems: a high-performance computing (HPC) cluster and the ENA and COVID-19 Data Portal Application Programming Interfaces (APIs). Samples between both original datasets (raw reads and genomic assemblies) may overlap. As recommended, some data providers might automatically upload both data formats. The results presented in the dashboard are stratified by dataset.

Knowledge Base
The CoVigator knowledge base collects data from both genomic assemblies and raw reads, orchestrates its processing through the variant calling pipeline and stores all the metadata, raw data and processed results in a relational database.
The data for both datatypes are fetched via the corresponding API hosted by the European Bioinformatics Institute [60]; the metadata are normalized, the FASTQ (raw NGS reads) and FASTA (genomic assemblies) files are downloaded and their MD5 checksums are confirmed to ensure data integrity.
Furthermore, the knowledge base iteratively builds a variant co-occurrence matrix (only for the raw read dataset) and precomputes analyses of the data (binned abundance of mutations, dN/dS ratios per gene and domain, top occurring variants, pairwise cooccurrence and counts of variants per lineage, country, sample, mutation type, length and nucleotide substitution) that ensure low-latency responses. CoVigator system components. The accessor reads external data and stores it in an SQL database. The processor reads the stored data and distributes the processing of every sample in an HPC cluster via Dask. The pipeline processes FASTA and FASTQ data and finally stores the results back in the database (See Figure S1 for a more detailed FASTA and FASTQ processing pipeline). The dashboard reads the results and displays them in a set of interactive plots. The results are also available in raw format.
CoVigator operates via interaction with external systems: a high-performance computing (HPC) cluster and the ENA and COVID-19 Data Portal Application Programming Interfaces (APIs). Samples between both original datasets (raw reads and genomic assemblies) may overlap. As recommended, some data providers might automatically upload both data formats. The results presented in the dashboard are stratified by dataset.

Knowledge Base
The CoVigator knowledge base collects data from both genomic assemblies and raw reads, orchestrates its processing through the variant calling pipeline and stores all the metadata, raw data and processed results in a relational database.
The data for both datatypes are fetched via the corresponding API hosted by the European Bioinformatics Institute [60]; the metadata are normalized, the FASTQ (raw NGS Figure 1. CoVigator system components. The accessor reads external data and stores it in an SQL database. The processor reads the stored data and distributes the processing of every sample in an HPC cluster via Dask. The pipeline processes FASTA and FASTQ data and finally stores the results back in the database (See Figure S1 for a more detailed FASTA and FASTQ processing pipeline). The dashboard reads the results and displays them in a set of interactive plots. The results are also available in raw format.

Analysis Pipeline
In general, the CoVigator pipeline processes FASTQ and FASTA files into annotated and normalized analysis-ready VCF files via two independent workflows (Figures 1 and S1). We implemented the pipeline in the Nextflow framework [61] and managed all dependencies with Conda environments to enable seamless installation. We have embedded the SARS-CoV-2 reference genome ASM985889v3 [5]. Using a different reference, this pipeline could instantly analyze other virus sequences as well.

Dashboard
The dashboard is the user interface to CoVigator. There are two separate views for the raw reads and genomic assembly datasets. Each view provides a set of tabs that allows the user to explore different aspects of the data held in the database. Each tab provides some interactive visualizations, described below. When applicable, the tabs provide a set of filters on the left side. These have been excluded from the screenshots for the purpose of clarity.
The most relevant tabs are described below, and some notable findings are highlighted.

Samples
The samples tab ( Figure 2) enables the user to explore the accumulation of samples through time and the evolution of the dN/dS ratio in different genomic regions.    Figure 2A shows the accumulation of samples in each country. The dashboard allows the user to select specific countries and/or lineages. Figure 2B,C show the dynamics of the dN/dS ratio through time, over genes and protein domains. The dN/dS ratio aims to estimate the evolutionary pressure on SARS-CoV-2 proteins and domains. This metric, although originally developed for assessing diverging species, is an imperfect but simple estimation of the evolutionary pressure within the same species [62,63], in this case, SARS-CoV-2. There have been recent efforts to develop better alternatives for estimating the evolutionary pressure on SARS-CoV-2 [64]. The traditional interpretation of dN/dS is as follows: dN/dS < 0 indicates purifying selection, dN/dS = 1 indicates neutral evolution and dN/dS > 1 indicates positive selection.

Lineages
The lineages tab enables the user to explore the different lineages through time and geography ( Figure 3). Both the accumulation of samples in every lineage worldwide ( Figure 3A) and the dominant lineage through time ( Figure 3B) Figure 2B,C show the dynamics of the dN/dS ratio through time, over genes and protein domains. The dN/dS ratio aims to estimate the evolutionary pressure on SARS-CoV-2 proteins and domains. This metric, although originally developed for assessing diverging species, is an imperfect but simple estimation of the evolutionary pressure within the same species [62,63], in this case, SARS-CoV-2. There have been recent efforts to develop better alternatives for estimating the evolutionary pressure on SARS-CoV-2 [64]. The traditional interpretation of dN/dS is as follows: dN/dS < 0 indicates purifying selection, dN/dS = 1 indicates neutral evolution and dN/dS > 1 indicates positive selection.

Lineages
The lineages tab enables the user to explore the different lineages through time and geography ( Figure 3

Mutation Statistics
The mutation statistics tab provides insights into the variant calling results on the different datasets and genomic regions (Figure 4). Expected trends in the data can be confirmed in these visualizations.

Mutation Statistics
The mutation statistics tab provides insights into the variant calling results on the different datasets and genomic regions (Figure 4). Expected trends in the data can be confirmed in these visualizations.    Figure S4 for a screenshot including the filters. The median number of SNVs per sample in the raw read dataset is 32, with an interquartile range (IQR) of 30 ( Figure 4A). Additionally, the median number of MNVs is two with an IQR of one. The number of deletions is lower (median: 3, IQR: 2) than the number of SNVs, and the number of insertions is even lower, with few samples having just one insertion. For the genomic assemblies, the numbers are slightly different, with median SNVs = 44 (IQR: 19), MNVs = 2 (IQR: 0), deletions = 4 (IQR: 3) and again just one insertion in a few samples ( Figure 4B).
We observe that the base substitution C > T is by and large the most frequent, followed by G > T and A > G; the deletion TA > T and the MNV GG > AA is the most frequent in both datasets ( Figure 4C,D).
In Figure 4E,F, we confirm that deletions are more frequent than insertions with an insertion-to-deletion ratio of 0.002 and 0.032 for raw reads and genomic assemblies, respectively. We also confirm two previous findings: (1) shorter deletions and insertions are more common than longer ones [65,66] and (2) the deletions and insertions not causing a frameshift are overrepresented as their impact in the resulting protein are more subtle [67]. In the genomic assemblies, we observe a long tail of deletions longer than 8 bp, which is not observed in the raw read results. We suspect this is a technical artefact introduced via our variant calling method. Finally, as shown in Figure 4G,H we observe that the most frequent mutation effect is a missense variant, followed by a synonymous variant. This is coherent between both datasets.

Recurrent Mutations
The recurrent mutations tab allows the user to explore the most recurrent mutations by the total count of observations through time within their genomic context ( Figure 5). In Figure 5A, the top recurrent mutations and their frequency and counts through time are shown. The size of the table can be parametrized for up to 100 mutations. For instance, the user can explore the most recurrent mutations in the whole genome, a given gene or a given protein domain. Furthermore, the period in which the monthly counts are shown can be parameterized. The gene viewer ( Figure 5B) has multiple tracks: (i) a scatter plot with the relevant mutations and their frequencies in the virus population, (ii) ConsHMM conservation tracks and (iii) gene and Pfam protein domains. The table in Figure S5 provides the decline and rise of the Alpha and Delta lineages, respectively, in the counts of mutations between April and July 2021.
Additionally, the mutation statistics tab provides a co-occurrence analysis that points to clusters of co-occurring mutations and their correspondence with virus lineages; or in the case of mutations shared between lineages, these clusters may contain a mixture of different but related lineages. Due to performance limitations, this analysis is only available in the raw read dataset and at the gene level. In Figure S6, we show the Jaccard index co-occurrence matrix in the spike protein and its clustering results annotated with SARS-CoV-2 lineages in Table S1.

Clonal and Intrahost Mutations in the Raw Read Dataset
The FASTQ files provide the pile-up of reads across the genome, and this gives detailed information into the called variants. In particular, we can count the number of reads supporting each variant, and this allows us to identify subclonal variants supported using only a fraction of the reads. These variants likely emerged within the host and are referred to as intrahost variants. The identification of intrahost variants is not possible on the genomic assemblies.
We consider high-quality clonal mutations as those with a VAF greater than or equal to 80%, and those with a VAF greater than or equal to 50% and lower than 80% as lowconfidence clonal mutations. Only high-confidence clonal mutations are used to determine a consensus sequence and assign a SARS-CoV-2 lineage (Figure 6).   Figure S4 for a screenshot including the filters.
Additionally, the mutation statistics tab provides a co-occurrence analysis that points to clusters of co-occurring mutations and their correspondence with virus lineages; or in the case of mutations shared between lineages, these clusters may contain a mixture of different but related lineages. Due to performance limitations, this analysis is only available in the raw read dataset and at the gene level. In Figure S6, we show the Jaccard index co-occurrence matrix in the spike protein and its clustering results annotated with SARS-CoV-2 lineages in Table S1. only a fraction of the reads. These variants likely emerged within the host and are referred to as intrahost variants. The identification of intrahost variants is not possible on the genomic assemblies.
We consider high-quality clonal mutations as those with a VAF greater than or equal to 80%, and those with a VAF greater than or equal to 50% and lower than 80% as lowconfidence clonal mutations. Only high-confidence clonal mutations are used to determine a consensus sequence and assign a SARS-CoV-2 lineage ( Figure 6). The remaining dataset of mutations poses a different technical challenge due to the difficulty of separating true low VAF mutations from noise. We first determine those mutations with a VAF below 50% as raw candidate intrahost mutations.
We observed a large number of low-frequency mutations among SARS-CoV-2 genomes. In order to establish a high-quality set of intrahost mutations for studying viral evolution, we screened and compared the literature on SARS-CoV-2 intrahost mutations for different filtering approaches and implemented a conservative approach ( Table 2)   Table 2. Published and implemented filtering approaches for intrahost variants.
We observed a large number of low-frequency mutations among SARS-CoV-2 genomes. In order to establish a high-quality set of intrahost mutations for studying viral evolution, we screened and compared the literature on SARS-CoV-2 intrahost mutations for different filtering approaches and implemented a conservative approach ( Table 2). Table 2. Published and implemented filtering approaches for intrahost variants.

Conclusions
The persistently increasing amount of publicly available SARS-CoV-2 sequencing data calls for robust platforms that allow constant monitoring of genomic SARS-CoV-2 variants in heterogeneous data sets. Our CoVigator pipeline covers the essential steps of preparing the data and calling variants from SARS-CoV-2 raw sequencing data from ENA and genome assemblies from the COVID-19 Data Portal. The pipeline is integrated within the CoVigator knowledge base that orchestrates the download, processing and storage of the underlying samples and results. The CoVigator dashboard provides different visualizations and features for selecting clonal variants across all genes from the SARS-CoV-2 genome in a selected period. The dashboard also provides a comprehensive analysis of intrahost variants observed across detected mutations in the raw read dataset. To this end, we propose a conservative filtering approach based on filtering samples and mutations. The dataset of intrahost mutations derived from public data that we make available through CoVigator is, to the best of our knowledge, the largest published dataset of SARS-CoV-2 intrahost mutations. The main strength of CoVigator is the combination of a software pipeline with a dashboard, which ensures both processing of the data and its interpretation. Uniquely, CoVigator processes genome assemblies and raw sequencing data types, making it open-data-friendly and allowing it to be adopted to other SARS-CoV-2 data sources. A brief comparison of the important features of CoVigator with other pipelines is tabulated in Table S2.
The identification of mutations over such heterogeneous datasets obtained with different sequencing protocols is challenging. With CoVigator, we observed VAF dilution on mutations identified via targeted amplicon sequencing with overlapping primers, genome edge effects and read edge effects. We aim to address these challenges in the future, e.g., through inferring the primers used in an arbitrary sample. Additionally, we implemented a simplistic phasing of clonal mutations occurring in the same amino acid to ensure their correct annotation. However, we identified the need for a phasing method for low-VAF mutations that existing germline phasing tools do not cover. CoVigator is currently limited to processing Illumina sequencing data, while the majority of SARS-CoV-2 sequencing projects (i.e., Artic network) and pipelines use Oxford Nanopore sequencing. SARS-CoV-2 Nanopore data processing will be implemented in subsequent releases of CoVigator.
Future versions of CoVigator can be broadened to other use cases, such as other infectious organisms or co-existing infections during the pandemic (see supplementary methods for further details [69,70]). Additionally, we envision the annotation of all possible mutations before their observation to potentially improve preparation for future variants of concern.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v15061391/s1, Figure S1. Workflow of CoVigator pipeline; Figure S2. screenshot of the lineages tab in the ENA dataset; Figure S3. screenshot of mutation statistics tab in the ENA dataset; Figure S4. recurrent mutations tab for the spike protein in the ENA dataset; Figure S5. top 30 mutations in the spike protein from COVID-19 Data Portal; Figure S6. screenshot of intrahost mutations tab; Figure S7. Jaccard index co-occurrence matrix on the spike protein; Table S1. co-occurrence clusters on the spike protein with its matching lineage information; Table S2. Comparison of SARS-CoV-2 data processing pipelines with CoVigator. Supplementary methods for co-occurrence analysis and for CoVigator extension to other viruses. Funding: BioNTech SE: Mainz, Germany, supports the study. The funder provided support in the form of a salary for author U.S., but did not have any additional role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. The specific roles of this author are articulated in the 'Author Contributions' section. In addition, the other authors are employees of the non-profit company TRON gGmbH and are supported in the form of salaries. TRON gGmbH did not have any additional role in the study design, data collection and analysis, decision to publish or preparation of the manuscript. Intel is committed to accelerating access to technology that can combat the current pandemic and enabling scientific discovery that better prepares our world for future crises. Funding for this solution was funded in part by Intel's Pandemic Response Technology Initiative. For more information about healthcare solutions from Intel, visit intel.com/healthcare. For more information about Intel's COVID-19 response, visit https://www.intel.com/content/www/ us/en/corporate-responsibility/covid-19-response.html, accessed on 16 June 2023.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.