Rapid design of a bait capture platform for culture- and amplification-free next-generation sequencing of SARS-CoV-2

SARS-CoV-2 is a novel betacoronavirus and the aetiological agent of the current COVID-19 outbreak that originated in Hubei Province, China. While polymerase chain reaction is the front-line tool for SARS-CoV-2 surveillance, application of amplification-free and culture-free methods for isolation of SARS-CoV-2 RNA, partnered with next-generation sequencing, would provide a useful tool for both surveillance and research of SARS-CoV-2. We here release into the public domain a set of bait capture hybridization probe sequences for enrichment of SARS-CoV-2 RNA from complex biological samples. These probe sequences have been designed using rigorous bioinformatics methods to provide sensitivity, accuracy, and minimal off-target hybridization. Probe design was based on existing, validated approaches for detecting antimicrobial resistance genes in complex samples and it is our hope that this SARS-CoV-2 bait capture platform, once validated by those with samples in hand, will be of aid in combating the current outbreak.


Introduction
In late 2019, a novel coronavirus subsequently coined "SARS-CoV-2" (synonymous with "2019 novel coronavirus" and "2019-nCoV") was identified as the aetiological agent of an outbreak of febrile respiratory illness, COVID-19, possibly associated with the Huanan South China Seafood Market in Wuhan, Hubei Province, China (1,2). Since the initial outbreak, clinical symptoms have ranged from mild to severe pneumonia with the disease spreading through humanto-human transmission (1). Coronaviruses (CoVs) infect humans and other animals and are large (120-160 nm), roughly spherical, enveloped viruses, which carry a non-segmented positive-sense- that forms a clade within the subgenus Sarbecovirus (5). As SARS-CoV-2 is 89% identical to two bat SARS-like CoVs (ZC45 and ZC21), it is presumed that SARS-CoV-2 is another zoonotic emergence with the intermediate host currently unknown (6). However, as SARS-CoV-2 is an RNA virus, mutation rates are high (approximately 10 -4 nucleotide substitutions per site per year) and whole-genome surveillance is critical for proper molecular epidemiology (i.e., tracking the outbreak) (6). In order to perform phylogenetic (forensic) analysis, the virus needs to be first isolated from a sample, but as viremia levels can vary dramatically, depending on what stage of the infection, sequencing the virus from complex metagenomic pools can be expensive and time consuming. Currently the method of choice is a bronchoalveolar lavage using cell culture, either human airway epithelial cell culture, Huh7 or VeroE6, requiring both time and expertise (2,6). In order to minimize the risk of handling infectious viral cultures, preserve critical sample volumes, reduce labour and turn-around time, perform quick phylogenetic analysis, and to enable new scientific enquiry, a reliable method for both culture-and amplification-free enrichment of SARS-CoV-2 RNA from complex samples containing a mixture of host and bacterial DNA and RNA would be of value for next-generation sequencing (NGS) workflows. In recent years, hybridization bait capture methods for enrichment of viral targets in metagenomics samples have been designed and validated for a range of applications (7)(8)(9)(10)(11). Our own team members have used bait capture to isolate and sequence the aetiological agent of the 'Black Death', Yersinia pestis, (12), and have recently designed and validated an accurate and sensitive enrichment platform for NGS detection of antimicrobial resistance genes in complex biological samples based on our Comprehensive Antibiotic Resistance Database (13,14). In the last year, we have been combining our expertise in outbreak and associated illness and mortality of COVID-19, we have opted to release our SARS-CoV-2 bait capture platform without experimental validation, in part relying on past success of similar bait capture design approaches and in part hopeful that it can be rapidly validated by those with samples in hand. We hope our platform proves useful for those combatting and researching the current SARS-CoV-2 outbreak.

Design
Using BaitsTools (v1.6.2) software (15), we designed 80 nucleotides (nt) hybridization probes by tiling with an offset of 20 nt across all SARS-CoV-2 sequences available at the National Center for Biotechnology Information (NCBI) prior to February 5, 2020 (n=37; complete genomes = 21, partial coding sequence = 16). Built-in BaitsTools functions were used to remove probe sequences that were incomplete or contained incorrect nucleotides. We also used BaitsTools to remove probes with GC content <25% or >55%, leaving 30,853 probes remaining. Melting temperature (Tm) was predicted using the OligoArrayAux function melt.pl (settings, -n RNA -t 65 -C 1.89e -9 ) and used to remove probes with a Tm <55°C or >105°C (16). To prevent off-target hybridization between the probes and any non-viral DNA or RNA, the candidate set of probes was compared against GenBank's nucleotide database using high-throughput BLASTN (default settings) (13,17). Probes with high-scoring segment pairs (HSPs) >50 nt and high sequence similarity (>80%) to non-viral targets were discarded. Finally, the candidate set of hybridization probes was compared against itself through BLASTN analysis and a custom filter applied to mitigate the number of redundant probes sharing overlapping sequence space (discarding one member of non-identical pairs with >60 nt alignment and >80% sequence similarity), resulting in a candidate list of 1,310 bait capture hybridization probes for SARS-CoV-2.

Assessment
To predict the efficacy of capture by the candidate probes a Bowtie2 alignment (settings, bowtie2 --end-to-end -N 1 '-L 32' -a) (18) was performed to align the set of 1,310 probes to 21 complete SARS-CoV-2 genome sequences, with the resulting alignment file analyzed using SAMtools (19). Statistics to determine the number of instances that a probe mapped to a section of a SARS-CoV-2 genome, the length coverage of probes across known genomes, and the depth of coverage by probes for each genome were calculated by adapting the Next Generation Sequencing Capture Assessment Tool (ngsCAT) in Python 3.6.8 (20). The GC content for each probe was calculated using the GCcontent.py Python3 script available at https://gist.github.com/wdecoster and the melting temperature predicted using melt.pl, as described above. An individual Bowtie2 alignment was performed comparing the candidate probe set against a single SARS-CoV-2 genome (www.ncbi.nlm.nih.gov/nuccore/MN908947.3), analyzed using SAMtools, and visualized using JBrowse (21). Plots were generated using R    Our probe set consistently displayed 3.4x coverage across all genomes ( Figure 5). A drop in coverage occurs at approximately the 15 kB region of ORF1ab, but retains ~3x coverage.
Candidate probes were visualized against one SARS-CoV-2 genome, MN908947.3, highlighting uniform coverage across the genome and visualizing the regions where drops in coverage reside.

Discussion
The growing SARS-CoV-2 outbreak will require continued sequencing efforts to collect information pertaining to this novel coronavirus. At present, in order to sequence SARS-CoV-2, there is a culturing step required that increases the risk to laboratory staff, making the analysis of SARS-CoV-2 difficult in resource-constrained countries without the facilities to culture the virus.
Bait capture of SARS-CoV-2 followed by NGS using small Nanopore sequencers provides a simpler alternative (24). Targeted SARS-CoV-2 RNA enrichment promotes enrichment through subtraction, by physically separating target RNA from a complex patient sample, resulting in most sequenced fragments being on-target and thus reducing required metagenomics sequencing effort.
This added benefit of lowered sequencing volume in turn also reduces overall turn-around time and cost.
While probe sets for targeted enrichment of a broad range of viruses already exist (7-10), we propose a probe set designed specifically in response to the current SARS-CoV-2 outbreak.
For probe sets based on a broad diversity of viruses there is concern for off-target hybridization, resulting in off-target enrichment and added time and cost to the workflow (7,8). Probe design improvements in detecting rare DNA pertaining to antimicrobial resistance highlight the importance considering off-target hybridization and balancing length and depth of coverage when designing probes for specific targets (13). Our probe design aims to maximize specificity and sensitivity to SARS-CoV-2 by removing candidate probes that could hybridize to human or other eukaryotic, bacterial, or archaeal RNA or DNA. Yet, in silico alignment tools do not accurately reflect hybridization in solution and it is entirely possible that our proposed probe set may crosshybridize with other coronaviruses, such as SARS-CoV and MERS-CoV, or may have unanticipated off-target hybridization with unrelated viruses, bacterial members of the microbiome, or mammalian host DNA or RNA. While co-infection with multiple coronaviruses are rare, possible off-target hybridization with non-coronavirus DNA or RNA is an important concern (3). While our design reflects optimal methods, validated for bait capture of antimicrobial resistance (AMR) genes (13), experimental validation of our proposed SARS-CoV-2 bait capture platform is required. To this aim, we note that for our antimicrobial resistance gene bait capture platform we had 80 nt biotinylated ssRNA probes synthesized by Arbor Biosciences (Ann Arbor, MI) using the custom myBaitsR kit and that our protocol for capture of AMR DNA is described