COVID-19 Evolves in Human Hosts

Today, we are all threatened by an unprecedented pandemic: COVID19. How different is it from other coronaviruses? Will it be attenuated or become more virulent? Which animals may be its original host? In this study, we collected and analyzed nearly thirty thousand publicly available complete genome sequences for COVID-19 virus from 79 different countries, the previously known flu-causing coronaviruses (HCov-229E, HCov-OC43, HCov-NL63 and HCovHKU1) and the lethal, pathogenic viruses, SARS, MERS, Victoria, Lassa, Yamagata, Ebola, and Dengue. We found strong similarities between the current circulating COVID-19 and SARS and MERS, as well as COVID-19 in rhinolophines and pangolins. On the contrary, COVID-19 shares little similarity with the flu-causing coronaviruses and the other known viruses. Strikingly, we observed that the divergence of COVID-19 strains isolated from human hosts has steadily increased from December 2019 to May 2020, suggesting COVID-19 is actively evolving in human hosts. In this paper, we first propose a novelMLCS algorithm HA-MLCS1 for the big sequence data (with sequence length over 103) analysis, which can calculate the common model for COVID-19 complete genome sequences to provide important information for vaccine and antibody development. Geographic and time-course analysis of the evolution trees of the human COVID-19 reveals possible evolutional paths among strains from 79 countries. This finding has important implications to the management of COVID-19 and the development of vaccines and medications. ∗Both authors contributed equally to this research. 1The source code of HA-MLCS is available at: https://github.com/HA-MLCS/HA-MLCS Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD ’20 Health Day: AI for COVID-19, August 23–27, 2020, CA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00 https://doi.org/10.1145/1122445.1122456


INTRODUCTION
Since its first report in December 2019, the severe infectious pneumonia caused by the new COVID-19 virus has spread widely from the Wuhan City, across China, and to 188 countries. On March 11, 2020, the WHO announced COVID-19 outbreak a pandemic, the first of its kind since the 2009 Swine Flu. Internationally, as of May 29, 2020, the outbreak of COVID-19 has resulted in more than 5,851,494 cases and 361,270 deaths 2 . COVID-19 is currently the biggest health, economical and survival threat to the entire human race. We are in urgent need to understand this virus, find treatment and develop vaccines to combat it.
One challenge in developing effective antibodies and vaccines for COVID-19 is that we do not yet understand this virus. How far away is it from other coronaviruses? Has it undergone any changes since its first discovery? These questions are critical for us to find cures and design effective vaccines and medications, and critical for managing this virus. The study of COVID-19 began only recently [1][2][3][4][5][6]. So far, pioneering studies related to the virus have been limited to a few complete genome sequences and a few related viruses [7]. One study used six COVID-19 sequences from patients in Wuhan and compared them with those of SARS and MERS [8]. Another two studies used nine and five sequences respectively, and found that COVID-19 is similar to SARS [9,10]. Recent work [11] studied the emergence of genomic diversity and recurrent mutations in COVID-19 by using 7666 public genome assemblies. These pioneering efforts laid the foundation for our work.
In this paper, we collected nearly thirty thousand complete genome sequences, covering 29,305 genomes isolated from COVID-19 in human hosts from 79 countries, 21 genomes from animals and the environment (outside the human bodies), 101 genomes from the previously known flu-causing coronavirus, and 61 genomes from seven potentially lethal pathogenic viruses, SARS, MERS, Victoria, Lassa, Yamagata, Ebola, and Dengue. This collection allows us to analyze the evolution and diversity of COVID-19 in depth. Note that, in this paper all computations/analyses are done using only the collected COVID-19 complete genome sequences (COVID-19 sequences/strains/viruses for short).
In this paper, we report strong shared similarity between the currently circulating COVID-19 and the SARS virus, as well as strong shared similarities with COVID-19 in rhinolophines (especially with two strains) and in pangolins. On the contrary, COVID-19 shares a moderate sequence similarity to the flu-causing coronaviruses, despite reported similar symptoms. Strikingly, we observed the divergence of COVID-19 strains isolated from human hosts steadily increased from December 2019 to May 2020, suggesting COVID-19 is now actively evolving in human hosts. This may potentially explain the differences in the death rate in different areas, as the virus might have evolved into strains of different lethality. Importantly, in this paper we first proposed a novel MLCS algorithm for the big sequences analysis, which can calculate the common model (common subseuqneces) for COVID-19 sequences and provide important information for future studies of vaccine and antibody design. Evolutionary analysis of the human COVID-19 from 79 countries reveals the following important discoveries: As early as Dec. 2019, COVID-19 virus was widespread in many countries and regions, and it is particularly worth noting that the entire genome sequences of the top 15 countries with the most severe epidemics, except Russia and Spain, almost do not reside in the first generation on the evolution tree from 79 countries' sequences, which is of great significance to the traceability of COVID-19. Moreover, the other findings by big sequences analysis in this paper may also provide important information to the understanding and the management of COVID-19 and to the development of vaccines and medications for the virus in the near future.
The rest of this paper is organized as follows. Section 2 discusses our proposed novel MLCS (Multiple Longest Common Subsequence) algorithm NP-MLCS for COVID-19 big sequence data similarity analysis. The big data analysis results for COVID-19 strains are reported in Section 3. Section 4 concludes the paper.

A MLCS ALGORITHM NP-MLCS 2.1 Preliminaries
MLCS Problem. We define a subsequence of a given sequence over a finite alphabet Σ as a sequence obtained by deleting zero or more (not necessarily consecutive) characters from the given sequence. Let = 1 2 ... and = 1 2 ... be two sequences with lengths and , respectively, over a finite alphabet Σ, i.e., , ∈ Σ. The goal of the Longest Common Subsequence (LCS) mining problem is to find all longest common subsequences of and . Similarly, the goal of the MLCS problem is to find all longest common subsequences from ( ⩾ 3) sequences of equal length or different lengths. LCS is a special case of MLCS.
The MLCS problem is a classical NP-hard problem [12], which is related to the identification of sequences similarity and to the common model extraction between sequences. It has many important applications in bioinformatics, computational genomics, pattern recognition, etc. Based on the adopted method, existing MLCS algorithms can be classified into two categories: dynamic programming based and dominant-point based exact or approximate algorithms.
(1) Dynamic Programming Algorithms. Given two sequences = 1 2 ... and = 1 2 ... with lengths and , respectively, over a finite alphabet Σ with [ ] = , [ ] = , , ∈ Σ, 1 ⩽ ⩽ and 1 ⩽ ⩽ , a dynamic programming algorithm iteratively constructs a ( + 1) · ( + 1) score matrix , where [ , ] is the length of an between two prefixes ′ and ′ of and . Once the score matrix is calculated, all the LCSs can be obtained by tracing back from the end element [ , ] to the starting element [0, 0]. Both the time and space complexities of this algorithm are ( ). Given ( ⩾ 3) sequences with equal or unequal lengths, the matrix can be extended to dimensions for the MLCS problem, in which the element [ 1 , 2 , ..., ] can be calculated in a similar way. Both the time and space complexity is ( ) [13].
(2) Dominant-point Based Algorithms. The dominant-point based algorithms are motivated by the observation that most of the cells in the score matrix of the input sequences are useless and do not need to be computed. Only a very small subset of the cells, called dominants (see Def.1 in Sec. 2.3), should be computed and stored. A dominant-point based MLCS algorithm consists of two steps [14,16]: 1) constructing a directed acyclic graph, called MLCS-DAG, which consists of all MLCSs of input sequences; 2) computing all of the MLCSs of the sequences based on the MLCS-DAG.
Although many MLCS algorithms [13,16,17] have shown that the dominant-point based MLCS algorithms are much faster than the classical dynamic programming based algorithms, theoretical analysis and some statistical experiments [18,19] reveal that the current mainstream dominant-point based MLCS algorithms are hard to apply to big sequence data (sequences' length more than or equal to 10 3 ) due to the serious weaknesses of their MLCS-DAG with a massive number of redundant points, as well as memory and calculation exponential explosion for large-scale/long sequences.

Related Work
Considering the space-time cost, approximate MLCS algorithms are usually designed for mining MLCSs of long and/or large-scale sequences, namely big sequences. As we aim to propose a high precision and efficient approximation MLCS algorithm in this paper, we only review existing representative approximation algorithms.
Existing approximate MLCS algorithms can be divided into two categories: with or without a guaranteed performance ratio, the ratio of MLCS length (i.e., |MLCS|) of an approximate solution to that of the optimal one. Algorithms such as LR, ExpA, and BNMAS belong to the first category [17]. They all provide a guaranteed performance ratio of 1/|Σ|, where |Σ| is the size of the sequence's alphabet Σ. Although interesting theoretically, they are not very useful in practice because the performance ratio of 1/|Σ| is too small, e.g., 1/4 for DNA sequences. Algorithms in the second category usually use heuristic or probabilistic search techniques to achieve a good performance. For example, Shyu and Tsai [20] used ant colony optimization to find approximate solutions. Wang et al. [21] proposed a heuristic greedy search algorithm MLCS-APP, and Pro-MLCS [17] adopted an iterative best first search strategy to progressively output better and better solutions. Yang et al. [22] presented two space-efficient approximate MLCS algorithms, SA-MLCS and SLA-MLCS with an iterative beam widening search strategy to reduce the space usage during the iterative calculating process. Experiments show that SA-MLCS and SLA-MLCS can solve an order of magnitude larger size instances than the state-of-the-art approximate algorithm Pro-MLCS. Although this second class of algorithms claims that optimal solutions can be found, the quality of the solutions is difficult to evaluate as there are no exact baseline algorithms for comparison.
From the literature review to the existing representative approximate algorithms, we make the following observations: 1) These algorithms' precision is too low to meet the practical needs; 2) These algorithms are hard to apply to big sequence data due to the weakness of their underlying dominant-point based methods; 3) Despite great efforts, no approximate MLCS algorithm can tackle the big sequence data MLCS mining efficiently and effectively. The proposed novel MLCS algorithm aims to achieve all this. In what follows, we'll go into detail on the main procedures of our novel approximate MLCS algorithm and its underlying theory.

A Novel Approximate MLCS Algorithm
Constructing Successor Tables (ST). To obtain all the immediate successors of a dominate from the sequence set in (1) time, we design a new data structure, called successor tables of . The construction and operation of are detailed in Appendix A. Constructing optimized MLCS-DAG. To overcome the weaknesses of the MLCS-DAG of the existing dominant-point based algorithm, we would like to construct an optimized MLCS-DAG, called MLCS-ODAG, with a minimum number of non-critical points(not contributing to MLCSs of sequences set . To this end, we construct its MLCS-ODAG with the following procedure: 1) Two dummy -dimensional points (0, 0, ..., 0) (the source point) and (∞, ∞, ..., ∞) (the sink point) are first introduced into the MLCS-ODAG for -dimensional sequences, with all the other points in MLCS-ODAG being the successors of (0, 0, ..., 0) and the predecessors of (∞, ∞, ..., ∞). Let = 0 and = {(0, 0, ..., 0)}.

2) If
= ∅, goto 6); otherwise, for each point in , calculate all its immediate successors by the successor tables of and add a directed edge to each of their successors from . If point has no successor, a directed edge from to sink point is added. All of the successors of points from constitute an initial ( + 1) ℎ level point set, denoted as +1 .
3) To eliminate many redundant points (repeated and non-critical points) possibly residing in +1 , a retention strategy is employed, that is, only those best points (key points for short) that are most likely to contribute of are retained. To this end, all the points from +1 are sorted by the best non-dominated sorting method in [23] to achieve its first frontier set, denoted as ( +1 ) 1 . That is, ∀ ∈ ( +1 ) 1 and there is no other point ′ ∈ +1 that dominates . All points except ( +1 ) 1 are deleted from the set. 4) Through extensive experiments, we find that there are still many non-critical points residing in the ( +1 ) 1 although a large number of redundant points have been deleted in the above step. To eliminate the remaining redundant points, all the points in ( +1 ) 1 are further evaluated by Eq. 1. Since the points with the higher scores probably have little or no contribution to MLCSs to , we only keep top points with the minimum values in ( +1 ) 1 and delete all the others. It is important to note that those deleted points may be key points, so this strategy leads to our algorithm being an approximate algorithm. 5) Let = + 1, and = ( +1 ) 1 , goto 2). 6) End the construction of MLCS-ODAG. With the above steps, an optimized MLCS-ODAG of sequences with as few redundant points as possible are constructed with the forward iteration → +1 procedure. An example of constructing MLCS-ODAG of 3 sequences is shown in Fig. 1.
where max( ) is the maximum value over all dimensions of . The lower the value of ( ), the greater the likelihood that will contribute to MLCSs of the input sequences, and vice versa. The property of the proposed empirical function ( ) has been proved in [14,19]. And it works well in our experiments.
Mining all of the MLCSs. Given the constructed MLCS-ODAG, we need to design an efficient and effective strategy to extract all MLCSs from it. We start by reviewing the following concepts from the graph theory.
Definition 4: For a directed acyclic graph = ⟨ , ⪯⟩, the topological sorting is to find an overall order of the vertices in from the partial order ⪯ [24].
Definition 5: A topological sorting algorithm [24] iteratively performs the following two steps until all vertices in have been traversed and processed: 1) outputing the vertices with in-degree 0; 2) deleting the edges connecting to the vertices.
Inspired by the topological sorting algorithm and investigating our constructed MLCS-ODAG, we found the following important fact.
Forward direction → A (3,7,6) T (7,5,5) A (6,7,6) Theorem 1: The sum of the numbers, denoted by and respectively, of the forward levels (from source (0, 0, ..., 0) to sink (∞, ∞, ..., ∞)) and the backward levels (from the sink to the source) of those points (called key points, denoted by ) residing on the longest paths corresponding to the MLCSs is exactly equal to | | + 1. However, the non-critical points would not have the property (see Figs. 1 and 2). This can be formulated as follows: Proof: Given a set of sequences, let their MLCS length in the MLCS-ODAG be | |, which is exactly equal to the maximum value of the forward levels in the MLCS-ODAG minus one (proven in [18,25,26]). Hence, given a key point residing on any of the longest paths of MLCS-ODAG, if its forward-level value is , i.e., ( ) = , there must remain | | − levels from to the sink point. So, its backward-level value ( ) must be equal to | | − + 1 (= ( )). Hence, Based on the above observation, we replace the in-degree with the out-degree and layer the MLCS-ODAG by the topological sorting algorithm from the sink to the source, denoted as Algorithm BackwardTopSort. With this, all the non-critical points in MLCS-ODAG are now identified and can be easily removed. Fig. 2 shows the result of BackwardTopSort to Fig. 1. Note that the MLCS-ODAG shown in Fig. 2 contains only those key points, that is, each path in the MLCS-ODAG corresponds to an MLCS of 1 , 2 and 3 . In addition, as shown in Fig. 2, some key points, such as point (7,5,5), would be deleted in the procedure of constructing MLCS-ODAG, leading to some MLCSs of MLCS-ODAG lost, so our proposed MLCS algorithm, called NP-MLCS, belongs to the approximate MLCS algorithm category.
The pseudo-code of the proposed algorithm NP-MLCS is given in Appendix D.
We compared our algorithm with the state-of-the-art algorithms CRO and SA_MLCS via extensive experiments. From the experimental results shown in Appendix B, we can draw the following conclusions: 1) Although the baseline CRO always has the fastest speed, it has the lowest precision; 2) Our algorithm NP-MLCS is much better than SA_MLCS in both running time and precision. In terms of running time, our algorithm is orders of magnitude faster than the baseline SA_MLCS; 3) Our NP-MLCS works well on big sequence data.
Notably, our NP-MLCS has following unique properties: 1) Low space-time complexity Theorem 2: The proposed algorithm NP-MLCS has ( log ) time complexity and ( + | |) space complexity, respectively. Proof: For each sequence of over the alphabet Σ with length , ( |Σ|) time is needed for constructing its successor table. The main operations in constructing the MLCS-ODAG consist mainly of following. Firstly, establish the predecessor-successor relationships among dominants in MLCS-ODAG. Secondly, sort all of the points in +1 by Algorithm BestNondominatedSorting [23]. Thus the time complexity for constructing the MLCS-ODAG is The backward topological sorting on MLCS-ODAG by algorithm BackwardTopSorting needs ( ), where is the total number of points in the final constructed MLCS-ODAG by algorithm BackwardTopSorting, and ≪ . Therefore, the total time complexity of the proposed algo- Similarly, the storage space of successor tables is ( |Σ|), and the storage space of the MLCS-ODAG with points and | | edges is ( + | |). The space complexity of NP-MLCS is ( |Σ| + + | |) = ( + | |) as ( |Σ|) ≪ ( + | |). 2) 100% MLCS' length precision Theorem 3: The MLCS' length precision is 100%, and the number of MLCS precision of NP-MLCS can be calculated by Eq. 3.
where, is the ratio of the total number of key points to the total number of points in MLCS-ODAG.
. The means of the notations ( ) 1 and are the same as before, shown in Sec. 2.3. |K| represents the size of the set K of the key points in MLCS-ODAG.
Proof: Since the key points in MLCS-ODAG uniquely contribute to and determine both the length and the total number of mined MLCSs in MLCS-ODAG, we argue that the precision of an approximate MLCS algorithm should be evaluated by both the mined MLCS' length and the number of MLCS of the algorithm. As the procedure for constructing MLCS-ODAG always keeps the frontier points of MLCS-ODAG, none of MLCS' length precision is lost, and the MLCS' length precision of NP-MLCS is 100%. However, since the deleted points ( ) 1 may contain some key points, the number of mined MLCS precision of NP-MLCS is defined by Eq. 3.
Notice that this property is very important for practical applications. In practice, it is not necessary to extract all the MLCSs between sequences, but to ensure that the length of the extracted MLCSs is accurate. 3

) A novel approximate MLCS algorithm suitable for big sequences analysis in practice
The theoretical analysis and extensive experiments show that the proposed algorithm NP-MLCS is an efficient MLCS algorithm suitable for big sequences analysis in practice.

Computing platforms and tools
This paper's investigation is carried out using two main computational tools, our proposed big sequence data (i.e., sequences with length over 10 4 ) analysis algorithm NP-MLCS (for similarity analysis) and the existing MEGA X system [24] (for evolutionary relationship analysis). All the calculations were done on a computing cluster of 18 nodes (Intel(R) Xeon(R) Gold 5115 CPU, 2 chip, 10 cores/chip, 2 threads/core, @2.4 GHz and 96GB RAM).

Similarity metrics
Based on the similarity metric design criteria and a common method for extracting subsequences among sequences in bioinformatics and computational biology [19,20], we give the following definitions and equations for computing the similarity of big sequences.
Definition 6 (LD): Lowenstein/edit distance [22,23,24] is the minimum number of operands required to convert a character sequence to another sequence using the operations of inserting, deleting or changing a character.
is the most commonly used measure of similarity between two sequences, on which the similarity between a pair of sequences and is defined as [20] ( The LCS-based similarity of a pair of sequences and is defined as where | | represents the length of the LCSs mined from the pair of sequences and . | | and | | represent the lengths of sequences and , respectively. We use two similarity metrics/measures for each analysis experiment, one based on LCS (Eq. 5) and the other based on Lowenstein/edit distance LD (Eq. 4). We used the two similarity metrics to represent the similarities between a set of sequences, which can reveal some potential biological evolutionary or genetic relationships of different species quantitatively, enabling medical professionals and biological researchers to perform cross-verification or cross-comparison, and possibly deciding which method makes more biological sense.

Evolution and diversity of COVID-19
3.4.1 Evolution of COVID-19 viruses from 79 countries. In order to more accurately reveal the evolutionary relationship between the nearly 30,000 collected COVID-19 stains in 79 countries from December 2019 to May 2020, we first select all of the sequences (totality:25) from China, the first country to report COVID-19 outbreaks, since Dec. 23, 2019 to Jan.31, 2020, and all of the sequences (totality:401) from China and other 18 countries in January 2020. Then, we fed these 426 COVID-19 sequences into MEGA X to construct the evolutionary tree. From the constructed evolutionary tree, we selected all of the first generation sequences. After that, by the uniform random sampling method, i.e., by ensuring the sequences from each of 79 countries and their earliest sequences from Dec. 2019 to May 2020 can be drawn, we randomly sampled 10 groups of sequences from 79 countries between February and May, 2020, respectively. Then we added some new sequences with high confidence in each group to replace the low-confidence sequences with multiple placeholders (the notation means the number of unknown characters). Finally, each group of sequences with all of the first generation's sequences calculated previously were fed into MEGA X to generate their evolution trees, respectively. 10 evolutionary trees produced by MEGA X demonstrated a high degree of consistency. Due to space limitations, we only present one evolutionary tree here in Fig.3 and Appendix C, and others are available at https://github.com/NP-MLCS/NP-MLCS/tree/master/ supplementary_materials.
Investigating the evolutionary tree shown in Fig. 3 allows us to make the following observations: 1) Although China was the first country to report COVID-19 outbreaks and to provide COVID-19 sequences, none of the sequences were the earliest generations, and they were concentrated in the sixth and the eighth branches of the later generations in the tree.
2) Apart from the two sequences from Russia in the third branches and Spain in the forth branches of the tree, respectively, all of the sequences from the top 15 countries currently reported to have the most severe outbreaks reside in the later generations in the fifth to tenth branches of the tree.
3) Of all the existing 29,305 COVID-19 sequences in human hosts from 79 countries, the earliest sequence No.   USA  0  21  113  4271  1968  15  6388  England  0  2  37  6306  7624  276  14245  Spain  0  0  12  474  19  0  505  Italy  0  5  5  57  18  0  85  France  0  9  13  330  36  0  388  Germany  0  9  23  110  40  0  182  India  0  2  0  116  180  46  344  Canada  0  4  7  145  36  0  192  China  25  286  241  103  6  unexpectedly resided in the eighth branches of the later generations in the tree, which indicates that COVID-19 virus probably began to spread among people in multiple countries as early as December 2019. This is also confirmed in a recent study [11]. 4) The earliest sampled sequences from the 79 countries are distributed in different branches of the tree, which indicates the widespread infections and diversity of COVID-19 virus in the world due to traveling and other reasons. 5) Although there is not yet enough evidence to trace COVID-19's origin, investigating the earliest generations' sequences in this evolution tree may provide some clues.

Similarity and evolution of COVID-19 viruses.
In this study, we calculated all similarities of COVID-19 viruses among themselves and also between COVID-19 viruses and other related viruses. Notice that the similarity matrix of the homogeneous sequences is a symmetric matrix, which represents pairwise comparisons based on our proposed MLCS algorithm computed between the sequences of the same virus type; otherwise an asymmetric matrix, which represents pairwise comparisons between sequences of two different virus types. The average similarity between sequences in the same virus type is computed using all the elements of the upper/lower half of the symmetric similarity matrix except the diagonal elements, while the average similarity between sequences of two different virus classes is calculated using all the elements of the asymmetric similarity matrix.
Since China was the first country that reported COVID-19 outbreak and submitted COVID-19 viruses, and USA, Italy and England are the countries most affected by COVID-19 epidemic with a lot of sequence data from Jan. to May 2020, the similarity and evolutionary analysis of the sequences of the above four countries are particularly reported here 7 , which are shown in Table 2, Figs. 4-5 and Appendix C, respectively. From the above our analysis, we can make the following observations: 1) Although the overall similarities of these human strains are high, we observed a reduction of the similarities in later months of all the above four countries, indicating mutations within the human population is already occurring.
2) The averages of nucleotide differences from the four countries are 286.39, 292.35, 268.49 and 247.61, respectively, corresponding to the averages of nucleotide differences 325, 423, 378 and 289 of four countries. These changes imply rapid evaluations of this virus, which might result in attenuation or more virulent strains. All these differences are statistically significant ( <0.0013), which indicates that COVID-19 has begun its divergence in the human population.
3) Although the sequences of COVID-19 virus from the above four countries have evolved at different rates, all the different countries' viruses are steadily mutating, which potentially explains the underlying differences in virulence and alerts us to consider this divergence in designing antibodies and vaccines. 4) By investigating the sequence locations in their evolutionary tree of the above countries, as well as all other countries, we can infer that the first generation sequences is positively related to their sampling time, but not entirely. In addition, for each country, there are also some outlier sequences, e.g., strain EPI ISL 417180 China 2020.02.03. Further research on the first generation strains including outliers from these countries will have important significance in searching for the virus transmission path.

The similarity between COVID-19 and related viruses.
To help trace the original or the intermediate host of COVID-19 and to assist the finding of natural remedies, we analyzed the similarity between COVID-19 viruses in different hosts, including human, rhinolophine, pangolin, and environmentally collected strains.
We found that COVID-19 virus living in the environment is highly similar to that living in the human body. The average similarity can reach 0.9972/0.9972 (LCS/LD). This is expected as this is likely to reflect what is being transmitted right now among the human population. We also found strong similarities between TG13 and RaTG13 (rhinolophine host) and COVID-19 (human host), reaching 0.9599/0.9584 (LCS/LD) and 0.9599/0.9585 (LCS/LD), respectively. But the average similarities between COVID-19 with human host and the other COVID-19 strains with rhinolophine host are not very high, 0.7416/0.6631, lower than the similarity with COVID-19 strains (pangolin host), 0.8742/0.8604, by 13% and 20% (LCS/LD).
It has been reported that many symptoms of COVID-19 patients resemble those of the influenza patients infected with one of the four known flu coronaviruses. Therefore, we computed the sequence similarities between COVID-19 viruses and the four known flu coronaviruses, HCov-229E, HCov-OC43, HCov-NL63 and HCov-HKU1. The average similarity matrices, computed with Eq. 5 (LCS-based) and Eq. 4 (LD-based), respectively, are shown in Appendix E. The average similarities, are 0.6532/0.5557 for HCov-229E, 0.6806/0.5619 for HCov-OC43, 0.6597/0.5607 for HCov-NL63 and 0.6909/0.5612 for HCov-HKU1. We observed that the difference in the similarity values between the two (LCS and LD) metrics is about 10%, but the trends of the two results are consistent.
Compared to the shared similarity between COVID-19 and the seven lethal strains, the similarity between COVID-19 and other known flu-causing coronaviruses is in general higher except SARS and MERS. This pinpoints the importance of revisiting the treatment of flus and studying whether drug repurposing could possibly alleviate the current COVID-19 crisis.
It is also worth noting that the similarities between COVID-19 strains and viruses HCov-OC43, Lassa, MERS, Victoria, Yamagate, Ebola and Dengue have increased steadily over the past six months.

CONCLUSION
Pathogenic mechanism, virus detection, and vaccine and drug developments all heavily depend on the analysis of the complete genome sequences of COVID-19. This study provides important information to support the decision making of medical and healthcare professionals in tracking COVID-19's mutation paths, developing virus detection tools, vaccines and drugs, and controlling the epidemic. Below, we reiterate several key findings.
First, the genome sequences of COVID-19 viruses in humans have already gone through mutations over the past six months. This has important implication for developing COVID-19 test kits, vaccines and antibody treatments. Recently, efforts to isolate antibodies for COVID-treatment have been announced by several pharmaceutical companies, and vaccines are being actively developed by many research labs around the world. The breadth of the coverage of the antibodies and vaccines will be critical in determining its efficacy.
Second, COVID-19 shares little similarity with Ebola, but more with the four previously known flu-causing coronaviruses (HCov-229E, HCov-OC43, HCov-NL63 and HCov-HKU1), and even more with SARS. The sequence analysis suggests that treatments to SARS and other flu-inducing coronaviruses might be another roadmap that we should explore. We recommend considering this during medication and treatment development.
Third, COVID-19 virus strains from most countries might have gone through multiple evolution paths. Extensive analyses of COVID-19 strains from different countries potentially lead us to find the first generation COVID-19 virus and its origin. As the data shown here, at the national scale, COVID-19 could have already spread through multiple routes. This also highlights the need to develop more aggressive isolation and quarantine procedures for anyone demonstrating suspicious symptoms, even without direct or known contact with a patient.

APPENDIX A MAIN DATA STRUCTURE
Successor tables (ST). The successor tables { 1 , 2 , ..., } of the sequence set = { 1 , 2 , ..., } are built to support the compression of the data and quick search for the immediate successors of the points. For a sequence = 1 , 2 , ..., from the sequence set over a finite alphabet Σ = ( 1 , 2 , ..., ), its successor table is a two-dimensional array, where [ , ] (the element of the th row and the th column) is defined as From Eq. 6, we can see that  The set of immediate successors of a -dimensional point = ( 1 , 2 , ..., ) can be obtained efficiently in ( |Σ|) time. For a -dimensional point , the operation for producing its can be characterized by Eq. 7.

B EXPERIMENTAL RESULTS
We evaluate the performance of the approximate algorithms, whose performances vary in terms of not only efficiency but also precision. Here, the precision is measured by Eq. 3. The results for all the tested approximate algorithms, including the state-of-the-art CRO, SA_MLCS as well as our algorithm NP-MLCS are shown in Tables 3  and 4.     The proposed algorithm NP-MLCS is implemented in Java JDK1.8. Where is a user-customized parameter (1 ⩽ ∈ Z), which represents how many number of key points to be retained in each layer when constructing MLCS-ODAG. The source code of algorithm NP-MLCS is available at: https://github.com/NP-MLCS/NP-MLCS E THE SIMILARITY BETWEEN COVID- 19 AND RELATED VIRUSES