COVID-19 Mortality Risk Assessment among Various Age Groups Using Phylogenetic Analysis

The age-related mortality and morbidity risk of COVID-19 has been considered speculative without enough scientific evidence. This study aimed to collect more evidence on the association between patient age and risk of severe disease state and/or mortality from SARS-CoV-2 infection. Genomic dataset along with metadata (3608 samples) retrieved from GISAID from different geographical regions were grouped into 10 age groups (0-10, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-80, 81-90, 91-100 years) as well as high-risk or lowrisk according to patient clinical status. Genomic sequences were aligned and analyzed using MAFFT and FASTTREE to build a phylogenetic tree in order to identify age-risk associations based on phylogenetic clustering. Case fatality rates(CFR), as well as the Odds ratio (OR) for high-risk outcomes, were calculated for different age groups. Results revealed that individuals aged between 25-50 years have the best immune response to the infection. On the other hand, disease fatality was higher in patients aging above 50 years. We created an application to calculate the OR of being at high risk given a certain age threshold from GISAID datasets. OR values increased between ages 1-10 years (1.271) and 11-20 years (1.313) but reduced at age range 21-30 years (1.290) and increased again for 61-70 years (2.465). CFR calculated for each of the age groups had peak values at 90-100 years (26.8%) and the lowest at 0-10 years (0%). The CFR for ages above 50 years was about twice greater (11.6%-26.8%) than that for ages below (0-6.6%). The phylogenetic analysis revealed that the majority of samples obtained from India showed low-risk among different age groups and were defined as clade GH. Another cluster from Singapore visualization showed unfavorable patient outcome across several age groups and were classified under clade O. To conclude, this study analyses showed a variety of age-risk associations. As scientists from different countries upload more genomes to globally shared databases, more evidence will reinforce mortality risk associations in COVID-19 patients.


BACKGROUND AND INTRODUCTION
The current coronavirus pandemic that was caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is not the first pandemic of the family Coronaviridae [1]. In 2002 and 2012 respectively, severe acute respiratory syndrome (SARS) and middle east respiratory syndrome (MERS) caused by severe acute respiratory syndrome coronavirus (SARS-CoV) and middle east respiratory syndrome coronavirus (MERS-CoV) outbreaks emerged [1][2] [3][4] [5]. Taxonomically, There are seven coronaviruses namely: 229E (α), NL63 (α), OC43 (β), HKU1 (β), MERS-CoV, SARS-CoV, SARS-CoV-2. They belong to the family Coronaviridae and SARS-CoV-2 is a strain of the SARS related coronavirus (SARSr-CoV) that is genetically associated with other coronavirus strains that infect bats in China [6] [7]. From its period of emergence up until the end of 2002, coronavirus was considered to be non-fatal [7]. Currently, SARS-CoV-2; first isolated in Wuhan China; has led to more deaths than the earlier SARS outbreak of 2002-2003. As of September 7, 2020, it has been responsible for nearly 27 million cases and 900,000 deaths worldwide [8]. Sequenced SARS-CoV-2 genomic data from human hosts available at the Global Initiative on Sharing All Influenza Data (GISAID), identified three major clades of SARS-CoV-2 viz; clade G (a variant of the spike protein S-D614G), clade V (a variant of the ORF3a coding protein NS3-G251), and clade S (variant ORF8-L84S) [9]. Phylogenetic data has grouped the variants into clusters namely; A, B, and C. Both clusters A and C spread more outside of East Asia, mostly in America and Europe. In contrast, type B is the most common variant infecting East Asia, and the ancestor species appears to have never spread outside of East Asia without first converting to the B species [10]. Until this moment, the SARS-CoV-2 has been characterized by mutations, which help explain its origin, distribution, and tracing viral pathogenesis in every continent [11]. For instance, there were 5775 distinct variants out of 10022 SARS-CoV-2 genomes analyzed; of these, 2969 missense mutations, 1965 synonymous mutations, and 484 mutations in the non-coding regions were observed in samples obtained from 68 countries [12]. SARS-CoV-2 belongs to the Betacoronavirus genus, the most prevalent, which shares 82% nucleotide identity with SARS-CoV and about 50 % with MERS-CoV [13]. Using specialized tools and techniques, researchers have been able to conduct phylogenetic studies of the SARS-CoV-2 genome against suspected zoonotic reservoirs [14] [15]. Multiple Sequence Alignment (MSA) of SARS-CoV-2 genome data has revealed tremendous amounts of information, including evolutionary diversity, and similarity with other coronavirus strains. With the help of Multiple Alignment using Fast Fourier Transform (MAFFT), research has revealed that SARS-CoV and MERS-CoV both originated in bats [16]. In humans, as one age, the potential of the immune system to fight against infections diminishes [17]. Previous studies have suggested that older adults are highly susceptible to SARS-CoV-2 infection and may suffer severe COVID-19 outcomes due to comorbidities [18]. It is still not clear whether age (associated immune decline) has a direct influence on COVID-19 mortality.
It is therefore important that we identify whether patient age may affect disease severity and/or mortality rates, and also identify any evolutionary pattern exhibited by the virus as it is transmitted between different hosts across various geographical locations. In this study, we have constructed a maximum likelihood phylogenetic tree of SARS-CoV-2 virus strains collected from different countries the virus has spread its tail. The phylogenetic tree will give us an idea about the strain of the virus that got introduced in a country and how that affects disease severity and/or mortality.

Data Download and Filtration
The clinical dataset of COVID-19 patients and the genomic sequence for each SARS-CoV-2 infected patient was collected from the GISAID EpiCoV repository [9]. The retrieved dataset contains a total of 4592 genomes. A Python (v.3.7.4) script (pandas v.1.0.5 [19] and Biopython v.1.77 [20] packages) was used to filter the data, generate FASTA and the clinical dataset files. For the data cleaning, samples corresponding to non-human hosts were filtered out. Additionally, all samples corresponding to unknown and those tagged as Not Available (NA) in the patient status and patient age column were filtered out thereby reducing the dataset from 4592 samples to 3608 samples. The Biopython package was specifically used to parse the FASTA file containing the complete genomic data. All the filtered genomic and clinical data were written to a new file using the python script.

Multiple Sequence Alignment and Phylogenetic tree reconstruction
Using the filtered genomic data, multiple sequence alignment was performed using MAFFT (Galaxy v.7.221). MAFFT performs progressive alignment and iterative refinement for increased accuracy and aligns as many as 30,000 sequences [21]. To generate the maximum likelihood phylogenetic tree, the FastTree program was used (Galaxy v.2.1.10+galaxy). FastTree generates maximum likelihood phylogenetic trees to find the local optimum or 'best tree' [22]. All the analysis concerning Phylogenetic tree reconstruction was performed on the Galaxy web server [23]. Default parameters were used to perform the analysis using the filtered genomic dataset. MAFFT generated an alignment in FASTA format, which was used as an input to the FastTree program. iToL was used To efficiently visualize the phylogenetic tree, One of the key features of this software is that it allows large trees with large numbers of taxa to be easily visualized, located on the internet, and edited. The public sharing allows the tree to be reused for future analysis [24].

Odds Ratio calculations
To determine whether exposure is a risk factor for a particular risk outcome, a measure of association, Odds ratio (OR), was used to compare the relative odds of the disease outcome with a given exposure to a specific variable of interest (e.g.age) [25]. In this study, the clinical dataset was used to calculate the OR, after being categorized into "Low risk" and "High risk" using R (Supplementary Table 4). 12 patients with ages lesser than 1 and greater than 100 were filtered out of the clinical dataset, and the filtered data was sorted in order of increasing age. This was done using an R script and Microsoft Excel.
The OR calculations were performed in R, using epiR (v. 1.0-15) and epitools (v 0.5-10.1) packages. We used the epiR package for 2 by 2 OR calculation and epitools for multiple OR calculation. The Age versus Outcome plot was built with the construction of 2 by 2 tables for every age cutoff and then plotting every odds ratio as a single value.
The OR values for Clade versus Outcome with comparison to the reference clade were calculated with epitools package. The reference clade was artificially created with the median value among all clades for the patient count and with the general outcome ratio for data split. For age group OR representation, we computed mean OR values for given age intervals. P-values are calculated from the chi-squared test.

CFR calculations
The case fatality ratio (CFR) is used to measure the proportion of infected individuals with fatal outcomes. It estimates the percentage of deaths among identified confirmed cases. It is measured by the formula: where, Ndp = number of deaths from disease, Np = number of confirmed cases

OR calculations
The Odds Ratio (OR) for a particular age range illustrates the odds that patients older than the specified age threshold are more likely to be at high risk.
After the cleaning step, the data contained 3608 individual patient records for subsequent analysis. OR analysis for the patient age groups highlights a trend towards being more exposed to the disease ( High risk) with higher age. In particular, the OR   and 0.28 for G and S clades respectively, and CI (95%) covers above and below 1). The clade, which is associated with the low-risk outcome for the patients is GR, which has 0.61 times low disease risk, with a 95% confidence interval and true odds which lies between 0.48-0.76 (P<.05). V clade has the widest CI limits and a P-value of 0.7.
To increase research reproducibility and for easy analysis, result exploration, and interpretation, an ORCaG (Odds Ratio Calculations for GISAID data) shiny application was created, which is available online at https://biopavlohrab.shinyapps.io/ORCaG/, and https://github.com/MountainMan12/GISAID_phylo/tree/master/ORCaG. The application allows to dynamically change patient status category with the age cut-off for easy visual inspection of data. Documentation for the app is available at the corresponding GitHub page.

CFR calculations
In this study, the CFR for each age group was measured. The CFR had its peak in the age range of 90-100 (26.8%) and the least at age 0-10 (0%). The CFR for ages above 50 was about two times (or greater) more than that of ages below. This supports the fact that patients above 50 years are at a higher risk when compared to patients below 50 years of age, though being at high risk doesn't confirm a death penalty. (Supplementary   Table 3, Figure 6).

Phylogenetic Analysis
Phylogenetic analysis suggests the evolution of SARS-CoV-2. In the present study, more than 3500 genomes from 69 countries were analyzed, where the highest number of patients were observed corresponding to the GH clade for Indian samples (Figure 3).
Upon tree visualization, a group of closely related Indian samples was clustered, the majority of which were classified as low-risk ( Figure 4) corresponding to clade GH. In this cluster, no certain age relationship with the disease risk nor the clade type could be seen.
In another cluster of samples from Singapore, most of them showed high risk and corresponded to clade O with no special correspondence to a certain age group. ( Figure   5). Figure 3 shows the samples from Singapore have the highest frequency of clade O.

Immunity reaction
Additionally, an immunity reaction distribution plot of COVID-19 across all ages given in the dataset was generated (Figure 1) which shows that the immunity reaction to COVID-19 was optimal among the younger individuals(mostly ages 25-50 years). The immunity reaction dataset based on patient reaction is provided in Supplementary Table 1.

DISCUSSION
Since the beginning of the disease outbreak, age has been a significant forecasting determinant of COVID-19 in patients [26]. In this study, immunity reaction to COVID-19 infection was analyzed across all ages of the study's dataset. It showed that the immunity reaction to COVID-19 was optimal among the younger individuals(mostly ages 25-50 years), declined after the age of 50 years, and the least reaction was seen in the age range from 75 to100 years. Our results support the fact that immunosenescence is characterized by reduced B and T cell numbers as well as responses. Moreover, COVID-19 infection is characterized in many cases with lymphopenia (decreased lymphocytes numbers) which is much higher in the elderly than that of the young and middle-aged patients [27], this may worsen the immune response in the elderly than younger age groups. Taken together, age-associated immune remodeling, in company with other predisposing factors such as malnutrition, decreased physical activity, and associated chronic medical conditions among the elderly lead to the decreased immune response in the older age groups, elevating their susceptibility to infectious diseases and responsible for the severe clinical manifestations observed in older patients. In this line of thought, the previously mentioned mechanisms suggest the underlying worse prognosis in older patients with COVID-19 and to some extent explain our following results that might lead to further research.
An important characteristic of a novel infectious disease as COVID-19 is its severity and its ability to cause death. The World Health Organization (WHO) report in August 2020 on estimating mortality from COVID-19 recommended that "efforts should be made to calculate risk-group-specific estimates of fatality risk to have a better insight on the true patterns of fatality" [28]. In the present study, we measured the CFR for different age groups, and results showed that the death rate increases as age increases which agrees with the Center for Disease Control and Prevention (CDC) report comparing death rate ratios of different age groups to the 18-29 years age group [29]. A slight difference in CFR progression was seen at the ages range 11-20 and 81-90, this proves the theory that CFR calculated during an ongoing pandemic is conditional because some active cases may subsequently die after the time of the report leading to underestimation of the CFR of the report [28,29].
So we measured another estimate for disease outcome, we calculated odds ratios (OR) for disease severity with different age groups (Supplementary Table 2). Our results are consistent with a study done on 17 million people in England, where more than 90% of COVID-19 related deaths were recorded among people over 60 years of age. Furthermore, those above 80 years had an about 40-fold increased risk compared with those who are 51-60 years [30]. In the current study, the OR had its peak at age range 81-90 years which almost tripled that of patients within the age range of 71-80 years, this also agrees with an initial study conducted in Italy that described the mortality rate and risk factors for patients above 80 years [31]. This shows that patients above 80 years of age are at higher risk [32], which could be due to their inherent reduced immune capacities and resilience [31].
The phylogenetic analysis of the circulating clade in a country and its risk association along with specific age groups is necessary to identify the specific groups with the highest treatment-targeted priority. Our study identified seven clades with the most abundant one was the GH clade accounting for low disease risk and the highest count in Indian and Saudi-Arabian populations. Bartolini et al., [33] reported the clustering of V and G clades from the European Union (EU) countries. However, in our present study, there was an expanded clustering of clades across the 69 countries. Clades L and O with the highest disease risk had the least pairwise OR when compared with the other 5 clades. GISAID genome sequence interspersed different non-G clades, located on the gene with distribution in different clades indicating repeated occurrence with no evolutionary advantage [34].
Interestingly and in contrast to our previous results, our phylogenetic tree visualization showed a deviation from the regular observed pattern of the relationship between advanced age (>50 years) and high-risk disease outcomes. Clusters from Indian samples showed the prevalence of the low-risk clade GH across varying age groups. Also, clusters from Singapore showed the prevalence of the high-risk clade O across all age groups. Changes in age-group-specific infection were earlier observed in a study, carried out in EU countries, showing changes in the age group of the most affected population from ages >60 to ages 20 -29 over several months. The median infection age also was shown to have decreased from 54 years to 39 years in a space of 7 months [34]. The impact of several genetic variants can be suggested by the fact that the virus does not show similar mortality rates across different countries. The viral progression may vary in terms of the genetic makeup of an individual, and the outcomes may also be due to several other factors that influence treatment and patient care. This deviation suggests that further factors should be taken into account during performing a risk-group-specific analysis of the disease, it will provide a more accurate understanding of the mortality rates related to SARS-CoV-2.

CONCLUSIONS
We have successfully analyzed more than 3500 genomes of SARS-CoV-2 isolated from COVID-19 patients from different geographical locations and identified a positive association between patient age and COVID-19 disease severity.
This study has its limitations, and this includes working with small datasets. More genomes could increase our confidence in OR analysis results. Variation in the accessibility to treatment availability and facilities can also influence the patient outcome.
In the context of the proposed hypothesis, it is not clear as to whether age could have a direct impact on mortality of the patients, but this could be better understood by looking at other clinical factors.

DATA AVAILABILITY
All datasets used are provided in the Zenodo repository: https://zenodo.org/record/4007666#.X1tmwnYzavM All scripts written for the analysis are provided in the GitHub repo: https://github.com/MountainMan12/GISAID_phylo