Characterization of nucleocapsid (N) protein from novel coronavirus SARS-CoV-2

Severe acute respiratory syndrome novel coronavirus 2 (SARS-CoV-2) has caused the global pandemic as COVID-19, which is the most notorious global public health crisis in the last 100 years. SARS-CoV-2 is composed of four structural proteins and several non-structured proteins. The multi-facet nucleocapsid (N) protein is the major component of structural proteins of CoVs, However, there are no dedicated genomic, sequences and structural analyses focusing on potential roles of N protein. Hence, there is an urgent requirement of a detailed study on N protein of SARS-CoV-2. Herein, we are presenting a comprehensive study on N protein from SARS-CoV-2. We have identified seven motifs conserved in the three major domains namely Nterminal domain, linker regions and the Cterminal domains. Out of seven motifs, six motifs are conserved across different members of coronaviridae, while motif4 is specific for SARS CoVs with potential amyloidogenic properties. Additionally, we report this protein has large patches of disordered regions flanking with these seven motifs. These motifs are hubs of epitopes with 67 experimentally verified epitopes from related viruses. We report the presence of three nuclear localization signals (NLS1-NLS3 mapped to 36-41, 256-26, and 363-389 residues, respectively) and two nuclear export signals (NES1NLS2 from 151-161 and 217-230 residues, respectively) in the N protein of SARSCoV-2. These deciphered two Q-patches as Q-patch1 and Q-patch2, mapped in the regions of 266-306, and 361-418 residues, which potentially help in the aggregation of the viral proteins along with 219LALLLLDR226 patch. Additionally, we have identified 14 antiviral drugs potentially binding to seven motifs of Nproteins using docking-based drug discovery methods.


Introduction
A novel coronavirus (CoV) named as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, initially called as 2019-nCoV) pushed the entire world on a halt with outbreak of coronavirus disease 2019 . This disease has rapidly become a new emerging human disease and a global pandemic as SARS-CoV-2 has already infected more than 4,731,458 people across globe and over 316,169 deaths due to COVID-19 as Situation Report-120 of World Health Organization (WHO, Webpage https://www.who.int/emergencies/diseases /novel-coronavirus-2019 on 19 th May 2020). This suggested that the rate of mortality is 6.7%, which is higher than initial estimation of about 2%. Several countries have locked down their cities, states and their international borders, which has led almost no international travels and ultimately resulted in a massive slowdown in the global economy and businesses. Coronaviruses (CoV) are members of coronaviridae (NCBI Taxonomy ID: 69399) and these are enveloped positivesense, single-stranded RNA viruses. These members of coronaviridae are evolutionary grouped into four genera with Greek number prefixes as alpha (α-CoV), beta (β-CoV), gamma (γ-CoV) and deltacoronaviruses (δ-CoV). CoVs are known to cause infections of the mammalian respiratory and gastrointestinal tracts, but the infection mechanism of these viruses is not fully understood. The current virus SARS-CoV-2 is seventh CoV, which caused infections to humans and the other four infections by members of β-CoV genera were SARS, Middle East respiratory syndrome (MERS), HCoV-OC43 and HCoV-HKU1, while two α-CoV infections were reported as HCoV-NL63 and HCoV-229E. One examining carefully, it is clear that CoV have tendencies to break species boundaries during infecting humans, they use intermediate hosts such as bats, anteaters and camels [1,2]. Hence, it is cautioned that CoV-based infections will come back every now and then with our exposure to intermediate hosts most probably from a wild animal species [2]. Although known that CoV clearly causes infections multiple times and every time a new infection comes, we tend to control its impact by containing spread of the virus by lockdown and using other viral drugs. It is the high time that we focus on CoV using an insight out approach focusing on the long term solutions for the current disease -COVID19 and also how we can counter any other future outbreak of the CoV. Hence, it is an urgent requirement that several computational methods have to come up for characterization of various components of CoV genetics and biology. This will be unravelling future therapeutic targets against SARS-CoV-2 and related viruses. We have an advantage now that using nextgeneration sequencing methods several genomes of CoV are available and massive global attempts are made to make more and more genomic data to be available soon. Recently available genome of SARS-CoV-2 is 29.3 kb in size (NCBI Accession ID ASM985889v3) harbouring four essential structural proteins, including spike (S) glycoprotein, small envelope (E) protein, matrix (M) protein, and nucleocapsid (N) protein [3]. The nucleocapsid (N) protein of a coronavirus is a multifunctional protein that plays a crucial role in virus assembly and in its RNA transcription. The N protein is crucial in the formation of helical ribonucleoproteins during packaging the RNA genome, regulating viral RNA synthesis during replication, transcription. This protein is also capable of regulating infected host cells and their cellular mechanisms. The primary functions of N protein are binding to the viral RNA genome, and packing them into a long helical nucleocapsid structure or ribonucleoprotein (RNP) complex. The N protein possesses highly immunogenic properties and it is a highly expressed protein during infection, capable of inducing protective immune responses against SARS-CoV and SARS-CoV-2.
Herein, we have carried on a comprehensive characterization of nucleocapsid N protein of SARS-CoV-2 from sequence, phylogenetic and structural perspectives for deciphering potential therapeutic targets and epitome inventories. We have seven motifs with different roles either coronaviridae-specific (motif2) or SARS-CoVs-specific (motif4), two glutamine-rich Q-patches and potential roles in generating higher aggregation propensity in the human cells. We also explored a library of FDAapproved drugs against seven motifs of N protein for their roles as anti-SARS-CoV-2 drugs and usages in drug repurposing and majority of drugs are HIV

Scanning nucleocaspid (N) proteins of several coronavirus genomes
We detected putative nucleocapsid (N) proteins from different viruses using BLASTP [4] with an E-value < 1e −10 with nucleocapsid (N) protein (Genbank Accession ID YP_009724397.2) as the query after setting up the local BLAST database. We performed annotation of these sequences using OMICSBOX [5] and CELLO2GO [6].

Protein sequence and strucural analyses of nucleocapsid (N) protein from selected CoV
We aligned nucleocapsid (N) proteins from selected CoVs using MUSCLE alignment suite [9] and resulting alignments were visualized using either ESPrint3.0 tool [10] and/or JALVIEW [11]. Homology model of full length N protein (accession ID -QHD43423.pdb) was taken from I-TASSER [12]. This protein sequence alignment was constructed using Muscle alignment tool and this protein sequence alignment is visualized along with secondary structural elements using ESPrint3.0 tool [10]. We predicted nuclear localization and nuclear export locations using and NetNES1.1 [13], respectively and these signals are manually curated further. We analysed disordered regions using PREDICTPROTEIN and DOSOPRET3.0 [14]. We visualized the motifs in the homology model using PyMol (www.pymol.org) and YASARA (www.yasara.org).

Phylogenetics analyses
We have carried out phylogenetic analyses of nucleocapsid (N) protein from selected CoV with Neighbor-joining (NJ) method [18] using bootstrap value of 500 under using MEGA-X [19].

Mapping of experimentally validated epitopes using IEDB database
We mapped experimentally validated immune epitopes, flanking identified 7 motifs of N protein with cut-off of minimum 70% sequence identity from Immune Epitope Database and Analysis Resource (IEDB, Web: http://www.iedb.org/), derived by BLASTP [4].

Prediction of aggregation propensity for nucleocapsid (N) protein
For predicting aggregation propensity of nucleocapsid (N) protein, we have supplied N protein sequence (YP_009724397.2) to four different tools namely TANGO [20], AGGRESCAN [21], FoldAmyloid [22], and Amylpred [23]. For TANGO following parameters were used pH, temperature, ionic strength and protein concentration by setting to values of 7.4, 310 K, 0.1 M and 0.1 M, respectively and the protein was analyzed in its native form without any Nor C-terminal protection. The data were then compared with those of a well-known amyloidogenic protein, TAR-DNA binding protein 43 (TDP-43, accession no. NP_031401.1), which is involved in the amytropic lateral sclerosis (ALS) [24].

Docking analyses focusing on seven motifs
We downloaded FDA approved drugs in SDF format from Drugbank [25]. For our docking analyses, we have unutilized the protein model of full length N protein from I-TASSER) website (accession ID -QHD43423.pdb). In this model, we mapped the seven conserved motifs. We docked A total of 1908 drugs to different motifs using Autodock vina (3), after energy minimization with help of ff14SB amber force field in the UCSF Chimera Software (4). We performed the virtual screening using VSvina custom scripts (https://github.com/narekum/VSvina). After docking, we carefully examined the binding modes of best affinity binders for each of the binding motifs. Resulting data was visualized using PyMol (www.pymol.org) and YASARA (www.yasara.org).

Protein architectures and sequence of N protein from SARS-CoV-2
We aligned Nucleocapsid (N) from representative viruses as listed and a reannotation summary is provided in Table 1. Nucleocapsid (N) protein of SARS-CoV-2 is 419 amino acid long with a molecular mass of 45.6 kDa. This N protein has two major intrinsically disordered regions (IDRs) in the N-and C-terminal ends ranging from 1 to 41 and 366 to 419 residues ( Figs. 1-2). There are three conserved domains in protein architectures of coronavirus N protein namely an N-terminal RNAbinding domain (NTD), a C-terminal dimerization domain (CTD), and intrinsically disordered Ser-Arg (SR)-rich linker region. We have built the model structure of the full length N protein from SARS-CoV-2, which is composed of eight a-helices namely helix a1 to helix a8, ten h-helices designed as helix h1 to helix h10 and nine b-sheets as b-sheet b1-to b-sheet b9 (Figs 2-3). Nucleocapsid (N) protein of SARS-CoV-2 is the close homolog of protein N of BtCoV-RatG12 and SARS CoV (SARS-CoV) with 99% and 90%sequence identities and 99% and 94% sequence similarities, respectively (Fig. 2). During our sequence comparisons, we found that protein N from SARS CoVs have higher sequence identities and similarities ranged from 89% to 99% identities and 93% to 94% similarities, whereas N protein from other viruses namely PHEV-1 (Genbank accession ID. ARC95215.1) and PHEV-2 (AAY68302.1) from porcine hemagglutinating encephalomyelitis virus and MHV-1 (AAA76578.1) and MHV-2 (BAJ04701.1) from murine hepatitis virus show 30-31% identities and 47% similarities, respectively. We also examined the N protein from four Indian SARS-CoV-2 genomes versus N protein from SARS-CoV-2 Wuhan isolates. We found that sequence identity and sequence similarity are in ranges of 99-100% for N protein deduced from these four genomes versus that of SARS-CoV-2 Wuhan isolates.

Fig 2. Protein sequence alignment of N protein illustrates secondary structural elements and positions of different motifs.
A.
The full-length N protein of SARS-CoV-2 has different secondary structural elements namely 8 a-helices, 10 h-helices & 9 b-sheets, mapped on top of the alignment, derived from I-TASSER [12] model (PDB ID -QHD43416.pdb). This protein sequence alignment was constructed using MUSCLE suite [9] and this protein sequence alignment is visualized along with secondary structural elements using ESPrint3.0 tool [10]. Amino acids are coloured based on physicochemical properties as following cyan -HKR, red -DE, maroon -STNQ, pink -AVLIM. CTD -C-terminal dimerization domain, IDR -intrinsically disordered region, NES -Nuclear Export signal, NLS -Nuclear localization signal and NTD -Nterminal RNA-binding domain; cyan triangle -alternative K or R for NLS3 *-position of glutamines as red * -glutamine present only in SARS-CoVs green * -glutamine absent in SARS-CoV-2 blue * -glutamine present only in other viruses black * -glutamine present in all viruses maroon * -glutamine present in SARS-CoVs and some other viruses B. Sequence identity and similarity scores depicts that CoVs (top 6) have higher sequence identities and similarities (green shades) than other viruses (bottom 4) -porcine hemagglutinating encephalomyelitis virus (like PHEV-1 (Genbank accession ID. ARC95215.1) and PHEV-2 (AAY68302.1) and murine hepatitis virus (MHV-1, AAA76578.1 and MHV-2, BAJ04701.1) This is clearly evident that there is grouping into two these classes of viruses ( Fig. 1) as (a) first six N protein sequences from different CoVs (marked in red box, Fig. 1) and (b) lower four N protein sequences with lower sequence identities (Fig. 1)  A. Front & back view illustrating motifs 1-7. B. Location of motif1 ( 69 GQGVPI 75 ) and motif2 ( 106 PRWYFYYLGTGP 117 ) in the N-terminal RNA-binding domain (NTD). C. Location of motif3 (S/R rich region) in the turn between b-sheet b4 and P-helix P1, mapped into at the end of NTD and the linker region (LKR). D. Illustration of motifs4 to motif7 mapped into the C-terminal dimerization domain (CTD) located in the turn between the P-helix P4 and the a-helix a5, the turn between the a-helix a5 and the P-helix P5 and, at the end of the a-helix a8, respectively.
First two motifs are present in the NTD region as motif1 and motif2 in the amino acid positions at 69-75 and 106-117 as 69 GQGVPI 75 and 106 PRWYFYYLGTGP 117 and these two motifs are mapped in the turn between b-sheets b1-b2 and at the b-sheet b5, respectively (Figs. 2-3). The third motif is a large stretch of S/R-rich region, starting at the of the NTD region and surpassing over the linker region (LKR) from 176 to 207 residue positions (according to YP_009724397.2 numbering). This motif is present in the LKR region connecting the NTD to CTD domains as visualised on the structural model, in the turn between the bsheet b4 and the P-helix P1 (Figs 1-2). The fourth motif is six-residues long as 221 LLLLDR 226 in the LKR region, residing in the turn in between the b-sheet b4 and the P-helix P1.
Remaining three motifs are localized in the C-terminal domain (CTD) in 257-263, 274-281 and 353-361 as motif5-motif7, respectively. Motif5 is seven amino acid long as 257 KPRQKR[ST] 263 with either serine or threonine at the 7 th position, which is structurally mapped in the tern connecting the P-helix P4 and the a-helix a5, where motif6 is closely mapped on the next tern connecting the a-helix a5 and the P-helix P5 and it is six residues long as 274 FG[KR]RGP 281 . The seventh motif is 3 53 LN....AY. 361 , which is localized at the end of the a-helix a8.

Motif2 of N protein is coronaviridaespecific and it is potential hub of with various epitope design
Motif2 is 12 residues long as 106 PRWYFYYLGTGP 117 which is mapped at the b-sheet b5, (Figs. 1-2). This motif is present in various members of coronaviridae including various coronaviruses as evident from phylogenetic tree (Fig. 4), where other viruses are marked in blue boxes like murine hepatitis viruses and porcine hemagglutinating encephalomyelitis. This motif is 100% identical at this location and it hinted us to examine its potential immunological roles like epitope formation. Interestingly, we found that this motif is residing to a central location, of which 9-10 residues flanking in both directions, have be experimentally validated for immune epitopes in various experiments for different SARS-CoVs using T-Cell and MHC arrays in human and mice (BALB/c). We have summarized 19 epitopes (epitope1 to epitope19) using motif2, either fully or partially matching to these epitopes in other SARS-CoVs and also in related viruses like feline infectious peritonitis virus (strain KU-2) and murine hepatitis virus strain JHM ( Table 3). The presence of 19 epitope is this region of N protein for CoVs and total conservation of this motif, together hints that motif2 is a potentially the hub of the epitope designing for SARS-CoV-2, which is supported by various experimental validation in closely related viruses like different strains of SARS-CoV as summarized in Table 3.

Motif4 is SARS-CoV-specific with amyloidogenic properties
Motif4 is clearly an insertion in the linker (LKR) region of the nucleocapsid protein of the SARS-CoV-2 and related CoVs (Fig. 2) but not present in other viruses. We examined this motif in various members of coronaviridae and we confirmed that this leucine-rich six-residue motif ( 221 LLLLDR 226 ). This motif is present in various SARS-CoVs (Fig. 5A) in the turn between the b-sheet b4 and the P-helix P1 in the linker (LKR) region (Fig. 3). This motif is the central element of a potential nuclear export signal (NES, Fig. 5B) . Using four tools for amyloidogenic propensity prediction, we identified eight amyloidogenic stretches, present throughout the N protein sequence ( Table  4). Interestingly, two of the eight predicted stretches; 108 WYFYYL 113 and 219 LALLLLDR 226 (extension of motif4) of the protein possess the highest aggregation score for all the tools. Notably, the sequence alignment of representatives of SARS-CoV family and other viruses showed that the first region remains conserved among the related virus while the motif4 has specifically got inserted in the SARS-CoV family. These data suggest that the nucleocapsid protein has a high aggregation propensity in the host cells. The specific insertion of motif4 (harbouring leucine-rich region) might enhance the aggregation propensity of this protein compared to that of the other viruses and thus might form amyloid-like structures.
Additionally, it is interesting to note that two amyloidogenic stretches 350 VILLN 354 and 392 VTLLP 396 in SARS-nCoV-2 N protein ( Fig. 2 & Table 4) bears sequence homology with 75 VLVVL 79 from the ORF8b. Shi et al. previously demonstrated that the VLVVL motif confers the aggregation ability to ORF8b protein [26]. Thus, we might speculate that these stretches in N protein might confer the ability for aggregation in SARS-nCoV-2, however it needs to be experimentally verified. It is known that intracellular protein aggregation contributes to the pathogenesis of a variety of diseases and is both propagated by and contributes to inflammation [27,28]. Recently, ORF8b protein in SARS-CoV has shown aggregation that leads to cytotoxicity in epithelial cells, and this cytotoxicity can be partially rescued by preventing the aggregation of this protein. The aggregation of ORF8b protein induces endoplasmic reticulum (ER) stress, lysosomal damage, and the activation of transcription factor EB (TFEB) [26].. Since N protein in SARS-CoV-2 shows amyloidogenic stretches, it would be interesting to explore the possibility of formation of aggregates and the role in inflammation and cytotoxicity in epithelial cells. Overall, all these amyloidogenic stretches ( Table 4) require experimental validations for their detailed roles in aggregation, inflammation, cytotoxicity, autophagy and cellular apoptosis.

Potential roles of other motifs of SARS-CoV-2 N protein
We have evaluated epitome profiles of other motifs using from IEDB and we found 48 epitopes flanking these 6 motifs ( Table  5). The first motif ( 69 GQGVPI 75 ) mapped to the turn between b-sheets b1-b2 b-sheet b5, respectively (Figs. 2-3) has eight experimentally verified epitopes and seven of these are reported for SARS-CoV and one for murine hepatitis virus strain JHM ( Table 5). The S/R-rich motif3 in the LKR has a total of ten epitopes experimentally verified for different SARS-CoV strains ( Table 5). The six-residues long motif4 ( 221 LLLLDR 226 ) has 9 epitopes flanking in closely related SARS-CoV strains. Similarly, there are 5, 3, and 13 validated epitome regions motif5, motif6 and motif7, respectively ( Table 5). Taken together 67 epitopes of seven motifs (Tables 3, 5), 94% are deduced from SARS-CoV strains.

Deciphering nuclear localization signal and nuclear export signals in the N protein
There are three nuclear localization signals (NLS) in the N protein of SARS-CoV-2 ( Fig. 2), namely NLS1-NLS3, mapped on one each to the N-terminal IDR ranged from 36-41, to the CTD from 256-262 and to the N-terminal IDR from 363-389. NLS2 is mapped to motif5. NLS3 is the longest NLS and it is conserved for SARS-CoVs but alternative lysine (K) and arginine (R) are present in the same region for other viruses (marked by cyan triangles in Fig 2).
This hints that NLS3 is common to viral N proteins with some variations. The nuclear export of proteins is mainly governed by CRM1 (chromosome region maintenance 1 protein) or XPO1 (exportin-1) [29]. These recognize the protein nuclear export signal (NES) in the cargo proteins [29]. NESs are hydrophobic rich (preferably leucine) regions of 8-15 amino acids long sequences [29]. A comparison of NES containing proteins [30] with SARS-CoV-2 N protein, we identified two NES with the NES1 is mapped in the NTD from 152-161 whereas the second NES (NES2) mapped in the LKR ranged from 217-230 ( Figs. 2 and 5B). Interestingly, the NES2 was found in SARS-CoV-specific motif4 (Fig. 5B). These data suggest that motif4 might enhance the cytoplasmic localization of N protein from SARS-CoV and have a SARS-CoV-specific cytoplasmic function. Overall, N protein of SARS-CoV-2 is a unique protein harbouring three NLS and two NES signals. This potentially allows SARS-CoV-2 to use it as system of dynamically transporting

Identification of two glutamine-rich patches as Q-patches in the N protein
As hinted by disordered region prediction (Fig. S2) for a glutamine-rich stretch present in the N protein of SARS-CoVs as 239 QQQQGQ 244 as depicted in protein sequence alignment (Fig. 2). However, the manual inspection of the protein alignment, we found two large patches of glutamine rich regions, which we named as Q-patches (marked by * in Fig. 2). These two Qpatches -Q-patch1 and Q-patch2 are present in 266-306 and 361-418 residue ranges, respectively ( Table 6). The Q-patch1 has a total of 20 glutamines (marked by * in Fig. 2) mapped with eight glutamines are SARS-CoV-specific as Q229, Q239, Q240, Q241, Q289, Q294, Q303 and Q306 (red stars in Fig. 2), whereas six glutamines are only found in other viruses as R227Q, T245Q, K249Q, del255Q, T271Q, and H300Q (blue stars in Fig. 2). Additionally, three glutamines are conserved in all viruses as Q260, Q272 and Q281 (black stars in Fig. 2), whereas Q242 and Q244 as conserved in all SARS-CoVs along with some other viruses (maroon stars in Fig. 2). Interestingly, SARS-CoV-2 lost one conserved glutamine and it is replaced by alanine as A267Q (green star in Fig. 2). Q-patch2 is mapped from the end of the motif7 at the position 361 to almost end of the C-terminal end at the position 418 and this patch harbours 16 glutamines (marked by * in Fig. 2) Out of 16 glutamines, with seven glutamines are SARS-CoV2 -specific as Q380, Q384, Q386, Q406, Q408, Q409 and Q418 (red stars in Fig. 2), whereas eight are present only in other viruses, like K361Q, P368Q, K370Q, K373Q, K375Q, D377Q, K388Q, and K405Q (blue stars in Fig. 2) and one glutamine is present in all SARS-CoVs along with some other viruses as Q389 (maroon star in Fig. 2). All in all, two patches of glutamine-rich regions are deciphered during our sequence analyses of N proteins SARS-CoV-2 and related viruses.

Small molecule docking at the conserved motifs of SARS-CoV-2 N protein
Previously, structure-based computer modelling leveraged the drug discovery process for the development of antiviral drugs of several viruses including hepatitis C virus (HCV [31]), hepatitis delta virus [32], Ebola virus [33] and Zika virus [34]. For the identification of potential drug targets against the seven conserved motifs of N protein, we employed the screening of the FDA approved drugs using molecular docking method. As the protein pockets of similar shapes can bind to diverse drugs with different chemical properties [35]. We identified four drugs namely, adapalene, naldemedine, dihydroergotamine and midostaurin binding to the seven conserved motifs of N protein (Supplementary Table  2). SARS-CoV-specific motif4 of N protein showed the highest affinity (-11.9 Kcal/Mol) to dihydroergotamine (Fig. 6). Adapalene exhibited the binding affinity towards motif1 and motif2 of N protein (Fig. 6). We further identified 14 antiviral top antiviral FDA-approved drugs with affinity score lower than -8.0 Kcal/Mol, which can potentially target motifs of the nucleocapsid protein of SARS-CoV-2 ( Table 7). This list has ten drugs approved against HIV as abacavir, darunavir, delavirdine, dolutegravir, elvitegravir, indinavir, nelfinavir raltegravir, rilpivirine, and tipranavir ( Table 7). In addition, we deduced three drugs, which are used against HCVs as dasabuvir, boceprevir and sofosbuvir, and trifluridine ( Table 7) as a drug against herpes simplex viruses (HSV). All these drugs belongs to six drug classes namely, integrase inhibitors, nonnucleoside reverse transcriptase inhibitors (NNRTIs), nucleoside reverse transcriptase inhibitors (NRTIs), 5-substituted 2deoxyuridines, HCV NS5B inhibitor + HCV NS5A inhibitor and protease inhibitors [36]. Integrase inhibitors is an important class of drug as it targets virus integrase to inhibit the integration of viral DNA into human chromosomes. NNRTI binds directly to virus reverse transcriptase and inhibits DNA synthesis. Mechanism of each drug action in listed in the Table 7.
When we compared motif-wise antiviral drug targets, we found that motif2 can be targeted by 13 different drugs, whereas motif1 and motif2 by four each, motif5 by 3 drugs and motif6 and motif7 can be targets of 2 antiviral drugs each ( Table 7). These hint for repurposing of a set of antiviral drugs are possible against COVID-19 and it requires further investigation in this direction.

Discussion
COVID-19 caused by SARS-CoV-2, is a major pandemic in 100 years. It has challenged our global movements by locking down the human population at homes for the hunt of safety against this virus. SARS-CoV-2 is the seventh CoVs that infected humans via with an intermediate host of zoonotic origins. It is well known that CoVs e.g., SARS, MERS and the recent outbreak of SARS-CoV-2 pandemic has caused great loss of life as well as economy time and again. Various strategies are being used to develop drugs against cell surface viral proteins [37]. Neutralizing antibodies bind to the surface proteins on viruses to prevent entry to the host cells, but the frequent mutation in the coat proteins can abolish the antibodymediated immunity [37]. As an alternative, N protein can be used an antigen for early diagnosis and development of vaccines for many viruses due to its conserved gene sequence. In particular, we focus on identifying motifs in N protein in SARS-CoV-2 which are conserved across various CoVs. N protein is mapped in the 3' end of SARS-CoV-2 genome from 28.274 bp to 29,533 bp with total ORF size of 1280 bp and is maintained in Indian strains like EPI_ISL_426414, EPI_ISL_426415, EPI_ISL_413523 and EPI_ISL_426179 and also in 43 bat-CoVs genomes (Fig. 1). Using a set representative sequences, we have illustrated seven motifs present in several CoVs mapped into different region of the protein N namely motif1-motif7 (Figs. 1-3, Table 2) The 12 amino acid long motif2 ( 106 PRWYFYYLGTGP 117 ) is conserved in all representative sequences and hence it is coronaviridae-specific (Fig.  4), whereas motif4 is clearly SARS-CoVsspecific (Fig. 5A).
Utilizing IEDB we have generated the largest epitome inventory of any SARS-CoV with 67 epitopes of seven motifs and 63 are deduced from SARS-CoV strains (Tables 3, 5). Given that N proteins of SARS-CoV-2 and SARS CoV are nearly identical with 99% sequence identities and 94% sequence similarities (Fig. 2).Being nearly identical proteins, this epitome inventory is suitable for testing against SARS-CoV-2.
We have identified three nuclear localization signals (NLSs) and two leucine-rich nuclear export signals (NESs) in the N proteins of different SARS-CoVs (Figs 1-2 and 5). Previously, it has been shown that the N protein was localized mainly in cytoplasm SARS-CoV infected Vero E6 cells [38] suggesting a strong NES may be present in this protein. Timani et al. showed that the region ( 220 LALLLLDRLNRL 231 ) of the N protein in SARS-CoV was a functional NES [39]. Interestingly, the EGFP tagged NTD also showed cytoplasmic localization of the protein [39] and authors hinted for the presence of an additional NES in the NTD of the protein. In this study, we have identified NES1 in the NTD region, conserved in several SARS-CoVs (Fig. 2). These nuclear localization and export signals are conserved in the different SARS-CoVs (Fig. 2). It collaborates that the N protein is involved in the dynamic nuclear-cytoplasmic trafficking. This trafficking of proteins controls many cellular processes, including gene expression, signal transduction, cell differentiation, and immune response. Utilizing of CRM1 using two NES by SARS-CoVs reflects that CoVs are capable of mimicking and exploiting for the conserved and constitute mechanism for cellular protein. There are some wellestablished viral examples of nuclear exports, exploited by matrix M1 protein of influenza A virus [40,41], E7 oncoprotein of human papillomavirus [40] and REV protein of HIV-1 [41]. Presence of both NLS and NES is also known for two viral proteins, VP1 and VP3 from the chicken anemia virus and using these signals and these proteins regulate the VP2 shuttling in cells [42,43]. Altogether, it is clear that with exploitation of these NLSs and NESs from the N protein become capable of shuttling between the cytoplasm and the nucleus, during the SARS CoV life cycle and plays an important role in the SARS CoV replication, assembly, and SARS CoV budding.
We have identified two glutamate-rich patches in N protein of SARS-CoV2 as Q-patch1 (20 Qs) and Q-patch2 (16 Qs). Q-patch1 also contents GKGQQQQGQ, which is conserved in N proteins in 1727 genomes of Bat CoVs and SARS-CoVs [44]. Repeats of Q and Q-rich regions are very well described in eukaryotic and viral genomes, which interfere with autophagy by causing viral proteins aggregation [44]. There are several examples of Q-rich regions assisting in controlling virus replication like bovine leukemia virus infection [45]. For RNA viruses like CoVs, these long patches of Q-rich may play instrumental roles in genome replications and environmental sensing [46]. Along with other RNA viruses including MERS and SARS-CoVs, Q-rich patches are required dsRNA folding domains near the 5′ end of these genomes as reported for SARS-CoVs [47] and Flavivirus [48]. Additionally, there are some other possibilities in different viruses as like dsRNA intermediate formation, roles in the interferon response and viral interfere with the OAS/RNaseL system as discussed recently [44]. Such longer patches of Q-rich regions are not explained previously in RNA viruses specifically coronaviridae. It is clear that coupled by extended motif4 (as 219 LALLLLDR 226 , Table 4) and these two Q-rich patches, it is clear that N protein of SARS-CoV-2 has very high aggregation propensity in the host cells. Presence of Qrich patches are known to have interference with autophagy [49] and their RNA binding abilities are known in several model organisms [44]. The Q-rich regions are often associated with human diseases like Huntington's disease and Huntington's disease-like 2 [44] and other neurological disorders [49]. In these diseases, The Q-rich patches targeting are potentially provide therapeutics solutions against these diseases [50]. During this study, we have found sequence stretches of N protein, which are capable of creating higher aggregation propensity in the human cells. Hence, we recommend targeting these patches such as extended motif4, Q-patch1 and Q-patch2. Additional stretches are also critical in aggregation formation ( Table 4). However, a careful mutational studies will be required to full-proof these findings. There are some studies on other viruses which have reflected that N proteins are potentially forming the aggregations like rhabdovirus uses N proteins to form aggregates in Cho cells and bacteria [51]. This study further demonstrated that osmolytes and a chaperone like phosphoprotein (P) maintain the N protein in a correctly folded form; however, authors did not report about the sequence patches in the N protein that might facilitate this aggregation [51]. The presence of amyloid patches in N protein might be one of the contributing reasons for this aggregation. All-in-all, amyloidogenic stretches in the nucleocapsid protein of SARS-nCoV-2 may lead to aggregation which may induce an overwhelming inflammatory stress in lung epithelial cells leading to cytotoxicity. During the pandemic of COVID-19 the biggest rush is towards the repurposing existing drugs because drugs of known safety profiles are keys to rapid deployments after initial success at level of the computational docking experiments [52]. Hence, we have also performed this strategy and explored potential antiviral drugs targeted against identified seven N protein motifs of SARS-CoV-2 and we have identified 14 antiviral drugs from drugbanks ( Table 7). Overall, we have characterized N protein of SARS-CoV-2 during COVID-19 pandemic using genomic data, protein sequence, structure, epitopes and docking experiment to provide quick suggestions for further studies. We are sure this work will be setting of further platforms for the characterization of N proteins from different viruses, specifically CoVs.  Table 2. Overview of 7 motifs deduced from sequence analyses of nucleocapsid (N) proteins from CoVs and related viruses. Table 3. Summary of experimentally validated epitopes flanking N protein motifs, except motif2. This data is derived from Immune Epitope Database and Analysis Resource (IEDB, Web: http://www.iedb.org/) Table 4. Predicted amyloidogenic stretches in the nucleocapsid protein of SARS-CoV-2 Table 5. Overview of experimentally measured immune epitopes, mapped to the motif2 of N protein and flanking regions. This data is derived from IEDB, (Web: http://www.iedb.org/). Table 6. Summary of two extended patches of glutamine (Q) residues as Q-patches.      sheets b5-b6 *Amino acid numbering of Nucleocapsid (N) protein of SARS-CoV-2 is followed # -" [4]" indicates presence of SR minimum two times and maximum of seven times $ -"[XY]" indicates presence of any of two given residues in the bracket % -"." Indicates presence of any 20 amino acids Table 3. Overview of experimentally measured immune epitopes, mapped to the motif2 of N protein and flanking regions. This data is derived from Immune Epitope Database and Analysis Resource (IEDB, Web: http://www.iedb.org/).