Beyond History: The List of The Most Well Studied Human Protein Structures

Of 20,000 or so canonical human protein sequences, as of July 2020, 6,747 proteins have had their full or partial medium to high resolution structures determined by x-ray crystallography or other methods. Which of these proteins dominate the protein database (the PDB) and why? In this paper, we list the 272 top protein structures based on the number of their PDB depositions. This set of proteins accounts for more than 40% of all available human PDB entries and represent past trend and current status for protein science. We briefly discuss the relationship which some of the prominent protein structures have with protein biophysics research and mention their relevance to human diseases. The information may inspire researchers who are new to protein science, but it also provides a year 2020 snap-shot for the state of protein science.

In case of a human protein in a protein-protein complex, the human segment bound may be very small, e.g. a peptide and the partner protein may not be human (e.g. in case of interactions with microorganisms). We also considered isoforms in the analysis, but assumed that the latin name for human "homo sapiens" was mentioned in the Uniprot entries for all human proteins. The top 200 human proteins counting the number of the PDB items is given in Table  1. Separately, we also list the 100 transmembrane proteins with the most entries (Table 2). Since 28 membrane proteins already appear in the first part of Table 1, so in total only 272 unique proteins are listed. Overall, the high frequency appearance of the proteins in the pdb arises from the biological importance that they have in cellular processes, in human diseases but some also to an increasingly lesser extent from their use as model systems for our understanding of protein structure and function. Below we comment on some of the most highly representative structures/families which have emerged, also making a note of the early history of protein structural biology.
It should be noted that the method we used to rank the top-hit proteins is completely different from the approach used in the report by Dolgin in 2017, where the top 10 genes in the human genome are identified by counting the frequency of appearance of a gene in the PubMed database. 3 By contrast, we count the absolute number of available PDB entries for each protein in the Uniprot websever. However, the two methods corroborate each other in part--some of top genes identified in the study of Dolgin, such as Cellular tumor antigen p53, Tumor necrosis factor, Epidermal growth receptor and Estrogen Receptor also appear in our lists. Top 10 Aqueous And Membrane Proteins Identified: By far the most structures solved are for Beta-2microglobulin, Carbonic anhydrase 2 and Cyclin-dependent kinase 2 as the first, second, third place of the most deposited structures in the PDB, with 770, 766 and 410 entries respectively. Prothrombin; Beta-secretase 1; DNA polymerase beta; HLA class I histocompatibility antigen, A alpha Chain; Transthyretin; Bromodomain-containing protein 4; DNA cross-link repair 1A protein ranks 4-10.
The top 10 for membrane proteins (with transmembrane regions) are listed below: Beta-secretase 1; HLA class I histocompatibility antigen, A alpha chain; Estrogen receptor; HLA class I histocompatibility antigen, B alpha chain; Epidermal growth factor receptor; Histo-blood group ABO system transferase; Amyloid-beta precursor protein; Dipeptidyl peptidase 4; HLA class II histocompatibility antigen, DR alpha chain; Hepatocyte growth factor receptor.
Protein Classification: Of all the 20,000 canonical human proteins, noticeable protein groupings include 1,653 metabolic enzymes; 1,089 non-metabolic enzyme such as kinase and GTPases; 1,600 transcription factors; at least 1,555 transporters and channels; and 831 GPCRs. [4][5][6] By contrast, among the 272 top-hit proteins listed in the tables, there are 73 metabolic enzymes, 5 GTP-/ATPases, 44 kinases, 16 transcription factors, 5 ion channels and 4 Gprotein-coupled receptors (GPCRs). Moreover, there are 11 human leukocyte antigens, 5 histone proteins, 5 bromodomain (BRD) containing proteins, 4 Hormone and Growth factors, 10 cell adhesion molecules, and 6 cystic fibrosis family proteins. The other 84 of 272 proteins are not classified into major protein families, but all of them have important biological functions. Here we comment on several of the families.

Kinase as One of The Best Studied Protein Family:
The high frequency of appearance of kinases (44 of 272) is remarkable in contrast to its low fraction amongst the 20,000 canonical human proteins (518 of 20,000). Protein kinases are thought to modify up to 30% of all human proteins and many of them such as Raf kinase, Akt kinase, Ephrin type-A receptor 2 and Epidermal growth factor receptor (EGFR) have a crucial role in disease development, especially in cancer. 7 Clinically, more than 250 kinase inhibitors are undergoing clinical trials and 37 are already approved as therapeutics. 8 Due to this biomedical significance, kinases are one of the most well studied families of human proteins.

Membrane Protein Structures:
Membrane proteins represent 20-30% of human proteins. In many earlier reports, it was noted that the membrane proteins are largely underrepresented (only ~2% of all PDB items) in structure determination by comparison to their number in genomes. This number is inaccurate today, however, especially for human proteins. If we count all peripheral-, transmembrane and integral membrane proteins, 2,237 distinct membrane proteins have at least one structure, corresponding to 33.2% of all available human protein structures. By counting single-pass and multi-pass transmembrane proteins only, 1,132 of 6,747 (16.8%) proteins with available structures are membrane proteins. In both cases, this is close to the proportional number of membrane proteins in the human genome. However, it is true that integral membrane proteins such as transporters, ion channels and GPCRs, are still not presented well in the top 272 proteins with most of pdb items. This is despite the fact that GPCRs for example account for approx. 30% of all drug targets. Several proteins of intense research interest are in the top-100 table for membrane proteins and others are catching up. Until recently integral membrane proteins typically had much fewer pdb items than soluble proteins. This is at least partially due to difficulties in protein expression and purification. Transmembrane proteins such as Receptor Tyrosine Kinases (such as EGF receptors and Eph receptors) and Cell Adhesion proteins (such as Integrin) are prominently represented in the lists. However, these proteins have the majority of domains exposed in solvent, and it is these domains whose structure has been mostly determined, excluding the single membrane crossing segment; there are only a few structures available for the membrane crossing regions typically from NMR (about 27 as of 2017). 9 Due to the technical challenges with sample preparation and likely the dynamic nature of the structures, the determination of full length TM protein structures remains a frontier of structural biology, with increasing success reported by use of NMR, molecular modeling and cryo-EM incl. cryo-electron tomography (cryo-ET). Historical Implication and Model Proteins for Protein Science: Several of the proteins listed in the tables have historical contexts and/or have become model proteins for structural biology and protein biophysics research. However, it should be noted that some of well-known proteins (from the other organisms) in protein history do not appear in the tables, as here we have adhered to human proteins. Due to the challenge of crystallization especially of eukaryotic proteins, traditionally crystallographers have tried their luck with a wide range species approach, especially in the days when proteins had to be purified from the organism itself. With the advent of recombinant protein expression, the focus shifted to prokaryotic homologues of human proteins and then with the mandate of several structural genomics efforts to work on human proteins. The number of human proteins in the PDB received a significant boost. As an reference, if counting all the species, the number of pdb items for proteins with the largest representation are the following ( The studies of Hemoglobin, Insulin, G-proteins, Na-K-ATPase, Prion, Cyclin dependent kinase, Ion channels, Ubiquitin, GPCRs and PD-L1 have been recognized with the Nobel prize. For example, Myoglobin and Hemoglobin were the earliest proteins to have their 3D structure revealed by x-Ray crystallography. Hemoglobin was also the first well known allosteric protein complex identified in the 1960s and a key advance in our understanding of cooperativity. 10 Myoglobin and Cytochrome are early known examples of structure-based allostery for an individual protein. In biophysical research, Ubiquitin, individually or as a multi-protein chain, are is a model protein for studying protein conformational as well as configurational ensembles, protein dynamics and protein association/recognition. 11 Calmodulin and Lysosome were widely used in the earlier NMR characterization of protein dynamics and conformational entropy. 12 H-and KRas are recently used as model proteins for investigating the multi-orientational nature of protein configurations at the cells plasma membrane. 13 Recently, p53 and Estrogen receptor have also been studied with respect to their likely changes over the course of evolution. 14 Relevance to Human Disease: Many of the 272 proteins are important for their involvement in human diseases and remain a current focus of research. For example, the (Low-density lipoprotein receptor) LDL receptor is vital for the regulation of the concentration of human lipoprotein which tracks human fat content. Proteins such as p53, Ras GPTases, Estrogen receptor and 14-3-3-proteins are crucial proteins either in cancer development or cancer metastasis. 15 Fibroblast growth factors and Neuropilin-1 are vital factors for human cell development. Cytokines and Cell adhesion molecules such as human leukocyte antigen, Tumor necrosis factor and T-cell surface glycoprotein CD4, are important for immunity. Amyloid-beta precursor proteins, Microtubule-associated protein tau and the prion protein are crucial for development of neuronal diseases. 16 Angiotensin-converting enzyme 2 (ACE2) and recently Neuropilin-1 were identified as entry receptor for coronavirus SARS-COV-2. 17,18 In summary, we considered the number of human proteins with fully or partially available medium to high resolution structures among the 20,000 or so canonical human proteins. From this set, we identified 272 protein structures with the most number of items in the PDB. These proteins are also the ones which have been a focus of intense studies, either because of their history as model systems, and more recently as proteins with high biomedical relevance. Many of these proteins have influenced our understanding of protein structural and functions biology as well as biophysics. The information we provided here should be helpful to researchers who are new to protein science, as in a sea of proteins, the top-studied proteins may serve as "Lighthouses" for future investigations. However, our analysis may also interest structural biologists, as a "Stamp in Time", showing how far Protein Science has moved and "the Waters which may lie ahead".