A Peptidoform Matching Strategy in Bottom-up Proteomics for Studying Functions of Posttranslational Modifications

: Protein translational modifications (PTMs) generate an enormous, but as yet undetermined, expansion of the expressed proteoform s. In this Viewpoint, we firstly differentiate the concepts of proteoform and peptidoform by reviewing and discussing previous literature. We show that the current PTM biological investigation and annotation largely follow a PTM site-specific rather than proteoform-specific approach. We further illustrate a potentially useful matching strategy in which a particular “modified peptidoform ” is matched to the corresponding “unmodified peptidoform ” as a reference for the quantitative analysis between samples and conditions. We suggest this strategy could provide directly relevant information for learning the PTM site-specific biological functions. Accordingly, we advocate for the wider use of the nomenclature “ peptidoform ” in the future bottom-up proteomic studies.


Top-down and Bottom-up Proteomics: Proteoform and Peptidoform.
There are two principal proteomic methodologies for analyzing proteins and their PTM formsthe bottom-up approach and the top-down approach. In a general bottom-up workflow, proteins are digested into peptides using trypsin and then identified by liquid chromatography coupled with tandem MS analysis. Consequently, in bottomup approach the connection between the peptides and their proteins of origin is lost during the enzymatic digestion step. This information loss makes resolving of PTM forms at the whole-protein level challenging. In contrast, in top-down workflows, intact proteins and protein variants are measured for their masses and then fragmented and identified by tandem MS, allowing a comprehensive characterization of the protein level molecular composition.
The concept of proteoform is highly relevant for understanding top-down perspectives. The Consortium for Topdown Proteomics has defined the nomenclature "proteoform" to designate all the different molecular forms in which the protein product of a single gene can be found, which include changes due to genetic variations, alternatively spliced RNA transcripts, and posttranslational modifications [10]. In essence, each individual molecular form of expressed proteins is a proteoform [10].
Indeed, the concept of proteoform has greatly sharpened our views on protein diversity. Following the definition of proteoform, protein subpopulations carry a phosphate moiety at two sites could generate four possible proteoforms three modified at each or both sites and another unmodified proteoform. Likewise, the same or different types of PTMs (e.g., PTM crosstalk) at different amino acid residues might decorate a given protein in any combinatorial patterns. Therefore, an exceptional increase in the theoretical number of proteoforms due to PTMs could be expected (Figure 1, Upper view). Moreover, certain PTM types such as polyubiquitination and glycosylation can generate additional constitutive variation at the same site. For example, if we consider the variable structural feature of oligosaccharides, the number of proteoforms carrying distinct glycan structure will be fairly large. Very interestingly, despite such a theoretical "proteoform explosion", the real number of human proteforms seems to be much lower, according to a community-level estimation [3]. A total of ~one million proteoforms were roughly estimated based on e.g., the current practice of histone modification analysis, the technical detection threshold, cell type uniqueness, and the practical cellular constraints in controlling the enzymatic writing and maintenance of PTMs. The authors nevertheless acknowledged that a precise estimate of the number of human proteoforms is difficult to provide [2,3]. As such, for a specific protein the pool of all possible proteoforms can be immense [11]. Although top-down studies have proven to be an exceptionally powerful resource for hypothesis-driven research on defined protein targets, there is a fairly long way to go for the top-down technique to routinely detect most (e.g., one million) proteoforms in a sample, due to its current limited sensitivity which is still perceived as worse than in the bottom-up approach [12,13]. A virtual example of protein X (representing any protein) is shown. This protein X can be modified by different enzymes at different amino acid residues, forming a variety of possible proteoforms carrying acetylated, phosphorylated, or glycosylated sites and even the combinatorial PTMs in the cells. Additionally, the mRNA alternative splicing could create some truncated proteoforms of X. In bottomup proteomics, the modified peptidoforms can be measured in PTM enriched samples. If we take the non-modified peptidoform counterpart (measured by total proteomics) as a comparative reference, we can extract a pair of PTM/non-PTM peptidoforms, irrespective of the total proteoform pool, for interrogating the impact of a site-specific PTM among samples. Such information can be verified by e.g., western-blot, illustrating site-specific PTM functions. Protein structural regions of different colors denote the location of respective peptidoforms. The phosphorylated S219 represents any PTM site of any type. The red equals sign highlights the proteoform background which is the same in PTM-enriched and non-PTM measurements.
On the other hand, the concept of peptidoform has been used in several studies [14][15][16][17][18][19][20][21][22][23][24], but not widely. Peptidoform stands for specifically modified or mutated peptides with the same backbone amino acid sequence [20]. In the early days, peptidoform was mentioned always together with proteoform, to deliver the analogous concept at the peptide level. More recently, peptidoform was independently used in bottom-up studies, such as in a few data-independent acquisition (DIA) MS-based [25,26] quantitative PTM investigations [14,16,20]. For example, we and colleagues have developed Inference of Peptidoforms (IPF), a computational algorithm for confident, systematic identification and quantification of peptidoforms in DIA-MS datasets [20] and applied it in analyzing plasma proteomes of human twin individuals [20,27].

PTM function: Proteoform-specific or PTM Site-specific?
With the concepts of proteoform and peptidoform clarified above, it becomes inspiring to revisit how we study PTM biology currently. To learn the function of a protein PTM, the researchers normally perform experiments to measure the abundance and other properties of it (e.g., localization, stability, etc.) in a biological process such as druginduced perturbation or disease development. The researchers would then refer to published literature or PTM databases and sometimes perform new validation experiments to fully establish a functional link between the PTM site and the biological question. Although both bottom-up and top-down approaches both support well the broad, relative proteomic quantification between samples and conditions, they measure the de facto individual peptides (digested) and proteoforms, respectively. In an ideal scenario, if distinctive proteoforms are precisely identified and quantified between biological/clinical samples (see a nice example in Ref. [28]), and with such knowledge being accrued over time, the top-down measurement will pinpoint the different functions of every detectable proteoform. This type of proteoform-specific knowledge would also nicely fit the structural view of protein complexity, because actions taken by the each intact proteoform species in the cellular processes should be anyway depending on their unique molecular structure.
However, it is crucial to stress that, different proteoforms of the same gene can share the same site-specific PTM, whereas lots of PTMs are currently studied and annotated in a site-specific, rather than a proteoform-specific, manner. For a virtual example, please see Figure 1. Herein, the S219 of protein X (can be any protein) is able to be phosphorylated by a particular kinase Y, resulting in phosphorylated S219 (i.e., pS219, highlighted in orange color). All the proteoform species carrying pS219 (n=7 in this case) might co-exist in the cell after the enzymatic kinasesubstrate reaction. Currently, the most concrete functional studies are performed for pS219 itself, but not for one of the seven proteoforms carrying pS219 (which would otherwise need the protein-level separation or purification [29]). Also, the resultant knowledge and annotations are built on pS219 (e.g., it is a substrate of kinase Y) [30], but not on each of the seven proteoforms.
Furthermore, it is intriguing to ask what is really being measured in PTM analyses performed using classic non-MS methods, such as western blotting (WB), immunohistochemistry (IHC), and enzyme-linked immunosorbent assay (ELISA). For these assays, antibodies have to be developed and optimized targeting proteins carrying particular PTMs. During antibody production, the immunogen-the part of the protein that the antibody recognizes (e.g., a continuous stretch of amino acids) or the full-length protein, is the key. In the first production of phosphorylationdependent antibodies twenty years ago, benzyl phosphonate was injected into rabbits to generate antibodies detecting phosphotyrosine-containing proteins [31][WEB1]. Nowadays, phosphosite-specific antibodies are typically generated by the immunization using synthetic phosphopeptide surrounding the phosphosite of interest. Further, a positive selection is normally performed to remove antibodies detecting the non-phosphorylated version. Therefore, whether WB, IHC, or ELISA measure a specific proteoform or multiple proteoforms carrying the same PTM site largely depends on the epitope and selectivity of the particular antibody used. Although in WB the molecular weight (MW) of the protein target is obtained by referring to MW markers, the MW information can be lacking in IHC and ELISA assays. Even in WB, due to the limited MW resolution of electrophoresis, a detected protein band might represent multiple proteoforms with varied molecular compositions (but sharing a PTM site) that are just too close in MW. In this regard, using full-length proteoforms as antigens for production of high-quality affinity reagents using methods like phage display might be helpful to increase the proteoform-level specificity in WB, IHC, and ELISA analyses [3].
Although many current research tools largely follow the PTM site-centric assumption, we want to point out that there is no doubt that the link between proteoform species and their functional significance would be a major advance in the future. This will catalyze the fundamental knowledge drift from PTM site-centric to proteoformcentric investigation, because eventually, biology research has to be both precise and mechanistic. Ultimately, the complete primary structures of proteins on a proteome scale will be useful [13,32]. Before that, however, corresponding experimental and informatic paradigms have to be established and widely applied. Recently, interesting workflows have been applied to infer proteoform-dependent functions from e.g., peptide co-varying analysis across multiple samples and comparisons [33][34][35]. Emerging bioinformatic annotation tools, such as PTMsigDB, just started to drift towards PTM site-specific annotation following PTM proteomic profiling [36,37] (which already presented a major conceptual advance compared to the conventional, widely used, gene-specific annotating frameworks [38,39]). Yet, the proteoform-specific annotation databases have not been configured proteome-wide, mainly due to the lack of data. In this regard, the recent initiative of The Human Proteoform Project is very timely and extremely important for assembling an atlas with more detailed knowledge by creating a comprehensive proteoform index [32].

Peptidoform: the concept revisited.
Due to the above challenges, we reason that the concept of peptidoform may facilitate studying PTMs using bottomup proteomics (Figure 1, and below). In particular, herein, we propose that, the usage of peptidoform should clearly embrace both unmodified and differentially modified peptides that share the same backbone amino acid sequence. The peptidoforms can be generated by trypsin digestion or by other proteases. We further propose that the previously used terms such as phosphomodiform [6] can be unified under the nomenclature "peptidoform", because phosphomodiform essentially means phosphorylated peptidoform.
The usage of peptidoform instead of "modified peptides", will a) remove those modified peptides shared by multiple genes, b) create a concept in analogy to proteoform but describes the peptide-level diversity, and c) most importantly, emphasize the site-specific PTM biology in bottom-up proteomics, considering the enormous number of proteoforms in cells. In this regard, taking phosphorylation as an example, the phosphosite abundance profiling experiments in most previous phosphoproteomic studies, the phosphomodiform thermal stability analysis in Huang et al. and others [6,40,41], and our previous phosphomodiform lifetime study [7] all belong to the peptidoform profiling.

A peptidoform matching strategy for relative quantification between samples.
In the present Viewpoint, we would like to suggest that a peptidoform matching strategy could be a powerful approach to study PTM site-specific functions. To elaborate, in our recent study [7], by using the pulse experiment of stable isotope-labeled amino acids in cells (pSILAC) [42], we performed a pilot phosphoproteome turnover analysis. Particularly, we adopted a peptidoform matching strategy that directly interrogates the impact of individual phosphosites on protein turnover. This particular method is referred to as DeltaSILAC (delta determination of turnover rate for modified proteins by SILAC) [7]. In DeltaSILAC, for each site-specific phosphorylation, we determined the lifetime difference between the phosphorylated peptidoform (measured by pSILAC in enriched phosphoproteomes) and the non-phosphorylated peptidoform counterpart (measured by pSILAC in the same peptide samples without phospho-enrichment). This strategy successfully revealed that phosphorylation of the majority of sites increased protein stability in growing HeLa cells, which was not transparent without the matching strategy [7].
The highlight of this peptidoform matching strategy lies in the total "proteoform pool" that is always the same in the both PTM-enriched and non-PTM measurements. Taking the example of Protein X again (Figure 1, bottom  panel), no matter what and how many proteoform species of X exist in the cells, these proteoforms will generate a common peptide mixture after protease digestion. The bottom-up measurements on the peptidoforms of pS219 and npS2I9 (non-phosphorylated S219) will only extract two peptidoforms, compare their difference between samples/conditions, and directly infer the impact of S219 phosphorylation. Similarly, this comparison can be applied to all the other PTM sites and types (such as acetylation and glycosylation, see green and pink sites in Figure 1) or even to a simultaneous analysis of multiple PTMs. In the case of PTMs on Lysine and Arginine residues, alternative proteases other than trypsin might be used to generate the PTM/non-PTM pair of peptidoforms.
Of note, the relative quantitative comparison between samples is crucial because the pS219 and npS219 peptidoforms may have different physicochemical properties, ending up with different flyability and responsiveness in the mass spectrometer. The relative fold-change of the ratio of pS219/npS219 between samples, rather than the pS219/npS219 ratio itself, can provide valuable and relevant information to the "PTM site-specific" biological functions.
Conceivably, the extensive detection and quantification of non-PTM peptidoform for each PTM site seem to be crucial for this conceptualized strategy, which can be difficult for PTM sites of high stoichiometry. Indeed, in the DeltaSILAC study, despite we estimated lifetimes for ~13,000 phosphorylated peptidoforms, we only measured lifetime for 2,100 phos-/non-phos-peptidorm pairs [7]. Fortunately, the technical barrier in identifying non-PTM peptides currently is lower than the PTM analysis. Normally, the amounts of the non-PTM samples are less limited, allowing for e.g., peptide-level fractionation to increase the coverage of unmodified peptidoform. As the final relevant note, our studies have suggested that, if the non-PTM protein-level reference is used to match the quantitative results of the PTM peptidoform, ~three times more PTM sites could be matched. This solution is of course less ideal, which impaired the turnover measurement [7] but seemed to be acceptable for the abundance profiling of the PTM sites among steady-state cells [43].

The blind man feels an elephantto take a part for the whole.
Previously, a variety of experimental and bioinformatic strategies have been developed with the purpose of properly normalizing PTM proteomic data using the protein-level results (or analyzing the two jointly) for which we regret that we do not have space to cite and discuss here. The Viewpoint summarizes some of our understanding on the important concepts about peptide, protein, and PTM detection by MS, rather than a comprehensive review on the relevant topics.
To conclude, it seems that, even with the high resolving power provided by modern mass spectrometers, the complete, high-throughput quantification of all or most proteoforms might be still formidable in the near future. We deduce that the peptidoform measurement and a peptidoform matching strategy currently can provide direct and relevant information to study the "PTM site-specific" biological functions, even if one is somehow (unfortunately) "blind" to the "proteoform universe" in the sample. This "taking part for the whole" strategy will likely work well for quantitative PTM analysis for a fairly long time. We advocate the bottom-up community should consider using the nomenclature "peptidoform" more often.
In essence, we argue that "bottom-up" strategy pioneered by Eng and Yates [44] about 25 years ago does not just measure the peptide surrogates, but the peptidoforms collapsed from the huge pool of proteoforms in a given sample. In fact, the relative quantification of a "peptidoform" between different conditions and samples provides quantitative information for understanding the PTM site-specific function, which many biologists care about and study for many years.