Identification and Validation of Human Missing Proteins and Peptides

Oct 5, 2017 - Identification and Validation of Human Missing Proteins and Peptides in Public Proteome Databases: Data Mining Strategy. Amr Elguoshy†...
0 downloads 14 Views 2MB Size
Subscriber access provided by University of Virginia Libraries & VIVA (Virtual Library of Virginia)

Article

Identification and validation of human missing proteins and peptides in public proteome databases; Data mining strategy. Amr El Guoshy, Yoshitoshi Hirao, BO XU, Suguru Saito, Ali F. Quadery, Keiko Yamamoto, Toshiaki Mitsui, and Tadashi Yamamoto J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00423 • Publication Date (Web): 05 Oct 2017 Downloaded from http://pubs.acs.org on October 6, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Identification and validation of human missing proteins and peptides in public proteome databases; Data mining strategy. Amr Elguoshy1, 2, 3, Yoshitoshi Hirao1, Bo Xu1. Suguru Saito1, Ali F. Quadery1, Keiko Yamamoto1, Toshiaki Mitsui2, Tadashi Yamamoto1* and Chromosome X project team of JProS (chair: Y. Ishihama). Biofluid and Biomarker Center - Niigata University - Japan1, Graduate School of Science and Technology - Niigata University - Japan2, Biotechnology department - Faculty of Agriculture - Al-azhar University Egypt3 Correspondence to Tadashi Yamamoto, Biofluid Biomarker Center (BBC), Institute for Social Innovation and Cooperation, NIIGATA University 8050, Ikarashi 2-no-cho, Nishi-ku, Niigata, 950-2181, JAPAN Tel: +81-252626850 E-mail: [email protected]

Abstract: In an attempt to complete human proteome project (HPP), Chromosome-Centric Human Proteome Project (C-HPP) launched the journey of missing protein (MP) investigation in 2012. However, 2,579 and 572 protein entries in the neXtProt (2017-1) are still considered as missing and uncertain proteins, respectively. Thus, in this study, we proposed a pipeline to analyze, identify and validate human missing and uncertain proteins in open-access transcriptomics and proteomics databases. Analysis of RNA expression pattern for missing proteins in Human protein Atlas showed that 28 % of them, such as Olfactory receptor 1I1 (O60431), had no RNA expression, suggesting the necessity to consider uncommon tissues for transcriptomic and proteomic studies. Interestingly, 21% had elevated expression level in a particular tissue (tissue-enriched proteins) indicating the importance of targeting such proteins in their elevated tissues. Additionally, Analysis of RNA expression level for missing proteins showed that 95% had no or low expression level (0-10 Transcript Per Million), indicating that low abundance is one of the major obstacles facing the detection of missing proteins. Moreover, missing proteins are predicted to generate fewer predicted unique tryptic peptides than the identified proteins. Searching for these predicted unique tryptic peptides which correspond to missing and uncertain proteins in the experimental peptide list of open-access MS-based databases (PA, GPM) resulted in the detection of 402 missing and 19 uncertain proteins with at least 2 unique peptides (>= 9 aa) at < 5e4 % FDR. Finally, Matching the native spectra for the experimentally detected peptides with their SRMAtlas synthetic counterparts at three transition sources (QQQ, QTOF, QTRAP) gave us an opportunity to validate 41 missing proteins by >= 2 proteotypic peptides. . Keywords: HPP, missing proteins, Uncertain Proteins, PA, GPM, SRMAtlas

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Introduction: The Chromosome-Centric Human Proteome Project (C-HPP) is one of the main programs in the Human Proteome Project (HPP), which is aimed to provide evidence of all proteins encoded by the human genome1, 2. The neXtProt database has been consolidated as the primary integrative knowledge database of human proteins3. Multilevel rating scale (PE1-PE5) was used in the neXtProt to assigns experimental evidence for each protein, PE1 level refers to proteins that have experimental evidence at protein level. PE2 level represents proteins that have experimental evidence at transcript level. PE3 refers to proteins inferred from homology. PE4 level represents predicted proteins. PE5 level includes uncertain or dubious entries. Proteins within PE2, PE3, and PE4 levels are considered as the “missing proteins”4, 5. The neXtProt (release 01.2017) listed 20,159 proteins, among which 17,008 were annotated by protein evidence as identified proteins (PE1) (84.4% of protein entries). The numbers of the missing (PE2-4) and uncertain proteins (PE5) were 2,579 and 572, respectively, corresponding to 12.8% and 2.8% of the total entries in the neXtProt. There are many proposed possibilities to explain why (PE2-4) proteins are missing: First, they have no expression at all6 or they have a restricted expression pattern such as those proteins which have been expressed only in specific tissue or specific disease state or specific developmental stage7; Second, they have been widely expressed but at low expression level (low abundance proteins)7; Third, they have been abundantly expressed but doesn't contain any tryptic cleavage site or generated only nested peptides8; Fourth, they have specific physio-chemical properties like high hydrophobicity9 or contain several transmembrane domains (TMD)4; Fifth, they have erroneous annotation at genomic level which results in incorrectly predicted protein sequences6, or have special modification(PTM)10 or special molecular processing events (MPE)8 which not considered in database search; Finally, they have already been identified elsewhere in one of Proteomics repositories but have not been recognized yet by proteomics community. Considering these obstacles that hinder the detection of missing protein, numerous initiatives were launched worldwide to search and detect these missing proteins. For example, Vandenbrouck et al identified 206 missing and 4 uncertain proteins with at least two unique distinct peptides ( ≥ 9 amino acids) in spermatozoa (tissue-specific expression)11. Also, Wei et al detected 74 low abundant missing proteins in testis using high-resolution MS (QE-HF)12. In addition, Kitata et al identified 74 missing membrane proteins (highly hydrophobic) with protein-level FDR of 1% using Hp-RP StageTip prefractionation of membrane-enriched samples from 11 NSCLC cell lines13. Garin-Muga et al identified 102 missing proteins of chromosome 16 at a protein-level (FDR of 1%) by data mining of 65 PRIDE human projects14. Moreover, PeptideAtlas (PA)15 and Global Proteome Machine (GPM)16 are well-curated protein expression database aiming to reprocess the submitted data in ProteomeXchange17 18, 19 according to their unified analysis and validation workflow. The neXtProt, the reference knowledge database for HPP, collects MS-based peptide list only from PA after filtering through the threshold20 but not from other resources such as GPM although it might be considered as another good source for hunting missing proteins. 2 ACS Paragon Plus Environment

Page 2 of 26

Page 3 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

In addition to, SRMAtlas represents an excellent source for missing protein validation where it covered 99.7% of human proteome by selected reaction monitoring (SRM)21. Thus, we propose a workflow to firstly, investigate the reasons of missing PE2-5 proteins by analyzing the RNA sequence data and then predicting the sequence features of human proteome highlighting the differences among missing, uncertain and identified proteins. Second, testing the hypothesis that missing and uncertain proteins have been already identified elsewhere but have not been recognized by proteomics community through looking for unique tryptic peptides of these missing and uncertain proteins in more proteomic resources than those considered by NeXtprot. Third, verification of the identified unique peptides corresponding to missing and uncertain proteins by examining the similarity of their fragment ion profiles with those of synthetic counterparts collected from SRMAtlas.

Materials and Methods: 1. Datasets: 1.1 Protein sequences dataset Human protein sequences dataset “NeXtProt_all.peff.gz” was retrieved from NeXtprot database (release 2017-01-23), which contains 20,159 entries in 42,135 isoforms produced by alternative splicing (https://www.neXtProt.org). According to updated protein existence annotations for human proteins, 17,008 proteins were annotated as PE1, 1,939 as PE2, 563 as PE3, 77 as PE4 and 572 as PE5 in the neXtProt database. 1.2 RNA expression dataset Human RNA expression pattern dataset “proteinatlas.tsv.zip” was downloaded From Human Protein Atlas (HPA, version17) website, which contains RNA expression pattern for human genes in 37 human tissues. In HPA, RNA expression pattern of putative protein-coding genes were classified based on transcriptomic analysis across all major organs and tissue types into 5 categories: (1) genes expressed in all tissues (housekeeping genes); (2) genes have elevated mRNA expression level in a particular tissue, with at least five-fold compared to other tissues (tissue enriched genes); (3) genes have elevated mRNA expression level in a group of 2-7 tissues, with at least five-fold (group enriched genes); (4) genes have elevated expression level in a particular tissue, with at least five-fold compared to average levels in all tissues (tissue enhanced genes); (5) genes expressed in several, but not all (mixed genes); and finally (6) genes which not detected in any tissue (not detected)22. Also, Human RNA expression level dataset “rna_tissue.tsv.zip” was downloaded from HPA which contains RNA expression level in TPM unit for human genes in 37 tissues. Moreover, Human RNA expression level dataset “GTEx_Analysis_v6p_RNAseq_RNA-SeQCv1.1.8_gene_median_rpkm.gct.gz” was downloaded from GTEx (V6P) website which contains RNA expression level as a median RPKM by tissue for human genome in 53 different tissues.

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Additionally, from neXtprot database, we downloaded “nextprot_ensg.txt” which contains the corresponding ensemble ID for all neXtprot proteins.

1.3 MS based Peptides datasets: 1.3.1 PeptideAtlas (PA) dataset: 1,222,862, 135,761 and 289,980 peptides correspond to Human 2017-01 build; Human Phosphoproteome 2017-01 build, and Human Testis 2017-01 build respectively, were retrieved from PA website “db.systemsbiology.net/sbeams/cgi/PeptideAtlas/GetPeptides”. the downloaded peptides were annotated as follows: best probability from peptide prophet, number of observations, empirical proteotypic score, SSRCalc relative hydrophobicity, number of samples, number of protein mappings, number of genome locations, is exon spanning, is subpeptide of, protease ids, atlas build name. 1.3.2 Global Proteome Machine (GPM): All human peptides were retrieved from PEPTIDE database (http://peptides.thegpm.org/~/peptides_by_species/) on February 2, 2017. This dataset contains 1,431,208 peptides. These downloaded peptides were annotated as follows: number of observations as charge 1, charge 2, charge 3 and charge 4, and minimum observed E-value for the peptide. 1.4 SRMAtlas dataset: SRM transitions from a variety of data sources (QTOF, Agilent QQQ, QTrap5500, IonTrap and Predicted) for 166,174 human proteome peptides covering 99.7% of human proteome21 were retrieved from SRMAtlas. These peptides were annotated as follows: Source, Precursor m/z, precursor charge, fragment ion m/z fragment ion charge, fragment ion type, Relative Intensity of peak in CID spectrum, etc. 2. Analysis of neXtProt canonical proteins at RNA and protein levels. 2.1 At RNA level: Ensemble identifier was used as non-redundant ID in downloaded tables from HPA and GTEx to look up the corresponding RNA sequencing annotations for all canonical protein sequences in neXtprot database (20,159 canonical proteins) (version 2017-01).These data will provide us an opportunity to design proteomics experiment in a particular tissue or cell type based on RNA sequencing data. 2.2 At Protein level: Protein Molecular weight (MW), an isoelectric point (pI) and number of isoforms per gene were calculated for each canonical protein sequence in neXtprot database (20,159 canonical proteins) 3. Analysis of all neXtprot proteins (including isoforms) at peptide level 4 ACS Paragon Plus Environment

Page 4 of 26

Page 5 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

3.1 Isoforms merging: Generally, in neXtprot databases, all protein products encoded by one gene have a primary accession number, followed by dash and number. Therefore, in our analysis workflow, all isoforms that have the same primary accession number are merged. As a result, 42,135 isoforms are merged to 20,159 protein clusters, each cluster represented by the primary accession number of its members. 3.2 In silico trypsinization: Using in-house Perl script, 42,135 isoforms as members of 20,159 protein clusters were in silico trypsinized. The rule of trypsinization was set for cleavage after K or R (not followed by P), and 2 missed cleavage sites were allowed. 3.3 Uniqueness of in-silico trypsinized peptides in human proteome: Peptide uniqueness was evaluated at the level of protein cluster by considering that leucine (L) and isoleucine (I) are equivalents for analysis workflow to guarantee the confidence in uniqueness. In contrast, L and I were not considered as equivalents for the purpose of searching the same peptide in peptide databases. 3.4 Splitting human unique peptidome dataset: All human unique peptides were split into three sub-datasets: identified, missing and uncertain peptides, corresponding to identified, missing and uncertain human proteins in the neXtProt database (2017-01-23 release), respectively. 4. Searching for missing peptides and filtration The unique peptides correspond to the missing and uncertain proteins were searched against the retrieved peptides from PA and GPM. Then, the resulted peptides from searching step were filtered through the following 2 criteria: 4.1 Highly validated missing and uncertain peptides by setting a threshold according to the peptide best probability, E-value for detected peptides from PA and GPM respectively. In details, PA used peptide prophet to compute the posterior probability for each individual Peptide spectrum match (PSM). And this probability can be used to estimate local FDR because posterior probability and FDR are complementary23; for example, posterior probability = 9.9e-1 corresponds to < 5e-4 % local FDR and so on. From this perspective, we used the most strict criteria (best probability = 1) for missing peptide identifications which means no peptides identified in decoy database at this level (local FDR < 5e-4 %). In GPM, they used X! Tandem E-value 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

scores to check the validity of PSM. And these scores can be used to estimate the corresponding FDR according to Balgley et al. FDR calculations which mentioned in their study. They calculated the corresponding false discovery rate for expectations or probability score distribution for different four search engines (Mascot, Sequest, OMSSA, and X! Tandem). And they showed that at X! Tandem E-value score 50) TPM or PRKM in HPA and GTEx respectively (Figure 2A, B), Indicating that low abundance is one of the main obstacles facing missing proteins identification.

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Notably, the median expression level of missing, uncertain and identified proteins in testis is higher than the median in other tissues (Figure 1B, C). Interestingly, most human missing proteins were expressed in testis. on the contrary, in other tissues like skeletal muscles, many missing proteins have no expression.(Figure 2A,B).these results suggesting that the possibility of missing proteins detection in testis is higher than other tissues (Figure 1B, C). On the other hand, analysis of the mRNA expression in testis may have a difficulty since most of the cells in testis have much lower mRNA/DNA ratio than other cells, making us keep in mind a possibility of DNA contamination in RNA expression studies.

1.2 At the protein level: 1.2.1 MW and pI The mean MW of identified proteins was 66.6 (median = 49.9 kDa), whereas the mean MW of missing and uncertain proteins was 43.8 (median = 35.34) and 28.5 (median = 19.2kDa) respectively (Figure 3A). Generally, low molecular weight proteins tend to generate an insufficient number of detectable unique tryptic peptides which decreases the possibility of identification of such proteins. The mean pI of identified proteins was 7.2 (median = 6.8), whereas the mean pI of missing and uncertain proteins was 8.1 (median = 8.6) and 8.2 (median = 8.7) respectively (Figure 3B), indicating that high proportion of missing and uncertain proteins tend to be more basic compared to identified proteins. Notably, Highly basic proteins tend to generate high fragment charge states (> 3) during fragmentation step which are difficult to be detected using the most common fragmentation technology (CID). Terry et al had the same finding using the difference between basic and acidic fraction instead of pI calculations 9. 1.2.2 Isoforms numbers: Our hypothesis was that proteins which have 2 isoforms show a tendency to be identified at the protein level with any of them. For example, Q8IX05 protein has 2 isoforms (Q5JTY5-1, Q5JTY5-2). Q5JTY5-1 is the canonical one which doesn’t contain any tryptic proteotypic peptide. But, Q5JTY5-2 could be identified by one proteotypic peptide (GPDDILLGMFYDTDVPYK) considering all isoforms in neXtprot as our search space and considering that L and I are equivalents. According to our hypothesis, we compared among identified, missing and uncertain proteins in terms of a number of isoforms per gene. Our results showed that Average number of isoforms per gene was 2.09, 1.40 and 1.09 for identified, 8 ACS Paragon Plus Environment

Page 8 of 26

Page 9 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

missing and uncertain group, respectively which decrease the possibility of missing and uncertain protein identification; taking in our consideration that missing and uncertain proteins tend to be low MW proteins. Also, missing genes possessing more than one isoform (26% of all missing) may have a tendency to be identified compared with those that have only one isoform at the same protein length (74% of all missing) (Figure 3C). In conclusion, the missing proteins tend to be low abundant, small in molecular weight and had a modest number of isoforms compared to the identified proteins. 1.3

At the Peptide level In order to increase the possibility and validity of missing protein identifications, we considered all isoforms generated from each gene and not only canonical one. Therefore, all neXtprot isoforms that have the same primary accession number are collected together in one entry or cluster. As a result, 42,135 isoforms are merged to 20,159 protein clusters, each cluster represented by the primary accession number. Then, in silico trypsinization of these isoforms generated 687,475, 1,075,798, and 1,188,861 as 0, 1 and 2 missed cleavage peptides respectively. This calculation considered all distinct peptides without considering any peptide length limitations after removing redundancy and assuming that Leucine and Isoleucine are equivalent. (Supplementary 1; Table2). Notably, there were 7 proteins which had no tryptic cleavage sites; two missing (Q9BYP8, Q9HC47), three uncertain (O15225, Q6NVV0, Q9NRI6) and two identified proteins (P60329, Q156A1) (supplementary 1; Table2). Considering only 0 missed cleavage peptides (687,475 tryptic peptides), we noticed that 29 protein clusters (contain 29 isoforms) have no MS observable tryptic peptides (7-40 aa); 16 protein clusters (contain 16 isoforms) in identified group, 9 protein clusters (contain 9 isoforms) in missing group and 4 protein clusters (contain 4 isoforms) in uncertain group (supplementary 1; Table3). In addition, 260 protein clusters (contain 279 isoforms) have no observable unique tryptic peptides (7-40 aa); 127 protein clusters (contain 137 isoforms) in identified group, 118 protein clusters (contain 126 isoforms) in missing group and 15 protein clusters (contain 16 isoforms) in uncertain group (supplementary 1; Table 4). Moreover, 291 entries or cluster (contain 319 isoforms) have only one observable unique tryptic peptides (7-40 aa); 161 protein clusters (contain 184 isoforms) in identified group, 97 protein clusters (contain 102 isoforms) in the missing group and 33 protein clusters (contain 33 isoforms) in the uncertain group (supplementary 1_Table 5).

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Therefore, using other proteases like chymotrypsin individually or in combinations might increase the possibility of the identification of such proteins. Distribution of tryptic cleavage sites (0 missed cleavage) for each protein entry showed less density and fewer MS-observable tryptic peptides (7-40 aa) (0 missed cleavage) compared to the identified proteins of the same protein length (normalized peptide density = tryptic peptide count of each protein * 100/ protein length in amino acid residues) (Figure 4C). Investigating the uniqueness of all generated tryptic peptides showed that the unique peptides (2,826,644) tended to be longer than shared ones (125,490) (Figure 4B). In addition, missing proteins tended to generate fewer unique tryptic peptides (7-40 aa) compared to those of the identified proteins at the same protein length (Figure 4D). Also, there was a trend towards increasing unique peptide numbers and decreasing shared peptide numbers with increasing the number of missed cleavage (Figure 4A). Previously, we analyzed neXtprot proteins at different levels (RNA level, protein level, and tryptic peptide level) highlighting the differences among identified, missing and uncertain proteins. In particular, at the peptide level, we generated unique predicted tryptic peptides which correspond to missing and uncertain proteins. The experimental counterpart of these predicted peptides might exist in one of MS based peptide database like PA and GPM. Therefore, our aim in the next step is to search these predicted peptides in PA and GPM experimental peptide list. 2. Investigating the existence of missing and uncertain proteins (searching workflow) 2.1. Peptide identifications According to our searching workflow (figure 5), 42,135 entries in the neXtProt database were merged into 20,159 protein entries with all isoforms. Considering isoforms may increase the possibility of missing protein identification by targeting all isoforms unique peptides for each missing protein. Then, in silico trypsinization were applied to 42,135 isoforms in 20,159 protein entries generating 2,550,191 tryptic peptides. Tryptic peptides with a length greater than or equal 9 amino acids and 2 missed cleavages; 165,859 and 21,781 unique peptides corresponding to the missing and uncertain proteins, respectively were taken into consideration for peptides database search (Supplementary 1_Table 1). 10 ACS Paragon Plus Environment

Page 10 of 26

Page 11 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Searching for unique missing and uncertain peptides in the peptide list of PA and GPM showed that 3,537 distinct peptides correspond to 1,013 missing and 92 uncertain proteins were detected (Figure 5). In details, 440 and 384 peptides correspond to 396 and 68 missing and uncertain proteins respectively were identified in PA at best probability >= 9.92e-1 which corresponds to = 9 aa) (Figure 6A, B), whereas 404 and 56 missing and uncertain proteins identified by only one unique peptide.

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3. Validation of missing and uncertain peptides Searching 2,746 detected missing and uncertain peptides against 166,174 SRMAtlas peptides returned only 1130 peptides correspond to missing and uncertain proteins. Therefore, we downloaded the precursor ion and fragment ion data for all synthetic counterparts of 1130 missing and uncertain peptides from SRMAtlas repository in different 3 transition resources (QTOF, QQQ, and QTRAP). Regarding the natural spectra for the detected missing and uncertain peptides, we downloaded only the natural spectra for 600 missing and uncertain peptides from GPM. Unfortunately, we couldn’t consider all 1130 peptides which have counterparts in SRMAtlas because it is a time-consuming task since we have to select manually the best spectra match for each detected peptide and also we have to consider the spectra with fewer modifications to match the simple synthetic peptides. Then, (1) matching the precursor ion data (m/z and charge) of the natural GPM spectra with those precursor ion data generated from different sources in SRM atlas database. (2) For those precursor ions matched in step 1, we checked the similarity of their fragment ion profiles (m/z and intensity). Finally, 238 and 8 natural precursor ions from GPM which correspond to 214 missing and 6 uncertain peptides showed wellmatch of their fragment ion m/z pattern with those of synthetic spectra in SRMAtlas (Supplementary3_Tables 1-6) (Figure 7A, B). At Protein level, these validated peptides correspond to 149 missing and 6 uncertain peptides; showing the experimental validation of 41 missing proteins in the current study by >= 2 proteotypic peptides (Supplementary3_Tables 4).

4. Annotations of identified missing and uncertain proteins in this study. Further investigation of 402 newly identified missing proteins (>= 2 proteotypic peptides) showed that these proteins have low RNA expression level (average = 6.1 TPM). Notably, 103 proteins (more than 25% of them) have very low expression level among all 37 HPA tissues (average < 1 TPM) (Supplementary 2_Table 9). Moreover, Average length of these proteins is 492 aa, 77 of them are very short protein (< 250 amino acids). Whereas, no highly hydrophobic proteins exists in this list (maximum GRAVY = 0.7). More than 36% of these proteins have more than one isoform. (Supplementary 2_Table 10). Moreover, the correspondence between RNA sequencing data in HPA and proteomic data in PA and GPM for some of the missing proteins showed agreement between both. For example, Q99811 protein is tissue enhanced protein in skin and other tissues and it has been already identified in the same tissue in GPM peptide list using 2 proteotypic peptides 12 ACS Paragon Plus Environment

Page 12 of 26

Page 13 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(THYPDAFVREELAR, KNFSVSHLLDLEEVAAAGR) from “PXD003414” study. In addition to “Q6UXN7” protein which is tissue-enriched protein in testis and it has been already identified in the same tissue by PA and GPM; one proteotypic peptide “MGIQHLGNALLVCEQPR” was identified in both, whereas the other peptide “AEEQGTQLWDPTK” identified exclusively in GPM by “PXD004785” study. There were other cases where no correspondence between RNA data in HPA and proteomics data in PA and GPM. For example, Q9NS40 protein is Tissue enriched protein in cerebral cortex and tissue enhanced in testis according to HPA and GTEx respectively. But it has been already detected in Ovarian Cancer Cell Lines by GPM using 3 proteotypic peptides “ASSVHDIEGFGVHPK”, “KLSFESEGEKENSTNDPEDSADTIR”, “SSSFISSIDDEQKPLFSGIVDSSPGIGK” in “PXD000901” study.

Conclusion In the current research, we applied data mining strategy to identify and validate missing proteins through searching the missing peptides against the peptide list of PA and GPM databases. Application of this strategy resulted in the identification of more than 15 % of missing and uncertain proteins by at least 2 unique peptides (>= 9 aa). In addition to, other 15 % of missing and uncertain proteins identified by only one unique peptide (>= 9 aa). Also, by utilizing the SRMAtlas data to validate missing peptides, 238 and 8 precursor ions correspond to 214 missing and 6 uncertain peptides respectively have the same peptide fragment ion m/z patterns in both GPM and SRMAtlas. These validated peptides correspond to 149 missing and 6 uncertain peptides; indicating that 41 missing proteins have been validated in the current study by >= 2 proteotypic peptides.

Acknowledgment: The study was supported partially by the Center of Innovation Program from Japan Science and Technology Agency, JST and the Research Grant for Kidney Disease from the Masanori Katagiri Foundation to TY. The funders had no role in study design, data collection, and analysis, decision to publish, or preparation of the manuscript.

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; Marko-Varga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30, (3), 221-3. (2) Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; Beretta, L.; Bergeron, J.; Borchers, C. H.; Corthals, G. L.; Costello, C. E.; Deutsch, E. W.; Domon, B.; Hancock, W.; He, F.; Hochstrasser, D.; Marko-Varga, G.; Salekdeh, G. H.; Sechi, S.; Snyder, M.; Srivastava, S.; Uhlen, M.; Wu, C. H.; Yamamoto, T.; Paik, Y. K.; Omenn, G. S., The human proteome project: current state and future direction. Mol Cell Proteomics 2011, 10, (7), M111 009993. (3) Gaudet, P.; Michel, P. A.; Zahn-Zabal, M.; Britan, A.; Cusin, I.; Domagalski, M.; Duek, P. D.; Gateau, A.; Gleizes, A.; Hinard, V.; Rech de Laval, V.; Lin, J.; Nikitin, F.; Schaeffer, M.; Teixeira, D.; Lane, L.; Bairoch, A., The neXtProt knowledgebase on human proteins: 2017 update. Nucleic Acids Res 2017, 45, (D1), D177-D182. (4) Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S., Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014, 13, (1), 15-20. (5) Horvatovich, P.; Vegvari, A.; Saul, J.; Park, J. G.; Qiu, J.; Syring, M.; Pirrotte, P.; Petritis, K.; Tegeler, T. J.; Aziz, M.; Fuentes, M.; Diez, P.; Gonzalez-Gonzalez, M.; Ibarrola, N.; Droste, C.; De Las Rivas, J.; Gil, C.; Clemente, F.; Hernaez, M. L.; Corrales, F. J.; Nilsson, C. L.; Berven, F. S.; Bischoff, R.; Fehniger, T. E.; LaBaer, J.; Marko-Varga, G., In Vitro Transcription/Translation System: A Versatile Tool in the Search for Missing Proteins. J Proteome Res 2015, 14, (9), 344151. (6) Horvatovich, P.; Lundberg, E. K.; Chen, Y. J.; Sung, T. Y.; He, F.; Nice, E. C.; Goode, R. J.; Yu, S.; Ranganathan, S.; Baker, M. S.; Domont, G. B.; Velasquez, E.; Li, D.; Liu, S.; Wang, Q.; He, Q. Y.; Menon, R.; Guan, Y.; Corrales, F. J.; Segura, V.; Casal, J. I.; Pascual-Montano, A.; Albar, J. P.; Fuentes, M.; Gonzalez-Gonzalez, M.; Diez, P.; Ibarrola, N.; Degano, R. M.; Mohammed, Y.; Borchers, C. H.; Urbani, A.; Soggiu, A.; Yamamoto, T.; Salekdeh, G. H.; Archakov, A.; Ponomarenko, E.; Lisitsa, A.; Lichti, C. F.; Mostovenko, E.; Kroes, R. A.; Rezeli, M.; Vegvari, A.; Fehniger, T. E.; Bischoff, R.; Vizcaino, J. A.; Deutsch, E. W.; Lane, L.; Nilsson, C. L.; Marko-Varga, G.; Omenn, G. S.; Jeong, S. K.; Lim, J. S.; Paik, Y. K.; Hancock, W. S., Quest for Missing Proteins: Update 2015 on Chromosome-Centric Human Proteome Project. J Proteome Res 2015, 14, (9), 3415-31. (7) Omenn, G. S., The strategy, organization, and progress of the HUPO Human Proteome Project. J Proteomics 2014, 100, 3-7. (8) Elguoshy, A.; Magdeldin, S.; Xu, B.; Hirao, Y.; Zhang, Y.; Kinoshita, N.; Takisawa, Y.; Nameta, M.; Yamamoto, K.; El-Refy, A.; El-Fiky, F.; Yamamoto, T., Why are they missing? : Bioinformatics characterization of missing human proteins. J Proteomics 2016, 149, 7-14. (9) Farrah, T.; Deutsch, E. W.; Hoopmann, M. R.; Hallows, J. L.; Sun, Z.; Huang, C. Y.; Moritz, R. L., The state of the human proteome in 2012 as viewed through PeptideAtlas. J Proteome Res 2013, 12, (1), 162-71. (10) Shiromizu, T.; Adachi, J.; Watanabe, S.; Murakami, T.; Kuga, T.; Muraoka, S.; Tomonaga, T., Identification of missing proteins in the neXtProt database and unregistered

14 ACS Paragon Plus Environment

Page 14 of 26

Page 15 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J Proteome Res 2013, 12, (6), 2414-21. (11) Vandenbrouck, Y.; Lane, L.; Carapito, C.; Duek, P.; Rondel, K.; Bruley, C.; Macron, C.; Gonzalez de Peredo, A.; Coute, Y.; Chaoui, K.; Com, E.; Gateau, A.; Hesse, A. M.; Marcellin, M.; Mear, L.; Mouton-Barbosa, E.; Robin, T.; Burlet-Schiltz, O.; Cianferani, S.; Ferro, M.; Freour, T.; Lindskog, C.; Garin, J.; Pineau, C., Looking for Missing Proteins in the Proteome of Human Spermatozoa: An Update. J Proteome Res 2016, 15, (11), 3998-4019. (12) Wei, W.; Luo, W.; Wu, F.; Peng, X.; Zhang, Y.; Zhang, M.; Zhao, Y.; Su, N.; Qi, Y.; Chen, L.; Zhang, Y.; Wen, B.; He, F.; Xu, P., Deep Coverage Proteomics Identifies More Low-Abundance Missing Proteins in Human Testis Tissue with Q-Exactive HF Mass Spectrometer. J Proteome Res 2016, 15, (11), 3988-3997. (13) Kitata, R. B.; Dimayacyac-Esleta, B. R.; Choong, W. K.; Tsai, C. F.; Lin, T. D.; Tsou, C. C.; Weng, S. H.; Chen, Y. J.; Yang, P. C.; Arco, S. D.; Nesvizhskii, A. I.; Sung, T. Y.; Chen, Y. J., Mining Missing Membrane Proteins by High-pH Reverse-Phase StageTip Fractionation and Multiple Reaction Monitoring Mass Spectrometry. J Proteome Res 2015, 14, (9), 3658-69. (14) Garin-Muga, A.; Odriozola, L.; Martinez-Val, A.; Del Toro, N.; Martinez, R.; Molina, M.; Cantero, L.; Rivera, R.; Garrido, N.; Dominguez, F.; Sanchez Del Pino, M. M.; Vizcaino, J. A.; Corrales, F. J.; Segura, V., Detection of Missing Proteins Using the PRIDE Database as a Source of Mass Spectrometry Evidence. J Proteome Res 2016, 15, (11), 4101-4115. (15) Deutsch, E. W., The PeptideAtlas Project. Methods Mol Biol 2010, 604, 285-96. (16) Craig, R.; Cortens, J. P.; Beavis, R. C., Open source system for analyzing, validating, and storing protein identification data. J Proteome Res 2004, 3, (6), 1234-42. (17) Vizcaino, J. A.; Deutsch, E. W.; Wang, R.; Csordas, A.; Reisinger, F.; Rios, D.; Dianes, J. A.; Sun, Z.; Farrah, T.; Bandeira, N.; Binz, P. A.; Xenarios, I.; Eisenacher, M.; Mayer, G.; Gatto, L.; Campos, A.; Chalkley, R. J.; Kraus, H. J.; Albar, J. P.; Martinez-Bartolome, S.; Apweiler, R.; Omenn, G. S.; Martens, L.; Jones, A. R.; Hermjakob, H., ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol 2014, 32, (3), 223-6. (18) Jones, P.; Cote, R. G.; Martens, L.; Quinn, A. F.; Taylor, C. F.; Derache, W.; Hermjakob, H.; Apweiler, R., PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res 2006, 34, (Database issue), D659-63. (19) Farrah, T.; Deutsch, E. W.; Kreisberg, R.; Sun, Z.; Campbell, D. S.; Mendoza, L.; Kusebauch, U.; Brusniak, M. Y.; Huttenhain, R.; Schiess, R.; Selevsek, N.; Aebersold, R.; Moritz, R. L., PASSEL: the PeptideAtlas SRMexperiment library. Proteomics 2012, 12, (8), 1170-5. (20) Baker, M. S.; Ahn, S. B.; Mohamedali, A.; Islam, M. T.; Cantor, D.; Verhaert, P. D.; Fanayan, S.; Sharma, S.; Nice, E. C.; Connor, M.; Ranganathan, S., Accelerating the search for the missing proteins in the human proteome. Nat Commun 2017, 8, 14271. (21) Kusebauch, U.; Campbell, D. S.; Deutsch, E. W.; Chu, C. S.; Spicer, D. A.; Brusniak, M. Y.; Slagel, J.; Sun, Z.; Stevens, J.; Grimes, B.; Shteynberg, D.; Hoopmann, M. R.; Blattmann, P.; Ratushny, A. V.; Rinner, O.; Picotti, P.; Carapito, C.; Huang, C. Y.; Kapousouz, M.; Lam, H.; Tran, T.; Demir, E.; Aitchison, J. D.; Sander, C.; Hood, L.; Aebersold, R.; Moritz, R. L., Human SRMAtlas: A Resource of Targeted Assays to Quantify the Complete Human Proteome. Cell 2016, 166, (3), 766-78. (22) Uhlen, M.; Oksvold, P.; Fagerberg, L.; Lundberg, E.; Jonasson, K.; Forsberg, M.; Zwahlen, M.; Kampf, C.; Wester, K.; Hober, S.; Wernerus, H.; Bjorling, L.; Ponten, F., Towards a knowledgebased Human Protein Atlas. Nat Biotechnol 2010, 28, (12), 1248-50. (23) Nesvizhskii, A. I., A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010, 73, (11), 2092-123. 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(24) Balgley, B. M.; Laudeman, T.; Yang, L.; Song, T.; Lee, C. S., Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol Cell Proteomics 2007, 6, (9), 1599-608. (25) Schaeffer, M.; Gateau, A.; Teixeira, D.; Michel, P. A.; Zahn-Zabal, M.; Lane, L., The neXtProt peptide uniqueness checker: a tool for the proteomics community. Bioinformatics 2017. (26) Carithers, L. J.; Moore, H. M., The Genotype-Tissue Expression (GTEx) Project. Biopreserv Biobank 2015, 13, (5), 307-8.

Figures: Figure 1 | RNA Expression data for 3 different protein groups; proteins experimentally validated at the protein level (PE1), missing proteins (PE2:4) and uncertain proteins (PE5) in 37 HPA tissues and 53 GTEx tissues. A, the distribution of RNA expression pattern in HPA. B, Boxplot of RNA expression level in HPA. C, Boxplot of RNA expression level in GTEx. Figure 2 | A, Barplot showing the number of proteins expressed in 37 HPA tissues at different expression levels in TPM. B, Barplot showing the number of proteins expressed in 53 GTEx tissues at different expression levels in PRKM. Figure 3 | Physiochemical properties and isoforms frequency for 3 different protein groups; proteins experimentally validated at the protein level (PE1), missing proteins (PE2:4) and uncertain proteins (PE5). A, Boxplot of protein MW for 3 different protein groups. B, Boxplot of protein pI for 3 different protein groups. C, Count of isoforms per gene as a percentage for 3 different protein groups. Figure 4 | frequency of tryptic peptide uniqueness, distribution of tryptic peptide length and distribution of tryptic peptide count per protein for all NeXtProt proteins. A, Bar plot showing the count of unique and shared tryptic peptides for 0, 1 and 2 missed cleavages considering all peptide lengths. B, Density plot showing the distribution of tryptic peptide length of unique and shared peptides for 0, 1 and 2 missed cleavages considering all peptide lengths. C, D, Density plot showing comparison of the distribution of all tryptic peptide count per protein and unique tryptic peptide count per protein respectively (normalized by protein length) between proteins experimentally validated at protein level (PE1), missing proteins (PE2: PE4) and Uncertain proteins (PE5) considering only 0- missed cleavage peptides and MS observable peptides (7-40 aa). Figure 5 | Bioinformatics workflow for missing protein detection and identification. The main steps in this workflow are A, Merging B, In silico trypsinization C, 16 ACS Paragon Plus Environment

Page 16 of 26

Page 17 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

investigating peptide uniqueness D, Splitting E, Matching missing and uncertain peptides F, 1st Peptide filtration G, 2nd peptide filtration H, Peptide validation using SRMAtlas data. Figure 6 | missing and uncertain peptide and protein identification. A, Venn diagram showing the overlap of matched missing and uncertain peptides in PA and GPM. B, missing and uncertain proteins identified by at least 2 unique peptides (>= 9 aa) in PA and GPM. Figure 7 | the validation of Matched missing peptides using SRMAtlas data. A, validation of Krueppel-like factor 7 (KLF7) protein (O75840-PE2) using 2 proteotypic peptides (SSAVDILLSR, TSQTLSAVDGTVTLK). B, Validation of Zinc finger protein 182 (ZNF182) protein (P17025-PE2) using 2 proteotypic peptides (IPFWNFPEVCQVDEQIER, HESFGNNMVDNLDLFSR). Supplementary 1: Table 1: NeXtprot canonical proteins that don’t contain any tryptic cleavage sites. Table 2: statistics of unique and shared peptide generated from in silico trypsinization of neXtprot proteins (including all isoforms). Table 3: NeXtprot proteins (including all isoforms) that don’t generate any MS searchable tryptic peptides (7-40 aa) considering only 0 missed cleavage peptides. Table 4: NeXtprot proteins (including all isoforms) that don’t generate any MS searchable UNIQUE tryptic peptides (7-40 aa) considering only 0 missed cleavage peptides. Table 5: NeXtprot proteins (including all isoforms) that generate only one MS searchable UNIQUE predicted tryptic peptides considering only 0 missed cleavage peptides. Supplementary 2: Table 1: Summary of Missing and Uncertain Peptides matched to PA peptide list at different best probability threshold. Table 2: Summary of Missing and Uncertain Peptides matched to GPM peptide list at different E-value thresholds and their corresponding peptide and protein FDR threshold according to reference 24 in the manuscript. Table 3: Missing proteins which identified in PA with 2 proteotypic peptides (>= 9 aa). All peptide identifications have best probability = 1 which corresponds to < 5e-4 % peptide local FDR. Table 4: Uncertain proteins which identified in PA with 2 proteotypic peptides (>= 9 aa). All peptide identifications have best probability = 1 which corresponds to < 5e-4 % peptide local FDR. Table 5: Missing proteins which identified in GPM with more than one proteotypic peptide (>= 9 aa). All peptide identifications have E-value threshold = 9 aa). All peptide identifications have E-value threshold = 9 aa); one from PA and the other from GPM. All peptide identifications from PA have best probability = 1 which corresponds to < 5e-4 % peptide local FDR. All peptide identifications from GPM have E-value threshold = 9 aa); one from PA and the other from GPM. All peptide identifications from PA have best probability = 1 which corresponds to < 5e-4 % peptide local FDR. All peptide identifications from GPM have E-value threshold A-1 Prt-SeqA1 >A-2 Prt-SeqA2 >B-1 Prt-seqB1 >B-2 Prt-seqB2

A Isoforms merging

Nextprot 20,159 protein clusters >A Prt-SeqA1 Prt-SeqA2 >B Prt-seqB1 Prt-seqB2

B - trypsinization - Missedcleav.=2 - Pep_len>= 9 - Remove redundancy within cluster

Nextprot Peptides 2,550,191

C uniqueness

>A Pep-SeqA1.1 Pep-seqA1.2 Pep-seqA1.3 Pep-SeqA2.1 Pep-seqA2.2 >B Pep-seqB1.1 Pep-seqB1.2 Pep-seqB2.1 Pep-seqB2.2

>A Pep-SeqA1.1 Pep-seqA1.2 Pep-seqA1.3 Pep-SeqA2.1 Pep-seqA2.2 >B Pep-seqB1.1 Pep-seqB1.2 Pep-seqB2.1 Pep-seqB2.2

identified Unique peptides(PE=1) 2,221,922

D splitting

matching

Peptide Atlas Peptides (Ver.:1701) All: 1,222,862 Phospho:135,761 Testis:289,980

1,013 missing and 92 uncertain proteins

Assigning matched peptides to proteins

GPM Peptides Ver.:170205 1,431,208

All Matched peptides 3,537

F 864 missing and 83 uncertain proteins

Assigning matched peptides to proteins

Peptide identification at 0% FDR 3,047

G 806 missing and 75 uncertain proteins

Assigning matched peptides to proteins

1st filtration

2nd filtration

Using Peptide uniqueness checker in Nextprot (2,746)

H Peptide validation using SRMAtlas data ACS Paragon Plus Environment

unique shared unique unique shared shared unique unique shared

Nextprot Unique peptides 2,409,562 >A Pep-SeqA1.1 Pep-seqA1.3 Pep-SeqA2.1 >B Pep-seqB1.2 Pep-seqB2.1

Missing and uncertain Unique peptides(PE=2:5) 187,640

E

Nextprot Peptides 2,550,191

unique unique unique unique unique

Selection of Unique peptides

5

Page 24 of 26

Page 25 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Journal of Proteome Research

6A

Missing PA

Uncertain GPM

PA

Proteotypic Peptides

179

6B

2088

110

1

131

62

Uncertain

Missing PA

Proteins identified by > 1 Proteotypic peptide

176

GPM

PA-GPM

GPM

PA

41

360

21

ACS Paragon Plus Environment

PA-GPM

1

GPM

17 17

Journal of Proteome Research

1 2 7A 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

O75840

7B

Page 26 of 26

P17025

b6-672.27 b12-679.79

ACS Paragon Plus Environment