Informatics View on the Challenges of Identifying Missing Proteins

Nov 9, 2015 - The data were derived from in silico trypsin digestion of 20 189 human proteins in UniProt, where the numbers of unique and shared pepti...
1 downloads 8 Views 1MB Size
Subscriber access provided by Stockholm University Library

Letter

Informatics view on the challenges of identifying missing proteins from shotgun proteomics Wai-Kok Choong, Hui-Yin Chang, Ching-Tai Chen, ChiaFeng Tsai, Wen-Lian Hsu, Yu-Ju Chen, and Ting-Yi Sung J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00482 • Publication Date (Web): 09 Nov 2015 Downloaded from http://pubs.acs.org on November 10, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Informatics view on the challenges of identifying missing proteins from shotgun proteomics Wai-Kok Choong‡,1, Hui-Yin Chang‡,1,2,3, Ching-Tai Chen‡,1, Chia-Feng Tsai4, Wen-Lian Hsu1, Yu-Ju Chen4,*, Ting-Yi Sung1,* 1

Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan; 2Bioinformatics

Program, Taiwan International Graduate Program, Academia Sinica, Taipei 11529, Taiwan; 3

Institute of Biomedical Informatics, National Yang-Ming University, Taipei 11221, Taiwan;

4

Institute of Chemistry, Academia Sinica, Taipei, Taiwan

ABSTRACT

Protein experiment evidences at protein level from mass spectrometry and antibody experiments are essential to characterize the human proteome. neXtProt (2014-09 release) reported 20,055 human proteins, including 16,491 proteins identified at protein level and 3,564 proteins unidentified. Excluding 616 proteins at uncertain level, 2,948 proteins were regarded as missing proteins. Missing proteins were unidentified partially due to MS limitations and intrinsic properties of proteins, e.g., only appearing in specific diseases or tissues. Despite such reasons, it is desirable to explore issues affecting validation of missing proteins from an “ideal” shotgun analysis of human proteome. We thus performed in silico digestions on the human proteins to generate all in silico fully-digested peptides. With these presumed peptides, we investigated the

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 40

identification of proteins without any unique peptide, the effect of sequence variants on protein identification, difficulties in identifying olfactory receptors and highly similar proteins. Among all proteins with evidences at transcript level, G protein-coupled receptors and olfactory receptors, based on InterPro classification, were the largest families of proteins and exhibited more frequent variants. To identify missing proteins, the above analyses suggested including sequence variants in protein FASTA for database searching. Furthermore, evidence of unique peptides identified from MS experiment would be crucial for experimentally validating missing proteins.

KEYWORDS Missing Proteins, Proteomics, Bioinformatics, Peptide identification, Protein inference INTRODUCTION The Chromosome-Centric Human Proteome Project1 (C-HPP) is an international project organized by Human Proteome Organization (HUPO). It aims at discovering and characterizing all human proteins encoded from genes for the purpose of filling the gap between genomics and proteomics2. With the advent of mass spectrometry (MS) and antibody technologies, human proteins predicted from coding genes can be experimentally validated. However, a portion of human proteins currently still lack confident protein expression evidences and are thus considered missing. To clarify the human proteome, the current main theme of C-HPP is to identify missing proteins3-6. neXtProt7 is a well-curated knowledge base of human proteins that is used to determine the missing protein list for C-HPP5. The neXtProt database (2014-09 release) contained 20,055 human proteins, where 16,491 proteins had protein evidences (PE) at protein level (PE1), 2,647

ACS Paragon Plus Environment

2

Page 3 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

proteins at transcript level (PE2), 214 proteins at homology level (PE3), 87 proteins at prediction level (PE4), and 616 proteins at “uncertain” or “dubious’ level (PE5). Proteins at PE2-5 level are not confirmed with protein expression evidences. Since proteins at PE5 are uncertain or dubious, 2,948 proteins at PE2-4, excluding those at PE5, were regarded as missing proteins5. Several reasons for missing proteins to be unidentified in biological samples or MS experiments have been reported in the literature. First, the detection capability of mass spectrometry technology constrains the detection of several kinds of proteins, e.g., lowabundance proteins4, 6, hydrophobic proteins8, 9, membrane proteins5,

8, 10, 11

, proteins without

trypsin cleavage sites8, and very small proteins with less than 50 amino acids11. Second, some proteins may be located in the unusual tissues or cell types or only appear in the development stage, some specific disease states or some specific cellular stress responses, e.g., embryo or fetus5,

6, 8

. Another noteworthy reason is related to database sequence search in shotgun

proteomics analysis, which is frequently used to identify proteins. For example, sometimes a representative protein is chosen to represent a group of highly homologous or identical protein sequences in the protein database and therefore a large number of proteins are missed in identification5, 6, e.g., immunoglobulins and cytokeratins. Though MS-based draft map of the human proteome recently reported by two independent groups12, 13 marked a milestone progress in profiling human proteome, the human proteins in neXtProt have not been fully identified by shotgun proteomics analyses probably due to the aforementioned reasons. Therefore, there is a need to explore an “ideal” shotgun analysis of human proteome in order to pinpoint informatics issues that may affect protein identification and strategies to validate missing proteins. To fulfill the need, we thus conducted a systematic

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 40

analysis based on in silico digestions of the human proteins to obtain all in silico fully-digested peptides as if they were all experimentally identified. With all of these presumed peptides, we investigated the informatics limitations of identifying missing proteins from current shotgun proteomics. First, we analyzed the effect of shared peptides on protein identification, where shared peptides commonly occur in several proteins and result in ambiguous protein identification14,

15

, affecting protein-level evidence. On the contrary, unique peptides are

uniquely matched to a single protein in the protein database and can lead to protein identification without ambiguity. Noting 145 proteins, including ten uncertain proteins at PE5, without any in silico unique peptide under the three digestions using trypsin, Lys-C and both, we revisited the experiment evidence of 58 proteins annotated with PE1 evidence. Second, to improve peptide and protein identifications, the information of sequence variants, such as single nucleotide polymorphisms and somatic mutations, has been used in shotgun proteomics analyses16-21. However, variants are often neglected in the canonical sequences provided by protein sequence databases, and thus may render proteins to be missing. We particularly examined the ten largest families and domains with PE2 evidence. Then we compared the sequence variations of proteins with PE2 and with PE1 evidences in these families and domains to verify whether sequence variation could be a factor affecting identification of missing proteins in these families and domains. Finally, as shared peptides can affect protein inference, we analyzed homologous proteins in the human proteins to investigate identification of highly similar proteins, particularly those extreme cases with similarity at 100%. Based on the above in silico analyses, different strategies, e.g., antibody technology, top-down proteomics, different enzymes for protein digestion, improvement of database sequence searching by including sequence variants in protein

ACS Paragon Plus Environment

4

Page 5 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

FASTA, can be considered in the design of experiments to experimentally validate the missing proteins. METHODS Databases - Human proteins in UniProt22 (20,189 proteins, 2014-09 release) and neXtProt (20,055 proteins, 2014-09 release) were used for analyses in this paper. Human Protein Atlas23 (HPA) (version 13) was used for antibody-based evidences; and PeptideAtlas24 (2015-03) and SRMAtlas25 (2014-08) were used for MS-based information. The 20,189 human proteins were digested using six different enzymes, including trypsin, Lys-C, combination of trypsin and Lys-C (denoted as trypsin+Lys-C), chymotrypsin (high specificity), Glu-C, and Lys-N. The cleavage specificities of the above enzymes include N-terminal side of K for Lys-N, and C-terminal side of the following amino acids: R/K (both not before P) for trypsin, R (not before P)/K for trypsin+Lys-C, K for Lys-C, F, Y and W (all three not before P) for chymotrypsin (high specificity rule defined in Expasy Peptide Cutter), E and D for Glu-C. All the digestions followed the cleavage specificity rules provided by Expasy Peptide Cutter and Keil’s rules26. No miscleavage was allowed in these in silico digestions. Sequence Variants – Sequence variants of the human proteins in UniProt were obtained from neXtProt (2014-09) in which the sequence variant annotations were derived from dbSNP27 and COSMIC28 databases. Sequence variants occurring at the same position of each protein sequence were only counted once for the calculation of amino acid length per sequence variant. RESULTS AND DISCUSSION

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 40

Analysis of protein identification based on in silico fully-digested unique and shared peptides of human proteins In shotgun proteomics, proteins are proteolytically digested into peptides, separated by liquid chromatography and then analyzed by mass spectrometry. By using current available database search tools, such as Mascot29, SEQUEST30 and X!Tandem31, digested peptides can be identified and assembled into proteins. The selection of enzymes for protein digestion can result in different cleaved peptide combinations of proteins. In order to obtain all possible peptides for protein identification, we simulated three in silico digestions using commonly used trypsin, LysC and combination of both (denoted as trypsin+Lys-C) on the 20,189 human proteins in UniProt. Table 1 lists the number of “in silico fully digested peptides” (or simply called peptides for convenience) of the 20,189 proteins using different enzymes. As shown in Table 1, 621,501 unique peptides and 639,361 shared peptides were obtained using trypsin+Lys-C tandem digestion, which were higher than those by using single trypsin or Lys-C. As trypsin was the most frequently used enzyme, we noted that 49.77% (607,642 out of 1220,895) tryptic peptides were shared peptides. This result was similar to a previous study which reported that 53% of tryptic peptides in the human proteins of International Protein Index (89,486 proteins) were shared by more than one protein32. Removing all of the redundant peptides, we observed that the numbers of shared peptides were dramatically reduced (as shown in Table 1). For example, after removing the redundant tryptic peptides, the number of shared tryptic peptides was decreased from 607,642 to 48,675. In other words, the 607,642 shared peptides were repeatedly composed of the 48,675 distinct peptides in the trypsin digestion mixture. Similar phenomenon was also observed in Lys-C and trypsin+Lys-C cleaved peptides. We then analyzed the peptide length of unique and shared peptides under the three digestions. The length distributions of unique and

ACS Paragon Plus Environment

6

Page 7 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

shared peptides under trypsin digestion were shown in Figure 1, and the distributions under LysC and trypsin+Lys-C digestions were shown in Figures S-1 and S-2 in the Supporting Information. Overall speaking, the sequence length of unique peptides was longer than that of shared peptides in all three digestions. Besides, most of the unique peptides had sequence length of at least 7; such length was consistent with the requirement of peptide length for identifying missing proteins. In addition to UniProt human proteome, we also conducted similar digestion analysis on the human proteome from the Ensemble database, with the consideration of protein isoforms. As a result, in both databases, the number of shared peptides became higher and the number of unique peptides became lower when regarding each protein isoform as an individual sequence (Table S-1, Supporting Information). Protein inference based on all shared peptides could be inconclusive – In order to use PE annotation in neXtProt, we mapped the 20,189 proteins in UniProt to the proteins in neXtProt. As a result, 20,053 proteins in UniProt matched with the entries in neXtProt and were used for the following analysis, excluding two entries P0CB43 (PE1) and Q8N8A4 (PE2) in neXtProt that were not in UniProt. By comparing the number of non-redundant unique peptides generated using the three digestion methods, 18,329 unique peptides were additionally obtained using trypsin+Lys-C digestion, whereas 31,571 and 246,906 unique peptides were additionally obtained using trypsin and Lys-C, respectively, as shown in Figure 2 (A). On the other hand, 19,908 out of the 20,053 proteins had at least one unique peptide from any of the three digestions, whereas 145 proteins had no unique peptide, i.e., all shared peptides (Figure 2, (B)). The detailed information of the 20,053 and 145 proteins was listed in Tables S-2 and S-3 in the Supporting Information, respectively. As shown in Figure 2 (B), using trypsin digestion could possibly identify 1 protein

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 40

more than using Lys-C digestion; on the other hand, using Lys-C digestion could identify 50 proteins complementary to those using trypsin digestion. Using the tandem digestion of trypsin and Lys-C could not increase any identifiable protein from using either enzyme. A similar analysis which restricts to peptides with 6-40 amino acids33 was also performed (Figure S-3, Supporting Information). According to the analysis, when considering peptide length limitation, the number of proteins with no unique peptides was increased from 145 to 229. Among the 145 proteins without any unique peptide, 58 proteins had protein evidence level at PE1 (protein), 38 at PE2 (transcript), 32 at PE3 (homology), 7 at PE4 (predicted), and 10 at PE5 (uncertain) based on neXtProt annotation; and 77 (38+32+7) proteins were regarded as missing proteins. A concern could be raised about the validity of the PE1 evidences of the 58 proteins since they consisted of only shared peptides. We manually examined the MS information of the 58 proteins reported in neXtProt and PeptideAtlas. The observation indicated that all peptides used for identifying the 58 proteins were shared peptides (also in Table S-4 in the Supporting Information). For example, Annexin A8 (P13928, Vascular anticoagulant-beta) has been validated at PE1 level by Burkard et al.34 using mass spectrometry (neXtProt release 2014-09; UniProt release 2015-10), in which 17 out of 43 in silico digested peptides were identified by mass spectrometry technology as shown in Figure 3 and Table S-5 in the Supporting Information. Further manually examining the PeptideAtlas, SRMatlas databases, and our manual inspection, we observed that all of the 17 peptides were shared peptides. Annexin A8 was inferred by the 17 shared peptides, which yielded a large sequence coverage in the protein. However, even though all of these shared peptides might uniquely determine the protein Annexin A8, we could not exclude the possibility that these shared peptides came from a combination of multiple proteins other than Annexin A8. As noted, the study of Burkard et al.34

ACS Paragon Plus Environment

8

Page 9 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

has been removed from the references of Annexin A8 in the newer version of neXtProt (2015-09 release). In order to confirm Annexin A8, we examined antibody experiment evidences reported in HPA which specifically showed that no antibody for this protein was currently available. We then resorted to possible evidence of 3D structure. Réty et al.35 reported the 3D structure of Annexin A8 which validated this protein at PE1 level. For the 58 proteins with PE1 evidences, we manually examined their information in PeptideAtlas, GPMDB, HPA, neXtProt 3D Structure and UniProt and summarized in Table S-4 in the Supporting Information. Among the 58 proteins, 6 proteins were annotated as PE1 proteins by 3D structure evidence, 11 proteins by experimental or functional characterization evidence, 1 protein by binary interaction reported in the IntAct database, and the remaining highly similar 40 proteins supported by literature. Note that among the 40 highly similar proteins, 15 proteins have identical sequences, i.e., 7 pairs of identical proteins and 1 protein identical or highly similar to other 14 proteins. The 25 highly similar proteins corresponded to difficult annotation cases and needed further investigation. (Please see the subsection “Difficulties in identifying missing proteins with high similarity” for more discussion.) Inferring proteins that do not contain unique tryptic peptides in trypsin-based shotgun proteomic analysis may need further investigation. To experimentally validate proteins without any in silico unique peptide, including those 77 missing proteins without any in silico unique peptide, other experiment technology, e.g., antibody or top-down proteomics, may be needed. Proteins with fewer in silico fully digested unique peptides containing higher percentage of missing proteins – In shotgun proteomics analyses, unique peptides are essential to unambiguously infer proteins as protein level evidence. Since trypsin is the most frequently used

ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 40

enzyme for protein digestion in shotgun proteomics analysis, we examined the number of in silico (fully digested) unique tryptic peptides contained in each of the 20,053 neXtProt proteins and grouped the proteins by the number of their unique peptides. The distribution of protein evidence level in each protein group with different number of unique peptides is shown in Figure 4 and Table S-6 in the Supporting Information. As shown in Figure 4, proteins with more unique peptides had higher percentage of proteins with PE1 evidence; this phenomenon revealed that proteins with more unique peptides were more likely to be identified at the protein level. For example, in the 3,042 proteins with at least 51 unique peptides, 95% were at PE1 level while 5% at PE2-4 level (as shown in Figure 4). On the contrary, proteins with fewer unique peptides tended to contain higher percentage of missing proteins; e.g., in the 1,317 proteins with fewer than 6 unique peptides, 54% were at PE1 level, 33% at PE2-4 level, and 13% at PE5 level). We also noticed that 85.5% of proteins with PE1 evidence contained at least 11 unique peptides, while only 14.5% of the PE1 proteins contained less than 11 unique peptides. Therefore, missing proteins containing fewer unique peptides were relatively difficult to be identified from the MS shotgun proteomics analysis. Using other enzymes to increase unique peptides for possible identification of missing proteins In order to identify missing proteins consisting of fewer in silico unique tryptic peptides, we considered using another enzyme, e.g., chymotrypsin or Glu-C, to obtain more unique peptides. We here particularly analyzed the in silico digestion using chymotrypsin on the 2,070 missing proteins with at most 20 unique tryptic peptides. These proteins were divided into groups with 1-10 in silico unique (tryptic) peptides (Group 1), 11-20 unique (tryptic) peptides (Group 2) and no unique (tryptic) peptide (Group 3), respectively. The three groups consisted of 1,060, 920, and 90 proteins, respectively. The distributions of in silico unique peptides by trypsin and

ACS Paragon Plus Environment

10

Page 11 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

chymotrypsin of the proteins in Groups 1-3 were shown in Figure 5. Given unique peptides in both digestions, 221 out of 1,060 proteins in Group 1 had at least 11 unique peptides, including 16 proteins having at least 21 unique peptides, attributed to chymotrypsin digestion. Furthermore, 209 out of 920 proteins in Group 2 had at least 21 unique peptides, and 20 out of 90 proteins in Group 3, i.e., without any unique tryptic peptide, could find 1 unique chymotryptic peptides. We also conducted similar analyses using Glu-C, Lys-C, and Lys-N digestions. The distributions using trypsin, Glu-C, Lys-C, and Lys-N of the same protein groups were shown in Figure S-4 in the Supporting Information. Comparing the performance of these enzymes, chymotrypsin was probably an alternative solution for digestion that could yield more unique peptides. Although single enzyme digestion has been a conventional procedure for protein identification, the concept of using multiple proteases for shotgun proteomics to improve protein identification has been reported. For example, Swaney et al.36 applied multiple proteases (Lys-C, Arg-C, Asp-N and Glu-C) to analyze the proteome of the model organism Saccharomyces cerevisiae. This study reports that the number of identified proteins is increased from 3,313 by using trypsin alone to 3,908 by using multiple proteases, and similarly, sequence coverage is increased from 24.5% to 43.4%. Wisńiewski et al.37 introduced a protocol which enables consecutive digestion in conjunction with filter-aided sample preparation (FASP). With this protocol, using tandem digestion of Lys-C and trypsin enabled identification of up to 40% more proteins and phosphorylation sites than using trypsin digestion alone. To explore the possibility of using multiple proteases to discover missing proteins, we surveyed three recently published studies38-40 which utilized multiple enzymes for the improvement of protein identification. By manually examining the proteins reported by these studies, we observed 41, 17 and 123 missing

ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 40

proteins in their identified proteins, respectively. Further inspecting these found missing proteins, we noticed that 24, 2, and 23 proteins in the respective studies contained at least one unique (fully-digested) peptide and had FDR =51, (B) 26-50, and (C) 0-25 unique peptides, respectively. The LSV distributions of experimentally confirmed (PE1) proteins and missing proteins in the three protein groups were illustrated in Figure 6. The result showed that in each group missing proteins exhibited slightly smaller LSV than experimentally confirmed proteins, where the difference in Group A was statistically insignificant (at 5% significance level using Mann-Whitney-Wilcoxon U-test implemented in R). Moreover, the LSV of experimentally confirmed proteins were quite close in the three groups. Since sequence variants can be observed from RNA sequencing, i.e., at transcript level, we investigated the difference of LSV between proteins at PE1 level and proteins at PE2 level. The LSV distributions of PE1 and PE2 proteins were illustrated in Figure 7, where both distributions were bell-shaped and had similar ranges. Notably, the statistical mode of the distribution of proteins with PE2 evidence was smaller than that of proteins with PE1 evidence, exhibiting a shift toward left (smaller LSV). We reasonably conjectured that some specific types of PE2proteins exhibited a relatively lower LSV value, i.e., more frequent sequence variants, that

ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 40

yielded those proteins being unidentified by current database search engines and caused the LSV distribution a leftward shift. Therefore, we examined the proteins with PE2 evidence according to the family and domain classification provided by InterPro (release 48.0)41 and selected the ten largest families and domains with proteins ranging from 443 to 44 for analysis. The detailed information of the ten families and domains was shown in Table S-10 in the Supporting Information. Notably, the top three families or domains of proteins with PE2 evidence were G protein-coupled receptor (GPCR), rhodopsin-like (family: IPR000276 and domain: IPR017452) and olfactory receptor (OR, family: IPR000725). (Note that based on InterPro’s classification, these two families and one domain contained overlapping proteins.) The comparison of LSV between proteins with PE1 evidence and with PE2 evidence in the ten families or domains was illustrated in Figure 8, in which the LSVs of the entire proteins with PE1 and PE2 evidences, respectively, were shown as control. Evidently, proteins in the top three families or domains exhibited much lower distribution of LSVs than the other 7 families or domains. Therefore, sequence variant could be a factor for the proteins in the GPCR rhodopsin-like and OR families unable to be confirmed at protein level. This finding was consistent with previous studies which reported that the olfactory receptors expanded their functional variability through genetic polymorphisms in the olfactory perception42-45. In database sequence searching for protein identification, sequence variants, in addition to canonical sequences, may need to be included in sequence FASTA in order to identify these proteins46-50. Difficulties in identifying the olfactory receptor family Olfactory receptors (ORs) are members of 7-transmembrane domain, GPCR rhodopsin-like family. They are located in the cell membrane of olfactory epithelium (nasal tissue) and detect the odors or chemical stimuli in environments for odor recognition. Human genome contains

ACS Paragon Plus Environment

14

Page 15 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

approximately 400 intact olfactory receptor genes that have been identified, and these genes encode protein sequences of length ranging from 269 to 369 amino acids. ORs have been shown to reveal high level of genetic variation within human population, leading to alteration of protein sequences and their functions42-45. Revisiting the olfactory receptors with evidence at protein level - According to the annotation provided in neXtProt database, ORs were a major group of missing proteins considered in the CHPP. Out of the 419 reviewed ORs (as listed in Table S-11 in the Supporting Information) in the database, only 10 (0.02%) of them were annotated with PE1 evidences as listed in Table 3 (detailed information in Table S-12 in the Supporting Information). We looked into the evidences of the ten proteins reported in HPA and PeptideAtlas. We observed that four ORs had corresponding antibodies as annotated in HPA; but based on HPA protein evidence and protein reliability, these four ORs were not confirmed at protein level. According to the information in PeptideAtlas, all of the ten ORs had weak or insufficient evidences to ensure their existences at the protein level by MS-based methods. In summary, the protein-level existence evidences of these ten proteins even with PE1 annotation were inconclusive under MS-based analyses and antibody techniques. Therefore, all human ORs could be classified as missing proteins. The tissue-specific issue of OR expression - Based on functional specificity, ORs were primarily presumed to express in olfactory tissues. However, several studies have recently shown that some of the human and mouse ORs can be expressed in non-olfactory tissues51-56. For example, the human OR1D2 (NX_ P34982, protein level) was expressed as spermatozoa and testis; human OR51E2 and its paralog OR51E1 (NX_Q8TCB6 and NX_ Q9H255, protein level) were

ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 40

expressed as prostate cancer cells, spleen and liver. Furthermore, an mRNA-Seq (NGS) analysis of 16 human tissues revealed that 111 of 400 ORs were expressed in more than one kind of nonolfactory tissues57. It implied that ORs could possibly be detected in non-olfactory tissues by proteomics techniques as they were presumed to be possible only in olfactory tissues. Dealing with this controversial tissue-specific issue, validation of the identified proteins is a critical issue in identifying ORs. For instance, evaluation of the spectrum quality58, examination of the unique tryptic peptide58 and targeted proteomics for validation are the possible way in non-targeted shotgun proteomic analysis. Sequence variants, hydrophobicity and shorter protein sequence length very likely causing olfactory receptors unidentified in MS analyses - In order to investigate possible reasons for ORs to be unidentified at protein level, we first compared the sequence variants of the PE2 proteins to that of the PE1 proteins in the GPCR rhodopsin-like and OR families and domain (Figure 8). Proteins with PE2 evidence in GPCR rhodopsin-like (family and domain) exhibited lower LSV than those with PE1 evidence. Since ORs are a subfamily of GPCR rhodopsin-like, sequence variant could be a possible reason for ORs being missing. Furthermore, we also compared hydrophobicity, protein sequence length and the number of in silico unique tryptic peptides of proteins in the GPCR rhodopsin-like and OR families and domain, compared with proteins in the other 7 families or domains mentioned above. The comparisons of protein/peptide hydrophobicity (we used GRAVY59 value of “Peptides” in R package to evaluate hydrophobicity), protein/peptide sequence length and the number of tryptic peptides are shown in Figures S-5 to S-9 in the Supporting Information, respectively. The result revealed that the PE2 proteins in the GPCR rhodopsin-like family (IPR000276) and domain (IPR017452) had higher hydrophobicity, smaller size (i.e., shorter sequence length), fewer in silico unique tryptic

ACS Paragon Plus Environment

16

Page 17 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptides than PE1 proteins in the same family/domain; ORs revealed similar hydrophobicity, protein sequence length, and average number of unique tryptic peptides to the PE2 proteins in GPCR rhodopsin-like family and domain. Thus, the sequence variant, protein hydrophobicity, protein sequence length and the number of unique peptides were possible reasons for the proteins in GPCR rhodopsin-like family, including ORs, being missing. Since hydrophobicity and sequence length are inherent properties of ORs that hinder their identification from shotgun proteomic analyses, using different enzymes for protein digestion and including sequence variants in database sequence searching are suggested for possible improvement in identification of ORs. Difficulties in identifying missing proteins with high similarity The presence of highly homologous proteins and redundant protein entries (i.e., truncated sequences and identical sequences under different gene names or gene products) in protein sequence database could lead to ambiguity in protein inference from identified peptides14, 15. To investigate the influence of high sequence similarity on protein identification, we applied the frequently used CD-HIT60 program to cluster 20,189 human protein sequences in UniProt at a threshold of 100% similarity. As a result, the 20,189 human proteins were clustered into 20,147 groups, and 28 out of 20,147 groups had 100% sequence similarity. In the 28 groups, 20 groups consisted of identical proteins and 8 groups consisted of proteins and their truncated sequences. The detailed information of the 28 groups with 100% similarity was shown in Table S-13 in the Supporting Information. Next, we examined the PE annotation of the 70 proteins contained in the 28 groups provided by neXtProt. As a result, 28, 15, 19, 5, and 3 proteins showed to be at PE1, PE2, PE3, PE4, and PE5 levels, respectively; 39 proteins at PE2-4 level were regarded as

ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 40

missing proteins. We used the following two examples with PE1 evidence to discuss the difficulties of identifying missing proteins with high similarity. First, vacuolar fusion protein CCZ1 homolog (P86791) and vacuolar fusion protein CCZ1 homolog B (P86790) were two proteins with identical sequences, both annotated with PE1 evidence. The detailed information of the two proteins is shown in Table S-14 in the Supporting Information. Manually examining the peptide and antibody information of the two proteins in neXtProt, PeptideAtlas, and HPA databases, we observed that their annotations were the same. It was difficult to distinguish both proteins from one another. By merging both sequences into a single sequence, we observed that in silico unique tryptic peptides were obtained for experimentally identifying the sequence as both proteins were reported at PE1 level. Second, Histone H2A type 1 (P0C0S8) and histone H2A type 1-H (Q96KK5) had sequence similarity of 100%, where Q96KK5 (128 amino acids) was a truncated sequence of P0C0S8 (130 amino acids). However, by manual examination, we found the peptide and antibody information of P0C0S8 and Q96KK5 were identical in neXtProt, PeptideAtlas, and HPA databases (Table S15 in the Supporting Information). It was hard to distinguish which of the two proteins were experimentally confirmed. Furthermore, using the longer sequence (130 amino acids) to represent the two proteins, we observed that no in silico unique tryptic peptide could be obtained from the sequence. Identifying a protein sequence and its truncated protein sequence, which consisted of only tryptic peptides shared with other proteins, was more difficult than in the previous example. A group of proteins with 100% similarity was usually represented by a single identical or longer protein. Consider the first case that the representative sequence contains in silico unique

ACS Paragon Plus Environment

18

Page 19 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

tryptic peptides. Then the group of proteins could be possibly identified by shotgun proteomic analyses. It was hard to distinguish the identical proteins from one another as shown in the first example. To distinguish from the protein sequence and its truncated sequence, top-down proteomics provides a possible solution. In the second case that the representative sequence does not contain any in silico unique tryptic peptides, identifying such proteins becomes more challenging. Some other enzymes may be needed for protein digestion in the shogun proteomics experiments to yield unique peptides for possible identification. Conclusion To identify the human proteome, several issues causing proteins to be missing have been reported in the literature that include the limitation of experimental methods and mass spectrometry techniques, the limited occurrence of proteins, and limitation of currently available database sequence searching. In this paper, we attempted to investigate which of the 2,948 missing proteins were hard to detect by even an “ideal” shotgun proteomic analysis. By three in silico digestions using commonly used trypsin, Lys-C and both to generate all in silico fully-digested peptides, assuming as if they were identified, 145 proteins contained no unique peptide, consisting of only shared peptides. (The number 145 would be increased to 195 when only trypsin digestion was considered.) Protein inference based on shared peptides alone may render the protein inference inconclusive, even in the 58 proteins with PE1 evidence, not to mention the 77 missing proteins. To validate such proteins, besides antibody experiments, topdown proteomic analysis and combining peptide identification results from multiple shotgun proteomics analyses based on multiple enzymes, e.g., trypsin and chymotrypsin, would be possible approaches, in which informatics remains challenging.

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 40

Sequence variants in proteins could affect peptides to be unidentified, thereby probably affecting protein identification. Motivated from this inference, we examined whether the sequence variant is a factor that affects proteins to be missing. From our analysis, we observed sequence variant was not a factor for proteins to be unidentified in most families or domains of proteins at PE2, except the two families, rhodopsin-like GPCRs and ORs. Since both rhodopsinlike GPCRs and ORs are mostly membrane proteins, using experiment techniques to mine human membrane subproteome is a promising way to experimentally validate missing proteins in these two families. In terms of identifying these families of proteins, including sequence variants in the protein FASTA could be used to improve identifications. As MS technology is ever-improving, bioinformatics for protein identification will become crucial for identifying missing proteins. On the other hand, knowing the informatics limitation based on the in silico analyses provided in this paper, researchers can design appropriate experiments for identifying missing proteins. Collaborative effort on MS technology and informatics can facilitate the identification of missing proteins. ASSOCIATED CONTENT Supporting Information Figure S-1. The distributions of unique and shared peptides of different lengths based on in silico Lys-C digestion. Figure S-2. The distributions of unique and shared peptides of different lengths based on in silico trypsin and Lys-C tandem digestion. Figure S-3. The Venn diagram of digested peptides and 20,053 proteins containing at least one unique digested peptide yielded by trypsin, Lys-C and both enzymes, when peptide length is limited to 6-40 amino acids. Figure S-4. The distributions of proteins with different numbers of unique peptides using trypsin, Glu-C, Lys-N, and Lys-C, respectively.

ACS Paragon Plus Environment

20

Page 21 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure S-5. The distribution of protein hydrophobicity (in terms of Gravy value) in ten families and domains classified by InterPro. Figure S-6. The distribution of peptide hydrophobicity (in terms of Gravy value) in ten families and domains classified by InterPro. Figure S-7. The distribution of protein length in ten families and domains classified by InterPro. Figure S-8. The distribution of unique peptide length in ten families and domains classified by InterPro. Figure S-9. The distribution of unique peptide number in ten families and domains classified by InterPro. Table S-1. The numbers of in silico fully tryptic-digested peptides of the 20,189 human proteins in UniProt and Ensemble database. Table S-2. The detailed information of the 20,053 proteins common in UniProt and neXtProt. Table S-3. The detailed information of the 145 proteins having no unique peptide. Table S-4. A summary table of the 58 PE1 proteins consisting of only shared peptides. Table S-5. The list of 43 fully-digested peptides of Annexin A8 (P13928, Vascular anticoagulantbeta), where 17 of them were identified by Burkard et al.34 using mass spectrometry technology. Table S-6. The number of proteins at PE1-5 categorized by their unique tryptic peptide numbers. Table S-7. The unique fully-digested peptides of missing proteins identified in Guo et al.38 using multiple protease strategy (https://proteomics.swmed.edu/confetti/, manually examined, FDR < 1% at peptide and protein levels) Table S-8. The unique peptide information of missing proteins identified in Chen et al.39 that used trypsin, chymotrypsin, and both for digestion (FDR < 1% at peptide and protein levels) Table S-9. The information of phospho-peptides identified by Giansanti et al.40 (FDR < 1% at peptide level) Table S-10. The detailed information of ten largest families or domains with proteins at PE2 level. Table S-11. The detailed information of 419 reviewed olfactory receptors. Table S-12. The detailed information of the ten olfactory receptors annotated with PE1 evidence in neXtProt (201409 release). Table S-13. The detailed information of 28 groups of UniProt proteins with 100% sequence similarity clustered by CD-HIT. Table S-14. The detailed information of two proteins P86790 and P86791that have 100% sequence similarity.

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 40

Table S-15. The detailed information of two proteins P0C0S8 and Q96KK5 that have 100% sequence similarity. This material is available free of charge via the Internet at http://pubs.acs.org. AUTHOR INFORMATION Corresponding Author * Ting-Yi Sung, Tel:886-2-2788-3799 #1711, Email: [email protected]. * Yu-Ju Chen, Tel: 886-2-27898660, E-mail: [email protected] Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ‡These authors contributed equally. ACKNOWLEDGMENT This work was supported by the Academia Sinica, Ministry of Science and Technology of Taiwan (MOST103-2221-E-001-037), and Taiwan International Graduate Program. We would like to thank the anonymous reviewers for their valuable comments and suggestions. We also thank Thejkiran Pitti for fruitful discussion on the manuscript. REFERENCES 1. Legrain, P.; Aebersold, R.; Archakov, A.; Bairoch, A.; Bala, K.; Beretta, L.; Bergeron, J.; Borchers, C. H.; Corthals, G. L.; Costello, C. E.; Deutsch, E. W.; Domon, B.; Hancock, W.; He, F.; Hochstrasser, D.; Marko-Varga, G.; Salekdeh, G. H.; Sechi, S.; Snyder, M.; Srivastava, S.; Uhlen, M.; Wu, C. H.; Yamamoto, T.; Paik, Y. K.; Omenn, G. S., The human proteome project: current state and future direction. Mol Cell Proteomics 2011, 10, M111 009993. 2. Paik, Y. K.; Jeong, S. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H. J.; Na, K.; Choi, E. Y.; Yan, F.; Zhang, F.; Zhang, Y.; Snyder, M.; Cheng, Y.; Chen, R.; MarkoVarga, G.; Deutsch, E. W.; Kim, H.; Kwon, J. Y.; Aebersold, R.; Bairoch, A.; Taylor, A. D.; Kim, K. Y.; Lee, E. Y.; Hochstrasser, D.; Legrain, P.; Hancock, W. S., The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat Biotechnol 2012, 30, 221-3.

ACS Paragon Plus Environment

22

Page 23 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

3. Paik, Y. K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H. J.; Na, K.; Jeong, S. K.; He, F.; Binz, P. A.; Nishimura, T.; Keown, P.; Baker, M. S.; Yoo, J. S.; Garin, J.; Archakov, A.; Bergeron, J.; Salekdeh, G. H.; Hancock, W. S., Standard guidelines for the chromosome-centric human proteome project. J Proteome Res 2012, 11, 2005-13. 4. Marko-Varga, G.; Omenn, G. S.; Paik, Y. K.; Hancock, W. S., A first step toward completion of a genome-wide characterization of the human proteome. J Proteome Res 2013, 12, 1-5. 5. Lane, L.; Bairoch, A.; Beavis, R. C.; Deutsch, E. W.; Gaudet, P.; Lundberg, E.; Omenn, G. S., Metrics for the Human Proteome Project 2013-2014 and strategies for finding missing proteins. J Proteome Res 2014, 13, 15-20. 6. Omenn, G. S., The strategy, organization, and progress of the HUPO Human Proteome Project. J Proteomics 2014, 100, 3-7. 7. Lane, L.; Argoud-Puy, G.; Britan, A.; Cusin, I.; Duek, P. D.; Evalet, O.; Gateau, A.; Gaudet, P.; Gleizes, A.; Masselot, A.; Zwahlen, C.; Bairoch, A., neXtProt: a knowledge platform for human proteins. Nucleic Acids Res 2012, 40, D76-83. 8. Beck, M.; Claassen, M.; Aebersold, R., Comprehensive proteomics. Curr Opin Biotechnol 2011, 22, 3-8. 9. Farrah, T.; Deutsch, E. W.; Hoopmann, M. R.; Hallows, J. L.; Sun, Z.; Huang, C. Y.; Moritz, R. L., The state of the human proteome in 2012 as viewed through PeptideAtlas. J Proteome Res 2013, 12, 162-71. 10. Shiromizu, T.; Adachi, J.; Watanabe, S.; Murakami, T.; Kuga, T.; Muraoka, S.; Tomonaga, T., Identification of missing proteins in the neXtProt database and unregistered phosphopeptides in the PhosphoSitePlus database as part of the Chromosome-centric Human Proteome Project. J Proteome Res 2013, 12, 2414-21. 11. Ezkurdia, I.; Juan, D.; Rodriguez, J. M.; Frankish, A.; Diekhans, M.; Harrow, J.; Vazquez, J.; Valencia, A.; Tress, M. L., Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes. Human Molecular Genetics 2014, 23, 5866-5878. 12. Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; LealRojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T. C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A., A draft map of the human proteome. Nature 2014, 509, 575-81. 13. Wilhelm, M.; Schlegl, J.; Hahne, H.; Moghaddas Gholami, A.; Lieberenz, M.; Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; Mathieson, T.; Lemeer, S.; Schnatbaum, K.; Reimer, U.; Wenschuh, H.; Mollenhauer, M.; Slotta-Huspenina, J.; Boese, J. H.; Bantscheff, M.; Gerstmair, A.; Faerber, F.; Kuster, B., Mass-spectrometry-based draft of the human proteome. Nature 2014, 509, 582-7.

ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 40

14. Nesvizhskii, A. I.; Aebersold, R., Interpretation of shotgun proteomic data: the protein inference problem. Mol Cell Proteomics 2005, 4, 1419-40. 15. Nesvizhskii, A. I., A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J Proteomics 2010, 73, 2092-123. 16. Li, J.; Su, Z. L.; Ma, Z. Q.; Slebos, R. J. C.; Halvey, P.; Tabb, D. L.; Liebler, D. C.; Pao, W.; Zhang, B., A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics. Molecular & Cellular Proteomics 2011, 10. 17. Nijveen, H.; Kester, M. G. D.; Hassan, C.; Viars, A.; de Ru, A. H.; de Jager, M.; Falkenburg, J. H. F.; Leunissen, J. A. M.; van Veelen, P. A., HSPVdb-the Human Short Peptide Variation Database for improved mass spectrometry-based detection of polymorphic HLAligands. Immunogenetics 2011, 63, 143-153. 18. Roth, M. J.; Forbes, A. J.; Boyne, M. T.; Kim, Y. B.; Robinson, D. E.; Kelleher, N. L., Precise and parallel characterization of coding polymorphisms, alternative splicing, and modifications in human proteins by mass spectrometry. Molecular & Cellular Proteomics 2005, 4, 1002-1008. 19. Su, Z. D.; Sun, L.; Yu, D. X.; Li, R. X.; Li, H. X.; Yu, Z. J.; Sheng, Q. H.; Lin, X.; Zeng, R.; Wu, J. R., Quantitative detection of single amino acid polymorphisms by targeted proteomics. Journal of Molecular Cell Biology 2011, 3, 309-315. 20. Alves, G.; Ogurtsov, A. Y.; Yu, Y. K., RAId_DbS: mass-spectrometry based peptide identification web server with knowledge integration. BMC Genomics 2008, 9, 505. 21. Xi, H.; Park, J. S.; Ding, G. H.; Lee, Y. H.; Li, Y. X., SysPIMP: the web-based systematical platform for identifying human disease-related mutated sequences from mass spectrometry. Nucleic Acids Research 2009, 37, D913-D920. 22. Apweiler, R.; Bairoch, A.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O'Donovan, C.; Redaschi, N.; Yeh, L. S., UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32, D115-9. 23. Lindskog, C., The potential clinical impact of the tissue-based map of the human proteome. Expert Rev Proteomics 2015, 12, 213-5. 24. Farrah, T.; Deutsch, E. W.; Hoopmann, M. R.; Hallows, J. L.; Sun, Z.; Huang, C. Y.; Moritz, R. L., The State of the Human Proteome in 2012 as Viewed through PeptideAtlas. Journal of Proteome Research 2013, 12, 162-171. 25. Picotti, P.; Lam, H.; Campbell, D.; Deutsch, E. W.; Mirzaei, H.; Ranish, J.; Domon, B.; Aebersold, R., A database of mass spectrometric assays for the yeast proteome. Nat Methods 2008, 5, 913-4. 26. Keil, B. i., Specificity of proteolysis. Springer-Verlag: Berlin ; New York, 1992; p ix, 336 p. 27. Sherry, S. T.; Ward, M. H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E. M.; Sirotkin, K., dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29, 30811. 28. Forbes, S. A.; Bindal, N.; Bamford, S.; Cole, C.; Kok, C. Y.; Beare, D.; Jia, M.; Shepherd, R.; Leung, K.; Menzies, A.; Teague, J. W.; Campbell, P. J.; Stratton, M. R.; Futreal, P. A., COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res 2011, 39, D945-50.

ACS Paragon Plus Environment

24

Page 25 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

29. Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S., Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551-67. 30. Eng, J. K.; McCormack, A. L.; Yates, J. R., An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J Am Soc Mass Spectrom 1994, 5, 976-89. 31. Craig, R.; Beavis, R. C., TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466-7. 32. Meyer-Arendt, K.; Old, W. M.; Houel, S.; Renganathan, K.; Eichelberger, B.; Resing, K. A.; Ahn, N. G., IsoformResolver: A Peptide-Centric Algorithm for Protein Inference. Journal of Proteome Research 2011, 10, 3060-3075. 33. Branca, R. M.; Orre, L. M.; Johansson, H. J.; Granholm, V.; Huss, M.; Perez-Bercoff, A.; Forshed, J.; Kall, L.; Lehtio, J., HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nat Methods 2014, 11, 59-62. 34. Burkard, T. R.; Planyavsky, M.; Kaupe, I.; Breitwieser, F. P.; Burckstummer, T.; Bennett, K. L.; Superti-Furga, G.; Colinge, J., Initial characterization of the human central proteome. BMC Syst Biol 2011, 5, 17. 35. Rety, S.; Sopkova-de Oliveira Santos, J.; Dreyfuss, L.; Blondeau, K.; Hofbauerova, K.; Raguenes-Nicol, C.; Kerboeuf, D.; Renouard, M.; Russo-Marie, F.; Lewit-Bentley, A., The crystal structure of annexin A8 is similar to that of annexin A3. J Mol Biol 2005, 345, 1131-9. 36. Swaney, D. L.; Wenger, C. D.; Coon, J. J., Value of Using Multiple Proteases for LargeScale Mass Spectrometry-Based Proteomics. Journal of Proteome Research 2010, 9, 1323-1329. 37. Wisniewski, J. R.; Mann, M., Consecutive proteolytic digestion in an enzyme reactor increases depth of proteomic and phosphoproteomic analysis. Anal Chem 2012, 84, 2631-7. 38. Guo, X.; Trudgian, D. C.; Lemoff, A.; Yadavalli, S.; Mirzaei, H., Confetti: a multiprotease map of the HeLa proteome for comprehensive proteomics. Mol Cell Proteomics 2014, 13, 1573-84. 39. Chen, Q.; Yan, G.; Zhang, X., Applying multiple proteases to direct digestion of hundred-scale cell samples for proteome analysis. Rapid Commun Mass Spectrom 2015, 29, 1389-94. 40. Giansanti, P.; Aye, T. T.; van den Toorn, H.; Peng, M.; van Breukelen, B.; Heck, A. J., An Augmented Multiple-Protease-Based Human Phosphopeptide Atlas. Cell Rep 2015, 11, 1834-43. 41. Apweiler, R.; Attwood, T. K.; Bairoch, A.; Bateman, A.; Birney, E.; Biswas, M.; Bucher, P.; Cerutti, L.; Corpet, F.; Croning, M. D. R.; Durbin, R.; Falquet, L.; Fleischmann, W.; Gouzy, J.; Hermjakob, H.; Hulo, N.; Jonassen, I.; Kahn, D.; Kanapin, A.; Karavidopoulou, Y.; Lopez, R.; Marx, B.; Mulder, N. J.; Oinn, T. M.; Pagni, M.; Servant, F.; Sigrist, C. J. A.; Zdobnov, E. M.; Consortium, I., InterPro - an integrated documentation resource for protein families, domains and functional sites. Bioinformatics 2000, 16, 1145-1150. 42. Frumin, I.; Sobel, N.; Gilad, Y., Does a unique olfactory genome imply a unique olfactory world? Nature Neuroscience 2014, 17, 6-8. 43. Hasin-Brumshtein, Y.; Lancet, D.; Olender, T., Human olfaction: from genomic variation to phenotypic diversity. Trends in Genetics 2009, 25, 178-184. 44. Mainland, J. D.; Keller, A.; Li, Y. R.; Zhou, T.; Trimmer, C.; Snyder, L. L.; Moberly, A. H.; Adipietro, K. A.; Liu, W. L. L.; Zhuang, H. Y.; Zhan, S. M.; Lee, S. S.; Lin, A.; Matsunami,

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 40

H., The missense of smell: functional variability in the human odorant receptor repertoire. Nature Neuroscience 2014, 17, 114-120. 45. Olender, T.; Waszak, S. M.; Viavant, M.; Khen, M.; Ben-Asher, E.; Reyes, A.; Nativ, N.; Wysocki, C. J.; Ge, D. L.; Lancet, D., Personal receptor repertoires: olfaction as a model. Bmc Genomics 2012, 13. 46. Woo, S.; Cha, S. W.; Merrihew, G.; He, Y.; Castellana, N.; Guest, C.; MacCoss, M.; Bafna, V., Proteogenomic database construction driven from large scale RNA-seq data. J Proteome Res 2014, 13, 21-8. 47. Wang, X.; Zhang, B., customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics 2013, 29, 3235-7. 48. Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Scalf, M.; Smith, L. M., Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J Proteome Res 2014, 13, 228-40. 49. Wang, X.; Slebos, R. J.; Wang, D.; Halvey, P. J.; Tabb, D. L.; Liebler, D. C.; Zhang, B., Protein identification using customized protein sequence databases derived from RNA-Seq data. J Proteome Res 2012, 11, 1009-17. 50. Nesvizhskii, A. I., Proteogenomics: concepts, applications and computational strategies. Nat Methods 2014, 11, 1114-25. 51. Neuhaus, E. M.; Zhang, W.; Gelis, L.; Deng, Y.; Noldus, J.; Hatt, H., Activation of an olfactory receptor inhibits proliferation of prostate cancer cells. J Biol Chem 2009, 284, 1621825. 52. Spehr, M.; Gisselmann, G.; Poplawski, A.; Riffell, J. A.; Wetzel, C. H.; Zimmer, R. K.; Hatt, H., Identification of a testicular odorant receptor mediating human sperm chemotaxis. Science 2003, 299, 2054-8. 53. Kang, N.; Kim, H.; Jae, Y.; Lee, N.; Ku, C. R.; Margolis, F.; Lee, E. J.; Bahk, Y. Y.; Kim, M. S.; Koo, J., Olfactory marker protein expression is an indicator of olfactory receptorassociated events in non-olfactory tissues. PLoS One 2015, 10, e0116097. 54. Sanz, G.; Leray, I.; Dewaele, A.; Sobilo, J.; Lerondel, S.; Bouet, S.; Grebert, D.; Monnerie, R.; Pajot-Augy, E.; Mir, L. M., Promotion of cancer cell invasiveness and metastasis emergence caused by olfactory receptor stimulation. PLoS One 2014, 9, e85110. 55. Weng, J.; Wang, J.; Hu, X.; Wang, F.; Ittmann, M.; Liu, M., PSGR2, a novel G-protein coupled receptor, is overexpressed in human prostate cancer. Int J Cancer 2006, 118, 1471-80. 56. Xu, L. L.; Stackhouse, B. G.; Florence, K.; Zhang, W.; Shanmugam, N.; Sesterhenn, I. A.; Zou, Z.; Srikantan, V.; Augustus, M.; Roschke, V.; Carter, K.; McLeod, D. G.; Moul, J. W.; Soppett, D.; Srivastava, S., PSGR, a novel prostate-specific gene with homology to a G proteincoupled receptor, is overexpressed in prostate cancer. Cancer Res 2000, 60, 6568-72. 57. Flegel, C.; Manteniotis, S.; Osthold, S.; Hatt, H.; Gisselmann, G., Expression profile of ectopic olfactory receptors determined by deep sequencing. PLoS One 2013, 8, e55368. 58. Ezkurdia, I.; Vazquez, J.; Valencia, A.; Tress, M., Analyzing the First Drafts of the Human Proteome. J Proteome Res 2014. 59. Kyte, J.; Doolittle, R. F., A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157, 105-32. 60. Fu, L. M.; Niu, B. F.; Zhu, Z. W.; Wu, S. T.; Li, W. Z., CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150-3152.

ACS Paragon Plus Environment

26

Page 27 of 40

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 40

Insert Table of Contents Graphic and Synopsis Here

ACS Paragon Plus Environment

28

Page 29 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Tables Table 1. The number of in silico fully digested peptides of the 20,189 human proteins in UniProt using different enzymes. Enzymes Trypsin

Digested Peptides

Non-redundant Digested Peptides

Unique Peptides Shared Peptides Total Unique Peptides Shared Peptides 613,253 607,642 1220,895 613,253 48,675 (50.23%) (49.77%) (100%) (92.65%) (7.35%)

Total 661,928 (100%)

LysC

431,157 (64.7%)

235,285 (35.3%)

666,442 (100%)

431,157 (94.68%)

24,229 (5.32%)

455,386 (100%)

Trypsin+LysC

621,501 (49.29%)

639,361 (50.71%)

1260,862 (100%)

621,501 (92.48%)

50,544 (7.52%)

672,045 (100%)

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 40

Table 2. The number of missing proteins found in the identified proteins using multiple proteases. Proteases

Sample

Number of identified proteins

Number of missing proteins found in the identified proteins

Number of validated missing proteins

Chymotrypsin, Glu-C, Human HeLa 5,223 41 8# and Lys-C proteome Trypsin, Chymotrypsin, Chen et al.39 and DLD-1 cells* 555 15 2# Trypsing+Chymotrpsin AspN, Chymotrpsin, Human Giansanti et al.40 Glu-C, Lys-C, and 5,326 123 23& phosphoproteome Trypsin *DLD-1 cells: Duke’s type C colorectal adenocarcinoma, Chinese Academy of Sciences, Shanghai, China cells. # Containing at least one unique fully-digested peptide and FDR < 1% at peptide and protein levels. & Containing at least one unique fully-digested peptide and FDR < 1% at peptide level. Guo et al.38

ACS Paragon Plus Environment

30

Page 31 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3. The information of the ten ORs proteins annotated with PE1 evidences. neXtProt (201409 release) Protein evidence

neXtProt (201505 release) Protein evidence

NX_Q9H205

PE 1

NX_Q8TCB6

neXtProt ID

HPA (version 13)

PeptideAtlas (201503)

Antibody

HPA evidence

Protein reliability

Status

PE 1

-

-

-

insufficient evidence

PE 1

PE 2

HPA051439, CAB019995

Transcript level

Uncertain

not detected

NX_Q9H255

PE 1

PE 1

-

-

-

weak

NX_Q8NGP9

PE 1

PE 3

-

-

-

not detected

NX_P34982

PE 1

PE 1

-

-

-

not detected

NX_Q8NGR3

PE 1

PE 2

-

-

-

not detected

NX_Q8NGI9

PE 1

PE 1

HPA047135, HPA049575

No evidence

Uncertain

weak

NX_Q8NGM8

PE 1

PE 2

HPA048674

No evidence

Uncertain

rejected

NX_Q8NGY6

PE 1

PE 3

HPA056637

No evidence

Uncertain

insufficient evidence

NX_Q96RD0

PE 1

PE 3

-

-

-

not detected

ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 40

Figures

Figure 1. The peptide length distributions of unique and shared peptides based on in silico trypsin digestion. The data was derived from in silico trypsin digestion of 20,189 human proteins in UniProt, where the numbers of unique and shared peptides were 613,253 and 607,642, respectively. Red and blue areas represent the distributions of unique and shared peptides, respectively, regarding to different peptide lengths shown in the X-axis.

ACS Paragon Plus Environment

32

Page 33 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2. (A) The number of non-redundant unique peptides using trypsin, Lys-C, and trypsin+Lys-C digestion. A number of 18,329 unique peptides were additionally obtained using trypsin+Lys-C digestion, whereas 31,571 and 246,906 unique peptides were additionally obtained using trypsin and Lys-C, respectively. (B) The number of proteins with at least one non-redundant unique peptide using trypsin, Lys-C, and trypsin+Lys-C digestion. In silico digestion analysis was conducted on 20,053 proteins common in UniProt and neXtProt. Among the 20,053 proteins, 19,908 proteins had at least one unique peptide from any of the three digestions, and 145 proteins did not have any unique peptide.

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 40

Figure 3. The Annexin A8 sequence and its identified peptide sequences annotated with MS information. The gray bars denote the peptide sequences used to identify the protein reported by Burkard et al.34. The green and orange bars denote positions in the amino acid sequence covered by shared peptides annotated in PeptideAtlas and SRMAtlas, respectively.

ACS Paragon Plus Environment

34

Page 35 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4. The distribution of proteins at different evidence levels in all protein groups classified by the number of their unique tryptic peptides. The data was derived from in silico trypsin digestion of 20,053 common proteins from human proteomes of UniProt and neXtProt. The yellow curve, corresponding to the Y-axis on the right-hand-side of the panel, showed the number of proteins in each protein group. The purple curve, corresponding to the Y-axis on the left-hand-side of the panel, showed the accumulative percentage of PE1 proteins.

ACS Paragon Plus Environment

35

Journal of Proteome Research

Page 36 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

36

Page 37 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5. Comparisons of the unique peptide number generated by trypsin- and chymotrypsin-digestion. The in silico analysis was conducted on missing proteins with at most 20 unique tryptic peptides (i.e., 2,070 missing proteins). The 2,070 proteins were divided into three groups according to the number of unique tryptic peptides: proteins with 1-10 unique tryptic peptides (Group 1), with 11-20 unique tryptic peptides (Group 2), and without any unique tryptic peptide (Group 3). The distributions of proteins in the three groups were shown in A, B and C, respectively.

ACS Paragon Plus Environment

37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 40

Figure 6. The LSV Comparison between 17,479 verified (PE1) proteins and 2,211 missing proteins. Group A, B, and C denote verified proteins (PE1) with >=51, 26-50, and 0-25 in silico unique tryptic peptides, respectively. Group Am, Bm, and Cm denote missing proteins (PE2-4) with >=51, 26-50, and 0-25 in silico unique tryptic peptides, respectively. The numbers of proteins in group A, Am, B, Bm, C, and Cm were 2994, 137, 5455, 469, 9030, and 1605, respectively.

ACS Paragon Plus Environment

38

Page 39 of 40

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 7. The Comparison of the LSV distributions between 16,490 proteins experimentally validated at protein level (PE1) and 2,646 proteins validated at transcript level (PE2).

ACS Paragon Plus Environment

39

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 40

Figure 8. The LSV distribution in ten families and domains classified by InterPro. The ten families or domains containing most number of proteins with PE2 evidence were included in this analysis; they are A: IPR017452 (GPCR, rhodopsin-like, 7TM), B: IPR000276 (G proteincoupled receptor, rhodopsin-like), C: IPR000725 (Olfactory receptor), D: IPR015880 (Zinc finger, C2H2-like), E: IPR013087 (Zinc finger C2H2-type/integrase DNA-binding domain), F: IPR001909 (Krueppel-associated box), G: IPR009057 (Homeodomain-like), H: IPR001356 (Homeobox domain (IPR001356)), I: IPR020683 (Ankyrin repeat-containing domain), and J: IPR013783 (Immunoglobulin-like fold). Family/domain X1 and X2 denotes proteins with PE1 and PE2 evidences, respectively. In addition, Group PE1 and PE2 represent the total 16,490 and 2,646 proteins with evidences at PE1 and PE2 levels, respectively.

ACS Paragon Plus Environment

40