Evaluating the possibility of detecting variants in shotgun proteomics

theoretically ~70% variants observable in an ideal shotgun proteomics. ... observed in RNA-Seq were identified via shotgun proteomics with manually va...
0 downloads 0 Views 4MB Size
Subscriber access provided by UNIVERSITY OF TOLEDO LIBRARIES

Article

Evaluating the possibility of detecting variants in shotgun proteomics via LeTE-fusion analysis pipeline Tung-Shing Mamie Lih, Wai-Kok Choong, Yu-Ju Chen, and Ting-Yi Sung J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00052 • Publication Date (Web): 08 Aug 2018 Downloaded from http://pubs.acs.org on August 9, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Evaluating the possibility of detecting variants in shotgun

proteomics

via

LeTE-fusion

analysis

pipeline Tung-Shing Mamie Lih,1, 2, ‡ Wai-Kok Choong,1, 2, ‡ Yu-Ju Chen,2 Ting-Yi Sung1,* 1

Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan

2Institute

of Chemistry, Academia Sinica, Taipei 11529, Taiwan

KEYWORDS Variant peptides, mass spectrometry, proteogenomics, shotgun proteomics.

ABSTRACT In proteogenomic studies, many genome-annotated events, e.g., single amino acid variation (SAAV) and short INDEL, are often unobserved in shotgun proteomics. Therefore, we propose an analysis pipeline called LeTE-fusion (Le: peptide length, T: theoretical values, E: experimental data) to first investigate whether peptides with certain lengths are observed more often in mass spectrometry (MS)-based proteomics, which may hinder peptide identification causing difficulty in detecting genome-annotated events. By applying LeTE-fusion on different MS-based proteome data sets, we found peptides within 7-20 amino acids are more frequently identified, possibly attributed to MS-related factors instead of proteases. We then further extended the usage of LeTE-fusion on four variant-containing-sequence data sets (SAAV-only) with various sample complexity up to the whole human proteome scale, which yields theoretically ~70% variants observable in an ideal shotgun proteomics. However, only ~40% of 1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

variants might be detectable in real shotgun proteomic experiments when LeTE-fusion utilizes the experimentally observed variant-site-containing wild-type peptides in PeptideAtlas to estimate the expected observable coverage of variants. Finally, we conducted a case study on HEK293 cell line with variants reported at genomic level that were also identified in shotgun proteomics to demonstrate the efficacy of LeTE-fusion on estimating expected observable coverage of variants. To the best of our knowledge, this is the first study to systematically investigate the detection limits of genome-annotated events via shotgun proteomics using such analysis pipeline.

INTRODUCTION Proteogenomics focuses on integrating the genomics and proteomics data to generate a comprehensive biological map that assists in disease diagnosis, therapeutic drug discovery, and even assists in refining gene models and increases confidence in proteomics analyses.1,2 Currently, mass spectrometry (MS) is the major analytical tool for examining the alterations of genomes at the protein level.3,4 Among different MS-based proteomics analysis, shotgun proteomics (or bottom-up proteomics) is the common approach for identifying the proteins and protein variations that involves the use of liquid chromatography (LC) coupled with tandem MS (MS/MS).4,5 In shotgun proteomics, proteins in a sample mixture are first digested into peptides using a protease (e.g. trypsin). The mixture of the digested peptides is loaded into LC for peptide separation. As peptides are separated, eluted, and ionized, some peptides are selected for further fragmentation to produce MS/MS spectra which contain the characteristic sequence fragment ions (e.g. b- and y-ions).2,6 The acquired MS/MS spectra are typically searched against a customized protein sequence database to identify the variant peptides. The customized database can be constructed from various sources, such as six-frame translation of the genomic sequence, expressed sequence tag data, and RNA sequencing data (RNA-Seq).2 However, the number of identified variant peptides in shotgun proteomics is often much less than the number of variants that is determined at genomics level. For instance, Ruggles et al.7 predicted 20435 single nucleotide variant (SNV) peptides in patient-derived xenograft tumors from established Basal (WHIM2) using whole genome analysis (WGA). Among the WGA-estimated variant peptides, 5552 variant peptides were found using RNA-Seq as well. Nonetheless, only 610 variant 2

ACS Paragon Plus Environment

Page 2 of 49

Page 3 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

peptides were identified from MS/MS data.7 In a study based on five patients with lung adenocarcinoma, only four nonsynonymous mutants out of thousands of tumor-specific variants observed in RNA-Seq were identified via shotgun proteomics with manually validation.8 Shotgun proteomics can provide high-throughput proteome analysis for both simple and complex protein mixtures; yet there are detection limitations in MS, such as difficulty in detecting low-abundance proteins,9,10 hydrophobic proteins,11,12 and proteins composed with ≦50 amino acids.13 Moreover, many proteomics studies predominately rely on trypsin as the single protease for protein digestion because of its specificity and efficiency.3,14 However, it is unsuitable for proteins lacking trypsin cleavage sites.11 To explore the reasons for much less variants detected at protein level than at genomic level, we propose a computational analysis pipeline, called LeTE-fusion to conduct a comprehensive in silico analysis. LeTE-fusion considers the possible effect of peptide Lengths, computes Theoretical coverages to provide an ideal estimation of peptide and variant peptide detections, and then utilizes peptides with Experimental evidence to derive a more realistic estimation of the percentage of detectable genome-annotated variants in shotgun MS experiments. To incorporate with detection limitations of MS and the minimum peptide length required for more confident identification, the pipeline uses peptides with length from seven to 40 amino acids (aa) and groups peptides into six length ranges, namely 7-15aa, 7-20aa, 7-25aa, 7-30aa, 7-35aa, and 7-40aa, in order to examine the changes under different peptide lengths. We divided our study into three major parts. First, applying LeTE-fusion, we conducted in silico analysis on human proteome to perform pairwise comparison between in silico tryptic peptides and experimentally observed tryptic peptides in PeptideAtlas.12,15 From this comparison, we inspected the relationship between peptide length and peptide identification to check whether or not certain peptide-lengths were observed more often, which could affect variant identification. In the second part of the study, we applied LeTE-fusion to examine the percentage of variants (called theoretical variant coverage) that can be ideally detected using in silico fully digested variant peptides of the human protein sequences from four sequence data sets of samples with different complexities, namely whole human proteome, tissue, cell line, and a single protein. This theoretical variant coverage provides a theoretical bound of observing variant peptides in an ideal shotgun proteomics; however, this theoretical variant coverage based on a single protease for digestion can be still limited regardless of the peptide-length range. For instance, Giansanti et 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

al.16 investigated the use of seven proteases in parallel to digest the proteins of Escherichia coli (E. coli). Different sets of peptides were generated because of the distinct specificity of the proteases, the number of identified proteins increased when combining results from all proteases. Moreover, Choong et al.17 thoroughly examined three published studies to demonstrate the possibilities of identifying missing proteins experimentally by using multiple proteases. Therefore, using multiple proteases may increase the proteome coverage and thus enhance the identification,17 and thus we conducted an analysis to study the feasibility of using multiple proteases in parallel for enhancing the theoretical bound of variant coverage. The last part of our study was focused on approximating a more realistic variant coverage than the theoretical bound. We used observed counterpart wild-type peptides (i.e., variant-sitecontaining wild-type peptides with experimental evidence in PeptideAtlas) in LeTE-fusion to assess expected observable coverage of variant peptides in shotgun MS experiments. The chemico-physical properties of variant peptides and their counterpart wild-type peptides were similar (shown later in Results and Discussions); thus, it was adequate to estimate the coverage of variants expected observable in actual shotgun MS experiments by matching in silico fully tryptic variant peptides to the observed counterpart wild-type peptides with consideration of different peptide-length ranges. To demonstrate the efficacy of using LeTE-fusion and the above approximate estimation of variants detected in MS experiments, we conducted a case study on Human Embryonic Kidney cell line (HEK293), with variants detected at both genomic18 and proteomic19 levels. However, a discrepancy between variants expected observable and experimentally detected was still noted. We further utilized the abundance dynamic range from PaxDb20 (https://pax-db.org/) and Human Protein Atlas21 (www.proteinatlas.org) to examine the variants being undetected in shotgun proteomics whether due to the fact that they were variants in proteins with lower abundance even at the RNA level. To the best of our knowledge, this is the first in-depth in silico study to investigate the possibility of observing variants in shotgun proteomics. To conduct the study, we propose the LeTE-fusion analysis pipeline. By applying it, we can examine the factors that may affect variant identification at protein level. Because of its robustness, LeTE-fusion is also applicable to analyze any protein sequence data sets to discover the likelihood of detecting the variants in the samples of interest. Furthermore, using observed counterpart wild-type peptides in LeTE-fusion, 4

ACS Paragon Plus Environment

Page 4 of 49

Page 5 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

we found only 28.5% to 44.5% variants in our three data sets from complex samples of cell line, tissue, and whole human proteome were expected to be observable in MS experiments. MATERIALS AND METHODS

Data sets We acquired 12 data sets from different sources and classified them into two main categories, namely, six protein sequence data sets and six MS-based data sets. A protein sequence data set contains protein sequences that were either collected from public sequence databases (e.g., UniProt) or the literature. The first protein sequence data set was created using 20204 human protein sequences in UniProtKB/Swiss-Prot22 (version 201509), denoted as s_UP2015. The remaining five sequence data sets were proteins with variant information. To be specific, we first utilized the variant information provided in UniProt23 (humsvar.txt, version 201609, total of 73582 variants) to find the proteins in s_UP2015 having variant annotations to generate a sequence data set, denoted as s_Proteome. Next, we combined two protein FASTA files (https://github.com/vetbio/2015jpr) containing variant annotations published by Kim et al.8 that were generated by converting RNA-Seq data of a paired tumor and adjacent normal tissues from a patient with lung adenocarcinoma (sample: 10060NT), denoted as s_Tissue. There were originally 21740 and 21331 missense variants in the FASTA files of the paired normal and tumor tissues, respectively. Filtering out any variants in association with introduction or removal of stop codons, we had a total of 28657 variants in the s_Tissue data set. Third, we downloaded protein sequences from Ensembl

24

and combined variant information obtained from Catalogue

of Somatic Mutations in Cancer (COSMIC)25 for A549 human cell line; the data set was denoted as s_CellLine. In the s_CellLine data set, initially there were 522 missense variants by removing redundant variation events and excluding variants in association with introduction or removal of stop codons, 275 variants remained. Fourth, we constructed a single protein data set of dual specificity mitogen-activated protein kinase kinase 1 (MP2K1_HUMAN), denoted as s_SinglePro, where the sequence and variant annotation (total of 8 known variants) were acquired from UniProt (accession: Q02750). The last sequence data set was a set of protein sequences of HEK293 cell line with 1336 variants annotated via whole exome sequencing (WES),18,19 denoted as s_HEK293. Note that if a protein contains multiple variant annotations, we will generate multiple entries for the same protein, and each entry contains only one of the 5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

variant annotations, e.g., 73582 variant annotations in s_Proteome data set corresponding to 73582 protein sequences. Unlike protein sequence data sets, the MS-based data sets contain proteins/peptides with MS experiment evidence. We collected lists of canonical human proteins (i.e., non-redundant proteins identified with high confidence) and their peptides in PeptideAtlas of versions 201503 and 201601, denoted as m_PA2015 and m_PA2016, respectively. The m_PA2015 data set includes 1025698 distinct peptides from 133638335 peptide-spectrum matches (PSM) of 1011 samples, where m_PA2016 consists of 1166164 distinct peptides from 160165850 PSM of 1202 samples. We also downloaded MS-based human proteomics data with identification results in Global Proteome Machine Database26 (GPMDB, downloaded on 2016/02/22), denoted as m_GPMDB. Another two MS-based data sets were identified proteins and peptides of DLD-1 cells published by Chen et al.,27 in which the authors acquired MS data from hundred-scale cell samples digested by trypsin and chymotrypsin in parallel (i.e., multiple MS experiments being conducted and each using a different single protease for protein digestion), denoted as m_CellTR and m_CellCH, respectively. Besides human data sets, we obtained identification results of Escherichia coli (E. coli) generated by Giansanti et al.16 using seven different proteases in parallel for their MS experiments, denoted as m_Ecoli. Table 1 shows the short tags, brief description, and number of peptides and variants in each data set. Note that the variants may come from single nucleotide polymorphism or other alterations in genomics level depending on the definition given in the original sources.

LeTE-fusion LeTE-fusion is designed as a robust analysis pipeline, considering Length of peptides, Theoretical estimation and Experimental evidence of peptides, to explore the possible factors that might affect identification for variant peptides while assessing the possibility of detecting variants in shotgun proteomics. LeTE-fusion contains four major components: (1) in silico digestion on protein sequence data sets, (2) grouping peptides and proteins based on different peptide-length ranges, (3) calculating theoretical coverage using in silico peptides, namely, theoretical protein coverage and theoretical variant coverage, and (4) examining experiment evidence of peptides in MS data sets to estimate experimental protein coverage or to estimate the 6

ACS Paragon Plus Environment

Page 6 of 49

Page 7 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

coverage of variant sites expected to be observed. The illustration of the LeTE-fusion is shown in Figure S1. The required processing steps in LeTE-fusion are described as follows.

7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 8 of 49

Table 1. Short tags and data description of the data sets used in this study. Tag

Sample Source

Description

Variantcontained

Number of peptide sequences (7-40aa)

Number of variants

Protein Sequence Data Sets s_UP2015

Human

Reviewed human protein sequences in UniProt version 201509

No

542099

-

s_Proteome

Human

Protein sequences in s_UP2015 that contain variants (reviewed human variant information from UniProt)

Yes

53287

73582

s_Tissue

Human

Yes

20248

28657a

s_CellLine

Human

Yes

199

275a

s_SinglePro

Human

Yes

4

8

s_HEK293

Human

Yes

687b

999

m_PA2015

Human

Canonical proteins and peptides in PeptideAtlas (ver. 2015-03)

No

1020326

-

m_PA2016 m_GPMDB

Human Human

Canonical proteins and peptides in PeptideAtlas (ver. 2016-01) Data downloaded from GPMDB on 2016/02/22

No No

1157723 1124104

-

No

647 (Glu-C) 4852 (Lys-C) 5341 (Asp-N) 7810 (Lys-N) 8430 (Arg-C) 15006 (Tryspin) 10537 (Chymotrypsin)

-

No

4046

-

No

832

-

m_Ecoli

E. coli

m_CellTR

DLD-1 cells

m_CellCH

DLD-1 cells

Protein sequences generated from RNA-seq data of a paired tumor and its adjacent normal tissues of a patient with lung adenocarcinoma (sample: 10060NT) by Kim et al. 7 Protein sequences of A549 cell line downloaded from Ensembl combined with variant information found in COSMIC. Protein sequence of dual specificity mitogen-activated protein kinase kinase 1 (MP2K1) with its variant information in UniProt Protein sequences containing variants were detected using WES published by Lin et al.19 MS-based Data Sets

Identified peptides of E. coli via shotgun proteomics of parallel use of multiple proteases (i.e., trypsin, chymotrypsin, Arg-C, Asp-N, Glu-C, Lys-C and Lys-N) by Giansanti et al.15

Identified trypsin (TR) digested peptides of 300 DLD-1 cells via shotgun proteomics by Chen et al.20 Identified chymotrypsin (CH) digested peptides of 300 DLD-1 cells via shotgun proteomics by Chen et al.20

a We removed variants relating to introduction and removal of stop codons b We removed sequence conflict between RefSeq and Swiss-Prot as well as the isoforms 8

ACS Paragon Plus Environment

Page 9 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

In silico digestion All of the protein sequence data sets were in silico digested by commonly used proteases in parallel, including trypsin, chymotrypsin, Glu-C, Lys-C, Arg-C, Asp-N, and Lys-N, except s_UP2015 which was subjected to in silico trypsin digestion only. All digestions followed the cleavage specificity rules provided by ExPASy PeptideCutter and Keil’s rules.28 To be specific, the former five proteases cleave at C-terminus of specific amino acids, that is, K or R not before P for trypsin; F, Y, L, M or W not before P for chymotrypsin; D or E for Glu-C; K for Lys-C; and R for Arg-C. The latter two proteases, Asp-N and Lys-N, cleave at N-terminus of D and K, respectively. Furthermore, we considered fully digested in silico peptides, or called in silico peptides for simplicity, in this study. As mentioned above, parallel use of multiple proteases means to apply in silico digestion using one of the proteases each time.

Grouping peptides and proteins based on peptide length The length of digested peptides may affect the overall result of a shotgun proteomics experiment. Long peptides may be undetectable by the mass spectrometer or lack of sufficient number of fragment ions for accurate peptide identification due to detection limit, or ionization and/or fragmentation efficiency (assuming collision-induced dissociation, CID, as the fragmentation method). On the other hand, short peptides (i.e., peptides with length ≤ 6aa) may render unconfident protein identification because they are more likely to be shared peptides and thus may lead to ambiguity in protein inference. Furthermore, the percentage of shared peptides decreases gradually at peptide length ≥ 7aa as described by Choong et al.17 Therefore, to take all the above into consideration as well as to cope with PeptideAtlas’ standard on peptide length of at least seven amino acids,12,29 when performing in silico digestion on a protein sequence data set by a specific protease, we considered fully digested in silico peptides of length in the following six ranges: 7-15 amino acids (aa), 7-20aa, 7-25aa, 7-30aa, 7-35aa, and 7-40aa. We use Q69YW2 (protein name: protein stum homolog) in s_UP2015 to demonstrate peptide grouping. This protein has eight fully tryptic in silico peptides as shown in Figure 1. For peptide-length range of 7-15aa, we will group GASSSSGVVVQVR (13aa), EKKGPLR (7aa), and EQGIPQQL (8aa) together. As for 7-20aa, besides the aforementioned three peptides, DAETAAAAAAVAAADPR (17aa) is included as well, and similarly for the grouping of peptide-length ranges 7-25aa, 7-30aa, 7-35aa, and 7-40aa. Proteins are then grouped based on their peptide lengths. For example, we 9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

say “proteins in the 7-20aa length range” to mean proteins having at least one in silico peptide of length 7-20aa, and only such peptides are used in the corresponding analysis.

Calculating theoretical and experimental protein coverages For each defined peptide-length range, we define theoretical protein coverage of a protein as the percentage of the protein sequence covered by its fully digested in silico peptides under the given peptide-length range (i.e., total length of in silico peptides divided by the protein length) that reflects the protein coverage in an ideal shotgun proteomics experiment. As indicated in Figure 1, Q69YW2 has a total length of 141aa. To calculate the theoretical protein coverage under length range of 7-15aa, we sum up the lengths of the three in silico peptides divided by the total protein length which is (13+7+8)/141=19.86%. The same procedure is carried out for other length ranges. The experimental protein coverage calculation and experimentally observed peptide grouping are identical to the theoretical one, except we only use peptides having experimental evidence. When examining experiment evidence of theoretical peptides, we matched the peptides to experimentally observed fully digested peptides in the MS-based data sets. Similar to theoretical protein coverage, we define experimental protein coverage of a protein as the total length of observed peptides divided by the protein length, which reflects protein observation in the actual shotgun proteomics experiments.

Calculating theoretical variant coverage using in silico variant peptides To conduct in silico analyses on variant sites in a protein sequence data set, we only retained fully digested in silico variant peptides for analysis. The in silico variant peptides are grouped based on their lengths into the above-mentioned six different peptide-length ranges. Given the total number of variants in a data set, we define theoretical variant coverage as the percentage of variants that are contained in in silico peptides under a specific peptide-length range (e.g., 720aa). It provides a theoretical or ideal bound of detecting variants assuming all of in silico variant peptides can be detected in an ideal shotgun proteome.

10

ACS Paragon Plus Environment

Page 10 of 49

Page 11 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. Peptide grouping and calculation of theoretical and experimental protein coverages using in silico and experimentally observed fully tryptic peptides based on different peptidelength ranges, illustrated by using Q69YW2 (Protein stum homolog) as an example.

RESULTS AND DISCUSSION To investigate the inconsistency in the number of observed variants at protein and genome levels, we used our proposed analysis pipeline, LeTE-fusion, to analyze the protein sequence data sets and MS-based data sets. We first applied LeTE-fusion on human proteome to inspect the possible effects on peptide identification induced by either proteases or MS-related factors that may contribute to the lower discovery rate of variants in shotgun proteomics. Next, we examined the variant coverage in four protein sequence data sets with different sample complexity via LeTE-fusion and then estimated expected observable variant coverage using counterpart wildtype peptides (variant-site-containing wild-type peptides with experimental evidence in PeptideAtlas). Finally, to verify the estimated variant coverage obtained via the pipeline, we conducted a case study using HEK293 cell line data set. 11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Comparison of theoretical and experimental protein coverages of human proteome using trypsin digestion We first examined the discrepancy of ideal and actual shotgun experiment outcomes by comparing the theoretical and experimental protein coverages based on trypsin digestion and peptides of lengths in the aforementioned six peptide-length ranges. Tryptic in silico peptides in s_UP2015 (protein sequence data set) and experimentally observed tryptic peptides (or simply observed peptides in this section) in m_PA2015 and m_PA2016 (MS-based data sets) were used to explore the human proteome. To be specific, following LeTE-fusion we performed in silico trypsin digestion on protein sequences in s_UP2015 to obtain fully tryptic in silico peptides of length in a specific range. Observed peptides within the same specified length range were collected from m_PA2015 and m_PA2016 as two independent sets. Theoretical and experimental protein coverages were then calculated for each protein using the in silico and observed peptides, respectively. In order to compare theoretical and experimental protein coverages based on in silico and observed peptides of different lengths, we calculated the coverages of the proteins commonly contained in all of the three data sets, s_UP2015, m_PA2015 and m_PA2016. For the six different length ranges from 7-15aa to 7-40aa, the numbers of the corresponding protein groups varied slightly. There were 14420 proteins in peptide-length range of 7-15aa, 14630 proteins in 7-20aa, 14674 proteins in 7-25aa, 14689 proteins in 7-30aa, 14698 proteins in 7-35aa, and 14703 proteins in 7-40aa. When computing the protein coverages, the calculation was performed based on the corresponding proteins in the specific peptide-length range.

Differences in theoretical and experimental protein coverages based on different peptide-length ranges For each peptide-length range, we calculated theoretical protein coverages using in silico peptides generated from s_UP2015, and two experimental protein coverages using observed peptides from m_PA2015 and m_PA2016, respectively. Figure 2A and B shows the protein coverage distributions for peptide-length ranges 7-20aa and 7-40aa. Regardless of the peptide-length range, the experimental protein coverages based on m_PA2015 (red dots) and m_PA2016 (green dots) were distributed similarly; and yet more than half of the proteins showed experimental coverage 60%; on the contrary, the experimental protein coverage was substantially lower, more than 50% of the proteins with experimental protein coverage 99% of the proteins regardless the peptide-length ranges (Figure 2A, 2B and S2) suggesting we would miss certain parts of the proteins. Furthermore, the distribution patterns of observed peptide ratios in human proteome (Figure 2D) and E. coli data set (Figure S4) were more likely influenced by MS-related factors, rather than protease used for digestion. Instead of randomly or evenly selecting peptides in the sample to generate MS/MS spectra, MS conditions (e.g., CID fragmentation) seemed to favor peptides with length of 8-20aa, thereby rendering peptides with such length range to be more observable in MS experiments. Thus, the above analyses suggest 16

ACS Paragon Plus Environment

Page 16 of 49

Page 17 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

that certain regions on the proteins might be less likely identified when the lengths of digested peptides containing those regions fall outside the observable ranges in shotgun MS experiments. If these regions happened to have variants, then the sufficiency in identifying variants would certainly be affected. Although the observed peptide ratio was much higher in peptide length 12≦ l ≦15, the total peptide numbers for other lengths are still considerably large. Therefore, the analyses in the following sections are still carried out based on the defined peptide-length range.

Theoretical variant coverage analysis of protein sequence data sets with different sample complexities In the previous sections, we discover that certain lengths of peptides are observed more often than the others. Thus, this section is mainly focused on describing our findings regarding the possibility of observing variants in the peptide-length range of 7-20aa. We have also analyzed peptides in the other length ranges, but here we only present the comparison using 7-20aa and 740aa length ranges since the latter represents the case with highest number of in silico peptides and still within the MS detection and peptide identification limit.

Theoretical variant coverages based on trypsin digestion We first conducted an in silico analysis to calculate the theoretical variant coverages of proteins, assuming ideal shotgun analysis, in s_Proteome, s_Tissue, s_CellLine and s_SinglePro, four data sets with decreasing sample complexity at whole human proteome, tissue, cell line and single protein levels, respectively. The theoretical variant coverages of the four data sets are shown in Table 2. The first three data sets with higher sample complexity, from cell line to the whole human proteome, exhibited a similar trend in theoretical variant coverages, but dissimilar to the single protein data set. To be specific, using trypsin for digestion, ~70% of variants in s_Proteome (53287/73582), s_Tissue (20248/28657), and s_CellLine (199/275), exhibited a possibility of being detected in shotgun proteomics considering the longest peptide-length range (i.e., 7-40aa). Furthermore, the theoretical variant coverage decreased as the peptide-length range reduced. If considering the peptide-length ranges that were more likely being detected in MS experiments, namely, 7-15aa and 7-20aa, at most 37.45% and 48% of variants (in s_CellLine data set) were expected to be detected in MS. 17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 49

On the other hand, the theoretical variant coverage of the single protein remained at 50% in peptide-length ranges ≥ 7-25aa and greatly reduced at 7-15aa and 7-20aa. The result of the single protein could suggest that the theoretical variant coverage depend on the amino acid compositions of the protein. For instance, if a protein sequence contains too many or too few K or R for trypsin digestion, we would have a lower chance of observing its fully tryptic variant peptides of appropriate lengths (e.g., 7-20aa). Each protein would have its own theoretical variant coverage depending on its amino acid composition. As the sample complexity increases, more proteins with different amino acid compositions will be involved. Even assuming longest peptide length 7-40aa and an ideal shotgun proteomics experiment, the theoretical variant coverage was about 70% and still 30% of variants were unseen in the complex data sets. Hence, it is quite unlikely to identify most of the variant sites in data sets with relatively higher sample complexity in shotgun proteomics. Moreover, the theoretical variant coverage decreased as the peptide length decreased, and thus it is unlikely that applying trypsin alone for digestion could identify most of the variant peptides in a complex proteome using shotgun proteomics.

Table 2. Theoretical variant coverages using fully tryptic in silico peptides in different peptidelength ranges. Data set

s_Proteome

s_Tissue

s_CellLine

s_SinglePro

Total variants

73582

28657

275

8

Peptide-length range number percentage number percentage number percentage number percentage 7-40aa

53287

72.4%

20248

70.7%

199

72.4%

4

50.0%

7-35aa

50266

68.3%

19204

67.0%

193

70.2%

4

50.0%

7-30aa

46021

62.5%

17573

61.3%

186

67.6%

4

50.0%

7-25aa

40312

54.8%

15556

54.3%

168

61.1%

4

50.0%

7-20aa

32804

44.6%

12608

44.0%

132

48.0%

1

12.5%

7-15aa

23473

31.9%

9013

31.5%

103

37.5%

1

12.5%

Theoretical variant coverages calculated from parallel use of multiple proteases for in silico digestion Considering the peptide-length range 7-20aa which is more likely to be detected in shotgun proteomics experiments, the theoretical variant coverage using trypsin digestion is still less than 18

ACS Paragon Plus Environment

Page 19 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

50%, suggesting more than 50% of variants are unlikely to be detected. Therefore, we conducted in silico analysis on the four data sets (i.e., s_Proteome, s_Tissue, s_CellLine, and s_SinglePro) using the following six proteases for digestion: chymotrypsin, Glu-C, Arg-C, Lys-C, Lys-N, and Asp-N, in addition to trypsin, to investigate whether the theoretical variant coverage could be increased. Each of the seven proteases was used to in silico digest proteins in the data sets to generate in silico variant peptides which were used to derive the theoretical variant coverage for each peptide-length range. Comparing the in silico digestion performance among individual proteases, trypsin, chymotrypsin, and Glu-C yielded higher theoretical variant coverages than Arg-C, Lys-C, Lys-N, and Asp-N, regardless of the peptide-length ranges for all the data sets except s_SinglePro (Figure S6 and Tables S1-4). The results could be attributed to the fact that trypsin, Glu-C, and chymotrypsin can cleave at multiple types of amino acid, generating more short peptides than the other four proteases which can only cleave at a single type of amino acid. We then carried out a comprehensive analysis on all combinations of parallel use of multiple proteases from 1 to 7 proteases (trypsin and the six other proteases) by conducting non-redundant aggregation of all individual protease results. For each data set, we evaluated the number of variants that could be observed via only 1 protease (regardless of the type of protease) or more proteases. Figure 3 shows the theoretical variant coverage of using k (1≦k≦7) proteases in parallel as the total number of variants contained in the in silico peptides digested by any of the protease among “exactly” k out of 7 proteases for s_Proteome and s_Tissue data sets, where the results on s_CellLine and s_SinglePro are shown in Figure S7. We particularly noticed that some variants cannot be seen in in silico peptides with a particular length generated from any of the seven proteases, which are denoted as “missed” variants for multi-proteases and for trypsin alone in Figure 3. Using the result of length range 7-20aa of s_Proteome as an example (“7-20 Multiple Proteases” in Figure 3A), 9802 variants (theoretical variant coverage of 13.32%) are missed. Similarly, 15476 variants (theoretical variant coverage of 21.03%) can be seen in in silico variant peptides by only one particular protease (i.e., these variant peptides cannot be seen by the other six proteases, marked as “1 protease”). On the contrary, only 0.39% of the variants can be seen using any of the seven proteases (i.e., marked as “7 proteases”) for protein digestion in MS experiments. Under the same peptide-length range, s_Tissue data set also exhibited similar behavior that 12.73% (3648/28657) of variants were missed and 20.69% (5931/28657) of variants were covered by only one protease (“7-20 Multiple Proteases” in Figure 3B). The 19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

overall variant coverage could be improved, especially in peptide-length range of 7-40aa, by using multiple proteases in parallel to complement the limited digestion by using trypsin only. Moreover, the parallel digestion using multiple individual proteases could generate different overlapping peptides that contain variants of the protein; identifying these overlapping peptides would increase the confidence in identifying variant peptides.

Figure 3. Theoretical variant coverages in different peptide-length ranges using trypsin only and multiple proteases for in silico digestion, respectively. (A) Results of s_Proteome data set (73582 variants), and (B) Results of s_Tissue data set (28657 variants). In this figure, all combinations of using exactly k (1≦k≦7) proteases in parallel were considered to calculate the theoretical variant coverage. Those variants that cannot be seen in in silico peptides from any of the seven proteases are denoted as “missed”.

20

ACS Paragon Plus Environment

Page 20 of 49

Page 21 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Improving theoretical variant coverage to a more realistic estimated variant coverage by using observed counterpart wild-type peptides The theoretical variant coverage provides an optimistic estimation on detecting variants in peptides within desired length ranges in ideal shotgun proteomics as shown in the previous section. For example, in the whole human proteome data set, the theoretical variant coverage obtained by trypsin digestion is 72.4%, given peptide length of 7-40aa. The ability to detect variant peptides in real shotgun proteomics experiments can be constrained based on our results from observed peptide ratios calculated using peptides with MS experiment evidence in PeptideAtlas. However, if multiple proteases are used in parallel, more variants shall be detected theoretically. Therefore, the theoretical variant coverage may provide an over-optimistic expectation of detecting variants in shotgun MS experiments. In order to make the theoretical coverage reflecting real detection outcomes more closely, we would need to analyze variant peptides with experimental evidence. Nevertheless, the number of identified variant peptides are still very low in shotgun MS experiments. To conduct a rigorous analysis to infer a more realistic estimation of detecting variant peptides, we used their observed counterpart (wild-type) peptides instead, that is, variant-sitecontaining wild-type peptides with MS experiment evidence in s_Proteome, s_Tissue, s_CellLine, and s_SinglePro. Presumably, the likelihood of detecting variant peptides having counterpart wild-type peptides observed in MS data is a lot higher than the variant peptides lacking such counterpart wild-type peptides; and we propose to use observed counterpart wildtype peptides to infer possible detection of variant peptides. But a critical issue about the presumption is whether the variant peptides and their observed counterpart wild-type peptides share similar peptide properties to support the rationale of such estimation. To tackle the issue, we compared the GRAVY hydrophobicity index, charge (at pH7), and Isoelectric point (pI) of variant peptides and their counterparts, either being observed or non-observed, in s_Tissue data set. We also applied the enhanced signature peptide (ESP) predictor30 and used ESP scores to evaluate the similarity in potential ion-current responses between in silico variant peptides and their wild-type counterpart peptides.31 As illustrated in Figure S8, only subtle differences are shown in GRAVY, charge (at pH7), pI, and ESP scores, indicating the variant peptides and their counterparts are quite similar. Therefore, our use of observed counterpart wild-type peptides to estimate the likelihood of detecting variant peptides is acceptable. 21

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

To find the observed counterpart wild-type peptides in s_Proteome, s_Tissue, s_CellLine, and s_SinglePro data sets, in the LeTE-fusion pipeline, in silico wild-type peptides are mapped to peptides in PeptideAtlas. Since most data contained in PeptideAtlas used trypsin for digestion, here we focused on tryptic peptides of the four data sets. Furthermore, the counterpart wild-type peptides are not actual variant peptides, we use variant site instead of variant when we describe the analysis results in order to clearly state that these peptides just contain the variant-occurring sites instead of being the variant peptides. To explain our analysis workflow, we use s_SinglePro data set for demonstration.

Figure 4. Using the single protein data set, s_SinglePro (MAP2K1_HUMAN), to illustrate the procedure of finding counterpart wild-type peptides. Given the eight variant sites of the protein, in silico wild-type peptides containing variant-occurring sites are generated and then mapped to peptides in PeptideAtlas, namely, m_PA2016, to obtain the counterpart wild-type peptides (with experimental evidence). Ly: within the length range; Ln: outside the length range.

Using the single protein data set to demonstrate the analysis of using LeTE-fusion to approximate the percentage of expected observable variants The s_SinglePro data set consists of the protein sequence of MP2K1_HUMAN, which is a canonical protein (i.e., non-redundant protein identified with high confidence) in PeptideAtlas having eight known variants. Since we had the variant information from UniProt, therefore, we knew the exact location that the variants occurred (i.e., variant sites) on the protein. Thus, we performed in silico trypsin digestion on the wild-type protein sequence (from s_UP2015) and kept only the in silico wild-type peptides containing the variant sites for seeking the counterpart wild-type peptides having experimental evidence in PeptideAtlas. However, it was possible that 22

ACS Paragon Plus Environment

Page 22 of 49

Page 23 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

we could not find fully tryptic experimentally observed counterpart wild-type peptides in PeptideAtlas. Thus, we allowed the observed counterpart wild-type peptides to have at most two missed cleavages in order to ensure that we included most of the variant sites. Figure 4 shows the mapping of the in silico wild-type peptides containing variant sites to the corresponding counterpart wild-type peptides in PeptideAtlas (specifically m_PA2016 data set) for the single protein data set. As shown in Figure 4, we found that only four out of the eight variant sites (green bars) were located on in silico peptides within the length range 7-40aa and had observed counterpart wildtype peptides in PeptideAtlas. Other three variant sites (gray bar) occurred on the same in silico wild-type peptides with a long length of 43aa and could not find observed counterpart wild-type peptides in m_PA2016. The remaining one variant site was in a very short in silico peptide of length 1aa (the 97th amino acid of the protein is changed from K to R if the single amino acid variation occurs), which was outside the desired peptide-length ranges; however, the variant site was contained in wild-type peptides with ≥1 missed cleavage as found in m_PA2016. Nonetheless, we only retained observed wild-type peptides having in silico variant peptides within the defined length ranges. After matching the eight in silico wild-type peptides to m_PA2016 data set, we obtained a list of observed counterpart wild-type peptides. Note that a variant site could be found in more than one counterpart wild-type peptide due to the consideration of ≤2 missed cleavages. In such cases, we only counted the observation of the variant site once instead of multiple times. Note that in silico variant-site-containing peptides can have length either within (denoted as Ly) or outside (denoted as Ln) a specific length range, and can have corresponding observed counterpart wild-type peptides with at most two miscleavages (denoted as Ey) either in a specific length range, or not (denoted as En). For a combination of peptide-length ranges, in silico variant-site-containing peptides can be clustered into four groups, namely, LyEy, LyEn, LnEy, and LnEn. For each group, the variant coverage can be calculated by the number of variant sites (i.e., total number of variant-site-containing peptides) in the group divided by the total number of variant sites in the entire data set. Figure 5 shows the variant coverages of the s_SinglePro data set for different combinations of peptide-length ranges, where in silico peptides and observed counterpart wild-type peptides in the same length range are the diagonal entries. A more detailed result is provided in Table S5. 23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Since the peptides containing variant sites falling into the LyEy category are supposedly detectable in real shotgun proteomics, these variant sites can be regarded as expected observable variant sites, and the variant coverage of LyEy is called expected observable variant coverage. For example, considering peptide-length range of 7-40aa for both in silico peptides and counterparts (red square-bounded region in Figure 5), the variant coverages of LyEy = 50%, LyEn = 0%, LnEy = 12.5%, and LnEn = 37.5%. Therefore, 50% of variant sites are expected observable for the single protein. In comparison with theoretical variant coverage (i.e., the sum of variant coverages of LyEy and LyEn), the expected observable coverage could represent a more realistic percentage of variant peptides to be expected observable in shotgun MS experiments.

Figure 5. Variant coverages (in percentage) for s_SinglePro data set (total of 8 variants) under different combinations of peptide-length ranges. Variant coverages are calculated using in silico peptides containing variant sites that either having observed counterpart wild-type peptides or none in m_PA2016. Ly: within the length range; Ln: outside the length range; Ey: with experiment evidence; En: without experiment evidence. 24

ACS Paragon Plus Environment

Page 24 of 49

Page 25 of 49 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Large-scale variant analysis based on observed counterpart wild-type peptides at human proteome, tissue, and cell line levels To conduct analyses on the other three data sets, s_Proteome, s_Tissue and s_CellLine, for each data set, we mapped its protein information against s_UP2015 and m_PA2016 to filter out (1) proteins having mismatched gene and protein identifiers; (2) non-canonical proteins based on the canonical protein list from PeptideAtlas, where canonical proteins are referred to as proteins identified with high confidence in MS experiments; and (3) sequence conflicts of given protein identifiers among different databases or different versions. Finally, we had 62281, 8082, and 119 variants remained in s_Proteome, s_Tissue, and s_CellLine, respectively. For each data set, we mapped the in silico variant-site-containing peptides to experimentally observed peptides allowing at most two miscleavages in m_PA2016 to find the observed counterpart wild-type peptides and compute variant coverages of different length ranges. The results of s_Proteome, s_Tissue, and s_CellLine are shown in Figure 6. In the whole human proteome data set with 62281 variants, considering peptide length 7-20aa (Figure 6A, yellow square-bounded region), we estimated expected observable variant coverage of 27.3%, i.e., 27.3% of variants could be likely detected in MS experiments, though the theoretical variant coverage was 45.7% (LyEy + LyEn: 27.3% + 18.4% = 45.7%). Even for the length range of 740aa (red square-bounded region), only 40.3% variants were expected observable in shotgun proteomics, whereas the theoretical variant coverage of 73.2% was too optimistic. Thus, by using the observed counterpart wild-type peptides, it seems that we can obtain a more realistic estimation of variant coverage that is closer to real MS experiment outcomes. The other two data sets also demonstrated similar results as shown in Figure 6B and C, respectively. Detailed results are in Tables S6-8. Because 7-20aa is the peptide-length range having the most detected peptides in MS experiments and 7-40aa is the length range containing the highest number of in silico peptides and also within the MS detection and peptide identification limit, we compared the results of s_Proteome, s_Tissue, and s_CellLine in these two length ranges. As shown in Figure 6D, the expected observable variant coverage for 7-20aa was much smaller than that for 7-40aa; on the contrary, the variant coverage of LnEn for 7-20aa was much larger than that for 7-40aa. As the 25

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

length range increased, the variant coverage of LnEn was decreasing and the expected observable variant coverage was increasing for all the three data sets. Nevertheless, the expected observable variant coverage for the longest range was still EO_NoExp > TG. Second, we examined the RNA abundance obtained from the Human Protein Atlas21 among the three groups within the same biological sample under the same conjecture of EO_Exp > EO_NoExp > TG even at RNA level. Finally, we analyzed the abundance correlation between protein and RNA among different biological samples to ensure data from two different sources still providing adequate results.

Figure 8. Comparison among 999 variant sites found at genomic level, 160 expected observable variants in the peptide-length range of 7-40aa estimated by LeTE-fusion, and 68 experimental variants identified by Lobas et al. for the s_HEK293 data set. Abundance dynamic range analysis at protein level PaxDb20 provides protein abundance data sets across different domains with various sample complexities from MS experiments. We thus used data sets of whole human proteome (H.sapiens-Whole organism (Integrated) as denoted in PaxDb), human kidney tissue (H.sapiensKidney (Integrated) as denoted in PaxDb), and HEK293 cell line (H.sapiens-Cell line, HEK293, (Geiger,MCP,2012) as denoted in PaxDb) to analyze the protein abundance among the three 31

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

groups, as shown in Figure S9. The abundance dynamic range in various samples should be disparate since different cell types can express specialized proteins with different abundances that are in charge of unique properties of the cells.34 Thus, we utilized such fundamental variation among the samples to assign the whole human proteome and human kidney tissue data sets as negative control in order to reduce any bias while verifying our conjecture. In the complex sample data sets (i.e., whole human proteome and kidney tissue), most of the proteins in EO_NoExp and EO_Exp had abundance with a similar median, higher than that of TG (i.e., TG < EO_NoExp ≈ EO_Exp), as shown in Figure S9A and B. However, for the HEK293 cell line data set from PaxDb, opposite from the complex sample data sets from the same protein abundance database, most of the proteins in TG and EO_NoExp had similar but lower protein abundance than EO_Exp, and the median abundances of the three groups were TG ≈ EO_NoExp < EO_Exp, as shown in Figure S9C. We then further statistically analyzed the differences of median abundances among the three groups by Kruskal-Wallis rank sum test in all the data sets, and the result was significant with an adjusted p-value