Data-dependent scoring parameter optimization in ... - ACS Publications

Most database search tools for proteomics have their own scoring parameter sets depending on experimental ... optimize scoring parameters in a data-de...
0 downloads 0 Views 970KB Size
Technical Note Cite This: J. Proteome Res. XXXX, XXX, XXX−XXX

pubs.acs.org/jpr

Data-Dependent Scoring Parameter Optimization in MS-GF+ Using Spectrum Quality Filter Hyunjin Jo and Eunok Paek* Department of Computer Science, Hanyang University, Seongdong-gu, Seoul 04763, Korea

Downloaded via DURHAM UNIV on July 27, 2018 at 08:51:44 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: Most database search tools for proteomics have their own scoring parameter sets depending on experimental conditions such as fragmentation methods, instruments, digestion enzymes, and so on. These scoring parameter sets are usually predefined by tool developers and cannot be modified by users. The number of different experimental conditions grows as the technology develops, and the given set of scoring parameters could be suboptimal for tandem mass spectrometry data acquired using new sample preparation or fragmentation methods. Here we introduce a new approach to optimize scoring parameters in a data-dependent manner using a spectrum quality filter. The new approach conducts a preliminary search for the spectra selected by the spectrum quality filter. Search results from the preliminary search are used to generate data-dependent scoring parameters; then, the full search over the entire input spectra is conducted using the learned scoring parameters. We show that the new approach yields more and better peptide-spectrum matches than the conventional search using built-in scoring parameters when compared at the same 1% false discovery rate. KEYWORDS: proteomics, search algorithm, parameter optimization, machine learning, peptide identification



INTRODUCTION Database search is a common method to interpret massspectrometry data obtained in shotgun proteomics. Tandem mass (MS/MS) spectra and protein sequence database are used as an input of database search with many parameter values including digestion enzyme, precursor mass tolerance, and modification list, depending on sample preparation and experimental conditions. The software tools calculate match scores between an experimental spectrum and theoretical spectra deduced from candidate peptide sequences of the input database. As a result of the search, a peptide-spectrum match (PSM) list is reported. A validation step follows to estimate the significance of output PSMs. Two major methods of validation are a target-decoy method1 that estimates false discovery rate (FDR) and a model-based method such as PeptideProphet2 that calculates the probability of each PSM. One important part of database search technique is to properly set search parameters such as precursor mass tolerance, fragment mass tolerance, and variable modifications. These parameters are usually set based on a user’s knowledge of the experimental setup and an educated guess on the potential modifications believed to be included in the analytes. While the “yield” of the search depends heavily on these search parameter values, users often make less than optimal choices. Thus there have been studies to infer the optimal search parameter values to help users harvest more PSMs at a given FDR.3,4 There is another set of parameters that greatly affects the search resultsscoring parameters specific to each search tool. Most tools have their own built-in scoring parameter sets, the © XXXX American Chemical Society

values of which depend on instruments, fragmentation methods, sample preparation protocols, and so on. These scoring parameters are usually hard-coded by software developers and cannot be dynamically set by users. For example, the latest version of MS-GF+5−7 has 37 predefined scoring parameter sets. A proper scoring parameter set must be employed for the best search results. While the predefined parameter sets support a variety of data types defined by fragmentation method, protease, isobaric labeling, and resolution, the number of different parameter sets will ever increase along with technology developments. With the increasing number of scoring parameter sets, users will find it difficult to choose the optimal scoring parameter for specific MS/MS data, especially those acquired by adopting novel proteomic experimental technology. For example, currently, there are many publicly available datasets of tandem mass tag (TMT)-labeled and phosphorylation-enriched spectra generated using higher energy collision-induced dissociation (HCD). Unfortunately, the most up-to-date version of MSGF+ does not provide a scoring parameter set for this type of MS/MS data. Another approach to maximizing the number of PSMs at a given FDR is postprocessing of search results. Percolator,8 for example, uses a support vector machine (SVM) to discriminate correct target PSMs from the false. It utilizes confidently identified target PSMs and decoy PSMs to train the SVM model. PSMs are represented as a set of features including raw Received: June 1, 2018 Published: July 23, 2018 A

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research scores of an individual search engine, peptide mass, peptide length, and so on. In this study, we suggest a new approach to obtaining optimized scoring parameters of MS-GF+, one of the most popular search tools for high-throughput proteomics, in a datadependent manner. We call this approach a “data-dependent scoring parameters (DDP)” search. The DDP search consists of two search phases. During the first search, we conduct a quick database search only for the spectra selected by the spectrum quality filter. Using the first search result, we generate data-dependent scoring parameters, then use them for the second full-scale search. We compared the results of builtin and DDP searches in terms of the number of PSMs and the quality of PSMs at the same FDR. In all of the cases we had tested, the new approach yielded more and better PSMs than built-in search results. We also tested the performance of the new approach followed by Percolator, a rescoring method based on semisupervised learning. We expected that dynamic scoring parameter optimization would affect the search results independently from a postprocessor such as Percolator. Applying both DDP search and Percolator in sequence yielded more PSMs than a built-in search followed by Percolator.



Figure 1. Workflow of search using data-dependent scoring parameters.

be acquired using the same fragmentation method and enzyme. The PSMs are partitioned depending on their precursor charge and peptide length. The scoring parameters are generated for each partition because fragment ion propensities vary. Within each partition, peak rank distribution of each ion type is computed as well as the peak error distribution so that scores can be differentiated depending on the intensity and mass error of the peak. Because the scoring parameters are generated in each partition, PSMs of at least a few thousand unique peptides are needed as an input of ScoringParamGen7 so that there can be a sufficient number of training data for each partition. These scoring parameters were generated from the training set without assuming any prior knowledge of the protocol, and MS-GF+ utilizes this information when scoring the PSM. A more detailed description on MS-GF+ scoring can be found in Supplements to the MS-GF+ publication.

EXPERIMENTAL SECTION

Dataset

Sixteen MS/MS datasets (Supporting Information, Table S-1), acquired from human samples and fragmented using HCD and digested by trypsin, were downloaded from the PRIDE9 (http://www.ebi.ac.uk/pride/) and CPTAC10 repositories (https://cptac-data-portal.georgetown.edu). Ten datasets (Table S-1a) were used for training of the spectrum quality filter, and the remaining six datasets (Table S-1b) were used for performance tests of the DDP search. All datasets were searched by MS-GF+ against the Swiss-Prot human database with the following search parameters: Enzyme = Trypsin, PrecursorMassTolerance = 20 ppm, and IsotopeErrorRange = [0, 1]. For datasets of Table S-1a, minLength = 6 and NTT = 1 were used, whereas minLength = 8 and NTT = 2 were used for those in Table S-1b. Modification lists were adopted from the original publication of the data (Supporting Data 1). PSMs were obtained at 1% FDR using target/decoy method with the minimum peptide length restriction of 8.

Quality Filter

A simple HCD spectrum quality filter was built using a support vector machine (SVM). We generated three SVM models depending on charge states 2+, 3+, and 4+ or higher. From the built-in search results, confidently identified MS/MS scans (at 1% FDR) were used as a positive training set, and unidentified MS/MS scans were used as the negative training set for the spectrum quality filter. For each SVM model, a total of 50 000 spectra were used as a training set from 10 datasets in Table S1a; For each dataset, 5000 spectra (2500 positive and 2500 negative training data) were randomly selected. There have been various spectral features reportedly discriminant in selecting high-quality identifiable spectra.11−13 We selected the seven most relevant features out of 154 features from the previous research results (Supporting Information). The feature selection method in R package was applied to determine the most discriminant features. The number of features was selected empirically to avoid both underfitting and overfitting. The feature values were standardized using the Z-score. Table 1 shows the performance of the spectrum quality filter in terms of precision and recall, frequently used to measure the performance of a classifier. Precision is defined as the number of true-positives divided by the total number of spectra classified as positive, and recall is defined as the number of true-positives divided by the total number of spectra that actually belong to the positive class. Precision differs considerably from dataset to dataset, whereas recall was relatively stable. Even when the precision was low, the recall of

Workflow

For given input spectra, a training dataset for parameter optimization (Figure 1) is obtained by performing a search over a limited set of confidently assigned spectra. (1) We applied the filter until at most 100 000 high-quality (identifiable) spectra were obtained or we exhausted the input. (2) The first search was conducted only for the filtered high-quality spectra using built-in scoring parameters. (3) The identified PSMs from the first search results, validated at 1% FDR, were used as a positive training set for a scoring parameter generation tool of MS-GF+ (ScoringParamGen7) to produce data-dependent scoring parameters. (4) These scoring parameters were used in turn to conduct the second search for the entire original input spectra. ScoringParamGen

A scoring parameter generation tool within MS-GF+ package, ScoringParamGen, was used to optimize the scoring parameters. The tool accepts PSMs at 1% FDR as input to generate a file of scoring parameters. The input spectra had to B

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research Table 1. Filtering Performance of the Spectrum Quality Filter Depending on Precursor Charges CS = 2

CS ≥ 4

CS = 3

dataset

precision (%)

recall (%)

precision (%)

recall (%)

precision (%)

recall (%)

PXD001078 PXD003289 PXD004342 PXD003660 PXD001543 PXD002735

49.23 33.98 43.34 22.17 43.04 39.86 38.60

77.29 92.06 71.08 94.73 72.25 80.18 81.27

67.78 26.93 55.17 22.51 38.94 46.66 43.00

86.47 87.85 76.53 94.85 71.03 86.64 83.90

43.92 13.37 32.75 6.55 27.46 31.36 25.90

86.31 74.53 71.54 77.32 78.37 91.86 79.99

scoring parameter set optimized for the input spectra, where both the target and decoy sequences are given the same chance of matching a given spectrum. Our experience shows that datadependent scoring parameters can be reliably obtained when more than 10 000 input PSMs are used. Considering that recent mass-spectrometry-based proteomics experiments often generate over one million spectra, only a small fraction of spectra are used to obtain spectrum characteristics when a quality filter is used. Test results on the three datasets by DDP search with the spectrum quality filter are presented in Table 3a. As shown in Table 3a, 40.19, 14.17, and 18.48% of the input spectra were filtered as good-quality spectra, respectively. The first searches were conducted on the filtered spectra using a built-in scoring parameter set defined for each datatype. As a result of the first search, 26 778, 26 234, and 48 767 PSMs were obtained from each dataset at 1% FDR, respectively. Subsequently, these PSMs were used as an input to ScoringParamGen in MS-GF+ to generate data-dependent scoring parameters. The generated scoring parameters were applied in the second search to the entire original input spectra. After 1% FDR estimation of the second search results, 33 728, 107 307, and 141 759 PSMs were obtained, respectively, showing an increase by 10.7, 10.77, and 9.83%, respectively. Over 90% of identifications overlap between the two methods (Table S-2a), and the number of identified PSMs, including phosphorylated PSMs, did increase significantly with DDP search. But can we say that the additional identifications are “good” in some sense? We compared the results from builtin and DDP searches in terms of their match scores. It seemed unfair to compare scores generated by MS-GF+ because the scores were calculated using different parameters. For an objective comparison, we adopted XCorr score implemented in Tide software.14 The density plots of Tide’s XCorr scores of PSMs exclusive to each method are illustrated in Figure 2a. We can see that the DDP search results showed better XCorr score distributions. Upon closer investigation, we found out that there were two different reasons for better DDP results. (A paired list of all of the differently identified spectra and their search results can be found in Supporting Data 2.) The first differential group showed different identifications between built-in and DDP searches. This was usually caused by isotope error correction of a precursor or by random scrambling of amino acids. For most datasets, over 70% of these PSMs got equal or greater XCorr scores in DDP searches (shown in the last column of Table S-2). The second group consisted of newly identified PSMs by the DDP search. These PSMs had the same peptide identifications for both built-in and DDP searches and thus the same XCorr scores, whereas their scores were not above 1% FDR cutoff by the built-in search, but the optimized scoring parameter sets changed their E values,

70% or higher was sufficient to test the DDP approach. A better quality filter can reduce the running time of the first search.



RESULTS At a first glance, the DDP search might look as if it is “overfitting” because the scoring parameters obtained from a set of PSMs are applied to the same set of spectra. To investigate if using the statistical data obtained from a preliminary search to dynamically adjust the scoring parameters of a search tool results in “overfitting”, we conducted a test on three publicly available datasets with known built-in protocols: PXD001078, PXD003289, and PXD004342. For the test, the spectra filtering step was skipped and the original input spectra were randomly divided into two groups. The first group was searched using a built-in scoring parameter set defined for each datatype, and its search results were used to generate data-dependent scoring parameters. The generated scoring parameters were used for DDP searches for both the first and the second groups. As shown in Table 2a, the first and second groups showed little difference in the number of PSMs at 1% FDR. From this result, we can say that the DDP approach simply tries to fine-tune a Table 2. Datasets Were Randomly Divided into Two Groups: (a) Datasets with Corresponding Protocols in the MS-GF+ and (b) Datasets with No Corresponding Protocols in the MS-GF+a search dataset

group

first search result

second search result

increase (%)

(a) Datasets with Corresponding Protocols in the MS-GF+ PXD001078 first 15264 16818 10.18 second 15221 16739 9.97 PXD003289 first 48429 53743 10.97 second 48462 53753 10.92 PXD004342 first 64538 71775 11.21 second 64530 71670 11.07 (b) Datasets with No Corresponding Protocols in the MS-GF+ PXD003660 first 2666 2825 5.95 second 2628 2778 5.70 PXD001543 first 14464 16183 11.88 second 14401 16097 11.77 PXD002735 first 9619 9757 1.44 second 9634 9829 2.03 a Scoring parameters generated by the first search of the first group were used for the second search of both the first and the second groups. The tests were performed 10 times, and the results were averaged.

C

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

Table 3. Results of DDP Search: (a) Datasets with Corresponding Protocols in the MS-GF+ and (b) Datasets with No Corresponding Protocols in the MS-GF+a DDP search dataset

built-in search result

PXD001078 PXD003289 PXD004342

30468 96877 129069

PXD003660 PXD001543 PXD002735

5302 28845 19250

no. filtered spectra (% filtered)

first search result

(a) Datasets with Corresponding Protocols in the MS-GF+ 44006 (40.19%) 26778 100000 (14.17%) 26234 100000 (18.48%) 48767 (b) Datasets with No Corresponding Protocols in the MS-GF+ 29239 (38.43%) 5115 54956 (49.53%) 21357 39129 (46.53%) 16670

second search result

increase (%)

33728 107307 141759

10.7 10.77 9.83

5749 32181 19666

8.43 11.57 2.16

The first search was conducted only for the filtered spectra, while the second search was conducted for the entire original input spectra. The search results show the number of PSMs at 1% FDR. The last column shows the percentage of the number of PSMs that increased by DDP search when compared with built-in search. a

Figure 2. Density plots of Tide’s XCorr scores of PSMs in Table S-2. Blue bars represent the PSMs identified solely by built-in search, and red bars represent the PSMs identified solely by DDP search. (a) Datasets with corresponding protocols in the MS-GF+. (b) Datasets with no corresponding protocols in the MS-GF+.

making these PSMs salvaged by the DDP search. This means that the scoring parameter optimization not only increases the number of identifications but also leads to more accurate peptide identifications as well. MS-GF+ has many built-in scoring parameter sets depending on various protocols, but the list cannot be complete due to the technology developments. For instance, currently there

is no protocol for TMT-labeled and phosphorylation-enriched data, whereas such datasets are ubiquitous. For such datasets, one must employ a closely related scoring parameter set, probably resulting in a less-than-optimal outcome. We tested our approach on three TMT-labeled, phosphorylation enrichment datasets: PXD003660, PXD001543, and PXD002735. For comparison, a built-in scoring parameter set defined for D

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research

search phase yielded a sufficient number of identifications (according to our experience, a few thousands of identifications were sufficient) for ScoringParamGen, subsequent iterations converged to the optimal parameter set. With a mutated search, our method seems to get stuck in a local optimum, just like when a gradient descent method is applied in a nonconvex space. Even though “mutated search” showed convergence in the number of identifications, the “normal search” showed faster convergence than the “mutated search”. If the initial scoring parameter set was adopted suitably and a sufficient number of PSMs was obtained in the first phase, then the scoring parameter set would be optimized in one step. Unlike Percolator that simply rescores PSMs and determines the cutoff value for FDR estimation, the DDP approach involves additional search, a time-consuming task. Therefore, it is desirable that the DDP approach is conducted using a proper initial scoring parameter set with a spectrum quality filter so that the convergence can be achieved in one iteration.

phosphorylation-enriched data was adopted. Again, the DDP search gave better results than the built-in search in both the number of PSMs and their XCorr scores (Tables 2b and 3b, Table S-2b, and Figure 2b). It may seem as if the DDP search is optimizing the identification results by rescoring PSMs. However, DDP search dynamically finds the optimal scoring parameters to be used during the search, while Percolator rescores the PSMs after the search is finished by introducing various extra features in addition to the score produced by the search tool. Figure 3



CONCLUSIONS We suggested a new approach, called DDP search, to find and apply an optimized scoring parameter set in a data-dependent manner. Using this approach, we could identify more PSMs than the built-in search at the same FDR. We also showed that the PSMs additionally identified by DDP search showed better XCorr scores than those identified solely by the built-in search. These improvements were achieved at the expense of increased searching time. We alleviate the increase in search time by adopting a spectrum quality filter. In the worst case, it can take twice as long as the original search time. Such overhead in the search time decreases as the number of input spectra increases. Adopting the spectrum quality filter is not required for DDP search, but it has pros and cons. A drawback is that the DDP search using the spectrum quality filter yielded fewer PSMs at a given FDR than the DDP search without the filter. However, if a sufficient number of PSMs, which is a few thousands as suggested by the original MS-GF+ reports and confirmed by our experiments, can be provided for scoring parameter optimization, then DDP search results would show equal competence with or without the filter. An advantage, obviously, is the reduction in search time during the first phase, and the efficiency gain will become more significant as the number of original input spectra increases. Recent proteomics experiments often generate millions of spectra from a single sample, and searching the entire spectra twice, to acquire ∼10% increase in identification, may not be desired. A script that can run MS-GF+ with DDP optimization using the spectrum quality filter is downloadable at https://prix.hanyang.ac.kr. The suggested approach could be applied to any search tool designed to set scoring parameters dynamically depending on an input data type.

Figure 3. Results of built-in and DDP searches followed by Percolator at 1% FDR. Y axis represents a percent increase in PSMs from built-in search result.

shows the results of built-in and DDP searches, each followed by Percolator. The DDP search followed by Percolator yielded more PSMs at 1% FDR in all six datasets, demonstrating that optimizing scoring parameters and postprocessing improve the search results in an orthogonal manner, and the DDP search and postprocessor can be employed at the same time for even better identification results. Because Percolator iteratively rescores PSMs, we can also iteratively optimize the scoring parameters using the PSMs of the previous DDP search. For this iterative DDP test, we downloaded three more datasets: PXD000383, PXD000534, and PXD004732. Each dataset was acquired on LTQ-Orbitrap Velos, LTQ Orbitrap Elite, and Orbitrap Fusion Lumos, respectively. PXD000383 and PXD000534 were fragmented using collision-induced dissociation (CID). PXD004732 was generated using multiple fragmentation methods, and we only used spectra generated by electron-transfer dissociation (ETD). Every search was conducted without filtering so that we can acquire as many PSMs as possible for ScoringParamGen. We first conducted iterative searches starting with a suitable built-in scoring parameter set (“normal search”). For most datasets, the number of identifications converged from the second iteration (Table S-3). It seemed as if we are performing a greedy search for the optimum value in a space of parameter sets, so we tested another iterative DDP search starting with a completely different scoring parameter set (“mutated search”), namely, a scoring parameter set defined for a completely different fragmentation method (HCD, CID vs ETD). As shown in Table S-3, some of the mutated searches converged from the third iteration, but others yielded only dozens of PSMs during the first phase, and thus the ScoringParamGen and subsequent searches could not be applied. When the first



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.8b00415. Table S-1: List of datasets. Table S-2: Comparison of built-in and DDP search results. Table S-3: Iterative DDP search results. Supporting Method: Features used to build quality filter. (PDF) E

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX

Technical Note

Journal of Proteome Research



(12) Na, S. J.; Paek, E. Quality assessment of tandem mass spectra based on cumulative intensity normalization (vol. 5, pg 3241, 2006). J. Proteome Res. 2007, 6 (4), 1615−1615. (13) Wong, J. W.; Sullivan, M. J.; Cartwright, H. M.; Cagney, G. msmsEval: tandem mass spectral quality assignment for highthroughput proteomics. BMC Bioinf. 2007, 8, 51. (14) Diament, B. J.; Noble, W. S. Faster SEQUEST Searching for Peptide Identification from Tandem Mass Spectra. J. Proteome Res. 2011, 10 (9), 3871−3879.

Supporting Data 1: Files containing modification list. (ZIP) Supporting Data 2: PSM list of different results between built-in and DDP searches. (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Tel: +82-2-2220-2377. ORCID

Hyunjin Jo: 0000-0002-8325-0229 Eunok Paek: 0000-0003-3655-9749 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This research was supported by the Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation funded by the Ministry of Science and ICT (NRF-2017M3C9A5031597) and by NRF2017R1E1A1A01077412. We would like to thank Mincheol Jeon for his support on testing the DDP performance.



REFERENCES

(1) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207−14. (2) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74 (20), 5383−5392. (3) Kil, Y. J.; Becker, C.; Sandoval, W.; Goldberg, D.; Bern, M. Preview: a program for surveying shotgun proteomics tandem mass spectrometry data. Anal. Chem. 2011, 83 (13), 5259−67. (4) May, D. H.; Tamura, K.; Noble, W. S. Param-Medic: A Tool for Improving MS/MS Database Search Yield by Optimizing Parameter Settings. J. Proteome Res. 2017, 16 (4), 1817−1824. (5) Kim, S.; Gupta, N.; Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. J. Proteome Res. 2008, 7 (8), 3354−3363. (6) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277. (7) Kim, S.; Mischerikow, N.; Bandeira, N.; Navarro, J. D.; Wich, L.; Mohammed, S.; Heck, A. J. R.; Pevzner, P. A. The Generating Function of CID, ETD, and CID/ETD Pairs of Tandem Mass Spectra: Applications to Database Search. Mol. Cell. Proteomics 2010, 9 (12), 2840−2852. (8) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4 (11), 923−5. (9) Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R. PRIDE: The proteomics identifications database. Proteomics 2005, 5 (13), 3537− 3545. (10) Ellis, M. J.; Gillette, M.; Carr, S. A.; Paulovich, A. G.; Smith, R. D.; Rodland, K. K.; Townsend, R. R.; Kinsinger, C.; Mesri, M.; Rodriguez, H.; Liebler, D. C. Cptac, Connecting Genomic Alterations to Cancer Biology with Proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discovery 2013, 3 (10), 1108− 1112. (11) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R., 3rd Automatic quality assessment of peptide tandem mass spectra. Bioinformatics 2004, 20 (Suppl 1), i49−54. F

DOI: 10.1021/acs.jproteome.8b00415 J. Proteome Res. XXXX, XXX, XXX−XXX