Hunting for Unexpected Post-Translational ... - ACS Publications

Mar 24, 2014 - This allows the search engine to search for unexpected modifications while maintaining its ability to identify unmodified peptides effe...
1 downloads 10 Views 3MB Size
Article pubs.acs.org/jpr

Hunting for Unexpected Post-Translational Modifications by Spectral Library Searching with Tier-Wise Scoring Chun Wai Manson Ma† and Henry Lam*,†,‡ †

Division of Biomedical Engineering and ‡Department of Chemical and Biomolecular Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China S Supporting Information *

ABSTRACT: Discovering novel post-translational modifications (PTMs) to proteins and detecting specific modification sites on proteins is one of the last frontiers of proteomics. At present, hunting for posttranslational modifications remains challenging in widely practiced shotgun proteomics workflows due to the typically low abundance of modified peptides and the greatly inflated search space as more potential mass shifts are considered by the search engines. Moreover, most popular search methods require that the user specifies the modification(s) for which to search; therefore, unexpected and novel PTMs will not be detected. Here a new algorithm is proposed to apply spectral library searching to the problem of open modification searches, namely, hunting for PTMs without prior knowledge of what PTMs are in the sample. The proposed tier-wise scoring method intelligently looks for unexpected PTMs by allowing mass-shifted peak matches but only when the number of matches found is deemed statistically significant. This allows the search engine to search for unexpected modifications while maintaining its ability to identify unmodified peptides effectively at the same time. The utility of the method is demonstrated using three different data sets, in which the numbers of spectrum identifications to both unmodified and modified peptides were substantially increased relative to a regular spectral library search as well as to another open modification spectral search method, pMatch. KEYWORDS: spectral libraries, spectra library searching, post-translational modifications



INTRODUCTION It has been estimated that from ∼23 000 genes in the human genome over a million distinct protein molecules can be derived. Protein post-translational modifications (PTMs) account for most of this molecular diversity. PTMs refer to the covalent changes to the polypeptide after its synthesis in the ribosome, usually mediated by specific enzymes. These processes change the conformation of the proteins and regulate their activities, underlying the versatile functionality and sophisticated regulation mechanisms of proteins in the cell.1 To date, more than 300 types of PTM are experimentally observed in humans with over 25 000 modified proteins (over 220 000 putative modification sites) validated in the Swiss-Prot database.2 These numbers are expected to grow steadily as more experimental data become available. Nonetheless, only a few of this large number of PTM types are extensively studied. Even for phosphorylation, probably the best understood PTM, much remains unknown about the phosphorylation sites, enzyme−substrate relationships, and the actual functions of the PTM in each protein.3 The vast diversity of PTMs − and our relative lack of knowledge about them − makes the identification of PTMs one of most difficult tasks and the last frontiers of proteomics.4−6 Currently, the main approach for identification of PTMs in proteins on a large scale is by sequence database searching with © 2014 American Chemical Society

a priori specifications of allowable mass shifts of certain amino acid residues.7−10 Sequence database searching, as the name implies, relies on available protein sequence databases (typically predicted from genome sequences) to define the “search space”, that is, to enumerate all possible peptide candidates to consider for matching. When PTMs are allowed, the sequence search engine simply expands its “search space” to include both the unmodified and modified versions of the affected peptides. This approach has several important limitations. First, it requires the researcher to define the types of PTMs he or she expects to find in advance, together with the amino acid residues on which the PTMs are expected to occur. In this scheme, PTMs that are not specified beforehand will never be detected, even though the modified peptide is indeed in the sample. Second, the expansion of search space is exponential as more PTMs are searched. This is because if multiple PTMs can occur on a peptide, then all permutations of unmodified and modified sites have to be considered. This can not only make the search impractically slow but also greatly reduce the sensitivity and specificity.11 Therefore, it is generally not advisable to consider more than a handful of PTMs at once in a typical sequence database search. These limitations make it Received: October 7, 2013 Published: March 24, 2014 2262

dx.doi.org/10.1021/pr401006g | J. Proteome Res. 2014, 13, 2262−2271

Journal of Proteome Research

Article

at http://peptide.nist.gov), and phosphorylation.25 More recently, synthetic peptides are utilized to attain full proteome coverage for the spectral library of yeast.26 Because of the technical difficulty in synthesizing modified peptides on a large scale, this approach is also limited to unmodified peptides. Therefore, in its simplest form spectral library searching against existing spectral libraries is not effective for hunting for uncommon or unexpected PTMs in the proteome. However, the open-modification search approach for sequence database searching, as previously described, can be and has been extended to spectral library searching. Although not directly applied to search spectral libraries, Bandeira et al.27 first proposed the idea of matching experimental spectra while allowing for mass shifts for the purpose of detecting spectral pairs presumably from related peptides. A dynamic programming approach is employed to find the best “gapped” alignment of the canonical ion series, similar to classic sequence alignment algorithms. A simpler algorithm followed in Bonanza28 counts equivalently two types of peak matches between the query spectrum and the library spectrum: peaks with similar m/z values and peaks whose m/z difference is close to the precursor m/z difference of the two spectra. It therefore assumes that a single modification in the peptide causes the m/z value of the precursor and a subset of its fragments (those containing the modification) to be shifted by the same amount. Subsequent methods pMatch29 and QuickMod30 took a similar approach for peak matching but utilized peak annotations to predict the possible m/z shifts of each peak based on the annotated charge state and ion type. QuickMod optimizes the scoring function by a support vector machine approach. To tolerate peak intensity variations due to the PTM, pMatch attempts to reduce the relative importance of the peak intensities in the library spectrum by “diluting” it with a theoretical spectrum consisting of all canonical (b- and y-type) ions at equal intensity. Despite these minor differences, the most important feature in all three algorithms is allowing each peak in the query spectrum to be matched more than once. There are two main theoretical advantages of adopting an open modification search in spectral library searching compared with doing so in sequence database searching. First, spectral library searching has a much smaller search space to begin with, as spectral libraries consist of only peptides that have been observed thus far, whereas sequence database searching considers all peptides derivable from all proteins in an organism. In a sense, using spectral libraries as a starting point to hunt for PTMs is similar to the refinement approach previously described, except that the first-pass search has already been performed on all data from the entire community used to construct the spectral library. This way, the search space is very reduced from sequence databases, which promises to vastly improve the search speed and sensitivity but not so much as to limit consideration to only proteins detected in one particular experiment as in typical refinement searches.31 Second, reference spectra in libraries allow more accurate matching, as real spectra contain information about peak intensities and minor ions not commonly considered in sequence database searching. It should be quickly pointed out that the PTM, being covalent attached to the peptide, may change the chemistry to an extent that the fragmentation pattern is significantly altered. For instance, phosphate groups attached to serines and threonines are labile in CID and sometimes produce prominent phosphate neutral loss ions, which would not be present in the spectrum of the

difficult to hunt for PTMs, especially uncommon or unexpected PTMs in the proteome on a large scale.6 Two general directions have been proposed to relieve these limitations. The first approach, sometimes called “refinement” or “two-pass search,” is to limit attention to only proteins or peptides for which the unmodified version is detected.12−14 The search engine first performs a normal search considering only the most common PTMs (or not at all). Any proteins or peptides identified in this “first-pass” search are then allowed to take on a wider range of PTMs in the “second-pass” search, and query spectra that have not been identified in the first round are searched against these candidates. Because in any given sample typically only hundreds of proteins can be found, the expansion in search space due to PTMs is mitigated by the reduction in the number of proteins or peptides considered. This approach trades off “breadth” for “depth” and will fail to detect modified peptides for which the unmodified counterparts are not present in the same sample. The second approach, called “open” or “blind” modification search, makes no assumption about what PTMs one might detect or on which proteins they might be found.11,14−19 Instead, the search engine tries to construct the profile of common PTMs from the data. To limit the search space, however, it typically has to assume that only one PTM is found on each peptide. (An exception is a recently proposed “multi-blind” algorithm that can account for multiple PTMs efficiently.11) The deviation of the observed precursor mass and the theoretical precursor mass of the candidate is presumably due to the presence of the PTM. Similarly, in the MS/MS spectrum of the modified peptide, a subset of the fragment ions − only those that are affected by the PTM − will also be shifted by the same precursor mass deviation (divided by the charge state of the fragment ion). To proceed, the search engine first has to expand the precursor mass tolerance and considers not only candidates with similar precursor masses as the query spectrum but also those with very different precursor masses. Then, to detect such “partial-mass-shifted” spectra, the similarity scoring function is modified to allow for mass-shifted peak matches, in addition to ordinary, identical-m/z, peak matches. The frequencies of spectral matches at different precursor mass shifts can be tabulated, revealing the profile of different PTMs in the sample. In an optional subsequent step, ordinary sequence database searching can be performed, allowing those PTMs detected at high frequency, to maximize the number of identifications.14 However, allowing mass-shifted peak matches will greatly increase the chance of random hits, leading to increased false-positive identifications, regardless of the algorithm used to detect these mass-shifted peak matches. The search space is also expanded considerably, affecting the search speed and sensitivity. Spectral library searching has recently received attention as a promising alternative to sequence database searching.20−23 It matches query spectra to a library of reference spectra for which the identification was known. Because of the ability to consider the finer details of spectrum in spectrum−spectrum matching, spectral library searching has been shown to outperform sequence database searching in sensitivity.24 Spectral libraries of peptides are compiled from a large amount of shotgun proteomics data of real biological samples, which are identified by sequence database searching. Because of the limitations previously described, spectral libraries at present contain only the most common PTMs, such as methionine oxidation, protein N-terminal acetylation, pyroglutamate formation (such as the NIST Peptide Tandem Mass Spectral Libraries, available 2263

dx.doi.org/10.1021/pr401006g | J. Proteome Res. 2014, 13, 2262−2271

Journal of Proteome Research

Article

corresponding unphosphorylated peptide.32 However, it has been shown that notwithstanding this potential complication the fragmentation pattern of the modified peptide retains enough similarity to that of the unmodified peptide (at least for the few amino acid substitutions and PTMs examined, including phosphorylation), such that it is still beneficial in terms of sensitivity to use the spectrum of the unmodified peptide as a template to match that of the modified one after simple mass shifting.33 We present a method for open-modification search based on spectral library searching. Our overall strategy is similar to previously proposed methods for open-modification search in that we detect spectral matches allowing peaks to be matched with mass shifts defined by the precursor mass deviation. Unlike previous approaches, however, we aim to minimize the problem of random peak matches by adopting a tier-wise matching approach, such that mass-shifted matches are counted only if they pass stringent thresholds defined by a probabilistic model. In other words, our method does not presume the existence of a PTM and can find PTMs without sacrificing the identification rate of unmodified peptides. In addition, we also account for the possibility of higher-charged fragment ions, again in a tier-wise manner, without the need to annotate peaks in the library spectrum. We show that our method outperforms the most recent open modification spectral search engine, pMatch,29 in terms of total spectra identified at a fixed false discovery rate (FDR). The algorithm is implemented in the spectral search engine SpectraST as part of the open-source Trans Proteomic Pipeline (TPP) software suite.34



SpectraST.36 The value of N is chosen to be 60 based on testing on the iPRG 2012 data set (Figure S4 in the Supporting Information). Tier-Wise Spectrum Matching

The representation of a query spectrum SQ after the preprocessing previously described can be viewed as a set of ordered pairs {(Mi, Ii)}, i ∈ {1, 2, 3 ... N}. Each retained peak corresponds to one ordered pair, and Mi is its m/z value and Ii̅ is the normalized rank-transformed intensity as defined in eqs 1 and 2. In our tier-wise matching approach, a pair of spectra is matched in several tiers. In the first tier (Tier-0), the peaks of query spectrum are matched with the peaks in the similarly preprocessed library spectrum (SL = {(Mj, Ij)}, j ∈ {1, 2, 3 ... N}). The Tier-0 dot product is calculated as d0 =

k 0 = ( i , j) :

(3)

Mi − M j