Efficient Reduction of Candidate Matches in ... - ACS Publications

Jul 8, 2014 - Library Searching Using the Top k Most Intense Peaks. Trung Nghia Vu,. †,‡. Wout Bittremieux,. †,‡. Dirk Valkenborg,. §,∥,⊥...
0 downloads 1 Views 1MB Size
Technical Note pubs.acs.org/jpr

Efficient Reduction of Candidate Matches in Peptide Spectrum Library Searching Using the Top k Most Intense Peaks Trung Nghia Vu,†,‡ Wout Bittremieux,†,‡ Dirk Valkenborg,§,∥,⊥ Bart Goethals,† Filip Lemière,# and Kris Laukens*,†,‡ †

Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp, Belgium Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, B-2020 Antwerp, Belgium § Flemish Institute for Technological Research (VITO), B-2400 Mol, Belgium ∥ CFP−CeProMa, University of Antwerp, B-2020 Antwerp, Belgium ⊥ I-BioStat, Hasselt University, B-3590 Diepenbeek, Belgium # Department of Chemistry, University of Antwerp, B-2020 Antwerp, Belgium ‡

S Supporting Information *

ABSTRACT: Spectral library searching is a popular approach for MS/MS-based peptide identification. Because the size of spectral libraries continues to grow, the performance of searching algorithms is an important issue. This technical note introduces a strategy based on a minimum shared peak count between two spectra to reduce the set of admissible candidate spectra when issuing a query. A theoretical validation through time complexity analysis and an experimental validation based on an implementation of the candidate reduction strategy show that the approach can achieve a reduction of the set of candidate spectra by (at least) an order of magnitude, resulting in a significant improvement in the speed of the search. Meanwhile, more than 99% of the positive search results is retained. This efficient strategy to drastically improve the speed of spectral library searching with a negligible loss of sensitivity can be applied to any current spectral library search tool, irrespective of the employed similarity metric. KEYWORDS: Peptide identification, spectral library searching, query speed



INTRODUCTION Over the past few years, a huge number of annotated spectral data sets, generated by various research groups, has been made publicly available. Because of the accumulation of these confidently assigned spectral data sets, spectral library searching has evolved into an effective approach for MS/MS-based peptide identification. In order to use a spectral library approach, annotated spectra from reliable experiments are collected and combined to construct a reference library of confidently assigned spectra. This library can then be used to identify an unknown MS/MS spectrum by finding the best matching reference spectrum. A major advantage of spectral library searching, as compared to the traditional sequence database searching, for identifying unknown spectra is its effective similarity matching. In particular, the main advantage is that spectral libraries are constructed from real, representative experimental data, rather than in silico generated theoretical data.1 In order to determine whether two spectra match each other, a similarity score is employed. Stein and Scott2 evaluated several mass spectral comparison methods and concluded that the dot © XXXX American Chemical Society

product achieved the best performance. Its advantages are that it has a good accuracy and is easy and fast to compute. As a consequence, the dot product is the most widely used similarity metric and has been used in, e.g., SpectraST,3 BiblioSpec,4,5 and X! Hunter.6 However, other, more complex similarity measures have been proposed as well. For example, Pepitome7 uses a combination of several scores including a hypergeometric test and a Kendall tau T statistic. In addition, pMatch8 uses a combination of several different scores as well. Here, the dot product is combined with a probability-based score. Furthermore, Yen et al.9 used a hybrid spectral library consisting of reference spectra and simulated spectra, in combination with probabilistic and rank-based scoring methods. Recently, Li et al.10 proposed a sliding dot product algorithm, which combines the standard dot product with a noise dot product, i.e., the dot product of the background peaks, in order to distinguish spectral correlation attributed to noise from spectral correlation Received: December 18, 2013

A

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

attributed to peptide fragments. Finally, Wu et al.11 used a hidden Markov model as a similarity metric in order to take into account the peak variability between the different spectral replicates that are used to construct a consensus spectrum in a spectral library. Despite the fact that spectral library searching is computationally less expensive than database searching, processing speed remains an important issue. Therefore, several improvements have been proposed to speed up the spectral library searching process. Most common spectral library workflows include a filtering step using the precursor mass of the query spectrum in order to reduce the number of candidate spectra, allowing only candidates for which the precursor mass is within a restricted tolerance window. Another approach is to use a binning method to reduce the complexity of the fragment spectra.3 This method achieves a dimensionality reduction by combining all of the peaks present within a certain mass window into a single bin. Furthermore, Shao et al.12 limited the number of peaks in a spectrum to 50 to reduce the computational complexity while using a modified dot product to achieve reasonable FDR results. On the other hand, Baumgardner et al.13 used GPU hardware to parallelize the computation of the similarity scores in order to achieve a significant increase in the speed of the searching process. These various approaches have proven to be effective for libraries with a limited size. However, because of the ready accessibility of high-quality spectral data sets, spectral libraries keep growing. As a result, when identifying a query spectrum, the number of candidates that needs to be considered as a match keeps increasing as well. Consequently, the computational cost for computing the best matching spectrum grows with the size of the spectral libraries. In this technical note, an alternative and complementary approach to reduce the number of candidate spectra during a spectral library search is introduced. The approach is based on two observations. First, the most intense peaks contribute more to the similarity score than the other peaks present in the spectra. Second, even after commonly used filtering steps, such as a precursor filter, most of the candidate spectra under consideration do not result in a sufficiently good match, i.e., a match with a high similarity score. In this technical note, we propose reducing the number of candidates before the (costly) calculation of similarity scores between the query spectrum and all candidate spectra. Therefore, a preliminary and fast screening of candidates is conducted on the basis of counting their most intense shared peaks. Only the candidate spectra for which the most intense peaks result in a good match with the query spectrum are retained for the subsequent similarity computations. The advantage of this approach is that comparing two spectra based on a limited number of the most intense peaks is computationally very cheap. On the other hand, many irrelevant candidate spectra with potentially low similarity scores can be filtered out. The experimental validation shows that the number of candidates before similarity computation can be reduced by up to an order of magnitude, whereas the search results differ only negligibly.

similarity score between the query spectrum and each admissible candidate spectrum is calculated. On the basis of this score and a specified threshold, positive hits are distinguished from negative hits. SSMs with a score below the threshold are considered to be negative hits, whereas SSMs with a score above the threshold are considered to be positive hits. Among the positive hits, the candidate with the highest score is selected as the final search result. The ensuing strategy aims to reduce the number of candidates for which the similarity score needs to be computed without affecting the final search result. Notation and Definition

Consider a spectral library L consisting of several spectra: L = {l1, ..., ln}. For each query spectrum q, the set of admissible candidate spectra is C = {c1, ..., cm}, with C ⊆ L and m ≤ n, depending on which, if any, prior filtering steps have been applied. In order to find the best matching SSM, for each of the candidate spectra ci ∈ C, the similarity with the query spectrum q has to be computed. When computing the similarity between query spectrum q and a candidate spectrum ci, a specific similarity metric D is used. The final result for a search of spectrum q against the set of candidate spectra C would then be f (q , C , δ) = argmax c(D(q , c) ≥ δ c ∈ C)

(1)

Here, δ is the score threshold used to indicate that an SSM is of sufficiently high quality. Note that we retain (at most) one candidate spectrum as the best SSM. Alternatively, multiple candidate spectra that satisfy the score threshold can be retained as well, i.e., for subsequent false discovery rate (FDR) computation. In the previous case, the set of admissible candidate spectra equals the full spectral library, i.e., m = n. However, a candidate reduction step can be used to first remove some of the candidate spectra that will not result in a high similarity score. This filtering step makes use of an approximate similarity metric based on lowdimensional spectral representatives, which can be computed very quickly, instead of the more expensive similarity metric used above. To obtain such a low-dimensional spectral representative s′ for a spectrum s, a transformation function ; is used, i.e. s ′ = ; (s )

(2)

On the basis of these spectral representatives, an adapted similarity metric D′ is used to calculate the similarity between the transformed spectra. Thus, using the spectral representatives and the adapted similarity metric D′, a new candidate set C′ ⊆ C can be obtained for query q C′ = {c | D′(;(q), ;(c)) ≥ ε , c ∈ C}

(3)

Here, ε is the threshold for the similarity between the spectral representatives. The reduced candidate set C′ can then be used to obtain the final search result much in the same way as the original candidate set was used f (q , C′, δ) = argmax c(D(q , c) ≥ δ | c ∈ C′)



(4)

Ideally, both formulas 1 and 4 should result in the same candidate spectrum, yielding the best SSM for query spectrum q; however, this is not guaranteed. Namely, if the optimal candidate spectrum is removed from the candidate set during the candidate reduction step, then the result will differ. Therefore, a candidate reduction strategy will try to minimize the size of the reduced candidate set while retaining the candidate spectra that result in the best SSM with the query spectrum. This way, after reducing the number of admissible candidates, the computation of the

MATERIALS AND METHODS First, a formal introduction of how a general candidate reduction strategy can be incorporated into the spectral library search process will be given. Afterward, the specific candidate reduction strategy based on a minimum shared peak count will be presented. When querying a spectral library in order to find the best match for a query spectrum, a spectrum−spectrum match (SSM) B

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

(expensive) similarity metric D has to be applied only on a smaller subset of candidate spectra. In the following sections, an efficient candidate reduction strategy based on a minimum shared peak count is introduced, and an evaluation of the theoretical gains in terms of time complexity, as well as the practical gains in terms of the obtained improvement in speed during an experimental validation, are presented. In addition, the experimental validation will evaluate the possible loss of optimal SSMs in terms of the observed loss in sensitivity.

most intense peaks overlap with those of the query spectrum will typically result in a higher similarity score. Hence, our transformation function ; will transform high-dimensional spectra to low-dimensional spectral representatives using the top k most intense peaks. An equivalent approach has previously been used by Frank et al.14 to reduce unnecessary similarity computations during spectrum clustering; however, to the best of our knowledge, this is the first time this approach has been used for candidate filtering during spectral library searching. More formally, for a spectrum s consisting of p peaks, s = {s1, ..., sp}, the transformation function ; in formula 2 can be concretized as follows

Candidate Reduction Strategy

In the previous section, the candidate reduction problem was first presented in a very general manner. In the next section, a specific strategy to solve this problem is introduced. Meanwhile, as an illustration, it is worth mentioning that some candidate reduction strategies are already commonly employed in spectral libraries. For example, the traditional precursor mass filter adheres to the definition of a candidate reduction strategy. Here, the original candidate set C equals the full spectral library. Furthermore, the transformation function ; in formula 2 (to transform a highdimensional spectrum into a spectral representative) simply represents the spectrum by its precursor mass: each multidimensional spectrum is represented by a single value. The reduced candidate set C′ is then obtained by removing all spectra that are not within the required mass offset from the precursor mass of the query spectrum, i.e., in formula 3, ε is the required mass offset, and the adapted similarity metric D′ is the absolute value of the difference between the precursor mass for query spectrum q and a candidate spectrum ci. The proposed candidate reduction strategy is complementary to this precursor mass filter. Because a precursor mass filter reduces the spectra to a single value, the filter uses only a minimal amount of information, and the reduced candidate set still contains several inferior candidate spectra. On the other hand, the proposed candidate reduction strategy is based on a minimum shared peak count between two spectra. This enables the proposed candidate reduction strategy to make use of some of the specific spectral properties while keeping the number of peaks used very limited to achieve an optimal improvement in processing speed. As illustrated with the precursor filter, a candidate reduction strategy is based on a transformation function ; to transform a high-dimensional spectrum to a low-dimensional spectral representative and on an adapted similarity metric D′ to compute the similarity between such spectral representatives. The transformation function ; proposed here is based on the observation that, when calculating the similarity score between two spectra (for example, using the commonly used dot product), it is obvious that the most intense peaks of both spectra will have a larger contribution to the similarity score than the less intense peaks. Actually, the fact that peak intensity can be taken into account when calculating SSMs is one of the prime advantages of spectral libraries as opposed to sequence database approaches.1 Furthermore, when the two spectra both have an intense peak for the same m/z value, their contribution to the similarity score increases accordingly. On the other hand, noise peaks, or more generally, peaks with a low intensity, will have a limited contribution to the similarity score. Therefore, to obtain a good match when calculating an SSM, the most intense peaks of the two spectra should largely overlap. Consequently, when trying to find the best match between a query spectrum and several candidate spectra, the candidate spectra for which the

s′ = ;(s) = {si′| |{si ≥ sj}| >(p − k), si ∈ s , sj ∈ s}

(5)

With |·| representing the set cardinality operator. Thus, the transformation function ; creates a spectral representative s′ for spectrum s consisting of the top k most intense peaks, s′ = {s′1, ..., sk′}. Subsequently, the similarity metric D′ is used in formula 3 to determine the approximate similarity score between the transformed spectra in order to obtain a reduced candidate set. For similarity metric D′, the number of matching peaks between two transformed spectra is used. Here, we assume that two peaks match if their mass difference is at most 1 Da. If q′ and s′ are the spectral representatives of a query spectrum q and a candidate spectrum s, respectively, then the similarity score by D′ between them is defined as D′(q′, s′) = | {qi′ = s′j | qi′ ∈ q′, s′j ∈ s′}|

(6)

Algorithm 1 outlines how formulas 5 and 6 can be used to calculate a reduced candidate set as defined in formula 3. In the next sections, the strategy is evaluated theoretically, via a computational complexity analysis, and validated experimentally.

Computational Complexity Analysis

On the basis of the formal definition of the shared peak candidate reduction strategy, its computational gains in terms of theoretical time complexity can be analyzed. For simplification, suppose that all spectra have p peaks. The transformation function ; (formula 5) to transform a spectrum s consisting of p peaks, s = {s1, ..., sp}, into a spectral representative s′ consisting of the k most intense peaks, s′ = {s′1, ..., s′k}, has a computational complexity of C

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

O(plog(p)). This value is the computational cost required to sort the peaks of s based on their intensity and extract the k most intense peaks. Subsequently, sorting these k most intense peaks based on their m/z value has a time complexity of O(klog(k)). Hence, for a spectral library L consisting of n spectra, L = {l1, ..., ln}, the first step in the preprocessing phase to compute the spectral representatives has a total time complexity of O(n(plog(p) + klog(k))). However, this preprocessing step has to be performed only once for each spectral library, and it will not influence the search performance. Subsequently, when performing a query, the time complexity needed to transform a query spectrum q is again O(plog(p) + klog(k)). Next, q needs to be evaluated against a set of m candidate spectra C = {c1, ..., cm}, C ⊆ L, using the adapted similarity metric D′ (formula 6), to obtain a reduced candidate set. Because the peaks for the spectral representatives are already sorted on their m/z values, comparing two sorted lists only has time complexity O(k). Therefore, the total time complexity for reducing the candidate set during the querying phase equals O(plog(p) + klog(k) + mk). Furthermore, after the candidate reduction stage, m′ candidate spectra of C′ = {c′1, ..., c′m′} will be retained, with m′ ≤ m. The retained candidate spectra have to be compared to the query spectrum using the full similarity metric (formula 4). As an illustration, we provide time complexity results for the dot product as similarity metric D. The dot product is widely used in various spectral libraries, and it has a low time complexity of O(p) for comparing two spectra of length p. Furthermore, this means that if a more advanced similarity metric would be used, then the candidate reduction strategy will yield an even more significant improvement. To obtain the final search results (formula 4), the dot product scores between query spectrum q and all m′ candidate spectra in C′ are calculated, which have a time complexity of O(m′p). We can summarize the time complexity needed for the entire similarity calculation as O(plog(p) + klog(k) + mk + m′p). Because q needs to be transformed only once, this is insignificant compared to the similarity computations. In other words, the complexity can be approximated as O(mk + m′p). In contrast, the time complexity needed for the complete similarity calculation to find the final search result without the candidate reduction strategy (formula 1) is O(mp). Because the number of k peaks used is kept (very) low, i.e., k ≪ p, comparing the spectral representatives requires very little additional time. On the other hand, if m′ ≪ m, not having to perform that many full similarity calculations significantly reduces the computation time needed to perform a similarity search. The experimental validation further shows that low values for k will already result in an order of magnitude reduction in admissible candidate spectra between C and C′.

Several different data sets were used in order to evaluate different aspects of the performance of the candidate reduction strategy. First, human plasma data sets and yeast data sets were used to evaluate the performance for spectra originating from different organisms. Furthermore, a high-resolution data set was used to evaluate the performance for data generated by modern mass spectrometry instruments. All data sets were retrieved from PeptideAtlas.16,17 The human plasma data consists of three raw data sets, PAe000337, PAe001281, and PAe000347, containing 70 759, 165 653, and 369 440 spectra, respectively. All reported human plasma results are the average results for these three data sets. The data sets were queried against the PeptideAtlas human plasma public library (built August 2013, available at http:// www.peptideatlas.org/builds/), consisting of 93 860 spectra. The average and maximum number of peaks of an individual spectrum in the library is around 110 and 150, respectively. The yeast data consists of two raw data sets, PAe000142 and PAe000145, containing 136 966 and 118 099 spectra, respectively. Again, all reported yeast results are the average results for both data sets. The data sets were queried against the NIST yeast library (built on April 6, 2012, available at http://peptide.nist. gov), consisting of 92 415 spectra. The average and maximum number of peaks of an individual spectrum in the library is around 140 and 250, respectively. The high-resolution data consists of data set PAe003762, containing 145 557 spectra, and was generated by an LTQ Orbitrap Velos instrument. Most parameters in the SpectraST software were set to the default values, but whenever required, these were adapted on the basis of the characteristics of the data sets. For the human and yeast data, the precursor mass tolerance was set equivalent to the default precursor mass tolerance, i.e., 3.0 Th. However, for the high-resolution data set, the precursor mass tolerance was set to 0.1 Th. Finally, the unmodified dot product was used as the discriminant scoring function.



RESULTS AND DISCUSSION The performance of the candidate reduction strategy was evaluated both in terms of the achieved reduction in the size of the candidate sets as well as in terms of the time needed for identification. Furthermore, the reduced candidate sets were evaluated on the basis of the sensitivity of the strategy, which is defined as the degree to which the identification results when using the candidate reduction strategy are equal to the original identification results without using the candidate reduction strategy. In addition, different values for the various parameters defined in formulas 3−5 were evaluated. As previously defined, k is the number of the top intense peaks used in the lower dimensional space instead of the full spectra, ε is the threshold for the minimum top k shared peak count between a query spectrum and the candidate spectra, and δ is the minimum threshold for the similarity metric to distinguish a “good” match between two spectra from a bad match, i.e., the minimum required dot product score. The value of the dot product for normalized spectra lies within the range [0,1], with a dot product of 1.0 indicating identical spectra. In order for the candidate reduction strategy to be as effective as possible, a relatively low value of extracted peaks k is preferred. On the basis of the cumulative percentage of total ion current (TIC) incorporated by the most intense peaks, as shown in Figure 1, which shows the cumulative contribution of each spectral peak to the total intensity computed for the human

Libraries, Data, and Implementation

As a validation of the candidate reduction strategy, its performance was experimentally evaluated. The candidate reduction strategy was implemented by modifying the opensource SpectraST spectral library,3 which is a part of the TransProteomic Pipeline (TPP) [TPP v4.6 (occupy) rev 3],15 and was run on a machine with Windows 7 Enterprise OS, Intel Core Duo CPU E8600 3.33 GHz, 4 GB RAM. Specifically, SpectraST was modified to provide candidate counts and timing information for each query and to add the candidate reduction strategy, while ensuring that all other aspects were kept identical to ensure a fair comparison. The Supporting Information contains further instructions on how SpectraST was modified. D

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

k and increasing values of ε, a larger reduction is obtained. Furthermore, the largest reduction is obtained when k and ε are equal to each other. This result is to be expected; in this case, all extracted peaks are required to match, which will not be true in most situations. However, adding some leeway, either by introducing more peaks, by increasing k, or by requiring less matching peaks, by decreasing ε, will avoid undue pruning of admissible candidates. However, Table 1 shows that, even for divergent values of k and ε, the number of admissible candidates can already be decreased by an order of magnitude. In addition, both for the human plasma data and the yeast data, candidate reduction ratios are comparable, despite the fact that both experiments feature a different organism and a different library size. This indicates that the candidate reduction strategy is quite robust. It was shown that the candidate reduction strategy is able to drastically decrease the number of candidates. However, pruning too many, or possibly any, of the candidates that would result in a positive identification match in the original candidate set should be avoided. To evaluate how the candidate reduction strategy influences the search results, the sensitivity at different dot product scoring thresholds is calculated. The sensitivity is defined as the proportion of queries that result in the same SSM with or without using the candidate reduction strategy. As aforementioned, an SSM is considered to be a good match if its dot product score is above a predefined scoring threshold δ. Whereas a dot product score of 1.0 signifies that the two spectra are identical, generally, a dot product score of 0.5 to 0.6 is considered a good threshold for retaining positive hits. Table 2 shows the sensitivity for varying values of k and ε for a dot product threshold of δ = 0.6. Intuitively, allowing some flexibility in the number of required matching peaks compared to the number of extracted peaks results in less positive hits being discarded and consequently results in a higher sensitivity. The shaded area indicates all combinations of k and ε that achieve a sensitivity exceeding 99%. Therefore, for these values, only a small number of probable positive hits will be lost while still achieving a sizable candidate reduction, as indicated by Table 1. Thus, with 99% sensitivity, it is possible to reduce the size of the candidate set from 3.5 up to more than 22 times for both the human plasma and yeast data. Furthermore, this table confirms that the proposed transformation function ; and similarity function D′ are able to very effectively capture the most important properties of matching spectra. The previous experiments were performed with a relatively wide precursor mass filter. However, because modern mass

Figure 1. Cumulative percentage of TIC explained by the most intense peaks. The contribution of each ith most intense peak was calculated by taking the average contribution of each ith most intense peak for all spectra in the human plasma spectral library.

plasma spectral library, a maximum value around k = 10 was selected as a reasonable compromise. This small number of required peaks (over 10 times less than the average number of peaks per spectrum in the libraries) ensures a fast calculation of the spectral representatives, while those 10 peaks still describe, on average, about 50% of the full TIC. Meanwhile, after the first 10 most intense peaks, the contribution of each peak to the TIC starts slowing. This means that, in general, the importance of additional peaks decreases. Results for the yeast spectral library are comparable and are available in the Supporting Information. In practical implementations, a more extensive evaluation of a larger range of k values remains possible. Because the number of top peaks, k, has been restricted to the range [1,10], the number of minimum shared top peaks, ε, is restricted to the same range. However, for the convenience of the presentation, the values of ε have been limited to [1,5], whereas the full results can be found in the Supporting Information. Having a low value for ε requires that only a few of the most intense peaks match. Meanwhile, a higher value for ε enforces a more stringent lower bound, because most, or all, of the extracted peaks can be required to match. Table 1 shows the reduction of admissible candidates for the human plasma data and the yeast data after applying the candidate reduction strategy. It can be seen that for increasing values of k, while keeping ε constant, a smaller reduction is obtained. Meanwhile, for a constant value of

Table 1. Number of Admissible Candidates after the Candidate Reduction Strategy Has Been Applied for Varying Values of k and ε to Human Plasma Data (a) and Yeast Data (b)a

a k, the number of extracted peaks; ε, the required number of matching peaks. Each cell represents the ratio by which the size of the aggregated candidate set is reduced for the corresponding values of k and ε.

E

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

Table 2. Sensitivity of Applying the Candidate Reduction Strategy for Varying Values of k and ε to Human Plasma Data (a) and Yeast Data (b)a

a k, the number of extracted peaks; ε, the required number of matching peaks. Each cell represents the sensitivity for the corresponding values of k and ε at a dot product scoring threshold δ = 0.6.

Table 3. Performance of the Candidate Reduction Strategy on a High-Resolution Data Set for Varying Values of k and εa,b

a k, the number of extracted peaks; ε, the required number of matching peaks. Each cell represents the corresponding values of k and ε. bFor (a), the number of admissible candidates after the candidate reduction strategy has been applied for varying values of k and ε. Each cell represents the ratio by which the size of the aggregated candidate set is reduced for the corresponding values of k and ε. Without applying the candidate reduction, in total, 7 194 532 candidates are admissible aggregated over all queries, which is equivalent to, on average, 49 candidates per query. For (b), the sensitivity of applying the candidate reduction strategy for varying values of k and ε. Each cell represents the sensitivity for the corresponding values of k and ε at a dot product scoring threshold δ = 0.6.

spectrometer instruments are able to measure at a much higher resolution, the precursor mass filter can be applied much more strictly, which will result in less remaining candidates after filtering on precursor mass. However, despite the decreased number of candidates prior to the candidate reduction step, similar reduction levels can be obtained for high-resolution data, as indicated by Table 3, which shows the results for the candidate reduction strategy when applied on a high-resolution data set with a strict precursor filter. These results indicate that the candidate reduction strategy can be successfully applied to different types of data, irrespective of its resolution. To further investigate the performance of the candidate reduction strategy, the wall clock time was used to measure the absolute increase in speed achieved. Table 4 shows the increased speed when using the candidate reduction strategy for the highresolution data set. The level of improvement in the speed was calculated by taking the ratio between the time needed when using the unmodified dot product as similarity metric and the time needed when first incorporating the candidate reduction strategy with specific combinations of k and ε. The time measured was restricted only to the time spent on performing the similarity computations; other aspects, such as I/O of the spectral library and the output results, are not influenced by the candidate reduction strategy and should not be taken into account. The timing estimates used are the average results for 10 independent searches, preceded by an additional search that was not taken into account to mitigate the effects of a “cold start”. The time spent on the similarity computation for the unmodified

Table 4. Improved Processing Speed Obtained by Applying the Candidate Reduction Strategy for Varying Values of k and εa

k, the number of extracted peaks; ε, the required number of matching peaks. Each cell represents the increased speed for the corresponding values of k and ε at a dot product score threshold δ = 0.6; only combinations that yield a sensitivity >99% (see Table 3) are reported. The unmodified query time for the high-resolution data set is 56339.9 ms, whereas the candidate reduction strategy improves the speed of this process by a factor of 50−80. a

dot product is around 56 339.9 ms, whereas the candidate reduction strategy results in increases in speed by a factor 50−80. In addition, the increased speed is not perfectly correlated with the candidate reduction ratio in Table 3, i.e., the increased speed for a higher k value is less pronounced than that for a lower k value, even though specific combinations of k and ε result in a F

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

search results for different combinations of k and ε were compared with the results for PeptideProphet running on search results without using the strategy, and, finally, ratio values between those results were calculated. These ratios are presented in Table 5. Interestingly, these ratio values are highly related and

comparable candidate reduction. As an explanation, consider the previously calculated time complexity for the candidate reduction strategy: O(mk + m′p). Indeed, even though the number of pruned candidates can be comparable, for higher k values, the time needed for the low-dimensional similarity computations slightly increases compared to the case of lower k values. On the other hand, even for nonrestrictive combinations of k and ε, a huge increased in speed is already attained, with the candidate reduction strategy being at least 50 times faster than the unmodified approach. Furthermore, these results were obtained when using a (relatively) strict precursor mass filter of 0.1 Th to initially reduce the candidate set. When a wider precursor mass filter would be used, for example, subsequently using the mass accuracy as a postsearch filtering mechanism to reduce false positives,18 the candidate reduction strategy would result in even bigger increases in speed. In the previous analyses, the required number of shared peaks between different spectra was mainly evaluated. In addition, Figure 2 shows the relationship between the sensitivity and the

Table 5. Ratio of Identifications Using PeptideProphet Running on Search Results of Different Combinations of k and ε and Search Results without Using the Strategya

k, the number of extracted peaks; ε, the required number of matching peaks. The data set used is the PAe000337 human plasma data set, and each cell represents the value for the corresponding values of k and ε. a

proportional to the sensitivity values in Table 2. The search results were further analyzed using the false discovery rate (FDR). Figure 3 compares the FDR for the search results with

Figure 2. Relationship between the sensitivity of the candidate reduction strategy and the dot product score threshold for k = 10 extracted peaks and ε = 2 required matching peaks.

dot product threshold δ at specific values k = 10 and ε = 2. It is clear that for low dot product score thresholds the sensitivity is rather low. This is caused by the fact that such low dot product scores do not provide a reliable similarity measure; even very dissimilar spectra will yield a sufficiently good match. On the other hand, the sensitivity shows a considerable improvement for dot product score thresholds of 0.5 or higher. Remarkably, it is exactly this threshold that is used to identify positive hits. Figure 2 shows that for these dot product score thresholds the sensitivity is very high. Consequently, almost all of the positive hits will be retained after the candidate reduction. Note that for very high dot product score thresholds (δ ≥ 0.8) the sensitivity will be (almost) 100%. Although Figure 2 shows only the evolution of the sensitivity values for k = 10 and ε = 2, sensitivities for other (realistic) values of these parameters follow the same trend. The full data is available in the Supporting Information. To further evaluate the influence of the candidate reduction strategy on the search results, the postsearch processing software PeptideProphet19 was deployed on the search results for different combinations of k and ε for the human plasma data set PAe000337. The total number of spectral matches filtered by the mixture model of PeptideProphet was used as a benchmark. More specifically, the results for PeptideProphet running on

Figure 3. FDR comparison by PeptideProphet for the human plasma data set PAe000337. The green curve shows the FDR in terms of the estimated number of correct identifications for the search results without applying the candidate reduction strategy. The other curves each show the FDRs for the search results for a specific combination of k and ε corresponding to one of the shaded cells in Table 2. The top and bottom curves are highlighted in blue and red, respectively.

and without applying the candidate reduction strategy. More specifically, the combinations of k and ε for which at least 99% sensitivity is obtained are evaluated (see Table 2). In general, there is little difference between the FDRs for the nonadapted search results and the search results after candidate reduction with the specified parameters. As long as the combination of k and ε preserves a high sensitivity, i.e., the good matches are retained, the FDR will be similar. This is also evidenced by the fact that the FDR for the combination of k = 10 and ε = 3 is the lowest, because this combination achieves the highest ratio reduction and the lowest sensitivity out of the selected parameter combination. Table 5 indicates this as well. G

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

It was experimentally validated that the presented candidate reduction strategy can provide a significant improvement in speed when performing a spectral library search. In addition, the effect of the required parameters, such as the lower dimensionality and the similarity threshold, on the achieved reduction of the candidate set was shown. Furthermore, it was demonstrated that for sensible values of these parameters the search results will not be influenced. However, for specific situations, an adapted value for some parameters may be desirable. Examples of these situations are when a very high sensitivity is required in order to retain all positive hits or when a major increase in speed is desired in order to very quickly assess SSMs by the reduced similarity measure. In these situations, Tables 1 and 2, as well as the Supporting Information, can give an indication of which parameter values might be appropriate.

certainly interesting considerations, which we leave at the discretion of the respective software developers. Finally, for the current analysis, the dot product was used as similarity metric because it is a very simple and universally used similarity metric. However, if a more complicated similarity metric, that requires more time to be computed, would be used, then the gains from the candidate reduction strategy would, comparatively, be even higher. Therefore, the applicability of the strategy in such situations would be an interesting avenue for future research.



ASSOCIATED CONTENT

S Supporting Information *

Twenty-two tables describing the sensitivity and reduction times of full cases for the various combinations of k and ε for the human plasma data and the yeast data; the cumulative TIC curve for the yeast spectral library; the relationship between the sensitivity of the candidate reduction strategy and the dot product score threshold for the human plasma data and the yeast data; and instructions on how to integrate the candidate reduction strategy in the open-source SpectraST software. This material is available free of charge via the Internet at http://pubs.acs.org.



CONCLUSIONS This technical note introduces a candidate reduction strategy based on a shared peak count to significantly speed up spectral library searching by efficiently reducing the set of candidate spectra. The strategy was evaluated both theoretically, based on its time complexity, and empirically by extending the opensource SpectraST software. The candidate reduction strategy can reduce the number of candidates by (at least) an order of magnitude while still having a very high sensitivity, thus retaining almost all positive hits. Furthermore, the influence of several parameters on the candidate reduction strategy has been shown, and some general values for these parameters have been proposed. The most important factors that influence the time needed to perform a spectral library search is the number of spectra that need to be matched and the size of the library. Current spectral libraries can contain several tens of thousands to hundreds of thousands spectra, and this is only expected to increase in the future. Furthermore, a typical mass spectrometry experiment can contain several hundreds of thousands of spectra as well. Although computing a single SSM takes only a fraction of a second, identifying a full experiment can require up to several hours of computation time. This clearly illustrates the need for efficient methods to perform spectral library searching. The candidate reduction strategy is universally applicable and is independent of the spectral library search engine or the employed similarity score. Instructions on how to implement the candidate reduction strategy are available in the Supporting Information. Furthermore, the full source code is available on request. On the other hand, the developers of spectral library software who know their software structure deeply will be able to apply the candidate reduction strategy in the most effective way. For example, it is possible to build an additional library containing only the top k peaks for each spectrum and link this to the main library by the same index. It is worth noting that this additional library would consume only a very small amount of memory because it needs to contain only the top k m/z values (intensity is not required) as well as the precursor mass and a reference index value. The query would need to be searched only against the additional library to filter the candidates before searching the main library, which contains all of the detailed information. Another possible implementation could directly include the top k peaks as additional information for each spectrum in the spectral library. When filtering, the software could extract this information and determine whether the spectrum under consideration should be retained. These are



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +32 (0) 3 265 33 10. Fax: +32 (0) 3 265 37 77. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS T.N.V. acknowledges support by a BOF interdisciplinary grant of the University of Antwerp. W.B. is supported by SBO grant “InSPECtor” (120025) of the Flemish agency for Innovation by Science and Technology (IWT).



REFERENCES

(1) Zhang, X.; Li, Y.; Shao, W.; Lam, H. Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. Proteomics 2011, 11, 1075−1085. (2) Stein, S. E.; Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 1994, 5, 859−866. (3) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7, 655−667. (4) Frewen, B. E.; Merrihew, G. E.; Wu, C. C.; Noble, W. S.; MacCoss, M. J. Analysis of peptide MS/MS spectra from large-scale proteomics experiments using spectrum libraries. Anal. Chem. 2006, 78, 5678− 5684. (5) Frewen, B.; MacCoss, M. J. Using BiblioSpec for creating and searching tandem MS peptide libraries. Curr. Protoc. Bioinf. 2007, 20, 13.7.1−13.7.12. (6) Craig, R.; Cortens, J.; Fenyo, D.; Beavis, R. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 2006, 5, 1843−1849. (7) Dasari, S.; Chambers, M. C.; Martinez, M. A.; Carpenter, K. L.; Ham, A.-J. L.; Vega-Montoto, L. J.; Tabb, D. L. Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment. J. Proteome Res. 2012, 11, 1686−1695. (8) Ye, D.; Fu, Y.; Sun, R.-X.; Wang, H.-P.; Yuan, Z.-F.; Chi, H.; He, S.M. Open MS/MS spectral library search to identify unanticipated post-

H

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Technical Note

translational modifications and increase spectral identification rate. Bioinformatics 2010, 26, i399−i406. (9) Yen, C.-Y.; Houel, S.; Ahn, N. G.; Old, W. M. Spectrum-tospectrum searching using a proteome-wide spectral library. Mol. Cell. Proteomics 2011, 10, M111.007666. (10) Li, H.; Zong, N. C.; Liang, X.; Kim, A. K.; Choi, J. H.; Deng, N.; Zelaya, I.; Lam, M.; Duan, H.; Ping, P. A novel spectral library workflow to enhance protein identifications. J. Proteomics 2013, 81, 173−184. (11) Wu, X.; Tseng, C.-W.; Edwards, N. HMMatch: Peptide identification by spectral matching of tandem mass spectra using hidden Markov models. J. Comput. Biol. 2007, 14, 1025−1043. (12) Shao, W.; Zhu, K.; Lam, H. Refining similarity scoring to enable decoy-free validation in spectral library searching. Proteomics 2013, 13, 3273−3283. (13) Baumgardner, L. A.; Shanmugam, A. K.; Lam, H.; Eng, J. K.; Martin, D. B. Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. 2011, 10, 2882−2888. (14) Frank, A. M.; Bandeira, N.; Shen, Z.; Tanner, S.; Briggs, S. P.; Smith, R. D.; Pevzner, P. A. Clustering millions of tandem mass spectra. J. Proteome Res. 2008, 7, 113−122. (15) Keller, A.; Eng, J.; Zhang, N.; Li, X.-j.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 2005, 1, E1−E8. (16) Deutsch, E. W.; Lam, H.; Aebersold, R. PeptideAtlas: A resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008, 9, 429−434. (17) Farrah, T.; Deutsch, E. W.; Aebersold, R. Using the human plasma PeptideAtlas to study human plasma proteins. Methods Mol. Biol. 2011, 728, 349−374. (18) Hsieh, E. J.; Hoopmann, M. R.; MacLean, B.; MacCoss, M. J. Comparison of database search strategies for high precursor mass accuracy MS/MS data. J. Proteome Res. 2010, 9, 1138−1143. (19) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383−5392.

I

dx.doi.org/10.1021/pr401269z | J. Proteome Res. XXXX, XXX, XXX−XXX