MaRaCluster: A Fragment Rarity Metric for Clustering Fragment

Subscriber access provided by ORTA DOGU TEKNIK UNIVERSITESI KUTUPHANESI

Article

MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics Matthew The, and Lukas Käll J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00749 • Publication Date (Web): 14 Dec 2015 Downloaded from http://pubs.acs.org on December 16, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics Matthew The and Lukas Käll∗ Science for Life Laboratory, School of Biotechnology, Royal Institute of Technology - KTH, Box 1031, 17121 Solna, Sweden E-mail: [email protected]

Abstract Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to the previous state-of-the-art method. We see that our method would ∗

To whom correspondence should be addressed

1 ACS Paragon Plus Environment


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

advance the construction of spectral libraries, as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available from https://github. com/statisticalbiotechnology/maracluster, under Apache 2.0 license.

Keywords Mass spectrometry, Proteomics, Hierarchical clustering, Bioinformatics, Database search, Spectral archives, Spectral libraries

Introduction Shotgun proteomics is currently the most comprehensive, yet data intensive method to analyze protein content in complex biological mixtures. Modern mass spectrometry equipment generates gigabytes of data per hour. Not surprisingly, the processing of the resulting fragment spectra is a bottleneck in many mass spectrometry-labs. The typical identification pipeline involves matching the fragment spectra to all the peptides of a sequence database using a database search engine, like Sequest 1 or Mascot 2 . Such search engines compare each fragment spectrum against the predicted fragment spectra of all the peptides of the sequence database, and score the matches based on how well the predicted and observed fragment spectra overlap. An alternative approach to analyze data is to first cluster fragmentation spectra and to create a consensus spectrum for each cluster 3 before further interpretation of the data. There is a set of theoretical benefits of such a processing strategy. Firstly, as datasets get larger we can use clustering as a way to reduce the number of tests made for any dataset, and thereby reduce induced multiple-hypothesis problems 4 . Secondly, there are many methods depending on searches using spectral libraries, where one searches spectra using prerecorded spectra rather than theoretical spectra. This principle has been used both 2 ACS Paragon Plus Environment

Page 2 of 25

Page 3 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


within proteomics 5 and metabolomics 6 , and much of the processing of Data Independent Acquisition data is done with such spectral libraries 7 . Clustering is a great way to form such spectral libraries. Finally, such strategies allow detection of frequently reoccurring spectra that lack interpretation with normal search engines. More than just correcting errors of traditional search strategies, this can be seen as an aid to mine for unanticipated analytes such as rare variant-peptides or post-translational modifications. The literature contains a wide variety of clustering approaches for fragment spectra, such as Pep-Miner 3 , MS2Grouper 8 , NorBel2 9 , MS-Cluster 10 , PRIDE-Cluster 11 and CAMS-RS 12 . Such algorithms normally, after some preprocessing steps such as noise filtering, do pairwise comparisons of all spectra of the analyzed set using a measure of similarity. Examples of such similarity measures are cosine distances 3 , or spectral dot products 10 , or matches of consecutive peaks 12 . Subsequently, the pairwise distances are used to cluster the spectra using schemes such as average-linkage hierarchical clustering 13 , or single-linkage hierarchical clustering 9 , or custom varieties of greedy clustering 8,10 . We see two possible ways to improve the results of these previous efforts. Many of the selected pairwise scoring metrics do not correct for the difference of the frequencies of different fragment masses based on the set of candidate peptides for a spectrum 14 . Also, many of the previous efforts do not have a strategy to counteract problems related to spectra containing fragments from more than one peptide, frequently referred to as chimeric spectra. As some studies show that more than 50% of all fragment spectra are isolated from precursor mass-ranges containing more than one peptide-species 15,16 , the effect of such spectra creating links between clusters of different peptides gets prominent when clustering comprehensive datasets. Here, we present a scheme for hierarchical clustering of fragment spectra, MaRaCluster. It employs a p value distance measure based on the product of the probabilities of matched peaks to be present and unmatched peaks to be absent for a randomly drawn spectrum (Figure 1). This approach gives more weight to rare peaks while lowering the contribution



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of frequently present peaks. To counteract cluster contamination through chimeric spectra, we subsequently employ complete-linkage hierarchical clustering.

Methods Data Acquisition We downloaded a set of spectra, comprising 561 runs on 51 samples with in total 7.8 million spectra from lysates of lymphoblastoid cell lines, aimed at studying variation of protein abundance in human 17 . The investigated peptides contained a chemical modification in the form of TMT6plex tags for determining quantitative levels, and were analyzed on an LTQ Orbitrap Velos (Thermo Scientific) equipped with an online 2D nanoACQUITY UPLC System (Waters). We will refer to this set as the linfeng set. For some of the comparisons we also downloaded two sets of spectra from Human Du145 prostate cancer cells and Yeast, collected on an LTQ Orbitrap Velos (Thermo Scientific), as described in Moruz et al. 18 We will refer to these sets as the hm human and hm yeast sets.

Clustering Algorithm Our hierarchical clustering scheme, MaRaCluster, can be divided into four distinct steps: 1. Preprocessing: the spectra are converted to MS2 format and assigned accurate precursor masses. Subsequently, the spectra are split into separate files based on these precursor masses to accommodate parallel processing. 2. Background distribution calculation: the spectra are reduced to a list of their N most intense fragment peak locations and we register the background frequency of fragments as a function of their mass-to-charge ratio. 3. Distance calculation: the pairwise distances between spectra are calculated using a scoring function, which we will describe in detail. 4 ACS Paragon Plus Environment

Page 4 of 25

Page 5 of 25

1.0 0.5

relative intensity

0.0 1.0 0.5

0.0 1.0 0.5 0.0

m/z

A 1.0

relative intensity / background frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


background frequency

0.5

0.0 1.0 0.5

0.0 1.0 0.5

0.0

m/z

B Figure 1: Schematic overview of the rarity-based distance calculation for fragment spectra. (A) Based on a naive dot product, the top and middle spectra are most similar. (B) Our scoring function takes the product of the probability of the top spectrum’s fragments randomly being matched (green bars) and not matched (red bars) by the other spectra. Subsequently, a p value is calculated based on a background frequency null model. Using the background frequency of fragment peaks at certain m/z values, matches of highly frequent fragment peaks increase the distance score more than matches of rare peaks. Based on this p value distance, the top and bottom spectra are most similar.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4. Clustering: a bottom-up hierarchical clustering is applied using a memory constrained complete-linkage algorithm. The individual steps are described in detail in separate subsections below.

Preprocessing We used Proteowizard 19 to convert the RAW files to MS1 and MS2 files. High-resolution precursor masses and charges were assigned to the MS2 spectra using information from the precursor scans with Hardklör 20 followed by Bullseye 21 , through the interface of the Crux 2.0 package 22 . For each MS2 spectra we applied the fragment peak binning strategy as used in, 23 i.e. we used edges between the bins placed at li = 1.000508 · (0.18 + i) Th, for i = 0, 1, . . . We subsequently stored only the N = 40 bin locations with the highest intensity. Using higher values of N , e.g. N = 100, resulted in reduced performance, a phenomenon previously observed for spectral dot products as well 10 . To accommodate parallel processing, the spectra are split according to their precursor mass into windows of equal workload. We keep track of spectra with precursor masses of ±1 precursor mass tolerance from the edges of such windows, to not miss any spectral-pairs due to this workload-balancing step. Also, as Bullseye often returns multiple precursor mass candidates, each spectrum could be assigned to multiple precursor mass windows.

Background distribution calculation Our algorithm is dependent on a background distribution, i.e. a distribution from which we can determine the probability of a random match of a fragment at a given m/z. We hence compiled histograms of fragment peak bins by aggregating data from the spectra with the same charge at a specific precursor mass (Figure 2). As seen, this distribution is different for different sets. For instance a set containing TMT-tags (the “linfeng”-series in Figure 2) had


Page 6 of 25

Page 7 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


a higher peak density in the 126 − 131 Th region, originating from their reporter masses. It is interesting to note that while the experimental set up greatly influences the background distribution, the studied organism seemed to have less of an influence. The background distribution from a yeast sample (hm yeast) did not differ much from the distribution of the human sample (hm human in Figure 2), while we see a prominent difference between the two series from human samples, hm human and linfeng. Also, as a finite amount of peptides originates from the proteome within a precursor mass range, the true fragment peaks are limited to the possible fragment ions of these peptides, causing an oscillating pattern with a period of around 10 Th. In order to establish the background distributions for a certain precursor mass, the spectra were split by their precursor charge and subsequently binned by their precursor mass using the same bin edges as for the fragment peaks. We counted the occurrence of each fragment peak bin over all the spectra in such a precursor mass bin. The background distribution, ξ(i), can then be defined for fragment bins, i = 1, . . . , M , as the fraction of spectra that have a peak in the fragment bin, i. Note that we calculate one set of ξ(i) for every combination of precursor mass bin and charge. Smoothing is necessary to prevent overfitting and to ensure enough fragment peaks are present to capture the global distribution. We did so by averaging the current precursor bin with the precursor bins within 5 Da of the current bin, and we used a moving average filter of 5 Th on the resulting fragment peak histogram. The smoothing parameters were set with the goal to have a large number of spectra that correspond to at least a few different peptides, while at the same time not incorporating too much information from other precursor mass bins, which will have fragment ions at different m/z values. Here, the balancing of these two objectives was done arbitrarly, but this could be optimized in future work.



Local fragment rarity distributions, prec m/z = 650Th 1.0 hm_human hm_yeast linfeng

0.8

relative peak count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

0.6

0.4 0.2 0.00

200

400

600

m/z

800

1000

1200

Figure 2: The background distribution for three data sets. The figure shows the histogram of fragment peak bins of spectra with precursor charge +2 and precursor mass between 1290 and 1310 Da. The sets hm human and hm yeast 18 had the same experimental set up and rendered very similar background distributions, despite the investigated samples originating from different organisms. However, the linfeng set 17 that originated from another lab, using similar instrumentation, ends up having a very different background distribution, despite its samples originating from the same organism as hm human. Also, an oscillating pattern with period ≈ 10 Th can be observed, caused by the fitness of the set of true fragments of the candidate peptides for this precursor mass range.


Page 9 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Distance Calculation A crucial component for any cluster analysis is how we measure the similarity between two clustered objects. Here, within computational mass spectrometry, the so-called scoring function is a natural choice. Scoring functions are functions that score similarities between fragment spectra, typically a spectrum predicted from an amino acid sequence and an observed spectrum. Despite a number of such functions already being described in the literature (see e.g. Fenyö&Beavis 24 ) we here introduce one more, a scoring function based on fragment mass rarity, which we describe below. Its main benefit is that it does not score every fragment ion equally but instead scores the individual peaks according to their background frequency. Beneath we will describe the details of our scoring function, that also is outlined in Algorithm 1. Scoring function We want to establish a score of how well a query spectrum, σ Q , resembles a target spectrum, σ T , given a background distribution, ξ(x), of how frequently a peak occurs at a given massto-charge ratio bin, x. Let sQ and sT denote the vector of mass-to-charge ratio bins of the N most intense fragments of σ Q and σ T . We can define a vector indicating the matches between the fragments in sQ to sT as b = {b1 , . . . , bN } where bi = 1sTi ∈sQ . We also introduce the individual fragment probabilities, qi = ξ(sTi ). The probability to draw a spectrum from the background distribution, ξ(x), having peaks where bi = 1 and lacking peaks where bi = 0 is y(b) = Pr(b | q) =

N

bi

qi (1−qi )

i=1

(1−bi )

=

N

qi

i=1

1−qi qi

(1−bi )

.

(1)

While this score will generate a well functioning ranking of different query spectra for a fixed target spectrum, its scores will not be comparable across different target spectra. Hence, we will normalize the probabilities into p values in the next section.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

Calculation of p values We can use the probability from Equation (1) as a score and use a generator function 25 to calculate a p value using the null model that the peaks matched by sQ in sT were randomly drawn from ξ(x). The p value of the matched subset represented by the binary vector b can be represented as,

p(b) =

b ∈{0,1}

1y(b)≥y(b ) y(b ),

(2)

N

An approximation of this type of p value can efficiently be calculated using the dynamic programming technique introduced in Moruz et al. 18 Following the suggested algorithm, we first evaluate the logarithm of Equation (1), log (y(b)) = log

N

qi

i=1

=Q+

N i=1

Here, Q =

i

(1 − bi ) log

1 − qi qi

log(qi ) and Xi = round

1 − qi qi

(1−bi )

≈Q+k

= N

(1 − bi )Xi .

(3)

i=1

log((1−qi )/qi )

k

is a discretization of the values,

log((1−qi )/qi ), with a user-defined, sufficiently small, scaling factor, k. With this discretized equation we can investigate the distribution of scores, y(b ), using dynamic programming. Let f (s, j) be a function that expresses the number of permutations, {b1 , . . . , bj }, such that ji=1 (1 − bi )Xi = s. Then for, j = 1, . . . , N , we have: ⎧ ⎪ ⎪ f (s, j−1) + f (s−Xj , j−1) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ f (s, j−1) f (s, j) = ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0

if s ≥ Xj and j > 0 if s < Xj and j > 0

(4)

if s = 0, j = 0 if s > 0, j = 0

We can efficiently calculate f (s, N ) using the fact that f (s, j) does not depend on f (t, l) for t > s and l ≥ j. We first allocate a vector with all elements set to zero, except for the first 10 ACS Paragon Plus Environment

Page 11 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


element, which is set to one, i.e. f (s, 0). From the end of the vector we then take X1 and walk down until the beginning of the vector, updating the vector in place using Equation (4) while ignoring the second argument. We repeat the procedure for Xj , j = 2, . . . , N . To derive an approximation of the p value for our match b, we calculate the probability p(b) of all permutations with scores at least as extreme as b using the approximation of Equation (3) in Equation (2),

p(b) =

b ∈{0,1}

Here, Y (b) =

N

Y (b)

i=1 (1

1y(b)≥y(b ) y(b ) ≈

f (s, N ) · eks+Q = d(Y (b)).

(5)

s=1

N

− bi )Xi , represents the discretized score of the match and d(Y ) is the

discretized p value distance as a function of the score, Y . Algorithm 1 MaRaCluster’s scoring function 1:

2: 3: 4: 5:

procedure getPvalueDistance(sQ , sT , ξ(x), k) sQ : m/z bins of query spectrum sT : m/z bins of target spectrum ξ(x): fragment frequency distribution k: bin score discretization parameter initialize binary matching vector

b←0 for i ← 1, N do qi ← ξ(sTi ) Xi ← round(log((1 − qi )/qi )/k)

get fragment frequencies of target bins calculate discretized target bin scores

6:

7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

if

sTi

Q

∈ s then bi ← 1 mark target bins found in query spectrum

Y ← N i=1 (1 − bi )Xi N Q ← i=1 log(qi ) f (0) ← 1 for s ← 1, N do f (s) ← 0 for j ← 1, N do for s ← Y, 1 do if s < Xj then f (s) ← f (s) else f (s) ← f (s) + f (s−Xj ) Y p ← s=1 f (s) · eks+Q return p

calculate query spectrum score begin Dynamic Programming

return p value distance



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 25

Practical considerations We used the scoring function described above to calculate distances between the spectra. Only spectra with predicted precursor mass within 20 ppm of each other were considered potential candidates. Conveniently, given the Xi it is easy to pre-compute, d(Y ), Y = 0, . . . ,

Xi , for each

target spectrum, which can be used as a look-up table. We can hence reduce computations by saving the Xi and the discrete function d(Y ). Furthermore, the function log(d(Y )) is smooth and can be approximated by a fifth degree polynomial in Y , thus allowing storage reduction from a few thousand values ( Xi + 1 values) to just 6 coefficients. As spectra can have multiple precursor mass candidates, pairs of spectra are frequently assigned multiple p values. We took the best p value out of this set, assuming that it corresponds to the true precursor mass-charge state. Cosine distance To justify the added complexity of introducing a rarity-based distance, we also conducted comparative experiments where we replaced our scoring function with a cosine distance, a normalized variation of the dot product. 8 When doing so, we employed the same binning strategy and selection of N bins as for our p value distance, except that we save the intensities for each of the N bins. For two spectra, σ Q and σ T , let I Q and I T denote the vectors of intensities at each of the mass-to-charge-ratio bins. Note that at most N entries of I Q and I T will be non-zero. The cosine distance between the two spectra is then calculated as: Q

m

T

C(I , I ) = m i=0

Q i=0 Ii

IiQ

2

·

· IiT

(6) m i=0

2 (IiT )

Taking the square root of the intensities in I Q and I T before calculating the cosine distance showed significant improvements and was therefore applied in our experiments.


Page 13 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Clustering The p values form a large sparse affinity matrix, A, of size, S × S, where S is the number of spectra. Due to the asymmetry in the p value calculation, the matrix is initially also asymmetric (though in practice the two p values are usually close), but we symmetrize it by taking, Aij = Aji = max(Aij , Aji ). Next, the logarithm is taken, mapping the p values from [0, 1] to the range (−∞, 0]. Finally, elements in the range (−5, 0] are considered insignificant and set to 0, increasing the sparseness of the matrix. The diagonal could contain the selfaffinity p(Ymax ), but as this value is of no practical use, the diagonal elements are also set to 0. Hierarchical clustering is applied to this affinity matrix using complete-linkage. Completelinkage was chosen in order to be as conservative as possible. Specifically, we would like to avoid clusters becoming connected by chimeric spectra, i.e. spectra that contain fragments from more than one peptide, acting as links between clusters composed of spectra from a single peptide species. When clustering a few million spectra the affinity matrix does not fit in the main memory of a commodity computer, even when using a sparse representation. Therefore, a completelinkage adaptation of the memory constrained sparse UPGMA algorithm from Lowenstein et al. 26 was implemented. To get from the dendrogram assembled by complete-linkage to a list of clusters, we need to set a threshold on some cluster metric. The simplest threshold is to stop merging branches when a certain threshold for the p value is reached. As p values are uniformly distributed under the null hypothesis, the expected number of false positives should be the p value threshold times the number of spectra comparisons. However, as the null hypothesis is not related to the peptide from which the spectra originated, a true or false positive in the sense that two spectra do or do not come from the same peptide would be a wrong interpretation. A threshold that seems to work well in practice is to take the inverse of the number of spectrum pairs that are compared, times a safety 13 ACS Paragon Plus Environment


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

margin of e.g. 0.01 or 0.001. For most datasets this would result in a threshold of p value between 10−10 and 10−20 . Increasing the p value threshold, e.g. to 10−5 , produced many contaminated clusters, i.e. clusters for which search engines report two or more different high-confident peptides for its constituting spectra. On the other hand, there are many spectrum pairs with a p value in the range of (10−15 , 10−5 ) that search engines confidently identify as the same peptide. A careful consideration between sensitivity (discovering weakly linked clusters) and specificity (avoiding contaminated clusters) is needed.

Results To analyze the performance of MaRaCluster, a comparison with some of its alternatives was made for several data sets and benchmarks. The results below are from the analysis of the linfeng set. 17 Clusters were generated by MaRaCluster for a range of p value thresholds. To quantify performance, two benchmarks described below were used. Firstly, we investigated MaRaCluster’s ability to create clusters representative of the original set of spectra. A consensus spectrum was generated for each cluster using the merging procedure employed by MS-Cluster. 10 The consensus spectra were searched by Tide 27 , through the interface of the Crux 2.0 package, followed by post-processing with Percolator v2.08.01 28 . For the Tide search, the same search parameters and protein database as in 17 were used. The results were visualized by plots of the number of identified peptides by percolator at a q value threshold of 0.01, against the number of consensus spectra (Figure 3). The number of consensus spectra is given relative to the original 7.8 million spectra; the number of identified peptides are given relative to the 76, 490 identified peptides from analyzing these original spectra. To illustrate the benefit of using our rarity-based distance, we compared MaRaClus-


Page 14 of 25

Page 15 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


ter (Figure 3A; solid, green series) to a modified version of MaRaCluster, where we replaced the p-value distance with a cosine distance, as described in the Methods section (Figure 3A; dashed, red series). In a similar fashion, we investigated the added benefit of using a complete-linkage scheme to a single-linkage scheme, where we benchmarked a modified version of MaRaCluster with a single-linkage scheme (Figure 3A; dotted, purple series). In both cases the original version of MaRaCluster retained more peptide identifications for the same number of consensus spectra compared to its two modified versions. Replacing complete-linkage by single-linkage hierarchical clustering showed the largest loss in performance. The resulting clusters were also compared to clusterings generated by a state-of-the-art method intended for clustering large sets of spectra, MS-Cluster. 10 For MS-Cluster we varied the mixing probability parameter over a range of probabilities, leaving all other parameters to their default values. Comparison to MS-Cluster (Figure 3B), also showed favorable performance of MaRaCluster. Interpolating the MS-Cluster series at 8% of spectra remaining (0.6 million clusters) showed 65% retainment of identified peptides, compared to 91% for MaRaCluster, thus resulting in 40% more identified peptides. The leftmost data point of MS-Cluster represents the default mixing probability, retaining just over 50% of the peptide identifications. The recommended p value threshold for MaRaCluster retained 99% of the peptide identifications with 19% of the original number of spectra remaining. Secondly, we conducted an analysis of the purity of the clusters, split out by cluster size. For this purpose we assigned peptides and q values to the original spectra, again using Tide, followed by Percolator. To assure a set of very high quality identifications, only spectra with a PSM with a q value below 0.001 were considered. Spectra without a confident PSM were not included in the measurement of performance, but were kept as a part of the clustering procedure. The purity of a cluster was defined as the largest fraction of spectra sharing their matched peptide. We plotted the average purity grouped by cluster size for two clusterings, comparing



1.0

Retainment of identified peptides

1.0

Retainment of identified peptides

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

0.9

0.9

0.8

0.8

0.7

0.7

MaRaCluster CosineDistSqrt SingleLinkage

0.6 0.5 0.00

Page 16 of 25

0.05

0.10

0.15

0.20

0.25

0.30

Proportion of spectra remaining

0.35

0.6

0.40

0.5 0.00

0.05

0.10

0.15

0.20

0.25

MaRaCluster MS-Cluster 0.30


A

0.35

0.40

B

Figure 3: MaRaCluster retains more identified peptides for a given number of consensus spectra. We plotted the number of identified peptides by Tide + Percolator against the number of consensus spectra relative to the numbers before clustering. (A) Compared to modified versions of MaRaCluster, replacing the p-value distance with a cosine distance (red, dashed) or replacing complete-linkage by single-linkage clustering (purple, dotted), the original version of MaRaCluster (green, solid) performed better over the entire range of number of consensus spectra remaining. (B) When reducing the number of consensus spectra below 30% of the original number of spectra, MaRaCluster retained far more identified peptides than MS-Cluster.


Page 17 of 25

1.00 Percentage of spectra

0.90 0.85

MaRaCluster MS-Cluster

0.80 0.75 0.70 0.65

MaRaCluster

20

0.95

Average purity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


2-3

4-7

8-1

5

16

-31

32

-63

Cluster size

64

-12

7

8 12

-25

5

25

MS-Cluster

15

10

5

0

6+

1

-3

2

-7

4

8

-1

5 1

6

-3

1 3

2

-6

3 6

4

-1

2

7 1

2

8

-2

5

5 2

5

6

+

Cluster size

A

B

Figure 4: The integrity of clusters from MaRaCluster is size-independent. We further investigated the clusterings of MaRaCluster and MS-Cluster at 19% of the original number of spectra. (A) We compared the average purity as a function of cluster size for MaRaCluster’s (green, solid) and MS-Cluster (blue, dashed). The purity of a cluster was defined as the largest fraction of spectra sharing their matched peptide. MS-Cluster tends to produce several very large clusters, a handful even above 2000 spectra. However, these are less reliable as their size grows, while the purity for MaRaCluster stays high even for large clusters. (B) We plotted the distribution of spectra as a function of the size of their cluster. A sizable fraction of the spectra belongs to large clusters, hence, the increased purity of MaRaCluster is likely one of the main contributors to the overall increased clustering performance. MaRaCluster and MS-Cluster at 19% of spectra remaining (Figure 4). Whereas the average purity of MS-Cluster’s clusters decline for clusters above ≈ 20 spectra, MaRaCluster’s clusters achieve > 98% average purity independent of cluster size. Furthermore, the distribution of cluster sizes showed that MaRaCluster generated smaller clusters than MS-Cluster for a fixed number of clusters, but at the same time left fewer spectra unclustered (Figure 4B). At 19% of spectra remaining, MaRaCluster left 7.6% of the spectra unclustered compared to 13.5% for MS-Cluster. The runtime of MaRaCluster was 7.5 hours wall clock time (4 cores, 28 CPU hours), compared to 14 hours for a typical run of MS-Cluster (single core). A couple of the steps, e.g., the computation of the p value polynomial approximations for each target spectrum as well the p value calculation for each spectrum-pair can be done in parallel. Other parts of



8

MaRaCluster Without clustering

Percentage of peptides reproducible

2.0

Percentage of peptides reproducible

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

7

1.5

Page 18 of 25

6 5

1.0

4 3

0.5

2 1

0.0

0.0

0.6

0.8


0.2

0.4

0 0

1.0

1

2

3

4

5

Max number of samples where peptide is absent

A

B

Figure 5: MaRaCluster increases the reproducibility of peptide identifications, i.e. the number of samples significantly identifying a peptide. (A) We counted the number of peptides identified in all of the 51 samples by propagating PSM identifications of the consensus spectra. The percentage of the peptides identified without clustering that were identified in all mixtures increased from 1% to almost 2%. (B) We plotted the percentage of reproducible peptides as a function of the maximum number of samples missing the peptide for the clusterings at 19% of spectra remaining. This plot shows that increased reproducibility holds as well when applying less strict requirements on reproducibility, by allowing a peptide identification to be absent in a few samples. the current implementation are not parallelized. Hence, the speed up of MaRaCluster is not entirely proportional to the number of cores. In general data dependent acquisition suffers from low reproducibility, which could also be observed for this data set: only 1% of the identified peptides is present in all of the 51 samples. Clustering the spectra and propagating the PSM identifications of the consensus spectra to the original spectra increased this rate to almost 2% (Figure 5A). Allowing fewer clusters increases the probability of contaminated clusters, causing the propagation of the PSM identification to be unjustified. However, combined with the results from Figure 4 we can be confident that this effect is limited at 19% of spectra remaining, where 1.7% of the original number of identified peptides were identified in all samples. When investigating using less strict criteria, allowing peptides to be absent from some samples, the trend is still that clustering increases reproducibility (Figure 5B).


Page 19 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Discussion Here, we have demonstrated that our method MaRaCluster improves the clustering compared to some available alternatives. Particularly, its rarity-based distance measure was found superior to the cosine-distance and its complete-linkage superior to single-linkage clustering on the tested data. We also demonstrated that it outperformed a state-of-the-art method MS-Cluster. For spectral dot product scores, intensity based filtering and normalization is an important step to filter out noise and account for intensity variability in the fragment spectra of a single analyte. Here, we did not use the fragment peak intensity information apart from the selection of the N bin locations in this work. Instead, we relied on the rarity of fragment peaks to distinguish more informative fragments. Fragments that appear frequently in the collection of spectra are less uniquely identifying a match than rare fragments. This circumvents the heuristic preprocessing steps needed when using fragment intensity based metrics. The rarity-based scoring function is a useful alternative to spectral dot products. It is robust against both systematic (e.g. mass tags and instrument calibration signals) and random noise. Owing to the utilization of p values, one can select a significance level a priori, which should be set as a function of the number of spectra included in the analysis. The recommended p value threshold for clustering in MaRaCluster gave consistent results in terms of quality, retaining > 95% of the number of originally identified peptides. The user might prefer less or more strict thresholds depending on the research question and MaRaCluster can generate clusterings with different granularities from a single run. Quite frequently fragment spectra end up having contributing fragments from multiple peptide species, so-called chimeric spectra 15,16,29 . Here, we did not use e.g. single-linkage clustering, which uses the minimal pairwise distance between the members of a cluster as its representation of distance between two clusters. In such schemes a chimeric spectrum would easily bridge clusters of its constituent peptides. Instead, a complete-linkage scheme which 19 ACS Paragon Plus Environment


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

uses the maximal pairwise distance, which counteracts such chimeric linkage. As the clustering method is completely modular from the distance measure, other clustering methods can be applied. Experiments with the Markov Cluster Algorithm 30 for a range of parameters proved inferior to complete-linkage (data not shown), with one of the main issues being that the relative strength of p values was difficult to capture. Another interesting feature of our procedure is that it includes a null model, which can be used to calibrate the distance of spectra with easy to interpret p values. However, it should be noted that the null model has the na¨ıve assumption that the observation of the individual fragment peaks are independent from each other. The calibration of this p value will be the subject of future study. Particularly, it could be interesting to investigate how representative this model is for spectral library searching, for which the field today resorts to decoy spectral libraries 31 . Here, we used the MaRaCluster algorithm to cluster spectra from shotgun proteomics experiments. However, the approach is general for mass spectrometry, and could probably be applied on for instance metabolomics or other small molecule data 32 . Also, in this work we used our rarity-based scoring function for the purpose of matching different observed fragment spectra to each other. However, in principle the scoring function is generic, and could be used for other purposes as well, e.g. to match theoretical spectra to observed fragment spectra in a search engine. The MaRaCluster algorithm was designed with scalability to billions of spectra in mind. Given that the computation time for the ≈ 107 spectra is well under one day, it seems plausible that the algorithm will be able to handle data sets with ≈ 108 spectra. However, when processing larger sets than that, the number of comparisons will most likely become too large, even with the restriction of only comparing spectra with precursor mass within a predefined mass tolerance. Under such conditions the number of comparisons can be reduced further by using more advanced partitioning functions, e.g. taking retention time or the most abundant fragment peaks into account. However, as these features are more variable than


Page 20 of 25

Page 21 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


the precursor mass, one has to establish a careful balance between precision and sensitivity for such functions.

References (1) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (2) Cottrell, J. S.; London, U. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (3) Beer, I.; Barnea, E.; Ziv, T.; Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 2004, 4, 950–960. (4) Serang, O.; Kall, L. Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. J. Proteome Res. 2015, (5) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nat. Methods 2008, 5, 873–875. (6) Kopka, J.; Schauer, N.; Krueger, S.; Birkemeyer, C.; Usadel, B.; Bergm¨ uller, E.; Dörmann, P.; Weckwerth, W.; Gibon, Y.; Stitt, M.; Willmitzer, L.; Fernie, A. R.; Steinhauser, D. [email protected]: the Golm metabolome database. Bioinformatics 2005, 21, 1635–1638. (7) Gillet, L. C.; Navarro, P.; Tate, S.; Röst, H.; Selevsek, N.; Reiter, L.; Bonner, R.; Aebersold, R. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 2012, 11, O111–016717.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(8) Tabb, D. L.; Thompson, M. R.; Khalsa-Moyers, G.; VerBerkmoes, N. C.; McDonald, W. H. MS2Grouper: group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J. Am. Soc. Mass Spectrom. 2005, 16, 1250–1261. (9) Flikka, K.; Meukens, J.; Helsens, K.; Vandekerckhove, J.; Eidhammer, I.; Gevaert, K.; Martens, L. Implementation and application of a versatile clustering tool for tandem mass spectrometry data. Proteomics 2007, 7, 3245–3258. (10) Frank, A. M.; Monroe, M. E.; Shah, A. R.; Carver, J. J.; Bandeira, N.; Moore, R. J.; Anderson, G. A.; Smith, R. D.; Pevzner, P. A. Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat. Methods 2011, 8, 587–591. (11) Griss, J.; Foster, J. M.; Hermjakob, H.; Vizca´ıno, J. A. PRIDE Cluster: building a consensus of proteomics data. Nat. Methods 2013, 10, 95–96. (12) Saeed, F.; Hoffert, J. D.; Knepper, M. A. CAMS-RS: clustering algorithm for largescale mass spectrometry data using restricted search space and intelligent random sampling. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2014, 11, 128–141. (13) Powell, D. W.; Weaver, C. M.; Jennings, J. L.; McAfee, K. J.; He, Y.; Weil, P. A.; Link, A. J. Cluster analysis of mass spectrometry data reveals a novel component of SAGA. Mol. Cell. Biol. 2004, 24, 7249–7259. (14) Xiao, C.-L.; Chen, X.-Z.; Du, Y.-L.; Li, Z.-F.; Wei, L.; Zhang, G.; He, Q.-Y. Dispec: A Novel Peptide Scoring Algorithm Based on Peptide Matching Discriminability. 2013, (15) Luethy, R.; Kessner, D. E.; Katz, J. E.; MacLean, B.; Grothe, R.; Kani, K.; Faca, V.; Pitteri, S.; Hanash, S.; Agus, D. B.; Mallick, P. Precursor-ion mass re-estimation improves peptide identification on hybrid instruments. J. Proteome Res. 2008, 7, 4031– 4039. 22 ACS Paragon Plus Environment

Page 22 of 25

Page 23 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


(16) Michalski, A.; Cox, J.; Mann, M. More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC- MS/MS. J. Proteome Res. 2011, 10, 1785–1793. (17) Wu, L.; Candille, S. I.; Choi, Y.; Xie, D.; Jiang, L.; Li-Pook-Than, J.; Tang, H.; Snyder, M. Variation and genetic control of protein abundance in humans. Nature 2013, 499, 79–82. (18) Moruz, L.; Hoopmann, M. R.; Rosenlund, M.; Granholm, V.; Moritz, R. L.; Käll, L. Mass fingerprinting of complex mixtures: protein inference from high-resolution peptide masses and predicted retention times. J. Proteome Res. 2013, 12, 5730–5741. (19) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534–2536. (20) Hoopmann, M. R.; Finney, G. L.; MacCoss, M. J. High-speed data reduction, feature detection, and MS/MS spectrum quality assessment of shotgun proteomics data sets using high-resolution mass spectrometry. Anal. Chem. 2007, 79, 5620–5632. (21) Hsieh, E. J.; Hoopmann, M. R.; MacLean, B.; MacCoss, M. J. Comparison of database search strategies for high precursor mass accuracy MS/MS data. J. Proteome Res. 2009, 9, 1138–1143. (22) McIlwain, S.; Tamura, K.; Kertesz-Farkas, A.; Grant, C. E.; Diament, B.; Frewen, B.; Howbert, J. J.; Hoopmann, M. R.; Käll, L.; Eng, J. K.; MacCoss, M. J.; Noble, W. S. Crux: rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 2014, 13, 4488–4491. (23) Eng, J. K.; Fischer, B.; Grossmann, J.; MacCoss, M. J. A fast SEQUEST cross correlation algorithm. J. Proteome Res. 2008, 7, 4598–4602.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(24) Fenyö, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75, 768–774. (25) Kim, S.; Gupta, N.; Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 2008, 7, 3354–3363. (26) Loewenstein, Y.; Portugaly, E.; Fromer, M.; Linial, M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 2008, 24, i41–i49. (27) Diament, B. J.; Noble, W. S. Faster SEQUEST searching for peptide identification from tandem mass spectra. J. Proteome Res. 2011, 10, 3871–3879. (28) Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923–925. (29) Houel, S.; Abernathy, R.; Renganathan, K.; Meyer-Arendt, K.; Ahn, N. G.; Old, W. M. Quantifying the impact of chimera MS/MS spectra on peptide identification in largescale proteomics studies. J. Proteome Res. 2010, 9, 4152–4160. (30) Enright, A. J.; Van Dongen, S.; Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002, 30, 1575–1584. (31) Lam, H.; Deutsch, E. W.; Aebersold, R. Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. J. Proteome Res. 2009, 9, 605–610. (32) Thiel, P.; Sach-Peltason, L.; Ottmann, C.; Kohlbacher, O. Blocked Inverted Indices for Exact Clustering of Large Chemical Spaces. J. Chem. Inf. Model. 2014, 54, 2395–2401. 24 ACS Paragon Plus Environment

Page 24 of 25

Page 25 of 25

Rarity-based distance metric

Counteracts chimeric linkage

frequency

intensity / frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


m/z

Figure 6: For Table of Contents only


MaRaCluster: A Fragment Rarity Metric for Clustering Fragment

Recommend Documents