Exhaustively Identifying Cross-Linked Peptides ... - ACS Publications

Aug 21, 2017 - Roughly speaking, there are two types of score functions in linear/cross-linked ..... know the underlying truth about cross-linked PSMs...
0 downloads 0 Views 1MB Size
Subscriber access provided by Eastern Michigan University | Bruce T. Halle Library

Technical Note

Exhaustively Identifying Cross-Linked Peptides with a Linear Computational Complexity Fengchao Yu, Ning Li, and Weichuan Yu J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00338 • Publication Date (Web): 21 Aug 2017 Downloaded from http://pubs.acs.org on August 23, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Exhaustively Identifying Cross-Linked Peptides with a Linear Computational Complexity Fengchao Yu,† Ning Li,∗,‡,¶ and Weichuan Yu∗,†,¶ †Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China ‡Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong, China ¶Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Hong Kong, China E-mail: [email protected]; [email protected] Phone: +852 2358 7335; +852 2358 7054

Abstract Chemical cross-linking coupled with mass spectrometry is a powerful tool to study protein-protein interactions and protein conformations. Two linked peptides are ionized and fragmented to produce a tandem mass spectrum. In such an experiment, a tandem mass spectrum contains ions from two peptides. The peptide identification problem becomes a peptide-peptide pair identification problem. Currently, most tools don’t search all possible pairs due to the quadratic time complexity. Consequently, missed findings are unavoidable. In our earlier work, we developed a tool named ECL to search all pairs of peptides exhaustively. Unfortunately, it is very slow due to the quadratic computational complexity, especially when the database is large. Furthermore, ECL uses a score function without statistical calibration, while researchers 1–3

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 38

have proposed that it is inappropriate to directly compare uncalibrated scores because different spectra have different random score distributions. Here, we propose an advanced version of ECL, named ECL2. It achieves a linear time and space complexity by taking advantage of the additive property of a score function. It can search a data set containing tens of thousands of spectra against a database containing thousands of proteins in a few hours. Comparison with other five state-of-the-art tools shows that ECL2 is much faster than pLink, StavroX, ProteinProspector, and ECL. Kojak is the only one that is faster than ECL2. But Kojak does not exhaustively search all possible peptide pairs. The comparison shows that ECL2 has the highest sensitivity among the state-of-the-art tools. The experiment using a large-scale in vivo cross-linking data set demonstrates that ECL2 is the only tool that can find the peptide-spectrum matches (PSMs) passing the false discovery rate/q-value threshold. The result illustrates that the exhaustive search and a well-calibrated score function are useful to find PSMs from a huge search space.

Keywords Cross-Linked Peptides Identification, Linear Computational Complexity, Exhaustive Database Search

1

Introduction

The power of chemical cross-linking coupled with mass spectrometry (XL-MS) has been well demonstrated in understanding protein structures and protein-protein interactions 4–7 . In XL-MS, we first link proteins with cross-linkers. Then, we quench the reaction and digest the proteins. Finally, we obtain pairs of linked peptides. However, identifying crosslinked peptides from XL-MS data is computationally challenging. The time complexity is quadratic with respect to the number of peptides in the database. Consequently, exhaustively

2 ACS Paragon Plus Environment

Page 3 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

searching all peptide-peptide pairs is time consuming and resource demanding. For example, there are around 3 × 106 peptides in the Homo sapiens (human) database (UniProtKB / Swiss-Prot, 2015-11 release, 20,205 proteins). Supposing the precursor mass tolerance is 10 ppm (parts per million), there will be around 107 peptide-peptide candidate pairs for each experimental spectrum on average. Many methods 8–28 have been developed to identify cross-linked peptides. These methods can be classified into two groups. The first group converts searching peptide-peptide pairs into searching two peptides sequentially with the help of specific cross-linkers. The second group limits the number of peptide-peptide pairs with heuristic pre-filtering procedures. Methods in the first group convert the quadratic time complexity into a linear time complexity by using cross-linkers 29–32 that can be broken during dissociation (e.g. collisioninduced dissociation (CID)). Kaake et al. 32 and Kao et al. 30 proposed to couple such crosslinkers with three levels of mass spectrometry (i.e. MS1, MS2, and MS3). The issue is that generating three levels of mass spectra requires a longer cycle time. Liu et al. 33 and Götze et al. 9 proposed to use cross-linker-cleaved signature peaks to infer the masses of two peptides. This method avoids generating three levels of mass spectrometry. However, the signature peaks may not be observed all the time, resulting in loss of useful data. Furthermore, the cleavable cross-linkers are not as widely used as mass-spectrometry-noncleavable crosslinkers (such as disuccinimidyl suberate (DSS) and bis(sulfosuccinimidyl) suberate (BS3)) in biological experiments 34,35 . Methods in the second group include xQuest/xProphet 36,37 , pLink 24 , ProteinProspector 26 , StavroX 8 , and Kojak 28 . They only use a fixed number of peptides to generate peptidepeptide pairs for each experimental spectrum. For example, xQuest/xProphet first uses the top 5000 peptides for pairing. Then, it filters all peptide-peptide pairs with a pre-score. Finally, it uses the top 50 peptide-peptide pairs for fine scoring. Similarly, pLink, ProteinProspector, and Kojak use the top 500, 1000, and 250 peptides, respectively, to generate peptide-peptide pairs. Such a strategy, however, only searches a fraction of all possible

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

peptide-peptide pairs. Let’s take the Homo sapiens (human) database as an example. There are around 107 peptide-peptide pairs for each experimental spectrum. With a rough estimation, pLink, ProteinProspector, and Kojak only search about 1.2%, 5.0%, and 0.3% of all peptide-peptide pairs, respectively. Clearly, the non-exhaustive search strategy cannot guarantee to find a spectrum’s highest scored peptide-peptide pair. The result is highly variable with respect to the database size. Our experiments have shown that there is a significant proportion of results missed with the non-exhaustive search strategy 38 . Also, the sensitivity decreases greatly as the increase of the database size. Petrotchenko and Borchers 39 proposed a fast algorithm to search peptide-peptide pairs. Its time complexity is quadratic and has already been widely used in tools including xQuest, Kojak, and ECL. In XL-MS, using a noncleavable amine-reactive cross-linker (such as DSS and BS3) to link two proteins is a common protocol 24,36,37,40,41 . In order to analyze XL-MS data using a noncleavable cross-linker, we developed a tool named ECL 38 that can exhaustively search a database in dozens of hours. However, its running time increases quadratically with respect to the database size. Comparison with methods in the second group shows that ECL is faster than xQuest, pLink, and StavroX, but is slower than ProteinProspector 26 and Kojak 28 when the database is large. There is a high possibility that a sample contains a large number of proteins. Such a phenomenon is very common in in vivo experiments. Take the data set published by Zhu et al. 7 for example, there are in total 15976 proteins that possibly exist in the sample. This means that the database used for cross-linked peptides identification must be large. Unfortunately, those tools that can handle a large database rely on a non-exhaustive search strategy, which results in a significant proportion of missed findings 38 . In this paper, we propose a tool named ECL2 that can search all peptide-peptide pairs with a linear time and space complexity. ECL2 has several advantages over ECL: Achieving a linear time complexity; using XCorr 42,43 as the score function, which is more robust to noise peaks than the correlation coefficient used in ECL; and using a linear-tail-fit method 24,44,45 to estimate the e-value for each peptide-spectrum match (PSM), which statistically calibrates

4 ACS Paragon Plus Environment

Page 4 of 38

Page 5 of 38

the original score function. ECL2 can analyze a typical liquid chromatography-tandem mass spectrometry (LC-MS/MS) data set using a large-scale database containing thousands of proteins in a few hours. To our knowledge, no exhaustive search tool can achieve such a speed. ECL2 also identifies the largest number of PSMs compared to pLink, StavroX, ProteinProspector, Kojak, and ECL with the same false discovery rate (FDR)/q-value threshold. The rest of the paper is organized as follows: Section 2 describes the algorithm of ECL2. Section 3 demonstrates the performance of ECL2 with real data sets. Section 4 concludes the paper with some discussions.

2

Method

By convention 11,24 , two linked peptides are called α chain and β chain, respectively. After dissociation (e.g. CID), there are fragmented ions from two peptide chains. Figure 1 illustrates the ions with marks: the green marks indicate linear ions that only contain one chain’s ions; and the red marks indicate cross-linking ions that contain one chain’s ions plus a modification containing the cross-linker and the other whole chain. b1

b2 b3

b4

b5

b6

b9

b8

b7

L L L L L L LL L

LDRKEIPLAK LLLLLLL L y9

y8

y7

b1

b2

y6

b3

y5

y4

b5

b4

y3

b6

y2

b7

y1

b8

b9

LLLLLLLLL

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

L LL LL LL L LFLLDTNKSR y9

y8

y7

y6

y5

y4

y3

y2

y1

Figure 1: An illustration of cross-linked peptides. The green marks indicate linear ions and the red marks indicate cross-linking ions. We also label the ion types and indexes in both chains. Given an experimental spectrum, the objective of cross-linked peptides identification can 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 38

be expressed as

max t∈S

s.t.

s(e, t), |m(e) − m(t)| ≤ τ1 ,

(1)

where S is a set containing all theoretical spectra of peptide-peptide pairs, s(e, t) is a score function, e is an experimental spectrum, t is a theoretical spectrum, m(•) is the precursor mass of a spectrum, and τ1 is the precursor mass tolerance. We need to pair two peptide chains to generate the corresponding theoretical spectra, which results in a quadratic time complexity. With an additive score function 2,46 , the score corresponding to a peptide-peptide pair equals the sum of two scores corresponding to two peptide chains. Other researchers have also made such an observation 18,22,28 , but they did not use it to reduce the time complexity. Here, we propose a new algorithm that achieves a linear time complexity using any score function with an additive property.

2.1

Additive Score Function

Given a spectrum, we can digitize the whole m/z range into bins based on the MS2 m/z tolerance:



⌋ m i= −o , τ2

(2)

where i is the index of the digitized bin, m is an m/z value, τ2 is the MS2 m/z tolerance, o is an offset, and ⌊x⌋ is the largest integer value smaller than or equal to x. For the i-th bin, the corresponding intensity can be obtained as ∑

(i+1+o)×τ2

vi =

pm ,

m=(i+o)×τ2

6 ACS Paragon Plus Environment

(3)

Page 7 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

where vi is the i-th value in the digitized vector and pm is the peak intensity whose m/z value is m. If there is no peak at the location m, pm = 0. Then, we have the following definition: Definition 1 Given a digitized experimental spectrum e and a digitized theoretical spectrum t, an additive score function reads

s(e, t) =

∑∑ j

f (gi (e), tj ),

(4)

i

where gi (e) is a measure of the i-th bin in the experimental spectrum, tj is the j-th value in t, and f (gi (e), tj ) is a scoring term. Roughly speaking, there are two types of score functions in linear/cross-linked peptides identification tools 22,24,26,28,36,37,42,45,47–50 : 1. Dot-product-based score functions, such as dot product 26,48,49 , intensity summation 36,37 , XCorr 22,28,36,37,42,45 , and kernel spectral dot product (KSDP) 24,47 . 2. Probability based score functions, such as match-odds 36,37 and log-odds function 50 . Dot product, intensity summation, and XCorr are additive score functions. For example, XCorr can be expressed as

XCorr(e, t) =

∑ i

75 1 ∑ ∑ ei × ti − ei+δ × ti 150 i δ=−75,δ̸=0

75 ∑ ∑ 1 = (ei − ei+δ ) × ti 150 i δ=−75,δ̸=0 ∑ = gi (e) × ti ,

(5)

i

where δ is an m/z offset and gi (e) = ei −

1 150

∑75

δ=−75,δ̸=0 ei+δ .

Here, we assume that there

are no overlapping peaks in the theoretical spectrum. This assumption may not be true in

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 38

some cases. However, such cases are quite rare due to the high-resolution property of mass spectrometers (e.g. Thermo Scientific Q-Exactive and LTQ Orbitrap Elite). According to Figure 1, theoretical peaks are from four sources: linear ions and crosslinking ions from the α chain; linear ions and cross-linking ions from the β chain. Correspondingly, a score of two linked peptide chains can be expressed as

s(e, t) =

∑∑ j

=

i

∑∑ jα

f (gi (e), tj ) f (gi (e), tjα ) +

i

∑∑ jβ

f (gi (e), tjβ )

i

= s(e, tα ) + s(e, tβ ),

(6)

where jα is a bin index corresponding to the peak from the α chain, jβ is a bin index corresponding to the peak from the β chain, tα is a digitized theoretical spectrum containing peaks from the α chain only, and tβ is a digitized theoretical spectrum containing peaks from the β chain only. Let’s call s(e, tα ) and s(e, tβ ) chain scores for convenience. With Equation (6), Equation (1) can be expressed as

max t∈S

s.t.

s(e, tα ) + s(e, tβ ), |m(e) − m(tα ) − m(tβ ) − mx | ≤ τ1 ,

(7)

where mx is the mass of the cross-linker.

2.2

Searching Cross-Linked Peptides

Equation (7) implies that, given an experimental spectrum, we can first calculate all chain scores separately. Let’s take Figure 1 as an example to demonstrate the calculation of the chain score for the toy example sequence “DRKEIPLAK”. In this example, we only consider single charged b/y-ions without post-translational modification for simplicity. We first calculate b-ion masses and y-ion masses without considering the cross-linker and the 8 ACS Paragon Plus Environment

Page 9 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

other chain. The indexes of the b/y-ions can be found in Figure 1. The first b-ion’s mass equals a proton’s mass plus the residue mass of “D”, the second b-ion’s mass equals the first b-ion’s mass plus the residue mass of “R”, the third b-ion’s mass equals the second b-ion’s mass plus the residue mass of “K”, and so on. The first y-ion’s mass equals the residue mass of “K” plus the mass of “H2 O”, the second y-ion’s mass equals the first y-ion’s mass plus the residue mass of “A”, the third y-ion’s mass equals the second y-ion’s mass plus the residue mass of “L”, and so on. After getting all b/y-ions’ masses, we use them as input to calculate the masses considering the cross-linker and the other chain. Since the link-site is “K”, the first two b-ions stay unchanged. Starting from the third b-ion, the masses equal the original b-ions’ masses plus the cross-linker’s mass and the other chain’s mass. However, in this step, we do not know what the other chain is. We can estimate the total mass of the cross-linker and the other chain using me − mc , where me is the spectrum’s precursor mass and mc is current chain’s mass. Normally, this estimation is precise enough because the precursor mass precision is much higher than the tandem mass precision. The calculation of the y-ions’ masses considering the cross-linker and the other chain is similar. After getting all b/y-ions’ masses, we fill in a Boolean vector based on these masses, MS2 m/z tolerance, and digitization offset (Equation (2)). Then, we calculate the chain score using the Boolean vector and the experimental spectrum. Please refer to Algorithm 1 for details. The ion mass calculation shows that the link-site’s index determines certain ions’ masses, which further determines the chain score. For different link-sites, we need to calculate the chain scores separately and pick the one with the highest score. After getting all chain scores, we add pairs of chain scores. The time complexity of adding all possible pairs is quadratic with respect to the number of chain scores. Here, we propose a digitization-based algorithm to achieve a linear time complexity. We first describe the procedure of searching cross-linked peptides given an experimental spectrum. Then, we analyze this procedure’s time and space complexity.

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 38

Algorithm 1 Calculating chain scores. Without loss of generality, we use ions fragmented from CID and don’t consider neutral loss ions. This can be easily applied to other dissociation methods by changing a b/y-ion to a/x-ion or c/z-ion. When we consider neutral loss ions, the computational complexity does not change. b is a vector of b-ion masses from the peptide chain, and y is a vector of y-ion masses from the peptide chain. We assume that the mass difference of the ions is larger than the MS2 m/z tolerance. xb is the link-site index corresponding to the b-ions, xy is the link-site index corresponding to the y-ions, mc is the mass of the peptide chain, e is the digitized experimental spectrum, me is the mass of the experimental spectrum, τ2 is the MS2 m/z tolerance, and o is the offset in digitization. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28:

procedure ChainScore(b, y, xb , xy , mc , e, me , τ2 , o) c ← vector[len(e)] s←0 for i ← 1, len(b) do ◃ fill bins corresponding to b-ions if i < xb then c[⌊bi /τ2 + o⌋] ← 1 ◃ ⌊bi /τ2 + o⌋ is a function taking the largest integer smaller than or equal to bi /τ2 + o else c[⌊(bi + me − mc )/τ2 + o⌋] ← 1 end if end for for i ← 1, len(y) do if i < xy then c[⌊yi /τ2 + o⌋] ← 1 else c[⌊(yi + me − mc )/τ2 + o⌋] ← 1 end if end for for j ← 1, len(c) do for i ← 1, len(e) do s ← s + f (gi (e), c[j]) end for end for

◃ fill bins corresponding to y-ions

◃ calculate the chain score ◃ f (gi (e), c[j]) is the score function

return s end procedure

Given a database, we first in silico digest all proteins into n peptide chains. All the peptide chains are sorted based on their masses. Then, we split the whole mass range into

10 ACS Paragon Plus Environment

Page 11 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

multiple intervals. The width of the intervals w is much smaller than the precursor mass tolerance τ1 . Here, we set the width w to 0.001 Da. The number of intervals is equal to

b=

mmax − mmin + 1, w

(8)

where b is the number of intervals, mmax is the maximal mass of the peptide chains, and mmin is the minimal mass of the peptide chains. All these values are pre-fixed before the database search. Finally, we assign the peptide chains into different intervals based on their masses. Given an experimental spectrum, we calculate the chain scores with respect to all possible peptide chains using Algorithm 1 and assign them to the corresponding intervals. According to Section 2.1, the highest score must come from one of the following situations: • Two peptide chains are from different intervals: The highest score is equal to the sum of the two top chain scores in two different intervals. • Two peptide chains are from the same interval: The highest score is equal to two times the top chain score in the interval. Thus, we only need to keep the top-scored peptide chain and the chain score in each interval. Given a peptide chain in the i-th interval, another peptide chain must be in the interval j satisfying |(i + j) × w − m(e) + mx | ∈ [−τ1 , τ1 ].

(9)

Here, we ignore the rounding error because w (i.e. 0.001 Da) is much smaller than τ1 (e.g. 0.01 Da) in practice. Algorithm 2 shows the pseudo code for identifying cross-linked peptides given a set of peptide chains and an experimental spectrum.

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 38

Algorithm 2 Identifying cross-linked peptides with a linear time and space complexity. Without loss of generality, we use ions fragmented from CID and don’t consider neutral loss ions. {bi } is a set of b-ion mass vectors from all peptide chains, {yi } is a set of y-ion mass vectors from all peptide chains, {xbi } is a set of link-site indexes corresponding to the b-ions, {xyi } is a set of link-site indexes corresponding to the y-ions, {mci } is a set of peptide chain masses, e is a digitized experimental spectrum, me is the mass of the experimental spectrum, mx is the mass of the cross-linker, τ1 is the precursor mass tolerance, τ2 is the MS2 m/z tolerance, o is the offset in digitization, and w is the mass interval in splitting the whole mass range. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

procedure Search({bi }, {yi }, {xbi }, {xyi }, {mci }, e, me , mx , τ1 , τ2 , o, w) ◃ i ∈ [1, n] sc ← vector[⌈max({mCi })/τ1 ⌉] s←0 ◃ s is the final score c1 ← −1 ◃ c1 is the index of the first peptide chain in the final result c2 ← −1 ◃ c2 is the index of the second peptide chain in the final result for i ← 1, |{mci }| do ◃ calculate chain scores and assign them to ranges s ← ChainScore(bi , yi , xbi , xyi , mci , e, me , τ2 , o) if sc [⌊mci /τ1 ⌋] < s then sc [⌊mci /τ1 ⌋] ← s end if end for for i ← 1, ⌈max({mci })/w⌉/2 do ◃ pair peptide pairs for j ← (me − mx − mci − τ1 )/w − i, (me − mx − mci + τ1 )/w − i do if sc [i] + sc [j] > s then s ← sc [i] + sc [j] c1 ← i c2 ← j end if end for end for return s, c1 , c2 end procedure

In the following, we analyze the time and space complexity of Algorithm 2. The time complexity of mass range splitting and peptide chains assignment is

O(n).

(10)

Without loss of generality, we suppose that the time and space complexity of calculating a 12 ACS Paragon Plus Environment

Page 13 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

chain score are independent of the number of peptides. The time complexity of calculating all chain scores and assigning them to intervals is

O(n).

(11)

The time complexity of finding pairs of peptide chains, summing chain scores, and keeping the highest-scored pair is

( O=

mmax − mmin τ1 · w w

) .

(12)

Combining Equation (10), (11), and (12), we obtain the total time complexity: (

(mmax − mmin )τ1 O n+ w2

) = O(n).

(13)

Because mmax , mmin , τ1 , and w are pre-fixed and independent of the database size, the total time complexity is linear with respect to the number of peptide chains. It is easy to see that the space complexity is also linear: O(n + (mmax − mmin )/w) = O(n). In summary, the linear time complexity is achieved with the following factors: 1) Taking advantage of an additive score function: A final score can be split into two chain scores. 2) Digitizing the whole mass range and assigning peptide chains to digitized intervals: With such a digitization, only one chain score is kept for each interval. This reduces the time complexity greatly. 3) Achieving a constant time complexity in summing chain scores by fixing the number of digitized intervals. Thus, the total time complexity reduces to O(n).

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2.3

The Work-Flow of ECL2

The major contribution of this paper is proposing a new algorithm of linear computational complexity to exhaustively search all peptide-peptide pairs in a cross-linked peptides identification task. In Yu et al. 38 , the authors used the normalized cross correlation coefficient as the score function. The normalized cross correlation coefficient has a normalization term in the denominator. As noisy peaks in the experimental spectrum may affect the denominator a lot, this measure is not robust to noise. Furthermore, the normalized cross correlation coefficient doesn’t have the additive property. In contrast, XCorr can be expressed as a dot product of a transformed experimental spectrum and a theoretical spectrum (Equation (5)). The peaks in the transformed experimental spectrum will not contribute to the score if there is no corresponding peak in the theoretical spectrum. Furthermore, in the high resolution setting, 150 bins offset makes a small effect because peaks in the digitized vector are quite sparse. Thus, XCorr is more robust to the noise than the normalized cross correlation coefficient. With these reasons, we choose XCorr as the score function in ECL2. Figure 2 shows the work-flow of ECL2. It takes a data file and a protein database file as inputs. After digitizing the spectra, digesting the protein sequences, and in silico fragmenting the peptide sequences, it calculates chain scores using Algorithm 1. ECL2 only keeps the chains with the chain scores larger than or equal to zero, which is the threshold used by Kojak. Since XCorr may produce negative scores if there are poor similarities, using zero as the threshold means that we only consider chains contributing positives to the scores. This is also the logic used by ProteinProspector 26 . After calculating chain scores, ECL2 pairs peptide chains and finds the highest-scored pair using Algorithm 2. Keich et al. 1,51 showed that different PSMs might have different XCorr score distributions. It is inappropriate to directly compare different PSMs’ XCorr scores. Statistical correction is needed before estimating FDR. Thus, ECL2 uses the linear tail-fit method 24,44,45 to estimate an e-value for each PSM. Since ECL2 has already recorded chain scores for each experimental spectrum, ECL2 uses them to generate random XCorr scores corresponding 14 ACS Paragon Plus Environment

Page 14 of 38

Page 15 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

to random matches. ECL2 calculates random XCorr scores by adding any two chain scores whose summed mass is in the precursor tolerance. The maximum number of such random XCorr scores is 15000, which is the number used by pLink 24 . If there is not enough random XCorr scores within the precursor tolerance, ECL2 will relax the precursor tolerance by 1 Da and generate more random XCorr scores. ECL2 keeps relaxing until there are 15000 random XCorr scores or the precursor tolerance reaches 20 Da. 20 Da is the value used by Crux 22 . After generating random XCorr scores, ECL2 generates a survival curve and converts its y-axis to loge scale. ECL2 picks the tail segment using the approach in Comet 45 . Then, ECL2 fits a linear equation y = a × x + b with the picked segment, where x is the x-axis coordinate, a and b are parameters to be estimated. Finally, the e-value equals c × ea×h+b , d

(14)

where c is the number of candidates during the exhaustive search, d is the number of random XCorr scores, and h is the x-axis coordinate corresponding to matched peptide-peptide pair. After estimating e-values for all PSMs, ECL2 estimates the FDR with all PSMs’ evalues. There are three kinds of PSMs: The first contains two peptide chains from the target database, the second contains two peptide chains from the decoy database, and the third contains one peptide chain from the target database and another peptide chain from the decoy database. Thus, we can estimate the FDR using 24,37 [

] #{false positive} E[#{false positive}] f (s) − d(s) F DR(s) = E = = , #{positive} E[#{positive}] max(t(s), 1)

(15)

where s is an e-value threshold, #{false positive} is the number of false positives with evalues equal to or smaller than s, #{positive} is the number of positives with e-values equal to or smaller than s, E[•] is the expectation, t(s) is the number of the first kind of PSMs with e-values equal to or smaller than s, d(s) is the number of the second kind of PSMs with e-values equal to or smaller than s, and f (s) is the number of the third kind of PSMs with 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 38

e-values equal to or smaller than s. In order to make the sensitivity as high as possible, the FDR is converted into q-values in proteomics tools, such as Crux 22 , Comet 45 , Percolator 52 , pLink 24 , xProphet 37 , or ECL 38 . In ECL2, we also do such a conversion using 53

q(t) = min F DR(s),

(16)

s≥t

where t is a threshold. Spectra

Spectrum 1 Spectrum 1 . . .

Top-Scored Peptide Chain in Inverval1 . . . Top-Scored Peptide Chain in Interval k1

Digitization . . .

Calculate chain scores

Spectrum l

Top-Scored Peptide Chain in Interval 1 . . .

Spectrum l

Top-Scored Peptide Chain in Interval k2 Peptide 1 . . .

In silico fragmentation

For each spectrum, pair and calculate final scores

Peptide n Top-Scored Peptide-Peptide pairs

Spectrum 1 in silico digest

. . .

Protein Database

Estimate e-value and q-value

Output

Top-Scored Peptide-Peptide pairs

Spectrum l

Figure 2: The work-flow of ECL2.

We also develop a tool, named ECLViewer2, to facilitate viewing the results. Please refer to the Supplementary Document for the details of using ECL2 and ECLViewer2.

3

Experiments

We perform two sets of experiments to show the power of ECL2. The first set contains 10 data files from Makowski et al. 54 . The purpose is to demonstrate the effect of database size and

16 ACS Paragon Plus Environment

Page 17 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the performance of ECL2 under different settings. The authors published 19 confident crosslinked proteins. We use them to generate databases with six different sizes. We use pLink, StavroX, ProteinProspector, Kojak, and ECL as benchmarks. The second set contains 30 data files from Zhu et al. 7 . The data files are from in vivo cross-linked proteins of Arabidopsis thaliana. We use the whole Arabidopsis thaliana database (TAIR10) 55 , which contains more than 3.5 × 104 proteins.

3.1

Identifying Peptides from In Vivo Cross-Linked Homo Sapiens Proteins

In Makowski et al. 54 , there are 10 data files from the cross-linking of human proteins, containing about 3×105 MS2 spectra in total. Please refer to Makowski et al. 54 for the details of the sample preparation and data acquisition. The authors reported 19 cross-linked proteins with a high confidence. In real applications, the database usually contains more proteins that do not exist in the sample. In the following, we combine different numbers of additional nonexistent proteins with the 19 proteins to generate six databases. These nonexistent proteins are generated by reversing protein sequences from Arabidopsis thaliana UniProt/SwissProt database. • The first database contains the 19 proteins plus 50 nonexistent proteins. • The second database contains all proteins in the first database plus another 150 nonexistent proteins. • The third database contains all proteins in the second database plus another 800 nonexistent proteins. • The fourth database contains all proteins in the third database plus another 4000 nonexistent proteins.

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

• The fifth database contains all proteins in the fourth database plus another 5000 nonexistent proteins. • The last database contains all proteins in the fifth database plus another 5000 nonexistent proteins. Without considering decoy sequences, we have six databases whose protein numbers are 69, 219, 1019, 5019, 10019, and 15019, respectively. We use StavroX (Version 3.6.0), pLink (Version 1.23), ProteinProspector (Version 5.17.1), Kojak (Version 1.5.4), ECL (Version 1.1.1), and ECL2 (Version 2.1.4) to search these data files against the six databases, respectively. Kojak allows users to control the number of peptide chains for each spectrum by modifying the “top_count” value. We run it with the value equal to 250 (default value in Hoopmann et al. 28 ) and 999999999, respectively. The purpose of the second setting is to study the difference between exhaustive and nonexhaustive search. ECL2 outputs peptide chains’ ranks and scores in the development mode. We also use such information to study the difference between exhaustive and non-exhaustive search. All tools’ precursor mass tolerance is 10 ppm, and the MS2 m/z tolerance is 0.01 Da. The allowed maximum missed cleavage is two. The allowed precursor masses are from 1000 Da to 12000 Da, the allowed peptide chain lengths are from 5 amino acids to 50 amino acids, and the allowed precursor charges are from 3 to 7. Some tools don’t have the parameter to constrain the precursor mass range, the peptide chain length, or the precursor charge during the search. For a fair comparison, we only consider the PSMs satisfying the above conditions although there may be more PSMs in the results. We set carbamidomethylation on “C” as the fixed modification. We don’t set any variable modification. All tools use the targetdecoy strategy 24,37,56 to estimate the FDR and q-value. The decoy sequences are generated by reversing the target sequences with the C-terminal unchanged. We use q-value ≤ 0.05 as the threshold for these tools. StavroX and pLink provide q-values for their own results. We use Percolator 52 to estimate q-values for Kojak’s results, as advised by the authors 57 . Percolator is a tool to re-score PSMs and estimate their q-values. The re-scoring step uses a 18 ACS Paragon Plus Environment

Page 18 of 38

Page 19 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

machine learning technique to combine multiple sub-scores into a final score for each PSM. Then, Percolator estimates q-values based on the final scores. Although the approach is different from what the researchers used in cross-linked peptide identification, we still use Percolator for Kojak’s results since it is the authors’ recommendation. ProteinProspector doesn’t provide a q-value in the result. Thus, we estimate it using Equation (15) and Equation (16). ECL and ECL2 report q-values by themselves. With the cut-off results, the PSMs from the cross-linking of the 19 proteins are treated as true positive PSMs and the PSMs containing at least one peptide chain from the nonexistent proteins are treated as false positive PSMs. This is just an approximation since we don’t know the underlying truth about cross-linked PSMs. All the parameter files and the database files can be found in the Supplementary File S-4 and S-1, respectively.

3.1.1

Identification Results

We summarize the true positive PSMs and false positive PSMs identified by the tools. We find that StavroX and Kojak may report multiple peptide-peptide pairs for one experimental spectrum. For a fair comparison, we only count each of these spectra once. Since we run Kojak with two “top_count” values (i.e. 250 and 999999999), we refer the first one as “Kojak” and the second one as “Kojak-ext”. Figure 3 shows the bar plots of the true positive and the false positive PSMs. StavroX cannot handle the second to sixth databases, pLink cannot handle the fourth to sixth databases, ProteinProspector cannot handle the fifth and sixth databases, Kojak-ext needs a very long time to search the third to sixth databases, and ECL also needs a very long time to search the fifth and sixth databases. Thus, the corresponding bars are left blank. The blue bars denote the true positive PSMs, and the orange bars denote the false positive PSMs. The value in the middle of each blue bar is the number of corresponding true positive PSMs, and the value at the top of each orange bar is the number of corresponding false positive PSMs. All these tools’ raw results can be found in the Supplementary File S-5.

19 ACS Paragon Plus Environment

Journal of Proteome Research

Identified PSMs 3500 21

135 96

105

89

91

69 Proteins

219 Proteins

1992

121

1787

1174

889

1595

892

877

5019 Proteins

True PSMs

10019 Proteins

ECL

ECL2

Kojak-ext

PP

Kojak

pLink

StavroX

ECL

ECL

Kojak-ext

PP

pLink

StavroX

ECL

ECL2

Kojak-ext

1019 Proteins

Kojak

483

428

PP

StavroX

ECL

ECL2

pLink

697

41

1615

1555

381

315

ECL2

82

Kojak-ext

PP

pLink

StavroX

ECL

ECL2

Kojak-ext

pLink

45

601

434

0

1871

73

52 59

2420

Kojak

2072

2064

PP

500

2762

2564

2448

7

1047

Kojak

1000

2632

37 2417

2888

2824

Kojak

1500

59

92

134

ECL2

2000

94

Kojak-ext

16

PP

159

Kojak

2500

pLink

67 71 71

StavroX

3000

StavroX

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 38

15019 Proteins

False PSMs

Figure 3: Bar plots showing identified true and false positive PSMs. The six bar plots correspond to the results of searching the six databases. Without considering decoy proteins, there are 69, 219, 1019, 5019, 10019, and 15019 proteins in the six databases, respectively. The blue bars denote true positive PSMs and the orange bars denote false positive PSMs. The value in the middle of each blue bar is the number of corresponding true positive PSMs and the value at the top of each orange bar is the number of corresponding false positive PSMs. “PP” stands for ProteinProspector. StavroX cannot handle the second to sixth databases, pLink cannot handle the fourth to sixth databases, ProteinProspector cannot handle the fifth and sixth databases, Kojak-ext needs a very long time to search the third to sixth databases, and ECL also needs a very long time to search the fifth and sixth databases. Thus, the corresponding bars are left blank. Figure 3 shows that ECL2 always identifies the largest number of PSMs among all tools. Comparing ECL2 with ECL, the major reason of different PSM numbers lies in the score function. As we mentioned in Section 2.3, ECL2 uses a more robust score function. Another advantage of ECL2 over ECL is that ECL2 can search a large database (e.g. containing more than 10000 proteins) within a reasonable period of time. This benefits from the linear time complexity algorithm in ECL2. In Section 3.1.2, we will further show the advantage of linear time complexity. Figure 3 also shows that ECL2 identifies more PSMs than Kojak. There are two major differences between Kojak and ECL2. The first one lies in the search strategy: non-exhaustive search versus exhaustive search. In order to see the outcome of this difference, we need to compare Kojak-ext with Kojak. The only difference between Kojak-ext and Kojak is the “top_count” value (999999999 v.s. 250), which is actually the difference of exhaustive

20 ACS Paragon Plus Environment

Page 21 of 38

search and non-exhaustive search. The results imply that exhaustive search may identify more PSMs than non-exhaustive search, which also applies to ECL2 and Kojak. The second difference lies in the different score functions and metrics used by Kojak 28,57 and ECL2. We also analyze the results of ECL2 to see how many PSMs would be missed if we limited the number of peptide chains for each spectrum. Figure 4 shows the trend with respect to different database sizes. We set three peptide chain rank thresholds: top 250 (proposed by Kojak), top 500 (proposed by pLink), and top 1000 (proposed by ProteinProspector). All these analyses show the advantage of exhaustive search, which implies the advantage of ECL2. In Section 3.1.2, we will show that it is unrealistic to let Kojak to exhaustively search all peptide-peptide pairs when the database is large. Thus, ECL2 is the only tool that can finish such a task.

Trend of Missed Findings Top 250

Percentage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Top 500

Top 1000

50% 45% 40% 35% 30% 25% 20% 15% 10% 5% 0% 0

2000

4000

6000

8000

10000 12000 14000 16000

Protein Number in a Database Figure 4: The trend of missed findings with different peptide chain rank thresholds and database sizes. Three chain rank thresholds are used: top 250, top 500, and top 1000. Six different database sizes are used. Without considering decoy proteins, there are 69, 219, 1019, 5019, 10019, and 15019 proteins in these databases, respectively. Those PSMs who have at least one peptide chain whose rank is larger than the threshold are treated as missed findings.

It is interesting to see that as the database size increases, the number of identified PSMs decreases for all tools. One reason is that the chance of random matching increases as the 21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 38

database size increases. In order to maintain the same FDR/q-value, a higher score threshold is required. Thus, the number of PSMs passing the corresponding score threshold decreases. All tools suffer from this issue. For each intra cross-linked PSM identified by ECl2, we calculate the distance of its two link-sites based on the Protein Data Bank (PDB) 58 . 12 of those 19 proteins have the structure information in the PDB (Supplementary Table S-1). Table 1 lists the summarized result. It shows that most PSMs have a distance less than or equal to 30 Å 24,37,38 . The detailed lists can be found in the Supplementary File S-3. Table 1: A table showing the number of intra cross-linked PSMs whose distances are less than or equal to 30 Å, the total number of checked intra cross-linked PSMs, and the ratios. 69 Proteins 219 Proteins 1019 Proteins 5019 Proteins 10019 Proteins 15019 Proteins

3.1.2

≤ 30 Å 344 354 280 220 176 156

Total 425 423 328 262 212 180

Ratio 0.81 0.84 0.85 0.84 0.83 0.87

Running Time

In this section, we calculate each tool’s average running time with respect to different database sizes. StavroX, pLink, Kojak, Kojak-ext, ECL, and ECL2 are run on a PC with an Intel Core i7-2600 CPU (3.40 GHz, 8 cores) and 32 GB memory. StavroX and ECL don’t support multi-thread computing, while Kojak, Kojak-ext, and ECL2 do support multi-thread computing. Thus, Kojak, Kojak-ext, and ECL2 are run with 8 cores. Although pLink supports multi-thread computing, it always crashes due to a “bad_alloc” error. Thus, we run it with single core. With the default setting, pLink works in the exhaustive mode when the database is small and in the open mode containing pre-filtering procedures when the database is large. We run it in the open mode for all the experiments for fair comparison. ProteinProspector is run on the authors’ web server. Table 2 shows the average running 22 ACS Paragon Plus Environment

Page 23 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

time in hours. Because we cannot control the computational resources assigned to ProteinProspector jobs, the running time has a fluctuation with respect to the database size. Table 2 shows that ECL2 is much faster than StavroX, pLink, ProteinProspector, Kojak-ext, and ECL. But it is slower than Kojak. We explained earlier that Kojak reduces the computational burden with simplifications. Table 2 also shows that Kojak is very slow if it does not limit the number of peptide chains. Enlarging the “top_count” value is unrealistic for Kojak when the database is large. Thus, ECL2 is the only tool that can exhaustively identify cross-linked peptides within a reasonable period of time when the database is large. Table 2: The average running time of the tools with respect to different database sizes. The unit is hours. StavroX cannot handle the second to sixth databases, pLink cannot handle the fourth to sixth databases, ProteinProspector cannot handle the fifth and sixth databases, Kojak-ext needs a very long time to search the third to sixth databases, and ECL also needs a very long time to search the fifth and sixth databases. Thus, the corresponding cells are marked with “NA”. “PP” stands for ProteinProspector. Because ProteinProspector runs on the authors’ web server and we cannot control the computational resources assigned to our jobs, the running time has a fluctuation with respect to the database size.

StavroX pLink PP Kojak Kojak-ext ECL ECL2

69 5.11 3.85 0.74 0.05 0.51 0.10 0.04

Target Protein Number in the 219 1019 5019 NA NA NA 10.40 33.62 NA 1.52 1.32 7.77 0.06 0.12 0.44 20.01 NA NA 0.61 5.35 85.44 0.08 0.33 1.85

Database 10019 15019 NA NA NA NA NA NA 0.80 1.15 NA NA NA NA 4.46 12.48

In order to show that ECL2 does have a linear time complexity, we plot the average running time with respect to the numbers of peptide chains (including decoy sequences) in Figure 5. For comparison, we also plot the average running time of ECL (Version 1.1.1) which has a quadratic time complexity. The figure clearly shows the advantage of linear complexity.

23 ACS Paragon Plus Environment

Journal of Proteome Research

Average Running Time 90 ECL

80

Running Time (hours)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 38

y = 258.7x2 + 7.104x - 0.0305 R² = 1

70 60

ECL2

50

40 30

y = 6.5517x - 0.7588 R² = 0.9052

20 10 0 -10 0

0.5

1

1.5

2

Peptide Chain Number (millions)

Figure 5: A plot showing the average running time with respect to different numbers of peptide chains. Decoy sequences are included. The x-axis is the number of peptide chains (the unit is millions) and the y-axis is the average running time (the unit is hours). The orange crosses are the observed running time of ECL, and the orange dashed line is the fitted quadratic line. The blue dots are the observed running time of ECL2 and the blue solid line is the fitted linear line. It also shows two lines’ equations and R2 values.

3.2

Identifying Peptides from In Vivo Cross-Linked Arabidopsis Thaliana Proteins

In this section, we use a large-scale cross-linking data set to demonstrate the power and necessity of ECL2. 30 data files from in vivo cross-linked Arabidopsis thaliana proteins are collected in Zhu et al. 7 . There are around 6 × 105 MS2 spectra in total. Please refer to Zhu et al. 7 for the details of the sample preparation and data acquisition. Since the data is from in vivo cross-linking and we don’t have the ground truth, we use the whole proteome of Arabidopsis thaliana as the database (downloaded from the Arabidopsis information resource 55 ). There are more than 3.5 × 104 proteins. Considering oxidation as the variable modification, there are more than 2 × 106 peptides, which results in 2 × 1012 peptide-peptide pairs. We try StavroX (Version 3.6.0), pLink (Version 1.23), ProteinProspector (Version 5.17.1), Kojak (Version 1.5.4), ECL (Version 1.1.1), and ECL2 (Version 2.1.4) to identify this data 24 ACS Paragon Plus Environment

Page 25 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

set. Because the data is generated by a Thermo LTQ Orbitrap XL mass spectrometer, the precursor mass tolerance is 10 ppm and the MS2 m/z tolerance is 0.5 Da. The allowed precursor charges are from 2 to 7. We set the allowed maximum missed cleavage to one, carbamidomethylation on “C” as the fixed modification, and oxidation on “M” as the variable modification. The rest of the parameters are the same as those in the earlier experiment. The detailed parameter files can be found in the Supplementary File S-4. Because the database is too large for a PC, we run the tasks on a server with two Intel Xeon E5-2670 v3 CPUs (24 cores in total) and 128 GB memory. Unfortunately, only Kojak and ECL2 can finish the task within a reasonable period of time. For each data file, Kojak needs about one hour and ECL2 needs about two hours. We observe that with q-value ≤ 0.05 as the threshold, all Kojak’s PSMs contain peptides from target and decoy databases simultaneously. This is because both databases have the same peptide sequence. Since we generate the decoy database by reversing the sequence, such a phenomenon is difficult to avoid. In order to avoid such an ambiguity, those peptides should not be considered during the search. The only related parameter is “min_peptide_mass”. We set it to 445 to include all peptides with length 5 (the sequence with the smallest mass is “GKGGK” if one link-site “K” is needed). Unfortunately, this parameter is not enough. Without those PSMs, Kojak identifies nothing. ECL2 identifies 137 PSMs, including the one identified by ECL in Zhu et al. 7 . To take a closer look, we summarize the ranks of the peptide chains corresponding to each spectrum. It turns out that 95 out of 137 PSMs have at least one peptide chain whose rank is larger than 250. Since Kojak uses at most 250 peptide chains to generate peptide-peptide pairs for each spectrum, those PSMs will be missed. We also note that 60 out of 137 PSMs have at least one peptide chain whose rank is larger than 1000. This means that even if we set the peptide chain number to that of ProteinProspector, there will still be significant missed findings. We don’t check the link-site distances from the intra cross-linked PSMs because there is only one protein whose accession number is AT1G08060.1 having the structure information in the PDB. The detailed results can be found in the Supplementary

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table S-2 and the details related to the e-value estimation can be found in the Supplementary File S-2.

4

Discussions

In this paper, we demonstrate that it is feasible to exhaustively search all peptide-peptide pairs with a linear time and space complexity. Given a data file with tens of thousands of MS2 spectra, ECL2 can finish the analysis using a big database in a few hours. Experiments show that ECL2 identifies the largest number of PSMs among all six tools. Experiments also show that ECL2 is much faster than StavroX, pLink, ProteinProspector, and ECL. In most PSMs, one peptide chain has a small rank and the other has a relatively large rank. We generate a scatter plot (Figure 6) using ECL2’s results from the Homo sapiens sample under the “15019 proteins” setting (Section 3.1). The database contains 15019 target proteins and 15019 decoy proteins. In the figure, each point corresponds to one identified PSM, the vertical axis corresponds to the log-transformed rank of the chain having a higher score, and the horizontal axis corresponds to the log-transformed rank of the chain having a lower score. We also plot three squares indicating the areas searched by Kojak, pLink, and ProteinProspector, respectively. Figure 6 shows that there are a considerable number of PSMs having one chain ranked smaller than 250 and the other ranked larger than 1000. For these PSMs, the larger ranked chains contribute little to the similarity score. Trnka et al. 26 proposed that the score from the larger ranked chain can provide an informative measurement of the matching quality. The authors combined such a score along with other sub-scores using a machine learning technique. Thanks to the power of the machine learning technique and the additional information from multiple sub-scores, such a strategy often performs better than those using a simple score function. However, such a performance improvement highly depends on the training data. How to incorporate multiple sub-scores in a robust and universal way is still an open question. Since such a study is beyond the

26 ACS Paragon Plus Environment

Page 26 of 38

Page 27 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

scope of this paper, we prefer studying it in the future.

Kojak pLink ProteinProspector

Figure 6: A scatter plot showing each PSM’s two chain ranks. Each point corresponds to one identified PSM, the vertical axis corresponds to the log-transformed rank of the chain having a higher score, and the horizontal axis corresponds to the log-transformed rank of the chain having a lower score. Three squares indicate the areas searched by Kojak, pLink, and ProteinProspector, respectively. The results in Section 3.1 and 3.2 show that there are PSMs with e-values smaller than 10−6 but q-values larger than 0.05. This is because there are decoy PSMs also having evalues small than 10−6 . In calculating the FDR using the target-decoy strategy, the value is large, which results in a large q-value. The underlying reason lies in the large search space. The chance of random match increases with respect to the size of the search space. Thus, we need a small e-value (or a higher score) to obtain a reasonable q-value (e.g. 0.05). It is also easy to figure out that such a phenomenon happens in inter-protein cross-linked PSMs, while intra-protein cross-linked PSMs rarely have such a phenomenon because their search spaces are relatively smaller. There are also PSMs with e-values larger than 1 but q-values smaller than 0.05. This is also due to the target-decoy strategy. If there are few decoy PSMs, the q-value will be small no mater how large the e-value is. Unfortunately, all cross-linked peptides identification tools have these issues. For those tools using scores instead of e-values 27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to estimate q-values, such issues are not obvious. These two issues are open questions in the community and beyond the scope of this paper. We would like to study them in the future.

Abbreviations 1. ECL: exhaustive cross-linked peptides identification. 2. XL-MS: chemical cross-linking coupled with mass spectrometry. 3. ppm: parts per million. 4. CID: collision-induced dissociation. 5. MS1: first level mass spectrometry. 6. MS2: tandem mass spectrometry. 7. MS3: third level mass spectrometry following MS2. 8. LC-MS/MS: liquid chromatography-tandem mass spectrometry. 9. DSS: disuccinimidyl suberate. 10. BS3: bis(sulfosuccinimidyl) suberate. 11. KSDP: kernel spectral dot product. 12. FDR: false discovery rate. 13. PSM: peptide-spectrum match. 14. PDB: protein data bank.

28 ACS Paragon Plus Environment

Page 28 of 38

Page 29 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Acknowledgement This work is partially supported by a theme-based project T12-402/13N from the Research Grants Council (RGC) of the Hong Kong S.A.R. government; an internal grant VPRGO15EG01 from HKUST; grants 31370315 and 31570187 from the National Natural Science Foundation of China (NSFC); grants 661613, 16101114, 16103615, and 16103817 from the General Research Fund (GRF) of the Hong Kong S.A.R. government; grants SRF11EG17PG-A and SRFI11EG17-A from the Energy Institute of HKUST; and a grant SBI09/10.EG01-A from the Croucher Foundation of the CAS-HKUST Joint Laboratory. We thank Meng Wang and Jiaan Dai for valuable discussions.

Supporting Information This information is available free of charge at http://pubs.acs.org/ • Supplementary Document. (PDF) • Supplementary Table S-1: A table of UniProt accession numbers and their corresponding PDB IDs. (XLSX) • Supplementary Table S-2: Detailed ECL2’s results from Section 3.2. (XLSX) • Supplementary File S-1: Databases used in Section 3.1 and 3.2. (ZIP) • Supplementary File S-2: Detailed e-value related results from Section 3.2. (ZIP) • Supplementary File S-3: Link-site distances of the intra cross-linked PSMs from Section 3.1. (ZIP) • Supplementary File S-4: Parameter files used in for all tools. (ZIP) • Supplementary File S-5: Identification tools’ raw results and logs from Section 3.1. (ZIP) 29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 38

The source code and executable file can be found at http://bioinformatics.ust.hk/ ecl2.html.

References (1) Keich, U.; Kertesz-Farkas, A.; Noble, W. S. Improved false discovery rate estimation procedure for shotgun proteomics. Journal of Proteome Research 2015, 14, 3148–3161. (2) Howbert, J. J.; Noble, W. S. Computing exact p-values for a cross-correlation shotgun proteomics score function. Molecular & Cellular Proteomics 2014, 13, 2467–2479. (3) Keich, U.; Noble, W. S. On the importance of well-calibrated scores for identifying shotgun proteomics spectra. Journal of Proteome Research 2015, 14, 1147–1160. (4) Nguyen, V. Q.; Ranjan, A.; Stengel, F.; Wei, D.; Aebersold, R.; Wu, C.; Leschziner, A. E. Molecular architecture of the ATP-dependent chromatin-remodeling complex SWR1. Cell 2013, 154, 1220–1231. (5) Greber, B. J.;

Boehringer, D.;

Leitner, A.;

Bieri, P.;

Voigts-Hoffmann, F.;

Erzberger, J. P.; Leibundgut, M.; Aebersold, R.; Ban, N. Architecture of the large subunit of the mammalian mitochondrial ribosome. Nature 2014, 505, 515–519. (6) Politis, A.; Stengel, F.; Hall, Z.; Hernández, H.; Leitner, A.; Walzthoeni, T.; Robinson, C. V.; Aebersold, R. A mass spectrometry-based hybrid method for structural modeling of protein complexes. Nature Methods 2014, 11, 403–406. (7) Zhu, X.; Yu, F.; Yang, Z.; Liu, S.; Dai, C.; Lu, X.; Liu, C.; Yu, W.; Li, N. In planta chemical cross-linking and mass spectrometry analysis of protein structure and interaction in Arabidopsis. Proteomics 2016, 16, 1915–1927. (8) Götze, M.; Pettelkau, J.; Schaks, S.; Bosse, K.; Ihling, C. H.; Krauth, F.; Fritzsche, R.; Kühn, U.; Sinz, A. StavroX-a software for analyzing crosslinked products in protein 30 ACS Paragon Plus Environment

Page 31 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

interaction studies. Journal of the American Society for Mass Spectrometry 2012, 23, 76–87. (9) Götze, M.; Pettelkau, J.; Fritzsche, R.; Ihling, C. H.; Schäfer, M.; Sinz, A. Automated assignment of MS/MS cleavable cross-links in protein 3D-structure analysis. Journal of The American Society for Mass Spectrometry 2015, 26, 83–97. (10) Young, M.; Tang, N.; Hempel, J.; Oshiro, C.; Taylor, E.; Kuntz, I.; Gibson, B.; Dollinger, G. High throughput protein fold identification by using experimental constraints derived from intramolecular cross-links and mass spectrometry. Proceedings of the National Academy of Sciences of the United States of America 2000, 97, 5802–5806. (11) Schilling, B.; Row, R.; Gibsonb, B.; Guo, X.; Young, M. MS2Assign: Automated assignment and nomenclature of tandem mass spectra of chemically crosslinked peptides. Journal of The American Society for Mass Spectrometry 2003, 14, 834–850. (12) Chu, F.; Shan, S.-O.; Moustakas, D. T.; Alber, F.; Egea, P. F.; Stroud, R. M.; Walter, P.; Burlingame, A. L. Unraveling the interface of signal recognition particle and its receptor by using chemical cross-linking and tandem mass spectrometry. Proceedings of the National Academy of Sciences of the United States of America 2004, 101, 16454–16459. (13) Tang, Y.; Chen, Y.; Lichti, C.; Hall, R.; Raney, K.; Jennings, S. CLPM: A cross-linked peptide mapping algorithm for mass spectrometric analysis. BMC Bioinformatics 2005, 6, S9. (14) Ihling, C.; Schmidt, A.; Kalkhof, S.; Schulz, D. M.; Stingl, C.; Mechtler, K.; Haack, M.; Beck-Sickinger, A. G.; Cooper, D. M.; Sinz, A. Isotope-labeled cross-linkers and Fourier transform ion cyclotron resonance mass spectrometry for structural analysis of a protein/peptide complex. Journal of the American Society for Mass Spectrometry 2006, 17, 1100–1113. 31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15) Koning, L. J.; Kasper, P. T.; Back, J. W.; Nessen, M. A.; Vanrobaeys, F.; Beeumen, J.; Gherardi, E.; Koster, C. G.; Jong, L. Computer-assisted mass spectrometric analysis of naturally occurring and artificially introduced cross-links in proteins and protein complexes. FEBS Journal 2006, 273, 281–291. (16) Maiolica, A.; Cittaro, D.; Borsotti, D.; Sennels, L.; Ciferri, C.; Tarricone, C.; Musacchio, A.; Rappsilber, J. Structural analysis of multi-protein complexes by cross-linking, mass spectrometry and database searching. Molecular & Cellular Proteomics 2007, 6, 2200–2211. (17) Lee, Y.; Lackner, L.; Nunnari, J.; Phinney, B. Shotgun cross-linking analysis for studying quaternary and tertiary protein structures. Journal of Proteome Research 2007, 6, 3908–3917. (18) Singh, P.; Shaffer, S.; Scherl, A.; Holman, C.; Pfuetzner, R.; Freeman, T. L.; Miller, S.; Hernandez, P.; Appel, R.; Goodlett, D. Characterization of protein cross-links via mass spectrometry and an open-modification search strategy. Analytical Chemistry 2008, 80, 8799–8806. (19) Yu, E. T.; Hawkins, A.; Kuntz, I. D.; Rahn, L. A.; Rothfuss, A.; Sale, K.; Young, M. M.; Yang, C. L.; Pancerella, C. M.; Fabris, D. The collaboratory for MS3D: a new cyberinfrastructure for the structural elucidation of biological macromolecules and their assemblies using mass spectrometry-based approaches. Journal of Proteome Research 2008, 7, 4848–4857. (20) Nadeau, O. W.; Wyckoff, G. J.; Paschall, J. E.; Artigues, A.; Sage, J.; Villar, M. T.; Carlson, G. M. CrossSearch, a user-friendly search engine for detecting chemically crosslinked peptides in conjugated proteins. Molecular & Cellular Proteomics 2008, 7, 739– 749. (21) Panchaud, A.; Singh, P.; Shaffer, S.; Goodlett, D. xComb: A cross-linked peptide 32 ACS Paragon Plus Environment

Page 32 of 38

Page 33 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

database approach to protein-protein interaction analysis. Journal of Proteome Research 2010, 9, 2508–2515. (22) McIlwain, S.; Tamura, K.; Kertesz-Farkas, A.; Grant, C. E.; Diament, B.; Frewen, B.; Howbert, J. J.; Hoopmann, M. R.; Käll, L.; Eng, J. K. Crux: rapid open source protein tandem mass spectrometry analysis. Journal of Proteome Research 2014, 13, 4488. (23) Du, X.; Chowdhury, S. M.; Manes, N. P.; Wu, S.; Mayer, M. U.; Adkins, J. N.; Anderson, G. A.; Smith, R. D. Xlink-Identifier: an automated data analysis platform for confident identifications of chemically cross-linked peptides using tandem mass spectrometry. Journal of Proteome Research 2011, 10, 923–931. (24) Yang, B.; Wu, Y.-J.; Zhu, M.; Fan, S.-B.; Lin, J.; Zhang, K.; Li, S.; Chi, H.; Li, Y.X.; Chen, H.-F. Identification of cross-linked peptides from complex samples. Nature Methods 2012, 9, 904–906. (25) Holding, A. N.; Lamers, M. H.; Stephens, E.; Skehel, J. M. Hekate: software suite for the mass spectrometric analysis and three-dimensional visualization of cross-linked protein samples. Journal of Proteome Research 2013, 12, 5923–5933. (26) Trnka, M. J.; Baker, P. R.; Robinson, P. J.; Burlingame, A.; Chalkley, R. J. Matching cross-linked peptide spectra: only as good as the worse identification. Molecular & Cellular Proteomics 2014, 13, 420–434. (27) Mueller-Planitz, F. Crossfinder-assisted mapping of protein crosslinks formed by sitespecifically incorporated crosslinkers. Bioinformatics 2015, 31, 2043–2045. (28) Hoopmann, M. R.; Zelter, A.; Johnson, R. S.; Riffle, M.; MacCoss, M. J.; Davis, T. N.; Moritz, R. L. Kojak: Efficient analysis of chemically cross-linked protein complexes. Journal of Proteome Research 2015, 14, 2190–2198.

33 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(29) Petrotchenko, E. V.; Borchers, C. H. ICC-CLASS: isotopically-coded cleavable crosslinking analysis software suite. BMC Bioinformatics 2010, 11, 64. (30) Kao, A.; Chiu, C.-l.; Vellucci, D.; Yang, Y.; Patel, V. R.; Guan, S.; Randall, A.; Baldi, P.; Rychnovsky, S. D.; Huang, L. Development of a novel cross-linking strategy for fast and accurate identification of cross-linked peptides of protein complexes. Molecular & Cellular Proteomics 2010, 10, M110.002212. (31) Petrotchenko, E. V.; Serpa, J. J.; Borchers, C. H. An isotopically coded CID-cleavable biotinylated cross-linker for structural proteomics. Molecular & Cellular Proteomics 2011, 10, M110.001420. (32) Kaake, R. M.; Wang, X.; Burke, A.; Yu, C.; Kandur, W.; Yang, Y.; Novtisky, E. J.; Second, T.; Duan, J.; Kao, A. A new in vivo cross-linking mass spectrometry platform to define protein–protein interactions in living cells. Molecular & Cellular Proteomics 2014, 13, 3533–3543. (33) Liu, F.; Rijkers, D. T.; Post, H.; Heck, A. J. Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry. Nature Methods 2015, 12, 1179–1184. (34) Tabb, D. L. Evaluating protein interactions through cross-linking mass spectrometry. Nature Methods 2012, 9, 879–881. (35) Leitner, A.; Faini, M.; Stengel, F.; Aebersold, R. Crosslinking and mass spectrometry: An integrated technology to understand the structure and function of molecular machines. Trends in Biochemical Sciences 2016, 41, 20–32. (36) Rinner, O.; Seebacher, J.; Walzthoeni, T.; Mueller, L.; Beck, M.; Schmidt, A.; Mueller, M.; Aebersold, R. Identification of cross-linked peptides from large sequence databases. Nature Methods 2008, 5, 315–318.

34 ACS Paragon Plus Environment

Page 34 of 38

Page 35 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(37) Walzthoeni, T.; Claassen, M.; Leitner, A.; Herzog, F.; Bohn, S.; Förster, F.; Beck, M.; Aebersold, R. False discovery rate estimation for cross-linked peptides identified by mass spectrometry. Nature Methods 2012, 9, 901–903. (38) Yu, F.; Li, N.; Yu, W. ECL: an exhaustive search tool for the identification of crosslinked peptides using whole database. BMC Bioinformatics 2016, 17, 1–8. (39) Petrotchenko, E. V.; Borchers, C. H. Application of a fast sorting algorithm to the assignment of mass spectrometric cross-linking data. Proteomics 2014, 14, 1987–1989. (40) Leitner, A.; Walzthoeni, T.; Aebersold, R. Lysine-specific chemical cross-linking of protein complexes and identification of cross-linking sites using LC-MS/MS and the xQuest/xProphet software pipeline. Nature Protocols 2014, 9, 120–137. (41) Lu, S.; Fan, S.-B.; Yang, B.; Li, Y.-X.; Meng, J.-M.; Wu, L.; Li, P.; Zhang, K.; Zhang, M.-J.; Fu, Y.; Luo, J.; Sun, R.-X.; He, S.-M.; Dong, M.-Q. Mapping native disulfide bonds at a proteome scale. Nature Methods 2015, 12, 329–331. (42) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5, 976–989. (43) Eng, J. K.; Fischer, B.; Grossmann, J.; MacCoss, M. J. A fast SEQUEST cross correlation algorithm. Journal of Proteome Research 2008, 7, 4598–4602. (44) Lynn, A. J.; Chalkley, R. J.; Baker, P. R.; Segal, M. R.; Burlingame, A. L. Protein Prospector and ways of calculating expectation values. Proceedings of the 54th ASMS Conference on Mass Spectrometry, Seattle. 2006; p 351. (45) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: An open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24.

35 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(46) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7, 655–667. (47) Wang, L.-h.; Li, D.-Q.; Fu, Y.; Wang, H.-P.; Zhang, J.-F.; Yuan, Z.-F.; Sun, R.-X.; Zeng, R.; He, S.-M.; Gao, W. pFind 2.0: a software package for peptide and protein identification via tandem mass spectrometry. Rapid Communications in Mass Spectrometry 2007, 21, 2985–2991. (48) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications 2014, 5, 5277. (49) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nature Methods 2008, 5, 873–875. (50) Tanner, S.; Shu, H.; Frank, A.; Wang, L.-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: Identification of posttranslationally modified peptides from tandem mass spectra. Analytical Chemistry 2005, 77, 4626–4639. (51) Keich, U.; Noble, W. S. Progressive Calibration and Averaging for Tandem Mass Spectrometry Statistical Confidence Estimation: Why Settle for a Single Decoy? International Conference on Research in Computational Molecular Biology. 2017; pp 99–116. (52) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007, 4, 923–925. (53) Storey, J. D.; Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 2003, 100, 9440– 9445.

36 ACS Paragon Plus Environment

Page 36 of 38

Page 37 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(54) Makowski, M. M.; Willems, E.; Jansen, P. W.; Vermeulen, M. Cross-linking immunoprecipitation-MS (xIP-MS): Topological analysis of chromatin-associated protein complexes using single affinity purification. Molecular & Cellular Proteomics 2016, 15, 854–865. (55) TAIR10.

https://www.arabidopsis.org/download_files/Sequences/TAIR10_

blastsets/TAIR10_seq_20101214_updated, 2017. (56) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nature Methods 2007, 4, 207–214. (57) Kojak. http://www.kojak-ms.org/param/index.html, Accessed: 2017-02-04. (58) Rose, P. W.; Prlić, A.; Altunkaya, A.; Bi, C.; Bradley, A. R.; Christie, C. H.; Di Costanzo, L.; Duarte, J. M.; Dutta, S.; Feng, Z. The RCSB protein data bank: Integrative view of protein, gene and 3D structural information. Nucleic Acids Research 2017, 45, D271–D281.

37 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 38

Spectra Top-Scored Peptide Chain in Inverval1 . . .

Spectrum 1 Spectrum 1 . . .

Top-Scored Peptide Chain in Interval k1 Digitization . . .

Calculate chain scores

Spectrum l

Top-Scored Peptide Chain in Interval 1 . . .

Spectrum l

Top-Scored Peptide Chain in Interval k2 Peptide 1 . . .

In silico fragmentation

For each spectrum, pair and calculate final scores

Peptide n Top-Scored Peptide-Peptide pairs

Spectrum 1 in silico digest

. . .

Protein Database

Top-Scored Peptide-Peptide pairs

Spectrum l

For TOC only.

38 ACS Paragon Plus Environment

Estimate e-value and q-value

Output