De Novo Sequencing of Peptides from Top-Down Tandem Mass Spectra

Sep 28, 2015 - De Novo Sequencing of Peptides from Top-Down Tandem Mass Spectra ... Fax: +7-812-448-6998., *Email: [email protected]. Tel...
0 downloads 0 Views 376KB Size
Subscriber access provided by UNIV OF TASMANIA

Article

De novo sequencing of peptides from top-down tandem mass spectra Kira Vyatkina, Si Wu, Lennard J. M. Dekker, Martijn M. VanDuijn, Xiaowen Liu, Nikola Toli#, Mikhail Dvorkin, Sonya Alexandrova, Theo M. Luider, Ljiljana Paša-Toli#, and Pavel A. Pevzner J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/pr501244v • Publication Date (Web): 28 Sep 2015 Downloaded from http://pubs.acs.org on September 29, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

De novo sequencing of peptides from top-down tandem mass spectra Kira Vyatkina,∗,†,‡ Si Wu,¶ Lennard J. M. Dekker,§ Martijn M. VanDuijn,§ Xiaowen Liu,∥,⊥ Nikola Toli´c,# Mikhail Dvorkin,† Sonya Alexandrova,† Theo M. Luider,§ Ljiljana Paˇsa-Toli´c,# and Pavel A. Pevzner∗,@,‡ Algorithmic Biology Laboratory, Saint Petersburg Academic University, Russia, Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Russia, Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK, USA, Erasmus MC, Department of Neurology, The Netherlands, Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, and Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, USA E-mail: [email protected]; [email protected] Phone: +7-921-954-2299. Fax: +7-812-448-6998



To whom correspondence should be addressed Algorithmic Biology Laboratory, Saint Petersburg Academic University, Russia ‡ Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, Saint Petersburg State University, Russia ¶ Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK, USA § Erasmus MC, Department of Neurology, The Netherlands ∥ Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis ⊥ Center for Computational Biology and Bioinformatics, Indiana University School of Medicine # Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory @ Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA, †

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract De novo sequencing of proteins and peptides is one of the most important problems in mass spectrometry-driven proteomics. A variety of methods have been developed to accomplish this task from a set of bottom-up tandem (MS/MS) mass spectra. However, a more recently emerged top-down technology, now gaining more and more popularity, opens new perspectives for protein analysis and characterization, implying a need in efficient algorithms for processing this kind of MS/MS data. Here we describe a method that allows to retrieve from a set of top-down MS/MS spectra long and accurate sequence fragments of the proteins contained in a sample. To this end, we outline a strategy for generating high-quality sequence tags from top-down spectra, and introduce the concept of a T -Bruijn graph by adapting to the case of tags the notion of an A-Bruijn graph widely used in genomics. The output of the proposed approach represents the set of amino acid strings spelled out by optimal paths in the connected components of a T -Bruijn graph. We illustrate its performance on top-down datasets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab. Keywords: top-down mass spectrometry, de novo sequencing, T -Bruijn graph.

Introduction Mass spectrometry (MS) has established itself as a standard and reliable tool for studying proteins. Until recently, the dominating technology applied in proteomics was bottomup tandem mass spectrometry carried out at the peptide level; 1 however, the top-down strategy that analyzes intact proteins, 2 first considered rather as complementary means, is now rapidly becoming a compelling alternative to the classical bottom-up approach. Among the most important tasks of mass spectrometry-based proteomics is de novo sequencing, which is the only way for analyzing proteins that come from an organism with an unknown genome or represent a novel splice variant, or cannot be derived directly from a USA

2 ACS Paragon Plus Environment

Page 2 of 37

Page 3 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

genome, like antibodies. In the past twenty years, the problem of peptide and protein de novo sequencing from bottom-up MS/MS spectra has received much attention, which stimulated development of a number of valuable software tools, including PEAKS, 3 PepNovo, 4 pNovo, 5 Lutefisk, 6 and Sherenga. 7 Several other approaches capitalized on either overlapping peptides resulting from multiple enzyme digest, 8–12 or complementary information contained in mass spectra acquired using different fragmentation techniques from peptides 13–17 or proteins. 18 A recently published algorithm TBNovo 19 for de novo sequencing of proteins from combined MS/MS datasets utilizes top-down spectra as a scaffold to assemble bottom-up spectra collected from overlapping peptides. However, the more advanced the instruments for top-down mass spectrometry become, the more information can be learned from top-down MS/MS data alone, and consequently, the stronger is the interest and need in algorithmic solutions for processing this kind of data. In this work, we introduce a fast and efficient method that, given a set of top-down MS/MS spectra, derives from it a number of amino acid strings representing accurate sequence fragments of proteins contained in the sample. In particular, it makes no assumptions on which technologies were employed to collect the input spectra. Our approach proceeds in three stages. First, it generates from a set of deconvoluted and preprocessed top-down MS/MS spectra a set of peptide sequence tags of length k, or k-tags, for a fixed k. Next, it constructs for the obtained set of tags a T -Bruijn graph, being our proposed modification to the case of tags of an A-Bruijn graph 20 frequently applied in genomics. Briefly, the vertices of a T -Bruijn graph correspond to the k-tags, and two vertices define an edge if their respective tags likely represent two consecutive k-mers from a protein sequence (see Figure 2). Finally, the de novo amino acid sequences are read from optimal paths in its connected components. The details are provided below. For this approach to be successful, the tags that serve as elementary building blocks for de novo reconstruction need to be of a very high quality. The question of extracting accurate sequence tags has arisen many times in the context of interpreting bottom-up 6,13,21–34 as

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

well as top-down 35,36 MS/MS spectra, and a number of sophisticated algorithms have been developed to this end. Here we present a fairly simple tag generation strategy, the most important aspect of which is that it uses ultra-low constant mass tolerance. This seemingly contradicts the common practice of specifying error tolerance in ppm when working with top-down data, and defining an allowed error in mass difference as relative to the larger of the masses under consideration. 35 However, though an absolute error in large masses can be accordingly large due to poor external calibration, changes in temperature in the local environment, poorly set automatic gain control (AGC) and space charging, ion packets of similar m/z will experience approximately the same effects, which will lead to similar errors in the respective measured m/z values. As a consequence, the difference between the masses corresponding to consecutive fragment ions with similar charge states can be measured with very high precision. Despite the fact that peaks in a spectrum bear a systematic error in their associated masses is well-known to every mass spectrometrist, 37 and was brought to notice in the context of tag generation from bottom-up spectra, 32 where a usage of the relative mass accuracy of 0.01Da was suggested, to the best of our knowledge, it has never been recognized how much one can benefit from refining this observation and applying it to process top-down spectra. Based on automated and manual analysis of a few data sets, we set the mass tolerance to 4mDa and further kept it throughout most of our experiments; this value still allowed us to retrieve many tags, at the same time assuring their accuracy. To appreciate the difference from a traditionally used ppm-based tolerance, compare 4mDa to e. g. 0.1Da, which is 10ppm at 10, 000Da, keeping in mind that the tolerance of 10ppm is nowadays commonly used for high-accuracy top-down Fourier-transform mass spectrometry (FTMS) data (even though for modern instruments, detected error in mass measurements can be well below 10ppm). In what follows, we describe in detail the procedures for spectra preprocessing, tag generation and T -Bruijn graph construction, and present experimental results for two top-down

4 ACS Paragon Plus Environment

Page 4 of 37

Page 5 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

datasets acquired from carbonic anhydrase 2 (CAH2) and the Fab region of alemtuzumab, respectively, which illustrate performance of the suggested method. The proposed approach is implemented in a software tool Twister freely available online at http://bioinf.spbau.ru/en/twister.

Methods Dataset acquisition Carbonic anhydrase 2 Intact bovine carbonic anhydrase 2 (CAH2) was analyzed by a reversed-phase liquid chromatography (RPLC) system coupled online with a Thermo LTQ Orbitrap Elite mass spectrometer as described before. 38 Briefly, three datasets were generated using ETD, CID and HCD fragmentation, respectively. For each dataset, a parent spectrum was collected at a 240k resolution at m/z of 400 (AGC taget of 1E6 , 2 microscans) followed by eight high resolution MS/MS acquisitions (120k resolution at m/z of 400 with AGC target of 1E 5 , 2 microscans). For the LC-ETD MS/MS analysis, ETD reaction time was set at 25 ms, and reagent ion AGC target of 1E 5 with maximum injection time of 100ms was chosen. For the LC-HCD MS/MS analysis, the normalized collisional energy (NCE) was set at 25%. In total, 3,031 ETD, 3,363 CID, and 3,437 HCD top-down MS/MS spectra were collected.

Fab region of alemtuzumab Alemtuzumab was digested with papain and subsequently reduced and analyzed by a reversed-phase liquid chromatography (RPLC) coupled online with a Thermo LTQ Orbitrap Velos mass spectrometer as described before. 38 Two datasets were generated using ETD and HCD fragmentation, respectively. In either case, MS and MS/MS spectra were collected at a 100k and 60k resolution, respectively. In total, 4, 962 ETD and 4, 931 HCD top-down MS/MS spectra were collected.

5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Deconvolution The acquired raw MS/MS spectra were centroided and converted into mzXML format with ReAdW, and then deconvoluted using the tool MS-Deconv 39 with the default parameters (maximum charge state: 30; maximum monoisotopic mass of fragment ions: 49, 000Da; signal-to-noise ratio: 1; envelopes of precursor ions were deconvoluted to derive the precursor masses of MS/MS spectra).

Spectra preprocessing For a mass spectrum S, let P(S) denote the set of its peaks. We always assume that the peaks from P(S) are sorted in the ascending order of their corresponding masses; the same assumption applies by default to any sequence of peaks under consideration. For a peak p ∈ P(S), let m(p) and I(p) denote its mass and intensity, respectively. A score of S is ∑ defined as Score(S) = p∈P(S) I(p). The deconvoluted spectra are preprocessed in four steps, as described below. Figure 1a-f illustrates the entire procedure for a CID or HCD spectrum; in case of an ETD spectrum S, the only difference would be in the number and masses of the auxiliary peaks. Adding auxiliary peaks Prior to tag generation, we add to an input spectrum S auxiliary peaks supposed to complete ladders of fragment ions. Two of those correspond to the zero mass and the precursor mass P M (S) of S, respectively. Moreover, if S is a CID or HCD spectrum, we add two more peaks corresponding to the molecular mass m(H2 O) of water and P M (S) − m(H2 O), respectively, and if S is an ETD spectrum, we add four more peaks corresponding to m(N H3 ), P M (S)−m(N H3 ), m(H2 O)−m(N H3 ), and P M (S)−m(H2 O)+ m(N H3 ), respectively. The intensity of each auxiliary peak is set to twice the maximum of that over the original peaks. Merging close peaks Sometimes two or more peaks corresponding to nearly identical masses can be observed in a spectrum. Such groups should rather be replaced by an appropriate single peak. We perform this as follows. The sequence of peaks is scanned in the 6 ACS Paragon Plus Environment

Page 6 of 37

Page 7 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ascending order of their corresponding masses. For a peak p, we detect all the peaks, the respective mass of which exceeds m(p) by at most ε (in our experiments, ε = 4mDa). If no such peaks were detected, we proceed to the next peak from the spectrum. Otherwise, the ∑ obtained group G of peaks (including p) is replaced with a single peak of intensity p∈G I(p) corresponding to the mass m(˜ p), where p˜ is the peak from G having the maximum intensity, and the newly formed peak is handled at the next step of the procedure.

Removing neutral losses In our experiments, we regularly observed losses of water and ammonia, as well as combinations of those. However, neutral-loss ions sometimes form long enough ladders to give rise to accurate although shifted sequence tags. In this respect, our final choice was to eliminate water-loss ions only. Intuitively, this means generating fewer yet more accurate tags than if all the peaks were kept; however, this effect is slight (see the section “Results” and Table 2 for experimental details). When preprocessing a top-down (deconvoluted) spectrum S, we scan its peaks from left to right. If for a peak p, S also contains a peak p′ at a mass m(p) + m(H2 O) (up to the tolerance ε), the intensity of p′ gets increased by I(p), and p is eliminated.

Reflecting peaks Since tandem mass spectra often contain only one out of the two complementary ions (if any), we apply peak reflection to obtain the second fragment ion from each such pair, thus potentially prolonging ladders of consecutive fragment ions, which would help us to generate more sequence tags. Reflecting a peak p of a spectrum S amounts to introducing in S a new peak p with the mass m(p) = P M (S)−m(p) and intensity I(p) = I(p). If p happens to lie close (in the above sense) to some other peak p′ ∈ P(S), p and p′ are replaced with a single peak of intensity I(p) + I(p′ ) corresponding to the mass m(p) if I(p) > I(p′ ), and to m(p′ ) otherwise.

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Sequence tags A tag is a short sequence of amino acids endowed with an offset. A tag can be derived from a spectrum S in the following way. Two peaks p1 and p2 of S define an amino acid a if m(p1 ) and m(p2 ) differ by the mass of a. Similarly, k + 1 peaks p1 , . . . , pk , pk+1 together define a k-tag t with the amino acid sequence a1 . . . ak if the peaks pi and pi+1 define the amino acid ai , where 1 ≤ i ≤ k. The amino acid sequence of t will be denoted by s(t). The offset o(t) of t equals m(p1 ). In addition, we associate with t an auxiliary spectrum S(t) with the set of peaks P(S(t)) = {p1 , . . . , pk+1 }. The score of t is defined as Score(t) = Score(St ) (see Figure 1g). Further we will have to merge together the tags with the same amino acid sequence and similar offsets. Let t1 , . . . , tm be such group of tags, where m ≥ 2. The resulting tag t∗ will of course inherit the former, and its offset will be set equal to that of the top-scoring underlying tag. The associated spectra S(t1 ), . . . , S(tm ) of the tags being merged will thereby get superimposed, which essentially amounts to forming the i-th peak of S(t∗ ) by picking up the i-th peak of each of S(t1 ), . . . , S(tm ) and gluing those together following the ∑ rules for merging close peaks. In particular, this implies that Score(t∗ ) = m i=1 Score(ti ). Generation For each deconvoluted and preprocessed spectrum, a spectrum graph is constructed, with the vertices scored by the underlying peak intensity. An edge is introduced between two vertices if the absolute difference of their respective masses matches a mass of some amino acid within 2ε; each edge is directed from the vertex with a smaller mass to the one with a larger mass, and is labeled with the corresponding amino acid. For each connected component of the spectrum graph, an optimal (i. e. maximizing the total vertex score) path is computed, from which all the possible k-tags are subsequently extracted, for a given k.

8 ACS Paragon Plus Environment

Page 8 of 37

Page 9 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

T -Bruijn graph construction A T -Bruijn graph is built for a set of tags of a fixed length extracted from the input spectra. Similar to an A-Bruijn graph, 20 it accounts for positional information (in this case, tag offsets); in contrast to that, it needs and is guaranteed to be acyclic (while an A-Bruijn graph is supposed to contain cycles). For a set T of k-tags, a T -Bruijn graph GT is constructed as follows. First, we construct from T a set T ∗ of k-tags with associated multiplicities by gluing together all the tags with a same amino acid sequence and offset; the multiplicity of the resulting tag is set to the number of the original ones that merged into it. Next, for each tag t ∈ T ∗ , a vertex vt of GT is generated and labeled with t. Subsequently, for any two vertices vt′ and vt′′ , a directed edge from vt′ to vt′′ is introduced if a2 . . . ak = b1 . . . bk−1 , and o(t′′ ) − o(t′ ) equals m(a1 ) within 2ε, where s(t′ ) = a1 . . . ak and s(t′′ ) = b1 . . . bk . The entire procedure, starting from tag generation, is illustrated in Figure 2. For the sake of simplicity, here we do not restrict our attention to the tags extracted from optimal paths in spectrum graphs, and consider all the possible ones instead. In practice, to obtain the set T ∗ from the set S of input spectra, we perform as follows. First, we extract from S a set T of k-tags, as described in the previous section. Subsequently, the tags from T are grouped by the amino acid sequence. For each group, its composing tags are sorted by increasing offset. The obtained list is then scanned by the procedure fully analogous to the one for merging close peaks (see above) in a search for subgroups of tags with similar offsets. The tags from each such subgroup τ are merged together as stated above. This gives us the set T ∗ that serves as input for the procedure constructing a T -Bruijn graph. For a vertex vt with an associated tag t, its score is defined as Score(vt ) = Score(t). For ∑ a path p = vt1 vt2 . . . vtm in GT , we let Score(p) = m i=1 Score(vti ). The intuition behind is that p gets scored based on the intensities of all the peaks from the original spectra that were transformed into the monoisotopic mass peaks in the deconvoluted spectra, which 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

subsequently defined the tags composing p. Finally, observe that each path p = vt1 vt2 . . . vtm in GT spells out an amino acid sequence j σ(p) = a11 a12 . . . a1k a2k . . . am k , where ai denotes the i-th amino acid of the tag tj , for 1 ≤ i ≤ k

and 1 ≤ j ≤ m. The length of σ(p) is m + k − 1. An alternative approach An intuitively simpler way to build a T -Bruijn graph comprises extracting from the input spectra (k + 1)-tags, which will label the graph edges, and then obtaining the k-tags corresponding to the vertices as prefixes and suffixes of length k of those. However, our suggested procedure potentially allows for generating more edges: indeed, it may occur that no spectrum contains a (k + 1)-tag te , while some two distinct spectra do contain its prefix and suffix of length k, respectively. On the other hand, reducing k e. g. by one in order not to miss the respective edges when starting from the edge-labeling tags (in which case an edge from vt′ to vt′′ output by our procedure will be represented by two consecutive edges labeled with t′ and t′′ , respectively), will lead to generating more spurious edges as well; consequently, the number of both correct and erroneous de novo strings generated in this way will be in-between of those produced by the first approach for the tag length k − 1 and k, respectively. Note that the former method, which generates vertex-labeling tags, allows, in particular, for 0-tags being simply peaks in the input spectra. In this case, a T -Bruijn graph represents a spectrum graph for a superspectrum constructed from the entire set of the given spectra, which immediately relates our proposed concept of the former to the classical notion of the latter. In practice, the accuracy of de novo strings extracted from a T -Bruijn graph for 0-tags is typically insufficient; however, this observation led us to give preference to the first version of the algorithm for constructing a T -Bruijn graph, even though the second one can also be used.

10 ACS Paragon Plus Environment

Page 10 of 37

Page 11 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Results We implemented the proposed approach in Java and benchmarked it on two top-down datasets acquired from CAH2 and the Fab region of alemtuzumab, respectively. The topdown MS/MS spectra were deconvoluted with MS-Deconv and preprocessed, and passed as input to our method. The algorithm depends of two parameters: tag length k, and mass tolerance ε. Our experiments were carried out for 4-tags, unless stated otherwise; the benefits and drawbacks of tags of different lengths will be discussed further. On a modern desktop or laptop, the running time of the algorithm comprises only a few seconds (while deconvolution is a bit more time-consuming, and would require a few minutes for a typical top-down dataset).

Mass tolerance setting The smaller is the mass tolerance, the more accurate the de novo strings happen to be (see the supplementary file Twister tolerance.xls for the lists of those generated from the CAH2 and alemtuzumab datasets for the tolerance from 1 to 10mDa using 4-tags). However, if the tolerance is too small, the amino acid strings will be too short to allow for a meaningful analysis. At the same time, the effect of increasing the mass tolerance is twofold. On one hand, this results in prolongations and merges of small correct fragments due to generation of new correct tags, but on the other hand, the overall accuracy thereby steadily deteriorates. Thus, a reasonable strategy for selecting the tolerance should aim at finding a balance between the length and accuracy of the de novo strings. One of the ways for achieving this is to start with a small tolerance, and then iteratively augment it until degradation of the de novo results can be observed. In the frame of our experiments, it was important that the de novo sequences be long enough to allow for identification of contaminants via a BLAST search (see below for details). Having set ε =2mDa, we missed two of the contaminants listed in the supplementary file Twister contaminants.xls for the CAH2 sample; however, at ε = 3mDa, we were already 11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

able to identify all of those. Increasing it further to 4mDa forced the sequence fragments of the CAH2 or the detected contaminants to become longer; at the same time, this led to no negative consequences like prolongations of the identified fragments inconsistent with the corresponding protein sequence(s), or long enough and high-scoring strings that cannot be interpreted via a BLAST search. This value also worked out fine for the alemtuzumab dataset. However, for ε = 5mDa, a 18-aa long fragment “LYSQNNLQSASALQTMQP” with a singe correct 4-mer, and a 15-aa long fragment “TLWKYNVVFVTVTLE” and its reversed copy “ELTVTVFVVNYKWLT” with no correct 4-mer (i. e. a one contained in the sequence of CAH2 or a contaminant, possibly in a reversed form) appeared at the 18th, 21st and 28th position, respectively, of the output list for the alemtuzumab dataset. For neither string, a BLAST search against the non-redundant database suggested a candidate interpretation. This indicated that for the alemtuzumab dataset, the tolerance should better be kept at the level of 4mDa (of course, unless we could hope those amino acid strings originated from novel proteins, but this was not the case). On the contrary, for the CAH2 dataset, the tolerance could potentially be increased to 5mDa without such consequences, yet also without significant increase in the length of the identified de novo strings. Therefore, we chose to keep the mass tolerance of 4mDa appropriate for either dataset throughout most of the experiments described in this paper.

Justifying the preference for a constant mass tolerance A natural expectation is that the errors in deconvoluted masses that were closer to each other in the m/z domain will be more similar than the errors in those once separated by a larger delta m/z, assuming the respective charge states do not differ too much. This suggests that smaller mass tolerances favour tags with the underlying masses deconvoluted from a same charge state, and raises the question whether it might be beneficial to introduce an adaptive mass tolerance, which would depend on the distance that separated in the m/z space the pairs of masses being examined during tag generation. To clarify this point, we carried out the following computational

12 ACS Paragon Plus Environment

Page 12 of 37

Page 13 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

experiments on the CAH2 and alemtuzumab datasets. First, for each value of the mass tolerance from 1 to 10mDa, and 100mDa, sets of “certified” and “uncertified” amino acids were produced. To this end, a spectrum graph was generated from each deconvoluted and preprocessed input spectrum, and an optimal path was extracted from each connected component of the former, as at time of k-tags generation. Subsequently, for each obtained path p of length at least 5 (which would thus contribute to the set of 4-tags), we considered the amino acid string σ(p), and identified its longest substring σ ′ (p) of length at least 4 contained in the sequence of a target protein or contaminant, possibly in the reversed form. The amino acids composing σ ′ (p) were classified as certified, while all the rest were classified as uncertified. Observe that an uncertified amino acid is not necessarily incorrect: this only means it is not covered by any correct 4-tag. According to their type, the derived amino acids were stored in the set of certified and uncertified amino acids, respectively; for each amino acid, its observed mass was calculated as the difference of the associated masses of its rightmost and leftmost defining peak pr and pl , respectively. Finally, the delta z and delta m/z value was calculated for the pair of peaks pr and pl as the absolute value of the difference z(pr )−z(pl ) and (m(pr )+z(pr ))/z(pr )−(m(pl )+ z(pl ))/z(pl ) = m(pr )/z(pr ) − m(pl )/z(pl ), respectively, where z(p) denotes the charge state of the fragment ion that gave rise to a peak p. Subsequently, for each value of the mass tolerance under consideration, we gathered statistics on the number and percentage of the certified and uncertified amino acids derived from the pairs of fragment ions with the charge states differing by a fixed delta z, and from those separated by a delta m/z from a fixed range. In the latter case, the average error in the masses of the retrieved certified and uncertified amino acids was also recorded. The obtained results are presented in the supplementary files TBruijn delta-z.xls and TBruijn delta-mz.xls, respectively. It can be immediately seen that a vast majority of the amino acids are produced by pairs of fragment ions separated by less than 200 units in the m/z space, and with the charge

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

states differing by at most 1: e. g. for ε = 4mDa, the share of (both certified and uncertified) such amino acids is 85.93% and 97.41%, respectively, for CAH2, and 84.03% and 98.11%, respectively, for alemtuzumab. The corresponding fractions of certified and uncertified amino acids, along with the overall statistics on the certified and uncertified amino acids obtained for each value of the mass tolerance being considered, are provided in the supplementary file TBruijn aa-stats.xls. Consequently, we focused our attention on the region of delta m/z below 200, and partitioned it into 20 ranges of spread 10. Next, for either dataset and each delta m/z range, a list of certified and uncertified amino acids, respectively, was generated using the mass tolerance of 4 or 100mDa; see the supplementary file TBruijn aa-mass error.xls. Along with an amino acid, for either underlying peak, its mass m, charge z and the m/z value were reported, as well as the resulting delta m/z and delta z values, and the deviation of the observed amino acid mass from the theoretical value. A close inspection of the results obtained for ε = 4mDa reveals that for a delta m/z below 10, the two fragment ions giving rise to an amino acid in most cases have substantially different charge states (e. g. 1 vs. 6), or either relatively small or relatively large m/z values (e. g. below 50 or above 1, 000 units). As a consequence, an error in the observed amino acid mass often happens to be relatively large: in the former case, due to an error in a peak mass introduced at time of deconvolution, and in the latter case—due to a decrease in the sensitivity of an instrument at small or large m/z values. And namely, the average mass error is approximately 5 and 3mDa for both certified and uncertified aimno acids for the CAH2 and alemtuzumab dataset, respectively. For a delta m/z ranging from 10 to 20, almost all the amino acids are defined by masses deconvoluted from a same charge state, which is mostly between 5 and 7. The average error in the amino acid mass immediately becomes smaller, and now is roughly 3 and 2mDa for the CAH2 and alemtuzumab dataset, respectively. Similar observations hold for the delta m/z range from 20 to 30 units, at which glycines produced by pairs of peaks with charge 2,

14 ACS Paragon Plus Environment

Page 14 of 37

Page 15 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

and alanines and serines produced by pairs of peaks with charge 3 are observed in either dataset. The average mass error thereby further decreases for both datasets. At the delta m/z range from 30 to 40, and from 40 to 50, larger amino acids are brought forth, while the average mass error continues to decrease. Subsequently, at the delta m/z range from 50 to 60, glycines defined by fragment ions with charge state 1 emerge. The same tendency continues with a delta m/z increasing further, up to the range from 180 to 190. The average mass error thereby generally remains small. For the ranges of spread 10 in-between 100 and 200 units, the amino acids derived from pairs of fragment ions with charge state 1, the masses of which fit into the delta m/z range being considered, clearly dominate (if any exist). Such pairs of ions generate amino acids most accurately, since no noticeable error is introduced in the respective neutral masses during deconvolution; this is appropriately reflected by the average mass error for those ranges (with the exception of the range from 150 to 160, in which arginine falls; however, this is consistent with the fact that arginines are rarely retrieved by our method, and the share of those in the respective amino acid list is particularly small for either dataset). Starting from the delta m/z range from 190 to 200, no amino acids defined by peaks with a same charge state can be observed, and the average error in an amino acid mass immediately becomes larger. Subsequently, with a shift of the range towards larger delta m/z values, the reported m/z values of the respective peaks steadily become larger, which together with an increase of a delta m/z value leads to an increase of the average error in the masses of the retrieved amino acids, even though most of those are still defined by pairs of fragment ions with the charge states differing by 1. Since for ε = 4mDa, many uncertifed amino acids are actually correct, the above observations apply both to the set of certified and uncertified amino acids, for either dataset; however, the average error in the amino acid mass is larger for uncertified amino acids. For ε = 100mDa, the overall tendency is still the same, but there are more deviations from it in case of certified amino acids, and substantially more of those in case of uncertified amino

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

acids. For either dataset, with the mass tolerance increasing from 1 to 10mDa, the share of the amino acids defined by pairs of fragment ions with the charge states differing by at most 1 steadily decreases from approximately 99.3% to 96%, and drops down to roughly 91% at 100mDa (see the supplementary file TBruijn delta-z.xls). Also, the share of the amino acids derived from pairs of fragment ions with a same charge state and the charge states differing by 1 thereby slowly decreases and increases, respectively. Both observations are consistent with the above-mentioned expectations. However, the accuracy of the derived amino acids deteriorates much faster. In particular, for the alemtuzumab dataset, at the mass tolerance of 100mDa, most of the amino acids derived from pairs of fragment ions with different charge states turn out to be uncertified. This indicates that a substantial increase in the mass tolerance will result in sets of tags of unacceptable quality, and thus, the tolerance should be preferably kept small. At the same time, since most certified amino acids are defined by fragment ions being at most 200 m/z units apart, it can be safely left constant.

Tag generation and T -Bruijn graph construction The computational experiments were carried out for tags of length k from 3 to 6 generated on a scan-by-scan basis. For each k, the total number of extracted k-tags, along with the number of vertices and connected components of the T -Bruijn graph obtained from those, is given in Table 1. Observe that the number of de novo amino acid strings derived in each case equals that of connected components in the respective T -Bruijn graph. The ratio between the number of k-tags and vertices or connected components in the T -Bruijn graph constructed from those tags does not vary much for different k within the same dataset: for CAH2, the former and latter is between 1.3 and 1.4, and 4.4 and 4.6, respectively, and for alemtuzumab—between 1.7 and 1.8, and 7.1 and 8.8, respectively. At the same time, these numbers indicate that the k-tags derived from alemtuzumab were

16 ACS Paragon Plus Environment

Page 16 of 37

Page 17 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

grouped more efficiently. This is consistent with the fact that while in the CAH2 dataset, we detected and identified 17 contaminants, in the alemtuzumab dataset, only 4 contaminants were observed (see below for details). Since in case of alemtuzumab, a vast majority of tags originated from the two target protein sequences, many of those were combined at time of T -Bruijn graph construction. As Table 3 illustrates, peak reflection almost always slightly improves the accuracy of the de novo strings and increases the coverage of a target protein sequence. This is consistent with the fact that the number of tags thereby increases slightly more than twice. Consequently, we decided for applying this procedure at the preprocessing stage; however, it should be recognized that this leads to appearance of most de novo strings in both direct and reversed form in the output list.

De novo sequences The amino acid strings spelled out by optimal paths in the connected components of the T -Bruijn graphs constructed from a set of k-tags, for k from 3 to 6, for the CAH2 and alemtuzumab datasets, are listed in the supplementary file Twister tag-length.xls. Leucine stays there for both itself and isoleucine, from which it is undistinguishable by standard MS/MS methods. For each de novo string, its longest fragment of length at least k that matches either a target protein or contaminant from the corresponding sample is highlighted in color; the leftmost one if picked up in case of ties. Moreover, the positions in the protein sequence of the first and last amino acid of the highlighted fragment are indicated; if the latter position is larger than the former, the respective fragment occurs in this sequence in the reversed form. Some of the short fragments can potentially match two or more sequences. In case of alemtuzumab, for a fragment matching both its light and heavy chain, the former is reported. Otherwise, if a fragment matches a target and contaminant protein, the target one is output, and for a fragment corresponding to two or more contaminants, the one encountered first in the supplementary file Twister contaminants.xls is reported. This is sufficient for our

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

illustrative purposes; however, an unambiguous identification can often be made through comparing the offset of a de novo sequence inherited from its underlying tags to the mass of one or a few amino acids that precede it or follow its reversed copy in a candidate protein sequence. An example of such analysis is provided in the supplementary materials. A majority of the obtained amino acid strings constitute sequence fragments of either a target protein or contaminant from the respective sample, possibly flanked with one or a few spurious amino acids. Figure 3 illustrates the coverage of the target protein sequences with the matching fragments of de novo strings derived from the respective dataset. For CAH2, the light chain of alemtuzumab, and the Fd region of the heavy chain of alemtuzumab, 173 out of 260, 177 out of 214, and 157 out of 228 amino acids are covered, respectively, which is 66.54%, 82.71%, and 68.86%, respectively. In addition, we point out that in our data analysis for CAH2, we observed N-terminal methionine truncation and serine acetylation; this also manifested itself as a substitution in the de novo strings of the N-terminal dimer “MS” or its reversed counterpart “SM” with a single amino acid ‘E’, the molecular mass of 129.043Da of which equals that of an acetylated serine residue, and therefore, appropriately reflects those modifications (see e. g. the 15th de novo string “GWHHE” for 4-tags, which thus represents a reversed copy of the prefix of the target sequence). For a purpose of comparison, we ran MS-Align+ 40 on either dataset, with the error tolerance for precursor and fragment masses set to 10ppm, and a maximum of two unexpected post-translational modifications allowed. Protein-spectrum matches assuming posttranslational modifications with an absolute mass difference over 100Da were discarded. Subsequently, we aligned the peptides identified with an E-value below 10−10 against the sequence of a target protein, and labeled the cleavage sites matched by the annotated fragment ions or one of the two respective auxiliary peaks (see Figure 3). The sequence coverage obtained in this way comprised 84.14%, 99.53%, and 91.89% for CAH2 and the light and heavy chain of alemtuzumab, respectively, indicating that several spectra contained fragment ions, which supported no correct 4-tags (as it should have been expected). At the

18 ACS Paragon Plus Environment

Page 18 of 37

Page 19 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

same time, Twister sometimes could generate correct 4-tags covering cleavage sites missed by MS-Align+, due to its ability to extract tags from the spectra that cannot be interpreted e. g. because of an error in the precursor mass.

Removal of water-loss ions As mentioned earlier, at time of spectra preprocessing, we eliminate water-loss ions through merging them to their unmodified counterparts. However, a comparison of the subsequently generated list of de novo strings to a one obtained while keeping such ions untouched shows that most differences apply to the scores of the resulting strings rather than the amino acid sequences on their own. In particular, this sometimes slightly alters the relative order of the strings in the output list (sorted by decreasing score). Yet this preprocessing step sometimes helps to retrieve a longer correct fragment of a protein sequence, as the following example demonstrates. The 164th de novo string generated from the CAH2 dataset by our method using 4-tags is “AQVAQALTTLGTNGTAGVVF”; its blue prefix represents a reversed sequence fragment of the contaminant protein flavin reductase (NADPH). But if the water-loss ions are kept, its counterpart (the 179th in the respective output list) is the string “AQVQAALTTLGTNGTAGVVF”, in which the relative order of the 4th and 5th amino acids is incorrect. The prefix “AQVAQALT” originates from a CID and an HCD spectrum (scans 2, 009 and 2, 129, respectively) and “AQVQAALT” is entirely contained in a single CID spectrum (scan 2, 094); the peaks that gave rise to their underlying tags are listed in Table 2. The peak at mass 5, 729.057Da in a preprocessed HCD spectrum, which separates ‘L’ and ‘T’ in the tag labeled with “QALT”, appears due to reflection of the original peak at mass 1, 772.019Da; however, if water-loss ions are removed, two peaks with mass roughly 1, 754.007Da get merged into it prior to reflection, increasing its intensity from 6, 628.283 to 20, 637.91. Either prefix under consideration is spelled out by a path comprising 5 vertices from the respective connected component of the T -Bruijn graph constructed for 4-tags. The score of the former upon and without water-loss ions elimination is 198, 643.635 and 170, 624.379,

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

respectively, while the score of the latter is 180, 625.809 in both cases. Moreover, the peak at mass 1, 772.019Da in a preprocessed HCD spectrum contributes to three more 4-tags extracted from the latter (and namely, the ones labeled with “ALTT”, “LTTL”, and “TTLG”, respectively), and an increase in its intensity further augments the cost of the path producing the better de novo interpretation, which will thus be selected as the optimal one.

Deconvolution: MS-Deconv vs Xtract In order to verify whether tags and de novo sequences of an appropriate quality can be derived from the spectra deconvoluted with the Xtract algorithm provided by the instrument vendor, we compared the results of our approach obtained upon deconvolution with MS-Deconv, either followed or not with peak reflection, with those obtained upon deconvolution with Thermo Xtract 3.0 using signal-tonoise ratio of 1 (default for MS-Deconv) or 2 (default for Xtract). The maximum charge state for Xtract was set to 30 (the default value for MS-Deconv), and the other parameters were either default for Xtract 3.0 or determined by it automatically. The lists of the de novo strings generated thereby are provided in the supplementary file MS-Deconv vs Xtract.xls. Since Xtract may combine the input spectra, peak reflection should not be applied to the deconvoluted spectra it produces. On the other hand, if the spectra deconvoluted with Xtract and MS-Deconv, respectively, were equally appropriate as input for the proposed approach, then the number of 4-tags derived from the Xtract-deconvoluted spectra would lie in-between the numbers of those obtained from the spectra deconvoluted with MS-Deconv with and without subsequent peak reflection, respectively. Moreover, the same would apply to the tag accuracy, and similar observations would hold for the de novo strings generated at the next stage. As Table 3 illustrates, this is indeed the case for the number of both 4-tags and the corresponding de novo strings. In addition, for ε = 4mDa, the longest correct fragment of the CAH2 sequence retrieved from the Xtract-deconvoluted spectra has length 41 (see the 5th de novo string for s/n of 1 or 2), while its longest correct subsequences derived from the spectra deconvoluted with MS-Deconv, upon subsequent peak reflection,

20 ACS Paragon Plus Environment

Page 20 of 37

Page 21 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

have length 19 (see the 7th, 94th, and 145th de novo strings). At the same time, the fraction of de novo strings containing a correct 4-mer (i. e. a one occurring in the sequence of a target protein or contaminant, possibly in a reversed form) is 65.77% and 66.92% for Xtract run with s/n 1 and 2, respectively, while for MS-Deconv, it is 84.75% and 83.54% upon and without peak reflection, respectively. And this discrepancy is notably larger for the light chain of alemtuzumab: if MS-Deconv is applied, the share of de novo strings having a correct 4-mer is 77.95% and 77.05% upon and without peak reflection, respectively, and in case of Xtract, it is as small as 39.84% and 56.17% for s/n of 1 and 2, respectively. Otherwise, with varying mass tolerance, Twister behaves in a similar way on both MSDeconv and Xtract deconvoluted spectra, implying that the same strategies for selecting this parameter should be applicable in either case.

Sequence tag generation: a comparison with ProSight PTM The key idea of the Twister approach is that it combines short tags derived from distinct spectra in order to come up with a set of de novo interpretations for the entire dataset rather than individual spectra. This is why it cannot be immediately compared with the other existing algorithms. However, to illustrate the benefits of applying ultra-low constant mass tolerance at time of tag generation, we compared the long tags extracted by Twister from spectrum graphs constructed for individual spectra (i. e. the ones, from which the k-tags are subsequently obtained, as described above) to the tags generated by ProSight PTM 2.0. The input spectra were deconvoluted with MS-Deconv and Xtract 3.0, respectively, prior to being passed to Twister and ProSight PTM 2.0, respectively. Twister applied no peak reflection at the preprocessing stage. Xtract was run with signal-to-noise ratio of 2 and the maximum charge state of 30. Taking into account that Xtract may combine the given spectra, and aiming to provide a fair comparison, we restricted our attention to 5 spectra from each of the five datasets—and namely, the ETD-, HCD- and CID-dataset for CAH2, and ETD- and HCD-dataset for alemtuzumab—from which Twister produced one or a few

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tags matching the sequence of a target protein or contaminant, and such that both the tag offset and precursor mass of the spectrum were consistent with some fragment of the corresponding protein sequence; each of those 25 spectra was then deconvoluted separately. The Sequence Tag Tolerance parameter of ProSight PTM was set to 10ppm. The sets of tags of length at least 4 obtained in this way with either tool are listed in the supplementary file Sequence tags.xls. The overall tendency is that both approaches retrieve approximately the same correct fragments of the underlying protein sequence, sometimes one extracting slightly longer tags than the other. However, Twister typically outputs fewer tags than ProSight PTM, and the difference can be drastic: e. g. for the CAH2-CID spectrum 2, 089, Twister produced 4 tags, and the two longest ones (each being correct) fully cover the sequence fragments contained in the 166 tags generated by ProSight PTM (almost all of which are incorrect). In addition, it should be noted that among the total of 6 tags obtained with Twister not classified as (fully) correct, four contain an amino acid ‘Q’ instead of a dimer “AG” or “GA” with an identical mass, and match the respective protein sequence otherwise. Contaminant detection and identification By means of a BLAST 41 search against the non-redundant database for the de novo sequences derived from 4-tags, we identified in the CAH2 and alemtuzumab dataset 16 and 4 contaminants, respectively. In the CAH2 sample, 10 contaminants were native (i.e. coming form bovine erythrocytes), while 6 were extraneous; the latter belonged to a mouse, Shewanella, or yeast organism, and appeared as carry-over proteins from previous runs on the same instrument or contaminants on the column. The alemtuzumab sample was contaminated with two Synechococcus ribosomal proteins, one mouse protein, and a human lyzozyme. The list of contamination proteins for either sample is provided in the supplementary file Twister contaminants.xls; moreover, for each contaminant, we indicate the de novo fragments of length at least 7 that matched the respective contaminant sequence but no other

22 ACS Paragon Plus Environment

Page 22 of 37

Page 23 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

one among those being considered (and neither did its reversed copy). Additional evidence for the identification was provided by shorter matching fragments. All the ambiguities in the BLAST search results could be resolved by means of analysis of mass offsets of the de novo strings, except for the human lyzozyme, which can potentially be present in a mutated form. Further details are provided in the supplementary materials.

Discussion We have introduced a method for retrieving from a set of top-down tandem mass spectra accurate fragments of the sequences of proteins present in the sample. It first generates from the deconvoluted and preprocessed spectra a set of high-quality sequence tags of a fixed length k using ultra-low constant mass tolerance, and then combines the tags that appear consecutive in the sequence of an underlying protein into longer amino acid strings. The larger is k, the more reliable is each individual k-tag, yet the smaller is the total number of those and the resulting de novo strings (see Table 1). However, having examined the values of k from 3 to 6, we came to a conclusion that the de novo sequence fragments are typically reported for each k being at most their length, and the quality of those of length at least 4 thereby does not vary much, while some strings of length 3 (immediately corresponding to the 3-tags that could not be prolonged) may be suspect. Therefore, we suggest to use 4-tags for obtaining an initial set of de novo strings, and if needed, subsequently enlarge the former with extra sequences that could be derived solely from 3-tags, or additionally validate the latter by searching for their counterparts produced from longer tags. The second task is accomplished through constructing a T -Bruijn graph, which constitutes our proposed generalization of the concept of an A-Bruijn graph 20 to the case of tags, and extracting from it optimal paths that spell out the target amino acid sequences. Since each tag supporting such a path must agree very well with the ones that precede and follow it in the latter, erroneous tags are particularly unlikely to appear in its middle. As a conse-

23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

quence, the obtained de novo strings typically represent correct sequence fragments, to some of which one or a few erroneous amino acids got appended at one or both ends. It should be noted, however, that such spurious amino acids do not necessarily arise from an incorrect interpretation of a spectrum: alternatively, they may originate from a low-abundance proteoform, from which only a few spectra were collected, or reflect post-translational modifications (recall that the N-terminal methionine truncation and serine acetylation in CAH2 manifested themselves as a terminal ‘E’ in the de novo fragments generated by Twister). In particular, several of the derived de novo sequences happen to originate from contaminants, and are lengthy enough to enable detection and identification of contamination proteins by means of a BLAST 41 search against the non-redundant database. Since contaminants can be present in a commercially purified sample only in small quantities, this points to extreme sensitivity of the suggested method, along with its potential applicability for analysis of mutations, splice variants and post-translational modifications (PTMs) of proteins. The de novo fragments derived from a T -Bruijn graph can already be quite long: in our experiments, the length of the retrieved subsequences sometimes almost reached 60 amino acids. However, it is possible to further group them together, thereby merging the fragments that, though overlapping in the protein sequence, did not match each other in terms of mass offsets. The latter would be the case, in particular, for overlapping fragments retrieved from spectra acquired using different technologies that produce distinct types of fragments ions— and this effect could be observed for the ETD and CID/HCD spectra together composing the CAH2 dataset. Alternatively, such strings may appear due to internal fragment ions sometimes observed in CID- and HCD-MS1 spectra, which give rise to C-terminal product ions with a mass 18Da smaller than that of a corresponding y-ion (however, k-tags defined by ladders of such ions are readily retrieved by our algorithm, and further combined with matching k-tags from the spectra, precursors of which are internal fragments with the Cterminal at the same position). These specific issues can potentially be addressed at time of

24 ACS Paragon Plus Environment

Page 24 of 37

Page 25 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

a T -Bruijn graph construction or after the de novo strings have been obtained, by identified and merging tags or strings, which agree in terms of the amino acid sequence and are endowed with offsets that differ from each other accordingly. Future developments of the proposed approach include handling of ±1Da errors commonly introduced at the deconvolution stage, and identifying groups of de novo strings acquired from a same protein and appropriately combining them together. Another promising direction comprises its adaptation to the case of more complex samples, including highly modified proteins and protein mixtures.

Associated content Supporting Information Available: the lists of de novo sequences generated from the CAH2 and alemtuzumab top-down datasets (1) for the mass tolerance from 1 to 10mDa using 4tags, (2) for the tag length from 3 to 6 at the mass tolerance of 4mDa, and (3) with and without peak reflection at the preprocessing stage, and from the spectra deconvoluted with Xtract at signal-to-noise ratio 1 and 2, respectively, for 4-tags at the mass tolerance of 4mDa; the lists of sequence tags generated by Twister and ProSight PTM 2.0 from a sample set of spectra; the lists of contaminants detected and identified in either dataset; statistics on the amino acids defined by pairs of masses deconvoluted from fragment ions (1) separated by a delta m/z from a fixed range, and (2) with the charge states differing by a fixed delta z, for the mass tolerance from 1 to 10mDa, and 100mDa, for either dataset; statistics on the amino acids derived (1) in total, (2) from pairs of fragment ions separated by less than 200 units in the m/z space, and (3) from pairs of fragment ions with the charge states differing by at most 1, for the mass tolerance from 1 to 10mDa, and 100mDa, for either dataset; statistics on the error in the observed mass of an amino acid defined by a pair of fragment ions separated by a delta m/z from a fixed range, for the mass tolerance of 4 and 100mDa, for either dataset. This material is available free of charge via the Internet at http://pubs.acs.org.

25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acknowledgements The research by K.V. and P.A.P. was partially supported by Government of Russian Federation (grant 11.G34.31.0018, till December 2014), and Russian Science Foundation (grant 14-50-00069, since February 2015). L.D. and M.V. are financially supported by the Netherlands Organization for Scientific Research (NWO), Zenith grant 93511034. We are grateful to Yury Tsybin for insightful remarks, and to Vitali Boitsov, Ivan Terterov and Sergey Vyazmin for fruitful discussions.

References (1) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198– 207. (2) Kelleher, N. L. Top down proteomics. Analytical Chemistry 2004, 76, 197A–203A. (3) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Communications in Mass Spectrometry 2003, 17, 2337–42. (4) Frank, A.; Pevzner, P. PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Analytical Chemistry 2005, 77, 964–973. (5) Chi, H.; Chen, H.; He, K.; Wu, L.; Yang, B.; Sun, R.-X.; Liu, J.; Zeng, W.-F.; Song, C.Q.; He, S.-M.; Dong, M.-Q. pNovo+: De Novo Peptide Sequencing Using Complementary HCD and ETD Tandem Mass Spectra. Journal of Proteome Research 2013, 12, 615–625. (6) Taylor, J. A.; Johnson, R. S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Communications in Mass Spectrometry 1997, 11, 1067–1075.

26 ACS Paragon Plus Environment

Page 26 of 37

Page 27 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(7) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. De Novo Peptide Sequencing via Tandem Mass Spectrometry. Journal of Computational Biology 1999, 6, 327–342. (8) Bandeira, N.; Tang, H.; Bafna, V.; Pevzner, P. Shotgun Protein Sequencing by Tandem Mass Spectra Assembly. Analytical Chemistry 2004, 76, 7221–7233. (9) Bandeira, N.; Clauser, K. R.; Pevzner, P. A. Shotgun Protein Sequencing: Assembly of Peptide Tandem Mass Spectra from Mixtures of Modified Proteins. Molecular and Cellular Proteomics 2007, 6, 1123–1134. (10) Bandeira, N.; Pham, V.; Pevzner, P.; Arnott, D.; Lill, J. R. Automated de novo protein sequencing of monoclonal antibodies. Nature Biotechnology 2008, 26, 1336–1338. (11) Liu, X.; Han, Y.; Yuen, D.; Ma, B. Automated Protein (Re)Sequencing with MS/MS and a Homologous Database Yields Almost Full Coverage and Accuracy. Bioinformatics 2009, 25, 2174–2180. (12) Castellana, N. E.; Pham, V.; Arnott, D.; Lill, J. R.; Bafna, V. Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database. Molecular and Cellular Proteomics 2010, 9, 1260–1270. (13) Savitski, M.; Nielsen, M. L.; Zubarev, R. A. New data base-independent, sequence tagbased scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques. Molecular and Cellular Proteomics 2005, 4, 1180–8. (14) Datta, R.; Bern, M. In Research in Computational Molecular Biology; Vingron, M., Wong, L., Eds.; Lecture Notes in Computer Science; Springer Berlin Heidelberg, 2008; Vol. 4955; pp 140–153.

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15) Bertsch, A.; Leinenbach, A.; Pervukhin, A.; Lubeck, M.; Hartmer, R.; Baessmann, C.; Elnakady, Y. A.; M¨ uller, R.; B¨ocker, S.; Huber, C. G.; Kohlbacher, O. De novo peptide sequencing by tandem MS using complementary CID and electron transfer dissociation. Electrophoresis 2009, 30, 3736–3747. (16) He, L.; Ma, B. ADEPTS: Advanced Peptide De Novo Sequencing with a Pair of Tandem Mass Spectra. Journal of Bioinformatics and Computational Biology 2010, 08, 981– 994. (17) Guthals, A.; Clauser, K. R.; Frank, A. M.; Bandeira, N. Sequencing-Grade De novo Analysis of MS/MS Triplets (CID/HCD/ETD) From Overlapping Peptides. Journal of Proteome Research 2013, 12, 2846–2857. (18) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Automated de novo sequencing of proteins by tandem high-resolution mass spectrometry. Proceedings of the National Academy of Sciences 2000, 97, 10313–10317. (19) Liu, X.; Dekker, L.; Wu, S.; Vanduijn, M. M.; Luider, T. M.; Toli´c, N.; Dvorkin, M.; Alexandrova, S.; Vyatkina, K.; Paˇsa-Toli´c, L.; Pevzner, P. A. De Novo Protein Sequencing by Combining Top-Down and Bottom-Up Tandem Mass Spectra. Journal of Proteome Research 2014, 13, 3241–3248. (20) Pevzner, P. A.; Tang, H.; Tesler, G. De novo repeat classification and fragment assembly. Genome Research 2004, 14, 1786–1796. (21) Mann, M.; Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Analytical Chemistry 1994, 66, 4390–4399. (22) Taylor, J. A.; Johnson, R. S. Implementation and Uses of Automated de Novo Peptide Sequencing by Tandem Mass Spectrometry. Analytical Chemistry 2001, 73, 2594–2604.

28 ACS Paragon Plus Environment

Page 28 of 37

Page 29 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(23) Tabb, D. L.; Saraf, A.; Yates, J. R. GutenTag: high-throughput sequence tagging via an empirically derived fragmentation model. Analytical Chemistry 2003, 75, 6415–6421. (24) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A.; Shevchenko, A. MultiTag: multiple error-tolerant sequence tag search for the sequence-similarity identification of proteins by mass spectrometry. Analytical Chemistry 2003, 75, 1307–1315. (25) Searle, B. C.; Dasari, S.; Turner, M.; Reddy, A. P.; Choi, D.; Wilmarth, P. A.; McCormack, A. L.; David, L. L.; Nagalla, S. R. High-throughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Analytical Chemistry 2004, 76, 2220–2230. (26) Frank, A.; Tanner, S.; Bafna, V.; Pevzner, P. Peptide sequence tags for fast database search in mass-spectrometry. Journal of Proteome Research 2005, 4, 1287–1295. (27) Tanner, S.; Shu, H.; Frank, A.; Wang, L.-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Analytical Chemistry 2005, 77, 4626–39. (28) Cao, X.; Nesvizhskii, A. I. Improved Sequence Tag Generation Method for Peptide Identification in Tandem Mass Spectrometry. Journal of Proteome Research 2008, 7, 4422–4434. (29) Na, S.; Jeong, J.; Park, H.; Lee, K.-J.; Paek, E. Unrestrictive identification of multiple post-translational modifications from tandem mass spectrometry using an error-tolerant algorithm based on an extended sequence tag approach. Molecular and Cellular Proteomics 2008, 7, 2452–2463. (30) Shen, Y.; Toli´c, N.; Hixson, K. K.; Purvine, S. O.; Anderson, G. A.; Smith, R. D. De novo sequencing of unique sequence tags for discovery of post-translational modifications of proteins. Analytical Chemistry 2008, 80, 7742–7754.

29 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(31) Tabb, D. L.; Ma, Z.-Q.; Martin, D. B.; Ham, A.-J. L.; Chambers, M. C. DirecTag: Accurate Sequence Tags from Peptide MS/MS through Statistical Scoring. Journal of Proteome Research 2008, 7, 3838–3846. (32) Pan, C.; Park, B.; McDonald, W.; Carey, P.; Banfield, J.; VerBerkmoes, N.; Hettich, R.; Samatova, N. A high-throughput de novo sequencing approach for shotgun proteomics using high-resolution tandem mass spectrometry. BMC Bioinformatics 2010, 11, 118. (33) Liu, W.-T.; Kersten, R. D.; Yang, Y.-L.; Moore, B. S.; Dorrestein, P. C. Imaging Mass Spectrometry and Genome Mining via Short Sequence Tagging Identified the Anti-Infective Agent Arylomycin in Streptomyces roseosporus. Journal of the American Chemical Society 2011, 133, 18010–18013. (34) Kersten, R. D.; Yang, Y.-L.; Xu, Y.; Cimermancic, P.; Nam, S.-J.; Fenical, W.; Fischbach, M. A.; Moore, B. S.; Dorrestein, P. C. Natural Product Peptidogenomics: A Mass Spectrometry-guided Genome Mining Approach. Nature Chemical Biology 2011, 7, 667–673. (35) LeDuc, R. D.; Taylor, G. K.; Kim, Y.-B.; Januszyk, T. E.; Bynum, L. H.; Sola, J. V.; Garavelli, J. S.; Kelleher, N. L. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry. Nucleic Acids Research 2004, 32, W340–W345. (36) Zamdborg, L.; LeDuc, R. D.; Glowacz, K. J.; Kim, Y.-B.; Viswanathan, V.; Spaulding, I. T.; Early, B. P.; Bluhm, E. J.; Babai, S.; Kelleher, N. L. ProSight PTM 2.0: improved protein identification and characterization for top down mass spectrometry. Nucleic Acids Research 2007, 35, W701–W706. (37) Matthiesen, R.; (et al.), Mass Spectrometry Data Analysis in Proteomics; Methods in Molecular Biology; Humana Press, 2013.

30 ACS Paragon Plus Environment

Page 30 of 37

Page 31 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(38) Dekker, L.; Wu, S.; Vanduijn, M.; Toli´c, N.; Stingl, C.; Zhao, R.; Luider, T.; PaˇsaToli´c, L. An integrated top-down and bottom-up proteomic approach to characterize the antigen binding fragment of antibodies. Proteomics 2014, 14, 1239–1248. (39) Liu, X.; Inbar, Y.; Dorrestein, P. C.; Wynne, C.; Edwards, N.; Souda, P.; Whitelegge, J. P.; Bafna, V.; Pevzner, P. A. Deconvolution and Database Search of Complex Tandem Mass Spectra of Intact Proteins: A Combinatorial Approach. Molecular and Cellular Proteomics 2010, 9, 2772–2782. (40) Liu, X.; Sirotkin, Y.; Shen, Y.; Anderson, G.; Tsai, Y. S.; Ting, Y. S.; Goodlett, D. R.; Smith, R. D.; Bafna, V.; Pevzner, P. A. Protein Identification Using Top-Down Spectra. Molecular and Cellular Proteomics 2012, 11 . (41) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 1990, 215, 403–410.

31 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 37

Tables Table 1: Statistics on k-tags and the respective T -Bruijn graphs for the top-down datasets acquired from CAH2 and the Fab region of alemtuzumab, for k from 3 to 6. k

k-tags

3 4 5 6

6,753 4,152 2,652 1,702

3 4 5 6

3,345 2,022 1,266 793

CAH2 T -Bruijn graph vertices components 4,755 1,509 3,031 905 1,992 594 1,313 384 without peak 1,737 513 1,117 322 737 219 485 146

k-tags 32,492 21,153 14,535 10,520 reflection 16,338 10,510 7,152 5,144

alemtuzumab T -Bruijn graph vertices components 17,849 4,551 11,948 2,757 8,388 1,773 6,125 1,193 4,750 3,140 2,149 1,505

1,160 732 470 301

Table 2: The peaks from three spectra acquired from CAH2 that contributed to the de novo strings AQVAQALTTLGTNGTAGVVF and AQVQAALTTLGTNGTAGVVF generated upon and without water-loss ions removal, respectively, before and after preprocessing (i. e. optional water-loss ions elimination and peak reflection); here, the fragments matching the contaminant protein flavin reductase (NADPH) are highlighted in blue, and the incorrectly ordered neighbor amino acids in the latter string are underlined. WL: water loss. The masses and intensities of the water-loss peaks from the initial HCD spectrum are marked in bold. spectrum ID

fragmentation

PM (Da)

2,094

CID

7,501.083

initially mass

2,099

2,129

CID

HCD

7,501.080

7,501.076

2,453.398 2,382.361 2,254.302 2,155.234 2,155.234 2,027.176 1,956.138 1,885.102 1,885.102 1,772.018 1,670.972 2453.397 2,382.359 2,254.299 2,155.234 2,155.234 2,084.196 1,956.138 2,254.303 2,254.302 2,155.235 2,155.235 2,084.197 2,084.198 1,956.138 1,956.138 1,885.102 1,772.019 1,754.007 1,754.007 1,670.970

intensity 8,970.987 2,513.201 4,564.518 14,161.235 13,520.768 759.667 3,400.817 1,547.220 1,813.001 5,951.344 2,821.843 2925.404 1,405.900 1,561.695 3,871.576 2,991.998 1,658.160 1,525.629 1,305.932 3,410.880 10,102.753 6,521.792 7,713.998 2,487.467 7,373.091 3,471.940 8,354.726 6,628.283 7,514.022 6,495.606 2,844.098

contributing peaks after preprocessing WLs kept WLs removed mass intensity mass intensity 5,047.684 5,118.721 5,246.780

8,970.987 2,513.201 4,564.518

5,345.848

27,682.002

5,473.907 5,544.944

759.677 3,400.817

5,615.980

3,360.220

5,729.064 5,830.110

5,951.344 2,821.843

32 ACS Paragon Plus Environment

5047.683 5,118.721 5,246.780

2925.404 1,405.900 1,561.695

5,345.846

6,773.574

5,416.884 5,544.942

1,658.160 2,630.081

5,246.773

4,716.812

5,345.841

16,624.544

5,416.879

10,201.464

5,544.937

10,845.030

5,615.973

8,354.726

5,729.057

20,637.910

5,830.106

2,844.098

Page 33 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3: Statistics on the 4-tags and de novo strings generated from MS-Deconv and Xtract deconvoluted spectra for the CAH2 and alemtuzumab datasets, for the mass tolerance from 1 to 10mDa. S/n: signal-to-noise ratio; LC/HC: light/heavy chain. CAH2 de novo strings with a correct 4-mer # %

ε (mDa)

4-tags

1 2 3 4 5 6 7 8 9 10

2,718 3,523 3,875 4,152 4,324 4,453 4,561 4,701 4,787 4,928

727 854 884 905 926 968 988 1,009 1,038 1053

642 744 759 767 771 787 791 795 789 793

88.31 87.12 85.86 84.75 83.26 81.30 80.06 78.79 76.01 75.31

1 2 3 4 5 6 7 8 9 10

1,379 1,760 1,927 2,022 2,084 2,147 2,203 2,266 2,315 2,379

269 299 315 322 332 351 358 366 380 389

237 255 266 269 276 283 286 288 285 293

88.10 85.28 84.44 83.54 83.13 80.63 79.89 78.69 75.00 75.32

1 2 3 4 5 6 7 8 9 10

2,186 2,649 2,834 2,990 3,149 3,284 3,402 3,404 3,475 3,544

384 423 469 533 588 648 700 697 720 725

309 333 349 354 375 388 404 390 385 376

80.47 78.72 74.41 66.42 63.78 59.88 57.71 55.95 53.47 51.86

1 2 3 4 5 6 7 8 9 10

2,096 2,508 2,701 2,798 2,902 2,967 3,052 3,060 3,121 3,167

356 384 405 443 473 509 548 557 562 562

294 307 317 326 327 347 362 353 341 334

82.58 79.95 78.27 73.59 69.13 68.17 66.06 63.38 60.68 59.43

total

coverage (%)

4-tags

total

alemtuzumab de novo strings with a correct 4-mer # %

coverage (%) LC HC

MS-Deconv, with peak reflection 57.31 65.00 66.15 66.54 69.23 69.23 69.23 69.23 69.23 69.23

6,797 14,532 18,770 21,153 22,638 23,753 24,511 25,126 25,722 26,309

1,864 2,525 2,648 2,757 2,797 2,866 2,897 2,940 2,965 3,003

1,636 2,148 2,180 2,149 2,123 2,093 2,076 2,040 1,998 1,950

87.77 85.07 82.33 77.95 75.90 73.03 71.66 69.39 67.39 64.94

75.23 82.24 83.18 82.71 84.11 85.05 85.05 85.05 85.05 85.98

58.77 62.72 67.11 68.86 71.49 69.74 68.86 68.42 68.42 68.42

501 635 685 732 766 817 842 880 910 936

424 528 556 564 584 592 596 603 598 596

84.63 83.15 81.17 77.05 76.24 72.46 70.78 68.52 65.71 63.68

74.77 81.78 82.71 82.71 82.71 83.64 83.64 83.64 83.64 83.18

57.02 60.09 63.60 65.35 67.54 66.23 67.54 67.54 66.67 66.67

725 914 1,193 1,516 1,846 1,980 2,032 1,972 1,981 1,942

464 535 577 604 620 613 629 617 612 603

64.00 58.53 48.37 39.84 33.59 30.96 30.95 31.29 30.89 31.05

78.04 85.05 89.25 88.79 88.79 88.32 88.32 87.38 88.79 90.19

60.09 68.42 65.35 69.74 68.42 60.53 62.72 67.54 67.54 69.30

602 734 835 972 1,101 1,186 1,253 1,239 1,275 1,291

432 495 524 546 561 556 563 561 567 565

71.76 67.44 62.75 56.17 50.95 46.88 44.93 45.28 44.47 43.76

79.44 82.71 82.71 83.18 84.11 83.18 84.58 84.58 85.05 84.58

61.40 64.04 64.04 62.28 66.67 61.84 63.16 62.72 67.54 60.96

MS-Deconv, without peak reflection 56.54 63.85 64.23 64.62 66.54 66.92 66.92 65.00 65.00 65.00

3,370 7,261 9,331 10,510 11,248 11,824 12,231 12,550 12,862 13,197

Xtract, s/n=1 63.08 63.85 65.38 65.77 65.00 70.77 68.08 68.08 67.69 67.69

5,649 8,038 9,698 11,018 12,095 13,509 14,269 14,737 15,306 15,818

Xtract, s/n=2 64.62 66.54 68.08 66.92 66.92 68.08 64.62 62.31 61.92 61.92

5,018 7,169 8,622 9,833 10,725 11,667 12,217 12,583 12,912 13,259

33 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figures

34 ACS Paragon Plus Environment

Page 34 of 37

Page 35 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

a)

40 20

b)

0

146

262

304 333

401

462

520 538

599

667

781 797

854

1000

0 18

146

262

304 333

401

462

520 538

599

667

781 797

854

982 1000

0 18

146

262

304 333

401

462

520 538

599

667

781 797

854

982 1000

0 18

146

262

304 333

401

462

538

599

667

797

854

982 1000

0 18

146

262

304 333

401

462

538

599

667 696

797

854

982 1000

40 20

c)

40 20

d)

40 20

e)

40 20

f)

203

40

E

K

20

0 18

146

E

A

G

203

304 333

401

462

G T

H

262

K

A

H

T

g)

738

538

599

667 696

738

797

854

982 1000

40

E 20

A H 0

401

538

667

738

1000

Figure 1: Preprocessing of a CID or HCD spectrum acquired from a toy “protein” with the amino acid sequence DFAEHPTGK and precursor mass P M = 1, 000Da: a) the initial (deconvoluted) spectrum; b) adding auxiliary peaks; c) merging close peaks; d) removing water-loss ions; e) reflecting peaks. f) The resulting spectrum contains four 3-tags tKGT , tAEH , tHEA , and tT GK with the offsets 18, 262, 401, and 696Da, respectively, where ts stays for a tag with the amino acid sequence s. Out of those, only tAEH can be derived from the original spectrum. g) The auxiliary spectrum S(tHEA ) consists of four peaks with the masses 401, 538, 667, and 738Da, respectively, and intensities I401 = 18, I538 = 30, I667 = 35, and I738 = 20, respectively; thus, Score(tHEA ) = I401 + I538 + I667 + I738 = 103.

35 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

D 0

S

V

T

Y

100 215 302 403 502

S 0

V

T

665 779

L

0

203 302

N 0

N

Y

A

V

TVY 302:1 DST 100:1

TVL 302:1

VYN (203) YNP (302)

P

T

800

NYV YVT AVT VTS

S

221 335 427498 597 698 785

1000

VYN 403:1

STV 215:2

1000

465 579 676

Y

1000

(100) (215) (302) (403)

STV (215) TVL (302)

215 302 403 502 615

V

DST STV TVY VYN

N

Page 36 of 37

VYN 203:1

YNP 302:1

NYV 221:1

YVT 335:1 VTS 498:1

(221) (335) (427) (498)

AVT 427:1

Figure 2: T -Bruijn graph construction, for the case of 3-tags. From each spectrum, a few 3tags are extracted. The two tags with the same amino acid sequence STV and offset 215 are merged into a single tag that inherits both the sequence and offset, and has multiplicity 2. The graph vertices are in one-to-one correspondence with the tags from the resulting set (for each vertex, the offset and multiplicity of its underlying tag are indicated). A directed edge is introduced between two vertices if the underlying 3-tags of those are supposed to correspond to the prefix and suffix of length 3 of the same 4-mer from a target protein sequence: for example, this is the case for the tags with the amino acid sequence DST and STV, respectively, and offsets 100 and 215 = 100 + m(D), respectively.

a)

1 61 121 181 241

b)

1 61 121 181

c)

1 61 121 181

MSHHWGYGKH NNGHSFNVEY VHWNTKYGDF GSLLPNVLDY LANWRPAQPL

NGPEHWHKDF DDSQDKAVLK GTAAQQPDGL WTYPGSLTTP KNRQVRGFPK

PIANGERQSP DGPLTGTYRL AVVGVFLKVG PLLESVTWIV

DIQMTQSPSS RFSGSGSGTD SDEQLKSGTA LSKADYEKHK

LSASVGDRVT FTFTISSLQP SVVCLLNNFY VYACEVTHQG

QVQLQESGPG EYNPSVKGRV SASTKGPSVF SGLYSLSSVV

LVRPSQTLSL TMLVDTSKNQ PLAPSSKSTS TVPSSSLGTQ

VDIDTKAVVQ VQFHFHWGSS DANPALQKVL LKEPISVSSQ

DPALKPLALV DDQGSEHTVD DALDSIKTKG QMLKFRTLNF

YGEATSSRMV RKKYAAELHL KSTDFPNFDP NAEGEPELLM

60

ITCKASQNID EDIATYYCLQ PREAKVQWKV LSSPVTKSFN

KYLNWYQQKP GKAPKLLIYN TNNLQTGVPS HISRPRTFGQ GTKVEIKRTV AAPSVFIFPP DNALQSGNSQ ESVTEQDSKD STYSLSSTLT RGEC 214

60

TCTVSGFTFT FSLRLSSVTA GGTAALGCLV TYICNVNHKP

DFYMNWVRQP ADTAVYYCAR KDYFPEPVTV SNTKVDKKVE

60

120 180 240

260

PGRGLEWIGF IRDKAKGYTT EGHTAAPFDY WGQGSLVTVS SWNSGALTSG VHTFPAVLQS PKSCDKTH 228

120 180

120 180

Figure 3: Coverage of the target protein sequences with the matching fragments of the de novo sequences derived from the respective T -Bruijn graphs for 4-tags: a) CAH2; b) the light chain of alemtuzumab; c) the Fd region of the heavy chain of alemtuzumab. For each sequence, the cleavage sites not matched by the annotated ions or appropriate auxiliary peaks from the spectra identified by MS-Align+ are labeled green.

36 ACS Paragon Plus Environment

Page 37 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Abstract graphic 84x34mm (300 x 300 DPI)

ACS Paragon Plus Environment