De Novo Protein Sequencing by Combining Top ... - ACS Publications

May 30, 2014 - Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington 99352, United States. ∥ Algo...
0 downloads 3 Views 803KB Size
Article pubs.acs.org/jpr

De Novo Protein Sequencing by Combining Top-Down and BottomUp Tandem Mass Spectra Xiaowen Liu,*,†,‡ Lennard J. M. Dekker,¶ Si Wu,§ Martijn M. Vanduijn,¶ Theo M. Luider,¶ Nikola Tolić,§ Qiang Kou,† Mikhail Dvorkin,∥ Sonya Alexandrova,∥ Kira Vyatkina,∥ Ljiljana Paša-Tolić,§ and Pavel A. Pevzner*,⊥ †

Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, 535 West Michigan Street, IT 475, Indianapolis, Indiana 46202, United States ‡ Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 410 West 10th Street, Suite 5000, Indianapolis, Indiana 46202, United States ¶ Department of Neurology, Erasmus University Medical Center, Postbus 2040, 3000 CA Rotterdam, The Netherlands § Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, Washington 99352, United States ∥ Algorithmic Biology Laboratory, Saint Petersburg Academic University, 8/3 Khlopina Str, St. Petersburg 194021, Russia ⊥ Department of Computer Science and Engineering, University of California, 9500 Gilman Drive, San Diego, California 92093, United States S Supporting Information *

ABSTRACT: There are two approaches for de novo protein sequencing: Edman degradation and mass spectrometry (MS). Existing MS-based methods characterize a novel protein by assembling tandem mass spectra of overlapping peptides generated from multiple proteolytic digestions of the protein. Because each tandem mass spectrum covers only a short peptide of the target protein, the key to high coverage protein sequencing is to find spectral pairs from overlapping peptides in order to assemble tandem mass spectra to long ones. However, overlapping regions of peptides may be too short to be confidently identified. High-resolution mass spectrometers have become accessible to many laboratories. These mass spectrometers are capable of analyzing molecules of large mass values, boosting the development of top-down MS. Topdown tandem mass spectra cover whole proteins. However, top-down tandem mass spectra, even combined, rarely provide full ion fragmentation coverage of a protein. We propose an algorithm, TBNovo, for de novo protein sequencing by combining topdown and bottom-up MS. In TBNovo, a top-down tandem mass spectrum is utilized as a scaffold, and bottom-up tandem mass spectra are aligned to the scaffold to increase sequence coverage. Experiments on data sets of two proteins showed that TBNovo achieved high sequence coverage and high sequence accuracy.



INTRODUCTION

understanding their structures and functions, can not be obtained from the genomes because the genomes are not available or the proteins are not directed inscribed in the genomes. As a result, de novo protein sequencing has become the method of choice for sequencing these proteins. Even when a protein sequence is known, de novo protein sequencing may discover novel proteoforms of the protein generated from unexpected mutations, splicing events, and post-translational modifications (PTMs). Edman degradation6,7 and MS8 have been widely used for de novo protein sequencing. Compared with the traditional Edman

In the past two decades, most researchers in mass spectrometry (MS)-based computational proteomics have been focusing on protein identification by searching spectra in tandem mass spectrometry (MS/MS) against protein databases1,2 generated from gene sequences in genomes. However, the sequences of some novel proteins, which are of vital importance in drug design, are not included in protein databases generated from genomes. For example, trastuzumab (Herceptin) and alemtuzumab (MabCampath), which are monoclonal antibodies, have been successfully used on patients with breast cancer3 and graph-versus-host disease.4 Captopril, a venom-based drug, has been successfully used on patients with cardiovascular disease.5 The sequences of these proteins, which are essential for © XXXX American Chemical Society

Received: December 29, 2013

A

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

In the past five years, high-resolution mass spectrometers (e.g., Orbitrap) have become accessible to many laboratories.24 These mass spectrometers are capable of analyzing intact proteins, facilitating the development of top-down MS. Because top-down MS/MS spectra are generated from intact proteins, long overlapping regions between top-down and bottom-up MS/MS spectra can be found. Top-down MS/MS spectra can be used as scaffolds to increase coverage and accuracy in de novo protein sequencing, especially when homologous protein sequences are not available. Compared with a homologous protein sequence, although top-down MS/MS spectra may miss many peaks, they do not contain mutations. In this paper, we propose an algorithm, TBNovo (top-down and bottom-up MSbased protein de novo sequencing), for de novo protein sequencing by combining top-down and bottom-up MS. Experiments on data sets of two proteins showed that the algorithm achieved high sequence coverage and high sequence accuracy.

degradation method, MS-based de novo protein sequencing is more economical and efficient.8 The two approaches can also be combined for protein sequencing.9,10 The history of using MS to sequence proteins can be dated back to 20 years ago. Johnson and Biemann first manually sequenced thioredoxin using MS/MS in 1987.11 Several years later, the method of assembling mass spectra of overlapped peptides for sequencing peptides and proteins was proposed.12,13 In this approach, multiple proteases are employed to cleave the same protein sample separately. Because the proteases cleave the protein at different sites, the peptides generated from a protease may overlap with those from another protease. Merging spectral pairs from overlapping peptides consecutively assembles short peptide MS/MS spectra into long ones from which a de novo protein sequence or several long contigs are obtained. This approach has been extensively studied in the last 10 years.14−18 The shotgun protein sequencing (SPS) method proposed by Bandeira et al.16 obtains over 90% coverage of target protein sequences and 95% accuracy of reported contigs or sequences.19 In some cases, even though the target protein sequence is unknown, its homologous protein sequences are available. Comparative shotgun protein sequencing (cSPS) combines homologous sequences and SPS to improve de novo sequence coverage and accuracy.8 In cSPS, contigs reported by SPS are aligned to a homologous sequence to find their relative positions and orders. If two contigs are aligned to an overlapping segment in the homologous sequence, they are “glued” together. Recently, Guthals et al. proposed meta-SPS, in which contigs reported by SPS are further assembled.20 The accuracy of contigs or sequences reported by meta-SPS is about 95%. Genome assembly has also been utilized as homologous sequences in de novo protein sequencing for finding proteoforms with unexpected splicing events.21 Because of the complexity of mass spectra and the difficulty of identifying short peptide overlaps, these bottom-up MS-based methods often report several contigs rather than one long sequence covering the whole protein. The combination of RNA-Seq and MS has been proven as a promising approach for sequencing antibodies,22 in which an RNA sequence database of antibodies is first generated using RNA-Seq, then bottom-up MS/MS spectra are searched against the database to identify antibody sequences. But this approach requires multiple experiments and increases the complexity of data analysis. CHAMPS23 uses a different method for bottom-up MSbased de novo protein sequencing. Bottom-up MS/MS spectra are first analyzed by de novo peptide sequencing to generate peptide sequences. Then the peptide sequences are aligned to a homologous sequence to find overlapping peptide sequences and their positions relative to the homologous sequence. Finally, the peptide sequences are assembled to acquire a protein sequence. This method has two advantages: (1) de novo peptide sequencing removes most of the noise peaks in MS/ MS spectra and converts complex spectra to simple peptide sequences, which are easier to assemble than spectra; (2) most errors in de novo peptide sequencing are corrected in peptide assembly by finding the consensus of overlapping peptides. Because of these advantages, CHAMPS achieved over 98% sequence coverage of target sequences and 100% accuracy of reported contigs/sequences. However, the performance of CHAMPS depends on the availability of highly similar homologous protein sequences.



METHODS

Mass Spectrometry Experiments

Top-down and bottom-up MS data sets were generated for two proteins: the light chain of alemtuzumab (MabCampath) and carbonic anhydrase 2 (CAH2_BOVIN, Swiss-Prot id: P00921). Light Chain of Alemtuzumab

Alemtuzumab was digested with papain and subsequently reduced and analyzed by a reversed-phase liquid chromatography (RPLC) system coupled online with either a Thermo LTQ Orbitrap Velos mass spectrometer or a Thermo QExactive mass spectrometer. A total of four data sets were generated. The first two data sets were obtained using the Thermo LTQ Orbitrap Veleos with electron-transfer dissociation (ETD) fragmentation. In the first experiment, MS and MS/MS spectra were collected at a resolution of 30 000 and 100 000, respectively; in the second experiment, MS and MS/ MS spectra were collected at a resolution of 100 000 and 100 000, respectively. The third data set was generated using the Thermo Q-Exactive with collision-induced dissociation (CID) fragmentation. In the LC−CID MS/MS analysis, MS and MS/ MS spectra were collected at a resolution of 100 000 and 100 000, respectively. The fourth data set was generated using the Thermo Q-Exactive with high-energy collision dissociation (HCD) fragmentation. In the LC-HCD MS/MS analysis, MS and MS/MS spectra were collected at a resolution of 30 000 and 140 000, respectively. In total, 7686 CID, 4931 HCD, and 12 134 ETD MS/MS spectra were collected. (See ref 25 for details of the experiments.) In bottom-up MS/MS experiments, alemtuzumab solution was reduced with dithiothreitol (DTT), alkylated with iodoacetamide, and digested overnight with trypsin, chymotrypsin, proteinase K, or pepsin. Digested alemtuzumab Fab fragments (fragment antigen-binding) were analyzed by a nanoLC system coupled with a Thermo LTQ Orbitrap XL mass spectrometer. MS and HCD MS/MS spectra were collected at a resolution of 30 000 and 7500, respectively. In total 11 465 bottom-up MS/MS spectra were generated (trypsin: 2716 spectra; chymotrypsin: 4328 spectra; proteinase K: 1616 spectra; and pepsin: 1910 spectra). (See ref 25 for details of the experiments.) B

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

CAH2

sequencing is to keep correct PRMs and remove incorrect PRMs.

Similar to the light chain of alemtuzumab, top-down and bottom-up MS/MS spectra were generated for protein CAH2. In total, 15 top-down data sets were obtained, including a CID, an HCD, and an ETD data set as well as 12 targeted ETD data sets. These data sets contain 3363 CID, 3437 HCD, and 3045 ETD top-down MS/MS spectra. In addition, a total of 47 536 tryptic bottom-up MS/MS spectra were generated. (See the Supporting Information for details of the experiments.)

Spectral Selection and Merging

We assume that the molecular mass of the target protein is known. In practice, the molecular mass of the target protein can be estimated by manual inspection of isotopomer envelopes of its precursor ions. MS-Deconv is used to determine the precursor mass of each top-down tandem mass spectrum by deconvoluting its corresponding MS1 spectrum, but MSDeconv sometimes reports ±1 Da errors in estimated precursor masses. The error of the molecular mass estimated by manual inspection needs to be smaller than 0.5 Da so that the ±1 Da errors can be detected. A top-down MS/MS spectrum is kept only if its precursor mass is the same (within an error tolerance) to the theoretical precursor mass of the protein. Both top-down and bottom-up PRM spectra are filtered to remove low quality ones as only high quality spectra provide accurate information for de novo sequencing: In each top-down data set, all top-down PRM spectra are ranked by the number of deconvoluted masses, and the top three spectra are kept. A bottom-up PRM spectrum is kept if its ALC score (reported by PEAKS) is no less than 50%. Because all top-down PRM spectra have the same (within an error tolerance) precursor mass, we assume that they are generated from one proteoform. The selected top-down spectra are merged into one spectrum to increase protein coverage. If the mass difference between two PRMs in the merged spectrum is smaller than an error tolerance, 15 ppm (ppm) in the experiments, the two PRMs are merged into one by removing the lower intensity one. The intensity of the merged PRM is the sum of intensities of the two PRMs. Below some PRMs in the merged spectrum are discarded to reduce the number of noise PRMs. A PRM is supported by a top-down PRM spectrum if it is the same (within an error tolerance) to a PRM in the spectrum. If a PRM is supported by at least k of the selected top-down PRM spectra, the PRM is included in the merged spectrum; otherwise, it is discarded. To determine the value of k, we first estimate the length of the target protein from its precursor mass and the average mass of amino acid residues. Then we set the value of k so that the number of kept PRMs is about 2 times the length of the target protein, which is similar to the total number of theoretical PRMs and suffix residue masses of the protein. Because a pair of PRMs x and Precursor Mass − x are added to a PRM spectrum for a neutral mass x, there are many complementary PRM pairs (x, Precursor Mass − x) in the merged spectrum. In most cases, only one of the PRMs in each pair is correct. (The incorrect PRM corresponds to a suffix residue mass.) Thus, a PRM is removed if it has less supporting spectra than its complementary PRM. (If a PRM and its complementary PRM have the same number of supporting spectra, both PRMs are kept.) Because of the ±1 Da errors introduced in top-down spectral deconvolution, the merged spectrum also contains some PRM pairs (x, y), where the difference between x and y is about 1 Da. Similar to complementary PRM pairs, the PRM with less supporting spectra in pairs (x, y) is removed.

Converting Raw Spectra to PRM Spectra

A prefix residue mass (PRM) spectrum contains masses corresponding to the prefixes of a peptide or protein.26 PRM spectra simplify data analysis because suffix residue masses and various types of ions are no longer involved. When we directly match a bottom-up CID MS/MS spectrum to top-down ETD MS/MS spectrum, the mass differences between b- and c-ions and those between y- and z-ions should be considered. After converting the CID and ETD spectra to PRM spectra, we treat all masses in the spectra as PRMs and do not need to consider the mass differences. In the proposed method, all MS/MS spectra are first converted to PRM spectra. Each top-down MS/MS spectrum is deconvoluted to a list of neutral masses using MS-Deconv.27 The intensity of a neutral mass is the sum of the intensities of the peaks in its corresponding isotopomer envelope. Following the method in ref 28, the list of neutral masses are further converted to a PRM spectrum. For example, to convert a CID spectrum to a PRM spectrum, we add each neutral mass x extracted from the CID spectrum and its complementary mass Precursor Mass − x to the PRM spectrum, where Precursor Mass is the precursor mass of the spectrum. Specifically, masses 0 and Precursor Mass − Water Mass are added to the PRM spectrum, where Water Mass is the mass of a water molecule. Other types of MS/MS spectra, such as ETD spectra, can be converted to PRM spectra using a similar method. De novo peptide sequencing by bottom-up MS/MS spectra has been extensively studied, and many software tools have been developed.29−33 These software tools can be used to convert bottom-up MS/MS spectra to PRM spectra, removing noise and C-terminal ion peaks in the spectra. Any de novo peptide sequencing tool that reports de novo peptides and their confidence scores can be coupled with TBNovo for the conversion. Because PEAKS30 provided accurate de novo sequencing results on the studied data sets (see the Discussion section), we use PEAKS in the method description. After PEAKS reports a de novo peptide for each spectrum, a theoretical PRM spectrum is generated from the peptide using the following method. Let mass(r) be the residue mass of an amino acid r. A de novo peptide r1r2...rn is represented as a PRM spectrum b0,b1...,bn, in which bi = ∑ki = 1mass(rk) is the sum of the masses of the first i residues of the peptide. Specifically, b0 = 0 and bn = Molecular Mass − Water Mass, where Molecular Mass is the molecular mass of the peptide. Peak intensities are ignored in the PRM spectra generated from bottom-up MS/MS spectra. In addition, PEAKS reports an average local confidence (ALC) score for each bottom-up MS/MS spectrum. Both top-down and bottom-up PRM spectra contain correct and incorrect PRMs. The incorrect PRMs are introduced from deconvoluted noise masses and suffix residue masses in topdown PRM spectra and from incorrect de novo peptides in bottom-up PRM spectra. The objective of de novo protein

Improving the Top-Down Spectrum Using Bottom-Up Spectra

If we had an ideal top-down MS/MS spectrum containing only a full list of correct PRMs of a protein, it would be easy to obtain the protein sequence from the spectrum. Below we use bottom-up spectra to increase the number of correct PRMs and C

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Figure 1. Improving a top-down PRM spectrum by using bottom-up PRM spectra. (a) A top-down PRM spectrum T of the target protein contains both correct and incorrect PRMs. (b) Spectral mapping. Using the alignments between T and bottom-up PRM spectra, red PRMs (supported by both T and at least one bottom-up PRM spectrum) and blue dotted PRMs (supported by at least two bottom-up PRM spectra) are selected to generated a new top-down PRM spectrum T1. (c) Gap filling by extension. In (c1), the PRM 988 is a candidate cleavage site for tryptic peptides because the mass difference between PRMs 832 and 988 is the same (within an error tolerance) to the residue mass of “R”. If a tryptic bottom-up PRM spectrum is a valid optimal spectrum for candidate cleavage site 988, the PRMs (blue dotted) in the spectrum are added to T1. In (c2), the PRM 1273 is a candidate cleavage site for chymotrypsin. If a chymotrypsin bottom-up PRM spectrum has a valid optimal spectrum for candidate cleavage site 1273, the PRMs (blue dotted) in the spectrum are added to the top-down PRM spectrum. (d) Gap filling by mass matching. Both PRMs 1785 and 3037 are candidate cleavage sites for trypsin. The mass difference 1252 = 3037 − 1785 is a candidate peptide mass. A bottom-up spectrum CYFFTKHIR with the precursor mass 1270 = 1252 + 18 is a candidate spectrum. If the candidate spectrum has the best score among all candidate spectra, PRMs (blue dotted) of the spectrum are used to fill the gap. (e) The resulting top-down PRM spectrum T contains more correct PRMs and less incorrect PRMs compared with the original spectrum T0.

value x to each PRM in B. The spectral mapping score between T and B is the maximum shared mass score among all shift values, that is, MScore(T, B) = maxxSScore(T, shift(B, x)). The shift value θ that maximizes the shared mass score of T and B is the optimal shif t of T and B. A PRM is supported by B if it is the same (within an error tolerance) to a PRM in shift(B, θ). If MScore(T0, B) is small, then the spectral mapping between T0 and B and the optimal shift of T0 and B are not confidently determined. On the basis of the observation, a spectrum B ∈ ) is not used in spectral mapping if MScore(T0, B) < α, where α is a prespecified parameter. Let )* be the set of bottom-up PRM spectra satisfying MScore(T0, B) ≥ α. We combine all PRMs in T0 and shifted spectra shift(Bi, θi) for Bi ∈ )* to form a new PRM spectrum (Figure 1b). Similar to the method in the section Spectral Selection and Merging, PRMs with similar masses are merged, and a PRM x is removed if its complementary PRM or another PRM y satisfying the requirement that the mass difference between x and y is about 1 Da has more supporting spectra. Finally, a PRM is removed if it is supported by only one PRM spectrum: T0 or one spectrum in )*.

decrease the number of incorrect PRMs in the top-down spectrum (Figure 1). Let T0 be the merged top-down PRM spectrum generated in the previous section. Compared with the ideal spectrum, T0 misses some correct PRMs and contains many incorrect PRMs (Figure 1a). We will use bottom-up spectra to “improve” the top-down spectrum T0 with three steps: (1) spectral mapping (Figure 1b), (2) gap filling by extension (Figure 1c), and (3) gap filling by mass matching (Figure 1d). If the ALC score (reported by PEAKS) of a bottom-up MS/MS spectrum is high, the quality of the spectrum is high, and its PRM spectrum is reliable. Let ) and , be the set of bottom-up PRM spectra corresponding to bottom-up MS/MS spectra with an ALC score ≥70% and ≥50%, respectively. (The cutoff values are chosen on the basis of experience.) In steps “spectral mapping” and “gap filling by extension”, only high-quality bottom-up spectra in ) are utilized to improve T0. In the step “gap filling by mass matching”, all bottom-up spectra in , are used to further increase the protein sequence coverage. Spectral Mapping

The shared mass score of a top-down PRM spectrum T and a bottom-up PRM spectrum B is the number of shared masses (within an error tolerance) in T and B, denoted by SScore(T, B). We can convert B to a new spectrum shift(B, x) by adding a

Gap Filling by Extension

Let T[i,j] and T[i,j] denote the set of PRMs m in a top-down PRM spectrum T satisfying i ≤ m ≤ j and i < m < j, respectively. D

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

A mass pair (i, j) in a PRM spectrum T is called a gap if j − i ≥ L (L is a prespecified parameter) and T(i, j) = ⌀. Let T1 be the resulting PRM spectrum of the previous step. Below we describe a method to fill a gap (i, j) in T1. Compared with T1, the original top-down spectrum T0 may contain some PRMs in gap (i, j) that are not included in T1 because they are not supported by any bottom-up PRM spectra. To utilize the PRMs in T0(i, j), the PRMs in T1[0, i] and T0(i, j) are combined to form a PRM spectrum U. We map bottom-up PRM spectra to U and use shifted PRMs in the bottom-up spectra to fill the gap (Figure 1c). While this method can be applied to all proteases, tryptic bottom-up spectra will be used to describe it. Because trypsin cleaves a protein after “R” and “K” residues, most tryptic peptides start at a residue that follows an “R” or ‘K” residue. If a PRM pair l and k in U satisfies k − l ≈ mass(R) or k − l ≈ mass(K), then the PRM k might be a cleavage site, called a candidate cleavage site. Let ) t be all tryptic bottom-up PRM spectra in ) . If a bottom-up spectrum B* satisfies SScore(U, shift(B*, k)) = maxB∈ ) tSScore(U, shift(B, k)), B* is an optimal spectrum for the candidate cleavage site k. If SScore(U, shift(B*, k)) is no less than a threshold, B* is a valid optimal spectrum of the candidate cleavage site k. In practice, the threshold is set as α − 2. The smallest candidate cleavage site k in U is first used to fill the gap. The shifted PRM list C = shift(B*, k) is computed for each valid optimal spectrum B* for the candidate cleavage site, and all PRMs in C(i, j) are added to U (Figure 1c1). Next, the process is applied to the second smallest candidate cleavage site in U (Figure 1c2). This procedure is repeated until no candidate cleavage site can be found in U. Finally if a PRM in U(i,j) is not supported by any valid optimal spectrum, it is removed (Figure 2). Similarly, the method can be performed

candidate peptide mass corresponding to the same protease, the spectrum is a candidate spectrum to fill the gap. For each candidate peptide masses of the gap, we find all its candidate bottom-up PRM spectra in , . Finally, the bottom-up spectrum B* with the best ALC score is selected among all candidate bottom-up spectra (corresponding to all candidate peptide masses). Let C = shift(B*, l) be the shifted spectrum of B*. All PRMs in C(i, j) are added to T2 (Figure 1d). In addition, all PRMs in T0(i, j) are added to T2. The resulting spectrum will be used to reconstruct the target protein sequence. Protein Sequencing Using Spectral Graphs

Let T be the top-down PRM spectrum obtained in the previous section. Our objective is to find a protein sequence P and its corresponding PRM spectrum S that best explain T and spectra in ). The objective function is based on the similarity between S and T as well as that between S and ). Because only high quality bottom-up PRM spectra provide accurate information for de novo protein sequencing, only a small set of spectra of ) that are highly similar to S are used to compute the similarity between S and ). Because S is unknown and T is similar to S, we use T to estimate the similarity between S and each spectrum in ) . The normalized spectral mapping score of a bottom-up PRM spectrum B and T is MScore(T, B)/|B|, where |B| is the number of PRMs in B. As defined in the section Spectral Mapping, a PRM in T is supported by B if it is the same (within an error tolerance) to a PRM in shift(B, θ), where θ is the optimal shift of T and B. All supporting bottom-up spectra for each PRM in T are ranked based on their normalized spectral mapping scores. If a spectrum B is one of the best three spectra for a PRM, it is called a score spectrum. Let ) s be the set of all score spectra. Protein Sequencing Problem

Given a top-down PRM spectrum T and a set of bottom-up PRM spectra )s, find a protein sequence P and its theoretical PRM spectrum S such that SScore(S, T) + ∑Bi∈) s SScore(S, shift(Bi, θi)) is maximized, where θi is the optimal shift of Bi and T. We use a spectral graph to solve this problem. A combined PRM spectrum U is obtained by combining PRMs in T and all PRMs in shifted bottom-up spectra shift (Bi, θi). To find a protein sequence to maximize the scoring function, the PRM spectrum is converted to a spectral graph in which each node represent a PRM. If a PRM in U is supported by x spectra, the weight of its corresponding node is x. Node v1 is connected to node v2 by a directed edge if mass(v2) − mass(v1) is the same (within an error tolerance) to the mass of one or two amino acid residues, where mass(v) is the corresponding PRM of node v. Finally, we find a heaviest path from the node corresponding to mass 0 to the node corresponding to mass PrecursorMass − WaterMass. The heaviest path in the graph corresponds to the protein sequence P that maximizes the scoring function. In the PRM spectrum S of P, there are some PRMs supported by only one PRM in U. These PRMs are removed to increase the accuracy of reported PRMs. Specifically, because N- and Cterminal PRMs do not have overlapping peptides to support them, N- and C-terminal PRMs have high error rates. Thus, Nand C-terminal PRMs (the smallest and the largest five PRMs) in S that are supported by two PRMs in U are further removed. Using only the edges in the heaviest path, we may fail to distinguish between residues “K” and “Q” because their masses are similar. To distinguish the two residues, we find bottom-up

Figure 2. Gap filling by extension.

using PRMs T1[j, N], where N is the largest PRM in T1. In practice, we first use PRMs in T1[0, i] to fill a gap (i, j). If the gap is not filled, PRMs in T1[j, N] are used to fill the gap. After PRMs are added to fill gaps, the resulting spectrum is “improved” again using the method described in the section Spectral Mapping. Gap Filling by Mass Matching

Let T2 be the resulting PRM spectrum of the previous step. The PRM spectrum T2 may still contain gaps. For a gap (i, j) in T2, if T2 contains two candidate cleavage sites l and r with respect to tryptic digestion satisfying l < i and r > j, then the mass r − l + WaterMass is a candidate peptide mass with respect to tryptic digestion. A gap may have multiple candidate peptide masses. If the precursor mass of a bottom-up PRM spectrum matches a E

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

Figure 3. Comparison between the target protein sequence of the light chain of alemtuzumab (214 amino acids) and the de novo protein sequence reported by TBNovo. If an amino acid reported by TBNovo is not the same to the target sequence but has the same mass to the amino acid in the target sequence, it has a green background. If an amino acid reported by TBNovo is incorrect, it has a red background.

PRM spectra in ) s that contain the residue and use the peaks of the bottom-up spectra to determine the correct residue.

amino acids. PEAKS reported 3889 peptides from the bottomup MS/MS spectra with default cutoff values: total local confidence (TLC) score ≥3 and ALC score ≥50%. Out of the 3889 peptides, 1067 peptides have an ALC score >70%. The running time of TBNovo was about 5 min. In spectral mapping, 143 bottom-up PRM spectra were mapped to the top-down PRM spectra with a score ≥7 (α = 7). By improving the top-down spectrum using spectral mapping, the resulting PRM spectrum contained 255 PRMs with 64.0% sequence coverage. In gap filling by extension, the parameter for the gap length was set as L = 300, and three gaps were identified. After gap filling by extension, the resulting PRM spectrum contained 450 PRMs, and the sequence coverage was increased to 81.8%. The extension method filled two of the three gaps but failed to fill the third gap. In gap filling by mass matching, one tryptic bottom-up MS/MS spectrum was added to fill the gap. The combined PRM spectrum in the protein sequencing problem contained 532 PRMs with 96.2% sequence coverage. After converting top-down and selected bottom-up PRM spectra to a spectral graph, the heaviest path algorithm reported 205 PRMs with 89.2% sequence coverage. After low confident PRMs were removed, TBNovo reported 188 PRMs in which 184 PRMs were correct. When the correct rate is based on PRMs, the sequence coverage was 86.9%, and the accuracy rate was 97.8% (Figure 3). When the correct rate is based on amino acid residues, 29 residues were not reported (reported as gaps), and 6 residues were incorrectly predicted, so the accuracy rate was 83.6%. The PRM spectrum was converted to a protein sequence with gaps. Even though some amino acids in the protein sequence are not reported, the positions of almost all reported contigs and gaps are correct. The main reason is that top-down spectra were used as a scaffold, providing essential information for finding correct positions of contigs and gaps.



RESULTS We implemented TBNovo in JAVA and tested it on the two data sets of the light chain of alemtuzumab and CAH2 on a desktop with a 3.4 GHz CPU and 16G RAM. All top-down MS/MS spectra in the data sets were deconvoluted and converted to PRM spectra. By comparing the precursor masses of the top-down MS/MS spectra with the molecular mass of the target protein, only top-down spectra with a 15 ppm error tolerance for precursor masses were kept. From each of the data sets (four data sets for the light chain and alemtuzumab and 15 data sets for CAH2), the top three PRM spectra with the largest number of PRMs were selected. The number three for selecting top spectra was determined on the basis of manual investigation of the quality of the deconvoluted MS/MS spectra. The selected top-down tandem mass spectra are of high quality and contain a large number of fragment ions, which are essential to de novo protein sequencing. For example, one HCD spectrum of the light chain of alemtuzumab contains 90 different fragment ions of the target protein. (See Figure 1 in the Supporting Information). All bottom-up MS/MS spectra were analyzed using PEAKS.30 The precursor and fragment ion error tolerances were set as 15 ppm and 0.05 Da, respectively. Carbamidomethylation was set as the fixed PTM on cysteine residues. For each bottom-up spectrum, only the top scoring de novo peptide was reported. Because the top-down spectra are generated from the target protein without alkylation with iodoacetamide, we used the mass of unmodified cysteines when converting de novo peptides to PRM spectra so that a bottom-up PRM spectrum from a peptide with cysteine residues can be mapped to the top-down PRM spectra.

CAH2

Light Chain of Alemtuzumab

We selected nine top-down spectra from the three top-down data sets as well as 12 targeted ETD top-down spectra; the total number of PRMs in the 18 spectra was 6186. After the topdown MS/MS spectra were merged, PRMs supported by less than k = 3 spectra were removed. The resulting top-down PRM spectrum contained 428 PRMs with 51.5% sequence coverage of the target protein. (The target protein has 258 amino acids excluding the N-terminal methionine.) PEAKS reported 14 520

We selected 12 top-down spectra from the four top-down data sets; the total number of PRMs in the 12 spectra was 6186. After the top-down MS/MS spectra were merged, PRMs supported by less than k = 2 spectra were removed. (See the Methods section for the protocol to determine the value of k.) The resulting top-down PRM spectrum contained 526 PRMs with 49.1% sequence coverage of the target protein with 214 F

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research



peptides from the bottom-up MS/MS spectra with default cutoff values: total local confidence (TLC) score ≥3 and ALC score ≥50%. Out of the 14 520 peptides, 8676 peptides have an ALC score >70%. The running time of TBNovo was about 7 min. The parameters of TBNovo were the same as the analysis of the light chain of alemtuzumab. A total 2476 bottom-up PRM spectra were mapped to the top-down PRM spectra with a score ≥7 in spectral mapping. The resulting PRM spectrum contained 908 PRMs with 60.9% sequence coverage. After gap filling by extension and by mass matching, the resulting PRM spectrum contained 664 PRMs, and the sequence coverage was increased to 84.5%. The combined PRM spectrum in the protein sequencing problem contained 646 PRMs with 82.5% sequence coverage. Finally, TBNovo reported 229 PRMs in which 194 PRMs were correct. The sequence coverage and correct rate based on PRMs were 75.2% and 84.7%, respectively. The sequence coverage is not high because a region (residues 172−224) in the protein lacks fragment ions in both bottom-up and top-down MS/MS spectra.

Article

ASSOCIATED CONTENT

S Supporting Information *

Information on the mass spectrometry experiments of CAH2, parameter setting, and TBNovo coupled with PepNovo. This material is available free of charge via the Internet at http:// pubs.acs.org.



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. Fax: +1-858-534-7029. Tel.: +1-858-822-4365. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by a startup fund provided by Indiana University-Purdue University Indianapolis. L.J.M.D. and M.M.V. are financially supported by The Netherlands Organization for Scientific Research (NWO), Zenith grant no. 93511034. Portions of this work were performed in the William R. Wiley Environmental Molecular Sciences Laboratory (EMSL), a Department of Energy, Biological and Environmental Research (DOE BER) national scientific user facility located on the campus of Pacific Northwest National Laboratory (PNNL) in Richland, Washington. PNNL is a multiprogram national laboratory operated by Battelle for the DOE under Contract DE-AC05-76RLO1830.



DISCUSSION In this paper, we proposed a new method for de novo protein sequencing using top-down and bottom-up MS/MS spectra. Because top-down MS/MS spectra serve as scaffolds in protein sequencing, almost all reported contigs and gaps are at the correct locations, which provide more valuable information than a set of contigs without relative positions in the target sequence reported by existing approaches using only bottom-up MS/MS spectra. The performance of de novo peptide sequencing is essential to high accuracy protein sequencing, because it determines the accuracy of bottom-up PRM spectra, from which the protein sequence is assembled. There are two commonly used software tools for this purpose: PEAKS30 and PepNovo.31 We used PEAKS in TBNovo because it provides scoring models for various proteases (e.g., chymotrypsin and pepsin) and fragmentation methods (e.g., HCD and ETD), which are critical for obtaining highly accurate de novo peptides. Because PepNovo does not provide various scoring models, TBNovo coupled with PepNovo did not achieve high sequence coverage for the two proteins. (See the Supporting Information.) The accuracy of the contigs and gaps reported by TBNovo is high, which is comparable to the best existing de novo protein sequencing tool, meta-SPS.20 The accuracy might be further improved by replacing the simple shared mass score with a scoring function that utilizes peak intensities as well as local confidence scores reported by de novo peptide sequencing. In TBNovo, the value of α is important for achieving high sequence coverage and accuracy. A low cutoff value for α decreases sequence accuracy by including many noise bottomup PRMs, a high cutoff value for α also decreases sequence coverage by excluding most bottom-up PRM spectra in protein sequencing. (See the Supporting Information for details.) The suggested setting for α is 7. Although the resulting sequences cover almost the whole target sequences, they still contain many gaps. These gaps, especially those at the N- and C- termini, are not filled because both top-down and bottom-up MS/MS spectra lack fragment ions in the regions. Generating more bottom-up MS/MS spectra using various proteases or middle-down MS/MS spectra might help to fill the gaps.



REFERENCES

(1) Eng, J. K.; McCormack, A. L.; Yates, J. R., III An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976−989. (2) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551− 3567. (3) Wiles, M.; Andreassen, P. Monoclonals-the billion dollar molecules of the future. Drug Discovery World 2006, Fall, 17−23. (4) Rai, K.; Hallek, M. Future prospects for alemtuzumab (MabCampath (TM)). Med. Oncol. 2002, 19, S57−S63. (5) Lewis, R. J.; Garcia, M. L. Therapeutic potential of venom peptides. Nat. Rev. Drug Discovery 2003, 2, 790−802. (6) Edman, P. Method for determination of the amino acid sequence in peptides. Acta Chem. Scand. 1950, 4, 283−293. (7) Thoma, R. S.; Smith, J. S.; Sandoval, W.; Leone, J. W.; Hunziker, P.; Hampton, B.; Linse, K. D.; Denslow, N. D. The ABRF Edman Sequencing Research Group 2008 Study: Investigation into homopolymeric amino acid N-terminal sequence tags and their effects on automated Edman degradation. J. Biomol. Tech. 2009, 20, 216−225. (8) Bandeira, N.; Pham, V.; Pevzner, P.; Arnott, D.; Lill, J. R. Automated de novo protein sequencing of monoclonal antibodies. Nat. Biotechnol. 2008, 26, 1336−1338. (9) Thakkar, A.; Cohen, A. S.; Connolly, M. D.; Zuckermann, R. N.; Pei, D. High-throughput sequencing of peptoids and peptide-peptoid hybrids by partial Edman degradation and mass spectrometry. J. Comput. Chem. 2009, 11, 294−302. (10) Calvete, J. J.; Ghezellou, P.; Paiva, O.; Matainaho, T.; Ghassempour, A.; Goudarzi, H.; Kraus, F.; Sanz, L.; Williams, D. J. Snake venomics of two poorly known Hydrophiinae: Comparative proteomics of the venoms of terrestrial Toxicocalamus longissimus and marine Hydrophis cyanocinctus. J. Proteomics 2012, 75, 4091−4101. G

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX

Journal of Proteome Research

Article

(11) Johnson, R. S.; Biemann, K. The primary structure of thioredoxin from Chromatium vinosum determined by high-performance tandem mass spectrometry. Biochemistry 1987, 26, 1209−1214. (12) Hopper, S.; Johnson, R. S.; Vath, J. E.; Biemann, K. Glutaredoxin from rabbit bone marrow. Purification, characterization, and amino acid sequence determined by tandem mass spectrometry. J. Biol. Chem. 1989, 264, 20438−20447. (13) Whaley, B.; Caprioli, R. M. Identification of nearest-neighbor peptides in protease digests by mass spectrometry for construction of sequence-ordered tryptic maps. Biol. Mass Spectrom. 1991, 20, 210− 214. (14) MacCoss, M. J.; McDonald, W. H.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Clark, J. I.; Yates, J. R. Shotgun identification of protein modifications from protein complexes and lens tissue. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 7900−7905. (15) Englander, J. J.; Del Mar, C.; Li, W.; Englander, S. W.; Kim, J. S.; Stranz, D. D.; Hamuro, Y.; Woods, V. L. Protein structure change studied by hydrogen-deuterium exchange, functional labeling, and mass spectrometry. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 7057−7062. (16) Bandeira, N.; Tang, H.; Bafna, V.; Pevzner, P. Shotgun protein sequencing by tandem mass spectra assembly. Anal. Chem. 2004, 76, 7221−7233. (17) Pham, V.; Henzel, W. J.; Arnott, D.; Hymowitz, S.; Sandoval, W. N.; Truong, B. T.; Lowman, H.; Lill, J. R. De novo proteomic sequencing of a monoclonal antibody raised against OX40 ligand. Anal. Biochem. 2006, 352, 77−86. (18) Klammer, A. A.; MacCoss, M. J. Effects of modified digestion schemes on the identification of proteins from complex mixtures. J. Proteome Res. 2006, 5, 695−700. (19) Bandeira, N.; Clauser, K. R.; Pevzner, P. A. Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol. Cell. Proteomics 2007, 6, 1123−1134. (20) Guthals, A.; Clauser, K. R.; Bandeira, N. Shotgun protein sequencing with meta-contig assembly. Mol. Cell. Proteomics 2012, 11, 1084−1096. (21) Castellana, N.; Bafna, V. Proteogenomics to discover the full coding content of genomes: A computational perspective. J. Proteomics 2010, 73, 2124−2135. (22) Cheung, W. C.; Beausoleil, S. A.; Zhang, X.; Sato, S.; Schieferl, S. M.; Wieler, J. S.; Beaudet, J. G.; Ramenani, R. K.; Popova, L.; Comb, M. J.; Rush, J.; Polakiewicz, R. D. A proteomics approach for the identification and cloning of monoclonal antibodies from serum. Nat. Biotechnol. 2012, 30, 447−452. (23) Liu, X.; Han, Y.; Yuen, D.; Ma, B. Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage and accuracy. Bioinformatics 2009, 25, 2174−2180. (24) Tran, J. C.; Zamdborg, L.; Ahlf, D. R.; Lee, J. E.; Catherman, A. D.; Durbin, K. R.; Tipton, J. D.; Vellaichamy, A.; Kellie, J. F.; Li, M.; Wu, C.; Sweet, S. M. M.; Early, B. P.; Siuti, N.; LeDuc, R. D.; et al. Mapping intact protein isoforms in discovery mode using top-down proteomics. Nature 2011, 480, 254−258. (25) Dekker, L.; Wu, S.; Vanduijn, M.; Tolić, N.; Stingl, C.; Zhao, R.; Luider, T.; Paša-Tolić, L. An integrated top-down and bottom-up proteomic approach to characterize the antigen binding fragment of antibodies. Proteomics 2014, DOI: 10.1002/pmic.201300366. (26) Kim, S.; Gupta, N.; Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 2008, 7, 3354−3363. (27) Liu, X.; Inbar, Y.; Dorrestein, P. C.; Wynne, C.; Edwards, N.; Souda, P.; Whitelegge, J. P.; Bafna, V.; Pevzner, P. A. Deconvolution and database search of complex tandem mass spectra of intact proteins: A combinatorial approach. Mol. Cell. Proteomics 2010, 9, 2772−2782. (28) Liu, X.; Hengel, S.; Wu, S.; Tolić, N.; Paša-Tolić, L.; Pevzner, P. A. Identification of ultramodified proteins using top-down tandem mass spectra. J. Proteome Res. 2013, 12, 5830−5838.

(29) Pevzner, P. A.; Dančík, V.; Tang, C. L. Mutation-tolerant protein identification by mass spectrometry. J. Comput. Biol. 2000, 7, 777−787. (30) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; DohertyKirby, A.; Lajoie, G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17, 2337−2342. (31) Frank, A.; Pevzner, P. PepNovo: De novo peptide sequencing via probabilistic network modeling. Anal. Chem. 2005, 77, 964−973. (32) Chi, H.; Sun, R. X.; Yang, B.; Song, C. Q.; Wang, L. H.; Liu, C.; Fu, Y.; Yuan, Z. F.; Wang, H. P.; He, S. M.; Dong, M. Q. pNovo: de novo peptide sequencing and identification using HCD spectra. J. Proteome Res. 2010, 9, 2713−2724. (33) Guthals, A.; Clauser, K. R.; Frank, A. M.; Bandeira, N. Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ ETD) from overlapping peptides. J. Proteome Res. 2013, 12, 2846− 2857.

H

dx.doi.org/10.1021/pr401300m | J. Proteome Res. XXXX, XXX, XXX−XXX