Peptide Identification Using Vectors of Small Fragment Ions - Journal

Davide Cittaro, Dario Borsotti, Alessio Maiolica, Elisabetta Argenzio, and Juri Rappsilber*. Proteomics Group, The FIRC Institute of Molecular Oncolog...
0 downloads 0 Views 151KB Size
Peptide Identification Using Vectors of Small Fragment Ions Davide Cittaro, Dario Borsotti, Alessio Maiolica, Elisabetta Argenzio, and Juri Rappsilber* Proteomics Group, The FIRC Institute of Molecular Oncology Foundation, Via Adamello 16, 20139 Milan, Italy Received January 27, 2005

Abstract: Traditionally, peptide identification using fragmentation spectra relies on extracting the maximum amount of information from spectra. Using different combinations of small ion masses, we show that identifying a small number of fragment ions in a spectrum is sufficient for peptide identification. We consider y2-, y3-, b2-, and b3-ions and find the combination of b2-y2 to be sufficient for many peptides. Adding either the y3- or the b3-ion increases specificity and allows reliable peptide identification in the human proteome. Fragmentation spectra and peptides are represented as n-dimensional vectors, where n is given by the number of fragment ions considered, and the peptide mass. The identification score is given by the Euclidian distance between the spectra and the matching peptide in n-dimensional space. We show that this approach, using minimal information, allows for precise and fast peptide identification. Keywords: peptide fragmentation • mass spectrometry • database search • attribute vectors • peptide identification

Introduction Peptide identification is the central step in most mass spectrometric experiments for protein identification. Algorithms such as Mascot1 or SEQUEST2 achieve this step by fitting experimental with predicted spectra. Other algorithms, as PeptideSearch3 and GutenTag,4 are based on generating sequence tags from the fragmentation spectra, which are then used to search for matching sequences in a database. In contrast, de novo approaches5-10 use the spectra to predict complete peptide sequences without the use of a database. While these approaches are conceptually very different, their objective is to extract the maximum information available from the spectra, and consequently require good quality data for best results. Recently, Halligan et al.11 described a method to identify peptides using peptide amino acid attribute vectors (PAAV). PAAV is based on indexing a sequence database for fast searching. The sequences are represented as vectors whose dimensions are the amino acid proportions of the peptide sequence. A major strength of the PAAV method is that it is easy to model and computationally light. Also, using amino acid composition liberates from the need for any knowledge * To whom correspondence should be addressed. Tel: +39 02 574 303 706. Fax: +39 02 574 303 231 (shared). E-mail: juri.rappsilber@ ifom-ieo-campus.it.

1006

Journal of Proteome Research 2005, 4, 1006-1011

Published on Web 04/09/2005

of the actual amino acid sequence. Despite the efficiency of the theoretical approach, however, the low amount of information on amino acid composition extractable from fragmentation spectra is a limiting factor for applying PAAV to mass spectrometric data. On the basis of PAAV, mass values could be substituted for amino acid information. Creating an index based on mass as attributes requires as many dimensions as steps considered for a given mass range. Using the PAAV approach on complete fragmentation spectra ranging from 60 to 2000 m/z, with 1 m/z step between consecutive points, leads to vectors characterized in 1940-dimensional space. Even with this rough spectral description (1 m/z is a low resolution) the number of dimension is too high for practical processing. Instead of trying to accommodate all information in the entire spectrum, we looked for ways to reduce the amount of data. In the peptidesequence-tag approach identification is achieved by annotating a small number of fragment ions that together allow annotation of peptide sequence information. Similarly, if it was possible to annotate peptide specific fragments such as b- and y- ions the dimensionality of the attribute vectors would be significantly reduced. Using b- and y-ions for indexing would allow unambiguous peptide identification using the PAAV method. However, under real conditions, only a small subset of peaks, at best, can be annotated as b- and y-ions. Furthermore, the annotation is only tentative. We measured the frequency of b- and y-ions in fragmentation spectra to determine the best candidates for the construction of an index and explored how many b- and y-ions are needed for a robust lookup. Using fragment ions and peptide mass in place of amino acid composition retains the computational advantages of PAAV. Our approach has the additional advantage of relying on data that are contained in fragmentation spectra and can motivate future work to extract b- and y-ion information reliably. Because we do not use amino acid composition directly as done by PAAV but fragment ions, we have decided to refer to this method as peptide fragment ion vectors (PFIV) (Figure 1).

Materials and Methods LC-MS Analysis. Protein mixtures from ongoing work in our lab on human nuclear complexes were reduced, alkylated, and digested using trypsin at 1:20 and desalted by StageTips12 prior to use. A QSTAR XL quadrupole time-of-flight (TOF) tandem mass spectrometer (AppliedBiosystems/MDS-Sciex, Toronto, Canada) with an Agilent 1100 NanoHPLC was used as LC-MS system in this study. A Valco titanium union was taped to the NanoElectrospray ion source (AppliedBiosystems/MDS-Sciex) 10.1021/pr0500152 CCC: $30.25

 2005 American Chemical Society

technical notes

Cittaro et al.

1. Start at t ) 0 generate k random vectors as cluster centers: Ci(0), i ∈{1...k} 2. Assign each vector pj in space S to the closest cluster C: λj(t) ) arg min d(pj, Ci(t)) i∈{1,...,k}

where d is the Euclidean distance between pj and Ci in n dimensions:

x∑ n

d)

∆xi2

i)0

Figure 1. Peptide sequence database is converted into a set of coordinates composed of the peptide mass and selected fragment masses. The vector space is then clustered using k-means algorithm. Peptides are identified extracting corresponding ions from the MS/MS spectra and finding the closest vector in the peptide space.

and used to hold the packed spray needles. Spray emitters (FS360-100-8-N-5-C15, New Objectives, Woburn, MA) were used as columns and were self-packed using ReproSil-Pur C18AQ 3µm (Dr. Maisch GmbH, Ammerbuch-Entringen, Germany). 2100 V were applied as spray voltage. Mobile phase A consisted of water, 5% acetonitrile and 0.5% acetic acid and mobile phase B of acetonitrile and 0.5% acetic acid. The gradient went from 0% buffer B to 20% in 50 min and then in 25 min to 80% B at 300 nl/min flow rate. Peptide identification for the test set was performed using Mascot (v. 2.0) with the following search parameters: database: IPI human 2.37, allowed miscleavage number: 0, fixed modifications: carboamidomethyl cysteine, variable modifications: oxidized methionine, peptide mass tolerance: 0.2 Da, MS/MS tolerance: 0.2 Da. The experimental dataset was composed of 1131 peptide sequences, extracted from Mascot results files, having scores higher than 35 and rank 1. Vector Space Construction. To create vector spaces that would represent the peptides in the database and that would be used for peptide identification, IPI human 2.3713 was in silico cleaved with trypsin specificity allowing 0 missed cleavage. I and L were considered the same amino acid. Only peptides in the mass range 600-8000 Da and having unique sequence were used in the following steps. The resulting peptide list was represented in a space Sn defined in n dimensions. Each peptide was represented as a vector p having the peptide mass Mr as the first dimension; the remaining n-1 dimensions represented masses of the chosen ions, sorted in ascending order

[]

Mr m1 p) , m1 e m2 e ... e mn-1 ... mn-1 Mass values were rounded to the second decimal place. Each vector was tagged with the corresponding peptide sequence. Duplicate vectors were removed and the remaining unique vectors were tagged with all the corresponding peptide sequences. Vector spaces were clustered using a k-means algorithm,14 as described below:

3. Calculate all k cluster means: Ci(t+1) )

1

∑p

|Ci(t)|λj(t))i

j

4. Repeat steps 2 and 3 until 95% of vectors pdo not change cluster anymore, with trt+1 Perfect convergence required a significant number of iterations. However, 95% convergence was reached in 12 iterations. After clustering was complete, the vector space was ready for inquiry using the mass spectrometric data. Each query vector q was matched in space by first assigning q to the closest cluster and then to the closest vector p inside the cluster using Euclidean metric. It is important to note that the 1131 vectors derived from our experimental dataset contained the peptide sequence as tag as a result of our prior analysis of the mass spectrometric data by standard means (Mascot). The Mascot search results were used to spot the b- and y-ions in the fragmentation spectra. While the peptide sequence was not used for matching, it was used during the evaluation. Given d the distance between the query and the assigned vector, τ a distance cutoff, sq the sequence of the query peptide, sa the sequence of the assigned peptide, we define

d 0.21 Da. A cutoff value can be estimated from the ROC curves analysis best classifying every assignment. For vector spaces built with three ion masses the best cutoff is τ ) 0.21 Da; for b2-y2 space the best cutoff is τ ) 0.18 Da. This cutoff or score addresses jointly the accuracy in precursor and in fragment masses. It is appropriate for instruments such as quadrupole-TOF mass spectrometer or TOF-TOF mass spectrometer that have the same accuracy for precursor and fragment masses. An instrument with high accuracy on the precursor mass and much lower accuracy on the fragment masses such as a 2D ion trap-FTICR mass spectrometer gives vectors with higher accuracy in one dimension than in the others. This could be addressed through a dual cutoff. However, trapping instruments in general are not well suited for the detection of small ions. We also measured method speed (Table 6); the greatest slice of time is needed for clustering. The clustering can be done once and then saved for future searches. The time to match an experimental vector in the vector space took from 50 ms in the smallest space to 190 ms in the largest one. Even on the relatively small computer we were using, the speed is therefore high enough to allow for on-the-fly analysis during the acquisition of the mass spectrometric data. Modified peptides were not considered in this work. In preliminary experiments we have seen that modified peptides result in false negative identification, having equal sequences of the assigned ones and distances greater than the cutoff (data not shown). To improve method speed we are considering using integer algebra instead of floating point. Also, using a 1-norm distance metric instead of the Euclidean could increase the computational speed. n

d)

∑|∆x | i

i)0

Conclusion We showed that few low mass ions are sufficient for peptide identification. If the b2-, y2-, and y3- or b3-ions can be assigned in a fragmentation spectrum, the matching peptide can be assigned from the database. This proves the usefulness of small mass fragment ions and future effort can be direct on reliably extracting them from fragmentation spectra. We implemented the database search as peptide fragment ion vectors (PFIV) using peptide attribute vectors following the work of Halligan et al.11 PFIV relies exclusively on the precursor mass and small fragments. PFIV is hence complementary to the peptide sequence tag approach that primarily builds on high mass fragments. In conjunction with automatic peptide sequence tag searches or other database search methods PFIV might be used to increase confidence in automatic protein identification.

technical notes Acknowledgment. We want to thank the members of our lab for helpful discussion and support, especially Lau Sennels for critically reading the manuscript. We thank Lara Lusa for suggestions regarding the statistical analysis. The work was funded by a Bioinformatics Center Grant from AIRC (Associazione Italiana per la Ricerca sul Cancro). References (1) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (3) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (4) Tabb, D. L.; Saraf, A.; Yates, J. R. Anal. Chem. 2003, 75, 64156421. (5) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 199, 6, 327-342. (6) Sheng, Q. H.; Xie, T.; Ding, D. F. Acta Biochim. Biophys. Sinica 2000, 32, 595-600. (7) Zhang, Z. Q.; McElvain, E. S. Anal. Chem. 2000, 72, 2337-2350. (8) Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Church, G. M. J. Comput. Biol. 2001, 8, 325-337.

Cittaro et al. (9) Yergey, A. L.; Coorssen, J. R.; Backlund, P. S.; Blank, P. S.; Humprey, G. A.; Zimmerberg, J.; Campbell, J. M.; Vestal, M. L. J. Am. Soc. Mass Specrtom. 2002, 13, 784-791. (10) Hernadez, P.; Gras, R.; Frey, J.; Appel, R. D. Proteomics 2003, 3, 870-878. (11) Halligan, B. D.; Dratz, E. A.; Feng, X.; Twigger, S. N.; Tonellato, P. J.; Greene, A. S. J. Proteome Res. 2004, 3, 813-820. (12) Rappsilber, J.; Ishihama, Y.; Mann, M. Anal. Chem. 2003, 75, 663670. (13) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. Proteomics 2004, 4, 1985-1988. (14) Hartigan, J. A. Clustering Algorithms 1998, 29, 256-276. (15) Hunt, D. F.; Yates, J. R., 3rd; Shabanowitz, J.; Winston, S.; Hauer, C. R. Proc. Natl. Acad. Sci. U. S. A. 1986, 83, 6233-6237. (16) Shevchenko, A.; Chernushevich, I.; Ens, W.; Standing, K. G.; Thomson, B.; Wilm, M.; Mann, M. Rapid Commun Mass Spectrom. 1997, 11, 1015-1024. (17) Uttenweiler-Joseph, S.; Neubauer, G.; Christoforidis, S.; Zerial, M.; Wilm, M. Proteomics 2001, 1, 668-682. (18) Schlosser, A.; Lehmann, W. D. Proteomics 2002, 2, 524-533.

PR0500152

Journal of Proteome Research • Vol. 4, No. 3, 2005 1011