Anal. Chem. 2004, 76, 1726-1732
Artificial Neural Network Analysis for Evaluation of Peptide MS/MS Spectra in Proteomics Tomasz Ba¸ czek,*,† Adam Bucin´ski,‡ Alexander R. Ivanov,§ and Roman Kaliszan†
Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gdan´ sk, Gen. J. Hallera 107, 80-416 Gdan´ sk, Poland, Division of Food Science, Institute of Animal Reproduction and Food Research of the Polish Academy of Sciences, Tuwima 10, 10-747 Olsztyn, Poland, and Harvard NIEHS Center for Environmental Health Proteomics Facility, Harvard School of Public Health, 665 Huntington Avenue, Boston, Massachusetts 02115
The aim of the work was to explore usefulness of artificial neural network (ANN) analysis for the evaluation of proteomics data. The analysis was applied to the data generated by the widely used protein identification program Sequest, completed with several structural parameters readily calculated from peptide molecular formulas. Proteins from yeast cells were identified based on the MS/ MS spectra of peptides. The constructed ANN was demonstrated to classify automatically as either “good” or “bad” the peptide MS/MS spectra otherwise classified manually. An appropriately trained ANN proves to be a high-throughput tool facilitating examination of Sequest’s results. ANNs are recommended as a means of automatic processing of large amounts of MS/MS data, which normally must be considered in the analysis of complex mixtures of proteins in proteomics. Completion of the Human Genome Project enabled a better understanding of biological functions of organisms.1,2 However, these studies still provide a limited insight into the cellular processes. Nowadays, a comprehensive analysis and characterization of all the expressed proteins, called proteomics, is the point of the interest. Today, the most widely used procedure for analysis of complex protein mixtures is two-dimensional gel electrophoresis.3-6 While this approach has high resolving power, it suffers from a number of factors.7,8 It is well known that efficient separation prior to mass * Corresponding author: (tel.) (48) (58) 3493260; (fax) (48) (58) 3493262; (e-mail):
[email protected]. † Medical University of Gdan´sk. ‡ Institute of Animal Reproduction and Food Research of the Polish Academy of Sciences. § Harvard School of Public Health. (1) International Human Genome Sequencing Consortium. Nature 2001, 409, 860-921. (2) Venter, J. C.; Adams, M. D.; Myers, E. W.; Li, P. W.; Mural, R. J.; et al. Science 2001, 291, 1304-1351. (3) Perrot, M.; Sagliocco, F.; Mini, T.; Monribot, C.; Schneider, U.; Shevchenko, A.; Mann, M.; Jeno, P.; Boucherie, H. Electrophoresis 1999, 20, 2280-2298. (4) Poutanen, M.; Salusjarvi, L.; Ruohonen, L.; Penttila, M.; Kalkkinen, N. Rapid Commun. Mass Spectrom. 2001, 15, 1685-1692. (5) Joubert, R.; Strub, J.-M.; Zugmeyer, S.; Kobi, D.; Carte, N.; van Dorsselaer, A.; Boucherie, H.; Jaquet-Gutfreund, L. Electrophoresis 2001, 22, 29692982. (6) Salusjarvi, L.; Poutanen, M.; Pitkanen, J.-P.; Koivistoinen, H.; Aristidou, A.; Kalkkinen, N.; Ruohonen, L.; Penttila, M. Yeast 2003, 20, 295-314. (7) Liebler, D. C. Introduction to Proteomics; Humana Press: Totowa, NJ, 2002. (8) Pandey, A.; Mann, M. Nature 2000, 405, 837-846.
1726 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
spectrometry (MS) and proper database searching greatly facilitates identification of proteins.9 Such high-resolution separation techniques as two-dimensional chromatography (ion-exchange chromatography combined with reversed-phase liquid chromatography (RPLC),10-15 size-exclusion chromatography combined with RPLC,16 and RPLC combined with capillary zone electrophoresis17) coupled to mass spectrometry are currently under intensive development and evaluation in proteomics. Currently, liquid chromatography is primarily used to separate complex mixtures of peptides and proteins whereas mass spectrometry is the method of choice for their identification. Liquid chromatography coupled to tandem mass spectrometry (LC/MS/ MS) is a standard equipment set used to identify the components of protein complexes and subcellular compartments. This technique is capable of identifying hundreds of peptides.10,11 On the other hand, the high resolving power of twodimensional chromatography (LC/LC) coupled to MS enables identification of thousands of peptides. Link et al.10 used that approach to identify the components of yeast and human ribosomes while Washburn et al.,11 as well as Peng et al.,12 analyzed the whole yeast cells. Alternative proteome analysis strategies based on peptide separations, including in-solution isoelectric focusing (sIEF)18-21 and capillary isoelectric focusing,22,23 as well (9) Wehr, T. LCGC North Am. 2002, 20, 954-962. (10) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-682. (11) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001, 19, 242-247. (12) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (13) Wagner, K.; Miliotis, T.; Marko-Varga, G.; Bischoff, R.; Unger, K. K. Anal. Chem. 2002, 74, 809-820. (14) Opiteck, G. J.; Lewis, K. C.; Jorgenson, J. W.; Anderegg, R. J. Anal. Chem. 1997, 69, 1518-1524. (15) Davis, M. T.; Beierle, J.; Bures, E. T.; McGinley, M. D.; Mort, J.; Robinson, J. H.; Spahr, C. S.; Yu, W.; Luthy, R.; Patterson, S. D. J. Chromatogr., B 2001, 752, 281-291. (16) Opiteck, G. J.; Jorgenson, J. W.; Anderegg, R. J. Anal. Chem. 1997, 69, 2283-2291. (17) Lewis, K. C.; Opiteck, G. J.; Jorgenson, J. W.; Sheeley, D. M. J. Am. Mass Spectrom. 1997, 8, 495-500. (18) Tan, A.; Pashkova, A.; Zang, L.; Foret, F.; Karger, B. L. Electrophoresis 2002, 23, 3599-3607. (19) Zuo, X.; Speicher, D. W. Proteomics 2002, 2, 58-68. (20) Herbert, B.; Righetti, P. G. Electrophoresis 2000, 21, 3639-3648. (21) Michel, P. E.; Reymond, F.; Arnaud, I. L.; Josserand, J.; Girault, H. H.; Rossier, J. S. Electrophoresis 2003, 24, 2-11. (22) Shen, Y.; Xiang, F.; Veenstra, T. D.; Fung, E. N.; Smith, R. D. Anal. Chem. 1999, 71, 5348-5353. 10.1021/ac030297u CCC: $27.50
© 2004 American Chemical Society Published on Web 02/05/2004
as chromatofocusing,24-26 have recently been employed for both protein and peptide identifications. In the present work, a twodimensional separation system comprising sIEF and RPLC was used. An important issue in proteomics is finding an algorithm and computer program allowing unambiguous protein identification based on the searching of a sequence database using mass spectrometry data. In some approaches, peptides of various molecular mass from enzymatic digestion of a protein contribute to the experimental data (that is the so-called peptide mass fingerprinting approach).27 Another possibility is to use the MS/ MS data of one or more peptides to confirm the protein identification (that is the so-called MS/MS ion search approach). Generally, the experimental data are compared with the calculated peptide mass or fragment ion mass values obtained by applying appropriate cleavage rules to the entries in a sequence database. Corresponding mass values are next counted or scored in a way that the peptide or protein to be identified matches best the data from the database.27 In 1994, Yates and co-workers28,29 developed for identification of proteins a correlation algorithm Sequest matching actual peptide tandem mass spectrometry data to appropriate data from protein databases. Currently this software is one of the most widely used programs in proteomics.30 The basis of Sequest is the assumption that amino acid sequences of peptides can be inferred from tandem mass spectra. The Sequest algorithm automates the inferring process first by enumerating candidates from the database that match the observed peptide’s mass. Then, the sequences are quickly checked against the spectrum by a preliminary scoring algorithm and first nonmatches are removed. Last, a more extensive cross-correlation scoring algorithm evaluates the sequence-derived theoretical spectra and compares them against the observed spectrum, and the sequences are ranked on the basis of such scoring.31 The use of programs such as Sequest is strictly associated with the appropriate interpretation of the MS data. In the Sequest program, output information is generated for peptides noted in the given database for which theoretical spectra match well the given experimental spectrum.30 The collection of the statistics is presented, what helps to classify each match. Initially, the difference between normalized cross-correlation functions for the first and second ranked results (∆Cn) is used to indicate a correctly selected peptide sequence. Next, additional criteria are added, including the cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one (23) Shen, Y.; Berger, S. J.; Anderson, G. A.; Smith, R. D. Anal. Chem. 2000, 72, 2154-2159. (24) Kang, X.; Frey, D. D. Anal. Chem. 2002, 74, 1038-1045. (25) Kang, X.; Frey, D. D. J. Chromatogr., A 2003, 991, 117-128. (26) Lubman, D. M.; Kachman, M. T.; Wang, H.; Gong, S.; Yan, F.; Hamler, R. L.; O’Neil, K. A.; Zhu, K.; Buchanan, N. S.; Barder, T. J. J. Chromatogr., B 2002, 782, 183-196. (27) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (28) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (29) Yates, J. R., III; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (30) Anderson, D. C.; Li, W.; Payan, D. G.; Noble, W. S. J. Proteome Res. 2003, 2, 137-146. (31) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. J. Proteome Res. 2002, 1, 2-26.
(Xcorr), the preliminary score based on the number of ions in the MS/MS spectrum that match the experimental data (Sp), the rank of the particular match during the preliminary scoring (RSp), and the ion value (I) describing how many of the detected (observed) ions match the theoretical ions for the peptide listed. Finally, the Sequest analysis is normally followed by a manual interpretation of the MS/MS spectra. Anderson et al.30 described recently a learning algorithm, called the support vector machine (SVM), for evaluating the Sequest database search results. SVM was designed to distinguish between the correctly and incorrectly identified peptides based on recognition of subtle patterns in a complex data set. Using appropriate training sets, that approach allowed an automated computational analysis of Sequest data for individual peptides. That enabled a high-throughput analysis of peptide sequencing results. For MS/ MS data, the SVM analysis of the experimentally obtained parameters, the Sequest-calculated statistics, and the additionally proposed parameters allowed a better match between these MS/ MS data and the peptide sequences. However, manual examination of spectra of peptides with low SVM-calculated scores was still recommended to identify the noisy or poorly fragmenting spectra that might compromise peptide identification. The approach described in the present paper addresses the problem of manual interpretation of spectra of peptides. To circumvent or reduce manual interpretation, artificial neural network (ANN) analysis has been proposed. The ANN analysis is a method of data analysis that is supposed to emulate the human brain’s way of working. Artificial neural nets exhibit the way in which arrays of neurons probably function in biological learning and memory. ANNs differ from classical computer programs in that they “learn” from a set of examples rather than being programmed to get the right answer. The information is encoded in the strength of the network’s “synaptic” connections.32-34 ANN is a compact group of connected, ordered in layers elements, which are able to process information. There are three kinds of layers in ANN: input layer, one or more hidden layers, and output layer. Elements of the network are the elementary units called artificial neurons. These elements are connected with each other with a different connection strength. The connections are called synaptic weights. In weights, the whole information on the network is encoded because those weights are in fact the numbers that determine the strength of the stimuli coming to the neurons. The most important feature of ANN, which determines the specificity of that computational method, is the process of learning. Learning of the networks is realized by the changes of the values for all synaptic weights with the use of a specific algorithm. The most commonly used learning algorithm is the so-called backpropagation training algorithm. In the process of learning with that algorithm, the network uses the error between the current and the desirable output to improve the values of synaptic weights. A more detailed description of ANNs has recently been presented in this journal.34 (32) Zupan, J.; Gasteiger, J. Anal. Chim. Acta 1991, 248, 1-30. (33) Zupan, J.; Gasteiger, J. Neural Networks for Chemists. An Introduction; VCH: Weinheim, 1993. (34) Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; Pasa-Tolic, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E. F.; Shen, Y.; Zhao, R.; Smith, R. D. Anal. Chem. 2003, 75, 1039.
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1727
Table 1. Exemplary Fragment of the Input Data Considered in ANN Analysis sequence
pI
dz/dpH
H
MW
CH
Xcorr
∆Cn
Sp
RSp
I
good/bad 1/0
-.MAKDLLPKQAANEPSLK.D -.MASNAARVVATAKDFDK.V -.MDFYTTDINKNVVPLFSK.G -.MEQINSNSRK.K -.MFNCLTKLVILVCLKYVAK.A -.MGSISRYLLK.K K.EVSVTKPLDVFLAASNLAR.A -.MLCAIKSTGYRYPR.T -.MLNILVLGNGAR.E -.MMIIIFIELCRIADSLLWIPK.S -.MSDDDYMNSDDDNDAEKR.Y -.MSTNCFSGYKDLIKEGDLTLIWVSR.D R.GNPTVEVELTTEK.G K.AADALLLKVNQIGTLSESIK.A -.MYNPVDAVLTK.I -.TQFTDIDKLAVSTIR.I
8.25 8.34 5.71 8.50 10.30 11.00 6.07 12.01 9.50 5.82 3.72 5.88 4.25 6.11 5.59 5.63
-0.2353 -0.1936 -0.2192 -0.1378 -1.5358 -0.3808 -0.1610 -1.1734 -0.0145 -0.2743 -4.3706 -0.3078 -1.9601 -0.1463 -0.1665 -0.2653
43.10 40.30 21.00 44.20 -34.90 1.80 21.20 17.10 -4.00 -56.90 104.60 30.50 52.80 23.70 1.40 14.00
1855.19 1796.04 2133.45 1207.34 2314.89 1168.44 2031.34 1717.02 1271.56 2577.23 2137.12 2935.34 1417.54 2085.43 1251.48 1708.94
2 2 2 2 2 2 3 2 2 2 2 2 1 3 2 3
1.73 1.92 1.76 2.26 1.56 2.00 3.90 1.59 1.83 1.92 1.66 1.70 2.49 3.82 1.50 3.58
0.30 0.69 0.70 0.33 0.71 0.24 0.41 0.38 0.16 0.36 0.66 0.30 0.53 0.40 0.15 0.54
124.3 137.4 301.0 994.4 92.7 193.2 555.0 122.7 806.3 60.4 144.5 203.3 180.6 959.4 112.3 549.0
19 17 2 2 25 23 1 6 1 1 21 1 1 1 88 1
0.34 0.31 0.35 0.67 0.19 0.44 0.32 0.31 0.59 0.20 0.26 0.27 0.38 0.39 0.35 0.50
0 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1
In chemistry and related fields of research, an interest in neural network computing has been noted since 1986.35,36 ANNs found applications for compound classification, modeling of structureactivity relationships, identification of potential drug targets, and localization of specific structural and functional sites on biopolymers.37 Also, the suitability of neural networks to identify the strong correlation between carcinogenicity of polycyclic aromatic hydrocarbons and their structural parameters inferred from 13C NMR spectra was reported.38 A good performance of ANNs in predicting bioactivity classes based on physicochemical parameters of agents was demonstrated for dihydrofolate reductase inhibitors.39,40 ANNs were also proposed as decision support systems in dentistry41 and urology42-44 and to assess HIV/AIDSrelated health performance.45 Very recently, attempts to use ANNs for modeling of chromatographic retention,46-50 including predictions of HPLC retention of both small-molecular-mass analytes49,50 and peptides,34 were also reported. In the approach presented here, a tryptic digest of proteins from yeast extract was analyzed by LC/ESI-MS/MS with the use of ion trap. The peptide spectra were manually classified as “good” or “bad” according to the recommendations of Link et al.10 All the peptides detected were divided into three groups: training, validating, and testing. Each individual peptide was characterized by several features (Table 1). Among those were the following: (35) Kaliszan, R. Structure and Retention in Chromatography, A Chemometric Approach; Harwood Academic: Amsterdam, 1997. (36) Schneider, G.; Wrede, P. Prog. Biophys. Mol. Biol. 1998, 70, 175-222. (37) Isu, Y.; Nagashima, U.; Hosoya, H.; Aoyma, T. J. Chem. Software 1994, 2, 76-95. (38) Andrea, T. A.; Kalayeh, H. J. Med. Chem. 1991, 34, 2824-2836. (39) So, S.-S.; Richards, W. G. J. Med. Chem. 1992, 35, 3201-3207. (40) Ajay, A. J. Med. Chem. 1993, 36, 3565-3571. (41) Brickley, M. R.; Shepherd, J. P.; Armstrong, R. A. J. Dent. 1998, 26, 305309. (42) Snow, P. B.; Rodvold, D. M.; Brandt, J. M. Urology 1999, 54, 787-790. (43) Wei, J. T.; Tewari, A. Urology 1999, 54, 945-948. (44) Krongrad, A.; Lai, S. Urology 1999, 54, 949-951. (45) Lee, C. W.; Park, J-. A. Inf. Manage. (Amsterdam) 2001, 38, 231-238. (46) Jalali-Heravi, M.; Parastar, F. J. Chromatogr., A 2000, 903, 145-154. (47) Loukas, Y. L. J. Chromatogr., A 2000, 904, 119-129. (48) Jimenez, O.; Marina, M. L. J. Chromatogr., A 1997, 780, 149-163. (49) Bucin´ski, A.; Ba¸ czek, T. Pol. J. Food Nutr. Sci. 2002, 11, 47-51. (50) Kaliszan, R.; Ba¸ czek, T.; Bucin´ski, A.; Buszewski, B.; Sztupecka, M. J. Sep. Sci. 2003, 26, 271-282.
1728 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
the observed data (peptide molecular mass, MW, and peptide charge, CH), the Sequest program-calculated statistics data (Xcorr, ∆Cn, Sp, RSp, and I), and the parameters calculated for individual peptides based on their structural formulas (isoelectric point value, pI, the titration curve slope at the pI, dz/dpH, and hydrophobicity, H). The aim of the project was to test whether the trained and validated ANNs exhibit high enough sensitivity and specificity regarding an accurate, high-throughput assignment of the Sequest data in accordance with their manual interpretation. It has been demonstrated that ANNs were capable of efficiently processing large sets of data generated during the analysis of complex mixtures of proteins. The ANNs constructed in the work predicted in a reliable manner whether the MS/MS spectrum for considered peptide was “good” or “bad”, thus replacing the need for manual interpretation of huge amounts of MS/MS spectra typically considered in proteomics. EXPERIMENTAL SECTION Materials. Reagents. Ammonium persulfate, N,N,N′,N′-tetramethylethylenediamine, (γ-methacryloxypropyl)trimethoxysilane, trypsin, ammonium bicarbonate, dithiothreitol (DTT), iodoacetamide (IAA), trifluoroacetic acid, YPD broth, and protease inhibitor cocktail for fungal and yeast cells were obtained from Sigma-Aldrich (St. Louis, MO). Immobiline II pK 3.6, 4.6, 6.2, 7.0, 8.5, and 9.3 were products of Amersham Biosciences (Piscataway, NJ). Acrylamide, methylenebisacrylamide, urea and tris(hydroxymethyl)aminomethane were from Pharmacia Biotech (Uppsala, Sweden). Isoelectric focusing anode and cathode buffers, ionexchange membranes, and PowerPac 3000 power supply were from Bio-Rad (Hercules, CA). Methods. Yeast Growing and Preparation of the Yeast Proteins Sample. Yeast Saccharomyces cerevisiae strain YP H499 was obtained from American Type Culture Collection (Manassas, VA). Growth in YPD broth (containing yeast extract, peptone, and dextrose) at 30 °C was continued up to a density of 6 × 108 cells/ mL. Next, harvesting of the cells by centrifugation at 3000g for 10 min at 4 °C was done, and after decanting of the supernatant, the pellets were allowed to drain and then their wet weight was determined. Lysis of the cells was performed using YeastBuster
Protein Extraction Reagent (Novagen, Madison, WI),51 with the addition of protease inhibitors for fungal and yeast cells according to the instructions of the manufacturer. Concentration of the proteins was determined with the use of the bicinchoninic acid protein assay kit (Sigma-Aldrich, St. Louis, MO). It equaled ∼7 mg/mL. Protein Digestion. In the case of the yeast proteins sample, to enhance in-solution enzymatic digestion of proteins, that process was performed with the addition of RapiGest SF reagent (Waters, Milford, MA), according to the instructions of the manufacturer. Briefly described, the procedure was as follows. Lyophilized to dryness protein pellets were first suspended in the RapiGest SF, dissolved previously in 50 mM ammonium bicarbonate (to give 0.2% RapiGest solution after the addition of DTT and IAA solutions), and vortexed. After addition of DTT to a final concentration of 5 mM, sample was heated at 60 °C for 30 min. After denaturation, the mixture was allowed to cool and IAA was added to a final concentration of 15 mM. The resulting mixture was placed in the dark for 30 min at room temperature. Then, trypsin was added in amounts providing an enzyme-to-protein ratio of 1:50 (w/w), and the sample was incubated at 37 °C for 90 min. In-Solution Isoelectric Focusing Fractionation. The sIEF device was made and sIEF fractionation carried out according to the procedure of Tan et al.18 The sIEF device containing 11 polyacrylamide gel membranes, with pH values 4.00, 4.20, 4.36, 4.50, 5.21, 5.83, 5.99, 6.40, 8.47, 8.75, and 9.74, was used for the sIEF fractionation of the yeast proteins digest and the bovine serum albumin (BSA) digest, used additionally to optimize the fractionation of yeast peptides. The voltage was applied according to the following program: 100 V for 30 min, 200 V for 30 min, 500 V for 60 min, 1000 V for at 60 min, and then 2000 V until completion of the process. The sIEF process was stopped when the current was less than 200 µA. For the sIEF analysis, a sample containing 0.1 mg/mL BSA digest and 0.7 mg/mL of soluble fraction of the proteins from yeast cells was prepared. After focusing, 12 fractions were simultaneously transferred, using a 12-channel digital pipet (Labnet International, Woodbridge, NJ), into 0.2-mL tubes. Nano-Reversed-Phase Liquid Chromatography/Electrospray Ionization-Ion Trap Tandem Mass Spectrometry (LC/ ESI-IT-MS/MS). Nano-LC/MS analysis of sIEF fractions of yeast proteins digest sample was carried out with the use of an UltiMate Capillary/Nano LC System (Dionex, San Francisco, CA), coupled to the LCQ Deca XP mass spectrometry system (ThermoFinnigan, San Jose, CA). PepMap C18 column (3 µm, 100 Å, 75 µm i.d. × 150 mm) with PepMap µ-precolumn (300 µm i.d. × 1 mm, packed with 5-µm C18, 100 Å), both from Dionex, were used. Gradient liquid chromatography elution was done with solvent A (water with the addition of 2% acetonitrile and 0.1% formic acid) and solvent B (water with the addition of 85% acetonitrile, 5% 2-propanol, and 0.1% formic acid). The gradient was 5 to 35% B in 85 min, followed by 35 to 90% B in 10 min, and finally, 90% B for another 5 min. Eluent flow rate was 300 nL/min. On-line ESI-MS was carried out in the positive-ion mode, with the ESI voltage typically set at 0.5-1.4 kV and the heated inlet capillary kept at 160 °C. Experiments were performed with a maximum in-source sample injection time of 50 ms, and three (51) Drott, D.; Bahairi, S.; Grabski, A. inNovations 2002, 15, 14-16.
microscans were summed for each scan. A full MS scan between 400 and 2000 m/z was realized by three MS/MS scans between 150 and 2000 m/z for the three most intense ions of the MS scan. The relative collision energy was set at 35% with an activation time of 30 ms. Dynamic exclusion was processed with a repeat count of 2 and a repeat duration of 1 min, with a 3-min exclusion duration window. The activation time was fixed at 30 ms. Data Analysis by ANN. ANN analysis was run on a personal computer using a Statistica Neural Networks v. 6.0 software (StatSoft, Tulsa, OK). The database of proteins for S. cerevisiae was that of the European Bioinformatics Institute.52 Isoelectric points, pI, and the titration curve slope values at pI, dz/dpH, of the peptides were calculated using pK values for amino acids.53 Hydrophobicity parameter, H, was calculated according to ref 54. In the case of the yeast proteins digest sample, the TurboSequest software (BioWorks 3.1, Thermo Finnigan) was used to search the database. The main searching criteria applied in TurboSequest are discussed in detail by Peng et al.12 Spectra for singly charged peptides with a cross-correlation score to a tryptic peptide (Xcorrs) larger than 2.0, spectra for doubly charged tryptic peptides with Xcorrs of at least 1.5, and spectra for triple charged tryptic peptides with Xcorrs above 3.3 were accepted. For all the accepted spectra ∆Cn was above 0.08. Oxidation of methionine and carboxyamidomethylation of cysteine were treated as the variable and the fixed modifications, respectively. All the matches were manually confirmed observing the rules recommended by Link et al.10 Hence, first, the MS/MS spectra should be of good quality with the fragment ions clearly above the baseline noise. Second, the same continuity to the b or y ion series should hold. Third, the y ions corresponding to a proline residue should be intense ions. Fourth, unidentified, intense fragment ions either correspond to +2 fragment ions or are assigned to the loss of one or two amino acids from one of the ends of the peptide. RESULTS AND DISCUSSION An artificial neural network, based on multilayer perceptron and comprising 10 artificial neurons in the input layer, 23, 10, and 7 neurons in three consecutive hidden layers, and a single neuron in the output layer, was used. The architecture of the model applied is given in Figure 1. Supervised method of learning with back-propagation strategy was used. Learning of the ANN was realized during 1000 epochs. The learning coefficient was 0.1, and the momentum equaled 0.3. Then, the learning was continued with the use of conjugate gradient descent algorithm up to 1100 epochs to achieve the smallest value of root-mean-squared (rms) error. Data from the learning set were presented to the ANN in a randomized manner during the learning process. The changes in rms error were recorded for the learning and the validating data set during the learning process (Figure 2). For further considerations, the ANN was taken which was characterized by the least rms error. Statistica Neural Networks software applied in the calculations divides each data set into three sections, the learning, validating, and testing subsets, in constructing the working ANN model. The (52) http://www.ebi.ac.uk/. (53) Bjellqvist, B.; Basse, B.; Ilsen, E.; Celis, J. E. Electrophoresis 1994, 15, 529539. (54) Abraham, D. J.; Leo, A. J. Proteins 1987, 2, 130-152.
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1729
Figure 3. Receiver operating characteristic curves for training, validating, and testing sets.
Figure 1. Architecture of the artificial neural network applied.
Figure 2. Training error graph.
total 2094 peptides analyzed in this work were randomly divided by default by the ANN program into the following sets: learning set with 1048 peptides, validating set with 523 peptides, and testing set with 523 peptides. Prior to that process, the whole data set was scaled within the 0-1 range. The learning set of data is important for ANNs to recognize the relationships between input and output data.55 The goodness of that recognition is checked with the use of the validating set of data. That set of data does not change the weight values in the network but is able to tell about the realistic predictive properties of the designed ANN. Hence, learning algorithms do not use the validating or testing sets to adjust network weights, although the validating set may optionally be used to track the network’s error performance, to identify the best network and to stop training. On that basis, a decision about the continuation or finalization of the learning process is undertaken and it is done without using (55) http://www.statsoft.com/.
1730 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
the testing set of data, which is used only for final revision of the ANN designed. Hence, the testing set is not used in training at all and is designed to give an independent assessment of the network’s performance when an entire network design procedure is completed. A straightforward evaluation of the quality of predictions made by ANNs consists of comparison of the classifications thus obtained to the classifications assigned manually. Disagreements between the two are counted as either false positives or false negatives. Prediction quality can be measured using the receiver operating characteristic (ROC) curve. Rather than depending upon a particular classification threshold, the ROC curve integrates information about the complete ranking of examples created by the ANN. The ROC curve represents, for varying classification thresholds, the rate of true positives (sensitivity) in relation to false positives (specificity). It summarizes the performance of a two-class classifier across the range of possible thresholds. An ideal classifier hugs the left side and top side of the graph, and the area under the curve is 1.0. A random classifier should achieve a value of ∼0.5. The ROC curve is recommended for comparing classifiers as it does not merely summarize performance at a single arbitrarily selected decision threshold but across all the possible decision thresholds. The ROC curve can be used to select an optimum decision threshold. This threshold (which equalizes the probability of misclassification of either class, i.e., the probability of false positives and false negatives) can be used to automatically set confidence thresholds in classification networks with a nominal output variable with the two-state conversion function.55 ROCs for learning, validation, and testing set of data are presented in Figure 3. In all cases, the area under the curve exceeds the value 0.82 and hence the models based on the ANN applied can be treated as sensitive and accurate classifiers. In Table 2, a summary statistics of the performance of the designed ANN is given. High classification coefficients and reasonably low errors obtained for learning, validation, and testing sets of data prove the goodness of the predictions obtained. Results of predictions for learning, validation, and testing sets of data are collected in Table 3. Among 704 spectra for peptides assigned as “good” in the learning set of data, 116 were classified incorrectly. It means that correctness of the ANN predictability in this case is 83.52%. The 81.69% of correctly assigned spectra were in the learning set of data for 344 spectra considered as
Table 2. Summation Statistics for the ANN Used for Evaluation of “Good” and “Bad” MS/MS Spectra classification coefficient for type of ANN multiple perceptron: 10:10-23-10-7-1:1
error for
learning process
validation process
testing process
learning process
validation process
testing process
0.8292
0.7553
0.7686
0.3328
0.3904
0.3982
Table 3. Classification of MS/MS Spectra Obtained with the Use of the Designed ANN for Learning, Validating, and Testing Sets of Data learning
all spectra correctly assigned spectra incorrectly assigned spectra percent of correctly assigned spectra percent of incorrectly assigned spectra
validation
testing
“good” spectra
“bad” spectra
“good” spectra
“bad” spectra
“good” spectra
“bad” spectra
704 588 116 83.52 16.48
344 281 63 81.69 18.31
368 282 86 76.63 23.37
155 113 42 72.90 27.10
362 272 90 75.14 24.86
161 130 31 80.75 19.25
Table 4. Sensitivity Analysis Results for the Variables Considered in ANN Analysis pI
dz/dpH
H
MW
CH
Xcorr
∆Cn
Sp
RSp
I
6.0
7.0
8.0
3.0
10.0
2.0
5.0
1.0
9.0
4.0
Symbols are explained in the text.
“bad”. Very similar results were obtained for the validating set of data. Here, among 368 “good” spectra, 86 were classified incorrectly, and among 155 “bad” spectra, 42 were classified wrongly. In the case of the testing set of data, 90 of a total 362 spectra were incorrectly classified as “good”. At the same time, 31 “bad” spectra were classified incorrectly in the testing set of data among 161 spectra considered. Parallel with statistics for ANN analysis, also sensitivity analysis for input variables was done (Table 4). Sensitivity analysis gives insight into the usefulness of individual variables. It may identify variables that can be safely ignored in subsequent analysis and key variables that must always be retained. With that kind of analysis, it is possible to judge what parameters are the most significant (with sensitivity value close to 1) and the least significant (with sensitivity value close to 10) to generate the proposed ANN. According to sensitivity analysis, the Sequest-generated preliminary score, Sp, is the most significant parameter for interpretation of MS/MS spectra. Highly significant are molecular mass, MW, the ions value, I, and the cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one, Xcorr, both from Sequest. The difference between the normalized cross-correlation functions, ∆Cn, is also a significant parameter. Less significant appears to be only one Sequest-generated parameter, RSp. On the other hand, peptide charge, CH, hydrophobicity, H, isoelectric point value, pI, and the
titration curve slope value at the pI, dz/dpH, have lower significance or are insignificant. Apart from the ANN analysis, the training data set was also subjected to the Fisher test according to ref 30. To evaluate the correlations between individual variables and the classification labels associated with each peptide from the training set (input data), the Fisher criterion score (FCS) was used as a simple metric that is closely related to the Student’s t-test (Table 5). In Fisher analysis, unlike the sensitivity analysis, the most predictive single variable was Xcorr, i.e., the cross-correlation score between the observed peptide fragment mass spectrum and the theoretically predicted one. The preliminary score based on the number of ions in the MS/MS spectrum that match the experimental data, Sp, and the ion value, I, were the second most predictive single variables from the training set of data. The fourth and the fifth were as follows: the difference between the normalized crosscorrelation functions for the first and the second ranked results, ∆Cn, and the rank of the particular match at the preliminary scoring, RSp, respectively. Again, peptide charge, CH, hydrophobicity, H, isoelectric point value, pI, and the titration curve slope value at the pI, dz/dpH, had lower significance or were insignificant. CONCLUSIONS The utility of the artificial neural network analysis for the evaluation of proteomics data has been demonstrated. ANN analysis done on the Sequest-generated data, completed with a few readily calculated peptide parameters, allowed one to limit the manual interpretation of MS/MS spectra. The trained and validated ANN was proved to reliably classify the manually interpreted MS/MS spectra as “good” or “bad”. The ANN analysis appears to be a convenient, high-throughput method to facilitate the examination of the Sequest program results. It enables an automatic processing of large amounts of MS/MS data. Hence,
Table 5. Significance of the Variables in Terms of Fisher Criterion Score (FCS) pI
dz/dpH
H
MW
CH
Xcorr
∆Cn
Sp
RSp
I
0.009
0.001
0.000
0.006
0.005
0.106
0.013
0.080
0.011
0.067
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1731
the use of ANNs is proposed for processing of analytical data routinely determined in proteomics. A high significance of the preliminary score, Sp, for the classification of MS/MS spectra as “good” or “bad” may appear surprising as that parameter has been reported30 as a less important one of the Sequest-generated parameters for identifying a peptide. Therefore, it must be emphasized here that the sensitivity analysis concerned the meaning of individual parameters for a proper interpretation of spectra otherwise manually interpreted as “good” or “bad”. Because the Fisher criterion of classical statistics did not confirm the highest predictive potency
1732
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
of Sp, the conclusion about that parameter might be disputable and result from a limited explanatory power of “learning” techniques to which the ANNs belong. ACKNOWLEDGMENT T.B. is very grateful to the Foundation for Polish Science for the support during the course of this work. Received for review August 12, 2003. Accepted December 18, 2003. AC030297U