Anal. Chem. 2002, 74, 5826-5830
Prediction of Chromatographic Retention and Protein Identification in Liquid Chromatography/ Mass Spectrometry Magnus Palmblad,*,† Margareta Ramstro 1 m,‡ Karin E. Markides,‡ Per Ha˚kansson,† and ‡ Jonas Bergquist
The Ångstro¨m Laboratory, Division of Ion Physics, Box 534, and Institute of Chemistry, Department of Analytical Chemistry, Box 531, Uppsala University, SE-751 21 Uppsala, Sweden
Liquid chromatography coupled on- or off-line with mass spectrometry is rapidly advancing as a tool in proteomics capable of dealing with the inherent complexity in biology and complementing conventional approaches based on two-dimensional gel electrophoresis. Proteins can be identified by proteolytic digestion and peptide mass fingerprinting or by searching databases using shortsequence tags generated by tandem mass spectrometry. This paper shows that information on the chromatographic behavior of peptides can assist protein identification by peptide mass fingerprinting in liquid chromatography/mass spectrometry. This additional information is significant and already available at no extra experimental cost. Mass spectrometry (MS) is regarded today as one of the most important and versatile tools in proteomics. Commonly applied experimental methods include electrophoretic or chromatographic separation with subsequent off-line tryptic digestion and matrixassisted laser desorption/ionization or electrospray time-of-flight or ion trap mass spectrometry.1,2 Capillary electrophoresis (CE) MS3,4 or liquid chromatography (LC) MS5-7 coupled on-line via ESI interfaces has also been used to analyze tryptic digests of complex biological samples such as whole-cell lysates4,8-11 and human body fluids.12-14 The main purpose of coupling a liquid * Corresponding author. Fax: +1-925-423-7884. E-mail:
[email protected]. † Division of Ion Physics. ‡ Department of Analytical Chemistry. (1) Patterson, S. D.; Aebersold, R. Electrophoresis 1995, 16, 1791-1814. (2) Shevchenko, A.; Jensen, O. N.; Podtelejnikov, A. V.; Sagliocco, F.; Wilm, M.; Vorm, O.; Mortensen, P.; Boucherie, H.; Mann, M. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 14440-14445. (3) Smith, R. D.; Loo, J. A.; Barinaga, C. J.; Edmonds, C. G.; Udseth, H. R. J. Chromatogr. 1989, 480, 211-232. (4) Smith, R. D.; Pasa-Tolic, L.; Lipton, M. S.; Jensen, P. K.; Anderson, G. A.; Shen, Y.; Conrads, T. P.; Udseth, H. R.; Harkewicz, R.; Belov, M. E.; Masselon, C.; Veenstra, T. D. Electrophoresis 2001, 22, 1652-1668. (5) Whitehouse, C. M.; Dreyer, R. N.; Yamashita, M.; Fenn, J. B. Anal. Chem. 1985, 57, 675-679. (6) Stacey, C. C.; Kruppa, G. H.; Watson, C. H.; Wronka, J.; Laukien, F. H.; Banks, J. F.; Whitehouse, C. M. Rapid Commun. Mass Spectrom. 1994, 8, 513-516. (7) Voyksner, R. D. In Electrospray Ionization Mass Spectrometry; Cole, R. B., Ed.; John Wiley & Sons: New York, 1997; pp 323-341. (8) Jensen, P. K.; Pasa-Tolic, L.; Peden, K. K.; Martinovic, S.; Lipton, M. S.; Anderson, G. A.; Tolic, N.; Wong, K. K.; Smith, R. D. Electrophoresis 2000, 21, 1372-1380.
5826 Analytical Chemistry, Vol. 74, No. 22, November 15, 2002
separation method on-line to an electrospray mass spectrometer is to reduce the complexity of sample introduced to the mass spectrometer at any given time. The dynamic range when a sample is directly infused is limited by ion suppression in the electrospray and detector. Ion trap and Fourier transform ion cyclotron resonance (FTICR) mass spectrometers have a limited storage capacity and hence also a limited dynamic range. The latter can be partially overcome by selectively loading the FTICR cell using a mass-selective quadrupole,15 although such instruments are not yet commercially available. In addition to an improved dynamic range, the separation itself provides information on the analytes. In reversed-phase chromatography (RPC), this is primarily the hydrophobicity of the peptides.16,17 For a given measured tryptic peptide mass and measurement accuracy, there is only a certain number of theoretical tryptic peptides from proteins in a sequence database that are within experimental error from this mass.18,19 If mass accuracy is very high, with errors below 1 ppm, or near what can be achieved in high-field FTICR mass spectrometry under ideal conditions,20 (9) Shen, Y.; Tolic, N.; Zhao, R.; Pasa-Tolic, L.; Li, L.; Berger, S. J.; Harkewicz, R.; Anderson, G. A.; Belov, M. E.; Smith, R. D. Anal. Chem 2001, 73, 30113021. (10) Conrads, T. P.; Alving, K.; Veenstra, T. D.; Belov, M. E.; Anderson, G. A.; Anderson, D. J.; Lipton, M. S.; Pasa-Tolic, L.; Udseth, H. R.; Chrisler, W. B.; Thrall, B. D.; Smith, R. D. Anal. Chem, 2001, 73, 2132-2139. (11) Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Masselon, C.; Pasa-Tolic, L.; Shen, Y.; Udseth, H. R. Omics 2002, 6, 61-90. (12) Bergquist, J.; Palmblad, M.; Wetterhall, M.; Hakansson, P.; Markides, K. E. Mass Spectrom. Rev. 2002, 21, 2-15. (13) Wetterhall, M.; Palmblad, M.; Håkansson, P.; Markides, K. E.; Bergquist, J. J. Proteome Res. 2002, 1, 361-366. (14) Ramstro ¨m, M.; Palmblad, M.; Markides, K. E.; Håkansson, P.; Bergquist, J. Proteomics, in press. (15) Belov, M. E.; Nikolaev, E. N.; Anderson, G. A.; Udseth, H. R.; Conrads, T. P.; Veenstra, T. D.; Masselon, C. D.; Gorshkov, M. V.; Smith, R. D. Anal. Chem. 2001, 73, 253-261. (16) Frenz, J.; Hancock, W. S.; Henzel, W. J.; Horva´th, C. HPLC of Biological Macromolecules: Methods and Applications; Marcel Dekker: New York, 1990; pp 145-177. (17) Cornette, J. L.; Cease, K. B.; Margalit, H.; Spouge, J. L.; Berzofsky, J. A.; DeLisi, C. J. Mol. Biol. 1987, 195, 659-685. (18) Zubarev, R. A.; Håkansson, P.; Sundqvist, B. U. R. Anal. Chem. 1996, 68, 4060-4063. (19) Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolic, L.; Smith, R. D. Anal. Chem. 2000, 72, 3349-3354. (20) Bruce, J. E.; Anderson, G. A.; Wen, J.; Harkewicz, R.; Smith, R. D. Anal. Chem. 1999, 71, 2595-2599. 10.1021/ac0256890 CCC: $22.00
© 2002 American Chemical Society Published on Web 10/19/2002
there may exist a tryptic peptide of each protein for which there is only one candidate peptide within the mass measurement error, even if the candidates are calculated from all proteins in the organism.11,19 These peptides can then be used as “accurate mass tags” for protein identification. In general, however, mass accuracy is insufficient to identify proteins based on a single tryptic peptide mass, requiring several peptides, or additional information on the peptides for unambiguous protein identification. This paper shows how information from liquid separation methods such as chromatography can be used to improve peptide mass fingerprinting based on accurate mass measurement alone.21-25 Although resolving power and accuracy in chromatographic separations are several orders of magnitude lower than in mass spectrometry, the information is complementary in nature and available at negligible computational cost and no extra experimental cost. MATERIALS AND METHODS Reversed-phase chromatography was performed on a Jasco 1580 micro-LC system (Jasco, Tokyo, Japan) with 8-10-cm length, 200-µm-i.d., 360-µm-o.d. fused-silica (Polymicro Technologies, Tucson, AZ) columns packed in-house with C18 reversed-phase material (5-µm ODS-AQ, YMC Europe GmbH, Schermbeck, Germany). The electrospray end of the capillaries were mechanically tapered,26 and the exposed bare fused silica was Black Dust coated with polyimide (Alltech, Deerfield, IL) and 1-2-µm particle size graphite (Aldrich, Milwaukee, WI) according to Nilsson et al.27 The tip was connected to ground, and the spectrometer orifice was held at -3 kV. The mobile-phase flow was 250-1000 µL/ min, split ∼1:1000 in a Valco T-connector (VICI AG; Valco International, Schenkon, Switzerland) mounted before the column and injector. The columns were equilibrated for a minimum of 30 min before each sample injection. A volume of 10 µL of sample was injected through a sample loop connected to a six-port manual injection valve. Linear gradients were 30-45 min, ramping from 0-20 to 80-100% organic solvent (99.5% ACN, 0.5% HAc) versus 0.5% HAc in H2O, pH ∼3. The Jasco LC system was connected to a Bruker Daltonics BioAPEX-94e 9.4-T FTICR mass spectrometer (Bruker Daltonics, Billerica, MA) via an Analytica electrospray interface (Analytica, Branford, CT).28 The samples analyzed to illustrate the use of information from chromatographic separations were bovine serum albumin (BSA), purchased from Sigma (Sigma, St. Louis, MO), and proteins extracted from cerebrospinal fluid (CSF) obtained from healthy donors, both digested by trypsin (Boehringer Mannheim GmbH, (21) Mann, M.; Hojrup, P.; Roepstorff, P. Biol. Mass Spectrom. 1993, 22, 338345. (22) James, P.; Quadroni, M.; Carafoli, E.; Gonnet, G. Biochem. Biophys. Res. Commun. 1993, 195, 58-64. (23) Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; Watanabe, C. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 5011-5015. (24) Pappin, D. J. C.; Hojrup, P.; Bleasby, A. J. Curr. Biol. 1993, 3, 327-332. (25) Yates, J. R., 3rd; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Anal. Biochem. 1993, 214, 397-408. (26) Barnidge, D. R.; Nilsson, S.; Markides, K. E.; Rapp, H.; Hjort, K. Rapid Commun. Mass Spectrom. 1999, 994-1002. (27) Nilsson, S.; Wetterhall, M.; Bergquist, J.; Nyholm, L.; Markides, K. E. Rapid Commun. Mass Spectrom. 2001, 15, 1997-2000. (28) Palmblad, M.; Håkansson, K.; Håkansson, P.; Feng, X.; Cooper, H. J.; Giannakopulos, A. E.; Green, P. S.; Derrick, P. J. Eur. Mass Spectrom. 2000, 6, 267-275.
Mannheim, Germany, sequencing grade) as described by Bergquist et al.12 Information was extracted from the chromatography using a predictor of retention using the amino acid composition of candidate peptides as the only input.29-34 The candidate peptides were those tryptic peptides within 5 ppm28 of the measured mass in an in silico-digested protein database with 150 human body fluid proteins and BSA.12 The candidates could be ranked by combining mass measurement error and the deviation from predicted retention calculating a total χ2 value, where
χ2 )
(mcalc - mexp)2 σm2
+
(tcalc - texp)2 σt2
(1)
and m and t are the mass and retention time, respectively. The standard deviations were estimated from experimental data. The χ2 value is sensitive to statistical outliers. Such outliers could result from a large error in prediction or a random (false) peptide match, i.e., a tryptic peptide in the database that has a mass close to the measured mass but an unrelated sequence and hence an unrelated chromatographic behavior. Therefore, the total χ2 for all matching peptides from a protein in the database is not ideal for protein identification. Instead, the likelihoods of all matching peptide masses and retentions were summed to a total likelihood score, which was used to discriminate between true and random matches. The retention time predictor was based on a simple model of peptide behavior in reversed phase, similar to one discussed by Hodges et al.31-33 BSA peptides in the BSA tryptic digest and peptides from the abundant human serum albumin (HSA) and transferrin in the CSF digest were used to calculate a retention coefficient for each amino acid according to 20
tcalc )
∑n c + t i i
0
(2)
i)1
where ci are the retention coefficients for the 20 amino acids, ni are the number of each amino acid, and t0 compensates for void volumes and a delay between sample injection and acquisition of mass spectra. These parameters were fitted by the least-squares method to experimental data from ∼70 BSA peptides or ∼100 HSA and transferrin peptides putatively identified by accurate mass measurement and high relative intensities in the mass spectra. All software was written in C and run on a standard PC with one 1.1-GHz AMD Athlon processor. RESULTS AND DISCUSSION Figure 1 shows an LC/FTICR mass chromatogram of a CSF tryptic digest in a “virtual 2D gel” view. These data differ from (29) Meek, J. L. Proc. Natl. Acad. Sci. U.S.A. 1980, 77, 1632-1636. (30) Guo, D. C.; Mant, C. T.; Hodges, R. S. J. Chromatogr. 1987, 386, 205-222. (31) Hearn, M. T.; Aguilar, M. I.; Mant, C. T.; Hodges, R. S. J. Chromatogr. 1988, 438, 197-210. (32) Hodges, R. S.; Parker, J. M.; Mant, C. T.; Sharma, R. R. J. Chromatogr. 1988, 458, 147-167. (33) Mant, C. T.; Zhou, N. E.; Hodges, R. S. J. Chromatogr. 1989, 476, 363375. (34) Sanz-Nebot, V.; Toro, I.; Benavente, F.; Barbosa, J. J. Chromatogr., A 2002, 942, 145-156.
Analytical Chemistry, Vol. 74, No. 22, November 15, 2002
5827
Figure 1. Mass chromatogram of a cerebrospinal fluid tryptic digest showing over 6000 individual peptides. The inset shows the region between m/z 800 and 900 and retention time 7 and 16 min (spectra 35-85).
standard 2D gel electrophoresis data in two important respects. First, the proteins have been digested prior to analysis, thereby creating a large number of “spots”. In this mass chromatogram, 70 204 individual peaks could be reduced to 6551 unique peptide masses. The same species was often found in several consecutive spectra. Further redundancy is due to multiple charge states and multiple isotopic peaks. Second, the resolving power (m/∆m) in the mass (or mass-to-charge) dimension is ∼100 000 times greater than what can be achieved by SDS-PAGE. For instance, two tryptic peptides as close as 0.0045 in m/z can be resolved in broadband mode in this instrument.35 The inset shows a small region of this data set, with resolved isotopic peaks. It was found when ci and t0 were fit to experimental data that a large number of peptides (50-100) were required for the ci values to converge. The final values for three BSA and three CSF runs were found to correlate with hydrophobicity as expected; i.e., the more hydrophobic, the higher the retention coefficient ci. The retention coefficients derived from the CSF run represented by Figure 1 are shown in Table 1. The t0 values were all below and close to the observed void time (cf. Table 1 and Figure 1 where the void time is ∼5 min). This could be explained by a size dependency of the retention of tryptic peptides; i.e., the longer the peptide, the longer the retention time on the column. Small and internal tryptic peptides are relatively hydrophilic, on average, as they all have a basic C-terminal residue (arginine or lysine). Any linear dependence (35) Palmblad, M.; Wetterhall, M.; Markides, K.; Hakansson, P.; Bergquist, J. Rapid Commun. Mass Spectrom. 2000, 14, 1029-1034.
5828
Analytical Chemistry, Vol. 74, No. 22, November 15, 2002
Table 1. Retention Coefficients for the 20 Amino Acid Residues Derived from 102 Peptides from the Two Most Abundant Proteins, HSA and Transferrin, in a Cerebrospinal Fluid Tryptic Digest Separated by Reversed-Phase Chromatography under Acidic Conditions (pH 3) residue
ci (spectra no.)
ci (min)
arginine serine lysine asparagine glutamic acid aspartic acid glycine threonine alanine histidine proline methionine glutamine cysteine leucine valine phenylalanine isoleucine tyrosine tryptophan
-4.02 -3.75 -3.47 -2.87 -1.39 0.23 1.53 1.94 2.18 3.00 5.10 5.19 5.37 6.98 12.04 12.88 14.13 14.24 14.67 24.69
-0.76 -0.71 -0.66 -0.54 -0.26 0.04 0.29 0.37 0.41 0.57 0.97 0.98 1.02 1.32 2.28 2.44 2.68 2.70 2.78 4.68
(t0)
(25.17)
(4.77)
on peptide length can be implicitly encoded by the retention coefficients in this model as a constant term added to each ci. The accuracy can be quantified by the standard deviation of predicted from measured retention for peptides identified by
Figure 2. Measured versus predicted retention for 70 HSA and transferrin peptides in CSF used to train the predictor (O) and random (false) matches from 100 randomly chosen yeast proteins (b). The standard deviation from predicted retention was 6.6 spectra (1.6 min) for HSA and transferrin peptides and significantly larger, 16.1 spectra (4.0 min), for the yeast peptides.
accurate mass measurement (Figure 2). The accuracy of the predictor was found to be 8-10%, when “trained” by each of the six BSA and CSF data sets. Occasional outliers were observed, likely due to random (false) matches to the protein used for training the predictor. What is useful information is not accurate prediction per se but how discriminating the predictor is between true and false matches. The CSF digests have been thoroughly analyzed by other methods, and a large number of proteins have been identified.14 To illustrate the ability of a predictor of retention to discriminate between a true match, in this case fibrinogen in CSF, and a random protein from yeast with a similar number of peptides matching measured peptides with a relative mass measurement error less than 5 ppm, the measured versus the predicted retention are shown in Figure 3. In this case, mass measurement alone is not sufficient to discriminate between the two proteins. A common task is to compare proteins from a protein sequence database to mass spectrometric data to find which protein or proteins best fits these data. In the case of CSF, a subset of human proteins found in plasma and other body fluids12 was used to test the improvement of protein identification in complex mixtures by using the information from the predictor of chromatographic retention. The identification of CSF proteins also serves as an illustration of the applicability of this method in the analysis of a biological sample by tryptic digestion and liquid chromatography mass spectrometry. As an arbitrary figure of merit, we take the increase in the number of proteins out of 150 hypothetical proteins in the small a priori defined database12 that could be identified with 95% significance or better if tested individually, i.e., not necessarily 95% significant as a union after a conservative statistical correction (e.g., Bonferroni correction). This serves the purpose to show that the extra information from a predictor of chromato-
Figure 3. Measured versus predicted retention for fibrinogen tryptic peptides (O) and tryptic peptides calculated from a yeast protein (b), both matching masses measured in a cerebrospinal fluid tryptic digest. The peptides matching fibrinogen are not necessarily all true matches. In fact, a small number of false matches could always be expected depending on protein size (the number of possible tryptic fragments) and the number of detected masses. The likelihood score gives extra weight to matches with measured retention close to predicted retention.
graphic retention can be significant in protein identification. Figure 4 shows histograms of likelihood scores for one CSF run where 11 matching proteins >95% significant as compared to 6 proteins >95% significant without the information from predicted retention. The likelihood score in Figure 4 is simply the sum of the likelihoods for all peptides matching within 5 ppm, with the likelihood of the measured retention factored in when the predictor is used. No yeast proteins with scores near those for HSA and transferrin were found, and a conservative estimate is that these are at least 99.99% significant. The significances of several apolipoproteins, hemopexin, fibrinogen, lactoferrin, GFAP, and complement factor 9 were also improved. The use of “internal standards”, such as HSA in CSF, can be expected to make the prediction less dependent on chromatographic conditions such as pressure, flow rate, mobile-phase composition, or pH. The pH directly influences the charge of peptide side-chain groups and termini and hence the hydrophobicity (charged residues are more hydrophilic) and retention coefficients in RPC.29,34 In addition to a sufficiently large number of peptides in the training set, each amino acid should be present in several peptides in this set. The least abundant residues in BSA are tryptophan (two residues) and methionine (four residues). HSA contains only one tryptophan residue and does not suffice to determine the retention coefficients in separations of human body fluids, which is why transferrin (eight tryptophans) was also used. The identification of 15 out of the 70 tryptic peptides of BSA could be verified by electron capture dissociation,36 indicating the potential Analytical Chemistry, Vol. 74, No. 22, November 15, 2002
5829
Figure 4. Significant matches in a LC/FTICR mass chromatogram. The number increases when information on measured versus predicted retention (bold) as compared to mass measurement alone (dashed) is added. Sequences with likelihood scores of >95% of the yeast scores are indicated. The abundant proteins already known to be present in the sample, HSA and transferrin, were used as internal standards to train the predictor. The set of all these matching proteins would not be 95% significant after Bonferroni correction, but the significances can be further improved by taking information on the nonrandomness of the distribution of matching tryptic peptides in the sequence into account,12 by improving the mass accuracy or by improving the accuracy of the predictor.
usefulness of LC/MS/MS to verify training sets for retention time prediction. The number of false peptides can be estimated by comparison with the expected number of random matches by peptides from an unrelated organism, such as yeast, when looking at mammalian proteins as shown here.12,37 Equation 2 is not the only conceivable model for predicting reversed-phase chromatographic retention of peptides. A more advanced approach may take the exact sequence or even predicted secondary structure38 into account. For instance, retention coefficients of residues at or near the termini, or residues predicted to interact with the surroundings, could be given higher weight in eq 2. It should be noted that the accuracy of prediction achieved here is relatively poor compared to those found in the literature.29 The performance can be expected to improve significantly with the accuracy of the predictor. The computational cost was small,