Use of Peptide Retention Time Prediction for Protein Identification by

Vic Spicer, Andriy Yamchuk, John Cortens, Sandra Sousa, Werner Ens, .... Steven Ciavarini , Marc V. Gorenstein , Keith Richardson , John B. Hoyes ... ...
0 downloads 0 Views 174KB Size
Anal. Chem. 2006, 78, 6265-6269

Technical Notes

Use of Peptide Retention Time Prediction for Protein Identification by off-line Reversed-Phase HPLC-MALDI MS/MS Oleg V. Krokhin,†,‡ Stephen Ying,† John P. Cortens,‡ Dhiman Ghosh,‡ Victor Spicer,† Werner Ens,†,‡ Kenneth G. Standing,†,‡ Ronald C. Beavis,‡ and John A. Wilkins‡

Department of Physics and Astronomy, University of Manitoba, Winnipeg, Manitoba, R3T 2N2 Canada, and Manitoba Centre for Proteomics and Systems Biology, University of Manitoba 799 JBRC, 715 McDermot Avenue, Winnipeg, Manitoba, R3E 3P4 Canada

A new algorithm, sequence-specific retention calculator, was developed to predict retention time of tryptic peptides during RP HPLC fractionation on C18, 300-Å pore size columns. Correlations of up to ∼0.98 R2 value were obtained for a test library of ∼2000 peptides and ∼0.950.97 for a variety of real samples. The algorithm was applied in conjunction with an exclusion protocol based on mass (15 ppm tolerance) and retention time (2-min tolerance for 0.66% acetonitrile/min gradient), MART criteria to significantly reduce the instrument time required for complete MS/MS analysis of a digest separated by RP HPLC. This was confirmed by reanalyzing the set of HPLC-MALDI MS/MS data with no loss in protein identifications, despite the number of virtually executed MS/MS analyses being decreased by 57%.

Ion-pair RP HPLC of peptides has become a compulsory part of modern MS-based proteomics studies of complex protein mixtures.1,2 Such fractionation prior to the final MS (or MS/MS) step of analysis significantly simplifies the spectra to be analyzed while introducing little to no interference with ESI or MALDI ionization techniques. Due to its lower resolving power, HPLC is often regarded as simply a sample preparation step, neglecting the valuable auxiliary information that can be extracted from retention data. However, the potential value of chromatographic retention times as an adjunct to mass spectrometry for peptide identification has recently been considered by the proteomics community.3-5 * Corresponding author. Phone: (204) 474 6184. Fax: (204) 474 7622. E-mail: [email protected]. † Department of Physics and Astronomy. ‡ Manitoba Centre for Proteomics and Systems Biology. (1) Mann, M.; Hendrickson, R. C.; Pandey, A. Annu. Rev. Biochem. 2001, 70, 437-473. (2) Lambert, J. P.; Ethier, M.; Smith, J. C.; Figeys, D. Anal. Chem. 2005, 77, 3771-87. (3) Palmblad, M.; Ramstro ¨m, M.; Markides, K. E.; Håkansson, P.; Bergquist, J. Anal. Chem. 2002, 74, 5826-5830. (4) Petritis, K.; Kangas, L. J.; Ferguson, P. L.; Anderson, G. A.; Pasa-Tolic, L.; Lipton, M. S.; Auberry, K. J.; Strittmatter, E. F.; Shen, Y.; Zhao, R.; Smith, R. D. Anal. Chem. 2003, 75, 1039-1048. 10.1021/ac060251b CCC: $33.50 Published on Web 07/26/2006

© 2006 American Chemical Society

Chromatographic retention models were the subject of intensive studies in 1980s and 1990s.6-9 Most methods use a summation of hydrophobicities of individual amino acids, ignoring the influence of their position within the peptide. In addition, they were developed for restricted (up to 100-200) sets of peptides, sometimes with modified N or C termini. Application of these models to separations of real digests exhibits less than satisfactory results, but recent developments in proteomics have brought a renewed interest in this field as the data sets for modeling grow in size. For example, by combining the results of a large number of HPLC-ESI MS/MS runs, Petritis et al.4 generated a data bank of ∼7000 peptides and used it for model optimization using an artificial neural network approach. The use of this type of sample pool for the development of a retention time algorithm offers a more realistic representation of the range and complexity of samples that are likely to be encountered experimentally. Working with data sets of large size allowed proteomics researchers to propose the first sequence-specific retention algorithms.10-12 The use of an off-line HPLC-MALDI MS(MS/MS) combination has been gaining popularity in recent years.13-18 The capacity (5) Palmblad, M.; Ramstro ¨m, M.; Bailey, C. G.; McCutchen-Maloney, S. L.; Bergquist, J.; Zeller, L. C. J. Chromatogr., B. 2004, 803, 131-135. (6) Meek, J. L. Proc. Natl. Acad. Sci. U.S.A. 1980, 77, 1632-1636. (7) Browne, C. A.; Bennett, H. P. J.; Solomon, S. Anal. Biochem. 1982, 124, 201-208. (8) Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, J. M. R.; Hodges, R. S. J. Chromatogr. 1986, 359, 499-517. (9) Mant, C. T.; Hodges, R. S. HPLC of Biological. Macromolecules; Marcel Dekker: New York, 2002; pp 433-511. (10) Krokhin, O. V.; Craig, R.; Spicer, V.; Ens, W.; Standing, K. G.; Beavis, R. C.; Wilkins, J. A. Mol. Cell Proteomics 2004, 3, 908-919. (11) Krokhin, O. V.; Ying, S.; Craig, R.; Spicer, V.; Ens, W.; Standing, K. G.; Beavis, R. C.; Wilkins, J. A. Poster presentation; 52nd ASMS Conference on Mass Spectrometry and Allied Topics; Nashville, TN, 2004. (12) Petritis, K.; Kangas, L. J.; Yan, B.; Strittmatter, E. F.; Camp, D. G., II; Lipton, M. S.; Xu, Y.; Smith, R. D. Poster presentation; 52nd ASMS Conference on Mass Spectrometry and Allied Topics; Nashville, TN. 2004. (13) Hsieh, S.; Dreisewerd, K.; van der Schors, R. C.; Jimenez, C. R.; StahlZeng, J.; Hillenkamp, F.; Jorgenson, J. W.; Geraerts, W. P. M.; Li, K. W. Anal. Chem. 1998, 70, 1847-1852. (14) Miliotis, T.; Kjellstrom, S.; Nilsson, J.; Laurell, T.; Edholm, L. E.; MarkoVarga, G. J. Mass Spectrom. 2000, 35, 369-377. (15) Chen, H. S.; Rejtar, T.; Andreev, V.; Moskovets, E.; Karger, B. L. Anal. Chem. 2001, 7, 2323-2331.

Analytical Chemistry, Vol. 78, No. 17, September 1, 2006 6265

to completely decouple the separation and detection steps allows for the independent optimization of each process. In addition, MALDI offers the option of performing repeat analyses on the same static sample at different times. These properties make this process combination ideal for the detailed study of complex peptide mixtures. This approach was used for the characterization of an extensive set of the peptides used to develop previous versions of our sequence-specific retention prediction model.10,11 In the latter case, digests of known proteins were separated, and the identities of the peptides were confirmed by MS/MS. This approach provided a large but well-characterized set of peptides for algorithm development. The present algorithm SSRCalc (sequence-specific retention calculator) differs from other4,6-8 retention prediction models because it takes into account both amino acid composition and residue position within the peptide chain. The latest version includes influences from amino acid composition, the nature of N- and C-terminal residues, peptide length, pI, and propensity to form helical structures. These considerations result in an accurate model that performs well with both constructed and experimentally derived digests. The algorithm achieves an R2 value of ∼0.98 for predicted retention times of a set of ∼2000 peptides and was recently described in detail elsewhere.19 Three consecutive versions of the algorithm have been developed and made available to the public (http://hs2.proteome.ca/SSRCalc/SSRCalc.html). The SSRCalc is routinely used in our laboratory for a number of tasks, including protein identification, detailed peptide mapping of proteins, and collecting information about chemical and posttranslational modifications for future inclusion into the model. This paper illustrates an application of the retention time prediction algorithm for protein identification by off-line HPLC-MALDI MS/MS. EXPERIMENTAL SECTION Reagents. Proteins (unless otherwise noted), dithiothreitol, iodoacetamide, trifluoroacetic acid (TFA), and 2,5-dihydroxybenzoic acid (DHB) were obtained from Sigma Chemicals (St. Louis, MO). Sequencing grade modified trypsin (Promega, Madison, WI) was used for all digestion procedures. Preparation of Digests for HPLC-MS Analysis. A number of commercially available proteins were used to provide peptide mixtures for further micro-HPLC-MALDI MS (MS/MS) analysis. Sample preparation included reduction (10 mM dithiothreitol, 30 min, 57 °C), alkylation (50 mM iodoacetamide, 30 min in the dark at room temperature), dialysis (100 mM NH4HCO3, 6 h, 7 kDa molecular weight cutoff, Pierce) and trypsin digestion (1/50 enzyme/substrate weight ratio, 12 h, 37 °C). As a first step, digests of each individual protein (1 mg/mL) were analyzed separately by MALDI MS to confirm protein identity. Mixtures of these protein digests (0.4 pmol/µL of each protein) were prepared by appropriate dilution in 0.5% trifluoroacetic acid (TFA) water solution. The mixture (5 µL, 2 pmol of each protein) was injected into the µ-HPLC system. In total, six different mixtures were (16) Zhang, B.; McDonald, C.; Li, L. Anal. Chem. 2004, 76, 992-1001. (17) Krokhin, O.; Li, Y.; Andonov, A.; Feldmann, H.; Flick, R.; Jones, S.; Stroeher, U.; Bastien, N.; Dasuri, K. V. N.; Cheng, K.; Simonsen, J. N.; Perreault, H.; Wilkins, J.; Ens, W.; Plummer, F.; Standing, K. G. Mol. Cell Proteomics 2003, 2, 346-356. (18) Krokhin, O.; Cheng, K.; Sousa, S.; Ens, W.; Standing, K. G.; Wilkins, J. A. Biochemistry 2003, 42, 12950-12959. (19) Krokhin, O. Analytical Chemistry, submitted.

6266

Analytical Chemistry, Vol. 78, No. 17, September 1, 2006

prepared, each resulting in the identification of 300-400 tryptic fragments within a mass range from 560 to 5000 Da. These experiments provided a data set of ∼2000 peptides that was used for model development. Digests were also prepared from a set of lectin affinity purified proteins from K562 cells. The isolation and digestion procedures are described elsewhere.20 Chromatography and Fraction Collection. Deionized (18 MΩ) water and HPLC-grade acetonitrile were used for the preparation of eluents. Column temperature was maintained at 30 °C throughout all experiments. Chromatographic separations were performed using a micro-Agilent 1100 series system (Agilent Technologies, Wilmington, DE). Samples (5 µL) were injected directly onto a 150 µm × 150 mm column (Vydac 218 TP C18, 5 µm; Grace Vydac, Hesperia, CA) and eluted with a linear gradient of 1-80% acetonitrile (0.1% TFA) in 120 min or 0.66% acetonitrile/ min at 4 µL/min flow-rate. PEEK 65-µm-i.d. and fused-silica 50µm-i.d. tubing was used for all pre- and postcolumn liquid connections. The column effluent (4 µL/min) was mixed on-line with 2,5-dihydroxybenzoic acid MALDI matrix solution (0.5 µL/ min, 150 mg/mL DHB in water/acetonitrile, 1:1) and deposited by a computer-controlled robot onto a movable gold target at 0.5min intervals. A Microtee P775 (Upchurch Scientific) was used for on-line mixing. One hundred and twenty (120) fractions were collected, because most tryptic peptides were eluted within 60 min. Fractions were finally air-dried and subjected to MALDI-MS (MS/MS) analysis. A tryptic digest of glycoproteins from K562 cells was fractionated using a similar setup but with a 0.61% acetonitrile/min gradient.20 Seventy 1-min fractions were collected as described above. Due to slightly different pre- and postcolumn tubing connections, the intercept on the retention time vs hydrophobicity plot was slightly shifted. TOF Mass Spectrometry. The spots of the chromatographic fractions were analyzed by single mass spectrometry (MS) with an m/z range of 560-5000 and by tandem mass spectrometry (MS/MS) in the Manitoba/Sciex prototype MALDI quadrupole/ TOF (QqTOF) mass spectrometer.21 Orthogonal injection of ions from the quadrupole into the TOF section normally produces a mass resolving power of ∼10 000 fwhm and 10 ppm mass accuracy in the TOF spectra in both MS and MS/MS modes. Peak Assignments and MS/MS Identification of Peptides. “M/z”, “ProFound” and “Global Proteome Machine” (GPM)22 programs (Manitoba Centre for Proteomics and Systems Biology, www.proteome.ca) were used for peak assignment, peptide mass fingerprint, and MS/MS identification of peptides, respectively. Signal-to-noise ratios of 2.5 and 1.3 were used for peak assignment in MS and MS/MS spectra. Combined peak lists for HPLC-MALDI MS runs were created for each separation by concatenating peak lists from individual fractions. The fraction number was used as a measure of the peptide retention time. If the full intensity of a peak was contained in a single fraction, the peak was assigned a retention time equal to the fraction number; however, if that peak’s signal was distributed (20) Ghosh, D.; Krokhin, O.; Antonovici, M.; Ens, W.; Standing, K. G.; Beavis, R. C.; Wilkins, J. A. J. Proteome Res. 2004, 3, 841-850. (21) Loboda, A. V.; Krutchinsky, A. N.; Bromirski, M.; Ens, W.; Standing, K. G. Rapid Commun. Mass Spectrom. 2000, 14, 1047-1057. (22) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467.

Figure 1. SSRCalc retention time prediction for library set of peptides and tryptic peptides recovered from K562 cell affinity-purified protein mixture. Retention time vs hydrophobicity dependences for library set of ∼2000 peptides: version 2 (a) and version 3 (b) of retention calculator. Similar dependencies for all 347 tryptic peptides from K562 cell isolate: version 2 (c), version 3 (d).

between two (three) consecutive fractions, the assigned retention time was the intensity weighted average of the fraction numbers. Thus only one m/z value (for the fraction containing the maximum intensity) remains in the combined peak list. RESULTS AND DISCUSSION The latest version of SSRCalc was developed to address the possible contributions of a number of parameters that might influence chromatographic retention times. These included amino acid composition, the nature of the N- and C-terminal residues, peptide length, hydrophobicity, isolelectric point, and the propensity to form helical structures.19 Adjustments for these parameters improved the correlation coefficient for the test library of 2000 peptides from 0.9585 for the previous version11 to 0.978 for the latest one (compare Figure 1a, b).19 As has been observed by others, the performance of prediction algorithms is not quite as good with real samples of unknown proteins (Figure 1c, d). However, it was readily apparent that the new version of the algorithm displayed improved performance on the unknown mixture. This is illustrated in Figure 1, which shows correlations for pep-

tides identified in the HPLC-MALDI MS/MS run of the tryptic digest of lectin-affinity purified proteins from a K562 whole cell lysate (c, d). A total of 661 MS/MS spectra were collected in the latter case, and 401 peptides (60.6%) from 70 proteins were identified, including 347 tryptic peptides used to generate Figure 1c, d. Although both versions of the algorithm exhibited lower correlations, as compared to the ∼0.96 and ∼0.98 R2 values for the library set, the improvement in the latest version is reflected in the correlation values for the unknown sample. It should be noted that the results in Figure 1a, b and c, d were derived with different LC systems, which had slightly different tubing configurations and gradient slopes. These differences are responsible for the changes in the slopes and intercepts (compare Figure 1c, d, to 1a, b). The retention times for 97, 78, and 49% peptides from the Figure 1b are predicted with an accuracy of (4, (2, and (1 min, respectively. The margin of (2 min was used as a retention time filter criterion throughout this work. It should be noted that the contents of Figure 1c, d were randomly chosen and represent relatively complex sample conAnalytical Chemistry, Vol. 78, No. 17, September 1, 2006

6267

Figure 2. MALDI MS spectra of fraction no. 37 from the separation of the digest of lectin-affinity purified K562 isolate. All peaks analyzed by MS/MS are labeled. b, peptides identified by GPM; O, nonidentified peptides; +/-, peptides excluded/not excluded from the peak list during virtual reanalyzing of MS/MS data with MART exclusion criteria.

taining peptides of different abundances and nature. In some cases, we found much better correlation (up to an R2 value of 0.99, not shown here). This may indicate that the latter set of the peptides does not contain species with a high propensity to form helical structures, are extremely large or with extreme pI values, which commonly resulted in less accurate retention prediction. Such limitations originate from the fact that (i) the approach for taking into account peptides’ helicity, for example, still requires improvement,19 and (ii) the algorithm was developed for better fitting to an “average” peptide, whereas peptides with the unusual physicochemical properties typically behave anomalously and will be studied further. In addition, a significant proportion of the total proteins (∼30%) in the lectin-affinity purified preparation were predicted to integral membrane components, which may also have impacted on the chromatographic behavior of their peptides. Despite these limitations, we believe that correlation values in the range of 0.950.97 can provide significant support for data analysis and can be used to minimize instrument time required for protein identification by MS/MS. HPLC-MALDI MS/MS Analysis Using Mass-Retention Time Exclusion Criteria. The sample presented in Figure 1c, d was analyzed in December 2002. We did not have the ability to repeat the analysis, but we did to reanalyze it in-silico. The original data set contained MS spectra of 70 1-min fractions, a combined peak list, and 661 MS/MS spectra of the most abundant parent ions in the 900-4000 Da mass range. Figure 2 shows the MALDI MS spectra of fraction no. 37 from that separation. Note that during actual analysis, MS/MS spectra were collected only for labeled intense peaks, since each spectrum acquisition consumes the analytes deposited on the MALDI target. Each separation should be calibrated to obtain parameters for the regression equation (Rt ) B + (A × hydrophobicity)) to permit application of the chromatographic prediction algorithm. This can be done both externally and internally using the set of known peptides or well-characterized digest of a known protein.19 In this particular case, calibrations were not performed; therefore, we assumed that we knew correlation from the data at Figure 1d before the virtual MS/MS analysis. The procedure of protein identification by MS/MS and peak exclusion based on mass-retention time (MART) criteria executes the following steps (Figure 3): (1) The five most abundant peaks from the combined peak list were chosen, and their MS/MS spectra were submitted to a GPM search. 6268 Analytical Chemistry, Vol. 78, No. 17, September 1, 2006

Figure 3. The scheme of RP HPLC-MALDI MS/MS analysis of peptide mixture based on MART exclusion criteria.

(2) The proteins identified were digested in-silico, allowing 1 missed cleavage. Peptide hydrophobicities were calculated using SSRcalc version 3. The predicted retention times for peptides from theoretical digests were found using correlation from Figure 1d (Rt ) 8.8609 + (0.9752 × hydrophobicity)). (3) Since these proteins are already identified, we excluded their peptides from further consideration on the basis of two criteria: (15 ppm mass tolerance and (2 min deviation from the predicted retention time. (4) The next five most abundant peaks (as in step 1) from the treated peak list were chosen, and their MS/MS spectra combined with the first five to perform the second GPM search. If new proteins were identified, their potential tryptic peptides were excluded from the combined peak list, as in steps 2, 3, etc. The procedure was repeated 57 (285 spectra out of 661) times starting from step 1 using a diminished peak list until all 661 MS/ MS spectra were either analyzed by GPM or excluded from the peak list. For real-time analysis, the criterion to terminate MS/ MS acquisition will be the inability to acquire proper MS/MS spectra due to gradual sample consumption and a subsequent decrease in parent ion abundance throughout the peak list. The resulting list of identified proteins features the same 70 proteins found in the original MS/MS analysis20 without the exclusion; therefore, the number of MS/MS spectra needed to identify the same set of proteins was decreased by 57%. This should allow MS/MS analysis of peptides with lower abundances and a potential increase in the number of identified proteins. Nineteen MS/MS spectra were acquired from fraction no. 37 during the original analysis (labeled peaks at Figure 2). Sixteen of them resulted in peptide identification (labeled with solid circles) if all 661 MS/MS spectra were submitted to GPM search simultaneously. Eleven peaks were excluded from the peak list (labeled with a “+”) during virtual reanalyzing with peak exclusion based on mass-retention time criteria. Therefore, only 8 (labeled with a “-”) out of 19 MS/MS were reanalyzed virtually. This means that to get the same result, only 8 MS/MS will be acquired from this spot, if such an analysis were to be performed in real time.

Table 1. Peak Exclusion Based on Mass-Retention Time Criteria for Two Proteins Identified by MS/MS on First Five Most-Abundant Parent Ions from K562 Protein Isolate protein ID

endoplasmin precursor (GRP94)

R-glycosidase II alpha subunit

peptides identified if all 661 spectra submitted for search peptides identified if first 5 most intense peaks submitted for search peptides excluded based on mass-retention time criteria peptides outside predicted retention time 2 min nontryptic and peptides with posttranslational modifications

46 1 28 9 8

51 3 16 10 22

This approach is especially powerful at the early stages of analysis. Because the first most abundant peaks analyzed likely belong to the most abundant proteins in the mixture, there is a high probability that a trypsin digestion will generate a large number of peptides (unless these are low molecular weight proteins or do not have many tryptic cleavage sites), which will be detected by MS. For example, a GPM search using the first five peptides from the combined peak list identified proteins ranked 1 (one peptide) and 2 (three peptides) in the analysis of the whole set without exclusion criteria (Table 1); 28 and 16 peptides were excluded on the basis of mass-retention time criteria for endoplasmin precursor (94-kDa glucose-regulated protein, GRP94) and R-glycosidase II alpha subunit, respectively. These two proteins contain 49 peptides that were not excluded and subsequently analyzed. They consist of peptides with retention times outside the 2 min window of predicted retention, peptides with posttranslational modifications, and products of nontryptic cleavage. Application of the exclusion protocol based on MART criteria significantly reduces instrument time required for complete MS/ MS analysis of the digest separated by RP HPLC. Protein sequences were extracted from the database following initial MS/ MS identification and digested in-silico, and the resulting calculated mass-calculated retention time pairs were compared to HPLC-MS data. Peak exclusion was made if both mass and retention time were found within (15 ppm and the (2 min window, respectively. Peak exclusion based solely on mass is routinely applied in various MS protocols. An introduction of additional constraints makes this exclusion more reliable. This was confirmed by reanalyzing the set of HPLC-MALDI MS/MS data, which showed no loss of the protein identification, despite the number of virtually executed MS/MS being decreased by 57%. This approach is particularly effective for the identification of large proteins with a limited number of posttranslational modification and nonspecific cleavage sites. The application demonstrated here is very suitable for the HPLC-MALDI combination. Since two steps of the analysis are coupled off-line, there is enough time to perform the required calculations. There is potential for application to on-line ESI-MS/MS. This will require greater speed and performance of MS/MS identification algorithms.

CONCLUSION The current trends in proteomic research require more strict approaches to protein identification. In many cases MS/MS confirmation is needed to provide valid results. This imposes additional demands on productivity of MS equipment and increases instrument time required for the analysis. The MS/MS step of analysis is the most time- and sample-consuming. Any auxiliary information that can reduce the time required for MS/ MS will provide definite advantages over existing procedures. In the case of RP HPLC fractionation prior to MS/MS, this information comes naturally because the retention time of the detected peak can be easily extracted and compared to the predicted values. Sequence-specific retention calculator algorithm allows one to predict the retention time with the observed vs predicted correlation up to an R2 value of 0.98. The use of this algorithm allows significant decreasing of the time required for MS/MS by exclusion of the peptides that belong to the proteins already being identified from the following consideration. Such involvement of HPLC procedures suggest, however, that proteomics researchers should pay more attention to chromatographic details: they should maintain high reproducibility of LC separation, avoid column overloading, etc. Although most of the exclusion procedures presented here were performed manually, they contain several well-defined steps and can be easily coded into automated algorithms. ACKNOWLEDGMENT This work was supported by grants from the Natural Sciences and Engineering Research Council of Canada (K.G.S., W.E.), Genome Canada (W.E.), Canadian Institute for Health Research (J.A.W.) and the Health Sciences Centre Foundation studentship (D.G.), and the U.S. National Institutes of Health (GM 59240, K.G.S.).

Received for review February 7, 2006. Accepted June 19, 2006. AC060251B

Analytical Chemistry, Vol. 78, No. 17, September 1, 2006

6269