Anal. Chem. 2005, 77, 2187-2200
Quantitative Proteomic Analysis by Accurate Mass Retention Time Pairs Jeffrey C. Silva,*,† Richard Denny,§ Craig A. Dorschel,† Marc Gorenstein,† Ignatius J. Kass,‡ Guo-Zhong Li,† Therese McKenna,§ Michael J. Nold,‡ Keith Richardson,§ Phillip Young,§ and Scott Geromanos†
Waters Corporation, 34 Maple Street, Milford, Massachusetts 01757-3696, Waters Corporation, 100 Cummings Center, Beverly, Massachusetts 01915, and Waters Corporation, Atlas Park, Simons Way, M22 5PP, Manchester, Great Britain
Current methodologies for protein quantitation include 2-dimensional gel electrophoresis techniques, metabolic labeling, and stable isotope labeling methods to name only a few. The current literature illustrates both pros and cons for each of the previously mentioned methodologies. Keeping with the teachings of William of Ockham, “with all things being equal the simplest solution tends to be correct”, a simple LC/MS based methodology is presented that allows relative changes in abundance of proteins in highly complex mixtures to be determined. Utilizing a reproducible chromatographic separations system along with the high mass resolution and mass accuracy of an orthogonal time-of-flight mass spectrometer, the quantitative comparison of tens of thousands of ions emanating from identically prepared control and experimental samples can be made. Using this configuration, we can determine the change in relative abundance of a small number of ions between the two conditions solely by accurate mass and retention time. Employing standard operating procedures for both sample preparation and ESI-mass spectrometry, one typically obtains under 5 ppm mass precision and quantitative variations between 10 and 15%. The principal focus of this paper will demonstrate the quantitative aspects of the methodology and continue with a discussion of the associated, complementary qualitative capabilities. Quantitative proteomics has been chartered as the technology which will serve as a major contributor in studies aimed at uncovering disease pathways, biomarker discovery, and providing new insights into biological processes for drug discovery. In these experiments, mass spectrometry is used to determine the relative amounts of protein among different biological samples to characterize a variety of physiological conditions. In addition, further characterization of the physiological perturbation may require that the relative degrees of posttranslational modifications associated with the proteins of interest be determined. However, comprehensive quantitative proteomics remains technically challenging * Corresponding author. Phone: 978-482-3005. Fax: 508-482-2055. E-mail:
[email protected]. † Milford, Massachusetts. ‡ Beverly, Massachusetts. § Manchester, Great Britain. 10.1021/ac048455k CCC: $30.25 Published on Web 03/02/2005
© 2005 American Chemical Society
due to the issues associated with sample complexity, sample preparation, and the wide dynamic range of protein abundance.1,2 Many approaches to quantitative proteomics have involved the combination of stable-isotope labeling methods for sample preparation with automated liquid chromatography coupled to a tandem mass spectrometer (LC/MS/MS).3-12 Stable isotopes are generally introduced into proteins or peptides by chemical modification,3-6 metabolic labeling,7-10 or enzymatic derivatization.11,12 The specificity of these isotopic labeling techniques is contingent upon observing different mass shifts, which can be generated by using a variety of available labeling reagents. In two recent articles, Wang and co-workers,13 as well as Radulovic and co-workers,14 introduced quantitative, label-free LC/ MS strategies for global profiling of complex protein mixtures. Both publications illustrate their specific algorithms for ion detection, clustering and quantitation. The lower resolution instrument employed in the studies presented by Radulovic and colleagues requires that their data reduction scheme condense all detections into nominal mass bins. Though the data presented are compelling, the data reduction strategy involving nominal mass bins may result in significant errors when dealing with highly complex mixtures. As an example, in a simple proteome such as Escherichia coli, there are ∼105 000 tryptic peptides, including one (1) Hamdan, M.; Righetti, P. G. Mass Spectrom. Rev. 2002, 21, 287-302. (2) Lill, J. Mass Spectrom. Rev. 2003, 22, 182-194. (3) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994-999. (4) Zhou, H. L.; Ranish, J. A.; Watts, J. D.; Aebersold, R. Nat. Biotechnol. 2002, 19, 512-515. (5) Griffin, T. J.; Gygi, S. P.; Rist, B.; Aebersold, R. Anal. Chem. 2001, 73, 978-986. (6) Chakraboorty, A.; Regnier, F. J. Chromatogr., A 2002, 949, 173-184. (7) Veenstra, T. D.; Martinovic, S.; Anderson, G. A.; Pasa-Tolic, L.; Smith, R. D. J. Am. Soc. Mass Spectrom. 2000, 11, 78-82. (8) Ong, S. E.; Kratchmarova, I.; Mann, M. J. Proteome Res. 2003, 2, 173-181. (9) Krijgsveld, J.; Ketting, R. F.; Mahmoudi, T.; Johansen, J.; Artal-Sanz, M.; Verrijzer, C. P.; Plasterk, R. H. A.; Heck, A. J. R. Nat. Biotechnol. 2003, 21, 927-931. (10) Oda, Y.; Huang, K.; Cross, F. R.; Cowburn, D.; Chait, B. T. PNAS 1999, 96, 6591-6596. (11) Yao, X. D.; Freas, A.; Ramirez, J.; Demirev, P. A.; Fenselau, C. Anal. Chem. 2001, 73, 2836-2842. (12) Stewart, I. I.; Thomson, T.; Figeys, D. Rapid Commun. Mass Spectrom. 2001, 15, 2456-2465. (13) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Anal. Chem. 2003, 75, 4818-4826. (14) Radulovic, D.; Jelveh, S.; Ryu, S.; Hamilton, T. G.; Foss, E.; Mao, Y.; Emili, A. Mol. Cell. Proteomics 2004, 3, 984-997.
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005 2187
missed cleavage between 700 and 2481 molecular mass. An average of 7 tryptic peptides of the 105 000 are found within a mass tolerance of 5 ppm of itself. If the mass tolerance is increased to within 1 Da, the average number of tryptic peptides is increased to 165. Using this logic, the opportunity to have more than one peptide eluting within a nominal mass bin can be up to 23 times more likely if the data are reduced from accurate mass measurements to nominal mass. As a result, nominal mass binning of mass spectrometric, LC/MS data may lead to problems in subsequent clustering of replicate analyzes and to variability in the corresponding quantitative analysis. Radulovic and co-workers report that their quantitative results exhibited an acceptable measure of variance of 2-fold or less deviation in the observed signal intensities. In addition to presenting data from an identical instrument platform, Wang and colleagues also illustrated LC/ MS data collected on a time-of-flight mass spectrometer. In this work, the authors indicated that the higher resolution and mass accuracy of the TOF system was found to be advantageous for tracking and quantifying large numbers of mass spectral peaks. The results obtained from these studies provided acceptable coefficients of variation (∼25%) across integrated peak intensities. The data acquisition platform used by Radulovic was configured to collect two parallel LC/MS experiments in a single LC/MS run for simultaneous quantitative and qualitative analysis. In an alternating fashion, the instrument measures the masses of eluting peptide components in MS mode in one function and then carries out a data-dependent CID for a subset of detected precursor masses in MS/MS mode in a second function. However, the authors affirm that considerably more peptide peaks are detectable in full-scan MS mode than can be identified in the same time frame using the collision-induced dissociation process. This level of inefficiency requires that additional MS/MS experiments would be needed for thorough identifications to be made in a given study. The use of MS technology in high-throughput proteomics faces several challenges in order to accurately compare differentially expressed proteins from corresponding peptide component information, such as retention time, mass, and signal response. Included among these challenges, software solutions for peak detection, chromatographic spectral alignment, charge-state reduction, and deisotoping need to be implemented in order to reduce the complexity of the continuum MS data and successfully compare differences among samples. The Expression Informatics software, introduced in this study, has been developed to carry out these functionalities for comprehensive, quantitative, differential expression analysis. Although it has been observed that electrospray ionization (ESI) provides signal responses that correlate linearly with increasing analyte concentration,15-17 historically, there have been concerns regarding nonlinearity of signal response and ion suppression effects18-21 which have prevented the implementation of a simple LC/MS solution for quantitative proteomics. We outline (15) Purves, R. W.; Gabryelski, L. L. Rapid Commun. Mass Spectrom. 1998, 12, 695-700. (16) Voyksner, R. D.; Lee, H. Rapid Commun. Mass Spectrom. 1999, 13, 14271437. (17) Chelius, D.; Bondarenko, P. J. Proteome Res. 2002, 1, 317-323. (18) Muller, C.; Schafer, P.; Stortzel, M.; Vogt, S.; Weinmann, W. J. Chromatogr., B 2002, 773, 47-52. (19) Matuszewski, B. K.; Constanzer, M. L.; Chavez-Eng, C. M. Anal. Chem. 1998, 70, 882-889.
2188
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
a quantitative proteomics strategy which employs an LC/MS method as the basis for the analytical strategy for quantifying proteome profile data for differential expression analysis. This method relies on the changes in the peptide analyte signal response from each accurate mass measurement and corresponding retention time (AMRT) component, and to directly reflect their concentrations in one sample relative to another. This method does not require the use of any stable-isotope labeling method or enrichment strategy; however, it does require that the sample preparation conditions are carefully controlled for optimal, quantitative performance. Regardless of the analytical technique, the protein samples must be prepared in a fashion that ensures an efficient and reproducible separation, with concurrent elimination of undesirable artifacts. In this investigation, we prepared a tryptic digest of human serum spiked with increasing amounts of a standard protein mixture and observed the linear behavior in the signal from digested peptides corresponding to the experimentally configured protein concentrations. The methodology presented in this work maximizes the duty cycle of a quadrupole-time-of-flight (Q-TOF) mass spectrometer to yield extensive quantitative and qualitative information by systematically and simultaneously analyzing the peptide components from large sets of protein mixtures.22,23 Although this work involves the analysis of human serum, this methodology is applicable to any number of biological samples (plasma, urine, whole-cell lysate, organelle, tissue, or microbial). MATERIALS AND METHODS Sample Preparation. Six aliquots of human serum (HS, Sigma source) were dispensed into separate eppindorf tubes (∼200 ug). An equimolar stock solution of exogenous proteins (yeast enolase and alcohol dehydrogenase, rabbit glycogen phosphorylase, and bovine serum albumin and hemoglobin, MPDS proteins) was prepared such that each protein was present at 5 pmol/µL in 50 mM ammonium bicarbonate (pH 8.5). The exogenous proteins were added to each of the six aliquots of human serum such that the final concentration of equimolar proteins was 0.500, 0.250, 0.100, 0.050, 0.025, and 0.010 pmol/µL (final volume of 200 µL), respectively. To avoid working under the specified limits of the pipettor, appropriate dilutions of the stock solution were made to ensure that at least 10-20 µL of stock protein solution, from a calibrated 20-µL pipettor, was added to achieve the desired final exogenous protein concentration. The volumes of the samples were adjusted to 100 µL with 50 mM ammonium bicarbonate (pH 8.5) containing 0.05% RapiGest.25 Protein was reduced in the presence of 10 mM dithiothreitol at 60 °C for 30 min. The protein was alkylated in the dark, in the presence of 50 mM iodoacetamide, at room temperature for 30 min. Proteolytic digestion was initiated by adding modified trypsin (Promega) at a concentration of 75:1 (total protein to trypsin, by (20) Sangster, T.; Spence, M.; Sinclair, P.; Payne, R.; Smith, C. Rapid Commun. Mass Spectrom. 2004, 18, 1361-1364. (21) Mei, H.; Hsieh, Y.; Nardo, C.; Xu, X.; Wang, S.; Ng, K.; Korfmacher, W. A. Rapid Commun. Mass Spectrom. 2003, 17, 97-103. (22) Bateman, R. H.; Hoyes, J. B. U.K. Patent 2,364,168A, 2002. (23) Purvine, S.; Eppel, J. T.; Yi, E. C.; Goodlett, D. R. Proteomics 2003, 3, 847850. (24) Geromanos, S.; Dongre, A.; Opiteck, G.; Silva, J. C. U.K. Patent 2,385,918A, 2003. (25) Yu, Y. Q.; Gilar, M.; Lee, P. J.; Bouvier, E. S. P.; Gebler, J. C. Anal. Chem. 2003, 75, 6023-6028.
weight) and incubated at 37 °C overnight. Each digestion mixture was diluted to a final volume of 200 µL with 50 mM ammonium bicarbonate (pH 8.5) to reduce the concentration of RapiGest detergent to 0.025%. The tryptic peptide solution was centrifuged at 13 000 rpm for 10 min, and the supernatant was transferred into an autosampler vial for peptide analysis via LC/MS. Each sample was analyzed in triplicate. The LC/MS analysis was performed using 10 µL of the final tryptic digest. HPLC Configuration. Capillary liquid chromatography (CapLC) of tryptic peptides was performed with a Waters CapLC/ Waters CapLC autosampler, equipped with a Waters NanoEase Atlantis C18, 300 µm × 15 cm reversed-phase column. The aqueous mobile phase (mobile phase A) contained 1% acetonitrile in water with 0.1% formic acid. The organic mobile phase (mobile phase B) contained 80% acetonitrile in water with 0.1% formic acid. Peptides were loaded onto the column with 6% mobile phase B. Peptides were eluted from the column with a gradient of 6-40% mobile phase B over 100 min at 4.4 µL/min, followed by a 10-min rinse of 99% of mobile phase B. The column was immediately reequilibrated at initial conditions (6% mobile phase B) for 20 min. The lock mass, [Glu1]-fibrinopeptide at 100 fmol/µL (GFP), was delivered from the auxiliary pump of the CapLC at 1 µL/min to the reference sprayer of the NanoLockSpray source. Mass Spectrometer Configuration. Mass spectrometry analysis of tryptic peptides was performed using a modified Waters/ Micromass Q-Tof Ultima API to provide enhanced mass accuracy. Detection events were acquired at 4 GHz. For all measurements, the mass spectrometer was operated in V mode with a typical resolving power of at least 10 000. The spectrum integration time was 1.8 s with an interscan delay time of 0.2 s. All analyses were performed using positive-mode ESI using a NanoLockSpray source. The lock mass channel was sampled every 30 s. The mass spectrometer was calibrated with a GFP solution (100 fmol/µL) delivered through the reference sprayer of the NanoLockSpray source. The doubly charged ion ([M + 2H]2+) was used for initial single point calibration (Lteff), and MS/MS fragment ions of GFP were used to obtain the final instrument calibration. Data acquisition was operated in the exact neutral loss mode, without an include list. Accurate mass LC/MS and LC/MSE data were collected using 10 eV for MS and 28-35 eV for MSE acquisition such that one cycle of MS and MSE data was acquired every 4.0 s. The RF offset was adjusted such that the LC/MS data were effectively acquired from m/z 300 to 2000, which ensured that any masses observed in the LC/MSE data less than m/z 300 were known to arise from dissociations in the collision cell. RESULTS AND DISCUSSION Ion Detection. The ion detection algorithm of the Expression Informatics software uses a maximum likelihood algorithm to deisotope and charge-state-reduce the m/z detections to the corresponding monoisotopic m/z (MH+) for each scan of the continuum LC/MS data.26 The algorithm also calculates the observed mass and intensity measurement deviation for every detected component. The chromatographic area associated with each component is calculated using an integration algorithm similar to the ApexTrack peak integration algorithm provided in the MassLynx software. If a particular component exists in more (26) Skilling, J.; Bryan, R. K. Mon. Not. R. Astron. Soc. 1984, 211, 111-124.
than one charge-state, the corresponding area for any given monoisotopic ion is reported as the summed area from all contributing charge states. The retention time is determined for each reported monoisotopic ion at the moment it reaches its maximum intensity (apex). Each detected component is referred to as an AMRT (accurate-mass, retention time) component. An AMRT is extracted from the continuum data only if it exceeds a user-defined, minimum detection threshold. The software is also capable of processing the data using an autothreshold capability which automatically adjusts the ion detection threshold over time as a function of the dynamic range within the mass spectrometric data. The culmination of this process produces an AMRT component list. This list contains many experimentally derived attributes for each of the recorded AMRT components (AMRTs). Included in this output are the weight-averaged monoisotopic mass and charge state, the calculated mass deviation, the deisotoped and charge-state-reduced sum intensity (centered by area), the chromatographic area, the calculated intensity deviation, the observed apex retention time (centered by area), and the observed start and stop time for the ion detection of the corresponding AMRT. Clustering Peptide Components by Mass and Retention Time. One of the key operations required for the comparative analyses of peptide mixtures is clustering chemically identical components together from replicate injections of the same sample as well as among multiple samples. The clustering algorithm performs multiple binary comparisons to conduct the overall clustering strategy for a complete experiment.27,28 AMRT components from each injection are clustered to align identical components to one another on the basis of a mass precision and a retention time deviation threshold. In an initial binary comparison, a subset of the AMRTs from two separate injections is compared to establish the experimental retention time deviation behavior of identical AMRTs between the two samples. The subset of AMRTs considered in the initial comparison is typically those above the median intensity for the entire data set. In the initial comparison, a coarse threshold of typically 5 min is applied to consider all potential paired candidates. Often, peptides may not reproducibly elute at exactly the same time throughout a replicate analysis. However, one generally observes a consistent shift in retention-time, whereby the observed retention time of a given set of peptides will deviate systematically, although not necessarily by the same magnitude. Due to the complexity of the data, there often exist conditions under which an AMRT in one condition or replicate will match within the threshold criterion to multiple AMRTs in a different replicate or condition. This, of course, is not desirable, since an AMRT from one condition or replicate should only match its identical companion in any other condition. To address these situations, the clustering algorithm calculates the delta retention time for all matched AMRTs and plots the retention time for each AMRT against the retention time difference observed among the corresponding matched components (Figure 1A). In doing so, the algorithm can determine the expected (27) Li, G.-Z.; Gorenstein, M.; Geromanos, S.; Silva, J. C.; Dorschel, C. A.; Riley T. Proc. 52nd ASMS Conf. Mass Spectrom. Allied Top. 2004, TPY 354, Nashville, TN. (28) Gorenstein, M.; Li, G.-Z.; Geromanos, S.; Silva, J. C.; Dorschel, C. A.; Plumb, R. S.; Stumpf, C. L.; Riley, T. Proc. 52nd ASMS Conf. Mass Spectrom. Allied Top. 2004, WPJ 161, Nashville, TN.
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2189
Figure 1. (A) The AMRTs from two separate injections of the human serum spiked with 5 pmol of exogenous protein were clustered by mass and retention time using the Expression Informatics software to associate identical components. The initial results of the clustering algorithm are displayed by plotting the observed retention time deviation for all matched components versus the retention time of the first injection. Each point represents a paired AMRT having the appropriate mass ((10 ppm) and retention time tolerance ((5.0 min) from the first pass of the clustering algorithm. The red and blue lines define the corresponding upper and lower limits for the retention time tolerance used in the second pass filter. The matched components outside these tolerances are examples of similar mass measurements existing at multiple retention times within the 10 ppm mass tolerance. Although the absolute retention time deviation is ∼1.45 min throughout the entire chromatogram (min ) -1.05, max ) 0.40), the data indicate that the deviation of matched components at any given retention time does not exceed 0.5 min. (B) Using the retention time deviations from the matched components of the raw data, within the narrow retention time tolerance of 0.5 min, the retention times of the paired AMRTs are normalized and the redundant matched AMRTs are removed by eliminating those paired components outside the fine retention time tolerance. (C) Mass precision measurements from the 3131 replicating AMRTs (in at least two out of three injections) from the human serum samples containing 5.0 and 0.5 pmol exogenous proteins, whose replicate normalized intensity measurements were below 30% Cv. The 3131 replicating AMRTs produced 13 963 individual mass measurements used to produce the histogram plot of the mass precision. A total of 12 981 mass measurements were determined to have a mass precision of (3 ppm, which constitutes ∼93% of the data set. (D). Coefficient of variation of the intensity measurements from the 3404 replicating AMRTs (in at least two out of three injections) from the human serum samples containing 5.0 and 0.5 pmol exogenous proteins. The 3404 replicating AMRTs produced 5032 combined Cv measurements from both samples and were used to produce the histogram plot of the coefficient of variation of the measured intensity. A total of 4557 of the 5032 Cv measurements were under 30%, which constitutes ∼90% of the data set. The average and median coefficient of variation from these two data sets are 11 and 14%, respectively.
retention time deviations for a given set of peptides at any given moment throughout the chromatogram. The expected retention time deviations are modeled by monitoring the density of points about a retention time deviation plot and determining the upper and lower retention time deviation boundaries for any given binary comparison. Only the matched AMRT component included within the defined retention time deviation boundaries are considered to satisfy the matching criteria. Figure 1A illustrates such a plot. A fine retention time deviation threshold of typically less than 0.5 2190
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
min is generally observed among paired components between two experiments. Figure 1A illustrates a single pairwise comparison of a replicate injection of the same sample. If the chromatography were ideal, the retention time differences for all matched components would be 0, and the resulting plot would illustrate a straight horizontal line centered at zero deviation. Each point in the plot designates one paired set of components. Since many components elute from the column at any moment in time, the resulting plot should illustrate a dense scattering of points along
the retention time coordinate. Figure 1A illustrates that the reproducibility of the chromatographic peptide separation is ∼0.25 min, with an overall chromatographic deviation of 1.0 min. The pairwise comparison is performed for each of the replicate injections, as well as across the multiple experiments. The retention time deviations observed between the AMRTs of two injections serve as multiple internal standards and are used to determine an appropriate retention time offset for AMRTs eluting at any moment. The retention time offsets are used to normalize the observed retention time for every AMRT component. The effects of the retention time normalization are illustrated in Figure 1B. The output that is generated from the clustering routine is a large matrix, whereby identical components are aligned in each row for subsequent quantitative and statistical analysis. The assembled matrix will not only contain AMRTs which appear in each of the conditions for each of the replicate injections, but may also include those AMRTs which appear reproducibly in one or more of the six conditions. To illustrate the level of specificity one is capable of obtaining with mass accuracy and retention time reproducibility, the processed data can be queried at different retention time and mass precision tolerances. As an example, injection 2 of the human serum with 2 pmol of MPDS protein produced 2582 AMRTs. The 2582 AMRTs were queried to determine how many were within a (1-min retention time window and a 10 ppm mass tolerance. Using these tolerances, a total of 36 AMRTs (1.4%) were found to coexist within these parameters. Therefore, these 36 AMRTs could potentially add ambiguity during the clustering process and lead to incorrect clustering of the data. If the mass tolerance is allowed to expand to a 100-mDa error, the ambiguity is increases to a total of 76 AMRTs (2.9%). At 1 Da, nominal mass, the ambiguity increases to a total of 657 AMRTs (25.4%). These errors are compounded if the tolerances of both the retention time and mass precision are allowed to expand. If the retention time tolerance is allowed to be within (5 min, then the following statistics are generated from the single data file: 293 AMRTs (11.3%) at 10 ppm mass tolerance, 441 AMRTs (17.1%) at 100-mDa mass tolerance, and 1112 AMRTs (43.1%) at 1-Da tolerance. These results are based on a single injection of a single sample. If one were to compare replicates among many different samples, this could lead to a significant number of AMRTs being clustered incorrectly and thereby produce highly irreproducible results. Having an LC/MS instrumentation platform that is capable of providing reproducible mass precision and accuracy along with reproducible chromatography will significantly increase the quality of the clustered data and will provide a more robust quantitative proteomics platform. Data Normalization and Statistical Analysis. Once the AMRT data have been clustered, the clustering algorithm performs a number of mathematical and statistical calculations for the entire data set. To correct for injection variability and total protein load across samples, the intensity measurements for the entire data set are normalized. The intensity measurements of all detected AMRTs from each injection are normalized to a set of AMRTs (endogenous or exogenous) that are known not to have changed among the different samples. The internal AMRT standards used for normalization purposes were required to be present in all six experiments. Although the Expression Informatics software is capable of correcting the mass and intensity
measurements that are in dead time, there is a limit to its ability to accurately correct for those measurements.29,30 With this in mind, the internal AMRT standards selected for normalization were well below dead time and existed in all replicates of each sample. The average monoisotopic masses of the AMRTs used for normalization were 1273.6547, 1706.7746, and 2171.1138, with corresponding elution times of approximately 42.60, 53.60, and 101.80 min, respectively. These AMRT components were endogenous to human serum and were determined to originate from transferrin (data not shown).31 Next, the algorithm calculates the replication rate of each AMRT within and among all conditions. The algorithm also calculates the average mass, intensity, area, combined charge-state, and retention-time for each AMRT for all conditions. In addition, a standard deviation and coefficient of variation is determined for each of these measured attributes. Using this information, the software annotates those AMRTs common and unique to each condition. Last, the algorithm performs binary comparisons for each of the conditions to generate an average normalized intensity ratio (log) for all matched AMRTs and also performs a Student’s t-test for each binary comparison. The final results of the clustering algorithm can be exported as a comma-delimited text file containing all of the mass spectrometric and chromatographic attributes for each AMRT, along with all of the mathematical and statistical calculations generated after the clustering process. This clustered data file can be further manipulated or visualized in any of a number of commercially available software packages, such as Microsoft Excel or Spotfire Decision Site. The precision of the extracted mass measurements of the clustered components from the replicate injections of all samples were typically within (5 ppm of the mean mass measurement. These data are illustrated in Figure 1C and demonstrate the robustness of the ion extraction software and the stability of the mass measurement instrumentation. In fact, 90% of the total number of replicated components were measured with a precision of (3 ppm. The reproducibility of the quantitative intensity measurements from the Expression Informatics software is summarized in Figure 1D. These results indicate that the coefficient of variation (Cv) among the replicate injections and across multiple samples were typically less than 15%, with a majority of the quantitative variation lying between 11 and 14% Cv. These observations are typically expected from the Expression Informatics software when using standard protocols for efficient sample preparation.32 Expression Analysis of AMRT Components. The purpose of these experiments was to demonstrate that the Expression Informatics software could ascertain the relative change in abundance of a small subset of proteins (MPDS proteins) spiked into a complex protein background (human serum). The MPDS (29) Rockwood, A. L.; Fabbi, J. C.; Harris, L.; Davis, L.; Lee, E. D.; Ogden, C.; Tolley, H.; Gunsay, M.; Sin, J. C. N.; Lee, H. G. Proc. 45th ASMS Conf. Mass Spectrom. Allied Top. 1997, WOE 0250, Palm Springs, CA. (30) Barbacci, D. C.; Russel, D. H.; Schultz, J. A.; Holocek, J.; Ulrich, S.; Burton, W.; Van Stipdonk, M. J. Am. Soc. Mass Spectrom. 1998, 9, 1328-1333. (31) Silva, J. C.; Richardson, K.; Young, P.; Denny, R.; Neeson, K.; McKenna, T.; Dorschel, C. A.; Li, G.-L.; Gorenstein, M.; Riley, T.; Geromanos, S. Proc. 52nd ASMS Conf. Mass Spectrom. Allied Top. 2004, MPX 452, Nashville, TN. (32) Dorschel, C. A.; Gorenstein, M.; Li, G.-Z.; Silva, J. C.; Geromanos, S.; Riley, T. Proc. 52nd Ann. ASMS Conf. Mass Spectrom. Allied Top. 2004, TPY 458, Nashville, TN.
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2191
Figure 2. (A) The base peak intensity (BPI) of human serum with five equimolar exogenous proteins spiked at decreasing levels (5.00, 2.00, 1.00, 0.50, 0.25, and 0.10 pmol), (B) the selected ion chromatogram (SIC) of the doubly charged peptide ion, 724.34 ((0.05 m/z). The corresponding SICs were integrated using MassLynx processing software between 68.00 and 71 min. Processing parameters were set for automatic noise measurement, Savitzky-Golay smoothing (three channels, two smoothes), and ApexTrack peak integration. (C) The continuum mass spectrum at the apex of the corresponding 724.34 selected ion chromatogram in panel B (from 600 to 825 m/z). (D) The lock-masscorrected, centroided mass spectrum of the 724.34 isotope cluster (between 722 and 729 m/z) from panel C (smoothing: Savitzky-Golay, three channels, two smoothes; centering: three channels, centroid top 80%, centered by area) and lock-mass-corrected against the monoisotopic ion of Glu-Fib, 785.8426 m/z).
proteins were spiked at levels well below that of the most abundant proteins in the complex background. Six samples were prepared to reflect a dilution series of the MPDS proteins ranging from 10 to 500 fmol/µL. The samples were digested with trypsin as described in the Material and Methods Section, and the resulting polypeptide mixtures were analyzed in triplicate by LC/MS.22-24 To demonstrate that the quantitative information relating to the MPDS proteins was available in the acquired LC/MS data, a manual analysis was performed on a previously characterized AMRT (m/z 724.41 at 69.5 min). Figure 2A depicts six total ion chromatograms (TICs) obtained from the LC/MS acquisitions. For the sake of space, only one replicate TIC is illustrated for each of the six different samples. The TICs illustrate a high degree of similarity among the six different samples, despite an overall 50-fold change in the relative levels of MPDS peptides throughout the six samples. Figure 2B illustrates the selected ion chromatograms (SICs) for the m/z 724.41 (z ) 2, MH2+) ion at ∼69.5 min and the associated integrated peak areas, as determined by MassLynx. The identity of this peptide was validated by DDA to use as a proof-of-concept model for the subsequent quantitative comparison (data not shown, VVGLSTLPEIYEK peptide from yeast ADH). Figure 2C illustrates the six individual MS spectra obtained from each sample at the chromatographic apex of the SIC in Figure 2B (m/z 724.41). Each spectrum presented in Figure 2192 Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2C is normalized to the highest ion in the spectrum to illustrate the dilution of the 724.41 MH2+ ion over the six different concentrations. The data presented in each spectrum illustrate a very high degree of similarity with respect to the other coeluting peptides in the background of human serum. This similarity is reflected not only in the number of ions present in each scan but also in the correlation among their respective intensities and relative intensity ratios. The degree of chromatographic reproducibility is further supported, at the global level, from the Expression Informatics processing and analysis of the clustered AMRTs obtained from each of the replicate analyses, as will be illustrated later. Figure 2D depicts each spectrum after it has been smoothed (Savistky-Golay smoothing, three channels, two smoothes), centered (three channels, 80% of the centroid top, centered by area), and lock-mass corrected against the monoisotopic ion of GFP (m/z 785.8426). Comparison of the lock-mass-corrected mass measurements obtained from the six individual samples (m/z 724.41, MH2+) reflects the level of mass precision obtained from this methodology. It also establishes that one can use an LC/ MS-based approach for relative quantitation of peptide components in a complex protein sample, provided that sufficient mass and retention time reproducibility are obtained. Table 1 outlines the results obtained from the manual interrogation of the raw data using the commercially available MassLynx software. The inte-
Table 1. Summary Table of the Manual and Automated Analysisa manual processing (MassLynx) human serum + exogenous proteins, pmol
theoretical ratiob
intc
MH+ d
ppme
5.00 2.00 1.00 0.50 0.25 0.10 RMS error
1.0 2.5 5.0 10.0 20.0 50.0
15871 5498 2775 1584 688 343
1447.8134 1447.8082 1447.8062 1447.8082 1447.7998 1447.8042
5.9 -2.3 -0.9 -2.3 3.5 0.4 3.1
automated processing (Expression Informatics)
calcd ratiof 1.0 2.9 5.7 10.0 23.1 46.3
error (%)g 15.2 14.2 0.1 15.3 -7.5 5.4
inth
MH+ i
ppmj
calcd ratiok
545213 205709 107305 51992 23808 10885
1447.8112 1447.8151 1447.8086 1447.8089 1447.8102 1447.8121
-4.4 -7.1 -2.6 -2.8 -3.7 -5.0 4.6
1.0 2.7 5.1 10.5 22.9 50.1
errorl (%) 8.0 2.0 5.1 14.4 0.2 3.5
a The mass measurements and signal response measurements obtained from manual analysis using MassLynx software and automated processing using the Expression Informatics software for the 1447.8048 monoisotopic ion (at ∼69 min) originating from the VVGLSTLPEIYEK peptide of Yeast ADH are described in the Table. b The theoretical relative ratio for the spiked ADH peptide. c The integrated peak measurement obtained using ApexTrack peak integration in Masslynx. d The calculated monoisotopic mass from the lock-mass-corrected measurement of the 12C isotope of the doubly charged ion cluster. e The corresponding ppm error obtained using the Masslynx software when compared to the theoretical monoisotopic mass, 1447.8048. f The calculated relative ratio of each condition compared to the 5 pmol condition from the measured peak response. g The relative percent error between the calculated relative ratio and the theoretical relative ratio. h The integrated peak measurement obtained using the peak integration algorithm in the Expression Informatics software. i The calculated monoisotopic mass from the lock-mass corrected measurement of the doubly charged ion cluster using the maximum entropy algorithm in the Expression Informatics software. j The corresponding ppm error obtained using the Expression Informatics software when compared to the theoretical monoisotopic mass, 1447.8048. k The calculated relative ratio of each condition compared to the 5 pmol condition from the measured peak response. l The relative percent error between the calculated relative ratio and the theoretical relative ratio.
grated peak area and accurate mass measurement of the monoisotopic ion for each sample is indicated in Table 1. In addition, the observed mass error (ppm) has been determined, along with the corresponding calculated response ratios for each of the samples, when compared to the 5-pmol sample. Upon manual interrogation of the raw continuum data, the overall quantitative accuracy is within (10%. The average mass accuracy obtained from MassLynx for the yeast, ADH peptide (724.41 m/z, z ) 2) was below 5 ppm (RMS). Table 1 illustrates that the information is available in the raw continuum data to display the relative change in abundance of the yeast ADH protein (from 5000 to 100 fmol) in the complex background of human serum. The quality of the mass spectrometric data is highlighted in Table 1, which contains the average accurate mass measurement and corresponding parts-per-million error for the test AMRT in each of the separate samples. It also includes the average normalized intensity and the corresponding intensity ratios from the manual analysis of the yeast ADH peptide across all the six experiments. The 18 LC/MS experiments were processed with the Expression Informatics software for a profiling analysis study. The Expression Informatics results of the same AMRT described earlier (m/z 724.41 MH2+, 1447.81 MH+) produced an average mass precision error below 5 ppm (4.1 ppm, RMS) and an average quantitative error of ∼5%. The results obtained from the automated processing of the raw continuum data were, thus, in agreement with the manually obtained data from MassLynx, described above. The response curves generated from the manual and automated processing of the VVGLSTLEPIYEK tryptic peptide from yeast ADH is illustrated in Figure 3A. These data demonstrate the consistency between the two data processing methods, whereby the two normalized response curves are nearly coincident, with an overall correlation coefficient of 0.999. The results show the linearity of the two data processing methods across the 2 orders of magnitude dynamic range inherent in the outlined experiments. Interestingly, the linear response of the exogenous ADH peptide (724.41 MH2+) seems to illustrate little or no ion suppression effects which may have resulted from the high background of
human serum peptides throughout the dilution series. Though the data presented in Figures 2 and 3 and Table 1 are quite encouraging, the challenge hinges on creating a software processing package that is capable of automating the process, whereby hundreds or thousands of TICs can be compared quantitatively. Table 2 illustrates the number of AMRTs obtained from each replicate of each sample, along with the associated combined intensity for all extracted AMRTs (after normalization). The variability associated with the number of extracted AMRTs is presented in Table 2 and illustrates a high degree of reproducibility across replicate injections. However, the data also illustrate a steady decrease in the number of AMRTs reported along with a decrease in the combined intensity as one examines those samples containing the highest concentration of exogenous proteins to the lowest concentration exogenous proteins. We plotted the change in the average number of AMRTs and total intensity versus the spiked protein concentration for the six samples and found the data to be linear with R2 values of 0.9878 and 0.9838, respectively (data not shown). Since the background of human serum proteins should not change from sample to sample, it is our contention that the associated y intercepts of 1964 AMRTs and 7.0 × 107 intensity counts represent the basal level (number and associated intensity) of AMRTs present in the human serum. The 18 resulting xml files were generated from the continuum LC/MS data using the Expression Informatics software and contained both the mass spectrometric and chromatographic attributes for all extracted AMRTs. The xml files were processed using the associated clustering algorithm to group identical AMRTs across the replicate injections for all the six samples. In the replicate analysis of the human serum with 5 pmol of MPDS protein, 68% of the total AMRTs were replicated in three out of three injections (2577 AMRTs of the 3797 total clustered AMRTs). The 2577 replicating AMRTs consisted of ∼87% of the total detected intensity. The overall trend suggests that the missing observations are due to the ion detection threshold parameters. Decreasing the stringency to two out of three replicate injections resulted in 85% of the total AMRTs and constituted 95% of the Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2193
Figure 3. (A) The response curves of the doubly charged polypeptide ion (observed 724.34 m/z, VVGLSTLEPIYEK peptide from yeast ADH) at ∼69 min from manual interrogation and automated processing of the spiked human serum data. The response measurements were normalized to the maximum observed response from the corresponding dilution series. (B) A subset of 25 response curves obtained from the output of the clustering tool of the Expression Informatics software. The clustered output file was imported into Spotfire, and the data were parsed by the average monoisotopic mass from all replicates of each sample using the trellis option in Spotfire. The average monoisotopic mass for each AMRT component is indicated at the top of each plot. Those AMRTs associated with the human serum (rows 1-4) did not change throughout the dilution series and are indicated by those response curves with a slope of 0, whereas all of those AMRTs that are associated with the exogenous proteins have a similar positive slope (row 5). The AMRTs were validated to each of the corresponding exogenous proteins: 1422.7261 MH+, EFTPVLQADFQK (vovine hemoglobin (R-chain)); 1529.7344 MH+, VGAHAGEYGAEALER (bovine hemoglobin (β-chain)); 1576.7762 MH+, LKPDPNTLCDEFK (bovine albumin); 1578.8098 MH+, VDDFLLSLDGTANK (yeast enolase), and 1580.8387 MH+, QIIEQLSSGFFS PK (rabbit phosphorylase B).
total detected intensity. In the replicate injection of the 5-pmol condition, the average intensity measurement for those AMRTs which replicated in three out of three injections was 36 666 counts, whereas the average intensity measurements for the AMRTs which replicated in either two or three out of three injections was 13750 and 8411 counts, respectively. Lowering the ion detection threshold increases the number of AMRTs reported but also lowers the total fraction of replicating AMRTs. In addition, lowering the ion detection threshold does not dramatically affect the fraction of total intensity attributed to the replicating AMRTs. 2194 Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
A total of 1776 AMRTs were found in common to all replicates of all six samples, constituting an average combined intensity of 7.12 ×107 counts. These results are consistent with the hypothesis regarding the basal level of the human serum AMRTs found to replicate among the six samples. Though one may suspect the total number of AMRTs to be low, considering the complexity of the background of human serum peptides, it should be noted that the purpose of this study is to verify that the Expression Informatics software identifies the appropriate change in relative abundance among the spiked MPDS peptides. The ion detection
Table 2. Summary Table of the Ion Detection Resultsa sample
inj 1
AMRTs normalized intensity
3142 1.04 × 108
5 pm ProStds HsSera 3231 3212 1.03 × 108 9.90 × 107
2383 7.61 × 107
1 pm ProStds HsSera 2087 2244 8.22 × 107 8.18 × 107
2012 8.00 × 107
0.25 pm ProStds HsSera 1939 2058 7.27 × 107 8.12 × 107
AMRTs normalized intensity AMRTs normalized intensity
inj 2
inj 3
CV, % 1.47 2.59 6.62 4.27 3.00 5.88
inj 1
inj 2
inj 3
CV, %
2382 8.40 × 107
2 pm ProStds HsSera 2582 2758 8.58 × 107 8.85 × 107
7.31
2005 7.56 × 107
0.5 pm ProStds HsSera 2062 2106 7.46 × 107 7.97 × 107
2.46
1972 7.79 × 107
0.1 pm ProStds HsSera 2002 1923 7.08 × 107 7.38 × 107
2.03 4.81
2.66
3.57
a The total number of AMRTs is indicated for each replicate analysis of the six human serum samples. The sum of the normalized intensity for each replicate injection is listed below each of the corresponding total AMRT values. The coefficient of variation of the extracted AMRTs and their associated normalized intensity is calculated for each replicate injection. The ion detection parameters were set up to extract those multiply charged ions (charge states between 2 and 6) which exceeded 200 counts (center by area, after deisotoping).
threshold was set to generate AMRTs which spanned 3-4 orders of magnitude dynamic range within a given sample. The MPDS proteins were spiked into the human serum at levels such that their intensities were within this window of dynamic range. By applying these threshold parameters, we were able to demonstrate the appropriate response with the ADH peptide and, therefore, continue with the analysis to characterize the remaining AMRTs. The clustering results were exported from the Expression Informatics software and imported directly into Spotfire for evaluation. With identical components clustered across the replicate injections of the six samples (dilution series), one can readily obtain response curves for each of the clustered components. Figure 3B illustrates response curves for a subset of clustered AMRTs, in which the average normalized intensity is plotted as a function of the quantity (femtomole) of spiked MPDS proteins. The bottom five plots represent an individual peptide from four of the remaining five exogenous proteins. All of these response curves have a similar slope that is indicative of the configured serial dilution. The response curves in Figure 3B correspond to extracted AMRTs that replicated in all six samples of human serum with the exogenous proteins. The AMRTs with the experimentally determined monoisotopic m/z of 1422.7261, 1529.7344, 1576.7762, 1578.8098, and 1580.8387 represent peptides from bovine hemoglobin (β-chain), bovine hemoglobin (R-chain), bovine albumin, yeast enolase, and rabbit phosphorylase B, respectively. The mass accuracies associated with these corresponding peptides are all within ( 5 ppm of the theoretical tryptic peptide mass. All of the plots for the remaining AMRTs have a slope of 0 and, therefore, correspond to background serum peptides that do not change in relative concentration across the six individual samples. For the point of this illustration, the x axis corresponds to the concentration of spiked exogenous proteins. In a biomarker discovery study, the concentration dependence could easily be replaced by a time course or different perturbations, such as drug dosage or environmental conditions. The ability to display these response curves (or conditional profiles) for all matched AMRTs enables one to perform comprehensive global comparisons rather than multiple binary comparisons. Using this approach, the AMRTs can be rapidly screened and characterized on the basis of their collective behavior across the multiple conditions. Self-organizing maps (SOMs) or k-means clustering techniques can be used to associate AMRTs that exhibit the same
behavior, and by extension, may be related to the same protein, metabolic, or regulatory pathway(s).33,34 Figure 4A illustrates a diagonal plot of the log of the average normalized intensity for matched AMRTs from the 5-pmol mixture (x axis) versus the 2-pmol mixture (y axis). The data illustrate two distinct clusters of ions spanning close to 4 orders of magnitude dynamic range in ion detection and share 2997 matched AMRT component pairs between the two conditions. The data points are colored by their respective t-test score of the normalized intensities for all replicate injections between the two conditions to illustrate that the variance between the two conditions is statistically significant. The yellow data points illustrate those matched AMRT components with a t-test score of 0.01). The red data point highlights the AMRT described in Table 1, for the purpose of the manual analysis and comparison to the automated processing. (33) Mirkin, B. Mathematical Classification and Clustering, Nonconvex Optimization and Its Applications; Pardalos, P., Horst, R., Eds.; Kluwer Academic Publishers: The Netherlands, 1996, Chapter 11. (34) MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Le Cam, L. M., Neyman, J., Eds.; University of California Press: Berkeley and Los Angeles, CA; Vol 1, pp 281-297.
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2195
Figure 4. Diagonal plots of the normalized log intensity. (A) Comparison of clustered AMRTs between human serum with 5.0 pmol of exogenous protein mixture versus human serum with 2.0 pmol of exogenous protein mixture. For each matched AMRT component, the average log intensity from each condition is plotted along each of the two axes. The data are presented without applying any statistical filters, which are obtained from the clustered data set. (B) Same comparison as illustrated in Panel A; however, the data have been filtered using a number of the available statistical measures obtained from the clustering tool of the Expression Informatics software. The data have been filtered to show only those matched AMRTs which were found to have a coefficient of variation of the normalized intensity of e30% among the replicate injections, (minimum two out of three replicates per condition), as well as an observed mass precision of e10 ppm among the replicate injections. (C) Comparison of clustered AMRTs between human serum with 5 pmol of exogenous protein mixture versus human serum with 0.1 pmol of exogenous protein mixture after applying the statistical filter described above. (D) Comparison of clustered AMRTs between human serum with 5.0 pmol of exogenous protein mixture and human serum with 1.0 pmol of exogenous protein mixture after applying the statistical filter described above. (E) Comparison of clustered AMRTs between human serum with 0.50 pmol of exogenous protein mixture and human serum with 0.25 pmol of exogenous protein mixture after applying the statistical filter described above. The data presented in all panels are colored by binned probability score (p score) from a binary Student’s t-test. Those AMRTs which had a probability score of e0.01 are yellow, whereas those that are >0.01 are blue. The red data point corresponds to the monoisotopic ion of 1447.8048, which originates from the VVGLSTLPEIYEK peptide of yeast ADH. The interpolated black line corresponds to the expected fold change for each binary comparison.
2196
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
Figure 5. (A) A scatter plot of the average normalized intensity of the clustered AMRTs versus their corresponding coefficient of variation among the replicate injections for human serum spiked with 5 pmol of exogenous protein versus human serum spiked with 2 pmol of exogenous protein. The blue data points represent 1840 AMRTs which satisfy the statistical filters described in Figure 2B, whereas the red data points illustrate the 1157 AMRTs that were removed during the filtering process. (B) A histogram plot of the corresponding fold changes determined among the 1840 AMRTs which met the applied statistical measures.
From this analysis, it is suggested that the yellow data points represent peptides from the MPDS proteins, whereas the blue data points originate from peptides from human serum proteins. The information that is provided from this methodology allows one to apply user-defined thresholds to the resulting statistical analysis performed on any of the experimental attributes relating to each AMRT cluster, as well as a minimum replication rate within and across conditions as a means to extract the highest quality data for subsequent quantitative analysis. Figure 4B depicts 1840 (61.4%) of the matched AMRT component pairs from Figure 4A after applying a specific set of statistical thresholds to reveal the highest quality data. These statistical measurements are provided by the Expression Informatics software and are included in the corresponding output file. In this instance, the data were filtered by (1) applying a replication requirement, in which corresponding AMRTs must exist in at least two out of the three replicate injections for each condition, (2) requiring that the coefficient of variation for the normalized intensities of an AMRT be e30% and (3) requiring that the mass precision of clustered AMRTs be 90% of the total average normalized intensity found in each condition. A total of 724 of the 2997 AMRTs were attributed to AMRTs which occurred in only one out of the three replicate injections, an additional 384 AMRTs had coefficients of variation >30%, and 49 AMRTs had mass precision errors exceeding 10 ppm. This indicates that the most variable data are due to the lower intensity AMRTs, as can be seen in Figure 5A. Figure 5A depicts a scatter plot of the average normalized intensity of each
AMRT from the filtered data versus the observed coefficient of variation for the entire clustered data set. The blue data points are the subset of 1840 AMRTs which meet the statistical parameters described above. As expected, the data illustrate that the statistical filtering process had the most significant effect on the lowest intensity AMRTs, since they will be most influenced by coeluting AMRTs and will therefore tend to exhibit the highest variability (Cv). Manual inspection of the clustered output of the replicate injections of the 5-pmol condition indicated that less than 80 of the AMRTs determined to be found in only one out of three replicate injections could have been associated with an AMRT determined to have replicated in only two out of three replicate injections. In this particular example, this represents a false clustering rate of ∼2%. However, since only the AMRTs found to replicate in only one out of three injections are eliminated from the quantitative processing, the information describing these potentially discarded AMRTs is still captured in those AMRTs which occurred in two out of three injections. One of the key features of this methodology is that it is an unbiased approach. The method does not require prescreening of polypeptide pools for those peptides that contain specific amino acids. This unbiased approach produces significantly more peptide ions per protein than some other quantitative methodologies which utilize isotope-coded affinity tags. In addition, the quantitative nature of this methodology allows the user to apply statistical methods to remove polypeptide ions (AMRTs) that exhibit questionable reproducibility from further consideration without jeopardizing the ability to find lower level changes. Figure 5B depicts a histogram plot of the observed fold change for the 1840 filtered AMRTs. The data presented illustrate two Gaussian distributions about the x axis which are centered at values of 1.0 and 2.5. These values correlate with the predicted results for the serum-related peptides (no change) and the spiked exogenous peptides (2.5-fold change). Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
2197
Figure 4CD represents two additional diagonal plots of the log average normalized intensity of the 5-pmol mixtures versus both the 100-fmol and 1-pmol mixtures. The results from Figure 4C begin to test the limits of this methodology. At 100 fmol of spiked MPDS protein, we are approaching the limit of detection for the 300-µm scale chromatography selected for these series of experiments. This can manifest itself in the results by attenuating the expected fold change, producing more scatter between the upper and lower limits of the expected fold change. In addition, it should be noted that there are a number of peptides from the exogenous MPDS proteins that are chemically identical to a subset of the human serum proteins. Among these are human serum albumin and human hemoglobin. These chemically identical peptides will show an attenuated fold change as a function of their relative abundance over that of the endogenous peptide. Figure 4E illustrates the 250-fmol mixture versus the 100-fmol mixture. These plots illustrate two distinct ion distributions of AMRTs, which correlate with the relative concentration change of the MPDS proteins between the two samples as well as those unaffected human serum proteins. The blue data points represent those AMRTs that do not show any relative change with statistical significance between the two conditions (human serum proteins); the yellow data points represent those peptide components that do exhibit statistically significant changes between the two conditions (MPDS proteins). To confirm the quantitative results illustrated in Figure 4AE, we performed a simple peptide mass fingerprinting search using the average mass measurement of each AMRT that was found in at least two out of three replicate injections from all six conditions with a t-test probability score of e0.01 (67 AMRTs in all). We searched a Swissprot database of over 200 000 entries at 5 ppm mass accuracy with no missed cleavages and required four minimum peptides to match. The search results accounted for 59 of the 67 total AMRTs. The 59 AMRTs identified 47 proteins by peptide mass fingerprint, which included the 5 spiked in proteins (MPDS proteins) as well as 37 isoforms of the MPDS proteins from different species, including 23 different isoforms of glycogen phosphorylase. Last, the final five identifications were examples of very high molecular weight proteins (>120 kDa) which have tryptic peptides with monoisotopic masses in common with the MPDS proteins. The level of redundancy is not surprising, since the search was performed using a non-species-specific database. In a true biomarker discovery experiment, the peptide mass fingerprint would most likely be restricted to a nonredundant database of a specific organism to reduce the number of isoforms one may obtain from the homology/identity found in a crossspecies database. If we had spiked the proteins in at different concentrations, we could have used the quantitative fold change of the AMRTs as an additional filter or scoring mechanism to eliminate the wrongfully assigned high molecular weight protein assignments. We also suggest that the use of accurate mass in conjunction with fold change is a powerful strategy for MS-based protein identification. Since enzymatically digested proteins typically produce many peptides and this methodology does not limit the number of observed peptides per protein through the use of any type of affinity capture enrichment protocol, proteins which exhibit a relative fold change in expression will produce a number of 2198
Analytical Chemistry, Vol. 77, No. 7, April 1, 2005
AMRTs (tryptic peptides) that will exhibit the same change in expression within some reasonable tolerance. It is suggested that the use of accurate mass in conjunction with the quantitative fold change provides additional specificity to allow rapid screening of complex protein mixtures for targeted proteins of interest which exhibit a change in relative abundance. In instances for which further validation is needed, the user has the ability to construct a targeted include list for subsequent MS/MS analysis from the accurate mass and retention times (AMRTs) obtained from the LC/MS acquisition. However, the parallel LC/MS and LC/MSE strategy implemented for this analysis contains not only the precursor ion information but also the associated fragment ion information from all the observed precursors and allows one to identify the precursor ions without having to perform the targeted MS/MS experiment.31 Low-energy precursor data are collected into function 1, while the associated elevated-energy data are collected into the second function. The low-energy precursor ions are associated with their corresponding high-energy fragment ions using the obtained chromatographic attributes. In this type of experiment, the software uses both the low- and elevated-energy data for qualitative assignment.20 The data presented in this manuscript illustrate that the Expression Informatics software is capable of reducing large sets of LC/MS analyses from complex protein mixtures to a simple list of AMRT components that have undergone a change in relative abundance due to the applied perturbation. These capabilities are provided for by the use of the ion detection, clustering, and quantitative functionalities. Having the ability to reduce these complex protein mixtures to a simple list of AMRT components greatly simplifies the problem of properly identifying the proteins affected by the applied perturbation. In many cases, a subsequent protein identification from such complex protein mixtures can be ascertained from a simple peptide mass fingerprint of the specific AMRTs within a given fold change window. To illustrate this powerful capability, we conducted a PMF search with only those AMRTs present in at least two out of the three replicate injections for all conditions (5000-100 fmol MPDS proteins), with Cv’s of the associated replicating intensities of under 30%, with a mass precision of under 10 ppm, and illustrating a fold change with a t-test score of