Comprehensive Label-Free Method for the Relative Quantification of

Jun 24, 2005 - Pharmaceutical companies and regulatory agencies are broadly pursuing biomarkers as a means to increase the productivity of drug develo...
11 downloads 7 Views 630KB Size
Comprehensive Label-Free Method for the Relative Quantification of Proteins from Biological Samples Richard E. Higgs,* Michael D. Knierman, Valentina Gelfanova, Jon P. Butler, and John E. Hale Lilly Research Laboratories, MS 1533, Lilly Corporate Center, Indianapolis, Indiana, 46285 Received April 19, 2005

Pharmaceutical companies and regulatory agencies are broadly pursuing biomarkers as a means to increase the productivity of drug development. Quantifying differential levels of proteins from complex biological samples such as plasma or cerebrospinal fluid is one specific approach being used to identify markers of drug action, efficacy, toxicity, etc. We have developed a comprehensive, fully automated, and label-free approach to relative protein quantification from LC-MS/MS experiments of proteolytic protein digests including: de-noising, mass and charge state estimation, chromatographic alignment, and peptide quantification via integration of extracted ion chromatograms. Results from a variance components study of the entire method indicate that most of the variability is attributable to the LCMS injection, with a median peptide LC-MS injection coefficient of variation of 8% on a ThermoFinnigan LTQ mass spectrometer. Spiked recovery results suggest a quantifiable range of approximately 32fold for a sample protein. Keywords: relative quantification • biomarkers • proteomics • LC-MS/MS • differential expression

Introduction Advances in protein analytical technologies are finding broad applications in drug discovery programs, particularly in the area of biomarker discovery. Biomarkers are indicators of biological processes and include polynucleotides, proteins, and small molecule metabolites. Biomarkers of drug efficacy and toxicity have the potential to speed the process of drug development as they may provide indications of drug action at earlier stages than clinical endpoints. This could potentially shorten the length and reduce the cost of clinical trials and is therefore receiving increased attention from pharmaceutical and government agencies alike.1,2 Mass spectrometry is being widely applied to identify and quantify proteins in complex mixtures.3-5 Quantification of small molecules by integration of LC-MS extracted ion chromatogram (XIC) peaks has a long history in analytical chemistry. Similar quantification techniques applied to proteolytic protein digests have also been previously described.6 Recent developments in quantification using mass spectral techniques provide examples of how this technology may be utilized to identify biomarker candidates in complex samples. Several of these methods rely on the modification of peptides with reagents enriched in stable isotopes to introduce mass shifts in peptides which may then be used to quantify relative changes between two samples.7-9 The number of biological samples that need to be analyzed to detect statistically relevant protein changes, the complexity in the design and analysis of pairing study samples, and the resultant expense of these techniques limit their application for global sample analysis. Techniques have been reported for biomarker discovery using relative quantification of samples by compari* To whom correspondence should be addressed. E-mail: [email protected].

1442

Journal of Proteome Research 2005, 4, 1442-1450

Published on Web 06/24/2005

son of extracted ion chromatograms from mass spectra10,11 with several groups reporting similar methods based on this approach.12 The sheer volume of data collected during LC-MS/ MS analysis of complex protein mixtures requires that data analysis of these spectra be automated. We have developed a comprehensive analytical system to process the data from global LC-MS/MS analyses of complex protein mixtures. In contrast to other pattern-based,13,14 difference-based,15 or identification-based16,17 quantification methods, we have chosen to quantify peptides directly by integrating the ion current for the peak associated with a peptide. Our strategy is to produce a rectangular table with each row of the table representing a biological sample and each column representing a relative peptide quantity. With this table in hand, familiar statistical analysis techniques such as transformation, normalization, analysis of variance, etc. may be directly applied in a flexible framework supportive of arbitrary experimental designs (e.g., repeated measures, variance components analysis, etc.) and analyses. Our interest in statistical inference relative to a population of patients or pre-clinical animals motivated a highthroughput computational implementation that makes the analysis of hundreds of samples from a biomarker study practical. Our process was designed to reduce the number of false positive peptide identifications by using spectral quality evaluation and noise filtering. Variability of estimated peptide peak areas was minimized by integrating small, fixed, retention time windows of peptide XICs following chromatographic alignment (Figure 1). The method has been applied to double and triple-play LC-MS/MS data generated from LCQ, LTQ, and LTQ/FT mass spectrometers (ThermoFinnigan). Application of this technology to biological fluids may afford the ability to identify and measure protein changes without a priori knowl10.1021/pr050109b CCC: $30.25

 2005 American Chemical Society

Label-Free Relative Quantification of Proteins

Figure 1. Flow scheme for relative protein quantification.

edge of specific protein changes occurring in response to a particular biological perturbation.

Materials and Methods Serum Sample Preparation. Aliquots (50 µL) of rat serum were diluted with Montage equilibration buffer (Montage Albumin Deplete Kit, Millipore) to a volume of 200 µL. 100 µL of a 50% Protein G Sepharose bead suspension were added (Amersham Biosciences) and the mixture was rocked for 1 h at RT. Protein G Sepharose beads were pelleted at 2000 rpm for 2 min and 200 µL of the effluent was transferred to a preequilibrated Montage column (Montage Albumin Deplete Kit, Millipore). Pre-equilibration was performed via manufacturer specifications. The Montage column was then centrifuged at 500 × g for 2 min and the flow-thru was reapplied to the column and centrifuged again. Two consecutive 200 µL washes of Montage wash buffer (Montage Albumin Deplete Kit, Millipore) were passed over the column via 500 × g centrifugation for 2 min (final volume approximately 600 µL). An aliquot (120 µL) of the diluted and depleted rat serum was spiked with chicken lysozyme (approximately 1% of the initial total protein concentration). An equal volume (120 µL) of 8 M urea in 100 mM ammonium carbonate pH 11.0 was added to the spiked and depleted rat serum and an equal volume (240 µL) of reduction/alkylation cocktail (2% iodoethanol, and 0.5% triethylphosphine in acetonitrile) was then added. The solutions were capped and incubated for 1 h at 37 °C after which the solutions were speed vacuumed to dryness, at least 3 h. The serum pellet was then redissolved in 600 µL of a trypsin solution (Worthington TPCK treated, in 100 mM ammonium bicarbonate pH 8.0) to produce a 1.6 M urea solution and an enzyme: substrate ratio of 1:50 (w/w), assuming a 50% reduction in protein concentration after IgG/albumin depletion. The digestion was carried out at 37 °C overnight.18 Typically 10-20 µL of this digest was injected onto the mass spectrometer.

research articles Validation Study Sample Preparation. Study #1: Variance Components Analysis. Three aliquots of rat serum from each of five animals were passed over a separate Montage/Protein G column to reduce the albumin and immunoglobulin concentrations. After dilution with a urea buffer containing chicken lysozyme, each sample was divided into two aliquots for reduction, alkylation, and tryptic digestion as described above. Each tryptic digest (10 µL, approximately 0.167 µL of serum, 8.35 µg of starting total protein) was injected twice on an LCQ mass spectrometer in a random order. Study #2: Soybean Trypsin Inhibitor Titration. Eighteen aliquots of rat serum following the reduction of albumin and immunoglobulin concentrations were spiked with various (78 ng to 20 µg) amounts of soybean trypsin inhibitor (Sigma). Aliquots were reduced, alkylated, and digested as described above. Each tryptic digest (10 µL, approximately 0.141 µL of serum, 9.3 µg of starting total protein) was injected twice on an LCQ mass spectrometer in a random order. Study #3: Repeated LC-MS Injections on an LTQ Mass Spectrometer. A pool of digested rat serum (prepared as described above) was injected in twelve replicates at 20 µL each into an LTQ mass spectrometer. HPLC Conditions. A Surveyor autosampler and MS HPLC pump (ThermoFinnigan) were used for separation. Tryptic digests (10 µL) were injected onto a Zorbax SB300 1 × 50 mm C-18 reversed phase column (Agilent) at a flow rate of 50 µL/ min. The gradient conditions were 10-95% B (90-5% A) over 120 min, followed by a 0.1 min ramp to 100% C, followed by 5 min at 100% C, followed by a 0.1 min ramp to 10% B (90% A), and hold for 17 min at 10% B (90% A), where A was 0.1% formic acid in water, B was 50% acetonitrile 0.1% formic acid in water, and C was 80% acetonitrile 0.1% formic acid in water. The water and acetonitrile were Burdick and Jackson HPLC grade. The formic acid was from Aldrich. The effluent was diverted to waste for the first 2 min to keep the source clean. Between each sample in the set an injection of water is made and a shortened (60 min) gradient is performed to reduce carryover. LCQ Deca XP+ Conditions. The total column effluent (50 µL/min) was connected to the electrospray interface of an LCQ Deca XP+ ion trap mass spectrometer (ThermoFinnigan). The source was operated in positive ion mode with 4.0 kV electrospray potential, a sheath gas flow of 20 arbitrary units, and a capillary temperature of 150 °C. The source lenses were set by maximizing the ion current for the 2+ charge state of angiotensin. Data were collected in the triple play mode with the following parameters: centroid parent scan was set to 3 microscans and 200 ms maximum injection time, profile zoom scan was set to 3 microscans and 200 ms maximum injection time, and a centroid MS/MS scan was set to 3 microscans and 400 ms maximum injection time. Dynamic exclusion settings were set to a repeat count of one, exclusion list duration of two minutes, and rejection widths of -0.75 m/z and +2.0 m/z. Collisional activation was carried out at a relative collision energy of 35% and an exclusion width of 4 m/z. LTQ Conditions. The total column effluent (50 µL/min) was connected to the electrospray interface of an LTQ ion trap mass spectrometer (ThermoFinnigan). The source was operated in positive ion mode with 4.8 kV electrospray potential, a sheath gas flow of 20 arbitrary units, and a capillary temperature of 225 °C. The source lenses were set by maximizing the ion current for the 2+ charge state of angiotensin. Data were collected in the triple play mode with the following parameters: centroid parent scan was set to 1 microscan and 50 ms Journal of Proteome Research • Vol. 4, No. 4, 2005 1443

research articles

Higgs et al.

maximum injection time, profile zoom scan was set to 3 microscans and 500 ms maximum injection time, and a centroid MS/MS scan was set to 2 microscans and 2000 ms maximum injection time. Dynamic exclusion settings were set to a repeat count of one, exclusion list duration of two minutes, and rejection widths of -0.75 m/z and +2.0 m/z. Collisional activation was carried out at a relative collision energy of 35% and an exclusion width of 3 m/z. Zoom Scan Assessment. The data collected from a zoom scan triple-play experiment or from a high-resolution parent mass scan double-play experiment are used to estimate the quality of subsequent MS/MS spectra, the charge state of the peptide, and the monoisotopic and average mass of the peptide. The quality estimate is used to eliminate those scan events that were triggered off of noise or small molecules from further downstream processing. Peptide mass and charge state estimates are used in subsequent steps for peptide identification. Eliminating low quality scan events and more accurately estimating the charge state and mass of peptides ultimately reduces the number of false positives that must be dealt with at the peptide identification stages. Zoom scan assessment begins by hypothesizing a possible charge state of 1+, 2+, 3+, or 4+ for a peptide. Given the m/z of the scan event and a hypothesized charge state, it is possible to estimate the theoretical isotope distribution for a peptide of the hypothesized mass. Theoretical isotope relative heights were derived from an analysis of 15493 peptides ranging in length from 2 to 38 residues with a median length of 13. We have denoted the intensities of the four isotope peaks I0 for the monoisotopic peak, I1 for the +1 13C isotopic peak, I2 for the +2 13C isotopic peak, and I3 for the +3 13C isotopic peak. The 15,493 example peptides were then used to derive relationships for I0/max(I0, I1, I2, I3), I1/I0, I2/I0, and I3/I0 as functions of the peptide monoisotopic molecular weight (Figure 2). The detected peptide charge state and mass are estimated from the best match between the observed zoom scan spectrum and the theoretically derived spectrum for the possible charge states of 1+, 2+, 3+, and 4+. The agreement between the measured zoom scan and theoretical zoom scan spectra is based on a matched filter (convolution) operation.19 A cross-correlation coefficient is used as an intensity independent matching score between the measured and theoretical spectra. For data generated on an LCQ device we have found that modeling space charge effects (compression of the isotope peak spacing) within the theoretically derived isotope pattern is effective for improving the charge state estimates. MS/MS Filtering. To reduce the effect of MS/MS noise peaks on the false positive identification of peptides, a dynamic MS/ MS noise level is estimated for each spectrum. This noise level estimate is then subtracted from all MS/MS peak intensities with any resulting differences less than zero set to zero. The spectral noise level is estimated based on the observation that ideal MS/MS spectra of peptides have relatively few peaks (e.g., y-ions, b-ions, adducts, etc.) in a theoretical or high signal-tonoise ratio spectrum, while noisy MS/MS spectra typically have a high density of peaks within a local m/z neighborhood. Therefore, the filtering scheme developed uses a percentile of the peak intensities within a local m/z neighborhood as the noise estimate, where the percentile used is based on the density of peaks in the neighborhoodsa higher peak density results in a higher percentile to estimate the local noise level and a lower peak density results in a lower percentile to estimate the local noise level. A sigmoidal function (Figure 3) 1444

Journal of Proteome Research • Vol. 4, No. 4, 2005

Figure 2. Empirically derived relationships (from 15493 example peptides) between isotope peak intensities used to estimate the theoretical isotope pattern for a peptide (a) I0/max(I0,I1,I2,I3), nonlinear least-squares fit: I0/max (I0,I1,I2,I3) ) 1 if MW < 1800, e-0.00132+(MW-1800)0.0865 if MW g 1800, (b) I1/I0, linear least-squares fit: I1/I0 ) -0.00498 + 0.000560MW, (c) I2/I0, linear leastsquares fit: I2/I0 ) -0.367 + 0.000516MW + 1.59 × 10-7(MW 1527.34)2, and (d) I3/I0, nonlinear least-squares fit: I3/I0 ) 0.000605e0.00251MW-2.70×10-7MW2.

Figure 3. Filtering percentile as a function of local MS/MS peak density. Peak density is defined as the number of MS/MS peaks in a 40 m/z window divided by 40. Filtering Percentile ) 0.75 if Peak Density e0.1, (1 + e0.15-Peak Density/0.05) if Peak Density > 0.1.

to transform local MS/MS peak density to a percentile used for filtering was empirically derived by adjusting the parameters of the function and visually examining the resulting noise estimate for many MS/MS spectra by two of the authors. The final MS/MS noise level is then estimated by a Gaussian kernel smooth (150 m/z bandwidth) of two hundred equally spaced peak intensity percentiles across the full MS/MS mass range. Peptide Identification. Following the zoom-scan assessment to eliminate low quality data, estimation of peptide mass and charge state, and MS/MS filtering to remove noise peaks,

Label-Free Relative Quantification of Proteins

peptide identification is performed. Peptide identification is an active research topic with many approaches (e.g., Sequest,20 X! Tandem,21,22 MASCOT,23 etc.) under investigation. The focus of this article is quantification and not peptide identification, so no effort will be made to review the numerous approaches to the challenging identification problem. Our computational architecture is flexible and can accommodate any UNIX-based executable within our grid-computing environment. We have used Sequest, X! Tandem, and a machine learning based approach that combines output from both Sequest and X! Tandem as well as other sources (e.g., peptide retention modeling) for peptide identification in a manner similar to that described by Nesvizhskii et al.24 In the work reported here, we will restrict the examples to cases in which Sequest was used exclusively for peptide identification in a sequential searching strategy. The sequential searching strategy begins by searching a pre-filter database that contains proteins not of interest for our studies (e.g., albumin, keratin, IgGs, etc.) with trypsin only specification. Spectra not matching this initial search are then searched against the International Protein Index25 (IPI) and nonredundant26 (NR) databases with trypsin only specifications. Spectra not matching these databases are then sequentially searched again using the pre-filter, IPI, and NR databases with no enzyme specificity. Matches of an MS/MS spectrum to a database are user-definable and for the results presented here we used rules similar to those suggested by Peng et al. based on Sequest output.27 Chromatographic Alignment. Variability in the abundance of individual peptides between different samples may result in that peptide triggering an MS/MS scan in one sample and not in another. The area of this peptide may still be extracted from each sample, however it requires high quality chromatographic alignment between samples so that a consistent region the chromatogram is used for the extraction and integration of the ion current. Large biomarker studies can produce chromatographic shifts greater than one minute between pairs of samples (data not shown). Simply expanding the integration window by one or two minutes to account for chromatographic variability is not an option in our experience analyzing complex samples containing multiple coeluting peaks at most XIC masses. An expanded integration window that includes multiple peaks masks the quantification of individual peptides, produces results that are confounded with multiple peptides contributing to a value, and increases variability. We have found a simple pairwise alignment between all samples in a study and a selected (reference) sample in the study to work well for numerous biomarker discovery projects. This approach to alignment is founded on the following assumptions: (a) the samples included in the study are generally quite similar to each other with respect to their peptide content (i.e., there are many peptides in common between the samples), (b) the same chromatographic conditions are used for each sample in the study, and (c) in a local region of retention time, the retention time offset between any two samples is approximately constant. Given these assumptions, our approach proceeds by identifying matching landmark peptides between two samples, where a landmark match is declared if the following conditions are met: (1) the retention time of the triple play event between the samples is within a user-specified amount (e.g., 5 min), (2) the charge state of the peptide matches, (3) the m/z value of the monoisotopic peak from the zoom scans is within a userspecified amount (e.g., 0.7 Da) between the two samples, (4) the zoom scan cross-correlation coefficient of both peptides

research articles

Figure 4. Example MS/MS spectra and their estimated noise levels. 443 original peaks reduced to 118 peaks above estimated noise level in high noise spectrum (a). 589 original peaks reduced to 173 peaks above estimated noise level in lower noise spectrum (b).

Figure 5. Chromatographic alignment function between two samples included in the soybean trypsin inhibitor spiked recovery experiment. Retention time shift (minutes) vs retention time (minutes) for 462 landmark peptides are plotted with the associated loess fit.

to their respective theoretical isotope patterns exceeds a threshold (e.g., 0.65), and (5) the similarity between the corresponding MS/MS spectra exceeds a threshold (e.g., 0.75). The MS/MS similarity metric has been implemented as a correlation coefficient between two spectra following a convolution of each MS/MS stick-spectrum with a Gaussian peak shape. The optimal shift time required to align two landmark peptides is then estimated by the shift that yields the maximum overlap (estimated by convolution) between the XIC of the peptide in each sample. A smooth curve is then fit to the shift and retention time pairs using a weighted local regression technique, loess,28 where each observation is weighted by the cross-correlation coefficient between the landmark XICs at their optimal time shift value (Figure 5). The loess warping function for a sample is then applied to all of the peaks in the Journal of Proteome Research • Vol. 4, No. 4, 2005 1445

research articles

Higgs et al.

chromatogram (landmark or not). Thus, all samples in a study are projected onto the same retention time scale. The warping function between two samples is generally not monotonic over the entire retention time range and no restriction on overall monotonicity is used in our estimate of the warping function. We do however preserve the overall rank order of the retention times following alignment by constraining the bandwidth used in the loess fitting. This pairwise alignment strategy is clearly dependent on the sample chosen as the chromatographic alignment reference. In practice, we have found the pairwise approach to be generally robust to the sample chosen as the alignment reference and have therefore not pursued a global alignment approach. Additionally, a visual examination of the alignment warping functions for all samples included in a study has proven to be an effective means to detect and diagnose chromatography problems encountered in the analysis of dozens of study samples. For example, oscillatory warping functions have been associated with pump mixing problems while large magnitude linear warping functions have been associated with column degradation (data not shown). Peptide Quantification. Once all samples in a study have been chromatographically aligned to a reference, the relative quantification of peptides is carried out by integration of the XIC peak from the primary mass spectrum within each sample. A list of peptides to integrate within each sample is constructed by grouping together all triple-play events across all of the samples. This grouping can be done with or without the use of peptide identification. For the results presented here, we will restrict the examples to cases where we have used the peptide identification to group triple-play events. The same approach and software works for nonidentification based grouping by using the matching rules for alignment landmarks previously described. For each triple-play event in a group, the corresponding peptide XIC is generated in a local retention time region and the peak centroid (intensity weighted average of m/z values) is estimated. A fixed integration window for that peptide is then defined by the mean centroid retention time across all samples in the group ( a fixed retention time window (30 s. for our chromatography). Numerical integration (trapezoid rule) is then performed on the aligned XIC from all samples in the study in a fixed retention time region with optional smoothing and local linear baseline subtraction of the XIC prior to numeric integration. Example XICs and the derived integration regions using this procedure from eight validation study #1 rat serum samples indicates good overall quantification of the peptide (Figure 6). The importance of chromatographic alignment is apparent when this procedure is applied to un-aligned data (Figure 7). Following the integration of peptide-specific XIC peaks, the usual operations of transformation and normalization may be applied prior to any statistical analysis. Depending on the sample set being analyzed, normalization can be particularly important for minimizing systematic biases in ion current introduced by sample handling, sample concentration, instrument sensitivity drift during the course of data acquisition, etc. The spiked internal standard, chicken lysozyme, can be helpful in diagnosing and monitoring ion intensities before and after normalization. Postprocessing steps used for the validation studies described here include: (i) log2 transformation of the peptide peak areas, (ii) quantile normalization29 of the log2 peptide peak areas, and (iii) exponentiation of the quantile normalized log2 peptide peak areas to arrive at a normalized peptide quantity on the original 1446

Journal of Proteome Research • Vol. 4, No. 4, 2005

Figure 6. XICs from the 2+ R-1 macroglobulin peptide ATPLSLCALTAVDQSVLLLKPEAK for eight validation study #1 rat serum samples when chromatographic alignment is applied. Note that the peak from all samples fits within the highlighted [83.2, 84.2] integration region.

data scale. Relative protein levels are estimated by exponentiating the average of the normalized log2 peptide values associated with a protein. Computational Implementation. The previously described computational steps have all been implemented in a UNIXbased queuing environment in order to handle the throughput required from large proteomic biomarker discovery studies and to efficiently leverage the capability of commodity hardware arrayed as a grid of Linux compute clusters. The Perl scripting language was used for basic job control and input/output data

research articles

Label-Free Relative Quantification of Proteins

ing, and peptide identification) is specified by a user-defined XML configuration file describing which operations are to be performed, specific grid resources, priorities, and sequencing dependencies. A UNIX cron job monitors designated directories for ThermoFinnigan raw data files and a corresponding XML pipeline configuration file and automatically submits all of the pipeline jobs to the queue.

Results and Discussion We have used the bioanalytical and statistical methods previously described to perform relative quantification of peptides and proteins from complex biological samples for biomarker discovery. To characterize the entire quantification process we now describe three validation studies to estimate sources of variation, quantification, and translation from the LCQ platform to the LTQ platform.

Figure 7. XICs from the 2+ R-1 macroglobulin peptide ATPLSLCALTAVDQSVLLLKPEAK for eight validation study #1 rat serum samples when no chromatographic alignment is applied. Note that the peak from several samples is significantly misaligned with the highlighted [83.2, 84.2] integration region.

management of individual programs while the R computing environment30 was used for quantitative/statistical work. Individual programs were generally implemented with a Perl ‘wrapper’ using calls to R for the numerically intensive parts of the algorithms. All application code was developed to work in an integrated fashion within the Sun Grid Engine Environment31 (SGE). SGE provides a generic and flexible queue and resource management system that matches queue submissions with a grid of compute clusters. The real-time data processing pipeline (text extraction, zoom scan assessment, MS/MS filter-

Study #1: Variance Components Analysis. A sixty-sample (5 × 3 × 2 × 2) variance components study using rat sera (Lewis inbred male rats, 6-7 weeks old) was designed to assess the relative sources of variation attributable to animal-to-animal biological variability, technical variability attributable to abundant protein depletion, technical variability attributable to reduction, alkylation, and digestion, and technical variability attributable to the LC-MS injection. Rat sera from five animals were split into three aliquots for abundant protein removal. Following abundant protein removal, samples were split into two aliquots for reduction, alkylation, and digestion. Two injections of the digested material were made in a random order. Following alignment and quantification, log2 peptide peak areas were quantile normalized and exponentiated to obtain a normalized value on the original peak area scale. High confidence peptide identifications, defined by rules similar to those described by Peng et al.,27 were applied prior to statistical analysis. A random effects model containing the terms ANIMAL, DEPLETION[ANIMAL], RAD[ANIMAL*DEPLETION], and residual (INJECTION) was then fit for each high confidence peptide using the SAS VARCOMP procedure (version 8.02)32 with the restricted maximum likelihood (REML) method to estimate the relative contributions of animal variability, abundant protein depletion variability, reduction/alkylation/digestion variability, and LC-MS injection variability, respectively. Since each peptide has its own variance components partitioning, we examine the histogram over all peptides of total variance proportions attributable to each variance component. Visual examination of these histograms indicates that for the majority of peptides, the LC-MS injection is by far the dominant source of variability (Figure 8). Plotting the normalized peak areas of a prototypical peptide from this study demonstrates the overall results observed in this analysis of variance components (Figure 9). Study #2: Soybean Trypsin Inhibitor Titration. A thirtysix sample spiked recovery study using rat sera with spiked levels of soybean trypsin inhibitor (SBTI) was designed to assess the quantitative response of the bioanalytical and statistical methods. Eighteen aliquots from a pooled rat serum sample were spiked in duplicate with varying levels of SBTI (20 µg to 78 ng, in nine 2-fold dilutions), reduced, alkylated, and digested. Each digested sample was injected twice into the LC-MS system in random order. Following alignment and quantification, log2 peptide peak areas were quantile normalized and exponentiated to obtain a normalized value on the original peak area scale. Relative SBTI levels were estimated by exponentiJournal of Proteome Research • Vol. 4, No. 4, 2005 1447

research articles

Figure 8. Distribution over 1,184 peptides for (a) the proportion of total variance attributed to the animal effect, (b) abundant protein depletion effect, (c) reduction/alkylation/digestion effect, and (d) the LC-MS injection effect. Injection variability is the dominant contributor to total measurement error for most peptides.

Higgs et al.

Figure 9. Normalized peptide peak areas for the 2+ C-reactive protein peptide TSFNEILLFWTR. Three abundant protein removals denoted by D1, D2, and D3 for each of three animals are plotted (O and × represent reduction, alkylation, digestion #1 and #2, respectively). This is a prototypical peptide with an overall CV of 14%, 74% of total variability attributed to the LC-MS injection and 26% of total variability attributed to reduction, alkylation, and digestion. Visually, most variability appears to be coming from the LC-MS injection.

ating the average normalized log2 peptide peak areas for the high confidence SBTI peptide identifications using rules similar to those described by Peng et al.27 A four-parameter logistic function was fit to the estimated relative SBTI levels from the digestion #1 data and the log10 SBTI spiked concentration using the nonlinear regression platform within the JMP statistical software,33 (Figure 10a). Percent recovery values were estimated from the digestion #2 data by back-calculating an estimated SBTI concentration from the parameters of the four-parameter logistic (Figure 10b). While this is somewhat of an optimistic estimate of quantitative range since the same SBTI levels were used in the standard curve and the recovery estimation, it is nonetheless suggestive of a quantitative response, defined by recovery between 80 and 120%, for a 32-fold range (625 ng to 20 µg) in SBTI values. Study #3: Repeated LC-MS Injections on an LTQ Mass Spectrometer. The bioanalytical and statistical methods described were originally developed using data generated by a ThermoFinnigan LCQ mass spectrometer. Our laboratory has since upgraded to the LTQ instrument from ThermoFinnigan. The software described is compatible with LCQ or LTQ generated data without modification. Since most of the technical error in peptide quantification is attributed to the injection, we assessed the technical error from LC-MS injections with a small study in which we made twelve injections from a single rat serum digest sample. Following chromatographic alignment and peptide quantification, log2 peptide peak areas were quantile normalized and exponentiated to obtain a normalized value on the original peak area scale. High confidence peptide identifications, defined by rules similar to those described by 1448

Journal of Proteome Research • Vol. 4, No. 4, 2005

Figure 10. Four parameter logistic standard curve estimated from two LC-MS injections of one digestion of a soybean trypsin inhibitor (SBTI) titration on the log10 concentration scale (a). Percent recovery of SBTI from two LC-MS injections of a second digestion of the SBTI titration samples on the log10 concentration scale, with the 80-120% recovery region highlighted in gray (b).

Peng et al.,27 were applied prior to statistical analysis. The sample mean and variance were estimated for each high confidence peptide identification. An approximately linear

research articles

Label-Free Relative Quantification of Proteins

Conclusions A comprehensive LC-MS/MS data analysis process has been described. The process does not require any isotopic labeling of biological samples, is fully automated, and is implemented using a high-throughput computational environment, making large (dozens to hundreds) of biological sample studies practical. The objective of the method is to produce a rectangular table with relative peptide estimates for all of the biological samples in a studyseach peptide is quantified in each sample, irrespective of whether a triple-play event for a peptide occurred in each sample. Highly reproducible peptide peak areas are achieved by integrating peptide XICs in a local retention window following chromatographic alignment. A characterization of the method indicates that the majority of measurement error is occurring within the LC-MS injection itself. For an LTQ mass spectrometer the median CV of repeated injections for normalized peptide peak areas is 8%. A spiked recovery experiment with SBTI suggests that the method can be quantitative over an approximately 32-fold range. With the throughput and precision of the method, it appears that relatively small (20%) changes in protein relative levels between different biological sample sets (e.g. treated/vehicle, healthy/ disease, etc.) are discoverable with single injections per sample. The methods have been applied to data generated from both LCQ and LTQ mass spectrometers. The methods have also been applied to data from an LTQ/FT running in both the doubleplay (high-resolution parent scan followed by several nominal resolution MS/MS scans) and triple-play (nominal resolution primary scan, high-resolution zoom scan, and nominal resolution MS/MS scan) modes.

Figure 11. Sample standard deviation vs sample mean for 1,252 normalized peptide peak areas (a). Distribution of coefficients of variation (CV) for 1,252 normalized peptide peak areas (b). Median CV is 8%, 75th percentile is 11%, and 90th percentile is 15%.

relationship between the sample standard deviation and sample mean was observed (Figure 11a) which is suggestive of a multiplicative error model on the original data scale and an additive error model on the log scale and is consistent with previously reported variance functions from a time-of-flight mass spectrometer.34 The distribution of coefficient of variation (CV) values indicates good reproducibility with a median peptide CV value of 8%, 75th percentile of 11%, and a 90th percentile of 15% (Figure 11b). Without normalization the 50th, 75th, and 90th percentile of the peptide CVs were 9%, 13%, and 17%, respectively. The degree to which normalization reduces overall variability is study dependent, with a modest improvement in this small study and with substantial improvement in larger biomarker studies in which there can be drift in instrument sensitivity (data not shown). With no chromatographic alignment or normalization the 50th, 75th, and 90th percentile of the peptide CVs increase to 11%, 16%, and 21%, respectively. The chromatography in this small study was relatively reproducible resulting in warping functions with magnitudes less than 0.5 min and moderate reductions in variability when alignment is applied. However, larger biomarker studies tend to produce alignment warping functions with magnitudes of one minute or greater, thereby increasing the impact that chromatographic alignment has on reducing variability (data not shown).

Acknowledgment. We thank John Saalwaechter and Andrew Kaczorek for their efforts in developing and maintaining a high-availability grid-computing environment used for this work. We also thank Jude Onyia for supporting us in the development of these methods. References (1) FDA Critical Path Initiative, http://www.fda.gov/oc/initiatives/ criticalpath. (2) NIH Road map for Medical Research, http://www.nihroad map.nih.gov/index.asp. (3) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (4) Patterson, S. D.; Aebersold, R. H. Nat. Genet. 2003, 33, 311-323. (5) Domon, B.; Alving, K.; He, T.; Ryan, T. E.; Patterson, S. D. Curr. Opin. Mol. Ther. 2002, 4, 577-586. (6) Kaiser, R. E.; Higgs, R. E.; Julian, R. K. System and Methods for Quantitatively Comparing Complex Mixtures Using Single Ion Chromatograms Derived From Spectroscopic Analysis of Such Admixtures. United States Patent #5,885,841, granted 23 Mar 1999. (7) Gygi, S.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 19, 994-999. (8) Ji, J.; Chakraborty, A.; Geng, M.; Zhang, X.; Amini, A.; Bina, M.; Regnier, F. J. Chromatogr., B 2000, 745, 197-210. (9) Cagney, G.; Emili, A. Nat. Biotechnol. 2002, 20, 163-170. (10) Bondarenko, P. V.; Chelius, D.; Shaler, T. A. Anal. Chem. 2002, 74, 4741-4749. (11) Chelius, D.; Bondarenko, P. V. J. Proteome Res. 2002, 1, 317323. (12) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Anal. Chem. 2003, 75, 4818-4826. (13) Petricoin, E. F., III; Ardekani, A. M.; Hitt, B. A.; et al. Lancet 2002, 359, 572-577. (14) Radulovic, D.; Jelveh, S.; Ryu, S.; Hamilton, T. G.; Foss, E.; Mao, Y.; Emili, A. Mol. Cell. Proteomics 2004, 3, 984-997. (15) Wiener, M. C.; Sachs, J. R.; Deyanova, E. G.; Yates N. A. Anal. Chem. 2004, 76, 6085-6096.

Journal of Proteome Research • Vol. 4, No. 4, 2005 1449

research articles (16) Gao, J.; Opiteck, G. J.; Friedrichs, M. S.; Dongre, A. R.; Hefta, S. A. J. Proteome Res. 2003, 2, 643-649. (17) Colinge, J.; Chiappe, D.; Lagache, S.; Moniatte, M.; Bougueleret, L. Anal. Chem. 2005, 77, 596-606. (18) Hale, J. E.; Butler, J. P.; Gelfanova, V.; You, J.-S.; Knierman, M. D. Anal. Biochem. 2004, 333, 174-181. (19) Proakis, J. G.; Manolakis, D. G. Digital Signal ProcessingPrinciples, Algorithms and Applications, 2nd ed.; Prentice Hall: New York, 1992. (20) Eng, J. K.; McCormack, A. L.; Yates, R. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (21) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316. (22) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467. (23) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (24) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2004, 75, 4646-4658. (25) International Protein Index, European Bioinformatics Institute 2005, http://www.ebi.ac.uk/IPI/IPIhelp.html. (26) National Center for Biotechnology Information 2005, http:// www.ncbi.nlm.nih.gov.

1450

Journal of Proteome Research • Vol. 4, No. 4, 2005

Higgs et al. (27) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, J. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (28) Cleveland, W. S.; Grosse, E.; Shyu, W. M. Local regression models. Chapter 8 of Statistical Models in S; Chambers, J. M., Hastie, T. J., Eds.; Wadsworth & Brooks/Cole: 1992. (29) Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. Bioinformatics 2003, 19, 185-193. (30) R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing 2004, http://www.R-project.org. (31) Sun Grid Engine Project 2004, http://gridengine.sunsource.net. (32) SAS Institute. 2000. The SAS system: SAS Online Doc, version 8, HTML format [CD-ROM]. Cary, NC. (33) JMP Statistical Discovery Software, version 5.1.1, SAS Institute, Cary NC. (34) Anderle, M.; Roy, S.; Lin, H.; Becker, C.; Joho, K. Bioinformatics 2004, 20, 3575-3582.

PR050109B