Mixed-Effects Statistical Model for Comparative LC-MS Proteomics Studies† D. S. Daly,*,‡ K. K. Anderson,‡ E. A. Panisko,‡ S. O. Purvine,‡ R. Fang,§ M. E. Monroe,‡ and S. E. Baker‡ Pacific Northwest National Laboratory, 900 Battelle Boulevard, P.O. Box 999, Richland, Washington 99352, and Beckman Institute, California Institute of Technology, 1200 East California Boulevard, Pasadena, California 91125 Received July 16, 2007
Comparing a protein’s concentrations across two or more treatments is the focus of many proteomics studies. A frequent source of measurements for these comparisons is a mass spectrometry (MS) analysis of a protein’s peptide ions separated by liquid chromatography (LC) following its enzymatic digestion. Alas, LC-MS identification and quantification of equimolar peptides can vary significantly due to their unequal digestion, separation, and ionization. This unequal measurability of peptides, the largest source of LC-MS nuisance variation, stymies confident comparison of a protein’s concentration across treatments. Our objective is to introduce a mixed-effects statistical model for comparative LC-MS proteomics studies. We describe LC-MS peptide abundance with a linear model featuring pivotal terms that account for unequal peptide LC-MS measurability. We advance fitting this model to an often incomplete LC-MS data set with REstricted Maximum Likelihood (REML) estimation, producing estimates of model goodness-of-fit, treatment effects, standard errors, confidence intervals, and protein relative concentrations. We illustrate the model with an experiment featuring a known dilution series of a filamentous ascomycete fungus Trichoderma reesei protein mixture. For 781 of the 1546 T. reesei proteins with sufficient data coverage, the fitted mixed-effects models capably described the LC-MS measurements. The LC-MS measurability terms effectively accounted for this major source of uncertainty. Ninety percent of the relative concentration estimates were within 0.5-fold of the true relative concentrations. Akin to the common ratio method, this model also produced biased estimates, albeit less biased. Bias decreased significantly, both absolutely and relative to the ratio method, as the number of observed peptides per protein increased. Mixed-effects statistical modeling offers a flexible, wellestablished methodology for comparative proteomics studies integrating common experimental designs with LC-MS sample processing plans. It favorably accounts for the unequal LC-MS measurability of peptides and produces informative quantitative comparisons of a protein’s concentration across treatments with objective measures of uncertainties. Keywords: Mixed effect statistical model • LC-MS
Background Proteomics Research and Data. Proteomics research often employs comparative, screening, and time-course experiments that rely on estimating a protein’s concentration under one treatment relative to that protein’s concentration under another.1–5 A high-throughput peptide identification and quantification procedure using liquid chromatography (LC) coupled with mass spectrometry (MS) is a frequent source of measurements for these comparisons.6–8 A method to quantify and compare peptides and proteins across treatments must sur† Originally submitted and accepted as part of the “Statistical and Computational Proteomics” special section, published in the January 2008 issue of J. Proteome Res. (Vol. 7, No. 1). * Corresponding author. Tel: 509-375-4365. E-mail:
[email protected]. ‡ Pacific Northwest National Laboratory. § California Institute of Technology.
10.1021/pr070441i CCC: $40.75
2008 American Chemical Society
mount challenges due to the added nuisance variability from sample preparation and LC-MS processing. Consider a hypothetical proteomics study to discover disease biomarkers. The study’s experimental design may have subjects grouped by disease state and matched on age, gender, and body mass index with MS replicates nested within sample replicates within subject. The LC-MS processing of the resulting sample collection may require days to execute across multiple LC columns with unplanned MS emitter replacements. To make useful inferences about disease biomarkers, the analysis must deal appropriately with the nuisance variability due to subject heterogeneity and processing error. Most often, the LC-MS data analysis is a disjointed sequence of operations (filtering, normalization, missing observation management, ratioing, etc.) not underpinned by a unifying statistical model or truly reflecting the experiment’s design and LC-MS sample The Journal of Proteome Research 2008, 7, 1209–1217 1209 Published on Web 02/06/2008
research articles processing plan. A key contributor of nuisance variability is the unequal LC-MS measurability of peptides. Some peptides are more easily measured (i.e., identified and quantified) than others by LC-MS.9 This difference in measurability is most likely due to differences in sample digestion, elution, and ionization. The variation in peptide LC-MS measurability is a major source of variation across LC-MS measurements. Though the response of an enzymatic digestion LC-MS system varies significantly across peptides, this response is very reliable across replicate measurements of a given peptide. The probability that a peptide is observed using LC-MS decreases with its concentration. Consequently, an LC-MS proteomics experiment can produce a large, but incomplete or unbalanced, data set covering thousands of peptides wherein a significant percentage are not observed in each injection. A protein’s peptides will have varying levels of data completeness. Common measurements of MS abundance, such as the maximum ion intensity and integrated ion intensity, are asymmetrically distributed with small abundance values more numerous than large. Measurement variability increases as the expected value of the MS measurement increases (i.e., the absolute variability of large abundance measurements is greater than the variability of small abundance measurements). The relative variability appears approximately constant over the abundance range and independent of peptide identification. An optimal model for LC-MS proteomics measurements will feature a model integrating the experimental design and the laboratory processing plan and address the unequal LC-MS measurability of peptides. Fitting this model will account for measurement heterogeneity and accommodate unbalanced data while attenuating bias. A Naive LC-MS Proteomics Model. Consider an exemplary LC-MS analysis of a single protein mixture of equimolar peptides. Following extraction and purification, the peptides remain joined as a protein and, hence, at equal molarity in this idealized case-there is no protein degradation due to protease activity or dynamic modifications of the protein. After a complete tryptic digestion, the tryptic peptides of the protein continue at equal molarity. Then, ideally, the concentration of each peptide passes through LC-MS processing with equal efficiency producing an MS abundance measurement indexed by a peptide-specific (LC elution time/MS mass) measurement pair. And due to the uniform measurability across peptides in the exemplary LC-MS analysis, all peptides have comparable MS abundances; the MS abundances of the individual peptides are but replicate estimates of the parent protein’s abundance. In contrast, consider actual LC-MS analyses wherein one peptide may be more easily measured than another peptide at equal molarity due to numerous factors such as digestion efficacy, peptide hydrophobicity, and ionization potential. Each of these factors affects a peptide’s detection and the quantification of its abundance. For this discussion, we define peptide LC-MS measurability to be the ability of an LC-MS-based process to estimate that peptide’s abundance, from protein extraction and purification through enzymatic digestion, ionization, ion enumeration and aggregation, and peptide identification. Though LC-MS measurability varies from peptide to peptide within a sample, the measurability of the same peptide is very reliable across samples for any peptide measured under similar conditions on the same LC-MS platform. 1210
The Journal of Proteome Research • Vol. 7, No. 3, 2008
Daly et al. Further, the relative measurability between any two peptides is approximately constant across samples in which the peptides’ relative concentrations are constant. Consequently, abundance profiles for these peptides are very similar in shape, and parallel, within measurement error, on the appropriate scale (Figure 1). Mixed-Effects Statistical Modeling. Mixed-effects modeling is an established statistical methodology for the analysis of comparative, screening and time course experiments. The methodology features a unified modeling process from exploratory data investigation, through data preparation, model fitting and diagnosis, to inference and prediction.10 A mixed-effects model includes terms for both fixed effects such as researcherset treatments, and random effects such as subject response. This modeling is particularly effective in accommodating hierarchical sampling structures such as MS replicates nested within tissue replicates within animal replicates. Restricted maximum likelihood estimation (REML) was developed and refined to estimate more accurately variance components in random and mixed-effects models.11–13 In particular, REML applied to unbalanced data correctly tabulates degrees of freedom, improving error estimates and inferences; REML is better suited to fitting linear models to the often incomplete LC-MS data sets than other techniques such as ordinary least-squares analysis or analysis of variance. A biased sample of LC-MS abundances may result due to the unequal measurability of peptides-the probability of a missing observation for a peptide is higher at a lower concentration than the probability of a missing observation for that peptide at a higher concentration. The observed MS abundances in a set of replicate measurements, appear to be independent, to a first order, of LC-MS missing observation mechanism, though it may vary across replicate sets that vary in expected abundance (Figure 2). That is, we have observed that the probability of a missing observation within a set of replicates is nominally equal for each replicate-we are almost as likely to not observe a larger value as a smaller value across replicates. When we observe this behavior, we assume that the observed distribution of replicate abundance measurements of a given peptide is not significantly biased toward larger values by the missing observation mechanism. Directly accounting for the missing observation mechanism may improve parameter estimates in some instances. This, however, requires much more detailed knowledge of the statistical properties of LC-MS missing observation mechanism. We, and many others, are actively engaged in understanding and modeling this mechanism and then leveraging this research to improve the analysis of peptide abundances. This is a heady endeavor, however, and beyond the scope of this paper. As is, the standing of the mixed-effects modeling methodology among methods for analyzing LC-MS abundances is not diminished by the present treatment of missing observations. Routine application of mixed-effects modeling is facilitated by reliable, efficient statistical software such as R (The R Foundation for Statistical Computing, Vienna, Austria14). A suite of well-documented examples using this software are readily available.10,15
Methods A biologically induced difference between two conditions is often inferred from the ratio of a peptide’s LC-MS abundance estimates (i.e., a peptide’s relative abundance). The acceptance
Statistical Model for Comparative LC-MS Proteomics Studies
Figure 1. Original and adjusted peptide MS abundance profiles. Unique peptides of a given protein most often display similar MS abundance profiles-randomly perturbed by measurement error-across an experiment (upper panel). Removing the profile offsets due to differences in peptide MS measurability (i.e., peptide profile mean - overall mean) significantly reduces the nuisance variability among measurements within an MS injection due to this source (lower panel). In this case, adjusting for differences in peptide MS measurability reveals differences between neighboring dilutions-marked by white line segments– across the experiment as well as injection-by-injection processing effects.
of forming this ratio to eliminate or significantly minimize the systematic effects of LC-MS processing, coupled with the common assumption that measurement error is relative (i.e., MS measurement error increases with measurement value), suggests that variation in MS abundances may be explained adequately with a multiplicative error structure. Further, sample effects due to dilution/titration, fractionation, etc., are often multiplicative in nature. Consequently, log-transformed MS peptide abundances may be effectively described by an additive statistical model (i.e., in matrix notation, model terms, and coefficients are separable). The design of a LC-MS proteomics experiment and the conduct of the LC-MS analyses most often result in a large, complicated data set with effects intricately distributed across proteins. For instance, consider a hypothetical, but plausible, hierarchical study featuring multiple LC-MS injections of multiple preparations from multiple samples of multiple subjects within multiple study groups. The resulting data set, with measurements spanning hundreds to thousands of proteins and peptides, has a very intricate, intertwined correlation structure that should be handled by an all encompassing multiple protein model. Adequately defining this model and then estimating its parameters with a data set perforated with missing observations is presently beyond our reach. So, we presently model each protein individually. Extending the single protein mixed-effects statistical modeling of this paper to account for multiple proteins and the missing observation mechanism is a long-term goal of the authors.
research articles
Figure 2. Empirical MS abundance distributions by peptide observation fraction. Each boxplot summarizes the distribution of measured MS abundances from all peptides observed in P of 45 injections for P ) 1-45. The empirical distributions shift left with decreasing observation fraction while the distribution of values about the median remains nearly unchanged. This suggests that the distribution of a peptide’s measured abundances about its expected abundance is nearly independent of the peptide’s probability of observation for this experiment.
We fit an additive mixed-effects model to the log abundances of the observed peptides of each protein and calculate statistical estimates of model goodness-of-fit, parameters such as treatment and peptide effects, standard errors, and confidence intervals. The antilogarithm of the estimated difference between two treatment effects estimates the protein’s concentration under one treatment relative to that protein’s concentration under the second treatment. The mixed-effects statistical methodology for comparative LC-MS proteomics may be best explained through a moderately simple example with a known protein mixture. The derived example was designed to emulate the level of complexity and to produce the level of uncertainty of many biologically motivated experiments. The size, composition, and level of measurement uncertainty of the exemplar data set are commensurate with true biological data sets. Mixed-Effects Statistical Model of Peptide MS Abundance. Consider a comparative LC-MS experiment featuring T treatments, each applied to S subjects, each who provide M LC-MS injections-the design features M replicate injections nested within S subjects within T treatments. The MS abundance of each unique peptide of a given protein observed under this design may be described by a multiplicative model. Let Atspm represent the MS abundance of the Pth peptide in the mth measurement of the sth subject under the tth experimental treatment. Then, the peptide’s MS abundance as a function of sample concentration and treatment and processing effects is The Journal of Proteome Research • Vol. 7, No. 3, 2008 1211
research articles
Daly et al.
Atspm ) CTtSsFPp(1 + etspm) where C denotes the average concentration of the protein across an experiment’s treatments, Tt is the fixed effect of the tth treatment on the average concentration, Ss is the random effect of the sth subject on the average concentration, F is the across-peptides average constant of proportionality relating peptide abundance to concentration (that is, to a first-order approximation, A ) FC when no other effects are present), Pp is the measurability effect of the Pth peptide which incorporates processing effects such as digestion, elution, etc., and etspm is an independent random error associated with the kth measurement. The measurement errors have zero mean and variance σ2. The treatment effects, Tt, subject effects, Ss, peptide LC-MS measurabilities, Pp, and random measurement errors etspm may be estimated from replicate MS abundance measurements of two or more peptides and treatment samples. However, the concentration C and the constant F are not separable without more information (from internal standards, for example); that is, an estimate of the absolute concentration is not possible. Generally, this limitation is not an issue in a comparative study. To see how the relative concentration across two treatments may be estimated, consider the Pth peptide’s concentrations in two samples under two biological conditions, one measurement each, assuming equal initial concentrations. Let A11p1 and Aaap1 denote their measured LC-MS abundances obtained under identical LC-MS setup and conditions so that the peptide’s measurability, Pp, is constant. Suppose e11p1 and e22p1 represent the combined random variability due to subjects and measurement errors. Let R12 denote the ratio of these two abundances. Then, R12 is an estimate of concentration of the Pth peptide in the first sample relative to its concentration in the second sample: R12 )
A12p1 A22p1
)
CT1 FPp S1 (1 + e11p1) CT2 FPp S2 (1 + e22p1)
)
CT1 δ CT2 12
The average concentration and the processing effects, F and Pp, balance out. The ratio R12 estimates CT1/CT2, the peptide’s concentration under the first biological condition relative its concentration under the second, perturbed by the random error δ12 )
S1 (1 + e11p1) S2 (1 + e22p1)
To simplify analysis, the multiplicative MS abundance model above is transformed to an additive mixed-effects model with normally distributed random errors log(Atspm) ) µ + τt + γs + Fp + tspm where log( · ) is natural logarithm, µt ) log(FC), τt ) log(Tt), γs ) log(Ss), and Fp ) log(Pp). The random subject errors γs and measurement errors tspm ) log(1 + etspm) are assumed to be normally distributed with zero mean and variances σγ2 and σ2, respectively. 1212
The Journal of Proteome Research • Vol. 7, No. 3, 2008
In this simple case, γs represents the random effect of the s th subject while Pp, the peptide LC-MS measurability effect, is modeled as a fixed effect. A more sophisticated model may also include peptide LC-MS measurability in a random effect through an interaction term with another random effect such as animal or tissue unit-common sources of random error in comparative biological experiments. This model may also include terms replacing τt that reflect a more complex design of the biological experiment and terms that break out other LC-MS processing effects such as differences among LC columns and time of MS acquisition. It is also worth noting that the antilogarithms of many of the estimated effects may be interpreted directly as relative effects on the original scale. For instance, 100σˆ is an estimate of the relative measurement error, while 100Fˆp is the average percent increase or decrease in a measurement due to the LC-MS measurability of the Pth peptide relative to the average across all measurements. Mixed-Effects Modeling Example. The simple mixed-effects model is applied to a subset of peptides whose log-abundance profiles are expected to vary similarly across an experiment-for instance, a subset for peptides of one protein. An experiment may involve thousands of peptides, and hundreds to thousands of proteins. Therefore, applying the model to the data set from a comparative experiment requires a fair degree of user-guided automated statistical processing to assess data balance, define and fit the model, and summarize and report results, for each protein and across proteins. We illustrate this process using a highly controlled comparative LC-MS experiment exploiting known relative protein concentrations. Dilution Series Sample Preparation, Data Collection, and Screening. A set of 45 LC-MS injections were prepared from a complex protein mixture (total soluble protein) extracted from a Trichoderma reesei culture. Proteins were digested with trypsin after denaturation and purified by solid-phase extraction. The lyophilized peptides were resuspended in 0.1% formic acid. A sample from this stock mixture was drawn and then diluted. Next, three subsamples for LC-MS injection were drawn from the diluted sample. The process was repeated with the dilution level (zero-fold, 2-fold, 4-fold, 8-fold, or 16-fold) randomly chosen until three separate samples per dilution level were obtained. The replicate samples for each dilution level simulated 3 random draws from the same biological source. Triplicate injections from each dilution simulated a common practice in LC-MS proteomics analysis. The 45 samples were processed in random order as described below. For each injection, 1 µL of sample was injected onto a reversedphase column using an Agilent 1100 LC system (Agilent Technologies, Alpharetta, GA) and eluted directly into a Thermo LTQ mass spectrometer (ThermoFinnigan, Inc., San Jose, CA) by electrospray ionization. The mass spectrometer was operated in a data-dependent mode with one survey scan followed by five MS/MS spectra of the five most abundant ions observed in the survey scan. The raw data from the instrument was analyzed using the Sequest program6 with an in silico T. reesei protein database constructed from the genome sequence16 to identify the peptides which were observed. Peptides that were obtained from this analysis with accepted scoring criteria17 were utilized for peptide abundance measurements. Peptide abundances were estimated from selected ion chromatograms for the parent ion of each peptide observed using a readily available software program called MASIC.18 Other programs with similar algorithms are available.4,19–21 The
Statistical Model for Comparative LC-MS Proteomics Studies
research articles
Figure 3. β-D-Glucoside hydrolase MS abundance and relative concentration estimates. The estimated average abundance for each dilution level is presented with its 95% confidence interval (left panel). The nominal log[2] difference between neighboring MS abundance estimates is consistent with the known dilution levels of 1, 1/2, 1/4, 1/8, and 1/16 stock concentration. This agreement between the known and estimated values is made more evident via the ratios of these two sets, again with 95% confidence intervals (right panel). In each case, the agreement between known and estimated concentrations is within the precision of the analysis and is evidence of the accuracy of the method. The increasing upward bias from the 0- to the 16-fold dilution could not be ascertained; it may be due to a bias in the statistical method, the sample preparation, or sample processing.
software characterized the chromatographic peak for each peptide parent ion and, in particular, calculated the peak area to estimate peptide abundance. The final data set contained the estimated abundances, or missing value code, for the peptides observed in any of the 45 samples. Dilution Series Mixed-Effects Model. Let Atspm denote the log-transformed MS abundance of the mth measurement of Pth peptide in the sth sample of the tth dilution. Suppose µ is the average MS log-abundance over all observed peptides of a given protein and τt represents the effect of the tth dilution level. Suppose γs is the random effect of the sth dilution to the tth level with γs normally distributed with mean 0 and variance σγ2. Let Fp be the MS measurability effect of the Pth peptide. Let tspm denote an independent random measurement error and suppose tspm is normally distributed with mean 0 and variance σ2. Then, the mixed-effects model for this dilution series experiment is Atspm ) µ + τt + γs(t) + Fp + tspm γs(t) ∼ N(0, σγ2) tspm ∼ N(0, σ2) The primary interest is the differences between the 0-level no dilution effect τ0 and the effects of the other levels: τ1
through τ4. For proteins well-fit by the model, this corresponds directly to estimating the average relative concentrations between no dilution and the other dilution levels. Parameter Estimation and Data Imbalance. For a given peptide, a sufficient number and distribution of MS abundances across factor combinations must be observed for the peptide to contribute positively to the estimation of one or more model parameters. What is a sufficient number and distribution of MS abundances across factor combinations is not well understood. It is clear that one measurement, or a few isolated measurements scattered across factor combinations may not be informative for quantitation. At this time, we have found that two or more observations in two or more factor combinations (dilution levels in our experiment) most often gives acceptable results. Data balance across peptides for a given protein may be visualized by imaging a scored version of its peptide by injection submatrix of MS abundances (Figure 4). The resultant image succinctly summarizes the balance of the data highlighting the structure of observations fitted by the mixed-effects modeling. We keep this image in mind as we review modeling results and make inferences from the data. The Journal of Proteome Research • Vol. 7, No. 3, 2008 1213
research articles
Daly et al.
Figure 4. β-D-Glucoside hydrolase data analysis. This protein’s subset of MS abundances has very good balance for some peptides (white cells) with two or more measured abundances for each dilution level (upper left panel). Other peptides have poor balance due to few (light gray) or no observations (dark grey cells) across dilution levels. Among the fitted observations (white cells, upper left panel), measurement heterogeniety is apparent (upper right panel) and reduced by a log transformation (middle left panel). Fitted abundance values adjusted for peptide LC-MS measurability effects and measurement errors display their expected log[2] separation between neighboring dilution levels (middle right panel). The model fits the data well as witnessed by the lack of patterns in the residuals (lower left panel) and the close agreement to their expected values (lower right panel).
REML: Fitting and Summarization. For the MS abundances of each protein, the relevant terms of the general dilution series model were fit using REML. Terms not supported by observa1214
The Journal of Proteome Research • Vol. 7, No. 3, 2008
tions, such as dilution levels with no observations were excluded. The goodness of each fit was evaluated and the residuals were examined for undiscovered trends and to check
research articles
Statistical Model for Comparative LC-MS Proteomics Studies normality assumptions. All possible relative concentrations using no dilution as the baseline were estimated and compared to their expected true relative concentrations of 50%, 25%, 12.5%, or 6.25%. Modeling results were tabled, one set per protein. Each analysis was summarized visually with a 6-panel figure covering data balance, the distributions of original and logged abundance values across MS injections, estimated abundances corrected for peptide effects and measurement errors, and homogeneity and normality of residuals. Modeling effectiveness was evaluated by examining the distribution of estimation bias as function of the number of peptides observed per protein. % bias ) 100
ˆ -R R R
ˆ is the estimated where R is the true relative concentration and R relative concentration. For comparison, we also computed abundance ratios approximating a calculation implied widely in the literature: (1) select two treatment combinations to compare, (2) apply a “2 or more observations per treatment combination” data balance criteria to each peptide, (3) compute the abundance ratio for each passing peptide separately (dealing intuitively with the differences in peptide measurability), and (4) compute the mean of these ratios. We then computed a percentage difference in bias between the REML and corresponding ratio estimate: % bias diff ) 100
ˆ ratio-R ˆ REML R R
Finally, we examined the distributions of these differences as a function of the number of peptides observed per protein.
Results and Discussion The LC-MS analyses produced 45 annotated lists of peptide abundance values perturbed by sample preparation, processing and analysis. These were cross-tabulated making a 5599 peptide by 45 injection matrix. Each row and column were labeled with an appropriate protein/peptide or injection identifier, respectively. Rows were grouped by protein to create 1546 subsets while columns were grouped by dilution level and draw within dilution level reflecting the design of the experiment. Each matrix cell contained a positive MS abundance value, integrated intensity, or a “no measurement” indicator. The mixed-effects model was fit to 781 protein subsets, of the 1546, that passed a minimum data balance requirement of two or more abundance measurements for two or more dilution levels per peptide. Each of these 781 subsets contained enough measurements of the right treatment combinations to estimate a relevant mixed-effects model. The number of peptides per protein subset ranged from 1 to 34 with 58% of subsets including 2 or more peptides, and 23% featuring 5 or more. The estimated relative measurement errors σˆ ranged from near 0 to 113% with a median of 20.6% and standard deviation of 19.5%. Table 1 summarizes the median % bias values of the REML-based relative concentration estimates (which paired the no dilution to the 2-, 4-, 8-, and 16-fold dilutions) by the number of peptides per protein subset. There is a slight decrease in bias as the number of peptides per protein subset increases. However, on average, the median biases were 10%, 18.9%, 23.2%, and 31.2% for the 50%, 25%, 12.5%, and 6.25%
a
Table 1. REML Method Bias
dilution no. of peptides
50%
25%
12.5%
6.25%
no. of proteins
1 2 3 4 5–6 7–9 10–34
6.7 12.1 10.6 10.4 10.0 8.0 7.6
21.7 19.9 16.0 18.9 19.6 18.7 17.1
25.3 22.3 24.3 23.2 26.9 20.4 20.2
35.2 30.0 32.1 31.2 32.9 27.3 27.3
330 118 95 58 76 54 50
a The median % bias values of the REML-estimated, T. reesei protein relative concentrations (i.e., the median % bias of the 2-, 4-, 8-, and 16-fold dilution concentrations relative to the no dilution concentration) by the number of peptides-per-protein subset.
true relative concentrations, respectively. To put these biases in perspective, consider the following alternate interpretation. A median bias of 10% for a 1:2 dilution results in estimates that center about a 1:1.8 dilution; 18.9% bias for a 1:4 dilution results in estimates that center about a 1:3.36 dilution; a 23.2% bias for a 1:8 dilution results in estimates that center about a 1:6.5 dilution; and a 31.2% bias for a 1:16 dilution results in estimates that center about a 1:12.2 dilution. These systematic positive biases could be due to the data analysis methodology or could be an artifact of the laboratory procedures, or both, and cannot be determined without further experimentation. The biases of the REML estimates were found to be less than or equal to those of the ratio estimates. Figure 5 summarizes the distributions of the % bias diff values by true relative concentrations and the number of peptides per protein. When the number of peptides per protein is small, the biases in the two estimates are equivalent. When the number of peptides per protein is large, the biases of the ratio estimates is an additional 5% to 10% over the biases of the REML estimates. That is, when the number of peptides per protein is greater than or equal to 5, the median biases of the ratio estimates were approximately 15%, 24%, 30%, and 41% for the 50%, 25%, 12.5%, and 6.25% true relative concentrations, respectively. β-D-Glucoside Hydrolase: An Exemplar Analysis. The modeling outcome for protein T. reesei β-D-glucoside hydrolase is typical of data subsets that have sufficient data balance to estimate the model parameters with acceptable biases. The data set contained 281 measured abundances with 619 of 900, or 68.8%, missing. The β-D-glucoside hydrolase subset of MS abundances covered 20 peptides with 10 peptides having sufficient data to be included in mixed-effects modeling (upper left panel, Figure 4). At the highest concentration, MS abundances ranged from 1.8 × 106 to 8.1 × 107 intensity units with most values less than 1.2 × 107. Abundances decreased with decreasing concentration; abundances ranged from 9.6 × 104 to 5.2 × 106 with a median of 1.5 × 106 at the lowest concentration. The variation in abundances also decreased significantly with decreasing average abundance indicative of measurement heterogeneity (upper right panel, Figure 4). The log transformation of the abundances resulted in more homogeneous distributions across dilution levels effectively uncoupling expected abundance and variance in abundance (middle left panel, Figure 4). The mixed model fit the glucoside hydrolase data subset very well as witnessed by an adjusted R2 ) 92.2%. The relative measurement error was 35.1%. The fit accounted for the The Journal of Proteome Research • Vol. 7, No. 3, 2008 1215
research articles
Daly et al.
Figure 5. Difference in bias between ratio and REML methods. Boxplots (labeled on the left by the number of peptides per protein and the known relative concentration) of % BiasDiff show that the two methods are about equal in bias for one or two peptides per protein. The bias in the ratio method, however, increases compared to the REML method as the number of peptides per protein increases (as indicated by the rightward shift in the boxplots).
systematic variation in the data as demonstrated by the lack of a pattern in the residuals plotted against the fitted values (lower left panel, Figure 4). With the exception of a few outliers, the residuals closely followed a normal distribution, as assumed in the model, with the residuals in close agreement with their expected values (lower right panel, Figure 4). The fitted values with the overall mean and dilution effects retained, and the peptide MS measurability effects removed reflected the expected effects of the known dilutions (left panel, Figure 3, and middle right panel, Figure 4). Statistical hypothesis tests comparing the estimated dilution factors to the known dilution factors were not significant (right panel, Figure 3). The estimated relative concentrations were 49%, 27%, 14%, and 8% compared to the known relative concentrations 50%, 25%, 12.5%, and 6.25%, respectively.
Conclusions We introduced mixed-effects statistical modeling for comparative LC-MS proteomics studies. We illustrated its application with an experiment featuring a known dilution series of a T. reesei protein mixture. In our dilution study comparing estimated and true relative concentrations, the average bias in relative concentration estimates ranged from 3.5% for the 50% true relative concentration set to 1.85% for the 6.25% true relative concentration set. The source of this bias, be it the analysis methodology, an artifact of the experiment, or both, cannot be determined without further experimentation. The 1216
The Journal of Proteome Research • Vol. 7, No. 3, 2008
mixed-effects estimates were found to have equal or smaller bias than the common ratio estimates. For a protein whose data subset is sufficiently balanced, the mixed-effects statistical model produces informative quantitative comparisons of protein concentrations across treatments. In particular, the fitted model favorably accounts for differences in peptide LC-MS measurability, a major source of uncertainty in MS measurements. There are a number of quantitative methods for LC-MS based proteomics. Many make use of stable isotopes either by cellular, mid or postprocessing peptide labeling (ICAT,5,23 ITRAQ,24,25 18O,26 etc.) or isotopic peptide spiking (AQUA27). These methods are limited by the number of relevant isotopes available and by the number of samples or the number of peptides from any given experiment that can be compared. In addition, metabolic labeling of samples requires the use of a defined minimal media which is not amenable to human studies and other model systems. An advantage of nonlabeling methods is that sample processing is reduced, minimizing the amount of both sample loss and contamination. In addition, because there is not a limit to the number of conditions, or treatment combinations, complex experiments that involve a number of experimental samples (for example, a multipoint time course experiment) are straightforward to design, execute and analyze using established statistical methods. The mixedeffects modeling introduced in this paper, with its ability to account for differences in ion measurability, is appropriate for
research articles
Statistical Model for Comparative LC-MS Proteomics Studies both labeled and label-free studies, as well as LC-MS metabolomics studies. There remain a number of issues for the application and extension of mixed-effects modeling for comparative LC-MS proteomics studies. Assessing the balance of a data set against a model is straightforward. We know ahead of time, however, that a protein’s data set of MS abundances generally will be unbalanced due to the high frequency of unobserved peptide LC-MS measurements. Therefore, the more important issue is assessing the effect of unbalance on each parameter estimate so that the inferences related to those parameter estimates less affected by unbalanced data are appraised more positively than those more affected by the pattern of missing measurements. More than one protein may be the source for an observed peptide. Consequently, the MS abundance of this peptide may be due to contributions from one or more proteins. At this time, we do not directly account for this ambiguity in our modeling, and leave it to the researcher to resolve this (most often by filtering the data set, retaining only peptides identified uniquely with one protein as in our exemplary experiment). We are exploring multiple protein mixed-effects models that will accommodate this ambiguity and we will treat this in a future publication. The relative measurement error, Akaike’s information criteria and an overall F test are notable measures of model goodness of fit. These statistics, however, fail to convey a sense of model goodness of fit often associated with a statistic like the adjusted R2. For this modeling, a large R2 is a less informative measure of model goodness of fit with respect to the biological terms of greatest interest and the biological inferences to follow. A protein mixed-effects model featuring two or more peptides may have a large R2 because differences in peptide LC-MS measurability are often the largest source of variability in LC-MS abundances, and so our model’s terms addressing peptide LC-MS measurability may account for a significant percentage of variability in MS measurements. Our results show the mixed-effects models fit this type of data very well. We do, however, require a more informative measure of model goodness of fit, and are exploring partial R2 that are not as heavily influenced by the usually good fit of peptide LC-MS measurability terms. Biased sampling due to missing observations may bias parameter estimates. Reducing such bias by revising this statistical method requires better knowledge of the LC-MS missing observation mechanism. As is, the standing of the mixed-effects methodology among methods for comparing LC-MS abundances is not diminished by our present treatment of missing observations. Here, our proposed method addresses modeling one protein at a time. Proteomics studies generally involve the analysis of hundreds of proteins to discover which are affected by the treatments. In this case, controlling the level of false positive significant results is an issue.22 We are presently investigating controlling the false discovery rate for this type of study.
Acknowledgment. The research described in this paper was conducted as part of the Environmental Biomarkers Initia-
tive under the Laboratory Directed Research and Development Program, and the Genomes to Life program at Pacific Northwest National Laboratory, a multiprogram national laboratory operated by Battelle for the U.S. Department of Energy under Contract DE-AC05-76RL01830. The featured proteomics data were processed and archived by the Instrument Development Laboratory at the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the Department of Energy’s Office of Biological and Environmental Research.
References (1) Fang, R.; Elias, D. A.; Monroe, M. E.; Shen, Y.; McIntosh, W.; Wang, P.; Goddard, C. D.; Callister, S. J.; Moore, R. J.; Gorby, Y. A.; Adkins, J. N.; Fredrickson, J. K.; Lipton, M. S.; Smith, R. D. Mol. Cell. Proteomics 2006, 5, 714–725. (2) Kurian, D.; K, P.; Maenpaa, P. Proteomics 2006, 1, 1–2 [PMID: 16691555]. (3) Liu, T.; Qian, W.; Strittmatter, E.; Camp, D.; Anderson, G.; Thrall, B.; Smith, R. Nat. Biotechnol. 2001, 19, 242–247. (4) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Anal. Chem. 2003, 75, 4818–4826. (5) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994–999 [PMID: 10504701]. (6) Eng, J. K.; McCormack, A. L.; Yaes, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (7) Washburn, M. P.; Wolters, D.; Yates, J. R. Nature Biotechnol. 2001, 19, 242–247. (8) Masselong, C.; Anderson, G.; Harkewicz, R.; Bruce, J.; Pasa-Tolic, L.; Smith, R. Nat. Biotechnol. 2001, 19, 242–247. (9) Purvine, S. O. OMICS 2004, 8, 79–92. PMID: 15107238. (10) Pinheiro, J. C.; Bates, D. M. Mixed-effects models in S and S-plus; Springer-Verlag: New York, 2000. (11) Patterson, H. D.; Thompson, R. Biometrik 1971, 58, 545–554. (12) Searle, S. R.; Casella, G.; McCulloch, C. E. Variance components; John Wiley & Sons, Inc.: New York, 1992. (13) Laird, N. M.; Ware, J. H. Biometrics 1982, 38, 963–974. (14) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2007; ISBN 3-900051-07-0. (15) Venables, W. N.; Ripley, B. D. Modern applied statistics with S-Plus; Springer-Verlag, Inc.: New York, 1994. (16) Joint Genome Institute, Trichoderma reesei v1.0; Los Alamos National Security, LLC, 2006. (17) Wolters, D. A.; Washburn, M. P.; Yates, J. R. Anal. Chem. 2001, 73, 563–569. (18) Monroe, M. MASIC: MS/MS Automated Selected Ion Chromatogram generator; Proteomics Research Resource for Integrative Biology, Pacific Northwest National Laboratory: Richland, WA, 2006. (19) MacCoss, M. J.; Wu, C. C.; Liu, H.; Sadygov, R.; Yates, J. R. Anal. Chem. 2003, 5, 1–2. (20) Chelius, D.; Zhang, T.; Wang, G.; Shen, R.-F. Anal. Chem. 2003, 75, 6658–6665. (21) Bondarenko, P. V.; Chelius, D.; Shaler, T. A. Anal. Chem. y 2002, 74, 4741–4749. (22) Benjamini, Y.; Hochberg, Y. J. R. Stat. Soc., Ser. B, Method. 1995, 57, 289–300. (23) Han, D. K.; Eng, J.; Zhou, H.; Aebersold, R. Nat. Biotechnol. 2001, 19, 946–951. (24) DeSouza, L.; Diehl, G.; Rodrigues, M. J.; Guo, J.; Romaschin, A. D.; Colgan, T. J.; Siu, K. W. J. Proteome Res. 2005, 4, 377–386. (25) Ross, P. L. Mol. Cell. Proteomics 2004, 3, 1154–1169 [Epub 2004 Sep 22. PMID: 15385600]. (26) Yao, X. ; Freas, A.; Ramirez, J.; Demirev, P. A.; Fenselau, C. Anal. Chem. 2001, 73, 2836–2842 [Erratum: Anal. Chem. 2004 76(9), 2675. PMID: 11467524]. (27) Gerber, S. A.; Rush, J.; Stemman, O.; Krischner, M. W; Gygi, S. P. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 6940–6945 [Epub 2003 May 27. PMID: 12771378].
PR070441I
The Journal of Proteome Research • Vol. 7, No. 3, 2008 1217