Anal. Chem. 2005, 77, 7246-7254
Multicomponent Internal Recalibration of an LC-FTICR-MS Analysis Employing a Partially Characterized Complex Peptide Mixture: Systematic and Random Errors Corey M. Yanofsky,*,† Alexander W. Bell,‡ Souad Lesimple,§ Frank Morales,† TuKiet T. Lam,| Greg T. Blakney,| Alan G. Marshall,| Brian Carrillo,† Kossi Lekpor,† Daniel Boismenu,§ and Robert E. Kearney†
Bioinformatics Group, Department of Biomedical Engineering, McGill University, Strathcona Building, 3640 University Street, Montreal, Quebec, Canada H3A 2B2, Department of Anatomy and Cell Biology, McGill University, 3640 University Street, Montreal, Quebec, Canada H3A 2B2, Montreal Proteomics Network, McGill University, Genome Building, 740 Docteur Penfield Avenue, Montreal, Quebec, Canada H3A 1A4, and Ion Cyclotron Resonance Program, National High Magnetic Field Laboratory, Florida State University, 1800 East Paul Dirac Drive, Tallahassee, Florida 32310-4005
In high-throughput proteomics, a promising current approach is the use of liquid chromatography coupled to Fourier transform ion cyclotron resonance mass spectrometry (LC-FTICR-MS) of tryptic peptides from complex mixtures of proteins. To apply this method, it is necessary to account for any systematic measurement error, and it is useful to have an estimate of the random error expected in the measured masses. Here, we analyze by LC-FTICR-MS a complex mixture of peptides derived from a sample previously characterized by LC-QTOF-MS. Application of a Bayesian probability model of the data and partial knowledge of the composition of the sample suffice to estimate both the systematic and random errors in measured masses. Proteomics is a relatively new field of biology that aims to understand the state of a cell in a comprehensive manner.1 A key step in reaching this goal is the identification of the total protein complement of a biological sample. One experimental approach employs the enzyme trypsin to cleave the protein complement of the sample into “tryptic fragments”, which are relatively short peptide chains. The proteolytic action of trypsin is well characterized, so the presence of a particular tryptic fragment or set of fragments in the digested sample acts as an identifier of the protein from which they are derived. The tryptic fragments are themselves identified by mass spectrometry.2 A high-throughput approach to proteomics uses automated data collection to analyze complex samples and automated data * To whom correspondence should be addressed. Tel: 514-398-5736. E-mail:
[email protected]. † Bioinformatics Group, McGill University. ‡ Department of Anatomy and Cell Biology, McGill University. § Montreal Proteomics Network, McGill University. | Florida State University. (1) Hochstrasser, D. F.; Sanchez, J.-C.; Appel, R. D. Proteomics 2002, 7, 807812. (2) Hanes, P. A., Jr.; Yates, J. R., III. Yeast Comput. Funct. Genet. 2000, 2, pp 81-87.
7246 Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
analysis algorithms to rapidly identify proteins from the resulting large sets of data. Two main approaches are employed. In one approach, proteins are resolved into spots ideally containing one or at most a few proteins by two-dimensional SDS-PAGE, in-gel digested by trypsin, and the masses of the tryptic peptides are determined by mass spectrometry. These masses act as a fingerprint to search a database of protein sequences that has been digested in silico by trypsin.3 In the other approach, complex mixtures of proteins are separated into less complex mixtures by one-dimensional SDS-PAGE, and the resulting bands are in-gel digested by trypsin, creating a complex mixture of peptides derived from many different proteins of similar apparent molecular weight. Ultimately, the peptide mixtures are resolved by reversedphase liquid chromatography and introduced into a mass spectrometer with tandem MS capabilities by electrospray ionization (ESI). (The multidimensional protein identification technology (MudPIT) methodology follows a similar approach but replaces the preenzymatic digestion SDS-PAGE protein separation with a postenzymatic digestion ion-exchange chromatography stage.4) A survey MS scan of the LC eluent allows for the selection of doubly or triply charged parent ions (the most common charge state of tryptic peptides eluted at low pH) for tandem MS. The resulting fragmentation mass spectra contain sequence information that can be used to search a database of protein sequences.5 Liquid chromatography coupled to Fourier transform ion cyclotron resonance mass spectrometry (LC-FTICR-MS) offers the highest resolving power, mass accuracy, sensitivity, and dynamic range for the characterization of tryptic peptide masses (3) Binz, P. A.; Muller, M.; Walther, D.; Bienvenut, W. V.; Gras, R.; Hoogland, C.; Bouchet, G.; Gasteiger, E.; Fabbretti, R.; Gay, S.; Palagi, P.; Wilkins, M. R.; Rouge, V.; Tonella. L.; Paesano, S.; Rossellat, G.; Karmime, A.; Bairoch, A.; Sanchez, J.-C.; Appel, R. D.; Hochstrasser, D. F. Anal. Chem. 1999, 71, 4981-4988. (4) Wolters, D. A.; Washburn, M. P.; Yates, J. R., III. Anal. Chem. 2001, 73, 5683-5690. (5) Yates, J. R. III; Eng, J. K.; McCormack, A. L.; Schietz, D. Anal. Chem. 1995, 67, 1426-1436. 10.1021/ac050640q CCC: $30.25
© 2005 American Chemical Society Published on Web 10/12/2005
in complex mixtures, compared to TOF and ion trap mass spectrometers. The FTICR technology has been used to investigate the entire protein complement of prokaryotic cells.6 The ultrahigh mass resolution and mass accuracy of FTICR-MS translate directly into higher confidence in automated identification by database matching of peptide masses. However, even a technique as precise as FTICR-MS is subject to various sources of error,7,8 and high-precision measurements are degraded if they are subject to an unknown systematic error. Therefore, it is necessary to account for any systematic mass measurement error; in addition, it is useful to have an estimate of the average magnitude of the random error in measured masses, to optimize the protein identification algorithms. This paper describes an approach that uses preliminary knowledge of the contents of the sample (usually available for biological samples) to determine these errors and demonstrates its use on two data sets obtained by LC-FTICR-MS analysis of a complex tryptic peptide mixture. Because the abundant components of the sample have been previously identified by LC-QTOF techniques, a library of expected peptides can be constructed. This library is certain to contain many of the peptides that have been observed, although it also contains several thousand peptides not observed in the sample. The recalibration technique consists of matching the measured m/z ratios of the peptides in the sample to the masses of the expected peptides in the library. Even in the presence of numerous spurious matches between the data and the library, systematic measurement errors can be detected and corrected, and the average magnitude of random errors can be estimated. The recalibration technique does not fall into the standard categories of external calibration or internal calibration. However, the technique does not stand alone; rather, it augments external calibration, allowing it to achieve mass accuracy comparable to internal calibration without requiring the spiking of the sample or the use of dual-sprayer sourcesspartial knowledge of the contents of the sample, encoded as a highly inclusive library of expected sample components, is used instead. The technique is not specific to peptide libraries; it is applicable in other contexts where highly complex samples with somewhat predictable compositions are analyzed by FTICR-MS, e.g., petroleomics.9 EXPERIMENTAL METHODS Enriched preparations of endoplasmic reticulum rough10 and smooth microsomes11 derived from rat liver tissue were resolved by SDS-PAGE on a 7-15% gradient gel, which was then stained with Coomassie blue. The cytochrome P450 proteins largely migrate with apparent molecular size of ∼46 000-57 000 Da. This region of the gel was excised and destained, followed by reduction and alkylation, in-gel digestion with trypsin, and peptide extraction. (6) Hongying, G.; Shen, Y.; Veenstra, T. D.; Harkewicz, R.; Anderson, G. A.; Bruce, J. E.; Pasa-Tolic, L.; Smith, R. D. J. Microcolumn Sep. 2000, 7, 383390. (7) Herold, L. K.; Kouzes, R. T. Int. J. Mass Spectrom. Ion Processes 1990, 96, 275-289. (8) Gorshkov, M. V.; Guan, S.; Marshall, A. G. Int. J. Mass Spectrom. Ion Processes 1993, 128, 47-60. (9) Marshall, A. G.; Rodgers, R. P. Acc. Chem. Res. 2004, 37, 53-59. (10) Paiement, J.; Bergeron, J. J. M. J. Cell Biol. 1983, 96, 1791-1796. (11) Lavoie, C.; Lanoix, J.; Kan, F. W. K.; Paiement, J. J. Cell. Sci. 1996, 109, 1415-1425.
Solvents were delivered by a Shimadzu HPLC pumping system at 50 µL/min. Solutions A and B consisted of 5:94.5:0.5 and 94: 4.5:0.5 acetonitrile/water/formic acid, respectively. Samples (20 µL) were injected onto a Vydac Microbore column (C8, 300 Å, 5 µm, 50 × 1.0 mm) with a manual injector. The following gradient was used for chromatographic separation: 0-5 min, 0% B; 5-60 min, linear from 0 to 50% B; 60-65 min, linear from 50 to 95% B; 65-70 min, held at 95% B; 70-71 min, linear from 95 to 0% B; 71-75 min, held at 0% B; 75 min, stop. Flow rate was postcolumn split to 350-450 nL/min, by placement of a MicroTee flow rate splitter (Upchurch) between the column outlet and the electrospray interface. A 50-µm-inner diameter micro-ESI spray needle12 was coupled directly to the splitter, and a 2.0-kV ESI needle voltage was applied to the metal tubing waste line. The external electrospray interface has been described previously.13 Mass analyses were carried out with a home-built 9.4-T FTICR mass spectrometer configured for external ion accumulation.14 All experimental FTICR parameters have been described previously.15 The raw transients were processed with TOMAS, an in-house toolbox for mass spectrometry data analysis.16 TOMAS read the raw transient files, applied a Hanning window to the data, used an FFT to convert the transients to the frequency domain, and finally applied a calibration formula to convert frequency to m/z ratio.17,18 Peak detection from the calibrated mass spectral data was performed with Mascot Distiller (Matrix Science, London, U.K.), and the resulting peak lists were analyzed by use of a simple algorithm programmed in-house to provide a unique estimate of the m/z ratio and charge of each peptide, under the assumption that only peptides give rise to experimentally detected ions between m/z 400 and 1600. Two data sets were generated, one for the rough microsomes and one for the smooth microsomes. Each data set consists of multiple mass spectra collected throughout the elution of the sample from the chromatographic column. In the paper, the analysis will be described in reference to the smooth microsome data set; the corresponding analysis for the rough microsome data set is available as Supporting Information. RESULTS AND DISCUSSION Data Interpretation. A list of 93 proteins representing the most abundant components of the cytochrome P450 protein (∼46-57 kDa) sample based on previous results was compiled, and a library of tryptic peptide masses was constructed by in silico digestion of the proteins in the list. Tryptic peptide masses were calculated for peptides with up to two missed cleavages. Several (12) Emmett, M. R.; Caprioli, R. M. J. Am. Soc. Mass Spectrom. 1994, 5, 605613. (13) Senko, M. W.; Hendrickson, C. L.; PasaTolic, L.; Marto, J. A.; White, F. M.; Guan, S. H.; Marshall, A. G. Rapid Commun. Mass Spectrom. 1996, 10, 1824-1828. (14) Senko, M. W.; Hendrickson, C. L.; Emmett, M. R.; Shi, S. D. H.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1997, 8, 970-976. (15) Lam, T. T.; Lanman, J. K.; Emmett, M. R.; Hendrickson, C. L.; Marshall, A. G.; Prevelige, P. E. J. Chromatogr., A 2002, 982, 85-95. (16) Morales, F.; Kearney, R.; Bencsath-Makkai, Z.; Bergeron, J. Mol., Cell. Protein 2003, 2, 832. (17) Shi, S. D.-H.; Drader, J. J.; Freitas, M. A.; Hendrickson, C. L.; Marshall, A. G. Int. J. Mass. Spectrom. 2000, 195-196, 591-598. (18) Ledford, E. B.; Rempel, D. L., Jr.; Gross, M. L. Anal. Chem. 1984, 56, 27442748.
Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
7247
Figure 1. Schematic diagram of the matching of observed ions to members of the peptide library. Library peptide masses were calculated for the singly protonated species. There were 4157 matches to 1277 out of 1509 observed ions; the library contains 20 294 peptides with 19 904 unique masses. (The symbol “C*” represents 1-carbamidomethylcysteine.).
Figure 2. Histograms of matching errors: (a) simulated data demonstrating the expected appearance of error-free measurements; (b) observed data, revealing the presence of systematic errors.
kinds of modifications were taken into consideration: masses of peptides with N-terminal glutamines and glutamates were calculated for both the unmodified and pyroglutamate form; masses of peptides containing protein N-termini with N-terminal residue alanine, valine, or serine were calculated in both the acetylated and unmodified forms; and the masses of peptides containing methionine were calculated in both oxidized and unoxidized forms. (All cysteines were treated as alkylated to carbamidomethylcysteine.) The m/z ratios corresponding to the masses in the tryptic peptide library were matched to the measured m/z ratios at a tolerance of 10-1 Th, which is much wider than the typical measured m/z error for the FTICR instrument (on the order of 10-3 Th). Figure 1 shows a schematic diagram of the matching process. The difference between the m/z ratio of a library peptide and that of the matching experimentally observed peptide is termed the matching error. The matches are based only on the measured m/z ratios of the observed ions and are therefore only tentative matches: they may be either correct (that is, the peptide that was observed is actually the same as the one it has matched 7248
Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
in the library) or incorrect (that is, the peptide that was observed has a mass close to the one it matched in the library entirely by coincidence). Note that a given experimental peptide can match more than one library peptide, although only one such match can be correct; each match will be associated with its own matching error. Only correct matches give information about the miscalibration of the measured m/z ratios; incorrect matches can tell us nothing about the functioning of the mass spectrometer. Following the matching process, a histogram of the matching errors was created. Under the assumption that the measured m/z ratios contain no systematic error, the histogram would be expected to contain a background of incorrect matches widely dispersed throughout the range of possible matching errors and a sharp spike of correct matches whose matching errors would be clustered around zero; Figure 2a illustrates these expected results with simulated data, created by combining 4000 random points uniformly distributed over the range -0.1 to +0.1 (roughly simulating the incorrect matches) with 400 random points normally distributed with mean 0 and standard deviation 10-3 (simulating the correct matches).
Figure 3. Measured m/z ratios versus matching errors.
Figure 2b shows the histogram of the observed matching errors. The background of incorrect matches is present, as expected; the cluster of correct matches is also seen, but it is centered on an error of -0.01 Th and is much wider than would be expected for error-free FTICR data. It is clear from this figure that the measurements are affected by some kind of systematic error; this is most likely due to a miscalibration of the data and should therefore be a function of the experimentally measured m/z ratios. Figure 3 shows a plot of the m/z ratio of detected peptides versus matching error. Each matched experimental peptide is represented by one or more points in the plot, depending on the number of matches to that peptide. Two populations of matches, correct and incorrect, are clearly apparent on the plot: correct matches are tightly clustered diagonally in the center of the plot, and incorrect matches are dispersed throughout the displayed range of matching errors. As expected, the correct matches display systematic bias related to their measured m/z ratios:
(m/z)m ) (m/z)t + ∆(m/z) +
(1)
in which (m/z)m is the measured m/z ratio, (m/z)t is the true m/z ratio, ∆(m/z) is the systematic bias of a measured m/z ratio, and is a random measurement error, different for each correct match. It appears that the systematic bias term can be well approximated by a line,
∆(m/z) ) a(m/z)m + b
(2)
in which a is the slope of the line and b is the intercept. To maximize measurement accuracy, the systematic error term must be subtracted from the measured m/z ratio, leaving an estimate in which only random error remains.
(m/z)c ) (m/z)t + ) (m/z)m - a(m/z)m - b
(3)
in which (m/z)c is the corrected m/z ratio. To achieve this
correction, good estimates of the regression line slope parameter a and the intercept parameter b must be obtained. We might reasonably expect that constructing Figures 2b and 3 for different charge states would give significantly different results, because the number of possible incorrect matches is expected to be higher for higher charge states at a given m/z. (We expect that the distribution of matching errors of correct matches will not be affected by charge state, since the systematic bias is not dependent on charge except through m/z, and the random error is independent of charge state.) We examined the matches corresponding to charge states +2, +3, and +4, which accounted for 36.6, 35.8, and 17.0% of matches respectively, totaling 89.4% of all matches. (These proportions of charge state are consistent with tryptic digestion.) Surprisingly, we found that their statistical properties were so similar that treating different charge states separately was not merited. (Histograms of observed matching errors distinguished according to charge state are shown in Figure S-1 in the Supporting Information.) Data Modeling. Data such as those shown in Figure 3 are not suited to the standard linear least-squares regression technique because that method assumes that all of the data are described by the same model, whereas here, we assume that the linear equation applies only to the correct matches. With the wealth of data available, it would be possible to fit a line to the data by eye, but the accuracy of the line would be unknown, and the resulting estimates of the slope and intercept of the line are of limited use without some way of determining ranges of plausible values for those parameters. For this reason, we used Bayesian statistical techniques. In the Bayesian approach,19-21 probabilities are used to quantify the plausibility of hypotheses. For example, it might be reasonable to ascribe a probability of 0.7 to the hypothesis that a particular tentative match between the data and the library is a correct match. All such assessments are relative to the information used in forming the assessment, and individuals with different states of information may ascribe the same hypothesis different probabilities. For hypotheses regarding the true values of real-valued parameters, probability distributions are used to encode the plausibility that the true value will be found in different ranges. For example, if one is 95% certain that the true value of a measured mass is in the range 393.5 ( 0.2 Da, it might be reasonable to encode that information by assigning the true value of the measured mass a normal probability distribution with mean 393.5 Da and standard deviation 0.1 Da (note: probability distributions used in this paper are described in the Supporting Information). This is the key feature of the Bayesian framework: all forms of uncertainty are expressed through the use of probabilities and probability distributions. Thus, Bayesian analyses produce both parameter estimates and ranges of plausible values around those estimates. (19) Jaynes, E. T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, U.K., 2003; xix-197. (20) Bernardo, J. M.; Smith, A. F. M. Bayesian Theory; John Wiley and Sons: New York, 2000. (21) Gelman, A.; Carlin, J. B.; Stern, H. S.; Rubin, D. B. Bayesian Data Analysis, 2nd ed.; Chapman & Hall/CRC: Boca Raton, FL, 2003.
Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
7249
Upon obtaining new information (i.e., collecting experimental data), probability distributions are updated through the use of Bayes’s theorem.
P(θ|D,X) )
P(θ|X)P(D|θ,X) P(D|X)
(4)
The terms are defined as follows: θ represent the parameters of interest, in this case, the slope and intercept of the line associated with the systematic bias observed in Figure 3, along with some measure of the average magnitude of the random scatter around that line. D represents the observed data, in this case, the measured m/z ratios and matching errors plotted in Figure 3. X represents the prior information, in this case, the library of possible sample components, knowledge of the data-generating mechanism, and initial values or probable ranges of values for the parameters of interest. P(θ|D,X) is the “posterior” probability distribution, which encodes information about the parameters conditional on both the data and the prior information. Obtaining a representation of the posterior distribution is the goal of all practical Bayesian analyses. P(θ|X) is the “prior” probability distribution for the parameters, which expresses everything known about them before the data are taken into account. In this case, relatively little information about the parameters was available prior to collecting the data, so “vague” prior distributions (i.e., distributions that give roughly equal weight to all reasonable values) were used. P(D|θ,X) is the “direct” probability of the data, that is, the probability of observing a particular data set given the values of the parameters. Its mathematical form is determined by prior information about the data-generating mechanism. Since the data are known and the parameters are not, this term is actually a function of the parameters, called the likelihood function. It is peaked at those values of the parameters most strongly supported by the data. P(D|X) is the prior probability of the data. It is constant with respect to the parameters. In relation to the posterior distribution, it is the normalization constant of the probability kernel in the numerator of the equation, ensuring that the posterior distribution integrates to a total probability of one. The data points in the population of correct matches do not fall exactly on the line because they are subject to random measurement errors, represented by in eq 1. We assumed for both theoretical and empirical reasons that the random measurement errors associated with correct matches are normally distributed. Theoretically, random measurement errors are composed of the sum of small independent discrepancies and so are expected to be approximately normal by the central limit theorem, which states that the sum of many random variables will have a probability distribution approaching normal, provided that no one term of the sum dominates all others. Empirically, the normal distribution was found to fit the data very well (see below). This assumption explicitly introduces a new parameter, the standard deviation of the random measurement errors. This parameter measures the average magnitude of the random scatter around the regression line and is represented by the symbol σe. 7250 Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
Figure 4. Histograms of matching errors, displaying the shape of the incorrect matching error distribution. The dashed gray curve is a normal distribution fit by eye to the background distribution.
The model must explicitly account for the presence of incorrect matches in the data. This constraint requires introducing new distributions and parameters, whose role is to distinguish between data points in the population of correct matches and data points in the population of incorrect matches. The new parameters are “nuisance” parameters in that they have no direct bearing on the desired data recalibration but are required by the model. The nuisance parameters fall into two categories, those necessary for estimating the density of incorrect matches in the data space and those necessary for separating the data points into correct and incorrect matches. Figure 4 is a histogram of matching errors in which the range and the bin widths have been chosen for good visualization of the distribution of incorrect matching errors. To properly take into account the confounding effect of the incorrect matches on the estimates of the regression line parameters, it is critical to model the incorrect distribution well in the region where the correct and incorrect matches overlap (that is, between matching errors of -0.02 and 0.00 Th). A prospective model cannot actually be verified in this region, but a model that fits the data well in the area immediately around the region of overlap can be used to interpolate the density of incorrect matches in the critical region. Also plotted in Figure 4 is a normal distribution fitted by eye to the data. It demonstrates that the normal distribution is a good model for the distribution of incorrect matches between -0.07 and +0.05 Th; this range is centered around -0.01 Th, where the correct matches were found, and includes a sufficiently large portion of the incorrect distribution to ensure accurate modeling in the critical region. Since an accurate model outside the critical region was not necessary, the data were truncated to exclude points with matching errors above +0.05 Th and below -0.07 Th, and the distribution of incorrect matches was therefore modeled as a truncated normal distribution. This choice introduced two new parameters, the center parameter µI and the scatter parameter σI, which are the mean and standard deviation of the equivalent untruncated normal distribution. To separate the data into correct and incorrect matches, we defined an indicator variable li for each measured m/z ratio, which is equal to 1 if the list of tentative matches to that measured m/z ratio contains the correct match and is equal to 0 if the list of tentative matches to that measured m/z ratio does not contain
the correct match. The index i runs from 1 to the total number of measured m/z ratios. This set of indicator variables is rather like a set of flips of a biased coin, where heads (1) is equivalent to the correct peptide for the measured m/z ratio being present in the library, and tails (0) is equivalent to the correct peptide for measured m/z ratio being missing from the library. The bias of the coin is equivalent to the fraction of measured m/z ratios whose correct peptide is present in the library. Since this fraction of coverage is not known in advance, we introduce it explicitly as the probability that li is equal to 1 for each measured m/z ratio, and give it the symbol λ. Properly speaking, there should be a unique probability for each group of measured m/z ratios with the same charge state and number of matches in the library; however, in the interests of simplicity, the same probability was applied to all measured m/z ratios. Having defined an indicator variable for measured m/z ratios, we then did the same for the matches themselves. For each member of the list of tentative matches to the ith measured m/z ratio, we defined ci,j to be equal to 0 if that match is incorrect and equal to 1 if that match is correct. The index j runs from 1 to the total number of tentative matches to the ith measured m/z ratio. Note the relationship between li and ci,j,
li )
∑c
i,j
(5)
j
that is, if the list of tentative matches contains the correct match, then one of the ci,j is equal to 1 and li is equal to 1; if the list of tentative matches does not contain the correct match, then all of the ci,j are equal to 0 and li is also equal to 0. The data do not contain enough information to determine the values of the li and ci,j indicator variables; that is, the data cannot inform us whether a particular match is correct or incorrect. However, the data do allow us to estimate the probability that each match is correct. We used the symbol pi,j to represent the probability that the jth match to the ith measured m/z ratio is correct, given that the list of tentative matches to the ith measured m/z ratio does contain the correct match. We now summarize the model, to clarify the role of each parameter. An ion’s m/z ratio and charge are measured, and a list of potential matches is retrieved from the library. There is a probability λ that the correct match is found in the list of matches; given that the correct match is present, each match has a probability pi,j of being the correct match for that measured m/z ratio. The matching errors of the correct matches are normally scattered around the line defined in eq 1; the standard deviation of the scatter, σe, is unknown. The matching errors of the incorrect matches are drawn from a truncated normal distribution whose center parameter, µI, and scatter parameter, σI, are unknown. This model implicitly provides a full specification of the likelihood function (the analytical expression is derived in the Supporting Information). To obtain the posterior distribution, prior distributions must be assigned to the parameters of the model, i.e., a, b, σe, µI, σI, λ, and pi,j. Most of the prior distributions encoded a negligible amount of information about these variables relative to the information in the data, so the particular forms of the prior distributions were chosen for algorithmic tractability. The parameters a, b, and µI were assigned normal prior distributions with mean 0 and standard deviation 103; these prior
distributions are essentially flat over all reasonable values of the parameters and so do not unduly influence the estimates of these parameters. The λ parameter was assigned a uniform prior distribution over values from 0 to 1; that is, all possible values were taken to be equally likely. The pi,j parameters are discrete probabilities and therefore must take values between 0 and 1; furthermore, their definition implies that, for a given i, the sum Σj pi,j must be equal to unity. These constraints define a multidimensional space of possible values for the pi,j parameters. The pi,j parameters were given uniform prior distributions over this multidimensional space. There was some prior information about the standard deviation of random measurement errors, σe, and the scatter parameter of incorrect matches, σI. Properly calibrated measured m/z values should have errors on the order of 10-3 Th for this type of FTICRMS,22 and the distribution of matching errors of incorrect matches should have a large scatter relative to that of correct matches. For algorithmic convenience, the “precisions” (the squared inverses of σe and σI) were given gamma prior distributions. gamma distributions have two parameters, which may be thought of as the “expected standard deviation” and the “strength of expectation”. For σe, we used an expected standard deviation of 10-3 Th and a strength of expectation equivalent to that provided by six data points. Under this prior distribution, there is a 95% probability that σe is between 5.8 × 10-4 and 2.3 × 10-3 Th. For σI, we used an expected standard deviation of 3 × 10-2 Th and a strength of expectation equivalent to four data points. Under this prior distribution, there is a 95% probability that σI is between 1.5 × 10-2 and 1.0 × 10-1 Th. Note that although these prior distributions are relatively informative when compared to the prior distributions chosen for the other parameters, the information they contribute to the posterior distribution is small compared to the information in the data. The posterior distribution of the parameters is now fully specified, but unfortunately, it is not analytically or numerically tractable, as it involves every possible combination of correct and incorrect matches permitted by the data, of which there are ∼1.3 × 10547. (This is the product of the number of possible assignments for each detected m/z ratio.) Fortunately, this problem can be overcome by a method called Markov chain Monte Carlo (MCMC).21,23 In this approach, multiple samples are drawn from the posterior distribution, each sample consisting of a vector containing values for every parameter in the problem. The elements of each sample containing the nuisance parameters can be discarded; the elements of each sample containing the interesting parameters can be used to construct histograms and statistics. We used the WinBUGS software package24 to collect 12 000 samples from the joint posterior distribution of the parameters. (The following technical points will be of concern to those familiar with MCMC techniques. A burn-in of 2000 iterations was used, and convergence was assessed using the GelmanRubin convergence statistic, as modified by Brooks and Gelman,25 and also by examining the stability of the 2.5, 50, and 97.5% (22) Johnson, K. L.; Mason, C. J.; Muddiman, D. C.; Eckel, J. E. Anal. Chem. 2004, 76, 5097-5103. (23) Brooks, S. P. Statistician 1998, 1, 69-100. (24) Spiegelhalter, D.; Thomas, A.; Best, N.; Lunn, D. WinBUGS User Manual Version 1.4, 2003. Available: http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/ contents.shtml. (25) Brooks S. P.; Gelman, A. J. Comput. Graph. Stat. 1998, 4, 434-455.
Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
7251
Table 1. Posterior Means and Standard Deviations for the Model Parameters parameter (units)
posterior mean
posterior standard deviationa
a (dimensionless) b (Th) σe (Th)
-2.35 × 10-5 8.91 × 10-3 9.6 × 10-4
0.06 × 10-5 0.5 × 10-3 0.6 × 10-4
a The standard deviations are reported for the same order of magnitude as the means to aid in assessment of the accuracy of the estimates.
quantiles of the parameter distributions for three chains run from different initial values.) Measurement Error Estimation. For the parameters a, b, and σe, the samples from all the single-variable posterior distributions were checked for normality by use of normal quantilequantile plots (not shown), and in each case, the central 95% of the probability mass was well represented by the normal distribution. These posterior distributions can therefore be succinctly described by reporting the posterior mean, which is effectively the best estimate of the parameter value, and the posterior standard deviation, which quantifies the uncertainty in the estimate. Table 1 shows the means and standard deviations of the posterior distributions of the parameters of interest, as calculated from the 12 000 samples from the joint posterior distribution (see Table S-1 in the Supporting Information for the rough microsome equivalent). Under this model, it is worth noting that the estimated standard deviation of random measurement error (reported in Table 1), 9.6 × 10-4 Th, is consistent with the expected accuracy of FTICRMS (this corresponds to an error of 1.3 ppm for an ion at 750 Th). The model is constrained to allocate all of the variation in the data that cannot be explained by the systematic bias term or the presence of incorrect matches to the random measurement error. If the model were inappropriate for these data, any lack of fit between the model and the data would tend to inflate the estimated value of σe; the fact that its estimate is consistent with our prior information about its value indicates that the model is adequate to our needs. To make inferences about the position of the regression line, it is necessary to examine the joint posterior estimates of the slope and intercept parameters, the goal being to estimate the regression line and its bounds. Figure 5 shows a plot of the 12 000 samples from the joint posterior distribution of the slope parameter, a, and the intercept parameter, b (see Figure S-4 in the Supporting Information for the rough microsome equivalent). Also shown is an ellipse enclosing 95% of the posterior probability, under the assumption that the posterior distribution is bivariate normal. The best estimate of a and b is the mean of the distribution, whose location is reported in Table 1, but any pair of (a, b) values within the ellipse defines a reasonable regression line. Figure 6 shows the plot of measured m/z ratio versus matching error with the best estimate of the regression line and the boundary curves within which reasonable regression lines are located (see Figure S-5 in the Supporting Information for the rough microsome equivalent). Although it is difficult to distinguish, the boundary curves do not cross; they enclose the best estimate of the regression line, but never touch it. 7252 Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
Figure 5. Scatterplot of the posterior probability samples of the slope parameter and the intercept parameter. Each point in the figure defines a line in Figure 4; the plausibility that the true values of the parameters are found at any point is indicated by the density of posterior samples at that point. Also plotted is an ellipse enclosing 95% of the posterior samples; there is therefore a 95% probability that the true values of the parameters are within that ellipse. The vertical striations in the plot are an artifact due to the discretization of the slope parameter.
Figure 6. Scatterplot of measured m/z ratios versus matching errors, along with the estimated regression line (solid) and boundary curves (dashed) for lines within the 95% posterior credible ellipse.
Using the mean of the joint distribution of the parameters to estimate the regression line, we can now carry out our original goal of obtaining corrected m/z ratios by subtracting the systematic bias, as in eq 3. Figure 7 shows a plot of corrected m/z ratios versus matching error; it is evident that that the assumed linear correction has eliminated most of the systematic measurement error. Figure 8 shows the histogram of the matching errors of the corrected m/z ratios (see Figure S-6 in the Supporting Information for the rough microsome equivalent). The similarity between this histogram and that of Figure 2a shows that the correction has been successful: correct matches are now clustered around zero in a sharp spike that is clearly distinguishable from the background of incorrect matches. Also plotted is a normal distribution centered at zero with a standard deviation equal to the estimated standard deviation of random measurement errors reported in Table 1. This analysis provides visual confirmation that the normal
Figure 7. Scatterplot of corrected m/z ratios versus matching errors (see text).
Figure 8. Histogram of matching errors calculated based on corrected m/z ratios. The gray curve is the distribution of random measurement errors estimated by the model.
distribution fits the distribution of random measurement errors and that the estimate of the standard deviation of the random measurement errors is accurate. Since these data were collected without automatic gain control, the number of ions in the ICR cell changes considerably from scan to scan as the peptides elute. One might therefore expect that varying space charge effects should cause the random errors to be much more extreme than has been observed. But errors due to space charge are not entirely random; the average space charge effect, due to the difference between the number of ions in the external calibration and the average number of ions during the LC run, contributes to the systematic error in the model and is therefore corrected by the method. Only the variation of the space charge effect around its average value appears as random error in the model. CONCLUSION The effective use of the large amount of information in LCFTICR-MS data sets in high-throughput proteomics requires high confidence in the accuracy of the measured m/z ratios. This study has demonstrated that it is possible to estimate the systematic and random errors in the experimental m/z measurements based on partial knowledge of the composition of the sample, which is
usually available for biologically interesting samples. The Bayesian statistical approach yields knowledge of the region in which reasonable regression lines will be found, which permits the recalibration of every measured m/z ratio, even those for which no match has been found in the peptide library. The present approach also gives an experiment-specific estimate of the typical random error in the corrected measurements. In this particular case, the analysis was applied to two LC-FTICR-MS runs. In the smooth microsome data set, linear calibration errors affecting every spectrum were discovered to have magnitudes ranging from -4 mTh at 550 Th to -24 mTh at 1400 Th. These systematic errors were removed, and the resulting corrected values were subject only to random measurement errors, estimated to have a standard deviation of 0.96 ( 0.06 mTh. In the rough microsome data set, linear calibration errors affecting every spectrum were discovered to have magnitudes ranging from -4 mTh at 550 Th to -29 mTh at 1400 Th. These systematic errors were removed and the resulting corrected values were subject only to random measurement errors, estimated to have a standard deviation of 0.88 ( 0.06 mTh. The strength of this technique lies in the fact that it is able to provide these inferences even though there is no a priori information about which matches are correct and which are incorrect. This feature depends critically on the ultrahigh mass accuracy of the FTICR mass spectrometer: other mass spectrometers have mass accuracies on the same order as the dispersion of incorrect matches, and thus, only extremely severe systematic errors would be distinguishable from the background. The general approach can be adapted to any problem in which more than one category of data points is collected in a single data set with no way to determine with certainty to which category any given point belongs. For example, in the present case, different systematic bias models could be easily accommodated simply by changing eq 2. A simple way to understand the manner in which the Bayesian model estimates the parameters associated with the different categories of data is to consider the intuitively appealing approach of giving a weight to each point representing in some sense its “degree of membership” in each category and then estimating the parameters associated with each category using weighted averages. To determine the ideal weightings, however, one must know the parameters associated with each category beforehand; one is essentially trapped in a “chicken-or-egg” quandary. The Bayesian approach provides a statistical justification for the intuitive procedure, in that it is able to simultaneously weight each point according to its probability of membership in each category and estimate the parameters associated with each category. ACKNOWLEDGMENT This work was supported by grants from the Natural Sciences and Engineering Research Council of Canada, Genome Quebec, the Canadian Institute of Health Research, the National Science Foundation (CHE-99-09502), Florida State University, and the National High Magnetic Field Laboratory in Tallahassee, FL. The authors are indebted to John Bergeron, who was instrumental in arranging for funding, in the coordination of the research groups at McGill University and Florida State University, and in the choice of samples and data collection methodology. We gratefully acknowledge the contribution of Mark R. Emmett. C.M.Y. thanks Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
7253
David Spiegelhalter for his guidance in the use of the WinBUGS software package.
data set analysis, availability of the method. This material is available free of charge via the Internet at http://pubs.acs.org.
SUPPORTING INFORMATION AVAILABLE Effect of charge state on the analysis, definition and properties of probability distributions used in the article, derivation of the analytical expression of the likelihood function, rough microsome
Received for review April 14, 2005. Accepted August 26, 2005.
7254
Analytical Chemistry, Vol. 77, No. 22, November 15, 2005
AC050640Q