Robust Algorithm for Alignment of Liquid ChromatographyMass

Oct 4, 2006 - an Accurate Mass and Time Tag Data Analysis. Pipeline. Navdeep Jaitly,† Matthew E. Monroe,‡ Vladislav A. Petyuk,† Therese R. W. Cl...
0 downloads 0 Views 601KB Size
Anal. Chem. 2006, 78, 7397-7409

Robust Algorithm for Alignment of Liquid Chromatography-Mass Spectrometry Analyses in an Accurate Mass and Time Tag Data Analysis Pipeline Navdeep Jaitly,† Matthew E. Monroe,‡ Vladislav A. Petyuk,† Therese R. W. Clauss,† Joshua N. Adkins,‡ and Richard D. Smith*,‡

Environmental Molecular Science Laboratory and Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352

Liquid chromatography coupled to mass spectrometry (LC-MS) and tandem mass spectrometry (LC-MS/MS) has become a standard technique for analyzing complex peptide mixtures to determine composition and relative abundance. Several high-throughput proteomics techniques attempt to combine complementary results from multiple LC-MS and LC-MS/MS analyses to provide more comprehensive and accurate results. To effectively collate and use results from these techniques, variations in mass and elution time measurements between related analyses need to be corrected using algorithms designed to align the various types of data: LC-MS/MS versus LCMS/MS, LC-MS versus LC-MS/MS, and LC-MS versus LC-MS. Described herein are new algorithms referred to collectively as liquid chromatography-based mass spectrometric warping and alignment of retention times of peptides (LCMSWARP), which use a dynamic elution time warping approach similar to traditional algorithms that correct for variations in LC elution times using piecewise linear functions. LCMSWARP is compared to the equivalent approach based upon linear transformation of elution times. LCMSWARP additionally corrects for temporal drift in mass measurement accuracies. We also describe the alignment of LC-MS results and demonstrate their application to the alignment of analyses from different chromatographic systems, showing the suitability of the present approach for more complex transformations. Liquid chromatography-mass spectrometry (LC-MS)-based proteomics analyses of peptide mixtures can provide global qualitative and quantitative information about biological systems.1 Qualitative information in the form of protein identifications is generated by software such as SEQUEST,2 Mascot,3 ProteinLynx Global Server, Spectrum Mill,4 Protein Prospector,5 ProbID,6 and * To whom correspondence should be addressed. E-mail: [email protected]. † Environmental Molecular Science Laboratory. ‡ Biological Sciences Division. (1) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (2) Eng, K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (3) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. 10.1021/ac052197p CCC: $33.50 Published on Web 10/04/2006

© 2006 American Chemical Society

X!Tandem7-9 that analyze MS/MS spectra of peptides derived from a global protease digestion (e.g., trypsin). Quantitative information is generally extracted from the precursor ion (MS1) scans in these analyses or from all scans using higher resolution mass accuracy LC-MS analyses (i.e., without MS/MS). The latter analyses generally provide better quantitative information because of a higher ion sampling rate and improved ion discrimination than achieved in lower resolution analyses. Programs such as Decon2LS (http://ncrr.pnl.gov/software) and MSInspect (http:// proteomics.fhcrc.org/) implement deisotoping algorithms such as THRASH10 that extract masses and elution times of the ion species present in LC-MS spectra. However, due to challenges with runto-run variations in both the separation and mass analysis dimensions, more sophisticated alignment algorithms are needed to extract higher quality information from large scale experiments involving multiple analyses. A number of approaches have been developed and applied for high-throughput proteomic applications that rely on combined results obtained from different experimental platforms.11-13 While the computational steps used in each approach are unique, all of the methods rely on using the mass and LC elution time information of detected species observed across several experiments to find common chemical species. Some approaches use raw m/z ions and the entire elution profile of each m/z bin14-16 while others use monoisotopic masses and the peak apex of (4) Clauser, K. R.; Baker, P.; Burlingame, A. L. In 44th ASMS Conference on Mass Spectrometry and Allied Topics; Portland, OR, May 12-16, 1996; 365. (5) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882. (6) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-1412. (7) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467. (8) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza, J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S.; Simpson, R. J. Proteomics 2005, 5, 3475-3490. (9) Field, H. I.; Fenyo, D.; Beavis, R. C. Proteomics 2002, 2, 36-47. (10) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2000, 11, 320-332. (11) Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolic, L.; Smith, R. D. Anal. Chem. 2000, 72, 3349-3354. (12) Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Anal. Chem. 2002, 74, 5826-5830. (13) Smith, R. D.; Pasa-Tolic, L.; Lipton, M. S.; Jensen, P. K.; Anderson, G. A.; Shen, Y.; Conrads, T. P.; Udseth, H. R.; Harkewicz, R.; Belov, M. E.; Masselon, C.; Veenstra, T. D. Electrophoresis 2001, 22, 1652-1668. (14) Pierce, K. M.; Wood, L. F.; Wright, B. W.; Synovec, R. E. Anal. Chem. 2005, 77, 7735-7743.

Analytical Chemistry, Vol. 78, No. 21, November 1, 2006 7397

elution profiles of chemical species inferred to be in the samples using preprocessing algorithms.13,17,18 For example, the accurate mass and time (AMT) tag approach developed at our laboratory combines LC-MS/MS analyses with high mass measurement accuracy (MMA) LC-Fourier transform ion cyclotron resonance (FTICR) MS analyses in a relatively high-throughput pipeline. In this approach, results from both LC-MS and LC-MS/MS analyses are converted into lists of monoisotopic masses and elution times. Results from different data sets are combined by finding the transformation functions of mass and elution times that are required to remove variability in mass and elution time measurements between analyses. An alternative approach by Radulovic et al. utilizes a software suite for finding features common to LC-MS and LC-MS/MS experiments performed using ion trap mass spectrometers. The software bins peaks from MS scans by m/z bins and uses signal processing algorithms to discover peaks in the chromatographic dimension and to create “pamphlets”, which contain pixels for peaks identified. Pamphlets from different experiments are aligned by using a 2D smoothing spline in the m/z and time dimensions to correct for m/z and time drift.16 Because of the lower resolution of the instruments used, these data sets are hard to deisotope, and processing is done on peak level information. Signal processing algorithms are used to reduce noise, by requiring that detectable peaks be observed over multiple consecutive spectra in the chromatographic dimension. Due to the use of LC-MS/MS data sets, this approach provides peptide identifications with the quantitative information available from precursor ion MS scans. Listgarten et al. describe a method to concurrently align multiple data sets and normalize intensities by using a continuous profile model.19 While this approach has the additional appealing feature in that it provides a consensus alignment across all data sets simultaneously, it is computationally intensive and is based upon use of total ion chromatogram data. Another recently proposed pipeline creates “signal maps” from MS peaks of an LC-MS analysis and uses similarities and differences in signal maps to identify changing features.15 Noise peaks are removed by assuming that peaks without any corresponding peaks in ( 5 scans or without other peaks in the expected isotopic envelope of the peak are spurious noise peaks. It performs a dynamic programming alignment using a score that assumes the similarity of intensity profiles of mass spectra in different LC-MS analyses. Multiple analyses are combined in a progressive strategy of aligning and merging data sets based on similarity. Several other approaches for aligning LC-MS and gas chromatography (GC/MS) data sets have also previously been described in the literature.20,21 Each of these approaches has its own computational requirements and implicit (15) Prakash, A.; Mallick, P.; Whiteaker, J.; Zhang, H.; Paulovich, A.; Flory, M.; Lee, H.; Aebersold, R.; Schwikowski, B. Mol. Cell. Proteomics 2006, 5, 423432. (16) Radulovic, D.; Jelveh, S.; Ryu, S.; Hamilton, T. G.; Foss, E.; Mao, Y.; Emili, A. Mol. Cell. Proteomics 2004, 3, 984-997. (17) Kearney, P.; Thibault, P. J. Bioinf. Comput. Biol. 2003, 1, 183-200. (18) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Anal. Chem. 2003, 75, 4818-4826. (19) Listgarten, J.; Neal, R. M.; Roweis, S. T.; Emili, A. In Advances in Neural Information Processing Systems 17; MIT Press: Cambridge, MA, 2005. (20) Bylund, D.; Danielsson, R.; Malmquist, G.; Markides, K. E. J. Chromatogr., A 2002, 961, 237-244. (21) Zimmer, J. S.; Monroe, M. E.; Qian, W. J.; Smith, R. D. Mass Spectrom. Rev. 2006, 25, 450-482.

7398

Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

challenges based on how the data are preprocessed. Approaches dealing with data that have been minimally preprocessed tend to be more computationally intensive and complicated, but potentially more accurate. Here we describe an algorithm for alignment of LC-MS data sets in the context of an AMT tag approach and involving alignment in two different scenarios: (1) LC-MS analyses with LC-MS/MS analyses and (2) LC-MS analyses with other LCMS analyses. LC-MS/MS analyses are also aligned to each other, but that procedure will not be discussed in this paper. Before describing the algorithm, we briefly describe a statistical model to characterize the variability of the mass and LC elution time measurements we have observed for the preprocessed data in LC-MS and LC-MS/MS measurements. The liquid chromatography-based mass spectrometric warping and alignment of retention times of peptides (LCMSWARP) algorithm uses a scoring method based on this model to quantify the similarity between chromatographic subsections of the experiments and uses the scores to find a transformation function of mass and elution time to produce the best piecewise linear alignment between two experiments. We show that LCMSWARP effectively corrects for both mass and elution time variability in a typical LC-FTICR MS analysis of a complex sample. Finally, we illustrate additional utility of the algorithm by demonstrating alignments of data sets from two disparate LC systems that use different separation columns and different gradient shapes (i.e., linear versus exponential) and by aligning two experiments with missing subsections, an important development for dissemination of analytical methods to different laboratories and analysis platforms. EXPERIMENTAL PROCEDURES Sample Preparation. Two different Salmonella typhimurium strains were grown in three different culture conditions (one mimicking macrophage cell infection) and then subjected to offline strong cation exchange followed by online reversed-phase LCMS/MS. A detailed methods description can be found elsewhere.22 The same sample preparation method was used for related LCMS experiments. Liquid Chromatography and Mass Spectrometry. Sample separation was achieved using a constant-pressure capillary HPLC system that was manufactured in-house. Details of this system are described elsewhere.22,23 Reversed-phase capillary HPLC columns were manufactured in-house by slurry packing 5-µm Jupiter C18 stationary phase (Phenomenex, Torrence, CA) into a 60-cm length of 360 µm o.d. × 150 µm i.d. fused-silica capillary tubing (Polymicro Technologies Inc., Phoenix, AZ) incorporating a 2-µm retaining screen in a 1/16-in. capillary-bore union (Valco Instruments Co., Houston, TX). The HPLC system was equilibrated at 5000 psi with 100% mobile phase A (0.2% acetic acid and 0.05% TFA in water) for initial starting conditions. Mobile phase switched from A to B (0.1% TFA in 90% acetonitrile/10% water) 20 min after sample injection, creating an exponential gradient as mobile phase B displaced A in the mixer. Approximately 5 cm of 360-µm-i.d. fused-silica tubing (22) Adkins, J.; Mottaz, H. M.; Norbeck, A. D.; Rue, J.; Clauss, T.; Purvine, S. O.; Heffron, F.; Smith, R. D. Mol. Cell Proteomics 2006, 5, 1450-1461. (23) Shen, Y.; Tolic, N.; Zhao, R.; Pasa-Tolic, L.; Li, L.; Berger, S. J.; Harkewicz, R.; Anderson, G. A.; Belov, M. E.; Smith, R. D. Anal. Chem. 2001, 73, 30113021.

packed with 5 µm of C18 was used to split ∼25 µL/min of flow before the injection valve. The split flow controls gradient speed under conditions of constant-pressure operation. Flow through the capillary HPLC column was ∼2 µL/min when equilibrated to 100% mobile phase A. FTICR-MS analysis was performed using a ThermoElectron model LTQ-FT linear ion trap-FTICR hybrid mass spectrometer (ThermoElectron Corp., San Jose, CA) with electrospray ionization (ESI). The HPLC column was coupled to the mass spectrometer using an in-house-manufactured interface. No sheath gas or makeup liquid was used. The heated capillary temperature and spray voltage were 200 °C and 2.2 kV, respectively. Data acquisition began 20 min after sample injection and continued for 100 min over a mass (m/z) range of 400-2000. The automatic gain control target for full scan ion trap was 30 000 and 106 for the FTICR cell. For each cycle, the two most abundant ions from MS analysis were selected for MS/MS analysis using a collision energy setting of 35%. A dynamic exclusion time of 60 s and a single repeat count was used to discriminate against previously analyzed ions. Tandem mass spectral analysis was performed using a ThermoElectron model LTQ ion trap mass spectrometer (ThermoElectron Corp., San Jose, CA) with ESI. For each cycle, the ten most abundant ions from MS analysis were selected for MS/MS analysis, using a collision energy setting of 45%. Dynamic exclusion was used to discriminate against previously analyzed ions. An Agilent 1100 2D nanoflow system (Palo Alto, CA) was used for the comparative study of alignment across LC systems. A Zorbax 300SB-C18, 5 mm × 300 µm, 5-µm column was used for enrichment, while a Zorbax SB-C18, 30 cm × 75 µm column with 300-Å pores and 3.5-µm particles was used for separation. Mobile phase A was 0.02% TFA, 0.1% acetic acid in water, and mobile phase B was 0.02% TFA, 0.1% acetic acid in acetonitrile. The enrichment column was switched to position 1 at 0 min, switched to position 2 at 5 min, and returned to position 1 at 135 min. Unlike the constant-pressure system described above, the Agilent 1100 2D maintained a constant flow rate of 300 nL/min, with the pressure varying as needed. DISCUSSION Terminology. Chemical species in LC-MS experiments display a three-dimensional signature in the data corresponding to the dimensions of m/z, elution time, and intensity. In each mass spectrum, a detected species generally gives rise to an isotopic profile (envelope) in the m/z dimension with peaks observable for each isotope (depending on the resolution of the instrument and the charge of the ions). In the chromatographic dimension, each resolvable charged species also gives rise to LC elution profiles (peaks). Similarly, LC-MS/MS provides a three-dimensional signature in m/z, LC elution time, and intensity from MS spectra along with additionally associated MS/MS spectra, which are typically interpreted using a database searching tool such as SEQUEST to provide peptide sequence identification. Here we use the term feature to refer to the unique charged species that is inferred to give rise to the three-dimensional profile (with additional MS/MS spectra in LC-MS/MS experiments) that is observed in the data. Each such feature has a representative monoisotopic mass and an LC elution time that is the time of the peak apex of its elution profile (From here on, elution time will

be used to refer to the time of the apex of the LC elution profile of a feature.) A chemical species when observed by LC-MS is referred to as an LC-MS feature, while the same species observed in an LC-MS/MS experiment is referred to as an LC-MS/MS feature, although both features are observations of the same underlying chemical species. The mass of an LC-MS feature is the monoisotopic mass determined from deisotoping the isotopic profile using a modified version of the THRASH algorithm,10 while the mass of the LC-MS/MS feature is the mass of the theoretical peptide sequence and posttranslational modifications (if any) of the peptide. More details of the algorithms used to discover the LC-MS and LC-MS/MS features can be found elsewhere21 and in the Supporting Information. Importantly, the same chemical species detected in multiple analyses are grouped together by assuming that features with similar mass and elution times in different experiments arise from the same underlying chemical species. This grouping of features across different experiments (regardless of the type of experiment) is referred to as finding common features, but actually represents the grouping of features based on the assumption that within some defined limits the same chemical species gave rise to each observed feature. A feature common to multiple LC-MS experiments is referred to as a feature cluster, while a feature common to multiple LC-MS/MS measurements, which is identified with some level of confidence from database search scores (such as cross correlation in SEQUEST and probability values in PeptideProphet24), is referred to as a mass and time tag (MT). In our previous work, we have applied linear correction of elution time to adjust for variability in elution time before finding common features (see ref 21 and Supporting Information). Monoisotopic Mass and Peak Apex Elution Time Variability Model. LC separation of a complex peptide mixture is possible because of inherent differences in chemical and physical properties of the peptides. Different peptides interact with varying specificity with the mobile and stationary phases, which results in a wide range of elution times depending on factors that include the flow rate, system design (e.g., dead volumes), initial sample injection (e.g., variations in sample volume), temperature, reversedphase gradient shape, noise in the gradient, e.g., due to mixing of the two solvents, variations in flow rate (or pressure, depending upon the system design), etc. In multiple liquid chromatography experiments with exactly the same configuration and same sample, a peptide would ideally elute at exactly the same time because each of these factors would be identical. In practice, however, the observed elution time is the result of small deviations in several of these factors. The greatest differences from analysis to analysis in our measurements are manifested as differences in the dead time and relative separation speeds, i.e., the time between the start of an analysis and when the first analyte elutes from the separation column and the speed of the reversed gradient profile. These differences are actually linked and arise from the use of a constant-pressure LC system, which will unavoidably result in some flow rate variations due to unavoidable changes in the solvent flow arising from changes of the porosity of either the analytical column or the flow splitter (in practice, another column joined at a tee just before the sample injection valve). As a result, (24) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392.

Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

7399

Figure 1. (a) Plot of the relationship between scan numbers of LC-MS features of an LC-MS experiment with elution time nonlinearity and the NET values of the MTs these features matched to by using linear alignment. Each LC-MS feature is represented as a red dot with the x coordinate equal to the elution time (in scan number) of the LC-MS feature, and the y coordinate the NET value of the best MT it matched to. Also shown is the linear transformation function (NET ) 0.000 35 × scan -0.445) between scan numbers and NET values used for this alignment (solid green line). (b) Plot of the residuals between the NET values of the MTs and the NET values arrived by application of the linear transformation function to the scan numbers of the LC-MS features. The trend in the red dots illustrates local nonlinearities in the data, which make it necessary to use wider tolerances than is necessitated by the actual spread when matching. The transformation found by LCMSWARP is shown by the dashed blue line on top of the red points. (c, d) Mass measurement accuracy can drift over an experiment as a result of changing temperatures, instrumental parameters such as magnetic field, and interactions between ions. As a result, the difference (in ppm) between the mass of the LC-MS features and the MTs they matched to can be seen to vary as a function of different parameters. We have observed that m/z and elution time seem to be the major parameters. (c) Plots the relationship between mass errors for the data set in (a) and (b) as a function of m/z. (d) Plots the relationship between the mass errors for a different data set as a function of elution time (scan numbers). The mass error transformation found by LCMSWARP is shown as a solid black line in both.

an elution time correction is needed to effectively compare analyses, which we have found can be effectively done using a suitable global linear transformation between analyses21 (more details can be found in the Supporting Information). While such global or “macroscopic” corrections address the flow rate variability from experiment to experiment, they cannot correct for changes during an analysis, such as due to gradient noise, or to some extent for other types of variations between analyses, e.g., due to temperature changes, variations in solvent composition, or changes to the stationary phase due to incomplete column regeneration. It is expected from central limit theorem that even after correcting for global trends of dead time and flow rate changes, the effect of these lesser understood microscopic factors can result in the observed elution times being a normally distributed variable around an ideal elution time. The width of the distribution is dependent on how well variations in experimental conditions are minimized and how well behaved each peptide is with respect to small changes in separation parameters. While detailed research has been performed on characterizing the elution profiles of chemical species,25 in this work, we attempt 7400 Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

to model the elution time (as defined earlier) of a peptide from experiment to experiment as a Gaussian distribution, under the assumption of similar experimental configuration. Evaluation of the Gaussian model with 14 690 different peptides observed in at least 10 of 572 LC-MS/MS analyses showed that the data fit the model well (see the Model Testing section in the Supporting Information). Alternative models were also tested, but the normal distribution performed best and had the advantage that the variability was encompassed by one parameter, the standard deviation of the variability of elution time, rather than multiple parameters that affect the shape of the distributions and the understanding of the variability for different peptides. While a linear alignment works well for global first-order correction of elution times, aligning LC-MS experiments displaying local nonlinearities in elution time forces the use of elution time tolerances wider than the actual variability of the data. Figure 1a shows the relationship between MS scan numbers of LC-MS features of an LC-MS experiment and the normalized elution time (25) Felinger, A. Data analysis and signal processing in chromatography; Elsevier Science B.V.: Amsterdam, 1998.

Figure 2. Mass and elution time sections of two different LC-MS experiments. Similar patterns are observed, although the elution times will typically vary. The local similarity between analyses is used to find global alignment between analyses to correct for elution time variability.

(NET) values of the MTs these features are matched using a linear alignment (the NET for a mass tag is the average elution time for the mass tag in multiple LC-MS/MS experiments, which have been normalized to a baseline; see Supporting Information for more details). Also shown is the linear transformation function between MS scan numbers and NET values used for this alignment. Figure 1b shows the residuals between the NET values of the MTs and the NET values obtained by application of the linear transformation function to the scan numbers of the LCMS features. As can be seen, the residuals display nonlinear behavior as a function of scan number. Because of the nonlinearity present in these data, the comparison of LC-MS features to MTs would need a tolerance of ∼0.08 NET, even though it is evident that the width of the distribution of residuals around the mean is actually much smaller. Thus, the use of a nonlinear trend, such as the one shown passing through the center of the residuals, would enable the use of smaller NET tolerances and thus provide more confident matches. We observe that local nonlinearity at a given elution time causes most peptides eluting at that time to respond similarly (i.e., essentially all elute later or sooner than their expected elution times). This is probably the result of an underlying factor such as a disturbance in the expected gradient that affects the elution times of all peptides similarly. As a result, the local elution time regions of different analyses display similar patterns in mass and elution time (Figure 2). The MMA can also vary over the analysis, as well as between analyses due to use of incorrect calibration equations, changes in experimental parameters such as variations in ion population (significant in FTICR), temperature, magnetic field, and pressure in the mass analyzer.26-28 However, it is often difficult to isolate the factors producing such variations. Figure 1c shows the mass (26) Masselon, C.; Tolmachev, A. V.; Anderson, G. A.; Harkewicz, R.; Smith, R. D. J. Am. Soc. Mass Spectrom. 2002, 13, 99-106. (27) Mitchell, D. W.; Smith, R. D. Phys. Rev. E: Stat. Phys., Plasmas, Fluids, Relat. Interdiscip. Top. 1995, 52, 4366-4386. (28) Yanofsky, C. M.; Bell, A. W.; Lesimple, S.; Morales, F.; Lam, T. T.; Blakney, G. T.; Marshall, A. G.; Carrillo, B.; Lekpor, K.; Boismenu, D.; Kearney, R. E. Anal. Chem. 2005, 77, 7246-7254.

errors observed in the same LC-MS analysis as in Figure 1a, as a function of m/z calculated using the difference between the mass of the LC-MS features and the theoretical mass of the best (lowest Mahalanobis distance) MT match. Here the drift in mass measurement accuracy can be attributed to imprecise calibration coefficients because of the obvious functional relationship between the m/z and the mass errors. Figure 1d shows the mass errors observed for a different LC-MS analysis where the mass error distribution seems to drift primarily as a function of time. We attempt to use a normal model to describe the mass measurement errors observed in subsections (of m/z and time). We assume that, at any instant, the mass error distribution is normal around a mean value and that errors in the matching process are randomly distributed (with a uniform distribution) in our list of matches. The normality of the instantaneous mass error distribution was tested over 9 subsections for each of 20 LC-MS analyses, and the normal distribution was seen to fit the data very well (more details on the hypothesis testing are given in the Model Testing section of the Supporting Information). The NET errors observed for LC-MS features are expected to be independent of the observed mass errors because the variability of elution times is caused by the variability in the separation systems, which is assumed to be independent of the mass spectrometer. We tested this assumption of independence of mass and elution time errors for the LC-MS features seen in the 9 subsections of the 20 LC-MS analyses used above. On the basis of the tests (see the Model Testing section in the Supporting Information), little or no correlation was seen between the mass and the NET error distributions. Hence, an independent bivariate normal distribution of mass and elution time errors of individual features can be used appropriately to approximate the observed variability of the elution time subsections of an LC-MS experiment. Chromatographic subsections of two different analyses can be scored for similarity on the basis of the above model, using the mass and elution time variability of the individual features that are in common to the two analyses. We now describe the similarity score and the algorithm used to align the data sets in the LC Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

7401

elution time and mass dimensions. Methods. Algorithm. Definitions and Parameters. A feature f from an LC-MS or an LC-MS/MS analysis is represented as a mass and time pair: (fmass, ftime), while a MT, g from a mass and time tag database is represented as a mass and time pair, with additional information on the number of times (count) the MT was observed in the LC-MS/MS experiments: (gmass, gtime, gc). A feature and a MT will both be referred to interchangeably as features when only mass and elution time information is considered since these descriptors are common to both. An experiment E is described by the set of features discovered by our preprocessing algorithms when applied to data from the experiment; i.e., E ) {f : f is a feature discovered by our preprocessing algorithm}. N is the user-specified number of elution time sections that the alignee data set is broken into (usually 100 with each section covering ∼1% of the run) for the NET alignment, while Nc is the number of elution time sections the reference data set is broken into, where c is referred to as the contraction/expansion factor. Nmass is the number of sections the alignee data set is broken into for the mass recalibration while M is the number of sections the reference data set is broken into for the same purpose. Dnet is the number of discontinuous sections the LCMSWARP NET transformation function can jump (described in the next section), and Dmass is the number of discontinuous sections the corresponding LCMSWARP mass transformation function can jump (see next section). σnet and σmass are the standard deviations in the normalized elution time dimension (0-1) and the mass dimension (in parts per million, ppm), respectively, while tolmass and tolnet are the user-supplied mass (also in ppm) and normalized elution time tolerances. When the elution time of a feature f from an alignee data set is transformed to the elution time frame of the reference data set, we refer to the transformed elution time as fnet to refer to the fact that this normalized time is different from the original elution time ftime and is now in the scale of the reference data set. When the reference data set is a mass and time tag database, the elution times are in the NET scale. When the reference data set is another LC-MS experiment, the scale is converted from scan number of elution time to a scale of 0-1 by linearly scaling the scan numbers or elution times of features in the data set to cover the range of 0-1. Description. Figure 3 describes the LCMSWARP algorithm to align two data sets to each other in the elution time dimension. To do so, LCMSWARP breaks the alignee data set into N equalsized elution time segments (typically ∼100) and the reference data set into Nc equal-sized elution time segments, where c represents a contraction/expansion factor (usually 3) that allows for distortion of the LC-MS data sets during alignment (lines 3 and 4). Each section of the alignee data set is aligned linearly against all continuous sections in the baseline data set of length between 1 and c2 sections and a similarity score called the mscore is computed for each such alignment (line 9). Since the alignee data set has N sections and the reference has Nc sections, this allows each section of the alignee data set to be compressed to 1/c or expand to c times the corresponding fraction in the reference data set. Thus, both symmetric expansion and contraction are possible. To compute the mscore for an alignment of sections, common features are found (Figure 3, line 8 and Figure 7402 Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

Figure 3. Algorithm for calculation of similarity scores (mscore) and alignment scores (ascore) between sets of features from two experiments A and B.

3, Supporting Information). First the elution time, ftime, of each feature f in the alignee section is transformed to a value fnet, in the elution time scale of the reference sections using a linear transformation. Next, for each feature f, the feature g in the reference data set with the lowest Mahalanobis distance (Euclidean distance scaled by variance in each component) from f in the mass and time dimensions is found, with the constraint that the mass of g is within a user-supplied mass tolerance of the mass of f and the constraint that the elution time of g is within a usersupplied tolerance of the transformed elution time of f. For each feature f with no matches in the reference data set within the supplied tolerances, we make up a match with a “virtual” feature with mass fmass + fmasstolmass × 10-6 and elution time fnet + tolnet. The match score is computed as the log of the likelihood of observing the mass and NET differences given the assumed bivariate normal distribution (Figure 3, line 9, Figure 4). For alignment of an LC-MS data set to a MT tag database multiple matches to the LC-MS features are often possible when the mass

Figure 4. Algorithm for calculation of match score for a set of feature matches.

and time tag database is large. Thus, it is important to factor in the prior probability that the MT that a feature matched to is indeed real. This can be done by multiplying each conditional probability of mass error and NET error term (from assignment of an LC-MS feature to a MT) with the prior probability that a MT was present in the sample. Since a mass and time tag database is usually a collection of a large number of LC-MS/MS experiments, the coverage of observable peptides is expected to be much larger than that for a single LC-MS experiment. We thus approximate the probability that a MT was present in the LCMS sample using the empirical fraction of the total number of LC-MS/MS runs in the mass and time tag database that the MT was seen in. Since the denominator for the prior probability is constant for all MTs (equal to the number of the LC-MS/MS experiments used to build the mass and time tag database) and the denominator from Bayes rule (for conditional and prior probabilities) is just a normalizing constant, these can be ignored in the mscore. Therefore, the only term needed in the mscore is the log(number of observations of MT). When aligning an LCMS data set to another LC-MS data set, the prior probabilities cannot be estimated. We would like to make the mscore more symmetric because the alignment algorithm should be symmetric (i.e., not affected by which of the LC-MS experiments was chosen as the baseline). We do so by also penalizing for features that are missing in the reference data set. While this approach is ad hoc, this symmetric addition helps with the quality of alignment when the alignee and the reference data set have a significantly different number of features. An alignment score (ascore) is computed for all possible alignments by summing together the match scores of the subsections that are matched as a result of the alignment. This is done by using a dynamic programming algorithm similar to the one used to align biological sequences. Discontinuities are also allowed in the alignment process by a user-supplied value Dnet that specifies how many extra sections can be present in the data set (since it is usually assumed that the alignee data set will be the more complete data set). After alignment is performed, features from the two data sets are matched by choosing for each feature in the alignee data set the feature in the reference data set that is the closest in Mahalanobis distance. When the reference data set is a mass and time tag database, the probability of observation is also factored in as explained earlier. As a result, a pair of candidate matches of features is generated. If mass calibration is to be performed, we use a larger than necessary mass tolerance to generate this set of candidate matches so that drift in measurement accuracy does not cause real matches to be missed. These candidate matches are used to discover a mass recalibration function, with elution time, m/z of the alignee features, or both as the predictor variable that characterizes the drift in the mass measurement error over the course of the

Figure 5. Algorithm for calculation of the mass drift score matrices between two experiments A and B by using set S of feature matches.

experiment. One of the possibilities to discover this function was to use smoothing spline regression.29 However, this procedure was strongly affected by outliers and incorrect matches that were not always symmetrically distributed around the central trend. Thus, a dynamic programming algorithm is used to generate a first-pass piecewise linear transformation function that is applied to the data. Candidate matches are regenerated for mass recalibration by rematching based on the Mahalanobis distance after recalibration, and then the final transformation function is discovered by using a natural cubic regression spline in the second pass. The dynamic programming algorithm to calculate the alignment score for the piecewise alignments is shown in Figure 5. The mass errors are divided into 2M equal width bins of width tolmass/M. The MMA of the instrument is allowed to drift linearly from one elution time bin to another with the ends of the drift lying on the boundaries of the mass error bins. The mass error function is constrained to jump less than a user specified parameter, Dmass. The goodness of each possible jump in MMA from one elution time section to the next is calculated by the sum (29) Hastie, T.; Hastie, T.; Tibshirani, R.; Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction; Springer: New York, 2001.

Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

7403

Figure 6. FDR (a) and the TPR (b), using a linear alignment (illustrated by open circles) and using LCMSWARP (illustrated by asterisks), for the LC-MS to LC-MS/MS matching a function of NET tolerance. As can be seen from this figure, increasing the NET tolerance causes both the FDR and the TPR to increase with both models and shows the tradeoff between using tolerances that are too strict or too loose. LCMSWARP performs much better than linear alignment for this example, with both higher TPR and lower FDR.

of the squares of the mass differences after the transformation is scaled by the variance of mass errors observed in the section after the transformation (line 15). It was observed that the matches chosen after alignment contain a mixture of incorrect and correct answers. To reduce the effect of outliers, every match with a mass error greater than a user-specified tolerance, Zmassσmass is treated as having a mass error equal to Zmassσmass. A global alignment score is calculated using dynamic programming to find the lowest possible score among all alignments. The piecewise alignment function can be recovered from this alignment score by backtracking. While Figure 5 shows the dynamic programming working with scan number as the predictor variable, the same routine is alternatively used with m/z as the predictor variable instead of scan number. In our production environment, a two-pass routine has been observed to be the most stable, where the dynamic programming algorithm first performs recalibration as a function of m/z and then performs another recalibration as a function of scan number. LCMSWARP also operates in two passes. In the first pass, it performs NET alignment using historical mass and NET standard deviations and extra wide tolerances in mass (∼50 ppm for FTMS analyses and ∼200 ppm for QTOF) and NET (∼0.05). Following this first pass, an initial alignment is computed and mass measurements are corrected using the algorithm described previously. Using this initial alignment and mass correction, best matches are found and used to compute the actual NET and mass standard deviations for the alignee data set. These real standard 7404

Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

deviations are then used to recalculate the alignment function and the matching process is repeated. Clustering of LC-MS Analyses. Metabolomics and many proteomics analyses using LC-MS for candidate biomarker discovery often involve aligning analyses to uncover initially unidentified features that change in some interesting fashion between samples. Alignment of a pair of LC-MS and LC-MS analyses is accomplished by LCMSWARP as described above. Once all data sets are aligned to the same chromatographic coordinate scale, a complete or single-linkage hierarchical clustering is applied to the mass and NET (transformed) coordinate features to group common features into feature clusters. The distance function used to represent the difference between two features f and g is again the Mahalanobis distance between the two if their mass difference and NET difference is within the specified tolerance; otherwise, the distance is infinite. By using this alignment and clustering approach, unidentified features in several data sets can be grouped into a master list. A two-pass procedure is again employed to avoid having to depend on one baseline for alignment. Once data sets are aligned and clusters are detected, a representative mass and elution time is computed for the features in each cluster and then used as the baseline for the next round of alignment in which clusters are reselected. To reduce computational time, the entire merged set of features from all experiments is sorted by mass and broken down into partitions at indices where consecutive features are separated by a mass difference greater than 3σmass. Because of the natural isotopic

Figure 7. (a) Two-dimensional heat map of mscore between sections of two LC-MS experiments. The intense region represents sections of high-quality matches. As can be seen from this figure, there is a strong relationship between similar regions of the experiments. The beginning and the end of the analyses have no features present, and hence, a trend is not observed in either region. The scores for each subsection were normalized to a z score by using the mean and standard deviations of the scores with all the sections of the second run. (b) Heat map of similarity score between an LC-MS experiment and the MT tag database of LC-MS/MS analyses.

distribution of carbon, nitrogen, hydrogen, oxygen, and sulfur, the mass distributions for organic compounds lie primarily around well-defined centers. For example, peptides of mass around 1000 are mostly concentrated at 1000.5 and very few peptides exist with masses between 1000.7 and 1001.3 Da and between 999.7 and 1000.3 Da. This allows natural partitioning of masses to exist in peptides. The in silico partitioning we apply before clustering breaks the high MMA data set into numerous smaller partitions (each of size of the order of the number of data sets) and allows fast clustering without an observable effect on results. RESULTS The existence of local nonlinear trends in LC-MS data (Figure 1) can potentially lead to either incorrect matches (e.g., false peptide identifications), since one would otherwise use wider matching tolerances (i.e., cut-offs) than actually necessary if nonlinear correction were applied, or lead to missed identifications (i.e., false negatives) when narrow tolerances are applied. The latter gives rise to more “missing data” when a given species may

be within tolerances in one run, but outside in another. Increasing the elution time tolerance causes fewer false negatives, but more false positives. We illustrate this using an LC-MS analysis where the identity of LC-MS features is already known from an orthogonal means and then matching these features to MTs in a mass and time tag database by using linear alignment to perform the alignment and matching. Once the alignment and matching are performed, we can compare the matches obtained for the LCMS features from the alignment against the true identity of the features we already know and assess the true positive and false discovery rates. The data set we used was generated by analyzing trypsin-digested proteins from S. typhimurium using a Finnigan LTQ-FT instrument to obtain moderately high MMA MS precursor ion scans while concurrently performing MS/MS fragmentation analyses in the linear ion trap stage. The MS precursor scans were used to generate the LC-MS features that were then tied to the MS/MS spectra by comparing the theoretical mass generated from SEQUEST database searches to the mass in the related high MMA precursor MS scans. Using Washburn/Yates Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

7405

criteria to filter the MS/MS identifications from SEQUEST, a set of 1088 LC-MS features were identified. The error rate of the algorithm was assessed by comparing the sequences assigned to these features from the MS/MS search with those obtained from the matching process after alignment. The true positive rates (TPR) and false discovery rates (FDR) achieved by performing alignment at different NET tolerances are shown in Figure 6. As can be seen from this curve, increasing the NET tolerance results in an increase in the TPR but also causes an increase in the FDR for this data set. The relative rates of change of FDR and TPR depend on the percent of ambiguous/false matches present, which depends on the complexity of the samples and the types of samples being compared. More complex samples are more likely to have different peptides with similar mass and elution times that can cause ambiguity in the matching. Additionally, when samples are increasingly different, the random rate of errors in matching increases compared to the rate of correct matches and also results in more ambiguous matches. Indeed, we have observed that increasing the size of the mass and time tag database, by loosening the filters applied to the MS/MS data sets, causes the error rate of matching to increase, making it attractive to optimize this aspect, as well as the overall the alignment process. For the same data set, we also show the FDR and TPR obtained by using LCMSWARP for the above matching (Figure 6). In this example, LCMSWARP performs much better than linear alignment showing a higher TPR and lower FDR through most of the range of tolerances. Of course, the difference in performance depends on the nature of the data set. For data sets with a high degree of nonlinearity, LCMSWARP provides much better results than linear alignment, but for data sets with less nonlinearity, the performance in alignment of both algorithms is increasing similar, although LCMSWARP is never worse and also much faster (see below). Figure 7 shows two example heat maps that represent similarity scores (mscore) between data sets for two different alignments. In the first heat map, two LC-MS analyses with over 10 000 LC-MS features each were aligned to each other, while in the second heat map, an LC-MS data set with over 10 000 LC-MS features was aligned to a mass and time tag database containing over 30 000 high-quality MTs built from over 1000 LCMS/MS analyses. As can be seen, the mscore is able to highlight very clearly not only the similarity between related regions of the analyses but also the transformation function between the two. It is interesting to note from this figure that the residuals at higher scan numbers are further away from the linear fit for this data set. Higher residuals are often observed at higher scan numbers in our LC-MS analyses for the more hydrophobic components. Importantly, the ascore is able to find an appropriate transformation function in NET, and the ascoremass is able to find an appropriate transformation function in mass, as can be seen in the central trend lines in Figure 1. The two-pass algorithm used by LCMSWARP is able to reduce the distribution of mass and NET errors (Figure 8) when such nonlinear variabilities are present in the elution time or the MMA. In this work, the LCMSWARP algorithm was also adapted to find groups of features common to different LC-MS experiments. Figure 9 shows the results following alignment of five LC-MS data sets. These results were then clustered to determine features 7406 Analytical Chemistry, Vol. 78, No. 21, November 1, 2006

Figure 8. Histogram of NET errors after linear alignment and LCMSWARP are applied. As can be seen from this figure, the variability in NET measurements is corrected. This effect is more dramatic for a more distorted data set such as the one illustrated here. Reduction in the width of the error distribution allows us to use tighter tolerances in the matching process and reduce the rate of false positive matches.

in common among the analyses. The intensities of elements from different clusters are compared to a random baseline data set in Figure 10, which illustrates how features from different analyses appear to exhibit similar intensities. This behavior can be used to detect species exhibiting significant changes in abundance in a population of clusters. The robustness of the LCMSWARP method was evaluated by comparing results obtained from different LC systems operating with a different gradient shape and also by aligning samples with missing sections. An Agilent 1100 capillary LC system coupled to a Finnigan Orbitrap mass spectrometer was used to analyze a mixture of 12 proteins and 23 peptides.30 The resulting data were aligned to a MT tag database that had been constructed using the same samples, but analyzed using our constant-pressure LC system coupled to Finnigan LCQ and LTQ mass spectrometers. The Agilent capillary LC system employed a linear mobile-phase gradient, while the constant-pressure system used an exponential mobile-phase gradient. The differences in these gradients are exemplified by the mscore trends shown in Figure 11a. Despite these differences, LCMSWARP was able to find the function to transform the results from the two different separation systems (blue line). Figure 11b shows results of a different experiment, where 80 fractions of an LC separation were analyzed by direct infusion with a NanoMate (Advion BioSciences, Ithaca, NY) into a mass spectrometer. Missing data are present because some of the fractions were not successfully analyzed. After the 80 directinfusion MS experiments were merged together to create one master data set, the merged data set was aligned against the corresponding mass and time tag database using LCMSWARP, allowing for discontinuities in the NET dimension. LCMSWARP was able to find the correct transformation function despite the missing fractions, as illustrated by the blue line in Figure 11b.These examples show LCMSWARP to be a robust algorithm for effectively tightening the error distribution of NET values and mass error values when local nonlinearities in chromatography or drift in MMA of the mass spectrometer are observed. The time complexity of LCMSWARP is O(N2c3) for computing all mscores in the NET dimension and O(N2c3Dnet) for computing all the ascores in the NET dimension. Finding all of the closest (30) Purvine, S.; Picone, A. F.; Kolker, E. Omics 2004, 8, 79-92.

Figure 9. Two-dimensional view of zoomed in mass regions of five LC-MS data sets before and after alignment with LCMSWARP. As illustrated, clusters of features become more obvious after alignment.

Figure 10. Intensity of elements of different clusters compared to a random baseline data set. Note, most features show similar intensity patterns in each of the analyses.

features for each mscore computation between two sets is quadratic in the worst case while scoring them is linear in the number of pairs. Since the number of pairs is less than the number of features, n, and the number of features is much larger than Dnet, the complexity of LCMSWARP is O(N2c3n2) in the worst case. However, the average run time for finding all pairs of closest features is really much closer to O(n log n + m log m + (m + n)) ) O(n log n + m log m), where n, m are the number of features in the two sets, because after sorting the features, all possible pairs can be precomputed using the tight partitions formed by the mass and NET distributions. By sorting features in the mass dimension, stepping through the sorted list, and matching each current feature to only the range of features within the mass tolerances, we are able to reduce the candidate set of matches to a size that is, on average, similar for most features. This size reflects the complexity of the sample.

When implemented in C++, the LCMSWARP algorithm is able to align features from an LC-MS experiment with over 10 000 features to a mass and time tag database with over 200 000 MTs within 10 s on a 3-GHz PC. This time, of course, includes only the time required to perform alignment on the preprocessed data in feature space and not the time for feature detection in the samples. The feature detection is, in fact, a time-consuming procedure currently taking between 1 and 3 h for one LC-MS experiment. In contrast, aligning 10 LC-MS experiments, each with ∼10 000 features, takes less than 5 s. It is well known that the worst case time complexity of complete-linkage clustering is at most O(n2 log n) while the time complexity of single-linkage clustering is O(n2), where n is the number of features to be clustered. However, for our clustering operation, we partition features into groups by mass differences between consecutively higher mass features, as described previAnalytical Chemistry, Vol. 78, No. 21, November 1, 2006

7407

Figure 11. (a) Heat map of mscore from alignment of an LC-MS experiment using a complex protein mixture performed using a linear reversed-phase gradient with an Agilent LC system and analyzed against a MT tag database generated from LC-MS/MS analyses using a constant-pressure exponential gradient Isco system. The difference between the linear gradient LC system and the exponential gradient LC system is clearly evident at the beginning of the analysis. A linear transformation function (in green) is able to provide a degree of fit in the central part of the analysis but fails in the early part of the analysis. The LCMSWARP function is able to find the nonlinear transformation function that captures the variability of the two LC systems. (b) Heat map of mscore from alignment of 80 fractions of an LC separation, which were individually analyzed by direct infusion and merged into one data set. The merged data set was aligned against a mass and time tag database of similar samples. Some of the fractions were not successfully analyzed and are thus missing in the merged data set, as can be seen very clearly by the discontinuities in the heat map.

ously. Each group is on average of size O(d), where d is number of data sets. The number of bins, m, is usually quite stable and reflects the mass distribution of peptides, which results from the natural isotopic composition of elements as explained previously. Thus, on average, the clustering scales as m2d2 log(md) for complete linkage and as m2d2 for single-linkage clustering. For a small number of data sets (