Threshold-Avoiding Proteomics Pipeline - American Chemical Society

Aug 31, 2011 - IBM T. J. Watson Research Center, P.O. Box 218, Yorktown Heights, New ... Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525...
1 downloads 0 Views 3MB Size
ARTICLE pubs.acs.org/ac

Threshold-Avoiding Proteomics Pipeline Frank Suits,*,† Berend Hoekman,‡,§ Therese Rosenling,‡,§ Rainer Bischoff,‡,§ and Peter Horvatovich‡,§ †

IBM T. J. Watson Research Center, P.O. Box 218, Yorktown Heights, New York 10598, United States Department of Analytical Biochemistry, University of Groningen, Antonius Deusinglaan 1, The Netherlands § Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA Nijmegen, The Netherlands ‡

bS Supporting Information ABSTRACT: We present a new proteomics analysis pipeline focused on maximizing the dynamic range of detected molecules in liquid chromatographymass spectrometry (LCMS) data and accurately quantifying low-abundance peaks to identify those with biological relevance. Although there has been much work to improve the quality of data derived from LCMS instruments, the goal of this study was to extend the dynamic range of analyzed compounds by making full use of the information available within each data set and across multiple related chromatograms in an experiment. Our aim was to distinguish low-abundance signal peaks from noise by noting their coherent behavior across multiple data sets, and central to this is the need to delay the culling of noise peaks until the final peak-matching stage of the pipeline, when peaks from a single sample appear in the context of all others. The application of thresholds that might discard signal peaks early is thereby avoided, hence the name TAPP: thresholdavoiding proteomics pipeline. TAPP focuses on quantitative low-level processing of raw LCMS data and includes novel preprocessing, peak detection, time alignment, and cluster-based matching. We demonstrate the performance of TAPP on biologically relevant sample data consisting of porcine cerebrospinal fluid spiked over a wide range of concentrations with horse heart cytochrome c.

L

iquid chromatographymass spectrometry (LCMS) continues to grow in popularity for biomarker discovery and quantitative proteomics experiments,15 and research is ongoing to improve methods for extracting maximal information from the large, complex data sets that these state-of-the-art instruments produce. An LCMS data set for label-free comparative profiling comprises multiple single-stage mass spectrometry scans acquired with increasing retention time (rt) as different peptides emerge from the liquid chromatograph. The resulting raw data corresponds to a large (perhaps many gigabytes) two-dimensional (2D) image of intensity in the dimensions of mass/charge ratio (m/z) and rt. The image is further complicated by series of isotopologues and different charge states of the same compounds. In order to detect and quantify the peptides present in this “image,” an LCMS analysis pipeline must measure constituent peaks that span a 35 order of magnitude dynamic range of intensities, which on the low end merges with noise of both electronic and chemical origin.6,7 Often the choice between considering a peak to be noise versus a compound signal relies on an assumed cutoff in intensity (e.g., height, area, or estimated signal-to-noise ratio (S/N) of the peak) below which the signal is considered unreliable. Since this decision is based only on local information in the chromatogram, there is a risk of discarding biologically relevant peaks as false negatives early in the pipeline and thereby losing considerable measurement dynamic range. r 2011 American Chemical Society

One way to gain confidence in the identification of a peak is to perform MS/MS acquisition on it while it is still eluting from the LC. The resulting MS/MS spectrum not only helps identify the peptide but gives greater confidence in matching a peak in one chromatogram to a peak in another. Unfortunately, not all peaks can be identified in this way, and with no a priori knowledge of which peaks merit MS/MS study, one currently relies on datadependent acquisition algorithms that typically acquire MS/MS only for the higher-abundance peaks, leading to a reduction in coverage and dynamic range of those peaks with supporting MS/ MS information.8 A data-directed acquisition approach (DDA)9,10 using a priori knowledge of certain peaks, or reanalysis of the sample multiple times to increase MS/MS coverage, also has a limited dynamic range of concentration compared to MS-only acquisition, which uses the full acquisition time of the mass spectrometer to acquire quantitative information on the constituent compounds of the sample. In contrast, MS/MS may require 75% of the acquisition time, assuming one single-stage MS scan followed by three data-dependent MS/MS scans. Since the goal of this work is to detect low-abundance peaks including those near the level of the noise, MS/MS on all peaks is impractical due to Received: May 31, 2011 Accepted: August 31, 2011 Published: August 31, 2011 7786

dx.doi.org/10.1021/ac201332j | Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry limited coverage possible, and the difficulty to acquire MS/MS from low-abundance precursor peaks, although recent advances have improved this coverage.11 As a result, we focus on achieving the greatest confidence possible in separating true signal peaks from noise based solely on the single-stage LCMS data. This is consistent with the goals of high-throughput, label-free comparative proteomics profiling, which strives to cover a large dynamic range of concentration and is in common use for biomarker discovery. Historically, peak extraction focuses first on finding the peaks in each mass spectrum using traditional mass spectrometry peak detection routines, then aggregating the peaks corresponding to the same compound in rt to identify the maximum of each peak in the chromatographic elution dimension. This continues today with the so-called “centroid mode” output of LCMS instruments, where peaks are detected within each individual MS scan and output as lists of (m/z, intensity) pairs at each scan’s retention time. This results in a great reduction in size of the output file since each full-scan spectrum has been reduced to a list of peaks, but it requires peak detection without knowing the contents of lines from adjacent scans. Furthermore, peak detection on a per-scan basis requires an imposed distinction between low-abundance signal peaks and noise peaks, usually based on a minimum cutoff or threshold value, and sometimes a measure of the peak quality.12,13 Discriminating signal from noise based on the small amount of data in a single scan line while ignoring data in adjacent scan lines and other similar data sets makes it more challenging to qualify and reject noise peaks, leading to a lower dynamic range of concentrations for detected compounds. Alternatively, LCMS instruments can output the data in raw, “profile” format comprising all acquired intensities in the actual MS scan, e.g., ion current, as a function of m/z. These raw data sets are much larger and have irregular structure if the m/z and rt values do not have constant steps between them. The result is more computationally demanding to analyze since, unlike an image with uniformly spaced pixels, these values are scattered nonuniformly in 2D space, and each value requires a full set of (rt, m/z, intensity) values, rather than a long list of intensities mapped onto a uniformly spaced grid of rt and m/z. The resulting computational demands, particularly as instruments increase in resolution and data sets become larger, has made peak detection and quantification from the raw data sets less attractive than the simpler and smaller centroid-mode output. Nonetheless, there are many advantages to working with the raw data despite the added computational burden. First, a peak can be more accurately detected in the context of many adjacent line scans as the peptide elutes, since more information from the instrument is available for each peak. When discriminating signal peaks from noise peaks during centroid-mode acquisition, a noise peak that is large but appears in a single scan line will be retained, whereas a low-abundance peptide peak that appears in one or several adjacent scan lines will be discarded as noise if it never exceeds the cutoff. In addition, peaks that overlap in two dimensions will appear with a recognizable spatial relationship in the (m/z, rt) 2D space, as opposed to a simple succession of overlapping peaks of changing relative intensities in a series of adjacent m/z scans. Finally, since a peak is spread out in the two dimensions of rt and m/z, adjacent scan lines provide a greater number of values to improve the confidence in the peak’s estimated intensity. The central theme presented here is to work with raw data straight from the instrument as much as possible, and to view all

ARTICLE

Figure 1. Outline of the TAPP LCMS analysis pipeline. The focus of this paper is the first part from “Meshing” to “MetaMatch”, but additional processing steps (not described here) are shown for context. Note that the size of the data being processed changes abruptly from the full chromatograms to peak lists and that time alignment with Warp2D benefits from this data reduction since it works with peak lists.

the data from an experiment in context in order to recognize signals near the level of noise, thereby making maximum use of the MS instrument’s dynamic range. Although there have been many efforts to improve the performance of individual components in the LCMS analysis pipeline, and there is an appreciation that important disease-related signal may be at the low end of dynamic range,14 there has been little emphasis on the benefits of viewing each peak in the context of the ensemble of related experimental data to distinguish low-abundance signal from noise, although Coombes and co-workers discuss the concept.15,16 The focus of this work is to avoid any preprocessing that might cull signal prematurely and to delay the identification and rejection of noise peaks until the final stage where all peaks from an experiment can be studied as an ensemble.

’ ANALYSIS PIPELINE In TAPP (threshold-avoiding proteomics pipeline), we want to make best use of the raw LCMS data while avoiding the early application of thresholds that might discard biological signal together with noise. Figure 1 is a flowchart of the key components of TAPP. TAPP first regularizes the irregular raw, profile data by “meshing” it onto a regular grid along with a slight 2D Gaussian smoothing, with controllable parameters, that is gentle enough that it retains low-intensity noise peaks. Having the intensities on a regular grid makes the data set more like an image that is compact and efficient to process with minimal loss of resolution, as long as the grid spacing matches the resolution of the instrument. In fact, many mass spectrometers have changing resolution in m/z that would result in oversampling on one end of the spectrum and undersampling on the other. TAPP corrects for this by mapping the intensities onto a warped grid with spacing that changes to match the resolution of the instrument. This warping in m/z ensures constant resolution in the nonlinear m/z axis, as long as original mapping is restored after the extraction of peaks. Constant resolution not only reduces the size of the grid, but it provides a more consistent representation of peaks across m/z, as described in more detail below. Although TAPP provides the greatest benefit with raw, profile-mode data, 7787

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

Figure 2. Depiction of the slope-based peak-searching algorithm. The central region contains two shoulder peaks with a narrow valley between them. The white dots show the region of rt, m/z space assigned to the rear peak, excluding the shoulder peak toward the front. The black dots map out the boundary of the peak for use in estimating the background level directly from the data. Visualization created with OpenDX.20

the same smoothing and mapping onto a warped grid works for centroided, centroid-mode data also. TAPP finds peaks in the grid with a novel slope-based algorithm that regards every local maximum as a peak and carves out its surrounding territory based on slope rather than intensity. This has several advantages: First, every local maximum counts as a peak, and no minimum threshold intensity is applied to separate signal from noise. Second, overlapping and shoulder peaks will be split as long as there is a local minimum, or “valley”, between them, and the values around the boundary of each peak’s region of (m/z, rt) space provide a local estimate of the baseline and a calculation of peak volume above that baseline. Figure 2 is a visualization of the partitioning of peaks in (m/z, rt) space by this slope-based search. The white dots mark grid points assigned to a given peak, and the black dots indicate border regions where the slope decays or begins to rise upward, corresponding to the peak boundary used for estimating the local background. This means each peak has a corresponding baseline value that takes into account contributions from overlapping peaks, thereby avoiding the need for a separate baselinesubtraction heuristic. Third, knowledge of the peak shape and extent in the rt and m/z dimensions allows additional metrics for analysis and noise rejection and alternate ways to define the location and height of the “peak” itself. The peak can be defined as the midpoint of the highest pixel, the volumetric centroid of the full peak volume, or a “windowed centroid” of the neighborhood of the highest pixel, i.e., a peak location determined by the centroid of values within a small radius of the peak maximum, rather than a full center-of-mass calculation of the entire peak region. Each definition has advantages and disadvantages, but having the full volumetric region of each peak defined allows the freedom to choose among centroiding methods the method that best suits the experiment and instrumentation. An LCMS analysis pipeline usually includes a method of aligning multiple spectra in time to correct for nonlinear variations in elution time. This time-alignment or time-warping phase can occur before or after the extraction of peaks. In TAPP

ARTICLE

we first extract the peaks and then apply the Warp2D17 method to align the peaks using the following peak properties: 2D location in (m/z, rt) space, peak height, and peak widths in rt and m/z dimensions. As described in Suits et al., Warp2D is an efficient and accurate retention-time warping method that takes advantage of information in the 2D (m/z, rt) space to drive alignment of related peaks in separate chromatograms that have similar rt and nearly identical m/z coordinates, while it ignores peak pairs that have comparable rt but different m/z.17 Since Warp2D uses peak lists rather than the full spectrum including background areas, it avoids bias from baseline noise and is inherently driven by coherent peak locations between samples to determine time alignment. This recognition of correlated peaks in pairs of samples is itself a form of noise rejection since peaks that have no spatial correlation between two samples will not bias the time alignment. Note that single-stage LCMS data, especially when acquired with high-resolution mass spectrometers, may also be prone to shifts in the m/z dimension due to possible mass calibration distortion. In this case the m/z axis should be recalibrated to exact mass to improve the mass accuracy of the data, either by using background ions or internal standards.18,19 Once each sample data set has been reduced to a set of timealigned peaks (all aligned to the same reference sample), the next phase of TAPP is to match peaks across the ensemble of samples. This accomplishes the two goals of (1) matching the same peaks across multiple samples so they can be quantitatively compared according to sample label (e.g., wild type vs mutant) and (2) rejecting peaks that are not members of a cluster as noise rather than signal peaks. It is in this last phase that TAPP finally culls noise peaks, not based on an intensity threshold, but based on their lack of a consistent appearance with other peaks in a small cluster. The size of the cluster is determined by the m/z resolution of the instrument and the expected time-alignment errors across the data set after applying Warp2D. The maximum cluster size, in m/zrt dimensions, and the minimum number of peaks present in chromatograms of one of the sample group are the key threshold parameters TAPP uses to reject noise peaks. It is important to note these parameters are independent of the intensity and shape of the peaks, and they are applied at the end of processing—after the cluster-based peak-matching process.

’ VARIABLE MASS RESOLUTION Mass spectrometers have peak widths that vary with m/z according to known equations based on the physics of the separation and detection process. The terminology is sometimes confusing in its use of “resolution” and “resolving power,” so we speak here directly in terms of the width of a peak, σ, as a function of m/z. Equations 1ad show the dependency of peak width on m/z for the main types of mass spectrometers used in LCMS.2123 For clarity, in these and the following equations we use mz as shorthand for m/z in subscripts and when attached to subscripts, e.g., σmz and mz0. Quadrupole and ion trap: σ mz ¼ const

ð1aÞ

TOF and QTOF: σ mz µ m=z 7788

ð1bÞ dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

ARTICLE

Orbitrap: σmz µ ðm=zÞ3=2

Quadrupole and ion trap: ð2aÞ

TOF and QTOF:

FTICR: σmz µ ðm=zÞ2

m=zðnÞ ¼ mz0 þ nσ

ð1cÞ

m=zðnÞ ¼ mz0 eðσ̅ =mzÞn

ð1dÞ

A problem arises in mapping the LCMS data to a regular mesh because the mesh may be too fine at one m/z value and too coarse at another, resulting in oversampling and undersampling, respectively. Not only does this result in an inefficient representation of the data, but regions that are too fine may cause single peaks to split into subpeaks due to a small variation in values across the broadly sampled peak, while coarse regions may sacrifice the resolution delivered by the instrument and lose information. In order to have more consistent sampling of the mesh and a more robust peak search routine based on slope, TAPP maps the raw data onto a deformed mesh that has variable m/z spacing, which results in uniform sampling in terms of peak resolution in the m/z dimension. In order to do this, the peak resolution function in m/z, σ(m/z), of each mass spectrometer must be integrated to determine a formula for m/z as a function of an index, n, that takes uniform steps in resolution space. Furthermore, the mesh must have steps that correspond to the actual σ values of the instrument, i.e., the size of the steps should match the resolution of the instrument at each m/z value. Several items are needed to create a mesh with the desired irregular spacing, with steps of uniform resolution: A formula, m/z(n), for generating m/z as a function of an index, a userspecified resolution, σ̅ , at a given m/z, m/z, and a starting mass value for the initial index, (m/z)0. In this characterization, the peak width is specified only at a single m/z value which, along with the starting mass and the particular mass spectrometer resolution formula, completely determines the desired uniform-resolution grid spacing at all m/z values. Equations 2ad show the result of inverting the σ(m/z) to produce the desired m/z(n) grid spacing function for the main types of mass spectrometers and their corresponding resolution functions. The function is completely determined by the type of mass spectrometer, the starting mass for index 0, and the expected resolution at a single specified m/z value. This parametrization avoids potential confusion caused by terms such as “resolving power” that may themselves be a function of m/z for a given type of mass spectrometer. The constant-resolution mesh allows more efficient use of memory and simplifies peak searching and parametrization since the peak widths are constant across the mesh. It also unifies the appearance of peaks across the full m/z range and avoids the creation of small shoulder peaks caused by oversampling near a broad peak. Although the resolution eqs 1ad may not apply exactly to a given instrument, they will have no detrimental impact since the mass values are ultimately mapped back to the original nondeformed m/z space using the exact inverse formula. The deformed mesh is a temporary construct used to find peaks and need only approximate the changing resolution in m/z in order to have the desired roughly uniform sampling density. Once centroids are found in the deformed mesh, the corresponding inverse mapping yields the true m/z values of the peaks.

Orbitrap: m=zðnÞ ¼

!2 1 σ̅ pffiffiffiffiffiffiffiffi  3=2 n mz0 mz

ð2bÞ

ð2cÞ

FTICR: m=zðnÞ ¼

mz0 σ̅ mz0 1 n mz 2

ð2dÞ

Although many of the methods used by TAPP are computationally expensive and require a large amount of memory for calculation, the recent move from 32-bit to 64-bit architecture allows even a typical modern laptop to process high-resolution data sets in memory. Memory objects in TAPP may exceed limits imposed by 32-bit architectures and demand either reduced resolution or a partitioning of the data for analysis, but 64-bit allows manipulation of entire data sets at full resolution with the constant-resolution meshes. TAPP is implemented in portable C++, and it runs on most operating systems and platforms ranging from simple laptops to high-end clusters. We performed the analysis for this work on an IBM Power-6 cluster with eight processors per node and 128 GB of memory. When one CPU of this computer is used, the meshing process takes approximately 30 min to convert a 26 GB mzData XML file to a 300 MB binary mesh, with most of the time spent parsing the XML. The slopebased peak search then takes 50 s, and each Warp2D operation, warping one list of 200K peaks to a reference, takes 5 min. Since these are all independent operations, they can be easily run in parallel across a computer cluster so that the overall time to process an experiment is limited by the time spent meshing a single file.

’ CLUSTER-BASED THRESHOLDING There are many ways to cluster scattered points in two dimensions, but a unique aspect of the clustering needed in this context is that some peaks are presumed to be noise and are not assigned to any cluster. This is in contrast to standard clustering algorithms that place each element in an optimal cluster and leave no elements unclassified. In TAPP, a realistic cluster representing a peptide signal is required to have a certain fraction of constituent peaks within a maximum user-specified radius in m/z and rt. Our method for clustering these peaks is called MetaMatch, and the clusters found are referred to as MetaPeaks. Isolated peaks that do not belong to a cluster are culled as “orphan” noise peaks. MetaMatch takes as input a list of aligned peaks from samples with group labels. The samples may all have the same label, or they may comprise a set of classes corresponding to, for example, mutant/wild type (MT/WT), disease type, or any number of categories. XCMS makes use of a similar class-based clustering 7789

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry scheme.24 In order to accommodate and detect differential expression across the classes, it is necessary to allow a cluster to contain only peaks from one class and not others. In a differential mutant/wild-type experiment, a peak may be present in MT and not WT, or vice versa. If there are 10 of each type and a cluster is present with eight from MT and none from WT, then that cluster appears in 80% of MT but only 40% of the overall samples. As a result, in order to capture strongly differentially expressed peaks, the cluster criterion is a minimum occupancy fraction on a per-class basis. MetaMatch therefore requires the list of aligned peaks with assigned labels, the maximum cluster radius in m/z and rt, and the minimum fraction of peaks that must be present in at least one class. Note that this is a form of thresholding, but not based on intensity; instead it is based on a requirement for a coherent presence of peaks across the samples of the same class, regardless of intensity. Noise peaks unrelated to biologically important peptide signals are assumed to be random and will not show this coherent behavior across samples, although chemical background signals may appear coherently and will require rejection later in the analysis pipeline. The fact that clustering occurs in the m/z and rt dimensions means that this clustering is not directly suitable to an equivalent clustering of MS/MS peaks, since each peak in an MS/MS spectrum is bound to a single rt value. Nonetheless, an analogous clustering and matching method may be possible for LCMS/MS data. The clustering algorithm of MetaMatch is a specific form of within-groups clustering.25 The neighborhood of each peak is searched for a possible other cluster member. If none is found within the radial cutoffs, the peak is marked as an orphan and a new peak is selected. If a nearby peak is found, a centroid of the peak cluster is calculated based on the two peak positions and a new search for peaks near that centroid commences. Each time a peak is added, a new cluster centroid is calculated and the cluster is checked for any peaks that have now drifted beyond the maximum radii from this new centroid. Such peaks are marked as “available” since they are not orphans yet because they may match a different cluster to be created. Once a MetaPeak has stopped growing, it is checked for minimum occupancy within any class. If it does not meet this criterion, its peaks are released for possible matching with other clusters. Otherwise the peaks are marked as “taken” and the MetaPeak and its constituent peaks are logged. The algorithm proceeds through the entire list of peaks multiple times until there are no further changes, at which point it stops and outputs the list of MetaPeaks and orphans. Despite the seeming computational expense of this algorithm, a native implementation in C++ found clusters across 35 files, representing a total of 7 million peaks, in 11 min using one CPU of the computer described above. The output contained 150 000 MetaPeaks and 2.6 million orphans. The orphans represent the culled noise peaks, and the MetaPeaks quantify the intensities of peaks across the samples and between classes, allowing fold-ratio calculations of differential expression between classes of samples, as needed for comparative quantitative proteomics and biomarker discovery.

’ PARAMETER TUNING Since MetaMatch relies on three parameters for clustering (maximum cluster size in m/z and rt dimensions, along with minimum allowed cluster occupancy in any class) one must determine parameter values that result in robust clustering with good noise rejection. The m/z and rt cluster dimension limits are

ARTICLE

Figure 3. Close-up view of MetaPeak clusters and orphans. Individual peaks appear as round symbols sized according to the log(height). Peaks are colored based on their assigned MetaPeak in a random scheme to help visually distinguish them. Orphan peaks with no assigned MetaPeak appear in gray, indicating they have been culled as noise peaks. Square symbols indicate the location and mean size of the MetaPeaks. The lightblue MetaPeak at m/z 390.2, 38.5 min indicated with the red arrow is a spiked peptide, with levels plotted in Figure 5a. The dark-blue peak above it (blue arrow) is an endogenous CSF peptide peak with levels shown in Figure 5b. The symbols are slightly fuzzy because they are drawn in transparent colors to reduce the effect of stacking order on how they are drawn.

linked to known properties of the instrument and experiment so they can be set directly. Nonetheless it is helpful to assess the sensitivity of the results to the parameters by doing repeated matchings with different values and to note the effect on resulting quantities, such as the number of MetaPeaks found and the number of orphans rejected. Figure 3 shows the MetaPeak clusters found in a small region of an LCMS analysis of trypsin-digested porcine CSF. The clusters are colored randomly to help distinguish them from each other, and all constituent peaks have the color assigned to the corresponding MetaPeak, shown as a square located at the cluster centroid. The local clustering of peptide peaks is evident, in contrast with the more random and scattered peaks attributed to noise and colored gray in the figure. Figure 3 shows the final thresholding stage in TAPP, where noise peaks are rejected based on their nonmembership in a small cluster, rather than their intensity relative to a threshold. This makes use of information across all samples, with knowledge of the sample classes, to gain insight into the recurring appearance of low-abundance signal peaks. A key question in any thresholding process for noise rejection is how to tune the parameters. The Dmz and Drt parameters, which specify the maximum cluster radii in the mz and rt dimensions, are based primarily on empirical assessments of the instrument resolution, but additional tuning insight will also come from studying the cluster behavior in parameter space. Parts a and b of Figure 4 are contour plots of MetaMatch results as a function of a 2D scan varying the cluster width in m/z and rt. Note that the number of MetaPeaks found (Figure 4a) and the number of orphans generated (Figure 4b) show a strong dependence for small values of the cluster size parameters, 7790

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

ARTICLE

Figure 4. (a) Contour plots of the number of MetaPeaks as a function of clustering parameters, Dm/z and Drt, for uniform resolution meshing. This figure captures the sensitivity of the clustering algorithm to the choice of parameters, which helps reveal parameter values that have less influence on the results and therefore capture the true signal clusters inherent in the distribution of peaks. The parameters chosen for this study are Dm/z = 0.05 Th and Drt = 0.3 min, as shown by the red “+”. (b) Number of remaining “orphan” peaks after clustering the MetaPeaks. For very small cluster widths, in the lower left, the number of orphans rejected as noise peaks is very sensitive to the widths chosen. This begins to smooth out noticeably near Dm/z = 0.04 Th and Drt = 0.2 min. Again we show the parameters used for the study with the red “+”. 7791

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

ARTICLE

Figure 5. (a) Height of a peak related to a spiked cytochrome c peptide vs spiking level. (b) Height of a peak related to an endogenous CSF peptide vs spiking level.

Dm/z and Drt, but they smooth out rapidly once the cluster size matches the measurement resolution. This smoothing indicates less sensitivity of the parameter values to the clustering behavior and suggests the resulting clusters are intrinsic to the distribution of peaks in the data and not an artifact of clustering with specific parameter values. Although these figures do not point directly to optimal parameters, they allow a helpful guide to how sensitive the results are to given parameter values and thus can aid in parameter tuning. The spiked CSF data set (described below)

used values Dm/z = 0.05 Th and Drt = 0.3 min, which are realistic values given the instrumentation (QTOF mass spectrometer with mass resolution approximately 10 000, coupled to microfluidic nanoLC with time repeatability of (7 s after time alignment26) and correspond to smooth regions in Figure 4. These parameters represent physical properties of the instrument and the accuracy of the time-alignment procedure, but with a way to estimate their sensitivity to the results of the matching process. 7792

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

’ RESULTS It is important to demonstrate the effectiveness of a pipeline such as TAPP on a suitably complex biological sample. For this reason we applied TAPP to a series of spiked porcine cerebrospinal fluid (CSF) samples. We spiked 35 samples with varying concentrations of horse heart cytochrome c in replicates of five using a wet lab protocol similar to that used in a recent work,26 with details in the Supporting Information for this article. We meshed the 35 samples onto a uniform resolution grid based on the resolution behavior of the QTOF instrument according to eq 1b, followed by slope-based peak picking and Warp2D time alignment to a single arbitrarily chosen sample reference. We used MetaMatch to find all clusters containing peaks from at least 10 of the 35 samples (fraction of 0.3 used as minimal class occupancy, with a single class assigned to all samples) in each MetaPeak, and then searched the list of MetaPeaks for those MetaPeaks that show differential trends consistent with the spiking. Figure 5 shows the constituent peaks of one such MetaPeak, revealing the clear stepped spiking level through the classes of samples. The consistency of the peak intensities within each spiking level indicates the quantitative precision of the peak extraction method, and the matching of all the related peaks to the same MetaPeak, despite the large dynamic range of spiked peak abundances, demonstrates the success of both the time alignment and the MetaMatch clustering method. For comparison, Figure 5 shows a peak from an endogenous peptide indicating consistent abundance across the samples as expected. An effective way to communicate the clustering and noise rejection aspects of TAPP is to study an animation of the spiked samples as a function of spiking. Figure S1 (Supporting Information) presents this sequence as an animated visualization that shows the concentration-specific intensity change for the peptide plotted in Figure 5. Note that the human eyebrain perceives the clustering behavior automatically and picks out the spiked peaks based on their correlated decrease with reduced spiking level. There are small shifts of the peaks in rt, but much less than would be evident without Warp2D time alignment. The clusters of true signal peaks, corresponding to MetaPeaks, are evident in the animation, as well as the sporadic noise peaks culled out as orphan peaks, which appear with random distribution in the retention time and m/z space. ’ DISCUSSION This paper focuses on the benefits of delaying the rejection of noise peaks until late in analysis so that the information from the entire data ensemble can aid in distinguishing signal from noise. In addition, novel methods are provided for improving resolution and for clustering peaks across samples that have general applicability to LCMS work outside of biomarker discovery. However, if the goal of an experiment is to quantify specific peptides, then targeted LCMS/MS is more appropriate and there is little need to retain peaks near the level of noise. Furthermore, any discovered biomarker signature needs to be followed up to identify the peptide and understand its role in the biology behind the experiment, and recognizing differential peaks is only an early step in the process. Nonetheless, if signatures are only present as low-abundance peaks, and low-abundance peaks are explicitly ignored in typical LCMS workflows while they are retained in TAPP, then TAPP clearly has an advantage since otherwise the key data of interest would be completely unavailable. Although there may be a long list of differential signatures, they can be

ARTICLE

ranked according to fold ratio and ease of follow-up, in keeping with the role of LCMS-based biomarker discovery as “hypothesis generation.”27 TAPP provides a quantitative peak matrix as output, which can then undergo isotope and charge-state deconvolution. Data presented in this article are obtained with labelfree analysis, but TAPP also applies to labeled data since it is generic to LCMS peak extraction and analysis. It is difficult to quantify the true and false positive rate of detection for faint peaks in a biological sample because the ground truth knowledge of faint peptide signals is unknown since the complete composition of the sample is unspecified. In the example data shown in this paper, the “biomarker” signature is the progressive and expected loss of intensity across classes of samples in an experiment due to spiking. Although this is somewhat contrived as a biological signal, the same principles apply to the detection of a differential signal based on fold ratio for a wild-type/mutant experiment. For a more general experiment with multiple samples across several classes, a more complex biomarker signature is possible involving coordinated enhancement and suppression of different peptides. Regardless of the complexity of the biomarker signature, the principles applied in TAPP have obvious benefit since they make maximum use of available information within a given sample set, both in coverage and dynamic range, compared to the early application of thresholds or the selective coverage based on MS/MS. There will always be error in identifying true signal peak clusters across multiple samples, but TAPP aims to improve the detection of the low-abundance peaks where important biological signatures may be found.

’ ASSOCIATED CONTENT

bS

Supporting Information. Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected].

’ ACKNOWLEDGMENT The study of CSF was supported by a Grant from the Dutch TopInstitute Pharma (TIP 4.102). Groningen gratefully further acknowledges the following Grants: BioRange (NBIC) 2.2.3; NPC (Netherlands Proteomics Centre) Bsik03015; Dutch Multiple Sclerosis Foundation. ’ REFERENCES (1) Horvatovich, P.; Govorukhina, N. I.; Reijmers, T. H.; van der Zee, A. G.; Suits, F.; Bischoff, R. Electrophoresis 2007, 28, 4493–4505. (2) Mischak, H.; Allmaier, G.; Apweiler, R.; Attwood, T.; Baumann, M.; Benigni, A.; Bennett, S. E.; Bischoff, R.; Bongcam-Rudloff, E.; Capasso, G.; Coon, J. J.; D’Haese, P.; Dominiczak, A. F.; Dakna, M.; Dihazi, H.; Ehrich, J. H.; Fernandez-Llama, P.; Fliser, D.; Frokiaer, J.; Garin, J.; Girolami, M.; Hancock, W. S.; Haubitz, M.; Hochstrasser, D.; Holman, R. R.; Ioannidis, J. P. A.; Jankowski, J.; Julian, B. A.; Klein, J. B.; Kolch, W.; Luider, T.; Massy, Z.; Mattes, W. B.; Molina, F.; Monsarrat, B.; Novak, J.; Peter, K.; Rossing, P.; Sanchez-Carbayo, M.; Schanstra, J. P.; Semmes, O. J.; Spasovski, G.; Theodorescu, D.; Thongboonkerd, V.; Vanholder, R.; Veenstra, T. D.; Weissinger, E.; Yamamoto, T.; Vlahou, A. Sci. Transl. Med. 2010, 2, 46ps42. 7793

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794

Analytical Chemistry

ARTICLE

(3) Nilsson, T.; Mann, M.; Aebersold, R.; Yates, J. R.; Bairoch, A.; Bergeron, J. J. M. Nat. Methods 2010, 7, 681–685. (4) Yates, J. R., III; McCormack, A. L.; Schieltz, D.; Carmack, E.; Link, A. J. Protein Chem. 1997, 16, 495–497. (5) Cox, J.; Mann, M. Nat. Biotechnol. 2008, 26, 1367–1372. (6) Cappadona, S.; Levander, F.; Jansson, M.; James, P.; Cerutti, S.; Pattini, L. Anal. Chem. 2008, 80, 4960–4968. (7) Cappadona, S.; Nanni, P.; Benevento, M.; Levander, F.; Versura, P.; Roda, A.; Cerutti, S.; Pattini, L. J. Biomed. Biotechnol. 2010, 2010, 1–10. (8) Michalski, A.; Cox, J.; Mann, M. J. Proteome Res. 2011, 10, 1785–1793. (9) Domon, B.; Aebersold, R. Nat. Biotechnol. 2010, 28, 710–721. (10) Wepf, A.; Glatter, T.; Schmidt, A.; Aebersold, R.; Gstaiger, M. Nat. Methods 2009, 6, 203–205. (11) Panchaud, A.; Jung, S.; Shaffer, S. A.; Aitchison, J. D.; Goodlett, D. R. Anal. Chem. 2011, 83, 2250–2257. (12) Tautenhahn, R.; Bottcher, C.; Neumann, S. BMC Bioinf. 2008, 9, 504. (13) Zhang, J.; Gonzalez, E.; Hestilow, T.; Haskins, W.; Huang, Y. Curr. Genomics 2009, 10, 388–401. (14) Anderson, N. L.; Anderson, N. G. Mol. Cell. Proteomics 2002, 1, 845–867. (15) Baggerly, K. A.; Morris, J. S.; Edmonson, S. R.; Coombes, K. R. J. Natl. Cancer Inst. 2005, 97, 307–309. (16) Coombes, K. R.; Baggerly, K. A.; Morris, J. S. In Fundamentals of Data Mining in Genomics and Proteomics; Dubitzky, W., Granzow, M., Berrar, D. P., Eds.; Springer: New York, 2007; pp 79102. (17) Suits, F.; Lepre, J.; Du, P.; Bischoff, R.; Horvatovich, P. Anal. Chem. 2008, 80, 3095–3104. (18) Haas, W.; Faherty, B. K.; Gerber, S. A.; Elias, J. E.; Beausoleil, S. A.; Bakalarski, C. E.; Li, X.; Villen, J.; Gygi, S. P. Mol. Cell. Proteomics 2006, 5, 1326–1337. (19) Scheltema, R. A.; Kamleh, A.; Wildridge, D.; Ebikeme, C.; Watson, D. G.; Barrett, M. P.; Jansen, R. C.; Breitling, R. Proteomics 2008, 8, 4647–4656. (20) Open Visualization Data Explorer, OpenDX. http://www. opendx.org. (21) Guilhaus, M. J. Mass Spectrom. 1995, 30, 1519–1532. (22) Marshall, A. G.; Hendrickson, C. L.; Jackson, G. S. Mass Spectrom. Rev. 1998, 17, 1–35. (23) Hoffmann, E. d.; Stroobant, V. Mass Spectrometry: Principles and Applications, 3rd ed.; Wiley: West Sussex, England, 2007. (24) Smith, C. A.; Want, E. J.; O’Maille, G.; Abagyan, R.; Siuzdak, G. Anal. Chem. 2006, 78, 779–787. (25) Fielding, A. H. Cluster and Classification Techniques for the Biosciences; Cambridge University Press: Cambridge, U.K., 2007. (26) Rosenling, T.; Slim, C. L.; Christin, C.; Coulier, L.; Shi, S.; Stoop, M. P.; Bosman, J.; Suits, F.; Horvatovich, P. L.; StockhofeZurwieden, N.; Vreeken, R.; Hankemeier, T.; van Gool, A. J.; Luider, T. M.; Bischoff, R. J. Proteome Res. 2009, 8, 5511–5522. (27) Cravatt, B. F.; Simon, G. M.; Yates, J. R., III. Nature 2007, 450, 991–1000.

7794

dx.doi.org/10.1021/ac201332j |Anal. Chem. 2011, 83, 7786–7794