CLUE-TIPS, Clustering Methods for Pattern Analysis of LCMS Data

Sep 2, 2009 - We present CLUE-TIPS (Clustering Using Euclidean distance in ... raw LC-MS data by applying the Tanimoto distance metric to obtain ...
0 downloads 0 Views 4MB Size
CLUE-TIPS, Clustering Methods for Pattern Analysis of LC-MS Data Lakshmi Manohar Akella,*,† Tomas Rejtar,* Christina Orazine, Marina Hincapie, and William S. Hancock Barnett Institute of Chemical and Biological Analysis, Northeastern University, Boston, Massachusetts 02115 Received May 13, 2009

Liquid Chromatography Mass Spectrometry (LC-MS) based proteomics is an important tool in detecting changes in peptide/protein abundances in samples potentially leading to the discovery of disease biomarker candidates. We present CLUE-TIPS (Clustering Using Euclidean distance in Tanimoto InterPoint Space), an approach that compares complex proteomic samples for similarity/dissimilarity analysis. In CLUE-TIPS, an intersample distance feature map is generated from filtered, aligned and binarized raw LC-MS data by applying the Tanimoto distance metric to obtain normalized similarity scores between all sample pairs for each m/z value. We developed clustering and visualization methods for the intersample distance map to analyze various samples for differences at the sample level as well as the individual m/z level. An approach to query for specific m/z values that are associated with similarity/ dissimilarity patterns in a set of samples was also briefly described. CLUE-TIPS can also be used as a tool in assessing the quality of LC-MS runs. The presented approach does not rely on tandem massspectrometry (MS/MS), isotopic labels or gels and also does not rely on feature extraction methods. CLUE-TIPS suite was applied to LC-MS data obtained from plasma samples collected at various time points and treatment conditions from immunosuppressed mice implanted with MCF-7 human breast cancer cells. The generated raw LC-MS data was used for pattern analysis and similarity/dissimilarity detection. CLUE-TIPS successfully detected the differences/similarities in samples at various time points taken during the progression of tumor, and also recognized differences/similarities in samples representing various treatment conditions. Keywords: LC-MS • proteomics • biomarker discovery • hierarchical clustering • heat map • visualization • Multi-Dimensional Scaling (MDS) • Principal Component Analysis (PCA) • Tanimoto distance

Introduction The importance of mass spectrometry based approaches to analyze biological samples at protein level is widely known.1,2 Changes in abundance in a particular set of proteins or their modifications can be associated with health and disease in organisms. Methods and technologies that analyze various biological samples by comparing the protein/peptide levels in the samples are important for discovery of disease pathways and for identification of diagnostic and prognostic disease biomarkers. Many of the approaches that compare samples by quantifying the peptide/protein levels apply isotope labeling and use tandem mass spectrometry (MS/MS).3-7 However, only 20% percent of the MS/MS spectra collected during LC-MS experiments are typically assigned to peptides while many high intensity precursor ions are missed. As a result, the list of peptides identified by MS/MS is incomplete and proteins of interest might be overlooked.11,15,16 On the other hand, approaches that rely on the raw LC-MS data rather than a list of peptides can overcome this limitation. In recent years, several label-free approaches for LC-MS based quantitative proteomic analysis were introduced.8-16 Some of the label-free ap* To whom correspondence should be addressed. E-mail: (L.M.A.) [email protected], (T.R.) [email protected]. † Current address: Marine Biological Laboratory, Woods Hole, MA 02543.

4732 Journal of Proteome Research 2009, 8, 4732–4742 Published on Web 09/02/2009

proaches utilize MS/MS information for quantitation,17 while other approaches use the MS/MS derived peptide identifications only to establish chromatographic landmarks for peak matching and alignment.18 Most of the current approaches applied for comparative/ quantitative LC-MS proteomics that do not rely on MS/MS or chemical/isotopic labels can be broadly classified as those that utilize detection and quantitation of LC-MS features, for example, peaks at specific m/z and retention times, followed by comparative analysis for two or more samples8-13 and those thatworkwiththeentireMSsignalwithoutthepeakdetection.14-16 The approaches by Wang et al.8 and Radulovic et al.9 are labelfree strategies that involve denoising, peak-detection, alignment, grouping of peptide features, and quantitation. In the context of biomarker discovery, the problem of detecting differences in a group of samples is perhaps more difficult than the problem of classification of samples.16 A variety of approaches that primarily aim at finding differences between groups (e.g., disease vs control) of LC-MS samples rely on for example variants of the t-test.8-10,14,16 In Listgarten et al.,16 a spatial t-statistic was used to detect the differences after performing alignment and normalization using an approach based on Hidden Markov Models.19 In many of these approaches, once the differences associated with specific 10.1021/pr900427q CCC: $40.75

 2009 American Chemical Society

research articles

CLUE-TIPS, Clustering Methods peptides have been identified, a targeted MS/MS analysis of the identified differences is performed8,12,14 for protein identification using SEQUEST20 or Mascot.21 Once the proteins linked to the differences in samples have been identified, they can be further analyzed to identify any potential biomarkers. Among other methods, one approach computes a distance/ similarity score for each sample pair using dynamic programming based optimal alignment scoring strategy such as Needleman-Wunsch,22 applied to mass spectra of the sample pair.15,23 In another approach, peptide features were extracted from LC-MS samples and each sample was expressed in terms of relative protein abundances of thousands of peptide features that were extracted. The result is a matrix with samples as rows and peptide features as columns, which is similar to the representation of gene expression microarray data.12,13,24 The generated sample versus peptide matrix can be used as an input for unsupervised clustering methods or other supervised approaches when there is label information for pattern analysis. The peak-based approaches detect and extract peaks in individual LC-MS runs, and then, the generated peak lists are compared across several runs. Importantly, some peaks (features) could be lost in the process.14,16 In addition, peptide elution is a complex and not a well-understood process. As a result, the peptide feature lists tend to be ambiguous with different tools generating different feature lists for the same sample.15 Hence, it is difficult to accurately represent all samples using a set of features extracted in such an approach. The advantages of each of the approaches are likely to be closely linked to the precision of the MS instrument being used. High-resolution data are used with a peak-based approach, while lower-resolution data are used with signal-based approaches.16 In addition, due to the large dimensionality of the new peptide feature space, for any of the traditional machine learning approaches to be applied to the data, it requires a much larger set of samples than the dimensionality of the peptide feature space, commonly referred to as the curse of dimensionality. Dimensionality reduction methods like PCA and/or feature selection methods need to be applied before proceeding to clustering. So, in approaches that generate sample versus peptide feature arrays, the extracted peptide features could range from hundred to thousands, but the number of samples are typically much lower, which affects classification. In our approach, we work on the intersample distance feature space, which allows classification even when the number of samples is small with the help of the clustering methods that we describe in the Methods and Materials section. In this paper, we present an approach for analyzing the similarities and detecting differences in multiple samples at the m/z level and also an approach to extract m/z values for MS/MS sequencing that satisfy a set of conditions for a particular subset of samples. Our approach is also primarily a peak free approach, where we generate a normalized all pair similarity matrix for all relevant m/z values. Instead of detecting individual peaks (features), our approach utilizes raw LC-MS signal information and performs a series of preprocessing steps like m/z filtering, alignment, baseline-correction, and denoising. Then, the set of LC-MS samples is transformed to a single matrix of pairwise similarity scores using Tanimoto distance for all m/z values in the set of samples and cluster subspaces in this Tanimoto interpoint distance space for similarity analysis. Tanimoto distance is widely used as a measure of similarity in chemical compounds expressed in a binary vector in called fingerprints in large combinatorial compound librar-

25

ies. Fingerprints for each molecule are generated as an ASCII string of 1’s and 0’s indicating the presence or absence of certain substructural fragments in each compound. These fingerprints are an abstract representation of structural features of a molecule. Just as a chemical compound can be thought of as a combination of its substructures, we can represent an EIC (Extracted Ion Chromatogram) or a mass spectrum as a series of peaks. Tanimoto similarity between two binary vectors is one minus the number of bits common in both the vectors over the total number of bits in the two vectors as shown below. Dtanimoto(Si, Sj) ) 1 - |Si ∩ Sj |/|Si ∪ Sj |

(1)

It can be seen from eq 1 that the Tanimoto distance is a number between 0 and 1. We use the Tanimoto similarity matrix with similarity scores of all pairs of samples for all m/z values to extract and analyze patterns efficiently in a variety of ways from the LC-MS samples. Also, we observed that the standard Euclidean distance was less useful for pattern analysis when working in the interpoint distance space. Once the Tanimoto similarity matrix is generated, it also occupies only a small fraction of the storage space compared to that of the original set of samples. The similarity matrix approach defines an intersample similarity feature space for which we developed a set of clustering methods in a fully unsupervised setting to detect m/z level differences. The set of methods complement each other to extract patterns and analyze the similarities/ differences in samples for every m/z value in the given set of LC-MS samples. We also briefly described an approach to query the Tanimoto similarity matrix and automatically extract relevant m/z values associated with the detected similarities and differences in samples.

Methods and Materials We begin with the general description of the algorithm (CLUE-TIPS) for analyzing LC-MS patterns and then describe the details of all the individual steps involved in the approach. Algorithm Description. In the CLUE-TIPS (Clustering Using Euclidean distance in Tanimoto Inter-Point Space) approach, we initially process raw LC-MS signal data without extracting specific peak information and perform a series of steps including m/z filtering, alignment, denoising, and baseline correction to generate a set of matrices that we termed the Knowledgebase, see Figure 1. We call each matrix in the Knowledgebase an mz-matrix and it contains Extracted Ion Chromatograms (EIC) for all samples associated with each m/z value. Tanimoto distance was applied to binary vectors generated from each of the mz-matrices to generate a matrix of m/z values and Tanimoto intersample distance values. This matrix of m/z values and Tanimoto similarity scores of all pairs of samples is shown in Figure 1 as the Tanimoto Feature Map. The intersample Tanimoto scores are the features for clustering methods. m/z Filtering and Alignment. From each LC-MS sample represented as a matrix with chromatographic time in rows and m/z values in columns, the maximum intensity for each m/z column is calculated and then only those columns (m/z values) in the sample matrix that are at least a fraction, R times the mean value of those maximum intensity values, are retained. During our experiment R was set to 0.1 to select about 70% of the original m/z values. The m/z filtering is performed for one sample at a time and the m/z values common in all samples Journal of Proteome Research • Vol. 8, No. 10, 2009 4733

research articles

Akella et al.

Figure 1. Diagram giving a basic overview of the algorithm implemented in CLUE-TIPS.

were used for the construction of the Tanimoto Feature Map. After the m/z filtering, we perform chromatographic alignment of MS signals using Ordered Bijective Interpolated Warping (OBI-Warp)26,27 software. OBI-Warp aligns matrices along a single axis using Dynamic Time Warping (DTW) and a bijective interpolated warp function.27 Our input m/z versus time matrices were converted into the lmata format and provided as input to OBI-Warp with default parameters. Noise Reduction and Baseline Correction. After alignment, the LC-MS samples are represented as a set of mz-matrices, where each matrix contains all the EICs for all samples as rows for a particular m/z value. Moving average filter, a special case of the Finite Impulse Response (FIR) filter28 was used to reduce the noise in mz-matrices. Then, for each row representing an EIC of every mz-matrix, the variable background (baseline) is estimated and corrected using a spline approximation within multiple shifted windows. Binarization. Each mz-matrix is transformed into a binary matrix of zeros and ones. Each data element, data(i,j) of the mz-matrix representing the intensity value at time j in the EIC of the corresponding m/z value of the ith sample is transformed into string of k zeros and ones. The number of consecutive ones to populate is given by the smallest integer greater than or equal to the value of N in the following equation.

Tanimoto Intersample Distance Map. After the binarization step, we generate an intersample distance vector by applying Tanimoto distance for all pairs of samples in each mz-matrix. For a set of n samples, the n(n - 1)/2 columns in the Tanimoto intersample distance matrix corresponding to all sample distances are arranged in the order (2,1), (3,1), ..., (n,1), (3,2), ..., (n,2), ..., (n,n -1)). Each mz-matrix is transformed into a vector of all pair similarity scores and we can arrange all such vectors generated for all mz-matrices into a matrix with relevant m/z values as rows and all intersample similarity values as columns. We refer to the above matrix as the Tanimoto interpoint map. When only specific samples of interest in the whole sample space need to be analyzed, we can extract specific subspaces corresponding to those sample pairs from the Tanimoto interpoint map by only selecting a specific set of pairs associated with the samples of interest. The column index of a sample pair in the Tanimoto interpoint map associated with all pairs is given by eq 3. We extract all the relevant columns associated with the sample pairs of interest and concatenate them to generate the Tanimoto submatrix of intersample distances. k-1

Index (λ, κ) ) (λ - κ) +

∑ (n - p), p)1

where (λ,κ) is a sample pair with λ > κ N ) (√data(i, j)/GlobalMax) × 200

(2)

GlobalMax is a large constant, which was set to 106 in our experiment. As a result of binarization, each mz-matrix gets expanded by a factor of k, which was set to 40 in the experiment. 4734

Journal of Proteome Research • Vol. 8, No. 10, 2009

(3)

Clustering Methods for Pattern Analysis. The following clustering based methods were developed and applied to the Tanimoto interpoint map for LC-MS sample pattern analysis as shown in Figure 1. (1) Clustering and Heat-Map Visualization. The Tanimoto interpoint map can be clustered using hierarchical clustering

research articles

CLUE-TIPS, Clustering Methods and the over all similarity/dissimilarity patterns can be visualized with the help of a heat-map. (2) Mining Information by Aggregated Similarity Vector. The number of entries in the visualization Tanimoto interpoint map for all pairs of samples increases as the square of the number of samples. As a result, it could be difficult to visualize sample patterns when the number of samples is high. We can mine information from the Tanimoto interpoint map by generating a mean or average similarity vector for all relevant pairs of samples. An average similarity vector can be obtained by taking the average of the similarity scores for the pairs of samples that we are interested across all m/z values. From the average similarity vector for all pairs, we reconstructed the complete-linkage hierarchical clustering29 of the data at individual sample level. A dendrogram29 can be used to visualize the hierarchical clustering as shown in Figure 1. One can also use median/mode or other aggregating function to generate a similarity vector. (3) MDS/PCA-2D Sample Plots. The aggregated similarity vector generated from the above procedure was transformed into a symmetric distance matrix, where the entry at row i and column j is the aggregated Tanimoto similarity scores between samples i and j. Multi-Dimensional Scaling (MDS),29,30 Principal Component Analysis (PCA),29 or other dimensionality reduction techniques can be applied on the symmetric distance matrix. When dimensionality reduction methods like MDS were applied, the target dimensions are reduced to two in such a way that minimizes the sum of the squares of differences in distances between pairs of points in target dimension (2D) and the original dimension. All samples in the 2D space after the application of MDS methods were plotted for similarity analysis. When PCA was applied, data in the first two principal component space were plotted to assess similarity at the individual sample level. The 2D plots obtained in this method with MDS/PCA approach can in principle be compared to the plots generated by ‘Chaorder’23 that was used to assess biases in quantitative proteomic experiments.23 (4) Cluster of Histogram Counts. The similarity scores for each column associated with a sample pair in the Tanimoto interpoint map were binned and the bin counts for all pairs of samples were plotted. The resultant plot can be visualized as a cluster or group of histogram-count plots for all pairs of samples. Similar pairs of samples form peaks around the bins corresponding to high-degree of similarity, that is, low Tanimoto score, and less similar sample pairs form peaks around bins representing a low-degree of similarity, that is, high Tanimoto score. This cluster of histogram counts is very useful to more finely drill-down/analyze the sample level similarities. (5) Clustering for Queries. The Tanimoto interpoint map can be queried in a variety of ways for a specific set of m/z values that make a particular group of samples similar or dissimilar. In a typical query, there are groups of samples where samples within a group are similar and the samples in different groups are dissimilar and we search for specific m/z values that are consistent with the group structure. In a more general query, we think of a particular hierarchical pattern in a set of samples as in a dendrogram representing hierarchical clusters and then search for the set of m/z values that satisfy the sample pattern. To carry out such queries, a query-vector/array of all the pairs of samples involved in the query is constructed and the vector is populated with Tanimoto similarity scores that are consistent with the desired hierarchical pattern of samples. The submatrix associated with all the sample pairs correspond-

Table 1. Sample Description samples

time point

tumor

treatment

1, 2 3, 4 11, 12 7, 8 5, 6 15, 16 9, 10 19, 20 21, 22 17, 18 13, 14

Week-0 Week-6 Week-6 Week-6 Week-3 Week-6 Week-0 Week-3 Week-6 Week-6 Week-6

No No No No Yes Yes Yes Yes Yes Yes Yes

Baseline Baseline Tamoxifen Estrogen and Tamoxifen Baseline Baseline Estrogen Estrogen Estrogen Tamoxifen Estrogen and Tamoxifen

ing to samples in the query from the Tanimoto interpoint map is constructed using eq 3 with the query-vector included as a row of that submatrix. K-means29 clustering on this submatrix having m/z values as data points and intersample Tanimoto distances as features is performed with a specific number of clusters as input. The data-points (m/z values) that cluster with the query-vector are then selected. Clustering approaches other than k-means can be applied to extract the m/z values. The extracted m/z values as a result of the queries can be further analyzed and/or sequenced for a set of peptides/proteins that may be linked to the similarities/differences in the sample set. As a result, this approach can be used to search for potential biomarker candidates. All plots in the above methods can also serve as a visual tool for assessing the quality of the LC-MS experiment. Outliers in the set of LC-MS samples can also be detected and eliminated by the set of above methods. When sample replicate information is available, we can detect and eliminate the outlier with the help of the clustered heat-map. Under ideal conditions, we expect the replicate LC-MS runs to have very similar scores of similarity for all or most of m/z values. When the Tanimoto interpoint distance matrix is visualized using a heat map, we can expect to see a dark color assignment indicating a high degree of similarity for a particular sample pair. When an outlier is involved in a particular pair, we do not see the similarity in the heat map even when samples in the pair are replicates. A particular data sample is confirmed as an outlier with the heat map along with some of our other methods. When the number of LC-MS samples is large, we can transpose the Tanimoto interpoint map so that the matrix has the sample pairs as rows and m/z values as columns (features) and construct a SOM (Self-Organizing Map).31 The constructed SOM can be used to construct a submatrix having similar sample-pairs as rows and m/z values as columns from a set of data points (sample-pairs) clustered around each centroid in the SOM. Each of these submatrices can be transposed and used as inputs to our clustering methods for pattern analysis. A SOM approach for automatically extracting relevant subspaces was not implemented in this version of CLUE-TIPS. The CLUE-TIPS software suite that implemented all the clustering methods was largely written in MATLAB. Some of the preprocessing methods applied to the LC-MS information were written in Perl. CLUE-TIPS took about 25 min on a single desktop computer to generate the results described in this paper excluding the preprocessing steps like filtering and alignment. The software suite is available upon request. Description of the Data Set. The performance of the CLUETIPS suite was evaluated using a previously published data set.32 Briefly, a study aiming to discover putative plasma Journal of Proteome Research • Vol. 8, No. 10, 2009 4735

research articles biomarker candidates of breast cancer using proteomic analysis was performed using plasma samples collected from immuno-

Akella et al. compromised mice with implanted human cancer xenograft (MCF7). Groups of mice with implanted tumor as well as

Figure 2. (A) Tanimoto interpoint distance matrix clustering and heat-map visualization for 3, 4, 15, 16, 17, and 18. Cluster of histogram counts (B), dendrogram (C), and MDS plot of mean-aggregated-vector (D) for samples 3, 4, 15, 16, 17, and 18. 4736

Journal of Proteome Research • Vol. 8, No. 10, 2009

CLUE-TIPS, Clustering Methods

research articles

Figure 3. Clusterings of samples 3, 4, 15, 16, 19, and 20 indicate that sample 19 (shown with ovals) is an outlier.

control groups were subjected to several different treatments including administration of estrogen (promotes tumor growth), tamoxifen (inhibits tumor growth), a combination of estrogen and tamoxifen, and also no treatment. Plasma samples were collected at several time points, specifically, at the beginning of the study and then collection after tumor implantation, week 3 and week 6. Collected plasma samples were enriched for glycoproteome using multilectin affinity chromatography (MLAC), digested with trypsin, and analyzed using nanoLC MS analysis with a low resolution ion trap instrument. A selected subset of this study consisting of 22 LC-MS runs was used for the evaluation of CLUE-TIPS suite.

Results and Discussion A total of 22 LC-MS data files of the mouse data study32 at different time points under various treatments (estrogen, tamoxifen, and both) were used for pattern analysis using the CLUE-TIPS suite. A brief description of these samples is given in Table 1. The data set represents 11 different treatment conditions analyzed in duplicates, that is, a total of 22 runs. The m/z filtering scheme described earlier was applied to each of the 22 samples with the filtering parameter, R set to 0.1. Then, m/z values that were less than a cutoff value were

also removed from the filtered set of m/z values. We set the cutoff to 500 in the experiment. As a result of the filtering process, about 30% of the m/z values were removed from each sample. The set of m/z values common to all the samples were later used in the Tanimoto interpoint map. Alignment in chromatographic retention time using OBI-Warp26,27 was performed on the samples followed by baseline correction and denoising. Tanimoto interpoint map for all pairs of samples was then generated from binarized EICs as described in the Methods and Materials section. Identifying Outliers. LC-MS data quality in terms of reproducibility linked to, for example, electrospray instability or clogging of LC-MS column has a profound effect on data analysis, and as a result, low-quality LC-MS runs, outliers, should be eliminated prior to the analysis. Outliers were first detected and removed from the set of 22 samples to generate a new Tanimoto interpoint map without the outlier samples. Clustering methods were then applied on the newly generated map. On the basis of the sample replicate information and information about sample categories such as disease/normal, we can find and eliminate outlier samples. We started with groups of samples consisting of samples and their replicates to detect and remove outliers from each group using the Journal of Proteome Research • Vol. 8, No. 10, 2009 4737

research articles replicate information. For the first group, we analyzed 3 replicate pairs (3, 4), (15, 16), and (17, 18) to generate the heat map visualization of the corresponding Tanimoto interpoint distance submatrix that was generated by using eq 3 from the original Tanimoto interpoint map. It can be seen from Figure 2A, that the heat map pattern for sample pairs (3, 4) and (15, 16) is very dark indicating that the samples 3 and 4 are very similar for most of the m/z values. Similarly, samples 15 and 16 also seem to be similar. A similar dark pattern also can be expected for the replicate sample pair (17, 18), but the high dissimilarity indicated by the light color pattern shows that one of the runs in the pair (17, 18) is an outlier. It can be seen from Figure 2 that the sample pairs, (3, 4) and (15, 16), are on the extreme left of the plot and their histogram count plots cluster well as represented in Figure 2B. Samples that paired with the sample 18 are on the right side of the plot showing dissimilarities as represented in the vertical oval. In Figure 2C, we plotted the hierarchical clustering of the average similarity vector for the sample pairs, which clearly indicated that sample 18 was an outlier. We can also observe that sample 18 is an outlier

Figure 4. Continued 4738

Journal of Proteome Research • Vol. 8, No. 10, 2009

Akella et al. from the MDS plot shown in Figure 2D. After detecting that sample 18 is an outlier, we proceed with another group of samples and their replicates to detect outliers. In Figure 3, we see that, though samples 19 and 20 are replicate runs, they do not show similarity in the heat map. It can be also seen from Figure 3 that all sample pairs associated with sample 19 clustered together forming an outlier cluster as indicated by the large yellow patch on the left side of the heat map in the figure. The heat map entry associated with the sample pair (19, 20) showed large dissimilarly in these samples. Also, using the cluster of histogram counts in Figure 3, we can see that all samples that paired with sample 19 formed a distinct cluster to the right side of the plot as indicated by the arrow. We conclude that sample 19 is also an outlier. Similarly, samples 7 and 5 were also found to be outliers. Identifying Analytically Similar Samples. After eliminating outliers, we applied various clustering methods to the resulting subspace obtained by applying eq 3. The clustered heat map and the cluster of histogram counts are shown in Figure 4A. To address the questions about sample similarity and dis-

CLUE-TIPS, Clustering Methods

research articles

Figure 4. (A) Clustered heat map and the cluster of histogram counts of the Tanimoto interpoint map after outlier removal. (B) MDS plot and the dendrogram showing hierarchical clustering of samples generated from the mean aggregated vector of the Tanimoto interpoint map after outlier removal.

similarity, we first analyze the cluster of histogram counts in Figure 4A. The most similar sample pair is (13, 14) as it has the highest number of scores in the lowest Tanimoto score range indicated by the peak in the extreme left of the cluster of histogram counts in Figure 4A. Samples 13 and 14 are indeed replicate runs and it is also the most similar replicate pair among the other replicate pairs in the sample set as can be seen in the histogram plot in Figure 4A. Also, the other replicate sample pairs have high peaks on the left side of the histogram cluster plot in Figure 4A indicating high degree of similarity as expected. The dendrogram showing hierarchical clustering and the MDS plot of the aggregated vector of the Tanimoto interpoint map are shown in Figure 4B. An aggregated similarity

vector was generated by taking the average of similarity scores for all m/z values across all pairs of samples from the Tanimoto interpoint map. From the dendrogram in Figure 4B, it can be seen that all replicate runs clustered well and sample pairs (13,14), (1,2) and (15,16), have the highest degree of similarity. The clustered heat map is useful in analyzing the similarities and dissimilarities in samples along with information about the m/z values associated with the similarities and differences. Darker patterns in the heat map indicate higher similarities in the heat map and one can select a specific region of interest from the heat map and zoom-in to analyze the similarities/ differences at the m/z level as indicated in the heat map in Figure 4A. Journal of Proteome Research • Vol. 8, No. 10, 2009 4739

research articles

Akella et al.

Figure 5. Clustering methods applied to the subset 14, 15, 17, 20, and 22. The dendrogram was generated from the mean similarity vector of the Tanimoto interpoint map of the subset.

Identifying Biologically Dissimilar Samples. Among the dissimilar samples are the sample pairs (14, 9), (13, 9), (22, 9), and (14, 11) as they fall in the extreme right bin indicating a very low degree of similarity as shown in the histogram count plot of Figure 4A. On the basis of the sample description in Table 2. Examples Where the Expected and Observed Patterns Matched in Samples 14, 15, 17, 20, and 22 sample subsets

expected pattern32

observed pattern from clustering

14 and 15 15 and 22 14 and 17 17, 20, and 22

Dissimilar Dissimilar Dissimilar 17 is more similar to sample 20 than sample 22

14, 20, and 22

14 is closer/similar to 20 and 22. 14 is dissimilar to 15 and 17.

Dissimilar Dissimilar Dissimilar 17 is more similar to sample 20 than sample 22 from heat map pattern of the pairs (17, 20) and (17, 22) in Figure 5A As expected from Figure 5

4740

Journal of Proteome Research • Vol. 8, No. 10, 2009

Table 1, we know that samples 13 and 14 are replicates, and from the previous analysis, we know that they are analytically very similar samples. So we expect sample pairs (14, 9) and (13, 9) to have similar Tanimoto score patterns. As expected, these sample pairs fall in the dissimilar region (extreme right) of the histogram plot in Figure 4A. Tamoxifen suppresses the growth of tumor, while estrogen promotes the growth of tumor.32 From Table 1, samples 9 and 10 are estrogen-treated tumor samples at week-0, while samples 14 and 13 are estrogen and tamoxifen treated tumor samples at week-6, and sample 22 is the estrogen-treated tumor sample at week-6. As expected and also based on the measured tumor size, there was a greater tumor growth32 in sample 14 than in sample 9 or 10. Similarly samples 22 and 9 are expected to be different. Also, sample 11 is the tamoxifen-treated normal sample at week-6, which is expected to be different from samples 13 and 14. Interestingly, we observe this dissimilarity in sample pairs (14, 9), (13, 9), (22, 9), and (14, 11) from our cluster of histograms. It should be noted that, besides effects on the estrogen receptor, tamoxifen is known to interact with other receptors. Since tamoxifen is an estrogen receptor antagonist, these off-target effects may

research articles

CLUE-TIPS, Clustering Methods

Figure 6. Extracted ion chromatograms of samples 14, 15, 17, 20, and 22 for m/z value 711.

explain differences in changes in tumor growth, and accompanied changes in the plasma, for tamoxifen and estrogen versus only estrogen-treated subject. Analysis of Subsets of Samples for Similarity/Dissimilarity. We conducted further analysis using the clustering methods on a subset of samples from the original set. On the basis of our knowledge of the data set, we examined a set of five tumor samples with four of the tumor samples, 14, 15, 17, and 22 at week 6 and one tumor sample, 20 at week 3, all with different treatment conditions. We used eq 3 to extract the subspace associated with all the 10 possible pairs for the subset of the 5 samples from the original Tanimoto interpoint map of similarity scores. Clustering methods were applied to Tanimoto interpoint submap consisting of all 10 sample pairs as shown in Figure 5. Patterns in samples observed from the clustering methods that matched the expected patterns were summarized in Table 2. From the plots in Figure 5, samples 15 and 14 are the most dissimilar samples and the samples 22 and 20 are the most similar in this five sample subset. Sample 15 is the baseline sample (no treatment) at week-6 and there should be a significant tumor growth in sample 14 (sample at week-6 treated with estrogen and tamoxifen) compared to sample 15.32 Samples 20 and 22 are both estrogen-treated tumor samples where sample 20 is at week-3, while sample 22 is at week-22. So, we can expect these samples to have biological similarity. From the plot in Figure 5A, we can notice that sample 17 is more similar to 20 than 22 as it can be seen that there are more darker horizontal bands representing similarity at m/z level in the heat map entry associated with sample pair (17, 20) than those associated with sample pair (17, 22). Though samples 17 and 22 are week-6, tumor samples, the tamoxifen-treated

sample 17 is expected to have slower tumor growth rate compared to sample 22 treated with estrogen at week-6. So, we can expect sample 17 to be closer to the week-3, estrogentreated sample 20 than to the week-6, estrogen-treated sample 22. To analyze the subtle similarities/differences in pairs (17, 22) and (17, 20), it was necessary to analyze the patterns in the heat map, as the dendrogram (5C) or the histogram counts (5B) did not provide this information. The dendrogram plot in Figure 5C, generated from the average similarity vector, shows that on average samples 20 and 22 are similar and samples 15 and 17 are similar. The plot in Figure 5C also indicates that sample 14 is closer to the pair (22, 20). This is consistent with the tumor growth patterns of the data set.32 Querying the Tanimoto Interpoint Map. We applied our clustering based query approach to a Tanimoto interpoint submap consisting of all 10 sample pairs, specifically samples 14, 15, 17, 20, and 22. From the previous analysis of this subset as shown in Figure 5, we can divide this group of samples into 2 clusters/ groups with samples 20, 22, and 14 in the first group and samples 15 and 17 in the second group as observed from the hierarchical clustering plot in panel C of Figure 5. We can further divide the first group into 2 clusters/groups with samples 20 and 22 in the first group and sample 14 in the second group. Then, an array of Tanimoto similarity scores for all 10 pairs, termed the query vector, was generated. For an array index associated with a sample pair consisting of 2 samples from different/dissimilar groups, we filled that array position with the difference of the maximum value and the standard deviation of that sample pair for all m/z values in the Tanimoto interpoint map. Similarly, for an array index associated with the sample pair from the same/similar group, we filled that array position with the sum of the minimum value and Journal of Proteome Research • Vol. 8, No. 10, 2009 4741

research articles the standard deviation of that sample pair. The query can be expressed in words as “Extract all m/z values that make the sample pair (20, 22) and that make the sample pair (15, 17) similar making all other pairs in the set dissimilar”. We then clustered the Tanimoto interpoint map for 10 sample pairs along the query vector using k-means clustering to output a set of m/z values. The query generated about 36 m/z values that clustered with the query vector. We observed the extracted ion chromatograms of many of the m/z values from the sample set are consistent with the query. Figure 6 shows the extracted ion chromatograms for the m/z value, 711 generated from the query. It can be seen that samples 20 and 22 are similar and sample 14 is closer to the pair (20, 22). Also samples 15 and 17 are similar as per the pattern in Figure 5C. Similarly, other queries could be performed and evaluated manually; nevertheless, a complete analysis providing rigorous statistical assurance about the correctness of the extracted m/z values would be desirable.

Conclusion CLUE-TIPS, an approach to analyze patterns and to specifically find similarities and differences in LC-MS samples at the m/z level was presented in this paper. The effectiveness of the approach was demonstrated using data from LC-MS analysis of plasma collected for different treatment conditions at various time points of tumor growth from nude xenografted mice implanted with human breast cancer cells. CLUE-TIPS approach is an important tool to find global patterns associated with similarities/differences in LC-MS samples and use these patterns in samples to query for specific m/z values that are potentially responsible for these patterns. Thus, this approach is broadly applicable for the detection of potential protein biomarker candidates. We also demonstrated the use of CLUETIPS to analyze the quality of LC-MS runs. In addition, besides applications of CLUE-TIPS for protein/peptide biomarker discovery, the suite could be applied to other LC-MS data including metabolites. Importantly, CLUE-TIPS can be also applied to other analytical methods such as NMR or IR.

Acknowledgment. We acknowledge AstraZeneca for providing the samples for analysis. Authors would like to thank Prof. B. L. Karger for his support. Contribution 940 from the Barnett Institute.

Akella et al.

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16) (17)

(18) (19) (20) (21) (22) (23)

References (1) Aebersold, R.; Goodlett, D. R. Mass spectrometry in proteomics. Chem. Rev. 2001, 101 (2), 269–95. (2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198–207. (3) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Quantitative analysis of complex protein mixtures using isotopecoded affinity tags. Nat. Biotechnol. 1999, 17 (10), 994–9. (4) Zhou, H.; Ranish, J. A.; Watts, J. D.; Aebersold, R. Quantitative proteome analysis by solid-phase isotope tagging and mass spectrometry. Nat. Biotechnol. 2002, 20 (5), 512–5. (5) Griffin, T. J.; Gygi, S. P.; Rist, B.; Aebersold, R.; Loboda, A.; Jilkine, A.; Ens, W.; Standing, K. G. Quantitative proteomic analysis using a MALDI quadrupole time-of-flight mass spectrometer. Anal. Chem. 2001, 73 (5), 978–86. (6) Chakraborty, A.; Regnier, F. E. Global internal standard technology for comparative proteomics. J. Chromatogr., A 2002, 949 (1-2), 173– 84. (7) Andreev, V. P.; Li, L.; Rejtar, T.; Li, Q.; Ferry, J. G.; Karger, B. L. New algorithm for 15N/14N quantitation with LC-ESI-MS using an LTQFT mass spectrometer. J. Proteome Res. 2006, 5 (8), 2039–45. (8) Wang, W.; Zhou, H.; Lin, H.; Roy, S.; Shaler, T. A.; Hill, L. R.; Norton, S.; Kumar, P.; Anderle, M.; Becker, C. H. Quantification of proteins

4742

Journal of Proteome Research • Vol. 8, No. 10, 2009

(24) (25) (26) (27) (28) (29) (30) (31) (32)

and metabolites by mass spectrometry without isotopic labeling or spiked standards. Anal. Chem. 2003, 75 (18), 4818–26. Radulovic, D.; Jelveh, S.; Ryu, S.; Hamilton, T. G.; Foss, E.; Mao, Y.; Emili, A. Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry. Mol. Cell. Proteomics 2004, 3 (10), 984–97. Silva, J. C.; Denny, R.; Dorschel, C. A.; Gorenstein, M.; Kass, I. J.; Li, G. Z.; McKenna, T.; Nold, M. J.; Richardson, K.; Young, P.; Geromanos, S. Quantitative proteomic analysis by accurate mass retention time pairs. Anal. Chem. 2005, 77 (7), 2187–200. America, A. H.; Cordewener, J. H.; van Geffen, M. H.; Lommen, A.; Vissers, J. P.; Bino, R. J.; Hall, R. D. Alignment and statistical difference analysis of complex peptide data sets generated by multidimensional LC-MS. Proteomics 2006, 6 (2), 641–53. Li, X. J.; Yi, E. C.; Kemp, C. J.; Zhang, H.; Aebersold, R. A software suite for the generation and comparison of peptide arrays from sets of data collected by liquid chromatography-mass spectrometry. Mol. Cell. Proteomics 2005, 4 (9), 1328–40. Bellew, M.; Coram, M.; Fitzgibbon, M.; Igra, M.; Randolph, T.; Wang, P.; May, D.; Eng, J.; Fang, R.; Lin, C.; Chen, J.; Goodlett, D.; Whiteaker, J.; Paulovich, A.; McIntosh, M. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 2006, 22 (15), 1902–9. Wiener, M. C.; Sachs, J. R.; Deyanova, E. G.; Yates, N. A. Differential mass spectrometry: a label-free LC-MS method for finding significant differences in complex peptide and protein mixtures. Anal. Chem. 2004, 76 (20), 6085–96. Prakash, A.; Mallick, P.; Whiteaker, J.; Zhang, H.; Paulovich, A.; Flory, M.; Lee, H.; Aebersold, R.; Schwikowski, B. Signal maps for mass spectrometry-based comparative proteomics. Mol. Cell. Proteomics 2006, 5 (3), 423–32. Listgarten, J.; Neal, R. M.; Roweis, S. T.; Wong, P.; Emili, A. Difference detection in LC-MS data for protein biomarker discovery. Bioinformatics 2007, 23 (2), e198-204. Andreev, V. P.; Li, L.; Cao, L.; Gu, Y.; Rejtar, T.; Wu, S. L.; Karger, B. L. A new algorithm using cross-assignment for label-free quantitation with LC-LTQ-FT MS. J. Proteome Res. 2007, 6 (6), 2186–94. Jaffe, J. D.; Mani, D. R.; Leptos, K. C.; Church, G. M.; Gillette, M. A.; Carr, S. A. PEPPeR, a platform for experimental proteomic pattern recognition. Mol. Cell. Proteomics 2006, 5 (10), 1927–41. Rabiner, L. R. A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc. IEEE 1989, 77 (2), 257–86. Eng, J. K.; McCormack, A. L.; Yates, J. R., III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–67. Needleman, S. B.; Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48 (3), 443–53. Prakash, A.; Piening, B.; Whiteaker, J.; Zhang, H.; Shaffer, S. A.; Martin, D.; Hohmann, L.; Cooke, K.; Olson, J. M.; Hansen, S.; Flory, M. R.; Lee, H.; Watts, J.; Goodlett, D. R.; Aebersold, R.; Paulovich, A.; Schwikowski, B. Assessing bias in experiment design for large scale mass spectrometry-based quantitative proteomics. Mol. Cell. Proteomics 2007, 6 (10), 1741–8. Noy, K.; Fasulo, D. P. In Robust Estimation and Graph-Based Meta Clustering for LC-MS Feature Extraction; BIBM: Washington DC, 2007; pp 230-236. Flower, D. R. On the properties of bit string based measures ofchemical similarity. J. Chem. Inf. Comput. Sci. 1998, 38, 379–86. Prince, J. T.; Marcotte, E. M. Chromatographic alignment of ESI-LCMS proteomics data sets by ordered bijective interpolated warping. Anal. Chem. 2006, 78 (17), 6140–52. OBI-Warp Home page: http://obi-warp.sourceforge.net/index.html. Oppenheim, A. V.; Willsky, A. S.; Nawab, S. H. Signals and Systems, 2nd ed.; Prentice-Hall: Upper Saddle River, NJ, 1997. Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern Classification, 2nd ed.; Wiley: Hoboken, NJ, 2001. Borg, I.; Groenen, P. J. F. Modern Multi-Dimensional Scaling, 2nd ed.; Springer-Verlag: New York, 2005. Kohonen, T. Self-Organizing Maps; Springer-Verlag:: New York, 1997. Orazine, C. I.; Hincapie, M.; Hancock, W. S.; Hattersley, M.; Hanke, J. H. A proteomic analysis of the plasma glycoproteins of a MCF-7 mouse xenograft: a model system for the detection of tumor markers. J. Proteome Res. 2008, 7 (4), 1542–54.

PR900427Q