Subscriber access provided by UNIV OF CALIFORNIA SAN DIEGO LIBRARIES
Article
Power Normalization for Mass Spectrometry Data Analysis and Analytical Method Assessment Yuezhi Melodie Du, Ye Hu, Yu Xia, and Zheng Ouyang Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b04418 • Publication Date (Web): 16 Feb 2016 Downloaded from http://pubs.acs.org on February 19, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Power Normalization for Mass Spectrometry Data Analysis and Analytical Method Assessment Y. Melodie Du1, Ye Hu2, Yu Xia3 and Zheng Ouyang1, 3*
1
Weldon School of Biomedical Engineering, Purdue University, 206 S. Martin Jischke Dr., West
Lafayette, IN 47907 2
Department of Nanomedicine, Houston Methodist Research Institute, 6565 Fannin Street, Houston,
TX 77030 3
Department of Chemistry, Purdue University, 560 Oval Dr., West Lafayette, IN 47907
Prepared for Analytical Chemistry November 2015
*Corresponding Author: Prof. Zheng Ouyang Weldon School of Biomedical Engineering Purdue University West Lafayette, IN 47907 Email:
[email protected] Phone: 765-494-2214
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 27
Abstract Biomarker profiling using mass spectrometry plays an essential role in biological studies and is highly dependent on the data analysis for sample classification. In this study, we introduced power nomination of the mass spectra as a method for systematically altering the weights of peaks at different intensity levels. In combination with the use of support vector machine method (SVM), the impact on the sample classification has been characterized using data in four studies previously reported, including the distinctions of anomeric configurations of sugars, types of bacteria, stages of melanoma and the types of breast cancer. Comprehensive analysis of the data with normalization at different power normalization index (PNI) was developed and analysis tools, including error-PNI plots, reference profiles and error source profiles, were used to assess the potential of the analytical methods as well as to find the proper approaches to classify the samples.
Key words: Biomarker identification, Classification, Power normalization, Support vector machine, Probability estimation, Oligosaccharides, Bacteria, Melanoma, Breast cancer
ACS Paragon Plus Environment
2
Page 3 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Introduction Sample classification based on mass spectrometry (MS) analysis has been widely practiced in biological studies, including proteomics,1-5 disease diagonosis,6-10 bacteria identifications,11,12 structure analysis of carbohydrates13,14, and etc. The general approach is to extract the characteristic features from the mass spectra to distinguish different types of the samples. This could be done simply by observing a single peak corresponding to a unique compound, but in most of the cases a set of multiple peaks needs to be used in the data analysis. Those peaks represent the existence of a set of chemical or biological compounds of different concentrations. The classification based on the peak profiles can be done using peak correlation or matching,3,4,15 which counts the identical peaks between the sample spectra and library, or statistical methods, such as standard deviation6, t-test8 and ANOVA16, which are mostly used in proteomics and metabolomics2,17,18 to determine the identity of a sample.19,20 The distinction between the samples often could not be achieved simply based on the existence of the peaks in the spectra, but also their absolute or relative concentrations; therefore the unique profiles of a set of compounds are used for the sample classification and biomarker identification. Pattern recognition methods or machine learning methods have been used for this purpose, such as the dot product,21 principal component analysis (PCA),12 supporting vector machine (SVM),22,23 decision tree24 and neural networks.25 Analysis of peak profiles extracts the information from the mass spectra, in a way very different from the peak-by-peak analysis approaches,19,20 and can also comprehensively reveal different aspects of the samples.
ACS Paragon Plus Environment
3
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 27
Profiling-based sample classification and biomarker identification are dependent on the significance or weights set for the peaks included in the profile. Typically, higher weights are given to peaks of higher-intensities, which thereby makes them of higher contribution to the final decision19,26,27 (Supporting Information, section S1). In analysis of complex samples, the major peaks of the potential biomarkers, however, can be of low intensity levels due to the suppression by the chemical noises from other compounds in the samples. Its negative impact on data analysis is typically minimized by pre-selecting the mass range for the peaks of interest, based on the previous knowledge. This, however, can introduce errors in the data analysis due to the bias on the selection of the mass ranges. The nature of the samples and the quality of the data can also vary significantly at different stages of a study. For example, at the early-stage of a study the major aim typically is to develop and optimize the analytical method for obtaining high quality spectra. At this stage, the number of biological samples as well as the knowledge about the samples can be limited. It is also difficult to make an arbitrary selection of peaks or mass ranges for data analysis without a significant bias.
Data analysis providing a
comprehensive evaluation of the experimental method is particularly important for the rapid method development and an effective optimization, so it can become ready for analysis of biological samples of a large quantity. Machine learning methods, such as the support vector machine, have advantages for the early-stage method development due to their excellent applicability with raw data and minimal requirement for pre-existing knowledge. A potential risk in use of machine learning methods is the over-fitting, which
ACS Paragon Plus Environment
4
Page 5 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
can be prevented by reserving testing data sets from training for an objective evaluation of the model. Intensity scaling or intensity transformation has also been implemented for data analysis to overcome the suppression problem mentioned above, such as notating the peaks as 0 or 1 with a set threshold28,29, or intensity rescaling based on ranking4 and ranking orders.30,31 In some studies, logarithm
32
or square root normalizations33 of peak
intensities have also been performed to emphasize the low-intensity peaks. Appropriate intensity transformation can improve the accuracy in sample classification, however, it should be carefully selected based on the nature of the samples. In this study, we explored a method involving the peak intensity transformation for a systematic evaluation of the mass spectrometry data for sample classification and biomarker identification. The outcome can also be used to assess and optimize the analytical approach at an early stage of a study. A distinct feature of this method was the implementation of the power normalization of the peak intensities prior to the sample classification, which was done using SVM in this study. Power normalization index (PNI, to be further described) was varied to systematically rescale the intensities of all the peaks and the subsequent impacts on the sample classification were analyzed. We also introduced the error-PNI count plot, which revealed the relationship between the power normalization and the errors in sample classification, and more importantly, served as a high level summary of the possibility in distinguishing the samples analyzed using a particular analytical procedure. Suggestion can also be made for further improvement of the analytical method. Data from four experimental studies6,8,12,34 previously reported were used for the development and validation of this method.
ACS Paragon Plus Environment
5
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 27
Method for Data Processing and Analysis In order to process the mass spectra efficiently, each mass spectrum was converted into a vector with multiple dimensions, with each dimension corresponding to a particular mass-to-charge ratio (m/z) of a magnitude assigned as the peak intensity at the m/z value. The power normalization was applied for each spectrum first, at a PNI value, and the mass spectra for each sample category were then divided into the training and testing groups. The classification was done using a multi-class SVM method. The training group was used to generate the model with classification boundaries, while the testing group was used to evaluate the classification accuracy based the model. The SVM method has been shown to be powerful in classifications with lower number of samples,35 which is particularly suitable for early-stage studies. The errors for the testing results can then be used to construct the error-PNI count maps.
Multi-class SVM and decision scores Multi-class SVM analysis of data was performed after the power normalization of the mass spectra and the impact by the selection of PNI for the sample classification (see Supporting Information for multi-class SVM method) was evaluated.
The decision
scores, which are typically used to determine the identity of the samples,35 were calculated for all possible types. The rank of the scores were then used to determine the similarity between the testing sample and each possible sample types. The decision score was calculated using the sum of distances (D) between the testing sample and all the classification boundaries for different sample types. This works
ACS Paragon Plus Environment
6
Page 7 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
particularly well for evaluating the initial analysis in an early-stage study, where the number of possible sample types is typically larger than the replicates of each sample type. For example, in the study for the bacteria analysis12 there were 14 types of bacteria, but only five samples for each type were analyzed. The distance dij of tth tested sample between the ith and jth group types was calculated as
dij =
ω ⋅ φ (xt ) − bij ω ij ij
Equation 1
where ω is the normal vector of the hyperplane (see Supporting Information for details). The larger is the distance, the further is the data point away from the classification boundary, which means a higher possibility for a correct assignment of the sample to the corresponding group. The numerator [ωijφ(xt)-bij] in Equation 1 is called the decision value. Sometimes, ranking might be achieved by simply using ωijφ(xt).36 However, it can only be done in this way when the number of sample types is limited and properties of training set are similar enough, so b and |ω| can be ignored. The sum of the distance Dtm of the tth tested sample for mth sample type was calculated as
Dtm =
∑ j, if i=m
dij −
∑
dij , i < j, m = 1, 2,..., n
Equation 2
i, if j=m
where n is the total number of the sample types in the classification. The calculated decision values D can be used to support a variety of data analysis, such as the similarity ranking of all the possible sample types for the testing sample, as well as the ranking of the characteristic peaks that contribute the most to the classification (see Supporting Information).
ACS Paragon Plus Environment
7
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 27
Power normalization The power normalization was used to adjust the weights of the peaks at different intensity levels for the classification. It was performed by scaling the intensity of all the peaks in the mass spectra with a power normalization index (PNI) (Equation 3),
Peaki = (
peaki
∑ peak
2 j
) PNI j = 1, 2, 3, …n Equation 3
where peaki is the original peak intensity of the ith peak in the spectra and Peaki is the scaled intensity after the power normalization. The denominator, the square root of the sum of squares of all the peak intensities, was used to achieve an energy balance in every spectrum. As shown in Figure 1, the normalization at a PNI changed the relative difference in weight between peaks of high and low intensities, which affected their contributions to the classification. Normalization with a PNI lower than 1 reduced the differences in intensity, while granting higher weights to the peaks of lower intensities. For example, the fragment patterns in the MS/MS spectra (Figure 1c, d) recorded for two synthesized monosaccharaides13,37, ido-α-GA and glc-β-GA, were very simliar (Figure 1c, d). After rescaling the spectra with a power normalization at PNI of 0.3, the peaks previously hidden became more prominent (Figure 1a, b). A further analysis using SVM identified that some of the peaks originally of low intensities made critical contribution in distinguishing ido-α-GA and glc-β-GA (to be further discussed).
ACS Paragon Plus Environment
8
Page 9 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 1. MS/MS spectra normalized at PNI 0.3 for (a) ido-α-GA and (b) glc-β-GA and the original spectra (c) and (d), respectively. (e) The weighing factor of different intensities as a function of PNI. The data analysis was performed using programs written in Matlab (version R2012b, MathWorks, Natick, MA, USA,). The modules for training and testing using SVM were downloaded online.38 The functions for picking the characteristic peaks, ranking the similarity and plotting the figures were programed based on the intermediate variable in the SVM model.
ACS Paragon Plus Environment
9
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 27
Results and Discussion Data sets from four studies previously reported were used in this study to test and validate the method described above, which included the spectra recorded for synthesized carbohadrates of sixteen stereo configurations,13 lipid profiles of eighteen types of bacteria,12 mass spectra of peptides in human blood samples from patients with melanoma6 and breast cancer.39 These data sets were all collected in an early-stage of these studies, where new analytical methods were being developed, the numbers of samples for each type were relatively low, and the conditions for experimental control might vary significantly.
Error-PNI count plot and biomarker identification When the spectra of a sample in a testing group were power-normalized and classified using SVM, the sample was assigned to a category based on the ranking of the decision scores. The number of wrong assignments could then be counted for each PNI value and used to plot an error-PNI curve as part of the error-PNI count map. This method was firstly applied for classifying 16 D-aldohexose-glycolaldehydes (GA), synthesized in 8 sugar types with two anomeric configurations for each type, viz. α-D-all, β-D-all, α-D-alt, β-D-alt, α-D-gal, β-D-gal, α-D-glc, β-D-glc, α-D-gul, β-D-gul, α-D-ido, β-D-ido, α-D-man, β-D-man, α-D-tal, β-D-tal. They were synthesized as the standards for producing glycosyl-GA anions at m/z 221, which can be used as the diagnostic ions for probing the anomeric configurations of the oligosaccharides. In the previous study,13 the glycosyl-GA anions m/z 221 was produced through CID from disaccharide ions, and then was further fragmented under a controlled CID
ACS Paragon Plus Environment
10
Page 11 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
condition (see Supporting Information).13,37 The MS3 spectra were used for classification by spectral similarity score method.17 In that study, as well as in previous ones using the diagnostic ions for anomeric configuration identification,13,37 the MS/MS data set were acquired with CID conditions controlled by keeping the intensity of the surviving precursor ions after CID as 18 ± 5% of the base product ion (100%). It had been believed that this was important for achieving reproducible sample classification.
In this study,
136 MS/MS spectra (Supporting Information, Table S1) collected under this condition during the previous study were firstly used for testing the data analysis method with power normalization.
For each test, one spectrum of a particular sample type was
randomly picked as the testing sample and the rest 135 spectra were used for training of the SVM model. Both the testing and training spectra were power-normalized at the same PNI value and the errors in assignments were counted and used for plotting the error-PNI curves, as shown in Figure 2.
ACS Paragon Plus Environment
11
Analytical Chemistry
all-α all-β alt-α alt-β gal-α gal-β glc-α glc-β gul-α gul-β ido-α ido-β man-α man-β tal-α tal-β
60 50
Error Count (100%)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 27
1
40
2
30 20 10 0 0.1
0.2
0.3
0.4 0.5 0.6 0.7 0.8 Power Normalization Index
0.9
1
1.1
Figure 2. (a) Error-PNI plot of 16 types of synthesized monosaccharides-GA. Sugar all-α is not plotted because no classification error was found with the PNI range used.
It is obvious that the selection of the PNI value has a significant impact on the accuracy in classification.
For 14 out of 16 sugars (except for tal-β and ido-α), the
assignment is improved as the PNI increases and can be achieved with 100% accuracy when PNI is larger than 0.5. However, for tal-β and ido-α a 100% accuracy could only be obtained when the PNI is lower than 0.5 and 0.34, respectively. This indicates that the dominant fragment peaks of the diagnostic ions from these two sugars are not related to the structural differences while the peaks of minor intensities are and therefore can be used as biomarkers for distinguishing them from other stereoisomers. The error-PNI plot as shown in Figure 2 represents a summary of a systematic evaluation of the effectiveness of the experimental approach applied for classifying a
ACS Paragon Plus Environment
12
Page 13 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
chemical or biological system. The valley in each curve indicates the best normalization point for classifying each individual component in the chemical system; the overlap of the valleys, if existing, indicates the best normalization point for the global classification of the chemical system.
For instance, using the method involving the CID of the
diagnostic ions for classifying the sugars mentioned above, a normalization at PNI between 0.3 and 0.34 (① in Figure 2) should be selected to distinguish glc-β and tal-β, but 0.43 to 0.58 should be selected for gul-β and ido-α (② in Figure 2).
Based on the
error-PNI plot for the chemical system with the 16 sugars, it can be predicted that a complete classification cannot be done with a single step using the current analytical method, since there is not an overlap of all the valleys of the error-PNI curves. The best overall result for a single step classification would be obtained with a PNI between 0.46 and 0.56, where 15 of 16 isomers can be classified correctly. However, based on the error-PNI plot, a multi-step classification can be suggested to further improve the classification, which will be discussed later. In the previous studies13,37 a number of statistical methods were tested and the similarity score method21,40 was shown to work much better for classifying the sugars based on the MS/MS spectra of the diagnostic ions m/z 221. The reason can now be explained with the error-PNI plot obtained in this study. Derived from dot product calculation, similarity score method calculates the ratio between geometric mean and the arithmetic mean of corresponding intensities of two spectra (Equation 4),21,40
I )0.5 ,k = kI s,i + I r ,i
∑ (kI Similarity Score = ∑
s,i r ,i
∑I ∑I
r ,i
Equation 4 s,i
2
where Is,i is the peak intensity of the ith m/z value in the sample spectra, Ir,i is the peak
ACS Paragon Plus Environment
13
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 27
intensity in the reference spectra (library or training), and k is the normalization term that is related to the sum of all the peaks intensities the spectra. Note that there is a scale factor of 0.5 in the numerator, which is equivalent to a power normalization with a PNI at 0.5. This is in agreement with the analysis using the error-PNI plot in this study. The effect of using an appropriate PNI to classify the data can also be illustrated with the impact on the PCA results. Using classification of all-α, alt-β, gal-α and man-β as an example, Figure 3a and b show the PCA results based on the original data and data normalized at PNI of 0.5, respectively. Without the normalization, all-α, alt-β, gal-α and man-β could not be clearly distinguished (Figure 3a); however, after the normalization at PNI = 0.5, the data points for each sample can be much better grouped (Figure 3b). This is because an emphasis on the spectral peaks of lower intensities increased the difference between the spectra from different types of samples. This also indicates that the peaks of the highest intensities in the mass spectra for these samples might not be used as the signature peaks for distinguishing these samples. The similarity ranking method was used for assigning the sample group based on the data analysis. The impact on the similarity ranking and the improvement of the classification by the normalization could be illustrated using the distinction between the ido-α and glc-β samples as an example. After applying SVM classification with the original spectra and the normalized spectra at PNI 0.5, the similarity rankings for ido-α were plotted as shown in Figure 3c and d, respectively, for comparison. The ranking was based on the sum of distance Dtm (Equation 2) between the testing samples and the classification boundaries (Figure 3d inset). Without the power normalization, the ido-α
ACS Paragon Plus Environment
14
Page 15 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
sample was mis-assigned as glc-β, as shown in Figure 3c; however, it was corrected after the power normalization was applied with a proper PNI value (Figure 3d).
Figure 3. PCA of 4 types of commonly misclassified sugars with (a) original data and (b) data normalized at PNI 0.5. Similarity ranking of testing sample ido-α based on (c) original data and (d) data normalized at PNI 0.5, insets showing the boundary figures of three top-ranked types. Loading plots based on SVM for classifying the two highly similar sample groups, ido-α and glc-β, (e) with the original data and (f) data normalized at PNI of 0.5.
ACS Paragon Plus Environment
15
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 27
The normal vector ω (in Equation 1) of the corresponding PNI can be used to select the characteristic peaks for each sample type. The peaks with the highest ω value contribute the most in terms of distinguishing the sample from others.
Using the
distinction between the ido-α and glc-β samples as an example, the top 10 candidate peaks ranked for the decision-making are m/z 87, 141, 161, 117, 159, 163, 113, 71, 86, 85 and 185 (Figure 3f).
Comparing the loading plots with and without power
normalization (Figure 3e, f), the data analysis with power normalization identified some peaks, such as m/z 71 and 141, which were not previously selected13 as the signature peaks but actually can contribute to the distinction between the ido-α and glc-β samples. Another capability enabled by the analysis with the error-PNI plots is the evaluation of the critical experimental conditions. In previous studies,13,37 it has been claimed that the CID conditions, viz. the precursor ion intensity kept at 18 ± 5% relative to the base product ion after CID,37 was critical for using the diagnostic ions m/z 221 to identify the correct anomeric configurations. To test the need of retaining this condition for the analysis when the data analysis with spectral power nomination, we collected additional 132 MS/MS spectra (Supporting Information, Table S1) without carefully tuning the CID condition as described above. The variation of the intensity ratio between the precursor ion and base product ion was in a range of 24% - 100%. The SVM model was still trained using the 136 spectra collected under the controlled CID condition. The classification of the 132 samples yielded a 100% accuracy. Based on the study with the classification of the sugar samples, it is obvious that the power normalization can have a significant impact on the data analysis using the mass spectra. The error-PNI plot can serve as a unique but effective tool for evaluating the
ACS Paragon Plus Environment
16
Page 17 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
data as well as to assist the development of the experimental methods. A multi-step classification procedure can also be designed based on the information provided by the error-PNI plot. For instance, in order to achieve a complete classification, at the first step a PNI = 0.5 (①in Figure 2) can be selected to classify all 15 GAs except for tal-β; at the 2nd step, a PNI =3.2 can be selected to classify tal-β and glc-β (② in Figure 2). After the validation with the classification of the sugars, the developed data analysis method was then applied for classifications using data from other three studies. In a previous study, low temperature plasma was used to perform a direct analysis of bacteria12, including Bacillus subtilis, Staphylococcus aureus ATCC 25923, E. coli K12, and 13 Salmonella enterica bacteria (see Table S, all in the Luria-Bertani (LB) agar. As for a typical early-stage study, the study included limited sample quantity, viz. five spectra for each of 16 bacteria types, and the spectra were subjected to high matrix effects. For the testing of the classification, one sample of each bacteria was randomly selected as the testing sample and the rest were used for training. As shown in Figure 4a and b, while peaks in the m/z range above 200 attribute to fatty acid ethyl esters from the bacteria membrane, there are abundant peaks in the lower m/z range due to other chemicals in the sample matrices. Applying the classification method with power normalization, the effect of pre-selecting a mass range can be systematically analyzed and a clear strategy for classification can be derived. The error-PNI plot for classification without pre-selecting the m/z range is shown in Figure 4c, with the error-PNI curves highlighted for four types of Salmonella enteriaca bacteria, Paratyphi B DMS106/76 (SARA50), Typhimurium LT2 (SARA2), Paratyphi B DMS3205/83 (SARA47) and Paratyphi B DMS53/76 (SARA51). These bacteria are
ACS Paragon Plus Environment
17
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 27
highly similar in terms of the lipid profiles in the mass spectra12 and could not be well distinguished (Figure 4a and 4b, Figure S4a).
Obviously there is a difficulty in
classifying these samples, since there is not an overlap of the valleys in the error-PNI curves for these bacteria (Figure 4c). With the m/z range 250-300 pre-selected, the possibility for correct classifications is significantly improved (Figure 4d). At PNI at 0.7 (① in Figure 4d), 15 of 16 bacteria can be correctly classified, except for SARA47, which can be correctly classified without power normalization (PNI = 1, at ② in Figure 4c). Based on these results, a two-step procedure shall be used for the classification of all the bacteria using SVM, with the first step based on the original data followed by a 2nd step applying power normalization at PNI of 0.7.
Figure 4. Mass spectra of (a) SARA50 and (b) SARA51. The blue square is the location of mass range selection in the previous study. Error-PNI plots (c) without and (d) with mass range selection.
ACS Paragon Plus Environment
18
Page 19 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Similarly error analysis was performed for classification of serum samples for different stages of melanoma (Supporting Information Section S6). In the previous study,6 serum samples from B16 mouse model were used to detect the pulmonary metastatic melanoma. Those samples were collected at four different time points, before the injection of the melanoma cells (day 0), on day 7, 14 and 21 after the injection. Onchip fraction was performed for the peptides collected from serum, followed by a positive mode MS analysis of the peptides using a MALDI-TOF. Thirty samples were collected for each stage. For the classification using the power normalization in this study, one sample was randomly chosen as the testing point and the rest 29 samples were used for training in SVM. The PNI was selected in a range of 0.01 to 2, at a step size of 0.05. The error-PNI plot for the data analysis is shown in Figure 5a. The samples collected on day 7, day 14 and day 21 can be distinguished from each other at a relatively high confidence, when using SVM classifications with normalization at PNI between 0.46 and 0.51; however, the error for classifying the samples of cancer stage at day 0 can be as high as nearly 30%. The assignments of the day 0 samples are summarized in Figure 5b, with about 15% misclassified as “day 7” and 15% as “day 14”. The “day 0” and “day 21” samples, however, can be distinguished from each other very well at PNI 0.5. The misclassification of “day 0” samples can be systematically analyzed over the entire PNI range and the result is shown in Figure 5c. At a PNI of 0.1, a very few of “0 day” samples are misclassified as “day 7” or “day 21” but not as “day 14”. At a PNI larger than 0.3, “day 0” samples can be completely distinguished from “day 21” samples but not from “day 7” or “day 14”. The plot in Figure 5c is termed as the “relevance profile” for the “day 0” samples, which reveals the relationship between one particular
ACS Paragon Plus Environment
19
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 27
sample type, the “day 0” sample in this case, and any of the other sample types. This analysis is extremely useful for understanding the failure in the sample classification and for making improvement by selecting the proper spectral normalizations.
It is
noteworthy that the relevance profile of a sample type is high characteristic and can potentially be used for identifying the samples of this type. In another study, peptides circulating in blood, which were cleaved by carboxypeptidase N in the tumor microenvironment, were collected and analyzed in order to identify the developmental stages of breast cancer.8 Circulating peptides in 58 human plasma samples have been profiled using MALDI-TOF MS, including 10 samples of healthy controls (Control), 11 samples with stage I (BC-I), 12 samples with stage II (BCII), 15 samples with stage III (BC-III) and 10 samples with stage IV (BC-IV) breast cancer. Each sample had two replicates, which is not sufficient for a sophisticated method development, but is also typical for an early-stage study. For testing the data analysis using SVM with power normalization, one sample of each type was randomly chosen for testing and the rest were used for training in SVM. The PNI was selected from 0.01 to 2 with a step size of 0.05. The samples were extracted from serum and high matrix effects were observed (Figure S7). The error-PNI plot (Figure S8) indicates a high possibility of error if the original data (PNI = 1) were used directly for classification. Following the strategies discussed above, a thorough analysis at a system level can be done to understand the situation, which was quite complicated in this case. In addition to the relevance profile used above, an error source profile can also be derived to summarize the misclassification of other sample types INTO one particular type, as shown in Figure 5d for the Control sample type. According to the error source profile, no
ACS Paragon Plus Environment
20
Page 21 of 27
BC-I is misclassified as Control at all PNI values. At PNI around 0.4, all the samples classified as Control are true control samples but at PNI 0.8 nearly 10% of the classified Control are actually Stage II and 2% are Stag IV. As the PNI changes, the classification results change accordingly. (a)
(b)
Error-PNI plot of melanoma
Error Count (100%)
100
error analysis of “0 day” at PNI 0.5 0 day 7 day 14 day
0 days 7 days 14 days 21 days
80 60 40 20 0 0
0.5
1
1.5
2
Power Normaliza on Index Relevance profile of “0 day” 0 day 7 days 14 days 21 days
30 25 20 15 10 5 0
(d)
Error source profile of “Ctrl” Ctrl
Sample Count
(c)
Sample Count
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
30
BC-II
25
BC-III
20
BC-IV
15 10 5 0
0
0.5
1
1.5
2
Power Normaliza on Index
0
0.5
1
1.5
2
Power Normaliza on Index
Figure 5 (a) Error-PNI plot of classification of melanoma samples. (b) Classification result of the “0 day” samples. (c) Relevance profile for the “0 day” samples, showing the misclassification of “0” day sample to other sample types, as a function of PNI. (d) Breast cancer stage study: error source profile for Control sample type, showing the misclassification of samples of other types into the “Control” type, as a function of PNI.
Probability estimation can be provided by combining classification results at different PNIs of an unknown sample, based on the relevance and error source profiles.
ACS Paragon Plus Environment
21
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 27
Using a simple case for an example, if one sample is classified as BC-IV at PNI 0.4 but classified as Control at PNI 0.8, its true identity can be estimated with a probability. The number of samples of each type mis-assigned to BC-IV at PNI 0.4 and to Control at PNI 0.8 can be extracted from the error source profiles of BC-IV and Control types (Figure S9), as listed in Table 1. The possibility of the said sample to actually be a Control type sample but being mis-assigned as the BC-IV can be calculated as pctrl=(32/52)(16/52), where 52 is the total number of the sample. The possibility for being a BC-II samples is pBC-II = (2/52)(25/52). Most likely, this sample would not be BC-I, BC-III or BC-IV based on the information listed in the table. When a sample is determined to be a Control type based on the classification using the data analysis reported here, the probabilities of its being a true Control type are calculated as Pctrl = pctrl/(pctrl+pBC-II) = 92% . There is an 8% possibility for its being a BC-II type sample. SVM with power normalization at multiple PNI values enables a comprehensive analysis of the data that can assist the process of finding the ultimate solution in the classification. The information on the misassignments can be used for the sample classification as well. Table 1. Numbers of samples of each type assigned as BCIV at PNI 0.4 and as Control at PNI 0.8 (total 52 samples). PNI PNI 0.4 PNI 0.8 Classified as BC-IV Ctrl Sample Type Ctrl 32 16 BC-I 25 0 BC-II 24 2 BC-III 29 0 BC-IV 46 0
Conclusion
ACS Paragon Plus Environment
22
Page 23 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
In the study, the power normalization of the mass spectra was introduced for data analysis and SVM was used to perform the normalized data for sample classification. A set of visualization methods, including the error-PNI plot, relevance profile and error sour profile, were introduced to facilitate the analysis of the data, which ultimately are a reflection of the analytical approach adopted for the biological study. The selection of the proper normalization factor has been proven to be critical for the sample classification. Its applications in data analysis for the studies involving spectra for sugar standards, bacteria, melanoma and breast cancer samples have been demonstrated and multi-dimension data analysis enabled by the power normalization at various PNI values could be used to improve the sample classification significantly. Using the data analysis for melanoma and breast cancer studies, we also demonstrated that the “errors” in the classification can actually be used for improving the sample identifications, with the multi-dimension classification enabled by the power normalization at different PNIs.
Supporting Information The Supporting Information is available free of charge on the ACS Publications website at DOI:10.1021/acs.analchem.xxxx. •
Details in original experimental data and methods and results of data analysis. (PDF)
Acknowledgement The authors thank Dr. Chiharu Konda and Prof. R. Graham Cooks for providing the mass spectrometry data for analysis of sugars and bacteria, respectively. This work
ACS Paragon Plus Environment
23
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 27
was supported by the National Institute of General Medical Sciences (1R01GM106016) from the National Institutes of Health.
ACS Paragon Plus Environment
24
Page 25 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
References (1) Pham, T. V.; Piersma, S. R.; Oudgenoeg, G.; Jimenez, C. R. Expert Rev. Mol. Diagn. 2012, 12, 343-359. (2) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (3) Yang, B.; Wu, Y.-J.; Zhu, M.; Fan, S.-B.; Lin, J.; Zhang, K.; Li, S.; Chi, H.; Li, Y.X.; Chen, H.-F.; Luo, S.-K.; Ding, Y.-H.; Wang, L.-H.; Hao, Z.; Xiu, L.-Y.; Chen, S.; Ye, K.; He, S.-M.; Dong, M.-Q. Nat. Meth. 2012, 9, 904-906. (4) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976989. (5) Jia, C.; Yu, Q.; Wang, J.; Li, L. Proteomics 2014, 14, 1185-1194. (6) Fan, J.; Huang, Y.; Finoulst, I.; Wu, H.-j.; Deng, Z.; Xu, R.; Xia, X.; Ferrari, M.; Shen, H.; Hu, Y. Cancer Lett. 2013, 334, 202-210. (7) Gholami, B.; Norton, I.; Eberlin, L. S.; Agar, N. Y. R. IEEE J. Biomed. Health Inform. 2013, 17, 734-744. (8) Li, Y.; Li, Y.; Chen, T.; Kuklina, A. S.; Bernard, P.; Esteva, F. J.; Shen, H.; Ferrari, M.; Hu, Y. Clin. Chem. 2014, 60, 233-242. (9) Liao, H.; Wu, J.; Kuhn, E.; Chin, W.; Chang, B.; Jones, M. D.; O'Neil, S.; Clauser, K. R.; Karl, J.; Hasler, F.; Roubenoff, R.; Zolg, W.; Guild, B. C. Arthritis Rheum. 2004, 50, 3792-3803. (10) Zou, W.; She, J.; Tolstikov, V. V. Metabolites 2013, 3, 787-819. (11) Sauer, S.; Kliem, M. Nat. Rev. Microbiol. 2010, 8, 74-82. (12) Zhang, J. I.; Costa, A. B.; Tao, W. A.; Cooks, R. G. Analyst 2011, 136, 3091-3097. (13) Konda, C.; Bendiak, B.; Xia, Y. J. Am. Soc. Mass Spectrom. 2012, 23, 347-358. (14) Both, P.; Green, A. P.; Gray, C. J.; Sardzik, R.; Voglmeir, J.; Fontana, C.; Austeri, M.; Rejzek, M.; Richardson, D.; Field, R. A.; Widmalm, G.; Flitsch, S. L.; Eyers, C. E. Nat Chem 2014, 6, 65-74. (15) McDonnell, L. A.; Heeren, R. M. A. Mass Spectrom. Rev. 2007, 26, 606-643. (16) Pereira, J.; Porto-Figueira, P.; Cavaco, C.; Taunk, K.; Rapole, S.; Dhakne, R.; Nagarajaram, H.; Camara, J. S. Metabolites 2015, 5, 3-55. (17) Katajamaa, M.; Oresic, M. J. Chromatogr. A 2007, 1158, 318-328. (18) Gay, S.; Binz, P. A.; Hochstrasser, D. F.; Appel, R. D. Proteomics 2002, 2, 13741391. (19) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214-219. (20) Yang, D.; Ramidssoon, K.; Hamlett, E.; Giddings, M. C. J. Proteome Res. 2008, 7, 62-69. (21) Wan, K. X.; Vidavsky, I.; Gross, M. L. J. Am. Soc. Mass Spectrom. 2002, 13, 85-88. (22) Wu, B. L.; Abbott, T.; Fishman, D.; McMurray, W.; Mor, G.; Stone, K.; Ward, D.; Williams, K.; Zhao, H. Y. Bioinformatics 2003, 19, 1636-1643. (23) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Nat. Methods 2007, 4, 923-925. (24) Geurts, P.; Fillet, M.; de Seny, D.; Meuwis, M. A.; Malaise, M.; Merville, M. P.; Wehenkel, L. Bioinformatics 2005, 21, 3138-3145. (25) Ball, G.; Mian, S.; Holding, F.; Allibone, R. O.; Lowe, J.; Ali, S.; Li, G.; McCardle, S.; Ellis, I. O.; Creaser, C.; Rees, R. C. Bioinformatics 2002, 18, 395-404.
ACS Paragon Plus Environment
25
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 27
(26) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (27) Zhan, X.; Patterson, A. D.; Ghosh, D. BMC Bioinformatics 2015, 16. (28) Koenig, T.; Menze, B. H.; Kirchner, M.; Monigatti, F.; Parker, K. C.; Patterson, T.; Steen, J. J.; Hamprecht, F. A.; Steen, H. J. Proteome Res. 2008, 7, 3708-3717. (29) Fenyo, D.; Beavis, R. C. Anal. Chem. 2003, 75, 768-774. (30) Bern, M.; Kil, Y. J.; Becker, C. Curr Protoc Bioinformatics 2012, Chapter 13. (31) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Proteomics 2007, 7, 655-667. (32) Coombes, K. R.; Tsavachidis, S.; Morris, J. S.; Baggerly, K. A.; Hung, M. C.; Kuerer, H. M. Proteomics 2005, 5, 4107-4117. (33) Tabb, D. L.; MacCoss, M. J.; Wu, C. C.; Anderson, S. D.; Yates, J. R. Anal. Chem. 2003, 75, 2470-2477. (34) Konda, C.; Londry, F. A.; Bendiak, B.; Xia, Y. Journal of the American Society for Mass Spectrometry 2014, 25, 1441-1450. (35) Chang, C.-C.; Lin, C.-J. ACM Trans. Intell. Syst. Technol. 2011, 2, 1-27. (36) Geppert, H.; Horváth, T.; Gärtner, T.; Wrobel, S.; Bajorath, J. Journal of Chemical Information and Modeling 2008, 48, 742-746. (37) Fang, T. T.; Bendiak, B. J. Am. Chem. Soc. 2007, 129, 9721-9736. (38) Chang, C.-C.; Lin, C.-J. Acm Transactions on Intelligent Systems and Technology 2011, 2. (39) Li, Y. J.; Li, Y. G.; Chen, T.; Kuklina, A. S.; Bernard, P.; Esteva, F. J.; Shen, H. F.; Ferrari, M.; Hu, Y. Clinical Chemistry 2014, 60, 233-242. (40) Zhang, Z. Q. Anal. Chem. 2004, 76, 3908-3922.
ACS Paragon Plus Environment
26
Page 27 of 27
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
76x50mm (192 x 192 DPI)
ACS Paragon Plus Environment