A Multivariate Approach To Reveal Biomarker Signatures for

Figure 1. Workflow used for revealing biomarker signatures for disease classification. .... this variable gets a value zero (0) or one (1) and gives a...
0 downloads 0 Views 5MB Size
A Multivariate Approach To Reveal Biomarker Signatures for Disease Classification: Application to Mass Spectral Profiles of Cerebrospinal Fluid from Patients with Multiple Sclerosis Tarja Rajalahti,*,†,‡ Ann C. Kroksveen,§ Reidar Arneberg,| Frode S. Berven,§,⊥ Christian A. Vedeler,†,‡ Kjell-Morten Myhr,†,‡,# and Olav M. Kvalheim∇ Department of Clinical Medicine, University of Bergen, Bergen, Norway, Department of Neurology, Haukeland University Hospital, Bergen, Norway, Institute of Medicine, University of Bergen, Bergen, Norway, Pattern Recognition Systems AS, Bergen, Norway, Proteomic Unit (PROBE), Department of Biomedicine, University of Bergen, Bergen, Norway, The Norwegian Multiple Sclerosis National Competence Centre, Haukeland University Hospital, Bergen, Norway, and Department of Chemistry, University of Bergen, Bergen, Norway Received February 16, 2010

Mass spectral profiles from cerebrospinal fluid (CSF) are used as input to a novel multivariate approach to select features responsible for the separation of patients with multiple sclerosis (MS) from control groups. Our targeted statistical approach makes it possible to systematically remove features in the spectral fingerprints masking the components expressing the disease pattern. The low molecular weight CSF proteome from 54 patients with MS and a range of other neurological diseases (OND), as well as neurological healthy controls (NHC), is analyzed in replicates using mass spectral profiling. Statistically validated partial least-squares discriminant analysis (PLS-DA) models are created as a first step to separate the groups. Using the group membership as a target, the most discriminatory projection in the multivariate space spanned by the spectral profiles is revealed. From the resulting target-projected component, the spectral regions most significantly contributing to group separation are identified using the nonparametric discriminating variable (DIVA) test together with the so-called selectivity ratio (SR) plot. Our approach is general and can be applied for other diseases and instrumental techniques as well. Keywords: feature selection • biomarkers • disease classification • multivariate analysis • chemometrics • proteomics • mass spectrometry • MALDI-TOF • cerebrospinal fluid • multiple sclerosis

1. Introduction Protein and peptide profiling of tissue extracts and body fluids using mass spectrometry1,2 is a commonly used method to detect biomarker patterns with the purpose to detect differences between samples from different groups, for example, patients with a particular disease and controls. This approach has shown promising results, for example, in cancer research using serum as the body fluid of interest.3-6 An important issue in this context is how to reveal the information in the mass spectral profiles distinguishing diseased from controls. Each sample is very complex, described by thousands of variables (m/z numbers), and is influenced by several factors not related to compositional differences between samples. To * Corresponding author. E-mail: [email protected]. Tel: +47 55583366. Fax: +47 55589490. † Department of Clinical Medicine, University of Bergen. ‡ Department of Neurology, Haukeland University Hospital. § Institute of Medicine, University of Bergen. | Pattern Recognition Systems AS. ⊥ Proteomic Unit (PROBE), Department of Biomedicine, University of Bergen. # The Norwegian Multiple Sclerosis National Competence Centre, Haukeland University Hospital. ∇ Department of Chemistry, University of Bergen.

3608 Journal of Proteome Research 2010, 9, 3608–3620 Published on Web 05/25/2010

get a reliable detection of potential biomarkers, it is crucial to use proper data-analytical tools.7-9 Thus, Spiegelman et al.8 argue that “Rigorous application of sound statistical and chemometric principles will benefit the overall scientific community by improving protein biomarker discovery and validation.” Multivariate projection methods developed in chemometrics can be used to simplify complex proteomic data and make the visualization of spectral fingerprints easier.10-12 Furthermore, they make classification of samples and detection of biomarker signatures possible.13 Multiple sclerosis (MS) is a chronic inflammatory disease of the central nervous system (CNS) causing demyelination, axonal damage, and irreversible neurological disability.14 The cause of the disease remains unknown, but it is believed that the disease is a result of environmental factors acting on genetically susceptible individuals. No clinical sign or laboratory test alone gives the diagnosis of MS. The diagnosis is based on disease history, clinical examination combined with magnetic resonance imaging, and detection of oligoclonal bands in the cerebrospinal fluid (CSF). Evidence of disseminated disease in time and space gives the diagnosis of MS.15 However, subclinical disease activity with irreversible damage may occur even before the first clinically detectable symptoms of the disease. 10.1021/pr100142m

 2010 American Chemical Society

Multivariate Approach To Reveal Biomarker Signatures Early treatment slows down disease activity and early diagnosis is therefore crucial.16 Thus, there is a definitive need for sensitive and specific tests to map the disease signature and thus enable an early diagnosis. Many different strategies have been used to search for diagnostic markers for MS, but no reliable biomarkers have so far been detected.17-21 The complex heterogeneous nature of MS and its different subpopulations makes it probable that several biomarkers are needed for early diagnosis of MS.22-25 It is believed that pathological and physiological changes in the CNS are better reflected in the CSF than blood,26,27 making CSF the body fluid of choice to search for biomarkers in MS. Stoop et al.28,29 used mass spectrometry to compare CSF peptide profiles in MS patients, patients with clinical isolated syndrome (CIS), and patients with other inflammatory/noninflammatory neurological disease (OIND/OND). They identified several proteins associated with MS. In one study,28 they used presence and absence of peaks to find biomarker candidates, while in the other study29 they used peak height as a quantitative measure of peptide abundance to find markers distinguishing MS patients from controls. In this study, we used matrix-assisted laser desorption/ ionization (MALDI) time-of-flight (TOF) mass spectrometry30 to map the protein pattern in the low molecular weight (MW) CSF proteome. The aim was to reveal a disease signature for MS that can separate MS patients from patients with other neurological diseases (OND), as well as neurological healthy controls (NHC). The low MW fraction of the CSF proteome is believed to contain important biomarker candidates like cytokines, chemokines, hormones, and proteolytic fragments of larger proteins.2,31 MALDI is relatively straightforward to use and demands relatively simple analytical workup. Therefore, it may have the potential to be used on routine basis in clinical laboratories. Because of the severe overlap of peaks in MALDITOF mass spectral profiles, whole profiles of m/z numbers are used as input to a novel multivariate data modeling approach to reveal the biomarker signatures separating the different groups of samples. This is the first time the methodology is applied to real proteomics data. Using this approach, we were able to find disease patterns distinguishing MS patients from control groups. In addition to providing a tool for classification and diagnosis of patients, identification of the proteins mapped by these patterns may provide important clues to the causes of MS.

2. Materials and Methods The workflow used in this study for revealing biomarker signatures for disease classification is presented in Figure 1. 2.1. Study Population and Collection of CSF. The study included 18 patients with relapsing remitting multiple sclerosis (MS); 18 patients with other neurological diseases (OND); for example, headache, vertigo, neuropathy, encephalopathy; and 18 neurological healthy controls (NHC), that is, patients that underwent minor orthopedic surgery or investigation of the lower extremities. Clinical evaluation and diagnostic lumbar puncture of patients with MS and OND were performed at the Department of Neurology, Haukeland University Hospital. CSF from the NHC patients was collected during the procedure of spinal anesthesia at the Department of Orthopedic Surgery, Haukeland University Hospital. Written informed consent was obtained from the included patients and the study was approved by the Regional Committee for Medical and Health Research Ethics and the Norwegian Social Science Data Services.

research articles

Figure 1. Workflow used for revealing biomarker signatures for disease classification.

CSF was collected in sterile polypropylene tubes and centrifuged at 450g for 5 min to remove cells and particles. The supernatants from NHC patients were put on dry ice before storage at -80 °C, while the CSF supernatants from the patients with MS and OND were stored at -80 °C immediately after centrifugation. Prior to analysis, the samples were thawed, aliquoted into new tubes, and frozen once more at -80 °C. 2.2. Sample Preparation Prior to Mass Spectrometry Analysis. Prior to acquisition of the low MW proteome, CSF from each patient was diluted 1:5 using solution A containing 2% acetonitrile (ACN) (Riedel-de Hae¨n) and 0.01% trifluoroacetic acid (TFA) (Fluka). The diluted CSF was then added to a 30 kDa MW cutoff filter (Vivascience) and centrifuged at 2000g for 12 min at 4 °C. The flow-through, containing the low MW fraction of the proteome, was purified and concentrated using Multi-SPE C8 extraction discs (3 M, Millipore) as previously described.32 The purified samples were dried using SpeedVac centrifugation. 2.3. MALDI-TOF Analysis of the Low MW CSF Proteome. The dried samples were dissolved in solution B containing 70% ACN and 0.1% TFA, and mixed 1:1 with R-cyano-4-hydroxycinnamic acid (CHCA) matrix (Bruker Daltonics) (0.3 g/L in ethanol (Arcus)/acetone (Sigma Aldrich) 2:1). One microliter of this solution was spotted onto a 600 µm MTP AnchorChip 600/384 plate (Bruker Daltonics). Each spot was washed with 10 mM ammonium phosphate monobasic (Sigma Aldrich) dissolved in 0.1% TFA prior to recrystallization using ethanol/ acetone/0.1% TFA (6:3:1). Mass spectral profiles were acquired in the m/z range 740-9000 using an AutoFlex (Bruker Daltonics) mass spectrometer in a positive linear mode. The parameter settings were: laser frequency 20 Hz, ion source I 20 kV, pulsed ion extraction 250 ns with ion suppression up to 500 Da. No real time smoothing was performed. The analysis was performed using an auto execute method with the following setup: 20 initial uncollected shots at 35% laser power, followed by 100 shots that were collected. This was repeated at different positions until a total of 2000 shots had been collected. The laser power was varying between 20% and 24%. The spectra were automatically collected if the signal-to-noise ratio was Journal of Proteome Research • Vol. 9, No. 7, 2010 3609

research articles evaluated to be above 3 with a peak resolution of 200 in auto execute mode32 (FlexControl software vs 2.0, Bruker Daltonics). 2.4. Data Analysis of MALDI-TOF Spectral Profiles. 2.4.1. Data Set. CSF from each patient was fractionated and spotted several times, thus, giving many replicated experiments. This procedure produced two types of replicates: (i) analytical replicates, that is, aliquoted samples undergoing full analytical workup, and (ii) instrumental replicates, that is, a fractionated sample spotted several times. The number of replicates varied slightly from patient to patient due to practical experimental considerations. Each replicate was used as a single spectrum in the analysis. In this way, a realistic estimate of the variation in the complete analytical procedure was obtained, including also the modeling step. In total, the data set consisted of 498 spectra. Each spectral profile was described by intensities at 44 403 m/z numbers (mass range from 740.04 to 8999.84 Da). Since the aim was to find the discriminating m/z regions in the mass spectral profiles which separated the groups from each other, the groups were compared pairwise and three different subsets were defined: MS versus OND, MS versus NHC, and OND versus NHC. 2.4.2. Data Pretreatment. To remove noncompositional factors, for example, baseline effects, shifts in m/z values, structured noise (heteroscedasticity), and differences in signal intensities caused by analytical workup and the instrumental technique, the spectra were pretreated prior to data analysis according the recommendations given by Arneberg et al.33 Data were first smoothed using moving average with window size 10. All spectra were then aligned to an average spectrum using recursive alignment by fast Fourier transform (RAFFT) crosscorrelation function34,35 with window size 20. All m/z values under 1000 Da were removed after alignment since this region contained much background noise. The spectra had been baseline subtracted using the instrument’s own software (FlexAnalysis by Bruker Daltonics). If baseline subtraction yielded negative intensities, the negative intensities were adjusted to zero. Finally, heteroscedastic noise was transformed to homoscedastic noise using square root transformation36 and the data were normalized to unit norm. Data were mean-centered prior to modeling but not autoscaled, since this would magnify the noise from m/z values with low intensity when profiles with all m/z numbers are used. In addition, our approach does not depend on autoscaled data to reveal components with relatively low concentration, since the use of square root transformation reduces differences between components with low and high concentration. After all pretreatment steps, there were 41 507 m/z numbers describing each profile. 2.4.3. Data Analysis. Principal component analysis (PCA)37,38 was used to obtain an overview of the data. PCA may also reveal clustering among the samples when the major variation in spectral fingerprints represents separation between groups. Supervised methods, like partial least-squares discriminant analysis (PLS-DA),39 are more suitable when within-group variance dominates over between-group variance. These methods utilize a priori information about the samples’ group membership. PLS-DA was applied to the data to be able to find discriminating m/z regions in the mass spectral profiles of the different patient groups. A binary response variable was first created; this variable gets a value zero (0) or one (1) and gives a group membership for each sample in the group, for example, healthy controls and MS patients, respectively. A PLS regression40-42 model was then calculated using the mass spectral profiles as input. Using a part of the samples as a separate test 3610

Journal of Proteome Research • Vol. 9, No. 7, 2010

Rajalahti et al. set is not feasible when the number of samples is small compared to the number of variables. Therefore, double crossvalidation, utilizing all samples both for training and external validation, was used to obtain PLS-DA models with the best predictive performance.43,44 Since numerous PLS components are needed to describe the variation in complex spectral profiles, a PLS-DA model is difficult to interpret. Therefore, we proceeded with the target projection (TP) method45 to obtain easier interpretation. In TP, the spectral data matrix is projected onto the PLS regression vector. TP provides the part of the data matrix X of mass spectral profiles that is covarying with the vector of group membership y for the estimated model dimension, that is, the number of PLS components. With this projection, the information in the spectra unrelated (i.e., orthogonal) to the response variable (group membership) is removed and we obtain a single latent variable (the target-projected component) that represents the axis of optimal discrimination between the two groups in question. Thorough theoretical explanation with illustrations can be found in a recent paper by Kvalheim.46 With the same number of PLS components, the target-projected component is identical to the predictive orthogonal PLS component obtained from orthogonal PLS (O-PLS).47 The O-PLS method is widely used in, for example, metabonomics.12,48 Percent explained variance in X (R2X) for the PLS and the TP models are used to show the information content in X used to model y (PLS) and the information in X specifically related to y (TP). Percent explained variance in y (R2y) is used as one measure of the overall predictive performance of a model. Because of small sample size and group heterogeneity, data in proteomic applications cannot be assumed to follow a normal distribution. Therefore, a nonparametric approach called the discriminating variable (DIVA) test49 and the selectivity ratio (SR) plot50 were used for interpreting the TP model and revealing the most discriminating mass spectral regions. Since SR is a measure of between-to-within group variance, SR can reveal regions in spectral profiles with both high explanatory and high predictive significance for the investigated response. The DIVA test has been developed to provide statistical boundaries for the SR method and thus make the feature selection easier. More details about the applied procedures are provided in section 3 (Results and Discussion). Data analysis was performed using Sirius version 8.0 (Pattern Recognition Systems AS, Bergen, Norway).

3. Results and Discussion 3.1. Multivariate Modeling of Pretreated Mass Spectral Profiles. Figure 2 shows representative MALDI-TOF spectra after pretreatment of the data. The spectra span the mass range from 1000.06 to 8999.84 Da (41 507 m/z numbers). PCA models were first calculated for the three subsets (MS vs OND, MS vs NHC, and OND vs NHC) using the pretreated spectral data. Two principal components (PC) were calculated to attain an overview of the data. The obtained PCA scores for the subset MS versus OND are shown in Figure 3. The first PC explains 15.0% and the second PC 14.2% of the total spectral variation. The two groups are severely overlapping each other and no tendency of clustering is observed in the PCA score plot. Extraction of further PCs did not provide any separation either. The situation is similar for the other two subsets as well (results not shown). PLS-DA was used to be able to discriminate between different patient groups. Two strategies are possible for solving

Multivariate Approach To Reveal Biomarker Signatures

research articles

Figure 2. Representative MALDI-TOF spectral profiles for the low MW fraction of the CSF proteome after pretreatment of the data (1-9 kDa).

DA models were calculated for the three subsets using pretreated spectral data as explanatory variables (xi, i ) 1, 2, ..., 41 507) and group membership as the response variable (y). The group membership for the three subsets was defined as follows: (i) zero (0) for the OND and one (1) for the MS samples, (ii) zero (0) for the NHC and one (1) for the MS samples, and (iii) zero (0) for the NH and one (1) for the OND samples.

Figure 3. PCA score plot for the spectral profiles in MS (red) and OND (blue).

the problem: either a single model, including all three groups, or three binary models, modeling the groups pairwise. For our purpose, it is better to use three distinct binary classifications. The combination PLS-DA/TP is optimal with respect to finding the most discriminatory vector for a two-group classification problem. PLS-DA/TP can be extended to handle more than two groups, but the classification result can no longer be presented on a single vector for feature selection. When doing three distinct classifications, we obtain the best discriminatory features for each comparison, and if one then compares the results from all three classifications, one can obtain the features that provide separation of all three groups simultaneously. PLS-

The number of significant PLS components was estimated using a double cross-validation scheme with outer and inner loops to ensure the best predictive performance for each model. Since replicates were used independently and the same sample was therefore represented several times, the “true” number of degrees of freedom is exaggerated and even double crossvalidation may lead to overfitting. To reduce this risk, a relatively large percentage of samples was left out in double cross-validation steps. Twenty percent of the spectra was kept out for the outer loop to be used for independent external validation. The outer loop was repeated five times so that every spectrum had been kept out once, and only once. In the inner loop, 25% of the spectra was kept out at a time. The inner loop was repeated four times so that every spectrum had been kept out once, and only once, also in the inner loop. Centering was redone in the cross-validation. This resulted in models with 13 significant PLS-components for subsets MS versus OND and OND versus NHC, and 14 significant PLS-components for subset MS versus NHC, with explained variance in the range of 63.5-65.7% for the spectral data and 99.6-99.7% for the response. The validated PLS-DA models explain the response (group membership) exceptionally well. The large residual variation (approximately 35%) in the spectral data is expected, since most of the m/z numbers describe component patterns that are shared by all samples. Actually, the PLS-DA models incorporate much variation in the mass spectral profiles that are unrelated to the variation pattern in the response. This orthogonal variation has to be removed in order to uncover Journal of Proteome Research • Vol. 9, No. 7, 2010 3611

research articles

Rajalahti et al. a

Table 1. Summary of the Modeling Results for the Three Different Subsets before and after Variable Reduction subset

no. of spectra

A

R2(XPLS-DA) %

R2(XTP) %

R2(y) %

% MCCR (DIVA)

SR limit

no. of selected m/z

MS vs OND MS vs ONDc MS vs NHC MS vs NHCc OND vs NHC OND vs NHCc

247 247 356 356 390 390

13 5 14 9 13 6

65.74 91.98 65.30 87.24 63.45 87.76

6.34 53.24 10.55 43.70 10.08 50.91

99.57 68.16 99.65 93.34 99.57 83.72

76.5

0.50

88 (0.2%)

80

0.52

754 (1.8%)

79.5

0.60

260 (0.6%)

b

a A ) no. of PLS components; R2(XPLS-DA) ) explained variance in X for PLS-DA model; R2(XTP) ) explained variance in X for TP model; R2(y) ) explained variance in y for PLS-DA and TP model; MCCR ) mean correct classification rate. b Corresponding percentage of the original spectral profile in parentheses. c Reduced subset with selected m/z regions.

Figure 4. Scores on the target-projected (TP) component for the spectral profiles in MS (red) and OND (blue).

the biomarker signatures. Each of the three PLS-DA models was therefore rotated using TP to obtain one target-projected component that captures the predictive spectral information. The TP component represents the axis of optimal discrimination between the investigated groups. The calculated TP component for the three target-projected models explained 6.3-10.6% of the total variation in the mass spectral profiles. This shows that only a small fraction of the total variation in the spectral data contributes to the discrimination between the groups. The explained variances for the response remain the same for the TP models as for the PLS-DA models. Table 1 summarizes modeling results for the three subsets. Each spectral profile in a subset gets its own unique score value on the TP component. A perfect separation is achieved if all profiles from one group get high score values while all the profiles from the other group get low score values. An example of TP scores for subset MS versus OND is shown in Figure 4. A clear separation between the two groups can be observed; all MS samples get high scores and all OND samples get low scores. Similar results are obtained for the other two subsets (results not shown). The TP procedure is able to separate the different groups solely on the basis of their mass spectral profiles. 3.2. Detection of the Biomarker Signatures. The validated TP models can be used for revealing m/z numbers differing significantly in abundance between profiles from diseased and controls. The TP loadings represent the features in the spectral profiles that explain the separation between the groups. A TP loading is calculated for each spectral variable using the TP 3612

Journal of Proteome Research • Vol. 9, No. 7, 2010

scores. Figure 5 shows an example of TP loadings for subset MS versus OND. Several m/z numbers get large positive or negative loading values and therefore seem to be contributing significantly to the observed separation between the patient groups. However, as illustrated by Chau et al.,51 the size of TP loadings is not necessarily the best criterion to select the most significant discriminatory variables. The reason for this is that a spectral region with large loading values may in fact reflect a protein with relatively high abundance, and thus large variance, but still with low discriminatory ability. This is due to the definition of TP loadings which is based on reproduced covariances between the vector of group memberships for each sample and the vector of intensities for each m/z number. Use of selectivity ratio (SR) has shown promising results in biomarker selection.50,51 SR is defined as the ratio between explained variance and residual (unexplained) variance on the target-projected component and it is calculated for each spectral variable. By using a variable selection method based upon the size of SR, it is possible to find the most important m/z regions contributing to the separation between groups, that is, m/z numbers that can point to potential biomarkers. Variables with a high SR value have a good discriminatory ability between two groups of samples. If a variable is not well modeled by TP, it means that the chance of providing good discriminatory ability is low for this variable. Minor variables that covary with larger variables with good discriminatory ability will be enhanced and thus selected as shown in a previous paper.50 However, a boundary between highly discriminating and less significant variables is needed to be able

research articles

Multivariate Approach To Reveal Biomarker Signatures

Figure 5. Target projected loadings for the spectral profiles in subset MS vs OND.

Figure 6. Discriminating variable (DIVA) plot for subset MS vs OND: percent mean correct classification rate (MCCR) (continuous red line) and standard deviation of MCCR (dashed lines). The chosen SR value is marked with black vertical line.

to detect the most promising biomarker candidates. A nonparametric statistical test called discriminating variable (DIVA) test has been developed to provide statistical boundaries for the SR method. The DIVA test is based on testing every m/z ratio with respect to correct classification rate (CCR) and it gives a probability measure on how well variables within a certain SR interval separate the two groups. An SR to be used for each model can thus be determined at a chosen probability level. The chosen probability level can be seen as a compromise to balance the possibility of including many false biomarker candidates against the risk of missing important biomarkers. The DIVA approach is univariate in the sense that each variable is checked independently for discriminatory ability based on correct classification rate. Thus, there is a chance that a certain combination of variables, each having low discriminatory ability, can together provide a separation. This risk, however, must be balanced against the risk of achieving apparently good

classification performance, but limited or none predictive ability. With a large variable-to-sample ratio, as one usually has in proteomic/metabonomic applications, there is a high risk to obtain spurious results if one starts to consider all possible combinations of variables with the aim to find the “best” separation. Our approach selects variables with both good classification performance and relatively high correlation to the response, that is, group membership. Figure 6 shows results from DIVA test for MS versus OND (results for MS vs NHC and OND vs NHC are similar and therefore not shown). In the DIVA plot, the mean correct classification rates (MCCR) and its standard deviations for all SR intervals are plotted against SR. Since the CCR for a single variable is identical to sensitivity in a binary classification situation, the MCCR can be interpreted as a mean sensitivity within a certain SR interval. Thus, the DIVA plot connects classification performance to the ratio of between-group to Journal of Proteome Research • Vol. 9, No. 7, 2010 3613

research articles

Rajalahti et al.

Figure 7. Selectivity ratio (SR) plot for subsets (A) MS vs OND, (B) MS vs NHC, and (C) OND vs NHC. The chosen SR limits are marked with horizontal lines.

within-group variance in each variable in a quantitative manner.49 The DIVA tests indicate that SR values of approximately 0.50 (MS vs OND), 0.52 (MS vs NHC), and 0.60 (OND vs NHC) represent reasonable choices for reducing the number of m/z regions to a manageable size without sacrificing too many regions with discriminatory ability. The chosen MCCR levels for the three binary classifications vary from 76.5% to 80% (Table 1). The chosen SR values are subsequently used as the lower limit for variable selection in the SR plot (Figure 7). The SR plot looks like a spectrum, showing the SR at each m/z number. SR values can be positive or negative in the SR plot; the sign is related to the correlation pattern of the spectral variables to the response. SR is multiplied with the sign of the corresponding TP loading to enhance interpretation. By adding the sign, we can see which m/z regions increase or decrease in intensity. We observe that only a few of the spectral regions have values above the defined SR limit in each subset. Selected m/z regions are not necessarily corresponding to whole peaks but rather fractions of peaks. This is a result of overlap between peaks that reduces the SR in the overlapping areas. In the MS versus OND subset, only 88 m/z numbers have a significant contribution to the separation between MS patients and OND patients at the chosen probability level. This selection represents 0.2% of the original spectral profile. The positive and negative values in the SR plot correspond to m/z numbers with a relative increased or decreased intensity, respectively, in the MS profiles compared to the OND profiles. In the MS versus NHC subset, 754 m/z numbers, corresponding to 1.8% of the original spectral profile, contribute to the separation. Regions having a positive SR are relatively more abundant in the MS 3614

Journal of Proteome Research • Vol. 9, No. 7, 2010

group and regions having a negative SR are more abundant in NHC group. In the OND versus NHC subset, 260 m/z numbers, corresponding to 0.6% of the original spectral profile, are selected. In this subset, a positive SR corresponds to higher intensities in OND group while a negative SR corresponds to higher intensities in NHC group. The selected spectral regions for each subset classification and the group where these regions are more abundant are listed in Table 2. The most interesting spectral regions are the ones that are selected as discriminating in more than one classification. If we compare the selected regions for different subset classifications, we observe eight m/z regions that are common to two subset classifications (Table 2). The regions around m/z 2299 and 2498 are common to the MS versus OND and MS versus NHC subset classification and are relatively more abundant in MS patients. The regions around m/z 2318 and 2821 are common to the MS versus NHC and OND versus NHC subset classification. These regions are more abundant in the NHC group compared to the MS and OND groups, that is, they seem to be less abundant in the samples with neurological diseases. Also the regions around m/z 2430, 3237, 5121, and 7162 are selected for both the MS versus NHC and the OND versus NHC subset classifications, but they are more abundant in MS and OND groups, that is, they seem to be biomarker candidates for neurological diseases. The eight peaks corresponding to m/z regions discussed above are displayed in Figure 8. Although no single m/z region in the biomarker signatures can provide complete separation, visual inspection of the spectral regions showed that there is a rather good separation between groups in each of the selected

Multivariate Approach To Reveal Biomarker Signatures Table 2. Selected m/z Regions for the Three Different Subsetsa

1030.30-1030.50 (OND) 2496.56-2499.30 (MS)b 1443.45-1443.68 (MS) 1655.05-1655.92 (NHC) 2297.86-2304.45 (MS)b 2361.15-2363.38 (MS) 2497.62-2498.99 (MS)b 2730.8 (MS) 2753.16 (MS) 2903.79-2905.43 (NHC) 3191.85-3192.54 (NHC) 3236.09-3239.21 (MS)d 3907.52-3911.34 (NHC) 5118.79-5123.15 (MS)d 6132.00 (MS) 6258.55-6266.75 (MS) 7156.98-7159.30 (MS) 7459.36 (MS)

1514.32-1515.39 (NHC) 2427.11-2428.16 (OND)d 2814.12-2824.00 (NHC)c 3806.27-3810.59 (OND) 4550.90-4551.31 (OND) 7047.18-7047.69 (OND)

MS vs OND Classification 2150.45-2154.55 2297.86-2300.49 (MS) (MS)b 4331.62-4332.82 4386.54-4388.35 (OND) (OND) MS vs NHC Classification 1547.63-1548.35 (MS) 1669.11-1670.10 (NHC) 2316.03-2321.46 (NHC)c 2426.51-2435.38 (MS)d 2677.05-2677.37 (NHC) 2731.28-2731.92 (MS) 2814.29-2815.58 (NHC) 2972.53-2975.03 (MS) 3201.16-3202.19 (NHC) 3251.54-3252.41 (MS) 4462.74-4464.37 (MS) 5758.69-5762.62 (MS) 6132.96-6134.39 (MS) 6446.39-6452.26 (MS) 7161.88-7162.65 (MS)d 7459.88-7464.88 (MS)

1550.87 (MS) 1956.58-1959.95 (NHC) 2341.63-2342.67 (MS) 2445.32-2450.75 (MS) 2700.45-2701.25 (NHC) 2737.34-2740.85 (MS) 2818.49-2825.62 (NHC)c 3186.68-3189.27 (NHC) 3203.23-3215.49 (NHC) 3802.70-3803.26 (MS) 4479.04-4481.69 (MS) 6131.53 (MS) 6134.87-6136.77 (MS) 7156.47 (MS) 7447.79-7457.78 (MS)

OND vs NHC Classification 1916.58-1919.78 2316.47-2318.52 (NHC) (NHC)c 2430.11-2430.26 2433.87-2434.47 (OND)d (OND)d 3233.84-3239.21 3492.77-3493.85 (OND)d (OND) 3930.42-3930.99 4176.60-4177.39 (NHC) NHC 4739.02-4742.58 5117.92-5123.15 (OND) (OND)d 7160.07-7162.91 (OND)d

a Group with increased relative abundance in parentheses. b Common to MS vs OND and MS vs NHC, relatively more abundant in MS. c Common to MS vs NHC and OND vs NHC, relatively more abundant in NHC. d Common to MS vs NHC and OND vs NHC, relatively more abundant in MS and OND.

regions, thus, supporting our approach to reveal discriminating regions in mass spectral profiles. If we compare the selected spectral regions from the SR/ DIVA approach to the results from TP loadings (Figure 5), we observe that use of loadings results in the selection of more peaks. Some of the same peaks can be found but their order of importance is not the same. It is obvious that the use of

research articles loadings will give a forest of potential biomarker candidates, but most of them will be falsified at a significant probability level. 3.3. Analysis Using the Detected Biomarker Signatures. When the complete spectral profiles are used, the number of samples is very small compared to the number of variables. This is a typical feature for all -omics data. As shown in the first part of the analysis, most of the variables are irrelevant as they represent variation not related to the response. Therefore, the number of variables can be reduced drastically with minor loss of information. We now repeat the analysis using only the selected m/z regions. New PLS-DA models followed by TP are calculated for each subset. Double cross-validation gave 5, 9, and 6 significant PLS components for subsets MS versus OND, MS versus NHC, and OND versus NHC, respectively. Explained variance was in the range of 87.2-92.0% for the spectral data and 68.2-93.3% for the response. The calculated TP component in the three target-projected models explained 43.7-53.2% of the total variation in the mass spectral profiles. The results are summarized in Table 1. These models, calculated from the reduced mass spectral profiles, have higher explained variance in spectral data, but slightly lower explained variance in response than the models calculated for the complete profiles. The reason is that the selected m/z numbers are more relevant for the separation of groups, but at the same time, we loose some explanatory power in the classification pattern exhibited by the response by removing almost all the spectral variables. TP scores for all subsets are shown in Figure 9. Classification results based on the TP scores for the reduced spectral profiles are shown in Table 3. Excellent separation between the two groups can be observed in all subsets even though a few of the replicated spectra from some samples comparing MS versus OND and OND versus NHC are falsely classified. In MS versus OND classification, only one MS patient is systematically falsely classified while four other MS patients have 8-43% of the replicates falsely classified. In the same subset classification, one OND patient has 60% of the replicates falsely classified and six OND patients have 8-30% of the replicates falsely classified. This means that only one MS and one OND patient are falsely classified when taking into account the replicate variation in the samples. In OND versus NHC classification, four OND patients and one NHC patient have one replicate (corresponding to 7-20%) falsely classified, while only one NHC patient is systematically falsely classified. These results are due to analytical variance and the existence of a fuzzy zone in between the groups. Liland et al.52 and Forshed et al.53 have shown that optimization makes it possible to reduce analytical variance substantially making the MALDI-TOF and similar mass spectrometry approach produce more quantitative results. Note, however, that the separation of MS from NHC is still perfect after variable reduction. Despite the huge reduction in spectral variables, the loss of information in the classification pattern is surprisingly small and the correlation between TP scores before and after variable selection is 0.824, 0.967, and 0.916 for the subsets MS versus OND, MS versus NHC, and OND versus NHC, respectively. Furthermore, use of the nonparametric Mann-Whitney U-test to compare the group means using the scores on the TP component gives p < 10-36 for each subset classifications, proving an excellent multivariate selectivity in the reduced mass spectral profiles. No separation was observed between the groups when PCA models were calculated using the complete spectral profiles. Journal of Proteome Research • Vol. 9, No. 7, 2010 3615

research articles

Rajalahti et al.

Figure 8. Spectral m/z regions that are selected as discriminating and common in two subset classifications. Mean spectrum (prior to square root transformation and normalization) for each group is shown: MS (red), OND (blue), and NHC (black). Regions around m/z (A) 2299, (B) 2498, (C) 2318, (D) 2821, (E) 2430, (F) 3237, (G) 5121, and (H) 7162 are marked with vertical dashed lines.

The situation is different after variable reduction. For the reduced MS versus OND subset, the first PC explains a major part (67.5%) of the variation. When looking at the scores for PC1 (Figure 10) a clear tendency for separation between the groups along this component can be observed. The result is actually very similar to the TP scores for the reduced subset (Figure 9A). PC1 for the subsets MS versus NHC and OND versus NHC explains 60.2% and 69.1% of the variation, respectively. Also here a clear tendency for separation is observed (score plots not shown). These results support the conclusion 3616

Journal of Proteome Research • Vol. 9, No. 7, 2010

that the selected m/z regions do contain the essential information about the disease classification and are important for the separation between the different groups. Comparison with the TP score patterns (Figure 9), however, shows the superiority of the TP approach over PCA to single out the underlying disease signatures.

4. Conclusions In this work, a targeted multivariate projections approach is, for the first time, applied to real proteomic data from full

Multivariate Approach To Reveal Biomarker Signatures

research articles

Figure 9. Scores on the target-projected (TP) component for the detected biomarker signatures in (A) MS vs OND, (B) MS vs NHC, and (C) OND vs NHC. MS (red), OND (blue), and NHC (black). Table 3. Classification Results Based on TP scores for the Detected Biomarker Signature

subset

MS vs OND

% correct classification ratea

MS: 89.5

OND: 92.3

MS vs NHC OND vs NHC

100 OND: 97.2 NHC: 98.4

description of false classification casesb

1 patient with all replicates, 1 patient with several replicates (43%), 3 patients with one replicate (8-25%) 1 patient with several replicates (60%), 6 patients with one replicate (8-30%) 4 patients with one replicate (7-11%) 1 patient with all replicates, 1 patient with one replicate (20%)

a Percentage of all replicated samples. replicates in parentheses.

b

Percentage of falsely classified

spectral profiling. MALDI-TOF mass spectral data from CSF samples were analyzed to reveal disease signatures that separate MS patients from patients with OND and NHC. Spectral profiles were analyzed using PCA, PLS-DA and TP modeling. SR plot and DIVA test were used to select a small number of m/z regions with good discriminatory power for MS versus OND, MS versus NHC, and OND versus NHC subset classification. These are potential biomarker signatures that correlate with differences in the protein expression portrayed by the low MW fraction of CSF for the MS, OND, and NHC groups and can be used per se as a tool for disease classification. From these signatures, a few m/z regions were the same for more than one subset classification and these are the most interesting biomarker candidates. Further analysis is necessary for the identification of the candidate biomarkers. The unique mul-

tivariate disease pattern revealed by analytical approaches like the one presented here is expected to be useful also for detecting new groups resulting from small but correlated changes in the profiles. Such observations may give important clues to the disease mechanism and pathogenesis. The most discriminating m/z regions do not necessarily represent complete peaks. Actually, in most cases, they represent fractions of peaks. Thus, visual inspection of the pretreated profiles revealed in most cases overlap between components in the vicinity of discriminating m/z regions. This is due to huge number of proteins in the low MW protein fraction that must ultimately lead to overlapping m/z regions for many components using MALDI-TOF and similar screening techniques. This may create ambiguities using conventional methods for peak detection and matching of peaks between samples leading to loss of information and discriminatory ability. Multivariate resolution techniques54 can be used to separate discriminating overlapping peaks, but these procedures are time-consuming and not easy to automate. On the other hand, the use of SR plot and DIVA test on spectral profiles provides the backbone of a flexible, semiautomatic, and objective procedure for variable selection where both narrow and broad discriminatory m/z regions are located without the need for the assumption that variables must represent whole single separated peaks. These regions are combined into disease patterns with the best possible performance with respect to separating groups from each other. Furthermore, these disease patterns provide the most promising m/z regions for biomarker discovery in the investigated fraction. These features of our approach represent an advantage over more conventional analytical methods. Although only used on MALDI-TOF spectral profiles in the present work, our data-analytical approach is much more general. The generalization to surface-enhanced laser desorption/ionization (SELDI) mass spectrometry and Journal of Proteome Research • Vol. 9, No. 7, 2010 3617

research articles

Rajalahti et al.

Figure 10. Scores on the first principal component for the detected biomarker signature in MS vs OND. MS (red), OND (blue).

nuclear magnetic resonance (NMR) spectroscopy data is obvious, but by using the technique of unfolding,55 also data from hyphenated instruments like liquid chromatography-mass spectrometry (LC-MS) can be analyzed by our approach. Several chemometric methods have been tested for disease classification and biomarker detection in spectral profiles. SR plot and DIVA test can be used in combination with these methods. Nørgaard et al. classified breast cancer samples using extended canonical variates analysis (ECVA).56 Like TP, ECVA uses PLS as an intermediate step to find a single discriminatory vector for variable selection. The classifier in ECVA is determined using the difference between the mean spectra for each group as the response in regression, while TP is using the group membership as the response. Thus, one could build our DIVA and SR approach also on top of ECVA for binary classification problems. The same of course applies to orthogonal partial least-squares (O-PLS) regression which has been used, for example, for detecting biomarkers predictive for clinical outcome of acute myeloid leukemia.53 As mentioned earlier (section 2.4.3) the TP component is identical to the predictive O-PLS component when the same number of PLS components is used. Imre et al. classified cancer patients and healthy individuals using linear discriminant analysis (LDA) for MALDI peak data.57 LDA outperforms PLS when applicable, that is, when number of samples is larger than number of variables. However, when the data are highly collinear, which is always the case when using full spectral profiles, dimensionality reduction is needed as provided, for example, by PLS.58 A combination of analysis of variance and principal component analysis (ANOVA-PCA) has recently been applied for biomarker profiling in MALDI measurements of amniotic fluids59 and Escherichia coli cells.60 ANOVA-PCA avoids rotation of the principal components and therefore loadings for the significant PCs can be used directly for interpretation. In proteomic analysis, the search for selective biomarkers that separate completely groups of samples has not proven very fruitful. We fully support the view of Stoop et al.29 who state that “Protein differences in patient comparison studies are virtually never black-and-white phenomena. Thus, compounds will mostly still be present in a non-diseased state but in changed concentrations.” At the onset of a disease, many 3618

Journal of Proteome Research • Vol. 9, No. 7, 2010

components are probably in the gray zone, but a combination of several regions may still add up to a separation with very high statistical significance due to high multivariate selectivity. The separation ability may be strong despite the fact that differences in intensities between the groups of samples are small and monitoring just one peak at a time shows almost no discrimination. Changes in the correlation pattern among many moderately discriminating components act as an amplifier. Therefore, it should be kept in mind that selected regions are not acting isolated from each other: separation between groups occurs as a result of peaks that are changing in a systematic manner at the same time.

Acknowledgment. This work was supported by grants from the aid of EXTRA funds from the Norwegian Foundation for Health and Rehabilitation, the Kjell Almes Legacy, the Bergen MS Society, the Meltzer foundation, and the Norwegian MS Society. The authors are partly supported by the National Programme for Research in Functional Genomics (FUGE) funded by the Norwegian Research Council, and the Western Norway Regional Health Authority. Pattern Recognition Systems AS is thanked for partial financing of R. Arneberg and for providing the Sirius software free of charge. Prof. Rune J. Ulvik (Institute of Medicine, University of Bergen) is thanked for administrative support. Magnus Berle, Bjarte Askeland and the unit for orthopedic anesthesia, Haukeland University Hospital, is acknowledged for collection and inclusion of healthy orthopedic patients in the control CSF biobank. References (1) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198–207. (2) Hu, S.; Loo, J. A.; Wong, D. T. Human body fluid proteome analysis. Proteomics 2006, 6 (23), 6326–6353. (3) Villanueva, J.; Philip, J.; Entenberg, D.; Chaparro, C. A.; Tanwar, M. K.; Holland, E. C.; Tempst, P. Serum peptide profiling by magnetic particle-assisted, automated sample processing and MALDI-TOF mass spectrometry. Anal. Chem. 2004, 76 (6), 1560– 1570. (4) Villanueva, J.; Shaffer, D. R.; Philip, J.; Chaparro, C. A.; ErdjumentBromage, H.; Olshen, A. B.; Fleisher, M.; Lilja, H.; Brogi, E.; Boyd, J.; Sanchez-Carbayo, M.; Holland, E. C.; Cordon-Cardo, C.; Scher,

Multivariate Approach To Reveal Biomarker Signatures

(5)

(6)

(7) (8)

(9)

(10)

(11)

(12)

(13)

(14) (15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23) (24)

(25)

(26)

H. I.; Tempst, P. Differential exoprotease activities confer tumorspecific serum peptidome patterns. J. Clin. Invest. 2006, 116 (1), 271–284. de Noo, M. E.; Mertens, B. J.; Ozalp, A.; Bladergroen, M. R.; van der Werff, M. P.; van de Velde, C. J.; Deelder, A. M.; Tollenaar, R. A. Detection of colorectal cancer using MALDI-TOF serum protein profiling. Eur. J. Cancer 2006, 42 (8), 1068–1076. Petricoin, E. F.; Ardekani, A. M.; Hitt, B. A.; Levine, P. J.; Fusaro, V. A.; Steinberg, S. M.; Mills, G. B.; Simone, C.; Fishman, D. A.; Kohn, E. C.; Liotta, L. A. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359 (9306), 572–7. Smit, S.; Hoefsloot, H. C. J.; Smilde, A. K. Statistical data processing in clinical proteomics. J. Chromatogr., B 2008, 866 (1-2), 77–88. Spiegelman, C. H.; Pfeiffer, R.; Gail, M. Using chemometrics and statistics to improve proteomics biomarker discovery. J. Proteome Res. 2006, 5 (3), 461–462. McDonald, R. A.; Skipp, P.; Bennell, J.; Potts, C.; Thomas, L.; O’Connor, C. D. Mining whole-sample mass spectrometry proteomics data for biomarkers - An overview. Expert Syst. Appl. 2009, 36 (3), 5333–5340. Eriksson, L.; Antti, H.; Gottfries, J.; Holmes, E.; Johansson, E.; Lindgren, F.; Long, I.; Lundstedt, T.; Trygg, J.; Wold, S. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Anal. Bioanal. Chem. 2004, 380 (3), 419–429. Lee, K. R.; Lin, X.; Park, D. C.; Eslava, S. Megavariate data analysis of mass spectrometric proteomics data using latent variable projection method. Proteomics 2003, 3 (9), 1680–1686. Lindon, J. C.; Nicholson, J. K. Spectroscopic and statistical techniques for information recovery in metabonomics and metabolomics. Annu. Rev. Anal. Chem. 2008, 1, 45–69. Hendriks, M.; Smit, S.; Akkermans, W.; Reijmers, T. H.; Eilers, P. H. C.; Hoefsloot, H. C. J.; Rubingh, C. M.; de Koster, C. G.; Aerts, J. M.; Smilde, A. K. How to distinguish healthy from diseased? Classification strategy for mass specitrometry-based clinical proteomics. Proteomics 2007, 7 (20), 3672–3680. Compston, A.; Coles, A. Multiple sclerosis. Lancet 2002, 359 (9313), 1221–1231. Polman, C. H.; Reingold, S. C.; Edan, G.; Filippi, M.; Hartung, H. P.; Kappos, L.; Lublin, F. D.; Metz, L. M.; McFarland, H. F.; O’Connor, P. W.; Sandberg-Wollheim, M.; Thompson, A. J.; Weinshenker, B. G.; Wolinsky, J. S. Diagnostic criteria for multiple sclerosis: 2005 Revisions to the “McDonald Criteria”. Ann. Neurol. 2005, 58 (6), 840–846. Stueve, O.; Bennett, J. L.; Hemmer, B.; Wiendl, H.; Racke, M. K.; Bar-Or, A.; Hu, W.; Zivadinov, R.; Weber, M. S.; Zamvil, S. S.; Pacheco, M. F.; Menge, T.; Hartung, H. P.; Kieseier, B. C.; Frohman, E. M. Pharmacological treatment of early multiple sclerosis. Drugs 2008, 68 (1), 73–83. Szczucinski, A.; Losy, J. Chemokines and chemokine receptors in multiple sclerosis. Potential targets for new therapies. Acta Neurol. Scand. 2007, 115 (3), 137–146. Carrette, O.; Burkhard, P. R.; Hughes, S.; Hochstrasser, D. F.; Sanchez, J. C. Truncated cystatin C in cerebrospiral fluid: Technical artefact or biological process. Proteomics 2005, 5 (12), 3060–3065. Jurewicz, A.; Matysiak, M.; Raine, C. S.; Selmaj, K. Soluble NogoA, an inhibitor of axonal regeneration, as a biomarker for multiple sclerosis. Neurology 2007, 68 (4), 283–287. Lindsey, J. W.; Crawford, M. P.; Hatfield, L. M. Soluble Nogo-A in CSF is not a useful biomarker for multiple sclerosis. Neurology 2008, 71 (1), 35–37. Hein, K.; Kohler, A.; Diem, R.; Sattler, M. B.; Demmer, I.; Lange, P.; Bahr, M.; Otto, M. Biological markers for axonal degeneration in CSF and blood of patients with the first event indicative for multiple sclerosis. Neurosci. Lett. 2008, 436 (1), 72–76. Lehmensiek, V.; Suessmuth, S. D.; Touscher, G.; Brettschneider, J.; Felk, S.; Gillardon, F.; Tumani, H. Cerebrospinal fluid proteome profile in multiple sclerosis. Mult. Scler. 2007, 13 (7), 840–849. Bielekova, B.; Martin, R. Development of biomarkers in multiple sclerosis. Brain 2004, 127, 1463–1478. Ingram, G.; Hakobyan, S.; Robertson, N. P.; Morgan, B. P. Complement in multiple sclerosis: its role in disease and potential as a biomarker. Clin. Exp. Immunol. 2009, 155 (2), 128–139. Tumani, H.; Hartung, H. P.; Hemmer, B.; Teunissen, C.; Deisenhammer, F.; Giovannoni, G.; Zettl, U. K.; Bio, M. S. S. G. Cerebrospinal fluid biomarkers in multiple sclerosis. Neurobiol. Dis. 2009, 35 (2), 117–127. Zhang, J. Proteomics of human cerebrospinal fluid - the good, the bad, and the ugly. Proteomics: Clin. Appl. 2007, 1 (8), 805–819.

research articles (27) Roche, S.; Gabelle, A.; Lehmann, S. Clinical proteomics of the cerebrospinal fluid: Towards the discovery off now biomarkers. Proteomics: Clin. Appl. 2008, 2 (3), 428–436. (28) Stoop, M. P.; Dekker, L. J.; Titulaer, M. K.; Burgers, P. C.; Sillevis Smitt, P. A.; Luider, T. M.; Hintzen, R. Q. Multiple sclerosis-related proteins identified in cerebrospinal fluid by advanced mass spectrometry. Proteomics 2008, 8 (8), 1576–1585. (29) Stoop, M. P.; Dekker, L. J.; Titulaer, M. K.; Lamers, R. J. A. N.; Burgers, P. C.; Smitt, P. A. E. S.; van Gool, A. J.; Luider, T. M.; Hintzen, R. Q. Quantitative matrix-assisted laser desorption ionization-fourier transform ion cyclotron resonance (MALDI-FTICR) peptide profiling and identification of multiple-sclerosisrelated proteins. J. Proteome Res. 2009, 8 (3), 1404–1414. (30) Hillenkamp, F.; Karas, M.; Beavis, R. C.; Chait, B. T. Matrix-assisted laser desorption ionization mass-spectrometry of biopolymers. Anal. Chem. 1991, 63 (24), A1193–A1202. (31) Tirumalai, R. S.; Chan, K. C.; Prieto, D. A.; Issaq, H. J.; Conrads, T. P.; Veenstra, T. D. Characterization of the low molecular weight human serum proteome. Mol. Cell. Proteomics 2003, 2 (10), 1096– 1103. (32) Berven, F. S.; Kroksveen, A. C.; Berle, M.; Rajalahti, T.; Flikka, K.; Arneberg, R.; Myhr, K. M.; Vedeler, C.; Kvalheim, O. M.; Ulvik, R. J. Pre-analytical influence on the low molecular weight cerebrospinal fluid proteome. Proteomics: Clin. Appl. 2007, 1 (7), 699–711. (33) Arneberg, R.; Rajalahti, T.; Flikka, K.; Berven, F. S.; Kroksveen, A. C.; Berle, M.; Myhr, K. M.; Vedeler, C. A.; Ulvik, R. J.; Kvalheim, O. M. Pretreatment of mass spectral profiles: application to proteomic data. Anal. Chem. 2007, 79 (18), 7014–7026. (34) Wong, J. W. H.; Cagney, G.; Cartwright, H. M. SpecAlignsprocessing and alignment of mass spectra datasets. Bioinformatics 2005, 21 (9), 2088–2090. (35) Wong, J. W. H.; Durante, C.; Cartwright, H. M. Application of fast Fourier transform cross-correlation for the alignment of large chromatographic and spectral datasets. Anal. Chem. 2005, 77 (17), 5655–5661. (36) Kvalheim, O. M.; Brakstad, F.; Liang, Y. Z. Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Anal. Chem. 1994, 66 (1), 43–51. (37) Jackson, J. E. A Users’ Guide to Principal Components; Wiley: New York, 1991. (38) Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2 (1-3), 37–52. (39) Sjo¨stro¨m, M.; Wold, S.; So¨derstro¨m, B. PLS discriminant plots. In Pattern Recognition in Practice II; Elsevier Science Publ. B. V.: Holland, 1986; pp461-470. (40) Geladi, P.; Kowalski, B. R. Partial least-squares regressionsa tutorial. Anal. Chim. Acta 1986, 185, 1–17. (41) Wold, S.; Ruhe, A.; Wold, H.; Dunn, W. J. The collinearity problem in linear-regressionsthe partial least-squares (PLS) approach to generalized inverses. SIAM J. Sci. Stat. Comput. 1984, 5 (3), 735– 743. (42) Wold, S.; Sjo¨stro¨m, M.; Eriksson, L. PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58 (2), 109–130. (43) Stone, M. Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 1974, 36 (2), 111–147. (44) Filmoser, P.; Liebmann, B.; Varmuza, K. Repeated double cross validation. J. Chemometr. 2009, 23 (3-4), 160–171. (45) Kvalheim, O. M.; Karstang, T. V. Interpretation of latent-variable regression models. Chemometrics Intell. Lab. Syst. 1989, 7 (1-2), 39–51. (46) Kvalheim, O. M. Interpretation of partial least squares regression models by means of target projection and selectivity ratio plots. J. Chemometr. [Online early access], DOI: 10.1002/cem.1289. Published online: Feb 17, 2010. (47) Trygg, J.; Wold, S. Orthogonal projections to latent structures (OPLS). J. Chemom. 2002, 16 (3), 119–128. (48) Teul, J.; Ruperez, F. J.; Garcia, A.; Vaysse, J.; Balayssac, S.; Gilard, V.; Malet-Martino, M.; Martin-Ventura, J. L.; Blanco-Colio, L. M.; Tunon, J.; Egido, J.; Barbas, C. Improving metabolite knowledge in stable atherosclerosis patients by association and correlation of GC-MS and H-1 NMR fingerprints. J. Proteome Res. 2009, 8 (12), 5580–5589. (49) Rajalahti, T.; Arneberg, R.; Kroksveen, A. C.; Berle, M.; Myhr, K. M.; Kvalheim, O. M. Discriminating variable test and selectivity ratio plot: quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal. Chem. 2009, 81 (7), 2581–2590. (50) Rajalahti, T.; Arneberg, R.; Berven, F. S.; Myhr, K. M.; Ulvik, R. J.; Kvalheim, O. M. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom. Intell. Lab. Syst. 2009, 95 (1), 35–48.

Journal of Proteome Research • Vol. 9, No. 7, 2010 3619

research articles (51) Chau, F. T.; Chan, H. Y.; Cheung, C. Y.; Xu, C. J.; Liang, Y.; Kvalheim, O. M. Recipe for uncovering the bioactive components in herbal medicine. Anal. Chem. 2009, 81 (17), 7217–7225. (52) Liland, K. H.; Mevik, B.-H.; Rukke, E.-O.; Almøy, T.; Skaugen, M.; Isaksson, T.; Part, I. Quantitative whole spectrum analysis with MALDI-TOF MS, Part I: measurement optimization. Chemom. Intell. Lab. Syst. 2009, 96 (2), 210–218. (53) Forshed, J.; Pernemalm, M.; Tan, C. S.; Lindberg, M.; Kanter, L.; Pawitan, Y.; Lewensohn, R.; Stenke, L.; Lehtio, J. Proteomic data analysis workflow for discovery of candidate biomarker peaks predictive of clinical outcome for patients with acute myeloid leukemia. J. Proteome Res. 2008, 7 (6), 2332–2341. (54) de Juan, A.; Tauler, R. Multivariate curve resolution (MCR) from 2000: Progress in concepts and applications. Crit. Rev. Anal. Chem. 2006, 36 (3-4), 163–176. (55) Rajalahti, T.; Huang, F.; Klement, M. R.; Pisareva, T.; Edman, M.; Sjostrom, M.; Wieslander, A.; Norling, B. Proteins in different Synechocystis compartments have distinguishing N-terminal features: A combined proteomics and multivariate sequence analysis. J. Proteome Res. 2007, 6 (7), 2420–2434.

3620

Journal of Proteome Research • Vol. 9, No. 7, 2010

Rajalahti et al. (56) Norgaard, L.; Soletormos, G.; Harrit, N.; Albrechtsen, M.; Olsen, O.; Nielsen, D.; Kampmann, K.; Bro, R. Fluorescence spectroscopy and chemometrics for classification of breast cancer samples - a feasibility study using extended canonical variates analysis. J. Chemom. 2007, 21 (10-11), 451–458. (57) Imre, T.; Kremmer, T.; Heberger, K.; Molnar-Szollosi, E.; Ludanyi, K.; Pocsfalvi, G.; Malorni, A.; Drahos, L.; Vekey, K. Mass spectrometric and linear discriminant analysis of N-glycans of human serum alpha-1-acid glycoprotein in cancer patients and healthy individuals. J. Proteomics 2008, 71 (2), 186–197. (58) Barker, M.; Rayens, W. Partial least squares for discrimination. J. Chemom. 2003, 17 (3), 166–173. (59) Harrington, P. D.; Vieira, N. E.; Espinoza, J.; Nien, J. K.; Romero, R.; Yergey, A. L. Analysis of variance-principal component analysis: A soft tool for proteomic discovery. Anal. Chim. Acta 2005, 544 (1-2), 118–127. (60) Chen, P.; Lu, Y.; Harrington, P. B. Biomarker profiling and reproducibility study of MALDI-MS measurements of Escherichia coli by analysis of variance-principal component analysis. Anal. Chem. 2008, 80 (5), 1474–1481.

PR100142M