Combination of Statistical Approaches for Analysis ... - ACS Publications

Oct 28, 2008 - Combination of Statistical Approaches for Analysis of 2-DE Data Gives Complementary Results ... 224, DK-2800 Kgs. Lyngby, Denmark, and ...
0 downloads 0 Views 1MB Size
Combination of Statistical Approaches for Analysis of 2-DE Data Gives Complementary Results Harald Grove,†,‡ Bo M. Jørgensen,§ Flemming Jessen,§ Ib Søndergaard,| Susanne Jacobsen,⊥ Kristin Hollung,† Ulf Indahl,# and Ellen M. Færgestad*,† Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, P.O. Box 5003, N-1431 Ås, Norway, Nofima Food, Osloveien 1, N-1430 Ås, Norway, National Institute of Aquatic Resources, Department of Seafood Research, Technical University of Denmark, Building 221, DK-2800 Kgs. Lyngby, Denmark, Department of Systems Biology, Center for Microbial Biotechnology, Technical University of Denmark, Building 221, DK-2800 Kgs. Lyngby, Denmark, Department of Systems Biology, Enzyme and Protein Chemistry, Technical University of Denmark, Building 224, DK-2800 Kgs. Lyngby, Denmark, and Department of Mathematical Sciences and Technology, Norwegian University of Life Sciences, P.O. Box 5003, N-1431 Ås, Norway Received June 10, 2008

Five methods for finding significant changes in proteome data have been used to analyze a two-dimensional gel electrophoresis data set. We used both univariate (ANOVA) and multivariate (Partial Least Squares with jackknife, Cross Model Validation, Power-PLS and CovProc) methods. The gels were taken from a time-series experiment exploring the changes in metabolic enzymes in bovine muscle at five time-points after slaughter. The data set consisted of 1377 protein spots, and for each analysis, the data set were preprocessed to fit the requirements of the chosen method. The generated results were one list from each analysis method of proteins found to be significantly changed according to the experimental design. Although the number of selected variables varied between the methods, we found that this was dependent on the specific aim of each method. CovProc and P-PLS focused more on getting the minimum necessary subset of proteins to explain properties of the samples. These methods ended up with less selected proteins. There was also a correlation between level of significance and frequency of selection for the selected proteins. Keywords: Variable selection • 2-DE • multivariate methods • false discovery rate

Introduction Proteomics is a technique used for analyzing large amounts of protein data to find biologically relevant changes. Twodimensional gel electrophoresis (2-DE) is one commonly used method in this regard. The number of visualized proteins on a 2-DE gel is in the range from several hundred up to a few thousand. To efficiently analyze the amount of data generated by 2-DE, there is a need of reliable analysis tools, both for the image analysis and for the subsequent data analysis and selection of significantly changed proteins (variable selection). For the variable selection, there exists a wide variety of different approaches which are based on different assumptions and also have different requirements of the data. * To whom correspondence should be addressed. E-mail, [email protected]; phone, +47 64970107; fax, +47 64970333. † Nofima Food. ‡ Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences. § Department of Seafood Research, Technical University of Denmark. | Department of Systems Biology, Technical University of Denmark. ⊥ Department of Systems Biology, Enzyme and Protein Chemistry, Technical University of Denmark. # Department of Mathematical Sciences and Technology, Norwegian University of Life Sciences. 10.1021/pr800424c CCC: $40.75

 2008 American Chemical Society

The variable selection methods aim to either detect significant variables one by one in a univariate approach or to detect the significant variables by including the information of all the variables in a multivariate approach.1 Both univariate and multivariate analysis methods have been used to analyze proteomics data, and while both methods can be used to select significantly changed variables, the multivariate approach has the added benefit of giving information about the relationship between both samples and variables. Choosing what method to use is not trivial, and different techniques may report different variables as significant for an experiment.2,3 A review of different methods can be found in Smit et al.4 One problematic aspect of proteomic data is the large number of variables included in the analysis. A concluded significant change in protein volume is based on the probability of observing the change. There will always be a chance that any reported significantly changed protein is due to natural variation, called false positives. The commonly chosen level of significance is 5%, meaning that any result with a less than 5% chance of being natural variation will be reported as significant. With many variables analyzed, this translates into the expectation that 5% of the variables reported as significant will be false positives. To reduce this problem, there has been devised Journal of Proteome Research 2008, 7, 5119–5124 5119 Published on Web 10/28/2008

research articles algorithms to adjust the p-values based on the total number of variables tested. An overview of multiple hypothesis testing can be found in Dudoit et al.5 Other methods used in proteomic studies for calculating false discovery rates are the calculations of q-values6 and rotation testing.7 A comparison of different methods for controlling the false discovery rate (FDR) in microarray data was presented by Qian et al.8 Both the q-value calculation and the rotation testing aim to find how many of the selected variables are false positives as opposed to the expected number of false positives based on all the analyzed variables. When these methods are used, significance can be reported as the expected number of false positives among the selected proteins. In recent years, false discovery rate has been used to assess significance of reported results in 2-DE experiments.9,10 Even though a variable is reported as statistically significant, there is also a biological aspect. In some experiments, differences are only reported if there is at least a 2-times increase in protein amount. This limitation is based on biological considerations as differences below this value are considered too small to have any relevant effect. There is also a possibility that, even though the difference between two levels of a protein is significant, the actual level of changes for the protein might still not be of any relevance. The purpose of this paper was to apply both univariate and multivariate approaches to a data set and compare the selected variables from each method. While there exists many different methods for selecting significantly changed variables, we have chosen to look at a selected few. Similarly, there are also several options for preprocessing of spot volume data. In this paper, we have let the preprocessing be a part of each analysis method. The problem of reporting false positives is also discussed in addition to the difference in results between a screening approach and marker selection approaches.

Materials and Methods Samples. The gel images used for this paper and the biological background for the samples have been described previously.11 Briefly, meat samples were taken from seven animals at five time-points after slaughter (1, 2, 3, 6, and 10 h). Water-soluble proteins were extracted and focused in IPG 4-7 and separated on 12.5% SDS-PAGE in the second dimension. All 2-DE gels with samples from the same animal (five timepoints) were run at the same time. This means that differences between animals were confounded with the difference between gel runs. Because of this, differences between animals were not investigated. The gels were silver stained and scanned with 240 dpi using a regular office scanner. Image Analysis. Image analysis was performed with Progenesis SameSpots v1.0 (Nonlinear Dynamics). All gels were aligned to a common reference gel, chosen as the gel with the least amount of smears and the most clearly focused spots. Spot detection and matching were done automatically after alignment, copying the spot shapes from the reference gel to all the other gels. Thus the same spot boundaries were applied for all gels. The problem with missing values is then avoided because low-abundance spots will still be measured as long as a spot has been found in that position on other gels. The spot volumes were calculated after background subtraction and reported as a percentage of the total protein volume detected on the gel. Data Analysis. The data (spot volumes) after the image analysis constituted the X-matrix with the protein variables as 5120

Journal of Proteome Research • Vol. 7, No. 12, 2008

Grove et al. columns and muscle sample values as rows. Preprocessing and data analysis is specified later for each analysis method. The variable indicating time after slaughter was changed to an equidistant variable, replacing 1, 2, 3, 6, and 10 h with 0, 1, 2, 3, and 4 and used as y-variable (see Discussion below). Data analysis was performed using The Unscrambler v9.6 (Camo, Oslo, Norway), Matlab v7.2 (The MathWorks, MA) and 50-50 MANOVA (http://www.langsrud.com/stat/ffmanova.htm). Choice of Scale for Storage Time. For the storage time, there was a choice between several options. The motivation for changing the values of this variable was to achieve a closer to linear relationship between the protein volumes and the time variable. Depending on how the protein amounts are expected to change during storage, the time variable may be represented by the actual time, assuming a constant change in protein amounts or by the equidistant timeline, compressing the time scale at later time points (6 h and 10 h) when the change with time decreases. Another possibility is to use a smooth nonlinear transformation, for example, logarithms. In the present case, the result using a logarithmic transformation was almost equal to that using equidistant time. Thus, we have shown only the results from using the equidistant time, although we realize that a complete biological study preferably would have included more options. Univariate ANOVA with Rotation Testing. A univariate regression approach was performed using a two-way ANOVA model, using time and batch as the two factors, on one variable at the time. The reported significance levels (p-values) were corrected for the number of tests performed by using the rotation test procedure.7 For this analysis, the data were preprocessed by a Box-Cox transformation (xnew ) (xλ - 1)/λ) to correct for the dependency between the mean and the variance of each variable. This transformation was reported to be suitable to 2-DE data since the log-transformation tended to have problems achieving a linear result at lower protein volumes.12 The parameter λ was set to minimize the correlation between mean and variance of the variables. PLS Regression with Jackknife. Partial Least Squares (PLS) regression with jackknife estimation of significant regression coefficients was calculated using seven-segment cross-validation with mean-centring at each validation step. The crossvalidation segments consisted of samples from a single batch (five time points). Scaling of the data was done once, using a method called group-scaling.13 The variables we were looking for would be those with large variation across the time series. To avoid scaling down these variables and scaling up those variables with little variation, the group-scaling method calculates a weight based on the variation between the animals while keeping out the variation between time points. The weights for each variable are calculated based on the standard deviation for samples measured at the same time point. Here, five standard deviations were calculated for each variable and the final weight was the average of these standard deviations. The level of significance for each variable was based on the stability of the estimated regression coefficients. After having identified the spots with significant regression coefficients, a new PLS regression with jackknife was performed on a matrix containing these variables only. The procedure was repeated until convergence, defined as the point where all coefficients were significant. CovProc. The CovProc method has been found efficient in handling the task of finding a reduced set of variables to be used in a regression analysis.14 The basic idea is based on

Combination of Statistical Approaches for Analysis of 2-DE Data ranking the variables according to a combination between the fit and the prediction. This idea is called the H-principle and suggests using the squared covariance matrix XTYYTX as a measure of the strength of the linear relationship between X and Y.15 The procedure is to perform a PLS regression and rank the variables according to their regression weights calculated with the optimal number of components. Starting with the highest ranked variable, the variables are then added one by one to the model, calculating the fit and variance at each step. In this way, the fit between the model and data is increased and the model variation is decreased. The important point in using the CovProc method is that the data are expanded as long as it improves the results. The iteration in the CovProc procedure is stopped when the regression coefficient is below a given value. Power-PLS. Power-PLS (P-PLS) is a modification of the PLS regression.16 The main difference with this method compared to ordinary PLS is an explicit factorization of the PLS-loading weight covariances as a product of the associated correlation and standard deviation parts. A power parameter is then introduced to control the importance of the correlation between X and y versus the column wise standard deviation of X. The purpose of this modification is to avoid the influence of directions in X having large variability but little connection to y. By adjusting the power parameter in order for the loading weights to maximize the correlation between X and y, optimal variables for model building can more easily be separated from the rest. By setting the power parameter to give equal weight to both the standard deviation and the correlation, the P-PLS will give the same solution as an ordinary PLS model. Significant variables were selected on the basis of stable regression coefficients using a jackknife procedure with full cross-validation. A regression coefficient was stable if the confidence interval for its estimated value did not include zero. The model complexity was chosen according to the minimum number of components giving a prediction error not significantly poorer than the minimum prediction error.16 The significance in this case was calculated according to a χ2-statistic based on the squared sample errors. CMV. The Cross Model Validation (CMV) method is a PLS analysis with an added validation step.17,18 One sample is removed before building the model based on the rest of the samples, using ordinary PLS with jackknife and full crossvalidation. Then the kept-out sample is predicted using the results from the PLS analysis. This is repeated until all samples have been kept out and the results are presented as what percentage of the validation steps each regression coefficients are found to be significant. As with ordinary PLS with crossvalidation, the choice of validation segments is important for the results. The data were transformed by the same Box-Cox equation as for the univariate ANOVA. Significance Levels. The significance level was set individually for each method based on commonly used standards. ANOVA used a cutoff at 5% for p-values adjusted by rotation testing. The rotation testing is a way of calculating the FDR for each variable. The FDR then shows the expected number of false positives among the selected variables. Our choice of 5% meant that we accepted 5% of the selected variables to be false positives. PLS and P-PLS used a cutoff at 5% for the calculation of stable regression coefficients, with cross-validation used to find the regression coefficient variances. CMV presents the results as a percentage showing how often each variable is significant for building a good prediction model in each of the

research articles

Figure 1. Histogram showing the number of selected variables for each strategy.

outer cross-validation steps. Significance is calculated the same way as for ordinary PLS. CovProc chooses variables according to their predictive value, using a ranked list of variables as basis for selection as long as the model is improved.

Results After the alignment and spot detection, we had a data set consisting of 1377 variables. There were a total of 276 selected variables across all the five strategies. A histogram showing the number of selected variables for each strategy is shown in Figure 1. The univariate two-way ANOVA model, using time and batch (animal) as variables, selected 241 spots as being significantly changed with time. The PLS regression selected 93 variables that showed stable regression coefficients at convergence. The process needed four filtering steps until all remaining variables were marked as significant. The CMV analysis selected 75 variables that had stable regression coefficients in all cross-validation segments. In total, there were 121 variables marked as having stable regression coefficients in at least 25 of the 35 segments. The CovProc method selected 20 variables to give an optimal model of the time series. Because of the possibility of fine-tuning the P-PLS procedure, the number of variables selected varied according to how much weight was put on the correlation compared to the variance. To focus on variable selection, the power parameter (γ)16 was kept close to 1, at which point the analysis would have focused solely on the correlation between X and y. We performed the first analysis with γ ) 0.99, which yielded one variable as significant. Removing this variable and redoing the analysis resulted in another selected variable. Continuing this procedure with removing selected variables from the data and redoing the analysis gave five variables chosen with γ ) 0.99. Then, we adjusted the power parameter to γ ) 0.98 and repeated the analysis procedure. Continuation until no more variables were selected as significant resulted in 31 selected variables. The numbers of selected variables shared by each combination of the five methods are shown in Figure 2. Fourteen variables were shared among all strategies, while another 14 were shared in four of the five strategies. For each strategy, the selected variables were ranked based on the calculated significance. Comparing the individual rankings with the combined results showed that 19, 15, and 17 of the top 28 variables for ANOVA, PLS, and CMV, respectively, were present in the 28 variables from the combined results. For CovProc and P-PLS, a similar observation can be drawn from the fact that only 1 of 20 and 7 of 31 variables, respectively, Journal of Proteome Research • Vol. 7, No. 12, 2008 5121

research articles

Figure 2. Diagram of the selected variables from the five strategies. The colors correspond to the colors in Figure 1. ANOVA, bottom square (red); CMV, right square (gray); PLS with jackknife, (green); P-PLS, circle (black); COVPROC, star (blue). The number 14 in the middle represents the number of variables selected by all the strategies.

were not present in the top 28 variables from the combined results. Plotting the location of the 28 common variables gave no indication that the selection of the proteins was affected by their position on the gel (Figure 3). The same result was found for variables unique to each strategy. The group scaled data used with the PLS regression was also applied to the CMV method to compare the effect of preprocessing. The new preprocessing resulted in 127 selected variables, of which 81 were the same as the previous CMV results. There were 62 shared variables with the PLS regression, 51 of them being the same as from the first CMV analysis.

Discussion Preprocessing of Variables. Preprocessing of the data was performed separately for each method. For multivariate analysis, variables are usually scaled prior to data analysis when certain properties, for example, a difference in absolute magnitude, would otherwise undesirably influence the results. Since the procedure depends on the properties of the data, in addition to the properties of the method of analysis, scaling should be decided separately for each case. It may also be useful to vary the scaling to find different subsets of variables.13 The data used for PLS regression with jackknife were preprocessed by an alternative method of scaling where each variable were divided by the average of the variance within samples at each time point. When a scaling factor based on the variation only within groups of animals is calculated, only those variables showing little interanimal variation will be up-scaled. The scaling factor will be independent of how much the variable changes with time as long as this change is consistent among the seven animals. Another option is to preprocess the data by using transformations. This is a common strategy when using analysis of variance, where one requirement is that the mean and the variance for a variable should be independent. As is usual with 2-DE data, we found that the variance of the protein volumes was dependent on the measured protein volume. The most common strategy for removing this dependency is to use a logarithmic transformation. However, it has been shown that this procedure is not suitable for lowabundance proteins.12 That was also observed in our data set, so we decided to use a Box-Cox transformation instead to compensate for the problem. The effect of changing the preprocessing from transformation to group scaling was in5122

Journal of Proteome Research • Vol. 7, No. 12, 2008

Grove et al. vestigated by using the CMV method. The results showed that preprocessing had little effect on the overlap with the PLS regression results and the difference was mainly due to the method and not the preprocessing. Choice of Significance Level. The number of selected variables from all our tested methods is dependent on the chosen level of significance. Although this level is usually chosen to be at 5%, an ANOVA approach measures the fit between the design variable and the response, while a PLS approach, as shown here, measures the stability of each variables regression coefficient after deciding on the optimal model.19 How false positives are handled when performing multiple tests will also affect the results. A typical experiment will usually generate from several hundred variables to a couple of thousand variables. With the 1377 variables in this experiment, an unmodified significance test at the 5% level is expected to yield 69 significantly changed variables just by chance alone. In this experiment, we have used different ways of dealing with this problem. The multivariate methods reduce the number of significant variables based on a combination of model-fit and significance testing on the regression coefficients after cross-validation. The ANOVA method calculates the raw p-values and then finds the FDRs based on the rotation testing method. Choosing 5% for the FDR signifies that 12 of the 241 selected variables are expected to be false positives. It might be useful to consider changing the level of the FDR based on the possibility and ease of verification in further experiments. Because of the way the most significant variables are shared by most of the methods, false positives are also most likely to be among those variables chosen by only one or two strategies. Combining the Results. With no consideration for what analysis might be best suited to the task, the results from this experiment might vary from five proteins with the first step of the P-PLS to 241 proteins with the ANOVA. Since the five proteins are just a subset of the 241 proteins, ANOVA gives a broader look at what proteins are changing. The five proteins are just those chosen as necessary by the P-PLS method to give good predictions for how long after slaughter the meat sample was collected. While only 14 variables were selected by all the methods, those variables common to most of the strategies were also those variables showing the highest significance score for each strategy. The difference in number of selected variables is therefore most prevalent for the least significant variables within each strategy. The results also showed a connection to the expected results from each method. The CovProc method is designed to find strong candidate variables, suited to build good models between the predictors and the response. The same effect can be gained with the P-PLS method by focusing the power parameter on correlation. These methods will then try to find the optimal number of variables that will give a good prediction model of the response variable. The PLS and the CMV method is defining variables as significant if their contribution to the model is consistent in the cross-validation. For the CovProc method, there might be variables that correlate to the response, but do not increase the model fit due to redundancy. These variables are more likely to be included in the PLS regression or with the CMV, resulting in a larger number of selected variables. The ANOVA calculates significance for each variable and the number of variables will be totally dependent on the choice of p-value or the number of accepted false positives when using a FDR method. Since ANOVA does not use any information involving more than one variable, there will be no problems with redundancy in model

Combination of Statistical Approaches for Analysis of 2-DE Data

research articles

Figure 3. Spot location of variables found in at least four of the five strategies.

prediction, which means that all variables have an equal chance of getting selected. On the other hand, the lack of any possibility of interaction between the variables means that some variables might be overlooked if they are only significant in conjunction with other variables. One example of this was shown in Karp et al.20 where a cluster of proteins were found to be significant with a PLS discriminant analysis (PLS-DA) approach, although the change in each individual protein was not significant. A study comparing PLS-DA with the differential analysis in PDQuest also showed that variables selected by univariate tests will perform poorer when building prediction models from the selected variables.21 It has also been shown that, while different multivariate analyses can give the same overall model, the selected variables might still be different.22 This difference leads to the question of whether there is a best method for analyzing the data, or if the methods just show different subsets of a true list of significant proteins. For each method, there are several steps between the measured data and the presented results, such as preprocessing parameter estimation and validation of results. Each step might contribute to the difference in presented results between the methods.

Concluding Remarks To extract all information from an experiment like 2-DE, one will have to analyze the data by several methods. How to report selected proteins from a 2-DE experiment will be dependent on what method was used to analyze the data. Saying a variable is significant based on a multivariate method like PLS is not the same as saying it is significant based on a univariate method. The number of selected variables can therefore easily be due to a difference in the level of significance used for each method. Methods that focus on good prediction models will give fewer selected proteins, while methods focusing on variable stability and relevant variation will give more selected proteins. Depending on the objective of the experiment, both

results are useful. Although a stringent p-value might be necessary to claim the possibility for a biological marker, we argue that, for 2-DE experiments like the one presented in this paper, it might be more important to make sure all relevant information is extracted. Difference in results might then contain important information about the experiment. While it is important to not lose any potentially interesting variables, using FDR to calculate significance will give a better view of false positives in the reported results. Finally, even though some variables might be highly significant from the statistical analysis, it is the biological significance that should be the final verification of potential biological markers. Abbreviations: ANOVA, analysis of variance; CMV, cross model validation; CovProc, covariance procedures; FDR, false discovery rate; PLS, partial least squares; P-PLS, Power-PLS.

Acknowledgment. This work was supported by The Fund for the Research Levy on Agricultural Products in Norway. References (1) Jessen, F.; Lametsch, R.; Bendixen, E.; Kjaersgard, I. V. H.; Jorgensen, B. M. Extracting information from two-dimensional electrophoresis gels by partial least squares regression. Proteomics 2002, 2, 32–35. (2) Maurer, M. H.; Feldmann, R. E.; Bromme, J. O.; Kalenka, A. Comparison of statistical approaches for the analysis of proteome expression data of differentiating neural stem cells. J. Proteome Res. 2005, 4, 96–100. (3) Meunier, B.; Bouley, J.; Piec, I.; Bernard, C.; Picard, B.; Hocquette, J. F. Data analysis methods for detection of differential protein expression in two-dimensional gel electrophoresis. Anal. Biochem. 2005, 340, 226–230. (4) Smit, S.; Hoefsloot, H. C. J.; Smilde, A. K. Statistical data processing in clinical proteomics. J. Chromatogr., B 2008, 866, 77–88. (5) Dudoit, S.; Shaffer, J. P.; Boldrick, J. C. Multiple hypothesis testing in microarray experiments. Stat. Sci. 2003, 18, 71–103. (6) Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B 2002, 64, 479–498. (7) Langsrud, O. Rotation tests. Stat. Comput. 2005, 15, 53–60.

Journal of Proteome Research • Vol. 7, No. 12, 2008 5123

research articles (8) Qian, H. R.; Huang, S. Comparison of false discovery rate methods in identifying genes with differential expression. Genomics 2005, 86, 495–503. (9) Jia, X.; Hollung, K.; Therkildsen, M.; Hildrum, K. I.; Bendixen, E. Proteome analysis of early post-mortem changes in two bovine muscle types: M. longissimus dorsi and M. semitendinosis. Proteomics 2006, 6, 936–944. (10) Karp, N. A.; McCormick, P. S.; Russell, M. R.; Lilley, K. S. Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Mol. Cell. Proteomics 2007, 6, 1354–1364. (11) Jia, X. H.; Ekman, M.; Grove, H.; Faergestad, E. M.; Aass, L.; Hildrum, K. I.; Hollung, K. Proteome changes in bovine longissimus thoracis muscle during the early postmortem storage period. J. Proteome Res. 2007, 6, 2720–2731. (12) Gustafsson, J. S.; Ceasar, R.; Glasbey, C. A.; Blomberg, A.; Rudemo, M. Statistical exploration of variation in quantitative twodimensional gel electrophoresis data. Proteomics 2004, 4, 3791– 3799. (13) Jensen, K. N.; Jessen, F.; Jorgensen, B. M. Multivariate data analysis of two-dimensional gel electrophoresis protein patterns from few samples. J. Proteome Res. 2008, 7, 1288–1296. (14) Reinikainen, S. P.; Hoskuldsson, A. COVPROC method: strategy in modeling dynamic systems. J. Chemom. 2003, 17, 130–139. (15) Hoskuldsson, A. The H-Principle - New Ideas, Algorithms and Methods in Applied-Mathematics and Statistics. Chemom. Intell. Lab. Syst. 1994, 23, 1–28.

5124

Journal of Proteome Research • Vol. 7, No. 12, 2008

Grove et al. (16) Indahl, U. A twist to partial least squares regression. Jo. Chemom. 2005, 19, 32–44. (17) Ambroise, C.; McLachlan, G. J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 6562–6. (18) Anderssen, E.; Dyrstad, K.; Westad, F.; Martens, H. Reducing overoptimism in variable selection by cross-model validation. Chemom. Intell. Lab. Syst. 2006, 84, 69–74. (19) Martens, H.; Martens, M. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Qual. Preference 2000, 11, 5–16. (20) Karp, N. A.; Griffin, J. L.; Lilley, K. S. Application of partial least squares discriminant analysis to two-dimensional difference gel studies in expression proteomics. Proteomics 2005, 5, 81–90. (21) Marengo, E.; Robotti, E.; Bobba, M.; Milli, A.; Campostrini, N.; Righetti, S. C.; Cecconi, D.; Righetti, P. G. Application of partial least squares discriminant analysis and variable selection procedures: a 2D-PAGE proteomic study. Anal. Bioanal. Chem. 2008, 390, 1327–1342. (22) Jacobsen, S.; Grove, H.; Jensen, K. N.; Sorensen, H. A.; Jessen, F.; Hollung, K.; Uhlen, A. K.; Jorgensen, B. M.; Faergestad, E. M.; Sondergaard, I. Multivariate analysis of 2-DE protein patternspractical approaches. Electrophoresis 2007, 28, 1289–1299.

PR800424C