Evaluation of the Variables Characterized by Significant Discriminating Power in the Application of SIMCA Classification Method to Proteomic Studies Emilio Marengo,*,† Elisa Robotti,† Marco Bobba,† and Pier Giorgio Righetti‡ Department of Environmental and Life Sciences, University of Eastern Piedmont, Via Bellini 25/G, 15100 Alessandria, Italy, and Department of Chemistry, Materials and Engineering Chemistry ”Giulio Natta”, Polytechnic of Milano, Via Mancinelli 7, 20131 Milan, Italy Received November 8, 2007
SIMCA classification can be applied to 2D-PAGE maps to identify changes occurring in cellular protein contents as a consequence of illnesses or therapies. These data sets are complex to treat due to the large number of proteins detected. A method for identifying relevant proteins from SIMCA discriminating powers is proposed, based on the Box-Cox transformation coupled to probability papers. The method successfully allowed the identification of the relevant spots from 2D maps. Keywords: soft-independent model of class analogy • Box-Cox transformation • 2D-PAGE maps • discriminating power • normal probability plot
Introduction As commonly acknowledged, the main target in Proteomics is the comparison of the protein expression of different cell extracts, as control and diseased samples or diseased and drugtreated ones, in diagnostic/prognostic problems or in drug design studies. One of the most exploited tools for protein separation in this field is certainly two-dimensional gelelectrophoresis, providing a final 2D-map (called 2D-PAGE, from polyacrilamide gel-electrophoresis) where the separated proteins appear as spots spread on a gel matrix. Sets of replicate 2D-maps are usually compared to avoid the problems associated to poor reproducibility affecting 2D gel-electrophoresis. Once the sets of maps are obtained, they usually undergo the classical differential analysis carried out by dedicated software packages (e.g., PDQuest, MelanieIII etc.) to identify the spots showing a relevant up- or down- regulation with respect to control samples. An alternative to this standard procedure is represented by the multivariate statistical analysis of the so-called spot volume data sets in which each map is described in terms of the volumes of the identified spots. Multivariate statistical tools are very effective in this case because they allow us to take into consideration the correlations existing among the spots detected; they have been successfully applied to spot volume data since the middle 1980s.1,2 Among the several opportunities, Principal Component Analysis and classification tools such as SIMCA (Soft-Independent Model of Class Analogy) are the most exploited multivariate tools in proteomics.3–6 * To whom correspondence should be addressed. Prof. Emilio Marengo, Department of Environmental and Life Sciences, University of Eastern Piedmont, Via Bellini 25/G, 15100 Alessandria, Italy. Tel: +39 0131 360272. Fax: +39 0131 360250. E-mail:
[email protected]. † University of Eastern Piedmont. ‡ Polytechnic of Milano. 10.1021/pr700719a CCC: $40.75
2008 American Chemical Society
SIMCA has been recently applied with good results even by our research group.7–9 In these applications, the final aim, rather than being the classification of new samples for diagnostic purposes, was the identification of the main differences between the classes of objects, to identify the spots showing a relevant biological effect. A particular problem the researcher has to face when applying SIMCA to proteomic studies is the identification of the spots actually representing possible biomarkers, that is, spots characterized by a relevant ability in discriminating two classes (control/diseased or control/drug-treated). To this purpose, SIMCA provides a statistical index useful to the identification of the relevant spots for discriminating the existing classes: the discriminating power (DP) of each variable for each couple of classes. Once the DP values are calculated, the researcher has to select which of them are statistically relevant to identify the possible biomarkers. Usually, DP values larger than selected threshold values are adopted (typically a threshold value of 3, 4, or 5), but this procedure is not very effective in proteomics, due to the extremely large number of variables present. The method used for selecting the relevant spots must provide a reasonable set of possible biomarkers that can be further investigated via mass spectrometry, avoiding waste of time and experimental effort to identify the spots that are not really significant for the classification. In this paper, classical criteria for the identification of the relevant DP values are compared to an alternative proposed method. The classical criteria refer to significant threshold values (3, 4, 5) or to the mean or median DP value; these methods are compared to an alternative method based on the use of probability plots preceded by the normalization of the DP distribution through the Box-Cox transformation. The proposed procedure is based on two subsequent steps: (1) through a nonlinear Box-Cox transformation, the population The Journal of Proteome Research 2008, 7, 2789–2796 2789 Published on Web 06/11/2008
research articles
Marengo et al.
of the calculated DP values is turned into a well-known statistical distribution (e.g., Gaussian or gamma) and (2) the relevant spots are identified, by the use of probability plots, as those characterized by a transformed DP value that does not match the reference statistical distribution. The concept underlying the procedure is that the variables characterized by a relevant DP value do not belong to the population of the spots showing homogeneous values of discriminating powers. Different distributions were tested but Gaussian and gamma distributions represented the best alternatives. The method is applied here to four proteomic data sets of increasing complexity and increasing noise to verify the general validity of the approach and its robustness when compared to classical methods.
Theory Soft-Independent Model of Class Analogy (SIMCA). SIMCA classification method10–21 is based on the independent modeling of each class by means of PCA.16,20,21 Each class is in fact modeled in terms of its relevant PCs. The samples of each class are contained in the so-called SIMCA boxes, defined by the relevant PCs of each class. Each i-th sample is assigned to a class by the calculation of its distance from the SIMCA model of all the classes; the object is assigned to class c if its distance from class c is the smallest among the distances calculated from all the classes and if it is inferior to the critical distance characterizing class c. One of the most important advantages of SIMCA is the classification of each sample only by means of the useful information contained in the PCs that build up the models of each class; the spurious information and experimental error are eliminated by considering only the relevant PCs to describe each class. As just pointed out in the Introduction, the method is also useful when small data sets are analyzed (more variables than objects) because it performs a substantial dimensionality reduction. SIMCA classification starts with the calculation of the relevant PCs for each class of objects. They define the so-called class model. So, if the data are autoscaled, each object xiv belonging to class g, is modeled as: xivg )
∑t
iaglvag + rivg ;
g ) 1, ..., G; a ) 1, ..., A g ; i ) 1, ..., n g ;
a
v ) 1, ..., P (2.1) (G ) number of classes present; Ag ) number of significant PCs for class g; ng ) number of samples in class g; P ) number of original variables) where: tiag ) score of the i-th object of class g on the a-th PC; lvag ) loading of the v-th variable on the a-th PC of class g; rivg ) residual of the i-th object of class g for variable v. The values estimated by the model are then: x^ivg )
∑t
iaglvag
(2.2)
a
whereas the residuals are defined as: rivg ) (x^ivg - xivg)
(2.3)
The classification rule of object i is based on a Fisher’s F-test; object i is classified in class g if: 2790
The Journal of Proteome Research • Vol. 7, No. 7, 2008
2 rsd ig
rsd 2g
< Fcritic(R,v1)p-Ag,v2)(p)Ag)(ng-Ag-1))
(2.4)
where: rsdig ) residual standard deviation of object i on class g; rsdg ) residual standard deviation of class g; Fcritic ) critical value of F defining the SIMCA box; R ) significance level (usually set at 0.05, corresponding to a probability level of 95%); ν1, ν2 ) degrees of freedom of the numerator and denominator of the F-test, respectively. The residual standard deviation of each object i (i.e., its distance from the model of class g) is then compared to the residual standard deviation of class g (i.e., the typical distance of class g); if their ratio is smaller than the critical F value based on the degrees of freedom and on the significance level, object i is classified in class g. SIMCA gives some important statistics useful for a deep analysis of the classification performed. To our purpose, the most important is the discriminating power that is a measure of the ability of each variable to discriminate between two classes (c and g) at a time. The greater the discrimination power, the more a variable weights on the classification of an object in class c or g. It is defined as: DPvc )
2 2 rsdvcg + rsd vgc 2 2 rsdvc + rsdvg
ng
2 ; rsd vcg )
∑ i)1
2 r ivcg ng
(2.5)
where: 2 rsdvcg ) square residual standard deviation of variable v of the objects of class c from the model of class g; 2 rsdvgc ) square residual standard deviation of variable v of the objects of class g from the model of class c; 2 rsdvc ) square residual standard deviation of variable v of the objects of class c from the model of their own class; 2 rsdvg ) square residual standard deviation of variable v of the objects of class g from the model of their own class; 2 rivcg ) square residual of variable v of the i-th object of class c from the model of class g; ng ) number of objects in class g. The discrimination power is positive defined but can assume whatever value (it has not a superior limit). Usually, subjective methods are used to detect the number of variables showing a relevant discriminating power; the most used are graphical representations of DPs. In these representations, the variables are reported on the x-axis, whereas DP is represented on the y-axis. The variables are selected on the basis of a visual inspection of the graph. Other methods, for what concerns DP, are based on selected cutoff values; usually thresholds of 3, 4, or 5 are selected. Another method can be the selection of the variables showing a DP larger than the average DP value. Different criteria are compared here to verify their effectiveness in proteomic studies. Another criterion is then proposed, based on the coupling of Box-Cox transformation and probability plots. Box-Cox Transformation and Normal Probability Plot. The choice of the variables characterized by a relevant DP is particularly critical in the application to proteomic data sets, because of their large dimensionality. Moreover, the discriminating spots most interesting from a biological point of view are not usually those showing the largest discriminating power. In general, the spots with the largest DP are well-known as biomarkers but are not so relevant in a biological study where
research articles
Application of SIMCA Classification Method interesting and peculiar cause-effect relationships have to be identified. The problem that rises then is the selection of a suitable threshold. The procedure proposed is based on a previous transformation of the sets of DPs by the family of BoxCox transformations,20,21 defined as: y′j(λ) )
{
y λj - 1 for λ * 0 ln(yj)λ for λ ) 0
(2.6)
where y ′j ) transformed DP value of variable j; yj ) original DP value of variable j; λ ) order of the transformation. This is a family of transformations depending on the λ parameter. Through the variation of the λ parameter, it is possible to explore different transformations, e.g., inverse (λ ) -1), square root (λ ) 0.5), its inverse (λ ) -0.5), and logarithm (λ ) 0). This step allows for the turning of the population of DP values into a well-known probability distribution, so that the following identification of the relevant spots can be accomplished, exploiting the reference statistical distribution. Once the transformed values for each λ value are calculated, different statistical distributions can be compared, and the one best fitting with those obtained for each λ values can be identified. Different distributions can be considered; in this case, we considered the Gaussian, Gamma, and Lognormal distributions. Here, the discussion is focused on the Gaussian and Gamma distributions, giving the best results. Regarding the Gaussian distribution, the selection of the best λ value (i.e., the one providing the best fitting with a normal distribution) can be accomplished by means of the normal probability plots (NPP),20,21 that is, plots where the x-axis reports the observed values, whereas the y-axis reports the cumulative probability value associated to the specific x value, in the hypothesis of a normal distribution of the data. This plot typically shows a linear trend when normal distributions are represented. NPP are then built reporting the observed value on the x axis and the normal probability value zj on the y axis: zj ) Φ-1
3j - 1 3N + 1
(2.7)
where: Φ-1 ) inverse normal ripartition function (converting the probability value p in the normal value z); j ) rank of the j-th variable; N ) total number of variables. The best λ value is chosen as the value giving a distribution of DP more similar to a Gaussian. From the NPP corresponding to the selected λ value, a cutoff value is detected allowing the identification of the relevant variables. The general concept underlying this procedure is that the variables showing a relevant DP do not belong to the population of the spots showing homogeneous values of this statistic: they will lay out of the linear trend associated to the values belonging to a normal distribution. This typically happens at the extremes (tails of the distribution) of the data. As already pointed out, other distributions were considered (Gamma, Lognormal. . .); for these distributions, probability/ probability plots were used to identify the distributions best fitting the experimental data obtained for each λ value. Probability/probability plots report on the x/y axes, respectively, the experimental probability values and the ones corresponding
to the selected theoretical distribution. The best-fitting distribution is then the one providing the most linear trend between experimental and theoretical data. The procedure for the identification of the best fitting function is based on a graphical inspection, as is common practice with probability plots.
Experimental Details The procedure was applied to four different data sets of increasing complexity: (1) Neuroblastoma,9 consisting in 8 maps from adrenal mouse glands extracts, divided in two classes of 4 samples each (532 spots revealed): control (healthy; afterward indicated as HEA) and diseased (affected by neuroblastoma; afterward indicated as ILL) samples; (2) Cell lines,8 consisting in 10 maps from human lymphoma cells, divided in two classes of 5 samples each (264 spots revealed): GRANTA (commercial cell line) and MAVER (new established) cell lines; (3) Endothelia, consisting in 24 maps from tumoral human endothelia extracts, divided in 4 classes of 6 samples each (1167 spots revealed): control samples (diseased; afterward indicated as CTR), samples treated with Rapamicine (afterward indicated as RAP), treated with Vinblastine (afterward indicated as VBL), and treated with a mixture of the two drugs (afterward indicated as RAPVBL); (4) Pancreas,7 consisting in 18 samples from human tumoral pancreatic cell extracts, divided in 4 classes (435 spots revealed): control samples of PACA44 cell line (4 diseased samples; afterward indicated as PACA), control samples of T3M4 cell line (5 diseased samples; afterward indicated as T3M4), PACA44 samples treated with Trichostatin-A (4 samples; afterward indicated as PACATSA), and T3M4 samples treated with Trichostatin-A (5 samples; afterward indicated as T3M4TSA). The four data sets present some typical problems the researcher has to face in proteomics: in diagnosis/prognosis, the identification of the differences in the proteomic profile of control and diseased samples, to provide information about the effect of a disease on the proteomic expression of the target cell (Neuroblastoma data set); in the field of product development, the identification of the differences occurring between two different cell lines in order to evaluate the possibility to commercialize a new established cell line (Cell lines data set); in drug design, the identification of the role played by different drugs on the proteomic profile of the target cell (Endothelia data set) or the effect of a particular active principle on different cell lines (Pancreas data set). The first two data sets (Neuroblastoma and Cell lines) represent quite simple cases where the samples are divided into two single classes and are characterized by a small noise (the samples are well separated in the two groups). Endothelia data set instead represents a more complex case because the samples are divided into four classes that are not so well separated from one other. The different noise of the data is characteristic of proteomic data sets and is explored here to prove the ability of the approach to identify the relevant spots also when noisy data are considered. Results obtained on different data sets are reported here to illustrate the reliability of the proposed approach on typical problems of different nature and its general validity. The maps used in this study were provided by the research group of Prof. P. G. Righetti (Polytechnic of Milan - Italy) and Dr. Daniela Cecconi (University of Verona – Italy); the experimental protocols followed to obtain the maps for all the The Journal of Proteome Research • Vol. 7, No. 7, 2008 2791
research articles
Marengo et al.
Table 1. Percentage of Cumulative Explained Variance for the First 2 or 4 PCs Calculated Performing a Separate PCA on Each Class Contained in All the Datasets under Study: Neuroblastoma, Cell lines, Endothelia, and Pancreas Datasets Neuroblastoma (8 × 532)
PC1 PC2
class HEA
class ILL
43.33 72.85
46.01 77.74
Cell Lines (10 × 264)
PC1 PC2
class GRANTA
class MAVER
46.80 70.68
46.59 68.88
Endothelia (24 × 1167)
PC1 PC2 PC3 PC4
class CTR
class RAP
class VBL
class RAPVBL
28.41 50.56 71.81 87.02
27.33 49.88 68.85 85.81
27.74 48.86 67.50 84.58
29.74 50.61 68.94 85.86
Pancreas (18 × 435)
PC1 PC2
class PACA
class T3M4
class PACATSA
class T3M4TSA
39.39 75.12
32.31 57.22
40.42 72.86
33.74 59.12
investigated cases are not provided here because they were already discussed elsewhere.7–9 No partition of the samples between training set and test set could be made, both because of the small number of samples (a severe problem in proteomics) and the preliminary nature of the work, devoted most to the identification of the differences existing between the classes than to the use of the SIMCA model for diagnostic purposes (classification of new samples). The first two cases present two classes each: SIMCA will give in these cases a single value of discrimination power, because only one couple of classes is present. The other two data sets, instead, represent a more complex problem, because four classes are present: SIMCA will give six values of discrimination power (one for each couple of classes compared). Once the sets of discriminating powers have been calculated for each data set, the Box-Cox transformation was applied to each set by varying the λ parameter from -2.00 to 2.00 with a step of 0.20; the values λ ) 0.50 and -0.50 were added to include the transformations corresponding to the square root and its inverse, respectively. The entire procedure, from SIMCA classification to the identification of the most relevant DPs, was repeated for each couple of classes in each case study, eliminating the discriminating spots at each iteration and performing a new classification by SIMCA on the remaining spots. The iterative procedure was arrested when no further spots were identified as relevant by probability plots for Gaussian and Gamma distributions, respectively. For all the couples of classes considered, 1-3 iterations were sufficient to achieve convergence.
Results and Discussion SIMCA Classification. SIMCA was applied to the four data sets investigated after autoscaling. Only the first PC was retained as significant to describe the classes present in each data set, as a consequence of either the small number of samples present in each class and the 100% of correct assignments (non-error rate, NER%) obtained for each data set with only the first PC in the SIMCA models of each class. Table 1 reports the percentage of cumulative explained 2792
The Journal of Proteome Research • Vol. 7, No. 7, 2008
Table 2. Best λ Values Giving the Distribution of DPs More Similar to a Gaussian or Gamma Distribution for Each Dataset under Study best λ value data set
classes
Neuroblastoma HEA-ILL Cell Lines GRANTA-MAVER Endothelia CTR-RAP CTR-VBL CTR-RAPVBL RAP-VBL RAP-RAPVBL VBL-RAPVBL Pancreas PACA-T3M4 PACA-PACATSA PACA-T3M4TSA T3M4-PACATSA T3M4-T3M4TSA T3M4TSA-PACATSA
normal distr.
Gamma distr.
-1.20 -1.20 -1.40 -1.40 -1.40 -1.20 -1.20 -1.20 -0.80 -0.80 -0.60 -0.60 -0.60 -0.60
-0.40 -0.60 -0.60 -0.60 -0.50 -0.50 -0.60 -0.50 -0.40 -0.40 -0.20 -0.20 -0.20 -0.20
variance for the first 2 or 4 PCs calculated performing a separate PCA on each class contained in all the data sets under study. The first PC explains about 50% of the total variance for all the classes considered, with the only exception of Endothelia data set, where it explains about 30%. For all the classes investigated, only the first PC was considered as significant since it allowed the complete correct classification of all the samples in the corresponding class and because of the low number of samples. However, from a general point of view, it is possible to include in the SIMCA model all the relevant PCs; the DP values will then be calculated on the basis of all the significant PCs considered. Discriminating Power. For each data set, a set of discriminating powers for each couple of classes was obtained. The Box-Cox transformation was applied to the DPs of each class comparison and a normal probability plot was built for each λ value. Table 2 reports the values of λ giving the distribution of DP more similar to Gaussian and Gamma distributions; the values reported correspond to the final iteration. Regarding the normal distribution, for all the considered cases, the results are similar, because the best transformations are always negative, with the best λ values ranging from -1.40 (for three classes of Endothelia data set) to -0.60 (for four classes of Pancreas data set). The most typical value is -1.20. The procedure gives similar results for all the data sets considered even regarding the Gamma distribution, for which the best λ values range from -0.60 to -0.20. Pancreas data set shows λ values always smaller than the other cases investigated. All the best transformations show similar negative values; this reflects the behavior of DP values, characterized by distributions having almost the same shape even if different biological systems are studied. As an example of the effect of changing the λ values, the NPP, the probability/probability plot for the Gamma distribution and the corresponding data histogram, for one of the DPs sets, are reported in Figure 1 as a function of λ. As can be observed, the change of λ causes relevant changes in the distribution of the data. The best linearity is obtained when λ is -1.20 for the normal distribution. The shape of the graph obtained in this situation is typical for NPP corresponding to Gaussian distribution of the data. When outliers do exist, they are usually easily recognizable as points that do not fit the main linear trend. This usually takes place in the tails of
Application of SIMCA Classification Method
research articles
Figure 1. Histograms, NPP, and Gamma probability plots of 5 λ values for the set of DPs of classes HEA-ILL in Neuroblastoma data set: (a) λ ) -1.80; (b) λ ) -1.20; (c) λ ) -0.40; (d) λ ) 0.40; (e) λ ) 1.20.
the distribution. This means that with this method it is possible to identify the data that do not belong to the main Gaussian distribution as the data out of the main linear trend, created by the central data. For the Gamma distribution the best
linearity is obtained when λ is -0.40. Even in this case, the relevant spots (variables) can be identified as those most diverging from the linear trend; this is particularly true for the superior (more positive) tail of the distribution. Similar results The Journal of Proteome Research • Vol. 7, No. 7, 2008 2793
research articles
Marengo et al.
Figure 2. Bar diagrams representing the number of significant spots identified by the different criteria examined for Neuroblastoma and Cell lines data sets.
were obtained for all the data sets considered. The NPPs and Gamma plots showed for all the comparisons a linear trend up to a threshold where the points depart from the straight line represented in each graph. This threshold represents the cutoff value used to identify the relevant spots. No (or very small) deviation from normality or from the gamma distribution can be detected for the smallest values. Figures 2, 3, and 4 represent the number of significant spots obtained with the different criteria examined, as bar diagrams: Neuroblastoma and Cell lines data sets (Figure 2), Endothelia data set (Figure 3), and Pancreas data set (Figure 4). Looking at the bar diagrams for all the cases under study, it appears clear that the three methods based on fixed threshold values (3, 4, or 5) lead to an extremely large number of significant spots; the use of a fixed cutoff value seems, therefore, to not be a suitable choice in the application to proteomic data sets. The spots identified as important by these three procedures contain in fact also spots not significant to the proper classification of the samples. To confirm this statement, SIMCA classification was repeated, for each fixed threshold value independently, after the subsequent removal of groups of spots considered as significant by the threshold adopted. SIMCA did not allow the correct classification of all the samples even if not all the significant spots were removed from the model. A procedure allowing the definition of the proper threshold value for each case independently appears therefore a more effective alternative in proteomics where a too large number of detected spots would be of no use to the researcher; the main risk would be an unnecessary experimental effort to identify by mass spectrometry some proteins that are not actually relevant as biomarkers. This is also the case for the median value: even if it is not a fixed value, it provides a large number of spots since it is not sensitive to the presence of extreme DP values. NPP and Gamma plot procedures show similar results for what regards the number of significant spots identified and they represent a middle course between the fixed and median values on one side and the average value on the other. The average method, in fact, provides a number of significant spots that in most of cases is slightly smaller than that obtained by the NPP procedure; for more critical cases, however, as Endothelia data set, it can identify even a number of spots significantly smaller 2794
The Journal of Proteome Research • Vol. 7, No. 7, 2008
Figure 3. Bar diagrams representing the number of significant spots identified by the different criteria examined for the six class comparisons of Endothelia data set.
than NPP. It probably does not consider some spots showing a relevant ability of discrimination (i.e., potential biomarkers). The average method seems to be the one giving the results more similar to the NPP/Gamma approach, but it must be pointed out that this agreement is too often subdued to the singular values of the DPs calculated, because the average value depends very much on the single values used for its computation and on the overall number of spots identified. The NPP/ Gamma approach indeed represents a more reliable method for the identification of the relevant spots, because it is not so dependent from the number of spots present in the data set and their actual DP values.
Conclusions The present paper is focused on the comparison of different methods for the identification of the variables showing a significant discriminating power in the application of the SIMCA classification method in proteomics. The application of SIMCA to proteomic data sets makes it necessary to develop reliable methods for the identification of the relevant variables (spots) because of the particular nature of the cases under investigation. Proteomic data sets are characterized by a small number of samples described by a large number of variables, some of which stand out from the others because of particularly large DP values. It is important to select a proper threshold value allowing the identification of all the spots exhibiting a
research articles
Application of SIMCA Classification Method
the possible biomarkers and misses therefore important information. From this point of view, the alternative approach proposed appears more reliable and robust.
Acknowledgment. E.M. is supported by Regione Piemonte, Ricerca Scientifica Applicata, DD.RR. No. 78-9416 of 19-May-2003 and No. 111-20277 of 1-August-2003. We gratefully thank Dr. Daniela Cecconi (University of Verona) for providing the 2D-PAGE datasets used for statistical analysis. References
Figure 4. Bar diagrams representing the number of significant spots identified by the different criteria examined for the six class comparisons of Pancreas data set.
biologically relevant role, characterized not necessarily by extreme DP values. A method is proposed here based on a previous transformation of the sets of DPs with the Box-Cox transformation, followed by the identification of the relevant spots by means of the normal probability plots. Other distributions are also taken into account, and the one giving the best results, together with the normal distribution, is the Gamma distribution. The results given by the NPP and Gamma Plot methods are in good agreement, with the Gamma probability plots giving a number of relevant spots similar or slightly larger than that obtained by the NPP. The number of spots identified by the NPP/Gamma procedure is compared to that obtained by classical approaches. Classical procedures based on fixed threshold values and on the median DP value provide an extremely large number of significant spots: some of them proved not to be significant for the classification of the samples. Therefore, these classical methods are not effective if applied to proteomic data sets. For what regards the average DP value, it is the only one showing results similar to the NPP/Gamma approach: it provides in general a number of significant spots that is similar or slightly smaller than that provided by the NPP/ Gamma approach. However, when more noisy data are considered (Endothelia data set) or data sets are considered where the separation between the classes is not very effective, the average method provides a number of spots that is significantly smaller than that provided by NPP/Gamma approach. In such cases, the average method runs the risk of not identifying all
(1) Anderson, N. L.; Hofmann, J. P.; Gemmel, A.; Taylor, J. Global Approaches to Quantitative Analysis of Gene-Expression Patterns Observed by Use of Two-dimensional Gel Electrophoresis. J. Clin. Chem. 1984, 30, 2031–2036. (2) Tarroux, P.; Vincens, P.; Rabilloud, T. HERMeS: A second generation approach to the automatic analysis of two-dimensional electrophoresis gels. Part V: Data analysis. Electrophoresis 1987, 8, 187–199. (3) Kjaersgard, I. V. H.; Norrelykke, M. R.; Jessen, F. Changes in cod muscle proteins during frozen storage revealed by proteome analysis and multivariate data analysis. Proteomics 2006, 6 (5), 1606–1618. (4) Verhoeckx, K. C. M.; Gaspari, M.; Bijlsma, S. In search of secreted protein biomarkers for the anti-inflammatory effect of beta(2)adrenergic receptor agonists: Application of DIGE technology in combination with multivariate and univariate data analysis tools. J. Proteome Res. 2005, 4 (6), 2015–2023. (5) Eriksson, L.; Antti, H.; Gottfries, J. Using chemometrics for navigating in the large data sets of genomics, proteomics, and metabonomics (gpm). Anal. Bioanal. Chem. 2004, 380 (3), 419– 429. (6) Gottfries, J.; Sjogren, M.; Holmberg, B. Proteomics for drug target discovery. Chemometr. Intell. Lab. 2004, 73 (1), 47–53. (7) Marengo, E.; Robotti, E.; Cecconi, D.; Scarpa, A.; Righetti, P. G. Identification of the regulatory proteins in human pancreatic cancers treated with Trichostatin A by 2D-PAGE maps and multivariate statistical analysis. Anal. Bioanal. Chem 2004, 379 (78), 992–1003. (8) Marengo, E.; Robotti, E.; Bobba, M.; Liparota, M. C.; Antonucci, F.; Rustichelli, C.; Zamo`, A.; Chilosi, M.; Hamdan, M.; Righetti, P. G. Multivariate statistical tools applied to the characterization of the proteomic profiles of two human lymphoma cell lines by twodimensional gel electrophoresis. Electrophoresis 2006, 27, 484–494. (9) Marengo, E.; Robotti, E.; Righetti, P. G.; Campostrini, N.; Pascali, J.; Ponzoni, M. Study of proteomic changes associated with healthy and tumoral murine samples in neuroblastoma by principal component analysis and classification methods. Clin. Chim. Acta 2004, 345, 55–67. (10) Wold, S. Pattern recognition by means of disjoint principal components models. Pattern Recogn. 1976, 8, 127–139. (11) Van Der Voet, H.; Doornbos, D. A. The improvement of SIMCA classification using kernel density estimation. Part 1. Anal. Chim. Acta 1984, 161, 115–123. (12) Van Der Voet, H.; Doornbos, D. A. The improvement of SIMCA classification using kernel density estimation. Part 2. Anal. Chim. Acta 1984, 161, 125–134. (13) Van Der Voet, H.; Coenegracht, P. M. J.; Hemel, J. B. New probabilistic versions of the Simca and Classy classification methods. Part 1. Theoretical description. Anal. Chim. Acta 1987, 192, 63–75. (14) Van Der Voet, H.; Coenegracht, P. M. J. The evaluation of probabilistic classification methods, Part 2. Comparison of SIMCA, ALLOC, CLASSY and LDA. Anal. Chim. Acta 1988, 209, 1–27. (15) Frank, I. DASCO: a new classification method. Chemometr. Intell. Lab. 1988, 4, 215–222. (16) Mertens, B.; Thompson, M.; Fearn, T. Principal Component outlier detection and SIMCA: a synthesis. Analyst 1994, 119, 2777–2784. (17) Kvalheim, O. M.; Oygard, K.; Grahl-Nielsen, O. SIMCA multivariate data analysis of blue mussel components in environmental pollution studies. Anal. Chim. Acta 1983, 150, 145–152. (18) Saaksjarvi, E.; Khaligi, M.; Minkkinnen, P. Waste water pollution modelling in the southern area of lake Saimaa, Finland, by the simca pattern recognition method. Chemometr. Intell. Lab. 1989, 7, 171–180.
The Journal of Proteome Research • Vol. 7, No. 7, 2008 2795
research articles (19) Forina, M.; Drava, G.; Contarini, G. Feature selection and validation of SIMCA models: a case study with a typical italian cheese. Analusis 1993, 21, 133–147. (20) Vandeginste, B. G. M.; Massart, D. L.; Buydens, L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part B; Elsevier: Amsterdam, 1988.
2796
The Journal of Proteome Research • Vol. 7, No. 7, 2008
Marengo et al. (21) Massart, D. L.; Vandeginste, B. G. M.; Deming, S. M.; Michotte, Y.; Kaufman, L. Chemometrics: a textbook; Elsevier: Amsterdam, 1988.
PR700719A