Retention Prediction of Peptides Based on Uninformative Variable

Moreover, it is implemented in almost all statistical software packages. One of the drawbacks of the method is the fact that no optimal variable selec...
0 downloads 0 Views 183KB Size
Retention Prediction of Peptides Based on Uninformative Variable Elimination by Partial Least Squares R. Put,† M. Daszykowski,†,‡ T. Baczek,§ and Y. Vander Heyden*,† FABI, Department of Analytical Chemistry and Pharmaceutical Technology, Pharmaceutical Institute, Vrije Universiteit BrusselsVUB, Laarbeeklaan 103, B-1090 Brussels, Belgium, Department of Chemometrics, Institute of Chemistry, University of Silesia, 9 Szkolna Street, 40-006 Katowice, Poland, and Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gdansk, J. Hallera 107, 80-416 Gdansk, Poland Received February 9, 2006

A quantitative structure-retention relationship analysis was performed on the chromatographic retention data of 90 peptides, measured by gradient elution reversed-phase liquid chromatography, and a large set of molecular descriptors computed for each peptide. Such approach may be useful in proteomics research in order to improve the correct identification of peptides. A principal component analysis on the set of 1726 molecular descriptors reveals a high information overlap in the descriptor space. Since variable selection is advisable, the retention of the peptides is modeled with uninformative variable elimination partial least squares, besides classic partial least squares regression. The Kennard and Stone algorithm was used to select a calibration set (63 peptides) from the available samples. This set was used to build the quantitative structure-retention relationship models. The remaining 27 peptides were used as independent external test set to evaluate the predictive power of the constructed models. The UVE-PLS model consists of 5 components only (compared to 7 components in the best PLS model), and has the best predictive properties, i.e., the average error on the retention time is less than 30 s. When compared also to stepwise regression and an empirical model, the obtained UVE-PLS model leads to better and much better predictions, respectively. Keywords: peptides • molecular descriptors • retention prediction • QSRR • UVE-PLS • PLS

Introduction Nowadays a lot of interest goes to the prediction of physicochemical properties of molecules, such as their biological activity, or their retention on chromatographic systems.1-5 This usually is done using so-called quantitative structure-property relationship (QSPR) models that relate the property of interest with a set of molecular descriptors. These descriptors code for the chemical information and are related to certain physicochemical properties of the molecule.6 The use of quantitative structure-retention relationship (QSRR) models for chromatographic retention prediction of peptides may be a valuable tool in proteomic research. In particular, peptides retention prediction could help to identify a large number of peptides during proteomic research.7 Twodimensional gel electrophoresis still is widely used for peptide identification due to its high resolving power.8,9 On the other hand, in proteomics new high-resolution separation techniques, such as comprehensive two-dimensional HPLC coupled to mass spectrometry (MS) are under intensive development * To whom correspondence should be addressed. Tel: +32(2)4774734. Fax: +32(2)4774735. E-mail: [email protected]. † Vrije Universiteit Brussel. ‡ University of Silesia. § Medical University of Gdansk.

1618

Journal of Proteome Research 2006, 5, 1618-1625

Published on Web 06/15/2006

and evaluation.10,11 In general, reversed-phase gradient elution high-performance liquid chromatography is often used to separate complex mixtures of peptides, with mass spectrometry as identification method. Moreover, liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS) is frequently used to identify the components of protein complexes and sub cellular compartments.12 However, proteomic analysis still suffers from an incomplete use of the information available from HPLC experiments. In combination with mass spectrometry data, the prediction of the retention of given peptides could be helpful to improve the confidence of peptide identifications.13 When compared to isocratic RPLC, a more diverse set of solutes can be eluted on a given gradient elution system. This makes gradient elution RPLC better suited to analyze a diverse set of peptides with different amino acids compositions and total lengths. In some publications,14,15 gradient elution is considered as a combination of different isocratic systems. The gradient system is then used to search for isocratic conditions in order to make the elution (and separation) of a set of solutes achievable.15 From that point of view, one could suggest that it is not useful, or even has no sense to build QSRR models for gradient HPLC systems, since for different retention times, different isocratic systems are proposed. However, QSRRs for 10.1021/pr0600430 CCC: $33.50

 2006 American Chemical Society

research articles

Retention Prediction of Peptides

gradient elution systems do make sense, under the condition that all chromatographic parameters (column, temperature...) are kept constant; and the same gradient always is used (i.e., for all solutes analyzed).16,17 Thus, to be able to build meaningful QSRR models the chromatographic conditions at a given time need to be constant, which can be the case for both isocratic and gradient elution RPLC systems. To build good predictive quantitative structure-retention relationship models it is important to represent all relevant physicochemical molecular properties in the retention process. Nowadays, over 6000 molecular descriptors are already defined,6 and this number still increases. In general, two main QSRR approaches exist: (a) QSRR models are built from an a priori chosen small set of descriptors, which were selected based on their physicochemical properties as known by the chemist;18,19 and (b) the models are derived starting from a large set of molecular descriptors (hundreds or thousands), from which the best are selected by means of variable selection methods or during the model building.5,17 However, none of both approaches is a priori successful or better, since a proper selection of the relevant descriptors, which explains the retention measured, is needed. The performance of the QSRR models constructed based on the first approach mainly depends on the knowledge about the retention mechanisms and the availability of molecular descriptors describing them. For the QSRR models derived with the second approach, the key factor to obtain good predictions is an appropriate variable selection. For that purpose a stepwise procedure can be used, as is included, for instance, in stepwise regression,20 classification and regression trees (CART)21-23 and multivariate adaptive regression splines (MARS)5,24-26 Global search methods, like genetic algorithms27 and simulated annealing,28 form another group of variable selection methods. Besides the above, methods like uninformative variable elimination partial least squares29 employing latent variables can also be used. In this paper, the main emphasis is put on modeling a relationship between a set of 1726 molecular descriptors for 90 peptides and their chromatographic retention measured by gradient elution RPLC, also focusing on the variable selection problem. The retention of the peptides is modeled with partial least squares (PLS) regression and uninformative variable elimination PLS (UVE-PLS). The resulting models are compared to stepwise and multiple linear regression (MLR) models, which are the most frequently used modeling approaches in the field of QSRR.4,7,13,16 Stepwise regression was applied on a set of 1726 descriptors, whereas for MLR three descriptors with known properties were used (approach a).

2. Theory 2.1. Principal Component Analysis. Principal Component Analysis (PCA) aims to substitute a large number of explanatory variables by a few new, so-called latent variables, defined as Principal Components (PCs).30 The PCs are a linear combination of the initial variables and maximize the description of the data variance. The PCs are also regarded as a set of orthogonal axes, i.e., a new coordinate system, representing as much data variance as possible. In the literature, many PCA algorithms are described. The most popular is Singular Value Decomposition (SVD).31 With SVD the original data matrix, X,

is represented as a product of three matrices S, V, and D, and hence, the PCA model can be written as X ) SVDT

(1)

where S is a column matrix of normalized principal components, V a diagonal matrix with singular values on its diagonal, and DT the transposed matrix of loadings. The PCs are ranked in the columns of matrix S in a decreasing degree of importance in describing the data variance. The importance of a PC is scored by its corresponding eigenvalue in the diagonal matrix V. The first PCs, and often only few, are most relevant for description of the data variability, while the last are associated with random noise. The not-normalized PCs are a product of the S and V matrices. The loadings, D, relate the PCs with the original explanatory variables and inform about the contribution of every explanatory variable to a given PC. The key property of PCA is that it preserves the Euclidean distances between the objects in the space of PCs. This is a crucial aspect in visualizing the data structure. The visualization is done in the space of the first few PCs, which explain the majority of the data variance. By projecting objects on the planes of selected pairs of the PCs (score plots), similarities between objects can be visualized. The closer the objects are, the more similar they are in a chemical sense. The similarities between the explanatory variables are analyzed with loading plots, where the loadings are projected on the plane of selected pairs of PCs. 2.2. Stepwise Regression. The objective of stepwise regression20 is to construct a multivariate regression model for a certain property, y, based on a few deliberately selected explanatory variables. In stepwise regression, the first selected explanatory variable has the highest correlation with y. Then, explanatory variables are consecutively added to the model in a forward selection procedure. A new variable is added to the model if a significant change in residuals of the model is observed. The significance is evaluated using a statistical test, usually an F-test.20 Each time a new variable is included into a model, a backward elimination step follows in which an F-test detects earlier selected variables, which can be removed from the model without changing the residuals significantly. The variable selection procedure terminates when no additional variable significantly improves the model. Stepwise regression is very popular in QSRR studies,32 since the stepwise procedure is intuitively simple and based on the classical multiple linear regression (MLR) approach. Moreover, it is implemented in almost all statistical software packages. One of the drawbacks of the method is the fact that no optimal variable selection is guaranteed, since the new variables are found based on the previously included variables into the model. 2.3. Partial Least Squares (PLS). The goal of Partial Least Squares, PLS, is to build a linear model, describing a dependent variable, y, by a set of explanatory variables, X. Depending on the application, the number of explanatory variables can vary from a limited number up to hundreds or thousands. Especially in chemical applications, the proportion of the number of samples to the number of variables is often very small. In PLS, modeling is performed in the space of a few PLS factors, T, constructed to maximize the covariance between X, and y.33,34 The PLS model is given by the following equation: y ) Tq + e ) Xb + e

(2)

where q are the regression coefficients associated with the PLS Journal of Proteome Research • Vol. 5, No. 7, 2006 1619

research articles

Put et al.

factors (T), e (m × 1) are the residuals defined as the differences between the observed and predicted y, X is the matrix of explanatory variables (m × n), and b are the regression coefficients of n explanatory variables, which are obtained using the following equation: b ) W(PTW)-1q

(3)

where W is a matrix of loadings obtained by maximizing the covariance criterion, and P is a result of projection of X onto T. To ensure good predictive abilities of the PLS model, the calibration set should represent all of the data variance sources, and the number of PLS factors should be optimized. To prevent overfitting, the proper number of PLS factors is evaluated based on the predictive properties of the models for an external test set. 2.4. Uninformative Variable Elimination by Partial Least Squares (UVE-PLS). A PLS calibration model can be improved by excluding uninformative variables that have high variance, but a small covariance with the dependent variable y. Model improvement means that the model complexity and/or the prediction error decreases (increase of prediction ability). In our applications, the Uninformative Variable Elimination by Partial Least Squares approach (UVE-PLS), proposed by Centner et al.,29 was used. This calibration approach explores the idea of stability of the regression coefficients. The UVE-PLS relays on a cutoff value determined based on the stability of the coefficients associated with irrelevant variables added to the original data (see below). The data matrix X (m × n) is augmented with matrix N (m × k), containing random numbers normally distributed with an order of magnitude of 10-10. The m vectors of regression coefficients, b, are calculated with leaveone-out cross-validation for optimal complexity f and put into matrix B (m × n + k). Its first n columns are the regression coefficients related to the experimental variables, and the k remaining columns are related to the uninformative variables. The stability of the regression coefficient for the j-th variable is defined as sj ) mean B(:,j)/std(B(:,j))

(4)

where B(:,j) represents all elements of the j-th column of matrix B, mean and std stand for mean value and standard deviation of the m elements of the j-th column of B, respectively. The k noisy variables are irrelevant to model y. Thus, to discriminate stable and unstable regression coefficients, a cutoff value is defined as cut-off ) max (s(n + 1: n + k))

(5)

where max is a maximal value, and s(n + 1: n + k) is a vector of stability of k regression coefficients associated with k noisy variables. All experimental variables with the absolute value of stability of the regression coefficients below the cutoff value are irrelevant to model y and are eliminated from the original data set, because their information content is not higher than that of the random variables. 2.5. QSRR Model Statistics. During model building, the model fit improves proportional to the model complexity. Thus, the more factors included into the model, the better it fits the calibration data. Usually, the model fit is evaluated by the root1620

Journal of Proteome Research • Vol. 5, No. 7, 2006

mean-squared error (RMSE), computed for the calibration data, which is defined as

x

c

∑(y -yˆ

RMSE(f) )

i

(f) 2 i )

i)1

(6)

c

where yi is the experimental retention value of the i-th calibration sample, yˆ(f) i the predicted retention value for the i-th calibration sample using the model (of complexity f), and c the number of calibration set objects. The determination of the optimal complexity of the model requires an estimation of its predictive ability, to prevent overfitting to the calibration data. After all, the main goal of QSRR models is to obtain a reasonable prediction of the retention for future samples. To prevent over-optimistic prediction errors, it is preferable to evaluate the prediction by means of an independent external test set, rather than using internal validation procedures, such as cross-validation.20 The predictive ability of a model is characterized by the Root Mean Squared Error of Prediction (RMSEP) of an independent external test set, which is defined as

RMSEP(f) )

x

t

∑ (y - yˆ i

i)1

t

(f) 2 i )

(7)

where yi is the experimental retention value of the i-th test set object, yˆ(f) i the predicted retention value for the i-th object using the model (of complexity f), and t the number of test set objects. 2.6. Molecular Descriptors. The definition of molecular descriptors is 2-fold: a so-called theoretical descriptor can be defined as a number resulting from a mathematical procedure, that encodes for useful chemical information; whereas an experimental descriptor is the result of a standardized experiment.6 Because of practical advantages, theoretical descriptors usually are preferred in QSRR. Depending on the representation of the molecule from which the theoretical descriptors are calculated, a distinction is made between zero- (0D), one- (1D), two- (2D), three- (3D), and four-dimensional (4D) theoretical descriptors, which are derived from the chemical formula, substructure list representations, molecular graphs, geometrical presentations, and lattice representations of the molecule, respectively.6

3. Experimental Section The studied data set consists of a vector of retention times for 90 peptides with known amino acids composition (see Table 1) and a matrix of molecular descriptors computed for each peptide. The peptides (AF, YL, ML, WW, GM, GL, and WF) were from Sigma-Aldrich (St. Louis, MO), angiotensin II from Fluka and all other peptides were synthesized at the Department of Organic Chemistry, University of Gdansk.4 The retention time (tR) of the peptides was measured on an XTerra MS C18 column (Waters, Millford, MA) (15.0 × 0.46 cm I. D., particle size 5 µm). The mobile phase consisted of a mixture of water with 0.12% trifluoroacetic acid (solvent A), and acetonitrile with 0.10% trifluoroacetic acid (solvent B). Gradient elution was carried out: 0% B (time ) 0 min) to 60% B (time ) 20 min). The

research articles

Retention Prediction of Peptides Table 1. Peptides Used and Their Retention Times (tR) no.

peptide sequence

tR (min)

no.

peptide sequence

tR (min)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

AF YL ML WW GM GL WF LPQIENVKGTEDSGTT-CONH2 VKGTEDSGTT-CONH2 EHADLLAVVAASQKK-CONH2 VVAASQKK-CONH2 LAQAVRSS-CONH2 SFSMIKEGDYN-CONH2 Ac-NH-CEQDGDPE-CONH2 YKIEAVKSEPVEPPLPSQ-CONH2 LPPGPAVVDLTEKLEGQGG-CONH2 VVDLTEKLEGQGG-CONH2 DRVYIHPF KETS AKETS VAKETS TVAKETS HTVAKETS WHTVAKETS HWHTVAKETS LHWHTVAKETS MAGAAAAG-NH2 SKPKTNMKHMAGAAAAG-NH2 Ac-HNPGYPHNPGYP-NH2 HSDGIFTDS HSEGTFTSD YKIEAVQSETVEPPPPAQ-NH2 TLSYPLVSVVSESLTPER-NH2 PYPLRDVRGEPLEPPEPS-NH2 EVHHQKLVFFAEDVGSNK-NH2 EVHHQKLVFFAKDVGSNK-NH2 EVHHQKLVFFAQDVGSNK-NH2 EVHHQKLVFFAGDVGSNK-NH2 EVHHQKLVFFAENVGSNK-NH2 EVHHQKLVFFGEDVGSNK-NH2 pEADPNKFYGLM-NH2 DAEFRH-NH2 Ac-DAEFRH-NH2 DAEFGH-NH2 Ac-DAEFGH-NH2

11.9 12.7 12.8 15.9 8.8 11.0 15.6 13.0 9.3 15.2 9.5 10.8 13.9 10.5 14.0 16.5 13.8 15.2 4.2 5.0 8.7 9.4 9.5 11.8 11.8 13.0 10.1 11.4 12.2 13.4 11.3 13.4 17.7 14.0 14.6 13.9 14.5 14.5 14.5 14.5 15.8 10.6 11.7 10.9 12.0

46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90

DAEFRHDSG-NH2 DAEFGHDSG-NH2 DAEFRHDSGY-NH2 Ac-DAEFRHDSGY-NH2 DAEFGHDSGF-NH2 Ac-DAEFGHDSGF-NH2 EVHHQK-NH2 Ac-EVHHQK-NH2 EVRHQK-NH2 Ac-EVRHQK-NH2 EVHHQKLVFF-NH2 Ac-EVHHQKLVFF-NH2 EVRHQKLVFF-NH2 Ac-EVRHQKLVFF-NH2 LVFF-NH2 GSNKGAIIGLM-NH2 GKTKEGVLY-NH2 KTKEGVLY-NH2 TKEGVLY-NH2 KEGVLY-NH2 EGVLY-NH2 GVLY-NH2 MAGASELGTGPGA-NH2 AGGYKPFNLETA-NH2 GAPGGPAFPGQTQDPLYG-NH2 Ac-ETHLHWHTVAK-NH2 Ac-ETHLHWHTVAKET-NH2 WHT HWHT LHWHT HLHWHT THLHWHT ETHLHWHT SETHLHWHT Ac-EVHHQK EVHHQK EVRHQKLVFF EVRHQK Ac-EVRHQK Ac-EVHHQKLVFF EVHHQKLVFF Ac-EVRHQKLVFF Ac-DAEFRH DAEFGH Ac-DAEFGH

10.6 10.9 11.6 12.5 13.1 14.0 8.2 9.3 8.5 9.4 15.5 15.9 15.5 16.0 17.2 15.5 12.7 12.7 12.9 12.8 13.2 13.1 11.5 14.3 14.6 13.8 13.2 11.6 11.6 13.1 13.1 13.0 12.9 12.9 9.5 8.5 16.0 8.8 9.7 16.4 16.0 16.5 11.9 11.2 12.3

temperature was kept constant at 40 °C, a flow rate of 1 mL/ min was used, and detection was performed at a wavelength of 223 nm. The peptide samples were dissolved in a mixture of water with trifluoroacetic acid (0.10%).13 The molecular descriptors were calculated based on the optimized geometrical representations of the peptides obtained with Hyperchem 6.03 Professional software (Hypercube, Gainesville, Florida), using the molecular mechanics force field method (MM+) with the Polak-Ribie`re conjugate gradient algorithm with an RMS gradient of 0.05 kcal/(Å mol) as stopping criterion. Ten molecular descriptors were computed in Hyperchem 6.03 Professional. Besides these, 1630 molecular descriptors belonging to 20 classes were calculated using Dragon professional 5.0 software (Milano Chemometrics and QSAR Research Group-Talete, Milano, Italy, http://www. disat.unimib.it/chm/Dragon.htm), which was also used to extract 86 principal components from 16 descriptor classes (see Table 2). The scores of the peptides on these PCs were also considered as molecular descriptors. Thus, a total of 1726 molecular descriptors was used. Further calculations were performed in the Matlab 6.5 environment (The Mathworks, Natick, MA).

Table 2. Molecular Descriptor Classes Calculated With Dragon Software class

families of descriptors

dimensionality

no. of descriptors

no. of PC’s derived

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

constitutional topological walk and path counts connectivity indices information indices 2D autocorrelations edge adjacency indices BCUT topological charge indices eigenvalue-based indices randic molecular profiles geometrical RDF 3D-MoRSE WHIM GETAWAY functional group counts atom-centered fragments charge molecular properties

0-D 2-D 2-D 2-D 2-D 2-D 2-D 2-D 2-D 2-D 3-D 3-D 3-D 3-D 3-D 3-D 1-D 1-D other other

48 119 47 33 47 96 107 64 21 44 41 74 150 160 99 197 121 120 14 28

5 7 3 3 5 11 4 2 2 3 1 6 3 16 8 7 0 0 0 0

Journal of Proteome Research • Vol. 5, No. 7, 2006 1621

research articles

Put et al.

Figure 1. (a) cumulative percent of explained data variance by consecutive PCs; (b) PC1-PC2 loading plot; (c) PC1-PC2-PC3 score plot; the size of the spheres is proportional to the retention observed for a peptide.

4. Results and Discussion 4.1. Principal Components Analysis for Molecular Descriptors Exploration. To explore the set of 1726 molecular descriptors, principal components analysis was applied on the autoscaled descriptors. The two first principal components (PCs) describe 68% of the total data variance, which suggests a high correlation among the molecular descriptors. Twelve PCs are needed to explain 90% of the data variance (see Figure 1a). This high correlation of descriptors is also revealed on a loading plot of the first two PCs. On this plot, the loadings of the molecular descriptors on PC1 are plotted vs their loadings on PC2 (see Figure 1b). The score plot of the first PCs informs on the behavior of the peptides in the chromatographic system. In Figure 1c, a 3D score plot is shown. Each peptide is denoted by a sphere with a size proportional to its retention. A V-shaped data structure can be observed, however, no clear retention trend is present. It can be concluded that no simple retention trend is present in the space of the molecular descriptors. 4.2. QSRR Models. Quantitative structure retention relationship models were build to describe the retention times (tR) of the peptides based on the given set of molecular descriptors. Prior to the QSRR model building, the descriptor values were autoscaled in order to remove undesired scale differences between the descriptors. Three modeling methodologies were studied on the given data set: classical PLS without variable selection, UVE-PLS, and stepwise regression. Finally, the empirical MLR model described in ref 13 was used as a reference to evaluate the models obtained with the above methodologies. To judge on the predictive power of the constructed QSRR models, the set of peptides was split into a calibration and a test set using the Kennard and Stone algorithm.35,36 This algorithm allows a uniform selection of a predefined number 1622

Journal of Proteome Research • Vol. 5, No. 7, 2006

of objects. Such uniform selection ensures that the whole experimental domain will be sampled and all sources of data variance will be included into the calibration set. In this study the calibration set used to construct the models contains 63 peptides. The remaining 27 peptides served as independent test set to validate the models. The multivariate character of the studied data set, as well as the high degree of correlation among the molecular descriptors suggests using partial least squares (PLS) for the QSRR model building. PLS allows modeling the data, taking advantage of an excess of variables, regardless their high correlation. For our application, first a classic PLS model is build, relating the physicochemical properties of the peptides, represented by 1726 molecular descriptors, with their retention. Figure 2a shows the model fit (RMSE) and predictive properties of the model for the test samples (RMSEP) as a function of the model size for a series of PLS models. Model size refers to the number of PLS components in a given model. The best PLS model is composed of seven factors and has a good model fit: RMSE equals 0.359 (2.65% of the variation in y). The RMSEP equals 0.485 (3.58% of the variation in y) (see Figures 2a,b). Since the chromatographic retention times were measured in minutes, the errors calculated correspond to 0.359 and 0.485 min, respectively. Thus, values of 21.5 and 29.1 s are obtained for RMSE and RMSEP, respectively. Notice the difference in RMSE and RMSEP values, which may indicate that the model slightly overfits the calibration set. Both the PCA data exploration of the molecular descriptor space, and the RMSE vs RMSEP values suggest that the QSRR model can be improved by means of UVE-PLS. Since a lot of variables seem to contain analogue information, removing uninformative variables might improve the models. By removing variables that have a large variance but small covariance

Retention Prediction of Peptides

Figure 2. PLS models statistics: (a) RMSE (O-O) and RMSEP (2-2) vs model size for PLS models with different complexity; (b) Predicted retention times (tR) for the best model, vs observed retention times (tR) (+: calibration set samples; O: test set samples).

with y, the PLS model becomes more stable, and often, less PLS components are needed. The constructed UVE-PLS models indeed show a better stability than the initial PLS model (see Figure 3a). The best UVE-PLS model is composed of five latent factors only, and its RMSE and RMSEP values equal to 0.454 (3.36% of the variation in y) and 0.453 (3.35% of the variation in y), respectively. Since these RMSE and RMSEP values are expressed in minutes, they correspond to errors of about 27 s. From the initial set of 1726 molecular descriptors, only 128 were retained by UVE-PLS. Figure 4 shows the X-loadings for all desciptors of the UVEPLS model. Since none of the PLS components is dominated by one, or a small number of molecular descriptors, the model still is not easy to interpret. It can be concluded that molecular properties, which are important in the UVE-PLS model cannot easily be identified in this approach. Figure 3b shows the prediction of the test set samples (O) and the fit of the calibration set samples (+). Compared to PLS, the UVE-PLS model may be preferred, since two less PLS components are needed to obtain about the same predictive power. The obtained UVE-PLS model is also compared with the stepwise regression model. By setting the value of the significance level R for the F-test to 0.05, the stepwise model retains 11 descriptors. However, a model defined by seven descriptors is selected based on the evaluation of the predictive properties (see Figure 5a). Among them, there are two 1-D, four 2-D, and

research articles

Figure 3. UVE-PLS models statistics: (a) RMSE (O-O) and RMSEP (2-2) vs model size for UVE-PLS models with different complexity; (b) Predicted retention times (tR) for the best model, vs observed retention times (tR) (+: calibration set samples; O: test set samples).

one 3-D descriptors. An overview of the selected variables and their regression coefficients in the stepwise model is given in Table 3. Both the model fit and the predictive properties of the stepwise model (see Figure 5b) are less good than the performance of the UVE-PLS model. The data fit (RMSE) of the stepwise MLR model equals 0.547, and the RMSEP equals 0.668. On the other hand, the constructed stepwise model is easier to interpret, since molecular descriptors as such are used in the model, compared to the UVE-PLS model where latent variables are applied. Moreover, these PLS components were created from a total of 128 molecular descriptors retained after uninformative variable elimination. The larger number of descriptors retained by UVE-PLS model can be explained by a lack of the orthogonality condition, which means that the retained variables can be correlated, whereas the selected variables for the stepwise regression model are mutually orthogonal. Finally, the above results were compared applying the MLR model described in ref 13, in which three molecular descriptors are used: (i) the logarithm of the sum of gradient retention times of the amino acids composing the peptide (log SumAA); (ii) the logarithm of the peptide’s van der Waals volume (log VDWVol); and (iii) the logarithm of the theoretically calculated n-octanol-water partition coefficient (clog P). The obtained model for the calibration set, was used to predict the retention times of the test set samples. The corresponding RMSE and RMSEP values are equal to 0.789 and 0.882, respectively, which is considerably worse than the above results. Figure 6 indicates Journal of Proteome Research • Vol. 5, No. 7, 2006 1623

research articles

Put et al.

Figure 5. Stepwise regression models statistics: (a) RMSE (OO) and RMSEP (2-2) vs model size for Stepwise regression models with different complexity; (b) Predicted retention times (tR) for the best model, vs observed retention times (tR) (+: calibration set samples; O: test set samples). Figure 4. UVE-PLS model: X-loadings of the 128 molecular descriptors retained.

both a lack of model fit to the calibration set samples (+), and poor predictive power for the external test samples (O).

Table 3. Overview of the Stepwise Regression Model (significance level R ) 0.05, RMSE ) 0.547, RMSEP ) 0.668) no.

family

dimension

descriptors

regression coeffs

1 2 3 4 5 6 7

topological descriptors atom-centered fragments 2D autocorrelations geometrical descriptors atom-centered fragments walk and path counts 2D autocorrelations

2-D 1-D 2-D 3-D 1-D 2-D 2-D

PW5 C.003 PC06.04 G(N..N) H-047 piID MATS2e

12.65 (bo) 0.70 1.65 -1.41 -1.00 0.94 -0.39 0.29

5. Conclusions Modeling a certain property of a molecule often is a difficult task. However, HPLC retention prediction may be an additional source of valuable information in proteomic research. To obtain reliable models with good predictive properties, it is advisable to consider a large set of molecular descriptors representing a wide variety of molecular properties that potentially can be included in the models. The most crucial step in such an approach is the objective selection of relevant molecular descriptors from the given descriptors set. Capable variable selection methods are thus needed. Here, it was demonstrated that the selection of relevant descriptors in QSRR studies can be achieved using the UVE-PLS approach. A set of 1726 molecular descriptors were calculated for 90 peptides, for which the retention on a reversed-phase liquid chromatography system using gradient elution was measured. The constructed UVE-PLS QSRR model shows good predictive properties for new peptides, measured using independent test set samples, which were not considered during the model calibration. The RMSEP value equals 0.454 for the model composed of only five latent factors, and only 128 descriptors were selected from the initial set by the uninformative variable elimination procedure. It was concluded that the UVE-PLS approach performs better than regular PLS: it has a better predictive ability and resulted 1624

Journal of Proteome Research • Vol. 5, No. 7, 2006

in a less complex model (5 vs 7 components). Thus, also in this application the uninformative variable elimination approach is useful for variable selection, since it is able to select the most relevant molecular descriptors from the given set of 1726 descriptors. The UVE-PLS approach can also be considered as a better alternative to stepwise regression, since the most informative variables (descriptors) were successfully selected from a large set, and even better predictive abilities were achieved. Since prediction errors of less than 30 s were obtained, the methodology proposed may be of great value in the field of proteomics. For instance, such retention predictions may be useful in mass spectrometry data analysis and this additional information may improve the confidence of peptide identifications. From a practical point of view, the gradient slope also still can be reduced, which might result in a larger retention time variation for the different peptides and an increased distinction between them.

research articles

Retention Prediction of Peptides

Figure 6. Predicted retention times (tR) with the MLR model from ref 13, vs observed retention times (tR) (+: calibration set samples; O: test set samples).

Acknowledgment. M. Daszykowski would like to express his sincere gratitude to Foundation for Polish Science for financial support, and the other authors BWS/03/07. References (1) Vedani, A.; Dobler, M.; Lill, M. A. Toxicol. Appl. Pharmacol. 2005, 207, 398-407. (2) Verma, R. P.; Kurup, A.; Mekapati, S. B.; Hansch, C. Bioorg. Med. Chem. 2005, 13, 933-948. (3) Perkins, R.; Fang, H.; Tong, W.; Welsh, W. J. Environ. Toxicol. Chem. 2003, 22, 1666-1679. (4) Kaliszan, R.; Baczek, T.; Cimochowska, A.; Juszczyk, P.; Wisniewska, K.; Grzonka, Z. Proteomics 2005, 5, 409-415. (5) Put, R.; Xu, Q. S.; Massart, D. L.; Vander Heyden, Y. J. Chromatogr., A 2004, 1055, 11-19. (6) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; Wiley-VCH: Weinheim, Germany, 2000. (7) Baczek, T.; Bucinski, A.; Ivanov, A. R.; Kaliszan, R. Anal. Chem. 2004, 76, 1726-1732. (8) Liebler, D. C. Introduction to Proteomics; Humana Press: Totowa, NJ, 2002. (9) Pandey, A.; Mann, M. Nature (London) 2000, 405, 837-846. (10) Wagner, K.; Miliotis, T.; Marko-Varga, G.; Bischoff, R.; Unger, K. K. Anal. Chem. 2002, 74, 809-820. (11) Davis, M. T.; Beierle, J.; Bures, E. T.; McGinley, M. D.; Mort, J.; Robinson, J. H.; Spahr, C. S.; Yu, W.; Luethy, R.; Patterson, S. D. J. Chromatogr., B 2001, 752, 281-291. (12) Washburn, M. P.; Wolters, D.; Yates, J. R. Nat. Biotechnol. 2001, 19, 242-247.

(13) Baczek, T.; Wiczling, P.; Marszall, M.; Vander Heyden, Y.; Kaliszan, R. J. Proteome Res. 2005, 4, 555-563. (14) Quarry, M. A.; Grob, R. L.; Snyder, L. R. Anal. Chem. 1986, 58, 907-917. (15) Schoenmakers, P. J.; Bartha, A.; Billiet, H. A. H. J. Chromatogr. 1991, 550, 425-447. (16) Baczek, T.; Kaliszan, R. J. Chromatogr. A 2003, 987, 29-37. (17) Hancock, T.; Put, R.; Vander Heyden, Y.; Coomans, D.; Everingham, Y. Chemom. Intell. Lab. Syst. 2005, 76, 185-196. (18) Kaliszan, R.; van Straten, M. A.; Markuszewski, M.; Cramers, C. A.; Claessens, H. A. J. Chromatogr., A 1999, 855, 455-486. (19) Baczek, T.; Kaliszan, R.; Novotna´, K.; Jandera, P. J. Chromatogr., A 2005, 1075, 109-115. (20) Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics: Part A; Elsevier: Amsterdam, 1997. (21) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. G. Classification and Regression Trees; Wadsworth: Belmont, CA, 1984. (22) Put, R.; Perrin, C.; Questier, F.; Coomans, D.; Massart, D. L.; Vander Heyden, Y. J. Chromatogr., A 2003, 998, 261-276. (23) Daszykowski, M.; Walczak, B.; Xu, Q.; Daeyaert, F.; de Jonge, M. R.; Heeres, J.; Koymans, L. M. H.; Lewi, P. J.; Vinkers, H. M.; Janssen, P. A.; Massart, D. L. J. Chem. Inf. Comput. Sci. 2004, 44, 716-726. (24) Frank, I. E. Chemom. Intell. Lab. Syst. 1995, 27, 1-9. (25) Xu, Q. S.; Daszykowski, M.; Walczak, B.; Daeyaert, F.; de Jonge, M. R.; Heeres, J.; Koymans, L. M. H.; Lewi, P. J.; Vinkers, H. M.; Janssen, P. A.; Massart, D. L. Chemom. Intell. Lab. Syst. 2004, 72, 27-34. (26) Xu, Q. S.; Massart, D. L.; Liang, Y. Z.; Fang, K. T. J. Chromatogr., A 2003, 998, 155-167. (27) Devillers, J. Genetic Algorithms in Molecular Modeling; Academic Press: London, 1996. (28) Barakat, M. T.; Dean, P. M. J. Comput.-Aided Mol. Des. 1990, 4, 295-316. (29) Centner, V.; Massart, D. L.; de Noord, O. E.; de Jong, S.; Vandeginste, B.; Sterna, C. Anal. Chem. 1996, 68, 3851-3858. (30) Malinowski, E. R. Factor Analysis in Chemistry; Wiley: New York, 1991. (31) Golub, G. H.; Van Loan, C. F. Matrix Computations; The Johns Hopkins University Press: Oxford, 1983. (32) Fragkaki, A. G.; Koupparis, M. A.; Georgakopoulos, C. G. Anal. Chim. Acta 2004, 512, 165-171. (33) Martens, H.; Naes, T. Multivariate Calibration; Wiley: Chichester, UK, 1989. (34) Naes, T.; Isaksson, T.; Fearn, T.; Davies, T. Multivariate Calibration and Classification; NIR Publications: Chichester, UK, 2002. (35) Kennard, R. W.; Stone, L. A. Technometrics 1969, 11, 137-148. (36) Daszykowski, M.; Walczak, B.; Massart, D. L. Anal. Chim. Acta 2002, 468, 91-103.

PR0600430

Journal of Proteome Research • Vol. 5, No. 7, 2006 1625