Molecular Descriptor Subset Selection in ... - ACS Publications

Sep 8, 2015 - algorithm (FA), and flower pollination algorithm (FPA), was compared in molecular descriptor selection for development of quantitative ...
1 downloads 0 Views 742KB Size
Subscriber access provided by University of Manitoba Libraries

Article

Molecular descriptor subset selection in theoretical peptide quantitative structure retention relationships model development using nature-inspired optimization algorithms Petar Žuvela, J. Jay Liu, Katarzyna Macur, and Tomasz B#czek Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b02349 • Publication Date (Web): 08 Sep 2015 Downloaded from http://pubs.acs.org on September 8, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Molecular descriptor subset selection in theoretical peptide quantitative structure retention relationships model development using nature-inspired optimization algorithms Petar Žuvela‡, J. Jay Liu‡,*, Katarzyna Macur†, Tomasz Bączek∞ ‡

Department of Chemical Engineering, Pukyong National University, 365 Sinseon-ro, 608-739 Busan, Korea Laboratory of Mass Spectrometry, Intercollegiate Faculty of Biotechnology, University of Gdańsk and Medical University of Gdańsk, Kładki 24, 80-822 Gdańsk, Poland ∞ Department of Pharmaceutical Chemistry, Medical University of Gdańsk, Hallera 107, 80-416 Gdańsk, Poland * Fax: +82 51 6296429. E-mail: [email protected]. †

ABSTRACT: In this work, performance of five nature-inspired optimization algorithms, Genetic Algorithm (GA), Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), Firefly Algorithm (FA), and Flower Pollination Algorithm (FPA), was compared in molecular descriptor selection for development of quantitative structure-retention relationships (QSRR) models for 83 peptides that originate from eight model proteins. The matrix with 423 descriptors was used as input, and QSRR models based on selected descriptors were built using Partial Least Squares (PLS), whereas Root Mean Square Error of Prediction (RMSEP) was used as a fitness function for their selection. Three performance criteria, prediction accuracy, computational cost, and the number of selected descriptors, were used to evaluate the developed QSRR models. The results show that all five variable selection methods outperform interval PLS (iPLS), sparse PLS (sPLS), and the full PLS model, whereas GA is superior because of its lowest computational cost and higher accuracy (RMSEP of 5.534%) with a smaller number of variables (nine descriptors). The GA-QSRR model was validated initially through Y-randomization. In addition, it was successfully validated with an external testing set out of 102 peptides originating from Bacillus subtilis proteomes (RMSEP of 22.030%). Its applicability domain was defined, from which it was evident that the developed GA-QSRR exhibited strong robustness. All the sources of the model’s error were identified, thus allowing for further application of the developed methodology in proteomics.

INTRODUCTION Reversed phase liquid chromatography (RP-LC) coupled with tandem mass spectrometry (MS/MS) is a hyphenated method widely used as an analytical technique in proteomics 1. RP-LC provides powerful separation, whereas MS/MS allows for accurate peptide sequences identification. There are several identification algorithms available based on MS/MS data2-6. Analyzed protein samples are identified by means of finding a best match between experimental and theoretical spectra obtained from a protein database, e.g. the Sequest2 algorithm, which was used in this work. One of the most important LC parameters is the retention time of an analyte. Under constant experimental conditions, it is a parameter dependent on molecular geometry. Quantitative structure-retention relationships (QSRR) models allow for its direct prediction from the molecular structure. In unison with MS/MS spectra, the predicted retention times can be used for correction of peptide amino-acid sequences identification, as well as excluding the false-positive (or false-negative) results7. Several papers consider various modeling approaches to peptides retention time prediction. Guo et al.8 constructed a linear model for the prediction of peptides elution time in RP-LC based on retention coefficients of the individual amino acids of which a polypeptide chain consists. Krokhin et al.9 developed an online sequence-specific retention calculator

(SSRCalc). It was expanded for elimination of false-positive identifications, and the confirmation or exclusion of proteins was based only on one characteristic peptide presence in a study by Gilar et al.10 In addition to the use of empirical molecular descriptors (e.g., retention times of individual amino acids), theoretical descriptors can also be used. However, because more than 4000 11,12 are available for calculation, using an appropriate method of variable selection is crucial. Although variable selection methods are prevalent in dealing with data originating from plenty of scientific fields, e.g., spectroscopy13-20, in peptide QSRR, only a few studies have been published. Thereby, in order to tackle molecular descriptor selection, Put et al.21 utilized Uninformative Variable Elimination– Partial Least Squares13 (UVE-PLS). In UVE-PLS, an initial matrix of descriptors is expanded with noise, and subsequent PLS models are constructed. Variables are retained based on a cutoff value of the sample mean over standard deviation ratio. This might lead to removing descriptors that do not seem important on their own, but in linear combination have a strong influence on retention time. Ukić et al.22 reported such a case QSRR modeling of a set of carbohydrates analyzed with ion chromatography. Full PLS model exhibited lower error than UVE-PLS. Chen et al.16 made a similar observation. They argued that the UVE-PLS procedure cannot noticeably make a

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

complex PLS model simpler because the retained variables still contain noise. Bodzioch et al.23 presented a comprehensive comparison of different variable selection and regression methods in peptide QSRR model development. For that purpose, the authors developed two QSRR models: one based on a small number of apriori defined descriptors, as reported in a study of 101 peptides by Michel et al.24, and another obtained by variable selection. In the latter, an experimental descriptor (log Sum(k + 1)AA11) was included in the variable matrix. This novel descriptor was an improvement over the old (i.e., log SumAA11), and required measurement of fewer numbers of amino acids. However, attempts at completely replacing it with a theoretical one were unsuccessful11. It was the common and highest contributing variable in both UVE-PLS, and Genetic Algorithm (GA) prior to Multiple Linear Regression (MLR) and PLS. Parameter optimization was not performed. Therefore, it is unknown whether their variation or exclusion of the experimental descriptor would have influenced the variable selection. Stepwise MLR, a classical variable selection method was used by Dobricic et al.25 in a QSRR study for the prediction of retention factor for 22 L-aminoacidic amides of cortienic acids. However, in a case with few observations, it performs poorly22 or leads to overfitting26 because of a large number of selected variables. In a recent study, a QSRR model based on theoretical descriptors for the prediction of peptides in RP-LC was developed by Golmohammadi et al.27 utilizing GA coupled with PLS for variable selection. Optimization of GA parameters was not performed. Consequently, although the authors repeated each GA run ten times, that does not exclude the possibility that their variation could have improved the prediction of the final QSRR model. In general, GA coupled with both linear and non-linear regression methods are widely used for variable selection in QSRR model development27-32. However, an in-depth study on applications of such metaheuristic algorithms in peptide QSRR has, to the best of our knowledge, not been published yet. Therefore, we present a comprehensive comparative study of variable selection in peptide QSRR model development based on nature-inspired optimization algorithms. QSRR models were developed for peptides of eight model proteins obtained from biological samples using theoretical molecular descriptors, with PLS as a regression of choice for modeling, because of its ability to manage collinear descriptors, as well as being far less computationally intensive than non-linear machine learning methods33. Namely, Genetic Algorithms34 (GA-PLS), Particle Swarm Optimization35 (PSOPLS), Artificial Bee Colony36 (ABC-PLS), Firefly Algorithm (FA-PLS)37, and Flower Pollination Algorithm (FPA-PLS)38 were applied. Combinatorial optimization of their parameters was performed, and they were further compared to interval PLS (iPLS)14, and sparse PLS (sPLS)39 methods. The performance of the full PLS model was also given. Finally, the best model was selected and validated with an external validation set that contains 102 peptides originating from two Bacillus subtilis proteomes, and used for an applicability domain study. EXPERIMENTAL Sample preparation. Aqueous solutions (3 mg/mL) of (a) standard proteins: bovine milk β-casein, human serum albumin, bovine serum albumin, chicken egg ovalbumin, ribonuclease B, bovine milk lactoglobulin, bovine myoglobin, insu-

Page 2 of 10

lin-like growth factor-binding protein 1 (purified from human amniotic fluid using a previously reported procedure40), and (b) Bacillus subtilis proteins (obtained after extraction from endospores of strains 168 and ΔprpE) were first reduced by incubating with 100 mM dithiotreitol (DTT, freshly prepared in 100 mM ammonium bicarbonate buffer, pH 8.5) in 60 °C for 30 minutes. Subsequently, trypsin was added to the samples (enzyme to substrate ratio 1:50). Proteolytic digestion was performed for 12 hours at 37 °C and trifluorocaetic acid (TFA) was added to quench the reaction. The concentrations of the obtained peptide mixtures were approximately 50 pmol/µL. All the protein standards, chemicals (DTT, TFA, ammonium bicarbonate and MS-grade acetonitrile), and MS-grade trypsin used in this study were purchased from Sigma-Aldrich (Steinheim, Germany). Water used in the experiments was deionized through a Direct-QTM (Millipore) system (Millipore, Bedford, MA, USA). RP-LC-MS/MS conditions. Experiments were performed using the Finnigan LC-UV-MS/MS LTQ linear ion trap MS system with ESI ion source (Thermo Finnigan, San Jose, CA, USA). It was controlled by Thermo Xcalibur software 1.4. Peptides’ mixtures were separated on an XTerra MS C18 (2.1 x 100 mm, 3.5 µm) chromatographic column (Waters, Milford, MA, USA). Mobile phase consisted of solvent A – 0.1% aqueous solution of TFA and solvent B – 0.1% TFA in ACN were mixed online. Chromatographic analysis was carried out with a linear 90 minute gradient, from 0 to 60% of solvent B, at the flow rate of 200 µL/min. The MS/MS analyses were performed in the positive ion mode using source voltage of 4.62 kV, capillary voltage of 40.97 V, and capillary temperature of 219.96 °C. MS/MS spectra were generated in the linear ion trap via collision-induced dissociation, with an isolation width of 3 Da and activation amplitude of 35% of ejection RF amplitude, which corresponds to 1.58 V. Protein identification. Experiment retention times (tR,exp) of the peptides were defined at peak intensity maximum. The acquired MS/MS spectra were automatically searched against the protein database (*fasta, downloaded from UniProtKB) with the use of the Sequest Algorithm included in Bioworks 3.0 (Thermo Finningan, San Jose, CA, USA). Washburn et al.’s 41 filtering criteria were employed in protein identification. Peptides with Xcorr2 values of at least 1.9, 2.2, and 3.75, for +1, +2, and +3 charged tryptic peptides, respectively, were accepted as correctly identified. The ΔCn2 values were above 0.08. The results from these analyses, (a) 83 peptides of eightmodel proteins, and (b) 102 peptides of Bacillus subtilis proteomes were used for QSRR model development. Performed RP-LC-MS/MS analyses are described in detail by Macur et al. 42. Model development. Prior to model construction, geometries of molecular structures of the peptides were optimized in HyperChem Professional 8 (Hypercube Inc., Gainesville, Florida, USA) software by means of the Molecular Mechanics method43 with CHARMM (BIO+) force field44, utilizing the Polak-Ribière conjugate algorithm45, and root mean square (RMS) gradient value of 0.1 kcal mol-1 Å-1 as a stopping criterion. Although it is usual practice in QSRR studies to perform geometry optimization in vacuum46,47, we solvated the peptide structures in a water box of different edge lengths according to their size, in order to properly simulate RP-LC conditions. Such optimized molecular structures were transferred to Dragon 6.0 (Talete, Milano, Italy) software, and an initial matrix of

ACS Paragon Plus Environment

Page 3 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

4,885 molecular descriptors was calculated. In order to reduce it, descriptors with constant or near constant values and highly collinear descriptors were removed iteratively and pairwise. In such a way, the matrix was reduced to 423 variables. In order to optimize the number of PLS latent variables, the dataset was split randomly in a total of 1000 iterations. In each, the number of latent variables with the lowest Root Mean Square Error of Prediction (RMSEP) was selected. Three latent variables had the highest frequency, and that number was deemed optimal. Then, the reduced matrix was uniformly divided into a training set with 50 observations and a testing set with 33 observations by using Kennard and Stone algorithm48. Upon separating the datasets, subsets of five to fifteen molecular descriptors were selected in a manner that minimizes RMSEP:

 y (obs .)  y (pred .)    y (obs .)  

by Karaboga and Basturk36 based on the intelligent behavior of a honeybee swarm. There are three groups of bees: employed bees, onlookers, and scouts. Employed bees are those that go to a previously visited food source, onlookers bees are those waiting on the dance area to make a decision for choosing a food source, and scout bees are those that perform random searches. When a food source is exhausted, employed and onlooker bees become scouts. Onlooker bees choose food sources based on the probability value associated with that food source, pi, which is calculated according to equation (4), where fiti represents the fitness value of the solution i evaluated by its employed bee, and SN is the number of food sources equal to the number of employed bees (BN).

pi 

2

n

i

RMSEP 

i 1

i

i

 100  % 

fit i

(4)

SN

 fit

n

n 1

(1)

n In order to derive the PLS latent variables, the SIMPLS49 algorithm was utilized because of its simplicity and speed. All calculations were performed using MATLAB 2014b (MathWorks, Natick, Massachusetts, USA), and R 3.2.1.(R Foundation for Statistical Computing, Vienna, Austria) software on a personal computer equipped with an AMD Athlon II X2 250 processor with 3.00 GHz frequency, and 5 GB (4.75 GB usable) of RAM. THEORETICAL Genetic Algorithm (GA). GA is an optimization algorithm based on Darwin’s principles of natural selection first introduced by Holland34. The foundation of the algorithm is survival of the fittest. Thus, a population of candidate solutions is evolved toward better solutions. Such elite units from the population survive in the next generation, whereas the remaining units are replaced with children as a result of breeding (crossover). Similar to natural evolution, there is a possibility of mutation in GA. In this work, uniform creation, scattered crossover, uniform selection, and uniform mutation functions were used. Functions were written in such a manner that there cannot be duplicate variables within one unit. The algorithm parameters were optimized, whereas the stopping criterion was 100 generations without improvement in fitness. Particle Swarm Optimization (PSO). PSO is a populationbased optimization algorithm developed by Kennedy and Eberhart35. PSO population consists of random solutions (particles). Unlike other population-based algorithms, each particle is also associated with velocity. Position and velocity of the particles are defined by equations (2) and (3), where c1 and c2 represent positive constants, whereas rnd1(), and rnd2() represent two functions for random number generation uniformly distributed in [0,1]. Parameter xi represents the i-th particle, pi represents the best previous position of the i-th particle, b represents the best particle, and vi represents the velocity of particle i. Therefore, the particles fly in the direction of better solutions.

vi  vi  c1 rnd1 ()( pi  xi )  c2 rnd 2 ()( pb , i  xi )

(2)

xi  xi  vi

(3)

New potential food position is produced from the old one according to equation (5), where k ∈ {1, 2, . . ., BN} and j ∈ {1, 2, . . . , D} are randomly chosen indexes, and ϕi,j is a random number between -1 and 1.

vij  xij  ij  xij  xkj 

(5)

Firefly Algorithm (FA). FA is optimization algorithm introduced by Yang37 based on the behavior and flashing light of fireflies. The fundamental functions of such light lashes are to attract mating partners and potential prey, as well as to serve as a warning mechanism. Light intensity I at a particular distance r from the light source is expressed by the inverse square law:

I

1 r2

(6)

Flashing light in FA is associated with the objective function that is to be optimized. Thereby, it uses three idealized rules: 1. All fireflies are unisex, which makes one firefly be attracted to other fireflies regardless of sex. 2. The attractiveness of a firefly is proportional to its brightness. 3. The brightness of a certain firefly is affected or determined by the fitness function. Flower Pollination Algorithm (FPA). FPA is a optimization algorithm developed by Yang38 based on the characteristics of naturally occurring pollination of flowering plants. The procedure is summarized as follows: 1. Create an initial population of n flower/pollen gametes with random solutions. 2. Find the best solution g in the initial population. 3. Define a probability switch between 0 and 1. 4. Define a stopping criterion, such as a fixed number of generations, iterations or accuracy. 5. For all the flowers in the population, if a random number between 0 and 1 is lower than the probability switch, draw a step vector L that obeys Lévy distribution, and calculate global pollination as:

Artificial Bee Colony Algorithm (ABC). ABC algorithm is a powerful nature-inspired optimization algorithm developed

ACS Paragon Plus Environment

t 1

xi  xi  L( g  xi ) t

t

(7)

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1. If a random number between 0 and 1 is higher than the probability switch, draw ε from a uniform distribution in [0,1], and calculate local pollination as:

xi

t 1

 xi   ( x j  xk ) t

t

t

(8)

2. Evaluate new solutions, and if the new solutions are better, update them in the population. 3. Find the best solution g. 4. Repeat steps 5 to 8 while the criterion is reached. For FPA, the stopping criterion was the number of iterations (1000), whereas the other parameters were varied. Nature-inspired algorithms’ parameter optimization. Combinatorial optimization of algorithm parameters was performed. Parameters were varied in intervals of defined increments. In all optimization iterations, one parameter was always varied, whereas the others were constant. Each combination was performed in three replicates. Optimization ran until all parameter combinations were exhausted for each algorithm under the conditions detailed below. CPU time was measured with the in-built MATLAB function cputime in order to evaluate the computational cost of the run with optimal conditions. Considering that GA is a population-based search method, population size is an important GA parameter because it represents the number of PLS models. One of the first comprehensive studies of GA parameters was published by De Jong51 in his PhD thesis. De Jong reported that a small population size improved performance at the beginning of the GA evolution process, whereas a large population size improved performance in the long-term. Low mutation rate was suitable for performance of average fitness, whereas a high mutation rate was suitable for performance of best fitness. Therefore, population size was varied from 50 to 500 with an increment of 50, and mutation rate was varied from 0.2 to 0.8 with an increment of 0.1. For the PSO algorithm, only one parameter, the swarm size, was optimized. Poli et al.52 reported that swarm size values in a range of 20-50 are quite common. However, swarm size of the PSO algorithm directly affects the direction toward solutions, its robustness, and computational cost 53. Therefore, it should be varied in a wider range, because it represents PSO population size, and similar to GA, too small of a swarm is most likely to result in convergence to a local solution, whereas too large of a swarm has a higher computational cost, slowing convergence in the process. Therefore, swarm size was varied from 100 to 500 with an increment of 50. In the case of the ABC algorithm, again, only one parameter, colony size, was optimized. Because colony size is the parameter of the ABC algorithm that is equivalent to swarm size in PSO, it was varied in a broader interval: from 20 to 200 with an increment of 20. In optimization of FA parameters, three parameters, number of fireflies, randomization parameter (α), and attractiveness variation (γ), were varied. Number of fireflies was varied in a narrow interval of low values. The reason for this is the convergence of the FA algorithm, which is inversely proportional to the number of fireflies54. Therefore, it was varied from 10 to 30 with an increment of 10. The randomization parameter (α) is an FA algorithm parameter of the randomization term of the firefly movement defined by equation (9), where xi and xj represent the movement of fireflies i and j, respectively, β0 repre-

Page 4 of 10

sents the initial attractiveness value, γ represents attractiveness variation, rij represents distance between fireflies i and j, and rand() represents a random number generation function uniformly distributed in [0,1]. The value of the randomization factor is, in in most cases, between 0, and 1 (highly random). In order to reduce the computational cost of optimization, it was not throughout the entire interval, but from 0.2 to 0.8 with an increment of 0.1. xi  xi   0  e

   rij

2

x

j

 

 xi     rand () 

1



2

(9)

The attractiveness variation parameter (fixed light absorption coefficient) is a parameter of crucial importance in determining convergence speed, as well as how the FA algorithm behaves37. Depending on the application, it may have values from 0.01 up to 100. However, for most cases, its values are in [0,1], where 0.0 is regarded as low, 0.75 as medium, and 1.0 as high55. Thus, this parameter was varied from 0.1 to 1.0 with an increment of 0.1. Two FPA parameters were optimized, population size and probability switch. Population size was varied from 20 to 200 with an increment of 20. Although flower pollination can occur both locally and globally, adjacent flower patches or nearby flowers are more likely to be pollinated by local flower pollen. To emulate this occurrence, a probability switch p is used in order to switch between the two. The default value of the probability switch is 0.5, whereas 0.8 is optimal for most applications38. Thus, the probability switch was varied from 0.2 to 0.8 with a 0.2 increment. Finally, for sPLS, the thresholding parameter (η) are key parameters and should be optimized39. The values of the thresholding parameter are between 0, and 1. Therefore, we varied it in that range with an increment of 0.1. The molecular descriptor subset for the model from the method that outperformed the rest both in predictive ability and computational cost was selected and validated. Model validation. The model applied to both the testing and external validation sets was validated preliminarily by means of Y-randomization. Y-randomization56 is a procedure based on random permutation of the y-variable observations in order to observe the action of chance in fitting the given data. The predictive ability of the best model was further validated utilizing an external validation set of 102 peptides from two Bacillus subtilis proteomes. Thus, a possibility of developing overoptimistic models can be avoided57. In order to support this, an applicability domain58 was graphically defined by constructing a Williams plot. Williams plot is a graphical depiction of dependence between standardized residuals of the model, and its leverage represents the diagonal of the Hat matrix59:

h  diag  X2 ( X1 X1 ) X2  T

T

1

(10)

where X1 represents the training set variable matrix, and X2 represents the training, testing, or validation set matrix. Decision whether a respective observation is within or outside of the applicability domain was based on warning limits: three multiples of standard deviation and the critical leverage value h*60:

ACS Paragon Plus Environment

 K 1   N 

h  3 *

(11)

Page 5 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

where K is the number of variables and N is the number of observations. RESULTS AND DISCUSSION Developed QSRR models were derived from the peptides obtained from the digestion of eight model proteins. Their amino acid sequences were identified in the RP-LC-MS/MS analysis followed by a Sequest database search. The total of 83 model peptides with the highest Sequest scores, 50 for training and 33 for testing, were uniformly selected by the Kennard and Stone algorithm. In order to utilize the five variable selection methods, combinatorial optimization of their parameters was performed. For GA-PLS, it was found that the optimal population size and mutation rate were 450 and 0.2, respectively. Application of those values yielded a model with an RMSEP value of 5.356%. In the case of PSO-PLS, the optimal swarm size was 450, which yielded a 5.336% RMSEP. In turn, the lowest RMSEP for the ABC-PLS model (6.956%) was found when the colony size was 120. For FA-PLS, it was found that the optimal number of fireflies and randomization parameter were 30 and 0.2, respectively, whereas the optimal value of the FA attractiveness variation parameter was 0.8. Optimal values of those FA parameters yielded a model with an RMSEP value of 6.908%. For FPA-PLS, it was found that the optimal population size was 200, whereas the optimal FPA probability switch value was 0.2. Optimal FPA parameters yielded a model with an RMSEP value of 7.113%. Finally, for sPLS, the optimal thresholding parameter value was 0.1, for an RMSEP value of 26.360%. Table 1. Nature-inspired algorithms’ performance summary. Method

n

RMSEP / %

CPU time / s

GA-PLS

9

5.356

106.38

PSO-PLS

15

5.336

652.50

ABC-PLS

7

6.956

222.06

FA-PLS

9

6.908

179.70

FPA-PLS

9

7.113

1516.36

iPLS (forward)

18

25.175

n.a.

iPLS (backward)

58

20.263

n.a.

sPLS

75

26.360

n.a.

Full PLS model

423

36.990

n.a.

FA, the lowest number of wavelengths is selected. Moreover, the FA algorithm was used in multivariate calibration of NIR spectra of whole grain wheat samples by de Paula et al.61. It was shown that the FA algorithm yielded a highly accurate model (prediction error < 1%). In this work, for nine selected molecular descriptors, the FA-PLS model yielded an RMSEP of 6.908%. Therefore, a more appropriate application of FA could be variable selection in spectroscopic studies, rather than QSRR. Nature-inspired algorithms greatly outperform the classical iPLS14, sPLS339, and full PLS model. Forward iPLS yielded a model with 18 selected variables and an RMSEP of 25.175%, whereas backward iPLS yielded a model with 58 selected variables and an RMSEP of 20.263%. sPLS yielded a model with 75 selected variables and an RMSEP of 26.360%. For iPLS, this is not an odd result because it was specifically developed for NIR14,15 and NMR17 spectroscopy, whereas sPLS does not seem to be suitable in the case of QSRR. The model summary (Figure 1) for the GA-PLS model shows that both X-space and Y-space are highly explained.

Figure 1. PLS model summary for the X-Space, and Y-space of GA-PLS model for three latent variables.

Its predictive ability is demonstrated in Figure 2. The differences in predicted and experimental retention times for the best model, as it may be seen from Table S1 ranged from 0 to 3 minutes, in the 90 minutes analysis time.

Method abbreviations explained in the text. n – number of selected variables.

Results of the algorithms’ performance are summarized in Table 1. It may be observed that PSO-PLS gives the lowest RMSEP. However, only slightly lower than GA-PLS for a more complex model, in addition to increased computational cost. With 9 variables, GA-PLS is thus considered a better best model than PSO-PLS, which yielded the best model with 15 variables. The rest of the methods yielded models with considerably higher errors. To the authors’ knowledge, this is the first application of ABC and FPA algorithms in peptide QSRR, and variable selection in general, using PLS for regression. Recently, Goodarzi and Coelho 20 applied FA in Near Infrared (NIR) Spectroscopy, with PLS used for regression. They compared it to GA and PSO, and concluded that, with

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Predictive ability of the best GA-PLS model based on a testing set of peptides from model proteins. (n = 33).

ACS Paragon Plus Environment

Page 6 of 10

Page 7 of 10

Analytical Chemistry

Table 2. Correlation matrix of X-variables between themselves and with retention time for the best GA-PLS model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Variables

Variables tR / min

R4p

DISPp

ChiG/D

EtaF,A

HATS7m

IC1

GATS6e

GATS4e

tR / min

1

R4p

0.769

1

DISPp

-0.103

-0.103

1

ChiG/D

0.695

0.789

-0.185

1

EtaF,A

0.704

0.745

-0.003

0.694

1

HATS7m

-0.739

-0.626

0.014

-0.592

-0.672

1

IC1

-0.115

-0.098

0.255

-0.117

0.185

0.116

1

GATS6e

0.157

-0.022

-0.022

-0.059

-0.262

-0.118

-0.066

1

GATS4e

-0.097

-0.055

0.110

-0.136

-0.316

-0.113

-0.104

-0.026

1

Uc

0.737

0.737

0.070

0.647

0.963

-0.684

0.316

-0.196

-0.205

Uc

1

All the abbreviations explained in the text. Considering the low error, it can be observed that a smaller difference between the experimental and predicted retention times indicates high probability of correctly identified amino acid sequences by means of Sequest algorithm. The best GA-PLS model consists of R autocorrelation of lag 4 weighted by polarizability (R4p), displacement value weighted by polarizability (DISPp), Randić-like index from distance/distance matrix (ChiG/D), Eta average functionality index (EtaF/A), leverage-weighted autocorrelation of lag 7 weighted by mass (HATS7m), information content index for neighborhood symmetry of 1-order (IC1), Geary autocorrelation of lag 6 weighted by Sanderson electronegativity (GATS6e), Geary autocorrelation of lag 4 weighted by Sanderson electronegativity (GATS4e), and unsaturation count (Uc). More detailed description of all selected molecular descriptors can be found in12. As seen from the VIP62 plot (Figure 3) of the best GA-PLS model, four molecular descriptors have VIP values higher than the threshold 163-65, and five below it. However, proper threshold values differ under conditions of low proportion, high correlation, or equal coefficient structure64.

Indeed, the results of the correlation analysis (Table 2) have shown a presence of collinear descriptors. Regardless of the fact that GA selected five molecular descriptors with low variable importance to projection, they cannot be discarded. Therefore, VIP is not appropriate for further variable selection and interpretation. Y-randomization with 1000 iterations was performed along a testing set. The training set retention times were permutated. For each iteration, a GA-PLS that ran with optimal parameters was performed on these scrambled y-observations. It yielded average RMSEP of 22.97% and 32.42% (Figure 4) on the testing and external validation sets, respectively. This further confirms the predictive ability of the model, given that randomly permutated y-observations explained distinctively less variance than the y-observations sorted in the “proper” order.

Figure 4. Y-randomization plot. Testing set (orange lines), and the validation set (royal blue lines).

Figure 3. PLS Variable importance to projection for the best GAPLS model.

Moreover, the results show (Table S2) that when applied to the external validation set of 102 peptides that originate from the investigated Bacillus subtilis proteomes, the model has an error of 22.030%. The trend remains linear (Figure 5), but not perfectly scattered along the y = x line. The deviation originates from the inclusion of peptides with lower identification scores, as well as from identifications based only on one oc-

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

currence, which, by proteomic standards42, could mean that they are not present in the sample.

Figure 5. Predictive ability of the best GA-PLS model for an external validation set of peptides from Bacillus subtilis proteomes. Peptides with high identification scores, and/or identified based on more than one occurrence are depicted by upward pointing triangles. Peptides with lower identification scores, and/or identified based on one occurrence are depicted by downward pointing triangles. (n = 102)

Thus, the structures used to calculate the descriptors for those peptides were not accurate, resulting in inaccurate predictions. The developed GA-PLS model has shown to be superior over the GA-PLS model developed by Bodzioch et al.23. This may very well be caused by the use of fixed values of GA parameters without their optimization, as well as the inclusion of an experimental descriptor in their variable selection procedure. Finally, an applicability domain was defined for the GA-PLS model which is depicted in the Williams plot (Figure 6) with ±3σ, and h* (0.6) as warning limits.

Page 8 of 10

All the training set observations lie well within it, whereas the testing set has two outliers that are structurally distant from the training set. The amino acid sequences of the two peptides in question were LDELR and TCVADESAENCD. As it can be seen from Table S1, the retention times of those peptides were very well predicted, which confirms the robustness of the model. Concerning external validation set observations, there are 13 outliers present, three of which are very close to the critical leverage and are fairly predicted. However, ten of the outlying peptides are outside the warning limits and considered strong outliers. All are very long peptides with long experimentally determined retention times. There are two causes for their large absolute errors (Table S2): poor sequence identification and the fact that larger peptides require non-linear regression methods for more accurate prediction66. CONCLUSIONS In the presented study, five nature-inspired global optimization algorithms were used for the selection of molecular descriptors used for derivation of QSRR models. GA greatly outperformed the other methods and yielded the best model with RMSEP error of 5.356% for a lower number of selected variables and lowest computational cost. It was shown that an application more proper for FA could be variable selection in spectroscopy, whereas this is the first application of ABC, FPA in QSRR, and variable selection in general. All methods outperformed iPLS and sPLS. Results of the developed methodology have shown that a properly optimized GA will exhibit exceptional performance in the selection of a subset of molecular descriptors in QSRR. Coupling it with non-linear machine learning methods will lead to development of highly predictive models, which is the focus of future work. In that case, absolute error can serve as an indicator of false positive or negative identification.

ASSOCIATED CONTENT Supporting Information Experimental chromatographic data, and values of selected molecular descriptors for the GA-PLS model. This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION Corresponding Author *

Phone: +82 51 6296453. Fax: +82 51 6296429. E-mail: [email protected].

Notes The authors declare no competing financial interest.

ACKNOWLEDGMENTS This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A1A1A05004852). Figure 6. Williams plot for the best GA-PLS model with ±3σ, and h* (0.6) as warning limits. Diamonds represent training set observations, upward pointing triangles represent testing set observations, and downward pointing triangles represent validation set observations. (n = 152)

REFERENCES (1) Domon, B.; Aebersold, R. Science 2006, 32, 212-217. (2) Eng, J K.; McCormack, A. L.; Yates III, J. R. J. Am. Soc. Mass. Spectrom. 1994, 5, 976-989. (3) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567.

ACS Paragon Plus Environment

Page 9 of 10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(4) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3 958-964. (5) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 14661467. (6) Tabb, D. L.; Fernando, C. G.; Chambers, M. C. J. Proteome Res. 2007, 6, 654-661. (7) Bączek, T.; Kaliszan, R. Proteomics 2009, 9, 835-847. (8) Guo, D.; Mant, C. T.; Taneja, A. K.; Parker, M. R.; Hodges, R. S. J. Chromatogr. 1986, 359, 499-518. (9) Krokhin, O. V.; Craig, R.; Spicer, V.; Ens, W.; Standing, K. G.; Beavis, R. C.; Wilkins, J. A. Mol. Cell. Proteomics 2004, 3, 908-919. (10) Gilar, M.; Olivova, P.; Chakraborty, A. B.; Jaworski, A.; Geromanos, A. J.; Gelber, J. C. Electrophoresis 2009, 30, 1157-1167. (11) Bodzioch, K.; Bączek, T.; Kaliszan, R.; Vander Heyden, Y.; J. Pharm. Biomed. Anal. 2009, 50, 563-569. (12) Todeschini, R.; Consonni, V. Molecular Descriptors for Chemoinformatics; Wiley-WCH: Weinheim, 2009. (13) Centner, V.; Massart, D. L.; de Noord, O. E. Anal. Chem. 1996, 68, 3851-3858. (14) Nørgaard, L.; Saudland, A.; Wagner, K.; Nielsen, J. P.; Munck, L.; Engelsen, S. B. Appl. Spectrosc. 2000, 54, 413-419. (15) Leardi, R.; Nørgaard, L. J. Chemom. 2004, 18, 486-497. (16) Chen, D.; Shao, X. G.; Hu, B.; Su, Q. D. Analyst 2004, 129, 664-669. (17) Kristensen, K.; Savoroni, F.; Ravn-Haren, G.; Paulsen, M.; Markowski, J.; Larsen, F. H.; Dragsted, L. O.; Engelsen, S. B. Metalobolomics 2010, 6, 129-136. (18) Gosselin, R.; Rodrigue, D.; Duchesne, C. Chemom. Intell. Lab. Syst. 2010, 100, 12-21. (19) Xiaobo, Z.; Jiewen, Z.; Povey, M. J. W.; Holmes, M.; Hanpin, M. Anal. Chim. Acta 2010, 667, 14-32. (20) Goodarzi, M.; Coelho, L. D. S. Anal. Chim. Acta 2014, 852, 20-27. (21) Put, R.; Daszykowski, M.; Bączek, T.; Vander Heyden, Y. J. Proteome Res. 2006, 5, 1618-1625. (22) Ukić, Š.; Novak, M.; Žuvela, P.; Avdalović, N.; Liu, Y.; Buszewski, B.; Bolanča, T. Chromatographia 2014, 77, 985-996. (23) Bodzioch, K.; Durand, A.; Kaliszan, R.; Bączek, T.; Vander Heyden, Y. Talanta 2010, 81, 1711-1718. (24) Michel, M.; Bączek, T.; Studzińska, S.; Bodzioch, K.; Jonsson, T.; Kaliszan, R.; Buszewski, B. J. Chromatogr. A 2007, 1175, 49-54. (25) Dobricic, V.; Nikolic, K.; Vladimirov, S.; Cudina, O. Eur. J. Pharm. Sci. 2014, 56, 105-112. (26) Saeys, Y.; Inza, J.; Larrañaga, P. Bioinformatics 2007, 23, 2507-2517. (27) Golmohammadi, H.; Dashtbozorgi, Z.; Vander Heyden, Y. Chromatographia 2015, 78, 7-19. (28) Goodarzi, M.; Jensen, R.; Vander Heyden, Y. J. Chromatogr. B 2012, 910, 84-94. (39) Hancock, T.; Put, R.; Coomans, D.; Vander Heyden, Y.; Everingham, Y. A. Chemom. Intell. Lab. Syst. 2005, 76, 185-196. (30) Szaleniec, M.; Dudzik, A.; Pawul, M.; Kozik, B. J. Chromatogr. A 2009, 1216, 6224-6235. (31) D’Archivio, A. A.; Giannitto, A.; Maggi, M. A. J. Chromatogr. A 2013, 1298, 118-131. (32) Gupta, V. K.; Khani, H.; Ahmadi-Roudi, B.; Mirakhorli, S.; Fereyduni, E.; Agarwal, S. Talanta 2011, 83, 1014-1022. (33) Tian, F.; Yang, L.; Lv, F.; Zhou, P. Anal. Chim. Acta 2009, 644, 10-16. (34) Holland, J. H. Adaptation in Natural and Artificial systems; University of Michigan Press: Ann Arbor, Michigan, 1975. (35) Kennedy, J.; Eberhart, R. Proc. - Int. Conf. Neural Networks 1995, 4, 1942-1948. (36) Karaboga, D.; Basturk, B. A. J. Global Opt. 2007, 39, 459471. (37) Yang, X. S. Lect. Notes Comput. Sci. 2009, 5792, 169-178. (38) Yang, X. S.; Karamonglu, M.; He, X. Procedia Comput. Sci. 2013, 18, 861-868.

(39) Chun H.; Keles, S. J. R. Stat. Soc. 2010, 72, 3-25. (40) Sala, A.; Capaldi, S.; Visai, L.; Minchiotti, L.; Galliano, M.; Monaco, H. L. J. Biol. Chem. 2005, 33, 29812-29819. (41) Washburn, M. P.; Wolters, D.; Yates III, J. R. Nature Biotechnol. 2001, 19, 242-247. (42) Macur, K.; Bączek, T.; Kaliszan, R.; Temporini, C.; Corana, F.; Massolini, G.; Grzenkowicz-Wydra, K.; Obuchowski, M. J. Biomed. Biotechnol. 2010, DOI: 10.1155/2010/718142. (43) Burkert, U.; Allinger, N.J. Molecular Mechanics. American Chemical Society: Washington DC, 1982. (44) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. J. Comp. Chem. 1983, 4, 187-217. (45) Polak, E.; Ribière, G. ESAIM: Math. Modell. Numer. Anal. 1969, 3, 35-43. (46) Turowski, M.; Morimoto, T.; Kimata, K.; Moude, H.; Ikegami, T.; Hosoya, K.; Tanaka, N. J. Chromatogr. A 2011, 911, 177-190. (47) Studzińska, S.; Molíkova, M.; Kosobucki, P.; Jandera, P.; Buszewski, B. Chromatographia 2011, 73, 535-544. (48) Kennard, R. W.; Stone, L. A. Technometrics 1969, 11, 137148. (49) de Jong, S. Chemom. Intell. Lab. Syst. 1993, 18, 251–263. (50) De Jong, K. A. Analysis of the behavior of a class of genetic adaptive systems. Ph.D. Thesis, University of Michigan, August 1975. (51) Poli, R.; Kennedy, J.; Blackwell, T. Swarm. Intell. 2007, 1, 33-57. (52) Chen, D.; Zhao, C. Appl. Soft. Comput. 2009, 9, 39-48. (53) Yang, X. S. Int. J. Bio. Inspir. Com. 2010, 2, 78-84. (54) Marichelvam, M. K.; Prabaharan, T.; Yang, X. S. IEEE Trans. Evol. Comput. 2014, 18, 301-305. (55) Rücher, C.; Rücher, G.; Meringer, M. J. Chem. Inf. Model. 2007, 47, 2345-2357. (56) Gramatica, P. QSAR Comb. Sci. 2007, 26, 697-701. (57) Weaver, S.; Gleeson, M. P. J. Mol. Graph. Modell. 2008, 26, 1315-1326. (58) Atkinson, A. C. Plots, Transformations and Regression. An Introduction to Graphical Methods of Diagnostic Regression Analysis; Clarendon Press: Oxford, 1985. (59) Bujak, R.; Struck-Lewicka, W.; Kaliszan, M.; Kaliszan, R.; Markuszewski, M. J. J. Pharm. Biomed. Anal. 2015, 108, 29-37. (60) de Paula, L. C. M.; Soares, A. S.; de Lima, T. W.; Delbem, A. C. B.; Coelho, C. J.; Filho, A. R. G. PLoS One 2014, 9, DOI: 10.1371/journal.pone.0114145. (61) Wold, S.; Sjöstrom, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109-130. (62) Chong, C.; Jun, C. H. Chemom. Intell. Lab. Syst. 2005, 78, 103-112. (63) Mehmood, T.; Liland, K. H.; Snipen, L.; Sæbø, S. Chemom. Intell. Lab. Syst. 2012, 118, 62-69. (64) Ghasemi, J.; Saaidpour, S. J. Chromatogr. Sci. 2009, 47, 156-163. (65) Shinoda, K.; Sugimoto, M.; Tomita, M.; Ishihama, Y. Proteomics 2008, 8, 787-798.

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For TOC Only

ACS Paragon Plus Environment

Page 10 of 10