A Multi-Objective Genetic Algorithm for Outlier Removal - Journal of

Nov 10, 2015 - A Multi-Objective Genetic Algorithm for Outlier Removal. Oren E. Nahum†‡∥, Abraham Yosipof§, and Hanoch Senderowitz∥. † Depa...
2 downloads 7 Views 2MB Size
Subscriber access provided by UNIV OF LETHBRIDGE

Article

A Multi-Objective Genetic Algorithm for Outlier Removal Oren E. Nahum, Abraham Yosipof, and Hanoch Senderowitz J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00515 • Publication Date (Web): 10 Nov 2015 Downloaded from http://pubs.acs.org on November 10, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

A Multi-Objective Genetic Algorithm for Outlier Removal Authors: Oren E. Nahum,a,b,d Abraham Yosipof,c Hanoch Senderowitzd*

Affiliations: a

Department of Management, Bar-Ilan University, Ramat-Gan 52900, Israel

b

School of Management and Economics, The Academic Collage of Tel-Aviv – Yaffo, TelAviv, Israel.

c

Department of Business Administration, Peres Academic Center, Rehovot, Israel

d

Department of Chemistry, Bar-Ilan University, Ramat-Gan 52900, Israel

*Corresponding author, e-mail: [email protected]

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract QSAR (or QSPR) models are developed to correlate activities for sets of compounds with their structure-derived descriptors by means of mathematical models. The presence of outliers, namely, compounds that differ in some respect from the rest of the data set compromise the ability of statistical methods to derive QSAR models with good prediction statistics. Hence, outliers should be removed from data sets prior to model derivation. In this work we present a new multi-objective genetic algorithm for the identification and removal of outliers based on the k nearest neighbors (kNN) method. The algorithm was used to remove outliers from three different data sets of pharmaceutical interest (logBBB, factor 7 inhibitors, and dihydrofolate reductase inhibitors) and its performances were compared with those of five other methods for outlier removal. The results suggest that the new algorithm provides filtered data sets which: (1) better maintain the internal diversity of the parent data sets; (2) give rise to QSAR models with much better prediction statistics. Equally good filtered data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation") forcing it to remove certain compounds with low probability only. This option is highly useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds. We expect this new algorithm to be useful in future QSAR applications.

Keywords: multi-objective optimization, evolutionary algorithms, outlier removal, outlier detection, k nearest neighbors, QSAR, optimization, distance-based method.

ACS Paragon Plus Environment

Page 2 of 36

Page 3 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1. Introduction QSAR (or QSPR) models correlate a specific activity for a set of compounds with their structurederived descriptors by means of a mathematical model. Such models have been widely applied in many fields including chemistry, biology, and environmental sciences. The role of QSAR models in the identification of new compounds and in their subsequent optimization has been constantly growing and is now recognized by many practitioners of computer aided drug design methodologies.1, 2

The ability to derive predictive QSAR models depends on multiple factors including the presence of outliers, namely, objects (e.g., compounds) that are different in some respect from the rest of the data set.3-5 The presence of outliers can compromise the ability to build reliable predictive models since many algorithms will attempt to fit the outliers at the expanse of the bulk. Consequently outliers should be removed from the data set prior to model derivation. This removal of course does not preclude the careful analysis of outliers since they may be of great interest.6, 7 There are several methods for outliers removal.8 Statistical estimators such as deviation by a number of standard deviations (SD) from the mean could be used when the distribution of the descriptors values within the data set is well defined.3,

5

When the distributions are not well

defined, non-parametric methods should be used. Several such methods have been described in the literature including distance based methods,9-12 and methods based on support vector machines.13

Most outlier removal techniques reported in the literature remove outliers in a single step based on a single descriptors space ("one-pass" methods). A single step approach ignores the fact that when multiple outliers are present in a data set they may mask each other. Thus, certain outliers may be observable only after others have been removed.14 Moreover, distance-based methods for outlier removal are very sensitive to the descriptors space. Using a single descriptors space may therefore miss some outliers. Addressing these challenges requires a method that removes outliers in an iterative manner, preferably considering different descriptors spaces at each

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

iteration. Several iterative approaches for the removal of outliers, either in a single descriptors space or in multiple descriptors spaces, were reported in the literature.7, 14-16 Recently, Yosipof and Senderowitz 17 presented a new method for the iterative identification and subsequent removal of outliers identified in potentially different descriptors spaces based on the k nearest neighbors algorithm (kNN-OR algorithm). According to this approach, an outlier is defined as a compound whose distance to its k nearest neighbors is too large for its activity to be reliably predicted. Thus, at each iteration, the algorithm builds a kNN model, evaluates it  according to the leave-one-out cross validation metric ( ; see equation (1) below) and

removes from the data set the compound whose elimination results in the largest increase in    . This procedure is repeated until  exceeds a pre-defined threshold.

Despite its good performances, the kNN-OR algorithm suffers from several drawbacks:  1. The algorithm is greedy removing at each iteration a compound that maximizes  .

However, it is possible that removing at a specific iteration a compound that does not  maximize the current  will allow in a later iteration, for the removal of a different  compound so that the final  is higher than that obtained by removing at each iteration  . This means that it may be possible to reach a the compound that maximizes the present   pre-defined value of  by removing fewer compounds.

2. The algorithm uses a single stopping criterion with an arbitrarily selected value. Thus, a large number of outliers may have to be removed before this criterion is met. On the other hand, a slightly lower stopping criterion may require the removal of far fewer compounds yet this option is not presented to the user. 3. It is possible that at the final iteration, two subsets of compounds would have the same value  of  , each using different descriptors and different number of neighbors. However, in the

kNN-OR algorithm, only one subset is selected and the other is lost. 4. The kNN-OR algorithm does not allow to keep specific compounds in the data set (or at least to remove such compounds with a lower probability) if the user so wishes.

Meeting these challenges requires the treatment of outlier removal as a multi-objective optimization problem (MOOP). MOOPs are characterized by having several, often conflicting

ACS Paragon Plus Environment

Page 4 of 36

Page 5 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

objectives that have to be optimized simultaneously. In such cases, there is usually no single solution which outperforms all other solutions in terms of all the objectives but rather, a set of non-dominated solutions. Two solutions are said to be non-dominated if each is better than the other in at least one objective. The set of non-dominated solutions forms the Pareto front.18 This is illustrated in Figure 1, where each circle represents a solution to the optimization problem. The curved line represents the Pareto front. Each solution is designated a Pareto rank that is based on the number of solutions which dominate it (a solution dominates another solution if it is better than it across all objectives). The solid circles are non-dominated solutions, have Pareto rank of zero and fall on the Pareto front. Dominated solutions are shown as empty circle and the number of solutions which dominate them is given in parentheses.

Figure 1: Potential solutions of a two objectives problem represented by the Pareto front. Dominated solutions are shown as empty circles and the number of solutions which dominate them is written in parentheses.

MOOPs could be solved by transforming them into singe objective problems. In such cases only a single solution is typically provided.19, 20 Alternatively, multiple solution could be obtained by using multi-objective evolutionary algorithms (MOEAs). These are stochastic optimization techniques aimed at finding Pareto optimal solutions for a particular problem by using a population-based approach. The optimization mechanism of MOEAs is quite similar to that of evolutionary algorithms (EAs), except for the usage of the dominance relation as the criterion for reproduction probability. Thus, at each generation, objective values are calculated for every individual in the population and are then used to rank the individuals based on their dominance relationships within the population. Higher ranking individuals are given higher probabilities to produce the offspring population. In non-elitism MOEAs, the best solutions of the current

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

population are not preserved when the next generation is created.21 Examples of this category include VEGA,22 MOGA,23 NPGA,24 and NSGA.21 In contrast, elitism MOEAs preserve the best individuals from generation to generation. In this way, the system never loses the best individuals found during the optimization process. Algorithms such as PAES,25 SPEA2,26 PDE,27 NSGA-II,28 and MOPSO29 belong to this category.

This paper presents a multi-objective genetic algorithm for the identification and removal of outliers using the Strength Pareto Evolutionary Algorithm 2 (SPEA2)26 algorithm (see Methods section for more details). The MOOP algorithm simultaneously minimizes the number of  compounds to be removed and maximizes kNN-derived  . In addition, the descriptors space

and the number of nearest neighbors (k) are also optimized in accord with the kNN algorithm. This work builds on the Yosipof and Senderowitz 17 (kNN-OR) algorithm yet improves on it by addressing the above-described drawbacks.

The new algorithm was tested on the same data sets (logBBB, Factor 7 inhibitors, and dihydrofolate reductase inhibitors) and using the same metrics as the original kNN-OR algorithm. This was done in order to facilitate the comparison between the different methods. Comparison consisted of: (1) comparing internal diversity, prior to and following outliers removal (2) evaluating the predictive statistics of QSAR models derived from the data sets processed by the new algorithm as well as with several standard methods for outlier removal. We demonstrate that for all data sets, the new algorithm outperforms five other methods and produces filtered data sets similar in quality to those produced by the kNN-OR algorithm yet with fewer compounds removed. In addition, equally good data sets in terms of these metrics were obtained when another objective function was added to the algorithm (termed "preservation") forcing it to remove certain compounds with low probability only. This option is useful when specific compounds should be preferably kept in the final data set either because they have favorable activities or because they represent interesting molecular scaffolds.

ACS Paragon Plus Environment

Page 6 of 36

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2. Methods 2.1 Data Sets Three data sets were used for developing and testing the new algorithm. Briefly, these were collected from the literature, curated using an in-house protocol and subjected to descriptors calculations using Discovery Studio after which correlated, constant and nearly constant descriptors were removed (see more details on data sets curation and the selection of descriptors for each data set in Yosipof and Senderowitz17). 1. logBBB set: A set of 152 compounds with known logBBB values was obtained from the literature.30-32 These compounds where characterized by a final set of 15 descriptors (see reference 17 for more details on the descriptors selection process and Table 1 for the final set of descriptors). The complete data set is provided in Table S1 of the supporting information. Table 1: Descriptors selected for the logBBB data set. Jurs Descriptors

Structural and Thermodynamic

E-state Keys

Spatial Descriptors

AlogP98

ES_Sum_sssN

Radius of gyration

TPSA FNSA3

#Hydrogen bond acceptor

Topological Descriptor Balaban index JY

ES_Sum_ssCH2

DPSA1

#Hydrogen bond donor

ES_Sum_dO

RPCG RNCG

#Rotatable bonds

ES_Sum_aaCH

2. F7 set: A set of 355 factor 7 inhibitors (F7) was obtained from Norinder et al..33 These compounds where characterized by a final set of 19 descriptors (see Table 2). The complete data set is provided in Table S2 of the supporting information.

Table 2: Descriptors selected for the F7 data set.

Structural and Thermodynamic

E-state keys

Topological Descriptor

Information Content

AlogP98

ES_Sum_aasC

JX

CIC

Molecular Solubility

ES_Sum_dO

#Rotatable Bonds

ES_Sum_sNH2

#Chains

ES_Sum_ssCH2

#Hydrogen bond acceptor

ES_Sum_ssNH

#Hydrogen bond donor

ES_Sum_sssCH

Molecular FractionalPolarSASA

ES_Sum_sssN

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 36

ES_Count_aasC ES_Count_sCH3 ES_Count_sssCH

3. DHFR set: A set of 673 dihydrofolate reductase (DHFR) inhibitors was obtained from Sutherland et al.34 These compounds where characterized by a final set of 19 descriptors (see Table 3). The complete data set is provided in Table S3 of the supporting information.

Table 3: Descriptors selected for the DHFR data set.

Structural and Thermodynamic AlogP98

E-state keys

Topological Descriptor

Information Content

JX

IC

ES_Sum_aaCH

Molecular Weight

ES_Sum_aaN

Molecular Solubility

ES_Sum_aasC

#Rings

ES_Sum_sCH3

#Aromatic Rings

ES_Sum_sNH2

#Rings6

ES_Sum_ssCH2

#Hydrogen bond acceptor

ES_Count_aasC

#Hydrogen bond donor Molecular FractionalPolarSurfaceArea Molecular PolarSASA

2.2 Multi-objective k-Nearest Neighbors Genetic Algorithm Outlier Removal The purpose of this algorithm is to find the smallest possible set of compounds (i.e., outliers) whose removal from a data set will afford QSAR models with good prediction statistics. This is inherently a MOOP problem which requires the simultaneous minimization of the number of compounds to be removed and the maximization of model performances.

In this work QSAR models were derived using the k-nearest neighbors (kNN) algorithm. Briefly, kNN predicts the activity of a compound based on the weighted averaged activities of its k nearest neighbors. This algorithm has its roots in the similar property principle35 which states that similar compounds have similar properties. Inherent to the kNN algorithm is the optimization of the descriptors space in terms of which nearest neighbors are determined as well as the number of nearest neighbors. The performances of the kNN model on the internal (i.e.,  training) set are evaluated by the leave one out (LOO) cross-validated value ( , equation (1)).

ACS Paragon Plus Environment

Page 9 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The multi-objective optimization procedure is carried out using the Strength Pareto Evolutionary Algorithm 2 (SPEA2),26 which approximates the Pareto front for multi-objective optimization problems. SPEA2 uses an external set (archive) for storing primarily non-dominated solutions. At each generation, it combines archive solutions with the current population to form the next archive that is then used to produce offspring for the next generation. Each individual i in the archive At and the population Pt is assigned a raw fitness value R(i), determined by the number of its dominators in both archive and population. R(i)=0 corresponds to a non-dominated individual, whereas a high R(i) value means that individual i is dominated by many individuals. These raw values are then used to rank the individuals for the purpose of selecting candidates for reproduction. However, the raw fitness value by itself may be insufficient for ranking when most individuals do not dominate each other. Therefore, additional information, based on the kth nearest neighbor density of the individuals is incorporated to remove rank redundancy. The workflow of the SPEA2 algorithm is described below: Algorithm – SPEA2 Input:

Output:

N - Archive size M - Offspring population size T - Maximum number of generations A∗ - Non-dominated set of solutions to the optimization problem

1.

Initialization: Generate an initial population P0 and create the empty archive (external set) A0=∅. Set t=0.

2.

Fitness assignment: Calculate fitness values for individuals in Pt and At.

3.

Environmental selection: Copy all non-dominated individuals in Pt and At to At+1. If size of At+1 exceeds N then reduce At+1 by means of the truncation operator, otherwise if size of At+1 is less than N then fill At+1 with dominated individuals in Pt and At.

4.

Termination: If t ≥ T or another stopping criterion is satisfied then set A∗ to the set of decision vectors represented by the non-dominated individuals in At+1. Stop.

5.

Mating selection: Perform selection with replacement on At+1 in order to fill the mating pool.

6.

Variation: Apply recombination and mutation operators to the mating pool and set Pt+1 to the resulting population. Increment generation counter (t=t+1) and go to Step 2.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 36

In the present case, each candidate solution must specify the set of descriptors and the number of neighbors used by the kNN algorithm as well as the identity of compounds considered as outliers. This information is coded by a three component binary array (i.e., chromosome). The first part of the array encodes the number of neighbors using a binary representation. The second part describes the descriptors identity ("1" and "0" representing, respectively, selected and unselected descriptors for the current solution). The third part lists the compounds considered as outliers using the same representation as for the descriptors. The resulting chromosomes were subjected to a set of genetic operations as follows: Two parent chromosomes were selected using roulette wheel selection and were subjected to a multi-site crossover operator (where the number of sites was a random number in the range of one to six) to produce two new chromosomes. These represented a new combination of the number of neighbors, descriptors and outliers. The resulting chromosomes were further mutated to increase the diversity of the solutions population and to prevent trapping in local minima. Five types of mutation operators were used, each with its own probability: (1) General Mutation: This operation changed the value of a random bit in the chromosome; (2) Swap Mutation: This operation swapped the values of two randomly chosen bits in the chromosome; (3) Mutation in compounds: This operation performed a general mutation of the part of the chromosome encoding information on the compounds considered as outliers; (4) Mutation in the number of neighbors: This operation performed a general mutation of the part of the chromosome encoding information on the number of nearest neighbors; and (5) Mutation on descriptors: This operation performed a general mutation of the part of the chromosome encoding information on the selected descriptors.

For each new generation raw fitness values are calculated for each individual based on the  information encoded in its chromosome. This calculations is based on  (which in turn

depends on the descriptors selected, the number of neighbors selected the identity of outliers removed via the kNN algorithm) and on the number of outliers removed.

The algorithm was coded in C# using the visual studio professional 2010.

ACS Paragon Plus Environment

Page 11 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2.3 Alternative Outlier Removal Methods The results of the current algorithm were compared with those obtained by the kNN-OR algorithm reported by Yosipof and Senderowitz

17

as well as with the results obtained by five

other outlier removal methods reported in that work. These include: (1) distance based method, (2) distance K-based, (3) one-class SVM, (4) statistics, and (5) random outlier removal. In all cases, the same number of outliers was removed to allow for a facile comparison between the different methods. For completeness, a brief description of these methods is provided below (see reference 17 for more details).

2.3.1 kNN-OR method kNN-OR is an iterative method for the removal of outliers. At each iteration, the algorithm builds  a kNN model, evaluates it according to the leave-one-out cross validation metric ( ) and

removes from the data set the compound whose elimination results in the largest increase in    . This procedure is repeated until  exceeds a pre-defined threshold.

2.3.2 Distance Based Method for Outlier Removal36 A compound is considered an outlier, if at least p percent of its Euclidean distances to the other compounds are larger than d. In this work d was set to the average Euclidean distance between compounds in each data set and p was set to 67%, 87.5%, and 74% for the logBBB, F7, and DHFR data sets, respectively.

2.3.3

Distance K-based Method for Outlier Removal37

Outliers are identified by calculating Euclidean distances between each compound and its k nearest neighbors. Next, n compounds with largest distances are considered outliers and removed. Here the value of n was set to the same number of outliers removed by all other methods considered in this work. 2.3.4 One-class SVM-Based Outlier Removal13 This method attempts to identify a boundary which best separates the majority of the data points from the origin. Points which are found on the other side of the decision boundary are considered outliers. In this study the method's parameters were adjusted to remove the same number of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 36

outliers as removed by all other methods. One-class SVM calculations were performed by LIBSVM as implemented in WEKA version 3.7.9.38 2.3.5 Statistics-Based Outlier Removal39 Statistics-based outlier removal considers outliers as compounds with descriptors values deviating from the mean by a pre-defined number (typically three) of standard deviations (SD). In this study, for the logBBB and DHFR data sets, a compound was considered an outlier if the value of at least one of its descriptors was higher (or lower) by more than 3.1 SD from the mean. For the F7 data set, a compound was considered an outlier if the values of at least two of its descriptors were higher (or lower) by more than 2.9 SD from their respective means.

2.3.6 Random “Outlier” Removal Random “outlier” removal was performed by randomly excluding subsets of compounds (the sizes of which matched the number of outliers removed by all other methods) from within each data set. Random “outlier” removal was repeated ten times for each data set.

2.4 Internal Diversity For all data sets prior to and following compounds removal, the internal diversity was evaluated using pair-wise Euclidean distances between all compounds in the original normalized descriptors space. The results are presented as distance histograms.

2.5 QSAR Models Following outlier removal, the remaining compounds were divided into a modeling set (80%) and an external validation set (20%). Division was performed with a newly developed method that uses a Monte Carlo/Simulated Annealing procedure to optimize a representativeness function for the selection of a subset of objects (e.g., compounds) which best represents the parent data set.40 . The unselected and selected subsets constitute the training (modeling) and validation (i.e., test) sets, respectively. Models were built from the modeling set and evaluated on the validation set using the Random Forest (RF) approach as implemented in the CF program version 2.13 (http://www.qsar4u.com/) with default parameters as well as using our in house implementation of the kNN approach.17 For the training set, kNN and RF models were validated

ACS Paragon Plus Environment

Page 13 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

  using Leave-One-Out cross validation ( ; equation (1)) and determination coefficient ( ;  equation (1)), which takes the exact form as  ), respectively. For the test sets both models  were evaluated using the external explained variance ( , equation (2)) according to the OECD

guidelines for models validation.41

 

=

 

=1−

  = 1−

∑  − / 



 ∑  −   

∑(  −  ) ∑(  −   )

(1)

(2)

In equations (1) and (2),   is the experimental value,  ,  , and  are the predicted values and   is the mean of the experimental results over modeling set (training set)  compounds. We have also repeated the calculation of  with equation (2) using for   the

mean of the experimental results over test set compounds and found identical values (results not shown).

In addition, for all models, the mean absolute error coefficient (MAE; equation (3)), between the predicted ( ) and experimental (  ) data of the test sets was calculated.

 =

∑  −   

(3)

Where n is the number of observations.

Compounds in the validation set were also tested for their residence within the models' applicability domain (AD). This was mathematically defined as  = ! + #$ where DT is a threshold distance, ! % is the average Euclidean distance between each compound and its k nearest neighbors in the training set, σ is the standard deviation of the Euclidean distances, and Z is an (arbitrary) parameter to control the significance level. In this study, Z = 0.5. If the distance of the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

test compound from any of its k nearest neighbors in the training set exceeded the threshold, the prediction was considered unreliable.42

3. Results As pointed out in reference 17, kNN models for the logBBB, F7, and DHFR data sets could be obtained when including all compounds (152, 355, and 673 for logBBB, F7, and DHFR, respectively) in the modeling sets. This attempt led to reasonable yet clearly improvable   values in all cases.

3.1 logBBB Data set The logBBB data set contained 152 compounds which were subjected to the outlier removal procedure. In the kNN-OR algorithm,17 the stopping criterion of the outlier removal algorithm  ≥ 0.85. This resulted in the removal of 19 compounds, leaving a total of 133 was set to 

compounds for the subsequent QSAR modeling.

The GA-kNN outlier removal method is a multi-objective optimization algorithm, and therefore provides a set of non-dominated solutions. For the logBBB data set, this resulted in 35 non dominated solutions all with k = 3 and number of descriptors = 8. Figure 2 presents the relation  between the number of compounds and  for all solutions. As it can clearly be seen, higher  were obtained when larger numbers of outliers were removed from the data set. values of 

1 0.95 0.9

Q2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 36

0.85 0.8 0.75 0.7 0

5

10

15

20

25

Number of removed compounds

ACS Paragon Plus Environment

30

35

Page 15 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

 Figure 2:  vs. number of compounds removed in the set of non-dominated solutions found by the GA-kNN algorithm for the logBBB data set. Note that this graph does not represent a successive removal of outliers. Thus, the  ten compounds whose removal resulted in  = 0.8 are not a subset of the 25 compounds whose removal resulted  in  = 0.9.

For the GA-kNN algorithm, a result with +.,-- = 0.85, required the removal of only 13 compounds (as opposed to 19 compounds required by the kNN-OR algorithm). A solution which resulted in the removal of 19 compounds had +.,-- = 0.88. Figure 3 compares the internal diversity of the logBBB data set prior to and following GA-kNN  optimization-based outlier removal for both  = 0.88 solution (removal of 19 compounds)  = 0.85 solution (removal of 13 compounds). As the data clearly demonstrate, the and the 

removal of either 13 (8.6% of the data set) or 19 compounds (12.5% of the data set) did not significantly change the distribution of distances suggesting that the coverage of chemistry space was largely unaffected by the removal of outliers. This in turn implies that the domain of applicability of QSAR models derived from the filtered data set will not be reduced. For comparison, similar graphs are provided for the other methods considered in this work which demonstrate that for all other rational outlier removal procedures (except for kNN-OR) the distribution at short distances remained unchanged while long distances were truncated. Not surprisingly, random compounds removal (Figure 3H) had no effect on the distribution of distances.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 36

(A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

Figure 3: A comparison between pairwise Euclidean distances distributions before (red lines) and after (blue lines) the removal of outliers from the logBBB data set. Distances were calculated in the original descriptors space (see Table 1). (A) GA-kNN optimization-based outlier removal (13 compounds removed); (B) GA-kNN optimizationbased outlier removal (19 compounds removed); (C) kNN-OR optimization-based outlier removal (19 compounds removed); (D) Distance based outlier removal (19 compounds removed); (E) Distance K-based outlier removal (19 compounds removed); (F) One class SVM based outlier removal (19 compounds removed); (G) Statistics based outlier removal (19 compounds removed); (H) Random "outlier" removal (19 compounds removed).

ACS Paragon Plus Environment

Page 17 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Compounds surviving the different removal procedures were divided into a modeling set (106 compounds) and a validation set (27 compounds) and subjected to QSAR modeling using the kNN or the Random Forest algorithms. The results are presented in Table 4 (kNN) and Table 5 (RF). In these tables results obtained for the average across all 10 random models are designated as "Random consensus". Results from the last six entries in each table (as well as in Tables 6-9) were taken from Yosipof and Senderowitz.17

Table 4: Results obtained for the logBBB data set using kNN.

 

GA-kNN kNN-OR Distance based Distance K-based One-class SVM statistics Random consensus

0.80 0.81 0.67 0.70 0.61 0.60 0.63

no applicability domain  MAE  0.86 0.24 0.88 0.24 0.72 0.31 0.54 0.39 0.70 0.31 0.61 0.41 0.62 0.35

with applicability domain  MAE coverage %  0.85 0.25 81 0.89 0.22 74 0.78 0.28 74 0.61 0.38 89 0.75 0.31 85 0.68 0.39 89 0.67 0.33 81

Table 5: Results obtained for the logBBB data set using Random Forest.

GA-kNN kNN-OR Distance based Distance K-based One-class SVM statistics Random consensus

  0.67 0.60 0.52 0.56 0.51 0.48 0.53

  0.65 0.65 0.54 0.52 0.67 0.54 0.57

MAE 0.35 0.39 0.36 0.36 0.34 0.40 0.37

The best results for both modeling and test sets using kNN were obtained for the data sets   generated by the kNN-OR and the GA-kNN algorithms (GA-kNN:  = 0.80;  =  0.86 with no AD;  = 0.85 with AD; MAE = 0.24 with no AD, MAE = 0.25 with AD;    kNN-OR:  = 0.81;  = 0.88 with no AD;  = 0.89 with AD; MAE = 0.24 with no

AD, MAE = 0.22 with AD). Models generated using all other outlier removal methods and by   random compound removal provided poorer results (lower  and  and higher MAE

values).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 36

The results obtained with the RF method largely mirror those obtained using kNN. Overall,   higher  and  were obtained for the data set filtered by the GA-kNN and kNN-OR     = 0.67,  = 0.65 for the GA-kNN and  = 0.60,  = 0.65 for kNN-OR) than for (

almost any of the data sets rationally filtered by the other methods (the only exception being   = 0.67 for the data set filtered with once-class SVM).

 As noted above, the original kNN-OR  ≥ 0.85 stopping criterion was met by the new GA-

kNN algorithm after the removal of only 13 compounds. The remaining 139 compounds were divided into a modeling set (111 compounds) and a validation set (28 compounds) and subjected    to QSAR modeling. The results (kNN:  = 0.80;  = 0.75 with no AD;  =

0.80 with AD; MAE = 0.34 with no AD, MAE = 0.30 with AD; AD coverage = 75%; RF:   = 0.69;  = 0.61; MAE = 0.37) are very similar to those presented in the first entries 

of Table 4 and Table 5 (albeit with a slightly poorer performances for the external set) and are still better than those obtained for the data sets processed by all other methods.

3.2 F7 Data set: The F7 data set consisted of 355 compounds which were subjected to the outlier removal  ≥ 0.85) was met after the procedure. In the kNN-OR algorithm,17 the stopping criterion (

removal of 22 compound. For this data set, the set of solutions provided by the GA-kNN algorithm consisted of 23 non-dominated solutions, in which k ranged between 3 and 5, and the number of descriptors between 9 and 12. As with the logBBB data set there is a correlation  between the number of compounds removed and  values (Figure 4) with higher values

obtained upon the removal of larger numbers of outliers.

ACS Paragon Plus Environment

Page 19 of 36

0.88 0.86 0.84 0.82

Q2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.8 0.78 0.76 0.74 5

10

15

20

25

30

35

Number of removed compounds Figure 4: +.,-- vs. number of compounds removed in the set of non-dominated solutions found by the GA-kNN algorithm for the F7 data set.

For the GA-kNN algorithm, a result with +.,-- = 3. 45 required the removal of only 20 compounds. A solution which resulted in the removal of 22 compounds had +.,-- = 3. 46. Figure 5 compares the internal diversity of the F7 data set before and after outlier removal by the GA-kNN algorithm as well as by all other algorithms considered in reference 17. Since only a small number of compounds were removed (~6% of the parent data set) the distance distribution is largely unaffected by the removal of outliers.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 36

(A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

ACS Paragon Plus Environment

Page 21 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: A comparison between pairwise Euclidean distances distributions before (red lines) and after (blue lines) the removal of outliers from the F7 data set. Distances were calculated in the original descriptors space (see Table 2). (A): GA-kNN optimization-based outlier removal (20 compounds removed); (B) GA-kNN optimization-based outlier removal (22 compounds removed) (C): kNN-OR optimization-based outlier removal (22 compounds removed); (D): Distance based outlier removal (22 compounds removed); (E): Distance K-based outlier removal (22 compounds removed); (F): One class SVM based outlier removal (22 compounds removed); (G): Statistics based outlier removal (22 compounds removed); (H): Random "outlier" removal (22 compounds removed).

Following outlier removal, the remaining compounds were divided into a modeling set (266 compounds) and a test set (67 compounds) and subjected to QSAR modeling using kNN and RF. The results are presented in Table 6 (kNN) and Table 7 (RF) and follow the same trend as obtained for the logBBB data set. Table 6: Results obtained for the F7 data set using kNN.

no applicability domain GA-kNN kNN-OR Distance based Distance K-based One-class SVM Statistics Random consensus

  0.82 0.80 0.66 0.69 0.70 0.69 0.70

  0.84 0.86 0.60 0.62 0.67 0.59 0.61

MAE 0.30 0.30 0.41 0.38 0.40 0.40 0.43

with applicability domain   0.90 0.89 0.66 0.64 0.71 0.69 0.67

MAE 0.24 0.27 0.36 0.38 0.35 0.32 0.39

coverage % 64 67 69 76 75 66 73

Table 7: Results obtained for the F7 data set using Random Forest.

GA-kNN kNN-OR Distance based Distance K-based One-class SVM Statistics Random consensus

R2OOB 0.81 0.78 0.69 0.72 0.72 0.72 0.70

  0.79 0.81 0.61 0.64 0.68 0.60 0.66

MAE 0.34 0.34 0.42 0.38 0.39 0.41 0.41

For kNN, the best results for the modeling set were obtained with the GA-kNN algorithm   ( = 0.82) followed closely by kNN-OR ( = 0.80). Both methods also provided the   = 0.84 and  = 0.90 without and with AD, best results for the test set (GA-kNN: 

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 36

  respectively; kNN-OR:  = 0.86 and  = 0.89 without and with AD, respectively). All

other methods provided significantly poorer results. The same holds true for the results obtained with RF.

 As noted above, the original kNN-OR  ≥ 0.85 stopping criterion was met by the new GA-

kNN algorithm after the removal of only 20 compounds. The remaining 335 compounds were divided into a modeling set (268 compounds) and a validation set (67 compounds) and subjected    to QSAR modeling. The results (kNN:  = 0.80;  = 0.88 with no AD;  =

0.91 with AD; MAE = 0.27 with no AD, MAE = 0.23 with AD; AD coverage = 67%; RF:    = 0.79;  = 0.83; MAE = 0.32) are again very similar to those presented in the first

entries of Table 6 and Table 7 and are better than those obtained for the data sets processed by all other methods.

3.3 DHFR Data set: The DHFR data set consisted of 673 compounds which were subjected to the outlier removal  procedure. In Yosipof and Senderowitz,17 the stopping criterion ( ≥ 0.85) was met after the

removal of 87 compounds. For this data set the set of solutions provided by the GA-kNN  algorithm consisted of 118 non-dominated solutions, in which k = 2 (for solutions with   values between 0.8 and 0.9) or 3 (for solutions with  values between 0.68 and 0.8) (Figure

6), and the number of descriptors between 14 and 16. As with the two other data sets, there is a  values (Figure 7) with higher correlation between the number of compounds removed and 

values obtained upon the removal of larger numbers of outliers.

ACS Paragon Plus Environment

Page 23 of 36

1 0.8

Q2

0.6 0.4 0.2 0 0

1

2

3

k Figure 6: +.,-- vs. the size of k in the set of non-dominated solutions found by the GA-kNN algorithm for the DHFR data set.

0.95 0.9 0.85

Q2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.8 0.75 0.7 0.65 0

15

30

45

60

75

90

105 120 135 150 165

Number of removed compounds Figure 7: +.,-- vs. number of compounds removed in the set of non-dominated solutions found by the GA-kNN algorithm for the DHFR data set.

For the GA-kNN algorithm, a result with +.,-- = 0.85 required the removal of only 75 compounds. A solution which resulted in the removal of 87 compounds had +.,-- = 0.87. Figure 8 compares the internal diversity of the DHFR data set before and after outlier removal by the GA-kNN algorithm as well as by all other algorithms considered in reference 17. Similar to the BBB data set, the data demonstrate that the removal of either 75 compounds (11.1% of the data set) by the GA-kNN algorithm or of 87 compounds (12.9% of the data set) by the GA-kNN

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 36

and kNN-OR algorithms did not change the distribution of distances suggesting that the coverage of chemistry space was largely unaffected by the removal of outliers. In contrast, all other methods (except again random removal) demonstrated truncation at long distances implying reduce internal diversity.

(A)

(B)

(C)

(D)

(E)

(F)

(G)

(H)

ACS Paragon Plus Environment

Page 25 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8: A comparison between pairwise Euclidean distances distributions before (red lines) and after (blue lines) the removal of outliers from the DHFR data set. Distances were calculated in the original descriptors space (see Table 3). (A): GA-kNN optimization-based outlier removal (75 compounds removed); (B) GA-kNN optimizationbased outlier removal (87 compounds removed) (C): kNN-OR optimization-based outlier removal (87 compounds removed); (D): Distance based outlier removal (87 compounds removed); (E): Distance K-based outlier removal (87 compounds removed); (F): One class SVM based outlier removal (87 compounds removed); (G): Statistics based outlier removal (87 compounds removed); (H): Random "outlier" removal (87 compounds removed).

As with the previous two sets, the remaining compounds were divided into a modeling set (469 compounds) and a test set (117 compounds) and subjected to QSAR modeling using kNN and RF. The results are presented in Table 8 (kNN) and Table 9 (RF) and again demonstrate the good performances of the GA-kNN method.

Table 8: Results obtained for the DHFR data set using kNN.

no applicability domain GA-kNN kNN-OR Distance based Distance K-based One-class SVM Statistics Random consensus

  0.78 0.79 0.54 0.64 0.62 0.55 0.59

  0.83 0.83 0.63 0.74 0.65 0.54 0.53

MAE 0.31 0.32 0.48 0.45 0.53 0.48 0.54

with applicability domain   0.86 0.85 0.75 0.78 0.74 0.65 0.64

MAE 0.29 0.30 0.42 0.44 0.44 0.40 0.49

coverage % 75 75 77 74 65 65 70

Table 9: Results obtained for the DHFR data set using Random Forest.

GA-kNN kNN-OR Distance based Distance K-based One-class SVM Statistics Random consensus

  0.73 0.71 0.55 0.66 0.62 0.55 0.61

  0.75 0.77 0.68 0.73 0.72 0.61 0.62

MAE 0.41 0.40 0.47 0.47 0.47 0.46 0.52

Equally good results were obtained following the GA-kNN-based data set filtration using the +.,-- ≥ 0.85 stopping criterion. As stated above this criterion led to the removal of only 75 compounds from the data set (as opposed to removing 87 compounds using the kNN-OR

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 36

algorithm with the same stopping criterion). The remaining 598 compounds were divided into a modeling set (478 compounds) and a test set (120 compounds) and the resulting models    produced the following statistics: kNN:  = 0.77;  = 0.82 with no AD;  =

0.85 with AD; MAE = 0.34 with no AD, MAE = 0.31 with AD; AD coverage = 78%; RF:    = 0.73;  = 0.75; MAE = 0.39).

3.4 Introducing the "Preservation" Criterion To demonstrate the benefits of multi-objective optimization this section presents a new objective function, which is optimized by the GA-kNN algorithm, simultaneously with the other functions described above (i.e., the maximization of +.,-- and the minimization of the number of outliers). Given a data set intended to be used for the derivation of QSAR models, it is sometimes desirable to refrain from filtering out certain compounds since, for example, they have favorable biological activities or since they represent a scaffold of particular interest. If no special measures are taken, such compounds could be considered outliers and removed from the data set by the GA-kNN algorithm.

To guard against the removal of such interesting compounds, a new property termed "Preservation" was assigned to each compound. The preservation property takes values in the range of 0 to 1 with higher values assigned to more "important" compounds. In real cases, the precise assignment of preservation values should be performed by the user. In the present case, preservation values were randomly assigned (see below). Next, a new objective was added to the GA-kNN algorithm, namely the maximization of the total “Preservation”. To demonstrate the usage of the new function, compounds comprising the logBBB data set were assigned preservation values as follows: 142 compounds were assigned random values between 0 and 0.3 and ten compounds were assigned random values between 0.8 and 1. Application of the  GA-kNN algorithm to simultaneously maximize "Preservation", maximize  , and minimize

the number of compounds considered to be outliers resulted in a set of 383 non-dominated solutions. Not surprisingly, high values of "Preservation" corresponded to solutions with no compounds removal. Compounds removal reduced the value of "Preservation from 32.5 (no compounds removed) to 28.5 (24 compounds removed; see Figure 9).

ACS Paragon Plus Environment

Page 27 of 36

33 32.5

Preservation

32 31.5 31 30.5 30 29.5 29 28.5 28 0

5

10

15

20

25

Number of removed compounds

Figure 9: Preservation values vs. number of compounds removed using the GA-kNN algorithm for the logBBB data set. Results were obtained upon the simultaneous optimization of Preservation, +.,-- , and the number of outliers.

For all non-dominated solutions k = 3 and the number of descriptors ranged between 8 and 10 (see Figure 10). As with all other data sets, a relation exists between the number of compounds  removed and  (Figure 11). Figure 11 illustrates two important feature of the GA-kNN  algorithm: (1) The ability to obtain multiple solutions characterized by the same value of 

yet with different numbers of removed compounds (solutions across the horizontal red line). (2)  The ability to obtain several solutions characterized by almost the same value of  and the

same number of removed compounds yet with different compounds removed in each solution (any two almost overlapping points in Figure 11). 25 20

Compounds

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

15 10 5 0 0

2

4

6

Number of descriptors

ACS Paragon Plus Environment

8

10

Journal of Chemical Information and Modeling

Figure 10: Number of compounds removed vs. the number of descriptors in the final models obtained with the GAkNN algorithm for the logBBB data set. Results were obtained upon the simultaneous optimization of Preservation, +.,-- , and the number of outliers. 0.9 0.85 0.8

Q2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 36

0.75 0.7 0.65 0

5

10

15

20

25

Number of removed compounds  Figure 11:  vs. number of compounds removed in the set of non-dominated solutions found by the GA-kNN  algorithm for the logBBB data set. Results were obtained upon the simultaneous optimization of Preservation,  , and the number of outliers.

In this case, a result with +.,-- = 3. 45 was obtained upon the removal of 16 compounds. This number is slightly larger than that obtained in the initial run (13 compounds; see section 3.1) presumably due to imposing the "Preservation" criterion, yet it is still smaller than the number of compounds removed by the kNN-OR algorithm (19 compounds). Importantly, none of the compounds with "Preservation" value > 0.8 were removed. Similar to the previous results obtained for this data set (Figure 3), compounds removal did not affect the internal diversity of the data set (Figure 12).

ACS Paragon Plus Environment

Page 29 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 12: A comparison between pairwise Euclidean distances distributions before (red lines) and after (blue lines) the removal of outliers using the GA-kNN algorithm for the logBBB data set. Distances were calculated in the original descriptors space (see Table 1).

Compounds surviving the filtration process were divided into a modeling set (109 compounds) and a test set (27 compounds), affording once more models with good prediction statistics (kNN:    = 0.79;  = 0.89 with no AD;  = 0.88 with AD; MAE = 0.24 with no AD, MAE    = 0.25 with AD; AD coverage = 78%; RF:  = 0.63;  = 0.61; MAE = 0.37).

The preservation criterion was further tested on the F7 and DHRF data sets. For the F7 data set, 354 compounds were assigned random values between 0 and 0.3 and the remaining ten compounds were assigned random values between 0.8 and 1. Similarly, for the DHRF data set, 663 compounds were assigned random values between 0 and 0.3 and the remaining ten compounds were assigned random values between 0.8 and 1. Application of the GA-kNN  algorithm to simultaneously maximize "Preservation", maximize  , and minimize the number

of compounds considered to be outliers resulted in a set of 1725 non-dominated solutions for the F7 data set and 3710 non-dominated solutions for the DHRF data set. For the F7 data set, compounds removal reduced the value of "Preservation" from 63.9 (no compounds removed) to 58.3 (33 compounds removed). For the DHRF data set, compounds removal reduced the value of "Preservation" from 110.8 to 95.8 (by removing 100 compounds).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 36

For the F7 data set, a result with +.,-- = 3. 45 was obtained upon the removal of 20 compounds, including two with "Preservation" value > 0.8. In contrast with the results obtained for the logBBB data set, this number is equal to that obtained in the initial run (i.e. without the application of the "Preservation" criterion). As before, compounds removal did not affect the internal diversity of the F7 data set (results not shown). For the DHRF data set, a result with +.,-- = 3. 45 was obtained upon the removal of 95 compounds. This number is slightly higher than that obtained in the initial run (87 compounds), presumably due to imposing the "Preservation" criterion. None of the ten compounds with "Preservation" value > 0.8 was removed. As in all other cases, compounds removal did not affect the internal diversity of the DHRF data set.

For both F7 and DHRF data sets, compounds surviving the filtration process were divided into a modeling set (275 and 462 for the F7 and DHFR data sets, respectively) and a test set (69 and 116 for the F7 and DHFR data sets, respectively) affording once more models with good    prediction statistics (F7 kNN:  = 0.78;  = 0.77 with no AD;  = 0.91 with AD;  MAE = 0.34 with no AD, MAE = 0.25 with AD; AD coverage = 70%; F7 RF:  =     0.75;  = 0.79; MAE = 0.34; DHRF kNN:  = 0.77;  = 0.8 with no AD;  =

0.82 with AD; MAE = 0.35 with no AD, MAE = 0.23 with AD; AD coverage = 78%; DHRF   RF:  = 0.73;  = 0.74; MAE = 0.4). These numbers are similar to those obtained for

the original models (i.e., models derived from the data sets processed without the Preservation criterion) and are better than those obtained for models derived from the data sets processed by all other methods (except kNN-OR).

4. Discussion and Conclusions Two basic assumptions underlie this work: (1) The presence of outliers within a data set compromises the ability of statistical methods to produce predictive QSAR models and (2) This ability is regained upon outlier removal. Moreover, since there is a balance between the number of compounds removed from a data set and the applicability domain of models subsequently derived from it, it is preferable to remove from the data set the smallest possible number of compounds.

ACS Paragon Plus Environment

Page 31 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This paper presents a new, multi-objective kNN based genetic algorithm for outlier removal (GA-kNN) built on the foundations of the recently developed kNN-OR algorithm.17 Being based on the kNN approach, this algorithm retains in a data set compounds with activities that can be reliably predicted by their nearest neighbors, whether these neighbors are close to the compounds or remote from them. At the same time the algorithm removes from a data set compounds with activities that cannot be predicted by their nearest neighbors even if these are close to them in a specific descriptors space (i.e., activity cliffs).

The GA-kNN algorithm was applied to the filtration of three data sets with different activities, sizes and descriptors and was found to outperform five standard outlier removal algorithms (distance based, distance K-based, one-class SVM, statistics, and random; see below for a comparison with the original kNN-OR algorithm) in terms of two metrics. First the internal diversity of the data sets was not reduced upon filtration with GA-kNN. This suggests that the domain of applicability of QSAR model derived from these data sets is not reduced. In contrast, filtration with all other methods (except random removal) led to diversity reduction. Second, QSAR models derived from the data sets filtered with GA-kNN provided significantly better prediction statistics than QSAR models derived from data sets filtered by any of the other methods. This is true for both kNN-derived and RF-derived model suggesting that GA-kNN could be coupled with both distance-based and non-distance-based QSAR methods.

In terms of the above two metrics, equally good data sets were obtained by GA-kNN and by the  original kNN-OR algorithms. Yet for GA-kNN the  ≥ 0.85 criterion was met upon the

removal of fewer compounds (logBBB: 13 instead of 19; F7, 20 instead of 22; DHFR: 75 instead of 87). This suggests a potentially larger applicability domain for the resulting models. Moreover, treating outlier removal as a multi-objective optimization problem has two additional advantages: (1) A set of non-hierarchical solutions (as opposed to a single solution) is provided to the user allowing for the selection of the most suitable one. (2) Other objectives could be added and simultaneously optimized by the algorithm. This is exemplified in this work by considering compound "Preservation" as another objective while removing outliers from all three data sets. Equally good filtered data sets (as judged by the internal diversity and the predictive statistics of the resulting QSAR models metrics) retaining most of the compounds with high

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 36

preservation values were obtained upon adding this objective. The GA-kNN algorithm could be easily extended to incorporate additional objectives (e.g., minimizing the number of descriptors in the final model or maximizing model interpretability).

In conclusions, this work presents the first multi-objective optimization algorithm for the rational removal of outliers from within QSAR data sets with performances exceeding those of previously reported outlier removal algorithms. Importantly, since the new methods removes outliers based on the predictability of their activities from those of their nearest neighbors it could be potentially used for cleaning QSAR data sets from activity cliffs. We expect that this new algorithm will be useful in future QSAR applications. The new algorithm will be made available upon request. Supporting Information Available: Tables S1-S3. This material is available free of charge via the Internet at http://pubs.acs.org.

ACS Paragon Plus Environment

Page 33 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

6. References 1. Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R., QSAR modeling: where have you been? Where are you going to? Journal of medicinal chemistry 2014, 57, 4977-501.0 2. Tropsha, A.; Golbraikh, A., Predictive QSAR modeling workflow, model applicability domains, and virtual screening. Current pharmaceutical design 2007, 13, 3494-3504. 3. Hawkins, D. M., Identification of outliers. Springer: 1980; Vol. 11. 4. Johnson ,R. A.; Wichern, D. W., Applied multivariate statistical analysis. Prentice hall Englewood Cliffs, NJ: 1992; Vol. 4. 5. Barnett, V.; Lewis, T., Outliers in statistical data. Wiley New York: 1994; Vol. 3. 6. Kim, K. H., Outliers in SAR and QSAR: is unusual binding mode a possible source of outliers? J Comput Aided Mol Des 2007, 21, 63-86. 7. Tarko, L.; Stecoza, C. E.; Ilie, C.; Chifiriuc, M. C., QSAR Studies on Antibacterial Activity of Some Substituted Dihydrodibenzothiepins. Rev Chim-Bucharest 2009, 60 , .476-479 8. Ben-Gal, I. Outlier Detection. In Data Mining and Knowledge Discovery Handbook; Springer: 2005, pp 131-146. 9. Knox, E. M.; Ng, R. T. Algorithms for mining distancebased outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases, 1998; Citeseer: 1998; pp 392-403. 10. Ramaswamy, S.; Rastogi, R.; Shim, K. Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD Record, 2000; ACM: 2000; Vol. 29; pp 427-438. 11. Kaufman, L.; Rousseeuw, P .J., Finding groups in data: an introduction to cluster analysis. John Wiley & Sons: 2009; Vol. 344. 12. Zhu, H.; Martin, T. M.; Ye, L.; Sedykh, A.; Young, D. M.; Tropsha, A., Quantitative structure-activity relationship modeling of rat acute toxicity by oral exposure. Chem Res Toxicol 2009, 22, 1913-21. 13. Schölkopf, B.; Smola, A. J.; Williamson, R. C.; Bartlett, P. L., New Support Vector Algorithms. Neural computation 2000, 12, 1207-1245. 14. Cao, D. S.; Liang, Y. Z.; Xu, Q. S.; Li, H. D.; Chen, X., A new strategy of outlier detection for QSAR/QSPR. J Comput Chem 2010, 31, 592-602. 15. Schmid, G. H.; Csizmadia, V. M.; Mezey, P. G.; Csizmadia, I. G., The application of iterative optimization techniques to chemical kinetic data of large random error. Canadian Journal of Chemistry 1976, 54, 3330-3341. 16. Laszlo, T., Monte Carlo method for identification of outlier molecules in QSAR studies. Journal of Mathematical Chemistry 2010, 47, 174-190. 17. Yosipof, A.; Senderowitz, H., k‐Nearest neighbors optimization‐based outlier removal. Journal of computational chemistry 2015, 36, 493-506.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 36

18. Coello Coello, C. A., Evolutionary multi-objective optimization: a historical view of the field. Computational Intelligence Magazine, IEEE 2006, 1, 28-36. 19. Miettinen, K., Nonlinear multiobjective optimization. Springer Science & Business Media: 1999; Vol. 12. 20. Zeleny, M.; Cochrane, J. L., Multiple criteria decision making. McGraw-Hill New York: 1982; Vol. 25. 21. Deb, K., Multi-objective optimization using evolutionary algorithms. Wiley: 2001. 22. Schaffer, J. D. Multiple objective optimization with vector evaluated genetic algorithms. In Proceedings of the 1st international Conference on Genetic Algorithms, 1985; L. Erlbaum Associates Inc.: 1985; pp 93-100. 23. Fonseca, C. M ;.Fleming, P. J. Genetic algorithms for multiobjective optimization: Formulation, discussion and generalization. 1993; Citeseer: 1993; pp 416–423. 24. Horn, J.; Nafpliotis, N.; Goldberg, D. A niched Pareto genetic algorithm for multiobjective optimization. In First IEEE Conference on Evolutionary Computation, Piscataway, New Jersey, 1994; Piscataway, New Jersey, 1994; Vol. 1. 25. Knowles, J. D.; Corne, D. W., Approximating the nondominated front using the pareto archived evolution strategy. Evolutionary Computation 2000, 8, 149-172. 26. Zitzler, E.; Laumanns, M.; Thiele, L. SPEA2: Improving the strength Pareto evolutionary algorithm. 2001; Citeseer: 2001; pp 95–100. 27. Abbass, H. A.; Sarker, R.; Newton, C. PDE: a Pareto-Frontier Differential Evolution Approach Formulti-Objective Optimization Problems. 2001; 2001; Vol. 2. 28. Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T.; Fast, A.; Algorithm, E. M. G., A fast and elitist multi-objective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation 2002, 6. 29. Coello, C. A. C.; Pulido, G. T.; Lechuga, M. S., Handling multiple objectives with particle swarm optimization. IEEE Trans. Evolutionary Computation 2004, 8, 256-279. 30. Zhang, L.; Zhu, H.; Oprea, T.; Golbraikh, A.; Tropsha, A., QSAR Modeling of the Blood–Brain Barrier Permeability for Diverse Organic Compounds. Pharmaceutical research 2008, 25, 1902-1914. 31. Platts, J. A.; Abraham, M. H.; Zhao, Y. H.; Hersey, A.; Ijaz, L.; Butina, D., Correlation and prediction of a large blood–brain distribution data set—an LFER study. European journal of medicinal chemistry 2001, 36, 719-730. 32. Katritzky, A. R.; Kuanar, M.; Slavov, S.; Dobchev, D. A.; Fara, D. C.; Karelson, M.; Acree Jr, W. E.; Solov’ev, V. P.; Varnek, A., Correlation of blood–brain penetration using structural descriptors. Bioorganic & medicinal chemistry 2006, 14, 4888-4917. 33. Norinder, U.; Carlsson, L.; Boyer, S.; Eklund, M., Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination. Journal of Chemical Information and Modeling 2014, 54, 1596-1603.

ACS Paragon Plus Environment

Page 35 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

34. Sutherland, J. J.; O'Brien, L. A.; Weaver, D. F., Spline-Fitting with a Genetic Algorithm:  A Method for Developing Classification Structure−Activity Relationships. Journal of Chemical Information and Computer Sciences 2003, 43, 1906-1915. 35. Johnson, M. A.; Maggiora, G. M., Concepts and applications of molecular similarity. John Wiley & Sons: New York, 1990. 36. Knorr, E.; Ng, R. Algorithms for Mining Distance-Based Outliers in Large Datasets. In Proceedings of the 24th International Conference on Very Large Data Bases, VLDB, New York, USA 1998; Morgan Kaufmann Publishers Inc.: New York, USA 1998; pp 392-403. 37. Ramaswamy, S.; Rastogi, R.; Shim, K ,.Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 2000, 29, 427-438. 38. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H., The WEKA data mining software: an update. SIGKDD Explor. Newsl. 2009, 11, 1.0-18 39. Barnett, V.; Lewis, T., Outliers in Statistical Data. Wiley: New York, 1994. 40. Yosipof, A.; Senderowitz, H., Optimization of Molecular Representativeness. Journal of Chemical Information and Modeling 2014, 54, 1567-1577. 41. Organisation for Economic Co-operation and Development. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models. ; OECD Series on Testing and Assessment 69, OECD Document ENV/JM/MONO(2007)2, 2007; p 55 (paragraph no. 65 and 198) Table 5.7. 42. Tropsha, A.; Gramatica, P.; Gombar, V. K., The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models. QSAR & Combinatorial Science 2003, 22, 69-77.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For Table of Contents Use Only A Multi-Objective Genetic Algorithm for Outlier Removal Authors: Oren E. Nahum, Abraham Yosipof, Hanoch Senderowitz*

ACS Paragon Plus Environment

Page 36 of 36