Optimal HTS Fingerprints Definitions by Using a ... - ACS Publications

in terms of both AUC and enrichment factors, to full fingerprints for 27 out of 33 test assays, while randomly assembled fingerpints could achieve equ...
2 downloads 13 Views 1MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

Optimal HTS fingerprints definitions by using a desirability function and a genetic algorithm Alvaro Cortes Cabrera, and Paula M. Petrone J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00447 • Publication Date (Web): 09 Feb 2018 Downloaded from http://pubs.acs.org on February 14, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Optimal HTS Fingerprints Definitions by Using a Desirability Function and a Genetic Algorithm Alvaro Cortes Cabrera1* and Paula M. Petrone2$ 1

GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY,

UK 2

BarcelonaBeta Brain Research Center. Carrer de Wellington, 30, 08005 Barcelona, Spain

*[email protected] [email protected]

KEYWORDS. Genetic algorithm; HTS fingerprints; biological fingerprints; desirability function; machine learning.

ABSTRACT

Compound biological fingerprints built on data from high throughput screening (HTS) campaigns, or HTS fingerprints, are a novel cheminformatics method of representing compounds by integrating chemical and biological activity data that is gaining momentum in its application to drug discovery, including hit expansion, target identification and virtual screening. HTS fingerprints present two major limitations, noise and missing data, which are intrinsic to the high-throughput data acquisition technologies and to the assay availability or assay selection procedure used for their construction. In this work, we present a methodology to define an optimal set of HTS fingerprints by using a desirability function that encodes the principles of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

maximum biological and chemical space coverage, and minimum redundancy between HTS assays. We use a genetic algorithm to optimize such desirability function and obtain an optimal fingerprint that is evaluated for performance in a test set of 33 diverse assays. Our results show that the optimal HTS fingerprint represent compounds in chemical-biology space using 25% less assays. Used for virtual screening, the optimal HTS fingerprint obtained equivalent performance, in terms of both AUC and enrichment factors, to full fingerprints for 27 out of 33 test assays, while randomly assembled fingerpints could achieve equivalent performance in only 23 test assays. INTRODUCTION Compound biological fingerprints have emerged as an alternative to the standard chemical molecular representation1 that can be successfully exploited for hit expansion2, target identification2, 3, virtual screening4 and computational toxicology5-7. Due to the large amount of data availability from high throughput screening (HTS) campaigns in the corporate8 and public domains9, HTS readouts are the most popular bioactivity source for building the so-called HTS fingerprints (HTSfps)10. In HTSfps, each molecule is represented by several assay readouts, typically hundreds, for different targets and acquired by different assay technologies, which stand for a molecule’s unique biological profile. HTSfps have been successful at finding new active chemical scaffolds and also compounds that are able to reproduce a desired cellular phenotype in prospective screens2. Despite their success, they present two major limitations: noise and missing bioactivity data. Several sources of noise are well characterized in highthroughput screening technologies, including experimental conditions (dispensing errors, temperature fluctuations, evaporation, etc.), the assay technology (inhibitors of luciferase, fluorescent compounds and other assay specific nuisance compounds) or the properties of the

ACS Paragon Plus Environment

Page 2 of 20

Page 3 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

compounds (unstable chemical groups, presence of impurities, toxicity, aggregation, etc.) among others, but also errors in annotation and automatic processing of the results (legacy datasets) 8, 11, all of which contribute to mask the signal of the truly active compounds for each assay12-14. Regarding the missing data, it arises from the fact that not all the molecules are screened in every HTS campaign, as institutions continually update their screening libraries according to the needs of their research projects. Moreover, the number of assay readouts for a particular molecule depends on how conservative is a company or a scientific institution about compound selection and the amount of resources that it is capable of deploying for each screen (e.g. focused screening vs. full deck). The noise and missing data factors operate at the level of the individual assay, and therefore, both problems are intimately related to the key question of selecting the assays that will best describe a compound’s biological profile in an HTS fingerprint. Most studies, including our previous work2, take an approach ‘the more, the merrier’ to address this problem

3, 4

, by which all the available HTS assays are included in the HTS fingerprint. The

rationale behind this selection procedure would be hopefully covering enough part of the biological space to be useful beyond the obvious target space (e.g. receptor subtypes) and the fact that redundant assays and their biological targets would neither add more information nor be detrimental to the analysis. However, a more detailed study has led us to identify many situations in which including more assays does not necessarily mean extending the hypothetical biological coverage, but to introduce additional noise and redundancy. In this paper we address the issue of HTSfps construction by showing that the noise in the fingerprint is correlated and not independent, which makes it difficult to remove it from the signal once the fingerprint is constructed. Second, we provide a protocol for building a HTS fingerprint with a careful selection of de-correlated assays that optimizes at the same time the chemical and biological

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

space coverage, and that can obtain models with an equivalent performance to a larger fingerprint in the predictions for a standard data set of 33 test screenings. METHODS Assay sets. To allow a straightforward comparison to recent related literature, the set of PubChem HTS assays found by Helal et al. 15 was used to build the initial HTS fingerprint, and a test set for its evaluation was chosen according to the same authors. Briefly, a total of 210 assays, selected to maximize the number of molecules and the number of targets families covered were used to build the full HTS fingerprint (see Supporting Table S1). Another 33 assays, unrelated to the first set, and that were not used in any optimization step, were used for testing and performance evaluation (see Supporting Table S2 and S3). Assay readouts were downloaded from PubChem and transformed to Z-score values by calculating the average and the standard deviation of each assay and applying equation 1.  =

    

     

Equation 1

Extreme Z-score values, less than -9 or more than 9, were truncated with the ±9 standard deviations limit. Fingerprint assays were aggregated to produce a matrix where each row corresponds to an individual molecule, and the different columns carry the Z-score readouts for each of the 210 different assays. Any molecule with less than 25% of assay readouts (52 assays) and frequent hitters (hit rate > 5%, hit defined as Z-score ≥ 3) were removed from the matrix, giving a total of 353,460 molecules. Missing values were replaced with zeros according to Riniker et al. 4. For the 33 assays in the test set the activity annotations in PubChem provided by the scientists responsible of each screen were used.

ACS Paragon Plus Environment

Page 4 of 20

Page 5 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Noise filtering. To filter the inherent noise in the full HTS fingerprint matrix (210 assays x 353,460 compounds), we carried out a principal components analysis (PCA). Then, by using the concepts of the Random Matrix Theory16 and the distribution of Marchenko-Pasturc17, we defined a minimum threshold for the eigenvalue of the last signal-carrying component according to equation 2: Threshold = 1 +

!





Equation 2

where p is the number of assays or variables (210) and n the number of molecules or samples (353,460). Only the components above the threshold were considered to reconstruct the HTS fingerprint. Desirability function. For the purpose of finding the optimal set of assays to build the HTS fingerprint, we have designed a desirability function based on four different objectives: a) Selected assays should not be correlated, which involves reducing the average correlation between assay pairs but also reducing the maximum correlation between two particular assays, e.g. two targets of the same subfamily or two assays for the same target may tend to produce highly correlated and redundant results; b) Maximum molecular coverage, that translates into selecting assays in which large compound libraries have been screened; c) Maximum biological coverage, that implies using the largest number possible of truly independent assays available; d) Minimum missing data, which decreases the sparsity of the final HTS fingerprint and increases its information content, e.g. an assay where only 5% of the compounds would be found in other campaigns should not be included. To achieve these objectives, the desirability function has 5 different sub-components with individual weights as described in Table 1. These components are integrated into the final desirability value using Equation 3.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

"=

*

-./ % #∏'( "  & ) ∑& , & 

Page 6 of 20

Equation 3

where d is the global desirability value, di is the desirability of the ith subcomponent and wi its weight. By adjusting the different weights, the function can be adapted to prioritize certain objective over others. Table 1. Desirability function definition, penalty scheme and weights (wi), where Yi is the value of the parameter for a particular set of assays, YT is the maximal value of the parameter, pi is the value of the penalty for parameter i, and di is the desirability of the ith parameter. Function Missing data ratio

Maximum Pearson correlation

Form

0.7 − 2 "0 (2 ) = 4 0.7 0

"A (2 )

0.4 − |2 | = 4 0.4

0

Average Pearson correlation

Penalty

9: 2 < 0.7

=

9: 2 ≥ 0.7

9: |2 | < 0.4

=

0.001

>0 (2 ) = ? 1.0 − 2 >A (2 ) = ?

9: |2 | ≥ 0.4

"D (2 ) = 1.0 − |2 |

Number of assays

"E (2 ) =

Number of molecules

"( (2 ) =

2 2F 2 2F

ACS Paragon Plus Environment

0.001 2

wi

9: 2 < 0.7

5.0

9: |2 | < 0.4

1.0

@ 9: 2 ≥ 0.7 @ 9: |2 | ≥ 0.4

>D (2 ) = 0.001

0.3

>E (2 ) = 0.001

1.0

>( (2 ) = 0.001

5.0

Page 7 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Optimization algorithm. Ideally, the algorithm should be able to handle high dimensional vectors (many potential assays) and discrete space (binary variables, the assay is used or not) with minimal number of parameters involved. Since the number of variables to optimize may vary from a few dozens to hundreds, according to the available assays, a robust optimization algorithm is needed. For this reason, we have selected a simple, yet powerful, genetic algorithm (GA). In addition, the desirability function introduces a new challenge for the optimization algorithm, given that the function is not able to provide enough information to guide the algorithm from the subspace of non-feasible solutions to the region of acceptable solutions, i.e. the desirability value for a subset of assays which does not meet the minimum criteria will always be zero, without any information of how ‘bad’ is the set or how to improve it. Starting from a random population of sets of assays, this would be the case for the vast majority of solutions proposed if not all, and the algorithm would use hundreds of iterations blindly and randomly exploring the solutions space until it would find an acceptable set. To remediate the issue, we have implemented a small penalty scheme as recommended by Ortiz et al.

18

with the

form described in Table 1 and combined with the desirability function as indicated in equations 4 and 5, which can effectively lead the GA in a few iterations to the feasible subspace. *

> = #∏'( > )/ 

"G H = " − >

Equation 4 Equation 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

where p is the global penalty for the set, pi is the penalty associated with ith sub-component, d is the global desirability and dfinal is the final score including all the terms. The final algorithm was implemented in C, using OpenMP for parallelization due to the high computational cost of the process. We applied a total of 1,000 optimization rounds, to ensure convergence. Following the rule of a population size equals to 10 times the number of variables, to guarantee enough diversity, we used an initial population of 2,100 individuals randomly initialized with values of zero or one for each assay, which indicates if the assay should be used or discarded for that potential solution of the problem. At the end of each round, we used a tournament scheme for the selection phase where the best of three random individuals is selected for reproduction. Then, a two-point crossover mating function, with a probability of 50%, was employed to generate the offspring. Finally, each individual was subjected to mutation with a probability of 20% for selection, and 5% for flipping each individual gene (individual assay) if selected. The GA was run 10 times to evaluate the convergence of the results. The code is included as Supporting Information. Evaluation. To test the performance of the different variants of the HTS fingerprint, we have built random forest classifiers for predicting the active compounds in each of the 33 assays in the test set. For each target, we randomly split 20% of the data set for training the model, and subsequently use the remaining 80% to evaluate its performance using the area under the curve (AUC) of the Receiver-Operating characteristic (ROC) plot and the enrichment factor (EF) at 1%, 5%, 10% and 20% of the screened list of molecules. Each evaluation test was repeated 10 times to generate statistics. AUC and EF1% values per assay were compared across the fingerprint versions and statistically significant differences were investigated by using a Welsh ttest (α=0.01) and applying the Bonferroni correction for multiple testing.

ACS Paragon Plus Environment

Page 8 of 20

Page 9 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

RESULTS AND DISCUSSION We have built a compound biological fingerprint based on public data from PubChem using a set of assays related to the NIH Molecular Libraries Program (MLP) 19, 20 (see methods, set of 210 assays by Helal et al.

15

). Such selection criteria ensure a good chemical coverage, since it

includes the largest HTS assays publicly available, and a good compound overlap between assays since these projects share chemical libraries. However, we can expect a high disparity in the levels of quality and noise among assays as well as redundancy in the assay information. In an initial attempt, to improve the quality of the Z-score HTSfps, we worked under the hypothesis that our full fingerprint could be considered the result of combining a hypothetical pure signal from the assays with noise from different sources (Figure 1A). Under such assumption, it would be possible to separate both components to reconstruct a de-noised version of the fingerprint. To this end, we applied a simple approach based on principle component analysis (PCA) and the random matrix theory

21, 22

. In a first stage, the HTSfps matrix is subjected to PCA and

decomposed in its principal components. This decomposition allows isolating the strongest components, i.e. the components with the highest eigenvalues, which carry most of the signal of the data set, from the weak components that carry mostly noise. To determine a threshold to optimally separate both sets of components, we relied on the Marchenko-Pastur distribution, which characterizes the eigenvalues that can be expected from a random matrix (a matrix of pure noise) of any particular size, and can be used to estimate the eigenvalue of the last noisy component. By applying equation 2, we determined the threshold for the HTSfps matrix and selected the most informative components to rebuild the fingerprint. Once reconstructed, the fingerprint was evaluated in our independent test set composed of 33 assays, by training random forest classifiers with random splits of 20% of the fingerprint data and evaluating its performance

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in the remaining 80%. For comparison, the same experiment was performed using the full fingerprint and identical considerations for training and testing. Figure 1. A) HTS fingerprints de-noising protocol. Principal components analysis is used to decompose the HTS fingerprint, then, signal-carrying components with an eigenvalue higher than the threshold predicted from Marchenko-Pastur law are selected to reconstruct the matrix. B) Optimization algorithm to select the set of most informative assays. The genetic algorithm starts with a random population. Individuals with high desirability are selected for reproduction. Random mutations are introduced in the offspring with a low frequency to add diversity.

We used two main parameters for the evaluation of the de-noised fingerprints, the area under the curve (AUC) of the ROC plot, which measures the global performance of the classifier to distinguish between active and inactive molecules, and the enrichment factor at 1% (EF1%), which measure the increase in probability to find active molecules in the top 1% of the ranked list produced by the classifier compared to random selection. As can be seen in Figure 2, both parameters are significantly (α=0.01) worse for all the assays in the test set using the reconstructed HTSfps.

ACS Paragon Plus Environment

Page 10 of 20

Page 11 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. AUC and EF1% for the 33 assays in the test set using the full HTS fingerprint (red) and the de-noised version using PCA and RMT (blue) are shown. Assays with significant (α=0.01) differences in performance are indicated with an asterisk.

This apparent contradiction can be explained if we consider the assumptions for our de-noising methodology to work. A main requisite for PCA to be able to separate signal from noise is that the noise present in the data should be independent and non-correlated. From the results of our

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

analysis we can conclude that this is not the case for the HTSfps, and complex noise relationships between assays can be expected 23. In that manner, by neglecting the components below the Marchenko-Pastur threshold we are also removing a large part of the signal. To evaluate this point, we counted the number of positive annotations (i.e. Z-score value ≥ 3) before and after the procedure, as a proxy to the actual information content of the fingerprint. From the initial 559,202 activity annotations out of 68,252,191 (0.81%), the filtered fingerprint contained only 114,606 (0.16%), which represents only 20% of the initial number. In addition, and to assess the effects of the PCA, we have measured the performance of the resulting models using all the components as fingerprints. As can be seen in Supporting Figure S1, these are also significantly worse compared to the original fingerprint. This phenomenon is not exclusive of the HTS data, and it has been described before for other types of data such as gene expression microarrays, where Colwell el al.24 have found analogous problems. Since trying to reduce the noise once the HTSfps has been assembled is a challenging task, we propose to address both the noise and the missing data issues simultaneously when building the initial assay set (Figure 1B). Starting with our full HTSfps, we defined a desirability function (Table 1, see methods) to be optimized with a GA (Methods). The optimization procedure converged well in all the runs (See methods and Supporting Figure S2) and the resulting fingerprints from the optimization algorithm had a large overlap in terms of the assays selected and virtually equivalent values for the 5 parameters used in our desirability function. The total number of the assays selected by all the solutions was 166 (79% of the total 210), with a maximum absolute value of the Pearson correlation between assays of 0.30 and an average absolute correlation of 0.02. The optimal HTSfps covered 92% of the total number of molecules

ACS Paragon Plus Environment

Page 12 of 20

Page 13 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

with a ratio of missing information (unavailable readouts divided by the product of number of molecules by number of assays) of around 3.7% in all cases. After the optimization procedure, the best assay set according to the global desirability score was selected to build a reduced fingerprint that has been evaluated using again the independent test set of 33 different assays. Its performance has been compared to 5 sets of 166 assays that have been selected randomly, and to the full HTS fingerprint with all the possible assays included. As can be seen in Figure 3, the performance of the optimal fingerprint is comparable to the full fingerprint which carries 25% more assays. Only a total of 6 test set assays (18%) show significantly worse global performance (AUC) when using the optimized fingerprint for training the classifiers, while an average of 10 test set assays (30%) have degraded performance when the random assay sets are used (Supporting Information and Supporting Figure S3). Regarding the EF1% metric, again only 6 test set assays present degraded performance when using the optimized fingerprint to train the models. In the case of the randomly assembled fingerprints, 10 test assays (30%) in average, have worse enrichments. Our results confirm that the optimization procedure is able to provide better sets of assays than pure random selection. Figure 3. AUC and EF1% for the 33 assays in the training set using the full HTS fingerprint (red) and the optimal version found in this work (blue). Assays with significant (α=0.01) differences in performance are indicated with an asterisk.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The balanced desirability function that we have employed in this work penalizes missing information and lack of molecular coverage at the expense of being more permissive with the number of assays and assay correlation. By choosing this set of weights, our methodology also provides a solution to the particular case of “moving collections”, when the library of

ACS Paragon Plus Environment

Page 14 of 20

Page 15 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

compounds to be screened is renovated periodically or incrementally, and an updated snapshot HTSfps is required every few months. Our performance evaluation is based on a diverse set of 33 different assays that provide a reasonable confidence in the ability of the optimized fingerprints to be representative of a wide biological space. However, it still might be possible to modify the set of weights in the desirability function to reduce the risk of not covering enough part of the space by biasing the optimization procedure towards maximal inclusion of non-redundant assays at the expense of allowing more percentage of missing data in the final fingerprint. In summary, our method is flexible enough to accommodate new specific requirements through the introduction of new parameters in the desirability function, such as assay incompatibility in cases where several orthogonal assays are available for the same target; forcing better data coverage of a particular subsets of molecules, in the case of focused libraries for instance; or including proxy-measurements to the level of noise of each assay, as the hit rate or the technology employed. CONCLUSIONS In this work, we have introduced a methodology to build more efficient biological fingerprints from HTS data to represent compounds in cheminformatics applications such as virtual screening. This technique mitigates the issues related to noise and missing data intrinsic to HTS fingerprints. We have shown that the noise in the fingerprints is dependent and correlated across assays, which makes its separation from the signal a complex task. To circumvent the issue, we introduced a method to design the optimal set of most informative assays by using a desirability function encoding our requirements and a genetic algorithm for the challenging optimization

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

process. The results show that the classification models built using an optimal data set of assays have an equivalent performance to that of more complex dataset, with larger number of partially correlated and redundant assays, in 27 out of 33 test assay, and improved performance over randomly selected sets, which could achieve identical performance only in 23 out 33 test assays. Code availability: The code and algorithms used in this work can be found in Supporting Information or downloaded from http://github.com/accsc/HTS_GA

AUTHOR INFORMATION Co-Corresponding Authors [email protected]. GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, Hertfordshire, SG1 2NY, UK [email protected]. BarcelonaBeta Brain Research Center. Carrer de Wellington, 30, 08005 Barcelona, Spain Author Contributions A.C. designed the algorithm and A.C. and P.P. wrote the paper. Conflict of Interest A.C. is a GlaxoSmithKline employee.

ABBREVIATIONS AUC, Area under the curve; ROC, Receiver-operating characteristic; GA, Genetic algorithm. MLP, NIH Molecular Libraries Program; HTS, High throughput screening; fps, Fingerprints; PCA, Principal component analysis; RMT, Random matrix theory.

ACS Paragon Plus Environment

Page 16 of 20

Page 17 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Supporting Information Available: Tables S1 and S2, Figures S1, S2, and S3 and tables with individual results per assay in the test set for the different methods evaluated. REFERENCES 1. Cortés‐Cabrera, A.; Morris, G. M.; Finn, P. W.; Morreale, A.; Gago, F., Comparison of ultra‐fast 2D and 3D ligand and target descriptors for side effect prediction and network analysis in polypharmacology. British journal of pharmacology 2013, 170, 557-567. 2. Cortes Cabrera, A.; Lucena-Agell, D.; Redondo-Horcajo, M.; Barasoain, I.; Díaz, J. F.; Fasching, B.; Petrone, P. M., Aggregated Compound Biological Signatures Facilitate Phenotypic Drug Discovery and Target Elucidation. ACS chemical biology 2016, 11, 3024-3034. 3. Wassermann, A. M.; Lounkine, E.; Urban, L.; Whitebread, S.; Chen, S.; Hughes, K.; Guo, H.; Kutlina, E.; Fekete, A.; Klumpp, M., A screening pattern recognition method finds new and divergent targets for drugs and natural products. ACS chemical biology 2014, 9, 1622-1631. 4. Riniker, S.; Wang, Y.; Jenkins, J. L.; Landrum, G. A., Using information from historical high-throughput screens to predict active compounds. Journal of chemical information and modeling 2014, 54, 1880-1891. 5. Zhu, H.; Zhang, J.; Kim, M. T.; Boison, A.; Sedykh, A.; Moran, K., Big data in chemical toxicity research: the use of high-throughput screening assays to identify potential toxicants. Chemical research in toxicology 2014, 27, 1643-1651. 6. Zhang, J.; Hsieh, J.-H.; Zhu, H., Profiling animal toxicants by automatically mining public bioassay data: a big data approach for computational toxicology. PloS one 2014, 9, e99863. 7. Russo, D. P.; Kim, M. T.; Wang, W.; Pinolini, D.; Shende, S.; Strickland, J.; Hartung, T.; Zhu, H., CIIPro: a new read-across portal to fill data gaps using public large-scale chemical and biological data. Bioinformatics 2016, 33, 464-466. 8. Kramer, C.; Dahl, G.; Tyrchan, C.; Ulander, J., A comprehensive company database analysis of biological assay variability. Drug discovery today 2016. 9. Fourches, D.; Sassano, M. F.; Roth, B. L.; Tropsha, A., HTS navigator: freely accessible cheminformatics software for analyzing high-throughput screening data. Bioinformatics 2013, 30, 588-589. 10. Petrone, P. M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kutchukian, P.; Cornett, A.; Deng, Z.; Davies, J. W.; Jenkins, J. L.; Glick, M., Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS chemical biology 2012, 7, 1399-1409. 11. Kalliokoski, T.; Kramer, C.; Vulpetti, A., Quality issues with public domain chemogenomics data. Molecular Informatics 2013, 32, 898-905. 12. Glick, M.; Klon, A. E.; Acklin, P.; Davies, J. W., Enrichment of extremely noisy highthroughput screening data using a naive Bayes classifier. Journal of biomolecular screening 2004, 9, 32-36. 13. Gribbon, P.; Lyons, R.; Laflin, P.; Bradley, J.; Chambers, C.; Williams, B. S.; Keighley, W.; Sewing, A., Evaluating real-life high-throughput screening data. Journal of biomolecular screening 2005, 10, 99-107. 14. Diller, D. J.; Hobbs, D. W., Deriving knowledge through data mining high-throughput screening data. Journal of medicinal chemistry 2004, 47, 6373-6383.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

15. Helal, K. Y.; Maciejewski, M.; Gregori-Puigjané, E.; Glick, M.; Wassermann, A. M., Public domain HTS fingerprints: design and evaluation of compound bioactivity profiles from PubChem’s bioassay repository. Journal of chemical information and modeling 2016, 56, 390398. 16. Tracy, C. A.; Widom, H. The distributions of random matrix theory and their applications. In New Trends in Mathematical Physics; Springer: 2009, pp 753-765. 17. Marchenko, V. A.; Pastur, L. A., Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik 1967, 114, 507-536. 18. Ortiz Jr, F.; Simpson, J. R. A Genetic Algorithm with a Modified Desirability Function Approach to Multiple Response Optimization. In IIE Annual Conference. Proceedings, 2002; Citeseer: 2002; p 1. 19. Huryn, D. M.; Cosford, N. D., The molecular libraries screening center network (MLSCN): identifying chemical probes of biological systems. Annual Reports in Medicinal Chemistry 2007, 42, 401-416. 20. Wang, Y.; Bolton, E.; Dracheva, S.; Karapetyan, K.; Shoemaker, B. A.; Suzek, T. O.; Wang, J.; Xiao, J.; Zhang, J.; Bryant, S. H., An overview of the PubChem BioAssay resource. Nucleic acids research 2010, 38, D255-D266. 21. Sharifi, S.; Crane, M.; Shamaie, A.; Ruskin, H., Random matrix theory for portfolio optimization: a stability approach. Physica A: Statistical Mechanics and its Applications 2004, 335, 629-643. 22. Ulfarsson, M. O.; Solo, V., Dimension estimation in noisy PCA with SURE and random matrix theory. IEEE transactions on signal processing 2008, 56, 5804-5816. 23. Glick, M.; Jenkins, J. L.; Nettles, J. H.; Hitchings, H.; Davies, J. W., Enrichment of highthroughput screening data with increasing levels of noise using support vector machines, recursive partitioning, and Laplacian-modified naive Bayesian classifiers. Journal of chemical information and modeling 2006, 46, 193-200. 24. Colwell, L. J.; Qin, Y.; Huntley, M.; Manta, A.; Brenner, M. P., Feynman-Hellmann Theorem and Signal Identification from Sample Covariance Matrices. Physical Review X 2014, 4, 031032.

ACS Paragon Plus Environment

Page 18 of 20

Page 19 of 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOC 254x190mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 20 of 20