Automated Protocol for Large-Scale Modeling of Gene Expression Data

Oct 31, 2016 - Here we show how automated machine learning workflows can produce models that are predictive of differential gene expression as a funct...
0 downloads 10 Views 3MB Size
Subscriber access provided by United Arab Emirates University | Libraries Deanship

Article

Automated protocol for large-scale modeling of gene expression data Michelle Lynn Hall, David Calkins, and Woody Sherman J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00260 • Publication Date (Web): 31 Oct 2016 Downloaded from http://pubs.acs.org on November 5, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Automated Protocol for Large-Scale Modeling of Gene Expression Data Michelle Lynn Hall,∗,†,¶ David Calkins,‡ and Woody B. Sherman† †Schr¨odinger, Inc., 222 Third Street, Cambridge, MA 02143 ‡Schr¨odinger, Inc., 101 SW Main St #1300, Portland, OR 97204 ¶Current address: Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139 E-mail: [email protected]

Abstract With the continued rise of phenotypic and genotypic-based screening projects, computational methods to analyze, process, and ultimately make predictions in this field take on growing importance. Here we show how automated machine learning workflows can produce models that are predictive of differential gene expression as a function of a compound structure using data from A673 cells as a proof of principle. In particular, we present predictive models with an average accuracy of greater than 70% across a highly diverse ∼ 1000 gene expression profile. In contrast to the usual in silico design paradigm, where one interrogates a particular target-based response, this work opens the opportunity for virtual screening and lead optimization for desired multi-target gene expression profiles.

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Introduction Target-based design Computer-aided drug design has traditionally focused on target-based approaches (including structure- and ligand-based projects) such as docking, pharmacophore modeling, molecular dynamics simulations, and predicting protein-ligand binding free energies. 1 Target-based approaches rely on the assumption that engaging the target of interest (binding, activating, allosterically modulating, etc.) will give rise to the desired phenotype, although this assumption is frequently untrue. With the growing challenges associated with discovering new drugs using target-based approaches and the development of higher throughput phenotypic screening methods, 2 the need for in silico methods to facilitate design outside of the target-based space also grows.

Phenotypic-based design In silico methods for phenotypic-based design roughly fall into two categories: target identification 3 and quantitative structure phenotype relationships (QSPhR). 4–6 Target identification has been particularly well-studied, presumably because of the attractiveness of identifying a particular target that is primarily responsible for the phenotype of interest. 3 Ideally, this brings the phenotypic project back into the traditional target-based design space, where traditional rational design approaches (e.g., docking, molecular dynamics, water analyses, free energy calculations, etc.) can be applied. However, when target identification is not fruitful, or the particular phenotype is shown to be too polypharmacological to benefit from target-based design, QSPhR approaches can be applied to make predictions and drive projects based on phenotypic or genotypic data. 4–6 At its core, this is a ligand-based design problem, although the target response can involve many (dozens to thousands, depending on the assay) factors, as opposed to common targetbased single objective approaches (i.e., target binding affinity or cellular activity). Typically 2

ACS Paragon Plus Environment

Page 2 of 36

Page 3 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

one employs ligand-based models to describe a particular phenotypic output as a function of chemical descriptors via a predictive algorithm (e.g., percent cell proliferation as a function of chemical fingerprints as predicted by a Na¨ıve Bayesian classifier). Historically, there have been a small number of phenomena to describe (e.g., percent cell proliferation, cell size), and thus, independent models can be constructed for each property. However, with evolving high-throughput technologies for phenotypic analysis and screening, the number of phenotypic outputs is rapidly expanding to a much larger number. One example is manifest in SmartCube technology from PsyhoGenics, 7 which gives thousands of phenotypic outputs for a single compound, dose, and time point combination. Thus, the ability of the modeler to carefully construct predictive models one-by-one for these types of phenotypic experiments is intractable. As such, automated protocols that can efficiently build a large number of predictive models to describe such a high-dimensional output are imperative.

Genotypic-based design While both flavors of in silico phenotypic design are well-studied (target identification and QSPhR), genotypic design is relatively unexplored by the molecular modeling community. Genotypic design is the intermediary between phenotypic- and target-based design, where the interaction of the drug with the target(s) elicits a differential gene expression profile, which ultimately begets a particular phenotype. 8,9 Each cell exhibits a native particular gene expression profile and these profiles are fingerprints of the cell state (e.g., healthy or diseased). In drugging a particular cell type, one may wish to elicit the rescue gene expression profile, which causes the diseased cell to express as if it were healthy, thus exhibiting the desired, healthy phenotype. Therefore, in contrast, to target- and phenotypic-based design, in genotypic-based design, the goal is to elicit a particular gene expression profile. The problem of high dimensionality described above for phenotypic experiments applies equally to genotypic experiments, where one is not typically concerned with the expression of a single gene, but a collection of genes (i.e., a gene expression profile). 3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

To meet this growing need, we describe one application of an automated workflow to build and validate predictive models to describe high-dimensional outputs such as gene expression profiles and high-dimensional phenotypic experiments. In particular, similarly to QSPhR described above, we are interested in quantitative structure gene expression relationships (QSGeR), where we set out to build predictive models of differential gene expression on a gene-by-gene basis as a function of chemical structure. We use A673 cells 10 in this particular experiment, but describe how the protocol is equally extensible to other cell lines and outputs as well.

Technical Overview Gene expression In a cell, DNA is transcribed into RNA and mRNA in the nucleus. 11 After being exported to the cytoplasm, mRNA is translated subsequently into protein. Those genes that are transcribed to protein are said to be expressed. Naturally, where all cells have the same genome, it is the differential expression of their genes (amongst other factors) that give rise to the different types of cells. That is, for example, while a kidney cell and an eye cell may have the same genes, it is the differential expression of these genes that result in the differing phenotypes and functions of these cells. Therefore, understanding differential gene expression is a fundamental part of understanding functioning at the cellular level. Finally, it is not the expression of a single gene, but the expression of a collection of genes (i.e., the gene expression profile) in concert that constitutes a marker of the cellular state.

Differential gene expression as a function of perturbagen In addition to giving rise to different cell types, gene expression profiles can also be considered fingerprints of cellular state (i.e., healthy or diseased). 11 Furthermore, the addition of small molecule perturbagens can alter the gene expression profiles from their native states as shown 4

ACS Paragon Plus Environment

Page 4 of 36

Page 5 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

in Fig 1. 12,13 One can imagine the goal of particular therapeutic is to modify a particular diseased cell’s gene expression profile such that it more closely mimics that of a healthy cell. For example, treatment of a particular diseased cell with a certain small molecule could elicit a rescue profile that results in the diseased cell more closely mimicking the healthy cell. Practically, altering the native outcome of one or more pathways in the cell responsible for the pathogenic state achieves this. Here, measuring differential gene expression profiles or phenotypes as an indicator of therapeutic efficacy offers the advantage of giving a complete image of the cell state, where we know the cell may exhibit multiple compensatory pathways to offset the altering effect of hitting just one target, for example.

Figure 1: Schematic of differential gene expression as a function of perturbagen. In the top, we see the native gene expression profile for seven genes in the presence of control, e.g., DMSO. After exposure to a perturbagen (i.e., compound dosing), some genes’ expressions are altered, while others remain unaffected. In particular, gene1 and gene6 are up- and down-regulated versus control, respectively. Taking the difference in these two expression profiles gives the differential gene expression shown at the bottom. As with phenotypic design, genotypic design can evade rational design since the cellular basis for the measured outcome is often not obvious. In contrast to phenotypic design, however, we are not aware of any studies that attempt to build predictive models of gene expression as a function of small molecule perturbagens. Nevertheless, some correlation between chemical structure of the perturbagen and differential gene expression exists. 14 In particular, Chen et al. have shown that there is a 20% chance that two compounds with 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Tanimoto similarity ≥ 0.85 have similar gene expression profiles. Their results suggest that it should be feasible to build quantitative and predictive models of differential gene expression, as is discussed at length in this work.

L1000 The L1000 method is a technology developed at the Broad Institute to study gene expression in a high-throughput manner. 15,16 It takes advantage of the fact that gene expression is highly correlated. That is, it is possible to describe the expression profile for a particular gene by looking at the expression of another, highly correlated gene. Because of this redundancy and using statistical analysis and extrapolation, it is possible to describe ∼ 20k genes with a reduced set of only ∼ 1k “landmark” genes – hence the term “L1000”. The ∼ 1k genes were chosen to be minimally redundant, widely expressed in different cellular contexts, and possess inferential value for the remaining genes not in the 1k set. Of particular interest to the computational scientist, is the fact that these datasets are published and freely available on the LINCScloud at http://www.lincscloud.org/l1000/.

“Characteristic direction” as a quantifier of differential gene expression Quantifying differential gene expression as a function of perturbagen is a non-trivial undertaking in bioinformatics. Clark et al. have described a method for transforming and normalizing the L1000 gene expression data to give “characteristic directions” on a per gene 17 ~ basis as a function of chemical perturbagen, ∆gene. Strictly speaking, these characteristic

directions are 978-dimensional vectors that provide the best description of the overall change in gene expression across all 978 genes measured in going from control (i.e., DMSO) to per~ turbagen dosing. These 978-dimensional vectors (∆gene) are decomposable into their 978 ˆ 2 , ..., ∆gene ˆ 978 ), where each component vector ∆gene ˆ i, component vectors (∆gene ˆ 1 , ∆gene

6

ACS Paragon Plus Environment

Page 6 of 36

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

then, describes the overall change in gene expression of that one particular gene or compoˆ i ) is quantified nent. The change of this one particular gene (or component vector, ∆gene ~ within the context of the whole gene expression profile ∆gene, in going from control to perturbagen. We refer to the matrix that encapsulates all perturbagens studied (here, 175) along with all genes studied (here, 978) as the chemogenomic matrix (see Fig 2). That is, ~ ˆ 1 , ∆gene ˆ 2 , ..., ∆gene ˆ 978 . ∆gene = ∆gene Finally, because the characteristic directions are normalized to unit vectors, one can compare the magnitudes of the individual component vectors to quantify how one gene’s P ~ ˆ 2i = 1. differential expression compares to another according to ∆gene = i=978 ∆gene i=1

Put more simply, quantifying the absolute difference in gene expression across different

perturbagens is not attempted. Instead, for any arbitrary perturbagen m, the overall differ~ m is normalized to a unit vector. That is, ∆gene ~ m = 1. ential gene expression profile ∆gene ~ m = ∆gene ~ n = 1, even if their absolute Thus for any two perturbagens m and n, ∆gene expressions differ. Therefore, while this information does not allow us to interrogate difˆ i for any arbitrary gene i across different ferences in any particular gene’s expression ∆gene perturbagens m and n, it does make it possible to interrogate differences in genes i and j ˆ i vs. ∆gene ˆ j ) for a single perturbagen m. (∆gene ~ We sought to create predictive models of ∆gene for any arbitrary perturbagen m by ˆ 1 , ∆gene ˆ 2 , ..., ∆gene ˆ 978 ) of ∆gene ~ tackling each of the 978 components (∆gene individually. ˆ i , where we seek to create In cheminformatics parlance, the dependent variable then is ∆gene 978 different models (running over i from 1 to 978) to describe all 978 genes of the gene expression set for any arbitrary perturbagen m.

A673 cell line As we had access to proprietary RNAseq data against A673 cells, we elected to focus our efforts on this particular cell line. (Note that we do not use this proprietary data for any of the analysis presented herein, and have instead used only publicly available sources for ease 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2: Heat map representation of chemogenomic matrix used in this study. The 978 genes analyzed are shown along the top, while the 175 pertubagens studied are on the left. Note that “perturbagen” describes the particular combination of compound, dose and time point used. In the work described here, only compounds measured at the same dose and time point were used. Thus, we use “perturbagen” and “compound” interchangeably. The color of each cell of the matrix represents the differential expression of genei in the presence of perturbagenj , ∆geneji , where red, blue and white are used to describe down-regulation, up-regulation, or no effect upon perturbation, respectively. The color ramp of the red or blue reflects increasing deviation from the median value (∆geneji =0.0) in the case of downand up-regulation, respectively.

8

ACS Paragon Plus Environment

Page 8 of 36

Page 9 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

of reproducibility, as described in the ”Methods: Dataset assembly” section.) In principle, one could carry out the same analysis with any cell line, as the algorithm is in theory agnostic to the cellular environment. That is, we only considered the compound structures and their effect on gene expression in the models. We also elected to focus on immortalized cells (e.g., A673 cells) over primary cells. While quantitative prediction of primary cells is of high interest, we speculated that they would have larger noise in their expression profiles, and thus we elected to focus on immortalized cells only in this study.

Results and discussion QSGeR model construction Predictive models ˆ i (the Here we construct 978 different predictive models, one per gene, that describe ∆gene dependent variable) as a function of chemical structure. While careful construction of models by an expert who can manually adjust parameters is desirable to produce the most predictive models, when faced with high-dimensional problems (i.e., 978 different outputs), this becomes intractable. Therefore, we turned to an automated protocol for the construction of predictive models, a Schr¨odinger utility called “AutoQSAR”, 18 which was shown to be as predictive as human experts in most cases. In this work, 18 we have shown AutoQSAR to be as robust or nearly as robust as human experts. Therefore, its utility is not necessarily in replacing the human expert in cases where one wishes to describe low-dimensional outputs (e.g., pIC50 versus a small number of targets), but in automating the task of building high-dimensional predictive models that would otherwise prove too time-consuming for practical application. ˆ i , were coupled with their Each of the 978 distinct differential gene expressions, ∆gene corresponding compound IDs and structures and used as input for AutoQSAR. AutoQSAR ˆ i ) along with the compound structures takes as input the dependent variable (here, ∆gene

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and constructs a predictive model that describes the dependent variable as a function of fingerprints or molecular descriptors. The first step is the generation of fingerprints of various types (i.e., radial, 19 dendritic, 20 linear, 21 and MOLPRINT2D 22,23 ) and 497 topology-based descriptors that consist of physicochemical properties, graph theoretical indices, and functional group counts as implemented in Canvas. 20,24 The exact fingerprints used are given in Table 1 while the full list of 497 descriptors is in the Supporting Information. The resultant pool of descriptors and fingerprint bits is then reduced to a maximally informative subset. First, uninformative descriptors are eliminated. In particular, discrete and continuous descriptors showing identity greater than 90% or variance less than 1−12(

PN

i=1

x2i /N )

, where x is the descriptor value for any given input molecule and N is the

number of input molecules, are eliminated, respectively. Second, the remaining descriptors are further filtered to eliminate colinearities, giving a subset where no pair of descriptors exhibits an absolute Pearson correlation coefficient greater than 0.8. Colinearity elimination occurs through hierarchical, agglomerative clustering on the absolute Pearson correlation matrix of the descriptors, where the descriptor closest to the center of each cluster is chosen as the representative descriptor for that cluster. For all results shown here, this maximally informative subset comprised 61 descriptors. Note that the descriptor selection process is unsupervised. That is, the descriptors are chosen ˆ i , therefore giving the same subset independent of the learner or dependent variable, ∆gene of descriptors for each of the combinations of learner and training/test set split built as part of the AutoQSAR workflow. Finally, the 32-bit fingerprints are also reduced to a maximally-informative subset, in that only the 10,000 bits with the greatest variance are maintained. In particular, this gave 10,000 linear fingerprints, 10,000 radial fingerprints, 10,000 dendritic fingerprints, but only 1,582 MOLPRINT2D fingerprints. The latter displayed greater colinearity and less variance than other fingerprints and thus were reduced to a much smaller subset. As with the descriptor

10

ACS Paragon Plus Environment

Page 10 of 36

Page 11 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

selection, this process was also unsupervised, and therefore the same subset of fingerprints is used in all model building. Table 1: Canvas Fingerprints Utilized in AutoQSAR Models Fingerprint Type

Description

Dendritic Linear MOLPRINT2D

Linear and branched fragments. 20 Linear fragments and ring closures. 20 A radial-like fingerprint that encodes atom environments using lists of atom types located at different topological distances. 22,23 Fragments that grow radially from each atom. Also known as extended connectivity fingerprints (ECFPs). 19

Radial

Simultaneously, the input dataset (here, 175 compounds) is split by default into 50 random training and test sets, using a 75:25 split, respectively. The maximally informative descriptor and fingerprints subsets are subsequently coupled with various predictive algorithms and applied to the different training sets. In the case presented here, we focus on categorical dependent variables (i.e., up- or downregulated or unaffected versus control), and thus the predictive algorithms employed include recursive partitioning 25 and Bayesian classification. 26 Note that we did not attempt to build a model to predict all three categories (up-regulated, down-regulated, or unaffected), since model performance is expected to degrade with an increasing number of classes it is trained to predict. Instead, we focused on building two, two-category models: (a) up-regulated versus control or not and (b) down-regulated versus control or not. Each combination of fingerprints or descriptors plus learning algorithm (recursive partitioning or Bayes) is applied to the different training set splits. Note that AutoQSAR only supports the use of fingerprints for kernel-based partial least squares regression (not employed in this study) and Bayesian classification. Recursive partitioning and all other learners support the use of descriptors only. In total, 300 distinct combinations are tried. All 300 of these models are then tested on the independent test sets. Subsequently, per11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 36

cent accurate predictions on the training and test set are quantified, accuracytraining and accuracytest , respectively. A prediction is only considered accurate if the predicted category (e.g., up-regulated) matches the experimental category. We naturally expect (at least some) degradation in the test set performance versus the training set but desire high accuracy on both and as little divergence as possible between the two. The fraction of compounds with correct predictions (fk ) in each category k, for all categories n studied is used to compute the accuracy across both the training and test sets as shown in Eq 1. n

accuracy =

1X fk n k=1

(1)

Given that we concern ourselves only with two-category models, n = 2, and we may equivalently write Eq 1 as a function of the numbers of true positives (TP), true negatives (TN), false negatives (FN), and false positives (FP),   TN TP 1 + accuracy = 2 (T P + F N ) (T N + F P )

(2)

It is important to note that both categories are equally weighted in Eqs 1 and 2. That is, irrespective of whether data is split into two categories of equal population (i.e., a 50:50 split) or, as we have done here, unequally split the data by 75:25 or greater, the accuracy weights both categories equally. This prevents spurious and misleading estimates of accuracy that can originate from simply heavily populating one category and thus effectively training a model to always predict the majority category. 27 Such a case is obviously not predictive and therefore it’s important to make sure that any metric of model quality account for this properly. Finally, the accuracy given in Eq 2 is used to compute a score along [0,1] which approaches unity only in cases of high accuracy of both the training and test sets, as given in Eq 3.

12

ACS Paragon Plus Environment

Page 13 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

score = accuracytest ∗ (1.0 − |accuracytrain − accuracytest |)

(3)

The top ten predictive models (by score) per gene are then carried forward. In particular, these give the final consensus model that will be used in any subsequent prospective predictions. This procedure was repeated for all 978 genes of the ensemble, with two models created per gene ((a) up-regulated or not and (b) down-regulated or not), as shown in Fig 3. Therefore, in total, 978 genes * 2 models/gene * 300 combinations/model = ∼ 600k predictive models were built to create the best possible final set of predictive models for this gene expression profile. The entire process required approximately 35 compute days, but was accomplished in 2.5 wall clock hours through parallelization across about 2,000 processors.

Figure 3: Schematic of the procedure used to automatically generate models for entire gene expression profile. Each gene in the collection of genes (in this case, 978 total) was run through AutoQSAR. The AutoQSAR algorithm itself is depicted in the gray box while the summation of the results of all genes predicted gives the predicted expression profile, shown outside the box. Running this across all 978 genes gave predictive models for the entire profile.

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Null models To further validate the generated models, we also tested the ability of a null model to describe differential gene expression. In particular, we asked the question of whether molecular weight might be just as descriptive as the various combinations of fingerprints and descriptors described above. Therefore, the same procedure described above was repeated with the exception that all models were constructed using molecular weight as the only independent variable.

QSGeR model performance Predictive models The distribution of the score’s given by Eq 3 for all models are shown in Fig 4. Note that score runs from 0 to 1, where 0.0, 0.5 and 1.0 correspond to perfectly unpredictive, random and predictive models, respectively. The median model performance score is 0.71 for both types of models: (a) up-regulated versus control or not and (b) down-regulated versus control or not. Nearly all models show performance better than a random score of 0.5. These results support our hypothesis that there is an appreciable correlation between chemical structure and its subsequent effect on gene expression. Note that score described in Eqs 1-3 was constructed as part of the AutoQSAR workflow and is generalizable to all types of dependent variables (i.e., continuous or categorical). Additionally, score is constructed carefully such that it is extensible to cases where the dependent variable is discretized into more than two categories. Nevertheless, the dependent variables employed here (namely, ∆genei ) are two-category ((a) up-regulated or not and (b) down-regulated or not). Therefore, a Matthew’s correlation coefficient (MCC) is also employed to measure model performance. MCC is specific to two-category models, and like score described in Eq 3, it attempts to capture model predictivity in a way that is agnostic to the distribution of the dependent

14

ACS Paragon Plus Environment

Page 14 of 36

Page 15 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

variables amongst the two categories. The MCC for each top-scoring model was computed from the confusion matrix of true positives (T P ), true negatives (T N ), false positives (F P ), and false negatives (F N ) according to Eq 4.

M CC = p

TP ∗ TN − FP ∗ FN (T P + F P )(T P + F N )(T N + F P )(T N + F N )

(4)

The distribution of Matthew’s correlation coefficients (M CC) across all models is given in Figure S1-2 and Table S1 of the Supporting Information. Note that M CC runs from -1 to 1, where -1, 0 and +1 correspond to perfectly unpredictive, random and predictive models, respectively. The median model performance score is 0.40-0.42 for both types of models: (a) up-regulated versus control or not and (b) down-regulated versus control or not. Nearly all models show performance better than a random score of 0.0. The results agree with the quantification using score distributions. Namely, these suggest that there is an appreciable correlation between differential gene expression and chemical structure that can be exploited to build predictive models. Because 300 different combinations of learning algorithm (i.e., Bayesian classification versus recursive partitioning) and independent variables (i.e., fingerprints and descriptors) were used in each run of AutoQSAR, it is possible to collect statistics on which combinations proved most predictive across the largest number of genes. In particular, we see that Bayesian classification outperforms recursive partitioning in 91% of cases studied (see Fig 5b). Furthermore, we find that descriptors also outperform all fingerprint types (see Fig 5a) when coupled with a Bayesian learner. (Recall that recursive partitioning with fingerprints is not supported.) That is, descriptors give the best performance in 95% of cases, whereas MOLPRINT2D, linear, dendritic and radial fingerprints only give the best performance in 1-2% of the cases studied here. This finding is likely due to the limited size and diversity of the training set. Furthermore, where gene expression is modulated by numerous potential pathways and targets, we should not expect a fingerprint-based method, which may very

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

well just memorize substructures, to perform as well as something more generalizable such as molecular descriptors. Finally, note that for each gene, we have constructed two, two-category models: one in which the gene is up-regulated or not and the other where the gene is down-regulated or not. One could also construct a single, two-category model to simply describe whether or not a gene is affected (irrespective of whether it is up- or down-regulated). In principle, it might prove more facile to predict whether a gene’s expression is affected or not instead of predicting the direction of the change. To test this, we regenerated and scored all predictive models across all genes as described above, but using a single, two-category model (affected or not). The distribution of score is shown in Figure S3. We see a slight degradation in the distribution of score across the set. In particular, the average score drops from 0.71 to 0.67. Null models The distribution of model scores is shown in Fig 4. The median for the null models is a score of 0.60 in both the up- and down-regulated sets with a distribution similar to the above-described set of models. This is a fair value but is still less predictive than the models mentioned above. In contrast to the predictive models, where Bayesian classification proved to be the most predictive learner, recursive partitioning proved most fruitful when coupled with the molecular weight descriptor in 92% of the models for both up- and down-regulated (see Fig 5b). As described in detail elsewhere, 18 the recursive partitioning learner is an ensemble of 100 decision trees, where each is built for a random split of the training set (not to be confused with the random training/test set split performed prior to model generation). When evaluating this learner on any test set compound, each of the 100 trees casts a vote and the final category is then determined by majority vote. Therefore, in a dataset of the size used in this study (approximately 200 compounds), it is possible to imagine that such a method is honing in on the particular molecular weight range responsible for the particular differential

16

ACS Paragon Plus Environment

Page 16 of 36

Page 17 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(a) Performance of predictive and null models to describe differential down regulation, shown in dark red and light red, respectively.

(b) Performance of predictive and null models to describe differential up regulation, shown in dark blue and light blue, respectively.

Figure 4: Histogram of scores for the top-performing models generated across all 978 genes for both categories of model: (a) down-regulated or not in red and (b) up-regulated or not in blue. Scores computed according to Eq 3. Predictive models were constructed from various learning algorithms (Bayesian classification or recursive partitioning) using different independent variable types (descriptors or fingerprints). Null models were likewise built from various learning algorithms (Bayesian classification or recursive partitioning) using only molecular weight as a descriptor. Note that nearly all models perform better than random (score = 0.5) and the median scores are 0.60 and 0.71 for the null and predictive models, respectively. 17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

gene expression seen in the dataset. We do not expect this to hold as one moves to larger datasets.

Methods Dataset assembly The characteristic direction data for the L1000 dataset was downloaded from http://amp. pharm.mssm.edu/public/L1000CDS_download/. The data was locally hosted as a Mongo database and queried to curate all characteristic direction data versus A673 cells. Python JSON modules were used to convert all JSON objects to CSV files. The approximately 400 compounds available in the L1000 dataset which had been tested against A673 cells were curated at their corresponding time point of six hours and at varying doses: 0.4 to 177 µM. The data was then binned by fixed time point and dose, where only the bin with the largest number of compounds was retained. In particular, 175 compounds at 10 µM and six hours of treatment were carried forward for subsequent analysis. Using the compound IDs, the Broad LINCScloud API was queried using their RESTful web service to obtain the SMILES strings for each. Ultimately, a single file containing each compound ID, dose, time point, SMILES string, and 978 characteristic directions was assembled. Strictly speaking, the particular combination of compound ID, dose, and time point is referred to as the “perturbagen”. However, since all perturbagens studied here were at constant dose and time point, we shall refer to them simplistically as “compounds” going forward. The data produced a matrix comprised of compounds IDs and their SMILES strings on one axis and the differential gene expressions (in the form of characteristic directions) on the other, for all genes. Note that the L1000 dataset only contains 978 genes, but “1000” is used for convenience. Therefore, the final matrix carried forward was 175 (compounds) by 978 (genes) dimensions, where all 978 genes plus the compound IDs and SMILES strings were retained. 18

ACS Paragon Plus Environment

Page 18 of 36

Page 19 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(a) Preference for independent variable types (molecular descriptors or various fingerprint types) as measured by their percent occurrences in the top-scoring models across all ∼ 2,000 different predictive models constructed.

(b) Preference for learner types (Bayesian classification or recursive partitioning) as measured by their percent occurrences in the top-scoring models across all ∼ 2,000 different models constructed, both null (green) and predictive (purple).

Figure 5: Preference for learner methods and independent variables among the top scoring models. Approximately 600,000 models were constructed and ranked to give a final set of ∼ 2,000 top-scoring models for both the set of predictive models and null models and the preference for the different learner and independent variable types was tabulated amongst the top-scoring subset. Molecular descriptors are preferred to all fingerprint types in the predictive models (see Fig 5a). The null model gives better performance when using a recursive partitioning learner, whereas the preference is the opposite in the case of the predictive models (see Fig 5b). Finally, we did not quantify a preference for independent variable of the null models, as molecular weight was the only independent variable used. 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Categorization of characteristic direction All characteristic directions for all 175 compounds versus all 978 genes were curated to give 175 * 978 = ∼ 170k data points in total (see Fig 6). The various quartiles (25th , 50th /median, and 75th ) were then measured and found to -0.014, 0.000 and 0.012, respectively. As we interest ourselves in cases where differential expression provides a large degree of up- or down-regulation versus the control, we defined all points ≤ 25th percentile as substantially down-regulated versus control. Similarly, we set and all points ≥ 75th percentile as upregulated versus control. All other points were classified as relatively unaffected versus control. Thus, all numerical characteristic direction data points were classified into one of three categories: down-regulated, up-regulated, or unaffected versus control.

Figure 6: Distribution of ∆genei values across all genes and all compounds studied. Note that count is shown on a log scale and the count falls off rapidly outside the mean. All values ≤ 25th percentile (-0.014) are designated as down-regulated, while those ≥ 75th percentile (0.012) are designated as up-regulated.

AutoQSAR Schr¨odinger’s AutoQSAR utility 18 was used to generate all predictive models. This method is described in detail elsewhere, 18 but relevant details are also given in the Results and Dis20

ACS Paragon Plus Environment

Page 20 of 36

Page 21 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

cussion section. In short, the AutoQSAR algorithm is shown schematically in the gray box of ˆ i ), Fig 3. Given an input of compound structures and some dependent variable (here, ∆gene molecular descriptors and 2D fingerprints are computed. Subsequently, feature selection is undertaken to capture the maximally informative subset of descriptors and fingerprints. As the dependent variable used here (∆genei ) was categorical (up-regulated or not and downregulated or not), the two learning methods applied were recursive partitioning and Bayesian classification. These two learning methods are coupled with the maximally informative subsets of either fingerprints or descriptors and applied to different 75:25 training/test set splits of the input data. The resultant predictive models (e.g., Bayesian classification using dendritic fingerprints on training/test set split #1) are then evaluated by percent accurate predictions against both the training and test sets. Each model is then scored according to Eq 3. By default, the top 10 models are carried forward as a consensus model for use in any subsequent prediction.

Conclusions We present an automated workflow based on the AutoQSAR tool to generate predictive models of differential gene expression from small-molecule compound dosing against A673 cells. These models were used to predict an entire expression profile as opposed to the expression of just a few genes. This work highlights the ability to build predictive models that describe high-dimensional data using an automated workflow, which allows this computationally challenging task to be executed in a straightforward manner that could be applied broadly to the prediction of gene expression using different cell lines, compounds, and/or conditions. Although we focused here exclusively on A673 cells as a proof of principle, the automated AutoQSAR protocol reported here is equally applicable to any high dimensionality problem, including phenotypic screening outputs. We also showed that the models give superior performance over a reasonable null model and give reasonable correlation with

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

experimental data from an independent test set. The primary metric for assessing the success of a virtual screening protocol is the enrichment it confers, as compared with a random sampling or another virtual screening approach. While enrichments are traditionally discussed in the context of target-based screens, their importance is perhaps even greater in the phenotypic and genotypic space, where throughputs are much lower. It has been shown in phenotypic design, that quantitative structure phenotype relationships (QSPhR) can be derived using one model system (e.g., yeast) and applied to another model system (e.g., human cells) to give higher enrichments in screening than would be achieved via random selection of compounds for screening alone. 6 We posit that similarly, quantitative structure gene expression relationships (QSGeR) can be used to enrich cell line screening sets as well. In the same vein as the work describe by Wallace et al., we propose that QSGeR should be used in gene expression-based screens to predict the profile of new compounds and improve screening efficiency. Finally, we suggest that this opens the door to other applications of QSGeR, including target identification and lead optimization. In particular, understanding the differential gene expression profile of a particular compound lends itself to understanding off-target and multi-target effects as the targets relate to their genes. Additionally, we propose that within the limitations of any knowledge-based model, these predictive models of gene expression can be used to facilitate lead optimization in the gene expression space, where one wishes to design for a particular rescue expression. In conclusion, we present a study that works to narrow the gap between computational biology, which traditionally focuses on target-gene-phenotype relationships, and computational chemistry, which is more focused on compound-target relationships. We propose that large-scale multi-objective predictive modeling approaches, such as the one presented here, can produce a more complete description of drug action and therefore can facilitate the drug discovery process.

22

ACS Paragon Plus Environment

Page 22 of 36

Page 23 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The authors declare the following competing financial interest(s): All authors are current or former employees of Schr¨odinger, Inc. and some authors hold shares of that company.

Acknowledgement The authors thank (in alphabetical order) Scott P. Brown, Patrick Devaney, Steve Dixon, and RJ Swett for helpful discussions. The authors also thank the Broad Institute, and in particular, David Lahr, Ted Natoli, and Aravind Subramanian for access to the L1000 dataset and helpful discussions. The authors additionally thank Antonio Osario at Schr¨odinger, Inc. and Avi Ma’ayan and Qiaonan Duan at the Icahn Mount Sinai School of Medicine for help with local hosting of and accessing the L1000 characteristic direction data, respectively. The authors also thank Katie Hughes for assistance with graphics rendering. The authors finally thank the anonymous reviewers for careful and critical reading of this manuscript and whose suggestions proved quite helpful.

Supporting Information Available All data used to build and validate QSGeR models is publicly available from the LINCS cloud server as described in the the Methods: Dataset assembly section. 16 The complete list of all 497 descriptors employed as independent variables is provided as a text file. For access to script to run prospective QSGeR predictions for any arbitrary input compound, please contact the corresponding authors. This material is available free of charge via the Internet at http://pubs.acs.org/.

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 36

References (1) Baig, M. H.; Ahmad, K.; Roy, S.; Ashraf, J. M.; Adil, M.; Siddiqui, M. H.; Khan, S.; Kamal, M. A.; Provazn´ık, I.; Choi, I. Computer Aided Drug Design: Success and Limitations. Cur. Pharm. Des. 2016, 22, 572–581. (2) Swinney, D. C.; Anthony, J. How Were New Medicines Discovered? Nat. Rev. Drug Discovery 2011, 10, 507–519. (3) Jenkins, J. L.; Bender, A.; Davies, J. W. In Silico Target Fishing: Predicting Biological Targets from Chemical Structure. Drug Discovery Today: Technol. 2006, 3, 413 – 421. (4) Lang, P. SAR by HCS. Nat. Chem. Biol. 2008, 4, 18–19. (5) Mart´ınez-Jim´enez, F.; Papadatos, G.; Yang, L.; Wallace, I. M.; Kumar, V.; Pieper, U.; Sali, A.; Brown, J. R.; Overington, J. P.; Marti-Renom, M. A. Target Prediction for an Open Access Set of Compounds Active Against Mycobacterium Tuberculosi. PLoS Comput. Biol. 2013, 9, 1–15. (6) Wallace, I. M.; Urbanus, M. L.; Luciani, G. M.; Burns, A. R.; Han, M. K. L.; Wang, H.; Arora, K.; Heisler, L. E.; Proctor, M.; St. Onge, R. P.; Roemer, T.; Roy, P. J.; Cummins, C. L.; Bader, G. D.; Nislow, C.; Giaever, G. Compound Prioritization Methods Increase Rates of Chemical Probe Discovery in Model Organisms. Chem. Biol. 2011, 18, 1273–1283. (7) PsychoGenics — Drug Discovery — Proprietary Drug Discovery Platforms: SmartCube. http://www.psychogenics.com/smartcube.html,

(accessed Jan 6,

2016). (8) Dean, P. M.; Zanders, E. D.; Bailey, D. S. Industrial-Scale, Genomics-Based Drug Design and Discovery. Trends Biotechnol. 2001, 19, 288–292.

24

ACS Paragon Plus Environment

Page 25 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(9) Huang, S. Gene Expression Profiling, Genetic Networks, and Cellular States: An Integrating Concept For Tumorigenesis and Drug Discovery. J. Mol. Med. 1999, 77, 469–480. (10) Martinez-Ramirez, A.; Rodriguez-Perales, S.; Melendez, B.; Martinez-Delgado, B.; Urioste, M.; Cigudosa, J. C.; Benitez, J. Characterization of the A673 Cell Line (Ewing Tumor) by Molecular Cytogenetic Techniques. Cancer Genet. Cytogenet. 2003, 141, 138–142. (11) Alberts, B., Johnson, A., Lewis, J., Morgan, D., Raff, M., Roberts, K., Walter, P., Eds. Molecular Biology of the Cell, 6th ed.; Garland Science: New York, 2014. (12) Lamb, J.; Crawford, E. D.; Peck, D.; Modell, J. W.; Blat, I. C.; Wrobel, M. J.; Lerner, J.; Brunet, J.-P.; Subramanian, A.; Ross, K. N.; Reich, M.; Hieronymus, H.; Wei, G.; Armstrong, S. A.; Haggarty, S. J.; Clemons, P. A.; Wei, R.; Carr, S. A.; Lander, E. S.; Golub, T. R. The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 2006, 313, 1929–1935. (13) Lamb, J. The Connectivity Map: A New Tool for Biomedical Research. Nat. Rev. Cancer 2007, 7, 54–60. (14) Chen, B.; Greenside, P.; Paik, H.; Sirota, M.; Hadley, D.; Butte, A. J. Relating Chemical Structure to Cellular Response: An Integrative Analysis of Gene Expression, Bioactivity, and Structural Data Across 11,000 Compounds. CPT Pharmacometrics Syst. Pharmacol. 2015, 4, 576–584. (15) Peck, D.; Crawford, E. D.; Ross, K. N.; Stegmaier, K.; Golub, T. R.; Lamb, J. A Method for High-throughput Gene Expression Signature Analysis. Genome Biol. 2006, 7, R61–R61. (16) L1000. http://www.lincscloud.org/l1000/, (accessed Feb 16, 2016).

25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(17) Clark, N. R.; Hu, K. S.; Feldmann, A. S.; Kou, Y.; Chen, E. Y.; Duan, Q.; Ma’ayan, A. The Characteristic Direction: A Geometrical Approach to Identify Differentially Expressed Genes. BMC Bioinf. 2014, 15, 1–16. (18) Dixon, S. L.; Duan, J.; Smith, E.; Bargen, C. D. V.; Sherman, W. P.; Repasky, M. P. AutoQSAR: An Automated Machine Learning Tool for Best-Practices QSAR Modeling. Future Med. Chem. 2016, In press. (19) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (20) Sastry, M.; Lowrie, J. F.; Dixon, S. L.; Sherman, W. Large-Scale Systematic Analysis of 2D Fingerprint Methods and Parameters to Improve Virtual Screening Enrichments. J. Chem. Inf. Model. 2010, 50, 771–784. (21) Daylight Theory:

Fingerprints. http://www.daylight.com/dayhtml/doc/theory/

theory.finger.html, (accessed Sep 25, 2016). (22) Bender, A.; Mussa, H. Y.; Glen, R. C.; Reiling, S. Molecular Simililarity Searching Using Atom Environments, Information-Based Feature Selection, and a Na¨ıve Bayesian Classifier. J. Chem. Inf. Comp. Sci. 2004, 44, 170–178. (23) Bender, A.; Mussa, H. Y.; Glen, R. C.; Reiling, S. Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT2D): Evaluation of Performance. J. Chem. Info. Comp. Sci. 2004, 44, 1708–1718. (24) Duan, J.; Dixon, S. L.; Lowrie, J. F.; Sherman, W. Analysis and Comparison of 2D Fingerprints: Insights into Database Screening Performance using Eight Fingerprint Methods. J. Mol. Graphics Modell. 2010, 29, 157–170. (25) Zeileis, A.; Hothorn, T.; Hornik, K. Model-Based Recursive Partitioning. J. Comp. Graph. Stat. 2008, 17, 492–514. 26

ACS Paragon Plus Environment

Page 26 of 36

Page 27 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(26) Kotsiantis, S. B.; Zaharakis, I.; Pintelas, P. In Emerging Artificial Intelligence Applications in Computer Engineering; Breuker, J., Dieng-Kuntz, R., Guarino, N., Kok, J. N., Liu, J., de Mantaras, R. L., Mizoguchi, R., Musen, M., Zhong, N., Eds.; IOS Press: 1013 BG Amsterdam, Netherlands, 2007. (27) Consider for example a dataset of 100 compounds, where we employ a 75:25 split. Here, 75 compounds would belong to category A and the remaining 25 are assigned to category B. If we simply predict that all 100 compounds belong to the majority category (A), we have fA = 1.0 and fB = 0.0. That is, we have correctly predicted all compounds with experimental category A and incorrectly predicted all compounds with experimental category B. From Eq 1 or 2, we would have accuracy = 0.5. Finally, setting accuracytest = accuracytraining = 0.5 for the example described herein, we see from Eq 3, that score = 0.5. That is, even in a dataset that is heavily skewed with respect to which category the data belongs to, the score is robust in detecting when performance is no better than random.

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

28

ACS Paragon Plus Environment

Page 28 of 36

Page 29 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2: Heat map representation of chemogenomic matrix used in this study. The 978 genes analyzed are shown along the top, while the 175 perturbagens studied are on the left. Note that "perturbagen" describes the particular combination of compound, dose and time point used. In the work described here, only compounds measured at the same dose and time point were used. Thus, we use "perturbagen" and "compound" interchangeably. The color of each cell of the matrix represents the differential expression of genei in the presence of perturbagenj, ∆geneji, where red, blue and white are used to describe downregulation, up-regulation, or no effect upon perturbation, respectively. The color ramp of the red or blue reflects increasing deviation from the median value (∆geneji=0.0) in the case of down- and up-regulation, respectively. 679x461mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4: Histogram of scores for the top-performing models generated across all 978 genes for both categories of model: (a) down-regulated or not in red and (b) up-regulated or not in blue. Scores computed according to Eq 3. Predictive models were constructed from various learning algorithms (Bayesian classification or recursive partitioning) using different independent variable types (descriptors or fingerprints). Null models were likewise built from various learning algorithms (Bayesian classification or recursive partitioning) using only molecular weight as a descriptor. Note that nearly all models perform better than random (score = 0.5) and the median scores are 0.60 and 0.71 for the null and predictive models, respectively. (b) Performance of predictive and null models to describe differential up regulation, shown in dark blue and light blue, respectively. 705x474mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 30 of 36

Page 31 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4: Histogram of scores for the top-performing models generated across all 978 genes for both categories of model: (a) down-regulated or not in red and (b) up-regulated or not in blue. Scores computed according to Eq 3. Predictive models were constructed from various learning algorithms (Bayesian classification or recursive partitioning) using different independent variable types (descriptors or fingerprints). Null models were likewise built from various learning algorithms (Bayesian classification or recursive partitioning) using only molecular weight as a descriptor. Note that nearly all models perform better than random (score = 0.5) and the median scores are 0.60 and 0.71 for the null and predictive models, respectively. (a) Performance of predictive and null models to describe differential down regulation, shown in dark red and light red, respectively. 705x474mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Schematic of differential gene expression as a function of perturbagen. In the top, we see the native gene expression profile for seven genes in the presence of control, e.g., DMSO. After exposure to a perturbagen (i.e., compound dosing), some genes' expressions are altered, while others remain unaffected. In particular, gene1 and gene6 are up- and down-regulated versus control, respectively. Taking the difference in these two expression profiles gives the differential gene expression shown at the bottom. 709x390mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 32 of 36

Page 33 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: Preference for learner methods and independent variables among the top scoring models. Approximately 600,000 models were constructed and ranked to give a final set of ca. 2,000 top-scoring models for both the set of predictive models and null models, and the preference for the different learner and independent variable types was tabulated amongst the top-scoring subset. Molecular descriptors are preferred to all fingerprint types in the predictive models (see Fig 5a). The null model gives better performance when using a recursive partitioning learner, whereas the preference is the opposite in the case of the predictive models (see Fig 5b). Finally, we did not quantify a preference for independent variable of the null models, as molecular weight was the only independent variable used (b) Preference for learner types (Bayesian classification or recursive partitioning) as measured by their percent occurrences in the topscoring models across all ca. 2,000 different models constructed, both null (green) and predictive (purple). 706x476mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6: Distribution of genei values across all genes and all compounds studied. Note that count is shown on a log scale and the count falls off rapidly outside the mean. All values ≤ 25th percentile (-0.014) are designated as down-regulated, while those ≥ 75th percentile (0.012) are designated as up-regulated. 597x545mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 34 of 36

Page 35 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: Preference for learner methods and independent variables among the top scoring models. Approximately 600,000 models were constructed and ranked to give a final set of ca. 2,000 top-scoring models for both the set of predictive models and null models, and the preference for the different learner and independent variable types was tabulated amongst the top-scoring subset. Molecular descriptors are preferred to all fingerprint types in the predictive models (see Fig 5a). The null model gives better performance when using a recursive partitioning learner, whereas the preference is the opposite in the case of the predictive models (see Fig 5b). Finally, we did not quantify a preference for independent variable of the null models, as molecular weight was the only independent variable used. (a) Preference for independent variable types (molecular descriptors or various fingerprint types) as measured by their percent occurrences in the top-scoring models across all ca. 2,000 different predictive models constructed. 706x476mm (72 x 72 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Schematic of the procedure used to automatically generate models for entire gene expression pro le. Each gene in the collection of genes (in this case, 978 total) was run through AutoQSAR. The AutoQSAR algorithm itself is depicted in the gray box while the summation of the results of all genes predicted gives the predicted expression profi le, shown outside the box. Running this across all 978 genes gave predictive models for the entire profi le. 742x522mm (72 x 72 DPI)

ACS Paragon Plus Environment

Page 36 of 36