Prediction of CRISPR sgRNA activity using a deep convolutional

Nov 28, 2018 - ... a deep-learning framework based on convolutional neural network ... for the Exploration of Bioactive Fragment Space for Drug Discov...
0 downloads 0 Views 2MB Size
Subscriber access provided by University of Winnipeg Library

Bioinformatics

Prediction of CRISPR sgRNA activity using a deep convolutional neural network Li Xue, Bin Tang, Wei Chen, and jiesi Luo J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00368 • Publication Date (Web): 28 Nov 2018 Downloaded from http://pubs.acs.org on November 30, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Prediction of CRISPR sgRNA activity using a deep convolutional neural network Li Xue1‡, Bin Tang2‡, Wei Chen3,* and Jiesi Luo4*

1

School of Public Health, Southwest Medical University, Luzhou, Sichuan, China

2

Basic Medical College of Southwest Medical University, Luzhou, Sichuan, China

3

4

Integrative genomics core, City of hope national medical center, Duarte, CA, USA Key Laboratory for Aging and Regenerative Medicine, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, Sichuan, China

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 32

ABSTRACT

The CRISPR-Cas9 system derived from adaptive immunity in bacteria and archaea has been developed into a powerful tool for genome engineering with wide-ranging applications. Optimizing single guide RNA (sgRNA) design to improve efficiency of target cleavage is a key step for successful gene editing using the CRISPR-Cas9 system. Because not all sgRNAs that cognate to a given target gene are equally effective, computational tools have been developed based on experimental data to increase the likelihood of selecting effective sgRNAs. Despite considerable efforts to date, it still remains a big challenge to accurately predict functional sgRNAs directly from large-scale sequence data. We propose DeepCas9, a deep-learning framework based on convolutional neural network (CNN), to automatically learn the sequence determinants and further enable the identification of functional sgRNAs for the CRISPR-Cas9 system. We show that the CNN method outperforms previous methods in both, the ability to correctly identify highly active sgRNAs in experiments not used in the training and the ability to accurately predict the target efficacies of sgRNAs in different organisms. Besides, we further visualize the convolutional kernels and show the match of identified sequence signatures and known nucleotide preferences. We finally demonstrate the application of our method to the design of next-generation genome-scale CRISPRi and CRISPRa libraries targeting human and mouse genomes. We expect that DeepCas9 will assist in reducing the numbers of sgRNAs that need to be experimentally validated to enable more effective and efficient genetic screens and genome engineering. DeepCas9 can be freely accessed at https://github.com/lje00006/DeepCas9.

ACS Paragon Plus Environment

2

Page 3 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Introduction The development of genome engineering technology based on the clustered regularly interspaced short palindromic repeats (CRISPR) system is one of the most powerful biological breakthroughs in recent decades1-3. It enables precise genome editing in a wide variety of organisms, has transformed biological research, and holds tremendous promise in biotechnology and medicine. According to context in prokaryotes, the CRISPR-Cas systems can be classified into two distinct classes and further subdivided into at least six types4, 5. Among them, the Class 2 type II CRISPR-Cas system derived from streptococcus pyogenes is the most widely used because of its simplicity, efficiency, and scalability6. This CRISPR system recognizes and cleaves the target DNA using three minimal components: the RNA-guided endonuclease Cas9, a specificitydetermining CRISPR RNA (crRNA), and an auxiliary trans-activating crRNA (tracrRNA)7, 8. Following crRNA and tracrRNA hybridization, Cas9 is guided to genomic loci matching a 20-nt guide sequence within the crRNA, which is immediately upstream of a protospacer adjacent motif (PAM)7, 8. Once bound to the target DNA, two nuclease domains in Cas9, HNH and RuvC, will cleave the DNA strands complementary and non-complementary to the guide sequence, leaving a blunt-ended DNA double strand break (DSB)9,

10

. In addition, the crRNA and

tracrRNA can also be fused to generate a single guide RNA (sgRNA)8. Along with the sgRNA, Cas9 protein from streptococcus pyogenes could be programmed to cleave virtually any sequence preceding a 5’-NGG-3’ PAM sequence in an easy and quick manner. The unprecedented flexibility of this RNA-guided system has enabled a broad range of applications, including genome editing11, gene expression regulation12, 13, epigenetic status modification14, and even targeting single stranded RNA15.

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 32

It has been observed that the mutagenesis rate of the CRISPR-Cas9 system varies greatly16. Further studies have indicated that on-target efficiency of site-directed mutation is highly dependent on the sgRNA17. In practice, different sgRNAs often manifest a spectrum of potency, and only a fraction of them are high effective18-20. Small positional shifts along the target DNA are sufficient to alter sgRNA function in an unpredictable manner16, 21. Hence, it is critical to design effective sgRNAs to perform reliable gene-knockdown experiments. Several recent studies have attempted to decipher the molecular features that determine sgRNA cleavage activity. Wang et al. performed genetic screens in two human cell lines by using a large-scale sgRNA libraries11. They showed that the sgRNA efficiency was associated with specific sequence motifs, enabling the prediction of more effective sgRNAs. Doench et al. assessed the potency of sgRNAs in libraries that tiled twelve cell surface receptors on murine and human cell lines and demonstrated that nucleotide composition of the DNA downstream from the PAM sequence contributed to the sgRNA efficiency18. Wong et al. reanalyzed the RNA-seq data from Doench et al. and identified some novel features that were characteristic of functional sgRNAs22. Based on previous published datasets, Hu et al. analyzed the effects of sequence context on sgRNA efficiency and refined the models by incorporating additional features, such as a preference for cytosine at the cleavage site23. Chari et al. applied a highthroughput sequencing approach to measure sgRNA activity in a large-scale screen24. They unraveled the underlying nucleotide sequence and chromatin accessibility had contributed to designing an improved mechanism of genome targeting. Moreno-Mateos et al. tested sgRNA activity in vitro using zebrafish embryos as a model system and indicated that strong G-richness and higher stability within efficient sgRNAs20. Doench et al. further designed a tiling library targeting eight human genes known to confer resistance to three drugs19. With the combination

ACS Paragon Plus Environment

4

Page 5 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

of previous data set, they developed a more effective modeling approach for sgRNA activity prediction. Deep-learning algorithms, a recent modification of multi-layered artificial neural networks, is a particularly powerful approach for learning complex patterns at multiple layers25. Compared to traditional machine learning methods, the deep-learning algorithms can take raw features from an extremely large, annotated data set, such as images or genomes, and use them to create a predictive tool based on patterns buried inside26. Over the past few years, these algorithms have been applied to bioinformatics and computational biology to manage increasing amounts and dimensions of data generated by high throughput technique. For example, previous applications of deep learning algorithms have achieved great success in predicting the splicing activity27, specificities of DNA- and RNA-binding proteins28, epigenetic marks and DNA accessibility29, 30. Recently, Kim et al. used the deep learning to improve the prediction of CRISPR-Cpf1 guide RNA activity31. In this paper, we introduce DeepCas9, a computational framework that applies a deep convolutional neural network (CNN) to predict the sgRNA activity and learn the sequence code. Through comprehensive experiments, we demonstrate that DeepCas9 outperforms all known state-of-the-art activity prediction methods on ten published datasets. To make DeepCas9 more understandable, we propose a strategy to visualize the convolutional kernels and successfully identify the key feature of sgRNA, which determines the target activity of genome editing. To demonstrate application of DeepCas9, we also focus on the datasets from genomic screens that used CRISPR activation (CRISPRa) and CRISPR interference (CRISPRi) technology32 and show the ability of deep learning to predict highly active sgRNAs.

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 32

Methods Knock-out and cleavage efficiency datasets We used ten public experimental validated sgRNA efficacy datasets in the gene knockout (KO), which were collected and processed by Haeussler et al.33 These experiment-based datasets cover several cell types of five species, some of which have been used to develop existing selection algorithms. The Chari dataset comprised a 1,234 guides infecting 293T cells with scrambled spCas9 targets, whose target mutation rates were the readout for efficiency24. The Wang/Xu dataset consisted of 2,076 guides targeting 221 genes whose deletion resulted in a growth disadvantage in the leukemia cell line HL-6011, 23. The knockout efficiency of this dataset was evaluated based on the decline in abundance in the screens. For the dataset from Doench et al.18, only data for 951 guides against six mouse cell-surface proteins genes (Cd5, Cd28, H2-K, Cd45, Thy1 and Cd43), which could be assayed by flow cytometry, were kept. There the abundance of integrated sgRNAs in FACS-isolated, target-negative cells was used as a measure of knockout success. We developed the deep-learning-based sgRNA prediction tool using the three datasets for training. The external test set contained seven datasets, including 2,333 (Doench dataset, A375)19, 4,239 (Hart dataset, Hct116)34, 1,020 (Moreno-Mateos dataset, Zebrafish)20, 72 (Gandhi dataset, Ciona), 50 (Farboud dataset, C.elegans)35, 102 (Varshney dataset, Zebrafish)36 and 111 (Gagnon dataset, Zebrafish)37 guides, respectively. The first dataset was a new version from Doench et al., which comprised 2,333 guides targeting eight human genes (CCDC101, MED12, TADA2B, TADA1, HPRT, CUL3, NF1 and NF2) whose knockout success was inferred from resistance to one of three drugs (vemurafenib, 6-thioguanine and selumetinib)19. The second dataset was

ACS Paragon Plus Environment

6

Page 7 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

obtained from 4,239 guides against the 829 genes determined to be essential in five different cell lines34. Here, only the result from Hct116 which was the only cell line with a replicate and with a high correlation between both replicates was used33. The third dataset was a set of 1,020 guides targeting 128 different genes in the zebrafish genome20. Unlike the previous datasets where guides were transcribed in vivo from a U6 small RNA promoter transcribed by RNA Polymerase III (PolIII), this dataset used in vitro-, T7-transcribed sgRNAs to infect the zebrafish one-cell embryos. The remaining four datasets contained fewer than 120 guides. The sgRNA activity of these datasets was strictly restricted to experimental assay, where the KO efficacy was clearly defined as the measured cleavage efficiency (Table S1). When we trained the deep-learning model, each training dataset was first sorted by cleavage efficiency in descending order. We then applied a Min-Max rescale procedure to the cleavage efficiency of each dataset. The Min-Max normalization maps a value in the range [0, 1] defined as fnk=(fk-fmin)/(fmax-fmin). Where fmax is the maximum value, fmin is the minimal value for the cleavage efficiency, fk is the original measure value and fnk is the respective rescaled value. For the testing datasets, we retained a real-value for each guide and then trained the CNN model to predict these scores. Convolutional neural network Convolutional neural network (CNN) is a type of deep-learning algorithm originally inspired by the natural visual perception mechanism of the living creatures38. The powerful aspect of CNNs is that they can allow computers to process hierarchical spatial representations efficiently and holistically, without relying on laborious feature crafting and extraction. The basic CNN architecture consists of three types of layers: convolution layers, pooling layers and fully

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 32

connected layers. The convolutional layer aims to learn feature representations of the inputs. They use weight vectors called filters to scan the local patches of all the data with a given pattern and calculates local weighted sums for every position. The pooling layer aims to achieve shiftinvariance by reducing the numbers of connections between convolutional layers. Pooling is an important concept of CNN. It determines the presence of the pattern in a region by calculating the maximum or average pattern match in smaller patches, thereby aggregating region information into a single number. After the successive application of convolution and pooling layers, there may be one or more fully connected layers which aim to perform high-level reasoning. They take all neurons in the previous layer and connect them to every single neuron of current layer to generate global semantic information39. Designed to analyze spatial information, CNNs has led to very good performance on various tasks such as visual recognition, speech recognition and natural language processing25. In bioinformatics and computational biology, CNNs are also showing great promise for highthroughput sequence analysis40. Leveraging on the rapid increasingly large and high-dimensional data sets (e.g. from DNA sequencing, RNA measurements) and the great improvements in the strengths of graphics processor units (GPU), CNNs allow to better exploit functional genomics data by training complex networks with multiple layers that capture their internal structure. Prior studies have demonstrated that CNNs can discover high-level features, improve performance over state-of-the art methods, and increase interpretability about the structure of the biological data28-30. Design of DeepCas9

ACS Paragon Plus Environment

8

Page 9 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

DeepCas9 is organized in a sequential layer-by-layer structure where convolution layers and pooling layers play a key role in extracting input features at different spatial scales. DeepCas9 receives a 30-bp target sequence as input and produces a regression score that highly correlates with sgRNA activity. Compared to previous approaches with explicit feature input18-20,

23

,

DeepCas9 skips steps of feature extraction from the sequence objects, and feature selection for determining “effective features”, leveraged by the use of a CNN. DeepCas9 can thus automatically learn feature representations that best encode the information essential for activity prediction. DeepCas9 was implemented using MXNet libraries (https://mxnet.incubator.apache.org/). The training architecture of the DeepCas9 is given in the Figure 1a. Initially, DeepCas9 uses one-hot encoding to convert DNA sequence, which translates the nucleotide in each position as a fourdimensional binary vector representing the presence or absence of an A, C, G or T. The convolution layer then performs one-dimensional convolution operations across the sequence with 50 filters. This process could be denoted as  ∑ Convolution( ) = ReLU∑    , !

where X is the input, i is the index of the output position and k is the index of kernels. The

) W = (

×

represents the weight matrix of the kth convolution kernel with size M× N,

where M is the size of the sliding window, N is the number of input channels29. Here, M and N are both equal to 4. ReLU represents a nonlinear activation function applied to the convolution outputs:  if  ≥ 0 ReLU() = $ * 0 if  < 0

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 32

The pooling layer computes the maximum value in each of the non-overlapping windows of size 2, which reduces variance and increases translational invariance over a region. This process could be denoted as: pooling( ) = max

, ,  , , ⋯ ,   , !

where X is the input, i is the index of output position, k is the index of kernels and M is the pooling window size. The fully connected layer integrates high-level features of DNA sequences and transforms the features into a fixed dimension space. All neurons in the fully connected layer receive input from all outputs of the previous layer, and apply ReLU function. The last layer, the regression output layer performs a linear transformation of the outputs of the fully connected layer and makes predictions for each sgRNA activity. Training of DeepCas9 model The proposed model was optimized for the mean squared error loss function using mini-batch stochastic gradient descent with Adam updates and regularized by dropout with a 0.3 dropout rate. Random search was used to optimize a set of important hyper-parameters, such as:

ACS Paragon Plus Environment

10

Page 11 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Inside the DeepCas9 architecture. (a) On the left, a convolutional neural network is applied to automatically learn the DNA sequence feature. First, the DNA sequence is encoded as “one-hot” vectors with a 1 at the position corresponding to the nucleotide type (A, C, G or T), and zero otherwise. The convolution and pooling operations are then applied to the input vectors and produces the output of each layer as feature maps. Finally, the output of the fully connected layer is fed to a linear regression layer that assigns a score for the activity. (b) The process of the convolution operation, where the 4×4 filter traverses the entire input vector with a stride of 1 to calculate the output. A filter can be visualized as a sequence motif. This helps to understand which nucleotide type the filter prefers at each sequence position. 1. Learning rate: The step size that the optimizer should take in the parameter space while updating the model parameters. 2. Batch Size: The number of training example to consider before updating the parameters. 3. Maximum Epochs: The total number of iterations over the model training. 4. Early Stopping Patience: The number of epochs to stop the model training, when the validation loss does not improve. Convolutional kernels visualization strategy When the kernel is slid over the input sequence, it functions as motif detector and becomes activated when the input matches its preference. To capture the activated positions of the sgRNA, a permutation method implemented in the R package rfPermute41 was applied on the neurons in the fully connected layer. The importance of neurons was determined by estimating the average decrease in node impurity after permuting each predictor variable. We mapped all neurons from the fully connected layer to convolution layer to identify the importance of the kernels.

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 32

We directly converted a kernel of the first convolutional layer into a motif that impacted the sgRNA efficiency (Figure 1b). We computed the enrichment and depletion of a nucleotide for a position as: 3

3

1 = log 2 5 4 − median 8$log 2 54 : : = 1,2, ⋯ , =*> 4

where p=(p1,p2,…,pn) denote the probabilities of the n elements C1, …, Cn (A, C, G and T) permitted at a position, and q=(q1,q2,…,qn) denote corresponding background probabilities. We plotted putative sequence logo by using the tool EDLogo in the R package Logolas (https://github.com/kkdey/Logolas). Baseline methods To evaluate DeepCas9, we compared its prediction performance with that of four existing algorithms, that is, SSC score23, sgRNA Scorer24, sgRNA Designer (rule set I)18, and sgRNA Designer (rule set II)19. We retrieved efficiency scorings for tested sgRNAs from their web sites. Default parameters were used for all methods. We also compared other machine learning models, including Support Vector Machine (SVM), Random Forest (RF), L1-regularized linear regression (L1 regression), L2-regularized linear regression (L2 regression), L1L2-regularized linear regression (L1L2 regression) and Gradientboosted regression tree (Boosted RT) from the Scikit-learn library42 with optimized parameters. For SVM, we considered the radial basis function (RBF) as the kernel function, and two parameters, the regularization parameter C and the kernel width parameter γ were optimized by using a grid search approach. It could identify good parameters based on exponentially growing sequences of (C, γ) (C=2-2, 2-1, … , 29 and γ=2-6, 2-5, … , 25). For RF, the two parameters, ntree

ACS Paragon Plus Environment

12

Page 13 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(the number of trees to grow) and mtry (the number of variables randomly selected as candidates at each node), were optimized using a grid search approach; the value of ntree was from 500 to 3000 with a step length of 500, and the value of mtry was from 2 to 40 with a step length of 2. For linear regression (L1, L2 and L1L2), the regularization parameter range was set to search over 100 points in log space, with a minimum of 10-4 and a maximum of 103. For the Gradientboosted regression trees, the number of base estimators, the maximum depth of the individual regression estimators, the minimum number of samples to split an internal node and the minimum number of samples to be at a leaf node were selected by searching over 200 models. Because SVM and RF are classification models, we trained them to separate potent and weak sgRNAs, which were pre-classified based upon a top-20% and bottom-80% efficacy cutoff, respectively. Evaluation Metric For evaluating performance of DeepCas9, we used Spearman correlation coefficient (r) between efficiency scores and prediction scores. The two-tailed Student’s t-test with n-2 degrees of freedom under the null hypothesis was used to determine the significance of each Spearman correlation of prediction scores. Results Overview of the DeepCas9 deep leaning model The deep CNN in DeepCas9 consisted of a hierarchical architecture that used raw DNA sequence as input and predicted the corresponding knockout efficiency. The CNN models consisted of convolution layers, rectification layers, pooling layers and fully connected layers. We carefully designed and sized the model architecture to make it appropriate for our purpose

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 32

(Figure 1a). The entire calculation process can be briefly described as follows: each input sequence was converted to a one-hot matrix with 4 rows and 30 columns. The four rows corresponded to the four nucleotides A, G, T and C. The first convolutional layer performed 50 convolutions with 4 × 4 filter on the one-hot matrix, producing 50 feature maps of size 1 × 27. These filters automatically extracted predictive features from input sequences during model training. After convolution, the rectified linear units (ReLU) was used to output the filter scanning results that were above the thresholds, which was learned during model training. The second pooling layer performed 1 × 2 spatial pooling of each feature map using the max value, producing 50 feature maps of size 1 × 13. Max pooling was applied to find the most significant activation signal in a sequence for each filter. All the pooling results were combined in one vector, resulting in a vector with a size of 650. Two fully connected layers were employed, each with 128 nodes. A dropout layer was added between the two fully connected layers to improve the generalization capability of the model and avoid overfitting. The output layer was a regression layer, which outputted the predicted scores for the activity After the model construction, we performed nested cross-validation with three datasets (Wang/Xu HL60, Doench Mouse-EL4 and Chari 293T) to evaluate the generalized performance of model selection and training of DeepCas9. In each fold of the outer tenfold cross-validation, models were trained on 90% of the data and tested on the remaining 10%. This procedure was repeated ten times until each tenth of the data was assigned to the training data. The average Spearman correlation coefficients between experimentally obtained cleavage efficiencies and predicted scores in these datasets were 0.58, 0.66 and 0.69, respectively (Figure S1), indicating the robustness and reproducibility of the CNN algorithm. When compared to conventional machine learning algorithms, such as Support Vector Machine, Random Forest, L1-regularized

ACS Paragon Plus Environment

14

Page 15 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

linear regression, L2-regularized linear regression, L1L2-regularized linear regression, and Gradient-boosted regression tree, the Spearman correlation coefficients of CNN in the crossvalidation were significantly higher than those of these conventional machine-learning-based algorithm, especially for the larger training dataset (Figure S1). Furthermore, as the length of target sequence for the cross-validation increased, the average Spearman correlation coefficient stopped increasing at 30 bp, suggesting that 30 bp was adequate as the input target sequence (Figure S2). Performance evaluation The sequence features and design rules reported in previous studies are inconsistent in different sgRNA libraries, cell types, or organisms18,

20, 23, 24

. These issues are mostly related to the

heterogeneous data sources generated from different platforms, experimental conditions, cell types, and organisms, and the insufficient quantity of CRISPR-based genome-editing data. Most previous methods or tools used a single data set or a small number of sgRNA studied to develop a learning model18, 20, 24. Their prediction or sgRNA design rules are likely biased or incomplete. In this work, we tested whether sgRNAs across datasets could be used to create a metapredictor with even higher accuracy and coverage. We expect that integration of multiple deep learning models and data sources will likely lead to more comprehensive and more accurate sgRNA efficacy prediction. Therefore, we integrated DeepCas9-mEL4, DeepCas9-293T and DeepCas9HL60 to generate DeepCas9, which is a weighted score for the sgRNA activity based on its prediction by different deep-learning models. We assessed the performance of DeepCas9 and saw that integrating predictions gave substantial improvements over each model taken individually (Figure S3).

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 32

To evaluate DeepCas9 performance for prediction of sgRNA on-target activity, the results were compared to SSC score23, sgRNA Scorer24, sgRNA Designer (rule set I)18 and sgRNA Designer (rule set II)19 competitor methods using ten data sets from different species and cell types. Figure 2 presents the obtained results by considering the proposed method and the following competitors. We first compared DeepCas9 with sgRNA Designer (rule set I), the first openly available platform generated by Doench et al.18. By targeting twelve cell surface receptors on murine and human cell lines with a total of 1,841 sgRNA, Doench et al. found that only a very small proportion (~5%) of all sgRNAs are highly effective18. In line with this result, we found that their algorithm (rule set I) trended to predict the majority of sgRNAs in the test datasets to have low cleavage efficiency. This resulted in a poor correlation between the prediction score of sgRNA Designer and measured

Figure 2. Performance comparison of DeepCas9 with other prediction methods. (a) Relative graph of Spearman correlation coefficients between different algorithms and datasets. The test datasets are arranged vertically, whereas the prediction algorithms are

ACS Paragon Plus Environment

16

Page 17 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

placed horizontally. For each data set, the experimental system is indicated by a species or cell type. Number of guides in every dataset are shown in parentheses. (b) The improvement values of DeepCas9 compared with different algorithms on the test data sets. cleavage efficiency in some test datasets (e.g., for Gagnon dataset, r=-0.07 and for MorenoMateos dataset, r=0.04) (Figure 2a). DeepCas9 showed substantially better performance than sgRNA Designer (rule set I) for all test datasets (Figure 2b). The average Spearman correlation coefficient of DeepCas9 was 74.37% higher than that of sgRNA Designer (rule set I). We further proceeded to compare DeepCas9 with two current state-of-the-art algorithms, sgRNA Scorer24, and SSC score23, respectively. Chari et al. previously generated large sgRNA libraries for CRISPR/Cas9 systems from both S.pyogenes (SpCas9) and S. thermophiles (St1Cas9) and built two support vector machine models (sgRNA Scorer) for each could predict high and low performing sgRNAs from their sequence composition24. Contrasting these approaches, Xu et al. utilized published CRISPR-Cas9 drop-out screening data11 to derive 28 characteristics from sgRNAs showing twofold higher sgRNA efficacy than their less active counterparts for the same genes (SSC score)23. The results indicated that DeepCas9 achieved better results than competitors presenting improvements of 39.26% and 33.87% higher than SSC score and sgRNA Scorer, respectively (Figure 2b). Like the sgRNA Designer (rule set I), the rule set II algorithm also predicts the ability of the sgRNAs to knock out the target gene19. This algorithm is an improvement on sgRNA Designer (rule set I) because it ranks data from multiple large-scale CRISPR experiments and combines their information to build a new algorithm with a more generalizable model19. When comparing the two algorithms, it was observed that even though DeepCas9 had reached lower individual correlation coefficients than sgRNA Designer

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 32

(rule set II), the results were improved in seven test datasets and achieved the average Spearman correlation coefficient of 6.96% higher than sgRNA Designer (rule set II) (Figure 2b). We next investigated whether the performance of DeepCas9 is related to the number of sgRNAs tested. As shown in the Figure S4, we did not observe a significant correlation between the Spearman correlation coefficient and the number of sgRNAs tested. The predictive performance of the various methods was always better for the Farboud C.elegans dataset than for Hart Repl2Lib1 Hct116 dataset, indicating that the quality of the input dataset is an important parameter for on-target design tools (Figure 2a). Different experimental measurements and biological readouts, which can greatly vary between the chosen targets, may have added to the difficulties preventing more precise predictions of sgRNA efficacy. Furthermore, these potentially confounding factors may have introduced biases in the datasets utilized to establish prediction rules, thus preventing unimpaired transfer of the underlying features to other datasets. Therefore, our DeepCas9 is specifically designed for personalized sgRNA design with considering cell type and data source heterogeneity. Visualizing and understanding DeepCas9 The purpose of this section is to identify the most important convolutional kernels related to the sgRNA activity and to determine the relative importance of the convolutional kernels to guide future research and CRISPR design efforts. To capture the important convolutional kernels, we performed a feature analysis of 128 neurons extracted from the convolution and pooling processes, and then mapped every neuron to the original convolution layer. The relative importance of a convolutional kernel was calculated as the total importance scores of the neurons generated by this kernel. We showed the top five convolutional kernels of the three training datasets (Figure 3) and visualized the contents of the kernel using the strategy proposed in the

ACS Paragon Plus Environment

18

Page 19 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Methods. The most important kernels (kernel 9 in Doench Mouse-EL4 dataset, kernel 10 in Chari 293T dataset and kernel 48 in Wang/Xu HL60 dataset) had importance scores greater than 10, which were twice as those of the next best kernels. The other top kernels had scores ranging from 3.09 to 5.21. Inside the convolutional kernels, we observed a strong preference for guanine and low preference for thymine, matching known nucleotide preferences for highly active sgRNAs. Although these important kernels showed similar nucleotide preferences, the kernel contents selected for mEL4, HL60 and 293T cells were not identical. We further visualized the positions of the DNA nucleotide sequence that are high importance for the prediction of sgRNA activity. The position importance is determined by the fraction of kernels activated at that position. When the kernel is slid over the protospacer and flanking sequences, it functions as motif detector and becomes activated when certain location matches its

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 32

Figure 3. The important convolutional kernels revealed by deep CNN models. The convolutional kernel here has a length of four nucleotides. At each position of input sequence the kernel has a preference for different nucleotides types. Each time the kernel is moved, it produces an output by taking the sum of the element-wise product of the input and the kernel position-specific weights. For the kernel unfavorable nucleotides will suppress the output while preferred nucleotides will increase the output. When a subsequence matches a kernel preference, this kernel will be activated and will produce a positive output for the sub-sequence location. Sequence position importance plot illustrates

ACS Paragon Plus Environment

20

Page 21 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

where in the sequence the kernel becomes activated for highly active sgRNAs. The higher the importance score, the more kernels are activated at that position. preference. This process is different from most approaches that map certain nucleotide or motif preferences to positions and is equivalent to scanning learned PWMs across the target sequence. We observed that the majority of kernels were activated when they convolved a continuous region adjacent to the PAM including the positions of 17, 18, 19, 20 and the ambiguous nucleotide of the PAM (‘N’ in ‘NGG’) (Figure 3). This result indicates that mutagenesis efficacy correlates with the nucleotide composition at defined positions adjacent to PAMs. It has been proposed that the 16-20 positions, known as the seed region, determines targeting specificity by making contacts with the arginine-rich bridge helix (BH) within the recognition (REC) lobe of the Cas9 protein43. The base pairing between the seed region of the guide RNA and the target DNA strand drives further step-wise destabilization of the target DNA duplex and directional formation of the guide-RNA-target-DNA heteroduplex44. Thus, target DNA complementarity to the crRNA in the seed region is critical for cleavage activity of Cas9. To sum up, the powerful learning ability of DeepCas9 could not only help us automatically detect sequence motifs, but can also guide us to find additional information, such as motif locations. Application of DeepCas9 to CRISPRi and CRISPRa To demonstrate the application of DeepCas9 to CRISPRi/a system, we examined curated data sets from screens that used CRISPR activation (CRISPRa) and CRISPR interference (CRISPRi) technology23, 45. Briefly, three datasets contains a total of 7,715 sgRNAs, among which 6,612 are related to targeting negatively selected essential genes in the CRISPRi experiment, 571 are related to positively selected genes upon CTx-DTA treatment and 532 are related to growth-

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 32

inhibiting genes in CRISPRa experiment. These sgRNAs were grouped into “efficient” and “inefficient” based on the phenotype scores that measure the relative sgRNA abundance. We used the DeepCas9 model to show the ability of deep learning in discriminating the efficient and inefficient sgRNAs for three independently generated CRISPRi/a data sets. We calculated the activity scores for these sgRNAs and drew the box plots (Figure 4). In the largest CRISPRi data set, DeepCas9 exhibited the greatest difference in the distributions of scores for sgRNAs classified as effective versus ineffective (fold change (fc) =1.20, P-value= 1.1╳10-41, two-sample Kolmogorov-Smirnov test). In a smaller CRISPRi data set, DeepCas9 again favored effective sgRNAs with a fc value of 1.16 and a P-value of 1.7╳10-5. The smallest CRISPRa data set with 532 sgRNAs gave a fc value of 1.14 and a P-value of 1.4 ╳ 10-4. Together, these observations suggest that although the structure and function of Cas9 and dCas9 are different, the commonalities between CRISPR knockout and CRISPRi/a, such as the initiation of sgRNADNA pairing, ensure the predictive power of DeepCas9.

Figure 4. DeepCas9 performance on three independently generated CRISPRi/a data sets.

ACS Paragon Plus Environment

22

Page 23 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Discussion and Conclusions In this paper, we demonstrated that real-valued in vivo or in vitro sgRNAs efficiency scores from CRISPR-Cas9 system could be predicted from raw sequences by our deep learning models, which allowed us to precisely identify effective sgRNAs at a rapid pace and led to the ultimate goal of production of validated genome-wide sgRNA libraries needed for high-throughput or individual gene knockout, interference or activation experiments. The previous work deepCpf1 also proposed using deep learning model to predict CRISPR-Cpf1 guide RNA activity, and showed better performance from DNA sequences31. To our knowledge, our DeepCas9 is the first deep learning method that can directly identify CRISPR-Cas9 sgRNA activity from the DNA sequences without using feature input. The high performance of our deep learning model was based on its ability to automatically extract sequence signatures, capture activity motifs and integrate the sequence context. In addition, multiple assays and data sources from different sgRNA libraries, cell types, organisms were simultaneously integrated in our deep learning models with shared kernels in the CNN layer. We used an ensemble model, obtained by integrating various deep-learning models, to exhibit a significantly improved predictive performance. This strategy was ignored by the available learning-based methods. These features of our model enabled us to develop the new tool DeepCas9 to predict sgRNA efficacy with increased accuracy. We showed that the deep CNN models in DeepCas9 accurately learned the known nucleotide preferences of sgRNAs during model training, suggesting that the high performance of DeepCas9 resulted from its ability to capture key sequence features affecting sgRNA on-target efficacy. The performance on the benchmark datasets and the comparative studies between the CNN and other machine learning methods proved the deep learning as a better predictor. And on the test datasets, the

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 32

DeepCas9 was more accurate in predicting sgRNA activity and clearly demonstrated the advantages of the deep CNN method. To sum up, DeepCas9 provides a framework for constructing deep learning models to predict the activities of abundant sgRNAs across multiple species genomes. The deep CNN model in DeepCas9 could be trained on any CRISPR dataset in any cell/specie type with available data. With advances in high-throughput CRISPR/Cas9 screens technologies and the accumulation of genomic CRISPR data, the deep learning models in DeepCas9 would be available for an increasing number of cell or specie types. As a result, the performance of DeepCas9 will continue to improve. The high performance of DeepCas9 would contribute in solving the challenge of optimizing sgRNA design to maximize activity. DeepCas9 would facilitate the high-throughput evaluation and prioritization of sgRNAs on a genome-wide scale.

Supporting Information. Tables S1 consists of detailed information of available efficiency datasets for this study. Figure S1 shows the performance comparison of CNN models with other prediction models in Wang/Xu HL60, Doench mouse EL-4 and Chari 293T datasets. Figure S2 shows the performance comparison of CNN model for different sizes of target sequence. Figure S3 shows the performance of a given subset algorithm on a certain dataset. Figure S4 shows a scatter plot between the number of sgRNAs and Spearman correlation coefficient. Author information *Corresponding Authors:

ACS Paragon Plus Environment

24

Page 25 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Jiesi Luo. E-mail: [email protected] Wei Chen. E-mail: [email protected] Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ‡These authors contributed equally. Funding Sources This work was funded by the National Natural Science Foundation of China No. 21803045

Notes The authors declare no competing financial interest.

References 1. Doudna, J. A.; Charpentier, E. Genome editing. The new frontier of genome engineering with CRISPR-Cas9. Science 2014, 346, 1077-+. 2. McNutt, M. Breakthrough to genome editing. Science 2015, 350, 1445-1445. 3. Shalem, O.; Sanjana, N. E.; Zhang, F. High-throughput functional genomics using CRISPR-Cas9. Nat. Rev. Genet. 2015, 16, 299-311. 4. Makarova, K. S.; Wolf, Y. I.; Alkhnbashi, O. S.; Costa, F.; Shah, S. A.; Saunders, S. J.; Barrangou, R.; Brouns, S. J.; Charpentier, E.; Haft, D. H.; Horvath, P.; Moineau, S.; Mojica, F. J.; Terns, R. M.; Terns, M. P.; White, M. F.; Yakunin, A. F.; Garrett, R. A.; van der Oost, J.; Backofen, R.; Koonin, E. V. An updated evolutionary classification of CRISPR-Cas systems. Nat. Rev. Microbiol. 2015, 13, 722-736. 5. Makarova, K. S.; Zhang, F.; Koonin, E. V. SnapShot: Class 1 CRISPR-Cas Systems. Cell 2017, 168, 946-946. 6. Hsu, P. D.; Lander, E. S.; Zhang, F. Development and applications of CRISPR-Cas9 for genome engineering. Cell 2014, 157, 1262-1278. 7. Deltcheva, E.; Chylinski, K.; Sharma, C. M.; Gonzales, K.; Chao, Y.; Pirzada, Z. A.; Eckert, M. R.; Vogel, J.; Charpentier, E. CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 2011, 471, 602-607.

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 32

8. Jinek, M.; Chylinski, K.; Fonfara, I.; Hauer, M.; Doudna, J. A.; Charpentier, E. A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 2012, 337, 816-821. 9. Anders, C.; Niewoehner, O.; Duerst, A.; Jinek, M. Structural basis of PAM-dependent target DNA recognition by the Cas9 endonuclease. Nature 2014, 513, 569-573. 10. Nishimasu, H.; Ran, F. A.; Hsu, P. D.; Konermann, S.; Shehata, S. I.; Dohmae, N.; Ishitani, R.; Zhang, F.; Nureki, O. Crystal structure of Cas9 in complex with guide RNA and target DNA. Cell 2014, 156, 935-949. 11. Wang, T.; Wei, J. J.; Sabatini, D. M.; Lander, E. S. Genetic Screens in Human Cells Using the CRISPR-Cas9 System. Science 2014, 343, 80-84. 12. Gilbert, L. A.; Larson, M. H.; Morsut, L.; Liu, Z.; Brar, G. A.; Torres, S. E.; SternGinossar, N.; Brandman, O.; Whitehead, E. H.; Doudna, J. A.; Lim, W. A.; Weissman, J. S.; Qi, L. S. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes. Cell 2013, 154, 442-451. 13. Konermann, S.; Brigham, M. D.; Trevino, A. E.; Joung, J.; Abudayyeh, O. O.; Barcena, C.; Hsu, P. D.; Habib, N.; Gootenberg, J. S.; Nishimasu, H.; Nureki, O.; Zhang, F. Genome-scale transcriptional activation by an engineered CRISPR-Cas9 complex. Nature 2015, 517, 583U332. 14. Rusk, N. CRISPRs and epigenome editing. Nat. Methods 2014, 11, 28-28. 15. Rousseau, B. A.; Hou, Z.; Gramelspacher, M. J.; Zhang, Y. Programmable RNA Cleavage and Recognition by a Natural CRISPR-Cas9 System from Neisseria meningitidis. Mol Cell 2018, 69, 906-914. 16. Hsu, P. D.; Scott, D. A.; Weinstein, J. A.; Ran, F. A.; Konermann, S.; Agarwala, V.; Li, Y.; Fine, E. J.; Wu, X.; Shalem, O.; Cradick, T. J.; Marraffini, L. A.; Bao, G.; Zhang, F. DNA targeting specificity of RNA-guided Cas9 nucleases. Nat. Biotechnol. 2013, 31, 827-832. 17. Mandal, P. K.; Ferreira, L. M. R.; Collins, R.; Meissner, T. B.; Boutwell, C. L.; Friesen, M.; Vrbanac, V.; Garrison, B. S.; Stortchevoi, A.; Bryder, D.; Musunuru, K.; Brand, H.; Tager, A. M.; Allen, T. M.; Talkowski, M. E.; Rossi, D. J.; Cowan, C. A. Efficient Ablation of Genes in Human Hematopoietic Stem and Effector Cells using CRISPR/Cas9. Cell stem cell 2014, 15, 643-652. 18. Doench, J. G.; Hartenian, E.; Graham, D. B.; Tothova, Z.; Hegde, M.; Smith, I.; Sullender, M.; Ebert, B. L.; Xavier, R. J.; Root, D. E. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol. 2014, 32, 1262-1267. 19. Doench, J. G.; Fusi, N.; Sullender, M.; Hegde, M.; Vaimberg, E. W.; Donovan, K. F.; Smith, I.; Tothova, Z.; Wilen, C.; Orchard, R.; Virgin, H. W.; Listgarten, J.; Root, D. E. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 2016, 34, 184-191. 20. Moreno-Mateos, M. A.; Vejnar, C. E.; Beaudoin, J. D.; Fernandez, J. P.; Mis, E. K.; Khokha, M. K.; Giraldez, A. J. CRISPRscan: designing highly efficient sgRNAs for CRISPRCas9 targeting in vivo. Nat. Methods 2015, 12, 982-988. 21. Cong, L.; Ran, F. A.; Cox, D.; Lin, S. L.; Barretto, R.; Habib, N.; Hsu, P. D.; Wu, X. B.; Jiang, W. Y.; Marraffini, L. A.; Zhang, F. Multiplex Genome Engineering Using CRISPR/Cas Systems. Science 2013, 339, 819-823. 22. Wong, N.; Liu, W.; Wang, X. WU-CRISPR: characteristics of functional guide RNAs for the CRISPR/Cas9 system. Genome Biol. 2015, 16, 218-225.

ACS Paragon Plus Environment

26

Page 27 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

23. Xu, H.; Xiao, T.; Chen, C. H.; Li, W.; Meyer, C. A.; Wu, Q.; Wu, D.; Cong, L.; Zhang, F.; Liu, J. S.; Brown, M.; Liu, X. S. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 2015, 25, 1147-1157. 24. Chari, R.; Mali, P.; Moosburner, M.; Church, G. M. Unraveling CRISPR-Cas9 genome engineering parameters via a library-on-library approach. Nat. Methods 2015, 12, 823-836. 25. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436-444. 26. Webb, S. Deep Learning for Biology. Nature 2018, 554, 555-557. 27. Xiong, H. Y.; Alipanahi, B.; Lee, L. J.; Bretschneider, H.; Merico, D.; Yuen, R. K.; Hua, Y.; Gueroussov, S.; Najafabadi, H. S.; Hughes, T. R.; Morris, Q.; Barash, Y.; Krainer, A. R.; Jojic, N.; Scherer, S. W.; Blencowe, B. J.; Frey, B. J. RNA splicing. The human splicing code reveals new insights into the genetic determinants of disease. Science 2015, 347, 1254806. 28. Alipanahi, B.; Delong, A.; Weirauch, M. T.; Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831-838. 29. Zhou, J.; Troyanskaya, O. G. Predicting effects of noncoding variants with deep learningbased sequence model. Nat. Methods 2015, 12, 931-934. 30. Kelley, D. R.; Snoek, J.; Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 2016, 26, 990-999. 31. Kim, H. K.; Min, S.; Song, M.; Jung, S.; Choi, J. W.; Kim, Y.; Lee, S.; Yoon, S.; Kim, H. H. Deep learning improves prediction of CRISPR-Cpf1 guide RNA activity. Nat. Biotechnol. 2018, 36, 239-241. 32. Horlbeck, M. A.; Gilbert, L. A.; Villalta, J. E.; Adamson, B.; Pak, R. A.; Chen, Y.; Fields, A. P.; Park, C. Y.; Corn, J. E.; Kampmann, M.; Weissman, J. S. Compact and highly active next-generation libraries for CRISPR-mediated gene repression and activation. eLife 2016, 5, e19760. 33. Haeussler, M.; Schonig, K.; Eckert, H.; Eschstruth, A.; Mianne, J.; Renaud, J. B.; Schneider-Maunoury, S.; Shkumatava, A.; Teboul, L.; Kent, J.; Joly, J. S.; Concordet, J. P. Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biol. 2016, 17, 148-159. 34. Hart, T.; Chandrashekhar, M.; Aregger, M.; Steinhart, Z.; Brown, K. R.; MacLeod, G.; Mis, M.; Zimmermann, M.; Fradet-Turcotte, A.; Sun, S.; Mero, P.; Dirks, P.; Sidhu, S.; Roth, F. P.; Rissland, O. S.; Durocher, D.; Angers, S.; Moffat, J. High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell 2015, 163, 1515-1526. 35. Farboud, B.; Meyer, B. J. Dramatic Enhancement of Genome Editing by CRISPR/Cas9 Through Improved Guide RNA Design. Genetics 2015, 199, 959-U106. 36. Varshney, G. K.; Pei, W. H.; LaFave, M. C.; Idol, J.; Xu, L. S.; Gallardo, V.; Carrington, B.; Bishop, K.; Jones, M.; Li, M. Y.; Harper, U.; Huang, S. C.; Prakash, A.; Chen, W. B.; Sood, R.; Ledin, J.; Burgess, S. M. High-throughput gene targeting and phenotyping in zebrafish using CRISPR/Cas9. Genome Res. 2015, 25, 1030-1042. 37. Gagnon, J. A.; Valen, E.; Thyme, S. B.; Huang, P.; Ahkmetova, L.; Pauli, A.; Montague, T. G.; Zimmerman, S.; Richter, C.; Schier, A. F. Efficient Mutagenesis by Cas9 ProteinMediated Oligonucleotide Insertion and Large-Scale Assessment of Single-Guide RNAs. Plos One 2014, 9, e98186. 38. Hubel, D. H.; Wiesel, T. N. Shape and arrangement of columns in cat's striate cortex. J Physiol 1963, 165, 559-568.

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 32

39. Gu, J. X.; Wang, Z. H.; Kuen, J.; Ma, L. Y.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X. X.; Wang, G.; Cai, J. F.; Chen, T. Recent advances in convolutional neural networks. Pattern Recognit 2018, 77, 354-377. 40. Angermueller, C.; Parnamaa, T.; Parts, L.; Stegle, O. Deep learning for computational biology. Mol. Syst. Biol 2016, 12, 878. 41. Altmann, A.; Tolosi, L.; Sander, O.; Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 2010, 26, 1340-1347. 42. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E. Scikit-learn: Machine Learning in Python. J. Mach Learn. Res. 2011, 12, 2825-2830. 43. Nishimasu, H.; Ran, F. A.; Hsu, P. D.; Konermann, S.; Shehata, S. I.; Dohmae, N.; Ishitani, R.; Zhang, F.; Nureki, O. Crystal Structure of Cas9 in Complex with Guide RNA and Target DNA. Cell 2014, 156, 935-949. 44. Anders, C.; Niewoehner, O.; Duerst, A.; Jinek, M. Structural basis of PAM-dependent target DNA recognition by the Cas9 endonuclease. Nature 2014, 513, 569-+. 45. Gilbert, L. A.; Horlbeck, M. A.; Adamson, B.; Villalta, J. E.; Chen, Y.; Whitehead, E. H.; Guimaraes, C.; Panning, B.; Ploegh, H. L.; Bassik, M. C.; Qi, L. S.; Kampmann, M.; Weissman, J. S. Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell 2014, 159, 647-661.

ACS Paragon Plus Environment

28

Page 29 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Inside the DeepCas9 architecture. (a) On the left, a convolutional neural network is applied to automatically learn the DNA sequence feature. First, the DNA sequence is encoded as “one-hot” vectors with a 1 at the position corresponding to the nucleotide type (A, C, G or T), and zero otherwise. The convolution and pooling operations are then applied to the input vectors and produces the output of each layer as feature maps. Finally, the output of the fully connected layer is fed to a linear regression layer that assigns a score for the activity. (b) The process of the convolution operation, where the 4×4 filter traverses the entire input vector with a stride of 1 to calculate the output. A filter can be visualized as a sequence motif. This helps to understand which nucleotide type the filter prefers at each sequence position. 508x355mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Performance comparison of DeepCas9 with other prediction methods. (a) Relative graph of Spearman correlation coefficients between different algorithms and datasets. The test datasets are arranged vertically, whereas the prediction algorithms are placed horizontally. For each data set, the experimental system is indicated by a species or cell type. Number of guides in every dataset are shown in parentheses. (b) The improvement values of DeepCas9 compared with different algorithms on the test data sets. 360x275mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 30 of 32

Page 31 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3. The important convolutional kernels revealed by deep CNN models. The convolutional kernel here has a length of four nucleotides. At each position of input sequence the kernel has a preference for different nucleotides types. Each time the kernel is moved, it produces an output by taking the sum of the elementwise product of the input and the kernel position-specific weights. For the kernel unfavorable nucleotides will suppress the output while preferred nucleotides will increase the output. When a sub-sequence matches a kernel preference, this kernel will be activated and will produce a positive output for the sub-sequence location. Sequence position importance plot illustrates where in the sequence the kernel becomes activated for highly active sgRNAs. The higher the importance score, the more kernels are activated at that position. 157x209mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. DeepCas9 performance on three independently generated CRISPRi/a data sets. 158x116mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 32 of 32