Subscriber access provided by UNIV OF DURHAM
Article
Fluorescence Color by Data-Driven Design of Genomic Silver Clusters Stacy M Copp, Alexander Gorovits, Steven M Swasey, Sruthi Gudibandi, Petko Bogdanov, and Elisabeth G. Gwinn ACS Nano, Just Accepted Manuscript • DOI: 10.1021/acsnano.8b03404 • Publication Date (Web): 30 Jul 2018 Downloaded from http://pubs.acs.org on July 31, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
Fluorescence Color by Data-Driven Design of Genomic Silver Clusters
Stacy M. Copp1*, Alexander Gorovits2, Steven M. Swasey3, Sruthi Gudibandi2, Petko Bogdanov2, Elisabeth G. Gwinn4 1
Center for Integrated Nanotechnologies, Los Alamos National Laboratory, Los Alamos, New
Mexico 87545, USA 2
Department of Computer Science, University at Albany-SUNY, 1400 Washington Ave. Albany,
NY 12222, USA 3
Department of Chemistry, University of California, Santa Barbara, Santa Barbara, Santa
Barbara, CA 93106, USA 4
Department of Physics, University of California, Santa Barbara, Santa Barbara, CA 93106, USA
*Corresponding author:
[email protected] ABSTRACT: DNA nucleobase sequence controls the size of DNA-stabilized silver clusters, leading to their well-known yet little-understood sequence-tuned colors. The enormous space of possible DNA sequences for templating clusters has challenged understanding of how sequence selects cluster properties and limited design of applications that employ these clusters. We investigate the genomic role of DNA sequence for fluorescent silver clusters using a data-driven approach. Employing rapid parallel silver cluster synthesis and fluorimetry, we determine the fluorescence spectra of silver cluster products stabilized by 1,432 distinct DNA oligomers. By
ACS Paragon Plus Environment
1
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 25
applying pattern recognition algorithms to this large experimental data set, we discover certain DNA base patterns, or “motifs,” that correlate to silver clusters with similar fluorescence spectra. These motifs are employed in machine learning classifiers to predictively design DNA template sequences for specific fluorescence color bands. Our method improves selectivity by 330% of templates for silver clusters with peak emission wavelengths beyond 660 nm. The discovered base motifs also provide physical insights into how DNA sequence controls silver cluster size and color. This predictive design approach for color of DNA-stabilized silver clusters exhibits the potential of machine learning and data mining to increase the precision and efficiency of nanomaterials design, even for a soft-matter-inorganic hybrid system characterized by an extremely large parameter space. TABLE OF CONTENTS GRAPHIC:
KEYWORDS: DNA, metal cluster, fluorescence, high-throughput, machine learning
Beyond encoding the genome of natural organisms, DNA sequence can also be engineered to realize self-assembling DNA nanostructures with custom geometries,1 to direct assembly of molecules and nanoparticles,2,3 and even to program motion.4 A distinct evolution in DNA nanotechnology is the use of DNA to template inorganic particles with sizes down to the cluster regime of just 2-100 atoms. Recent research has particularly focused on fluorescent metal
ACS Paragon Plus Environment
2
Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
clusters of gold, silver, and copper.5–7 DNA-stabilized silver clusters (AgN-DNA) are the best studied DNA-inorganic cluster composites because various 10-40 base DNA oligomer sequences stabilize AgN-DNA with distinct, narrow-band fluorescence colors spanning visible to nearinfrared wavelengths.8 With very small sizes of only ~10-30 silver atoms,9 AgN-DNA have intriguing potential as fluorescence reporters for a wide range of nanoscale events.10–13 However, applications are currently much impeded by the unsolved puzzle of the cluster “genome”: how does DNA sequence determine silver cluster color? Decoding the AgN-DNA genome would improve fundamental understanding of these materials and enable informed design of DNA sequences for a range of AgN-DNA applications, such as analyte sensing, biological imaging, and targeted placement of clusters on DNA constructs, at the high precision that Watson-Crick pairings currently afford to established DNA nanotechnology schemes. Thus far, the large space of possible DNA template sequences has hindered understanding of the sequence-color relation: many researchers stabilize AgN-DNA with DNA strands containing 20-30 bases, of which there are >1018 distinct sequences to choose from. To overcome the challenges presented by this large experimental parameter space and uncover how DNA sequence selects for AgN-DNA color, we harness tools from machine learning and bioinformatics to explore large experimental datasets connecting DNA sequence to AgN-DNA fluorescence color. “Materials informatics,” the study and design of materials systems by application of machine learning and other methods,14,15 is emerging as an efficient route to realizing desired material properties through data-driven prediction of synthesis parameters.16–22 Such methods are also improving the fundamental understanding of certain materials systems.17,22
ACS Paragon Plus Environment
3
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 25
AgN-DNA fluorophores are especially promising candidates for data-driven methods for several reasons. First, high-throughput methods have been developed for rapid synthesis and optical characterization of AgN-DNA.17,23 Second, decoding how DNA sequence selects AgNDNA properties is a high-dimensional problem due to the immense number of possible DNA template sequences. And third, successful design of AgN-DNA with desired optical properties requires stringent control over cluster size with atomic precision because small changes in cluster structure can produce large spectra shifts.9,23 Here, we seek to engineer DNA templates to control the morphology, and thereby the fluorescence color, of clusters of just ~10-20 silver atoms. Using high throughput experimental methods, we generate a large training dataset associating many DNA sequences with the fluorescence spectra of the silver clusters they stabilize. We employ a pattern recognition algorithm to extract DNA base subsequences called “motifs” that are correlated to specific fluorescence wavelength bands and employ these motifs in a machine learning classifier. Our approach discriminates sequences that stabilize clusters in different color bands, successfully classifying DNA templates in an experimental training dataset into physically relevant color categories with cross-validation accuracies up to 90%. We then employ the trained classifier to select new templates for AgN-DNA in specific color bands and verify these templates experimentally. This work successfully demonstrates a “closed loop” data-driven design strategy of a nanomaterial from start to finish: a large experimental data set is used to train a machine learning classifier, and new predictions by the classifier are then tested experimentally. RESULTS AND DISCUSSION Scheme 1 outlines our data-driven design method for DNA templates stabilizing AgN-DNA of specific fluorescence colors. The method requires many “training” data of individual DNA
ACS Paragon Plus Environment
4
Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
sequences and the fluorescence spectra of the AgN-DNA product(s) they stabilize (Scheme 1 I), as well as definition of color classes (Scheme 1 II). To begin to uncover the physically important information that connects DNA sequence to color, we perform motif mining using a tool for biological sequence pattern recognition24 (Scheme 1 III). Step III translates the training data into a parameterized form that is used to train a machine learning classifier (Scheme 1 IV). The nucleobase patterns uncovered in Step III, in conjunction with trained classifiers, are used to generate new sequences for select color classes (Scheme 1 V) that are then tested experimentally (Scheme 1 VI). Each step of this method is described in detail below.
Scheme 1. Sequence design method for DNA templates that stabilize AgN-DNA within specific color bands.
Experimental training data. To generate training data (Scheme 1 I), we synthesize and characterize AgN-DNA templated by 1432 10-base DNA oligomers, each with a distinct sequence, using robotic parallel cluster synthesis and fluorimetry in 384 well plate format.17,23 We consider 10-base DNA oligomers because these are long enough to stabilize fluorescent AgN-DNA25 and to potentially capture patterns relevant to more commonly used longer DNA templates while short enough to allow sufficient sampling of the space of all possible 410 distinct sequences. About half of the DNA training sequences were randomly generated23 and half were
ACS Paragon Plus Environment
5
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 25
previously designed to be brightly fluorescent, regardless of color.17 The combination of random and designed sequences is necessary to generate ample training data for DNA templates that stabilize fluorescent AgN-DNA because, at most, 25% of random 10-base sequences stabilize fluorescent AgN-DNA.23 The likelihood of selecting templates for fluorescent AgN-DNA can be increased three-fold by machine learning-aided design,17 thus expediting generation of sufficient training data. We emphasize that the designed training sequences were not designed to be selective for fluorescence color and were required to differ from training sequences by at least two base mutations,17 limiting the impact of designed training data on results and interpretation. A given DNA template may stabilize no fluorescent cluster products, one fluorescent product, or multiple products of different fluorescence colors, as evidenced by the number of peaks in the fluorescence spectrum. To determine the color(s) of these products, spectra are fitted to a sum of Gaussians to extract peak fluorescence wavelengths and fluorescence intensities. The training data then consists of 1432 template sequences and associated fluorescence peak wavelengths and intensities (Table S3). Further experimental details are provided in Methods and Supporting Information. Class definitions. We define physically motivated color classes based on the correlation of fluorescence color and cluster structure (Scheme 1 II).9,23 Previous work used mass spectrometry of purified AgN-DNA solutions to establish that “green” AgN-DNA with emission centered around 540 nm primarily contain four neutral silver atoms, while “red” AgN-DNA with emission centered around 630 nm primarily contain 6 neutral silver atoms; this study included several of the 10-base training templates.23 Due to these “magic number” cluster sizes that correspond to especially stable AgN-DNA,23 we expect most AgN-DNA to naturally fall into magic number color bands, justifying the use of discrete color classes. The color band for a given class will be
ACS Paragon Plus Environment
6
Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
broad to the extent that color depends on structural details other than silver content, such as variations in cluster shape or solvent exposure associated with the specifics of the nucleic acid environment.26 The multi-peaked color distribution of the fluorescence spectral peaks, λp, in the training data supports this expectation for natural color categories (Figure 1A). Thus, we define a “Green” class to contain DNA templates for AgN-DNA with brightest spectral peaks λp < 580 nm. (DNA templates with 580 nm < λp < 600 nm are excluded to better separate color classes based on magic numbers.) In Figure 1A, the broad peak near 630 nm has hints of longer wavelength structure that may be associated with larger cluster sizes or other changes in structure (see Supporting Information Figure S1 for a higher resolution histogram). Thus, we define a “Red” class (600 nm < λp < 660 nm) and a “Very Red” class (λp > 660 nm). Finally, DNA templates with fluorescence intensities below a dark threshold are categorized in a “Dark” class. Such Dark templates may either stabilize no silver clusters or may stabilize clusters that are insufficiently fluorescent for detection (We exclude from analysis DNA templates with intensities between “bright” and “dark” intensity thresholds to improve selectivity for brightly fluorescent AgN-DNA;17 details of thresholds in Supporting Information.)
Figure 1. (A) Histogram of bright spectral peaks measured for AgN-DNA stabilized by 1432 10base DNA training templates. Dashed vertical lines demark color class boundaries. (B) Histogram of the numbers of DNA sequences in color classes: sequences with bright spectral
ACS Paragon Plus Environment
7
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 25
peaks in only one class (grey) and sequences with secondary bright peaks in a different class (checkered blue).
This physically motivated color classification presents several computational challenges. First, the color classes have different sizes (Figure 1B). Imbalanced class sizes can cause machine learning classifiers to bias learning towards large classes only, as many algorithms assume data is evenly distributed among classes; this adversely affects classifier performance.27 Machine learning from such imbalanced data is a well-studied problem28–31 that can be addressed at the data level by sampling, at the algorithm level by imposing non-uniform training misclassification costs, and by hybrids of both.31 We balance class sizes by using all sequences in the smallest class, Green, and randomly subsampling Red, Very Red, and Dark classes. Second, AgN-DNA synthesis on some templates produces AgN-DNA of more than one color class (checkered blue bars, Figure 1B). Because these multi-color DNA templates likely mix DNA base patterns favoring different silver cluster sizes, we omit them from the training data. The large data sets used here required high throughput AgN-DNA synthesis by robotic pipetting. We tested reproducibility by replicate synthesis on half of the training sequences (Figure S2) and found negligible variance for the numbers of sequences falling into the Dark Class, 14% variance for the Very Red Class, 12% variance for the Red Class, and 39% variance for the Green Class, which is apparently more sensitive to experimental variations in reagent delivery (Table S1). Thus, we expect higher design success for Red and Very Red classes than for Green. Machine learning classifier. Support vector machines (SVMs), available in the Weka library,32,33 were used for classification. We select SVMs because they performed slightly better than other standard algorithms in the classification of DNA templates for AgN-DNA fluorescence
ACS Paragon Plus Environment
8
Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
intensity17 and because SVMs are relatively straightforward to interpret,22 in contrast to more complex “deep learning” methods.34 Briefly, an SVM learns to separate two classes of data represented by feature vectors in a high-dimensional space. Because our AgN-DNA color prediction problem involves four color classes, we employ one-versus-one (OvO) classification.35 With this method, each color class is associated with three trained SVMs that predict the set of probabilities for a given DNA sequence to belong to the desired class (e.g. Green) rather than an undesired class (e.g. Dark, Red, or Very Red). Motif mining. For effective classification using machine learning algorithms, feature vectors should describe the characteristics of a datum that determine its classification. Thus, we seek a representation of the training data that parameterizes information about a DNA sequence that is most important for determining color class. Previously, we showed that certain DNA base subsequences, or “motifs,” were predictive of AgN-DNA fluorescence intensity, while SVMs trained using feature vectors composed of entire parameterized sequences were poorly predictive of intensity.17 This was likely due, in part, to the fact that feature vectors containing information about select base motifs better capture important patterns that are invariant with position in a sequence while excluding information that is irrelevant for classification. Due to the clear role of DNA sequence in selecting AgN-DNA color, we hypothesize that certain DNA base motif patterns are also selective of AgN-DNA color. The motif mining algorithm, MERCI,24 was used to search for motifs that occur more frequently in one color class than the others (Scheme 1 III). Given a “positive” class and a “negative” class of sequences, MERCI finds subsequences occurring at a rate > fp in the positive class and < fn in the negative class. To find base motifs discriminative of color, we assign one color class as “positive” and combine all other classes as “negative,” extracting base motifs specifically correlated to one color class only. We take the
ACS Paragon Plus Environment
9
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 25
union of discriminative base motifs for each color class and represent DNA templates using binary features corresponding to each identified motif: a value of “1” if a base motif is present in a sequence and “0” if it is absent (details in Supporting Information). Feature selection. This motif mining approach focuses on subsequences that correlate to color and thereby removes irrelevant information that can otherwise lower a machine learning algorithm’s classification accuracy.17 The choice of MERCI parameters fp and fn strongly tune the numbers of selected motifs. Thus, in addition to selecting discriminative motifs, we may unintentionally select extraneous base motifs with no role in determining color, thus lowering classification accuracy, or we may select too few motifs, thus leaving out important base patterns and also lowering prediction accuracy. To capture only the most discriminative base motifs, we employ a class of supervised machine learning methods called feature selection (FS) that detect the most informative features for class determination.30,36 Here, FS identifies base motifs that are most important for determining the color class of a sequence. To identify base motifs related to each of the four color classes, we use OvO classification to perform FS (see Supporting Information for FS details). Cross-validation. The generalizability of the trained SVM classifiers is tested by 10-fold crossvalidation. This process trains each pairwise classifier on a randomly selected subset of the training data and determines classification accuracy on the remaining subset, repeating 10 times to determine variance (Table S2). (Thus, the accuracy of the trained SVM classifiers is evaluated on data that is not included in the training set.) Using this measure, we tested various possible feature sets in addition to the FS-identified motifs, finding that cross-validated classification accuracies were maximized by a refined set of feature vectors composed of the FS-identified motifs and additional features with aggregate template information (see Methods and Supporting
ACS Paragon Plus Environment
10
Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
Information). This refinement not only improved accuracies but also greatly reduced the dimensionality of the feature space, from over 1600 motifs identified by MERCI to slightly over 200 features, aiding interpretation of results as discussed below.
Figure 2. Heat map of pairwise SVM classification accuracies for OvO classifiers. The classification accuracies of the resultant pairwise SVMs are shown in Figure 2. Pairwise classifiers discriminate Red and Very Red from Dark with high (~90%) accuracies. Discrimination between Red and Very Red is least accurate, as might be expected given their neighboring color bands. Discrimination between Green and Dark has lower accuracy than between Dark and Red or Very Red, as expected due to the greater experimental variance found for the Green class (Figure S2). DNA template sequence design. After cross-validation tests, we retrained the pairwise SVM classifiers, employing all training data, and used the retrained SVM to select new template sequences for silver clusters of a desired color class. The design process has two steps. First, because computational testing of all possible new templates is infeasible, we use a simple generative scheme to build new candidate sequences for a given color, concatenating base motifs identified by FS as most important for color classification (Scheme 1 V, details in Methods and
ACS Paragon Plus Environment
11
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 25
Supporting Information). The retrained classifiers then predict the pairwise probabilities that the new template will produce silver clusters in the desired color class. The lowest of the three pairwise probabilities, Pmin, represents the least certain color classification. We rank the generated templates for a given color in descending order of Pmin. We experimentally tested the efficacy of our design methods using the top 180 ranked designed templates for Green and Very Red silver clusters. These classes were chosen for testing because training data is sparsest in these regions of the color distribution (Figure 1A), making design most challenging and the discovery of new templates most valuable. Figure 3 shows results for Green (Figure 3A) and Very Red (Figure 3B) designed templates. Design clearly concentrates the measured color distribution into the desired color bands. Table 1 compares percentages of DNA templates in each class for designed sequences and training data (successful design is reflected by increased percentage for the desired class and decreased percentages for other classes). By machine learning-aided design, we select DNA templates for bright Very Red clusters with 330% greater success, and bright Green clusters with 70% greater success.
Figure 3. Histograms of bright spectral peaks, normalized to the numbers of DNA sequences per experiment: (A) design for the Green Class (green) and training data (grey), (B) design for the Very Red Class (red) and training data (grey).
ACS Paragon Plus Environment
12
Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
Table 1. Relative changes in color class sizes after machine learning-aided design. Dark Class Green Class Red Class a
Very Red Class
a
a
a
Design for Very Red
- 87.7%
- 82.1%
+ 10.1%
+ 328.4%
Design for Green
+ 81.3%
+ 70.2%
- 71.4%
- 57.8%
a
Percent changes in class sizes are relative to the class sizes of the original training data
Machine learning-aided design is most successful when features capture the salient information of the underlying predicted classes. Thus, the success of our design method for AgNDNA in specific color bands supports the hypothesis that certain DNA base motifs within a sequence select for cluster size and, therefore, color. Using these motifs and our trained SVM model, we demonstrate much greater control over the colors of silver clusters stabilized by DNA strands. In particular, we successfully bias color distributions to the sparser regions of our training data sets, the Green and Very Red classes, even though the training data contained many fewer examples of these sequences. This success is most obvious in Figure 3B, where design for Very Red sequences resulted in many AgN-DNA in the near-infrared spectral regions, despite few examples of these emitters present in the initial training data. Design success for the Very Red Class is substantially higher than for Green. This reflects the ~ 3 times larger experimental variability in spectra of sequences stabilizing “greener” clusters (Figure S2, Table S1), as well as the smaller number of Green training sequences available for learning (Figure 1B). Analysis of our results for the Green Class shows that most designed Green sequences do produce the right color but with low fluorescence intensities: only
ACS Paragon Plus Environment
13
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 25
14% of designed Green sequences produced a bright, dominant green peak, but 47% stabilized dominant green products below the “bright” fluorescence intensity threshold. These “green-notbright” sequences point to commonalities between DNA templates for smaller, shorter wavelength silver clusters and Dark templates. In particular, Dark motifs may be necessary to truncate silver cluster growth, so that green AgN-DNA require a delicate balance of Green and Dark motifs. The role of stoichiometry during synthesis. Besides sequence, stoichiometry during AgN-DNA synthesis could affect cluster color and product yields. To test sensitivity to stoichiometry, we synthesized clusters on templates designed for Green and Very Red Classes using a higher ratio of [Ag+]/[DNA base], resulting in fewer bright spectral peaks overall for both classes (Figure S3, also shown previously23). The color distribution of designed Green templates showed a slight shift to longer wavelengths, while the color distribution for Very Red templates developed a distinct, longer wavelength sub-peak that may be associated with a larger magic number cluster size. Thus, while sequence is the primary determinant of AgN-DNA color,8 synthesis conditions can further optimize yield of a desired AgN-DNA on a given template. The role of DNA base sequence in color selection. Next, we examine the FS-identified motifs to understand how these base motifs engage in color selection. (Because FS identifies the information that is most relevant for machine learning classification, these FS-identified motifs are likely to be the crucial determinants of AgN-DNA color.) For simplicity, we begin with single-base composition in FS-identified motifs before moving to multi-base patterns in FSidentified motifs. Figure 4 shows the average single-base composition extracted from FS motifs. (We note that the random subsampling used to balance class sizes during motif mining, training, and FS results in slightly different lists of FS-identified motifs after each run; thus, Figures 4, 5,
ACS Paragon Plus Environment
14
Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
and S4 show average base pattern counts, not the FS-identified motifs themselves.) The distinctive patterns of motif base composition for the four color classes can be related to previous findings on Ag+-mediated DNA pairings of homobase37 and mixed-base strands.38 Studies of Ag+-mediated DNA pairings are relevant to AgN-DNA because AgN-DNA form by partial reduction of such Ag+-DNA products and are known to incorporate both neutral and cationic silver.9,23 Our results for the different color classes indicate that the Ag+-base interactions are imprinted into cluster structure. Dark motifs are much richer in thymine (T) than any of the color classes, agreeing with the very weak T-Ag+ association: recruitment of silver cations is a prerequisite for cluster formation.37 Guanine (G)-rich motifs are most prevalent for Very Red clusters, while cytosine (C)-rich motifs are correlated with all bright color classes and dominate in the Red class. Both C and G exhibit strong Ag+-mediated homobase pairings,37,38 recruiting abundant silver cations that may then be readily reduced to form a silver cluster. Furthermore, studies of C and G strands with single base mutations found that while the numbers of Ag+ that are bound to C-rich strands vary slightly upon base mutation, G-rich strands show a large (~80%) increase in Ag+ association.38 This presence of more Ag+ on the DNA prior to reduction may explain why G-rich motifs correlate more strongly to longer wavelength clusters. Finally, adenine (A)-rich motifs are particularly prominent in the Green Class. A-homobase strands associate Ag+ much less than for C- and G-homobase strands, yet more than for T-homobase strands.37 Thus, A-rich motifs may provide just enough Ag+ to form a small, green cluster without fostering growth of larger, redder clusters.
ACS Paragon Plus Environment
15
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 25
Figure 4. Average base content of motifs selected by FS for Dark, Green, Red, and Very Red Classes. Error bars represent standard deviations calculated from ten FS trials (FS balances class size by subsampling, resulting in some variability in selected motif lists between trials).
Figure 5. Bottom graph: Average counts of 2-base patterns in motifs selected by FS for Dark (grey bars), Green (green bars), Red (red bars), and Very Red (dark red bars) Classes (bottom graph). Error bars represent standard deviations calculated from ten FS trials, as for Figure 4. 2base patterns are sorted left-to-right in decreasing order of the standard deviation of the average counts for all four color classes per 2-base pattern (top graph), defined to be a measure of color “selectivity” (see main text). Next, we investigate multi-base patterns in FS-identified motifs for each color class. Figure 5 displays average counts of all 2-base patterns (see Figure S4 for 3-base patterns). To aid analysis, the “selectivity” of each multi-base pattern is defined to be the standard deviation of average counts for the four color classes (the average of the four bar heights per base pattern).
ACS Paragon Plus Environment
16
Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
Base patterns are sorted left-to-right in decreasing order of selectivity. Base patterns with the greatest variance in occurrence across color classes have the highest selectivity and are more strongly correlated to some color classes versus others. For instance, CC appears frequently in FS-identified motifs for all three fluorescent classes yet very infrequently in Dark motifs. GC and CG both favor Red and Very Red clusters, while GG more strongly favors Very Red. TT very strongly favors Dark, and several A-rich base patterns favor Green. Notably, the relative order of bases within a motif can influence color selectivity, e.g. TC is selective of Green, while CT is not, pointing to the importance of considering base motifs as opposed to simple base content when developing predictive methods for AgN-DNA color. Figure S4, displaying 3-base patterns, further demonstrates the complex connection between DNA base patterns and AgNDNA color. The challenge of imperfect data for materials informatics One of the grand challenges that faces researchers who use tools from data science to study and design materials is “imperfect data,” that is, data that includes noise and uncertainty and may even come from multiple laboratories and multiple experimental procedures.39 Here, we approach this challenge by: (1) considering data taken in the same laboratory using the same experimental procedure, (2) quantifying experimental uncertainty (Figure S2) and taking into account this uncertainty during analysis, and (3) accounting for variations in class size through rigorous sub-sampling during both motif mining and training of machine learning algorithms. The success of this method suggests that, with care, experimental data libraries can be of great use for nanomaterials discovery. CONCLUSION
ACS Paragon Plus Environment
17
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 25
We develop a data-driven method to design DNA templates that select silver clusters of certain sizes, and therefore fluorescence colors, by learning from a large experimental training data set. This method first discovers the DNA base motifs that are most discriminative of color, using a combination of motif mining and feature selection. Then, a set of trained pairwise classifiers identifies new candidate sequences built from the discovered motifs to fall within a selected color class. Our method improves selectivity of longer wavelength AgN-DNA, at the boundary of the visible and near infrared spectrum, by 330% and short-wavelength, “green” AgN-DNA by 70%. In addition to demonstrating successful design of DNA templates that discriminate for cluster size differences of just a few silver atoms, we also explore why the discovered base motifs are selective of color. This work shows the power of combining new advances in high-throughput experiments with data informatics to achieve molecular design of materials systems. METHODS Synthesis and characterization. Synthesis of DNA-stabilized silver clusters was performed with a pipetting robot on a set of 1432 DNA sequences in 384 well microplates, about half of which were randomly generated23 and half were previously designed to be brightly fluorescent, regardless of color.17 Emission spectra were measured from 450 nm to 800 nm with a Tecan Infinite M200 Pro plate reader. Experimentally measured peak wavelengths and peak areas for training sequences are provided in Table S3. Data classification. Training sequences are categorized into color classes by peak wavelengths and integrated peak areas. A given sequence is “bright” if at least one peak area is greater than a threshold value that is sufficiently above the detector noise level and normalized across plates using a control DNA sequence. Only sequences associated with "bright" peaks falling into one
ACS Paragon Plus Environment
18
Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
color class are used for training; sequences associated with multiple color classes are excluded. A fourth “Dark” class includes sequences with total integrated intensity less than a given normalized value. Classification is described in detail in Supporting Information. Motif mining. The four color classes were used to build balanced multi-class SVMs based on motif features learned from the training data set. To recognize color-predictive motifs, we employed MERCI,24 which mines a set of motifs common in a positive class and uncommon in a negative class (specific parameters listed in Supporting Information). Feature selection and additional features. A FS algorithm within Weka32 was used to reduce the MERCI-selected motifs to ~120 from an initial set of ~1600, thereby reducing feature vector dimensionality. FS was performed ten times for each pair of classes (multiple FS runs are used to account for subsampling needed to balance class size). Then, selected motifs were ranked by the number of times each motif was identified, and the top ~120 motifs were selected to generate feature vectors (details in Supporting Information). In addition, feature vectors contained counts of all 2- and 3-base motifs, class-specific likelihoods of base-base transitions, and total counts of class-specific motif content per sequence, for a total of 88 additional aggregate features (example in Supporting Information). Classification using support vector machines. A multi-class SVM was trained on the classlabeled training data represented by the feature vectors described above. To classify a sequence, a single pairwise SVM predicts the most likely of two classes and an associated probability for the most likely class. The multi-class SVM combines classifiers for each possible pair. Sequence generation. Sequences designed for a specific color class were generated using the list of class-specific FS-identified motifs. To build a sequence, motifs were sampled from a probability distribution that is based on motif frequency in a given color class and motif position
ACS Paragon Plus Environment
19
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 25
within the sequences in the training data (details in Supporting Information). Each sampled motif was placed in the sequence or rejected if incompatible with already placed motifs, until at most 3 unspecified bases remained, which were then filled randomly. Sequences were constrained to have 2 to 5 position-specific base mutations from all training sequences.
SUPPORTING INFORMATION The Supporting Information is available free of charge on the ACS Publications website: Highthroughput experimental details; details on spectral fitting and color class definitions; analysis of experimental reproducibility; details of motif mining, feature vectors, and feature selection; sequence generation details; effects of stoichiometry on synthesis; 3-base patterns in feature selection-identified motifs; DNA sequence libraries.
ACKNOWLEDGEMENTS The authors thank H. Nicholson and A. Chiu for insights into interpretation of classification accuracy and M. Radeke for use of a pipetting robot acquired with support from NIH-NEI 5R24-EY14799. The authors acknowledge use of UCSB’s Biological Nanostructures Laboratory within the California NanoSystems Institute. This work was performed, in part, at the Center for Integrated Nanotechnologies, an Office of Science User Facility operated for the U.S. Department of Energy (DOE) Office of Science. Los Alamos National Laboratory, an affirmative action equal opportunity employer, is operated by Los Alamos National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. S.M.C. acknowledges support from NSF-DGE-1144085. This work was supported by NSF-DMR-1309410.
ACS Paragon Plus Environment
20
Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
The authors declare no competing interests.
REFERENCES: (1)
Linko, V.; Dietz, H. The Enabled State of DNA Nanotechnology. Curr. Opin. Biotechnol. 2013, 24, 555–561.
(2)
Zhang, F.; Nangreave, J.; Liu, Y.; Yan, H. Structural DNA Nanotechnology: State of the Art and Future Perspective. J. Am. Chem. Soc. 2014, 136, 11198–11211.
(3)
Kuzyk, A.; Schreiber, R.; Fan, Z.; Pardatscher, G.; Roller, E.-M.; Högele, A.; Simmel, F. C.; Govorov, A. O.; Liedl, T. DNA-Based Self-Assembly of Chiral Plasmonic Nanostructures with Tailored Optical Response. Nature 2012, 483, 311–314.
(4)
Marras, A. E.; Zhou, L.; Su, H.-J.; Castro, C. E. Programmable Motion of DNA Origami Mechanisms. Proc. Natl. Acad. Sci. 2015, 112, 713–718.
(5)
Chakraborty, S.; Babanova, S.; Rocha, R. C.; Desireddy, A.; Artyushkova, K.; Boncella, A. E.; Atanassov, P.; Martinez, J. S. A Hybrid DNA-Templated Gold Nanocluster for Enhanced Enzymatic Reduction of Oxygen. J. Am. Chem. Soc. 2015, 137, 11678–11687.
(6)
Gwinn, E.; Schultz, D.; Copp, S.; Swasey, S. DNA-Protected Silver Clusters for Nanophotonics. Nanomaterials 2015, 5, 180–207.
(7)
Jia, X.; Li, J.; Han, L.; Ren, J.; Yang, X.; Wang, E. DNA-Hosted Copper Nanoclusters for Fluorescent Identification of Single Nucleotide Polymorphisms. ACS Nano 2012, 6, 3311– 3317.
(8)
Gwinn, E. G.; O’Neill, P.; Guerrero, A. J.; Bouwmeester, D.; Fygenson, D. K. Sequence-
ACS Paragon Plus Environment
21
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 25
Dependent Fluorescence of DNA-Hosted Silver Nanoclusters. Adv. Mater. 2008, 20, 279– 283. (9)
Schultz, D.; Gardner, K.; Oemrawsingh, S. S. R.; Markešević, N.; Olsson, K.; Debord, M.; Bouwmeester, D.; Gwinn, E. Evidence for Rod-Shaped DNA-Stabilized Silver Nanocluster Emitters. Adv. Mater. 2013, 25, 2797–2803.
(10)
Petty, J. T.; Sergev, O. O.; Kantor, A. G.; Rankine, I.; Ganguly, M.; David, F.; Wheeler, S.; Wheeler, J. F. Ten-Atom Silver Cluster Signaling and Tempering DNA Hybridization. Anal. Chem. 2015, 87, 5302–5309.
(11)
Yeh, H.-C.; Sharma, J.; Shih, I.-M.; Vu, D. M.; Martinez, J. S.; Werner, J. H. A Fluorescence Light-up Ag Nanocluster Probe That Discriminates Single-Nucleotide Variants by Emission Color. J. Am. Chem. Soc. 2012, 134, 11550–11558.
(12)
Shah, P.; Rørvig-Lund, A.; Chaabane, S. Ben; Thulstrup, P. W.; Kjaergaard, H. G.; Fron, E.; Hofkens, J.; Yang, S. W.; Vosch, T. Design Aspects of Bright Red Emissive Silver Nanoclusters/DNA Probes for MicroRNA Detection. ACS Nano 2012, 6, 8803–8814.
(13)
Liu, J. DNA-Stabilized, Fluorescent, Metal Nanoclusters for Biosensor Development. TrAC - Trends Anal. Chem. 2014, 59, 99–111.
(14)
Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; Persson, K. A. Commentary: The Materials Project: A Materials Genome Approach to Accelerating Materials Innovation. APL Mater. 2013, 1, 011002.
(15)
Lookman, T.; Alexander, F. J.; Bishop, A. R. Perspective: Codesign for Materials Science: An Optimal Learning Approach. APL Mater. 2016, 4, 053501.
(16)
Pilania, G.; Wang, C.; Jiang, X.; Rajasekaran, S.; Ramprasad, R. Accelerating Materials
ACS Paragon Plus Environment
22
Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
Property Predictions Using Machine Learning. Sci. Rep. 2013, 3, 1-6. (17)
Copp, S. M.; Bogdanov, P.; Debord, M.; Singh, A.; Gwinn, E. Base Motif Recognition and Design of DNA Templates for Fluorescent Silver Clusters by Machine Learning. Adv. Mater. 2014, 26, 5839–5845.
(18)
Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big Data of Materials Science: Critical Role of the Descriptor. Phys. Rev. Lett. 2015, 114, 105503.
(19)
Xue, D.; Balachandran, P. V.; Hogden, J.; Theiler, J.; Xue, D.; Lookman, T. Accelerated Search for Materials with Targeted Properties by Adaptive Design. Nat. Commun. 2016, 7, 1-9.
(20)
Mueller, T.; Kusne, A. G.; Ramprasad, R. Machine Learning in Materials Science: Recent Progress and Emerging Applications. In Rev. Comput. Chem.; 2016; Vol. 29, pp 186–273.
(21)
Yuan, R.; Liu, Z.; Balachandran, P. V.; Xue, D.; Zhou, Y.; Ding, X.; Sun, J.; Xue, D.; Lookman, T. Accelerated Discovery of Large Electrostrains in BaTiO 3 -Based Piezoelectrics Using Active Learning. Adv. Mater. 2018, 1702884.
(22)
Lee, E. Y.; Fulan, B. M.; Wong, G. C. L.; Ferguson, A. L. Mapping Membrane Activity in Undiscovered Peptide Sequence Space Using Machine Learning. Proc. Natl. Acad. Sci. 2016, 113, 13588–13593.
(23)
Copp, S. M.; Schultz, D.; Swasey, S.; Pavlovich, J.; Debord, M.; Chiu, A.; Olsson, K.; Gwinn, E. Magic Numbers in DNA-Stabilized Fluorescent Silver Clusters Lead to Magic Colors. J. Phys. Chem. Lett. 2014, 5, 959–963.
(24)
Vens, C.; Rosso, M.-N.; Danchin, E. G. J. Identifying Discriminative Classification-Based Motifs in Biological Sequences. Bioinformatics 2011, 27, 1231–1238.
(25)
Schultz, D.; Gwinn, E. Stabilization of Fluorescent Silver Clusters by RNA
ACS Paragon Plus Environment
23
ACS Nano 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 25
Homopolymers and Their DNA Analogs: C,G versus A,T(U) Dichotomy. Chem. Commun. (Cambridge, U. K.). 2011, 47, 4715–4717. (26)
Copp, S. M.; Faris, A.; Swasey, S. M.; Gwinn, E. G. Heterogeneous Solvatochromism of Fluorescent DNA-Stabilized Silver Clusters Precludes Use of Simple Onsager-Based Stokes Shift Models. J. Phys. Chem. Lett. 2016, 7, 698–703.
(27)
He, H.; Garcia, E. A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284.
(28)
Japkowicz, N.; Stephen, S. The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 2002, 6, 429–449.
(29)
Akbani, R.; Kwek, S.; Japkowicz, N. Applying Support Vector Machines to Imbalanced Datasets. In European Conference on Machine Learning; Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D., Eds.; Springer, Berlin, Heidelberg, 2004; pp 39–50.
(30)
Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, 2008.
(31)
Krawczyk, B. Learning from Imbalanced Data: Open Challenges and Future Directions. Prog. Artif. Intell. 2016, 5, 221–232.
(32)
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I. H. The WEKA Data Mining Software. ACM SIGKDD Explor. Newsl. 2009, 11, 10.
(33)
Chang, C.-C.; Lin, C.-J. LIBSVM. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27.
(34)
Lubbers, N.; Lookman, T.; Barros, K. Inferring Low-Dimensional Microstructure Representations Using Convolutional Neural Networks. Phys. Rev. E 2017, 96, 052111.
(35)
Bishop, C. M. Pattern Recognition and Machine Learning; 2006; Vol. 4.
ACS Paragon Plus Environment
24
Page 25 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Nano
(36)
Hall, M. A.; Smith, L. A. Practical Feature Subset Selection for Machine Learning. In Proceedings of the 21st Australasian Computer Science Conference ACSC’98; Springer, Berlin, Heidelberg: Perth, 1998; pp 181–191.
(37)
Swasey, S. M.; Leal, L. E.; Lopez-Acevedo, O.; Pavlovich, J.; Gwinn, E. G. Silver (I) as DNA Glue: Ag(+)-Mediated Guanine Pairing Revealed by Removing Watson-Crick Constraints. Sci. Rep. 2015, 5, 10163.
(38)
Swasey, S. M.; Gwinn, E. G. Silver-Mediated Base Pairings: Towards Dynamic DNA Nanostructures with Enhanced Chemical and Thermal Stability. New J. Phys. 2016, 18, 045008.
(39)
Sun, B.; Fernandez, M.; Barnard, A. S. Statistics, Damned Statistics and NanoscienceUsing Data Science to Meet the Challenge of Nanomaterial Complexity. Nanoscale Horiz. 2016, 1, 89-95.
ACS Paragon Plus Environment
25