The Resistome: A Comprehensive Database of Escherichia

Jul 20, 2016 - database: a literature-curated collection of Escherichia coli genotypes− phenotypes containing over 5,000 mutants that resist hundred...
4 downloads 17 Views 3MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

The Resistome: A Comprehensive Database of Escherichia coli Resistance Phenotypes James D. Winkler, Andrea L. Halweg-Edwards, Keesha E Erickson, Alaksh Choudhury, Gur Pines, and Ryan T. Gill ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.6b00150 • Publication Date (Web): 20 Jul 2016 Downloaded from http://pubs.acs.org on July 27, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

The Resistome: A Comprehensive Database of Escherichia coli Resistance Phenotypes James D. Winkler,∗,†,‡ Andrea L. Halweg-Edwards,†,¶ Keesha E. Erickson,† Alaksh Choudhury,† Gur Pines,† and Ryan T. Gill† Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO USA E-mail: [email protected]

Abstract The microbial ability to resist stressful environmental conditions and chemical inhibitors is of great industrial and medical interest.

Much of the data related to

mutation-based stress resistance, however, is scattered through the academic literature, making it difficult to apply systematic analyses to this wealth of information. To address this issue we introduce the Resistome database: a literature-curated collection of Escherichia coli genotypes-phenotypes containing over 5,000 mutants that resist hundreds of compounds and environmental conditions. We use the Resistome to understand our current state of knowledge regarding resistance and detect potential synergy or antagonism between resistance phenotypes. Our dataset represents one of the most comprehensive collections of genomic data related to resistance currently available. Future development will focus on the construction of a combined genomic-transcriptomic∗

To whom correspondence should be addressed Department of Chemical and Biological Engineering, University of Colorado Boulder, Boulder, CO USA ‡ Present Address: Shell Biodomain, 3333 Texas 6, Houston TX USA ¶ Present Address: Muse Biotechnologies, 4001 Discovery Drive, Boulder CO USA †

1 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 38

proteomic framework for understanding E. coli s resistance biology. The Resistome can be downloaded at https://bitbucket.org/jdwinkler/resistome release/overview.

Keywords synthetic biology, standardization, adaptive evolution, genomics

Introduction Genetic mechanisms of inhibitor resistance are of great medical and industrial interest, due to their respective impacts on human and microbial health. The rise in antibiotic resistant infections is especially troubling due to the rapid spread of drug resistance amongst clinically important microbes, a phenomenon that has rapidly become one of the great health challenges of the 21st century (1 , 2 ). New therapeutic approaches or compounds are therefore required to treat these resistant microbes. Similarly, the stability of biocatalysts used industrially for fermentation, wastewater purification, and other processes in the face of adverse environments or chemical inhibitors has significant economic implications as well (3 ). Significant research effort has therefore been devoted to understanding the physiological mechanisms for resistance phenotypes under a range of industrially important conditions (4 –9 ). Underlying both areas of research is the desire to improve our knowledge of microbial systems biology so that we can predict how resistance can arise from a given genetic context, and how we can take advantage of antagonistic or synergistic trait interactions to manipulate evolutionary outcomes (10 ). Adaptive laboratory evolution (11 ), library screening (6 , 12 , 13 ), and other high-throughput experiments coupled with genomic sequencing has enabled researchers to generate vast amounts of data relating phenotypes to known or discovered genotypes that are difficult to interpret and apply experimentally (14 ) Adding to this big data challenge is the variety of formats and venues used to disseminate data describing resistance phenotypes, especially 2 ACS Paragon Plus Environment

Page 3 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

in older, lower-throughput studies of a specific trait of interest. Since the era of omics biology has arrived in full force, as predicted (15 , 16 ), there have been numerous efforts to devise ways to organize this data (17 –22 ). However, these resources are not principally designed to enable analysis of mutationally-acquired resistance phenotypes. Improved integration and annotation are critical to extract useful inferences concerning resistance and the subsequent design of experimental tests, necessitating the development of a database focused on analyzing genotype-phenotype relationships. This work introduces the Resistome database as a platform for understanding geneticallydetermined E. coli traits. After reviewing the database format, data sources, and software pipeline used for analysis, we use the Resistome to quantitatively examine trends in the study of resistance phenotypes. An assessment of significant phenotype interactions and their preliminary implications for designing combination drug therapies and engineering multi-resistant industrial strains then follows to develop possible experimental tests that may uncover exploitable interactions. We also present machine learning-based tools to help researchers contextualize their experimental results and examine relationships with prior studies, while also distinguishing between general and specific adaptive mutations. Finally, possible future extensions of the database to explore transcriptomic and proteomic aspects of resistance are discussed.

Methods and Materials Database Sources and Implementation The Resistome database is composed of flat key-value records, where each record represents a single study (one peer-reviewed publication) containing one or more mutants of a given E. coli species. Papers were obtained by searching databases such as Google Scholar, exhaustive extraction from particular journals such as Antimicrobial Agents and Chemotherapy that are often publishing venues for resistance-related Every mutant has at least one mutated 3 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

gene. Each level of specification (paper, mutant, mutation) contains metadata concerning experimental conditions, project and engineering methodology, and putative effects of each mutation. The software used for reading and analyzing the Resistome is currently written in Python 2.7. Statistical analysis was performed with numpy 1.9.2 and scipy 0.16.01b, while network analysis relied on the NetworkX 1.9 package. Matplotlib 1.9 was used for data and network visualization. Gene ontologies were obtained from the Gene Ontology Consortium (23 , 24 ), and a list of E. coli essential genes was obtained from EcoGene (25 ). Protein data were obtained from PDB (26 ) and visualized using PyMol 1.8. The Resistome software and data can be downloaded from https://bitbucket.org/jdwinkler/resistome_release/ overview.

ChemGen Integration In addition to the data gathered here, several large chemical genomics databases have been developed for E. coli in the past. ChemGen (27 ) is one of the most comprehensive data sources (324 conditions x 3,979 non-essential gene knockout mutants) available for assessing the phenotypic response of the Keio collection mutants to over three hundred chemical inhibitors or environmental conditions that reduce strain growth. If only gene-inhibitor interactions with at least 2.5-fold fitness changes are considered to be significant, the total number of strains within ChemGen passing this threshold (3,489) is approximately 60% of the size of the Resistome. Using this fitness threshold, the filtered data was converted into a Resistome-compatible format and can be included in calculations using a simple toggle. The analysis presented here do not include ChemGen data.

Input Standardization Gene names were represented using official accession numbers obtained from Biocyc (19 ) using a name matching script. Briefly, the E. coli Biocyc database were converted into local Postgres SQL databases using a custom script, and then searched to find standardized 4 ACS Paragon Plus Environment

Page 4 of 38

Page 5 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

equivalents for every gene mutated in the Resistome. While the majority of the mutants deposited in the Resistome are descendants of K-12, mutations in other E. coli lineages (B, W, etc), are also present in the database. If these mutations are in identically-named orthologs, then they are converted to the standardized b-number representation used for E. coli K-12; otherwise, they are represented using their unique locus ID if a match can be found in our Biocyc name matching pipeline. Mutation annotations (locations, base changes, and so forth) are extracted directly from the curated studies. Approximately 98% of Resistome genes were converted into their standardized accession numbers (b-numbers). Phenotypes names were standardized using manual curation; for example, 1-butanol, n-butanol, and butanol resistance are all converted into butanol resistance. Each phenotype was manually associated with tags describing the phenotype classification (antibiotic resistance, solvents or biofuels, and so forth), and a simple ontology describing the relationship of the various resistance phenotypes was developed to enable class-to-class (antibiotic versus solvent resistance, for example) comparison of resistance patterns.

Network Analysis Several network types were constructed for analyzing interactions within and between Resistome strains. A gene-gene interaction network was modeled as an undirected network G(V,E), where genes were vertices in V, and edges connected vertices e: (u,v) in E if they were found in the same mutated strain. Each edge e was weighted according to how frequently that particular edge was found in the database. An undirected bipartite network G(V,P,E) consisting of genes in V, phenotypes in P, and edges connecting only genes and phenotypes was constructed as well. In this network, gene nodes are linked to phenotypes only if a given gene is mutated in a strain possessing that resistance phenotype; edge weights are computed as above. Community detection for phenotype overlap analysis was performed using igraph 0.1.11 with the algorithm specified in the text.

5 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Statistical Analysis The number of observed genes or biological processes that affect the resistance or sensitivity of a strain to a pair of conditions was compared using Fishers two-tailed exact test. P-values lower than 10−4 were considered significant. The significance of gene ontology enrichment versus the E. coli genome was computed using a bootstrapping approach to estimate enrichment of particular gene ontology tags versus the background E. coli distribution. Gene ontology tags with p-values less than 0.05 were considered significantly enriched. Other statistical tests are specified in the text.

Results and Discussion The main goal of the Resistome is to provide a standardized, consistent window into our current state of knowledge regarding resistance phenotypes, in the hopes that researchers can better understand future experiments by looking into the past for contextualization. After introducing the content of the database and software tools for its analysis, we explore statistically significant phenotype interactions that may be useful for designing experimental tests of antagonistic pleiotropy to screen for combined antibiotic therapies (28 ) or adaptive laboratory evolution to explore alternative genetic mechanisms that may enable two supposedly incompatible phenotypes to co-exist within a single strain.

Database Summary The 300 papers stored in the Resistome represent 5,511 E. coli mutants that collectively resist 228 compounds or conditions across 25 categories (Table 1). Overall, the genes mutated in these mutants represent approximately 2.5X coverage of the entire E. coli MG1655 genome, though the spatial distribution of Resistome mutations and their interactions are not evenly distributed throughout the genome (Figure 1). The source data can be divided into two main categories: low-throughput studies focus6 ACS Paragon Plus Environment

Page 6 of 38

Page 7 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

ing on intense characterization of a single or few gene(s), versus genome-wide studies that investigate potential contributions from every gene in the E. coli genome to the phenotype of interest. This divide can be clearly seen in the number of mutants screened per paper (Figure 2) starting in 2006, the point at which the Keio collection and other libraries became widely available. Given that the technologies used to rapidly genotype strains following selections or screening will only become cheaper, the scale of selection and screening in resistance studies will continue to increase for the foreseeable future. A computational infrastructure to facilitate contextualization and inter-study comparison of these data are clearly needed. Most Resistome data focuses on either antibiotic resistance or a collection of industriallyrelevant resistance phenotypes for E. coli (Figure 3). Tackling the medical challenge of dealing with infectious organisms with broad spectra of antibiotic resistance has occupied a large fraction of research attention. Examining the temporal distribution of phenotype data reveals a fairly stable division of research effort between these categories; antibiotic resistance is mainly of medical importance, while studies dedicated to acid, furan, osmotic, or solvent stresses likely respond to a combination of industrial biocatalyst improvement and general systems biology research (29 ). The data present in the database reflect researcher interests when working with E. coli, a metabolically-flexible Gram-negative organism (30 ); as other organisms are added to the Resistome, the intersection of their specific resistant traits with medical and industrial needs will shift the phenotype data distribution accordingly. Both random (adaptive evolution, mutagenesis) and rational (manually selected and introduced) mutations are curated within the Resistome, with the former category comprising approximately 24% of all papers curated thus far (74/300). There is a clear dichotomy in the type and number of mutations observed per study between each category when examining their mutational spectra (Figure 4), due to the use of designed libraries such as the Keio, ASKA , or TRMR collections that only intentionally inactivate or overexpress one gene per mutant. As a result, mutants generated randomly have many more perturbed biological processes than libraries (6.5 versus 0.9 processes, P = 9 ∗ 10−118 , two-tailed t-test). Each

7 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

randomly-generated strain also has approximately 1.40 mutated essential genes versus 0.08 genes for non-random methods (P < 10−251 , two-tailed t-test), highlighting a key difference between random and non-random studies: essential genes can be mutated if they remain functional enough to avoid cell death, but cannot be deleted and sometimes exhibit dosage sensitivity when overexpressed (31 ). Mutated essential genes are most commonly involved with DNA integrity or transcription processes (RNA polymerase, termination) due to their direct targeting by drugs and ability to influence transcription on a global scale. Despite the widespread use of designed deletion and overexpression libraries, mutations in these and other essential processes appear to be critical for adaptation and should be considered when designing an experiment for E. coli ’s phenotypic improvement.

Classifying Resistance using Machine Learning Relationships between resistance studies may be obscure, such as if the studies examine nominally unrelated phenotypes ((4 ) and (32 ) detecting identical amino acid changes in MreB, as an example), if traits depend on different mutated genes but similar biological processes, or if the studies are simply unknown to researchers at the time and place where it makes a difference in research outcomes. One of the goals of the Resistome is to minimize the role of chance in detecting these connections by not only providing a repository for resistance studies, but also to assist researchers in detecting similarities between new experimental data and historical Resistome results using unsupervised learning techniques. Network theory is often applied to biological problems as a means to quantify interactions and information flows within biological systems (33 , 34 ). Community detection, a form of unsupervised learning applied to networks to detect related nodes or edges, is particularly useful for classifying new data without relying on researcher inference (35 , 36 ).Generating a network through pairwise co-occurrence from the Resistome (see Methods and Materials) yields the gene-gene interaction network in Figure 5). Although it is possible to encode the mutation types in network nodes, in this case we use a (effectively) binary formulation where 8 ACS Paragon Plus Environment

Page 8 of 38

Page 9 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

nodes are included if they are mutated in any fashion to simplify downstream analysis. Clustering the network reveals several densely-interconnected communities of biological interest: many of these clusters represent common stress resistance mechanisms as deduced from GO statistical analysis, such as DNA topology control (affected by fluoroquinones), cell-shape control (modulated by osmotic stress and penicillin challenges), and general DNA-binding proteins (regulator-induced pleiotropy). New data can be easily integrated into this structure by submitting a simple list of mutated genes found in a strain, recomputing the clustering to account for new genes and modified edge weights, and then examining the clusters containing these genes. These clusters effectively represent automatically inferred similarities that should simplify analysis of library or evolution selection experiments considerably.

Drivers of Resistance and Sensitivity With a standardized representation of resistance genotypes in one hand, coupled with a range of tools for interpreting these data in the other, we now shift our focus to the development of resistance along with interactions between resistance phenotypes. The simple question of how a given genotype leads to the observed phenotype is a central question in microbiology and bioengineering. Examining the annotated cellular, biological, and molecular roles of resistance genes reveals three main categories that are significantly over-represented within the Resistome (P < 0.05): regulatory mechanisms that result in global perturbations of cellular physiology, specific adaptations for individual stressors, and growth rate improvements. All of these mutations may be present in a single strain and exhibit complex patterns of epistasis, making resistance a challenging phenotype to reverse engineer. Understanding epistatic interactions is critical, however, for forward engineering of resistance traits of interest and computationally-guided experimental design.

9 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Adaptation via Regulatory Mutations The E. coli regulatory network is capable of rapid processing of external and internal signals to maintain cellular homeostasis and promote growth. Understanding the connections between regulatory mutations and the resulting phenotypes can be extremely challenging due to the complexity of the E. coli regulatory network. A total of 287 genes annotated as transcriptional regulators are mutated in the Resistome and are heavily over-represented in mutants compared to their abundance in the E. coli MG1655 genome (P = 3.44 ∗ 10−65 , Fisher’s exact test); the regulator mutations associated with at least four distinct phenotypic classes are shown in Figures 7, demonstrating that a single regulator can affect a host of different traits. Mutations affecting transcriptional termination (rho, nusG, and several others) are enriched at a similar significance level (P = 3∗10−50 ,Fisher’s exact test). Mutant variants of other regulators such as CRP, which principally controls transcription in response to carbon source availability in the environment, also improve resistance to at least eight conditions (37 ). Most of the amino acids appear to be located near known functional sites when visualized (Figure 6), suggesting the mutagenesis experiments can saturate these sites to achieve diverse phenotypes. Mutation of these regulators and others presumably causes expression of the primary gene(s) conferring resistance to change, leading indirectly to the observed phenotypes. In addition to proteins directly controlling of gene transcription, changes to RNA polymerase resulting in pleiotropic changes to gene expression are enriched as well (P = 3.94 ∗ 10−79 , Fisher’s exact test). Since all the RNA polymerase components, excluding certain sigma factors, are essential, these mutations are mainly discovered using adaptive evolution or random mutagenesis. These mutations can involve deletions or insertions within rpoC or rpoB, as well as random mutagenesis of sigma factors (38 ). The stress-associated sigma factor RpoS has long been known to be frequently inactivated in chemostat evolution experiments (39 ), so it is not surprising that other RNA polymerase components might also be drivers of adaptation as well. While these particular mutations appear to be mostly 10 ACS Paragon Plus Environment

Page 10 of 38

Page 11 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

growth associated in the Resistome, their wide range of possible effects makes it difficult to link these mutations to the observed strain phenotypes without extensive transcriptomic, proteomic, and metabolomics analysis (40 ). Additional data may also provide insight into the translation of genotypes to phenotypes via transcription factor activation (41 ). Transient responses at the gene expression level also promote resistance without necessarily being reflected at the genome level. For instance, a non-genetic basis for many cancers, ovarian (42 ) and lung (43 ) cancer progression, and leukemia drug-resistance (44 ) have all been recently shown. Imbalances in protein or RNA levels, without mutation, can influence cancer severity (45 ). Similarly, bacteria have been observed to possess regulatory mechanisms that allow for decision making in response to a stimulus, again without a genetic basis (46 , 47 ). Gene expression data, obtained from microarray, RNA-seq, qPCR, and other sources, therefore must be considered in order to comprehensively decode pathways of resistance. A central database that includes resistance-conferring mutations as well as resistance-conferring responses at the gene expression level will set the stage for novel and complex analysis of epistatic interactions between genetic and non-genetic changes.

Walls, Pumps, and Locks: Specific Adaptation Mechanisms The second broadly identifiable group of genes confer specific resistance to particular stressors, generally through extrusion via pumps, mutation of the target protein, or membrane property changes that exclude the inhibitor more efficiently from the cell. A classic example involves selection for gyrA mutations in response to fluoroquinolone antibiotics, or residue changes in penicillin binding proteins (MreB, MreC) that confer resistance to that antibiotic class. The GO tag corresponding to these mechanisms (response to drug) is one of the most enriched categories (P = 7.5 ∗ 10−128 , Fisher’s exact test), which reflects the concentration of antibiotic-resistant genotypes within the Resistome. Since most of these mechanisms are well known, the advantage of using a database to interrogate new experiments is the ability to rapidly find connections between existing knowledge encapsulated in resources like the 11 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

GO network and between other experiments that may not be immediately obvious.

Growth versus Resistance The challenge of possessing large datasets describing complex traits, such as chemical resistance, is that it is difficult to confidently define a given genotype-phenotype relationship due to the sheer number of potential contributors to the phenotype. Many mutations might be neutral hitchhikers or genes that have a marginal impact on the phenotype of interest at best. Over the entire set of mutations contained within the Resistome, 17.3% of genes are associated with growth-related phenotypes in over 90% of studies, while 30.5% of genes are associated with non-growth related (various types of resistance) in the same fraction of mutants. Depending on researcher tolerance for potential false negatives or positives, genes exceeding a certain growth-associated threshold (80-90%) may be removed from subsequent analyses without filtering out genes that are likely to drive adaptation rather than faster growth. A complete list of genes associated with either classification is given in Table S1. Perhaps not surprisingly, genes that are found in growth-annotated studies more than 90% of the time are generally associated with central metabolism, amino acid biosynthesis, ion transport, and other related processes (Table 3). While it is possible to automatically filter these types of GO tags out of Resistome data automatically, metabolic gene mutations often compensate for the growth defects introduced by other mutations (48 ) and therefore play an important role in adaptation. The non-growth associated genes are generally related to drug export, oxidative stress tolerance, or biofilm formation (Table 2. As additional data is deposited into the Resistome, this type of filtering will enable researchers to accurately and quickly pinpoint the likely drivers of resistance for their particular phenotypes of interest and reduce the amount of necessary mutation characterization in strains with a high mutational load.

12 ACS Paragon Plus Environment

Page 12 of 38

Page 13 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Limitations of Random Mutation-Selection While adaptive evolution and other random mutagenesis techniques have been generally successful when used to improve inhibitor resistance, inherent limits arising from the structure of the genetic code that restrict mutation accessibility significantly have been found (49 ). The underlying issue is that mutating a codon such that one amino acid is changed to another (AA → AA’) often requires two or three nucleotide changes that are unlikely to occur randomly within a reasonable experimental time frame. Analysis of the amino acid changes documented within the Resistome reveals that amino acid transitions requiring a SNP comprise 97.6% of all amino acid transitions, with the balance primarily consisting of two nucleotide changes Figure 8. The observed AA → AA’ frequency strongly correlates with the minimum number of nucleotides (r = 0.71, P = 6 ∗ 10−69 , Spearman correlation) that must be changed to interconvert between their codons with single nucleotide changes. For the E. coli code, each amino acid can be converted to approximately 8 other amino acids with a SNP (s = 2), indicating that this robustness limitation is not only significant statistically but strongly curtails the potential chemical diversity accessible using random methods. Gene synthesis can be used to overcome this effect for individual protein libraries by manual selection of the library content, but for whole cells, it may be necessary to explore mutagenesis approaches that mutate multiple nucleotides at a site simultaneously. Ameliorating this effect offers an intriguing opportunity to improve the effectiveness of most random methods that warrants further exploration.

Detecting Phenotype Antagonism As the discovery of antibiotics has slowed and the specter of resistance rises, researchers have increasingly focused on devising treatment schedules and drug combinations that might extend the life of our current antibiotic repertoire (50 , 51 ). One possible route to limit adaptive evolution of drug resistance is to exploit antagonistic pleiotropy between two traits of interest (52 , 53 ), when a strain is challenged with two distinct inhibitors where acquiring 13 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

resistance to one condition leads to increased sensitivity to the other. Metabolic engineers face a related challenge when attempting to improve the robustness of industrial biocatalysts towards the complex combination of stresses encountered in large-scale cultivation (29 ). Identifying statistically significant overlaps between the modified genes or processes that affect resistance for pairs of phenotypes is therefore of interest, as each interaction can be characterized to see if it affects the adaptive fitness landscapes (54 , 55 ). In order to facilitate a direct comparison between gene-phenotype pairs, we transformed the gene overlaps recorded for each phenotype pair into ontology tag overlaps to examine commonalities in perturbed gene biological roles. Solvent resistance has significant functional overlaps with antibiotic, thermal, radiation, acid, osmotic, and furan resistance phenotypes (Figure 9). Interestingly, the acid-solvent interaction has been confirmed as an antagonism for E. coli ethanol tolerance (9 ), and suspected in the case of Lactobacillus brevis n-butanol tolerance (56 ). The overlap between processes affected by solvent and radiation resistance may also be due to the role of oxidative stress for both conditions (57 , 58 ). These findings imply that it may be challenging to design industrial organisms possessing all of these phenotypes, and that engineering a solvent-tolerant strain will likely result in unintended, possibly undesirable collateral traits. An interesting line of questioning that must also be addressed experimentally is whether there are alternative genotypes that alleviate antagonistic pleiotropy and enable two ostensibly incompatible phenotypes to co-exist; since the Resistome does not contain a complete set of resistance genotypes for any particular chemical inhibitor or environmental condition, the only conclusion we can currently draw is that typical resistance traits selected for under acid or solvent stress, for example, are not likely to be co-resident in the same strain. Radiation, antibiotic, and oxidative stress sensitivities also have a statistically significant overlap of affected processes. Radiation resistance or sensitivity is modulated by mutations to genes involved with the SOS and oxidative stress responses (58 ), and these processes are also frequently known or speculated to be affected by drug or oxidative stress. Examining

14 ACS Paragon Plus Environment

Page 14 of 38

Page 15 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

the gene ontology enrichments for these overlapping gene sets reveals that DNA repair genes (radA, uvrA, efeO, yadC, pgmB, among others) simultaneously influence all three phenotypes. While these connections are intriguing, experimental verification through adaptive evolution or combinatorial library screening is needed to determine if these phenotypes exhibit some form of antagonistic pleiotropy or synergy under controlled conditions. We also attempted to investigate whether or not it was possible to detect interactions within a specific class of papers. In this case, we analyzed only antibiotic-resistant mutants and repeated this procedure to identify statistically significant overlaps in affected gene ontology processes (Figure 10). While we detect many significant interactions, many of these antibiotic resistance phenotypes are studied using identical gene sets or libraries, so it is less surprising to observe significance here. Further experimental validation of these interactions and leveraging them into usable treatment strategies for forestalling the development of adaptive resistance is necessary, although these results encapsulate our existing knowledge that a small number of multi-drug resistance proteins can protect against a wide variety of toxic compounds.

Conclusions We present here a genotype-resistance phenotype database containing, to our knowledge, one of the most complete descriptions of resistance phenotypes generated using common methods available, along with the requisite software tools to exploit these data for improving our biological understanding of E. coli resistance. Given the continuing threat of antibiotic-resistant microbes and cancers, our ability to treat these diseases is rapidly becoming dependent on finding ways to forestall the acquisition of resistance. Future development of the Resistome will primarily focus on integrating genetic, proteomic, and transcriptomic datasets into a cohesive whole to identify treatment or conditions combinations that potentially slow adaptive evolution. Coupling the protein mutagenesis data contained within the database with in-

15 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

creasingly sophisticated models of protein function (59 ) may allow for building prospective resistance models that enable the creation of resistance-conferring strains de novo, facilitating rapid and defined improvements of strain traits without relying on combinatorial or random methods. The Resistome offers the opportunity to contextualize and exploit resistance data that may be key to extending the longevity of our medical arsenal and improving the robustness of industrial biocatalysts.

Acknowledgement This work was supported by the Fulbright Foundation East Asia Foundation (to JW) and the Department of Energy Genome Sciences program (grant #DE-SC008812). We would also like to thank the Hsueh-Fen Juan laboratory (National Taiwan University) for hosting JW during the development of the Resistome and for many helpful conversations concerning its content, the numerous researchers who graciously provided us with raw data from their studies, and other Fulbright fellows for their interest in this project. In addition, we thank the people of Taiwan for their hospitality during the execution of this project.

References 1. Brown, E. D., and Wright, G. D. (2016) Antibacterial drug discovery in the resistance era. Nature 529, 336–343. 2. Blair, J. M., Webber, M. A., Baylay, A. J., Ogbolu, D. O., and Piddock, L. J. (2015) Molecular mechanisms of antibiotic resistance. Nature Reviews Microbiology 13, 42–51. 3. Peabody, G. L., Winkler, J., and Kao, K. C. (2014) Tools for developing tolerance to toxic chemicals in microbial systems and perspectives on moving the field forward and into the industrial setting. Current Opinion in Chemical Engineering 6, 9–17. 4. Winkler, J. D., Garcia, C., Olson, M., Callaway, E., and Kao, K. C. (2014) Evolved osmotolerant Escherichia coli mutants frequently exhibit defective N-acetylglucosamine 16 ACS Paragon Plus Environment

Page 16 of 38

Page 17 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

catabolism and point mutations in cell shape-regulating protein MreB. Applied and Environmental Microbiology 80, 3729–3740. 5. Minty, J. J., Lesnefsky, A. A., Lin, F., Chen, Y., Zaroff, T. A., Veloso, A. B., Xie, B., McConnell, C. A., Ward, R. J., Schwartz, D. R., Rouillard, J.-M., Gao, Y., Gulari, E., and Lin, X. (2011) Evolution combined with genomic study elucidates genetic bases of isobutanol tolerance in Escherichia coli. Microbial Cell Factory 10, 18. 6. Warner, J. R., Reeder, P. J., Karimpour-Fard, A., Woodruff, L. B., and Gill, R. T. (2010) Rapid profiling of a microbial genome using mixtures of barcoded oligonucleotides. Nature Biotechnology 28, 856–862. 7. Sandoval, N. R., Mills, T. Y., Zhang, M., and Gill, R. T. (2011) Elucidating acetate tolerance in E. coli using a genome-wide approach. Metabolic Engineering 13, 214–224. 8. Wang, X., Yomano, L. P., Lee, J. Y., York, S. W., Zheng, H., Mullinnix, M. T., Shanmugam, K., and Ingram, L. O. (2013) Engineering furfural tolerance in Escherichia coli improves the fermentation of lignocellulosic sugars into renewable chemicals. Proceedings of the National Academy of Sciences 110, 4021–4026. 9. Goodarzi, H., Bennett, B. D., Amini, S., Reaves, M. L., Hottes, A. K., Rabinowitz, J. D., and Tavazoie, S. (2010) Regulatory and metabolic rewiring during laboratory evolution of ethanol tolerance in E. coli. Molecular Systems Biology 6, 378. 10. Palmer, A. C., and Kishony, R. (2013) Understanding, predicting and manipulating the genotypic evolution of antibiotic resistance. Nature Reviews Genetics 14, 243–248. 11. Long, A., Liti, G., Luptak, A., and Tenaillon, O. (2015) Elucidating the molecular architecture of adaptation via evolve and resequence experiments. Nature Reviews Genetics 16, 567–582.

17 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

12. Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M., Datsenko, K. A., Tomita, M., Wanner, B. L., and Mori, H. (2006) Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Molecular systems biology 2. 13. Kitagawa, M., Ara, T., Arifuzzaman, M., Ioka-Nakamichi, T., Inamoto, E., Toyonaga, H., and Mori, H. (2006) Complete set of ORF clones of Escherichia coli ASKA library (a complete set of E. coli K-12 ORF archive): unique resources for biological research. DNA research 12, 291–299. 14. Lehner, B. (2013) Genotype to phenotype: lessons from model organisms for human genetics. Nature Reviews Genetics 14, 168–178. 15. DAVIDSON, S., OVERTON, C., and BUNEMAN, P. (1995) Challenges in Integrating Biological Data Sources. Journal of Computational Biology 2, 557–572. 16. Lee, S. Y., Lee, D.-Y., and Kim, T. Y. (2005) Systems biotechnology for strain improvement. Trends in biotechnology 23, 349–358. 17. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2016) KEGG as a reference resource for gene and protein annotation. Nucleic acids research 44, D457– D462. 18. Otsuka, Y. et al. (2014) GenoBase: comprehensive resource database of Escherichia coli K-12. Nucleic Acids Research 43, D606–D617. 19. Caspi, R., Billington, R., Ferrer, L., Foerster, H., Fulcher, C. A., Keseler, I. M., Kothari, A., Krummenacker, M., Latendresse, M., Mueller, L. A., Ong, Q., Paley, S., Subhraveti, P., Weaver, D. S., and Karp, P. D. (2015) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research 44, D471–D480.

18 ACS Paragon Plus Environment

Page 18 of 38

Page 19 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

20. Gama-Castro, S. et al. (2015) RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Research 44, D133– D143. 21. Kim, H., Shim, J. E., Shin, J., and Lee, I. (2015) EcoliNet: a database of cofunctional gene network for Escherichia coli. Database 2015, bav001. 22. McArthur, A. G. et al. (2013) The Comprehensive Antibiotic Resistance Database. Antimicrobial Agents and Chemotherapy 57, 3348–3357. 23. Ashburner, M. et al. (2000) Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29. 24. Consortium, G. O. (2014) Gene Ontology Consortium: going forward. Nucleic Acids Research 43, D1049–D1056. 25. Zhou, J., and Rudd, K. E. (2012) EcoGene 3.0. Nucleic acids research gks1235. 26. Rose, P. W., Prli, A., Bi, C., Bluhm, W. F., Christie, C. H., Dutta, S., Green, R. K., Goodsell, D. S., Westbrook, J. D., Woo, J., Young, J., Zardecki, C., Berman, H. M., Bourne, P. E., and Burley, S. K. (2014) The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Research 43, D345–D356. 27. Nichols, R. J. et al. (2011) Phenotypic Landscape of a Bacterial Cell. Cell 144, 143–156. 28. Chandrasekaran, S., Cokol-Cakmak, M., Sahin, N., Yilancioglu, K., Kazan, H., Collins, J. J., and Cokol, M. (2016) Chemogenomics and orthology-based design of antibiotic combination therapies. Mol Syst Biol 12, 872. 29. Patnaik, R. (2008) Engineering Complex Phenotypes in Industrial Strains. Biotechnol. Prog. 24, 38–47.

19 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

30. Zhang, X., Tervo, C. J., and Reed, J. L. (2016) Metabolic assessment of E. coli as a Biofactory for commercial products. Metabolic Engineering 35, 64–74. 31. Makanae, K., Kintaka, R., Makino, T., Kitano, H., and Moriya, H. (2013) Identification of dosage-sensitive genes in Saccharomyces cerevisiae using the genetic tug-of-war method. Genome research 23, 300–311. 32. Shiomi, D., Toyoda, A., Aizu, T., Ejima, F., Fujiyama, A., Shini, T., Kohara, Y., and Niki, H. (2013) Mutations in cell elongation genes mreB, mrdA and mrdB suppress the shape defect of RodZ-deficient cells. Molecular Microbiology 87, 1029–1044. 33. Mori, H., Takeuchi, R., Otsuka, Y., Bowden, S., Yokoyama, K., Muto, A., Libourel, I., and Wanner, B. L. Prokaryotic Systems Biology; Springer, 2015; pp 155–168. 34. Ryan, C. J., Cimermanˇciˇc, P., Szpiech, Z. A., Sali, A., Hernandez, R. D., and Krogan, N. J. (2013) High-resolution network biology: connecting sequence with function. Nature Reviews Genetics 14, 865–879. 35. Fortunato, S. (2010) Community detection in graphs. Physics reports 486, 75–174. 36. Mitra, K., Carvunis, A.-R., Ramesh, S. K., and Ideker, T. (2013) Integrative approaches for finding modular structure in biological networks. Nature Reviews Genetics 14, 719– 732. 37. Geng, H., and Jiang, R. (2015) cAMP receptor protein (CRP)-mediated resistance/tolerance in bacteria: mechanism and utilization in biotechnology. Applied microbiology and biotechnology 99, 4533–4543. 38. Alper, H., and Stephanopoulos, G. (2007) Global transcription machinery engineering: a new approach for improving cellular phenotype. Metabolic engineering 9, 258–267. 39. Notley-McRobb, L., King, T., and Ferenci, T. (2002) rpoS mutations and loss of general

20 ACS Paragon Plus Environment

Page 20 of 38

Page 21 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

stress resistance in Escherichia coli populations as a consequence of conflict between competing stress responses. Journal of bacteriology 184, 806–811. 40. Cheng, K.-K., Lee, B.-S., Masuda, T., Ito, T., Ikeda, K., Hirayama, A., Deng, L., Dong, J., Shimizu, K., Soga, T., Tomita, M., Palsson, B. O., and Robert, M. (2014) Global metabolic network reorganization by adaptive mutations allows fast growth of Escherichia coli on glycerol. Nature Communications 5 . 41. Fu, Y., Jarboe, L. R., and Dickerson, J. A. (2011) Reconstructing genome-wide regulatory network of E. coli using transcriptome data and predicted transcription factor activities. BMC Bioinformatics 12, 233. 42. Timsah, Z., Ahmed, Z., Ivan, C., Berrout, J., Gagea, M., Zhou, Y., Pena, G. N. A., Hu, X., Vallien, C., Kingsley, C. V., Lu, Y., and Hancock, J. F. (2015) Grb2 depletion under non-stimulated conditions inhibits PTEN , promotes Akt-induced tumor formation and contributes to poor prognosis in ovarian cancer. Oncogene 1–11. 43. Jen, J., Lin, L.-l., Chen, H.-t., Liao, S.-y., Lo, F.-y., Tang, Y.-a., Su, W.-c., Salgia, R., Hsu, C.-l., Huang, H.-c., Juan, H.-f., and Wang, Y.-c. (2015) Oncoprotein ZNF322A transcriptionally deregulates alpha-adducin , cyclin D1 and p53 to promote tumor growth and metastasis in lung cancer. Oncogene 1–13. 44. Okabe, S., Tauchi, T., and Ohyashiki, K. (2008) Cancer therapy: Preclinical characteristics of Dasatinib- and Imatinib-resistant chronic myelogenous leukemia cells. Clinical Cancer Research 14, 6181–6187. 45. Brock, A., Chang, H., and Huang, S. (2009) Non-genetic heterogeneity a mutationindependent driving force for the somatic evolution of tumours. Nature Reviews Genetics 10, 336–341. 46. Balaban, N. Q., Merrin, J., Chait, R., Kowalik, L., and Leibler, S. (2004) Bacterial persistence as a phenotypic switch. Science (New York, N.Y.) 305, 1622–5. 21 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

47. Veening, J.-W., Stewart, E. J., Berngruber, T. W., Taddei, F., Kuipers, O. P., and Hamoen, L. W. (2008) Bet-hedging and epigenetic inheritance in bacterial cell development. PNAS 105, 4393–4398. 48. Minty, J. J., Lesnefsky, A. A., Lin, F., Chen, Y., Zaroff, T. A., Veloso, A. B., Xie, B., McConnell, C. A., Ward, R. J., Schwartz, D. R., Rouillard, J.-M., Gao, Y., Gulari, E., and Lin, X. (2011) Evolution combined with genomic study elucidates genetic bases of isobutanol tolerance in Escherichia coli. Microb Cell Fact 10, 18. 49. Firnberg, E., and Ostermeier, M. (2013) The genetic code constrains yet facilitates Darwinian evolution. Nucleic acids research 41, 7420–7428. 50. Imamovic, L., and Sommer, M. O. (2013) Use of collateral sensitivity networks to design drug cycling protocols that avoid resistance development. Science Translational Medicine 5, 204ra132–204ra132. 51. Goulart, C. P., Mahmudi, M., Crona, K. A., Jacobs, S. D., Kallmann, M., Hall, B. G., Greene, D. C., and Barlow, M. (2013) Designing antibiotic cycling strategies by determining and understanding local adaptive landscapes. PloS one 8, e56040. 52. Lazar, V. et al. (2014) Bacterial evolution of antibiotic hypersensitivity. Molecular Systems Biology 9, 700–700. 53. Dragosits, M., Mozhayskiy, V., Quinones-Soto, S., Park, J., and Tagkopoulos, I. (2013) Evolutionary potential, cross-stress behavior and the genetic basis of acquired stress resistance in Escherichia coli. Molecular systems biology 9, 643. 54. de Visser, J. A. G., and Krug, J. (2014) Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics 15, 480–490. 55. Schenk, M. F., Szendro, I. G., Salverda, M. L., Krug, J., and de Visser, J. A. G. (2013)

22 ACS Paragon Plus Environment

Page 22 of 38

Page 23 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Patterns of epistasis between beneficial mutations in an antibiotic resistance gene. Molecular biology and evolution 30, 1779–1787. 56. Winkler, J., and Kao, K. C. (2011) Transcriptional Analysis of Lactobacillus brevis to N-Butanol and Ferulic Acid Stress Responses. PLoS ONE 6, e21438. 57. Rutherford, B. J., Dahl, R. H., Price, R. E., Szmidt, H. L., Benke, P. I., Mukhopadhyay, A., and Keasling, J. D. (2010) Functional genomic study of exogenous n-butanol stress in Escherichia coli. Applied and Environmental Microbiology 76, 1935–1945. 58. Farr, S. B., and Kogoma, T. (1991) Oxidative stress responses in Escherichia coli and Salmonella typhimurium. Microbiological reviews 55, 561–585. 59. Woolfson, D. N., Bartlett, G. J., Burton, A. J., Heal, J. W., Niitsu, A., Thomson, A. R., and Wood, C. W. (2015) De novo protein design: how do we expand into the universe of possible protein structures? Current Opinion in Structural Biology 33, 16–26.

23 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Tables Table 1: Key Database Statistics Property Papers Journals Methods Designs Genes Mutants/paper Mutations/mutant Phenotypes

Value 300 81 17 5,511 3,639 18.4 (151.3) 2.02 (10.2) 228

Description Curated papers Number of journals represented Strain engineering techniques Deposited E. coli strains Number of unique genes mutated Avg. mutants per record Avg. mutations per mutant Unique resistance, sensitivity phenotypes

An overview of the Resistome. Each paper is manually curated from the literature, although in many cases tables of mutants were curated using special, paper-specific parsers for larger adaptive evolution or library screening studies with defined vocabularies for describing mutants and their phenotypes.

Table 2: Enriched Tags for Non-Growth Related Genes GO Tag GO:0015307 GO:0046677 GO:0042493 GO:0042542 GO:0015385 GO:0035725 GO:0022900 GO:0070301 GO:0006754 GO:0044011 GO:0046618 GO:0030163 GO:0015991

P-value 5.697E-08 5.952E-06 7.650E-06 7.782E-06 3.814E-05 6.674E-05 6.994E-05 3.181E-04 3.750E-04 3.962E-04 5.240E-04 5.258E-04 5.639E-04

Description drug:proton antiporter activity response to antibiotic response to drug response to hydrogen peroxide sodium:proton antiporter activity sodium ion transmembrane transport electron transport chain cellular response to hydrogen peroxide ATP biosynthetic process single-species biofilm formation on inanimate substrate drug export protein catabolic process ATP hydrolysis coupled proton transport

Enriched gene ontology tags for non-growth associated genes (those that are associated with growth-related studies more than 90% of the time in the Resistome). A p-value of P < 0.05 from a bootstrapping test (see Methods and Materials) was considered significant.

24 ACS Paragon Plus Environment

Page 24 of 38

Page 25 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Table 3: Enriched Tags for Growth Related Genes GO Tag GO:0046487 GO:0009992 GO:0034220 GO:0008679 GO:0042938 GO:0042936 GO:0006212 GO:0070329 GO:0004834 GO:0004474 GO:0009424 GO:0034618

P-value 1.688E-06 9.981E-06 4.406E-05 2.441E-04 2.667E-04 2.667E-04 3.804E-04 5.389E-04 6.577E-04 7.044E-04 7.350E-04 9.292E-04

Description glyoxylate metabolic process cellular water homeostasis ion transmembrane transport 2-hydroxy-3-oxopropionate reductase activity dipeptide transport dipeptide transporter activity uracil catabolic process tRNA seleno-modification tryptophan synthase activity malate synthase activity bacterial-type flagellum hook arginine binding

Enriched gene ontology tags for growth associated genes (those that are associated with growth-related studies more than 90% of the time in the Resistome). A p-value of P < 0.05 from a bootstrapping test (see Methods and Materials) was considered significant.

Figure Legends Figure 1: Modification frequency and linkage information for Resistome mutation along the E. coli genome. Spike height is proportional to the modification frequency at that locus, while edges, weighted by co-occurrence frequency, connect genes that are mutated in the same strain. This visualization was generated using the MG1655 genome using the default Circos settings.

Figure 2: Deposited papers versus time. The number of mutants per paper analyzed (right y-axis) is displayed in red, while the blue bars (left y-axis) indicate the number of papers in the Resistome curated from that year. The Resistome contains a total of 300 papers containing 5,511 mutants. The Keio collection (12 ) was introduced in 2006.

25 ACS Paragon Plus Environment

ACS Synthetic Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Resistance study classification over the time course of the Resistome. The top 11 categories are listed, and all other studies are included in the other classification. The full name for each stress resistance category is AAC: antibiotics and anti-chemotherapy resistance, FUR: furfural, GROW: general growth, HOT: high temperatures, ACID: low pH, METI: metabolic inhibitors (antimetabolites), OA: organic acid, OSM: hypo- or hyperosmotic stress, OXI: oxidative stress, RAD!: ionizing radiation, SB: solvents and biofuels, and OTH: other.

Figure 4: Distribution of mutational types in Resistome data. The full names for each type are: AA*: amino acid mutation, INT: genomic integration, SNP: single nucleotide polymorphism, REP: gene repression, OE: over-expression, DEL: deletion, TN: transposon insertion, INDEL: small (¡100 bp) insertion or deletion, BTW: intergenic mutation, PLA: plasmid cloning, FRAME: frameshift, UNK: unknown, RBS-T: RBS tuning, DUPE: internal duplication, LDEL: large deletion, CON: constitutive expression, LAMP: large amplification, TER: terminated, FUSE: protein fusion, AMP: amplification, COMP: compartmentalization, TRC: truncation, ASEN: anti-sense gene knockdown, REG: new regulator relationship. Note that residue changes and SNPs included in the database are for different sites and do not reflect the same mutations.

Figure 5: The Resistome gene-gene interaction network, generated from the Cartesian selfproducts for every deposited strain genotype. Clustering, denoted by node color, was performed using the multilevel algorithm to detect genes that frequently interact in hopes of inferring synergistic or antagonistic resistance interactions. Edges that occur five or fewer times in the network were pruned for visual clarity, leaving 496 nodes connected by 3,054 edges.

Figure 6: The structure of crp regulatory protein (PDB ID 2CGP) with mutations represented as enlarged spheres. A) The red spheres are sites identified to be of functional importance (DNA or ligand binding, etc) through PDB analysis and in uniprot databases. Approximately 90% of the mutated sites have been identified to have functional importance. Blue spheres represent residues of unknown function. B) Spheres are colored according to conservation (Blue: highly conserved, red: not conserved). Most mutated residues are highly conserved within CRP variants.

26 ACS Paragon Plus Environment

Page 26 of 38

Page 27 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 7: Map of regulator, phenotype pairs. Each block is colored black if that regulator (y-axis) is mutated in a strain that is more resistant or more sensitive to a given phenotype (x-axis), and white otherwise. Only regulators that are mutated in strains that resist at least three phenotypes are included in this plot. Legend: AAAM: amino acid antimetabolites, AAC: antibiotics and anti-chemotherapy resistance, DETR: detergents, FUR: furfural, ANTIM: general antimetabolites, GROW: general growth, HIGHG: high gravity, HOT: high temperatures, ACID: low pH, COLD: low temperature, METI: metabolic inhibitors (antimetabolites), MUTA: mutagens, NTL: nutrient limitation, OA: organic acid, OSM: hypoor hyperosmotic stress, OXI: oxidative stress, RAD!: ionizing radiation, SB: solvents and biofuels.

Figure 8: The proportions of AA→AA’ transitions requiring a minimum of 1, 2, and 3 nucleotide changes observed in the Resistome. The horizontal axis represents the original amino acid present in the protein sequence, while the number of times a given amino acid is mutated into any other amino acid or stop codon is given in parenthesis.

Figure 9: Significance of gene ontology tag overlaps for strains resistant (A) and sensitive (B) to inhibitors in each phenotype. Only phenotype pairs with an interaction significance less than the specified threshold of P < 10−4 are included here. P-values are limited to 10−20 for visual clarity. Note that the heatmaps have different scales, although the significance cutoff is identical in both cases. Phenotype names are shared with Figure 7.

Figure 10: Significance of gene ontology tag overlaps for strains resistant (A) and sensitive (B) to each top of antibiotic. Only phenotype pairs with an interaction significance less than the specified threshold of P < 10−4 are included here. Note that the heatmaps have different scales, although the significance cutoff is identical in both cases.

27 ACS Paragon Plus Environment

ACS Synthetic Biology

Graphical TOC Entry

Ar c

a na

e int





秘密

l l e go

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Re se sist o m e D a t a ba

28 ACS Paragon Plus Environment

Page 28 of 38

B4401

B1237

3 06 B4 062 B4 8 01 8 B439887 B 39 B

B3 B B3378 76 3 5

84

9

35

6 B0

B3 69 65 B 9 0 36 B3 42 64 3

B3

B349

B092

5

9

B3428 B3404

B3357 B3340 B3295 B3261 B3251 B124

67 9

1

B B115 5330 1 74

B2231

48

B20

B2 51 6

01

B3

B2

B30

6 B209

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

B B004 47 64 8

Page 29 of 38

ACS Paragon Plus Environment

9

ACS Synthetic Biology

Page 30 of 38

30

2500

25

2000 Mutants/Paper

20

1500 15

1000

10 5

500

0

0

1986 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Resistome Papers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

ACS Paragon Plus Environment

Publication Year

1.0

Page 31 of 38

0.8 0.6 0.4 0.2 0.0

1986 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Proportion

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

ACS Synthetic Biology

Year

AAC ACID FUR METI GROW OA ACS Paragon Plus Environment HOT OSM

OXI RAD! SB OTH

2500

Page 32 of 38

2000 1500

Counts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

ACS Synthetic Biology

1000 500

INT AA* SNP REP OE DEL TN INDEL BTW PLA FRAME UNK RBS-T DUPE LDEL CON LAMP FUSE AMP COMP TRC ASEN REG

0

ACS Paragon Plus Environment

Mutation Type

Page 33 of 38

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

ACS Synthetic Biology

ACS Paragon Plus Environment

A 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

ACS Synthetic Biology

B

ACS Paragon Plus Environment

Page 34 of 38

RPOS ACRB 1 CRP 2RPOB 3 ARCA 4 5 CSPE 6RPOD 7 8 CBL 9MARR 10ICLR 11 HUPA 12 13 CADC 14 COMR 15 HEPA 16 17 OMPR 18 19FHLA FRMR 20 21 HNS 22 23 FIS 24 SGRR 25 RPOC 26 EVGA 27 28 MARC 29 EMRR 30 31 HUPB 32 HCAR 33 34TRER 35 ARCB 36 CYNR 37 ACRR 38 39 YBCM 40 PDHR 41 42 MGRB 43TYRR 44 45YCJW 46 RSEB 47 CPXR 48 GADW 49 50 RPON 51 YDCQ 52 53 CRL 54 MRAZ 55 ENVR 56 57 RPOA 58 SFSB 59 60RSTA RPIR YEDW FRVR CSGD CSPA MALT MQSR RCSB RHO GLNG GALS YDCI GREA YAFC FABR OXYR RFAH FNR YAFQ YGAV CSIR GADX NAGC RPOH IHFB NSRR RCLR ARSE PCNB OGRK YDFH GALR FUR APPY CYSB LYSR MALI YEBC NUSA ARGP MARA PMRA YEIE PHOP CAIF PRPR GLPR RTCR YFET YHAJ YDCR HDFR DINJ NUSB GCVA MCBR SOXS PSPF FEAR YFHA RCSA MNTR FADR YIHL MATA LRP ILVY CSPD SOHA ADIY EXUR FLGM MODC YPDC ATOC BARA RSEA FECI DEOT YQHC LEUO CHAB LEXA YAGI ZUR NEMR FLIT YQEI YGFI YDEO SDIA PUTA YQJI YCAN YEBK PERR AGAR NADR TRPR CYTR DEOR CSPG

ACS Synthetic Biology

AAAM AAC DETR FUR ANTIM GROW HIGHG BASE HOT ACID COLD METI MUTA NTL OA OSM OXI RAD! SB

Page 35 of 38

ACS Paragon Plus Environment

1.0

Page 36 of 38

0.8 0.6 0.4 0.2 0.0

X(3) A(205) C(19) D(223) E(216) F(100) G(164) H(56) I(222) K(122) L(202) M(81) N(94) P(114) Q(114) R(154) S(218) T(114) V(157) W(45) Y(73)

AA Replacement Distance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

ACS Synthetic Biology

Amino Acids

ACS Paragon Plus Environment

1 REP

2 REP

3 REP

A

OSM Page 37 of 38

27 28 29 30 31

16 14

10 8 6 4 2

ACS Paragon Plus Environment

0

log10[P-value]

12

FUR BASE SB NTL GROW AAAM AAC RAD! OTH HIGHG DETR ANTIM COLD HOT OXI METI ACID OA MUTA OSM

MUTA OA 1 2 ACID 3 4 METI 5 OXI 6 7 HOT 8COLD 9 ANTIM 10 11DETR 12 HIGHG 13 14 OTH 15 16RAD! 17 AAC 18 AAAM 19 20 GROW 21 22 NTL 23 SB 24 25BASE 26 FUR

ACS Synthetic Biology

ACS Synthetic Biology

Page 38 of 38

18 16

12 10 8

log2[P-value]

14

6 4

LINCOSAMIDE

AMINONUCLEOSIDE

SULFONAMIDES

BETA-LACTAM ANTIBIOTIC

MACROLIDES

PHOSPHONIC

FATI

ANISOLES

TETRACYCLINE

ANTIMETABOLITE

AMINOCOUMARIN

ANTHRACYCLINE

MITOMYCINS

CEPHALOSPORIN

AMINOGLYCOSIDE

FUSIDANES

2

MONOBACTAMS

0

20

B

18 16 14 12 10 8

CEPHALOSPORIN

6

AMINOGLYCOSIDE FUSIDANES

4

MONOBACTAMS 2

FLUOROQUINOLONE

LINCOSAMIDE

AMINONUCLEOSIDE

SULFONAMIDES

MACROLIDES

ETA-LACTAM ANTIBIOTIC

ACS Paragon Plus Environment

FATI

ANISOLES

TETRACYCLINE

AMINOCOUMARIN

MITOMYCINS

CEPHALOSPORIN

AMINOGLYCOSIDE

FUSIDANES

MONOBACTAMS

FLUOROQUINOLONE

AZOLE

AZOLE

0

log2[P-value]

1 SULFONAMIDES 2 BETA-LACTAM ANTIBIOTIC 3 4 MACROLIDES 5 PHOSPHONIC 6 7 FATI 8 ANISOLES 9 10 TETRACYCLINE 11 ANTIMETABOLITE 12 13 AMINOCOUMARIN 14 15 ANTHRACYCLINE 16 MITOMYCINS 17 18 CEPHALOSPORIN 19 AMINOGLYCOSIDE 20 21 FUSIDANES 22 MONOBACTAMS 23 24 FLUOROQUINOLONE 25 AZOLE 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 LINCOSAMIDE 43 44 AMINONUCLEOSIDE 45 46 SULFONAMIDES 47 48 BETA-LACTAM ANTIBIOTIC 49 MACROLIDES 50 51 FATI 52 53 ANISOLES 54 55 TETRACYCLINE 56 57 AMINOCOUMARIN 58 59 MITOMYCINS 60

20

A

FLUOROQUINOLONE

AMINONUCLEOSIDE

AZOLE

LINCOSAMIDE