Subscriber access provided by University of Florida | Smathers Libraries
Article
MinGenome: An in silico top-down approach for the synthesis of minimized genomes Lin Wang, and Costas Maranas ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.7b00296 • Publication Date (Web): 18 Dec 2017 Downloaded from http://pubs.acs.org on December 23, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1 2 3 4 5 6 7 8 9 10
ACS Synthetic Biology
Research article
MinGenome: An in silico top-down approach for the synthesis of minimized genomes Lin Wang1 and Costas D. Maranas1* 1
Department of Chemical Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, United States ABSTRACT
11
Genome minimized strains offer advantages as production chassis by reducing transcriptional
12
cost, eliminating competing functions and limiting unwanted regulatory interactions. Existing
13
approaches for identifying stretches of DNA to remove are largely ad hoc based on information
14
on presumably dispensable regions through experimentally determined non-essential genes and
15
comparative genomics. Here we introduce a versatile genome reduction algorithm MinGenome
16
that implements a mixed integer linear program (MILP) to identify in size descending order all
17
dispensable contiguous sequences without affecting the organism’s growth or other desirable
18
traits. Known essential genes or genes that cause significant fitness or performance loss can be
19
flagged and their deletion be prohibited. MinGenome also preserves needed transcription factors
20
and promoter regions ensuring that retained genes will be properly transcribed while also
21
avoiding the simultaneous deletion of synthetic lethal pairs. The potential benefit of removing
22
even larger contiguous stretches of DNA if only one or two essential genes (to be re-inserted
23
elsewhere) are within the deleted sequence is explored. We applied the algorithm to design a
24
minimized E. coli strain and found that we were able to recapitulate the long deletions identified
25
in previous experimental studies and discover alternative combinations of deletions which have
26
not yet been explored in vivo.
27
KEYWORDS: MILP; top-down; minimal genome; genome-scale model; E. coli
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
28
Microbes such as E. coli, Saccharomyces cerevisiae, and Clostridia have been widely applied in
29
metabolic engineering projects for biofuels and biorenewables.1 However, the complexity of
30
microbial metabolism requires a sophisticated approach in the design of interventions. The
31
deletion of a seemingly unrelated gene may have deleterious consequences due to regulatory
32
effects, alterations of cofactor ratios, accumulation of intermediates or inability to recycle
33
metabolites. Many of these metabolic interdependencies are inherent in biology as new traits are
34
accumulated in response to changing selection pressures and random mutation. They offer
35
robustness to gene disruptions or changes to the external environment but they may also trigger
36
surprising and hard to predict unwanted responses for bioproduction.2 Metabolic models have
37
come a long way towards capturing these effects but unknowns still remain.2 A possible way to
38
circumvent some of the biological complexity is to re-write the genome by refactoring genes so
39
as the regulation is predictable/controllable and unneeded genes are eliminated.3,4 This approach
40
has led to simpler and functional gene clusters. A minimized genome has inherently fewer
41
interactions with heterologous pathways, fewer unknown elements and thus is more amenable to
42
prediction through modeling and can serve as an ideal and controllable chassis for efficient
43
bioproduction.
44
Genome reduction can be achieved by either a bottom-up or a top-down strategy. Bottom-up
45
genome minimization requires the de novo assembly of pathways into long stretches and the
46
linking of all necessary components into a single contiguous chromosome. This level of
47
engineering requires complete knowledge of all biological processes and interactions thereof
48
along with genome stability imperatives. To our knowledge, even though the de novo assembly
49
of the entire genomes of M. genatalium5 and yeast6 has been achieved, no re-engineering has
50
been attempted. In contrast, top-down genome minimization works by successively removing
ACS Paragon Plus Environment
Page 2 of 39
Page 3 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
51
non-essential contiguous genome stretches. Generally, a few large stretches are first removed
52
followed by many shorter deletions.7 The main advantage of the top-down approach is that the
53
starting point is an operational genome. Therefore, any adverse effects caused by deletion can
54
always be remedied by reverting back before the latest deletion. Even though the top-down
55
approach may not reach a “true” minimal genome, it is a more pragmatic approach for
56
constructing bioproduction chassis as genome reduction proceeds only up to the point that the
57
minimized strain reaches growth and production yield and rate goals.
58
Many top-down genome reductions have been carried out in E. coli over the last decade. The
59
first reduced genome shrunk the E. coli genome by 8.4% by removing its genomic islands.8
60
Hashimoto et al. further reduced the genome by 29.7% by deleting many non-essential genes.9
61
However, the genome minimized strain exhibited slower growth rate and deformed cell
62
morphology. Posfai et al. managed to reduce the genome by 15.3% without affecting growth
63
rate.10 In this strain (MDS43), mobile DNA and pathogenic genes present on the chromosome
64
were deleted. In response to a goal of minimizing the genome by 30% while maintaining growth
65
rate, spurred the ‘Minimal genome factory’ (MGF) research project in Japan.11 The first strain
66
(MGF-01) lacking 22% of the genome was constructed by removing a subset of non-identical
67
regions between E. coli and a close relative Buchnera SPP.12 Starting from MGF-01, a strain
68
reaching a 30% genome reduction was constructed by introducing viable deletions from earlier
69
studies.13 Recently, Hirokawa et al. extended the reduction to 35.2% by deleting more
70
dispensable regions on the MGF-01 strain.14 All of the identified deletions relied on comparative
71
genomics and a single gene knockout library. The use of metabolic modeling was not part of the
72
analysis, therefore, it is plausible that many genes were deemed essential even though less
73
characterized bypassing pathways may be available.15 The lack of convergence to a unique
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
74
reduction in genome minimization studies alludes to the possibility for even larger reductions
75
and/or alternative deletions. For example, MGF-01 shares only 568 gene deletions with the ones
76
carried out in strain MDS43 (only 52.5% similarity).12 As more sophisticated genome editing
77
tools (e.g., CRISPR16) are becoming commonplace, the need for a computational aid that will
78
help successively minimize genomes consistent with a set of performance criteria beyond simply
79
growth rate is becoming more pressing.
80
To address current limitations of computational tools for genome minimization, herein we
81
present MinGenome, a mixed-integer linear programming (MILP) algorithm for the top-down
82
genome reduction. MinGenome identifies a ranked list of deletions starting with the longest one
83
and proceeding in a monotonic order with shorter ones. The identification of larger DNA
84
stretches to be deleted first, affords savings in the time and cost needed to carry out the reductive
85
process. MinGenome also relies on a deterministic algorithm (i.e., MILP) thus ending up with a
86
single (barring any alternate optima) reduced genome and top-down genome reduction scheme.
87
Note that an earlier effort17,18 used a stochastic algorithm to assess the end points of reductive
88
evolutionary processes of endosymbiotic bacteria using biomass production feasibility to judge
89
viability. This algorithm tended to terminate at different reduced genomes given the stochastic
90
nature of the reduction process. MinGenome relies on (i) a genome-scale model (GSM)
91
representation of metabolism, (ii) gene location information from Genebank19, (iii) gene
92
essentiality information from transposon library, (iv) operons and promoter site structure
93
information from Ecocyc20, and (v) transcription factors information from RegulonDB21. The
94
procedure identifies first the largest contiguous DNA stretch within the genome that can be
95
deleted without affecting growth or any other performance criterion (e.g., target product max
96
yield or ATP availability). Subsequently, the next largest dispensable stretch of DNA is
ACS Paragon Plus Environment
Page 4 of 39
Page 5 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
97
identified given all the earlier deletions. The successive identification of smaller DNA stretches
98
to remove continues until (i) a percent genome reduction goal, (ii) a maximum number of
99
deletions, or (iii) a minimum DNA stretch size to be deleted is reached. The simultaneous
100
deletion of computationally predicted synthetic lethal pairs is avoided by maintaining biomass
101
yield at a non-zero or maximum level depending on user specifications. In the second phase, the
102
MinGenome algorithm assesses the possibility of removing even larger stretches of DNA if only
103
one or two essential genes are within the deleted sequence. These deleted essential genes can
104
then be re-inserted into the genome in a different location. We applied the algorithm to design a
105
genome-minimized E. coli K-12 MG1655 strain and found that we were able to recapitulate the
106
long deletions identified in previous experimental studies. A new deletion scheme that is not
107
explored before is proposed with large-scale genomic deletions ranging from 14.7 to 63.3kb.
108
MinGenome can be readily applied for other organisms provided availability of the
109
aforementioned information. The MinGenome algorithm was implemented in Python and C++
110
and is available on GitHub (https://github.com/maranasgroup).
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
111
RESULTS AND DISCUSSION
112
The proposed MinGenome computational workflow is shown pictorially in Figure 1. First,
113
information is collected on gene positions on the chromosome, essential genes, operons and their
114
promoter sites, and transcription factors as input for the MinGenome algorithm. In addition, the
115
MinGenome allows as options (i) gene re-insertion, (ii) retention of transcriptional factor genes
116
and (iii) user-supplied expansion of the list of essential genes. This provides the information
117
needed to flag the regions that cannot be deleted throughout the entire genome-minimization
118
process.
119
remaining contiguous dispensable DNA stretch. The objective function here is to maximize the
120
distance between the start position and end position of the deleted DNA stretch. Genes and
121
promoters within the deleted stretch are removed. The relationships between promoters and
122
genes and between genes and reactions are modeled as logic constraints. A gene cannot be
123
expressed if its promoter is removed, and an enzymatic reaction is knocked out if the gene (or
124
genes) coding for the enzyme is deleted. A set of performance criteria (e.g., maintain growth
125
rate, target product max yield, or ATP availability) can be imposed as possible constraints in the
126
MinGenome algorithm. In order to maintain growth, the deletion is deemed viable only if the
127
genome-minimized strain maintains a pre-specified biomass yield. The MinGenome algorithm
128
identifies segments to be deleted until one of the imposed stopping criteria is met. The longest
129
deletions generally remove genes encoding unknown functions, secondary metabolism, motility,
130
phages, and antibiotic resistance.22 Shorter stretches generally involve genes in alternative
131
metabolic pathways.
132
MinGenome algorithm is first deployed to design a genome-minimized E. coli strain. The first
133
case study demonstrates the algorithm's capability of predicting long deletions and contrasts with
Next, the MinGenome algorithm is successively applied to identify the largest
ACS Paragon Plus Environment
Page 6 of 39
Page 7 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
134
existing in vivo genome reduction efforts. The second case study assumes a more conservative
135
posture on gene essentiality by appending to the list of “not to be deleted” genes that are deemed
136
essential for the closely related organism Buchnera spp. but not in E. coli as put forth by
137
Mizoguchi et al.11. Expanding the essential genes list causes shorter deletions and more deletions
138
to reach 35% genome reduction than the first case study. In the last study, we allow up to one
139
essential gene to be deleted and then re-inserted elsewhere in the genome. We find that this leads
140
to a reduction in the number of contiguous stretches that need to be deleted to reach 35% genome
141
reduction from 69 to 56 (see Figure 7). Note that after 69 MinGenome predicted deletions we
142
reach a genome reduction level of 40%.
143 144
Identification of the 32 longest contiguous deletions for E. coli using MinGenome.
145
MinGenome was run on the E. coli GSM iJO136623 with the requirement of maintaining
146
maximum theoretical growth under aerobic minimal glucose growth medium. Essential genes
147
obtained from EcoliWiki (http://ecoliwiki.net/) were flagged and their deletion was prohibited.
148
We successively run MinGenome until we reached a deletion size of 2321 kb. This involved 38
149
deletions for a final genome size reduction of 50% (see Supporting Information Table S1). In an
150
earlier in vivo genome reduction of E. coli,24 long deletions (larger than 41.4 kb) were achieved
151
by aggregating the results from medium-scale deletions. Herein we compare the 32 longest
152
MinGenome predicted deletions with the experimentally carried out 32 large-scale deletions.
153
Figure 2 superimposes the top 32 deletions from the Profiling of E. coli Chromosome (PEC)
154
database (https://shigen.nig.ac.jp/ecoli/pec/) with the top 32 deletions predicted by MinGenome.
155
The numbering of MinGenome predicted deletions is in descending size order (i.e., deletion 1,
156
…, 32) while the numbering of experimental deletions uses the original deletion name in the
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
157
PEC database with an additional ‘LD’ as prefix when the original label starts with a number
158
(e.g., LD1, LD28, and LD3-32-1). Out of the 32 predicted deletions, 27 of them have
159
overlapping regions with the already experimentally carried out deletions from various genome
160
minimization studies. Deletions 5 and 18 (see Supporting Information Table S1) share the same
161
genes with experimental large-scale deletions LD1 and LD28 but with slightly different start and
162
end positions of deleted stretches. As many as 14 deletions (i.e., 2, 24, 7, 21, 15, 11, 22, 28, 14,
163
16, 4, 20, 17 and 10, see also Supporting Information Table S1) are at least 95% identical to the
164
experimental deletions (see Figure 2). Experimental deletions also track well with the
165
MinGenome predictions in the later deletions. For example, MinGenome deletion 35 matches the
166
experimental deletion LD3-32-1 but with a slightly different start and end positions.
167
MinGenome also predicted long deletion DNA segments that are not included in the
168
experimentally carried out deletions so far. For example, five of the MinGenome deletions (i.e.,
169
deletions 8, 26, 27, 29 and 31) involve stretches of DNA that were absent in experimentally
170
carried out long deletions (see Figure 2). However, the regions do contain independently carried
171
out medium-scale deletions that were not integrated to construct the genome-minimized strain. In
172
the experimental procedure of Hashimoto et al.9, a long deletion was constructed when a number
173
of contiguous medium-scale deletions that had been constructed independently in the previous
174
step (labeled as OCR1 to OCR71 and OCL1 to OCL89) could be aggregated to reach a long
175
deletion greater than 41.4 kb (e.g., LD21 is the combination of OCL47 and OCL48). An
176
unsuccessful medium-scale deletion that had an essential gene was excluded entirely thus no
177
long deletions in that region can be constructed to meet the required length. However, part of the
178
medium-scale deletion can still be included to form a long deletion. For example, Deletion 8 was
179
not constructed successfully because the medium-scale deletion OCL61 contains an essential
ACS Paragon Plus Environment
Page 8 of 39
Page 9 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
180
gene gltX encoding glutamate-tRNA ligase (see Figure 3). However, part of OCL61 (start from
181
gene ypdK and stops before gene gltX) can be combined with two successfully carried out
182
deletions OCL62 and OCL639 to form a long deletion. Similarly, deletion 26 can be
183
reconstructed based on deletions OCL31 and OCL30 carried out in Hashimoto et al (see Figure
184
3).9 Two genes (yraL and rpsO encoding 16S rRNA 2'-O-ribose C1402 methyltransferase and
185
30S ribosomal subunit protein S15, respectively) within the deletion OCL30 were considered
186
essential in the original experimental design9, but both of them were reported as non-essential
187
genes in later studies.25,26 In addition, deletions 27, 29, and 31 that have a smaller length (ranges
188
from 32 kb to 40 kb and are less than 41.4 kb) were not integrated into the experimental long
189
deletions. However, they also overlap with experimental medium-scale deletions OCL73-1/4 and
190
OCL74, OCL27 and OCL28-2/8/9, and OCR48/49, respectively (see Figure 3), which are
191
potential stretches that can be combined into longer deletions.
192
In addition, MinGenome on many occasions avoided growth detrimental deletions that were
193
attempted experimentally. Hashimoto et al.9 hypothesized that the experimental large-scale
194
deletion LD13 (from b2236 to b2276) contained genes responsible for growth rate and cell shape
195
as the cells assumed a different length to width ratio after LD13 was introduced. Interestingly,
196
MinGenome predicted no deletion within region LD13 that contains genes menE, menC, menH,
197
and menD (b2260 to b2264) coding for menaquinone biosynthesis pathway which is essential for
198
the production of biomass precursors menaquinone-8 and 2-demethylmenaquinone-8. Instead,
199
MinGenome predicted two shorter deletions from b2236 to b2259 and b2265 to b2276 at
200
deletions 39 and 91, respectively. Similarly, the MinGenome predicted no deletion within the
201
segment of the experimental deletion LD3-17-1 as the knockout of gene gdhA (b1761) encoding
202
glutamate dehydrogenase results in slower growth rate as predicted by the E. coli model
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
203
iJO136623. A ∆gdhA mutant was also reported before to be growth detrimental in vivo.27 It is
204
important to note that Hashimoto et al.9 have shown that the genome-minimized strain grows
205
slower as more long deletions are accumulated. In addition to the growth adverse gene deletions
206
that were skipped by MinGenome, potential regulatory effects may also exist that affect growth.
207
A more conservative posture is studied in the next case study with additional constraints that flag
208
“putative” essential genes and transcription factors as no deletions.
209
MinGenome avoided the deletion of computationally predicted synthetic lethals (SLs). SLs are
210
genes whose simultaneous deletion prohibits growth. Avoiding the simultaneous deletion of SL
211
is an important consideration as while constructing the minimal genome JCVI-syn3.0, some of
212
the reduced strains failed to grow due to the deletion of SL pairs.5 A number of computational
213
approaches have been developed to predict SLs from GSMs using bilevel optimization as well as
214
minimal cut set and elementary flux mode methods.28,29 However, the a priori enumeration of all
215
possible SLs (triples, quadruples and even higher combinations) requires significant
216
computational time. MinGenome circumvents this challenge by simply enforcing biomass
217
production as a constraint after all the cumulative deletions up to this point are imposed. The
218
deletions were subsequently compared against available SLs28 datasets (see Supporting
219
Information Table S3) confirming that no SLs were simultaneously deleted.
220
The MinGenome algorithm was exhaustively applied until no more genes could be removed. A
221
total of 572 genes were retained at the end with 423 of them included in model iJO136623, while
222
the 149 genes retained but are absent in iJO136623 are the essential genes flagged in the first step
223
of MinGenome. These genes were classified based on their COG function30 (see Figure 4 and
224
Supporting Information Table S2). The total number of COG classifications is 657 as some
225
genes had multiple functions. As shown in Figure 4, essential genes cover 97.4% of the set of
ACS Paragon Plus Environment
Page 10 of 39
Page 11 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
226
genes in information and processing. However, more than half of the genes in the remaining
227
categories are non-essential genes that are required to maintain biomass yield at maximum level
228
as the performance criteria defined in our MinGenome simulation.
229 230
E. coli minimized genome using gene essentiality from both E. coli and Buchnera spp along
231
with transcription factor information. Here we explored another deletion scheme with two
232
additional criteria compared with the first case study: (i) genes that have homologs in Buchnera
233
spp. were also labeled as essential as suggested in Mizoguchi et al.’s design of MGF-01 strain12,
234
and (ii) all genes coding for transcriptional factors with known functions in RegulonDB are
235
retained. Buchnera spp. is a symbiotic bacterium that is a close relative of E. coli. Symbionts
236
tend to retain only the absolutely essential genes as most nutrients are imported from its host
237
without a need for the corresponding biosynthetic pathways or genes associated with
238
pathogenicity.31 In addition to enzyme coding genes, here we also retain all genes associated
239
with the known transcriptional regulatory network that activates or represses genes while
240
genome reducing with MinGenome. Consequently, we flagged all 218 transcription factors in
241
RegulonDB as no deletions. There exist both Boolean-based regulatory models2,32 and
242
probabilistic ones33 for E. coli. In this study, we simply enforced that the entire regulatory
243
network remains intact with no provisions as to which transcription factors are needed. In
244
principle, either Boolean or probabilistic models can be used to narrow down the list of essential
245
transcription factors.
246
As in the first case study, we compared MinGenome predictions with the MGF-01 strain that has
247
35% genome reduction with 91 deletions catalogued in Profiling of E. coli Chromosome (PEC)
248
database (https://shigen.nig.ac.jp/ecoli/pec/). MinGenome terminated when the genome
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
249
reduction reached 35% reduction after 69 deletions (see Supporting Information Table S4). The
250
length of the longer deletions ranges from 14.7 kb to 63.3 kb and are smaller compared with the
251
long deletions carried out in the first case study (ranges from 31.2 kb to 226 kb). Note that
252
CRISPR/Cas9 can achieve long deletion from 23 kb to 1 Mb, but deletions of larger sizes
253
generally have lower efficiency.34 MGF-01 strain shares 568 gene deletions with strain MDS43
254
and 493 genes deletions with strain ∆16.12 MinGenome deletion predictions and MGF-01 strain
255
deletions share 993 genes (see Figure 5). The MinGenome predictions allude to the possibility
256
for additional genome reduction.
257
We compared the deletions of metabolic reactions carried out in the MGF-01 strain and in the
258
second case study of MinGenome. We find that reaction dispensability is highly metabolism
259
dependent with largely convergent results between MinGenome and experimental studies (see
260
Figure 6). Secondary carbon metabolism is by far the most highly reduced set of reactions with
261
62.9% and 50.3% deleted reactions are alternative carbon metabolism and inner and outer
262
membrane transport in vivo and in MinGenome predictions, respectively (see Figure 5). As
263
glucose was selected as the only carbon source, we observe that a number of deleted pathways
264
are responsible for the uptake and metabolism of alternative substrates such as D-glucose 1-
265
phosphate, branching glycogen, D-fructose, glycerol, D-Xylose, and D-ribose.
266
We also find that MinGenome predictions preferentially eliminate reactions involved in
267
alternative/redundant pathways compared to existing minimization studies (see Supporting
268
Information Table S5). In particular, in oxidative phosphorylation pathways, as many as 12
269
redundant electron transport reactions are deleted involving the oxidation/reduction of
270
ubiquinone, menaquinone, and dimethyl-menaquinone that enable growth under a variety of
271
conditions. In the citric acid cycle, two malate dehydrogenase (MDH) alternate reactions (MDH2
ACS Paragon Plus Environment
Page 12 of 39
Page 13 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
272
and MDH3) that convert malate to acetyl-CoA are deletion suggestions by MinGenome. Both of
273
them use the secondary activity of malate oxidoreductase (gene mqo) for the malate conversion.
274
Note that a deletion of the primary gene ∆mdh results in a severe growth defect35. MinGenome
275
here correctly prioritized the deletion of alternative reactions (e.g., keep MDH and delete
276
MDH2/MDH3) because gene mdh is in closer proximity to essential genes than gene mqo thus
277
preventing a longer deletion. Similarly, in amino acid metabolism pathways, the deletion of the
278
NADPH-dependent glutamate synthase reaction (GLUSy) is suggested while the NADH-
279
dependent reaction (GLUS) is retained. In purine and pyrimidine biosynthesis pathways, the
280
primary reaction GAR transformylase-T (GART) is retained while the alternative reaction
281
phosphoribosylglycinamide formyltransferase (GARFT) in inosine monophosphate (IMP)
282
biosynthesis pathway is deleted. In contrast, in the nucleotide salvage pathway, reactions inosine
283
kinase (INSK) and AMP nucleosidase (AMPN) are deleted while the alternative reactions 5'-
284
nucleotidase (NTD11) and adenine phosphoribosyltransferase (ADPT) are retained. In all cases,
285
MinGenome makes decisions on reaction retention based on the potential to generate the longest
286
deletion stretch without considering whether the retained pathway is primary or secondary.
287
Informed by experimental information a user can flag primary reaction pathways as a no deletion
288
when using MinGenome.
289
The suggested deletions by MinGenome were also implemented in E. coli’s genome-scale model
290
of metabolism and expression (ME model)36 to verify the viability of the mutants after the
291
successive deletions with the additional consideration of protein resource allocation to enzymes
292
and transcription/translational machinery. A number of studies have shown that protein
293
allocation for RNA polymerase37 and ribosomes38 are linearly correlated with E. coli’s growth
294
rate. Interestingly, even after the cumulative imposition of 69 deletions, no growth defect was
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
295
predicted by the ME model simulation. This suggests that the protein allocation for
296
transcription/translation may be sufficient for cell growth after the deletions.
297 298
Identification of multiple longest contiguous deletions for E. coli using MinGenome while
299
allowing for one gene re-insertion (per deletion) to the genome. As genetic engineering
300
tools39 allow for the efficient integration of genes back to the genome, here we explored whether
301
it is possible to reach the same of higher levels of genome reduction by deleting fewer but longer
302
stretches of DNA. CRISPR/Cas9 has been reported recently with more than 90% efficiency of
303
inserting sequences no longer than 2kb in E. coli using dsDNA as editing template.40 It has also
304
been shown before that the FLP-FRT site-specific recombinant system can remove a stretch of
305
DNA while moving the essential genes within the contiguous region to a complementary mini-F
306
plasmid.24 We therefore explored genome reduction with re-insertion strategy by applying the
307
MinGenome algorithm with the corresponding option switched on. As before, we required
308
retention of maximum theoretical growth under aerobic minimal glucose medium. Results (see
309
Supporting Information Table S1) confirm that it is indeed possible to delete larger stretches of
310
DNA with the provision of gene re-insertion. The new longest deletion combines deletions 1 and
311
3 to form a single contiguous deletion. Interestingly, the gene that needs to be re-inserted (i.e.,
312
aldA encoding aldehyde dehydrogenase A) is not essential.41 However, the associated reaction
313
glycolaldehyde dehydrogenase (GCALDD) is predicted to be essential based on model
314
iJO1366.23 This is because it is needed for degradation of glycolaldehyde side product in folate
315
metabolism. Aziz et al.42 showed a hidden pathway involving gene prpC that can convert
316
glycolaldehyde to glycolyl-CoA to bypass this function. Upon adding this pathway to the model,
ACS Paragon Plus Environment
Page 14 of 39
Page 15 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
317
we find a longer deletion with the option of no re-insertion that matches with the new longest
318
deletion that needs one gene re-insertion.
319
The cumulative deletions with and without one gene re-insertions of the second case study are
320
subsequently compared (see Figure 7 and Supporting Information Table S6). The number of
321
deletions to reach 35% genome reduction reduces from 69 to 56 when one gene re-insertion is
322
allowed. The total percentage of genome reductions without gene re-insertion only reaches
323
30.9% after 56 deletions. MinGenome as expected predicts longer genomic deletions. In
324
addition, the combined larger deletions often reveal falsely predicted essential gene by iJO136623
325
as
326
phosphomethylpyrimidine kinase in deletion 53. In model iJO136623, the ∆ thiD mutant is
327
predicted to be lethal as the phosphomethylpyrimidine kinase that synthesizes 4-amino-2-methyl-
328
5-(diphosphomethyl)pyrimidine (HMP-PP) is an essential precursor of thiamine pyrophosphate
329
(vitamin B1). However, gene thiD2 was reported to encode the same phosphomethylpyrimidine
330
kinase missing in iJO1366.43 As a result, an in vivo deletion of the combined regions showed no
331
growth defect corroborating the non-existence of an essential gene within the region thus
332
obviating the need for the re-insertion step.
in
the
case
aldA.
Another
such
example
includes
ACS Paragon Plus Environment
gene
thiD
encoding
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
333
CONCLUSION
334
This paper introduces the MILP optimization algorithm MinGenome for the construction of
335
genome-minimized strains. MinGenome performs a top-down genome reduction approach that
336
successively eliminates metabolic and regulatory genes starting with the longest possible
337
deletion first. In order to avoid lethal or growth defective deletions, essential genes and synthetic
338
lethal pairs are retained by enforcing a constraint on biomass yield. We applied MinGenome to
339
predict deletions for the E. coli K-12 MG1655 genome. Comparing with already carried out
340
deletions, MinGenome predictions match well with the experimental long deletions. In addition,
341
MinGenome was able to predict new deletions and avoid growth detrimental ones. Finally, we
342
explored MinGenome’s capability to reach the same of higher levels of genome reduction by
343
deleting fewer but longer stretches of DNA while allowing for gene re-insertions. MinGenome is
344
a general algorithm that can be applied to other organisms assuming that information on gene
345
annotation, metabolic pathways, biomass description, gene location and essentiality, promoter
346
sites, and operon structure is available.44 Obviously, the quality of the obtained genome
347
reduction scheme is dependent upon the accuracy of the provided information. MinGenome
348
allows as options (i) gene re-insertion, (ii) retention of transcriptional factor genes, and (iii) user-
349
supplied expansion of the list of essential genes. It is available as open-source Python/C++
350
programs on Maranas Lab’s GitHub website.
351
Moving forward a key challenge for MinGenome is predicting the effect of the deletion of genes
352
serving non-metabolic roles. Possibly, a transcriptional regulatory model (such as a Boolean-
353
based regulatory models2,32 and probabilistic ones33) can be used in conjunction with the GSM to
354
screen the deletion of needed TFs. Kinetic models45,46 and ME models36 could also be used to
355
make decisions on reaction retention with higher enzyme efficiency or less enzyme cost. In
ACS Paragon Plus Environment
Page 16 of 39
Page 17 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
356
addition, genes with unknown functions47 will be revealed as essential if MinGenome predictions
357
lead to lethal outcomes thus helping to determine their function. Currently, MinGenome relies on
358
the user input to flag them as no-deletion genes. Ultimately, we anticipate that one of the most
359
important contributions of MinGenome would be the identification of previously unknown
360
essential functions whenever a new deleted stretch causes lethality. By systematically probing
361
the entire genome for hidden essential (or synthetic lethal) functions, the pace of gene annotation
362
would be accelerated.
363
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
364
METHODS
365
A detailed description of MinGenome is presented in this section. First, we illustrate the
366
workflow and detailed mathematical formulation of the MinGenome algorithm along with a toy
367
example to clarify its function. Next, we show this algorithm to allow for multiple simultaneous
368
deletions while keeping the genome minimized strain consistent with a set of performance
369
criteria. Finally, we demonstrate how the MinGenome algorithm can identify even larger
370
dispensable stretches on the chromosome while allowing for gene re-insertions.
371
MinGenome: Optimization-driven workflow for genome minimization. MinGenome requires
372
as input (i) genome-scale model, (ii) genome sequence, (iii) gene essentiality, and (iv)
373
transcriptional units as input data (see Supporting Information Table S7 for details). Genome
374
sequence of Escherichia coli str. K-12 substr. MG1655 (NC_000913.3) was downloaded from
375
Genbank.19 The latest GSM reconstruction (iJO136623) was used here. Experimentally verified
376
essential genes48 were also flagged so that they would not be deleted during the simulation. This
377
is important as the metabolic model can only capture gene essentiality for only genes with a
378
metabolic role. However, only 20% of essential genes have known metabolic function, 75.6% of
379
essential genes are responsible for cell envelope/division, protein quality control, DNA
380
modification and maintenance, as well as protective function that are not modeled in GSM, and
381
4.4% of essential genes in E. coli do not have a known function.49 Promoter region information
382
for each transcription unit was obtained from EcoCyc database.20 The deletion of the promoter
383
for the corresponding transcription unit results in no expressions of genes in the transcription unit.
384
The MinGenome algorithm is posed and solved as a MILP problem. It requires the definition of
385
the following sets, parameters, and variables.
386
Sets
ACS Paragon Plus Environment
Page 18 of 39
Page 19 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
387 388 389 390 391
ACS Synthetic Biology
= |1, … , : set of metabolites
= |1, … , : set of reactions
= |1, … , : set of genes
= |1, … , : set of promoters
⊂ ∪ : set of genes and promoters
392
⊂ : set of promoters that initiate the expression of gene .
393
Parameters
394
: Position of the first nucleotide of the deleted sequence starting from the origin of replication
395
when gene or promoter is selected to be deleted in the beginning of the stretch. Note that is
396
not always the start site of a gene or a promoter. It is the first nucleotide of the non-overlapped
397
region between the gene/promoter and gene/promoter − 1 (see Figure 8 and Supporting
398
Information Table S7).
399
: Position of the first nucleotide of the gene or promoter immediately after the deleted
400
sequence (see Figure 8 and Supporting Information Table S7).
401
Variables
402
1, if gene or promoter is the ,irst gene or promoter within the deleted segment = 0, otherwise
403
1 =
404
4 =
405
56 : flux of reaction (7789 :;< => ℎ=> )
406
MILP representation of MinGenome algorithm
1, if gene or promoter is immediately after the end of the deleted segment 0, otherwise
1, if gene or promoter is deleted 0, otherwise
7@ A 1 − A ∈C
∈C
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 39
D. F. A GH,6 56 = 0
= 1, … ,
6∈I
A 1 = 1, A = 1
∈C
(2)
∈C
6M>
6M>
(1)
A 6 − A 16 = 4
= 1, … ,
5OHPQRSS ≥ U ∙ 5OHPQRSS,QRW
(3) (4)
4 = Y 4Z Z∈[ \
GPR constraints for reaction (see below) , 1 , 4 ∈ 0,1
= 1, … ,
(5)
∀ ∈
(6)
∀ ∈
407
The objective function maximizes the distance between the first and the last deletion event thus
408
yielding the longest contiguous stretch that can be deleted while satisfying all imposed
409
requirements. The first deletion event occurs at the nucleotide of the gene/promoter whose
410
deletion has no effect on the possible overlapped genes or promoters, whereas the last deletion
411
event is pegged at the last nucleotide before the gene or promoter immediately after the deleted
412
sequence (see Figure 8). This implies that the correct length of the deleted segment is assessed.
413
Constraint 1 is the standard FBA steady-state conservation of metabolite requirement.
414
Constraints 2 ensures that only one of the genes or promoters will be selected as the start or the
415
end of the deletion event. Constraint 3 ensures that all the genes/promoters after the start
416
gene/promoter and before the end gene/promoter will also be deleted. Note that 4 = 1 implies
417 418
deletion of gene/promoter . In the toy example (see Figure 9A), genes 2 through 4 are within
the deletion stretch when e = 1 and 1f = 1. As a result, 4Q = 1, 7 ∈ 2,3,4. Constraint 4
ACS Paragon Plus Environment
Page 21 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
419
imposes a requirement on the biomass yield as a fraction of the maximum theoretical value. In
420
all case studies, U is set equal to one implying that the max theoretical biomass yield is retained
421
upon the imposed deletions. As discussed earlier, additional requirements on the maximum
422
theoretical yield of other fluxes can be imposed along with requirements on biomass depending
423
on the ultimate production goals of the minimized strain. Constraint 5 ensures that a gene is
424
non-functional if all of its promoters ( ) are deleted (see Supporting Information Table S7 for
425
details). It is worth mentioning that one gene may have more than one promoter. For example,
426
genes ileS, ispA, fkpB, and ispH (from b0026 to b0029) have two promoters ileSp1and ispAp and
427
both of them bind to sigma factor g hi .20 Constraint 5 defines AND relationships that can be
428
reformulated as linear constraints as follows:
4 ≥ A 4Z − (| | − 1)
(7)
Z∈[ \
4 ≤ 4Z
429
∀ ∈
(8)
Where | | is the number of promoters for the gene . Based on constraints 7 and 8, 4 = 1 only
430
if 4Z = 1 for all ∈ , which is identical to constraint 5.
431
GPR constraints. Constraint 7 links reactions and genes through appropriate gene-to-protein-to-
432
reaction (GPR) constraints.28 The following cases need to be accounted for:
433
(i) A single gene k codes for the enzyme catalyzing reaction j (i.e., one-to-one mapping).
434 435 436 437
(1 − 4 )mn ≤ 56 ≤ (1 − 4 )on (ii) Two genes k1 and k2 code for an enzyme complex catalyzing reaction j. p
(1 − 4q )mn ≤ 56 ≤ (1 − 4q )on (1 − 4r )mn ≤ 56 ≤ (1 − 4r )on
(iii) Two gene k1 and k2 code for two isozymes catalyzing the reaction j.
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
438
Page 22 of 39
(1 − 4q + 1 − 4r )mn ≤ 56 ≤ (1 − 4q + 1 − 4r )on mn ≤ 56 ≤ on
439
(iv) More complex GPR relations
440
The complete logic statement that describes the GPR relation is posed using AND and OR
441
statement. The logic statement is converted into constraints 41 similar to the ones for (ii) and (iii).
442
For example, the constraints for genes k1, k2, and k3 with a GPR expressed as the logic statement
443
{(k1 AND k2) OR (k1 AND k3)} is as follows: p
444
(1 − 4t + 1 − 4r )mn ≤ 56 ≤ (1 − 4t + 1 − 4r )on mn(1 − 4q )mn ≤ 56 ≤ (1 − 4q )on ≤ 56 ≤ on
445
Identifying multiple simultaneous deletions with MinGenome. In order to obtain multiple
446
deletions without synthetic lethal gene pairs, the right-hand side of constraints 2 is modified as
447
follows:
A 1 = , A =
∈C
(9)
∈C
448
Here is the pre-specified number of contiguous genome deletions. Note that because all N
449
deletions are simultaneously performed no synthetic lethal pairs (or higher order) will be
450
included in the list of genes to be deleted. As shown in Figure 9A, top three deletions in the
451
genome of E. coli are predicted by the model. The starts of three deletions are e = 1, h = 1,
452
and >e = 1, and the ends of three deletions are 1f = 1, 1>i = 1, and 1>f = 1. Constraint 2 and
453
3 are satisfied since e + h + >e = 3 and 1f + 1>i + 1>f = 3. It is important to note that
454
Constraint 3 ensures that those deletions will not overlap with each other. For example,
455
constraint 4 ensures that 1f = 1 must happen between e = 1 and h = 1. As shown in Figure
456 457
9A, if deletion 1 overlaps with a new deletion 2 (e.g., v = 1 happens before 1f = 1), constraint 4 ( ∑v6M> 6 − ∑v6M> 16 = 4v ) is violated. As > = v = 1, constraint 4 indicates that 4f = 2 which
ACS Paragon Plus Environment
Page 23 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
458 459 460 461
ACS Synthetic Biology
contradicts the constraint 4 ∈ 0,1. As a result, overlapped deletions are precluded from the
MinGenome predictions. MinGenome is applied successively to identify the longest deletions.
In iteration x, the information of deletions gained through the previous x − 1 iterations is used as
parameters and constraint 9 is modified (i.e., change the constant number from n-1 to n) to
462
allow for the search of the x-th deletion.
463
Genome minimization algorithm for larger dispensable stretches on the chromosome while
464
allowing for gene re-insertions. MinGenome can also be modified to allow for presumably
465
larger deletions by allowing for one or more required genes to be re-inserted back into the
466
genome. Constraint 4 is modified to the following constraints:
6M>
6M>
A 6 − A 16 = y
∀ ∈
(10)
y ≥ 4
∀ ∈
(11)
A y = A 4 + 1
∈C
467
(12)
∈C
where y is de,ined as followed:
1, if gene is within a larger dispensable region to be deleted y = 0, otherwise
468
Constraint 10 defines the larger deletion region between 6 and 16 predicted by the model.
469
Constraint 11 and 12 is defined to allow a gene in the larger region to be re-inserted back into.
470
Figure 9B depicts an example that MinGenome predicted a longer deletion by combining two
471
deletions (gene 2 to gene 3 and gene 5 to gene 7), and essential gene 4 (4v = 0 and yv = 1)
472
within the longer deletion should be re-inserted into genome (see Table 1).
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
473
ASSOCIATED CONTENT
474
Supporting information
475
Supporting Information Tables S1 to S7 for three case studies carried out in the paper.
476 477
AUTHOR INFORMATION
478
Correspondence author
479
*Tel.: (814)863-9958, Email:
[email protected] 480
Author contributions
481
L.W. and C.D.M. conceived the study and designed the algorithms. L.W. performed the
482
simulations, data analysis and interpretation. Both authors contributed to writing the manuscript
483
and discussion.
484
Notes
485
The authors declared no competing financial interests.
486 487
ACKNOWLEDGMENTS
488
The authors acknowledge the inputs given by Thomas J. Mueller, Margaret N. Simons,
489
Stayakam Dash, and Joshua Chan at the various stages of idea refinement and implementation.
490
The authors gratefully acknowledge funding from the NSF (http://www.nsf.gov/) award
491
NSF/MCB 1546840. The funders had no role in the study design, data collection, and analysis,
492
decision to publish, or preparation of the manuscript.
493 494
ABBREVIATIONS
495
GSM, Genome-scale model; MILP, mixed-integer linear programming; LD, large-scale deletion
ACS Paragon Plus Environment
Page 24 of 39
Page 25 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
496
ACS Synthetic Biology
For Table of Contents Use Only
497
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
498
FIGURES AND TABLES
499 500
Figure 1. Schematic representation of the MinGenome algorithm. The MinGenome algorithm
501
builds on GSMs with information of essential genes, gene and promoter positions. The logic
502
constraints are imposed on promoters, genes, and reactions. MinGenome allows three additional
503
options that allow (i) gene re-insertion, (ii) retention of transcriptional factor genes, and (iii)
504
user-supplied expansion of the list of essential genes. The MinGenome identifies the sequence of
505
deletions starting with the largest dispensable region and proceeding monotonically to shorter
506
ones.
ACS Paragon Plus Environment
Page 26 of 39
Page 27 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
507 508
Figure 2. Comparison between 32 experimental long deletions (blue) and MinGenome predicted
509
top 32 long deletions (orange). MinGenome predicted deletions are in descending size order (i.e.,
510
from deletion 1 to deletion 32) and experimental deletions keep their initial name in the PEC
511
database with an additional ‘LD’ as prefix when the original label starts with a number (e.g.,
512
LD1, LD14, OCR37(-km), and LD3-15-1Y). In addition to a number of overlapped regions with
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
513
experimentally carried out deletions, MinGenome predicted additional long deletions and
514
avoided potential growth detrimental deletions.
515 516
517 518
Figure 3. Five MinGenome predictions (deletion 8, 26, 27, 29, 31) and their corresponding
519
experimentally carried out medium-scale deletions that were constructed independently at the
520
same locations. Experimental deletions are labeled with their names in PEC database (e.g.,
521
OCL63, OCL31, and OCR48,49-8). The red and green colored genes indicate genes that are truly
522
essential and falsely predicted as essential, respectively.
ACS Paragon Plus Environment
Page 28 of 39
Page 29 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
523 524
Figure 4. COG-based functional analysis genes of minimal genome predicted by the
525
MinGenome algorithm. The genes are categorized into five groups: (i) information storage and
526
processing, (ii) cellular processes, (iii) metabolism, (iv) poorly characterized, and (v) COG not
527
assigned. We observed that most of the non-essential genes retained in the minimal genome are
528
in “Metabolism” and are required to maintain the maximum biomass yield.
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
529 530
Figure 5. A Venn diagram of deletions comparison between the deleted genes in strain MGF-01
531
and MinGenome predictions.
ACS Paragon Plus Environment
Page 30 of 39
Page 31 of 39
120
experimental deletion
MinGenome prediction
100
number of reacction deletion
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
80
60
40
20
0
532 533
Figure 6. Comparison of the deletions of reactions in metabolism between MinGenome
534
predictions and MGF-01 strain in the second case study. Reactions are categorized based on their
535
subsystems in iJO136623. Blue bars indicate the deleted reactions in MGF-01 strain, and orange
536
bars indicate the deleted reactions in MinGenome predictions. MinGenome predictions have
537
more deletions in glycerophospholipid metabolism, nucleotide salvage pathways, oxidative
538
phosphorylation, amino acids metabolism, and pentose phosphate pathway.
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
539 540
Figure 7. Comparison of percentage of genome reduction in the first 56 deletions between
541
MinGenome algorithm with and without gene re-insertion. Blue bars indicate the cumulative
542
MinGenome deletions while allowing for gene re-insertion, and orange bars indicated the
543
cumulative MinGenome deletions with the re-insertion option switched off.
ACS Paragon Plus Environment
Page 32 of 39
Page 33 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
544 545
Figure 8. Definitions of the start site and the end site of a deletion in MinGenome algorithm.
546
The red line indicates a contiguous stretch (from gene 2 to gene 6) that MinGenome predicted as
547
a long deletion, and the orange line indicates the overlapped region between gene 1 and promoter
548
2. Note that the stretch of DNA for promoter 2 cannot be deleted entirely due to the overlapped
549
region of gene 1. MinGenome circumvents this challenge by defining the start site ( ) that
550
adjusts the beginning nucleotide of the long deletion to the non-overlapped region and ensuring
551
that the deletion of the end site ( ) does not affect the promoter or gene immediately after the
552
long deletion.
ACS Paragon Plus Environment
ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
553 554 555
Figure 9. A toy example of MinGenome: (A) three deletion segments were constructed in the
genome. The three red lines indicate long deletions 1, 2, and 3. 4 = 1 indicates the deletion of
556
gene , = 1 and 1 = 1 represent the first gene in the deleted stretch and first gene
557
immediately following the deleted stretch. (B) Longer deletion by MinGenome algorithm while
558
allowing for one essential gene re-insertion. The long deletion is constructed by combining two
ACS Paragon Plus Environment
Page 34 of 39
Page 35 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Synthetic Biology
559
deletions (i) gene 2 to gene 3 and (ii) gene 5 to gene 7, and gene 4 is allowed to be reinserted into
560
the genome. Note that for clarity purpose the involvement of promoters is omitted here.
561
562
Table 1: Toy example of one gene-reinsertion to the genome. The longer deletion combines two
563
deletions (i) gene 2 to gene 3 and (ii) gene 5 to gene 7. variable 1