MinGenome: An In Silico Top-Down Approach for ... - ACS Publications

Dec 18, 2017 - Department of Chemical Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, United States. •S Support...
0 downloads 4 Views 2MB Size
Subscriber access provided by University of Florida | Smathers Libraries

Article

MinGenome: An in silico top-down approach for the synthesis of minimized genomes Lin Wang, and Costas Maranas ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.7b00296 • Publication Date (Web): 18 Dec 2017 Downloaded from http://pubs.acs.org on December 23, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10

ACS Synthetic Biology

Research article

MinGenome: An in silico top-down approach for the synthesis of minimized genomes Lin Wang1 and Costas D. Maranas1* 1

Department of Chemical Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, United States ABSTRACT

11

Genome minimized strains offer advantages as production chassis by reducing transcriptional

12

cost, eliminating competing functions and limiting unwanted regulatory interactions. Existing

13

approaches for identifying stretches of DNA to remove are largely ad hoc based on information

14

on presumably dispensable regions through experimentally determined non-essential genes and

15

comparative genomics. Here we introduce a versatile genome reduction algorithm MinGenome

16

that implements a mixed integer linear program (MILP) to identify in size descending order all

17

dispensable contiguous sequences without affecting the organism’s growth or other desirable

18

traits. Known essential genes or genes that cause significant fitness or performance loss can be

19

flagged and their deletion be prohibited. MinGenome also preserves needed transcription factors

20

and promoter regions ensuring that retained genes will be properly transcribed while also

21

avoiding the simultaneous deletion of synthetic lethal pairs. The potential benefit of removing

22

even larger contiguous stretches of DNA if only one or two essential genes (to be re-inserted

23

elsewhere) are within the deleted sequence is explored. We applied the algorithm to design a

24

minimized E. coli strain and found that we were able to recapitulate the long deletions identified

25

in previous experimental studies and discover alternative combinations of deletions which have

26

not yet been explored in vivo.

27

KEYWORDS: MILP; top-down; minimal genome; genome-scale model; E. coli

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

28

Microbes such as E. coli, Saccharomyces cerevisiae, and Clostridia have been widely applied in

29

metabolic engineering projects for biofuels and biorenewables.1 However, the complexity of

30

microbial metabolism requires a sophisticated approach in the design of interventions. The

31

deletion of a seemingly unrelated gene may have deleterious consequences due to regulatory

32

effects, alterations of cofactor ratios, accumulation of intermediates or inability to recycle

33

metabolites. Many of these metabolic interdependencies are inherent in biology as new traits are

34

accumulated in response to changing selection pressures and random mutation. They offer

35

robustness to gene disruptions or changes to the external environment but they may also trigger

36

surprising and hard to predict unwanted responses for bioproduction.2 Metabolic models have

37

come a long way towards capturing these effects but unknowns still remain.2 A possible way to

38

circumvent some of the biological complexity is to re-write the genome by refactoring genes so

39

as the regulation is predictable/controllable and unneeded genes are eliminated.3,4 This approach

40

has led to simpler and functional gene clusters. A minimized genome has inherently fewer

41

interactions with heterologous pathways, fewer unknown elements and thus is more amenable to

42

prediction through modeling and can serve as an ideal and controllable chassis for efficient

43

bioproduction.

44

Genome reduction can be achieved by either a bottom-up or a top-down strategy. Bottom-up

45

genome minimization requires the de novo assembly of pathways into long stretches and the

46

linking of all necessary components into a single contiguous chromosome. This level of

47

engineering requires complete knowledge of all biological processes and interactions thereof

48

along with genome stability imperatives. To our knowledge, even though the de novo assembly

49

of the entire genomes of M. genatalium5 and yeast6 has been achieved, no re-engineering has

50

been attempted. In contrast, top-down genome minimization works by successively removing

ACS Paragon Plus Environment

Page 2 of 39

Page 3 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

51

non-essential contiguous genome stretches. Generally, a few large stretches are first removed

52

followed by many shorter deletions.7 The main advantage of the top-down approach is that the

53

starting point is an operational genome. Therefore, any adverse effects caused by deletion can

54

always be remedied by reverting back before the latest deletion. Even though the top-down

55

approach may not reach a “true” minimal genome, it is a more pragmatic approach for

56

constructing bioproduction chassis as genome reduction proceeds only up to the point that the

57

minimized strain reaches growth and production yield and rate goals.

58

Many top-down genome reductions have been carried out in E. coli over the last decade. The

59

first reduced genome shrunk the E. coli genome by 8.4% by removing its genomic islands.8

60

Hashimoto et al. further reduced the genome by 29.7% by deleting many non-essential genes.9

61

However, the genome minimized strain exhibited slower growth rate and deformed cell

62

morphology. Posfai et al. managed to reduce the genome by 15.3% without affecting growth

63

rate.10 In this strain (MDS43), mobile DNA and pathogenic genes present on the chromosome

64

were deleted. In response to a goal of minimizing the genome by 30% while maintaining growth

65

rate, spurred the ‘Minimal genome factory’ (MGF) research project in Japan.11 The first strain

66

(MGF-01) lacking 22% of the genome was constructed by removing a subset of non-identical

67

regions between E. coli and a close relative Buchnera SPP.12 Starting from MGF-01, a strain

68

reaching a 30% genome reduction was constructed by introducing viable deletions from earlier

69

studies.13 Recently, Hirokawa et al. extended the reduction to 35.2% by deleting more

70

dispensable regions on the MGF-01 strain.14 All of the identified deletions relied on comparative

71

genomics and a single gene knockout library. The use of metabolic modeling was not part of the

72

analysis, therefore, it is plausible that many genes were deemed essential even though less

73

characterized bypassing pathways may be available.15 The lack of convergence to a unique

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

74

reduction in genome minimization studies alludes to the possibility for even larger reductions

75

and/or alternative deletions. For example, MGF-01 shares only 568 gene deletions with the ones

76

carried out in strain MDS43 (only 52.5% similarity).12 As more sophisticated genome editing

77

tools (e.g., CRISPR16) are becoming commonplace, the need for a computational aid that will

78

help successively minimize genomes consistent with a set of performance criteria beyond simply

79

growth rate is becoming more pressing.

80

To address current limitations of computational tools for genome minimization, herein we

81

present MinGenome, a mixed-integer linear programming (MILP) algorithm for the top-down

82

genome reduction. MinGenome identifies a ranked list of deletions starting with the longest one

83

and proceeding in a monotonic order with shorter ones. The identification of larger DNA

84

stretches to be deleted first, affords savings in the time and cost needed to carry out the reductive

85

process. MinGenome also relies on a deterministic algorithm (i.e., MILP) thus ending up with a

86

single (barring any alternate optima) reduced genome and top-down genome reduction scheme.

87

Note that an earlier effort17,18 used a stochastic algorithm to assess the end points of reductive

88

evolutionary processes of endosymbiotic bacteria using biomass production feasibility to judge

89

viability. This algorithm tended to terminate at different reduced genomes given the stochastic

90

nature of the reduction process. MinGenome relies on (i) a genome-scale model (GSM)

91

representation of metabolism, (ii) gene location information from Genebank19, (iii) gene

92

essentiality information from transposon library, (iv) operons and promoter site structure

93

information from Ecocyc20, and (v) transcription factors information from RegulonDB21. The

94

procedure identifies first the largest contiguous DNA stretch within the genome that can be

95

deleted without affecting growth or any other performance criterion (e.g., target product max

96

yield or ATP availability). Subsequently, the next largest dispensable stretch of DNA is

ACS Paragon Plus Environment

Page 4 of 39

Page 5 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

97

identified given all the earlier deletions. The successive identification of smaller DNA stretches

98

to remove continues until (i) a percent genome reduction goal, (ii) a maximum number of

99

deletions, or (iii) a minimum DNA stretch size to be deleted is reached. The simultaneous

100

deletion of computationally predicted synthetic lethal pairs is avoided by maintaining biomass

101

yield at a non-zero or maximum level depending on user specifications. In the second phase, the

102

MinGenome algorithm assesses the possibility of removing even larger stretches of DNA if only

103

one or two essential genes are within the deleted sequence. These deleted essential genes can

104

then be re-inserted into the genome in a different location. We applied the algorithm to design a

105

genome-minimized E. coli K-12 MG1655 strain and found that we were able to recapitulate the

106

long deletions identified in previous experimental studies. A new deletion scheme that is not

107

explored before is proposed with large-scale genomic deletions ranging from 14.7 to 63.3kb.

108

MinGenome can be readily applied for other organisms provided availability of the

109

aforementioned information. The MinGenome algorithm was implemented in Python and C++

110

and is available on GitHub (https://github.com/maranasgroup).

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

111

RESULTS AND DISCUSSION

112

The proposed MinGenome computational workflow is shown pictorially in Figure 1. First,

113

information is collected on gene positions on the chromosome, essential genes, operons and their

114

promoter sites, and transcription factors as input for the MinGenome algorithm. In addition, the

115

MinGenome allows as options (i) gene re-insertion, (ii) retention of transcriptional factor genes

116

and (iii) user-supplied expansion of the list of essential genes. This provides the information

117

needed to flag the regions that cannot be deleted throughout the entire genome-minimization

118

process.

119

remaining contiguous dispensable DNA stretch. The objective function here is to maximize the

120

distance between the start position and end position of the deleted DNA stretch. Genes and

121

promoters within the deleted stretch are removed. The relationships between promoters and

122

genes and between genes and reactions are modeled as logic constraints. A gene cannot be

123

expressed if its promoter is removed, and an enzymatic reaction is knocked out if the gene (or

124

genes) coding for the enzyme is deleted. A set of performance criteria (e.g., maintain growth

125

rate, target product max yield, or ATP availability) can be imposed as possible constraints in the

126

MinGenome algorithm. In order to maintain growth, the deletion is deemed viable only if the

127

genome-minimized strain maintains a pre-specified biomass yield. The MinGenome algorithm

128

identifies segments to be deleted until one of the imposed stopping criteria is met. The longest

129

deletions generally remove genes encoding unknown functions, secondary metabolism, motility,

130

phages, and antibiotic resistance.22 Shorter stretches generally involve genes in alternative

131

metabolic pathways.

132

MinGenome algorithm is first deployed to design a genome-minimized E. coli strain. The first

133

case study demonstrates the algorithm's capability of predicting long deletions and contrasts with

Next, the MinGenome algorithm is successively applied to identify the largest

ACS Paragon Plus Environment

Page 6 of 39

Page 7 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

134

existing in vivo genome reduction efforts. The second case study assumes a more conservative

135

posture on gene essentiality by appending to the list of “not to be deleted” genes that are deemed

136

essential for the closely related organism Buchnera spp. but not in E. coli as put forth by

137

Mizoguchi et al.11. Expanding the essential genes list causes shorter deletions and more deletions

138

to reach 35% genome reduction than the first case study. In the last study, we allow up to one

139

essential gene to be deleted and then re-inserted elsewhere in the genome. We find that this leads

140

to a reduction in the number of contiguous stretches that need to be deleted to reach 35% genome

141

reduction from 69 to 56 (see Figure 7). Note that after 69 MinGenome predicted deletions we

142

reach a genome reduction level of 40%.

143 144

Identification of the 32 longest contiguous deletions for E. coli using MinGenome.

145

MinGenome was run on the E. coli GSM iJO136623 with the requirement of maintaining

146

maximum theoretical growth under aerobic minimal glucose growth medium. Essential genes

147

obtained from EcoliWiki (http://ecoliwiki.net/) were flagged and their deletion was prohibited.

148

We successively run MinGenome until we reached a deletion size of 2321 kb. This involved 38

149

deletions for a final genome size reduction of 50% (see Supporting Information Table S1). In an

150

earlier in vivo genome reduction of E. coli,24 long deletions (larger than 41.4 kb) were achieved

151

by aggregating the results from medium-scale deletions. Herein we compare the 32 longest

152

MinGenome predicted deletions with the experimentally carried out 32 large-scale deletions.

153

Figure 2 superimposes the top 32 deletions from the Profiling of E. coli Chromosome (PEC)

154

database (https://shigen.nig.ac.jp/ecoli/pec/) with the top 32 deletions predicted by MinGenome.

155

The numbering of MinGenome predicted deletions is in descending size order (i.e., deletion 1,

156

…, 32) while the numbering of experimental deletions uses the original deletion name in the

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

157

PEC database with an additional ‘LD’ as prefix when the original label starts with a number

158

(e.g., LD1, LD28, and LD3-32-1). Out of the 32 predicted deletions, 27 of them have

159

overlapping regions with the already experimentally carried out deletions from various genome

160

minimization studies. Deletions 5 and 18 (see Supporting Information Table S1) share the same

161

genes with experimental large-scale deletions LD1 and LD28 but with slightly different start and

162

end positions of deleted stretches. As many as 14 deletions (i.e., 2, 24, 7, 21, 15, 11, 22, 28, 14,

163

16, 4, 20, 17 and 10, see also Supporting Information Table S1) are at least 95% identical to the

164

experimental deletions (see Figure 2). Experimental deletions also track well with the

165

MinGenome predictions in the later deletions. For example, MinGenome deletion 35 matches the

166

experimental deletion LD3-32-1 but with a slightly different start and end positions.

167

MinGenome also predicted long deletion DNA segments that are not included in the

168

experimentally carried out deletions so far. For example, five of the MinGenome deletions (i.e.,

169

deletions 8, 26, 27, 29 and 31) involve stretches of DNA that were absent in experimentally

170

carried out long deletions (see Figure 2). However, the regions do contain independently carried

171

out medium-scale deletions that were not integrated to construct the genome-minimized strain. In

172

the experimental procedure of Hashimoto et al.9, a long deletion was constructed when a number

173

of contiguous medium-scale deletions that had been constructed independently in the previous

174

step (labeled as OCR1 to OCR71 and OCL1 to OCL89) could be aggregated to reach a long

175

deletion greater than 41.4 kb (e.g., LD21 is the combination of OCL47 and OCL48). An

176

unsuccessful medium-scale deletion that had an essential gene was excluded entirely thus no

177

long deletions in that region can be constructed to meet the required length. However, part of the

178

medium-scale deletion can still be included to form a long deletion. For example, Deletion 8 was

179

not constructed successfully because the medium-scale deletion OCL61 contains an essential

ACS Paragon Plus Environment

Page 8 of 39

Page 9 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

180

gene gltX encoding glutamate-tRNA ligase (see Figure 3). However, part of OCL61 (start from

181

gene ypdK and stops before gene gltX) can be combined with two successfully carried out

182

deletions OCL62 and OCL639 to form a long deletion. Similarly, deletion 26 can be

183

reconstructed based on deletions OCL31 and OCL30 carried out in Hashimoto et al (see Figure

184

3).9 Two genes (yraL and rpsO encoding 16S rRNA 2'-O-ribose C1402 methyltransferase and

185

30S ribosomal subunit protein S15, respectively) within the deletion OCL30 were considered

186

essential in the original experimental design9, but both of them were reported as non-essential

187

genes in later studies.25,26 In addition, deletions 27, 29, and 31 that have a smaller length (ranges

188

from 32 kb to 40 kb and are less than 41.4 kb) were not integrated into the experimental long

189

deletions. However, they also overlap with experimental medium-scale deletions OCL73-1/4 and

190

OCL74, OCL27 and OCL28-2/8/9, and OCR48/49, respectively (see Figure 3), which are

191

potential stretches that can be combined into longer deletions.

192

In addition, MinGenome on many occasions avoided growth detrimental deletions that were

193

attempted experimentally. Hashimoto et al.9 hypothesized that the experimental large-scale

194

deletion LD13 (from b2236 to b2276) contained genes responsible for growth rate and cell shape

195

as the cells assumed a different length to width ratio after LD13 was introduced. Interestingly,

196

MinGenome predicted no deletion within region LD13 that contains genes menE, menC, menH,

197

and menD (b2260 to b2264) coding for menaquinone biosynthesis pathway which is essential for

198

the production of biomass precursors menaquinone-8 and 2-demethylmenaquinone-8. Instead,

199

MinGenome predicted two shorter deletions from b2236 to b2259 and b2265 to b2276 at

200

deletions 39 and 91, respectively. Similarly, the MinGenome predicted no deletion within the

201

segment of the experimental deletion LD3-17-1 as the knockout of gene gdhA (b1761) encoding

202

glutamate dehydrogenase results in slower growth rate as predicted by the E. coli model

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

203

iJO136623. A ∆gdhA mutant was also reported before to be growth detrimental in vivo.27 It is

204

important to note that Hashimoto et al.9 have shown that the genome-minimized strain grows

205

slower as more long deletions are accumulated. In addition to the growth adverse gene deletions

206

that were skipped by MinGenome, potential regulatory effects may also exist that affect growth.

207

A more conservative posture is studied in the next case study with additional constraints that flag

208

“putative” essential genes and transcription factors as no deletions.

209

MinGenome avoided the deletion of computationally predicted synthetic lethals (SLs). SLs are

210

genes whose simultaneous deletion prohibits growth. Avoiding the simultaneous deletion of SL

211

is an important consideration as while constructing the minimal genome JCVI-syn3.0, some of

212

the reduced strains failed to grow due to the deletion of SL pairs.5 A number of computational

213

approaches have been developed to predict SLs from GSMs using bilevel optimization as well as

214

minimal cut set and elementary flux mode methods.28,29 However, the a priori enumeration of all

215

possible SLs (triples, quadruples and even higher combinations) requires significant

216

computational time. MinGenome circumvents this challenge by simply enforcing biomass

217

production as a constraint after all the cumulative deletions up to this point are imposed. The

218

deletions were subsequently compared against available SLs28 datasets (see Supporting

219

Information Table S3) confirming that no SLs were simultaneously deleted.

220

The MinGenome algorithm was exhaustively applied until no more genes could be removed. A

221

total of 572 genes were retained at the end with 423 of them included in model iJO136623, while

222

the 149 genes retained but are absent in iJO136623 are the essential genes flagged in the first step

223

of MinGenome. These genes were classified based on their COG function30 (see Figure 4 and

224

Supporting Information Table S2). The total number of COG classifications is 657 as some

225

genes had multiple functions. As shown in Figure 4, essential genes cover 97.4% of the set of

ACS Paragon Plus Environment

Page 10 of 39

Page 11 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

226

genes in information and processing. However, more than half of the genes in the remaining

227

categories are non-essential genes that are required to maintain biomass yield at maximum level

228

as the performance criteria defined in our MinGenome simulation.

229 230

E. coli minimized genome using gene essentiality from both E. coli and Buchnera spp along

231

with transcription factor information. Here we explored another deletion scheme with two

232

additional criteria compared with the first case study: (i) genes that have homologs in Buchnera

233

spp. were also labeled as essential as suggested in Mizoguchi et al.’s design of MGF-01 strain12,

234

and (ii) all genes coding for transcriptional factors with known functions in RegulonDB are

235

retained. Buchnera spp. is a symbiotic bacterium that is a close relative of E. coli. Symbionts

236

tend to retain only the absolutely essential genes as most nutrients are imported from its host

237

without a need for the corresponding biosynthetic pathways or genes associated with

238

pathogenicity.31 In addition to enzyme coding genes, here we also retain all genes associated

239

with the known transcriptional regulatory network that activates or represses genes while

240

genome reducing with MinGenome. Consequently, we flagged all 218 transcription factors in

241

RegulonDB as no deletions. There exist both Boolean-based regulatory models2,32 and

242

probabilistic ones33 for E. coli. In this study, we simply enforced that the entire regulatory

243

network remains intact with no provisions as to which transcription factors are needed. In

244

principle, either Boolean or probabilistic models can be used to narrow down the list of essential

245

transcription factors.

246

As in the first case study, we compared MinGenome predictions with the MGF-01 strain that has

247

35% genome reduction with 91 deletions catalogued in Profiling of E. coli Chromosome (PEC)

248

database (https://shigen.nig.ac.jp/ecoli/pec/). MinGenome terminated when the genome

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

249

reduction reached 35% reduction after 69 deletions (see Supporting Information Table S4). The

250

length of the longer deletions ranges from 14.7 kb to 63.3 kb and are smaller compared with the

251

long deletions carried out in the first case study (ranges from 31.2 kb to 226 kb). Note that

252

CRISPR/Cas9 can achieve long deletion from 23 kb to 1 Mb, but deletions of larger sizes

253

generally have lower efficiency.34 MGF-01 strain shares 568 gene deletions with strain MDS43

254

and 493 genes deletions with strain ∆16.12 MinGenome deletion predictions and MGF-01 strain

255

deletions share 993 genes (see Figure 5). The MinGenome predictions allude to the possibility

256

for additional genome reduction.

257

We compared the deletions of metabolic reactions carried out in the MGF-01 strain and in the

258

second case study of MinGenome. We find that reaction dispensability is highly metabolism

259

dependent with largely convergent results between MinGenome and experimental studies (see

260

Figure 6). Secondary carbon metabolism is by far the most highly reduced set of reactions with

261

62.9% and 50.3% deleted reactions are alternative carbon metabolism and inner and outer

262

membrane transport in vivo and in MinGenome predictions, respectively (see Figure 5). As

263

glucose was selected as the only carbon source, we observe that a number of deleted pathways

264

are responsible for the uptake and metabolism of alternative substrates such as D-glucose 1-

265

phosphate, branching glycogen, D-fructose, glycerol, D-Xylose, and D-ribose.

266

We also find that MinGenome predictions preferentially eliminate reactions involved in

267

alternative/redundant pathways compared to existing minimization studies (see Supporting

268

Information Table S5). In particular, in oxidative phosphorylation pathways, as many as 12

269

redundant electron transport reactions are deleted involving the oxidation/reduction of

270

ubiquinone, menaquinone, and dimethyl-menaquinone that enable growth under a variety of

271

conditions. In the citric acid cycle, two malate dehydrogenase (MDH) alternate reactions (MDH2

ACS Paragon Plus Environment

Page 12 of 39

Page 13 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

272

and MDH3) that convert malate to acetyl-CoA are deletion suggestions by MinGenome. Both of

273

them use the secondary activity of malate oxidoreductase (gene mqo) for the malate conversion.

274

Note that a deletion of the primary gene ∆mdh results in a severe growth defect35. MinGenome

275

here correctly prioritized the deletion of alternative reactions (e.g., keep MDH and delete

276

MDH2/MDH3) because gene mdh is in closer proximity to essential genes than gene mqo thus

277

preventing a longer deletion. Similarly, in amino acid metabolism pathways, the deletion of the

278

NADPH-dependent glutamate synthase reaction (GLUSy) is suggested while the NADH-

279

dependent reaction (GLUS) is retained. In purine and pyrimidine biosynthesis pathways, the

280

primary reaction GAR transformylase-T (GART) is retained while the alternative reaction

281

phosphoribosylglycinamide formyltransferase (GARFT) in inosine monophosphate (IMP)

282

biosynthesis pathway is deleted. In contrast, in the nucleotide salvage pathway, reactions inosine

283

kinase (INSK) and AMP nucleosidase (AMPN) are deleted while the alternative reactions 5'-

284

nucleotidase (NTD11) and adenine phosphoribosyltransferase (ADPT) are retained. In all cases,

285

MinGenome makes decisions on reaction retention based on the potential to generate the longest

286

deletion stretch without considering whether the retained pathway is primary or secondary.

287

Informed by experimental information a user can flag primary reaction pathways as a no deletion

288

when using MinGenome.

289

The suggested deletions by MinGenome were also implemented in E. coli’s genome-scale model

290

of metabolism and expression (ME model)36 to verify the viability of the mutants after the

291

successive deletions with the additional consideration of protein resource allocation to enzymes

292

and transcription/translational machinery. A number of studies have shown that protein

293

allocation for RNA polymerase37 and ribosomes38 are linearly correlated with E. coli’s growth

294

rate. Interestingly, even after the cumulative imposition of 69 deletions, no growth defect was

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

295

predicted by the ME model simulation. This suggests that the protein allocation for

296

transcription/translation may be sufficient for cell growth after the deletions.

297 298

Identification of multiple longest contiguous deletions for E. coli using MinGenome while

299

allowing for one gene re-insertion (per deletion) to the genome. As genetic engineering

300

tools39 allow for the efficient integration of genes back to the genome, here we explored whether

301

it is possible to reach the same of higher levels of genome reduction by deleting fewer but longer

302

stretches of DNA. CRISPR/Cas9 has been reported recently with more than 90% efficiency of

303

inserting sequences no longer than 2kb in E. coli using dsDNA as editing template.40 It has also

304

been shown before that the FLP-FRT site-specific recombinant system can remove a stretch of

305

DNA while moving the essential genes within the contiguous region to a complementary mini-F

306

plasmid.24 We therefore explored genome reduction with re-insertion strategy by applying the

307

MinGenome algorithm with the corresponding option switched on. As before, we required

308

retention of maximum theoretical growth under aerobic minimal glucose medium. Results (see

309

Supporting Information Table S1) confirm that it is indeed possible to delete larger stretches of

310

DNA with the provision of gene re-insertion. The new longest deletion combines deletions 1 and

311

3 to form a single contiguous deletion. Interestingly, the gene that needs to be re-inserted (i.e.,

312

aldA encoding aldehyde dehydrogenase A) is not essential.41 However, the associated reaction

313

glycolaldehyde dehydrogenase (GCALDD) is predicted to be essential based on model

314

iJO1366.23 This is because it is needed for degradation of glycolaldehyde side product in folate

315

metabolism. Aziz et al.42 showed a hidden pathway involving gene prpC that can convert

316

glycolaldehyde to glycolyl-CoA to bypass this function. Upon adding this pathway to the model,

ACS Paragon Plus Environment

Page 14 of 39

Page 15 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

317

we find a longer deletion with the option of no re-insertion that matches with the new longest

318

deletion that needs one gene re-insertion.

319

The cumulative deletions with and without one gene re-insertions of the second case study are

320

subsequently compared (see Figure 7 and Supporting Information Table S6). The number of

321

deletions to reach 35% genome reduction reduces from 69 to 56 when one gene re-insertion is

322

allowed. The total percentage of genome reductions without gene re-insertion only reaches

323

30.9% after 56 deletions. MinGenome as expected predicts longer genomic deletions. In

324

addition, the combined larger deletions often reveal falsely predicted essential gene by iJO136623

325

as

326

phosphomethylpyrimidine kinase in deletion 53. In model iJO136623, the ∆ thiD mutant is

327

predicted to be lethal as the phosphomethylpyrimidine kinase that synthesizes 4-amino-2-methyl-

328

5-(diphosphomethyl)pyrimidine (HMP-PP) is an essential precursor of thiamine pyrophosphate

329

(vitamin B1). However, gene thiD2 was reported to encode the same phosphomethylpyrimidine

330

kinase missing in iJO1366.43 As a result, an in vivo deletion of the combined regions showed no

331

growth defect corroborating the non-existence of an essential gene within the region thus

332

obviating the need for the re-insertion step.

in

the

case

aldA.

Another

such

example

includes

ACS Paragon Plus Environment

gene

thiD

encoding

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

333

CONCLUSION

334

This paper introduces the MILP optimization algorithm MinGenome for the construction of

335

genome-minimized strains. MinGenome performs a top-down genome reduction approach that

336

successively eliminates metabolic and regulatory genes starting with the longest possible

337

deletion first. In order to avoid lethal or growth defective deletions, essential genes and synthetic

338

lethal pairs are retained by enforcing a constraint on biomass yield. We applied MinGenome to

339

predict deletions for the E. coli K-12 MG1655 genome. Comparing with already carried out

340

deletions, MinGenome predictions match well with the experimental long deletions. In addition,

341

MinGenome was able to predict new deletions and avoid growth detrimental ones. Finally, we

342

explored MinGenome’s capability to reach the same of higher levels of genome reduction by

343

deleting fewer but longer stretches of DNA while allowing for gene re-insertions. MinGenome is

344

a general algorithm that can be applied to other organisms assuming that information on gene

345

annotation, metabolic pathways, biomass description, gene location and essentiality, promoter

346

sites, and operon structure is available.44 Obviously, the quality of the obtained genome

347

reduction scheme is dependent upon the accuracy of the provided information. MinGenome

348

allows as options (i) gene re-insertion, (ii) retention of transcriptional factor genes, and (iii) user-

349

supplied expansion of the list of essential genes. It is available as open-source Python/C++

350

programs on Maranas Lab’s GitHub website.

351

Moving forward a key challenge for MinGenome is predicting the effect of the deletion of genes

352

serving non-metabolic roles. Possibly, a transcriptional regulatory model (such as a Boolean-

353

based regulatory models2,32 and probabilistic ones33) can be used in conjunction with the GSM to

354

screen the deletion of needed TFs. Kinetic models45,46 and ME models36 could also be used to

355

make decisions on reaction retention with higher enzyme efficiency or less enzyme cost. In

ACS Paragon Plus Environment

Page 16 of 39

Page 17 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

356

addition, genes with unknown functions47 will be revealed as essential if MinGenome predictions

357

lead to lethal outcomes thus helping to determine their function. Currently, MinGenome relies on

358

the user input to flag them as no-deletion genes. Ultimately, we anticipate that one of the most

359

important contributions of MinGenome would be the identification of previously unknown

360

essential functions whenever a new deleted stretch causes lethality. By systematically probing

361

the entire genome for hidden essential (or synthetic lethal) functions, the pace of gene annotation

362

would be accelerated.

363

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

364

METHODS

365

A detailed description of MinGenome is presented in this section. First, we illustrate the

366

workflow and detailed mathematical formulation of the MinGenome algorithm along with a toy

367

example to clarify its function. Next, we show this algorithm to allow for multiple simultaneous

368

deletions while keeping the genome minimized strain consistent with a set of performance

369

criteria. Finally, we demonstrate how the MinGenome algorithm can identify even larger

370

dispensable stretches on the chromosome while allowing for gene re-insertions.

371

MinGenome: Optimization-driven workflow for genome minimization. MinGenome requires

372

as input (i) genome-scale model, (ii) genome sequence, (iii) gene essentiality, and (iv)

373

transcriptional units as input data (see Supporting Information Table S7 for details). Genome

374

sequence of Escherichia coli str. K-12 substr. MG1655 (NC_000913.3) was downloaded from

375

Genbank.19 The latest GSM reconstruction (iJO136623) was used here. Experimentally verified

376

essential genes48 were also flagged so that they would not be deleted during the simulation. This

377

is important as the metabolic model can only capture gene essentiality for only genes with a

378

metabolic role. However, only 20% of essential genes have known metabolic function, 75.6% of

379

essential genes are responsible for cell envelope/division, protein quality control, DNA

380

modification and maintenance, as well as protective function that are not modeled in GSM, and

381

4.4% of essential genes in E. coli do not have a known function.49 Promoter region information

382

for each transcription unit was obtained from EcoCyc database.20 The deletion of the promoter

383

for the corresponding transcription unit results in no expressions of genes in the transcription unit.

384

The MinGenome algorithm is posed and solved as a MILP problem. It requires the definition of

385

the following sets, parameters, and variables.

386

Sets

ACS Paragon Plus Environment

Page 18 of 39

Page 19 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

387 388 389 390 391

ACS Synthetic Biology

 = |1, … , : set of metabolites

= |1, … ,  : set of reactions

 = |1, … ,  : set of genes

 = |1, … ,  : set of promoters

 ⊂  ∪ : set of genes and promoters

392

  ⊂ : set of promoters that initiate the expression of gene .

393

Parameters

394

 : Position of the first nucleotide of the deleted sequence starting from the origin of replication

395

when gene or promoter  is selected to be deleted in the beginning of the stretch. Note that  is

396

not always the start site of a gene or a promoter. It is the first nucleotide of the non-overlapped

397

region between the gene/promoter  and gene/promoter  − 1 (see Figure 8 and Supporting

398

Information Table S7).

399

 : Position of the first nucleotide of the gene or promoter  immediately after the deleted

400

sequence (see Figure 8 and Supporting Information Table S7).

401

Variables

402

1, if gene or promoter  is the ,irst gene or promoter within the deleted segment  =  0, otherwise

403

1 = 

404

4 = 

405

56 : flux of reaction  (7789 :;< => ℎ=> )

406

MILP representation of MinGenome algorithm

1, if gene or promoter  is immediately after the end of the deleted segment 0, otherwise

1, if gene or promoter  is deleted 0, otherwise

7@ A 1  − A   ∈C

∈C

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 39

D. F. A GH,6 56 = 0

 = 1, … ,

6∈I

A 1 = 1, A  = 1

∈C

(2)

∈C





6M>

6M>

(1)

A 6 − A 16 = 4

 = 1, … , 

5OHPQRSS ≥ U ∙ 5OHPQRSS,QRW

(3) (4)

4 = Y 4Z Z∈[ \

GPR constraints for reaction  (see below)  , 1 , 4 ∈ 0,1

 = 1, … , 

(5)

∀ ∈

(6)

∀ ∈ 

407

The objective function maximizes the distance between the first and the last deletion event thus

408

yielding the longest contiguous stretch that can be deleted while satisfying all imposed

409

requirements. The first deletion event occurs at the nucleotide of the gene/promoter whose

410

deletion has no effect on the possible overlapped genes or promoters, whereas the last deletion

411

event is pegged at the last nucleotide before the gene or promoter immediately after the deleted

412

sequence (see Figure 8). This implies that the correct length of the deleted segment is assessed.

413

Constraint 1 is the standard FBA steady-state conservation of metabolite  requirement.

414

Constraints 2 ensures that only one of the genes or promoters will be selected as the start or the

415

end of the deletion event. Constraint 3 ensures that all the genes/promoters after the start

416

gene/promoter and before the end gene/promoter will also be deleted. Note that 4 = 1 implies

417 418

deletion of gene/promoter . In the toy example (see Figure 9A), genes 2 through 4 are within

the deletion stretch when e = 1 and 1f = 1. As a result, 4Q = 1, 7 ∈ 2,3,4 . Constraint 4

ACS Paragon Plus Environment

Page 21 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

419

imposes a requirement on the biomass yield as a fraction of the maximum theoretical value. In

420

all case studies, U is set equal to one implying that the max theoretical biomass yield is retained

421

upon the imposed deletions. As discussed earlier, additional requirements on the maximum

422

theoretical yield of other fluxes can be imposed along with requirements on biomass depending

423

on the ultimate production goals of the minimized strain. Constraint 5 ensures that a gene  is

424

non-functional if all of its promoters (  ) are deleted (see Supporting Information Table S7 for

425

details). It is worth mentioning that one gene may have more than one promoter. For example,

426

genes ileS, ispA, fkpB, and ispH (from b0026 to b0029) have two promoters ileSp1and ispAp and

427

both of them bind to sigma factor g hi .20 Constraint 5 defines AND relationships that can be

428

reformulated as linear constraints as follows:

4 ≥ A 4Z − (|  | − 1)

(7)

Z∈[ \

4 ≤ 4Z

429

∀ ∈  

(8)

Where |  | is the number of promoters for the gene . Based on constraints 7 and 8, 4 = 1 only

430

if 4Z = 1 for all  ∈   , which is identical to constraint 5.

431

GPR constraints. Constraint 7 links reactions and genes through appropriate gene-to-protein-to-

432

reaction (GPR) constraints.28 The following cases need to be accounted for:

433

(i) A single gene k codes for the enzyme catalyzing reaction j (i.e., one-to-one mapping).

434 435 436 437

(1 − 4 )mn ≤ 56 ≤ (1 − 4 )on (ii) Two genes k1 and k2 code for an enzyme complex catalyzing reaction j. p

(1 − 4q )mn ≤ 56 ≤ (1 − 4q )on (1 − 4r )mn ≤ 56 ≤ (1 − 4r )on

(iii) Two gene k1 and k2 code for two isozymes catalyzing the reaction j.

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



438

Page 22 of 39

(1 − 4q + 1 − 4r )mn ≤ 56 ≤ (1 − 4q + 1 − 4r )on mn ≤ 56 ≤ on

439

(iv) More complex GPR relations

440

The complete logic statement that describes the GPR relation is posed using AND and OR

441

statement. The logic statement is converted into constraints 41 similar to the ones for (ii) and (iii).

442

For example, the constraints for genes k1, k2, and k3 with a GPR expressed as the logic statement

443

{(k1 AND k2) OR (k1 AND k3)} is as follows: p

444

(1 − 4t + 1 − 4r )mn ≤ 56 ≤ (1 − 4t + 1 − 4r )on mn(1 − 4q )mn ≤ 56 ≤ (1 − 4q )on ≤ 56 ≤ on

445

Identifying multiple simultaneous deletions with MinGenome. In order to obtain multiple

446

deletions without synthetic lethal gene pairs, the right-hand side of constraints 2 is modified as

447

follows:

A 1 = , A  =

∈C

(9)

∈C

448

Here is the pre-specified number of contiguous genome deletions. Note that because all N

449

deletions are simultaneously performed no synthetic lethal pairs (or higher order) will be

450

included in the list of genes to be deleted. As shown in Figure 9A, top three deletions in the

451

genome of E. coli are predicted by the model. The starts of three deletions are e = 1, h = 1,

452

and >e = 1, and the ends of three deletions are 1f = 1, 1>i = 1, and 1>f = 1. Constraint 2 and

453

3 are satisfied since e + h + >e = 3 and 1f + 1>i + 1>f = 3. It is important to note that

454

Constraint 3 ensures that those deletions will not overlap with each other. For example,

455

constraint 4 ensures that 1f = 1 must happen between e = 1 and h = 1. As shown in Figure

456 457

9A, if deletion 1 overlaps with a new deletion 2 (e.g., v = 1 happens before 1f = 1), constraint 4 ( ∑v6M> 6 − ∑v6M> 16 = 4v ) is violated. As > = v = 1, constraint 4 indicates that 4f = 2 which

ACS Paragon Plus Environment

Page 23 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

458 459 460 461

ACS Synthetic Biology

contradicts the constraint 4 ∈ 0,1 . As a result, overlapped deletions are precluded from the

MinGenome predictions. MinGenome is applied successively to identify the longest deletions.

In iteration x, the information of deletions gained through the previous x − 1 iterations is used as

parameters and constraint 9 is modified (i.e., change the constant number from n-1 to n) to

462

allow for the search of the x-th deletion.

463

Genome minimization algorithm for larger dispensable stretches on the chromosome while

464

allowing for gene re-insertions. MinGenome can also be modified to allow for presumably

465

larger deletions by allowing for one or more required genes to be re-inserted back into the

466

genome. Constraint 4 is modified to the following constraints: 



6M>

6M>

A 6 − A 16 = y

∀ ∈ 

(10)

y ≥ 4

∀ ∈ 

(11)

A y = A 4 + 1

∈C

467

(12)

∈C

where y is de,ined as followed:

1, if gene  is within a larger dispensable region to be deleted y =  0, otherwise

468

Constraint 10 defines the larger deletion region between 6 and 16 predicted by the model.

469

Constraint 11 and 12 is defined to allow a gene in the larger region to be re-inserted back into.

470

Figure 9B depicts an example that MinGenome predicted a longer deletion by combining two

471

deletions (gene 2 to gene 3 and gene 5 to gene 7), and essential gene 4 (4v = 0 and yv = 1)

472

within the longer deletion should be re-inserted into genome (see Table 1).

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

473

ASSOCIATED CONTENT

474

Supporting information

475

Supporting Information Tables S1 to S7 for three case studies carried out in the paper.

476 477

AUTHOR INFORMATION

478

Correspondence author

479

*Tel.: (814)863-9958, Email: [email protected]

480

Author contributions

481

L.W. and C.D.M. conceived the study and designed the algorithms. L.W. performed the

482

simulations, data analysis and interpretation. Both authors contributed to writing the manuscript

483

and discussion.

484

Notes

485

The authors declared no competing financial interests.

486 487

ACKNOWLEDGMENTS

488

The authors acknowledge the inputs given by Thomas J. Mueller, Margaret N. Simons,

489

Stayakam Dash, and Joshua Chan at the various stages of idea refinement and implementation.

490

The authors gratefully acknowledge funding from the NSF (http://www.nsf.gov/) award

491

NSF/MCB 1546840. The funders had no role in the study design, data collection, and analysis,

492

decision to publish, or preparation of the manuscript.

493 494

ABBREVIATIONS

495

GSM, Genome-scale model; MILP, mixed-integer linear programming; LD, large-scale deletion

ACS Paragon Plus Environment

Page 24 of 39

Page 25 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

496

ACS Synthetic Biology

For Table of Contents Use Only

497

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

498

FIGURES AND TABLES

499 500

Figure 1. Schematic representation of the MinGenome algorithm. The MinGenome algorithm

501

builds on GSMs with information of essential genes, gene and promoter positions. The logic

502

constraints are imposed on promoters, genes, and reactions. MinGenome allows three additional

503

options that allow (i) gene re-insertion, (ii) retention of transcriptional factor genes, and (iii)

504

user-supplied expansion of the list of essential genes. The MinGenome identifies the sequence of

505

deletions starting with the largest dispensable region and proceeding monotonically to shorter

506

ones.

ACS Paragon Plus Environment

Page 26 of 39

Page 27 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

507 508

Figure 2. Comparison between 32 experimental long deletions (blue) and MinGenome predicted

509

top 32 long deletions (orange). MinGenome predicted deletions are in descending size order (i.e.,

510

from deletion 1 to deletion 32) and experimental deletions keep their initial name in the PEC

511

database with an additional ‘LD’ as prefix when the original label starts with a number (e.g.,

512

LD1, LD14, OCR37(-km), and LD3-15-1Y). In addition to a number of overlapped regions with

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

513

experimentally carried out deletions, MinGenome predicted additional long deletions and

514

avoided potential growth detrimental deletions.

515 516

517 518

Figure 3. Five MinGenome predictions (deletion 8, 26, 27, 29, 31) and their corresponding

519

experimentally carried out medium-scale deletions that were constructed independently at the

520

same locations. Experimental deletions are labeled with their names in PEC database (e.g.,

521

OCL63, OCL31, and OCR48,49-8). The red and green colored genes indicate genes that are truly

522

essential and falsely predicted as essential, respectively.

ACS Paragon Plus Environment

Page 28 of 39

Page 29 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

523 524

Figure 4. COG-based functional analysis genes of minimal genome predicted by the

525

MinGenome algorithm. The genes are categorized into five groups: (i) information storage and

526

processing, (ii) cellular processes, (iii) metabolism, (iv) poorly characterized, and (v) COG not

527

assigned. We observed that most of the non-essential genes retained in the minimal genome are

528

in “Metabolism” and are required to maintain the maximum biomass yield.

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

529 530

Figure 5. A Venn diagram of deletions comparison between the deleted genes in strain MGF-01

531

and MinGenome predictions.

ACS Paragon Plus Environment

Page 30 of 39

Page 31 of 39

120

experimental deletion

MinGenome prediction

100

number of reacction deletion

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

80

60

40

20

0

532 533

Figure 6. Comparison of the deletions of reactions in metabolism between MinGenome

534

predictions and MGF-01 strain in the second case study. Reactions are categorized based on their

535

subsystems in iJO136623. Blue bars indicate the deleted reactions in MGF-01 strain, and orange

536

bars indicate the deleted reactions in MinGenome predictions. MinGenome predictions have

537

more deletions in glycerophospholipid metabolism, nucleotide salvage pathways, oxidative

538

phosphorylation, amino acids metabolism, and pentose phosphate pathway.

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

539 540

Figure 7. Comparison of percentage of genome reduction in the first 56 deletions between

541

MinGenome algorithm with and without gene re-insertion. Blue bars indicate the cumulative

542

MinGenome deletions while allowing for gene re-insertion, and orange bars indicated the

543

cumulative MinGenome deletions with the re-insertion option switched off.

ACS Paragon Plus Environment

Page 32 of 39

Page 33 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

544 545

Figure 8. Definitions of the start site and the end site of a deletion in MinGenome algorithm.

546

The red line indicates a contiguous stretch (from gene 2 to gene 6) that MinGenome predicted as

547

a long deletion, and the orange line indicates the overlapped region between gene 1 and promoter

548

2. Note that the stretch of DNA for promoter 2 cannot be deleted entirely due to the overlapped

549

region of gene 1. MinGenome circumvents this challenge by defining the start site ( ) that

550

adjusts the beginning nucleotide of the long deletion to the non-overlapped region and ensuring

551

that the deletion of the end site ( ) does not affect the promoter or gene immediately after the

552

long deletion.

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

553 554 555

Figure 9. A toy example of MinGenome: (A) three deletion segments were constructed in the

genome. The three red lines indicate long deletions 1, 2, and 3. 4 = 1 indicates the deletion of

556

gene  ,  = 1 and 1 = 1 represent the first gene in the deleted stretch and first gene

557

immediately following the deleted stretch. (B) Longer deletion by MinGenome algorithm while

558

allowing for one essential gene re-insertion. The long deletion is constructed by combining two

ACS Paragon Plus Environment

Page 34 of 39

Page 35 of 39 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

559

deletions (i) gene 2 to gene 3 and (ii) gene 5 to gene 7, and gene 4 is allowed to be reinserted into

560

the genome. Note that for clarity purpose the involvement of promoters is omitted here.

561



562

Table 1: Toy example of one gene-reinsertion to the genome. The longer deletion combines two

563

deletions (i) gene 2 to gene 3 and (ii) gene 5 to gene 7. variable  1