Orthogonal Information Encoding in Living Cells with High Error

Information encoding in DNA is of great interest but its applications in vivo might be questionable since errors could be enriched exponentially by ce...
0 downloads 0 Views 1MB Size
Subscriber access provided by UNIVERSITY OF TOLEDO LIBRARIES

Article

Orthogonal information encoding in living cells with high error-tolerance, safety and fidelity Lifu Song, and An-Ping Zeng ACS Synth. Biol., Just Accepted Manuscript • DOI: 10.1021/acssynbio.7b00382 • Publication Date (Web): 10 Feb 2018 Downloaded from http://pubs.acs.org on February 11, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Synthetic Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Orthogonal information encoding in living cells with high error-tolerance, safety and fidelity Lifu Song1 and An-Ping Zeng1,* 1

Institute of Bioprocess and Biosystems Engineering, Hamburg University of Technology, Hamburg, Germany

* To whom correspondence should be addressed. Tel: + 49 40 42878-3217; Fax: + 49 40 42878-2909; Email: [email protected] Present Address: Prof. An-Ping Zeng ([email protected]), Institute of Bioprocess and Biosystems Engineering, Hamburg University of Technology, Denickestrasse, 15, Hamburg, 21071, Germany Abstract

Information encoding in DNA is of great interest but its applications in vivo might be questionable since errors could be enriched exponentially by cellular replications and the artificial sequences may interfere with the natural ones. Here, a novel self-error-detecting, three-base block encoding scheme (SED3B) is proposed for reliable and orthogonal information encoding in living cells. SED3B utilizes a novel way to add error detecting bases in small data blocks which can combine with the inherent redundancy of DNA molecules for effective error correction. Errors in a rate of 19% can be corrected as shown by error-prone PCR experiments with E. coli cells. Calculation based on this preliminary result shows that SED3B encoded information in E. coli can be reliable for more than 12,000 years of continuous replication. Importantly, SED3B encoded sequences do not share sequence space to all reported natural DNA sequences except for some short tandem repeats, indicating a low biological relevance of encoded sequences for the first time. These features make SED3B attractive for broad orthogonal information encoding purpose in living cells, e.g. comments/barcodes encoding in synthetic biology. For proof of concept, 10 different barcodes were encoded in E. coli cells. After continuous replications for ten days including exposure to ultraviolet for 2~3 minutes (lethality>60%) per day, all barcodes were fully recovered, proving the stability of encoded information. An online encoding-decoding system is implemented and available at http://biosystem.bt1.tu-harburg.de/sed3b/. Keywords

DNA data storage, Data encoding in living cells, data encoding in DNA, biological barcode, synthetic biology comment, Error correction Introduction

Deoxyribonucleic acid (DNA) is the natural information carrier utilized in all living organisms on earth (1). Artificial information encoding in DNA (AIED) is a potential resolution for high density and longterm data storage which is urgently required by the challenge of storing exponentially growing global digital data (2–7). Besides this, AIED also has several other attractive applications such as hiding messages in DNA microdots (8), embedding watermarks/signatures in genes or organisms (9– 15), barcode and comment encoding for synthetic biology programs (16). Applications of AIED can be divided into two ‘in vitro’ and ‘in vivo’. For in vitro applications, great attention has been given to 1

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 22

large data storage applying various error correcting codes, DNA synthesis and sequencing technologies recently (17–23). For in vivo applications, there are two ways of writing information in DNA of living cells. One method is to embed the information in synonymous codons of coding area (9,11–15), which is only specific for application of watermark/signature encoding. Another one is to encode the information in extra DNA sequences which can be inserted into the non-coding areas of DNA of living cells (10), which can be generally used for different purposes. In contrast to the in vitro application of AIED for large data storage (17–25), limited progress has been made for data encoding in living cells. Due to the data ‘writing’ and ‘reading’ barriers caused by the cell membrane, large data storage in living cells might be limited although there are such efforts recently (24,25). Different from the in vitro applications for large data storage, in vivo applications, e.g. encoding of watermarks/signatures, barcode and comment encoding, only require storing a small amount of information. However, correction of errors which could be enriched by cellular replications and avoidance of interference of artificial sequences with the natural ones (being so-called biologically relevant) are two major issues for in vivo applications. Although work on watermark/signature encoding algorithms that write encrypted information to synonymous codons claimed no influence on the protein expression or function, it is not convincible considering the fact that codon optimization strategy has been widely used for optimization of protein expression (9,11–15,26–28). Furthermore, it is possible that altering the synonymous codons can influence the cell physiology on RNA level which still requires more advanced investigation and validation. In contrast, writing information in extra DNA provides a simple way to encode information in DNA of living cells by inserting the extra DNA into the non-coding areas (10). Furthermore, methods of writing information in extra DNA sequences can be applied to any in vivo applications rather than specific to the watermark/signature encoding. To this end, a new method, which can avoid possible interference of encoded artificial DNA sequences with the natural ones by a more advanced mechanism and correct the exponentially enriching errors by cellular replications, is highly desirable. Theoretically, the encoding schemes designed for in vitro data storage in DNA are also usable for in vivo applications. However, to the best of our knowledge, no reported in vitro data storage methods can address specifically the two major issues mentioned above. One unique feature of information storage in DNA is that there are always many copies of DNA molecules which represent the same data. In other words, there is a high inherent data redundancy. In this study, using a novel way of adding error detection codes block by block, we have established an efficient encoding scheme (termed as SED3B) which takes full advantage of the inherent redundancy feature for error correction. SED3B can effectively repress exponentially enriching errors emerging from DNA replication as indicated by in silicon and experimental results. In addition to limited extreme GC contents, homopolymers, and simple secondary structure, SED3B encoded sequences also show very low biological relevance as proved by comparative studies with naturally formed sequences. Features of high error tolerance and low biological relevance make SED3B promising for orthogonal information encoding in living cells. To facilitate the usage of SED3B as information encoding scheme in living cells, an online encoding-decoding system with cases of comment and barcode encoding is implemented and released in http://biosystem.bt1.tuharburg.de/sed3b/. Material and Methods

Detailed steps for encoding binary data into DNA string 2

ACS Paragon Plus Environment

Page 3 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

To detail the steps of encoding arbitrary digital data into DNA string in our scheme SED3B, an arbitrary computer file is represented as a string (S1) of bits (often interpreted as a number between 0 and 1). The detailed steps are illustrated in Figure 1 and explained as follows (A Perl script implementation is provided in Supporting Information 1): 1) Bit string S1 is converted to DNA string S2 of characters in {A, C, G, T} four bits by four bits using the scheme shown in Figure 1 (shown in rows of “Data encoding bases”). 2) One error detecting base is inserted per two bases based on assign rule I shown in Figure 1 to generate DNA string S3. 3) Check presence of “TTT” three bases by three bases and change the error detecting bases of the first blockfollowing “TTT” to a new base based on rule II to generate a final DNA string S4. Decoding error-containing DNA strings into binary data The decoding process refers to restoring the original binary data from varying numbers of errorcontaining DNA strings. Our encoding scheme permits detection of errors of substitution, insertion and deletion (indels) which are achieved by detection of extensive errors emerged in continuous three-base encoding blocks in principle. The details of the decoding process are illustrated in Figure 2 and described as follows (A Perl script implementation is provided in Supporting Information 1): 1.

2.

Generate consensus DNA string block by block as follows: a) Read a three-base block from all DNA fragments and remove all the three-base blocks with errors detected (The rule of error detecting base is initialized with rule I); b) Make consensus block by taking the block with largest occurrence frequency; c) Switch the error detecting rule to rule II if the consensus block is ‘TTT’, otherwise, switch to rule I; d) Go to next blocks and repeat the steps a), b), and c) until a complete consensus DNA string is generated; Transfer the consensus DNA string into bit string based on the scheme shown in Figure 1;

Analysis of error tolerance by in silicon simulation The 35,292 bps DNA string encoding the logo of our institute was used as input for error tolerance simulation. The capabilities of SED3B tolerating to substitution errors and insertion-deletion errors were investigated respectively. The specific rate of random errors was introduced base by base by giving a specific error probability. For the simulation of substitution error tolerance, the rates of AT and GC transition errors were doubled to those of A/TG/C transition errors to mimic the natural DNA replication process. Variant numbers of DNA sequences with random errors were then used for decoding to test the error tolerance. The Perl script for introducing errors into DNA strings and the related input files are detailed in Supporting Information 2. In vivo verification of the error tolerance Error-prone PCR experiment To test the error tolerance capability of the SED3B encoding scheme in practice, we used errorprone PCR to introduce random errors into the synthesized DNA fragments and recovered the 3

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 22

original information using the error-rich DNA fragments. We encoded text of “Hello, World!” into 78bp DNA string. Error-prone PCR was performed using JBS dNTP-Mutagenesis Kit using the recommended protocol with 30 thermal cycles to introduce errors into the encoded DNA sequence. The amplified fragments by error-prone PCR were ligated with linearized pZE21-MCS plasmid using In-Fusion® HD Cloning Kit from Clontech® Laboratories. The ligation products were transformed into stellar E. coli stellar competent Cells. The plasmid map and encoded information are presented in Supporting Information 3. The plasmid abstractions of individual colonies were deposited for sequencing. The primers used and sequencing results are detailed in Supporting Information 3. UV- exposure experiment 10 barcodes were generated randomly (as listed in Table S1 in the Supporting Information 4). The 10 barcodes were encoded into DNA sequences using SED3B. 10 primers containing the encoded DNA barcodes with two additional 15bp hooks in both ends were synthesized in order to insert the DNA barcodes into the vector pZE21-MCS. The 10 DNA barcodes were cloned into pZE21-MCS plasmids individually using the Gibson assembly method. The 10 resulting plasmids were transformed into E. coli Top10 cells individually. To test the reliability of the encoded DNA barcodes in E. coli cells, we continuously cultivated the cells for 10 days and exposed the cells to UV for 2~3 minutes (fatality rate >60%, indicating an extreme condition) per day. Details of UV exposure experiment: 10ul fresh cells were mixed with 90ul sterilized water in 96-well plate and then placed in clean bench with a 30W UV Lamp for 2~3 minutes. The cells from the 10th round of UV exposure were sprayed on plates in proper dilutions. Five colonies for each barcode were sent out for sequencing. The sequencing results are used to recover the encoded barcodes. The used barcodes and encoded DNA barcodes are listed in Table S1 in the Supporting Information 4. The primers used to construct the plasmids containing the encoded DNA barcodes are listed in Table S2 in Supporting Information 4. Biological relevance analysis of SED3B encoded sequences It is a challenge to verify the biological relevance of encoded sequences in a direct way. We focus on comparative analysis of the SED3B encoded DNA sequences with naturally formed sequences. We try to prove that the naturally formed DNA sequences and the SED3B encoded sequences are orthogonal. If so, it indirectly proves that SED3B encoded sequences have limited biological relevance. For this purpose, we implemented a Perl script (see Supporting Information 5) to search for sub-sequences that do not violate SED3B encoding scheme in natural biological sequences. All three frames were considered. The 30,151,123 nucleotide sequences available in the NCBI nucleotide collection (nt) database collected on 28.05.2015 were used as inputs for the current analysis. Tandem repeats that were founded to partially follow the SED3B encoding were analyzed using “Tandem repeat finder version 4.07b” with default settings (29). Secondary structure analysis As the secondary structure is formed by complementarily matched partial contents, a larger number of complementarily matched k-mers indicates a more complex secondary structure of corresponding DNA string. Here, K-mers refers to all the possible substrings of length k, in a given string (30). The total number and percentage of complementarily matched k-mers in all k-mers were used as indicators of the complex of the DNA strings in secondary structures in the current analysis. A self4

ACS Paragon Plus Environment

Page 5 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

implemented Perl script was used for analyzing the complementary matched k-mers (Supporting Information 6). A k-mer length of 12 was applied during the analysis. Implementation of an online encoding-decoding system with SED3B The online system is implemented by using the CakePHP (https://cakephp.org/) web development framework. Two different applications are provided: comment encoding-decoding and biological barcode encoding-decoding. The system is available under the link: http://biosystem.bt1.tuharburg.de/sed3b/. Results

Principles of a self-error-detecting, three-base block encoding scheme To fully utilize the redundancy feature of DNA molecules for error correction, we designed a novel self-error-detecting, three-base block (SED3B) encoding scheme for effective and flexible error correction. In more details, the binary bits are first transformed into data encoding DNA bases four bits by four bits using the scheme shown in Figure 1. Then we insert one error checking base per two data encoding bases to form three-base blocks. The third base is designed to detect whether there are errors emerged in the two encoding bases. A simple way of error checking by the third base is the checksum principle (31). However, the checksum method has no optimization option for homopolymers and extreme GC contents. Instead, we designed a novel strategy to enable error checking by the third base. We first divide all possible 16 two-base combinations into four groups, based on the principle that all the four two-base combinations in the same group do not share any identical base in neither the first nor the second base. We then assign every group with an error detecting base as shown in Figure 1. Thus, the data encoding two-base and the error detecting base won’t match to each other anymore if error emerges in any of the three bases. In other words, a single base error on any of the three bases can be detected. To avoid extreme GC, long homopolymers and complex secondary structures generated in the encoding of DNA strings, three additional principles are followed while assigning error checking bases to the four groups of twobase combinations: 1) no more than 3 G/C present in the three base block to avoid extreme GC contents; 2) no identical bases present in all the three bases to avoid long homopolymers; 3) no complementarily matched three base blocks present. However, principals 1) and 2) cannot be satisfied simultaneously. To address this issue, we designed two rules for assigning error detecting bases as shown in Figure 1, rows of “error detecting base rules”. The rule I satisfies the principle that there are no more than 2 G/C present in all the three bases while a “TTT” homopolymer does present there. Rule II abolishes any three base homopolymer while enabling G/C presents in all three bases. During the encoding process, Rule I is used in general and only if “TTT” is present, the rule for assigning the error detecting bases is switched to Rule II temporarily and then switched back to Rule I after having encoded once. Thus, continuous “T” homopolymers can be avoided as the error detecting base for “TT” is switched to “G”, not “T” in Rule II and extreme GC content can also be avoided as three G/C combinations in Rule II are only present if the previous encoding three-base block is “TTT”. Finally, no more than seven continuous “T”, five continuous “A/C” and three continuous “G” are possible to exist in the encoded DNA strings which have been proved to be acceptable by current DNA synthesis and sequencing technologies (18). The GC content can be controlled below 67.7%. Since two-thirds of the total bases are used for data encoding, the SED3B 5

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 22

scheme has a theoretic encoding efficiency of ~66.7% regardless the addressing and compression problems. High error tolerance revealed by in silicon simulations Tolerance of substitution errors To test the capability of substitution error detection by SED3B, we introduced different rates of random substitution errors into the SED3B encoded DNA fragments and calculated the percentage of errors that could be detected by SED3B. As shown by the green triangles in Figure 3, more than 90% errors can be detected while an error rate less than 10% and 78% errors still can be detected even when the error rate is as high as 30%. The error rates after error repression shown in red crisscross are more than one magnitude lower than the untreated ones. Next, we tested the error correction capabilities using variant numbers of DNA sequences. We first tested with 10 and 100 DNA sequences individually. As shown in Figure 4, the error tolerance ability by using 100 DNA sequences is higher than that using 10 DNA fragments as expected. The error tolerance is up to 5% using 10 DNA fragments and up to 33% using 100 DNA fragments. To estimate the number of sequences required for reliable correction of a specific rate of errors, we performed series of simulations with error rates ranging from 1% to 40%, with a step increment of 1%. At each simulated error rate, we started with a small number of sequences to retrial the data for 500 iterations. If errors have emerged in any of the 500 iterations, we increased the number of sequences by one and repeat the process again until there are no errors emerged all 500 iterations. As shown in Figure 5, although the required number of sequences increased exponentially with the increase of the error rate, 200 sequences were enough to correct errors in a very high rate of 40%. Tolerance of insertion-deletion errors Insertion-deletion (indel) errors are a type of errors that are uniquely presented in DNA as storage media since indel errors cannot occur on traditional planner based storage media. Indel errors are more difficult to correct compared to substitution errors. To the best of our knowledge, no available method which can well handle the indel errors has been reported so far. SED3B encodes data into consecutive three base blocks. Indel errors emerged in SED3B encoded DNA sequences can cause frame shifting of the following three base blocks resulting in series of blocks violating the error checking rules. Thus, the indel errors can be easily detected during decoding. To investigate the tolerance of SED3B to indel errors, we run series of simulations in a similar way to the substitution error simulations. As shown in Figure 6, with 10 DNA sequences, indel errors in a rate of 0.4% can be effectively corrected. By increasing the DNA sequences to 100, indel errors as high as 3% can be corrected. It is noted that SED3B has an about 10 times higher capability to correct substitution errors than to correct indel errors. Interestingly, it has been reported that the rate of nucleotide substitutions was also about 10 times higher than the rate of indels in vivo (32). Thus, SED3B is also reliable and compatible with in vivo applications as verified by the experimental studies in the following sections. Furthermore, the results mentioned above are obtained based on a relatively simple decoding algorithm without special consideration of information in the frame shifted blocks.

6

ACS Paragon Plus Environment

Page 7 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

A more advanced decoding algorithm which can make use of the inherent information of the frame shifted blocks would further enhance the indel error correction capability of SED3B. SED3B encoded DNA sequences show low biological relevance The biological safety issue has been widely discussed in synthetic biology (33,34). However, this problem didn’t draw enough attention in previous studies of information encoding in DNA. Large data storage in DNA will produce a huge amount of novel DNA fragments. The encoding scheme should provide mechanisms to avoid that the encoded DNA fragments could be utilized by microbes in nature, especially for large data storage application. Similar to the life coding system in nature, SED3B also uses a three-base block encoding manner. However, only one-fourth of the 64 possible three-base combinations are used in SED3B in general and another one-fourth is used only in cases that “TTT” is present in the previous encoding block. We believe that such an encoding scheme imposes strong limitations on the encoded DNA string, making it hard to form “biologically meaningful” sequences. To prove this, we implemented a Perl script to search for sub-sequences that satisfy our encoding rules in natural DNA sequences (the Perl script is detailed in Supporting Information 5). We analyzed all the 30,151,123 nucleotide sequences available in the NCBI nucleotide collection (nt) database (collected on May 28, 2015) considering all three frames. The results showed that none of the entire coding sequences can fit our encoding rules and the number of matched partial sequences decreases rapidly along with the increase of the cut-off length as shown by the blue dots in Figure 8. Furthermore, large amounts of partial sequences are found to be tandem repeat structure containing sequences which have low biological meanings as shown by red dots in Figure 8. Indeed, all partial sequences with a length longer than 65bp are found to be tandem repeats (the sequences are listed in Supporting Information 5). These results imply that the SED3B encoded DNA sequences and naturally formed DNA sequences are located in different sequence spaces with slight space overlaps of tandem repeat sequences. In other words, SED3B encoded DNA strings show very low biological relevance. SED3B encoded DNA sequences show simple secondary structures Synthesis and sequencing of DNA fragments with complex secondary structures is a not-well-solved problem yet (35). Retrieving information stored in DNA with complex secondary structures is a challenge which has been considered in previous studies (2,3). Complex secondary structures are formed by complementary subsequences. In SED3B, all the reverse complementary three bases combinations are abolished, which strongly prohibits the encoded DNA strings to form complex secondary structures in principle. To test this point, we encoded three files in different sizes into DNA strings by using the SED3B encoding scheme and without using the third base optimization individually. Since it’s difficult to predict and compare the secondary structure complexity of DNA sequences directly, the total number and percentage of complementarily matched k-mer pairs (CMKM) among all k-mers were used as an indicator of the complexity of secondary structures. Although the SED3B encoded DNA strings are 1.5 folds longer in length compared to the ones without the third base optimization, the total CMKMs are reduced by more than 80% by using SED3B as shown in Figure 9. Furthermore, the percentage of reduced CMKMs increases while enlarging the data volume. In the case of file c with a size of 9.797 Kilobyte, 7

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 22

CMKMs are reduced by 94%. This implies that the DNA strings encoded by SED3B show much simple secondary structures. Reliable orthogonal information encoding in living cells using SED3B The features of effective error correction and low biological relevance make SED3B very promising for orthogonal information encoding in living cells. To test the reliability of the information written with SED3B and stored in living cells in practical, we encoded the digital information “Hello, World!” in a plasmid. Since the replication error rate of E. coli cells is very low, we used error-prone PCR to speed up the error enrichment process. We used JBS dNTP-Mutagenesis Kit with a very high mutation rate of up to 20%. The error-prone PCR products were transformed into E. coli stellar competent cells. 14 individual colonies were picked for plasmid abstraction and sequencing. The sequencing results revealed that variant error rates ranging from 11% to 30% were introduced and the average error rate was 19.1%. The original information can be retrieved correctly from all the 14 sequences. The detailed sequence information is presented in Supporting Information 3. Random errors could emerge and be enriched exponentially by replications of DNA. The final rate of errors is related to the fidelity of DNA replication and the number of replications. To destruct the stored information by DNA replication, the enriched errors should be higher than the error rate that can be tolerant by the encoding scheme. Thus, we get the following inequality:  = 1 − (1 − ) > 

Eq. 1

where  denotes the final rate of errors after times of replication with a replication error probability of  per base and  is the rate of errors that can be tolerated. It has been reported that the DNA replication error rate of E. coli cells is as low as 10-9 to 10-11 per base pair (36). Here, we use the highest error rate, i.e. 10-9, for calculation to make a confident estimation. Although our simulation results show that SED3B can tolerance as high as 40% rate of errors, we calculated with an error rate of 19.1% which has been proved in the error-prone PCR experiment in practical. Using these numbers, we obtain: > (1 − ) ÷ (1 − ) =

(1 − 0.191) ≈ 2.128 (1 − 0.000000001)

The doubling time of E. coli is around 0.5 to 1 hour. We use a doubling time of 0.5 hours for the following calculation. Thus, the minimal replication time  required to destruct the information is obtained from the following equation:  = ×  = 2.128 × 0.5ℎ "# ≈ 12,100 %&'# Thus, it should take more than 12,000 years’ of replication time to make the information distorted, indicating a reliable information encoding in living cells. In a further experiment, we encoded 10 different barcodes (listed in Table S1 of Supporting Information 4) into pZE21-MCS plasmids and transformed the 10 resulted plasmids into E. coli Top10 cells individually. To test the reliability of the encoded barcodes in E. coli cells, we continuously cultivated the cells for 10 days with exposure of the cells to UV light for 2~3 minutes (fatality rate >60%) per day. The cells from the 10th round UV exposure were sprayed on plates in proper 8

ACS Paragon Plus Environment

Page 9 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

dilutions. As confirmed by sequencing of five colonies for each barcode we were able to fully recover the barcodes without any errors. These results reinforce the reliability of the encoded barcodes in living cells. To fascinate the utilization of SED3B as an information encoding system in living cells, an online encoding-decoding system for comment and barcode encoding-decoding has been implemented and released in http://biosystem.bt1.tu-harburg.de/sed3b/. Concluding remarks

Reliable information storage in living cells seems to be questionable since errors could be introduced and enriched exponentially by rounds of replications and the artificial sequences may interfere with the natural ones. In the current study, we presented a novel encoding scheme SED3B which takes full advantage of the inherent redundancy of DNA molecules for error correction. SED3B can effectively correct exponentially enriching errors during DNA replication by using a small population of DNA molecules as indicated by in silicon and experimental results for the first time. Based on error-prone PCR experiments with E. coli cells, more than 12,000 years of continuous replications are estimated to be required to make the SED3B encoded information unrecoverable in growing E. coli cells. Furthermore, for the first time, we showed that SED3B encoded DNA sequences have little biological relevance to known natural DNA sequences. Synthetic biologists are designing diverse biological parts, circuits or even whole genomes and organisms for various applications. For a better organization of such an larger and larger ‘library’, a reliable and efficient barcoding system is required to avoid misusage of plasmids and organisms and for easy detection of misusage. Moreover, similar to programming in computer science, we also need to encode information as comments in “synthetic biological programs” in relevant applications. In these cases, SED3B can be applied for reliable information encoding with low affections to the functions of designed parts and cells. In principle, the SED3B encoding scheme is also applicable for in vitro data storage in DNA although more extensive investigations would be required to confirm its value in this regard. Indeed, SED3B does show some advantages in large data storage based on its coding scheme. First of all, SED3B is the first encoding scheme shown to be well tolerant to both substitution and indel errors. We showed that by using merely ten DNA sequences SED3B can correct a substitution error rate of 3%. It has been reported that the error rate of high throughput DNA synthesis technology is around 0.5% currently (18). Thus, five sequences are enough for reliable information encoding by the state of the art DNA synthesis technology using SED3B. In a recent study of Erlich and Zielinski, 1,300 copies of each DNA fragments were achieved for reliable data storage in DNA (7). For a successful recovery of the data they needed merely to sequence about 10 copies of each DNA molecule. Our results show that the copy number of molecules for storage can be further reduced by using SED3B which in turn will enhance the storage density. Additionally, releasing a huge amount of artificial DNA fragments into environment might cause potential biological safety issues especially for large data storage. For example, the microbes in nature may employ the novel DNA fragments to generate diversity. It, in turn, may accelerate antibiotic resistance development of microbes, which is one of the most critical problems to human health at present (37). Thus, the encoding scheme should provide mechanisms to avoid or reduce the formation of biologically relevant DNA sequences. With a unique feature of low biological relevance, SED3B shows potential in large data storage concerning about the biological safety issue for the first time. 9

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 22

References

1. WATSON, J. D., and CRICK, F. H. C. (1953) Molecular Structure of Nucleic Acids. A Structure for Deoxyribose Nucleic Acid, Nature 171, 737–738. 2. Church, G. M., Gao, Y., and Kosuri, S. (2012) Next-generation digital information storage in DNA, Science (New York, N.Y.) 337, 1628. 3. Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E. M., Sipos, B., and Birney, E. (2013) Towards practical, high-capacity, low-maintenance information storage in synthesized DNA, Nature 494, 77–80. 4. Fu, Q., Li, H., Moorjani, P., Jay, F., Slepchenko, S. M., Bondarev, A. A., Johnson, P. L. F., Aximu-Petri, A., Prufer, K., Filippo, C. de, Meyer, M., Zwyns, N., Salazar-Garcia, D. C., Kuzmin, Y. V., Keates, S. G., Kosintsev, P. A., Razhev, D. I., Richards, M. P., Peristov, N. V., Lachmann, M., Douka, K., Higham, T. F. G., Slatkin, M., Hublin, J.-J., Reich, D., Kelso, J., Viola, T. B., and Paabo, S. (2014) Genome sequence of a 45,000-year-old modern human from western Siberia, Nature 514, 445–449. 5. Extance, A. (2016) How DNA could store all the world's data, Nature 537, 22–24. 6. Bancroft, C., Bowler, T., Bloom, B., and Clelland, C. T. (2001) Long-term storage of information in DNA, Science (New York, N.Y.) 293, 1763–1765. 7. Erlich, Y., and Zielinski, D. (2017) DNA Fountain enables a robust and efficient storage architecture, Science (New York, N.Y.) 355, 950–954. 8. Clelland, C. T., Risca, V., and Bancroft, C. (1999) Hiding messages in DNA microdots, Nature 399, 533– 534. 9. Heider, D., and Barnekow, A. (2007) DNA-based watermarks using the DNA-Crypt algorithm, BMC bioinformatics 8, 176. 10. Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R.-Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., Krishnakumar, R., Assad-Garcia, N., Andrews-Pfannkoch, C., Denisova, E. A., Young, L., Qi, Z.-Q., Segall-Shapiro, T. H., Calvey, C. H., Parmar, P. P., Hutchison, C. A., Smith, H. O., and Venter, J. C. (2010) Creation of a bacterial cell controlled by a chemically synthesized genome, Science (New York, N.Y.) 329, 52–56. 11. Heider, D., and Barnekow, A. (2008) DNA watermarks: a proof of concept, BMC molecular biology 9, 40. 12. Heider, D., Kessler, D., and Barnekow, A. (2008) Watermarking sexually reproducing diploid organisms, Bioinformatics (Oxford, England) 24, 1961–1962. 13. Heider, D., Pyka, M., and Barnekow, A. (2009) DNA watermarks in non-coding regulatory sequences, BMC research notes 2, 125. 14. Liss, M., Daubert, D., Brunner, K., Kliche, K., Hammes, U., Leiherer, A., and Wagner, R. (2012) Embedding permanent watermarks in synthetic genes, PloS one 7, e42465. 15. Haughton, D., and Balado, F. (2013) BioCode: two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA, BMC bioinformatics 14, 121. 16. http://syntheticbiology.org/Vectors/Barcode.html. 17. Jain, M., Fiddes, I. T., Miga, K. H., Olsen, H. E., Paten, B., and Akeson, M. (2015) Improved data analysis for the MinION nanopore sequencer, Nat. Methods 12, 351–356. 18. Kosuri, S., and Church, G. M. (2014) Large-scale de novo DNA synthesis: technologies and applications, Nat. Methods 11, 499–507. 19. Grass, R. N., Heckel, R., Puddu, M., Paunescu, D., and Stark, W. J. (2015) Robust chemical preservation of digital information on DNA in silica with error-correcting codes, Angew. Chem. Int. Ed. Engl. 54, 2552– 2555. 20. Yim, A. K.-Y., Yu, A. C.-S., Li, J.-W., Wong, A. I.-C., Loo, J. F. C., Chan, K. M., Kong, S. K., Yip, K. Y., and Chan, T.-F. (2014) The Essential Component in DNA-Based Information Storage System: Robust ErrorTolerating Module, Front Bioeng Biotechnol 2, 49. 10

ACS Paragon Plus Environment

Page 11 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

21. Blawat, M., Gaedke, K., Hütter, I., Chen, X.-M., Turczyk, B., Inverso, S., Pruitt, B. W., and Church, G. M. (2016) Forward Error Correction for DNA Data Storage, Procedia Computer Science 80, 1011–1022. 22. Bornholt, J., Lopez, R., Carmean, D. M., Ceze, L., Seelig, G., and Strauss, K. A DNA-Based Archival Storage System. In the Twenty-First International Conference (Conte, T., and Zhou, Y., Eds.), pp 637–649. 23. Tabatabaei Yazdi, S. M. H., Gabrys, R., and Milenkovic, O. (2016) Portable and Error-Free DNA-Based Data Storage, bioRxiv Epub 2016. DOI: 10.1101/079442. 24. Akhmetov, A., Ellington, A., and Marcotte, E. (2016) A highly parallel strategy for storage of digital information in living cells. 25. Yachie, N., Ohashi, Y., and Tomita, M. (2008) Stabilizing synthetic data in the DNA of living organisms, Systems and synthetic biology 2, 19–25. 26. Mueller, S., Jafari, F., and Roth, D. (2016) A covert authentication and security solution for GMOs, BMC bioinformatics 17, 389. 27. Fischer, M. D., McClements, M. E., La Martinez-Fernandez de Camara, C., Bellingrath, J.-S., Dauletbekov, D., Ramsden, S. C., Hickey, D. G., Barnard, A. R., and MacLaren, R. E. (2017) CodonOptimized RPGR Improves Stability and Efficacy of AAV8 Gene Therapy in Two Mouse Models of X-Linked Retinitis Pigmentosa, Molecular therapy : the journal of the American Society of Gene Therapy 25, 1854– 1865. 28. Pouyet, F., Mouchiroud, D., Duret, L., and Sémon, M. (2017) Recombination, meiotic expression and human codon usage, eLife Epub 2017. DOI: 10.7554/eLife.27344. 29. Benson, G. (1999) Tandem repeats finder: a program to analyze DNA sequences, Nucleic Acids Res. 27, 573–580. 30. Pevzner, P. A., Tang, H., and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly, Proceedings of the National Academy of Sciences of the United States of America 98, 9748– 9753. 31. Wei, B., and Chen, T. (2014) Verifying Data Migration Correctness: The Checksum Principle, RTI Press. 32. Saitou, N., and Ueda, S. (1994) Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates, Molecular biology and evolution 11, 504–512. 33. Mandell, D. J., Lajoie, M. J., Mee, M. T., Takeuchi, R., Kuznetsov, G., Norville, J. E., Gregg, C. J., Stoddard, B. L., and Church, G. M. (2015) Biocontainment of genetically modified organisms by synthetic protein design, Nature 518, 55–60. 34. Rovner, A. J., Haimovich, A. D., Katz, S. R., Li, Z., Grome, M. W., Gassaway, B. M., Amiram, M., Patel, J. R., Gallagher, R. R., Rinehart, J., and Isaacs, F. J. (2015) Recoded organisms engineered to depend on synthetic amino acids, Nature 518, 89–93. 35. Goodwin, S., McPherson, J. D., and McCombie, W. R. (2016) Coming of age: ten years of nextgeneration sequencing technologies, Nature reviews. Genetics 17, 333–351. 36. Fijalkowska, I. J., Schaaper, R. M., and Jonczyk, P. (2012) DNA replication fidelity in Escherichia coli: a multi-DNA polymerase affair, FEMS microbiology reviews 36, 1105–1121. 37. Blair, J. M. A., Webber, M. A., Baylay, A. J., Ogbolu, D. O., and Piddock, L. J. V. (2015) Molecular mechanisms of antibiotic resistance, Nature reviews. Microbiology 13, 42–51.

Author contributions

LFS proposed the original idea and performed the studies. APZ supervised the studies. LFS and APZ wrote the manuscript. Competing financial interests

A patent has been released (PCT/EP2016/078122) regarding the encoding scheme presented in this study. 11

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 22

Supporting Information Supporting Information 1 Source codes of the Perl scripts used for encoding and decoding data using SED3B scheme Supporting Information 2 Input files utilized for error tolerance simulations Supporting Information 3 Primers and map of plasmid used in error-prone PCR experiment Supporting Information 4 Lists of 10 barcodes and primers utilized for construction of the 10 barcodeencoding plasmids Supporting Information 5 Source codes of the Perl script used for biological relevance analysis and the searching results against NCBI nr database Supporting Information 6 Source codes of the Perl script used for k-mers analysis

12

ACS Paragon Plus Environment

Page 13 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

TABLE AND FIGURE LEGENDS

Figure 1. Illustration of encoding binary data into DNA string using the SED3B encoding scheme.

13

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 22

Figure 2. Detailed steps of decoding error-containing DNA strings into an error-free bit string.

The black, green and red characters stand for the data encoding bases, error correction bases and error containing bases respectively. The encoding scheme does permit detection of insert and deletion errors by detection of continuous errors of encoding blocks.

14

ACS Paragon Plus Environment

Page 15 of 22

Figure 3. Detection and repression of substitution errors by SED3B encoding scheme

▲ Percentage of errors detected by the SED3B method. ┿ Remained percentage of errors in DNA fragments after removing the errors detected. × Percentage of random errors introduced during simulations. Errors were introduced into DNA fragments randomly base by base. A range of error rates from 1 to 30% was simulated with a stepping increment of 1%. Random errors were introduced in each step with a specific error rate setting, and each step was iterated for 500 times. More than 90% errors could be detected while the error rate less than 10%. More than 78% errors have been detected even the error rate is as high as 30%. The error rates after error repression shown in red crisscross are more than one magnitude lower than the untreated ones.

0%

Setting of error rates introduced 5% 10% 15% 20%

25%

30%

100.00%

10.00% Percentage of errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

1.00%

0.10%

0.01%

0.00%

15

ACS Paragon Plus Environment

ACS Synthetic Biology

Figure 4. Correction of substitution errors by multiple DNA sequences encoded with SED3B

■Percentage of errors in DNA strings after removing the detected errors. ▲The emerged percentage of errors in final recovered information using 10 error-containing DNA strings for retrieval of the information. ×The emerged percentage of errors in final recovered information using 100 DNA strings for information retrieval. Errors were introduced into DNA fragments randomly base by base. A range of error rates from 1% to 40% was simulated with a stepping increment of 1%. Random errors were introduced in each step with a specific error rate setting, and each step was iterated for 500 times. Figure A and B represent the same results with different axis range settings.

10%

A

9%

Percentage of errors

8% 7% 6% 5% 4% 3% 2% 1% 0% 0%

10% 20% Error rate setting

30%

40%

0.30%

B 0.25% Percentage of errors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 22

0.20% 0.15% 0.10% 0.05% 0.00% 0%

10% 20% Error rate setting

30%

40% 16

ACS Paragon Plus Environment

Page 17 of 22

Figure 5. Simulation of required numbers of sequences for reliable information recovery by DNA fragments with variant rates of substitution errors

To estimate the number of sequences required for reliable correction of a specific rate of errors, series of simulations were performed with a range of error rates from 1% to 40%, with a step increment of 1%. At each simulated error rate, we start with a small number of sequences to retrial the data for 500 iterations. If errors emerge in any of the 500 iterations, we increase the number of sequences by one and repeat the process again until there are no errors found in all 500 iterations.

300

y = 6.0413e8.4916x R² = 0.9926

250

Required number of sequences for error tolerance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

200

40%, 201

150

100

50

0 0%

10%

20% 30% 40% Error rates in DNA fragments

50%

17

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 22

Figure 6. Correction of indel errors by SED3B with multiple DNA sequences

▲= Emerged percentage of errors in the final recovered information using 10 error-containing DNA strings for retrieval of the information. X = Emerged percentage of errors in the final recovered information using 100 DNA strings for information retrieval. The Indel errors were introduced into the DNA fragments randomly base by base. Random indel errors were introduced in each step with a specific error rate setting, and each step was iterated for 500 times. 0.4% indel errors can be corrected using 10 DNA sequences and 9% indel errors can be corrected with 100 DNA sequences.

18

ACS Paragon Plus Environment

Page 19 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 7. Correct information retrieved using 14 sequences with high rates of errors introduced by errorprone PCR

The three-base blocks with errors detected are replaced with “---”. The average error rate is ~19.1%.

19

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 22

Figure 8. Comparative analysis of SED3B encoded sequences with natural DNA sequences

All partial sequences longer than 66bp that satisfy the SED3B scheme are tandem repeats. The 30,151,123 sequences in NCBI Nucleotide database collected on 28.05.2015 were used as inputs. All three frames were analyzed. The horizontal axis stands for length cut-off of partial sequences. The vertical axis stands for the total number of partial sequences that have a length equal or longer than the cut-offs. The small chart in the top-right is a zoom-in of the large chart. Blue dots stand for the total numbers of matched partial sequences equal or longer than a specific length. Red dots stand for the total numbers of matched partial sequences which are found to be tandem repeats.

20

ACS Paragon Plus Environment

Page 21 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Synthetic Biology

Figure 9. The number of complementarily matched k-mers is reduced remarkably by using the SED3B scheme.

The numbers above the blue bars stand for the percentage of reduced complementary matched kmers by applying SED3B for secondary structure optimization corresponding to different input files.

1.20E+07 Optimized with SEDTB 1.00E+07

Not Optimized

8.00E+06

6.00E+06

4.00E+06

2.00E+06 94.2%

0.00E+00

90.2%

80.9%

1,005 kb File A

1,704 kb File B

9,797kb File C

21

ACS Paragon Plus Environment

ACS Synthetic Biology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

Orthogonal and reliable for thousands of years

ACS Paragon Plus Environment

Page 22 of 22