Functional Annotation of Proteins Encoded by the Minimal Bacterial

effective training dataset extracting from various bacterial genomes. The experimentally validated ...... protein function annotation and will facilit...
2 downloads 0 Views 2MB Size
Subscriber access provided by Kaohsiung Medical University

Article

Functional Annotation of Proteins Encoded by the Minimal Bacterial Genome Based on Secondary Structure Element Alignment Zhiyuan Yang, and Stephen Kwok-Wing Tsui J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00262 • Publication Date (Web): 14 May 2018 Downloaded from http://pubs.acs.org on May 16, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Functional Annotation of Proteins Encoded by the Minimal Bacterial Genome Based on Secondary Structure Element Alignment

Zhiyuan Yang 1,2,3, Stephen Kwok-Wing Tsui 2,3,4,* 1

College of Life Information Science & Instrument Engineering, Hangzhou Dianzi University, Hangzhou 310018, China

2

School of Biomedical Sciences, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong

3

Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong

4

Centre for Microbial Genomics and Proteomics, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong

1 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 38

Abstract In synthetic biology, one of key focuses is building a minimal artificial cell which can provide basic chassis for functional study. Recently, the J. Craig Venter Institute published the latest version of the minimal bacterial genome JCVI-syn3.0, which only encoded 438 essential proteins. However, among them functions of 149 proteins remain unknown because of the lack of effective annotation method. Here, we report a secondary structure element alignment method called SSEalign based on an effective training

dataset

extracting

from

various

bacterial

genomes.

The

experimentally validated homologous genes in different species were selected as training positives, while unrelated genes in different species were selected as training negatives. Moreover, SSEalign used a set of well-defined basic alignment elements with the backtracking line search algorithm to derive the best parameters for accurate prediction. Experimental results showed that SSEalign achieved 88.2% test accuracy, which is better than existing prediction methods. SSEalign was subsequently applied to identify the functions of those unannotated proteins in the latest published minimal bacteria genome JCVI-syn3.0. Results indicated that at least 136 proteins out of 149 unannotated proteins in the JCVI-syn3.0 genome could be annotated by SSEalign. Our method is effective for the identification of protein homology in JCVI-syn3.0 and can be used to annotate those hypothetical proteins in other bacterial genomes. Keywords: Protein secondary structure, homology identification, JCVI-syn3.0, essential gene 2 ACS Paragon Plus Environment

Page 3 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1. INTRODUCTION A minimal genome refers to a genome chassis that provides the basic functions for a living creature without redundant genes. In 2010, the first version of the minimal bacterial genome JCVI-syn1.0 had been reported by Gibson et al.

1

with chemically

synthesized methodology. This genome was transplanted into the cell of Mycoplasma

capricolum whose nucleus has been removed. This artificial genome had finally been demonstrated to possess the potential to self-replication and alive in the basic culture medium. Later, transposon mutagenesis technique

2

was applied to the genes of

JCVI-syn1.0 to identify dispensable genes. Finally, a more compressed genome JCVI-syn3.0 was obtained, which is smaller than any genomes of autonomously replicating cells reported before 3. The JCVI-syn3.0 genome is approximately 531kbp in size and consists of 473 essential genes (438 protein-coding and 35 RNA-coding genes). The 438 protein-coding genes were then annotated by searching against TIGRfam database 4 and divided into two groups: (a) JCVI-kno3.0 genes: 289 genes whose functions are clearly known, including those genes whose functions are extensively studied and can be supported by multiple aspects of evidence, such as genomic context and the structure information. (b) JCVI-unk3.0 genes: 149 genes whose functions are ambiguous and even completely unknown. As all these genes are essential for living organisms, we 3 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hypothesized that the 149 encoded proteins with unknown functions should share homology with proteins in other bacteria. Although many tools are currently available for the homology identification, their performance of detecting protein homology in bacteria is sub-optimal. This problem is particularly serious in novel genome annotation, which relies on sequence alignment to annotate the functions of translated proteins. The reason for their failures is that they have used datasets of protein homology with a wide range of identities but not datasets within bacteria, whose alignment identity of protein homology often fall into the twilight zone (≤35%). To overcome this shortcoming, we strictly restricted the training dataset to be protein homology in bacteria and specifically optimized parameters for detecting homology in this situation. It is well known that the evolution rate of protein secondary structures (SS) is much slower than that of amino acids 5. After million years of evolution, amino acid sequences of protein homologs have greatly changed but their structures have not been significantly disrupted. The protein secondary structure is also the basis of the tertiary structure of a protein and it can serve as a bridge that links the primary sequence and the tertiary structure for the functional analysis. The tools for protein secondary structure prediction have been extensively studied since the 1990s. These tools can be typically divided into two different categories: template-based prediction and ab initio prediction. Many of these computational 4 ACS Paragon Plus Environment

Page 4 of 38

Page 5 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

methods are based on the close templates but a few for ab initio prediction. Among these tools, three of them (JPRED4 6, PsiPred 7 and SSpro 8) have been widely used for protein secondary structure prediction. JPRED4 is a web-based tool and the query sequence has a limit of protein length (≤800 amino acids), making it not suitable for SS prediction in this study. PsiPred is a template-based tool for secondary structure prediction. Considering that closely related templates are not available for the protein homology in the twilight zone, PsiPred is also not suitable for SSEalign. Recently, a study reported a comparison of the prediction accuracy of these tools using third-party datasets. They found that SSpro has the best prediction accuracy for CullPDB and CB513 datasets 9. Furthermore, the ab initio prediction method had been highly improved recently, which makes it possible to predict the secondary structures of those proteins in the twilight zone. Therefore, SSpro using ab initio strategy was employed to conduct secondary structure prediction in SSEalign. In this study, we report a method that is specifically designed for homology identification of hypothetical proteins. We first trained our method in well-studied bacteria to show the performance of the SSEalign. The line search optimization approach was used to derive the best scoring matrix for this application. The derived parameters were then applied to identify the homology among the JCVI-syn3.0 genome and several other well-annotated bacteria. Lastly, six variables were used to further interpret our annotation results.

5 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2. EXPERIMENTAL SECTION Our work was shown in Figure 1 and was composed of two sections: SSEalign in SCOP 10 and ECOD datasets 11, SSEalign applied in the JCVI-syn3.0 dataset.

6 ACS Paragon Plus Environment

Page 6 of 38

Page 7 of 38 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1 - Flowchart of our work. (A) SSEalign in SCOP and ECOD datasets; (B) SSEalign applied in the JCVI-syn3.0 genome. BAE: basic alignment element; SCOP35: subset of SCOP database, any protein alignments within this subset were less than or equal 35%; SSE: Secondary Structure Element. BAC dataset: special bacteria dataset for training in JCVI-syn3.0.

2.1 SCOP and ECOD datasets datasets These two datasets (SCOP and ECOD) are manually established by different groups using different methods. The proteins in the SCOP database were selected as the training dataset, while proteins in the ECOD database were used as independent controls to evaluate the performance of SSEalign. Different hierarchies are present in the SCOP database, such as class, fold and superfamily. The protein members of the same superfamily were considered as homologs suggested by He et al

12.

Thus, the protein pairs within the same

superfamily were selected as positive samples, while the protein pairs from different superfamilies were selected as negative samples. Since imbalance of positive and negative samples could dramatically affect the model training in machine learning problems 13, the same number of positive and negative samples were selected. Short sequences (