Iterative Genome Correction Largely Improves ... - ACS Publications

May 8, 2014 - Iterative Genome Correction Largely Improves Proteomic Analysis of. Nonmodel ... ABSTRACT: The current application and development of...
1 downloads 0 Views 4MB Size
Article pubs.acs.org/jpr

Iterative Genome Correction Largely Improves Proteomic Analysis of Nonmodel Organisms Xiaohui Wu,† Lina Xu,† Wei Gu, Qian Xu, Qing-Yu He,* Xuesong Sun,* and Gong Zhang* Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Huang-Pu Avenue West 601, Guangzhou 510632, China S Supporting Information *

ABSTRACT: The current application and development of proteomic studies typically depend on the availability of sequenced genomes. Protein identification based on the detected peptides with liquid chromatography tandem mass spectrometry is limited by the absence of sequenced genomes in many nonmodel organisms. In this study, we demonstrated a new strategy based on our stable, accurate, and error-tolerant FANSe (Fast and Accurate mapping tool for Nucleotide Sequencing datasets) mapping algorithm to correct genome sequences in an iterative manner. To evaluate the efficiency of the corrected genome databases in proteomic study, MS/MS spectra of whole proteome extracted from a Bacillus pumilus strain without complete genome sequence were searched against the protein sequence databases derived from the complete reference genome sequence of a homologous bacterium and from the corrected genome sequence. The results indicated that the corrected protein sequence database could significantly facilitate peptide/protein identification. Importantly, this strategy can help to detect novel peptide variants. This strategy of genome correction will promote the development of functional proteomics in nonmodel organisms. KEYWORDS: genome correction, accurate mapping, protein identification



INTRODUCTION Mass spectrometry (MS)-based proteomics has been developed to be a powerful tool in the investigation of the function of genes in living organisms. Highly efficient MS identification algorithms, such as ProVerB,1 MASCOT,2 Sequest,3 pFind,4−6 X!Tandem,7,8 and so on, rely on the matching of masses of intact peptides to the corresponding masses from the reference databases like RefSeq and Uniprot. These protein sequence databases are largely dependent on the genomic information. However, many important species have not yet been adequately covered despite the rapid development of genome sequencing technology. The lack of reliable genome sequences seriously impedes the application and development of proteomics studies because it brings the difficulty in identifying peptides, especially in microbial communities due to the high genomic variation between the strains within one species.9,10 Although de novo peptide sequencing has an obvious advantage in protein identification of organisms without sequenced genomes, it still suffers from low-accuracy identification and a significant amount of manual validation work and faces an immense challenge for full-length peptides when compared with database search.11 Although the Multi-Strain Mass Spectrometry Prokaryotic Database Builder (MSMSpdbb) can generate a combined protein database of several closely related species,12 cross-species comparison is still not efficient in identifying © 2014 American Chemical Society

proteins originating from species phylogenetically distant from the corresponding reference organisms or belonging to poorly conserved protein families.13 Moreover, the cross-species protein identification is highly dependent on the existing protein databases of close species and cannot include all actual amino acids variations. Therefore, we aim to establish a practical method that identifies the actual sequence variations and is independent of the information from other species. There are two strategies at the genomic level to overcome the sequence variations between the reference genomic sequence and the nonmodel organism: de novo genome assembly and mapping-based genomic sequence correction.14,15 De novo genome assembly uses the short sequencing reads to assemble a draft genome and then perform automated annotation to mark out the possible open reading frames (ORFs). However, the current sequencing platforms and automated assembly algorithms output tens of thousands of short and broken contigs lacking ORF integrity or difficult for ORF prediction, even for a small genome.16 Accurate scaffolding and gap-closing necessitate extensive and skillful manual work or additional experiments like primer-walking.16−18 Still, these assemblies are error-prone: a de novo assembly of a bacterium Alcanivorax borkumensis SK2 (3.1Mbp Received: October 8, 2013 Published: May 8, 2014 2724

dx.doi.org/10.1021/pr500369b | J. Proteome Res. 2014, 13, 2724−2734

Journal of Proteome Research

Article

described previously.23 The PCR product was sequenced using ABI 3730XLBigDye (Applied Biosystems). The 16S rRNA sequence was aligned to NCBI nucleotide collection using nucleotide blast program.

genome) at 30× sequence depth reaches the correctness of only 95.3% (one error in every 20 nucleotides) and the coverage of 98.7%, lower than the almost perfect mappingbased approach.15 Besides, automatic annotations are still to be improved: the best annotator in the test can annotate only 52.8% of the real ORFs of the Alcanivorax borkumensis SK2 assembly, while the false-positive rates (FPRs) are >49%.15 In contrast, mapping-based genome correction strategy maps sequencing reads to the whole genome sequence of a strain of the same species, searches for single nucleotide variations (SNVs), and updates the reference sequence with such corrections. It is particularly accurate when a closely related genome sequence is available,15 and the detailed annotation of the reference sequence can be directly applied. Although mapping-based genome correction cannot analyze large fragment insertion as compared with the reference genome or genome shuffling, these missing parts are often noncrucial in the protein-oriented studies because most of the coding genes in the systems are present. Therefore, it has been used in population genotyping (reviewed in ref 19) as well as reduced representation genome analysis.20 There are a number of errorcorrection methods for sequencing available.21 However, the drawbacks are clear: (i) the SNV sensitivity can be as low as 0.03% and vary largely depending on the data set; (ii) the most sensitive algorithm needs more than 11 h and ∼10GB RAM correcting a small E. coli genome, which is very computationally expensive; and (iii) all successful approaches handled the genomes that are very close to the reference genome (deviation