The Need for Manuscripts To Include Database Identifiers for Proteins

Jul 24, 2018 - Departments of Biochemistry and Chemistry and Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana , Illin...
0 downloads 0 Views 287KB Size
Editorial Cite This: Biochemistry 2018, 57, 4239−4240

pubs.acs.org/biochemistry

The Need for Manuscripts To Include Database Identifiers for Proteins

Downloaded via 95.181.217.93 on August 10, 2018 at 02:01:38 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

T

and co-workers investigated the levels of misannotation in several well-curated enzyme superfamilies in four public protein sequence databases (UniProtKB/Swiss-Prot, UniProtKB/ TrEMBL, GenBank NR, and KEGG).2 They concluded that the levels of misannotation in the automatically annotated databases (UniProtKB/TrEMBL and GenBank NR) averaged 5−63%, with the level of misannotation exceeding 80% for some superfamilies. Usually, the functions were “overpredicted”; i.e., the annotations provided more molecular detail than justified, e.g., specific reactions using specific substrates. It is likely that, since 2009, the level of misannotation has increased as errors are propagated by automated annotation transfer. UniProt allows any member of the community with experimental evidence to submit revisions or corrections to TrEMBL annotations (http://www.uniprot.org/help/ submissions#guidelines and http://www.uniprot.org/ update); the entry with the verified function is transferred to UniProt/SwissProt and is publicly available in subsequent database releases. In contrast, GenBank records (with annotations) can be changed only by, or with the permission of, the original submitter of the record. NCBI/GenBank provides the Third Party Sequence Annotation database (TPA) that allows others to provide postsubmission experimental annotation data (see https://www.ncbi.nlm.nih.gov/books/NBK53704/ #gbankquickstart.although_i_m_not_listed). UniProtKB/SwissProt does not (and cannot) provide complete coverage of published functional information (the consequence of manual curation), and the update procedures described in the previous paragraph are easily ignored (if known) by the experimental community; thus, a more efficient approach for transferring experimentally determined functions from the literature to the databases is required. Biochemistry now has adopted guidelines (“Accession IDs for Proteins”) to facilitate the transfer of experimentally verified functions from published articles to the UniProt database. These guidelines are published in the Information for Authors (http://pubs.acs.org/paragonplus/submission/bichaw/ bichaw_authguide.pdf) and are reproduced at the end of this editorial. Briefly, authors are asked to include accession IDs for all proteins experimentally characterized in their manuscripts: the accession IDs should be indicated in parentheses after the protein name in the text of the manuscript, e.g., in the Introduction, Materials and Methods, Results, and/or Discussion sections, and/or in a list in a section at the end of the manuscript. UniProt accession IDs are encouraged; however, because not all entries in the NCBI database are present in the UniProt database, NCBI accession IDs also can be provided. Manuscripts can be searched electronically by UniProt for accession IDs, thereby facilitating capture of the experimentally determined

he two major protein databases, GenBank/RefSeq maintained by the National Center for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/) and UniProt maintained by the UniProt Consortium (EMBL-EBI; http:// www.uniprot.org/), are increasing rapidly in size as a result of the low cost and ease of genome sequencing. In its most recent release (Release 88; May 23, 2018), GenBank/RefSeq contained 110,333,800 entries. UniProt provides two databases, UniProtKB/TrEMBL with entries for which the annotations are derived from the European Nucleotide Archive (https:// www.ebi.ac.uk/ena) and UniProtKB/SwissProt with entries for which the annotations are manually curated from the literature; the most recent UniProt release (Release 2018_06; June 20, 2018) contained a total of 116,587,823 entries, 116,030,110 in UniProtKB/TrEMBL and 557,713 in UniProtKB/SwissProt. The size of the UniProt database is increasing at a rate of 2.5%/month, i.e., a doubling time of 2.5 years (Figure 1).

Figure 1. Growth of the UniProt database. The most recent release (Release 2018_06) contained a total of 116,587,823 entries, 116,030,110 entries in UniProtKB/TrEMBL (blue line) and 557,713 entries in UniProtKB/SwissProt (red line). The decrease in April 2015 was the result of archiving redundant sequences (encoded by similar species) in the UniParc database.

The large and rapidly increasing number of entries in the protein databases provides the potential for a better understanding of the “chemical, physical, mechanistic, and/or structural basis of biological or cell function” (precisely the scope of this journal). However, realizing this goal depends on the ability to effectively leverage and integrate the large amount of data; this, in turn, is dependent on the reliability of the functional annotations (in vitro activities and in vivo functions). The annotations for