Computational Survey of Sequence Specificity for ... - ACS Publications

Nov 13, 2014 - ABSTRACT: In 1998, Wilkins et al. (J. Mol. Biol. 1998, 278,. 599−608) reported high specificity in terminal regions. (terminal tags) ...
2 downloads 4 Views 3MB Size
Article pubs.acs.org/jpr

Computational Survey of Sequence Specificity for Protein Terminal Tags Covering Nine Organisms and Its Application to Protein Identification Akiyasu C. Yoshizawa,* Yuko Fukuyama, Shigeki Kajihara, Hiroki Kuyama, and Koichi Tanaka Koichi Tanaka Laboratory of Advanced Science and Technology, Shimadzu Corporation, Kyoto 604-8511, Japan S Supporting Information *

ABSTRACT: In 1998, Wilkins et al. (J. Mol. Biol. 1998, 278, 599−608) reported high specificity in terminal regions (terminal tags) of 15 519 proteins from five organisms and proposed a methodology for identifying proteins by terminal tags. However, their examined sequence data were not based on complete genome sequences. Here, we examined current proteome data (217 249 entries from UniProt 2013_6 complete/reference proteome for nine organisms including human) in terms of the specificity of terminal tags and their computational annotation. One example from the results indicated that the specificity of N-terminal tags plateaued at 28% at a length of six residues for human; even when using both N- and C-terminal tags, specificity was merely 66%. In order to determine the cause of these low specificities, the annotation of proteins sharing terminal tags with other proteins was examined. The results suggested that a large majority were phylogenetically or functionally related, whereas nonrelated proteins sharing terminal tags made up less than 1% of human proteome data. On the basis of these findings, we constructed the terminal tag sequence database ProteinCarta (http://ms3d.jp/ software/proteincarta/), which includes all terminal tags of proteomes from the nine organisms analyzed here, in order to confirm the specificity of terminal tags and to identify the parent protein. KEYWORDS: Protein terminal specificity, terminal sequence tag, protein identification, database, bioinformatics



INTRODUCTION Identifying proteins plays a pivotal role in proteomics, and mass spectrometry is one of the most popular and powerful techniques for protein identification. Peptide mass fingerprinting (PMF)1 and peptide fragment (fragmentation) fingerprinting (PFF, or MS/MS ion search)1 are commonly implemented in such popular search engines as Mascot,2 Sequest,3 X!Tandem,4,5 ProteinProspector,6 and MaxQuant.7 These methodologies are in the mainstream of protein identification, but different identification approaches can be considered for specialized purposes. In 1998, the methodology of protein identification using sequence tags (short amino acid sequences) at the N- and Ctermini was pioneered by Wilkins et al.8 They investigated the relationship between the specificity and length of terminal regions in 15 519 protein sequences. In their report, 41% (Saccharomyces cerevisiae) to 83% (Mycoplasma genitalium) of four-residue sequences at the N-terminus (terminal tags) were specific, and 74% (Homo sapiens) to 97% (M. genitalium) of Cterminal tags were specific. From these results, they proposed a methodology for identifying proteins by terminal tags and published the database search tool TagIdent, which detects proteins with sequence tags of up to six amino acid residues, pI value, and molecular weight. However, determination of © 2014 American Chemical Society

terminal sequences was not practicable at the time because of two problems in the experimental methodology. First, the Edman degradation method9,10 was available, but it was only for N-terminal sequencing. Second, mass spectrometry-based methods were premature due to the absence of efficient pretreatments such as enrichment or isolation techniques for peptides of interest. The terminal regions of a protein are important both for identifying the protein and for clarifying various biological phenomena. The terminal regions are significant for their posttranslational modifications.11−20 These modifications are directly related to protein function and/or its regulation. It is thus necessary to investigate terminal regions, and an intense effort has been continuously devoted to developing experimental techniques for decades. Mass spectrometry has become the method of choice for analyzing macromolecules of biological origin. In addition, RNA sequencing methodology to read mRNA sequences (RNA-seq), of which information can be utilized for determining a protein’s sequence, is now being developed using next-generation sequencing (NGS). Various methods (e.g., combined fractional diagonal chromatography Received: July 30, 2014 Published: November 13, 2014 756

dx.doi.org/10.1021/pr500793h | J. Proteome Res. 2015, 14, 756−767

Journal of Proteome Research

Article

(COFRADIC)21−23 and the combined methodology of chemical modification24−27) have been developed to clarify structures of protein terminal regions, achieving a high degree of purity of terminal peptides. With these methods, it is now feasible to determine terminal sequences of mature proteins. These terminal region investigations of comprehensive proteome samples are known as positional proteomics28,29 or terminomics.30 In this research area, extensive studies on species-specific protein cleavage31 and alternative translation initiation30,32 are underway. Here, we examined the feasibility of identification using only terminal tags, as proposed by Wilkins et al.8 This methodology can be expected to contribute to the research efficiency of terminomics. We obtained proteome data of nine organisms from the UniProt reference/complete proteome data and examined their terminal regions, especially human protein terminal regions. Counting specific terminal tags, we determined that a large majority of terminal tags are specific, except for isoforms and paralogs, and, in some rare cases, nonphylogenetically/functionally related proteins share short terminal tags. On the basis of these results, we constructed a terminal tag database for these nine organisms. Users can identify a possible protein with terminal tags as the query; if the input query tags are too short to identify the protein uniquely, then this database system calculates how many more amino acid residues in the terminal tag must be determined.



(d) residues identical to a human signal peptide that is contained in the category b, and (e) no residues removed. (3) Finally, for human, four data sets, (i), (ii), (iii), and (iv), were created: (i) removing category a from Swiss-Prot and category e for TrEMBL (no residues removed from TrEMBL), (ii) removing category b from Swiss-Prot and category e for TrEMBL (no residues removed from TrEMBL), (iii) removing category a from Swiss-Prot and category c from TrEMBL, and (iv) removing category b from Swiss-Prot and category d from TrEMBL. These four data sets thus consist of the following sequences: (i) Swiss-Prot human sequences without experimentally confirmed signal peptide/propeptide residues and TrEMBL human sequences without any processing, (ii) SwissProt human sequences without all confirmed/possible signal peptide/propeptide residues and TrEMBL human sequences without any processing, (iii) Swiss-Prot human sequences without experimentally confirmed signal peptide/propeptide residues and TrEMBL human sequences without subsequences identical to any experimentally confirmed signal peptides in Swiss-Prot human sequences, and (iv) Swiss-Prot human sequences without all confirmed/possible signal peptide/ propeptide residues and TrEMBL human sequences without subsequences identical to any confirmed/possible signal peptides in Swiss-Prot human sequences. For every other organism, two data sets corresponding to categories (i) and (ii) were created. The signal peptide/propeptide was removed from the variant sequence only when the amino acid sequence at the position of the signal peptide/propeptide in the variant protein was identical to the corresponding sequence in the Swiss-Prot canonical sequence. All of the following analyses were performed against terminal tags extracted from the processed sequences above.

MATERIALS AND METHODS

Retrieved Data

All amino acid sequence data in our data set was obtained from UniProt release 2013_6;33 sequences from Escherichia coli (strain K12) and M. genitalium (strain ATCC 33530/G-37/ NCTC 10195) were obtained from the complete proteome set (http://www.uniprot.org/taxonomy/?query= complete:yes%20ancestor:2), and sequences from the other seven organisms (H. sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Drosophila melanogaster (fly), Caenorhabditis elegans (worm), Arabidopsis thaliana (weed), and S. cerevisiae (yeast)) were obtained from the reference proteome set (http://www.uniprot.org/taxonomy/?query= reference:yes%20ancestor:2759). All gene numbers were retrieved from Ensembl.34 Figure 1 lists the version of genomic data for each organism from Ensembl.

Classification of Annotation for Protein Sequences with Shared Terminal Tags

KEGG OC35 and the annotation information in UniProt were utilized to automatically classify the annotation of protein sequences that shared the same terminal tags. All data set (UniProt) sequences were mapped to the sequences in the KEGG GENES database (release 58.1)36 based on identifier mapping lists provided by UniProt and KEGG and on sequence similarity comparisons performed by BLAST (ver. 2.2.21).37 The KEGG OC number assigned to each KEGG GENES sequence was then mapped to each data set sequence. One protein (the protein on the first line) in the list of proteins sharing the same terminal tags was chosen and defined as the terminal-group canonical sequence. The identities and similarities for all other sequences in the group were then calculated. Sequence similarity was examined using the Water program, which implements the Smith−Waterman algorithm38 included in the EMBOSS bioinformatics program suite (ver. 6.2.0−2).39

Processing Sequences

All entries annotated as fragment or fragments were removed from the data set. All sequences from Swiss-Prot in the data set were then processed to mature forms. Specifically, sequences of signal peptides and propeptides were removed as follows. (1) First, all information on signal peptides/propeptides was retrieved from the feature information and the variant information included in the XML version of Swiss-Prot release 2013_6. Two data sets were created from Swiss-Prot sequences by removing residues corresponding to either of two categories: (a) annotated signal peptides or propeptides only when their qualities are experimentally confirmed in Swiss-Prot and (b) annotated signal peptides or propeptides regardless of their qualities (i.e., their qualities are experimentally confirmed or by similarity or probable or potential in Swiss-Prot). (2) For human data, three data sets were then similarly created using TrEMBL human sequences by removing residues corresponding to one of following three categories: (c) residues identical to a human signal peptide that is contained in the category a,

Experiment Conditions for Evaluation Example

Human epidermal growth factor receptor-2 (Her2, or ErbB2/ Neu) was purified from breast cancer cell line BT-474 as described in ref 40. The obtained protein sample in the excised gel plug (30 pmol) was reduced (by tris(2-carboxyethyl)phosphine hydrochloride (TCEP)) and alkylated (by iodoacetamide (IAA)), followed by digestion using trypsin. Matrixassisted laser desorption/ionization (MALDI) mass spectra were obtained using an AXIMA Performance time-of-flight mass spectrometer (Shimadzu/Kratos, Manchester, UK) as 757

dx.doi.org/10.1021/pr500793h | J. Proteome Res. 2015, 14, 756−767

Journal of Proteome Research

Article

Figure 1. Details of analyzed data. The bar graph on the left indicates approximate numbers and ratios of the Swiss-Prot isoform sequence, the SwissProt canonical sequence, and the TrEMBL sequence in the data sets, from left to right. The right table indicates actual numbers of these three categories, total protein numbers, P/G ratios (the ratio of protein total numbers to gene numbers), actual gene numbers, and the source of genomic data. Total protein numbers indicate the number of sequences from which sequences annotated fragment or fragments in the description lines were removed. According to the UniProt Web site (http://www.uniprot.org/help/about), with manual curation for Swiss-Prot, an automatic-curated sequence in TrEMBL is first classified into two categories: the canonical sequence, which is a representative sequence in multiple variant sequences of a protein, and the isoform, which is a nonrepresentative sequence. The words canonical and variants are assigned under manual curation; hence, all TrEMBL sequences should be classified as canonical sequences or isoforms.

article) varies with the organism. The P/G ratio for human exceeds 250%; for yeast, it is slightly less than 100%. Ordinarily, amino acid sequences stored in a database are identical to transcripts. However, the mature form of proteins in cells lacks various kinds of amino acid residues, such as signal peptides, propeptides, and initiator methionines after the ordinary truncation processes. The amino acid sequences and intraprotein positions of these truncated residues are stored in public databases, e.g., Swiss-Prot. Thus, if the signal peptide of an observed mature protein is identical to the stored signal peptide, then the N-terminal tag of this mature protein is identical to the sequence starting from the residue next to the signal peptide. However, numerous signal peptides remain undiscovered, so when a novel signal peptide is truncated, a mature protein may start from a different residue from the data in the database. Similarly, C-terminal tags may stop at nonterminal residues of database-stored sequences because novel C-terminal truncations may occur. Therefore, when searching a database with terminal tags obtained from actual mature proteins, it is necessary to allow them to start or stop at any residues in the neighborhood of the database sequence terminal residues. It is impossible to compare arbitrary length tags from/to arbitrary residues in the database sequences for investigating their specificity because the number of cases increases exponentially, producing a combinatorial explosion. The following analyses were thus performed against the sequences from which the signal peptides/propeptides stored in Swiss-Prot were removed. The signal peptide/propeptide data stored in UniProt (Swiss-Prot) can be classified into four classes according to reliability. The most reliable information is that confirmed by experiments (no indication in Swiss-Prot entries), and the other three categories are computational estimation results (indicated

described in ref 40. The obtained spectra were analyzed by the MWD method41 and Mascot2 on Mass++.42 (See Supporting Information 2 for details.) Implementation of Analysis Programs and the Database System

The search function of the protein terminal sequence database system (ProteinCarta) is implemented utilizing the Sary program,43 which implements the suffix array algorithm.44 All analyses were performed using in-house programs implemented in Perl. The ProteinCarta database system was implemented in Perl and JavaScript.



RESULTS AND DISCUSSION

Data Sets from Nine Organisms’ Proteome Data

Along with the database search by PFF or PMF, identification using terminal tags depends on the characteristics of the target database. We used UniProt, the most typical target database for PFF, as the target database. Figure 1 indicates the contents of the data sets, namely, Swiss-Prot sequences and TrEMBL sequences in the entire proteomes of the seven eukaryotic model organisms and two prokaryotic organisms examined. Sequences annotated as fragment(s) were removed from all of the data sets; thus, the entire number of sequences for organisms in the data sets are less than the numbers of sequences in the original data from the UniProt reference/ complete proteome sets. In the data set, 60% of human and mouse sequences and 40% of weed (a dicotyledon whose genome was the first model plant genome to be sequenced) and rat sequences have been manually curated. Manual curation of three unicellular organisms (yeast, E. coli, and mycoplasma) is complete. As indicated in Figure 1, the ratio of the protein count to gene number (referred to as the P/G ratio in this 758

dx.doi.org/10.1021/pr500793h | J. Proteome Res. 2015, 14, 756−767

Journal of Proteome Research

759

Data set (iv)

Data set (iii)

Data set (ii)

Data set (i)

a The frequencies are identical to the ratios of proteins with specific terminal tags in the following four proteome data sets: (i) Swiss-Prot human sequences without experimentally confirmed signal peptide or propeptide residues and TrEMBL human sequences without any processing, (ii) Swiss-Prot human sequences without all confirmed or possible signal peptide or propeptide residues and TrEMBL human sequences without any processing, (iii) Swiss-Prot human sequences without experimentally confirmed signal peptide or propeptide residues and TrEMBL human sequences without subsequences identical to any experimentally confirmed signal peptides in Swiss-Prot human sequences, and (iv) Swiss-Prot human sequences without all confirmed or possible signal peptide or propeptide residues and TrEMBL human sequences without subsequences identical to any confirmed or possible signal peptides in Swiss-Prot human sequences (see Materials and Methods).

26 225 42.3% 26 326 42.5% 26 223 42.3% 26 315 42.5% 26 043 42.0% 26 145 42.2% 26 041 42.0% 26 136 42.2% 25 844 41.7% 25 944 41.9% 25 843 41.7% 25 937 41.9% 25 579 41.3% 25 681 41.4% 25 579 41.3% 25 676 41.4% 24 509 39.6% 24 607 39.7% 24 509 39.6% 24 602 39.7% 16 258 26.2% 16 317 26.3% 16 258 26.2% 16 315 26.3% 866 1.4% 866 1.4% 866 1.4% 866 1.4% 19 232 31.0% 20 257 32.7% 19 046 30.7% 19 433 31.4% 19 074 30.8% 20 089 32.4% 18 886 30.5% 19 255 31.1% 18 887 30.5% 19 894 32.1% 18 694 30.2% 19 054 30.8% 18 615 30.0% 19 620 31.7% 18 429 29.7% 18 792 30.3% 17 538 28.3% 18 626 30.1% 17 374 28.0% 17 848 28.8% 11 812 19.1% 13 180 21.3% 11 692 18.9% 12 640 20.4% 1938 3.1% 3271 5.3% 1865 3.0% 2972 4.8% 612 1.0% 1001 1.6% 542 0.9% 825 1.3%

6AA 5AA 4AA 3AA 10AA 9AA 8AA 7AA

N-terminal

6AA 5AA 4AA 3AA 61 960 Homo sapiens

total sequences

Three amino acid-length peptides have 8000 (203) possible variations from the permutation of 20 kinds of amino acids, and four amino acid-length peptides have 160 000 (204) possible variations; thus, it is not possible to identify the entire human proteome by very short (no longer than three amino acids) terminal peptides. Therefore, we investigated using terminal peptide tags with lengths no shorter than three amino acids in the same way as the study by Wilkins et al.8 Table 1 lists the specificities of single N- or C-terminal regions in the human proteome from four data sets, focusing on the same characteristics as those in the study by Wilkins et al. In contrast, their work exhibited the results of three-, four-, and five-residue-long tags. The total number of sequences in the data set is 61 960, as a result of removing fragment sequences. For each data set, the number of specific terminal tags and its ratio to the whole protein number at each terminal and each terminal length are indicated. The ratios are also plotted in Figure 2a. In all cases, a longer tag indicates higher specificity when the tag length is short enough; however, when the length of terminal tags reaches six or seven residues, the frequency of the specific tags from all organisms reaches a plateau. Therefore, identifying a sequence in UniProt with only a single terminal tag on either terminus is impossible because the plateau level is 43% at most. No significant difference was observed between the results of data sets (i) and (ii); thus, the quality of signal peptide/propeptide information has little effect on specificity. In addition, the processing of TrEMBL sequences (the removal of putative signal peptide) did not drastically change the specificity of terminal sequences (the results of data sets (iii) and (iv)). Four graphs of the specificity of C-terminal tag sequences are actually identical (Figure 2a). Note that 70% of 10-residue-long human C-terminal tags are specific sequences in the current proteome data, although slightly more than 40% of all human proteins have a specific C-terminus; this result indicates that some terminal tags are shared with a very large number of protein sequences. Putative signal peptides/propeptides were not removed from proteins in the eight organism’s proteomes other than human because the processing of human TrEMBL sequences had no effect. Similarly, Figure 2b−i indicates the ratios of specific terminal tags in these other organisms. For each organism, the only data sets that were used were (i) and (ii) described in the Materials and Methods section, and both contain nonprocessed TrEMBL sequences. The actual numbers of specific terminal tags of these eight organisms are presented in Supporting Information Table 1. The 10 most common terminal tags in all

no. of sequences with unique terminal tags, % of all sequences in the data set

Table 1. Frequency of Unique Terminal Tags at Human Protein N- and C-Termini and Their Ratios to Total Sequences (61 960)a

Terminal Tag Specificity in Nine Organisms

7AA

C-terminal

8AA

9AA

10AA

as by similarity, probable, and potential in Swiss-Prot). We classified the Swiss-Prot sequences into two categories according to this reliability: only experimentally confirmed signal peptides were removed, and all known signal peptides (including three kinds of computationally estimated signal peptides) were removed. We also classified the TrEMBL sequences into two categories: those removing the peptide fragments at the terminal region of a protein sequence (which is identical to a signal peptide removed from Swiss-Prot sequences) and those not removing them. Finally, for each human sequence, four data sets corresponding to the four categories, (i), (ii), (iii), and (iv), are generated as indicated in the Materials and Methods section.

26 359 42.5% 26 459 42.7% 26 357 42.5% 26 448 42.7%

Article

dx.doi.org/10.1021/pr500793h | J. Proteome Res. 2015, 14, 756−767

Journal of Proteome Research

Article

Figure 2. continued

760

dx.doi.org/10.1021/pr500793h | J. Proteome Res. 2015, 14, 756−767

Journal of Proteome Research

Article

Figure 2. Frequency of specific (unique) sequence tags at protein N- and C-termini and tag length. The x axis represents the length of terminal tags, and the y axis represents the frequency of specific tags as the ratio of the sequence number with specific terminal tags to the whole sequence number in the data set for the organism. (a) Human. Both N- and C-terminal tags are extracted from same four data sets used in Table 1. (b) Mouse. (c) Rat. (d) Fly. (e) Worm. (f) Weed. (g) Yeast. (h) E. coli. (i) M. genitalum. For each graph from b to i, both N- and C-terminal tags are extracted from the two data sets described in the Materials and Methods section. For all organisms, the graphs for C-termini completely overlap; the graphs for Ntermini indicate a certain level of variance.

and fly. These results are not sufficient for any biologically solid conclusions; however, this tendency is an unambiguous characteristic of UniProt proteome data, which is our methodology objective and usual database search target.

data sets of nine organisms and their frequencies of occurrence are presented in Supporting Information Table 2. Generally, when tags are relatively short (