The Construction of a Bioactive Peptide Database in Metazoa

The peptide, peptide precursor and peptide motif databases (version 1.0) are the most complete peptide, ... (1) All proteins from Metazoa, which funct...
1 downloads 0 Views 405KB Size
The Construction of a Bioactive Peptide Database in Metazoa Feng Liu,† Geert Baggerman,*,‡ Liliane Schoofs,§ and Geert Wets† Data Analysis & Modeling Group, Transportation Research Institute, Hasselt University, Building D, 3590 Diepenbeek, Belgium, ProMeta, Interfacultary Center for Proteomics and Metabolomics, K.U.Leuven, Herestraat 49, bus 1023, 3000 Leuven, Belgium, and Laboratory for Functional Genomics and Proteomics, Naamsestraat 59, Leuven, Belgium

J. Proteome Res. 2008.7:4119-4131. Downloaded from pubs.acs.org by KAOHSIUNG MEDICAL UNIV on 07/04/18. For personal use only.

Received January 17, 2008

Bioactive peptides play critical roles in regulating most biological processes in animals, and have considerable biological, medical and industrial importance. A number of peptides have been discovered usually based on their biological activities in vitro or based on their sequence similarities in silico. Through searches in Swiss-Prot and Trembl protein databases using BLAST alignment tools and other in silico methods, all currently known bioactive peptides and their precursor proteins are extracted. In addition, 132 recently discovered putative peptide genes in Drosophila as well as their orthologs in other species are collected. In total, 20 027 bioactive peptides from 19 438 precursor proteins covering 2820 metazoan species are retained, and they, respectively, make up a peptide and a peptide precursor database. The peptides and peptide precursor proteins are further classified into 373 families, 178 of which are represented by Prosite Pfam or Smart motifs, or by typical peptide motifs that have been constructed recently. The remaining 195 families are novel peptide families. The motifs characterizing the 178 peptide families are saved into a peptide motif database. The peptide, peptide precursor and peptide motif databases (version 1.0) are the most complete peptide, precursor and peptide motif collection in Metazoa so far. They are available on the WWW at http://www.peptides.be/. Keywords: peptide • peptide precursor • motif • Prosite • Pfam • Smart • BLAST

1. Introduction Bioactive peptides occur in the whole animal kingdom, from the least evolved phyla to the highest vertebrates. They play key roles as signaling molecules in many, if not all, physiological processes, for instance, as a peptidergic neurotransmitter or neurohormone, as a peptidergic toxin, or as a growth factor. Therefore, they are of considerable biological, medical and industrial importance.1 Peptides are synthesized in the cell in the form of large preproproteins (precursors), which are then cleaved and modified to generate biologically functional peptides.2 A number of peptides have been discovered based on their biological activity in vitro.3 In addition, putative peptides and their precursor proteins have also been identified by sequence analysis in silico. This is based on sequence similarity comparisons between a protein under investigation and a known peptide or its precursor protein, or based on a match between this protein and a peptide motif from conserved domain databases, such as Prosite,4 Pfam,5 and SMART.6 In most cases, proteins showing significant sequence similarities have similar functional properties. * To whom correspondence should be addressed. Geert Baggerman, ProMeta, Interfacultary Center for Proteomics and Metabolomics, K.U.Leuven, Herestraat 49, bus 1023, 3000 Leuven, Belgium. e-mail: [email protected]. † Hasselt University. ‡ ProMeta, Interfacultary Center for Proteomics and Metabolomics. § Laboratory for Functional Genomics and Proteomics. 10.1021/pr800037n CCC: $40.75

 2008 American Chemical Society

Current databases that comprise a variety of peptides are, for example, the EROP-Moscow oligopeptide database and the SWEPEP database. The EROP-Moscow database7 consists of peptides that are extracted directly from publications in scientific journals, and all peptides are no more than 50 amino acids in length. The SwePep database8 contains endogenous peptides and small proteins below 10 kDa, and most of the peptides are collected from Uniprot when the corresponding database entries (proteins) have peptides annotated in ‘Feature’ line of the Uniprot file. Neither of these two databases covers all known bioactive peptides and their precursor molecules, and both of them include nonpeptide proteins which are either located in the nucleus of a cell or function as transport, milk or enzyme proteins. So far, a database systematically collecting all known bioactive peptides and their precursors in Metazoa is lacking. In this paper, we use an alternative peptide search approach, which is based on the combination of BLAST alignment tools with the annotation in Uniprot protein database and the discussion in literature. As a result, a novel peptide database is constructed; it includes all known bioactive peptides which are documented in Uniprot. The precursor proteins, from which the peptides are released, make up a peptide precursor database. All these peptides and precursor proteins are further classified into peptide families, and the motifs characterizing the corresponding peptide families are also saved into a peptide motif database. The currently constructed peptide, peptide Journal of Proteome Research 2008, 7, 4119–4131 4119 Published on Web 08/16/2008

research articles precursor and motif databases are the most complete peptide, precursor and peptide motif collection in Metazoa up to date.

2. Method 2.1. Peptide Precursor Collection. (1) All proteins from Metazoa, which function as mature peptides or peptide precursor proteins that can further be processed into smaller bioactive peptides, are assembled into a peptide precursor database. A protein has characteristics of a peptide or peptide precursor when it is annotated in Uniprot (release 13.2)sconsisting of Swiss-Prot (release 55.2) and TrEMBL (release 38.2) protein databasessin either of the following ways: (i) When it is annotated as bioactive peptide in the ‘Features’ line or (ii) when its protein name contains peptide keywords or (iii) when it is annotated with peptide keywords in the ‘Keywords’ line. The peptide keyword categories include molecular function keywords including amphibian defense peptide, antimicrobial (including antibiotic, defensin and fungicide), antiviral protein, cytokine, endorphin, growth factor, hormone, hypotensive agent, neuropeptide, neurotransmitter, opioid peptide, vasoactive (including vasoconstrictor and vasodilator), toxin (including cardiotoxin, ionic channel inhibitor and neurotoxin) and antifreeze protein. The definition of the keywords can be referred to in Uniprot. From the collection of proteins obtained by the abovementioned annotation ways, proteins are subsequently excluded when (i) a subcellular location as membrane protein is indicated in the Uniprot protein file or (ii) when a protein is also characterized by nonpeptide keywords which do not refer to peptide proteins. Nonpeptide keywords are receptor, signalanchor, DNA binding, milk, nuclear protein, transport, collagen, and enzyme. In addition to the peptide or precursor proteins collected by the above-mentioned method, the novel putative peptide precursors discovered in Drosophila9 are also included into the peptide precursor database. (2) The proteins in Metazoa, which contain neither the characteristic peptide keywords nor the nonpeptide keywords in their protein names or in the ‘Keywords’ line in their corresponding protein files, are also extracted from Uniprot. In total, 175 778 proteins are obtained, and the function of these proteins is not definitely annotated as peptide or nonpeptide proteins in Uniprot. (3) Stand-alone PSI-BLAST (http://www.ncbi.nlm.nih.gov/ BLAST/download.shtml)10 is used to align all the extracted proteins in the peptide precursor database obtained in step 1 with all the proteins collected in step 2. On the basis of the characteristics of a peptide precursor that, in many cases, only a short motif in the protein sequence is conserved and responsible for the function of the protein,11 the score matrix PAM30 is used, and the word size is set to 2 in order to find short but strong similarities. Only those proteins that display similarities with the extracted peptide precursors with a significant BLAST score (e-value < 0.001), are retained. The obtained list is then checked manually in terms of cellular component, biological process and molecular function as stated by GO (gene ontology) terms or in literature. As a result, we collected 1438 additional proteins for the peptide precursor database, although these novel peptide precursors were so far not annotated with peptide keywords in Uniprot. (4) In the final peptide precursor database, each database entry represents a peptide precursor protein or, in some cases, a mature peptide protein. The entry consists of Uniprot entry 4120

Journal of Proteome Research • Vol. 7, No. 9, 2008

Feng et al. name, Uniprot accession number, protein name, gene name, species, species taxon, mass, protein sequence length and protein sequence. In case of a precursor protein, the information on its signal peptide is also retained. The presence of a signal peptide is assumed when it is indicated on the protein file in the protein database; in other cases, the signal peptide prediction program SignalP12 (http://www.cbs.dtu.dk/services/ SignalP/) is used to predict the presence of a signal peptide. If a protein is known to belong to a family as annotated in ‘SIMILARITY’ line or to display a significant match to an existing family motif from databases such as PROSITE, Pfam or SMART as indicated in ‘DR’ line in its protein file in Uniprot, the information on its family classification is collected into the precursor database. In addition, if a protein matches one of the patterns in the recently developed peptide pattern database,12 this classification information is also taken into account. 2.2. In Silico Extraction of Peptides. From each assembled peptide precursor protein, the bioactive peptide sequences are extracted in silico from the annotated beginning and ending positions of the subsequences that are annotated as ‘peptide’ or ‘chain’ in the ‘Feature’ line in their corresponding protein files. Database entries in the peptide precursor database that only constitute the mature peptide sequence, that is, in those cases where the protein precursor is unknown, are also retained. Small proteins (less than 200 amino acids in length) from the precursor database, which contain an N-terminal signal peptide and for which no mature peptides have as yet been annotated, presumably contain a single peptide and are therefore also included into the peptide database after in silico removal of the N-terminal signal peptide. According to statistics on all annotated bioactive peptide sequences, 97% are no longer than the 200 amino acid threshold value. 2.3. Peptide Classification and Motif Collection. All proteins in the peptide precursor database are classified into families according to their family classification information collected from Uniprot or from our recently developed peptide pattern database.13 Proteins that are not identified by any of the existing peptide motifs can be assigned to belong to a particular peptide family based on their respective molecular functions described in literature. Proteins that display sequence similarities with a significant score (e-value < 0.001) as obtained by BLAST, are also clustered into the same family. Once all the precursor proteins are classified into peptide families, the peptides within the peptide database, which are in silico cleaved from these precursors, are automatically assigned into corresponding peptide families. The peptide motifs from Prosite, Pfam, SMART and the peptide pattern database are then collected into a peptide motif database. Figure 1 shows the construction procedure for the peptide, the peptide precursor and the peptide motif databases. We combined annotation information from Uniprot and from literature with computational tools such as BLAST in order to efficiently search for bioactive peptides and their precursor proteins as complete as possible, and to accurately cluster these proteins into families.

3. Database Information Organization and Software Implementation The currently constructed three databases reside at http:// www.peptides.be, and they consist of the information on all known peptide and peptide precursor proteins in Uniprot as well as all known peptide motifs. These databases enable users

research articles

Construction of a Bioactive Peptide Database in Metazoa

Figure 1. Construction procedure of the peptide, peptide precursor and peptide motif databases.

to perform a rapid search via key features of peptides and to carry out statistical analysis on all the known peptides, peptide precursor proteins and peptide families. 3.1. Information Organization. This site presents information via a HTML-based multilevel interface. This interface begins with a Main page, which includes Home, Search, Statistics, Submission, Help, and Contact pages. Figure 2 illustrates the architecture of the database site. The Home page gives a general introduction on these three databases and their release version. The Statistics page contains a list of statistics on length distribution, species distribution and family distribution of all peptides and precursor proteins. The Submission page provides a window where users can submit their peptide data to the database. The Help page gives

information on how to use this site, and the Contact page leads to our e-mail address. The main part of this database site is the Search page. It provides a rapid search for peptide records according to specific peptide characteristics, such as peptide accession number, peptide name, length, monoisotopic mass, amino acid sequence, organism, peptide family, and precursor protein Uniprot accession number from which the peptide is extracted. Each database entry (peptide record) in the peptide database is tagged by a unique peptide accession number, beginning with character ‘PEP’ followed by 5 numerical digits. When search query is filled in on the Search page, a Result page is returned, which contains a list of peptides that meet the specified characteristics. Each peptide, in turn, links to a Journal of Proteome Research • Vol. 7, No. 9, 2008 4121

research articles

Feng et al.

Figure 2. Architecture of the database site.

peptide page, which describes the details of the peptide. On the Peptide page, ‘Uniprot accession’ links to the Uniprot database Web site where the corresponding peptide precursor protein is opened. ‘Peptide family’ reveals a page listing all the corresponding peptide family members. On the Family page, ‘motif’ links to a Motif page, which reveals the family motif that is either a pattern in Prosite format or a Pfam (or SMART) motif accession number; ‘Peptides in FASTA’ enables users to download amino acid sequences of all peptides from this family. ‘Precursor’ connects to a Precursor page, where all the sequences of the peptide precursor proteins from this family are displayed. 3.2. Software Implementation. The following programming elements are used as server software, they include MySQL database server (version 5.0.21), apache Web server (version 2.0.54) and PHP language (version 5.1.4). This system creates dynamic Web pages using cgi-scripts written in the PHP language. In response to each appropriate user query, the required HTML pages are generated interactively. Once the user’s Web browser has sent the HTTP query to the Web server, the required script containing the database query is executed. After the request is processed, the PHP script dynamically generates the results which are in the form of an HTML page and are then sent to the user’s computer.

4. Results In total, 20 027 bioactive peptides and 19 438 peptide precursor proteins make up the peptide and the peptide precursor databases, respectively. The database entries originate from 2820 different metazoan species and comprise cytokine and growth factors (4319), hormones (9114), antimicrobial peptides (2685), toxins (2423), antifreeze proteins (200), and peptides from other functional families (1286). Of all the 19 438 proteins in the peptide precursor database, 19 208 are classified into 373 peptide families, each including 4122

Journal of Proteome Research • Vol. 7, No. 9, 2008

at least 2 peptides or precursor proteins; the remaining 230 proteins have no identified homologies and they are put together into a special ‘unique peptides’ group. Table 1 list all 373 peptide families together with the number of precursor proteins, the number of peptides, and the phyla distribution of these proteins, for each family. A total of 178 (48%) of all these peptide families have been characterized by motifs collected in the peptide motif database.13 4.1. Peptide and Precursor Length Distribution. Most of peptide sequences (97% according to the statistics on our peptide database) are shorter than 200 amino acids. The shortest peptides are 3-amino-acids long, for example, the sea anemone Antho-RIamide-2 neuropeptide ‘YRI’,14), the Funnelweb spider omega-agatoxin-1A minor chain ‘SPC’ (P1596915), and human growth-modulating peptide ‘GHK’ (P0115716). The existence of numerous short bioactive peptides implies that, in the whole bioactive peptide super family, there are many conserved regions within peptide precursor proteins which are limited in length but which are biologically important functional portions of these molecules.13 A majority of peptide precursor sequences (98%) are no more than 500 amino acids in length, and they are relatively short compared to other nonpeptide proteins. 4.2. Phyla Distribution among Peptide Families. All assembled metazoan peptides are from various phyla including Annelida, Arthropoda, Chordata, Cnidaria, Echinodermata, Echiura, Hemichordata, Mollusca, Nematoda, Nemertea, Platyhelminthes and Porifera. Fifty-three (14%) of all 373 peptide families, listed in Table 1, contain peptides originating from at least 2 phyla. While different phyla have evolved at different pace to develop novel peptides, they still retain certain common characteristics. The existence of the peptide families shared by different phyla further implies that organisms in the whole animal kingdom, from the least evolved phyla to the highest

research articles

Construction of a Bioactive Peptide Database in Metazoa a

Table 1. Peptide Families and Phyla Distribution Id

Family_name

NPro

NPep

Phyla distribution

Cytokine and Growth Factor 1213 619 Annelida; Arthropoda; Chordata; Cnidaria; Echinodermata; Hemichordata; Mollusca; Nematoda; Platyhelminthes; 635 690 Chordata; 480 355 Chordata; Echinodermata; 399 278 Arthropoda; Chordata; Echinodermata; Nematoda; 368 358 Chordata; 223 181 Arthropoda; Chordata; Cnidaria; Nematoda; 222 162 Chordata; 209 14 Chordata; 120 120 Chordata; 104 101 Chordata; 99 99 Chordata; 96 96 Chordata; 95 92 Arthropoda; Chordata; Cnidaria; 94 72 Chordata; 64 21 Arthropoda; Chordata; Nematoda; 68 38 Arthropoda;,Chordata; Mollusca; Nematoda; Platyhelminthes; 64 64 Chordata; 59 55 Arthropoda; Chordata; Mollusca; 58 40 Arthropoda; Chordata; Cnidaria; Nematoda; 53 48 Chordata; Nematoda; 46 32 Chordata; 41 7 Arthropoda; Chordata; 35 14 Chordata; 34 34 Arthropoda; Chordata; Nematode 33 13 Chordata; 31 31 Chordata;

1

TGF_beta

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Small_cytokines Nerve_growth_factor HBGF_FGF interferon_alpha_beta_delta Platelet_derived_growth_factor Interleukin_1 Receptivity_factor Interferon_gamma Interleukin_10 Interleukin_2 Interleukin_4_13 Heparin_binding_epidermal_growth_factor Interleukin_6 Hepatoma_derived_growth_factor GRANULINS

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

Interleukin_15 PTN_MK_heparin_binding Gremlin Interleukin_17 Interleukin_12_alpha Interleukin_16 Osteopontin GMF_beta FAM3 Granulocyte_macrophage_colony_ stimulating_factor Neuregulin LIF_OSM Interleukin_7_9 Interleukin_5 Trefoil_factor Interleukin_3 Interleukin_11 uteroglobin Amphiregulin Interleukin_21 Ciliary_neurotrophic_factor Growth_factor_ARC Interleukin_28_29 Stromal_cell_derived_growth_factor IL_27_p28_subunit Neurosecretory_protein_VGF Interleukin_31 IL_6_subfamily_like_cytokine_M17 Fibrosin Thymic_factor

1 2 3 4 5

Somatotropin Insulin ACTH_domain_and_Opioids_neuropeptides Glycoprotein_hormones_beta_chain FMRFamide_and_related_neuropeptides

818 649 490 484 270

6

Glucagon_GIP_secretin_VIP

224

382

7 8 9 10 11 12

Natriuretic_peptides Gonadotropin_releasing_hormones ACBP Arthropod_CHH_MIH_GIH Pancreatic_hormone Tachykinin

202 197 172 161 136 127

268 357 105 209 174 210

31 30 29 25 22 21 17 13 11 10 9 8 7 6 3 3 3 2 2 2

19 21 29 25 21 21 11 13 8 10 9 1 7 6 0 2 3 1 2 2 Hormones 570 916 738 487 457

Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Platyhelminthes; Arthropoda; Chordata; Mollusca; Nematoda; Annelida; Chordata; Arthropoda; Chordata; Mollusca; Nematoda; Annelida; Arthropoda; Chordata; Cnidaria; Mollusca; Nematoda; Platyhelminthes; Arthropoda; Chordata; Cnidaria; Mollusca; Platyhelminthes; Chordata; Chordata; Mollusca; Arthropoda; Chordata; Nematoda; Platyhelminthes; Arthropoda; Nematoda; Arthropoda; Chordata; Mollusca; Platyhelminthes; Arthropoda; Chordata; Echiura; Mollusca;

Journal of Proteome Research • Vol. 7, No. 9, 2008 4123

research articles

Feng et al.

Table 1. Continued Id

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 4124

Family_name

Glycoprotein_hormones_alpha_chain Neurohypophysial_hormones Gastrin_cholecystokinin Calcitonin_CGRP_IAPP Leptin Pyrokinins Corticotropin_releasing_factor Granins Neuropeptide_like_protein Ghrelin_and_Motilin_related_peptide Somatostatin Bradykinin Agouti Parathyroid_hormone Allatostatin Periviscerokinin Transthyretin Endothelin Erythropoietin_thrombopoeitin Stanniocalcin Adipokinetic Hepcidin Bombesin Prokineticin Accessory_gland_specific_peptide_26Aa Angiotensin_like_peptide Neurexophilin Galanin Pro_MCH Urotensin_II Neuroendocrine_protein_7B2 Neuromedin_U_S Egg_laying_hormone Adrenomedullin Resistin Accessory_gland_protein_62 Cocaine_and_amphetamine_regulated_ transcript_protein Pigment_dispersing_hormone Guanylin Wamide_neuropeptides Leucokinin Thyroliberin GBP_PSP1_paralytic Allatotropin Nasal_embryonic_luteinizing_hormone_ releasing_hormone_factor Nitrophorin Neurotensin_neuromedin_N Accessory_gland_specific_peptide_70A Prothoracicotropic_hormone_precursor Accessory_gland_protein_26Ab Orexin Eclosion_hormone neuropeptide_B_W Bursicon VD1_RPD2_alpha_peptide Cardioactive_peptide Corazonin Tuberoinfundibular_peptide Apelin Orcokinin Achatin Neuroparsin Insulin_growth_factor_like Morphogenetic_neuropeptide Journal of Proteome Research • Vol. 7, No. 9, 2008

NPro

NPep

Phyla distribution

123 122 104 97 93 88 88 86 84 80 75 69 63 62 62 59 58 52 52 51 50 48 44 41 37 36 33 33 33 31 30 29 26 25 25 25 25

127 188 197 103 94 117 97 83 205 115 102 105 63 74 159 65 58 60 49 26 76 49 61 41 4 55 18 40 62 38 17 37 43 35 25 25 37

24 23 21 20 19 18 18 17

34 25 53 22 28 20 15 11

Arthropoda; Chordata; Arthropoda; Chordata; Cnidaria; Mollusca; Nematoda; Arthropoda; Nematoda; Chordata; Arthropoda; Arthropoda; Chordata;

16 15 15 14 13 13 12 11 11 10 9 9 8 7 7 6 6 5 5

12 32 15 5 13 18 12 15 11 10 9 13 8 19 17 6 7 5 5

Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Arthropoda; Mollusca; Arthropoda; Arthropoda; Chordata; Chordata; Arthropoda; Mollusca; Arthropoda; Chordata; Chordata; Cnidaria;

Arthropoda; Chordata; Annelida; Arthropoda; Chordata; Mollusca; Arthropoda; Chordata; Chordata; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Arthropoda; Mollusca; Nematoda; Chordata; Chordata; Arthropoda; Chordata; Chordata; Chordata; Arthropoda; Arthropoda; Arthropoda; Chordata; Nematoda; Chordata; Chordata; Chordata; Arthropoda; Chordata; Chordata; Chordata; Arthropoda; Annelida; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Chordata; Annelida; Mollusca; Chordata; Chordata; Arthropoda; Chordata;

research articles

Construction of a Bioactive Peptide Database in Metazoa Table 1. Continued Id

Family_name

77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107

Lymna_DF_amide Osteocrin Accessory_gland_peptide_Acp33A diuretic_hormone_class_II Peptide_hormone Neuropeptide_S Proctolin Ecdysis_triggering_hormone Short_neuropeptide_F Small_cardioactive_peptide Neuroendocrine_secretory_protein_55 Male_accessory_gland_protein WWamide Myomodulin Pleurin Androgenic_gland_hormone Antho_RIamide Hym_preprohormone SPTR_prohormone Abdominal_ganglion_neuropeptide Myoactive_tetradecapeptide Ovarian_ecdysteroidogenic_hormone Contraction_inhibiting_peptide Antidiuretic_factor Putative_neuropeptide Buccalin Neuroactive_polyprotein_R15 Accessory_gland_protein_98 Fulicin Trypsin_modulating_oostatic_factor Light_yellow_cell_peptide

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33

Beta_defensin antimicrobial_1 Arthropod_defensins Mammalian_defensins Cecropin Cathelicidins Attacin Bombinin Drosomycin_like Penaeidin pleurocidin 4kD_defensin Termicin Liver_expressed_antimicrobial spaetzle Crustin_like_peptide Ceratotoxin grammistin Moricin Gloverin Granulysin_NK_lysin_like Coleoptericin Anti_lipopolysaccharide_factor tachyplesin_polyphemusin Lebocin Tryptophyllin Electrin Clavanin Phylloseptin Rubellidin Pilosulin apidaecin Anionic_peptide_clone_precursor

NPro

5 5 5 5 2 4 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2

NPep

5 5 5 5 2 4 4 6 5 5 12 3 3 13 3 6 4 4 2 6 2 3 2 2 2 22 6 2 10 2 2

Antimicrobial 408 409 391 480 134 134 134 158 123 130 88 101 67 40 59 126 44 44 40 40 29 29 27 27 21 21 19 19 15 6 14 11 13 21 13 13 10 9 10 10 8 8 8 7 8 8 7 7 7 7 7 7 6 6 6 6 6 6 6 6 5 9 5 11 5 5

Phyla distribution

Mollusca; Chordata; Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Mollusca; Chordata; Arthropoda; Mollusca; Mollusca; Mollusca; Arthropoda; Cnidaria; Cnidaria; Annelida; Mollusca; Mollusca; Annelida; Arthropoda; Mollusca; Arthropoda; Mollusca; Mollusca; Mollusca; Arthropoda; Mollusca; Arthropoda; Mollusca; Arthropoda; Chordata; Arthropoda; Chordata; Platyhelminthes; Arthropoda; Mollusca; Nematoda; Chordata; Arthropoda;Chordata; Nematoda; Arthropoda; Chordata; Arthropoda; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata; Chordata; Chordata; Chordata; Chordata; Arthropoda; Arthropoda; Arthropoda;

Journal of Proteome Research • Vol. 7, No. 9, 2008 4125

research articles

Feng et al.

Table 1. Continued Id

Family_name

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

Putative_antimicrobial_peptide_clone_precursor Ponericin Gambicin Styelin Metalnikowin Salmocidin Mytilin Pseudin Tigerinin Putative_antimicrobial_knottin_protein_Btk oxyopinin_2 Abaecin Drosocin Xenoxin Histatin diapausin Antimicrobial_peptide_Alo Dermcidin Megourin Hymenoptaecin Hematopoietic_antimicrobial_peptide Gallerimycin Hylin_b Peptide_BmKn Myticin Parabutoporin Arenicin Halocidin_subunit Acanthoscurrin Japonicin Ponericin_L Metchnikowin Formaecin Antimicrobial_peptide_lumbricin Kassinatuerin Hemiptericin Pyrrhocoricin

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

24 25 26

Conotoxin_1 Snake_toxins Scorpion_toxin Scorpion_short_toxin_1 Mu_agatoxin_and_spider_toxin_SFI Alpha_conotoxin Anenome_neurotoxin I_superfamily_conotoxin Ergtoxin omega_agatoxin Myotoxins Scorpion_short_toxin_2 Spider_toxin_Tx2 huwentoxin_2 Potassium_channel_toxin_alpha_KTx Omega_atracotoxin Long_chain_potassium_channel_inhibitor_ scorpion_toxin Beta_bungarotoxin_B_chain_precursor Melittin Delta_atracotoxin conotoxin_P Pardaxin sea_anemone_potassium_channel_ inhibitory_toxin Mast_cell_degranulating_peptide sea_anemone_BDS_toxin Hydralysin

4126

Journal of Proteome Research • Vol. 7, No. 9, 2008

18 19 20 21 22 23

NPro

NPep

5 5 5 5 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

5 5 5 5 4 4 4 4 4 4 4 4 4 3 29 3 3 5 3 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

644 373 272 110 62 56 32 31 25 25 16 15 15 14 15 13 12

Toxin 670 373 276 108 77 58 32 31 25 28 17 15 15 14 15 13 13

12 11 8 8 7 6

12 11 8 8 7 6

5 5 4

5 5 4

Phyla distribution

Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Mollusca; Chordata; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Arthropoda; Mollusca; Arthropoda; Annelida; Chordata; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Annelida; Chordata; Arthropoda; Arthropoda Arthropoda; Mollusca; Platyhelminthes; Chordata; Arthropoda; Arthropoda; Arthropoda; Mollusca; Cnidaria; Mollusca; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Arthropoda; Mollusca; Chordata; Cnidaria; Arthropoda; Cnidaria; Cnidaria;

research articles

Construction of a Bioactive Peptide Database in Metazoa Table 1. Continued Id

Family_name

27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61

Conantokin Jellyfish_toxin aptotoxin latrotoxin Aptotoxin_1_4_6_9_Paralytic_peptide Alpha_A_conotoxin janus_atracotoxin Conotoxin_contulakin Neurotoxin_magi Potassium_channel_toxin_kappa_KTx Conotoxin_flf14 venom_vasodilator_peptide LiTx_toxin SNTX_VTX_toxin spider_toxin_CSTX Insecticidal_toxin_DTX Toxin_PsTX_60 Toxin_protein_KITx Paralytic_insecticial_toxin Lycotoxin Conotoxin_Gla Toxin_Tc50_43 Toxin_Tc46_61 Neurotoxin_B sea_anemone_short_toxin Tamulustoxin Waglerin scorpion_toxin_IsCT Toxin_TxP_I pompilidotoxin Toxin_AETX Oxytoxin Polybine Tetrapandin Ectatomin

1 2 3 4 5 6

Antifreeze_protein_1 Type_III_antifreeze Antifreeze_protein_AFP_2 Antifreeze_glycoprotein_AFGP Type_I_antifreeze_protein Antifreeze_protein_type_IV

1

Fibrinogen_alpha_1_chain_precursor

2 3

Disintegrin Thymosin_beta

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Sperm_activating_peptide Putative_peptide_11 L71 Nematode_specific_peptide_family_group_c Submaxillary_gland_androgen_regulated_protein Pacifastin_related_peptide Colipase Mastoparan Putative_peptide_7 Immune_induced_peptide_precursor Salivary_glue_protein_3 GGNG_myoactive_peptide CAMP_generating_peptide Nematode_specific_peptide_family_group_b Vespid_chemotactic_peptide Dynastin

NPro

4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 41 36 31 18 17 5

NPep

4 3 4 4 4 4 3 3 3 3 3 3 3 3 5 3 3 3 3 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 Antifreeze 41 37 31 25 17 5

1043

Others 706

503 90

271 100

41 28 27 24 18 16 17 17 15 12 10 9 9 8 8 7

53 28 27 24 24 37 17 17 15 12 7 10 7 8 8 7

Phyla distribution

Mollusca; Cnidaria; Arthropoda; Arthropoda; Arthropoda; Mollusca; Arthropoda; Mollusca; Arthropoda; Arthropoda; Mollusca; Chordata; Arthropoda; Chordata; Arthropoda; Arthropoda; Cnidaria; Arthropoda; Arthropoda; Arthropoda; Mollusca; Arthropoda; Arthropoda; Nemertea; Cnidaria; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Cnidaria; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata; Chordata; Arthropoda; Chordata; Chordata; Chordata; Arthropoda; Chordata; Echinodermata; Mollusca; Nematoda; Porifera; Arthropoda; Chordata; Echinodermata; Nematoda; Arthropoda; Chordata; Cnidaria; Echinodermata; Mollusca; Nematoda; Platyhelminthes; Porifera; Echinodermata; Arthropoda; Arthropoda; Nematoda; Chordata; Arthropoda; Chordata; Chordata; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Annelida; Arthropoda; Nematoda; Arthropoda; Chordata; Journal of Proteome Research • Vol. 7, No. 9, 2008 4127

research articles

Feng et al.

Table 1. Continued Id

Family_name

NPro

NPep

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

Secapin Cytin_A_chain Phosphatidylethanolamine_binding_protein Metastasis_suppressor_KiSS Immune_induced_peptides_13 Attractin Bombolitin Salivary_anti_thrombin_anophelin Testis_ecdysiotropin_peptide Putative_peptide_1 Putative_peptide_10 Coagulogen Putative_peptide_4 Putative_peptide_5 giant-lens Seminal_vesicle_specific_peptide Protein_C12orf39 Cysteine_rich_peptide Follicular_dendritic_cell_secreted Pedibin Sodefrin Statherin trunk_protein Salivary_glue_protein_5 VEGF_co_regulated_chemokine_1 Putative_peptide_2 Putative_peptide_9 Eyestalk_peptide Conophan Putative_peptide_3 Putative_peptide_6 Putative_peptide_8 Putative_peptide_12 Putative_peptide_13 Pneumadin

6 6 6 6 5 5 5 5 5 5 5 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2

6 6 12 16 7 5 5 5 5 5 0 12 4 0 1 5 5 3 3 3 3 3 1 3 3 0 3 2 2 2 0 0 2 1 2

Phyla distribution

Arthropoda; Annelida; Chordata; Chordata; Arthropoda; Mollusca; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata; Arthropoda; Chordata; Cnidaria; Chordata; Chordata; Arthropoda; Arthropoda; Chordata; Arthropoda; Arthropoda; Arthropoda; Mollusca; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Arthropoda; Chordata;

a Table 1 lists 373 peptide families in total, which include 48 families of cytokines and growth factors, 110 families of peptide hormones, 82 families of antimicrobial peptides, 70 families of toxins, 6 families of antifreeze peptides, and 57 families of other types of bioactive peptides. Peptide families are sorted decreasingly by ‘NPro’ in each of the above-mentioned peptide family types. Id: family identity. Family_name: the name of peptide families. NPro: the number of proteins in a family. NPep: the number of peptides in a family. Phyla distribution: the phyla from which the peptide precursor proteins originate. Peptide families and phyla distribution.

vertebrates, have a close relationship and a peptide evolved history.17 For example, TGF_beta family members are widely distributed among all above-mentioned phyla except for Echiura and Nemertea. It has been suggested that these family members share an early evolutionary origin and a high conservation of protein functionality during animal evolution.18 FMRFamide_and_related_neuropeptides family members not only occur in nematodes, arthropods, molluscs and annelids, but also in chordates (e.g., P83308 from Chicken), Platyhelminthes (e.g., P41853 from Artioposthia triangulate) and Cnidaria (e.g., Hydra). Hydra, despite its simple nervous system, also contains a number of peptides belonging to the FMRFamide_and_related_neuropeptides family, such as neuropeptide O76947, O76948 and O76949.19 In addition to peptide families that cover various phyla, 320 remaining peptide families are at this moment confined to a single phylum, such as bombesin in Chordata and adipokinetic hormone in Arthropoda. There are 149 unique families in Arthropoda, 117 in Chordata, 31 in Mollusca, 13 in Cnidaria, 6 in Annelida, 2 in Nematoda, 1 in Echinodermata, and 1 in Nemertea. Furthermore, some of these families have been identified only in a single organism, such as arenicin peptide in Lugworm20 and tetrapandin toxin in Emperor scorpion.21 4128

Journal of Proteome Research • Vol. 7, No. 9, 2008

Figure 3. Peptide frequency distribution among phyla. Ann, Annelida; Art, Arthropoda; Cho, Chordata; Cni, Cnidaria; Ech (left), Echinodermata; Ech (right), Echiura; Mol, Mollusca; Nem (left), Nematode; Nem (right), Nemertea; Pla, Platyhelminthes; Por, Porifera.

4.3. Peptide and Family Distribution among Phyla. Figure 3 shows the peptide distribution among phyla. A majority of peptides are found within the phylum of the Chordata for which 14 358 (72%) peptide have been identified. Within the

Construction of a Bioactive Peptide Database in Metazoa

Figure 4. Peptide family frequency distribution among phyla.Ann: Annelida; Art: Arthropoda; Cho: Chordata; Cni: Cnidaria; Ech (left): Echinodermata; Ech (right): Echiura; Hem: Hemichordata; Mol: Mollusca; Nem (left): Nematoda; Nem (right): Nemertea; Pla: Platyhelminthes; Por: Porifera.

Chordata, most peptides have been identified in humans (1206 peptides). For the remaining phyla, following organisms have the highest number of identified peptides: Drosophila (Arthropoda, 325 peptides), Caenorhabditis elegans (Nematoda, 327 peptides), Aplysia californica (Mollusca, 132 peptides), Hydra magnipapillata (Cnidaria, 29 peptides), Eisenia fetida (Annelida, 9 peptides), Hemicentrotus pulcherrimus (Echinodermata, 11 peptides), Schistosoma japonicum (Platyhelminthes, 6 peptides), Urechis unicinctus (Echiura, 7 peptides), Cerebratulus lacteus (Nemertea, 3 peptides), and Sycon raphanus (Porifera, 2 peptides). Figure 4 illustrates the peptide family distribution among phyla. Compared to Figure 3, it is noted that, although more peptides have been identified in Chordata compared to, for instance, Arthropoda, the chordate phylum contains less peptide families than arthropods. This is because arthropods have fewer identified peptides within each family, but there are a variety of peptide families within arthropods. On the other hand, in chordates, more peptides have been identified within the same family. 4.4. Comparison to Other Existing Peptide Databases. Our peptide database is compared to two major peptide databases including EROP-Moscow oligopeptide database and the SWEPEP database. The EROP-Moscow database (http://erop.inbi.ras.ru/) includes peptides of which the chemical structures have been completely determined. All information in this database is extracted directly from primary sources, the great majority of which are publications in scientific journals. There are 6477 oligopeptides in the EROP-Moscow (release of 25-Dec-2006) database, and all of them are no more than 50 amino acids in length. Most of these oligopeptides are produced by ribosomal synthesis in Metazoa; the remaining ones are formed by nonribosomal enzymes from Bacteria and Fungi. Ninety-three percent of these entries are also collected in our peptide database. A total of 203 of the other oligopeptides in the EROPMoscow database are not annotated as bioactive peptides in Uniprot, although their precursor proteins are in our peptide precursor collection. The remaining oligopeptides in the EROPMoscow database are not indexed in Uniprot, and neither are their precursor proteins. In addition to short (oligo)peptides, our database also collects numerous bioactive peptide sequences of more than 50 amino acids in length.

research articles The SwePep database (release of 2006-02-15) (http://www.swepep.org/) contains 4180 annotated endogenous peptides and small proteins below 10 kDa from different tissues and originating from 394 species. It also includes 50 novel peptides identified by the developer of the SwePep database. The annotated peptides are collected from Uniprot when peptide is indicated in ‘Feature’ line of the corresponding database entries. Our currently established peptide database collects not only amino acid sequences annotated as ‘PEPTIDE’, but also sequences which are annotated as ‘CHAIN’ in the ‘Features’ line of Uniprot protein files, for example, peptide andropin (34aa) from precursor O16825 in Drosophila and gastrin/ cholecystokinin-like peptide (52aa) from P80110 in Trachemys scripta. In addition, our peptide database also accommodates mature bioactive peptide proteins or protein fragments as well as sequences from small precursor proteins after the removal of signal peptides. Thus, compared to the existing peptide databases, our established peptide database collects more bioactive peptides from various sources. Of all the 20 027 peptides in this newly established peptide database, 12 001 are extracted from peptide precursor proteins when the sequences are annotated as ‘Peptide’ or ‘CHAIN’ in ‘Feature’ line. The remaining 8026 peptides are mature peptides or peptide precursor proteins that are less than 200 amino acids in length. A total of 8438 (42%) of all these peptides are no more than 50 amino acids in length. The remaining 58% comprise longer bioactive amino acid sequences (>50 amino acids), which are either annotated as ‘CHAIN’ in the corresponding Uniprot protein file or which are obtained directly from small proteins. These larger peptides also play an important signaling role in physiology (e.g., as growth factor or toxin) and the inclusion of the sequences makes our peptide database more complete. In addition, some amino acid sequences, for example, milk protein ‘Alpha-S1-casein’ P02662 (E00398) and transport protein ‘Hemoglobin subunit beta’ P02081 (E01089) from Bos taurus in EROP-Moscow as well as nuclear proteins Q9H300 (human) and Q5R5H4 (Orangutan) in SwePep are not in our peptide collection because proteins that are doubtly annotated with both peptide keywords and nonpeptide keywords are excluded from our peptide database. 4.5. Comparison to Other Existing Peptide Classification. In the Uniprot database, proteins are annotated to belong to a family based on sequence similarity (as stated in ‘SIMILARITY’ line) or based on a significant match to an existing motif from Prosite, Pfam and Smart databases, as stated in ‘DR’ line. However, many of the annotated peptide families do not cover all known family members; there are still a number of peptides or precursor proteins which are not classified into any family. Most of these missing family members include short protein fragments and mature peptide molecules as well as peptide precursors which have a low degree of sequence similarity with other members in the corresponding peptide family. Thus, they cannot be identified either by sequence similarity comparison or by motif match. For example, for ‘HBGF_FGF’, the family is annotated in the ‘SIMILARITY’ line and it is also characterized by Prosite, Pfam and Smart motifs in the ‘DR’ line. The 426 out of 455 proteins in this family in our database are annotated to match the corresponding family motif ‘PF00167’ and 99 of these matching proteins are also indicated to belong to this family by sequence similarity. However, the remaining 29 proteins in this family, such as for instance human fibroblast growth factor 13 isoform Journal of Proteome Research • Vol. 7, No. 9, 2008 4129

research articles (Q9Y643) and porcine basic fibroblast growth factor fragment (Q9TRD1), are not annotated in Uniprot with respect to their family classification. However, in the present peptide database, they are clustered into this family based on information on their molecular function discussed in literature. While each peptide family in our database covers as many family members as possible, it excludes false positive proteins which match the existing peptide motifs. For example, two nonpeptide proteins including Q9VIA4 from Drosophila melanogaster and A8P7Q9 from Brugia malayi contain the Prosite pattern ‘TGF_BETA_1’, but they are excluded from the corresponding family ‘TGF_beta’ in our database, because they are not peptide precursors according to the criteria used for construction of the present database. Apart from the family classification information as annotated in Uniprot, we also make use of our recently developed peptide pattern database12 for clustering the peptides and their precursor proteins. The short conserved peptide patterns are efficient in identifying short protein fragments or mature peptides and the use of these patterns improves the accuracy and completeness of peptide family classification. For example, the family ‘FMRFamide_and_related_neuropeptides’ in our database contains 261 proteins; in Uniprot only 108 (41%) are annotated to belong to this family by sequence similarity; 91 (35%) have a significant match to the corresponding Pfam family motif ‘PF01581’, and 17 can be identified by both sequence similarity and this Pfam motif. However, 243 (93%) proteins in the family ‘FMRFamide_and_related_neuropeptides’ match the corresponding pattern patterns in our peptide pattern database, many of which are not annotated in Uniprot with respect to family classification (for instance FMRFamide-related peptide Q86G61 from Heterodera glycines and LFRFa Q5U900 from Lymnaea stagnalis). In addition to the 178 characterized peptide families, 195 peptide families in our database are not identified by current peptide motifs (i.e., nasal embryonic luteinizing hormonereleasing hormone factor and prothoracicotropic hormone). Most of these families are small and contain recently discovered peptides. Proteins can be assigned into these families based on information available in literature or based on protein sequence similarities as revealed by BLAST.

5. Discussion In this paper, three databases including a peptide, a peptide precursor and a peptide motif database are created, and they, respectively, consist of all currently known peptides, peptide precursors and peptide motifs in the kingdom of Metazoa. The construction of these databases is undoubtedly important in studying the structure and function of peptides, and in identifying new members of a particular peptide family. It also provides a platform where the evolutionary relationship of a group of organisms which share related peptides from a family can be analyzed. The construction procedure of the peptide database can also be applied to non-Metazoa species, such as plants and viruses. In these species, bioactive peptides and their precursor genes have also been discovered (for example, antimicrobial peptide O81338 from Common ice plant and fibroblast growth factor P41444 from Autographa californica nuclear polyhedrosis virus). The present database is more complete than existing peptide databases such as the EROP-Moscow database and the Swepep database. However, the integration between the SwePep database and our peptide database could be important for the 4130

Journal of Proteome Research • Vol. 7, No. 9, 2008

Feng et al. biochemical identification of novel peptides. The SwePep database is developed to speed up and improve the peptide identification process utilizing mass spectrometry. In the identification process, the experimental peptide masses are compared to the peptide masses stored in the SwePep database, both with and without possible post-translational modifications. The peptides in SwePep, which match the experimental peptide masses and thus are considered as potential bioactive endogenous peptides, are singled out. These peptides can later be confirmed with tandem mass spectrometry data. However, as indicated in this study, of all 400 experimental peptide masses which were detected from hypothalamic mouse brain tissue by Decyder MS, only 54 could be identified by SwePep database. While the experimental peptide masses may include completely novel peptides, some of the known peptides together with their post-translational modifications are probably missing from the SwePep database. For example, 195 endogenous peptides in Mouse are collected in the SwePep database, but not less than 434 Mouse peptides with mass below 10 kDa are included in our peptide database. The integration between the SwePep database and our peptide database could identify and validate more peptides in vitro as more putative peptides are available in our peptide database. It can be noted that for 2634 (14%) peptide precursors out of all 19 437 entries in our precursor database no bioactive peptides were annotated. Examples are human anti-mullerian hormone (Q6GTN3) and transforming growth factor beta superfamily signaling ligand (Q4H3X4) from Ciona intestinalis. The putative peptides from these large precursor proteins can be obtained by using a cleavage site prediction program, such as the model based on neural network for PCs family22 (http:// www.cbs.dtu.dk/services/ProP/) or neuropred (http://neuroproteomics.scs.uiuc.edu/neuropred.html),23 which identifies the putative cleavage sites in peptide precursors. It is also possible to add all nonannotated post-translational modification to the peptides by the protein post-translational modifications (PTM) prediction program AutoMotif server24 (http:// automotif.bioinfo.pl/). The integration among the SwePep peptide identification process, our peptide and precursor databases, the cleavage site prediction program and the PTM prediction program could provide great potential in identifying bioactive peptides in vitro. Many peptides exert their biological activity by binding to membrane proteins such as G-protein coupled receptors and ion channels.. The creation of a link between the newly established peptide database and existing receptor databases, such as IUPHAR database (http://www.iuphar-db.org/GPCR/) or G-protein Coupled Receptor Database (http:// bioinformatics2.biol.uoa.gr/gpDB/index.jsp), could facilitate the research on peptides, receptors and interactions between them. This study is important especially to the field of neurobiology and drug discovery in general where the goal is to deorphan GPCRs and other peptide receptors.25 Finally, with all known peptide motifs available in our peptide motif database as well as all known peptide and peptide precursor sequences available in our peptide and precursor databases, a peptide prediction program will be added to this Web site. A given new protein sequence (or a group proteins) can be compared with all available peptide motifs as well as with all peptide or peptide precursor sequences. Proteins, having a significant match to a peptide motif or displaying significant sequence similarities to a peptide or precursor sequence, could be predicted as putative peptide or

research articles

Construction of a Bioactive Peptide Database in Metazoa precursor proteins. Researchers are encouraged to submit their newly identified peptides or peptide precursors on the Web site, http://www.peptides.be.

Acknowledgment. This research was sponsored by IWT funds and by the FWO grants G0146.03, G0580.06 References (1) Yoneda, M.; Watanobe, H.; Terano, A. Central regulation of hepatic function by neuropeptides. J. Gastroenterol. 2001, 36 (6), 361–367. (2) Horton, H. R.; Moran, L. A.; Ochs, R. S.; Rawn, D. J.; Scrimgeour, K. G. Principles of Biochemistry, 3rd ed.; Prentice Hall: Upper Saddle River, NJ, 2002; pp 728-732. (3) Hinuma, S.; Habata, Y.; Fujii, R.; Kawamata, Y.; Hosoya, M.; Fukusumi, S.; Kitada, C.; Masuo, Y.; Asano, T.; Matsumoto, H.; Sekiguchi, M.; Kurokawa, T.; Nishimura, O.; Onda, H.; Fujino, M. A prolactin-releasing peptide in the brain. Nature 1998, 393 (6682), 272–276. (4) Hulo, N.; Bairoch, A.; Bulliard, V.; Cerutti, L.; De, C. E.; LangendijkGenevaux, P. S.; Pagni, M.; Sigrist, C. J. The PROSITE database. Nucleic Acids Res. 2006, 34 (Database issue), D227–D230. (5) Bateman, A.; Coin, L.; Durbin, R.; Finn, R. D.; Hollich, V.; GriffithsJones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E. L.; Studholme, D. J.; Yeats, C.; Eddy, S. R. The Pfam protein families database. Nucleic Acids Res. 2004, 32 (Database issue), D138–D141. (6) Schultz, J.; Milpetz, F.; Bork, P.; Ponting, C. P. SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl. Acad. Sci. U.S.A. 1998, 95 (11),), 5857–5864. (7) Zamyatnin, A. A.; Borchikov, A. S.; Vladimirov, M. G.; Voronina, O. L. The EROP-Moscow oligopeptide database. Nucleic Acids Res. 2006, 34 (Database issue), D261–D266. (8) Falth, M.; Skold, K.; Norrman, M.; Svensson, M.; Fenyo, D.; Andren, P. E. SwePep, a database designed for endogenous peptides and mass spectrometry. Mol. Cell. Proteomics 2006, 5 (6), 998–1005. (9) Liu, F.; Baggerman, G.; D’Hertog, W.; Verleyen, P.; Schoofs, L.; Wets, G. In silico identification of new secretory peptide genes in Drosophila melanogaster. Mol. Cell. Proteomics 2006, 5 (3), 510– 522. (10) Altschul, S. F.; Madden, T. L.; Schäffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res. 1997, 25 (17), 3389–402. (11) Baggerman, G.; Liu, F.; Wets, G.; Schoofs, L. Bioinformatic analysis of Peptide precursor proteins. Ann. N.Y. Acad. Sci. 2005, 1040, 59– 65. (12) Nielsen, H.; Brunak, S.; von Heijne, G. Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng. 1999, 12 (1), 3–9.

(13) Liu, F.; Baggerman, G.; Schoofs, L.; Wets, G. Uncovering conserved patterns in bioactive peptides in Metazoa. Peptides 2006, 27 (12), 3137–3153. (14) Nothacker, H. P.; Rinehart, K. L.; McFarlane, I. D.; Grimmelikhuijzen, C. J. Isolation of two novel neuropeptides from sea anemones: the unusual, biologically active L-3-phenyllactyl-TyrArg-Ile-NH2 and its des-phenyllactyl fragment Tyr-Arg-Ile-NH2. Peptides 1991, 12 (6), 1165–1173. (15) Santos, A. D.; Imperial, J. S.; Chaudhary, T.; Beavis, R. C.; Chait, B. T.; Hunsperger, J. P.; Olivera, B. M.; Adams, M. E.; Hillyard, D. R. Heterodimeric structure of the spider toxin omega-agatoxin IA revealed by precursor analysis and mass spectrometry. J. Biol. Chem. 1992, 267 (29), 20701–20705. (16) Schlesinger, D. H.; Pickart, L.; Thaler, M. M. Growth-modulating serum tripeptide is glycyl-histidyl-lysine. Experientia 1977, 33 (3), 324–325. (17) Hokfelt, T.; Broberger, C.; Xu, Z. Q.; Sergeyev, V.; Ubink, R.; Diez, M. Neuropeptides-an overview. Neuropharmacology 2000, 39 (8), 1337–1356. (18) Vitt, U. A.; Hsu, S. Y.; Hsueh, A. J. Evolution and classification of cystine knot-containing hormones and related extracellular signaling molecules. Mol. Endocrinol. 2001, 15 (5), 681–694. (19) Darmer, D.; Hauser, F.; Nothacker, H. P.; Bosch, T. C.; Williamson, M.; Grimmelikhuijzen, C. J. Three different prohormones yield a variety of Hydra-RFamide (Arg-Phe-NH2) neuropeptides in Hydra magnipapillata. Biochem. J. 1998, 332 (Pt 2), 403–412. (20) Ovchinnikova, T. V.; Aleshina, G. M.; Balandin, S. V.; Krasnosdembskaya, A. D.; Markelov, M. L.; Frolova, E. I.; Leonova, Y. F.; Tagaev, A. A.; Krasnodembsky, E. G.; Kokryakov, V. N. Purification and primary structure of two isoforms of arenicin, a novel antimicrobial peptide from marine polychaeta Arenicola marina. FEBS Lett. 2004, 577 (1-2), 209–214. (21) Shalabi, A.; Zamudio, F.; Wu, X.; Scaloni, A.; Possani, L. D.; Villereal, M. L. Tetrapandins, a new class of scorpion toxins that specifically inhibit store-operated calcium entry in human embryonic kidney293 cells. J. Biol. Chem. 2004, 279 (2), 1040–1049. (22) Duckert, P.; Brunak, S.; Blom, N. Prediction of proprotein convertase cleavage sites. Protein Eng. Des. Sel. 2004, 17 (1), 107–112. (23) Southey, B. R.; Amare, A.; Zimmerman, T. A.; Rodriguez-Zas, S. L.; Sweedler, J. V. NeuroPred: a tool to predict cleavage sites in neuropeptide precursors and provide the masses of the resulting peptides. Nucleic Acids Res. 2006, 34 (Web Server issue), W267– W272. (24) Plewczynski, D.; Tkacz, A.; Wyrwicz, L. S.; Rychlewski, L. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics 2005, 21 (10), 2525–2527. (25) Lee, D. K.; George, S. R.; O’Dowd, B. F. Novel G-protein-coupled receptor genes expressed in the brain: continued discovery of important therapeutic targets. Expert Opin. Ther. Targets 2002, 6 (2), 185–202.

PR800037N

Journal of Proteome Research • Vol. 7, No. 9, 2008 4131