PIGOK: Linking Protein Identity to Gene Ontology and Function Richard J. Jacob†,‡ and Rainer Cramer*,§ Department of Biochemistry, University College London, Gower Street, London WC1E 6BT, United Kingdom, Ludwig Institute for Cancer Research, 91 Riding House Street, London W1W 7BS, United Kingdom, and The BioCentre and School of Chemistry, University of Reading, Whiteknights, PO Box 221, Reading RG6 6AS, United Kingdom Received April 8, 2006
Abstract: Here we introduce a computer database that allows for the rapid retrieval of physicochemical properties, Gene Ontology, and Kyoto Encyclopedia of Genes and Genomes information about a protein or a list of proteins. We applied PIGOK analyzing Schizosaccharomyces pombe proteins displaying differential expression under oxidative stress and identified their biological functions and pathways. The database is available on the Internet at http://pc4-133.ludwig.ucl.ac.uk/pigok.html. Keywords: Software • data analysis • bioinformatics • database • proteomics • gene ontology • Kyoto Encyclopedia of Genes and Genomes • codon adaptation index • mass spectrometry
Introduction Large scale proteomic experiments can identify thousands of proteins in a single experiment. One of the common tasks after protein identification is to determine a protein’s known function. The proteomic researcher will usually draw on a variety of sources for this information depending upon the specific species being worked with. Two very useful resources are the Gene Ontology projects and the Kyoto Encyclopedia of Genes and Genomes (KEGG).1 The Gene Ontology Consortium has supplied a vocabulary for the description of gene products in terms of their associated biological processes, cellular components, and molecular functions in a species-independent manner. The Gene Ontology Annotation (GOA) project2 at the European Bioinformatics Institute (EBI) and various groups focusing on individual model organisms, for example the Schizosaccharomyces pombe genome project at the Sanger Centre3 and the FlyBase4 team, are currently assigning Gene Ontology (GO) annotations for individual genomes. One of the KEGG services that is extremely useful to biologists is a database of pathways mapping the molecular interaction networks in biological processes. This “wealth of data” is stored in multiple databases and files, preventing it from being assimilated into a single table suitable for further processing and manuscript preparation without laborious copying and pasting from the different sources. Alternatives to this cumbersome undertaking are programs or * To whom correspondence should be addressed. E-mail: r.k.cramer@ reading.ac.uk. Telephone: +44 (0)118 378 4550. Fax: +44 (0)118 378 4551. † University College London. ‡ Ludwig Institute for Cancer Research. § University of Reading. 10.1021/pr0601537 CCC: $33.50
2006 American Chemical Society
scripts that collate the desired data from different sources and assemble it in a table or database for later reporting. However, many of these approaches suffer from similar problems: for example, the commands or links used to access the databases change format over time, “breaking” scripts and the slowness of connections to externally hosted databases. Although the microarray community has generated a number of databases, for example SOURCE5 and DAVID,6 to solve many of these problems, these databases are very much orientated toward the needs of this specific community, often providing only the data on arrayed genes. The bioinformatics community is just starting to develop similar tools: Integr8 offers integrated views of complete genomes and proteomes including links to the GOA project7 but not KEGG, and the Multi-Protein Survey System8 (MPSS) provides a database with similar design philosophy to PIGOK but with slightly different functionality. Therefore, we developed a database solution that solves these problems and provides the available data along with precalculated physicochemical properties in a format that is useful to the proteomics community. PIGOK, Protein Interrogation of Gene Ontology and KEGG Databases, is a database containing precalculated isoelectric point (pI), molecular weight (MW), and codon adaptation index9 (CAI) values, GO and GO slim annotations, and KEGG pathway accession numbers.
Materials and Methods We designed a simple web interface for the database, which is also accessible directly via scripts (Figure 1). The database uses a standard query language (SQL) database (MySQL) and a combination of Perl scripts with BioPerl libraries10 and EMBOSS programs11 to generate the pI, MW, and CAI values. GO information is from the GOA projects data files,2 and GO slim entries12 were precalculated for each GO definition. KEGG pathway accession numbers for the proteins, where they exist, are extracted from KEGG data files for each species. Currently, PIGOK supports the S. cerevisiae, S. pombe, International Protein Index13 (IPI) Human, Mouse, Rat, Zebrafish, Arabidopsis, Chicken, Cow, and SWISS-PROT14 databases. The common gateway interface (cgi) scripts also simultaneously generate links to the originating databases and to the NCBI Entrez Gene15 database. The cgi scripts can optionally provide an output of their results to a Microsoft Excel worksheet for later data manipulation. The PIGOK database resides on a Dell (Texas, U.S.A.) dual 800 MHz Intel Pentium III computer running the Fedora Core 3 Linux operating system and takes approximately 20 s to generate results for the whole S. pombe genome. Journal of Proteome Research 2006, 5, 3429-3432
3429
Published on Web 09/23/2006
Linking Protein Identity to Gene Ontology and Function
technical notes
Figure 1. Overview of the PIGOK database layout. A BioPerl/EMBOSS script precalculates the CAI, pI, and MW values for each protein. A second script determines the GO slim value for each GO annotation and loads the results into the database. A final database loading script enters KEGG path accession numbers into the database. The database is then accessible via local scripts or through the cgi interface.
Figure 2. A comparison of the whole S. pombe genome GO slim biological process annotations (inner ring) vs the proteins displaying differential expression under oxidative stress (outer ring). The GO annotations “generation of precursor metabolites and energy”, “response to stress”, and “protein biosynthesis” are over represented in the experimental data set.
To test the software, a sample data set was analyzed. S. pombe exposed to peroxide-induced oxidative stress was compared to its wild type.16 The differently treated S. pombe cells were lysed, labeled with CyDye, and analyzed by twodimensional difference gel electrophoresis.17 After gel image analysis with DeCyder image analysis software (GE Healthcare, Amersham, UK), a selection of significantly regulated proteins 3430
Journal of Proteome Research • Vol. 5, No. 12, 2006
were selected for identification. The proteins displaying differential expression were excised robotically for manual digestion and automatic digestion with a prototype robotic platform. The extracted peptide digests were identified by matrix-assisted laser desorption/ionization mass spectrometry (MALDI MS) peptide mass fingerprinting and database searching with Mascot18 (Matrix Science, London, UK).
technical notes
Jacob and Cramer
Figure 3. The codon adaptation indices (CAIs) of the differentially expressed proteins identified from the oxidative stress experiment were compared to the CAI distribution of the whole S. pombe genome. The identified proteins showed a bias toward proteins predicted to be more abundant.
Results and Discussion A total of 110 gel spots displaying differential expression were analyzed by both manual and automated methods, and the results were combined, identifying 110 proteins from 98 gel spots resulting in a total of 75 unique gene products. Using PIGOK to look up the GO and KEGG information for the identified protein set allowed the quick interpretation of the data. The differentially regulated proteins had a greater proportional representation in the yeast GO slim annotations “response to stress”, “protein biosynthesis”, and “generation of precursor metabolites and energy” from the biological process ontology than the proteins from the whole genome (Figure 2). KEGG pathway analysis showed similar pathway assignment, identifying 8 proteins involved in the Glycolysis/ Glucongenesis pathway and 11 out of the 39 pathways were identified involved in amino acid metabolism or biosynthesis. By plotting the CAI values for the identified proteins against the values of the whole genome in a histogram, it can also be seen that the identified proteins are predicted to be the more abundant on average (Figure 3). Other studies19-21 showed similar biases toward the identification of the more abundant proteins indicating that the identification of the lower abundance proteins may be a factor of the gel separation or protein detection methods or attributable to the specific response to the changes in the biological system. During the development of PIGOK, a number of changes had to be made to the underlying source code. The formats for a number of data files changed or the files were replaced entirely by comparable files necessitating changes in the underlying database loading scripts. Very often, the quality of the source files improved by providing more cross-referenced accession numbers and additional information. One such change was the inclusion of the Entrez Gene accession number in the IPI reference files. Before this change, Entrez Gene accession numbers had to be inferred from a third data source that contained cross references to both the IPI and the Entrez Gene data files. Occasionally, the webpages or websites containing information about the proteins or the GO and KEGG annotations changed, thus breaking the links that PIGOK generates. As these links are generated on the fly by the PIGOK query
script and not stored in the database, the query script can be quickly updated, fixing the problem. A broken link will affect web users for a short while until the query script is modified with the new URL information. Changes to the data sources or reference sites that PIGOK uses occur approximately once a year per providing source. PIGOK’s simple web-based interface is publicly available at http://pc4-133.ludwig.ucl.ac.uk/pigok.html. We also provide a downloadable package for the database at http:// www.ludwig.ucl.ac.uk/bachem_html/software.htm. This includes the following: all the scripts used to generate the database structure, to process the protein databases and the calculation of their physicochemical parameters, annotations and relationships, and to load the resulting information into the database along with the cgi scripts and a number of other example scripts to access the data.
Conclusion PIGOK permits rapid analysis of a group of proteins linking them to their GO and KEGG accession numbers and physicochemical properties. Overviews of the proteins’ biological processes, cellular components, molecular functions, and KEGG pathways allow a researcher to form an initial interpretation of the biological relevance of their results. The PIGOK database is accessible via the worldwide web or can be downloaded and installed locally. Future work will add support for further genome databases as the GO information becomes available, expand the range of scripts increasing their functionality, and take advantage of the database data relationships. Additional physicochemical properties can also be calculated for each protein, for example the number of transmembrane domains or the hydrophobicity of the protein.
Acknowledgment. We thank John Timms for numerous helpful discussions. R.J.J. is supported by a BBSRC CASE award. References (1) Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M. Nucleic Acids Res. 1999, 27, 29-34.
Journal of Proteome Research • Vol. 5, No. 12, 2006 3431
technical notes
Linking Protein Identity to Gene Ontology and Function (2) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T.; Harris, M. A.; Hill, D. P.; Issel-Tarver, L.; Kasarskis, A.; Lewis, S.; Matese, J. C.; Richardson, J. E.; Ringwald, M.; Rubin, G. M.; Sherlock, G. Nat. Genet. 2000, 25, 25-29. (3) Wood, V.; Gwilliam, R.; Rajandream, M. A.; Lyne, M.; Lyne, R.; Stewart, A.; Sgouros, J.; Peat, N.; Hayles, J.; Baker, S.; Basham, D.; Bowman, S.; Brooks, K.; Brown, D.; Brown, S.; Chillingworth, T.; Churcher, C.; Collins, M.; Connor, R.; Cronin, A.; Davis, P.; Feltwell, T.; Fraser, A.; Gentles, S.; Goble, A.; Hamlin, N.; Harris, D.; Hidalgo, J.; Hodgson, G.; Holroyd, S.; Hornsby, T.; Howarth, S.; Huckle, E. J.; Hunt, S.; Jagels, K.; James, K.; Jones, L.; Jones, M.; Leather, S.; McDonald, S.; McLean, J.; Mooney, P.; Moule, S.; Mungall, K.; Murphy, L.; Niblett, D.; Odell, C.; Oliver, K.; O’Neil, S.; Pearson, D.; Quail, M. A.; Rabbinowitsch, E.; Rutherford, K.; Rutter, S.; Saunders, D.; Seeger, K.; Sharp, S.; Skelton, J.; Simmonds, M.; Squares, R.; Squares, S.; Stevens, K.; Taylor, K.; Taylor, R. G.; Tivey, A.; Walsh, S.; Warren, T.; Whitehead, S.; Woodward, J.; Volckaert, G.; Aert, R.; Robben, J.; Grymonprez, B.; Weltjens, I.; Vanstreels, E.; Rieger, M.; Schafer, M.; Muller-Auer, S.; Gabel, C.; Fuchs, M.; Dusterhoft, A.; Fritzc, C.; Holzer, E.; Moestl, D.; Hilbert, H.; Borzym, K.; Langer, I.; Beck, A.; Lehrach, H.; Reinhardt, R.; Pohl, T. M.; Eger, P.; Zimmermann, W.; Wedler, H.; Wambutt, R.; Purnelle, B.; Goffeau, A.; Cadieu, E.; Dreano, S.; Gloux, S.; Lelaure, V.; Mottier, S.; Galibert, F.; Aves, S. J.; Xiang, Z.; Hunt, C.; Moore, K.; Hurst, S. M.; Lucas, M.; Rochet, M.; Gaillardin, C.; Tallada, V. A.; Garzon, A.; Thode, G.; Daga, R. R.; Cruzado, L.; Jimenez, J.; Sanchez, M.; del Rey, F.; Benito, J.; Dominguez, A.; Revuelta, J. L.; Moreno, S.; Armstrong, J.; Forsburg, S. L.; Cerutti, L.; Lowe, T.; McCombie, W. R.; Paulsen, I.; Potashkin, J.; Shpakovski, G. V.; Ussery, D.; Barrell, B. G.; Nurse, P. Nature 2002, 415, 871-880. (4) Drysdale, R. A.; Crosby, M. A. Nucleic Acids Res. 2005, 33, D390395. (5) Diehn, M.; Sherlock, G.; Binkley, G.; Jin, H.; Matese, J. C.; Hernandez-Boussard, T.; Rees, C. A.; Cherry, J. M.; Botstein, D.; Brown, P. O.; Alizadeh, A. A. Nucleic Acids Res. 2003, 31, 219223. (6) Dennis, G., Jr.; Sherman, B. T.; Hosack, D. A.; Yang, J.; Gao, W.; Lane, H. C.; Lempicki, R. A. Genome Biol. 2003, 4, P3. (7) Kersey, P.; Bower, L.; Morris, L.; Horne, A.; Petryszak, R.; Kanz, C.; Kanapin, A.; Das, U.; Michoud, K.; Phan, I.; Gattiker, A.;
3432
Journal of Proteome Research • Vol. 5, No. 12, 2006
(8) (9) (10)
(11) (12)
(13) (14) (15) (16) (17) (18) (19) (20) (21)
Kulikova, T.; Faruque, N.; Duggan, K.; McLaren, P.; Reimholz, B.; Duret, L.; Penel, S.; Reuter, I.; Apweiler, R. Nucleic Acids Res. 2005, 33, D297-302. Hao, P.; He, W. Z.; Huang, Y.; Ma, L. X.; Xu, Y.; Xi, H.; Wang, C.; Liu, B. S.; Wang, J. M.; Li, Y. X.; Zhong, Y. Bioinformatics 2005, 21, 2142-2143. Grantham, R.; Gautier, C.; Gouy, M.; Mercier, R.; Pave, A. Nucleic Acids Res. 1980, 8, r49-r62. Stajich, J. E.; Block, D.; Boulez, K.; Brenner, S. E.; Chervitz, S. A.; Dagdigian, C.; Fuellen, G.; Gilbert, J. G.; Korf, I.; Lapp, H.; Lehvaslaiho, H.; Matsalla, C.; Mungall, C. J.; Osborne, B. I.; Pocock, M. R.; Schattner, P.; Senger, M.; Stein, L. D.; Stupka, E.; Wilkinson, M. D.; Birney, E. Genome Res. 2002, 12, 1611-1618. Rice, P.; Longden, I.; Bleasby, A. Trends Genet. 2000, 16, 276277. Biswas, M.; O’Rourke, J. F.; Camon, E.; Fraser, G.; Kanapin, A.; Karavidopoulou, Y.; Kersey, P.; Kriventseva, E.; Mittard, V.; Mulder, N.; Phan, I.; Servant, F.; Apweiler, R. Brief Bioinform. 2002, 3, 285-295. Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. Proteomics 2004, 4, 1985-1988. O’Donovan, C.; Martin, M. J.; Gattiker, A.; Gasteiger, E.; Bairoch, A.; Apweiler, R. Brief Bioinform. 2002, 3, 275-284. Maglott, D.; Ostell, J.; Pruitt, K. D.; Tatusova, T. Nucleic Acids Res. 2005, 33, D54-58. Weeks, M. E.; Sinclair, J.; Jacob, R. J.; Saxton, M. J.; Kirby, S.; Jones, J.; Waterfield, M. D.; Cramer, R.; Timms, J. F. Proteomics 2005, 5, 1669-1685. Unlu, M.; Morgan, M. E.; Minden, J. S. Electrophoresis 1997, 18, 2071-2077. Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. Gygi, S. P.; Corthals, G. L.; Zhang, Y.; Rochon, Y.; Aebersold, R. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 9390-9395. Salusjarvi, L.; Poutanen, M.; Pitkanen, J. P.; Koivistoinen, H.; Aristidou, A.; Kalkkinen, N.; Ruohonen, L.; Penttila, M. Yeast 2003, 20, 295-314. Trabalzini, L.; Paffetti, A.; Scaloni, A.; Talamo, F.; Ferro, E.; Coratza, G.; Bovalini, L.; Lusini, P.; Martelli, P.; Santucci, A. Biochem. J. 2003, 370, 35-46.
PR0601537