Proteomic 2DE Database for Spot Selection ... - ACS Publications

MS data. The software supports data analysis through a number of automated data selection functions and advanced graphical tools. Once protein identit...
4 downloads 7 Views 477KB Size
Proteomic 2DE Database for Spot Selection, Automated Annotation, and Data Analysis Lars Malmstro1 m,† Johan Malmstro1 m,† Gyo1 rgy Marko-Varga,*,‡ and Gunilla Westergren-Thorsson† Cell & Molecular Biology, University of Lund, BMC, C13, 221 84 Lund, Sweden, and Analytical Chemistry, University of Lund, Box 124, 221 00 Lund, Sweden Received November 9, 2001

We present a software solution that enables faster and more accurate data analysis of 2DE/MALDI TOF MS data. The software supports data analysis through a number of automated data selection functions and advanced graphical tools. Once protein identities are determined using MALDI TOF MS, automated data retrieval from online databases provides biological information. The software, called 2DDB, reduces analysis time to a fraction without losing any quality compared to more manual data analysis. The database contains over 100 000 data entries, and selected parts can be reached at http://2ddb.org. Keywords: spot selection • automated annotation • 2DE/MALDI TOF MS • 2DDB

Introduction As the human genome was made public last year, we entered the post-genomic era and focus turned toward the proteome, all the proteins expressed at any one time in a cell, a tissue, an organ, or an organism. Proteomics, the study of the proteome, has become the new scientific “buzz word” and has found enormous attention both from the scientific community as well as from investors.1,2 Several hundred millions of US dollars have already been invested by the pharmaceutical and biotech companies. A new forecast of the proteomics market predicts that the proteomics market will grow nearly 6-fold from US $963 million in 2000 to US $5.6 billion by 2006.3 Academia as well as the pharmaceutical/biotech industry have moved their efforts toward trying to complete the human proteome. To do so, several new high-throughput techniques have been developed to identify,4,5 quantify,6 and characterize7 proteins (for a review, see Pandey and Mann8). Many of these technologies make use of mass analysis technique as the final protein identifier; for a review, see Mann et al.9 Common problems to all high-throughput techniques are data handling and data interpretation. Unless the data can be filtered, annotated, and viewed in an effective, automated way, the biological conclusions will be very challenging to make. In our laboratory, we use a combination of two-dimensional gel electrophoresis (2DE) for quantification and separation of crude cell extracts and matrix-assisted laser- desorption ionization time-of-flight mass spectrometery (MALDI TOF MS) for identification to study changes in protein expression in different tissues when subjected to various stimuli. In a typical experiment, triplets of gels are run for multiple stimuli at multiple time points. Each gel is capable of separating * To whom correspondence should be [email protected]. † Cell & Molecular Biology, University of Lund. ‡ Analytical Chemistry, University of Lund. 10.1021/pr010004i CCC: $22.00

addressed.

 2002 American Chemical Society

E-mail:

thousands of proteins, visible as spots when stained with either fluorescent staining or silver staining.10,11 Most of the information in the gels is of secondary interest. Hence, identifying and filtering interesting data is necessary because of the large amount of complex data present. In addition, accurate information about identified proteins has to be easily retrieved and made searchable when a large number of proteins are identified. For this reason, we have developed a software solution, the 2-DE database (2DDB), to aid the selection of interesting spots from the gels and to automate information retrieval from the Internet when protein identities are entered. This tool is clearly of value when dealing with large amounts of data derived from 2DE/MALDI TOF MS.

Materials and Methods Software Utilized. The software system 2DDB is built on three main building blocks implemented under Linux Debian 2.2 (release potato) on a standard desktop computer with a 800 MHz AMD Duron-CPU with 40 GB hard drive and 128 MB RAM. The three main building blocks are a structured query language (sql) relational database (MySQL release 3.22), and two programming languages, perl (Practical extraction and retrieval language) 5.004 and PHP 4.01. Perl is used to extract data from files exported from BioRads PDQUEST (Bio-Rad discovery series, Bio-Rad Laboratories, Sundbyberg, Sweden) and import them into the database. Perl is also used to do the statistical calculations and to maintain and update the database. PHP is used to create the interactive web interface and to create graphs and images on the fly. Also needed are a web server, Apache (version 1.30), and several software packages to perl and PHP. MySQL was used to store the data in tables in a relational fashion and enables retrieval of the data in a specific and easily controlled manner. Journal of Proteome Research 2002, 1, 135-138

135

Published on Web 02/02/2002

research articles

Malmstro1 m et al.

Figure 2. Large gray rectangles represent an experiment in the 2DDB system. The 2DE experiment is performed with triplicate gels of different experimental conditions (A). The gels are scanned and analyzed in software such as Bio-Rads PDQUEST (B). The data is exported from PDQUEST and imported into the 2DDB software (C) where a statistical analysis is performed (D). The data are presented to the user as time/bar graphics and as marked spots on a reference gel (E). The information aids the user to do a manual selection of regulated proteins (F).

Figure 1. Data structure in 2DDB. The colored arrows indicate relations in the database; e.g., blue arrows indicate the relation between the experiments (experiments 1 and 2 in picture) and the experimental information table, represented by a blue box. Data are exported from PDQUEST (A), and each experiment results in a data table (B). A table with statistical calculations accompanies the main datatable. Each experiment is described by a main data table containing experimental conditions and a gel table containing information about individual gels (C), including gel group information used when performing the statistical analysis. The information can be used to select protein spots of interest, which then are identified by MALDI-TOF-MS (E). The identities are then imported into the 2DDB, which automatically retrieves information from online databases such as SwissProt and PubMed (E). The three datatables in C link each experiment to the others.

Results and Discussion Database Structure. The database is a three-layer database. The lowest layer is the actual data, which consist of the MySQL data tables. The layer above the data is the analysis/logical layer, where the data from different tables and different online sources are combined and served to the presentation layer. PERL and PHP build up this layer. The presentation layer, handled by PHP, presents the data for the user in a web format and adds on functionality such as hyperlinks to information within the 2DDB or to relevant external web sites. This architecture permits very large data structures and enables easy database maintenance. Because of the automated data retrieval from the Internet databases, the database is easy to keep up to date. The internal data structure of the database can be seen in Figure 1. Data from every experiment is stored in two separate tables, one containing all the raw data and one containing mean and standard deviation for each spot as well as the number of gels this spot was present in. This table speeds up queries from the database by eliminating recalculation of the 136

Journal of Proteome Research • Vol. 1, No. 2, 2002

statistics for every query, thereby reducing the systems response time. Information concerning the gels and the experiments is stored in two tables that are common to all experiments in the database. Data Import. A number of 2D electrophoresis gels are run with samples from, e.g., healthy/diseased tissue stimulated/ unstimulated control cell culture; see Figure 2A. After being scanned, the gels are analyzed in a spot matching software, PDQUEST (version 6.1.0) two-dimensional gel analysis system (Figure 2B). In the ideal case, a single spot constitutes a spot on the gel. In the real case, each spot often contains more than one protein that has comigrated or by other means ended up in the same position in the gel. The spot matching software matches spots to equivalent spots in the other gels, which of course is essential when detecting differences between the samples. The corresponding spots are given a common spot id. Each spot on the gel is also given an integrated optical density (IOD) value in the software, program such as PDQuest. This value was compared to the total amounts of valid spots. Thus, each spot is expressed as a ppm (parts per million) of the total IOD of all valid spots. The spot id together with the x and y coordinates, IOD, and the quality of the spots are exported and imported into the 2DDE, Figure 2C. The software produces easily navigated web pages that present the data to the user in an easily digested form. The actual raw data is complemented with time and bar graphs, information from online databases such as SwissProt12 and PubMed. The user can make sub-selections manually or let the computer present, e.g., all spots that have been significantly (double-sided T test with a 95% confidential interval) up- or down-regulated between healthy/diseased human tissue. Data can also be mapped back to the gel-image helping the user to distinguishing spots from gel artifacts and spot mismatches (Figure 2F). The features describe above aid the user in selecting proteins of interest (Figure 2D), which then can be further analyzed. Data Annotation. In this case, we use a MALDI TOF MS to characterize the spots we identified as interesting (Figure 2F), and the protein identities we get from the MALDI TOF MS are entered into the 2DDB. Information about the proteins is automatically obtained from different online databases and is conveniently displayed on the web pages. The data tables

research articles

Proteomic 2DE Database for Spot Selection

Figure 3. The data in the database can be visualized on the reference gel. In this image, all proteins found by PDQUEST are mapped onto the gel. The spots on the gel image are marked out in yellow with the identification number if the protein of the spot is not known. If the protein is known, the spot is marked out in red with the SwissProt id. The spots are clickable, and a click displays either a bar graph or a combined line and bar graph together with information about the protein and the raw data.

Figure 4. Multiple graphs from a sub-selection can be viewed in a single web page as seen in part A. This gives a good overview of how the proteins in the selection reacted to the stimulation in comparison with the control. Each graph is clickable and will display a larger graph, information about the protein if it is identified, and all the raw data color-coded depending if the up or down regulation was significant, as seen in part B. Part C shows an example of a bargraph.

shown in Figure 1C are common for all experiments and, thus, constitute bridges between different experiments. This enables conclusions to be drawn from a larger set of experiments. Case Study. Here, we present a specific case study to display the power of the 2DDB system. This particular case addresses the difficulties in analyzing a large amount of gels derived from expression analysis of protein stimuli in a cell culture responding to various over time. Human primary fetal fibroblasts were stimulated with three stimuli during three different time lengths: 24, 48, and 96 h. Each experiment was repeated three times, and one of the time points was repeated once. The gels were matched in PDQUEST, where the amount of data made it impossible to draw meaningful conclusions about the biological responses to the stimuli. Therefore, data from the matched gels were exported as text files and subsequently imported and analyzed in the 2DDB. The 2DDB visualizes data on gel images (Figure 3) where every spot is clickable. A click on a spot results in a detailed web page displaying information of the expression in all gels, average values, and simple graphs allowing the relative expression of the protein to be monitored over time in response to certain stimulus; see Figure 4. The line graphs are used when, for example, multiple stimuli have been subjected to cells at various time points. The x-axis can hence show time and y-axis relative regulation compared to control, and the different lines represent different experiments. Filled markers indicate significant regulation. The bar graphs can for example show different stimulus on the x-axis and relative regulation compared to control on the y-axis. This enabled a complete investigation of the spots and simplified the selection of spots of interest for mass spectrometry protein identification in matter of hours. After MS protein identification of selected spots the SwissProt ID and name was entered into the database, allowing automated retrieval of data from SwissProt and PubMed (Figure 5). The information of the identified proteins are stored locally to quickly access information when needed. The construction of the database and the subsequent annotation enables the possibility to search for the

Figure 5. Identified proteins are displayed in a list in part A. A single click will show all the graphs in the overview mode (Figure 4A) or on a graph in detailed mode (Figure 4B). One can also view information about the protein as seen in part B or map the entire list on a gel image as seen in part C. By integrating information about the proteins, whether they are up regulated or down regulated and where they were found on the gel helps the user to determine which proteins were co-regulated.

expression of the same proteins in other experiments as response to different stimulus and the expression in various tissues.

Conclusions Biology has evolved from being a single question/single answer science to a high-throughput science in the last couple of years. Today, huge amounts of data can be generated in a short time, and data handling/data analysis has become a significant problem. A typical problem in 2DE/MS identification is that the investigator can study how well over 2000 proteins are regulated in one single experiment. All of these 2000 Journal of Proteome Research • Vol. 1, No. 2, 2002 137

research articles proteins are of course not interesting and should not be subjected further investigation, partly because of high costs and the amount of manual work. The software system we present here is aimed at selecting a doable number of proteins for further investigation by making the data clickable through a web interface with helpful graphics in the form of graphs and gel images. Presentation of relevant protein data for each identified protein will also help the investigator to investigate the biological question he or she is trying to answer. The 2DDB database holds more than 100 000 data entries today, and the amount of data is steadily increasing. Our research team uses the database to investigate new hypothesis within human pulmonary diseases and to verify specific protein sequences that holds information on a unique protein sequence modification that can be of a significant importance when it can be linked to a biological activation process. Ongoing experiments with human pulmonary mesenchymal cells derived from central and distal lung biopsies are being processed. New clinical studies underway, initiated by our research team, are also planned to be included and the acquisitions made from the Internet will be dynamically followed by continuous updates. Today, the 2DDB retrieves information about annotated proteins from Internet databases such as PubMed and SwissProt. In the future, the intention is to extend the number of databases to include pathway databases, such as KEGG13 and additional protein and gene databases, such as Pfam14 and EMBL15 that can be used to further study the biological pathways associated with the regulated proteins.

Acknowledgment. This work was supported by grants from the Swedish Medical Research Council (No. 11550), the Heart-Lung Foundation, the Va˚rdal Foundation, the Society for Medical Research, the J. A. Persson, G. & J. Kock and A. O ¨ sterlund Foundations, Riksfo¨reningen mot Reumatism, Gustaf V.s 80-a˚rsfond, A.-G. Crafoord Foundation, the Thelma Zoe´ga Foundation, and the Medical Faculty, University of Lund.

138

Journal of Proteome Research • Vol. 1, No. 2, 2002

Malmstro1 m et al.

References (1) Financial Times 2001, 11, 1. (2) Int. Econ. 2001, 24, 1. (3) Study Foresees Proteomics Market Growing to $5.6B by 2006. GenomeWeb 2001, 1. (4) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 1999, 17, 676-682. (5) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001, 19, 242-247. (6) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 1999, 17, 994999. (7) Cagney, G.; Uetz, P.; Fields, S. High-throughput screening for protein-protein interactions using two- hybrid assay. Methods Enzymol. 2000, 328, 3-14. (8) Pandey, A.; Mann, M. Proteomics to study genes and genomes. Nature 2000, 405, 837-846. (9) Mann, M.; Hendrickson, R. C.; Pandey, A. Analysis of Proteins and Proteomes by Mass Spectrometry. Annu. Rev. Biochem. 2001, 70, 437-473. (10) Bratt, C.; Lindberg, C.; Marko-Varga, G. Restricted access chromatographic sample preparation of low mass proteins expressed in human fibroblast cells for proteomics analysis. J. Chromatogr. A 2001, 909, 279-288. (11) Shevchenko, A.; Wilm, M.; Vorm, O.; Mann, M. Mass spectrometric sequencing of proteins silver-stained polyacrylamide gels. Anal. Chem. 1996, 68, 850-858. (12) Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000, 28, 45-48. (13) Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999, 27, 29-34. (14) Bateman, A.; Birney, E.; Durbin, R.; Eddy, S. R.; Howe, K. L.; Sonnhammer, E. L. The Pfam protein families database. Nucleic Acids Res. 2000, 28, 263-266. (15) Baker, W.; van den, B. A.; Camon, E.; Hingamp, P.; Sterk, P.; Stoesser, G.; Tuli, M. A. The EMBL nucleotide sequence database. Nucleic Acids Res. 2000, 28, 19-23.

PR010004I