PLIPS, an Automatically Collected Database of ... - ACS Publications

Feb 13, 2009 - Alexey V. Antonov,*,† Sabine Dietmann,† Philip Wong,† Rodchenkov Igor,† and. Hans W. ... Received September 22, 2008. The spect...
0 downloads 0 Views 94KB Size
PLIPS, an Automatically Collected Database of Protein Lists Reported by Proteomics Studies Alexey V. Antonov,*,† Sabine Dietmann,† Philip Wong,† Rodchenkov Igor,† and Hans W. Mewes†,‡ Helmholtz Zentrum München-German Research Center for Environmental Health (GmbH), Institute for Bioinformatics and System Biology, Ingolsta¨dter Landstrasse 1, D-85764 Neuherberg, Germany, and Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universita¨t Mu ¨nchen, 85350 Freising, Germany Received September 22, 2008

The spectrum of problems covered by proteomics studies range from the discovery of compartment specific cell proteomes to clinical applications, including the identification of diagnostic markers and monitoring the effects of drug treatments. In most cases, the ultimate results of a proteomics study are lists of proteins found to be present (or differentially present) at cell physiological conditions under study. Normally, the results are published directly in the article in one or several tables. In many cases, this type of information remains disseminated in hundreds of proteomics publications. We have developed a Web mining tool which allows the collection of this information by searching through full text papers and automatically selecting tables, which report a list of protein identifiers. By searching through major proteomics journals, we have collected approximately 800 independent studies published recently, which reported about 1000 different protein lists. On the basis of this data, we developed a computational tool PLIPS (Protein Lists Identified in Proteomics Studies). PLIPS accepts as input a list of protein/gene identifiers. With the use of statistical analyses, PLIPS infers recently published proteomics studies, which report protein lists that significantly intersect with a query list. PLIPS is a freely available Web-based tool (http://mips.helmholtz-muenchen.de/proj/plips). Keywords: proteomics databases • enrichment analysis • web tool for proteomics • PLIPS • analysis of proteomics data

1. Introduction Proteomic approaches have enormous potential to provide new insights into the molecular mechanisms underlying cell biology and its structural organization.1 Indeed, proteomics provides the opportunity to study different functional cell states by identification of expressed proteins. A goal of most proteomics studies is therefore to identify all the proteins contained within a sample.2-7 Thus, independent of the biological phenomena studied, the outcome of the majority of proteomics studies is a list (or several lists) of proteins. In the next step, the interpretation of the functional context of the identified proteins is required. A widely accepted strategy is to infer biological processes that are most relevant to the analyzed proteins.8-14 The inference is based on prior knowledge. The Gene Ontology (GO) database or the Kyoto Encyclopedia of Genes and Genomes (KEGG) is commonly used as reference knowledge.15,16 Taking into account the significant growth of the number of experimental studies that report protein lists related to * To whom correspondence should be addressed. E-mail: a.antonov@ helmholtz-muenchen.de. † GSF National Research Center for Environment and Health, Institute for Bioinformatics and System Biology. ‡ Technische Universita¨t Mu ¨ nchen. 10.1021/pr800804d CCC: $40.75

 2009 American Chemical Society

different functional context, it is interesting for experimental researchers to find recently published protein lists that significantly intersect with the set resulted from his/her own work. This information can be helpful to understand the links between different biological phenomena, as well as to better understand functional role of the involved proteins. In most cases, the identified lists of proteins are reported in a tabular format in the paper. Being publicly available, this type of information, at the same time, is dissolved in hundreds of papers. Thus, in most cases, this valuable information is used only in the paper, where it was published. We have developed a Web mining tool, which allows collecting this type of information by searching through full text papers and automatically selecting tables, which report a list of protein identifiers. In most cases, such tables in proteomics papers provide protein lists found to be present or differentially present at a particular cell state or under various physiological conditions. By searching through major proteomics journals (Proteomics, Journal of Proteome Research, Molecular & Cellular Proteomics, Proteomics - Clinical Applications), we have collected approximately 800 independent studies published during the last seven years, which report about 1000 different protein lists. There are several databases for the purpose of capturing and disseminating proteomics data, some of which provide data Journal of Proteome Research 2009, 8, 1193–1197 1193 Published on Web 02/13/2009

research articles

Antonov et al.

Table 1. Types of Gene Identifiers Recognized by PLIPS and Data Sources Used for ID Mapping type of IDs

file used

“Gene Symbol”, “Ensembl”, “LocusTag” “RefSeq Protein ID”, “RefSeq Transcript ID” “UniProt/Swiss-Prot”

ftp://ftp.ncbi.nlm.nih.gov/gene/ DATA/gene_info.gz ftp://ftp.ncbi.nlm.nih.gov/gene/ DATA/gene2refseq.gz ftp://ftp.ncbi.nlm.nih.gov/gene/ DATA/gene_refseq_uniprotkb_ collab.gz ftp://ftp.ncbi.nlm.nih.gov/gene/ DATA/gene2unigene http://www.affymetrix.com/ Annotation files

“UniGene” “Affymetrix probe codes”

analysis pipelines.17-23 For example, the PRIDE database has been developed to provide a standards-compliant repository for mass-spectrometry-based proteomics data comprising identifications of proteins, peptides and post-translational modifications, together with the mass spectra that provide evidence for these identifications. The disadvantage of public repositories is the requirement for researchers to submit their data before or after publication. At present, the situation is far from ideal and still many studies, which report valuable information, are not systematically covered by available database resources. The quality of data in standards-compliant public repositories is of no doubt higher than automatically collected data in our work (automatically collected protein lists may be incomplete, e.g., some protein identifiers may be not recognized). Additionally, public repositories can collect a lot of valuable supplemental information about experimental set up, technical parameters and so on. On the other hand, the main purpose of our tool is to provide experimental researchers with quick access to a catalog of previously published studies which reports partially the same proteins in a different or similar functional context. For this application, several missing proteins in the list are not highly relevant. The coverage of our data in terms of the number of accounted studies is reasonably high. Relying on the collected data, we developed a computational tool PLIPS (Protein Lists Identified in Proteomics Studies). PLIPS accepts as an input a list of protein/gene identifiers. With the use of statistical analyses, PLIPS identifies recently published proteomics studies, which reported protein lists that significantly intersect with the query list. PLIPS (http://mips. gsf.de/proj/plips) is a freely available Web-based tool.

2. Materials and Methods We collected all papers published during the last 7 years in major proteomic journals: Proteomics, Journal of Proteome Research, Molecular & Cellular Proteomics, Proteomics - Clinical Applications. From each paper, the tables were extracted and protein/gene identifiers were searched. Only those tables which contained more than 10 unique protein/gene identifiers of the same type (“UniProt/Swiss-Prot”, “Gene Symbol”,24 “Ensembl”,25 “RefSeq Protein ID”, “RefSeq Transcript ID”26) were selected. The collected protein/gene lists were systematically organized. To allow comparative analyses of the collected data, we mapped all identified protein lists to NCBI “Entrez Gene”24 identifiers. For mapping purposes, files from NCBI and Affymetrix Web sites were used. Detailed information on data sources used by PLIPS is in Table 1. We would like to point out that protein and gene identifiers can be highly ambiguous with multiple synonymous variants.10 For this reason, the 1194

Journal of Proteome Research • Vol. 8, No. 3, 2009

quality of the retrieved data in PLIPS can be different for different types of identifiers reported in the original publication. The user can query a list of protein/gene identifiers. An automatic enrichment analysis is implemented to identify those papers recently published that reported protein/gene lists, which significantly intersect with the query list. For each protein/gene list f in the database, the number of genes lf that are common between the query list and f is counted. In the next step, the null hypothesis H0 (genes from the query list and list f are independent) is tested. The hypergeometric test (adjusted for multiple testing by a Monte-Carlo simulation procedure) is employed to assess significance of the intersection between the query protein/gene list and protein/gene list reported previously by experimental proteomics studies. The estimated p-value corresponds exactly to the definition of an experiment-wise Westfall and Young P-value.11,27,28

3. Results PLIPS (http://mips.helmholtz-muenchen.de/proj/plips) is a freely available Web-based collection of automatically retrieved protein/gene lists published previously. PLIPS has an easy-touse interface. The user can browse through recently published protein/gene lists and see the connections between different studies, which reported significantly similar protein lists. The user can query his/her list of protein/gene identifiers to find statistically significant links to previously published studies. As input, PLIPS accepts several types of protein (gene) identifiers. PLIPS supports most protein and gene identifiers such as “Entrez Gene”,29 “UniProt/Swiss-Prot”, “Gene Symbol”,24,29 “UniGene”,29 “Ensembl”,25 “RefSeq Protein ID”, “RefSeq Transcript ID”,26 and“Affymetrix probe codes”.30 As output, a catalog of previously published studies that report protein/gene list that significantly intersect with a query protein/gene list is provided. Each protein list from the PLIPS was examined versus all other identified lists to find commonly shared proteins. Those list pairs that have significant intersection with p-value