MS Based Protein

Oct 25, 2010 - of Information Technology and TUCS, University of Turku, Finland. Received ... database.2 Also, the quality of raw MS/MS data and pepti...
0 downloads 0 Views 2MB Size
Compid: A New Software Tool To Integrate and Compare MS/MS Based Protein Identification Results from Mascot and Paragon Niina Lietze´n,†,# Lari Natri,‡,# Olli S. Nevalainen,‡ Jussi Salmi,‡ and Tuula A. Nyman*,† Protein Chemistry Research Group, Institute of Biotechnology, University of Helsinki, Finland, and Department of Information Technology and TUCS, University of Turku, Finland Received August 12, 2010

Abstract: Tandem mass spectrometry-based proteomics experiments produce large amounts of raw data, and different database search engines are needed to reliably identify all the proteins from this data. Here, we present Compid, an easy-to-use software tool that can be used to integrate and compare protein identification results from two search engines, Mascot and Paragon. Additionally, Compid enables extraction of information from large Mascot result files that cannot be opened via the Web interface and calculation of general statistical information about peptide and protein identifications in a data set. To demonstrate the usefulness of this tool, we used Compid to compare Mascot and Paragon database search results for mitochondrial proteome sample of human keratinocytes. The reports generated by Compid can be exported and opened as Excel documents or as text files using configurable delimiters, allowing the analysis and further processing of Compid output with a multitude of programs. Compid is freely available and can be downloaded from http://users.utu.fi/lanatr/compid. It is released under an open source license (GPL), enabling modification of the source code. Its modular architecture allows for creation of supplementary software components e.g. to enable support for additional input formats and report categories. Keywords: database searching • Mascot • Paragon • protein identification • tandem mass spectrometry

Introduction Tandem mass spectrometry-based protein identification has a key role in modern proteomic studies. Protein identification from complex mixtures using LC-MS/MS produces large amounts of raw data, and efficient data processing and results validation form a critical part of the studies.1 Peptides and proteins are usually identified using different database search engines. First, the program matches the acquired fragment ion spectra to the theoretical fragment ion spectra created for each * To whom the correspondence should be addressed. Protein Chemistry Research Group, Institute of Biotechnology, University of Helsinki, P.O. Box 65 (Viikinkaari 1), FI-00014, Finland. E-mail: [email protected]. Phone: +358-9-191-59411. Fax: +358-9-191-59930. † University of Helsinki. ‡ University of Turku. # These authors contributed equally to this work. 10.1021/pr100824w

 2010 American Chemical Society

theoretical peptide from the protein sequence database. Peptides are identified based on the quality of these peptidespectrum matches. Then, the identified peptides are matched to protein sequences and protein identifications are inferred based on the matches. Protein groups are formed from individual proteins in a way that the members of a protein group share a significant amount of the same MS/MS data. One of the challenges in protein identification is that many peptides can be matched to several different proteins because of homologous proteins and redundant entries in the sequence database.2 Also, the quality of raw MS/MS data and peptidespectrum matches is important as false peptide identifications can affect protein inference and result in incorrect protein identifications. Especially with complex samples, different database search algorithms can identify partially different proteins because of the differences in peptide-spectrum matching and protein grouping algorithms. There are several database search engines available for protein identification from LC-MS/MS experiments.3 However, the end-point user faces often limited choice of software that accepts raw data from the particular mass spectrometry instrument used for the experiments. Mascot (Matrix Science) is a commercial database search engine, which accepts raw data from several different mass spectrometers and can thus be used with a wide variety of instruments.4 The public version of Mascot can be used with small data sets, but when processing large amounts of MS/MS spectra, the in-house version of Mascot is needed. Paragon algorithm used in ProteinPilot (Applied Biosystems) is a database search algorithm designed to process tandem mass spectrometry data from their instruments.5 In addition to Paragon, also Mascot can be used through ProteinPilot interface. The use of more than one database search engine has proven beneficial when analyzing complex protein mixtures as different search engines produce both confirmatory as well as complementary information from the same raw data.6-8 Manual comparison and analysis of large database search results is, however, laborious and time-consuming, and different bioinformatics tools have been developed to make these analyses easier.9,10 However, most of these tools can utilize data from only a few database search engines, and currently there are no freely available tools that could be used to combine protein identification results from Paragon with results from other search engines. Here, we present Compid, a fast and easy-touse tool to integrate and compare protein identification results from Mascot and Paragon. Journal of Proteome Research 2010, 9, 6795–6800 6795 Published on Web 10/25/2010

technical notes Materials and Methods Compid. Compid is a Java application that facilitates the interpretation of peptide and protein identifications produced by Mascot and Paragon database search engines. The application includes various different reporting tools to make summaries from database search results. These aid the user in the analysis of the peptide and protein identification results. In particular, one can compare the output of Mascot and Paragon on peptide and protein levels in an easy way. Compid can be run on different platforms: Linux, Solaris, and Windows. It is freely available for download at http://users.utu.fi/lanatr/ compid. Compid analyses consist of two consequent phases: importing data sets into Compid’s internal database, and generating reports from the imported data. The supported formats for input data sets include XML files and Excel workbooks exported from Paragon, and DAT files from Mascot. To enable inputting of Mascot data sets, one has to download the freely available Mascot Parser software library from Matrix Science. For all the input file formats, the imported data includes the titles for each MS/MS scan with the identified peptide sequences and the matchings of the peptides to proteins along with the associated scorings. Compid relies on the original peptide identifications made by Mascot and Paragon, and does not calculate any new probabilities regarding peptide identifications. The grouping of similar proteins performed by the original software is included. The Excel files from ProteinPilot contain only the primary protein identifications for each group, but for other formats, all proteins from each group are imported. Sequences from a FASTA-formatted protein sequence database can also be imported to Compid and added to the imported protein identifications; they can then be used for sequence-based protein comparisons. Compid can create two types of reports from the data sets, namely, a peptide report and a protein group report. These list the peptides and proteins, respectively, identified in the data sets. The reports can be created for a single data set, or for a combination of several data sets (a union), which is treated as if it was a single data set. Also, two data sets or data set unions can be compared against each other resulting in a threepart report: elements (i.e., peptides or proteins) common to both data sets and elements found only in one of the data sets. Here, a Venn diagram of common and unique elements is also created. Descriptive statistical information can be shown about the numbers of peptide and protein identifications in a data set and the queries (MS/MS spectra) resulting in these hits. Compid allows querying its internal database manually using the SQL query language. Additionally, it enables extraction of information from large Mascot result files that cannot be opened via the Web interface. Compid also includes a utility which converts a sequence database in FASTA format to a decoy database, where the amino acid sequences are either randomized or reversed. The reports generated by Compid can be exported and opened as Excel documents or as text files using configurable delimiters, enabling further analysis with other programs. In Compid, protein identifications are grouped based on their accession numbers, the peptides associated with them, or their full sequences allowing for a specified amount of amino acid alterations. The Mascot and Paragon programs provide their own protein groups: each group contains the proteins that are possible identifications based on mainly the same set of 6796

Journal of Proteome Research • Vol. 9, No. 12, 2010

Lietze´n et al. MS/MS data. The user can use this original grouping, or allow Compid to create its own grouping based on individual proteins identified. In the latter case, initially each protein forms its own group with only one member. Next, each group is compared to every other group and a similarity score of the group pair is calculated by comparing each protein pair between the groups. The final score of the group pair is the similarity of the most similar protein pair (single linkage) or the sum of similarity scores of all the protein pairs (complete linkage). The group pair with the highest score is merged through a union operation and the iteration is continued until no group pair has a similarity score that exceeds a user-given threshold. There are three different ways of calculating the proteinwise similarity score. First, the sequence-based alternative uses the Needleman-Wunsch-algorithm for performing a linear alignment of the protein sequences. Here, the user can set the maximum number of allowed amino acid modifications. The score is added by one, if amino acids in the sequences match. Otherwise, if they do not match or if there is a gap, the score is reduced by one. Second, the peptides from the proteins can be matched. Here, the peptides either have to match exactly (Strict Equality) or ambiguous amino acid matches (X f anything, B f D or N, Z f E or Q) can be allowed. The user can specify a similarity threshold by setting a lower limit on the number of common peptides in the compared proteins (Common Peptides), or by requiring that the full set of peptides matched in one of the proteins is a subset of the peptides in the other protein (Peptide Subset). The third and the simplest alternative is to compare the accession numbers of the proteins (Accession only). As a result of the above grouping process, Compid returns a set of protein groups, where similar proteins are in the same group. Proteome Samples. Mitochondria from human HaCaT keratinocytes (approximately 1 × 107 cells) were isolated by Qproteome Mitochondria Isolation Kit (Qiagen). Proteins from this cell fraction were separated by SDS-PAGE followed by ingel digestion with trypsin. Afterward, the gel was Coomassie stained; the lane was cut into 20 slices; proteins were reduced, alkylated, and in-gel trypsin digested; and the resulting peptides were analyzed by nanoLC-MS/MS.11-13 The LC-MS/MS analysis was done using an Ultimate 3000 nanoLC (Dionex) and a QSTAR Elite hybrid quadrupole TOF-MS (Applied Biosystems/MDS Sciex) with nano-ESI ionization. The peptide samples were first loaded on a ProteCol C18 trap column (10 mm × 150 µm, 3 µm, 120 Å) (SGE), followed by peptide separation on a PepMap100 C18 analytical column (15 cm × 75 µm, 5 µm, 100 Å) (LC Packings/Dionex) at 200 nL/min using a gradient of 0-40% acetonitrile (ACN) in 100 min. MS data were acquired using Analyst QS 2.0 software. The LC-MS/MS data were searched with in-house Mascot version 2.2 through ProteinPilot 3.0 interface and with ProteinPilot Paragon algorithm against human sequences in SwissProt database (release 57.4, 20 331 sequences) and in NCBI database (version 20080129, 180 801 sequences). Similar search criteria were used in both Mascot and Paragon database searches (Table 1A). In the database searches, raw data from all 20 gel fractions were processed together. False discovery rates (FDRs) for the identifications were calculated using the target-decoy strategy with concatenated normal and reversed sequence databases14 and they varied between 2% and 3%.

technical notes

Compid: Software Tool for Protein Identification Results Table 1. Mascot and Paragon Database Search Criteria and Results (A) Database Search Criteria Mascot

taxonomy enzyme modifications precursor and fragment ion mass tolerance peptide charge state identification threshold other settings

Paragon

human trypsin fixed mod.: carbamidomethyl modification of cysteine; variable mod.: methionine oxidation 0.2 Da

human trypsin cysteine alkylation with iodoacetamide not defined

+1, +2, +3 p < 0.05; “bold red” 1 missed cleavage allowed

not defined 95% confidence (Unused ProtScore >1.3) gel-based ID, rapid search

(B) Database Search Results: Numbers of Protein Identifications sequence database

Mascot

Paragon

Swiss-Prot NCBI

1414 1463

1213 1203

Results and Discussion In the present report, we introduce Compid, a new software tool which can be used to integrate and compare peptide and protein identification results from two database search engines, Mascot (Matrix Science) and Paragon (Applied Biosystems) (Figure 1). Compid can create two types of reports from the data sets, a peptide report and a protein group report, which list the peptides and proteins, respectively, identified in the data sets (Figure 2). The peptide report contains a list of

peptides in the given data sets, either including only the peptides assigned to proteins, or all of them. For each peptide, queries and scores affiliated with them, as reported by Mascot (ms) or Paragon (pg), are presented. Correspondingly, the protein group report lists the identified proteins. For each protein, the number of matching peptide sequences as well as the number of distinct MS/MS spectra and their spectral numbers are displayed, along with the protein identification scores provided by the database search engines. Descriptive

Figure 1. Compid user interface. Starting window of Compid is shown in the background. The right panel shows the protein groups report parameter window. A Venn diagram created for a comparison of two data sets is shown in the upper left panel and the job status window in the lower left panel. Journal of Proteome Research • Vol. 9, No. 12, 2010 6797

technical notes

Lietze´n et al.

Figure 2. Compid analysis and contents of peptide and protein group reports. Compid can be used to create three different types of reports based on Mascot (ms) and/or Paragon (pg) identification data. Additionally, manual SQL-queries can be used to retrieve information from Compid’s internal database. Both peptide report and protein group report consist of four data tables: summary table including numbers of common and unique identifications and more detailed tables including information of the peptides/protein groups identified from both data sets or only one of them. In peptide identification tables, peptide sequences, numbers of queries, and Mascot/ Paragon scores corresponding to each unique peptide are reported as well as the average score and the score and identifier of the best peptide spectrum match for each peptide. In protein group identification tables, the numbers of original Mascot/Paragon protein groups, the number of proteins in each group, protein accession numbers, Mascot/Paragon identification scores, numbers of unique peptide identifications for each protein group as well as the queries matching the corresponding protein group together with the peptide identification scores are reported.

statistical information is shown about the peptide and protein identifications in a data set and the queries (MS/MS spectra) resulting in these hits. To demonstrate the usefulness of Compid, we have used it to extract information from large Mascot result files and to compare the identification results obtained with Mascot and Paragon database search engines for a mitochondrial proteome sample of human keratinocytes. Protein identifications with Mascot and Paragon were done against two different protein sequence databases, Swiss-Prot and NCBI, to study how the size and redundancy of the database affect the identification results. On the basis of these database searches, we were able to identify from our sample more than 1400 proteins using Mascot and more than 1200 proteins using Paragon (Table 1B). 6798

Journal of Proteome Research • Vol. 9, No. 12, 2010

First, Compid was used to create a peptide level comparison of Mascot and Paragon identification results for the mitochondrial proteome sample of HaCaT keratinocytes (Supplementary Table 1). Here, the number of peptide identifications remained approximately the same regardless of the sequence database used (Mascot, 35 189 and 35 006 peptides and Paragon, 31 141 and 35 381 peptides against Swiss-Prot and NCBI, respectively). However, only approximately 20% of the peptide identifications in both comparisons were common for both database search algorithms. In general, large amount of the MS/MS spectra produced in a tandem mass spectrometry experiment do not contain enough fragment ion information for high-confidence peptide identifications. Additionally, some high-quality MS/ MS spectra remain poorly identified because of for example

technical notes

Compid: Software Tool for Protein Identification Results

Figure 3. Peptide- and protein-level comparisons of Mascot and Paragon database search results show that there are differences between the two database search algorithms. (A and B) Numbers of all peptide identifications vs low-confidence peptide-spectrum matches from mitochondrial proteome of HaCaT keratinocytes in Mascot (ms) and Paragon (pg) database searches against A) Swiss-Prot and B) NCBI protein sequence databases. Assigned peptides: peptides used in protein identifications. Protein-level comparison of the identification results in Mascot and Paragon database searches against (C) Swiss-Prot and D) NCBI databases. Protein-level comparisons were done using accession-based, peptide subset-based, and protein sequence-based comparison methods.

constrained database search parameters, inaccurate charge state or precursor ion m/z, or gaps in protein sequence database that is used in searches.15 This can also be seen in our results as over 70% of the peptide identifications unique for either Mascot or Paragon were low-confidence identifications having a Mascot ion score