Coverage and Consistency: Bioinformatics Aspects ... - ACS Publications

Sep 10, 2013 - Tim Keighley,. † and Mark P. Molloy*. ,†. †. Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Austr...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Coverage and Consistency: Bioinformatics Aspects of the Analysis of Multirun iTRAQ Experiments with Wheat Leaves Dana Pascovici,† Donald M. Gardiner,‡ Xiaomin Song,† Edmond Breen,† Peter S. Solomon,§ Tim Keighley,† and Mark P. Molloy*,† †

Australian Proteome Analysis Facility, Macquarie University, Sydney, NSW 2109, Australia CSIRO Plant Industry, Queensland Bioscience Precinct, 306 Carmody Road, Brisbane, QLD 4067, Australia § Plant Sciences Division, Research School of Biology, The Australian National University, Canberra, ACT 0200, Australia ‡

S Supporting Information *

ABSTRACT: The hexaploid genome of bread wheat (Triticum aestivum) is large (17 Gb) and repetitive, and this has delayed full sequencing and annotation of the genome, which is a prerequisite for effective quantitative proteomics analysis. Aware of these constraints we investigated the most effective approaches for shotgun proteomic analyses of bread wheat that would support large-scale quantitative comparisons using iTRAQ reagents. We used a data set that was generated by two-dimensional LC−MS of iTRAQ labeled peptides from wheat leaves. The main items considered in this study were the choice of sequence database for matching LC−MS data, the consistency of identification when multiple LC−MS runs were acquired, and the options for downstream functional analysis to generate useful insight. For peptide identification we examined the extensive NCBInr plant database, a smaller composite cereals database, the Brachypodium distachyon model plant genome, the EST-based SuperWheat database, as well as the genome sequence from the recently sequenced D-genome progenitor Aegilops tauschii. While the most spectra were assigned by using the SuperWheat database, this extremely large database could not be readily manipulated for the robust protein grouping that is required for large-scale, multirun quantitative experiments. We demonstrated a pragmatic alternative of using the composite cereals database for peptide spectra matching. The stochastic aspect of protein grouping across LC−MS runs was investigated using the smaller composite cereals database where we found that attaching the Brachypodium best BLAST hit reduced this problem. Further, assigning quantitation to the best Brachypodium locus yielded promising results enabling integration with existing downstream data mining and functional analysis tools. Our study demonstrated viable approaches for quantitative proteomics analysis of bread wheat samples and shows how these approaches could be similarly adopted for analysis of other organisms with unsequenced or incompletely sequenced genomes. KEYWORDS: wheat, quantitative, mass spectrometry, bioinformatics, peptide identification



INTRODUCTION Despite its economic importance and efforts toward sequencing the bread wheat (Triticum aestivum) genome carried out by the International Wheat Gene Sequencing Consortium (http:// www.wheatgenome.org/), the genome sequence of bread wheat will not be available in the immediate future. The large 17 Gb genome size, the high percentage of repetitive sequences (>80%), and the hexaploid nature of wheat1 make the genome assembly and gene prediction a challenging task. A variety of other plant genome projects are in various stages of completion,2 with finished genome sequences available for Arabidopsis and a number of other dicotyledonous species and high quality genome sequences available for monocot species such as rice, sorghum, maize, and Brachypodium distachyon.3 The absence of a complete annotated wheat genome sequence also means no comprehensive wheat protein database is available, which in turn makes proteomics projects on wheat difficult to carry out. Wheat proteomics, by both gel-based and shotgun mass spectrometry, relies on the ability to match © 2013 American Chemical Society

peptide tandem mass spectra to sequences from protein databases to infer and quantitate proteins. An incomplete protein database will result in a large proportion of unmatched spectra, while alternatively, a large database composed of known proteins from related species may result in ambiguous matches with low scores due to poor homology. To date to the best of our knowledge only one classical shotgun proteomics paper has been reported on wheat,4 though the wheat proteomics literature is supported by other gel-based experiments. In regard to the challenges associated with LC−MS shotgun proteomic experiments, Ford et al. described an iTRAQ study (for a recent review on iTRAQ, see ref 5) using an EST-derived database for protein identification and reported matching only 10% of the tandem MS spectra. They also discussed the difficulties of determining peptide uniqueness for Special Issue: Agricultural and Environmental Proteomics Received: June 5, 2013 Published: September 10, 2013 4870

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

or stability of the identification across successive MS runs. There is inherent uncertainty in inferring the presence of a particular protein based on spectra matching to peptides due to the protein inference problem:15 peptides will be shared among different proteins, and the problem will be compounded in the presence of a highly redundant database. Protein identification and quantitation software such as ProteinPilot16 and Mascot17 attempt to address the protein inference problems by placing proteins in groups, which informally speaking are sets of proteins identified by roughly the same set of spectra, from which a single or shared winners are reported. There is a stochastic element in the selection of the winner protein in a group especially with a large redundant database: one protein may be selected as a winner in one run, and a closely related one may be selected the winner in subsequent runs. Unfortunately, this presents problems when aligning proteins for the downstream quantitative analysis. This ambiguity or stochastic element is captured in ref 18, where it is shown that considering winner proteins only may yield a poorer reproducibility rate or overlap over several runs. Thus, for a large multirun experiment, in addition to the number of proteins identified, it is necessary to consider the ease of correct quantitation alignment across runs. This stands in contrast to gel-based experiments, where gel alignment is typically done prior to protein spot identification. The aim of this work was to evaluate the best pipeline for quantitation and analysis of multirun iTRAQ proteomic experiments using wheat leaves. To determine the best suited database for assigning bread wheat tandem mass spectra to proteins and for downstream functional analysis, we compared several options: the extensive NCBInr plant database, a smaller composite cereals database assembled in-house, the Brachypodium distachyon model plant genome, and peptides predicted from the newly released wheat D-genome progenitor Aegilops tauschii. We considered simple metrics such as percentage of spectra matched, numbers of proteins identified, and the reliability of identification (FDR). This analysis showed that the in-house assembled composite cereals database and the Dgenome progenitor were favored. We also examined use of the SuperWheat database for peptide assignment but found the use of this large database to be unwieldy for robust cross-run iTRAQ quantitation assignments. We evaluated the option of searching with the composite database, but projecting the quantitation to the Brachypodium best BLAST hit as a means of readily integrating with existing downstream functional analysis tools such as AgriGO.19 We used the Brachypodium BLAST mapping to assign biological process and molecular function categories to the protein identifications and thus to characterize functionally the identified proteins separately for each iTRAQ run. We found that the biological function categories highly overabundant among the proteins identified in each iTRAQ run as compared to the background Brachypodium genome are consistently those containing proteins with high copy numbers per cell such as metabolic categories and translation. Thus downstream functional enrichment analyses of small proteomics subsets is best performed using a more conservative proteomics-specific background such as the set of proteins identified in a wheat proteomics experiment. The findings from these experiments could be useful to shotgun proteomic analyses for other organisms with unsequenced or incompletely sequenced genomes.

subsequent quantitation but did not investigate solutions to the problem. This report helps to underscore the difficulty associated with large-scale iTRAQ quantitation and highlights the necessity to improve shotgun proteomic based analyses of wheat. In order to proceed with a wheat shotgun proteomics experiment, options include using a partially complete genome database containing existing wheat proteins or peptides, ESTdatabases that reflect expressed proteins, a related plant species with complete genome established, or a composite plant database containing multiple complete genomes. Searching the wheat spectra against existing wheat proteins or peptides would be ideal if a sufficient number of proteins could be reliably identified, and this will no doubt become the best option as the wheat database matures. Various wheat EST resources are available, notably the extensive (>2 million sequences) SuperWheat database,6,7 which has yielded useful results for protein identification in gel-based experiments, though by their size and redundancy present challenges for consistent quantitation across multiple iTRAQ LC−MS runs. Another option is searching against a protein database from a related model species that has been well annotated; several such examples are reviewed in ref 8 and show that the choice of species specific versus database from a related model depends heavily on the specific cases, as well as the phylogenetic distance between the plant investigated and the available models. For instance, in the cultivated tobacco (Nicotiana tabacum) experiment described in ref 9 a database assembled from more distant species including Arabidopsis yielded identifications for all of the random set of tobacco proteins selected for evaluation, while the much smaller N. tabacum database yielded only 11% of the identifications. On the other hand in the case of the experiment on pea chloroplast (Pisum sativum) of10 the nonspecies specific database created from Arabidopsis and Medicago truncatula was found to be less adequate than the species specific pea database to which it was compared. In the case of wheat the available Brachypodium distachyon presents itself as the most appropriate fully sequenced candidate model species for homology matching due to the high quality genome assembly and associated gene annotations. The survey sequence of wheat11 and the two progenitor species Triticum urartu12 and Aegilops tauschii13 of the A and D genomes, respectively, have only recently been made available and currently lack tools for downstream functional analysis of detected proteins. Even if a complete wheat genome were available, the homeologous versions of proteins from each of the three genomes would make shotgun proteomic analyses complicated, and a haploid representation of the wheat genome would need to be generated. Alternative options are either large comprehensive databases such as NCBI nonredundant (nr) plant or smaller composite databases that can be assembled by amalgamating sequences from various related plant genome sequences. The choice of database significantly impacts the results as the proteins identified will always be a function of the proteins present in the matching database. This aspect was made manifest in the experiments of ref 14, where it was shown that, when aiming to identify plant pathogens via LC−MS/MS, adding database sequences in an effort to gain greater coverage led to more cross-species matches and misidentifications. A large comprehensive database is not a priori the best option. While obtaining more protein identifications is always desirable, the other facet of reliable assignments is the certainty 4871

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research



Article

METHODS

database was downloaded in March 2013 and contained 533,716 entries. The combined cereals database was assembled by combining available wheat identifiers from Uniprot and NCBI, Brachypodium identifiers (downloaded from http:// www.brachypodium.org/), Sorghum, Maize and Rice identifiers as well as Uniprot and TrEMBL sources, for a total of 299,040 proteins. The Brachypodium database contained the subset of all Brachypodium identifiers from the combined database, for a total of 35,455 entries, thus the Brachypodium database was completely included in the combined cereals database. The wheat D genome progenitor peptide sequences were downloaded from GigaDB; the file used contains 43,150 sequences (wheatD_final_43150.gff.pep.Corrected) and is at present available from the ftp link at http://gigadb.org/data set/ 100054. The SuperWheat database (Version #100211) contains 2,094,746 entries and was obtained by contacting the authors.

Growth and Experimental Conditions

Wheat cultivar BG220 (a recombinant inbred line resulting from crossing BR34 and Grandin) was used for all experiments and was maintained in a controlled environment facility with day/night cycles of 16/8 h at 20/15 °C with 60/90% relative humidity. Ten plants per 10 cm pot were grown in potting mix for 10 days. The first leaves were infiltrated in a central zone on the adaxial side using a 1 mL needleless syringe with either water or 200 ppm deoxynivalenol (Sigma) or were untreated. Each leaf received approximately 200 μL of solution and the infiltration zone was marked with permanent marker. Regions that received the treatment were harvested at 4, 12, 24, and 48 h following treatment with either water or deoxynivalenol. The untreated plants were harvested at 4 h only. Sampled leaves were placed into 15 mL tubes and immediately frozen in liquid nitrogen and stored at −80 °C until protein extraction

Database Search and Protein Identification

The experimental nanoLC ESI MS/MS data were submitted to ProteinPilot V4.2 (AB Sciex) for data processing using each of the above four specially built databases. The Paragon method was used for database searches.16 The parameters were set as the following: iTRAQ 4plex or 8plex peptide labeled for Sample Type depending on the corresponding sample types in the runs. MMTS for Cys Alkylation, Trypsin for Digestion, Triple TOF 5600 for Instrument, None for Species (the whole database was used), Quantitate and Bias correction enabled, Biological modification for ID Focus, Thorough ID and Run False Discovery Rate Analysis enabled. As recommended by ProteinPilot developer, the detected protein threshold (Unused ProtScore) in database search method was set as larger than 0.10 (better than 20% confidence) to allow ProteinPilot false discovery rate analysis.21 After ProteinPilot data processing, protein summaries and peptide summaries were exported as text files. Proteins identified with better than 1.3 Unused score (greater than 95% confidence) were used for further data analysis.

Experimental Design

The three iTRAQ runs used for this technical evaluation are a subset of a larger experiment designed to evaluate the effect of deonxynivalenol exposure on wheat. Wheat leaf samples were taken after 4, 12, 24, and 48 h after infiltration with either water or deoxynivalenol. The design information showing the placement of samples on runs is shown in Supplementary Table S.T1. Protein Extraction

Wheat leaves were ground in liquid nitrogen, then proteins precipitated with ice cold acetone overnight at −80 °C. Proteins were recovered by centrifugation at −20 °C for 1 h at 17,418g. To extract proteins, 400ul of iTRAQ buffer (0.25 M triethylammonium bicarbonate, pH 8.5; 0.05% SDS) was added to the pellets. The samples were vortexed and probe sonicated on ice, followed by centrifugation at 14,000g for 5 min. The supernatants were transferred to new tubes while the remaining pellets were extracted a second time using 200ul of iTRAQ buffer as described above. The supernatants from the two extractions were combined and centrifuged at 14,000g for 5 min and protein concentrations determined by amino acid analysis. No specific depletion of Rubisco was undertaken. Further details of this analysis procedure are shown in Supplemental Analytical Methods M1.

Data Analysis

The data analysis was carried out using the R statistical programming environment. The Protein Pilot summary files containing the protein quantitation, protein database Fasta files (Combined and Brachypodium) and additional annotation files used for the data analysis carried out in this present work have been placed into an R package (WheatRPac), alongside with the open source R code used to generate figures 1-5; the WheatRPac R package is available from ftp://ftp.proteome.org. au/WheatProject/. The Supporting Information R1 gives instructions for installing the package and running the included R code.

iTRAQ Experiment

iTRAQ 4-plex and iTRAQ 8-plex runs were carried out according to the iTRAQ experimental design. For the pooled samples, equal amount of proteins were pooled from three biological replicates. The iTRAQ experimental procedures were reported in detail previously20 and are also described in Supplemental Analytical Methods M1. Briefly, 100ug of proteins from each sample were reduced with TCEP [Tris(2carboxyethyl)phosphine], alkylated with MMTS (methyl methanethiosulfonate), digested with trypsin and labeled with iTRAQ tag. The labeled peptides were fractionated by strong cation exchange HPLC and the fractions were subjected to reverse phase nanoLC ESI MS/MS data acquisition using a TripleTOF 5600 mass spectrometer.

iTRAQ Run Alignment with and without Competitors

In the default text export of Protein Pilot identified proteins the group winner is assigned quantitation, while the competitors are listed as part of the group but are not assigned quantitation. Hence in a naı̈ve alignment of two protein summary files only the winners are considered and the competitors are simply discarded. In order to assess the competitor overlap, we first “fill down” a protein group to assign the same quantitation to the competitors as for the winner protein, and then intersect the new sets of quantitation but only retain the proteins that were identified as winners in either one run or the other. Thus overall over two runs both approaches are equally parsimonious, in that they retain only those proteins that were selected

Database Construction

Five databases were evaluated: the NCBInr viridiplantae plant database; an in-house assembled combined cereals database; an all Brachypodium database; wheat D genome progenitor database, and the SuperWheat database.6,7 The NCBInr 4872

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

phytozome.net. For each protein in the database the best BLAST hit only was retained and appended to the Protein Summary search results. Then two measures were calculated for each protein group in each search: the number of proteins in the group, and the number of distinct best BLAST hits within a group. The group size is related to the database redundancy −redundant databases yield larger groups − but many large groups may have the same best BLAST hit. The distribution of protein group size and number of best BLAST hits in protein group was plotted side by side for the combined cereals and Brachypodium searches. Quantitation Aggregation

Protein ratios from the combined cereals search were aggregated by best Brachypodium BLAST hit using the average in log space (corresponding to a geometric mean of the ratios), thus projecting the quantitation onto the BBH. This quantitation was then compared to the Brachypodium search of the same spectra by assessing the correlation and line of best fit. Brachypodium best BLAST hits were classified based on whether they were single hits (best match for a single identified protein), double hits (same best matches for two identified proteins), and triple or more hits (same best matches for three or more proteins). Gene Ontology Enrichment

Gene Ontology annotation was downloaded from AgriGO and summarized at plant GO slim categories for each iTRAQ run. The resulting annotation was compared with the overall background genomics annotation both by using the Gene Set Enrichment analysis available from AgriGO and by using the PloGO R package developed in our group;22 enrichment of a category is evaluated in both cases for each GO category of interest by Fisher’s exact test. Since each individual iTRAQ run was found to be enriched at numerous categories when compared to the genomics background, a bootstrapping exercise was carried out by generating small random subsets from the set of proteins identified in one run, and comparing the gene ontology annotation both to the genomic background and to the set of all proteins identified in one iTRAQ run.

Figure 1. Combined cereals database sources (bars) and median percentage of protein identifications (line) assigned; the percentages for each of the three iTRAQ runs are marked as a separate blue dot. The percentage of identifications originating from the respective species for each of the experimental iTRAQ runs is overlaid on top. The Plant category refers to plants other than the cereals already listed.

as winners in either one run or the other; however in the competitor overlap a larger number of proteins are seen to be shared between the two sets. The side by side correlation plot of the quantitation from the naı̈ve alignment and competitoralignment can be used to compare the overall impact of the filldown approach. The comparison of the two alignments can be generated using the runOverlap function in the WheatRPac package.



Best BLAST Hit Assignment and Protein Group Size

RESULTS

Comparison of Databases for Peptide Spectra Matching

BLAST mappings of the combined cereals database and D genome progenitor to the Brachypodium data set were carried out, (BLASTP version 2.2.25+, E-value < e-10.) The Brachypodium predicted protein set corresponding to annotation version 1.2 was downloaded from www.

To establish an optimum bioinformatics pipeline for analysis of quantitative wheat shotgun proteomics data sets we used wheat leaves infiltrated with either water (control) or 200 ppm deoxynivalenol (DON) to induce proteome expression

Table 1. Comparison of Peptide Identification Results Using 4-plex iTRAQ Labeled Bread Wheat with Different Sequence Databasesa database

spectra matched (%)

total proteins identified

protein groups identified

average group size

proteins quantitated

FDR (%) based on decoy db 0.3% (0.08% qt proteins only) 0.08% (0% qt proteins only) 2.3% (0.2% qt proteins only) 0.2% (0.07% qt proteins only) nd

composite

52.7

4770

1262

3.8

1227

NCBInr plant

48.4

11884

1162

10.2

1070

Brachypodium

44.3

1808

1023

1.8

985

D Genome Progenitor Ae. tauschii Super Wheat DB

53.7

1577

1353

1.2

1324

59.5

54,254

1,694

32.0

nd

no. of sequences in database 238,594 533,716 35,533 43,150 2,094,746

a

The FDR could not be determined for the SuperWheat database due to the large database size, and hence the results are not directly comparable with the others. nd = not determined. 4873

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

Figure 2. Comparison of search overlap and protein quantitation correlation, naı̈ve and considering competitors (4-plex Run 3 protein abundance ratio Label 115/114 shown, corresponding to a DON 48H/Control sample). By looking at winner proteins only, the apparent intersection between the Brachypodium search and the combined cereals search is small (432 proteins). Considering competitor proteins yields a considerably higher overlap (724 proteins). The quantitation correlation is similar for the naı̈ve and competitor inferred overlap, 0.95 for the ratio shown.

changes. The extracted proteins were labeled with iTRAQ reagents, and several LC−MS experiments were conducted. Three iTRAQ runs were generated, two iTRAQ 8-plex runs and one iTRAQ 4-plex run (Supplemental Analytical Methods M1). We used the 4-plex iTRAQ run data as a model to investigate peptide spectra matching using various sequence databases. Table 1 details the comparisons of using the NCBI nr plant database, combined cereals database, Brachypodium database, and wheat D-genome progenitor (Aegilops tauschii) database for peptide identification. We considered metrics including percentage of spectra matched, number of proteins identified, numbers of protein groups identified, number of quantitated proteins, and false discovery rates based on decoy database search. The D-genome progenitor and the combined cereals database performed closely in terms of percentage of spectra matched (53.7% and 52.7%, respectively) and numbers of protein groups identified (1353 and 1262, respectively). The NCBInr plant database was next in terms of spectra matched and proteins identified; however, due to the much larger number of sequences the search time was also increased, and the high database redundancy was reflected in the large number of protein identifications compared to the number of protein groups (average group size 10.2). The Brachypodium search provided a surprisingly high percentage of spectra matches (44.3%) considering its size is 12% of the combined cereals database (Figure 1) and only 7% of the NCBInr size. Though

fewer proteins were identified, use of this database provided added advantage for functional analysis with direct links to downstream analysis tools such as AgriGO (http://bioinfo.cau. edu.cn/agriGO/). The high abundance of Rubisco in plant cells is a contributing factor that accounts for significant MS/MS sequencing time that could impact on the depth of detection of other proteins in shotgun proteomic studies. The approach we took to address this problem was to fractionate the peptides by SCX prior to reversed-phase LC−MS analysis. This orthogonal separation approach distributes the Rubisco peptides, enabling effective detection of peptides from other proteins. Across the 12 SCX fractions, the percentage of Rubisco-attributable peptides varied from 18% to 54% (see Supplemental Analysis A1). Comparison with the SuperWheat EST-Based Database

An extensive database of transcribed wheat gene sequences resides in the SuperWheat database6 (2.09 million sequences), and therefore we attempted to compare use of this database with those described above. The extremely large size of this predominantly EST database restricted ProteinPilot use to only a forward search (i.e., FDR using a decoy database search of forward and reverse appended sequences could not be determined as this computation failed on our 8-core CPU system). Compared to the other databases used in Table 1, the forward search yielded a greater number of spectra matches (59.5%) and identified more protein groups (1,694). However, 4874

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

Figure 3. Distribution of Protein Pilot group sizes (A) and number of distinct best BLAST hits (BBH) (B) in the protein group for each consolidated from the three iTRAQ runs for each of the two searches; the area between the two distributions is shaded. The combined cereals database has larger groups with a long tailed distribution of group sizes, some as large as 226 proteins (the image × range was truncated for ease of display). However, the two databases have very similar distribution of distinct BBH in each group.

that are identified by roughly the same spectra are grouped together and the quantitation is assigned to the winner. However, neither the protein groups nor the group winners are necessarily conserved across successive runs. The stochastic element of picking winners is described for instance in ref 18, where it is shown that ignoring the inherent ambiguity in protein identification can lead to low reproducibly across runs. There are alternatives to the naı̈ve merging of winner proteins only, but none is computationally trivial. The AB SCIEX Protein Alignment template (www.absciex.com/ ProteinAlignmentTemplate) was designed to handle the correct alignment of such searches instead of the simplistic matching of winners across runs; however, its deficiency is that it requires a master search as a reference frame, which can be prohibitive for a large experiment. Another approach described in ref 24 employs PCGA (Protein Group Code Algorithm) to group together overlapping groups across runs; however, once large across-run groups are generated, it is not straightforward to consolidate the quantitation within. In our particular example, considering only the quantitated proteins (winners in each ProteinPilot group), the apparent or naı̈ve overlap between the quantitated proteins in Brachypodium and combined cereals databases is approximately 42% of the Brachypodium search, which is quite low considering that the Brachypodium database is a subset of the combined cereals one. Alternatively, when merging, taking competitor proteins into account and assigning the same quantitation to the

the sequence redundancy was significantly greater than all other compared databases (54,254 proteins made up the 1,694 protein groups, or equivalently an average group size of 32.0). While use of the SuperWheat EST database resulted in more matching wheat spectra, this result remains to be qualified as the FDR remains unknown. Moreover, it is unwieldy to directly use this large, highly redundant database with multiple iTRAQ runs due to problems of across run misalignment that impact upon reliable protein quantitation (see below). Identification and Quantitation Comparison Across Searches

Given the difficulties discussed above with using large, redundant databases for multiple iTRAQ runs, we focused our subsequent analyses on the combined cereals and the Brachypodium databases. We looked in more detail at the protein identifications and quantitation obtained for the same spectra using the combined cereals and Brachypodium databases. This served a 2-fold purpose: first it showed immediately the impact of disregarding competitor proteins from the identifications, and second, it established baseline variability in quantitation when the same spectra were searched against two related databases. The Pro Group algorithm in Protein Pilot16 takes a conservative approach to protein identification that follows the Molecular and Cellular Proteomics publication guidelines23 for reporting the number of protein identifications. Proteins 4875

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

Figure 4. Inferred quantitation correlation. (A) When mapping protein winners to the BBH, unique BBH form more than 87% of the mappings (Single category). Double hits are those for whom two separate protein winners can be identified based on the spectra, but they have the same Brachypodium BBH. Single and double BBH put together amount to 98.4% of the total. (B) Overall correlation for inferred quantitation. (C) When merging with the Brachypodium search via the BBH, single best BLAST hits have correlations comparable with the baseline merging by accession or direct Brachypodium search (0.95, compare with Figure 2). (D) Correlation for double matches. (E) Correlation for averaged double matches.

a plausible proxy can be given via BLAST mapping to Brachypodium. Different species proteins homologous to the same wheat protein should more often map to the same Brachypodium transcript via BLAST, while different wheat proteins might not. Figure 3 shows the distribution of protein groups by group size and distinct BBH numbers in the group. The stochastic aspect of assigning winners within a group is likely to be higher with the composite database, which has larger groups, but not so after mapping to the Brachypodium BBH, as many large protein groups will have a unique BBH. BLAST mapping to a related organism has in this case the beneficial effect of reducing the misalignment across runs. Considering the percentage of protein groups containing two or more proteins as a possible measure of the database redundancy, we found that the median combined cereals run contains 65% nonunique groups, compared to only 42% of such groups in the Brachypodium search. The difference adequately reflects the increased redundancy of the combined cereals database. Considering the percentage of protein groups with two or more distinct best BLAST hit (BBH) transcripts in them, we found however that the median combined cereals search has 24% groups with two or more distinct BBH, while the median Brachypodium search has 29% protein groups with two or more distinct transcripts.

competitors, followed by only selecting the proteins found to be a winner in at least one run, the overlap becomes 69% of the Brachypodium search. The respective overlap and quantitation correlations are shown in Figure 2. For searches run using a nonredundant database the overlap with or without competitors will be very close. When restricting to Brachypodium hits quantitated in both searches, the quantitation ratio is similar (Pearson correlation coefficient of log iTRAQ ratios is 0.95) whether considering only group winners or taking competitors into account. We regard this as the baseline variability in protein quantitation when identifying the same proteins searching the same spectra using different databases. BLAST Mapping to Brachypodium Impact on Protein Pilot Group Size

Searching with a larger and highly redundant database will yield larger protein groups (Table 1) and with repeated runs will have a more pronounced stochastic aspect to assigning the winner protein: from among similar proteins one might be selected as the winner in one run, and a related one in another run. When working with a composite database from multiple species, there should be in theory two kinds of proteins that cannot be distinguished by a given set of spectra: similar ones from separate species homologous to the same wheat protein, and closely homologous wheat proteins that cannot be distinguished by the given spectra. Without a complete wheat database it is of course impossible to know which is which, but 4876

dx.doi.org/10.1021/pr400531y | J. Proteome Res. 2013, 12, 4870−4881

Journal of Proteome Research

Article

Figure 5. Plant GO slim terms classification of proteins identified in each iTRAQ experiment, for the Brachypodium, combined cereals, and D Genome progenitor A. tauschii searches; in the case of the combined cereals and D Genome progenitor searches the GO annotation was inferred by BLAST mapping to Brachipodium. As there is overlap between the various processes, the different categories add up to more than 100%. Biological processes such as subcategories of metabolic process, translation, and transport contribute a large percentage of the total proteins identifications in each run. The percentages of annotation are consistent by iTRAQ run and broadly consistent across the different database searches as well.

Inferred Quantitation

more winning proteins comprise over 98% of the total. Figure 4B shows the correlation of the inferred data with the Brachypodium search data for one example iTRAQ ratio in the 4-plex run (Ratio 115/114 in iTRAQ Run 3). The baseline correlation, comparing Brachypodium proteins quantitated as winners in both searches, is 0.96 as shown in Figure 2 and shows the variability in quantitation when the same proteins are identified by searching the same spectra using different database searches. The overall inferred correlation is 0.88 and shows the difference between quantitating the Brachypodium hits directly and assigning quantitation on the basis of the BBH. However, when separating the proteins into the categories from panel A, the correlation for the single BLAST hits is comparable to the baseline correlation (correlation coefficient 0.95, panel C), and the correlation for the BLAST hits shared by two proteins is is lower (correlation coefficient 0 0.72, panel D). For the BLAST hits shared by two or more protein group winners, the quantitation needs to be aggregated. This was achieved by averaging the log iTRAQ ratios mapping to the same BBH. Panel E shows the aggregation correlation with the Brachypodium matches (0.81 for the double matches). While the correlations are lower than for the single matches, they are nonetheless improved by aggregation, as is the regression line slope. Again, this compares the averaged quantitation assigned via BBH mapping with the quantitation assigned by direct search against the Brachypodium database. There are insufficient triple and higher matches to make meaningful comments about the quality of fit for this group. An additional metric considered for the subset of aggregated double matches has been the percentage of “strong” discordance, which we define as (number of ratios aggregated where max is >1.2 and min is 1 and min is