Proteogenomic Analysis of the Venturia pirina ... - ACS Publications

Jun 26, 2014 - Fax: +61 3 9479 1226., *(K.M.P.) E-mail: [email protected]. Phone: +61 3 9032 7474., *(S.M.) E-mail: [email protected]...
1 downloads 0 Views 1MB Size
Article pubs.acs.org/jpr

Proteogenomic Analysis of the Venturia pirina (Pear Scab Fungus) Secretome Reveals Potential Effectors Ira R. Cooke,*,†,‡ Dan Jones,§,∥ Joanna K. Bowen,⊥ Cecilia Deng,⊥ Pierre Faou,† Nathan E. Hall,†,‡ Vignesh Jayachandran,† Michael Liem,† Adam P. Taranto,§ Kim M. Plummer,*,§,∥ and Suresh Mathivanan*,† †

Department of Biochemistry, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, Victoria 3086, Australia Life Sciences Computation Centre, Victorian Life Sciences Computation Initiative §Department of Botany, Centre for AgriBioscience, La Trobe University, Melbourne, Victoria 3086, Australia ∥ Plant Biosecurity Cooperative Research Centre, LPO Box 5012, Bruce ACT 2617, Australia ⊥ The New Zealand Institute for Plant and Food Research Limited (PFR), Auckland 1025, New Zealand ‡

S Supporting Information *

ABSTRACT: A proteogenomic analysis is presented for Venturia pirina, a fungus that causes scab disease on European pear (Pyrus communis). V. pirina is host-specific, and the infection is thought to be mediated by secreted effector proteins. Currently, only 36 V. pirina proteins are catalogued in GenBank, and the genome sequence is not publicly available. To identify putative effectors, V. pirina was grown in vitro on and in cellophane sheets mimicking its growth in infected leaves. Secreted extracts were analyzed by tandem mass spectrometry, and the data (ProteomeXchange identifier PXD000710) was queried against a protein database generated by combining in silico predicted transcripts with six frame translations of a whole genome sequence of V. pirina (GenBank Accession JEMP00000000). We identified 1088 distinct V. pirina protein groups (FDR 1%) including 1085 detected for the first time. Thirty novel (not in silico predicted) proteins were found, of which 14 were identified as potential effectors based on characteristic features of fungal effector protein sequences. We also used evidence from semitryptic peptides at the protein N-terminus to corroborate in silico signal peptide predictions for 22 proteins, including several potential effectors. The analysis highlights the utility of proteogenomics in the study of secreted effectors. KEYWORDS: Venturia pirina, proteogenomics, secreted effectors, Ave1



trees.9,10 Members of the Venturia genus exhibit strong host specificity on several major fruit crops, including apple; European, Chinese or Japanese pear; peach; and cherry.11,12 This host specificity is driven by coevolved relationships, likely to involve effector proteins secreted by the pathogen and resistance (R) proteins (often receptors or receptor-like kinases) in the host. Natural disease resistance (governed by these R proteins) exists and is used in plant breeding programs. Gaining a better understanding of this pathosystem, in particular by identifying fungal effector and host resistance genes, will significantly aid the research effort to reduce the economic impacts of this suite of diseases. While plant R genes have conserved sequences and are readily identified in plant genome sequences, effector genes lack sequence similarity. Most characterized fungal effectors, however, have similar properties, in that they are small (generally less than 200 amino acids), secreted, and cysteine rich (typically enabling the formation of two to four disulfide

INTRODUCTION Advances in DNA sequencing technologies have resulted in affordable whole genome sequencing for non-model microorganisms, including many plant pathogenic fungi. One of the chief outcomes of a genome sequencing project will be a database of putative genes and gene products, usually generated de novo by gene finding algorithms,1−3 perhaps refined by the addition of transcriptomic sequences.4 Proteogenomics provides a means of validating these lists at the protein level, as well as identifying novel proteins, and correcting errors such as frameshifts or incorrect start sites that are difficult to detect from DNA or RNA sequences alone.5 Nevertheless, the benefits of reconciling proteomic and genomic data are not often realized, presumably because there are relatively few tools to deal with the computational complexities of proteogenomic searches (although see Pang et al.,6 Risk et al.7), or to annotate gene models based on peptide information.8 Venturia pirina (synonym pyrina) is a pathogenic fungus that infects European pear (Pyrus communis), causing significant crop losses and quality reduction, as well as damage to infected © 2014 American Chemical Society

Received: February 28, 2014 Published: June 26, 2014 3635

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

using HCl and NaOH) with 10 samples (one per 10 cm Petri dish) of fungal material in cellophane and of agar. All falcon tubes were then placed in a mini-rotator (Thermo Scientific) for 48 h at 4 °C. The proteins were later separated using centrifugation at 17 000g in a Sorvall SS-34 centrifuge. The resulting supernatant solutions were then passed through sterile 0.2 μm filters (Merck Millipore) to remove any traces of pellet material. The resulting filtrate was then concentrated by passing it through a 3 kDa Amicon filter retaining proteins larger than 3 kDa. V. pirina secretome samples (30 μg) were loaded onto a precast NuPAGE 4−12% Bis-Tris gel in 1× MES SDS running buffer. A constant voltage of 150 V was used for running the gel followed by visualization of the protein bands by Coomassie stain (Bio-Rad). The gel was stained for 1 h followed by destaining (20% ethanoic acid and 7.5% acetic acid in Milli-Q water) overnight. The gel lane was excised into 20 bands followed by in-gel reduction, alkylation, and trypsinization as described previously.18 Briefly, reduction was performed by using 10 mM DTT (Bio-Rad) in 50 μL per gel band for 30 min followed by alkylation for 20 min with 25 mM iodoacetamide (Sigma) in 70 μL per gel band and digested with 150 ng of sequencing grade trypsin (Promega) overnight at 37 °C. The resulting peptides were extracted with 0.1% trifluoroacetic acid (TFA) and 100 μL of 50% acetonitrile (ACN) and finally concentrated to ∼20 μL using the SpeedVac Concentrator. A graphical description of our sample preparation method from fungal culture through to LS−MS/MS is provided as Supporting Information (Figure S1).

bonds) resulting in a degree of stability for these proteins in hostile extracellular conditions of plant tissues.13 A major barrier in the research of any aspects of Venturia biology and disease is the lack of published whole genome sequences and the corresponding gene models for Venturia. Currently, only 288 protein coding sequences (including many redundant sequences) have been published on the NCBI GenBank database for the entire Venturia genus, including just 36 for V. pirina. In this study, we performed a proteogenomic analysis on V. pirina to identify protein coding genes including the effector proteins that could mediate scab disease in pear. A major impediment to the identification of effectors is their relatively low abundance, compared to plant proteins, in infected leaves. A previous proteomics study of a related disease system (apple scab) identified only plant proteins from Venturia inaequalis infected apple leaves.14 For this reason, models of infection, such as growth on potato dextrose agar (PDA) overlaid with cellophane, have been used to mimic the growth of Venturia spp. in the leaf.15 In this infection mimic system, the fungus forms specialized structures (appressoria, penetration pegs and laterally dividing, multicellular tissue known as stroma) on and within the cellophane sheet.15−17 Since this growth more closely resembles that in infected leaves than on pure agar, it is expected that proteins involved in infection (such as effectors) will be produced in sufficient abundance for identification by mass spectrometry. Our study took advantage of this growth system as well as a sample preparation protocol that attempted to enrich for secreted proteins to identify as many potential effectors as possible. In addition, we developed a suite of proteogenomic tools that we used to map our proteomic results to genomic locations. By examining our proteomic results in a genomic context, we were able to analyze the reliability of gene predictions by the popular gene annotation software AUGUSTUS. Finally, we exploited the semitryptic search capability of search engines, X!Tandem and MS-GF+, to examine N-terminal modifications for a subset of our identified proteins, including modifications critical to effector function such as signal peptide cleavage.



Mass Spectrometry

Samples were analyzed using three different LC−MS/MS platforms in order to maximize the number of peptides that could be ionised and therefore identified. The three mass spectrometers used were an Orbitrap Elite (Thermo Fisher), an Ion-Trap (HCT ultra PTM Discovery System, Bruker Daltonics) and a MALDI-TOF-TOF instrument (Ultraflex III, Bruker Daltonics). Peptides resulting from the 20 gel bands were resuspended in 10 μL with 8 μL of each being used for analysis on the HCT. The remaining 2 μL from gel bands 1−10 and 11−20 were pooled together to create two samples (20 μL volume each). These two samples were then used for MALDI-TOF/ΤOF and Orbitrap analysis. The following sections provide details of mass spectrometry performed on all three platforms grouped by ionization method. Details of software used to extract MS/MS peaklists and perform database searches are given in a separate Database Searches section below. The mass spectrometry-based proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral. proteomexchange.org) via the PRIDE partner repository with the data set identifier PXD000710 and DOI 10.6019/ PXD000710.

MATERIALS AND METHODS

Fungal Culture and Sample Preparation

A monospore (genetically pure) culture of V. pirina Aderh. (isolate ICMP 11032, listed by the synonym, V. pyrina, originally isolated from European pear, cultivar Winter Nelis, in Hastings, NZ, in 1990) was grown on sterilized cellophane sheets (Fowlers Vacola, Vic, Australia) overlaid onto potato dextrose agar (PDA, Sigma-Aldrich) in 9 cm Petri dishes. Cultures were grown for 2 weeks (20 °C under 12 h light/dark cycle) before harvesting. After this, most stages of growth (hyphae, stroma, and spores) were evident. Secreted proteins were harvested from V. pirina growing in and on the cellophane sheets and from the agar medium beneath the sheets. Protein harvesting was performed by gentle washing (in buffers of various pH) of the cellophane sheets and the roughly chopped agar, both containing fungal material and secreted proteins. Cellophane and agar pieces were placed separately in 50 mL falcon tubes (BD Falcon) containing 20 mL of 0.5 M Tris buffer solution with protease inhibitors (Complete Ultra protease inhibitor F. Hoffman, La Roche) (1 tablet dissolved in 50 mL). Samples were harvested using three different pHs of the buffer solution (4, 7, and 11, adjusted

ESI−LC−MS/MS

Peptides reconstituted in 0.1% formic acid and 2% ACN (buffer A) were loaded onto a trap column (C18 PepMap 100 μm i.d. × 2 cm trapping column, Dionex) at 5 μL/min for 6 min and washed for 6 min before switching the precolumn in line with the analytical column (Vydac MS C18, 3 μm, 300 Å and 75 μm i.d. × 25 cm, Grace Pty. Ltd.). The separation of peptides was performed at 300 nL/min using a linear ACN gradient of buffer A and buffer B (0.1% formic acid, 80% ACN), starting from 5% buffer B to 60% over 90 min (HCT) or 120 min (Orbitrap Elite). 3636

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

assembly length was 41 983 696 bp with a GC content of 47.3%. Assessment of genome quality using CGAL 0.9.621 (summing CGAL scores from separate bowtie2 mappings of reads from each library) gives a likelihood score of −4.09 × 109. This Whole Genome Shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession JEMP00000000. The version described in this paper is version JEMP01000000. A set of gene models were constructed from mRNA sequences derived from the closely related fungus V. inaequalis using PASA_r2012-06-25.22 V. inaequalis mRNA sequences were derived from previously published expressed sequence tags (ESTs)15−17 and transcriptome assemblies of both in vitro and infected leaf (in planta) RNA. De novo gene predictions were performed using AUGUSTUS 2.5.51 using a parameter file generated from these gene models. Prediction of alternative transcripts from sampling was allowed.

Data were collected in Data Dependent Acquisition mode using m/z 300−1500 as MS scan range, CID MS/MS spectra were collected for the 20 most intense ions for the Orbitrap or top three for the HCT. Dynamic exclusion parameters were set as follows; repeat count 1, duration 90 s, and in addition for the Orbitrap, the exclusion list size was set at 500 with early expiration disabled. Other instrument parameters for the Orbitrap were the following: MS scan at 120 000 resolution, maximum injection time 150 ms, AGC target 1 × 106, CID at 35% energy for a maximun injection time of 150 ms with AGT target of 5000. The Orbitrap Elite was operated in dual analyzer mode with the Orbitrap analyzer being used for MS and the linear trap being used for MS/MS. The HCT used maximum accumulation time of 200 ms with an ion charge current (ICC) target = 200 000, MS spectra were a sum of seven individual scans (scanning speed of 8100 (m/z)/s). MS/MS spectra used a fragmentation amplitude of 0.85 V and were a sum of six scans ranging from m/z 100 to 2200 at a scan rate of 26 000 (m/z)/s.

Proteogenomic Software tools

All analysis was performed using a suite of command-line tools developed by the authors and released as an open source tool suite, Protk (https://github.com/iracooke/protk). Protk provides a consistent command-line interface around a wide variety of existing tools such as the Trans-Proteomic-Pipeline,23 BLAST+24,25 or search engines such as X!Tandem26 and Mascot (Matrix Science). Protk wrappers deal with differences in output or input formats, automate submission of searches to online search engines, and provide for maintenance of shared databases. Full details of all Protk tools used in this study are provided as Supporting Information (Table S1). The command-line version of Protk is available as a ruby gem (http://rubygems.org/gems/ protk) and many individual Protk tools, are also available as graphical tools via the Galaxy toolshed (http://toolshed.g2.bx. psu.edu/). Although most tools in Protk are wrappers around existing tools, the tool protxml_to_gff was created specifically for this project.

LC−MALDI-MS/MS

Peptides were fractionated using a MALDI plate spotter (Proteineer Fc, Bruker Daltonics) coupled to an UltiMate 3000 nano-LC system (Dionex). Samples were first loaded onto a trapping column, PepMap100 C18 (5 μm, 100 Å, 300 μm i.d. × 5 mm) for 5 min at 7 μL/min in buffer A (2% ACN and 0.05% TFA) and washed for 6 min before switching the precolumn in line with the analytical column (Vydac Everest 5 μm, 300 Å, and 150 μm i.d. × 15 cm, Grace Pty. Ltd.). The flow rate used for separation on the analytical column was 1.2 μL/min with a 65 min gradient of buffer B (80% ACN and 0.05% TFA) as follows: (2−16)% B from (0−3) min to 55% B at 62 min. Eluted peptides were collected at ten second intervals on a 384 spot AnchorChip plate with 800 μm diameter and mixed online with HCCA (α-cyano-4-hydroxycinnamic acid) (HCCA Dried Droplet, AnchorChip Standard Target). Mass spectra were recorded on an Ultraflex III TOF/TOF instrument (Bruker Daltonics) equipped with LIFT capability using FlexControl version 3.3 and WARP-LC version 3.3 software (Bruker Daltonics). Settings applied were the following: positive reflector, m/z: 900− 4000 Da, voltage 26 kV, and a delayed extraction time of 20 ns. At each spot we accumulated 750 MS shots and 2000 MS/MS shots at fine-scale positions determined using a random walk. The 13 most promising MS peaks were selected for MS/MS at each spot using an algorithm within the WARP-LC software. Calibration was done externally using a peptide standard mixture (Bruker Daltonics). The spectra were processed using version 3.3 of FlexAnalysis with SNAP algorithm with a threshold of three, and peak lists were filtered to remove any unwanted matrix and keratin peaks.

Database Construction for Proteogenomic Analysis

The protein database was constructed by concatenating a six frame translation of the entire V. pirina genome with translations of all transcripts (including alternative isoforms) predicted by AUGUSTUS. In addition, the cRAP database of common contaminants (version 1.0 http://www.thegpm.org/crap/) was included. All sequences potentially coding for peptides of 20 amino acids or more were included in the six frame translation. The inclusion of six frame translations ensures that peptides arising from novel (i.e., not predicted by AUGUSTUS) genes will not be excluded (provided they do not cross a splice boundary), whereas the inclusion of predicted transcripts allows for peptides that span two or more exons. To facilitate downstream proteogenomics analysis, we encoded the genomic coordinates for the start and end positions of each coding sequence in the FASTA headers of our database. The resulting database consisted of 1.2 million putative protein sequences and was then expanded to include a further 1.2 million decoy sequences generated using an implementation of the make_ random algorithm (Palmblad 2001 http://www.ms-utils.org/ make_random.html) included with Protk.

V. pirina Genome and Gene Predictions

Genomic DNA from V. pirina isolate ICMP_11032 (synonym V. pyrina, http://scd.landcareresearch.co.nz/Specimen/ICMP_ 11032) was isolated from cultures grown on Potato Dextrose Agar, and purified using a Qiagen DNEasy Blood and Tissue kit (catalog 69506). DNA was sequenced on an Illumina GAIIx at the ACRF Biomolecular Resource Facility (ABRF) and the Beijing Genomics Institute (BGI). Sequencing consisted of three libraries with insert sizes of 131, 2000, and 5000 bp. Output FASTQ files were assessed for quality using SolexaQA.19 Sequences were assembled using Velvet 1.1.20 The final assembly consisted of 7758 scaffolds with an N50 of 330 809 with the largest scaffold of 1 704 168 bp. The total

Database Searches

Prior to database searching, raw data from each instrument was converted from native instrument format to mzML. All conversions were performed using msconvert (version 3.0.432327) with vendor peak picking enabled, and using the MS2Denoise function with default parameters. Database searches were performed on each 3637

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

output file from each instrument run, using three search engines, Mascot (version 2.4.0 Matrix Science), X!Tandem (Cyclone 2013.02.01.126) and MS-GF+ (v9517 http://proteomics.ucsd. edu/Software/MSGFPlus.html). Settings for all search engines were variable modifications (oxidized methionine, N-terminal acetylation), fixed modification (carbamidomethylation of cysteine), precursor ion mass tolerance (10 ppm of Orbitrap, 80 ppm MALDI, 1.2 Da HCT), fragment ion mass tolerance (0.5 Da Orbitrap, 0.8 Da HCT and 0.4 Da MALDI), and up to one missed tryptic cleavage was allowed. For X!Tandem and MS-GF+ only, semitryptic searches were enabled thereby permitting peptides with a single trypsin cut site. Raw search results for each run were then converted to pepXML and processed with PeptideProphet28 using decoy sequences to help fit the negative distribution. All files were combined by processing with iProphet29 (default settings) to produce a single pepXML file. This was subsequently analyzed with ProteinProphet30 to produce a single protXML file containing information about all peptides identified, as well as information used to group indistinguishable or related proteins based on peptide redundancy. We retained protein entries at 1% Protein Prophet FDR, and peptides assigned to those proteins at 5% adjusted FDR.

Secretome and Signal Peptide Predictions

To assess coverage of the putative V. pirina secretome, and as a cross reference against which our observed signal peptidase cleavage sites could be compared, we assessed all predicted protein sequences (translations of AUGUSTUS-predicted transcripts), and all confirmed proteins (ORFs and predicted transcripts with peptide evidence) for the presence of a signal peptide. To perform this assessment, we used the following computational protocol, which is adapted from the protocol by Emanuelsson et al.33 for the case where we are only interested in secreted proteins. Proteins were first assessed using TargetP 1.134,35 which classified them into three categories (S:, secreted; M, containing a mitochondrial targeting sequence; or unknown). Proteins were automatically retained if TargetP gave a prediction of S (secreted) with a reliability class of one (highest possible). TargetP provides reliability classes based on the difference between the raw scores of the chosen category (top scoring) and the category with the next highest raw score. If TargetP gave a prediction of secretion, but in a reliability class greater than one, it was then assessed using TMHMM 2.036 and rejected if the protein contained at least ten amino acids in transmembrane regions other than at the N-terminus (first 60 positions). Finally, a protein was retained only if SignalP 4.137 predicted a signal peptide (D cutoff = 0.34).

Mapping Peptide Information to the V. pirina Genome



We created a software tool, protxml_to_gff using the ruby language that is able to quickly and accurately map peptide identifications from a provided protXML file back to a given genome. The tool is included as part of a suite of tools called Protk. As input, the tool requires a protXML file from Protein Prophet, a nucleotide FASTA file containing the genome itself, and an amino acid FASTA file with the protein database that was used in the proteomics search. As output, the tool produces a single file in genome features file v3 (gff3) format. Genomic coordinates for proteins (transcript coordinate and all exon coordinates) are obtained from the proteomics FASTA file, and the coordinates of coding sequences for each peptide are then inferred from these. In cases where a peptide maps across one or more splice junctions, each fragment is recorded as a child of the parent peptide. Since a protXML file contains both protein and peptide level confidence scores, the user can specify both separately, and peptides are only recorded if they pass both confidence thresholds. In this study, we specified the peptide confidence threshold at 5% FDR (1- number of sibling peptides adjusted probability) and protein level confidence at 1% FDR (1-protein probability). If a peptide could be derived from multiple database entries (either open reading frames or predicted transcripts) coordinates were recorded for all possible locations. The new tool works on any protXML generated by searching a protein database created using the database generation tools in Protk that encode genome coordinates in the FASTA header.

RESULTS AND DISCUSSION

V. pirina Proteome and Secretome Coverage

Our analysis identified 1088 V. pirina protein groups (a complete list is available as Supporting Information Table S3) containing proteins that share common peptides. These protein groups and their associated peptides provide evidence for the expression of a total of 1074 genes including 1042 AUGUSTUS predicted gene models and 30 novel genes (Table 1, and see Supporting Table 1. Coverage of the V. pirina Genome and Secretome by Peptides Observed Using Mass Spectrometrya

Whole Genome Secreted Proteins

AUGUSTUS genes

AUGUSTUS transcripts

novel ORFs

1042 200 (17)

1821 293 (17)

30 17 (14)

a

AUGUSTUS predictions (genes/transcripts), and ORFs outside AUGUSTUS predictions (novel ORFs) with at least one observed peptide are shown. Rows represent the number of each type of entity found for the entire genome, or for secreted proteins. Potential effectors are a subset of secreted proteins and are shown in brackets.

Information Tables S3 and S4). Note that the total number of genes identified (1074) is different from the number of protein groups (1088) because alternate transcripts of the same gene will occasionally be placed in separate protein groups when identifications come from distinct peptides, and because closely related genes with shared peptides may be placed in the same protein group. Also note that the number of novel genes (30) comes from a total of 35 novel open reading frames (ORFs) but includes five instances where two separate but closely spaced or overlapping ORFs were considered to form a gene. The protein and gene identifications in our study are based on a total of 6299 significant peptide identifications. While the vast majority (5652) of peptides could be identified by either a six frame translation or predicted transcript database, 589 peptides (at splice junctions) were identified from predicted transcripts alone, and 58 peptides

Visualization and Comparison with Predicted Gene Models

After mapping peptides to genomic coordinates, we were able to compare peptides with existing gene models using the Bedtools31 (version 2.16.2) suite for genome arithmetic. These tools allowed us to check for and locate regions of interest, such as novel exons, confirmed splice sites, confirmed and novel start sites, novel genes, gene extensions and frameshifts. These regions of interest were exported to separate gff3 files and then loaded as separate tracks into the Integrative Genomics Viewer (IGV)32 (version 2.1.28) for visualization. 3638

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

additional factors may be contributing to this low identification rate, including (1) the inability of AUGUSTUS (and presumably other gene finders) to correctly predict short proteins (see section Length Distribution of Proteins); (2) the likely presence of numerous low probability (i.e., not expressed) transcripts among AUGUSTUS predictions leading to inflated predictions of secreted proteins from genomic analysis alone (see section AUGUSTUS Transcript Probabilities as Indicators of Expression); and (3) incorrect annotations at the N-terminus preventing signal peptide predictions. Our results suggest that proteogenomics can help mitigate these factors, specifically: small proteins (shorter than 200 amino acids), which have no associated gene prediction, can be found using six frame translations (Table 1); detection of peptides at splice junctions helps discriminate between expressed and nonexpressed transcripts; and identification of semitryptic N-terminal peptides provides information on Nterminal post-translational processing including signal-peptidase cleavage.

were identified exclusively from open reading frames outside predicted genes (Figure 1). This demonstrates the benefits including both six frame translations and de novo predicted transcripts in the search database (Figure 1).

Proteogenomic Analysis Identifies Putative Effector Proteins

Of the peptides that matched 30 novel loci (i.e., not predicted as genes by AUGUSTUS), 14 were predicted to be secreted, and all of these exhibited classic characteristics of effectors,13,15,17 i.e., small (79−169 amino acids), secreted, and cysteine rich (four or more cysteine residues). Five out of these 14 were similar to proteins expressed from the Ave1 effector gene of Verticillium dahliae (the causal agent of Verticillium wilt of tomato). Eight of the 14 have sequences with no sequelogs in the NCBI database (no hits with E < 1.0 via tBLASTn) which is consistent with the fact that effectors are often lineage specific.42 The only other secreted protein in this set was a 233 amino acid protein of unknown function conserved in many fungi (three cysteine residues only). Using the criteria that effectors should be shorter than 200 amino acids and contain at least four cysteine residues,13 we found an additional 17 potential effectors among the 200 secreted proteins that were called by AUGUSTUS. A full list of all potential effector proteins is provided as Supporting Information in Table S2. Searches based on six frame translations are therefore responsible for over one-third of the putative effectors identified in this study. In addition, proteogenomics was able to corroborate the signal peptidase cleavage sites for a total of nine of these. A complete list of all 31 potential effectors, including putative coding sequences is available as Supporting Information (Table S4). Six proteins with similarity to proteins expressed from the V. dahliae Ave1 effector gene were identified in the secretome of V. pirina (five novel and one predicted by AUGUSTUS). In V. dahliae, Ave1 is a single copy gene, encoding a small, secreted protein with four cysteine residues. The Ave1 protein plays a key role during V. dahliae infection of susceptible tomato plants, but Ave1 is also recognized by resistant tomato plants containing the Ve1 resistance protein (a classical plant resistance receptor).42 We identified, via BLAST search, 14 loci with sequence similarity (E < 1 × 10−5) to the V. dahliae Ave1, including 12 genes and two pseudogenes (lacking an open reading frame) in the V. pirina genome. This expansion stands in contrast to V. dahliae where Ave1 exists in single copy. One possible explanation for the observed expansion of Ave1 homologues in V. pirina is that they encode effectors, and play a role in the coevolutionary arms race between fungus and host. Our proteogenomics results were able to unambiguously

Figure 1. Venn diagram showing number of peptides identified from different entry types in our protein search database. Category “AUGUSTUS genes” refers to proteins corresponding to all transcripts from predicted genes. Six frame translation includes all possible coding sequences regardless of start methionine.

Our study significantly increased the number of known protein coding genes for V. pirina by providing proteomic evidence for 1042 AUGUSTUS gene models and 30 novel genes. Prior to this study, only 36 protein coding sequences were reported for V. pirina on the NCBI GenBank database (http://www.ncbi.nlm.nih.gov/protein/), many of which are redundant. These correspond to seven distinct protein coding genes in the UniProt database (http://www.uniprot.org), of which only three encoded secreted proteins. The number of identified genes from our study includes three of the seven already known. We found proteomic evidence supporting the expression of 200 genes encoding secreted proteins (Table 1). It is interesting to note that although the number of peptides (Figure 1) and proteins identified exclusively from six frame translations is very small, these contribute a large fraction (14/30) of potential effectors. Although we used a sample preparation protocol designed to enrich for secreted proteins, we also find (using our bioinformatic secretion prediction protocol) that 84% of the identified proteins in our samples are not predicted to be secreted. Presumably, some of these proteins could have been secreted through nonclassical secretory mechanisms including exosomes,38 but it is also likely that some proteins are present due to cell lysis. Nevertheless, the total number of secreted proteins identified by us (200) is similar to that of other fungal secretome studies. 39−41 Like other studies on fungal secretomes, we found many more predicted secreted proteins from genomic analysis (in our case 2753 isoforms from 1794 genes) than were identified by mass spectrometry (Fungal Secretome Database40). This discrepancy suggests that a large proportion of secreted proteins are either secreted in very low abundance, or only expressed and secreted under specific growth conditions (including during plant infection). Several 3639

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

Figure 2. Proteogenomic evidence for the expression and signal-peptide cleavage of a novel Ave1 homologue from V. pirina. Top of the figure shows a genomic region around a potential effector (scaffold111_frame_1_orf_713) with a long open reading frame (green bar) with four uniquely mapping peptides (yellow bars). Middle section of the figure shows a zoomed view on the highlighted region surrounding a semitryptic peptide GEIASASTYKPPYFPNK suggesting the presence of a signal peptide (blue bar). The putative signal-peptide cleavage site is corroborated by results from SignalP 4.1 (bottom of figure) which shows a strong peak in the S-score at the N-terminus of the observed peptide.

was represented by multiple N-terminal peptides representing variants in nontryptic cleavage position (± three amino acids). We found that 55 of these 63 proteins could be classified into one of four categories, as summarized in Table 2.

confirm expression (via uniquely mapping peptides) for six Ave1 homologues, of which five were not predicted at all by AUGUSTUS. Like many effectors, Ave1 codes for a small protein (less than 200 amino acids);13 this property may have contributed to this failure of AUGUSTUS to annotate genefamily members. Furthermore, we observed semitryptic peptides at the N-terminus of proteins expressed from three of these Ave1 homologues, providing direct evidence of signal peptide cleavage to produce mature secreted proteins from these genes. An example of one such protein is shown in Figure 2, and demonstrates agreement between mapped peptides and predictions from SignalP.

Table 2. Possible Origins for Observed Semitryptic Peptides cleavage origin

no. of proteins

signal peptidase

22

propeptide cleavage mitochondrial processing nonspecific cleavage

Signal Peptides and Other N-Terminal Cleavage Sites

A particularly powerful feature of proteogenomics is the opportunity to determine the location of signal peptide cleavage sites. While definitive verification of a signal peptide currently requires direct protein sequencing (Edman or de novo mass spectrometry sequencing) of the mature protein (http:// www.uniprot.org/manual/signal), proteogenomics can be used to verify relatively large numbers of cleavage sites across a complex sample of proteins with much less effort. Given the number (approximately 250) of verified sites required to train signal peptide prediction algorithms,43 proteogenomics has the potential to provide enough verified sites for specialist training sets on particular organisms. In a step toward this, Muller et al.44 used proteogenomics to identify 63 signal peptidase cleavage sites in a prokaryote (Helicobacter pylori) and discovered that the SPase cleavage motif was LxA, rather than the otherwise well-known AxA motif.45 A key requirement for identifying signal peptide cleavage sites is the ability to search for semitryptic peptides. This is a feature provided by both the MS-GF+ and X!Tandem search engines, and was used in our analysis. Excluding examples of N-terminal methionine excision, we identified a total of 83 N-terminal semitryptic peptides from 63 proteins. This difference is explained by the fact that, in a small number of cases, a protein

2 6 25

criterion signal peptide site predicted by SignalP within five amino acids signal peptide cleavage predicted by SignalP more than five amino acids upstream TargetP predicts mitochondrial targeting existence of a tryptic peptide encompassing the observed semitryptic peptide

A total of 22 proteins were found for which SignalP also predicted a cleavage site in close (± five amino acids) agreement with our observed N-terminal peptide. We also found two other proteins where the cleavage site predicted by SignalP was between eight and 20 amino acids upstream of our observed site. In these cases, the discrepancy is likely due to the existence of a short propeptide which is also removed to produce the mature protein. In this case we use the term propeptide simply to refer to a short sequence that must be cleaved in order to activate the protein (see http://www.uniprot.org/manual/ propep for a definition). The Swissprot database currently lists around 11 thousand reviewed proteins containing a propeptide, and a large proportion (eight thousand) of these also have signal peptides. The presence of both a signal peptide and propeptide is common among effectors and toxins.46,47 For a further six proteins we found that TargetP predicted the existence of a cleaved mitochondrial targeting sequence. In four of these cases, the cleavage site was in close (±three amino acids) agreement with the prediction made by TargetP. Aside from these biologically interesting origins of semitryptic peptides, we also 3640

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

position (P2) is one of alanine, serine, glycine, threonine, valine, proline or cytosine. Since these amino acids coincide with the stabilizing amino acids under the N-end rule,51 it has been speculated that the purpose of NME has been simply to expose these amino acids so as to increase protein half-life.52 Our results show that while alanine and serine are strongly conserved at the position immediately following methionine (P2), the remaining amino acids were unconserved. This suggests a special role for alanine and serine over and above the stabilizing effects under the N-end rule. Our results agree with the recent findings of Bonissonne et al.53 who proposed that the purpose of NME is to expose only alanine and serine with the remaining five stabilizing N-end rule amino acids being exposed incidentally due to their similarity to alanine and serine. Bonisonne et al.53 argued that if the purpose of NME was to expose all seven of the stabilizing P2 amino acids, then all seven should be equally conserved in that position, whereas if a particular subset were important, only those should show a pattern of conservation. Using a large proteogenomics data set ranging across many organisms, and examining sequelogs of the same protein across taxa, they found that the most conserved P2 amino acids under NME were alanine and serine,53 with the remaining five being largely unconserved. Our analysis is consistent with their results and appears to support the hypothesis that the role of NME is primarily to expose alanine and serine with the remaining five amino acids being exposed incidentally.

found 25 examples where a fully tryptic parent peptide also existed. In such cases, we assumed that the cleavage was due to some form of nonspecific peptide degradation, perhaps caused by sample preparation. Sequence conservation in the region of signal peptide cleavage sites identified by us is shown in Figure 3 and agrees

Figure 3. Sequence Logo plot48,49 showing conservation close to the signal peptide cleavage site (black arrow) for 22 proteogenomically verified sequences. Vertical axis and letter size represent bit score which is a measure of amino acid conservation at that position (horizonal axis).

with current expectations for a eukaryotic organism.45 In particular, there is an A[X]A motif with A at positions −3 and −1 relative to the cleavage site. While this motif is strongly conserved in some prokaryotes,44,45 it is much less well conserved in eukaryotes, where other small amino acids such as glycine and proline sometimes take the place of alanine.45 Interestingly, Figure 3 also clearly shows a strong preference for either alanine or leucine at various positions in the cleaved (N-terminal) region as well as conservation of phenylananine at position −7. This is in contrast to the N-terminus of the mature protein (right part of Figure 3) where there is little sequence conservation.

Length Distribution of Novel Proteins

We found a total of 35 protein-coding open reading frames (ORFs) or partial ORFs (lacking a start methionine) that were novel, in the sense that they were not identified by AUGUSTUS gene prediction and were found sufficiently far (1/10 the average inter-gene distance) from any other gene to rule them out as gene extensions. Examination of these 35 novel ORFs in a genome browser revealed three cases where two ORFs were closely spaced and it seemed likely that they belonged to the same gene. All of these novel ORFs had Protein Prophet probabilities greater than 0.99, with 19 being identified by a single unique peptide and the remaining 16 being supported by between two and five unique peptides. Our goal was to compare the length distribution of these novel ORFs identified by proteogenomics with the length distribution of transcripts predicted by AUGUSTUS (and also identified by proteogenomics). Such a comparison is complicated by the fact that identification of an ORF does not necessarily identify a full-length protein. To deal with this issue we performed a BLASTx search of novel open-reading frames against the NCBI nr protein database with the aim of finding full-length sequelogs to our novel sequences. Since these are full-length sequences they can be legitimately compared with full length AUGUSTUS predicted transcripts. This search yielded 16 matches (E < 1 × 10−5), of which ten were distinct full length proteins (listed as complete proteins in the UniProt Knowledgebase http://uniprot.org). Figure 5 compares the length distribution of these ten full length novel protein sequelogs with the lengths of proteins predicted by AUGUSTUS (and with confirmed protein expression). Despite the small number of novel sequences available for comparison, it is clear that these tend to be much shorter than the bulk of transcripts predicted by AUGUSTUS. The clear difference in Figure 5 is also supported by a Komolgorov Smirnov test (p < 1 × 10−6). This result suggests that small proteins (which are are precisely the proteins of most interest as potential effectors13) are likely to be severely under-represented in gene lists based purely on

N-Terminal Methionine Excision

We observed 117 instances of N-terminal methionine excision (NME) in V. pirina and plot the pattern of conservation of the first seven residues following the start methionine of the corresponding proteins in Figure 4. NME is a strongly conserved

Figure 4. Sequence logo plot constructed from the first seven amino acids of 117 proteins displaying N-terminal methionine excision (excluding the initiator methionine). Vertical axis and letter size represent bit score which is a measure of amino acid conservation at that position (Horizonal axis). Note that only A, S, V, G, T ,and P are found at position 2, but of these, only A and S are strongly conserved. At all other positions, there is no sequence conservation.

biochemical function, but its purpose is poorly understood.50 In general, NME only occurs if the amino acid in the second 3641

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

Figure 6. Distributions of AUGUSTUS transcript probabilities for isoforms where at least two transcripts were predicted from the same gene. Solid curve shows 61 uniquely identified transcripts from our study. Dashed curve shows all 1821 ab initio predicted transcripts. Figure 5. Length distributions of homologous full-length proteins to novel ORFs (n = 10; dashed line) versus proteins predicted by AUGUSTUS and confirmed with proteomics (n = 1821; solid line). Both lines shown are smoothed kernel densities computed using the density function in R.57

are expressed. This implies that a very low probability cutoff would be required to avoid excluding any expressed transcripts in an ab initio gene prediction run and therefore limits the practical utility of AUGUSTUS transcript probabilities as a means of limiting database size, or ab initio predicting expression for other purposes.

computational gene-finding (also see Table 1). This is because gene prediction algorithms typically avoid calling genes from short open reading frames or putative transcripts.54,55 Our result supports the findings of other recent studies in which novel short polypeptides have been discovered in heavily studied organisms such as Humans56 using proteomic approaches.



CONCLUSIONS Using proteogenomics, we have obtained evidence for the expression of 1074 genes from V. pirina. Importantly, many potential effector proteins that may play a role in plant disease were identified. These proteins are useful targets for functional studies to confirm their role as effectors in V. pirina. Such functional work to confirm the role of effectors has previously been performed for the sister species V. inaequalis using gene silencing58,59 and protein bioassays.60 Similar methods could be used to confirm the role of the potential effectors identified in this study, although gene silencing may be complicated for the genes with similarity to the V. dahliae Ave1, as they are present as a gene family in V. pirina, rather than a single copy gene, and may display some functional redundancy. With these results, further studies can be performed to develop inhibitors against potential effectors so as to prevent the fungal infection. We also highlight the fact that proteogenomics is uniquely placed to refine gene models in ways that are useful for identifying secreted proteins, and if samples are enriched for these, then significant biological insights can be obtained at low cost.

AUGUSTUS Transcript Probabilities as Indicators of Expression

AUGUSTUS gene prediction resulted in a total of 20 756 transcripts across 11 963 genes for the V. pirina genome. Of these, we found peptides from a total of 1821 transcripts; however, many of these were shared peptides belonging to multiple transcripts for the same gene. A total of 723 transcripts could be unambiguously identified, usually in cases where just a single transcript was predicted for a gene. Of particular interest are a set of 61 transcripts where we were able to uniquely identify a particular splice variant from among several possibilities at the same locus. These transcripts can be used to investigate the utility of AUGUSTUS’ ab initio transcript probabilities1 as indicators of the likelihood of expression. AUGUSTUS transcript probabilities are obtained by random sampling of alternate parses (gene models) of the genome sequence according to their posterior probability. One possible use of these transcript probabilities is distinguishing between genuinely expressed splice variants versus hypothetical variants that are not expressed. If AUGUSTUS transcript probabilities were a reliable indicator of expression, we would expect that any transcript that could be unambiguously identified in a proteomics experiment should have a high AUGUSTUS probability. In cases where AUGUSTUS predicts just a single transcript for a gene, we found that the transcript probability was always high so we excluded these cases from our analysis. Instead, we focused (see Figure 6) on cases where multiple transcripts existed for a gene, and we plotted the probabilities for our 61 uniquely identifiable transcripts versus the background (all predicted transcripts including those not identified by proteomics). The results clearly show that the majority of transcripts predicted by AUGUSTUS have very low probabilities, and that those we identified uniquely by proteomics have much higher probabilities compared with the background. Nevertheless, it is also notable from Figure 6 that there is a long tail to the distribution of probabilities for expressed transcripts, such that some transcripts with low probabilities (less than 0.5)



ASSOCIATED CONTENT

S Supporting Information *

Figure S1: Preparation of Venturia pirina samples for mass spectrometry. Table S1: A complete list and description of all software tools used in the proteogenomic analysis. Table S2: List of all potential effector proteins identified across all Venturia pirina samples. Table S3: Complete list of all proteins identified across all Venturia pirina samples. Table S4: List of all novel loci including details of signal peptide prediction. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Authors

*(I.R.C.) E-mail: [email protected]. Phone: +61 3 9479 2256. Fax: +61 3 9479 1226. *(K.M.P.) E-mail: [email protected]. Phone: +61 3 9032 7474. *(S.M.) E-mail: [email protected]. Phone: +61 3 9479 2565. 3642

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

Notes

(15) Kucheryava, N.; Bowen, J. K.; Sutherland, P. W.; Conolly, J. J.; Mesarich, C. H.; Rikkerink, E. H.; Kemen, E.; Plummer, K. M.; Hahn, M.; Templeton, M. D. Two novel Venturia inaequalis genes induced upon morphogenetic differentiation during infection and in vitro growth on cellophane. Fungal Gen. and Biol. 2008, 45, 1329−1339. (16) Bowen, J. K.; Mesarich, C. H.; Bus, V. G. M.; Beresford, R. M.; Plummer, K. M.; Templeton, M. D. Venturia inaequalis: The causal agent of apple scab. Mol. Plant Pathol. 2010, 12, 105−122. (17) Bowen, J. K.; Mesarich, C. H.; Rees-George, J.; Cui, W.; Fitzgerald, A.; Win, J.; Plummer, K. M.; Templeton, M. D. Candidate effector gene identification in the ascomycete fungal phytopathogen Venturia inaequalis by expressed sequence tag analysis. Mol. Plant Pathol. 2009, 10, 431−448. (18) Kalra, H.; Adda, C. G.; Liem, M.; Ang, C.-S.; Mechler, A.; Simpson, R. J.; Hulett, M. D.; Mathivanan, S. Comparative proteomics evaluation of plasma exosome isolation techniques and assessment of the stability of exosomes in normal human blood plasma. Proteomics 2013, 13, 3354−3364. (19) Cox, M. P.; Peterson, D. A.; Biggs, P. J. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinform. 2010, 11, 485. (20) Zerbino, D. R.; Birney, E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18, 821−829. (21) Rahman, A.; Pachter, L. CGAL: Computing genome assembly likelihoods. Genome Biol. 2013, 14, R8. (22) Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003, 31, 5654−5666. (23) Deutsch, E. W.; Shteynberg, D.; Lam, H.; Sun, Z.; Eng, J. K.; Carapito, C.; Von Haller, P. D.; Tasman, N.; Mendoza, L.; Farrah, T.; Aebersold, R. Trans-Proteomic Pipeline supports and improves analysis of electron transfer dissociation data sets. Proteomics 2010, 10, 1190−1195. (24) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403−410. (25) Camacho, C.; Coulouris, G.; Avagyan, V.; Ma, N.; Papadopoulos, J.; Bealer, K.; Madden, T. L. BLAST+: Architecture and applications. BMC Bioinform. 2009, 10, 421. (26) Craig, R.; Beavis, R. C. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−1467. (27) Chambers, M. C.; et al. A cross-platform toolkit for mass spectrometry and proteomics. Nat. Biotechnol. 2012, 30, 918−920. (28) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383− 5392. (29) Shteynberg, D.; Deutsch, E.; Lam, H.; Eng, J.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R.; Aebersold, R.; Nesvizhskii, A. I. iProphet: Multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 2011, 10, No. M111.007690. (30) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75, 4646−4658. (31) Quinlan, A. R.; Hall, I. M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841−842. (32) Robinson, J. T.; Thorvaldsdóttir, H.; Winckler, W.; Guttman, M.; Lander, E. S.; Getz, G.; Mesirov, J. P. Integrative genomics viewer. Nat. Biotechnol. 2011, 29, 24−26. (33) Emanuelsson, O.; Brunak, S.; Von Heijne, G.; Nielsen, H. Locating proteins in the cell using TargetP, SignalP and related tools. Nat. Protoc. 2007, 2, 953−971. (34) Emanuelsson, O.; Nielsen, H.; Brunak, S.; Von Heijne, G. Predicting subcellular localization of proteins based on their Nterminal amino acid sequence. J. Mol. Biol. 2000, 300, 1005−1016. (35) Nielsen, H.; Engelbrecht, J.; Brunak, S.; Von Heijne, G. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 1997, 10, 1−6.

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the Australian NH&MRC fellowship (1016599) and Australian Research Council Discovery Grant (DP130100535) to S.M., VLSCI’s Life Sciences Computation Centre, a collaboration between Melbourne, Monash and La Trobe Universities and an initiative of the Victorian Government, Australia to I.R.C. and N.E.H. The work was also supported by a La Trobe University eResearch grant, and a La Trobe University collaborative grant. Plant Biosecurity Cooperative Research Centre PhD studentship supported D.J. The authors would like to acknowledge the support of the Australian Government’s Cooperative Research Centre’s Program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank J. Strugnell for providing valuable feedback on the manuscript.



REFERENCES

(1) Stanke, M.; Keller, O.; Gunduz, I.; Hayes, A.; Waack, S.; Morgenstern, B. AUGUSTUS: Ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006, 34, W435−W439. (2) Lukashin, A. V.; Borodovsky, M. GeneMark.hmm: New solutions for gene finding. Nucleic Acids Res. 1998, 26, 1107−1115. (3) Mathivanan, S. Integrated bioinformatics analysis of the publicly available protein data shows evidence for 96% of the human proteome. J. Prot. Bioinf. 2014, 7, 41−49. (4) Stanke, M.; Schöffmann, O.; Morgenstern, B.; Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform. 2006, 7, 62. (5) Castellana, N.; Bafna, V. Proteogenomics to discover the full coding content of genomes: A computational perspective. J. Proteomics 2010, 73, 2124−2135. (6) Pang, C. N. I.; Tay, A. P.; Aya, C.; Twine, N. A.; Harkness, L.; Hart-Smith, G.; Chia, S. Z.; Chen, Z.; Deshpande, N. P.; Kaakoush, N. O.; Mitchell, H. M.; Kassem, M.; Wilkins, M. R. Tools to co-visualize and co-analyse proteomic data with genomes and transcriptomes: Validation of genes and alternative mRNA splicing. J. Proteome Res. 2013, 13, 84−98. (7) Risk, B. A.; Spitzer, W. J.; Giddings, M. C. Peppy: Proteogenomic search software. J. Proteome Res. 2013, 12, 3019−3025. (8) Castellana, N. E.; Shen, Z.; He, Y.; Walley, J. W.; Briggs, S. P.; Bafna, V. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Mol. Cell. Proteomics 2014, 13, 157−167. (9) Shabi, E. In Oxford Handbook of Innovation; Jones, A. L., Aldwinkle, S. H., Eds.; American Phytopathological Society: St. Paul, MN, 1990; Chapter Pear scab. (10) Villalta, O.; Washington, W.; McGregor, G. Susceptibility of European and Asian pears to pear scab. Plant Protection Quarterly 2004, 19, 2−4. (11) Schnabel, G.; Schnabel, E. L.; Jones, A. L. Characterization of ribosomal DNA from Venturia inaequalis and its phylogenetic relationship to rDNA from other tree-fruit Venturia species. Phytopathology 1999, 89, 100−108. (12) Bus, V. G. M.; Rikkerink, E. H. A.; Caffier, V.; Durel, C.-E.; Plummer, K. M. Revision of the nomenclature of the differential hostpathogen interactions of Venturia inaequalis and Malus. Annu. Rev. Phytopathol. 2011, 49, 391−413. (13) Stergiopoulos, I.; de Wit, P. J. G. M. Fungal effector proteins. Annu. Rev. Phytopathol. 2009, 47, 233−263. (14) Gau, A. E.; Koutb, M.; Piotrowski, M.; Kloppstech, K. Accumulation of pathogenesis-related proteins in the apoplast of a susceptible cultivar of apple (Malus domestica cv. Elstar) after infection by Venturia inaequalis and constitutive expression of PR genes in the resistant cultivar Remo. Eur. J. Plant Pathol. 2004, 110, 703−711. 3643

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644

Journal of Proteome Research

Article

(36) Krogh, A.; Larsson, B.; Von Heijne, G.; Sonnhammer, E. L. L. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. J. Mol. Biol. 2001, 305, 567−580. (37) Petersen, T. N.; Brunak, S.; Von Heijne, G.; Nielsen, H. SignalP 4.0: Discriminating signal peptides from transmembrane regions. Nat. Methods 2011, 8, 785−786. (38) Kalra, H.; Simpson, R. J.; Ji, H.; Aikawa, E.; Altevogt, P.; Askenase, P.; Bond, V. C.; Borràs, F. E.; Breakefield, X.; Budnik, V. Vesiclepedia: A compendium for extracellular vesicles with continuous community annotation. PLoS Biol. 2012, 10, e1001450. (39) Braaksma, M.; Martens-Uzunova, E. S.; Punt, P. J.; Schaap, P. J. An inventory of the Aspergillus niger secretome by combining in silico predictions with shotgunproteomics data. BMC Genomics 2010, 11, 584. (40) Lum, G.; Min, X. J. FunSecKB: The fungal secretome knowledgebase. Database 2011, 2011, No. bar001. (41) Tsang, A.; Butler, G.; Powlowski, J.; Panisko, E. A.; Baker, S. E. Analytical and computational approaches to define the Aspergillus niger secretome. Fungal Gen. Biol. 2009, 46, S153−S160. (42) de Jonge, R.; van Esse, H. P.; Maruthachalam, K.; Bolton, M. D.; Santhanam, P.; Saber, M. K.; Zhang, Z.; Usami, T.; Lievens, B.; Subbarao, K. V. Tomato immune receptor Ve1 recognizes effector of multiple fungal pathogens uncovered by genome and RNA sequencing. Proc. Natl. Acad. Sci. U.S.A. 2012, 109, 5110−5115. (43) Zhang, Z.; Henzel, W. J. Signal peptide prediction based on analysis of experimentally verified cleavage sites. Protein Sci. 2009, 13, 2819−2824. (44) Müller, S. A.; Findeiß, S.; Pernitzsch, S. R.; Wissenbach, D. K.; Stadler, P. F.; Hofacker, I. L.; von Bergen, M.; Kalkhof, S. Identification of new protein coding sequences and signal peptidase cleavage sites of Helicobacter pylori strain 26695 by proteogenomics. J. Proteomics 2013, 86, 27−42. (45) Tuteja, R. Type I signal peptidase: An overview. Arch. Biochem. Biophys. 2005, 441, 107−111. (46) Honma, T.; Hasegawa, Y.; Ishida, M.; Nagai, H.; Nagashima, Y.; Shiomi, K. Isolation and molecular cloning of novel peptide toxins from the sea anemone Antheopsis maculata. Toxicon 2005, 45, 33−41. (47) Rouxel, T.; et al. Effector diversification within compartments of the Leptosphaeria maculans genome affected by repeat-induced point mutations. Na. Commun. 1, 2, 202−10. (48) Crooks, G. E.; Hon, G.; Chandonia, J.-M.; Brenner, S. E. WebLogo: A sequence logo generator. Genome Res. 2004, 14, 1188− 1190. (49) Schneider, T. D.; Stephens, R. M. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 1990, 18, 6097−6100. (50) Giglione, C.; Vallon, O.; Meinnel, T. Control of protein lifespan by Nterminal methionine excision. EMBO J. 2003, 22, 13−23. (51) Varshavsky, A. The N-end rule. Cell 1992, 69, 725−735. (52) Arfin, S. M.; Bradshaw, R. A. Cotranslational processing and protein turnover in eukaryotic cells. Biochemistry 1988, 27, 7979− 7984. (53) Bonissone, S.; Gupta, N.; Romine, M.; Bradshaw, R. A.; Pevzner, P. A. N-terminal protein processing: A comparative proteogenomic analysis. Mol. Cell. Proteomics 2012, 12, 14−28. (54) Brent, M. R.; Guigó, R. Recent advances in gene structure prediction. Curr. Op. Struct. Biol. 2004, 14, 264−272. (55) Goli, B.; Nair, A. S. The elusive short gene−an ensemble method for recognition for prokaryotic genome. Biochem. Biophys. Res. Commun. 2012, 422, 36−41. (56) Slavoff, S. A.; Mitchell, A. J.; Schwaid, A. G.; Cabili, M. N.; Ma, J.; Levin, J. Z.; Karger, A. D.; Budnik, B. A.; Rinn, J. L.; Saghatelian, A. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nat. Chem. Biol. 2013, 9, 59−64. (57) R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2013.

(58) Fitzgerald, A. M.; Mudge, A. M.; Gleave, A. P.; Plummer, K. M. Agrobacterium and PEG-mediated transformation of the phytopathogen Venturia inaequalis. Mycol. Res. 2003, 107, 803−810. (59) Fitzgerald, A. M.; Van Kan, J. A.; Plummer, K. M. Simultaneous silencing of multiple genes in the apple scab fungus, Venturia inaequalis, by expression of RNA with chimeric inverted repeats. Fungal Genet. Biol. 2004, 41, 963−71. (60) Win, J.; Greenwood, D.; Plummer, K. Characterisation of a protein from Venturia inaequalis that induces necrosis in Malus carrying the Vm resistance gene. Phys. Mol. Plant Pathol. 2003, 62, 193−202.

3644

dx.doi.org/10.1021/pr500176c | J. Proteome Res. 2014, 13, 3635−3644