Using Proteomics to Mine Genome Sequences - Journal of Proteome

Jan 21, 2004 - Proteome Systems Ltd., Locked Bag 2073, North Ryde NSW 1670, Australia. Journal of Proteome Research , 2004, 3 (3), ... PepLine: A Soft...
0 downloads 9 Views 741KB Size
Using Proteomics to Mine Genome Sequences Jonathan W. Arthur* and Marc R. Wilkins* Proteome Systems Ltd., Locked Bag 2073, North Ryde NSW 1670, Australia Received July 18, 2003

We present a method for mining unannotated or annotated genome sequences with proteomic data to identify open reading frames. The region of a genome coding for a protein sequence is identified by using information from the analysis of proteins and peptides with MALDI-TOF mass spectrometry. The raw genome sequence or any unassembled contigs of an organism are theoretically cleaved into a number of equal sized but overlapping fragments, and these are then translated in all six frames into a series of virtual proteins. Each virtual protein is then subjected to a theoretical enzymatic digestion. Standard proteomic sample preparation methods are used to separate, array, and digest the proteins of interest to peptides. The masses of the resulting peptides are measured using mass spectrometry and compared to the theoretical peptide masses of the virtual proteins. The region of the genome responsible for coding for a particular protein can then be identified when there are a large number of hits between peptides from the protein and peptides from the virtual protein. The method makes no assumptions about the location of a protein in a particular gene sequence or the positions or types of start and stop codons. To illustrate this approach, all 773 proteins of Pseudomonas aeruginosa contained in SWISS-PROT were used to theoretically test the method and optimize parameters. Increasing the size of the virtual proteins results in an overall improvement in the ability to detect the coding region, at the cost of decreasing the sensitivity of the method for smaller proteins. Increasing the minimum number of matching peptides, lowering the mass error tolerance, or increasing the signal-to-noise ratio of the simulated mass spectrum, improves the ability to detect coding regions. The method is further demonstrated on experimental data from Mycobacterium tuberculosis and is also shown to work with eukaryotic organisms (e.g., Homo sapiens). Keywords: proteomics • genome annotation • open reading frames • peptide mass fingerprinting • mass spectrometry

Introduction The sequencing of genomes is now routine. GenBank contains genomic data for over 1000 species with more than 86 complete microbial genomes.1 However, the DNA sequence in and of itself tells one very little about the organism. It is, initially, nothing more than a long sequence of only four characters. The real value in the sequencing effort comes from determining the location and nature of the genes within the genome and the proteins for which those genes code. These proteins are, in turn, responsible for determining the way the organism functions. There are a number of known processes for attempting to determine the location of proteins in genome sequence data. Mathe et al.2 contains a recent review of the various techniques for gene prediction in eukaryotic data. The methods generally divide into two classes. In the first, an identification of an open reading frame is made when the content of the sequence identifies it as a coding region. This is based on either sufficient sequence similarity to a well-annotated protein or by analysis * To whom correspondence should be addressed. Phone: +61 2 9889 1830. Fax: +61 2 9889 1805. E-mail: [email protected]; [email protected]. 10.1021/pr034056e CCC: $27.50

 2004 American Chemical Society

of the compositional bias, G+C content, or codon usage. In the second method, identification is made by searching for signal sequences indicating start and stop codons as well as splice sites. It is also possible to use a combination of the methods. In all cases, the various techniques adopt various assumptions and hypotheses about the nature and location of open reading frames in the DNA sequence. Detection methods requiring such assumptions or hypotheses may produce incorrect results if the assumptions or hypotheses are incorrect. For example, these procedures are unlikely to locate nontypical sequences, which ironically may be of more interest than other proteins having more typical sequences identified using existing techniques. Furthermore, the accuracy of the various techniques varies considerably between applications. In a landmark paper evaluating the various programs for gene structure prediction, Burset and Guigo´3 found the average percentage of exons correctly identified was less than 50%. Since this time, the methods have been improved dramatically as recent evaluations against mammalian4 and plant5 data show. Even so, the best average exon sensitivity and specificity for any of the techniques in Journal of Proteome Research 2004, 3, 393-402

393

Published on Web 01/21/2004

research articles mammalian data is 0.76. This suggests a significant proportion of the open reading frames in a genome go unpredicted with current techniques. Mass spectrometry has been previously used in conjunction with whole genome data in order to identify proteins in a sample. Kuster et al.6 and Choudhary et al.7 both describe the use of MS/MS data to identify proteins, whereas Giddings et al.8 use matrix-assisted laser desorption-ionization time-offlight mass spectrometry (MALDI-TOF MS) data in conjunction with a genome fingerprint scanning analysis to identify proteins. In this paper, we present an alternative method for annotating genome sequences using proteomic data. The method is hypothesis independent and does not make assumptions for the detection of a protein from nucleic acid sequences. As such, it overcomes a number of the limitations of existing de novo gene prediction algorithms. Furthermore, the method uses data derived from MALDI-TOF mass spectrometry matched against virtual proteins. Although MS/MS data can also be applied to the same task, MALDI-TOF mass spectrometers are relatively inexpensive and thus common in many laboratories. MALDITOF mass spectrometers are also easy to use and produce data that is relatively easy to interpret. This makes the results of our technique easy to comprehend for a user familiar with peptide mass fingerprinting and does not require more complicated MS/MS analysis or large increases in computational resources as the size of the genome increases. The method described in the paper is validated with a combination of theoretical and empirical analyses. In the former, we use a query set of all proteins from P. aeruginosa to demonstrate the conceptual soundness of the technique. This is followed with an empirical analysis with mass spectra derived from a sample of M. tuberculosis demonstrating that the conceptual soundness of the technique transfers to the experimental environment. The application of the method to the identification of open reading frames from the H. sapiens genome is also demonstrated.

Materials and Methods Preparation of the Genome. The genome is initially prepared for searching. The method9 works equally well using either a complete, assembled genome sequence or an unassembled series of contigs. In the latter case, the method outlined here is simply repeated for each contig in turn. The method is described for the former case for simplicity. Figure 1 depicts the technique in a schematic diagram. The genome sequence is initially sliced into a series of gene segments of a fixed, predetermined length, for example 1050 base pairs (bp) corresponding to a typical protein size of 33 to 37 kDa. The gene segments overlap each other so the second gene segment commences at a point halfway along the first. In the example above, this means the first gene is composed of base pairs 1 to 1050, the second is composed of base pairs 525 to 1575, the third is base pairs 1051 to 2100, etc. Each gene segment is then translated in six frames to form six virtual proteins. Each virtual protein sequence is then cleaved according to the cleavage rules of a proteolytic enzyme, usually trypsin. (Trypsin cleaves a protein on the C-terminal side of each lysine or arginine residue, except where such a residue is immediately followed by a proline residue.) The resulting peptides will include examples containing one or more stop codons. These peptides are artificial and are never seen in an experimental digest of a protein and so are excluded 394

Journal of Proteome Research • Vol. 3, No. 3, 2004

Arthur and Wilkins

Figure 1. Flowchart depicting an overview of the process of annotating a genome using proteomic information.

from the list of tryptic peptides. However, the N-terminal portion of such a peptide, up to the first stop codon, could be found as the last peptide cleaved from a protein. This portion is retained. No assumption can be made about the C-terminal portion of a peptide containing a stop codon, from the residue after the last stop codon until the end of the sequence. However, we include a tryptic peptide for each subsequence of this portion starting with a methionine. This captures potential, typical initial peptides in a protein sequence without assuming the sequence can only start with a methionine. Finally, those peptides with a mass less than 400 Da are discarded, as these masses are not typically seen in a MALDI mass spectrum. The remaining sequences and masses of the virtual peptides are stored in a database. Three specific genomes were used to demonstrate the technique. All genomes were downloaded from GenBank. The theoretical assessment of the method used the P. aeruginosa genome (GenBank accession number NC_002516). This genome was sequenced and annotated by Stover et al.10 and is 6.26 Mbp in length. The genome was prepared with three different segment sizes 732 bp, 1050 bp, and 1362 bp. The M. tuberculosis H37Rv genome (GenBank accession number NC_000962) was used for the experimental analysis. Only one segment size of 1050 bp was used in the experimental analysis. Finally, the demonstration of the technique on eukaryotic data used the sequence of chromosome 22 from H. sapiens. These data were downloaded as a series of contigs. Each contig was sequentially prepared with a segment size of 1050 bp. Identification of Open Reading Frames. A tissue sample from the organism is prepared using standard proteomic techniques.11 This analysis can be performed in a number of ways and the final sample preparation technique is very dependent on the type of sample being studied. The detail of the chosen sample preparation method is largely irrelevant to the later analysis and genome annotation; however, in the interests of completeness, we describe a typical sample preparation here and provide details of the specific sample preparation used to acquire the data for our experimental assessment of the method in a separate section below. Another, detailed example of typical sample preparation and mass spectrometry techniques can be found in Pedersen et al.12 The sample contains the proteome of the organism, i.e., a set of proteins expressed by the genome. The proteins are

research articles

Using Proteomics to Mine Genome Sequences

generally solubilized as well as reduced and alkylated to break any cysteine disulfide bonds. The proteins are then separated according to their isoelectric point using isoelectric focusing. Once this is complete the proteins are further separated according to their molecular weight using two-dimensional (2D) polyacrylamide gel electrophoresis. This results in a 2D gel with a series of spots, each corresponding to a protein expressed by the organism. The spots can be stained and the gel imaged. Application of image analysis to the gel image allows the detection of the location of the spots followed by their excision from the gel. The protein in each spot is then digested using trypsin (or equivalent) and prepared for mass spectroscopy. A MALDI-TOF mass spectrometer measures the mass of the tryptic peptides and these can be extracted from the resulting mass spectrum.13,14 Peptide mass fingerprinting15-19 is used to compare the experimental masses of the peptides to the theoretical tryptic peptide masses from the theoretical digestion of the genome. Each of the experimental masses is compared to all the theoretical masses in the database. An experimental mass matches a theoretical mass if it is equivalent within a specified range around the theoretical mass called the error tolerance. The number of peptides from each virtual protein matching one of the experimental masses is calculated. A “hit” or a potential identification is achieved when the number of matching peptides in the virtual protein exceeds a user-defined threshold called the “minimum to match”. A confident identification of the correct virtual protein is achieved when the number of matching peptides on the segment significantly exceeds the number of matching peptides in the next best hit. Once the matching virtual protein is found, it is trivial to establish the region of the genome coding for the protein expressed in the sample. Preparation of a Query Set of Proteins. A query set of proteins was prepared from the set of known proteins from P. aeruginosa taken from SWISS-PROT20 (release 40.28)sa total of 773 proteins. Each protein was theoretically digested according to the cleavage rules of trypsin to yield a set of peptides whose masses were calculated. Tryptic peptides whose mass was less than 400 Da were discarded, as these masses are not usually seen on a typical MALDI mass spectrum. Sample Preparation for Experimental Analysis. The mass spectra for the experimental part of our analysis were derived from a study of M. tuberculosis proteins secreted into culture. Samples of human sputa from tuberculosis positive individuals were streaked onto tuberculosis culture plates. Colonies of M. tuberculosis were inoculated into 10 mL of GAS broth and incubated, with gentle shaking at 37 °C for two weeks. The cells were centrifuged and inoculated into 100 mL of broth for a further two weeks. The bacteria were pelleted and discarded and protease inhibitor cocktail tablets added. The supernatant was cooled on ice, filtered through a 0.45 mm filter, then a 0.22 mm filter before being stored at -80 °C. The filtrate was thawed and concentrated to about 0.5% of the original volume using an ultrafiltration cell with a 3 kDa cutoff filter. This material was washed in 10 volumes of PBS using a 5 kDa cutoff centrifugal filter. CHAPS was added to the sample to a final concentration of 0.5% (w/v), before dilutingto-a ratio of 1:9 with ice-cold acetone (100%). After 320 min, at -20 °C, the precipitate was pelleted by centrifugation for 10 min (5000 g) at 4 °C before the pellet was resuspended in 30 mL of sample buffer (7 M urea, 2 M thiourea, 2% CHAPS, 5

mM Tris) by sonication for 10 min using and ultrasonic bath. The remaining particles were removed by a 4 min spin at 21 000 g. The resuspended pellet was reduced with a final concentration of 3 mM tributylphosphene for 1 h, and then alkylated for another hour in the dark with 15 mM iodoacetamide at room temperature. The sample (125 mL) was rehydrated overnight on a Pharmacia immobilized pH gradient (IPG) 7 cm, pI 4-7 strip and focused from 40 000-50 000 Vh, with a final current of between 10 and 20 mA. The IPG strip was loaded onto a 1.5 mm Novex Bis Tris 4-12% mini gel and run for 30 min at 5 mA/gel then 30 min at 10 mA/gel and then 30 mA/gel until the tracking dye was within 3 mm of the base of the gel. The gel was stained with colloidal Coomassie G-250 and destained for 1-3 days before imaging and storage in 1% acetic acid. Protein spots were excised using the Proteome Systems Xcise instrument. These spots were washed, digested, and extracted, then desalted and concentrated using micro-ziptips. The resulting peptides were analyzed by automated MALDI-TOF MS on the Bruker Bi Flex III.

Results and Discussion Theoretical Assessment of the Method. A series of computational simulations were undertaken in order to demonstrate the efficacy of the method and determine the optimum parameters. The simplest simulation involved taking the set of known proteins for Pseudomonas aeruginosa. The genome was sequenced and annotated by Stover et al.10 and is 6.26 Mbp in length. The query set of known proteins from P. aeruginosa was taken from SWISS-PROT20 (release 40.28)sa total of 773 proteins. They were prepared as described in the Materials and Methods section. The tryptic peptides of each protein in the query set in turn were searched against the raw genome prepared using the method described in the Materials and Methods section above and depicted in Figure 1. To recap briefly, the genome is prepared by the following: • cutting the nucleotide sequence into overlapping gene segments of equal length; • translating the gene segments in 6 frames to generate virtual proteins; • theoretically cleaving the resulting virtual proteins according to the cleavage rules of trypsin; • and calculating the masses of the peptides and storing these in a database. The tryptic peptides of each query protein are searched against the peptides from each virtual protein stored in the database using peptide mass fingerprinting. When the P. aeruginosa genome is divided into 1050 bp segments, it produces 11 933 genome segments that translate in six frames to 71 598 virtual proteins. The region of the genome coding for a protein is determined by finding the segment with the highest number of matching peptides. This is the “best hit”. The “second best hit” is determined by finding the segment with the next highest number of peptides, excluding those segments connected to the segment with the highest number of peptides through a chain of overlapping segments. This is illustrated in Figure 2. This allows for the fact that the protein may be longer than one or more segments and thus may have a significant number of hits on adjacent segments. To further illustrate the theoretical assessment a specific example is shown in Figure 3. In the example, the known Journal of Proteome Research • Vol. 3, No. 3, 2004 395

research articles

Figure 2. Schematic diagram showing how the best hit and second best hit are defined. The second best hit is the segment having the most hits, when the segments overlapping the best hit are ignored. This is a necessary distinction because the protein sequence may extend across multiple segments.

protein sequence of Aspartate aminotransferase (SWISS-PROT, P72173) from P. aeruginosa was theoretically digested using the cleavage rules of trypsin. The masses of these peptides were searched using peptide mass fingerprinting against the prepared genome of P. aeruginosa. The best hit, indicated by the largest number of matching peptide masses, is genome segment 6711 with 29 matches. However, this protein stretches across multiple segments, namely 6710 (19 matches), 6711 (29 matches), and 6712 (18 matches). The second best hit is therefore segment 4455 with 7 matches. It is important to clarify that the correct hit, as demonstrated in the above example, is indicated by the difference between the number of matching peptides on segments 6710, 6711, and 6712 and the number of peptides matching at random on other segments. It is not the number of matching peptides per se that is important. The matching segments above would still have been identified, albeit with reduced confidence, if the matching segments had had, for example, between 8 and 10 matching peptides each. The computational analysis discussed below examines the extent to which the correct hit can be expected to have a distinctly higher number of matching peptides than found in random matches to other segments throughout the genome. Proof of Concept. To further explore the utility of this approach, all 773 query proteins from P. aeruginosa were

Arthur and Wilkins

binned according to the number of tryptic peptides with mass greater than 400 Da they generated in a theoretical digestion. The first bin contained all proteins with 1 to 10 peptides; the second contained all proteins with 11 to 20 peptides, etc. This approximately reflects the binning of the proteins by their molecular weight. All peptides from proteins were then matched against the P. aeruginosa genome and then stored. The number of matching peptides in the best hit for each of the proteins in the bin was averaged, as was the number of matching peptides in the second best hit. These two numbers were plotted as in Figure 4 to show the difference between the correct hit and the best of the incorrect hits (the latter representing the hits found through random peptide matches alone). The effectiveness of the technique is related to the distance between these two curves. Initial simulations showed that there were insufficient large proteins (i.e., those with greater than 80 tryptic peptides) to generate a meaningful average for these bins. Hence all results are reported for proteins with up to eighty tryptic peptides. The results in Figure 4 show an analysis using a segment size of 1050 bp, error tolerance of 0.1 Da, and a minimum to match value of 7 peptides. There is a distinct difference between the average best hit and the average second best hit (the best of the incorrect hits). The average correct hit has about five matching peptides for the smallest protein through to about 35 for the largest proteins. In contrast, the average second best hit has less than one hit for the smallest proteins studied. This increases steadily to a maximum of about eleven hits for the largest proteins studied. Both curves appear to approach a particular maximum value. This is determined by the size of the genome segments, and hence the size of the virtual proteins, in use. As the segments get larger they can accommodate more matching peptides. This effect will be discussed later. The matching peptides on the second best hit are random occurrences. They occur because a subset of the query masses just happen to randomly match peptides from another segment

Figure 3. Extracts from a screen shot depicting an example of the theoretical assessment of the method. The theoretical peptides from “Aspartate aminotransferase” (P72173) were searched against the genome of P. aeruginosa using the method described in the paper. The best hit is segment 6711 with 29 matches and the second best hit, as defined in the main text, is segment 4455 with 7 matches. 396

Journal of Proteome Research • Vol. 3, No. 3, 2004

Using Proteomics to Mine Genome Sequences

Figure 4. Graph depicting the results of a simulation to demonstrate the effectiveness of the method. On average, for all proteins in P. aeruginosa, the best hit (filled square), corresponding to the coding region has more peptide hits than the next best hit (open square). This distinction is particularly evident in large proteins but decreases as the proteins become smaller.

Figure 5. Reducing the error tolerance improved the efficacy of the method. The average top hit is indicated with filled symbols (square: 0.1 Da, upward triangle: 0.01 Da, downward triangle: 0.001 Da, diamond: 0.0001 Da) while the average second hit is indicated with the respective open symbols.

of the genome. In order for a coding region to be correctly identified, the correct match must match more peptides than typically found in the best of these random matches. It is clear from the figure that, on average, there is a three to 4-fold difference between the number of peptides matching the correct hit and the number of peptides matching the best of the incorrect hits. Optimization of the Approach. To assess the technique, we repeated the above analysis while changing various parameters from either the preparation of the virtual proteins from the genome or the peptide mass fingerprinting identification of the open reading frames. Figure 5 shows the effect of reducing the error tolerance on the efficacy of the method. Four simulations were performed on the test set of proteins described above using a segment size of 1050 bp and a minimum to match of seven peptides. The error tolerance was varied from 0.1 to 0.0001 Da in orders of magnitude. The results show the efficacy of the method improves as the error tolerance is reduced. There is very little effect on the best hit with at most one or two fewer peptides matches. In contrast, the effect on the second best hit is dramatic, reducing it to less than four across the entire range of proteins. By reducing the error tolerance the chance of finding a random match for any query mass is reduced. The best hit is almost entirely composed of legitimate matches, hence reducing the error tolerance has little effect on the result. However,

research articles

Figure 6. Increasing the minimum to match improves the efficacy of the method. The average top hit is indicated with filled symbols (square: 20 peptides, upward triangle: 10 peptides, downward triangle: 7 peptides, circle 4 peptides) while the average second hit is indicated with the respective open symbols.

Figure 7. Increasing the segment size improves the efficacy of the method. The average top hit is indicated with filled symbols (square: 732 bp, upward triangle: 1050 bp, diamond: 1362 bp) whereas the average second hit is indicated with the respective open symbols.

the second best hit is almost entirely composed of false, random matches and reducing the error tolerance quickly eliminates these matches. Figure 6 shows the effect of increasing the minimum to match value on the efficacy of the method. Four simulations were performed on the test set of proteins described above using a segment size of 1050 bp and an error tolerance of 0.1 Da. The minimum to match value was respectively set at 20, 10, 7, and 4 peptides. The results show the efficacy of the method improves as the minimum to match is increased. There is little effect on the best hit except once the minimum to match value exceeds, or nearly exceeds, the number of masses in the query protein. A protein with, for example, five query masses will not be detected if a minimum of 10 matches are required. There is a stronger effect on the average second best hit. As the minimum to match increases, an increasing number of the second best hits in each bin will be screened out because they have fewer matches than the required threshold. This results in a lower average for that bin, demonstrating an improved efficacy for the method. Figure 7 shows the effect of increasing the segment size on the efficacy of the method. Three simulations were performed on the test set of protein described above using an error tolerance of 0.1 Da and minimum to match of seven peptides. The segment size was respectively set at 732, 1050, and 1362 base pairs. This corresponds to protein sizes of approximately 25, 38, and 50 kDa, respectively. The results show the efficacy of the method improves as the segment size is increased. The Journal of Proteome Research • Vol. 3, No. 3, 2004 397

research articles

Figure 8. Increasing the signal to noise ratio improves the efficacy of the method. The average top hit is indicated with filled symbols (square: 0% replaced peptides, upward triangle: 20%, diamond: 40%, circle: 60%, downward triangle: 80%) whereas the average second hit, indicated for 0% replaced peptides only, is shown with an open square.

improvement results in an increase in the average number of matching peptides in the top hit. As the size of the segment increases, more of the query masses from a particular protein fall in the one segment rather than being spread across adjacent segments. However, the size of this effect varies with the size of the query protein. For smaller proteins, with less tryptic peptides, the change is almost negligible, whereas for large proteins, approximately 10 extra matches are added for each increase in the segment size. This demonstrates the limit to the improvement to be gained by increasing the segment size. Once the segment size exceeds the size of the query protein, any further increases in the segment size will have no effect on the hit, as all the matching peptides will already be contained within the segment. The effect on the second best hit is very small. There is only a slight increase, one to two matches, in the average number of matches with each increase in segment size, reflecting the small chance of an extra random match occurring on the same segment. The above simulations have one common feature. The query set of peptide masses is composed of the complete set of tryptic peptides from a P. aeruginosa protein of mass greater than 400 Da. In an experimental situation, not all of the masses measured on the mass spectrometer arise from the protein in the sample. Some spectra contain more than one legitimate protein. Others may have become contaminated during the experimental process. Some peptides are not ionized efficiently in the source of the mass spectrometer, and finally, the determination of the mass of a particular peptide may be incorrect or overlooked. To simulate this inclusion of spurious masses and removal of legitimate masses, we performed five simulations where a certain percentage of the original query masses were replaced by masses randomly drawn from a list of the masses found by theoretically digesting all protein sequences in SWISS-PROT and TrEMBL. Figure 8 shows the results of this analysis. The simulations were performed with a segment size of 1050 bp, error tolerance of 0.1 Da, and minimum to match of seven. The percentage of original query masses replaced with random masses was 0%, 20%, 40%, 60%, and 80% in each successive search. The effect on the best hit is noticeable. Each successive increase in the number of query masses replaced by random masses decreases the number of matches in the best hit. The effect is relatively independent of the size of the query protein with about 5 to 7 matches lost with each successive reduction in the signal-to-noise ratio. In regard to the second best hit, 398

Journal of Proteome Research • Vol. 3, No. 3, 2004

Arthur and Wilkins

the only meaningful curve is the original curve where the full set of query masses is used. As more random masses are introduced, the concept of the best, or “correct” hit, is less well defined. Increasingly, it will be lost completely, and in this case the original second best hit is artificially promoted to be the best hit. Hence, the average second best hit will be artificially reduced. Once about 60% of the original query masses have been replaced, the best hit becomes indistinguishable from the original second best hit. This indicates that on average 40% or more of the peptides in the query protein are required in order to effectively separate the correct hit from the random hits. It is important to remember this figure is an average quantity. Some proteins may be correctly identified with a smaller percentage of the peptides present while other may require more peptides to be present. Experimental Assessment Section. A culture of M. tuberculosis was used as the source of proteins for an experimental analysis of the genome annotation technique. The sample was prepared and the proteins separated using 2D gel electrophoresis (see Materials and Methods). A number of spots were cut from the gel, digested with trypsin, and the peptides analyzed with MALDI-TOF mass spectrometry. These peaks were analyzed using standard peptide mass fingerprinting techniques to identify the proteins contained in each spot. The genome of M. tuberculosis was segmented into 1050 base pair segments, translated, and digested as described in the outline of the method. The experimental masses were searched against the genome using peptide mass fingerprinting as described in the theoretical outline. Twenty spots previously identified by other techniques were chosen as a set to test our approach with empirical data. The proteins were known to have high quality mass spectra, as indicated by a large number of the tryptic peptides being seen. The proteins included some isoforms that appeared as trains of spots on the 2D gel. However, nine of these twenty spots actually contained more than one protein. Although the spots were chosen on the basis of high coverage, the secondary protein or proteins in the same spot have less coverage. These spots were also analyzed resulting in an analysis of 29 proteins having a range of coverage from 16% to 83%, with an average of about 60%. The masses from each spot were searched against the genome using the technique described above. As the proteins in the sample contained artifactual modifications, the peptide mass-fingerprinting algorithm also included a check for matches to the modified form of the peptide. An error tolerance of 0.1 Da was used. The minimum to match was varied between 4 and 6 peptides as needed to screen out large numbers of random matches. Once the peptide mass fingerprinting was complete, the results were examined to find the segment with the highest number of matches. In 12 of the 29 cases, the matching segment of the genome was easily and clearly identified due to the large number of unambiguous peptide matches for the correct hit and a small number of matches for the subsequent hits. In the remaining 17 cases, the number of matching masses was similar for a small group of segments, making it unclear which segment was the correct identification based on the number of hits alone. However, a review of the matching peptides for each of these segments clearly divided them into two classes. In the first, the vast majority of the matching peptides were contained in one contiguous string of

Using Proteomics to Mine Genome Sequences

research articles

Figure 9. Identification of the coding region stretching over segments 800 to 803 for “Chaperone protein dnaK” (P32723).

amino acid residues between two stop codons. In the second, the peptides were scattered over the sequence, almost completely separated by stop codons. Those in the first class were clearly the parts of the genome that are the matching segments. Once the segments were found, the contiguous region of amino acid containing the most peptides between two stop codons was matched against SWISS-PROT using BLAST. The top hit was taken as the identification of the protein using the genome annotation technique and compared to the identity found by standard peptide mass fingerprinting. The identities from both techniques matched for all 29 of the proteins examined, proving the accuracy of our technique. On a secondary point, although our theoretical analysis indicated that, on average, more than 40% of the peptides from a protein were required in order to identify the protein-coding region, the experimental analysis demonstrates it is possible to identify some proteins with a much lower coverage. As mentioned above, some of the proteins in this analysis had as little as 16% coverage. Two examples are provided to demonstrate this. In the first, shown in Figure 9, four consecutive segments (800-803) received 10, 12, 12, and 6 peptide matches, respectively. All other segments had less than 6 matches and were screened out using a minimum to match of 6 peptides. This indicates the protein found on the gel matches the region of the genome stretching across these four segments. The protein sequence of segment 802 was compared to all the proteins in the SWISSPROT database using BLAST. The protein was thus identified as “Chaperone protein dnaK” (P32723). This protein of molecular weight 66.7 kDa exactly matches the identification determined by standard peptide mass fingerprinting, indicating the technique correctly identified the region of the genome coding for the protein of interest. In the second example, shown in Figure 10, two regions of interest were found. The first involved segments 7308 (5 matches) and 7309 (7 matches), the second involved segments 8290 (7 matches) and 8291 (6 matches). There was one other segment with 6 matches. All of the other segments had less

than 6 matches. Those with fewer than 5 matches were screened out using the minimum to match parameter. The number of matches in these two regions is still more than the number of matches in the background hits. However, because it was close to the background number, all the segments were examined to check for contiguous regions of matching peptides described above. The two identified regions above had all but one of their matches in a contiguous region. All of the other hits were eliminated on these criteria except 802. At this stage, given our past identification of segment 802 as P32723, a 67 kDa protein, we can eliminate this segment also because we know the spot was cut from a low molecular weight region of the gel. The portion of the protein sequence between two stop codons having the most hits was, in each case, submitted to BLAST as described above. The first region identified as “10 kDa chaperonin” (P09621) and the second region identified as “10 kDa culture filtrate antigen cfp 10” (O69739). Both these proteins were found in this spot using standard peptide mass fingerprinting. These proteins did not stand out as clearly as in the first example, but were still identifiable through further verification as described above. This demonstrates the technique can also work when multiple proteins are located in the one spot and when the proteins being searched for are small. This is a key issue as small open reading frames are some of the most difficult to find using standard open reading frame identification techniques. Application to Eukaryotic (human) Data. The method can, in principle, be applied to higher order genomes including the human genome. To demonstrate this, the genome sequence of chromosome 22 of H. sapiens was prepared and searched using the genome annotation technique. A theoretical peak list was generated using the sequence of “Apolipoprotein L5” (Q9BWW9), known to be located on chromosome 22. This peak list was searched against the entire sequence of chromosome 22 using the genome annotation method using an error tolerance of 0.1 Da and a minimum to match of 10 peptides. Figure 11 shows the result of this search. There were 12 hits Journal of Proteome Research • Vol. 3, No. 3, 2004 399

research articles

Arthur and Wilkins

Figure 10. Identification of the coding regions for “10 kDa chaperonin” (P09621) on segments 7308 and 7309, and ”10 kDa culture filtrate antigen cfp 10” (O69739) on segments 8290 and 8291. This demonstrates the method also works for small proteins.

with between 10 and 23 matches. Examining the details of each of these in turn shows all except two of these hits involve matches to repeat regions in the genome i.e., the same peptide occurs multiple times repeatedly resulting in an artificially high number of matches. The remaining two hits are on overlapping segments. Comparing the sequence of these segments to all the proteins in SWISS-PROT using BLAST identifies the protein as “Apolipoprotein L5” (Q9BWW9).

Conclusion The above results demonstrate the effectiveness of the method in detecting the region of a genome coding for a particular expressed protein. The theoretical analysis allows us to draw some conclusions about the best parameters to use when applying the method. The error tolerance should be chosen to be as small as possible. In practice, this is limited by the experimental data. Data generated to a higher level of accuracy, for example, from internally calibrated MALDI-TOF spectra or spectra from Fourier transform mass spectrometry (FT-MS), allows a smaller error tolerance to be used. This, in turn, improves the ability 400

Journal of Proteome Research • Vol. 3, No. 3, 2004

of the technique to eliminate random matches, and correctly identify the coding region. The minimum to match parameters should be chosen on the basis of the size of the query protein. Generally, the higher the value, the more false positive matches will be eliminated. If the proteins being studied are from a higher molecular weight region of the gel, then the parameter can be increased to make use of this effect. However, when working in the low molecular weight region of the gel, a low minimum to match is required to avoid obtaining a false negative. In practice, the parameter will generally be set high and the search repeated while gradually lowering the minimum to match threshold until a point where identity is clearly identified by showing a gap between the correct hit and the random hits. It is sometimes useful to adopt this procedure even when a clear identity is found at a high minimum to match to check for the possibility of a second protein being contained in the spot. The segment size should be chosen to be representative of the expected size of proteins in the sample. Larger segment sizes make it easier to identify larger proteins while having little or no effect on smaller proteins. This suggests using a size as

Using Proteomics to Mine Genome Sequences

research articles

Figure 11. Segment with the highest number of matches (36302) is identified as the coding region for “Apolipoprotein L5” (Q9BWW9). Other segments are eliminated because they contain many repeat regions.

large as the largest expected protein. In this way, a single comparison against a genome with a relatively large segment size can potentially be used to identify all the coding regions within the genome. However, although there is no reduction in the ability to detect smaller proteins with a larger segment size, smaller segment sizes allow the region of the genome identified to be more exactly defined. A segment size of 1050 bp gives twice as much resolution to the method as a segment size of 2100 bp. Hence, in practice, the best method is to use a range of segment sizes, starting with a larger segment size to perform the identification, and then reducing this if necessary to more narrowly define the coding region. Such an analysis could easily be automated with the query masses being submitted sequentially to the method using increasingly narrow segment sizes. Reviewing the method itself, a number of possible modifications or alterations were considered. These include the following: Is It Necessary to Overlap the Segments? The segments are overlapped to facilitate the process of identifying the region of the genome coding for the protein of interest. In some cases, the masses from the protein of interest could be distributed across two adjacent segments, with a portion of the masses at the end of one segment and a second portion at the start of the next segment. This means the number of masses on each of the two segments will be closer to the background number of random, “noise” matches found on the remaining segments making it harder to identify the hit. However, by using overlapping segments the masses at the end of one segment and the start of the next will all be located on the common, overlapping segment. This means the number of masses on the common, overlapping segment will be further from the background number of random, “noise” matches making it easier to identify this as the correct location of the proteincoding region in the genome. In principle, the overlap is not necessary for the method to work, but greatly facilitates the identification process. Is It Necessary to Segment the Genome? In principle, the segmentation of the genome before digestion and analysis with MALDI-TOF data is optional. Indeed, the three previously

reported uses of mass spectrometry data to find protein identification in genomic data6-8 use the single genome sequence without segmenting except to provide computational efficiency. However, this paper demonstrates the advantages of segmenting the genome prior to searching it with proteomic data. The genome is segmented to enable easier identification of the protein-coding region of the genome. The genome is segmented into fixed sections, regardless of the length or possible location of the protein coding regions. This makes it possible to establish the noise (or background number of hits) expected when matching data against the entire genome. It is difficult, or near impossible, to establish this conclusively if a definite segment length is not used. Understanding the number of background or random matches to the peptide masses can help to determine the protein coding regions, as when the number of matches exceeds the number of random matches on other segments, a protein-coding region is indicated. If the genome were not segmented, it would be difficult to determine in a systematic way when a concentration of hits was indicative of a protein-coding region. It would be necessary to look for a certain number of hits in a certain length region, but the exact value of this parameter would need to be predetermined to be of use in assessing the results. Each segment of the genome simulates a protein (the translation of a certain region of a genome). By segmenting, the peptide identification process is analogous to peptide mass fingerprinting. In this manner, users can readily assess the outcomes of this approach without needing to learn and understand the complexities of a new protein identification paradigm. Previous studies6,7 described the need to break the genome into smaller pieces in order to reduce the computational demand of working with a whole genome. Our estimate also suggests segmenting the genome will lead to a number of improvements in computational performance. In particular, working with a whole genome at once is demanding in terms of computer memory. Smaller segments can be analyzed sequentially and thus require less memory at any particular point in the calculation. Journal of Proteome Research • Vol. 3, No. 3, 2004 401

research articles Finally, segmenting the genome provides important utility in terms of the ability of a user to visualize the results of the genome annotation. Our method presents the results of the analysis to users in a form almost identical to the output of standard peptide mass fingerprinting analysis. This aids in interpretation of the results, as many users are familiar with the common parameters such as number of hits and percent coverage. Our method also negates the need to develop a novel, graphical user interface to enable the adequate browsing of a complete genome as opposed to focusing on the relevant segments of the genome. Can the Method be Used for Protein Identification? The method is specifically designed for use with unannotated (and perhaps unassembled) genomes. For these genomes, there will be few (if any) protein sequences available in public sequences databases. This means that standard peptide mass fingerprinting against the protein sequence information will not yield a protein identification for most of the proteins expressed in the sample. The method described in the paper can be used to identify the protein and annotate the genome at the same time. The method is also useful where a large number of the protein sequences exist in public sequence databases, but most of these entries are derived from ORF prediction algorithms rather than experimental evidence. As outlined in the Introduction, there are a number of known limitations to the various ORF prediction algorithms. Our method provides a technique for identifying proteins that may have been overlooked in the ORF prediction process. The implementation of the method described above assumes the enzyme used to digest the gel spots is trypsin. This is the most common enzyme used experimentally. Thus, the theoretical digestion of the segments is also done using the cleavage rules of trypsin. However, the method can use any appropriate enzyme to digest the experimental proteins. In this case, the theoretical digestion of the genome segments needs to use the cleavage rules for the enzyme to be used in the experimental analysis. If the experimental analysis is done with multiple enzymes, it is possible to use the findings from multiple searches with each of the enzymes to confirm the identification of the region of the genome. If both analyses identify a certain region of the genome as a possible protein-coding region, then the region is more likely to be correctly identified as such. It is possible that each analysis may not have enough hits to be clearly distinguished from the background but because multiple analyses indicate the same region, it can still be identified as the protein-coding region. It is also possible to take missed cleavage peptides and modified peptides into account. When the cleavage rules are used to determine the theoretical peptides, the sequence of peptides resulting from a missed cleavage can also be calculated. This allows the mass of these peptides to also be determined. During the peptide mass-fingerprinting search these masses can also be compared to experimental masses. Similarly, one can calculate the mass of a modified form of each of the peptides and check these masses also when comparing against the experimental masses. The method can also be used to identify pseudo-genes. A genome may contain pseudo-genes (regions of the genome resulting from gene duplication or retro-transposition made inactive through evolution of the genome). These regions may have considerable sequence similarity to other coding regions.

402

Journal of Proteome Research • Vol. 3, No. 3, 2004

Arthur and Wilkins

Thus applying the method to a protein spot may result in multiple regions of the genome being identified: one corresponding to the protein coding region and others corresponding to pseudo-genes derived from that protein. Similarly, the genome may contain families of protein with very similar sequences. An application of the method in this case may again identify multiple regions of the genome corresponding to this family of proteins. Finally, although it has not been illustrated in this research, the method could be used to locate coding and noncoding regions in eukaryotic genomes to help identify exon and intron boundaries. As the identification of intron-exon boundaries can be difficult, we imagine that this may become a key application of this technique.

Acknowledgment. The authors thank Sushila Thomas for making the mass spectra from her analysis of M. tuberculosis culture available for use in the empirical testing of this technique. Xcise is a trademark of Proteome Systems Ltd. References (1) Wheeler, D. L.; Church, D. M.; Federhen, S.; Lash, A. E.; Madden, T. L.; Pontius, J. U.; Schuler, G. D.; Schrimi, L. M.; Sequeira, E.; Tatusova, T. A.; Wagner, L. Nucleic Acids Res. 2003, 31, 28-33. (2) Mathe, C.; Sagot, M.-F. S. T.; Rouze, P. Nucleic Acids Res. 2002, 30, 4103-4117. (3) Burset, M.; Guigo, R. Genomics 1996, 34, 353-367. (4) Rogic, S.; Mackworth, A. K.; Ouellette, F. B. F. Genome Res. 2001, 11, 817-832. (5) Pavy, N.; Rombauts, S.; Dehais, P.; Mathe, C.; Ramana, D. V. V.; Leroy, P.; Rouze, P. Bioinformatics 1999, 15, 887-899. (6) Kuster, B.; Mortensen, P.; Andersen, J. S.; Mann, M. Proteomics 2001, 1, 641-650. (7) Choudhary, J. S.; Blackstock, W. P.; Creasy, D. M.; Cottrell, J. S. Proteomics 2001, 1, 651-667. (8) Giddings, M. C.; Shah, A. A.; Gesteland, R.; Moore, B. Proc. Natl. Acad. Sci. 2003, 100, 20-25. (9) Arthur, J. W.; Wilkins, M. R. Annotation of genome sequences 2003, International Patent Application Number PCT/AU03/00300. (10) Stover, C. K.; Pham, X. Q.; Erwin, A. L.; Mizoguchi, S. D.; Warrener, P.; Hickey, M. J.; Brinkman, F. S. L.; Hufnagle, W. O.; Kowalik, D. J.; Lagrou, M.; Garber, R. L.; Goltry, L.; Tolentino, E.; WesbrockWadman, S.; Yuan, Y.; Brody, L. L.; Coulter, S. N.; Folger, K. R.; Kas, A.; Larbig, K.; Lim, R.; Smith, K.; Spencer, D.; Wong, G. K. S.; Wu, Z.; Paulsen, I. T.; Reizer, J.; Saier, M. H.; Hancock, R. E. W.; Lory, S.; Olson, M. V. Nature 2000, 406, 959-964. (11) Wilkins, M. R., Williams, K. L., Appel, R. D., Hochstrasser, D. F., Eds.; Proteome Research: New Frontiers in Functional Genomics; Springer-Verlag: Berlin, 1997, Heidelberg. (12) Pedersen, S. K.; Harry, J. L.; Sebastian, L.; Baker, J.; Traini, M.; McCarthy, J. T.; Manoharan, A.; Wilkins, M. R.; Gooley, A. A.; Righetti, P. G.; Packer, N. H.; Williams, K. L.; Herbert, B. R. J. Proteome Res. 2003, 2, 303-311. (13) Breen, E. J.; Hopwood, F. G.; Williams, K. L.; Wilkins, M. R. Electrophoresis 2000, 21, 2243-2251. (14) Breen, E. J.; Holstein, W. L.; Hopwood, F. G.; Smith, P. E.; Thomas, M. L.; Wilkins, M. R. Spectroscopy 2003, in press. (15) Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; Watanabe, C. Proc. Natl. Acad. Sci. 1993, 90, 5011-5015. (16) James, P.; Quadroni, M.; Carafoli, E.; Gonnet, G. Biochem. Biophys. Res. Comm. 1993, 195, 58-64. (17) Mann, M.; Hojrup, P.; Roepstorff, P. Biol. Mass Spec. 1993, 22, 338-345. (18) Pappin, D. J. C.; Hojrup, P.; Bleasby, A. J. Curr. Biol. 1993, 3, 327332. (19) Yates, J. R. I.; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Anal. Biochem. 1993, 214, 397-408. (20) Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.-C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O’Donovan, C.; Phan, I.; Pilbout, S.; Schneider, M. Nucleic Acids Res. 2003, 31, 365-370.

PR034056E