Report
Mining Genomes with MS T
he recent sequencing of the Haemophilus influenzae genome has provided the scientific community with itsfirstlook at the complete genome of a living organism (2). The success of this sequencing venture and the resulting unique insights into the biology of a singlecell organism will encourage new ventures into whole genome analysis of organisms other than just those targeted for sequencing under the human genome project As sequencing technology and systems integration for production sequencing improve, we believe that whole genome sequencing will become a standard technique for the correlating mass spectra of peptides and study of an organism proteins to sequences in the database. The existence of a genome sequence for These new software tools, which allow an organism provides a powerful infrastruc- fast, accurate, and automated protein identure to facilitate experimentation. For exam- tification, are contributing to a revolution ple, simultaneously examining changes in in the practice of protein biochemistry. the expression of a large number of genes as a function of cellular perturbation or acti- Protein and nucleotide vation will be straightforward with the con- databases struction of ordered arrays of DNA seCollecting, assimilating, and disseminatquences from known genes (2)) ing data are central to the scientific process. To permit widespread access to seExploring and using the vast amount quence information, several sequence of information encoding a genome will depositories are maintained around the require a new synergy between computer world. In general, databases contain two technology and biological experimentalevels of information: Thefirstlevel contion. Powerful software tools that can tains the nucleotide or amino acid semodel and manipulate biological data will be necessary for analyzing experiments in quence information, and the second level the context of genomic sequence informa- provides annotations to the sequence, which may include a description of the tion. Among recent software developorganism the sequence comes from, relements are powerful new approaches for vjmt references sequence features and an accession number. The sequence information consists of a John R. Yates, III string of characters representing the Ashley L. McCormack amino acids or nucleotide. The characters Jimmy Eng G, C, A, and T are used to represent deUniversity of Washington
oxynucleotide residues, and three nucleoSearching protein tides are used to code for one amino acid residue. There are 64 (4 ) possible combiand nucleotide nations of deoxynucleotides that code for 20 amino acids (61 codons) and stop databases with MS the signals (3 codons), leading to redundancy some of the codons used for amino data allows accuratefor acids. A nucleotide sequence can be transto a protein sequence by converting identification of lated the triplet codons to their respective acid sequences. amino acid sequences amino A region of nucleotide sequence that
534 A
Analytical Chemistry News & Features, September 1, 1996
3
can be translated to an amino acid sequence over some length (> 50-100 bases) before a stop codon is encountered is called an open reading frame and is most likely a gene. (By random chance, a stop codon should occur within 21 codons.) Three reading frames exist for the translation of a nucleottde sequence in the 5' 'o 3' direction and an additional three reading frames on the complementary strand are also in the 5' to 3' direction. Because the strand that contains the coding region for a gene is not always known a plete 6-reading frame translation is usually necessary to determine potential open reading frames Nucleotide sequence data from genomic DNA or expressed genes are stored in databases such as GenBank. Expressed gene sequences or sequence tags (ESTs) are generated by extracting mPvNA from cells and converting to cDNA with reverse transcriptase (3). The nucleotide sequence of the cDNA is then determined with a single sequencing analysis. An EST (cDNA) sequence can be 300-700 bases in length but may contain a higher percentage of errors (1-3%) than the genomic sequence The sequences of 0003-2700/96/0368-534A/$12.00/0 © 1996 American Chemical Society
Biotechnology Information (blast@ncbi. nlm.nih.gov), which runs the basic local sequence alignment program (BLAST) (4); sequences can be e-mailed to the server, and a result is returned by e-mail within a few minutes. Access to BLAST (http://ncbi.nlm.nih. gov) and other search programs has been simplified by the development of the World Wide Web and Web browsers. By using sequence alignment methods, a protein sequence or a related sequence can be found to identify or gain insight into the putative function of a protein. Given the rapid increase in the amount of information in sequence databases, database searches have become an important component of sequence studies. Early identification of a protein as "known" can eliminate the need for further characterization and lot of time effort and In addition identifying similarity to a protein of known function can provide important clues and experimental direction to fletermine flip fiinctinn of the new protein
expressed genes, because of the obvious connection to proteins, are stored in a new database called dbEST. A single-letter code also exists for the 20 amino acids. Because the amino acids asparagine (N) and glutamine (Q) can be hydrolyzed under acidic conditions to the corresponding acids aspartic acid (D) and glutamic acid (E), respectively, the letters B and Z are used to indicate ambiguities in the designation of N/D and Q/E. When the identity of an amino acid at a specific location is unknown or unresolved, it is designated by the letter X. It is not uncommon for amino acid residues to be modified to other forms but this information is not directly stored as part of the character-based sequence information. Any known information about modifications to protein sequences mav be included in sequence annotations The Protein Information Resource and Swiss-Prot data-
bases are two good sources of information about known modifications to proteins. Searching w i t h amino acid sequence information
Once a "new" amino acid sequence has been obtained, the first step is to determine whether that sequence or a similar sequence exists in a database. By using string search or sequence alignment algorithms, an exact match between strings can be found if the sequence exists in the database. Sequence alignment algorithms can be used to compare sequences to determine whether another sequence exists in the database that may be evolutionarily related. If functional information is available for the related clues about the function of the sequence can be inferred One frequently used alignment program is accessible by the e-mail server at the National Center for
Because interpretation of MS data to derive an unambiguous sequence can take hours to days, efficiency can be greatly increased by prior determination of whether the data represent a known sequence. The ability to quickly identify a known protein in a biological process may also present new information to help understand the protein's function or the process under study. For example, by sequencing peptides with cytotoxic activity in melanoma-specific T-cell assays, it was discovered that several tumor antigens associated with melanoma were derived from the sequences of known proteins (5) Searching w i t h molecular weight information
Mass spectrometers produce two types of information: mass-to-charge (m/z) information from the measurement of an intact protonated molecular ion or detailed
Analytical Chemistry News & Features, September 1, 1996 5 3 5 A
Report
structural information resulting from deliberate cleavage of covalent bonds within the ion (Figure la). Determining the molecular weight of an intact known protein can provide information about covalent modifications, proteolytic processing, or even sequence errors, but it does not pinpoint such changes to specific sites. More detailed information must be obtained by site-specific proteolysis of the protein and determination of the masses of the resulting peptides. Such peptide mapping has become method for confirming protein sequences or identifying the locations of modifications (6 7) A less sbvious implication ofpeptide manning is that the pattern produced by any given sequence can be as an identifying characteristic In 1993, five papers appeared that described computer algorithms for searching protein databases with peptide mass maps (8-12). By comparing the "unknown" protein's mass map with those predicted for all the sequences in the database, the protein could be identified. Obviously, the more complete the mass map the more accurate this identification. Experimental circumstances, however, can complicate the mapping process. For example it is unusual to completely recover all peptides particularly if proteolysis is performed with small quantities of mafprial in a polvacrylamide gel or with electroblotted oroteins Furthermore modifia protein seauence such as phosphorylation and elvcosylation alter the mass of some fragments and to fragments with molecular weights that cannot De piaceu in the sequence, In pracf 1 t ss map is not usuall needed to identify a sequence—as few as 4.1.
t
ill
J
three peptides can form a unique map and ii
4. •
-j
nc
4_-
m \
.-U
-j
allow protein identification (11). The lden4-J-
4-
4-
4
41.
•
tification of a protein sequencefroma I • , ,
, r
-1
•
1
11
highly conserved family is also problem7
,
,,,. " . .,
Figure 1 . Protein information provided by MS. (a) Mass measurement of the peptides derived by site-specific proteolysis of proteins and (b) detailed amino acid sequence information produced by CID.
an unambiguous match also increases (13). Some search algorithms allow the inclusion of additional information such as the molecular weight of the protein (obtained from SDS-PAGE or mass measurement by MS), amino acid composition or sequence data, and mass maps obtained by proteolysis with a second enzyme. Simple derivatization experiments, some of which can be done directly on a sample probe, can also quickly provide limited composition information. Acetylation of a pepttde will leveal lhe numbee of free amino groups, which can indicate whether lysine is present. Esterification will indicate the number of aspartic acid or glutamic acid residues present. Deuterium exchange of the acidic protons will also give information relevant to the amino acid composition. All of these methods increase the level of confidence in the match to the protein sequence, but ultimately it is best to obtain the identification in a single experiment. As databases grow in size, it wiil be more important to have sophisticated
atic, and a mass map may hit similar protein mequences from closely related organisms. In most mass-mapping search algorithms, protein identification is determined by the number of masses that match those predicted. Results obtained by searching EST databases with a 6-reading frame translation have shown that as the number of sequences increases, the 5 3 6 A ofAnalytical Chemistry News & Features, September amount information necessary to obtain
scoring methods that can be used to reduce ambiguity in the search results and flag false positive results. An elegant scoring routine for weighting mass maps to protein sequences, dubbed MOWSE (11), uses a scheme that assigns a frequency factor based on the number of times that a peptide's m/z value occurs in a protein of a particular molecular weight. These frequencies are precalculated for peptides created with commonly used site-specific proteases. For example, the number of times peptides with a molecular weight of 1000 ± 100 Da occur rn proteins of 35 000 ± 10 kDa is determined Whenever tide's m/z value matches a predicted {?&&ment within a protein the frennenrv is that peptide's molecular weight and the protein's molecular weight Measurement of the statistical relevance of m/z values for collections of pepuaes in relauon to an the sequences 4. •
4.1.
J
4. V.
J
•
__
4.-
present in the database and incorporation of such measures into scoring routines •11 i_
1
•
J J
c
•*
41.
will help improve and define the specmc•4_
•
1.
4 -
/
•
4T
4i
11
ity inherent in m/z information as well as 1, 1996
the constraints needed to uniquely and unambiguously identify proteins with peptide mass maps. This information should be easy to calculate as the genomes of organisms are sequenced.
Tandem MS data Searching databases with a peptide mass map provides a powerful method for correlating mass spectral information to a sequence database. Because a set of peptide masses is necessary to attain some level of statistical relevance, this method is limited primarily to situations in which at least several peptide masses from the same protein have been observed. It can also be complicated by the presence of peptides derived from a mixture of proteins. For example, an important immunological problem for which mass mapping is ineffective is the analysis of peptides presented by major histocompatibility (MHC) molecules. In this immunological system, peptides are derived from intracellular and extracelluar protein antigens that are proteolyzed, processed, and loaded into the binding site of the MHC molecule and then transported to the cell surface. Peptides present on the cell surface are derived from a large number of different proteins and, depending on the subtype of the MHC molecule as 3000 different peptides may be present By combining microseoaration techniques with tandem MS manv immunologically important peptides have been sequenced (5 14) In peptide sequence analysis by tandem MS, a peptide ion is selected with the first mass analyzer and passed into a collision cell filled with a neutral, inert gas such as argon or helium (15). The ions undergo collision-induced dissociation (CID) to produce a set of fragment ions, which are then separated by m/z value in the second mass analyzer. Peptides fragment under CID conditions to produce a ladder of sequence ions in which the difference in mass between two related fragment ions corresponds to the molecular weight of an amino acid residue (Figure lb) Because a fragment ion series can retain charge on the N-terminus fa- b- c- d-ions) or the C-termii nus (x- y- w- /-ions'! of thenentide the sequence ions observed in the tandem mass spectrum can provide redundant information (75 Iff) Although this information ii somnlementarv it tadd tn snectral
complexity and complicates interpretation of the spectrum, and deriving amino acid sequence information from a tandem mass spectrum is often the rate-limiting step. Amino acid sequence information can be obtained from the fragmentation patterns observed by using low-energy collisions on triple quadrupole mass spectrometers (15), magnetic-sector hybrids (17), and ion traps (quadrupole and ion cyclotron resonance) (18,19); high-energy collisions on foursector mass spectrometers (20 21); and postsource decay processes observed on matrix-assisted laser desorption/ionization (MALDD reflectrontime-of-flight(reTOF) mass snectrometprs (??)
consider the presence of an unknown covalent modification to the peptide or the presence of a sequence error in the database. For example, the presence of a modification to the N-terminal region of the peptide's sequence would cause the b-ion of low m/z value and the y-ion of high m/z value to change from the values predicted by the database sequence. By eliminating the requirement that those particular m/z values match the predicted values, sequences meeting the remaining criteria can be found. When criteria dropped however the specificity of the search is decreased and matches to manv se-
Searching with sequence and m/z data from tandem mass spectra. Mann and Wilm have developed an approach for searching databases based on the information that can be derived by
sidered further by manual to rletermine tTie amino acid sequence including the modification or correction of the seqtience error that best fits the tandem mass spectrurn If fragment ions in the tandem mass spectrum are of poor S/N, an unambiguous verification may be difficult. Although this method is potentially very specific given a long sequence string, the user must interpret each tandem mass spectrum to add the sequence string. Two sites have appeared on the World Wide Web with variations on the above method for database searching (http://rafael.ucsf. edu/mstag.html and http://chait-sgi.rockefeller.edu/). Searching directly with raw, uninterpreted tandem, mass spectra. The SEQUEST algorithm (24) uses the predictability of peptide fragmentation to identify amino acid sequences that would match the fragmentation pattern observed in the tandem mass spectrum (http://thompson.mbt.washington.edu/ sequesthtml). In addition to the sequence-specific information present in a tandem mass spectrum, low m/z ions corresponding to the structure +NH = CHR (immonium ions) diagnostic information about the amino acid content of the peptide Under low-energy collision conditions immonium ions are frequently observed when the amino acids leucine isoleticine methionine trvntophan tyrosine or phenylalanine are present in the
The sensitivity of the cross-correlation method is remarkable. partial interpretation of a tandem mass spectrum (23). The user generates a short sequence string, 2-A amino acids in length, by interpretation of the tandem mass spectrum. This string alone is not very specific, but when combined with information such as the peptide's molecular weight, the fragment ion m/z values (b- or y-ions) that define the sequence string, and the protease cleavage specificity, the search is sufficiently specific to match a small number of amino acid sequences. During the search sequence, ions and character strings are considered in both the N-terminal (b-ion) and C-terminal (y-ion) directions, because it isn't known a priori whether the m/z values used to define the sequence string were derived from b- or y-type ions. If a search doesn't identify a sequence, additional searches can be carried out to
produce immonium ion information more representative of the amino acid composi tion (25). This information is also used in the search process to weight sequences.
Analytical Chemistry News & Features, September 1, 1996 5 3 7 A
Report
Figure 2. Cross-correlation analysis method.
Amino acid sequences in the database are scanned tofindlinear combinations of amino acids, proceeding from the N- to the C-terminus, that are within some tolerance of the peptide mass represented by the tandem mass spectrum. This mass tolerance can be as large as 200 u or as small as 0.5 u. Although a large mass tolerance allows more sequences to be evaluated and hence slows the search the accuracy of the search is not substantially affected by the mass tolerance used to select the amino acid sequences because two scoring methods are used to evaluate the sequences for their closeness-of-fit to the tandem mons
spectrum Seauence selection can also be enided bv the cleavDrotease used ate the peptide including consideration of incompletely digested oeptides from either side of the primarv sites or it can be 538 A
performed with no assumptions about how the peptide was created. Nucleotide databases can also be searched by translating nucleotide sequences "on the fly" to protein sequences (26). A translation in all 6 reading ffames is usually performed and results in roughly a sixfold increase in the number of amino acid sequences examined. Chemical modifications can be considered by changing the amino acid mass used to calculate the peptide masses. Because a modification such as phosphorylation may not be present at every occurrence of a threonine, serine, or tyrosine, it is important to evaluate each possibility (27). Once an amino acid sequence is within the defined mass tolerance, a preliminary evaluation is performed (24). First, the number of predicted fragment ions that match ions observed in the spectrum
Analytical Chemistry News & Features, September 1, 1996
within 1 u and their abundances are summed. A set of ion types most likely to be found in the tandem mass spectra generated by a particular type of mass spectrometer may be selected for use in the search. A sequence that matches a continuous set of ions (one in which consecutive sequence ions are present) is weighted more heavily than one that randomly matches a few sequence ions. If an immonium ion is present in the spectrum, the associated amino acid must be present in the sequence under consideration or an additional component of the score is increased or decreased The total number of predicted sequence ions is also noted and a score is calculated for each amino acid Qpnuenrf1 This methnH of Qrnrincr can be sufficient to identify the correct amino acid sequence if the tandem mass spectrum has cood if manv of the seQuence lQiis are present, vvnen iiec-e&r sarv the sensitivity of the sparrhes can he increased by incorporating a second scormg step. This scoring routine is based on a comparison of the tandem mass spectrum of interest with that of theoretical mass spectra developed by calculating predicted sequence ions and then assigning a relative abundance to each type of ion (Figure 2). The relative ion abundance can be varied for the different ion types, which is particularly useful for the reconstruction of highenergy CID spectra in which a wide variety of ion types can be observed with different ion abundances. A cross-correlation function (28 29) is used to compare the reconstructed ixiiiss stjectxttm with the ciuery tandem spectrum The reconstructed spectrum and the experimental spectrum represent discrete input signals, as shown in Figure 2. If two signals are the same, the correlation function should maximize at a displacement value (T) of 0. Cross-correlation is computed by fast Fourier transformation of the two data sets, multiplication of one transform by the complex conjugate of the other, and inverse transformation of the resulting product. A final score is attributed to each candidate peptide sequence by subtracting the cross-correlation value at T = 0 bv the mean of fhe functton ovee the ranjre -75 < t < 75 The normalized cross-correlation value (Cn) has proven valuable in identifying
correct answers andflaggingfalse posi tives. When a difference greater rhan 0.1 is observed between the normalized cross-correlation value of the first and the second ranked sequences, the first an swer is usually correct (24). A small differ ence in the ACn value is also observed when the first two answers have a high degree of sequence similarity Fitnire 3 shows a plot of the cross-correlation dif ferences (AC ) for tandem mass spectra of 44 peptides analwed atrainst the OWL
database of ~100 ,000 entries. For all but 2 of the 44 peptirles the correct answer an-
peared in the top-ranking position The sensitivity of the cross C276, and other materials fyi Pressure Products Industries, Inc. I Reliability Under Pressure 900 Louis Drive, Warminster, PA 18974 USA (215)675-1600, FAX: (2151443-8341,
Dyna/Mag"' PPI; Hastelloy» Cabot Corp
© 1993 PPI
(13) James, P.; Quadroni, M.; Carafoli, E.; Gonnet, G. Protein Science 1994,4,1347-50. (14) Hunt, D. F.; Henderson, R. A; Shabanowitz, J.; Sakaguchi, K.; Michel, H.; Sevilir, N.; Cox, A L.; Apella, E.; Engelhard, V. H. Science 1992,255,1261-63. (15) Hunt, D. F.; Yates, J. R., Ill; Shabanowitz, J.; Winston, S.; Hauer, C. R Proc. Natl. Acad. Scii USA 1986,83, ,233-37. (16) Roepstorff, P.; Fohlman, J. Biomed. Mass Spectrom. .184,11,6011 After the human genome and the ge(17) Bean, M. F.; Carr, S. A; Thorne, G. C; Reilly, M. H.; Gaskell, S. J. Anal. Chem. nomes of other organisms are sequenced, 1991, 63,1473-81. the challenge will be to assign functions to (18) Kaiser, R E., Jr.; Cooks, R G.; Syka, J. E. sequences in the database and to reconP.; Stafford, G. C. Rapid Commun. Mass Spectrom. .199,4,30-33. struct the networks of interactions com(19) Senko, M. W.; Beu, S. C; McLafferty, prising physiological processes. Genomic F. W. Anal. Chem. 1194,66,415-18. sequence information will provide the bio(20) Biemann, K. Biomed. Environ. Mass Speclogical blueprints necessary to decipher trom. 1988,16,99-111. (21) Medzihradsky, K. F.; Burlingame, A L. the details of molecular organization, inMethodss A Companion to Methods in Enteraction, and function, and biological zymology y194, 6,284-3030 studies will shift from a reductionist ap(22) Kaufmann, R.; Kirsch, D.; Spengler, B. Int. J. Mass Spectrom. Ion Processes 1994, proach isolation and study of single 131,355-85. components to a more systemic and net(23) Mann, M.; Wilm, M. Anal. Chem. 1194, 66,4390-99. worked view of processes Through tan(24) Eng, J.; McCormack, A L; Yates, J. R, III. dem MS and database searching a pro/. Am. Soc. Mass Spectrom. 1994,4, 976tein in a sample enriched for biological 89. (25) Falick, A M.; Hines, W. M.; Medzihradactivity could be directly and rapidly idensky, K. F; Baldwin, M. A; Gibson, B. W. tifipH even thmiCTh the sample may nnt hp / Am. Soc. Mass Spectrom. 1993,3,882homnppnnns Furthermorp the ability rn 83. (26) Yates, J. R, III; Eng, J.; McCormack, A. L use sequence Anal. Chem. 1199, 67,3202-10. l M S (27) Yates, J. R, III; Eng, J.; McCormack, A. L.; integrating and automating sequence analSchieltz, D.Anal. Chem. .199, 67,142636. ysis and protein identification. (28) Powell, L. A.; Heiftje, G. M. Anal. Chim. Acta 1978,100, 313-27. (29) Owens, K. Appl. Spectrosc. Rev. 1992,27, References 1-49. (1) Fleischmann, R. D.; Adams, M. D.; White, (30) Yates, J. R., Ill; Eng, J. K.; Clausner, K.; 0., et al. Science 1995,269,496-512. Burlingame, A L.J. Am. Soc. Mass Spec(2) Schena, M.; Shalon, D.; Davis, R W.; trom., in press. Brown, P. A. Science 1995,270, 467-70. (31) Yates, J. R., Ill; Eng, J.; Schieltz, D.; link, A. Proceedings of the 43rd Conference on (3) Adams, M. D.; Kelley, J. M.; Gocayne, Mass Spectrometry and Allied Topics, J. D., et al. Science 1991,252,1651-56. 1995,325. (4) Altschul, S. F.; Gish, W.; Miller, W.; My(32) Griffin, P. R; MacCoss, M.; Belvins, R. A; ers, E. W.; Lipman, D. J.J. Mol. Biol. Aaronson, J. S.; Eng, J. K.; Yates, J. R, III. 1990,215,403-10. Rapid Commun. Mass Spectrom. 1995, 9, (5) Cox, A L; Skipper, J.; Chen, Y.; Hender1546-51. son, R A; Darrow, T. L; Shabanowitz, J.; (33) Larmann, J. P., Jr.; Lemmo, A V.; Moore, Engelhard, V. H.; Hunt, D. R; Slingluff, A W., Jr.; Jorgenson, J. W. Electrophoresis T. L, Jr. Science e994,264, ,16-119 1993,14,439-47. (6) Gibson B. W.; Biemann, K. Proc. Natl. (34) Patterson, S.D. Anal. Biochem. 1194, Acad. Sci. USA 1984, 81,1956-60. 221,1-15. (7) Morris, H. R.; Panico, M.; Taylor, G. W. Biochem. Biophys. Res. Commun. 1983, (35) Patterson, S. D.; Aebersold, R. Electrophoresiis995,16,1791-1814. 227,299-305. (8) Yates, J. R., Ill; Griffin, P. R; Speicher, S.; Hunkapiller, T. Anal. Biochem. 1193, 214,397-407. John R. Yates, III, is an assistant professor (9) Henzel, W.; Billed, T.; Stults, J.; Wond, S.; of molecular biotechnology, Ashley L. McGrimley, C; Watanabe, C. Proc. Natl. Cormack is a staff scientist, and Jimmy Eng Acad. Sci. USA 1993, 90, ,011-15. (10) James, P.; Qaudroni, M.; Carafoli, E.; Gon- is a research consullant at the University oo net, G. Biochem. Biophys. Res. Commun. Washington. Address correspondence to 1993,195, 58-64. Yates at Box 357730, Dept. of Molecular (11) Pappin, D.; Hojrup, P.; Bleasby, A Curr. Biotechnology, University of Washington, Biol. 1993,3,327-32. (12) Mann, M.; Hojrup, P.; Roepstorff, P. Bioll Seattle, WA 98185-7730 (
[email protected]). Mass Spectrom. 1993,22,338-45. Amino acid sequences conserved among related proteins can also be readily identified across many species. Finally, automation of the data analysis end of protein identification allows implementation of higher throughput sampling techniques such as interfacing an autosampler to the chromatograph.
CIRCLE 10 ON READER SERVICE CARD
540 A
Analytical Chemistry News & Features, September 1, 1996