Mass spectrometric methods for protein sequencing - ACS Publications

Mass spectrometric methods for protein sequencing. Klaus Biemann. Anal. Chem. , 1986, 58 (13), pp 1288A–1300A. DOI: 10.1021/ac00126a001. Publication...
1 downloads 0 Views 15MB Size
A

Spectrometric p w m s are among the mosc mpomc components of all living systems. Their functions range from catalysts (enzymes) to regulators to structural components. The building blocks of proteins are about 20 amino acids (HzN-CH(R)-COOH), all Mering in the structure of R, linked together by peptide bonds (-CO-NH-) in chains that may consist of a few dozen to more than loo0 amino acids. The d e termination of the “primary” structure of p m teins, namely, the arrangement (sequence) of the various amino acids along this chain, is a formidable task. It is generally accomplished by cleaving the large chain into smaller, more manageable segements and determining their amino acid sequence first. There are an enormous number of ways in which these 20 building blocks can be linearly assembled and, therefore, an equally large number of possible peptides that can result b m the chemical or enzymatic cleavage of a protei. However, they all have an important structural feature in common-a repeating backbone of consecutive -NH-CH(R>-CO- units and the s t r i d limitation that R must be one of the side chains of the naturally OcCulTingamino acids. These restrictions make it possible to determine the structure of peptides by mass spectrometry @IS) despite their almost infinite variability.

The potential usefulness of MS for determining the amino acid sequence of peptides was recognized in the late 1950s (1).Cleavage of a single bond along the backbone produces a fragment whose mass is related to the sum of all side chains and the number of peptide backbone units retained in the fragment. Thus, interpretation of the mass spectrum of a suitable peptide derivative to provide information about a previously unknown sequence of amino acids is generally rather straightforward. During the 1960s and 1970s,however, the technique’s contribution to the elucidation of protein structure was quite limited (2,3).Determination of the amino acid sequence of small peptides was possible only after extensive chemical conversions to derivatives that were sufficiently volatile to permit their introduction into the spectrometer via a gas chromatograph or the solid-sample probe. One benefit of the gas chromatograph was that it allowed analysis of complex peptide mixtures derived from proteins; the solid-sample probe was useful only for derivatives of single peptides or very simple mixtures. The gas chromatographylmass spectrometry (GC/MS) technique required the more complicated conversion to N-trifluoroethyl-0-trimethylsilyl polyamino alcohols (4). whereas &’-acetyl-N,0-permethylated derivatives ( 5 ) were generally used for direct introduction. The reactions involved in these derivatizations and the sequence-specific fragmentation upon electron ionization (EI) are illustrated in Scheme I. Most often these mass spectrometric techniques were used to answer specific questions, such as the identification of N-terminal blocking groups and sequences. These blocked peptides are not amenable to manual or automated Edman degradation ( 6 ) , a chemical method that removes and identifies one amino acid after the other in a stepwise fashion. I t is the only widely used practical method for 0003-2700/86/A35&1288$01.50/0 @ 1986 American Chemical Society

~

~~~~~

Klaus Biemann Depaltment of Chemistry Massachusetts lnstltute of Technology Cambridge. Mass. 02139

Methodsfor ProteinSequencing the direct determination of the amino acid sequence of polypeptides and proteins and can generally he carried through 10-30 steps and even further in favorable cases. Because both the reductive GCIMS method and the permethylation technique have now been eclipsed by an entirely different approach, they will not he discussed here in detail. It suffices to say that their advantages and disadvantages nicely complemented those of the Edman degradation. Thus, a combination of the two became an efficient strategy fur the determination of the primary structure of a few small proteins in the late 1970s and early 1980s. Most notable among them was the membrane protein bacteriorhodopsin, which consists of a single chain of 248 amino acids (7). Because it exhibits unique soluhility properties-it is very hydrophobic-it could not he cleaved by en-

zymes, normally the first step in sequencing. It is soluble only in 80% formic acid, the solvent of choice for chemical cleavage with cyanogen h r o ~ mide. This reagent splits the peptide bond after methionine resulting, in this case, in 10 peptides, ranging in length from 4 to 50 amino acids, each ending in homoserine (the reaction product of methionine). Only the peptide representing the original C-terminus of the protein does not end with homoserine. Partial acid hydrolysis of each of these 10 peptides, followed by reductive derivatization and analysis of the resulting complex mixture, provided the sequences of many di- to hexapeptides. The sequences could he assembled either by lining them up where there was sufficient redundancy in the information or by using partial sequences derived at about the same time hy the automated Edman degra-

ANALYTICAL CHEMISTRY, VOL.

dation. The latter method often did not define the sequence all the way to the C-terminal homoserine of the particular peptide. As these hydrophobic peptides become shorter and shorter during the Edman degradation, they are increasingly soluble in the organic solvents used in this procedure and are then quickly lost. On the other hand, the derivatives of nonpolar amino acids are particularly well suited for MS because they can he easily transmitted through a gas chromatograph or vaporized from a solids probe. DNA-prolein correlation

Around 1980 the picture changed dramatically in a number of ways. First, the fine art of chemical sequencing of proteins reached a pinnacle when the primary structure of the first protein that consisted of more than 1000 amino acids was determined ( 8 ) .

58. NO. 13, NOVEMBER 1986

128SA

But this achievement appeared to be rendered obsolete by the development of relatively simple, fast, and reliable methods (9,10)for the determination of the base sequence of the gene coding for a given peptide. The amino acid sequence of the peptide could then be simply derived by translation of the DNA sequence using the genetic code. This meant that one did not have to determine the amino acid sequence of the protein directly, as long as the corresponding gene could be isolated. Recombinant DNA techniques (genetic engineering) came along at the same time and made this process even simpler and less timeconsuming. However, a number of problems remained. The codon for each amino acid corresponds to three bases in the DNA chain, and for a protein of 1000 amino acids one has to identify correctly a t least 3000 consecutive nucleotide bases without a single error of omission, insertion, or misidentification. In addition, because each correct DNA sequence can be translated into three entirely different strings of amino acids (the so-called reading frames) some protein data are necessary to define the correct one. Conventionally, this was achieved by determining the amino acid sequence of the protein for a short portion at the N-terminal and

C-terminal end and redundantly sequencing the entire gene to reveal any errors in an individual sequencing experiment. Here again, MS proved quite useful. Reductive derivatization of a partial acid or enzymatic hydrolysate of even a large protein produced a complex mixture of hundreds of peptide derivatives. By G U M S it was possible to determine enough sequences of three to six consecutive amino acids derived from regions randomly located along the protein that one could fit them to the amino acid sequence predicted from the DNA results. Deletion or insertion errors in the DNA sequence were easily recognized if some of the peptides matched in one reading frame and others matched in another. The first and still the largest protein to be studied by this technique, the enzyme alanyl-tRNA synthetase, is 875 amino acids long. Its structure was determined by the combination of DNA sequencing and mass spectrometric peptide sequencing in a realtime collaborative effort ( 1 1 ) .

FABMS The second event that occurred in the early 1980s and dramatically changed the role of MS in peptide and protein structure research was the development of fast atom bombardment

MS (FABMS) by Barber et al. (12). Although some may still argue that this method is merely a modification of secondary ion MS (SIMS), the use of a liquid matrix-and not so much the use of neutral atoms-suddenly made it possible to ionize large polar molecules simply and without any prior chemical derivatization. Although this had been achieved previously using either field desorption (FD) (13) or plasma desorption (PD, in which the fission products of 252Cf cause the ionization) (14), none of these techniques gained much popularity for technical reasons-chiefly inconvenience or unavailability of the instrumentation. It should be pointed out, however, that a group in Osaka, Japan, made quite a bit of progress in the use of FDMS to solve protein structure problems, most notably the identification of single amino acid mutations in abnormal hemoglobins that cause various inherited diseases ( 1 5 , 1 6 ) . Peptides have always been the favored test compound for new ionization techniques, in part because of their structural uniqueness, their wide availability in all sizes, complexities, and amounts, and perhaps most importantly because of their intellectual appeal as essential components of biological systems. Barber’s first publication describing the FAB ionization

Exceptional Performance and Uncompromising Value! w Compact design and high performance Easy to use and maintain w Quick setup and solvent changeover Upgradable to gradient operation w Microbore, analytical, or semi-preparative LC applications w Precise solvent delivery for reproducible results w Very low pulsation for optimum results and long column life w Ask for details: The. SSI Model 222 HPLC Pump delivers exceptional performance for both conventional and high-speed HPLC as well as GPC and ion chromatography. The versatile Model 222 is a rugged, reliable HPLC pump with a long list of helpful features: remote control capability, optional pumpheads for microbore or semi-prep LC, settable upper and lower pressure limits, solvent compressibility correction, built-in injector mount and column enclosure, upgradability to gradient operation. and more.

1-800-441-HPLC [l -800-441-4752) in PA call 814-234-7311 or write to:

CIRCLE 195 ON READER SERVICE CARD

1290A

ANALYTICAL CHEMISTRY, VOL. 58, NO. 13, NOVEMBER 1986

Scientific Systems, Inc. 1120 W. College Avenue, State College, PA 16801

In LC sample injection, “just right”is all wrong. Rheddyneexplains why

RHEODYNE THE LC CONNECTION COMPANY

CIRCA F 1Rn MI

WADER SERVICE CARD

process included a peptide of mol wt 1318 as an example (12). It exhibited an abundant (M H)+ ion, a characteristic of all FAB mass spectra of polar compounds. Immediately, the mass spectrometric world took notice, and a flood of papers concerned with the mass spectra of peptides followed. An intermediate milestone was reached with the demonstration that even a small protein, such as insulin, (mol wt 5803.4 for the human variety) can be successfully ionized (17). In addition to a pronounced peak corresponding to the mass of the protonated molecule, the (M H)+ ion, FAB spectra exhibit some fragment ions. Although these are generally of low abundance, they increase with a decrease in the molecular size of the peptide and increase, up to a point, with increasing sample size. A further complication in the ability to interpret the FAB spectrum of a peptide is the obliterating effect of low-mass peaks resulting from the liquid matrix (glycerol or thioglycerol) and its cluster ions. Thus, although the literature abounds with papers describing the correlation of a FAB spectrum of a peptide with its known sequence, those seriously concerned with the determination of the primary structure of large peptides or proteins do not attempt to deduce an unknown se-

+

+

quence from a FAB mass spectrum directly. Rather, they use this information in conjunction with other chemical or mass spectrometric fragmentation procedures. There are, however, many situations in which the molecular weight information alone suffices. The most well documented area is a variant of the DNA-protein structure correlation discussed earlier. Although the GC/ MS method provides short actual sequences that can be matched against those predicted from a DNA sequence, comparing the molecular weights (accurate to within one mass unit) of the peptides found in the mixture produced by the action of an enzyme, such as trypsin, with those predicted from a hypothetical amino acid sequence is an equally stringent test of the correctness of that sequence. This prediction is simple and reliable because trypsin very specifically cleaves the peptide bonds at the carboxyl side of the basic amino acids arginine and lysine. Because these two amino acids occur in all proteins with average frequency, tryptic peptides are generally 2-25 amino acids long. With the exception of the very short peptides, their molecular weights are quite distinct. Thus it is very rare that more than one peptide of the same molecular weight is predicted from a hypo-

thetical (DNA-derived)amino acid sequence. Errors in a DNA sequence hardly ever result in a transposition of two or more amino acids; they normally cause a change in reading frame (which results in an entirely different amino acid sequence, including the locations of lysines and arginines) or the replacement of one amino acid by another. As a consequence, these errors in the DNA sequence always lead to a change in the molecular weights of the predicted tryptic peptides. It follows that it is quite easy to check the correctness of a DNA sequence by the FABMS determination of the molecular weights of the tryptic peptides derived from the corresponding protein (18).Where they match, the DNA sequence is correct, whereas any mismatches directly indicate the type of error (deletion, addition, or misidentification of a base) and pinpoint its location, as illustrated in Figure 1. This information makes correction easy. Clearly, this check-and-balance strategy is most useful when carried out in parallel because it eliminates the need for the redundant resequencing of the same region of the gene just to uncover the unavoidable errors made the first time around. Table I lists the proteins whose structures were determined by these strategies.

UV Detectors m Semi-Micro Systems

Detectors m Column Ovens m Columns

Solvents/ Reagents

For additional information on our Autosampler or to arrange for a demonstration, call or write: 111 Woodcrest Rd. Cherry Hill, NJ 08034-0395 (609) 354-9200 E M SCIENCE (800) 222-0342

miTM

'Reg TM Hitachi, Ltd Japan

A Dnirion of EM Indusrries, Inc

Circle 50 for literature

1292A

Circle 51 for a demonstration.

ANALYTICAL CHEMISTRY, VOL. 58, NO. 13, NOVEMBER 1986

Associare ofE Merck

Darrnrradr, Germany

BrownleeTakesthe Mvsterv Out of Supercritical Fluid Chromatography (SFC). Think of SFC as an extension of gas chromatography.An extension into molecular weight ranges which would require an oven temperature of over 4 0 0 T by conventional GC. Because SFC is really dense gas chromatography, it allows you to separate high molecular weight or thermally labile compounds at oven temperatures below 200oc. Now think about what you really need for SFC.You need a GC,a special pumping system and controller,column restrictor and SFC soihvare.You need the Brownlee SFC System One, complete with the Hewlett-Packard Model 5890 GC and our MicroGradient System. The heart of SFC is the pump. Our MicroGradient System is the only pumping system for SFC with a proven three-year history and over 400 installed units.The built-in microprocessor comes with SFC software, ready to run.You don't need an external computer. Our pumping system automatically refills with liquefied gas (such as CO>.There is no need to cool the pump or the supply line. With the Brownlee SFC System One, anyone experienced in capillary GC can be doing SFC within a daywe'll install it in your lab and train your staff. Our applicationsscientists will be happy to share their experience with you. Contact Brownlee Labs today for a copy of ourTechnical Note 925,"Supercriical Fluid Chromatography (SFC): Bridging the Gap Between GC and HPLC:' And find out how easy it is to get started in SFC.

Brownlee Labs i\DMmnwbed-m,ha

2045 Martin Avenue Santa Clara, CA 95050 4OW727-1346800/231-4038 CIRCLE 20 ON READER SERVICE CARD

339 346 347

I

L

Reading frame3

unit of glycyl-tRNA synthetase from E. co//(adapted from Reference 18) ~. ~~

~

The molecular weights predined for reading frame 1 matched these found by FABMS up to amlno add position 324 and then again starting with amino acid 387. Translation of the DNA sequBnce in reading frame3 comerpondedto an amino acid Sequence containing two Wyptic puptides whose molecular weights matched two experimentally determined ones. namely, from amino acid 339 lo 346 and 347 to 360. A Shin lo readlmg lTBme 3 indicates a missing base in the DNA sequence (insertion of a base would shift to reading hame 2). Also. between the codon 01 amino acid 324 and that of 339 mere were only 41 bases rather than 42 (a multiple of three lor each codon). Similarly. between the Wdons of amino BCM 360 and 387 mere were 79 bases instead of 78. These two errors can be pinpointed further by generating all hypometical amino acid Sequences derived from lhese short DNA sequence regions in which one base has been COnSeCUtiVely insmed M deleted at each position “mil a Wyptic peptide appears whose molecula weight U)meSpOndsto one found in lhe Wyptic digust 01 lhe protein (lor details see Reference 18)

Most of them are rather large because it is much easier to determine the sequence of a few thousand bases in a DNA strand than a few hundred amino acids in a protein. On the other hand, it is relatively simple to generate the limited protein data, such as the molecular weights of a large fraction of all tryptic peptides necessary to check the DNA sequence. Genetically engineered proteins Another area of current interest is the verification of the structure of a protein that has been prepared by recombinant DNA techniques. The large-scale synthesis of biologically active proteins that occur in extremely

low concentrations in higher organisms can now be accomplished by insertion of a properly manipulated plasmid in a single cell system that can be cultured in large quantities. Well-known examples are the production of human insulin, interferons, and interleukins. Many other proteins and numerous modifications of them are produced for research purposes. It is, however, important to verify that the new host organism pnduces the c11rrect Drotein and that i~ is not modified in any way (other than a desired one). This verification can be accomplished by the same techniques as outlined above for the DNA-derived protein sequences and is sometimes referred to

Table 1. Protein sequences deduced by a combinationof DNA sequencing and FABMS Gln-tRNA synlhetase Gly-tRNA synthetase (a and P, from E. coli

Met-tRNA synlhemse from yeast His-tRNA synthetase tram E. coli GI-1RNA synthetase from E. coli

550

a

303 687 751

b

+

C

324

d

47 1

e

Endo-~-KacelylglucosaminidaseH

313

f

Rabbit muscle creatine phosphokinase Rabbit brain creatine phospbkinase Protein S from Myxoc. xanthus Human glucose transporter Cytosolic phosphoenolypyruvate carboxykinase

380 380

g

h

173

i

492 626

i k

Biemann. K. h t J. Mars S p C D O m Ion mys. 1982. 45. 183-9* oWeDster. 1. A. el al. J BIOI. Chem 1983.258.10637-41: ‘Gibson. B W el a1 Proc Natl. A m . Sci U.S A. 1984, 81. 1956ret ton. R et d J. B~OI.Cmm.. I 6 0, * FWW-.~. R et ai J B,OI them toes. 280. 10063.68. press: f ~ c m i n sP.. w. et SI. J. BIOI, chem. 1484.259.7577-83: 0 p&y. S. et al. J. E~OI.then 1984, 259. 14317-20; @Pickering,L. et SI. P m . Nafl. AicBd. Sci. U.S.A. 1985. 82, 2310-14

a

~~~~~~~

‘lakao. T. elal. J. Bid. Chem. 1984, 259. 6105-9IMYB~kler.M. utal. SciencelS85, 229, 94145; Beale. E. 0. et al. J. B i d Chem. 1985. 260. 10748-60.

1294A

ANALYTICAL CHEMISTRY, VOL. 58, NO. 13, NOVEMBER 1986

as FAB mapping (19,20).Essentially one checks whether the peptides produced by specific cleavage (either with trypsin or with cyanogen bromide) have the expected molecular weight. This principle has been demonstrated by Morris et el. (20) on insulin as a test compound and was used by Richter et al. to show that the protein eglin c when produced by recombinant techniques in E. coli contains an additional acetyl group (21).This modification was first recognized when the molecular weight determined by FABMS was found to be 8134 instead of 8092 as calculated from the known sequence of the natural protein. Cleavage of the biosynthetic protein with various enzymes finally showed that this extra acetyl group was attached to the N-terminal threonine. More recently, site-specific amino acid replacements in bacteriorhodopsin were produced in H. G. Khorana’s group via the total synthesis of the corresponding gene and expressed in E. coli (22). Here again the determination of the molecular weights of the peptides produced by cleavage of the native protein and those obtained from the proteins produced by the synthetic genes corroborated the correctness of the desired modifications. As an example, Figure 2 shows the (M H)+ ion regions of the N-terminal cyanogen bromide fragment of native bacteriorhodopsin. This fragment begins with pyroglutamic acid and includes a modified analogue where pyroglutamic acid was deleted and replaced by methionine (which had been cleaved off as homoserine). The mass difference of the two (M + H)+ ions at mlz 2191.4 and mlz 2080.2, respectively, corresponds to pyroglutamic acid minus H20, that is, 111mass units (23).

+

Direct mass dpectromebk sequenciq of protehs An entirely different approach has to be taken if the amino acid sequence of a protein is to be determined directly, without isolating and sequencing the corresponding gene. For smaller proteins, direct sequencing is more efficient because the work of finding and identifying the gene represents the major effort and is just as difficult for small proteins as for large ones. The conventional approach applies the automated Edman procedure to the protein itself and certain cleavage products. However, FABMS is becoming a more and more useful alternative or complementary approach. As mentioned earlier, a conventinnal FAB spectrum of a typical peptide produced by enzymatic or chemical cleavage does not contain sufficiently abundant fragment ions to define its

“I switched to Ohaus because I wanted my students to have an rugged 4

apartment of Chemistry &Physics Augusta College, Augusta, GA

The latest Ohaus Galaxy@ analytical balance. ..the G160D. Easy to use and solidly built. Dual range: 160g x 0.1 mg, 30 g x 0.01 mg. One of 12 new versatile analytical and top-loading electronic balances in the Galaxy series. For more information or a demonstration, call or write Ohaus Scale Corporation, 29 Hanover Road, Florham Park, NJ 07932, (800) 672-7722.

ows* Circle #120 for a demonsiration Circle #121 for information

ANALYTICAL CHEMISTRY, VOL. 56. NO. 13. NOVEMBER 1986

1295 A

HRLCW:

The new generation in HPLC.

- From Bio-Rad,new answers,

I

new ideas, and new technology for chromatography.

Provides complete mi.- -.Jf your HPLC system,including binary or tern& gradients, injectors,ands&nplers. Runs up to 6 pumps simultaneously

Sophisticated data analysis

~

.

$&