Top-Down de Novo Protein ... - ACS Publications

Mar 23, 2010 - Jens Fuchser,† Katja Kuhlmann,§ and Detlev Suckau*,†. Bruker Daltonik GmbH, Fahrenheitstrasse 4, 28359 Bremen, Germany, Department...
0 downloads 0 Views 3MB Size
Anal. Chem. 2010, 82, 3283–3292

Top-Down de Novo Protein Sequencing of a 13.6 kDa Camelid Single Heavy Chain Antibody by Matrix-Assisted Laser Desorption Ionization-Time-of-Flight/Time-of-Flight Mass Spectrometry Anja Resemann,† Dirk Wunderlich,† Ulrich Rothbauer,‡ Bettina Warscheid,§,| Heinrich Leonhardt,‡ Jens Fuchser,† Katja Kuhlmann,§ and Detlev Suckau*,† Bruker Daltonik GmbH, Fahrenheitstrasse 4, 28359 Bremen, Germany, Department of Biology and Center for Integrated Protein Science, Ludwig Maximilians University Munich, Grosshaderner Strasse 2, 82152 Planegg-Martinsried, Germany, Medizinisches Proteom-Center, Ruhr-Universitaet Bochum, Universitaetsstrasse 150, 44780 Bochum, Germany, and Clinical & Cellular Proteomics, Medical Faculty and Center for Medical Biotechnology, Duisburg-Essen University, 45117 Essen, Germany The primary structure of a 13.6 kDa single heavy chain camelid antibody (VHH) was determined by matrixassisted laser desorption ionization-time-of-flight/timeof-flight (MALDI-TOF/TOF) top-down sequence analysis. The majority of the sequence was obtained by mass spectrometric de novo sequencing, with the N-terminal 14 amino acid residues being determined using T3sequencing and database interrogation. The determined sequence was confirmed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis of a tryptic digest, which also provided high-energy collisionally induced dissociation (CID) data permitting the clear assignment of 3 of the 14 isobaric Leu/Ile residues. Five of the 11 Leu/Ile ambiguities could be resolved by homology comparisons with known VHH sequences. The monoisotopic molecular weight of the VHH was determined by ultrahigh-resolution orthogonal electrospray (ESI)-TOF analysis and found to be 13 610.6066 Da, in excellent agreement with the established sequence. To our knowledge, this is the first time that the entire primary structure of a protein with a molecular weight >13 kDa has been established by mass spectrometric top-down sequencing. Today, de novo protein sequence determination is typically performed on the DNA level on a grand scale. Although it is a well established method for the assignment of protein identity and the identification of post-translational modifications (PTMs) in protein analysis or proteomics,1 MS has rarely been used for * Corresponding author. Dr. Detlev Suckau, Bruker Daltonik GmbH, Fahrenheitstr. 4, D-28359 Bremen, Germany. Phone: +49-421-2205-245. E-mail: dsu@ bdal.de. † Bruker Daltonik GmbH. ‡ Ludwig Maximilians University Munich. § Ruhr-Universitaet Bochum. | Duisburg-Essen University. (1) Hufnagel, P.; Rabus, R. J. Mol. Microbiol. Biotechnol. 2006, 11 (1-2), 53– 81. 10.1021/ac1000515  2010 American Chemical Society Published on Web 03/23/2010

de novo protein sequencing. In pioneering work, MS was used to sequence thioredoxin based on de novo sequencing of proteolytic peptides,2 while later work concentrated on the use of chemical treatment strategies.3,4 MS plays a crucial role in the assignment of recombinant protein structures in basic research and the pharmaceutical industry.5 Both top-down and bottom-up approaches are used for detailed protein characterization.6-8 Top-down protein sequencing, mass spectrometric sequence assignment without any prior proteolytic digestion steps, has been developed to a level that offers a powerful orthogonal approach to the classical bottom-up method. In the bottom-up approach, proteins are initially digested with proteolytic agents such as trypsin and then analyzed by combinations of chromatographic and mass spectrometric means.1 This process destroys the information about the sequence of the proteolytic peptides along the protein chain and is, therefore, too laborious for de novo protein sequencing in general. Top-Down Mass Spectrometric Protein Sequencing. Electron capture dissociation (ECD) is a well established technique on electrospray mass spectrometers for the top-down sequence analysis of proteins as large as 270 kDa.9 More recently, electron transfer dissociation (ETD) has also been employed for bottomup10 and top-down protein identification.11,12 However, the potential for de novo sequencing of entire proteins by ESI-ECD or ETD remains unfulfilled.13 Matrix-assisted laser desorption ionization (MALDI), in contrast, has not yet been the subject of a great deal of attention in the top-down protein sequencing arena.14 Originally (2) Johnson, R. S.; Biemann, K. Biochemistry 1987, 26 (5), 1209–14. (3) Chait, B. T.; Wang, R.; Beavis, R. C.; Kent, S. B. Science 1993, 262 (5130), 89–92. (4) Zhong, H.; Zhang, Y.; Wen, Z.; Li, L. Nat. Biotechnol. 2004, 22 (10), 1291– 1296. (5) Zhang, Z.; Pan, H.; Chen, X. Mass Spectrom. Rev. 2009, 28, 147–176. (6) Chait, B. T. Science 2006, 314 (5796), 65–66. (7) Macht, M. Bioanalysis 2009, 1 (6), 1131–1148. (8) Hoffman, M. D.; Sniatynski, M. J.; Kast, J. Anal. Chim. Acta 2008, 627 (1), 50–61. (9) Han, X.; Jin, M.; Breuker, K.; McLafferty, F. W. Science 2006, 314 (5796), 109–112.

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

3283

developed in the 1990s on linear mode MALDI-time-of-flight (TOF),15 top-down protein sequencing was later performed on reflector mode MALDI-TOF16 and TOF/TOF instruments.17,18 More recently, MALDI has been more broadly applied for topdown protein characterization using in-source decay (ISD),7,19 and MALDI-ISD measurements can be set up in automated workflows using the recently introduced matrix 1,5-diaminonaphthalene (DAN).18 Even challenging tasks that are difficult to investigate using ESI-based methods, such as the assignment of PTMs7 and PEGylation sites in therapeutic proteins, can be accomplished by MALDI-ISD.20 MALDI top-down sequencing (TDS) is predominantly used to characterize protein terminal sequences between approximately the 10th and 80th residue from the C- and Nterminus.21 Access to terminal sequences is provided by T3sequencing, i.e., the further tandem mass spectrometry (MS/ MS) analysis of fragment ions generated by ISD in the TOF/ TOF part of the instrument.17 MALDI-TDS can be applied even in the presence of N-terminal blocking acetyl or pyroglutamyl groups,17,20 which gives the method a significant advantage over Edman sequencing. In analogy to ETD and ECD spectra, the MALDI-ISD process is mediated by a hydrogen radical transfer step and provides predominantly c, y, and (z + 2) fragment ions.18,22 Generally, a uniform fragmentation along the protein backbone is observed that provides long stretches of sequence tags in clearly readable ladders of singly charged fragment ions suitable for de novo sequencing. This is in stark contrast to CID,23 which provides solitary landmark peaks produced by labile bond cleavages, such as those adjacent to Pro or Asp, that can only be correlated to previously known protein sequences. In previous top-down protein sequencing studies, only partial assignments of previously known sequences were achieved. This meant that database searching was essential to the success of the analysis11,12,24 or the work needed to be extensively complemented (10) John, J. P.; Pollak, A.; Lubec, G. Electrophoresis 2009, 30 (17), 3006–3016. (11) Macek, B.; Waanders, L. F.; Olsen, J. V.; Mann, M. Mol. Cell. Proteomics 2006, 5 (5), 949–958. (12) Rauser, S.; Marquardt, C.; Balluff, B.; Albers, C.; Belau, E.; Hartmer, R.; Suckau, D.; Specht, K.; Ebert, M. P.; Schmitt, M.; Aubele, M.; Ho ¨fler, H.; Walch, A. J. Proteome Res. 2010, DOI: 10.1021/pr901008d. (13) Young, N. L.; Dimaggio, P. A.; Plazas-Mayorca, M. D.; Baliban, R. C.; Floudas, C. A.; Garcia, B. A. Mol. Cell. Proteomics 2009, 8 (10), 2266– 2284. (14) Breuker, K.; Jin, M.; Han, X.; Jiang, H.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2008, 19 (8), 1045–1053. (15) Brown, R. S.; Lennon, J. J. Anal. Chem. 1995, 67 (21), 3990–3999. (16) Suckau, D.; Cornett, D. S. Analusis 1998, 26, 18–21. (17) Suckau, D.; Resemann, A.; Schuerenberg, M.; Hufnagel, P.; Franzen, J.; Holle, A. Anal. Bioanal. Chem. 2003, 376, 952–965. (18) Demeure, K.; Quinton, L.; Gabelica, V.; De Pauw, E. Anal. Chem. 2007, 79, 8678–8685. (19) Hardouin, J. Mass Spectrom. Rev. 2007, 26, 672–682. (20) Yoo, C.; Suckau, D.; Sauerland, V.; Ronk, M.; Ma, M. J. Am. Soc. Mass Spectrom. 2009, 20 (2), 326–333. (21) Suckau, D.; Resemann, A. J. Biomol. Techniques 2009, 20 (5), 258–262. (22) Ko ¨cher, T.; Engstro ¨m, A.; Zubarev, R. A. Anal. Chem. 2005, 77 (1), 172– 177. (23) Liu, Z.; Schey, K. L. J. Am. Soc. Mass Spectrom. 2005, 16 (4), 482–490. (24) Binz, P. A.; Abdi, F.; Affolter, M.; Allard, L.; Barblan, J.; Bhardwaj, S.; Bienvenut, W. V.; Bulet, P.; Burgess, J.; Carrette, O.; Corthals, G.; Delalande, F.; Diemer, H.; Favreau, P.; Giuliano, E.; Gueguen, Y.; Guillaume, E.; Hahner, S.; Man, P.; Michalet, S.; Neri, D.; Noukakis, D.; Palagi, P.; Paroutaud, P.; Carvalho-Pimenta, D.; Quadroni, M.; Resemann, A.; Richert, S.; Rybak, J.; Sanchez, J.-C.; Scherl, A.; Scheurer, S.; Schweiger Hufnagel, U.; Siethoff, C.; Suckau, D.; van Dorsselaer, A.; Wagner-Redeker, W.; Walter, N.; Sto ¨cklin, R. Proteomics 2003, 3, 1562–1566.

3284

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

Figure 1. Schematic structure of a conventional antibody (IgG) in comparison with camelid antibodies and their variable heavy chain nanobodies (VHH).

by bottom-up LC-MS/MS data.25 Here we report the first use of top-down sequencing for establishing a de novo protein sequence that could not be found in public protein sequence databases. MALDI-ISD and T3-sequencing26 were used to establish the complete structure of a 13.6 kDa camelid antibody. Additional LC-MALDI-MS/MS data from a tryptic digest were used to demonstrate the ability of high-energy collisions of the MALDITOF/TOF instrument27 to distinguish leucine (Leu) from isoleucine (Ile) residues. Once the full sequence of the variable heavy chain of heavy chain antibodies (VHH) domain antibody has been established using MALDI-TDS, we used ESI-ultrahigh-resolution orthogonal TOF (oTOF) technology to determine both the precise molecular weight and the isotopic distribution. This data was used to validate the assignment and, at the same time, demonstrate the absence of any chemical artifacts such as Gln/ Asn deamidation. The combined approach showed that de novo sequencing of proteins up to 14 kDa without reference sequences is possible using MALDI-TOF. Furthermore, the precise determination of the molecular ion mass and isotopomer distribution is suitable for the validation of a determined sequence. To our knowledge, this study presents the first successful de novo protein sequence determination using topdown mass spectrometry. Camelid Antibodies: Nanobodies. In all mammalian species, antibodies are composed of two identical heavy-chains and two identical light chains comprising a bipartite epitope binding domain (Figure 1). So far the only exceptions to this paradigm are immunoglobuline G (IgG) antibodies of the Camelidae (llamas, alpacas, camels, and dromedaries), which contain a fraction of heavy-chain antibodies (HCAbs) that lack the light-chain.28 The antigen-binding unit of HCAbs is reduced to a single variable domain called the VHH domain (variable heavy chain of heavy chain antibodies) with a molecular weight of ∼14 kDa. Because of its size of 2.5 nm × 4 nm, VHH domains are also called nanobodies (Figure 1). Nanobodies have numerous advantages over other small antibody fragments like Fab or single-chain variable fragment (scFv) that are derived from conventional (25) Ma, M.; Chen, R.; Ge, Y.; He, H.; Marshall, A. G.; Li, L. Anal. Chem. 2009, 81, 240–247. (26) Suckau, D.; Resemann, A. Anal. Chem. 2003, 75, 5817–5824. (27) Macht, M.; Asperger, A.; Deininger, S. O. Rapid Commun. Mass Spectrom. 2004, 18 (18), 2093–2105. (28) Hamers-Casterman, C.; Atarhouch, T.; Muyldermans, S.; Robinson, G.; Hammers, C.; Bajyana Songa, E.; Bendahman, N.; Hammers, R. Nature 1993, 363, 446–448.

antibodies. First, only one domain needs to be cloned and expressed to generate a mature nanobody in vivo. Second, specific nanobodies can easily be selected with phage display technologies. Third, nanobodies are highly soluble and stable and can be efficiently expressed in heterologous systems.29 Nanobodies typically have affinities in the nanomolar range, comparable with those of scFvs.30,31 Described applications of nanobodies include targeting and tracing of antigens in live cells, modulation of protein properties in living cells, targeted modulation of enzymes, and their use as immobilized nanotraps to precipitate protein complexes in vivo and in vitro.32-34 Technical innovations in mass spectrometry (MS) have made it possible to obtain sequence information from undigested proteins. Do these new developments mean that MS is ready to replace traditional techniques such as Edman sequencing? In this study, a well characterized protein sample, a camelid antibody, was generated and subjected to MALDI top-down de novo sequencing. MATERIAL AND METHODS Study Design. This work was organized in the form of a blinded study, in which the group in Munich (H.L. and U.R.) generated the protein samples, and the group in Bochum (B.W. and K.K.) selected the protein of interest, validated and assayed its structure, purity, and amount. The Bochum group also acted as the anonymizer. The group in Bremen (A.R., J.F., D.S., and D.W.) subsequently performed mass spectrometric de novo sequencing and sequence validation work to characterize the primary structure of the selected protein. After returning the results to the anonymizer, the true sequence of the protein, namely, the camelid nanobody, was revealed. In the work presented here, the original analysis results are presented and discussed with relation to the actual protein sequence. The authors that performed the de novo sequencing work did not have any information other than (1) a protein with a molecular weight of approximately 15 kDa and a free N-terminus is to be sequenced, (2) the protein is homogeneous, with no other protein species being present in the sample, and (3) that 120 pmol protein was contained in each of the 6 vials received. Selection, Expression, and Purification of the R2P9 Nanobody. The R2P9-nanobody was derived from a native VHH library. To generate the library we isolated lymphocytes from an alpaca (Lama pacos), amplified the repertoire of the variable regions of the heavy-chain antibodies by polymerase chain reaction (PCR), and cloned it into a phage display vector. A VHH library of 2.5 × 105 individual clones was obtained in Escherichia coli TG1 cells. We randomly selected clones for recombinant expression analysis and determined the DNA (29) Muyldermans, S. J. Biotechnol. 2001, 74, 277–302. (30) Arbabi Ghahroudi, M.; Desmyter, A.; Wyns, L.; Hamers, R.; Muyldermans, S. FEBS Lett. 1997, 414, 521–526. (31) Saerens, D.; Pellis, M.; Loris, R.; Pardon, E.; Dumoulin, M.; Matagne, A.; Wyns, L.; Muyldermans, S.; Conrath, K. J. Mol. Biol. 2005, 352, 597–607. (32) Rothbauer, U.; Zolghadr, K.; Tillib, S.; Nowak, D.; Schermelleh, L.; Gahl, A.; Backmann, N.; Conrath, K.; Muyldermans, S.; Cardoso, M. C.; Leonhardt, H. Nat. Methods 2006, 3, 887–889. (33) Jobling, S. A.; Jarman, C.; Teh, M. M.; Holmberg, N.; Blake, C.; Verhoeyen, M. E. Nat. Biotechnol. 2003, 21, 77–80. (34) Rothbauer, U.; Zolghadr, K.; Muyldermans, S.; Schepers, A.; Cardoso, M. C.; Leonhardt, H. Mol. Cell. Proteomics 2008, 7, 282–289.

sequences for three unique VHH domains. The sequence of the R2P9-VHH-domain showed the amino acid substitutions in the framework-2 region (see Figure S-4 in the Supporting Information) that are characteristic for a llama VHH: Y37, E44, R45 (numbering according to Kabat and Wu35). The nature of these amino acids abrogates the interaction with a possible VL domain, and their hydrophilic character renders the domain soluble in aqueous medium. The absence of additional cysteine residues besides the conserved C22 and C92 is a common feature of llamas VHHs and distinguishes them from dromedary VHHs.29 The coding sequence of the R2P9-VHH-domain was cloned into the pHEN4C bacterial expression vector30 using the NcoI and NotI restriction sites (which adds a C-terminal in frame histidine [His6] tail). The vector was used to transform Escherichia coli BL21 cells. For expression and purification, 1 L of E. coli culture was induced with 1 mM isopropyl β-D-1thiogalactopyranoside (IPTG) for 20 h at room temperature (RT). Bacterial cells were harvested by centrifugation (10 min, 5000g) and the pellet was resuspended in 10 mL of binding buffer (1× PBS, 500 mM NaCl, 20 mM imidazole, 1 mM phenylmethanesulphonylfluoride (PMSF), and 10 µg/µL lysozyme). The R2P9-VHH-domain was highly expressed and yielded 0.5-1 mg of soluble VHH per 500 mL of IPTG-induced bacterial culture. For bacterial lysis, incubation continued for 1 h at 4 °C in a rotary shaker and the cell suspension was sonicated (5 × 30 s pulse) on ice. After centrifugation (20 min, 20 000g) soluble proteins were loaded on a preequilibrated 1 mL HiTrap-column ¨ kta-purifier system. The (GE Healthcare) and purified using an A His-tagged R2P9 nanobody was eluted by a linear gradient from 20 to 500 mM imidazole as a monomer with a molecular weight of ∼14 kDa in gel filtration analysis, which is consistent with the size predicted from its sequence (Figure S-1 in the Supporting Information). Gel Filtration. A total of 10 µg of immobilized metal ion affinity chromatography (IMAC)-purified R2P9-nanobody was loaded on a Superdex-75 column (Amersham Pharmacia Biotech) and chromatographed at a flow rate of 0.5 mL/min in column buffer (1× PBS, 0.5 mM EDTA). BSA (66 kDa), carbonic anhydrase (29.5 kDa), and cytochrome c (12.5 kDa) were used as calibration standards. R2P9 Nanobody Sequence. The sequence is as follows: DVQLVESGGGLVQAGGSLRLSCAASGIIFSINAMGWYRQAPG KERELVAAISSNGNTLYADSVKGRFTISRDNAKNTLYLQMN SLKPEDTAMYYCTAPDDEYDYWGQGTQVTVSSKKKHHHH HH. The nucleotide sequence for the VHH domain gene has been deposited by Ulrich Rothbauer [[email protected]] in the NCBI database under GenBank Accession Number bankit1285548 GU233430. Preparation of the Study Sample. The nanobody was sent to the anonymizer in 1 mL of 0.2 mg/mL protein in PBS buffer. For LC-ESI-MS/MS analyses, 10 µg of nanobody were precipitated with acetone to remove the PBS buffer. The pellet was resuspended in 35 µL of 40 mM ammonium bicarbonate/10% acetonitrile. Trypsin (sequencing grade, Promega) was added at an enzyme-to-substrate ratio of 1:20, and the sample was digested (35) Kabat, E. A.; Wu, T. T. J. Immunol. 1991, 147, 1709–1719.

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

3285

for 4 h at 48 °C. The reaction was stopped by adding 0.5 µL of TFA. Of each sample, 2.5 µL was diluted with 0.1% TFA to 15 µL and analyzed by LC-ESI-MS/MS as described in the Supporting Information. The intact R2P9 nanobody samples that were sent to the Bremen group were precipitated with acetone, resuspended in 10 mM ammonium bicarbonate buffer, and dried in a speed vac. Six aliquots (120 pmol each) of the test sample were used for MALDI top-down sequencing. MALDI-MS Analysis and Top-Down Sequencing of the Study Sample. The sample obtained from the anonymizer was dissolved in 10 µL of water/0.1% TFA prior to MS analysis. Different MALDI matrixes were applied: sinapinic acid for intact mass determination and sDHB (super DHB ) mixture of DHB and 2-hydroxy-5-methoxybenzoic acid, 1:9) or DAN (1,5-diaminonaphtalene) for ISD fragmentation. DAN was obtained from Acros Organics (Geel, Belgium); the other matrixes were from Bruker Daltonics (Bremen, Germany), and all of them were used without further purification. Saturated solutions were made of sinapinic acid in 30% ACN/water/0.1% TFA and of DAN in 50% ACN/water/0.1% TFA. The DAN solution was always prepared freshly before use. A total of 50 mg of sDHB were dissolved in 1 mL of 50% ACN/water/0.1% TFA. A volume of 2 µL of sample solution were mixed with 2 µL of matrix solution and spotted in 1 µL droplets onto a stainless steel target and air-dried at ambient temperature. A protein mixture of insulin, ubiquitin, cytochrome c, and myoglobin (Protein Calibration Mix I, Bruker Daltonics) was used for external calibration of the mass spectra of intact proteins. Intact BSA (bovine serum albumin, GERBU, Germany) was prepared with DAN and sDHB, and ISD fragments were used for external calibration of ISD sample spectra. All MALDI mass spectra were acquired using an ultrafleXtreme MALDI-TOF/TOF mass spectrometer with smartbeam II laser (Bruker Daltonics). The instrument was used in linear mode for intact mass determination and in reflector mode, and acquisition was optimized for a mass range from 1000 to 8000 Da allowing monoisotopically resolved ISD fragment ion detection. Approximately 2000-8000 laser shots were acquired per ISD spectrum using a 1 kHz acquisition speed in positive ion mode with a 25 kV acceleration voltage. The low mass deflection was set to 900 Da; detection gain was 3× the gain used for peptide detection. T3-sequencing spectra were acquired in MS/MS mode by selecting ISD fragment ions as precursors. No additional collision gas was used. The parameters for the Mascot MS/ MS search of the N-terminal were protein database, NCBI (20080903), no enzyme, mammalia, MS Tol. 100 ppm, MS/ MS Tol. 0.7 Da, Amidated (C-term) as a variable modification to account for the c-ion structure. Compass 1.3 software was used for data acquisition and processing, and BioTools 3.2 Service Pack 1 was used for TDS analysis (all Bruker Daltonics). The SNAP algorithm was used for monoisotopic peak annotation.36 The top-down module in BioTools is able to automatically create sequence tags on ISD spectra that represent the amino acid sequence of the protein. Additional interactive assignments of amino acid residues based on the mass differences between ISD fragment ions were (36) Ko ¨ster, C. Mass spectrometry method for accurate mass determination of unknown ions. U.S. Patent 6,188,064, February 13, 2001.

3286

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

performed using FlexAnalysis 3.3 processing software (Bruker Daltonics). The automatic and interactive software routines even permitted to assign dipeptide sequences in case of a missing fragment in c- or (z + 2)-ion series due to proline gaps.19 Proline gaps are associated with the sequence motif “X-Pro” (X, any amino acid residue) as cleavage N-terminal to the R-C atom of proline is typically not detected due to its cyclic nature. For LC-MALDI analysis, one of the sample aliquots was digested with trypsin after reduction with dithiothreitol (DTT) and alkylation with iodoacetamide in 50 mM (NH4)2CO3. The digest was separated using Bruker Easy-nLC equipped with a C18 NanoSeparations column (75 µm i.d. × 10 cm, 5 µm particles) with a flow of 0.3 µL/min. Fractions were collected with a Proteineer fc fraction collector and spotted on an 800 µm AnchorChip target (both Bruker Daltonics) with a continuous sheath flow of matrix solution (35 µL of a saturated solution of R-cyano-cinnamic acid in 95% acetonitrile/water/0.1% TFA were mixed with 74 µL of water, 675 µL of acetonitrile, 8 µL of 10% TFA in water, and 8 µL of 100 mM NH4H2PO4 in water). Precise Molecular Weight Determination of the Nanobody. Before analysis, the sample was HPLC-purified. The HPLC separation was performed on an Agilent 1200 RP HPLC system equipped with a binary pump, column oven, and cooled well plate autosampler. A volume of 5 µL of the sample (1.6 pmol/µL) was cleaned up on a reverse phase column (Zorbax SBC8, Rapid Resolution Cartridge, 2.1 mm × 30 mm, 3.5 µm). Water with 0.1% formic acid was used as solvent A, solvent B consisted of 0.1% formic acid in acetonitrile (all chemicals HPLC-MS grade). Gradient: 0 min, 0% B; 2 min, 0% B; 10 min, 98% B; 12 min, 98% B; 12.1 min, 0% B; 20 min, 0% B. The flow rate was 300 µL/min; the column oven was heated up to 40 °C. UHR-TOF. The Bruker Daltonics maXis ultrahigh-resolution oTOF mass spectrometer (UHR-TOF) was operated in ESI positive ion mode. Further settings: nebulizer 1.6 psi, dry gas rate 9 L/min, dry gas temperature 190 °C, funnel rf 400 Vpp, multipole rf 400 Vpp, ion energy 5 V, low mass cutoff 322, collision energy 10 V, collision rf 1200 Vpp, ion cooler rf 400 Vpp, transfer time 120 µs, prepulse storage time 10 µs. Data were acquired in profile mode with a scan speed of 0.5 Hz with a resolution of 30 000 at m/z 922. The maXis was calibrated 1 h before sample analysis by direct infusion of a calibration mixture (ESI-L lowconcentration tuning mix, Agilent Technologies) with a flow rate of 3 µL/min in the enhanced quadratic calibration mode. At the beginning of the sample measurement, a volume of 20 µL of the same calibration mixture was introduced to the ESI source via a six-port valve controlled by the MS acquisition software. For recalibration and further processing of the acquired data file, the DataAnalysis software 4.0, SP1 (Bruker Daltonics) was used. The data set was recalibrated externally in enhanced quadratic calibration mode by using the automatically introduced calibrant. On top of this recalibration, a single point correction (“lock mass”) for every single mass spectrum was performed. The lock mass container is installed inside the ion source in the atmospheric pressure region right above the inlet to the ion optics and contains a small piece of mineral wool soaked with 20 µL of lock mass solution. The solution was prepared by 50-fold dilution of a stock solution (Chip Cube high mass reference, Agilent Technologies) in acetonitrile. The

protonated lock mass ion has the elemental composition C24H19F36N3O6P3 and appears at m/z 1221.9906. An averaged mass spectrum was deconvoluted with the maximum entropy algorithm in DataAnalysis. Fourier Transform Mass Spectrometry (FTMS). A 7 T solariX FT mass spectrometer (Bruker Daltonics) was operated in ESI positive ion mode. The mass spectrometer was calibrated externally with ESI tune mix (Agilent, diluted 1:40 in acetonitrile). The nebulizer gas was set to 3.0 L/min, drying gas flow rate was 4.2 L/min, and drying gas temperature was 210 °C. The ions were excited with a frequency sweep and detected for about 0.75 s in the mass range m/z 500-3000. The size of the acquired data set was 1 MW. The data were zero filled once before sine apodization. For mass calibration of the FTMS data in DataAnalysis 4.0, an internal single reference mass ([ubiquitin + 10H]10+ at m/z 857.47060) was used. The averaged mass spectrum was deconvoluted with a maximum entropy algorithm and peaks were picked using the SNAP II algorithm.36 Safety Considerations. DAN (1,5-diaminonaphtalene) matrix is a potential cancer hazard and may cause eye, skin, and respiratory tract irritation. The toxicological properties of this material have not been fully investigated. Personal protective equipment includes dust mask type N95 (US), eyeshields, and gloves. Hazard codes: Xn, N. Risk statements: 40-50/53. Safety statements: 36/37-60-61. RESULTS Nanobody Sequence Validation and Quality Control. The study protein R2P9 was characterized by bottom-up LC-ESI-MS/ MS with a sequence coverage of 63%. Searches against the NCBInr database resulted in the identification of five partially overlapping peptides of camelid antibodies and contaminating proteins from Escherichia coli (Table S-1A in the Supporting Information). However, the peptides from camelid antibodies were not sufficient to determine the amino acid sequence of R2P9. The purity and homogeneity of the R2P9 sample was confirmed by an in-house database containing the R2P9 sequence and sequences of common contaminants such as keratins, resulting in the identification of R2P9 with a total Mascot score of 3045 and sequence coverage of 63% (Table S-1B in the Supporting Information). Since the R2P9 nanobody was not contained in public databases and its homogeneity confirmed, it was chosen as the study sample and shipped by the anonymizer. De Novo Sequence Determination of the Unknown Protein. The first step of a top-down analysis is usually the determination of the molecular weight, protein homogeneity, and sample quality. The linear mode MALDI-TOF spectrum of the sample in sinapinic acid contained a single protein with a molecular mass of 13 617.4 ± 1.5 Da (n ) 3) and no contaminants (Figure 2). The degree of salt and other adducts was low, making the sample wellsuited for ISD-based top-down sequencing. The two matrices DAN and sDHB were used for ISD analysis. DAN is known to have reductive properties18,37 sufficient to reduce disulfide bonds. Furthermore, it facilitates the generation of intense N-terminal c- and C-terminal (z + 2)-ion series, while the a- and y-ion series are rather weak. This reduces the complexity of ISD spectra of proteins and facilitates the determination of (37) Takayama, M. J. Am. Soc. Mass Spectrom. 2001, 12 (4), 420–427.

Figure 2. The MALDI-TOF spectrum of the study sample R2P9 provided the molecular weight of the protein (average MW 13 617.4 Da, n ) 3) and confirmed that the sample had an appropriate concentration and molecular homogeneity, both prerequisites for subsequent top-down sequencing work. The small peak at ∼m/z 9079 represents the 3+ charge state of the protein dimer.

N-terminal sequence tags and de novo sequencing in general. In contrast, sDHB typically promotes intense c- and y-ion formation accompanied by (z + 2)- and a-ions. The respective fragment spectra are more complex but provide for better access to the protein C-terminal fragment ion series. Therefore, the complementary use of these matrixes provides targeted analysis of both protein termini. N- and C-terminal sequences can be distinguished by mass differences: N-terminal (c - a ) 45 Da) and C-terminal ((z + 2) - y ) 15 Da) fragment ion series.17,26 The initial top-down sequence analysis of the unknown protein using the DAN matrix provided a clear ISD spectrum in the 1-7 kDa range (Figure 3). In this mass range, all peaks were monoisotopically labeled using the SNAP algorithm.36 An automated algorithm was used in the BioTools software to establish the sequence calls in the range from m/z 1841 to 5805 (Figure 3). The initial sequence call was truncated at m/z 4010. The de novo sequencing algorithm was then reapplied to the data set under consideration of proline gaps, which extended the readout of the sequence to m/z 5805. In this preparation, predominantly c-ions were observed that result from the peptide backbone cleavage C-terminal to the peptide bonds. In the case of the cyclic amino acid proline, the formation of a c-ion N-terminal to proline would require the cleavage of the ring. This does not occur with an observable frequency. Therefore, proline gaps are typically observed in c-ion and (z + 2)-ion series that result from cleavage of the same bond.19,38 To assign sequences across proline gaps de novo, the BioTools software permutates the sequence motifs XP and XPP with X representing an arbitrary amino acid. In the sequence calculation mode that enabled the detection of proline gaps, the dimer sequence AP was assigned between the c-ions at m/z 4010.0 and 4178.2. The isobaric reversed sequence PA can be excluded from assignment to the gap as the mass difference is comprised of the amino acid N-terminal to proline and the proline residue. MALDI-TDS spectra on the ultrafleXtreme provided welldistinguished peaks between 1 and 8 kDa. The ability to assign monoisotopic peaks diminishes at higher m/z values, while in the (38) Zubarev, R. A.; Kelleher, N. L.; McLafferty, F. W. J. Am. Chem. Soc. 1998, 120, 3265–3266.

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

3287

Figure 4. T3-sequencing analysis of the protein’s N-terminus. The c-ion in the ISD spectrum (see Figure 3) at m/z 2041.2 was selected and further fragmented in the TOF/TOF without admission of collision gas. The resulting T3-spectrum was submitted to a standard Mascot search, and a dromedary VHH sequence (gi|7263724, bottom) with a peptide score of 134 (expectation value 1.4 × 10-9) was retrieved from the NCBI database providing a full assignment of the 21 N-terminal R2P9 residues.

Figure 3. MALDI-ISD spectrum of R2P9 in DAN matrix (top). On the basis of the monoisotopically labeled fragment ions (see examples in inserts) in this spectrum, the N-terminal 64mer sequence was established by a combination of software supported de novo sequencing. The displayed c-ion sequence tags were generated automatically in BioTools without (center) and extended with consideration of proline gaps (bottom). An AP proline gap was assigned at m/z 4010.0.

lower mass range the spectra tend to be complex, hampering the de novo read out of sequence tags. Therefore, the extension of sequence tags toward protein termini requires a second level of top-down sequence analysis on the same MALDI-TOF/TOF: T3-sequencing.7,19,26 Here, we selected the c-ion at m/z 2041.1 in the precursor ion selector and further fragmented it in the TOF/ TOF part of the instrument (Figure 4) using laser induced metastable dissociation (LID).17 The resulting MS/MS spectrum of the protein N-terminus was subjected to a protein database search in which a published camelid VHH (gi|7263724) was 3288

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

Figure 5. Sequence assignment of the 65 N-terminal residues of R2P9 as compiled from de novo sequencing of the ISD fragment ions c18-c65 and by T3-sequencing of c21.The interruption in the c-ion series at c40 indicates a proline gap: here, an Ala-Pro dimer sequence fills the gap. The insets enable judging of the data quality of the sequence calls up to 6.5 kDa.

identified with a significant Mascot peptide score of 134 (expectation value 1.4 × 10-9). This partial sequence of the protein N-terminus meant the remainder of the sequence was efficiently resolved by a database search, and no further de novo sequencing was required. The identified N-terminal sequence allowed the assignment of both the fragment ion at m/z 2041.1 as c21 and the 65 N-terminal amino acid residues. Figure 5 shows the original MALDI-TDS spectrum using DAN (Figure 2) annotated with the N-terminal protein sequence ranging from c11 to c65. The sequence calls at masses higher than m/z 5805 were obtained interactively and assignments were made using the FlexAnalysis software. Sequence assignment and peak picking were performed simultaneously, providing calls up to 6.6 kDa. Each

Figure 6. Analysis of the protein C-terminus. ISD spectrum of R2P9 obtained from an sDHB preparation (top). Predominantly C-terminal fragments are observed that provided for straightforward manual sequence assignment in the 1-3.5 kDa range. The sequence readout CTAPDDEYDYWGQGTQVTVSSKK was N-terminally truncated at a Cys residue suggesting the presence of a disulfide bond. The tag was C-terminally extended toward the protein C-terminus by T3sequencing. The fragment ions at m/z 1699 (which turned out to be y14) were selected from the ISD spectrum and further fragmented in the TOF/TOF (bottom). Manual de novo sequencing in FlexAnalysis allowed extension of the known sequence VTVSSKK toward the C-terminus based on the y-ion series. The protein C-terminus was thus assigned as VTVSSKKKHHHHHH, revealing a C-terminal His6tag.

call was based on the unequivocal assignment of monoisotopic peaks with relatively high abundance and a maximal residue mass tolerance of 0.1 Da. Exemplary N-terminal sequence calls based on c-ions are shown in Figure 5 (top panel). The reader should note that this spectrum acquired from the DAN matrix preparation permitted Cys-22 to be read through as free thiol without any other reducing agent but DAN being added.18 This single ISD spectrum with an additional T3-sequencing spectrum provided the full, uninterrupted sequence assignment of Nterminal residues 1 to 65, covering 59% of the protein molecular weight. The remaining sequence was obtained through Cterminal protein sequencing following the same approach. A second preparation of R2P9 using sDHB provided an ISD spectrum in which y- and (z + 2)-ions were quite intense whereas the c-ions were largely absent due to the intact disulfide bridge which truncated the N-terminal sequence readout (Figure 6, top). Such truncations are typical indicators for the presence of branched or cyclic structures, e.g., for proline, as discussed above. Protein glycosylation, PEGylation,20 other high molecular weight modifying groups, or disulfide cross-links17,26 are typically the cause of such abrupt drops in the intensities of fragment ion series. As described below, a single disulfide cross-link is actually present in the sequence, with one cysteine residue following the sequence calls at m/z 3410.9. This residue also causes the truncation of the C-terminal sequence readout. Again, the combination of automated and interactive, software-assisted de novo sequence

Figure 7. Extension of the C-terminal sequence readout toward the N-terminus and complete sequence draft of R2P9. To extend the readout beyond Cys-95, the study sample protein was reduced using TCEP and prepared in sDHB. The C-terminal sequence was established with high confidence up to ∼(z + 2)55 and with less confidence up to (z + 2)60 (top panels) and used to generate a draft sequence. The draft sequence was initially validated by bottom-up LC-MALDIMS/MS analysis; the matching tryptic peptides are shown as gray bars (N-terminal peptide is of highest abundance and colored black). Sequence stretches obtained by the initial ISD spectra are marked in green, stretches derived from T3-sequencing in yellow, and the sequence obtained in TCEP/sDHB analysis based on C-terminal sequencing with (z + 2) ions in blue. Amino acid residues confirmed by bottom-up analysis (red) provided a sequence coverage from LC-MALDI work of 67%.

analysis provided sequence calls from m/z 1210.7 to m/z 3410.9, with a clear truncation of the ion series beyond that point. A y-ion at m/z 1699.0 was further selected for T3-sequencing (Figure 6, bottom) to extend the sequence tag toward the protein C-terminus, providing a spectrum well amenable to automated de novo sequencing. The RapiDeNovo module for de novo sequencing in the BioTools software provided a sequence suggestion (100 ppm MS tolerance, 0.3 Da MS/MS tolerance) based on y, b, a-, and immonium ions that revealed a C-terminal His6-tag characteristic of recombinant proteins and the full assignment of the 14 C-terminal residues. The 9 C-terminal calls were revealed only by T3-sequencing analysis. Figure 6 summarizes the sequence assignments based on ISD and T3-sequencing, which cover the 29 C-terminal residues of R2P9 according to 25% of the protein molecular weight. The discussed four spectra allowed covering 74% of the protein sequence at that stage. As sDHB does not, unlike DAN, provide for the reduction of disulfide bonds, the N-terminal extension of the C-terminal 29mer sequence beyond cysteine required an additional sample preparation. Here, the protein was reduced by tris(2-carboxyethyl)phosphine (TCEP) under acidic conditions followed by the addition of sDHB and MALDI sample preparation. The sequence calls were manually extended from (z + 2)29 up to (z + 2)60 (Figure 7), which did not allow for a significant overlap with the N-terminal sequencing result except for Gly-65. The quality of monoisotopic peak assignments and sequence calls was quite good up to ∼(z + 2)55. However, the calls up to (z + 2)60 were based on monoisotopic peak assignments from isotopomers with low signal-to-noise values that reduced the certainty of these assignments. Therefore, additional means of validation were employed that included bottom-up LC-MALDI MS/MS confirmation and top-down precision matching between the calculated molecular weight based on that sequence and the Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

3289

experimental molecular ion information. A MALDI peptide mass fingerprint after tryptic digestion of R2P9 provided a sequence coverage of 67%. Unfortunately, this bottom-up approach did not provide a good coverage of the range between the amino acid residues 64 and 95, and thus an LC-MALDIMS/MS analysis was performed to close that gap. All peptides that were analyzed by LC-MALDI-MS/MS were subjected to searches against the established sequence. BioTools permits the search and scoring of matching peptides after unspecific proteolytic cleavage. The peptide R2P9-(58-71) at m/z 1499.90 (theoretical m/z ) 1499.7965) was reliably assigned confirming that part of the determined sequence with lowest confidence (Figure S-2 in the Supporting Information). The typical question that can be addressed by Edman sequencing39 better than with any mass spectrometric method is the reliable discrimination of the isobaric residues Leu and Ile. All assignments of Leu in the draft sequence at this point stand for L or I (i.e., Leu or Ile); a further distinction cannot be made by any current top-down mass spectrometric sequencing approach. The assignment of Ile-69 in the sequence draft (Figure 7) was based solely on the database identification of the peptide R2P9-(65-71) “GRFTISR” using Mascot (data not shown). The intense Nterminal peptide R2P9-(1-19) in the LC-MS/MS data set was fragmented under high-energy CID (heCID) conditions to provide differentiation between Leu and Ile (Figure S-3 in the Supporting Information). MALDI-TOF/TOF fragmentation of peptides utilizing a collision gas allows the distinction between Leu and Ile, as they produce side chain-specific d- and w-fragment ions.17,27,40 The heCID spectrum of the peptide R2P9-(1-19) shows intense Leuspecific w-ions wL2 (m/z 229.09), wL9 (m/z 841.38), and wL16 (m/z 1440.71) demonstrating the presence of Leu at all three positions. These data resulted in Leu-2, Leu-9, Leu-16, and Ile69 being included in the sequence delivered to the anonymizer while all other calls of Leu in the sequence represented a placeholder for either Leu or Ile. However, the heCID data obtained from the N-terminal peptide demonstrate that an additional bottom-up LC-MALDI experiment not only provides data that can validate sequence candidates but can also provide distinction between the isobaric amino acids Leu and Ile, which is not possible by any other current MS instrumentation in the typical peptide and protein laboratory, such as electrospray instruments. A homology search based on the determined sequence using protein BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi)41 identified several homologue camelid VHH sequences from Lama pacos, Lama glama, and Camelus dromedarius with expectation values in the range 10-42 to 10-49 (data not shown). A comparison between the determined candidate sequence and 42 VHH genes42 indicated a reasonable homology of 81% for amino acid residues R2P9-(1-95), taking germline VHH cvhhp4242 as an arbitrary reference. Two complementarity determining regions (CDR1 and CDR2) are coded in this part of the sequence, while R2P9-(96-124) contained a camelid VHH (39) Edman, P. Mol. Biol. Biochem. Biophys. 1970, 8, 211–255. (40) Papayannopoulos, I. Mass Spectrom. Rev. 1995, 14, 49–73. (41) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. J. Mol. Biol. 1990, 215 (3), 403–410. (42) Nguyen, V. K.; Hamers, R.; Wyns, L.; Muyldermans, S. EMBO J. 2000, 19 (5), 921–930.

3290

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

Figure 8. Fully sequence-assigned ISD spectrum of R2P9 in DAN. The error plot (bottom) allows the overall data quality to be judged based on the mass errors associated with each sequence call. Beyond approximately 4-5 kDa, mass errors increase from ∼0.02 to 0.3 Da due to reduced peak intensity and spreading of the fragment ion information into more isotopomers.

nanobody sequence fused to a His6-purification tag introduced by the plasmid expression vector pHEN4C. In order to estimate whether Leu or Ile residues are present at positions that were not elucidated by heCID analysis, we compared the candidate sequence with the sequences published by Nguyen42 in Figure S-4 in the Supporting Information. Out of 14 ambiguous positions that were occupied by either Ile or Leu, 6 Leu and 2 Ile residues could be assigned using the residue numbering scheme of Figure S-4 in the Supporting Information: Leu-4, Leu-11, and Leu-18 were unequivocally determined using heCID of the N-terminal peptide, and Leu-20, Ile-51, Ile-70, Leu81, and Leu-86 were assigned by analogy to the reference sequence. Neither MS nor homology evidence could be used to assign residues Leu-27, Leu-28, Leu-31, Leu-47, Leu-59, and Leu78 and therefore their assignment remained ambiguous. The draft sequence in Figure 7 was assigned using the initial ISD spectrum (Figure 3), which provided for the extension of the N-terminal sequence to c71 (Figure 8) and the readout of GRFTISR from N- and C-terminal sequence calls in the ISD spectrum at ∼7 kDa. Here, the resolving power of the MALDI-TOF/TOF instrument in the range of 30 000-45 000 was essential to safely assign monoisotopic masses up to ∼8 kDa. A plot (Figure 8, bottom) of the mass errors of all matching fragment ions of the ISD spectrum (top) shows mass errors of less than 0.1 Da for most sequence calls below 5 kDa and of less than 0.3 Da in the range 5-8 kDa. Across the entire mass range from 1 to 8 kDa, a root mean square (rms) error of 0.1 Da was the basis for the sequence assignment. Molecular Weight of R2P9. For the final confirmation of the draft sequence, we determined the intact protein mass with high accuracy. New generation ultrahigh-resolution ESI-oTOF instruments use an analog-to-digital converter. This technique provides high-precision determination of the quantitative isotopomeric composition of mass spectrometric peaks due to the high dynamic range and the absence of dead times that could attenuate subsequent isotopomers. The mass accuracy for isotopically resolved protein molecular weight determination using internal “lock mass” calibration is typically better than 1 ppm. We obtained

that was established is actually in agreement with an independently accessible parameter that can be accurately determined. In addition, the accurate match between the calculated and experimental isotopic distribution led to the conclusion that the protein in question was produced without any unwanted side reactions or artifacts, such as deamidation of Gln or Asn. This was also evidence for the complete formation of the disulfide bond between Cys-22 and Cys-95 that should have been formed under the oxidative conditions of protein folding. This contrasts to the observation of deamidated peptides in LC-MS/MS analysis (data not shown), indicating that the deamidation occurred during protein digestion and chromatographic separation of the bottomup analysis rather than accurately reflecting the actual properties of the protein sample. We repeated the accurate molecular weight determination using a 7 T ESI-FTMS and obtained a precise molecular weight of 13 610.605 24 Da. This represents the monoisotopic molecular weight with a mass error of 0.59 mDa (43 ppb) based on a match of the entire experimental and calculated isotopic patterns using the SNAP algorithm,36 which is roughly equivalent to the molar mass of an electron. All evidence based on precise mass determination and isotopic pattern matching indicated that the study sample was homogeneous and an excellent match to the determined sequence. Therefore, the results described here were returned to the anonymizer for evaluation and correlation with the internal quality control results. The anonymizer received the results and compared them with the correct R2P9 sequence (Figure S-4 in the Supporting Information). In conclusion, the entire sequence was correctly determined using MALDI-TDS, including the 3 Leu residues in the N-terminal peptide. All other Leu/Ile ambiguities were unresolved in the returned data.

Figure 9. High-resolution ESI mass spectrum of intact P2R9 after charge deconvolution (top). Inverse pattern (bottom) and monoisotopic molecular weight of 13 610.6058 Da calculated based on the elemental composition C594H916N172O187S5 derived from the candidate sequence. Assigned base peak MWs show a 59 ppb mass error between both isotopomer distributions.

a baseline resolved isotopic distribution of the molecular ion of R2P9 on the maXis ESI-oTOF with a charge deconvoluted neutral base peak isotopomer at 13 618.6271 Da (see Figure 9). The elemental composition of R2P9 was calculated based on the determined sequence as C594H916N172O187S5 yielding a theoretical monoisotopic molecular weight of 13 610.605 83 Da. The isotopic pattern was calculated and matched to the experimental data assuming 32 000 resolving power. The experimentally determined base peak, which we use routinely for the MW determination of proteins with resolved isotopic peaks, was only 0.8 mDa (59 ppb) higher than 13 618.6263 Da, i.e., the base peak of the calculated isotopomer distribution. The offset between both entire isotopomer distributions as specified by the SNAP algorithm36 was -0.92 ppm, which is largely due to poorer matches of the isotopomers with lower abundance. Thus, the calculated molecular weight was in excellent agreement with the determined sequence. We used this information solely as a gross control parameter suggesting the sequence

DISCUSSION This study investigated whether top-down protein sequencing on a MALDI-TOF/TOF mass spectrometer was suitable for the de novo sequencing of an entire, undigested protein. This extends the scope of top-down sequencing beyond analysis of known peptides37 or proteins,9,18 proteins carrying modifications,7,20,24 or proteins that can be identified by database searching.12,21,26 The work presented here reports on top-down de novo sequencing of the largest protein (124 residues, 13.6 kDa) analyzed by MS thus far. Bottom-up data were used solely to confirm the draft sequence and to further resolve Leu/Ile ambiguities by heCID. Most other top-down based de novo protein sequencing projects have relied on bottom-up approaches to a much greater extent and involved smaller proteins such as the 72 amino acid residue comprising crustacean hyperglycemic hormone.25 This work demonstrates the power of MALDI-TDS for de novo sequencing of proteins, a task that has traditionally been addressed by Edman sequence analysis.39 The acquisition of ISD spectra for such analyses can be largely automated with the DAN matrix.18 Even the database interrogation of top-down data can be performed in an automated fashion.21,26,43 The sample consumption (43) Zamdborg, L.; LeDuc, R. D.; Glowacz, K. J.; Kim, Y. B.; Viswanathan, V.; Spaulding, I. T.; Early, B. P.; Bluhm, E. J.; Babai, S.; Kelleher, N. L. Nucleic Acids Res. 2007, 35, W701–706.

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

3291

for MALDI-ISD spectra is typically in the 10-50 pmol range for each sample preparation, which is sufficient to define a proteins N- and C-terminus and to call up to 80 amino acid residues. This compares favorably with Edman sequencing. Nevertheless, the task in this study was considerably more difficult as every single residue needed to be accounted for. The high resolving power of the MALDI-TOF instrument that provided monoisotopic peak assignment up to ∼8 kDa and a high mass accuracy of fragment ions were essential for the successful sequence determination. We regard the long consecutive sequence tags that are usually obtained from MALDI-TDS spectra as extremely helpful for de novo protein sequencing. This property of MALDI-TDS spectra contrasts with most published ECD or ETD spectra in which fragment ions are produced sporadically along the protein backbone.9,14,38,43 However, although some sequence calls were relatively simple and required less than an hour of experimental and data analysis time (green shaded sequence in Figure 7), the determination of the remaining protein sequence required considerably more effort. The elucidation of the N-and C-terminal sequences (shaded yellow) required additional pseudo-MS3 analysis (T3-sequencing)26 from the same sample preparations. Here, the availability of a matching N-terminal sequence in the NCBI protein sequence database greatly facilitated the 14 N-terminal sequence calls, which otherwise would have required de novo sequencing of that spectrum. Such sequencing is possible but is time-consuming and requires considerable experience. Overall, calling the 124 amino acid residues of the VHH protein took 1 week, including the confirmation through precision mass matching of the intact molecular weight at the 40-50 ppb level. In contrast to Edman sequencing, C-terminal sequence calls are possible with MALDI-TDS and N-terminal protein modifications such as pyroglutamylation or acetylation do not preclude N-terminal sequence calls. Typical sample requirements are the availability of >10 pmol of isolated protein in solution for identification and >100 pmol for de novo sequencing. The quality of the protein sample is very important for successful MALDI-TDS analysis, high-quality samples may permit calling more than 140 residues21 and identification of proteins larger than 100 kDa. Large parts of the determined sequence were confirmed by LC-MS/MS analysis, illustrating that the combination of top-down de novo sequencing and bottom-up validation may work quite well for sequence assignments (44) Scha¨fer, H.; Chervet, J. P.; Bunse, C.; Joppich, C.; Meyer, H. E.; Marcus, K. Proteomics 2004, 4 (9), 2541–2544.

3292

Analytical Chemistry, Vol. 82, No. 8, April 15, 2010

whereas using the bottom-up approach alone would not provide a sequence coverage sufficient for de novo peptide sequencing. This work demonstrates that ∼14 kDa proteins can be sequenced entirely by MALDI-TDS. Larger proteins may also be analyzed by MALDI-TDS but only if (a) reference sequence information is available or (b) results are complemented by bottom-up approaches. Typically, the N- and C-terminal 50-80mer sequences can be obtained relatively quickly by MALDI-TDS. This makes the approach well-suited for combination with common methods of molecular cloning. The terminal sequences obtained by MALDI-TDS enable PCR primers to be designed for cloning an unknown protein of interest and sequencing it using common DNA sequencing methods. This combination has the added value of providing (a) detailed information about both terminal sequences including modifications, (b) an accurate description of the molecular ion, including the molecular heterogeneity (e.g., different glycoforms), and (c) the full sequence of a larger protein chain. In such a combined approach, the termini could be reliably determined in an initial step while sequencing through the core region of a protein could be achieved through DNA sequencing. For such approaches, MALDI-TDS is highly suitable as it routinely provides sequence information from both termini. ACKNOWLEDGMENT The authors thank Friedrich Lottspeich (Martinsried) for confirmative Edman sequencing work of the R2P9 nanobody and Helmut E. Meyer (Bochum), Friedrich Lottspeich, and Roland Kellner (Darmstadt) for organizing this study as a contribution to the 2009 Martinsried Meeting on “Micromethods in Protein Chemistry” and their support and critical discussion of the manuscript. NOTE ADDED AFTER ASAP PUBLICATION This paper was published on the Web on March 23, 2010. Text in the title and in the tenth paragraph of the Materials and Methods was revised. The corrected version was reposted on March 26, 2010. SUPPORTING INFORMATION AVAILABLE Addtional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review January 8, 2010. Accepted March 8, 2010. AC1000515