Product Review: Software for MS Protein identfication - Analytical

Judith Handley. Anal. Chem. , 2002, 74 (5), pp 159 A–162 A. DOI: 10.1021/ac021963e. Publication Date (Web): March 1, 2002. Cite this:Anal. Chem. 74,...
0 downloads 0 Views 631KB Size
product review

Software for MS Protein Identification Automated protein analysis has arrived. Judith Handley

P

rotein database search programs are now widely available, and many include handy “tools” that ease characterizing structures as well as identifying proteins. Many programs offer similar capabilities and operate on mass spectrometers from a number of vendors. Even better, most software vendors let prospective customers preview their software by providing free access through their websites. However, transferring MS data to a website is usually a cut-andpaste process that is limited to one spectrum at a time. If you buy the product, the licensed software provides privacy and usually transfers data automatically. In addition, software developers are forming alliances with instrument manufacturers to integrate their software with mass spectrometers, online instruments, and relational databases. This product review surveys some of the available software, and several proteomics researchers offer their opinions on “test-driving” software under individual working conditions. Instead of the usual product table, websites are given at the end of the article. These sites contain further information, and most have links to databases, other search engines, and protein characterization tool software.

Scenery Nearly all database searches begin with the basics. The user sets the search engine for such parameters as the enzyme used for the digest, the type of mass spectrometer, the database and species to search,

and types of chemical modifications. “Often people set [the parameters] so you hit at least four to five peptides to definitively say that you’ve matched the protein,” says Gary Siuzdak of the Scripps Research Institute. The “traditional” first step in protein analysis is to collect data from a MALDI mass spectrometer and use it with a search engine designed for a “mass map” or “fingerprint spectrum”, says Mark McDowall

of Micromass (United Kingdom). “All mass fingerprint search engines use a slightly different approach, but [they] basically take a protein database and, knowing the rules of enzymology, predict where [each] protein is digested with an enzyme,” says John Michnowicz of Agilent Technologies. For relatively pure samples, McDowall says that this process of generating theoretical m/z peaks and matching them with the

M A R C H 1 , 2 0 0 2 / A N A LY T I C A L C H E M I S T R Y

159 A

product review

fingerprint spectrum is very rapid and reliable. But if the sample is complex, the search results may suggest several possible proteins. Then MS/MS is the usual next step. Tandem MS is also useful if concentrations are low or more structural information is needed. McDowall

But some researchers want more. There can be variations of the same protein, and identifying a protein does not identify the actual structure in the sample. For instance, if three particular peptides “only exist in the first half of a sequence, you wouldn’t know the specific form [of the polypeptide] because you don’t have in-

The consensus is that software is reliable, but there are a few cautions. says that some researchers even choose to bypass a fingerprint spectrum and begin their analysis with MS/MS. However, searching tandem MS data requires special algorithms. It creates fragmentation patterns by breaking polypeptide ions from the enzyme digest into pieces. McDowall says that a typical algorithm predicts this fragmentation pattern “from known rules in the literature. It produces a list of fragment masses and compares them to those actually in your spectrum.” This search engine looks for the best matching set of mass peaks to identify the protein. An alternative to peak-matching is determining amino acid sequences from the MS/MS spectrum and then searching a database for matching amino acid sequences. A related method searches for a “sequence tag”, which Brian Chait of Rockefeller University defines as a characteristic stretch of three or four amino acids in the middle of a specific peptide. Al Burlingame of the University of California–San Francisco says that some programs automatically find the sequence tag once the mass spectrum is entered, but for other programs, “somebody has to manually decide that two or three amino acids actually belong together.” Finally, notes Michnowicz, “Some software search programs are fine-tuned for a specific mass spectrometer. For instance, the ion trap will fragment materials slightly differently than a triple ‘quad’.”

Twists in the road A search engine would most likely be sufficient if protein samples are relatively pure or if only identification is wanted.

160 A

formation about the rest of the sequence,” explains Alex Taylor of Immunex. For some applications, such as pharmaceutical research, specific structures are needed. If a search engine cannot sequence the amino acids, software tools are available to determine short sequences of amino acids for a sequence search program. Other tools determine such possible variations as glycosylations and phosphorylations, sequence errors, or unexpected proteolytic cleavages. Depending on the vendor, these tools may be sold as separate modules or part of the search algorithm. Some programs go a step further by pulling together all of these pieces of the structural puzzle for de novo sequencing—deciphering the amino acid sequence of a protein not already in a database. “I think people have begun to understand now that the most important information is not in peptides that can be identified with the protein, but with the peptides which are modified, and that’s where a lot of the structural and functional information is held. That’s where we need to apply de novo sequencing,” comments Chris Sutton of Kratos Analytical.

Trust software? The consensus of experts is a qualified “yes” that the available protein identification and characterization software is reliable, but there are a few cautions. “All of these search engines depend on whether or not the nominal masses of the peaks and the charge states ... are assigned correctly,” Burlingame warns. Whether or not there are serious errors, he says, depends on several data factors, such as the

A N A LY T I C A L C H E M I S T R Y / M A R C H 1 , 2 0 0 2

S/N of the spectra and the way the mass peak is determined. He says that the mass peak is also affected by the increasing relative proportion of C-13 in peptide masses greater than ~1600 Da. Because most of the proteins’ weight comes from carbon (of which C-13 is ~1.1%) and hydrogen, as the number of carbon atoms increases, C-13 becomes more important for these peptides. Thus, Burlingame says, it is important to understand all the data in the run and apply judicious human intervention. Algorithms that score search results give clues to the certainty of a protein’s determined identity, but methods for scoring differ among search engines and manufacturers. Numerical values may be absolute, relative, or based on probability or statistics. Calculations may include peak intensities, the relative number of matched peaks versus the set of peaks for the target protein, and other criteria. John Yates of the Scripps Research Institute says, “Scoring can be a quagmire. All the programs have strengths and weaknesses.” He says that users gain confidence in a scoring algorithm with experience, and he recommends running control samples to learn an algorithm’s capabilities and limitations.

Drivability With all the capabilities of these programs, one might think that a supercomputer is necessary to run them. Actually, the complexity and number of samples are the major factors that determine how much computing power is required. Software is available, often from the same vendor, for running search engines and associated protein analysis tools on almost any type of computer or network system. According to Chait, some laboratories are tackling 100,000 identifications a week. On such a scale, more computing power and additional types of software are needed. “When you’re doing that amount of identifications, it has to be highly automated, and all of that data has to be placed in a special relational database so that you can fish out the data when you need it,” he says. Communication among different software and hardware modules then becomes an issue. “Every different manufacturer puts out data, usually in their own format,”

product review

says Chait. “Some of them are very straightforward ... so that people can just plug data into a search engine. Others are like cryptology exercises.” Most of the current software enables data to be copied and pasted without reformatting, or reformatting may require only a trivial push of a button. In comparing software capabilities, Steven Gygi of the Harvard Medical School considers situations in which very small differences in data and its processing significantly affect results. “If there is enough information to identify [a protein], but there are low-abundance peptides, the MS/MS spectra may be marginal. Then the difference in algorithms becomes more apparent as you try to decide what is important and what is not important,” he says. So test-driving helps identify algorithms that produce the desired results for these situations. Siuzdak says that with the same data, he has been able to identify a protein with one program, but not another. “So the programs are handling data in different ways,” he says. He recommends running the same set of data with different programs and comparing the results: the certainty of the identification, the ability to look at protein modifications, how easy the output format is to interpret, and the “ability to simultaneously identify multiple proteins.” Also important, he says, is the coverage—the percentage of the protein’s amino acid sequence that is determined. The ability to identify many proteins quickly varies with the search engine, tools, and their level of integration into a user’s system. “In many systems,” says McDowall, “the data acquisition and interpretation is divorced, so you have to manually move data from one to the other. And sometimes interpretation is in different modules ... so that the MS/ MS databank search module might be separate from the MS fingerprint search, etc.” But, he adds, during the past few years, more software elements have been integrated with each other, driven by the demands of high-throughput screening of complex samples. Systems are now available that interface data acquisition from numerous separation and analytical techniques with interactive decision-making modules and

real-time identification of proteins. “With LC/MS/MS experiments, you can do a complete chromatographic run, switching sequentially from MS on an LC sample to MS/MS on automatically identified interesting precursor peaks,” explains Detlev Suckau of Bruker Daltonic.

Free Web access, local licensing, and integration What follows are brief descriptions of software and vendor websites. Systems linked to particular mass spectrometers are also included. Mascot software from Matrix Science,

Mascot search engine.” ProteinProspector from the University of California–San Francisco, http:// prospector.ucsf.edu/, provides several search capabilities and tools, but with separate programs for mass fingerprint searching (MS-Fit) and tandem MS spectra searching (MS-Tag, MS-Homology, and MS-Pattern). MS-Tag is the sequence tag search engine, and the latter two programs search for sequences that could have errors in them, allow for sequence errors in the database, and search for homology changes (in which an amino acid has been replaced by another with a simi-

Integrated systems support real-time protein identification. Ltd. (United Kingdom), www.matrix science.com, offers modules for searching mass fingerprints, sequence tags, and spectra from MS/MS. Another feature of Mascot, says Matrix Science’s David Creasy, is that all three types of searching can be combined. The scoring algorithm is a probability-based transformation of Molecular Weight Search (MOWSE). Mascot is integrated with Bruker’s BioTools software, www.brukerdaltonik.de/ biotools.html, which visualizes the data and analyzes it for evidence of modifications, cross-links, and point mutations, says Suckau. Another level of integration is the agreement of Matrix Science to work with both Kratos Analytical (United Kingdom), a division of Shimadzu, and Agilent, www.chem.agilent.com/Scripts/ Phome.asp, to make their mass spectrometer software compatible with Mascot. Customers load Mascot onto their mass spectrometers or separate servers. Sutton says that the Kratos IntelliMarque software, www.shimadzu-biotech. net, “feeds information back and forth with Mascot” for semi-automated sequencing of post-source decay spectra. For spectrometers from other vendors, Creasy says the Daemon tool “can look in a spectrometer’s directory for new raw data files and ... create peak lists that are automatically submitted to the

lar function). Burlingame says that MSPattern is less sophisticated and has a different scoring system than MS-Homology, which incorporates information about the biological evolution of substituents. Unlike some search engines, these search directly from the data in MS/MS spectra. Tools are provided to further characterize a protein by calculating theoretical enzymatic cleavage, peptide masses, amino acid sequences, possible modifications, cross-links, and homologues. Richard Jacob of the University of California– Berkeley notes that ProteinProspector, unlike most other applications, can calculate masses for high-energy side-chain ions (d, v, and w). To run on a local server, the computer must be configured as a Web server. David Hicks of Applied Biosystems,

www.appliedbiosystems.com/products/ byType.cfm?id=6, says that ProteinProspector’s MS-Fit is built into the software for their MALDI-TOF instrument. Together with Matrix Science, Applied Biosystems has coordinated algorithms to make both the MALDI-TOF and ES-quadrupole-TOF instruments compatible with Mascot. Another feature for the quadrupole TOF is the PepSea algorithm from MDS Protana (Denmark),

www.protana.com/solutions/software/ default.asp. This search engine is built into the Applied Biosystems BioAnalyst

M A R C H 1 , 2 0 0 2 / A N A LY T I C A L C H E M I S T R Y

161 A

product review

Experts agree: Test-drive under real conditions. software, which characterizes peptide modifications and performs de novo sequencing. Software, like biological species, sometimes evolves along different branches. Products from ProteoMetrics, www. Canada.proteometrics.com, evolved functionally and commercially from a Rockefeller University group’s PROWL software. ProFound is the search engine for MALDI peptide mass mapping, and Sonar is the MS/MS search engine for MALDI and ESI mass spectrometers. Each program alone processes only limited data, but when incorporated into the Knexus package, they become part of a semi-automated system that can analyze hundreds of data files at a time, says Jennifer Krone of ProteoMetrics. Highthroughput with the search engines in the RADARS package is due to a relational database and other integrated components, says Krone. Tools for editing spectra, visualizing information, and finding modifications are included in all programs. The tools can be downloaded for free. Krone also says that ProFound and Sonar are integrated into MALDITOF mass spectrometers from Amersham Biosciences (formerly Amersham Pharmacia Biotech), http://proteomics. amershambiosciences.com, but the screen appearance is different from the usual ProFound screen. Peter Hojrup of Lighthouse Data (Denmark) says that GPMAW, http:// welcome.to/GPMAW, searches for fingerprint spectra on public or local databases, but “relies on the experience of the user,” rather than on probability scoring. He describes the program as a “postprocessing tool [for] detailed mass analysis and planning mass spec experiments.” It identifies a wide range of variations in peptides, including modifications and cross-links. Lutefisk performs de novo interpretation of tandem mass spectra. Its source code can be downloaded for free from

www.immunex.com/researcher/ lutefisk, and Taylor says that it can be run on almost any platform. If the search results include several possible

162 A

identities with small variations, the Lutefisk data is used as queries in CIDentify freeware. This program differentiates ambiguous dipeptides, and its source code is available from a link at the Lutefisk website. The source code can also be modified for different data input formats. A newcomer is BioBridge Computing (Sweden), www.biobridge.se. Their automated PIUMS software searches MALDI fingerprint spectra. Compugen’s ProtoCall MS, www.Cgen.com, also searches for fingerprint spectra.

Isolated systems Two software systems not accessible through free Internet searching are automated for high throughput by the integration of search engines and characterization tools with online instruments and a database system. ProteinLynx Global SERVER, www.micromass.co. uk/software, from Micromass (United Kingdom), is a client/server system with a full set of search engines and characterization tools, and McDowall says that it is fully functional only on Micromass instruments. TurboSEQUEST, www.thermo.com/eThermo/CDA/ Products/Product_Listing/0,1086, 11556-113-113,00.html, is part of Thermo Finnigan’s Bioworks system and is supported only on their spectrometers. McDowall says that the peptide mass fingerprint search engine of ProteinLynx is based on the complete spectrum, whereas many search algorithms look only at a user-specified number of intense peaks. He says that another difference resides in the probabilistic de novo algorithm, which quantifies the confidence for each amino acid in each position as well as the confidence for the total sequence. TurboSEQUEST searches protein databases from uninterpreted MS/MS spectra. Amy Zumwalt of Thermo Finnigan says that it cross-correlates fragment ion information to increase the certainty in results. It also identifies peptide modifications and isotope-coded affinity tags (ICATs), which are used for quantitation. Bioworks includes de novo sequencing. Gygi adds that the user can “write other

A N A LY T I C A L C H E M I S T R Y / M A R C H 1 , 2 0 0 2

programs on top of TurboSEQUEST ” to customize functions.

Previews Keeping up with all the new alliances and products is impossible, but a few have surfaced during the preparation of this article. The following are a few examples of upcoming projects planned for release this year. One collaboration in progress for ProteoMetrics is to incorporate Sonar into ESI-TOF spectrometers from Amersham Bioscience, says Krone. Another ProteoMetrics project, says David Miller of Genomic Solutions, www.genomicsolutions. com, is to combine the Genomic Solutions Protein Warehouse with RADARS and some additional software components to create a new search engine and software package. In a strategic alliance with IBM, Proteome Systems (Australia), www. proteomesystems.com, is implementing a multicomponent, integrated system that functions with a relational database on an IBM server. BioinformatIQ uses the IonIQ search engine for fingerprint mass mapping. Marc Wilkins of Proteome Systems says that the scoring method for search results is based on MOWSE, but with some added criteria. These components will become part of an integrated system with Kratos’s Axima spectrometers and its IntelliMarque software, according to Bill Skea of Proteome Systems. Similarly, BioinformatIQ will be integrated with the Thermo Finnigan LCQ, its Xcalibur software, and TurboSEQUEST for LC/MS/MS analysis. Tina Settineri of Applied Biosystems/ MDS Sciex (Canada) says that these partners are configuring two search engines, Pro ID and Pro ICAT, to identify proteins from uninterpreted MS/MS data. She says the algorithm includes searches for disulfide bonds, peptide modifications, and homologues. Judith Handley is an assistant editor with Analytical Chemistry.

Upcoming product reviews July: Flow-injection analysis August: Raman spectroscopy September: Ion-trap mass spectrometers