MSQuant, an Open Source Platform for Mass ... - ACS Publications

Nov 4, 2009 - Mass spectrometry-based proteomics critically depends on algorithms for data interpretation. A current bottleneck in the rapid advance o...
4 downloads 10 Views 6MB Size
MSQuant, an Open Source Platform for Mass Spectrometry-Based Quantitative Proteomics Peter Mortensen,† Joost W. Gouw,‡,§ Jesper V. Olsen,†,| Shao-En Ong,†,⊥ Kristoffer T. G. Rigbolt,† Jakob Bunkenborg,† Ju ¨ rgen Cox,# Leonard J. Foster,†,§ ‡ † Albert J. R. Heck, Blagoy Blagoev, Jens S. Andersen,*,† and Matthias Mann*,†,# Center for Experimental Bioinformatics, Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Campusvej 55, DK-5230 Odense M, Denmark, Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands, and Department for Proteomics and Signal Transduction, Max-Planck Institute for Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany Received August 13, 2009

Mass spectrometry-based proteomics critically depends on algorithms for data interpretation. A current bottleneck in the rapid advance of proteomics technology is the closed nature and slow development cycle of vendor-supplied software solutions. We have created an open source software environment, called MSQuant, which allows visualization and validation of peptide identification results directly on the raw mass spectrometric data. MSQuant iteratively recalibrates MS data thereby significantly increasing mass accuracy leading to fewer false positive peptide identifications. Algorithms to increase data quality include an MS3 score for peptide identification and a post-translational modification (PTM) score that determines the probability that a modification such as phosphorylation is placed at a specific residue in an identified peptide. MSQuant supports relative protein quantitation based on precursor ion intensities, including element labels (e.g., 15N), residue labels (e.g., SILAC and ICAT), termini labels (e.g., 18O), functional group labels (e.g., mTRAQ), and label-free ion intensity approaches. MSQuant is available, including an installer and supporting scripts, at http://msquant.sourceforge.net. Keywords: Mass spectrometry-based proteomics • proteomics software • protein identification • protein quantitation • stable isotope labeling

Introduction Even though protein science and 2D gel electrophoresis have been around for decades, proteomicssthe large-scale study of proteinsshas only recently proven its value as a postgenomic tool. Several areas of proteomics use different technological platforms,1-3 and mass spectrometry (MS) is a particularly powerful method for characterizing endogenous proteins, including their modifications. A flow diagram of one of the currently most popular forms of MS-based proteomics4 is shown in Figure 1A. Briefly, proteins are first extracted from the source of interest (e.g., protein complexes, organelles, or organisms), which may contain anywhere from a handful of proteins to several thousand. In the case of a stable isotope labeled (SIL) experiment, the two or more populations to be * To whom correspondence should be addressed. E-mail: mmann@ biochem.mpg.de and [email protected]. † University of Southern Denmark. ‡ Utrecht University. § Current address: Centre for High-Throughput Biology, 2125 East Mall, University of British Columbia, Vancouver, BC, Canada, V6T 1Z4.; | Current address: Department of Proteomics, The Novo Nordisk Foundation Center for Protein Research, Faculty of Health Sciences, University of Copenhagen, Blegdamsvej 3B DK-2200 Copenhagen, Denmark; ⊥ Current address: Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142, United States. # Max-Planck Institute for Biochemistry. 10.1021/pr900721e

 2010 American Chemical Society

compared are mixed at some point upstream of MS. Proteins need to be enzymatically digested and fractionated to reduce complexity, in either order, and the resulting peptide mixtures are then analyzed by online chromatography. The mass spectrometer analyzes the mass-to-charge ratio (m/z) and signal intensities of the eluting peptides and selects peptide ions in turn for fragmentation, giving rise to product ion spectra (MS/MS or MS2). Ions selected from the product ions spectrum for a second step of fragmentation (MS3) adds further specificity to the database search, Figure 1B. In a qualitative experiment, the objective is to identify as many of the proteins as possible in a given sample, whereas in a quantitative experiment the relative or absolute amount of proteins is also determined.5 The quantitative information is derived from the intensity of the peptide precursor ion signal as it elutes from the column, the ‘extracted ion chromatogram (XIC). In SIL-based quantitative proteomics, such as in the ICAT6 or metabolic labeling7 technologies, this XIC is compared between two isotopically different forms of the same peptide. Depending on the experiment, the relative amount of modified peptides such as phosphorylated peptides may also be quantified. The major disconnect in existing workflows based on this model is that the software used to acquire the mass spectra does not communicate directly with the database search software. To Journal of Proteome Research 2010, 9, 393–403 393 Published on Web 11/04/2009

research articles

Mortensen et al.

Figure 1. Data generation in MS-based quantitative proteomics. (A) Two populations of cells are either chemically or metabolically labeled with a “light” and a “heavy” isotope. They are combined and proteins are extracted followed by optional protein separation, fractionation and digestion (dashed arrows). Alternatively, the proteins can be directly digested after mixing followed by peptide separation and fractionation (solid arrows). Finally, peptides are analyzed by liquid chromatography coupled online to a tandem mass spectrometer. (B) The upper panel shows the extracted ion current (XIC) of a 15N-labeled (red) and unlabeled (blue) peptide pair. The second panel depicts a mass spectrum (MS) that was acquired during the elution (black bar in the upper panel) of these peptides and the third panel shows the fragmentation spectrum (MS2) of the “heavy” peptide (encircled in the second panel). The bottom panel shows the fragmentation spectrum (MS3) of the most intense MS2 signal (encircled in the third panel).

gain biologically or clinically relevant insights from proteomic data, however, robust and effective software solutions are required.8 Here we describe an open-source program called MSQuant, which implements a design strategy with significant advantages for proteomics projects. First, being open source, it can be modified and extended rapidly and independently of the MS manufacturer by the users. Second, we use building blocks of the manufacturer’s software to gain direct access to the raw data without any information loss and without having to recreate basic data manipulation and visualization tools. As a result, we have been able to quickly implement a range of analysis tools, which are still not available in any commercial software. These analysis tools have been crucial in enabling a number of large-scale proteomics projects in our laboratories.9-17

Materials and Methods MSQuant Development Environment. MSQuant is written in the Microsoft .NET integrated development environment. The main reason for this choice was the fact that MS vendors’ software is implemented on the Windows platform and programs created in the .NET environment easily interface to those programs, partially solving the problem of proprietary software. The Microsoft .NET environment has some similarities to the Java environment and contains 9911 public classes (Microsoft 394

Journal of Proteome Research • Vol. 9, No. 1, 2010

.NET Framework 3.5 SP1) providing low and high level functionality. Like other modern development environments, .NET enables object-oriented architecture, exception handling, and rapid development of graphical user interfaces, as well as a host of other powerful features that enable efficient software construction. A limitation of the .NET environment in our experience is that it often interfaces poorly to cross platform tools used in bioinformatics. As an example, the R and Bioconductor environments18,19 are used extensively in our groups but we currently interface data between MSQuant and these platforms via a spreadsheet format. MSQuant interfaces with MS vendors’ software through standard Microsoft binary component standards (COM objects, for Common Object Model). For Analyst (Applied Biosystems/ MDS-Sciex) and Xcalibur (Thermo Fisher Scientific) MSQuant uses the native spectrum display components of the manufacturer’s software, thus the user familiar with these programs does not have to learn a new spectrum display and navigation scheme. For MassLynx (Micromass/Waters) it uses the versatile open source Zedgraph for spectrum display. DTASuperCharge. In our experience, vendor software can often mis-assign the charge state or monoisotopic mass of the precursor ion; in the worst cases, the software might not even assign a precursor mass at all. As part of the MSQuant environment, we also developed DTASuperCharge for Thermo

MSQuant, an Open Source Platform

research articles

Figure 2. Pipeline versus interactive data analysis. Data files are generated by the vendor-specific acquisition software (layer 1). In the first processing step (layer 2), peak lists are extracted from the raw data file. Critical tasks in this step are to determine the correct charge state and the monoisotopic peak in the MS spectra and the selection of the most relevant peaks in the fragment spectra. Peak lists serve as input to the Mascot search engine (layer 3), and results may be subjected to bioinformatic analysis (layer 5). Additionally, a quantitation step may be incorporated at the beginning of the procedure or after peptide identification (layer 4). In extension of this linear pipeline model, MSQuant incorporates extensive feedback to the raw data in the form of iterative searches after recalibration and visual control of identification and quantitation results using the vendor’s software components.

Fisher Scientific data, a program that attempts to assign the correct charge state and isotope state for each precursor (Figure 2, layer 2). The algorithm generates an average isotope distri-

bution and determines the sum of square deviations to the observed isotope cluster for each possible charge state and isotope position. The sum of square deviations serves as a score Journal of Proteome Research • Vol. 9, No. 1, 2010 395

research articles

Mortensen et al.

Figure 3. Main application windows of MSQuant. The start screen associates Mascot result files with the corresponding raw data files and specifies parameters and filters for parsing the Mascot file into MSQuant. The recalibration window allows the user to evaluate peptide mass accuracy before and after recalibration. The protein list window is the main document window and contains a list of identified proteins. This window interfaces with modules for the analysis of sequence and quantitative information extracted from the precursor ion and product ion spectra, respectively. MSQuant stores all data for an experiment in a document file and export annotated spectra and data in various report formats.

to determine the most likely result. The m/z value and charge state are then exported into the file used for database searching. Furthermore, DTASuperCharge allows the user to directly export the product ion peak list as determined by the vendor software or to export the results of the built-in peak-picking algorithm. Such a step can be necessary to limit the number of peaks for database searching, and it ensures that the database search engine will accept the data format and speeds up the search. In an attempt to retain the most informative peaks, DTASuperCharge will recursively find the largest peaks in the fragment spectrum. First, the largest data point is found and the corresponding peak is determined by peak recognition. This defines two new mass ranges to the left and to the right of the peak in which the procedure is repeated until the user defined number of iterations has been achieved. Alternatively, the user can also specify that this algorithm is performed independently in smaller mass ranges, such as 300 m/z windows. Other optional preprocessing steps included in DTASuperCharge include the proper assignment of the precursor mass in MS3 experiments of phosphopeptides. In these experiments, the MS2 experiment serves to recognize phosphopeptide candidates and to generate a neutral loss precursor, which is then fragmented in an additional step to identify it and to locate the phosphorylation site. In this case, to satisfy database search engine requirements, the precursor mass of the MS3 experiment is specified as that of the original parent ion rather than the neutral loss ion in the product ion spectrum, as is normally done by vendor software. The improvement in the identification results obtained with this preprocessing of the mass spectra depends on the sophistication of the vendor program and ranges from no difference to being an essential step. It is in any case essential that the centroided data for database searching contains a link to the 396

Journal of Proteome Research • Vol. 9, No. 1, 2010

spectrum in the acquisition raw file. The DTASuperCharge program is open source and available via the MSQuant home page. Handling of Multiple Files. In situations where multiple raw data files comprise a single experiment, such as protein correlation profiling (see below) and prefractionation of samples, MSQuant needs to keep track of the raw file that any peptide was identified from. To accomplish this, a script (MGFcombiner) with a graphical user interface (GUI) writes the file information into the header for each MS2 or MS3 experiments (Figure 2, layer 2). Getting Started as an MSQuant User. Currently, MSQuant supports the ABI-Sciex, the Thermo Fisher Scientific, and the Micromass/Waters data files directly. Detailed instructions for the entire process of MSQuant installation can be found on the MSQuant homepage, in particular regarding dependencies on other required software. A single installer for MSQuant with all additional software including DTASuperCharge can be downloaded from http://msquant.sourceforge.net. Before executing the installer, we recommend to study the installation notes and examples in detail. The User Interface. MSQuant has five main windows as summarized in Figure 3 and illustrated in Figures 4-6 displaying data from a recent study on protein and phosphorylation dynamics in human embryonic stem cells.20 The start screen associates Mascot result files with the corresponding raw data files. It also allows the user to specify the quantitation mode, various filters used in parsing the Mascot file and a number of other parameters. The recalibration window displays the results of recalibration based on precursor ions identified with a user defined score threshold (Figure 4). The protein list window is the main document window and contains a list of identified proteins with the number of identified peptides as well as a protein score. Double clicking a protein brings up the protein validation window (Figure 5). The user can manually verify or

research articles

MSQuant, an Open Source Platform

Figure 4. Screenshot of the Recalibration window in MSQuant. This window visualizes the peptide mass errors of a data set before and after recalibration. Various display options are available, including the scatter plot of peptide masses versus relative peptide mass errors shown here. The trend line for the 8926 high scoring peptides indicates a small systematic calibration error.

reject peptides and proteins. The fifth window is used for quantitation and contains peptides identifying the protein, a visual display of the LC profile, a list of the SIL ratios for each precursor mass spectrum and a panel for a mass spectrum (Figure 6). This window allows manual check of quantitation accuracy. For example, if the SIL ratio of one peptide is different from the others identifying the same protein, stepping through the zoomed part of the spectrum will quickly reveal if there is a problem in quantitation. Saving of results is carried out from the protein list window and acts on the user-selected proteins. Data can be saved as a session file that can be reopened in MSQuant, to a tab separated text file, or directly into OpenOffice Calc or Microsoft Excel. The user interface also allows specification of the quantitation mode, including a large range of possible labeling conditions; arbitrary labels can be specified in an XML file. Batch Processing. MSQuant allows batch processing of an arbitrary subset of identified proteins. Quantitation, MS3 scoring (see below) and post-translational modification (PTM) scoring can all be performed in batch mode. Due to limitations in the .NET environment, MSQuant can experience memory problems with large data sets, necessitating saving the current state and continuation of the quantitation operation after

restart of the program. This task can be performed automatically by starting MSQuant using the Windows command line that also provides the option to automatically process all data in multiple experiments. Getting Started as an MSQuant Developer. The full source code for MSQuant can be obtained from the MSQuant homepage. MSQuant is written as four projects in one “solution” in C# and VB.NET. These projects can also be downloaded and information about installation is contained in a separate text file.

Results and Discussion MSQuant Software. The liquid chromatography-tandem mass spectrometry (LC-MS/MS) experiment described in Figure 1 is controlled by software provided by the manufacturer of the mass spectrometer. The resulting data resides in raw data files (layer 1 in Figure 2) in proprietary formats and can typically only be visualized by the same software. The route from raw data to relevant proteomic results involves four levels of data processing, (i) feature extraction, (ii) database searching/protein identification, (iii) quantitation, and (iv) bioinformatics analysis (Figure 2). The processing of mass spectra begins with the generation of peak lists from the fragmentation spectra (layer Journal of Proteome Research • Vol. 9, No. 1, 2010 397

research articles

Mortensen et al.

Figure 5. Screenshots of the Validation window in MSQuant. In screenshot A, the top box lists peptides that identify the DNA-dependent protein kinase catalytic subunit. Besides the sequence, MSQuant also shows detailed information for each peptide including the identified modifications, the Mascot ion score and the delta score (difference between the Mascot ion scores of the peptide and the second best hit). The lower screenshot (B) shows the analysis of PTM sites in MSQuant. The double charged peptide IEDVGSDEEDDSGKDK contains, besides two labeled lysine residues also a phosphorylation. There are two possible positions in this peptide that can be phosphorylated: S6 and S12. By PTM-scoring this peptide in MSQuant, theoretical ions for each of these two positions are calculated and matched to the peaks in the experimental raw fragment spectrum. MSQuant then calculates the PTM-score and ranks all the possibilities accordingly. These are displayed in the bottom left window and the corresponding annotated product ion spectra can be visualized by clicking on them. For export, the preferred ones are initially selected automatically but they can also be changed by the user. 398

Journal of Proteome Research • Vol. 9, No. 1, 2010

MSQuant, an Open Source Platform

research articles

Figure 6. Screenshot of the Quantitation window in MSQuant. This window is divided into four panels. (A) Identified peptides of the selected protein that are quantifiable. (B) MS scans of a quantified peptide with the retention time (ret. T.), scan number (cycle), MS peak areas (A1 and A2), and ratio (2/1) indicated. The raw mass spectrum of each scan can be visualized in (C) by activating (e.g., double clicking) a scan (B) or peptide (A). The small red horizontal lines indicate the height of the theoretical isotopes based on the elemental composition of the peptide. (D) Graph where the MS peak area of the peptides is plotted against the retention time.

2 in Figure 2). The peak lists, consisting of the m/z value and charge state of the precursor and the m/z value and intensity of the fragments are then submitted to a database search program (layer 3 in Figure 2). Often, however, the link between the results of this database search and the raw data upon which it is based is lost. In 2002 we started to develop, out of necessity, a program we called Mascot Parser as a platform for the interpretation of Mascot database search results against raw product ion spectra. Furthermore, in 2001 we had begun to develop stable isotope labeling by amino acids in cell culture (SILAC) but could not find any software to extract quantitative information from the mass spectra. Manual quantitation of even a simple experiment could take several months, leading to the development of simple scripts for extracting quantitative information.21 We then added a quantitation module to Mascot Parser, which was accordingly renamed to MSQuant; in this way, MSQuant had a significant part in the rapid success of SILAC. We have since used MSQuant as a platform for quickly implementing algorithmic solutions. In early 2004, we posted MSQuant and its source code at the SourceForge Web site, and it has since been in use by a large number of proteomics researchers the world over (see for example refs 22-25). MS Manufacturer Support. One problem with the simple pipeline model arises from the fact that the software used to acquire and display the raw MS data is almost always proprietary. Thus, tools to analyze the raw data have to be implemented by the manufacturer, who may not be interested in specialized applications and who usually have a conservative

software release policy. MS vendors offer various proprietary software suites for the analysis of proteomic data, but these often lag behind when new technologies allow new types of data to be acquired. Due to their closed nature, they typically do not allow much interaction or experimentation with the data so MSQuant aims to support access to raw data formats from a range of vendors (Figure 2). Initially designed for Applied Biosystems/MDS-Sciex data format (.wiff), MSQuant has now been expanded to support data from Thermo Fisher Scientific (.raw) and Micromass/Waters as well. This is facilitated by the design of MSQuant, to which modules can be added to incorporate support for specific data files, provided that programmatic access to this kind of data is possible, e.g. via standard Microsoft binary component standards (COM). MSQuant is programmed in such a way that data from the individual vendor specific modules is generalized immediately after extraction, making it relatively easy to add other modules (Figure 2, layer 1). Interaction with Waters/Micromass data is made available via the MassLynx Datafile Access Component (DAC), which allows simple access to the MassLynx raw data files and is included in the installation of MassLynx. MSQuant uses DAC, for example, to extract data points from the raw data files to display fragmentation spectra and full-scan precursor ion spectra. Modules and Functionality of MSQuant. Signal processing techniques are often applied to MS and MS/MS to remove chemical and electronic noise;26 peak picking algorithms can then attempt to find peaks, isotope clusters and charge states. Journal of Proteome Research • Vol. 9, No. 1, 2010 399

research articles In our experience, current peak picking algorithms have a significant error rate, especially in the case of metabolic labeling with suboptimal 15N-enrichments (i.e., 15N-enrichments e99%).27 Since there can be thousands of potential fragment signals in a spectrum, an important step in peak list generation is the selection of the peaks most likely to be informative as usually only a few are relevant for a database search. DTASuperCharge or similar programs are first used to preprocess the raw data and prepare it for database searching; this step also incorporates crucial information required by MSQuant to locate the correct raw data during processing (Figure 2, layer 2). MSQuant does not contain a database search program, unlike for instance the Virtual Expert Mass Spectrometrist (VEMS) software package,28 but relies on a database search program such as Mascot to identify peptide hits corresponding to the MS/MS data (Figure 2, layer 3). At the start of a session, MSQuant interprets the output of the database search and identified peptides and proteins are parsed into an internal data structure (Figure 2, layer 4). Information for each peptide includes its charge state, observed mass, position of the MS/MS experiment in the raw data file, score, sequence, modification state, theoretical mass and start of the peptide sequence in the protein. MSQuant also evaluates the selection of correct quantitation mode and labeling consistency for the identified peptides. The second best matching sequence is also included into the internal data structure. Recalibration. The first functionality implemented in MSQuant is an iterative recalibration.9 Error in mass determination has two sources, systematic and random error. Systematic error is caused by a variety of factors, for example, drift of calibration constants with time or with temperature, and it can normally be minimized by frequent calibration or by using internal standards. In practice, however, often neither of these are performed and, as a result, researchers often search proteome data with very wide mass tolerance windows, needlessly degrading the useful mass accuracy of the instrument.29 MSQuant sidesteps this problem by using several hundred high scoring peptides as internal calibrants, assuming that these are most likely to be correctly identified. Optimal instrumentdependent calibration constants are calculated from the observed versus calculated masses of these peptides and these are then applied to all measured masses. The overall improvement in average mass accuracy is visualized in a separate window with various display options that provide the user with an immediate evaluation of the data quality and, thus, instrument performance and optimal database search parameters (Figure 4). On a hybrid quadrupole TOF instrument (Applied Biosystems/MDS-Sciex QSTAR, Micromass/Waters Q-ToF), average absolute mass accuracy improves from 50-200 ppm (depending on the calibration state of the instrument) to typically around 10 ppm. A script changes the precursor masses in the peak list file after which a second search can be performed using the improved mass tolerance. In practice, this simple algorithm improves the mass accuracy of the instrument several fold, leading to much more specific search results. Elimination of systematic mass error would have been straightforward to implement either in MS vendor or database search software but is still not widely employed. This illustrates the value of user access to the data flow as is possible in MSQuant. For FT-ICR data using selected ion monitoring scans, as well as LTQ-Orbitrap data with lockmass enabled,30 mass accuracy is typically a few hundred parts-per-billion and is not signifi400

Journal of Proteome Research • Vol. 9, No. 1, 2010

Mortensen et al. cantly improved. In full scan FT-ICR data, a systematic error can be introduced by space charge effects and in this case MSQuant employs a script to recalibrate masses in the frequency domain.31 Validation of Protein and Peptide Hits. As mentioned above, the typical pipeline model in proteomics does not allow re-evaluation of raw data after the database search. Coupled with the large amount of data generated in proteomics experiments, users are often restricted to rely on purely statistical measures (i.e., the Mascot score), or at best, to view a static picture of the spectrum at very low resolution. While this latter option is better than no evaluation, it does not take advantage of what might be high resolution/high accuracy data or of informative details in the spectra (e.g., proline-directed cleavage fragments). To get around the normal limitations in pipelined data, MSQuant retains the location of the spectrum identifying a peptide in its internal data structure. When evaluating a protein, its peptides are displayed in a list and the raw fragmentation spectra used to identify each peptide is accessible by double clicking and the data can be zoomed to any resolution. Importantly, the high resolution raw data is overlaid with the information calculated from the peptide identification. The ion series corresponding to the type of fragment ions (e.g., b- and y-ions in CID or c- and z- ions in ETD) are automatically marked on the respective peaks and peaks matching a theoretical fragment mass are highlighted (Figure 5). Spectrum annotation is generalized and can be customized by the user. For example, for phosphopeptides both ion series with and without the neutral loss of phosphate can be displayed concurrently and significantly improves the assignment of phosphorylation sites. When validating spectra, expert mass spectrometrists look for certain telltale features, such as the presence of an intense fragment N-terminal to a proline in the sequence or the characteristic a2, b2 pair but most database search engines ignore these critical features. MSQuant automatically annotates N-terminal proline breaks and the a2,b2 pair, facilitating general use of these features. Importantly, validation in MSQuant is highly visual and streamlined, requiring only seconds to display and validate a spectrum. In a typical proteomics experiment, hundreds of proteins may have been identified but only a small number are biologically interesting (those that change quantitatively in the experiment, for example). Furthermore, many proteins do not need manual validation since they are identified with many peptides and high scores. In our experience, validation in MSQuant of a few hundred MS/MS spectra is almost always sufficient, does not take much time and greatly adds to the overall data quality of the experiment. We highly recommend verification with raw datasas exemplified here in MSQuantsfor all cases where extensive biological follow-up is dependent on correct interpretation of an MS/MS experiment. MS3 Scoring. Proteomic strategies normally involve a single peptide fragmentation step. The introduction of instruments such as the linear ion trap has recently made it technically feasible to perform two stages of fragmentation (i.e., MS3) in a chromatographic time scale and with high sensitivity. Such a second step of fragmentation should dramatically increase the information on a peptide and greatly increase identification confidence. However, no software was available to use the MS3 information for peptide identification and consequently MS3 was only performed in special cases, such as phosphopeptide identification where standard database search algorithms could be used.32

MSQuant, an Open Source Platform Using the MSQuant framework, we quickly implemented an MS3 score33 as follows: In a first step the Mascot identification is used to determine the putative MS3 precursor and the theoretical MS3 fragments. The peak list of the MS3 spectrum is then reduced to four peaks per 100 m/z and the overlap of calculated and observed fragments is counted. Finally, the probability for observing these fragments by chance is calculated using a binominal distribution. If there are several MS3 experiments for one MS2 experiment, MSQuant calculates all scores but retains only the highest scoring spectrum. Likewise, if there are multiple possible precursor identities (y-ion or b-ion, for example), MSQuant will consider both possibilities and retain the one with the best score. For compatibility with the Mascot scoring scheme, we report 10 times the negative logarithm (base 10) of this probability. This MS3 score has proven to be very useful and allows identification of proteins on the basis of a single peptide.34-36 Post-Translational Modification (PTM) Score. For modified peptides, both the identity of the peptide and the placement of the modification need to be determined. Sometimes there is only one possibility for the location of the modification. However, often the nature of the modification and identity of the primary peptide sequence can be clearly established but the exact site of the modification may be less clear. For example, in the analysis of the phosphorylated and SILAC labeled peptide IEDVGSDEEDDSGKDK, the phosphogroup could be located on any of the two serine residues in the primary sequence of this peptide (Figure 5B). The site of modification must be within the sequence stretch corresponding to the peptide, but can only be localized precisely if the corresponding, distinguishing fragments are present. Search engines such as Mascot in principle generate scores for each of the different phosphorylation sites but they do not specially score for fragments distinguishing between these possibilities. We developed a probability score (PTM-score) based on assigning a probability that the observed fragments match the fragments calculated for a given sequence by chance. In the MSQuant framework, we first applied this score for MS3 experiments as described above33 and then further developed the algorithm for phosphorylation matching. It iterates through all the possible modification sites and generates a score based on the number of supporting fragment masses (Figure 5B), including handling the placement of several phosphorylation sites in a sequence, each of which may have different probabilities (described more fully in37 and in slightly modified form in38). We have applied this score to a large-scale quantitative study of protein phosphorylation where more than 6000 phosphorylation sites were identified and classified according to probability of phosphorylation site placement.37 While it was developed for phosphorylation, the principles underlying the PTM-score are of a general nature and can be used for any modification. MSQuant also allows evaluation of the PTM score by displaying the calculated fragment ions for any combination of the possible site-specific modifications for the MS/MS experiment as proposed by the scoring algorithm. Toggling between these possibilities gives valuable information about how much better the top scoring site localization is as compared to other interpretations. Quantitation. As mentioned above, one of the primary reasons behind our development of MSQuant was to enable large SILAC experiments, rather than having to rely on manual or semiautomated analysis.21 In SILAC, stable isotope labeled amino acids such as 13C6-Arg and 13C615N2-Lys are metaboli-

research articles cally incorporated into the proteome with normal isotopic abundance versions serving as a control. The mass-distinguishable SILAC populations are then treated differentially and mixed together to compare their proteomes. Peptides occur in pairs,7 triplets,39 or even quintuplets,40 and the quantitative ratios between these forms of the peptide accurately reflect the relative abundance of the proteins in the proteomes. Since the initial versions, however, we have generalized the quantitation tools in such a way that in principle any level of multiplexing can be quantified by MSQuant. In practice however, multiplexing MS-based quantitation beyond triplex is not favorable due to the increase in complexity at the MS level. We have also generalized the type of modification used for quantitation, for example, SILAC employs amino acid isotopologues but many other approaches involve SIL of specific atoms or chemical derivatization of specific functional groups. MSQuant handles all SIL-based quantitative approaches that work on the MS level that we are aware of. This is done through the use of userdefinable modifications and quantitation schemes in the XML settings file new_MSQ_quantitationModes.xml. MSQuant quantifies SIL pairs or triplets on the basis of peptide identifications (rather than directly from the data) and requires at least one of the members of a SIL pair to be identified by MS/MS, which is then used to calculate the position for all other partner peaks (Figure 6). MSQuant uses an algorithm, AutoCenter, that centers the quantitation on the actual peaks. This is important for Finnigan LTQ-FT data where the masses in some MS spectra are shifted due to space-charge effects. Thus it is not necessary to widen the quantitation mass window to account for this effect and which would introduce the risk of affecting the quantitation result with data from unrelated peaks. Users can click on the result for any MS scan and view the corresponding raw data. Often, single scans are unreliable due to interference from coeluting peaks, for instance. These scans can be removed from consideration under user-control. A somewhat complicating issue in stable-isotope coding with heavy nitrogen (15N) is that the mass increase introduced by the label is more variable than in SILAC since it depends on the number of nitrogen atoms in the molecule rather than the number of residues of a specific amino acid. Finding peak pairs and quantifying their intensities therefore requires a different approach for 15N labeling and, to this end, MSQuant iterates through the sequence of the identified peptide (i.e., residue by residue) and computes the mass increase for each amino acid. Label-Free Quantitation. While mass spectrometry is not in itself quantitative, signals for the same peptides can be compared between runs. In a technique called “protein correlation profiling”, we made use of this feature in distinguishing background proteins from genuine organellar proteins.11 The centrosome, an important organelle involved in organizing the microtubule network and cell division, was partially separated by centrifugation. Fractions surrounding and including the peak centrosomal fraction were digested and analyzed by LC-MS/MS. Quantitation based on XIC values for each peptide through the gradients gave a characteristic profile for genuine centrosomal proteins but not for copurifying proteins.11 While straightforward in principle, this algorithm required matching thousands of peptides in adjacent gradient fractions on the basis of mass and retention time. To align retention times, MSQuant uses a simple linear fit of the high scoring peptides common to each run. In this way, peaks that are not sequenced Journal of Proteome Research • Vol. 9, No. 1, 2010 401

research articles in every run can nevertheless be quantified. The ability to accurately assign proteins to either centrosome or background and the observed consistency of centrosomal profiles of usually better than 30% implies excellent matching of peptides between runs. It also implies good reproducibility of the peptide signal across fractions in very complex mixtures. We have since used MSQuant to perform protein correlation profiling of all membrane bound compartments of the cell14 andsin a two compartment comparisonsto the changes in the membrane proteome of stem cells during differentiation.41 Data Filtering and Export. MSQuant allows filtering of the identified peptides so that peptides passed into the program are limited to those fulfilling various criteria such as Mascot score, mass accuracy, peptide length, and modification and labeling status. This feature allows the user to apply the tools described above for the analysis of only a subset of peptides. As an example, tyrosine phoshorylated peptides can easily be selected for a swift and focused inspection. After completing protein quantitation and validation in MSQuant, changes to the data can be saved for reanalysis or the protein and peptide data can be exported in different formats. Filtering options available at this stage make possible the export of high-quality data for downstream bioinformatic analysis. The various filtering options also facilitate automated analysis of entire data sets based on predefined quality criteria. The export function serves as a general interface to statistical analysis of the data. These functions are not contained in MSQuant itself but can be supplied by external packages such as OpenOffice Calc, Microsoft Excel, or the above-mentioned R environment. MSQuant and Related Software. A large number of software tools have now been developed for the analysis of MS-based proteomics data.28,42 Several of them are “black box” designs tied to instrument vendors or are commercial products. Others, such as the MaxQuant software43 profited significantly from the experience gained with the MSQuant software, which has led to more sophisticated mathematical and algorithmic design for the analysis of high-resolution FT-data. MaxQuant is, however, restricted to this type of data (Thermo Fisher Scientific) and supports label-free and SILAC labeled data only. MSQuant on the other hand, supports data from two additional vendors (i.e., Micromass/Waters and Applied Biosystems/MDSSciex). In addition, MSQuant can handle any precursor ion intensity based quantitation method (e.g., SILAC, 15N, labelfree, ICAT, 18O, etc) and allows fast and easy viewing of annotated spectra used for quantitation, as well as product ion mass spectra for manual validation of sequence assignment. Due to its open-source nature, users have the possibility of implementing specific features that are not yet part of MSQuant. Taken together, all these aspects make MSQuant very distinct from any other quantitation software available.

Conclusions and Perspectives In this paper, we have described several advantages of an open and extensible software environment. Data quality and quality of manual validation is tremendously enhanced through fast, convenient access to the raw data, and we have implemented a number of novel data manipulation ideas due to access to all parts of the data processing scheme. An especially strong point of MSQuant is that it allows integration of data from advanced acquisition schemes and optimal use of the raw data resulting in very high quality identifications and quantitation. MSQuant also supports data from several instrument types and vendors. In our opinion, technological progress in 402

Journal of Proteome Research • Vol. 9, No. 1, 2010

Mortensen et al. proteomics will depend just as much on algorithmic advances as on improvements in sample preparation strategies and mass spectrometric hardware. We therefore hope that MSQuant has made a contribution to the development of these crucial data analysis methods and help overcoming the major data analysis bottleneck of mass spectrometry-driven proteomics.

Acknowledgment. Work at the Center for Experimental BioInformatics acknowledges support by a generous grant of the Danish National Research Foundation and funding received from the European Commission’s seventh Framework Programme (grant agreement HEALTH-F4-2008-201648/ PROSPECTS). We thank past and present members of CEBI for their contributions to the development of MSQuant. We also thank all users of MSQuant elsewhere for their active support of this project. References (1) Tyers, M.; Mann, M. From genomics to proteomics. Nature 2003, 422 (6928), 193–7. (2) de Hoog, C. L.; Mann, M. Proteomics. Annu. Rev. Genomics Hum. Genet. 2004, 5, 267–93. (3) Zhu, H.; Snyder, M. Protein chip technology. Curr. Opin. Chem. Biol. 2003, 7 (1), 55–63. (4) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198–207. (5) Ong, S. E.; Mann, M. Mass spectrometry-based proteomics turns quantitative. Nat. Chem. Biol. 2005, 1 (5), 252–62. (6) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 1999, 17 (10), 994–9. (7) Ong, S. E.; Blagoev, B.; Kratchmarova, I.; Kristensen, D. B.; Steen, H.; Pandey, A.; Mann, M.; Stable isotope labeling by amino acids in cell culture, SILAC. as a simple and accurate approach to expression proteomics. Mol. Cell. Proteomics 2002, 1 (5), 376–86. (8) Patterson, S. D. Data analysis--the Achilles heel of proteomics. Nat. Biotechnol. 2003, 21 (3), 221–2. (9) Lasonder, E.; Ishihama, Y.; Andersen, J. S.; Vermunt, A. M. W.; Pain, A.; Sauerwein, R. W.; Eling, W. M. C.; Hall, N.; Waters, A. P.; Stunnenberg, H. G.; Mann, M. Analysis of the Plasmodium falciparum proteome by high-accuracy mass spectrometry. Nature 2002, 419 (6906), 537–542. (10) Andersen, J. S.; Lam, Y. W.; Leung, A. K.; Ong, S. E.; Lyon, C. E.; Lamond, A. I.; Mann, M. Nucleolar proteome dynamics. Nature 2005, 433 (7021), 77–83. (11) Andersen, J. S.; Wilkinson, C. J.; Mayor, T.; Mortensen, P.; Nigg, E. A.; Mann, M. Proteomic characterization of the human centrosome by protein correlation profiling. Nature 2003, 426 (6966), 570–4. (12) Kratchmarova, I.; Blagoev, B.; Haack-Sorensen, M.; Kassem, M.; Mann, M. Mechanism of divergent growth factor effects in mesenchymal stem cell differentiation. Science 2005, 308 (5727), 1472–7. (13) Kerner, M. J.; Naylor, D. J.; Ishihama, Y.; Maier, T.; Chang, H. C.; Stines, A. P.; Georgopoulos, C.; Frishman, D.; Hayer-Hartl, M.; Mann, M.; Hartl, F. U. Proteome-wide analysis of chaperonindependent protein folding in Escherichia coli. Cell 2005, 122 (2), 209–20. (14) Foster, L. J.; de Hoog, C. L.; Zhang, Y.; Xie, X.; Mootha, V. K.; Mann, M. A mammalian organelle map by protein correlation profiling. Cell 2006, 125 (1), 187–99. (15) Vermeulen, M.; Mulder, K. W.; Denissov, S.; Pijnappel, W. W.; van Schaik, F. M.; Varier, R. A.; Baltissen, M. P.; Stunnenberg, H. G.; Mann, M.; Timmers, H. T. Selective anchoring of TFIID to nucleosomes by trimethylation of histone H3 lysine 4. Cell 2007, 131 (1), 58–69. (16) Romijn, E. P.; Christis, C.; Wieffer, M.; Gouw, J. W.; Fullaondo, A.; van der Sluijs, P.; Braakman, I.; Heck, A. J. Expression clustering reveals detailed co-expression patterns of functionally related proteins during B cell differentiation: a proteomic study using a combination of one-dimensional gel electrophoresis, LC-MS/MS, and stable isotope labeling by amino acids in cell culture (SILAC). Mol. Cell. Proteomics 2005, 4 (9), 1297–310. (17) Gouw, J. W.; Pinkse, M. W.; Vos, H. R.; Moshkin, Y. M.; Verrijzer, C. P.; Heck, A. J. R.; Krijgsveld, J. In vivo stable isotope labeling of

research articles

MSQuant, an Open Source Platform

(18)

(19) (20)

(21) (22)

(23)

(24) (25)

(26)

(27)

(28)

(29) (30)

fruit flies reveals post-transcriptional regulation in the maternalto-zygotic transition. Mol. Cell. Proteomics 2009, 8 (7), 1566–1578. Gentleman, R. C.; Carey, V. J.; Bates, D. M.; Bolstad, B.; Dettling, M.; Dudoit, S.; Ellis, B.; Gautier, L.; Ge, Y.; Gentry, J.; Hornik, K.; Hothorn, T.; Huber, W.; Iacus, S.; Irizarry, R.; Leisch, F.; Li, C.; Maechler, M.; Rossini, A. J.; Sawitzki, G.; Smith, C.; Smyth, G.; Tierney, L.; Yang, J. Y.; Zhang, J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10), R80. R Development Core Team. R: A Language and Environment for Statistical Computing; R Development Core Team: Vienna, Austria, 2007. Prokhorova, T. A.; Rigbolt, K. T.; Johansen, P. T.; Henningsen, J.; Kratchmarova, I.; Kassem, M.; Blagoev, B. Stable isotope labeling by amino acids in cell culture (SILAC) and quantitative comparison of the membrane proteomes of self-renewing and differentiating human embryonic stem cells. Mol. Cell. Proteomics 2009, 8 (5), 959–70. Foster, L. J.; De Hoog, C. L.; Mann, M. Unbiased quantitative proteomics of lipid rafts reveals high specificity for signaling factors. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (10), 5813–8. Dreisbach, A.; Otto, A.; Becher, D.; Hammer, E.; Teumer, A.; Gouw, J. W.; Hecker, M.; Volker, U. Monitoring of changes in the membrane proteome during stationary phase adaptation of Bacillus subtilis using in vivo labeling techniques. Proteomics 2008, 8 (10), 2062–2076. Lemeer, S.; Jopling, C.; Gouw, J. W.; Mohammed, S.; Heck, A. J.; Slijper, M.; den Hertog, J. Comparative phosphoproteomics of zebrafish Fyn/Yes morpholino knockdown embryos. Mol. Cell. Proteomics 2008, 7 (11), 2176–2187. Chan, Q. W.; Foster, L. J. Changes in protein expression during honey bee larval development. Genome Biol. 2008, 9 (10), R156. Pandhal, J.; Ow, S. Y.; Wright, P. C.; Biggs, C. A. Comparative proteomics study of salt tolerance between a nonsequenced extremely halotolerant cyanobacterium and its mildly halotolerant relative using in vivo metabolic labeling and in vitro isobaric labeling. J. Proteome Res. 2009, 8 (2), 818–28. Listgarten, J.; Emili, A. Statistical and computational methods for comparative proteomic profiling using liquid chromatographytandem mass spectrometry. Mol. Cell. Proteomics 2005, 4 (4), 419– 34. Gouw, J. W.; Tops, B. B. J.; Mortensen, P.; Heck, A. J. R.; Krijgsveld, J. Optimizing identification and quantitation of 15N-labeled proteins in comparative proteomics. Anal. Chem. 2008, 80 (10), 7796–7803. Matthiesen, R.; Trelle, M. B.; Hojrup, P.; Bunkenborg, J.; Jensen, O. N. VEMS 3.0: algorithms and computational tools for tandem mass spectrometry based identification of post-translational modifications in proteins. J. Proteome Res. 2005, 4 (6), 2338–47. Zubarev, R.; Mann, M. On the proper use of mass accuracy in proteomics. Mol. Cell. Proteomics 2007, 6 (3), 377–81. Olsen, J. V.; de Godoy, L. M.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per million

(31)

(32)

(33) (34)

(35) (36) (37) (38)

(39) (40)

(41)

(42)

(43)

mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 2005, 4 (12), 2010– 21. de Godoy, L. M.; Olsen, J. V.; de Souza, G. A.; Li, G.; Mortensen, P.; Mann, M. Status of complete proteome analysis by mass spectrometry: SILAC labeled yeast as a model system. Genome Biol. 2006, 7 (6), R50. Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villen, J.; Li, J.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P. Large-scale characterization of HeLa cell nuclear phosphoproteins. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (33), 12130–5. Olsen, J. V.; Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (37), 13417–22. Adachi, J.; Kumar, C.; Zhang, Y.; Olsen, J. V.; Mann, M. The human urinary proteome contains more than 1500 proteins, including a large proportion of membrane proteins. Genome Biol. 2006, 7 (9), R80. Pilch, B.; Mann, M. Large-scale and high-confidence proteomic analysis of human seminal plasma. Genome Biol. 2006, 7 (5), R40. de Souza, G. A.; Godoy, L. M.; Mann, M. Identification of 491 proteins in the tear fluid proteome reveals a large number of proteases and protease inhibitors. Genome Biol. 2006, 7 (8), R72. Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell 2006, 127 (3), 635–48. Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24 (10), 1285–92. Blagoev, B.; Ong, S. E.; Kratchmarova, I.; Mann, M. Temporal analysis of phosphotyrosine-dependent signaling networks by quantitative proteomics. Nat. Biotechnol. 2004, 22 (9), 1139–45. Molina, H.; Yang, Y.; Ruch, T.; Kim, J. W.; Mortensen, P.; Otto, T.; Nalli, A.; Tang, Q. Q.; Lane, M. D.; Chaerkady, R.; Pandey, A. Temporal profiling of the adipocyte proteome during differentiation using a five-plex SILAC based strategy. J. Proteome Res. 2009, 8 (1), 48–58. Foster, L. J.; Zeemann, P. A.; Li, C.; Mann, M.; Jensen, O. N.; Kassem, M. Differential expression profiling of membrane proteins by quantitative proteomics in a human mesenchymal stem cell line undergoing osteoblast differentiation. Stem Cells 2005, 23 (9), 1367–77. Mueller, L. N.; Brusniak, M. Y.; Mani, D. R.; Aebersold, R. An assessment of software solutions for the analysis of mass spectrometry based quantitative proteomics data. J. Proteome Res. 2008, 7 (1), 51–61. Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteomewide protein quantification. Nat. Biotechnol. 2008, 26 (12), 1367–72.

PR900721E

Journal of Proteome Research • Vol. 9, No. 1, 2010 403