Game-Theory-Based Search Engine to Automate ... - ACS Publications

Oct 31, 2013 - Schematic of the mass selection filter, which searches for the correct ion ... With these three constraints, our MSF search engine can ...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Game-Theory-Based Search Engine to Automate the Mass Assignment in Complex Native Electrospray Mass Spectra Yao-Hsin Tseng,† Charlotte Uetrecht,‡,§ Shih-Chieh Yang,† Arjan Barendregt,‡ Albert J. R. Heck,‡ and Wen-Ping Peng*,† †

Department of Physics, National Dong Hwa University, Shoufeng, Hualien, Taiwan 97401, R.O.C. Biomolecular Mass Spectrometry and Proteomics, Bijvoet Center for Biomolecular Research and Utrecht Institute for Pharmaceutical Sciences, Utrecht University, and Netherlands Proteomics Centre, Padualaan 8, 3584 CH Utrecht, The Netherlands § Sample Environment Group, European XFEL GmbH, Notkestraße 85, 22607 Hamburg, Germany ‡

S Supporting Information *

ABSTRACT: Electrospray ionization coupled to native mass spectrometry (MS) has evolved into an important tool in structural biology to decipher the composition of protein complexes. However, the mass analysis of heterogeneous protein assemblies is hampered because of their overlapping charge state distributions, fine structure, and peak broadening. To facilitate the mass analysis, it is of importance to automate preprocessing raw mass spectra, assigning ion series to peaks and deciphering the subunit compositions. So far, the automation of preprocessing raw mass spectra has not been accomplished; Massign was introduced to simplify data analysis and decipher the subunit compositions. In this study, we develop a search engine, AutoMass, to automatically assign ion series to peaks without any additional user input, for example, limited ranges of charge states or ion mass. AutoMass includes an ion intensity-dependent method to check for Gaussian distributions of ion series and an ion intensity-independent method to address highly overlapping and non-Gaussian distributions. The minimax theorem from game theory is adopted to define the boundaries. With AutoMass, the boundaries of ion series in the well-resolved tandem mass spectra of the hepatitis B virus (HBV) capsids and those of the mass spectrum from CRISPR-related cascade protein complex are accurately assigned. Theoretical and experimental HBV ion masses are shown in agreement up to ∼0.03%. The analysis is finished within a minute on a regular workstation. Moreover, less well-resolved mass spectra, for example, complicated multimer mass spectra and norovirus capsid mass spectra at different levels of desolvation, are analyzed. In sum, this first-ever fully automatic program reveals the boundaries of overlapping ion peak series and can further aid developing highthroughput native MS and top-down proteomics.

E

To assign masses correctly, three things have to be dealt with. First of all, the noisy raw mass spectra have to be noise-filtered and reduced in size, albeit not with respect to information. Preprocessing requires thorough peak finding, smoothing, background subtraction, and automatic threshold determination to preserve all information in the spectra. Next, a program has to identify peaks, assign ion series to the correct peaks, and determine the boundary between different distributions to yield accurate masses. Third, results from several experiments (e.g., full mass spectra and collision induced dissociation) have to be combined to reveal true stoichiometry in the multiprotein complexes where many subunit combinations can result in the same complex mass. Currently, there are a few tools to deal with the above issues. For example, MassLynx17 can partially assign the charge and

lectrospray ionization mass spectrometry (ESI-MS) is now in widespread use to investigate composition and structure of large heterogeneous protein assemblies.1−5 Through an array of instrumental improvements on time-of-flight (ToF)6,7 and orbitrap analyzers,8 microchannel plate (MCP) detectors,9 nano-ESI ion sources, and quadrupoles (Q)6,7 for high mass transmission, the investigation of large protein complexes and virus assemblies has reached levels up to a mass of 18 MDa with charge states up to 350+, an m/z (mass-to-charge ratio) range up to 80 000, and mass resolution of a few thousand.10,11 This resolution helps acquire and assign mass spectra of large protein complexes and decipher their quaternary structure, stoichiometry, and topology. However, when analyzing large and heterogeneous protein complexes especially with high m/z values, overlapping peak series appear, which often hinder the correct mass assignment.12−15 Two reasons attribute to incorrect mass assignment: first, peaks with increasing charges in the high mass range are shifted closer to each other and the boundaries are thus blurred, and second, incomplete desolvation may cause substantial peak broadening.13,16 © 2013 American Chemical Society

Received: June 28, 2013 Accepted: October 31, 2013 Published: October 31, 2013 11275

dx.doi.org/10.1021/ac401940e | Anal. Chem. 2013, 85, 11275−11283

Analytical Chemistry

Article

mass of proteins with intermediate size14,15,18−21 but cannot handle proteins in the highest mass range.22 Massign,16 introduced by Morgner et al., is the first systematic approach to optimize data analysis. It has semiautomatic and automatic modes to tackle all steps of data processingreducing spectra size, smoothing and subtracting background, identifying peaks, assigning ion series (masses), and reducing the number of possible subunit combinations by integrating information from multiple sources. However, its ion series assignment requires manual input parameters such as potential mass range, m/z range, and maximum possible charge. Still, ion series are often missed and needed to be added manually. In addition, the boundaries between ion series require manual refinement. Benesch et al. developed a deconvolution algorithm, CHAMP, which can estimate the distribution of various stoichiometries from overlapping and unresolved peaks.23 It is comparable to SOMMS24 but is much more user-friendly. Thalassinos et al. developed the Amphitrite software,25 which is comparable in peak assignment to CHAMP and SOMMS. The main advantage of this software is the analysis of ion mobility data. It can handle very complex samples and retrieve or compare ion shapes more readily than could be achieved manually. The above tools are semiautomatic; however, to handle the increasing amount of data, automatic mass assignment is a necessity nowadays. To our knowledge, no program can yet assign ion series to peaks automatically. Here, we present an automatic algorithm, AutoMass, to partially preprocess and assign ion series to correct peaks in mass spectra. It can successfully analyze overlapping charge state distributions of large protein complexes and determine the correct boundaries between ion series. AutoMass uses three steps to assign ion series. First, baseline is subtracted and then peaks are fitted by Gaussian distribution. After that, a threshold is set to pick peaks. Second, possible ion series are selected with the mass selection filter (MSF). Third, boundaries are determined among ion series by using the Minimax theorem from game theory.26 Von Neumann introduced game theory to solve zerosum two-person games, in which the players try to maximize their individual gain. In our case, two ion series compete for a peak in the overlapping or boundary region. Mass (m) and charge (z) are two players in an ion series, and the mass-tocharge ratio (m/z) is an observable. Here we adopt two optimization strategies of Minimax theorem, that is, maximizing the change in mass standard deviation (SD) and minimizing the charge shift of the searched ion series that can reach a zero-shift result (zero-sum) by varying m/z sets. If one m/z peak belongs to other ion series in the searched ion series, changes in deleting and selecting of m/z values will result local maxima and minima in mass SD and charge shift, respectively. Therefore, the competition of mass and charge in one ion series can result in a maximum gain. Complicated tandem mass spectra generated by collisioninduced dissociation (CID) at accelerating voltages of 400 V of T = 3 and T = 4 hepatitis B virus (HBV) capsid ions are taken as a model system to prove the correctness and efficacy of our program.27 The deviation of the assigned and theoretcial masses for both T = 3 and T = 4 fragment ions is less than 0.03%, which facilitates precise mass measurements to determine exact molecular stoichiometries of complexes even in the range of 3−4 MDa at sufficient spectral resolution. Moreover, we correctly assign subcomplexes with more than two peaks per ion series of the CRISPR-related cascade protein complex from Escherichia coli,28 which cover a broad charge (up

to 50+) and mass (17−450 kDa) range. The obtained results are consistent with the results obtained by manual assignments. AutoMass is also used to assign incompletely desolvated mass spectra of norovirus-like particles at activation energies of 50 and 200 V; in addition, ion species in connector scaffolding proteins of bacteriophage phi29 can be partially assigned. AutoMass is highly efficient as it takes a few minutes for mass assignment of complicated protein complexes; for example, it takes one minute to assign ion series to ∼100 peaks in the HBV tandem mass spectra. It is user-friendly since it only requires a threshold adjustment. No further manual setting of parameters, prior knowledge of ion species, limitation of charge states,22 upper mass bound setting, or defined m/z selection window are required. In addition, it can potentially offer a core platform for automated mass list calculation in high-throughput top-down proteomics and native mass spectrometry. The mass lists can then be used to assign the stoichiometry of the detected complexes or to determine the time evolution of their signal intensities.



SOFTWARE WORKFLOW Three steps are used to assign ion series: picking peaks (I), selecting possible ion series (II), determing boundaries between ion series (III), and then steps II and III are repeated until all identified peaks are assigned (see also Figure 1). Prior to these steps, raw mass spectra were exported from the Q-TOF instrument software and smoothed with a Savitzky Golay algorithm. Picking Peaks. Smoothing and Baseline Subtraction. To subtract the background, AutoMass uses adjacent-averaging to smooth the raw mass spectrum by setting data division (range) of each m/z point automatically. Then, the intensity of each m/ z point subtracts the average intensity in this data division, and a mass spectrum with a reduced baseline is generated. Adjacentaveraging can help smooth the acquired mass spectrum, reduce its background, and enhance the sharp edge of a signal peak. Identifying Peaks. To identify the useful peaks in the mass spectrum, first, the slope of the whole data set is calculated. Second, two slopes are calculated from any three adjacent m/z points (from point 1 to 2 and 2 to 3, respectively). If the slopes (absolute values) of starting points or ending points of the peaks are greater than the average slope of the whole data set, then the program marks them as peaks. Third, the searched peaks are fitted with a Gaussian distribution (Gaussian vi. in Labview, National Instruments). Intensity Threshold. The threshold is defined by leastsquares fitting of the whole mass spectrum, and then the program determines its cutoff point. Signals which are greater than this cutoff point will be deleted. The above process is repeated three times to get a final cutoff point as shown in Figure S1 (in the Supporting Information). This cutoff point multiplied by the input S/N ratio is the intensity threshold used in the program. The S/N ratio is user-defined. Then, a peak list (line spectrum) is generated for further analysis. Selecting Ion Series by Mass Selection Filter. A mass selection filter (MSF) is designed to find possible ion series based on our previous observations.22 To select ion series from the peak list, three constraints have to be met: (1) a periodic pattern, (2) one charge increment, and (3) overtone. With MSF, a set which has three consecutive peaks (m1/z1, m2/z2, m3/z3) is picked, and the mass standard deviation (SD) is plotted as a function of charge state z (Figure 3c). Then, the mass SD shows a minimum at the correct charge state set and 11276

dx.doi.org/10.1021/ac401940e | Anal. Chem. 2013, 85, 11275−11283

Analytical Chemistry

Article

harmonics of the fundamental periodic pattern.30 This phenomenon is known in acoustics of music instruments and called overtone (3) or higher harmonics. Figure 2 shows an

Figure 2. Schematic of the mass selection filter, which searches for the correct ion series. The “i” indicates the angular frequency of the charge state. Black squares represent a correct ion series with serial integers.

overtone analysis of the MSF. MSF picks up the first (i = 1) m/ z peak and searches the whole line spectrum to match the overtone condition. With MSF (black lines/squares), a correct ion series is searched if serial integers of the angular frequency (ki) can be obtained (e.g., i = 1, 2, 3, 4, ...); by contrast, if nonintegers are found (e.g., i = 1.6, 2.6, 3.6,...green lines/ diamonds), the ions belong either to other ion series or to noise. Once the ion series is found, MSF chooses the second peak (i = i + 1) and searches the next overtone series. In Figure 3c,d, 22 antinodes and 26 antinodes are observed between two nodes, respectively. The type of overtone and number of antinodes depend on sampled peaks and are therefore different in Figure 3c,d. The antinodes here are integer multiples and constitute a set of overtones, as mentioned in the paper by Mann et al.31 With these three constraints, our MSF search engine can find all possible ion series according to a set of three consecutive m/ z peaks. This set is formed automatically by AutoMass. AutoMass then calculates the mass SD over mean mass () ratio, which has to be