iTop-Q: an intelligent tool for top-down proteomics ... - ACS Publications

iTop-Q: an intelligent tool for top-down proteomics quantita- tion using DYAMOND algorithm. Hui-Yin Chang. 1,&,. Ching-Tai Chen. 1,&, Chu-Ling Ko...
0 downloads 3 Views 3MB Size
Article Cite This: Anal. Chem. 2017, 89, 13128−13136

pubs.acs.org/ac

iTop-Q: an Intelligent Tool for Top-down Proteomics Quantitation Using DYAMOND Algorithm Hui-Yin Chang,†,‡ Ching-Tai Chen,†,‡ Chu-Ling Ko,§ Yi-Ju Chen,∥ Yu-Ju Chen,∥ Wen-Lian Hsu,† Chiun-Gung Juo,*,⊥,# and Ting-Yi Sung*,† †

Institute of Information Science, Academia Sinica, Taipei 115, Taiwan Department of Computer Science, National Chiao Tung University, Hsinchu 300, Taiwan ∥ Institute of Chemistry, Academia Sinica, Taipei 115, Taiwan ⊥ Molecular Medicine Research Center, Chang Gung University, Taoyuan 333, Taiwan # PharmaEssentia Corp., Taipei 115, Taiwan §

S Supporting Information *

ABSTRACT: Top-down proteomics using liquid chromatogram coupled with mass spectrometry has been increasingly applied for analyzing intact proteins to study genetic variation, alternative splicing, and post-translational modifications (PTMs) of the proteins (proteoforms). However, only a few tools have been developed for charge state deconvolution, monoisotopic/average molecular weight determination and quantitation of proteoforms from LC-MS1 spectra. Though Decon2LS and MASH Suite Pro have been available to provide intraspectrum charge state deconvolution and quantitation, manual processing is still required to quantify proteoforms across multiple MS1 spectra. An automated tool for interspectrum quantitation is a pressing need. Thus, in this paper, we present a user-friendly tool, called iTop-Q (intelligent Top-down Proteomics Quantitation), that automatically performs large-scale proteoform quantitation based on interspectrum abundance in top-down proteomics. Instead of utilizing single spectrum for proteoform quantitation, iTop-Q constructs extracted ion chromatograms (XICs) of possible proteoform peaks across adjacent MS1 spectra to calculate abundances for accurate quantitation. Notably, iTop-Q is implemented with a newly proposed algorithm, called DYAMOND, using dynamic programming for charge state deconvolution. In addition, iTop-Q performs proteoform alignment to support quantitation analysis across replicates/samples. The performance evaluations on an in-house standard data set and a public large-scale yeast lysate data set show that iTop-Q achieves highly accurate quantitation, more consistent quantitation than using intraspectrum quantitation. Furthermore, the DYAMOND algorithm is suitable for high charge state deconvolution and can distinguish shared peaks in coeluting proteoforms. iTop-Q is publicly available for download at http://ms.iis.sinica.edu.tw/COmics/Software_iTop-Q.

L

protein ions will be recorded in several consecutive MS1 spectra and intensive protein ions are subjected to MS2 analysis to obtain fragment information for identifying proteoforms along with their PTMs.11 Quantitation is an important task in proteomics, because it provides an opportunity of the comparative studies of proteins between different disease or health states for biomarker discovery.11−13 Several strategies have been proposed for intact protein quantitation. For example, Du et al. utilized 14N/15N metabolic labeling strategies for measuring expression ratios of intact proteins using topdown mass spectrometry.14 Bergmann et al. combined bottomup and top-down approaches with MeCAT labeling strategy for the absolute quantitation of proteolytic peptides and intact proteins from a complex biological system.15 Nevertheless,

iquid chromatography (LC) coupled with mass spectrometry (MS) or tandem mass spectrometry (MS2) has become a predominant platform for proteomics research because of its high sensitivity, increasing resolution, and high processing speed.1−4 Bottom-up and top-down proteomics are two complementary approaches in the field of proteomics.5,6 In bottom-up proteomics, proteins are digested into peptides using proteases, and then the peptides are separated by LC and analyzed by MS and MS/MS.7,8 Top-down proteomics, without proteolytical digestion, utilizes intact protein masses for proteomics analyses, providing an opportunity for the characterization and identification of post-translational modifications (PTMs) on the proteins. In top-down proteomics, intact proteins are separated by LC prior to MS. The separated intact proteins with the assistance of heat, nebulizing gas, and high voltage are desorbed as multiple charged protein ions for MS detection. 9,10 Since a protein usually elutes as a chromatographic peak in a retention time duration, the charged © 2017 American Chemical Society

Received: June 17, 2017 Accepted: November 22, 2017 Published: November 22, 2017 13128

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Analytical Chemistry



limitations on the application of labeling strategy to intact protein quantitation were noted because the labeling efficiency decreases as molecular mass increases.11,13,16 Label-free strategy, on the other hand, has drawn much attention for the relative quantitation of intact proteins because of the relatively easy sample preparation, using no expensive labeling reagents, and applicability to primary human samples.13 For example, Ntai et al. presented an integrated platform for identification and label-free quantitation analyses of protoeforms and applied it to measure the abundance fold change of deleting a histone deacetylase in S. cerevisiae.17 In general, there are two different methodologies in label-free strategy. The first methodology is intraspectrum quantitation which performs relative quantitation of modified and unmodified proteoforms present within the same spectrum.11,13,18 For example, Pesavento et al. demonstrated ionization efficiency as a major issue by measuring the intensity ratios of histone H4 proteoforms and their fragment ions in single MS2 spectra.19 The other methodology is the construction of extracted ion chromatograms (XICs) from MS1 spectra. Castagnola et al. revealed hypo-phosphorylation as a defective event in more than 60% of autistic spectrum disorders patients using extracted ion abundances of intact proteins of interest in human saliva.20 Recently, Wu et al. quantitatively profiled 83 proteoforms of 20 identified proteins in human parotid and submandibular gland secretions by using an accurate mass and time tag database for identified proteoforms and generating XICs from the raw data accordingly.21 Several tools have been proposed for intact protein quantitation. Decon2LS22 and MASH Suit Pro23 are two public tools for intraspectrum quantitation. ProSightPC24−26 (Thermo Scientific), Biopharma Finder (Thermo Scientific), and MassHunter BioConfirm (Aglient) are commercial tools that also perform intact protein quantitation. Nevertheless, cross-spectra label-free strategy for proteoform quantitation is relatively more challenging because of its additional requirement of automated construction of XICs. Recently, ProMex included in MSPathFinder27 has been publicly available that clusters isotopic envelopes, constructs XICs to determine elution time span for refinement, and uses theoretical isotopic envelopes to score the likelihood of detected proteoform features. In this study, we present a fully automated tool, called iTopQ (intelligent Top-down Proteomics Quantitation), to construct XICs across multiple MS1 spectra for proteoform quantitation. Since most proposed charge state deconvolution algorithms, such as MaxEnt,28 THRASH29 (implemented in Decon2LS and MASH Suit Pro), MS-Deconv,30,31 and UniDec,32 are mainly aimed for intraspectrum deconvolution, we particularly propose a new deconvolution algorithm for iTop-Q implementation, called DYAMOND (DYnamic progrAMming ON charge state Deconvolution), for the deconvolution of the constructed XICs. Using iTop-Q, the constructed XICs are clustered and those passing our quality criteria are called putative proteoform envelopes, corresponding to putative proteoforms. With our newly developed DYAMOND algorithm, the monoisotopic and average masses of detected putative proteoforms are accurately calculated and reported. Moreover, iTop-Q also aligns the detected putative proteoforms across different replicates/samples for direct abundance comparison.

Article

EXPERIMENTAL SECTION

Standard Protein Data Set. Chemicals. Cytochrome c (Cyt c) standard protein (theoretical protein average mass: 12361.96 obtained from ProteoMass), all chemicals, and solvents were purchased from Sigma-Aldrich (St. Louis, MO, U.S.A.). The chemicals were all of analytical grade. Water and acetonitrile were of CHROMASOLV grade. The protein sample was dissolved in 10% acetonitrile to form a solution of 1 mg/mL. Instrument. A UPLC system (Waters, Milford, MA, U.S.A.) equipped with a C4 reversed-phase column (2.1 × 100 mm, 1.8 m, BEH 300; Waters, Milford, MA, U.S.A.) was coupled with an LTQ-Orbitrap XL MS (Thermo Scientific, San Jose, CA, U.S.A.) with an orthogonal electrospray ionization (ESI) source. For liquid chromatography, the initial flow rate was 0.1 mL/min 98% solvent A (0.1% formic acid and 0.01% trifluoroacetic acid) and 2% solvent B (acetonitrile with 0.1% formic acid and 0.01% trifluoroacetic acid). A volume of 5 μL of sample was injected. After injection, solvent B was maintained at 2% for 10 min then increased to 40% during a span of 40 min, maintained at 40% for 5 min then to 98% over 5 min, after which this percentage composition was held for 12 min. Finally, solvent B was reduced back down to 2% in 8 min and held at this percentage for 5 min. For mass spectrometry, full scan acquisition was performed in profile mode with the preset resolution of 60000. Public Large-Scale Yeast Lysate Data Set. A public yeast lysate data set of seven fractions acquired by LC-UVPD-MS/ MS (LC: Bruker-Michrom, Auburn, CA; MS/MS: Thermo Scientific Orbitrap Elite mass spectrometer, Bremen, Germany), with three or four technical replicates in each fraction, was downloaded.33 A total of 292 proteoforms corresponding to 215 proteins were identified from these 7 fractions using ProSightPC 3.0. The detailed descriptions of data processing for both data sets are described in Supporting Information, section I.



METHODS Intelligent Algorithms for Top-down Proteomic Quantitation. iTop-Q accepts input files in mzXML and mzML formats which can be conveniently converted from raw data by existing converters. It also accepts LC-MS (possibly also containing MS2) data acquired in profile or centroid mode. Since iTop-Q focused on quantifying intact proteins, it particularly extracts and processes MS1 spectra. The general workflow of iTop-Q is shown in Figure 1. Preprocessing of MS1 Data. In order to reduce data complexity, iTop-Q first performs a preprocessing for each MS1 spectrum. The detailed descriptions of signal centroiding, noise removal, selecting representative isotopic signals in each spectrum and constructing XICs across MS1 spectra are described in Supporting Information, section II. We use peaks to represent XICs in the following paragraphs. The DYAMOND Algorithm for Charge State Deconvolution. Grouping Peaks Based on Retention Time. Since peaks of a protein theoretically elute in close retention time, iTop-Q first groups peaks based on their retention time. Starting from the most intensive peak, say pi, in the detected peaks, iTop-Q groups peaks with apex retention time in the range of t1 − Δ to t2 + Δ, where Δ is the retention time tolerance (2 s by default), and t1 and t2 are the starting and ending retention time of pi, respectively. 13129

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

Determining Charge States Using Dynamic Programming. Let zmin and zmax be the minimum and maximum charges, respectively, of all peaks in P; and let Z be a list of consecutive integers in a decreasing order from zmax to zmin. We use dynamic programming to assign charge states by optimizing the following scoring function F(i,j) for i = 1, 2, ..., n and j = 1, 2, ..., zmax − zmin + 1: ⎧ F(i − 1, j − 1) + M(p , zj) i ⎪ ⎪ F(i , j) = max⎨ F(i − 1, j) − d1 ⎪ ⎪ F(i , j − 1) − d 2 ⎩ ⎧+2n , if ⎪ M(pi , zj) = ⎨ ⎪ ⎩−2n , if

Calculating Possible Charge States. Let P = {p1, p2, ..., pi, ..., pn} be a list of grouped peaks sorted in increasing m/z values, where n is the number of peaks. Assuming that most of peaks in P correspond to a single protein, Prot, with mass M having consecutive charge states (e.g., 12+, 13+, 14+, etc.), we compute the possible charge states between any two peaks in P. To be specific, assuming two arbitrary peaks, pi and pj, have consecutive charge states zi and zi − 1, solving the following simultaneous equations: M + zi × H zi

and

mzj =

M + (zi − 1) × H zi − 1 (1)

we obtain mzj − H zi = mzj − mzi

z j ∉ li

(3)

where d1 and d2 are the penalties of assigning a possible charge state to a gap in the peak list and assigning a gap in the charge state (i.e., no charge state) to a peak, respectively; and we set d1 = d2 = 1. Initially, F(i,0) = F(0,j) = 0 for all i,j. Using dynamic programming, a score table will be established, on which a backtracking procedure is applied to find the optimized charge state assignment. The list of peaks with optimized assigned charge states, denoted by Pc, defines a putative proteoform of an intact protein, Prot. Sometimes, there are possibly more than one path achieving the maximum score, DYAMOND calculates the standard deviation of protein masses determined by the peaks in each path and selects the one with minimum standard deviation as the optimized path for charge state assignment. Furthermore, if there are discontinuous charge states in the defined proteoform (e.g., the proteoform is assigned with the charge states of 13+, 14+, 16+, 18+, and the charge states of 15+ and 17+ are missing), a postprocessing procedure is performed to research all peaks (with or without charge states) in the retention time range, and calculate the mass of each peak using its m/z value and the missing charge states. If the mass difference between the calculated mass and M is within a mass tolerance, the peak will also be regarded as part of the defined proteoform and reported as a shared peak if it has already been assigned to another already-determined proteoform. Finally, iTop-Q validates the quality of Pc by the number of peaks and the continuity of the charge states. If Pc contains at least three peaks and includes at least two continuous charge states, Pc is regarded as qualified. Otherwise, Pc is regarded as unqualified and the peaks of Pc are put back to the original peak list for another charge state deconvolution procedure. In addition to the quality validation on putative proteoforms, iTop-Q also validates the quality of charge states in the putative proteoforms using isotopic signals. To be specific, for each charge state in a putative proteoform, we examine whether there are at least three isotopic signals with m/z intervals equal to the charge state. If yes, the charge state is regarded as qualified and the peak is considered as validated. Otherwise, the charge state is regarded as unqualified and the peak is considered as invalidated (marked in green color in the user interface of iTop-Q). Calculating the Masses and Abundances of a Putative Proteoform. Since Pc is composed of multiple peaks, determining its protein monoisotopic/average mass and abundance are important tasks. We calculate the protein average mass by using the m/z and charge state of the most intensive peak in Pc and apply averagine model34 to compute

Figure 1. General workflow of iTop-Q. After data input, iTop-Q first performs putative proteoform detection on each individual run using DYAMOND. Then, it aligns detected putative proteoforms across different samples/replicates and generates a summary table of protein abundances with sample/replicate names in columns and detected proteoforms in rows.

mzi =

z j ∈ li

(2)

where mzi and mzj are the m/z values of pi and pj (mzj > mzi), and H is the mass of a proton. For each pi in the peak list P, by applying eq 2 to pi paired with any peak pj in P, iTop-Q generates a list of possible charge states, li, for peak pi, 1 ≤ i ≤ n − 1. To reduce the size of li, we check for each possible charge state candidate, z, whether there is an isotopic peak in the righthand and left-hand sides of the most intensive isotopic peak of pi, with the m/z intervals of 1/z among the three isotopic peaks. If yes, z is regarded as a possible charge state; otherwise, z will be removed from li. 13130

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

Figure 2. Comparison of iTop-Q, ProMex, and Decon2LS on the standard protein Cytochrome c (Cyt c) in terms of calculated charge states and protein monoisotopic mass, where “Total” denotes the total number of charge states being detected, and “Distinct” denotes the number of distinct charge states. (A) The number of charge states and monoisotopic masses of Cyt c calculated by iTop-Q, where PPE1 and PPE2 are two detected putative proteoforms. (B) The number of charge states and monoisotopic masses of Cyt c calculated by ProMex, where PPE1 and PPE2 are displayed together because ProMex reports multiple proteoform features corresponding to Cyt c, where each feature contains peaks eluted within the entire retention time range of PPE1 and PPE2, without distinguishing PPE1 and PPE2. Since the representative peak of a detected feature of Cyt c is assigned with incorrect charge state, the standard deviation of calculated monoisotopic mass in the third replicate is relatively larger than those in the other two replicates. However, the standard deviation becomes much smaller after removing the feature. (C) The number of charge states and monoisotopic masses calculated by Decon2LS, where three MS1 spectra, each with the most intensive signals of PPE1 and PPE2, are selected from each replicate as representatives.

the protein monoisotopic mass of Pc. For protein abundance, according to our analyses under six different abundance calculation methods, we utilize the abundance of the most intensive peak in Pc as the representative since it has the most consistent quantitation performance. After putative proteoform detection, each input file has its corresponding putative proteoform list. Aligning Putative Proteoforms across Replicates/ Samples. In a label-free top-down LC-MS experiment, proteins with almost the same protein masses in any two runs eluted in close retention time are commonly regarded as identical proteoforms. iTop-Q, therefore, groups putative proteoforms across runs based on their masses and retention time. To avoid possible retention time shift of the same proteoform in different runs, iTop-Q performs a retention time adjustment procedure prior to putative proteoform grouping. iTop-Q first selects the run having the largest number of detected proteoforms as the reference and pairwise aligns the proteoform list in each of the other runs with respect to those in the reference. The commonly detected putative proteoforms in both runs, called landmarks, are used to model the retention time shift distribution between the reference and the other run. Two proteoforms are considered commonly detected in both runs if they satisfy the following conditions: (1) their masses differ within a user-defined mass tolerance; (2) they have at least two common peaks, i.e., close m/z and the same charge state. With a list of landmarks, we utilize LOESS regression algorithm35,36 with a span of 20% and weight of 1 to construct a retention time drift model, and adjust all putative proteoforms

in the aligned run accordingly. After retention time adjustment, putative proteoforms in the other run are aligned with those in the reference if two proteoforms have a protein mass difference less than a mass threshold and the adjusted retention time difference within a given retention time tolerance.



RESULTS AND DISCUSSION Performance Evaluation by a Standard Protein Data Set. We first used a standard protein data set with three technical replicates to evaluate the performance of iTop-Q. A standard intact protein Cytochrome c (Cyt c) with the same concentration was injected into three technical replicates. Because of the separations of LC gradient, the intact protein forms two proteoform envelopes in each technical replicate (as shown in Supporting Information, Figure S1). Evaluation on Charge State Deconvolution. Using iTopQ, two putative proteoform envelopes, called PPE1 and PPE2, with assigned charge states and eluted in close retention time in the three replicates were detected (Supporting Information, Figure S2). The m/z, retention time, charge states, intensity, and S/N values of the peaks of PPE1 and PPE2 in the three replicates are listed in Supporting Information, Tables S1 and S2, respectively. In order to evaluate the performance of iTopQ, we utilized ProMex27 and Decon2LS22 to process this data set as well. Similar to iTop-Q, ProMex also constructs XICs across MS1 spectra. Decon2LS determines the charge states of signals in each MS1 spectrum using THRASH algorithm29 and reports the abundances of detected signals accordingly. The number of peaks, proteoform features or signals belonging to 13131

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

Decon2LS since it does not provide integrated signal abundance across spectra. All the abundance ratios of Decon2LS are listed in Supporting Information, Tables S5−S7. Figure 3 shows the abundance ratio distributions under six abundance calculation methods on putative proteoforms

Cyt c detected by iTop-Q, ProMex and Decon2LS in each replicate are listed in Supporting Information, Table S3. Figure 2 shows the number of charge states and monoisotopic masses of Cyt c in the three replicates calculated by iTop-Q, ProMex, and Decon2LS. Since Decon2LS detects the signals of Cyt c in several neighboring MS1 spectra depending on the elution of Cyt c, we only show the MS1 spectra with the most intensive signals of PPE1 and PPE2 in Figure 2C. All the MS1 spectra with the signals of PPE1 and PPE2 detected by Decon2LS are shown in Supporting Information, Figures S3−S5. As shown in Figure 2A, the total number of detected charge states is equal to the number of distinct charge states, meaning there is no redundant charge states in iTop-Q. In addition, we noticed that ProMex did not distinguish PPE1 and PPE2, that is, the proteoform features contain peaks in the entire retention time range of PPE1 and PPE2 as shown in its output results (Supporting Information, Table S4), whereas iTop-Q detected both proteoform envelopes though they eluted in close retention time. For protein monoisotopic mass calculation, iTop-Q, ProMex, and Decon2LS report the masses of 12350.04 ± 0.4, 12408.83 ± 99.8, and 12351.8 ± 0.9, respectively, where the relative large standard deviation of ProMex was caused by a detected peak incorrectly assigned with a charge state (as shown in Figure 2 and Supporting Information, Table S4). After removing the incorrect charge state, the standard deviation of monoisotopic mass calculated by ProMex become much smaller (12352.71 ± 5.9). For protein average mass calculation, both iTop-Q and Decon2LS have mass error smaller than 5 Da compared with the theoretical average mass obtained from ProteoMass, while ProMex is not compared since it does not provide protein average mass information. iTop-Q has smaller standard deviation (calculated protein average mass: 12358.46 ± 0.35) than Decon2LS (calculated protein average mass: 12359.61 ± 1.11). The detailed average mass comparison of iTop-Q and Decon2LS is shown in Supporting Information, Figure S6. Finally, iTop-Q took 3.28 min in average to process each replicate, whereas ProMex and Decon2LS take 66.33 and 76.48 min in average, respectively. Evaluation on Abundance Calculation by Six Different Methods. Selecting a proper method for accurately quantifying the proteoform is important. We considered six different methods to calculate the abundance of a proteoform, including using the most abundant peak area (M1), using the apex intensity of the most intensive peak (M2), summing the areas of top three intensive peaks (M3), summing the apex intensities of top three intensive peaks (M4), summing the areas of all peaks (M5), and summing the apex intensities of all peaks (M6). Note that here peaks refer to the XICs for iTop-Q and proteoform features for ProMex (as apex intensity and abundance based on area are provided), and signals for Decon2LS. To evaluate these methods, we calculated proteoform abundance ratios between any two replicates, defined by the abundance in replicate i measured by a specific method divided by the abundance in replicate j, 1 ≤ i, j ≤ 3, and i ≠ j, which are expectedly close to 1. Since Decon2LS reports signal abundances in individual spectrum, different abundances of PPE1 and PPE2 could be reported from different MS1 spectra of each replicate. We thus calculated the abundance ratios among all of the MS1 spectra containing signals of PPE1 or PPE2 of the three replicates using the methods of M2, M4, and M6. Note that M1, M3, and M5 cannot be applied in

Figure 3. Abundance ratio distributions under six different abundance calculation methods of detected proteoform envelopes of the standard protein data set by iTop-Q, ProMex, and Decon2LS. The six methods are as follows. M1: using the most abundant peak area, M2: using the apex intensity of the most intensive peak, M3: summing the areas of top three intensive peaks, M4: summing the apex intensities of top three intensive peaks, M5: summing the areas of all peaks, and M6: summing the apex intensities of all peaks.

detected by iTop-Q, ProMex, and Decon2LS. We noticed that, using the three tools, all the six methods have their replicate abundance ratios close to 1, and, for iTop-Q, M1 has the smallest standard deviation. Compared with Decon2LS, iTop-Q and ProMex have smaller standard deviation on the calculated abundance ratios, suggesting that using XIC area could provide better quantitation accuracy. Performance Evaluation and an Application Demonstrated by a Public Large-Scale Yeast Lysate Data Set. We utilized a public yeast lysate data set33 of a higher sample complexity than the standard data set, where overlapping proteoform envelopes may occur, to evaluate iTop-Q’s performance. This data set includes seven fractions, each with three or four technical replicates. A total of 292 proteoforms, provided by the authors, were identified from these seven fractions using ProSightPC 3.0. This data set was processed by iTop-Q, ProMex, and Decon2LS. Though iTop-Q constructs XICs across MS1 spectra and reassembles XICs to form putative proteoform envelopes, it took on average 1.4 min to process each replicate, compared to 154.67 and 746.67 min for each replicate by ProMex and Decon2LS, respectively. Processing the data set, iTop-Q acquired a total of 4027 putative proteoform envelopes (including nondistinct, being multiply detected putative proteoforms), comprised of 26089 peaks. On the other hand, ProMex detected a total number of 79448 proteoform features (each with a representative peak with assigned charge state and m/z value), and Decon2LS detected a total number of 1199287 signals in all of the replicates of the seven fractions. Manually examining the quality of 292 identified proteoforms by our proposed criteria for qualified proteoform envelopes described in the DYAMOND algorithm in the Experimental Section, we found 176 proteoforms passed the criteria, whereas 47 proteoforms did not pass the criteria even though their precursor XICs were constructed, and 69 had low signal intensities so that their precursor XICs could not be constructed. We thus utilized the 176 proteoforms as the 13132

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

and Decon2LS to evaluate protein quantitation. Since we do not know the exact protein abundance ratios among different fractions, we calculated protein ratios among technical replicates (which are expected to be 1) in each fraction for our evaluation. Figure 4 shows the abundance ratios of the

benchmark for the following analyses. The detailed information on the 176 proteoforms is listed in Supporting Information, Table S8. Performance Evaluation on Charge State Deconvolution and Protein Monoisotopic/Average Mass Accuracy. iTop-Q detected 1516 peaks of all the 176 proteoform envelopes from the replicates of seven fractions, while ProMex and Decon2LS detected 943 peaks of 168 proteoforms and 7387 signals of 176 proteoforms, respectively. The detailed information on peaks, proteofrom features, and signals detected by iTop-Q, ProMex, and Decon2LS is listed in Supporting Information, Tables S9− S11, respectively. In the output results of Decon2LS, signals with the same charge state and m/z could be repeatedly reported (Supporting Information, Figure S7) and the number of detected signals comprising the same putative proteoform could vary across spectra (Supporting Information, Figure S8 and Table S10). Similar situation occurs in the output results of ProMex (Supporting Information, Figure S9 and Table S11). We thus independently took the union of charge states of the proteoforms detected by ProMex and Decon2LS to compare with the charge states detected by iTop-Q. The number of total and distinct charge states detected by the three tools is shown in Supporting Information, Figure S10(A). The charge state comparison of iTop-Q, ProMex, and Decon2LS is shown in Supporting Information, Figure S10(B), where 28% (506/1827) of charge states were commonly detected by the three tools, and 7% (135/1827) of charge states were only detected by iTop-Q. In addition, our tool detected more high charge states than ProMex and Decon2LS (Supporting Information, Figure S10(C)). Analyzing the peaks with charge states detected by iTop-Q alone, we observed that ProMex and Decon2LS also detected 48% (65 out of 135) and 83% (112 out of 135) of them in their representative peaks and signals, respectively, but assigning the them with incorrect charge states (Supporting Information, Table S12). This is probably due to noisy background or overlapping isotope envelopes that leads to the incorrect charge state deconvolution. Our DYAMOND algorithm calculates possible charge states and applies dynamic programming to filter out incorrect charge state assignment, and thus is shown to be highly accurate. On the other hand, the 17% (311 out of 1827) charge states undetected by iTop-Q is mainly due to the signals with low S/N ratios such that their XICs could not be constructed (Supporting Information, Figure S11). It is also noted that, though the distinct charge state number of ProMex is relatively lower than iTop-Q and Decon2LS because of only one representative peak with its charge state reported for each detected feature, the number of detected charge states greatly increases when considering all charge states within the charge state range provided by ProMex as detected (Supporting Information, Figure S10(A),(B)). Finally, based on the 506 commonly detected peaks/signals, the monoisotopic masses calculated by iTop-Q are close to those calculated by ProMex and Decon2LS (Supporting Information, Figure S12(A)), and the average masses calculated by iTop-Q and Decon2LS are also highly correlated (as shown in Supporting Information, Figure S12(B)). Performance Evaluation on Protein Quantitation. Quantifying proteoforms in a complex data set is challenging since overlapping proteoforms may occur, increasing the difficulty of accurate protein quantitation. Similar to quantitation analysis of standard protein data set, we compare quantitation analysis on the 176 benchmark proteoforms detected by iTop-Q, ProMex,

Figure 4. Abundance ratio distributions of iTop-Q, ProMex, and Decon2LS in the yeast lysate data set using the six different abundance calculation methods.

detected features between any two replicates in the same fraction using iTop-Q, ProMex, and Decon2LS. All three tools achieved median proteoform ratios of 1. iTop-Q and ProMex had much smaller abundance ratio deviation than Decon2LS. This result suggests that using the peak areas reveals more consistent abundance calculation, echoing the previous finding in the standard data set. Furthermore, based on the ratio distributions of iTop-Q and ProMex, using the area of the most intensive peak (M1) in a putative proteoform envelope provides the smallest protein ratio deviation. Accurate Deconvolution of Shared Peaks in Coeluting Proteoforms. In a large-scale top-down proteomics experiments, several proteoforms can possibly coelute in close retention time. Even in some cases of coelution, one or more peaks in a proteoform envelope may overlap with some in another proteoform envelope; and we call such overlapping peaks in coeluting proteoforms as shared peaks henceforth. It is important for a tool capable of separating proteoform envelopes from one another and distinguishing shared peaks in the coeluting proteoforms. Among the 176 proteoforms detected by iTop-Q, only 25 proteoforms did not coelute with any other proteoform within a retention time tolerance of ±0.5 min, while 151 proteoforms were coeluted with other putative proteoforms (Supporting Information, Figure S13(A)). The detailed number of coeluting proteoforms is shown in Supporting Information, Figure S13(B). Among the 151 coeluting proteoforms, 123 proteoforms did not have any shared peak and were relatively easy to distinguish, and 28 proteoforms had shared peaks which were successfully detected by iTop-Q (Supporting Information, Figure S13(C)). Figure 5 demonstrates a heat map of two coeluting proteoform envelopes of two proteoforms selected from the 176 identified proteoforms, where the x-axis and y-axis represent the retention time and m/z, respectively. As shown in Figure 5, one green peak is shared by two proteoform envelopes. In addition, four peaks shaped by dotted lines in the two proteoform envelopes are invalidated by iTop-Q since their isotopic patterns do not fit those calculated by the charge states assigned by iTop-Q. Using iTop-Q, not only proteoforms, but also shared peaks, can be successfully distinguished from one to another. This analysis demonstrates the ability of iTop-Q in 13133

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

Figure 5. Example of two identified and coeluted proteins, RS20_YEAST (average protein mass: 13817.62) and RS19B_YEAST (average protein mass: 15784.39). The proteoform envelopes of RS20_Yeast and RS19B_Yeast are colored in red and blue, respectively. The peak shared by the two proteoforms is colored in green.

Figure 6. Abundances of seven proteins with and without post-translational modifications. Two proteins (accession number: P02293 and P02294) have four and two different post-translational modifications, respectively, and the other five proteins have a single post-translational modification. The fold change is defined as the abundance of a protein with post-translational modification divided by the abundance of the protein without posttranslational modification.

According to the identification results of the yeast lysate data set provided by Cannon et al.,33 56% (98/176) proteoforms were modified, where acetylation is the most commonly observed modification in the data set (Supporting Information, Table S13). Specifically, seven proteins, including SOD1 protein (corresponding to 18 proteoforms), were identified with both unmodified and modified forms; 10 proteins (corresponding to 35 proteoforms) were identified with various PTM forms, but no umodified form; and 123 proteins (corresponding to 123 proteoforms) were identified with either modified or unmodified form. Using iTop-Q, the abundance ratios of the seven proteins with both modified and unmodified patterns were computed (Figure 6), where the protein (id: P02293) was calculated four times since this protein has been identified with four modifications. Notably, for SOD1 protein, decreased phosphorylation level (0.27-fold) on Cu−Zn superoxide dismutase (SOD1, P00445), 60S ribosomal protein L22A (RPL22A, P05749, 0.07-fold), and L26-B (RPL26B, P53221, 0.11-fold) were quantified by iTop-Q.

distinguishing coeluting proteins and allowing users to verify the quality of detected proteoform envelopes. Application to Protein Post-Translational Modification Quantitation. The quantitative study of post-translational modifications is important since post-translational modifications play pivotal roles in the determination of biological processes. 37 For example, SOD1 protein localized in mitochondria regulates proteins from oxidative injury, energy generation, and provides its ubiquity for fermentative and respiratory in yeast.38 The expression level of this protein will be altered in yeast in response to different redox stimuli, and it may directly or indirectly influence the activity of protein kinases, regulating translational activity of the ribosomes.39 In addition, the activity between SOD1 and the phosphorylation state of ribosomal P proteins has been shown to reveal a close correlation in diauxic shift and logarithmic growth of yeast.32 Computing the abundance ratios of SOD1 protein and its phosphorylation modification could benefit the study of biological processes of yeast lysate. 13134

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

Figure 7. Main user interface of iTop-Q. (A) Proteoform summary table summarizes all detected putative proteoforms with their calculated masses, retention time, and abundances in different replicates/samples. (B) Putative proteoform envelope table lists peaks included in the clicked proteoform envelope with their m/z, retention time, intensity values, and charge states. (C), (D), and (E) are the heat map, XIC plot, and projected spectrum plot of a selected peak in the putative proteoform envelope table. (F) Parameter setting table lists the parameters used in the quantitation.

iTop-Q: a Friendly and Graphical Quantitation Tool. We implemented iTop-Q using C# programming languange as a portable tool (i.e., without requiring installation) such that one can easily operate the tool in several Microsoft Windows series platform (including Windows 7, 8, 10, and Windows Server 2008, 2012) for top-down proteomics quantitation. For users to easily quantify their LC-MS data, iTop-Q is implemented with a quantitation wizard that guides users to process the imported data step by step. In the quantitation wizard, only one parameter, that is, mzWidth, is required since the data resolution varies among instruments. But users are allowed to modify parameters, such as the mass tolerance and noise threshold in the advance setting, to optimize quantitation results. Once the quantitation process is completed, the main user interface of iTop-Q will display six panels (Figure 7A−F). The first panel (Figure 7A) shows a proteoform summary table which lists the monoisotopic mass, average mass, retention time, and abundance of detected putative proteoforms in different replicates or samples. By double-clicking an entry of abundance (i.e., the abundance of a proteoform in a specific processed file) in the summary table, the putative proteoform envelope table (Figure 7B) will list the detailed information (e.g., the m/z, retention time, assigned charge state, and intensities) of peaks included in the envelope of selected putative proteoform. More importantly, coelution

information is also provided in this panel. If the selected proteoform is coeluted with another proteoform, say proA, the protein ID of proA will be listed in the column of “Shared with protein ID”. The third panel provides graphical visualizations which display the elution heatmap, constructed XIC, and projected MS1 spectrum of the selected proteoform. In the plot of constructed XIC, users can use mouse to redefine the boundary of XIC, and the proteoform abundance will be updated instantly. Finally, the fourth panel (Figure 7F) lists the parameter setting used in the quantitation. To use iTop-Q for intact protein quantitation, users having identification results of a top-down proteomics data set can map their identified proteins with those detected by iTop-Q using proper monoisotopic/average mass and retention time tolerances.



CONCLUSION As top-down proteomics continues to increase in throughput and complexity of the samples analyzed, the lack of robust bioinformatics tools for the top-down data analysis, management and interpretation has become a major obstacle in comparison with bottom-up approach.10,12 To conquer the challenges, we have developed iTop-Q as a friendly and graphical tool for protein quantitation in MS1 level. An intelligent algorithm, DYAMOND, has also been designed to perform the difficult charge state deconvolution at MS1 spectra. 13135

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136

Article

Analytical Chemistry

(12) Cui, W. D.; Rohrs, H. W.; Gross, M. L. Analyst 2011, 136, 3854−3864. (13) Toby, T. K.; Fornelli, L.; Kelleher, N. L. Annu. Rev. Anal. Chem. 2016, 9, 499−519. (14) Du, Y.; Parks, B. A.; Sohn, S.; Kwast, K. E.; Kelleher, N. L. Anal. Chem. 2006, 78, 686−694. (15) Bergmann, U.; Ahrends, R.; Neumann, B.; Scheler, C.; Linscheid, M. W. Anal. Chem. 2012, 84, 5268−5275. (16) Collier, T. S.; Hawkridge, A. M.; Georgianna, D. R.; Payne, G. A.; Muddiman, D. C. Anal. Chem. 2008, 80, 4994−5001. (17) Ntai, I.; Kim, K.; Fellers, R. T.; Skinner, O. S.; Smith, A. D. t.; Early, B. P.; Savaryn, J. P.; LeDuc, R. D.; Thomas, P. M.; Kelleher, N. L. Anal. Chem. 2014, 86, 4961−4968. (18) Smith, L. M.; Kelleher, N. L.; et al. Nat. Methods 2013, 10, 186− 187. (19) Pesavento, J. J.; Mizzen, C. A.; Kelleher, N. L. Anal. Chem. 2006, 78, 4271−4280. (20) Castagnola, M.; Messana, I.; Inzitari, R.; Fanali, C.; Cabras, T.; Morelli, A.; Pecoraro, A. M.; Neri, G.; Torrioli, M. G.; Gurrieri, F. J. Proteome Res. 2008, 7, 5327−5332. (21) Wu, S.; Brown, J. N.; Tolic, N.; Meng, D.; Liu, X.; Zhang, H.; Zhao, R.; Moore, R. J.; Pevzner, P.; Smith, R. D.; Pasa-Tolic, L. Proteomics 2014, 14, 1211−1222. (22) Jaitly, N.; Mayampurath, A.; Littlefield, K.; Adkins, J. N.; Anderson, G. A.; Smith, R. D. BMC Bioinf. 2009, 10, 87. (23) Cai, W.; Guner, H.; Gregorich, Z. R.; Chen, A. J.; Ayaz-Guner, S.; Peng, Y.; Valeja, S. G.; Liu, X.; Ge, Y. Mol. Cell. Proteomics 2016, 15, 703−714. (24) LeDuc, R. D.; Taylor, G. K.; Kim, Y. B.; Januszyk, T. E.; Bynum, L. H.; Sola, J. V.; Garavelli, J. S.; Kelleher, N. L. Nucleic Acids Res. 2004, 32, W340−345. (25) Zamdborg, L.; LeDuc, R. D.; Glowacz, K. J.; Kim, Y. B.; Viswanathan, V.; Spaulding, I. T.; Early, B. P.; Bluhm, E. J.; Babai, S.; Kelleher, N. L. Nucleic Acids Res. 2007, 35, W701−706. (26) Fellers, R. T.; Greer, J. B.; Early, B. P.; Yu, X.; LeDuc, R. D.; Kelleher, N. L.; Thomas, P. M. Proteomics 2015, 15, 1235−1238. (27) Park, J.; Piehowski, P. D.; Wilkins, C.; Zhou, M.; Mendoza, J.; Fujimoto, G. M.; Gibbons, B. C.; Shaw, J. B.; Shen, Y.; Shukla, A. K.; Moore, R. J.; Liu, T.; Petyuk, V. A.; Tolic, N.; Pasa-Tolic, L.; Smith, R. D.; Payne, S. H.; Kim, S. Nat. Methods 2017, 14, 909−914. (28) Ferrige, A. G.; Seddon, M. J.; Jarvis, S.; et al. Rapid Commun. Mass Spectrom. 1991, 5, 374−377. (29) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2000, 11, 320−332. (30) Liu, X.; Inbar, Y.; Dorrestein, P. C.; Wynne, C.; Edwards, N.; Souda, P.; Whitelegge, J. P.; Bafna, V.; Pevzner, P. A. Mol. Cell. Proteomics 2010, 9, 2772−2782. (31) Kou, Q.; Wu, S.; Liu, X. BMC Genomics 2014, 15, 1140. (32) Marty, M. T.; Baldwin, A. J.; Marklund, E. G.; Hochberg, G. K.; Benesch, J. L.; Robinson, C. V. Anal. Chem. 2015, 87, 4370−4376. (33) Cannon, J. R.; Cammarata, M. B.; Robotham, S. A.; Cotham, V. C.; Shaw, J. B.; Fellers, R. T.; Early, B. P.; Thomas, P. M.; Kelleher, N. L.; Brodbelt, J. S. Anal. Chem. 2014, 86, 2185−2192. (34) Senko, M. W.; Beu, S. C.; McLaffertycor, F. W. J. Am. Soc. Mass Spectrom. 1995, 6, 229−233. (35) Cleveland, W. S. J. Am. Stat. Assoc. 1979, 74, 829−836. (36) Cleveland, W. S. Am. Stat. 1981, 35, 54−54. (37) Aebersold, R.; Mann, M. Nature 2016, 537, 347−355. (38) Nedeva, T. S.; Petrova, V. Y.; Zamfirova, D. R.; Stephanova, E. V.; Kujumdzieva, A. V. FEMS Microbiol. Lett. 2004, 230, 19−25. (39) Zielinski, R.; Pilecki, M.; Kubinski, K.; Zien, P.; Hellman, U.; Szyszka, R. Biochem. Biophys. Res. Commun. 2002, 296, 1310−1316.

According to our analyses, iTop-Q is an effective quantitation tool with high quantitation accuracy.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.7b02343. The description of data processing and parameter settings; the preprocessing of MS1 data; supplementary Figures S1−S13 (PDF). Supplementary Tables S1−S13 (XLSX).



AUTHOR INFORMATION

Corresponding Authors

*Phone: +886-2-2788-3799, ext. 1711. Fax: +886-2-2782-4814. E-mail: [email protected]. *Phone: +886-2-2655-7688, ext. 1367. Fax: +886-2-2655-7626. E-mail: [email protected]. ORCID

Hui-Yin Chang: 0000-0003-1767-1874 Yu-Ju Chen: 0000-0002-3178-6697 Ting-Yi Sung: 0000-0002-6028-0409 Author Contributions ‡

H.-Y.C. and C.-T.C. contributed equally to this work.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by the Academia Sinica, Ministry of Science and Technology of Taiwan (MOST106-2221-E-001018), and Taiwan International Graduate Program. The first coauthor also thanks Prof. Alexey Nesvizhskii’s for support during paper revision.



REFERENCES

(1) Bogdanov, B.; Smith, R. D. Mass Spectrom. Rev. 2005, 24, 168− 200. (2) Zhou, H.; Ning, Z.; Starr, A. E.; Abu-Farha, M.; Figeys, D. Anal. Chem. 2012, 84, 720−734. (3) Gosetti, F.; Mazzucco, E.; Gennaro, M. C.; Marengo, E. J. Chromatogr. B: Anal. Technol. Biomed. Life Sci. 2013, 927, 22−36. (4) Lanucara, F.; Holman, S. W.; Gray, C. J.; Eyers, C. E. Nat. Chem. 2014, 6, 281−294. (5) Savaryn, J. P.; Catherman, A. D.; Thomas, P. M.; Abecassis, M. M.; Kelleher, N. L. Genome Med. 2013, 5, 53. (6) Skinner, O. S.; Havugimana, P. C.; Haverland, N. A.; Fornelli, L.; Early, B. P.; Greer, J. B.; Fellers, R. T.; Durbin, K. R.; Do Vale, L. H.; Melani, R. D.; Seckler, H. S.; Nelp, M. T.; Belov, M. E.; Horning, S. R.; Makarov, A. A.; LeDuc, R. D.; Bandarian, V.; Compton, P. D.; Kelleher, N. L. Nat. Methods 2016, 13, 237−240. (7) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Nat. Biotechnol. 2001, 19, 242−247. (8) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R. Nat. Biotechnol. 1999, 17, 676−682. (9) Kelleher, N. L.; Thomas, P. M.; Ntai, I.; Compton, P. D.; LeDuc, R. D. Expert Rev. Proteomics 2014, 11, 649−651. (10) Catherman, A. D.; Skinner, O. S.; Kelleher, N. L. Biochem. Biophys. Res. Commun. 2014, 445, 683−693. (11) Cai, W.; Tucholski, T. M.; Gregorich, Z. R.; Ge, Y. Expert Rev. Proteomics 2016, 13, 717−730. 13136

DOI: 10.1021/acs.analchem.7b02343 Anal. Chem. 2017, 89, 13128−13136