Exploring the Potential of Data-independent ... - ACS Publications

Step 1 was peak finding using the 'Enhance' algorithm which centroids signals in each spectrum and then merges .... goodness of fit represented by R2 ...
0 downloads 8 Views 2MB Size
Subscriber access provided by - Access paid by the | UCSB Libraries

Exploring the Potential of Data-independent Acquisition Proteomics using Untargeted All-ion Quantitation# Application to Tumor Subtype Diagnosis Zhixiang Yan, and Ru Yan Anal. Chem., Just Accepted Manuscript • Publication Date (Web): 09 Mar 2018 Downloaded from http://pubs.acs.org on March 9, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Exploring the Potential of Data-independent Acquisition Proteomics using Untargeted All-ion Quantitation: : Application to Tumor Subtype Diagnosis Zhixiang Yan†,‡ Ru Yan*,†,‡ †

State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of Macau, Taipa, Macao, China ‡ Zhuhai UM Science & Technology Research Institute, Zhuhai 519080, China *

Corresponding author: [email protected]

Abstract Maximizing the recovery of meaningful biological information can facilitate proteomics guided early detection and precise treatment of diseases. However, the conventional protein and peptide level targeted quantification of untargeted data independent acquisition (DIA) such as sequential window acquisition of all theoretical spectra (SWATH) is not necessarily descriptive of all information. Untargeted all-ion quantification theoretically could retrieve more features in SWATH digital maps by circumventing the initial identification process but is intrinsically susceptible to errors because of the extremely complexity of proteome samples and the poor selectivity of a single ion. In this study, we optimized and applied the untargeted all-ion quantification of SWATH data to differentiate tumor subtypes. Large peptides and low abundant peptides benefited more from untargeted all-ion quantification. Top-ranked significant ions were linked to their corresponding ion envelops, where multiple correlated ions were used for measurement and only ion envelopes containing at least three ions with consistent intensity ratio were kept as refined differentiating features. Multivariate statistical analysis revealed that for the tested dataset, the refined markers discovered by untargeted SWATH analysis showed comparable diagnostic power to protein and peptide markers. Limitations and benefits of the approach are further discussed. INTRODUCTION The advance of mass spectrometry has greatly promoted the application of proteomics in biomarker discovery, disease diagnostics, and personalized therapy.1-5 While is not difficult to differentiate tumor from normal tissues and biofluids, achieving

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

subtype-specific and stage-specific proteomic diagnosis for precision medicine remains a big challenge. For instance, follicular thyroid carcinoma (FTC) and benign follicular adenoma (FA) are clinically indistinguishable by cytological evaluation of ultrasound-guided fine needle aspiration (FNA) biopsy, leading to many patients underwent unnecessary surgery6,7. Previous proteomics also indicated that FA and FTC showed remarkable similarity of the respective proteomes.8 In discovery proteomics, tandem mass spectrometry (MS/MS) is usually collected for global peptide and protein identification through automatic data-dependent acquisition (DDA). However, DDA suffers from the stochastic nature that relies on MS1 survey scan to trigger MS/MS.9 Furthermore, quantification by extracted fragment ion chromatograms as in multiple reaction monitoring (MRM) and data independent acquisition (DIA) analysis is not possible in DDA owing to incomplete and irregular sampling of chromatographic peaks by MS/MS. This is particularly undesirable for quantification in complex mixtures where integration of fragment ion chromatograms is more selective, sensitive, and reproducible than precursor ion-based quantification.10,11 Recently, data independent acquisition (DIA) such as sequential window acquisition of all theoretical spectra (SWATH) has emerged as another important method in discovery proteomics. SWATH combines the comprehensiveness of conventional shotgun proteomics and the sensitivity and reproducibility of MRM to detect and accurately quantify a high number of analytes.12-16 While SWATH can collect MS/MS spectra of all available peptide ions using a wide Q1 isolation window (e.g., 25 Da), a major challenge for SWATH is the generation of high-quality MS/MS libraries, referring to which targeted fragment ion chromatograms are extracted for quantification.15,16 The MS/MS libraries are generally constructed by DDA locally on the same samples, and less frequently using community data repositories or synthesized peptides. This is an ironic reversal given that sensitive SWATH has to be developed by the DDA method that has a substantially lower sensitivity than the SWATH itself. This dilemma could be partially alleviated by direct peptide identification and quantification from DIA data without dependence on a spectral library using dedicated software such as DIA-Umpire which generates pseudo-MS/MS spectra for database search using precursor-fragment grouping.17 However, in many cases, fragment ions are not clearly associated with a precursor ion in SWATH.18 Furthermore, some peptides could be identified in MS/MS without detectable precursor ions in MS1.19 The great challenge of assigning all fragments to the corresponding precursors may explain why DDA identified more MS/MS spectra compared to DIA (the total numbers of identified proteins and peptides were

ACS Paragon Plus Environment

Page 2 of 19

Page 3 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

comparable),20 although DIA (MS/MSALL) fragments significantly more precursors than DDA. Recently, OpenSWATH14, DIA-Umpire,17 and Skyline21 software platforms have allowed simultaneously quantify MS1 and SWATH-MS/MS to reduce false peak integrations caused by interferences and/or weak signals.14,17,21 However, they only used the MS1 and MS/MS information of identified peptides. We reasoned that expanding the coverage to those initially unidentified peptide ions will provide more quantitative information for precise disease diagnosis. In this study, we explored untargeted all-ion quantitation to rescue unidentified precursor and fragment ions obscured in conventional DIA workflow. The comparative analysis of unidentified features for inter-group discrimination could be an end in itself, and perhaps eventually lead to clinical utility without ever discovering the identity of the discriminating features. However, previous work using untargeted, unidentified profiles of surface-enhanced laser desorption and ionization time-of-flight (SELDI-TOF)22 to identify ovarian cancer has shown that such strategy is susceptible to artefacts and systematic biases due to the poor selectivity of single ion. In this study, we have systematically optimized the untargeted all-ion quantitation of SWATH data and evaluated the utility of such strategy in tumor subtype diagnosis. EXPERIMENTAL SECTION Mass Spectrometry Datasets. We employed three publicly available SWATH datasets acquired on the AB Sciex TripleToF 5600 or 6600 system to optimize and evaluate all-ion quantification, including a human cell line background proteome spiked with heavy-isotope-labeled phosphopeptides in 12 concentrations,14 a tissue proteome of thyroid tumor subtypes,8 and a serum proteome of breast cancer subtypes.4 Protein and Peptide Level Quantitation. The local library was loaded into PeakView (version 2.2.0) with SWATH Processing Micro App (AB Sciex), excluding shared peptides. Peptide identification in SWATH was based on the following parameters: 100 peptides per protein, 6 transitions per peptide, 99% peptide confidence, 1% FDR (a maximum of 1% false positive peak groups), fragment ion extraction window of 10 min and mass tolerance of 50 ppm. The identified proteins, peptides and their corresponding peak area and FDR were exported, and the data were further filtered to keep only peptides with FDR ≤ 1% in at least five out of eight samples in each subgroup and remove reverse protein hits.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Untargeted All-ion Quantitation. SWATH is a looped product ion mode. For instance, for a mass range of 400–1200 Da and an isolation width of 26 Da, the instrument generally acquires data as follows: Experiment 1: MS1 scan; Experiment 2: MS/MS of 400–426; Experiment 3: MS/MS of 425–451… Experiment 33: MS/MS of 1175–1201. Currently, MarkerView (version 1.3.1, AB Sciex) is the only software that could extract all precursor ions and fragment ions in SWATH MS1 and MS/MS experiments, respectively, and thus was employed to evaluate untargeted all-ion quantitation. The overall workflow mainly consisted of three main steps (Figure 1). Step 1 was peak finding using the 'Enhance' algorithm which centroids signals in each spectrum and then merges masses belonging to the same peak cluster. The ion with the largest intensity was considered the base peak mass of the cluster. The following parameters were used: minimum and maximum RT set to exclude peak eluting early or late in the chromatographic run (adjusted for each case), background subtraction offset of 10 scans, noise intensity multiplication factor of 1.3 (to compensate for minor variations in the intensity of background ions by multiplying the background spectrum with this value before subtraction), minimum spectral peak width of 10 ppm, minimum peak width of 3 scans, charge states assigned. Step 2 was peak alignment with a mass tolerance of 30 ppm and a RT tolerance of 0.7 min after RT correction. Peaks that are within these tolerance values, either between samples or within a single sample, will be aligned to the same peak. MarkerView only supports linear RT correction but LC usually generates non-linear RT drift. To mitigate this shortcoming, a non-linear RT drift model was first constructed using locally quadratic (loess) regression, implemented in the R (version 3.4.3).23 Then the entire RT range was split into multiple smaller RT ranges to perform segmented linear RT correction using the turning points of non-linear curve as the boundaries of different segments (2-min overlap between each segment). Each segment employed 4-8 endogenous landmark ions which were (a) present in every sample with reasonable intensity, (b) evenly distributed across the entire RT range, and (c) exhibiting good peak shape without neighboring isobaric peaks within 3 min RT window. The minimum intensity was 500 and 50 for precursor ion and fragment ion respectively. The allowed peak width was 0.1 min - 4 min. Only peaks detected in at least 75% of the total samples were kept for further analysis. Step 3 was re-measurement and quality control of candidate marker ions. To ensure that each biomarker is discovered by high quality measurements, the top-ranked significant ions revealed by untargeted analysis (based on p-value) were traced back to their corresponding precursor ion envelope (different isotopes (M, M+1, M+2) and charge states) and/or fragment ion envelope (coeluted highly correlated fragments). Their peak areas were recalculated using MultiQuant (version 3.0.3, AB Sciex) with the MQ4 algorithm based on the following parameters: Gaussian smooth width, 3 points; min peak width, 3 points; noise percentage, 85%; baseline subtraction window, 2 min; peak splitting, 2 points. The extracted ion chromatograms (XICs) were

ACS Paragon Plus Environment

Page 4 of 19

Page 5 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

manually inspected to identify potential interferences and to visually assess XIC quality (intensity, noise level, peak shape) for proper peak picking and integration. Summing of multiple ions (at least 3) derived from the same precursor/fragment ion envelope was used for quantitation to increase sensitivity and selectivity. Only precursor/fragment ion envelopes containing at least three ions with consistent intensity ratio across different samples were kept as refined marker ions (Figure 1). When ambiguities were present, corresponding ion features were checked by examining the isotope distribution, ion ratio, and elution profile of nearby features. Multivariate Statistical Analysis. Protein, peptide, and ion intensity data were imported into MarkerView (Version 1.3.1, AB Sciex) to perform most likely ratio (MLR) normalization13 and group differences were examined using unpaired two-tailed t-test. Five features from the top-20 statistically significant ions were selected as diagnostic marker ions using random forest implemented in MetabolAnalyst 3.0.24 The diagnostic performance of the selected biomarkers was quantified by area under receiver operating characteristic (AUROC). The ROC curves were generated by Monte-Carlo cross validation (MCCV) where two thirds (2/3) of the samples were used to evaluate the feature importance. The top-ranked important features were then used to build classification models which is validated on the 1/3 samples that were left out. The procedure was repeated multiple times to calculate the performance and confidence interval (CI) of each model. Unsupervised principal component analysis (PCA) and hierarchical clustering analysis (HCA, using Euclidean distance and complete linkage) was performed using MetabolAnalyst 3.024 based on log-transformed and auto-scaled data of selected markers. Cross validation was carried out for validation of the PCA model with the goodness of fit represented by R2 values, and the predictive ability of the model represented by Q2. RESULTS AND DISCUSSION Optimization of Untargeted All-ion Quantification. Two SWATH datasets varied in terms of sample type (serum4 and tissue8) and data acquisition conditions (34 fixed windows using TripleToF 56004 and 100 variable windows using TripleToF 66008) were evaluated to investigate whether the parameters of MarkerView need to be optimized for each dataset or whether the selected parameters are generally applicable. Unlike conventional SWATH workflow which uses multiple coeluted fragments to confirm the peptide identity and control FDR, untargeted all-ion quantification is solely based on the m/z and RT of individual ions. The mass tolerance of 10, 20, 30, 40, 50, and 75 ppm was first evaluated for untargeted analysis. As shown in Figure 2A1, narrow mass tolerance of 10-20 ppm was likely to form splitting features. The isotope peak M+1 at m/z 916.44 with an intensity of 24.43 (Figure 2A2) was split into two peaks with an intensity of 15.12 and 9.31, respectively. In contrast, the large mass tolerance of 75 ppm was likely to form merged features. As shown in Figure 2B1, the isotope peak M+2 at m/z 402.23 was

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 19

successfully separated from a single charged interfering peak at m/z 402.25 using a mass tolerance of 10-50 ppm. However, the mass tolerance of 75 ppm merged these two peaks into one feature which showed significantly increased peak width and unreasonably high intensity (Figure 2B2), given that the M+2 intensity of small peptide precursor ion should be significantly lower than the M and M+1 intensity. Taken together, a mass tolerance of 30-50 ppm is preferable in terms of peak splitting and peak merging. To further evaluate the overall m/z shift for peak alignment, we randomly selected 100 precursor ions and 100 fragment ions across the m/z range of 350-1500 and the detected m/z of the same ion across different samples were compared. For each ion, ∆m/z max was calculated by subtracting the lowest m/z value from the corresponding highest value and expressed as relative unit (ppm) and absolute unit (Da). As shown in Figure 3, ppm is a better way to set mass tolerance since it provides relatively consistent m/z shift across the mass range. 30 ppm covers all observed ∆m/z in both datasets and could be used as a general parameter for untargeted SWATH analysis. Although the RT variation depends on the chromatography condition, a RT tolerance of 0.7 min could generally cover the RT variation by linear correction in both datasets. However, neighboring isoformic or isobaric peaks present in the allowable tolerance window may become cross-aligned whereby signal from different analytes is treated as a single feature. Indeed, accurate pairing of corresponding chromatographic features requires highly accurate modeling of complex nonlinear RT drift. Currently, the software team from AB Sciex is developing new non-linear alignment algorithms so the new version MarkerView could further improve untargeted all-ion quantification. Comparison of Untargeted and Targeted Ion Extraction. Next, we evaluated the capability of untargeted all-ion quantification to successfully extract and integrate ions

by

reanalyzing

standard

response

curves

for

heavy-isotope-labeled

phosphopeptides spiked into a human cell line background proteome.14 We performed untargeted analysis using vendor software MarkerView 1.3.1. Targeted peak integration was carried out using open-source Skyline21 and vendor software PeakView and MultiQuant. By means of comparing the m/z and RT of precursor ion and top3 fragment ions, the peak areas of 50 randomly selected synthetic spiked-in phosphopeptides integrated from untargeted and targeted analysis were highly correlated (R2 > 0.9) with several examples shown in Figure 4, suggesting peak picking and peak integration by untargeted and targeted analysis are generally comparable. It could also be observed that untargeted all-precursor and all-fragment analysis were highly complementary. On one hand, as shown in Figure 4A, the MS1 precursor ion

ACS Paragon Plus Environment

Page 7 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

was more prone to encounter interferences whereas the more selective MS/MS fragment ions typically had fewer interferences and thus achieved a better linear response and better correlation between targeted and untargeted analysis at low concentrations (Figure 4B). However, in some cases, SWATH MS/MS fragments exhibited significantly lower intensity and more interferences than MS1 (Figure 4C). The MS1 quantification covered 11 out of the 12 concentration points whereas the most abundant fragment could not be used to for quantification (Figure 4D). Taken together, simultaneous untargeted analysis of both precursor and fragment ions could take full advantage and maximize the information recovery of SWATH data. Marker Ion Quality Control, Identification and Multiple Testing. Untargeted single ion level quantification is intrinsically susceptible to interference and error especially for low abundance ions. To ensure high-quality marker ions are selected for further analysis, we only kept ion envelopes containing at least three correlated ions with consistent intensity ratio. There is also a guideline from the FDA of using three or more specific ions to confirm the identity of a known component in a sample.25 Figure 5 illustrates a fragment ion envelope that was employed as refined marker feature. The composed five fragment ions exhibited very low intensity, but their combined quantification was highly selective, readily distinguished from interferences. In contrast, Figure S1 shows the scenario where a low-confidence marker ion was excluded. This ion (m/z=1488.4168, RT=36.54 min) was detected in SWATH MS1 scan. However, the MS1 spectrum provides no isotopic envelop (M, M+1, M+2) for precursor ion characterization. Because of its low intensity, there was no corresponding fragment ions detected either. Thus, it was difficult to ensure that the same ion was compared across different samples. While a complete study of definitive marker ion identification is beyond the scope of this paper, we did employ various strategies to improve the identification of the discriminating ions, including using extended SWATH library containing 10,000 human proteins26, retrieve pseudo-MS/MS spectra directly from SWATH data by DIA-Umpire17, de novo sequencing27, tolerant database search28, and MS/MS manual inspection. The success of identification is ultimately limited by the quality of fragmentation spectra and database search algorithms. Some marker ions were easily assigned to the corresponding peptides, for instance NKDQGTYEDYVEGLR was identified as a top-ranked marker in both peptide and all-ion level comparison. However, low abundance marker ion identification was not successful primarily due to the poor quality of MS/MS spectra (Figure 5D). Linking low abundance precursor ions and fragment ions were particularly difficult. Efforts should be made to obtain more informative fragment spectra by sample fractionation, reduced SWATH window width, better separation using longer column and/or longer gradient, and different

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

fragmentation method. However, even good quality spectra could still be unidentified. This is about what one would expect, as currently no database search engine could correctly annotate all the collected MS/MS spectra. We evaluated Benjamini-Hochberg FDR adjusted p-values for multiple testing. However, the combination of unadjusted p-values with fold change threshold (p < 0.05, fold change ≥ 2) was utilized as FDR adjusted p-values were too stringent at the conventional 0.05 cutoff, resulting in no significant protein/peptide/ion level differentiating features. This phenomenon has also been observed by Wu et al.15 In addition, the multiple testing correction could introduce bias when comparing protein, peptide, and ion level biomarker discovery because many peptides and ions are derived from the same proteins and highly correlated, but the commonly used Bonferroni correction and Benjamini-Hochberg procedure assume that the individual tests are independent of each other. Large Peptides and Low Abundance Peptides Benefit More from Untargeted All-Ion Quantification. Figure 6A shows the mass distribution of deconvoluted precursor ions (carrying two to four charges) in all-ion quantification as well as peptides identified in local ion library and SWATH MS/MS (1% FDR) from a thyroid carcinoma SWATH dataset8. A total of 31,788 deconvoluted precursor ions were detected in SWATH MS1 with 16,027 and 10,633 peptides identified in deep DDA-generated ion library and SWATH MS/MS, respectively. The peptide identification rate in DDA-generated ion library significantly reduced as the mass of peptides increased and consequently the ion library mainly cataloged peptides below 2400 Da. Peptides quantified by SWATH-MS/MS exhibited the similar unbalanced mass distribution. Figure 6B shows the precursor intensity distribution which indicates that large peptide at all abundance level were generally unidentified by SWATH-MS/MS. Although large peptides are less preferred for quantification due to the poorer ionization and fragmentation, they are virtually the only choices in some constrained situations. For instance, acidic and highly hydrophobic membrane proteins may contain very limited number of K and R in their sequence, thus mainly producing large tryptic peptides. Peptides with missed cleavages could be considered for quantification if they are reproducibly generated and there are not enough fully tryptic peptides detected.26 Missed cleavages also increase the peptide length. Currently, two dominant mass spectrometry systems for DIA, TripleTOF 5600 and Q-Exactive, both employ the beam-type collision-induced dissociation (CID). However, backbone cleavage of large peptides using beam-type CID generally requires higher collision energy which in turn produces a large number of product ions with low abundance not appropriate for MS/MS quantification due to significant cascaded collision of beam-type CID.27,28 Such an example is depicted in Figure S2.

ACS Paragon Plus Environment

Page 8 of 19

Page 9 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Two triply charged precursor ions (peptide molecular weight above 3400 Da) exhibited distinct signals in MS1 but were dissociated into multiple minor fragments in SWATH MS/MS. As shown in Figure 6B, low abundance peptides also significantly benefit from untargeted all-ion quantification. They could be quantified by precursor ions or more selectively using fragment ions as shown in Figure 5. Application of Untargeted All-ion Quantitation in Tumor Subtype Diagnosis. We used tumor subtype classification to evaluate the diagnostic power of markers discovered by all-ion quantification. We chose minimally invasive follicular thyroid carcinoma (FTC) and benign follicular adenoma (FA) to challenge the performance as FTC and FA are clinically indistinguishable6,7 and could not be effectively differentiated by conventional SWATH protein quantification.8 As single biomarkers are biologically implausible to be specific for complex diseases like FTC and are likely to fail in larger populations. Classification using multiple markers is preferable as it can more readily accommodate biological variation and provide more robust diagnostic accuracy. Indeed, increasingly the number of features gradually increased the AUC value of multivariate ROC curve, which was above 0.96 when all top-20 ranked features were employed (Figure 7A). However, such significant difference should not be practical in large scale samples because of the high individual variability. The small sample size (n=8 for each group) also results in a wide 95% confidence interval (Figure 7A), making the model less reliable. Figure 7B shows the unsupervised PCA plots of all four groups using the first two principal components (PCs), which accounted for more than half of the total variances (52.3%-59.7%), significantly higher than the remaining PCs (Figure S3). In all cases, the papillary thyroid carcinoma (PTC) group was well separated from other groups, the FA and FTC groups were totally indistinguishable from each other, and the normal group partially overlapped with the FTC group. The k-means clustering (Figure S4) and hierarchical clustering (Figure 7C3) also suggested that the samples should be clustered into such three groups. On the basis of above multivariate statistical comparisons, we found the markers revealed by untargeted all-ion discovery did not show significantly higher differentiating power than protein and peptide markers. More large-scale datasets should be evaluated to investigate whether more diagnostic marker ions will be discovered when protein markers provide unsatisfactory results. Untargeted all-ion quantification is not a replacement for protein/peptide quantification. As described, it is error prone than protein/peptide quantification, which can use the target-decoy strategy to control FDR. However, it can be used as an alternative comparative analysis to uncover a wealth of information. In addition to find more diagnostic biomarkers missed by targeted analysis, untargeted all-ion quantification could also be applicated in cases where high confidence analyte

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

identification (including protein, peptide, lipid, and metabolite) could not be achieved. For instance, human fecal samples contain a large number of highly variable unknown gut microbial proteins and building high-quality sample-specific sequence databases are costly and demanding. Preliminary study indicated that protein-level SWATH results are highly variable depending on the microbial protein database used (e.g. metagenome and UniProt). A large number of MS/MS spectra could not be confidently identified because of the incompleteness of the database and higher FDR threshold in metaproteomics. Thus, we will evaluate untargeted all-ion quantification in our on-going metaproteomic and metabolomic study. CONCLUSIONS Untargeted all-ion quantification strategy was explored to retrieve MS and MS/MS information lost in targeted analysis of untargeted SWATH data. Compared with protein-centric and peptide-centric analysis, this ion-centric strategy maximized the extraction of quantitative information inherent in SWATH, thus providing a more comprehensive biomolecular profile for differential analysis of highly heterogeneous diseases such as cancer. Untargeted all-ion quantification, however, in its current implementation, is challenged by mass shift and RT shift which may results in peak misalignment. Another restraint is the poor selectivity of single ion which needs manual selection of high-quality biomarker ions from candidates. This shortcoming could be mitigated by new algorithms that allow untargeted quantification of ion envelops (multiple correlated ions) rather than single ions in an automated way. New methods providing better mass- and RT-recalibration and/or new instrumentation capable of even greater mass accuracy and stable RT will further empower the untargeted SWATH quantification. However, confident identification of biomarker ions requires additional analytical and bioinformatic efforts. While the protein, peptide and ion markers showed comparable diagnostic power in the tested dataset, more large-scale datasets should be evaluated to investigate whether more diagnostic marker ions will be discovered when protein markers show unsatisfactory results. ACKNOWLEDGMENTS This work was financially supported by the National Natural Science Foundation (Ref. no: 81473281) and University of Macau (MYRG2015-00220-ICMS-QRCM). REFERENCES (1) Hanash, S.; Taguchi, A. Cancer J. 2011, 17, 423–428. (2) Shaheed S. U.; Rustogi, N.; Scally, A.; Wilson, J.; Thygesen, H.; Loizidou, M. A.; Hadjisavvas, A.; Hanby, A.; Speirs, V.; Loadman, P.; Linforth, R.; Kyriacou, K.; Sutton, C. W. J. Proteome Res. 2013, 12, 5696–5708.

ACS Paragon Plus Environment

Page 10 of 19

Page 11 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(3) Ortea, I.; Rodríguez-Ariza, A.; Chicano-Gálvez, E.; Arenas Vacas, M. S.; Jurado Gámez, B. J Proteomics 2016, 138, 106–114. (4) Gajbhiye, A.; Dabhi, R.; Taunk, K.; Jagadeeshaprasad, M. G.; RoyChoudhury, S.; Mane, A.; Bayatigeri, S.; Chaudhury, K.; Santra, M. K.; Rapole, S. J. Proteomics 2017, 163, 1–13. (5) Duarte, T. T.; Spencer, C. T. Proteomes 2016, 4(4). pii: 29. (6) Cibas, E. S.; Ali, S. Z. Am. J. Clin. Pathol. 2009, 132, 658–665. (7) Haugen, B. R.; Woodmansee, W. W.; McDermott, M. T. Clin Endocrinol (Oxf) 2002, 56, 281–290. (8) Martínez-Aguilar, J.; Clifton-Bligh, R.; Molloy, M. P. Sci. Rep. 2016, 6, 23660. (9) Yan, Z. X.; Yan, R. Anal. Chem. 2015, 87, 2861–2868. (10) McLafferty, F. W. Science 1981, 214, 280–287. (11) Picotti, P.; Bodenmiller, B.; Mueller, L. N.; Domon, B.; Aebersold, R. Cell 2009, 138, 795–806. (12) Gillet, L. C.; Navarro, P.; Tate, S.; Röst, H.; Selevsek, N.; Reiter, L.; Bonner, R.; Aebersold, R. Mol. Cell. Proteomics 2012, 11, O111.016717. (13) Lambert, J. P.; Ivosev, G.; Couzens, A. L.; Larsen, B.; Taipale, M.; Lin, Z. Y.; Zhong, Q.; Lindquist, S.; Vidal, M.; Aebersold, R.; Pawson, T.; Bonner, R.; Tate, S.; Gingras, A. C. Nat. Methods 2013, 10, 1239–1245. (14) Rosenberger, G.; Liu, Y.; Röst, H. L.; Ludwig, C.; Buil, A.; Bensimon, A.; Soste, M.; Spector, T. D.; Dermitzakis, E. T.; Collins, B. C.; Malmström, L.; Aebersold, R. Nat. Biotechnol. 2017, 35, 781−788. (15) Wu, J. X.; Song, X.; Pascovici, D.; Zaw, T.; Care, N.; Krisp, C.; Molloy, M. P. Mol. Cell. Proteomics 2016, 15, 2501−2514. (16) Zi, J.; Zhang, S.; Zhou, R.; Zhou, B.; Xu, S.; Hou, G.; Tan, F.; Wen, B.; Wang, Q.; Lin, L.; Liu, S. Anal. Chem. 2014, 86, 7242−7246. (17) Tsou, C. C.; Avtonomov, D.; Larsen, B.; Tucholska, M.; Choi, H.; Gingras, A. C.; Nesvizhskii, A. Nat. Methods 2015, 12, 258−264. (18) Chen, G.; Walmsley, S.; Cheung, G. C. M.; Chen, L.; Cheng, C. Y.; Beuerman, R. W.; Wong TY, Zhou, L.; Choi, H. Anal. Chem. 2017, 89, 4897–4906. (19) Panchaud, A.; Scherl, A.; Shaffer, S. A.; von Haller, P. D.; Kulasekara, H. D.; Miller, S. I.; Goodlett, D. R. Anal. Chem. 2009, 81, 6481–6488. (20) Stewart, P. A.; Fang, B.; Slebos, R. J.; Zhang, G.; Borne, A. L.; Fellows, K.; Teer, J. K.; Chen, Y. A.; Welsh, E.; Eschrich, S. A.; Haura, E. B.; Koomen J. M. Proteomics 2017, 17 (6). doi: 10.1002/pmic.201600300. (21) Rardin, M. J.; Schilling, B.; Cheng, L.-Y.; MacLean, B. X.; Sorenson, D. J.; Sahu, A. K.; MacCoss, M. J.; Vitek, O.; Gibson, B. W. Mol. Cell. Proteomics 2015, 14, 2405−2419. (22) Petricoin, E. F.; Ardekani, A. M.; Hitt, B. A.; Levine, P. J.; Fusaro, V. A.; ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Steinberg, S. M.; Mills, G. B.; Simone, C.; Fishman, D. A.; Kohn, E. C.; Liotta, L. A. Lancet 2002, 359, 572–577. (23) Podwojski, K.; Fritsch, A.; Chamrad, D. C.; Paul, W.; Sitek, B.; Stühler, K.; Mutzel, P.; Stephan, C.; Meyer, H. E.; Urfer W, Ickstadt, K.; Rahnenführer, J. Bioinformatics 2009, 25, 758−764. (24) Xia, J.; Wishart, D. S. Nat. Protoc. 2011, 6, 743–760. (25) Food and Drug Administration, Guidance for Industry Guidance: Bioanalytical Method Validation. US Department of Health and Human Services, FDA, Center for Drug Evaluation and Research, Rockville (2001) (26) Rosenberger, G.; Koh, C. C.; Guo, T.; Röst, H. L.; Kouvonen, P.; Collins, B. C.; Heusel, M.; Liu, Y.; Caron, E.; Vichalkovski, A.; Faini, M.; Schubert, O. T.; Faridi, P.; Ebhardt, H. A.; Matondo, M.; Lam, H.; Bader, S. L.; Campbell, D. S.; Deutsch, E. W.; Moritz, R. L.; Tate, S.; Aebersold, R. Sci. Data 2014, 1, 1−15. (27) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 2337–2342. (28) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. Nat. Biotechnol. 2015, 33, 743–749. (29) Chiva, C.; Ortega, M.; Sabidó, E. J. Proteome Res. 2014, 13, 3979−3986. (30) Shipkova, P.; Drexler, D. M.; Langish, R.; Smalley, J.; Salyan, M. E.; Sanders, M. Rapid Commun. Mass Spectrom. 2008, 22, 1359–1366. (31) Vatansever B, Lahrichi SL, Thiocone A, Salluce N, Mathieu M, Grouzmann E, Rochat B. J. Sep. Sci. 2010, 33, 2478–2488.

ACS Paragon Plus Environment

Page 12 of 19

Page 13 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 1. General workflow of untargeted SWATH all-ion quantitation for biomarker discovery. MarkerView converts data to centroid data if profile raw files are used. Non-linear RT drift model was constructed using R to guide the segmented linear RT correction and peak alignment in MarkerView. The intensities of top-ranked significant ions were recalculated by summing of multiple ions derived from the corresponding precursor/fragment ion envelope using MultiQuant to improve selectivity and quantitation accuracy. Only precursor/fragment ion envelopes showing consistent intensity ratios across different samples were kept as refined marker ions (black dots in the volcano plots).

Figure 2. Impact of mass tolerance on untargeted peak picking illustrated by the

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

isotopic envelopes (M, M+1, M+2) of two doubly-charged precursor ions. The detected m/z and intensities are provided. The blue arrows in the x-axis indicates the range for the peak detected. The isotope peak M+1 at m/z 916.4537 (A) is split into two peaks using a mass tolerance of 10-20 ppm. The isotope peak M+2 at m/z 402.23 (B) are merged with a single charged interfering peak at m/z 402.2534 using a mass tolerance of 75 ppm. The relative high intensity of M+2 (B2) indicates the presence of interference as the M+2 intensity of small peptide precursor ion should be significantly lower than the M and M+1 intensity.

Figure 3. Maximal m/z shift (∆m/z max) of 100 precursor ions and 100 fragment ions for two SWATH datasets. The tissue proteome (A1, B1) was acquired with 100 variable windows using TripleToF 6600 and the serum proteome (A2, B2) was acquired with 34 fixed windows using TripleToF 5600. Each sample was subjected to peak peaking separately and the detected m/z of the same ion across different samples were compared. For each ion, ∆m/z max was calculated by subtracting the lowest m/z value from the corresponding highest value and expressed as absolute unit Da (A1, A2) and relative unit ppm (B1, B2).

ACS Paragon Plus Environment

Page 14 of 19

Page 15 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 4. Comparison between peak intensity of untargeted ion extraction using vendor software MarkerView and targeted ion extraction using open-source Skyline. Synthetic isotope-labeled phosphopeptides were measured in a 12-step dilution series (ranging from 1:1 to 1:127) with a human cell line background. (A and C) Extract ion chromatogram of monoisotopic precursor ion (top panel) and the corresponding top3 fragment ions (bottom panel). (B and D) Comparison between untargeted integrated areas and areas obtained with targeted extraction for precursor ion and the most abundant fragment ion across different concentrations. Inset plots show the details of the low concentration range.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. Example of an unidentified low abundance marker ion envelop (RT=49.14 min) differentiating FA and FTC group. (A) The screenshot of MarkerView is highlighted with five correlated marker ions in the SWATH-MS/MS experiment 16 of the thyroid carcinoma dataset. (B) Box and whiskers plot shows the distribution of normalized intensity between two groups. (C) Extraction ion chromatogram (XIC) shows their relative ion intensity across different samples. (D) SWATH-MS/MS spectrum highlights the significantly lower intensity of the marker ions compared with the other co-eluted fragment ion. A view on the corresponding low intensity fragment ion region is shown in the insert.

ACS Paragon Plus Environment

Page 16 of 19

Page 17 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 6. Mass distribution (A) and abundance distribution (B) of deconvoluted precursor ions in MS1 and peptides identified in ion library and SWATH MS/MS (1% FDR) in the thyroid carcinoma dataset.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7. Distinction between the follicular thyroid carcinoma (FTC), benign follicular adenoma (FA), papillary thyroid carcinoma (PTC), and normal groups. (A1-A3) Multivariate ROC curve analysis of FA and FTC using random forest-based sample classification and feature ranking. Area under the curve (AUC) and 95% confidence interval (CI) for top 2, 3, 5, 7, 10, and 20 top-ranked markers are indicated in each plot. Unsupervised principal component analysis score plot (B1, R2=0.55, Q2=0.44; B2, R2=0.57, Q2=0.48; B3, R2=0.58, Q2=0.51) is provided with 95% confidence ellipsoids. Unsupervised hierarchical clustering analysis (C1-C3) is based on spearman distance and complete linkage.

ACS Paragon Plus Environment

Page 18 of 19

Page 19 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

TOC

ACS Paragon Plus Environment