AntDAS: Automatic Data Analysis Strategy for ... - ACS Publications

Sep 18, 2017 - The present work aims to address this problem by proposing a novel data analysis strategy wherein (1) chromatographic peaks in the UPLC...
1 downloads 12 Views 2MB Size
Subscriber access provided by University of Sussex Library

Article

AntDAS: Automatic Data Analysis Strategy for UPLC– QTOF-based Nontargeted Metabolic Profiling Analysis Hai-Yan Fu, Xiao-Ming Guo, Yue-Ming Zhang, Jing-Jing Song, Qing-Xia Zheng, Ping-Ping Liu, Peng Lu, Qian-Si Chen, Yong-Jie Yu, and Yuanbin She Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b03160 • Publication Date (Web): 18 Sep 2017 Downloaded from http://pubs.acs.org on September 18, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 11

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

AntDAS: Automatic Data Analysis Strategy for UPLC–QTOF-based Nontargeted Metabolic Profiling Analysis Hai-Yan Fua,*, Xiao-Ming Guoa, Yue-Ming Zhangb, Jing-Jing Songc, Qing-Xia Zhengd, Ping-Ping Liud, Peng Lud, Qian-Si Chend, Yong-Jie Yub, f,*, Yuanbin Shee a

School of Pharmaceutical Sciences, South Central University for Nationalities, Wuhan 430074, China College of Pharmacy, Ningxia Medical University, Yinchuan 750004, China c Ningxia Institute of Cultural Relics and Archeology, Yinchuan 750001, China b

d

China Tobacco Gene Research Center, Zhengzhou Tobacco Research Institute of CNTC, Zhengzhou 450001, China ZhengJiang University of Technology, Hangzhou, 310014, China f Ningxia Engineering and Technology Research Center for Modernization of Hui Medicine, Ningxia Medical University, Yinchuan 750004, China e

ABSTRACT: High-quality data analysis methodology remains a bottleneck for metabolic profiling analysis based on ultraperformance liquid chromatography–quadrupole time-of-flight mass spectrometry. The present work aims to address this problem by proposing a novel data analysis strategy wherein (1) chromatographic peaks in the UPLC–QTOF dataset are automatically extracted by using an advanced multiscale Gaussian smoothing-based peak extraction strategy; (2) a peak annotation stage is used to cluster fragment ions that belong to the same compound. With the aid of high-resolution mass spectrometer, (3) a time-shift correction across the samples is efficiently performed by a new peak alignment method; (4) components are registered by using a newly developed adaptive network searching algorithm; (5) statistical methods, such as analysis of variance and hierarchical cluster analysis, are then used to identify the underlying marker compounds; finally, (6) compound identification is performed by matching the extracted peak information, involving high-precision m/z and retention time, against to our compound library containing more than 500 plant metabolites. A manually designed mixture of 18 compounds is used to evaluate the performance of the method, and all compounds are detected under various concentration levels. The developed method is comprehensively evaluated by an extremely complex plant dataset containing more than 2000 components. Results indicate that the performance of the developed method is comparable with the XCMS. The MATLAB GUI code is available from http://software.tobaccodb.org/software/antdas.

Nontargeted metabolic profiling analysis based on ultraperformance liquid chromatography hyphenated with highresolution time-of-flight mass spectrometer (e.g., UPLC– QTOF) is extensively used in many scientific fields.1,2 Modern hyphenated instruments with powerful separation capability for UPLC and high-resolution mass spectrometry provide massive chemical information for thousands of compounds,3–6 which greatly benefit metabolomics and proteomics that aim to discover biomarkers and their pathways.2,5 Compared with the developing modern analytical instruments, data analysis procedure, however, becomes a bottleneck in nontargeted metabolic profiling analysis.7 The typical UPLC–QTOF-based nontargeted metabolic profiling analysis consists of several stages, namely, peak extraction, annotation, and alignment, and statistical analysis, to discover underlying biomarkers.4,8–12 A number of methods have been developed to achieve this goal.1,8–30 The most famous methods are XCMS,19 MET-COFEA,24 and Mzmine.31 The successes of current methods seriously depend on two critical stages, i.e., peak extraction32–34 and peak alignment35–39. The widely used peak detection methods involve algorithms based on continuous wavelet transform (CWT)40–42. In complex sample analysis, however, false-positive peaks or false-

negative peaks are still frequently encountered. Additionally, a large percentage of users are not experts in data analysis. They do not exactly know how to optimize a large number of parameters in these methods to obtain desirable results even with the aid of guidelines. Moreover, optimized parameters based on several ions in some samples may not be suitable for the others. The other drawback of UPLC–QTOF analysis is the timeshift problem across samples.35,39 In complex plant sample analysis, time-shift correction based on only high-precision mass values is inapplicable in extremely complex plant samples because it is very common to encounter a situation wherein several candidate peaks meet both the mass tolerance and elution range of the targeted peak. Although time-shift correction has been extensively studied in one-dimensional chromatographic signals, the alignment efficiency retards these onedimensional methods to UPLC–QTOF-based nontargeted metabolic profiling analysis because hundreds and even thousands of chromatograms are present in each sample. The present work develops a new nontargeted data analysis strategy (AntDAS) for UPLC–QTOF-based nontargeted metabolic profiling analysis. In this strategy, the peak detection,

1 ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

annotation, and time-shift correction stages can be automatically simultaneously performed. Compound information, including high-precision m/z and retention time, can be directly imported into a user-defined compound library for metabolite identification. The performance of AntDAS is evaluated by two UPLC–QTOF datasets, i.e., a mixture of standards and a complex plant metabolic profiling dataset.

METHODOLOGY AntDAS consists of the following 8 sub-steps: chromatogram extraction, peak extraction, peak annotation, time-shift correction, peak registration, peak filling, statistical analysis, and compound identification. An advanced peak extraction algorithm has been well developed in our laboratory, and in this work, high-precision m/z marking and peak annotation procedures are added. Moreover, a new time-shift correction strategy and a novel peak registration strategy are proposed. Chromatogram Extraction AntDAS implements peak detection procedure based on chromatograms. According to Stolt et al.34, peak extraction can be accomplished for a chromatogram with large m/z range, such as ±0.5 Da. Figure S-1 provides the chromatogram extraction procedure in AntDAS. First, high-precision masses are rounded as integers. Then, points with identical m/z values are merged. Finally, a chromatogram matrix is constructed with rows as the spectrum counts and columns as the mass values. An acquired data point will be placed at the corresponding mass column and scanning row. Figure S-1E provides an extracted chromatogram at the 111th column, corresponding to the m/z 111. An advantage of rounding highprecision m/z for chromatogram extraction is that data analysis efficiency can be greatly improved. For example, chromatogram extraction consumes less than 10s for a 100 Mb mzData.xml format file. Peak Extraction A smoothing-based strategy is implemented in the AntDAS for peak extraction, which involves baseline correction, peak detection, false-positive peak filtration, and high-precision m/z marking. Baseline correction. Baseline correction is implemented based on minimal values in the chromatogram.22,33 Most of the minimal values usually correspond to random background noise recorded by the detector.43 AntDAS first extracts minimal values in the chromatogram into a vector. Then, a moving window-based strategy is used to filter the outlying minimal values under peaks. Finally, a linear interpolation strategy is utilized to retrieve the baseline drift for correction. Detailed illustration of baseline correction can be found in the ref. 33 Peak detection. AntDAS uses a multiscale Gaussian smoothing strategy for peak detection. The concept is that chromatographic peaks must be maximal values and can be further emphasized after smoothing:

𝒔𝑠𝑚𝑜𝑜𝑡ℎ𝑒𝑑 = 𝒔 ∗ 𝐺 (1), where s and ssmoothed represent the original and smoothed signals, respectively, * is the convolution production, and G is the Gaussian function with the sum of its elements as 1. The standard deviation of the Gaussian function is the smoothing scale. Our investigation suggests that an increment of 0.1 for smoothing scale could be applied for most situations. Figure S-2 provides an illustration for AntDAS peak detection. AntDAS detects pseudo-peaks position by (1) detecting the max-

Page 2 of 11

imal values under each smoothing scale and (2) searching ridge lines of maximal values across smoothing scales. Notably, the number of ride lines is larger than that of the peaks. A false-positive peak elimination stage is needed. False-positive peak filtration. An adaptive instrumental noise estimation step is used. First, signals that monotonously increase/decrease around the detected peaks are eliminated from the chromatogram. Second, the noise is estimated by using a moving window smoothing strategy. The center point, xi, in the window will be replaced as 𝑥𝑖 = 𝑚𝑎𝑥(𝑥𝑖 , 𝑤0.9 ), (2) where 𝑤0.9 is the data point larger than 90% of the data points in the moving window. Third, a linear interpolation strategy is used to retrieve the instrumental noise across the entire chromatogram. The estimated noise is insensitive to window width and a width of 101 points can be employed for data analysis. Chromatographic peaks with signal-to-noise ratios of