MetNet: Metabolite network prediction from high-resolution mass

Publication Date (Web): November 30, 2018 ... Traditionally, the annotation was done manually imposing constraints in reproducibility and automatizati...
0 downloads 0 Views 1MB Size
Subscriber access provided by La Trobe University Library

Technical Note

MetNet: Metabolite network prediction from high-resolution mass spectrometry data in R aids metabolite annotation Thomas Naake, and Alisdair R. Fernie Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b04096 • Publication Date (Web): 30 Nov 2018 Downloaded from http://pubs.acs.org on December 2, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

MetNet: Metabolite network prediction from high-resolution mass spectrometry data in R aids metabolite annotation Thomas Naake1,* and Alisdair R. Fernie1 1Central

Metabolism, Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany

ABSTRACT: A major bottleneck of mass spectrometric metabolomic analysis is still the rapid detection and annotation of unknown m/z features across biological matrices. This kind of analysis is especially cumbersome for complex samples with hundreds to thousands of unknown features. Traditionally, the annotation was done manually imposing constraints in reproducibility and automatization. Furthermore, different analysis tools are typically used at different steps which requires parsing of data and changing of environments. We present here MetNet, implemented in the R programming language and available as an open-source package via the Bioconductor project. MetNet that is compatible with the output of the xcms/CAMERA suite uses the data-rich output of mass spectrometry metabolomics to putatively link features on their relation to other features in the data set. MetNet uses both structural and quantitative information of metabolomics data for network inference and enables the annotation of unknown analytes.

Among the main challenges in mass spectrometric (MS) metabolomics analysis is the high-throughput analysis of metabolic features, their fast detection and subsequent annotation. By contrast to the screening of known, previously characterized, metabolic features in these data, the putative annotation of unknown features is often cumbersome and requires a lot of manual work, hindering the biological information retrieval of these data1. Especially in complex samples (e.g. from plant tissues) many features are often not represented in downstream analyses due to their missing annotation and thus statements on metabolic changes, e.g. due to application of stress or mutations, focus mainly on previously reported metabolites. High-resolution MS data is often very information-rich and reactions, in the sense of metabolic conversions/transformations, can be derived from structural properties of features2. For instance in specialized metabolism of plants, fungi or bacteria, these metabolic conversions can be translated to metabolic pathways that are characterized by chemical modifications of a scaffold structure. Given that transformations occur for these pathways often in a linear way, metabolic conversions can be seen as a metabolic grid, where vertices represent metabolites and edges are the corresponding metabolic conversion. In addition to that, statistical associations between features (based on their intensity values) can be a valuable resource to find co-synthesized or co-regulated metabolites that occur in the same biosynthetic pathways2,3. Different software solutions exist for reconstructing metabolic networks which differ in the assumptions they make and in the kind of previous knowledge they require (reviewed in de Souza et al.1). One popular software tool is mummichog4 which uses network analysis for pathway enrichment, prediction of functional activity and putative metabolite identification. While this approach is valid for organisms for which genome-scale

metabolic models are available, for high-throughput metabic screening of certain biological matrices (e.g. tissues and stress conditions) or of medicinal plant species this approach might impose limitations on its applicability. Another popular software tool for putative identification of specialized metabolites is PlantMat5: It matches combinatorial enumerations of predefined aglycones and decorating building blocks (e.g. glycosyl and acyl subunits) against features of the dataset to predict metabolites. PlantMat is restricted by the search space of possible aglycones and in addition, does not take quantitative information that is commonly acquired by mass spectrometers into account. Inspired by the approach by Breitling et al.6, MetaNetter, another popular tool, is independent on the search space of predefined aglycones since it only takes mass differences that correspond to metabolic transformations into account. Furthermore, MetaNetter allows for incorporation of quantitative information (i.e. intensity values) that are consented with the inferred edges from the network inference based on mass differences. Given that an analysis tool for the construction of mass difference networks within the R framework is still lacking, we developed MetNet to close this gap and are expanding the functionality of MetNet in comparison to comparable software tools (e.g. MetaNetter). MetNet combines information from both structural data (differences in m/z values of features and resulting changes in polarity) and statistical associations (intensity values of features per sample) to propose putative metabolic networks that can be used for further exploration, e.g. in hypothesis formulation about the corresponding chemical structures or putative pathway reconstructions. In our opinion MetNet has the following advantages with regard to comparable software (e.g. MetaNetter): (1) It uses the R programming environment and is thus compatible with up(peaklist generation, annotation of peaklists, etc., e.g. by the widely-used R/Bioconductor packages xcms7 and CAMERA8)

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and downstream (statistical analyses or data integration, etc.) software for metabolomics analysis. Ultimately this also allows for having the complete metabolite analysis in one programming/scripting environment. (2) It is script-based which allows for automatization, scaling and reproducibility of workflows, (3) for calculation of statistical association MetNet uses a community-based strategy for network inference that accounts for biases inherent in certain statistical methods.

Theory, Implementation and Workflow The idea of using high-resolution MS data for network construction was first proposed by Breitling et al.6 and was implemented soon after by a Cytoscape plugin, MetaNetter9,10, that is based on the inference of metabolic networks on molecular weight differences and Pearson and partial correlation. Incorporating network inference on statistical associations was shown to reduce the number of false positives3. We decided to implement different algorithms for network inference to account for biases that are inherent in these statistical methods and to integrate the results11. Currently, the following statistical models are implemented: Least absolute shrinkage and selection operator12, Random Forest13, Pearson and Spearman correlation (including partial and semipartial correlation)2, context likelihood of relatedness (CLR)14, the algorithm for the reconstruction of accurate cellular networks (ARACNE)15 and constraint-based structure learning16 (Bayesian networks). These models can either be used independently or in combination in order to predict the underlying metabolic networks. After creating the statistical and structural (consensus) adjacency matrix these two matrices will be combined to an adjacency matrix that has support from both structural information and statistical properties. For the functions provided by MetNet and a detailed workflow, we refer to the script in the Supplementary Information. Data requirements. MetNet requires information about m/z values, retention time (optional) and intensity values. Typically such data is stored in a m x n matrix, with m features in the rows and n samples as columns, together with a vector of length |m| that stores m/z and retention time values of each feature. Such a m x n matrix is returned by xcms or CAMERA, two commonly employed software tools in the R environment to process mass spectral data and to annotate isotopes and adducts7,8.MetNet does not offer any functions to pre-process and normalize data, however it is paramount for statistical computation that the data set is appropriately normalized across different samples.

Figure 1. Creating the network topology based on structural information. A: Metabolites are transformed by specific reactions such as hydroxylation or rhamnosylation. Differences can be readily observed by shift in m/z values. In the case of flavonol glycosides, transformations can occur as e.g. hydroxylation (+15.99 Da) or rhamnosylation (+146.06 Da). B: Depending on the attached chemical group the polarity of the metabolite will change. In the case of flavonols, hydroxylation and rhamnosylation will render the metabolites more polar, resulting in earlier retention time on a reverse-phase liquid chromatography. C: Exemplary network based on structural information.

Creating the adjacency matrix based on structural information. Molecular weight difference wX is defined by wX = |wA - wB|, where wA is the molecular weight of substrate A, and wB is the molecular weight of product B6 (typically, m/z values will be used as a proxy for the molecular weight since the molecular weight is not directly derivable from MS data, cf. Figure 1 A). If a molecular weight difference is detected, the pair of vertices will be set as adjacent in the adjacency matrix. Another information source is the retention time to further support edges in the adjacency matrix or to remove erroneous edges (cf. 3). While the retention time correction step can be omitted, typically, specific transformation will result in a polarity change of molecules that is reflected by a shifted retention time (see Figure 1 B) and will reduce the number of false positive assignments. For ambiguous metabolic transformations, the retention time correction can also be omitted for these specific transformations. From the adjacency matrix, the network can be created where vertices are metabolic features and edges represent a transformation between features (optionally corrected by retention time shift, see Figure 1 C). Creating the adjacency matrix based on statistical associations. MetNet follows the outline described by Marbach et al. 11 that proposed (1) to create adjacency matrices per model and (2) build a consensus adjacency matrix from these. From the adjacency matrix, a network can be created where vertices are metabolic features and edges correspond to statistical associations supported by the consensus adjacency matrix based on the statistical methods (see Figure 2).

Figure 2. Creating the network topology based on statistical associations. MetNet uses quantitative information from m metabolites M1 to Mn, where x are intensity values, to find statistical associations between metabolites and infer a network topology. Different statistical methods are implemented and the resulting adjacency matrices per model will be combined to a consensus adjacency matrix (not shown here). The Figure is modified from Steuer2.

Combining the adjacency matrices from structural and statistical information. After creating the adjacency matrices from structural and statistical information, the two matrices will be combined. Only these connections will be reported in the final “consensus” network that are present in both adjacency matrices (intersection is taken). Further analysis. Existing R packages can be employed to analyze further the network topology, to extract features, to visualize the network structure (e.g. by using sna17 or igraph18), or to integrate other data sources (e.g. genomic information or transcriptomic data) that will further enhance the predicted network.

Application of the MetNet pipeline to real-world data sets and classification of predicted network The MetNet workflow was applied to a metabolic UHPLC MS dataset from a set of Nicotiana leaf samples containing 10131 features acquired in positive ionization mode. The data set was

ACS Paragon Plus Environment

Page 2 of 8

Page 3 of 8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry queried for the following metabolic transformations: “hydroxylation”, “malonylation”, “rhamnosylation”, “glucosylation” and addition of “disaccharide” and “trisaccharide”. After building the adjacency matrix based on structural information and retention time correction 5097 and 3563 features remained that connect to at least another feature based on the above defined metabolic transformations. From these features 2336 features remained after building the adjacency matrix based on quantitative information (i.e. intensity values, using Pearson and Spearman correlation, CLR and ARACNE models) and combining with the adjacency matrix based on the structural information. These 2336 features formed 619 connected components, of which the components with membership size > 10 are depicted in Figure 3. For the total network, 3924 edges were assigned based on mass difference and 2417704 edges were assigned by the central graph of the statistical models (2026 intersecting edges). The MetNet pipeline reconstructs a component that contains features of the flavonoid pathway containing by way of example Kaempferol glucoside (supported by [M+H]+ 449.11 and 287.06, eluting at 285.6 seconds), Quercetin rhamnoglucoside glucoside (supported by [M+H]+ 773.21, 449.11 and 303.05, eluting at 273.6 seconds) and Quercetin rhamnoside glucoside (supported by [M+H]+ 611.16 and 465.10, eluting at 273.3 seconds). Furthermore, we highlighted a component containing diterpene glycosides19 (DTG, see Figure 3) and putatively identified DTG. Within this network 87 edges were based on mass differences and 704 were assigned by the central graph of the statistical models (71 intersecting edges). Based on known DTG, adjacent vertices can structurally be predicted and putatively annotated, e.g. feature 1073.4 (eluting at 404.8 seconds) and Nicotianoside XII isomer (eluting at 398.5 seconds) are separated by a mass difference of 86.00, corresponding to a malonyl group.

In another application, the MetNet pipeline inferred the adjacency matrix using structural information using the MoNA Fiehn HILIC (hydrophilic interaction liquid chromatography) library that contains primary, specialized metabolites and pharmaceutical active agents (see script in the Supplementary Information and Figure S1 therein). MetNet predicted a putative network based on the following metabolic transformations: “acetylation”, “hydroxylation”, “rhamnosylation” and “glucosylation”. In addition, the predicted edges were checked for agreement with the expected retention time shift. We calculated classification measures for the predicted networks (see Table S1 in the Supplementary Information) and found high accuracy (>0.9990), low false positive rate (0.9990) and high true positive rate (>0.9032) for all respective groups. The false discovery rate ranged between 1 (only one predicted link in the group) and 0.69 and decreased when applying the retention time correction step. However, for some instances false negatives were reported after the retention time correction. Generally, we detected that the high false positive rate can partly be explained by occurrence of stereoisomers or structural isomers or by metabolic transformations with the same mass (e.g. formation of a ketone instead of addition of hydroxyl group, see Supplementary Information and supplementary .xlsx file, MetNet_MoNA_Fiehn_HILIC_classification.xlsx), that were not assigned as true positives. Based on the results obtained from the MoNA Fiehn HILIC library, we state that (1) MetNet can be applied to other separation techniques such as the commonly used separation by liquid chromatography and (2) that MetNet using the inference by structural properties gives meaningful results that will aid in putative metabolite annotation and conveys biological meaning.

Figure 3: MetNet predicts biologically meaningful networks. Mass spectrometric data was queried for metabolic transformations (“hydroxylation”, “malonylation”, “rhamnosylation”, “glucosylation”, and addition of “disaccharide” and “trisaccharide”) and assigned edges were corrected for expected shifts in retention time. The models Pearson and Spearman correlation, CLR and ARACNE were employed to assign statistical association between features. For simplicity, components are displayed that have more than 10 members (right side). One component is highlighted that contains previously reported diterpene glycosides (DTG)19. Based on the reported edges structural predictions on hitherto unassigned features can be made. Venn diagrams show the number of reported edges for the whole dataset and the highlighted component (containing DTG) for adjacency matrices based on structural and statistical properties and intersection.

Conclusion

ASSOCIATED CONTENT

MetNet offers a powerful approach to infer ab initio metabolic network topology from MS metabolomics data. It does not require any a priori information on metabolites that are present in biological matrices, but will detect possible metabolic transformations (e.g. following a biosynthetic pathway of biomolecules) based on known metabolic transformations. To reduce possible false positive associations, MetNet uses statistical models to validate and refine the detected edges. MetNet extends the xcms/CAMERA suite to a next analyses step and improves the automatization and reproducibility of MS metabolomics data analysis in the R environment. Given this, the described tool enables the annotation of previously unknown analytes and thus facilitates the biological information retrieval from high-resolution MS-based metabolomics data sets.

Supporting Information The Supporting Information is available free of charge on the ACS Publications website. S1_MetNet.pdf: A script that describes a typical workflow via the MetNet package for LC-MS data, creates Figure 3 of the main text and describes the workflow for running the MonA Fiehn HILIC library data (PDF) MetNet_peaklist.txt: Accompanying file containing a peaklist to run the script described in S1_MetNet.pdf (TXT) MetNet_MoNA_Fiehn_HILIC.txt: Accompanying file to run the script described in S1_MetNet.pdf containing a HILIC library of primary metabolites, specialized metabolites and pharmaceutical active agents in positive and negative ionization mode acquired,

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

downloaded (TXT)

from

http://mona.fiehnlab.ucdavis.edu/downloads

MetNet_MoNA_Fiehn_HILIC_classification.xlsx: Accompanying .xlsx file that contains manual classification of predicted metabolic transformation using MetNet_MoNA_Fiehn_HILIC.txt file and calculation of classification measures (XLSX)

AUTHOR INFORMATION Corresponding Author * [email protected]

Author Contributions T.N. designed the software algorithm and programmed the software. Both authors analyzed the data. The manuscript was written through contributions of both authors. Both authors have given approval to the final version of the manuscript.

Notes The authors declare no competing financial interest.

ACKNOWLEDGMENT T.N. acknowledges the support by the IMPRS-PMPG program and A.R.F. the support of Max Planck Society. T.N. and A.R.F. thank Dr. Emmanuel Gaquerel for providing the example LC-MS dataset.

(3) Morreel, K.; Saeys, Y.; Dima, O.; Lu, F.; Van de Peer, Y.; Vanholme, R.; Ralph, J.; Vanholme, B.; Boerjan, W. Plant Cell 2014, 26, 929-945. (4) Li, S. Z.; Park, Y.; Duraisingham, S.; Strobel, F. H.; Khan, N.; Soltow, Q. A.; Jones, D. P.; Pulendran, B. Plos Comput Biol 2013, 9. (5) Qiu, F.; Fine, D. D.; Wherritt, D. J.; Lei, Z. T.; Sumner, L. W. Anal Chem 2016, 88, 11373-11383. (6) Breitling, R.; Ritchie, S.; Goodenowe, D.; Stewart, M. L.; Barrett, M. P. Metabolomics 2006, 2, 155-164. (7) Smith, C. A.; Want, E. J.; O'Maille, G.; Abagyan, R.; Siuzdak, G. Anal Chem 2006, 78, 779-787. (8) Kuhl, C.; Tautenhahn, R.; Bottcher, C.; Larson, T. R.; Neumann, S. Anal Chem 2012, 84, 283-289. (9) Burgess, K. E. V.; Borutzki, Y.; Rankin, N.; Daly, R.; Jourdan, F. J Chromatogr B Analyt Technol Biomed Life Sci 2017, 1071, 68-74. (10) Jourdan, F.; Breitling, R.; Barrett, M. P.; Gilbert, D. Bioinformatics 2008, 24, 143-145. (11) Marbach, D.; Costello, J. C.; Kuffner, R.; Vega, N. M.; Prill, R. J.; Camacho, D. M.; Allison, K. R.; Consortium, D.; Kellis, M.; Collins, J. J.; Stolovitzky, G. Nat Methods 2012, 9, 796-804. (12) Tibshirani, R. J Roy Stat Soc B Met 1996, 58, 267-288. (13) Breiman, L. Mach Learn 2001, 45, 5-32. (14) Faith, J. J.; Hayete, B.; Thaden, J. T.; Mogno, I.; Wierzbowski, J.; Cottarel, G.; Kasif, S.; Collins, J. J.; Gardner, T. S. Plos Biol 2007, 5, 5466. (15) Margolin, A. A.; Nemenman, I.; Basso, K.; Wiggins, C.; Stolovitzky, G.; Dalla Favera, R.; Califano, A. BMC Bioinformatics 2006, 7 Suppl 1, S7. (16) Scutari, M. J Stat Softw 2010, 35, 1-22. (17) Butts, C. T. J Stat Softw 2008, 24, 1-51. (18) Csardi, G. N., T. InterJournal, Complex Systems 2006, 1695, 1-9. (19) Heiling, S.; Khanal, S.; Barsch, A.; Zurek, G.; Baldwin, I. T.; Gaquerel, E. Plant Journal 2016, 85, 561-577.

REFERENCES (1) Perez de Souza, L.; Naake, T.; Tohge, T.; Fernie, A. R. Gigascience 2017, 6, 1-20. (2) Steuer, R. Brief Bioinform 2006, 7, 151-158.

ACS Paragon Plus Environment

Page 4 of 8

Page 5 of 8

A

HO

HO

Analytical Chemistry

OH

OH

O

+15.99 Da

HO

O

O

O

1 2 3 4 5 6 7 8 9 10 11 B12 13 14 15 16 17 18 19 20 21 22 23 24 C25 26 27 28

HO

O

HO

OH

O

F1

F2

OH HO

HO

O

OH

O

OH HO

HO

+146.06 Da

+146.06 Da HO OH

OH HO

HO

O

+15.99 Da

O HO

O

O

F3 O

HO

O

O

O

OH

HO

OH

F4 O

HO

HO

OH

O

OH

O

OH O

HO

OH HO

HO

HO OH

OH HO

RTF1 > RTF2

O

HO

O

O

O HO

O

HO

OH

O

O

OH

O

OH

OH HO

HO

HO

HO

RTF1 > RTF3

RTF2 > RTF4 HO OH

OH HO

HO

O

RTF3 > RTF4

O HO

O

O

OH

O

O

HO

O

O

O

OH

O

OH

OH O

HO

O

HO

HO

OH

HO

OH HO

HO

F1: kaempferol-3-O-glucoside ACS Paragon Plus Environment F2: quercetin-3-O-glucoside

F1

F2

F3

F4

F3: kaempferol-3-O-rutinoside F4: quercetin-3-O-rutinoside





x 3, 1

… xm, 1

M3

Mm



x 2, 1



x 1, 1

M2

conditions

… x m, n

x3, n

x2, n

x1, n

M1

M2

1 2 3 4 5 6 7 8 9 10 11 12 13 agon 14 Plus Envir 15 16

M1

alytical Page 6Chemis of 8

M3

M2

Analytical Chemistry

2415678 2026 1898 16

71

633

statistical structural statistical

(putative) DTG

ACS Paragon Plus Environment

structural

whole dataset

unknown metabolite

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

(putative) DTG

Page 7 of 8

F2

F4

F1

F3

Analytical Chemistry Page 8 of 8

F4

F2

F1

F3

F4

F3

F2

F4

F1

+15.99 Da

F1

+146.06 Da

F2

+15.99 Da

+146.06 Da

F3

F2

F1

F3

F2

1 2 3 4 5 6 7 8 9 10 11 ACS 12 Paragon Plus Environment 13 14