Forecasting Chronic Diseases Using Data Fusion - Journal of

May 31, 2017 - Data fusion, that is, extracting information through the fusion of complementary data sets, is a topic of great interest in metabolomic...
2 downloads 11 Views 1MB Size
Article

Forecasting Chronic Diseases using Data Fusion Evrim Acar, Gözde Gürdeniz, Francesco Savorani, Louise Hansen, Anja Olsen, Anne Tjønneland, Lars Ove Dragsted, and Rasmus Bro J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 31 May 2017 Downloaded from http://pubs.acs.org on May 31, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Forecasting Chronic Diseases using Data Fusion Evrim Acar1*, Gözde Gürdeniz2, Francesco Savorani3, Louise Hansen4, Anja Olsen4, Anne Tjønneland4, Lars Ove Dragsted2, Rasmus Bro1 1

Department of Food Science, Faculty of Science, University of Copenhagen, 1958 Frederiksberg C, Denmark ([email protected], [email protected])

2

Department of Nutrition, Exercise and Sports, Faculty of Science, University of Copenhagen, 1958 Frederiksberg C, Denmark ([email protected], [email protected])

3

Department of Applied Science and Technology (DISAT), Polytechnic University of Turin – Corso Duca degli Abruzzi 24, 10129 Torino (TO), Italy ([email protected])

4

Danish Cancer Society Research Center, Strandboulevarden 49, 2100 Copenhagen, Denmark ([email protected], [email protected], [email protected])

* Corresponding Author: Evrim Acar, Email: [email protected], Tel: +4535331439

ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 35

Abstract Data fusion, i.e., extracting information through the fusion of complementary data sets, is a topic of great interest in metabolomics since analytical platforms such as Liquid Chromatography Mass Spectrometry (LC-MS) and Nuclear Magnetic Resonance (NMR) spectroscopy commonly used for chemical profiling of biofluids provide complementary information. In this study, with a goal of forecasting acute coronary syndrome (ACS), breast cancer and colon cancer, we jointly analyzed LC-MS, NMR measurements of plasma samples, and the metadata corresponding to the lifestyle of participants. We used supervised data fusion based on multiple kernel learning and exploited the linearity of the models to identify significant metabolites/features for the separation of healthy referents and the cases developing a disease. We demonstrated that (i) fusing LC-MS, NMR and metadata provided better separation of ACS cases and referents compared to individual data sets, (ii) NMR data performed the best in terms of forecasting breast cancer, while fusion degraded the performance, and (iii) neither the individual data sets nor their fusion performed well for colon cancer. Furthermore, we showed the strengths and limitations of the fusion models by discussing their performance in terms of capturing known biomarkers for smoking and coffee. While fusion may improve performance in terms of separating certain conditions by jointly analyzing metabolomics and metadata sets, it is not necessarily always the best approach as in the case of breast cancer.

Keywords: data fusion, multiple kernel learning, Liquid Chromatography - Mass Spectrometry, Nuclear Magnetic Resonance Spectroscopy, acute coronary syndrome, cancer

ACS Paragon Plus Environment

2

Page 3 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1. Introduction When the goal is to unravel the dynamics of a complex system such as the human metabolism, the complexity of the problem necessitates collection and analysis of data using multiple technologies. Analytical platforms such as LC-MS, NMR and Gas Chromatography-Mass Spectrometry (GC-MS) are commonly used for chemical profiling of biological samples. Measurements from such platforms are capable of detecting different types of chemical compounds with different levels of sensitivity and provide complementary information

1, 2

.

Therefore, there is a great interest in data fusion in metabolomics in order to get a better understanding of the human metabolome. The ultimate goal in many metabolomics applications is to jointly analyze data sets from multiple analytical platforms and identify chemical compounds related to certain conditions, e.g., food intake, drug treatment, or various diseases. This problem can be cast as a multi-view learning problem, where the labels of the samples correspond to the condition of interest, e.g., a case group vs. a control group, and measurements of samples from different platforms correspond to multiple views. Concatenation of multiple views can be considered as one way of fusion; however, it increases the risk of overfitting, in particular, for untargeted studies measuring thousands of chemical compounds. Previously, subspace-based multi-view learning methods such as dimension reduction of each data set followed by Partial Least Squares (PLS)based classifiers or vice versa have been used to jointly analyze metabolomics measurements, e.g., LC-MS and GC-MS data in microbial metabolomics 3, LC-MS and NMR measurements of cerebrospinal fluid samples (CFS)4, microarray and GC-MS data in plant biology5 and NMR and fluorescence spectroscopy measurements for cancer diagnostics 6. Another multi-view learning approach is multiple kernel learning (MKL), which is a well-studied topic in machine learning 7.

ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 35

Different kernels are used as a measure of similarity for different views, and data from multiple views are fused by combining the kernels. There are several applications of MKL-based approaches also in omics studies, e.g., joint analysis of GC-MS and NMR measurements of CFS to study multiple sclerosis 8, and fusion of omics data using MKL-based PLS-type models 9. In this paper, we address the challenging problem of forecasting whether people will develop certain diseases, i.e., acute coronary syndrome, breast cancer and colon cancer, within five years after sample collection, using LC-MS and NMR measurements of plasma samples and metadata containing information regarding to their lifestyle. In addition to analyzing individual data sets, we also formulate the problem as a multi-view learning problem and investigate whether data fusion based on MKL improves the forecasting performance. Finally, to validate our approach and study the influence of confounding variables we investigate the ability of the approach to predict current life-style habits such as smoking and coffee-drinking, for which we have subjective answers from the volunteers, and objective markers in the metabolic profiles. 2. Materials and Methods 2.1. Samples from Danish Diet, Cancer and Health (DCH) cohort The DCH cohort consists of 57,053 people who were enrolled in the study between 1993-1997. Participants were eligible for inclusion if they fulfilled the following criteria: age between 50-64 years, born in Denmark and no previous cancer diagnosis in the Danish Cancer Registry. Each participant filled in a detailed food frequency questionnaire (FFQ) and lifestyle questionnaire at baseline; development and validation of the FFQ is described elsewhere 10, 11. The FFQ contained questions regarding 192 food and beverage items and was designed to capture information on the participants’ habitual diet during the previous 12 months. Anthropometric measurements were

ACS Paragon Plus Environment

4

Page 5 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

taken including height (m) and weight (kg), and participants gave various biological samples including plasma (non-fasting). The participants in the DCH cohort were linked to the Central Population Registry for information on vital status and emigration. Information on ACS and cancer incidence was obtained by linkage of the Central Population Registry number of each participant to the National Patient Registry (ACS) or Danish Cancer Registry (breast and colon cancer) and the Cause of Death Registry. Each person was followed for primary ACS or cancer occurrence from the date of entry until an event or the end of follow-up. Incident cases of breast cancer were defined according to the International Classification of Diseases (ICD) tenth revision (ICD-10) diagnosis code C50 and colon cancer likewise using the code C18. Incident cases of fatal or nonfatal ACS in the cohort were identified and defined according to the eighth revision (ICD-8) diagnosis codes 410–410·99 and 427·27 until 31 December 1993 and subsequently by ICD-10 (diagnosis codes I20.0, I21.0–I21.9 and I46.0–I46.9). Each case was individually validated through review of medical records12. All ACS cases were free of previous ACS diagnoses before baseline, and all cancer cases were free of previous cancer diagnoses. Our data set contains plasma samples from 3376 participants including 412 that developed breast cancer between baseline and April 27, 2006, 408 people that developed colon cancer between baseline and April 27, 2006, 1092 that developed ACS between baseline and December 31, 2003, and 1476 healthy referents free of those three diseases at the end of followup (April 27, 2006). The referents were selected randomly corresponding to a case-cohort design13, where referents are a random set of unmatched controls.

Figure 1(a) shows the

histograms of “time to diagnosis”, i.e., time between sample collection and the date of diagnosis, for the three diseases. In this paper, we only used the samples from the participants who

ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 35

developed the disease within five years of sample collection, which corresponds to 392 breast cancer, 186 colon cancer, and 619 ACS cases. Figure 1(b) shows the histograms of the length of follow-up period for referents for the three diseases. For each disease, we used only the referents who had been followed-up for at least five years. Plasma was prepared upon participants’ blood collection in tubes coated with citrate as anti-coagulant and then stored at -150 or -80 degrees until the analysis. In addition to plasma samples, we also used 56 meta variables from the FFQ and lifestyle questionnaire considered relevant for this study based on a priori knowledge from the literature. The selected variables are mainly related to diet but also include weight, height, blood pressure, smoking status, level of exercise, etc. The complete list of variables can be found in the Supplementary Table S1.

(a)

(b)

Figure 1. Histograms of (a) time to diagnosis for cases, (b) follow-up time for referents, for breast cancer, colon cancer and acute coronary syndrome.

ACS Paragon Plus Environment

6

Page 7 of 35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2.2. LC-MS measurements Prior to LC-MS analysis, plasma proteins were precipitated as described previously14. The samples were randomized into 47 batches and placed in 96-well plates before analysis. An ultraperformance liquid chromatography (UPLC) system coupled to a quadrupole time-of-flight mass spectrometer (Premier QTOF, Waters Corporation, Manchester, UK) was used for sample analysis. The mobile phase was 0.1% formic acid in water (solvent A) and 0.1% formic acid in 70% acetonitrile and 30% methanol (solvent B). Five µL of each sample were injected into a HSS T3 C18 column (2.1 x 100 mm, 1.8µm) coupled with a VanGuard HSS T3 C18 column (2.1 x 5mm, 1.8µm) and the gradient was operated for 7.0 min. The eluate was analyzed by electrospray ionization (ESI) in positive and negative mode, applying a capillary voltage of 3.2 kV and 2.8 kV, respectively and cone voltage of 20 kV. Ion source and desolvation gas (nitrogen) temperatures were set to 120 and 400°C, respectively. More detail on UPLC-QTOF analysis conditions can be found elsewhere 15. Blanks (B) (5% of acetonitrile:methanol 70:30 v/v in water), metabolomics standard mixtures (MStd) and ‘lab pool’ serum samples (LP) were injected after every 40 samples, throughout each analytical batch. In order to assess the analytical quality, blank and metabolomics standards were running regularly and monitored during the instrumental run to assure stability of retention times to within 0.1min, and mass accuracy within 5 ppm. B and MStd were analyzed every 40 samples along with an LP sample used on each plate for the entire set, resulting in 3-4 repetitions of each per batch. LP samples were evaluated visually with Principal Component Analysis (PCA) after data preprocessing to assess between batch data quality (Supplementary Figure S1). MS/MS fragmentation experiments were conducted for structural characterization of the unknown compounds identified as being of interest to explain our prediction models. The

ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 35

collision-induced dissociation (CID) was set to 10, 20 and 30 eV in separate runs after targeted parent ions had been selected by the MS1 (±1 mass unit) in a retention time window of ±10s from previously observed retention time of full scan experiment, and all other parameters were kept as for the LC-MS full scan experiments as described above. 2.2.1. Preprocessing of LC-MS measurements Initially, raw data were converted to intermediate MZdata using an R function (convert.waters.Rd, https://github.com) and further preprocessed by XCMS16 using the centWave peak picking algorithm. Preprocessed data were then imported into MATLAB. Feature filtering was applied based on three criteria. First, the features detected earlier than 0.3 min and later than 6 min were removed. Second, if a feature has a significant peak area (>60, approx. 6 times the maximal background noise) in the first blank sample in at least one of the analytical batches, the feature was removed from the entire set as background noise. Third, features with implausible masses (the first decimal place>4) and retention times (