OnPLS-based multi-block data integration: a multivariate approach to

6 days ago - ACS eBooks; C&EN Global Enterprise ... Journal of Chemical Education · Journal of Chemical Information and Modeling .... OnPLS-based mult...
1 downloads 0 Views 2MB Size
Subscriber access provided by RMIT University Library

Article

OnPLS-based multi-block data integration: a multivariate approach to interrogating biological interactions in asthma Stacey N. Reinke, Beatriz Galindo-Prieto, Tomas Skotare, David Iain Broadhurst, Akul Singhania, Daniel Horowitz, Ratko Djukanovic, Timothy S.C. Hinks, Paul Geladi, Johan Trygg, and Craig E. Wheelock Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b03205 • Publication Date (Web): 18 Oct 2018 Downloaded from http://pubs.acs.org on October 21, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Analytical Chemistry

OnPLS-based multi-block data integration: a multivariate approach to interrogating biological interactions in asthma

Stacey N. Reinke1,2*, Beatriz Galindo-Prieto3,4,5, Tomas Skotare3, David I. Broadhurst2, Akul Singhania6,7, Daniel Horowitz8, Ratko Djukanović6,9, Timothy S.C. Hinks6,9,10, Paul Geladi11, Johan Trygg3, Craig E. Wheelock1,12* Affiliations: of Physiological Chemistry 2, Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm, Sweden 2Centre for Integrative Metabolomics and Computational Biology, School of Science, Edith Cowan University, Perth, Australia 3Computational Life Science Cluster, Department of Chemistry (KBC), Umeå University, Umeå, Sweden 4Industrial Doctoral School (IDS), Umeå University, Umeå, Sweden 5Department of Engineering Cybernetics (ITK), Norwegian University of Science and Technology (NTNU), Trondheim, Norway. 6Clinical and Experimental Sciences, University of Southampton Faculty of Medicine, Southampton University Hospital, Southampton, SO16 6YD, UK 7Laboratory of Immunoregulation and Infection, The Francis Crick Institute, London, NW1 1AT, UK 8Janssen Research and Development, High Wycombe, Buckinghamshire, UK 9NIHR Southampton Respiratory Biomedical Research Unit, Southampton University Hospital, Southampton, SO16 6YD, UK 10NIHR Oxford Biomedical Research Centre / Respiratory Medicine Unit, NDM Experimental Medicine, University of Oxford, Level 7, John Radcliffe Hospital, Oxford, OX3 9DU 11Swedish University of Agricultural Sciences, Forest Biomass and Technology, SE 90183 Umeå, Sweden 12Gunma University Initiative for Advanced Research (GIAR), Gunma University, Maebashi, Japan 1Division

* Corresponding author: Craig E. Wheelock 1Division of Physiological Chemistry 2, Department of Medical Biochemistry and Biophysics, Karolinska Institute, SE-171 77 Stockholm, Sweden. Email: [email protected] Stacey N. Reinke of Physiological Chemistry 2, Department of Medical Biochemistry and Biophysics, Karolinska Institute, Stockholm, Sweden 2Centre for Integrative Metabolomics and Computational Biology, School of Science, Edith Cowan University, Perth, Australia Email: [email protected] 1Division

ACS Paragon Plus Environment

1

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

50 51 52 53 54 55 56 57 58

Page 2 of 33

ORCIDs: Stacey N Reinke: 0000-0002-0758-0330 Beatriz Galindo-Prieto: 0000-0001-8776-8626 David Broadhurst: 0000-0003-0775-9581 Timothy Hinks: 0000-0003-0699-2373 Paul Geladi: 0000-0001-6618-5931 Craig Wheelock: 0000-0002-8113-0653

ACS Paragon Plus Environment

2

Page 3 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

59

Abstract

60

Integration of multi-omics data remains a key challenge in fulfilling the potential of

61

comprehensive systems biology. Multiple-block orthogonal projections to latent structures

62

(OnPLS) is a projection method that simultaneously models multiple data matrices,

63

reducing feature space without relying on a priori biological knowledge. In order to improve

64

the interpretability of OnPLS models, the associated multi-block variable influence on

65

orthogonal projections (MB-VIOP) method is used to identify variables with the highest

66

contribution to the model. This study combined OnPLS and MB-VIOP with interactive

67

visualisation methods to interrogate an exemplar multi-omics study, using a subset of 22

68

individuals from an asthma cohort. Joint data structure in six data blocks was assessed:

69

transcriptomics; metabolomics; targeted assays for sphingolipids, oxylipins, and fatty

70

acids; and a clinical block including lung function, immune cell differentials, and cytokines.

71

The model identified 7 components, 2 of which had contributions from all blocks (globally

72

joint structure), and 5 that had contributions from 2-5 blocks (locally joint structure).

73

Components 1 and 2 were the most informative, identifying differences between healthy

74

controls and asthmatics, and a disease-sex interaction, respectively. The interactions

75

between features selected by MB-VIOP were visualised using chord plots, yielding

76

putative novel insights into asthma disease pathogenesis, the effects of asthma treatment,

77

and biological roles of uncharacterised genes. For example, the gene ATP6V1G1, which

78

has been implicated in osteoporosis, correlated with metabolites that are dysregulated by

79

inhaled corticoid steroids (ICS), providing insight into the mechanisms underlying bone

80

density loss in asthma patients taking ICS. These results show the potential for OnPLS,

81

combined with MB-VIOP variable selection and interaction visualisation techniques, to

82

generate hypotheses from multi-omics studies and inform biology.

83

ACS Paragon Plus Environment

3

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 33

84

Introduction

85

In the post-genomic era, data-driven science has become increasingly necessary due to

86

the vast array of instrumentation that is capable of generating thousands of data points for

87

a single analytical observation1-2. In addition to using classical univariate statistical

88

methods, machine-learning techniques have become routinely used to interrogate and

89

understand vast amounts of data3-4. Two common characteristics of omics data are that

90

the number of measured variables is vastly greater than the number of observations5, and

91

that there is a degree of multicollinearity between variables6. As such, computational

92

methods that project high dimensional data into a smaller number of component variables

93

have become commonplace7. Multivariate projection methods such as Principal

94

Components Analysis (PCA)8, Partial Least Squares Discriminant Analysis (PLS-DA)9,

95

Canonical Variate Analysis (CVA)8, together with Hierarchical Cluster Analysis (HCA)10,

96

Random Forests11, and Support Vector Machines (SVM)12 are all used to analyze omics

97

data3-4. PLS-DA and its extension, orthogonal projection to latent structures – discriminant

98

analysis (OPLS-DA)13-14, have become popular projection methods in the metabolomics

99

community15. As modeling methods become increasingly complicated, they have also

100

become concomitantly difficult to interpret. Assignment of the variable importance often

101

becomes an a posteriori statistical process based on either permutation testing or random

102

resampling (e.g., confidence intervals derived from bootstrap/jackknife statistics)16. For

103

methods based on a PLS algorithm, the direct statistical method of Variable Influence on

104

Projection (VIP)17-18 is often used to estimate variable contribution to the resulting models.

105 106

In recent years, as the omics sciences have matured, it has become common to acquire

107

data from multiple omics platforms in a single biological experiment. As such, each

108

biological sample is interrogated by multiple analytical platforms, which in turn can be

109

linked to multiple sources of experimental metadata. Data from each platform (or

ACS Paragon Plus Environment

4

Page 5 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

110

measurement context) can be considered a discrete block, with multiple blocks making up

111

the complete data set of the experiment. Multivariate projection methods such as OPLS-

112

DA have proven successful in modelling the underlying latent biological structure within a

113

single high dimensional data block; however, they are theoretically unsuitable for

114

modelling multiple data blocks simultaneously. There are two reasons for this issue. First,

115

if multiple data blocks are concatenated into a single matrix, with no accounting for

116

measurement context, then the subsequent model can be considered as a single

117

projection model, where the weighting of each variable is governed by the total sum of

118

squares19. This, in principle, demands that each block is normalized to the same size, to

119

avoid a projection model that is biased towards the impact of the data set with the most

120

variables. In practice, this can be problematic, particularly when there are a mix of blocks

121

of vastly different sizes. For example, in a model concatenating 20,000 transcripts, 200

122

metabolites, and 20 clinical variables, the transcripts would over-represent the global data

123

structure and thus have a larger contribution to the resulting model. In multi-block

124

modelling, this is not an issue as each block is treated independently. This approach

125

leaves flexibility to scale individual variables according to importance and also to keep

126

variables in their original unit. Second, each individual data set is associated with its own

127

underlying structure19-20, describing the true biological variance, and also platform-specific

128

measurement error. Covariance of biological latent structure across multiple data blocks

129

is implicit; however, it is a fair assumption that the measurement error across multiple

130

blocks will be independent, and thus easily ignored at this block-interaction level.

131

Conversely, if multiple data blocks are concatenated into a single data set before

132

projection, the model will struggle to effectively separate true biological structure from

133

block-specific noise and result in erroneous interpretation of the conglomerate projection

134

model.

135

ACS Paragon Plus Environment

5

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 33

136

To address the need for multivariate methods to simultaneously model multiple data

137

matrices, a number of multi-block data integration methods have been proposed21-23. In

138

2011, Löfstedt and Trygg24 proposed a novel multi-block multivariate method called

139

OnPLS, which utilises the framework of OPLS to decompose data from more than two

140

input matrices. Multi-block models, such as OnPLS, are fully symmetric, meaning each

141

data block is weighted to allow an equal contribution to the model, regardless of the

142

number of variables or underlying data structure within each block25. Multi-block

143

approaches offer further advantages over single-block or block-concatenation in

144

biomarker discovery. First, the validity of any true biological biomarker is significantly

145

increased if there is a clear covariance between data blocks, thus reducing the possibility

146

of false discovery26. Second, contrary to block-concatenation modelling, which is strongly

147

biased toward the globally-joint variation, multi-block analysis decomposes the different

148

levels of variation (global, local, unique)27 such that relatively small but informative trends

149

are also identified. Recently, Galindo-Prieto et al. adapted the VIP concept for multi-block

150

data analysis (Multi-Block Variable Influence on Orthogonal Projections method28, MB-

151

VIOP) to identify the variables that contribute to these different levels of joint structure.

152 153

The aim of this study was combine OnPLS and MB-VIOP with data visualization methods

154

to create a workflow capable of simultaneously modeling and investigating interactions

155

between multiple omics data blocks. The study chosen for this purpose was a subset from

156

a previously reported asthma cohort for which multiple omics datasets were acquired in

157

isolation29-30. These analyses included untargeted metabolomics, targeted metabolite

158

assays, differential immune cell population analyses, and cytokine arrays. Additionally, for

159

the present study, transcriptomics of peripheral blood T cells was performed. OnPLS

160

modelling and MB-VIOP were then used to integrate the disparate data blocks into a single

ACS Paragon Plus Environment

6

Page 7 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

161

model, which was then interrogated to identify novel interactions between the data blocks

162

and disease status as well as other clinical endpoints.

163

ACS Paragon Plus Environment

7

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 33

164

Experimental Section

165

Clinical Cohort

166

Briefly, 12 healthy controls and 10 severe asthmatics were included from the original

167

study29. Transcriptomics was subsequently performed on peripheral blood T cells and

168

metabolomics/metabolic profiling assays were performed on serum. All participants were

169

enrolled from the NIHR Southampton Respiratory Biomedical Research Unit and

170

University Hospital Southampton outpatient clinics; all provided written informed consent.

171

The National Research Ethics Service Committee South Central – Southampton B ethics

172

committee (UK; Ref 10/H0504/2) approved this study. Clinical classification and enrolment

173

criteria were previously described29, 31. Participant data were included in the present study

174

if they were classified as either healthy control or severe asthmatic individuals in the

175

existing cohort and data from all data blocks (described in next section) were collected.

176 177

Sample Collection and Analyses

178

Details of sample collection and transcriptomics analyses are available in the supporting

179

information. Details of analytics, quality control, and data cleaning for metabolomics,

180

targeted metabolic assays, and clinical assays were performed as previously described29-

181

30.

182 183

Data Blocks and Processing

184

Six data blocks were used for modelling: Transcriptomics, Sphingolipids, Metabolomics,

185

Fatty Acids, Oxylipins, and Clinical Data (Figure 1). A complete list of all variables included

186

for each data block is provided in Tables S1-S6. The data blocks were defined by a priori

187

knowledge about both the system under observation and the measurement technology19.

188

The primary consideration was the underlying structure of the data that could possibly

189

confound the biological interaction between blocks. To avoid bias in combining information

ACS Paragon Plus Environment

8

Page 9 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

190

from different probes for one gene, all non-QC probes were included for OnPLS modelling;

191

the Transcriptomics block included 54,613 variables. This approach is commonly used for

192

analyzing transcriptomics data32. Four data blocks represented serum metabolites:

193

Sphingolipids (28 variables, targeted assay), Metabolomics (66 variables, untargeted

194

assay screened against an in-house chemical library), Fatty Acids (14 variables, targeted

195

assay), and Oxylipins (38 variables, targeted assay). Twenty-three clinical variables were

196

combined into the Clinical data block; these variables were derived from typical clinical

197

assays and measurements, and included lung function tests, bronchoalveolar lavage fluid

198

and peripheral blood T cell populations, serum cytokines, and serum vitamin D3. For

199

clinical data, values that were missing due to being below the limit of detection (LOD) of

200

the respective assay were imputed with 1/10 of the lowest measured value, because the

201

LOD was not known for each assay and OnPLS cannot process missing values. Data that

202

were missing for an entire subset array of the clinical data (e.g. for individuals missing the

203

cytokine assay) were imputed using the median value of the corresponding clinical group

204

(control or asthma). Remaining missing values were replaced using PCA imputation. Prior

205

to OnPLS model calculation, all data (except for transcriptomics) were log-transformed.

206

All data were then scaled to unit variance.

207 208

OnPLS Model Calculation and Visualisation

209

The OnPLS model simultaneously analyzed the data matrices, returning output matrices

210

of shared information (components), as described27. These output matrices reveal shared

211

data structure on three levels for each data matrix, which can be summarized as:

212 213

𝑋𝑖 =

𝑋𝐺

+

𝑔𝑙𝑜𝑏𝑎𝑙𝑙𝑦 𝑗𝑜𝑖𝑛𝑡

𝑋𝐿

+ 𝑋𝑈 +

𝐸

𝑙𝑜𝑐𝑎𝑙𝑙𝑦 𝑗𝑜𝑖𝑛𝑡

𝑢𝑛𝑖𝑞𝑢𝑒

𝑟𝑒𝑠𝑖𝑑𝑢𝑎𝑙 𝑛𝑜𝑖𝑠𝑒

(1)

214

Globally joint components reveal structure that is shared by all input data matrices. Locally

215

joint components reveal structure shared by two or more, but not all, of the input matrices.

ACS Paragon Plus Environment

9

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

216

Finally, unique components identify latent structure that is present in only one input matrix.

217

The OnPLS model returned separate score vectors for each data block in each

218

component. To identify the sources of biological variance explained by the OnPLS

219

components, the component scores for each block were correlated with metadata

220

variables not included in the clinical data block: clinical class (control vs. asthma), sex,

221

age, BMI, dose of inhaled and oral corticosteroids, and smoking (current/former smoker

222

vs. never smoked). The resulting Pearson correlation coefficients were presented as a

223

metadata correlation plot33. To visualise the overall OnPLS model, hierarchical Principal

224

Component Analysis (PCA)34 was used to summarize the 30 OnPLS score vectors,

225

resulting in two PCA components describing the relationships in the OnPLS model. Prior

226

to calculating the PCA model, the score vectors were scaled to unit variance. The PCA

227

score plot showed individual participants and the loadings plot displayed the score vectors

228

from the OnPLS model, labelled by block type and OnPLS model component number.

229 230

MB-VIOP Concept, Motivation and Calculation

231

Multiblock variable influence on orthogonal projections (MB-VIOP) is a feature selection

232

method that (i) sorts the input variables by importance for data interpretation in OnPLS

233

models, either for the total model (all variation types together) or per component (global,

234

local, or unique variations separately), and (ii) explores the connections between the

235

variables (either in the same or different data matrix) that contribute to explain the same

236

component (latent variable) in the multiblock system. Multiblock-VIOP is a model-based

237

variable selection method because it uses the n preprocessed data matrices, the score

238

vectors, and the normalized loading vectors from an OnPLS model. OnPLS regression

239

can relate the data matrices according to the model component; however, it must be

240

emphasized that not all input variables of these related matrices will connect among

241

themselves to explain the variation contained in a specific model component. The MB-

ACS Paragon Plus Environment

10

Page 11 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

242

VIOP algorithm is necessary to sort the input variables according to their connections for

243

interpreting the variation contained in one or more specific components. Furthermore, MB-

244

VIOP finds the degree of importance of each variable in the correct proportion for a

245

multiblock system, which cannot be achieved by the OnPLS normalized loadings plot35.

246 247

The calculation of the MB-VIOP values can be summarized as the Hadamard products of

248

the normalized loadings multiplied by the ratio of the variation explained by a model

249

component and the cumulated variation. After a block- and component-wise iterative

250

algorithm with all input variables from the six data matrices involved, the resulting MB-

251

VIOP vectors were normalized by Euclidean norm and by the number of original (input)

252

variables raised to the ½ power. The variables of interest that were identified by MB-VIOP

253

were selected as a subset for further multivariate analysis as shown below. For additional

254

details about the MB-VIOP fundamentals and algorithm, readers are referred to the

255

original references28.

256 257

Data Visualisation

258

The between-block covariance of the subset of variables contributing to Components 1

259

and 2 of the OnPLS model were visualised using chord-plots36. Using the variables

260

reaching a defined MB-VIOP threshold, a chord-plot was constructed by first calculating

261

the Spearman rank correlation coefficient (r) for each pairwise combination of variables

262

with MB-VIOP values above a threshold. Those variables where a significant (p70% of the variance in each metabolic profiling data block

305

(Sphingolipids, Fatty Acids, and Oxylipins), and 56% of the Metabolomics block. This

306

higher degree of explained variance can be attributed to the selective association between

307

these variables and asthma. These targeted assays were performed to confirm findings

308

from the initial metabolomics screen30. While not all targeted metabolites were originally

309

detected using metabolomics, they represent biological processes known to be involved

310

in inflammation. This point is of particular relevance in that it is not the number of variables

311

in a given data block that is the primary driver, but rather the inherent biological content4,

312

9.

313

into a single OnPLS model – and demonstrates the utility of this approach for data

314

modeling. However, there is the expected caveat that data blocks that contain higher

315

levels of biological structure will have a concomitant increase in contribution to the overall

316

OnPLS model.

This facet makes it meaningful to combine disparate omics blocks of varying structure

317 318

To determine the biological factors associated with each OnPLS component and data

319

block, model score vectors were correlated with a number of known biological factors

320

(Figure 2). Component 1 scores from all blocks positively and significantly (p 1.0 with

405

dehydroepiandrosterone-sulfate (DHEA-S) being the strongest driver of the control-

406

asthma difference (Supplemental Tables). Component 1 of OnPLS had 31 variables with

407

a MB-VIOP > 1.0, 15 of which were unique to OnPLS modelling. Whereas DHEA-S was

408

a major driver in the covariance in the single-block analysis, it was less important in the

409

combined covariance of the OnPLS model. The Transcriptomics, Oxylipin, and Clinical

410

data blocks showed similar trends, with OPLS and OnPLS revealing different biological

411

insights (data not shown).

412 413

While the present study shows the potential for OnPLS-based modelling to be useful for

414

simultaneously modelling multiple data blocks and generating hypotheses, it is limited by

415

sample size and study power. Furthermore, OnPLS is currently unable to derive a block

416

weighting such that MB-VIOP values can be scaled and directly compared across all

417

blocks. Accordingly, MB-VIOP values can only be directly compared within a given data

418

block and not between blocks. In interpreting the results, one must consider the overall

419

contribution, not only of the block per se, but also of the individual variables, to the

420

respective component.

421 422

Conclusions

423

The multi-block OnPLS method combined with MB-VIOP variable selection and interaction

424

visualisation techniques yielded putative novel insights into asthma disease pathogenesis,

425

the effects of asthma treatment, and biological roles of genes. The current study was

426

performed in a worst-case scenario approach using a small sample set, with unbalanced

427

groups, and multiple study confounders. While these issues limit the ability of the different

428

components of the OnPLS model to identify unique biological sources of variation, it

429

demonstrates the potential for this method for identifying key structure in omics data

ACS Paragon Plus Environment

18

Page 19 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

430

integration. It is likely that in large well-designed studies, the different components would

431

be able to identify and explain other sources of biological and/or experimental variability

432

(e.g., therapeutics, centre bias, diet). It is also possible that this approach would be useful

433

in identifying sub-phenotypes of disease, with different sub-groups and/or mechanisms

434

described by different components. We therefore propose that OnPLS modeling can be

435

incorporated into large-scale molecular phenotyping studies for stratified medicine. Given

436

that the omics technologies detect molecules that function in a highly interdependent and

437

dynamic manner within a living system, multi-block methods such as OnPLS, together

438

with MB-VIOP and interaction visualisation, provide a logical approach to investigating

439

systems biology.

440 441

Acknowledgements

442

The authors wish to thank Rickard Sjögren for providing the script to generate the

443

metadata correlation plot. SNR was supported by a Canadian Institutes of Health

444

Research (CIHR) Fellowship (MFE-135481). TSCH was supported by Wellcome Trust

445

Research Fellowships (088365/z/09/z and 104553/z/14/z), by the Academy of Medical

446

Sciences and by the National Institute for Health Research (NIHR) Oxford Biomedical

447

Research Centre (BRC). AS was supported by the Faculty of Medicine, University of

448

Southampton, UK. BGP was supported by MKS Instruments AB and by IDS / KBC of

449

Umeå University (Sweden) for 2016-2017, and supported by an ERCIM ‘Alain

450

Bensoussan’ Fellowship Programme at the Department of Engineering Cybernetics (ITK)

451

of the Norwegian University of Science and Technology (Norway) for 2017-2018. CEW

452

was supported by the Swedish Heart Lung Foundation (HLF 20170603). We acknowledge

453

the support of the Swedish Heart Lung Foundation (HLF 20170734), the Swedish

454

Research Council (2016-02798), the Karolinska Institutet and the ChAMP (Centre for

455

Allergy Research Highlights Asthma Markers of Phenotype) consortium which is funded

ACS Paragon Plus Environment

19

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 33

456

by the Swedish Foundation for Strategic Research, the Karolinska Institutet, AstraZeneca

457

& Science for Life Laboratory Joint Research Collaboration, and the Vårdal Foundation.

458 459

Conflict of Interest Disclosure

460

The authors have no conflicts of interest to declare.

461 462

Supporting Information

463



Supporting methods for transcriptomics

464



Supporting tables

465

o

466 467

variables, Table S1 o

468 469

o

o

o

o

480

MBVIOPOnPLS and VIPOPLS-DA values for Block 6 (Clinical) variables, Table S6

o

478 479

MBVIOPOnPLS and VIPOPLS-DA values for Block 5 (Oxylipins) variables, Table S5

476 477

MBVIOPOnPLS and VIPOPLS-DA values for Block 4 (Fatty Acids) variables, Table S4

474 475

MBVIOPOnPLS and VIPOPLS-DA values for Block 3 (Metabolomics) variables, Table S3

472 473

MBVIOPOnPLS and VIPOPLS-DA values for Block 2 (Sphingolipids) variables, Table S2

470 471

MBVIOPOnPLS and VIPOPLS-DA values for Block 1 (Transcriptomics)

Rho values for correlations between variables presented in Figure 5A (Component 1), Table S7

o

P values for correlations between variables presented in Figure 5A (Component 1), Table S8

ACS Paragon Plus Environment

20

Page 21 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

481

Analytical Chemistry

o

482 483 484

Rho values for correlations between variables presented in Figure 5B (Component 2), Table S9

o

P values for correlations between variables presented in Figure 5B (Component 2), Table S10

485 486

ACS Paragon Plus Environment

21

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

487

References

488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531

1. Ideker, T.; Galitski, T.; Hood, L., A new approach to decoding life: systems biology. Ann Rev Genomics Hum Genet 2001, 2, 343-72. 2. Kell, D. B.; Oliver, S. G., Here is the evidence, now what is the hypothesis? The complementary roles of inductive and hypothesis-driven science in the postgenomic era. Bioessays 2004, 26 (1), 99-105. 3. Brown, M.; Dunn, W. B.; Ellis, D. I.; Goodacre, R.; Handl, J.; Knowles, J. D.; O’Hagan, S.; Spasić, I.; Kell, D. B., A metabolome pipeline: from concept to data to knowledge. Metabolomics 2005, 1 (1), 39-51. 4. Gromski, P. S.; Muhamadali, H.; Ellis, D. I.; Xu, Y.; Correa, E.; Turner, M. L.; Goodacre, R., A tutorial review: Metabolomics and partial least squaresdiscriminant analysis--a marriage of convenience or a shotgun wedding. Anal Chim Acta 2015, 879, 10-23. 5. Wheelock, A. M.; Wheelock, C. E., Trials and tribulations of 'omics data analysis: assessing quality of SIMCA-based multivariate models using examples from pulmonary medicine. Mol Biosyst 2013, 9 (11), 2589-96. 6. Nørgaard, L.; Bro, R.; Westad, F.; Engelsen, S. B., A modification of canonical variates analysis to handle highly collinear multivariate data. J Chemom 2006, 20 (8‐10), 425-435. 7. Broadhurst, D. I.; Kell, D. B., Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2006, 2 (4), 171-196. 8. Krzanowski, W. J., Principles of Multivariate Analysis: A User's Perspective. Clarendon Press: 1990. 9. Wold, S.; Sjöström, M.; Eriksson, L., PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 2001, 58 (2), 109-130. 10. Hastie, T.; Tibshirani, T.; Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed.; Springer-Verlag: New York, 2009; p 745. 11. Breiman, L., Random Forests. Mach Learn 2001, 45 (1), 5-32. 12. Cortes, C.; Vapnik, V., Support-Vector Networks. Mach Learn 1995, 20 (3), 273-297. 13. Bylesjö, M.; Rantalainen, M.; Cloarec, O.; Nicholson, J. K.; Holmes, E.; Trygg, J., OPLS discriminant analysis: combining the strengths of PLS‐DA and SIMCA classification. J Chemom 2006, 20 (8‐10), 341-351. 14. Trygg, J.; Wold, S., Orthogonal projections to latent structures (O‐PLS). J Chemom 2002, 16 (3), 119-128. 15. Madsen, R.; Lundstedt, T.; Trygg, J., Chemometrics in metabolomics--a review in human disease diagnosis. Anal Chim Acta 2010, 659 (1-2), 23-33. 16. Xia, J.; Broadhurst, D. I.; Wilson, M.; Wishart, D. S., Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics 2013, 9 (2), 280-299. 17. Wold, S.; Johansson, E.; Cocchi, M., PLS Partial Least Squares Projections to Latent Structures. In 3D QSAR in Drug Design: Theory, Methods, and Applications, Kubinyi, H., Ed. Springer: Netherlands, 1993; pp 523-550.

ACS Paragon Plus Environment

22

Page 23 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576

Analytical Chemistry

18. Galindo‐Prieto, B.; Eriksson, L.; Trygg, J., Variable influence on projection (VIP) for orthogonal projections to latent structures (OPLS). J Chemom 2014, 28 (8), 623-632. 19. Höskuldsson, A.; Svinning, K., Modelling of multi-block data. J Chemom 2006, 20 (8-10), 376-385. 20. Cavill, R.; Jennen, D.; Kleinjans, J.; Briede, J. J., Transcriptomic and metabolomic data integration. Brief Bioinform 2016, 17 (5), 891-901. 21. Van Loan, C. F., Generalizing the singular value decomposition. SIAM J Numer Anal 1976, 13 (1), 76-83. 22. Van Deun, K.; Van Mechelen, I.; Thorrez, L.; Schouteden, M.; De Moor, B.; van der Werf, M. J.; De Lathauwer, L.; Smilde, A. K.; Kiers, H. A. L., DISCO-SCA and Properly Applied GSVD as Swinging Methods to Find Common and Distinctive Processes. PLoS ONE 2012, 7 (5), e37840. 23. Lock, E. F.; Hoadley, K. A.; Marron, J. S.; Nobel, A. B., Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat 2013, 7 (1), 523. 24. Löfstedt, T.; Trygg, J., OnPLS—a novel multiblock method for the modelling of predictive and orthogonal variation. J Chemom 2011, 25 (8), 441-455. 25. Smilde, A. K.; Westerhuis, J. A.; de Jong, S., A framework for sequential multiblock component methods. J Chemom 2003, 17 (6), 323-337. 26. Li, C. X.; Wheelock, C. E.; Skold, C. M.; Wheelock, A. M., Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J 2018, 51 (5). 27. Löfstedt, T.; Hoffman, D.; Trygg, J., Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal Chim Acta 2013, 791, 13-24. 28. Galindo-Prieto, B. Novel variable influence on projection (VIP) methods in OPLS, O2PLS, and OnPLS models for single-and multi-block variable selection: VIPOPLS, VIPO2PLS, and MB-VIOP methods. Doctoral Dissertation. Umeå University, 2017. 29. Hinks, T. S.; Zhou, X.; Staples, K. J.; Dimitrov, B. D.; Manta, A.; Petrossian, T.; Lum, P. Y.; Smith, C. G.; Ward, J. A.; Howarth, P. H.; Walls, A. F.; Gadola, S. D.; Djukanovic, R., Innate and adaptive T cells in asthmatic patients: Relationship to severity and disease mechanisms. J Allergy Clin Immunol 2015, 136 (2), 32333. 30. Reinke, S. N.; Gallart-Ayala, H.; Gomez, C.; Checa, A.; Fauland, A.; Naz, S.; Kamleh, M. A.; Djukanovic, R.; Hinks, T. S.; Wheelock, C. E., Metabolomics analysis identifies different metabotypes of asthma severity. Eur Respir J 2017, 49 (3), 1601740. 31. Vijayanand, P.; Seumois, G.; Pickard, C.; Powell, R. M.; Angco, G.; Sammut, D.; Gadola, S. D.; Friedmann, P. S.; Djukanovic, R., Invariant natural killer T cells in asthma and chronic obstructive pulmonary disease. N Engl J Med 2007, 356 (14), 1410-22. 32. Diez, D.; Wheelock, A. M.; Goto, S.; Haeggstrom, J. Z.; Paulsson-Berne, G.; Hansson, G. K.; Hedin, U.; Gabrielsen, A.; Wheelock, C. E., The use of network

ACS Paragon Plus Environment

23

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619

Page 24 of 33

analyses for elucidating mechanisms in cardiovascular disease. Mol Biosyst 2010, 6 (2), 289-304. 33. Skotare, T.; Sjögren, R.; Surowiec, I.; Nilsson, D.; Trygg, J., Visualization of descriptive multiblock analysis. J Chemom 2018, https://doi.org/10.1002/cem.3071. 34. Wold, S.; Kettaneh, N.; Tjessem, K., Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. J Chemom 1996, 10 (5‐6), 463-482. 35. Galindo-Prieto, B.; Trygg, J.; Geladi, P., A new approach for variable influence on projection (VIP) in O2PLS models. Chemom Intell Lab Syst 2017, 160, 110-124. 36. Holten, D., Hierarchical edge bundles: Visualization of adjacency relations in hierarchical data. IEEE Trans Vis Comp Graph 2006, 12 (5), 741-748. 37. Hunter, J. D., Matplotlib: A 2D graphics environment. Comput Sci Eng 2007, 9 (3), 90-95. 38. Wang, D.; Bodovitz, S., Single cell analysis: the new frontier in 'omics'. Trends Biotechnol 2010, 28 (6), 281-90. 39. Holgate, S. T.; Wenzel, S.; Postma, D. S.; Weiss, S. T.; Renz, H.; Sly, P. D., Asthma. Nat Rev Dis Primers 2015, 1, 15025. 40. Tan, L. J.; Wang, Z. E.; Wu, K. H.; Chen, X. D.; Zhu, H.; Lu, S.; Tian, Q.; Liu, X. G.; Papasian, C. J.; Deng, H. W., Bivariate Genome-Wide Association Study Implicates ATP6V1G1 as a Novel Pleiotropic Locus Underlying Osteoporosis and Age at Menarche. J Clin Endocrinol Metab 2015, 100 (11), E1457-66. 41. Wong, C. A.; Walsh, L. J.; Smith, C. J.; Wisniewski, A. F.; Lewis, S. A.; Hubbard, R.; Cawte, S.; Green, D. J.; Pringle, M.; Tattersfield, A. E., Inhaled corticosteroid use and bone-mineral density in patients with asthma. Lancet 2000, 355 (9213), 1399-1403. 42. McNamara, P.; Seo, S. B.; Rudic, R. D.; Sehgal, A.; Chakravarti, D.; FitzGerald, G. A., Regulation of CLOCK and MOP4 by nuclear hormone receptors in the vasculature: a humoral mechanism to reset a peripheral clock. Cell 2001, 105 (7), 877-89. 43. Jang, Y. S.; Kang, Y. J.; Kim, T. J.; Bae, K., Temporal expression profiles of ceramide and ceramide-related genes in wild-type and mPer1/mPer2 double knockout mice. Mol Biol Rep 2012, 39 (4), 4215-21. 44. Gooley, J. J.; Chua, E. C., Diurnal regulation of lipid metabolism and applications of circadian lipidomics. J Genet Genomics 2014, 41 (5), 231-50. 45. Kenfield, M.; Yu, H.; Ehlers, A.; Xie, W.; Gunsten, S.; Agapov, E.; Horani, A.; Holtzman, M. J.; Brody, S. L.; Haspel, J., Dynamic Regulation Of Circadian Clock Genes In Chronic Airway Disease. In C31. BASIC AND TRANSLATIONAL STUDIES ON THE ROLE OF THE EPITHELIUM, American Thoracic Society: 2017; pp A5210-A5210.

620

ACS Paragon Plus Environment

24

Page 25 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

621 622

Analytical Chemistry

Table 1. Clinical Data Healthy Control (N=12)

Severe Asthma (N=10)

26.5 (24.8, 30.8)

63 (43.5, 63)

9/3

4/6

24.1 (22.6, 43.2)

34.0 (27.4, 43.2)

Never Smoker (#)

11

6

Current/Former Smoker (#)

1

4

0

10 (1280)

Age (Years) Sex (M/F) BMI (kg/m2) Smoking Status

Treatment Inhaled corticosteroids (#, median dose*)

Oral corticosteroids (#) 0 Values are median (interquartile range) or number *Beclomethasone dipropionate equivalent g

3

623 624 625

ACS Paragon Plus Environment

25

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

626 627 628

Page 26 of 33

Table 2. OnPLS Model Summary Component 1 2 3 4 5 6 7 Sum

Connection Global Local Local Global Local Local Local

Transcriptomics 8% 7% 8% 6% 4% 5% 37%

Sphingolipids 33% 11% 14% 9% 6% 13% 86%

Metabolomics 16% 7% 10% 9% 7% 7% 56%

Fatty Acids 25% 39% 10% 74%

Oxylipins 8% 40% 7% 7% 9% 71%

Clinical 17% 15% 14% 9% 55%

629 630

ACS Paragon Plus Environment

26

Page 27 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676

Analytical Chemistry

Figure Legends Figure 1. Schematic of potential shared structure between data blocks. The six data blocks used in this study are shown with their respective numbers of variables. The diagram shows all possible shared structure connections between the data blocks. Figure 2. Correlation between model scores and metadata. Circle size and colour intensity are proportional to strength of correlation (larger and darker indicates strong correlation). Red, positive correlation; blue, negative correlation. Thick outline around box, significant correlation (p 1.0. Vertical lines are drawn to show MBVIOP > 1.0 threshold for all blocks in addition to the increased MB-VIOP thresholds of > 2.0 and > 2.5 for Transcriptomics in Components 1 and 2, respectively. Percentages reflect the amount of variance described by each component, for each data block. (A) Component 1. (B) Component 2. Figure 5. Chord plots showing between-block correlations. (A) Component 1. (B) Component 2. Chord plots were made by calculating the Spearman rank correlations for each pairwise comparison of variables meeting the MB-VIOP thresholds. Variables with a significant (p < 0.01) between-block correlation were presented in the chord plots. Nodes represent variables. Text colour is associated with block: grey, Transcriptomics; green, Metabolomics; yellow, Sphingolipids; blue, Fatty Acids; orange, Oxylipins; red, Clinical. The number of correlations associated with a given node are noted in parenthesis next to the name of the variable. Node colour represents direction of change. Component 1: blue, increased in asthma; red, decreased in asthma. Component 2: white, increased in females; black, increased in males. Chords represent correlations: yellow, positive correlation; purple, negative correlation. Each chord plot was constrained such that withinblock correlations were ignored. (---) denotes non-coding gene transcripts.

ACS Paragon Plus Environment

27

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

677 678

Page 28 of 33

For TOC Only

679

ACS Paragon Plus Environment

28

Page 29 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 1. Schematic of potential shared structure between data blocks. The six data blocks used in this study are shown with their respective numbers of variables. The diagram shows all possible shared structure connections between the data blocks.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Correlation between model scores and metadata. Circle size and colour intensity are proportional to strength of correlation (larger and darker indicates strong correlation). Red, positive correlation; blue, negative correlation. Thick outline around box, significant correlation (p 1.0. Vertical lines are drawn to show MB-VIOP > 1.0 threshold for all blocks in addition to the increased MB-VIOP thresholds of > 2.0 and > 2.5 for Transcriptomics in Components 1 and 2, respectively. Percentages reflect the amount of variance described by each component, for each data block. (A) Component 1. (B) Component 2.

ACS Paragon Plus Environment

Page 32 of 33

Page 33 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 5. Chord plots showing between-block correlations. (A) Component 1. (B) Component 2. Chord plots were made by calculating the Spearman rank correlations for each pairwise comparison of variables meeting the MB-VIOP thresholds. Variables with a significant (p < 0.01) between-block correlation were presented in the chord plots. Nodes represent variables. Text colour is associated with block: grey, Transcriptomics; green, Metabolomics; yellow, Sphingolipids; blue, Fatty Acids; orange, Oxylipins; red, Clinical. The number of correlations associated with a given node are noted in parenthesis next to the name of the variable. Node colour represents direction of change. Component 1: blue, increased in asthma; red, decreased in asthma. Component 2: white, increased in females; black, increased in males. Chords represent correlations: yellow, positive correlation; purple, negative correlation. Each chord plot was constrained such that within-block correlations were ignored. (---) denotes non-coding gene transcripts.

ACS Paragon Plus Environment