Analysis of Isocratic Chromatographic Retention Data using Bayesian

Oct 18, 2018 - Abstract. The objective of this work was to develop a multilevel ... of lanostane-type triterpene acids without standards by UHPLC-MS/M...
1 downloads 0 Views 2MB Size
Subscriber access provided by University of Sunderland

Article

Analysis of Isocratic Chromatographic Retention Data using Bayesian Multilevel Modeling #ukasz Kubik, Roman Kaliszan, and Pawel Wiczling Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b04033 • Publication Date (Web): 18 Oct 2018 Downloaded from http://pubs.acs.org on October 22, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

Analysis of Isocratic Chromatographic Retention Data using

2

Bayesian Multilevel Modeling

3

Łukasz Kubik, Roman Kaliszan, Paweł Wiczling*

4

Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gdańsk, Gen. J.

5

Hallera 107, 80-416 Gdańsk, Poland

6

*Corresponding author's e-mail: [email protected]

7

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

8

Abstract

9

The objective of this work was to develop a multilevel (hierarchical) model based on isocratic

10

reversed phase high-performance chromatographic data, collected in methanol and acetonitrile for

11

58 chemical compounds. Such multilevel model is a regression model of the analyte-specific

12

chromatographic measurements, in which all the regression parameters are given a probability

13

model. It is a fundamentally different approach from the most common approach where parameters

14

are separately estimated for each analyte (without sharing information across analytes and different

15

organic modifiers).

16

The statistical analysis was done with Stan software implementing the Bayesian statistics

17

inference with Markov Chain Monte Carlo sampling. During the model building process a series

18

of multilevel models of different complexity were obtained, such as: 1) model with no pooling

19

(separate models are fitted for each analyte); 2) model with partial pooling (a common distribution

20

for analyte-specific parameters); and 3) model with partial pooling and a regression model relating

21

analyte-specific parameters and analyte-specific properties (QSRR equations). All the models were

22

compared with each other using 10-fold cross-validation.

23

The benefits of multilevel models in inference and predictions were shown. In particular the

24

obtained models allowed us to i) better understand the data and ii) to solve many routine analytical

25

problems, e.g. to obtain a well-calibrated predictions of retention factor for an analyte in

26

acetonitrile-containing mobile phases given no, one or several measurements in methanol-

27

containing mobile phases and vice versa.

28 29

Keywords

30

multi-level modeling, Bayesian statistics, liquid chromatography, QSRR

ACS Paragon Plus Environment

Page 2 of 24

Page 3 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

31

Analytical Chemistry

1. Introduction

32

The retention mechanism in the reversed-phase high-performance liquid chromatography

33

(RP HPLC) is a complicated process involving a great variety of interactions that are difficult to

34

describe exactly1. Generally, the retention factor depends on the properties of the mobile phase, the

35

stationary phase and the analyzed compounds, e.g. polar and non-polar surface area of analytes,

36

dielectric constant of the mobile phase, surface properties of the packing material and other

37

descriptors2. The complex nature of these interactions usually requires mathematical models to

38

quantify the relationship between retention time and multiple method parameters, such as pH,

39

temperature, buffer concentration and other conditions1,3. Such a model, when appropriately

40

validated, can be of great help during method development procedure by giving an analyst means

41

to predict chromatograms for a wide range of experimental conditions.

42

Models used in the field of chromatography are often build to describe the behavior of a

43

single analyte (or for a set of analytes, modeled one at a time). Such models are certainly useful

44

and serve its role in solving many problems encountered in the laboratory4-7. In this work we would

45

like to provide a generalization of these models to multilevel (hierarchical) models, that could even

46

further increase the role of predictive modeling in the field of chromatography. The basic idea is

47

to take into account similarities between analytes, solvents or columns while developing a

48

chromatographic model. As an example let us consider the case of isocratic chromatographic data

49

collected for methanol (MeOH) and acetonitrile (ACN) containing mobile phases for a diverse set

50

of analytes. One would generally approach this data by building a separate models for each analyte

51

either for MeOH or ACN. And then eventually seek for a relationship between analyte-specific

52

chromatographic parameters, such as log kw, and analyte properties, such as log P or polar surface

53

area (QSRR equations). It is not an optimal approach as there is a lot of information that could be

54

shared, both between analytes and the organic modifiers, e.g. the same log kw values regardless of

55

organic modifier type or similarity of log kw values for analytes having similar log P values.

56

Multilevel model is a regression model of the individual (analyte-specific) chromatographic

57

measurements, in which all the model parameters – regression coefficients – are also given a

58

probability model. This second-level parameters are also estimated from the data. Multilevel

59

modeling is well known and commonly used mathematical technique, applied in many fields, e.g.

60

in cancer studies8, anesthesiology9 or educational research10. It is still not a common tool for

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

61

retention prediction; however, the Bayesian inference itself was reported to be a useful approach

62

in chromatography11-13. Recently multilevel modeling using Stan software was described as a

63

convenient method for describing gradient HPLC data14.

64

In this work we re-analyzed an isocratic RP HPLC data previously obtained in our

65

department15 for 58 chemical compounds. During the model building process a series of models of

66

different complexity were proposed and compared, such as: 1) model with no pooling (separate

67

models were fitted for each analyte); 2) model with partial pooling (a common distribution for

68

analyte-specific parameters); and 3) model with partial pooling and a regression model between

69

analyte-specific parameters and analyte-specific properties (QSRR equations). Multilevel models

70

were implemented in the Stan software that provides full Bayesian inference for continuous-

71

variable models through Markov Chain Monte Carlo (MCMC) methods. The predictive

72

performance of the proposed models was evaluated using the posterior predictions, Watanabe-

73

Akaike Information Criterion (WAIC) and 10-fold cross-validation. We also illustrate the

74

usefulness of the proposed models in predicting retention times in ACN-containing mobile phases

75

given no, one or several measurements in MeOH-containing mobile phases and vice versa.

76

2. Experimental Section

77

2.1 Chromatographic parameters

78

The data used to illustrate the main concept of multilevel models is taken from Al-Haj et

79

al. article15. It was obtained using RP HPLC in the isocratic retention mode, with the UV detection.

80

58 drug-like chemical compounds, listed in the Supporting Information, were analyzed. The

81

analytes had lipophilicity (MLOGP) that ranged from -0.2 to 5.1 and molecular mass that range

82

from 78 to 270 g/mol. MeOH and ACN were used as a mobile phase. The percent amount of

83

organic modifier in the mobile phase (φ) equaled 20, 25, 30, 35, 40, 45, 50, 60, 65, 70, 75, 80, 85,

84

90, 95 and 20, 25, 30, 40, 50, 60, 65, 70, 75, 80 for MeOH and ACN, respectively. For more details

85

it is advised to study the original article. The raw data is attached in the Supporting Information. It

86

is also presented graphically in Supporting Figure S1.

87

2.2 Molecular modeling

88

In the original work15, for each analyte a set of structural descriptors was calculated using

89

the HyperChem software16. Three descriptors were used in modeling: total dipole moment (µ),

ACS Paragon Plus Environment

Page 4 of 24

Page 5 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

90

maximum electron excess on a most charged atom (δmin) and water-accessible molecular surface

91

area (AWAS).

92

Additionally, all 58 compounds were re-modelled using respectively: Open Babel 2.3.217

93

(SMILES to MOL2 conversion), Discovery Studio Visualizer 1618 (preliminary geometry

94

optimization using Dreiding-like forcefield17), GaussView 3.0920 (Gaussian input files preparation)

95

and Gaussian 0921 software (B3LYP method), with the application of the 6-31G basis set (final

96

structure optimization). Only 4-iodophenol was optimized using STO-6G set, due to the occurrence

97

of the iodine atom. Dragon 722 software was used to calculate the MLOGP and molar weight.

98

2.3 Multilevel modeling

99

Multilevel modeling was carried out using the Stan23 / CmdStan 2.1624 software linked with

100

the Matlab® R2017b25 using the MatlabStan 2.1526. For the calculation of each model we used the

101

following values of the Stan parameters: number of iterations = 1000, warmup = 1000, number of

102

Markov chains = 4. Stan codes were based on the Margossian and Gillespie work27,28. Exemplary

103

Stan code can be found in the Supporting Information. Determination of model parameters provides

104

a possibility to obtain predictions (and uncertainty around these predictions) for a new (not-yet-

105

analyzed) analyte that take into account the information about the likely values of analyte-specific

106

parameters (from the posterior distribution) and any set of experimental data. To assess the

107

accuracy of such predictions, posterior predictive checks were used. Such predictive checks are

108

simply a replicated dataset using the model interference in the forward directions. These replicated

109

data sets, when compared visually with the original data, allow to assess model fit and predictive

110

capabilities of the model29. The predictive power of models were assessed with the Watanabe-

111

Akaike Information Criterion (WAIC), using the MatlabStan command mstan.waic. WAIC is

112

conceptually similar to Akaike information criterion (AIC) and Bayesian information criterion

113

(BIC), where the higher the WAIC value, the better the model predictive performance. Since the

114

WAIC is not able to assess the model performance for new analytes (it approximates leave-one-

115

measurement-out cross-validation), a 10-fold cross-validation (specifically leave-analytes-out

116

cross-validation) was used instead. The analytes from the original data were randomly partitioned

117

into 10 subsamples. Of the 10 subsamples, a single subsample was excluded from the analysis.

118

The remaining 9 subsamples plus none or limited number of measurements from the excluded

119

analytes were used to obtain predictions for those excluded analytes. The cross-validation process

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 24

120

was then repeated 10 times, with each of the 10 subsamples used exactly once as the validation

121

data. The results from the folds were combined and summarized as an log pointwise predictive

122

density of cross validation (LPDCV) and root mean square error of cross validation (RMSECV).

123

The LPDCV is a preferred way to evaluate the predictive accuracy of a Bayesian model. RMSECV

124

is less appropriate for models that are far from the normal distribution. The details on how WAIC,

125

LPDCV, RMSECV were calculated are provided in the Supplementary Information and

126

reference30. 6 scenarios were considered during the cross-validation depending on the number of

127

measurements available for predictions: 1) no measurements (NONE); 2) single MeOH

128

measurement (1M); 3) single ACN measurement (1A); 4) single MeOH and single ACN

129

measurement (1M & 1A); 5) all ACN measurements (AllA) and 6) all MeOH measurements

130

(AllM). They allowed us to assess the uncertainty of predictions in the situation of having access

131

to the limited number of experimental data.

132

3. Model development procedure

133

3.1 The classical approach proposed by Al-Haj et al.15

134

In the original approach presented by Al-Haj et al 15, a separate models were fitted for each

135

analytes. The structural model assumed a simple Snyder-Soczewiński model for MeOH and ACN

136

that can be described by the following equation:

137

𝑙𝑜𝑔𝑘𝑖𝑗𝑘 = 𝑙𝑜𝑔𝑘𝑤, 𝑖𝑘 ― 𝑆1,𝑖𝑘 ∙ 𝜑𝑗

138

where j = 1... J (out of J) denotes jth mobile phase compositions; i = 1… nAnalytes denotes ith (out

139

of nAnalytes) analyte, and k = 1..2 denotes MeOH (k = 1) or ACN (k = 2); log kw,ik denotes a

140

chromatographic measure of hydrophobicity (analyte and organic modifier specific). It is basically

141

a retention factor corresponding to the zero content of the organic modifier (i.e. neat water); and

142

S1,ik is the slope coefficient (also analyte and organic modifier specific) that can be understood as

143

an apparent difference of retention factors in water and MeOH or ACN.

144

The observed retention factors (log kObs) was further modeled according to:

145

𝑙𝑜𝑔𝑘𝑂𝑏𝑠,[𝑧] ~ 𝑁(𝑙𝑜𝑔𝑘𝑖[𝑧]𝑗[𝑧]𝑘[𝑧], 𝜎𝑖[𝑧]𝑘[𝑧])

146

Where, z = 1…nObs denotes zth (out of nObs) measurement; N denotes the normal distribution

147

with the mean given be Eq. (1) and standard deviation σik; a tilde (~) denotes "has the probability

Eq. 1

ACS Paragon Plus Environment

Eq. 2

Page 7 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

148

distribution of", i.e. the values of logkObs are randomly drawn from the given (in this case normal)

149

distribution. Standard deviations are conventionally assumed to be analyte and organic modifier

150

specific.

151

In the next modeling step the QSRR relationship were proposed separately for log kw,i1 and log kw,i2

152

by assuming a linear relationship between log kw and a set of predictors (descriptors) (e.g.

153

lipophilicity (log P) or total dipole moment (µi), maximum electron excess on a most charged atom

154

(δmin,i) and water-accessible molecular surface area (AWAS,i)):

155

log kw,ik~𝑁(𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑘𝑙𝑜𝑔𝑃𝑖,𝜔𝑙𝑜𝑔𝑘𝑤)

Eq. 3

156

log kw,ik~𝑁(𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑘𝜇𝑖 + 𝛽2,𝑘𝛿𝑚𝑖𝑛,𝑖 + 𝛽3,𝑘𝐴𝑊𝐴𝑆,𝑖,𝜔𝑘𝑙𝑜𝑔𝑘𝑤)

Eq. 4

157

where θlogkw,k is a retention factor for an analyte with descriptors equal to zero, ω is the scale

158

parameter and β1,k-β3,k are regression coefficients (different for MeOH and ACN).

159

There are several weaknesses of such a modeling approach: i) log kw is assumed to be

160

different for MeOH and ACN, ii) two independent QSRR equations for log kw were proposed for

161

MeOH and ACN, iii) the QSRR equations for other parameters (e.g. for S1,ik) were not explored,

162

iv) the two-stage approach (the estimation of QSRR equations conditional on the estimated log kw,i

163

values) does not properly take into account the uncertainty of log kw,ik. Please note it has different

164

uncertainty depending on the degree of extrapolation, v) finally the correlations between analyte-

165

specific parameters were not explored. There is no need to make such simplifications during the

166

model building process as shown in the subsequent sections.

167

3.2 Model with partial pooling (a common distribution for analyte-specific parameters)

168

Further we will assume a more realistic non-linear relationship between the log k and

169

organic modifier (Neue et al31 equation):

170

𝑙𝑜𝑔𝑘𝑖𝑗𝑘 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖𝑘 ∙ 𝜑𝑗

171

where S2,ik is the curvature coefficient for ith analyte for MeOH (k = 1, equivalent notation S2m,i)

172

and ACN (k = 2, equivalent notation S2a,i). Please note that the log kw is the same for MeOH and

173

ACN, as it should be. For convenience, this equation was reparametrized to the retention factor in

174

MeOH and ACN (log km and log ka) noticing that:

𝑆1,𝑖𝑘 ∙ 𝜑𝑗

Eq. 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 24

𝑆1,𝑖1

175

𝑙𝑜𝑔𝑘𝑚,𝑖 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖1

Eq. 6

176

𝑙𝑜𝑔𝑘𝑎,𝑖 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖2

177

Hence, the retention factor in neat MeOH and ACN has a more natural interpretation than the slope.

178

The observed retention factors (log kObs) was further modeled similarly as previously:

179

𝑙𝑜𝑔𝑘𝑂𝑏𝑠,𝑧 ~ 𝑁(𝑙𝑜𝑔𝑘𝑖[𝑧]𝑗[𝑧]𝑘[𝑧],𝜎)

180

where, z denotes zth measurement; N denotes the normal distribution with the mean given by Eq.

181

(1) and standard deviation σ. This time a common standard deviation is used. This assumptions can

182

be relaxed if needed.

183

The idea of multilevel modeling allows to provide a range of second-level models for analyte-

184

specific parameters (log kw,i, log km,i, log ka,i, ln S2m,i, ln S2a,i):

185

[ ] ( )

186

where MST denotes the multivariate student t distribution, θ is a mean value of the parameter,  is

187

a normality parameter, and  denotes a variance-covariance matrix. In particular θlogkw, θlogkm,

188

θlogka, θlnS2m, θlnS2a denote typical values of parameters. The use of multivariate distribution allows

189

to model the correlation between analyte-specific parameters. Please note that some correlations

190

(especially between log kw,i, log km,i and log ka,i) are expected in chromatographic system. The S2m

191

and S2a were modeled on a logarithmic scale to ensure their positive values.

192

The model for log kObs is the same as previously. The following priors were assigned based on

193

literature findings and our judgment and are part of model assumptions:1, 32,33

194

θlogkw ~ N(2, 5)

Eq. 10

195

θlogkm ~ N(0, 5)

Eq. 11

196

θlnS2m ~ N(ln(0.2), 0.5)

Eq. 12

𝑆1,𝑖2

Eq. 7

Eq. 8

𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎

Eq. 9

ACS Paragon Plus Environment

Page 9 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

197

θlogka ~ N(0, 5)

Eq. 13

198

θlnS2a ~ N(ln(2), 0.5)

Eq. 14

199

In this work we decided to use a weekly informative priors that do not place too much

200

probability in any particular interval (and hence favor those values). The θlogkw was assumed to be

201

2 ± 5, thus without any data we think that the typical analyte will have logkw,i in a range from -8 to

202

12 (± 2 STD around the mean), similarly log km,i and log ka,i were assumed to be in a range from

203

(-10 to 10). In the case of the θlnS2m and θlnS2a parameters, the priors' means were based on the

204

literature data1, 32,33 with coefficient of variation of 50%, thus S2m,i and S2a,i were assumed to be in

205

a range from 0.09 to 0.46 for MeOH and from 0.89 to 4.4 for ACN. Please note that these priors

206

can be easily changed to reflect some additional knowledge.

207

Further, we decomposed our prior on covariance-matrix into a scale (ω) and a correlation matrix

208

(ρ) according to the formula:

209

Ω = diag(ω)ρdiag(ω)

210

where ω and ρ were given the following priors:

211

ωlog kw, ωlog km, ωlog ka, ωlnS2m, ωlnS2a ~ N+ (0,5)

Eq. 16

212

ρ ~ LKJ(1) (5x5 matrix)

Eq. 17

213

where N+ denotes the half-normal distribution and LKJ denotes the Lewandowski, Kurowicka, and

214

Joe distribution34. In this case the LKJ(1) ensures that density is uniform over correlation matrices

215

of order 5. Prior for standard deviation for residuals and for the degree of freedom of the MST

216

distribution equals:

217

σ ~ N+(0, 1)

Eq. 18

218

 ~ gamma(2, 0.1)

Eq. 19

219

thus favoring normal distribution. The use of student t distribution ensures robustness. It was

220

required as there are analytes that differ considerably from typical ones.

221

3.3 Model with no pooling

Eq. 15

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 24

222

By fixing all omegas to a large value, a no pooling approach is obtained. It is equivalent to

223

the assumption that there is no information shared between different analytes, thus all analytes-

224

specific parameters are essentially estimated based on the analyte-specific data. In this work the

225

following values were assumed: ωlog kw, ωlog km, ωlog ka, ωlnS2m, ωlnS2a = 10, ρ is identity matrix of

226

size 5 and  equals 20, with the rest of the code being similar as in the previous section.

227

3.4 Model with partial pooling and a regression model between analyte-specific parameters and

228

analyte-specific properties

229

The model with partial pooling can be further extended by adding predictors (descriptors)

230

explaining part of the inter-analyte variability. As an example the relationship between log kw,i, log

231

ka,i, log km,i and lipophilicity (MLOGPi) or molecular mass (MMOLi) can be proposed as follows:

232

233

[ ] ( [ ] (

)

𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑙𝑜𝑔𝑘𝑤·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 + 𝛽1,𝑙𝑜𝑔𝑘𝑚·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 + 𝛽1,𝑙𝑜𝑔𝑘𝑎·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎

Eq. 20

)

Eq. 21

𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽2,𝑙𝑜𝑔𝑘𝑤·(𝑀𝑀𝑂𝐿𝑖 ― 150) 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 + 𝛽2,𝑙𝑜𝑔𝑘𝑚·(𝑀𝑀𝑂𝐿𝑖 ― 150) 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 + 𝛽2,𝑙𝑜𝑔𝑘𝑎·(𝑀𝑀𝑂𝐿𝑖 ― 150) ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎

234

Such a relationship is consistent with the expected similarity between log k and MLOGP and

235

between log k and molecular mass as the latter is correlated with log P. In this case the following

236

priors were used β1~N(1.00, 0.50) and β2~N(0.02, 0.01) thus assuming that the relationship

237

between MLOGP and log kw, log km and log ka is linear with a slope close to one, and assuming a

238

linear relationship between molecule mass and log kw, log km and log ka with a slope and standard

239

deviation being 50 time smaller (where 50 is a standard deviation of the molecular masses of

240

the analytes).

241

4. Results and discussion

242

In this work we presented a series of multilevel models obtained during the analysis of

243

isocratic data obtained for 58 compounds in MeOH and ACN. The proposed models provide a

ACS Paragon Plus Environment

Page 11 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

244

unified descriptions of the whole dataset. It is in contrary to “classical” methods of analyzing

245

chromatographic data, which tend to ignore the hierarchical structure of the data, and perform the

246

analysis at the analyte-level only.

247

Figure 1 and Figure 2 show the individual predictions (prediction corresponding to the

248

future observations on the same analyte) and typical predictions (prediction corresponding to the

249

future observations of a new analyte) for 5 representative compounds selected based on the

250

accuracy of the fit (from the worst to the best based on the root mean square error of the Pooled-

251

log P model), respectively. The individual fits are satisfactory for all the considered models. They

252

are also much better than for the original model that assumed Snyder-Soczewinski equation (data

253

not shown). The similarity of models is also confirmed by the WAIC, LPDCV (all data) and

254

RMSECV (all data) measures, which are essentially identical (Table 1). It means, that given all the

255

observations we can predict analyte retention equally well using any of the presented models. This

256

is not true when trying to predict retention factor for an analyte for which no experimental data is

257

available. Such a typical predictions are presented in Figure 2. The typical predictions show much

258

higher uncertainty than individual predictions, as there is less information about the retention factor

259

for an analyte without any measurements. In this case only the information from other analytes and

260

the predictors, such as log P or molecular mass, can be taken into account. In our analysis the

261

predictive performance of the tested models expressed as LPDCV was -232.3 for Pooled-log P, -

262

1436.7 for Pooled-Mmol, -1775.1 for Pooled and finally -4809.6 for the Unpooled model. In the

263

case of an analyte with no measurements, the information about log P leads to more accurate

264

predictions than molecular mass.

265

The goodness of fit plots are presented in Figure 3. These plots shows the relationship

266

between the observations and model predictions (typical and individual) and allow to assess the

267

calibration accuracy and sharpness of predictions35. For individual predictions, both the accuracy

268

(whether the points are close to the line of unity for the whole range of measurements) and

269

sharpness (the spread of the points around the line of identity) are excellent for all the tested models.

270

The situation is different for the typical predictions. The accuracy and sharpness is reasonable for

271

the Pooled model with predictors (log P reduces uncertainty more than the molecular mass). Please

272

note that the calibration is problematic for the Unpooled and Pooled models, which means that

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

273

those models should be avoided for predictions when there is no information on analyte properties

274

available.

275

Supporting Table S1 presents a summary of the marginal posterior distributions for model

276

parameters. These parameters summarize all the important features of the data and can be used by

277

others to predict retention factors of new analytes (for a similar column and analytes that were used

278

to develop the model).

279

The mean normality parameter () equals 2.60 and 2.90, depending on the model. Low

280

values of normality parameter indicates that the studied MST distribution has heavy-tails; thus,

281

there are analytes that are considerably different from the typical ones. These analytes have unusual

282

retention time profile. The 4-aminophenol is an example of such a compound, as it has higher

283

retention in ACN than in MeOH, for the whole range of organic modifier contents (Figure 1).

284

Log P value of a typical analyte in our dataset is 2.34. For such an analyte the estimated log

285

k value equals 3.20 for neat water (log kw), -0.71 for neat MeOH (log km) and -1.00 for neat ACN

286

(log ka). Calculated curvature coefficients (S2) equals 0.59 and 1.40, for MeOH and ACN,

287

respectively. These numbers are close to the literature values1,32,33. Also, strong correlation (0.74-

288

0.76) between S2 parameters for MeOH and ACN (ρlnS2M, lnS2A) is observed. Mean β1,logkw in the

289

Pooled-log P model is close to one (1.30). Mean β1,logkm and β1,logka are much smaller (0.29 and

290

0.21, respectively); however, the trend with log P is evident. Similar situation can be observed for

291

β2 parameter - mean β2,logkw (0.029) in the Pooled-Mmol model is close to the prior value (0.02)

292

and β2 for MeOH and ACN are much smaller (0.0058 and 0.0043, respectively), which is a

293

consequence of a correlation between log P and molecular mass.

294

Correlations between log k parameters corresponding to single-component mobile phases

295

(water, MeOH and ACN) are evident, in particular for the Pooled model (ρlogkm, logka = 0.84, ρlogkm,

296

logkw

297

persist even after including common predictors (0.76, 0.71, 0.74, and 0.53, 0.24, 0.24,

298

respectively). If there is a strong correlation between some model parameters, then the information

299

on one parameter can give an information about the likely value of the other parameter, e.g.

300

correlation between log ka and log km, indicates that knowing retention times of an analyte in water-

301

MeOH system gives the analyst information on its retention in the water-ACN system. Thus,

302

understanding the correlation can help to make more reliable predictions.

= 0.78, ρlogka, logkw = 0.79). For the Pooled-log P and Pooled-Mmol models this correlation

ACS Paragon Plus Environment

Page 12 of 24

Page 13 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

303

Supporting Figure S2 presents the individual (analyte-specific) parameters for 5

304

representative analytes as obtained by the four considered models. The regularization imposed by

305

the multilevel model is immediately visible by comparing the uncertainty of parameters obtained

306

using the Unpooled and Pooled models. In the case of the pooled approach the uncertainty is

307

reduced and the individual values are “shrunk” toward the typical values. It is especially visible for

308

parameters that are difficult to estimate (such as ln S2m,i and ln S2a,i).

309

Multilevel models proposed in this work are a natural framework for understanding

310

chromatographic data. They are also useful in making predictions of retention times in the case of

311

limited chromatographic data; e.g., for situations that are usually of interest to analyst. In this work

312

we illustrated this concept by predicting the retention times in a situation of limited access to the

313

chromatographic data. Specifically the access to the none, single MeOH (1M), single ACN (1A),

314

single MeOH and single ACN (1M & 1A), all MeOH (AllM) and all ACN (AllA) measurements

315

were considered. Figure 4 illustrates the agreement between predictions (using cross-validation)

316

and observations for all the models and 6 considered scenarios. It can be used to assess accuracy

317

and sharpness, similarly, as was done previously. The predictive performance is summarized in

318

Table 1. The predictions along with the uncertainty are shown in Supporting Figures S-3 and S-4

319

for 4-aminophenol (the worst fit, with unusual retention time profile) and xanthene (the best fit),

320

respectively.

321

It is clear that more experimental data leads to more accurate predictions. Interestingly, the

322

pooled models are reasonably well calibrated whenever there is at least one measurement available

323

for predictions. Still the sharpness of predictions is slightly better once predictors (molecular mass

324

or log P) are included into the model. Thanks to the application of the knowledge on the whole

325

population of analytes (pooling), addition of single experimental point for just one mobile phase

326

(MeOH or ACN-containing mobile phases) results in the significant reduction of uncertainty

327

around predictions, for both eluents. Such an effect is not observed for the Unpooled model. In this

328

case, access to the experimental data in one of the organic modifier improves the predictions only

329

for that specific eluent, without influencing the other. The degree of uncertainty reduction can be

330

assessed investigating LPDCV and RMSECV measures in Table 1 and by the visual inspection of

331

Supporting Figures S-3 and S-4.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

332

The LPDCV can be directly used to compare the models (with the higher values denoting

333

higher predictive performance). In general, when comparing all the predictive accuracy measures

334

(WAIC, LPDCV, RMSECV), goodness-of-fit-plots and cross-validation plots, the Pooled-log P

335

and Pooled-Mmol model leads to useful, well-calibrated, predictions in situations where limited

336

data is available for predictions. Obviously, the worst predictive accuracy has the Unpooled model,

337

unless large number of analytes-specific measurements are available.

338

5. Conclusions

339

In this work we proposed and compared several multilevel chromatographic retention

340

models. Such models allow to efficiently share information between analytes and organic modifiers

341

that can be used to predict retention and associated uncertainty for new analytes or analytes with

342

only few measurements available. In particular, they can be helpful to solve many practical

343

analytical problems, such as predicting retention times in ACN-containing mobile phases given no,

344

one or several measurements in MeOH-containing mobile phases.

345

The Bayesian multilevel modeling is mathematically sophisticated; thus, it has been rarely

346

applied in the field of analytical chemistry. Nevertheless, the recent advances in providing the state-

347

of-the-art platforms for statistical modeling and high-performance statistical computation (Stan's

348

probabilistic programming language) makes those models attractive in solving everyday separation

349

prediction problems encountered by chromatography practitioners.

350

6. References

351

(1) Nikitas, P.; Pappa-Louisi A., J. Chromatogr. A 2009, 1216, 1737-1755.

352

(2) Gritti, F.; Guiochon, G., Anal. Chem. 2005, 77, 4257–4272.

353

(3) Kazakevich, Y.V.; LoBrutto, R.; Chan, F.; Patel, T., J. Chromatogr. A 2001, 913, 75-87.

354

(4) DryLab®, Molnár-Institute for applied chromatography, Berlin, Germany, molnar-

355

institute.com.

356

(5) Leweke, S.; von Lieres, E., Comput. Chem. Eng. 2018, 113, 274-294.

357

(6) Wen,Y.; Talebi, M.; Amos, R.I.J.; Szucs, R; Dolan, J.W.; Pohl C.A.; Haddad P.R.; J

358

Chromatogr A. 2018, 1541, 1-11.

ACS Paragon Plus Environment

Page 14 of 24

Page 15 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

359

(7) Vander Heyden, Y.; Perrinam C.; Massarta, D.L. Handbook of Analytical Separations,

360

2000, 1, 163-212.

361

(8) Zahnd, W.E.; McLafferty, S.L., Ann Epidemiol. 2017, 27, 739-748.

362

(9) Hastings, R.H.; Glaser, D., Anesth. Analg. 2011, 113, 877-887.

363

(10) Schreiber, J.B.; Griffin, B.W., J. Educ. Res. 2004, 98, 24-33.

364

(11) Wiczling, P.; Kaliszan, R., Anal. Chem. 2016, 88, 997-1002.

365

(12) Wiczling, P.; Kubik, Ł.; Kaliszan, R., Anal. Chem. 2015, 87, 7241-7249.

366

(13) Barcaru, A.; Mol, H.G.J.; Tienstra, M.; Vivó-Truyols, G., Anal. Chim. Acta 2017, 983, 76-90.

367

(14) Wiczling, P., Anal. Bioanal. Chem. 2018, 410, 3905-3915.

368

(15) Al-Haj, M.A.; Kaliszan, R.; Nasal A., Anal. Chem. 1999, 71, 2976-2985.

369

(16) HyperChem™, Hypercube Inc., Waterloo, ON, Canada, 1999.

370

(17) O'Boyle, N.; Banck, M.; James, C.A.; Morley, C.; Vandermeerschm, T.; Hutchison, G.R., J.

371

Cheminform. 2011, 3, 33.

372

(18) Dassault Systèmes BIOVIA, Discovery Studio Visualizer, Release 16.1, San Diego: Dassault

373

Systèmes, 2015.

374

(19) Hahn, M., J. Med. Chem. 1995, 38, 2080-2090.

375

(20) GaussView, Version 3.09, Roy Dennington, Todd Keith and John Millam, Semichem Inc.,

376

Shawnee Mission, KS, 2009.

377

(21) Gaussian 09, Revision A.02, M. J. Frisch et al., Gaussian, Inc., Wallingford, CT, 2009.

378

(22) Kodesrl, Dragon (software for molecular descriptor calculation) version 7.0.6, 2016,

379

https://chm.kode-solutions.net.

380

(23) Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker,

381

M.; Guo, J.; Li, P.; Riddell A., J. Stat. Softw. 2017, 76. 1-32.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

382

(24) Stan Development Team. 2017. CmdStan: the command-line interface to Stan, Version 2.16.0,

383

http://mc-stan.org.

384

(25) MATLAB® and Statistics Toolbox Release R2017b, The MathWorks®, Inc., Natick,

385

Massachusetts, United States.

386

(26) Stan Development Team. 2017. MatlabStan: the MATLAB interface to Stan, http://mc-

387

stan.org.

388

(27) https://github.com/stan-dev/stan/wiki/Complex-ODE-Based-Models.

389

(28)

390

ode.html.

391

(29) Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B., Bayesian data analysis, 2nd ed.; Chapman

392

& Hall/CRC Texts in Statistical Science: Boca Raton, 2004.

393

(30) Vehtari, A.; Gelman, A.; Gabry, J., Stat Comput. 2017, 27, 1413-1432.

394

(31) Neue, U.D.; Phoebe, C.H.; Tran, K.; Cheng, Y.F.; Lu, Z., J. Chromatogr. A 2001, 925, 49-67.

395

(32) Pappa-Louisi A.; Nikitas, P.; Balkatzopoulou, P.; Malliakas, C., J. Chromatogr. A 2004, 1033,

396

29-41.

397

(33) Snyder, L.R.; Kirkland, J.J.; Dolan, J.W., Introduction to modern liquid chromatography, 3rd

398

ed.; Wiley-Blackwell: Oxford, 2010.

399

(34) Lewandowski, D.; Kurowicka, D.; Joe H., J. Multivar. Anal. 2009, 100, 1989-2001.

400

(35) Gneiting, T.; Balabdaoui, F.; Raftery, A.E., J. Royal Stat. Soc., Series B 2007, 69, 243-268.

401

7. Acknowledgements

402

This Project was supported by the National Science Centre, Poland (grant 2015/18/E/ST4/00449)

403

and from the funds of the Polish Ministry of Science and Higher Education, granted for the

404

development of the young scientists (participants of the doctoral studies) - no. 01-0224/08/529.

405

Part of the calculations were carried out at the Academic Computer Centre in Gdańsk.

406

8. Conflict of Interest Disclosure

http://mc-stan.org/events/stancon2017-notebooks/stancon2017-margossian-gillespie-

ACS Paragon Plus Environment

Page 16 of 24

Page 17 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

407

Analytical Chemistry

The authors declare no competing financial interest.

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 24

408

Tables

409

Table 1. The summary of predictive performance measures used to compare the tested models.

410

The predictive performance (log pointwise predictive density (LPDCV) and root mean square error

411

(RMSECV)) was assessed based on 10-fold cross-validation for 4 tested models and 6 prediction

412

scenarios (based on no measurements (None), single MeOH measurement (1M), single ACN

413

measurement (1A), single MeOH and single ACN measurement (1M & 1A), all ACN

414

measurements (AllA) and all MeOH measurements (AllM)). WAIC is an approximation of leave-

415

one-out measurement cross validation. Available

Unpooled

Pooled

Pooled-log P

Pooled-Mmol

-5823.7 0

-5814.2 -9.5

-5822.7 -1.0

data WAIC of the models All Data: WAIC ΔWAIC

-5832.5 8.7

LPDCV (LPDCV for MeOH data/LPDCV for ACN data) after 10-fold cross validation -4809.6 -1775.1 -232.3 -1436.7 (-3073.5/-1736.2) (-1161.4/-613.7) (-198.5/-33.8) (-955.4/-481.3) -3168.0 928.4 931.5 1027.3 1M (-1509.5/-1658.5) (753.2/175.2) (751.1/180.4) (828.1/199.2) -3551.7 773.4 893.6 948.5 1A (-3015.3/-536.4) (228.7/544.7) (308.8/584.8) (222.3/726.1) -1773.7 1373.4 1482.5 1636.7 1M & 1A (-1327.4/-446.3) (845.8/527.6) (855.6/626.9) (914.3/722.3) 429.4 2222.7 2249.0 2236.0 AllM (1962.1/-1532.7) (1948.3/274.5) (1946.1/302.9) (1949.3/286.7) -1692.8 1478.8 1540.0 1503.0 AllA (-2902.3/1209.6) (262.5/1216.2) (326.9/1213.1) (288.8/1214.1) 3171.7 3164.5 3159.2 3163.4 All Data: (1962.1/1209.6) (1948.3/1216.2) (1946.1/1213.1) (1949.3/1214.1) RMSECV (RMSECV for MeOH data/RMSECV for ACN data) after 10-fold cross validation

None

None

0.81 (0.85/0.72)

0.78 (0.82/0.73)

0.30 (0.33/0.26)

0.62 (0.64/0.59)

1M

0.48 (0.36/0.63)

0.18 (0.19/0.18)

0.18 (0.18/0.17)

0.18 (0.17/0.19)

1A

0.98 (1.2/0.34)

0.18 (0.22/0.094)

0.18 (0.21/0.096)

0.18 (0.21/0.091)

1M & 1A

0.36 (0.42/0.20)

0.16 (0.19/0.087)

0.15 (0.17/0.091)

0.15 (0.17/0.086)

AllM

0.37 (0.033/0.62)

0.12 (0.034/0.19)

0.11 (0.034/0.17)

0.11 (0.033/0.18)

ACS Paragon Plus Environment

Page 19 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

AllA

0.79 (0.99/0.024)

0.20 (0.25/0.024)

0.20 (0.25/0.024)

0.20 (0.25/0.024)

All Data:

0.03 (0.033/0.024)

0.03 (0.033/0.024)

0.03 (0.033/0.024)

0.03 (0.033/0.024)

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

416

Figures

417

Figure 1. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and

418

observed retention factors (dots) for 5 representative analytes. Predictions correspond to the future

419

observations on the same analyte, i.e. posterior predictions conditioned on the observed data from

420

the same analyte. Black color corresponds to MeOH whereas red color corresponds to ACN.

421

Figure 2. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and

422

observed retention factors (dots) for 5 representative analytes. Prediction corresponds to the future

423

observations of a new analyte, i.e. posterior predictive distributions. Black color corresponds to

424

MeOH whereas red color corresponds to ACN. For the Unpooled model the median is not smooth

425

due to large posterior variability.

426

Figure 3. Goodness-of-fit-plots of the four considered models. The observed versus the mean

427

typical predicted retention factors (the a posteriori mean of a predictive distributions corresponding

428

to the future observations of a new analyte) and the observed versus the mean individual predicted

429

retention times (the a posteriori mean of a predictive distributions conditioned on the observed

430

data from the same analyte). The black symbols denote MeOH and the red symbols denote ACN.

431

Figure 4. Goodness-of-fit-plots of the four considered models after 10-fold cross-validation. The

432

graph shows the observed versus the mean predicted retention factors (the a posteriori mean of a

433

predictive distributions conditioned on the part of the observed data (specifically none, single

434

MeOH (1M), single ACN (1A), single MeOH and single ACN (1M & 1A), All MeOH (AllM) and

435

all ACN (AllA)) from the same analyte. The red symbols denote predictions for ACN and black

436

symbols denote predictions for MeOH.

437 438

For TOC only:

439

ACS Paragon Plus Environment

Page 20 of 24

Page 21 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 1 Predicted (posterior median (line) and 95% credible intervals (shaded area)) and observed retention factors (dots) for 5 representative analytes. Predictions correspond to the future observations on the same analyte, i.e. posterior predictions conditioned on the observed data from the same analyte. Black color corresponds to MeOH whereas red color corresponds to ACN. 165x180mm (300 x 300 DPI)

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and observed retention factors (dots) for 5 representative analytes. Prediction corresponds to the future observations of a new analyte, i.e. posterior predictive distributions. Black color corresponds to MeOH whereas red color corresponds to ACN. For the Unpooled model the median is not smooth due to large posterior variability. 165x180mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 22 of 24

Page 23 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 3. Goodness-of-fit-plots of the four considered models. The observed versus the mean typical predicted retention factors (the a posteriori mean of a predictive distributions corresponding to the future observations of a new analyte) and the observed versus the mean individual predicted retention times (the a posteriori mean of a predictive distributions conditioned on the observed data from the same analyte). The black symbols denote MeOH and the red symbols denote ACN. 165x180mm (300 x 300 DPI)

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Goodness-of-fit-plots of the four considered models after 10-fold cross-validation. The graph shows the observed versus the mean predicted retention factors (the a posteriori mean of a predictive distributions conditioned on the part of the observed data (specifically none, single MeOH (1M), single ACN (1A), single MeOH and single ACN (1M & 1A), All MeOH (AllM) and all ACN (AllA)) from the same analyte. The red symbols denote predictions for ACN and black symbols denote predictions for MeOH. 165x180mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 24 of 24