Subscriber access provided by University of Sunderland
Article
Analysis of Isocratic Chromatographic Retention Data using Bayesian Multilevel Modeling #ukasz Kubik, Roman Kaliszan, and Pawel Wiczling Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b04033 • Publication Date (Web): 18 Oct 2018 Downloaded from http://pubs.acs.org on October 22, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
Analysis of Isocratic Chromatographic Retention Data using
2
Bayesian Multilevel Modeling
3
Łukasz Kubik, Roman Kaliszan, Paweł Wiczling*
4
Department of Biopharmaceutics and Pharmacodynamics, Medical University of Gdańsk, Gen. J.
5
Hallera 107, 80-416 Gdańsk, Poland
6
*Corresponding author's e-mail:
[email protected] 7
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
8
Abstract
9
The objective of this work was to develop a multilevel (hierarchical) model based on isocratic
10
reversed phase high-performance chromatographic data, collected in methanol and acetonitrile for
11
58 chemical compounds. Such multilevel model is a regression model of the analyte-specific
12
chromatographic measurements, in which all the regression parameters are given a probability
13
model. It is a fundamentally different approach from the most common approach where parameters
14
are separately estimated for each analyte (without sharing information across analytes and different
15
organic modifiers).
16
The statistical analysis was done with Stan software implementing the Bayesian statistics
17
inference with Markov Chain Monte Carlo sampling. During the model building process a series
18
of multilevel models of different complexity were obtained, such as: 1) model with no pooling
19
(separate models are fitted for each analyte); 2) model with partial pooling (a common distribution
20
for analyte-specific parameters); and 3) model with partial pooling and a regression model relating
21
analyte-specific parameters and analyte-specific properties (QSRR equations). All the models were
22
compared with each other using 10-fold cross-validation.
23
The benefits of multilevel models in inference and predictions were shown. In particular the
24
obtained models allowed us to i) better understand the data and ii) to solve many routine analytical
25
problems, e.g. to obtain a well-calibrated predictions of retention factor for an analyte in
26
acetonitrile-containing mobile phases given no, one or several measurements in methanol-
27
containing mobile phases and vice versa.
28 29
Keywords
30
multi-level modeling, Bayesian statistics, liquid chromatography, QSRR
ACS Paragon Plus Environment
Page 2 of 24
Page 3 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
31
Analytical Chemistry
1. Introduction
32
The retention mechanism in the reversed-phase high-performance liquid chromatography
33
(RP HPLC) is a complicated process involving a great variety of interactions that are difficult to
34
describe exactly1. Generally, the retention factor depends on the properties of the mobile phase, the
35
stationary phase and the analyzed compounds, e.g. polar and non-polar surface area of analytes,
36
dielectric constant of the mobile phase, surface properties of the packing material and other
37
descriptors2. The complex nature of these interactions usually requires mathematical models to
38
quantify the relationship between retention time and multiple method parameters, such as pH,
39
temperature, buffer concentration and other conditions1,3. Such a model, when appropriately
40
validated, can be of great help during method development procedure by giving an analyst means
41
to predict chromatograms for a wide range of experimental conditions.
42
Models used in the field of chromatography are often build to describe the behavior of a
43
single analyte (or for a set of analytes, modeled one at a time). Such models are certainly useful
44
and serve its role in solving many problems encountered in the laboratory4-7. In this work we would
45
like to provide a generalization of these models to multilevel (hierarchical) models, that could even
46
further increase the role of predictive modeling in the field of chromatography. The basic idea is
47
to take into account similarities between analytes, solvents or columns while developing a
48
chromatographic model. As an example let us consider the case of isocratic chromatographic data
49
collected for methanol (MeOH) and acetonitrile (ACN) containing mobile phases for a diverse set
50
of analytes. One would generally approach this data by building a separate models for each analyte
51
either for MeOH or ACN. And then eventually seek for a relationship between analyte-specific
52
chromatographic parameters, such as log kw, and analyte properties, such as log P or polar surface
53
area (QSRR equations). It is not an optimal approach as there is a lot of information that could be
54
shared, both between analytes and the organic modifiers, e.g. the same log kw values regardless of
55
organic modifier type or similarity of log kw values for analytes having similar log P values.
56
Multilevel model is a regression model of the individual (analyte-specific) chromatographic
57
measurements, in which all the model parameters – regression coefficients – are also given a
58
probability model. This second-level parameters are also estimated from the data. Multilevel
59
modeling is well known and commonly used mathematical technique, applied in many fields, e.g.
60
in cancer studies8, anesthesiology9 or educational research10. It is still not a common tool for
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
61
retention prediction; however, the Bayesian inference itself was reported to be a useful approach
62
in chromatography11-13. Recently multilevel modeling using Stan software was described as a
63
convenient method for describing gradient HPLC data14.
64
In this work we re-analyzed an isocratic RP HPLC data previously obtained in our
65
department15 for 58 chemical compounds. During the model building process a series of models of
66
different complexity were proposed and compared, such as: 1) model with no pooling (separate
67
models were fitted for each analyte); 2) model with partial pooling (a common distribution for
68
analyte-specific parameters); and 3) model with partial pooling and a regression model between
69
analyte-specific parameters and analyte-specific properties (QSRR equations). Multilevel models
70
were implemented in the Stan software that provides full Bayesian inference for continuous-
71
variable models through Markov Chain Monte Carlo (MCMC) methods. The predictive
72
performance of the proposed models was evaluated using the posterior predictions, Watanabe-
73
Akaike Information Criterion (WAIC) and 10-fold cross-validation. We also illustrate the
74
usefulness of the proposed models in predicting retention times in ACN-containing mobile phases
75
given no, one or several measurements in MeOH-containing mobile phases and vice versa.
76
2. Experimental Section
77
2.1 Chromatographic parameters
78
The data used to illustrate the main concept of multilevel models is taken from Al-Haj et
79
al. article15. It was obtained using RP HPLC in the isocratic retention mode, with the UV detection.
80
58 drug-like chemical compounds, listed in the Supporting Information, were analyzed. The
81
analytes had lipophilicity (MLOGP) that ranged from -0.2 to 5.1 and molecular mass that range
82
from 78 to 270 g/mol. MeOH and ACN were used as a mobile phase. The percent amount of
83
organic modifier in the mobile phase (φ) equaled 20, 25, 30, 35, 40, 45, 50, 60, 65, 70, 75, 80, 85,
84
90, 95 and 20, 25, 30, 40, 50, 60, 65, 70, 75, 80 for MeOH and ACN, respectively. For more details
85
it is advised to study the original article. The raw data is attached in the Supporting Information. It
86
is also presented graphically in Supporting Figure S1.
87
2.2 Molecular modeling
88
In the original work15, for each analyte a set of structural descriptors was calculated using
89
the HyperChem software16. Three descriptors were used in modeling: total dipole moment (µ),
ACS Paragon Plus Environment
Page 4 of 24
Page 5 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
90
maximum electron excess on a most charged atom (δmin) and water-accessible molecular surface
91
area (AWAS).
92
Additionally, all 58 compounds were re-modelled using respectively: Open Babel 2.3.217
93
(SMILES to MOL2 conversion), Discovery Studio Visualizer 1618 (preliminary geometry
94
optimization using Dreiding-like forcefield17), GaussView 3.0920 (Gaussian input files preparation)
95
and Gaussian 0921 software (B3LYP method), with the application of the 6-31G basis set (final
96
structure optimization). Only 4-iodophenol was optimized using STO-6G set, due to the occurrence
97
of the iodine atom. Dragon 722 software was used to calculate the MLOGP and molar weight.
98
2.3 Multilevel modeling
99
Multilevel modeling was carried out using the Stan23 / CmdStan 2.1624 software linked with
100
the Matlab® R2017b25 using the MatlabStan 2.1526. For the calculation of each model we used the
101
following values of the Stan parameters: number of iterations = 1000, warmup = 1000, number of
102
Markov chains = 4. Stan codes were based on the Margossian and Gillespie work27,28. Exemplary
103
Stan code can be found in the Supporting Information. Determination of model parameters provides
104
a possibility to obtain predictions (and uncertainty around these predictions) for a new (not-yet-
105
analyzed) analyte that take into account the information about the likely values of analyte-specific
106
parameters (from the posterior distribution) and any set of experimental data. To assess the
107
accuracy of such predictions, posterior predictive checks were used. Such predictive checks are
108
simply a replicated dataset using the model interference in the forward directions. These replicated
109
data sets, when compared visually with the original data, allow to assess model fit and predictive
110
capabilities of the model29. The predictive power of models were assessed with the Watanabe-
111
Akaike Information Criterion (WAIC), using the MatlabStan command mstan.waic. WAIC is
112
conceptually similar to Akaike information criterion (AIC) and Bayesian information criterion
113
(BIC), where the higher the WAIC value, the better the model predictive performance. Since the
114
WAIC is not able to assess the model performance for new analytes (it approximates leave-one-
115
measurement-out cross-validation), a 10-fold cross-validation (specifically leave-analytes-out
116
cross-validation) was used instead. The analytes from the original data were randomly partitioned
117
into 10 subsamples. Of the 10 subsamples, a single subsample was excluded from the analysis.
118
The remaining 9 subsamples plus none or limited number of measurements from the excluded
119
analytes were used to obtain predictions for those excluded analytes. The cross-validation process
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 24
120
was then repeated 10 times, with each of the 10 subsamples used exactly once as the validation
121
data. The results from the folds were combined and summarized as an log pointwise predictive
122
density of cross validation (LPDCV) and root mean square error of cross validation (RMSECV).
123
The LPDCV is a preferred way to evaluate the predictive accuracy of a Bayesian model. RMSECV
124
is less appropriate for models that are far from the normal distribution. The details on how WAIC,
125
LPDCV, RMSECV were calculated are provided in the Supplementary Information and
126
reference30. 6 scenarios were considered during the cross-validation depending on the number of
127
measurements available for predictions: 1) no measurements (NONE); 2) single MeOH
128
measurement (1M); 3) single ACN measurement (1A); 4) single MeOH and single ACN
129
measurement (1M & 1A); 5) all ACN measurements (AllA) and 6) all MeOH measurements
130
(AllM). They allowed us to assess the uncertainty of predictions in the situation of having access
131
to the limited number of experimental data.
132
3. Model development procedure
133
3.1 The classical approach proposed by Al-Haj et al.15
134
In the original approach presented by Al-Haj et al 15, a separate models were fitted for each
135
analytes. The structural model assumed a simple Snyder-Soczewiński model for MeOH and ACN
136
that can be described by the following equation:
137
𝑙𝑜𝑔𝑘𝑖𝑗𝑘 = 𝑙𝑜𝑔𝑘𝑤, 𝑖𝑘 ― 𝑆1,𝑖𝑘 ∙ 𝜑𝑗
138
where j = 1... J (out of J) denotes jth mobile phase compositions; i = 1… nAnalytes denotes ith (out
139
of nAnalytes) analyte, and k = 1..2 denotes MeOH (k = 1) or ACN (k = 2); log kw,ik denotes a
140
chromatographic measure of hydrophobicity (analyte and organic modifier specific). It is basically
141
a retention factor corresponding to the zero content of the organic modifier (i.e. neat water); and
142
S1,ik is the slope coefficient (also analyte and organic modifier specific) that can be understood as
143
an apparent difference of retention factors in water and MeOH or ACN.
144
The observed retention factors (log kObs) was further modeled according to:
145
𝑙𝑜𝑔𝑘𝑂𝑏𝑠,[𝑧] ~ 𝑁(𝑙𝑜𝑔𝑘𝑖[𝑧]𝑗[𝑧]𝑘[𝑧], 𝜎𝑖[𝑧]𝑘[𝑧])
146
Where, z = 1…nObs denotes zth (out of nObs) measurement; N denotes the normal distribution
147
with the mean given be Eq. (1) and standard deviation σik; a tilde (~) denotes "has the probability
Eq. 1
ACS Paragon Plus Environment
Eq. 2
Page 7 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
148
distribution of", i.e. the values of logkObs are randomly drawn from the given (in this case normal)
149
distribution. Standard deviations are conventionally assumed to be analyte and organic modifier
150
specific.
151
In the next modeling step the QSRR relationship were proposed separately for log kw,i1 and log kw,i2
152
by assuming a linear relationship between log kw and a set of predictors (descriptors) (e.g.
153
lipophilicity (log P) or total dipole moment (µi), maximum electron excess on a most charged atom
154
(δmin,i) and water-accessible molecular surface area (AWAS,i)):
155
log kw,ik~𝑁(𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑘𝑙𝑜𝑔𝑃𝑖,𝜔𝑙𝑜𝑔𝑘𝑤)
Eq. 3
156
log kw,ik~𝑁(𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑘𝜇𝑖 + 𝛽2,𝑘𝛿𝑚𝑖𝑛,𝑖 + 𝛽3,𝑘𝐴𝑊𝐴𝑆,𝑖,𝜔𝑘𝑙𝑜𝑔𝑘𝑤)
Eq. 4
157
where θlogkw,k is a retention factor for an analyte with descriptors equal to zero, ω is the scale
158
parameter and β1,k-β3,k are regression coefficients (different for MeOH and ACN).
159
There are several weaknesses of such a modeling approach: i) log kw is assumed to be
160
different for MeOH and ACN, ii) two independent QSRR equations for log kw were proposed for
161
MeOH and ACN, iii) the QSRR equations for other parameters (e.g. for S1,ik) were not explored,
162
iv) the two-stage approach (the estimation of QSRR equations conditional on the estimated log kw,i
163
values) does not properly take into account the uncertainty of log kw,ik. Please note it has different
164
uncertainty depending on the degree of extrapolation, v) finally the correlations between analyte-
165
specific parameters were not explored. There is no need to make such simplifications during the
166
model building process as shown in the subsequent sections.
167
3.2 Model with partial pooling (a common distribution for analyte-specific parameters)
168
Further we will assume a more realistic non-linear relationship between the log k and
169
organic modifier (Neue et al31 equation):
170
𝑙𝑜𝑔𝑘𝑖𝑗𝑘 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖𝑘 ∙ 𝜑𝑗
171
where S2,ik is the curvature coefficient for ith analyte for MeOH (k = 1, equivalent notation S2m,i)
172
and ACN (k = 2, equivalent notation S2a,i). Please note that the log kw is the same for MeOH and
173
ACN, as it should be. For convenience, this equation was reparametrized to the retention factor in
174
MeOH and ACN (log km and log ka) noticing that:
𝑆1,𝑖𝑘 ∙ 𝜑𝑗
Eq. 5
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 24
𝑆1,𝑖1
175
𝑙𝑜𝑔𝑘𝑚,𝑖 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖1
Eq. 6
176
𝑙𝑜𝑔𝑘𝑎,𝑖 = 𝑙𝑜𝑔𝑘𝑤,𝑖 ― 1 + 𝑆2,𝑖2
177
Hence, the retention factor in neat MeOH and ACN has a more natural interpretation than the slope.
178
The observed retention factors (log kObs) was further modeled similarly as previously:
179
𝑙𝑜𝑔𝑘𝑂𝑏𝑠,𝑧 ~ 𝑁(𝑙𝑜𝑔𝑘𝑖[𝑧]𝑗[𝑧]𝑘[𝑧],𝜎)
180
where, z denotes zth measurement; N denotes the normal distribution with the mean given by Eq.
181
(1) and standard deviation σ. This time a common standard deviation is used. This assumptions can
182
be relaxed if needed.
183
The idea of multilevel modeling allows to provide a range of second-level models for analyte-
184
specific parameters (log kw,i, log km,i, log ka,i, ln S2m,i, ln S2a,i):
185
[ ] ( )
186
where MST denotes the multivariate student t distribution, θ is a mean value of the parameter, is
187
a normality parameter, and denotes a variance-covariance matrix. In particular θlogkw, θlogkm,
188
θlogka, θlnS2m, θlnS2a denote typical values of parameters. The use of multivariate distribution allows
189
to model the correlation between analyte-specific parameters. Please note that some correlations
190
(especially between log kw,i, log km,i and log ka,i) are expected in chromatographic system. The S2m
191
and S2a were modeled on a logarithmic scale to ensure their positive values.
192
The model for log kObs is the same as previously. The following priors were assigned based on
193
literature findings and our judgment and are part of model assumptions:1, 32,33
194
θlogkw ~ N(2, 5)
Eq. 10
195
θlogkm ~ N(0, 5)
Eq. 11
196
θlnS2m ~ N(ln(0.2), 0.5)
Eq. 12
𝑆1,𝑖2
Eq. 7
Eq. 8
𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎
Eq. 9
ACS Paragon Plus Environment
Page 9 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
197
θlogka ~ N(0, 5)
Eq. 13
198
θlnS2a ~ N(ln(2), 0.5)
Eq. 14
199
In this work we decided to use a weekly informative priors that do not place too much
200
probability in any particular interval (and hence favor those values). The θlogkw was assumed to be
201
2 ± 5, thus without any data we think that the typical analyte will have logkw,i in a range from -8 to
202
12 (± 2 STD around the mean), similarly log km,i and log ka,i were assumed to be in a range from
203
(-10 to 10). In the case of the θlnS2m and θlnS2a parameters, the priors' means were based on the
204
literature data1, 32,33 with coefficient of variation of 50%, thus S2m,i and S2a,i were assumed to be in
205
a range from 0.09 to 0.46 for MeOH and from 0.89 to 4.4 for ACN. Please note that these priors
206
can be easily changed to reflect some additional knowledge.
207
Further, we decomposed our prior on covariance-matrix into a scale (ω) and a correlation matrix
208
(ρ) according to the formula:
209
Ω = diag(ω)ρdiag(ω)
210
where ω and ρ were given the following priors:
211
ωlog kw, ωlog km, ωlog ka, ωlnS2m, ωlnS2a ~ N+ (0,5)
Eq. 16
212
ρ ~ LKJ(1) (5x5 matrix)
Eq. 17
213
where N+ denotes the half-normal distribution and LKJ denotes the Lewandowski, Kurowicka, and
214
Joe distribution34. In this case the LKJ(1) ensures that density is uniform over correlation matrices
215
of order 5. Prior for standard deviation for residuals and for the degree of freedom of the MST
216
distribution equals:
217
σ ~ N+(0, 1)
Eq. 18
218
~ gamma(2, 0.1)
Eq. 19
219
thus favoring normal distribution. The use of student t distribution ensures robustness. It was
220
required as there are analytes that differ considerably from typical ones.
221
3.3 Model with no pooling
Eq. 15
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 24
222
By fixing all omegas to a large value, a no pooling approach is obtained. It is equivalent to
223
the assumption that there is no information shared between different analytes, thus all analytes-
224
specific parameters are essentially estimated based on the analyte-specific data. In this work the
225
following values were assumed: ωlog kw, ωlog km, ωlog ka, ωlnS2m, ωlnS2a = 10, ρ is identity matrix of
226
size 5 and equals 20, with the rest of the code being similar as in the previous section.
227
3.4 Model with partial pooling and a regression model between analyte-specific parameters and
228
analyte-specific properties
229
The model with partial pooling can be further extended by adding predictors (descriptors)
230
explaining part of the inter-analyte variability. As an example the relationship between log kw,i, log
231
ka,i, log km,i and lipophilicity (MLOGPi) or molecular mass (MMOLi) can be proposed as follows:
232
233
[ ] ( [ ] (
)
𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽1,𝑙𝑜𝑔𝑘𝑤·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 + 𝛽1,𝑙𝑜𝑔𝑘𝑚·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 + 𝛽1,𝑙𝑜𝑔𝑘𝑎·(𝑀𝐿𝑂𝐺𝑃𝑖 ― 2.34) ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎
Eq. 20
)
Eq. 21
𝑙𝑜𝑔𝑘𝑤,𝑖 𝜃𝑙𝑜𝑔𝑘𝑤 + 𝛽2,𝑙𝑜𝑔𝑘𝑤·(𝑀𝑀𝑂𝐿𝑖 ― 150) 𝑙𝑜𝑔𝑘𝑚,𝑖 𝜃𝑙𝑜𝑔𝑘𝑚 + 𝛽2,𝑙𝑜𝑔𝑘𝑚·(𝑀𝑀𝑂𝐿𝑖 ― 150) 𝑙𝑜𝑔𝑘𝑎,𝑖 ~𝑀𝑆𝑇 , 𝜃𝑙𝑜𝑔𝑘𝑎 + 𝛽2,𝑙𝑜𝑔𝑘𝑎·(𝑀𝑀𝑂𝐿𝑖 ― 150) ,𝛺 𝑙𝑛𝑆2𝑚,𝑖 𝜃𝑙𝑛𝑆2𝑚 𝑙𝑛𝑆2𝑎,𝑖 𝜃𝑙𝑛𝑆2𝑎
234
Such a relationship is consistent with the expected similarity between log k and MLOGP and
235
between log k and molecular mass as the latter is correlated with log P. In this case the following
236
priors were used β1~N(1.00, 0.50) and β2~N(0.02, 0.01) thus assuming that the relationship
237
between MLOGP and log kw, log km and log ka is linear with a slope close to one, and assuming a
238
linear relationship between molecule mass and log kw, log km and log ka with a slope and standard
239
deviation being 50 time smaller (where 50 is a standard deviation of the molecular masses of
240
the analytes).
241
4. Results and discussion
242
In this work we presented a series of multilevel models obtained during the analysis of
243
isocratic data obtained for 58 compounds in MeOH and ACN. The proposed models provide a
ACS Paragon Plus Environment
Page 11 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
244
unified descriptions of the whole dataset. It is in contrary to “classical” methods of analyzing
245
chromatographic data, which tend to ignore the hierarchical structure of the data, and perform the
246
analysis at the analyte-level only.
247
Figure 1 and Figure 2 show the individual predictions (prediction corresponding to the
248
future observations on the same analyte) and typical predictions (prediction corresponding to the
249
future observations of a new analyte) for 5 representative compounds selected based on the
250
accuracy of the fit (from the worst to the best based on the root mean square error of the Pooled-
251
log P model), respectively. The individual fits are satisfactory for all the considered models. They
252
are also much better than for the original model that assumed Snyder-Soczewinski equation (data
253
not shown). The similarity of models is also confirmed by the WAIC, LPDCV (all data) and
254
RMSECV (all data) measures, which are essentially identical (Table 1). It means, that given all the
255
observations we can predict analyte retention equally well using any of the presented models. This
256
is not true when trying to predict retention factor for an analyte for which no experimental data is
257
available. Such a typical predictions are presented in Figure 2. The typical predictions show much
258
higher uncertainty than individual predictions, as there is less information about the retention factor
259
for an analyte without any measurements. In this case only the information from other analytes and
260
the predictors, such as log P or molecular mass, can be taken into account. In our analysis the
261
predictive performance of the tested models expressed as LPDCV was -232.3 for Pooled-log P, -
262
1436.7 for Pooled-Mmol, -1775.1 for Pooled and finally -4809.6 for the Unpooled model. In the
263
case of an analyte with no measurements, the information about log P leads to more accurate
264
predictions than molecular mass.
265
The goodness of fit plots are presented in Figure 3. These plots shows the relationship
266
between the observations and model predictions (typical and individual) and allow to assess the
267
calibration accuracy and sharpness of predictions35. For individual predictions, both the accuracy
268
(whether the points are close to the line of unity for the whole range of measurements) and
269
sharpness (the spread of the points around the line of identity) are excellent for all the tested models.
270
The situation is different for the typical predictions. The accuracy and sharpness is reasonable for
271
the Pooled model with predictors (log P reduces uncertainty more than the molecular mass). Please
272
note that the calibration is problematic for the Unpooled and Pooled models, which means that
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
273
those models should be avoided for predictions when there is no information on analyte properties
274
available.
275
Supporting Table S1 presents a summary of the marginal posterior distributions for model
276
parameters. These parameters summarize all the important features of the data and can be used by
277
others to predict retention factors of new analytes (for a similar column and analytes that were used
278
to develop the model).
279
The mean normality parameter () equals 2.60 and 2.90, depending on the model. Low
280
values of normality parameter indicates that the studied MST distribution has heavy-tails; thus,
281
there are analytes that are considerably different from the typical ones. These analytes have unusual
282
retention time profile. The 4-aminophenol is an example of such a compound, as it has higher
283
retention in ACN than in MeOH, for the whole range of organic modifier contents (Figure 1).
284
Log P value of a typical analyte in our dataset is 2.34. For such an analyte the estimated log
285
k value equals 3.20 for neat water (log kw), -0.71 for neat MeOH (log km) and -1.00 for neat ACN
286
(log ka). Calculated curvature coefficients (S2) equals 0.59 and 1.40, for MeOH and ACN,
287
respectively. These numbers are close to the literature values1,32,33. Also, strong correlation (0.74-
288
0.76) between S2 parameters for MeOH and ACN (ρlnS2M, lnS2A) is observed. Mean β1,logkw in the
289
Pooled-log P model is close to one (1.30). Mean β1,logkm and β1,logka are much smaller (0.29 and
290
0.21, respectively); however, the trend with log P is evident. Similar situation can be observed for
291
β2 parameter - mean β2,logkw (0.029) in the Pooled-Mmol model is close to the prior value (0.02)
292
and β2 for MeOH and ACN are much smaller (0.0058 and 0.0043, respectively), which is a
293
consequence of a correlation between log P and molecular mass.
294
Correlations between log k parameters corresponding to single-component mobile phases
295
(water, MeOH and ACN) are evident, in particular for the Pooled model (ρlogkm, logka = 0.84, ρlogkm,
296
logkw
297
persist even after including common predictors (0.76, 0.71, 0.74, and 0.53, 0.24, 0.24,
298
respectively). If there is a strong correlation between some model parameters, then the information
299
on one parameter can give an information about the likely value of the other parameter, e.g.
300
correlation between log ka and log km, indicates that knowing retention times of an analyte in water-
301
MeOH system gives the analyst information on its retention in the water-ACN system. Thus,
302
understanding the correlation can help to make more reliable predictions.
= 0.78, ρlogka, logkw = 0.79). For the Pooled-log P and Pooled-Mmol models this correlation
ACS Paragon Plus Environment
Page 12 of 24
Page 13 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
303
Supporting Figure S2 presents the individual (analyte-specific) parameters for 5
304
representative analytes as obtained by the four considered models. The regularization imposed by
305
the multilevel model is immediately visible by comparing the uncertainty of parameters obtained
306
using the Unpooled and Pooled models. In the case of the pooled approach the uncertainty is
307
reduced and the individual values are “shrunk” toward the typical values. It is especially visible for
308
parameters that are difficult to estimate (such as ln S2m,i and ln S2a,i).
309
Multilevel models proposed in this work are a natural framework for understanding
310
chromatographic data. They are also useful in making predictions of retention times in the case of
311
limited chromatographic data; e.g., for situations that are usually of interest to analyst. In this work
312
we illustrated this concept by predicting the retention times in a situation of limited access to the
313
chromatographic data. Specifically the access to the none, single MeOH (1M), single ACN (1A),
314
single MeOH and single ACN (1M & 1A), all MeOH (AllM) and all ACN (AllA) measurements
315
were considered. Figure 4 illustrates the agreement between predictions (using cross-validation)
316
and observations for all the models and 6 considered scenarios. It can be used to assess accuracy
317
and sharpness, similarly, as was done previously. The predictive performance is summarized in
318
Table 1. The predictions along with the uncertainty are shown in Supporting Figures S-3 and S-4
319
for 4-aminophenol (the worst fit, with unusual retention time profile) and xanthene (the best fit),
320
respectively.
321
It is clear that more experimental data leads to more accurate predictions. Interestingly, the
322
pooled models are reasonably well calibrated whenever there is at least one measurement available
323
for predictions. Still the sharpness of predictions is slightly better once predictors (molecular mass
324
or log P) are included into the model. Thanks to the application of the knowledge on the whole
325
population of analytes (pooling), addition of single experimental point for just one mobile phase
326
(MeOH or ACN-containing mobile phases) results in the significant reduction of uncertainty
327
around predictions, for both eluents. Such an effect is not observed for the Unpooled model. In this
328
case, access to the experimental data in one of the organic modifier improves the predictions only
329
for that specific eluent, without influencing the other. The degree of uncertainty reduction can be
330
assessed investigating LPDCV and RMSECV measures in Table 1 and by the visual inspection of
331
Supporting Figures S-3 and S-4.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
332
The LPDCV can be directly used to compare the models (with the higher values denoting
333
higher predictive performance). In general, when comparing all the predictive accuracy measures
334
(WAIC, LPDCV, RMSECV), goodness-of-fit-plots and cross-validation plots, the Pooled-log P
335
and Pooled-Mmol model leads to useful, well-calibrated, predictions in situations where limited
336
data is available for predictions. Obviously, the worst predictive accuracy has the Unpooled model,
337
unless large number of analytes-specific measurements are available.
338
5. Conclusions
339
In this work we proposed and compared several multilevel chromatographic retention
340
models. Such models allow to efficiently share information between analytes and organic modifiers
341
that can be used to predict retention and associated uncertainty for new analytes or analytes with
342
only few measurements available. In particular, they can be helpful to solve many practical
343
analytical problems, such as predicting retention times in ACN-containing mobile phases given no,
344
one or several measurements in MeOH-containing mobile phases.
345
The Bayesian multilevel modeling is mathematically sophisticated; thus, it has been rarely
346
applied in the field of analytical chemistry. Nevertheless, the recent advances in providing the state-
347
of-the-art platforms for statistical modeling and high-performance statistical computation (Stan's
348
probabilistic programming language) makes those models attractive in solving everyday separation
349
prediction problems encountered by chromatography practitioners.
350
6. References
351
(1) Nikitas, P.; Pappa-Louisi A., J. Chromatogr. A 2009, 1216, 1737-1755.
352
(2) Gritti, F.; Guiochon, G., Anal. Chem. 2005, 77, 4257–4272.
353
(3) Kazakevich, Y.V.; LoBrutto, R.; Chan, F.; Patel, T., J. Chromatogr. A 2001, 913, 75-87.
354
(4) DryLab®, Molnár-Institute for applied chromatography, Berlin, Germany, molnar-
355
institute.com.
356
(5) Leweke, S.; von Lieres, E., Comput. Chem. Eng. 2018, 113, 274-294.
357
(6) Wen,Y.; Talebi, M.; Amos, R.I.J.; Szucs, R; Dolan, J.W.; Pohl C.A.; Haddad P.R.; J
358
Chromatogr A. 2018, 1541, 1-11.
ACS Paragon Plus Environment
Page 14 of 24
Page 15 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
359
(7) Vander Heyden, Y.; Perrinam C.; Massarta, D.L. Handbook of Analytical Separations,
360
2000, 1, 163-212.
361
(8) Zahnd, W.E.; McLafferty, S.L., Ann Epidemiol. 2017, 27, 739-748.
362
(9) Hastings, R.H.; Glaser, D., Anesth. Analg. 2011, 113, 877-887.
363
(10) Schreiber, J.B.; Griffin, B.W., J. Educ. Res. 2004, 98, 24-33.
364
(11) Wiczling, P.; Kaliszan, R., Anal. Chem. 2016, 88, 997-1002.
365
(12) Wiczling, P.; Kubik, Ł.; Kaliszan, R., Anal. Chem. 2015, 87, 7241-7249.
366
(13) Barcaru, A.; Mol, H.G.J.; Tienstra, M.; Vivó-Truyols, G., Anal. Chim. Acta 2017, 983, 76-90.
367
(14) Wiczling, P., Anal. Bioanal. Chem. 2018, 410, 3905-3915.
368
(15) Al-Haj, M.A.; Kaliszan, R.; Nasal A., Anal. Chem. 1999, 71, 2976-2985.
369
(16) HyperChem™, Hypercube Inc., Waterloo, ON, Canada, 1999.
370
(17) O'Boyle, N.; Banck, M.; James, C.A.; Morley, C.; Vandermeerschm, T.; Hutchison, G.R., J.
371
Cheminform. 2011, 3, 33.
372
(18) Dassault Systèmes BIOVIA, Discovery Studio Visualizer, Release 16.1, San Diego: Dassault
373
Systèmes, 2015.
374
(19) Hahn, M., J. Med. Chem. 1995, 38, 2080-2090.
375
(20) GaussView, Version 3.09, Roy Dennington, Todd Keith and John Millam, Semichem Inc.,
376
Shawnee Mission, KS, 2009.
377
(21) Gaussian 09, Revision A.02, M. J. Frisch et al., Gaussian, Inc., Wallingford, CT, 2009.
378
(22) Kodesrl, Dragon (software for molecular descriptor calculation) version 7.0.6, 2016,
379
https://chm.kode-solutions.net.
380
(23) Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker,
381
M.; Guo, J.; Li, P.; Riddell A., J. Stat. Softw. 2017, 76. 1-32.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
382
(24) Stan Development Team. 2017. CmdStan: the command-line interface to Stan, Version 2.16.0,
383
http://mc-stan.org.
384
(25) MATLAB® and Statistics Toolbox Release R2017b, The MathWorks®, Inc., Natick,
385
Massachusetts, United States.
386
(26) Stan Development Team. 2017. MatlabStan: the MATLAB interface to Stan, http://mc-
387
stan.org.
388
(27) https://github.com/stan-dev/stan/wiki/Complex-ODE-Based-Models.
389
(28)
390
ode.html.
391
(29) Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B., Bayesian data analysis, 2nd ed.; Chapman
392
& Hall/CRC Texts in Statistical Science: Boca Raton, 2004.
393
(30) Vehtari, A.; Gelman, A.; Gabry, J., Stat Comput. 2017, 27, 1413-1432.
394
(31) Neue, U.D.; Phoebe, C.H.; Tran, K.; Cheng, Y.F.; Lu, Z., J. Chromatogr. A 2001, 925, 49-67.
395
(32) Pappa-Louisi A.; Nikitas, P.; Balkatzopoulou, P.; Malliakas, C., J. Chromatogr. A 2004, 1033,
396
29-41.
397
(33) Snyder, L.R.; Kirkland, J.J.; Dolan, J.W., Introduction to modern liquid chromatography, 3rd
398
ed.; Wiley-Blackwell: Oxford, 2010.
399
(34) Lewandowski, D.; Kurowicka, D.; Joe H., J. Multivar. Anal. 2009, 100, 1989-2001.
400
(35) Gneiting, T.; Balabdaoui, F.; Raftery, A.E., J. Royal Stat. Soc., Series B 2007, 69, 243-268.
401
7. Acknowledgements
402
This Project was supported by the National Science Centre, Poland (grant 2015/18/E/ST4/00449)
403
and from the funds of the Polish Ministry of Science and Higher Education, granted for the
404
development of the young scientists (participants of the doctoral studies) - no. 01-0224/08/529.
405
Part of the calculations were carried out at the Academic Computer Centre in Gdańsk.
406
8. Conflict of Interest Disclosure
http://mc-stan.org/events/stancon2017-notebooks/stancon2017-margossian-gillespie-
ACS Paragon Plus Environment
Page 16 of 24
Page 17 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
407
Analytical Chemistry
The authors declare no competing financial interest.
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 24
408
Tables
409
Table 1. The summary of predictive performance measures used to compare the tested models.
410
The predictive performance (log pointwise predictive density (LPDCV) and root mean square error
411
(RMSECV)) was assessed based on 10-fold cross-validation for 4 tested models and 6 prediction
412
scenarios (based on no measurements (None), single MeOH measurement (1M), single ACN
413
measurement (1A), single MeOH and single ACN measurement (1M & 1A), all ACN
414
measurements (AllA) and all MeOH measurements (AllM)). WAIC is an approximation of leave-
415
one-out measurement cross validation. Available
Unpooled
Pooled
Pooled-log P
Pooled-Mmol
-5823.7 0
-5814.2 -9.5
-5822.7 -1.0
data WAIC of the models All Data: WAIC ΔWAIC
-5832.5 8.7
LPDCV (LPDCV for MeOH data/LPDCV for ACN data) after 10-fold cross validation -4809.6 -1775.1 -232.3 -1436.7 (-3073.5/-1736.2) (-1161.4/-613.7) (-198.5/-33.8) (-955.4/-481.3) -3168.0 928.4 931.5 1027.3 1M (-1509.5/-1658.5) (753.2/175.2) (751.1/180.4) (828.1/199.2) -3551.7 773.4 893.6 948.5 1A (-3015.3/-536.4) (228.7/544.7) (308.8/584.8) (222.3/726.1) -1773.7 1373.4 1482.5 1636.7 1M & 1A (-1327.4/-446.3) (845.8/527.6) (855.6/626.9) (914.3/722.3) 429.4 2222.7 2249.0 2236.0 AllM (1962.1/-1532.7) (1948.3/274.5) (1946.1/302.9) (1949.3/286.7) -1692.8 1478.8 1540.0 1503.0 AllA (-2902.3/1209.6) (262.5/1216.2) (326.9/1213.1) (288.8/1214.1) 3171.7 3164.5 3159.2 3163.4 All Data: (1962.1/1209.6) (1948.3/1216.2) (1946.1/1213.1) (1949.3/1214.1) RMSECV (RMSECV for MeOH data/RMSECV for ACN data) after 10-fold cross validation
None
None
0.81 (0.85/0.72)
0.78 (0.82/0.73)
0.30 (0.33/0.26)
0.62 (0.64/0.59)
1M
0.48 (0.36/0.63)
0.18 (0.19/0.18)
0.18 (0.18/0.17)
0.18 (0.17/0.19)
1A
0.98 (1.2/0.34)
0.18 (0.22/0.094)
0.18 (0.21/0.096)
0.18 (0.21/0.091)
1M & 1A
0.36 (0.42/0.20)
0.16 (0.19/0.087)
0.15 (0.17/0.091)
0.15 (0.17/0.086)
AllM
0.37 (0.033/0.62)
0.12 (0.034/0.19)
0.11 (0.034/0.17)
0.11 (0.033/0.18)
ACS Paragon Plus Environment
Page 19 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
AllA
0.79 (0.99/0.024)
0.20 (0.25/0.024)
0.20 (0.25/0.024)
0.20 (0.25/0.024)
All Data:
0.03 (0.033/0.024)
0.03 (0.033/0.024)
0.03 (0.033/0.024)
0.03 (0.033/0.024)
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
416
Figures
417
Figure 1. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and
418
observed retention factors (dots) for 5 representative analytes. Predictions correspond to the future
419
observations on the same analyte, i.e. posterior predictions conditioned on the observed data from
420
the same analyte. Black color corresponds to MeOH whereas red color corresponds to ACN.
421
Figure 2. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and
422
observed retention factors (dots) for 5 representative analytes. Prediction corresponds to the future
423
observations of a new analyte, i.e. posterior predictive distributions. Black color corresponds to
424
MeOH whereas red color corresponds to ACN. For the Unpooled model the median is not smooth
425
due to large posterior variability.
426
Figure 3. Goodness-of-fit-plots of the four considered models. The observed versus the mean
427
typical predicted retention factors (the a posteriori mean of a predictive distributions corresponding
428
to the future observations of a new analyte) and the observed versus the mean individual predicted
429
retention times (the a posteriori mean of a predictive distributions conditioned on the observed
430
data from the same analyte). The black symbols denote MeOH and the red symbols denote ACN.
431
Figure 4. Goodness-of-fit-plots of the four considered models after 10-fold cross-validation. The
432
graph shows the observed versus the mean predicted retention factors (the a posteriori mean of a
433
predictive distributions conditioned on the part of the observed data (specifically none, single
434
MeOH (1M), single ACN (1A), single MeOH and single ACN (1M & 1A), All MeOH (AllM) and
435
all ACN (AllA)) from the same analyte. The red symbols denote predictions for ACN and black
436
symbols denote predictions for MeOH.
437 438
For TOC only:
439
ACS Paragon Plus Environment
Page 20 of 24
Page 21 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 1 Predicted (posterior median (line) and 95% credible intervals (shaded area)) and observed retention factors (dots) for 5 representative analytes. Predictions correspond to the future observations on the same analyte, i.e. posterior predictions conditioned on the observed data from the same analyte. Black color corresponds to MeOH whereas red color corresponds to ACN. 165x180mm (300 x 300 DPI)
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 2. Predicted (posterior median (line) and 95% credible intervals (shaded area)) and observed retention factors (dots) for 5 representative analytes. Prediction corresponds to the future observations of a new analyte, i.e. posterior predictive distributions. Black color corresponds to MeOH whereas red color corresponds to ACN. For the Unpooled model the median is not smooth due to large posterior variability. 165x180mm (300 x 300 DPI)
ACS Paragon Plus Environment
Page 22 of 24
Page 23 of 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
Figure 3. Goodness-of-fit-plots of the four considered models. The observed versus the mean typical predicted retention factors (the a posteriori mean of a predictive distributions corresponding to the future observations of a new analyte) and the observed versus the mean individual predicted retention times (the a posteriori mean of a predictive distributions conditioned on the observed data from the same analyte). The black symbols denote MeOH and the red symbols denote ACN. 165x180mm (300 x 300 DPI)
ACS Paragon Plus Environment
Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 4. Goodness-of-fit-plots of the four considered models after 10-fold cross-validation. The graph shows the observed versus the mean predicted retention factors (the a posteriori mean of a predictive distributions conditioned on the part of the observed data (specifically none, single MeOH (1M), single ACN (1A), single MeOH and single ACN (1M & 1A), All MeOH (AllM) and all ACN (AllA)) from the same analyte. The red symbols denote predictions for ACN and black symbols denote predictions for MeOH. 165x180mm (300 x 300 DPI)
ACS Paragon Plus Environment
Page 24 of 24