Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg
Article
Benchmarking of Computational Methods for Creation of Retention Models in Quantitative Structure-Retention Relationships Studies Ruth I.J. Amos, Eva Tyteca, Mohammad Talebi, Paul Raymond Haddad, Roman Szucs, John W Dolan, and Christopher Andrew Pohl J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00346 • Publication Date (Web): 13 Oct 2017 Downloaded from http://pubs.acs.org on October 18, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
1
Benchmarking of Computational Methods for
2
Creation of Retention Models in Quantitative
3
Structure-Retention Relationships Studies
4
Ruth I. J. Amos1#*, Eva Tyteca1,3,#, Mohammad Talebi1, Paul R. Haddad1, Roman Szucs3,
5
John W. Dolan4, and Christopher A. Pohl5
6
1
7
Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia
8
2
9
Liège, 2 Passage des Deportes, Gembloux 5030, Belgium
Australian Centre for Research on Separation Science (ACROSS), School of Physical
Analytical Chemistry, AgroBioChem Department, Gembloux Agro-Biotech, University of
10
3
Pfizer Global Research and Development, Ramsgate Rd, Sandwich CT13 9ND, UK
11
4
LC Resources, 1795 NW Wallace Rd., McMinnville, OR 97128, USA
12
5
Thermo Fisher Scientific, 490 Lakeside Dr, Sunnyvale, CA 94085, USA
13
*
14
KEYWORDS
15
QSRR; geometry optimisation; calculations; molecular descriptor; density functional theory;
16
molecular mechanics
[email protected] 17 18
ABSTRACT: Quantitative Structure - Retention Relationship (QSRR) models are powerful
19
techniques for the prediction of retention times of analytes, where chromatographic retention
20
parameters are correlated with molecular descriptors encoding chemical structures of
21
analytes. Many QSRR models contain geometrical descriptors derived from the 3-D spatial
22
coordinates of computationally predicted structures for the analytes. Therefore, it is sensible
23
to calculate these structures correctly, as any error is likely to carry over to the resulting
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
24
QSRR models. This study compares molecular modelling, semi-empirical, and density
25
functional methods (both B3LYP and M06) for structure optimisation. Each of the
26
calculations was performed in a vacuum, then repeated with solvent corrections for both
27
acetonitrile and water. We also compared Natural Bond Orbital analysis with the Mulliken
28
charge calculation method. The comparison of the examined computational methods for
29
structure calculation shows that, possibly due to the error inherent in descriptor creation
30
methods, a quick and inexpensive molecular modelling method of structure determination
31
gives similar results to experiments where structures are optimised using an expensive and
32
time-consuming level of computational theory. Also, for structures with low flexibility,
33
vacuum or gas phase calculations are found to be as effective as those calculations with
34
solvent corrections added.
35 36
INTRODUCTION
37
Quantitative Structure - Retention Relationship (QSRR) models build a mathematical
38
relationship between a chromatographic parameter of an analyte (such as retention time) and
39
its structure.1 QSRR is receiving more attention in the chromatographic community as the
40
challenges faced become more complex and costly. One area where QSRR may be applied
41
with good effect is method development, where the use of this computational method can
42
minimize the time and resources necessary for the initial scoping phase of selecting the
43
optimal stationary and broad mobile phase composition.2 Another possible use of QSRR is in
44
the area of non-targeted analysis (NTA) where for example, the entire metabolite content of
45
biological samples3-4 or an entire sample of pollutants in waste water5-6 is analysed by liquid
46
chromatography coupled to mass spectrometry and the predicted retention time of a
47
metabolite or pollutant can help with its identification.
ACS Paragon Plus Environment
Page 2 of 30
Page 3 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
48
To build a QSRR model, a training set of analytes with known retention times (or other
49
chromatographic parameters of interest, such as the linear solvent strength retention model
50
parameters7) is required. The structure of each analyte in the training set is determined
51
computationally and from that structure a series of descriptors is calculated. Then, a
52
regression method is utilized to build a model for retention time prediction based on the
53
calculated descriptors. Finally a test set of analytes (along with their molecular structures and
54
descriptors) is utilized to test the model for prediction accuracy.1
55
Advances in computational chemistry have enabled thousands of different molecular
56
descriptors to be calculated for each molecule. Some descriptors are purely explanatory, such
57
as the molecular weight of the compound or constitutional descriptors (number of oxygen
58
atoms, for example).1 Two-dimensional (2-D) descriptors show connectivity,
59
autocorrelation, topological distance, and walk and path counts,1, 8 whereas descriptors based
60
on the three-dimensional (3-D) structure of the molecule give more information, such as the
61
partial charges of each of the atoms in the molecule.1 It would seem reasonable to assume
62
that the accuracy of the majority of descriptors depends on the calculated structure of the
63
molecule in question and therefore on the computational method utilized to build that
64
structure.1
65
A brief overview of the literature shows that the computational methods used for finding
66
the 3-D structure of molecules for QSRR are many and varied.1, 9-20 Whilst numerous
67
researchers have compared outcomes from different methods for variable selection and
68
regression analysis (such as artificial neural networks (ANN), genetic algorithms (GA),
69
multiple linear regression (MLR), and partial least squares regression (PLS) etc.)1, 13-14, 16, 21-22
70
a search of the literature has failed to provide any benchmarking studies on how the different
71
computational methods for original structure determination (prior to descriptor calculation)
72
may change the final error in the outcome of the QSRR models. In fact, many databases
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
73
utilized to obtain initial molecular structures have up to 3.5% error in the atomic makeup of
74
the structures (e.g. the substitution of a methoxy group for a hydroxyl group) leading to errors
75
in calculation23 even before the 3-D structure of the molecule is calculated. Error values and
76
correlation constants given for QSRR studies in the literature are confounded by changes in
77
the method of model building, the descriptors utilized, and the clustering method used for
78
building the training sets. Therefore, a more systematic study of computational methods is
79
required, keeping other factors constant and only changing the method of structure
80
optimization, in order to determine the possible contribution of this preliminary step to the
81
final accuracy of the resultant QSRR models.
82
In this study, the factors held constant were those for calculating molecular descriptors and
83
creating the model. For those factors, we utilized methods that have been highly successful in
84
our group. The methods for structure optimization were varied and the most common
85
methods compared to find which method of optimization gave the lowest error in the final
86
model.2, 7, 24-28
87
Methods found in the literature for defining the geometry of molecules prior to descriptor
88
calculation vary from the very simple - SMILES (Simplified Molecular Input Line Entry
89
System) - to molecular mechanics (MM),1, 9-10 semi-empirical methods (SE),11-12 (or a
90
combination of MM and SE13-16), and density functional methods (DFT).17-19 A description
91
of each of these levels of theory, along with examples of QSRR and QSAR (Quantitative
92
Structure-Activity Relationships) studies utilizing them, is found in the Supporting
93
Information. As no comparison study is available on the effects of the different geometry
94
calculation methods, this present study includes a benchmarking of computational methods to
95
elucidate which level of computational chemistry gives the lowest error of prediction of a
96
chromatographic parameter. Calculation methods used range from MM using the MMFF94
97
force field in the Balloon software29 to an SE method (Parametric Method 7, PM7) using
ACS Paragon Plus Environment
Page 4 of 30
Page 5 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
98
MOPAC30-31 (using as a starting point the lowest energy geometry from the MM calculations)
99
and finally to DFT calculations using B3LYP (Becke 3-parameter exchange with correlation
100
by Lee, Yang, and Parr,32-33 once again, starting from the SE optimized geometry). The M06
101
(Minnesota 2006)34 DFT method has also been used for completeness as there are many
102
different DFT methods widely available. For the DFT methods, the basis sets by Pople: 6-
103
31G using a split valence functional,35-36 6-31+G(d) and 6-311++G(d,p) were used to bring in
104
differing levels of polarization and diffuse functions.37-38
105
Many methods utilized for the definition of the structure of molecules do not involve any
106
inclusion of solvent calculations. That is, the molecules are optimized in a vacuum with no
107
explicit or implicit correction for the involvement of solvent. While Karelson39 and Stack40
108
emphasize the need to take solvent into account, Wiczling and Kubik20 utilized the
109
Polarizable Continuum Model with the Integral Equation Formalism variant (IEFPCM)41
110
solvent corrections for QSRR models for dissociating compounds with no visible
111
improvement to the outcome. In the present study results from vacuum calculations are
112
compared to those using (IEFPCM)41 for the implicit inclusion of solvent. Corrections for
113
both acetonitrile (CH3CN) and for water (H2O) have been performed to investigate whether
114
the different dielectric constants (ε = 78.39 and 36.64, respectively) will make a difference to
115
the overall calculation error. 41-42
116
Final structures from each stage have been used to create so-called localized QSRR
117
models.25 All other aspects of the model generation were kept constant so as to compare only
118
the method of geometry optimization for the molecules. The prediction errors have been
119
compared for each type of geometry optimization to find the best method of calculation.
120 121
MATERIALS AND METHODS
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
122
Dataset The reversed-phase liquid chromatography (RPLC) retention dataset used in this
123
study was obtained from the literature and includes 75 compounds.43-45 The particular dataset
124
was utilized to devise the hydrophobic interaction model relating retention of the analyte and
125
the type of reversed-phase stationary phase. The dataset is diverse in size, shape, and polarity
126
of the compounds and is representative of many compounds utilized for RPLC.43-45 The
127
column utilized for the study was a 150 x 4.6 mm, 5 µm, Waters Symmetry C18 column (GL
128
Sciences, Tokyo, Japan) using a mobile phase of 50:50 CH3CN:H2O (for neutral compounds)
129
and 50:50 CH3CN:31.2 mM potassium phosphate buffer at pH = 2.8 (for basic and acidic
130
compounds). Flow-rate was 1.5 mL/min and the column void time 0.919 min.
131
Calculation of molecular geometries MarvinSketch version 16.2.15 from ChemAxon1
132
(Budapest, Hungary) was used for drawing the molecular structures. Initial conformational
133
searches to find the 50 lowest energy structures were performed using MMff9446-50
134
implemented in Balloon.29 The lowest energy conformers were taken as input structures for
135
geometry optimization using the semi-empirical Parametric Method number 7 (PM7)30
136
implemented in Molecular Orbital PACkage (MOPAC),30-31 followed by further geometry
137
optimization of the lowest energy structure using density functional theory51-52 implemented
138
in Gaussian09.53 The DFT calculations were performed both in the gas phase and with
139
solvent corrections for either H2O or CH3CN implemented using IEFPCM41 in Gaussian09.
140
Calculations were performed using both B3LYP32-33 and M0634 functionals. The basis sets
141
included in the study are outlined in Table 1 and include polarization functions and diffuse
142
functions.35, 37
143
Calculation of molecular descriptors Dragon 6.0 software54 (Talete, Milano, Italy) was
144
employed for calculation of molecular descriptors using the resulting minimum energy
145
conformations of the compounds.
ACS Paragon Plus Environment
Page 6 of 30
Page 7 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
146
The Dragon 6.0 software54 was able to calculate nearly five thousand molecular
147
descriptors, consisting of constitutional, topological, geometrical, electrostatic, and quantum
148
chemical descriptors.
149
procedure.
150
constant or near constant values, descriptors with a standard deviation less than 0.0001,
151
strongly correlated descriptors (descriptors with a correlation coefficient >0.90) and those
152
descriptors not available for all compounds were excluded. Before statistical analysis, all the
153
descriptors were scaled to zero mean and unit variance (autoscaling procedure) because the
154
numerical values of the descriptors varied significantly. The resulting descriptor sets were
155
used to build trial models for the experimental chromatographic retention data.
The Handbook of Molecular Descriptors55 details the calculation
To minimize subsequent problems of chance correlation, descriptors with
156
Generation of QSRR models In previous studies by our group we have found that the
157
lowest error in retention time prediction can be achieved when smaller training sets of
158
chromatographically similar compounds to the target are used.25 Each compound in the
159
dataset was therefore utilized successively as a target analyte by removing it from the dataset
160
and then predicting its retention time using models derived from a subset of compounds in
161
the dataset as the training set. The predictive models established with these training sets were
162
then used to predict the retention of the target compound as an external validation strategy to
163
assess the predictive ability of the established QSRR models.
164
The training sets were obtained through so-called chromatographic similarity searching,25
165
where the compounds in the dataset were ranked according to the absolute value of the ratio
166
of each compound’s retention factor k with the k-value of the target analyte. The predictive
167
model was then built using a training set obtained by applying a k-ratio threshold (k-ratio
0.5, R2>0.6,
218
|(R2 – R02)/R2| < 0.1, and 0.8