Benchmarking of Computational Methods for Creation of Retention

Oct 13, 2017 - Quantitative structure–retention relationship (QSRR) models are powerful techniques for the prediction of retention times of analytes...
0 downloads 15 Views 510KB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

Benchmarking of Computational Methods for Creation of Retention Models in Quantitative Structure-Retention Relationships Studies Ruth I.J. Amos, Eva Tyteca, Mohammad Talebi, Paul Raymond Haddad, Roman Szucs, John W Dolan, and Christopher Andrew Pohl J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00346 • Publication Date (Web): 13 Oct 2017 Downloaded from http://pubs.acs.org on October 18, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

Benchmarking of Computational Methods for

2

Creation of Retention Models in Quantitative

3

Structure-Retention Relationships Studies

4

Ruth I. J. Amos1#*, Eva Tyteca1,3,#, Mohammad Talebi1, Paul R. Haddad1, Roman Szucs3,

5

John W. Dolan4, and Christopher A. Pohl5

6

1

7

Sciences-Chemistry, University of Tasmania, Private Bag 75, Hobart 7001, Australia

8

2

9

Liège, 2 Passage des Deportes, Gembloux 5030, Belgium

Australian Centre for Research on Separation Science (ACROSS), School of Physical

Analytical Chemistry, AgroBioChem Department, Gembloux Agro-Biotech, University of

10

3

Pfizer Global Research and Development, Ramsgate Rd, Sandwich CT13 9ND, UK

11

4

LC Resources, 1795 NW Wallace Rd., McMinnville, OR 97128, USA

12

5

Thermo Fisher Scientific, 490 Lakeside Dr, Sunnyvale, CA 94085, USA

13

*

14

KEYWORDS

15

QSRR; geometry optimisation; calculations; molecular descriptor; density functional theory;

16

molecular mechanics

[email protected]

17 18

ABSTRACT: Quantitative Structure - Retention Relationship (QSRR) models are powerful

19

techniques for the prediction of retention times of analytes, where chromatographic retention

20

parameters are correlated with molecular descriptors encoding chemical structures of

21

analytes. Many QSRR models contain geometrical descriptors derived from the 3-D spatial

22

coordinates of computationally predicted structures for the analytes. Therefore, it is sensible

23

to calculate these structures correctly, as any error is likely to carry over to the resulting

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

24

QSRR models. This study compares molecular modelling, semi-empirical, and density

25

functional methods (both B3LYP and M06) for structure optimisation. Each of the

26

calculations was performed in a vacuum, then repeated with solvent corrections for both

27

acetonitrile and water. We also compared Natural Bond Orbital analysis with the Mulliken

28

charge calculation method. The comparison of the examined computational methods for

29

structure calculation shows that, possibly due to the error inherent in descriptor creation

30

methods, a quick and inexpensive molecular modelling method of structure determination

31

gives similar results to experiments where structures are optimised using an expensive and

32

time-consuming level of computational theory. Also, for structures with low flexibility,

33

vacuum or gas phase calculations are found to be as effective as those calculations with

34

solvent corrections added.

35 36

INTRODUCTION

37

Quantitative Structure - Retention Relationship (QSRR) models build a mathematical

38

relationship between a chromatographic parameter of an analyte (such as retention time) and

39

its structure.1 QSRR is receiving more attention in the chromatographic community as the

40

challenges faced become more complex and costly. One area where QSRR may be applied

41

with good effect is method development, where the use of this computational method can

42

minimize the time and resources necessary for the initial scoping phase of selecting the

43

optimal stationary and broad mobile phase composition.2 Another possible use of QSRR is in

44

the area of non-targeted analysis (NTA) where for example, the entire metabolite content of

45

biological samples3-4 or an entire sample of pollutants in waste water5-6 is analysed by liquid

46

chromatography coupled to mass spectrometry and the predicted retention time of a

47

metabolite or pollutant can help with its identification.

ACS Paragon Plus Environment

Page 2 of 30

Page 3 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

48

To build a QSRR model, a training set of analytes with known retention times (or other

49

chromatographic parameters of interest, such as the linear solvent strength retention model

50

parameters7) is required. The structure of each analyte in the training set is determined

51

computationally and from that structure a series of descriptors is calculated. Then, a

52

regression method is utilized to build a model for retention time prediction based on the

53

calculated descriptors. Finally a test set of analytes (along with their molecular structures and

54

descriptors) is utilized to test the model for prediction accuracy.1

55

Advances in computational chemistry have enabled thousands of different molecular

56

descriptors to be calculated for each molecule. Some descriptors are purely explanatory, such

57

as the molecular weight of the compound or constitutional descriptors (number of oxygen

58

atoms, for example).1 Two-dimensional (2-D) descriptors show connectivity,

59

autocorrelation, topological distance, and walk and path counts,1, 8 whereas descriptors based

60

on the three-dimensional (3-D) structure of the molecule give more information, such as the

61

partial charges of each of the atoms in the molecule.1 It would seem reasonable to assume

62

that the accuracy of the majority of descriptors depends on the calculated structure of the

63

molecule in question and therefore on the computational method utilized to build that

64

structure.1

65

A brief overview of the literature shows that the computational methods used for finding

66

the 3-D structure of molecules for QSRR are many and varied.1, 9-20 Whilst numerous

67

researchers have compared outcomes from different methods for variable selection and

68

regression analysis (such as artificial neural networks (ANN), genetic algorithms (GA),

69

multiple linear regression (MLR), and partial least squares regression (PLS) etc.)1, 13-14, 16, 21-22

70

a search of the literature has failed to provide any benchmarking studies on how the different

71

computational methods for original structure determination (prior to descriptor calculation)

72

may change the final error in the outcome of the QSRR models. In fact, many databases

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

73

utilized to obtain initial molecular structures have up to 3.5% error in the atomic makeup of

74

the structures (e.g. the substitution of a methoxy group for a hydroxyl group) leading to errors

75

in calculation23 even before the 3-D structure of the molecule is calculated. Error values and

76

correlation constants given for QSRR studies in the literature are confounded by changes in

77

the method of model building, the descriptors utilized, and the clustering method used for

78

building the training sets. Therefore, a more systematic study of computational methods is

79

required, keeping other factors constant and only changing the method of structure

80

optimization, in order to determine the possible contribution of this preliminary step to the

81

final accuracy of the resultant QSRR models.

82

In this study, the factors held constant were those for calculating molecular descriptors and

83

creating the model. For those factors, we utilized methods that have been highly successful in

84

our group. The methods for structure optimization were varied and the most common

85

methods compared to find which method of optimization gave the lowest error in the final

86

model.2, 7, 24-28

87

Methods found in the literature for defining the geometry of molecules prior to descriptor

88

calculation vary from the very simple - SMILES (Simplified Molecular Input Line Entry

89

System) - to molecular mechanics (MM),1, 9-10 semi-empirical methods (SE),11-12 (or a

90

combination of MM and SE13-16), and density functional methods (DFT).17-19 A description

91

of each of these levels of theory, along with examples of QSRR and QSAR (Quantitative

92

Structure-Activity Relationships) studies utilizing them, is found in the Supporting

93

Information. As no comparison study is available on the effects of the different geometry

94

calculation methods, this present study includes a benchmarking of computational methods to

95

elucidate which level of computational chemistry gives the lowest error of prediction of a

96

chromatographic parameter. Calculation methods used range from MM using the MMFF94

97

force field in the Balloon software29 to an SE method (Parametric Method 7, PM7) using

ACS Paragon Plus Environment

Page 4 of 30

Page 5 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

98

MOPAC30-31 (using as a starting point the lowest energy geometry from the MM calculations)

99

and finally to DFT calculations using B3LYP (Becke 3-parameter exchange with correlation

100

by Lee, Yang, and Parr,32-33 once again, starting from the SE optimized geometry). The M06

101

(Minnesota 2006)34 DFT method has also been used for completeness as there are many

102

different DFT methods widely available. For the DFT methods, the basis sets by Pople: 6-

103

31G using a split valence functional,35-36 6-31+G(d) and 6-311++G(d,p) were used to bring in

104

differing levels of polarization and diffuse functions.37-38

105

Many methods utilized for the definition of the structure of molecules do not involve any

106

inclusion of solvent calculations. That is, the molecules are optimized in a vacuum with no

107

explicit or implicit correction for the involvement of solvent. While Karelson39 and Stack40

108

emphasize the need to take solvent into account, Wiczling and Kubik20 utilized the

109

Polarizable Continuum Model with the Integral Equation Formalism variant (IEFPCM)41

110

solvent corrections for QSRR models for dissociating compounds with no visible

111

improvement to the outcome. In the present study results from vacuum calculations are

112

compared to those using (IEFPCM)41 for the implicit inclusion of solvent. Corrections for

113

both acetonitrile (CH3CN) and for water (H2O) have been performed to investigate whether

114

the different dielectric constants (ε = 78.39 and 36.64, respectively) will make a difference to

115

the overall calculation error. 41-42

116

Final structures from each stage have been used to create so-called localized QSRR

117

models.25 All other aspects of the model generation were kept constant so as to compare only

118

the method of geometry optimization for the molecules. The prediction errors have been

119

compared for each type of geometry optimization to find the best method of calculation.

120 121

MATERIALS AND METHODS

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

122

Dataset The reversed-phase liquid chromatography (RPLC) retention dataset used in this

123

study was obtained from the literature and includes 75 compounds.43-45 The particular dataset

124

was utilized to devise the hydrophobic interaction model relating retention of the analyte and

125

the type of reversed-phase stationary phase. The dataset is diverse in size, shape, and polarity

126

of the compounds and is representative of many compounds utilized for RPLC.43-45 The

127

column utilized for the study was a 150 x 4.6 mm, 5 µm, Waters Symmetry C18 column (GL

128

Sciences, Tokyo, Japan) using a mobile phase of 50:50 CH3CN:H2O (for neutral compounds)

129

and 50:50 CH3CN:31.2 mM potassium phosphate buffer at pH = 2.8 (for basic and acidic

130

compounds). Flow-rate was 1.5 mL/min and the column void time 0.919 min.

131

Calculation of molecular geometries MarvinSketch version 16.2.15 from ChemAxon1

132

(Budapest, Hungary) was used for drawing the molecular structures. Initial conformational

133

searches to find the 50 lowest energy structures were performed using MMff9446-50

134

implemented in Balloon.29 The lowest energy conformers were taken as input structures for

135

geometry optimization using the semi-empirical Parametric Method number 7 (PM7)30

136

implemented in Molecular Orbital PACkage (MOPAC),30-31 followed by further geometry

137

optimization of the lowest energy structure using density functional theory51-52 implemented

138

in Gaussian09.53 The DFT calculations were performed both in the gas phase and with

139

solvent corrections for either H2O or CH3CN implemented using IEFPCM41 in Gaussian09.

140

Calculations were performed using both B3LYP32-33 and M0634 functionals. The basis sets

141

included in the study are outlined in Table 1 and include polarization functions and diffuse

142

functions.35, 37

143

Calculation of molecular descriptors Dragon 6.0 software54 (Talete, Milano, Italy) was

144

employed for calculation of molecular descriptors using the resulting minimum energy

145

conformations of the compounds.

ACS Paragon Plus Environment

Page 6 of 30

Page 7 of 30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

146

The Dragon 6.0 software54 was able to calculate nearly five thousand molecular

147

descriptors, consisting of constitutional, topological, geometrical, electrostatic, and quantum

148

chemical descriptors.

149

procedure.

150

constant or near constant values, descriptors with a standard deviation less than 0.0001,

151

strongly correlated descriptors (descriptors with a correlation coefficient >0.90) and those

152

descriptors not available for all compounds were excluded. Before statistical analysis, all the

153

descriptors were scaled to zero mean and unit variance (autoscaling procedure) because the

154

numerical values of the descriptors varied significantly. The resulting descriptor sets were

155

used to build trial models for the experimental chromatographic retention data.

The Handbook of Molecular Descriptors55 details the calculation

To minimize subsequent problems of chance correlation, descriptors with

156

Generation of QSRR models In previous studies by our group we have found that the

157

lowest error in retention time prediction can be achieved when smaller training sets of

158

chromatographically similar compounds to the target are used.25 Each compound in the

159

dataset was therefore utilized successively as a target analyte by removing it from the dataset

160

and then predicting its retention time using models derived from a subset of compounds in

161

the dataset as the training set. The predictive models established with these training sets were

162

then used to predict the retention of the target compound as an external validation strategy to

163

assess the predictive ability of the established QSRR models.

164

The training sets were obtained through so-called chromatographic similarity searching,25

165

where the compounds in the dataset were ranked according to the absolute value of the ratio

166

of each compound’s retention factor k with the k-value of the target analyte. The predictive

167

model was then built using a training set obtained by applying a k-ratio threshold (k-ratio
0.5, R2>0.6,

218

|(R2 – R02)/R2| < 0.1, and 0.8