Subscriber access provided by Kaohsiung Medical University
Computational Chemistry
Predicting Thermodynamic Properties of Alkanes by Highthroughput Force Field Simulation and Machine Learning Zheng Gong, Yanze Wu, Liang Wu, and Huai Sun J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00407 • Publication Date (Web): 11 Sep 2018 Downloaded from http://pubs.acs.org on September 12, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Predicting Thermodynamic Properties of Alkanes by Highthroughput Force Field Simulation and Machine Learning
Zheng Gong, Yanze Wu, Liang Wu and Huai Sun*
School of Chemistry and Chemical Engineering, Materials Genome Initiative Center, and Key Laboratory of Scientific and Engineering Computing of Ministry of Education, Shanghai Jiao Tong University, Shanghai, China 200240
*To whom the correspondence should be addressed:
[email protected] ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Abstract Knowledge of the thermodynamic properties of molecules is essential for chemical process design and the development of new materials. Experimental measurements are often expensive and not environmentally-friendly. In the past, studies using molecular simulations have focused on a specific class of molecules, owing to the lack of a consistent force field and simulation protocol. To solve this problem, we have developed a high-throughput force field simulation (HT-FFS) procedure by combining a recently developed general force field with a validated simulation protocol to calculate thermodynamic properties for large number of molecules. This procedure is applied to calculate liquid densities, heats of vaporization, heat capacities, vapor-liquid-equilibrium curves, critical temperatures, critical densities and surface tensions for a wide range of alkanes. The predictions agree well with available experimental data in terms of accuracy and precision, demonstrating that HT-FFS is a valid approach to supplementing experimental measurements. Furthermore, the large amount of data generated by HT-FFS lays a foundation for machine learning. We have developed an artificial neural network that demonstrates the feasibility of expanding predictions beyond simulation using a machine learning model. TOC Graphic
ACS Paragon Plus Environment
Page 2 of 37
Page 3 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Introduction The thermodynamic properties of a chemical substance is often a prerequisite for chemical engineering design or the research and discovery of novel materials1. For example, liquid crystals and polymers consist of molecular moieties which share the same chemical structure as common organic molecules. Therefore, understanding the thermodynamic properties of organic compounds provides valuable insight for materials synthesis and engineering design. Traditionally, the thermodynamic properties of chemicals are measured by experimental means, which in many cases are costly, and sometimes dangerous or not environmentally-friendly. With the rapid growth in computer power, computational approaches that supplement experimental measurements have become possible. In particular, force field simulations, such as molecular dynamics (MD) or Monte Carlo (MC), have been applied successfully to calculate thermodynamic and transport properties for many molecular systems. For example, a series of biannual competitions by the Industrial Fluid Properties Simulation Collective (IFPSC)2-4 have demonstrated that computational approaches can be used to predict thermodynamic and transport properties in close agreement with experimental measurements. However, computational methods that enable the accurate measurement of properties for one set of molecules, including simulation protocols and the underlying force field, are not guaranteed to be successful for a different set of molecules. In many cases, the model force field has been specifically optimized for a targeted set of molecules using a specific simulation protocol, which prevents predictions from being extended to different molecules or conditions. Therefore, in order to make a general method that can be used for calculating the thermodynamic properties of any class of molecules consistently, an integrated procedure that combines a general force field and a validated simulation protocol, needs to be determined. One outstanding problem that must be solved is the temperature transferability of a general all-atom force fields such as OPLS-AA5, GAFF,6 COMPASS7 or TEAM8, which have been optimized mostly under ambient conditions. In recent contributions, we have developed a temperature-transferable coarse-grained force field9,10 and an all-atom (AA) force field11 for describing intermolecular interactions under a broad range of tem-
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
peratures. These force fields have been optimized using a set of well-established simulation protocols to predict thermodynamic properties accurately. Because of the enormous magnitude of chemical space, establishing a general procedure that can calculate the thermodynamic properties of any molecules of interest is an ambitious endeavor that must be carried out in steps. The force field parameters and simulation protocols need to be determined for different types of molecules, often in multiple iterations. For this purpose, a high-throughput force field simulation (HT-FFS) procedure is necessary. Using the HT-FFS procedure, we can rapidly perform simulations for a large number of molecules to identify and solve systematic problems either in the force field or the simulation protocol. In our previous paper12, we reported a HT-FFS procedure for predicting vapor-liquid-equilibrium (VLE) phase diagrams of binary mixtures of CO2 and hydrocarbons using a coarse-grained force field10. In this work, we take one step further to expand the HT-FFS procedure to calculate thermodynamic properties of molecular liquids from ground level using an all-atom force field.11 Prediction of VLE is significant because it provides the most sensitive test on description of intermolecular interactions in a force field. Several force fields such as TraPPE-UA13, TraPPE-EH14, AUA415, OPPE16, SPEADMD17, Mie potential18, SAFT-γ19 have been developed using VLE data. Most predictions reported in the literature are based on MC simulation following the pioneer work of Siepmann et. al.20, in which the two phases are simulated separately and coupled by MC moves. To predict the coexistent phases using MD is challenging because of the length and time scales required. However, along with the ever-growing computer power, it becomes possible to predict the phase separation dynamics. Using HT-FFS we are able to scan the simulations and to find out the boundaries of feasibility. Another significance of HT-FFS is that it can be used to produce an enormous amount of data for machine-learning (ML). Quantitative structure property relationships (QSPR), analogous to quantitative structure activity relationships (QSAR) acquired in life science, have been used in materials science.21-23 However, the predictive power of QSPR is limited by the size of the training dataset used for model optimization and analytical functional forms. Recent ML algorithms such as artificial neural networks (ANN) have
ACS Paragon Plus Environment
Page 4 of 37
Page 5 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
been successful due to the availability of large amounts of data (“big data”) and the capability for high-dimensional optimization. This approach has attracted much attention from the research community,24,25 ML has been applied to predict crystal structures26-28, compound properties29, condensed matter behavior30, and for drug discovery31 and materials design32. Since HT-FFS is capable of generating large amounts of data, a HT-FFS driven ML model can be built to expand predictions beyond simulations. In this study, we focus on alkane molecules with various topologies, such as linear, branched and cyclic structures. Alkanes are one of the most important energy sources and raw materials in the chemical industry, for which sufficient experimental data is available for developing a HT-FFS procedure. The procedure is automated with minimal requirements for human interactions. Intelligent algorithms have been implemented for model building, simulation, detecting and correcting errors, and statistical analysis of data. Using the data produced by HT-FFS, a simple ANN model is constructed to showcase the predictive power of HT-FFS driven ML. High-Throughput Force Field Simulation (HT-FFS) Procedure The HT-FFS procedure reported in our previous work12 has been extended to the simulation of bulk liquids and quantum mechanics (QM) calculations of molecules. The liquid is represented by a cubic box subject to 3-dimensional periodic boundary conditions. The density, heat of vaporization and heat capacity of the liquid can be calculated using this model. QM calculations are used calculate molecular contributions to the heat capacity. As in our previous work12, VLE properties are studied using a rectangular slab model, from which a series of interfacial and bulk properties are calculated, including surface tension, vapor-liquid phase diagrams, as well as the critical temperature and density. The HT-FFS procedure is illustrated in Figure 1. The procedure has five major stages: input, builder, simulation, quality-control, and output.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 1. The workflow of HT-FFS procedure
The input is minimal, only a list of the SMILES strings of the molecules of interest and a list of temperatures and pressures. The builder parses the SMILES strings using OpenBabel33 to generate 3D molecular structures. These structures are used to build either Liquid or VLE model by using Packmol34. The bulk liquid model contains about 3,000 atoms (including hydrogen) in cubic box with length of about 3.2 nm. The VLE model usually contains about 12,000 atoms where about 300 atoms are initially placed in the vapor phase. The elongated box has the size of about 4.2 * 4.2 * 24 nm3. Topology and force field files for simulations are generated automatically using Direct Force Field (DFF). The TEAM-AMBER force field8 with the recently developed temperaturedependent dispersion energy term11 is used, which has the functional form:
ACS Paragon Plus Environment
Page 6 of 37
Page 7 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
=
, − + − + 1 − −1 cos
2 2 2
+
1 − cos2
+
! " #
,,
+ $% &'
() #
*
−'
() + #
* ,
(1)
where the Lennard-Jones parameters are written as functions of temperature. Simulations are carried out using GROMACS 201635. The cutoff for van der Waals (vdW) and Coulomb forces is set to 1.2 nm by default. The particle mesh Ewald (PME) method36 is used for calculating the long-range electrostatic interactions beyond this cutoff. For the liquid simulations, a continuum correction37 is applied to take into account the long-range contributions of vdW interactions. For the VLE simulations, the longrange contribution of vdW interactions are explicitly calculated using the PME method.38,39 The initial configuration of the liquid is first relaxed by energy minimization followed by 100 ps annealing at 800 K, and then subjected to 400 ps of NPT molecular dynamics simulation at a given temperature and pressure. At least 1 ns of NPT simulation is performed to further equilibrate the liquid system and collect data. The VLE model is subject to energy minimization followed by 400 ps NVT simulation. At least 10 ns of NVT simulation is performed to equilibrate the VLE system and collect data. The time steps are set to 1 fs for annealing and 2 fs for data collection. The Langvein thermostat is used to control the temperature because it is efficient for temperature coupling for gas phase. The Berendsen40 and Parrinello-Rhaman41 barostats are used to control the pressure during equilibration and data collection, respectively. The LINCS42 algorithm is used to constrain the length of any bonds involving hydrogen atoms. The QM calculation of the target molecule is conducted by using Gaussian09
43
. MDAnalysis44 and Physi-
calValidation45 packages are used to analyze the MD trajectories. The test of ensemble and configuration is the first function in the quality-control module of the HT-FFS workflow. The kinetic energy distribution are calculated to ensure that the simulation is in desired ensemble. The distribution similarity analysis is performed by utilizing the Kolmogorov-Smirnov method. The diffusion coefficient and density are estimated to validate the configuration of liquid. If the diffusion coefficient is lower than 10/0 cm /s, the system is considered to be in the solid state; if the density is
lower than 0.05 g/cm3, the system is considered to be in vapor phase. If the system should
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 37
be in a liquid state according to experimental data, solidification or vaporization indicates either the force field or the simulation protocol is insufficient and the simulation is stopped for further analysis. After the optimization of the force field and simulation protocol, it was found that this problem mainly occurs when the state point is near a phase transition. It is often difficult to maintain a stable liquid state near these boundaries. In this case, the state point is removed from the computational list. For the VLE model, it is critical to maintain stable coexistent liquid and vapor phases. The Canny edge detector method was used to automatically detect the separation of liquid and gas phases12. A hyperbolic fitting is performed for the density profile along the direction (z) normal to the interface to obtain the densities and thickness of liquid and gas phases 56 758
3 = 456 75 8
+ −
56 /58 56 /58
tanh tanh
=/=! > =/=" >
? < ? >
AB AB
(2)
where ρE and ρF are the densities of the coexisting liquid and gas phases, z and z are the
locations of the two interfaces, s denotes the thickness of the interfaces, and lI is the
length of the simulation box in the z direction. An example is shown in Figure 2. The thickness of the liquid phase and interfaces are determined by JAK = ? − ? − 2L arctanh0.99
JKP = 4L arctanh0.99
(3a) (3b)
As the temperatures increases, the thickness of the liquid phase decreases. The system is considered to be unstable if the thickness of the liquid phase is smaller than 2 nm. The state point at which this occurs, which is often close to the critical point, is then removed from the calculation.
ACS Paragon Plus Environment
Page 9 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 2. Hyperbolic fitting of the density profile along the direction (z) normal to the liquid-vapor interface. Different colors denote two separated fittings.
The second function of the quality control is about the convergence. System equilibration is reached by automatically extending the simulation time. In our previous work12, equilibration was detected by monitoring the slope of the target properties against simulation time, which works fine for coarse-grained simulations as they equilibrate quickly. Because atomistic simulations converge slowly, the acceptable threshold for slope detection cannot be well determined. Herein, equilibration is detected using the method proposed by Chodera46, in which the production region is identified as the final contiguous region containing the largest number of uncorrelated samples. The temperature, pressure, potential energy and density are monitored. The equilibrium must be stable for at least half of the simulation time. If it is not, the simulation is extended in increments of 500 ps for liquid simulations and 2 ns for VLE simulations. The last function of the quality control is about the precision of prediction, which is calculated on-the-fly. Statistical inefficiency37 is calculated for the ensemble-averaged data to calculate the block size and the uncertainty12. The precision thresholds are 0.005 g/cm3 for density, 0.5 kJ/mol for intermolecular energy and 1 mN/m for surface tension. If the uncertainty of the simulation data is larger than the threshold, the simulation is extended in increments of 500 ps for liquid simulations and 2 ns for VLE simulations.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 37
The output of the HT-FFS procedure includes three sets of basic data: 1) intramolecular heat capacities calculated using QM calculations, 2) densities and intermolecular energies of liquids at given temperatures and pressures, 3) liquid-vapor coexistence densities and surface tensions at given temperatures. Other thermodynamic properties, such as compressibilities, thermal expansion coefficients, sound velocities, heat capacities, critical temperatures and densities, etc. can be derived from these three sets of basic data47. High-Throughput Force Field Simulation Results Liquids Liquid simulations were carried out for a total of 876 common alkane molecules taken from Yaws' Critical Property Data for Chemical Engineers and Chemists48. These molecules include linear, branched and cyclic alkanes. We selected molecules with less than 20 carbon atoms to avoid difficulties in sampling large molecules. Fused cyclic alkanes containing 3 or 4-member rings were excluded due to the lack of validated force field parameters. The distribution of molecules in different topologies is given in Table S1. The branched alkanes are the most abundant, while cyclopropane, cyclobutane, and bicylo-alkanes contribute small parts. For each molecule, 56 state points spanning of 8 temperatures from 25 K above the melting point (denoted as %R7 ) to 0.8 times the critical temperature (denoted as % .0∗ ) and 7 pressures at 1, 50, 100, 250, 500, 750, 1000 bar,
were simulated using NPT conditions. A total of 49,056 state points were simulated for the 876 molecules. Excluding 12 state points that failed in the quality-control stage due to being close to phase transition points, we obtained 49,044 valid data points. For each molecule, the densities and intermolecular energies were represented by polynomial expansions in the temperature and pressure, K X 3 = ∑ZK ∑Z/K X WK,X % Y
K X [A,KP\# = ∑ZK ∑Z/K X ]K,X % Y
(4a) (4b)
so that values at arbitrary state points can be obtained by interpolation. Some of the calculated data can be compared directly against experimental data taken from NIST Standard Reference Database 103b49,50. The experimental data are the rec-
ACS Paragon Plus Environment
Page 11 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ommended value after critical evaluation on the available raw data.51 Figure 3 shows
comparisons of the densities at three different temperatures: in addition to the low %R7
and high % .0∗ boundary temperatures, the boiling point (denoted as %^_` ) is shown for
comparison. The pressures of these systems are the experimental vapor pressures. The number of data at each of the temperatures is limited by the numbers of experimental data for comparison. Overall, the agreement is satisfactory. The mean deviations (MDEV) are scattered within ±1%, the mean unsigned deviations (MUD) are nearly the same, while the standard deviations (STD) are larger at higher temperatures.
Figure 3. Comparison of experiential and simulated density of alkanes at three temperatures: %R7 , % , % .0∗ (see text for definitions).
The heats of vaporization (abcd ) can be calculated from NPT simulation data: a^_` = Ye & − , + [f − [A 5 5
8
6
(5)
Where P is the pressure, M is the molar mass of the molecule, and 3F and 3E are densities
of the vapor and liquid phases, and [F and [E are the internal energies of the gas and liq-
uid phases. By writing the internal energy as a sum of the intramolecular and intermolecular contributions (e.g. [E = [E,ghijc + [E,ghikj , we have: abcd = l% − [E,ghikj + m
(6)
m = [F,ghijc − [E,ghijc + [F,ghikj + YnF − l% − YnE
(7)
where R is the ideal gas constant, and C is a correction term:
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 37
At low temperature and pressure, each of the four terms on the right-hand-side can be approximately ignored. The first term means the intramolecular energies in the gas and liquid phases are the same. The second term means that the interaction of molecules is zero in the gas phase, the third term means the gas obeys the ideal-gas law, and the forth term means in comparison with the gas volume, liquid volume can be ignored. Therefore, the calculation of a^_` can be simplified as:
abcd = l% − [E,ghikj
(8)
This equation is broadly used in literature7,52 for calculation of abcd , which is very
straightforward for NPT or NVT simulation as only the intermolecular energy of the liquid phase is required. However, we found the predicted abcd values using Eq. (8) are systematically overestimated as the abcd value increases (see Figure 4).
Figure 4. Comparison of the heats of vaporization of alkanes at the boiling point between experiential data and calculated data using simulations and Eq.(8).
To understand these systematic deviations, we evaluate the four terms in Eq. (7) using hexane, dodecane, and octadecane as examples. The difficulty lies in the gas phase
calculations. The intermolecular energy F,ghikj was calculated from NVT simulation of a
molecule placed in a periodic box with the cell edge fixed to match the experimental vapor density. The long range vdW and electrostatic interactions were calculated using the PME method so that the value represents an averaged interaction. The intramolecular energy F,ghijc was calculated using NVT simulation of a molecule in vacuum. 100 parallel
ACS Paragon Plus Environment
Page 13 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
simulations with different initial conformations were performed. The convergence was verified by examination of the distributions of dihedral angles and the end-to-end distances, as shown by Figure S1 in Supporting Information. Considering the fluctuation of the temperature in simulation %ogp , the intramolecular potential energy is corrected by52 [F,ghijc % = [F,ghijc %ogp + q % − %ogp 3scitp − 6 − svtho
(9)
where scitp is the number of atoms and svtho is the number of constraints (the number
of hydrogen atoms) of the molecule. The four correction terms are listed in Table 1. The intermolecular energy of gas phase and the liquid volume contribution are indeed minimal. The largest error is from the intramolecular energy difference between the gas and liquid phases. This energy difference is due to the complexity of molecular conformations, which is correlated with molecular weight. The second largest error is due to the deviation from the ideal gas law YnF − l%, which is sensitive to temperature. As temper-
ature increases, the vapor density and pressure increase, and then the deviation from the ideal-gas law increases. The data in Table 1 confirms these arguments. Table 1. The correction terms to a^_` calculated for three representative molecules at
three different temperatures. The empirical corrections based on Eq. (10) are also provided. The units of each energy correction are in kJ/mol. w[KP#_ = [f,KP#_ − [A,KP#_ )
Hexane
Dodecane
Octadecane
−YnA
Ynf − l%
[f,KP\#
1.0
0.0
-0.1
0.0
406
5.3
-0.1
-0.5
0.1
288
0.0
0.0
0.0
0.0
489
1.0
0.0
-0.3
0.0
526
2.2
-0.1
-0.6
0.1
326
0.0
0.0
0.0
0.0
481
0.1
0.0
0.0
0.0
598
1.2
-0.1
-0.7
0.0
T(K)
P(bar)
203
0.0
342
0.0
0.0
0.0
ACS Paragon Plus Environment
w[KP#_
m
m\
-0.1±0.2
-0.1
-0.7
-1.9±0.2
-1.7
-1.1
-2.4
-1.4
-0.2
-1.9
-2.3±0.4
-2.1
-3.3
-2.9
-3.5
-2.2±0.6
-0.2
-3.3
-2.2
-4.8
-3.6
-6.0
-1.6±0.2 -0.2±0.4
-1.8±0.4 -0.2±0.3 -2.8±0.6
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 37
Based on the analysis, we propose an empirical correction using the product of the temperature and number of carbon atoms in the molecule. A coefficient is obtained by fitting the experimental data: m\ = −
y
z l%
(10)
This model is tested on the corrections in Table 1. Using the empirical correction, the
predicted abcd at three temperatures are displayed in Figure 5. The MDEVs range from 0.8 to 3.2 %, the STDs are around 4%, and the MUDs are from 2.7 to 4.3 %.
Figure 5. Comparison of experiential and calculated heats of vaporization at three tem-
peratures: %R7 , % , % .0∗ . The empirical correction has been applied to the calculated data.
Heat capacity was calculated by combining the molecular and intermolecular contributions.11 m{ = mijcho + mjti + mbg| + '
}~ })
* + ' }) * {
{}
{
(11)
The molecular translational and rotational contributions mijcho and mjti were calculated
using the energy equipartition theorem, while the vibrational contribution was calculated using the hindered-rotor model for dihedral modes.53 The normal mode analysis was carried out at the B3LYP/6-31G(d) level of theory, and the vibrational frequencies were scaled by 0.9613 as suggested by Kesharwani et. al.54 The intermolecular contribution '
}~ })
* and volume contribution ' }) * were evaluated from the analytic expressions {
{}
{
Eq. (4) obtained using NPT simulation data. The fluctuation formula based on tempera-
ACS Paragon Plus Environment
Page 15 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ture-independent potential was not used. The results for isobaric heat capacity at three temperatures and corresponding vapor pressures are shown in Figure 6. The agreement of the simulated results with the experimental data is excellent. The MDEVs are negative, ranging from -2.6 to -1.2%. Both STDs and MUDs show that the deviations are larger at low temperature (%R7 ), which is presumably related to the fact that the equipartition the-
orem is a better approximation at high temperatures.
Figure 6. Comparison of experimental and predicted isobaric heat capacities at three tem-
peratures: %R7 , % , % .0∗ (see text for definitions).
Thermodynamic data at high pressure are scattered throughout literature, making it difficult to make systematic comparisons. Herein, we choose seven molecules from the NIST Chemistry Webbook55: methane, nonane, dodecane, isobutane, isohexane, cyclo-
propane, and cyclohexane. Isobaric curves for the density and Cd at high pressure (350
bar for isobutane, 280 for cyclopropane, 800 bar for cyclohexane and 1000 bar for other molecules) are shown in Figure 7. Our simulation data shows excellent agreement with the experimental data in the wide temperature and pressure range.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 7. Comparison of experimental (lines) and simulated (dots) densities and isobaric heat capacities for seven molecules at high pressure. Color scheme: red, methane; magenta, nonane; navy, dodecane; wine, isobutane; dark yellow, isohexane; olive, cyclopropane; green, cyclohexane
VLE Results The VLE phase diagrams for a smaller set of molecules were calculated because the simulation is far more expensive than the simulation of liquid properties. 274 molecules with ten or fewer carbon atoms were simulated using the slab model and NVT conditions.
For each system, 8 temperatures ranging from %R7 to % .0y∗ were applied. At each tem-
perature, the equilibrium was established dynamically so that the average pressure is the saturated pressure, and the densities in the vapor and liquid phases are coexistence densities. The ensemble test indicated that the energy equipartition was maintained for both liquid and vapor phases. As an example, the distribution of translational and rotational kinetic energies calculated from the VLE simulation of decane at 466 K is displayed in
Figure S2. The mean translational and rotational kinetic energies equal 5.8 kJ/mol (1.5 RT) for both liquid and vapor phases. The predicted VLE phase diagrams of 35 representative molecules are shown in Figure 8. Additional diagrams of 240 molecules are given in Figure S3. Overall the
ACS Paragon Plus Environment
Page 16 of 37
Page 17 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
agreements between the predicted and experimental curves are good. However, a close examination of the simulation data indicates that the data are statistically meaningful at high temperatures where the vapor density is above 0.01 g/cm3, which is about one order of magnitude higher than the prediction uncertainty of 10-3 g/cm3. The problem is more pronounced in the Clausius-Clapeyron plots (Figure S4), which shows large deviations at low temperatures.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 8. Comparison of calculated (dots and crosses) and experimental (lines) vaporliquid-equilibria of 35 representative molecules. The cross symbols denote less confident values.
At low temperatures, the density of gas phase is too low to have sufficient number of molecules in the vapor phase and the vaporization rate is too low to reach equilibrium under the current simulation settings. As an example, a summary of the critical data obtained from VLE simulation of hexane is given in Table 2. At 379 K, the average number of molecules in the vapor phase is 28, and the average rates of vaporization and condensation are 94/ns and 93/ns respectively. These numbers decrease as the temperature decrease, and reduce to zeros at 203 K. Using Arrhenius equation and assumption of the first order rate law, we extrapolate the rate to be 0.13/ns at 203 K, which means on average to observe one molecule crossing the interphase needs 7.7 ns. Another factor is about the length scale. The density at 203 K is in the order of 10-6, which means the simulation box needs to be expanded hundreds times to host one molecule in the vapor phase. Since the number of molecules in the vapor phase and the rate of vaporization (or condensation) are too small at low temperatures, we mark the vapor phase data lower than 0.01 g/cm3 as “less confident”, denoted by cross symbols in Figures 8 and S3.
Table 2. Summary of critical data of VLE simulation for hexane. The data are: temperature, thicknesses of liquid phase, gas phase and interfaces, densities of liquid and gas
ACS Paragon Plus Environment
Page 18 of 37
Page 19 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
phases, number of molecules in gas phase, and rate of vaporization and condensation. We define a successful vaporization or condensation as a move of a molecule from one bulk phase to another, both are at least 1 nm away from the boundary of interface. The uncertainty is given in parenthesis at the last decimal point. T
JAK
Jf_>
JKP
3A
3f
3
sf
s^_`#K=\
s}\>\
(/ns)
(/ns)
(K)
(nm)
(nm)
(nm)
(g/cm )
(g/cm )
203
4.48(2)
18.00(1)
2.95(3)
0.744(0)
0
0
0
0
273
4.46(1)
16.87(1)
4.09(2)
0.679(0) )
0.000(0)
1.6(3)
4.0
4.5
326
4.44(1)
15.90(1)
5.09(2)
0.626(0)
0.002(0)
6.6(4)
26
26
362
4.22(3)
15.02(3)
6.19(6)
0.589(1)
0.007(0)
20(1)
71
69
379
4.04(3)
14.54(4)
6.85(7)
0.568(1)
0.010(0)
28(1)
94
93
397
3.77(4)
14.03(4)
7.63(7)
0.545(1)
0.016(0)
42(1)
127
129
414
3.26(6)
13.51(5)
8.66(9)
0.522(1)
0.026(1)
63(2)
156
154
432
2.5(1)
12.6(1)
10.3(2)
0.495(1)
0.035(2)
70(4)
170
167
3
Despite the shortcoming of prediction VLE at low temperatures, the critical proper-
ties extracted from the VLE phase diagrams are satisfactory. The critical temperature %v
and critical density 3v were estimated by fitting the coexisting densities to the scaling law and the rectilinear diameter law
) .y
3E − 3F = '1 − * ) 5 75
= 3 + '1 − ) * )
(12a) (12b)
where A and B are system-dependent factors. As shown in Figure 9, excellent agreement with experimental data is obtained. For more than half of the molecules, the unsigned deviations between our predicted results and experimental data are less than 1%. The MUDs of both %v and 3v are 1.4%.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 37
Figure 9. Comparison of experimental and simulated critical temperatures and critical densities.
Another advantage of using MD to simulate the VLE coexistences is that the surface tensions can be obtained from the same simulation. To compare calculated data with experimental ones which are usually measured at different temperatures, a fitting is performed for at different temperatures to interpolate at intermediate temperatures: )
= m '1 − * )
(13)
where m and n are parameters. Comparisons of surface tensions are shown in Figure 10.
The MDEVs range from -2.2% to 8.4%, STDs are within 7.0% to 9.8% and MUDs are from 5.2% to 9.5%.
Figure 10. Comparison of experimental and simulated surface tensions at three temperatures: %R7 , % , % .0∗ (see text for definitions)
ACS Paragon Plus Environment
Page 21 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Comparison of Uncertainties In Figure 11, we plot the distributions of unsigned deviations of HT-FFS predictions
in density, abcd , m{ , critical temperature, critical density, and surface tension. The devia-
tions have multiple origins. The force field used to perform the simulations is only an approximation of the true potential. Although we have parameterized the force field very carefully and integrated temperature dependence into the functional form, the intermolecular interactions are too complicated to be exactly represented. Another origin is the simulation protocol, which may need to be further improved. Nevertheless, the distribution of the deviations represents the uncertainty of HT-FFS predictions. For a new prediction, given that the precision is controlled by the acceptable threshold, the uncertainty in matching the true (experimental) data is given by the distributions shown in Figure 11.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Figure 11. Comparison of deviations (orange) of HT-FFS predictions and uncertainties
(blue) of experimental measurements in liquid densities, a^_` , m{ , critical temperatures,
critical densities and surface tensions For density, a^_` , m{ and surface tension, the com-
parison is made at normal boiling point. The data at different temperatures provide similar results.
It is a valid question to ask how reliable the HT-FFS predictions are in comparison with experimental measurements. To answer this question, we analyzed the uncertainties in experimental measurements as collected in the NIST Standard Reference Database 103b27,50,51 and make side-by-side comparisons with the deviations in HT-FFS predictions. The results are shown in Figure 11. Although the precision of each experimental measurement can be controlled reasonably well, analogous to the precision of each simulation, the uncertainties among different measurements, either using different experimental methods or the same method but operated in different labs, show significant deviations. Comparing the deviations of HT-FFS predictions with uncertainties of experimental measurements, it is seen that in most cases, the distributions are similar.
ACS Paragon Plus Environment
Page 22 of 37
Page 23 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Machine Learning Feedforward Neural Network Three fundamental properties, density, intermolecular energy and isobaric heat capacity, obtained at 49,044 valid state points were used to develop a ML model. We chose the simplest ANN, a feed-forward neural network (FNN), built using PyTorch56 for predicting thermodynamic properties. The scheme of a FNN is shown in Figure 12. The output can be written as a function: = a … + + . . . +
(14)
where is input vector, , , … are the weight matrices of each layer,
, , … are the bias vectors of each layer, and a and are element-wise activation functions for the output and hidden layers, respectively. Here we use a = and = [ , where the [ activation function is: [ =
if > 0 e − 1 if 0
(15)
Figure 12. Schematic representation of FNN with N-1 hidden layers, where
, , … are the output vectors of the nth layer (including the input and output layers),
, , … and , , … are weight matrices and bias vectors of each layer, a
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 37
and are the activation functions corresponding to the output and hidden layers, respec-
tively.
The initial values of the weight matrices and the bias vectors are randomly assigned. The Adam optimizer57 is employed to minimize the mean squared error (MSE) between
the FNN estimation = ss and the MD simulation value of thermodynamic
properties, :
e = ∑X X − X
(16)
where is the number of training data points, X and X are the properties determined
from MD simulation and estimation by the FNN, respectively. The implemented FNN model includes 3 hidden layers with 16, 8 and 4 nodes in each layer, respectively.
The set of thermodynamic properties predicted from simulations was randomly separated into three datasets: (1) 70% (34,330) of the data constitutes the training set; 20% (9,809) for the validation set; and 10% (4,905 data points) in the test set. The validation set is used to evaluate the performance of trained ANNs using different hyper parameters. The test set is a true measure of the accuracy of the FNN because this dataset was not exposed to the training process. Because the data are randomly separated, each set has similar representation of the properties to be described. Feature Selection The features (finger prints) used in this work include information regarding molecular structure and the thermodynamic state point. We attempted to use features that resemble the atom types used in force field, as they are the most essential input to simulations. We also include molecular topology information such as the number of rings and number of chain segments. Initially we proposed to use a set of 23 structural features. The structural features were calculated by using SMARTS (SMILES arbitrary target specification) matching functionalities in OpenBabel33 package. There are two features representing the thermodynamic state point, temperature and pressure. All together, we have a total of 25 features, which we will subsequently determine which are more significant.
ACS Paragon Plus Environment
Page 25 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
The feature list was reduced by using a recursive feature elimination (RFE)49 procedure with a linear support vector machine (SVM) regressor58, implemented in the ScikitLearn package59. The main equation of linear SVM prediction is: ¢ = ∑ ¡ X ⋅ ¥X + ¦ X £
(17)
¢ is the prediction vector, is the input vector, ¥X § = 1, s are a group of supwhere ¡
port vectors, and ¨X , ¦ are coefficients. The SVM regressor was trained by optimizing ¥X , ¨X and ¦ to minimize the squared error between the prediction vector and objective vector,
∑KK − K . RFE was carried out using the training set only. At the beginning, the input
vector consisted of all features. The SVM regressor was then trained and the im-
portance of each feature was calculated by:
m = ∑ X ©X X £
(18)
where m is the importance of feature , and ©X is the kth element of jth support vector.
The feature of the least importance was eliminated from the feature list. The RFE was recursively repeated, and in each iteration one feature was removed. To evaluate the reduction procedure, at the end of each step, the remaining features were used to train a model by a simple FNN that has one hidden layer and 16 nodes. As shown in Figure 13, the mean squared error (MSE) in the predictions increase as the number of features decrease. There is a sharp increase in the error with respect to each property. Therefore, we can identify the smallest feature list using an error tolerance.
Figure 13. Mean squared error of predictions with different numbers of features using a 16 node FNN with one hidden layer.
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 37
The minimum feature list is different for different thermodynamic properties, as shown in Table 3. Several features are generally applicable, such as the state point features (T and P) and the number of carbon atoms (NC) which essentially represents molecular weight. In addition, the numbers of methylene (C2), methyl (C3) and cyclic (R) groups, which are associated with different atom types of the force field are applicable to all of the three properties. Other features are more or less related to one of the three properties. For density, the number of methyl groups attached to ternary carbon atoms (C4M1, C4M2, C4M3), and the number of carbon atoms attached to one or two (C1R, C2R) rings
are sensitive. For energetic properties (a^_` and m{ ), the number of the innermost atoms
such as secondary atoms (C3, C3M1, C3M2) and ternary (C4) carbons are also sensitive, which is consistent with the fact that the dispersion energy is sensitive to the contributions from these atom types in the force field. In addition, the flexibility of molecules, represented by the number of linear segments (L4 and L6) are sensitive in determining a^_` and m{ . Several features, such as the numbers of small (3-, 4-member) and large (8-
member) rings, fused and bridge rings, and joint rings, are not sensitive to any of the three properties because these structural features were not sufficiently represented by the data. Finally, we must note that the reduced feature lists are not unique solutions, as they were derived by following the “minimum path” as explained above. It is analogous to optimization in a high-dimension space, what we have obtained as a good solution is likely to be one of multiple solutions. A procedure for finding the best set of features is a subject for further research.
Table 3. The minimum features list used in the FNN model for different properties. The last column is the union of the three minimum lists. [E,ghikj
m{
Union
√
√
>C
CH-
√
√
√
4
C2
-CH2-
√
√
√
√
5
C1
-CH3
√
√
√
√
6
L4
-CCCC-
√
√
√
Index
Name
Group
density
1
NC
C
√
2
C4
3
√
ACS Paragon Plus Environment
Page 27 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
7
L6
-CCCCCC-
√
√
√
8
C3M1
-CH(CH3)-
√
9
C3M2
-CH(CH3)2
√
10
C4M1
>C(CH3)-
√
11
C4M2
-C(CH3)2-
√
12
C4M3
-C(CH3)3
√
13
C1R
-CRH(C)-
√
√
14
C2R
-CR(C)(C)-
√
√
15
M1R
-CRH(CH3)-
16
M2R
-CR(CH3)2-
17
R#
Cyclic (≥5)
18
R3
3-Membered ring
19
R4
4-Membered ring
20
R8
8-Membered ring
21
FR
Fused rings
22
RR
Ring-Ring
23
RJ
Atom in three rings
24
T
25
P
√ √
√
√ √ √
√
√
√
√
√
√
Temperature
√
√
√
√
Pressure
√
√
√
√
Predictions In order to improve the rate of convergence for the network parameters, the inputs are scaled so that the values are in the range of [-1, 1]. The network model is trained through a mini-batch optimization process. In each epoch, the training data is randomly divided into several batches, each containing 1024 points. After the optimization has performed 1000 steps, a new batch of data is used for training. Training each network requires about 30 minutes using a GTX 1080 graphic card. The trained model predicts the three thermodynamic properties very well in comparison to the simulation results. Figure 14 shows a comparison of FNN predictions against simulation data, using the full (union of the three property-dependent lists) feature list and the minimum feature list corresponding to each property. The maximum absolute relative errors (MARE) are calculated and listed in the charts. Using the minimum feature
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
lists, the prediction quality is slightly poorer than those predicted using the full feature list.
Figure 14. Comparison of FNN predictions with MD simulation data for liquid densities, intermolecular energies and isobaric heat capacities, using the union feature list (first row) and minimum feature list (second row).
Table 4 lists the statistical analysis of the FNN performance in describing the three sets of data. The root mean square error (RMSE), average absolute relative error (AARE), average deviation (bias), fraction of absolute relative error (ARE) less than 1%, and maximum absolute relative error (MARE) are listed. The errors are very close among the three thermodynamic properties. Overall, the ML model yields excellent agreement with the simulation data, with AARE are generally less than 0.5%. In particular, the test set demonstrates that the model is capable to predict these properties very well.
ACS Paragon Plus Environment
Page 28 of 37
Page 29 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Table 4. Statistical analysis of FNN prediction on training, validation, and test data sets, for the three thermodynamic properties.
density
training
validation
test
RMSE (g/cm3)
2.9E-03
3.0E-03
3.1E-03
AARE
0.28%
0.29%
0.29%
bias (g/cm )
2.5E-04
2.2E-04
2.3E-04
ARE