Machine-Learning Based Stacked Ensemble Model for Accurate

May 31, 2019 - The root-mean-square error values, uncertainty quantification, and ... assifier decision tree corresponds to one output result (a discr...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/JPCA

Cite This: J. Phys. Chem. A 2019, 123, 5190−5198

Machine-Learning Based Stacked Ensemble Model for Accurate Analysis of Molecular Dynamics Simulations Published as part of The Journal of Physical Chemistry virtual special issue “Machine Learning in Physical Chemistry”. Samrendra K. Singh,† Karteek K. Bejagam,‡ Yaxin An,‡ and Sanket A. Deshmukh*,‡ †

CNH Industrial, Burr Ridge, Illinois 60527, United States Department of Chemical Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States

Downloaded via NOTTINGHAM TRENT UNIV on August 13, 2019 at 10:18:23 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



S Supporting Information *

ABSTRACT: Accurate, faster, and on-the-fly analysis of the molecular dynamics (MD) simulations trajectory becomes very critical during the discovery of new materials or while developing force-field parameters due to automated nature of these processes. Here to overcome the drawbacks of algorithm based analysis approaches, we have developed and utilized an approach that integrates machine-learning (ML) based stacked ensemble model (SEM) with MD simulations, for the first time. As a proof-of-concept, two SEMs were developed to analyze two dynamical properties of a water droplet, its contact angle, and hydrogen bonds. The two SEMs consisted of two layered networks of random forest, artificial neural network, support vector regression, Kernel ridge regression, and knearest neighbors ML models. The root-mean-square error values, uncertainty quantification, and sensitivity analysis of both the SEMs suggested that the final result was more accurate as compared to that of the individual ML models. This new computational framework is very general, robust, and has a huge potential in analyzing large size MD simulation trajectories as it can capture critical information very accurately. by the accuracy of an algorithm.5,11−17 The ML methods heavily rely on the amount, diversity, and integrity of the training data, features used to train the models, and hyperparameters used to construct the model.18−20 To develop supervised ML models, identifying the unique and important features related to a specific property of materials is the most important and challenging task.21 Typically, 75% to 90% of the total data that represent the input features and related properties is used to train an ML model, while remaining data is used to test the predictability of a model. To the best of our knowledge, in the field of materials science, to predict the properties of a material, in almost all the reported studies, only one type of ML model has been developed and used.18 Often relying on the predictions of only one ML model may not be ideal, especially during materials discovery or FF optimization. Hence, there is a pressing need to develop more robust, accurate, and reliable ML based approaches. Moreover, it has been predicted that by 2022 MD simulations may generate up to petabytes of data, which will constitute a valuable source of information.22 Hence, developing ML approaches that can be

1. INTRODUCTION Accuracy of the results of molecular dynamics (MD) simulations depends solely on the force-field (FF) parameters, and the precision of the results is determined by the algorithms and tools used to analyze MD simulation trajectories.1−3 In recent years, MD simulations have been integrated with optimization algorithms and machine-learning (ML) methods to accelerate the development of FF parameters and design of new materials.4−9 The ability of the algorithms/methods to perform accurate analysis becomes very critical during materials search or FF optimization because the input values for the next iteration of the optimization cycle are determined by evaluating the input sets and corresponding results in the current iterations.7,10 Typically, these analyses are performed by using computer codes that are developed on the basis of certain algorithms, which may have their own limitations. A wrong estimate of a desired property from MD simulations can sabotage the entire optimization process. Hence, it is important to develop methods/tools that are accurate, computationally efficient and have the ability to perform onthe-fly analysis of the MD simulation trajectories during optimization process. Machine-learning (ML) methods have been used to predict and/or analyze molecular simulations as they are not limited © 2019 American Chemical Society

Received: April 11, 2019 Revised: May 30, 2019 Published: May 31, 2019 5190

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

Article

The Journal of Physical Chemistry A

Figure 1. Workflow of the stacked ensemble models (SEMs) developed in the present study.

forming hydrogen bonds and its hydrogen bond status (i.e., presence/absence of a hydrogen bond) were used as the input and output features, respectively. To the best of our knowledge, this contribution demonstrates the first attempt to analyze dynamic properties from MD simulation trajectories by using such an ML based SEM approach. The sensitivity analysis suggests that the predictions of the individual model and, thus, the SEM depend on the size of the data. The uncertainty quantification to evaluate the performance of the SEM shows that the final predictions of the SEM are more accurate as compared to individual ML models. Figure 1 shows the architecture of the SEM developed in this present study, the overall process of the model training, and result predictions. The first layer of SEM was constructed by combining two models of RF, ANN, SVE, KRR, and KNN, a total of ten models. Details on the architecture and hyperparameters on each of these models are provided in the Supporting Information, section S1, and Table S1. The second layer had two RF models, which used output from the first layer as an input. The final results were obtained by averaging the outputs from the second layer.

trained with such big data to perform accurate predictions is very much desirable. Here, we present a novel computational framework that integrates an ML based stacked ensemble model (SEM) with MD simulation trajectories to perform accurate analysis.23,24 The SEM has several interconnected layers, where each layer consists of a number of different ML models. The ML models of the first layer receive the features and properties as input and output predictions, respectively. The predicted properties from the ML models in the first layer are combined and fed as input to train the ML models of the second layer, to enrich the predictions further.25 The predictions from every ML model has certain error associated with it, which depends on the inherent design of the ML model, hyperparameters chosen for the model, and the complex relationship between the input features and the output. This process of stacking ML models in one or more layers lead to enrichment of the overall prediction accuracy.25

2. METHODS Here, as a proof-of-concept two SEMs were developed to analyze an MD simulation trajectory for the following two dynamic properties of a water droplet: (i) its contact angle and (ii) hydrogen-bond network in a water droplet. The SEMs developed in the present study consisted of two layers and following five types of ML models: random forest (RF),26 artificial neural network (ANN),7,27 support vector regression (SVR),28 Kernel ridge regression (KRR),29 and k-nearest neighbors (k-NN).30 The contact angle SEM was trained by using an ideal, equilibrated water droplet obtained from MD simulation. A novel set of input features describing the geometry of a droplet were created, and the corresponding contact angle was used as an output feature. The hydrogen bond SEM was trained by using the MD simulation trajectory of a water droplet on a hexagonal boron nitride (BN) sheet. For this hydrogen bond SEM, the position of water molecules

3. RESULTS AND DISCUSSION 3.1. SEM for Predicting Contact Angle. 3.1.1. Data Generation for the Training of the SEM for Contact Angle. In order to develop a robust SEM that is capable of maintaining a high level of accuracy, under a wide range of inputs, it is important to train the ML models with a large set of high fidelity data.19 Specifically, these training data must cover a large range of feature landscape. Here, to create such a training data, a spherical water droplet with ∼200 Å diameter (∼150 000 SPC/Fw molecules) was created.31−33 This water droplet was further equilibrated by performing MD simulation for 1 ns. More details on the equilibration of the droplet are given in the Supporting Information section S2. The last two equilibrated frames of the water droplet obtained from the 5191

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

Article

The Journal of Physical Chemistry A

Figure 2. Feature generation and training of SEM from ideal water droplet. (a) Original droplet with 200 Å diameter (∼150 000 SPC/Fw water molecules). (b, c) Examples of two droplets created from 80 Å radius sphere. (d, e) Example of two droplets created from 50 Å radius spheres, for training of the SEM. (f) Schematic of input feature generation for the individual ML models in the SEM. Each row on the grid represent one feature, thus resulting in a feature set of 50 elements. We used only 19 features (3rd to 21st) as an input to our SEM out of 50, where the numbering index starts from the plane of the substrate. (g) 10 ML models of Layer 1 were trained by using ten unique data sets from total of 13 data sets. (h) Training strategy for one of the two ML models, namely RF#3 model from Layer 2 is shown. RF#4 from Layer 2 was trained using twelfth data set in similar manner.

simulation were used to create 15 762 feature droplets. Specifically, a larger droplet from each frame, with a ∼200 Å diameter, was used to carve out 71 full spherical subdroplets with radius ranging from 30 to 100 Å with 1 Å increments Figure 2a. To mimic a water droplet on a planar surface, each of these full spherical subdroplets was then sliced using a cutoff plane in Z-direction, as shown in Figure 2b−e, to create feature droplets. The cutoff plane was moved in Z-direction by 1 Å steps to create new feature droplets with different contact angles from each of the full spherical subdroplets. To generate droplets that are observed in realistic hydrophilic and hydrophobic surfaces, we ensured that the cutoff plane position ranged from (10 Å to R Å) to (R Å to 10 Å) from the center of the sphere. Here, R represents the radius of the spherical subdroplet. As expected, a larger full spherical subdroplet with 100 Å radius yielded 181 feature droplets as compared to 41 feature droplets generated from a smaller full spherical subdroplet of 30 Å radius. Thus, by using all the 71 full spherical subdroplets, we were able to generate a total of 7881 feature droplets for each frame (total of 15 762 feature droplets). The features extracted from these feature droplets and their corresponding contact angles were used to train and

validate the ML models, which is sufficiently large data set to train a high-fidelity model. 3.1.2. Input Feature Engineering from a Feature Droplet. The predictive power of an ML model strongly depends on the selected input features. Generation and selection of these input features is one of the most basic and challenging steps for applications of ML models. In the present study, we have developed a set of novel input features that are transferable to a water droplet placed on any hydrophilic or hydrophobic surface. Moreover, these new features can be used to represent a water droplet of any size, represented by any type of commonly used atomistic water models, e.g., SPC, SPC/E, TIP3P, etc.34 Figure 2f shows the schematic of the feature development process that involved the mapping of positions of atoms from a 3-D droplet on a 2-D grid, and calculating the number density of water molecules in every cell in the grid. Specifically, from a feature droplet, only the oxygen of water molecules was used for feature creation. The 3-D coordinates (x, y, z) of the oxygen atoms of the water were reduced to a 2D coordinate (r, Z) system, where r is the radial distance of an atom from the Z-axis. The Z-axis is a vector normal to the substrate plane passing through the center of the droplet, as shown in Figure 2f. The (r, Z) plane was divided into 50 × 50 5192

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

Article

The Journal of Physical Chemistry A

Figure 3. Parity plots representing the training and testing of individual models in the SEM for contact angle predictions: (a) ML models in Layer 1, (b) ML models in Layer 2, (c) final prediction of the SEM, and (d) predictions of the contact angles. Data shown in black dots on the diagonal of each model represent the test data. The probability distributions of bootstrapping errors are shown in insets. The black lines in insets show the mean bootstrapping error, and the region between two red lines shows the 95% confidence interval obtained from bootstrapping technique. Contact angles estimated by the circular fit method vs the SEM method. The two angle values in the parentheses, next to the inset droplet images, represent the angles calculated and predicted by the circular fit and SEM models, respectively.

the basis of the radius of the sphere encompassing the droplet (R), and the distance of the cutoff plane from the center of the sphere (H). Water molecules at the interface of the cutoff plane and the droplet were kept or removed from the droplet by considering the position of an oxygen atom with respect to the cutoff plane. In a typical MD simulation of a water droplet on a hydrophilic or hydrophobic substrate, the distance between a water layer in direct contact with the substrate and top surface of the substrate could be ∼3.0 to ∼4.0 Å. Here, to reproduce this, we assumed a substrate plane to be ∼3.5 Å below the cutoff plane for contact angle calculations. The contact angle Θ for each of the feature droplets was calculated by using the equation below:

square grid, which could be mapped on all the feature droplets generated in the present study. The 2 Å spacing in the grid generated cells with optimum oxygen number density. For every cell in the grid, number density of the oxygen atoms was calculated by dividing their number with the volume occupied by corresponding water molecules. The projection of a 2-D cell in (r, Z) space looks like a torus shaped region in 3-D space as shown in Figure 2f. The number density values across the rows was then added, yielding a feature vector of 50 elements, which was normalized by dividing every element with the maximum value. This allowed us to generate features that can be used to predict the contact angle of water on a given hydrophilic or hydrophobic surface, irrespective of its size. One can fit a circle using the points representing the water−air interface mapped on a 2-D plane, with more than four to five points. The circle fit improves with the number of points before becoming insensitive to them. In other words, after a certain number of points, the additional points become redundant for a fitting a circle. For the training of the SEM model as discussed in detail in the following section, only 19 elements were utilized from this 50 element vector. Specifically, the elements in the first two rows of the grid near the water−surface interface were ignored. We observed that 19 elements were sufficient to represent the shape of the droplet. Thus, the 19 elements from third to 21st were utilized as input features for the training of the SEM. 3.1.3. Calculations of Contact Angle of Feature Droplets. The contact angle of a feature droplet was used as an output feature along with the corresponding input features to train individual ML models in the SEM. As shown in Figure 2b−e, the contact angle of a given feature droplet was calculated on

Θ = 90 − φ

(1)

φ = cos−1[(H − 3.5)/R]

(2)

H is the distance of the cutoff plane from the center of the sphere, and it is positive in the positive z-direction and negative in the negative z-direction. Θ and φ are shown in Figure 2b−e, The angle calculated by using eq 1 is referred as an “Actual Angle” in Figure 3. 3.1.4. Training of the Contact Angle SEM. To train the SEM, a water droplet from the MD simulation trajectory is converted into input and output features as discussed above. The number of water molecules and 19 features of a droplet are combined to create a feature vector of size 1 × 20. The feature vector is fed into every ML model in Layer 1, and the contact angles predicted by each ML model is combined to form a 1 × 10 feature vector. Therefore, the output of Layer 1 is a vector with 10 contact angle values as predicted by 5193

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

Article

The Journal of Physical Chemistry A

Figure 4. Parity plot representing the training and testing of SEM for contact angle predictions with (a) 1%, (b) 2%, (c) 5%, (d) 10%, (e) 20%, (f) 40%, (g) 60%, and (h) 80% data. Data shown in black on each model represent the test data. Insets in (a)−(h) show the uncertainty quantified by bootstrapping method. (i) Plot for the percentage (%) of data used for training and corresponding mean prediction error. It suggests the convergence of the mean error for the 10% to 100% data set. The error bar on each data shows the 95% confidence interval range.

2 (Figure 2h). Thus, the size of the input feature vector depends directly on the layer where a given ML model resides. Similarly, RF #4 model was trained using the twelfth data set. 3.1.5. Results of Training of an SEM for Contact Angle Predictions. Parts a and b of Figure 3 show the prediction accuracy of the ten and two models used in Layer 1 and Layer 2 of the SEM, respectively. The averaged predictions of the two ML models in Layer 2 are also shown in Figure 3c as final predictions. We find a tight clustering of points about the main diagonal of the predicted angle vs actual angle plot. The accuracy is as follows: final prediction > Layer 2 > Layer 1. The increased tight clustering of points in final predictions as compared to Layer 1 suggest that the ML models in Layer 2 predict the contact angle more accurately. The root-meansquare error (RMSE) value for each model is shown in section S3 of the Supporting Information (Figure S6a and Table S2), and it decreases for the final predictions as compared to the individual ML models in Layer 1 and Layer 2. The RMSE is aggregation of the residuals and can be used to evaluate the accuracy of the predictive model.36 A low RMSE value points to lower residuals, and hence higher prediction accuracy. To further investigate the model performance in detail, the uncertainty quantification was performed using bootstrap sampling technique.37−39 In a bootstrapping technique, a number of artificial data sets are generated by choosing points, randomly, from an existing data set. It has been effectively used to study uncertainties in different ML models.40,41 The insets in Figure 3a−c show the probability distribution histograms of the bootstrapping errors with 1000 measurements for each model. The black vertical line shows the mean value of the errors, and the region between two dotted red lines show the 95% likelihood confidence interval. It can be seen that the mean of the model prediction error for the models in Layer 1 ranges from ∼0.74 to ∼2.06. Considering the mean of the bootstrap model prediction error, the 95% confidence interval, and the RMSE, the final average predictions of the SEM performs better as compared to individual models in Layer 1 and Layer 2 (Figure S6a,b of the Supporting Information). 3.1.6. Testing of the Contact Angle SEM on MD simulation Trajectories of a Water Droplet. To further evaluate the reliability and accuracy of this new approach, we

individual ML models. Individual ML models in Layer 2 are trained to reduce the prediction errors made by the ML models in Layer 1. This task is performed by using the already trained models in Layer 1 to make predictions for a previously unused, unique training data set. The contact angle predictions from Layer 1 (10 contact angles, one from each ML model) are then used as the input for an ML model in Layer 2. A given ML model in Layer 2 is trained to make a better prediction for the contact angle based on the 10 predictions made by the first layer. This process is repeated for each ML model in Layer 2. The two ML models in Layer 2 predict the two values of contact angles (one for each ML model). These two predictions were then averaged to get the final value of contact angle predicted by the SEM. In detail, the original data set of 15 762 feature droplets was split into typical 80:20 training and testing data sets. Thus, making it 12 000 data points for training and 3762 data points for testing. The training data set was further split into 13 equal size subsets, where 12 sets were used for training and 1 set for the validation. Each individual ML model in the SEM was trained on one of these independent training subsets of the original data set as shown in Figure 2g,h. This ensures the independence of the various ML models, which is essential for the high fidelity of the ensemble method. A droplet feature vector of 20 elements was used as input for the contact angle SEM. The first element of the input feature vector represents the number of water molecules in a feature droplet. The remaining 19 elements were taken from the geometrical features of the droplet. The 19 droplet features were third to 21st elements of the droplet features starting from the water− surface interface. In other words, the first two elements near the water−surface interface were ignored, as it is a common practice in contact angle calculations to ignore the region near water−surface interface (1200 training feature droplets. 3.2. SEM for Predicting Hydrogen Bonds. 3.2.1. Data Generation for the Training of the SEM for Hydrogen Bonds. To further validate the robustness of the SEM approach, we developed an SEM to analyze the hydrogen bonds present in water molecules in a droplet. Specifically, we performed MD simulations of an SPC/Fw water droplet with 2000 molecules on a single BN sheet at 300 K for ∼2 ns.33 The BN sheet was represented by using the Tersoff potential.42 The interactions between SPC/Fw and BN sheet were modeled using the recently developed parameters to reproduce the macroscopic contact angle of water on hexagonal BN sheet.33 The simulation trajectory was stored after every 1 ps, which resulted in 1000 frames. All the MD simulation were performed using LAMMPS package.43 The BN sheet was infinite in X- and Y-directions, which was implemented by periodic boundary conditions (PBC). In the Z-direction box size was 300 Å so as to avoid interactions of the droplet with its image and to create an air−water interface. Each water molecule’s hydrogen bond status (i.e., presence or absence of a hydrogen bond) was considered during this analysis. The hydrogen bond status for each water molecule was calculated using the conditions described in the eq 3. A total of 100 000 data points were used to train the SEM for hydrogen bonds. 3.2.2. Input Features for Hydrogen Bond SEM. In the literature, various geometric and energetic criteria have been used to define a hydrogen bond between two water molecules.2,44,45 Here we have used the following geometric criteria to identify two hydrogen bonded water molecules: R OO ≤ 3.6 Å; R OH ≤ 2.45 Å; φ ≤ 30 Å

placing the oxygen of the central molecule at the origin (0, 0, 0). Then the central atom was reoriented such that the vector connecting the oxygen atom to one of the hydrogen atoms was aligned with the x-axis. The normal vector to the water molecular plane was aligned with the y-axis. The neighboring water molecules were reoriented to retain their original position from the simulation trajectory with respect to the central water molecule. This relative position of the neighbor water molecule, i.e., x, y, and z coordinates of oxygen and hydrogen atoms, resulted in a feature with a vector of nine elements. These features of a neighboring water molecule were used as input to train the SEM for hydrogen bond predictions. These features developed here are very general and can be used to define hydrogen bonds of a water droplet of any size and on a hydrophilic or a hydrophobic surface. 3.2.3. Output Features for Hydrogen Bond SEM. The hydrogen bond status, i.e., presence or absence of a hydrogen bond between the central water molecule and its neighboring water molecules was used as the output feature to train the individual ML models in Layer 1 and Layer 2 of the SEM. The hydrogen bonds were identified by using the geometric criteria defined in eq 3. The output feature was set to 0 and 1 if the hydrogen bond was absent and present, respectively, between the two water molecules. 3.2.4. Results of Training of an SEM for Hydrogen Bond Predictions. Here, to generate a high-quality diverse data set, the 100 000 input features of SPC/Fw water molecules and their corresponding hydrogen bond status were randomly divided into 13 equal parts. The training of the individual models in SEM for hydrogen bonds was also trained by using the approach demonstrated in Figure 2g,h. Briefly, 10 data sets were used to train individual models in Layer 1. The 11th and 12th data sets were used to train the RF#3 and RF#4 from Layer 2, respectively. The 13th data set was used to test all the ML models in the SEM. The final predictions were estimated by averaging the results of RF#3 and RF#4 models. To evaluate the robustness of this SEM in predicting the accurate hydrogen bonds, we utilized the water droplets with 2000 molecules of SPC/Fw, SPC, and SPC/E water models.33,46 Parts d−f of Figure 5 show the hydrogen bonds calculated by using the SEM (top panel), geometric criteria shown in eq 3 (middle panel), and the difference between the results obtained from SEM and geometric criteria (lower panel). Overall, for all three water models, the SEM for hydrogen bond shows an excellent agreement with the results obtained from the geometric criteria shown in eq 3. The blue region in the lower panel suggests that predictions of the ML model in the bulk region match very well with the geometric criteria. The hydrogen bonds predicted by SEM at the air−water and substrate−water interface are slightly higher (5%) as compared to the geometric criteria. A possible explanation for the higher number of hydrogen bond predicted by the SEM in the interfacial region could be due to fact that the training set has significantly higher number of features corresponding to the water molecules in the bulk region. Hence, the SEM learning is slightly biased toward predicting the hydrogen bond for the water molecules in bulk region more accurately at the expense of interfacial water molecules. Figure 5d−f and Table S4 from Supporting Information show the results obtained from uncertainty quantification performed by using bootstrap sampling technique.39 The results were obtained by taking 1000 averages of 1000 resampled data sets from validation data set. It can be seen

(3)

In eq 3, the ROO represent the distance between oxygen of a donor and acceptor atoms. The ROH, represents the distance between hydrogen atom of a donor molecule and oxygen atom of an acceptor molecule. The φ is the angle between oxygen of an acceptor molecule, oxygen of a donor molecule, and hydrogen of a donor molecule (Oacceptor···Odonor−Hdonor). Note, two water molecules that follow these geometric criteria are known to satisfy the hydrogen bond energetic criteria.2 In the present study, to generate the input features for the training of an SEM of hydrogen bonds, two water molecules with oxygen−oxygen distance within 3.6 Å were identified. Each water molecule present in the bulk had four to eight neighboring molecules and had a tetrahedron-like structure (Figure 5b). The molecule for which its neighbors were identified was placed at the center of a tetrahedron (referred as central molecule from here on), with its neighbors appearing at the vertices of the tetrahedron. The water molecules forming hydrogen bonds with each other, closely followed the geometric criteria shown in eq 3. The individual ML models in Layer 1 and Layer 2 were trained by using the relative position of oxygen and hydrogen atoms (x, y, and z coordinates) of a neighboring water molecule that form hydrogen bonds with the central molecule. This relative position of a neighboring water molecule was calculated by 5196

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

The Journal of Physical Chemistry A



that the average percentage error associated with the individual ML models is higher compared to that of the final output of the SEM. Moreover, the region that shows the 95% likelihood confidence interval also becomes narrow for the final predictions, suggesting that the accuracy of the SEM is further improved. We also evaluated the SEM with various statistical measures and training data size in Supporting Information section S6 (Figures S9 and S10, Table S4 and S5). These statistical measures suggest, similarly to SEM for the contact angle, the prediction accuracy improved by stacking the ML models. The bootstrap average prediction error did not significantly change with increasing the training data size, but the 95% confidence interval narrowed with a larger data size. The sensitivity analysis on the training data size showed that the mean error and 95% confidence interval stabilized above ∼40 000 training points (Figure S11 and Table S6 in Supporting Information).

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +1 540-231-8785. ORCID

Karteek K. Bejagam: 0000-0002-8660-9946 Sanket A. Deshmukh: 0000-0001-7573-0057 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors acknowledge Advanced Research Computing (ARC), Virginia Tech, for providing the computational resources. S.A.D. acknowledges the Start-up Funds from Virginia Tech. Authors are thankful to Prof. Luke Achenie for fruitful discussion on this work.



4. CONCLUSION A novel computational framework that employs the stacked ensemble model (SEM), for the first time, to analyze the molecular dynamics (MD) simulation trajectories was developed. As a proof-of-concept, two SEMs were developed and used to analyze the contact angle of a water droplet and the hydrogen bonds in water molecules in a droplet. The SEMs had two layers of several different machine-learning (ML) models including, random forest (RF), artificial neural network (ANN), support vector regression (SVR), k-nearest neighbors (k-NN), and Kernel ridge regression (KRR). To train the contact angle SEM, droplet size independent, unique features associated with the geometrical shape were developed. This SEM was trained to predict the contact angle of a water droplet for the corresponding input features. The uncertainty quantification using the bootstrapping technique and the sensitivity of the SEM were also studied. The results of the mean of the bootstrap model prediction error, the 95% confidence interval, and the RMSE suggest that final predictions of the SEM are better than those of individual models in the SEM. This SEM was also successfully used to predict the contact angle of a water droplet from MD simulation trajectory. The hydrogen bond SEM was trained by using the relative position of a water molecule forming a hydrogen bond with a given water molecule. Again, we find that the accuracy of the final prediction of the SEM was higher than that of the predictions of the individual ML models in the SEM. Overall, the SEM approach proposed here is novel and general and can be utilized to analyze structural and dynamical properties of other solvents, materials, and biomolecules. Moreover, it can be extended to analyze trajectories/data obtained from other computational methods such as density functional theory, Monte Carlo simulations, and continuum calculations, to name a few.



Article

REFERENCES

(1) Bejagam, K. K.; Singh, S.; Deshmukh, S. A. Nanoparticle Activated and Directed Assembly of Graphene into a Nanoscroll. Carbon 2018, 134, 43−52. (2) Deshmukh, S.; Mooney, D. A.; McDermott, T.; Kulkarni, S.; Don MacElroy, J. M. Molecular Modeling of Thermo-Responsive Hydrogels: Observation of Lower Critical Solution Temperature. Soft Matter 2009, 5 (7), 1514−1521. (3) Santiso, E.; Herdes, C.; Müller, E. On the Calculation of SolidFluid Contact Angles from Molecular Dynamics. Entropy 2013, 15 (9), 3734−3745. (4) An, Y.; Bejagam, K. K.; Deshmukh, S. A. Development of Transferable Nonbonded Interactions between Coarse-Grained Hydrocarbon and Water Models. J. Phys. Chem. B 2019, 123909 (5) Bejagam, K. K.; An, Y.; Singh, S.; Deshmukh, S. A. MachineLearning Enabled New Insights into the Coil-to-Globule Transition of Thermosensitive Polymers Using a Coarse-Grained Model. J. Phys. Chem. Lett. 2018, 9, 6480−6488. (6) An, Y.; Bejagam, K. K.; Deshmukh, S. A. Development of New Transferable Coarse-Grained Models of Hydrocarbons. J. Phys. Chem. B 2018, 122 (28), 7143−7153. (7) Bejagam, K. K.; Singh, S.; An, Y.; Deshmukh, S. A. MachineLearned Coarse-Grained Models. J. Phys. Chem. Lett. 2018, 9 (16), 4667−4672. (8) Patra, T. K.; Meenakshisundaram, V.; Hung, J.-H.; Simmons, D. S. Neural-Network-Biased Genetic Algorithms for Materials Design: Evolutionary Algorithms That Learn. ACS Comb. Sci. 2017, 19 (2), 96−107. (9) Bejagam, K. K.; Singh, S.; An, Y.; Berry, C.; Deshmukh, S. A. PSO-Assisted Development of New Transferable Coarse-Grained Water Models. J. Phys. Chem. B 2019, 123 (4), 909−921. (10) Meenakshisundaram, V.; Hung, J.-H.; Patra, T. K.; Simmons, D. S. Designing Sequence-Specific Copolymer Compatibilizers Using a Molecular-Dynamics-Simulation-Based Genetic Algorithm. Macromolecules 2017, 50 (3), 1155−1166. (11) Gastegger, M.; Behler, J.; Marquetand, P. Machine Learning Molecular Dynamics for the Simulation of Infrared Spectra. Chem. Sci. 2017, 8 (10), 6924−6935. (12) Glazer, D. S.; Radmer, R. J.; Altman, R. B. Combining Molecular Dynamics and Machine Learning to Improve Protein Function Recognition. Pac. Symp. Biocomput. 2008, 332−343. (13) Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Machine Learning of Molecular Electronic Properties in Chemical Compound Space. New J. Phys. 2013, 15 (9), No. 095003. (14) Tribello, G. A.; Ceriotti, M.; Parrinello, M. Using Sketch-Map Coordinates to Analyze and Bias Molecular Dynamics Simulations. Proc. Natl. Acad. Sci. U. S. A. 2012, 109 (14), 5196−5201. (15) Decherchi, S.; Berteotti, A.; Bottegoni, G.; Rocchia, W.; Cavalli, A. The Ligand Binding Mechanism to Purine Nucleoside Phosphor-

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jpca.9b03420. Details of machine-learning (ML) models, the method of generating 200 Å-diameter water droplet, evaluation of the performance of the stacked ensemble model (SEM), computational details of a water droplets on a 2D sheet, sensitivity analysis of SEM (PDF) 5197

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198

Article

The Journal of Physical Chemistry A ylase Elucidated via Molecular Dynamics and Machine Learning. Nat. Commun. 2015, 6, 6155. (16) Arhami, M.; Kamali, N.; Rajabi, M. M. Predicting Hourly Air Pollutant Levels Using Artificial Neural Networks Coupled with Uncertainty Analysis by Monte Carlo Simulations. Environ. Sci. Pollut. Res. 2013, 20 (7), 4777−4789. (17) Jorge, M.; Fischer, M.; Gomes, J. R. B.; Siquet, C.; Santos, J. C.; Rodrigues, A. E. Accurate Model for Predicting Adsorption of Olefins and Paraffins on MOFs with Open Metal Sites. Ind. Eng. Chem. Res. 2014, 53 (40), 15475−15487. (18) Liu, Y.; Zhao, T.; Ju, W.; Shi, S. Materials Discovery and Design Using Machine Learning. Journal of Materiomics 2017, 3 (3), 159− 177. (19) Millard, K.; Richardson, M. On the Importance of Training Data Sample Selection in Random Forest Image Classification: A Case Study in Peatland Ecosystem Mapping. Remote Sensing 2015, 7 (7), 8489−8515. (20) Singh, S.; Abbassi, H. 1D/3D Transient HVAC Thermal Modeling of an off-Highway Machinery Cabin Using CFD-ANN Hybrid Method. Appl. Therm. Eng. 2018, 135, 406−417. (21) Li, Z.; Ma, X.; Xin, H. Feature Engineering of MachineLearning Chemisorption Models for Catalyst Design. Catal. Today 2017, 280, 232−238. (22) Pérez, A.; Martínez-Rosell, G.; De Fabritiis, G. Simulations Meet Machine Learning in Structural Biology. Curr. Opin. Struct. Biol. 2018, 49, 139−144. (23) Funda, Güneş; Russ, Wolfinger; Pei-Yi Tan Stacked Ensemble Models for Improved Prediction Accuracy, Static Analysis Symposium, 2017. (24) Sehgal, M. S. B.; Gondal, I.; Dooley, L. Stacked Regression Ensemble for Cancer Class Prediction. INDIN ’05. 2005 3rd IEEE International Conference on Industrial Informatics, 2005. (25) Wang, G.; Hao, J.; Ma, J.; Jiang, H. A Comparative Assessment of Ensemble Learning for Credit Scoring. Expert Syst. Appl. 2011, 38 (1), 223−230. (26) Breiman, L. Random Forests. Mach. Learn. 2001, 45 (1), 5−32. (27) Jain, A. K.; Mao, J.; Mohiuddin, K. M. Artificial Neural Networks: A Tutorial. Computer 1996, 29 (3), 31−44. (28) Suykens, J. A. K.; Vandewalle, J. Least Squares Support Vector Machine Classifiers. Neural Process. Letters 1999, 9 (3), 293−300. (29) An, S.; Liu, W.; Venkatesh, S. Face Recognition Using Kernel Ridge Regression. In 2007 IEEE Conference on Computer Vision and Pattern Recognition; 2007, 1−7. (30) Kramer, O. K-Nearest Neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Kramer, O., Ed.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2013, 13−23. (31) Wu, Y.; Tepper, H. L.; Voth, G. A. Flexible Simple PointCharge Water Model with Improved Liquid-State Properties. J. Chem. Phys. 2006, 124 (2), No. 024503. (32) Bejagam, K. K.; Singh, S.; Deshmukh, S. A. Development of Non-Bonded Interaction Parameters between Graphene and Water Using Particle Swarm Optimization. J. Comput. Chem. 2018, 39, 721− 734. (33) Achari, P. F.; Bejagam, K. K.; Singh, S.; Deshmukh, S. A. Development of Non-Bonded Interaction Parameters between Hexagonal Boron-Nitride and Water. Comput. Mater. Sci. 2019, 161, 339−345. (34) Mark, P.; Nilsson, L. Structure and Dynamics of the TIP3P, SPC, and SPC/E Water Models at 298 K. J. Phys. Chem. A 2001, 105 (43), 9954−9960. (35) Werder, T.; Walther, J. H.; Jaffe, R. L.; Halicioglu, T.; Koumoutsakos, P. On the Water−Carbon Interaction for Use in Molecular Dynamics Simulations of Graphite and Carbon Nanotubes. J. Phys. Chem. B 2003, 107 (6), 1345−1352. (36) Géron, A. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems; O’Reilly Media, Inc., 2017. (37) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; CRC Press, 1994.

(38) Davison, A. C.; Kuonen, D. An Introduction to the Bootstrap with Applications in R. Statistical computing & Statistical graphics newsletter 2002, 13 (1), 6−11. (39) Hirsch, R. M.; Archfield, S. A.; De Cicco, L. A. A Bootstrap Method for Estimating Uncertainty of Water Quality Trends. Environmental Modelling & Software 2015, 73, 148−166. (40) Reich, Y.; Barai, S. V. Evaluating Machine Learning Models for Engineering Problems. Artificial Intelligence in Engineering 1999, 13 (3), 257−272. (41) Li, Z.; Omidvar, N.; Chin, W. S.; Robb, E.; Morris, A.; Achenie, L.; Xin, H. Machine-Learning Energy Gaps of Porphyrins with Molecular Graph Representations. J. Phys. Chem. A 2018, 122 (18), 4571−4578. (42) Brenner, D. W.; Shenderova, O. A.; Harrison, J. A.; Stuart, S. J.; Ni, B.; Sinnott, S. B. A Second-Generation Reactive Empirical Bond Order (REBO) Potential Energy Expression for Hydrocarbons. J. Phys.: Condens. Matter 2002, 14 (4), 783. (43) Plimpton, S.; Crozier, P.; Thompson, A. LAMMPS-Large-Scale Atomic/Molecular Massively Parallel Simulator; Sandia National Laboratories, 2007; p 18. (44) Deshmukh, S. A.; Sankaranarayanan, S. K. R. S.; Suthar, K.; Mancini, D. C. Role of Solvation Dynamics and Local Ordering of Water in Inducing Conformational Transitions in poly(N-Isopropylacrylamide) Oligomers through the LCST. J. Phys. Chem. B 2012, 116 (9), 2651−2663. (45) Deshmukh, S.; Mooney, D. A.; MacElroy, J. Molecular Simulation Study of the Effect of Cross-Linker on the Properties of poly(N-Isopropyl Acrylamide) Hydrogel. Mol. Simul. 2011, 37 (10), 846−854. (46) Vega, C.; de Miguel, E. Surface Tension of the Most Popular Models of Water by Using the Test-Area Simulation Method. J. Chem. Phys. 2007, 126 (15), 154707.

5198

DOI: 10.1021/acs.jpca.9b03420 J. Phys. Chem. A 2019, 123, 5190−5198