An Improved Structure−Property Model for Predicting Melting-Point

In this work, we present a new QSPR model for predicting the melting-point temperature of a diverse organic dataset. The model benefits from the inclu...
0 downloads 13 Views 308KB Size
Ind. Eng. Chem. Res. 2006, 45, 5117-5126

5117

An Improved Structure-Property Model for Predicting Melting-Point Temperatures Srinivasa S. Godavarthy, Robert L. Robinson, Jr., and Khaled A. M. Gasem* School of Chemical Engineering, Oklahoma State UniVersity, Stillwater, Oklahoma 74078

Physical properties and thermodynamic data are essential inputs to all computer-aided molecular design applications. Basic properties, such as the melting-point temperature, are essential for developing custom chemicals with desired thermophysical behavior. Currently, accurate correlations for the melting-point temperature are limited, including recent attempts to use quantitative structure-property relationships (QSPR). The lack of a comprehensive melting-point model can be attributed to (a) the sensitivity of the melting point to subtle variations in molecular structure and (b) the inability of existing molecular descriptors to account satisfactorily for all factors that influence the melting-point behavior. In this work, we present a new QSPR model for predicting the melting-point temperature of a diverse organic dataset. The model benefits from the inclusion of novel nonlinear descriptors developed through the use of robust genetic algorithms (GAs) and neural networks. Three new descriptors were developed to account for the nonlinear effect of molecular weight, hydrogen bonding, and molecular symmetry on the melting-point temperature. The resultant QSPR model is capable of modeling melting-point behavior with a root-mean-square error (RMSE) of 12.6 K, an average absolute percent deviation (%AAD) of 4.7 in absolute temperature, and a correlation coefficient of R2 ) 0.95. This work differs from other literature efforts in that (a) an extensive dataset of over 1250 molecules is used for model development and (b) descriptor selection is performed using nonlinear algorithms. The results suggest that often-ignored structural descriptors such as molecular weight and symmetry have major roles in determining the melting point. 1. Introduction The melting point is the temperature at which the solid and liquid phases of a pure substance coexist in equilibrium at atmospheric pressure. It is an important physical property and has been studied extensively in the literature. Several experimental and modeling frameworks have been developed for obtaining the melting-point temperature. Besides the direct use of melting points to indicate the solid-liquid phase transition, they are also used to estimate the aqueous solubility,1 toxicity,2 and enthalpy and entropy of melting.3 We were motivated in this study by the need for accurate melting-point predictions for use in chemical design as well as in estimating other significant thermophysical properties. Several approaches, empirical as well as semiempirical, have been suggested in the literature for the estimation of meltingpoint temperatures. These include (a) group contribution methods and (b) correlations using structural parameters. However, all the existing strategies fail to capture fully the structural subtleties responsible for melting-point behavior, and they often provide poor melting-point predictions. Furthermore, most literature efforts are limited to homologous subsets of molecules, which precludes the extension of these models to other organic chemicals. Similar problems exist with group contribution methods, which are limited in applicability to structures that contain the functional groups included in the training set. Also, the literature models often require large amounts of experimental data for fine-tuning the model parameters. To overcome these difficulties, several structure-based models have been proposed. However, these quantitative structureproperty relationship (QSPR) models suffer from several * To whom all correspondence should be addressed. Tel.: (405)744-5280. Fax: (405) 744-6338. E-mail address: [email protected].

drawbacks, including (a) a reliance on linear models for QSPR model development, (b) the use of off-the-shelf structural descriptors, and (c) the use of general-purpose heuristic algorithms for molecular screening. Previous literature efforts have shown that the relationship between molecular structure and a thermophysical property is often nonlinear. For example, previous literature efforts have suggested that molecular weight might have a nonlinear correlation with melting-point behavior. As such, linear models fail to establish properly the significance of molecular weight as a descriptor. Also, off-the-shelf structural descriptors available through most commercial QSPR software do not account for the spatial arrangement (symmetry) or crystal packing of the chemicals. A new QSPR model that uses novel descriptors, established from theoretical observations, and nonlinear algorithms for model development is required. Accordingly, the specific objectives of the present work were (a) to study the utility of new molecular descriptors in correlating melting-point temperature, specifically, the effects of structural symmetry, functional groups, and molecular weight descriptors; and (b) to investigate the potential for improving existing linear melting-point correlations through the use of nonlinear algorithms. 2. Some Factors Affecting the Melting Point A crystalline solid is usually visualized as a rigid structure. However, in reality, even these crystals exhibit disorder at the atomic scale (atomic and molecular vibrations), and this disorder increases with temperature. Nevertheless, a molecule retains its structure because of the presence of intermolecular attractive forces. However, with increasing temperature, the vibrations steadily increase until they overcome the attractive forces holding the crystal together. The temperature at which this change occurs is the melting-point temperature. Melting is the phenomenon where the solid transforms to a liquid state and

10.1021/ie051130p CCC: $33.50 © 2006 American Chemical Society Published on Web 06/10/2006

5118

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006

the free energy of the phase transition is zero. Thermodynamically, this can be represented as

∆G ) ∆Hm - Tm∆Sm ) 0 Tm )

∆Hm ∆Sm

(1) (2)

The aforementioned relation indicates that both enthalpic and entropic forces affect the melting-point behavior of any molecule.4 Enthalpic forces include ionic, dipole, dispersion, and hydrogen-bonding effects, whereas entropic forces include effects that describe position, rotation, translation, and conformational flexibility (vibration). Tesconi3 showed that ∆Hm in the aforementioned equation is influenced by hydrogen bonding and ∆Sm is affected by the rotational and vibrational degrees of freedom. Not all types of hydrogen bonding affect melting-point behavior equally; for example, when hydrogen bonding is intramolecular, only a modest effect on the melting point was observed; however, when hydrogen bonding is intermolecular, a definite increase in the melting point is observed. Several researchers have suggested that the intermolecular bonding is highly dependent on the type of packing. They noted that strong bonding and higher melting points were observed in closepacked crystals, whereas weak bonding and lower melting temperatures were observed in amorphous crystals. Dannenfelser et al.5 have suggested that crystal structure and packing have a major role in determining the melting point of any molecule. Rotational entropy, which is calculated from the symmetry of the molecule, also determines the entropy of melting. A molecule that exhibits high rotational entropy (low symmetry) has a low melting point. Several models have been proposed to calculate the entropy of melting. Group contribution models were first suggested by Chickos et al.6 to estimate the entropy of melting for monosubstituted and multisubstituted organic molecules. They reported an average error of less than 2 eu for their model. Models based on symmetry number (σ) have also been proposed by Dannenfelser et al.:5

∆Sm ) 13.5 - R ln σ

(3)

The symmetry number σ represents the number of ways that a molecule can be oriented to give indistinguishable images; for example, the symmetry number of benzene is 12 and that of toluene is 2. In his work, Dannenfelser et al.5 evaluated the symmetry of 362 organic molecules and calculated their entropies of melting. Later, Dannenfelser and Yalkowsky7 calculated the entropies of melting for over 949 organic molecules by accounting for both rotational and vibrational effects. They reported an average error of 12.5 J K-1 mol-1. Although symmetry-based estimates provide some insight into the relationship between the melting points of molecules and their crystal structures, no methodologies exist to predict the melting points of molecules that exhibit polymorphism. Also, several organic molecules are known to exhibit liquid-crystal behavior. In addition, oscillations in organic molecules are temperature-dependent; thus, some molecules reorient their crystal lattice via rotation in the molecular plane, which causes the solid to deform at temperatures below the melting point. This reorientation of the crystal lattice has not yet been addressed adequately. Molecules that exhibited any of the aforementioned anomalous behaviors were eliminated from the database that we used for model development.

3. Literature Models for Melting-Point Predictions Numerous models to estimate melting points have been proposed over the years.8 Many literature correlations, although accurate, have been developed to describe the melting behavior of homologous series of organic molecules and fail when they are extended to diverse organic datasets. Despite their limitations, the literature models provide insight into the structural features that are significant to the estimation of the meltingpoint temperature. The earliest attempt to predict melting points was that of Mills,9 who proposed the following expression to correlate the melting points of homologous series of alkanes:

Tm )

β(x - c) 1 + γ(x - c)

(4)

where x represents the number of methylene (CH2) groups and β, γ, and c are constants. This correlation included the concept of convergence temperature, i.e., the melting point of alkanes of infinite chain length converges to β/γ. Kipping10 noticed that, for ketones with two alkyl chains, several isomers existed with different melting points and suggested modifications to the Mills correlation to account for the effect of isomers. The first attempt to correlate the structure to the melting point was performed by Longinescu,11 who derived an expression that was based on the density (D) and the number of atoms (n):

Tm ) 8.37Dxn

(5)

Tsakalotos12 noticed the oscillation in the melting point between even- and odd-chain hydrocarbons and proposed an equation based on molecular symmetry. He reported an average error of just 0.4 K in the prediction of the melting point of C17 to C24 hydrocarbons:

Tm ) 85 - 0.01882(σ - 1)2

(6)

Several researchers also have identified the significance of molecular vibrations and tried to correlate the effect of vibrations on crystal lattice disruption and melting point.13,14 Prud’homme15 observed that

Tm + T b ) T c

(7)

Lorenz and Herz16 determined that the ratio of Tm/Tb ranged in value from 0.3 to 1.0, with the average for organic molecules being 0.5893. Lyman et al.17 used this correlation on 12 diverse organic sets and found the average error to be ∼27 K. Following the work of Lindemann13 and Robertson,14 numerous studies have been reported on the relationship between molecular weight and melting point. Garner and co-workers18-20 developed equations for the melting point of both odd- and even-chain fatty acids and paraffins based on the number of C atoms (n). The average prediction error for the model was 1.7 and 0.6 K for acids (eq 8) and paraffins (eq 9), respectively:

1.03n - 3.61 0.002652n - 0.0043

(8)

0.6085n - 1.75 0.001491n - 0.00404

(9)

Tm ) Tm )

These equations are similar to those of Mills.9 Garner et al.20 noted that, from these equations, the convergence temperatures are identical for all series of homologous organic molecules.

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006 5119

He hypothesized that, at infinite chain lengths, the effect of end groups would be negligible. In reality, however, some differences in the convergence temperature were noticed, indicating that the terminal group does influence the melting behavior of even large molecules. Besides chain length, early correlations for the melting point also included molecular symmetry21 and molecular weight.22 Garner20 hypothesized that the melting point should approach a common limit as the chain length increases. Several models have been proposed to account for this phenomenon. Merckel,23 Meyer and Van dar Wyk,24 Moullin,25 and Etessam and Sawyer26 have tried to account for the effect of chain length and molecular weight on the melting point. Cramer27 postulated that the variation in a number of physical properties, including melting point, could be correlated with bulk and cohesiveness parameters. These parameters, which were obtained from factor analysis of a range of molecular characteristics, provided satisfactory melting-point predictions. The most successful QSPR predictions to date for the melting point have been developed for normal alkanes. A correlation coefficient of R2 ) 0.998 was obtained for 24 molecules using topological indices such as the Wiener index and the Bablan index.28 Dearden and Rahman29 performed a QSPR study of 41 anilines, using stepwise regression analysis: v

Tm ) 276 + 117HD + 122F + 37.2χ + 19.8HA + 139.2I4 + 47.5R - 24.9L (10) In eq 10, HA and HD account for the hydrogen bonding, F and R are the Swain-Lupton field parameters, χv represents the molecular connectivity, I4 is the indicator variable for 4-substitution, and L is the STERIMOL length parameter. A QSPR-based approach for the prediction of the melting point was first proposed by Abramowitz.30,31 His work involved the correlation of the melting point for 85 rigid, non-hydrogenbonding molecules with four descriptors (σ, ω, Tb, and ORTHO):

Tm ) 0.772Tb + 110.8 log σ + 11.56 ORTHO + 31.9ω3 234.4 (11) where σ is the symmetry number, ω the acentric factor, and Tb the boiling point. ORTHO depicts the number of groups that are in an orthogonal position (R2 ) 0.938 and s ) 22.8 K). When extended to polychlorinated biphenyls, they were able to obtain accurate predictions by replacing the boiling point with the number of chlorine (nCl) atoms in the molecule:31

Tm ) 14.9nCl + 117.6 log σ + 6.0 ORTHO + 410ω3 + 200 (12) The geometric volume was introduced as a new descriptor by Bhattacharjee et al.32 Although this descriptor showed good correlation with the melting point (R ) 0.925), it has been neglected in QSPR correlations. Dearden33 studied the effect of hydrogen bonding on the melting point. He considered a series of 42 anilines and correlated their melting points using five descriptors (R2 ) 0.886, s ) 24.6 K), which were the measures of hydrogen-bond donor ability, the hydrophobic subsistent constant, the molar refractivity, the sterimol width parameter, and the indicator variable of meta-substitution. Topological indices were used by Medic et al.34 to model melting points of pyrazolinones, which provided a correlation coefficient of R2 ) 0.891:

Tm ) -53.752χv + 101.16I + 229.157

(13)

The packing factor, which is another significant parameter when dealing with solid-state properties, was introduced by Charton and Charton.35 They used 366 cogeneric alkanes and correlated both branched and unbranched molecules with an “intermolecular force equation”, which included a variable that was capable of accounting for the packing energy contribution of the alkyl group (R2 ) 0.918, s ) 17.9 K). Katritzky et al.36 published a comprehensive QSPR analysis of the melting point. Their work involved 443 monosubstituted and disubstituted benzenes; a six-parameter correlation (R2 ) 0.837, s ) 30.2 K) was obtained. The importance of hydrogenbonding descriptors was again reflected in this QSPR model. Apart from hydrogen bonding, the melting point is also governed by the molecular packing in crystals (effects from molecular shape, size, and symmetry), and other intermolecular interactions such as charge transfer, and dipole-dipole interactions in the solid phase. Gramatica et al.37 extended Katritzky’s efforts in correlating the melting point by introducing weighted holistic invariant molecular (WHIM) descriptors. WHIM descriptors are three-dimensional (3D) molecular descriptors that take into account the size, shape, symmetry, and atom distribution of the molecules. A four-parameter model (R2 ) 0.82, s ) 21.2 K) was developed from 66 WHIM descriptors for predicting the melting point (melting-point range of mp ) 16.5-310 °C, 82 molecules). QSPR-based methods alone were only partially successful in representing the melting-point behavior of diverse subsets. Simamora and Yalkowsky38 were the first to propose a combined group contribution approach (Unified Physical Property Estimation Relationship) for predicting the melting points of organic molecules. They developed a melting-point equation from a data set containing 1690 molecules with an average error of 37.4 K. This approach was further extended by Jain and Yalkowsky39 to predict the melting points for a wide subset of aliphatic molecules. Despite the modeling efforts to date, no existing general theory of melting fully explains all the phenomena observed in association with the melting of organic molecules. One reason is a lack of adequate theory to account for the fact that similar molecules have very different types of packing, which results in significantly different melting points, even among isomers. Also, there are several structure-related differences in correlating the melting points.40 For example, the increase in the melting point with size is greatest for globular structures, followed by planar, polycyclic aromatics, cyclic alkanes, cyclic alkenes, and then linear alkenes. The most recent attempts38,41-43 have been limited to the generation of linear QSPR (Type I) correlations. The correlations generally involve four or five descriptors and accurately predict the melting points for a single class of organic molecules.44 However, these studies fail to provide accurate results when applied to other classes of organic molecules (acids, secondary alcohols, etc.). Available molecular descriptors in most commercial software do not adequately reflect the molecular packing and intermolecular forces involved. The most significant literature correlations are summarized in Table 1. 4. QSPR Methodology The development of a QSPR model for determining the melting point involves several distinct steps, as illustrated in Figure 1. To begin, a dataset of molecules chosen from reliable sources must be assembled. After this dataset is validated, the

5120

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006

Table 1. QSPR Models Reported in Literature for the Prediction of Melting Point of Organic Molecules molecule

parameter(s)

ketones paraffins, acids paraffins (44-100 carbons) polycyclic aromatic molecules hydrocarbons haloginated hydrocarbons low-volatility hydrocarbons paraffins (44-390 carbons) aliphatic hydrocarbons substituted anilines rigid, non-hydrogen-bonding organics aldehydes, ketones, amines rigid aromatics diverse organics branched and unbranched alkanes pyridines and piperidines non-hydrogen-bonding aliphatics pyridines substituted benzenes (mono, di) polychlorinated biphenyls diverse organics alkanes

location of keto group molecular weight, number of carbon atoms chain length, symmetry mechanical rigidity molecular diameter and group contribution C-C-C bond angle saturated vapor pressure topological indices, wiener and bablan indexes C-C-C-C torsion angle Hanch parameter and refractivity SIGMAL, EXPAN, ORTHO QSPR (five parameters) functional groups and hydrogen bonding functional groups packing energy contribution QSPR (6 parameter) functional groups and molecular interactions QSPR (hydrogen bonding) hydrogen bonding, crystal packing, symmetry WHIM descriptors UPPER MLRA

Figure 1. Steps involved in development of a quantitative structureproperty relationship (QSPR) correlation.

3D structures of the molecule are generated using commercial molecular visualization software. Structure optimization and descriptor generation are performed using commercial QSPR software. Model development usually begins with linear analysis and algorithms such as linear regression analysis, principal component analysis, or partial least squares may be used. The best set of descriptors is then used to develop the nonlinear QSPR model. The final step of the QSPR model development involves the validation of the new model by predicting the melting points of molecules that were not included in the model development. 4.1. Dataset. The melting points of 1250 molecules were used in the development of the QSPR model. The dataset used in the study included information from various sources.38,43,45-49 The assembled data were representative of all major classes of

R2

s (K)

0.5

0.99

1.0

0.94 0.83 0.94 0.98 0.92 0.83

22.8

0.85 0.83 0.82 0.98 0.999

24.6 17.6 17.9 12.8 36.1 30.2 21.2 23.1 1.1

ref 10 18 67 68 69 70 71 28 72 29 7 31 38 43 35 34 42 40 36 37 39 65

organic molecules of interest to chemical design applications. Structural isomers, such as orthosubstituted, metasubstituted, and parasubstituted benzenes, paraffins with varying degrees of branching, molecules that contain heteroatoms such as sulfur, and multiring molecules, along with molecules that contain hydroxyl, halogens, cyanide, and nitro functional groups were included in the dataset. The distributions of molecules, with respect to their molecular weights and melting points, are provided in Figures 2 and 3, respectively. 4.2. Structure Generation and Optimization. The twodimensional (2D) structures of the molecules were generated using ChemDraw.50 Then, 3D structures were generated for these molecules using ChemBats3DUltra. Because more than one set of 3D coordinates that satisfy the structural constraints (bond length and bond angle) can be generated for any given molecule, the conformation with the lowest Gibbs free energy must be located. Most chemicals that are used as solvents or drugs contain single bonds that join two groups of atoms. Such bonds can usually rotate with a low-to-moderate energy barrier that changes the bond angle and orientation of other groups in the structure. In nature, these representations occur with differing probabilities, and the objective of structure optimization is to obtain the 3D structure that is most prominent. The structures were initially optimized using the MOPAC51 module available in ChemBats3DUltra. AMPAC 6.052 was then used to further refine the 3D geometry of the structures. AM1 parametrizations were used to calculate the quantum-chemical molecular descriptors. The output files from AMPAC were used to calculate various descriptors. For 22 molecules, the geometry could not be optimized (the current software failed to optimize molecules with triple bonds, e.g., CN). These molecules were eliminated and a final set of 1228 molecules was used for model development and validation. 4.3. Descriptor Generation. Three commercial QSPR packages were used to calculate over 1400 molecular descriptors, including a variety of constitutional, topological, geometric, and electrostatic descriptors. Examples from each group of descriptors are presented in Table 2. A detailed description of these descriptor groups can be found elsewhere.53 The number of descriptors calculated for each molecule was dependent on the structural complexity of the molecule. Descriptors that were not calculated for a given molecule (e.g., the number of rings in a

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006 5121

Figure 2. Molecular weight distribution of the data set.

of a QSPR model, especially if neural networks are used in the model development. Orthogonalization techniques are used to obtain a subset of descriptors from which all redundant information has been removed. Nevertheless, hundreds of descriptors may still remain; further reduction of the number of these descriptors is needed before model development can begin. 5. QSPR Model Development

Figure 3. Melting-point distribution for the data set. Table 2. Examples of Molecular Descriptors Calculated from Commercial QSPR Packages molecular descriptor constitutional topological geometrical electrostatic thermodynamic quantum-chemical

example number of atoms, molecular weight Wiener index, Kier shape index moments of inertia, molecular volume charged partial surface area, min and max partial charges vibrational enthalpy, internal entropy at 300 K LUMO, HOMO

methane molecule) were set to zero in the subsequent development of QSPR correlations. 4.4. Descriptor Reduction. Not all molecular descriptors are important to the melting-point behavior; thus, algorithms to identify the most significant subset of descriptors are required. The large number of descriptors available for model development causes dimensionality problems. Furthermore, the use of irrelevant or redundant descriptors diminishes the performance

The primary objectives of this work were to examine the efficacy of (a) including descriptors from multiple commercial QSPR software, (b) including new descriptors constructed based on theoretical observations, and (c) using nonlinear neural network algorithms to improve the melting-point predictions. To meet these objectives, we conducted four case studies in the following sequence: Case Study 1, which involved the development of a linear QSPR model using descriptors obtained from a single commercial QSPR package (CODESSA54). Case Study 2, which involved the development of a linear QSPR model using descriptors from three different QSPR packages (CODESSA, AMPAC, and DRAGON55). Case Study 3, which involved the development of a linear QSPR model using constructed descriptors. Case Study 4, which involved the development of a nonlinear QSPR model. Each case study, in turn, furnished valuable guidance to the development of the final QSPR model (Case Study 4) for determining the melting point. Each case study is discussed thoroughly elsewhere;56 here, we present details for Case Studies 3 and 4 only, which provided the best results for the linear and nonlinear models, respectively. Table 3 provides a brief summary of the result obtained for each of the case studies and a comparison with the best literature models available for melting point. 5.1. Construction of New Descriptors. A review of the literature indicated that commercial software did not adequately

5122

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006

Table 3. Comparison of Melting-Point Prediction Results Obtained from Current Work and Literature case

a

model type

Literature 1 Literature 2

linear linear

Case Study 1 Case Study 2 Case Study 3 Case Study 4

linear linear linear nonlinear

From Constantinou and Gani.43

b

root mean square error, RMSE (K)

approach Literature Values functional groupsa UPPER schemeb

62.5 37.4

Current Work QSPR (single commercial software) QSPR (multiple commercial software) QSPR + three constructed descriptors QSPR + three constructed descriptors

48.7 39.5 26.1 12.6

From Abramowitz and Yalkowsky.31

address all the structural features of a molecule that are significant to the melting point. Literature studies have shown that descriptors that include the effects of molecular symmetry, molecular weight, gravitational index, hydrogen bonding, and functional group characteristics on the melting-point temperature were essential to the development of a reliable QSPR model. Therefore, three new descriptors were constructed for each molecule and included in model development: symmetry, functional groups, and molecular weight. (a) Symmetry. To account for the effect of symmetry on the melting point, the molecules were assigned symmetry numbers according to a method proposed by Dannenfelsser et al.5 The symmetry number simply represents the number of indistinguishable positions that can be obtained by rigidly rotating a molecule about its center of mass. For example, anthracene is assigned a symmetry number of four, because it can be rotated into four indiscernible positions, whereas, for phenanthrene, the symmetry number is only two, because it can be rotated into only two indistinguishable positions. In determining the symmetry number of a molecule, the terminal groups were considered to be single atoms. For molecules that are conical (chloromethane), cylindrical (carbon dioxide), or spherical (methane), Dannenfelser et al.5 empirically defined the symmetry numbers to be 10, 20 and 100, respectively. The molecules were divided into six classes based on symmetry. These classifications appear in Table 4. Each is treated as an independent variable, i.e., when creating the descriptor table, a molecule with a symmetry number of 12 (benzene) has a value of one put under SIM6 column, whereas the SIM1, SIM2, SIM3, SIM4, and SIM5 columns have zero values. The underlying assumption is that symmetry contributes equally for all molecules with similar symmetries. For example, the difference in melting point between the orthogonal molecules, which have no symmetry and their para-isomers with 2-fold symmetry is almost the same in most cases. The melting points of O-chlorobenzaldihide (σ ) 0) and P-chlorobenzaldihide (σ ) 2) are 416 and 449 K (melting-point elevation ) 33 K), while those of O-cyanoaniline (σ ) 0) and Pcyanoaniline (σ ) 2) are ∼323 and 357 K, respectively (meltingpoint elevation ) 34 K). On average, the difference between isomers with zero and 2-fold symmetry is ∼31 K. There are several cases where these generalizations fail, including molecules that exhibit high hydrogen bonding ability (e.g., trichloroaniline, tetramethyldichlorobenzene) and hexa-substituted benzenes. Because only a small number of molecules deviate from these generalizations, no modification to the symmetry descriptors was introduced; alternatively, descriptors that account for hydrogen bonding were introduced to accommodate molecules that deviate from symmetry-number-based generalizations. (b) Functional Groups. The rigidity and symmetry of a fused ring are destroyed if any functional group is attached to the

Table 4. Molecular Classifications Used in the Generation of the Symmetry Descriptor descriptora

example

coefficient 1.66

SIM1 (1) O-chlorofluorobenzene 30.18

SIM2 (2)

1,2-difluorobenzene 36.49

SIM3 (3)

trifluorocyclohexane 89.72 SIM4 (4) 1,4-difluorobenzene 78.06

SIM5 (6)

1,3,6-trifluorobenzene 93.70

SIM6 (12)

benzene a

SIM1-SIM6 are the six symmetry descriptors used to classify molecules. The six symmetry descriptors are added together to provide a linear descriptor, the coefficients of which are provided in the table and whose intercept value is 116.61.

ring.5 Also, the presence of functional groups on the ring affects the packing efficiency of the ring and markedly reduces the melting point. The melting-point depression is governed by the properties of the side chain. If the side chain is capable of hydrogen bonding, then the melting-point depression is less, compared to non-hydrogen-bonding functional groups. This requires that functional-group contributions be included to describe the melting-point behavior accurately. A set of 40 functional groups was constructed for the entire melting-point data, based on the approach proposed by Constantinou and

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006 5123 Table 5. Functional Groups and Parameters Used in Construction of the Functional Group Descriptor descriptor

parameter

descriptor

parameter

H3 CH2 CH C CH2d CHd dC Cd #CH #C CH2 (ring) CH (ring) C (ring) dCH (ring) dC (ring) F Cl Br I OH

-17.38 1.39 10.20 26.79 -19.95 -10.85 11.71 -39.61 13.90 23.04 -3.48 6.59 17.51 7.9 26.03 -15.92 -9.04 8.21 7.78 0.48

-OOdC NH2 NH N -Nd -SCOOH -CHO dO -O- (ring) OdC (ring) N (ring) NH (ring) -Nd (ring) dS- (ring) C#N NO2 SH OH (ring)

-1.95 36.83 30.88 8.38 0.86 -71.98 66.60 11.67 102.44 25.96 7.28 44.03 -32.04 13.18 25.67 16.9 31.24 18.11 -28.05 31.00

Gani.43 The melting point for a given molecule is obtained by summing the contributions from the individual functional groups. The individual functional groups used in the generation of the functional group descriptor and their correlation coefficients can be found in Table 5. (c) Molecular Weight. When a linear model was developed using just the QSPR descriptors, the molecular weight and the gravitational index were identified to be significant. However, the prediction results indicate a nonlinear relationship with the melting point. To address these nonlinearity effects in the model, a new descriptor was developed that included a nonlinear combination of the molecular weight and the gravitational index. This new descriptor accounts for the observed correlation of the melting point with the log of the molecular weight and the square root of the gravitational index. A genetic algorithm (GA)-based search tool was used to determine the functional form of the new descriptor. In the GA architecture, the descriptor combinations are basically represented as a string, and mutations involve just a change in one of the binary digits in the string, which constitute one of the operations explained in detail elsewhere.58,59 A “Roulette Wheel,” an imaginary disk designed to have a one-to-one relationship with the descriptors, is used for random selection of these descriptors. The components of the function are allowed to change randomly until a suitable form (which provides the lowest root mean square error, RMSE) is evolved. The zero generation of the GA consists of 10 seed descriptors, chosen from the 10 best one-descriptor fits obtained for the dataset and five arithmetic operators. All the descriptors, including molecular weight and gravitational index, which were determined to have a nonlinear correlation with the melting point were included. The descriptors were then combined randomly to generate new functional forms and the combinations were then scored. “Scoring” is a term used to represent the ability of the new correlations to predict the melting-point temperature accurately. The GA forces the R2 value to increase by using two parameterssSc and Ss:

Sc )

selection pressure maximum generations - N × population size maximum generations (14) Ss ) (Sc × population size) + 1

(15)

Score )

R2 + Ss Sc

(16)

Initially, the selection pressure is set to a low value (e.g., 0.5). As the number of new functions increases, the selection pressure is increased to cause the Sc value to increase faster than Ss. This drives the algorithm to find new combinations that have lower RMSE values, i.e., the algorithm is forced to identify functions that provided high-quality predictions. The general form of this descriptor evolved through the use of the GA is

log (MWxGI) Tm ) × (RI3)3/4 total number of atoms 3

(17)

where GI is the gravitational index, RI3 is the third-order randic index, and MW is the molecular weight. 5.2. Linear Model Development. Model development is usually initiated with linear analysis. There are three different multilinear regression analysis (MLRA) feature selection routines that are widely used in industry today:60 (a) regression by leaps and bounds,61 (b) genetic algorithms, and (c) simulated annealing. All three methods begin with a large pool of descriptors, generate subsets from that pool, and perform MLRA on those subsets to determine the overall quality, usually based on the RMSE of the regression. Choosing the subset of descriptors that best encodes the property of interest in a linear regression is a difficult optimization problem. The essence of multilinear regression is that any thermophysical property (y) of a series of compounds can be represented through a multilinear equation, using the descriptors (x) that are inherent to a particular compound. The general form of the correlation is N

y ) βo +

βixi ∑ i)1

(18)

where y is the property of interest, βo the intercept, N the number of molecular descriptors in the correlation, βi the coefficient for descriptor i, and xi the molecular descriptor. 5.3. Nonlinear Model Development. Most thermophysical properties exhibit nonlinear relationships to the descriptors of their chemical structures. The use of linear relationships often yields models with poor predictive ability. A neural network approach allows for the adaptation of the nonlinear model based on a set of examples. Although several types of neural networks are available, a back-propagation network was used in this work. The development of a back-propagation network involves the use of a trial-and-error approach for the selection of a network architecture and network weights. The networks differ in the number of layers and hidden neurons used in each layer. Usually, each network architecture and weight combination is tried at least once before deciding the best layout. In some cases, the network must be retrained if the model predictions are poor. The most significant problem with back-propagation networks is that of overfitting.62 When trained on the same data set for multiple cycles, the network develops a surface that accommodates all the training data; this provides an accurate estimate of the training set but has poor predictive capabilities. There are several rules of thumb that provide guidance to the number of hidden neurons to use or when to stop training the network.62 (When determining the network architecture, the rule of thumb is to keep the ratio of total observations to adjustable parameters above 2, to avoid any chance effects.63 The adjustable parameters in a neural network are simply the

5124

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006

weight and bias.) Although these rules of thumb are helpful in developing the preliminary network architecture, a trial-anderror investigation is still required before a viable nonlinear model can be developed.63,64 5.4. Model Validation. To validate the QSPR model, the experimental melting-point data are split randomly into training, prediction, and cross-validation (CV) sets. The prediction set contains ∼30% of the data (360 molecules), whereas the CV set usually consists of 10% of the data (120 molecules). The network is trained using only the training data. Periodically, the training is halted and values for the CV set are predicted. The RMSE for the CV set is plotted as a function of the number of training cycles. The error for the training set usually decreases as the number of training cycles increases,64 while the CV RMSE reaches a minimum and then gradually increases. Any training beyond this minimum RMSE value indicates that the network is being overtrained. At the onset of an increase in the CV set error, the network training is stopped. External validation of the neural network is obtained using the prediction set. The property values of the prediction set are obtained from the network after complete training of the network, followed by cross-validation. If the RMSE error of the prediction set is similar to that of the training and CV sets, then the network model satisfies the constraint for external validation.65 6. Results Initially, our QSPR model development efforts involved the use of a single QSPR software for descriptor generation. Linear regression analysis was then used for model development (Case Study 1). The large prediction errors and (R2 ) 0.89 and RMSE ) 48.7 K) suggested that all descriptors of significance to the melting point had not been considered. To overcome this problem, over 1400 descriptors from three commercial QSPR packages were used in our second case study. The linear model generated using this new pool of descriptors (Case Study 2) had a correlation coefficient of R2 ) 0.91 and RMSE ) 39.5 K. When compared to literature results (Table 3), the new QSPR model performed better than the functional group model (RMSE ) 62.5 K) and was comparable to the UPPER model predictions (RMSE ) 37.4 K). Furthermore, the commercial software did not address the effects of molecular symmetry, functional groups, and nonlinear molecular weight dependence on melting point. Therefore, three new descriptors were constructed and included in our models. Analysis of outliers indicated that most belonged to the nitrogen or the aliphatic alcohol subset. A look into the literature indicated that both classes of molecules exhibit different types of orbital hybridization, which, if taken into consideration, might help improve the model further, although previous efforts to classify functional groups based on the degree of orbital hybridization3 have shown only moderate success. This approach was not implemented in this work for two main reasons: (a) the lack of algorithms, which would help identify the type of hybridization from chemical structure, and (b) the error associated with manually entering the data. To implement the orbital hybridization manually, each structure had to be viewed through structure visualization software and the descriptors have to be calculated manually. Human intervention is required for descriptor calculation; therefore, this approach is error prone, because of the size of the data set and the tedious nature of the descriptor calculations. Table 6 presents the list of descriptors and the coefficients used in the development of the final QSPR model (Case Study 3). The coefficients given in Table 6 are obtained through linear

Figure 4. Comparison of the experimental and predicted melting points obtained using the new linear model. Table 6. Linear Model for Correlating Melting Point of Diverse Organics descriptor

parameter

intercept HA-dependent HDCA-2/TMSA (Zefirov’s PC) constructed Descriptor 2 (functional group) constructed Descriptor 3 (molecular weight) constructed Descriptor 1 (symmetry) relative number of nitrogen atoms maximum net atomic charge of a H atom relative number of single bonds maximum net atomic charge of a C atom average structural information content (order 1) average information content (order 1) Randic index (order3) moment of inertia, A FNSA-1 Kier and Hall index (order 0) number of fluorine atoms total hybridization component of the molecules

311.21 47563.00 4246.40 -434.01 233.40 220.95 186.90 80.20 -72.26 43.67 27.15 20.75 19.18 11.88 -10.17 -4.36 -4.89

analysis. A total of 16 structural descriptors were included in our model. The linear correlation yielded values of R2 ) 0.95 and RMSE ) 26.1 K. The inclusion of the three constructed descriptors in the model helped to improve the predictions for the melting point (see Case Studies 2 and 3 in Table 3). The three new descriptorssmolecular symmetry, functional groups, and molecular weightsalong with hydrogen bonding were determined to be the most significant. Figure 4 provides a comparison between the experimental and predicted melting points obtained using the new linear model (a model that included the constructed descriptors). These results notwithstanding, the linear model developed still had room for improvement; specifically, improvement may be gained through the use of nonlinear model development algorithms. A simple back-propagation network was used to develop the nonlinear models. The neural network architecture and initial weights were determined through rigorous simulations. Each neural network was initiated with a set of network weights, which were chosen randomly. The network progressively alters these weights as the training proceeds. To determine when the network has been sufficiently trained, the dataset was divided into training, CV, and prediction sets. The prediction set and the CV set contained 30% and 10% of the total data, respectively, whereas the training set usually contained the other 60%. The network performance is highly dependent on the initial weight used, which presents an additional complication in neural

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006 5125

Figure 5. Comparison of the experimental and predicted melting points obtained from a nonlinear model.

Figure 6. Residuals plot for melting-point prediction obtained from a nonlinear model

network modeling. Because only random weights are used, multiple network training sessions that use different initial weights are required to ensure reliability. The nonlinear neural network model results, which are presented in Figures 5 and 6, had values of RMSE ) 12.6 K and R2 ) 0.99. 7. Discussion The objectives of this work were 3-fold: (a) investigate the utility of including descriptors from multiple commercial QSPR software into model development, (b) investigate the utility of including new descriptors that had been developed based on previous literature observations, and (c) investigate the ability of nonlinear QSPR models in predicting melting-point temperatures. The best literature models for melting points have been documented in Table 1. Although the work by Li and Zhu65 provided the best melting-point estimate, only a selected database that contained alkanes was used and, hence, is not suitable for direct comparison with this work. The only comprehensive work found in the literature, by Jain and Yalkowsky,39 had an average absolute error of 23.12 K for 338 molecules. In this work, we present a structure-based model that is capable of proving melting-point estimates, on average, within 12.6 K for >1200 molecules that are of interest to chemical industries. This work differs from other literature efforts in that the model incorporates most of the significant descriptors identified in

previous melting studies. Furthermore, in contrast to most literature efforts, the current QSPR model in not limited to any single homologous series but can be used for most organic molecules. In fact, there has been only one effort,3 which includes a bigger subset of molecules. Although a higher R2 value is to be expected, in comparison to published QSPR models involving a single family of organic molecules, the overall model predictions (RMSE ) 12.6 K) show significant improvements over the available literature models. The results show that molecules with high hydrogen-bonding ability have a tendency to exhibit high melting points. Physically, this can be visualized in terms of extra energy required to break the hydrogen bonds. When functional groups, symmetry, and nonlinear molecular weight descriptors were incorporated into the QSPR framework, significant improvements were realized in the melting-point predictions. In addition to the improved model predictions obtained, several insights into the physics behind the structural aspects that affect the melting-point temperature were gained. For example, the results indicate that the ability of an atom to vibrate and rotate greatly affects the melting point. Specifically, longchain atoms, which have greater ability to vibrate and rotate, have lower melting points than anthracene and other organic molecules with fused rings. Development of the new descriptors posed significant technical challenges that required the development of new data mining strategies, including the use of GAs. One aspect yet to be addressed satisfactorily involves the classification of functional groups. For example, the classifying of alcohols as aromatic or aliphatic was not sufficient. Simamora38 suggested that classification based on orbital hybridization could be used to overcome this problem. Such an orbital hybridization approach for classifying functional groups should be considered if further improvements in the model predictions are desired. 8. Conclusions (1) Hydrogen bonding, molecular symmetry, molecular weight, and functional groups are major factors in correlating the melting-point behavior of organic molecules. (2) The newly developed linear model yielded an root mean square error (RMSE) of 26.1 K. This is comparable to the best melting-point model available in the literature64 (RMSE ) 37 K). (3) The new nonlinear model is capable of predicting the melting-point behavior of most organic chemicals and has an RMSE of 12.6 K. (4) Improvement in the nonlinear model may be achieved by developing methodologies to (a) classify functional groups based on hybridization and (b) account for different types of intermolecular and intramolecular hydrogen bonding. Literature Cited (1) Isnard, P.; Lambert, S. Chemosphere 1989, 18, 1837. (2) Dearden, J. C. QSAR Prediction of Melting Point, A Property of Environmental Relevance. In Proceedings of the 4th International Workshop on QSAR in EnVironmental Toxicology, Veldhoven, The Netherlands, 1990. (3) Tesconi, M. M.S. Thesis, The University of Arizona, 1999. (4) Abramowitz, R.; Yalkowsky, S. H. Chemosphere 1990, 21 (1011), 1221. (5) Dannenfelser, R. M.; Surendran, N.; Yalkowsky, S. H. SAR QSAR EnViron. Res. 1993, 1, 273-292. (6) Chickos, J. S.; Hesse, D. G.; Libeman, J. F. J. Org. Chem. 1990, 55, 3833-3840. (7) Dannenfelser, R.-M.; Yalkowsky, S. H. Estimation of Entropy of Melting from Molecular Structure: A Non-Group Contribution Method. Ind. Eng. Chem. Res. 1996, 35 (4), 1483-1486.

5126

Ind. Eng. Chem. Res., Vol. 45, No. 14, 2006

(8) Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids; McGraw-Hill: New York, 2001. (9) Mills, E. Philos. Mag. Ser. 1884, 17, 173-180. (10) Kipping, F. S. J. Chem. Soc. 1894, 63, 465-468. (11) Longinescu, G. J. Chem. Phys. 1903, 1, 296. (12) Tsakalotos, D. E. C. R. Acad. Sci. Paris 1906, 143, 1235-1236. (13) Lindemann, F. Phys. Z. 1910, 11, 609. (14) Robertson, P. W. J. Chem. Soc. 1919, 115, 1210-1223. (15) Prud’homme, M. J. Chem. Phys. 1920, 18, 359. (16) Lorenz, R.; Herz, W. Z. Anorg. Allg. Chem. 1922, 122 (2), 5160. (17) Lyman, W. J.; Neely, W. B.; Blau, G. E. EnVironmental Exposure for Chemicals; CRC Press: Boca Raton, FL, 1985. (18) Garner, W. E.; Madden, F. C.; Rushbrooke, J. E. J. Chem. Soc. 1926, 48, 2491-2502. (19) Garner, W. E.; King, A. M. J. Chem. Soc. 1929, 51, 1849-1861. (20) Garner, W. E.; Van Bibber, K.; King, A. M. J. Chem. Soc. 1931, 53, 1533-1541. (21) Beacall, T. Rec. TraV. Chim. 1928, 47, 37-44. (22) Austin, J. J. Chem. Soc. 1930, 52, 1049. (23) Merckel, J. H. C. Proc. R. Acad. 1937, 40, 164-173. (24) Meyer, K. H.; Van dar Wyk, A. HelV. Chim. Acta 1937, 20, 13131320. (25) Moullin, E. B. Proc. Cambridge Philos. Soc. 1938, 34, 459-464. (26) Etessam, A. H.; Sawyer, M. F. J. J. Inst. Petrol. 1939, 25, 253262. (27) Cramer, R. D. J. Chem. Soc. 1980, 102, 1837-1849. (28) Hanson, M.; Rouvray, D. The Use of Topological Indices to Estimate the Melting Points of Organic Molecules; Elsevier: New York, 1987. (29) Dearden, J., C.; Rahman, M. H. QSAR approach to the prediction of melting points of substituted anilines. Math. Comput. Model. 1988, 11, 843-846. (30) Abramowitz, R. Estimation of the melting point of rigid organic compounds. Ph.D. Thesis, University of Arizona, Tucson, AZ, 1986. (31) Abramowitz, R.; Yalkowsky, S. H. Pharm. Res. 1990, 7 (9), 942. (32) Bhattacharjee, S.; Rao, A. S.; Dasgupta, P. Comput. Chem. 1991, 15, 319-322. (33) Dearden, J. C. Sci. Total EnViron. 1991, 59, 109-110. (34) Medic, S. M.; Nikolic, S.; Matijevic, S. J. A QSPR study of 3-(Phthalimidoalkyl)-pyrazolin-5-one. Acta Pharm. 1992, 42, 153-167. (35) Charton, M.; Charton, B. J. Phys. Org. Chem. 1994, 7, 196. (36) Katritzky, A. R.; Maran, U.; Karelson, M.; Lobanov, V. S. J. Chem. Inf. Comput. Sci. 1997, 37, 913. (37) Gramatica, P.; Navas, N.; Todeschini, R. Chemomet. Intell. Lab. Syst. 1998, 40, 53. (38) Simamora, P.; Yalkowsky, S. H. Group Contribution Methods for Predicting the Melting Points and Boiling Points of Aromatic Compounds. Ind. Eng. Chem. Res. 1994, 33 (5), 1405-1409. (39) Jain, N.; Yalkowsky, S. H. J. Pharm. Sci. 1999, 88, 852. (40) Katritzky, A. R.; Lobanov, V. S.; Karelson, M.; Murugan, R.; Grendze, M. P.; Toomey J. E. J. ReV. Roum. Chim. 1996, 41, 851. (41) Wei, J. Boiling Points and Melting Points of Chlorofluorocarbons. Ind. Eng. Chem. Res. 2000, 39 (8), 3116. (42) Krzyzaniak, J. F.; Myrdal, P. B.; Simamora, P.; Yalkowsky, S. H. Boiling Point and Melting Point Prediction for Aliphatic, Non-HydrogenBonding Compounds. Ind. Eng. Chem. Res. 1995, 34 (7), 2530. (43) Constantinou, L.; Gani, R. AIChE J. 1994, 40 (10), 1697. (44) Needham, D. E.; Wei, I. C.; Seybold, P. G. Molecular modeling of the physical properties of alkanes. J. Am. Chem. Soc. 1988, 110 (13), 4186-4194.

(45) Bondi, A. Physical Properties of Molecular Crystals, Liquids, and Gases; Wiley: New York, 1968. (46) Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods: EnVironmental BehaVior of Organic Molecules; McGraw-Hill: New York, 1982. (47) Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods: EnVironmental BehaVior of Organic Molecules; American Chemical Society: Washington, DC, 1990. (48) Joback, K. G.; Reid, R. C. Chem. Eng. Commun. 1987, 57, 233. (49) Physical and Thermodynamic Property Database, Design Institute for Physical Property Data (DIPPR) Project 801, 1999. (50) Chemdraw 8.0, Cambridge Software, 2004. (51) Stewart, J. P. MOPAC Program Package, 1990. (52) AMPAC 6.0, Semichem, Inc., Shawnee, KS, 1998. (53) Karelson, M. Molecular Descriptors in QSAR/QSPR; Wiley: New York, 2000. (54) CODESSA 2.63, Semichem, Inc., Shawnee, KS, 1998. (55) DRAGON 4.0, Milano Chemometrics, Italy, 2003. (56) Godavarthy, S. S. Ph.D. Thesis, Oklahoma State University, Stillwater, OK, 2004. (57) Wei, J. Molecular Symmetry, Rotational Entropy, and Elevated Melting Points. Ind. Eng. Chem. Res. 1999, 38 (12), 5019. (58) Venkatasubramanian, V.; Chan, K.; Caruthers, J. M. Comput. Chem. Eng. 1994, 18 (9), 833. (59) Venkatasubramanian, V.; Chan, K.; Caruthers, J. M. J. Chem. Inf. Comput. Sci. 1995, 35, 188. (60) Myers, R. H. Classical and Modern Regression with Applications; PWS-Kent Publishing Co.: Boston, MA, 1989. (61) Draper, N. R.; Smith, H. Applied Regression Analysis; Wiley: New York, 1981. (62) Gakh, A. A.; Gakh, E. G.; Stumper, B. G.; Noid, D. W. Neural Network-Graph Theory Approach to the Prediction of the Physical Properties of Organic Compound. J. Chem. Inf. Comput. Sci. 1994, 34, 832-839. (63) Ulmer, C. W.; Smith, D. A.; Sumpter, B. G.; Noid, D. I. Computational Neural Networks and the Rational Design of Polymeric Materials: The Next Generation Polycarbonates. Comput. Theor. Polym. 1998, 8, 311-321. (64) Wessel, M. D.; Jurs, P. C. Prediction of Normal Boiling Points for a Diverse Set of Industrially Important Organic Compounds from Molecular Structure. J. Chem. Inf. Comput. Sci. 1995, 35, 841. (65) Li, Y.; Zhu, Y. Huaxue Wuli Xuebao 2000, 13, 557. (66) Zhao, L.; Yalkowsky, S. H. A Combined Group Contribution and Molecular Geometry Approach for Predicting Melting Points of Aliphatic Compounds. Ind. Eng. Chem. Res. 1999, 38, 3581. (67) Broadhurst, M. G. J. Res. Natl. Bur. Stand., Sect. A 1962, 66, 241249. (68) Flory, P.; Vrij, A. Melting Points of Linear-Chain Homologs. The Normal Paraffin Hydrocarbons. J. Am. Chem. Soc. 1963, 85, 3548. (69) Wachalewski, T. Postepy Fiz. 1970, 21, 403-406. (70) Syunyaeva, R. Chem. Technol. Fuels Oils 1981, 17, 161. (71) Mackay, D.; Shiu, W.; Bobra, A.; Billington, J.; Chan, E.; Yeun, A.; Ng, C.; Szeto, F. Volatilization of Organic Pollutants from Water. U.S. Environmental Agency Report PB 82-230939, Athens, GA, 1982. (72) Taskinikas, P.; Yalkowsky, S. Toxicol. EnViron. Chem. 1988, 17, 19-33.

ReceiVed for reView October 10, 2005 ReVised manuscript receiVed March 20, 2006 Accepted March 29, 2006 IE051130P