J. Phys. Chem. 1995, 99, 925-937
925
Molecular Modeling by Linear Combinations of Connectivity Indexes Lionello Pogliani Dipartimento di Chimica, Universith della Calabria, 87030 Cosenza (CS), Italy Received: August 12, 1994; In Final Form: October 19, 1994@
The modeling power of the method of linear combinations of connectivity indexes (LCCI), based on a minimal and on an expanded set of connectivity indexes, has been tested on several properties of different classes of organic compounds: the melting points and motor octane numbers of alkanes, the melting points and solubilities of caffein homologues, and four different physicochemical properties of organophosphorus compounds. The modeling of the first property, a classical shape-dependent property and up to date a challenging problem of molecular modeling, was resolved by partitioning the entire set of alkanes into congruent subsets. A minimal set of normal and valence connectivity indexes was able to model the melting points of caffein homologues that have quite similar molecular shapes and sizes, while the modeling of the solubilities of these homologues was unravelled by taking into consideration their association in solution and by employing linear combinations of squared connectivity indexes. The very effective modeling of the two different types (shape- and sizedependent) of properties of the organophosphorus compounds, with a minimal set of connectivity indexes, delineates also a test for the proposed valence 6” value of phosphorus in organophosphorus derivatives. Linear LCOCI combinations of orthogonal connectivity indexes were also tested to improve, if possible, the modeling of the properties of the given classes of compounds. Modeled properties show that the connectivity indexes can be highly dependent on the detailed knowledge of the physicochemical state of the investigated system and that, usually, LCCIs with a minimal basis set yield quite adequate modeling.
Introduction
x
Molecular connectivity indexes are, certainly, the most successful structural parameters derived from chemical graph Recently, linear combinations (LCCI) of connectivity indexes and linear combinations of orthogonal 52 connectivity indexes (LCOCI) 17-22 succeeded in modeling up to eight different physical properties of natural amino acids and three properties of inorganic salts. A graph theoretical molecular connectivity index has also been recently proposed and successfully tested to encode the cisltrans isomerism in unsaturated conformer^.^^ The aim of the proposed LCCI method is to delineate a coherent and elementary methodology that is capable of modeling the greatest number of properties of different classes of compounds with the smallest number of (graph theoretical) connectivity indexes without the insertion of any other kind of external (nonconnectivity)indexes or parameters. Main purpose of the present paper will be to test the potentialities and limits of the method of linear combination of molecular connectivity indexes and of the corresponding orthogonal connectivity indexes (LCOCI). In this frame we will see how “ad hoc” segmented and supramolecular connectivity indexes can be formulated and used and how the descriptive power of the method is strongly dependent on the detailed physicochemical knowledge of the system under study. From this point of view the central moment of this study is the modeling of the solubilities of the small set of caffein homologues. No more than 6 years ago Needham et aLZ5noticed the failure to model with the aid of any kind of indexes, inclusive connectivity indexes, the melting points of alkanes and attributed this failure to the fact that the melting transitions are associated with a shape-dependent dimension not fairly enough well modeled by indexes which depend more on size-dependent terms. Now, concerning the connectivity indexes, this is a rather unsatisfying conclusion for a kind of indexes that can encode bran~hingl-~ and that, as recognized by Needham et al., can Abstract published in Advance ACS Abstracts, December 1, 1994.
span more independent dimensions than do many other structural parameter^.^^ In this study a way out from this burrow will be tried by partitioning the group of alkanes into congruent subsets of compounds, where for congruent is meant substituent-like alkanes. The modeling of the motor MON octane number of a set of alkanes, a well-known property of molecular modeling studies,26will also be tried with the intent to see how far LCCI are alternative to other molecular modeling methodologies. The molecular modeling of these two properties of alkanes will be done with the aid of linear combinations of indexes derived from a minimal { X } M = and an expanded set of { X } E indexes. While the melting points of alkanes constitutes, up to date, a challenging problem of molecular modeling, the very different solubilities and melting points of some caffein which have very similar molecular shape and size make up another challenging test for the molecular connectivity modeling of physical properties. Furthermore, many atoms of these pharmacologically important molecules show different values for the 6 and valence 6’ cardinal numbers with the consequence that the simple and valence molecular connectivity indexes are no more equal and this offers the possibility of testing the ability of the valence connectivity indexes to improve the modeling of the given properties. The modeling of the solubility of the caffein homologs will also give some interesting clues on the general problem of molecular modeling. R e ~ e n t l y ~different l - ~ ~ size- and shape-dependent physical properties of a series of organophosphorus P=O compounds have been successfully modeled with the newly proposed aN and GAI indexes, both based on quantum theory. The modeling of these same physical properties with a MCI set will, thus, show how far linear combinations of connectivity indexes can encode different size- and shape-dependent properties of a group of compounds and how far connectivity indexes are competitive relatively to the aN and GAI indexes. This modeling would also offer the possibility of testing the value of the valence delta 6” for the phosphorus atom in organic phospho derivatives.
{x}
Q022-3654/9512099-Q925$09.QQ1Q0 1995 American Chemical Society
Pogliani
926 J. Phys. Chem., Vol. 99, No. 3, 1995
The experimental values for the physical properties of caffeine homologues and organophosphoruscompounds were taken from refs 27-30, while the corresponding connectivity values were calculated from eqs 1-4. As the D and Dv as well as the O x and O x v indexes for the organophosphorus compounds differ each other by a constant term (Dv - D = 11.22 and O x x ’ - O x = Method -1.01831 for B’(P) = 2.22) we can neglect D’ and Ox’’ and reduce the minimal set of connectivity indexes of the organoIn the present study will generally be used the following phosphorus compounds to {X}M = = (D,”X,~X,~X’}. The minimal set of connectivity indexes: = {D,Dv,”x,”xv,lx,lxv}, difference between 1x and lx’ all along the given set of which has already been successfully employed in previous organophosphorus compounds is not constant (terms in eq 4 studies on amino acid^.'^-^^ Next to this set an expanded set are not additive atomic terms) due to the sometimes different 6 of 16 connectivity i n d e ~ e s which ,~ will be summarized in the values of the carbon atoms directly bonded to the phosphorus following considerations, will also be tested. The connectivity or oxygen atoms. It should be noticed that the proposed value and the valence connectivity indexes are rooted on the cardinal 6”(P(=O)) = 2 . Z 4 for the organophosphorus compounds fits 6 and valence 6’ numbers, respectively, where 6 is the number nicely the following definition: of u bonds (or u electrons) and 6’ is the number of valence
All along the modeling of the physical properties of the different sets of compounds will also be checked the possibility of obtaining a better modeling with the aid of linear combinations of orthogonal connectivity indexes (LCOCI), derived from the corresponding molecular connectivity i n d e ~ e s . ~ ~ - ~ ~
{x}
{x}
electrons (inclusive p- and n-type electrons) of an atom in a hydrogen-suppressed chemical graph.5 The E6 D, and the mthorder (n = 0-n) connectivity indexes for molecules with i = 1 to n non-hydrogen atoms can be defined in the following way :5 , l7
D =Z d i
(1)
i
6’ = (2’
+ Z)/(Z - z‘ - 1)
where Z is the total number of electrons (Z = 15) and Z” is number of valence electrons (ZY = 5 ) . Normally, the connectivity indexes are to some extent linearly interrelated (collinear) and to evaluate the extent of the interrelation between two and xf indexes, it was proposed to use the regression coefficient R of the linear relationship = axf b (collinearity criterion37) and to consider strongly collinear those indexes with R&,x’) > 0.98. Recently20it was also introduced the mean correlation coefficient of the interrelation IM matrix, (Rm&:P)) (P stands for property) to test the overall collinearity of a set of connectivity indexes, employed to model P. It has to be underlined20,21,33-35 that even strongly collinear indexes with 0.98 R < 1 (here, R = 1 when R = 0.999 999) can give rise to positive contributions to the modeling of a property as, normally, the fraction the one index differs from the other may be very important for further improvement of the descriptive power of an LCCI. Thus, inclusion or exclusion of an index on the exclusive basis of its collinearity with another index of the same LCCI can be misleading. The melting points and solubilities of the caffein homologs will be modeled by the aid of the {X}M = {x} = {D,Ox,lx,”xv,lxv} minimal connectivity set. In this set the D’ index has been excluded, as, for these homologues, we have R(D,DV)= 1. The orthogonal ‘51 molecular connectivity indexes, which can be obtained from the corresponding connectivity indexes, have been introduced to bypass the problem of collinearity, to improve, when possible, the modeling of a property and to take advantage of the constancy of the components of a LCOCI.33-35 The orthogonalization procedure can be briefly outlined in the following way: the f i s t best connectivity index is chosen as the f i s t l Q index, then the second 2S21 = 2Q index is obtained by subtracting from the second index that part which can be reproduced by ‘52. Such a process goes on obtaining from a ix index the corresponding ‘Qi1 = iS2 index, which is orthongal to every i-lQ indexes, that is, from which has been discarded the part that can be reproduced by the other orthogonal indexes. It should, anyway, be remembered that not always is it possible to obtain with linear combinations of S2 indexes an improved modeling. The situation strongly reminds us what happens in quantum chemistry calculations where, sometimes, nonorthogonal basis functions work better and easier than orthogonal ones. The physical P properties are modeled with the aid of the following dot product:
x
+
P
where the summation in eq 2 runs over the m-order paths and subscripts 1, 2, 3, ..., etc., stand for 6 values of successive adjacent non-hydrogen atoms. For example, m = 0 specifies the zeroth-order O x connectivity index for which p equals the number i of heteroatoms (C, N, 0, and P) in a molecule, and 61 is the delta value of a single heteroatom:
(3) m = 1 specifies, instead, the fist-order connectivity index where p equals the number b of CJ bonds and 6162 is the product of the delta values of two adjacent heteroatoms:
b
The second-order (m = 2) connectivity index 2x is the sum over p = number of two adjacent bond paths 1-2-3 of terms (6113&)-~/~.An additional hierarchy of indexes of order m > 1 and type t can be added to the hierarchy of indexes involving p paths (normally defined as “xP) by summing analogous terms over substructural units involving t = c cluster, t = pc path cluster, and t = ch chain combinations of m bonds.5
“x,
“x
Substituting 6 with the corresponding valence 6’ number, we obtain the corresponding family of valence connectivity indexes. In the case of the alkanes, where 6 = d’, the given minimal connectivity set contracts to {x} = {D,Ox,’x>,to which we will add the following total structureX,connectivity index, introduced by Needham et al.25 in their study on the melting points of alkanes (eq 5). The minimal connectivity set that will be used for the modeling of the MPs and MONS of 56 and 30 alkanes respectively will then be {X}M = {x} = ( D , o x , l ~ , ~where t) (5)
While the values of these indexes for the 57 alkanes were calculated (D) or taken from Appendix I of Kier and Hall’s book36 the experimental melting points and motor octane numbers were taken from refs 25 and 31, respectively.
(6)
P=Cx
x
(7)
where P is the column vector of the physical properties to be modeled (when more than one), x is the best molecular connectivity column vector made up of x indexes plus the
J. Phys. Chem., Vol. 99, No. 3, 1995 927
Linear Combinations of Connectivity Indexes unitary xo index and C (a row vector when just a property is modeled) is the matrix of the constant terms of the connectivity variables obtained from the multivariate regression analysis. In the case of LCOCI, vector S2 takes the place of vector x in eq 7. The introduction of the connectivity unitary index xo (go for the orthogonal case) corresponds to reduce the nonhomogeneous estimate of P into a homogeneous one, that can be better represented by matrix formalism. The statistical performances of the different linear LCCI or LCOCI have been obtained from a multiple regression analysis. The best LCCI were sorted following the values of their correlation coefficient R, standard deviation of estimates S, quality factor Q = R/S,I9 and variance ratio F =flz/[(f - R2)vl5where f = number of freedom degrees and v = number of variables. Normally Q and F values are directly obtained with the six digits calculated R and S values. The Q quality factor was introduced to detect meaningless addition of a new index in the modeling of a property with a LCCI,33-35as in such cases inclusion of an unessential index is followed by a decrease of this factor due to a growing S. The objective of a good modeling is thus the selection of an LCCI minimizing (for a given R) in some sense the estimation error S of a given P; if the LCCI is chosen as to minimizing this S, then the combination of indexes of this LCCI is called the best descriptor (or Q descriptor) and the modeling is optimal. Another objective of the modeling is the selection (for a given R) of an LCCI minimizing in some sense the number of v variables relative to the number of observations used to model a property and if the LCCI is chosen as to minimizing this v, then the combination of indexes of the corresponding LCCI is called the best F descriptor and the modeling is an F-optimal modeling. Generally, while Q values indicate the relative quality of different LCCI of a specific property, the F values indicate the absolute quality of a LCCI of different properties. Results and Discussion
This result and discussion section will be divided into as many sections as are the sets of compounds the physical constants of which have to be modeled. We will start with the modeling of the melting points of alkanes, the most controversial problem in molecular modeling and the greatest challenge to molecular connectivity indexes. A. Modeling the Melting Points of Alkanes. The experimental values of the melting points and motor octane numbers of the alkanes together with their corresponding values have been collected throughout Table 1. 1. Minimal = {D,ox,'~,xr} Set of ConnectiviQ Indexes. The best combination of connectivity indexes used to model the MP of the full set of alkanes (n = 56) is
x
{x}
{D,'X,XtI: Q = 0.0160,
F = 5.99,
R = 0.507,
S = 31.71
From the statistical parameters of this LCCI we notice that there is no satisfying modeling of the melting points of alkanes with the given set of x indexes. The x and C vectors of the combination used in eq 7 to obtain the calculated values of the melting points are
X = (D,'XJt,X0)
C = (11.415, -38.949, -54.416, 160.36)
The calculated melting points are plotted in Figure 1 versus the corresponding experimental values. This figure c o n f i i s the very bad description of the melting points of alkanes with the given minimal set of x indexes. In fact, a deep analysis of
TABLE 1: Experimental Melting Point (kelvin) and Motor Octane Numbers (MON) and Calculated Connectivity x Indexes for 56 Alkane@
names 3 4 2 2M3 2M4 2M5 24MM6 33MM5 5
23MM4 235MMM6 33MM6 22MM5 234MEM5 22MM6 2234(M)5 4M7 3M7 224MMM6 3M6 24MM5 23MM5 3E5 2M6 3M5 233MMM6 23MM7 23ME5 3E7 244MMM6 4M8 22MM7 223MMM5 234MMM5 2M7 3M8 224MMM5 225MMM6 26MM7 2334(M)5 334MMM6 233MMM5 22MM4 223MME5 6 25MM6 33ME5 7 2M8 2244(M)5 8 9 33EE5 223MMM4 22MM3 2233(M)5
MP MON 85.46 89.80 90.1 89.88 113.25 97.6 113.55 90.3 119.48 73.5 113.65 69.9 138.69 86.6 143.43 61.9 144.61 94.4 145.35 147.05 83.4 149.34 95.6 150.95 151.97 77.4 152.06 152.20 39 152.65 35 153.15 55.0 153.75 153.91 83.5 154.05 88.5 154.55 65.0 154.87 46.4 155.15 74.3 156.35 157.15 158.19 88.1 158.25 159.77 159.95 160.15 160.88 99.9 163.94 95.9 164.11 23.8 165.55 165.77 100.0 176.37 170.25 171.03 171.95 172.45 99.4 173.28 93.4 173.95 177.80 26.0 181.95 55.7 182.28 0.0 182.54 192.75 206.61 216.36 219.63 240.04 248.24 256.60 80.2 263.25
D 4 6 2 8 6 10 14 12 8 10 16 14 12 16 14 16 14 14 16 12 12 12 12 12 10 16 16 14 16 16 16 16 14 14 14 15 14 16 16 16 16 14 10 16 10 14 14 12 16 16 14 16 16 12 8 16
2.707 10 3.41421 2.000 00 4.28445 3.577 35 4.991 56 6.569 81 5.91421 4.121 32 5.15470 7.439 15 6.621 32 5.91421 7.439 15 6.621 32 7.65470 6.405 77 6.405 77 7.491 56 5.698 67 5.861 80 5.861 80 6.698 67 5.69867 4.991 56 7.491 56 7.27602 6.568 91 7.11288 7.491 56 7.11288 7.328 42 6.78445 6.73205 6.405 77 7.112 88 6.78445 7.491 56 7.27602 7.65470 7.491 56 6.78445 5.207 10 7.491 56 4.828 42 6.569 81 6.621 32 5.535 53 7.11288 7.707 10 6.24264 6.949 74 7.32842 6.077 35 5.50000 7.707 10
'x
Xr
1.41421 1.91421 1.000 00 2.27005 1.73205 2.77005 3.663 90 3.121 32 2.41421 2.64273 4.03658 3.621 32 3.06066 4.091 42 3.56066 3.85405 3.80806 3.80806 3.95450 3.30806 3.125 89 3.18073 3.346 06 3.27005 2.80806 4.00403 4.18073 3.718 74 4.34606 3.977 16 4.30806 4.06066 3.481 38 3.55341 3.77005 4.30806 3.41650 3.91650 4.125 89 3.88675 4.04204 3.50403 2.56066 4.019 38 2.914 21 3.625 89 3.681 98 3.41421 4.27005 3.707 10 3.91421 4.41421 4.24264 2.943 37 2.00000 3.81066
0.7071 0.5000 1.00000 0.4082 0.5774 0.2887 0.1667 0.2500 0.3536 0.3333 0.1361 0.1768 0.2500 0.1361 0.1768 0.1667 0.1443 0.1443 0.1443 0.2041 0.2357 0.2357 0.2041 0.2041 0.2887 0.1443 0.1179 0.1667 0.1021 0.1443 0.1021 0.1250 0.2041 0.1925 0.1443 0.1021 0.2041 0.1443 0.1179 0.1667 0.1443 0.2041 0.3536 0.1443 0.2500 0.1667 0.1768 0.1768 0.1021 0.1768 0.1250 0.0884 0.1250 0.2887 0.5000 0.1768
Abbreviations as in Kier and Hall:352 = ethane, 3 = propane, etc.; M = methyl, E = ethyl; e.g., 34ME6 = 3-methyl-4-ethylhexane. the set of values of Table 1 reveals the impossibility of finding any kind of correlation between the experimental melting points and the corresponding connectivity index values. Such a bad modeling of the melting points of alkanes has been attributedz5 to the inherent limits of the connectivity indexes themselves. Due to the very bad scoring of the given indexes, it can be ruled out the possibility to use, at this level, orthogonal indexes to improve the modeling of the property. It should anyway be noticed, from the value of the interrelation matrix (Rm&:MP)) = 0.944, that these indexes are not strongly collinear, the only strongly interrelated indexes being D and O x with R(D,"x) = 0.988.
928 J. Phys. Chem., Vol. 99, No. 3, 1995
240
4
a
i
120 80
Pogliani
/
/ . ./yJ . ... ..
" -
160
200
240
240
-
200
-
/I
" a I 160 -
.
120
,
0
I8 :/
80
280
120 -
80
280
.
Y
,
I
80
120
160
,
200
1
280
240
MP exp Figure 1. Plot of the calculated (with the x and C vectors of the {X}M set) versus the experimental melting points for 56 alkanes.
MP exp Figure 2. Plot of the calculated (with the x and C vectors of the {X)E set) versus the experimental melting points for 56 alkanes.
2. Expanded { X } E = {D,xt,mx&~xpc}Set of Connectivio Indexes with m = 0-6, j = 3-6, and k = 4-6. This set of v = 16 connectivity indexes ( m = 0-6 stand for m from 0 to 6 and for "xP) should offer a better modeling of the melting points of alkanes and, in fact, the following combinations of connectivity indexes yield
To bypass the inherent limits of the connectivity indexes in the modeling of the melting points of alkanes, without rejecting them, the easiest and most straightforward way should be to introduce a segmentation of the given set of alkanes into smaller congruent subsets. Another way out might be to introduce a cluster model for the connectivity indexes that takes into account molecular interactions in the liquid Such interactions could be simulated, at the connectivity level, by the aid of a supraconnectivity index, that encompasses two or more molecules, but as the physical description of real liquids is far from being satisfactory, any experimental evidence about the real number of associated molecules would be purely casual; we will, anyway, see how an ad hoc real model based on supramolecular connectivity indexes can be successfully applied to simulate the solubility of a real system. 3a. Segmentation of the Set of Alkane Compounds and Modeling with a Minimal = (D,"x,'x,xt} Set of Indexes.
*x
C"xlx,>: (1) m = 1 - 6 , j = 3-6: Q = 0.0323, F = 7.36, R = 0.788, S = 24.36 (2) m = 1-6,j = 3-5: Q = 0.0321, F = 8.07, R = 0.782, S = 24.36 (3) m = 1-5,j = 3, 5: Q = 0.0290, F = 8.50, R = 0.744, S = 25.58 Clearly, while the first is the Q-best LCCI, the third is the F-best LCCI. As a kind of compromise, it will be chosen to model the MP values, that are plotted versus the corresponding experimental ones in Figure 2, the LCCI originated by the connectivity indexes of item 2. The generatingx and C vectors for the calculated MP are
x = (1x,2x,3x,4x,5x,6x,3xc,4xc~5xc9xo) C=(399.11, -278.41, -145.74, -120.84, -131.10, -100.19, 365.97, -435.10, 90.579, -299.28) Combinations of indexes of items 1-3 were chosen by the aid of a test-and-discard method (a kind of trial-and-error method): starting with the best single-index LCCI, inclusion or exclusion of an index was checked by the resulting Q and F value of the new LCCI. Even if the plot of Figure 2 represents a clear improvement over the plot of Figure 1, the modeling is, nevertheless, far from being convincing. The small improvement obtained by the aid of connectivity indexes belonging to the extended set might be read as a kind of hint that these indexes are in some way effective in encoding shape-dependent properties. This crucial problem, the capability of the connectivity indexes to encode not just shape-dependent properties but more generally to encode properties, that are dependent on noncovalent interactions occumng in a real physicochemical system, will be correctly solved only with the modeling of the solubilities of the small set of caffein homologues that has, thus, a central paradigmatic value for the solution of the modeling problem with a LCCI.
{x}
(1st subset) linear i alkanes (i = 2-9 and n = 8)
+ Ei alkanes ( i = 3-8 and n = 14) (3rd subset) MMi + MEi + EEi alkanes (i = 3-7 and
(2nd subset) Mi
n = 17) (4th subset) MMMi
+ MEMi alkanes (i = 4-6
and
n = 13)
(5th subset) MMMMi alkanes (i = 5 and n = 4) Let us now analyze the best LCCI descriptors of these subsets: 1st subset: The best combinations for this subset (n = 8) are
x
{'x}:
Q=0.0618,F=86.1,R=0.967,S=25.03
{ox,'x,x,}:Q = 0.0673, F = 34.0, R = 0.981, S = 14.58 The first combination is chosen, here, to model the MP of this set as it has a similar Q but a much better F than the second one. The x and C vectors used to obtain the calculated MP,d, of this subset of alkanes are
x = ('x,x0) 2nd subset: the best
C = (45.426, 29.101)
{x} for this subset (n = 14) is
{D}: Q = 0.0693, F = 29.5, R = 0.843, S = 12.16 For this subset this single-index combination is the overall best LCCI, even if its modeling power is rather poor. Vectors x
J. Phys. Chem., Vol. 99, No. 3, 1995 929
Linear Combinations of Connectivity Indexes and C used to derive the calculated MP,d, values are
x = (0,~') C = (5.8947,77.035) 3rd subset: The best connectivity combinations for this subset ( n = 17) are
{x,}: Q = 0.0105, F = 1.95, R = 0.339, S = 32.26
U u
240 200
i I
m
J
160 -
CL
{ 1 ~ , ~ t } : Q = 0 . 0 4 2 7 , F = 1 6 . 1 , R = 0 . 8 3 4 , S19.55 = The last combination is the best one, and even if its statistical improvement over the single index LCCI is very impressive, it remains, nevertheless, a rather poor LCCI. The poor score of the single-index combination points to the probable existence of a dominant orthogonal descriptor, as we shall see later. The x and C vectors of this LCCI are
x = ('x,xt,x')
C = (180.10, 1189.9, -705.54)
4th subset: The best consecutive connectivity combinations for this subset ( n = 13) of alkanes are
{x,}:Q = 0.0532, F = 22.1, R = 0.817, S = 15.36 {D, x,}:Q = 0.0964, F = 36.3, R = 0.938, S = 9.721 {D,lx,xt}: Q = 0.11 1, F = 31.8, R = 0.956, S = 8.650 Here, again, we have a nice modeling of the melting points with the second and third combinations of indexes, while the single-index combination looks quite decent. The x and C vectors of the third LCCI will be chosen, due to its better Q value, to derive the calculated MP,d, values:
x = (D,'x,x,,x')
C = (42.098, 106.98,2506.2, - 1299.2)
The collinearity among the indexes of this set is rather strong as it emerges from their R values: R(D,'x) = 0.989, R(D,xt) = 0.988 and R('x,xt) = 0.988. This underlines again what has been said in the method section: strongly collinear indexes can be quite effective in improving the modeling of a physical property. 5th subset: Due to the very low n value (n = 4) for this subset we will chose combinations with at maximum v = 2 indexes. The best ones are
{x,}: Q = 0.0290, F = 6.04, R = 0.867, S = 29.87 {'x,~,}:Q = 1.366, F = 6692, R = 0.99996, S = 0.73 The second combination, where R('x,xt) = 0.824, is impressive but, due to the low number of observations, it should be handled with much care. Its x and C vectors are
X=
('x,x,,x')
C = (549.94, 13338, -4190.4)
In Figure 3 the calculated melting points for the different subsets, with the given best x and C vectors, are plotted versus their corresponding experimental values: the improvement over Figure 2 is eye-catching. It should then be possible to model the different melting points of alkanes subsets that differ each other by their degree of substitution and, furthermore, use the results exemplified in Figure 2 to set up a range for the maximum variability for the melting points of the whole set. Calculations done with orthogonal S2 indexes,23derived from the respective best y, combinations of subsets 3-5 (orthogonal
J
e
120 -
MP e x p Figure 3. Plot of the calculated (withthe x and C vectors of the ( 2 ) ~ set) versus the experimental melting points of the segmented set of 56 alkanes.
TABLE 2: Calculated *P(with = x,) Orthogonal Index Indexes of the Third Subset of the Values from the k,,'~} Alkanes ~~~
alkanes
2Q
alkanes
*Q
24MM6 33MM6 24MM5 23ME5 22MM4 33EE5
-0.054 33 -0.036 42 -0.179 05 0.00051 -0.038 10 0.27464
33MM5 22MM5 23MM5 22MM7 25MM6 22MM3
-0.097 97 -0.158 63 -0.124 21 0.09266 -0.092 34 0.278 14
alkanes
2R
23MM4 -0.077 62 22MM6 -0.097 08 23MM7 0.170 20 26MM7 0.115 36 33ME5 0.024 24
representations for subsets 1 and 2, described by a single index, are meaningless) have shown that only the third subset, that shows a R('y,,xt) = 0.974, has a quite interesting orthogonal representation. The orthogonalization procedure for this third subset starts with xt = 'Q and orthogonalizes to obtain the 2S2 values which are collected in Table 2. Let us fiist check by the (i) stepwise and (ii) deletion method^^^-^^ if these S2 (practically the 2Q index) indexes have been correctly calculated: (i) the coefficients of the full C(S2) vectors can be obtained from the corresponding Ck) vectors by deriving stepwise best-y, LCCI and using the diagonal terns of these Ck) as those of the sought-after C(S2),while the coefficient of the unitary noindex of the C(Q) vector is given by the y,O coefficient of the smallest Ck) vector, that is (subscript s specifies the stepwise C vector)
'x
Col,,xo)= (1 11.15, 143.93) C&,,'X,X') = (1189.9, 180.10, -705.54) C('Q,2sZ,sZo),= (111.15, 180.10, 143.93) C('Q,2Q,Qo)= (111.15, 180.10, 143.93) and in fact the calculated (with the orthogonal indexes of Table 2) C(S2) vector is equal to the stepwise C(Q), vector (up to the fourth decimal figure; due to the small number of Q indexes, roundoff problems are negligible). (ii) Deletion of the *R index from C(S2)produces C('S2,Q0) = Ckt,f) = (111.15, 143.93). This, here, self-evident result is based on a general feature of the orthogonal i n d e ~ e s : ~ ~ - ~ ~ the constancy of the terms of the C(R) vector under deletion or inclusion of an orthogonal index. Clearly, the statistical performance of the best LCCI and corresponding best LCOCI made up of vectors x = &t,ly,,y,o) and S2 = ('R,2S2,C20)respectively is, and should be, exactly the same33-35as derived orthogonal
930 J. Phys. Chem., Vol. 99, No. 3, 1995
Pogliani
indexes cannot enclose more information content than their parent indexes. Nevertheless, this does not exclude the possibility that the 2Q index could be a better descriptor than the IC2 = xt index. Indeed, the following 2C2 LCOCI displays
280
x
{2Q}: Q = 0.0344, F = 20.8, R = 0.763, S = 22.18 The improvement over the descriptive power of the xr index is astounding, and we can conclude that the 2Q index is the dominant descriptor of the melting points of the third subset of alkanes. As the modeling of the melting points of subsets 2 and 3 is not at all satisfactory, we turn our attention to the extended, v = 16, set of connectivity indexes to look for a further improvement of the modeling of these and also of subsets 1, 4, and 5. 3b. Segmentation of the Set of Alkane Compounds and Modeling with the Extended Y = 16, Set of Indexes. The different combinations have been chosen following the alreadymentioned test and discard method. The fiist, second, and third subsets are better modeled with this extended set of indexes while the modeling of the fourth and fifth subsets does not improve at all. 1st subset: the n = 8 melting points are very well modeled by
240
"
200
0
Q = 0.0948, F = 202.5, R = 0.986, S = 10.40 'x {X}M is evident.
{"}:
x = (4x,x0)
C = (137.97,92.074)
2nd subset: the n = 14 points can be modeled by the following two best F (first) and Q (second) combinations: {D,6x,c}: Q = 0.0818, F = 20.5, F = 0.888, S = 10.9 0
1
{D,x?
2
3
4
xc, x,}: Q = 0.1 11, F = 12.6, R = 0.957, S = 8.62 While the fiist combination is just better than the {D} combination of the minimal set, the second combination is quite fine. The best x and C vectors of this LCCI are x 3 x.9
C = (19.595, 8962.4, -10904, -2560.6,613.71, 553.70, -9098.6) 3rd subset: this n = 17 subset can be adequately modeled by the following combination of indexes: {
1
2
3
X?XP x,X C '
The x
4
5 XC'
xc9
/I
i1 I
120
{x}~,
index of the previous improvement over The x and C vectors are
I
I
1 // 120
80
, 200 MP e x p
160
, 280
240
Figure 4. Plot of the calculated (with the x and C vectors of the {X}E set) versus the experimental melting points of the segmented set of 56 alkanes.
of alkanes corresponds in reality to the introduction of a new kind of ad hoc connectivity indexes, and it is over this last topic that results on caffein homologues will shed some more light. B. Modeling the Motor Octane Number of Alkanes. 1. found a rather Minimal = {D?x,'x,xt} Set. While Xu et good correlation between MON values of heptane and of octane isomers separately with the quantum theoretically based GAI and aN indexes:
{x}
GAI: n(heptanes) = 8, Q = 0.0743, F = 38.4, R = 0.930, S = 12.52 n(octanes) = 17, Q = 0.0841, F = 69.6, R = 0.907, S = 10.79 aN:
n(heptanes) = 8 Q = 0.0801, F = 44.7, R = 0.939, S = 11.73 n(octanes) = 17, Q = 0.0903, F = 79.3, R = 0.917, S = 10.16 Balaban et a1.26 (and references therein) derived a series of satisfactory correlations among the PON = MON/2 RON/2 (RON = research octane number) of n = 45 alkanes, 35 cycloalkanes, and 73 alkenes and a mixed set of topological indexes (connectivity plus other type of indexes):
+
4
xpcl:
Q = 0.068, F = 11.6, R = 0.949, S = 13.98 and C vectors of this LCCI are
n = 45: Q = 0.118, F = 166, R = 0.961, S = 8.17
n = 35: Q = 0.082, F = 36.6, R = 0.883, S = 10.8 n = 73: Q = 0.150, F = 55, R = 0.875, S = 5.85
C = (576.03, 679.14, -668.73, 1371.2, -2215.8, 389.37, -77.470, -867.77) In Figure 4 the MPcd, values that have been calculated with the aid of the given x and C vectors (MPcdcof subsets 4 and 5 were derived from the previous minimal vectors) have been plotted versus the corresponding experimental melting points. Improvement over Figure 3 is evident. Before leaving the modeling of this property, it should anyway be noticed that, except for the 5th subset, (i) the best standard deviations of the other subsets are not insignificant, (ii) even with the best LCCI there is a high concentration of melting points at midranges, and (iii) LCCIs with minimal basis set sometimes work better than LCCIs with extended basis set. The given segmentation
The LCCI method allows to model the MON of the set of n = 30 alkanes studied by Xu et al., in a very satisfactory and straightforward way. The best single and multiple descriptors of this property are
{xt}:
Q = 0.0153, F = 4.74, R = 0.381, S = 24.88
{D,ox,lx}: Q = 0.0921, F = 57.3, R = 0.932, S = 10.12 To notice is the fact that the xt index, does not appear in the second combination. This fact highlights one of the properties of these LCC119,33-35and one of the limits of the test and discard method, used for the extended set: not always the inclusion (or exclusion) of the next-best index yields the best modeling of a property. The bad scoring of the single-index
Linear Combinations of Connectivity Indexes
J. Phys. Chem., Vol. 99, No. 3, 1995 931 TABLE 3: Calculated *Q and 3Q (D = 'Q) Indexes from the { D t ~ , ~ Set x > for the Modeling of the MON of Alkanes (n = 30) alkanes *Q 3Q alkanes ZQ 'Q
120
100 80
U -
0
60
z 0
I
40
20
0
0
20
40
80
60
100
120
MON e x p
Figure 5. Plot of the calculated (with thez and C vectors of the { x } ~ set) versus the experimental motor octane numbers for 30 alkanes.
224MMM5 0.198 49 233MMM4 0.198 49 234MMM5 0.146 09 23MM4 0.115 75 2M4 0.085 41 23MM5 0.049 34 33MM5 0.101 75 33MM6 0.035 36 22MM6 0.035 36 2M5 -0.047 39 3E5 -0.113 79 25MM6 -0.017 05 2M6 -0.113 79 3M7 -0.180 19 2M7 -0.180 19
-0.047 42 223MMM5 0.198 49 0.017 46 0.040 11 2M3 0.019 01 0.012 26 0.041 45 22MM5 0.101 75 -0.035 05 0.016 74 22MM4 0.168 15 -0.017 29 -0.007 97 4 -0.077 73 0.024 64 0.036 97 23ME5 -0.017 05 0.057 22 0.025 61 24MM5 0.049 34 -0.017 87 0.007 85 22MM3 0.234 56 -0.060 18 -0.052 81 3M5 -0.047 39 0.032 51 -0.005 50 24MM6 -0.017 05 0.002 38 -0.144 12 0.006 88 0.052 75 5 -0.035 63 3M6 -0.113 79 0.014 75 -0.023 26 4M7 -0.180 19 -0.003 02 -0.003 02 6 -0.210 53 -0.010 89 -0.041 03 7 -0.276 93 -0.028 65
120
combination points to the possible presence of a dominant orthogonal descriptor for this property. The best modeling of the MON of the n = 30 alkanes is then achieved by the following x and C vectors:
x=
(D,Ox,'x,x0)
c = (-212.23,
/
100
-
80
U
395.69, 251.85, -484.59)
5
z
60
0
The calculated values of MON of Figure 5 , which have been plotted versus the corresponding experimental ones, have been obtained with these two vectors. The figure is impressive, but it can be improved, as we shall see, with the use of the extended set of connectivity indexes. First, let us find out if there are some dominant orthogonal descriptors for this property. The very bad modeling of the {xt} LCCI endorse this possibility. The value (Rm&:MON)) = 0.95 1 means that the connectivity indexes of this set are not strongly collinear even if the zerothorder and total connectivity indexes show a strong interrelation: R(Ox,xt) = 0.989 99. In Table 3 have been collected the '51 ( i = 1-3) orthogonal indexes with D = 'S2,that have been obtained with an orthogonalization procedure, performed sequentially on the {D?x,'x} combination. Let us check their validity by the stepwise and the exclusion method:
C(D,xo)= (-1.6578, 91.667) C(D,Ox,x0)= (-65.400, 164.81, -101.4)
C(D,OX,~X,XO) = (-212.23,395.69, 251.85, -484.59)
C('Q,2Q,3Q,Q0),= (-1.6578, 164.81, 251.85, 91.667) C('Q,2Q,3Q,Q0)= (-1.6579, 164.81, 251.85,91.669)
I
40
20
0 20
0
and evidently C(1Q,510) = C(D,xo). The derived orthogonal indexes seem, then, quite right. The two LCOCI made up of the following orthogonal connectivity indexes exhibit the following statistical quality: {'Q}:
Q = 0.0648, F = 85.0, R = 0.867, S = 13.39
{2C2,3S2}: Q = 0.0847, F = 72.8, R = 0.918, S = 10.84 A comparison with the statistical performances of {xt} and {D,Ox} (Q= 0.068, F = 41, R = 0.88, S = 13) LCCI reveals
60
80
100
120
MON exp
Figure 6. Plot of the calculated (with the x and C vectors of the {X}E set) versus the experimental motor octane number for 30 alkanes.
the remarkable performances of these two combinations of R indexes: the 251 index is, thus, the dominant descriptor of the MON of alkanes. The performance of this index is nearly as good as the performance of the GAI index with the octanes and of the mixed set of indexes26with the cycloalkanes. The quality of the figure (not shown here) obtained by the aid of the second LCOCI can hardly be distinguished from the quality of Figure 5 . 2. Expanded { X ) E = { D , ~ r , m x ~ ~ Set c , kof~ Connectivity pc} Indexes with m = 0-6, j = 3-5, and k = 4-6. This v = 16 expanded set of indexes yields quite interesting descriptions of the MON of the n = 30 alkanes:
{'x}:
Q = 0.0365, F = 27.1, R = 0.701, S = 19.19 3
C('Q,2Q,Qo) = (-1.6578, 164.81, 91.667)
40
x,5
4
6
x,,}:
{D,Ox,lXXr. XC' xpc, Q = 0.112, F = 32.0, R = 0.961, S = 8.50
{{x) - D ;v = 15): Q = 0.184, F = 45.7, R = 0.9899, S = 5.38 The single-x description is much better than the xt description but worse than the *Q description of the minimal connectivity set. The LCCI made up of the v = 15 index combination (the full set minus the D index) is clearly the best overall description of the MON of alkanes and albeit its high v value it shows a rather nice F value. In Figure 6 are shown the calculated MON values obtained with the aid of theX and C vectors of the second
932 J. Phys. Chem., Vol. 99, No. 3, 1995
Pogliani
TABLE 4: Observed Melting Points ("C), Solubilities in Water at 30 or 20 "C (mg/mL) and Calculated Homologues (Substituted Xanthinesy names 37MMXa 371MMMXa 37 lMMEXa 37 1MMPXa 37 lMMBXa 13MMXa 137MMMXa 137MMEXa 137MMPXa 137MMBXa 137MMIXa 1387tMIXa 1387tMBXa
MP 357 238 167 138 122 270 238 156 99 108 91 134 127
sol (temp) 0.54 (30) 25.8 (30) 39.8 (30) 13.8 (30) 5.6 (30) 8.1 (30) 25.8 (30) 36.6 (30) 231.1 (30) 3.7 (30) 27 (20) 6.3 (20) 4.5 (20)
D 28 30 32 34 36 28 30 32 34 36 36 38 38
x Values of Caffein
Ox
Ix
Ox"
IxV
9.585 42 10.455 67 11.162 77 11.869 88 12.576 99 9.585 42 10.455 67 11.162 77 11.869 88 12.576 99 12.740 12 13.610 36 13.447 23
6.109 06 6.536 58 7.074 59 7.574 59 8.074 59 6.125 90 6.536 58 7.074 459 7.574 59 8.074 59 7.930 43 8.341 11 8.485 27
7.235 49 8.182 70 8.889 81 9.596 92 10.304 02 7.235 49 8.182 70 8.889 81 9.596 91 10.304 02 10.467 16 11.389 81 11.226 67
3.713 50 4.107 93 4.684 05 5.184 05 5.684 05 3.717 58 4.107 93 4.684 05 5.184 05 5.684 05 5.539 89 5.970 71 6.114 86
Abbreviations as in table 1: Xa = xanthine, tM = MMM, P = propyl, B = butyl, I = isobutyl; e.g., 137MMIXa= 1,3-dimethyl-7-isobutykanthine, 137MMMXa = caffein.
combination (v = 8) versus the corresponding experimental MON values
C = (-392.33,665.26,584.29, -115.60, -52.185, -41.686,26.657, 15.178, -874.06)
These pharmacologically interesting r n o l e c ~ l e s * ~have - ~ ~very similar molecular shape and size but different melting points and especially very different solubilities, a fact that has been suggested to rationalize their different pharmacological activities. Three arguments have been given to explain their melting points and solubilities: (i) steric factors, (ii) hydrogen bond formation, and (iii) marked association phenomena in aqueous solutions. While these physicochemical justifications can be overlooked during the modeling of the melting points of these compounds with a minimal basis set (inclusive of valence indexes) LCCI or LCOCI, they have to be taken into proper consideration (especially the last two) to derive ad hoc supramolecular or supraconnectivity indexes to use to model the solubilities with a LCCI. 1. Melting Points of Caffein Homologues. The modeling of this property of the n = 12 xanthine derivatives with different LCCIs derived from the = {D,ox,lx>xv,lxv}minimal basis set reaches its optimum with the following LCCIs:
The small improvement over the plot of Figure 5 is evident. The rather fine description of the MON of alkanes, obtained with the minimal, orthogonal, and expanded set of connectivity indexes, can be ascribed to the partial size-dependentcharacter of this property and to the possibility of the LCCI method to span a vast amount of combinations of branching related indexes and to choose, thus, combinations that can mimic in some way specific dominant attributes of a property. C. Modeling the Melting Points and Solubilitiesof CatTein Homologues with a Minimal Set of (x} = {D,"~,~x,"x~,'x~} Indexes. In Table 4 the experimental melting points and solubilities and the connectivity values of caffein homologues have been collected following the two main groups into which these {'x,'y,'}: Q=O.O216, F = 16.6, R = 0 . 8 8 7 , S = 4 1 . 1 xanthine ring derivatives can be divided: homologues of 37MMXa = 3,7-dimethylxanthine = theobromine and homologues {D,1 l v }: Q = 0.0212, F = 10.7, R = 0.895, S = 42.1 of 13MMXa = 1,3-dimethylxanthine= theophylline; caffeine = 137MMMXa which belongs to both groups has been repeated {D,0 1 l v }: Q = 0.0231, F = 9.52, R = 0.919, S = 39.8 twice in this table. All of these compounds can be thought to derive from 3MXa that can be represented by the aid of the followQ = 0.0479, F = 32.7, R = 0.982, S = 20.5 ing 6 and 6"matrices (italic numbers of the upper line represent To notice is the very good scoring in both Q and F values of the position of the atoms while c and C stand for connection): the last combination, while the first combinations seems quite 1,2,3,4,5, 6;9,8,7 reasonable. The poor ratings of the second, third, and fourth combinations of indexes can be improved with the introduction d = 2,3,3,3,3, 3;2,2,2 of the corresponding orthogonal connectivity indexes, which 0, l,O, c, 1; c, 0, c can offer better Q and F values. The following and C vectors of the last LCCI: 1,2,3,4,5, 6;9,8,7
{x}
x,x x,x,x {x}:
i
c,
1
4,4,5,4,4,4;5,3,4 0, 6, 1, c, C, 6; c, 0, C Connections take place vertically and horizontally only in the f i s t row and between positions marked with C or c (zero values fill the voids), thus, position 4 is connected with position 9 and position 5 with position 7 and 1 with 6. For more details about such kind of matrices see refs 19, 21, and 22. The 6 matrices of the different xanthine derivatives can be obtained by adding the 6 values of the side chains at one of the two ends (1 or 7) in the first line of the 6 matrices, rising by 1 the 6 values where addition follows and filling with zeros the second line if the side chain has no side chains.
x
C = (2527.9, -8264.9, -7771.9,4275.8, 3246.3, 13295) have been used to obtain the calculated MP, which have been plotted versus the correspondingexperimental values, in Figure 7. The plot is quite fine even if obtained by highly collinear indexes with (Rmk;MP)) = 0.996, a further example that collinearity does not prevent an appropriate modeling of a property. In Table 5 have been collected the values of the orthogonal indexes obtained with an orthogonalization procedure that started with D = 'C2 and orthogonalized successively the other indexes of the minimum set in the given order (set in the
J. Phys. Chem., Vol. 99, No. 3, 1995 933
Linear Combinations of Connectivity Indexes 400
TABLE 5: Orthogonal Connectivity Index Values of the Caffein Homologues Obtained from the Minimum (x> Set of Indexes Used for the Modeling of Their Physical Properties (Abbreviations as in Table 4)
/I
/ 80 80
160
240
320
400
MP exp Figure 7. Plot of the calculated (withthex and C vectors of the { x } ~ set) versus the experimental melting points for 12 caffein homologues.
opening lines of this section). The three following orthogonal combinations produce, in fact, significant LCOCI:
{'Q,%2}: Q = 0.0266, F = 25.3, R = 0.921, S = 35.61 {1Q,3Q,5Q}: Q = 0.0346, F = 28.4, R = 0.956, S = 27.7 { 1Q,3Q,4Q,552}: Q = 0.0428, F = 32.7, R = 0.974, S = 22.7
The first combination is even better than the four-;I-index combination, while the last combination is nearly as good as the five-%-indexcombination. Practically the melting points of the caffein homologues can be appropriately modeled by an LCOCI made up of the first combination of two S2 connectivity indexes plus the unitary index. Let us check by the stepwise and exclusion methods the rightness of the found orthogonal indexes:
C(D,x0)= (-19.493, 820.26) C(DY~,XO) = (-70.704, 132.77,979.69) C ( D , o x , l ~ ,= ~ o(524.80, ) -727.27, -1 125.9, -541.88) C(D,0 X,1 X,O
xV ,xO ) -- (1423.5, -4112.9,
-2641.2, 1928.0, 2061.6)
C({X),XO)= (2527.9, -8264.9, -7771.9,4275.8, 3246.3, 113294) C({Q},Qo),= (-19.493, 132.77, -1125.9, 1928.0, 3246.3, 820.26) C({Q},Qo)= (-19.492, 132.74, -1125.9, 1928.0, 3246.3, 820.24) C('Q,3Q,5Q,Q0)= (-19.494, -1125.9, 3246.4, 820.28) C('Q,5Q,Q0)= (-19.494, 3246.4, 820.28) The stepwise C(Q), and calculated C(Q) are essentially equal, while the calculated terms of the last C vectors (the best two-Q and three-Q vectors), that have been obtained by the exclusion method, are effectively invariant. 2. Solubilities of Caffein Homologues. The best combination of indexes for the water solubilities of the caffein homologues is the following three-;I-index combination:
(D,ox,'x>: Q = 0.0044, F = 0.282, R = 0.309, S = 70.97
names 37MMXa 13MMXa 137MMMXa 371MMEXa 137MMEXa 371MMPXa 1387tMIXa 1387tMBXa 371MMBXa 137MMBXa 137MMPXa 137MMIXa
2Q
-0.013 48 -0.013 48 0.085 36 0.021 05 0.021 05 -0.043 25 0.154 42 -0.008 71 -0.107 54 -0.107 54 -0.043 25 0.055 59
3Q
-0.027 29 -0.010 45 0.007 14 0.027 44 0.027 44 0.009 74 -0.009 93 0.009 62 -0.007 97 -0.007 97 0.009 74 -0.027 5 1
4 8
58
0.003 61 -0.009 63 0.018 87 -0.006 31 -0.006 3 1 -0.001 63 -0.003 02 0.006 99 0.003 03 0.003 03 -0.001 64 -0.006 98
0.017 30 -0.005 24 -0.01 1 57 0.001 65 0.001 65 -0.001 19 0.005 37 0.012 34 -0.004 04 -0.004 04 -0.001 20 -0.01 1 02
and it is a very bad descriptor, indeed. The modeling is so bad that even orthogonal indexes cannot be of any help here. It is interesting to notice that if the highest solubility value (23 1.1) is eliminated from the set, the modeling power of these three indexes improves dramatically, but we will come back to this and other difficulties in the conclusion section. Before abandoning the modeling of this property let us review some of the conclusions reached by Guttman and H i g ~ c h i *and ~ , ~confiied ~ latel~:*~-~O (i) caffein exists in aqueous solution as monomer, dimer, and tetramer, (ii) dimerization and tetramerization reach a maximum with 7-propyltheophylline (137MMPXa), (iii) 137MMEXa and 371MMEXa mostly dimerize, and (iv) as the length of the side chain in caffein homologues increases (drastically for butyl side chain and less drastically for propyl side chain in 371MMPXa) association decreases. These conclusions can be used as a starting point for the construction of an ad hoc molecular MC connectivity model to derive new connectivity values for the modeling of the solubility of caffein homologues. It should be, here, remembered that construction of ad hoc meaningful MC models was already successfully attempted during the modeling of the isoelectric and solubility points of amino acid^.^^,'^ While there a MC fragment model based on the functional groups of the amino acids was introduced (fragment connectivity indexes), here an association model based on the known association phenomena that caffein homologues undergo in aqueous solution is introduced. Thus, with the assumption that 137MMMXa (caffein), 137MMEXa, and 371MMEXa exist mostly as dimers ( a = 2, where a is an association parameter), 137MMPXa exists mostly as a tetramer (a = 4) and assuming that 137MMIXa (a pharmacological very active compound) exists as a mixing of a monomer and a dimer (a = 1.5), we will multiply the connectivity indexes (Table 4) of the respective molecules with the given association parameter and with these new supraconnectivity values for these compounds we will start again the modeling of the solubilities of the caffein homologues. The scoring of the best combinations of the new ad hoc connectivity plus supraconnectivity indexes are
{'zv}:Q = 0.0552, F = 135.6, R = 0.965, S = 17.49 {ox,'x): Q = 0.0630, F = 88.5, R = 0.976, S = 15.48 0
{D,
x,x,lxvI : 1
Q = 0.0771, F = 66.3, R = 0.987, S = 12.79 If the rating of the third combination is noteworthy, the grades of the first and second combinations are not at all secondary: the LCCI method can often offer a wide range of good
934 J. Phys. Chem., Vol. 99, No. 3, 1995
Pogliani
240
TABLE 6: Observed Molar Refractivities MRD, Refractivity Index Density bZ0, and Calculated x Values of 17 MPO(0R)z Neutral Organophosphorus Compound@
_I
/
R' butyl
SOL exp Figure 8. Plot of the calculated (with the x and C vectors of the { & ) z } ~ set) versus the experimental solubility of the 12 caffein homologues.
combinations and not always the selection of the best one seems to be obvious. The improvement reached with the aid of these supraconnectivity indexes is remarkable, corroborating, thus, the validity of the reached conclusions of the cited experimental works. The very good description of every LCCI made up of the given combinations, renders the use of the orthogonal indexes, here, superfluous. This very satisfying modeling can nevertheless be further improved, and the key for the improvement is obtained from the calculated solubility values obtained by the aid of the best x and C vectors:
x
0
1
l v
X = (D,X, X, X
0
,X
As three of the calculated values are negative, it could be worthier to choose a linear combination of the squares of the connectivity indexes of the best combination. In fact, the following combination of squared indexes shows a brilliant statistical score: 2 0
2 1
x)2,(1 xv )2I:
Q = 0.193, F = 414.6, R = 0.9979, S = 5.17 In Figure 8 the experimental solubilities of the caffein homologues have been plotted together with the corresponding calculated values (every value is now positive) obtained with the aid of the following x and C vectors:
nDZo
d420
D
Ox
1.4259 0.9638 24 10.156 85 1.4226 0.9653 24 10.483 13 1.4222 0.9657 24 10.483 13 24 10.914 21 28 11.57107 1.5264 0.9529 28 11.897 34 28 12.328 43 1.4353 0.9401 32 12.985 28 36 12.13998 1.4401 0.9303 36 14.39949 1.44 0.9257 40 15.81371 1.4381 0.9146 40 16.13998 1.4414 0.9289 40 16.13998 1.4445 0.9164 44 17.22792 1.4427 0.9093 48 18.642 13 1.4498 0.9077 52 20.056 35 1.4512 0.9012 56 21.470 56
1x 6.121 32 5.833 00 5.909 01 5.414 21 7.121 32 6.833 00 6.414 21 8.121 32 8.156 60 9.121 32 10.121 32 9.909 01 9.985 02 11.121 32 12.121 32 13.121 32 14.121 32
Adopted values for the association constant could surely be chosen to maximize the modeling of the property, independently of any experimental evidence. Such an inferred model could be used, as already said, with satisfying results to model the melting points of alkanes even if its experimental validity would be highly questionable. The proposed modeling of the solubilities of caffein homologues and the segmentation model of the melting points of alkanes helps, here, to understand the fact that connectivity indexes are dependent on the state of the investigated physicochemical system and that for the derivation of an adequate modeling of a property, the connectivity indexes have to be exactly defined for each state of the system. For example, an effective modeling of the solubilities should usually include also the connections with the hydration sphere of the
x
5.484 71 5.196 40 5.349 97 4.901 40 6.484 71 6.196 40 5.777 61 7.48471 7.597 55 8.484 71 9.48471 9.349 97 9.348 42 10.48471 11.484 71 12.484 71 13.484 71
M = methyl, E = ethyl, P = propyl, Bu = butyl, Pe = pentyl, H
TABLE 7: Observed Retention Index Rffor Paper Chromatography and Calculated Connectivity x Values of 14 RPO(OR')2 Neutral OreanoDhosDhorus ComDoundsQ M M M
E M Bu M
Pe H
Hep M
Oc
E P Bu Bu Bu
Pe Bu H Bu Bu Bu Hep Bu
0.80 0.71 0.62 0.59 0.53 0.48 0.46 0.38 0.38 0.34 0.26 0.24 0.22 0.15
7.328 43 8.742 64 10.156 85 10.863 96 11.571 07 11.571 07 12.278 17 12.985 28 12.985 28 13.692 39 14.39949 14.39949 15.10660 15.81371
4.121 32 5.121 32 6.121 32 6.681 98 7.181 98 7.121 32 7.681 98 8.121 32 8.181 98 8.681 98 9.181 98 9.121 32 9.681 98 10.121 32
Oc a Abbreviations as in Table 6, Oc = octyl.
M
3.484 71 4.484 71 5.484 71 5.995 24 6.495 24 6.484 7 1 6.995 24 7.484 71 7.495 24 7.995 24 8.495 24 8.484 71 8.995 24 9.484 71
solute as recently suggested for the modeling of the relaxation times of the Ca(Tyr-DMSO) of amino acids.'* D. Modeling the Physical Properties of Organophosphorus Compounds with a Minimal (x} = {D,Ox,lx,lXy} Basis Set. In Tables 6 and 7 have been collected the experimental values of the physicochemical properties and the molecular connectivity values of the organophosphorus compounds. In Table 7 the minimal set for Rfis reduced to = {ox,lx,lxv} as R(D,()x) = 1. The molar refractivity MRD is a property dependent more on the size of the molecule as well as the retention Rfindexes for paper chromatography while the density d420and the refractive n D Z o indexes are more shape dependent. The modeling of these two types of physical properties should, then, be an interesting objective. Furthermore, this modeling offers the possibility to use the recently defined 6" = 2 . Z 5value for the phosphorus atom in P-0 organophosphorus compounds. The fact that lxvand 'x are not perfectly collinear but differ a little bit (their R value is 0.9999) offers the possibility of testing this valence connectivity index all along the modeling of the four physical properties. The modeling of the molar refractivities with the GAI i n d e ~ ~ lgave . ~ * a significant scatter (no R and S values have been given, the reasoning is based on the figure of the plot) which nicely attenuated with the modeling of the other two properties for which we obtain (no R and S
{x}
C = (0.43782, -2.17627, -4.58008,2.91351, -9.52927)
o v
= hexyl, Hep = heptyl, undec = undecyl, dodec = dodecyl, c = cyclo.
P
C = (214.12, -355.95, -541.62, 216.45, -74.967)
{D A x)
MRD 54.42 isobu 54.86 secbu 54.81 terbu 54.46 n-Pe 64.19 isoPe 63.61 22MMP 64.27 n-H 73.45 C-H 69.23 n-hep 82.87 octyl 91.24 lMHep 92.02 2EH 91.18 nonyl 101.1 decyl 109.7 undec 119.8 dodec 129.3
Linear Combinations of Connectivity Indexes
J. Phys. Chem., Vol. 99, No. 3, 1995 935
have been given for Rfbut the given plot is rather fine):
130
nD2O:Q=316.7,F= 1 1 l , R = 0 . 9 5 , S = 0 . 0 0 3 , n = 1 4
Q = 161.7, F = 191, R = 0.97, S = 0.006, n = 14
:'d:
Let us first check the descriptive power of the
lxVindex
110
1
alone:
MR,:
Q = 0.413, F = 1573, R = 0.9953, S = 2.412, n = 17
&: Q = 43.69, F = 934.7, R = 0.9936, S = 0.023, n = 14 n 20.
D . 50
Q = 398.7, F = 189.3, R = 0.9697, S = 0.0024, n = 14
50
70
90
110
130
MR,exp
d?: Q = 136.1, F = 140.3, R = 0.9598, S = 0.0071, n n = 14 The F description of the two size-dependent properties MRD and Rf is excellent while the F description of the two shapedependent properties, even if not so excellent, is anyway gratifying and nearly as good as the descriptive power of the GAI index. The difference in the F modeling of the two different kinds of properties is respected even with the best combination of indexes as can be seen from the following values of the statistical parameters and from the corresponding figures:
Figure 9. Plot of the calculated (with the x and C vectors of the {X)M set) versus the experimental molar refractivity for 17 organophosphorus compounds.
x
MRD {Ox}:
Q = 0.554, F = 2831, R = 0.9974, S = 1.802
{ox,'x}:Q = 1.980, F = 18116, R = 0.99981, S = 0.505 {D,ox,lx}: Q = 2.033, F = 12721, R = 0.99983, S = 0.492
x = (ox,lx,xo)
Vectors used to obtain the calculated values of Figure 9 have been taken from the second combination, which presents a quite nice Q value and a superb F value. Noticeable is the accurate descriptive power of the Ox index alone:
&: {Ox}:
0.10
C = (3.91214, 3.77006, -8.43220)
0.50
0.70
C IO
Rf e x p Figure 10. Plot of the calculated (with the x and C vectors of the {X}M set) versus the experimental gas chromatographicretention indexes for 14 organophosphorous compounds.
worsening of the Q value relatively to the full four-index combination: Q = 44.5, F = 969.9, R = 0.9939, S = 0.022
x = (aOx,'x,xO)
{lx,lxv}: Q = 57.5, F = 808.6, R = 0.9966, S = 0.017
{ox,lx,lxv}: Q=63.1,F=650.6,R=0.9974,S=0.016 C = (0.60259, -0.71731,0.84313) x = (1x,lxv ,x0) Vectors used to obtain the calculated values of Figure 10 have been taken from the second combination, which shows a much better F value and a something smaller Q value than the last combination. Even here the single-index combination is remarkable:
n;O:
{'x}: Q = 404.8, F = 195.1, R = 0.9706, S = 0.0024 (D,'x,'x}: Q = 477.1, F = 90.4, R = 0.9820, S = 0.00206
{x}:
0.30
Q = 481.1, F = 68.9, R = 0.9841, S = 0.00205
Due to its better F value, which more than compensates the
C = (-0.01391, 0.01578,0.03641, 1.37845) These vectors of the second combination have been chosen to simulate the values of this property. Figure 11 shows the calculated values versus the corresponding n ~ ones, * ~ the plot is very good but not as fine as the two preceding ones:
d?: {Ox}:
Q = 143.2, F = 155.2, R = 0.9635, S = 0.0067
x=
C = (-0.00628, 1.02799)
Vectors of this combination yield the best overall LCCI, and in Figure 12 are shown the calculated versus the corresponding experimental values for this property. It is evident from Figures 9-12 that the modeling quality of the f i s t two (more size-dependent) properties is better than the modeling quality of the last two (more shape-dependent) properties, a fact that is clearly reflected by the value of the F parameter. Noticeable is the good description of the four
936 J. Phys. Chem., Vol. 99,No. 3, 1995
1
1.445
u
I
/
-I U
Pogliani
1.435
xa E
'!
1.415
I
/
,
,
1.415
1.425
1.435
1.445
1.455
n2,0 e x p
Figure 11. Plot of the calculated (with the x and C vectors of the {X}M set) versus the experimental refractivity indexes for 14 organophosphorus compounds. 0.97 I
0.89
Y 0.89
/I
rn
0.91
0.93
0.95
0.97
d?e x p
Figure 12. Plot of the calculated (with the x and C vectors of the {X}M set) versus the experimental densities for 14 organophosphorus compounds.
properties by the single-index combination, which has the features of a dominant descriptor, thus rendering unimportant the use of orthogonal indexes for this purpose. The 'x"index contributes to the best combination in two of the four properties ( n and~ Rf). ~ The ~ values of the interrelation matrices for the four properties are very high, reflecting the strong (in the sense of Mihalic et aL3') collinearity among these indexes, which provide anyway a good modeling of the four properties:
( R M k : MR,; R,; nD20; dp))= 0.991; 0.999914; 0.9987; 0.9987
Conclusions All along the different sets of compounds here analyzed, there is a unifying leitmotive that concerns the ability of the LCCI method, which is based on molecular connectivity indexes, to model size- and/or shape-dependent properties. While the melting points of alkanes are a classical example of a strong shape-dependent property, the molar refractivity and the retention index in paper chromatography of phosphonates are more size dependent. The density and the refractivity index of these same phosphonates together with the motor octane numbers of alkanes seem to occupy a midregion and be both size- and shape-dependent properties. Charge, hydrogen bond, and steric factors become important factors (in reality, charges are rather
similar27-30 ) in determining the solubility and melting point of caffein analogues. The adopted, experimentally grounded, association pattern for the connectivity indexes yields a very fine modeling for the solubility of these xanthine derivatives, where the valence lxv connectivity index plays an important role being, in fact, a good single-descriptor for this property. The brilliant modeling of the adopted linear combination of squared connectivity indexes (LCSCI) is an important hint to design refined modeling of properties, which show some negative calculated values. The used supraconnectivity indexes for the solubility of caffein homologues could be taken as a trace not only for a general modeling of the solubilities (possibly taking into consideration the connections with the solvent) but also for the modeling of the melting point of alkanes. Due to the lack of experimental and conclusive theoretical information on the behavior of real liquids and consequently on the connections that take place in the liquid state an ad hoc segmentation model was adopted to model the melting points of the alkanes. Clearly, while the segmentation is not the optimal choice as it reduces drastically the number n of observations, it is nevertheless due to the lack of knowledge of the system, an obligatory way. The segmentation into congruent subsets discloses the possibility of obtaining by the aid of both a minimal (4" and 5" subsets) and an expanded (lo, 2", and 3" subsets) set of connectivity indexes, a rather satisfying modeling of this shape-dependent property. Orthogonalization of the connectivity indexes of the third subset, for the minimal case, offers the possibility to detect a dominant descriptor for this subset (2!2). To notice is (i) that the first (nonbranched) subset is very well modeled by the 4~ path index, (ii) that xC and xPc (especially 3 ~ and c ",,J type of indexes play a decisive role in modeling the melting points of the second and third (branched alkanes) subsets and (iii) that fourth and fifth branched subsets are well modeled by indexes belonging to the minimal set and especially by the xtindex, which is the dominant descriptor of two subsets. The modeling of the melting points of the caffein homologues can be considered as the modeling of a different segment of a vaster set of compounds to which both xanthine derivatives and alkanes belong and for which 6 and 6" are different. The rather fine modeling of this property demonstrates the importance of the valence connectivity indexes and, especially, of the corresponding orthogonal indexes, that achieve a fine modeling with less variables. The modeling of the motor Octane numbers of alkanes with a minimal set shows a dominant orthogonal (%2)descriptor and a rather good three-X-index descriptor, while the modeling with indexes from the extended set seems excellent. The modeling of the two size-dependent (MRD and Rf)and shape-dependent ( n and~ dd20) ~ properties ~ of the organophosphorus derivatives attests the better performances of the LCCI method with size-dependentproperties than with shape-dependentproperties; nevertheless, it shows also that shape-dependent properties can be satisfactorily described by a method that, spanning a rather wide space of combinations, can pick up the more appropriate connectivity indexes. This strengthens the supposition that the description of the melting points of the whole set of alkanes could be solved assuming a hypothetical ad hoc cluster model to derive new connectivity values. The testing of the 6" value for the P atom in these derivatives is, clearly, not definitive as the l x V index is not a sufficient index for such a test, but its relevance in the description of both shape- and size-dependent properties is an interesting indication that valence indexes play even here a central role. The versatility of the molecular connectivity-linear combination of connectivity indexes (LCCI-MC) method all along the modeling of the given properties is clearly accentuated (i) by
J. Phys. Chem., Vol. 99, No. 3, 1995 937
Linear Combinations of Connectivity Indexes the importance of the D, xt and higher-order (especially xc and xPc)connectivity indexes in the description of the MP and MON of alkanes, (ii) by the good descriptive power of the valence connectivity indexes for the other properties, and finally (iii) by the versatility of the molecular connectivity orthogonal indexes. The molecular connectivity theory offers both the possibility of designing new useful connectivity indexes and to derive, by aid of a straightforward (and simple if performed on minimal sets of connectivity indexes) orthogonalization procedure, the dominant descriptor of a property (when no dominant exists) and even the constant terms of the C(S2) without the aid of any orthogonalization procedure. On the other side, if for shape- and size-descriptive power for the connectivity indexes understood are the overall shape and size resulting from association, hydration, and related phenomena that molecules undergo in real systems, then connectivity indexes have to be calculated in a way to describe also these phenomena. This fact means that LCCIs made up of connectivity indexes “physicochemically ad hoc” should be able to describe a larger amount of properties than they actually do and that the problem of size- and shape-dependent descriptive power of an index has to be reduced to the capacity of the index to describe noncovalent interactions. Known connectivity indexes seem, then, to be optimal descriptors of gaseous state (e.g., boiling points) and diluted solution properties, that is, of properties that are not severely affected by noncovalent interactions, that take place among molecules, a feature that should be deeper investigated. For example, the LCCI-MC description of the recently published solubilities of amino acids reaches an optimum with the smaller value^,^^^^^ i.e., in diluted solution and if the higher solubility value of 7PTph (23 1.1, see Table 4) is left out of the solubility model of caffein homologues the {D,Ox,’x>best but unsatisfactory combination (Q = 0.004, F = 0.282, R = 0.309, S = 70.97, see text) improves its descriptive power to Q = 0.0613, F = 2.51, R = 0.720, S = 14.73, that is, nearly 10 times for what concerns Q and F values. It could be argued why has not been used an extended set of indexes also for the caffein homologues and organophosphorus compounds: first, the description of the properties of these compounds is already fine and second, the test and discard methods to choose an even better combination to use in the corresponding LCCI is very unsophisticated, as it selects only a restricted subspace of the full combinatorial space. The number of connectivity indexes possible for compounds which have different 6 and dvvalues can easily reach 20 or more, and the number of combinations possible with such sets can easily become enormous (more than lo6), not mentioning the LCOCI and the linear combinations of the squares or reciprocals or of other special constructions of indexes. Thus, one of the prerequisites of molecular connectivity modeling and more generally of quantitative structure-property relationships (QSPR), elementary calculations, has to be discarded. It should be remembered that even if the capacity of minimal sets of indexes to model many properties is noteworthy and of great help, generally, extended sets, due to their superior flexibility, should be preferred. The very last conclusions are, then, rather discouraging and ambitious simultaneously: the most suitable combination of connectivity indexes apt to model a property is (i) dependent
x
x
x
x
on the real and many times unknown physicochemical state of the system and (ii) it hides somewhere among the many millions of possible combinations. But Rome was not built in a day, and it will be interesting to see where this effort will lead. Acknowledgment. I would like to thank Prof. L. B. Kier for providing constant assistance, Dr. G. Bonacchi for an interesting scoop on caffein homologues, Prof. P. Dapporto for a stimulating discussion on solubility, and an unknown reviewer for an interesting remark on the segmentation of alkanes. MURST, the ministry for University, Scientific and Technological Research, is gratefully acknowledged for financial support. References and Notes (1) Randit, M. J . Am. Chem. SOC.1975, 97, 6609. (2) Kier, L. B.; Hall, L. H.; Murray, W. J.; Randit, M. J . Pharm. Sci. 1975, 64, 1971. (3) Kier, L. B.; Hall, L. H.; Murray, W. J. J . Pharm. Sci. 1975, 64, 1974. (4) Kier, L. B.; Hall, L. H. J . Pharm. Sci. 1981, 70, 583. (5) Kier, L. B.; Hall, L. H. Molecular Connectivity in Structure-Activity Analysis; Wiley: New York, 1986. (6) Kier, L. B.; Hall, L. H. Advances in Drug Research; Testa, B., Ed.; Academic: New York, 1992; Vol 22. (7) TrinajstiC, N. Chemical Graph Theory; CRC: Boca Raton, FX, 1983; Vol. 2. (8) MihaliC, Z.; TrinajstiC, N. J . Chem. Educ. 1992, 69, 701. (9) Turro, N. J. Angew. Chem., Int. Ed. Engl. 1986, 25, 882. (10) Rouvray, D. H. Sci. Am. 1986, 254, 64. (11) Rouvray, D. H. J . Mol. Srruct. (THEOCHEM) 1989, 185, 187. (12) Seybold, P. G.;May, M. A.; Bagal, U. A. J . Chem. Educ. 1987, 64, 575. (13) Hansen, P. J.; Jurs, P. C. J . Chem. Educ. 1988, 65, 574. (14) Stanton, D. T.; Jurs, P. C.; Hicks, G. M. J . Chem. InJ, Comput. Sci 1991, 31, 301. (15) Basak, S. C.; Magnuson, V. R.; Niemi, G. J.; Regal, R. R. Discr. Appl. Math. 1988, 19, 17. (16) Basak, S. C.; Niemi, G. I.; Veith, G. D. J . Math. Chem. 1991, 7, 243. (17) Pogliani, L. J. Pharm. Sci. 1992, 81, 334. (18) Pogliani, L. Comput. Chem. 1993, 17, 283. (19) Pogliani, L. J . Phys. Chem. 1993, 97, 6731. (20) Pogliani, L. J. Phys. Chem. 1994, 98, 1494. (21) Pogliani, L. Amino Acids 1994, 6 , 141. (22) Pogliani, L. Curr. Top. Pept. Prof. Res., to be published. (23) Pogliani, L., unpublished and to be published results. (24) Pogliani, L. J . Chem. In$ Comput. Sci. 1994, 34, 801. (25) Needham, D. E.; Wei, I. C.; Seybold, P. G. J . Am. Chem. SOC. 1988, 110, 4186. (26) Balaban, A. T.; Kier, L. B.; Joshi, N. MATCD 1992, 28, 13. (27) Guttman,D.; Higuchi, T. J . Am. Pharm. Assoc. 1957, 46, 4. (28) Bolton, S.; Guttman,D.; Higuchi, T. J . Am. Pharm. Assoc. 1957, 46, 38. (29) Agostini, 0.; Bonacchi, G.;Dapporto, P.; Paoli, P.; Fedi, M.; Manzini, S . Arzneim.-ForschJDrug Res. 1990, 40, 1089. (30) Agostini, 0.;Bonacchi, G.; Dapporto, P.; Paoli, P.; Pogliani, L.; Toja, E. J. Chem. Soc., Perkin Trans. 2 1994, 5, 1061. (31) Xu, L.; Wang, H. W.; Su, Q. Cornput. Chem. 1992, 16, 187. (32) Xu, L.; Wang, H. W.; Su, Q. Comput. Chem. 1992, 16, 195. (33) RandiC, M. J. Chem. I n j , Cornput. Sci. 1991, 31, 3 1 1 . (34) RandiC, M. J . Mol. Struct (THEOCHEM) 1991, 233, 45. (35) Randit, M. New J . Chem. 1991, 15, 517. (36) Kier, L. B.; Hall, L. H. Molecular Connectivity in Chemistry and Drug Research; Academic: New York, 1976. (37) Mihalic, Z.; Nikolic, S.; Trinajstic, N. J. Chem. I f . , Comput. Sci. 1992, 32, 28. (38) Atkins, P. W. Physical Chemistry; Oxford: Oxford, 1992. (39) Andersen, H. C. Annu. Rev. Phys. Chem. 1975, 26, 145. (40) CRC Handbook of Chemistry and Physics, 72nd ed.; Boca Raton, FL, 1992. JP942 153C