J. Phys. Chem. 1996, 100, 18065-18077
18065
Modeling with Special Descriptors Derived from a Medium-Sized Set of Connectivity Indices Lionello Pogliani Dipartimento di Chimica, UniVersita` della Calabria, 87030 Rende (CS), Italy ReceiVed: May 17, 1996; In Final Form: September 6, 1996X
The descriptive and utility power of linear combinations of special construction of connectivity indices (LCXCI) derived by a trial-and-error procedure from a medium-sized set of eight connectivity indices or from a subset of it has been tested on several properties of different classes of bioorganic and inorganic compounds. Two techniques have been tested to choose the appropriate combination of indices: the forward selection and the complete combinatorial technique. While the latter searches the entire combinatorial space and the first searches only a subspace of it, this last, nevertheless, has many advantages among which to be a good tool for an elementary and direct test for newly defined indices. The modeling of the side-chain volume V of 18 amino acids (AA) is perfectly achieved by a composite index together with a connectivity index, and, while the modeling of pI of 21 amino acids is satisfactorily accomplished by special 0χv-fractional indices that are also rather good descriptors of the melting Tm points of 20 amino acids, the solubility S of 16 and 20 amino acids is nicely described by reciprocal and suprareciprocal connectivity indices, respectively, a description that seems to have nothing in common with the modeling of the same property for 23 purines and pyrimidines (PP) achieved by squared supraconnectivity indices. Nevertheless, the modeling of the solubility of the entire heterogeneous class of n ) 43 amino acids, purines, and pyrimidines could be satisfactorily achieved with a set of supracomposite indices based on the χtv index mainly. The modeling of the motor octane MON number of 30 alkanes and of the melting points of 17 and 14 alkanes shows how far a minimum set of four connectivity indices can positively replace a larger set of 17 indices, while the modeling of the lattice ∆HLφ enthalpies of 20 metal halides by a mixed set of normal and composite indices introduces and stimulates the problem of the definition of a connectivity model for inorganic compounds. The utility of the given LCXCI is generally rather high as many properties can be satisfactorily modeled by one or just two indices (V, pI, S(AA), S(PP), and ∆HLφ) and it can be enhanced, especially when the modeling requires more than three or four indices, with the introduction of the corresponding orthogonal indices.
Introduction Molecular connectivity indices are, certainly, the most successful structure-related indices in structure-property and structure-activity studies.1-10 They are graph-theoretical parameters that originate from the chemical graphs or pseudographs of the corresponding molecule. Graphs are sets of vertices and edges that connect vertices while pseudographs, allowing the introduction of multiple edges and loops (edges from a vertex to itself), mimic multiple bonds and p and n electrons. Despite the fact that multiple regression analysis has been claimed to be an inadequate statistical approach for deriving significant predictive models11,12 (see citation in Randic´’s paper, ref 11), a fact that, if true, would bring to a paralysis the molecular connectivity paradigm, the statistical approach based on connectivity indices has continued to collect some interesting results in the field of structure-property and -activity studies. Now, if a paradigm is a model or pattern based on a set of rules that defines boundaries and specifies how to be successful at and within these boundaries and its success is measured by the problems you solve using these rules, then the achieved positive results of molecular connectivity (MC) in the field of structure-property studies are rather encouraging. Recently, a wide set of properties of different organic, biochemical, and even inorganic compounds have been modeled by the aid of linear combinations of connectivity indices13-27 and of orthogonal connectivity indices (LCCI and LCOCI). A connectivity model has recently been proposed to X
Abstract published in AdVance ACS Abstracts, October 15, 1996.
S0022-3654(96)01434-7 CCC: $12.00
solve the old problem of coding the cis-trans isomerism in graphs of unsaturated olefins,28 while ordering-dependent orthogonal indices have been recently defined29 and also a procedure for inverse imaging to revert from the connectivity equation back to the molecule30,31 has been developed. Furthermore, linear combinations of reciprocal, squared, and supraconnectivity indices have been recently introduced and successfully used.23,32,33 The aim of the present paper is to develop and check special functions f(χ) ) X of connectivity indices to be used in a linear combination mode for prediction of physicochemical properties of different groups of compounds whose properties cannot fairly well be described by the normal connectivity χ indices. All along the present study two purposes will be held in mind. First, the construction of f(χ) should possibly be kept at a “pedestrian” level, taking for pedestrian the definition given by Trinajstic´ et al.; that is, calculations should be maintained at a level that can be carried out by hand and/or pocket calculator.34 Thus, the {χ} set of basic connectivity indices used in the elaboration of X should be kept rather restricted, a fact that should also avoid that the construction of the combinatorial space for the different linear combinations should become overwhelmingly complex. Second, particular attention should be devoted to the curvefitting paradox,11 that is, to the detected contradiction between the usefulness of the regression equations for (coefficient) interpretation and their validity for predictive purposes, i.e., “the more predictive the less useful” paradox. This can be achieved with predictive regressions using a restricted number of indices or with regressions based on orthogonal indices11 that show © 1996 American Chemical Society
18066 J. Phys. Chem., Vol. 100, No. 46, 1996
Pogliani
coefficient stability. The fact that LCCI or LCXCI equations become less and less reliable in their coefficients as more and more indices are used and that sometimes a good predictive LCCI has the standard errors of its coefficients larger than the values of the coefficients themselves could be phrased in the following way: give me four indices and I can fit a rhinoceros, give me five and I can include his tail. Well, despite the fact that a larger number of indices are necessary to draw a goodlooking rhinoceros inclusive of his tail, a one-parameter LCCI is clearly superior to a multiparameter LCCI, unrewarding of its formal simplicity, yet, complex molecular systems that can give rise to a large number of χ indices per molecule sometimes need a good deal of indices for their description. Clearly, the problem is where to draw the line, that is, when it should be established that the number of indices starts to be excessive and when orthogonal indices should be used to derive more reliable equations. The present study will cover the modeling of four P properties of n ) 18-21 natural amino acids, the solubility S of 23 PP bases and of 43 {PP+AA}, the motor octane number MON and melting Tm points of 30 and {17 + 14} alkanes, respectively, and the lattice enthalpies of 20 metal halides. While the sidechain molecular V volume of 18 amino acids can fairly well be modeled with normal LCCI and will make up a nice comparative example of the predictive power of normal χ and special f(χ) indices, all other properties, solubility, S (n ) 20), pH at the isoelectric point, pI (n ) 21), and melting Tm (or MP) points of 20 of amino acids can be satisfactorily modeled only by the aid of LCXCI that achieve also a very satisfactory modeling of the solubility S of the entire set of n ) 44 amino acids plus purines and pyrimidines. The modeling of the set of inorganic compounds will be achieved by the aid of especially and newly defined Dz and X indices. The goal of this study is, thus, mainly heuristic (a heuristic is a procedure that provides aid and direction in the solution of a problem), that is, to present material that will provide a basis for the construction and use of linear combinations of special X ) f(χ) connectivity descriptors that show meaningful predictive power and utility. Another challenge in LCXCI or LCCI studies is to keep the number of combinations of the different indices “pedestrian” once a given set of indices is chosen; the validity of a forward selection technique that chooses only a portion of the full combinatorial space will also be checked. Furthermore, this technique is well suited for a rapid checking of X indices derived by a trial-and-error procedure. Method The chosen medium-sized {χ} set of molecular and valence (denoted by V) molecular connectivity indices for numerical encoding the different properties of the different classes of compounds is given by the following m ) 8 indices
{χ} ) {D, DV, 0χ, 0χv, 1χ, 1χv, χt, χtv} These indices, which form the basis functions of our molecular connectivity modeling, can be computed as follows. The sumdelta D molecular connectivity index is given by35,36
D ) ∑δi
(1)
The zeroth- and first-order molecular connectivity indices are given by4 0
χ ) ∑(δi)-0.5
(2)
1
χ ) ∑(δiδj)-0.5
(3)
The total structure molecular connectivity index χt is given by37
χt ) (δ1δ2...δN)-0.5
(4)
where δi is the delta cardinal number which represents the count on non-hydrogen σ bond electrons contributed by atom i.4 The sum in eqs 1 and 2 is taken over all N vertices and that in eq 3 over all edges of the molecular graph (corresponding to nonhydrogen atoms and σ bonds). Replacing δ with valence δv (which represents the count of all non-hydrogen electrons contributed by atom i4) in eqs 1-4, the corresponding four valence molecular connectivity χv indices are obtained. The δv(S) values in amino acids Cys and Met (0.56 and 0.67, respectively) have been taken from ref 4. It is to be noted here, that the total and valence total structure connectivity indices are used for the first time along the modeling of the amino acids as well as χ values with five figures (the preceding LCCI-MC calculations on amino acids have been done with three figures only). The modeled properties can be calculated by the aid of the following dot product modulus
P ) |C‚χ|
(5)
where P is the calculated property of a compound, row vector C is the vector of the coefficients ck that are determined by the least-squares procedure, and column vector χ is the vector of the connectivity χk descriptors selected with a selection technique. The descriptor of the constant cm+1 term is the unitary index χ0 ≡ 1. If χ is a m‚n matrix (where n is the number of compounds) then P is a property column vector of the entire class of compounds. Bars in eq 5 stand for absolute value to get rid of negative P values with no physical meaning and simultaneously enhance the description of the property.24,25,32 In this study, orthogonal molecular connectivity iΩ indices (that is, linear combinations of orthogonal connectivity indices, LCOCI), recently defined by Randic´,13-16 will also be used to circumvent the loss of validity of the corresponding LCCI equations measured by the standard errors sk of the ck coefficients. For every index of a LCCI or LCXCI equation the fractional utility (the inverse of the fractional error) uk ) |ck/sk| as well as the average fractional utility 〈u〉 ) ∑uk/(m + 1) will be given. Introduction of orthogonal indices that short-circuit the collinearity problem due to mutual interrelation among χ indices has also other advantages like stability of the coefficients upon introduction of a new Ω index and generation of dominant descriptors when some χ indices are poor descriptors (especially the single one). The only drawback of these advantages is that they can be derived only by further calculations that are less and less “pedestrian” notably if the number of indices to be orthogonalized is large and the number of compounds to be modeled is not held constant. Two procedures for index selection will be used all along this paper: the forward selection and the complete combinatorial procedures. The forward selection technique (fst) is a sequential procedure based on the notion that indices should be inserted one at time until a satisfactory Q-LCCI (Q ) r/s, where r is the correlation coefficient and s standard deviation of estimates21) is obtained; this technique spans a subspace of the complete combinatorial space. The procedure is as follows: (a) choose the χ index that gives the largest Q value and this is the best single LCCI (or LCXCI), then (b) choose the next χ index of the {χ} set that when inserted in the model gives the largest increase in Q, in the presence of the previous index,
Molecular Connectivity Indices
J. Phys. Chem., Vol. 100, No. 46, 1996 18067
TABLE 1: Number of Possible Combinations for m Indices with the Forward Selection (fst) and Complete Combinatorial (cct) Technique m
no. of fst combinations
no. of cct combinations
2 3 4 5 6 7 8 9 10 20 25 30
3 6 10 15 21 28 36 45 55 210 325 465
3 7 15 31 63 127 255 511 1023 1 048 575 33 554 431 1 073 741 823
i
(7)
i
where Zv has already been defined and n is the principal quantum number. Due to the fact that for the given MeX, D, 0χ, and 1χ ≡ χt are meaningless (their values is constant all along the set of 20 MeX, the chemical graph of every MeX is one and the same), the set of connectivity indices for MeX reduces (1χv ≡ χtv) to the following minimal basis set, {χ} ) {Dv, 0χv, 1χv, DZ}. Results and Discussion
and we call this the best 2 - χ LCCI, and so on till Q starts constantly to decrease with the introduction of the next χ index of the set. The complete combinatorial technique is a procedure that searches the entire combinatorial space spanned by the indices of the {χ} set, extracting the Q best combinations (∑rCm,r, m, number of indices, χ0 excluded, and r ) 1 - m). Combinations are also sorted following their F value (F ) fr2/ [(1 - r2)m], where f is the degrees of freedom ) n - (m + 1), n ) n0 of data, and m ) n0 of indices4). This last statistic can be a valuable aid in discriminating among different Q-LCCI with rather similar Q values. It is to be noted that Q (quality ratio) and F (Fischer ratio) values have been derived with original calculated r and s with five digits; thus, e.g., for n ) 18, m ) 3, and r ) 0.996, F ) 580 while for r ) 0.9964, F ) 645. A significant decrease in F with the introduction of one additional variable (with increasing Q and decreasing s) could mean that the new descriptor is not as good as expected; that is, its introduction has endangered the statistical quality of the combination that nevertheless can again improve with the ulterior introduction of a more potent descriptor. The difference between the fst and cct procedures is clearly shown in Table 1, where the overall number of possible combinations with the two procedures with growing n has been collected. From this table we can clearly notice how the complete combinatorial space practically “explodes” with growing n (notice, from top to bottom, the growing tree formed by the number of digits of the single combinations) while the forward selection technique looks quite easy up to nearly 10 indices. It should be added that 30 connectivity (normal + valence) indices per molecule is not an extraordinary number of indices keeping in mind that ramified C7,8 alkanes38 can have up to 17 nonvalence molecular connectivity indices per molecule. We will define as a minimal set a set of indices that gives rise to less than 100 cct (or 25 fst) combinations and as medium-sized set sets that give rise to less than 1000 cct (or 60 fst) combinations. The modeling of the lattice ∆HLφ enthalpies at 298.15 K of 20 metal halides (MeX) is somewhat problematic because the molecular connectivity paradigm has been defined for organic compounds only, for which it is possible to draw a chemical graph in terms of points and edges. To circumvent this dead end we will imagine, for the time being, electrostatic bonds as edges while the gas phase molecule of these compounds will serve as basis for the construction of the corresponding chemical graph (e.g., NaCl ) •-•). The valence δv values of the given atoms of MeX are derived by the aid of the following relation:4
δv ) Zv/(Z - Zv - 1)
DZ ) ∑δiZ ) ∑(Zvi/ni)
(6)
where Zv is the number of valence electrons and Z is the atomic number of the corresponding atom. The following new index, the sum δz index, will also be tested:
In Table 2 are collected the connectivity values of n ) 21 R-amino acids calculated with five significant figures (previous works on amino acids had no χt and χtv indices). While each amino acid can be characterized by its set of connectivity indices, the connectivity values for 23 purines and pyrimidines, collected in Table 3, show three couples of compounds (with sign, +, •, and 0), each with equal set of connectivity values. In Table 4, the lattice enthalpies of 20 metal halides and their corresponding connectivity values have been collected and no similarity in connectivity values can be detected. The values of the experimental properties for amino acids and purines and pyrimidines have been collected in Table 5. The experimental values of the given classes of compounds have been taken from the available literature.23,39-41 The motor MON octane number and the melting Tm points of 30 and 31 alkanes, respectively, as well their connectivity values are to be found in ref 23. Before starting our modeling let us have a short discussion about the utility uk and 〈u〉 parameters. Randic´’s exposition11 about the multiple linear regression paradox considers the modeling of heats of formation for 18 octane isomers by a set of {χ1 - χ6} connectivity indices. He obtains the following optimal descriptions (see Table 6, connectivity indices are collected into a vector as well as the different uk values). While the 3-χi, 4-χi, and 6-χi LCCI show worsening (even relative to the 5-χi-LCCI), Q, F, s, and 〈u〉 values and are not considered here, the best Q-LCCI is the one with 5-χi indices and the best F-LCCI is the second one that nevertheless has a rather poor u vector and a not at all impressive 〈u〉 value. First, LCCI has an excellent u vector but a very poor predictive power (see Q and F values). To bypass this paradox, Randic´ proposes to use LCOCI (linear combinations of orthogonal connectivity indices)11 that not only show no interrelation and numerical stability but also an improved u vector. Let us then have a look at the corresponding LCOCI (per definition14-16 the 1-χ,χ0LCCI equals the 1-Ω,Ω0-LCOCI) shown in Table 6, bottom. While the predictive power of the LCOCI is unchanged, and this turns out to be a good check for the Ω indices, 〈u〉 and u are here brightly better as (normally) the most important Ωs (that is, with the highest ck) and Ω0 improve their utility at the expenses of the less important Ωs. Thus c3,4 of the last LCOCI are far from being optimal. Anyway, the paradox has been practically removed even if at the expense of further calculations, which for a 5-χi LCCI is not at all trivial. Nevertheless, Randic´’s orthogonal Ω indices have a pedestrian aspect in the construction of ck(Ω), as there is no need to calculate the Ω values to derive the corresponding ck(Ω) values; in fact, they can be derived by the aid of a stepwise inclusion procedure13-16 from the corresponding best predictive LCCI. Thus, once it is known that a LCOCI has the same predictive power but a better utility than the corresponding LCCI, then the passage from ck(χ) to ck(Ω) is practically straightforward. With this in mind, in analyzing the predictive power and utility of our LCCI or LCXCI we will forget, for the time being, everything about
18068 J. Phys. Chem., Vol. 100, No. 46, 1996
Pogliani
TABLE 2: Molecular Connectivity Indices for 21 Amino Acids (AA)
a
AA
D
Dv
0χ
0χv
1χ
1χv
χt
χtv
Gly Ala Cys Ser Vala Thr Met Pro Leu Ile Asn Asp Lys Hyp Gln Glu His Arg Phe Tyr Trp
8 10 12 12 14 14 16 16 16 16 16 16 18 18 18 18 22 22 24 26 32
20 22 23.56 28 26 30 26.67 28 28 28 36 38 32 34 38 40 42 42 42 48 54
4.284 46 5.154 70 5.861 81 5.861 81 6.732 05 6.732 05 7.276 02 5.983 13 7.439 16 7.439 16 7.439 16 7.439 16 7.983 13 6.853 37 8.146 27 8.146 27 8.267 58 9.560 48 8.974 69 9.844 93 10.836 50
2.639 92 3.510 16 4.553 58 3.664 48 5.087 51 4.534 73 6.146 07 4.554 13 5.794 62 5.794 62 4.702 78 4.572 73 5.915 94 4.871 59 5.409 97 5.279 84 5.819 18 6.708 83 6.604 02 6.973 88 8.104 02
2.270 06 2.642 73 3.180 74 3.180 74 3.553 42 3.553 42 4.180 74 3.804 53 4.036 58 4.091 42 4.036 58 4.036 58 4.680 74 4.198 38 4.536 58 4.536 58 5.198 38 5.536 58 5.698 38 6.092 22 7.181 54
1.189 53 1.627 09 2.402 90 1.774 22 2.537 77 2.218 62 4.043 55 2.766 88 3.020 94 3.075 78 2.304 34 2.239 27 3.366 24 2.841 58 2.804 34 2.739 27 3.155 29 3.600 22 3.722 22 3.856 51 4.716 24
0.408 25 0.333 33 0.235 70 0.235 70 0.192 45 0.192 45 0.117 85 0.083 33 0.136 08 0.136 08 0.136 08 0.136 08 0.083 33 0.068 04 0.096 23 0.096 23 0.034 02 0.048 11 0.024 06 0.019 64 0.005 67
0.037 27 0.030 43 0.028 75 0.009 62 0.017 57 0.007 86 0.018 59 0.009 32 0.012 42 0.012 42 0.002 54 0.001 96 0.004 39 0.003 40 0.001 79 0.001 39 0.000 80 0.000 78 0.000 69 0.000 27 0.000 09
In ref 20 D and Dv values of Val are incorrectly quoted (16 and 28).
TABLE 3: Calculated χ Values for 23 Purine and Pyrimidine Basesa (PP) PP
D
Dv
0χ
0χv
1χ
1χv
χt
χtv
7I8MTp 7B8MTp 7ITp 7BTp 1BTb 7PTp(+) 1PTb(+) 7ETp(•) 1ETb(•) Cf Tp Tb UA OA X IsoG(0) G(0) HypoX A T 5MC U C
38 38 36 36 36 34 34 32 32 30 28 28 26 22 24 24 24 22 22 18 18 16 16
62 62 60 60 60 58 58 56 56 54 52 52 54 50 48 46 46 42 40 36 34 34 32
13.610 36 13.447 23 12.740 12 12.576 99 12.576 99 11.869 88 11.869 88 11.162 77 11.162 77 10.455 67 9.585 42 9.585 42 8.715 18 8.430 72 7.844 93 7.844 93 7.844 93 6.974 69 6.974 69 6.853 37 6.853 37 5.983 13 5.983 13
11.389 81 11.226 67 10.467 16 10.304 02 10.304 02 9.596 91 9.596 92 8.889 81 8.889 81 8.182 7 7.235 49 7.235 49 5.724 74 5.249 31 5.341 06 5.457 38 5.457 38 4.957 38 5.073 69 4.893 85 5.010 16 3.971 2 4.087 51
8.341 11 8.485 27 7.930 43 8.074 59 8.074 59 7.574 59 7.574 59 7.074 59 7.074 59 6.536 58 6.125 9 6.109 06 5.664 7 5.092 22 5.270 86 5.270 86 5.270 86 4.877 01 4.877 01 4.198 38 4.198 38 3.787 69 3.787 69
5.970 71 6.114 86 5.539 89 5.684 05 5.684 05 5.184 05 5.184 05 4.684 05 4.684 05 4.107 93 3.717 58 3.713 5 3.112 37 2.663 33 2.928 73 2.960 49 2.960 49 2.745 09 2.772 77 2.485 6 2.517 36 2.068 93 2.100 7
0.003 564 0.003 086 0.004 365 0.003 78 0.003 78 0.005 346 0.005 346 0.007 56 0.007 56 0.010 69 0.013 095 0.013 095 0.016 04 0.039 28 0.019 64 0.019 64 0.019 64 0.024 06 0.024 06 0.068 04 0.068 04 0.083 33 0.083 33
8.51 × 10-5 7.37 × 10-5 9.82 × 10-5 8.51 × 10-5 8.51 × 10-5 0.000 12 0.000 12 0.000 17 0.000 17 0.000 24 0.000 269 0.000 269 0.000 13 0.000 27 0.000 34 0.000 43 0.000 43 0.000 85 0.001 08 0.003 01 0.003 8 0.003 47 0.004 39
a A ) adenine, G ) guanine, U ) uracil, T ) thymine, C ) cytosine, OA ) orotic acid, UA ) uric acid, X ) xanthine, M ) methyl, P ) propyl, B ) butyl, I ) isobutyl, Cf ) caffein ) 137MMMX ) 7 MTp, Tb ) theobromine ) 37MMX, Tp ) theophylline ) 13MMX. Compounds with +, b, and 0 have similar {χ} values.
LCOCI (clearly, in the sense that for the next ten seconds it is absolutely forbidden to think of rhinoceros). As a supplement to this paragraph about orthogonal indices let us add that total connectivity indices χt and χtv, introduced some years ago,37 are poorly correlated to the other χ indices of the set with interesting consequences, as we shall see. The modeling of the side-chain molecular volume V of n ) 18 amino acids (no Met, Cys, and Hyp) with the given {χ} set, which has an average value of the interrelation matrix for this property 〈RIM(V:{χ}〉 ) 0.883 and the strongest and the weakest interrelations given by Rs(D,1χ) ) 0.995 and Rw(1χv,χtv) ) 0.688, respectively, is very appreciable. The fs (forward selection) and cc (complete combinatorial) techniques choose the convincing successive combinations for V given in Table 7, top. LCCI with more than three χ indices show poorer and poorer Q/F values. The cc technique chooses the same 1-χ and 2-χ index LCCI like the fs technique but a somewhat Q/F better 3-χ-index
LCCI. Note the fact that while for this last cc LCCI Q and F improve, 〈u〉 worsens considerably relative to the last fs LCCI. This fact can surely be ascribed to a stronger interrelation of the χ indices, which for the 3-χ cc LCCI is given by
R(D, 0χv) ) 0.942, R(D, 1χv) ) 0.949, R(0χv, 1χv) ) 0.987 while for the 3-χ fs LCCI is given by
R(Dv, 0χv) ) 0.826, R(Dv, χtv) ) 0.805, R(0χv, χtv) ) 0.703 We are, here, at the core of Randic´’s paradox. While the 3-χfs LCCI shows a somewhat worse predictive power, it has nonetheless a much better utility than the 3-χ-cc LCCI due to the lower interrelation of its indices. If we adopt as the best
Molecular Connectivity Indices
J. Phys. Chem., Vol. 100, No. 46, 1996 18069
TABLE 4: Lattice Enthalpies ∆HLO at 298.15 K (kJ mol-1) of 20 Metal Halides (MeX) and Their Corresponding Molecular Connectivity Values MeX
∆HLφ
Dv
0 v
χ
1 v
χ
DZ
LiF NaF KF RbF CsF LiCl NaCl KCl RbCl CsCl LiBr NaBr KBr RbBr CsBr LiI NaI KI RbI CsI
1037 926 821 789 750 852 786 717 695 678 815 752 689 668 654 761 705 649 632 620
8 7.111 11 7.058 82 7.028 57 7.018 87 1.777 78 0.888 89 0.836 60 0.806 35 0.796 65 1.259 26 0.370 37 0.318 08 0.287 83 0.278 13 1.155 56 0.266 67 0.214 38 0.184 13 0.174 42
1.377 96 3.377 96 4.501 07 6.294 04 7.658 07 2.133 89 4.133 89 5.257 00 7.049 97 8.414 00 2.963 96 4.963 96 6.087 07 7.880 04 9.244 07 3.535 46 5.535 46 6.658 57 8.451 54 9.815 57
0.377 96 1.133 89 1.558 39 2.236 07 2.751 62 1.133 89 3.401 68 4.675 16 6.708 20 8.254 87 1.963 96 5.891 88 7.097 62 11.619 0 14.297 9 2.535 46 7.606 39 10.45 40 15.000 0 18.458 5
4 3.833 33 3.750 00 3.700 00 3.666 67 2.833 33 2.666 67 2.583 33 2.533 33 2.500 00 2.250 00 2.416 67 2.000 00 1.950 00 1.916 67 1.900 00 1.733 33 1.650 00 1.600 00 1.566 67
modeling equation the 2-χ-index LCCI, which has an optimal F value and a very nice 〈u〉 value, we should not worry about paradoxical problems. In Figure 1 the calculated V volumes with the 2-χ-index LCCI are plotted vs their corresponding experimental values. Its χ, C, and u vectors are
χ ) (Dv, 0χv, χ0), C ) (-0.58873, 22.0991, -10.0753), u ) (5.1, 27.8, 4.0) Anyway, let us check if special composite X connectivity indices score better. The following composite 3.5X ) [(Dv)3.5/0χv] index was found by a trial-and-error procedure; it is a rather poor single index (Q/F ) 0.031/11), but together with 0χv, with a R(3.5X, 0χv) ) 0.725, it shows the following fine modeling and utility
{0χv, 3.5X}: Q ) 0.424, F ) 989, r ) 0.996, s ) 2.35, 〈u〉 ) 15.7, u ) (34.4, 5.3, 7.4) The good 〈u〉 and Q values are not impressive, but the imposing F value is, which tell us that the introduction of the second index has considerably enhanced the quality of the description. Thus, the final best molecular connectivity equation for the amino acid side-chain volumes could then be written in the following way
1 V ) 0 v[c1(0χv)2 + c2(Dv)3.5] + c3χ0 χ
(8)
with C ) (21.2390, -0.00012, -19.2127). As a connectivity index cannot be zero (there would be no molecules and clearly no volumes), the equation has both physical and mathematical meaning. Let us now model the pH at the isoelectric point, pI, of n ) 21 amino acids. The modeling of this property for n ) 19 amino acids has already been attempted by the aid of a rather awkward construction of the connectivity indices based mainly on the side-chain functional groups.20,21,36 In this study we will develop a more logical construction of composite indices apt to describe this property. Normal connectivity indices are very poor
TABLE 5: Experimental Side-Chain Molecular Volume (V) (in Å3), Solubility, S (at 25 °C, in g per kg of Water), pH at the Isoelectric Point (pI), Melting Points (Tm) for 21 Amino Acids (AA) and Experimental Solubility, S (at the Indicated T (°C), in g per 100 mL of Water) for 23 Purines and Pyrimidines (PP) AA
V
S
pI
Tm/°C
PP
S (T (°C))
Gly Ala Cys Ser Val Thr Met Pro Leu Ile Asn Asp Lys Hyp Gln Glu His Arg Phe Tyr Trp
36.3 52.6
251 167
54.9 85.1 71.2
422 58 97 56 1622 23 34 25 5 6 361 42 8.6 43 181 29 0.5 12
5.97 6 5.07 5.68 5.96 5.60 5.74 6.30 5.98 6.02 5.41 2.77 9.74 5.8 5.65 3.22 7.59 10.76 5.48 5.66 5.89
290 297 178 228 292-295a 253 283 222 337 284 236 270 224-225a
7I8MTp 7B8MTp 7ITp 7BTp 1BTb 7PTp 1PTb 7ETp 1ETb Cf Tp Tb UA OA X IsoG G HypoX A T 5MC U C
0.63 (20) 0.45 (20) 2.7 (20) 0.37 (30) 0.56 (30) 23.11 (30) 1.38 (30) 3.66 (30) 3.98 (30) 2.58 (30) 0.81 (30) 0.054 (30) 0.002 (20) 0.18 (18) 0.05 (20) 0.006 (25) 0.004 (40) 0.07 (19) 0.09 (25) 0.40 (25) 0.45 (25) 0.36 (25) 0.77 (25)
a
73.6 102 102 72.4 68.4 105.1 92.7 84.7 91.1 109.1 113.9 116.2 135.4
185 249 277 238 284 344 282
An average value was used.
descriptors of this property as can be seen from the following single and multi-fs (forward selection) LCCI
{1χv}: Q ) 0.167, F ) 1.67, r ) 0.284, s ) 1.70 {D, Dv, 0χ, 0χv, 1χ, 1χv}: Q ) 0.619, F ) 3.81, r ) 0.787, s ) 1.27 The average interrelation matrix value together with Rs and Rw values
〈RIM(pI:{χ})〉 ) 0.853, Rs(D,1χ) ) 0.994, Rw(1χv, χtv) ) 0.574 verify again that total connectivity indices are poorly interrelated with the other indices. We define now the following fractional connectivity XF index
XF ) 0
(
)
∆n χ 1+ nT χv
(9)
where χ is an index of the given set (for χ ) Dv f XF ) DXvF, and so on), ∆n ) nA - nB, where nA is the number of sidechain acid groups (COOH groups: nB ) 0 and nA ) 1 for Asp and Glu) and nB the number of side-chain basic groups (NH2 and NH and guanidinium groups: nA ) 0 and nB ) 1 for Lys and His and nB ) 2 for Arg), and nT ) 3, the total number of functional groups (main plus side-chain functional groups). This choice seems very sensible as pI values are strongly dependent on the type of side-chain functional groups. Normalization by 0χv follows a trial-and-error procedure. These indices show the following interrelation values
〈RIM(pI:{XF})〉 ) 0.560, Rw(DXF, XtF) ) 0.004, Rs(DXF, 1XF) ) 0.975 They are, thus, much less correlated than their corresponding
18070 J. Phys. Chem., Vol. 100, No. 46, 1996
Pogliani
TABLE 6: Normal (χ) and Orthogonal (Ω) Connectivity Vectors together with Their Corresponding Utility Vector u, and Statistical Parameters 〈u〉, Q, F, r, and s for the Modeling of Heats of Formation of 18 Octane Isomers u vectors
〈u〉
Q
F
r
s
6.49, 20.8 1.91, 4.00, 1.14 0.33, 1.19, 1.14, 1.85, 3.16, 16.7 8.92, 3.99, 27.7 11.3, 4.99, 0.55, 1.06, 3.16, 36
13.6 2.35 4.07 13.5 9.51
0.40 1.90 2.47 1.90 2.47
13.7 48.2 32.5 48.2 32.5
0.68 0.93 0.97 0.93 0.97
1.70 0.49 0.39 0.49 0.39
χ and Ω vectors χ0
χ1, χ1, χ2, χ0 χ1, χ2, χ3, χ4, χ5, χ0 1 Ω, 2Ω, Ω0 1Ω, 2Ω, 3Ω, 4Ω, 5Ω, Ω0
TABLE 7: Best Forward Selection (fs) and Complete Combinatorial (cc) Connectivity Index Combinations {χ} for the Modeling of the Side-Chain Molecular Volume V of 18 Amino Acids, and Best fs and cc Fractional Connectivity Index Combinations {XF} for the Modeling of the pH at the Isoelectric Point pI of 21 Amino Acids V: {χ} and pI: {XF} V:{χ} fs and cc: {0χv} fs and cc: {Dv, 0χv} fs: {Dv, 0χv, χtv} cc: {D, 0χv, 1χv} pI:{XF} fs and cc: {0Xv}F fs and cc: {DXv, 0Xv}F fs and cc: {DXv, 0X, 0Xv, 1X}F
〈u〉
Q
14.8 12.3 8.9 4.84
0.25 0.40 0.41 0.433
22.4 2.12 12.1 2.14 7.93 2.53
F 691 887 619 688
r
s
0.989 0.996 0.996 0.997
3.95 2.48 2.43 2.30
267 0.966 0.46 136 0.969 0.45 95.1 0.980 0.39 Figure 2. Plot of the calculated vs the experimental pH at the isoelectric point of 21 amino acids.
Figure 1. Plot of the calculated vs the experimental side-chain volumes of 18 amino acids.
parent indices, especially the total connectivity fractional indices. The forward fs and the complete combinatorial cc techniques choose the same best sequential LCFCI (where F stands for fractional and to avoid repetition of lowercase F with each X index, lowercase F is used outside the parentheses only), shown in Table 7. The 3-XF-index LCFCI is not shown as it has a worse Q/F value (2.125/85.6) that the 4-XF-index LCFCI. Higher-XF-index LCFCI (up to seven indices) show slowly improving Q scores but rapidly worsening F scores even if the cc technique gives different and better Q/F-LCFCI than the fs technique. The 8-XF-index LCFCI shows a worse Q/F score than the 4-XF-index LCFCI, which is here chosen to simulate the pI of 21 amino acids even if its decreasing F is deceiving. The very good F value of the 1-XF-LCFCI together with its excellent u ) (16.3, 28.4) vector clearly indicates the good quality of this elementary linear combination. The quality of the simulation is shown in Figure 2. Simulating vectors are (χ0 ≡ X0 ≡ 1)
XF ) (DXv, 0X, 0Xv, 1X, X0)F, C ) (-1.79023, 8.20212, -18.4178, 13.7802, 12.9388) u ) (3.07, 2.75, 4.68, 2.83, 26.3) The simulation of the melting points of organic compounds is a rather formidable task (very probably the most formidable)
in QSPR studies.23,37 Nevertheless, we will here try to simulate the melting Tm points of n ) 20 amino acids (Hyp has no data available) with the newly defined fractional XF indices of eq 9. For this simulation we allow ∆n ) 1 for Leu and Tyr, ∆n ) -1 for Pro, Ser, Thr, Cys, Asn, Asp, Gln, Glu, Lys, His and Arg and ∆n ) 0 for the remnant amino acids. The explanation for such a choice is for the moment a posteriori. While the 〈RIM〉, Rs, and Rw values for the normal χ indices for the Tm simulation are practically the same as in the pI case, for the XF indices the only consistent change is shown by Rw(D,Xtv)F ) 0.03, which means that, even here, total XF indices are poorly correlated to the other indices. The following are the best LCFCI for this property
fs and cc: {0X}F: 〈u〉 ) 6.04, Q ) 0.037, F ) 50, r ) 0.857, s ) 23.1 fs: {0X, 1Xv}F: 〈u〉 ) 3.19, Q ) 0.038, F ) 26, r ) 0.870, s ) 22.7 cc: {0X, Xt}F: 〈u〉 ) 4.46, Q ) 0.039, F ) 27, r ) 0.872, s ) 22.5 Higher-order XF LCFCI show a worsening statistical score with both fs and cc techniques. Decreasing F is, in this case, a good warning parameter for the worsening quality of the combinations. The difference between the two selecting techniques here is rather small even if the difference in 〈u〉 value is not insignificant. Both the 1-XF- and 2-XF-index LCFCI modeling are not optimal but the fact that the melting points of the whole set of amino acids can be practically modeled with a composite XF index formally similar to the index used in pI modeling should not be underestimated. We will not develop any further this simulation but just report the following simulating vectors (X0F ≡ χ0 ≡ 1)
XF ) (0X, X0)F, C ) (126.863, 110.952), u ) (7.1, 5.0) Let us now turn our attention to the simulation of the solubility S of n ) 20 amino acids (AA, no data available for
Molecular Connectivity Indices
J. Phys. Chem., Vol. 100, No. 46, 1996 18071
TABLE 8: Best Forward Selection (fs) and Complete Combinatorial (cc) Reciprocal Connectivity Index Combinations {R} for the Modeling of the Solubility of 16 Amino Acids, S(AA-16), Best fs and cc Suprareciprocal Connectivity Index Combinations {aR} for the Modeling of the Solubility, S(AA-20), of 20 Amino Acids, and Best fs and cc Supraconnectivity Squared Index Combinations {XS} for the Modeling of the Solubility, S(PP-23), of 23 Purines and Pyrimidines S(AA-16), S(AA-20), S(PP-23) S(AA-16): {R} fs and cc: {0R} fs and cc: {0R, Rt} fs and cc: {DR, 0R, Rt} cc: {DRv, 0R, 0Rv, Rt} cc: {DR, DRv, 0R, 0Rv, IR, IRv} S(AA-20): {aR} fs and cc: {a0Rv} fs and cc: {a0Rv, aRt, aRtv} S(PP-23): {XS} fs and cc: {1X}S fs and cc: {1X, Xt}S fs: {DX, 1X, Xt}S cc: {DX, 1X, Xtv}S fs: {DX, 0X, 1X, Xt}S cc: {0X, 0Xv, 1Xv, Xtv}S
〈u〉
Q
8.72 8.20 3.48 3.90 2.51
0.038 0.049 0.055 0.058 0.059
r
s
0.935 0.962 0.971 0.977 0.981
24.7 19.7 17.8 16.7 16.7
29.37 0.029 2052 14.56 0.029 698
0.996 0.996
34.7 34.4
22.2 21.4 3.90 4.03 3.31 4.78
0.993 0.996 0.9974 0.9974 0.9976 0.9978
1.758 2.347 2.687 2.698 2.753 2.821
F 97.4 81.2 68.9 58.1 39.4
1553 1385 1211 1220 953 1001
0.57 0.43 0.37 0.37 0.36 0.35
Cys) and of n ) 23 purines and pyrimidines (PP) and finally to the simulation of the whole set of n ) 43 AA + PP. As already mentioned, the solubility of amino acids with a set of six reciprocal X ) R ) 1/χ indices (no Rt ) 1/χt and Rtv ) 1/χtv indices) has been already successfully attempted32,33 but no particular attention was devoted to the total connectivity indices and to the utility of the equations. Reciprocal connectivity R indices can be considered the most simple case of composite indices together with squared connectivity XS ) (χ)2 indices used in the simulation of the same property of purines and pyrimidines. The normal indices with
〈RIM(S:{χ})〉 ) 0.850, Rw(1χv, χtv) ) 0.573, Rs(D, 1χ) ) 0.993 are very poor descriptors of S(AA). The first way to simulate this property for AA is to get rid of the strong outliners, Pro, Ser, Arg, and Hyp and simulate S with reciprocal R indices for n ) 16 AA. Reciprocal indices are less interrelated than their corresponding parent indices (the change from n ) 20 to n ) 16 does not significantly affect previous correlation χ values), in fact:
〈RIM(S:{R})〉 ) 0.748, Rw(1Rv, Rtv) ) 0.40, Rs(DR, 1R) ) 0.997 The fs and cc best sequential descriptions of S(AA, n ) 16) are summarized in Table 8. Total R indices play a significant role here. After the 3-R-index LCRCI the fs technique is not able to find any better Q-LCRCI while the cc technique discovers two more sequential optimal Q-LCRCI (last two in Table 8). The drastically decreasing F after the 2-R-index LCRCI for this modeling tells us that something is wrong with the inclusion of the third and further best index. If we consider a Q/F/〈u〉 criterion for the best LCRCI then the 2-R-index LCRCI is clearly the best one, and in fact, the vectors of this LCRCI will be used in the absolute value mode of eq 5, as some calculated S values are negative, to obtain the modeling of S for the n ) 16 amino acids given in Figure 3
Figure 3. Plot of the calculated vs the experimental solubility of 16 amino acids.
R ) (0R, Rt, R0), C ) (2106.83, 0.41364, -245.623), u ) (12.2, 3.01, 9.42) The absolute value mode improves slightly the modeling, as for S ) C‚R the calculated vs the experimental values have a Q/F ) 0.0507/175 while for S ) |C‚R|: Q/F ) 0.0511/178. Let us now include in the treatment the four outliers as they do not represent any form of experimental error. As the concept of outlier has a meaning in the context of a model, knowledge of the facts that give rise to them should always be used to improve the model. The unusually high solubility of the four outliers (especially Pro) can better be grasped supposing a higher solvation that can be modeled with the introduction of supraconnectivity indices, that is, indices multiplied by an association a parameter.23,42 This parallels the method to give outliers reduced weight on some kind of subjective basis as this turns to be equivalent to the subjective assertion that the model is correct but the data need to be adjusted. Using supraconnectivity indices for the four outliers obtained with a(Pro) ) 8, a(Ser, hyp, Arg) ) 2 for the {DR, DRv, 0R, 0Rv, 1R, 1Rv} subset and a(Pro) ) 1/8 and a(Ser, Hyp, Arg) ) 1/2 for the {Rt, Rtv} subset, and a ) 1 for the remnant amino acids, we obtain a set of supra-R indices, with
〈RIM(S:{aR})〉 ) 0.661, Rw(a0R, aRtv) ) 0.181, Rs(a0Rv, a1R) ) 0.997 for n ) 20 AA, that show an even lower collinearity than the preceding indices for n ) 16 amino acids. Both the cc and fs search technique find a 1-aR and a 3-aR optimal LCRCI with supraindices where total reciprocal indices play a consistent role (see Table 8, middle). While the 1-aR-index LCRCI shows an exceptional statistical score, the 3-aR-index LCRCI is also rather good, its F and 〈u〉 values being much better that in the preceding simulation of n ) 16 amino acids. Noticeable is the fact that other LCRCI with a different number of aR indices show a worsening Q, F, and 〈u〉 score. Vectors used to derive the calculated solubility values shown in Figure 4 for the n ) 20 amino acids are (χ0 ≡ R0 ≡ 1)
R ) (a0Rv, R0), C ) (1010.789, -139.389), u ) (45.30, 13.44) Calculated values have been obtained with the absolute value eq 5 that shows a better statistical score:
S ) C‚R: Q ) 0.0287, F ) 2052; S ) |C‚R|: Q ) 0.0292, F ) 2127
18072 J. Phys. Chem., Vol. 100, No. 46, 1996
Pogliani
Figure 4. Plot of the calculated vs the experimental solubility of 20 amino acids.
Figure 5. Plot of the calculated vs the experimental solubility of 23 purine and pyrimidine bases.
Thus, a decisive and formally simple molecular connectivity equation for the solubility of the 20 amino acids should be
The simulation of the solubility (grams per kilogram of water) of n ) 43 set of AA plus PP with the defined supraindices for Pro, Ser, Arg, Hyp, 7Ptp, 1ETb, Cf, and 7ITp results to be very bad. In fact, with the full set of eight indices we have: Q/F/ r/s ) 0.0016/0.85/0.409/262, while with the normal connectivity indices are somewhat better but always unsatisfactory: Q/F/ r/s ) 0.0027/2.55/0.612/223. Note that normal indices have the following interrelation values
S ) c1 0
a + c2χ0 v χ
(10)
The solubility S of n ) 23 purines and pyrimidines can instead be optimally simulated with linear combination of supraconnectivity squared XS ) (aχ)2 indices (LCSCI, where S stands for squared) a fact that was casually discovered just by squaring the best LCCI with supra-aχ-indices.24 We will construct a squared {XS} set and research by the aid of the fs and cc combinatorial techniques the best LCSCI. Note the fact that for purines and pyrimidines the a association values for 7PTp (a ) 4), 1ETb and Cf (a ) 2), 7ITp (a ) 1.5) are experimentally grounded23 (for the remnants, a ) 1). The passage from supraaχ-indices to supra squared indices brings a lowering in the mutual correlation of the indices as can be seen from the following values:
〈RIM(S:{χ})〉 ) 0.886, Rw(0χv, χtv) ) 0.631, Rs(D, 1χ) ) 0.998 〈RIM(S:{XS})〉 ) 0.656, Rw(0XvS, XtvS) ) 0.199, Rs(DXS, 1XS) ) 0.99993 The best sequential fs and cc LCSCI with supraindices (lowercase S is outside the parentheses) are shown in Table 8, bottom. Noticeable is the improving of 〈u〉 value with the 4-XSindex cc LCSCI after the dramatic worsening of the same value with the 3-XS-index (fs and cc) LCSCI, while drastic negative changes in F can only be detected with the last two combinations. The overall best LCSCI seems to be the second one with 2-XS indices which has both a high predictability and utility. Following X and C vectors in the absolute value mode (S ) C(X, Q/F ) 2.41/2908 and S ) |C‚X|, Q/F ) 2.53/3206) are used to obtain Figure 5 where calculated S values are plotted vs the corresponding experimental values (X0S ≡ χ0 ≡ 1)
XS ) (1X, Xt, X0)S, C ) (0.02570, 174.060, -0.92195), u ) (52.2, 4.15, 7.76) Thus, a final explicit MC equation for the solubility of purines and pyrimidines could be
S ) c1(a1χ)2 + c2(aχt)2 + c3χ0
(11)
〈RIM(S(AA+PP):{χ})〉 ) 0.822, Rw(1χv, χtv) ) 0.483, Rs(D, 1χ) ) 0.994 that is, interrelation is lower than in the homogeneous classes of AA or PP. The trial-and-error method together with the fs technique discovers the following set of supra composite indices
{cDχtv, cDvχtv, c0χχtv, c0χvχtv, c1χχtv, c1χvχtv, aχt, bχtv} where the total valence χtv index multiplies every other index with the exception of χt and itself and where c ) a‚b. While the value of a for the supraindices of AA and PP has already been defined in the preceding paragraphs on S(AA) and S(PP), b ) 1 for every amino acid to avoid a ) k and a ) 1/k of Pro, Ser, Arg, and Hyp canceling each other, and b ) a for purines and pyrimidines. For the sake of clarity the given composite indices will be rewritten in the following form
{X} ) {Dχ, DXv, 0X, 0Xv, 1X, 1Xv, Xt, Xtv} The interrelation values of the new indices are somewhat lower than the interrelation values of the normal indices; in fact
〈RIM(S(AA+PP):{X})〉 ) 0.763, Rw(1Xv, Xt) ) 0.289, Rs(DX, 1X) ) 0.998 The best 1-X- and 2-X-index LCXCI are common to both fs and cc techniques
{DX}: 〈u〉 ) 7.62, Q ) 0.008, F ) 196, r ) 0.909, s ) 108.5 {DX, 0Xv}: 〈u〉 ) 7.17, Q ) 0.016, F ) 341, r ) 0.972, s ) 62.3 if we do not forget that we are modeling 43 molecules belonging to different classes of compounds both LCXCI can be considered optimal, the only negative point being a rather large s value that can be a little bit improved with the following cc sequences
Molecular Connectivity Indices
J. Phys. Chem., Vol. 100, No. 46, 1996 18073
{DX, DXv, Xtv}: 〈u〉 ) 6.02, Q ) 0.019, F ) 326, r ) 0.981, s ) 52.4 {DX, DXv, Xt, Xtv}: 〈u〉 ) 4.59, Q ) 0.020, F ) 264, r ) 0.983, s ) 50.5 The best 〈u〉 and F combinations seem to be the second and the third ones. Note the fact that the fs technique, with the sequential inclusion of the next best index, gives rise each time to improved LCXCI (not shown here) that nevertheless does not achieve the same statistical score of the last 4-X-cct LCXCI. The cc technique does not discover any better LCXCI even if r improves a little bit when more indices are included. Modeling vectors used to obtain Figure 6 are (here every Scalc value is positive)
X ) (DX, DXv, Xt, Xtv, X0), C ) (-2763.50, 2427.08, -494.410, -16270.3, 18.5792), u ) (5.24, 8.02, 1.99, 6.11, 1.60) The following squared X index gives rise to a remarkable singleindex LCXCI description of S(AA+PP) with no other squared index or combinations of squared indices (or mixed indices) showing such a very interesting improvement in Q, F, and 〈u〉 values relative to the defined {X} indices, even if there is a worsening in s and Q values
{(DXv)2}: 〈u〉 ) 10.8, Q ) 0.011, F ) 365, r ) 0.948, s ) 83.0 In a preceding work23 the motor octane MON number of 30 alkanes could be satisfactorily modeled only with indices taken from an expanded set of 17 {D, 0χ, 1χ, χt + higher-order} indices. Best 1-χ index and best overall LCCI were (〈u〉 value was not taken into account)
{6χ}: Q ) 0.0365, F ) 27, r ) 0.701, s ) 19.2 {D, 0χ, 1χ, χt, 3χ, 5χc, 4χpc, 6χpc}: Q ) 0.112, F ) 32, r ) 0.961, s ) 8.50 We will try now to derive from the minimal set {χ}) {D, 0χ, 1χ, χ } by a trial-and-error procedure a new set of composite t indices that could allow us to forget everything about higherorder indices and at the same time have a fine MON modeling of the n ) 30 alkanes. The trial-and-error search ends up with the following set of eight indices of which four composite.
{X}1 ) {D, 0χ, 1χ, χt, 01X, 00X, 0Xt, 1Xt}1 where 01X ) 0χ‚1χ, 00X ) (0χ)2, 0Xt ) 0χ‚χt and 1Xt ) 1χ‚χt. These indices are somewhat less correlated than the corresponding {χ} indices as
〈RIM(MON:{χ})〉 ) 0.951, Rw(0χ, χt) ) 0.887, Rs(D, 0χ) ) 0.99 〈RIM(MON:{X}1)〉 ) 0.905, Rw(0Xt, 00X) ) 0.66, Rs(D, 0χ) ) 0.99 In Table 9 (top) it can be seen that while the best 1-X-index combination rates not worse than the preceding LCCI from the expanded set, the following sequential cc combinations (for the corresponding best fs LCXCI only Q and F values are given in parentheses) show steadily improving Q quality, a nearly
Figure 6. Plot of the calculated vs the experimental solubility of 43 amino acids, purines, and pyrimidines.
constant F quality, and a tangible worsenig 〈u〉 value after the third combination. The 4-X-index LCXCI is already better than the corresponding expanded set 8-χ-index LCCI. The composite {X}1 indices have then nicely accomplished their task and have rendered the calculation of higher-order indices and the huge inspection of a combinatorial space made up of 17 indices, practically unnecessary. For its good predictability and nice utility the 4-X-index (plus χ0) LCXCI (third combination in Table 9) with a vector u ) (10.7, 5.1, 6.8, 9.7, 11.4) seems the best one, but for the following simulation we choose the vectors of the last combination to obtain the calculated MON values shown in Figure 7
X ) (D, 0χ, 1χ, 01X, 00X, 0Xt, 1Xt, χ0) C ) (-149.988, 577.374, -411.730, 65.2829, -33.8663, -371.861, 650.937, -213.172) u ) (2.46, 4.31, 3.32, 4.80, 4.49, 6.51, 3.15, 2.38) Another important modeling of our preceding work was the modeling of the melting MP points of 56 alkanes23 that was achieved by segmentation of the entire set of compounds. While the modeling of two segments derived from the segmentation could satisfactorily be achieved with a minimum-sized {χ} ) {D,0χ,1χ,χt} set, the modeling of the other two segments could only be achieved with an expanded set of m ) 17 connectivity indices. Thus, for the segment of n ) 17 [MMi + MEi + EEi, with i ) 3-7] alkanes the best single-, two-, and multi-index LCCI were
{χt}: Q ) 0.011, F ) 1.95, r ) 0.339, s ) 32.3 {1χ, χt}: Q ) 0.043, F ) 16, r ) 0.834, s ) 19.6 {1χ, χt, 2χ, 3χc, 4χc, 5χc, 4χpc}: Q ) 0.068, F ) 11.6, r ) 0.949, s ) 14.0 The trial-and-error procedure based on the two {1χ, χt} indices discovers the following mixed set of two normal and five composite X indices
{X}2 ) {1χ, χt, 1Xt, 11X, Xtt, 1Xtt, 11Xt}2 where 1Xt ) 1χ‚χt,
11X
) (1χ)2, Xtt ) (χt)2, 1Xtt ) 1χ‚(χt)2,
11X t
18074 J. Phys. Chem., Vol. 100, No. 46, 1996
Pogliani
TABLE 9: Best Complete Combinatorial (cc) Normal and Composite Connectivity Index Combinations {X}1 for the Modeling of the Motor Octane Number MON of 30 Alkanes, and Best cc Normal and Composite Connectivity Index Combinations {X}2 for the Modeling of the Melting Points MP of 17 Alkanes (for the Corresponding Best fs Combinations Only the Q and F Values Are Given in Parentheses) MONa: {X}1 and MPa: {X}2 MON: {X}1 fs and cc: {0Xt}1 cc: {01X, 1Xt}1 cc: {0χ, χt, 01X, 1Xt}1 cc: {D, 0χ, 01X, 00X, 0Xt}1 cc: {0χ, 1χ, 01X, 00X, 0Xt, 1Xt}1 cc: {D, 0χ, 1χ, 01X, 00X, 0Xt, 1Xt}1 MP: {X}2 fs and cc: {Xtt}2 cc: {1χ, χt}2 cc: {1χ, χt, 1Xtt}2 cc: {1χ, 1Xt, 11X, 11Xt}2 fs and cc: {1χ, χt, 11X, Xtt, 1Xtt, 11Xt}2 a
〈u〉
Q
F
r
s
2.19 9.99 8.73 5.96 5.23 3.93
0.029 0.081 (0.050 0.121 (0.109) 0.135 (0.133) 0.147 (0.139) 0.163 (0.139)
16.5 67.1 (25) 74.2 (71) 74.1 (71) 73.2 (66) 77.4 (56)
0.608 0.912 0.960 0.969 0.975 0.980
21.4 11.2 7.9 7.2 6.6 6.0
8.54 5.01 2.93 3.76 2.84
0.016 0.043 (0.040) 0.051 (0.041) 0.059 (0.044) 0.069
4.4 16.1 (14) 15.5 (10) 15.1 (8.6) 12
0.477 0.835 0.884 0.913 0.946
30.1 19.6 17.2 15.6 13.6
Experimental data are taken from ref 23.
Figure 7. Plot of the calculated vs the experimental motor octane MON number of 30 alkanes.
Figure 8. Plot of the calculated vs the experimental melting MP points of {17 + 14} alkanes.
) (1χ)2‚χt. This set of indices show the following interrelation values
indices show a more consistent collinearity than the preceding set
〈RIM(Tm:{X}2)〉 ) 0.729, Rw(1Xtt, 11Xt) ) 0.011, Rs(χt, Xtt) ) 0.998 1
that is, a much lower collinearity than the X indices used to model MON. The best LCXCI for this MP modeling are collected in Table 9 (bottom) (Q/F values of fs LCXCI being in parentheses). Thus, the 6-X-index LCXCI achieves a somewhat nicer Q/F description than the 7-χ-index LCCI from an expanded set, and it will be chosen to derive the calculated melting points, even if the best 〈u〉/Q/F combination seems to be the four mixed χ and X index combination. Now, let us try to model the melting temperature of the second segment of n ) 14 alkanes. This segment could be modeled by the following LCCI from an expanded m ) 17 set of χ indices
{D}: Q ) 0.0693, F ) 29.5, r ) 0.843, s ) 12.2 {D, 0χ, 1χ, 2χ, 3χc, 4χpc}: Q ) 0.111, F ) 12.6, r ) 0.957, s ) 8.6 The trial-and-error procedure on the {D, 0χ, 1χ} set finds the following set of composite indices
{X}3 ) {D, 0χ, 1χ, XD3, 0XD3, 1XD2}3 where XD3 ) D-3, 0XD3 ) 0χ‚D-3, 1XD2 ) 1χ‚D-2. This set of
〈RIM(Tm:{X}3)〉 ) 0.924, Rw(0χ, XD3) ) 0.839, Rs(XD3, 0XD3) ) 0.996 While the best single-index is always D, the best overall cc and fs modeling is given by the full {X}3 set LCXCI with (LCXCI with less indices show worse Q/F values)
{X}3: 〈u〉 ) 4.84, Q ) 0.135, F ) 18.7, r ) 0.970, s ) 7.17 The improvement over the χ-LCCI is here more than evident. Figure 8, which describes Tm (K) for two sets of 17 and 14 alkanes, has been obtained with the following C vectors
C(17) ) (-19145, -109498, 1805.5, 40864, 34176, 39201, 50930) C(14) ) (152.61, 665.74, -1275.1, 2649108, -1145571, 188761, -334871) We will not further deepen the discussion about the melting points, as they deserve a full work, nevertheless, even here it is evident that composite indexes derived from a small size set of connectivity indexes are good basis functions for molecular modeling.
Molecular Connectivity Indices
J. Phys. Chem., Vol. 100, No. 46, 1996 18075
TABLE 10: Best Complete Combinatorial (cc) Connectivity Index Combinations {χ} for the Modeling of the Lattice Enthalpies ∆HLO of 20 Metal Halides, and Best cc and fs Normal and Composite Connectivity Index Combinations {X}4 for the Same Modeling (Here Only Q and F Values are given in Parentheses for fs Combinations) ∆HLφ: {χ} and {X}4
〈u〉
∆HL {χ} fs and cc: {0χv} cc: {Dv, 0χv} cc: {0χv, 1χv, DZ} cc: {Dv, 0χv, 1χv, DZ} ∆HLφ: {X}4 fs and cc: {1Rv}4 fs and cc: {1Rv, 1.5Rv}4 cc: {0χv, DZ, 1Rv}4 cc: {0χv, DZ, 1Rv, 1.5Rv}4 fs and cc: {0χv, 1χv, DZ, 1Rv, 1.5Rv}4
Q
F
r
s
φ:
17.21 19.44 6.28 6.28
0.015 0.033 0.043 0.044
45.4 115 131 102
0.846 0.965 0.980 0.982
57.3 29.0 22.2 22.2
29.8 24.2 12.5 7.4 5.6
0.019 0.038 0.053 (0.044) 0.054 (0.044) 0.055
72.2 147 192 (132) 154 (100) 126
0.895 0.972 0.986 0.988 0.989
48.1 25.9 18.7 18.2 17.9
Figure 9. Plot of the calculated vs the experimental lattice enthalpies of 20 metal halides.
The modeling of the lattice enthalpies of the n ) 20 metal halides25 (MeX) of Table 4 with the {χ} ) {Dv, 0χv, 1χv, DZ} set that has
〈RIM(∆HLφ:{χ})〉 ) 0.669, Rw(Dv, 0χv) ) 0.388, Rs(Dv, DZ) ) 0.913 is shown in Table 10, top. We have reported here the statistical score of every sequential cc LCCI to be able to compare it with the statistical score of the following LCXCI, whose X indices, derived from the 1χv index, are special reciprocal connectivity indices
{X}4 ) {0χv, 1χv, DZ, 1Rv, 1.5Rv}4 1Rv
where here are
)
(1χv)-1
and
1.5Rv
)
(1χv)-1.5.
Interrelation values
〈RIM(∆HφL:{X}4)〉 ) 0.652, Rw(1χv, 1.5Rv) ) 0.438, Rs(1Rv, 1.5Rv) ) 0.981 The best sequential LCXCI are given in Table 10, bottom (in parentheses are the corresponding Q/F values of the corresponding fs LCXCI). The improvement over the preceding {χ} set is evident at every level. The overall best LCXCI is the third one with a very interesting set of 〈u〉, Q, and especially F values that underline both the high predictability and high utility of this combination. Its modeling vectors used to obtain the calculated values of Figure 9 are
X ) (0χv, DZ, 1Rv, χ0), C ) (-20.1665, 45.5942, 64.4373, 719.386), u ) (7.69, 6.55, 5.35, 30.20)
We will now introduce the orthogonal Ω Randic´’s indices to improve 〈u〉 values and uk values11 of some of our modelings. This is a step that often stimulates chemists into jumping to the conclusion but is really painless if handled with property and has definite advantages. Chosen properties will be pI(AA), S(AA+PP), and ∆HLφ(MeX), which are modeled by a LCXCI of 4, 4, and 3 special descriptors, respectively. In fact, Randic´’s orthogonalization procedure,13-16 even if conceptually simple, can nevertheless, for more than four or five indices, become rather tedious to perform and not at all pedestrian without a good software package. We start with pI by rewriting the best LCXCI vectors for this property ordered following the importance of the relative indices (first index, the one with the best Q, second index, the index with the nicest improvement in Q, and so on), thus we have
X ) (0Xv, DXv, 1X, 0X, X0) C ) (-18.4178, -1.79023, 13.7802, 8.20212, 12.9388) 〈u〉 ) 7.93; u ) (4.7, 3.1, 2.8, 2.8, 26.3) The orthogonalization of the four X with 0Xv ≡ 1Ω produces the following orthogonal vectors that with the exception of 〈u〉 have the same statistical Q/F score of X indices
Ω ) (1Ω, 2Ω, 3Ω, 4Ω, Ω0) C ) (-8.00272, -0.10392, 2.68609, 8.20212, 13.76247) 〈u〉 ) 11.55; u ) (19.2, 1.3, 1.0, 2.8, 33.4) As expected, there is a nice improvement in 〈u〉 (i) due to an exceptional improvement in the utility u1 value of the most important descriptor (0Xv, which builds a very good single-index LCXCI by itself, see pI paragraph) and (ii) to a good improvement of the utility u5 of the unitary index while (iii) less important descriptors show a decreasing u2,3 utility. The simulation of S(AA+PP) with orthogonal indices starts also here with a reordering procedure; that is
X ) (DX, Xtv, DXv, Xt, X0) C ) (-2763.5, -16270, 2427.1, -494.41, 18.5792) 〈u〉 ) 4.59; u ) (5.2, 6.1, 8.0, 2.0, 1.6) Orthogonalization of X vector generates the following vectors and utility
Ω ) (1Ω, 2Ω, 3Ω, 4Ω, Ω0) C ) (1196.98, -9821.57, 2222.50, -494.41, -22.5854)
18076 J. Phys. Chem., Vol. 100, No. 46, 1996
〈u〉 ) 10.37; u ) (30.1, 9.3, 7.8, 2.0, 2.6) Utility doubles thanks to the first two most important descriptors and to the unitary term. The simulation of the lattice enthalpies starts also with a reordering of the following vectors
X ) (1Rv, 0χv, DZ, χ0) C ) (64.4373, -20.1665, 45.5942, 719.386) 〈u〉 ) 7.4; u ) (5.4, 7.7, 6.6, 30.2) to produce the following vectors with a remarkably improved utility, especially at the level of the most important and unitary indices
Ω ) (1Ω, 2Ω, 3Ω, Ω0) C ) (160.628, -20.1715, 45.5942, 682.282) 〈u〉 ) 41.74; u ) (21.8, 7.7, 6.6, 131) It should be underlined that C(Ω) vector values are now constant under inclusion or deletion of a new index,13-16 that is, equation P ) |C‚Ω| has now (i) the same high predictive power of the corresponding equation with X indices, that is, P ) |C‚X|, (ii) a better utility, and (iii) a total stability. Conclusion The introduced special and/or composite X ) f(χ) connectivity indices (personally I would call the special index the reciprocal index and composite indices the remnants, inclusive of the squared) derived from a medium-sized set of connectivity indices or from a subset of it by a trial-and-error procedure are good, and sometimes even very good, descriptors of very different properties. Their construction is simple and straightforward, especially when the subset from which they are derived is rather restricted as in the case of the simulation of V volumes of amino acids, MON, and melting points of alkanes as well as the lattice enthalpies of MeX. Reciprocal and squared supraconnectivity indices are the more easy to obtain and have a very remarkable descriptive power of the solubility of 20 amino acids and 23 purines and pyrimidines, respectively. Introduced composite X indices mainly based on the χtv index to model the entire set of 43 amino acids and purines and pyrimidines are very interesting as even the 1-X-index LCXCI has a good descriptive strength. Concerning this last description, we think that it is the first time that descriptors for the modeling of different classes of compounds are defined and positively tested. Clearly, the rather inflated value of the standard deviation s of the estimate for S(AA+PP) underlines the presence of outliers (as is evident from Figure 6) that could be better modeled with a deeper knowledge of the association phenomena that these (especially amino acids) compounds undergo in solution, that is, with a refining of the supraindices through a better calculation of the association a parameter. Concerning supraindices, it is interesting to notice that they work not just at the level of each homogeneous class of compounds (that is, AA and PP) but also at the more general level of the whole heterogeneous class (AA + PP, n ) 43) of compounds. The fractional XF indices introduced to model the isoelectric pI point of 21 amino acids and mainly based on the 0χv index, acting as a normalization parameter, not only accomplish their task in a remarkable way, as even a 1-XF-index LCXCI shows very good predictive power, but they also disclose the possibility to undertake an interesting
Pogliani simulation of the melting points of 20 amino acids. Furthermore, these fractional indices cast some light on the importance of the side-chain functional groups both in pI and Tm simulations of functional rich molecules. The satisfactory modeling of the motor octane MON number of 30 alkanes and of the melting points of two segments of 17 and 14 alkanes by the aid of composite indices, based mainly on D, 0χ, 1χ, and χt indices, shows how to bypass the problem to derive and work with extended set of connectivity indices (here, m ) 17) as in both cases a good description can be obtained with fewer indices derived from a minimal-sized set of four indices. In the case of the lattice enthalpies of 20 metal halides the introduction of two special reciprocal connectivity indices based on the 1χv index and derived from a minimal-sized set of indices, enhances the modeling of this property both at the 1-X level and at the multi-X level. The improved simulation in both Q and F parameters of the side-chain volumes of 18 amino acids with the introduction of the total and composite connectivity indices (for what concerns the 2-X and 3-χ linear combinations) even relatively to Lucic´, Nikolic´, and Trinajstic´ orthogonal Ω indices brings us to underline the importance of such indices all along the modeling of the amino acids. The fact that a single composite X index, based on Dv and 0χv indices, achieves a further improvement of F and 〈u〉 parameters and u vector introduces us to another characteristic of LCXCI, that is, their utility. Normally, composite indices show a lower interrelation than their parent indices and this can be the origin of the rather good utility of the best LCXCI for the different properties (the choice of the best LCXCI, among the many best, is not always a simple task): V, pI, S(AA) especially for n ) 20, S(PP), MON (the 4-X-index LCXCI), and ∆HLφ. While the utility of S(AA+PP) and even of pI and ∆HLφ has been further enhanced with the use of Randic´’s orthogonal indices, the utilility of Tm of both amino acids and alkanes has been left out of any improvement as further work is necessary at the level of connectivity indices to achieve a satisfactory and global description of Tm, as it should not be forgotten that the simulation of Tm of organic compounds is till today a very controversial point23,37 in quantitative structure-property studies. Last, but not least, a word should be said about the forward selection technique. This technique of choice of indices achieves in many cases a good selection and, whenever it does not choose the best overall indices, normally these last are not far away in terms of Q, F, and 〈u〉 values from the very best cc indices. Thus, it should always be worth starting the selection of the indices, especially if we are testing newly defined indices, with this technique and only after look for the other technique. A further advantage of the forward selection technique is that many times it can help to restrict the combinatorial space of the cc technique; in fact, if a very good simulation has been reached with, e.g., four fs indices the same good description or even an improved one should be obtained with four or fewer cc indices. Future work should then (i) deepen the meaning for the association a parameter for the supraindices, (ii) further elucidate the value of ∆n in the fractional indices for a better definition of indices that are highly dependent on side-chain functional groups, (iii) derive the ultimate index for the melting points, (iv) enlarge the connectivity model to include inorganic compounds that for the moment have a rather loose relation with it, and (v) further improve the modeling of heterogeneous classes of compounds. Many of these points (i, iii, and v) have directly to do with a better understanding of the shape-dependent character of many properties.
Molecular Connectivity Indices Let us close this excursus on the many facets of molecular connectivity with the following quotation from the Templar Order of Portugal, “thus, my friend, the truth you learned at the beginning of your education and the truth you learned at the end of it are, eVen if different, the same truth”. Acknowledgment. This study was written during a period of leave at the Centro de Quimica-Fisica Molecular of the Technical University of Lisbon, Portugal. It is a pleasure to thank Professors J. M. G. Martinho and M. N. Berberan-Santos of this University for their help and support during this period as well Professor M. Randic´ of the Drake University and Professor L. B. Kier of the Virginia University for their assistance, concern, and support during the entire molecular connectivity enterprise of the author. I am indebted to Professor N. Trinajstic´ and Dr. S. Nikolic´ of the University of Zagreb as well to Dr. S. C. Basak of the Natural Resources Research Institute, Duluth, MN, for the many interesting and helpful literature. It is also a pleasure to thank the Chemistry Department, the Chemistry teaching staff, and the Faculty of Science of the University of Calabria, Italy, for allowing the author to take leave, and especially Professors G. Chidichimo, M. Ghedini, and G. Sindona as well as Professors M. Terenzi, G. Ranieri, and Dr. L. Coppola. MURST, the Ministry for University, Scientific and Technological Research, is gratefully acknowledged for financial support. References and Notes (1) Randic´, M. J. Am. Chem. Soc. 1975, 97, 6609. (2) Kier, L. B.; Hall, L. H.; Murray, W. J.; Randic´, M. J. Pharm. Sci. 1975, 64, 1971. (3) Kier, L. B.; Hall, L. H. J. Pharm. Sci. 1981, 70, 583. (4) Kier, L. B.; Hall, L. H. Molecular ConnectiVity in Structure-ActiVity Analysis; Wiley: New York, 1986. (5) Trinajstic´, N. Chemical Graph Theory, 2nd ed.; CRC: Boca Raton, FL, 1992. (6) Turro, N. J. Angew. Chem., Int. Ed. Engl. 1986, 25, 882. (7) Seybold, P. G.; May, M. A.; Bagal, U. A. J. Chem. Educ. 1987, 64, 575. (8) Hansen, P. J.; Jurs, P. C. J. Chem. Educ. 1988, 65, 574. (9) Rouvray, D. H. J. Mol. Struct. (THEOCHEM) 1989, 185, 187. (10) Basak, S. C.; Niemi, G. J.; Veith, G. D. J. Math. Chem. 1991, 7, 243.
J. Phys. Chem., Vol. 100, No. 46, 1996 18077 (11) Randic´, M. Int. J. Quant. Chem.: Quant. Biol. Symp. 1994, 21, 215. (12) Carotti, A.; Altomare, C. Chem. Ind. 1995, 77, 13. (13) Randic´, M. Croat. Chim. Acta 1991, 64, 43. (14) Randic´, M. J. Chem. Inf. Comput. Sci. 1991, 31, 311. (15) Randic´, M. J. Mol. Struct. (THEOCHEM) 1991, 233, 45. (16) Randic´, M. New J. Chem. 1991, 15, 517. (17) Kier, L. B.; Hall, l. H. In AdVances in Drug Research; Testa, B., Ed.; Academic: New York, 1992; Vol. 22. (18) Balaban, A. T.; Kier, L. B.; Joshi, N. Mater. Chem. 1992, 28, 13. (19) Mihalic´, Z.; Nikolic´, S.; Trinajstic´, N. J. Chem. Inf. Comput. Sci. 1992, 32, 28. (20) Pogliani, L. J. Phys. Chem. 1993, 77, 6731. (21) Pogliani, L. J. Phys. Chem. 1994, 98, 1494. (22) Pogliani, L. Curr. Top. Pept. Prot. Res. 1994, 1, 119. (23) Pogliani, L. J. Phys. Chem. 1995, 99, 925. (24) Pogliani, L. J. Chem. Inf. Comput. Sci., to be published. (25) Pogliani, L. MATH/CHEM/COMP’96, to be published in Croat. Chim. Acta. (26) Nikolic´, S.; Medic´-Saric´, M.; Rendic´, S.; Trinajstic´, N. Drug Met. ReV. 1994, 26, 717. (27) Lucic´, B.; Nikolic´, S.; Trinajstic´, N. Croat. Chim. Acta 1995, 68, 417. (28) Pogliani, L. J. Chem. Inf. Comput. Sci. 1994, 34, 801. (29) Lucic´, B.; Nikolic´, S.; Trinajstic´, N. Croat. Chim. Acta 1995, 68, 435. (30) Kier, L. B.; Hall, L. H.; Frazer, J. W. J. Chem. Inf. Comput. Sci. 1993, 33, 143. (31) Kier, L. B.; Hall, L. H.; Frazer, J. W. J. Chem. Inf. Comput. Sci. 1993, 33, 148. (32) Pogliani, L. Amino Acids 1995, 9, 217. (33) Pogliani, L. Croat. Chim. Acta 1996, 69, 95. (34) Trinajstic´, N.; Mihalic´, Z.; Harris, F. E. Int. J. Quant. Chem.: Quant. Chem. Symp. 1994, 28, 525. (35) See acknowledgments in ref 24. (36) Pogliani, L. J. Pharm. Sci. 1992, 81, 334. (37) Needham, D. E.; Wei, I. C.; Seybold, P. G. J. Am. Chem. Soc. 1988, 110, 4186. (38) Kier, L. B.; Hall, L. H. Molecular ConnectiVity in Chemistry and Drug Research; Academic: New York, 1976. (39) CRC Handbook of Chemistry and Physics, 72nd ed.; David, R. L., Ed-in-Chief; CRC Press: Boca Raton, FL, 1991-1992; p 7-3. (40) Kier, L. B.; Hall, L. H. J. Pharm. Sci. 1976, 65, 1806. (41) Atkins, P. W. Physical Chemistry; Oxford University Press: Oxford, U.K., 1990. (42) Pogliani, L. Comput. Chem. 1993, 17, 283.
JP961434C