A PROLOG program for the generation of molecular formulas

Universidad Central de Venezuela. Apartado Postal 47906. Caracas 1041 A, Venezuela. When engaged in the interpretation of mass spectra data of organic...
1 downloads 0 Views 3MB Size
one) will be reflected in the global price of the final drug (see eq 1). The program obviously does not include all the different factors that can have a role in the process of "lead optimization," but we believe that i t allows the student to be familiar with some important topics in the "lead optimization" stage, such as (1) the importance of rational suggestion of representative substituents (series design methods), (2) the problems derived from the existence of anomalous or unsuitable substituents, (3) the importance of synthetic considerations in drug design, and (4) the high power of QSAR techniques in drug design.

A PROLOG Program for the Generation of Molecular Formulas 6. Mender and J. A. Moreno

Facultad de Ciencias Universldad Central de Venezuela Apartado Postal 47906 Caracas 1041A. Venezuela When engaged in the interpretation of mass spectra data of organic compounds, it is essential that students be able to nasien constitutional formulas to the molecular (parent) and ------- - - ~ fragment ion peaks of the spectrum under consideration. This is by no means an easy task, and i t is generally carried out by either inspecting tables and compilations of mass spectral information (14) or by implementing numerical algorithms in calculators or computers (15, 16). Considering that the latter procedure is the most adequate for this prohlem, we have developed a computer program called FORMULAS that generates a list of all possible molecular formulas for compounds that may contain carbon, hydrogen, nitrogen, oxygen, sulfur, phosphorus, and halogens that are consistent with certain restrictive conditions that include the number and type of constituent atoms, mass value, state of ionization, and the degree of unsaturation. The idea underlying the problem is that an organic molecule can be conveniently depicted as a topological plane connected graph (17,18). A graph is a mathematical device consistine of a set of ooints (known as vertices) and a set of lines (known as edges) that connect pairs of the vertices. In such chemical mauhs the vertices represent the average position of e a c h i f t h e atoms, and the edges represent the covalent bonds existing between pairs of atoms. In the program, a chemical graph is characterized by the number of vertices and edges it possesses and by its cyclomaticnumber. This is achieved by using the following relations: The number of edges N, in a graph can be calculated as: ~~~

-

where N,(i)represents the number of vertices of degree Di. The cyclomatic number of a graph C, can be defined as follows:

with C, > 0 and where the sum xN,(i)is the total number of vertices. I t must be noted further that the degree of a vertex, corresponding t o the number of edges converging a t that vertex, and the cyclomatic number of a graph, which is the number of independent loops or cycles formed by the edges, have both a counterpart in the terminology of molecular chemistry. I t is clear that the degree of a vertex corresponds t o the valence of the corresponding atom, whereas the cyclomatic number is equivalent to the "degree of unsaturation" or of 234

Journal of Chemical Education

"hvdroeen deficiencv" in the molecule. In the promam a . . u recursive routine generates sets of integer numbers corresnondine " to the different N,,,) (numbers of allowed isotopic species) whose nominal masses add up to the target molecular mass. The set of \,ertices is then tested for a graph condition by checking that relation 2 holds. This restriction is not enough since many graph-forming combinations of atoms are meaningless from a chemical point of view. Therefore additional restrictive rules of heuristic character ("rules of thumb"), are used in order to limit the formula production to only those that constitute plausible chemical graphs. The present version of FORMULAS uses the following heuristic rules: (1) If the number of carbons is greater than zero, then the number of edges (bonds) is greater than or equal to four. (2) If the number of atoms with odd valences is greater than zero, then this number must he even, otherwise the number of atoms with even valences, excluding the number of carbons, must be an even number. (3) If the sum of the numbers of carbon, nitrogen, oxygen, sulfur, and phosphorous atoms is greater than three, then the degree of unsaturation is arbitrary, greater than or equal to zero; otherwise the degree of unsaturation is less than or equal to two. (4) If the only atoms considered are sulfur and oxygen,then the sum of their numbers is less than or equal to three. (5) If the only atoms considered are nitrogen and phosphorous, then the sum of their numbers is less than or equal to two. When a set of N,m numbers is found fulfilling the above rules, the formula is printed in the formula window, together with its exact mass and its degree of unsaturation. The oroeram FORMULAS iscoded inPROLOG usine the T ~ ~ ~ P R(19) ~ syntax. LOG Although PROLOG is a ;omputer language not yet very extensively used in chemical applications (20), it has been chosen by us because i t offers many advantages when compared to common procedural languages like BASIC and Pascal. In particular its high modularitv allows the promam to be structured as a collection of indkidual indeienaent modules or procedures. These modules express facts about objects and relations between objects t h a i describe the problem. In our case the different graph and heuristic rules are programmed in a declarative manner without having to bother too much about procedural details. The results follow logically from the rules and facts defining the problem so that, if they do not agree with reality, then the rules and facts must be revised. A final and crucial advantage of PROLOG is its built-in resolution mechanism. The resolution principle embodies pattern matching- and a n automatic backtracking strategy, forming in this way a powerful reasoning mechanism (inference engine). This feature is useful when solving problems with nondeterministic initial conditions that involve complicated search strategies, as in the present case. The program FORMULAS has been designed to be simple in use. After hooting the disk, the program displays the initial screen containing title and authorship information; after pressing any key, a second screen is displayed presenting a menu with the available options. In this menu the user chooses between two execution modes. In mode 1the program produces formula listings of compounds or ions with a predefined molecular mass that the user can choose to be exact or nominal. The user is also prompted to specify the different atoms thought to conform~thecandidate molecular species, a specified number of free bonds defining its neutral or ionic character and, if known, its possible degree of unsaturation. In mode 2 the program generates a list of formulas for compounds or ions, with nominalmasses in agiven range. As in mode 1 the user is prompted for the desired atomic constitution. the number of free bonds, and the degree of unsaturation. When the user presets an exact ma&, the oroeram reauesta the desired resolution for the exact mass constrain (chis option is called mass tolerance). The mass tolerance sets the number of significant decimal places to be -A

assumed in the exact mass calculations. In both modes, during the process of formula generation the user can choose a particular compound and generate species with isotopic nuclei. In this case, the user is prompted for the specification of the particular isotopic atoms to be considered. Recently a similar program, named MSPI and based almost entirely on heuristics, has been reported in the literature (21). The main difference between these programs is that FORMULAS relies not only on heuristic but also on more rigorous graph-theoretical rules. This approach gives our program the possibility of handling additional restrictions, like the degree of unsaturation and the ionization state, in the generation of the possible molecular formulas. This eives the oroeram a more eeneral character which can be a&reciatedth&gh the factthat the program FORMULAS is able to eenerate some dausihle formulas that are not generated by t i e program MSPI. The execution of the program FORMULAS requires an IBM-PC or compatible with 512 Kbytes of RAM andat least one disk drive. Copies of the program, both source and executable files, are available from the authors on a 5.25-in. disk that includes a README executable file with a brief introduction, some notes about notation, and a table of the employed atomic mass values. Send a money order for $15 to any one of the authors.

Determinationof Inflection Points from Experimental Data Danlel E. Stogryn Mount St. Mary's College 12001 Chalon Rd. LOS Angeles. CA 90049 Plots of experimental data frequently show inflection points, the locations of which are important in an experiment. In chemistry and biochemistry, titrations using pH meters or potentiometric titrations are well-known examples. This communication provides a program for determining the location of an inflection point for those laboratories with access to a comouter or workstation with an AT&T UNIX operating syst&n. The program, which is only seven lineslon~.is written a s a UNIX shell orocedureand depends on the sgndard UNIX spline and adk commands. Oniy five or six data points, not necessarily. eaually over abscis. . spaced . sa values, spanning the inflection point are required for very good results useful a t both the research level and in student iaboratories. The spline command provides a set of interpolated points ohtained with cubic soline interoolatine ~olvnomials(22). Each polynomial connects two s u E c e s s i v ~ d a ~ p o i nint ssuch a way that there is continuity of the first and second derivatives a t the point connected by two adjacent polynomials. Spline interoolation has the desirable feature of beine less sdnsitive to 'bad" data points than interpolation by a ;ngle polynomial spanning the set of data; however, even with the spline interpolation, obviously "bad" points in the neighborhood of the inflection point should be excluded from the analysis. Although the input data need not be equally spaced along the abscissa, the UNIX spline command provides aporoximatelv"eouallv . "snaced . internolated values which can be further processed. The awk command is a pattern scanning and processing language which, in our case, takes the output of the spline command, approximates the point where the second derivative is zero (the inflection point) bv locatine where the absolute value of a function's increment over approximately equally sized intervals of the abscissa is largest, and then prints the results. The UNIX shell procedure is shown in Figure 2. T o use this shell procedure, a data file with the experimentally obtained coordinates arranged according to either monotone

spline -n S111 < Si21

-

linc = p r e v i o u r o r d i f l N I I == 2 Lil inc