Chemical inference. 1. Formalization of the language of organic

This article is cited by 8 publications. Daniel J. Graham,, Christopher Malarkey, and, Matthew V. Schulmerich. Information Content in Organic Molecule...
0 downloads 0 Views 2MB Size
J . Ckem. If: Comput. Sci. 1983, 23, 117-134

hits, their number is always less than five. It is considered that the strategy developed here can be extended for other more general cases such as patent information by increasing the degree of multiplicity of representations and by incorporating multiple functions of the Markush indicator. Although the tactics discussed in this paper would be sufficiently powerful even for other types of generic names in the Handbook which are not mentioned in the present paper, it is evident that there are many difficult problems to solve on a wide variety of Markush claims in patent specifications, including indefinite and a theoretically infinite number of structures. Therefore, before we can deal with the patent information in a completely satisfactory way, it would be necessary to give careful consideration at least on the following points: (1) more generic terms such as alkyl and halogen should be taken care of; (2) the Q representation should have a greater degreeof multiplicity; (3) the Markush indicator should be able to discriminate between more than two situations; (4) a more complicated query must be accepted by using logical operations (AND, OR, and so on). EXPERIMENTAL SECTION The searching program is written in BASIC PLUS and runs on a Disital Equipment Corp. PDP-11/60 at Japan Association for

International Chemical Information (JAICI). ACKNOWLEDGMENT We are indebted to T. Oshima of Japan Chemical Industry Ecology-Toxicology & Information Center (JETOC) for discussions related to the Handbook. This work was financially supported in part by a Grant-in-aid for Scientific Research from the Ministry of Education, Science, and Culture to which the authors' gratitude is due.

117

REFERENCES AND NOTES (1) (a) Ash, J. E.; Hyde, E., Eds. "Chemical Information Systems"; Ellis Horwood Ltd.: Chichester, 1975. (b) Rush, James E. 'Handling Chemical Structure Information". In "Annual Review of Information Science and Technology"; Williams, Martha E. Ed.; Knowledge Industry Publications: 1978; Vol. 13, Chapter 8. (c) Howe, Jeffrey W.; Milne, Margaret M.; Pennell, Ann F., Eds. 'Retrieval of Medical Chemical Information". ACS Symp. Ser. 1978, No. 84. (2) 'NEW CAS SERVICES". J. Chem. Inf. Compur. Sci. 1981,21, 117. (3) Lynch, M. F.; Barnard, J. M.; Welford, S.M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 1. Introduction and General Strategy". J. Chem. In$ Compur. Sci. 1981,21,148-150. (b) Bamard, J. M.; Lynch, M. F.; Welford, S.M. "Computer Storage and Retrieval of Generic Chemical Structures in Patents. 2. GENSAL, a Formal Language for the Description of Generic Chemical Structures". Ibid. 1981,21, 151-161. (c) Welford, S.M.; Lynch, M. F.; Barnard, J. M. 'Computer Storage and Retrieval of Generic Chemical Structures in Patents. 3. Chemical Grammaers and Their Role in the Manipulation of Chemical Structures". Ibid. 1981, 21, 161-168. (d) Bamard, J. M.; Lynch, M. F.; Welford, S. M.'Computer Storage and Retrieval of Generic Chemical Structures in Patents. 4. An Extended Connection Table Representation for Generic Structures". Ibid. 1982, 22, 160-164. (4) The Japanese edition has a name index which helps reduce the time if the substance in question is listed as a specific name. ( 5 ) Chemical Products Safety Division, Basic Industries Bureau, Ministry of International Trade & Industry. 'Handbook of Existing Chemical Substances", 2nd Ed.; The Chemical Daily Co. Ltd.: Tokyo, Japan, 1981. (6) (a) Kudo, Yoshihiro; Yamasaki, Tohru; Sasaki, Shin-ichi. "The Characteristic Polynomial Uniquely Represents the Topology of a Molecule". J . Chem. Doc. 1973, 13, 224-227. (b) Kudo, Yoshihiro; Sasaki, Shin-ichi. "The Connectivity Stack, A New Format for Representation of Organic Chemical Structures". Ibid. 1974, 14,200. (c) Kudo, Yoshihiro; Sasaki, Shin-ichi. 'Principle for Exaustive Enumeration of Unique Structures Constituent with Structural Information". J . Chem. Inf. Compur. Sci. 1976, 16, 43. (d) Kudo, Yoshihiro; Aoki, Shotaro; Takada, Yoshito; Taji, Toyoaki; Fujioka, Ichiro; Higashino, Kazuko; Fujishima, Hisayuki; Sasaki, Shin-ichi. "A Structural Isomers Enumeration and Display System (SIEDS)". Ibid. 1976, 16, 50. (e) Sasaki, Shin-ichi; Abe, Hidethugu; Hirota, Yuji; Kudo, Yoshihiro; Ochiai, Shukichi; Saito, Keiji; Yamasaki, Tohru. "CHEMICS-F: A Computer Program System for Structure Elucidation of Organic Compounds". Ibid. 1978, 18,21 1.

Chemical Inference. 1. Formalization of the Language of Organic Chemistry: Generic Structural Formulas JOHN E. GORDON*' and JOYCE C. BROCKWELL Chemical Abstracts Service, Columbus, Ohio 43210, and Department of Chemistry, Kent State University, Kent, Ohio 44242 Received February 22, 1983 Categorization,syntax, semantics, and history of generic structural formulas (GSFs) are discussed. Their roles in chemical inference, chemical documentation, and chemistry learning are considered in the context of normalization and formalization of languages of structural formulas, chemical equations, and mechanisms. A formal language (ABSF) of homocomposite GSFs and a heterocompite language employing normalized structural variables (NVSF) are defined and merged. Useful formal operations involving these languages, their expressive power, and their relationship to Markush SFs and the GENOA and GENSAL languages are considered. Generic structural formulas, i.e., those that denote whole classes of specific structural formulas (SFs), are heavily used in all domains of the chemical literature. Despite this usage and despite applications in other parts of chemistry to be discussed below, they have not been the subject of a general and fundamental linguistic study, with two partial exceptions: Study of the information-handling aspects of Markush SFs has been motivated by their extensive use in Lynch and his students have recently begun to describe a more fundamental treatment of generic structural formulas of variable compo~ition.~ Our work, which concentrates on fix-

ed-composition generic structures, complements Lynch's rather well. This report describes our initial study and attempts to place it in a broad context of the history of SF-class notation and its applications in chemistry and metachemistry beyond chemical information science. IMPORTANCE OF GENERAL STRUCTURAL FORMULAS Maturation of a science is generally accompanied by formalization of its languages. Chemistry is currently in this stage; its linguistic development is proceeding in two directions.

0095-2338/83/1623-0117S01.50/00 1983 American Chemical Society

118 J . Chem. Znf. Comput. Sci., Vol. 23, No. 3, 1983

First, new machinemanipulable languages are being developed for use in such systems as the structure-elucidation systems CONGEN6 and CHEMICS? the synthesis-planning systems LHASA,* SECS? etc., and the more general chemical behavior system CICLOPS.lo Second, we are engaged in formalizing the ordinary “language of organic chemistry” (LOC), the language in which structural formulas and systematic and common names combine with structural operators and other chemical connectives to produce sentences, most commonly in the form of reaction or mechanism equations. This work involves extension of expressive power and removal of ambiguity in addition to extraction of formal morphological, syntactic, and semantic rule systems for generation of grammatical expressions and study of their interpretations, while retaining the form and flavor of established usage as far as possible. Since a great many of the concepts that chemists routinely express and manipulate involve classes of SFs/compounds, the developing LOC rather keenly experiences the need for an expressive and precise language of generic structural formulas, hence the present study. It is a fundamental characteristic of chemistry that the smallest unit of molecular structure that defines an identifiable and stable set of properties observable in real chemical substances is in most cases a substructure containing 2-20 atoms. Examples are the correspondences between functional groups and their characteristic chemical reactions or between chromophores and their characteristic spectroscopic signals. Thus these substructures (the functional groups, chromophores, etc.) are fundamental chemical concepts. Since a substructure defines a class of structural formulas (those containing the substructure as a subgraph), chemical concepts often involve structural-formula classes. Indeed, one can find in the common inference patterns used by chemists many other examples of manipulation of SF classes (e.g., in making structural analogies and in combining isolated pieces of connectivity information to infer gross molecular structure). Thus one expects search by SF class to be an important tool in chemical information science. Experience has shown SF-class searching via traditional access tools to be unsatisfactory; the comprehensive chemical data bases are not organized or indexed to facilitate SF-class searching. One alternative lies in mechanized substructure search systems ( S S S ) based on comparison of library and query screens (substructures). The substantial current interest in SSS confirms the importance of SF-class searching, and SSS will presumably prove to be the method of choice. There is no substitute, however, for a paper-and-pencil notation in the hands of the working chemist. While direct graphic input to SSS is rapidly displacing composition of queries via search dictionaries, a paper-and-pencil language may always intervene between chemist and machine as a medium for defining and conceptually manipulating SF classes. Thus we visualize three roles for languages of generic structural formulas in chemical information science: (a) as a basis for generic nomenclature and substance-class indexing; (b) as a common interpretive language for discussing (i) input and output SF classes for a mechanized structure-elucidation system and (ii) substructure search-more generic than any SSS query language and independent of their screen organizations and input devices; (c) in chemical patent handling (see ref 5). Concrete examples of items b and a are given later in this paper and in the following paper of this series, respectively. A consistent and complete language of generic SFs will have basic applications in chemical education as well. A good example of the difficulties produced by lack of accurate means for describing SF classes is provided by the classroom discussion of the elucidation of structure of a naturally occurring

GORDONAND BROCKWELL compound from chemical and spectroscopic data. Each piece of structural information defines a set of possible SFs for the unknown, and the possible SFs remaining in consideration at any time are the elements of the intersection of these individual sets. The problem is with what kind of SF to represent the incompletely known structure. If one discusses this process of incorporating successive structural facts with the specific SF that represents the final answer visible to the students, they find it very difficult to visualize the intermediate stages where less structural detail is known than is visible in the fully specific SF. Using the molecular formula for the unknown is better, but it underrepresents what is known, so that established structural features must be imagined at intermediate stages. What one needs is a language of generic SFs in which one can express at each stage in structure elucidation precisely what is known about the connectivity of the substance without implying any structural features that have not been established. At present, no language is available that possesses this degree of expressive power in addition to continuity with the standard, fully specific SF on the one hand and with the fully generic molecular formula on the other. HISTORY OF GENERIC STRUCTURAL FORMULAS

(GSFS) Structural Variables. The use of a variable name, symbolized by alphabetic characters and used to represent any element in a replacement set, appeared in mathematics no later than the seventh century (numeric variables; geometric variables by 1637).” The analogous chemical variables, ranging over replacement sets of atoms, substructures, or complete SFs, date from about 1880,l2within about a decade of the time when ordinary, specific SFs took their present form. Both mathematical and chemicalstructural variables may be either locally defined (e.g., x in the sentence x2 5 x = 32)13 or globally (e.g., i in the sentence xi = 0, where by global definition i E N, and N is the set of natural numbers). Structural formulas containing locally defined structural variables are known as Markush SFs; these are considered in a separate section. We will understand “structural variables” to denote the globally defined variety and consider here the actual extent of “global” and the stability of structural-variable names over time and across schools. Some of the oldest structural-variable names have had the most stable denotations: e.g., M = metal, R = alkyl, X = halogen. Still, throughout the century of their use, different groups of chemists or the same chemists on different occasions have used these signs to denote narrower, very locally (e.g., within one paper) defined replacement sets (e.g., R = CH3, CH3CH2,CH3(CH2)loCH2)or expanded ones (e.g., R = any hydrocarbon group, saturated or unsaturated). A good indication of current usage may be obtained from Morrison and Boyd’s organic chemistry textbook (MB14). MB adhere closely to the narrowest denotations for R and X; they use HA and B for generic acid and base [but sometimes use A = any atom (p 132)] and introduce the useful G = any group in the context of a variable substituent [G is used at least once as a local variable (p 632: G = OH, NH2, NHC6Hs, NHCONH,)]. W, Y, and Z are used only as locally defined variables (e.g., pp 154, 192, 629, 661). As one goes progressively further from academic circles, usage becomes less stable; generic SFs in the patent literature very commonly use R and X as completely local variables, whose defined values apply only to a single SF. In informal transfer of information via SFs, particularly that accompanying chiefly oral discourse, local definitions are slow and cumbersome; speakers tend to rely on globally defined values for structural variables and to presume that their audience uses the same global definition. Since such discourse goes unre-

+

J. Chem. InJ Comput. Sci., Vol. 23, No. 3, 1983 119

CHEMICAL INFERENCE corded, it receives little attention in information science, but its volume and impact are very great. Speakers under time constraint and striving to maintain continuity of discourse often press ordinarily stable variables (e.g., R = alkyl) into broadened or distorted, temporary, and unannounced definitions (e.g., R = any group). This makes the information conveyed ambiguous and is particularly frustrating to learners. In summary, structural variables have never received truly global definitions, and this introduces a certain level of ambiguity wherever their use is not accompanied by a local definition. On the other hand, they are highly valuable. Normalization of global structural-variable definitions is a worthwhile goal. While this may never be achieved across the chemical community, it can be rigorously implemented in formal systems. Part Structural Formulas. When one turns to progressively larger classes of SFs, in which the variable portions of the structure dominate the constant portions, structural variables become less useful. Customary usage then denotes the class of all SFs containing a constant substructure simply by drawing the substructure, usuaIly with the connections to the variable portion shown explicitly as open bonds. Thus, I is

I I I I

CI-C-C-CH=O

I

construed as a generic SF denoting the set of all SF graphs containing this graph as a subgraph, i.e., the class of all 0chloro aldehydes. Brace Structural Formulas. A molecular formula can be construed as the generic SF denoting all the specific SFs that possess the indicated composition. This continuity of molecular formulas with SFs is illustrated nicely by SFs of the type 115 and 2,16 in which the segment in molecular formula format

early 20th century. Structure 3, which represents the alkaloid coclaurin at one stage in its determination of structure, is the last primary-journal instance we have found.’* In what follows we refer to the traditional language of BSFs as TBSF. Brace SFs are equally well suited to represent operations of SF specification, in which discovery of additional specific structural details corresponds to moving atomic symbols from the residue molform to inside the brace, and S F generalization, which does the opposite. Markush Structural Formulas. A Markush S F is a structural formula containing one or more structural variables (atoms, groups, substructures, or subscripts) together with an annexed definition of the replacement set for each variable; the definitions are strictly local and generally apply to one Markush S F only. CLASSIFICATION AND FREQUENCY OF OCCURRENCE OF GSFS GSFs fall into two major categories, depending on whether the elemental composition is fixed or variable. Table I labels these the homocomposite and heterocomposite types, respectively, and subcategorizes the latter according to the type of compositional variation present. Brace SFs handle variable connectivity well, but not variable composition. Markush and (permanently defined) structural-variable SFs do the opposite. One useful and unambiguous combination suggests itself and is shown (labeled “BraceMarkush”) in Table I; we consider various combined notations in a later section. Structural formulas with ambiguous bond placements (e.g., 4 or 5) are often called Markush SFs, even

6 1

denotes constraint of the variable portion of the structure to the indicated elemental composition. To handle molecular formula segments with more than the two open bonds exemplified in 1 and 2, while avoiding implying anything about the points of connection of the fully specific groups to the molecular formula segment, the generic and specific portions were commonly separated by a brace, as in 3, where the partial CH3-O-

~-0H-OHN=

I

C16H13

3

structures are said to be within the brace, and “C16H13”is called the residue molform. Such formulas, which we will refer to as brace SFs (BSFs), grew out of the earliest method of coding connectivity in specific SFs, e.g., in 11. When in N for H

CH3-N-CHO

17

I

H

I1

the period following ca. 1870 connectivity was increasingly indicated by dots or dashes for bonds, the brace SFs continued to be used as generic SFs, e.g., structure 3 to denote the class of all SFs possessing one imino group, one methoxyl, two hydroxyls and a total composition of (C16H13 N H + C H 3 0 + H O + H O =) C17H19N03.Brace-SF usage peaked in the

+

when the composition is constant, as in 4, but the connectivity variation is advantageously conveyed in brace notation, as in 6. Table I1 summarizes the distribution of a sample of 691 GSFs among the categories and graphic representations identified above. In each entry the upper number comes from analysis of the 376 GSFs appearing in one issue of The Journal of Organic Chemistry;37the number beneath refers to the 3 15-item sample of patent-abstract GSFs from Chemical Abstracts.38 Brace SFs are entirely absent. While The Journal of Organic Chemistry sample displays a sizeable fraction of both Markush and ostensibly globally defined structural-variable GSFs (those with no accompanying definition of structural variables), the patent sample uses essentially all Markush SFs. The 204 homologous instances from The Journal of Organic Chemistry were examined in context to attempt to infer the S F class that the author(s) wished to represent. In 133 instances (65%) the usage does not appear to conform to strict, traditional definitions (R used to denote groups beyond alkyl and X for those beyond halogen). The actual denotations of R (again, inferred from context) in both samples of GSFs are shown in Table 111. In the Markush SFs of the patent sample, R, X, Y, etc. are used quite interchangeably in the explicit, local definition of any replacement set. However, it is also clear that in the ostensibly predefined structural-variable GSFs &e., those lacking an explicit local definition) of The Journal of Organic Chemistry sample, R ranges over essentially any replacement set the author pleases. This is probably true of the primary and abstract literature generally. It is not necessarily true of academic and textbook usage, where “permanent” definitions

GORDON AND BROCKWELL

120 J . Chem. In$ Comput. Sci., Vol. 23, No. 3, 1983 Table I. Generic Structural Formulas class-member attributes

type homocomposite

have same molform?

have common substrs?

+

+

heterocomposite nat ur a1

+ -

arbitrary

how class members vary

graphic representation available

connectivity, hybridization, formal electron/charge distribution

brace. e.g.

elemental identity within a periodic family

Markush, e.g., N=CCCH,CO,-M+ for M = Li, Na, or structural-variable, e.g., CH,CH,X Markush, e.g.

“free” variation of substructure

Z-@C0CH

z = NO,, F, C1, H, CH,O 01 structural variable, e.g. G-@cocH

repeating structurally homologous compositionally homologous

non-CIH

-

+ +

number of CH,s in structural formula

+

multiples of any substructure

multiples of CH, in molform

DESIDERATA FOR GENERIC SFS

Table 11. Occurrence of GSF Types globally defined ambiguous structural free-standing bond Markush variable substructure placement

type

0 19

homocomposite heterocomposite natural

1

19 0

c=c
C< and -H appear and the number of open bonds/free valences is exactly one. A is the set of aryl radicals, defined as that subset of P whose elements aje composed solely of ABU tokens of type H-, =C