JOHN A. MORGAN and D. E. H. FREAR Chemical Codification Subcommittee, National Research Council, Washington, D. C.
D m m c the war it was necessary to carry out large- parts of the structure of classified compounds at once. scale test programs to find efficient insecticides, insect The National Research Council has sponsored the repellents, rodenticides, fungicides, antimalarial com- development of a system of classification for use in pounds, chemical warfare toxicants, and other applica- correlation studies of chemical st,mcturewith biological tions of chemical compounds. In some of these cases activity. This will bc used in conjunction with the little work on such a scale had been done previously; screening test program for organic compounds being there were few clear leads for any particular types of instigated by the N.R.C. under the direction of Dr. compounds: The only possible method was a general W. R. Kirner. It was decided that machine-sorted screening program to test available compounds for punched cards should be used in the work so that large numhem of compounds can be compared quickly in each type of action. Now that the wartime needs are past, more deliberate several different ways. Although this system has been and effective programs can be organized. It is desirable developed for chemical-biological correlation studies, to inspect all compounds that have been tested in some similar work can and should be carried out in many particular way to see if the active compounds have other applications of organic compounds. similaritiesin chemical structure. If such combinations Drs. D. E. H. Frear, E. J. Seiferle, and H. L. King are found, future work along these lines can be directed a t Pennsylvania State College have worked on the with more hope of realizing a good percentage of effec- problem of filing and examining compounds tested as tive compounds. insecticides and fungicides in their studies on the In order to carry out correlation studies of chemical relation of chemical structure to toxicity. The system structure with biological properties, a system of classi- of classification that they developed has been successfying and inspecting chemical compounds is required. fully used for their file of several thousand compounds Existing systems of indexing compounds, although and is the basis for the system developed by the Chemisatisfactory for their intended uses, are not suitable cal Codification Subcommittee of the National Research for this new purpose of examining parts of structure. Council. This subcommittee, under the chairmanship The alphabetical nomenclature system used in Chemical of Dr. C. C. Stock of the Sloan-Kettering Institute for Abstracts does file together compounds having similar Cancer Research, has as its members Drs. Frear and "index" structures, but does not allow selection of Seiferle, Dr. Drake of the U~iversit~y of Maryland, particular constituent groups. Empirical formula Dm. Haller and Hall from the U.S.D.A., Drs. classification shows nothing of structure. The Reilstein Rouiller and Wardell of Edgewood Arsenal, Dr. system and the corresponding system developed by the A. M. Patterson of Chemical Abstracts, Mr. Churchill Survey of Antimalarial Drugs enable the investigator of Johns Hopkins University, and Mr. Morgan of to index a compound in a unique position, but closely the National Research Council. The committee has related compounds having slight variation in sub- been working in close cooperation with the Army stituent groups often occur in widely separated loca- service groups, the Navy, Government bureaus, the tions. None of these svstems allows insoection of all U. S. Patent Office, and individual chemists who Presented before the Division of Chemical Education at the are interested in the project. The code that has 110th meeting of the American Chemical Society in Chicago, been developed has reached the stage where we wish September 9-13, 1946. to have i t examined by many chemists for comments and suggestions. We expect many idcas for additional uses for the code with slight modifications. The code is designed to describe t,he strncture and constituent groups of any chemical compound. It can be used with any machine-sorted punched card with one card for each compound. Our examples will use the IBM card and machine since we had one of these machmes available in developing .the code. For each compound the card having 80 columns is divided into five sections (Figure 1). The first of thesle is a serial number of six digits to identify the
59
FEBRUARY, 1047
compound and relate i t to the master file which will contain references and other data. Thus, six columns can take care of one million compounds by serial number. The second section of 14 columns gives the empirical formula of the compound. Exact numbers of atoms are given for carbon, hydrogen, bromine, chlorine, fluorine, iodine, nitrogen, oxygen, and sulfur. This takes 11 columns since two each are allowed for carbon and hydrogen. The next three columns show the presence of all additional elements. One column has specific punches assigned to the 11 most commonly occurring elements so that any one or combination of several of these may be shown. The next two columns give the atomic number of any other element. A thorough survey of Chemical Abstracts has shown that this arbitrary system will successfully describe practically all organic compounds. Special punches have been allowed for compounds having more elements than are shown by this system. This empirical formula is for organic compounds only; the problem is much simpler for inorganic compounds. The third section of the card, 40 columns, adequately describes the structure of any compound. These columns are divided into 10 groups or fields of four consecutive columns each. The structure for any compound is broken down into groups of atoms which can be identified as units, such as the COOH, CHO, and urea structures. Examination of thousands of compounds has shown that their structures can be described with an average of four or five such groups per compound. A few complex compounds may have eight or nine such groups. The ten fields allowed on a card are considered sufficient for practically all compounds, although special punches are available to show 11 or more groups. The fourth and fifth sections, having 20 columns total, are devoted to relevant physical properties and biological data. Since only a brief record of biological data can be included in the 10 to 15 columns allowed, the serial number can be used to relate the chemical card to a second card containing detailed biological data. This will not always be necessary. Another N.R.C. committee with Dr. McKeen Cattell of the Cornell Medical School as chairman is working on the biological codification scheme. The code uses the system devised by Dr. Frear and his associates in which the groups making up a compound are arranged in decreasing order of complexity. Families are rstahlished according to the empirical TABLE 1 Sample Families 0 (CH)NOSX
1 (CH)NOS 2 (CHINOS 3 (CH)NOX 4 (CH)NSX 9 (C1~1)NS A (CH)OS
Noncyolic groups Cyclic groups Cyclic groups . Noncvolio mourn
B (CH)OS Cyclic groups M CZ rings N CH Noncyclic groups P CH Cvclic m u .w . Q Z rings R Ionic g r y s atts.+ed I;: S not
tP
constitution of each group (Table 1). The most complex groups making up the first family contain halogen, nitrogen, oxygen, and sulfur in addition to carbon and hydrogen, which are always assumed to he present. An example would be a chlorosulfouamide. The second family contains only nitrogen, oxygen, and sulfur in addition to the carbon and hydrogen. The other families proceed in order (Table 2). TABLE 2 (CH)N N o n c ~ c l i cG~OUDS Biguanides Guanidines Diazaamino
HsNC(:NH)NHC(:NH)NHn H,NC(:NH)NH* HX: NNH.
Diaionium Hydrazones Hydrazines Quaternssv rwnmonium
(-NiN)+ :NNH2 -HNNH, f=N=) +
second-ary pnrnary
Imines Cyanamides Nitriles (cyanides) Isanitriles (isocyanides)
F10 F15 F20
=NH -NH, :NH =NCN -CN -NC
Each structural group is described on the card by a fourdigit figure. The first digit is the family designation from one to nine and on through the alphabet. Space is thus left for further refinements. The second and third digits identify the particular structure within a family, using numbers 01 to 99. Actually, in a11 families many spaces are left so that unused numbers are available for structures that have been missed in making up the code or that are not yet known. The fourth digit designates how many times the group occurs in the compound being described (Table 3). TABLE 3 Sulfanilamide NHa I
-STh
F52
Sulfanilamide would have the 162.1 code number since the 100 family is for carbon-nitrogen-oxygen groups, the 62 designates the unsubstituted sulfonamide group, and the 1 shows there is only one such group in the compound. In most cases two families have been assigned for each type of empirical constitution, one for noncyclic and the other for cyclic forms. Foi e-xample, the F family codes carhon-nitrogen structures, such as amiues, cyanamides! guanidines, and diazo structures, while the G famly codes carbonnitrogen heterocyclic forms, such as are found in pyridine and quinoline. The groups that are described are not just the
60
functional groups hut include every part of the compound. The position of the atoms in the compound cannot be shown exactly so that all position isomers will be coded dike as long as their constituent groups are alike. Ethyl propionate and propyl acetate are coded identically since each contains three groups: the ester, two carbons in a chain, and three carbons in a chain. On the other hand, p-aminophenol is coded with the hydroxy, the amino, and the phenyl groups, whereas phenyl hydroxylamine, an isomer, is coded with the phenyl and hydroxylamine groups. We believe that the groups chosen will adequately describe the structure of any compound for the purpose of comparing it with other compounds. Because of the multiplicity of heterocyclic groups, an original mathematical scheme has been devised for their codification. The first digit within the family indicates the number of members in a ring, and the second digit the number of carbon atoms in that ring. The carbon-nitrogen heterocyclic structures, for example, make up the G family; pyridine is coded as G65 because it is a six-membered ring containing five carbons. Fused stmctures use this same system, being broken down into their component single ring structures. Hexamethylenetetramine, which is made up of three fused six-membered rings each containing three nitrogen and three carbons, is coded as G63.3. The carbocyclic structures are treated differently, each of the most common structures being identified with a separate number. A few examples are benzene or phenyl, cyclohexadiene, cyclohexene, cyclohexane, anthracene, benzonaphthene, and phenanthrene. The structures not occurring cornmoIJy are grouped together under one number. For instance, all single rings containing three or four carbons are given the code number P97. Two methods have been developed to code the organometallic and inorganic constituents. These have been designated Plans A and B. Plan A, which is currently preferred by the committee, involves the placement of the organically-linked elements into family R, and the inorganic groups into family S, with specific rules to define this separation. The groups in family R are limited to structures having the central inorganic atom linked directly to carbon. An example of a typical R group is found in ethanephosphonic acid, CzHZO(OH),, in which the phosphoms atom is given the number R15.1 (family R plus atomic number of element). Family S, under this system, includes those inorganic atoms not linked directly to carbon, such as the phosphorus in the phosphate radical. As in family R, the code number consists of the family designation, S, plus the atomic number. Triethyl phosphate has three ethyl groups plus S15.1 to show one phosphorus atom, and 808.4 to show 'four oxygen atoms. Plan B, which is also under consideration by the committee, involves the separation of the organicallylinked and inorganic groups on a chemical basis into three families. The first of these inclndes the or-
JOURNAL OF CHEMICAL EDUCATION
ga~cally-linkedelements and radicals, the second the cations, and the third the anions. This plan, although perhaps more chemically sound than Plan A, becomes quite complex in practical application. For this reason it is being studied fnrther by the committee and will not he presented a t this time. The groups are punched on the card in family order so that columns 21 to 24, the first field, will always contain the most complex group. By this system, however, no one group will always he in a certain field hut may occur in any position acrow the card depending on the number of earlier groups for that compound. For instance, in p-aminophenol the hydroxy group, H60, is coded in the second field, hut in hydroxyaminoquinoline the H60 is in the third field since the amino and h~t~rocyclic carbon-nitrogen ring are punched in the first and second fields. At first inspcction this sounds as if selecting a certain group would he a long tedious process of sorting in every field across the card. Actually, the sorting can be so organized that a majorit,y of the cards are eliminated immediately and the work involved is not prohibitive. The correlation of chemical structure with biological activity may he approached from either the chemical or biological angle. For example, all compounds having a certain combination of constituent groups could be selected and then examined by machine for their biological activity. On the other hand, the cards for compounds having a high degree of activity in some biological application could be selected and sorted for general survey of common structural characteristics. The physical properties could he treated the same. Mention has been made earlier of special punches. For columns in which only single digit numbers are desired, punches 0 to 9 are used. The other two punches a t the top of the card, called the 11th and 12th punches, can he used individually or together for special meanings. In this way such characteristics as isotopes, radioa~tivit~y,the prmence of unusual elements, and more than ten structural groups in a compound can he indicated on the card. It is anticipated that individual investigators will use these punches for data in which they are especially interested. As an example of the mechanical process of sorting, we might wish to select all amino sulfonamide compounds like the biologically important sulfanilamide. The 525 compounds that we have as examples repre sent almost every type of organic compound. In order to e l i a t e most of them we can sort in the empirical formula section to select all compounds containing two or more nitrogen atoms and one or more snlfur atoms. The next criterion for the desired compounds is that they contain a sulfonamidr group which will be 160,161, or 162. To select these, we sort in column 21 for 1. Those that have 1 must further be sorted in column 22 for a 6. This most be repeated in every field until the compounds show a group in the 2 family or in some later family. We then know that there can he no 100 on these cards. The sulfonamides selected are thrn sorted in the samc way for F.50, F51, and
FEBRUARY, 1947
F52-amino groups. This combination of using the empirical formula and searching the fields, taking care to eliminate first all cards that cannot possibly fit the requirements, enables us to find desired cards quickly. For your interest in the practical side of this work, the most time involved is in coding the compounds. We have found that after some experience a chemist can code between 30 and 50 compounds per hour depending on their complexity. The punching by a trained operator can he carried out a t the rate of 100 compounds an hour. The machine will sort cards a t about 25 thousand an hour. We visualize the work of sorting as being done occasionally as problems arise, not continually.
61
Files of punched cards can he maintained in any desired order (serial number, empirical formulas, etc.) by sorting the cards a t the end of each work period. Filing the cards in order, however, is useful only when it serves to reduce the number of cardi to be sorted. It might be desirable in some cases to have duplicate cards, one for each type of data that is recorded. The punched cards do not add any knowledge to what is known, nor can the machines think. The value of the system is that i t selects all available data of any type for statistical analysis and comparison quickly. We hope that you will send us your comments, ideas, and questions concerning the code.