6 Algorithms in the Computer Handling of Chemical
Downloaded by UNIV OF MASSACHUSETTS AMHERST on September 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
Information LOUIS J. O'KORN Systems Development Dept., Chemical Abstracts Service, The Ohio State University, Columbus, O H 43210
The chemical l i t e r a t u r e emphasizes the d e t a i l e d s t r u c t u r a l c h a r a c t e r i s t i c s o f chemical substances; t h i s paper addresses computer-based algorithms that support the handling o f information about chemical substances. The nature o f problems r e q u i r i n g an a l g o r i t h m i c s o l u t i o n , examples o f s p e c i f i c algorithms to support these s o l u t i o n s , and some o f the c o n t i n u i n g problems are discussed. Since r e p r e s e n t a t i o n a f f e c t s the nature of a l g o r i t h m s , several o f the computer r e p r e s e n t a t i o n s o f a chemical substance are mentioned. For these r e p r e s e n t a t i o n s , algorithm developments that perform i n t e r c o n v e r s i o n , r e g i s t r a t i o n , and s t r u c t u r e searching are d i s c u s s e d . Introduction The techniques utilized i n chemical information handling systems fall i n t o two c a t e g o r i e s -- those which handle the p r o c e s s i n g o f t e x t and those concerned with the p r o c e s s i n g o f chemical substance information. The general t e x t handling processes i n chemical information handling systems are not s u b s t a n t i a l l y d i f f e r e n t from the processes of information handling systems for other scientific disciplines. Although not discussed here, s u b s t a n t i a l development has occurred i n the development of computer-based algorithms for text information handling systems. These computer-based t e x t information handling systems provide for data base c o m p i l a t i o n to support t r a d i t i o n a l p r i n t e d p u b l i c a t i o n and a l s o the s e l e c t i v e dissemination o f the i n f o r m a t i o n . Algorithm development i n the areas o f computer e d i t i n g , data base management, s o r t i n g , computer-based composition, and t e x t searching have been critical to the o v e r a l l development of computer-based primary and secondary p u b l i c a t i o n s systems and t e x t search s e r v i c e s . Results o f these developments are i l l u s t r a t e d in the computer-based information system used at Chemical A b s t r a c t s S e r v i c e (CAS) [1]. Lynch [2] d e s c r i b e s p r i n c i p l e s and techniques for the computer-based information s e r v i c e s and 122
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF MASSACHUSETTS AMHERST on September 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
6
OKORN
Computer Handling of Chemical Information
123
Cuadra [3J provides annual reviews o f developments i n informat i o n handling. It i s the set of methods f o r r e p r e s e n t i n g , s o r t i n g , manip u l a t i n g and r e t r i e v i n g information about chemical substances that d i s t i n g u i s h e s the techniques o f chemical information handl i n g from those o f other d i s c i p l i n e s . Chemical l i t e r a t u r e emphas i z e s the d e t a i l e d s t r u c t u r a l c h a r a c t e r i s t i c s o f chemical substances. T h i s i s i l l u s t r a t e d by the f a c t that f o r the 392,000 documents abstracted i n 1975 i n CHEMICAL ABSTRACTS, 1,514,000 chemical substance index e n t r i e s were generated. Of these chemi c a l substance index e n t r i e s , 368,000 corresponded t o substances which were reported f o r the f i r s t time i n 1975. T h i s paper addresses the computer-based algorithms that support the handling o f chemical substance i n f o r m a t i o n . Since the methods used to represent information about chemical substances are c r i t i c a l to the nature o f the algorithms used, a v a r i e t y o f chemical substance r e p r e s e n t a t i o n systems are p r e sented, along with the v a r i o u s system processes necessary t o handle computer-based f i l e s o f chemical substance i n f o r m a t i o n . The algorithm developments that support these system processes are summarized, and sample algorithms are provided i n the appendix to i l l u s t r a t e supporting system processes i n areas o f r e g i s t r a t i o n , substructure s e a r c h i n g , and i n t e r c o n v e r s i o n s . Lynch and others [4] provide an overview o f p r i n c i p l e s and techniques f o r computer h a n d l i n g o f information on chemical substances, and the c h a r a c t e r i s t i c s o f information h a n d l i n g systems u t i l i z i n g these p r i n c i p l e s and techniques. Representations o f Chemical Substance Information Chemical s t r u c t u r e diagrams are two-dimensional v i s u a l d e s c r i p t i o n s o f a chemical substance and provide an important medium f o r communications between chemists. Employing convent i o n s f o r r e p r e s e n t i n g the three-dimensional s t r u c t u r a l features i n the p l a n e , these s t r u c t u r e diagrams f a l l short o f d e s c r i b i n g geometrical r e a l i t y but they are the accepted way to d e s c r i b e chemical substances. Because s t r u c t u r a l diagrams are d i f f i c u l t to convey both o r a l l y and i n w r i t t e n t e x t , s e v e r a l other r e p r e s e n t a t i o n systems have been developed. Many o f these chemical substance r e p r e s e n t a t i o n systems were developed p r i o r t o , but have been u t i l i z e d i n , computer-based chemical substance informat i o n handling systems. In a d d i t i o n , s e v e r a l r e p r e s e n t a t i o n systems more amenable to a l g o r i t h m i c computer p r o c e s s i n g have been developed. For i n p u t , storage, m a n i p u l a t i o n , and output w i t h i n computer-based systems, a r e p r e s e n t a t i o n o f the chemical substance must be s e l e c t e d . The s e l e c t i o n o f a p a r t i c u l a r r e p r e s e n t a t i o n scheme f o r an information system i s based on the s i z e o f the f i l e s to which i t a p p l i e s , the functions to be performed, the a v a i l a b l e hardware and software, and the d e s i r e d balance between
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by UNIV OF MASSACHUSETTS AMHERST on September 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
124
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
manual and machine processes. The substance r e p r e s e n t a t i o n system i s c r i t i c a l to the nature o f algorithms i n computer-based chemical substance information h a n d l i n g systems. Not a l l r e p r e s e n t a t i o n s are o f equivalent d e s c r i p t i v e power. Two important c h a r a c t e r i s t i c s of a r e p r e s e n t a t i o n are unambiguity and uniqueness. A r e p r e s e n t a t i o n i s unique i f , upon a p p l y i n g the r u l e s o f the system to a chemical substance, only one r e p r e s e n t a t i o n can be d e r i v e d . A r e p r e s e n t a t i o n i s unambiguous i f the r e p r e s e n t a t i o n a p p l i e s to only one chemical substance, although there may be more than one p o s s i b l e r e p r e s e n t a t i o n f o r each chemi c a l substance. For example, i n Figure l a , the systematic name provides a unique, unambiguous r e p r e s e n t a t i o n . The molecular formula, Figure l b , i s a unique but ambiguous r e p r e s e n t a t i o n ; unique because f o r any chemical substance there i s only one molecular formula, but ambiguous because isomers a l s o have t h i s molecular formula. The a r b i t r a r i l y numbered connection t a b l e , Figure l c , provides a non-unique, unambiguous r e p r e s e n t a t i o n . The r e p r e s e n t a t i o n i s unambiguous since i t corresponds to one and only one substance, but i t i s not unique because a l t e r n a t i v e numberings o f the connection t a b l e would r e s u l t i n d i f f e r e n t r e p r e s e n t a t i o n s f o r the same chemical substance (the connection t a b l e r e p r e s e n t a t i o n i s discussed i n more d e t a i l below). In a d d i t i o n to being c a t e g o r i z e d according to t h e i r uniqueness and ambiguity, chemical substance r e p r e s e n t a t i o n s commonly used w i t h i n computer-based systems can be f u r t h e r c l a s s i f i e d as systematic nomenclature, fragment codes, l i n e a r n o t a t i o n s , connection t a b l e s , and coordinate r e p r e s e n t a t i o n s . Systematic Nomenclature. Systematic nomenclature provides a unique, unambiguous r e p r e s e n t a t i o n o f a chemical substance by the a p p l i c a t i o n of a r i g o r o u s set o f systematic nomenclature rules. A r e p r e s e n t a t i o n o f a chemical substance i s constructed by a p p l y i n g these nomenclature r u l e s to combine terms which d e s c r i b e the i n d i v i d u a l r i n g s , c h a i n s , and f u n c t i o n a l groups w i t h i n the chemical substance. Chemical nomenclature provides a r e p r e s e n t a t i o n which can be i n t e r p r e t e d d i r e c t l y by the p r a c t i c i n g chemist, i s g e n e r a l l y s u i t a b l e for o r a l d i s c o u r s e , can be used i n a p r i n t e d index, and i s i n c r e a s i n g l y a v a i l a b l e i n computer-readable f i l e s . Davis and Rush [5, Chapter 8] d e s c r i b e the o r i g i n , development, and examples o f systematic nomenclature systems. Figure 2 provides an example o f systematic nomenclature u t i l i z i n g the CHEMICAL ABSTRACTS NINTH COLLECTIVE INDEX Nomenc l a t u r e Rules [6], The systematic name i n t h i s example i s cyclohexanol, 2-chloro-. I t i s generated by (1) determining the p r i n c i p a l f u n c t i o n a l group, the OH group; (2) determining the r i n g or chain to which i t i s d i r e c t l y attached, cyclohexane; (3) naming the f u n c t i o n a l group and i t s attached r i n g , c y c l o hexanol; and (4) naming a l l other f u n c t i o n a l groups and s k e l e t a l fragments, 2 - c h l o r o , where the locant 2 i d e n t i f i e s the p o i n t o f
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
6
O'KORN
Computer Handling of Chemical Information
125
Downloaded by UNIV OF MASSACHUSETTS AMHERST on September 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
attachment to the cyclohexane ring. Fragment Codes. Fragment codes are a series of predefined descriptors which are assigned to significant substructural units, e.g., rings or functional groups. A given code i s as signed to a chemical substance i f the structural component occurs within the chemical substance. Typically, fragment codes pro vide a unique, ambiguous description of a chemical substance. With the introduction of punched-card systems, fragment code systems became popular because of the simplicity of representa tion and the ease of the coding and searching operations. Since fragment codes offer only a partial description of a chemical substance based on predefined descriptors, there are situations for which certain substructural components that were not i n i t i ally anticipated and defined cannot be searched and situations of extraneous retrievals of structures containing the needed fragments but not in desired relationships. Although fragment codes are valuable for subclassification of f i l e s , in the case of large f i l e s , fragment codes are usually accompanied by other, more complete representations. Figure 3 provides an example of a fragment code representation utilizing the Ring Code System [7], with codes corresponding to the card columns and punches for the particular characteristic cited. Linear Notation. Linear notation systems use a linear string consisting of a set of symbols to represent complete topo logical descriptions of chemical substances. Each system has symbols which represent atoms or groups of atoms, a syntax to describe interconnections, and rules for ordering the symbols to provide a unique and unambiguous representation of the topo logy of a chemical substance. After deriving a linear notation by applying a set of ordering rules, linear notations are easy to input and require no specialized input equipment. The representation i s very compact and the f i l e structure is simple; also linear notations can be utilized in printed indexes. Davis and Rush [5, Chapter 9] provide general information on linear notation systems and a more detailed discussion of the origin and development of the IUPAC, Wiswesser, Hayward, and Skolnik linear notation systems. Figure 4 provides an example of a representation using Wiswesser Line Notation. For this example, the Wiswesser Line Notation i s L6TJ AQ BG. The ring system is cited f i r s t and i s represented by L6TJ where L indicates the start of a carbocyclic ring, 6 indicates a six-member ring, Τ indicates that the ring is fully saturated, and J indicates the end of the ring system. The substituents CI and OH are represented by G and Q, respectively, and their positions of attachment are identified by the locants A and B. Since Q occurs later than G is the defined collating sequence, Q is cited before G.
In Algorithms for Chemical Computations; Christoffersen, R.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
126
ALGORITHMS
FOR CHEMICAL
COMPUTATIONS
a. ) Systematic Nomenclature: Benzene, 1,4-dichlorob. ) Molecular Formula: C H C 1 6
4
2
c. ) Connection Table: Atom No. Element Bonds
Downloaded by UNIV OF MASSACHUSETTS AMHERST on September 5, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0046.ch006
Cl'
IS
C\8
Figure 1.
1 2 3 4 5 6 7 8
Connections
s
Cl
c c c c c c
S,S,D S,D D,S S,D,S D,S D,S
Cl
S
2 1,3,7 2,4 3,5 4,6,8 5,7 2,6 5
Various representations of the chemical substance
chloro
Cl Cyclohexane -
tOH