Computer controlled television scan system for direct encoding of

Computer controlled television scan system for direct encoding of chemical structure models. W. S. Woodward, and T. L. Isenhour. Anal. Chem. , 1974, 4...
0 downloads 4 Views 594KB Size
Computer Controlled Television Scan System for Direct Encoding of Chemical Structure Models W. S. Woodward and T. L. lsenhour Department of Chemistry, University of North Carolina, Chapel Hill, N. C. 27514

A major difficulty for the widespread use of chemical information retrieval systems is the rapid encoding of chemical structures into computer compatible format. This work describes the automatic generation of a connectivity table in computer compatible format by television scanning of stylized chemical structure models coupled with on-line computerized pattern recognition.

Over the last decade, chemical information retrieval systems have advanced very rapidly, both in response to an evident need and because high-speed digital computers have become generally available ( 1 - 5 ) . A major difficulty remains, however, in the widespread use of such systems, particularly for the occasional user. This is the problem of encoding chemical structures-necessarily the key interrogative in many chemical questions-into computer compatible format. Linear notations, such as Wiswesser Line Notation (WLN), have been developed which allow complete representation of chemical structures in computer compatible form. However, the actual encoding procedure still constitutes a major hurdle. Large agencies such as Chemical Abstracts Service have embarked upon major projects to generate computermanipulatable notation for large files of chemical structures. In such projects, because of their production line nature, it may be possible to make the cost per structure encoded economically reasonable. However, the user of such systems must also have the ability to encode in order to interrogate the system, and, for the occasional user, familiarization with complex coding rules constitutes an inordinate investment of time for notations such as WLN. The generation of an atom by atom connectivity table has frequently been an intermediate in the encoding procedure. Several studies have developed computer programs to produce linear notation from such connectivity tables. In these studies, the connectivity table is generated either by direct human construction or by interactive human-machine construction through a device such as a Rand Tablet (6-9). This work describes the automatic generation of a connectivity table in computer compatible format by television scanning of stylized chemical structure models coupled with on-line computerized pattern recognition. (1) C. M. Bowman, F. A. Lander, and M. H. Reslock, J. Chem. Doc., 7, 43 (1967). (2) C. M. Bowman, F. A. Lander, N. W. Lee, M. H. Reslock, and B. P. Smith, J. Chem. Doc., 10, 50 (1970). (3) R. J. Feldmann and D. A. Koniver, J. Chem. Doc., 11, 154 (1971) (4) R. J. Feldmann, S. R. Heller, K. P. Shapiro, and R. S. Heller. J. Chem. Doc., 12, 41 (1972). (5) R. J. Feldmann and S. R. Heller, J. Chem. Doc., 12, 48 (1972). (6) C. M. Bowman, F. A. Lander, N. W. Lee, and M. H. Reslock. J. Chem. Doc., 8, 133 (1968). (7) C. D. Farrell. A. R. Chauvenet, and D. A. Koniver, J. Chem. Doc., 11, 52 (1971). (8) S.R. Heller and D. A. Koniver, J. Chem. Doc., 12, 55 (1972) (9) D. A . Koniver, W. J. Wiswesser, and E. Usdin, Science, 170, 1437 (1972).

422

A N A L Y T I C A L C H E M I S T R Y , VOL. 46,

NO. 3,

The guiding assumption throughout this work has been that the most time-consuming and costly step in the preparation of structural information for presentation to computer-based retrieval systems is that of translation from traditional, highly readable graphical chemical models to the inevitably abstruse symbol strings compatible with common computer input devices. When the encoding method chosen is a simple one ( e . g . , hand encoding to connectivity tables), the volume of code required to represent structures tends to be large and tedious to produce. If a more efficient coding process is designed ( e . g . , Wiswesser Line Formulae), the resulting symbol strings are more compact but the coding procedure may be forbiddingly complex. Indeed, there seems to be diminishing reason to hope that any easy manual bridge exists between the twodimensional medium of the structural diagram and a onedimensional computer input form. Reasoning along these lines leads to the conclusion that the ability to accept two-dimensional structural diagrams as direct computer input would be an important enhancement of chemical information retrieval systems. Use of interactive graphical input devices such as CRT displays and associate photosensitive pens, “Rand Tablets,” “joysticks,” etc., represents one set of methods to this end. In contrast to these approaches, which require the dedication of rather expensive hardware to the model-building process, is the concept of a system in which the model more closely resembles traditional structural diagrams. In such an arrangement, structural input would consist of a graphical construct (ideally simply the common molecular diagram) buildable easily without the assistance of, and possibly remote from, computing facilities. Thus the requirement for on-line computer access during input preparation is avoided. Actual acceptance of the model by the computer would then occur through a scanning process divorced from, and hence not limited by the speed of, the model construction process. The work herein described explores an implementation of just such a graphical representation system under the following constraints and considerations. 1. The scanning system was to directly accept two-dimensional models constructed of materials and on a scale suitable for easy manipulation. Thus no photographic or other preprocessing was to be permitted as part of the encoding procedure. 2. The entire scanning, recognition, and encoding procedure was to proceed without human intervention, once the model had been “presented” t o the equipment and the process initiated. (The system was not to depend on getting “hints” from the operator.) 3. Overall system performance parameters were of necessity qualitative but, in general, speed and accuracy of encoding were to be clearly superior to manual procedures. 4. The necessary scanning hardware was to be constructed within a minimum budget and be controlled by an existing available minicomputer: a Raytheon 704 with 8K word memory. The hardware was to interface to the programmed 1/0 bus of this computer and was to require

M A R C H 1974

no nonstandard modification of the CPU mainframe, as the 704 was to be shared with other projects.

SCANNING HARDWARE Of first concern in design of the scanning hardware was selection of the basic graphical input device. Of many exotic possibilities, the combination of economic restrictions and performance requirements dictated selection of a low cost vidicon television camera. From among many suitable alternatives, a Panasonic WV-2OOP was chosen. This camera provides, for a purchase price of $220.00, a resolution of approximately 600 lines in the center of the scanned frame, optics capable of accepting the range of diagram sizes envisioned, and automatic light adaptation. Interface electronics, providing communication between computer, camera, and television monitor, were constructed along straightforward lines, and the resulting system is diagrammed in Figure 1. The structural diagram to be scanned is presented to the television camera which converts it to a high-speed analog signal. This signal consists of a serialization of the diagram reflectivity as produced by a left to right, top to bottom scan (raster). Because interlace of alternate scans is not presently employed, a complete scan is produced every 16.7 milliseconds and consists of approximately 240 horizontal lines of 320 resolvable elements each. Thus, the viewed diagram is effectively dissected into 76,800 discrete picture elements. Superimposed upon the picturedescripting signal is synchronizing information indicating both frame and individual horizontal-line start times. The camera output is conveyed via coaxial cable to the camera/computer interface where it is reseparated into video and synchronization information. The continuum of light intensity levels detected by the camera is reduced to a light/dark decision by a thresholding circuit (Schmidt trigger) so that each frame element may be encoded as a single binary digit. The output of the thresholding circuit is sampled approximately once every 200 nanoseconds, and the resulting 5-MHz bit stream is assembled into 16bit horizontal line samples for input to the computer 1 / 0 bus. Address registers located in the frame address logic accumulate pulses from the horizontal sample oscillator and horizontal sync separator to maintain a running indication of the particular portion of the video frame being serialized a t any instant. By presenting to this logic an appropriate vertical address (8 bits) specifying a particular scan line, and horizontal address (9 bits) (within that line) specifying the end of a 16-element segment, the computer can interrogate any 16-bit horizontal segment within the frame. While only one such segment can be sampled within any scan of a given line, successive lines can be addressed and sampled independently. To serve as an aid to camera alignment and system adjustment, the recovered sync and binary video are recombined for display on a video monitor (slightly modified conventional TV receiver). Additionally, to provide an indication of computer/camera interaction, the monitor image is intensified in those regions of the frame undergoing scrutiny by the system. This display gives a helpful telltale of progress made by the encoding algorithm as it processes the model.

FORMAT FOR THE INPUT MEDIUM Although the ideal input medium for a graphically interrogated chemical information retrieval system would obviously consist of unrestricted molecular diagrams, it was recognized very early in the project that concessions would be necessary to ease the problems of model recognition and interpretation. The first area encountered which

Figure 1. Scanning hardware

seemed to require notational modifications was the design of the symbols used to represent structural elements in the diagram. While the technology of optical character recognition (OCR) as a computer input method is relatively mature, existing OCR techniques were inadequate for the problem a t hand in two respects. The usual OCR application consists of computer reading of typed or printed documents for subsequent processing. The encoding process is aided by many topological and graphical conventions which characterize Roman alphabet text and which simplify the pattern recognition algorithm. Examples of these helpful properties include: 1. Printed or typed text consists solely of the characters to be recognized. This seemingly trivial property is of enormous help to the designer of OCR equipment but is so common that he may seldom recognize it. Because a suitable chemical modeling nomenclature is almost sure to contain noncharacter information ( e . g . , bond indicating lines), this property could not be utilized in the planned system. 2. The characters comprising text are spatially disjoint and arrayed serially within the document in a priori defined ways ( i e . , left-to-right, top-to-bottom). Thus, having found the first character of a document to be scanned (or even, in many cases, just having “found” the document!), a conventional OCR system has very reliable simple procedures for proceeding to the rest. The importance of these properties to the success of automatic OCR is indicated by the lack of progress achieved by the technology in situations which lack them. Machine reading of cursive script, for example, in which the dissection of text into individual characters must precede recognition, has been largely unsuccessful. Second, the successful mechanical recognition of standard character fonts requires that the internal graphical representation of each character consist of many (usually several hundred) resolved picture elements. This fact arises because character fonts not explicitly designed for automatic processing often depend on relatively small features to distinguish certain character pairs (‘Q’ from ‘O’, for examp1e)c Any OCR system able to cope with such fonts must, therefore, resolve to a t least this level of detail. Given the resolving capabilities of inexpensive television cameras (indeed, of the 525-line television standard itself) and the desired system capability for diagram complexity, the ability to reliably recognize standard fonts seemed impractical. The design of a character set suitable for construction of computer readable molecular diagrams was undertaken with five goals in mind. A N A L Y T I C A L C H E M I S T R Y , V O L . 46, N O . 3, M A R C H 1974

423

+--~uNITs--------/

Figure 2. Character format (carbon element) 1. All members of the set were to share features of appearance easily recognizable by the scanning algorithm which would serve to reliably distinguish them from noncharacter diagram components and permit rapid and accurate location of characters for identification. 2. The character set was to be designed so as to minimize sensitivity of the over-all encoding process to expected forms of camera error: e.g., video noise, spatial jitter, geometric distortion, etc. 3. The chosen format should make maximum use of available camera resolution. 4. The over-all character outline should be “plane-filling.” Thus, characters could be arranged in compact clusters where the represented molecular topology permits, yielding a good density of complexity and efficient resolution utilization. 5 . The resulting character set should be as “readable” as possible to the human model builder. The resulting character format is shown in Figure 2. The character pattern consists of two components. The identity of the character represented is contained in the inner 3 x 3 array of 3-unit-square elements. Thus, a total of 29 or 512 distinct symbols is possible. Each identification “bit” is represented by a square consisting itself of 9 scan elements in order to achieve through the resulting redundancy greater insensitivity to camera errors and, thereby, greater reliability in character recognition. Surrounding the character identification pattern on the left and top is a pattern shared by all characters in the set. This feature serves a location function. The spatial frequency components present in this normalizing pattern are very unlikely to be generated by noncharacter elements in envisioned structural diagrams. Thus, an effective method for locating characters within the structural models consists of a search for regions within the model containing those spatial frequencies unique to the normalization pattern. The composite symbol has dimensions of 11 x 11 raster elements. Thus, approximately 21 characters will fit the vertical limits of the frame while 29 will fit horizontally, yielding a maximum diagram complexity of 21 X 29 = 609 characters. This represents a maximum model density adequate for envisioned applications. Indeed, it seemed that the construction effort required for models exceeding this limit would constitute a deterrent against their use in any case. Readability of the resulting character set remains a moot point. The 3 x 3 character pattern does, however, permit the more common chemical symbols to be fairly representational (see Figure 3). In an effort to improve character appearance, the normalization pattern can be rendered in a contrasting color pattern, red on white, for example, to reduce its distracting effect on the eye. Because the spectral response of the vidicon camera “peaks” in the blue region of the optical spectrum, such an arrangement does not significantly reduce pattern contrast 424

Figure 3. Photograph of actual character set representing hexafluoroacetylacetone. (See Table I , part D, for encoding results of this example) as seen by the scanning hardware, i.e., “red” is a good “black” to the camera. In addition to symbolizing chemical elements, a number of other functions are performed by members of the character set. Multiple bonds, for example, are indicated explicitly by appropriate symbols. To increase notational compactness and speed model construction, whole functional groups may be represented by a single symbol. The symbol set capacity of 512 members permits freedom in symbol definition without danger of “running out” of available codes. Besides the requirement for a computer-readable symbol set for the representation of structural elements, a means is needed to indicate connections between elements. In the devised system, element-element connection is indicated either by simple element adjacency ( i e . , touching) or by joining connected elements with solid lines.

THE SCANNING ALGORITHM The process by which the system accepts a structural model and reduces it to a descriptive connectivity table is divided into three major phases. They are, in order of occurrence, Initial Search, Structural Analysis, and Data Formatting and Output. The Initial Search phase consists of a systematic scan of the frame for the purpose of locating the left-most character of a structural model. Starting at the upper left extremity of the frame, a column-wise search is made for any “black” picture elements (the molecular diagrams are represented as black-on-white characters on a white background). Upon locating black, the search algorithm interrogates the surrounding area for the presence of a structural symbol. If this interrogation is successful in locating and identifying a character, that character becomes the “root” for subsequent structural analysis and the initial search phase is terminated. In the event that no character can be isolated, the interrogation stops and the prior search is resumed a t a frame location just beyond the point at which the black, but unrecognizable, picture elements were encountered. In this fashion, slight blemishes in the structural diagram or transient camera errors do not preclude successful encoding. Structural analysis begins following the location and identification of one diagram character by the initial search. The purpose of the analysis phase is to discover and identify all structural symbols appearing in the diagram, and to tabulate the connections between them. The method used is a character-by-character tracing of

ANALYTICAL CHEMISTRY, VOL. 46, NO. 3, M A R C H 1974

~~

Table I. Examples of Connection Tables Generated from Molecular Models Teletype output

**DERIVED CONNECTION TABLE** # Type Connections 01 H 02 02 C 03 04 01 05 03 H

Scan time, eec

Structure

7

3.5

H-7-H H

02

04 H 02 05 H 02 FRAME SCAN COMPLETED **DERIVED CONNECTION TABLE** # Type Connections 01 C 02 03 02 C 03 01 03 C 01 02 FRAME SCAN COMPLETED **DERIVED CONNECTION TABLE** i: Type Connections 01 c 02 02 C 03 09 01 03 C 04 02 08 04 C 05 03 05 C 04 06 06 C 05 07 07 C 06 08 10 08 C 0 3 07 09 C 02 10 N 07 FRAME SCAN COMPLETED **DERIVED CONNECTION TABLE** 4 Type Connections 01 F 02 02 C 03 08 09 01 03 C 04 02 10 04 C 05 12 03 13 05 C 06 04 14 06 C 07 16 05 17 07 F 06 08 F 02 09 F 02 10 D B 03 13. 11 0 10 12 H 04 13 H 04 14 D B 05 15 15 0 14 16 F 06 17 F 06 FRAME SCAN COMPLETED

C

/ \

c-c

F-C;

$-% c

F-

3

N

C

5

c-c’

7 7 7 q- C-y-$-F-F

9

F O H O F

the model, using a t each step the connectedness of the diagram as a guide to the next model feature. As this exploration progresses, the revealed diagram topology is represented internally in computer memory as a threaded table in which the modeled molecular connections are mapped in chained memory references. Using the character found by the initial search as a starting point, the algorithm proceeds as follows. 1. After a character has been found and recognized, a search is performed for i t (on the basis of its x-y coordinates in the diagram) in the existing connectivity table to discover if it has been encountered previously in the analysis. If the search is successful, then the character already exists as a table entry. Proceed to 2. Otherwise, the character is newly found and a connection table entry must be created for it. The first data contained in the entry must be character type code (of the element represented) and its graphical coordinates. Each of the four sides of the character must then be examined for connections to other diagram elements. Any connected elements found are noted by appending to the table entry being created, information denoting the nature of the element forming the connection. This information consists of a character type code if the element is a character, or a line

code if the connection is one end of a bond line. Also recorded is an indication of which edge (right, bottom, left, top) of the “parent” character shares the connection. It is important to note that the preservation of this last datum enables the encoding of certain types of stereoisomerism. 2. In every case except the situation immediately following the termination of the initial search phase, the character just processed by step 1 will have been found by tracing a connection from another character. Both the originating connection and the terminating connection in the present character are marked as having been traced. Also, referencing information is inserted into both connections so that each “points” to the connection table entry of its terminating character. In this way, the graphic connectedness of the structural model is duplicated in chains of memory references in the growing connection table. By tracing these chains, later analysis steps can reconstruct the topology of the original model. 3. The character under examination is now inspected to determine if it possesses any connections which have not yet been traced to their terminations. If untraced connections exist, the first encountered becomes the next goal for the analysis. Its terminating character becomes the new character for step 1 and analysis proceeds.

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 3, M A R C H 1974

425

4. If step 3 fails to turn up any untraced connections to the current character, the algorithm conducts a search of the entire connection table in an attempt to find untraced bonds. If one is found, its terminating character is “adopted” as the next goal and the algorithm proceeds to step 1. If none exist, then the analysis phase is complete. The Data Formatting and Output phase consists of pat-processing the connectivity table prepared by Structural Analysis so that it may be output in a form acceptable for further encoding, storage for later retrieval, etc. It is in this phase that such functions as multicharacter structural symbols may be implemented. The Data Formatting processor is really the interface between the scanning system and the end recipient of acquired structures and provides the means for tailoring the general algorithm to individual applications.

RESULTS AND DISCUSSION Examples of four connection tables generated from molecular models are given in Table I. Each table gives consecutive numbers to the encoded characters and then uses these numbers to indicate connectivity. Hence the first example (methane) in the table indicates: (1) a hydrogen

connected to character 2; (2) a carbon connected to characters 3, 4, 1, and 5 ; etc. In this manner the entire topology of the model is encoded. Notice that the scan times were only a few seconds even for reasonably complex molecules. In the case where entire files are to be encoded, we estimate a worker with reasonable knowledge of organic structures should be able to encode on the order of 1000 structures in an eight-hour day or about two per minute. Furthermore, the results can be stored on magnetic tape and preserved for later batch encoding to actual search notation. In conclusion, the system described allows convenient, rapid encoding of chemical structures into computer compatible format. The structures so encoded are then capable of being computer-processed into any linear notation for use in retrieval networks. The cost of the total equipment involved is on the order of $15,000, making the apparatus a reasonably priced input terminal to a large scale retrieval network. Work is under way to make a production system from the prototype. Received for review July 13, 1973. Accepted October 1, 1973. The financial support of the National Science Foundation is gratefully acknowledged.

Computer-Assisted Assignment of Retention Indices in Gas Chromatography-Mass Spectrometry and Its Application to Mixtures of Biological Origin H. Nau and K. Biemann Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Mass. 02739

A program has been written for the automatic assignment of the retention indices of all components eluted during a gas chromatography-mass spectrometry-computer experiment. A set of three standard compounds (e.g. n-alkanes) is coinjected with the sample into the gas chromatograph. The computer then identifies these compounds by their mass spectral characteristics and uses their positions to assign retention indices for the entire gas chromatogram. It is shown that the combination of these two sets of data which are structure specific and complementary, namely retention indices and mass spectra, constitutes a most efficient and reliable approach to the identification of compounds. Applications to the identification of drugs in body fluids of comatose patients as well as to the complete characterization of amino acids and oligopeptides in derivatized polypeptide hydrolyzates are presented. The retention index data provided by this program can be used for manual interpretation as well as lor computer-assisted interpretation and search systems.

ploitation of the thus created vast amount of potentially extremely valuable information. Computer techniques for automatic identification of the eluting fractions, either by comparison with a collection of authentic mass spectra (2) or programs mimicking the interpretative steps employed by the chemist (3-5) or combinations thereof, have been successful to varying degrees. A new interpretative approach utilizing changes of the abundance of particular ions during a gas chromatogram, so-called mass chromatograms (6), was made possible by the availability of a continuous mass spectrometric record of the entire gas chromatogram with a time resolution of a few seconds (the scan period of the mass spectrometer). The relative ease with which those interpretative methods could be applied, once one had the capability of continuously recording the mass spectra directly into a computer, naturally led to an overemphasis of the mass spectrometric data as the sole parameter for identification, relegating the gas chromatograph to the secondary role of a separation device.

Since the development of real-time data acquisition and processing techniques for gas chromatograph-mass spectrometer combinations that permit continuous recording of the mass spectra of gas chromatographic eluents (I), considerable efforts have been devoted to the efficient ex(1) R. A. Hitesand K. Biernann, Anal. Chem., 40, 1217 (1968)

426

A N A L Y T I C A L CHEMISTRY, VOL. 46, N O . 3, M A R C H 1974

(2) H. S. Hertz, R . A. Hites, and K. Biernann, Anal. Chem., 43, 681 (1971 ) . (3) B. G. Buchanan, A. M. Duffield, and A. V. Robertson, in “Mass Spectrorndry: Techniques and Applications.” G. W. A. Milne, Ed., Wiley-interscience. New York, N.Y.. 1971, p 121. (4) M. Senn, R. Venkataraghavan, and F. W. McLafferty, J. Amer. Chem. Soc., 88, 5593 (1966). (5) K. Biernann, C. Cone, B. R. Webster, and G. P. Arsenault, J. Amer. Chem. Soc., 88, 5598 (1966). (6) R. A. Hites and K . Biemann, Anal. Chem., 42, 855 (1970).