Inverted file structure for molecular formula and ... - ACS Publications

Inverted File Structure for Molecular Formula and Homologous Series Searching of Large Data Bases R. Geoff. Dromey’ Research School of Chemistry, Australian National University, P.O.Box 4, Canberra, A.C. T., 2600 Australia

A procedure Is descrlbed for constructing a hlghly compact Inverted molecular formula data base. The data base is easily updated and It allows efficient molecular formula searches. Molecular weight and homologous serles searches are also possible. The technique Is potentially applicable to a wide range of storage-retrieval problems. The method provides a fourfold reduction In storage over scatter table methods employed with molecular formula data bases.

The importance of molecular formula searching of large data bases in relation to answering questions about molecular structure has been firmly established (1). Both fixed-length strings and scatter tables have been used for storage-retrieval. In the CROSSBOW (2) system 18-character fixed length records are used to store molecular formulas. Heller (3)has used a sophisticated scatter table storage and retrieval system for molecular formulas. The basis of this latter system is the application of a hashing function to each molecular formula. The hashing function maps character combinations for each molecular formula into a numeric key value that is used to define a region of the file in which to search for a given molecular formula. The problem with this system is that “collisions” (where more than one molecular formula yields the same numeric key) increase as the size of the file grows. Collisions and other problems of file maintenance can only be minimized by using an address space that is large relative to the number of molecular formulas in the file. In the discussion that follows, a much simpler, highly compact and efficient molecular formula storage and retrieval system is described. Binary Representation of Molecular Formulas. A high resolution molecular mass can be calculated for molecular formulas using “accurate” atomic weights for each element. Provided the accuracy of the calculated mass is extended far enough for each molecular formula, it can produce an almost unique mapping. Molecular weights up to 2000 amu can be represented by an integer in the range 0 2 000 000 if they are multiplied by loo0 and rounded (taking three figures after the decimal point). For example

-

C18HN02 +

272.178

-+

272178

The integer representation of molecular formulas can be exploited to construct a storage-retrieval file. The binary representation of the integer value of the molecular formula provides the mechanism for this construction. The 22-bit representation for C18H2402 is thus

27 217 810

-+

00010000100111001100102

Inverted File Based on t h e B i n a r y Representation of Molecular Formulas. In carrying out a molecular formula search, an attempt is made to answer the question “which Present address, Department of Computing Science, University of Wollongong, Wollongong, N.S.W., 2500 Australia. 1982

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

members (record numbers) in the data base possess the molecular formula being considered?” It is well established in the theory of data bases (4)that inverted bit map files provide a very fast method for retrieving information about members of the data base which possess a particular attribute or a particular set of attributes. There is one drawback with such a file s t r u c t u r e i t requires a very large amount of storage space particularly if there is a very large number of possible attributes (keys). In fact, only when one of the dimensions in this two-dimensional file structure is small, can the power of inversion be truly exploited. Consider now the present molecular formula problem in the context of the inverted file structure just described. In any large data base it would be necessary to handle all possible molecular formulas to some upper limiting molecular weight-2000 amu would suffice in most instances. If the integer representation described above were to be used, two million keys would be needed to accommodate molecular formulas up to 2000 amu. Storing the record numbers in a fixed length bit array would require 2000000 X 20000 bits of storage to set up a complete inverted file (Figure 1) for a data base of only 20 OOO components. This amount of storage is obviously completely unacceptable for a molecular formula storage and retrieval system. However, the binary representation of the integer keys (molecular formulas) can be exploited to construct an inverted file that has the descriptive capacity of the above inverted file yet which can be confined to much more favorable storage requirements. In fact, it is possible to reduce the number of original integer keys to a set of pseudo-keys whose number is equal to the logarithm to the base 2 of the original set of descriptor keys. A quick calculation soon reveals that this is indeed a very acceptable level of storage. In the example considered only 22 pseudo-keys (221 > 2 000 OOO) rather than 2 million keys are required-a reduction by a factor of approximately lo5. The newly proposed inverted file (Figure 2) is made up of a key dimension of only 22 and a record number dimension of 20 000 bits-this is only 12222 words on a computer with a 36-bit word size. To accommodate all the information in the original inverted file, record numbers must be encoded as a bit vector. Each record number has associated with it just one molecular formula from which it is possible to derive an integer in the range 1 2 000 000. Logarithmic compression is achieved by setting bits in t h e 22-bit “record-number-bit-vector” that correspond t o t h e binary representation of t h e integer “molecular formula” associated with t h e record number being considered. All record numbers that possess a given molecular formula have the same binary pattern set in their associated recordnumber-bit-vectors. For example, suppose record numbers 10123 and 15256 corresponded to the composition ClsH2402. The associated equivalent bit vectors are shown in Figure 2. T h e Boolean Strategy f o r Molecular F o r m u l a Retrieval. It is easy to see how this inverted file structure lends itself to molecular formula searching. First the associated integer and binary representations are computed from the molecular formula. The binary code determines the Boolean logic strategy for combining the 22 “pseudo-key’’ bit arrays

-

0,

keys

2,000,000

Record Numbers

+

20,000 b i t s

Table I. Processing Profile for a Typical Molecular Formula Search. Formula Searched for C,,H,,O, Number of Number of Number of Number of Number of Number of Number of Number of

Words Still Active = Words Still Active = Words Still Active = Words Still Active = Words Still Active = Words Still Active = Words Still Active = Words Still Active =

1028 902

I :

i

Figure 1. Inverted file based on integer representation of molecular composition 10123

15256

bits

-

1

~~

1

1

1

1

1

0

0

0

0

21

binary representation

so as to filter out all record number bits other than those corresponding to the molecular formula that is being searched for. If the 22 pseudo-key bit arrays are designated A, B, C, D, E, ..., respectively and the bit pattern for the molecular formula to be searched for is 11010... and so on, then the following logic operations would be needed on the pseudo keys

A. AND. B. AND. (NOT C). AND. D. AND. (NOT E). . . The “AND” and the “NOT” are Boolean operations. When this set of Boolean operations is carried through for the complete set of pseudo-keys the only bits that will be left set in the temporary decision bit array will be those corresponding t o the molecular formula to be retrieved. A large reduction in the number of Boolean operations needed is achieved by keeping an updated list of pointers to words that still have bits set at each step in the sequence of logic operations. For 100 molecular formula searches chosen a t random, it was found that on average logic operations were needed o n only 13% o f t h e words in t h e file. In contrast, a sequential bit-matching procedure would require a search of the entire file. Because of the high efficiency of Boolean computer instructions, the inverted processing requires a very small amount of CPU time for even very large files.

First Pass

(only bits set

processed)

29

22 12

10 9

9 9 9 9

9 9

i

Second Pass (“NOT” logic)

9

9 9

Table 11. Storage Requirements for Molecular Formula Search Systems Storage Search system required (Words) CROSSBOW SCATTER TABLE LOG-COMPRESSED

I

Figure 2. Logcompressed inverted file for storing of molecular composition

Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active = Number of Words Still Active =

737 543 31 1 177 88 48

29 264 21 120 4 964

A very efficient processing strategy that minimizes the number of NOT operations is to make a first pass processing only those arrays corresponding to a bit being set in the logic guide bit vector. The order of array processing should be from the most significant to the least significant bit. Table I shows the words processed at each stage in the logic Operations for a typical formula search. The actual record numbers that correspond to bits that remain set in the decision vector after the logic operations have been completed are retrieved by using a lookup table. The table serves as a map from the numeric value of a byte (an 8-bit computer segment) to a list of the corresponding bits set in the byte. Comparison of Compressed-Inverted File Structure with Other Storage and Retrieval Systems. It is necessary to compare the storage requirements of the present system with those for the other systems mentioned. This task is made easier by the fact that results have been published for a molecular formula searching system based on the scatter storage technique mentioned above ( 3 ) . I t was found that 21120 words (36 bits each) were needed to set up a search file for 8124 molecular formulas. The CROSSBOW method would require a t the very least 29246 words [8124 X (18/5) for 5/7ASCII] to set up the file on the same computer. In contrast to the scatter storage system and CROSSBOW, the newly proposed system under the same conditions would require only 4964 words [8124 X (22/36)] of storage. The comparative storage requirements of‘ the three systems are summarized in Table 11. Percent-wise the log-compressed inverted file structure needs only 23.5% of the storage of the next most efficient system. The log-compressed file structure has the added advantage that it can also be used to do integer (low resolution) molecular weight searches whereas the scatter storage system requires a separate file that occupies an additional 9984 words. T h a t is, the log-compressed file requires less than one sixth the number of words needed by the scatter storage approach t o set u p a molecular formula-molecular ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

1983

Lower limit

= 271.500-

R5 R6

-+

-+

1-

C. AND.

lOOOOlllOOO]

C. AND.

100001110010~

weight inverted file retrieval system. Four rather than three digits would possibly be needed after the decimal point if the file were to grow over 100000. T o gauge the efficiency of the proposed file structure for molecular formula retrieval from a large data base a file was set up which contained the equivalent of 400 000 molecular formulas. In searching this file Boolean operations were performed on 15% of the file to simulate a molecular formula retrieval. The CPU time required to search this large file on a UNIVAC 1108 computer was only 0.35 s. Use of the Compressed Inverted Molecular Formula File for Molecular Weight Searches. To use the molecular formula file for a molecular weight search amounts to, in principle, applying Boolean logic on a subset of the 22 arrays. Before proceeding with a description of the algorithm, it is necessary to introduce a definition of molecular weight that relates to accurate molecular mass. An integer molecular weight M is defined as that integer value that encompasses accurate masses in t h e range M - 0.5 t o M + 0.499 a m u . A complete summary of the molecular weight retrieval algorithm is given in Table 111. The first steps needed are to establish integer and then binary representations for the upper and lower limits (e.g., if M = 272, the integers used to derive the binary representation would be 271500 and 272499). T h e next step is to establish the set of most significant bits that are common to the upper and lower limits. The appropriate Boolean logic is then carried out on the arrays that correspond to the common set of bits. This operation results in a filtering of the file down to a set of molecular weights that includes only the desired molecular weight and other values immediately adjacent. A recursive procedure is needed to filter out all molecular weights greater than the upper limit. The results of the common logic operations C are taken together with modified subsets of the independent part A of the upper limit. The algorithm proceeds by combining the C result with the most significant bits of A until a 1 is encountered. The 1is replaced by a zero (a Boolean NOT) and the logic is carried out on the corresponding subset of arrays, Then another start is made a t the most significant bit of A and again it proceeds until the second 1 is encountered, the second 1 is then set to zero and the extended set is combined with C. The process continues until no more bits sets are encountered. To filter out values less than the lower limit, essentially the same 1984

010010001100

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

STRUCTURE

M O L ECU L A R FORMULA

HOMOLOGOUS M O L E C U L A R FORMULA ( X = C H 2 )

C2h7N

X2H3N

C3H9N

X3H3N

OH

ic

*eo’

C18H2203

II c703

C19H2403

I Z C 7O3

Flgure 3. Some examples of homologous molecular formulas

procedure is used except that in the latter case successive zero’s are replaced by 1’s. Reference to the example given in Table I11 should make the whole process clearer. An interactive molecular formula-molecuh weight retrieval system that uses the log-compressed inverted file structure has been implemented on a PDP-l1/45 for a file of 19689 molecular formulas. This suite of programs makes up a subsystem of a mass spectral search system (5). Retrieving Members of a Homologous Series from a Large Data Base. Information about members of a homologous series is of importance to some users of a large data

base. The compressed inverted file structure can be readily adapted to storage and retrieval of this information. The essential step is to affect a simple transformation on standard molecular formulas. This reduces to “dividing” each molecular formula by “-CH2-” (the homologous component) which removes all the -CH2- contributions to the molecular formula. If X is used to denote the basic unit -CH2-, molecular formulas can be written as homologous formulas in which the “class” component appears separately. A set of examples is given in Figure 3. T h e homologous series log-compressed inverted file is constructed by subtracting the homologous contribution from the accurate molecular mass, e.g. C18H2402

C12H24

+

X12c602

+

+

272.178

168.188 103.990

+

:. C 6 0 2-+

-

T h e residual accurate mass for CsOzcan be converted to an integer in the range 0 2 OOOOOO. The procedure for encoding the binary representations of the residuals then parallels the procedure for molecular formulas as does the retrieval operation. T h e inverted homologous series file can be coupled with a separate compressed inverted file that represents the number of -CH2- components for each molecular formula. By searching the respective files for C 6 0 2and X12(referring to

the previous example) and then doing a Boolean AND to combine the two sets of results, the formula XlzCsOzand hence CI8Hz4O2 is retrieved. This file structure tallows for ranges of homologous series (e.g., XI7 Xzo)to be searched and i t is also convenient for standard molecular formula searches. To accommodate molecular weights up to 2000 amu (e.g., X I x256 = 2 9 , an extra 9-array bit vectors are needed on top of the 22 needed for the pseudo-class component of the molecular formula.

-

-

ACKNOWLEDGMENT The author thanks, J. K. MacLeod and P. Keogh for their comments. The data base of molecular formulas was supplied by G. Milne and S. Heller. The author is grateful to J. Reinfelds for pointing out an error in the molecular weight retrieval algorithm. The author thanks Mrs. Greta Pribyl and Glenda Gregor for typing the manuscript. LITERATURE CITED (1) J. E. Ash and E. Hyde, Ed., “Chemical Information Systems”, Horwood, Chichester, 1975. (2) D. R. Eakin and E. Hyde. in “Computer Representation and Manipulation of Chemical Information”, W. I. Wipke, S. R. Heller, E. Hyde, and R. Feldmann, Ed., Wiley, New York, N.Y., 1973. (3) S. R. Heller, Anal. Chern.. 44, 1951 (1972). (4) D. Lefkovitz, J . Chem. Inf. Cornput. Sci., 15, 14 (1975). (5) R. G. Dromey, J . Chern. Inf. Comput. Sci., submitted for publication.

RECEIVED for review May 23, 1977. Accepted July 21, 1977.

Trace Determination of Vinyl Chloride in Water by Direct Aqueous Injection Gas Chromatography-Mass Spectrometry Toshihiro Fujii The Wivision of Chemistry and Physics, National Institute for Environmental Studies, Yatabe, Tsukuba, Ibaraki 300-2 I, Japan

Described is a rapid, precise, and specific method for the analysis of vinyl chloride (VC) at the sub-ppb level in water samples. The method involves the use of mass fragmentography of the gas chromatography-mass spectrometry by simultaneously recording m / e 62 and 64 after direct aqueous injection of a large sample (1000 ML) on the precolumn (digiyceroi as a liquid phase) with no concentration or extraction required. Several tap waters in Tokyo areas were tested, and VC was not found to be present.

Vinyl chloride (VC) has been identified as a carcinogen ( I ) that is likely to be related to human cancer. Initially, the intense search for VC in the environment has centered upon foods (because poly(viny1chloride) is a common food packing material and VC migrates into foods), and the occupational atmosphere of poly(viny1 chloride) plants. Therefore, there have appeared many reports on sampling (2-4) and analytical techniques ( 5 )capable of detecting VC at less than ppm levels, in air (6) or in food samples ( 7 ) . Recently, VC at the 0.1-ppb level was discovered in the drinking water of U.S. cities (8). This incident has generated considerable concern also on the search for VC in water samples. There have been many reports for the determination of volatile organics in water by vmious methods, such as the head space technique, the gas stripping technique, and solvent

extraction. However, very few methods have been applied to the trace analysis of VC in water. T o my knowledge, the only used method appears to be the gas stripping technique (9). VC is stripped off a water sample with helium or nitrogen from which it is separated by adsorption on adsorbents. The adsorbents then are transferred into gas chromatograph or gas chromatograph-mass spectrometers for analysis. This method is not entirely suitable with respect to simplicity of operation. In addition, it requires special equipment. This paper describes a simple, new method which allows precise quantitative determinations of the sub-ppb level VC in water by a direct aqueous injection gas chromatography-mass spectrometry (IO) without any special preparations. This method is based upon mass fragmentography (11)which provides the highest sensitivity of the detector with high specificity. The large sample injection (1000 pL) affords low detection limits.

EXPERIMENTAL Apparatus. All analyses were performed on a Finnigan 3300F’ gas chromatographquadrupole mass spectrometer equipped with a multiple ior. detector, by which mass fragmentography can be carried out. The interface between the gas chromatograph and the mass spectrometer was an all-glassjet-type enrichment device. The mass spectrometer was set to unit resolution (10% valley between adjacent nominal masses). The resulting ion currents were recorded on a multichannel strip chart recorder. The instrument was operated in the electron impact mode. Other conditions held constant throughout the analysis were: helium ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

1985

Inverted file structure for molecular formula and ... - ACS Publications

Recommend Documents