Document not found! Please try again

Near optimum computer searching of information files using hash coding

Mar 1, 1971 - Near optimum computer searching of information files using hash coding. Peter C. Jurs. Anal. Chem. , 1971, 43 (3), pp 364–367. DOI: 10...
0 downloads 12 Views 513KB Size
However, the volume of the closed system also increases by an amount deDendent on w * , the cross-sectional area of the manometer and / I , the length of the column of water displaced. Thus,

V , = Vi

+

It has been previously shown that h substituting in Equation 2

,., [P,arz + k‘V,pg /I

(4)

d I l ’

=

kl, so h

+ pgnrZh]

=

=

k‘l‘ and

AnRT

Substituting the following constants: p = 1.00 g/cm3,g = 980 cm/sec2, = 3.14, and Pt = 1.01 X l o Gdynes/cm2 (760 mm

Ha) /7//i‘

x

103[3171r 2

+ 0.980 k’Vi + 3.080 r2h]= AnRT

(7)

The relationship between / I and An will be linear if the third term in the brackets is negligible compared with the sum of the

other two. For the manometer system used in this work, r = 0.75 cm, Vi = 50 cm3, and k’ = 2.22 h/2.22 X 1O5[1784

+ 109 + 1.73

/I] =

AnRT

(8)

and the third term is less than 0,9% of the sum of the other two if h is less than 10 cm. Thus this term can be dropped and the height difference between the two arms of the manometer is linearly related to the moles of gas evolved. From Equation 4 it is seen that if r is small enough, the volume change caused by the movement of manometer fluid is negligible. Then V I V , and the pressure changc. measured in the manometer is a true differential pressure, i.c. it docs not have to be corrected for the voltinic change. A simplc calculation shows that if Vi is approximately 50 cm3, r would have to be 0.10 cni or less. But the placement of conductivity electrodes in a manometer of this size would be quite difficult. However, a suitable compromise between V # ,r , and sensitivity should be possible. RECEIVED for review September 8, 1970. Accepted December 1, 1970. Thanks are due to the Analytical Chemistry Division of the American Chemical Society for the Anacon Summer Fellowship awarded to S.J.S.

Near Optimum Computer Searching of Information Files Using Hash Coding Peter C.Jurs Department

of

Clieniistry, The Penns) loania State Unicersity, Unicersity Pork, Pa. 16802

The technique of hash coding has been applied to searching information files similar to those used in spectrometry laboratories. A discussion of several searching strategies, including the optimum one, is presented, and it is shown that hash coding yields nearly optimum matching algorithms for which search times are independent of file size. Two algorithms using hash coding have been implemented; the results of experiments with these algorithms are presented, The first algorithm matches blocks of unknown 16-bit spectra to a file of known spectra at the rate of approximately 20,000 spectra per second, independent of the number of unknown spectra being matched. The second system employs a double hashing procedure to search a data file of 20,000 spectra for one unknown at a time and verify its presence or absence in 40 milliseconds, on the average.

EXPERIMENTAL SITUATIONS routinely arise in which it is necessary to match an unknown spectrum to a file of known spectra. Several investigations which deal with the problem of computer searching of infrared spectrometry files have been reported. Anderson and Covert (I) reported a system using an IBM 7080 computer with magnetic tape input which could search 167 spectra per second. Erley ( 2 ) packed the words within the computer’s memory and used logical operations to make the necessary comparisons. His system was developed with an IBM 1130 using a disk input, and it could process 1000 spectra per second. Lytle (3) used an inverted file of IR ( 1 ) D. H.

Anderson and G . L. Covert. ANAL. CHEM.,39, 1288

(1967). (2) D. S . Erley, ibid., 40,894 (1968). (3) F. E. Lytle, ibid., 42, 355 (1970). 364

ANALYTICAL CHEMISTRY, VOL. 43, NO. 3, MARCH 1971

spectra and developed a system which could search 1000 spectra per second using a 500 card per minute card reader for input. This inverted file system suffers from the disadvantages that new spectra cannot be added to the file as conveniently as with the other systems and that only one search can be performed at a time. It can, however, find near matches relatively easily. Lytle and Brazie ( 4 ) have more recently reported a system which uses compressed IR spectra to obtain search rates of 333 spectra per second with 45-bit spectra on a small laboratory computer. They also use statistical data compression to develop a system on a XDS Sigma 5 computer using disk input which can process 18000 16-bit spectra per second. These 16-bit spectra do not contain all the information present in the original spectra, however. A major drawback of most searching systems is that the search time is proportional to the number of members in the file being searched. This paper discusses a method, both in theoretical and experimental terms, for which this limitation is not present. A nearly optimal searching strategy can be developed by using hash coding to drastically reduce the time necessary to search files of data, such as IR spectra. The problem of exactly matching an unknown query word to one of the members of a dictionary of words arises repeatedly in information handling applications. The problem of retrieving infrared spectra from a file of standard spectra is only one example of such an application. The terminology “word” for the query is used because the information, whether it is an actual English word or a number or a spectrum, is (4) F. E. Lytle and T. L. Brazie, ANAL.CHEM., 1532 (1970).

stored in binary form in the computer memory and can be considered as a unit. Let the dictionary of standard data file contain 2” members. Each member, and, therefore, any possible query word, can be thought of as a 6-bit binary number. The general problem is to determine whether an unknown b-bit query is or is not a member of the data set. If a match is found, then several courses of action may be open; however, the basic problem consists in determining if the query word is a member of the data set. This problem has been termed the exact matching problem, and it can be broken down into five cases, depending on the memory size, M , of the computer being used (5). I. Optimum Case. A binary number made up of b bits can assume 2b possible values. Therefore, a memory which contains 2b bits can store every possible 6-bit number o r spectrum w as a “1” in the appropriate position, the wth position. Then, given any query word w ,it can be matched to the data set by merely checking to see if the wth bit in memory is turned on. This is the optimum case of the exact matching problem in that only one memory reference involving one bit is needed to verify the presence or absence of any query word in the data set. 11. Impossible Case. The number of bits of memory necessary to store the unordered data set is M = (b - a)2’. Storage of the data set in this size memory is done by putting the words comprising the data set in numerical order and then computing their successive differences. Each spectrum would then require (b - a ) bits on the average, and the entire data set would require (6 - a)2“ bits. If M < (b - a)2‘, then the data set cannot be stored, and matching of query words to the data file is impossible. The data set must be broken down into subsets, forming a new problem. 111. Exhaustive Search. If the memory is large enough to store the unordered data set, that is, if M = (b - a)2‘, then exhaustive searching routines can be employed. These involve sequentially looking through the data file until either a match is found o r the end of the file is reached without a match. On the average, such exhaustive searching routines require M/2 matches, o r memory references, of b bits each per query word. IV. Logarithmic Sort. If M = b2a, then the data set can be stored in ordered (numerical) sequence. The search can proceed by investigating half of memory, then a quarter, then a n eighth, etc. This algorithm requires at most a inspections of words, each requiring b bit-matches. V. Hash Coding. If M = (1 ,f’)L12~,then hash coding can be used (5, 6). The excess memory available, f, determines the average number of memory references per query for matching. For the case o f f = 1, i.e., the memory is twice as large as the data set to be used, the number of memory references will be less than two, on the average. This is very close to optimum. Hash coding is implemented in two steps. A filing algorithm is used to set up the data set in memory, and a retrieval alogrithm is used to perform the matches. Both algorithms have available for their use a hashing algorithm H(w,j). It uses two input parameters-w, the 6-bit word being filed o r retrieved, and j , an integer. It produces one output parameter-an integer k = H ( w , j ) . The integer produced by the hashing algorithm is randomly distributed over the

interval 1 5 k 5 m,where m is the number of words of memory being used for storage of the data set. Thus, the hashing algorithm randomly (but algorithmically) produces an integer, given the b-bit word being filed or retrieved and a n integer j . The filing algorithm utilizes the hashing function in the following manner: To file the with word of the data set, the filing algorithm computes H(wi, 1). If the memory location with this address is empty, then w t is put into it. If that memory location is occupied, then the procedure is repeated for H(w,2), H(w,3) , . . . until a n unoccupied memory cell is found, and w I is filed there. This process is repeated for all the members of the data set. When the filing algorithm has filed all the members of the data set, then a fraction of the memory is filled and the remainder is empty. To retrieve a query word w from the data set, the retrieval algorithm computes H(w,l). If memory location H(w,l) contains w , a match has been found, and the algorithm terminates. If H(w,l) is empty, then w is not in the data set, and the algorithm terminates. However, if H(w,l) is occupied by some other word, then the ,procedure is repeated for H(w,2), H(w,3), . . . until either a n empty memory cell is found or w itself is found. On the average the retrieval algorithm makes very few memory references, since only a fraction of the memory is filled. (The actual number of memory references obtained for a particular data set depends on the fraction of the memory which is filled and the method employed for resolving conflicts, but not o n the file size.) Thus hash coding is very close to optimum with relatively little excess memory. (If a n additional 2b bits of memory are available, then the search time can be lowered further by using bit-mapping of the main data set. This involves assigning an extra bit to each memory location and setting that bit to 1 only if that memory location is nonzero. Then, on the average, a query word can be retrieved with 4 b .2a-b g 4 bit-references (5). Such bit mapping requires, of course, both hardware and software which allow treatment of single bits.) The method of hash coding can be successfully applied to the problem of retrieval of spectrometry data, such as infrared spectra.

(5) M. Minsky and S. Papert: “Perceptrons,” MIT Press, Cambridge, Mass., 1969, p 219fl: (6) I. Flores. “Data Structure and Management,” Prentice-Hall, Inc., Englewood Cliffs, N. J., 1970, p 146ff.

(7) W. H. Payne, J. R. Rabung, and T. P. Bogyo, Comm. ACM,

+

+

EXPERIMENTATION AND RESULTS

A filing algorithm and a retrieval algorithm for use with simulated spectra have been fully implemented and tested on the Pennsylvaina State University Computation Center IBM 360167. The hashing function H ( w , j ) used is based on a n 360/Assembler language random number generation subroutine in the 360/67 library. I t uses the multiplicative congruential method of random number generation which is widely employed; the method is discussed in detail in (7) and (8). Any random number generator could be used; this one has been chosen because of availability and speed. The remainder of the coding was done in FORTRAN IV. All programs are available to interested parties from the author. A few modifications and extensions of the basic hashing procedure described above are necessary for application to retrieval of simulated I R spectra. These are discussed in the following paragraphs. Digitized spectra can be non-unique for chemical reasons

12,85 (1969). (8) P. A. W. Lewis, A. S.Goodman, and J. M. Miller, ZBM System J.,8,136(1969). ANALYTICAL CHEMISTRY, VOL. 43, NO. 3, MARCH 1971

365

Table I. Investigation of Hashing Procedure 1 Trial 1

2 3

Bits per spectrum 32 16

No. of blocks 10 10

32

10 10

16

4

Spectra per block 2000 2000 2000 2000

f 0.50 0.50 0.50 0.50

Matches 68 68 16 16

Input device tapeldisk tapeldisk tapeldisk tapeldisk

Time, sec 2 1 2

1

Average memory references 3.0 3.0 3.0 3.0

Table 11. Investigation of Hashing Procedure 2 Trial 1

2 3

Bits per spectrum 16 16 16

No. of

blocks 40 40 40

Spectra per block 500 500 500

f 1 1 1

even though the coding schemes are capable of representing all the spectra uniquely. For example, Lytle and Brazie ( 4 ) have pointed out that Sadtler Laboratories' Spec Finder data files have some identical codes for different molecules. To handle this possibility, the retrieval algorithm must be altered in the following way. During retrieval, when a match is found, the search must continue until an empty memory cell is found. This assures the retrieval algorithm of finding all matches for w , not just the first one. This modification increases the average number of memory references per query word slightly. (The necessity of finding all matches for each query word would require modifications of the other matching algorithms discussed in the previous section as well.) The general discussion of filing and retrieval algorithms using a hashing function above assumed that only the actual word which is a member of the data set would be filed in memory. Actually, any information could be put there. For a file of infrared spectra it might be desirable to file a registry number with the spectrum itself, for example. Alternatively, a list of pointers could be developed by the filing algorithm to link the memory cells of the filed data with data files outside the computer. Then the retrieval algorithm need only report matches in terms of memory locations in the filed data and the operator can use this number to find the matched spectrum (in a book, etc.). The filing and retrieval algorithms discussed above havt been extensively tested with simulated spectral data. For convenience the spectra used were 16- and 32-bit spectra; this is totally arbitrary and the utility of the method is unrelated to word size. Several data sets of 20,000 spectra were generated; these were handled in subsets of 2000 spectra or less in order to better simulate the method as it might be implemented on a small laboratory computer. The IBM 360/67 utilized is in a multiprogramming environment and therefore timing of the algorithms is not accurate. The times do, however, give a n idea of the speed of the method. The procedures were implemented corresponding to two typical laboratory situations. Procedure 1 treats query spectra in blocks of, say, 100. The individual data subsets are input and searched in a sequential manner; input time is the limiting factor for searching speed. The time required for a search is almost completely independent of the number of query words involved, since the operations carried out internally are so fast compared to the input speed. Additional spectra can be added to the data file by inserting them into the existent subsets or by starting a new subset. 366

ANALYTICAL CHEMISTRY, VOL. 43, NO. 3, M A R C H 1971

Matches 64 58

24

Input device disk disk disk

0.04

Average memory references 3.8 3.6

0.04

3.0

Time per query, sec 0.04

Table I shows some results obtained using a specific implementation of procedure 1. The 20,000 16- and 32-bit spectra used were divided into ten equal subsets. Approximately 1 of the spectra appear in the data set twice, in accordance with the possibility of non-unique spectra in IR spectral data collections, as mentioned above. Each 2000 spectra were filed in a memory block of 3000 words, i.e., f = 0.5. The query spectra consisted of 100 randomly chosen spectra with different sets used for trials 1 and 2 than for 3 and 4; the number of matches found is given in column 5 . The time necessary to d o the entire problem was approximately equal for either tape or disk input due to buffered input on the system used. An average of 3.0 memory references was required per query spectrum. In the form reported, this system could be used on a smaller laboratory computer since the memory requirements are not severe. Only 3000 memory locations are used for the data file subsets. Thus, procedure 1 allows searching of a large data file for a block of query spectra in a very reasonable amount of time-up to 20,000 spectra per second-depending on the length of the spectra. In procedure 2, query spectra are looked up one at a time. A double hashing method is employed. The data set is divided into n subsets. To file the wlth spectrum, the filing algorithm computes "(w) = i, a randomly chosen number i (1 5 i 2 n). Then the spectrum being filed is inserted into the ith subset by the usual hashing procedure. The retrieval algorithm also works in two stages. Thus, only one subset of the data set need be accessed from storage for either filing or retrieval of a query word. This substantially reduces the amount of input which must be done during retrieval. Of course, procedure 2 requires a random access input device such as a disk or drum. This method is substantially superior to the first one for very small numbers of query spectra since it drastically cuts input time, which is the limiting factor in the overall speed of the search. New spectra to be added to the data set are inserted into the correct subset, which changes the fraction of memory filled in that subset and, therefore, the average memory references per query. If a great many spectra are added to the data set, the filing algorithm must rework the data set disposition with a larger number of subsets and a revised preliminary hashing function "(w) with the new expanded range. Table I1 shows some results obtained using the double hash coding procedure. The 20,000 16-bit spectra employed were divided into 40 subsets of 500 spectra each. Once again, approximately 1 of the spectra appear in the data set twice.

Each memory block is half full, Le., each 500 spectra are stored in 1000 locations. Three sets of 100 randomly chosen query spectra were selected and matched to the data set with the results shown. Line 1 shows that 64 matches were found for the first set of query spectra and that 3.8 memory references were required per spectrum. On the average, each query was answered in 40 milliseconds. Line 2 shows the results for a different set of query spectra, with 58 matches and 3.6 average memory references; line 3 gives the results for a third set of query spectra. Thus, it is seen that procedure 2 is very fast for matching a few query spectra at a time to a large data file. This algorithm has also been implemented in a form that could easily be used on a smaller laboratory computer. Some of the variables chosen for the investigations reported in Tables I and I1 are arbitrary. The fraction of memory used, number of data blocks, and number of spectra per block can

be chosen to maximize efficiency on any particular computer system. The values chosen here are typical in that they demonstrate the methods, but they should not be considered as optimum choices. The number of bits per spectrum used in laboratory situations depends on the data involved; the method of hash coding is not at all dependent on the word size of the data set or query words, so convenience has led to the use of 16- and 32-bit words in this work. To summarize: the method of hash coding makes extremely fast information retrieval algorithms for laboratory computers feasible and makes the length of time to perform a search independent of the size of the file searched.

RECEIVED for review October 22, 1970. Accepted December 14, 1970.

Teflon, A Noninert Chromatographic Support John R. Conder Chemical Engineering Department, University College of Swansea, U.K. Sorption by Teflon as support has been studied with di-n-nonyl phthalate as stationary phase, the liquid loading being varied from 0 to 20%. A substantial support contribution to retention is observed for a variety of both polar and nonpolar solutes. On a 6% loaded column more than 10% of the retention of water and about 14% of that of n-hexane are attributed to adsorption by Teflon. Retention and tailing effects a r e distinguished with the aid of a two-site sorption model. The implications for chromatographic analysis and physical measurement are described. In particular, absence of tailing should not be taken as evidence of an inert support: adsorption may still affect retention.

“TEFLON” @u Pont) (POLYTETRAFLUOROETHYLENE) differs from diatomaceous earth supports used for gas-liquid chromatography in giving a nearly symmetrical water peak and no tailing with alcohol or amine solutes (1-3). This fact makes Teflon supports very useful for separating highly polar materials, but is also commonly taken to imply that Teflon approaches the ideal of an inert, nonadsorbent support. An “inert” support is one which adds nothing to solute retention on the liquid stationary phase, and so can safely be used for thermodynamic studies on the liquid phase or for identifying peaks by their retention parameters. If Teflon were inert to adsorption of polar solutes, it might be expected to be inert also to nonpolar solutes. This deduction, however, conflicts with recent observations of substantial support adsorption of hydrocarbons by coated Teflon. The stationary phases were di-n-propyl tetrachlorophthalate (4), polyethylene glycol 400 (3,and squalane (6). In each case, total retention was actually greater on Teflon than on firebrick. Since adsorption is observed with such widely dif( 1 ) C. Landault and G. Guiochon, J. Chromutogr., 9, 133 (1962). (2) D. M. Ottenstein,J. Gus Chromutogr., 1,11(1963). (3) J. J. Kirkland, ANAL.CHEM., 35, 2003 (1963). (4) J. R. Conder, unpublished results, 1966. (5) M. B. Evans, and J. F. Smith, J. Chromatogr., 30,325 (1967). ( 6 ) W. Jequier and J. Robin, Chomutogruphia, 1, 297 (1968).

fering types of liquid phase and also since the activity coefficients of hydrocarbons in at least two of these phases are less than 2 , the extra retention cannot be attributed solely to Gibbs adsorption at the liquid-gas interface, but may be due to adsorption by the support. Confirmation is provided by Graham, who observed and measured the adsorption of paraffins on uncoated Teflon in a static system (7). Extrapolation of his data leads one to expect a contribution to hydrocarbon retention of around several per cent from adsorption on Teflon, depending on the nature and percentage loading of the liquid phase. This suggests that other solutes, such as amines and alcohols, may also be adsorbed by the support even when no tailing is observed. The present study was therefore undertaken to investigate the adsorption by Teflon of different types of solute, including both polar and nonpolar types; and to see whether absence of tailing is a sufficient criterion for supposing support adsorption to be absent. The contribution of support adsorption to total retention was measured by isolating it from the contribution of bulk solution in the stationary phase. This was done by varying the liquid loading and analyzing the results by a method developed previously (8). Di-n-nonyl phthalate was selected as a representative stationary phase of moderate polarity as distinct from the highly polar and nonpolar phases of previous studies ( 3 , 5 , 6). EXPERIMENTAL

Apparatus and Materials. The gas chromatograph was a Phase Separations Model LC-2 with katharometer detector. A mercury manometer was added to measure the line pressure before the injector and a soap-bubble flowmeter with a 50-ml volume was attached at the exit from the chromatograph. The recorder was a Servoscribe RE.511, and the carrier gas, hydrogen. Columns were constructed in stainless steel tubing of 0.25-inch o.d., all approximately 36 in. long and (7) D. P. Graham, J . Phys. Chem., 69,4387 (1965). (8) J. R. Conder, J. Chromatogr., 39,273 (1969). ANALYTICAL CHEMISTRY, VOL. 43, NO. 3, MARCH 1971

367